Research article Special Issues

Enhancing classification performance in imbalanced datasets: A comparative analysis of machine learning models

  • Received: 16 June 2023 Revised: 14 August 2023 Accepted: 05 September 2023 Published: 13 October 2023
  • In the realm of machine learning, where data-driven insights guide decision-making, addressing the challenges posed by class imbalance in datasets has emerged as a crucial concern. The effectiveness of classification algorithms hinges not only on their intrinsic capabilities but also on their adaptability to uneven class distributions, a common issue encountered across diverse domains. This study delves into the intricate interplay between varying class imbalance levels and the performance of ten distinct classification models, unravelling the critical impact of this imbalance on the landscape of predictive analytics. Results showed that random forest (RF) and decision tree (DT) models outperformed others, exhibiting robustness to class imbalance. Logistic regression (LR), stochastic gradient descent classifier (SGDC) and naïve Bayes (NB) models struggled with imbalanced datasets. Adaptive boosting (ADA), gradient boosting (GB), extreme gradient boosting (XGB), light gradient boosting machine (LGBM), and k-nearest neighbour (kNN) models improved with balanced data. Adaptive synthetic sampling (ADASYN) yielded more reliable predictions than the under-sampling (UNDER) technique. This study provides insights for practitioners and researchers dealing with imbalanced datasets, guiding model selection and data balancing techniques. RF and DT models demonstrate superior performance, while LR, SGDC and NB models have limitations. By leveraging the strengths of RF and DT models and addressing class imbalance, classification performance in imbalanced datasets can be enhanced. This study enriches credit risk modelling literature by revealing how class imbalance impacts default probability estimation. The research deepens our understanding of class imbalance's critical role in predictive analytics. Serving as a roadmap for practitioners and researchers dealing with imbalanced data, the findings guide model selection and data balancing strategies, enhancing classification performance despite class imbalance.

    Citation: Lindani Dube, Tanja Verster. Enhancing classification performance in imbalanced datasets: A comparative analysis of machine learning models[J]. Data Science in Finance and Economics, 2023, 3(4): 354-379. doi: 10.3934/DSFE.2023021

    Related Papers:

    [1] Maria Pia Riccio, Gennaro Catone, Rosamaria Siracusano, Luisa Occhiati, Pia Bernardo, Emilia Sarnataro, Giuseppina Corrado, Carmela Bravaccio . Vitamin D deficiency is not related to eating habits in children with Autistic Spectrum Disorder. AIMS Public Health, 2020, 7(4): 792-803. doi: 10.3934/publichealth.2020061
    [2] Anastasia Stathopoulou, Georgios F. Fragkiadakis . Assessment of psychological distress and quality of life of family caregivers caring for patients with chronic diseases at home. AIMS Public Health, 2023, 10(2): 456-468. doi: 10.3934/publichealth.2023032
    [3] Dorota Zarnowiecki, Meaghan S Christian, James Dollman, Natalie Parletta, Charlotte E.L Evans, Janet E Cade . Comparison of school day eating behaviours of 8–11 year old children from Adelaide, South Australia, and London, England
    . AIMS Public Health, 2018, 5(4): 394-410. doi: 10.3934/publichealth.2018.4.394
    [4] Casey T. Harris, Kevin Fitzpatrick, Michael Niño, Priya Thelapurath, Grant Drawve . Examining disparities in the early adoption of Covid-19 personal mitigation across family structures. AIMS Public Health, 2022, 9(3): 589-605. doi: 10.3934/publichealth.2022041
    [5] Christos Sikaras, Maria Tsironi, Sofia Zyga, Aspasia Panagiotou . Anxiety, insomnia and family support in nurses, two years after the onset of the pandemic crisis. AIMS Public Health, 2023, 10(2): 252-267. doi: 10.3934/publichealth.2023019
    [6] Waled Amen Mohammed Ahmed, Sara Boutros Shokai, Insaf Hassan Abduelkhair, Amira Yahia Boshra . Factors Affecting Utilization of Family Planning Services in a Post-Conflict Setting, South Sudan: A Qualitative Study. AIMS Public Health, 2015, 2(4): 655-666. doi: 10.3934/publichealth.2015.4.655
    [7] Erin Nolen, Catherine Cubbin, Mackenzie Brewer . The effect of maternal food insecurity transitions on housing insecurity in a population-based sample of mothers of young children. AIMS Public Health, 2022, 9(1): 1-16. doi: 10.3934/publichealth.2022001
    [8] J. Nwando Olayiwola, Melanie Raffoul . Saving Women, Saving Families: An Ecological Approach to Optimizing the Health of Women Refugees with S.M.A.R.T Primary Care. AIMS Public Health, 2016, 3(2): 357-374. doi: 10.3934/publichealth.2016.2.357
    [9] Dominique Meekers , Raseliarison Ratovonanahary , Tokinirina Andrianantoandro , Hiangotiana Randrianarisoa . Using Survey Data to Identify Opportunities to Reach Women with An Unmet Need for Family Planning: The Example of Madagascar. AIMS Public Health, 2016, 3(3): 629-643. doi: 10.3934/publichealth.2016.3.629
    [10] Argyro Pachi, Maria Anagnostopoulou, Athanasios Antoniou, Styliani Maria Papageorgiou, Effrosyni Tsomaka, Christos Sikaras, Ioannis Ilias, Athanasios Tselebis . Family support, anger and aggression in health workers during the first wave of the pandemic. AIMS Public Health, 2023, 10(3): 524-537. doi: 10.3934/publichealth.2023037
  • In the realm of machine learning, where data-driven insights guide decision-making, addressing the challenges posed by class imbalance in datasets has emerged as a crucial concern. The effectiveness of classification algorithms hinges not only on their intrinsic capabilities but also on their adaptability to uneven class distributions, a common issue encountered across diverse domains. This study delves into the intricate interplay between varying class imbalance levels and the performance of ten distinct classification models, unravelling the critical impact of this imbalance on the landscape of predictive analytics. Results showed that random forest (RF) and decision tree (DT) models outperformed others, exhibiting robustness to class imbalance. Logistic regression (LR), stochastic gradient descent classifier (SGDC) and naïve Bayes (NB) models struggled with imbalanced datasets. Adaptive boosting (ADA), gradient boosting (GB), extreme gradient boosting (XGB), light gradient boosting machine (LGBM), and k-nearest neighbour (kNN) models improved with balanced data. Adaptive synthetic sampling (ADASYN) yielded more reliable predictions than the under-sampling (UNDER) technique. This study provides insights for practitioners and researchers dealing with imbalanced datasets, guiding model selection and data balancing techniques. RF and DT models demonstrate superior performance, while LR, SGDC and NB models have limitations. By leveraging the strengths of RF and DT models and addressing class imbalance, classification performance in imbalanced datasets can be enhanced. This study enriches credit risk modelling literature by revealing how class imbalance impacts default probability estimation. The research deepens our understanding of class imbalance's critical role in predictive analytics. Serving as a roadmap for practitioners and researchers dealing with imbalanced data, the findings guide model selection and data balancing strategies, enhancing classification performance despite class imbalance.



    Autism Spectrum Disorder (ASD) is a disease described as strongly heterogeneous due to the large number of symptoms which may appear in the child's functioning [1], as well as the variable response of the body to the treatment process [1],[2]. In spite of the fact that the symptoms are multiple and occur with changing intensity, every person with autism presents abnormalities in communication and social interaction [2], exhibits repetitive behaviours, and a limited scope of interests [3],[4]. The onset of the disease occurs in early childhood [3]. Only a minor percentage of people with diagnosed Autism Spectrum Disorder, with mild symptoms (ex. Difficulty in social communication, problematic with adaptation to change, planning difficulty) , are able to live a relatively independent life as adults [2],[3],[5]. The majority (with symptoms of moderate and severe intensity) need the help of their families or social welfare to the end of their lives [2]. Their functioning in adult life depends on the early introduction of intensive therapeutic programmes, modifying the undesirable behaviours, and aimed at teaching social and communication skills [6][8].

    Scientific literature stresses a constant growth in the incidence of the disorder under discussion. For example, data from the Autism and Developmental Disabilities Monitoring Network shows that, in 2012, in the USA, there were twice as many eight-year-old children with diagnosed ASD as only two years earlier, in 2010 [9]. Taking into consideration the whole population of children, in 2000, ASD was reported as occurring in one in every 150 children, and in 2010, in one of every 68 [10]. The causes of this situation are unknown. The scientists believe it is related to greater public awareness of the symptoms of autism, new diagnostic criteria, and possibility of diagnosis at a younger age [11][14]. These are only hypotheses, but they undoubtedly encourage various agencies-medical, social, educational and other to search for effective solutions for supporting people with autism and their families [4].

    The symptoms of autism are recognised in the child's environment quite quickly. Usually, it is the parents who first realise that their child does not achieve the expected milestones in development; his or her development is retarded or stopped [15][17]. At that time, parents observe that their child does not react to their physical affection, does not want to express emotions, often avoids hugging (which is very hard for parents, especially for mothers) and eye contact, and does not want to communicate in any way [15],[16]. Moreover, the child may present atypical behaviours, movements related to a strong need of isolation from its surroundings, which are incomprehensible to the parents [16]. Usually, these include destructive, socially unacceptable behaviours [18]. These symptoms arouse anxiety and feelings of helplessness in the parents and make them seek professional help [15].

    The problems affecting the autistic child affect also the parents. Therefore, it may be said that the autism of a child has considerable implications for its parents [19]. Caring for a child with autism is associated with emotional consequences [20][23]. It has been proved that parents of atypical children experience parental stress much more frequently than the general population [24], as the moment of the child's diagnosis generates strong uncertainty about the future life of the child and the whole family [25],[26].

    The study by Bitsik and Sharpley, conducted on the basis of an analysis of fathers and mothers of ASD children, showed that women are more preoccupied and prone to depression than men caring for their disabled children [27]. Similar results were obtained by Dąbrowska, who indicated that mothers are much more frequently exposed to stress [28][30]. Moreover, it was proved that parents of ASD children are three to five times more vulnerable to depression than parents of neurotypical children [31]. The most commonly used assessment tools for preoccupation and depression [27] of parents include:

    • Self-Rating Depression Scale—SDS [32].
    • Self-Rating Anxiety Scale—SAS [33].
    • Connor-Davidson Resilience Scale—CD-RISC [34].

    The aim of the paper is to evaluate the functioning of families with an ASD child and compare it to the functioning of families with neurotypical children. The degree of flexibility, cohesion and level of communication enables the family to be classified either as healthy or dysfunctional.

    The study was approved by Bioethics Committee of the Poznan Univeristy of Medical Sciences (approval number: 1223/17) and Australian New Zealand Clinical Trials Registry (ANZCTR) number ACTRN12618000598280.

    The study was performed using (Flexibility and Cohesion Evaluation Scales, FACES-IV) questionnaire by David H. Olson, in its Polish form, developed by Andrzej Margasiński. The questionnaire consists of sixty-two statements, to which the subject responds in a 5-degree scale, from strongly disagree to strongly agree. The statements are divided into eight sub-scales. Six of them are the main sub-scales of David H. Olson's Circumplex Model of the two dimensions of family functioning: cohesion and flexibility (Balanced Cohesion, Disengaged, Unmeshed, Balanced Flexibility, Rigid, Chaotic). The two remaining sub-scales measure family communication (which is the third dimension of the Circumplex Model) and family satisfaction. Apart from sub-scale results, it is possible to calculate three complex ratios: Cohesion Ratio, Flexibility Ratio and Total Circumplex Ratio, which reflects the degree to which family functioning is healthy [35].

    The tool used is based on the Circumplex Model, which focuses on three crucial dimensions of family functioning: cohesion, flexibility and communication. Cohesion means the emotional bonding that family members have towards one another. Flexibility of relationships is defined as the quality and expression of leadership and organization, role relationships, and relationships rules and negotiations [36]. The communication dimension is viewed as a facilitating dimension that helps families alter their levels of cohesion and flexibility. The intensity of cohesion and flexibility of family relationships may have two basic levels: balanced or unbalanced. Unbalanced cohesion may mean extremely high cohesion level (unmeshed relationships) or an extremely low cohesion level (disengaged relationships, lack of bonding). On the other hand, unbalanced flexibility may mean extremely high (chaotic family relationships) or extremely low (rigid family relationships) flexibility levels. The main hypothesis of the model says that there is a positive relationship between a balanced cohesion level, balanced flexibility level, and healthy family functioning, as well as a positive relationships between unbalanced cohesion level, unbalanced flexibility level and problematic family functioning [36].

    The third basic dimension of D. H. Olson's Circumplex Model, which influences both flexibility and cohesion, is communication [37]. This refers to the skill of providing the family members with information, plans and emotions. This dimension is also defined as the positive communication skills utilized in the couple or family system[38].

    Cluster analysis of data obtained from studies, using Flexibility and Cohesion Evaluation Scales, resulted in distinguishing six family types: Balanced, Cohesively Rigid, Flexibly Disengaged, Mid-range, Rigidly Disengaged and Unbalanced [35]. The Balanced type is characterised by the highest scores on the balanced sub-scales and the lowest scores on the remaining sub-scales. The Cohesively Rigid type is characterised by high scores in the balanced cohesion and rigid sub-scales, moderate enmeshed scores, and low disengaged and chaos scores. The Flexibly Disengaged type is characterised by high scores on the Balanced Flexibility and Disengaged sub-scales, and low scores on the Rigid sub-scale The Mid-range type is characterised by moderate scores on all of the sub-scales, with the exception of the disengaged sub-scale, where the score is usually low. The Rigidly Disengaged type is characterised by high scores on all of the sub-scales other than Cohesion, where moderate to low scores are characteristic. The Unbalanced type is characterised by high scores on all four of the unbalanced scales: Disengaged, Unmeshed, Rigid and Chaotic, and low scores on the two balanced scales: Balanced Cohesion and Balanced Flexibility. These families are assumed to experience the greatest difficulties and be the most problematic in terms of their functioning. It is estimated that this is the family type most often looking for therapy [35].

    The study with Flexibility and Cohesion Evaluation Scales, by David H. Olson, in its Polish adaptation by Andrzej Margasiński, included 70 parents of ASD children, and 70 parents with children without diagnosed ASD, as the control group. The study was performed in January and February 2018. The study used inclusion criteria: (1). parents aged 25–45; (2). children without comorbidities; (3). diagnosis of autism in children.

    In order to compare FACES IV results obtained by the parents of ASD children and the control group, an independent samples t-test for equality of means was performed, and the statistical significance of the obtained differences was assessed.

    The analysis of the Balanced Cohesion sub-scale indicated that the parents of children with autism achieve lower FACES-IV results in the Balanced Cohesion sub-scale than the control group. The study covered 140 observations. The significance level of Levene's test indicates that the results should be interpreted with the assumed equality of variance. The p-value for the t-test for difference of means is 0.002; therefore, the means in both groups differ in a statistically significant way. The results are presented in Table 1.

    Table 1.  The sub-scales in the group of parents of children with ASD vs. parents of neurotypical children.
    Group N Average P-value
    Balanced Cohesion sub-scale (STEN) Autism 70 5.2000 21,843
    Control group 70 6.3571 22,135
    Balanced Flexibility sub-scale (STEN) Autism 70 5.6857 20,820
    Control group 70 6.2143 19,478
    Disengaged sub-scale (STEN) Autism 70 7.2857 18,583
    Control group 70 6.4143 18,217
    Unmeshed sub-scale (STEN) Autism 70 6.8857 20,610
    Control group 70 5.4857 18,077
    Rigid sub-scale (STEN) Autism 70 6.9143 17,672
    Control group 70 6.6857 17,573
    Chaotic sub-scale (STEN) Autism 70 6.7143 18,893
    Control group 70 6.0143 19,597
    Family Communication sub-scale (STEN) Autism 70 5.3857 24,215
    Control group 70 6.1714 24,846
    Family Satisfaction sub-scale (STEN) Autism 70 6.3143 24,586
    Control group 70 7.2857 21,274

     | Show Table
    DownLoad: CSV

    • Balanced flexibility sub-scale The p-value for the t-test for difference of means is 0.123; therefore, the means in both groups do not differ in a statistically significant way.
    • Disengaged sub-scale The p-value for the t-test for difference of means is 0.006; therefore, the means in both groups differ in a statistically significant way.
    • Unmeshed sub-scale The p-value for the t-test for difference of means is 0.000; therefore, the means in both groups differ in a statistically significant way.
    • Rigid sub-scale The p-value for the t-test for difference of means is 0.444; therefore, the means in both groups do not differ in a statistically significant way.
    • Chaotic sub-scale The p-value for the t-test for difference of means is 0.033; therefore, the means in both groups differ in a statistically significant way.
    • Family communication sub-scale The p-value for the t-test for difference of means is 0.060; therefore, the means in both groups do not differ in a statistically significant way.
    • Family satisfaction sub-scale The p-value for the t-test for difference of means is 0.014; therefore, the means in both groups differ in a statistically significant way.

    The analyses within the group of parents of ASD children did not show any statistically significant differences in FACES-IV due to socio-demographic variables.

    Research into parental stress levels showed that parents of children with ASD have greater uncertainty, stress and depression levels than parents of neurotypical children [39][43] and also parents of children with other disabilities [44],[45]. Similar results can be observed in the comparison between the stress levels of parents of ASD children and the general population [21],[43],[46][49].

    The most significant factor generating parenting stress are the ASD symptoms in their children [31]. Among the most frequent symptoms contributing to parental stress, scientists enumerate impaired cognitive functions and impaired social reactions, which directly correspond to the emergence of parental stress, anxiety and depression [50][53]. Other aspects of autism which may induce parental stress are: the level of functioning of the child, the child's age, the dysfunction of adaptive behaviours, agammaession, tantrums, and self-inflicted injuries [21],[54][57].

    However, it is emphasized that there is no social understanding of the characteristics of ASD, due to which, both the parents and the ASD children themselves, are subject to more severe social criticism. The specific behaviours of ASD patients are often perceived as parenting errors [31]. What is even more important, it is considered that parental stress factors come exclusively from outside of this social group and not from the personality and behavioural models of the parents themselves [31],[58]. Important factors influencing the development of parental stress and burn-out include lack of activity of mothers of ASD children outside the home in comparison to mothers of neurotypical children, who can spend much more of their free time outside the family, in a stress-free environment [59]. Similar conclusions were made in other studies, which, apart from isolation factors, also identified the phenomenon of “self-blaming” mothers, who burden themselves with blame for their child's difficulties [60],[61]. Another aspect of parental stress, described in the literature, is escaping from the problems related to the child's disability, visible as its difficult behaviours [62].

    The assessment of parental stress showed that over a half (55.8%) of fathers feel overwhelming helplessness one to five times a month. On the other hand, over 70% of mothers experience this feeling one to five times a month. The results of this study confirmed earlier research into anxiety and depression in parents, conducted by the same authors on a group of parents in Australia [27],[63].

    Another study focused on the parents of children with diagnosed ASD. The research into parental stress showed that the majority of the subjects agreed to the statements that “caring for a child takes a lot of time and energy” and “the behaviour of my child embarrasses and stresses me”. In the area of social support, the majority of the subjects agreed with the statements “the members of my family rely on me” and “I cannot rely on the members of my family”. As far as the area of self-efficacy is concerned, the majority marked the answer “try another solution if the first one did not bring expected results”. The study described showed that there are multiple sources of parental stress and that its level is influenced by all members of the ADS child's family, including parents, siblings, and grandparents. It was also shown that, despite the difficulties and problems, the caregivers of ASD children have social support and can cope with difficult situations [64].

    The scientific literature also includes works devoted to the role of stress resilience and self-efficacy in parents of ASD children. One of them analyses the group under discussion. The study was conducted using the Satisfaction With Life Scale (SWLS) [65], the Coping Strategy Inventory (CSI) [66], and the Coping with Stress Self-Efficacy Scale (CSSES) [67]. The results confirm that bringing up a child influences coping strategies and the sense of self-efficacy. Therefore, stress has an impact on the level of satisfaction with life of parents of ASD children. The scientists found differences depending on the parent's sex, stating that the primary goal of a woman is the sense of self-efficacy, while men put problem solving in the first place [67][72]. It was also shown that, together with the ageing of ASD parents, the social support for these families decreases, as does cognitive restructuring [69],[73],[74].

    The results of many studies prove that the sense of self-efficacy contributes to higher life satisfaction. Moreover, the sense of self-efficacy correlates positively with resilience strategies (problem solving and cognitive restructuring) and negatively with dysfunctional strategies (social isolation, wishful thinking, self-criticism) [60],[69],[75][79].

    It is worth mentioning the study conducted by Bitsik et al. (2017), analysing daily cortisol levels in parents of ASD children. Cortisol is called the neurohormone of stress [57],[80]. Cortisol levels were measured via the analysis of the subjects' saliva. It is estimated that cortisol is present in this material for about 10 minutes from the occurrence of the stress factor [81]. It was proved that, in 129 subjects, the levels of cortisol drop in accordance with the circadian rhythm. At the same time, the studies proved that self-inflicted injuries in children with ASD may be a stress-provoking factor in parents [57],[82].

    In order to reduce parental stress, parents of ASD children are recommended to introduce effective mitigation of autism symptoms [83]. It is emphasized that only successful reduction of ASD symptoms in the child may improve the well-being of the whole family [84]. Long-term stress may have drastic health consequences on parents of ASD children [31]. Support groups for parents of ASD children are one of the forms of therapy aimed at coping with stress and preventing burn-out [85].

    (1). It has been established that the parents of children with autism achieve lower results in the balanced cohesion sub-scale than the control group.

    (2). The parents of ASD children obtained higher scores in the disengaged sub-scale than the control group.

    (3). Furthermore, in the unmeshed sub-scale, their scores were higher than in the control group.

    (4). In the chaotic sub-scale, the parents of ASD children obtained higher scores than the control group.

    (5). It was found out that the family satisfaction level in parents of ASD children is lower than in the control group.

    (6). In the balanced flexibility, rigid and family communication sub-scales, there were no statistically significant differences between the parents of ASD children and the parents from the control group.

    (7). In parents of ASD children, the scores in all unbalanced sub-scales were higher than in families with children without autism (even if in some of differences were not statistically significant) while the scores in the balanced sub-scales were lower.

    (8). The STEN analysis of mean results of the parents of ASD children does not show extreme results in the scales studied, their results remain in the mid-range values (with the assumption that the middle of the STEN scale is 5.5 and the standard deviation is 2).

    (9). In families with ASD children, there is a higher risk of the unbalanced or rigidly disengaged family type than in families with neurotypical children.

    This may be a significant result, suggesting the risk of the occurrence of a disturbed family system, functioning in families with children with ASD, which should be a trigger for providing these families with early family functioning diagnosis and consequent support and therapy.



    [1] Alija S, Beqiri E, Gaafar AS, et al. (2023) Predicting students performance using supervised machine learning based on imbalanced dataset and wrapper feature selection. Informatica 47. https://doi.org/10.31449/inf.v47i1.4519 doi: 10.31449/inf.v47i1.4519
    [2] Aljedaani W, Rustam F, Mkaouer MW, et al. (2022) Sentiment analysis on twitter data integrating textblob and deep learning models: The case of us airline industry. Knowl-Based Syst 255: 109780. https://doi.org/10.1016/j.knosys.2022.109780 doi: 10.1016/j.knosys.2022.109780
    [3] Anguita D, Ghelardoni L, Ghio A, et al. (2012) The'k'in k-fold cross validation. in 'ESANN', 441–446.
    [4] Bentéjac C, Csörgő A, Martínez-Muñoz G (2021) A comparative analysis of gradient boosting algorithms. Artif Intell Rev 54: 1937–1967. https://doi.org/10.1007/s10462-020-09896-5 doi: 10.1007/s10462-020-09896-5
    [5] Booth A, Gerding E, McGroarty F (2015) Performance-weighted ensembles of random forests for predicting price impact. Quant Financ 15: 1823–1835. https://doi.org/10.1080/14697688.2014.983539 doi: 10.1080/14697688.2014.983539
    [6] Breeden J (2021) A survey of machine learning in credit risk. J Credit Risk 17. https://ssrn.com/abstract = 3946261
    [7] Breiman L (2001) Random forests. Mach learn 45: 5–32. https://doi.org/10.1023/A:1010933404324 doi: 10.1023/A:1010933404324
    [8] Breiman L, Friedman J, Olshen R, et al. (1984) Classification and regression trees (wadsworth, belmont, ca). 13: 978–0412048418.
    [9] Calderoni L, Ferrara M, Franco A, et al. (2015) Indoor localization in a hospital environment using random forest classifiers. Expert Syst Appl 42: 125–134. https://doi.org/10.1016/j.eswa.2014.07.042 doi: 10.1016/j.eswa.2014.07.042
    [10] Cawley GC, Talbot NL (2010) On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res 11: 2079–2107.
    [11] Chen T, Guestrin C (2016) Xgboost: A scalable tree boosting system, in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 785–794.
    [12] Chicco D, Jurman G (2020) The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC Genomics 21: 1–13. https://doi.org/10.1186/s12864-019-6413-7 doi: 10.1186/s12864-019-6413-7
    [13] Das S, Datta S, Chaudhuri BB (2018) Handling data irregularities in classification: Foundations, trends, and future challenges. Pattern Recogn 81: 674–693. https://doi.org/10.1016/j.patcog.2018.03.008 doi: 10.1016/j.patcog.2018.03.008
    [14] De Campos LM, Cano A, Castellano JG, et al. (2011) Bayesian networks classifiers for gene-expression data, in 2011 11th International Conference on Intelligent Systems Design and Applications, IEEE, 1200–1206. https://doi.org/10.1109/ISDA.2011.6121822
    [15] Deng M, Chen J, Huang J, et al. (2018) Agricultural drought risk evaluation based on an optimized comprehensive index system. Sustainability 10: 3465. https://doi.org/10.3390/su10103465 doi: 10.3390/su10103465
    [16] Dhieb N, Ghazzai H, Besbes H, et al. (2019) Extreme gradient boosting machine learning algorithm for safe auto insurance operations, in 2019 IEEE international conference on vehicular electronics and safety (ICVES), IEEE, 1–5. https://doi.org/10.1109/ICVES.2019.8906396
    [17] Dorogush AV, Ershov V, Gulin A (2018) Catboost: gradient boosting with categorical features support. arXiv preprint. https://doi.org/10.48550/arXiv.1810.11363 doi: 10.48550/arXiv.1810.11363
    [18] Fayyad UM, Irani KB (1992) The attribute selection problem in decision tree generation, in 'AAAI', 104–110.
    [19] Fernando KRM, Tsokos CP (2021) Dynamically weighted balanced loss: class imbalanced learning and confidence calibration of deep neural networks. IEEE T Neur Net Learn Syst 33: 2940–2951.
    [20] Granström D, Abrahamsson J (2019) Loan default prediction using supervised machine learning algorithms.
    [21] Han J, Kamber M, Pei J (2012) Data mining concepts and techniques third edition, University of Illinois at Urbana-Champaign Micheline Kamber Jian Pei Simon Fraser University.
    [22] Ho TK (1995) Random decision forests, in Proceedings of 3rd international conference on document analysis and recognition, IEEE, 1: 278–282.
    [23] Kaggle (2023) Give me some credit. Available from: https://www.kaggle.com/competitions/GiveMeSomeCredit/dataselect = cs-training.csv. Accessed: 2023-02-05.
    [24] Ke G, Meng Q, Finley T, et al. (2017) Lightgbm: A highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst 30.
    [25] Kelleher JD, Mac Namee B, D'arcy A (2020) Fundamentals of machine learning for predictive data analytics: algorithms, worked examples, and case studies, MIT press.
    [26] Khemakhem S, Boujelbene Y (2018) Predicting credit risk on the basis of financial and non-financial variables and data mining. Rev Account Financ 17: 316–340. https://doi.org/10.1108/RAF-07-2017-0143 doi: 10.1108/RAF-07-2017-0143
    [27] Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inform Software Tech 58: 388–402. https://doi.org/10.1016/j.infsof.2014.07.005 doi: 10.1016/j.infsof.2014.07.005
    [28] Leo M, Sharma S, Maddulety K (2019) Machine learning in banking risk management: A literature review. Risks 7: 29. https://doi.org/10.3390/risks7010029 doi: 10.3390/risks7010029
    [29] Li K, Xu H, Liu X (2022) Analysis and visualization of accidents severity based on lightgbm-tpe. Chaos, Solitons Fract 157: 111987. https://doi.org/10.1016/j.chaos.2022.111987 doi: 10.1016/j.chaos.2022.111987
    [30] Liu L, Li P, Chu M, et al. (2021) Stochastic gradient support vector machine with local structural information for pattern recognition. Int J Mach Learn Cybe 12: 2237–2254. https://doi.org/10.1007/s13042-021-01303-x doi: 10.1007/s13042-021-01303-x
    [31] Liu W, Chawla S, Cieslak DA, et al. (2010) A robust decision tree algorithm for imbalanced data sets, inProceedings of the 2010 SIAM International Conference on Data Mining, SIAM, 766–777.
    [32] Lokeswari N, Amaravathi K (2018) Comparative study of classification algorithms in sentiment analysis. Int Res J Sci Eng Technol 4: 31–39.
    [33] Mitchell TM, Mitchell TM (1997) Machine learning, 1: McGraw-hill New York.
    [34] Ogunleye A, Wang QG (2019) Xgboost model for chronic kidney disease diagnosis. IEEE/ACM T Comput Bi 17: 2131–2140. https://doi.org/10.1109/TCBB.2019.2911071 doi: 10.1109/TCBB.2019.2911071
    [35] Okey OD, Maidin SS, Adasme P, et al. (2022) Boostedenml: Efficient technique for detecting cyberattacks in iot systems using boosted ensemble machine learning. Sensors 22: 7409. https://doi.org/10.3390/s22197409 doi: 10.3390/s22197409
    [36] Padmaja TM, Dhulipalla N, Bapi RS, et al. (2007) Unbalanced data classification using extreme outlier elimination and sampling techniques for fraud detection. in 15th International Conference on Advanced Computing and Communications (ADCOM 2007), IEEE, 511–516. https://doi.org/10.1109/ADCOM.2007.74
    [37] Patro S, Sahu KK (2015) Normalization: A preprocessing stage. arXiv preprint arXiv: 1503.06462. https://doi.org/10.48550/arXiv.1503.06462
    [38] Rubin DB (1976) Inference and missing data. Biometrika 63: 581–592.
    [39] Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7: 147. https://doi.org/10.1037/1082-989X.7.2.147 doi: 10.1037/1082-989X.7.2.147
    [40] Singhal Y, Jain A, Batra S, et al. (2018) Review of bagging and boosting classification performance on unbalanced binary classification, in 2018 IEEE 8th International Advance Computing Conference (IACC), IEEE, 338–343. https://doi.org/10.1109/IADCC.2018.8692138
    [41] Stephens D, Diesing M (2014) A comparison of supervised classification methods for the prediction of substrate type using multibeam acoustic and legacy grain-size data. PloS One 9: e93950. https://doi.org/10.1371/journal.pone.0093950 doi: 10.1371/journal.pone.0093950
    [42] Sun J, Lang J, Fujita H, et al. (2018) Imbalanced enterprise credit evaluation with dte-sbd: Decision tree ensemble based on smote and bagging with differentiated sampling rates. Inform Sci 425: 76–91. https://doi.org/10.1016/j.ins.2017.10.017 doi: 10.1016/j.ins.2017.10.017
    [43] Thabtah F, Hammoud S, Kamalov F, et al. (2020) Data imbalance in classification: Experimental evaluation. Inform Sci 513: 429–441. https://doi.org/10.1016/j.ins.2019.11.004 doi: 10.1016/j.ins.2019.11.004
    [44] Wilson DR, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Intell Res 6: 1–34.
    [45] Yao Z, Ruzzo WL (2006) A regression-based k nearest neighbor algorithm for gene function prediction from heterogeneous data, BMC Bioinformatics, BioMed Central, 7: 1–11. https://doi.org/10.1186/1471-2105-7-S1-S11
    [46] Zhang C, Liu C, Zhang X, et al. (2017) An up-to-date comparison of state-of-the-art classification algorithms. Expert Syst Appl 82: 128–150. https://doi.org/10.1016/j.eswa.2017.04.003 doi: 10.1016/j.eswa.2017.04.003
    [47] Zhou L, Wang H (2012) Loan default prediction on large imbalanced data using random forests. TELKOMNIKA Indonesian J Electr Eng 10: 1519–1525. https://doi.org/10.11591/telkomnika.v10i6.1323 doi: 10.11591/telkomnika.v10i6.1323
  • This article has been cited by:

    1. Gwendoline DESQUENNE GODFREY, Naomi DOWNES, Emilie CAPPE, A Systematic Review of Family Functioning in Families of Children on the Autism Spectrum, 2023, 0162-3257, 10.1007/s10803-022-05830-6
    2. Emma Chad-Friedman, Karen A. Kuhlthau, Rachel A. Millstein, Giselle K. Perez, Christina M. Luberto, Lara Traeger, Jacqueline Proszynski, Elyse Park, Characteristics and Experiences of Parents of Children with Learning and Attention Disabilities and Autism Spectrum Disorder: A Mixed Methods Study, 2022, 30, 1066-4807, 427, 10.1177/10664807211052304
    3. Anna Kostiukow, Piotr Poniewierski, Dominika Janowska, Włodzimierz Samborski, Levels of happiness and depression in parents of children with autism spectrum disorder in Poland, 2021, 81, 0065-1400, 279, 10.21307/ane-2021-026
    4. Talal E. Alhuzimi, Family Functioning and Strengths in Families of Children With Autism Spectrum Disorder in Saudi Arabia, 2024, 32, 1066-4807, 230, 10.1177/10664807231217061
    5. Fátima El‐Bouhali‐Abdellaoui, Núria Voltas, Paula Morales‐Hidalgo, Josefa Canals, Examining the Relationship Between Parental Broader Autism Phenotype Traits, Offspring Autism, and Parental Mental Health, 2024, 1939-3792, 10.1002/aur.3295
  • Reader Comments
  • © 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(3581) PDF downloads(272) Cited by(7)

Figures and Tables

Figures(9)  /  Tables(7)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog