Research article Special Issues

A selective evolutionary heterogeneous ensemble algorithm for classifying imbalanced data

  • Received: 18 December 2022 Revised: 17 February 2023 Accepted: 03 March 2023 Published: 13 March 2023
  • Learning from imbalanced data is a challenging task, as with this type of data, most conventional supervised learning algorithms tend to favor the majority class, which has significantly more instances than the other classes. Ensemble learning is a robust solution for addressing the imbalanced classification problem. To construct a successful ensemble classifier, the diversity of base classifiers should receive specific attention. In this paper, we present a novel ensemble learning algorithm called Selective Evolutionary Heterogeneous Ensemble (SEHE), which produces diversity by two ways, as follows: 1) adopting multiple different sampling strategies to generate diverse training subsets and 2) training multiple heterogeneous base classifiers to construct an ensemble. In addition, considering that some low-quality base classifiers may pull down the performance of an ensemble and that it is difficult to estimate the potential of each base classifier directly, we profit from the idea of a selective ensemble to adaptively select base classifiers for constructing an ensemble. In particular, an evolutionary algorithm is adopted to conduct the procedure of adaptive selection in SEHE. The experimental results on 42 imbalanced data sets show that the SEHE is significantly superior to some state-of-the-art ensemble learning algorithms which are specifically designed for addressing the class imbalance problem, indicating its effectiveness and superiority.

    Citation: Xiaomeng An, Sen Xu. A selective evolutionary heterogeneous ensemble algorithm for classifying imbalanced data[J]. Electronic Research Archive, 2023, 31(5): 2733-2757. doi: 10.3934/era.2023138

    Related Papers:

  • Learning from imbalanced data is a challenging task, as with this type of data, most conventional supervised learning algorithms tend to favor the majority class, which has significantly more instances than the other classes. Ensemble learning is a robust solution for addressing the imbalanced classification problem. To construct a successful ensemble classifier, the diversity of base classifiers should receive specific attention. In this paper, we present a novel ensemble learning algorithm called Selective Evolutionary Heterogeneous Ensemble (SEHE), which produces diversity by two ways, as follows: 1) adopting multiple different sampling strategies to generate diverse training subsets and 2) training multiple heterogeneous base classifiers to construct an ensemble. In addition, considering that some low-quality base classifiers may pull down the performance of an ensemble and that it is difficult to estimate the potential of each base classifier directly, we profit from the idea of a selective ensemble to adaptively select base classifiers for constructing an ensemble. In particular, an evolutionary algorithm is adopted to conduct the procedure of adaptive selection in SEHE. The experimental results on 42 imbalanced data sets show that the SEHE is significantly superior to some state-of-the-art ensemble learning algorithms which are specifically designed for addressing the class imbalance problem, indicating its effectiveness and superiority.



    加载中


    [1] P. Branco, L. Torgo, R. P. Ribeiro, A survey of predictive modeling on imbalanced domains, ACM Comput. Surv., 49 (2016), 1–50. https://doi.org/10.1145/2907070 doi: 10.1145/2907070
    [2] H. Guo, Y. Li, J. Shang, M. Gu, Y. Huang, B. Gong, Learning from class-imbalance data: Review of methods and applications, Expert Syst. Appl., 73 (2017), 220–239. https://doi.org/10.1016/j.eswa.2016.12.035 doi: 10.1016/j.eswa.2016.12.035
    [3] Y. Qian, S. Ye, Y. Zhang, J. Zhang, SUMO-Forest: A Cascade Forest based method for the prediction of SUMOylation sites on imbalanced data, Gene, 741 (2020), 144536. https://doi.org/10.1016/j.gene.2020.144536 doi: 10.1016/j.gene.2020.144536
    [4] P. D. Mahajan, A. Maurya, A. Megahed, A. Elwany, R. Strong, J. Blomberg, Optimizing predictive precision in imbalanced datasets for actionable revenue change prediction, Eur. J. Oper. Res., 285 (2020), 1095–1113. https://doi.org/10.1016/j.ejor.2020.02.036 doi: 10.1016/j.ejor.2020.02.036
    [5] G. Chen, Z. Ge, SVM-tree and SVM-forest algorithms for imbalanced fault classification in industrial processes, IFAC J. Syst. Control, 8 (2019), 100052. https://doi.org/10.1016/j.ifacsc.2019.100052 doi: 10.1016/j.ifacsc.2019.100052
    [6] P. Wang, F. Su, Z. Zhao, Y. Guo, Y. Zhao, B. Zhuang, Deep class-skewed learning for face recognition, Neurocomputing, 363 (2019), 35–45. https://doi.org/10.1016/j.neucom.2019.04.085 doi: 10.1016/j.neucom.2019.04.085
    [7] Y. S. Li, H. Chi, X. Y. Shao, M. L. Qi, B. G. Xu, A novel random forest approach for imbalance problem in crime linkage, Knowledge-Based Syst., 195 (2020), 105738. https://doi.org/10.1016/j.knosys.2020.105738 doi: 10.1016/j.knosys.2020.105738
    [8] S. Barua, M. M. Islam, X. Yao, K. Murase, MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning, IEEE Trans. Knowl. Data Eng., 26 (2012), 405–425. https://doi.org/10.1109/TKDE.2012.232 doi: 10.1109/TKDE.2012.232
    [9] G. E. A. P. A. Batista, R. C. Prati, M. C. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explorations Newsl., 6 (2004), 20–29. https://doi.org/10.1145/1007730.1007735 doi: 10.1145/1007730.1007735
    [10] K. E. Bennin, J. Keung, P. Phannachitta, A. Monden, S. Mensah, MAHAKIL: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction, IEEE Trans. Software Eng., 44 (2017), 534–550. https://doi.org/10.1109/TSE.2017.2731766 doi: 10.1109/TSE.2017.2731766
    [11] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique, J. Artif. Intell. Res., 16 (2002), 321–357. https://doi.org/10.1613/jair.953 doi: 10.1613/jair.953
    [12] M. Zheng, T. Li, X. Zheng, Q. Yu, C. Chen, D. Zhou, et al., UFFDFR: Undersampling framework with denoising, fuzzy c-means clustering, and representative sample selection for imbalanced data classsification, Inf. Sci., 576 (2021), 658–680. https://doi.org/10.1016/j.ins.2021.07.053 doi: 10.1016/j.ins.2021.07.053
    [13] G. Ahn, Y. J. Park, S. Hur, A membership probability-based undersampling algorithm for imbalanced data, J. Classif., 38 (2021), 2–15. https://doi.org/10.1007/s00357-019-09359-9 doi: 10.1007/s00357-019-09359-9
    [14] M. Li, A. Xiong, L. Wang, S. Deng, J. Ye, ACO Resampling: Enhancing the performance of oversampling methods for class imbalance classification, Knowledge-Based Syst., 196 (2020), 105818. https://doi.org/10.1016/j.knosys.2020.105818 doi: 10.1016/j.knosys.2020.105818
    [15] T. Pan, J. Zhao, W. Wu, J. Yang, Learning imbalanced datasets based on SMOTE and Gaussian distribution, Inf. Sci., 512 (2020), 1214–1233. https://doi.org/10.1016/j.ins.2019.10.048 doi: 10.1016/j.ins.2019.10.048
    [16] T. Zhang, Y. Li, X. Wang, Gaussian prior based adaptive synthetic sampling with non-linear sample space for imbalanced learning, Knowledge-Based Syst., 191 (2020), 105231. https://doi.org/10.1016/j.knosys.2019.105231 doi: 10.1016/j.knosys.2019.105231
    [17] R. Batuwita, V. Palade, FSVM-CIL: Fuzzy support vector machines for class imbalance learning, IEEE Trans. Fuzzy Syst., 18 (2010), 558–571. https://doi.org/10.1109/TFUZZ.2010.2042721 doi: 10.1109/TFUZZ.2010.2042721
    [18] C. L. Castro, A. P. Braga, Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data, IEEE Trans. Neural Networks Learn. Syst., 24 (2013), 888–899. https://doi.org/10.1109/TNNLS.2013.2246188 doi: 10.1109/TNNLS.2013.2246188
    [19] S. Datta, S. Das, Near-Bayesian Support Vector Machines for imbalanced data classification with equal or unequal misclassification costs, Neural Networks, 70 (2015), 39–52. https://doi.org/10.1016/j.neunet.2015.06.005 doi: 10.1016/j.neunet.2015.06.005
    [20] H. Yu, C. Mu, C. Sun, W. Yang, X. Yang, X. Zuo, Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data, Knowledge-Based Syst., 76 (2015), 67–78. https://doi.org/10.1016/j.knosys.2014.12.007 doi: 10.1016/j.knosys.2014.12.007
    [21] H. Yu, C. Sun, X. Yang, W. Yang, J. Shen, Y. Qi, ODOC-ELM: Optimal decision outputs compensation-based extreme learning machine for classifying imbalanced data, Knowledge-Based Syst., 92 (2016), 55–70. https://doi.org/10.1016/j.knosys.2015.10.012 doi: 10.1016/j.knosys.2015.10.012
    [22] Z. H. Zhou, X. Y. Liu, Training cost-sensitive neural networks with methods addressing the class imbalance problem, IEEE Trans. Knowl. Data Eng., 18 (2006), 63–77. https://doi.org/10.1109/TKDE.2006.17 doi: 10.1109/TKDE.2006.17
    [23] D. Devi, S. K. Biswas, B. Purkayastha, Learning in presence of class imbalance and class overlapping by using one-class SVM and undersampling technique, Connect. Sci., 31 (2019), 105–142. https://doi.org/10.1080/09540091.2018.1560394 doi: 10.1080/09540091.2018.1560394
    [24] R. Barandela, R. M. Valdovinos, J. S. Sanches, New applications of ensemble of classifiers, Pattern Anal. Appl., 6 (2003), 245–256. https://doi.org/10.1007/s10044-003-0192-z doi: 10.1007/s10044-003-0192-z
    [25] N. V. Chawla, A. Lazarevic, L. O. Hall, K. W. Bowyer, SMOTEBoost: Improving prediction of the minority class in Boosting, in Knowledge Discovery in Databases: PKDD 2003, (2003), 107–119. https://doi.org/10.1007/978-3-540-39804-2_12
    [26] G. Collell, D. Prelec, K. R. Patil, A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data, Neurocomputing, 275 (2018), 330–340. https://doi.org/10.1016/j.neucom.2017.08.035 doi: 10.1016/j.neucom.2017.08.035
    [27] W. Fan, S. J. Stolfo, J. Zhang, P. K. Chan, AdaCost: Misclassification cost-sensitive boosting, in International Conference of Machine Learning, (1999), 97–105. Available from: http://ids.cs.columbia.edu/sites/default/files/Adacost_Imbalanced_classes.pdf.
    [28] M. Galar, A. Fernandez, E. Barrenechea, F. Herrera, EUSBoost: Enhancing ensembles for highly imbalanced data-sets by eevolutionary undersampling, Pattern Recognit., 46 (2013), 3460–3471. https://doi.org/10.1016/j.patcog.2013.05.006 doi: 10.1016/j.patcog.2013.05.006
    [29] P. Lim, C. K. Goh, K. C. Tan, Evolutionary Cluster-Based Synthetic Oversampling Ensemble (ECO-Ensemble) for imbalance learning, IEEE Trans. Cybern., 47 (2016), 2850–2861. https://doi.org/10.1109/TCYB.2016.2579658 doi: 10.1109/TCYB.2016.2579658
    [30] X. Y. Liu, J. Wu, Z. H. Zhou, Exploratory undersampling for class-imbalance learning, IEEE Trans. Syst. Man Cybern. Part B Cybern., 39 (2008), 539–550. https://doi.org/10.1109/TSMCB.2008.2007853 doi: 10.1109/TSMCB.2008.2007853
    [31] S. E. Roshan, S. Asadi, Improvement of Bagging performance for classification of imbalanceed datasets using evolutionary multi-objective optimization, Eng. Appl. Artif. Intell., 87 (2020), 103319. https://doi.org/10.1016/j.engappai.2019.103319 doi: 10.1016/j.engappai.2019.103319
    [32] A. Roy, R. M. O. Cruz, R. Sabourin, G. D. C. Cavalcanti, A study on combining dynamic selection and data preprocessing for imbalance learning, Neurocomputing, 286 (2018), 179–192. https://doi.org/10.1016/j.neucom.2018.01.060 doi: 10.1016/j.neucom.2018.01.060
    [33] C. Seiffert, T. M. Khoshgoftaar, J. V. Hulse, A. Napolitano, RUSBoost: A hybrid approach to alleviating class imbalance, IEEE Trans. Syst. Man Cybern. Part A Syst. Humans, 40 (2009), 185–197. https://doi.org/10.1109/TSMCA.2009.2029559 doi: 10.1109/TSMCA.2009.2029559
    [34] Y. Sun, M. S. Kamel, A. K. C. Wong, Y. Wang, Cost-sensitive boosting for classification of imbalanced data, Pattern Recognit., 40 (2007), 3358–3378. https://doi.org/10.1016/j.patcog.2007.04.009 doi: 10.1016/j.patcog.2007.04.009
    [35] B. Tang, H. He, GIR-based ensemble sampling approaches for imbalanced learning, Pattern Recognit., 71 (2017), 306–319. https://doi.org/10.1016/j.patcog.2017.06.019 doi: 10.1016/j.patcog.2017.06.019
    [36] D. Tao, X. Tang, X. Li, X. Wu, Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval, IEEE Trans. Pattern Anal. Mach. Intell., 28 (2006), 1088–1099. https://doi.org/10.1109/TPAMI.2006.134 doi: 10.1109/TPAMI.2006.134
    [37] S. Wang, X. Yao, Diversity analysis on imbalanced data sets by using ensemble models, in 2009 IEEE Symposium on Computational Intelligence and Data Mining, (2009), 324–331. https://doi.org/10.1109/CIDM.2009.4938667
    [38] H. Yu, J. Ni, An improved ensemble learning method for classifying high-dimensional and imbalanced biomedicine data, IEEE/ACM Trans. Comput. Biol. Bioinf., 11 (2014), 657–666. https://doi.org/10.1109/TCBB.2014.2306838 doi: 10.1109/TCBB.2014.2306838
    [39] H. G. Zefrehi, H. Altincay, Imbalance learning using heterogeneous ensembles, Expert Syst. Appl., 142 (2020), 113005. https://doi.org/10.1016/j.eswa.2019.113005 doi: 10.1016/j.eswa.2019.113005
    [40] J. F. Díez-Pastor, J. J. Rodríguez, C. I. García-Osorio, L. I. Kuncheva, Diversity techniques improve the performance of the best imbalance learning ensembles, Inf. Sci., 325 (2015), 98–117. https://doi.org/10.1016/j.ins.2015.07.025 doi: 10.1016/j.ins.2015.07.025
    [41] Z. H. Zhou, J. Wu, W. Tang, Ensembling neural networks: many could be better than all, Artif. Intell., 137 (2002), 239–263. https://doi.org/10.1016/S0004-3702(02)00190-X doi: 10.1016/S0004-3702(02)00190-X
    [42] I. Triguero, S. González, J. M. Moyano, S. García, J. Alcalá-Fdez, J. Luengo, et al., KEEL 3.0: An open source software for multi-stage analysis in data mining, Int. J. Comput. Intell. Syst., 10 (2017), 1238–1249. https://doi.org/10.2991/ijcis.10.1.82 doi: 10.2991/ijcis.10.1.82
    [43] C. Blake, E. Keogh, C. J. Merz, UCI repository of machine learning databases, 1998. Available from: https://cir.nii.ac.jp/crid/1572543025422228096#citations_container.
    [44] L. Breiman, Bagging predictors, Mach. Learn., 24 (1996), 123–140. https://doi.org/10.1007/BF00058655 doi: 10.1007/BF00058655
    [45] R. E. Schapire, A brief introduction to boosting, in Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, (1999), 1401–1406. Available from: https://citeseerx.ist.psu.edu/document?repid = rep1 & type = pdf & doi = fa329f834e834108ccdc536db85ce368fee227ce.
    [46] L. Breiman, Random forests, Mach. Learn., 45 (2001), 5–32. https://doi.org/10.1023/A:1010933404324 doi: 10.1023/A:1010933404324
    [47] T. K. Ho, The random subspace method for constructing decision forests, IEEE Trans. Pattern Anal. Mach. Intell., 20 (1998), 832–844. https://doi.org/10.1109/34.709601 doi: 10.1109/34.709601
    [48] S. A. Gilpin, D. M. Dunlavy, Relationships between accuracy and diversity in heterogeneous ensemble classifiers, 2009.
    [49] K. W. Hsu, J. Srivastava, Diversity in combinations of heterogeneous classifiers, in PAKDD 2009: Advances in Knowledge Discovery and Data Mining, (2009), 923–932. https://doi.org/10.1007/978-3-642-01307-2_97
    [50] R. M. O. Cruz, R. Sabourin, G. D. C. Cavalcanti, Dynamic classifier selection: Recent advances and perspectives, Inf. Fusion, 41 (2018), 195–216. https://doi.org/10.1016/j.inffus.2017.09.010 doi: 10.1016/j.inffus.2017.09.010
    [51] É. N. de Souza, S. Matwin, Extending adaboost to iteratively vary its base classifiers, in Canadian AI 2011: Advances in Artificial Intelligence, (2011), 384–389. https://doi.org/10.1007/978-3-642-21043-3_46
    [52] D. Whitley, A genetic algorithm tutorial, Stat. Comput., 4 (1994), 65–85. https://doi.org/10.1007/BF00175354 doi: 10.1007/BF00175354
    [53] J. Demsar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res., 7 (2006), 1–30. Available from: https://www.jmlr.org/papers/volume7/demsar06a/demsar06a.pdf.
    [54] S. García, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power, Inf. Sci., 180 (2010), 2044–2064. https://doi.org/10.1016/j.ins.2009.12.010 doi: 10.1016/j.ins.2009.12.010
  • Reader Comments
  • © 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(1199) PDF downloads(69) Cited by(2)

Article outline

Figures and Tables

Figures(4)  /  Tables(7)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog