CC<i>k</i>EL: Compensation-based correlated <i>k</i>-labelsets for classifying imbalanced multi-label data

Qianpeng Xiao; Changbin Shao; Sen Xu; Xibei Yang; Hualong Yu; Qianpeng Xiao; Changbin Shao; Sen Xu; Xibei Yang; Hualong Yu

doi:10.3934/era.2024139

Electronic Research Archive

2024, Volume 32, Issue 5: 3038-3058. doi: 10.3934/era.2024139

Previous Article Next Article

Research article

CCkEL: Compensation-based correlated k-labelsets for classifying imbalanced multi-label data

1.
School of Computer, Jiangsu University of Science and Technology, Zhenjiang, Jiangsu, China
2.
Jiangsu Key Laboratory of Media Design and Software Technology, Jiangnan University, Wuxi, Jiangsu, China
3.
School of Information Engineering, Yancheng Institute of Technology, Yancheng, Jiangsu, China

Received: 07 February 2024 Revised: 03 April 2024 Accepted: 11 April 2024 Published: 23 April 2024

Imbalanced data distribution and label correlation are two intrinsic characteristics of multi-label data. This occurs because in this type of data, instances associated with certain labels may be sparse, and some labels may be associated with others, posing a challenge for traditional machine learning techniques. To simultaneously adapt imbalanced data distribution and label correlation, this study proposed a novel algorithm called compensation-based correlated k-labelsets (CCkEL). First, for each label, the CCkEL selects the k-1 strongest correlated labels in the label space to constitute multiple correlated k-labelsets; this improves its efficiency in comparison with the random k-labelsets (RAkEL) algorithm. Then, the CCkEL transforms each k-labelset into a multiclass issue. Finally, it uses a fast decision output compensation strategy to address class imbalance in the decoded multi-label decision space. We compared the performance of the proposed CCkEL algorithm with that of multiple popular multi-label imbalance learning algorithms on 10 benchmark multi-label datasets, and the results show its effectiveness and superiority.
- multi-label learning,
- class imbalance,
- random k-labelsets,
- label correlation,
- decision output compensation
Citation: Qianpeng Xiao, Changbin Shao, Sen Xu, Xibei Yang, Hualong Yu. CCkEL: Compensation-based correlated k-labelsets for classifying imbalanced multi-label data[J]. Electronic Research Archive, 2024, 32(5): 3038-3058. doi: 10.3934/era.2024139

Related Papers:

Abstract

Imbalanced data distribution and label correlation are two intrinsic characteristics of multi-label data. This occurs because in this type of data, instances associated with certain labels may be sparse, and some labels may be associated with others, posing a challenge for traditional machine learning techniques. To simultaneously adapt imbalanced data distribution and label correlation, this study proposed a novel algorithm called compensation-based correlated k-labelsets (CCkEL). First, for each label, the CCkEL selects the k-1 strongest correlated labels in the label space to constitute multiple correlated k-labelsets; this improves its efficiency in comparison with the random k-labelsets (RAkEL) algorithm. Then, the CCkEL transforms each k-labelset into a multiclass issue. Finally, it uses a fast decision output compensation strategy to address class imbalance in the decoded multi-label decision space. We compared the performance of the proposed CCkEL algorithm with that of multiple popular multi-label imbalance learning algorithms on 10 benchmark multi-label datasets, and the results show its effectiveness and superiority.

References

[1]	M. L. Zhang, Z. H. Zhou, A review on multi-label learning algorithms, IEEE Trans. Knowl. Data Eng., 26 (2013), 1819–1837. https://doi.org/10.1109/TKDE.2013.39 doi: 10.1109/TKDE.2013.39
[2]	Z. Shao, W. Zhou, X. Deng, M. Zhang, Q. Cheng, Multilabel remote sensing image retrieval based on fully convolutional network, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 13 (2020), 318–328. https://doi.org/10.1109/JSTARS.2019.2961634 doi: 10.1109/JSTARS.2019.2961634
[3]	Z. Zhang, Q. Zou, Y. Lin, L. Chen, S. Wang, Improved deep hashing with soft pairwise similarity for multi-label image retrieval, IEEE Trans. Multimedia, 22 (2019), 540–553. https://doi.org//10.1109/TMM.2019.2929957 doi: 10.1109/TMM.2019.2929957
[4]	X. Zhang, J. Xu, C. Soh, L. Chen, LA-HCN: label-based attention for hierarchical multi-label text classification neural network, Expert Syst. Appl., 187 (2022), 115922. https://doi.org/10.1016/j.eswa.2021.115922 doi: 10.1016/j.eswa.2021.115922
[5]	Z. Yang, F. Emmert-Streib, Optimal performance of Binary Relevance CNN in targeted multi-label text classification, Knowledge-Based Syst., 284 (2023), 111286. https://doi.org/10.1016/j.knosys.2023.111286 doi: 10.1016/j.knosys.2023.111286
[6]	R. Su, H. Yang, L. Wei, S. Chen, Q. Zou, A multi-label learning model for predicting drug-induced pathology in multi-organ based on toxicogenomics data, PLoS Comput. Biol., 18 (2022), e1010402. https://doi.org/10.1371/journal.pcbi.1010402 doi: 10.1371/journal.pcbi.1010402
[7]	S. Wan, M. K. Mak, S. Y. Kung, mGOASVM: Multi-label protein subcellular localization based on gene ontology and support vector machines, BMC Bioinf., 13 (2012), 1–16. https://doi.org/10.1186/1471-2105-13-290 doi: 10.1186/1471-2105-13-290
[8]	K. C. Chou, Advances in predicting subcellular localization of multi-label proteins and its implication for developing multi-target drugs, Curr. Med. Chem., 26 (2019), 4918–4943. https://doi.org/10.2174/0929867326666190507082559 doi: 10.2174/0929867326666190507082559
[9]	H. Wang, L. Yan, H. Huang, C. Ding, From protein sequence to protein function via multi-label linear discriminant analysis, IEEE/ACM Trans. Comput. Biol. Bioinf., 14 (2016), 503–513. https://doi.org/10.1109/TCBB.2016.2591529 doi: 10.1109/TCBB.2016.2591529
[10]	M. R. G. A. De Oliveira, P. M. Ciarelli, E. Oliveira, Recommendation of programming activities by multi-label classification for a formative assessment of students, Expert Syst. Appl., 40 (2013), 6641–6651. https://doi.org/10.1016/j.eswa.2013.06.011 doi: 10.1016/j.eswa.2013.06.011
[11]	M. L. Zhang, Y. K. Li, X. Y. Liu, X. Geng, Binary relevance for multi-label learning: an overview, Front. Comput. Sci., 12 (2018), 191–202. https://doi.org/10.1007/s11704-017-7031-7 doi: 10.1007/s11704-017-7031-7
[12]	J. Fürnkranz, E. Hüllermeie, E. Loza Mencía, K. Brinker, Multilabel classification via calibrated label ranking, Mach. Learn., 73 (2008), 133–153. https://doi.org/10.1007/s10994-008-5064-8 doi: 10.1007/s10994-008-5064-8
[13]	M. L. Zhang, Z. H. Zhou, ML-KNN: A lazy learning approach to multi-label learning, Pattern Recognit., 40 (2007), 2038–2048. https://doi.org/10.1016/j.patcog.2006.12.019 doi: 10.1016/j.patcog.2006.12.019
[14]	M. L. Zhang, Z. H. Zhou, Multilabel neural networks with applications to functional genomics and text categorization, IEEE Trans. Knowl. Data Eng., 18 (2006), 1338–1351. https://doi.org/10.1109/TKDE.2006.162 doi: 10.1109/TKDE.2006.162
[15]	M. R. Boutell, J. Luo, X. Shen, C. M. Brown, Learning multi-label scene classification, Pattern Recognit., 37 (2004), 1757–1771. https://doi.org/10.1016/j.patcog.2004.03.009 doi: 10.1016/j.patcog.2004.03.009
[16]	J. Read, B. Pfahringer, G. Holmes, E. Frank, Classifier chains for multi-label classification, Mach. Learn., 85 (2011), 333–359. https://doi.org/10.1007/s10994-011-5256-5 doi: 10.1007/s10994-011-5256-5
[17]	G. Tsoumakas, I. Katakis, I. Vlahavas, Random k-labelsets for multilabel classification, IEEE Trans. Knowl. Data Eng., 23 (2010), 1079–1089. https://doi.org/10.1109/TKDE.2010.164 doi: 10.1109/TKDE.2010.164
[18]	A. N. Tarekegn, M. Giacobini, K. Michalak, A review of methods for imbalanced multi-label classification, Pattern Recognit., 118 (2021), 107965. https://doi.org/10.1016/j.patcog.2021.107965 doi: 10.1016/j.patcog.2021.107965
[19]	A. Zhang, H. Yu, S. Zhou, Z. Huan, X. Yang, Instance weighted SMOTE by indirectly exploring the data distribution, Knowledge-Based Syst., 249 (2022), 108919. https://doi.org/10.1016/j.knosys.2022.108919 doi: 10.1016/j.knosys.2022.108919
[20]	A. Zhang, H. Yu, Z. Huan, X. Yang, S. Zheng, S. Gao, SMOTE-RkNN: A hybrid re-sampling method based on SMOTE and reverse k-nearest neighbors, Inf. Sci., 595 (2022), 70–88. https://doi.org/10.1016/j.ins.2022.02.038 doi: 10.1016/j.ins.2022.02.038
[21]	K. E. Bennin, J. Keung, P. Phannachitta, A. Monden, S. Mensah, MAHAKIL: Diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction, IEEE Trans. Software Eng., 44 (2018), 534–550. https://doi.org/10.1109/TSE.2017.2731766 doi: 10.1109/TSE.2017.2731766
[22]	M. Zhang, T. Li, X. Zheng, Q. Yu, C. Chen, D. D. Zhou, et al., UFFDFR: Undersampling framework with denoising, fuzzy c-means clustering, and representative sample selection for imbalanced data classsification, Inf. Sci., 576 (2021), 658–680. https://doi.org/10.1016/j.ins.2021.07.053 doi: 10.1016/j.ins.2021.07.053
[23]	R. Batuwita, V. Palade, FSVM-CIL: fuzzy support vector machines for class imbalance learning, IEEE Trans. Fuzzy Syst., 18 (2010), 558–571. https://doi.org/10.1109/TFUZZ.2010.2042721 doi: 10.1109/TFUZZ.2010.2042721
[24]	C. L. Castro, A. P. Braga, Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data, IEEE Trans. Neural Networks Learn. Syst., 24 (2013), 888–899. https://doi.org/10.1109/TNNLS.2013.2246188 doi: 10.1109/TNNLS.2013.2246188
[25]	Z. H. Zhou, X. Y. Liu, Training cost-sensitive neural networks with methods addressing the class imbalance problem, IEEE Trans. Knowl. Data Eng., 18 (2005), 63–77. https://doi.org/10.1109/TKDE.2006.17 doi: 10.1109/TKDE.2006.17
[26]	H. Yu, C. Mu, C. Sun, W. Yang, X. Yang, X. Zuo, Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data, Knowledge-Based Syst., 76 (2015), 67–78. https://doi.org/10.1016/j.knosys.2014.12.007 doi: 10.1016/j.knosys.2014.12.007
[27]	H. Yu, C. Sun, X. Yang, W. Yang, J. Shen, Y. Qi, ODOC-ELM: Optimal decision outputs compensation-based extreme learning machine for classifying imbalanced data, Knowledge-Based Syst., 92 (2016), 55–70. https://doi.org/10.1016/j.knosys.2015.10.012 doi: 10.1016/j.knosys.2015.10.012
[28]	G. Collell, D. Prelec, K. R. Patil, A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data, Neurocomputing, 275 (2018), 330–340. https://doi.org/10.1016/j.neucom.2017.08.035 doi: 10.1016/j.neucom.2017.08.035
[29]	P. Lim, C. K. Goh, K. C. Tan, Evolutionary cluster-based synthetic oversampling ensemble (ECO-Ensemble) for imbalance learning, IEEE Trans. Cybern., 47 (2017), 2850–2861. https://doi.org/10.1109/TCYB.2016.2579658 doi: 10.1109/TCYB.2016.2579658
[30]	S. E. Roshan, S. Asadi, Improvement of Bagging performance for classification of imbalanced datasets using evolutionary multi-objective optimization, Eng. Appl. Artif. Intell., 87 (2020), 103319. https://doi.org/10.1016/j.engappai.2019.103319 doi: 10.1016/j.engappai.2019.103319
[31]	H. Yu, J. Ni, An improved ensemble learning method for classifying high-dimensional and imbalanced biomedicine data, IEEE/ACM Trans. Comput. Biol. Bioinf., 11 (2014), 657–666. https://doi.org/10.1109/TCBB.2014.2306838 doi: 10.1109/TCBB.2014.2306838
[32]	H. G. Zefrehi, H. Altincay, Imbalance learning using heterogeneous ensembles, Expert Syst. Appl., 142 (2020), 113005. https://doi.org/10.1016/j.eswa.2019.113005 doi: 10.1016/j.eswa.2019.113005
[33]	X. M. An, S. Xu, A selective evolutionary heterogeneous ensemble algorithm for classifying imbalanced data, Electron. Res. Arch., 31 (2023), 2733–2757. https://doi.org/10.3934/era.2023138 doi: 10.3934/era.2023138
[34]	F. Charte, A. J. Rivera, M. J. del Jesus, F. Herrera, Addressing imbalance in multilabel classification: Measures and random resampling algorithms, Neurocomputing, 163 (2015), 3–16. https://doi.org/10.1016/j.neucom.2014.08.091 doi: 10.1016/j.neucom.2014.08.091
[35]	F. Charte, A. J. Rivera, M. J. del Jesus, F. Herrera, ML-SMOTE: Approaching imbalanced multi-label learning through synthetic instance generation, Knowledge-Based Syst., 89 (2015), 385–397. https://doi.org/10.1016/j.knosys.2015.07.019 doi: 10.1016/j.knosys.2015.07.019
[36]	M. Zhang, Y. K. Li, H. Yang, Towards class-imbalance aware multi-label learning, IEEE Trans. Cybern., 52 (2020), 4459–4471. https://doi.org/10.1109/TCYB.2020.3027509 doi: 10.1109/TCYB.2020.3027509
[37]	B. Liu, G. Tsoumakas, Dealing with class imbalance in classifier chains via random undersampling, Knowledge-Based Syst., 192 (2020), 105292. https://doi.org/10.1016/j.knosys.2019.105292 doi: 10.1016/j.knosys.2019.105292
[38]	Y. Peng, E. Huang, G. Chen, C. Wang, J. Xie, A general framework for multi-label learning towards class correlations and class imbalance, Intell. Data Anal., 23 (2019), 371–383. https://doi.org/10.3233/IDA-183932 doi: 10.3233/IDA-183932
[39]	J. Rice, R. J. Belland, A simulation study of moss floras using Jaccard's coefficient of similarity, J. Biogeogr., 9 (1982), 411–419. https://doi.org/10.2307/2844573 doi: 10.2307/2844573
[40]	J. R. Quinlan, Improved use of continuous attributes in C4.5, J. Artif. Intell. Res., 4 (1996), 77–90. https://doi.org/10.1613/jair.279 doi: 10.1613/jair.279
[41]	J. Demsar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res., 7 (2006), 1–30. https://doi.org/10.1007/s10846-005-9016-2 doi: 10.1007/s10846-005-9016-2
[42]	S. García, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power, Inf. Sci., 180 (2010), 2044–2064. https://doi.org/10.1016/j.ins.2009.12.010 doi: 10.1016/j.ins.2009.12.010
[43]	S. Pandya, T. R. Gadekallu, P. K. Reddy, W. Wang, M. Alazab, InfusedHeart: A novel knowledge-infused learning framework for diagnosis of cardiovascular events, IEEE Trans. Comput. Social Syst., 2022 (2022), 1–10. http://doi.org/10.1109/TCSS.2022.3151643 doi: 10.1109/TCSS.2022.3151643
[44]	L. Zhang, J. Wang, W. Wang, Z. Jin, Y. Su, H. Chen, Smart contract vulnerability detection combined with multi-objective detection, Comput. Networks, 217 (2022), 109289. https://doi.org/10.1016/j.comnet.2022.109289 doi: 10.1016/j.comnet.2022.109289
[45]	X. Liu, T. Shi, G. Zhou, M. Liu, Z. Yin, L. Yin, et al., Emotion classification for short texts: an improved multi-label method, Humanit. Social Sci. Commun., 10 (2023), 1–9. https://doi.org/10.1057/s41599-023-01816-6 doi: 10.1057/s41599-023-01816-6

Reader Comments

Your name:*

Email:*
© 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)