Multilabel learning is an important topic in machine learning research. Evaluating models in multilabel settings requires specific cross validation methods designed for multilabel data. In this article, we show that the most widely used cross validation split quality measure does not behave adequately with multilabel data that has strong class imbalance. We present improved measures and an algorithm, optisplit, for optimizing cross validations splits. Extensive comparison of various types of cross validation methods shows that optisplit produces more even cross validation splits than the existing methods and it is among the fastest methods with good splitting performance.
Citation: Henri Tiittanen, Liisa Holm, Petri Törönen. Novel split quality measures for stratified multilabel cross validation with application to large and sparse gene ontology datasets[J]. Applied Computing and Intelligence, 2022, 2(1): 49-62. doi: 10.3934/aci.2022003
Multilabel learning is an important topic in machine learning research. Evaluating models in multilabel settings requires specific cross validation methods designed for multilabel data. In this article, we show that the most widely used cross validation split quality measure does not behave adequately with multilabel data that has strong class imbalance. We present improved measures and an algorithm, optisplit, for optimizing cross validations splits. Extensive comparison of various types of cross validation methods shows that optisplit produces more even cross validation splits than the existing methods and it is among the fastest methods with good splitting performance.
[1] | M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, et al., Gene ontology: tool for the unification of biology, Nature genetics, 25 (2000), 25–29. https://doi.org/10.1038/75556 doi: 10.1038/75556 |
[2] | S. Bengio, K. Dembczynski, T. Joachims, M. Kloft, M. Varma, Extreme Classification (Dagstuhl Seminar 18291), Dagstuhl Reports, 8 (2019), 62–80. |
[3] | K. Bhatia, K. Dahiya, H. Jain, P. Kar, A. Mittal, Y. Prabhu, M. Varma, The extreme classification repository: Multi-label datasets and code, 2016. |
[4] | F. Charte, A. Rivera, M. J. del Jesus, F. Herrera, A. Troncoso, H. Quintián, E. Corchado, On the impact of dataset complexity and sampling strategy in multilabel classifiers performance, Hybrid Artificial Intelligent Systems, (2016), 500–511. Springer International Publishing. https://doi.org/10.1007/978-3-319-32034-2_42 |
[5] | A. De Myttenaere, B. Golden, B. Le Grand, F. Rossi, Mean absolute percentage error for regression models, Neurocomputing, 192 (2016), 38–48. https://doi.org/10.1016/j.neucom.2015.12.114 doi: 10.1016/j.neucom.2015.12.114 |
[6] | F. Florez-Revuelta, Evosplit: An evolutionary approach to split a multi-label data set into disjoint subsets, Applied Sciences, 11 (2021), 2823. https://doi.org/10.3390/app11062823 doi: 10.3390/app11062823 |
[7] | M Merrillees, L Du, Stratified Sampling for Extreme Multi-Label Data, Pacific-Asia Conference on Knowledge Discovery and Data Mining, (2021), 334–345. https://doi.org/10.1007/978-3-030-75765-6_27 |
[8] | M Merrillees, L Du, Stratified sampling for xml, 2021. Available from: https://github.com/maxitron93/stratified_sampling_for_XML. |
[9] | K. Sechidis, G. Tsoumakas, I. Vlahavas, On the stratification of multi-label data, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, (2011), 145–158. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-23808-6_10 |
[10] | P. Szymański, T. Kajdanowicz, A scikit-based Python environment for performing multi-label classification, arXiv e-prints, 2017. |
[11] | P. Szymański, T. Kajdanowicz, A network perspective on stratification of multi-label data, Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications, volume 74 of Proceedings of Machine Learning Research, (2017), 22–35. |
[12] | H. Tiittanen, L. Holm, P. Törönen, Optisplit. Available from: https://github.com/xtixtixt/optisplit. |
[13] | P. Törönen, A. Medlar, L. Holm, Pannzer2: a rapid functional annotation web server, Nucleic acids res., 46 (2018), W84–W88. https://doi.org/10.1093/nar/gky350 doi: 10.1093/nar/gky350 |
[14] | G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, I. Vlahavas, Mulan: A java library for multi-label learning, J. Mach. Learn. Res., 12 (2011), 2411–2414. |
[15] | D. H. Wolpert, Stacked generalization, Neural Networks, 5 (1992), 241–259. https://doi.org/10.1016/S0893-6080(05)80023-1 |
[16] | N. Zhou, Y. Jiang, T. R. Bergquist, A. J. Lee, B. Z. Kacsoh, A. W. Crocker, K. A. Lewis, G. Georghiou, et al., The cafa challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens, Genome biol., 20 (2019), 1–23. |