Research article

Novel split quality measures for stratified multilabel cross validation with application to large and sparse gene ontology datasets


  • Received: 11 February 2022 Revised: 30 March 2022 Accepted: 30 March 2022 Published: 31 March 2022
  • Multilabel learning is an important topic in machine learning research. Evaluating models in multilabel settings requires specific cross validation methods designed for multilabel data. In this article, we show that the most widely used cross validation split quality measure does not behave adequately with multilabel data that has strong class imbalance. We present improved measures and an algorithm, optisplit, for optimizing cross validations splits. Extensive comparison of various types of cross validation methods shows that optisplit produces more even cross validation splits than the existing methods and it is among the fastest methods with good splitting performance.

    Citation: Henri Tiittanen, Liisa Holm, Petri Törönen. Novel split quality measures for stratified multilabel cross validation with application to large and sparse gene ontology datasets[J]. Applied Computing and Intelligence, 2022, 2(1): 49-62. doi: 10.3934/aci.2022003

    Related Papers:

  • Multilabel learning is an important topic in machine learning research. Evaluating models in multilabel settings requires specific cross validation methods designed for multilabel data. In this article, we show that the most widely used cross validation split quality measure does not behave adequately with multilabel data that has strong class imbalance. We present improved measures and an algorithm, optisplit, for optimizing cross validations splits. Extensive comparison of various types of cross validation methods shows that optisplit produces more even cross validation splits than the existing methods and it is among the fastest methods with good splitting performance.



    加载中


    [1] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, et al., Gene ontology: tool for the unification of biology, Nature genetics, 25 (2000), 25–29. https://doi.org/10.1038/75556 doi: 10.1038/75556
    [2] S. Bengio, K. Dembczynski, T. Joachims, M. Kloft, M. Varma, Extreme Classification (Dagstuhl Seminar 18291), Dagstuhl Reports, 8 (2019), 62–80.
    [3] K. Bhatia, K. Dahiya, H. Jain, P. Kar, A. Mittal, Y. Prabhu, M. Varma, The extreme classification repository: Multi-label datasets and code, 2016.
    [4] F. Charte, A. Rivera, M. J. del Jesus, F. Herrera, A. Troncoso, H. Quintián, E. Corchado, On the impact of dataset complexity and sampling strategy in multilabel classifiers performance, Hybrid Artificial Intelligent Systems, (2016), 500–511. Springer International Publishing. https://doi.org/10.1007/978-3-319-32034-2_42
    [5] A. De Myttenaere, B. Golden, B. Le Grand, F. Rossi, Mean absolute percentage error for regression models, Neurocomputing, 192 (2016), 38–48. https://doi.org/10.1016/j.neucom.2015.12.114 doi: 10.1016/j.neucom.2015.12.114
    [6] F. Florez-Revuelta, Evosplit: An evolutionary approach to split a multi-label data set into disjoint subsets, Applied Sciences, 11 (2021), 2823. https://doi.org/10.3390/app11062823 doi: 10.3390/app11062823
    [7] M Merrillees, L Du, Stratified Sampling for Extreme Multi-Label Data, Pacific-Asia Conference on Knowledge Discovery and Data Mining, (2021), 334–345. https://doi.org/10.1007/978-3-030-75765-6_27
    [8] M Merrillees, L Du, Stratified sampling for xml, 2021. Available from: https://github.com/maxitron93/stratified_sampling_for_XML.
    [9] K. Sechidis, G. Tsoumakas, I. Vlahavas, On the stratification of multi-label data, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, (2011), 145–158. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-23808-6_10
    [10] P. Szymański, T. Kajdanowicz, A scikit-based Python environment for performing multi-label classification, arXiv e-prints, 2017.
    [11] P. Szymański, T. Kajdanowicz, A network perspective on stratification of multi-label data, Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications, volume 74 of Proceedings of Machine Learning Research, (2017), 22–35.
    [12] H. Tiittanen, L. Holm, P. Törönen, Optisplit. Available from: https://github.com/xtixtixt/optisplit.
    [13] P. Törönen, A. Medlar, L. Holm, Pannzer2: a rapid functional annotation web server, Nucleic acids res., 46 (2018), W84–W88. https://doi.org/10.1093/nar/gky350 doi: 10.1093/nar/gky350
    [14] G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, I. Vlahavas, Mulan: A java library for multi-label learning, J. Mach. Learn. Res., 12 (2011), 2411–2414.
    [15] D. H. Wolpert, Stacked generalization, Neural Networks, 5 (1992), 241–259. https://doi.org/10.1016/S0893-6080(05)80023-1
    [16] N. Zhou, Y. Jiang, T. R. Bergquist, A. J. Lee, B. Z. Kacsoh, A. W. Crocker, K. A. Lewis, G. Georghiou, et al., The cafa challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens, Genome biol., 20 (2019), 1–23.
  • Reader Comments
  • © 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(1215) PDF downloads(29) Cited by(0)

Article outline

Figures and Tables

Figures(3)  /  Tables(3)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog