Research article

Two-stage feature selection for classification of gene expression data based on an improved Salp Swarm Algorithm


  • Received: 31 July 2022 Revised: 03 September 2022 Accepted: 07 September 2022 Published: 19 September 2022
  • Microarray technology has developed rapidly in recent years, producing a large number of ultra-high dimensional gene expression data. However, due to the huge sample size and dimension proportion of gene expression data, it is very challenging work to screen important genes from gene expression data. For small samples of high-dimensional biomedical data, this paper proposes a two-stage feature selection framework combining Wrapper, embedding and filtering to avoid the curse of dimensionality. The proposed framework uses weighted gene co-expression network (WGCNA), random forest and minimal redundancy maximal relevance (mRMR) for first stage feature selection. In the second stage, a new gene selection method based on the improved binary Salp Swarm Algorithm is proposed, which combines machine learning methods to adaptively select feature subsets suitable for classification algorithms. Finally, the classification accuracy is evaluated using six methods: lightGBM, RF, SVM, XGBoost, MLP and KNN. To verify the performance of the framework and the effectiveness of the proposed algorithm, the number of genes selected and the classification accuracy was compared with the other five intelligent optimization algorithms. The results show that the proposed framework achieves an accuracy equal to or higher than other advanced intelligent algorithms on 10 datasets, and achieves an accuracy of over 97.6% on all 10 datasets. This shows that the method proposed in this paper can solve the feature selection problem related to high-dimensional data, and the proposed framework has no data set limitation, and it can be applied to other fields involving feature selection.

    Citation: Xiwen Qin, Shuang Zhang, Dongmei Yin, Dongxue Chen, Xiaogang Dong. Two-stage feature selection for classification of gene expression data based on an improved Salp Swarm Algorithm[J]. Mathematical Biosciences and Engineering, 2022, 19(12): 13747-13781. doi: 10.3934/mbe.2022641

    Related Papers:

  • Microarray technology has developed rapidly in recent years, producing a large number of ultra-high dimensional gene expression data. However, due to the huge sample size and dimension proportion of gene expression data, it is very challenging work to screen important genes from gene expression data. For small samples of high-dimensional biomedical data, this paper proposes a two-stage feature selection framework combining Wrapper, embedding and filtering to avoid the curse of dimensionality. The proposed framework uses weighted gene co-expression network (WGCNA), random forest and minimal redundancy maximal relevance (mRMR) for first stage feature selection. In the second stage, a new gene selection method based on the improved binary Salp Swarm Algorithm is proposed, which combines machine learning methods to adaptively select feature subsets suitable for classification algorithms. Finally, the classification accuracy is evaluated using six methods: lightGBM, RF, SVM, XGBoost, MLP and KNN. To verify the performance of the framework and the effectiveness of the proposed algorithm, the number of genes selected and the classification accuracy was compared with the other five intelligent optimization algorithms. The results show that the proposed framework achieves an accuracy equal to or higher than other advanced intelligent algorithms on 10 datasets, and achieves an accuracy of over 97.6% on all 10 datasets. This shows that the method proposed in this paper can solve the feature selection problem related to high-dimensional data, and the proposed framework has no data set limitation, and it can be applied to other fields involving feature selection.



    加载中


    [1] A. Bashiri, M. Ghazisaeedi, R. Safdari, L. Shahmoradi, H. Ehtesham, Improving the prediction of survival in cancer patients by using machine learning techniques: experience of gene expression data: a narrative review, Iran. J. Public Health, 46 (2017), 165−172.
    [2] A. K. Shukla, P. Singh, M. Vardhan, Gene selection for cancer types classification using novel hybrid metaheuristics approach, Swarm Evol. Comput., 54 (2020), 100661. https://doi.org/10.1016/j.swevo.2020.100661 doi: 10.1016/j.swevo.2020.100661
    [3] A. Saha, S. Das, Clustering of fuzzy data and simultaneous feature selection: a model selection approach, Fuzzy Set Syst., 340 (2018), 1−37. https://doi.org/10.1016/j.fss.2017.11.015 doi: 10.1016/j.fss.2017.11.015
    [4] J. A. Cruz, D. S. Wishart, Applications of machine learning in cancer prediction and prognosis, Cancer Inf., 2 (2006), 59−77. https://doi.org/10.1177/117693510600200030 doi: 10.1177/11769351060020003010.1177/117693510600200030
    [5] A. K. Shukla, P. Singh, M. Vardhan, A hybrid framework for optimal feature subset selection, J. Intell. Fuzzy Syst., 36 (2019), 2247−2259. https://doi.org/10.3233/JIFS-169936 doi: 10.3233/JIFS-169936
    [6] I. Guyon, A. Elisseef, An introduction to variable and fea ture selection, J. Mach. Learn. Res., 3 (2003), 1157–1182. https://doi.org/10.5555/944919.944968 doi: 10.5555/944919.944968
    [7] L. C. Molina, L. Belanche, A. Nebot, Feature selection algo rithms: a survey and experimental evaluation, in 2002 IEEE International Conference on Data Mining, (2002), 306–313. https://doi.org/10.1109/ICDM.2002.1183917
    [8] H. Liu, L. Yu, Toward integrating feature selection algorithms for classifcation and clustering, IEEE Trans. Knowl. Data Eng., 17 (2005), 491–502. https://doi.org/10.1109/TKDE.2005.66 doi: 10.1109/TKDE.2005.66
    [9] H. M. Zawbaa, E. Emary, C. Grosan, V. Snasel, Large-dimensionality small-instance set feature selection: a hybrid bio-inspired heuristic approach, Swarm Evol. Comput., 42 (2018), 29–42. https://doi.org/10.1016/j.swevo.2018.02.021 doi: 10.1016/j.swevo.2018.02.021
    [10] L. Sun, X. Zhang, Y. Qian, J. Xu, S. Zhang, Feature selection using neighborhood entropy-based uncertainty measures for gene expression data classification, Inf. Sci., 502 (2019), 18−41. https://doi.org/10.1016/j.ins.2019.05.072 doi: 10.1016/j.ins.2019.05.072
    [11] A. Kumar, A. Halder, Ensemble-based active learning using fuzzy-rough approach for cancer sample classification, Eng. Appl. Artif. Intell., 91 (2020), 103591. https://doi.org/10.1016/j.engappai.2020.103591 doi: 10.1016/j.engappai.2020.103591
    [12] J. Lee, I. Choi, C. Jun, An efficient multivariate feature ranking method for gene selection in high-dimensional microarray data, Expert Syst. Appl., 166 (2020), 113971. https://doi.org/10.1016/j.eswa.2020.113971 doi: 10.1016/j.eswa.2020.113971
    [13] X. Zheng, C. Zhang, Gene selection for microarray data classification via dual latent representation learning, Neurocomputing, 461 (2021), 266−280. https://doi.org/10.1016/j.neucom.2021.07.047 doi: 10.1016/j.neucom.2021.07.047
    [14] L. Li, W. Ching, Z. Liu, Robust biomarker screening from gene expression data by stable machine learning-recursive feature elimination methods, Comput. Biol. Chem., 100 (2022), 107747. https://doi.org/10.1016/j.compbiolchem.2022.107747 doi: 10.1016/j.compbiolchem.2022.107747
    [15] H. Wang, L. Tan, B. Niu, Feature selection for classification of microarray gene expression cancers using Bacterial Colony Optimization with multi-dimensional population, Swarm Evol. Comput., 48 (2019), 172−181. https://doi.org/10.1016/j.swevo.2019.04.004 doi: 10.1016/j.swevo.2019.04.004
    [16] C. Shen, K. Zhang, Two-stage improved Grey Wolf optimization algorithm for feature selection on high-dimensional classification, Complex Intell. Syst., 8 (2022), 1−21. https://doi.org/10.1007/s40747-021-00452-4 doi: 10.1007/s40747-021-00452-4
    [17] C. Qu, L. Zhang, J. Li, F. Deng, Y. Tang, X. Zeng, et al., Improving feature selection performance for classification of gene expression data using Harris Hawks optimizer with variable neighborhood learning, Briefings Bioinf., 22 (2021). https://doi.org/10.1093/bib/bbab097 doi: 10.1093/bib/bbab097
    [18] A. Dabba, A. Tari, S. Meftali, R. Mokhtari, Gene selection and classification of microarray data method based on mutual information and moth flame algorithm, Expert Syst. Appl., 166 (2020), 114012. https://doi.org/10.1016/j.eswa.2020.114012 doi: 10.1016/j.eswa.2020.114012
    [19] L. Sun, X. Kong, J. Xu, Z. Xue, R. Zhai, S. Zhang, A hybrid gene selection method based on reliefF and ant colony optimization algorithm for tumor classification, Sci. Rep., 9 (2019), 8978. https://doi.org/10.1038/s41598-019-45223-x doi: 10.1038/s41598-019-45223-x
    [20] Uzma, F. Al-Obeidat, A. Tubaishat, B. Shah, Z. Halim, Gene encoder: a feature selection technique through unsupervised deep learning-based clustering for large gene expression data, Neural Comput. Appl., 34 (2020), 8309−8331. https://doi.org/10.1007/s00521-020-05101-4 doi: 10.1007/s00521-020-05101-4
    [21] S. Mirjalili, A. H. Gandomi, S. Z. Mirjalili, S. Saremi, H. Faris, S. M. Mirjalili, Salp swarm algorithm: a bio-inspired optimizer for engineering design problems, Adv. Eng. Software, 114 (2017), 163−191. https://doi.org/10.1016/j.advengsoft.2017.07.002 doi: 10.1016/j.advengsoft.2017.07.002
    [22] J. Kennedy, R. Eberhart, Particle swarm optimization, in Proceedings of ICNN'95 - International Conference on Neural Networks, 1995. https://doi.org/10.1109/ICNN.1995.488968
    [23] S. Mirjalili, S. M. Mirjalili, A. Lewis, Grey wolf optimizer, Adv. Eng. Software, 69 (2014), 46–61. https://doi.org/10.1016/j.advengsoft.2013.12.007 doi: 10.1016/j.advengsoft.2013.12.007
    [24] S. Mirjalili, A. Lewis, The whale optimization algorithm, Adv. Eng. Software, 95 (2016), 51−67. https://doi.org/10.1016/j.advengsoft.2016.01.008 doi: 10.1016/j.advengsoft.2016.01.008
    [25] S. Mirjalili, SCA: a sine cosine algorithm for solving optimization problems, Knowledge-Based Syst., 96 (2016), 120−133. https://doi.org/10.1016/j.knosys.2015.12.022 doi: 10.1016/j.knosys.2015.12.022
    [26] P. Langfelder, S. Horvath, WGCNA: an R package for weighted correlation network analysis, Bmc Bioinf., 9 (2008), 559. https://doi.org/10.1186/1471-2105-9-559 doi: 10.1186/1471-2105-9-559
    [27] B. Zhang, S. Horvath, A general framework for weighted gene co-expression network analysis, Stat. Appl. Genet. Mol. Biol., 4 (200), 17. https://doi.org/10.2202/1544-6115.1128 doi: 10.2202/1544-6115.1128
    [28] H. Peng, F. Long, C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans. Pattern Anal. Mach. Intell., 27 (2005), 1226−1238. https://doi.org/10.1109/TPAMI.2005.159 doi: 10.1109/TPAMI.2005.159
    [29] Available from: https://csse.szu.edu.cn/staff/zhuzx/Datasets.html.
    [30] A. K. Shukla, P. Singh, M. Vardhan, An adaptive inertia weight teaching-learning-based optimization algorithm and its applications, Appl. Math. Modell., 77 (2020), 309−326. https://doi.org/10.1016/j.apm.2019.07.046 doi: 10.1016/j.apm.2019.07.046
    [31] M. Rostami, S. Forouzandeh, K. Berahmand, M. Soltani, M. Shahsavari, M. Oussalah, Gene selection for microarray data classification via multi-objective graph theoretic-based method, Artif. Intell. Med., 123 (2021), 102228. https://doi.org/10.1016/j.artmed.2021.102228 doi: 10.1016/j.artmed.2021.102228
    [32] B. Nouri-Moghaddam, M. Ghazanfari, M. Fathian, A novel bio-inspired hybrid multi-filter wrapper gene selection method with ensemble classifier for microarray data, Neural Comput. Appl., 2021 (2021), 1−31. https://doi.org/10.1007/s00521-021-06459-9 doi: 10.1007/s00521-021-06459-9
    [33] O. A. Alomari, S. N. Makhadmeh, M. A. Al-Betar, Z. A. A. Alyasseri, I. A. Doush, A. K. Abasi, et al., Gene selection for microarray data classification based on Grey Wolf Optimizer enhanced with TRIZ-inspired operators, Knowledge-Based Syst., 223 (2021), 107034. https://doi.org/10.1016/j.knosys.2021.107034 doi: 10.1016/j.knosys.2021.107034
    [34] G. Zhang, J. Hou, J. Wang, C. Yan, J. Luo, Feature selection for microarray data classification using hybrid information gain and a modified binary krill herd algorithm, Interdiscip. Sci. Comput. Life Sci., 12 (2020), 288−301. https://doi.org/10.1007/s12539-020-00372-w doi: 10.1007/s12539-020-00372-w
  • Reader Comments
  • © 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(2460) PDF downloads(108) Cited by(4)

Article outline

Figures and Tables

Figures(20)  /  Tables(18)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog