Research article

Group feature screening for ultrahigh-dimensional data missing at random

  • Received: 21 November 2023 Revised: 23 December 2023 Accepted: 05 January 2024 Published: 12 January 2024
  • MSC : 62H30, 62R07

  • Statistical inference for missing data is common in data analysis, and there are still widespread cases of missing data in big data. The literature has discussed the practicability of two-stage feature screening with categorical covariates missing at random (IMCSIS). Therefore, we propose group feature screening for ultrahigh-dimensional data with categorical covariates missing at random (GIMCSIS), which can be used to effectively select important features. The proposed method expands the scope of IMCSIS and further improves the performance of classification learning when covariates are missing. Based on the adjusted Pearson chi-square statistics, a two-stage group feature screening method is modeled, and theoretical analysis proves that the proposed method conforms to the sure screening property. In a numerical simulation, GIMCSIS can achieve better finite sample performance under binary and multivariate response variables and multi-classification covariates. The empirical analysis through multiple classification results shows that GIMCSIS is superior to IMCSIS in imbalanced data classification.

    Citation: Hanji He, Meini Li, Guangming Deng. Group feature screening for ultrahigh-dimensional data missing at random[J]. AIMS Mathematics, 2024, 9(2): 4032-4056. doi: 10.3934/math.2024197

    Related Papers:

  • Statistical inference for missing data is common in data analysis, and there are still widespread cases of missing data in big data. The literature has discussed the practicability of two-stage feature screening with categorical covariates missing at random (IMCSIS). Therefore, we propose group feature screening for ultrahigh-dimensional data with categorical covariates missing at random (GIMCSIS), which can be used to effectively select important features. The proposed method expands the scope of IMCSIS and further improves the performance of classification learning when covariates are missing. Based on the adjusted Pearson chi-square statistics, a two-stage group feature screening method is modeled, and theoretical analysis proves that the proposed method conforms to the sure screening property. In a numerical simulation, GIMCSIS can achieve better finite sample performance under binary and multivariate response variables and multi-classification covariates. The empirical analysis through multiple classification results shows that GIMCSIS is superior to IMCSIS in imbalanced data classification.



    加载中


    [1] J. Q. Fan, R. Samwort, Y. C. Wu, Ultrahigh dimensional feature selection: Beyond the linear model, J. Mach. Learn. Res., 10 (2009), 2013–2038. https://doi.org/10.1145/1577069.1755853 doi: 10.1145/1577069.1755853
    [2] J. Fan, J. Lv, Sure independence screening for ultrahigh dimensional feature space, J. Roy. Stat. Soc. B, 70 (2008), 849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x doi: 10.1111/j.1467-9868.2008.00674.x
    [3] P. Hall, H. Miller, Using generalized correlation to effect variable selection in very high dimensional problems, J. Comput. Graph. Stat., 18 (2009), 533–550. https://doi.org/10.1198/jcgs.2009.08041 doi: 10.1198/jcgs.2009.08041
    [4] G. Li, H. Peng, J. Zhang, L. Zhu, Robust rank correlation based screening, Ann. Statist., 40 (2012), 1846–1877. https://doi.org/10.1214/12-AOS1024 doi: 10.1214/12-AOS1024
    [5] X. Y. Wang, C. L. Leng, High dimensional ordinary least squares projection for screening variables, J. Roy. Stat. Soc. B, 78 (2016), 589–611. https://doi.org/10.1111/rssb.12127 doi: 10.1111/rssb.12127
    [6] L. P. Zhu, L. X. Li, R. Z. Li, L. X. Zhu, Model-free feature screening for ultrahigh-dimensional data, J. Am. Stat. Assoc., 106 (2011), 1464–1475. https://doi.org/10.1198/jasa.2011.tm10563 doi: 10.1198/jasa.2011.tm10563
    [7] R. Li, W. Zhong, L. Zhu, Feature screening via distance correlation learning, J. Am. Stat. Assoc., 107 (2012), 1129–1139. https://doi.org/10.1080/01621459.2012.695654 doi: 10.1080/01621459.2012.695654
    [8] X. Shao, J. Zhang, Martingale difference correlation and its use in high-dimensional variable screening, J. Am. Stat. Assoc., 109 (2014), 1302–1318. https://doi.org/10.1080/01621459.2014.887012 doi: 10.1080/01621459.2014.887012
    [9] Q. Mai, H. Zou, The Kolmogorov filter for variable screening in high-dimensional binary classification, Biometrika, 100 (2013), 229–234. https://doi.org/10.1093/biomet/ass062 doi: 10.1093/biomet/ass062
    [10] D. Huang, R. Li, H. Wang, Feature screening for ultrahigh dimensional categorical data with applications, J. Bus. Econ. Stat., 32 (2014), 237–244. https://doi.org/10.1080/07350015.2013.863158 doi: 10.1080/07350015.2013.863158
    [11] L. Ni, F. Fang, F. Wan, Adjusted pearson chi-square feature screening for multi-classification with ultrahigh dimensional data, Metrika, 80 (2017), 805–828. https://doi.org/10.1007/s00184-017-0629-9 doi: 10.1007/s00184-017-0629-9
    [12] P. Lai, M. Y. Wang, F. L. Song, Y. Q. Zhou, Feature screening for ultrahigh-dimensional binary classification via linear projection, AIMS Math., 8 (2023), 14270–14287. https://doi.org/10.3934/math.2023730 doi: 10.3934/math.2023730
    [13] W. C. Song, J. Xie, Group feature screening via the F statistic, Commun. Stat. Simul. C., 51 (2022), 1921–1931. https://doi.org/10.1080/03610918.2019.1691223 doi: 10.1080/03610918.2019.1691223
    [14] D. Qiu, J. Ahn, Grouped variable screening for ultra-high dimensional data for linear model, Comput. Stat. Data Anal., 144 (2020), 106894. https://doi.org/10.1016/j.csda.2019.106894 doi: 10.1016/j.csda.2019.106894
    [15] H. J. He, G. M. Deng, Grouped feature screening for ultra-high dimensional data for the classification model, J. Stat. Comput. Simul., 92 (2022), 974–997. https://doi.org/10.1080/00949655.2021.1981901 doi: 10.1080/00949655.2021.1981901
    [16] Z. Z. Wang, G. M. Deng, J. Q. Yu, Group feature screening based on information gain ratio for ultrahigh-dimensional data, J. Math., 2022, 1600986. https://doi.org/10.1155/2022/1600986 doi: 10.1155/2022/1600986
    [17] Z. Z. Wang, G. M. Deng, H. Y. Xu, Group feature screening based on Gini impurity for ultrahigh-dimensional multi-classification, AIMS Math., 8 (2023), 4342–4362. https://doi.org/10.3934/math.2023216 doi: 10.3934/math.2023216
    [18] Y. L. Sang, X. Dang, Grouped feature screening for ultrahigh-dimensional classification via Gini distance correlation, 2023. https://doi.org/10.48550/arXiv.2304.08605
    [19] P. Lai, Y. M. Liu, Z. Liu, Y. Wan, Model free feature screening for ultrahigh dimensional data with responses missing at random, Comput. Stat. Data Anal., 105 (2017), 201–216. https://doi.org/10.1016/j.csda.2016.08.008 doi: 10.1016/j.csda.2016.08.008
    [20] Q. H. Wang, Y. J. Li, How to make model-free feature screening approaches for full data applicable to the case of missing response? Scand. J. Stat., 45 (2018), 324–346. https://doi.org/10.1111/sjos.12290 doi: 10.1111/sjos.12290
    [21] X. X. Li, N. S. Tang, J. H. Xie, X. D. Yan, A nonparametric feature screening method for ultrahigh-dimensional missing response, Comput. Stat. Data Anal., 142 (2020), 106828. https://doi.org/10.1016/j.csda.2019.106828 doi: 10.1016/j.csda.2019.106828
    [22] L. Y. Zou, Y. Liu, Z. H. Zhang, Adjusted feature screening for ultra-high dimensional missing response, J. Stat. Comput. Simul., 2023. https://doi.org/10.1080/00949655.2023.2256926 doi: 10.1080/00949655.2023.2256926
    [23] L. Ni, F. Fang, J. Shao, Feature screening for ultrahigh dimensional categorical data with covariates missing at random, Comput. Data Anal., 142 (2020), 106824. https://doi.org/10.1016/j.csda.2019.106824 doi: 10.1016/j.csda.2019.106824
    [24] J. Suykens, J. Vandewalle, Least squares support vector machine classifiers, Neural Process. Lett., 9 (1999), 293–300. https://doi.org/10.1023/A:1018628609742 doi: 10.1023/A:1018628609742
    [25] B. Lantz, Machine learning with R, 2 Eds., Packt Publishing, 2015.
    [26] R. J. Samworth, Optimal weighted nearest neighbour classifiers, Ann. Stat., 40 (2012), 2733–2763. Available from: https://www.jstor.org/stable/41806553.
  • Reader Comments
  • © 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(739) PDF downloads(74) Cited by(1)

Article outline

Figures and Tables

Tables(7)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog