Group feature screening for ultrahigh-dimensional data missing at random

Hanji He; Meini Li; Guangming Deng; Hanji He; Meini Li; Guangming Deng

doi:10.3934/math.2024197

AIMS Mathematics

2024, Volume 9, Issue 2: 4032-4056. doi: 10.3934/math.2024197

Previous Article Next Article

Research article

Group feature screening for ultrahigh-dimensional data missing at random

1.
School of Economics and Finance, South China University of Technology, Guangdong 510006, China
2.
School of Mathematics and Computer Science, Chongqing College of International Business and Economics, Chongqing 401520, China
3.
School of Mathematics and Statistics, Guilin University of Technology, Guangxi 541000, China
4.
Applied Statistics Institute, Guilin University of Technology, Guangxi 541000, China

Received: 21 November 2023 Revised: 23 December 2023 Accepted: 05 January 2024 Published: 12 January 2024
MSC : 62H30, 62R07

Statistical inference for missing data is common in data analysis, and there are still widespread cases of missing data in big data. The literature has discussed the practicability of two-stage feature screening with categorical covariates missing at random (IMCSIS). Therefore, we propose group feature screening for ultrahigh-dimensional data with categorical covariates missing at random (GIMCSIS), which can be used to effectively select important features. The proposed method expands the scope of IMCSIS and further improves the performance of classification learning when covariates are missing. Based on the adjusted Pearson chi-square statistics, a two-stage group feature screening method is modeled, and theoretical analysis proves that the proposed method conforms to the sure screening property. In a numerical simulation, GIMCSIS can achieve better finite sample performance under binary and multivariate response variables and multi-classification covariates. The empirical analysis through multiple classification results shows that GIMCSIS is superior to IMCSIS in imbalanced data classification.
- group feature screening,
- sure screening property,
- chi-square statistic,
- missing data
Citation: Hanji He, Meini Li, Guangming Deng. Group feature screening for ultrahigh-dimensional data missing at random[J]. AIMS Mathematics, 2024, 9(2): 4032-4056. doi: 10.3934/math.2024197

Related Papers:

Abstract

Statistical inference for missing data is common in data analysis, and there are still widespread cases of missing data in big data. The literature has discussed the practicability of two-stage feature screening with categorical covariates missing at random (IMCSIS). Therefore, we propose group feature screening for ultrahigh-dimensional data with categorical covariates missing at random (GIMCSIS), which can be used to effectively select important features. The proposed method expands the scope of IMCSIS and further improves the performance of classification learning when covariates are missing. Based on the adjusted Pearson chi-square statistics, a two-stage group feature screening method is modeled, and theoretical analysis proves that the proposed method conforms to the sure screening property. In a numerical simulation, GIMCSIS can achieve better finite sample performance under binary and multivariate response variables and multi-classification covariates. The empirical analysis through multiple classification results shows that GIMCSIS is superior to IMCSIS in imbalanced data classification.

References

[1]	J. Q. Fan, R. Samwort, Y. C. Wu, Ultrahigh dimensional feature selection: Beyond the linear model, J. Mach. Learn. Res., 10 (2009), 2013–2038. https://doi.org/10.1145/1577069.1755853 doi: 10.1145/1577069.1755853
[2]	J. Fan, J. Lv, Sure independence screening for ultrahigh dimensional feature space, J. Roy. Stat. Soc. B, 70 (2008), 849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x doi: 10.1111/j.1467-9868.2008.00674.x
[3]	P. Hall, H. Miller, Using generalized correlation to effect variable selection in very high dimensional problems, J. Comput. Graph. Stat., 18 (2009), 533–550. https://doi.org/10.1198/jcgs.2009.08041 doi: 10.1198/jcgs.2009.08041
[4]	G. Li, H. Peng, J. Zhang, L. Zhu, Robust rank correlation based screening, Ann. Statist., 40 (2012), 1846–1877. https://doi.org/10.1214/12-AOS1024 doi: 10.1214/12-AOS1024
[5]	X. Y. Wang, C. L. Leng, High dimensional ordinary least squares projection for screening variables, J. Roy. Stat. Soc. B, 78 (2016), 589–611. https://doi.org/10.1111/rssb.12127 doi: 10.1111/rssb.12127
[6]	L. P. Zhu, L. X. Li, R. Z. Li, L. X. Zhu, Model-free feature screening for ultrahigh-dimensional data, J. Am. Stat. Assoc., 106 (2011), 1464–1475. https://doi.org/10.1198/jasa.2011.tm10563 doi: 10.1198/jasa.2011.tm10563
[7]	R. Li, W. Zhong, L. Zhu, Feature screening via distance correlation learning, J. Am. Stat. Assoc., 107 (2012), 1129–1139. https://doi.org/10.1080/01621459.2012.695654 doi: 10.1080/01621459.2012.695654
[8]	X. Shao, J. Zhang, Martingale difference correlation and its use in high-dimensional variable screening, J. Am. Stat. Assoc., 109 (2014), 1302–1318. https://doi.org/10.1080/01621459.2014.887012 doi: 10.1080/01621459.2014.887012
[9]	Q. Mai, H. Zou, The Kolmogorov filter for variable screening in high-dimensional binary classification, Biometrika, 100 (2013), 229–234. https://doi.org/10.1093/biomet/ass062 doi: 10.1093/biomet/ass062
[10]	D. Huang, R. Li, H. Wang, Feature screening for ultrahigh dimensional categorical data with applications, J. Bus. Econ. Stat., 32 (2014), 237–244. https://doi.org/10.1080/07350015.2013.863158 doi: 10.1080/07350015.2013.863158
[11]	L. Ni, F. Fang, F. Wan, Adjusted pearson chi-square feature screening for multi-classification with ultrahigh dimensional data, Metrika, 80 (2017), 805–828. https://doi.org/10.1007/s00184-017-0629-9 doi: 10.1007/s00184-017-0629-9
[12]	P. Lai, M. Y. Wang, F. L. Song, Y. Q. Zhou, Feature screening for ultrahigh-dimensional binary classification via linear projection, AIMS Math., 8 (2023), 14270–14287. https://doi.org/10.3934/math.2023730 doi: 10.3934/math.2023730
[13]	W. C. Song, J. Xie, Group feature screening via the F statistic, Commun. Stat. Simul. C., 51 (2022), 1921–1931. https://doi.org/10.1080/03610918.2019.1691223 doi: 10.1080/03610918.2019.1691223
[14]	D. Qiu, J. Ahn, Grouped variable screening for ultra-high dimensional data for linear model, Comput. Stat. Data Anal., 144 (2020), 106894. https://doi.org/10.1016/j.csda.2019.106894 doi: 10.1016/j.csda.2019.106894
[15]	H. J. He, G. M. Deng, Grouped feature screening for ultra-high dimensional data for the classification model, J. Stat. Comput. Simul., 92 (2022), 974–997. https://doi.org/10.1080/00949655.2021.1981901 doi: 10.1080/00949655.2021.1981901
[16]	Z. Z. Wang, G. M. Deng, J. Q. Yu, Group feature screening based on information gain ratio for ultrahigh-dimensional data, J. Math., 2022, 1600986. https://doi.org/10.1155/2022/1600986 doi: 10.1155/2022/1600986
[17]	Z. Z. Wang, G. M. Deng, H. Y. Xu, Group feature screening based on Gini impurity for ultrahigh-dimensional multi-classification, AIMS Math., 8 (2023), 4342–4362. https://doi.org/10.3934/math.2023216 doi: 10.3934/math.2023216
[18]	Y. L. Sang, X. Dang, Grouped feature screening for ultrahigh-dimensional classification via Gini distance correlation, 2023. https://doi.org/10.48550/arXiv.2304.08605
[19]	P. Lai, Y. M. Liu, Z. Liu, Y. Wan, Model free feature screening for ultrahigh dimensional data with responses missing at random, Comput. Stat. Data Anal., 105 (2017), 201–216. https://doi.org/10.1016/j.csda.2016.08.008 doi: 10.1016/j.csda.2016.08.008
[20]	Q. H. Wang, Y. J. Li, How to make model-free feature screening approaches for full data applicable to the case of missing response? Scand. J. Stat., 45 (2018), 324–346. https://doi.org/10.1111/sjos.12290 doi: 10.1111/sjos.12290
[21]	X. X. Li, N. S. Tang, J. H. Xie, X. D. Yan, A nonparametric feature screening method for ultrahigh-dimensional missing response, Comput. Stat. Data Anal., 142 (2020), 106828. https://doi.org/10.1016/j.csda.2019.106828 doi: 10.1016/j.csda.2019.106828
[22]	L. Y. Zou, Y. Liu, Z. H. Zhang, Adjusted feature screening for ultra-high dimensional missing response, J. Stat. Comput. Simul., 2023. https://doi.org/10.1080/00949655.2023.2256926 doi: 10.1080/00949655.2023.2256926
[23]	L. Ni, F. Fang, J. Shao, Feature screening for ultrahigh dimensional categorical data with covariates missing at random, Comput. Data Anal., 142 (2020), 106824. https://doi.org/10.1016/j.csda.2019.106824 doi: 10.1016/j.csda.2019.106824
[24]	J. Suykens, J. Vandewalle, Least squares support vector machine classifiers, Neural Process. Lett., 9 (1999), 293–300. https://doi.org/10.1023/A:1018628609742 doi: 10.1023/A:1018628609742
[25]	B. Lantz, Machine learning with R, 2 Eds., Packt Publishing, 2015.
[26]	R. J. Samworth, Optimal weighted nearest neighbour classifiers, Ann. Stat., 40 (2012), 2733–2763. Available from: https://www.jstor.org/stable/41806553.

Reader Comments

Your name:*

Email:*
© 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

AIMS Mathematics

1.8 3.1

Metrics

Article views(2101) PDF downloads(88) Cited by(1)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Figures and Tables

Tables(7)

AIMS Mathematics

Group feature screening for ultrahigh-dimensional data missing at random

Related Papers:

Abstract

References

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Catalog

AIMS Mathematics

Group feature screening for ultrahigh-dimensional data missing at random

Related Papers:

Abstract

References

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog