Because the majority of model-free feature screening methods concentrate on individual predictors, they are unable to consider structured predictors, such as grouped variables. In this study, we suggest a model-free and direct extension of the original sure independence screening approach for group screening using Gini impurity for a classification model. Compared to current feature screening approaches, the proposed method performs better in terms of screening efficiency and classification accuracy. It was established that the suggested group screening process exhibits sure screening properties and ranking consistency properties under specific regularity conditions. We used simulation studies to illustrate the limited sample performance of the proposed technique and real data analysis.
Citation: Zhongzheng Wang, Guangming Deng, Haiyun Xu. Group feature screening based on Gini impurity for ultrahigh-dimensional multi-classification[J]. AIMS Mathematics, 2023, 8(2): 4342-4362. doi: 10.3934/math.2023216
Because the majority of model-free feature screening methods concentrate on individual predictors, they are unable to consider structured predictors, such as grouped variables. In this study, we suggest a model-free and direct extension of the original sure independence screening approach for group screening using Gini impurity for a classification model. Compared to current feature screening approaches, the proposed method performs better in terms of screening efficiency and classification accuracy. It was established that the suggested group screening process exhibits sure screening properties and ranking consistency properties under specific regularity conditions. We used simulation studies to illustrate the limited sample performance of the proposed technique and real data analysis.
[1] | P. Breheny, The group exponential lasso for bi-level variable selection, Biometrika, 71 (2015), 731–740. https://doi.org/10.1111/biom.12300 doi: 10.1111/biom.12300 |
[2] | P. Breheny, J. Huang, Penalized methods for bi-level variable selection, Stat. Interface., 2 (2009), 369–380. https://doi.org/10.4310/SII.2009.v2.n3.a10 doi: 10.4310/SII.2009.v2.n3.a10 |
[3] | L. Breiman, J. Friedman, C. J. Stone, R. A. Olshen, Classification and regression trees, Belmont CA: Wadsworth International Group, 1984. https://doi.org/10.1201/9781315139470 |
[4] | H. Cui, R. Li, W. Zhong, Model-free feature screening for ultrahigh dimensional discriminant analysis, J. Am. Stat. Assoc., 110 (2015), 630–641. https://doi.org/10.1080/01621459.2014.920256 doi: 10.1080/01621459.2014.920256 |
[5] | J. Fan, J. Lv, Sure independence screening for ultrahigh dimensional feature space, J. R. Stat. Soc. Ser. B, 70 (2008), 849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x doi: 10.1111/j.1467-9868.2008.00674.x |
[6] | J. Fan, R. Samworth, Y. Wu, Ultrahigh dimensional feature selection: beyond the linear model, J. Mach. Learn. Res., 10 (2009), 2013–2038. http://arXiv.org/abs/0812.3201 |
[7] | H. He, G. Deng, Grouped feature screening for ultra-high dimensional data for the classification model, J. Stat. Comput. Simul., 92 (2022), 972–997. https://doi.org/10.1080/00949655.2021.1981901 doi: 10.1080/00949655.2021.1981901 |
[8] | D. Huang, R. Li, H. Wang, Feature screening for ultrahigh dimensional categorical data with applications, J. Bus. Econ. Stat., 32 (2014), 237–244. https://doi.org/10.1080/07350015.2013.863158 doi: 10.1080/07350015.2013.863158 |
[9] | J. Huang, S. Ma, H. Xie, C. Zhang, A group bridge approach for variable selection, Biometrika, 96 (2009), 339–355. https://doi.org/10.1093/biomet/asp020 doi: 10.1093/biomet/asp020 |
[10] | B. Lantz, Machine learning with R: expert techniques for predictive modeling, $2^{ed}$, Birmingha: Packt Publishing, 2019. |
[11] | Q. Mai, H. Zou, The Kolmogorov filter for variable screening in high-dimensional binary classification, Biometrika, 100 (2013), 229–234. https://doi.org/10.1093/biomet/ass062 doi: 10.1093/biomet/ass062 |
[12] | L. Ni, F. Fang, Entropy-based model-free feature screening for ultrahigh-dimensional multiclass classification, J. Nonparametr. Stat., 28 (2016), 515–530. https://doi.org/10.1080/10485252.2016.1167206 doi: 10.1080/10485252.2016.1167206 |
[13] | L. Ni, F. Fang, F. Wan, Adjusted pearson Chi-Square feature screening for multi-classification with ultrahigh dimensional data, Metrika, 80 (2017), 805–828. https://doi.org/10.1007/s00184-017-0629-9 doi: 10.1007/s00184-017-0629-9 |
[14] | L. Ni, Variable screening methods for ultra-high dimensional categorical covariates, Shanghai: East China Normal University, 2019. |
[15] | Y. Niu, R. Zhang, J. Liu, H. Li, Group screening for ultra-high-dimensional feature under linear model, Stat. Theor. Relat. Field., 4 (2020), 43–54. https://doi.org/10.1080/24754269.2019.1633763 doi: 10.1080/24754269.2019.1633763 |
[16] | D. Qiu, J. Ahn, Grouped variable screening for ultra-high dimensional data for linear model, Comput. Stat. Data Anal., 144 (2020), 1–11. https://doi.org/10.1016/j.csda.2019.106894 doi: 10.1016/j.csda.2019.106894 |
[17] | Y. Sheng, Q. Wang, Model-free feature screening for ultrahigh dimensional classification, J. Multivar. Anal., 178 (2020), 1–15. https://doi.org/10.1016/j.jmva.2020.104618 doi: 10.1016/j.jmva.2020.104618 |
[18] | W. Song, J. Xie, Group feature screening via the F statistic, Commun. Stat. Simul. Comput., 48 (2019), 1921–1931. https://doi.org/10.1080/03610918.2019.1691223 doi: 10.1080/03610918.2019.1691223 |
[19] | J. A. K. Suykens, J. Vandewalle, Least squares support vector machine classifiers, Neural Process. Lett., 9 (1999), 293–300. https://doi.org/10.1023/A:1018628609742 doi: 10.1023/A:1018628609742 |
[20] | X. Shao, J. Zhang, Martingale difference correlation and its use in high-dimensional variable screening, J. Am. Stat. Assoc., 109 (2014), 1302–1318. https://doi.org/10.1080/01621459.2014.887012 doi: 10.1080/01621459.2014.887012 |
[21] | L. Wang, G. Chen, H. Li, Group SCAD regression analysis for microarray time course gene expression data, Bioinformatics, 23 (2007), 1486–1494. https://doi.org/10.1093/bioinformatics/btm125 doi: 10.1093/bioinformatics/btm125 |
[22] | M. Yuan, Y. Lin, Model selection and estimation in regression with grouped variables, J. R. Stat. Soc. Ser. B, 68 (2006), 49–67. https://doi.org/10.1111/j.1467-9868.2005.00532.x doi: 10.1111/j.1467-9868.2005.00532.x |
[23] | N. Zhou, J. Zhu, Group variable selection via a hierarchical lasso and its oracle property, Stat. Interface., 3 (2010), 557–574. https://doi.org/10.48550/arXiv.1006.2871 doi: 10.48550/arXiv.1006.2871 |
[24] | L. Zhu, L. Li, R. Li, L. Zhu, Model-free feature screening for ultrahigh-dimensional data, J. Am. Stat. Assoc., 106 (2011), 1464–1475. https://doi.org/10.1198/jasa.2011.tm10563 doi: 10.1198/jasa.2011.tm10563 |