Research article

Ultra-high-dimensional feature screening of binary categorical response data based on Jensen-Shannon divergence

  • Received: 23 October 2023 Revised: 05 December 2023 Accepted: 13 December 2023 Published: 02 January 2024
  • MSC : 62H30, 62R07

  • Currently, most of the ultra-high-dimensional feature screening methods for categorical data are based on the correlation between covariates and response variables, using some statistics as the screening index to screen important covariates. Thus, with the increasing number of data types and model availability limitations, there may be a potential problem with the existence of a class of unimportant covariates that are also highly correlated with the response variable due to their high correlation with the other covariates. To address this issue, in this paper, we establish a model-free feature screening procedure for binary categorical response variables from the perspective of the contribution of features to classification. The idea is to introduce the Jensen-Shannon divergence to measure the difference between the conditional probability distributions of the covariates when the response variables take on different values. The larger the value of the Jensen-Shannon divergence, the stronger the covariate's contribution to the classification of the response variable, and the more important the covariate is. We propose two kinds of model-free ultra-high-dimensional feature screening methods for binary response data. Meanwhile, the methods are suitable for continuous or categorical covariates. When the numbers of covariate categories are the same, the feature screening is based on traditional Jensen-Shannon divergence. When the numbers of covariate categories are different, the Jensen-Shannon divergence is adjusted using the logarithmic factor of the number of categories. We theoretically prove that the proposed methods have sure screening and ranking consistency properties, and through simulations and real data analysis, we demonstrate that, in feature screening, the approaches proposed in this paper have the advantages of effectiveness, stability, and less computing time compared with an existing method.

    Citation: Qingqing Jiang, Guangming Deng. Ultra-high-dimensional feature screening of binary categorical response data based on Jensen-Shannon divergence[J]. AIMS Mathematics, 2024, 9(2): 2874-2907. doi: 10.3934/math.2024142

    Related Papers:

  • Currently, most of the ultra-high-dimensional feature screening methods for categorical data are based on the correlation between covariates and response variables, using some statistics as the screening index to screen important covariates. Thus, with the increasing number of data types and model availability limitations, there may be a potential problem with the existence of a class of unimportant covariates that are also highly correlated with the response variable due to their high correlation with the other covariates. To address this issue, in this paper, we establish a model-free feature screening procedure for binary categorical response variables from the perspective of the contribution of features to classification. The idea is to introduce the Jensen-Shannon divergence to measure the difference between the conditional probability distributions of the covariates when the response variables take on different values. The larger the value of the Jensen-Shannon divergence, the stronger the covariate's contribution to the classification of the response variable, and the more important the covariate is. We propose two kinds of model-free ultra-high-dimensional feature screening methods for binary response data. Meanwhile, the methods are suitable for continuous or categorical covariates. When the numbers of covariate categories are the same, the feature screening is based on traditional Jensen-Shannon divergence. When the numbers of covariate categories are different, the Jensen-Shannon divergence is adjusted using the logarithmic factor of the number of categories. We theoretically prove that the proposed methods have sure screening and ranking consistency properties, and through simulations and real data analysis, we demonstrate that, in feature screening, the approaches proposed in this paper have the advantages of effectiveness, stability, and less computing time compared with an existing method.



    加载中


    [1] J. Q. Fan, J. C. Lv, Sure independence screening for ultrahigh dimensional feature space, J. R. Statist. Soc. B., 70 (2008), 849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x doi: 10.1111/j.1467-9868.2008.00674.x
    [2] P. Hall, H. Miller, Using generalized correlation to effect variable selection in very high dimensional problems, J. Comput. Graph. Stat., 18 (2009), 533–550. https://doi.org/10.1198/jcgs.2009.08041 doi: 10.1198/jcgs.2009.08041
    [3] G. X. Li, H. Peng, J. Zhang, L. X. Zhu, Robust rank correlation based screening, Ann. Statist., 40 (2012), 1846–1877. https://doi.org/10.1214/12-AOS1024 doi: 10.1214/12-AOS1024
    [4] J. Q. Fan, R. Song, Sure independence screening in generalized linear models with NP-dimensionality, Ann. Statist., 38 (2010), 3567–3604. https://doi.org/10.1214/10-AOS798 doi: 10.1214/10-AOS798
    [5] J. Q. Fan, Y. Feng, R. Song, Nonparametric independence screening in sparse ultra-high-dimensional additive models, J. Am. Stat. Assoc., 106 (2011), 544–557. https://doi.org/10.1198/jasa.2011.tm09779 doi: 10.1198/jasa.2011.tm09779
    [6] J. Y. Liu, R. Z. Li, R. L. Wu, Feature selection for varying coefficient models with ultrahigh-dimensional covariates, J. Am. Stat. Assoc., 109 (2014), 266–274. https://doi.org/10.1080/01621459.2013.850086 doi: 10.1080/01621459.2013.850086
    [7] H. Liang, H. S. Wang, C. L. Tsai, Profiled forward regression for ultrahigh dimensional variable screening in semiparametric partially linear models, Stat. Sinica, 22 (2012), 531–554. https://doi.org/10.5705/ss.2010.134 doi: 10.5705/ss.2010.134
    [8] L. P. Zhu, L. X. Li, R. Z. Li, L. X. Zhu, Model-free feature screening for ultrahigh-dimensional data, J. Am. Stat. Assoc., 106 (2011), 1464–1475. https://doi.org/10.1198/jasa.2011.tm10563 doi: 10.1198/jasa.2011.tm10563
    [9] R. Z. Li, W. Zhong, L. P. Zhu, Feature screening via distance correlation learning, J. Am. Stat. Assoc., 107 (2012), 1129–1139. https://doi.org/10.1080/01621459.2012.695654 doi: 10.1080/01621459.2012.695654
    [10] X. He, L. Wang, H. G. Hong, Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data, Ann. Statist., 41 (2013), 342–369. https://doi.org/10.1214/13-AOS1087 doi: 10.1214/13-AOS1087
    [11] W. L. Pan, X. Q. Wang, W. N. Xiao, H. T. Zhu, A generic sure independence screening procedure, J. Am. Stat. Assoc., 114 (2018), 928–937. https://doi.org/10.1080/01621459.2018.1462709 doi: 10.1080/01621459.2018.1462709
    [12] J. Q. Fan, Y. Y. Fan, High-dimensional classification using features annealed independence rules, Ann. Statist, 36 (2008), 2605–2637. https://doi.org/10.1214/07-AOS504 doi: 10.1214/07-AOS504
    [13] Q. Mai, H. Zou, The Kolmogorov filter for variable screening in high-dimensional binary classification, Biometrika, 100 (2013), 229–234. https://doi.org/10.1093/biomet/ass062 doi: 10.1093/biomet/ass062
    [14] H. J. Cui, R. Z. Li, W. Zhong, Model-free feature screening for ultrahigh dimensional discriminant analysis, J. Am. Stat. Assoc., 110 (2015), 630–641. https://doi.org/10.1080/01621459.2014.920256 doi: 10.1080/01621459.2014.920256
    [15] D. Y. Huang, R. Z. Li, H. S. Wang, Feature screening for ultrahigh dimensional categorical data with applications, J. Bus. Econ. Stat., 32 (2014), 237–244. https://doi.org/10.1080/07350015.2013.863158 doi: 10.1080/07350015.2013.863158
    [16] L. Ni, F. Fang, Entropy-based model-free feature screening for ultrahigh-dimensional multiclass classification, J. Nonparametr Stat., 28 (2016), 515–530. https://doi.org/10.1080/10485252.2016.1167206 doi: 10.1080/10485252.2016.1167206
    [17] F. Y. Xiao, Multi-sensor data fusion based on the belief divergence measure of evidences and the belief entropy, Inform. Fusion., 46 (2019), 23–32. https://doi.org/10.1016/j.inffus.2018.04.003 doi: 10.1016/j.inffus.2018.04.003
    [18] F. Y. Xiao, A new divergence measure for belief functions in D-S evidence theory for multisensor data fusion, Inform. Sciences, 514 (2020), 462–483. https://doi.org/10.1016/j.ins.2019.11.022 doi: 10.1016/j.ins.2019.11.022
    [19] F. Y. Xiao, GEJS: A generalized evidential divergence measure for multisource information fusion, IEEE T. Syst. Man Cy-S., 53 (2022), 2246–2258. https://doi.org/10.1109/TSMC.2022.3211498 doi: 10.1109/TSMC.2022.3211498
    [20] F. Y. Xiao, J. H. Wen, W. Pedrycz, Generalized divergence-based decision making method with an application to pattern classification, IEEE T. Knowl. Data En., 35 (2022), 6941–6956. https://doi.org/10.1109/TKDE.2022.3177896 doi: 10.1109/TKDE.2022.3177896
    [21] J. Lin, Divergence measures based on the shannon entropy, IEEE Trans. Inform. Theory, 37 (1991), 145–151. https://doi.org/10.1109/18.61115 doi: 10.1109/18.61115
    [22] C. E. Shannon, A mathematical theory of communication, Bell Syst. Tech. J., 27 (1948), 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x doi: 10.1002/j.1538-7305.1948.tb01338.x
    [23] H. J. He, G. M. Deng, Grouped feature screening for ultra-high dimensional data for the classification model, J. Stat. Comput. Sim., 92 (2022), 974–997. https://doi.org/10.1080/00949655.2021.1981901 doi: 10.1080/00949655.2021.1981901
    [24] D. Singh, P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, et al., Gene expression correlates of clinical prostate cancer behavior, Cancer Cell, 1 (2002), 203–209. https://doi.org/10.1016/S1535-6108(02)00030-2 doi: 10.1016/S1535-6108(02)00030-2
    [25] M. A. Shipp, K. N. Ross, P. Tamayo, A. P. Weng, J. L. Kutok, R. C. T. Aguiar, et al., Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning, Nat. Med., 8 (2002), 68–74. https://doi.org/10.1038/news011227-7 doi: 10.1038/news011227-7
    [26] W. Hoeffding, Probability inequalities for sums of bounded random variables, J. Am. Stat. Assoc., 58 (1963), 13–30. https://doi.org/10.1080/01621459.1963.10500830 doi: 10.1080/01621459.1963.10500830
  • Reader Comments
  • © 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(785) PDF downloads(61) Cited by(0)

Article outline

Figures and Tables

Tables(16)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog