Research article

Generalized Jaccard feature screening for ultra-high dimensional survival data

  • Received: 05 July 2024 Revised: 10 September 2024 Accepted: 18 September 2024 Published: 24 September 2024
  • MSC : 62F07, 62N01

  • To identify critical genomes that influence a cancer patient's survival time, feature screening methods play a vital role in this biomedical field. Most of the current research relies on a fixed survival function model, which limits its universality in practical applications. In this paper, we propose the Generalized Jaccard coefficient (GJAC), which extends the traditional Jaccard coefficient from comparing binary vectors' similarity to calculating the correlation between the general vectors. The larger the GJAC value, the higher the sample similarity. Using the GJAC, we introduce a novel model-free screening method to select the active set of covariates in ultra-high dimensional survival data. Through Monte Carlo simulations, GJAC-Sure Independence Screening (GJAC-SIS) shows a higher accuracy, lower errors, and an excellent applicability in different types of survival data compared with other existing model-free feature screening methods in survival data. Additionally, in the real cancer datasets (DLBCL), GJAC-SIS can screen out two additional important genomes, which are certified in the real biomedical experiment, while the other five methods can't. As a result, GJAC-SIS achieves a high screening precision, delivers a more effective screening outcome, and has a better utility and universality.

    Citation: Renqing Liu, Guangming Deng, Hanji He. Generalized Jaccard feature screening for ultra-high dimensional survival data[J]. AIMS Mathematics, 2024, 9(10): 27607-27626. doi: 10.3934/math.20241341

    Related Papers:

  • To identify critical genomes that influence a cancer patient's survival time, feature screening methods play a vital role in this biomedical field. Most of the current research relies on a fixed survival function model, which limits its universality in practical applications. In this paper, we propose the Generalized Jaccard coefficient (GJAC), which extends the traditional Jaccard coefficient from comparing binary vectors' similarity to calculating the correlation between the general vectors. The larger the GJAC value, the higher the sample similarity. Using the GJAC, we introduce a novel model-free screening method to select the active set of covariates in ultra-high dimensional survival data. Through Monte Carlo simulations, GJAC-Sure Independence Screening (GJAC-SIS) shows a higher accuracy, lower errors, and an excellent applicability in different types of survival data compared with other existing model-free feature screening methods in survival data. Additionally, in the real cancer datasets (DLBCL), GJAC-SIS can screen out two additional important genomes, which are certified in the real biomedical experiment, while the other five methods can't. As a result, GJAC-SIS achieves a high screening precision, delivers a more effective screening outcome, and has a better utility and universality.



    加载中


    [1] R. Tibshirani, Regression shrinkage and selection via the lasso, J. Roy. Stat. Soc. B, 58 (1996), 267–288. http://dx.doi.org/10.1111/j.2517-6161.1996.tb02080.x doi: 10.1111/j.2517-6161.1996.tb02080.x
    [2] J. Fan, R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Amer. Stat. Assoc., 96 (2001), 1348–1360. http://dx.doi.org/10.1198/016214501753382273 doi: 10.1198/016214501753382273
    [3] C. Zhang, Nearly unbiased variable selection under minimax concave penalty, Ann. Statist., 38 (2010), 894–942. http://dx.doi.org/10.1214/09-AOS729 doi: 10.1214/09-AOS729
    [4] J. Fan, J. Lv, Sure independence screening for ultrahigh dimensional feature space, J. Roy. Stat. Soc. B, 70 (2008), 849–911. http://dx.doi.org/10.1111/j.1467-9868.2008.00674.x doi: 10.1111/j.1467-9868.2008.00674.x
    [5] P. Bühlmann, M. Kalisch, M. Maathuis, Variable selection in high-dimensional linear models: partially faithful distributions and the PC-simple algorithm, Biometrika, 97 (2010), 261–278. http://dx.doi.org/10.1093/biomet/asq008 doi: 10.1093/biomet/asq008
    [6] P. Hall, H. Miller, Using generalized correlation to effect variable selection in very high dimensional problems, J. Comput. Graph. Stat., 18 (2009), 533–550. http://dx.doi.org/10.1198/jcgs.2009.08041 doi: 10.1198/jcgs.2009.08041
    [7] G. Li, H. Peng, J. Zhang, L. Zhu, Robust rank correlation based screening, Ann. Statist., 40 (2012), 1846–1877. http://dx.doi.org/10.1214/12-AOS1024 doi: 10.1214/12-AOS1024
    [8] J. Fan, R. Song, Sure independence screening in generalized linear models with NP-dimensionality, Ann. Statist., 38 (2010), 3567–3604. http://dx.doi.org/10.1214/10-AOS798 doi: 10.1214/10-AOS798
    [9] E. Barut, J. Fan, A. Verhasselt, Conditional sure independence screening, J. Amer. Stat. Assoc., 111 (2016), 1266–1277. http://dx.doi.org/10.1080/01621459.2015.1092974 doi: 10.1080/01621459.2015.1092974
    [10] L. Zhu, L. Li, R. Li, L. Zhu, Model-free feature screening for ultrahigh-dimensional data, J. Amer. Statist. Assoc., 106 (2011), 1464–1475. http://dx.doi.org/10.1198/jasa.2011.tm10563 doi: 10.1198/jasa.2011.tm10563
    [11] R. Li, W. Zhu, L. Zhu, Feature screening via distance correlation learning, J. Amer. Stat. Assoc., 107 (2012), 1129–1139. http://dx.doi.org/10.1080/01621459.2012.695654 doi: 10.1080/01621459.2012.695654
    [12] D. Huang, R. Li, H. Wang, Feature screening for ultrahigh dimensional categorical data with applications, J. Bus. Econ. Stat., 32 (2014), 237–244. http://dx.doi.org/10.1080/07350015.2013.863158 doi: 10.1080/07350015.2013.863158
    [13] L. Zhu, Y. Zhang, K. Xu, Measuring and testing for interval quantile dependence, Ann. Statist., 46 (2018), 2683–2710. http://dx.doi.org/10.1214/17-AOS1635 doi: 10.1214/17-AOS1635
    [14] X. He, L. Wang, H. Hong, Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data, Ann. Statist., 41 (2013), 342–369. http://dx.doi.org/10.1214/13-AOS1087 doi: 10.1214/13-AOS1087
    [15] J. Fan, Y. Feng, Y. Wu, High-dimensional variable selection for Cox's proportional hazards model, In: Borrowing strength: theory powering applications—a festschrift for Lawrence D. Brown, Durham: Institute of Mathematical Statistics, 2010, 70–86. http://dx.doi.org/10.1214/10-IMSCOLL606
    [16] S. Zhao, Y. Li, Principled sure independence screening for Cox models with ultra-high-dimensional covariates, J. Multivariate Anal., 105 (2012), 397–411. http://dx.doi.org/10.1016/j.jmva.2011.08.002 doi: 10.1016/j.jmva.2011.08.002
    [17] A. Gorst-Rasmussen, T. Scheike, Independent screening for single-index hazard rate models with ultrahigh dimensional features, J. Roy. Stat. Soc. B, 75 (2013), 217–245. http://dx.doi.org/10.1111/j.1467-9868.2012.01039.x doi: 10.1111/j.1467-9868.2012.01039.x
    [18] R. Song, W. Lu, S. Ma, X. Jessie Jeng, Censored rank independence screening for high-dimensional survival data, Biometrika, 101 (2014), 799–814. http://dx.doi.org/10.1093/biomet/asu047 doi: 10.1093/biomet/asu047
    [19] J. Zhang, Y. Liu, Y. Wu, Correlation rank screening for ultrahigh-dimensional survival data, Comput. Stat. Data Anal., 108 (2017), 121–132. http://dx.doi.org/10.1016/j.csda.2016.11.005 doi: 10.1016/j.csda.2016.11.005
    [20] T. Zhou, L. Zhu, Model-free feature screening for ultrahigh dimensional censored regression, Stat. Comput., 27 (2017), 947–961. http://dx.doi.org/10.1007/s11222-016-9664-z doi: 10.1007/s11222-016-9664-z
    [21] W. Zhong, J. Wang, X. Chen, Censored mean variance sure independence screening for ultrahigh dimensional survival data, Comput. Stat. Data Anal., 159 (2021), 107206. http://dx.doi.org/10.1016/j.csda.2021.107206 doi: 10.1016/j.csda.2021.107206
    [22] D. Zhang, X. You, S. Liu, K. Yang, Multi-colony ant colony optimization based on generalized Jaccard similarity recommendation strategy, IEEE Access, 7 (2019), 157303–157317. http://dx.doi.org/10.1109/ACCESS.2019.2949860 doi: 10.1109/ACCESS.2019.2949860
    [23] A. Rosenwald, G. Wright, A. Wiestner, W. Chan, J. Connors, E. Campo, et al., The proliferation gene expression signature is a quantitative integrator of oncogenic events that predicts survival in mantle cell Lymphoma, Cancer Cell, 3 (2003), 185–197. http://dx.doi.org/10.1016/S1535-6108(03)00028-X doi: 10.1016/S1535-6108(03)00028-X
  • Reader Comments
  • © 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(100) PDF downloads(11) Cited by(0)

Article outline

Figures and Tables

Figures(3)  /  Tables(4)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog