To identify critical genomes that influence a cancer patient's survival time, feature screening methods play a vital role in this biomedical field. Most of the current research relies on a fixed survival function model, which limits its universality in practical applications. In this paper, we propose the Generalized Jaccard coefficient (GJAC), which extends the traditional Jaccard coefficient from comparing binary vectors' similarity to calculating the correlation between the general vectors. The larger the GJAC value, the higher the sample similarity. Using the GJAC, we introduce a novel model-free screening method to select the active set of covariates in ultra-high dimensional survival data. Through Monte Carlo simulations, GJAC-Sure Independence Screening (GJAC-SIS) shows a higher accuracy, lower errors, and an excellent applicability in different types of survival data compared with other existing model-free feature screening methods in survival data. Additionally, in the real cancer datasets (DLBCL), GJAC-SIS can screen out two additional important genomes, which are certified in the real biomedical experiment, while the other five methods can't. As a result, GJAC-SIS achieves a high screening precision, delivers a more effective screening outcome, and has a better utility and universality.
Citation: Renqing Liu, Guangming Deng, Hanji He. Generalized Jaccard feature screening for ultra-high dimensional survival data[J]. AIMS Mathematics, 2024, 9(10): 27607-27626. doi: 10.3934/math.20241341
To identify critical genomes that influence a cancer patient's survival time, feature screening methods play a vital role in this biomedical field. Most of the current research relies on a fixed survival function model, which limits its universality in practical applications. In this paper, we propose the Generalized Jaccard coefficient (GJAC), which extends the traditional Jaccard coefficient from comparing binary vectors' similarity to calculating the correlation between the general vectors. The larger the GJAC value, the higher the sample similarity. Using the GJAC, we introduce a novel model-free screening method to select the active set of covariates in ultra-high dimensional survival data. Through Monte Carlo simulations, GJAC-Sure Independence Screening (GJAC-SIS) shows a higher accuracy, lower errors, and an excellent applicability in different types of survival data compared with other existing model-free feature screening methods in survival data. Additionally, in the real cancer datasets (DLBCL), GJAC-SIS can screen out two additional important genomes, which are certified in the real biomedical experiment, while the other five methods can't. As a result, GJAC-SIS achieves a high screening precision, delivers a more effective screening outcome, and has a better utility and universality.
[1] | R. Tibshirani, Regression shrinkage and selection via the lasso, J. Roy. Stat. Soc. B, 58 (1996), 267–288. http://dx.doi.org/10.1111/j.2517-6161.1996.tb02080.x doi: 10.1111/j.2517-6161.1996.tb02080.x |
[2] | J. Fan, R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Amer. Stat. Assoc., 96 (2001), 1348–1360. http://dx.doi.org/10.1198/016214501753382273 doi: 10.1198/016214501753382273 |
[3] | C. Zhang, Nearly unbiased variable selection under minimax concave penalty, Ann. Statist., 38 (2010), 894–942. http://dx.doi.org/10.1214/09-AOS729 doi: 10.1214/09-AOS729 |
[4] | J. Fan, J. Lv, Sure independence screening for ultrahigh dimensional feature space, J. Roy. Stat. Soc. B, 70 (2008), 849–911. http://dx.doi.org/10.1111/j.1467-9868.2008.00674.x doi: 10.1111/j.1467-9868.2008.00674.x |
[5] | P. Bühlmann, M. Kalisch, M. Maathuis, Variable selection in high-dimensional linear models: partially faithful distributions and the PC-simple algorithm, Biometrika, 97 (2010), 261–278. http://dx.doi.org/10.1093/biomet/asq008 doi: 10.1093/biomet/asq008 |
[6] | P. Hall, H. Miller, Using generalized correlation to effect variable selection in very high dimensional problems, J. Comput. Graph. Stat., 18 (2009), 533–550. http://dx.doi.org/10.1198/jcgs.2009.08041 doi: 10.1198/jcgs.2009.08041 |
[7] | G. Li, H. Peng, J. Zhang, L. Zhu, Robust rank correlation based screening, Ann. Statist., 40 (2012), 1846–1877. http://dx.doi.org/10.1214/12-AOS1024 doi: 10.1214/12-AOS1024 |
[8] | J. Fan, R. Song, Sure independence screening in generalized linear models with NP-dimensionality, Ann. Statist., 38 (2010), 3567–3604. http://dx.doi.org/10.1214/10-AOS798 doi: 10.1214/10-AOS798 |
[9] | E. Barut, J. Fan, A. Verhasselt, Conditional sure independence screening, J. Amer. Stat. Assoc., 111 (2016), 1266–1277. http://dx.doi.org/10.1080/01621459.2015.1092974 doi: 10.1080/01621459.2015.1092974 |
[10] | L. Zhu, L. Li, R. Li, L. Zhu, Model-free feature screening for ultrahigh-dimensional data, J. Amer. Statist. Assoc., 106 (2011), 1464–1475. http://dx.doi.org/10.1198/jasa.2011.tm10563 doi: 10.1198/jasa.2011.tm10563 |
[11] | R. Li, W. Zhu, L. Zhu, Feature screening via distance correlation learning, J. Amer. Stat. Assoc., 107 (2012), 1129–1139. http://dx.doi.org/10.1080/01621459.2012.695654 doi: 10.1080/01621459.2012.695654 |
[12] | D. Huang, R. Li, H. Wang, Feature screening for ultrahigh dimensional categorical data with applications, J. Bus. Econ. Stat., 32 (2014), 237–244. http://dx.doi.org/10.1080/07350015.2013.863158 doi: 10.1080/07350015.2013.863158 |
[13] | L. Zhu, Y. Zhang, K. Xu, Measuring and testing for interval quantile dependence, Ann. Statist., 46 (2018), 2683–2710. http://dx.doi.org/10.1214/17-AOS1635 doi: 10.1214/17-AOS1635 |
[14] | X. He, L. Wang, H. Hong, Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data, Ann. Statist., 41 (2013), 342–369. http://dx.doi.org/10.1214/13-AOS1087 doi: 10.1214/13-AOS1087 |
[15] | J. Fan, Y. Feng, Y. Wu, High-dimensional variable selection for Cox's proportional hazards model, In: Borrowing strength: theory powering applications—a festschrift for Lawrence D. Brown, Durham: Institute of Mathematical Statistics, 2010, 70–86. http://dx.doi.org/10.1214/10-IMSCOLL606 |
[16] | S. Zhao, Y. Li, Principled sure independence screening for Cox models with ultra-high-dimensional covariates, J. Multivariate Anal., 105 (2012), 397–411. http://dx.doi.org/10.1016/j.jmva.2011.08.002 doi: 10.1016/j.jmva.2011.08.002 |
[17] | A. Gorst-Rasmussen, T. Scheike, Independent screening for single-index hazard rate models with ultrahigh dimensional features, J. Roy. Stat. Soc. B, 75 (2013), 217–245. http://dx.doi.org/10.1111/j.1467-9868.2012.01039.x doi: 10.1111/j.1467-9868.2012.01039.x |
[18] | R. Song, W. Lu, S. Ma, X. Jessie Jeng, Censored rank independence screening for high-dimensional survival data, Biometrika, 101 (2014), 799–814. http://dx.doi.org/10.1093/biomet/asu047 doi: 10.1093/biomet/asu047 |
[19] | J. Zhang, Y. Liu, Y. Wu, Correlation rank screening for ultrahigh-dimensional survival data, Comput. Stat. Data Anal., 108 (2017), 121–132. http://dx.doi.org/10.1016/j.csda.2016.11.005 doi: 10.1016/j.csda.2016.11.005 |
[20] | T. Zhou, L. Zhu, Model-free feature screening for ultrahigh dimensional censored regression, Stat. Comput., 27 (2017), 947–961. http://dx.doi.org/10.1007/s11222-016-9664-z doi: 10.1007/s11222-016-9664-z |
[21] | W. Zhong, J. Wang, X. Chen, Censored mean variance sure independence screening for ultrahigh dimensional survival data, Comput. Stat. Data Anal., 159 (2021), 107206. http://dx.doi.org/10.1016/j.csda.2021.107206 doi: 10.1016/j.csda.2021.107206 |
[22] | D. Zhang, X. You, S. Liu, K. Yang, Multi-colony ant colony optimization based on generalized Jaccard similarity recommendation strategy, IEEE Access, 7 (2019), 157303–157317. http://dx.doi.org/10.1109/ACCESS.2019.2949860 doi: 10.1109/ACCESS.2019.2949860 |
[23] | A. Rosenwald, G. Wright, A. Wiestner, W. Chan, J. Connors, E. Campo, et al., The proliferation gene expression signature is a quantitative integrator of oncogenic events that predicts survival in mantle cell Lymphoma, Cancer Cell, 3 (2003), 185–197. http://dx.doi.org/10.1016/S1535-6108(03)00028-X doi: 10.1016/S1535-6108(03)00028-X |