Polygenic risk score (PRS) can evaluate the individual-level genetic risk of breast cancer. However, standalone single nucleotide polymorphisms (SNP) data used for PRS may not provide satisfactory prediction accuracy. Additionally, current PRS models based on linear regression have insufficient power to leverage non-linear effects from thousands of associated SNPs. Here, we proposed a transcriptional risk score (TRS) based on multiple omics data to estimate the risk of breast cancer.
The multiple omics data and clinical data of breast invasive carcinoma (BRCA) were collected from the cancer genome atlas (TCGA) and the gene expression omnibus (GEO). First, we developed a novel TRS model for BRCA utilizing single omic data and LightGBM algorithm. Subsequently, we built a combination model of TRS derived from each omic data to further improve the prediction accuracy. Finally, we performed association analysis and prognosis prediction to evaluate the utility of the TRS generated by our method.
The proposed TRS model achieved better predictive performance than the linear models and other ML methods in single omic dataset. An independent validation dataset also verified the effectiveness of our model. Moreover, the combination of the TRS can efficiently strengthen prediction accuracy. The analysis of prevalence and the associations of the TRS with phenotypes including case-control and cancer stage indicated that the risk of breast cancer increases with the increases of TRS. The survival analysis also suggested that TRS for the cancer stage is an effective prognostic metric of breast cancer patients.
Our proposed TRS model expanded the current definition of PRS from standalone SNP data to multiple omics data and outperformed the linear models, which may provide a powerful tool for diagnostic and prognostic prediction of breast cancer.
Citation: Jianqiao Pan, Baoshan Ma, Xiaoyu Hou, Chongyang Li, Tong Xiong, Yi Gong, Fengju Song. The construction of transcriptional risk scores for breast cancer based on lightGBM and multiple omics data[J]. Mathematical Biosciences and Engineering, 2022, 19(12): 12353-12370. doi: 10.3934/mbe.2022576
Polygenic risk score (PRS) can evaluate the individual-level genetic risk of breast cancer. However, standalone single nucleotide polymorphisms (SNP) data used for PRS may not provide satisfactory prediction accuracy. Additionally, current PRS models based on linear regression have insufficient power to leverage non-linear effects from thousands of associated SNPs. Here, we proposed a transcriptional risk score (TRS) based on multiple omics data to estimate the risk of breast cancer.
The multiple omics data and clinical data of breast invasive carcinoma (BRCA) were collected from the cancer genome atlas (TCGA) and the gene expression omnibus (GEO). First, we developed a novel TRS model for BRCA utilizing single omic data and LightGBM algorithm. Subsequently, we built a combination model of TRS derived from each omic data to further improve the prediction accuracy. Finally, we performed association analysis and prognosis prediction to evaluate the utility of the TRS generated by our method.
The proposed TRS model achieved better predictive performance than the linear models and other ML methods in single omic dataset. An independent validation dataset also verified the effectiveness of our model. Moreover, the combination of the TRS can efficiently strengthen prediction accuracy. The analysis of prevalence and the associations of the TRS with phenotypes including case-control and cancer stage indicated that the risk of breast cancer increases with the increases of TRS. The survival analysis also suggested that TRS for the cancer stage is an effective prognostic metric of breast cancer patients.
Our proposed TRS model expanded the current definition of PRS from standalone SNP data to multiple omics data and outperformed the linear models, which may provide a powerful tool for diagnostic and prognostic prediction of breast cancer.
[1] | K. L. Britt, J. Cuzick, K. Phillips, Key steps for effective breast cancer prevention, Nat. Rev. Cancer, 20 (2020), 417–436. https://doi.org/10.1038/s41568-020-0266-x doi: 10.1038/s41568-020-0266-x |
[2] | C. Wild, E. Weiderpass, B. Stewart, World cancer report: cancer research for cancer prevention, Lyon: Int. Agency Res. Cancer, 1 (2020), 23–33. https://www.paho.org/en/node/69005 |
[3] | D. Thompson, D. Easton, The genetic epidemiology of breast cancer genes, J. Mammary Gland Biol. Neoplasia, 9 (2004), 221–236. https://doi.org/10.1023/B:JOMG.0000048770.90334.3b doi: 10.1023/B:JOMG.0000048770.90334.3b |
[4] | L. Wu, W. Shi, J. Long, X. Guo, K. Michailidou, J. Beesley, et al., A transcriptome-wide association study of 229,000 women identifies new candidate susceptibility genes for breast cancer, Nat. Genet., 50 (2018), 968–978. https://doi.org/10.1038/s41588-018-0132-x doi: 10.1038/s41588-018-0132-x |
[5] | P. Maas, M. Barrdahl, A. D. Joshi, P. L. Auer, M. M. Gaudet, R. L. Milne, et al., Breast cancer risk from modifiable and nonmodifiable risk factors among white women in the United States, JAMA Oncol., 2 (2016), 1295–1302. https://doi.org/10.1001/jamaoncol.2016.1025 doi: 10.1001/jamaoncol.2016.1025 |
[6] | N. Mavaddat, P. D. Pharoah, K. Michailidou, J. Tyrer, M. N. Brook, M. K. Bolla, et al., Prediction of breast cancer risk based on profiling with common genetic variants, J. Nat. Cancer Inst., 107 (2015), djv036. https://doi.org/10.1093/jnci/djv036 doi: 10.1093/jnci/djv036 |
[7] | A. V. Khera, M. Chaffin, K. G. Aragam, M. E. Haas, C. Roselli, S. H. Choi, et al., Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations, Nat. Genet., 50 (2018), 1219–1224. https://doi.org/10.1038/s41588-018-0183-z doi: 10.1038/s41588-018-0183-z |
[8] | N. Mavaddat, K. Michailidou, J. Dennis, M. Lush, L. Fachal, A. Lee, et al., Polygenic risk scores for prediction of breast cancer and breast cancer subtypes, Am. J. Hum. Genet., 104 (2019), 21–34. https://doi.org/10.1016/j.ajhg.2018.11.002 doi: 10.1016/j.ajhg.2018.11.002 |
[9] | Y. Dor, H. Cedar, Principles of DNA methylation and their implications for biology and medicine, Lancet, 392 (2018), 777–786. https://doi.org/10.1016/S0140-6736(18)31268-6 doi: 10.1016/S0140-6736(18)31268-6 |
[10] | R. Lowe, N. Shirley, M. Bleackley, S. Dolan, T. Shafee, Transcriptomics technologies, PLoS Comput. Biol., 13 (2017), e1005457. https://doi.org/10.1371/journal.pcbi.1005457 doi: 10.1371/journal.pcbi.1005457 |
[11] | Y. C. Chen, Y. C. Chang, W. C. Ke, H. W. Chiu, Cancer adjuvant chemotherapy strategic classification by artificial neural network with gene expression data: An example for non-small cell lung cancer, J. Biomed. Inf., 56 (2015), 1–7. https://doi.org/10.1016/j.jbi.2015.05.006 doi: 10.1016/j.jbi.2015.05.006 |
[12] | H. Jin, H. C. Lee, S. S. Park, Y. S. Jeong, S. Y. Kim, Serum cancer biomarker discovery through analysis of gene expression data sets across multiple tumor and normal tissues, J. Biomed. Inf., 44 (2011), 1076–85. https://doi.org/10.1016/j.jbi.2011.08.010 doi: 10.1016/j.jbi.2011.08.010 |
[13] | L. P. Zhao, H. Bolouri, Object-oriented regression for building predictive models with high dimensional omics data from translational studies, J. Biomed. Inf., 60 (2016), 431–445. https://doi.org/10.1016/j.jbi.2016.03.001 doi: 10.1016/j.jbi.2016.03.001 |
[14] | S. Joe, H. Nam, Prognostic factor analysis for breast cancer using gene expression profiles, BMC Med. Inf. Decis. Making, 16 (2016), 56. https://doi.org/10.1186/s12911-016-0292-5 doi: 10.1186/s12911-016-0292-5 |
[15] | Y. Zhang, A. Li, J. He, M. Wang, A novel MKL method for GBM prognosis prediction by integrating histopathological image and multi-omics data, IEEE J. Biomed. Health. Inf., 24 (2020), 171–179. https://doi.org/10.1109/JBHI.2019.2898471 doi: 10.1109/JBHI.2019.2898471 |
[16] | X. Zhang, T. Li, J. Wang, J. Li, L. Chen, C. Liu, Identification of cancer-related long non-coding RNAs using XGBoost with high accuracy, Front. Genet., 10 (2019), 735. https://doi.org/10.3389/fgene.2019.00735 doi: 10.3389/fgene.2019.00735 |
[17] | D. Tong, Y. Tian, T. Zhou, Q. Ye, J. Li, K. Ding, et al., Improving prediction performance of colon cancer prognosis based on the integration of clinical and multi-omics data, BMC Med. Inf. Decis. Making, 20 (2020), 22. https://doi.org/10.1186/s12911-020-1043-1 doi: 10.1186/s12911-020-1043-1 |
[18] | J. A. Alegría-Torres, A. Baccarelli, V. Bollati, Epigenetics and lifestyle, Epigenomics, 3 (2011), 267–277. https://doi.org/10.2217/epi.11.22 doi: 10.2217/epi.11.22 |
[19] | C. P. Wild, The exposome: from concept to utility, Int. J. Epidemiol., 41 (2012), 24–32. https://doi.org/10.1093/ije/dyr236 doi: 10.1093/ije/dyr236 |
[20] | Y. V. Sun, Y. J. Hu, Integrative analysis of multi-omics data for discovery and functional studies of complex human diseases, Adv. Genet., 93 (2016), 147–190. https://doi.org/10.1016/bs.adgen.2015.11.004 doi: 10.1016/bs.adgen.2015.11.004 |
[21] | S. W. Choi, T. S. Mak, P. F. O'Reilly, Tutorial: a guide to performing polygenic risk score analyses, Nat. Protoc., 15 (2020), 2759–2772. https://doi.org/10.1038/s41596-020-0353-1 doi: 10.1038/s41596-020-0353-1 |
[22] | J. Erenpreisa, A. Giuliani, Resolution of complex issues in genome regulation and cancer requires non-linear and network-based thermodynamics, Int. J. Mol. Sci., 21 (2019), 240. https://doi.org/10.3390/ijms21010240 doi: 10.3390/ijms21010240 |
[23] | G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, et al., Lightgbm: A highly efficient gradient boosting decision tree, Adv. Neural Inf. Process. Syst., 30 (2017), 3146–3154. https://www.microsoft.com/en-us/research/publication/lightgbm-a-highly-efficient-gradient-boosting-decision-tree/ |
[24] | E. Zhu, F. Jiang, C. Liu, J. Xu, Partition independent set and reduction-based approach for partition coloring problem, IEEE Trans. Cybern., 52 (2022), 4960–4969. https://doi.org/10.1109/TCYB.2020.3025819 doi: 10.1109/TCYB.2020.3025819 |
[25] | K. Tomczak, P. Czerwińska, M. Wiznerowicz, The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge, Contemp. Oncol., 19 (2015), A68–77. https://doi.org/10.5114/wo.2014.47136 doi: 10.5114/wo.2014.47136 |
[26] | A. Rahimi, M. Gönen, Discriminating early-and late-stage cancers using multiple kernel learning on gene sets, Bioinformatics, 34 (2018), i412–i421. https://doi.org/10.1093/bioinformatics/bty239 doi: 10.1093/bioinformatics/bty239 |
[27] | Y. Yuan, E. M. V. Allen, L. Omberg, N. Wagle, A. Amin-Mansour, A. Sokolov, et al., Assessing the clinical utility of cancer genomic and proteomic data across tumor types, Nat. Biotechnol., 32 (2014), 644–652. https://doi.org/10.1038/nbt.2940 doi: 10.1038/nbt.2940 |
[28] | B. Liu, Y. Liu, X. Pan, M. Li, S. Yang, S. C. Li, DNA methylation markers for pan-cancer prediction by deep learning, Genes, 10 (2019) 778. https://doi.org/10.3390/genes10100778 doi: 10.3390/genes10100778 |
[29] | B. Ma, F. Meng, G. Yan, H. Yan, B. Chai, F. Song, Diagnostic classification of cancers using extreme gradient boosting algorithm and multi-omics data, Comput. Biol. Med., 121 (2020), 103761. https://doi.org/10.1016/j.compbiomed.2020.103761 doi: 10.1016/j.compbiomed.2020.103761 |
[30] | A. Weiss, M. Chavez-MacGregor, D. Y. Lichtensztajn, M. Yi, A. Tadros, G. N. Hortobagyi, et al., Validation study of the American joint committee on cancer eighth edition prognostic stage compared with the anatomic stage in breast cancer, JAMA Oncol., 4 (2018), 203–209. https://doi.org/10.1001/jamaoncol.2017.4298 doi: 10.1001/jamaoncol.2017.4298 |
[31] | G. De'ath, K. E. Fabricius, Classification and regression trees: a powerful yet simple technique for ecological data analysis, Ecology, 81 (2000), 3178–3192. https://doi.org/10.2307/177409 doi: 10.1890/0012-9658(2000)081[3178:CARTAP]2.0.CO;2 |
[32] | J. Liu, K. Wang, S. Ma, J. Huang, Accounting for linkage disequilibrium in genome-wide association studies: A penalized regression method, Stat. Interface, 6 (2013), 99–115. https://doi.org/10.4310/SII.2013.v6.n1.a10 doi: 10.4310/SII.2013.v6.n1.a10 |
[33] | R. Tibshirani, Regression shrinkage and selection via the lasso: a retrospective, J. R. Stat. Soc.: Ser. B, 73 (2011), 267–288. https://doi.org/10.1111/j.1467-9868.2011.00771.x doi: 10.1111/j.1467-9868.2011.00771.x |
[34] | H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. R. Stat. Soc.: Ser. B, 67 (2005), 301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x doi: 10.1111/j.1467-9868.2005.00503.x |
[35] | A. J. Smola, B. Schölkopf, A tutorial on support vector regression, Stat. Comput., 14 (2004), 199–222. https://doi.org/10.1023/B:STCO.0000035301.49549.88 doi: 10.1023/B:STCO.0000035301.49549.88 |
[36] | B. J. Vilhjálmsson, J. Yang, H. K. Finucane, A. Gusev, S. Lindström, S. Ripke, et al., Modeling linkage disequilibrium increases accuracy of polygenic risk scores, Am. J. Hum. Genet., 97 (2015), 576–592. https://doi.org/10.1016/j.ajhg.2015.09.001 doi: 10.1016/j.ajhg.2015.09.001 |
[37] | T. S. Mak, R. M. Porsch, S. W. Choi, X. Zhou, P. C. Sham, Polygenic scores via penalized regression on summary statistics, Genet. Epidemiol., 41 (2017), 469–480. https://doi.org/10.1002/gepi.22050 doi: 10.1002/gepi.22050 |
[38] | A. Alves, Stacking machine learning classifiers to identify Higgs bosons at the LHC, J. Instrum., 12 (2017), T05005. https://doi.org/10.1088/1748-0221/12/05/T05005 doi: 10.1088/1748-0221/12/05/T05005 |
[39] | B. Pavlyshenko, Using stacking approaches for machine learning models, in 2018 IEEE Second International Conference on Data Stream Mining & Processing, (2018), 255–258, https://doi.org/10.1109/DSMP.2018.8478522 |
[40] | J. J. Barendregt, S. A. Doi, Y. Y. Lee, R. E. Norman, T. Vos, Meta-analysis of prevalence, J. Epidemiol. Commun. Health, 67 (2013), 974–978. https://doi.org/10.1136/jech-2013-203104 doi: 10.1136/jech-2013-203104 |
[41] | S. Wang, Q. Zhang, C. Yu, Y. Cao, Y. Zuo, L. Yang, Immune cell infiltration-based signature for prognosis and immunogenomic analysis in breast cancer, Briefings Bioinf., 22 (2021), 2020–2031. https://doi.org/10.1093/bib/bbaa026 doi: 10.1093/bib/bbaa026 |
[42] | L. Yang, S. Wang, Q. Zhang, Y. Pan, Y. Lv, X. Chen, et al., Clinical significance of the immune microenvironment in ovarian cancer patients, Mol. Omics, 14 (2018), 341–351. https://doi.org/10.1039/c8mo00128f doi: 10.1039/C8MO00128F |
[43] | C. Zhang, Y. Ma, Ensemble machine learning || ensemble learning, Chapter, 1 (2012), 1–34. https://doi.org/10.1007/978-1-4419-9326-7. doi: 10.1007/978-1-4419-9326-7 |
[44] | Y. Pan, G. Liu, F. Zhou, B. Su, Y. Li, DNA methylation profiles in cancer diagnosis and therapeutics, Clin. Exp. Med., 18 (2018), 1–14. https://doi.org/10.1007/s10238-017-0467-0 doi: 10.1007/s10238-017-0467-0 |
[45] | T. Hou, H. Chang, H. Jiang, P. Wang, N. Li, Y. Song, et al., Smartphone based microfluidic lab-on-chip device for real-time detection, counting and sizing of living algae, Measurement, 187 (2022), 0263–2241. https://doi.org/10.1016/j.measurement.2021.110304 doi: 10.1016/j.measurement.2021.110304 |
[46] | Y. Cheng, C. He, M. Wang, X. Ma, F. Mo, S. Yang, et al., Targeting epigenetic regulators for cancer therapy: mechanisms and advances in clinical trials, Signal Transduction Targeted Ther., 4 (2019), 62. https://doi.org/10.1038/s41392-019-0095-0 doi: 10.1038/s41392-019-0095-0 |
[47] | J. Fan, K. Slowikowski, F. Zhang, Single-cell transcriptomics in cancer: computational challenges and opportunities, Exp. Mol. Med., 52 (2020), 1452–1465. https://doi.org/10.1038/s12276-020-0422-0 doi: 10.1038/s12276-020-0422-0 |
[48] | J, Rodon, J. C. Soria, R. Berger, W. H. Miller, E. Rubin, A. Kugel, et al., Genomic and transcriptomic profiling expands precision cancer medicine: the WINTHER trial, Nat. Med., 25 (2019), 751–758. https://doi.org/10.1038/s41591-019-0424-4 doi: 10.1038/s41591-019-0424-4 |