The transcriptional risk scores for kidney renal clear cell carcinoma using XGBoost and multiple omics data

Xiaoyu Hou; Baoshan Ma; Ming Liu; Yuxuan Zhao; Bingjie Chai; Jianqiao Pan; Pengcheng Wang; Di Li; Shuxin Liu; Fengju Song; Xiaoyu Hou; Baoshan Ma; Ming Liu; Yuxuan Zhao; Bingjie Chai; Jianqiao Pan; Pengcheng Wang; Di Li; Shuxin Liu; Fengju Song

doi:10.3934/mbe.2023519

Mathematical Biosciences and Engineering

2023, Volume 20, Issue 7: 11676-11687. doi: 10.3934/mbe.2023519

Previous Article Next Article

Research article

The transcriptional risk scores for kidney renal clear cell carcinoma using XGBoost and multiple omics data

1.
School of Information Science and Technology, Dalian Maritime University, Dalian 116026, China
2.
Physical Department of Science and Technology, Dalian University, Dalian 116622, China
3.
Department of Mechanical Engineering, University of Houston, Houston 77204, USA
4.
Department of Neuro Intervention, Dalian Medical University affiliated Dalian Municipal Central Hospital, Dalian 116033, China
5.
Department of Nephrology, Dalian Medical University affiliated Dalian Municipal Central Hospital, Dalian 116033, China
6.
Department of Epidemiology and Biostatistics, Key Laboratory of Molecular Cancer Epidemiology, Tianjin, National Clinical Research Center of Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin 300060, China

Academic Editor: Yang Kuang

Received: 06 March 2023 Revised: 20 April 2023 Accepted: 22 April 2023 Published: 08 May 2023

Most kidney cancers are kidney renal clear cell carcinoma (KIRC) that is a main cause of cancer-related deaths. Polygenic risk score (PRS) is a weighted linear combination of phenotypic related alleles on the genome that can be used to assess KIRC risk. However, standalone SNP data as input to the PRS model may not provide satisfactory result. Therefore, Transcriptional risk scores (TRS) based on multi-omics data and machine learning models were proposed to assess the risk of KIRC. First, we collected four types of multi-omics data (DNA methylation, miRNA, mRNA and lncRNA) of KIRC patients from the TCGA database. Subsequently, a novel TRS method utilizing multiple omics data and XGBoost model was developed. Finally, we performed prevalence analysis and prognosis prediction to evaluate the utility of the TRS generated by our method. Our TRS methods exhibited better predictive performance than the linear models and other machine learning models. Furthermore, the prediction accuracy of combined TRS model was higher than that of single-omics TRS model. The KM curves showed that TRS was a valid prognostic indicator for cancer staging. Our proposed method extended the current definition of TRS from standalone SNP data to multi-omics data and was superior to the linear models and other machine learning models, which may provide a useful implement for diagnostic and prognostic prediction of KIRC.
- kidney renal clear cell carcinoma,
- diagnosis,
- transcriptional risk score,
- multi-omics data,
- XGBoost
Citation: Xiaoyu Hou, Baoshan Ma, Ming Liu, Yuxuan Zhao, Bingjie Chai, Jianqiao Pan, Pengcheng Wang, Di Li, Shuxin Liu, Fengju Song. The transcriptional risk scores for kidney renal clear cell carcinoma using XGBoost and multiple omics data[J]. Mathematical Biosciences and Engineering, 2023, 20(7): 11676-11687. doi: 10.3934/mbe.2023519

Related Papers:

Abstract

Most kidney cancers are kidney renal clear cell carcinoma (KIRC) that is a main cause of cancer-related deaths. Polygenic risk score (PRS) is a weighted linear combination of phenotypic related alleles on the genome that can be used to assess KIRC risk. However, standalone SNP data as input to the PRS model may not provide satisfactory result. Therefore, Transcriptional risk scores (TRS) based on multi-omics data and machine learning models were proposed to assess the risk of KIRC. First, we collected four types of multi-omics data (DNA methylation, miRNA, mRNA and lncRNA) of KIRC patients from the TCGA database. Subsequently, a novel TRS method utilizing multiple omics data and XGBoost model was developed. Finally, we performed prevalence analysis and prognosis prediction to evaluate the utility of the TRS generated by our method. Our TRS methods exhibited better predictive performance than the linear models and other machine learning models. Furthermore, the prediction accuracy of combined TRS model was higher than that of single-omics TRS model. The KM curves showed that TRS was a valid prognostic indicator for cancer staging. Our proposed method extended the current definition of TRS from standalone SNP data to multi-omics data and was superior to the linear models and other machine learning models, which may provide a useful implement for diagnostic and prognostic prediction of KIRC.

References

[1]	C. D'Avella, P. Abbosh, S. K. Pal, D. M. Geynisman, Mutations in renal cell carcinoma, Urol. Oncol. Semin. Orig. Invest., 38 (2020), 763–773. https://doi.org/10.1016/j.urolonc.2018.10.027 doi: 10.1016/j.urolonc.2018.10.027
[2]	C. Kooperberg, M. LeBlanc, V. Obenchain, Risk prediction using genome-wide association studies, Genet. Epidemiol., 34 (2010), 643–652. https://doi.org/10.1002/gepi.20509 doi: 10.1002/gepi.20509
[3]	B. Vilhjálmsson, J. Yang, H. Finucane, A. Gusev, S. Lindstrm, S. Ripke, et al., Modeling linkage disequilibrium increases accuracy of polygenic risk scores, Am. J. Hum. Genet., 97 (2015), 576–592. https://doi.org/10.1016/j.ajhg.2015.09.001 doi: 10.1016/j.ajhg.2015.09.001
[4]	A. Khera, M. Chaffin, K. Aragam, M. Haas, C. Roselli, S. Choi, et al., Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations, Nat. Genet., 50 (2018), 1219–1224. https://doi.org/10.1038/s41588-018-0183-z doi: 10.1038/s41588-018-0183-z
[5]	X. Chen, Z. Zhou, R. Hannan, K. Thomas, I. Pedrosa, P. Kapur, et al., Reliable gene mutation prediction in clear cell renal cell carcinoma through multi-classifier multi-objective radiogenomics model, Phys. Med. Biol., 63 (2018), 215008. https://doi.org/10.1088/1361-6560/aae5cd doi: 10.1088/1361-6560/aae5cd
[6]	R. Lowe, N. Shirley, M. Bleackley, S. Dolan, T. Shafee, Transcriptomics technologies, PLoS Comput. Biol., 13 (2017), e1005457. https://doi.org/10.1371/journal.pcbi.1005457 doi: 10.1371/journal.pcbi.1005457
[7]	N. Rappoport, R. Shamir, Multi-omic and multi-view clustering algorithms: review and cancer benchmark, Nucleic Acids Res., 46 (2018), 10546–10562. https://doi.org/10.1093/nar/gky889 doi: 10.1093/nar/gky889
[8]	C. P. Wild, Complementing the genome with an "exposome": the outstanding challenge of environmental exposure measurement in molecular epidemiology, Cancer Epidemiol. Biomarkers Prev., 14 (2005), 1847–1850. https://doi.org/10.1158/1055-9965.EPI-05-0456 doi: 10.1158/1055-9965.EPI-05-0456
[9]	J. A. Alegría-Torres, A. Baccarelli, V. Bollati, Epigenetics and lifestyle, Epigenomics, 3 (2011), 267–277. https://doi.org/10.2217/epi.11.22 doi: 10.2217/epi.11.22
[10]	E. Zhao, L. Li, W. Zhang, W. Wang, Y. Chan, B. You, et al., Comprehensive characterization of immune- and inflammation-associated biomarkers based on multi-omics integration in kidney renal clear cell carcinoma, J. Transl. Med., 17 (2019), 177. https://doi.org/10.1186/s12967-019-1927-y doi: 10.1186/s12967-019-1927-y
[11]	D. Speed, D. J. Balding, MultiBLUP: improved SNP-based prediction for complex traits, Genome. Res., 24 (2014), 1550–1557. https://doi.org/10.1101/gr.169375.113 doi: 10.1101/gr.169375.113
[12]	J. Liu, K. Wang, S. Ma, J. Huang, Accounting for linkage disequilibrium in genome-wide association studies: A penalized regression method, Stat. Interface, 6 (2013), 99–115. https://doi.org/10.4310/SII.2013.v6.n1.a10 doi: 10.4310/SII.2013.v6.n1.a10
[13]	L. Lello, S. G. Avery, L. Tellier, A. I. Vazquez, G. de Los Campos, S. D. H. Hsu, Accurate genomic prediction of human height, Genetics, 210 (2018), 477–497. https://doi.org/10.1534/genetics.118.301267 doi: 10.1534/genetics.118.301267
[14]	S. W. Choi, T. S. Mak, P. F. O'Reilly, Tutorial: a guide to performing polygenic risk score analyses, Nat. Protoc., 15 (2020), 2759–2772. https://doi.org/10.1038/s41596-020-0353-1 doi: 10.1038/s41596-020-0353-1
[15]	G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, et al., Lightgbm: A highly efficient gradient boosting decision tree, in Proceedings of the 31st International Conference on Neural Information Processing Systems, (2017), 3149–3157.
[16]	T. Chen, C. Guestrin, XGBoost: A scalable tree boosting system, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (2016), 785–794. https://doi.org/10.1145/2939672.2939785
[17]	K. Tomczak, P. Czerwińska, M. Wiznerowicz, The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge, Contemp. Oncol., 19 (2015), A68–A77. https://doi.org/10.5114/wo.2014.47136 doi: 10.5114/wo.2014.47136
[18]	A. Rahimi, M. Gönen, Discriminating early- and late-stage cancers using multiple kernel learning on gene sets, Bioinformatics, 34 (2018), i412–i421. https://doi.org/10.1093/bioinformatics/bty239 doi: 10.1093/bioinformatics/bty239
[19]	Y. Yuan, E. M. V. Allen, L. Omberg, N. Wagle, A. Amin-Mansour, A. Sokolov, et al., Assessing the clinical utility of cancer genomic and proteomic data across tumor types, Nat. Biotechnol., 32 (2014), 644–652. https://doi.org/10.1038/nbt.2940 doi: 10.1038/nbt.2940
[20]	B. Liu, Y. Liu, X. Pan, M. Li, S. Yang, S. C. Li, DNA methylation markers for pan-cancer prediction by deep learning, Genes, 10 (2019), 778. https://doi.org/10.3390/genes10100778 doi: 10.3390/genes10100778
[21]	B. Ma, F. Meng, G. Yan, H. Yan, B. Chai, F. Song, Diagnostic classification of cancers using extreme gradient boosting algorithm and multi-omics data, Comput. Biol. Med., 121 (2020), 103761. https://doi.org/10.1016/j.compbiomed.2020.103761 doi: 10.1016/j.compbiomed.2020.103761
[22]	A. Weiss, M. Chavez-MacGregor, D. Y. Lichtensztajn, M. Yi, A. Tadros, G. N. Hortobagyi, et al., Validation study of the American Joint Committee on cancer eighth edition prognostic stage compared with the anatomic stage in breast cancer, JAMA Oncol., 4 (2018), 203–209. https://doi.org/10.1001/jamaoncol.2017.4298 doi: 10.1001/jamaoncol.2017.4298
[23]	T. S. H. Mak, R. M. Porsch, S. W. Choi, X. Zhou, P. C. Sham, Polygenic scores via penalized regression on summary statistics, Genet. Epidemiol., 41 (2017), 469–480. https://doi.org/10.1002/gepi.22050 doi: 10.1002/gepi.22050
[24]	R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. B, 58 (1996), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x doi: 10.1111/j.2517-6161.1996.tb02080.x
[25]	H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. R. Stat. Soc. B, 67 (2005), 301–320. https://doi.org/10.1111/j.1467-9868.2005.00527.x doi: 10.1111/j.1467-9868.2005.00527.x
[26]	A. J. Smola, B. Schölkopf, A tutorial on support vector regression, Stat. Comput., 14 (2004), 199–222. https://doi.org/10.1023/B:STCO.0000035301.49549.88 doi: 10.1023/B:STCO.0000035301.49549.88
[27]	J. Snoek, H. Larochelle, R. P. Adams, Practical bayesian optimization of machine learning algorithms, arXiv preprint, (2012), arXiv: 1206.2944. https://doi.org/10.48550/arXiv.1206.2944
[28]	B. Pavlyshenko, Using stacking approaches for machine learning models, in 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP), (2018), 255–258. https://doi.org/10.1109/DSMP.2018.8478522
[29]	J. J. Barendregt, S. A. Doi, Y. Y. Lee, R. E. Norman, T. Vos, Meta-analysis of prevalence, J. Epidemiol. Community Health, 67 (2013), 974–978. https://doi.org/10.1136/jech-2013-203104 doi: 10.1136/jech-2013-203104
[30]	J. T. Rich, J. G. Neely, R. C. Paniello, C. C. Voelker, B. Nussenbaum, E. W. Wang, A practical guide to understanding Kaplan-Meier curves, Otolaryngology-Head Neck Surg., 143 (2010), 331–336. https://doi.org/10.1016/j.otohns.2010.05.007 doi: 10.1016/j.otohns.2010.05.007
[31]	J. H. Wei, Z. H. Feng, Y. Cao, H. W. Zhao, Z. H. Chen, B. Liao, et al., Predictive value of single-nucleotide polymorphism signature for recurrence in localised renal cell carcinoma: a retrospective analysis and multicentre validation study, Lancet Oncol., 20 (2019), 591–600. https://doi.org/10.1016/S1470-2045(18)30932-X doi: 10.1016/S1470-2045(18)30932-X
[32]	Y. Dor, H. Cedar, Principles of DNA methylation and their implications for biology and medicine, Lancet, 392 (2018), 777–786. https://doi.org/10.1016/S0140-6736(18)31268-6 doi: 10.1016/S0140-6736(18)31268-6
[33]	S. Wang, Q. Zhang, C. Yu, Y. Cao, Y. Zuo, L. Yang, Immune cell infiltration-based signature for prognosis and immunogenomic analysis in breast cancer, Briefings Bioinf., 22 (2021), 2020–2031. https://doi.org/10.1093/bib/bbaa026 doi: 10.1093/bib/bbaa026
[34]	L. Yang, S. Wang, Q. Zhang, Y. Pan, Y. Lv, X. Chen, et al., Clinical significance of the immune microenvironment in ovarian cancer patients, Mol. Omics, 14 (2018), 341–351. https://doi.org/10.1039/c8mo00128f doi: 10.1039/c8mo00128f
[35]	C. Zhang, Y. Ma, Ensemble Machine Learning, Springer, 2012. https://doi.org/10.1007/978-1-4419-9326-7
[36]	Y. Pan, G. Liu, F. Zhou, B. Su, Y. Li, DNA methylation profiles in cancer diagnosis and therapeutics, Clin. Exp. Med., 18 (2018), 1–14. https://doi.org/10.1007/s10238-017-0467-0 doi: 10.1007/s10238-017-0467-0
[37]	J. Fan, K. Slowikowski, F. Zhang, Single-cell transcriptomics in cancer: computational challenges and opportunities, Exp. Mol. Med., 52 (2020), 1452–1465. https://doi.org/10.1038/s12276-020-0422-0 doi: 10.1038/s12276-020-0422-0
[38]	T. Hou, H. Chang, H. Jiang, P. Wang, N. Li, Y. Song, et al., Smartphone based microfluidic lab-on-chip device for real-time detection, counting and sizing of living algae, Measurement, 187 (2022), 0263–2241. https://doi.org/10.1016/j.measurement.2021.110304 doi: 10.1016/j.measurement.2021.110304

Reader Comments

Your name:*

Email:*
© 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)