Diabetes is a metabolic disorder caused by insufficient insulin secretion and insulin secretion disorders. From health to diabetes, there are generally three stages: health, pre-diabetes and type 2 diabetes. Early diagnosis of diabetes is the most effective way to prevent and control diabetes and its complications. In this work, we collected the physical examination data from Beijing Physical Examination Center from January 2006 to December 2017, and divided the population into three groups according to the WHO (1999) Diabetes Diagnostic Standards: normal fasting plasma glucose (NFG) (FPG < 6.1 mmol/L), mildly impaired fasting plasma glucose (IFG) (6.1 mmol/L ≤ FPG < 7.0 mmol/L) and type 2 diabetes (T2DM) (FPG > 7.0 mmol/L). Finally, we obtained1,221,598 NFG samples, 285,965 IFG samples and 387,076 T2DM samples, with a total of 15 physical examination indexes. Furthermore, taking eXtreme Gradient Boosting (XGBoost), random forest (RF), Logistic Regression (LR), and Fully connected neural network (FCN) as classifiers, four models were constructed to distinguish NFG, IFG and T2DM. The comparison results show that XGBoost has the best performance, with AUC (macro) of 0.7874 and AUC (micro) of 0.8633. In addition, based on the XGBoost classifier, three binary classification models were also established to discriminate NFG from IFG, NFG from T2DM, IFG from T2DM. On the independent dataset, the AUCs were 0.7808, 0.8687, 0.7067, respectively. Finally, we analyzed the importance of the features and identified the risk factors associated with diabetes.
Citation: Yu-Mei Han, Hui Yang, Qin-Lai Huang, Zi-Jie Sun, Ming-Liang Li, Jing-Bo Zhang, Ke-Jun Deng, Shuo Chen, Hao Lin. Risk prediction of diabetes and pre-diabetes based on physical examination data[J]. Mathematical Biosciences and Engineering, 2022, 19(4): 3597-3608. doi: 10.3934/mbe.2022166
Diabetes is a metabolic disorder caused by insufficient insulin secretion and insulin secretion disorders. From health to diabetes, there are generally three stages: health, pre-diabetes and type 2 diabetes. Early diagnosis of diabetes is the most effective way to prevent and control diabetes and its complications. In this work, we collected the physical examination data from Beijing Physical Examination Center from January 2006 to December 2017, and divided the population into three groups according to the WHO (1999) Diabetes Diagnostic Standards: normal fasting plasma glucose (NFG) (FPG < 6.1 mmol/L), mildly impaired fasting plasma glucose (IFG) (6.1 mmol/L ≤ FPG < 7.0 mmol/L) and type 2 diabetes (T2DM) (FPG > 7.0 mmol/L). Finally, we obtained1,221,598 NFG samples, 285,965 IFG samples and 387,076 T2DM samples, with a total of 15 physical examination indexes. Furthermore, taking eXtreme Gradient Boosting (XGBoost), random forest (RF), Logistic Regression (LR), and Fully connected neural network (FCN) as classifiers, four models were constructed to distinguish NFG, IFG and T2DM. The comparison results show that XGBoost has the best performance, with AUC (macro) of 0.7874 and AUC (micro) of 0.8633. In addition, based on the XGBoost classifier, three binary classification models were also established to discriminate NFG from IFG, NFG from T2DM, IFG from T2DM. On the independent dataset, the AUCs were 0.7808, 0.8687, 0.7067, respectively. Finally, we analyzed the importance of the features and identified the risk factors associated with diabetes.
[1] | J. M. Lachin, D. M. Nathan, D. E. R. Group, Understanding metabolic memory: The prolonged influence of glycemia during the Diabetes Control and Complications Trial (DCCT) on future risks of complications during the study of the Epidemiology of Diabetes Interventions and Complications (EDIC), Diabetes Care, (2021), Online ahead of print, https://doi.org/10.2337/dc20-3097 |
[2] | G. Triplett, S. Eichold, Concurrent diabetes mellitus and sickle cell disease, Diabetes Care, 2 (1979), 327–328. https://doi.org/10.2337/diacare.2.3.327a doi: 10.2337/diacare.2.3.327a |
[3] | C. Greenhill, Diabetes: How does leptin decrease hyperglycaemia in T1DM and T2DM? Nat. Rev. Endocrinol., 10 (2014), 511. https://doi.org/10.1038/nrendo.2014.104 doi: 10.1038/nrendo.2014.104 |
[4] | D. Holmes, Diabetes: New marker to predict risk of T2DM, Nat. Rev. Endocrinol., 13 (2017), 625. https://doi.org/10.1038/nrendo.2017.128 doi: 10.1038/nrendo.2017.128 |
[5] | M. Kaare, K. Mikheim, K. Lillevali, K. Kilk, T. Jagomae, E. Leidmaa, et al., High-fat diet induces pre-diabetes and distinct sex-specific metabolic alterations in Negr1-deficient mice, Biomedicines, 9 (2021), 1148. https://doi.org/10.3390/biomedicines9091148. doi: 10.3390/biomedicines9091148 |
[6] | Correction: Prevalence of diabetes, pre-diabetes and associated risk factors: Second National Diabetes Survey of Pakistan (NDSP), 2016-2017, BMJ Open, 8 (2019), e020961corr1. https://doi.org/10.1136/bmjopen-2017-020961corr1 |
[7] | C. Ao, L. Yu, Q. Zou, Prediction of bio-sequence modifications and the associations with diseases, Brief Funct. Genom., 20 (2021), 1–18. https://doi.org/10.1093/bfgp/elaa023 doi: 10.1093/bfgp/elaa023 |
[8] | M.D. Campbell, T. Sathish, P. Z. Zimmet, K. R. Thankappan, B. Oldenburg, D. R. Owens, et al., Benefit of lifestyle-based T2DM prevention is influenced by prediabetes phenotype, Nat. Rev. Endocrinol., 16 (2020), 395–400. https://doi.org/10.1038/s41574-019-0316-1 doi: 10.1038/s41574-019-0316-1 |
[9] | A. O. Amuta, W. Jacobs, A. E. Barry, An examination of family, healthcare professionals, and peer advice on physical activity behaviors among adolescents at high risk for Type 2 diabetes, Health Commun., 32 (2017), 857– 863. https://doi.org/10.1080/10410236.2016.1177907 doi: 10.1080/10410236.2016.1177907 |
[10] | J.P. Wei, T. Luo, Y. Wang, W. Lu, Screening differential hub genes related with the hypoglycemic effect of quercetin through data mining, Curr. Bioinform., 16 (2021), 1152–1160. https://doi.org/10.2174/1574893616666210617110314 doi: 10.2174/1574893616666210617110314 |
[11] | Q. Zou, K. Qu, Y. Luo, D. Yin, Y. Ju, H. Tang, Predicting diabetes mellitus with machine learning techniques, Front. Genet., 9 (2018), 515. https://doi.org/10.3389/fgene.2018.00515 doi: 10.3389/fgene.2018.00515 |
[12] | Z. Li, C. Zhao, Q. Fu, J. Ye, L. Su, X. Ge, et al., Neodymium (3+)-Coordinated black phosphorus quantum dots with retrievable NIR/X-Ray optoelectronic switching effect for anti-glioblastoma, Small, (2021), Online ahead of print. https://doi.org/10.1002/smll.202105160 |
[13] | A. B. Goldfine, V. A. Fonseca, The use of colesevelam HCl in patients with type 2 diabetes mellitus: Combining glucose- and lipid-lowering effects, Postgrad. Med., 121 (2009), 13–18. https://doi.org/10.3810/pgm.2009.05.suppl53.288 doi: 10.3810/pgm.2009.05.suppl53.288 |
[14] | Q. Zhu, Y. Fan, X. Pan, Fusing multiple biological networks to effectively predict miRNA-disease associations, Curr. Bioinform., 16 (2021), 371–384. https://doi.org/10.2174/1574893615999200715165335 doi: 10.2174/1574893615999200715165335 |
[15] | L. Wei, W. He, A. Malik, R. Su, L. Cui, B. Manavalan, Computational prediction and interpretation of cell-specific replication origin sites from multiple eukaryotes by exploiting stacking framework, Brief. Bioinform., 22 (2021), bbaa275. https://doi.org/10.1093/bib/bbaa275 doi: 10.1093/bib/bbaa275 |
[16] | M. M. Hasan, M. A. Alam, W. Shoombuatong, H. W. Deng, B. Manavalan, H. Kurata, NeuroPred-FRL: An interpretable prediction model for identifying neuropeptide using feature representation learning, Brief. Bioinform., 22 (2021), bbab167. https://doi.org/10.1093/bib/bbab167 doi: 10.1093/bib/bbab167 |
[17] | M. M. Hasan, N. Schaduangrat, S. Basith, G. Lee, W. Shoombuatong, B. Manavalan, HLPpred-Fuse: Improved and robust prediction of hemolytic peptide and its activity by fusing multiple feature representation, Bioinformatics, 36 (2020), 3350–3356. https://doi.org/10.1093/bioinformatics/btaa160 doi: 10.1093/bioinformatics/btaa160 |
[18] | H. Jun, J. Lee, H. A. Lee, S. E. Kim, K. N. Shim, H. K. Jung, et al., Fasting blood glucose variability and unfavorable trajectory patterns are associated with the risk of colorectal cancer, Gut. Liver, (2021), Online ahead of print. https://doi.org/10.5009/gnl210048 |
[19] | The Expert Committee on the Diagnosis, Classification of Diabetes Mellitus, Report of the expert committee on the diagnosis and classification of diabetes mellitus, Diabetes Care, 26 (2003), S5–S20. https://doi.org/10.2337/diacare.26.2007.s5 |
[20] | A. Ogunleye, Q. G. Wang, X. G. Boost, Model for chronic kidney disease diagnosis, IEEE/ACM Trans. Comput. Biol. Bioinform., 17 (2020), 2131–2140. https://doi.org/10.1109/TCBB.2019.2911071 doi: 10.1109/TCBB.2019.2911071 |
[21] | P. Liu, B. Fu, S. X. Yang, L. Deng, X. Zhong, H. Zheng, Optimizing survival analysis of XGBoost for ties to predict disease progression of breast cancer, IEEE Trans. Biomed. Eng., 68 (2021), 148– 160. https://doi.org/10.1109/TBME.2020.2993278 doi: 10.1109/TBME.2020.2993278 |
[22] | F. Ahmad, A. Farooq, M. U. G. Khan, Deep learning model for pathogen classification using feature fusion and data augmentation, Curr. Bioinform., 16 (2021), 466–483. https://doi.org/10.2174/1574893615999200707143535 doi: 10.2174/1574893615999200707143535 |
[23] | S. Jiao, Q. Zou, H. Guo, L. Shi, iTTCA-RF: A random forest predictor for tumor T cell antigens, J. Transl. Med., 19 (2021), 449. https://doi.org/10.1186/s12967-021-03084-x doi: 10.1186/s12967-021-03084-x |
[24] | Y. M. Dong, J. H. Bi, Q. E. He, K. Song, ESDA: An improved approach to accurately identify human snoRNAs for precision cancer therapy, Curr. Bioinfor., 15 (2020), 34–40. https://doi.org/10.2174/1574893614666190424162230 doi: 10.2174/1574893614666190424162230 |
[25] | X. Song, X. Liu, F. Liu, C. Wang, Comparison of machine learning and logistic regression models in predicting acute kidney injury: A systematic review and meta-analysis, Int. J. Med. Inform., 151 (2021), 104484. https://doi.org/10.1016/j.ijmedinf.2021.104484 doi: 10.1016/j.ijmedinf.2021.104484 |
[26] | L. Zhang, Y. He, H. Song, X. Wang, N. Lu, L. Sun, et al., Elastic net regularized softmax regression methods for multi-subtype classification in cancer, Curr. Bioinform., 15 (2020), 212–224. https://doi.org/10.2174/1574893613666181112141724 doi: 10.2174/1574893613666181112141724 |
[27] | Y. Wang, R. Zhang, M. Pi, J. Xu, M. Qiu, T. Wen, Correlation between TCM Syndromes and Type 2 diabetic comorbidities based on fully connected neural network prediction model, Evid. Based Complement Alternat. Med., 2021 (2021), 6095476. https://doi.org/10.1155/2021/6095476 doi: 10.1155/2021/6095476 |
[28] | M. Awais, W. Hussain, N. Rasool, Y. D. Khan, iTSP-PseAAC: Identifying tumor suppressor proteins by using fully connected neural network and PseAAC, Curr. Bioinform., 16 (2021), 700–709. https://doi.org/10.2174/1574893615666210108094431 doi: 10.2174/1574893615666210108094431 |
[29] | J. Phillips, S. K. Poon, D. Yu, M. Lam, M. Hines, M. Brunner, et al., A conceptual measurement model for ehealth readiness: A team based perspective, AMIA Annu. Symp. Proc., 2017 (2017), 1382–1391. |
[30] | M. Kottas, O. Kuss, A. Zapf, A modified Wald interval for the area under the ROC curve (AUC) in diagnostic case-control studies, BMC Med. Res. Methodol., 14 (2014), 26. https://doi.org/10.1186/1471-2288-14-26 doi: 10.1186/1471-2288-14-26 |
[31] | M. T. Rouabah, A. Tounsi, N. E. Belaloui, Genetic algorithm with cross-validation-based epidemic model and application to the early diffusion of COVID-19 in Algeria, Sci. Afr., 14 (2021), e01050. https://doi.org/10.1016/j.sciaf.2021.e01050 doi: 10.1016/j.sciaf.2021.e01050 |
[32] | L. Zhu, G. Duan, C. Yan, J. Wang, Prediction of microbe-drug associations based on chemical structures and the KATZ measure, Curr. Bioinform., 16 (2021), 807–819. https://doi.org/10.2174/1574893616666210204144721 doi: 10.2174/1574893616666210204144721 |
[33] | J. Long, H. Yang, Z. Yang, Q. Jia, L. Liu, L. Kong, et al., Integrated biomarker profiling of the metabolome associated with impaired fasting glucose and type 2 diabetes mellitus in large‐scale Chinese patients, Clin. Transl. Med., 11 (2021), e432. https://doi.org/10.1002/ctm2.432 doi: 10.1002/ctm2.432 |