1.
Introduction
Diabetes is a metabolic disorder disease caused by insufficient insulin secretion and insulin secretion disorders [1].The main manifestation of diabetes is hyperglycemia. Long-term exposure of organs to hyperglycemia will cause the damage of physiological system, then leading to chronic progressive lesions and failure of tissues and organs, such as eyes, kidneys, nerves, heart and blood vessels [2]. At present, diabetes Mellitus can be divided into type 1 diabetes mellitus (T1DM) and type 2 diabetes mellitus (T2DM), among which T2DM is the most common type of diabetes, accounting for about 95% of diabetic patients [3]. The main factors leading to T2DM are environmental factors and bad living habits. In addition, age, overnutrition and insufficient exercise are all the triggers of diabetes [4]. From health to T2DM, the development usually goes through three stages: health, pre-diabetes, type 2 diabetes [5]. When T2DM is diagnosed, the blood glucose level of patients will continue to rise, and drug treatment is difficult to reverse [6,7]. However, patients in pre-diabetes can maintain blood glucose stability and even restore health through artificial intervention. Many studies have shown that early diagnosis and treatment of T2DM is the most effective way to prevent and control T2DM. Therefore, early detection and timely adjustment of lifestyle is the key to the treatment of T2DM [8].
With the development of economy and culture, people pay more and more attention to physical examination[9,10]. Finding valuable information related to diabetes from physical examination data and finding out the changing pattern of diabetes at all stages is of great importance to the prevention and treatment of diabetes.
In recent years, many algorithms have been used to predict diabetes. For example, Zou et al. used principal component analysis (PCA) and minimum redundant maximum (mRMR) correlation to screen risk factors, and utilized decision tree (DT), RF and neural network (NN) to predict diabetes [11]. By using mutual information (MI) and Gini impurity (GI) to screen diabetes-related risk factors in physical examination data, Yang et al. established a cascade diabetes risk prediction system [12]. The invasive risk assessment model HCL predicted diabetes by using invasive characteristics and referring to Harvard Cancer Risk Index [13].
Machine learning algorithms have been widely used in the field of medicine because of their powerful performance [14,15,16,17]. Therefore, based on physical examination data in real world, this study used XGBoost, RF, LR, and FCN to predict diabetes, and analyze the impact of these indicators at each stage of T2DM.
2.
Materials and methods
2.1. Benchmark Dataset
The physical examination data were collected from Beijing Physical Examination Center from January 2006 to December 2017. In this study, fasting plasma glucose (FPG) index in the physical examination data was used as the standard to classify the sample types of the dataset. FPG can reflect the function of islet B cells, and generally indicate the secretion function of basal insulin, which is the most commonly used indicator for diabetes [18]. Clinical application of FPG is more conducive to the early diagnosis and prevention of T2DM. According to WHO (1999) diagnostic criteria for diabetes, the population was divided into three groups: normal FPG (NFG, FPG < 6.1 mmol/L), slightly impaired FPG (IFG, 6.1 mmol/L ≤ FPG < 7.0 mmol/L), and T2DM (T2DM, FPG > 7.0 mmol/L) [19]. Finally, the benchmark data included 1,221,598 NFG samples, 285,965 IFG samples, and 387,076 T2DM samples.
There are 14 initial features in the physical examination data, including waistline, age, systolic pressure (SP), gender, blood uric acid (BUA), serum creatinine (SC), triglyceride, diastolic pressure (DP), glutamic oxalacetic transaminase (GOT), hipline, high-density lipoprotein (HDL), glutamic-pyruvic transaminase (GPT), height, blood urea nitrogen(BUN), weight, total cholesterol (TC), and low density lipoprotein (LDL). Height and Waist circumference cannot directly evaluate a person's obesity, so we added waist height ratio (WHtR) to reflect whether a person has visceral fat accumulation. As a result, total of 15 features were used to perform further analysis and model construction.
To facilitate the performance evaluation of the model, we divided the data set into training set and test set according to the ratio of 7:3. Thus, the benchmark dataset can be formulated as
where the symbol 1, 2 and 3 represent the NFG, IFG and T2DM, respectively. The "train" and "test" denotes the training data and test data, respectively.
2.2. Machine learning methods
In this study, eXtreme Gradient Boosting (XGBoost), random forest (RF), logistic regression (LR), and fully connected neural network (FCN) algorithm were used as the classifier. The details are as follows.
2.1.1. eXtreme Gradient Boosting (XGBoost)
XGBoost is based on the gradient boosting algorithm [20,21,22]. In the modeling process, features are spitted through continuous adding trees. In each time, a tree is added to learn a new function to fit the residual of the last prediction. After the training, a gradient boosting model of K trees is obtained. The ultimate goal of XGBoost is to make the predicted value of the tree group as close to the true value as possible, and to have as large a generalization range as possible.
The objective function of XGBoost is:
where y'i is the output of the entire cumulative model, and the regularization term ∑kΩ(ft) is a function representing the complexity of the tree. The smaller the value, the lower the complexity and the stronger the generalization ability of the model.
In this study, Gini impurity (GI) is used to evaluate the contribution of features to the model. In the tree model, better decision-making conditions can be selected by comparing the value of GI. Each division of tree nodes should try to make the GI as low as possible. GI is mainly used to solve the problem of high computational complexity. It is defined as:
where trepresents a given node, i represents any category of label, and p(i|t) represents the proportion of label category i on node t.
2.1.2. Random Forest (RF)
RF is also a tree-based ensemble classifier which is a representative model of the bagging method. The core idea of the bagging method is to construct multiple independent evaluators, and then the prediction results are determined by the principle of average or majority voting [23,24].
2.1.3. Logistic Regression (LR)
LR is a generalized linear regression analysis algorithm, and is often used in the field of disease diagnosis [25,26]. It is a variation of linear regression, and an algorithm widely used in the field of regression and classification. LR is to construct a mapping from X to ˆy and calculates the parameters of the model formulated as.
The process is calculated as follows. Firstly, a loss function is defined, and then the parameter vector is solved by minimizing the loss function. Finally, the LR uses the Sigmoid function to control the output between 0 and 1:
The Sigmoid function distributes the value of g(z) between 0 and 1. When g(z) approaches 0, the label of the sample is category 0, and when g(z) is close to 1, the label of the sample is category 1. In this way, a classification model can be obtained.
2.1.4. Fully connected neural network (FCN)
FCN generally consists of three parts, an input layer, a hidden layer and an output layer [27,28]. Each layer uses the output of the previous layer as input, and then outputs to the next level. The most basic unit in a neural network is a neuron. Each neuron receives multiple inputs and produces an output. Multiple neurons are connected to each other to form a neural network. Fully connected neural network (FCN) generates nonlinear output through activation functions. The commonly used activation functions are ReLU, Sigmoid, and Tanh. FCN training is divided into two processes: forward propagation and backward propagation. The forward propagation fits the features, and then uses the loss function to calculate the gap between the model output value and the target value. Backpropagation uses the gradient descent method to update the parameters of each layer according to the loss function value generated by the forward propagation, thereby optimizing and updating parameter.
We established a three-layer fully connected neural network, the input layer has 18 neurons. The first layer has 7 neurons and the second layer has 4 neurons respectively, the activation function is 'ReLU', the optimization function is 'RMSprop'. The output layer has three neurons, the activation function is 'Softmax'.
2.2. Performance measurement
In this study, accuracy, precision, recall, F1 and AUC were used to evaluate the performance of proposed models [29], which were calculated as follows:
where TP represents true positives, describing the number of correctly predicted positive samples; FP denotes false positives, representing the number of negative samples predicted as positive; FN indicates false negatives, representing the number of positive samples classified as negative; TN denotes true negatives, representing the number of samples correctly predicted as negative. Accuracy is the ratio of the number of all predicted correct samples divided by the total number of samples.
The receiver operating characteristic (ROC) curve is often used to measure the predictive power of the current method across the entire range of algorithm decision value [30]. The ROC can reveal the relationship between true positive rate (TPR) and false positive rate (FPR). We used the area under the ROC curve, referred to as area under curve (AUC), to evaluate the performance of the model.
2.3. Model validation
Generally, there are three methods for model verification: Holdout test, K-Fold cross-validation test and Leave-One-Out (LOO) test [31,32].
Holdout test divides the sample into two mutually exclusive parts, one part is used as the training set and the other part is used as the test set. The model is trained on the training set and examined on the test set. All evaluation indexes were calculated on the test set. K-Fold cross-validation divides the data set into K mutually exclusive data subsets. Each time, one data subset is used as the test set, and all other subsets are used as the training set. Traverse these K subsets in turn. Finally, the average values of the evaluation indexed are used as the final evaluation indexes. The stability of K-Fold cross-validation is closely related to the value of K. If the K value is too small, the experimental stability is not enough. If the K value is too large, the modeling cost may increase. Generally, the K value is 5 or 10. LOO is a special K-Fold cross-validation, where k is equal to the number of sample in the data set. The results obtained by this method are the same as the training entire test. The expected value of the set is the closest, but the cost is too large.
In this article, we use Holdout test for model verification.
3.
Results and discussion
In this study, four kinds of machine learning methods that are XGBoost, RF, LR and FCN were used as the classifier. The following two experiments were performed as follows.
3.1. Prediction of NFG, IFG and T2DM
In the first experiments, based on the above four methods, four-classification models were established to distinguish NFG, IFG and T2DM.We used Strain1, Strain2 and Strain3 to train the four machine learning methods for constructing models. The Stest1, Stest2 and Stest3were utilized to investigate the performance of models for the prediction of NFG, IFG and T2DM. The results were recorded in Table 1 and shown in Figure 1. Table 2 displays the six evaluation indexes of four models on test data. From the table, we noticed that XGBoost could produce the best results with the AUC (macro) of 0.7874 and the AUC (micro) of 0.8633. It is worth noting that the prediction result of FCN is the worst, suggesting that FCN is not suitable for health data analysis. This is consistent with the fact that neural network is not suitable for the analysis of less characteristic samples. Figure 1 shows the ROC curves of four different classifiers on test set. For each algorithm, we draw the micro-average ROC curve, macro-Average ROC Curve and any two kinds of ROC curves. According to Figures 1 (a), we can also see that the AUCs of XGBoost identifying NFG, IFG, and T2DM from the entire population are 0.79, 0.70, and 0.84, respectively.
Subsequently, we performed feature analysis and showed the results in Figure 2. shows the feature importance of XGBoost based on dataset 1. Waist circumference ranked first respectively, indicating that obesity is the most important risk factor for diabetes, and age ranked second. The older the age, the greater the risk of diabetes. Figure 3 shows the incremental feature selection strategy (IFS) curve, it can be seen that when the first 7 features (Waistline, Age, SP, Gender, BUA, SC, Triglyceride) are used for modeling, the model achieves the highest AUC, and the addition of features does not improve the overall results of the model. We believe that these 7 features are important risk factors for distinguishing NFG, IFG and T2DM.
3.2. Discrimination between any two classes
On the basis of benchmark dataset, three binary models were established to distinguish NFG and IFG, NFG and T2DM, as well as IFG and T2DM. The importance of features in each model was assessed using GI, and incremental feature selection (IFS) was used to find the optimal feature subset. Due to good performance and wide usage in healthy data, we only used XGBoost to construct the three models. Results have been recorded in Table 2.
At first, we built a model for discriminating between NFG from IFG. ROC curve and feature rank of the model were drawn in Figure 2. Results show that the AUC is 0.7808. There is little difference between NFG and IFG. Although blood sugar is elevated in the pre-diabetes stage, the pancreatic islets have not been completely impaired. It will not cause irreversible damage to the body. From Figure 2b and c, it can be observed that the features with the most importance characteristics at this case are waistline, Age, WHtR, Gender and SP, indicating that the risk factors for the early population are obesity, age and hypertension.
Subsequently, we focused on the discrimination between NFG vs T2DM. From Table 2 and Figure 3a, the XGBoost-based model could produce the AUC of 0.8687. The model established by physical examination indicators can more accurately distinguish normal people from diabetic people. The order of feature importance is Age, Waistline, Triglyceride, WHtR, SP, Gender and SC (Figure 3b). In the identification of diabetic patients, some molecular markers, such as triglycerides, play an important role, which reflects the physiological level of diabetic patients. At present, the diagnosis rate of diabetes in China is less than 50%. It is of great significance to diagnose diabetic patients through physical examination indicators, especially in rural China's free physical examination.
The third binary model was built for distinguishing IFG from T2DM based on XGBoost. Based on the results in Table 2 and Figure 4a, we may notice that the model could achieve the AUC of 0.7067 on test dataset. This prediction accuracy is the lowest among the three two classification models. This is mainly due to the fact that many physical indicators of pre diabetes and diabetes are very similar. Patients with pre diabetes are not easily controlled and treated, and are easily converted to diabetic patients. In this classification problem, both IFG population and T2DM population are exposed to hyperglycemia and have an impact on various physical indicators. Figure 4b and c conclude that the most important features are Gender, SC, Triglyceride, Age, BUA, Waistline, GOT, WHtR, GPT. Some special features, such as SC and GOT, may indicate that renal and liver function of T2DM population may be impaired compared with IFG population.
4.
Conclusions
Diabetes is a metabolic disease. From health to diabetes, there are generally three stages: health, pre-diabetes and type 2 diabetes. It is worth studying how to use machine learning methods to early predict and diagnose the disease. In the three-classification experiment of distinguishing NFG, IFG and T2DM, by comparing the results of the four classifiers: XGBoost, RF, LR, and FCN, we can find that there is little difference between them. XGBoost is slightly better than other classifiers, with AUC (macro) of 0.7874 and AUC (micro) of 0.8633. Then, we chose XGBoost as the basic classifier, and constructed three binary classification models to distinguish between NFG and IFG, NFG and T2DM, IFG and T2DM. The AUCs of these models on test dataset are 0.7808, 0.8687 and 0.7067, respectively. We used GI index to evaluate the importance of features, sort the features according to their importance, and mine relevant risk factors by combining with IFS strategy. Overall, Age, Triglyceride, WHtR, and SP are important risk factors. In particular, it was found that T2DM patients may have liver and kidney damage.
Through this work, we hope to explore the possibility of early prediction of diabetes with physical examination data. And we hope to dig out valuable information related to diabetes from the physical examination data and other omics data [33], and discover the changes in the each stage of diabetes, so as to provide clues for early prevention and treatment of diabetes. In the future, we hope to clarify the causal relationship between various risk factors and diabetes through cohort studies and Mendelian randomization studies, and explore some effective intervention schemes on this basis.
Acknowledgments
The study was supported by grants from the National Key R & D Program of China (2020YFC2003403), Capital's Funds for Health Improvement and Research (2018-2-2242) and the National Natural Science Foundation of China (82130112).
Conflict of interest
The authors declare that there is no conflict of interest.