Research article

ACRnet: Adaptive Cross-transfer Residual neural network for chest X-ray images discrimination of the cardiothoracic diseases


  • Cardiothoracic diseases are a serious threat to human health and chest X-ray image is a great reference in diagnosis and treatment. At present, it has been a research hot-spot how to recognize chest X-ray image automatically and exactly by the computer vision technology, and many scholars have gotten the excited research achievements. While both emphysema and cardiomegaly often are associated, and the symptom of them are very similar, so the X-ray images discrimination for them led easily to misdiagnosis too. Therefore, some efforts are still expected to develop a higher precision and better performance deep learning model to recognize efficiently the two diseases. In this work, we construct an adaptive cross-transfer residual neural network (ACRnet) to identify emphysema, cardiomegaly and normal. We cross-transfer the information extracted by the residual block and adaptive structure to different levels in ACRnet, and the method avoids the reduction of the adaptive function by residual structure and improves the recognition performance of the model. To evaluate the recognition ability of ACRnet, four neural networks VGG16, InceptionV2, ResNet101 and CliqueNet are used for comparison. The results show that ACRnet has better recognition ability than other networks. In addition, we use the deep convolution generative adversarial network (DCGAN) to expand the original dataset and ACRnet's recognition ability is greatly improved.

    Citation: Boyang Wang, Wenyu Zhang. ACRnet: Adaptive Cross-transfer Residual neural network for chest X-ray images discrimination of the cardiothoracic diseases[J]. Mathematical Biosciences and Engineering, 2022, 19(7): 6841-6859. doi: 10.3934/mbe.2022322

    Related Papers:

    [1] Pingping Sun, Yongbing Chen, Bo Liu, Yanxin Gao, Ye Han, Fei He, Jinchao Ji . DeepMRMP: A new predictor for multiple types of RNA modification sites using deep learning. Mathematical Biosciences and Engineering, 2019, 16(6): 6231-6241. doi: 10.3934/mbe.2019310
    [2] Hasan Zulfiqar, Rida Sarwar Khan, Farwa Hassan, Kyle Hippe, Cassandra Hunt, Hui Ding, Xiao-Ming Song, Renzhi Cao . Computational identification of N4-methylcytosine sites in the mouse genome with machine-learning method. Mathematical Biosciences and Engineering, 2021, 18(4): 3348-3363. doi: 10.3934/mbe.2021167
    [3] Lei Chen, Ruyun Qu, Xintong Liu . Improved multi-label classifiers for predicting protein subcellular localization. Mathematical Biosciences and Engineering, 2024, 21(1): 214-236. doi: 10.3934/mbe.2024010
    [4] Yuanqian Yao, Jianlin Lv, Guangyao Wang, Xiaohua Hong . Multi-omics analysis and validation of the tumor microenvironment of hepatocellular carcinoma under RNA modification patterns. Mathematical Biosciences and Engineering, 2023, 20(10): 18318-18344. doi: 10.3934/mbe.2023814
    [5] Mingshuai Chen, Xin Zhang, Ying Ju, Qing Liu, Yijie Ding . iPseU-TWSVM: Identification of RNA pseudouridine sites based on TWSVM. Mathematical Biosciences and Engineering, 2022, 19(12): 13829-13850. doi: 10.3934/mbe.2022644
    [6] Jianhua Jia, Mingwei Sun, Genqiang Wu, Wangren Qiu . DeepDN_iGlu: prediction of lysine glutarylation sites based on attention residual learning method and DenseNet. Mathematical Biosciences and Engineering, 2023, 20(2): 2815-2830. doi: 10.3934/mbe.2023132
    [7] Carsten Conradi, Elisenda Feliu, Maya Mincheva . On the existence of Hopf bifurcations in the sequential and distributive double phosphorylation cycle. Mathematical Biosciences and Engineering, 2020, 17(1): 494-513. doi: 10.3934/mbe.2020027
    [8] Yunyun Liang, Shengli Zhang, Huijuan Qiao, Yinan Cheng . iEnhancer-MFGBDT: Identifying enhancers and their strength by fusing multiple features and gradient boosting decision tree. Mathematical Biosciences and Engineering, 2021, 18(6): 8797-8814. doi: 10.3934/mbe.2021434
    [9] Ying Xu, Jinyong Cheng . Secondary structure prediction of protein based on multi scale convolutional attention neural networks. Mathematical Biosciences and Engineering, 2021, 18(4): 3404-3422. doi: 10.3934/mbe.2021170
    [10] Yuqing Qian, Tingting Shang, Fei Guo, Chunliang Wang, Zhiming Cui, Yijie Ding, Hongjie Wu . Identification of DNA-binding protein based multiple kernel model. Mathematical Biosciences and Engineering, 2023, 20(7): 13149-13170. doi: 10.3934/mbe.2023586
  • Cardiothoracic diseases are a serious threat to human health and chest X-ray image is a great reference in diagnosis and treatment. At present, it has been a research hot-spot how to recognize chest X-ray image automatically and exactly by the computer vision technology, and many scholars have gotten the excited research achievements. While both emphysema and cardiomegaly often are associated, and the symptom of them are very similar, so the X-ray images discrimination for them led easily to misdiagnosis too. Therefore, some efforts are still expected to develop a higher precision and better performance deep learning model to recognize efficiently the two diseases. In this work, we construct an adaptive cross-transfer residual neural network (ACRnet) to identify emphysema, cardiomegaly and normal. We cross-transfer the information extracted by the residual block and adaptive structure to different levels in ACRnet, and the method avoids the reduction of the adaptive function by residual structure and improves the recognition performance of the model. To evaluate the recognition ability of ACRnet, four neural networks VGG16, InceptionV2, ResNet101 and CliqueNet are used for comparison. The results show that ACRnet has better recognition ability than other networks. In addition, we use the deep convolution generative adversarial network (DCGAN) to expand the original dataset and ACRnet's recognition ability is greatly improved.



    Protein post-translational modification is an important chemical process, which plays a key role in regulating cell functions [1] and also changes the physical and chemical properties of protein. More than 400 post-translational modifications including methylation [2], acetylation [3], phosphorylation [4], and S-nitrosylation (SNO) [5] have been discovered so far. As SNO is a reversible post-translational modification of proteins, a large number of studies have shown that SNO plays an important role in multiple biological processes such as redox signal transduction [6], cell signal transduction [7], cell senescence [8], and transcription [9]. SNO is also related to many human diseases such as cancer [10], Alzheimer's disease [11], and chronic renal failure [12]. Therefore, a well-grounded understanding of SNO is of great significance for the study of basic biological processes [9,13] and the development of drugs [14]. In recent years, many SNO sites have been identified through molecular signals [15,16], but identification of SNO sites still faces some challenges including low accuracy, time-consuming and labor-intensive. With the continuous development of computer technology, a large number of computational models have been used to predict the specific sites of SNO modification.

    Many post-translational modifications of proteins have been detected by a variety of computational models. Qiu identified phosphorylated [17] and acetylated [18] proteins with GO notations. GPS-SNO [19], SNOSite [20], iSNO-PseAAC [21], PreSNO [5] and RecSNO [22] have been applied to the prediction of SNO sites. The GPS-SNO, SNOSite and iSNO-PseAAC models use relatively small data sets. In addition, many negative samples in these data sets are now experimentally verified as positive samples. The data sets used by PreSNO and RecSNO are relatively large and new, but there is still room for improvement in the performance of the model.

    On the basis of previous research, this work established two models for predicting SNO proteins and sites. In predicting SNO proteins, a bag-of-words model has been proposed on the basis of KNN scoring matrix obtained from proteins' GO annotation information [18], PseAAC [23,24] of amino acids sequence. Fusion of multiple features can more comprehensively reflect the information of the protein sequence and improve the prediction results. A combination of oversampling technique and random deletion method are applied to balance the training set since the issue is involved in imbalanced data sets. In predicting SNO sites, two feature extraction methods, TPC [25] and CKSAAP [26], are used to extract the features of protein sequence fragments. In order to eliminate the redundancy and noise information of the original feature space, elastic nets [27] are used to reduce the dimensionality of the feature space after the fusion strategy is performed on the original features. Random Forest severed as the classifiers and be verified with 5-fold cross-validation. The specific flow chart is shown in Figure 1.

    Figure 1.  The framework of RF-SNOPS.

    To obtain a scientifical prediction result, a strict benchmark data set is essential. The UniProKB has been accepted by most bioinformatics researchers. Here, the negative samples are extracted from the UniProKB and the positive sample are extracted from Xie's [28], which is a high-quality data set based on extensive literature research. The protein sequence can be expressed as:

    P=R1R2R3RiRL (1)

    where Ri represents the i-th amino acid residue, and L represents the length of the protein sequence.

    In order to identify SNO proteins, we constructed a benchmark data set similar to dataset of Hasan et al. [5], which consists of 3113 SNO proteins. Every one of the positive samples, i.e., SNO proteins, has at least one SNO site. For negative samples, we randomly selected 18, 047 proteins without any SNO site from the UniProKB. In order to make the results more rigorous, the CD-HIT was used to remove 30% of the 3113 positive samples and 18, 047 negative samples. Finally, 2192 positive samples and 7809 negative samples are collected in the proposed benchmark data set.

    The benchmark data set for predicting SNO sites are the same as Hasan et al. [5], which consists of 3383 positive samples and 3365 negative samples. A potential SNO(C) site-containing peptide sample can be generally expressed by

    Pξ=RξR(ξ1)R2R1CR+1R+2R+(ξ1)R+ξ (2)

    where the subscript ξ is an integer, Rξ represents the ξ-th upstream amino acid residue from the center, the R+ξ the ξ-th downstream amino acid residue, and so forth. If the number of left or right residues of the center C is less than ξ, then the pseudo amino acid "X" would be used to supplement the sequence. The (2ξ+1)-tuple peptide sample Pξ can be further classified into the following two categories:

    ¯Pξξ{¯P+ξ,ifitscenterisaSONsite¯P¯ξ,otherwise (3)

    where ¯P+ξ denotes a true SNO segment C with at its center, ¯P¯ξ a corresponding false SNO segment, and the symbol means "a member of" in the set theory.

    GO-KNN [18] features were extracted on the basis of the GO annotations of proteins. In this work, we need to find out the GO terms of all protein sequences and calculate the distance between proteins. Take protein P1 as an example, for anyone of other proteins, for example, P2, then their GO terms can be listed as P1GO={GO11,GO12,,GO1M} and P2GO={GO21,GO22,,GO2N} are obtained. If there is no GO term for a protein, we will replace it with GO terms of its homologous protein. The distance between two proteins can be calculated with Eq (4):

    Distance(P1,P2)=1P1GOP2GOP1GOP2GO (4)

    where, GO1i and GO2i represent the i-th GO of P1 and P2, respectively. The M and N represent the numbers of GO, respectively, ∪and ∩ are the union and intersection in the set theory, and ⌊ ⌋ represents the number of elements in the set. Then, the GO-KNN features could be extracted according to the following steps: 1) Sorting the calculated distances in ascending order; 2) Selecting the first k near neighbors of the test protein; 3) Calculating the percentage of positive samples in the k neighbors. In this study, k were selected as 2, 4, 8, 16, 62, 64, 128, 256, 512, 1024. In this way, a 10-dimensional feature vector (x1,x2,,x10) could represents the protein P1.

    A bag-of-words model [29] based on the physical and chemical properties of protein has been used in identifying GPCR-drug interaction. The main steps are listed as follows: 1) Encoding the protein sequence with its physical and chemical properties. Up to now, scientists have obtained various physical and chemical properties of 20 common amino acids [30]. After careful experimental comparison, hydrophilicity was selected as an indicator for the proposed model. 2) Designing wordbooks for protein. When the window sizes are 1, 2 and 3, and the step size of the moving window is 1, the coding sequence is divided into segments of different lengths. Segments of length 1 form wordbook WB1, segments of length 2 form wordbook WB2, and segments of length 3 form wordbook WB3. When the window size is 2, the step size of moving the window is still 1. But the window at this time is different from the above, it is separated by an amino acid. At this time, the coding sequence is divided into fragments of length 2, and these fragments form the wordbook WB4. 3) Clustering the word books. We divided the words in the wordbook WB1 into 20 sub-groups according to the types of amino acids. Words in WB2,WB3 and WB4 were clustered with K-means algorithm. The numbers of clusters were 16, 62 and 16. 4) Calculating the ratio of the number of each amino acid to the total number of words in the vocabulary with Eq (5).

    XWBji=XWBjiN   i=1,,K   j=1,2,3,4 (5)

    here, K is the number of clusters in the wordbook WBj, XWBji is the number of words in the i-th category of the wordbook WBj, and N is the total number of words in the wordbook WBj. Then a 114-D feature vector was formed for a given protein sequence, i.e. (XWB11,,XWB120,XWB21,,XWB216,XWB31,...XWB362,XWB41...,XWB416).

    PseAAC [23,24] is a very popular feature for bioinformatics. In this work, six physical and chemical properties of hydrophobicity, hydrophilicity, molecular side chain mass, PK1, PK2 and PI are used. We first used Eq (6) to transform the original physical and chemical properties of amino acids:

    Wa(i)=W0a(i)20i=1W0a(i)2020i=1[W0a(i)20i=1W0a(i)20]220 (6)

    where, a{1,2,,6} and i{1,2,,20}. W0a(i) represents the value of the ath original physical and chemical properties of the ith amino acid. We substitute the values of the transformed physical and chemical properties with Eq (7):

    Θ(Ri,Rj)=166a=1[Wa(Rj)Wa(Ri)]2 (7)

    where, W1(Rj)represents the value of hydrophobicity of Rj. By analogy, W6(Ri)represents the PI value of Ri. Then the correlation factor of each layer can be obtained by using the Eq (8):

    θλ=1LλLλi=1Θ(Ri,Ri+λ)λ<L (8)

    where θλ represents the correlation factor of the λ-th layer of the protein sequence. Finally, the protein sequence is converted into a feature by Eq (9):

    xi={fi20i=1fi+ωλj=1θj(1i20)ωθi2020i=1fi+ωλj=1θj(20+1i20+λ) (9)

    here, fi represents the frequency of the i-th amino acid, ω is 0.5, and λ is 5. In this way, a 25-dimensional feature vector is formed.

    In order to reduce the adverse effect of unbalanced data on the performance of the model, many methods for dealing with unbalanced data have been proposed, such as Synthetic Minority Oversampling Technique [31] (SMOTE) and Random Under Sampler [32] (RUS). SMOTE is a method proposed by Chawla et al. It has been used to predict protein sites [27] and improve the prognostic assessment of lung cancer [33]. RUS is a very simple and popular method of under-sampling. It can be used in pediatric pneumonia detection [34] and convolutional neural network performance improvement issues [35]. In this study, we combined these two methods to process the data. SMOTE is used to oversample the positive samples, and RUS is used to under-sample the negative samples. In the end, the number of processed positive samples is equal to negative samples. The specific process is shown in Figure 2.

    Figure 2.  Balance database processing.

    CSKAAP [25] has been widely used in protein site prediction [26] since it can effectively express internal laws for a given protein sequence. The protein fragment is composed of 20 common amino acids and a pseudo amino acid, which contains 441 residue pairs (AA, AC, ..., XX) for each l. Here l represents the space between each residue pair. The following formula is used to calculate the characteristics of the fragment:

    (NAANT,NACNT,NADNT,,NXXNT)441 (10)

    here NAA,NAC, represent the number of times the corresponding amino acid pair appears in the fragment, L is the length of the protein fragment. NT=Ll1. In this study, the values of l are 0, 1, 2, 3, 4, and the corresponding NT are 40, 39, 38, 37 and 36, respectively. Then, a 2205-D feature vector is formed.

    Based on the structural properties of proteins, researchers have proposed the tripeptide composition (TPC). It has been used to predict protein subcellular localization [36] identification of plasmodium mitochondrial proteins [37]. TPC calculates the frequency of three consecutive amino acids, and then a protein fragment can be represented by a 9261-dimensional vector.

    Pi=Ni92611Ni (11)

    where Ni represents the total number of i-th in 9261 tripeptides.

    The elastic net proposed by Zou and Hastie [38] is an effective feature selection method. By introducing the L1,L2 norm into a simple linear regression model, the elastic net can not only perform continuous shrinkage and automatically select variables at the same time, but also predict related variables. At present, elastic nets have been widely used in protein site prediction [27,39] and achieved good results.

    In this study, four indicators were used to evaluate the performance of the models. They are accuracy (ACC), sensitivity (SN), specificity (SP) and Matthews correlation coefficient (MCC) [40], which are defined by Eq (12):

    {Sn=TPTP+FNSp=TNTN+FPACC=TP+TNTP+FP+TN+FNMCC=TP×TNFP×FN(TP+FP)×(TP+FN)×(TN+FP)×(TN+FN) (12)

    In predicting SNO proteins, TP indicates the number of proteins that are predicted to have SNO sites and actually have SNO sites, and TN is the number of proteins that are predicted to have no SNO sites that are actually not have SNO sites. FP is the number of proteins without SNO sites but predicted to have SNO sites, FN is the number of proteins with SNO sites but predicted to have no SNO sites. In addition, the area under the ROC curve AUC is also used to evaluate this model.

    In predicting SNO sites, TP indicates the number of actual SNO sites predicted to be SNO sites, and TN indicates the number of non-SNO sites predicted to be not SNO sites. FP is the number of non-SNO sites predicted to be SNO sites, and FN is the number of actual SNO sites predicted to be non-SNO sites.

    Random Forest [41] is an algorithm that integrates multiple trees through the idea of ensemble learning. Its basic unit is decision tree. As a highly flexible machine learning algorithm, Random Forest (RF) has been widely used in data analysis [42], biological information [43] and technological development [44].

    Naive Bayes (NB) [45] is a simple and effective classifier, which is widely used in software defect prediction [46], medical diagnosis [47] and biological information [48]. NB is based on the Bayes theorem and the assumption of the conditional independence of features, which greatly reduces the complexity of the classification algorithm.

    K Nearest Neighbor (KNN) [49] is one of the supervised machine learning algorithms, which is widely used in face recognition [50], disease research [51] and engineering applications [52]. Its main idea is to judge the category of the predicted value based on the category of the k points closest to the predicted value.

    XGBoost [53,54] is an improved algorithm for boosting based on GBDT [55]. XGBoost is an integrated lifting algorithm that integrates many basic models to form a strong model. Because of its advantages such as good prediction effect and high training efficiency, XGBoost has been widely used in the field of data analysis.

    In this research, GO-KNN, BOW and PseAAC three kinds of feature extraction methods were used to encode the protein sequence, and obtained 10-D, 114-D and 25-D feature vectors, respectively. These three kinds of features were fused into a 149-D feature vector ALL. Through the 5-fold cross-validation, the prediction results obtained by different feature extraction are shown in Table 1.

    Table 1.  The predict results of different features.
    Feature Acc (%) Sn (%) Sp (%) MCC AUC
    GO-KNN 82.72 48.36 92.37 0.4533 0.8521
    BOW 78.83 13.04 97.30 0.1969 0.7359
    PseAAC 79.38 19.43 96.22 0.2503 0.7616
    ALL 83.77 49.49 93.40 0.4840 0.8593

     | Show Table
    DownLoad: CSV

    It can be seen from Table 1 that different features obtained varied prediction results. Among the three methods, GO-KNN has the highest ACC, Sn, MCC and AUC, of which are 82.72%, 48.36%, 0.4533 and 0.8521 respectively. The ACC, Sn, MCC and AUC of BOW are the lowest, of which are 78.83%, 13.04%, 0.1969 and 0.7359, respectively. But the Sp of BOW is 97.30%, which is the highest. After combining these three characteristics, ACC, Sn, Sp, MCC, AUC are 83.77%, 49.49%, 93.40%, 0.4840, 0.8593, respectively. Among them, ACC, Sn, MCC and AUC are all higher than those produced by GO-KNN. The results show that multi-feature fusion can improve a number of indicators. In order to better analyze the influence of different features on the prediction of SNO proteins, the prediction results obtained from the three features and their fusion features are shown in Figure 3.

    Figure 3.  Comparison of prediction results on different features.

    It can be seen from Figure 3 that the three features and their fusion affect the five evaluation indicators to some extent. They are less effective on Sn and MCC, and better on ACC, Sp and AUC. Comparing these four feature codes, the ACC, Sn, MCC and AUC of the fusion feature ALL are improved. Multi-feature fusion can reflect sequence information more comprehensively, thereby improving prediction ability. Therefore, multi-feature fusion can be used to predict SNO proteins.

    Here, SMOTE and RUS are denoted as SR balancer. We input the pre-balanced and post-balanced data sets into the model, and passed the 5-fold cross-validation to obtain the prediction results of ACC, Sn, Sp, MCC, AUC on the balanced and unbalanced data sets, as shown in the Table 2.

    Table 2.  Comparison of predict results before and after SR Balancer.
    Acc (%) Sn (%) Sp (%) MCC AUC
    Imbalance 83.77 49.49 93.40 0.4840 0.8593
    Balance 81.84 70.82 84.93 0.5178 0.8635

     | Show Table
    DownLoad: CSV

    It can be seen from Table 2 that the balanced Sn and Sp are relatively balanced. In addition, Sn, MCC and AUC have improved. Therefore, in summary, it is very necessary to balance the dataset.

    Classifiers play an important role in model prediction. This work used the above four classifiers to identify SNO proteins. After 5-fold cross-validation, the results of each classifier for ACC, Sn, Sp, MCC and AUC are shown in Table 3. It can be seen from Table 3 that the effect of random forest on various evaluation indicators is the best. In order to better compare the effects of different classifiers, the prediction results of the four classifiers are shown in Figure 4.

    Table 3.  The prediction results of different classifiers.
    Algorithms Acc (%) Sn (%) Sp (%) MCC AUC (%)
    RF 81.84 70.82 84.93 0.5178 0.8635
    NB 63.81 78.37 59.73 0.3154 0.7710
    KNN 71.97 83.44 68.75 0.4366 0.8360
    XGBoost 80.73 70.07 83.72 0.4953 0.8553

     | Show Table
    DownLoad: CSV
    Figure 4.  The ROC curves of different classification methods.

    The area under the ROC curve can evaluate the predictive performance of the model. It can be seen from Figure 4 that when the random forest is used as a classifier, the area under the ROC curve is the largest. Therefore, random forest is the best choice for the proposed model.

    In this study, two kinds of features, CKSAAP and TPC, were used, and the 2205-dimension and 9261-dimension feature vectors were obtained on the basis of above-mentioned algorithms. In order to better reflect the information of protein fragments, these features are fused into a 11, 466-dimension feature vector. Through 5-fold cross-validation, the prediction results obtained by different feature extraction are shown in Table 4.

    Table 4.  The prediction results of different feature extraction methods.
    Feature Acc (%) Sn (%) Sp (%) MCC AUC
    CKSAAP 73.97 83.67 64.27 0.4891 0.8036
    TPC 71.38 66.07 76.74 0.4305 0.8069
    ALL 75.36 86.39 64.31 0.5201 0.8196

     | Show Table
    DownLoad: CSV

    It can be seen from Table 4 that the ACC, Sn and MCC of CKSAAP are higher than those of TPC. TPC performs better than CKSAAP on Sp and AUC. After feature fusion, Acc, Sn, MCC and AUC are all higher than single feature. Therefore, feature fusion is necessary for this issue.

    Multi-information fusion can more comprehensively extract protein sequence information, but redundancy and noise information will also be generated. The dimensionality reduction method can not only retain important features, but the computational efficiency of the model will also be improved. In this paper, elastic net was used to reduce the dimensionality of the fused feature data set, and obtain the feature subset of 704. After 5-fold cross-validation, the prediction results of Random Forest are shown in Table 5.

    Table 5.  Results before and after feature selection.
    Acc (%) Sn (%) Sp (%) MCC AUC
    All 75.36 86.39 64.31 0.5201 0.8196
    Elastic net 76.02 85.68 66.33 0.5304 0.8260

     | Show Table
    DownLoad: CSV

    The features after dimensionality reduction using elastic nets, except for Sn, all other evaluation indicators have been improved. In addition, because the feature dimension is greatly reduced after dimensionality reduction, the efficiency of the model is also significantly improved.

    Four kinds of classifiers, Random Forest, Naive Bayes, K-Nearest Neighbor and XGBoost, were tested in this work for predicting SNO sites. After 5-fold cross-validation, the results were shown in Table 6. From Table 6 we can get that Naive Bayes and K-Nearest Neighbors are relatively inferior. Except for Sp, all indicators of Random Forest were the best. In order to evaluate the performance of the classifier more comprehensively, the ROC curves of different classifiers are shown in Figure 5.

    Table 6.  The prediction results of different classifiers.
    Algorithms Acc (%) Sn (%) Sp (%) MCC AUC (%)
    RF 76.02 85.68 66.33 0.5304 0.8260
    NB 69.74 79.46 59.98 0.4022 0.7605
    KNN 63.63 46.39 81.00 0.2923 0.7246
    XGBoost 72.88 74.37 71.40 0.4580 0.8015

     | Show Table
    DownLoad: CSV
    Figure 5.  The ROC curves of different classification methods.

    From Figure 5, we can clearly see that the area under the ROC curve of the random forest is the largest. Therefore, random forest has been selected as the classifier of the proposed model.

    To further evaluate the performance of this model, and we compared it with the PreSNO and RecSNO models. The prediction results of three different methods for the same data set are shown in Table 7. From Table 7, we can see that the ACC, Sn and MCC models of this model are all the highest. In addition, the Sp and AUC of the model in this paper also have good results. Therefore, the performance of this model is better than PreSNO and RecSNO.

    Table 7.  Comparison of the RF-SNOPS with other methods.
    Feature Acc (%) Sn (%) Sp (%) MCC AUC
    PreSNO 70% 54% 86% 0.42 0.84
    RecSNO 72% 79% 66% 0.45 0.79
    RF-SNOPS 76.02 85.68 66.33 0.5304 0.8260

     | Show Table
    DownLoad: CSV

    In order to identify SNO proteins, we used GO-KNN, BOW and PseAAC to extract the sequence information. GO-KNN extracted KNN neighbor information based on protein GO information, and BOW and PseAAC extracted protein sequence information based on physical and chemical properties. In addition, we used the SR balancer to process the unbalanced data set, reduce the negative impact of the unbalance on the model. Finally, Random Forest was used to make predictions. For predicting SNO sites, CKSAAP and TPC were used to extract protein fragment information. In order to improve the computational efficiency and eliminate the redundancy and noise generated by the fusion features, we used elastic nets to reduce the dimensionality of the fusion features. These processes only need to require calculation models without any physical and chemical experiments, which can save experimental costs and improves work efficiency. We hope that this work will be helpful for solving biological problems with computational methods.

    This work was supported by the grants from the National Natural Science Foundation of China (No. 31760315, 62162032, 61761023), Natural Science Foundation of Jiangxi Province, China (NO. 20202BAB202007).

    The authors have declared that no competing interest exists.



    [1] D. Brenner, J. McLaughlin, R. Hung, Previous lung diseases and lung cancer risk: A systematic review and meta-analysis, PLoS One, 6 (2011). https://doi.org/10.1371/journal.pone.0017479 doi: 10.1371/journal.pone.0017479
    [2] A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet classification with deep convolutional neural networks, Commun. ACM, 60 (2017), 84–90. https://doi.org/10.1371/journal.pone.001747910.1145/3065386 doi: 10.1371/journal.pone.001747910.1145/3065386
    [3] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, et al., ImageNet Large Scale Visual Recognition Challenge, Int. J. Comput. Vision, 115 (2015), 211–252. https://doi.org/10.1007/s11263-015-0816-y doi: 10.1007/s11263-015-0816-y
    [4] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, A. Zisserman, The pascal visual object classes challenge: A retrospective, Int. J. Comp. Vision, 111 (2014), 98–136, https://doi.org/10.1007/s11263-014-0733-5 doi: 10.1007/s11263-014-0733-5
    [5] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, et al., Microsoft COCO: Common objects in context, in Computer Vision – ECCV 2014, Springer, 8693 (2014), 740–755. https://doi.org/10.1007/978-3-319-10602-1_48
    [6] L. Zhang, P. Yang, H. Feng, Q. Zhao, H. Liu, Using network distance analysis to predict lncRNA-miRNA interactions, Interdiscip. Sci., 13 (2021), 535–545. https://doi.org/10.1007/s12539-021-00458-z doi: 10.1007/s12539-021-00458-z
    [7] G. Liang, L. Zheng, A transfer learning method with deep residual network for pediatric pneumonia diagnosis, Comput. Methods Programs Biomed., 187 (2020). https://doi.org/10.1016/j.cmpb.2019.06.023 doi: 10.1016/j.cmpb.2019.06.023
    [8] X. Wei, Y. Chen, Z. Zhang, Comparative experiment of convolutional neural network (CNN) models based on pneumonia X-ray images detection, in 2020 2nd International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI), (2020), 449–454. https://doi.org/10.1109/MLBDBI51377.2020.00095
    [9] L. Račić, T. Popovic, S. Caki, S. Sandi, Pneumonia detection using deep learning based on convolutional neural network, in 2021 25th International Conference on Information Technology, (2021), 1–4. https://doi.org/10.1109/IT51528.2021.9390137
    [10] A. G. Taylor, C. Mielke, J. Mongan, Automated detection of moderate and large pneumothorax on frontal chest X-rays using deep convolutional neural networks: A retrospective study, PLoS Med., 15 (2018). https://doi.org/10.1371/journal.pmed.1002697 doi: 10.1371/journal.pmed.1002697
    [11] T. K. K. Ho, J. Gwak, O. Prakash, J. I. Song, C. M. Park, Utilizing pretrained deep learning models for automated pulmonary tuberculosis detection using chest radiography, in Intelligent Information and Database Systems, Springer, 11432 (2019), 395–403. https://doi.org/10.1007/978-3-030-14802-7_34
    [12] R. Zhang, M. Sun, S. Wang, K. Chen, Computed Tomography pulmonary nodule detection method based on deep learning, US 10937157B2, L. Infervision Medical Technology, 2021. Available from: https://patentimages.storage.googleapis.com/9c/00/cc/4c302cd759496a/US10937157.pdf.
    [13] C. Tong, B. Liang, Q. Su, M. Yu, J. Hu, A. K. Bashir, et al., Pulmonary nodule classification based on heterogeneous features learning, IEEE J. Sel. Areas Commun., 39 (2021), 574–581. https://doi.org/10.1109/JSAC.2020.3020657 doi: 10.1109/JSAC.2020.3020657
    [14] L. J. Hyuk, S. H. Young, P. Sunggyun, K. Hyungjin, H. E. Jin, G. J. Mo, et al., Performance of a deep learning algorithm compared with radiologic interpretation for lung cancer detection on chest radiographs in a health screening population, Radiology, 297 (2020), 687–696. https://doi.org/10.1148/radiol.2020201240 doi: 10.1148/radiol.2020201240
    [15] A. Hosny, C. Parmar, T. P. Coroller, P. Grossmann, R. Zeleznik, A. Kumar, et al., Deep learning for lung cancer prognostication: a retrospective multi-cohort radiomics study, PLoS Med., 15 (2018). https://doi.org/10.1371/journal.pmed.1002711 doi: 10.1371/journal.pmed.1002711
    [16] M. Masud, N. Sikder, A. A. Nahid, A. K. Bairagi, M. A. Alzain, A machine learning approach to diagnosing lung and colon cancer using a deep learningbased classification framework, Sensors (Basel), 21 (2021), 1–21. https://doi.org/10.3390/s21030748 doi: 10.3390/s21030748
    [17] S. Roy, W. Menapace, S. Oei, B. Luijten, E. Fini, C. Saltori, et al., Deep learning for classification and localization of COVID-19 markers in point-of-care lung ultrasound, IEEE Trans. Med. Imaging, 39 (2020), 2676–2687. https://doi.org/10.1109/TMI.2020.2994459 doi: 10.1109/TMI.2020.2994459
    [18] H. T. Qing, K. Mohammad, M. Mokhtar, P. GholamReza, T. Karim, T. A. Rashid, Real‑time COVID-19 diagnosis from X-Ray images using deep CNN and extreme learning machines stabilized by chimp optimization algorithm, Biomed. Signal Process. Control, 68 (2021). https://doi.org/10.1016/j.bspc.2021.102764 doi: 10.1016/j.bspc.2021.102764
    [19] M. A. Khan, S. Kadry, Y. D. Zhang, T. Akram, M. Sharif, A. Rehman, et al., Prediction of COVID-19 - pneumonia based on selected deep features and one class kernel extreme learning machine, Comput. Electr. Eng., 90 (2021). https://doi.org/10.1016/j.compeleceng.2020.106960 doi: 10.1016/j.compeleceng.2020.106960
    [20] Y. Qasim, B. Ahmed, T. Alhadad, H. A. Sameai, O. Ali, The impact of data augmentation on accuracy of COVID-19 detection based on X-ray images, in Innovative Systems for Intelligent Health Informatics, Lecture Notes on Data Engineering and Communications Technologies, Springer, 72 (2021), 1041–1049. https://doi.org/10.1007/978-3-030-70713-2_93
    [21] M. Loey, F. Smarandache, N. E. M. Khalifa, Within the lack of chest COVID-19 X-ray dataset: A novel detection model based on GAN and deep transfer learning, Symmetry, 12 (2020). https://doi.org/10.3390/sym12040651 doi: 10.3390/sym12040651
    [22] S. Y. Lu, D. Wu, Z. Zhang, S. H. Wang, An explainable framework for diagnosis of COVID-19 pneumonia via transfer learning and discriminant correlation analysis, ACM Trans. Multimedia Comput. Commun. Appl., 17 (2021), 1–16. https://doi.org/10.1145/3449785 doi: 10.1145/3449785
    [23] S. Y. Lu, Z. Q. Zhu, J. M. Gorriz, S. H. Wang, Y. D. Zhang, NAGNN: Classification of COVID-19 based on neighboring aware representation from deep graph neural network, Int. J. Intell. Syst., 37 (2021), 1572–1598. https://doi.org/10.1002/int.22686 doi: 10.1002/int.22686
    [24] L. T. Duong, N. H. Le, T. B. Tran, V. M. Ngo, P. T. Nguyen, Detection of tuberculosis from chest X-ray images: boosting the performance with vision transformer and transfer learning, Expert Syst. Appl., 184 (2021), 115519. https://doi.org/10.1016/j.eswa.2021.115519 doi: 10.1016/j.eswa.2021.115519
    [25] J. R. F. Junior, D. A. Cardona, R. A. Moreno, M. F. S. Rebelo, J. E. Krieger, M. A. Gutierrez, A general fully automated deep-learning method to detect cardiomegaly in chest x-rays, in Progress in Biomedical Optics and Imaging 2021: Computer-Aided Diagnosis, 2021. https://doi.org/10.1117/12.2581980
    [26] Y. Wu, S. Qi, Y. Sun, S. Xia, Y. Yao, W. Qian, et al., A vision transformer for emphysema classification using CT images, Phys. Med. Biol., 66 (2021), 245016. https://doi.org/10.1088/1361-6560/ac3dc8 doi: 10.1088/1361-6560/ac3dc8
    [27] P. Rajpurkar, J. Irvin, R. L. Ball, K. Zhu, B. Yang, H. Mehta, et al., Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists, PLoS Med., 15 (2018). https://doi.org/10.1371/journal.pmed.1002686 doi: 10.1371/journal.pmed.1002686
    [28] A. I. A. Rivero, N. Papadakis, R. Li, P. Sellars, Q. Fan, R. T. Tan, et al., GraphX NET-chest X-Ray classification under extreme minimal supervision, in Medical Image Computing and Computer Assisted Intervention–MICCAI 2019-22nd International Conference, 2019. https://doi.org/10.1007/978-3-030-32226-7_56
    [29] X. Wang, Y. Peng, L. Lu, Z. Lu, R. M. Summers, TieNet: Text-image embedding network for common thorax disease classification and reporting in chest X-rays, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (2018), 9049–9058. https://doi.org/10.1109/CVPR.2018.00943
    [30] J. Zhao, M. Li, W. Shi, Y. Miao, Z. Jiang, B. Ji, A deep learning method for classification of chest X-ray images, J. Phys. Conf. Ser., 1848 (2021). https://doi.org/10.1088/1742-6596/1848/1/012030 doi: 10.1088/1742-6596/1848/1/012030
    [31] T. K. K. Ho, J. Gwak, Utilizing knowledge distillation in deep learning for classification of chest X-ray abnormalities, IEEE Access, 8 (2020), 160749–160761. https://doi.org/10.1109/ACCESS.2020.3020802 doi: 10.1109/ACCESS.2020.3020802
    [32] Y. Xiao, M. Lu, Z. Fu, Covered face recognition based on deep convolution generative adversarial networks, in Lecture Notes in Computer Science, (2020), 133–141. https://doi.org/10.1007/978-3-030-57884-8_12
    [33] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, 2015. https://doi.org/10.1109/acpr.2015.7486599
    [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, et al, Going deeper with convolutions, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (2015), 1–9. https://doi.org/10.1109/cvpr.2015.7298594
    [35] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in 32nd International Conference on Machine Learning, (2015), 448–456. Available from: http://proceedings.mlr.press/v37/ioffe15.pdf.
    [36] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (2016), 770–778. https://doi.org/10.1109/cvpr.2016.90
    [37] Y. Yang, Z. Zhong, T. Shen, Z. Lin, Convolutional neural networks with alternately updated clique, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (2018), 2413–2422. https://doi.org/10.1109/CVPR.2018.00256
    [38] J. Hu, L. Shen, S. Albanie, G. Sun, E. Wu, Squeeze-and-excitation networks, IEEE Tra-ns. Pattern Anal. Mach. Intell., 42 (2020), 2011–2023. https://doi.org/10.1109/TPAMI.2019.2913372 doi: 10.1109/TPAMI.2019.2913372
    [39] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R. R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors, preprint, arXiv: 1207.0580v1.
    [40] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for training GANs, Adv. Neural Inf. Process. Syst., (2016), 2234–2242. Available from: https://proceedings.neurips.cc/paper/2016/file/8a3363abe792db2d8761d6403605aeb7-Paper.pdf.
    [41] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, GANs trained by a two time-scale update rule converge to a local Nash equilibrium, Adv. Neural Inf. Pro-cess. Syst., (2017), 6627–6638. Available from: https://proceedings.neurips.cc/paper/2017/file/8a1d694707eb0fefe65871369074926d-Paper.pdf.
    [42] F. F. Li, J. Deng, K. Li, ImageNet: Constructing a large-scale image database, J. Vision, 9 (2009). https://doi.org/10.1167/9.8.1037 doi: 10.1167/9.8.1037
    [43] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), 2818–2826. https://doi.org/10.1109/cvpr.2016.308
    [44] N. L. Ramo, K. L. Troyer, C. M. Puttlitz, Comparing predictive accuracy and computational costs for viscoelastic modeling of spinal cord tissues, J. Biomech. Eng., 141 (2019). https://doi.org/10.1115/1.4043033 doi: 10.1115/1.4043033
    [45] D. M. Powers, Evaluation: From precision, recall and f-measure to ROC, informedness, markedness and correlation, J. Mach. Learn. Technol., 2 (2011), 2229–3981. Available from: http://hdl.handle.net/2328/27165.
    [46] T. Fawcett, An introduction to ROC analysis, Pattern Recognit. Lett., 27 (2006), 861–874. https://doi.org/10.1016/j.patrec.2005.10.010 doi: 10.1016/j.patrec.2005.10.010
    [47] C. X. Ling, J. Huang, H. Zhang, AUC: A better measure than accuracy in comparing learning algorithms, in Advances in Artificial Intelligence, 16th Conference of the Canadian Society for Computational Studies of Intelligence, AI 2003, Halifax, Canada, 2003. https://doi.org/10.1007/3-540-44886-1_25
    [48] G. Zeng, On the confusion matrix in credit scoring and its analytical properties, in Communications in Statistics - Theory and Methods, 49 (2020), 2080–2093. https://doi.org/10.1080/03610926.2019.1568485
  • Reader Comments
  • © 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(2049) PDF downloads(61) Cited by(1)

Figures and Tables

Figures(6)  /  Tables(7)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog