
We develop a mathematical model for the transmission of brucellosis in sheep taking into account external inputs, immunity, stage structure and other factors. We find the the basic reproduction number R0 in terms of the model parameters, and prove the global stability of the disease-free equilibrium. Then, the existence and global stability of the endemic equilibrium is proven. Finally, sheep data from Yulin, China are employed to fit the model parameters for three different environmental infection exposure conditions. The variability between different models in terms of control measures are analyzed numerically. Results show that the model is sensitive to the control parameters for different environmental infection exposure functions. This means that in practical modeling, the selection of environmental infection exposure functions needs to be properly considered.
Citation: Zongmin Yue, Yuanhua Mu, Kekui Yu. Dynamic analysis of sheep Brucellosis model with environmental infection pathways[J]. Mathematical Biosciences and Engineering, 2023, 20(7): 11688-11712. doi: 10.3934/mbe.2023520
[1] | Pingping Sun, Yongbing Chen, Bo Liu, Yanxin Gao, Ye Han, Fei He, Jinchao Ji . DeepMRMP: A new predictor for multiple types of RNA modification sites using deep learning. Mathematical Biosciences and Engineering, 2019, 16(6): 6231-6241. doi: 10.3934/mbe.2019310 |
[2] | Hasan Zulfiqar, Rida Sarwar Khan, Farwa Hassan, Kyle Hippe, Cassandra Hunt, Hui Ding, Xiao-Ming Song, Renzhi Cao . Computational identification of N4-methylcytosine sites in the mouse genome with machine-learning method. Mathematical Biosciences and Engineering, 2021, 18(4): 3348-3363. doi: 10.3934/mbe.2021167 |
[3] | Lei Chen, Ruyun Qu, Xintong Liu . Improved multi-label classifiers for predicting protein subcellular localization. Mathematical Biosciences and Engineering, 2024, 21(1): 214-236. doi: 10.3934/mbe.2024010 |
[4] | Yuanqian Yao, Jianlin Lv, Guangyao Wang, Xiaohua Hong . Multi-omics analysis and validation of the tumor microenvironment of hepatocellular carcinoma under RNA modification patterns. Mathematical Biosciences and Engineering, 2023, 20(10): 18318-18344. doi: 10.3934/mbe.2023814 |
[5] | Mingshuai Chen, Xin Zhang, Ying Ju, Qing Liu, Yijie Ding . iPseU-TWSVM: Identification of RNA pseudouridine sites based on TWSVM. Mathematical Biosciences and Engineering, 2022, 19(12): 13829-13850. doi: 10.3934/mbe.2022644 |
[6] | Jianhua Jia, Mingwei Sun, Genqiang Wu, Wangren Qiu . DeepDN_iGlu: prediction of lysine glutarylation sites based on attention residual learning method and DenseNet. Mathematical Biosciences and Engineering, 2023, 20(2): 2815-2830. doi: 10.3934/mbe.2023132 |
[7] | Carsten Conradi, Elisenda Feliu, Maya Mincheva . On the existence of Hopf bifurcations in the sequential and distributive double phosphorylation cycle. Mathematical Biosciences and Engineering, 2020, 17(1): 494-513. doi: 10.3934/mbe.2020027 |
[8] | Yunyun Liang, Shengli Zhang, Huijuan Qiao, Yinan Cheng . iEnhancer-MFGBDT: Identifying enhancers and their strength by fusing multiple features and gradient boosting decision tree. Mathematical Biosciences and Engineering, 2021, 18(6): 8797-8814. doi: 10.3934/mbe.2021434 |
[9] | Ying Xu, Jinyong Cheng . Secondary structure prediction of protein based on multi scale convolutional attention neural networks. Mathematical Biosciences and Engineering, 2021, 18(4): 3404-3422. doi: 10.3934/mbe.2021170 |
[10] | Yuqing Qian, Tingting Shang, Fei Guo, Chunliang Wang, Zhiming Cui, Yijie Ding, Hongjie Wu . Identification of DNA-binding protein based multiple kernel model. Mathematical Biosciences and Engineering, 2023, 20(7): 13149-13170. doi: 10.3934/mbe.2023586 |
We develop a mathematical model for the transmission of brucellosis in sheep taking into account external inputs, immunity, stage structure and other factors. We find the the basic reproduction number R0 in terms of the model parameters, and prove the global stability of the disease-free equilibrium. Then, the existence and global stability of the endemic equilibrium is proven. Finally, sheep data from Yulin, China are employed to fit the model parameters for three different environmental infection exposure conditions. The variability between different models in terms of control measures are analyzed numerically. Results show that the model is sensitive to the control parameters for different environmental infection exposure functions. This means that in practical modeling, the selection of environmental infection exposure functions needs to be properly considered.
Protein post-translational modification is an important chemical process, which plays a key role in regulating cell functions [1] and also changes the physical and chemical properties of protein. More than 400 post-translational modifications including methylation [2], acetylation [3], phosphorylation [4], and S-nitrosylation (SNO) [5] have been discovered so far. As SNO is a reversible post-translational modification of proteins, a large number of studies have shown that SNO plays an important role in multiple biological processes such as redox signal transduction [6], cell signal transduction [7], cell senescence [8], and transcription [9]. SNO is also related to many human diseases such as cancer [10], Alzheimer's disease [11], and chronic renal failure [12]. Therefore, a well-grounded understanding of SNO is of great significance for the study of basic biological processes [9,13] and the development of drugs [14]. In recent years, many SNO sites have been identified through molecular signals [15,16], but identification of SNO sites still faces some challenges including low accuracy, time-consuming and labor-intensive. With the continuous development of computer technology, a large number of computational models have been used to predict the specific sites of SNO modification.
Many post-translational modifications of proteins have been detected by a variety of computational models. Qiu identified phosphorylated [17] and acetylated [18] proteins with GO notations. GPS-SNO [19], SNOSite [20], iSNO-PseAAC [21], PreSNO [5] and RecSNO [22] have been applied to the prediction of SNO sites. The GPS-SNO, SNOSite and iSNO-PseAAC models use relatively small data sets. In addition, many negative samples in these data sets are now experimentally verified as positive samples. The data sets used by PreSNO and RecSNO are relatively large and new, but there is still room for improvement in the performance of the model.
On the basis of previous research, this work established two models for predicting SNO proteins and sites. In predicting SNO proteins, a bag-of-words model has been proposed on the basis of KNN scoring matrix obtained from proteins' GO annotation information [18], PseAAC [23,24] of amino acids sequence. Fusion of multiple features can more comprehensively reflect the information of the protein sequence and improve the prediction results. A combination of oversampling technique and random deletion method are applied to balance the training set since the issue is involved in imbalanced data sets. In predicting SNO sites, two feature extraction methods, TPC [25] and CKSAAP [26], are used to extract the features of protein sequence fragments. In order to eliminate the redundancy and noise information of the original feature space, elastic nets [27] are used to reduce the dimensionality of the feature space after the fusion strategy is performed on the original features. Random Forest severed as the classifiers and be verified with 5-fold cross-validation. The specific flow chart is shown in Figure 1.
To obtain a scientifical prediction result, a strict benchmark data set is essential. The UniProKB has been accepted by most bioinformatics researchers. Here, the negative samples are extracted from the UniProKB and the positive sample are extracted from Xie's [28], which is a high-quality data set based on extensive literature research. The protein sequence can be expressed as:
P=R1R2R3⋯Ri⋯RL | (1) |
where Ri represents the i-th amino acid residue, and L represents the length of the protein sequence.
In order to identify SNO proteins, we constructed a benchmark data set similar to dataset of Hasan et al. [5], which consists of 3113 SNO proteins. Every one of the positive samples, i.e., SNO proteins, has at least one SNO site. For negative samples, we randomly selected 18, 047 proteins without any SNO site from the UniProKB. In order to make the results more rigorous, the CD-HIT was used to remove 30% of the 3113 positive samples and 18, 047 negative samples. Finally, 2192 positive samples and 7809 negative samples are collected in the proposed benchmark data set.
The benchmark data set for predicting SNO sites are the same as Hasan et al. [5], which consists of 3383 positive samples and 3365 negative samples. A potential SNO(C) site-containing peptide sample can be generally expressed by
−Pξ=R−ξR−(ξ−1)⋯R−2R−1CR+1R+2⋯R+(ξ−1)R+ξ | (2) |
where the subscript ξ is an integer, R−ξ represents the ξ-th upstream amino acid residue from the center, the R+ξ the ξ-th downstream amino acid residue, and so forth. If the number of left or right residues of the center C is less than ξ, then the pseudo amino acid "X" would be used to supplement the sequence. The (2ξ+1)-tuple peptide sample −Pξ can be further classified into the following two categories:
¯Pξξ∈{¯P+ξ,ifitscenterisaSONsite¯P¯ξ,otherwise | (3) |
where ¯P+ξ denotes a true SNO segment C with at its center, ¯P¯ξ a corresponding false SNO segment, and the symbol ∈ means "a member of" in the set theory.
GO-KNN [18] features were extracted on the basis of the GO annotations of proteins. In this work, we need to find out the GO terms of all protein sequences and calculate the distance between proteins. Take protein P1 as an example, for anyone of other proteins, for example, P2, then their GO terms can be listed as P1GO={GO11,GO12,⋯,GO1M} and P2GO={GO21,GO22,⋯,GO2N} are obtained. If there is no GO term for a protein, we will replace it with GO terms of its homologous protein. The distance between two proteins can be calculated with Eq (4):
Distance(P1,P2)=1−⌊P1GO∩P2GO⌋⌊P1GO∪P2GO⌋ | (4) |
where, GO1i and GO2i represent the i-th GO of P1 and P2, respectively. The M and N represent the numbers of GO, respectively, ∪and ∩ are the union and intersection in the set theory, and ⌊ ⌋ represents the number of elements in the set. Then, the GO-KNN features could be extracted according to the following steps: 1) Sorting the calculated distances in ascending order; 2) Selecting the first k near neighbors of the test protein; 3) Calculating the percentage of positive samples in the k neighbors. In this study, k were selected as 2, 4, 8, 16, 62, 64, 128, 256, 512, 1024. In this way, a 10-dimensional feature vector (x1,x2,⋯,x10) could represents the protein P1.
A bag-of-words model [29] based on the physical and chemical properties of protein has been used in identifying GPCR-drug interaction. The main steps are listed as follows: 1) Encoding the protein sequence with its physical and chemical properties. Up to now, scientists have obtained various physical and chemical properties of 20 common amino acids [30]. After careful experimental comparison, hydrophilicity was selected as an indicator for the proposed model. 2) Designing wordbooks for protein. When the window sizes are 1, 2 and 3, and the step size of the moving window is 1, the coding sequence is divided into segments of different lengths. Segments of length 1 form wordbook WB1, segments of length 2 form wordbook WB2, and segments of length 3 form wordbook WB3. When the window size is 2, the step size of moving the window is still 1. But the window at this time is different from the above, it is separated by an amino acid. At this time, the coding sequence is divided into fragments of length 2, and these fragments form the wordbook WB4. 3) Clustering the word books. We divided the words in the wordbook WB1 into 20 sub-groups according to the types of amino acids. Words in WB2,WB3 and WB4 were clustered with K-means algorithm. The numbers of clusters were 16, 62 and 16. 4) Calculating the ratio of the number of each amino acid to the total number of words in the vocabulary with Eq (5).
XWBji=XWBjiN i=1,…,K j=1,2,3,4 | (5) |
here, K is the number of clusters in the wordbook WBj, XWBji is the number of words in the i-th category of the wordbook WBj, and N is the total number of words in the wordbook WBj. Then a 114-D feature vector was formed for a given protein sequence, i.e. (XWB11,…,XWB120,XWB21,…,XWB216,XWB31,...XWB362,XWB41...,XWB416).
PseAAC [23,24] is a very popular feature for bioinformatics. In this work, six physical and chemical properties of hydrophobicity, hydrophilicity, molecular side chain mass, PK1, PK2 and PI are used. We first used Eq (6) to transform the original physical and chemical properties of amino acids:
Wa(i)=W0a(i)−∑20i=1W0a(i)20√∑20i=1[W0a(i)−∑20i=1W0a(i)20]220 | (6) |
where, a∈{1,2,⋯,6} and i∈{1,2,⋯,20}. W0a(i) represents the value of the ath original physical and chemical properties of the ith amino acid. We substitute the values of the transformed physical and chemical properties with Eq (7):
Θ(Ri,Rj)=16∑6a=1[Wa(Rj)−Wa(Ri)]2 | (7) |
where, W1(Rj)represents the value of hydrophobicity of Rj. By analogy, W6(Ri)represents the PI value of Ri. Then the correlation factor of each layer can be obtained by using the Eq (8):
θλ=1L−λ∑L−λi=1Θ(Ri,Ri+λ)λ<L | (8) |
where θλ represents the correlation factor of the λ-th layer of the protein sequence. Finally, the protein sequence is converted into a feature by Eq (9):
xi={fi∑20i=1fi+ω∑λj=1θj(1≤i≤20)ωθi−20∑20i=1fi+ω∑λj=1θj(20+1≤i≤20+λ) | (9) |
here, fi represents the frequency of the i-th amino acid, ω is 0.5, and λ is 5. In this way, a 25-dimensional feature vector is formed.
In order to reduce the adverse effect of unbalanced data on the performance of the model, many methods for dealing with unbalanced data have been proposed, such as Synthetic Minority Oversampling Technique [31] (SMOTE) and Random Under Sampler [32] (RUS). SMOTE is a method proposed by Chawla et al. It has been used to predict protein sites [27] and improve the prognostic assessment of lung cancer [33]. RUS is a very simple and popular method of under-sampling. It can be used in pediatric pneumonia detection [34] and convolutional neural network performance improvement issues [35]. In this study, we combined these two methods to process the data. SMOTE is used to oversample the positive samples, and RUS is used to under-sample the negative samples. In the end, the number of processed positive samples is equal to negative samples. The specific process is shown in Figure 2.
CSKAAP [25] has been widely used in protein site prediction [26] since it can effectively express internal laws for a given protein sequence. The protein fragment is composed of 20 common amino acids and a pseudo amino acid, which contains 441 residue pairs (AA, AC, ..., XX) for each l. Here l represents the space between each residue pair. The following formula is used to calculate the characteristics of the fragment:
(NAANT,NACNT,NADNT,⋯,NXXNT)441 | (10) |
here NAA,NAC,⋯ represent the number of times the corresponding amino acid pair appears in the fragment, L is the length of the protein fragment. NT=L−l−1. In this study, the values of l are 0, 1, 2, 3, 4, and the corresponding NT are 40, 39, 38, 37 and 36, respectively. Then, a 2205-D feature vector is formed.
Based on the structural properties of proteins, researchers have proposed the tripeptide composition (TPC). It has been used to predict protein subcellular localization [36] identification of plasmodium mitochondrial proteins [37]. TPC calculates the frequency of three consecutive amino acids, and then a protein fragment can be represented by a 9261-dimensional vector.
Pi=Ni∑92611Ni | (11) |
where Ni represents the total number of i-th in 9261 tripeptides.
The elastic net proposed by Zou and Hastie [38] is an effective feature selection method. By introducing the L1,L2 norm into a simple linear regression model, the elastic net can not only perform continuous shrinkage and automatically select variables at the same time, but also predict related variables. At present, elastic nets have been widely used in protein site prediction [27,39] and achieved good results.
In this study, four indicators were used to evaluate the performance of the models. They are accuracy (ACC), sensitivity (SN), specificity (SP) and Matthews correlation coefficient (MCC) [40], which are defined by Eq (12):
{Sn=TPTP+FNSp=TNTN+FPACC=TP+TNTP+FP+TN+FNMCC=TP×TN−FP×FN√(TP+FP)×(TP+FN)×(TN+FP)×(TN+FN) | (12) |
In predicting SNO proteins, TP indicates the number of proteins that are predicted to have SNO sites and actually have SNO sites, and TN is the number of proteins that are predicted to have no SNO sites that are actually not have SNO sites. FP is the number of proteins without SNO sites but predicted to have SNO sites, FN is the number of proteins with SNO sites but predicted to have no SNO sites. In addition, the area under the ROC curve AUC is also used to evaluate this model.
In predicting SNO sites, TP indicates the number of actual SNO sites predicted to be SNO sites, and TN indicates the number of non-SNO sites predicted to be not SNO sites. FP is the number of non-SNO sites predicted to be SNO sites, and FN is the number of actual SNO sites predicted to be non-SNO sites.
Random Forest [41] is an algorithm that integrates multiple trees through the idea of ensemble learning. Its basic unit is decision tree. As a highly flexible machine learning algorithm, Random Forest (RF) has been widely used in data analysis [42], biological information [43] and technological development [44].
Naive Bayes (NB) [45] is a simple and effective classifier, which is widely used in software defect prediction [46], medical diagnosis [47] and biological information [48]. NB is based on the Bayes theorem and the assumption of the conditional independence of features, which greatly reduces the complexity of the classification algorithm.
K Nearest Neighbor (KNN) [49] is one of the supervised machine learning algorithms, which is widely used in face recognition [50], disease research [51] and engineering applications [52]. Its main idea is to judge the category of the predicted value based on the category of the k points closest to the predicted value.
XGBoost [53,54] is an improved algorithm for boosting based on GBDT [55]. XGBoost is an integrated lifting algorithm that integrates many basic models to form a strong model. Because of its advantages such as good prediction effect and high training efficiency, XGBoost has been widely used in the field of data analysis.
In this research, GO-KNN, BOW and PseAAC three kinds of feature extraction methods were used to encode the protein sequence, and obtained 10-D, 114-D and 25-D feature vectors, respectively. These three kinds of features were fused into a 149-D feature vector ALL. Through the 5-fold cross-validation, the prediction results obtained by different feature extraction are shown in Table 1.
Feature | Acc (%) | Sn (%) | Sp (%) | MCC | AUC |
GO-KNN | 82.72 | 48.36 | 92.37 | 0.4533 | 0.8521 |
BOW | 78.83 | 13.04 | 97.30 | 0.1969 | 0.7359 |
PseAAC | 79.38 | 19.43 | 96.22 | 0.2503 | 0.7616 |
ALL | 83.77 | 49.49 | 93.40 | 0.4840 | 0.8593 |
It can be seen from Table 1 that different features obtained varied prediction results. Among the three methods, GO-KNN has the highest ACC, Sn, MCC and AUC, of which are 82.72%, 48.36%, 0.4533 and 0.8521 respectively. The ACC, Sn, MCC and AUC of BOW are the lowest, of which are 78.83%, 13.04%, 0.1969 and 0.7359, respectively. But the Sp of BOW is 97.30%, which is the highest. After combining these three characteristics, ACC, Sn, Sp, MCC, AUC are 83.77%, 49.49%, 93.40%, 0.4840, 0.8593, respectively. Among them, ACC, Sn, MCC and AUC are all higher than those produced by GO-KNN. The results show that multi-feature fusion can improve a number of indicators. In order to better analyze the influence of different features on the prediction of SNO proteins, the prediction results obtained from the three features and their fusion features are shown in Figure 3.
It can be seen from Figure 3 that the three features and their fusion affect the five evaluation indicators to some extent. They are less effective on Sn and MCC, and better on ACC, Sp and AUC. Comparing these four feature codes, the ACC, Sn, MCC and AUC of the fusion feature ALL are improved. Multi-feature fusion can reflect sequence information more comprehensively, thereby improving prediction ability. Therefore, multi-feature fusion can be used to predict SNO proteins.
Here, SMOTE and RUS are denoted as SR balancer. We input the pre-balanced and post-balanced data sets into the model, and passed the 5-fold cross-validation to obtain the prediction results of ACC, Sn, Sp, MCC, AUC on the balanced and unbalanced data sets, as shown in the Table 2.
Acc (%) | Sn (%) | Sp (%) | MCC | AUC | |
Imbalance | 83.77 | 49.49 | 93.40 | 0.4840 | 0.8593 |
Balance | 81.84 | 70.82 | 84.93 | 0.5178 | 0.8635 |
It can be seen from Table 2 that the balanced Sn and Sp are relatively balanced. In addition, Sn, MCC and AUC have improved. Therefore, in summary, it is very necessary to balance the dataset.
Classifiers play an important role in model prediction. This work used the above four classifiers to identify SNO proteins. After 5-fold cross-validation, the results of each classifier for ACC, Sn, Sp, MCC and AUC are shown in Table 3. It can be seen from Table 3 that the effect of random forest on various evaluation indicators is the best. In order to better compare the effects of different classifiers, the prediction results of the four classifiers are shown in Figure 4.
Algorithms | Acc (%) | Sn (%) | Sp (%) | MCC | AUC (%) |
RF | 81.84 | 70.82 | 84.93 | 0.5178 | 0.8635 |
NB | 63.81 | 78.37 | 59.73 | 0.3154 | 0.7710 |
KNN | 71.97 | 83.44 | 68.75 | 0.4366 | 0.8360 |
XGBoost | 80.73 | 70.07 | 83.72 | 0.4953 | 0.8553 |
The area under the ROC curve can evaluate the predictive performance of the model. It can be seen from Figure 4 that when the random forest is used as a classifier, the area under the ROC curve is the largest. Therefore, random forest is the best choice for the proposed model.
In this study, two kinds of features, CKSAAP and TPC, were used, and the 2205-dimension and 9261-dimension feature vectors were obtained on the basis of above-mentioned algorithms. In order to better reflect the information of protein fragments, these features are fused into a 11, 466-dimension feature vector. Through 5-fold cross-validation, the prediction results obtained by different feature extraction are shown in Table 4.
Feature | Acc (%) | Sn (%) | Sp (%) | MCC | AUC |
CKSAAP | 73.97 | 83.67 | 64.27 | 0.4891 | 0.8036 |
TPC | 71.38 | 66.07 | 76.74 | 0.4305 | 0.8069 |
ALL | 75.36 | 86.39 | 64.31 | 0.5201 | 0.8196 |
It can be seen from Table 4 that the ACC, Sn and MCC of CKSAAP are higher than those of TPC. TPC performs better than CKSAAP on Sp and AUC. After feature fusion, Acc, Sn, MCC and AUC are all higher than single feature. Therefore, feature fusion is necessary for this issue.
Multi-information fusion can more comprehensively extract protein sequence information, but redundancy and noise information will also be generated. The dimensionality reduction method can not only retain important features, but the computational efficiency of the model will also be improved. In this paper, elastic net was used to reduce the dimensionality of the fused feature data set, and obtain the feature subset of 704. After 5-fold cross-validation, the prediction results of Random Forest are shown in Table 5.
Acc (%) | Sn (%) | Sp (%) | MCC | AUC | |
All | 75.36 | 86.39 | 64.31 | 0.5201 | 0.8196 |
Elastic net | 76.02 | 85.68 | 66.33 | 0.5304 | 0.8260 |
The features after dimensionality reduction using elastic nets, except for Sn, all other evaluation indicators have been improved. In addition, because the feature dimension is greatly reduced after dimensionality reduction, the efficiency of the model is also significantly improved.
Four kinds of classifiers, Random Forest, Naive Bayes, K-Nearest Neighbor and XGBoost, were tested in this work for predicting SNO sites. After 5-fold cross-validation, the results were shown in Table 6. From Table 6 we can get that Naive Bayes and K-Nearest Neighbors are relatively inferior. Except for Sp, all indicators of Random Forest were the best. In order to evaluate the performance of the classifier more comprehensively, the ROC curves of different classifiers are shown in Figure 5.
Algorithms | Acc (%) | Sn (%) | Sp (%) | MCC | AUC (%) |
RF | 76.02 | 85.68 | 66.33 | 0.5304 | 0.8260 |
NB | 69.74 | 79.46 | 59.98 | 0.4022 | 0.7605 |
KNN | 63.63 | 46.39 | 81.00 | 0.2923 | 0.7246 |
XGBoost | 72.88 | 74.37 | 71.40 | 0.4580 | 0.8015 |
From Figure 5, we can clearly see that the area under the ROC curve of the random forest is the largest. Therefore, random forest has been selected as the classifier of the proposed model.
To further evaluate the performance of this model, and we compared it with the PreSNO and RecSNO models. The prediction results of three different methods for the same data set are shown in Table 7. From Table 7, we can see that the ACC, Sn and MCC models of this model are all the highest. In addition, the Sp and AUC of the model in this paper also have good results. Therefore, the performance of this model is better than PreSNO and RecSNO.
Feature | Acc (%) | Sn (%) | Sp (%) | MCC | AUC |
PreSNO | 70% | 54% | 86% | 0.42 | 0.84 |
RecSNO | 72% | 79% | 66% | 0.45 | 0.79 |
RF-SNOPS | 76.02 | 85.68 | 66.33 | 0.5304 | 0.8260 |
In order to identify SNO proteins, we used GO-KNN, BOW and PseAAC to extract the sequence information. GO-KNN extracted KNN neighbor information based on protein GO information, and BOW and PseAAC extracted protein sequence information based on physical and chemical properties. In addition, we used the SR balancer to process the unbalanced data set, reduce the negative impact of the unbalance on the model. Finally, Random Forest was used to make predictions. For predicting SNO sites, CKSAAP and TPC were used to extract protein fragment information. In order to improve the computational efficiency and eliminate the redundancy and noise generated by the fusion features, we used elastic nets to reduce the dimensionality of the fusion features. These processes only need to require calculation models without any physical and chemical experiments, which can save experimental costs and improves work efficiency. We hope that this work will be helpful for solving biological problems with computational methods.
This work was supported by the grants from the National Natural Science Foundation of China (No. 31760315, 62162032, 61761023), Natural Science Foundation of Jiangxi Province, China (NO. 20202BAB202007).
The authors have declared that no competing interest exists.
[1] |
J. J. Wang, L. Zhang, Current problems and countermeasures in the prevention and control of bovine brucellosis, Chin. Anim. Husb. Vet. Med., 34 (2007), 109. https://doi.org/10.3969/j.issn.1671-7236.2007.03.038 doi: 10.3969/j.issn.1671-7236.2007.03.038
![]() |
[2] |
D. Q. Shang, Research progress of brucellosis, Chin. J. Endemic Dis. Control, 19 (2004), 204–212. https://doi.org/10.3969/j.issn.1001-1889.2004.04.004 doi: 10.3969/j.issn.1001-1889.2004.04.004
![]() |
[3] | H. L. Ren, S. Y. Lu, Y. Zhou, Z. H. Li, Z. S. Liu, Progress in research and control of Brucellosis, Chin. Anim. Husb. Vet. Med., 36 (2009), 139–143. |
[4] |
J. P. Gorvel, Brucella: a Mr "Hide" converted into Dr Jekyll, Microbes Infect., 10 (2008), 1010–1013. https://doi.org/10.1016/j.micinf.2008.07.007 doi: 10.1016/j.micinf.2008.07.007
![]() |
[5] |
M. P. Franco, M. Mulder, R. H. Gilman, H. L. Smits, Human brucellosis, Lancet Infect. Dis., 7 (2007), 775–786. https://doi.org/10.1016/S1473-3099(07)70286-4 doi: 10.1016/S1473-3099(07)70286-4
![]() |
[6] |
M. T. Li, G. Q. Sun, W. Y. Zhang, Z. Jin, Model-based evaluation of strategies to control Brucellosis in China, Int. J. Environ. Res. Public Health, 14 (2017), 295. https://doi.org/10.3390/ijerph14030295 doi: 10.3390/ijerph14030295
![]() |
[7] |
X. J. Wang, D. Wang, Y. Y. Shi, C. Q. Xu, A mathematical model analysis of Brucellosis in Inner Mongolia, J. Beijing Univ. Civil Eng. Archit., 32 (2016), 65–69. https://doi.org/10.3969/j.issn.1004-6011.2016.02.013 doi: 10.3969/j.issn.1004-6011.2016.02.013
![]() |
[8] |
C. Li, Z. G. Guo, Z. Y. Zhang, Transmission dynamics of a brucellosis model: Basic reproduction number and global analysis, Chaos, Solitons Fractals, 104 (2017), 161–172. https://doi.org/10.1016/j.chaos.2017.08.013 doi: 10.1016/j.chaos.2017.08.013
![]() |
[9] |
O. Vasilyeva, T. Oraby, F. Lutscher, Aggregation and environmental transmission in chronic wasting disease, Math. Biol. Eng., 12 (2015), 209–231. https://doi.org/10.3934/mbe.2015.12.209 doi: 10.3934/mbe.2015.12.209
![]() |
[10] |
C. J. Shen, M. T. Li, W. Zhang, Y. Yi, Y. Wang, Q. Hou, et al., Modeling transmission dynamics of Streptococcus suis with stage structure and sensitivity analysis, Discrete Dyn. Nat. Soc., 2014 (2014), 1–10. https://doi.org/10.1155/2014/432602 doi: 10.1155/2014/432602
![]() |
[11] |
M. T. Li, G. Q. Sun, Y. F. Wu, J. Zhang, Z. Jin, Transmission dynamics of a multi-group brucellosis model with mixed cross infection in public farm, Appl. Math. Comput., 237 (2014), 582–594. https://doi.org/10.1016/j.amc.2014.03.094 doi: 10.1016/j.amc.2014.03.094
![]() |
[12] |
K. Meng, X. Abdurahman, Study of the brucellosis transmission with multi-stage, Commun. Math. Biol. Neurosci., 2018 (2018). https://doi.org/10.28919/cmbn/3796 doi: 10.28919/cmbn/3796
![]() |
[13] |
X. D. Sun, Q. Hou, Modeling sheep brucellosis transmission with a multi stage model in chanling country of jilin province, Appl. Math. Comput., 51 (2016), 227–244. https://doi.org/10.1007/s12190-015-0901-y doi: 10.1007/s12190-015-0901-y
![]() |
[14] |
P. O. Lolika, S. Mushayabasa, C. P. Bhunu, C. Modnak, J. Wang, Modeling and analyzing the effects of seasonality on brucellosis infection, Chaos, Solitons Fractals, 104 (2017), 338–349. https://doi.org/10.1016/j.chaos.2017.08.027 doi: 10.1016/j.chaos.2017.08.027
![]() |
[15] |
J. Hang, S. G. Ruan, G. Q. Sun, X. Sun, Z. Jin, Analysis of multi-patch dynamical model about cattle brucellosis, J. Shanghai Normal Univ., 43 (2014), 441–455. https://doi.org/10.3969/j.issn.100-5137.2014.05.001 doi: 10.3969/j.issn.100-5137.2014.05.001
![]() |
[16] |
G. Q. Sun, Z. K. Zhang, Global stability for a sheep brucellosis model with immigration, Appl. Math. Comput., 246 (2014), 336–345. https://doi.org/10.1016/j.amc.2014.08.028 doi: 10.1016/j.amc.2014.08.028
![]() |
[17] |
J. B. Muma, N. Toft, J. Oloya, A. Lund, K. Nielsen, K. Samui, et al., Evaluation of three serological tests for brucellosis in naturally infected cattle using latent class analysis, Vet. Microbiol., 125 (2007), 187–192. https://doi.org/10.1016/j.vetmic.2007.05.012 doi: 10.1016/j.vetmic.2007.05.012
![]() |
[18] |
O. Diekmann, J. A. P. Heesterbeek, J. A. J. Metz, On the definition and the computation of the basic reproduction ratio R0 in models for infectious diseases in heterogeneous populations, J. Math. Biol., 28 (1990), 365–382. https://doi.org/10.1007/bf00178324 doi: 10.1007/bf00178324
![]() |
[19] |
O. Diekmann, J. A. P. Heesterbeek, M. G. Roberts, The construction of next-generation matrices for compartmental epidemic models, J. R. Soc. Interface, 7 (2010), 873–885. https://doi.org/10.1098/rsif.2009.0386 doi: 10.1098/rsif.2009.0386
![]() |
[20] |
P. Van den Driessche, J. Watmough, Reproduction numbers and sub-threshold endemic equilibria for compartmental models of disease transmission, Math. Biosci., 180 (2002), 29–48. https://doi.org/10.1016/s0025-5564(02)00108-6 doi: 10.1016/s0025-5564(02)00108-6
![]() |
[21] | J. P. Lasalle, The Stability of Dynamical Systems, Society for Industrial and Applied Mathematics, 1976. https://doi.org/10.1137/1.9781611970432 |
[22] |
R. P. Sigdel, C. C. McCluskey, Global stability for an it SEI model of infectious disease with immigration, Appl. Math. Comput., 243 (2014), 684–689. https://doi.org/10.1016/j.amc.2014.06.020 doi: 10.1016/j.amc.2014.06.020
![]() |
[23] |
H. Akaike, A new look at the statistical model identification, IEEE Trans. Autom. Control, 19 (1974), 716–723. https://doi.org/10.1007/978-1-4612-1694-0_16 doi: 10.1007/978-1-4612-1694-0_16
![]() |
[24] |
C. M. Hurvich, C. L. Tsai, Regression and time series model selection in small samples, Biometrika, 76 (1989), 297–307. https://doi.org/10.1093/biomet/76.2.297 doi: 10.1093/biomet/76.2.297
![]() |
[25] |
M. De la Sen, R. Nistal, S. Alonso-Quesada, A. Ibeas, Some formal results on positivity, stability, and endemic steady-state attainability based on linear algebraic tools for a class of epidemic models with eventual incommensurate delays, Discrete Dyn. Nat. Soc., 2 (2019), 1–22. https://doi.org/10.1155/2019/8959681 doi: 10.1155/2019/8959681
![]() |
[26] |
S. S. Chen, Y. J. Ran, H. B. Huang, Z. Wang, K. K. Shang, Epidemic dynamics of two-pathogen spreading for pairwise models, Mathematics, 10 (2022), 1906. https://doi.org/10.3390/math10111906 doi: 10.3390/math10111906
![]() |
[27] |
M. De la Sen, S. Alonso-Quesada, A. Ibeas, On the stability of an SEIR epidemic model with distributed time-delay and a general class of feedback vaccination rules, Appl. Math. Comput., 270 (2015), 953–976. https://doi.org/10.1016/j.amc.2015.08.099 doi: 10.1016/j.amc.2015.08.099
![]() |
[28] |
R. Xu, Global dynamics of a delayed epidemic model with latency and relapse, Nonlinear Anal.-Model. Control, 18 (2013), 250–263. https://doi.org/10.15388/NA.18.2.14026 doi: 10.15388/NA.18.2.14026
![]() |
Feature | Acc (%) | Sn (%) | Sp (%) | MCC | AUC |
GO-KNN | 82.72 | 48.36 | 92.37 | 0.4533 | 0.8521 |
BOW | 78.83 | 13.04 | 97.30 | 0.1969 | 0.7359 |
PseAAC | 79.38 | 19.43 | 96.22 | 0.2503 | 0.7616 |
ALL | 83.77 | 49.49 | 93.40 | 0.4840 | 0.8593 |
Acc (%) | Sn (%) | Sp (%) | MCC | AUC | |
Imbalance | 83.77 | 49.49 | 93.40 | 0.4840 | 0.8593 |
Balance | 81.84 | 70.82 | 84.93 | 0.5178 | 0.8635 |
Algorithms | Acc (%) | Sn (%) | Sp (%) | MCC | AUC (%) |
RF | 81.84 | 70.82 | 84.93 | 0.5178 | 0.8635 |
NB | 63.81 | 78.37 | 59.73 | 0.3154 | 0.7710 |
KNN | 71.97 | 83.44 | 68.75 | 0.4366 | 0.8360 |
XGBoost | 80.73 | 70.07 | 83.72 | 0.4953 | 0.8553 |
Feature | Acc (%) | Sn (%) | Sp (%) | MCC | AUC |
CKSAAP | 73.97 | 83.67 | 64.27 | 0.4891 | 0.8036 |
TPC | 71.38 | 66.07 | 76.74 | 0.4305 | 0.8069 |
ALL | 75.36 | 86.39 | 64.31 | 0.5201 | 0.8196 |
Acc (%) | Sn (%) | Sp (%) | MCC | AUC | |
All | 75.36 | 86.39 | 64.31 | 0.5201 | 0.8196 |
Elastic net | 76.02 | 85.68 | 66.33 | 0.5304 | 0.8260 |
Algorithms | Acc (%) | Sn (%) | Sp (%) | MCC | AUC (%) |
RF | 76.02 | 85.68 | 66.33 | 0.5304 | 0.8260 |
NB | 69.74 | 79.46 | 59.98 | 0.4022 | 0.7605 |
KNN | 63.63 | 46.39 | 81.00 | 0.2923 | 0.7246 |
XGBoost | 72.88 | 74.37 | 71.40 | 0.4580 | 0.8015 |
Feature | Acc (%) | Sn (%) | Sp (%) | MCC | AUC |
PreSNO | 70% | 54% | 86% | 0.42 | 0.84 |
RecSNO | 72% | 79% | 66% | 0.45 | 0.79 |
RF-SNOPS | 76.02 | 85.68 | 66.33 | 0.5304 | 0.8260 |
Feature | Acc (%) | Sn (%) | Sp (%) | MCC | AUC |
GO-KNN | 82.72 | 48.36 | 92.37 | 0.4533 | 0.8521 |
BOW | 78.83 | 13.04 | 97.30 | 0.1969 | 0.7359 |
PseAAC | 79.38 | 19.43 | 96.22 | 0.2503 | 0.7616 |
ALL | 83.77 | 49.49 | 93.40 | 0.4840 | 0.8593 |
Acc (%) | Sn (%) | Sp (%) | MCC | AUC | |
Imbalance | 83.77 | 49.49 | 93.40 | 0.4840 | 0.8593 |
Balance | 81.84 | 70.82 | 84.93 | 0.5178 | 0.8635 |
Algorithms | Acc (%) | Sn (%) | Sp (%) | MCC | AUC (%) |
RF | 81.84 | 70.82 | 84.93 | 0.5178 | 0.8635 |
NB | 63.81 | 78.37 | 59.73 | 0.3154 | 0.7710 |
KNN | 71.97 | 83.44 | 68.75 | 0.4366 | 0.8360 |
XGBoost | 80.73 | 70.07 | 83.72 | 0.4953 | 0.8553 |
Feature | Acc (%) | Sn (%) | Sp (%) | MCC | AUC |
CKSAAP | 73.97 | 83.67 | 64.27 | 0.4891 | 0.8036 |
TPC | 71.38 | 66.07 | 76.74 | 0.4305 | 0.8069 |
ALL | 75.36 | 86.39 | 64.31 | 0.5201 | 0.8196 |
Acc (%) | Sn (%) | Sp (%) | MCC | AUC | |
All | 75.36 | 86.39 | 64.31 | 0.5201 | 0.8196 |
Elastic net | 76.02 | 85.68 | 66.33 | 0.5304 | 0.8260 |
Algorithms | Acc (%) | Sn (%) | Sp (%) | MCC | AUC (%) |
RF | 76.02 | 85.68 | 66.33 | 0.5304 | 0.8260 |
NB | 69.74 | 79.46 | 59.98 | 0.4022 | 0.7605 |
KNN | 63.63 | 46.39 | 81.00 | 0.2923 | 0.7246 |
XGBoost | 72.88 | 74.37 | 71.40 | 0.4580 | 0.8015 |
Feature | Acc (%) | Sn (%) | Sp (%) | MCC | AUC |
PreSNO | 70% | 54% | 86% | 0.42 | 0.84 |
RecSNO | 72% | 79% | 66% | 0.45 | 0.79 |
RF-SNOPS | 76.02 | 85.68 | 66.33 | 0.5304 | 0.8260 |