
In the fight against the COVID-19 pandemic, China has long adhered to the "Dynamic Zero COVID-19" strategy till the end of 2022. To understand the mechanism of this strategy, we used the case of the Yangzhou summer outbreak in 2021 and a multi-stage dynamical model incorporating city-wide and key area testing-trace-isolation (TTI) strategies. We defined two time-varying indexes for measuring the disease transmission risk and the public health prevention and control force, respectively, which allowed us to explore the mechanisms of TTI policies. Integrating with the historical data and literature parameter values, we first estimated the parameters and then quantified the relevant indexes over time. The findings showed that multiple rounds of rapid testing were one of the critical measures to overcome the outbreak in Yangzhou within one month. In addition, we compared the impact of the duration of the free transmission stage, tracking rate, testing interval and precise division of key areas on the epidemiological indicators, including the final sizes of infections and isolations, peak value, peak arrival time and epidemic duration and the minimum round of testing. Our results suggest that the early detection of the epidemic, an improved efficiency of tracking, and a reduced duration of each test play a positive role in restraining COVID-19; however, a considerable investment of resources was essential to achieve a significant effect quickly.
Citation: Juan Li, Wendi Bao, Xianghong Zhang, Yongzhong Song, Zhigui Lin, Huaiping Zhu. Modelling the transmission and control of COVID-19 in Yangzhou city with the implementation of Zero-COVID policy[J]. Mathematical Biosciences and Engineering, 2023, 20(9): 15781-15808. doi: 10.3934/mbe.2023703
[1] | Pingping Sun, Yongbing Chen, Bo Liu, Yanxin Gao, Ye Han, Fei He, Jinchao Ji . DeepMRMP: A new predictor for multiple types of RNA modification sites using deep learning. Mathematical Biosciences and Engineering, 2019, 16(6): 6231-6241. doi: 10.3934/mbe.2019310 |
[2] | Honglei Wang, Wenliang Zeng, Xiaoling Huang, Zhaoyang Liu, Yanjing Sun, Lin Zhang . MTTLm6A: A multi-task transfer learning approach for base-resolution mRNA m6A site prediction based on an improved transformer. Mathematical Biosciences and Engineering, 2024, 21(1): 272-299. doi: 10.3934/mbe.2024013 |
[3] | Hasan Zulfiqar, Rida Sarwar Khan, Farwa Hassan, Kyle Hippe, Cassandra Hunt, Hui Ding, Xiao-Ming Song, Renzhi Cao . Computational identification of N4-methylcytosine sites in the mouse genome with machine-learning method. Mathematical Biosciences and Engineering, 2021, 18(4): 3348-3363. doi: 10.3934/mbe.2021167 |
[4] | Huili Yang, Wangren Qiu, Zi Liu . Anoikis-related mRNA-lncRNA and DNA methylation profiles for overall survival prediction in breast cancer patients. Mathematical Biosciences and Engineering, 2024, 21(1): 1590-1609. doi: 10.3934/mbe.2024069 |
[5] | Yong Ding, Jian-Hong Liu . The signature lncRNAs associated with the lung adenocarcinoma patients prognosis. Mathematical Biosciences and Engineering, 2020, 17(2): 1593-1603. doi: 10.3934/mbe.2020083 |
[6] | Xiangzheng Fu, Yifan Chen, Sha Tian . DlncRNALoc: A discrete wavelet transform-based model for predicting lncRNA subcellular localization. Mathematical Biosciences and Engineering, 2023, 20(12): 20648-20667. doi: 10.3934/mbe.2023913 |
[7] | Yunxiang Wang, Hong Zhang, Zhenchao Xu, Shouhua Zhang, Rui Guo . TransUFold: Unlocking the structural complexity of short and long RNA with pseudoknots. Mathematical Biosciences and Engineering, 2023, 20(11): 19320-19340. doi: 10.3934/mbe.2023854 |
[8] | Shuai Miao, Lijun Wang, Siyu Guan, Tianshu Gu, Hualing Wang, Wenfeng Shangguan, Weiding Wang, Yu Liu, Xue Liang . Integrated whole transcriptome analysis for the crucial regulators and functional pathways related to cardiac fibrosis in rats. Mathematical Biosciences and Engineering, 2023, 20(3): 5413-5429. doi: 10.3934/mbe.2023250 |
[9] | Xuesi Chen, Qijun Zhang, Qin Zhang . Predicting potential biomarkers and immune infiltration characteristics in heart failure. Mathematical Biosciences and Engineering, 2022, 19(9): 8671-8688. doi: 10.3934/mbe.2022402 |
[10] | Xiaoshan Qian, Lisha Xu, Xinmei Yuan . Soft-sensing modeling of mother liquor concentration in the evaporation process based on reduced robust least-squares support-vector machine. Mathematical Biosciences and Engineering, 2023, 20(11): 19941-19962. doi: 10.3934/mbe.2023883 |
In the fight against the COVID-19 pandemic, China has long adhered to the "Dynamic Zero COVID-19" strategy till the end of 2022. To understand the mechanism of this strategy, we used the case of the Yangzhou summer outbreak in 2021 and a multi-stage dynamical model incorporating city-wide and key area testing-trace-isolation (TTI) strategies. We defined two time-varying indexes for measuring the disease transmission risk and the public health prevention and control force, respectively, which allowed us to explore the mechanisms of TTI policies. Integrating with the historical data and literature parameter values, we first estimated the parameters and then quantified the relevant indexes over time. The findings showed that multiple rounds of rapid testing were one of the critical measures to overcome the outbreak in Yangzhou within one month. In addition, we compared the impact of the duration of the free transmission stage, tracking rate, testing interval and precise division of key areas on the epidemiological indicators, including the final sizes of infections and isolations, peak value, peak arrival time and epidemic duration and the minimum round of testing. Our results suggest that the early detection of the epidemic, an improved efficiency of tracking, and a reduced duration of each test play a positive role in restraining COVID-19; however, a considerable investment of resources was essential to achieve a significant effect quickly.
Various chemical modifications, including cytosine modification, uridine isomerization, and adenosine methylation, have been found in cellular RNA [1] and have been linked to important biological and physiological functions in cells [2]. У modification is a common posttranscriptional RNA modification known as the fifth base in RNA [3]. It is commonly present in a variety of species, and research has revealed that tRNA and rRNA contain large amounts of it [4]. Numerous biological processes have shown У to be crucial, and distinct У modifications serve different purposes at various places [5,6,7]. Therefore, the discovery of У sites in RNA sequences is crucial for both fundamental and applied biological research.
Initially, researchers identified У modification sites based on biochemical experiments. At first, researchers used paper chromatography to find У modification sites in the RNA of yeast, which was achieved by using RNA decomposition enzymes to decompose RNA and electrophoresis to separate out column chromatography on the upper layer of paper [3,4,5,6,7,8]. Later researchers successively used high-performance liquid chromatography and mass spectrometry to detect modification sites [9]. With the growing interest in this field, researchers have proposed a variety of high-throughput sequencing technologies, including Ψ-seq [10,11], PseudoU-seq [12] and CeU-Seq [13], and successfully used them to detect У sites. However, the methods described above are reliant on time-consuming, expensive, and difficult biochemical experiments, which are susceptible to environmental factors, and the sequencing process becomes increasingly difficult as the sequence length increases. Therefore, robust, fast, and inexpensive calculation methods are needed to predict У sites in RNA sequences.
First, Panwar and colleagues proposed a tRNAmod model to predict У sites in tRNA [14]. Then, a web server (PPUS) based on support vector machine (SVM) was proposed by Li et al. to identify У sites in S.cerevisiae and H.sapiens [15]. The frequency composition of nucleotides and the pseudo K-tuple nucleotide composition (PseKNC) were merged for feature representation in the iRNA-PseU model that Chen et al. created [16]. Subsequently, He et al. developed the SVM model (PseUI) to identify У sites in H.sapiens, S.cerevisiae and M.musculus, which combined a variety of feature extraction techniques including position-specific dinucleotide propensity (PSDP) [17]. Later, utilizing convolutional neural networks, Tahir et al. created a predictor (iPseU-CNN) [18]. Extreme gradient boosting (XGboost) was used by Liu et al. to create a new model known as XG-PseU [19]. Lv et al. also proposed a method called RF-PseU, which utilizes the LGBM algorithm for feature selection while combining the random forest algorithm for classification [20]. Saad et al. proposed a convolutional neural network model MU-PseUDeep [21], which combines sequence and secondary structural features to predict У sites. Li et al. built the model Porpoise by utilizing multiple type features and inputting them into the stacked ensemble learning framework [22]. Although the aforementioned techniques have proven successful in correctly identifying У sites in RNA sequences, they might still use more work in comparison to high-performance predictors [23,24,25,26,27,28].
In this study, we build a У site identification model (iPseU-TWSVM) based on TWSVM, and Figure 1 depicts the model construction process. The model combines multiple feature representation methods, including Kmer, ENAC and EIIP. To obtain the best subset of features, the mRMR approach is utilized. The model is then evaluated using 10-fold cross-validation (10-CV) and independent testing (Ⅰ-testing). The average Ⅰ-testing accuracy of the iPseU-TWSVM is 3.4% higher than that of current advanced predictors, demonstrating the better generalization performance of our model. Therefore, iPseU-TWSVM may become an effective tool for У site identification.
In this work, we train and evaluate our models using datasets created by Chen et al. [29]. The steps of constructing the benchmark dataset were as follows: 1307 positive samples and 33,280 negative samples were obtained at first, and then the subset-balancing treatment was adopted to reduce the number of negative samples according to Euclidean distance. The obtained distance values were sorted in ascending order, and the first 1307 negative samples were selected to form the negative subset. The training datasets contained data from three species, namely, H.sapiens, S.cerevisiae and M.musculus. The H.sapiens training dataset included 495 positive samples and 495 negative samples; the S.cerevisiae dataset included 314 positive samples and 314 negative samples; and the M.musculus dataset included 944 samples, half of which were positive samples. There were just two species in the Ⅰ-testing datasets: H.sapiens and S.cerevisiae. Each of them included 200 samples, of which only half were positive and half were negative.
Different types of features reflect biological significance from different perspectives, including sequence composition and physicochemical properties. In this work, a variety of types of features are used to comprehensively consider the composition, distribution and physicochemical properties of nucleotides in the sequence from various aspects to further improve the prediction performance of subsequent work.
One effective technique for extracting RNA sequence characteristics is Kmer, which reflects the frequency of k adjacent nucleotides in the sequence. The frequencies of the k-neighboring nucleotides are used to generate the feature vector [30]. The method is provided by the web server Pse-in-One2.0 (http://bioinformatics.hitsz.edu.cn/Pse-in-One2.0/) [31].
The approach offers 22 different physicochemical properties to create the pseudo-dinucleotide composition [32,33,34]. It overwrites the local sequence order and the global sequence order information into the feature vector. The relevant features are expressed in this form:
Vector=(m1m2⋯m16m16+1⋯m16+λ)T | (1) |
with
mi={qi∑16j=1qj+α∑λk=1ρk(1≤i≤16)αρσ−16∑16j=1qj+α∑λk=1ρk(16+1≤i≤16+λ) | (2) |
where qi(i=1,2,⋯,16) represents the 16 dinucleotides' normalized frequency of occurrence; α(0≤α≤1) is the weight factor; and λ is the highest counted rank. ρk is the k-tier correlation factor.
ρk displays the relationship between the sequence orders of all neighboring dinucleotides along a specific RNA sequence, which can be written as
ρk=1l−k−1∑l−k−1j=1C(Rj,Rj+k)(k=1,2,⋯,λ;λ<l−1) | (3) |
where C(Rj,Rj+k) indicates the correlation function expressed as
C(Rj,Rj+k)=1σ∑σg=1[Pg(Rj)−Pg(Rj+k)]2 | (4) |
where parameter σ is the number of physicochemical properties studied; Pg(Rj) and Pg(Rj+k) are the related values of the gth property for the dinucleotides Rj at position j and Rj+k at position j+k.
The coding method reflects that each nucleotide in the sequence has different chemical structures and binding properties. The ring structures of the four RNA nucleotides (ACGU) differ from one another, hydrogen bond, and functional group. Based on these differences, they may be represented with a 3D coordinate [35].
The method incorporates data on each nucleotide's distribution in the RNA sequence as well as its frequency [35]. We can calculate the density di of an RNA sequence's ith prefix subsequence. It is defined as
di=1i∑ij=1f(xj),wheref(xj)={1,ifxj=xi0,otherwise | (5) |
where i is the length of the sliding string and xj represents the nucleotide at the jth position.
The EIIP values represent the energy of the delocalized electron in the nucleotide. The nucleotides in the DNA sequence have previously been denoted by the EIIP values of A, G, C and T [36]. In the RF-PseU method [20], each nucleotide in an RNA sequence was also coded by EIIP feature vectors.
Using a fixed length window, the approach was used to determine the nucleotide composition [20,35]. Afterward, RNA sequences were converted into equally long feature vectors. Sequence length and sliding window size are two factors that affect the dimension of ENAC coding.
E=(b1,b2,⋯,bn),wherebi=NiN,i∈{A,C,G,U} | (6) |
where N is the sliding window size and n is the coding dimension.
Binary profiles provide the position specific composition of nucleotides in RNA fragments [35,36]. A four-digit binary vector is used to encode each nucleotide. Dibinary profiles are different from binary profiles in that they are encoded for 16 dinucleotides, i.e., AA is denoted by (0, 0, 0, 0).
mRMR [37] is a commonly used feature selection method for compressing feature vector space. The goal of this technique is to identify a subset of features from the initial feature set that have the lowest correlation between features and the highest correlation with the output result. It considers the connection between features as well as the association between features and labels. The mechanism of feature selection is as follows.
The mutual information is used to find the feature subset S containing m features first, so that the m features found have the maximum correlation with the category c. The correlation between the feature subset S and the category c is defined by the average value of all mutual information between each feature and category as shown in (7).
maxD(S,c),D=1|S|∑xi∈SI(xi;c) | (7) |
where I(xi;c) is mutual information; S is a subset of features of length m; xi is the ith feature in S and c is category variable.
Then the features selected by the maximum correlation may be redundant, and (8) is used to eliminate the redundancy among m features.
minR(S),R=1|S|2∑xi,xj∈SI(xi;xj) | (8) |
The final feature subset S is obtained by combining the maximum correlation D with the minimum redundancy R.
mRMR=max[1|S|∑xi∈SI(xi;c)−1|S|2∑xi,xj∈SI(xi;xj)] | (9) |
Compared with other feature selection methods, the proposed algorithm considers the redundancy among features, further optimizes the feature subset, and solves the problem that the maximum dependency is difficult to achieve. However, only approximate optimal solutions can be obtained in practical applications.
Consider the binary classification issue using the training datasets
Dtrain={(u1,1),(u2,1),⋯,(um,1),(um+1,−1),(um+2,−1),⋯,(um+n,−1)}, | (10) |
where ui∈Rn,i=1,2,⋯,m+n.
Let T=(u1,u2,⋯,um)T∈Rm×n,F=(um+1,um+2,⋯,um+n)T∈Rn×n and l=m+n.
TWSVM [38] looks for a pair of nonparallel hyperplanes in the linear case.
w+u+b+=0andw−u+b−=0 | (11) |
where w+∈Rn,w−∈Rn,b+∈R,b−∈R by solving the following pair of QPPs:
minw+,b+,ξ−12(Tw++e+b+)T(Tw++e+b+)+c1eT−ξ−s.t.−(Fw++e−b+)+ξ−≥e−,ξ−≥0 | (12) |
and
minw−,b−,ξ+12(Fw−+e−b−)T(Fw−+e−b−)+c2eT+ξ+s.t.(Tw−+e+b−)+ξ+≥e+,ξ+≥0 | (13) |
where c1,c2 are the penalty parameters, e+,e− are all 1 vectors (e+,e−=[1⋯1]T) whose dimensions are the same as the number of positive and negative samples respectively, and ξ+,ξ− are slack vectors of appropriate dimension.
Minimizing the objective function means making a hyperplane as close as possible to one type of data, and the constraint requires that the distance between the hyperplane and the other type of data is at least greater than 1. Their corresponding Lagrange dual problems are
maxαeT−α−12αTJ(KTK)−1JTαs.t.0≤α≤c1e− | (14) |
and
maxγeT+γ−12γTK(JTJ)−1KTγs.t.0≤γ≤c2e+ | (15) |
where
K=[Te+]∈Rm×(n+1),J=[Fe−]∈Rn×(n+1). | (16) |
The solution to the primary problem can be acquired by addressing the dual problem, which can be obtained by
(wT+,b+)T=−(KTK)−1JTα, | (17) |
(wT−,b−)T=−(JTJ)−1KTγ. | (18) |
Therefore, an unknown point u∈Rn is predicted to the Class by
Class=argmins=−,+|wsu+bs|, | (19) |
where |⋅| is the perpendicular distance of point u from the planes wsu+bs=0,s=−,+.
This method not only divides a large quadratic programming problem into two small quadratic programming problems, which improves the training speed, but also is not very sensitive to noise.
Five indicators were widely used to assess how well the built models performed [39,40,41], accuracy (ACC), sensitivity (SN), specificity (SP), Matthew correlation coefficient (MCC), and integral area under the receiver operating characteristic curve (auROC), which were calculated using the following equations.
ACC=TP+TNTP+TN+FP+FN | (20) |
SN=TPFN+TP | (21) |
SP=TNTN+FP | (22) |
MCC=TN×TP−FN×FP√(TP+FP)×(TP+FN)×(TN+FN)×(TN+FP) | (23) |
where TP, TN, FP, and FN represent true positive, true negative, false positive and false negative, respectively.
We use 10-CV for comparison [42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62]. The training datasets are equally divided into ten subsets. The remaining one subset is tested after the proposed model has been trained using nine subsets. After each subset is tested once, the procedure is repeated ten times, and the average results represent the final performance. Finally, Ⅰ-testing was used in the testing datasets to evaluate the training model.
Feature extraction affects the results of subsequent sequence classification. To obtain better performance, this paper studied seven different features, including Kmer, PC-PseDNC-General and ANF, EIIP, ENAC, NCP + NBP from the RF-PseU method [20]. These features were first used in the experiment separately, and then multiple features were selected for different combinations according to the test results to obtain better experimental results.
Table 1 lists the results of feature combination for the H_990 dataset using the TWSVM methods. The first six rows are the performance of single features, with Kmer, EIIP, ENAC and NCP + NBP returning the best results, which are roughly distributed in the range of 0.57–0.59. Since the test results of single features were lower than those of the RF-PseU predictor, we combined several features that perform well and used the mRMR method to select the best feature for model construction. For the H_990 dataset, the combined characteristics listed in Table 1 have four results, including Kmer + ENAC, Kmer + EIIP, Kmer + NCP + NBP and Kmer + PC-PseDNC-General + ANF + EIIP + ENAC + NCP + NBP. Using the TWSVM method, the result of combined features was usually approximately 1–3% higher than that of single features, with a maximum ACC value of 0.65 and the best feature combination being KMER + PC-PseDNC-Generel +ANF + EIIP + ENAC + NCP + NBP. Compared with the RF-PseU predictor, the accuracy was improved by 0.7%. We chose this feature combination for the 10-CV of the training set H_990 and applied it to the Ⅰ-testing of the testing set H_200.
Feature Subset | TWSVM | ||||
ACC | MCC | SN | SP | AUC | |
Kmer | 0.59 | 0.181 | 0.533 | 0.646 | 0.618 |
PC-PseDNC-General | 0.534 | 0.07 | 0.434 | 0.635 | 0.543 |
ANF | 0.526 | 0.053 | 0.568 | 0.485 | 0.518 |
EIIP | 0.572 | 0.144 | 0.525 | 0.618 | 0.6 |
ENAC | 0.587 | 0.178 | 0.472 | 0.7 | 0.59 |
NCP + NBP | 0.584 | 0.172 | 0.53 | 0.639 | 0.582 |
Kmer + NCP + NBP | 0.59 | 0.182 | 0.532 | 0.648 | 0.587 |
Kmer + ENAC | 0.603 | 0.208 | 0.582 | 0.624 | 0.62 |
Kmer + EIIP | 0.606 | 0.212 | 0.585 | 0.626 | 0.625 |
Kmer + PC-PseDNC-Generel + ANF + EIIP + ENAC + NCP + NBP |
0.65 | 0.301 | 0.697 | 0.602 | 0.682 |
Table 2 lists the test results of different feature combinations on the S_628 dataset using the TWSVM method. Similar to the results in Table 1, the test results of Kmer, EIIP, ENAC and NCP + NBP were better. We combined those features that reported better performance. Except for Kmer + NCP + NBP, the results of other combined features were improved compared with the single feature results. Among them, the performance of the feature combination Kmer + PC – PseDNC – Generel + ANF + EIIP + ENAC + NCP + NBP was the best, with test results improved by approximately 6% compared to other combinations. We also used the independent test set S_200 to test this feature combination.
Feature Subset | TWSVM | ||||
ACC | MCC | SN | SP | AUC | |
Kmer | 0.627 | 0.259 | 0.625 | 0.63 | 0.67 |
PC-PseDNC-General | 0.574 | 0.153 | 0.666 | 0.483 | 0.589 |
ANF | 0.584 | 0.175 | 0.716 | 0.452 | 0.584 |
EIIP | 0.614 | 0.238 | 0.473 | 0.755 | 0.632 |
ENAC | 0.634 | 0.326 | 0.435 | 0.836 | 0.693 |
NCP + NBP | 0.669 | 0.34 | 0.659 | 0.678 | 0.711 |
Kmer + NCP + NBP | 0.664 | 0.33 | 0.653 | 0.675 | 0.683 |
Kmer + ENAC | 0.653 | 0.308 | 0.631 | 0.675 | 0.683 |
Kmer + EIIP | 0.631 | 0.263 | 0.628 | 0.634 | 0.662 |
Kmer + PC-PseDNC-Generel+ANF + EIIP + ENAC + NCP + NBP | 0.722 | 0.45 | 0.656 | 0.786 | 0.758 |
Table 3 shows the test results for the dataset M_994. The feature combination Kmer + PC-PseDNC-Generel + ANF + EIIP + ENAC + NCP + NBP had the best performance, and all test indicators were higher than the rest of the feature combinations. The combination was increased by approximately 5%, the MCC was increased by approximately 10%, and the AUC was also significantly improved.
Feature Subset | TWSVM | ||||
ACC | MCC | SN | SP | AUC | |
Kmer | 0.584 | 0.17 | 0.676 | 0.49 | 0.614 |
PC-PseDNC-General | 0.541 | 0.084 | 0.517 | 0.566 | 0.539 |
ANF | 0.553 | 0.113 | 0.71 | 0.396 | 0.527 |
EIIP | 0.625 | 0.276 | 0.826 | 0.424 | 0.632 |
ENAC | 0.664 | 0.329 | 0.667 | 0.661 | 0.7 |
NCP + NBP | 0.662 | 0.326 | 0.636 | 0.688 | 0.703 |
Kmer + NCP + NBP | 0.677 | 0.36 | 0.623 | 0.731 | 0.73 |
Kmer + ENAC | 0.662 | 0.326 | 0.657 | 0.667 | 0.704 |
Kmer + EIIP | 0.636 | 0.274 | 0.697 | 0.574 | 0.667 |
Kmer + PC-PseDNC-Generel + ANF + EIIP + ENAC + NCP + NBP | 0.728 | 0.462 | 0.795 | 0.661 | 0.775 |
In this study, we contrast the mRMR approach and the LightGBM method [64] since feature selection is a crucial component of model construction. Figure 2 shows their accuracy on the training datasets of the three species. The findings demonstrate that the performance of the mRMR technique is superior, which further enhances the classification accuracy of the model. The accuracy of the mRMR method on the three species is greater than that of the LightGBM approach.
The accuracy of classification results may be successfully increased by feature selection. We initially utilized the mRMR technique to pick feature subsets with high correlation with class labels and low feature redundancy to obtain the optimum feature dimension. To further obtain the feature dimension with the best precision, the incremental feature selection approach was applied. After many experiments, we found that the accuracy of Ⅰ-testing and 10-CV fluctuates as the number of characteristics rises and the highest accuracy mostly appeared within 100 or 150 dimensions, as illustrated in Figure 3. The accuracy of each species initially increases rapidly as the feature dimension increases, and then fluctuates continuously. For H.sapiens species, the highest 10-CV accuracy of 0.65 was obtained when the feature dimension reached 33, while the highest independent test accuracy of 0.763 was obtained at relatively low dimensions. The highest 10-CV accuracy of 0.722 and independent test accuracy of 0.825 for S. cerevisiae species were between 60–80 dimensions, obtained in 72 and 76 dimensions, respectively. M. musculus species only showed 10-CV results with the highest value at a feature dimension of 62.
Tables 4–6 show the changes of feature dimensions after feature selection and the distribution of feature subsets after optimization for the three species. It can be found that ENAC and EIIP occupy a large number in the optimized feature subset of the three species, followed by NCP and NBP, ANF and Kmer occupy a small number, and there is no PC-PseDNC-General in the optimized feature subset. It indicates that each feature has different contributions in the model, and ENAC and EIIP play an important role in the model.
Feature | The original dimension | The dimension after feature selection |
NCP | 63 | 14 |
EIIP | 21 | 21 |
NBP | 84 | 15 |
ENAC | 105 | 41 |
ANF | 21 | 8 |
Kmer | 64 | 1 |
PC-PseDNC-general | 22 | 0 |
Feature | The Original Dimension | The dimension after feature selection |
NCP | 93 | 22 |
EIIP | 31 | 31 |
NBP | 124 | 25 |
ENAC | 155 | 53 |
ANF | 31 | 19 |
Kmer | 256 | 0 |
PC-PseDNC-general | 18 | 0 |
Feature | The original dimension | The dimension after feature selection |
NCP | 63 | 11 |
EIIP | 21 | 21 |
NBP | 84 | 17 |
ENAC | 105 | 43 |
ANF | 21 | 7 |
Kmer | 64 | 1 |
PC-PseDNC-general | 22 | 0 |
Since many previous researchers built У sites recognition models based on support vector machines, we employed SVM [65] as a classifier in the same feature space to compare the performance of TWSVM with that of SVM. Figure 4 displays how it performed. The ACC, MCC, and AUC based on the TWSVM model were found to be larger than those based on the SVM model for the 10-CV results of the three species, while the independent test results may have more clearly indicated the difference between the two. All of the evaluation metrics outperformed the SVM model. As a result, we concluded that the TWSVM model performs much better than the SVM model, suggesting that it may be better suited for identifying У sites in RNA sequences.
The effectiveness of iPseU-TWSVM was also evaluated in comparison to other advanced predictors, such as iRNA-PseU [16], PseUI [17], iPseU-CNN [18], XG-PseU [19] and RF-PseU [20]. The 10-CV and Ⅰ-testing results of the advanced У site predictors using iPseU-TWSVM are contrasted in Tables 7 and 8, respectively. The 10-CV results reveal that the accuracy of iPseU-TWSVM on H.sapiens is 1.7% less accurate than that of the best predictor iPseU-CNN on this species, and the accuracy on S.cerevisiae and M.musculus is 0.722 and 0.728, respectively, which is 2.6 and 2.0% less accurate than that of the best predictor RF-PseU. Although iPseU-TWSVM does not perform optimally on the training set, iPseU-TWSVM has higher accuracy than other predictors on all species in terms of Ⅰ-testing. H.sapiens and S.cerevisiae are 1.3 and 5.5% more accurate in Ⅰ-testing than the best predictor RF-PseU, with corresponding accuracy values of 0.763 and 0.825, respectively. We also calculated the average accuracy of several species so that we could compare the predictors' performance in depth. As shown in Table 9, the 10-CV accuracy of iPseU-TWSVM is 1.3% less than that of RF-PseU. In terms of Ⅰ-testing, iPseU-TWSVM is significantly improved by 3.4% compared with RF-PseU. iPseU-TWSVM performs much better overall than the other predictors. The findings demonstrate that iPseU-TWSVM, a very practical technique, has greater generalization performance and is more appropriate for recognizing У sites in RNA sequences.
Species | Classifier | Cross-validation | ||||
ACC | MCC | SN | SP | AUC | ||
H.sapiens | iRNA-PseU | 0.604 | 0.21 | 0.61 | 0.598 | 0.64 |
PseUI | 0.642 | 0.28 | 0.649 | 0.636 | 0.68 | |
iRNA-CNN | 0.667 | 0.34 | 0.65 | 0.688 | / | |
XG-PseU | 0.661 | 0.32 | 0.635 | 0.687 | 0.7 | |
RF-PseU | 0.643 | 0.29 | 0.661 | 0.626 | 0.7 | |
iPseU-TWSVM | 0.65 | 0.301 | 0.697 | 0.602 | 0.682 | |
S.cerevisiae | iRNA-PseU | 0.645 | 0.29 | 0.647 | 0.643 | 0.81 |
PseUI | 0.641 | 0.3 | 0.647 | 0.675 | 0.69 | |
iRNA-CNN | 0.682 | 0.37 | 0.664 | 0.705 | / | |
XG-PseU | 0.682 | 0.37 | 0.668 | 0.695 | 0.77 | |
RF-PseU | 0.748 | 0.49 | 0.772 | 0.724 | 0.81 | |
iPseU-TWSVM | 0.722 | 0.45 | 0.656 | 0.786 | 0.758 | |
M.musculus | iRNA-PseU | 0.691 | 0.38 | 0.733 | 0.648 | 0.75 |
PseUI | 0.704 | 0.41 | 0.799 | 0.703 | 0.71 | |
iRNA-CNN | 0.718 | 0.44 | 0.748 | 0.691 | / | |
XG-PseU | 0.72 | 0.45 | 0.765 | 0.676 | 0.74 | |
RF-PseU | 0.748 | 0.5 | 0.731 | 0.765 | 0.796 | |
iPseU-TWSVM | 0.728 | 0.462 | 0.795 | 0.661 | 0.775 |
Species | Classifier | Independent testing | ||||
ACC | MCC | SN | SP | AUC | ||
H.sapiens | iRNA-PseU | 0.65 | 0.3 | 0.6 | 0.7 | / |
PseUI | 0.655 | 0.31 | 0.63 | 0.7 | / | |
iRNA-CNN | 0.69 | 0.4 | 0.777 | 0.68 | / | |
XG-PseU | 0.675 | / | / | 0.608 | / | |
RF-PseU | 0.75 | 0.5 | 0.78 | 0.72 | 0.8 | |
iPseU-TWSVM | 0.763 | 0.529 | 0.825 | 0.7 | 0.786 | |
S.cerevisiae | iRNA-PseU | 0.6 | 0.2 | 0.63 | 0.57 | / |
PseUI | 0.685 | 0.37 | 0.65 | 0.72 | / | |
iRNA-CNN | 0.735 | 0.47 | 0.688 | 0.778 | / | |
XG-PseU | 0.71 | / | / | / | / | |
RF-PseU | 0.77 | 0.54 | 0.75 | 0.79 | 0.838 | |
iPseU-TWSVM | 0.825 | 0.65 | 0.85 | 0.8 | 0.905 |
Scores type | iPseU-TWSVM | RF-PseU | XG-PseU | iRNA-CNN | PseUI | iRNA-PseU |
Cross-validation | 0.7 | 0.713 | 0.687 | 0.689 | 0.662 | 0.647 |
Independent testing | 0.794 | 0.76 | 0.693 | 0.713 | 0.7 | 0.625 |
This work proposes the use of a novel model called iPseU-TWSVM to identify RNA У sites across various species. We have used an efficient feature selection method to obtain the best feature subset and selected TWSVM as the classifier to increase recognition accuracy. Finally, we compared advanced predictors and found that iPseU-TWSVM significantly improved the independent test accuracy by 3.4%, while the accuracy of cross validation was lower by 1.3%. Through comprehensive analysis, it was concluded that the relatively poor performance of the training datasets was due to the following two reasons. One is that the features used by the best predictor are different, and the other is that the classifier of the model is different. The above results indicate that iPseU-TWSVM had better generalization performance and could more accurately identify У sites from RNA sequences. It is anticipated that iPseU-TWSVM will be effective in identifying RNA У sites.
The contribution of this work has the following three aspects: (ⅰ) the model uses TWSVM as a classifier, which improves the accuracy of the model and improves the training speed; (ⅱ) the model has good generalization performance and can be applied to the prediction of other sites in the sequence; (ⅲ) further accurate identification of У sites in the sequence lays the foundation for disease control and related drug development. At the same time, this work also has the following shortcomings: (ⅰ) in the feature selection part, only two algorithms are compared, and subsequent research can try other algorithms to further improve the feature subset; (ⅱ) the model uses TWSVM as a classifier. In the original problem of TWSVM, only empirical risk is minimized, but structural risk is not minimized. Moreover, the algorithm can only obtain approximate solutions. Subsequent research can consider improving TWSVM or try other classification algorithms as the classifier of the model to improve the prediction performance. Future work will study emerging methods [66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81] to further improve the accuracy of the model.
This paper is supported by the National Natural Science Foundation of China (NSFC 62172076, 62072385), and the Municipal Government of Quzhou under Grant Number 2020D003 and 2021D004.
The authors declare there is no conflict of interest.
[1] | CSSE: COVID-19 Dashboard, Johns Hopkins University (JHU), 2022. Available from: https://www.arcgis.com/apps/dashboards/bda7594740fd40299423467b48e9ecf6. |
[2] | WHO: Tracking SARS-CoV-2 variants, 2022. Available from: https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/. |
[3] |
T. K. Burki, Omicron variant and booster COVID-19 vaccines, Lancet Respir. Med., 10 (2022). https://doi.org/10.1016/S2213-2600(21)00559-2 doi: 10.1016/S2213-2600(21)00559-2
![]() |
[4] |
Y. Liu, J. Rocklöv, The reproductive number of the Delta variant of SARS-CoV-2 is far higher compared to the ancestral SARS-CoV-2 virus, J. Travel Med., 28 (2021), 1–3. https://doi.org/10.1093/jtm/taab124 doi: 10.1093/jtm/taab124
![]() |
[5] | MMWR: Centers for Disease Control and Prevention, 2022. Available from: https://www.cdc.gov/mmwr/volumes/71/wr/mm7104e4.htm. |
[6] | O. Barnes, J. Burn-Murdoch, Omicron's less severe cases prompt cautious optimism in South Africa, 2021. Available from: https://www.ft.com/content/d315be08-cda0-462b-85ec-811290ad488e. |
[7] | WHO: WHO Director-General's opening remarks at the media briefing – 5 May 2023, 2023. Available from: https://www.who.int/director-general/speeches/detail/who-director-general-s-opening-remarks-at-the-media-briefing---5-may-2023. |
[8] |
M. Oliu-Barton, B. S. R. Pradelski, Y. Algan, M. G. Baker, A. Binagwaho, G. J. Dore, et al., Elimination versus mitigation of SARS-CoV-2 in the presence of effective vaccines, Lancet Global Health, 10 (2022), e142–e147, 2022. https://doi.org/10.1016/S2214-109X(21)00494-0 doi: 10.1016/S2214-109X(21)00494-0
![]() |
[9] |
M. Oliu-Barton, B. S. R. Pradelski, P. Aghion, P. Artus, I. Kickbusch, J. V. Lazarus, et al., SARS-CoV-2 elimination, not mitigation, creates best outcomes for health, the economy, and civil liberties, Lancet, 397 (2021), 12–18. https://doi.org/10.1016/S0140-6736(21)00978-8 doi: 10.1016/S0140-6736(21)00978-8
![]() |
[10] |
W. Liang, M. Liu, J. Liu, Y. Wang, J. Wu, X. Liu, The dynamic COVID-zero strategy on prevention and control of COVID-19 in China (in Chinese), Chin. Med. J., 102 (2022), 239–242. https://doi.org/10.3760/cma.j.cn112137-20211205-02710 doi: 10.3760/cma.j.cn112137-20211205-02710
![]() |
[11] | Coexist with COVID-19' for only ten days, the number of critically ill cases in Korea has reached a new high (in Chinese), 2021. Available from: https://baijiahao.baidu.com/s?id = 1716037339938236108 & wfr = spider & for = pc. |
[12] |
Y. Xing, G. Wong, W. Ni, X. Hu, Q. Xing, Rapid response to an outbreak in Qingdao, China, N. Engl. J. Med., 383 (2020), e129. https://doi.org/10.1056/nejmc2032361 doi: 10.1056/nejmc2032361
![]() |
[13] |
Z. Wu, Q. Wang, J. Zhao, P. Yang, J. M. McGoogan, Z. Feng, et al., Time course of a second outbreak of COVID-19 in Beijing, China, June-July 2020, JAMA, 324 (2020), 1458−1459. https://doi.org/10.1001/jama.2020.15894 doi: 10.1001/jama.2020.15894
![]() |
[14] |
Z. Li, F. Liu, J. Cui, Z. Peng, Z. Chang, S. Lai, et al., Comprehensive large-scale nucleic acid–testing strategies support China's sustained containment of COVID-19, Nat. Med., 27 (2021), 740–742. https://doi.org/10.1038/s41591-021-01308-7 doi: 10.1038/s41591-021-01308-7
![]() |
[15] | New Coronavirus Pneumonia Diagnosis and Treatment Plan (Trial Version 8) (in Chinese), National Health Commission of the People's Republic of China, 2020. Available from: http://www.nhc.gov.cn/yzygj/s7653p/202008/0a7bdf12bd4b46e5bd28ca7f9a7f5e5a.shtml. |
[16] |
Y. Zhang, C. You, X. Gai, X. Zhou, On coexistence with COVID-19: estimations and perspectives, China CDC Wkly, 3 (2021), 1057–1061. https://doi.org/10.46234/ccdcw2021.245 doi: 10.46234/ccdcw2021.245
![]() |
[17] |
B. Tang, F. Xia, S. Tang, N. L. Bragazzi, Q. Li, X. Sun, et al., The effectiveness of quarantine and isolation determine the trend of the COVID-19 epidemic in the final phase of the current outbreak in China, Int. J. Infect. Dis., 95 (2020), 288–293. https://doi.org/10.1016/j.ijid.2020.03.018 doi: 10.1016/j.ijid.2020.03.018
![]() |
[18] |
C. Hou, J. Chen, Y. Zhou, L. Hua, J. Yuan, S. He, et al., The effectiveness of quarantine of Wuhan city against the Corona Virus Disease 2019 (COVID-19): a well-mixed SEIR model analysis, J. Med. Virol., 92 (2020), 841–848. https://doi.org/10.1002/jmv.25827 doi: 10.1002/jmv.25827
![]() |
[19] |
K. Shimizu, T. Kuniya, Y. Tokuda, Modeling population-wide testing of SARS-CoV-2 for containing COVID-19 pandemic in Okinawa, Japan, J. Gen. Fam. Med., 22 (2021), 173–181. https://doi.org/10.1002/jgf2.439 doi: 10.1002/jgf2.439
![]() |
[20] |
A. Aleta, D. Martín-Corral, A. P. Y. Piontti, M. Ajelli, M. Litvinova, M. Chinazzi, et al., Modelling the impact of testing, contact tracing and household quarantine on second waves of COVID-19, Nat. Hum. Behav., 4 (2020), 964–971. https://doi.org/10.1038/s41562-020-0931-9 doi: 10.1038/s41562-020-0931-9
![]() |
[21] |
T. Colbourn, W. Waites, J. Panovska-Griffiths, D. Manheim, S. Sturniolo, G. Colbourn, et al., Modelling the health and economic impacts of population-wide testing, contact tracing and isolation (PTTI) strategies for COVID-19 in the UK, SSRN Electron. J., 9 (2020). https://doi.org/10.2139/ssrn.3627273 doi: 10.2139/ssrn.3627273
![]() |
[22] |
A. J. Kucharski, P. Klepac, A. J. K. Conlan, S. M. Kissler, M. L. Tang, H. Fry, et al., Effectiveness of isolation, testing, contact tracing, and physical distancing on reducing transmission of SARS-CoV-2 in different settings: a mathematical modelling study, Lancet Infect. Dis., 20 (2020), 1151–1160. https://doi.org/10.1016/s1473-3099(20)30457-6 doi: 10.1016/s1473-3099(20)30457-6
![]() |
[23] |
S. Contreras, J. Dehning, M. Loidolt, J. Zierenberg, F. P. Spitzner, J. H. Urrea-Quintero, et al., The challenges of containing SARS-CoV-2 via test-trace-and-isolate, Nat. Commun., 12 (2021), 378. https://doi.org/10.1038/s41467-020-20699-8 doi: 10.1038/s41467-020-20699-8
![]() |
[24] | Yangzhou city health and family planning commission website, 2021. Available from: http://wjw.yangzhou.gov.cn/. |
[25] | The paper news official website, 2021. Available from: ttps: //www.thepaper.cn/. |
[26] |
A. Holborow, H. Asad, L. Porter, P. Tidswell, C. Johnston, I. Blyth, et al., The clinical sensitivity of a single SARS-CoV-2 upper respiratory tract RT-PCR test for diagnosing COVID-19 using convalescent antibody as a comparator, Clin. Med., 20 (2020), 6. https://doi.org/10.7861/clinmed.2020-0555 doi: 10.7861/clinmed.2020-0555
![]() |
[27] |
J. Chhatwal, Y. Xiao, P. Mueller, M. Adee, O. O. Dalgic, M. A. Ladd, et al., Changing dynamics of COVID-19 in the U.S. with the emergence of the Delta variant: projections of the COVID-19 simulator, medRxiv, 2021. https://doi.org/10.1101/2021.08.11.21261845 doi: 10.1101/2021.08.11.21261845
![]() |
[28] |
B. Li, A. Deng, K. Li, Y. Hu, Z. Li, Y. Shi, et al., Viral infection and transmission in a large well-traced outbreak caused by the Delta SARS-CoV-2 variant, Nat. Commun., 13 (2022). https://doi.org/10.1038/s41467-022-28089-y doi: 10.1038/s41467-022-28089-y
![]() |
[29] |
M. Du, M. Liu, J. Liu, Progress in research of epidemiologic feature and control of SARS-CoV-2 Delta variant (in Chinese), Chin. J. Epidemiol., 42 (2021), 1774–1779. https://doi.org/10.3760/cma.j.cn112338-20210808-00619 doi: 10.3760/cma.j.cn112338-20210808-00619
![]() |
[30] |
P. Dreessche, J. Watmough, Reproduction numbers and sub-threshold endemic equilibria for compartmental models of disease transmission, Math. Biosci., 180 (2022), 29–48. https://doi.org/10.1016/S0025-5564(02)00108-6 doi: 10.1016/S0025-5564(02)00108-6
![]() |
[31] |
H. R. Thieme, Spectral bound and reproduction number for infinite dimensional population structure and time heterogeneity, SIAM J. Appl. Math., 70 (2009), 188–211. https://doi.org/10.1137/080732870 doi: 10.1137/080732870
![]() |
[32] |
A. Cintrón-Arias, C. Castillo-Chávez, L. M. A. Bettencourt, A. L. Lloyd, H. T. Banks, The estimation of the effective reproductive number from disease outbreak data, Math. Biosci. Eng., 6 (2009), 261–282. https://doi.org/10.3934/mbe.2009.6.261 doi: 10.3934/mbe.2009.6.261
![]() |
[33] |
H. Haario, M. Laine, A. Mira, E. Saksman, DRAM: efficient adaptive MCMC, Stat. Comput., 16 (2006), 339–354. https://doi.org/10.1007/s11222-006-9438-0 doi: 10.1007/s11222-006-9438-0
![]() |
[34] |
M. D. Mckay, R. J. Beckman, W. J. Conover, A comparison of three methods for selecting values of input variables in the analysis of output from a computer code, Technometrics, 21 (1979), 239–245. https://doi.org/10.1080/00401706.2000.10485979 doi: 10.1080/00401706.2000.10485979
![]() |
[35] | Guidelines for Organization of Regional Novel Coronavirus Nucleic Acid Tests (Third Edition) (in Chinese), 2022. Available from: http://www.nhc.gov.cn/yzygj/s7659/202203/b5aaa96dfe1b4f14b19bf2f888a10673.shtml. |
[36] |
J. Qi, D. Zhang, X. Zhang, T. Takana, Y. Pan, P. Yin, et al., Short- and medium-term impacts of strict anti-contagion policies on non-COVID-19 mortality in China, Nat. Hum. Behav., 6 (2022), 55–63. https://doi.org/10.1038/s41562-021-01189-3 doi: 10.1038/s41562-021-01189-3
![]() |
1. | Mingzhao Wang, Haider Ali, Yandi Xu, Juanying Xie, Shengquan Xu, BiPSTP: Sequence feature encoding method for identifying different RNA modifications with bidirectional position-specific trinucleotides propensities, 2024, 300, 00219258, 107140, 10.1016/j.jbc.2024.107140 |
Feature Subset | TWSVM | ||||
ACC | MCC | SN | SP | AUC | |
Kmer | 0.59 | 0.181 | 0.533 | 0.646 | 0.618 |
PC-PseDNC-General | 0.534 | 0.07 | 0.434 | 0.635 | 0.543 |
ANF | 0.526 | 0.053 | 0.568 | 0.485 | 0.518 |
EIIP | 0.572 | 0.144 | 0.525 | 0.618 | 0.6 |
ENAC | 0.587 | 0.178 | 0.472 | 0.7 | 0.59 |
NCP + NBP | 0.584 | 0.172 | 0.53 | 0.639 | 0.582 |
Kmer + NCP + NBP | 0.59 | 0.182 | 0.532 | 0.648 | 0.587 |
Kmer + ENAC | 0.603 | 0.208 | 0.582 | 0.624 | 0.62 |
Kmer + EIIP | 0.606 | 0.212 | 0.585 | 0.626 | 0.625 |
Kmer + PC-PseDNC-Generel + ANF + EIIP + ENAC + NCP + NBP |
0.65 | 0.301 | 0.697 | 0.602 | 0.682 |
Feature Subset | TWSVM | ||||
ACC | MCC | SN | SP | AUC | |
Kmer | 0.627 | 0.259 | 0.625 | 0.63 | 0.67 |
PC-PseDNC-General | 0.574 | 0.153 | 0.666 | 0.483 | 0.589 |
ANF | 0.584 | 0.175 | 0.716 | 0.452 | 0.584 |
EIIP | 0.614 | 0.238 | 0.473 | 0.755 | 0.632 |
ENAC | 0.634 | 0.326 | 0.435 | 0.836 | 0.693 |
NCP + NBP | 0.669 | 0.34 | 0.659 | 0.678 | 0.711 |
Kmer + NCP + NBP | 0.664 | 0.33 | 0.653 | 0.675 | 0.683 |
Kmer + ENAC | 0.653 | 0.308 | 0.631 | 0.675 | 0.683 |
Kmer + EIIP | 0.631 | 0.263 | 0.628 | 0.634 | 0.662 |
Kmer + PC-PseDNC-Generel+ANF + EIIP + ENAC + NCP + NBP | 0.722 | 0.45 | 0.656 | 0.786 | 0.758 |
Feature Subset | TWSVM | ||||
ACC | MCC | SN | SP | AUC | |
Kmer | 0.584 | 0.17 | 0.676 | 0.49 | 0.614 |
PC-PseDNC-General | 0.541 | 0.084 | 0.517 | 0.566 | 0.539 |
ANF | 0.553 | 0.113 | 0.71 | 0.396 | 0.527 |
EIIP | 0.625 | 0.276 | 0.826 | 0.424 | 0.632 |
ENAC | 0.664 | 0.329 | 0.667 | 0.661 | 0.7 |
NCP + NBP | 0.662 | 0.326 | 0.636 | 0.688 | 0.703 |
Kmer + NCP + NBP | 0.677 | 0.36 | 0.623 | 0.731 | 0.73 |
Kmer + ENAC | 0.662 | 0.326 | 0.657 | 0.667 | 0.704 |
Kmer + EIIP | 0.636 | 0.274 | 0.697 | 0.574 | 0.667 |
Kmer + PC-PseDNC-Generel + ANF + EIIP + ENAC + NCP + NBP | 0.728 | 0.462 | 0.795 | 0.661 | 0.775 |
Feature | The original dimension | The dimension after feature selection |
NCP | 63 | 14 |
EIIP | 21 | 21 |
NBP | 84 | 15 |
ENAC | 105 | 41 |
ANF | 21 | 8 |
Kmer | 64 | 1 |
PC-PseDNC-general | 22 | 0 |
Feature | The Original Dimension | The dimension after feature selection |
NCP | 93 | 22 |
EIIP | 31 | 31 |
NBP | 124 | 25 |
ENAC | 155 | 53 |
ANF | 31 | 19 |
Kmer | 256 | 0 |
PC-PseDNC-general | 18 | 0 |
Feature | The original dimension | The dimension after feature selection |
NCP | 63 | 11 |
EIIP | 21 | 21 |
NBP | 84 | 17 |
ENAC | 105 | 43 |
ANF | 21 | 7 |
Kmer | 64 | 1 |
PC-PseDNC-general | 22 | 0 |
Species | Classifier | Cross-validation | ||||
ACC | MCC | SN | SP | AUC | ||
H.sapiens | iRNA-PseU | 0.604 | 0.21 | 0.61 | 0.598 | 0.64 |
PseUI | 0.642 | 0.28 | 0.649 | 0.636 | 0.68 | |
iRNA-CNN | 0.667 | 0.34 | 0.65 | 0.688 | / | |
XG-PseU | 0.661 | 0.32 | 0.635 | 0.687 | 0.7 | |
RF-PseU | 0.643 | 0.29 | 0.661 | 0.626 | 0.7 | |
iPseU-TWSVM | 0.65 | 0.301 | 0.697 | 0.602 | 0.682 | |
S.cerevisiae | iRNA-PseU | 0.645 | 0.29 | 0.647 | 0.643 | 0.81 |
PseUI | 0.641 | 0.3 | 0.647 | 0.675 | 0.69 | |
iRNA-CNN | 0.682 | 0.37 | 0.664 | 0.705 | / | |
XG-PseU | 0.682 | 0.37 | 0.668 | 0.695 | 0.77 | |
RF-PseU | 0.748 | 0.49 | 0.772 | 0.724 | 0.81 | |
iPseU-TWSVM | 0.722 | 0.45 | 0.656 | 0.786 | 0.758 | |
M.musculus | iRNA-PseU | 0.691 | 0.38 | 0.733 | 0.648 | 0.75 |
PseUI | 0.704 | 0.41 | 0.799 | 0.703 | 0.71 | |
iRNA-CNN | 0.718 | 0.44 | 0.748 | 0.691 | / | |
XG-PseU | 0.72 | 0.45 | 0.765 | 0.676 | 0.74 | |
RF-PseU | 0.748 | 0.5 | 0.731 | 0.765 | 0.796 | |
iPseU-TWSVM | 0.728 | 0.462 | 0.795 | 0.661 | 0.775 |
Species | Classifier | Independent testing | ||||
ACC | MCC | SN | SP | AUC | ||
H.sapiens | iRNA-PseU | 0.65 | 0.3 | 0.6 | 0.7 | / |
PseUI | 0.655 | 0.31 | 0.63 | 0.7 | / | |
iRNA-CNN | 0.69 | 0.4 | 0.777 | 0.68 | / | |
XG-PseU | 0.675 | / | / | 0.608 | / | |
RF-PseU | 0.75 | 0.5 | 0.78 | 0.72 | 0.8 | |
iPseU-TWSVM | 0.763 | 0.529 | 0.825 | 0.7 | 0.786 | |
S.cerevisiae | iRNA-PseU | 0.6 | 0.2 | 0.63 | 0.57 | / |
PseUI | 0.685 | 0.37 | 0.65 | 0.72 | / | |
iRNA-CNN | 0.735 | 0.47 | 0.688 | 0.778 | / | |
XG-PseU | 0.71 | / | / | / | / | |
RF-PseU | 0.77 | 0.54 | 0.75 | 0.79 | 0.838 | |
iPseU-TWSVM | 0.825 | 0.65 | 0.85 | 0.8 | 0.905 |
Scores type | iPseU-TWSVM | RF-PseU | XG-PseU | iRNA-CNN | PseUI | iRNA-PseU |
Cross-validation | 0.7 | 0.713 | 0.687 | 0.689 | 0.662 | 0.647 |
Independent testing | 0.794 | 0.76 | 0.693 | 0.713 | 0.7 | 0.625 |
Feature Subset | TWSVM | ||||
ACC | MCC | SN | SP | AUC | |
Kmer | 0.59 | 0.181 | 0.533 | 0.646 | 0.618 |
PC-PseDNC-General | 0.534 | 0.07 | 0.434 | 0.635 | 0.543 |
ANF | 0.526 | 0.053 | 0.568 | 0.485 | 0.518 |
EIIP | 0.572 | 0.144 | 0.525 | 0.618 | 0.6 |
ENAC | 0.587 | 0.178 | 0.472 | 0.7 | 0.59 |
NCP + NBP | 0.584 | 0.172 | 0.53 | 0.639 | 0.582 |
Kmer + NCP + NBP | 0.59 | 0.182 | 0.532 | 0.648 | 0.587 |
Kmer + ENAC | 0.603 | 0.208 | 0.582 | 0.624 | 0.62 |
Kmer + EIIP | 0.606 | 0.212 | 0.585 | 0.626 | 0.625 |
Kmer + PC-PseDNC-Generel + ANF + EIIP + ENAC + NCP + NBP |
0.65 | 0.301 | 0.697 | 0.602 | 0.682 |
Feature Subset | TWSVM | ||||
ACC | MCC | SN | SP | AUC | |
Kmer | 0.627 | 0.259 | 0.625 | 0.63 | 0.67 |
PC-PseDNC-General | 0.574 | 0.153 | 0.666 | 0.483 | 0.589 |
ANF | 0.584 | 0.175 | 0.716 | 0.452 | 0.584 |
EIIP | 0.614 | 0.238 | 0.473 | 0.755 | 0.632 |
ENAC | 0.634 | 0.326 | 0.435 | 0.836 | 0.693 |
NCP + NBP | 0.669 | 0.34 | 0.659 | 0.678 | 0.711 |
Kmer + NCP + NBP | 0.664 | 0.33 | 0.653 | 0.675 | 0.683 |
Kmer + ENAC | 0.653 | 0.308 | 0.631 | 0.675 | 0.683 |
Kmer + EIIP | 0.631 | 0.263 | 0.628 | 0.634 | 0.662 |
Kmer + PC-PseDNC-Generel+ANF + EIIP + ENAC + NCP + NBP | 0.722 | 0.45 | 0.656 | 0.786 | 0.758 |
Feature Subset | TWSVM | ||||
ACC | MCC | SN | SP | AUC | |
Kmer | 0.584 | 0.17 | 0.676 | 0.49 | 0.614 |
PC-PseDNC-General | 0.541 | 0.084 | 0.517 | 0.566 | 0.539 |
ANF | 0.553 | 0.113 | 0.71 | 0.396 | 0.527 |
EIIP | 0.625 | 0.276 | 0.826 | 0.424 | 0.632 |
ENAC | 0.664 | 0.329 | 0.667 | 0.661 | 0.7 |
NCP + NBP | 0.662 | 0.326 | 0.636 | 0.688 | 0.703 |
Kmer + NCP + NBP | 0.677 | 0.36 | 0.623 | 0.731 | 0.73 |
Kmer + ENAC | 0.662 | 0.326 | 0.657 | 0.667 | 0.704 |
Kmer + EIIP | 0.636 | 0.274 | 0.697 | 0.574 | 0.667 |
Kmer + PC-PseDNC-Generel + ANF + EIIP + ENAC + NCP + NBP | 0.728 | 0.462 | 0.795 | 0.661 | 0.775 |
Feature | The original dimension | The dimension after feature selection |
NCP | 63 | 14 |
EIIP | 21 | 21 |
NBP | 84 | 15 |
ENAC | 105 | 41 |
ANF | 21 | 8 |
Kmer | 64 | 1 |
PC-PseDNC-general | 22 | 0 |
Feature | The Original Dimension | The dimension after feature selection |
NCP | 93 | 22 |
EIIP | 31 | 31 |
NBP | 124 | 25 |
ENAC | 155 | 53 |
ANF | 31 | 19 |
Kmer | 256 | 0 |
PC-PseDNC-general | 18 | 0 |
Feature | The original dimension | The dimension after feature selection |
NCP | 63 | 11 |
EIIP | 21 | 21 |
NBP | 84 | 17 |
ENAC | 105 | 43 |
ANF | 21 | 7 |
Kmer | 64 | 1 |
PC-PseDNC-general | 22 | 0 |
Species | Classifier | Cross-validation | ||||
ACC | MCC | SN | SP | AUC | ||
H.sapiens | iRNA-PseU | 0.604 | 0.21 | 0.61 | 0.598 | 0.64 |
PseUI | 0.642 | 0.28 | 0.649 | 0.636 | 0.68 | |
iRNA-CNN | 0.667 | 0.34 | 0.65 | 0.688 | / | |
XG-PseU | 0.661 | 0.32 | 0.635 | 0.687 | 0.7 | |
RF-PseU | 0.643 | 0.29 | 0.661 | 0.626 | 0.7 | |
iPseU-TWSVM | 0.65 | 0.301 | 0.697 | 0.602 | 0.682 | |
S.cerevisiae | iRNA-PseU | 0.645 | 0.29 | 0.647 | 0.643 | 0.81 |
PseUI | 0.641 | 0.3 | 0.647 | 0.675 | 0.69 | |
iRNA-CNN | 0.682 | 0.37 | 0.664 | 0.705 | / | |
XG-PseU | 0.682 | 0.37 | 0.668 | 0.695 | 0.77 | |
RF-PseU | 0.748 | 0.49 | 0.772 | 0.724 | 0.81 | |
iPseU-TWSVM | 0.722 | 0.45 | 0.656 | 0.786 | 0.758 | |
M.musculus | iRNA-PseU | 0.691 | 0.38 | 0.733 | 0.648 | 0.75 |
PseUI | 0.704 | 0.41 | 0.799 | 0.703 | 0.71 | |
iRNA-CNN | 0.718 | 0.44 | 0.748 | 0.691 | / | |
XG-PseU | 0.72 | 0.45 | 0.765 | 0.676 | 0.74 | |
RF-PseU | 0.748 | 0.5 | 0.731 | 0.765 | 0.796 | |
iPseU-TWSVM | 0.728 | 0.462 | 0.795 | 0.661 | 0.775 |
Species | Classifier | Independent testing | ||||
ACC | MCC | SN | SP | AUC | ||
H.sapiens | iRNA-PseU | 0.65 | 0.3 | 0.6 | 0.7 | / |
PseUI | 0.655 | 0.31 | 0.63 | 0.7 | / | |
iRNA-CNN | 0.69 | 0.4 | 0.777 | 0.68 | / | |
XG-PseU | 0.675 | / | / | 0.608 | / | |
RF-PseU | 0.75 | 0.5 | 0.78 | 0.72 | 0.8 | |
iPseU-TWSVM | 0.763 | 0.529 | 0.825 | 0.7 | 0.786 | |
S.cerevisiae | iRNA-PseU | 0.6 | 0.2 | 0.63 | 0.57 | / |
PseUI | 0.685 | 0.37 | 0.65 | 0.72 | / | |
iRNA-CNN | 0.735 | 0.47 | 0.688 | 0.778 | / | |
XG-PseU | 0.71 | / | / | / | / | |
RF-PseU | 0.77 | 0.54 | 0.75 | 0.79 | 0.838 | |
iPseU-TWSVM | 0.825 | 0.65 | 0.85 | 0.8 | 0.905 |
Scores type | iPseU-TWSVM | RF-PseU | XG-PseU | iRNA-CNN | PseUI | iRNA-PseU |
Cross-validation | 0.7 | 0.713 | 0.687 | 0.689 | 0.662 | 0.647 |
Independent testing | 0.794 | 0.76 | 0.693 | 0.713 | 0.7 | 0.625 |