1.
Introduction
To date, more than 400 different types of protein post-translational modifications (PTMs) have been discovered in the whole process of life activities [1,2]. Ubiquitylation is a modification process in which ubiquitin molecules may attach themselves to substrate proteins on lysine residues under the action of E1 activation enzyme, E2 conjugation enzyme and E3 ligation enzyme [3,4,5]. Ubiquitination is widely involved in various physiological processes due to its diversity and multivalence, including cell proliferation, apoptosis, autophagy, DNA damage repair and immune response [6,7]. Similar to the phosphorylation pathway, the ubiquitin modification pathway is reversible, that is, ubiquitin protein modifications can be removed by deubiquitinase. Therefore, it is difficult to study protein ubiquitination. However, there are many experimental approaches which have been developed, mainly including high-throughput mass spectrometry techniques, ubiquitin antibodies, ubiquitin binding proteins, and so on [8,9].
To study the ubiquitination of proteins, we need to make clear that: 1) Which proteins can be ubiquitinated; 2) For ubiquitinated proteins, which lysine residues can be ubiquitinated; 3) Quantitative analysis should be performed to find the motifs of ubiquitinated protein sequences. After clarifying the above points, we need to further understand how the ubiquitination happens, and what the key molecules that affect this ubiquitination process are. In other words, what is the role of E3 enzyme in this process? To complete these studies, a large amount of statistical data is needed. With the development of molecular biology and computer aided calculation, motif identification has become a helpful method for digging valuable information from biological sequences. Recently, a variety of machine learning approaches have been developed for automatic recognition of protein ubiquitylation sites.
The first online predictor for identifying protein ubiquitylation sites, called Ubipred [10], took advantage of 31 informative attributes out of 531 physicochemical properties as features, and support vector machine (SVM) as classifier. After that, many researches have proposed new models to predict ubiquitylation sites based on traditional classification algorithm, such as SVM [8,11,12,13,14], random forest (RF) [15,16], gray system model. In these studies, the composition of k-spaced amino acid pairs, binary amino acid encoding, physicochemical properties of amino acids, pseudo-amino acid composition, and so on, are adopted to characterize sequence information. Chen et al. [17] proposed a predictor called hCKSAAP_UbSite, which used SVM classifier incorporated with the composition of k-spaced amino acid pairs (CKSAAP), the binary amino acid encoding, the AAindex physicochemical property, and the protein aggregation propensity, to recognize protein ubiquitylation sites of human. Qiu et al. [18] developed a predictor called iUbiq-Lys, which adopted the evolutionary information, pseudo-amino acid composition (PseAAC), as well as the gray system model to predict protein ubiquitylation sites. Cai and Jiang [19] employed various traditional machine learning methods for the ubiquitylation site identification based on physicochemical properties of amino acids concerning protein sequences. Wang et al. [20] designed a tool, ESA-UbiSite, using physicochemical properties together with support vector machine to identify human ubiquitylation sites. And they also proposed the evolutionary screening algorithm (ESA) to select negative samples from non-validated sites effectively. In recent years, deep learning has been extensively used in the field of bioinformatics [21,22,23,24]. In 2017, He et al. [21] proposed the first deep learning architecture by utilizing raw protein sequence fragments, selected physicochemical properties of amino acids, and corresponding position-specific scoring matrix as input. More recently, Wang et al. [24] proposed an improved training scheme with word-embedding model, incorporated with the multilayer convolutional neural network to predict plant ubiquitylation sites. Although many models have been proposed to recognize ubiquitination sites of different species, there are only several models are designed for plant [13,15,16,24]. The latest model is built by Wang et al. [24], which achieved the accuracies of 0.782 on 10-fold cross-validation and 0.756 on independent test, respectively. Comparing the performance with other species, there is still a lot of room to improve the prediction performance. One solution is to use more sequence order information and position information of whole sequences. Another solution is to consider both traditional classification algorithm and deep learning. For the convenience of research, we show the related works and major information in Table 1.
In this work, we present a novel prediction model called UPFPSR to further improve the predictive performance for plant potential ubiquitylation sites. We build and optimize our model from three aspects. First, in order to extract more effective and representative information from protein fragments, four sequence feature extraction methods, namely DBPB (di-amino acid bi-profile Bayes), EGAAC (enhanced grouped amino acid composition), Pse-AAC (pseudo-amino acid composition) and PWAA (position-weight amino acid composition) are used to transform sequence fragments of length 31 into numerical feature vectors efficiently. Second, deep learning algorithms and several extensively used traditional machine learning algorithms were compared during model construction, and random forest (RF) is the chosen classification algorithm to establish our lysine ubiquitination site prediction model UPFPSR. Last, we perform a 10-fold cross-validation test, as well as an independent test to compare and evaluate the performance of the constructed model objectively using five common measures, i.e., accuracy (Acc), precision, recall, F1-score and the area under the ROC curve (AUC) values. When compared with one of the most advanced prediction tool CNN+word2vec on an independent dataset, UPFPSR shows its advantage over CNN+word2vec with the accuracy of 77.3%, precision of 75.0%, recall of 81.7%, F1-score of 0.782, and AUC of 0.84.
2.
Materials and methods
Figure 1 shows the overall framework of UPFPSR, which contains four major steps: 1) Data collection and preprocessing; 2) Feature encoding schemes; 3) Model construction; 4) Model evaluation. In the first step, the training dataset and the independent testing set originating from the PLMD database [25], are collected and pre-processed. Then in the second step, we adopt four different sequence-based feature-encoding techniques to extract effective feature vectors. We perform a 10-fold cross-validation on multiple classifiers in the third step to select the optimal model for plant ubiquitylated site recognition. Finally, the trained RF model is further evaluated by an independent test set, and the predictive performance is compared with the existing predictor of Wang et al. [24]. Details are described in following sections.
2.2. Data collection and pre-processing
In this study, we use datasets constructed by Wang et al. [24] to train and validate our model. The experimentally verified lysine ubiquitylation proteins are collected from the PLMD database [25]. We select ubiquitylation sites from Oryza sativa subsp indica, O. sativa subsp japonica, and Arabidopsis thaliana for the plant subset. The protein peptide sequences of length 31 with experimentally verified ubiquitylation lysine in the center are collected as positive dataset. If the number of upstream or downstream amino acids is less than 15, the lacking amino acids are complemented with the same number of pseudo amino acid "X"s. The negative samples (non-ubiquitination sites) were generated by satisfying the requirement that 31 long sequences with lysine in the center. Meanwhile, the negative sample should not be annotated experimentally. After a series of treatment, a total of 7000 protein peptide sequences are obtained for species of plant with sequence similarity less than 30%, which contains 3500 positive peptides and 3500 negative peptides. We randomly select 2750 peptides from the 3500 positive peptides and 2750 peptides from the 3500 negative peptides, separately, as the positive training dataset and the negative training dataset. The training dataset, consisting of 5500 protein fragments, is used to train and optimize the prediction model. The remaining 1500 protein peptides including 750 positive samples and 750 negative samples are used to evaluate the generalization ability of the established predictor. For the specific procedure of constructing data set, please refer to the work of Wang et al. [24].
2.2. Sequence encoding schemes
Four different protein fragment encoding methods are used in this study, namely DBPB, EGAAC, Pse-AAC and PWAA. These encoding schemes consider 20 natural amino acids and a pse-amino acid (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y, X) presented in protein sequence fragments, and transform them into numerical feature vectors.
2.2.1. Di-amino acid bi-profile Bayes (DBPB)
DBPB [26] considers the frequency of every two adjacent amino acids at each position in all the positive and negative samples. It has been widely used in the field of post translational modification of proteins. For example, Zhu et al. [27] proposed Inspector, a novel succinylation prediction tool that used random forest algorithm to identify key determinants of succinylation among six sequence-based features: di-amino acid bi-profile Bayes, position-specific di-amino acid propensity, pseudo-amino acid composition, position-weight amino acid composition, enhanced grouped amino acid composition and composition of k-spaced amino acid group pairs. Jia et al. [28] proposed an ensemble model O-GlcNAcPRED-II to predict O-GlcNAcylation sites by fusing multiple features incorporated di-amino acid bi-profile Bayes. Given a protein peptide sequence S, the DBPB feature vector is defined as:
where n is the length of the sequence fragment after omitting amino acid K in the central position. (i.e., n = 30), pj(j=1,2,⋅⋅⋅,n−1) denotes the posterior probability of two adjacent amino acids at the j-th position in all positive samples, while pj(j=n,n+1,⋅⋅⋅,2(n−1)) represents the posterior probability of two adjacent amino acids at the (j−n+1)-th position in all negative samples.
2.2.2. Enhanced grouped amino acid composition (EGAAC)
The EGAAC feature encoding [29] firstly classifies the 20 amino acids into five categories, according to their physicochemical properties, e.g., charge, hydrophobicity and molecular size. These five categories are aliphatic group (g1): (G, A, V, L, M, I), aromatic group (g2): (F, Y, W), positive charge group (g3): (K, R, H), negative charge group (g4): (D, E) and uncharged group (g5): (S, T, C, P, N, Q), respectively. EGAAC descriptor calculates the frequency of each amino acid group in windows of fixed length, which is defined as:
where N(w) is the size of the sliding window w, and N(g,w)is the number of amino acids in group g within the sliding window w. In this study, we used the default setting w = 5, the size of the sliding window is 5.
2.2.3. Pseudo-amino acid composition (Pse-AAC)
Considering both local and global sequence-order information of protein sequences, Pse-AAC [30] has been developed and widely used to represent protein sequences [31,32,33]. Pse-AAC expresses the protein sequence as a (20+λ)-dimensional feature vector, where the first 20 dimensions contain information about the composition of amino acids, while the last λ-dimensional vector represents a range of physicochemical properties. This method can effectively avoid the loss of information in amino acid order and the loss of physicochemical information in protein sequence. The Pse-AAC feature vector for a protein sequence can be formulated as:
where
fr(r=1,2,…,20) represents the normalized occurrence frequency of 20 natural amino acids in the protein sequence [27,31]. Parameter λ is the integer representing the top counted grade (or rank) of the correlation along a protein sequence, and λ = 30 is adopted in this section. In addition, w(ranging from 0 to 1) is a weight factor used to improve accuracy and is set to w = 0.05 in this work. θk(k=1,2,…,λ) is referred to as the j-tier correlation factor imaging the sequence-order correlation among all the j-th most contiguous residues along the protein chain. θk can be calculated as follows:
The correlation function θ(Ri,Ri+k) is given by:
where μ is the number of physical and chemical indices considered, and Ij(Ri) is the j-th physicochemical index value of the amino acid Ri. In this section, we employed thirty physicochemical properties, and so, μ is equal to 30 in Eq (6).
It should be noted that before replacing the physicochemical index values through Eq (6), the standard conversion described by the following formula is performed:
where I′j(Ai) is the j-th original physical and chemical value of the i-th amino acid Ai. xk and xm represent 20 natural amino acids (m,k=1,2,…,20). The amino acid "X" is omitted here. We apply the iLearn package [34] to calculate the Pse-AAC features.
2.2.4. Position-weight amino acid (PWAA) composition
PWAA [35] is the improvement of the traditional amino acid composition vector and can reflect the sequence position information of amino acid around the intermediate site. Therefore, this feature can avoid the loss of sequence-order information effectively. Given a natural amino acid Ai(i=1,2,…,20), the position information of Ai in a protein peptide fragment can be calculated as follows:
where D represents the number of upstream residues or downstream residues from the center site in a protein or peptide fragment. xi,j=1 if Ai is the j-th residue in a protein peptide fragment, otherwise xi,j=0. According to Eq (8), wi demonstrates the distance information betweenAi and intermediate site.
2.3. Model training
Random Forest (RF) [36] is an algorithm that integrates a lot of decision trees through the idea of ensemble learning, and has been widely used in computational biology, since it is non-parametric, efficient and interpretable. For example, it has been used in identifying protein succinylation sites [37,38,39], phosphorylation sites [40], glutarylation sites [41], et al..
The basic unit of random forest is a decision tree, which casts a unit vote for each subset of samples. Then the forest is constructed based on the majority voting strategy. In general, the number of trees has a great impact on the performance of the RF classification algorithm. Therefore, we search for optimal RF parameters in the training process, by setting the tree number from set{50,100,150,200,250,300}, respectively. The performance results are shown in Supplementary Table S1.
Support vector machine (SVM) is a classical machine learning method originally proposed in 1963 by Vapnik et al. [42], and has been widely used to solve data classification problems [11,43,44,45,46]. Based on the statistical learning theory, the main idea of SVM is to design a kernel function and look for the optimal separating maximum margin hyperplane which can differentiate between ubiquitylation sites and non-ubiquitylation sites. The Gaussian radial basis function (RBF) is adopted as the kernel function in this study. The RBF-SVM needs optimizing two key parameters: penalty parameter C and kernel parameter γ. For SVM algorithm, we apply grid search by setting C∈{22,23,24,25} and γ∈{0.01,0.1,1}, and finally select the pair of parameters which show the best prediction performance on the 10-fold cross validation.
The k-nearest neighbor (KNN) algorithm is the most common and simplest among all machine learning classifiers, and is also called lazy learning algorithm because it requires less training time [47,48,49]. KNN algorithm selects k training samples which are nearest to the input sample in the feature space according to certain decision rules. In this section, Euclidean distance is used as the measure of the difference between two data points and we assign the sample to be determined to the class label with the maximum voted class among these neighboring classes [50]. The Euclidean distance d between two samples x and y can be calculated through the following formula:
where N is the total number of feature dimensions, and xi, yi are the i-th dimension feature of x and y respectively.
It is important to select the optimal k value, which affects the performance of the KNN classifier significantly. We choose the optimal k value by 10-fold cross validation, and set the parameter k from set {3,5,7,9,11,13,15}.
The Naïve Bayes (NB) is an efficient statistical classification algorithm, constructing model through the joint probability P(x,y)=P(x|y)P(y). It achieves posterior probability P(y|x) based on Bayes theorem, and then the prediction is given according to the category label with the maximum posterior probability [50,51,52].
Despite its simplicity, the NB algorithm tends to outperform some more complex classification approaches, and is a widely used algorithm in bioinformatics researches. Because of its simple implementation, no iteration and high learning efficiency, Naïve Bayes algorithm has been extensively used in classification and other decision support applications [50,53,54].
2.4. Performance evaluation
To evaluate the performance of the constructed model and compare it with existing methods objectively, statistical analyses, including k-fold cross-validation tests and independent tests, are performed in this study. We also adopt four common performance measures including accuracy, precision, recall, and F1-score [24,55,56], which are defined as follows:
where TP, FP, TN, and FN indicate the number of true positives, false positives, true negatives and false negatives, respectively. In addition, we plot the receiver operating characteristic (ROC) curves and calculate the area under curve (AUC) values to further assess the performance of our model. The higher the AUC value is, the better the model performs.
3.
Results and discussion
3.1. Selection of the classification algorithm
To find the most suitable algorithm for distinguishing ubiquitination sites in plant, we firstly try four traditional algorithms, namely support vector machine (SVM), k-nearest neighbor (KNN), naive Bayes (NB), and random forest (RF). For each machine-learning algorithm, we perform 10-fold cross validation to search for optimal model parameters. The detailed performance results of each parameter combination are provided in supplementary Tables S1, S2 and S3.
When training the RF model, we set the number of decision trees as 50, 100, 150, 200, 250 and 300, separately, to screen the optimal number of decision trees. Table S1 demonstrates that the optimal accuracy achieves the best at 0.811 with tree number of 200. Moreover, other measurement precision, such as recall and F1-score, also reach the best values simultaneously.
During the process of SVM algorithm optimization, the penalty parameter C is selected from the set {22,23,24,25} and the kernel parameter γ is selected from the set {0.01,0.1,1}. Table S2 shows that the maximum accuracy achieves 0.747 by the parameter combinations C = 2^2, γ = 0.1 or C = 2^3, γ = 0.1.
For k-nearest neighbor algorithm, the parameter k is set as 3, 5, 7, 9, 11, 13 and 15 orderly. For each neighbor number k, the 10-fold cross-validation results are shown in Supplementary Table S3, where the k value of 11 achieves the best accuracy of 0.676.
We report the best performance for each classifier with the measures of accuracy, precision, recall, and F1-score in Table 2 and Figure 2. Moreover, we also plot their ROC curves and provide the corresponding AUC values, respectively, as shown in Figure 3. It is noted that the RF classification algorithm demonstrates its superiority to other classifiers on all of these measures. Specifically, the optimal model RF on 10-fold cross validation achieves an accuracy of 81.1%, precision of 81.0%, recall of 81.2%, F1-score of 0.811, and AUC of 0.888.
Due to their outperforming learning ability, deep learning algorithms have also been extensively applied in prediction realms [23,56,57,58,59]. We test two deep neural network architectures, deep neural network (DNN) and convolutional neural network (CNN) on our training data as well. The 10-fold cross-validation prediction results show that RF is superior to CNN and DNN once again, with accuracy of 0.811 versus 0.698 of CNN and 0.743 of DNN. As far as we know, deep learning frameworks tend to show good performance on large-scale data. The plant data size we used is not large enough for deep neural networks, but suitable for traditional machine learning algorithms. If enough ubiquitination sites are discovered, deep learning framework will still be the first choice in the future.
3.2. Comparison of UPFPSR with existing method CNN+word2vec on the independent test set
As far as we know, the most up-to-date ubiquitylation prediction tool CNN+word2vec [24] demonstrated the best performance compared to other state-of-the art methods [8,12,18,21,23]. Therefore, we only compare the predictive performance of UPFPSR with CNN+word2vec on the independent test dataset including 750 positive samples and 750 negative samples. The specific performance comparison results are shown in Figure 4 and Table S4. The results illustrate that between these two predictors, UPFPSR achieves a better predictive performance than CNN+word2vec on each measurement index. In particular, UPFPSR achieves the recall value of 81.7%, while the recall value of CNN+word2vec is 76.7%. In addition, UPFPSR improves the accuracy by 1.7% over CNN+word2vec. What is more, UPFPSR gives the AUC of 0.84 versus 0.81 of CNN+word2vec. In conclusion, all results demonstrate that our proposed model has high confidence on plant ubiquitylation site prediction and is more appropriate for recognizing the plant ubiquitylation site.
3.3. Amino acid preferences of ubiquitylation sites
We analyze the amino acid preferences around ubiquitylation sites as compared with non- ubiquitylation sites using the Two Sample Logo [60] web server and show the statistical results in Figure 5. The larger font indicates that this kind of amino acid is enriched in this position, which is statistically significant. It is clearly shown that arginine (R) and glutamic acid (E) are more likely to appear around ubiquitylated lysine than non-ubiquitylated lysine in plant, especially on the -11th to -1th and 1th to 10th positions. By contrast, serine (S) and lysine (K) are less likely to appear in ubiquitylated peptides than in non-ubiquitylated peptides. This analysis indicates that there is no obvious motif for ubiquitylated sites. Therefore, it is significant to build a model for the prediction of plant ubiquitylation sites.
4.
Results
In this study, a novel model named UPFPSR is developed to identify lysine ubiquitylation sites for plant. UPFPSR incorporates various sequence-based information including DBPB, EGAAC, Pse-AAC and PWAA. Rigorous benchmarking tests based on 10-fold cross validation and an independent test set have illustrated that this novel method is efficient and promising for improving the prediction of lysine ubiquitylation sites. But the overall prediction performance is less than 90%, the model still needs to be further improved. Besides the types of features adopted, more secondary and tertiary structure information of proteins should be considered in the future. In addition, more up-to-date deep learning neural networks ensemble with traditional algorithms can be employed for further researches.
Acknowledgments
This work is supported by the National Natural Science Foundation of China (No: 62071079 and 61803065).
Conflict of interest
The authors declare there is no conflict of interest.