Enhancer is a non-coding DNA fragment that can be bound with proteins to activate transcription of a gene, hence play an important role in regulating gene expression. Enhancer identification is very challenging and more complicated than other genetic factors due to their position variation and free scattering. In addition, it has been proved that genetic variation in enhancers is related to human diseases. Therefore, identification of enhancers and their strength has important biological meaning. In this paper, a novel model named iEnhancer-MFGBDT is developed to identify enhancer and their strength by fusing multiple features and gradient boosting decision tree (GBDT). Multiple features include k-mer and reverse complement k-mer nucleotide composition based on DNA sequence, and second-order moving average, normalized Moreau-Broto auto-cross correlation and Moran auto-cross correlation based on dinucleotide physical structural property matrix. Then we use GBDT to select features and perform classification successively. The accuracies reach 78.67% and 66.04% for identifying enhancers and their strength on the benchmark dataset, respectively. Compared with other models, the results show that our model is useful and effective intelligent tool to identify enhancers and their strength, of which the datasets and source codes are available at https://github.com/shengli0201/iEnhancer-MFGBDT1.
Citation: Yunyun Liang, Shengli Zhang, Huijuan Qiao, Yinan Cheng. iEnhancer-MFGBDT: Identifying enhancers and their strength by fusing multiple features and gradient boosting decision tree[J]. Mathematical Biosciences and Engineering, 2021, 18(6): 8797-8814. doi: 10.3934/mbe.2021434
Related Papers:
[1]
Hanyu Zhao, Chao Che, Bo Jin, Xiaopeng Wei .
A viral protein identifying framework based on temporal convolutional network. Mathematical Biosciences and Engineering, 2019, 16(3): 1709-1717.
doi: 10.3934/mbe.2019081
[2]
Jie Bai, Heru Xue, Xinhua Jiang, Yanqing Zhou .
Recognition of bovine milk somatic cells based on multi-feature extraction and a GBDT-AdaBoost fusion model. Mathematical Biosciences and Engineering, 2022, 19(6): 5850-5866.
doi: 10.3934/mbe.2022274
[3]
Jian-xue Tian, Jue Zhang .
Breast cancer diagnosis using feature extraction and boosted C5.0 decision tree algorithm with penalty factor. Mathematical Biosciences and Engineering, 2022, 19(3): 2193-2205.
doi: 10.3934/mbe.2022102
[4]
Yu Jin, Zhe Ren, Wenjie Wang, Yulei Zhang, Liang Zhou, Xufeng Yao, Tao Wu .
Classification of Alzheimer's disease using robust TabNet neural networks on genetic data. Mathematical Biosciences and Engineering, 2023, 20(5): 8358-8374.
doi: 10.3934/mbe.2023366
[5]
Dan Zhu, Liru Yang, Xin Liang .
Gender classification in classical fiction: A computational analysis of 1113 fictions. Mathematical Biosciences and Engineering, 2022, 19(9): 8892-8907.
doi: 10.3934/mbe.2022412
[6]
Faiz Ul Islam, Guangjie Liu, Weiwei Liu .
Identifying VoIP traffic in VPN tunnel via Flow Spatio-Temporal Features. Mathematical Biosciences and Engineering, 2020, 17(5): 4747-4772.
doi: 10.3934/mbe.2020260
[7]
Xuedong Tian, Jiameng Wang, Yu Wen, Hongyan Ma .
Multi-attribute scientific documents retrieval and ranking model based on GBDT and LR. Mathematical Biosciences and Engineering, 2022, 19(4): 3748-3766.
doi: 10.3934/mbe.2022172
[8]
Sidra Abid Syed, Munaf Rashid, Samreen Hussain, Anoshia Imtiaz, Hamnah Abid, Hira Zahid .
Inter classifier comparison to detect voice pathologies. Mathematical Biosciences and Engineering, 2021, 18(3): 2258-2273.
doi: 10.3934/mbe.2021114
[9]
Cenyi Yang .
Prediction of hearing preservation after acoustic neuroma surgery based on SMOTE-XGBoost. Mathematical Biosciences and Engineering, 2023, 20(6): 10757-10772.
doi: 10.3934/mbe.2023477
[10]
Yafeng Zhao, Xuan Gao, Junfeng Hu, Zhen Chen .
Tree species identification based on the fusion of bark and leaves. Mathematical Biosciences and Engineering, 2020, 17(4): 4018-4033.
doi: 10.3934/mbe.2020222
Abstract
Enhancer is a non-coding DNA fragment that can be bound with proteins to activate transcription of a gene, hence play an important role in regulating gene expression. Enhancer identification is very challenging and more complicated than other genetic factors due to their position variation and free scattering. In addition, it has been proved that genetic variation in enhancers is related to human diseases. Therefore, identification of enhancers and their strength has important biological meaning. In this paper, a novel model named iEnhancer-MFGBDT is developed to identify enhancer and their strength by fusing multiple features and gradient boosting decision tree (GBDT). Multiple features include k-mer and reverse complement k-mer nucleotide composition based on DNA sequence, and second-order moving average, normalized Moreau-Broto auto-cross correlation and Moran auto-cross correlation based on dinucleotide physical structural property matrix. Then we use GBDT to select features and perform classification successively. The accuracies reach 78.67% and 66.04% for identifying enhancers and their strength on the benchmark dataset, respectively. Compared with other models, the results show that our model is useful and effective intelligent tool to identify enhancers and their strength, of which the datasets and source codes are available at https://github.com/shengli0201/iEnhancer-MFGBDT1.
1.
Introduction
Enhancers are non-coding DNA fragments, which hold responsibility for regulating gene expression in both transcription and translation and the production of RNA and proteins [1]. Unlike the proximal elements promoters of the gene, enhancers are distal elements that can be located up to 20kb upstream or downstream away from a gene, or even located on a different chromosome [2]. Such locational variation makes the identification of enhancers challenging. Moreover, genetic variation in enhancers has been demonstrated that it is related to many human illnesses, such as cancer [3,4], disorder [4,5] and inflammatory bowel disease [6]. Genome-wide study of histone modifications has shown that enhancers are a large group of functional elements with many different subgroups, such as strong enhancers and weak enhancers, poised enhancers and inactive enhancers [7]. Because enhancers of different subgroups have different biological activities, understanding enhancers and their subgroups is an important task, especially for the identification of the enhancers and their strength.
Due to the importance of enhancers in genomics and disease, the identification of the enhancers and their strength has become a popular topic in biological research. The pioneering works carried out purely by the experimental techniques include chromatin immunoprecipitation followed by deep sequencing [8,9,10], DNase I hypersensitivity [11] and genome-wide mapping of histone modifications [12,13,14,15,16]. However, the experimental methods are expensive, time consuming and low accuracy. Therefore, several computational methods were developed in order to fast identify enhancers and their strength in genomes. In 2016, Liu et al. [2] developed a two-layer predictor iEnhancer-2L, which is the first computational model for identifying not only enhancers, but also their strength by pseudo k-tuple nucleotide composition. At the same year, Jia et al. [17] proposed EnhancerPred model by fusing bi-profile Bayes and pseudo-nucleotide composition as multiple features, and a two-step wrapper for feature selection to distinguish between enhancers and non-enhancers and to determine enhancers' strength. In 2018, Liu et al. [18] established the iEnhancer-EL model for identifying enhancers and their strength with ensemble learning approach. In 2019, Nguyen et al. [19] put forward iEnhancer-ECNN model to identify enhancers and their strength using ensembles of convolutional neural networks. At the same year, Tan et al. [20] used ensemble of deep recurrent neural networks for identifying enhancers via dinucleotide physicochemical properties. Le et al. [21] developed iEnhancer-5Step model to identifying enhancers and their strength using hidden information of DNA sequences via Chou's 5-step rule and word embedding. In 2021, Basith et al. [22] proposed Enhancer-IF model by integrative machine learning (ML)-based framework for identifying cell-specific enhancers. At the same year, Cai ea al. [23] established iEnhancer-XG model by using XGBoost as a base classifier and k-spectrum profile, mismatch k-tuple, subsequence profile, position-specific scoring matrix and pseudo dinucleotide composition as feature extraction methods. Le et al. [24] use a transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers. Lim et al. [25] proposed iEnhancer-RF model to identify enhancers and their strength by enhanced feature representation using random forest. However, the stability of the model still needs to be improved, especially for identifying the strong enhancers from the weak enhancers.
In this study, we focus on developing a novel model named iEnhancer-MFGBDT to identify enhancers and their strength. Its first layer serves to identify whether a DNA sequence sample is of enhancer or not, while its second layer is to identify whether the identified enhancer as being strong or weak. We fuse k-mer and reverse complement k-mer nucleotide composition based on DNA sequence, and second-order moving average, normalized Moreau-Broto auto-cross correlation and Moran auto-cross correlation based on dinucleotide physical structural property matrix as extracted multiple features, and a 902-dimensional feature vector is obtained for each enhancer sequence. Then, gradient boosting decision tree (GBDT) algorithm in this study is adopted as the feature selection strategy and also as the classifier. The accuracy of enhancers and their strength on the benchmark dataset with the 10-fold cross-validation are 78.67% and 66.04%, respectively. The accuracy of enhancers and their strength on the independent dataset with the 10-fold cross-validation are 77.50% and 68.50%, respectively. The experimental results indicate that our model improves the accuracies to identify enhancers and their strength, and is a useful supplementary tool.
2.
Materials and methods
2.1. Datasets
In order to facilitate comparison, in this study, we adopt the benchmark dataset S constructed by Liu et al. [2], they obtain the 2968 enhancer sequences with 200bp which can be formulated by
S=S+∪S−S+=S+strong∪S+weak,
(1)
where S+ contains 1484 enhancer sequences, S− contains 1484 non-enhancer sequences, S+strong contains 742 strong enhancer sequences, S+weak contains 742 weak enhancer sequence, in which none of the enhancer DNA sequences has the pairwise sequences similarities more than 80%.
2.2. Feature extraction
Suppose that a DNA enhancer sequence D with L nucleic acid residues is expressed by
where Bi denotes the i-th nucleic acid residue of the DNA sequence at the sequence position i. In this study, 902 multiple features are extracted by fusing k-mer nucleotide composition, reverse complementary k-mer, second-order moving average, Moreau-Broto auto-cross correlation, and Moran auto-cross correlation based on dinucleotide property matrix.
2.2.1. K-mer nucleotide composition
K-mer nucleotide composition is a basic feature extraction approach and widely used in different fields of bioinformatics [26,27,28,29]. For a enhancer sequence with nucleotides, the k-mer nucleotide compositions involve all the possible subsequences with length of the enhancer sequence. We slide along the enhancer sequence with one nucleotide as a step size using a sliding window. When the subsequence of the enhancer sequence matches with the -th k-mer nucleotide composition, the occurrence number of the k-mer is denoted by. represents the occurrence frequency of the -th k-mer, and can be expressed by
fi=niL−k+1.
(3)
For each k, we can obtain 4kk-mer features, here we let k=1,2,3, Finally, each enhancer sequence obtains 41+42+43=84-dimensional k-mer feature vector.
2.2.2. Reverse complementary k-mer
The reverse complementary k-mer is a variant of the basic k-mer, and abbreviated as RevKmer, in which the k-mers are not expected to be strand-specific, so reverse complements are collapsed into a single feature. For example, when k = 2, there are totally 16 basic k-mers, but by removing the reverse complementary k-mers, only 10 different dinucleotides AA, AC, AG, AT, CA, CC, CG, GA, GC and TA are be retained. In other words, we obtain 10 reverse complementary 2-mer features. Let k=1,2,3, 2+10+32 = 44 RevKmer features are extracted, which can be calculated by a web server named Pse-in-One 2.0 [30].
2.2.3. Second-order moving average based on dinucleotide property matrix
As has been reported, DNA physicochemical properties play crucial role in gene expression regulation and genome analysis, and are also closely correlated with the functional non-coding elements [31,32,33]. In this study, six dinucleotide physical structural properties are adopted, include three the local translational parameters related to shift, slide and rise, and three the local angular parameters related to twist, tilt and roll [34]. The values of six DNA dinucleotide physical structural properties are shown in Table 1. Each DNA physical structural property is normalized for reducing the bias and noise by the following formula
P−PminPmax−Pmin,
(4)
Table 1.
The original values of the six physical structural properties for the 16 dinucleotides in DNA.
where P is the original value of the property, Pmin and Pmax are the minimum and the maximum property values, respectively.
A DNA sequence is a polymer of four nucleotides with A, C, G and T. Any combination of two nucleotides is called dinucleotide. Hence, there are totally 4*4 = 16 basic dinucleotides. First of all each dinucleotide in a DNA sequence is replaced by the value of the physical structural property. Then, each DNA sequence in the datasets can be converted into a matrix P=(pi,j)(L−1)×6, which is named by dinucleotide property matrix (DPM), where L represents the number of nucleic acid residue in this DNA sequence. pi,j represents the value of the ith dinucleotide corresponding to the jth physical structural property.
Second-order moving average (SOMA) algorithm is proposed by Alessio et al. [35], which is defined by fusing the idea of the moving average and the second-order difference. SOMA mainly investigate the long-range correlation properties of a stochastic time series.
Let a discrete stochastic time series be y(i),i=1,2,⋯,L, where L is the size of the stochastic series y(i). The algorithm of the SOMA is described as follows
Step 1. Calculate the moving average ˜yn(i) of the time series y(i) as
˜yn(i)=1nn−1∑k=0y(i−k),
(5)
where n is the moving average window. When n→0, then ˜yn(i)→y(i).
Step 2. For a given moving average window n, 2⩽n<L, the second-order difference between the y(i) and ˜yn(i) is defined by
σ2MA=1L−nL∑i=n[y(i)−˜yn(i)]2,
(6)
where σ2MA is a systematic analysis of the properties of y(i) with respect to ˜yn(i), so σ2MA is called the second-order moving average.
A dinucleotide property matrix contains 6 columns, each column is considered a time series, in other words, a dinucleotide property matrix contains 6 time series. Hence, each enhancer DNA sequence is represented by 6 SOMA features for a certain moving average window n. Here, we let n=2,3,⋯,10, we construct a 6∗9=54-dimensional SOMA-DPM feature vector for each enhancer sequence.
2.2.4. Moreau-Broto auto-cross correlation based on dinucleotide property matrix
Normalized Moreau-Broto auto-cross correlation (NMBACC) [36] based on dinucleotide property matrix for extracting global sequence information can be described by
where λ is the lag of the auto-cross correlation along the column in dinucleotide property matrix. Pi,s represents the value at the i-th row for the s-th column (s-th property index), Pi+λ,t represents the value at the i+λ-th row for the t-th column (t-th property index). When s=t, NMBACC(s,s,λ) represents the auto- correlation with the same property. When s≠t, NMBACC(s,t,λ) represents the cross-correlation with the different property. Here, we let λ=1,2,3,⋯,10, finally, each enhancer sequence obtains a 6∗6∗10=360-dimensional NMBACC-DPM feature vector.
2.2.5. Moran auto-cross correlation based on dinucleotide property matrix
Moran auto-cross correlation (MACC) [37] based on dinucleotide property matrix for extracting global sequence information can be described by
where λ is the lag along the column in dinucleotide property matrix, pi,s and pi,trepresent the value at the i-th row for the s-th column and t-th column in dinucleotide property matrix, respectively. pi+λ,t represents the value at the (i+λ)-th row for the t-th column in dinucleotide property matrix. ˉps and ˉpt are the average value for the s-th and t-th column, respectively. When s=t, MACC(s,s,λ) represents the auto-correlation with the same property. When s≠t,MACC(s,t,λ) represents the cross-correlation with the different property. Here, we let λ=1,2,3,⋯,10, finally, each enhancer sequence obtains a 6∗6∗10=360-dimensional MACC-DPM feature vector.
2.3. Gradient boosting decision tree
Gradient boosting decision tree (GBDT) is a Boosting algorithm based on decision tree as base learner, was proposed by Freidman in 2001 [38,39]. It builds a decision tree in each iteration to reduce the residual of the current model in the gradient direction. Then linearly combines the decision tree with the current model to obtain a new model. GBDT repeats the iteration until the number of decision trees reaches the specified value, and the final strong learner is obtained. GBDT is commonly used for regression, classification and feature selection. GBDT's advantages include: (a) It flexible processes of various types of data, including both continuous and discrete dataset; (b) It has powerful predictive ability and generalization ability; (c) It has good interpretability and robustness, can automatically discover high-order relationships between features, and does not require data normalization and other processing.
The GBDT classification algorithm process is as follows
Input: training dataset D={(x1,y1),(x2,y2),⋯,(xm,ym)}. Suppose that the maximum iteration number is T, the loss function is L(y,f(x)), and m is the number of samples.
(1) Initialize the weak classifier as follows
f0(x)=argmincm∑i=1L(yi,c),
(9)
c is the constant value that minimizes the loss function, that is, f0(x) is a tree with only one root node.
(2) For t=1 to T
a. For i=1 to m, calculate negative gradient as follows
where the loss function L(y,f(x))=log(1+exp(yf(x))),y∈{−1,1}.
b. Use L(xi,rti),i=1,2,⋯,m to fit a CART regression tree to get the t-th regression tree, and its corresponding leaf node area is Rtj, j=1,2,⋯,J.J is the number of leaf nodes of the regression tree t.
c. For leaf node area j=1,2,⋯,J, calculate the best residual fitting value as follows
ctj=argmincm∑xi∈Rtjlog[1+exp(−yi(ft−1(xi)+c))].
(11)
As the above equation is difficult to optimize, ctj is generally replaced by the approximate value as
ctj=∑xi∈Rtjrti∑xi∈Rtj|rti|(|1−rti|).
(12)
d. Update the strong classifier by
ft(x)=ft−1(x)+J∑j=1ctjI(x∈Rtj).
(13)
(3) Get the final strong classifier f(x) by
f(x)=fT(x)=J∑t=1J∑j=1ctjI(x∈Rtj).
(14)
Output: fT(x).
GBDT can not only be used for classification, but also can be used for feature selection by calculating the gini index. The gini index is ranked in descending order by the importance of the feature, the first k features can be selected as needed. In this study, we adopt GBDT to carry out feature selection and classification.
2.4. Cross-validation and performance assessment
In order to save the computational time, 10-fold cross-validation is carried out for each feature to evaluate the identification performance in this study. The dataset is randomly divided into ten subsets with approximately equal size, and the ratio of the testing set to the training set is 1:9. Each subset is in turn taken as a test set and the remaining nine subsets are used to train the GBDT classifier, and finally the average performance measures over the ten validation results are used for performance evaluation. K-fold cross-validation approach can improve the reliability of evaluation, because all of the original data are considered and each subset is tested only once.
To make an objective and comprehensive evaluation, we employ different performance measures [40,41,42,43], including sensitivity (Sn), specificity (Sp), accuracy (Acc) and Matthews correlation coefficient (MCC), The MCC value is ranging from -1 to 1, while the values of other three measures range from 0 to 1. They can be formulated as
where N+ represents the total number of the true enhancer sequences investigated, while N+− represents the number of true enhancer sequences incorrectly identified to be non-enhancer sequences; N− represents the total number of the non-enhancer sequences investigated while N−+ represents the number of the non-enhancer sequences incorrectly identified to be enhancer sequences.
We also employ the receiver operating characteristic (ROC) curve [44] and the area under the ROC curve (AUC) [45] to evaluate our model. The ROC curve plots the true positive rate (Specificity) as a function of the false positive rate (1-Specificity) for all possible thresholds. The ROC curve is closer to the upper left corner, the better the identification performance is. In other words, the closer the AUC is to 1, the better the identification system is.
3.
Results and discussion
3.1. Identification performance on the benchmark dataset
Identifying enhancers is a binary classification problem, which can be divided into two layers, the first layer is devoted to identify whether a DNA sequence is of enhancer or not, while the second layer is committed to identify enhancer sequence as being strong or weak enhancer. In this study, a novel model iEnhancer-MFGBDT is proposed by using multi-features and gradient boosting decision tree. Firstly, the 902 multi-features are extracted for both layers of each enhancer sequence, which contain 84 k-mer features, 44 RevKmer features, 54 SOMA-DPM features, 360 NMBACC-DPM features and 360 MACC-DPM features. Next, 156 features for the first layer, 263 features for the second layer are selected from 902 multi-features with the GBDT algorithm by the gini index, respectively. Finally, the GBDT classifier is adopted to implement classification using the 10-fold cross-validation. The Figure 1 shows the operating flow of iEnhancer-MFGBDT model.
Figure 1.
The flowchart of the iEnhancer-MFGBDT model.
Identification results by the 10-fold cross-validation are shown in Table 2 by our iEnhancer-MFGBDT model on the benchmark datastet. From Table 2, we can see that the accuracy reaches 78.67% and 66.04% for the first and second layers on the benchmark dataset, respectively. Meanwhile, the values of Sn, Sp and MCC reach 77.54%, 79.78%, 0.5735 for the first layer, 70.56%, 61.63%, 0.3232 for the second layer. The AUC indicates the probability at which the model ranks a randomly selected positive sample higher than a randomly selected negative sample. In fact, The AUC can measure the overall performance of a given identification system. The ROC curves are plotted for the both first and second layers, and shown in Figure 2. The AUC values on the benchmark dataset are 0.8615 and 0.7187 for the first layer and the second layer, respectively. Obviously, the second layer is more difficult to identify than the first layer due to their position variation and free scattering.
Table 2.
The identification performance of iEnhancer-MFGBDT with 10-fold cross validation on the benchmark dataset.
In this study, we adopt five different approaches to extract features from the benchmark dataset, which are named as K-mer, RevKmer, SOMA-DPM, NMBACC-DPM and MACC-DPM feature group, respectively. For the purpose of obtaining the importance of single feature group, we calculate the performance for K-mer, RevKmer, SOMA-DPM, NMBACC-DPM and MACC-DPM, respectively, and as shown in Table 3. The accuracy of single feature group is lower than that of multiple features after GBDT feature selection (MGBDT) for the both layers. Therefore, the fusion of multiple features is very necessary. From Table 3, we can see that the best identification performance is K-mer, followed by RevKmer, NMBACC-DPM, SOMA-DPM successively, the MACC-DPM is the lowest one for the first layer. Meanwhile, we also can see that the best identification performance is RevKmer, followed by K-mer, SOMA-DPM, MACC-DPM successively, the NMBACC-DPM is the lowest one for the second layer. Among these five feature groups, k-mer and RevKmer are the feature extraction methods based on DNA sequence, SOMA-DPM, NMBACC-DPM and MACC-DPM are the feature extraction methods based on physical structural properties of DNA dinucleotide. Obviously, the DNA sequence-based feature group is superior to physical structural properties-based feature group.
Table 3.
Feature group analysis of iEnhancer-MFGBDT with 10-fold cross validation on the benchmark dataset.
3.3. Comparison with feature selection and without feature selection
We construct 902 features by multiple features, and the large dimension will lead to decrease predictive performance, a handicap for the computation and information redundancy. The features selection can help the original classification system achieve a better predictive performance and a lower computational cost by removing redundant features. Hence, finding a suitable dimension reduction method is very important. The gini index is ranked in descending order by importance for GBDT, here, we use "mean" and "gini" as the threshold and criterion for feature selection. Figure 3 shows the accuracy comparison between our model with feature selection and without feature selection. It is obvious that the accuracies have been improved for both layers in the benchmark dataset, and clearly shows that GBDT feature selection method has great effect on improving accuracy. The accuracy is improved by 1.35% and 5.87% for the first layer and the second layer by using GBDT feature selection, respectively. These experimental results show that GBDT is very effective for the benchmark dataset.
Figure 3.
Identification accuracy comparison between with feature selection and without feature selection on the benchmark dataset.
To demonstrate the superiority of GBDT classifier, support vector machine (SVM), extra trees (ET), random forest (RF) and Bagging classifiers are tested successively using the selected features by GBDT based on the 10-fold cross-validation. As shown in Figure 4, the identification accuracy of SVM, ET, RF and Bagging reaches 75.64%, 77.02%, 77.15% and 76.75% for the first layer, and 60.04%, 62.47%, 65.02% and 64.75% for the second layer, respectively. However, the identification accuracy of GBDT reaches 78.67% and 66.04% for the first and second layer, respectively, we can see that from Figure 4, the accuracies of SVM, ET, RF and Bagging are all lower than the accuracies obtained by GBDT for the both layers. The results show that GBDT is more powerful for our benchmark dataset than other classifiers.
Figure 4.
Identification accuracy comparison with different classifiers.
In order to avoid experimental errors, it is persuasive to use an independent dataset to objectively evaluate our model. We adopt the independent dataset also constructed by Liu et al. [2], which contains the 400 enhancer sequences with 200bp, among them, 100 strong enhancer sequences, 100 weak enhancer sequences and 200 non-enhancer sequences, and sequence similarity is less than or equal to 80%. The results obtained by the proposed model using the 10-fold cross-validation on the independent dataset test are given in Table 4. For the first layer, the ACC, Sn, Sp, MCC and AUC reach 77.50%, 76.79%, 79.55%, 0.5607 and 0.8589, respectively. For the second layer, the ACC, Sn, Sp, MCC and AUC reach 68.50, 72.55%, 66.81%, 0.3862 and 0.7524, respectively. The values of these metrics further illustrate the effectiveness of our model.
Table 4.
The identification performance of iEnhancer-MFGBDT with 10-fold cross validation on the independent dataset.
The proposed iEnhancer-MFGBDT model, is compared with eight state-of-the-art models: iEnhancer-2L [2], iEnhancerPred [17], iEnhancer-EL [18], iEnhancer-ECNN [19], Tan et al. [20], iEnhancer-XG [23], BERT-2D CNNs [24] and iEnhancer-RF [25]. The values of Acc, Sn, Sp and MCC are listed in Tables 5 and 6.
Table 5.
The comparison with other methods in identifying enhancers and their strength on the benchmark dataset.
For the benchmark dataset, iEnhancer-2L, iEnhancerPred, iEnhancer-EL, Tan et al., iEnhancer-XG and iEnhancer-RF models are adopted for comparison for the both layers, of which the values of ACC, Sn, Sp and MCC are listed in Table 5. Among the six models, the accuracy for our model is lower than that of iEnhancer-XG model for the both layers, but the stability of our model is higher than that of iEnhancer-XG model. The accuracy for our model is 1.78%, 5.49%, 0.64%, 3.84% and 2.49% higher than the iEnhancer-2L, iEnhancerPred, iEnhancer-EL, Tan et al. and iEnhancer-RF models for the first layer, respectively, and the accuracy for our model is 4.11%, 3.98%, 1.01%, 7.08% and 3.51% higher than the iEnhancer-2L, iEnhancerPred, iEnhancer-EL, Tan et al and iEnhancer-RF models for the second layer, respectively. As shown in Table 5, our model has the best performance and is the most stable model from Sn, Sp and MCC.
For the independent dataset, iEnhancer-2L, iEnhancerPred, iEnhancer-EL and iEnhancer-ECNN, Tan et al., iEnhancer-XG and BERT-2D CNNs models are adopted for comparison for the first layer, of which the values of ACC, Sn, Sp and MCC are listed in Table 6. The accuracy is improved by 0.6%–4.5% for the first layer. From Table 6, we can see that iEnhancer-2L, iEnhancerPred, iEnhancer-EL and iEnhancer-ECNN, Tan et al., and iEnhancer-XG models are adopted for comparison for the second layer, The accuracy is improved by 0.01%-13.5% for the second layer. The test results still show that the performance of iEnhancer-MFGBDT is best on the independent dataset. Our model achieves remarkably better results than other existing models, and make a considerable improvement for performance.
4.
Conclusions
In this study, an effective computational tool called enhancers-MFGBDT has been developed for identification of DNA enhancers and their strength. The iEnhancer-MFGBDT model is established by fusing multi-features and GBDT based on the 10-fold cross validation. Compared with the existing models, our model can obtain satisfactory accuracies for the first and second layers on the benchmark dataset and independent dataset. It is anticipated that iEnhancer-MFGBDT will become a very useful high throughput tool for researching enhancers or, at the least, play an important complementary role to the existing models. As pointed out in [46] by Chou and Shen, user-friendly and publicly accessible web-servers represent the future direction for practically developing more useful computational tools, and have increasing impacts on medical science [47]. In the future, we will make great efforts to establish a web-server for the iEnhancer-MFGBDT model to facilitate communication among colleagues in bioinformatics.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (No. 12101480), the Natural Science Basic Research Program of Shaanxi (Nos.2021JM-115, 2021JM-444), and the Fundamental Research Funds for the Central Universities (No. JB210715).
Conflict of interest
The authors declare no conflict of interest.
References
[1]
N. Omar, W. Y. Shiong, L. Xi, C. C Yee Ling, M. T. D. Abdullah, N. K. Lee, Enhancer prediction in proboscis monkey genome: A comparative study, J. Telecom. Electron. Computer Eng., 9 (2017), 175-179.
[2]
B. Liu, L. Y. Fang, R. Long, X. Lan, K. C. Chou, iEnhancer-2L: A two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition, Bioinformatics, 32 (2016), 362-369. doi: 10.1093/bioinformatics/btv604
[3]
H. M. Herz, Enhancer deregulation in cancer and other diseases, Bioessays, 38 (2016), 1003-1015. doi: 10.1002/bies.201600106
[4]
G. Zhang, J. Shi, S. Zhu, Y. Lan, L. Xu, H. Yuan, et al., DiseaseEnhancer: A resource of human disease-associated enhancer catalog, Nucleic Acids Res., 46 (2018), D78-D84.
[5]
O. Corradin, P. C. Scacheri, Enhancer variants: Evaluating functions in common disease, Genome Med., 6 (2014), 85.
[6]
M. Boyd, M. Thodberg, M. Vitezic, J. Bornholdt, K. Vitting-Seerup, Y. Chen, et al., Characterization of the enhancer and promoter landscape of inflammatory bowel disease from human colon biopsies, Nat. Commun., 9 (2018), 1661.
[7]
D. Shlyueva, G. Stampfel, A. Stark, Transcriptional enhancers: from properties to genome-wide predictions, Nat. Rev. Genet., 15 (2014), 272-286. doi: 10.1038/nrg3682
[8]
N. D. Heintzman, B. Ren, Finding distal regulatory elements in the human genome, Curr. Opin. Genet. Dev., 19 (2009), 541-549. doi: 10.1016/j.gde.2009.09.006
[9]
N. D. Heintzman, R. K. Stuart, G. Hon, Y. T. Fu, C. W. Ching, R. D. Hawkins, et al., Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome, Nat. Genet., 39 (2007), 311-318.
[10]
A. Visel, M. J. Blow, Z. R. Li, T. Zhang, J. A. Akiyama, A. Holt, et al., ChIP-seq accurately predicts tissue-specific activity of enhancers, Nature, 457 (2009), 854-858.
[11]
A. P. Boyle, L. Y. Song, B. K. Lee, D. London, D. Keefe, E. Birney, et al., High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells, Genome Res., 21 (2011), 456-464.
[12]
J. Ernst, P. Kheradpour, T. S. Mikkelsen, N. Shoresh, L. D. Ward, C. B. Epstein, et al., Mapping and analysis of chromatin state dynamics in nine human cell types, Nature, 473 (2011), 43-49.
[13]
G. D. Erwin, N. Oksenberg, R. M. Truty, D. Kostka, K. K. Murphy, N. Ahituv, et al., Integrating diverse datasets improves developmental enhancer prediction, PLoS Comput. Boil., 10 (2014), e1003677.
[14]
M. Feinandez, D. Miranda-Saavedra, Genome-wide enhancer prediction from epigenetic signatures using genetic algorithm-optimized support vector machine, Nucleic Acids Res., 40 (2012), e77.
[15]
H. A. Firpi, D. Ucar, K. Tan, Discover regulatory DNA elements using chromatin signatures and artificial neural network, Bioinformatics, 26 (2010), 1579-1586. doi: 10.1093/bioinformatics/btq248
[16]
N. Rajagopal, W. Xie, Y. Li, U. Wagner, W. Wang, J. Stamatoyannopoulos, et al., RFECS: A random-forest based algorithm for enhancer identification from chromatin state, PLoS Comput. Boil., 9 (2013), e1002968.
[17]
C. Z. Jia, W. Y. He, EnhancerPred: a predictor for discovering enhancers based on the combination and selection of multiple features, Sci. Rep., 6 (2016) 38741.
[18]
B. Liu, K. Li, D. S. Huang, K. C. Chou, iEnhancer-EL: Identifying enhancers and their strength with ensemble learning approach, Bioinformatics, 34 (2018), 3835-3842. doi: 10.1093/bioinformatics/bty458
[19]
Q. H. Nguyen, T. Nguyen-Vo, N. Q. K. Le, T. T. T. DO, S. Raharja, B. P. Nguyen, iEnhancer-ECNN: Identifying enhancers and their strength using ensemble of convolutional neural networks, BMC Genom., 20 (2019), 951.
[20]
K. K. Tan, N. Q. K. Le, H. Y. Yeh, M. C. H. Chua, Ensemble of deep recurrent neural networks for identifying enhancers via dinucleotide physicochemical properties, Cells, 8 (2019), 767.
[21]
N. Q. K. Le, E. K. Y. Yapp, Q. T. Ho, N. Nagasundaram, Y. Y. Ou, H. Y. Yeha, iEnhancer-5Step: Identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding, Anal. Biochem., 571 (2019), 53-61. doi: 10.1016/j.ab.2019.02.017
[22]
S. Basith, M. M. Hasan, G. Lee, L. Y. Wei, B. Manavalan, Integrative machine learning framework for the identification of cell-specific enhancers from the human genome, Brief. Bioinform., (2021), 1-13. doi: 10.1093/bib/bbab252.
[23]
L. J. Cai, X. B. Ren, X. Z. Fu, L. Peng, M. Y. Gao, X. X. Zeng, iEnhancer-XG: Interpretable sequence-based enhancers and their strength predictor, Bioinformatics, 37 (2021), 1060-1067. doi: 10.1093/bioinformatics/btaa914
[24]
N. Q. K. Le, Q. T. Ho, T. T. D. Nguyen, Y. Y. Ou, A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information, Brief. Bioinform., 22 (2021), 1-7. doi: 10.1093/bib/bbaa398
[25]
D. Y. Lim, J. Khanal, H. Tayara, K. T. Chong, iEnhancer-RF: Identifying enhancers and their strength by enhanced feature representation using random forest, Chemometr. Intell. Lab., 212 (2021), 104284.
[26]
W. He, Y. Ju, X. Zeng, X. Liu, Q. Zou, Sc-ncdnapred: A sequence-based predictor for identifying non-coding dna in saccharomyces cerevisiae, Front. Microbiol., 9 (2018), 2174.
[27]
C. S. Kim, M. D. Winn, V. Sachdeva, K. E. Jordan, K-mer clustering algorithm using a mapreduce framework: application to the parallelization of the inchworm module of trinity, BMC Bioinform., 18 (2017), 467.
[28]
J. Matias Rodrigues, T. S. Schmidt, J. Tackmann, C. von Mering, Mapseq: Highly efficient k-mer search with confidence estimates, for rRNA sequence analysis, Bioinformatics, 33 (2017), 3808-3810.
[29]
J. S. Wang, S. L. Zhang, PA-PseU: An incremental passive-aggressive based method for identifying RNA pseudouridine sites via Chou's 5-steps rule, Chemometr. Intell. Lab., 210 (2021), 104250.
[30]
B. Liu, H. Wu, K. C. Chou, An improved package of web servers for generating various modes of pseudo components of DNA, RNA, and protein sequences, Natural Sci., 4 (2017), 67-91.
[31]
B. Liu, S. Y. Wang, R. Long, K. C. Chou, iRSpot-EL: Identify recombination spots with an ensemble learning approach, Bioinformatics, 33 (2017), 35-41. doi: 10.1093/bioinformatics/btw539
[32]
Y. Y. Yao, S. L. Zhang, Y. Y. Liang, iORI-ENST: Identifying origin of replication sites based on elastic net and stacking learning, SAR QSAR Environ. Res., 32 (2021), 317-331. doi: 10.1080/1062936X.2021.1895884
[33]
Z. Liu, X. Xiao, D. J. Yu, J. H. Jia, W. R. Qiu, K. C. Chou, pRNAm-PC: Predicting N6-methyladenosine sites in RNA sequences via physical-chemical properties, Anal. Biochem., 497 (2016), 60-67. doi: 10.1016/j.ab.2015.12.017
[34]
R. E. Dickerson, Definitions and nomenclature of nucleic acid structure components, Nucleic Acids Res., 17 (1989), 1797-1803. doi: 10.1093/nar/17.5.1797
[35]
E. Alessio, A. Carbon, G. Castelli, V. Frappietro, Second-order moving average and scaling of stochastic time series, The European Physical Journal. B: Condensed Matter and Complex Systems, 27 (2002), 197-200.
[36]
Y. Y. Liang, S. L. Zhang, Identify Gram-negative bacterial secreted protein types by incorporating different modes of PSSM into Chou's general PseAAC via Kullback-Leibler divergence, J. Theor. Biol., 454 (2018), 22-29. doi: 10.1016/j.jtbi.2018.05.035
[37]
S. L. Zhang, T. Xue, Use Chou's 5 steps rule to identify DNase I hypersensitive sites via dinucleotide property matrix and extreme gradient boosting, Mol. Genet. Genom., 295 (2020), 1431-1442. doi: 10.1007/s00438-020-01711-8
[38]
J. H. Friedman, Greedy Function Approximation: A Gradient Boosting Machine, Ann. Stat., 29 (2001), 1189-1232. doi: 10.1214/aos/1013203450
[39]
N. Alexey, K. Alois, Gradient boosting machines, a tutorial, Front. Neurorobot., 7 (2013), 21.
[40]
B. Manavalan, S. Basith, T. H. Shin, L. Wei, G. Lee, mAHTPred: A sequence-based meta-predictor for improving the prediction of anti-hypertensive peptides using effective feature representation, Bioinformatics, 35 (2019), 2757-2765. doi: 10.1093/bioinformatics/bty1047
[41]
J. H. Jia, Z. Liu, X. Xiao, B. X. Liu, K. C. Chou, iPPI-Esml: An ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC, J. Theor. Biol., 377 (2015), 47-56. doi: 10.1016/j.jtbi.2015.04.011
[42]
B. Liu, K. Li, D. S. Huang, K. C. Chou, iEnhancer-EL: Identifying enhancers and their strength with ensemble learning approach, Bioinformatics, 34 (2018), 3835-3842. doi: 10.1093/bioinformatics/bty458
[43]
S. Basith, B. Manavalan, T. H. Shin, G. Lee, iGHBP: Computational identification of growth hormone binding proteins from sequences using extremely randomised tree, Comput. Struct. Biotec., 16 (2018), 412-420. doi: 10.1016/j.csbj.2018.10.007
[44]
T. Fawcett, An introduction to ROC analysis, Pattern Recogn. Lett., 27 (2006), 861-874.
[45]
A. P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recogn., 30 (1997), 1145-1159. doi: 10.1016/S0031-3203(96)00142-2
[46]
K. C. Chou, H. B. Shen, Review: Recent advances in developing web-servers for predicting protein attributes, Natural Sci., 1 (2009), 63-92. doi: 10.4236/ns.2009.12011
[47]
K. C. Chou, Impacts of bioinformatics to medicinal chemistry, Med. Chem., 11 (2015), 218-234. doi: 10.2174/1573406411666141229162834
This article has been cited by:
1.
Hao Wu, Mengdi Liu, Pengyu Zhang, Hongming Zhang,
iEnhancer-SKNN: a stacking ensemble learning-based method for enhancer identification and classification using sequence information,
2023,
2041-2649,
10.1093/bfgp/elac057
2.
Huijuan Qiao, Shengli Zhang, Tian Xue, Jinyue Wang, Bowei Wang,
iPro-GAN: A novel model based on generative adversarial learning for identifying promoters and their strength,
2022,
215,
01692607,
106625,
10.1016/j.cmpb.2022.106625
3.
Ahmad Hassan Butt, Tamim Alkhalifah, Fahad Alturise, Yaser Daanial Khan,
A machine learning technique for identifying DNA enhancer regions utilizing CIS-regulatory element patterns,
2022,
12,
2045-2322,
10.1038/s41598-022-19099-3
4.
Guohua Huang, Wei Luo, Guiyang Zhang, Peijie Zheng, Yuhua Yao, Jianyi Lyu, Yuewu Liu, Dong-Qing Wei,
Enhancer-LSTMAtt: A Bi-LSTM and Attention-Based Deep Learning Method for Enhancer Recognition,
2022,
12,
2218-273X,
995,
10.3390/biom12070995
5.
Mehwish Gill, Saeed Ahmed, Muhammad Kabir, Maqsood Hayat,
A Novel Predictor for the Analysis and Prediction of Enhancers and Their Strength via Multi-View Features and Deep Forest,
2023,
14,
2078-2489,
636,
10.3390/info14120636
6.
Wenxing Hu, Yelin Li, Yan Wu, Lixin Guan, Mengshan Li,
A deep learning model for DNA enhancer prediction based on nucleotide position aware feature encoding,
2024,
27,
25890042,
110030,
10.1016/j.isci.2024.110030
7.
Faiza Mehmood, Shazia Arshad, Muhammad Shoaib,
ADH-Enhancer: an attention-based deep hybrid framework for enhancer identification and strength prediction,
2024,
25,
1467-5463,
10.1093/bib/bbae030
8.
Le Thi Phan, Changmin Oh, Tao He, Balachandran Manavalan,
A comprehensive revisit of the machine‐learning tools developed for the identification of enhancers in the human genome,
2023,
23,
1615-9853,
10.1002/pmic.202200409
9.
Keruo Jiang, Zhen Huang, Xinyan Zhou, Chudong Tong, Minjie Zhu, Heshan Wang,
Deep belief improved bidirectional LSTM for multivariate time series forecasting,
2023,
20,
1551-0018,
16596,
10.3934/mbe.2023739
10.
Xuechen Mu, Zhenyu Huang, Qiufen Chen, Bocheng Shi, Long Xu, Ying Xu, Kai Zhang,
DeepEnhancerPPO: An Interpretable Deep Learning Approach for Enhancer Classification,
2024,
25,
1422-0067,
12942,
10.3390/ijms252312942
11.
Qian Li, Shangcheng Yan, Weiran Yang, Zhuan Du, Ming Cheng, Renwei Chen, Qiankun Shao, Yuan Tian, Mengchao Sheng, Wei Peng, Yongyou Wu,
Machine learning models for prediction of lymph node metastasis in patients with gastric cancer: a Chinese single-centre study with external validation in an Asian American population,
2025,
15,
2044-6055,
e098476,
10.1136/bmjopen-2024-098476