
Aiming at the problems of the wet experiment method in identifying the types of conotoxins, such as the complexity, low efficiency and high cost, this study proposes a method that uses the sequence information of the conotoxin peptides combined with long short term memory networks (LSTM) models to predict the Methods of spirotoxin category. This method only needs to take the conotoxin peptide sequence as input, and adopts the character embedding method in text processing to automatically map the sequence to the feature vector representation, and the model extracts features for training and prediction. Experimental results show that the correct index of this method on the test set reaches 0.80, and the AUC value reaches 0.817. For the same test set, the AUC value of the KNN algorithm is 0.641, and the AUC value of the method proposed in this paper is 0.817, indicating that this method can effectively assist in identifying the type of conotoxin.
Citation: Feng Wang, Shan Chang, Dashun Wei. Prediction of conotoxin type based on long short-term memory network[J]. Mathematical Biosciences and Engineering, 2021, 18(5): 6700-6708. doi: 10.3934/mbe.2021332
[1] | Wenbo Yang, Wei Liu, Qun Gao . Prediction of dissolved oxygen concentration in aquaculture based on attention mechanism and combined neural network. Mathematical Biosciences and Engineering, 2023, 20(1): 998-1017. doi: 10.3934/mbe.2023046 |
[2] | Faisal Mehmood Butt, Lal Hussain, Anzar Mahmood, Kashif Javed Lone . Artificial Intelligence based accurately load forecasting system to forecast short and medium-term load demands. Mathematical Biosciences and Engineering, 2021, 18(1): 400-425. doi: 10.3934/mbe.2021022 |
[3] | Chongyi Tian, Longlong Lin, Yi Yan, Ruiqi Wang, Fan Wang, Qingqing Chi . Photovoltaic power prediction based on dilated causal convolutional network and stacked LSTM. Mathematical Biosciences and Engineering, 2024, 21(1): 1167-1185. doi: 10.3934/mbe.2024049 |
[4] | Shujuan Liu, Hui Jin, Yanbiao Di . A strategy for predicting waste production and planning recycling paths in e-logistics based on improved EMD-LSTM. Mathematical Biosciences and Engineering, 2023, 20(9): 17569-17588. doi: 10.3934/mbe.2023780 |
[5] | Xihe Qiu, Xiaoyu Tan, Chenghao Wang, Shaotao Chen, Bin Du, Jingjing Huang . A long short-temory relation network for real-time prediction of patient-specific ventilator parameters. Mathematical Biosciences and Engineering, 2023, 20(8): 14756-14776. doi: 10.3934/mbe.2023660 |
[6] | Liying Zhao, Ningbo Cao, Hui Yang . Forecasting regional short-term freight volume using QPSO-LSTM algorithm from the perspective of the importance of spatial information. Mathematical Biosciences and Engineering, 2023, 20(2): 2609-2627. doi: 10.3934/mbe.2023122 |
[7] | Pinpin Qin, Xing Li, Shenglin Bin, Fumao Wu, Yanzhi Pang . Research on transformer and long short-term memory neural network car-following model considering data loss. Mathematical Biosciences and Engineering, 2023, 20(11): 19617-19635. doi: 10.3934/mbe.2023869 |
[8] | Peng Lu, Ao Sun, Mingyu Xu, Zhenhua Wang, Zongsheng Zheng, Yating Xie, Wenjuan Wang . A time series image prediction method combining a CNN and LSTM and its application in typhoon track prediction. Mathematical Biosciences and Engineering, 2022, 19(12): 12260-12278. doi: 10.3934/mbe.2022571 |
[9] | Huanhai Yang, Shue Liu . A prediction model of aquaculture water quality based on multiscale decomposition. Mathematical Biosciences and Engineering, 2021, 18(6): 7561-7579. doi: 10.3934/mbe.2021374 |
[10] | Xingyu Tang, Peijie Zheng, Yuewu Liu, Yuhua Yao, Guohua Huang . LangMoDHS: A deep learning language model for predicting DNase I hypersensitive sites in mouse genome. Mathematical Biosciences and Engineering, 2023, 20(1): 1037-1057. doi: 10.3934/mbe.2023048 |
Aiming at the problems of the wet experiment method in identifying the types of conotoxins, such as the complexity, low efficiency and high cost, this study proposes a method that uses the sequence information of the conotoxin peptides combined with long short term memory networks (LSTM) models to predict the Methods of spirotoxin category. This method only needs to take the conotoxin peptide sequence as input, and adopts the character embedding method in text processing to automatically map the sequence to the feature vector representation, and the model extracts features for training and prediction. Experimental results show that the correct index of this method on the test set reaches 0.80, and the AUC value reaches 0.817. For the same test set, the AUC value of the KNN algorithm is 0.641, and the AUC value of the method proposed in this paper is 0.817, indicating that this method can effectively assist in identifying the type of conotoxin.
Conus is a kind of poisonous carnivorous tropical sea and ocean soft-body animals [1]. There are more than 500 species of Conus in the world, and there are at least 50,000 active peptides in the venom of Conus. The secreted toxin (called conotoxin) is mainly used in the predation and defense behavior of animals [2]. Conotoxin is extremely toxic and can cause animals to tremble, convulse, even paralyze and die. There are more than 700 kinds of conus in the world that secrete more than 100,000 toxins. However, the current experiments have only confirmed and recorded relatively few conotoxins (about 3000 peptides) [3]. Conotoxin has strong biological activity and novel chemical structure. It has extremely high selectivity for ligand gates or voltage-gated ion channels [4]. It can distinguish between similar ion channel types and is widely used as an ion. Pharmacological reagents in channel research. Because the insectivorous conotoxin can kill many kinds of worms [5], it has the potential to cultivate new varieties of insect-resistant crops or develop it as a peptide insecticide. Therefore, conotoxin has become a new source of new drug development and a powerful tool for pharmacology and neuroscience [6], and it ranks first in the research of animal neurotoxins. It is called "the treasure house of marine drugs", and it has received attention from all walks of life and has broad development prospects.
According to the different target sites of conotoxin [7], it can be divided into three categories: 1) Conotoxin that acts on ligand-gated ion channels. 2) Conotoxin acts on voltage-gated ion channels, which are also called voltage-sensitive channels. 3) CTX acting on other receptors [8]. There are more than 300 ion channels in living cells. Many important functions in life, such as heartbeat, sensory conduction and central nervous system response, are controlled by cell signaling through various ion channels. Ion channel dysfunction can cause a variety of diseases, such as epilepsy, arrhythmia and type II diabetes. These diseases are mainly treated with drugs that regulate the relevant ion channels [9]. Ion channels are also an important target for the treatment of viral diseases. Due to their importance to human life, ion channels have become the second most common drug development target. The following three ion channels are usually targets of toxins: potassium (K) channels, sodium (Na) channels, and calcium (Ca) channels. Based on its function and target object, conotoxin can be divided into the following three types: (i) K channel targeting type; (ii) targeting non-channel type; (iii) calcium channel targeting type [10].
Due to the explosive growth of protein sequence data [11], traditional wet experiment methods can no longer meet the needs of rapid identification of protein sequences. Yuan et al. developed a feature selection technique based on binomial distribution to predict the types of ion channel-targeted conotoxins by using radial basis function networks [12]. Subsequently, they developed a predictor (iCTX type) to improve prediction accuracy [13]. Zhang et al. applied a hybrid feature in the prediction issue [14]. Wang et al. combined the analysis of variance and correlation (AVC) with SVM to reduce redundancy of attributes and improve the prediction accuracy and computation speed [15]. However, none of these methods can be used to predict the type of conotoxin defined by its target ion channel. For example, δ-toxoid-like Ac6.1 and ω-toxin-like Ai6.2 both belong to toxoid C1. However, the former targets voltage-gated sodium channels, while the latter targets voltage-gated calcium channels [16].
To solve this problem, this article proposes a method to identify the three types of conotoxins by using their sequence information alone. In this research, we propose a deep learning long-term short-term memory (LSTM) neural network model to predict the classification of conus toxins [14], and use word embedding technology to represent the conotoxin sequence as a vector, which is because the protein sequence can be seen into a natural language. Effective features are extracted from the conotoxin sequence in order to further evaluate the performance of the model. The target model is compared with the existing machine learning model SVM [15]. The experimental results show that the method has good prediction performance and is suitable for classification and prediction of conotoxin. The workflow is shown in Figure 1.
In this paper, word embedding technology and LSTM are combined to construct a model for anticancer peptide prediction, so as to take advantage of LSTM's advantages in sequence modeling and long-term memory and word embedding in sequence representation.
The conotoxin sequence and its function used in this experiment are collected from Uniports [16]. In order to improve the quality of the data, when collecting data, we first limit the function of the conotoxin to support potassium, calcium and sodium channels [17]. There are no conotoxins clearly marked on Uniports. Except for a few of the conotoxins we have identified, all the others are discarded. In the end, we obtained 192 conotoxins, of which 74 calcium ion channel targeting types, 84 sodium ion channel targeting types, and 34 potassium ion channel targeting types. The training set consists of 60 calcium ion channel conotoxins, 67 sodium ion channel conotoxins and 25 potassium ion channel conotoxins. The test set consists of 14 calcium ion channel conotoxins and 17 sodium ion channel conotoxins. Toxin and 9 kinds of potassium ion channel conotoxin. The details of the data set are shown in Table 1.
Data set | Ca ion channel | Na ion channel | K ion channel |
Training set | 60 | 67 | 25 |
Test set | 14 | 17 | 9 |
This algorithm does not need to manually determine the physical and chemical properties of amino acids through wet experiments. It only uses the conotoxin character sequence as input data, and uses the word embedding training method to divide the conotoxin sequence into individual characters; because the length of the conotoxin sequence is not fixed, so we set a fixed maximum length according to the data set, and encode the conotoxin sequence with a fixed length. When the length of the encoded sequence is less than the maximum fixed length, fill it with 0 at the end, so that each character corresponds to an integer; then the word embedding training is carried out through the neural network, and 20 amino acid letters are mapped to the word embedding vector space, so that each character corresponds to a vector representation. The above steps can be automatically completed by the Tokenizer API provided by Keras. Each conotoxin sequence can be coded as an M × N matrix, where M is the set sequence length and N is the set embedding space vector dimension.
LSTM is a recurrent neural network with a special structure, which is an effective technology to solve the problem of long sequence dependence [18]. It is composed of a group of unit modules with memory function. Each unit module is composed of input gate, forget gate and output gate to realize the input, filtering and output of information. These gated operations enable LSTM to automatically extract and learn long-range correlation information useful for the overall classification task in the sequence, and the prediction of conotoxin classification based on sequence information is just in line with the characteristics of this type of sequence classification problem, so LSTM is suitable for classification of conotoxin.
This article classifies three ion channel-targeted conotoxins of potassium ion, calcium ion and sodium ion. Therefore, the activation function should not use the sigmoid function, but the softmax activation function. This is because the effect of sigmoid in dealing with two classification problems Not bad, but the softmax function works better when dealing with multi-classification problems [19].
Si=ezi∑kezk | (1) |
The overall process of the classification prediction algorithm proposed in this paper is shown in Figure 2. First, the amino acid characters appearing in the conotoxin sequence are automatically mapped to the embedding vector space after neural network training, so that each amino acid character corresponds to a vector representation; then each conotoxin sequence is represented as a corresponding matrix; The matrix is used as the input of the LSTM model for training and learning Finally.
Cross-validation and independent test data sets are used to verify the performance of the algorithm in this paper. Cross-validation divides the training set data into five sub-sets [20]. Each time one subset is used as the test set for verification, and the remaining four combinations are combined as the training set. This process is repeated 5 times until each subset is considered as a test set at least once. At the same time, this paper also uses an independent test data set to verify the performance of the algorithm. The evaluation indicators of the algorithm include: 1) True Positive Rate (TPR); 2) False Positive Rate (FPR); 3) Correct Index; 4) ROC [21] curve and the area value AUC under it. The calculation formula for each indicator is as follows:
TPR=TPTP+FN×100% | (2) |
FPR=FPTN+FP×100% | (3) |
Correctindex=(TPR+1−FPR)×12 | (4) |
In the formula: TP refers to the number of positive samples predicted to be positive; FP refers to the number of negative samples predicted to be positive; TN refers to the number of negative samples predicted to be negative; FN refers to the number of positive samples predicted to be negative.
Due to the small number of training sets, this paper adopts the cross-checking method to conduct experiments. In order to conduct effective verification, accuracy and ROC are used as measurement indicators [22].
Because this paper predicts which ion channel the conotoxin belongs to is a three-classification problem, that is, according to the sequence of the conotoxin to determine whether it belongs to a potassium ion channel, a sodium ion channel or a calcium ion channel, the activation function is softmax when compiling the model [23].
First, determine the appropriate word vector embedding dimension. According to the characteristics of the collected conotoxin sequence data, the dimension of the word embedding vector space is selected as 60 and 90 for comparative analysis. The ROC curve corresponding to the model during verification is shown in Figure 3.
When the embedding dimension is 90, the obtained area under the ROC curve, that is, the AUC value, is the largest. At this time, the prediction performance is the best. Therefore, the vector dimension of the word embedding space is set to 90.
When the training parameter epoch is set to 10 times, the accuracy and loss function curves of the model on the training set and independent test set are shown in Figure 4.
In Figure 4(a), when the abscissa is between 6–9, the accuracy is constantly increasing, but in Figure 4(b), when the abscissa is between 7–8, the loss rate of the verification set Instead, it increases, but when the abscissa is at 8–9, the loss rate of the validation set decreases. Therefore, this is a normal phenomenon. When the model discards some unnecessary features in the later stage of training, it will lose most of the correct samples. The increase of is not enough to change the result, and this situation will also occur for the original error of the sample loss caused by the change.
As can be seen from Figure 4, whether it is on the training set or the test set, the accuracy and loss value curves are close to each other, indicating that there is no over-fitting phenomenon, which shows that the model has a good generalization ability.
KNN is one of the most commonly used classification algorithms. It has a good predictive effect and is not sensitive to outliers. Considering that there may be some erroneous data in the conotoxin data set collected in this paper, the KNN algorithm is used as a comparison algorithm. The ROC curves of the two methods on the test set are shown in Figure 5. It can be seen from Figure 5 that the area under the LSTM curve is larger than the area under the KNN curve. The area under the ROC curve of LSTM is 0.817, and the area under the ROC curve of KNN is 0.641, the difference between the areas under the two curves is 0. 176. Indicating that the accuracy of LSTM is higher than that of KNN, which proves that the method based on LSTM is superior to the KNN algorithm dealing with the problem of conotoxin data classification.
Combined with Figure 4(a), it can be concluded that considering the imbalance of classification data, the method proposed in this paper has certain reference value for both accuracy and ROC, which further shows that the method is in the treatment of conotoxin and the superiority of the three classifications.
According to the characteristics of the conotoxin sequence, this paper uses the LSTM algorithm based on the word embedding method to classify and predict the conotoxin. The algorithm does not require manual feature extraction and feature reconstruction steps, which simplifies the algorithm design, and can use the advantages of the long-term dependence of LSTM according to the characteristics of the conotoxin sequence to provide a better prediction for the classification of the conotoxin. Performance and experimental results show that the proposed algorithm can effectively predict the conotoxin in three categories.
The authors declare that they have no conflict of interest.
[1] |
J. Thewissen, J. D. Sensor, M. T. Clementz, S. Bajpai, Evolution of dental wear and diet during the origin of whales, Paleobiology, 37 (2011), 655-669. doi: 10.1666/10038.1
![]() |
[2] |
Z. Li, G. Beauchamp, M. S. Mooring, Relaxed selection for tick-defense grooming in Père David's deer?, Biol. Conserv., 178 (2014), 12-18. doi: 10.1016/j.biocon.2014.06.026
![]() |
[3] | D. J. Adams, P. F. Alewood, D. J. Craik, R. D. Drinkwater, R. J. Lewis, Conotoxins and their potential pharmaceutical applications, Drug Dev. Res., 46 (2015), 219-234. |
[4] |
J. P. Johnson, J. R. Balser, P. B. Bennett, A novel extracellular calcium sensing mechanism in voltage-gated potassium ion channels, J. Neurosci., 21 (2001), 4143-4153. doi: 10.1523/JNEUROSCI.21-12-04143.2001
![]() |
[5] | R. H. Cox, N. J. Rusch, New expression profiles of voltage-gated ion channels in arteries exposed to high blood pressure, Microcirculation, 9 (2015), 243-257. |
[6] |
M. Verhulsel, M. Vignes, S. Descroix, L. Malaquin, D. M. Vignjevic, J. L. Viovy, A review of microfabrication and hydrogel engineering for micro-organs on chips, Biomaterials, 35 (2014), 1816-1832. doi: 10.1016/j.biomaterials.2013.11.021
![]() |
[7] |
A. Fu, Z. Zhao, F. Gao, M. Zhang, Cellular uptake mechanism and therapeutic utility of a novel peptide in targeted-delivery of proteins into neuronal cells, Pharm. Res., 30 (2013), 2108-2117. doi: 10.1007/s11095-013-1068-6
![]() |
[8] |
A. Beyeler, N. Kadiri, S. Navailles, M. B. Boujema, F. Gonon, C. Le Moine, et al., Stimulation of serotonin2C receptors elicits abnormal oral movements by acting on pathways other than the sensorimotor one in the rat basal ganglia, Neuroscience, 169 (2010), 158-170. doi: 10.1016/j.neuroscience.2010.04.061
![]() |
[9] | O. Wesołowska, Interaction of phenothiazines, stilbenes and flavonoids with multidrug resistance-associated transporters, P-glycoprotein and MRP1, Acta Biochim. Pol., 58 (2011), 433-448. |
[10] |
Y. Yang, Y. Cai, G. Wu, X. Chen, C. Zeng, Plasma long non-coding RNA, CoroMarker, a novel biomarker for diagnosis of coronary artery disease, Clin. Sci., 129 (2015), 675-685. doi: 10.1042/CS20150121
![]() |
[11] | R. Mason, Application of cathodoluminescence imaging to the study of sedimentary rocks, J. Geol., 115 (2006), 710-710. |
[12] |
L. F. Yuan, C. Ding, S. H. Guo, H. Ding, W. Chen, H. Lin, Prediction of the types of ion channel-targeted conotoxins based on radial basis function network, Toxicol. Vitro, 27 (2013), 852-856. doi: 10.1016/j.tiv.2012.12.024
![]() |
[13] |
S. Jouanneau, L. Reroutes, M. J. Durand, A. Boukabache, V. Picot, Y. Primault, et al., Methods for assessing biochemical oxygen demand (BOD): A review, Water Res., 49 (2014), 62-82. doi: 10.1016/j.watres.2013.10.066
![]() |
[14] | L. Zhang, C. Zhang, R. Gao, R. Yang, Q. Song, Using the SMOTE technique and hybrid features to predict the types of ion channel-targeted conotoxins, J. Theor. Biol., (2016), 75-84. |
[15] | J. Wang, R. M. Nishikawa, Y. Yang, Improving the accuracy in detection of clustered microcalcifications with a context-sensitive classification model, Med. Phys., 43 (2016), 159. |
[16] |
M. L. Pall, Electromagnetic fields act via activation of voltage-gated calcium channels to produce beneficial or adverse effects, J. Cell. Mol. Med., 17 (2013), 958-965. doi: 10.1111/jcmm.12088
![]() |
[17] |
J. Szendroedi, W. Sandtner, T. Zarrabi, E. Zebedin, K. Hilber, S. C. Dudley, et al., Speeding the recovery from ultraslow inactivation of voltage-gated Na+ channels by metal ion binding to the selectivity filter: a foot-on-the-door?, Biophys. J., 93 (2007), 4209-4224. doi: 10.1529/biophysj.107.104794
![]() |
[18] | J. Bandyopadhyay, J. Velázquez, Blow-up rate estimates for the solutions of the bosonic Boltzmann-Nordheim equation, J. Math. Phys., 56 (2015), 761-847. |
[19] |
C. Angulo, F. J. Ruiz, L. González, J. A. Ortega, Multi-classification by using tri-class SVM, Neural Process. Lett., 23 (2006), 89-101. doi: 10.1007/s11063-005-3500-3
![]() |
[20] |
J. C. Chang, S. G. Hilsenbeck, S. Fuqua, Genomic approaches in the management and treatment of breast cancer, Br. J. Cancer, 92 (2005), 618-624. doi: 10.1038/sj.bjc.6602410
![]() |
[21] |
J. Yin, L. Tian, Joint confidence region estimation for area under ROC curve and Youden index, Stat. Med., 33 (2014), 985-1000. doi: 10.1002/sim.5992
![]() |
[22] |
A. Mihret, Y. Bekele, K. Bobosha, M. Kidd, A. Aseffa, R. Howe, et al., Plasma cytokines and chemokines differentiate between active disease and non-active tuberculosis infection, J. Infect., 66 (2013), 357-365. doi: 10.1016/j.jinf.2012.11.005
![]() |
[23] |
K. Sasaki, H. M. Kantarjian, E. J. Jabbour, S. O'Brien, J. E. Cortes, Clinical application of artificial intelligence in patients with chronic myeloid leukemia in chronic phase, Blood, 128 (2016), 940-940. doi: 10.1182/blood.V128.22.940.940
![]() |
1. | Linh T. T. Nguyen, David J. Craik, Quentin Kaas, Bibliometric Review of the Literature on Cone Snail Peptide Toxins from 2000 to 2022, 2023, 21, 1660-3397, 154, 10.3390/md21030154 | |
2. | shuang li, Wenlong Li, Hongzhi Wang, 2024, Traffic flow prediction based on adaptive graph convolutional recurrent network, 9781510681316, 36, 10.1117/12.3034784 |
Data set | Ca ion channel | Na ion channel | K ion channel |
Training set | 60 | 67 | 25 |
Test set | 14 | 17 | 9 |