Drugs are an important means to treat various diseases. They are classified into several classes to indicate their properties and effects. Those in the same class always share some important features. The Kyoto Encyclopedia of Genes and Genomes (KEGG) DRUG recently reported a new drug classification system that classifies drugs into 14 classes. Correct identification of the class for any possible drug-like compound is helpful to roughly determine its effects for a particular type of disease. Experiments could be conducted to confirm such latent effects, thus accelerating the procedures for discovering novel drugs. In this study, this classification system was investigated. A classification model was proposed to assign one of the classes in the system to any given drug for the first time. Different from traditional fingerprint features, which indicated essential drug properties alone and were very popular in investigating drug-related problems, drugs were represented by novel features derived from a large drug network via a well-known network embedding algorithm called Node2vec. These features abstracted the drug associations generated from their essential properties, and they could overview each drug with all drugs as background. As class sizes were of great differences, synthetic minority over-sampling technique (SMOTE) was employed to tackle the imbalance problem. A balanced dataset was fed into the support vector machine to build the model. The 10-fold cross-validation results suggested the excellent performance of the model. This model was also superior to models using other drug features, including those generated by another network embedding algorithm and fingerprint features. Furthermore, this model provided more balanced performance across all classes than that without SMOTE.
Citation: Chenhao Wu, Lei Chen. A model with deep analysis on a large drug network for drug classification[J]. Mathematical Biosciences and Engineering, 2023, 20(1): 383-401. doi: 10.3934/mbe.2023018
Drugs are an important means to treat various diseases. They are classified into several classes to indicate their properties and effects. Those in the same class always share some important features. The Kyoto Encyclopedia of Genes and Genomes (KEGG) DRUG recently reported a new drug classification system that classifies drugs into 14 classes. Correct identification of the class for any possible drug-like compound is helpful to roughly determine its effects for a particular type of disease. Experiments could be conducted to confirm such latent effects, thus accelerating the procedures for discovering novel drugs. In this study, this classification system was investigated. A classification model was proposed to assign one of the classes in the system to any given drug for the first time. Different from traditional fingerprint features, which indicated essential drug properties alone and were very popular in investigating drug-related problems, drugs were represented by novel features derived from a large drug network via a well-known network embedding algorithm called Node2vec. These features abstracted the drug associations generated from their essential properties, and they could overview each drug with all drugs as background. As class sizes were of great differences, synthetic minority over-sampling technique (SMOTE) was employed to tackle the imbalance problem. A balanced dataset was fed into the support vector machine to build the model. The 10-fold cross-validation results suggested the excellent performance of the model. This model was also superior to models using other drug features, including those generated by another network embedding algorithm and fingerprint features. Furthermore, this model provided more balanced performance across all classes than that without SMOTE.
[1] | P. A. Naik, M. Yavuz, S. Qureshi, J. Zu, S. Townley, Modeling and analysis of COVID-19 epidemics with treatment in fractional derivatives using real data from Pakistan, Eur. Phys. J. Plus, 135 (2020), 795. https://doi.org/10.1140/epjp/s13360-020-00819-5 doi: 10.1140/epjp/s13360-020-00819-5 |
[2] | P. A. Naik, J. Zu, K. M. Owolabi, Modeling the mechanics of viral kinetics under immune control during primary infection of HIV-1 with treatment in fractional order, Phys. A, 545 (2020), 123816. https://doi.org/10.1016/j.physa.2019.123816 doi: 10.1016/j.physa.2019.123816 |
[3] | P. A. Naik, J. Zu, M. Ghoreishi, Stability analysis and approximate solution of SIR epidemic model with Crowley-Martin type functional response and holling type-Ⅱ treatment rate by using homotopy analysis method, J. Appl. Anal. Comput., 10 (2020), 1482–1515. https://doi.org/10.11948/20190239 doi: 10.11948/20190239 |
[4] | B. Wang, J. F. Gomez-Aguilar, Z. Sabir, M. A. Z. Raja, W. F. Xia, H. Jahanshahi, et al., Numerical computing to solve the nonlinear corneal system of eye surgery using the capability of morlet wavelet artificial neural networks, Fractals, 30 (2022), 1–19. https://doi.org/10.1142/S0218348X22401478 doi: 10.1142/S0218348X22401478 |
[5] | J. E. Solís-Pérez, J. A. Hernández, A. Parrales, J. F. Gómez-Aguilar, A. Huicochea, Artificial neural networks with conformable transfer function for improving the performance in thermal and environmental processes, Neural Networks, 152 (2022), 44–56. https://doi.org/10.1016/j.neunet.2022.04.016 doi: 10.1016/j.neunet.2022.04.016 |
[6] | M. Umar, Z. Sabir, M. A. Z. Raja, J. F. G. Aguilar, F. Amin, M. Shoaib, Neuro-swarm intelligent computing paradigm for nonlinear HIV infection model with CD4+ T-cells, Math. Comput. Simulat., 188 (2021), 241–253. https://doi.org/10.1016/j.matcom.2021.04.008 doi: 10.1016/j.matcom.2021.04.008 |
[7] | A. A. Mostafa, A. A. Alhossary, S. A. Salem, A. E. Mohamed, GBO-kNN a new framework for enhancing the performance of ligand-based virtual screening for drug discovery, Expert Syst. Appl., 197 (2022), 116723. https://doi.org/10.1016/j.eswa.2022.116723 doi: 10.1016/j.eswa.2022.116723 |
[8] | Q. Dai, C. Bao, Y. Hai, S. Ma, T. Zhou, C. Wang, et al., MTGIpick allows robust identification of genomic islands from a single genome, Brief. Bioinf., 19 (2016), 361–373. https://doi.org/10.1093/bib/bbw118 doi: 10.1093/bib/bbw118 |
[9] | R. Kong, X. Xu, X. Liu, P. He, M. Q. Zhang, Q. Dai, 2SigFinder: the combined use of small-scale and large-scale statistical testing for genomic island detection from a single genome, BMC Bioinf., 21 (2020), 159. https://doi.org/10.1186/s12859-020-3501-2 doi: 10.1186/s12859-020-3501-2 |
[10] | S. Yang, Y. Wang, Y. Chen, Q. Dai, MASQC: Next generation sequencing assists third generation sequencing for quality control in N6-Methyladenine DNA identification, Front. Genet., 11 (2020), 269. https://doi.org/10.3389/fgene.2020.00269 doi: 10.3389/fgene.2020.00269 |
[11] | Z. Lu, K. C. Chou, iATC_Deep-mISF: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals by deep learning, Adv. Biosci. Biotechnol., 11 (2020), 153–159. https://doi.org/10.4236/abb.2020.115012 doi: 10.4236/abb.2020.115012 |
[12] | A. Lumini, L. Nanni, Convolutional neural networks for ATC classification, Curr. Pharm. Design, 24 (2018), 4007–4012. https://doi.org/10.2174/1381612824666181112113438 doi: 10.2174/1381612824666181112113438 |
[13] | H. Zhao, Y. Li, J. Wang, A convolutional neural network and graph convolutional network-based method for predicting the classification of anatomical therapeutic chemicals, Bioinformatics, 37 (2021), 2841–2847. https://doi.org/10.1093/bioinformatics/btab204 doi: 10.1093/bioinformatics/btab204 |
[14] | Y. Cao, Z. Q. Yang, X. L. Zhang, W. Fan, Y. Wang, J. Shen, et al., Identifying the kind behind SMILES—anatomical therapeutic chemical classification using structure-only representations, Brief. Bioinf., (2022), bbac346. https://doi.org/10.1093/bib/bbac346 doi: 10.1093/bib/bbac346 |
[15] | J. P. Zhou, L. Chen, Z. H. Guo, iATC-NRAKEL: An efficient multi-label classifier for recognizing anatomical therapeutic chemical classes of drugs, Bioinformatics, 36 (2020), 1391–1396. https://doi.org/10.1093/bioinformatics/btz757 doi: 10.1093/bioinformatics/btz757 |
[16] | J. P. Zhou, L. Chen, T. Wang, M. Liu, iATC-FRAKEL: A simple multi-label web-server for recognizing anatomical therapeutic chemical classes of drugs with their fingerprints only, Bioinformatics, 36 (2020), 3568–3569. https://doi.org/10.1093/bioinformatics/btaa166 doi: 10.1093/bioinformatics/btaa166 |
[17] | S. Tang, L. Chen, iATC-NFMLP: Identifying classes of anatomical therapeutic chemicals based on drug networks, fingerprints and multilayer perceptron, Curr. Bioinf., (2022), in press. https://doi.org/10.2174/1574893617666220318093000 |
[18] | X. Cheng, S. G. Zhao, X. Xiao, K. C. Chou, iATC-mISF: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals, Bioinformatics, 33 (2016), 341–346. https://doi.org/10.1093/bioinformatics/btw644 doi: 10.1093/bioinformatics/btw644 |
[19] | L. Nanni, S. Brahnam, Multi-label classifier based on histogram of gradients for predicting the anatomical therapeutic chemical class/classes of a given compound, Bioinformatics, 33 (2017), 2837–2841. https://doi.org/10.1093/bioinformatics/btx278 doi: 10.1093/bioinformatics/btx278 |
[20] | X. Cheng, S. G. Zhao, X. Xiao, K. C. Chou, iATC-mHyb: a hybrid multi-label classifier for predicting the classification of anatomical therapeutic chemicals, Oncotarget, 8 (2017), 58494–58503. https://doi.org/10.18632/oncotarget.17028 doi: 10.18632/oncotarget.17028 |
[21] | X. Wang, Y. Wang, Z. Xu, Y. Xiong, D. Q. Wei, ATC-NLSP: Prediction of the classes of anatomical therapeutic chemicals using a network-based label space partition method, Front. Pharmacol., 10 (2019), 971. https://doi.org/10.3389/fphar.2019.00971 doi: 10.3389/fphar.2019.00971 |
[22] | H. Ogata, S. Goto, K. Sato, W. Fujibuchi, H. Bono, M. Kanehisa, KEGG: Kyoto encyclopedia of genes and genomes, Nucleic Acids Res., 27 (1999), 29–34. https://doi.org/10.1093/nar/28.1.27 doi: 10.1093/nar/28.1.27 |
[23] | M. Kuhn, C. von Mering, M. Campillos, L. J. Jensen, P. Bork, STITCH: interaction networks of chemicals and proteins, Nucleic Acids Res., 36 (2007), D684–D688. https://doi.org/10.1093/nar/gkm795 doi: 10.1093/nar/gkm795 |
[24] | M. Kuhn, D. Szklarczyk, S. Pletscher-Frankild, T. H. Blicher, C. von Mering, L. J. Jensen, et al., STITCH 4: integration of protein-chemical interactions with user data, Nucleic Acids Res., 42 (2014), D401–407. https://doi.org/10.1093/nar/gkt1207 doi: 10.1093/nar/gkt1207 |
[25] | A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks, in the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (2016), 855–864. https://doi.org/10.1145/2939672.2939754 |
[26] | C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn., 20 (1995), 273–297. https://doi.org/10.1007/BF00994018 doi: 10.1007/BF00994018 |
[27] | L. Breiman, Random forests, Mach. Learn., 45 (2001), 5–32. https://doi.org/10.1023/A:1010933404324 doi: 10.1023/A:1010933404324 |
[28] | N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique, J. Artif. Intell. Res., 16 (2002), 321–357. https://doi.org/10.1613/jair.953 doi: 10.1613/jair.953 |
[29] | X. Zhao, L. Chen, Z. H. Guo, T. Liu, Predicting drug side effects with compact integration of heterogeneous networks, Curr. Bioinform., 14 (2019), 709–720. https://doi.org/10.2174/1574893614666190220114644 doi: 10.2174/1574893614666190220114644 |
[30] | W. Zhang, X. Yue, F. Liu, Y. L. Chen, S. K. Tu, X. N. Zhang, A unified frame of predicting side effects of drugs by using linear neighborhood similarity, BMC Syst. Biol., 11 (2017), 101. https://doi.org/10.1186/s12918-017-0477-2 doi: 10.1186/s12918-017-0477-2 |
[31] | G. Li, T. Fang, Y. Zhang, C. Liang, Q. Xiao, J. Luo, Predicting miRNA-disease associations based on graph attention network with multi-source information, BMC Bioinf., 23 (2022), 244. https://doi.org/10.1186/s12859-022-04796-7 doi: 10.1186/s12859-022-04796-7 |
[32] | B. Perozzi, R. Al-Rfou, S. Skiena, Deepwalk: Online learning of social representations, in the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, (2014), 701–710. https://doi.org/10.1145/2623330.2623732 |
[33] | H. Cho, B. Berger, J. Peng, Compact integration of multi-network topology for functional analysis of genes, Cell Syst., 3 (2016), 540–548. https://doi.org/10.1016/j.cels.2016.10.017 doi: 10.1016/j.cels.2016.10.017 |
[34] | J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, Q. Mei, Line: Large-scale information network embedding, in the 24th international conference on world wide web, (2015), 1067–1077. https://doi.org/10.1145/2736277.2741093 |
[35] | L. Chen, Z. Li, S. Zhang, Y. H. Zhang, T. Huang, Y. D. Cai, Predicting RNA 5-methylcytosine sites by using essential sequence features and distributions, BioMed. Res. Int., 2022 (2022), 4035462. https://doi.org/10.1155/2022/4035462 doi: 10.1155/2022/4035462 |
[36] | Y. Wang, Y. Xu, Z. Yang, X. Liu, Q. Dai, Using recursive feature selection with random forest to improve protein structural class prediction for low-similarity sequences, Comput. Math. Method M., 2021 (2021), 5529389. https://doi.org/10.1155/2021/5529389 doi: 10.1155/2021/5529389 |
[37] | Z. Wu, L. Chen, Similarity-based method with multiple-feature sampling for predicting drug side effects, Comput. Math. Method M., 2022 (2022), 9547317. https://doi.org/10.1155/2022/9547317 doi: 10.1155/2022/9547317 |
[38] | B. Ran, L. Chen, M. Li, Y. Han, Q. Dai, Drug-Drug interactions prediction using fingerprint only, Comput. Math. Method M., 2022 (2022), 7818480. https://doi.org/10.1155/2022/7818480 doi: 10.1155/2022/7818480 |
[39] | A. Kastrin, P. Ferk, B. Leskosek, Predicting potential drug-drug interactions on topological and semantic similarity features using statistical learning, PloS One, 13 (2018), e196865. https://doi.org/10.1371/journal.pone.0196865 doi: 10.1371/journal.pone.0196865 |
[40] | S. Ding, D. Wang, X. Zhou, L. Chen, K. Feng, X. Xu, et al., Predicting heart cell types by using transcriptome profiles and a machine learning method, Life, 12 (2022), 228. https://doi.org/10.3390/life12020228 doi: 10.3390/life12020228 |
[41] | X. Zhou, S. Ding, D. Wang, L. Chen, K. Feng, T. Huang, et al., Identification of cell markers and their expression patterns in skin based on single-cell RNA-sequencing profiles, Life, 12 (2022), 550. https://doi.org/10.3390/life12040550 doi: 10.3390/life12040550 |
[42] | F. Ahmad, A. Farooq, M. U. G. Khan, M. Z. Shabbir, M. Rabbani, I. Hussain, Identification of most relevant features for classification of francisella tularensis using machine learning, Curr. Bioinf., 15 (2020), 1197–1212. https://doi.org/10.2174/1574893615666200219113900 doi: 10.2174/1574893615666200219113900 |
[43] | M. Onesime, Z. Yang, Q. Dai, Genomic island prediction via chi-square test and random forest algorithm, Comput. Math. Method M., 2021 (2021), 9969751. https://doi.org/10.1155/2021/9969751 doi: 10.1155/2021/9969751 |
[44] | E. Frank, M. Hall, L. Trigg, G. Holmes, I. H. Witten, Data mining in bioinformatics using Weka, Bioinformatics, 20 (2004), 2479–2481. https://doi.org/10.1093/bioinformatics/bth261 doi: 10.1093/bioinformatics/bth261 |
[45] | B. Matthews, Comparison of the predicted and observed secondary structure of T4 phage lysozyme, BBA-Protein Struct., 405 (1975), 442–451. https://doi.org/10.1016/0005-2795(75)90109-9 doi: 10.1016/0005-2795(75)90109-9 |
[46] | R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in IJCAI'95: Proceedings of the 14th International Joint Conference on Artificial Intelligence, (1995), 1137–1145. |
[47] | W. Zhang, F. Liu, L. Luo, J. Zhang, Predicting drug side effects by multi-label learning and ensemble learning, BMC Bioinf., 16 (2015), 365. https://doi.org/10.1186/s12859-015-0774-y doi: 10.1186/s12859-015-0774-y |
[48] | Y. Tabei, E. Pauwels, V. Stoven, K. Takemoto, Y. Yamanishi, Identification of chemogenomic features from drug-target interaction networks using interpretable classifiers, Bioinformatics, 28 (2012), i487–i494. https://doi.org/10.1093/bioinformatics/bts412 doi: 10.1093/bioinformatics/bts412 |
[49] | T. Pahikkala, A. Airola, S. Pietila, S. Shakyawar, A. Szwajda, J. Tang, et al., Toward more realistic drug-target interaction predictions, Brief Bioinf., 16 (2015), 325–337. https://doi.org/10.1093/bib/bbu010 doi: 10.1093/bib/bbu010 |
[50] | G. Landrum, RDKit: Open-source cheminformatics, 2006. Available from: http://www.rdkit.org. |
[51] | M. LJPvd, G. Hinton, Visualizing high-dimensional data using t-SNE, J. Mach. Learn. Res., 9 (2008), 2579–2605. |
mbe-20-01-018-Supplementary-S1.pdf | |
mbe-20-01-018-Supplementary-S2.pdf |