The imbalanced data makes the machine learning model seriously biased, which leads to false positive in screening of therapeutic drugs for breast cancer. In order to deal with this problem, a multi-model ensemble framework based on tree-model, linear model and deep-learning model is proposed. Based on the methodology constructed in this study, we screened the 20 most critical molecular descriptors from 729 molecular descriptors of 1974 anti-breast cancer drug candidates and, in order to measure the pharmacokinetic properties and safety of the drug candidates, the screened molecular descriptors were used in this study for subsequent bioactivity, absorption, distribution metabolism, excretion, toxicity, and other prediction tasks. The results show that the method constructed in this study is superior and more stable than the individual models used in the ensemble approach.
Citation: Juan Zhou, Xiong Li, Yuanting Ma, Zejiu Wu, Ziruo Xie, Yuqi Zhang, Yiming Wei. Optimal modeling of anti-breast cancer candidate drugs screening based on multi-model ensemble learning with imbalanced data[J]. Mathematical Biosciences and Engineering, 2023, 20(3): 5117-5134. doi: 10.3934/mbe.2023237
The imbalanced data makes the machine learning model seriously biased, which leads to false positive in screening of therapeutic drugs for breast cancer. In order to deal with this problem, a multi-model ensemble framework based on tree-model, linear model and deep-learning model is proposed. Based on the methodology constructed in this study, we screened the 20 most critical molecular descriptors from 729 molecular descriptors of 1974 anti-breast cancer drug candidates and, in order to measure the pharmacokinetic properties and safety of the drug candidates, the screened molecular descriptors were used in this study for subsequent bioactivity, absorption, distribution metabolism, excretion, toxicity, and other prediction tasks. The results show that the method constructed in this study is superior and more stable than the individual models used in the ensemble approach.
[1] | R. L. Siegel, K. D. Miller, A. Jemal, Cancer statistics, Ca-Cancer J. Clin., 69 (2019), 7–34. https://doi.org/10.3322/caac.21551 doi: 10.3322/caac.21551 |
[2] | C. DeSantis, J. Ma, L. Bryan, A. Jemal, Breast cancer statistics, Ca-Cancer J. Clin., 64 (2014), 52–62. https://doi.org/10.3322/caac.21203 doi: 10.3322/caac.21203 |
[3] | G. Giamas, A. Filipović, J. Jacob, W. Messier, H. Zhang, D. Yang, et al., Kinome screening for regulators of the estrogen receptor identifies LMTK3 as a new therapeutic target in breast cancer, Nat. Med., 17 (2011), 715–719. https://doi.org/10.1038/nm.2351 doi: 10.1038/nm.2351 |
[4] | Q. Feng, Z. Zhang, M. J. Shea, C. J. Creighton, C. Coarfa, S. G. Hilsenbeck, et al., An epigenomic approach to therapy for tamoxifen-resistant breast cancer, Cell Res., 24 (2014), 809–819. https://doi.org/10.1038/cr.2014.71 doi: 10.1038/cr.2014.71 |
[5] | B. Shaker, K. M. Tran, C. Jung, D. Na, Introduction of advanced methods for structure-based drug discovery, Curr. Bioinf., 16 (2021), 351–363. https://doi.org/10.2174/1574893615999200703113200 doi: 10.2174/1574893615999200703113200 |
[6] | L. Cai, C. Lu, J. Xu, Y. Meng, P. Wang, X. Fu, et al., Drug repositioning based on the heterogeneous information fusion graph convolutional network, Briefings Bioinf., 22 (2021), bbab319. https://doi.org/10.1093/bib/bbab319 doi: 10.1093/bib/bbab319 |
[7] | A. Ben Brahim, L. Mohamed, Ensemble feature selection for high dimensional data: a new method and a comparative study, Adv. Data Anal. Classif., 12 (2018), 937–952. https://doi.org/10.1007/s11634-017-0285-y doi: 10.1007/s11634-017-0285-y |
[8] | L. Meng, N. Masuda, Epidemic dynamics on metapopulation networks with node2vec mobility, J. Theor. Biol., 534 (2022), 110960. https://doi.org/10.1016/j.jtbi.2021.110960 doi: 10.1016/j.jtbi.2021.110960 |
[9] | D. H. Le, D. Nguyen Ngoc, Drug repositioning by integrating known disease-gene and drug-target associations in a semi-supervised learning model, Acta Biotheor., 66 (2018), 315–331. https://doi.org/10.1007/s10441-018-9325-z doi: 10.1007/s10441-018-9325-z |
[10] | R. Su, J. Hu, Q. Zou, B. Manavalan, L. Wei, Empirical comparison and analysis of web-based cell-penetrating peptide prediction tools, Briefings Bioinf., 21 (2020), 408–420. https://doi.org/10.1093/bib/bby124 doi: 10.1093/bib/bby124 |
[11] | Y. Yang, L. Chen, Identification of drug-disease associations by using multiple drug and disease networks, Curr. Bioinf., 17 (2022), 48–59. https://doi.org/10.2174/1574893616666210825115406 doi: 10.2174/1574893616666210825115406 |
[12] | Y. Saeys, A. Thomas, Y. Van de Peer, Robust feature selection using ensemble feature selection techniques, in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, (2008), 313–325. https://doi.org/10.1007/978-3-540-87481-2_21 |
[13] | B. Seijo-Pardo, I. Porto-Díaz, V. Bolón-Canedo, A. Alonso-Betanzos, Ensemble feature selection: Homogeneous and heterogeneous approaches, Knowledge-Based Syst., 118 (2017), 124–139. https://doi.org/10.1016/j.knosys.2016.11.017 doi: 10.1016/j.knosys.2016.11.017 |
[14] | S. Zhang, Y. Chen, W. Zhang, R. Feng, A novel ensemble deep learning model with dynamic error correction and multi-objective ensemble pruning for time series forecasting, Inf. Sci., 544 (2021), 427–445. https://doi.org/10.1016/j.ins.2020.08.053 doi: 10.1016/j.ins.2020.08.053 |
[15] | H. Liu, Z. Duan, F. Han, Y. Li, Big multi-step wind speed forecasting model based on secondary decomposition, ensemble method and error correction algorithm, Energy Convers. Manage., 156 (2018), 525–541. https://doi.org/10.1016/j.enconman.2017.11.049 |
[16] | Z. Zhang, B. Krawczyk, S. Garcìa, A. Rosales-Pérez, F. Herrera, Empowering one-vs-one decomposition with ensemble learning for multi-class imbalanced data, Knowledge-Based Syst., 106 (2016), 251–263. https://doi.org/10.1016/j.knosys.2016.05.048 doi: 10.1016/j.knosys.2016.05.048 |
[17] | H. Guo, Y. Li, Y. Li, X. Liu, J. Li, BPSO-Adaboost-KNN ensemble learning algorithm for multi-class imbalanced data classification, Eng. Appl. Artif. Intell., 49 (2016), 176–193. https://doi.org/10.1016/j.engappai.2015.09.011 doi: 10.1016/j.engappai.2015.09.011 |
[18] | A. K. Sharma, R. Srivastava, Protein secondary structure prediction using character bi-gram embedding and bi-LSTM, Curr. Bioinf., 16 (2021), 333–338. https://doi.org/10.2174/1574893615999200601122840 doi: 10.2174/1574893615999200601122840 |
[19] | F. Weng, H. Zhang, C. Yang, Volatility forecasting of crude oil futures based on a genetic algorithm regularization online extreme learning machine with a forgetting factor: The role of news during the COVID-19 pandemic, Resour. Policy, 73 (2021), 102148. https://doi.org/10.1016/j.resourpol.2021.102148 doi: 10.1016/j.resourpol.2021.102148 |
[20] | Y. Xu, Y. Ma, Z. Zhu, J. Li, T. Lu, Construct comprehensive indicators through a signal extraction approach for predicting housing price crises, PloS One, 17 (2022), e0272213. https://doi.org/10.1371/journal.pone.0272213 doi: 10.1371/journal.pone.0272213 |
[21] | F. Weng, J. Zhu, C. Yang, W. Gao, H. Zhang, Analysis of financial pressure impacts on the health care industry with an explainable machine learning method: China versus the USA, Expert Syst. Appl., 210 (2022), 118482. https://doi.org/10.1016/j.eswa.2022.118482 doi: 10.1016/j.eswa.2022.118482 |
[22] | R. Polikar, Ensemble learning, in Ensemble Machine Learning, Springer, Boston, MA, (2012), 1–34. https://doi.org/10.1007/978-1-4419-9326-7_1 |
[23] | T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (2016), 785–794. https://doi.org/10.1145/2939672.2939785 |
[24] | L. Breiman, Random forests, Mach. Learn., 45 (2001), 5–32. https://doi.org/10.1023/A:1010933404324 doi: 10.1023/A:1010933404324 |
[25] | P. Bühlmann, S. Van De Geer, Statistics for High-Dimensional Data: Methods, Theory and Applications, Springer Science & Business Media, 2011. https://doi.org/10.1007/978-3-642-20192-9 |
[26] | L. Huang, S. Chen, Z. Ling, Y. Cui, Q. Wang, Non-invasive load identification based on LSTM-BP neural network, Energy Rep., 7 (2021), 485–492. https://doi.org/10.1016/j.egyr.2021.01.040 doi: 10.1016/j.egyr.2021.01.040 |
[27] | Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE, 86 (1998), 2278–2324. https://doi.org/10.1109/5.726791 doi: 10.1109/5.726791 |
[28] | H. Altun, A. Bilgil, B. C. Fidan, Treatment of multi-dimensional data to enhance neural network estimators in regression problems, Expert Syst. Appl., 32 (2007), 599–605. https://doi.org/10.1016/j.eswa.2006.01.054 doi: 10.1016/j.eswa.2006.01.054 |
[29] | D. E. Rumelhart, E. H. Geoffrey, R. J. Williams, Learning representations by back-propagating errors, Nature, 323 (1986), 533–536. https://doi.org/10.1038/323533a0 doi: 10.1038/323533a0 |
[30] | Y. Nakamura, O. Hasegawa, Nonparametric density estimation based on self-organizing incremental neural network for large noisy data, IEEE Trans. Neural Networks Learn. Syst., 28 (2016), 8–17. https://doi.org/10.1109/TNNLS.2015.2489225 doi: 10.1109/TNNLS.2015.2489225 |
[31] | W. Sun, Q. Gao, Exploration of energy saving potential in China power industry based on Adaboost back propagation neural network, J. Cleaner Prod., 217 (2019), 257–266. https://doi.org/10.1016/j.jclepro.2019.01.205 doi: 10.1016/j.jclepro.2019.01.205 |
[32] | C. Yan, T. Zhang, Y. Sun, H. Tang, H. Li, A hybrid variable selection method based on wavelet transform and mean impact value for calorific value determination of coal using laser-induced breakdown spectroscopy and kernel extreme learning machine, Spectrochim. Acta, Part B, 154 (2019), 75–81. https://doi.org/10.1016/j.sab.2019.02.007 doi: 10.1016/j.sab.2019.02.007 |
[33] | N. M. Nasrabadi, Pattern recognition and machine learning, J. Electron. Imaging, 16 (2007), 049901. https://doi.org/10.1117/1.2819119 doi: 10.1117/1.2819119 |
[34] | P. Tang, X. Yan, Y. Nan, S. Xiang, S. Krammer, T. Lasser, FusionM4Net: A multi-stage multi-modal learning algorithm for multi-label skin lesion classification, Med. Image Anal., 76 (2022), 102307. https://doi.org/10.1016/j.media.2021.102307 doi: 10.1016/j.media.2021.102307 |
[35] | F. Weng, Y. Chen, Z. Wang, M. Hou, J. Luo, Z. Tian, Gold price forecasting research based on an improved online extreme learning machine algorithm, J. Ambient Intell. Hum. Comput., 11 (2020), 4101–4111. https://doi.org/10.1007/s12652-020-01682-z doi: 10.1007/s12652-020-01682-z |
[36] | K. Zhang, S. Zhang, Y. Song, L. Cai, B. Hu, Double decoupled network for imbalanced obstetric intelligent diagnosis, Math. Biosci. Eng., 19 (2022), 10006–10021. https://doi.org/10.3934/mbe.2022467 doi: 10.3934/mbe.2022467 |
[37] | J. Wang, Prediction of postoperative recovery in patients with acoustic neuroma using machine learning and SMOTE-ENN techniques, Math. Biosci. Eng., 19 (2022), 10407–10423. https://doi.org/10.3934/mbe.2022487 doi: 10.3934/mbe.2022487 |
[38] | C. Wei, K. Sohn, C. Mellina, A. Yuille, F. Yang, Crest: A class-rebalancing self-training framework for imbalanced semi-supervised learning, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 10857–10866. |
[39] | D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, preprint, arXiv: 1412.6980. https://doi.org/10.48550/arXiv.1412.6980 |
[40] | P. Wang, K. Li, B. Xiao, K. Li, Multi-objective optimization for joint task offloading, power assignment, and resource allocation in mobile edge computing, IEEE Internet Things J., 9 (2021), 11737–11748. https://doi.org/10.1109/JIOT.2021.3132080 |
[41] | R. Zheng, M. Li, Z. Liang, F. Wu, Y. Pan, J. Wang, SinNLRR: a robust subspace clustering method for cell type detection by non-negative and low-rank representation, Bioinformatics, 35 (2019), 3642–3650. https://doi.org/10.1093/bioinformatics/btz139 doi: 10.1093/bioinformatics/btz139 |
[42] | P. Wang, W. Zhu, B. Liao, L. Cai, L. Peng, J. Yang, Predicting influenza antigenicity by matrix completion with antigen and antiserum similarity, Front. Microbiol., 9 (2018), 2500. https://doi.org/10.3389/fmicb.2018.02500 doi: 10.3389/fmicb.2018.02500 |
[43] | Z. Dimitris, Healthcare access as an important element for the EU's socioeconomic development: Greece's residents' opinions during the COVID-19 pandemic, Natl. Account. Rev., 4 (2022), 362–377. https://doi.org/10.3934/NAR.2022020 doi: 10.3934/NAR.2022020 |
[44] | F. Corradin, M. Billio, R. Casarin, Forecasting economic indicators with robust factor models, Natl. Account. Rev., 4 (2022), 167–190. https://doi.org/10.3934/NAR.2022010 doi: 10.3934/NAR.2022010 |
[45] | D. Panarello, G. Tassinari, The consequences of COVID-19 on older adults: evidence from the SHARE Corona Survey, Natl. Account. Rev., 4 (2022), 56–73. https://doi.org/10.3934/NAR.2022004 doi: 10.3934/NAR.2022004 |
[46] | Z. Li, H. Chen, B. Mo, Can digital finance promote urban innovation? Evidence from China, Borsa Istanbul Rev., 2022 (2022). https://doi.org/10.1016/j.bir.2022.10.006 doi: 10.1016/j.bir.2022.10.006 |
[47] | Y. Liu, P. Failler, Y. Ding, Enterprise financialization and technological innovation: Mechanism and heterogeneity, PLoS One, 17 (2022), e0275461. https://doi.org/10.1371/journal.pone.0275461 doi: 10.1371/journal.pone.0275461 |