Research article Special Issues

Optimal modeling of anti-breast cancer candidate drugs screening based on multi-model ensemble learning with imbalanced data

  • Received: 24 October 2022 Revised: 07 December 2022 Accepted: 15 December 2022 Published: 06 January 2023
  • The imbalanced data makes the machine learning model seriously biased, which leads to false positive in screening of therapeutic drugs for breast cancer. In order to deal with this problem, a multi-model ensemble framework based on tree-model, linear model and deep-learning model is proposed. Based on the methodology constructed in this study, we screened the 20 most critical molecular descriptors from 729 molecular descriptors of 1974 anti-breast cancer drug candidates and, in order to measure the pharmacokinetic properties and safety of the drug candidates, the screened molecular descriptors were used in this study for subsequent bioactivity, absorption, distribution metabolism, excretion, toxicity, and other prediction tasks. The results show that the method constructed in this study is superior and more stable than the individual models used in the ensemble approach.

    Citation: Juan Zhou, Xiong Li, Yuanting Ma, Zejiu Wu, Ziruo Xie, Yuqi Zhang, Yiming Wei. Optimal modeling of anti-breast cancer candidate drugs screening based on multi-model ensemble learning with imbalanced data[J]. Mathematical Biosciences and Engineering, 2023, 20(3): 5117-5134. doi: 10.3934/mbe.2023237

    Related Papers:

  • The imbalanced data makes the machine learning model seriously biased, which leads to false positive in screening of therapeutic drugs for breast cancer. In order to deal with this problem, a multi-model ensemble framework based on tree-model, linear model and deep-learning model is proposed. Based on the methodology constructed in this study, we screened the 20 most critical molecular descriptors from 729 molecular descriptors of 1974 anti-breast cancer drug candidates and, in order to measure the pharmacokinetic properties and safety of the drug candidates, the screened molecular descriptors were used in this study for subsequent bioactivity, absorption, distribution metabolism, excretion, toxicity, and other prediction tasks. The results show that the method constructed in this study is superior and more stable than the individual models used in the ensemble approach.


    [1] R. L. Siegel, K. D. Miller, A. Jemal, Cancer statistics, Ca-Cancer J. Clin., 69 (2019), 7–34. doi: 10.3322/caac.21551
    [2] C. DeSantis, J. Ma, L. Bryan, A. Jemal, Breast cancer statistics, Ca-Cancer J. Clin., 64 (2014), 52–62. doi: 10.3322/caac.21203
    [3] G. Giamas, A. Filipović, J. Jacob, W. Messier, H. Zhang, D. Yang, et al., Kinome screening for regulators of the estrogen receptor identifies LMTK3 as a new therapeutic target in breast cancer, Nat. Med., 17 (2011), 715–719. doi: 10.1038/nm.2351
    [4] Q. Feng, Z. Zhang, M. J. Shea, C. J. Creighton, C. Coarfa, S. G. Hilsenbeck, et al., An epigenomic approach to therapy for tamoxifen-resistant breast cancer, Cell Res., 24 (2014), 809–819. doi: 10.1038/cr.2014.71
    [5] B. Shaker, K. M. Tran, C. Jung, D. Na, Introduction of advanced methods for structure-based drug discovery, Curr. Bioinf., 16 (2021), 351–363. doi: 10.2174/1574893615999200703113200
    [6] L. Cai, C. Lu, J. Xu, Y. Meng, P. Wang, X. Fu, et al., Drug repositioning based on the heterogeneous information fusion graph convolutional network, Briefings Bioinf., 22 (2021), bbab319. doi: 10.1093/bib/bbab319
    [7] A. Ben Brahim, L. Mohamed, Ensemble feature selection for high dimensional data: a new method and a comparative study, Adv. Data Anal. Classif., 12 (2018), 937–952. doi: 10.1007/s11634-017-0285-y
    [8] L. Meng, N. Masuda, Epidemic dynamics on metapopulation networks with node2vec mobility, J. Theor. Biol., 534 (2022), 110960. doi: 10.1016/j.jtbi.2021.110960
    [9] D. H. Le, D. Nguyen Ngoc, Drug repositioning by integrating known disease-gene and drug-target associations in a semi-supervised learning model, Acta Biotheor., 66 (2018), 315–331. doi: 10.1007/s10441-018-9325-z
    [10] R. Su, J. Hu, Q. Zou, B. Manavalan, L. Wei, Empirical comparison and analysis of web-based cell-penetrating peptide prediction tools, Briefings Bioinf., 21 (2020), 408–420. doi: 10.1093/bib/bby124
    [11] Y. Yang, L. Chen, Identification of drug-disease associations by using multiple drug and disease networks, Curr. Bioinf., 17 (2022), 48–59. doi: 10.2174/1574893616666210825115406
    [12] Y. Saeys, A. Thomas, Y. Van de Peer, Robust feature selection using ensemble feature selection techniques, in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, (2008), 313–325.
    [13] B. Seijo-Pardo, I. Porto-Díaz, V. Bolón-Canedo, A. Alonso-Betanzos, Ensemble feature selection: Homogeneous and heterogeneous approaches, Knowledge-Based Syst., 118 (2017), 124–139. doi: 10.1016/j.knosys.2016.11.017
    [14] S. Zhang, Y. Chen, W. Zhang, R. Feng, A novel ensemble deep learning model with dynamic error correction and multi-objective ensemble pruning for time series forecasting, Inf. Sci., 544 (2021), 427–445. doi: 10.1016/j.ins.2020.08.053
    [15] H. Liu, Z. Duan, F. Han, Y. Li, Big multi-step wind speed forecasting model based on secondary decomposition, ensemble method and error correction algorithm, Energy Convers. Manage., 156 (2018), 525–541.
    [16] Z. Zhang, B. Krawczyk, S. Garcìa, A. Rosales-Pérez, F. Herrera, Empowering one-vs-one decomposition with ensemble learning for multi-class imbalanced data, Knowledge-Based Syst., 106 (2016), 251–263. doi: 10.1016/j.knosys.2016.05.048
    [17] H. Guo, Y. Li, Y. Li, X. Liu, J. Li, BPSO-Adaboost-KNN ensemble learning algorithm for multi-class imbalanced data classification, Eng. Appl. Artif. Intell., 49 (2016), 176–193. doi: 10.1016/j.engappai.2015.09.011
    [18] A. K. Sharma, R. Srivastava, Protein secondary structure prediction using character bi-gram embedding and bi-LSTM, Curr. Bioinf., 16 (2021), 333–338. doi: 10.2174/1574893615999200601122840
    [19] F. Weng, H. Zhang, C. Yang, Volatility forecasting of crude oil futures based on a genetic algorithm regularization online extreme learning machine with a forgetting factor: The role of news during the COVID-19 pandemic, Resour. Policy, 73 (2021), 102148. doi: 10.1016/j.resourpol.2021.102148
    [20] Y. Xu, Y. Ma, Z. Zhu, J. Li, T. Lu, Construct comprehensive indicators through a signal extraction approach for predicting housing price crises, PloS One, 17 (2022), e0272213. doi: 10.1371/journal.pone.0272213
    [21] F. Weng, J. Zhu, C. Yang, W. Gao, H. Zhang, Analysis of financial pressure impacts on the health care industry with an explainable machine learning method: China versus the USA, Expert Syst. Appl., 210 (2022), 118482. doi: 10.1016/j.eswa.2022.118482
    [22] R. Polikar, Ensemble learning, in Ensemble Machine Learning, Springer, Boston, MA, (2012), 1–34.
    [23] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (2016), 785–794.
    [24] L. Breiman, Random forests, Mach. Learn., 45 (2001), 5–32. doi: 10.1023/A:1010933404324
    [25] P. Bühlmann, S. Van De Geer, Statistics for High-Dimensional Data: Methods, Theory and Applications, Springer Science & Business Media, 2011.
    [26] L. Huang, S. Chen, Z. Ling, Y. Cui, Q. Wang, Non-invasive load identification based on LSTM-BP neural network, Energy Rep., 7 (2021), 485–492. doi: 10.1016/j.egyr.2021.01.040
    [27] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE, 86 (1998), 2278–2324. doi: 10.1109/5.726791
    [28] H. Altun, A. Bilgil, B. C. Fidan, Treatment of multi-dimensional data to enhance neural network estimators in regression problems, Expert Syst. Appl., 32 (2007), 599–605. doi: 10.1016/j.eswa.2006.01.054
    [29] D. E. Rumelhart, E. H. Geoffrey, R. J. Williams, Learning representations by back-propagating errors, Nature, 323 (1986), 533–536. doi: 10.1038/323533a0
    [30] Y. Nakamura, O. Hasegawa, Nonparametric density estimation based on self-organizing incremental neural network for large noisy data, IEEE Trans. Neural Networks Learn. Syst., 28 (2016), 8–17. doi: 10.1109/TNNLS.2015.2489225
    [31] W. Sun, Q. Gao, Exploration of energy saving potential in China power industry based on Adaboost back propagation neural network, J. Cleaner Prod., 217 (2019), 257–266. doi: 10.1016/j.jclepro.2019.01.205
    [32] C. Yan, T. Zhang, Y. Sun, H. Tang, H. Li, A hybrid variable selection method based on wavelet transform and mean impact value for calorific value determination of coal using laser-induced breakdown spectroscopy and kernel extreme learning machine, Spectrochim. Acta, Part B, 154 (2019), 75–81. doi: 10.1016/j.sab.2019.02.007
    [33] N. M. Nasrabadi, Pattern recognition and machine learning, J. Electron. Imaging, 16 (2007), 049901. doi: 10.1117/1.2819119
    [34] P. Tang, X. Yan, Y. Nan, S. Xiang, S. Krammer, T. Lasser, FusionM4Net: A multi-stage multi-modal learning algorithm for multi-label skin lesion classification, Med. Image Anal., 76 (2022), 102307. doi: 10.1016/
    [35] F. Weng, Y. Chen, Z. Wang, M. Hou, J. Luo, Z. Tian, Gold price forecasting research based on an improved online extreme learning machine algorithm, J. Ambient Intell. Hum. Comput., 11 (2020), 4101–4111. doi: 10.1007/s12652-020-01682-z
    [36] K. Zhang, S. Zhang, Y. Song, L. Cai, B. Hu, Double decoupled network for imbalanced obstetric intelligent diagnosis, Math. Biosci. Eng., 19 (2022), 10006–10021. doi: 10.3934/mbe.2022467
    [37] J. Wang, Prediction of postoperative recovery in patients with acoustic neuroma using machine learning and SMOTE-ENN techniques, Math. Biosci. Eng., 19 (2022), 10407–10423. doi: 10.3934/mbe.2022487
    [38] C. Wei, K. Sohn, C. Mellina, A. Yuille, F. Yang, Crest: A class-rebalancing self-training framework for imbalanced semi-supervised learning, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 10857–10866.
    [39] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, preprint, arXiv: 1412.6980.
    [40] P. Wang, K. Li, B. Xiao, K. Li, Multi-objective optimization for joint task offloading, power assignment, and resource allocation in mobile edge computing, IEEE Internet Things J., 9 (2021), 11737–11748.
    [41] R. Zheng, M. Li, Z. Liang, F. Wu, Y. Pan, J. Wang, SinNLRR: a robust subspace clustering method for cell type detection by non-negative and low-rank representation, Bioinformatics, 35 (2019), 3642–3650. doi: 10.1093/bioinformatics/btz139
    [42] P. Wang, W. Zhu, B. Liao, L. Cai, L. Peng, J. Yang, Predicting influenza antigenicity by matrix completion with antigen and antiserum similarity, Front. Microbiol., 9 (2018), 2500. doi: 10.3389/fmicb.2018.02500
    [43] Z. Dimitris, Healthcare access as an important element for the EU's socioeconomic development: Greece's residents' opinions during the COVID-19 pandemic, Natl. Account. Rev., 4 (2022), 362–377. doi: 10.3934/NAR.2022020
    [44] F. Corradin, M. Billio, R. Casarin, Forecasting economic indicators with robust factor models, Natl. Account. Rev., 4 (2022), 167–190. doi: 10.3934/NAR.2022010
    [45] D. Panarello, G. Tassinari, The consequences of COVID-19 on older adults: evidence from the SHARE Corona Survey, Natl. Account. Rev., 4 (2022), 56–73. doi: 10.3934/NAR.2022004
    [46] Z. Li, H. Chen, B. Mo, Can digital finance promote urban innovation? Evidence from China, Borsa Istanbul Rev., 2022 (2022). doi: 10.1016/j.bir.2022.10.006
    [47] Y. Liu, P. Failler, Y. Ding, Enterprise financialization and technological innovation: Mechanism and heterogeneity, PLoS One, 17 (2022), e0275461. doi: 10.1371/journal.pone.0275461
  • Reader Comments
  • © 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (
通讯作者: 陈斌,
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索


Article views(1680) PDF downloads(120) Cited by(0)

Article outline

Figures and Tables

Figures(5)  /  Tables(6)


DownLoad:  Full-Size Img  PowerPoint
