Research article

A new imbalanced data oversampling method based on Bootstrap method and Wasserstein Generative Adversarial Network


  • Received: 25 October 2023 Revised: 17 December 2023 Accepted: 22 December 2023 Published: 26 February 2024
  • Due to their high bias in favor of the majority class, traditional machine learning classifiers face a great challenge when there is a class imbalance in biological data. More recently, generative adversarial networks (GANs) have been applied to imbalanced data classification. For GANs, the distribution of the minority class data fed into discriminator is unknown. The input to the generator is random noise ($ z $) drawn from a standard normal distribution $ N(0, 1) $. This method inevitably increases the training difficulty of the network and reduces the quality of the data generated. In order to solve this problem, we proposed a new oversampling algorithm by combining the Bootstrap method and the Wasserstein GAN Network (BM-WGAN). In our approach, the input to the generator network is the data ($ z $) drawn from the distribution of minority class estimated by the BM. The generator was used to synthesize minority class data when the network training is completed. Through the above steps, the generator model can learn the useful features from the minority class and generate realistic-looking minority class samples. The experimental results indicate that BM-WGAN improves the classification performance greatly compared to other oversampling algorithms. The BM-WGAN implementation is available at: https://github.com/ithbjgit1/BMWGAN.git.

    Citation: Binjie Hou, Gang Chen. A new imbalanced data oversampling method based on Bootstrap method and Wasserstein Generative Adversarial Network[J]. Mathematical Biosciences and Engineering, 2024, 21(3): 4309-4327. doi: 10.3934/mbe.2024190

    Related Papers:

  • Due to their high bias in favor of the majority class, traditional machine learning classifiers face a great challenge when there is a class imbalance in biological data. More recently, generative adversarial networks (GANs) have been applied to imbalanced data classification. For GANs, the distribution of the minority class data fed into discriminator is unknown. The input to the generator is random noise ($ z $) drawn from a standard normal distribution $ N(0, 1) $. This method inevitably increases the training difficulty of the network and reduces the quality of the data generated. In order to solve this problem, we proposed a new oversampling algorithm by combining the Bootstrap method and the Wasserstein GAN Network (BM-WGAN). In our approach, the input to the generator network is the data ($ z $) drawn from the distribution of minority class estimated by the BM. The generator was used to synthesize minority class data when the network training is completed. Through the above steps, the generator model can learn the useful features from the minority class and generate realistic-looking minority class samples. The experimental results indicate that BM-WGAN improves the classification performance greatly compared to other oversampling algorithms. The BM-WGAN implementation is available at: https://github.com/ithbjgit1/BMWGAN.git.



    加载中


    [1] N. V. Chawla, Data mining for imbalanced datasets: An overview, in Data mining and knowledge discovery handbook, Springer, (2010), 875–886. https://doi.org/10.1007/978-0-387-09823-4_45
    [2] X. Gao, Z. Chen, S. Tang, Y. Zhang, J. Li, Adaptive weighted imbalance learning with application to abnormal activity recognition, Neurocomputing, 173 (2016), 1927–1935. https://doi.org/10.1016/j.neucom.2015.09.064 doi: 10.1016/j.neucom.2015.09.064
    [3] J. Jurgovsky, M. Granitzer, K. Ziegler, S. Calabretto, P. E. Portier, L. He-Guelton, et al., Sequence classification for credit-card fraud detection, Expert Syst. Appl., 100 (2018), 234–245. https://doi.org/10.1016/j.eswa.2018.01.037 doi: 10.1016/j.eswa.2018.01.037
    [4] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique, J. Artif. Intell. Res., 16 (2002), 321–357. https://doi.org/10.1613/jair.953 doi: 10.1613/jair.953
    [5] H. Han, W. Y. Wang, B. H. Mao, Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning, in Advances in Intelligent Computing, Springer, (2005), 878–887. https://doi.org/10.1007/11538059_91
    [6] G. Douzas, F. Bacao, F. Last, Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE, Inform. Sci., 465 (2018), 1–20. https://doi.org/10.1016/j.ins.2018.06.056 doi: 10.1016/j.ins.2018.06.056
    [7] C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem, Advances in Knowledge Discovery and Data Mining, Springer, (2009), 475–482. https://doi.org/10.1007/978-3-642-01307-2_43
    [8] H. B. He, Y. Bai, E. A. Garcia, S. T. Li, ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning, in 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), (2006), 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969
    [9] Y. Xie, M. Qiu, H. Zhang, L. Peng, Z. X. Chen, Gaussian Distribution based Oversampling for Imbalanced Data Classification, IEEE Trans. Knowl. Data Eng., 34 (2020), 667–669. https://doi.org/10.1109/TKDE.2020.2985965 doi: 10.1109/TKDE.2020.2985965
    [10] Y. T. Yan, Y. F. Jiang, Z. Zheng, C. J. Yu, Y. W. Zhang, Y. P. Zhang, LDAS: Local density-based adaptive sampling for imbalanced data classification, Expert Syst. Appl., 191 (2022), 13. https://doi.org/10.1016/j.eswa.2021.116213 doi: 10.1016/j.eswa.2021.116213
    [11] W. H. Xie, G. Q. Liang, Z. H. Dong, B. Y. Tan, B. S. Zhang, An Improved Oversampling Algorithm Based on the Samples' Selection Strategy for Classifying Imbalanced Data, Math. Probl. Eng., 2019 (2019), 526–539. https://doi.org/10.1155/2019/3526539 doi: 10.1155/2019/3526539
    [12] L. Z. Peng, H. L. Zhang, B. Yang, Y. H. Chen, A new approach for imbalanced data classification based on data gravitation, Information Sciences, 288 (2014), 347-373. https://doi.org/10.1016/j.ins.2014.04.046 doi: 10.1016/j.ins.2014.04.046
    [13] M. Koziarski, B. Krawczyk, M. Wozniak, Radial-Based oversampling for noisy imbalanced data classification, Neurocomputing, 343 (2019), 19–33. https://doi.org/10.1016/j.neucom.2018.04.089 doi: 10.1016/j.neucom.2018.04.089
    [14] S. Suh, H. Lee, P. Lukowicz, Y. O. Lee, CEGAN: Classification Enhancement Generative Adversarial Networks for unraveling data imbalance problems, Neural Netw., 133 (2021), 69–86. https://doi.org/10.1016/j.neunet.2020.10.004 doi: 10.1016/j.neunet.2020.10.004
    [15] E. Kaya, S. Korkmaz, M. A. Sahman, A. C. Cinar, DEBOHID: A differential evolution based oversampling approach for highly imbalanced datasets, Expert Syst. Appl., 169 (2021), 794–801. https://doi.org/10.1016/j.eswa.2020.114482 doi: 10.1016/j.eswa.2020.114482
    [16] F. Rahmati, H. Nezamabadi-Pour, B. Nikpour, A gravitational density-based mass sharing method for imbalanced data classification, SN Appl. Sci., 2 (2020), 50–59. https://doi.org/10.1007/s42452-020-2039-2 doi: 10.1007/s42452-020-2039-2
    [17] H. K. Lee, S. B. Kim, An overlap-sensitive margin classifier for imbalanced and overlapping data, Expert Syst. Appl., 98 (2018), 72–83. https://doi.org/10.1016/j.eswa.2018.01.008 doi: 10.1016/j.eswa.2018.01.008
    [18] V. López, A. Fernández, J. G. Moreno-Torres, F. Herrera, Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics, Expert Syst. Appl., 38 (2012), 6585–6608. https://doi.org/10.1016/j.eswa.2011.12.043 doi: 10.1016/j.eswa.2011.12.043
    [19] C. Elkan, The foundations of cost-sensitive learning, Acm SIGKDD Explor. Newsl., 6 (2004), 50–59. https://doi.org/10.1145/1007730.1007738 doi: 10.1145/1007730.1007738
    [20] B. Zadrozny, J. Langford, N. Abe, Cost-sensitive learning by cost-proportionate example weighting, in Third IEEE International Conference on Data Mining, (2003), 435–442. https://doi.org/10.1145/1007730.1007738
    [21] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches, IEEE Trans. Syst. Man Cybern.-C, 42 (2011), 463–484. https://doi.org/10.1109/TSMCC.2011.2161285 doi: 10.1109/TSMCC.2011.2161285
    [22] Wang. S, Yao. X, Diversity analysis on imbalanced data sets by using ensemble models, in 2009 IEEE symposium on computational intelligence and data mining, (2009), 324-331. https://doi.org/10.1109/CIDM.2009.4938667
    [23] N. V. Chawla, A. Lazarevic, L. O. Hall, K. W. Bowyer, SMOTEBoost: Improving prediction of the minority class in boosting, in Knowledge Discovery in Databases: PKDD 2003, Springer, (2003), 107–119. https://doi.org/10.1007/978-3-540-39804-2_12
    [24] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, A. Napolitano, RUSBoost: A hybrid approach to alleviating class imbalance, in IEEE Trans. Syst. Man Cybern.-A, 40 (2009), 185–197. https://doi.org/10.1109/TSMCA.2009.2029559
    [25] L. Cao, H. Shen, Combining re-sampling with twin support vector machine for imbalanced data classification, in 2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), (2016), 325–329. https://doi.org/10.1109/PDCAT.2016.076
    [26] G. E. Batista, R. C. Prati, M. C. Monard, Balancing strategies and class overlapping, in Advances in Intelligent Data Analysis VI, (2005), 24–35. https://doi.org/10.1007/11552253_3
    [27] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, A. A. Bharath, Generative adversarial networks: An overview, IEEE Signal Proc. Mag., 35 (2018), 53–65. https://doi.org/10.1109/MSP.2017.2765202 doi: 10.1109/MSP.2017.2765202
    [28] Z. Li, C. Ma, X. Shi, D. Zhang, W. Li, L. Wu, Tsa-gan: A robust generative adversarial networks for time series augmentation, in 2021 International Joint Conference on Neural Networks (IJCNN), (2021), 1–8. https://doi.org/10.1109/IJCNN52387.2021.9534001
    [29] W. Li, J. Chen, J. Cao, C. Ma, J. Wang, X. Cui, et al., EID-GAN: Generative Adversarial Nets for Extremely Imbalanced Data Augmentation, IEEE Trans. Ind. Inform., 19 (2022), 3208–3218. https://doi.org/10.1109/TII.2022.3182781 doi: 10.1109/TII.2022.3182781
    [30] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, et al., Improved techniques for training GANs, in Advances in neural information processing systems, (2016), 29.
    [31] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversarial networks, in Proceedings of the 34th International Conference on Machine Learning, (2017), 214–223.
    [32] A. M. Zoubir, B. Boashash, The bootstrap and its application in signal processing, IEEE Signal Proc. Mag., 15 (1998), 56–76. https://doi.org/10.1109/79.647043 doi: 10.1109/79.647043
    [33] B. Efron, Bootstrap methods: another look at the jackknife, in Breakthroughs in Statistics, (1992), 569–593. https://doi.org/10.1007/978-1-4612-4380-9_41
    [34] B. Tang, H. He, KernelADASYN: Kernel based adaptive synthetic data generation for imbalanced learning, in 2015 IEEE Congress on Evolutionary Computation (CEC), (2015), 664–671. https://doi.org/10.1109/CEC.2015.7256954
    [35] H. Lee, J. Kim, S. Kim, Gaussian-based SMOTE algorithm for solving skewed class distributions, Int. J. Fuzzy Log. Intell. Syst., 17 (2017), 229–234. https://doi.org/10.5391/IJFIS.2017.17.4.229 doi: 10.5391/IJFIS.2017.17.4.229
    [36] J. A. Sáez, J. Luengo, J. Stefanowski, F. Herrera, SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Inform. Sci., 291 (2015), 184–203. https://doi.org/10.1016/j.ins.2014.08.051 doi: 10.1016/j.ins.2014.08.051
    [37] M. Gao, X. Hong, S. Chen, C. J. Harris, E. Khalaf, PDFOS: PDF estimation based over-sampling for imbalanced two-class problems, Neurocomputing, 138 (2014), 248–259. https://doi.org/10.1016/j.neucom.2014.02.006 doi: 10.1016/j.neucom.2014.02.006
    [38] S. Tang, S. P. Chen, The generation mechanism of synthetic minority class examples, in 2008 International Conference on Information Technology and Applications in Biomedicine, (2008), 444–447. https://doi.org/10.1109/ITAB.2008.4570642
    [39] S. Bej, N. Davtyan, M. Wolfien, M. Nassar, O. Wolkenhauer, LoRAS: An oversampling approach for imbalanced datasets, Mach. Learn., 110 (2021), 279–301. https://doi.org/10.1007/s10994-020-05913-4 doi: 10.1007/s10994-020-05913-4
    [40] H. Bhagwani, S. Agarwal, A. Kodipalli, R. J. Martis, Targeting class imbalance problem using GAN, in 2021 5th International Conference on Electrical, Electronics, Communication, Computer Technologies and Optimization Techniques (ICEECCOT), (2021), 318–322. https://doi.org/10.1109/ICEECCOT52851.2021.9708011
  • Reader Comments
  • © 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(793) PDF downloads(45) Cited by(0)

Article outline

Figures and Tables

Figures(8)  /  Tables(10)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog