The crucial problem when applying classification algorithms is unequal classes. An imbalanced dataset problem means, particularly in a two-class dataset, that the group variable of one class is comparatively more dominant than the group variable of the other class. The issue stems from the fact that the majority class dominates the minority class. The synthetic minority over-sampling technique (SMOTE) has been developed to deal with the classification of imbalanced datasets. SMOTE algorithm increases the number of samples by interpolating between the clustered minority samples. The SMOTE algorithm has three critical parameters, "k", "perc.over", and "perc.under". "perc.over" and "perc.under" hyperparameters allow determining the minority and majority class ratios. The "k" parameter is the number of nearest neighbors used to create new minority class instances. Finding the best parameter value in the SMOTE algorithm is complicated. A hybridized version of genetic algorithm (GA) and support vector machine (SVM) approaches was suggested to address this issue for selecting SMOTE algorithm parameters. Three scenarios were created. Scenario 1 shows the evaluation of support vector machine SVM) results without using the SMOTE algorithm. Scenario 2 shows that the SVM was used after applying SMOTE algorithm without the GA algorithm. In the third scenario, the results were analyzed using the SVM algorithm after selecting the SMOTE algorithm's optimization method. This study used two imbalanced datasets, drug use and simulation data. After, the results were compared with model performance metrics. When the model performance metrics results are examined, the results of the third scenario reach the highest performance. As a result of this study, it has been shown that a genetic algorithm can optimize class ratios and k hyperparameters to improve the performance of the SMOTE algorithm.
Citation: Pelin Akın. A new hybrid approach based on genetic algorithm and support vector machine methods for hyperparameter optimization in synthetic minority over-sampling technique (SMOTE)[J]. AIMS Mathematics, 2023, 8(4): 9400-9415. doi: 10.3934/math.2023473
The crucial problem when applying classification algorithms is unequal classes. An imbalanced dataset problem means, particularly in a two-class dataset, that the group variable of one class is comparatively more dominant than the group variable of the other class. The issue stems from the fact that the majority class dominates the minority class. The synthetic minority over-sampling technique (SMOTE) has been developed to deal with the classification of imbalanced datasets. SMOTE algorithm increases the number of samples by interpolating between the clustered minority samples. The SMOTE algorithm has three critical parameters, "k", "perc.over", and "perc.under". "perc.over" and "perc.under" hyperparameters allow determining the minority and majority class ratios. The "k" parameter is the number of nearest neighbors used to create new minority class instances. Finding the best parameter value in the SMOTE algorithm is complicated. A hybridized version of genetic algorithm (GA) and support vector machine (SVM) approaches was suggested to address this issue for selecting SMOTE algorithm parameters. Three scenarios were created. Scenario 1 shows the evaluation of support vector machine SVM) results without using the SMOTE algorithm. Scenario 2 shows that the SVM was used after applying SMOTE algorithm without the GA algorithm. In the third scenario, the results were analyzed using the SVM algorithm after selecting the SMOTE algorithm's optimization method. This study used two imbalanced datasets, drug use and simulation data. After, the results were compared with model performance metrics. When the model performance metrics results are examined, the results of the third scenario reach the highest performance. As a result of this study, it has been shown that a genetic algorithm can optimize class ratios and k hyperparameters to improve the performance of the SMOTE algorithm.
[1] | A. Fernández, S. García, F. Herrera, Addressing the classification with imbalanced data: open problems and new challenges on class distribution, In: Lecture Notes in Computer Science, Heidelberg: Springer, 6678 (2011). https://doi.org/10.1007/978-3-642-21219-2_1 |
[2] | M. Liuzzi, P. A. Pelizari, C. Geiß, A. Masi, V. Tramutoli, H. Taubenböck, A transferable remote sensing approach to classify building structural types for seismic risk analyses: the case of Val d'Agri area (Italy), Bull. Earthq. Eng., 17 (2019), 4825–4853. |
[3] | D. Devarriya, C. Gulati, V. Mansharamani, A. Sakalle, A. Bhardwaj, Unbalanced breast cancer data classification using novel fitness functions in genetic programming, Expert Syst. Appl., 140 (2020), 112866. https://doi.org/10.1016/j.eswa.2019.112866 doi: 10.1016/j.eswa.2019.112866 |
[4] | S. Katoch, S. S. Chauhan, V. Kumar, A review on genetic algorithm: past, present, and future, Multimed. Tools Appl., 80 (2021), 8091–8126. https://doi.org/10.1007/s11042-020-10139-6 doi: 10.1007/s11042-020-10139-6 |
[5] | Y. L. Yuan, J. J. Ren, S. Wang, Z. X. Wang, X. K. Mu, W. Zhao, Alpine skiing optimization: A new bio-inspired optimization algorithm, Adv. Eng. Softw., 170 (2022), 103158 https://doi.org/10.1016/j.advengsoft.2022.103158 doi: 10.1016/j.advengsoft.2022.103158 |
[6] | J. F. Goycoolea, M. Inostroza-Ponta, M. Villalobos-Cid, M. Marín, Single-solution based metaheuristic approach to a novel restricted clustering problem, 2021. https://doi.org/10.1109/SCCC54552.2021.9650429 |
[7] | J. H. Holland, Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence, Bradford Books, 1992. |
[8] | S. N. Sivanandam, S. N. Deepa, Introduction to Genetic Algorithms, Heidelberg: Springer Berlin, 2010. |
[9] | F. Ortiz, J. R. Simpson, J. Pignatiello, A. Heredia-Langner, A genetic algorithm approach to multiple-response optimization, J. Qual. Technol., 36 (2004), 432–450. https://doi.org/10.1080/00224065.2004.11980289 doi: 10.1080/00224065.2004.11980289 |
[10] | H. I. Calvete, C. Gale, P. M. Mateo, A new approach for solving linear bilevel problems using genetic algorithms, European J. Oper. Res., 188 (2008), 14–28 https://doi.org/10.1016/j.ejor.2007.03.034 doi: 10.1016/j.ejor.2007.03.034 |
[11] | S. S. Nimankar, D. Vora, Designing a model to handle imbalance data classification using SMOTE and optimized classifier, In: Data Management, Analytics and Innovation, Singapore: Springer, 2020,323–334. |
[12] | K. Jiang, J. Lu, K. L. Xia, A novel algorithm for imbalance data classification based on genetic algorithm improved SMOTE, Arab. J. Sci. Eng., 41 (2016), 3255–3266. http://doi.org/10.1007/s13369-016-2179-2 doi: 10.1007/s13369-016-2179-2 |
[13] | R. Obiedat, R. Qaddoura, A. M. Al-Zoubi, L. Al-Qaisi, O. Harfoushi, M. Alrefai, et al., Sentiment analysis of customers' reviews using a hybrid evolutionary SVM based approach in an imbalanced data distribution, IEEE Access, 10 (2022), 22260–22273. https://doi.org/10.1109/ACCESS.2022.3149482 doi: 10.1109/ACCESS.2022.3149482 |
[14] | L. Wang, Imbalanced credit risk prediction based on SMOTE and multi-kernel FCM improved by particle swarm optimization, Appl. Soft Comput., 114 (2022), 108153. https://doi.org/10.1016/j.asoc.2021.108153 doi: 10.1016/j.asoc.2021.108153 |
[15] | L. Demidova, I. Klyueva, SVM classification: Optimization with the SMOTE algorithm for the class imbalance problem, 2017. https://doi.org/10.1109/MECO.2017.7977136 |
[16] | S. Sreejith, H. K. Nehemiah, A. Kannan, Clinical data classification using an enhanced SMOTE and chaotic evolutionary feature selection, Comput. Biol. Med., 126 (2020), 103991. https://doi.org/10.1016/j.compbiomed.2020.103991 doi: 10.1016/j.compbiomed.2020.103991 |
[17] | Y.-C. Wang, C.-H. Cheng, A multiple combined method for rebalancing medical data with class imbalances, Comput. Biol. Med., 134 (2021), 104527. https://doi.org/10.1016/j.compbiomed.2021.104527 doi: 10.1016/j.compbiomed.2021.104527 |
[18] | B. Zorić, D. Bajer, G. Martinović, Employing different optimisation approaches for SMOTE parameter tuning, 2016. https://doi.org/10.1109/SST.2016.7765657 |
[19] | E. Sara, C. Laila, I. Ali, The impact of SMOTE and grid search on maintainability prediction models, 2019. https://doi.org/10.1109/AICCSA47632.2019.9035342 |
[20] | J. J. Ren, Z. X. Wang, Y. Pang, Y. L. Yuan, Genetic algorithm-assisted an improved AdaBoost double-layer for oil temperature prediction of TBM, Adv. Eng. Inform., 52 (2022), 101563. https://doi.org/10.1016/j.aei.2022.101563 doi: 10.1016/j.aei.2022.101563 |
[21] | Y. L. Yuan, X. K. Mu, X. Y. Shao, J. J. Ren, Y. Zhao, Z. X. Zhao, Optimization of an auto drum fashioned brake using the elite opposition-based learning and chaotic k-best gravitational search strategy based grey wolf optimizer algorithm, Appl. Soft Comput., 123 (2022), 108947. https://doi.org/10.1016/j.asoc.2022.108947 doi: 10.1016/j.asoc.2022.108947 |
[22] | M. L. Shi, S. Wang, W. Sun, L. Y. Lv, X. G. Song, A support vector regression-based multi-fidelity surrogate model, 2019. |
[23] | D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley Professional, 1989. |
[24] | S. Panda, N. P. Padhy, Comparison of particle swarm optimization and genetic algorithm for FACTS-based controller design, Appl. Soft Comput., 8 (2008), 1418–1427. https://doi.org/10.1016/j.asoc.2007.10.009 doi: 10.1016/j.asoc.2007.10.009 |
[25] | D. Orvosh, L. Davis, Using a genetic algorithm to optimize problems with feasibility constraints, IEEE World Congress on Computational Intelligence, 1994. https://doi.org/10.1109/ICEC.1994.350001 doi: 10.1109/ICEC.1994.350001 |
[26] | E. C. Gonçalves, A. Plastino, A. A. Freitas, A genetic algorithm for optimizing the label ordering in multi-label classifier chains, In: 2013 IEEE 25th International Conference on Tools with Artificial Intelligence, 2013. https://doi.org/10.1109/ICTAI.2013.76 |
[27] | J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques, 2011. |
[28] | V. Vapnik, Principles of risk minimization for learning theory, In: Proceedings of the 4th International Conference on Neural Information Processing Systems, 1991,831–838. |
[29] | T. Koc, P. Akın, Estimation of high school entrance examination success rates using machine learning and beta regression models, J. Intell. Syst. Theory Appl., 5 (2022), 9–15. http://doi.org/10.38016/jista.922663 doi: 10.38016/jista.922663 |
[30] | D. Guleryuz, Estimation of soil temperatures with machine learning algorithms-Giresun and Bayburt stations in Turkey, Theor. Appl. Climatol., 147 (2022), 109–125. |
[31] | Q. Quan, Z. Hao, X. F. Huang, J. C. Lei, Research on water temperature prediction based on improved support vector regression, Neural Comput. Appl., 2020, 1–10. https://doi.org/10.1007/S00521-020-04836-4 doi: 10.1007/S00521-020-04836-4 |
[32] | N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, SMOTE: synthetic minority over-sampling technique, J. Artif. Intell. Res., 16 (2002), 321–357. https://doi.org/10.5555/1622407.1622416 doi: 10.5555/1622407.1622416 |
[33] | J. Brandt, E. Lanzén, A comparative review of SMOTE and ADASYN in imbalanced data classification, In: Bachelor's Thesis, Uppsala: Uppsala University, 2021. |
[34] | H. Al Majzoub, I. Elgedawy, Ö. Akaydın, M. K. Ulukök, HCAB-SMOTE: A hybrid clustered affinitive borderline SMOTE approach for imbalanced data binary classification, Arab. J. Sci. Eng., 45 (2020), 3205–3222. |
[35] | P. Akin, Y. Terzi, Comparison of unbalanced data methods for support vector machines, Turkiye Klinikleri J. Biostat., 13 (2021), 138–146. http://doi.org/10.5336/biostatic.2020-80268 doi: 10.5336/biostatic.2020-80268 |
[36] | S. Uğuz, Makine öğrenmesi teorik yönleri ve Python uygulamaları ile bir yapay zeka ekolü, Nobel Yayıncılık Ankara, 2019. |
[37] | R. E. Wright, Logistic regression, In: Reading and Understanding Multivariate Statistics, 1995,217–244. |
[38] | T. Koc, H. Koc, E. Ulas, Üniversite öğrencilerinin kötü alışkanlıklarının bayesci ağ yöntemi ile belirlenmesi, Çukurova Üniversitesi Sosyal Bilimler Enstitüsü Dergisi, 26 (2017), 230–240. |
[39] | S. V. Buuren, K. Groothuis-Oudshoorn, Mice: Multivariate imputation by chained equations in R, J. Statist. Softw., 45 (2011), 1–68. |