In predictive modeling, addressing class imbalance is a critical concern, particularly in applications where certain classes are disproportionately represented. This study delved into the implications of class imbalance on the interpretability of the random forest models. Class imbalance is a common challenge in machine learning, particularly in domains where certain classes are under-represented. This study investigated the impact of class imbalance on random forest model performance in churn and fraud detection scenarios. We trained and evaluated random forest models on churn datasets with class imbalances ranging from 20% to 50% and fraud datasets with imbalances from 1% to 15%. The results revealed consistent improvements in the precision, recall, F1-score, and accuracy as class imbalance decreases, indicating that models become more precise and accurate in identifying rare events with balanced datasets. Additionally, we employed interpretability techniques such as Shapley values, partial dependence plots (PDPs), and breakdown plots to elucidate the effect of class imbalance on model interpretability. Shapley values showed varying feature importance across different class distributions, with a general decrease as datasets became more balanced. PDPs illustrated a consistent upward trend in estimated values as datasets approached balance, indicating consistent relationships between input variables and predicted outcomes. Breakdown plots highlighted significant changes in individual predictions as class imbalance varied, underscoring the importance of considering class distribution in interpreting model outputs. These findings contribute to our understanding of the complex interplay between class balance, model performance, and interpretability, offering insights for developing more robust and reliable predictive models in real-world applications.
Citation: Lindani Dube, Tanja Verster. Interpretability of the random forest model under class imbalance[J]. Data Science in Finance and Economics, 2024, 4(3): 446-468. doi: 10.3934/DSFE.2024019
In predictive modeling, addressing class imbalance is a critical concern, particularly in applications where certain classes are disproportionately represented. This study delved into the implications of class imbalance on the interpretability of the random forest models. Class imbalance is a common challenge in machine learning, particularly in domains where certain classes are under-represented. This study investigated the impact of class imbalance on random forest model performance in churn and fraud detection scenarios. We trained and evaluated random forest models on churn datasets with class imbalances ranging from 20% to 50% and fraud datasets with imbalances from 1% to 15%. The results revealed consistent improvements in the precision, recall, F1-score, and accuracy as class imbalance decreases, indicating that models become more precise and accurate in identifying rare events with balanced datasets. Additionally, we employed interpretability techniques such as Shapley values, partial dependence plots (PDPs), and breakdown plots to elucidate the effect of class imbalance on model interpretability. Shapley values showed varying feature importance across different class distributions, with a general decrease as datasets became more balanced. PDPs illustrated a consistent upward trend in estimated values as datasets approached balance, indicating consistent relationships between input variables and predicted outcomes. Breakdown plots highlighted significant changes in individual predictions as class imbalance varied, underscoring the importance of considering class distribution in interpreting model outputs. These findings contribute to our understanding of the complex interplay between class balance, model performance, and interpretability, offering insights for developing more robust and reliable predictive models in real-world applications.
[1] | Abd Algani YM, Ritonga M, Bala BK, et al. (2022) Machine learning in health condition check-up: An approach using Breiman's random forest algorithm. Measurement 23: 100406. |
[2] | Ariza-Garzón MJ, Arroyo J, Caparrini A, et al. (2020). Explainability of a machine learning granting scoring model in peer-to-peer lending. Ieee Access 8: 64873–64890. https://doi.org/10.1109/ACCESS.2020.2984412 doi: 10.1109/ACCESS.2020.2984412 |
[3] | Biecek P, Burzykowski T (2021a) Explanatory model analysis: explore, explain, and examine predictive models. CRC Press. https://doi.org/10.1201/9780429027192 |
[4] | Biecek P, Burzykowski T (2021b) Local interpretable model-agnostic explanations (lime). Explanatory Model Analysis Explore, Explain and Examine Predictive Models, 1: 107–124. |
[5] | Breiman L (2001) Random forests. Mach learn 45: 5–32. https://doi.org/10.1023/A:1010933404324 doi: 10.1023/A:1010933404324 |
[6] | Chen Y, Calabrese R, Martin-Barragan B (2024) Interpretable machine learning for imbalanced credit scoring datasets. Eur J Oper Res 312: 357–372. https://doi.org/10.1016/j.ejor.2023.06.036 doi: 10.1016/j.ejor.2023.06.036 |
[7] | Davis R, Lo AW, Mishra S, et al. (2022) Explainable machine learning models of consumer credit risk. J Financ Data Sci 5. |
[8] | Du Toit H, Schutte WD, Raubenheimer H (2023) Shapley values as an interpretability technique in credit scoring. J Risk Model Validat 17. |
[9] | Dube L, Verster T (2023) Enhancing classification performance in imbalanced datasets: A comparative analysis of machine learning models. Data Sci Financ Econ 3: 354–379. https://doi.org/10.3934/DSFE.2023021 doi: 10.3934/DSFE.2023021 |
[10] | Dube L, Verster T (2024) Assessing the performance of machine learning models for default prediction under missing data and class imbalance: A simulation study. ORiON 40: 1–24. |
[11] | Dumitrache A, Nastu AA, Stancu S (2020) Churn prediction in telecommunication industry: Model interpretability. J Eastern Eur Res Bus Econ 2020. https://doi.org/10.5171/2020.241442 doi: 10.5171/2020.241442 |
[12] | Fisher A, Rudin C, Dominici F (2019) All models are wrong, but many are useful: Learning a variable's importance by studying an entire class of prediction models simultaneously. J Mach Learn Res 20: 1–81. |
[13] | Gislason PO, Benediktsson JA, Sveinsson JR (2006) Random forests for land cover classification. Pattern Recogn Lett 27: 294–300. |
[14] | Goutte C, Gaussier E (2005) A probabilistic interpretation of precision, recall and f-score, with implication for evaluation. In European conference on information retrieval, 345–359, Springer. |
[15] | Greenwell BM (2017) pdp: An r package for constructing partial dependence plots. R J 9: 421. |
[16] | Gregorutti B, Michel B, Saint-Pierre P (2017) Correlation and variable importance in random forests. Stat Comput 27: 659–678. https://doi.org/10.1007/s11222-016-9646-1 doi: 10.1007/s11222-016-9646-1 |
[17] | Guliyev H, Tatoğlu FY (2021) Customer churn analysis in banking sector: Evidence from explainable machine learning models. J Appl Microeconometrics 1: 85–99. |
[18] | Hastie T, Tibshirani R, Friedman J, et al. (2009) Random forests. The elements of statistical learning: Data mining, inference, and prediction, 587–604. |
[19] | Jafari MJ, Tarokh MJ, Soleimani P (2023) An interpretable machine learning framework for customer churn prediction: A case study in the telecommunications industry. J Ind Eng Manage Stud 10: 141–157. https://doi.org/10.22116/jiems.2023.365114.1504 doi: 10.22116/jiems.2023.365114.1504 |
[20] | Jiao Y, Du P (2016) Performance measures in evaluating machine learning based bioinformatics predictors for classifications. Quant Biol 4: 320–330. https://doi.org/10.1007/s40484-016-0081-2 doi: 10.1007/s40484-016-0081-2 |
[21] | Liaw A, Wiener M, et al. (2002) Classification and regression by randomforest. R News 2: 18–22. |
[22] | Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. Adv Neur Inf Process Syst 30. |
[23] | Moraffah R, Karami M, Guo R, et al. (2020) Causal interpretability for machine learning-problems, methods and evaluation. ACM SIGKDD Explor Newsl 22: 18–33. https://doi.org/10.1145/3400051.3400058 doi: 10.1145/3400051.3400058 |
[24] | Nationalbank Oesterreichische (2004). Guidelines on credit risk management: Rating models and validation. Oesterreichische Nationalbank. |
[25] | Nohara Y, Matsumoto K, Soejima H, et al. (2022) Explanation of machine learning models using Shapley additive explanation and application for real data in hospital. Comput Meth Prog Bio 214: 106584. |
[26] | Peng K, Peng Y, Li W (2023) Research on customer churn prediction and model interpretability analysis. Plos one 18: e0289724. |
[27] | Ribeiro MT, Singh S, Guestrin C (2016) Model-agnostic interpretability of machine learning. arXiv preprint. https://doi.org/10.48550/arXiv.1606.05386 |
[28] | Rodríguez-Pérez R, Bajorath J (2019) Interpretation of compound activity predictions from complex machine learning models using local approximations and Shapley values. J Med Chem 63: 8761–8777. |
[29] | Shahhosseini M, Hu G (2021) Improved weighted random forest for classification problems. In Progress in Intelligent Decision Science: Proceeding of IDS 2020, 42–56, Springer. |
[30] | Shapley L (2020) A value for n-person games. Class Game Theory 69–79. |
[31] | Staniak M, Biecek P (2018) Explanations of model predictions with live and breakdown packages. arXiv preprint. |
[32] | Tekouabou SC, Gherghina SC, Toulni H, et al. (2022) Towards explainable machine learning for bank churn prediction using data balancing and ensemble-based methods. Mathematics 10: 2379. https://doi.org/10.3390/math10142379 doi: 10.3390/math10142379 |
[33] | Tran KL, Le HA, Nguyen TH, et al. (2022) Explainable machine learning for financial distress prediction: evidence from Vietnam. Data 7: 160. https://doi.org/10.3390/data7110160 doi: 10.3390/data7110160 |
[34] | Uddin MS, Chi G, Al Janabi MA, et al. (2022) Leveraging random forest in micro-enterprises credit risk modelling for accuracy and interpretability. Int J Financ Econ 27: 3713–3729. https://doi.org/10.1002/ijfe.2346 doi: 10.1002/ijfe.2346 |
[35] | Verster T, Fourie E (2023) The changing landscape of financial credit risk models. Int J Financ Stud 11: 98. https://doi.org/10.3390/ijfs11030098 doi: 10.3390/ijfs11030098 |
[36] | Winham SJ, Freimuth RR, Biernacka JM (2013) A weighted random forests approach to improve predictive performance. Stat Anal Data Min ASA Data Sci J 6: 496–505. https://doi.org/10.1002/sam.11196 doi: 10.1002/sam.11196 |
[37] | Yu F, Wei C, Deng P, et al. (2021) Deep exploration of random forest model boosts the interpretability of machine learning studies of complicated immune responses and lung burden of nanoparticles. Sci Adv 7: eabf4130. https://doi.org/10.1126/sciadv.abf413 doi: 10.1126/sciadv.abf413 |
[38] | Zhu X, Chu Q, Song X, et al. (2023) Explainable prediction of loan default based on machine learning models. Data Sci Manag 6: 123–133. https://doi.org/10.1016/j.dsm.2023.04.003 doi: 10.1016/j.dsm.2023.04.003 |