Research article Special Issues

Assessing agreement between permutation and dropout variable importance methods for regression and random forest models

  • Received: 31 December 2023 Revised: 01 June 2024 Accepted: 08 July 2024 Published: 22 July 2024
  • Permutation techniques have been used extensively in machine learning algorithms for evaluating variable importance. In ordinary regression, however, variables are often removed to gauge their importance. In this paper, we compared the results for permuting variables to removing variables in regression to assess relations between these two methods. We compared permute-and-predict (PaP) methods with leave-one-covariate-out (LOCO) techniques. We also compared these results with conventional metrics such as regression coefficient estimates, t-statistics, and random forest out-of-bag (OOB) PaP importance. Our results indicate that permutation importance metrics are practically equivalent to those obtained from removing variables in a regression setting. We demonstrate a strong association between the PaP metrics, true coefficients, and regression-estimated coefficients. We also show a strong relation between the LOCO metrics and the regression t-statistics. Finally, we illustrate that manual PaP methods are not equivalent to the OOB PaP technique and suggest prioritizing the use of manual PaP methods on validation data.

    Citation: Kelvyn Bladen, D. Richard Cutler. Assessing agreement between permutation and dropout variable importance methods for regression and random forest models[J]. Electronic Research Archive, 2024, 32(7): 4495-4514. doi: 10.3934/era.2024203

    Related Papers:

  • Permutation techniques have been used extensively in machine learning algorithms for evaluating variable importance. In ordinary regression, however, variables are often removed to gauge their importance. In this paper, we compared the results for permuting variables to removing variables in regression to assess relations between these two methods. We compared permute-and-predict (PaP) methods with leave-one-covariate-out (LOCO) techniques. We also compared these results with conventional metrics such as regression coefficient estimates, t-statistics, and random forest out-of-bag (OOB) PaP importance. Our results indicate that permutation importance metrics are practically equivalent to those obtained from removing variables in a regression setting. We demonstrate a strong association between the PaP metrics, true coefficients, and regression-estimated coefficients. We also show a strong relation between the LOCO metrics and the regression t-statistics. Finally, we illustrate that manual PaP methods are not equivalent to the OOB PaP technique and suggest prioritizing the use of manual PaP methods on validation data.



    加载中


    [1] W. Kruskal, R. Majors, Concepts of relative importance in recent scientific literature, Am. Stat., 43 (1989), 2–6.
    [2] C. Achen, Interpreting and Using Regression, Sage, 29 (1982).
    [3] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. B, 58 (1996), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x doi: 10.1111/j.2517-6161.1996.tb02080.x
    [4] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. R. Stat. Soc. B, 67 (2005), 301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x doi: 10.1111/j.1467-9868.2005.00503.x
    [5] J. Pratt, Dividing the indivisible: using simple symmetry to partition variance explained, in Proceedings of the Second International Tampere Conference in Statistics, (1987), 245–260.
    [6] L. Breiman, Random forests, Mach. Learn., 45 (2001), 5–32. https://doi.org/10.1023/A: 1010933404324
    [7] C. Strobl, A. Boulesteix, T. Kneib, T. Augustin, A. Zeileis, Conditional variable importance for random forests, BMC Bioinf., 9 (2008), 1–11. https://doi.org/10.1186/1471-2105-9-307 doi: 10.1186/1471-2105-9-307
    [8] K. Bladen, Contributions to Random Forest Variable Importance with Applications in R, MS thesis, Utah State University, 2022.
    [9] G. Hooker, L. Mentch, S. Zhou, Unrestricted permutation forces extrapolation: variable importance requires at least one more model, or there is no free variable importance, Stat. Comput., 31 (2021), 1–16. https://doi.org/10.1007/s11222-021-10057-z doi: 10.1007/s11222-021-10057-z
    [10] J. Lei, M. G'Sell, A. Rinaldo, R. Tibshirani, L. Wasserman, Distribution-free predictive inference for regression, J. Am. Stat. Assoc., 113 (2018), 1094–1111. https://doi.org/10.1080/01621459.2017.1307116 doi: 10.1080/01621459.2017.1307116
    [11] R. Barber, E. Candès, Controlling the false discovery rate via knockoffs, Ann. Stat., 43 (2015), 2055–2085.
    [12] E. Candès, Y. Fan, L. Janson, J. Lv, Panning for gold: 'model-X' knockoffs for high dimensional controlled variable selection, J. R. Stat. Soc. B, 80 (2018), 551–577. https://doi.org/10.1111/rssb.12265 doi: 10.1111/rssb.12265
    [13] C. Ye, Y. Yang, Y. Yang, Sparsity oriented importance learning for high-dimensional linear regression, J. Am. Stat. Assoc., 113 (2018), 1797–1812. https://doi.org/10.1080/01621459.2017.1377080 doi: 10.1080/01621459.2017.1377080
    [14] D. Apley, J. Zhu, Visualizing the effects of predictor variables in black box supervised learning models, J. R. Stat. Soc. B, 82 (2020), 1059–1086. https://doi.org/10.1111/rssb.12377 doi: 10.1111/rssb.12377
    [15] A. Goldstein, A. Kapelner, J. Bleich, E. Pitkin, Peeking inside the black box: visualizing statistical learning with plots of individual conditional expectation, J. Comput. Graphical Stat., 24 (2015), 44–65. https://doi.org/10.1080/10618600.2014.907095 doi: 10.1080/10618600.2014.907095
    [16] B. Greenwell, B. Boehmke, A. McCarthy, A simple and effective model-based variable importance measure, preprint, arXiv: 1805.04755, 2018.
  • Reader Comments
  • © 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(489) PDF downloads(36) Cited by(0)

Article outline

Figures and Tables

Figures(10)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog