Research article

Distributed Bayesian posterior voting strategy for massive data

  • Received: 21 March 2022 Revised: 01 April 2022 Accepted: 06 April 2022 Published: 11 April 2022
  • The emergence of massive data has driven recent interest in developing statistical learning and large-scale algorithms for analysis on distributed platforms. One of the widely used statistical approaches is split-and-conquer (SaC), which was originally performed by aggregating all local solutions through a simple average to reduce the computational burden caused by communication costs. Aiming at lower computation cost and satisfactorily acceptable accuracy, this paper extends SaC to Bayesian variable selection for ultra-high dimensional linear regression and builds BVSaC for aggregation. Suppose ultrahigh-dimensional data are stored in a distributed manner across multiple computing nodes, with each computing resource containing a disjoint subset of data. On each node machine, we perform variable selection and coefficient estimation through a hierarchical Bayes formulation. Then, a weighted majority voting method BVSaC is used to combine the local results to retain good performance. The proposed approach only requires a small portion of computation cost on each local dataset and therefore eases the computational burden, especially in Bayesian computation, meanwhile, pays a little cost to receive accuracy, which in turn increases the feasibility of analyzing extraordinarily large datasets. Simulations and a real-world example show that the proposed approach performed as well as the whole sample hierarchical Bayes method in terms of the accuracy of variable selection and estimation.

    Citation: Xuerui Li, Lican Kang, Yanyan Liu, Yuanshan Wu. Distributed Bayesian posterior voting strategy for massive data[J]. Electronic Research Archive, 2022, 30(5): 1936-1953. doi: 10.3934/era.2022098

    Related Papers:

  • The emergence of massive data has driven recent interest in developing statistical learning and large-scale algorithms for analysis on distributed platforms. One of the widely used statistical approaches is split-and-conquer (SaC), which was originally performed by aggregating all local solutions through a simple average to reduce the computational burden caused by communication costs. Aiming at lower computation cost and satisfactorily acceptable accuracy, this paper extends SaC to Bayesian variable selection for ultra-high dimensional linear regression and builds BVSaC for aggregation. Suppose ultrahigh-dimensional data are stored in a distributed manner across multiple computing nodes, with each computing resource containing a disjoint subset of data. On each node machine, we perform variable selection and coefficient estimation through a hierarchical Bayes formulation. Then, a weighted majority voting method BVSaC is used to combine the local results to retain good performance. The proposed approach only requires a small portion of computation cost on each local dataset and therefore eases the computational burden, especially in Bayesian computation, meanwhile, pays a little cost to receive accuracy, which in turn increases the feasibility of analyzing extraordinarily large datasets. Simulations and a real-world example show that the proposed approach performed as well as the whole sample hierarchical Bayes method in terms of the accuracy of variable selection and estimation.



    加载中


    [1] Y. Zhang, M. J. Wainwright, J. C. Duchi, Communication-efficient algorithms for statistical optimization, Adv. Neural Inf. Process. Syst., 25 (2012). https://doi.org/10.1109/CDC.2012.6426691 doi: 10.1109/CDC.2012.6426691
    [2] A. Kleiner, A. Talwalkar, P. Sarkar, M. Jordan, The big data bootstrap, arXiv preprint, (2012), arXiv: 1206.6415.
    [3] T. Zhao, G. Cheng, H. Liu, A partially linear framework for massive heterogeneous data, Ann. Stat., 44 (2016), 1400–1437. https://doi.org/10.1214/15-AOS1410 doi: 10.1214/15-AOS1410
    [4] Q. Xu, C. Cai, C. Jiang, F. Sun, X. Huang, Block average quantile regression for massive dataset, Stat. Pap. (Berl), 61 (2020), 141–165. https://doi.org/10.1007/s00362-017-0932-6 doi: 10.1007/s00362-017-0932-6
    [5] H. Battey, J. Fan, H. Liu, J. Lu, Z. Zhu, Distributed testing and estimation under sparse high dimensional models, Ann. Stat., 46 (2018), 1352. https://doi.org/10.1214/17-AOS1587 doi: 10.1214/17-AOS1587
    [6] J. Fan, D. Wang, K. Wang, Z. Zhu, Distributed estimation of principal eigenspaces, Ann. Stat., 47 (2019), 3009–3031. https://doi.org/10.1214/18-AOS1713 doi: 10.1214/18-AOS1713
    [7] J. D. Lee, Q. Liu, Y. Sun, J. E. Taylor, Communication-efficient sparse regression, J. Mach. Learn. Res., 18 (2017), 115–144.
    [8] A. Javanmard, A. Montanari, Confidence intervals and hypothesis testing for high-dimensional regression, J. Mach. Learn. Res., 15 (2014), 2869–2909.
    [9] X. Chen, M.-g. Xie, A split-and-conquer approach for analysis of extraordinarily large data, Stat. Sin., (2014), 1655–1684.
    [10] Y. Zhang, J. Duchi, M. Wainwright, Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates, J. Mach. Learn. Res., 16 (2015), 3299–3340.
    [11] F. Liang, Q. Song, K. Yu, Bayesian subset modeling for high-dimensional generalized linear models, J. Am. Stat. Assoc., 108 (2013), 589–606. https://doi.org/10.1080/01621459.2012.761942 doi: 10.1080/01621459.2012.761942
    [12] Q. Song, F. Liang, A split-and-merge bayesian variable selection approach for ultrahigh dimensional regression, J. R. Stat. Soc. Series B Stat. Methodol., 77 (2015), 947–972. https://doi.org/10.1111/rssb.12095 doi: 10.1111/rssb.12095
    [13] T. Park, G. Casella, The bayesian lasso, J. Am. Stat. Assoc., 103 (2008), 681–686. https://doi.org/10.1198/016214508000000337 doi: 10.1198/016214508000000337
    [14] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Series B Stat. Methodol., 58 (1996), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x doi: 10.1111/j.2517-6161.1996.tb02080.x
    [15] M. Yuan, Y. Lin, Efficient empirical bayes variable selection and estimation in linear models, J. Am. Stat. Assoc., 100 (2005), 1215–1225. https://doi.org/10.1198/016214505000000367 doi: 10.1198/016214505000000367
    [16] C. Hans, Bayesian lasso regression, Biometrika, 96 (2009), 835–845. https://doi.org/10.1093/biomet/asp047 doi: 10.1093/biomet/asp047
    [17] H. Mallick, N. Yi, A new bayesian lasso, Stat. Interface, 7 (2014), 571–582. https://doi.org/10.4310/SII.2014.v7.n4.a12 doi: 10.4310/SII.2014.v7.n4.a12
    [18] F. Liang, Y. K. Truong, W. H. Wong, Automatic bayesian model averaging for linear regression and applications in bayesian curve fitting, Sta. Sin., 1005–1029. http://www.jstor.org/stable/24306895
    [19] G. Casella, M. Ghosh, J. Gill, M. Kyung, Penalized regression, standard errors, and bayesian lassos, Bayesian Anal., 5 (2010), 369–411. https://doi.org/10.1214/10-BA607 doi: 10.1214/10-BA607
    [20] M. Yuan, Y. Lin, Model selection and estimation in regression with grouped variables, J. R. Stat. Soc. Series B Stat. Methodol., 68 (2006), 49–67. https://doi.org/10.1111/j.1467-9868.2005.00532.x doi: 10.1111/j.1467-9868.2005.00532.x
    [21] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. R. Stat. Soc. Series B Stat. Methodol., 67 (2005), 301–320. https://doi.org/10.1080/01621459.2014.881153 doi: 10.1080/01621459.2014.881153
    [22] S. Kundu, D. B. Dunson, Bayes variable selection in semiparametric linear models, J. Am. Stat. Assoc., 109 (2014), 437–447. https://doi.org/10.1080/01621459.2014.881153 doi: 10.1080/01621459.2014.881153
    [23] N. Meinshausen, P. Bühlmann, Stability selection, J. R. Stat. Soc. Series B Stat. Methodol., 72 (2010), 417–473. https://doi.org/10.1111/j.1467-9868.2010.00740.x doi: 10.1111/j.1467-9868.2010.00740.x
    [24] R. D. Shah, R. J. Samworth, Variable selection with error control: another look at stability selection, J. R. Stat. Soc. Series B Stat. Methodol., 75 (2013), 55–80. https://doi.org/10.1111/j.1467-9868.2011.01034.x doi: 10.1111/j.1467-9868.2011.01034.x
    [25] G. Casella, Empirical bayes gibbs sampling, Biostatistics, 2 (2001), 485–500. https://doi.org/10.1093/biostatistics/2.4.485 doi: 10.1093/biostatistics/2.4.485
    [26] A. Bhattacharya, D. Pati, N. S. Pillai, D. B. Dunson, Dirichlet-laplace priors for optimal shrinkage, J. Am. Stat. Assoc., 110 (2015), 1479–1490. https://doi.org/10.1080/01621459.2014.960967 doi: 10.1080/01621459.2014.960967
    [27] C. Leng, M.-N. Tran, D. Nott, Bayesian adaptive lasso, Ann. Inst. Stat. Math., 66 (2014), 221–244. https://doi.org/10.1007/s10463-013-0429-6 doi: 10.1007/s10463-013-0429-6
    [28] H. Mallick, N. Yi, Bayesian methods for high dimensional linear models, J. Biometrics Biostatistics, 1 (2013), 005. https://doi.org/10.4172/2155-6180.S1-005 doi: 10.4172/2155-6180.S1-005
  • Reader Comments
  • © 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(1356) PDF downloads(60) Cited by(1)

Article outline

Figures and Tables

Figures(3)  /  Tables(4)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog