The emergence of massive data has driven recent interest in developing statistical learning and large-scale algorithms for analysis on distributed platforms. One of the widely used statistical approaches is split-and-conquer (SaC), which was originally performed by aggregating all local solutions through a simple average to reduce the computational burden caused by communication costs. Aiming at lower computation cost and satisfactorily acceptable accuracy, this paper extends SaC to Bayesian variable selection for ultra-high dimensional linear regression and builds BVSaC for aggregation. Suppose ultrahigh-dimensional data are stored in a distributed manner across multiple computing nodes, with each computing resource containing a disjoint subset of data. On each node machine, we perform variable selection and coefficient estimation through a hierarchical Bayes formulation. Then, a weighted majority voting method BVSaC is used to combine the local results to retain good performance. The proposed approach only requires a small portion of computation cost on each local dataset and therefore eases the computational burden, especially in Bayesian computation, meanwhile, pays a little cost to receive accuracy, which in turn increases the feasibility of analyzing extraordinarily large datasets. Simulations and a real-world example show that the proposed approach performed as well as the whole sample hierarchical Bayes method in terms of the accuracy of variable selection and estimation.
Citation: Xuerui Li, Lican Kang, Yanyan Liu, Yuanshan Wu. Distributed Bayesian posterior voting strategy for massive data[J]. Electronic Research Archive, 2022, 30(5): 1936-1953. doi: 10.3934/era.2022098
The emergence of massive data has driven recent interest in developing statistical learning and large-scale algorithms for analysis on distributed platforms. One of the widely used statistical approaches is split-and-conquer (SaC), which was originally performed by aggregating all local solutions through a simple average to reduce the computational burden caused by communication costs. Aiming at lower computation cost and satisfactorily acceptable accuracy, this paper extends SaC to Bayesian variable selection for ultra-high dimensional linear regression and builds BVSaC for aggregation. Suppose ultrahigh-dimensional data are stored in a distributed manner across multiple computing nodes, with each computing resource containing a disjoint subset of data. On each node machine, we perform variable selection and coefficient estimation through a hierarchical Bayes formulation. Then, a weighted majority voting method BVSaC is used to combine the local results to retain good performance. The proposed approach only requires a small portion of computation cost on each local dataset and therefore eases the computational burden, especially in Bayesian computation, meanwhile, pays a little cost to receive accuracy, which in turn increases the feasibility of analyzing extraordinarily large datasets. Simulations and a real-world example show that the proposed approach performed as well as the whole sample hierarchical Bayes method in terms of the accuracy of variable selection and estimation.
[1] | Y. Zhang, M. J. Wainwright, J. C. Duchi, Communication-efficient algorithms for statistical optimization, Adv. Neural Inf. Process. Syst., 25 (2012). https://doi.org/10.1109/CDC.2012.6426691 doi: 10.1109/CDC.2012.6426691 |
[2] | A. Kleiner, A. Talwalkar, P. Sarkar, M. Jordan, The big data bootstrap, arXiv preprint, (2012), arXiv: 1206.6415. |
[3] | T. Zhao, G. Cheng, H. Liu, A partially linear framework for massive heterogeneous data, Ann. Stat., 44 (2016), 1400–1437. https://doi.org/10.1214/15-AOS1410 doi: 10.1214/15-AOS1410 |
[4] | Q. Xu, C. Cai, C. Jiang, F. Sun, X. Huang, Block average quantile regression for massive dataset, Stat. Pap. (Berl), 61 (2020), 141–165. https://doi.org/10.1007/s00362-017-0932-6 doi: 10.1007/s00362-017-0932-6 |
[5] | H. Battey, J. Fan, H. Liu, J. Lu, Z. Zhu, Distributed testing and estimation under sparse high dimensional models, Ann. Stat., 46 (2018), 1352. https://doi.org/10.1214/17-AOS1587 doi: 10.1214/17-AOS1587 |
[6] | J. Fan, D. Wang, K. Wang, Z. Zhu, Distributed estimation of principal eigenspaces, Ann. Stat., 47 (2019), 3009–3031. https://doi.org/10.1214/18-AOS1713 doi: 10.1214/18-AOS1713 |
[7] | J. D. Lee, Q. Liu, Y. Sun, J. E. Taylor, Communication-efficient sparse regression, J. Mach. Learn. Res., 18 (2017), 115–144. |
[8] | A. Javanmard, A. Montanari, Confidence intervals and hypothesis testing for high-dimensional regression, J. Mach. Learn. Res., 15 (2014), 2869–2909. |
[9] | X. Chen, M.-g. Xie, A split-and-conquer approach for analysis of extraordinarily large data, Stat. Sin., (2014), 1655–1684. |
[10] | Y. Zhang, J. Duchi, M. Wainwright, Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates, J. Mach. Learn. Res., 16 (2015), 3299–3340. |
[11] | F. Liang, Q. Song, K. Yu, Bayesian subset modeling for high-dimensional generalized linear models, J. Am. Stat. Assoc., 108 (2013), 589–606. https://doi.org/10.1080/01621459.2012.761942 doi: 10.1080/01621459.2012.761942 |
[12] | Q. Song, F. Liang, A split-and-merge bayesian variable selection approach for ultrahigh dimensional regression, J. R. Stat. Soc. Series B Stat. Methodol., 77 (2015), 947–972. https://doi.org/10.1111/rssb.12095 doi: 10.1111/rssb.12095 |
[13] | T. Park, G. Casella, The bayesian lasso, J. Am. Stat. Assoc., 103 (2008), 681–686. https://doi.org/10.1198/016214508000000337 doi: 10.1198/016214508000000337 |
[14] | R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Series B Stat. Methodol., 58 (1996), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x doi: 10.1111/j.2517-6161.1996.tb02080.x |
[15] | M. Yuan, Y. Lin, Efficient empirical bayes variable selection and estimation in linear models, J. Am. Stat. Assoc., 100 (2005), 1215–1225. https://doi.org/10.1198/016214505000000367 doi: 10.1198/016214505000000367 |
[16] | C. Hans, Bayesian lasso regression, Biometrika, 96 (2009), 835–845. https://doi.org/10.1093/biomet/asp047 doi: 10.1093/biomet/asp047 |
[17] | H. Mallick, N. Yi, A new bayesian lasso, Stat. Interface, 7 (2014), 571–582. https://doi.org/10.4310/SII.2014.v7.n4.a12 doi: 10.4310/SII.2014.v7.n4.a12 |
[18] | F. Liang, Y. K. Truong, W. H. Wong, Automatic bayesian model averaging for linear regression and applications in bayesian curve fitting, Sta. Sin., 1005–1029. http://www.jstor.org/stable/24306895 |
[19] | G. Casella, M. Ghosh, J. Gill, M. Kyung, Penalized regression, standard errors, and bayesian lassos, Bayesian Anal., 5 (2010), 369–411. https://doi.org/10.1214/10-BA607 doi: 10.1214/10-BA607 |
[20] | M. Yuan, Y. Lin, Model selection and estimation in regression with grouped variables, J. R. Stat. Soc. Series B Stat. Methodol., 68 (2006), 49–67. https://doi.org/10.1111/j.1467-9868.2005.00532.x doi: 10.1111/j.1467-9868.2005.00532.x |
[21] | H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. R. Stat. Soc. Series B Stat. Methodol., 67 (2005), 301–320. https://doi.org/10.1080/01621459.2014.881153 doi: 10.1080/01621459.2014.881153 |
[22] | S. Kundu, D. B. Dunson, Bayes variable selection in semiparametric linear models, J. Am. Stat. Assoc., 109 (2014), 437–447. https://doi.org/10.1080/01621459.2014.881153 doi: 10.1080/01621459.2014.881153 |
[23] | N. Meinshausen, P. Bühlmann, Stability selection, J. R. Stat. Soc. Series B Stat. Methodol., 72 (2010), 417–473. https://doi.org/10.1111/j.1467-9868.2010.00740.x doi: 10.1111/j.1467-9868.2010.00740.x |
[24] | R. D. Shah, R. J. Samworth, Variable selection with error control: another look at stability selection, J. R. Stat. Soc. Series B Stat. Methodol., 75 (2013), 55–80. https://doi.org/10.1111/j.1467-9868.2011.01034.x doi: 10.1111/j.1467-9868.2011.01034.x |
[25] | G. Casella, Empirical bayes gibbs sampling, Biostatistics, 2 (2001), 485–500. https://doi.org/10.1093/biostatistics/2.4.485 doi: 10.1093/biostatistics/2.4.485 |
[26] | A. Bhattacharya, D. Pati, N. S. Pillai, D. B. Dunson, Dirichlet-laplace priors for optimal shrinkage, J. Am. Stat. Assoc., 110 (2015), 1479–1490. https://doi.org/10.1080/01621459.2014.960967 doi: 10.1080/01621459.2014.960967 |
[27] | C. Leng, M.-N. Tran, D. Nott, Bayesian adaptive lasso, Ann. Inst. Stat. Math., 66 (2014), 221–244. https://doi.org/10.1007/s10463-013-0429-6 doi: 10.1007/s10463-013-0429-6 |
[28] | H. Mallick, N. Yi, Bayesian methods for high dimensional linear models, J. Biometrics Biostatistics, 1 (2013), 005. https://doi.org/10.4172/2155-6180.S1-005 doi: 10.4172/2155-6180.S1-005 |