Quantile regression has been widely used in many fields because of its robustness and comprehensiveness. However, it remains challenging to perform the quantile regression (QR) of streaming data by a conventional methods, as they are all based on the assumption that the memory can fit all the data. To address this issue, this paper proposes a Bayesian QR approach for streaming data, in which the posterior distribution was updated by utilizing the aggregated statistics of current and historical data. In addition, theoretical results are presented to confirm that the streaming posterior distribution is theoretically equivalent to the orcale posterior distribution calculated using the entire dataset together. Moreover, we provide an algorithmic procedure for the proposed method. The algorithm shows that our proposed method only needs to store the parameters of historical posterior distribution of streaming data. Thus, it is computationally simple and not storage-intensive. Both simulations and real data analysis are conducted to illustrate the good performance of the proposed method.
Citation: Zixuan Tian, Xiaoyue Xie, Jian Shi. Bayesian quantile regression for streaming data[J]. AIMS Mathematics, 2024, 9(9): 26114-26138. doi: 10.3934/math.20241276
Quantile regression has been widely used in many fields because of its robustness and comprehensiveness. However, it remains challenging to perform the quantile regression (QR) of streaming data by a conventional methods, as they are all based on the assumption that the memory can fit all the data. To address this issue, this paper proposes a Bayesian QR approach for streaming data, in which the posterior distribution was updated by utilizing the aggregated statistics of current and historical data. In addition, theoretical results are presented to confirm that the streaming posterior distribution is theoretically equivalent to the orcale posterior distribution calculated using the entire dataset together. Moreover, we provide an algorithmic procedure for the proposed method. The algorithm shows that our proposed method only needs to store the parameters of historical posterior distribution of streaming data. Thus, it is computationally simple and not storage-intensive. Both simulations and real data analysis are conducted to illustrate the good performance of the proposed method.
[1] | M. Hilbert, Big data for development: A review of promises and challenges, Dev. Policy. Rev., 34 (2016), 135–174. http://doi.org/10.1111/dpr.12142 doi: 10.1111/dpr.12142 |
[2] | C. Wang, J. Wu, J. Yan, Statistical methods and computing for big data, Stat. Interface, 9 (2016), 399. https://dx.doi.org/10.4310/SII.2016.v9.n4.a1 doi: 10.4310/SII.2016.v9.n4.a1 |
[3] | H. Wang, Y. Ma, Optimal subsampling for quantile regression in big data, Biometrika, 108 (2021), 99–112. https://doi.org/10.1093/biomet/asaa043 doi: 10.1093/biomet/asaa043 |
[4] | H. Wang, R. Zhu, P. Ma, Optimal subsampling for large sample logistic regression, J. Am. Stat. Assoc., 117 (2022), 265–276. https://doi.org/10.1080/01621459.2020.1773832 doi: 10.1080/01621459.2020.1773832 |
[5] | X. Chen, W. Liu, X. Mao, Z. Yang, Distributed high-dimensional regression under a quantile loss function, J. Mach. Learn. Res., 21 (2020), 7432–7474. https://doi.org/10.1214/18-AOS1777 doi: 10.1214/18-AOS1777 |
[6] | A. Hu, Y. Jiao, Y. Liu, Y. Shi, Y. Wu, Distributed quantile regression for massive heterogeneous data, Neurocomputing, 448 (2021), 249–262. https://doi.org/10.1016/j.neucom.2021.03.041 doi: 10.1016/j.neucom.2021.03.041 |
[7] | R. Jiang, K. Yu, Smoothing quantile regression for a distributed system, Neurocomputing, 466 (2021), 311–326. https://doi.org/10.1016/j.neucom.2021.08.101 doi: 10.1016/j.neucom.2021.08.101 |
[8] | M. I. Jordan, J. D. Lee, Y. Yang, Communication-efficient distributed statistical inference, J. Am. Stat. Assoc., 526 (2018), 668–681. https://doi.org/10.1080/01621459.2018.1429274 doi: 10.1080/01621459.2018.1429274 |
[9] | N. Lin, R. Xi, Aggregated estimating equation estimation, Stat. Interface, 4 (2011), 73–83. https://dx.doi.org/10.4310/SII.2011.v4.n1.a8 doi: 10.4310/SII.2011.v4.n1.a8 |
[10] | L. Luo, P. Song, Renewable estimation and incremental inference in generalized linear models with streaming data sets, J. R. Stat. Soc. B, 82 (2020), 69–97. https://doi.org/10.1111/rssb.12352 doi: 10.1111/rssb.12352 |
[11] | C. Shi, R. Song, W. Lu, R. Li, Statistical inference for high-dimensional models via recursive online-score estimation, J. Am. Stat. Assoc., 116 (2021), 1307–1318. https://doi.org/10.1080/01621459.2019.1710154 doi: 10.1080/01621459.2019.1710154 |
[12] | E. D. Schifano, J. Wu, C. Wang, J. Yan, M. Chen, Online updating of statistical inference in the big data setting, Technometrics, 58 (2016), 393–403. https://doi.org/10.1080/00401706.2016.1142900 doi: 10.1080/00401706.2016.1142900 |
[13] | S. Mohamad, A. Bouchachia, Deep online hierarchical dynamic unsupervised learning for pattern mining from utility usage data, Neurocomputing, 390 (2020), 359–373. https://doi.org/10.1016/j.neucom.2019.08.093 doi: 10.1016/j.neucom.2019.08.093 |
[14] | H. M. Gomes, J. Read, A. Bifet, J. Paul, J. Gama, Machine learning for streaming data: State of the art, challenges, and opportunities, ACM Sigkdd Explor. Newslett., 21 (2019), 6–22. https://doi.org/10.1145/3373464.3373470 doi: 10.1145/3373464.3373470 |
[15] | L. Lin, W. Li, J. Lu, Unified rules of renewable weighted sums for various online updating estimations, arXiv Preprint, 2020. https://doi.org/10.48550/arXiv.2008.08824 |
[16] | C. Wang, M. Chen, J. Wu, J. Yan, Y. Zhang, E. Schifano, Online updating method with new variables for big data streams, Can. J. Stat., 46 (2018), 123–146. https://doi.org/10.1002/cjs.11330 doi: 10.1002/cjs.11330 |
[17] | J. Wu, M. Chen, Online updating of survival analysis, J. Comput. Graph. Stat., 30 (2021), 1209–1223. https://doi.org/10.1080/10618600.2020.1870481 doi: 10.1080/10618600.2020.1870481 |
[18] | Y. Xue, H. Wang, J. Yan, E. D. Schifano, An online updating approach for testing the proportional hazards assumption with streams of survival data, Biometrics, 76 (2020), 171–182. https://doi.org/10.1111/biom.13137 doi: 10.1111/biom.13137 |
[19] | S. Balakrishnan, D. Madigan, A one-pass sequential Monte Carlo method for Bayesian analysis of massive datasets, Bayesian Anal., 1 (2006), 345–361. https://doi.org/10.1214/06-BA112 doi: 10.1214/06-BA112 |
[20] | L. N. Geppert, K. Ickstadt, A. Munteanu, J. Quedenfeld, C. Sohler, Random projections for Bayesian regression, Biometrics, 27 (2017), 79–101. https://doi.org/10.1007/s11222-015-9608-z doi: 10.1007/s11222-015-9608-z |
[21] | R. Koenker, G. Bassett, Regression quantiles, Econometrica, 1978, 33–50. https://doi.org/10.2307/1913643 doi: 10.2307/1913643 |
[22] | Y. Wei, A. Pere, R. Koenker, X. He, Quantile regression methods for reference growth charts, Stat. Med., 25 (2006), 1369–1382. https://doi.org/10.1002/sim.2271 doi: 10.1002/sim.2271 |
[23] | H. Wang, Z. Zhu, J. Zhou, Quantile regression in partially linear varying coefficient models, Ann. Stat., 2009, 3841–3866. https://doi.org/10.1214/09-AOS695 doi: 10.1214/09-AOS695 |
[24] | X. He, B. Fu, W. K. Fung, Median regression for longitudinal data, Stat. Med., 22 (2003), 3655–3669. https://doi.org/10.1002/sim.1581 doi: 10.1002/sim.1581 |
[25] | M. Buchinsky, Changes in the US wage structure 1963–1987: Application of quantile regression, Econometrica, 1994,405–458. https://doi.org/10.2307/2951618 doi: 10.2307/2951618 |
[26] | A. J. Cannon, Quantile regression neural networks: Implementation in R and application to precipitation downscaling, Comput. Geosci., 37 (2011), 1277–1284. https://doi.org/10.1002/sim.1581 doi: 10.1002/sim.1581 |
[27] | Q. Xu, K. Deng, C. Jiang, F. Sun, X. Huang, Composite quantile regression neural network with applications, Expert Syst. Appl., 76 (2017), 129–139. https://doi.org/10.1016/j.eswa.2017.01.054 doi: 10.1016/j.eswa.2017.01.054 |
[28] | X. Chen, W. Liu, Y. Zhang, Quantile regression under memory constraint, Ann. Stat., 47 (2019), 3244–3273. https://doi.org/10.1214/18-AOS1777 doi: 10.1214/18-AOS1777 |
[29] | L. Chen, Y. Zhou, Quantile regression in big data: A divide and conquer based strategy, Comput. Stat. Data. An., 144 (2020), 106892. https://doi.org/10.1016/j.csda.2019.106892 doi: 10.1016/j.csda.2019.106892 |
[30] | K. Wang, H. Wang, S. Li, Renewable quantile regression for streaming datasets, Knowl.-Based Syst., 235 (2022), 107675. https://doi.org/10.1016/j.knosys.2021.107675 doi: 10.1016/j.knosys.2021.107675 |
[31] | Y. Chu, Z. Yin, K. Yu, Bayesian scale mixtures of normals linear regression and Bayesian quantile regression with big data and variable selection, J. Comput. Appl. Math., 428 (2023), 115192. https://doi.org/10.1016/j.cam.2023.115192 doi: 10.1016/j.cam.2023.115192 |
[32] | K. Lum, A. E. Gelfand, Spatial quantile multiple regression using the asymmetric Laplace process, Bayesian Anal., 7 (2012), 235–258. https://doi.org/10.1214/12-BA708 doi: 10.1214/12-BA708 |
[33] | M. Smith, R. Kohn, Nonparametric regression using Bayesian variable, J. Econometrics, 75 (1996), 317–343. https://doi.org/10.1016/0304-4076(95)01763-1 doi: 10.1016/0304-4076(95)01763-1 |
[34] | M. Dao, M. Wang, S. Ghosh, K. Ye, Bayesian variable selection and estimation in quantile regression using a quantile-specific prior, Computation. Stat., 37 (2022), 1339–1368. https://doi.org/10.1007/s00180-021-01181-5 doi: 10.1007/s00180-021-01181-5 |
[35] | K. E. Lee, N. Sha, E. R. Dougherty, M. Vannucci, B. K. Mallick, Gene selection: A Bayesian variable selection approach, Bioinformatics, 19 (2003), 90–97. https://doi.org/10.1093/bioinformatics/19.1.90 doi: 10.1093/bioinformatics/19.1.90 |
[36] | R. Chen, C. Chu, T. Lai, Y. Wu, Stochastic matching pursuit for Bayesian variable selection, Stat. Comput., 21 (2011), 247–259. https://doi.org/10.1007/s11222-009-9165-4 doi: 10.1007/s11222-009-9165-4 |
[37] | R. Jiang, K. Yu, Renewable quantile regression for streaming data sets, Neurocomputing, 508 (2022), 208–224. https://doi.org/10.1016/j.knosys.2021.107675 doi: 10.1016/j.knosys.2021.107675 |
[38] | X. Li, The influencing factors on PM$_{2.5}$ concentration of Lanzhou based on quantile eegression, HGU. J., 41 (2018), 61–68. https://doi.org/10.13937/j.cnki.hbdzdxxb.2018.06.009 doi: 10.13937/j.cnki.hbdzdxxb.2018.06.009 |
[39] | X. Zhang, W. Zhang, Spatial and temporal variation of PM$_{2.5}$ in Beijing city after rain, Ecol. Environ. Sci., 23 (2014), 797–805. https://doi.org/10.3969/j.issn.1674-5906.2014.05.011 doi: 10.3969/j.issn.1674-5906.2014.05.011 |
[40] | R. Tibshirani, Regression shrinkage and selection via the Lasso, J. R. Stat. Soc. B, 58 (2018), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x doi: 10.1111/j.2517-6161.1996.tb02080.x |
[41] | J. Fan, R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc., 96 (2011), 1348–1360. https://doi.org/10.1198/016214501753382273 doi: 10.1198/016214501753382273 |
[42] | F. E. Streib, M. Dehmer, High-dimensional LASSO-based computational regression models: Regularization, shrinkage, and selection, Mach. Learn. Know. Extr., 1 (2019), 359–383. https://doi.org/10.3390/make1010021 doi: 10.3390/make1010021 |
[43] | X. Ma, L. Lin, Y. Gai, A general framework of online updating variable selection for generalized linear models with streaming datasets, J. Stat. Comput. Sim., 93 (2023), 325–340. https://doi.org/10.1080/00949655.2022.2107207 doi: 10.1080/00949655.2022.2107207 |
[44] | A. Liu, J. Lu, F. Liu, G. Zhang, Accumulating regional density dissimilarity for concept drift detection in data streams, Pattern Recogn., 76 (2018), 256–272. https://doi.org/10.1016/j.patcog.2017.11.009 doi: 10.1016/j.patcog.2017.11.009 |
[45] | J. Wang, J. Shen, P. Li, Provable variable selection for streaming features, International Conference On Machine Learning, 80 (2018), 5171–5179. Available from: https://proceedings.mlr.press/v80/wang18g.html. |
[46] | J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, G. Zhang, Learning under concept drift: A review, IEEE T. Knowl. Data En., 31 (2018), 2346–2363. https://doi.org/10.1109/TKDE.2018.2876857 doi: 10.1109/TKDE.2018.2876857 |
[47] | R. Elwell, R. Polikar, Incremental learning of concept drift in nonstationary environments, IEEE T. Neural Networ., 22 (2011), 1517–1531. https://doi.org/10.1109/TNN.2011.2160459 doi: 10.1109/TNN.2011.2160459 |
[48] | D. Rezende, S. Mohamed, Variational inference with normalizing flows, International Conference On Machine Learning, 22 (2015), 1530–1538. Available from: https://proceedings.mlr.press/v37/rezende15. |
[49] | P. Müller, F. A. Quintana, A. Jara, T. Hanson, Bayesian nonparametric data analysis, New York: Springer Press, 2015. https://doi.org/10.1007/978-0-387-69765-9-7 |
[50] | R. Koenker, J. A. Machado, Goodness of fit and related inference processes for quantile regression, J. Am. Stat. Assoc., 94 (1999), 1296–1310. https://doi.org/10.1109/TNN.2011.2160459 doi: 10.1109/TNN.2011.2160459 |
[51] | K. Yu, R. A. Moyeed, Bayesian quantile regression, Stat. Probab. Lett., 54 (2001), 437–447. https://doi.org/10.1016/S0167-7152(01)00124-9 doi: 10.1016/S0167-7152(01)00124-9 |
[52] | M. Geraci, Linear quantile mixed models: The lqmm package for Laplace quantile regression, J. Stat. Softw., 57 (2014), 1–29. https://doi.org/10.18637/jss.v057.i13 doi: 10.18637/jss.v057.i13 |
[53] | M. Geraci, M. Bottai, Quantile regression for longitudinal data using the asymmetric laplace distribution, Biostatistics, 8 (2007), 140–154. https://doi.org/10.1093/biostatistics/kxj039 doi: 10.1093/biostatistics/kxj039 |
[54] | D. F. Benoit, D. V. den Poel, bayesQR: A Bayesian approach to quantile regression, J. Stat. Softw., 76 (2017), 1–32. https://doi.org/10.18637/jss.v076.i07 doi: 10.18637/jss.v076.i07 |