Research article

Corrected optimal subsampling for a class of generalized linear measurement error models

  • Received: 29 December 2024 Revised: 15 February 2025 Accepted: 24 February 2025 Published: 28 February 2025
  • MSC : 62F12, 62J12

  • Subsampling techniques have been promoted in massive data and can substantially reduce the computing time. However, existing subsampling techniques do not consider the case of dirty data, especially the inaccuracy of covariates due to measurement errors, which will lead to the inconsistent estimators of regression coefficients. Therefore, to eliminate the influence of measurement errors on parameter estimators for massive data, this paper combined the corrected score function with the subsampling technique. The consistency and asymptotic normality of the estimators in the general subsampling are also derived. In addition, optimal subsampling probabilities are obtained based on the general subsampling algorithm using the A-optimality and L-optimality criteria and the truncation method, and then an adaptive two-step algorithm is developed. The effectiveness of the proposed method is demonstrated through numerical simulations and two real data analyses.

    Citation: Ruiyuan Chang, Xiuli Wang, Mingqiu Wang. Corrected optimal subsampling for a class of generalized linear measurement error models[J]. AIMS Mathematics, 2025, 10(2): 4412-4440. doi: 10.3934/math.2025203

    Related Papers:

  • Subsampling techniques have been promoted in massive data and can substantially reduce the computing time. However, existing subsampling techniques do not consider the case of dirty data, especially the inaccuracy of covariates due to measurement errors, which will lead to the inconsistent estimators of regression coefficients. Therefore, to eliminate the influence of measurement errors on parameter estimators for massive data, this paper combined the corrected score function with the subsampling technique. The consistency and asymptotic normality of the estimators in the general subsampling are also derived. In addition, optimal subsampling probabilities are obtained based on the general subsampling algorithm using the A-optimality and L-optimality criteria and the truncation method, and then an adaptive two-step algorithm is developed. The effectiveness of the proposed method is demonstrated through numerical simulations and two real data analyses.



    加载中


    [1] J. A. Nelder, R. W. M. Wedderburn, Generalized linear models, J. R. Statist. Soc. A, 135 (1972), 370–384. https://doi.org/10.2307/2344614 doi: 10.2307/2344614
    [2] H. Imberg, M. Axelson-Fisk, J. Jonasson, Optimal subsampling designs, arXiv: 2304.03019. https://doi.org/10.48550/arXiv.2304.03019
    [3] Y. Chao, L. Huang, X. Ma, J. Sun, Optimal subsampling for modal regression in massive data, Metrika, 87 (2024), 379–409. https://doi.org/10.1007/s00184-023-00916-2 doi: 10.1007/s00184-023-00916-2
    [4] H. Wang, R. Zhu, P. Ma, Optimal subsampling for large sample logistic regression, J. Am. Stat. Assoc., 113 (2018), 829–844. https://doi.org/10.1080/01621459.2017.1292914 doi: 10.1080/01621459.2017.1292914
    [5] H. Wang, More efficient estimation for logistic regression with optimal subsamples, J. Mach. Learn. Res., 20 (2019), 1–59.
    [6] L. Zuo, H. Zhang, H. Wang, L. Sun, Optimal subsample selection for massive logistic regression with distributed data, Comput. Stat., 36 (2021), 2535–2562. https://doi.org/10.1007/s00180-021-01089-0 doi: 10.1007/s00180-021-01089-0
    [7] M. Ai, J. Yu, H. Zhang, H. Wang, Optimal subsampling algorithms for big data regression, Stat. Sinica, 31 (2021), 749–772. https://doi.org/10.5705/ss.202018.0439 doi: 10.5705/ss.202018.0439
    [8] Y. Yao, H. Wang, A review on optimal subsampling methods for massive datasets, Journal of Data Science, 19 (2021), 151–172. https://doi.org/10.6339/21-JDS999 doi: 10.6339/21-JDS999
    [9] J. Yu, Z. Ye, M. Ai, P. Ma, Optimal subsampling for data streams with measurement constrained categorical responses, J. Comput. Graph. Stat., in press. https://doi.org/10.1080/10618600.2024.2421990
    [10] J. Yu, H. Wang, M. Ai, A subsampling strategy for AIC-based model averaging with generalized linear models, Technometrics, 67 (2025), 122–132. https://doi.org/10.1080/00401706.2024.2407310 doi: 10.1080/00401706.2024.2407310
    [11] J. Yu, M. Ai, Z. Ye, A review on design inspired subsampling for big data, Stat. Papers, 65 (2024), 467–510. https://doi.org/10.1007/s00362-022-01386-w doi: 10.1007/s00362-022-01386-w
    [12] H. Liang, W. Härdle, R. J. Carroll, Estimation in a semiparametric partially linear errors-in-variables model, Ann. Statist., 27 (1999), 1519–1535. https://doi.org/10.1214/AOS/1017939140 doi: 10.1214/AOS/1017939140
    [13] G. Li, L. Xue, Empirical likelihood confidence region for the parameter in a partially linear errors-in-variables model, Commun. Stat.-Theor. M., 37 (2008), 1552–1564. https://doi.org/10.1080/03610920801893913 doi: 10.1080/03610920801893913
    [14] H. Liang, R. Li, Variable selection for partially linear models with measurement errors, J. Am. Stat. Assoc., 104 (2009), 234–248. https://doi.org/10.1198/jasa.2009.0127 doi: 10.1198/jasa.2009.0127
    [15] L. Stefanski, Unbiased estimation of a nonlinear function of a normal mean with application to measurement error models, Commun. Stat.-Theor. M., 18 (1989), 4335–4358. https://doi.org/10.1080/03610928908830159 doi: 10.1080/03610928908830159
    [16] T. Nakamura, Corrected score function for errors-in-variables models: methodology and application to generalized linear models, Biometrika, 77 (1990), 127–137. https://doi.org/10.1093/biomet/77.1.127 doi: 10.1093/biomet/77.1.127
    [17] Y. Yang, G. Li, T. Tong, Corrected empirical likelihood for a class of generalized linear measurement error models, Sci. China Math., 58 (2015), 1523–1536. https://doi.org/10.1007/s11425-015-4976-6 doi: 10.1007/s11425-015-4976-6
    [18] W. A. Fuller, Measurement error models, New York: John Wiley & Sons, 1987. https://doi.org/10.1002/9780470316665
    [19] H. Cui, F. Zou, L. Li, Feature screening and error variance estimation for ultrahigh-dimensional linear model with measurement errors, Commun. Math. Stat., 13 (2025), 139–171. https://doi.org/10.1007/s40304-022-00317-3 doi: 10.1007/s40304-022-00317-3
    [20] R. J. Carroll, D. Ruppert, L. A. Stefanski, C. M. Crainiceanu, Measurement error in nonlinear models: a modern perspective, New York: Chapman and Hall/CRC, 2006. https://doi.org/10.1201/9781420010138
    [21] G. Y. Yi, A. Delaigle, P. Gustafson, Handbook of measurement error models, New York: Chapman and Hall/CRC, 2021. https://doi.org/10.1201/9781315101279
    [22] S. Lipovetsky, Book review: handbook of measurement error models, Technometrics, 65 (2023), 302–304. https://doi.org/10.1080/00401706.2023.2201134 doi: 10.1080/00401706.2023.2201134
    [23] T. Wang, H. Zhan, Optimal subsampling for multiplicative regression with massive data, Stat. Neerl., 76 (2022), 418–449. https://doi.org/10.1111/stan.12266 doi: 10.1111/stan.12266
    [24] X. Lin, R. J. Carroll, Nonparametric function estimation for clustered data when the predictor is measured without/with error, J. Am. Stat. Assoc., 95 (2000), 520–534. https://doi.org/10.1080/01621459.2000.10474229 doi: 10.1080/01621459.2000.10474229
    [25] Y. Wei, R. J. Carroll, Quantile regression with measurement error, J. Am. Stat. Assoc., 104 (2009), 1129–1143. https://doi.org/10.1198/jasa.2009.tm08420 doi: 10.1198/jasa.2009.tm08420
    [26] P. Zhao, L. Xue, Variable selection for semiparametric varying coefficient partially linear errors-in-variables models, J. Multivariate Anal., 101 (2010), 1872–1883. https://doi.org/10.1016/j.jmva.2010.03.005 doi: 10.1016/j.jmva.2010.03.005
    [27] Y. Ma, R. Li, Variable selection in measurement error models, Bernoulli, 16 (2010), 274–300. https://doi.org/10.3150/09-bej205 doi: 10.3150/09-bej205
    [28] J. Ju, M. Wang, S. Zhao, Subsampling for big data linear models with measurement errors, arXiv: 2403.04361. https://doi.org/10.48550/arXiv.2403.04361
    [29] Y. Huang, C. Y. Wang, Consistent functional methods for logistic regression with errors in covariates, J. Am. Stat. Assoc., 96 (2001), 1469–1482. https://doi.org/10.1198/016214501753382372 doi: 10.1198/016214501753382372
    [30] S. Clémencon, P. Bertail, E. Chautru, Scaling up M-estimation via sampling designs: the Horvitz-Thompson stochastic gradient descent, Proceedings of IEEE International Conference on Big Data, 2014, 25–30. https://doi.org/10.1109/BigData.2014.7004208
    [31] L. Fahrmeir, H. Kaufmann, Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models, Ann. Statist., 13 (1985), 342–368. https://doi.org/10.1214/aos/1176346597 doi: 10.1214/aos/1176346597
    [32] A. Datta, H. Zou, CoCoLasso for high-dimensional error-in-variables regression, Ann. Statist., 45 (2017), 2400–2426. https://doi.org/10.1214/16-AOS1527 doi: 10.1214/16-AOS1527
    [33] K. Nottingham, R. Longjohn, M. Kelly, UC Irvine Machine Learning Repository, University of California, School of Information and Computer Science, 2013. Available from: http://archive.ics.uci.edu/ml.
    [34] J. Yu, H. Wang, M. Ai, H. Zhang, Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data, J. Am. Stat. Assoc., 117 (2022), 265–276. https://doi.org/10.1080/01621459.2020.1773832 doi: 10.1080/01621459.2020.1773832
  • Reader Comments
  • © 2025 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(48) PDF downloads(6) Cited by(0)

Article outline

Figures and Tables

Figures(5)  /  Tables(6)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog