Research article

Under-bagging nearest neighbors for long-tailed regression

  • Published: 25 November 2025
  • Long-tailed regression, also known as imbalanced regression, poses significant challenges in real-world prediction tasks where the target labels follow highly skewed or heavy-tailed distributions. In such scenarios, conventional regressors tend to bias toward high-density regions and perform poorly on tail samples, which often carry critical information in scientific and industrial applications. To address this issue, we propose an ensemble-based under-sampling algorithm named under-bagging nearest neighbors for long-tailed regression (UNNLR). The method employs a data-driven histogram density estimation (HDE) to adaptively estimate the label densities and determine sampling probabilities, thereby generating approximately uniform label distributions in subsampled datasets. To mitigate the information loss caused by under-sampling, a bagging mechanism is introduced, allowing multiple $ k $-nearest neighbor ($ k $-NN) regressors to be trained on bootstrap sub-samples in parallel. The theoretical analysis establishes, for the first time, minimax-optimal convergence rates for sampling-based regression under label imbalance. Experiments on both synthetic and real-world datasets demonstrate that UNNLR consistently outperforms existing sampling-based methods such as SMOGN, IRRCE, and ReBagg in terms of balanced mean squared error (BMSE), while achieving superior computational efficiency. The proposed framework bridges theoretical guarantees with scalable regression learning under long-tailed label distributions.

    Citation: Hanyuan Hang, Hongwei Wen, Zhouchen Lin. Under-bagging nearest neighbors for long-tailed regression[J]. Big Data and Information Analytics, 2025, 9: 231-264. doi: 10.3934/bdia.2025011

    Related Papers:

  • Long-tailed regression, also known as imbalanced regression, poses significant challenges in real-world prediction tasks where the target labels follow highly skewed or heavy-tailed distributions. In such scenarios, conventional regressors tend to bias toward high-density regions and perform poorly on tail samples, which often carry critical information in scientific and industrial applications. To address this issue, we propose an ensemble-based under-sampling algorithm named under-bagging nearest neighbors for long-tailed regression (UNNLR). The method employs a data-driven histogram density estimation (HDE) to adaptively estimate the label densities and determine sampling probabilities, thereby generating approximately uniform label distributions in subsampled datasets. To mitigate the information loss caused by under-sampling, a bagging mechanism is introduced, allowing multiple $ k $-nearest neighbor ($ k $-NN) regressors to be trained on bootstrap sub-samples in parallel. The theoretical analysis establishes, for the first time, minimax-optimal convergence rates for sampling-based regression under label imbalance. Experiments on both synthetic and real-world datasets demonstrate that UNNLR consistently outperforms existing sampling-based methods such as SMOGN, IRRCE, and ReBagg in terms of balanced mean squared error (BMSE), while achieving superior computational efficiency. The proposed framework bridges theoretical guarantees with scalable regression learning under long-tailed label distributions.



    加载中


    [1] Yang Y, Zha K, Chen Y, Wang H, Katabi D, (2021) Delving into deep imbalanced regression, In: International Conference on Machine Learning, 11842–11851.
    [2] Ren J, Zhang M, Yu C, Liu Z, (2022) Balanced MSE for imbalanced visual regression, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7926–7935. https://doi.org/10.1109/CVPR52688.2022.00777
    [3] Branco P, Torgo L, (2019) A study on the impact of data characteristics in imbalanced regression tasks, In: 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 193–202. https://doi.org/10.1109/DSAA.2019.00034
    [4] Steininger M, Kobs K, Davidson P, Krause A, Hotho A, (2021) Density-based weighting for imbalanced regression, Mach Learn 110: 2187–2211. https://doi.org/10.1007/s10994-021-06023-5 doi: 10.1007/s10994-021-06023-5
    [5] Ding Y, Jia M, Zhuang J, Ding P, (2022) Deep imbalanced regression using cost-sensitive learning and deep feature transfer for bearing remaining useful life estimation. Appl Soft Comput 127: 109271. https://doi.org/10.1016/j.asoc.2022.109271 doi: 10.1016/j.asoc.2022.109271
    [6] Islam A, Belhaouari SB, Rehman AU, Bensmail H, (2022) KNNOR: An oversampling technique for imbalanced datasets. Appl Soft Comput 115: 108288. https://doi.org/10.1016/j.asoc.2021.108288 doi: 10.1016/j.asoc.2021.108288
    [7] Branco P, Torgo L, Ribeiro RP, (2017) SMOGN: A pre-processing approach for imbalanced regression, In: First International Workshop on Learning with Imbalanced Domains: Theory and Applications, 36–50.
    [8] Song XY, Dao N, Branco P, (2022) DistSMOGN: Distributed SMOGN for imbalanced regression problems, In: Fourth International Workshop on Learning with Imbalanced Domains: Theory and Applications, 38–52.
    [9] Orhobor OI, Grinberg NF, Soldatova LN, King RD, (2022) Imbalanced regression using regressor-classifier ensembles. Mach Learn 112: 1365–1387. https://doi.org/10.1007/s10994-022-06199-4 doi: 10.1007/s10994-022-06199-4
    [10] Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP, (2002) SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res 16: 321–357. https://doi.org/10.1613/jair.953 doi: 10.1613/jair.953
    [11] Branco P, Torgo L, Ribeiro RP, (2019) Pre-processing approaches for imbalanced distributions in regression. Neurocomputing 343: 76–99. https://doi.org/10.1016/j.neucom.2018.11.100 doi: 10.1016/j.neucom.2018.11.100
    [12] Torgo L, Branco P, Ribeiro RP, Pfahringer B, (2015) Resampling strategies for regression. Exp Syst 32: 465–476. https://doi.org/10.1111/exsy.12081 doi: 10.1111/exsy.12081
    [13] Branco P, Torgo L, Ribeiro RP, (2018) Rebagg: Resampled bagging for imbalanced regression, In: Second International Workshop on Learning with Imbalanced Domains: Theory and Applications, 67–81.
    [14] Hang H, Steinwart I, Feng Y, Suykens JAK, (2018) Kernel density estimation for dynamical systems. J Mach Learn Res 19: 1–49.
    [15] Sadouk L, Gadi T, Essoufi EH, (2021) A novel cost-sensitive algorithm and new evaluation strategies for regression in imbalanced domains. Exp Syst 38: e12680. https://doi.org/10.1111/exsy.12680 doi: 10.1111/exsy.12680
    [16] Tsybakov AB, (2004) Introduction to Nonparametric Estimation, Springer New York. https://doi.org/10.1007/b13794
    [17] Bartlett PL, Long PM, Lugosi G, Tsigler A, (2020) Benign overfitting in linear regression. Proc Natl Acad Sci 117: 30063–30070. https://doi.org/10.1073/pnas.1907378117 doi: 10.1073/pnas.1907378117
    [18] Friedman JH, Bentley JL, Finkel RA, (1977) An algorithm for finding best matches in logarithmic expected time. ACM Trans Math Software 3: 209–226. https://doi.org/10.1145/355744.355745 doi: 10.1145/355744.355745
    [19] Hang H, Cai Y, Yang H, Lin Z, (2022) Under-bagging nearest neighbors for imbalanced classification. J Mach Learn Res 23: 1–63.
    [20] Chaudhuri K, Dasgupta S, (2014) Rates of convergence for nearest neighbor classification. Adv Neural Inf Process Syst 27: 3437–3445.
    [21] Zhao P, Lai L, (2021) Minimax rate optimal adaptive nearest neighbor classification and regression. IEEE Trans Inf Theory 67: 3155–3182. https://doi.org/10.1109/TIT.2021.3062078 doi: 10.1109/TIT.2021.3062078
    [22] Dua D, Graff C, UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, 2017. Available from: http://archive.ics.uci.edu/ml.
    [23] Vanschoren J, van Rijn JN, Bischl B, Torgo L, (2014) OpenML: Networked science in machine learning. ACM SIGKDD Explor Newsl 15: 49–60. https://doi.org/10.1145/2641190.2641198 doi: 10.1145/2641190.2641198
    [24] Nash WJ, Sellers TL, Talbot SR, Cawthorn AJ, Ford WB, (1994) The population biology of abalone (Haliotis species) in Tasmania. Sea Fish Div Tech Rep 48: 411.
    [25] Fanaee-T H, Gama J, (2014) Event labeling combining ensemble detectors and background knowledge. Prog Artif Intell 2: 113–127. https://doi.org/10.1007/s13748-013-0040-3 doi: 10.1007/s13748-013-0040-3
    [26] Zhao Y, Sun J, (2009) Recursive reduced least squares support vector regression. Pattern Recognit 42: 837–842. https://doi.org/10.1016/j.patcog.2008.09.028 doi: 10.1016/j.patcog.2008.09.028
    [27] Yeh IC, (1998) Modeling of strength of high-performance concrete using artificial neural networks. Cem Concr Res 28: 1797–1808. https://doi.org/10.1016/S0008-8846(98)00165-3 doi: 10.1016/S0008-8846(98)00165-3
    [28] Alhamdoosh M, Wang D, (2014) Fast decorrelated neural network ensembles with random weights. Inf Sci 264: 104–117. https://doi.org/10.1016/j.ins.2013.12.016 doi: 10.1016/j.ins.2013.12.016
    [29] Torgo L, Ribeiro R, (2003) Predicting outliers, In: Knowledge Discovery in Databases: PKDD 2003, 447–458. https://doi.org/10.1007/978-3-540-39804-2_40
    [30] Bernstein SN, (1946) The Theory of Probabilities, Moscow: Gastehizdat Publishing House.
    [31] Hoeffding W, (1963) Probability inequalities for sums of bounded random variables. J Am Stat Assoc 58: 13–30. https://doi.org/10.1080/01621459.1963.10500830 doi: 10.1080/01621459.1963.10500830
    [32] Steinwart I, Christmann A, (2008) Support Vector Machines, Springer Science & Business Media, 1–312. https://doi.org/10.1007/978-0-387-77242-4_1
    [33] Torgo L, Ribeiro R, (2007) Utility-based regression, In: European Conference on Principles of Data Mining and Knowledge Discovery, 597–604. https://doi.org/10.1007/978-3-540-74976-9_63
    [34] Torgo L, Ribeiro RP, Pfahringer B, Branco P, (2013) Smote for regression, In: Portuguese Conference on Artificial Intelligence, 378–389. https://doi.org/10.1007/978-3-642-40669-0_33
  • Reader Comments
  • © 2025 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(351) PDF downloads(9) Cited by(0)

Article outline

Figures and Tables

Figures(3)  /  Tables(2)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog