Research article Special Issues

An innovative approach of determining the sample data size for machine learning models: a case study on health and safety management for infrastructure workers

  • Received: 14 June 2022 Revised: 05 July 2022 Accepted: 05 July 2022 Published: 22 July 2022
  • Numerical experiment is an essential part of academic studies in the field of transportation management. Using the appropriate sample size to conduct experiments can save both the data collecting cost and computing time. However, few studies have paid attention to determining the sample size. In this research, we use four typical regression models in machine learning and a dataset from transport infrastructure workers to explore the appropriate sample size. By observing 12 learning curves, we conclude that a sample size of 250 can balance model performance with the cost of data collection. Our study can provide a reference when deciding on the sample size to collect in advance.

    Citation: Haoqing Wang, Wen Yi, Yannick Liu. An innovative approach of determining the sample data size for machine learning models: a case study on health and safety management for infrastructure workers[J]. Electronic Research Archive, 2022, 30(9): 3452-3462. doi: 10.3934/era.2022176

    Related Papers:

  • Numerical experiment is an essential part of academic studies in the field of transportation management. Using the appropriate sample size to conduct experiments can save both the data collecting cost and computing time. However, few studies have paid attention to determining the sample size. In this research, we use four typical regression models in machine learning and a dataset from transport infrastructure workers to explore the appropriate sample size. By observing 12 learning curves, we conclude that a sample size of 250 can balance model performance with the cost of data collection. Our study can provide a reference when deciding on the sample size to collect in advance.


    [1] H. Ding, N. N. Sze, Effects of road network characteristics on bicycle safety: a multivariate Poisson-lognormal model, Multimodal Transp., 1 (2022), 1-9. doi: 10.1016/j.multra.2022.100020
    [2] Z. Ma, P. Zhang, Individual mobility prediction review: data, problem, method and application, Multimodal Transp., 1 (2022), 1-11. doi: 10.1016/j.multra.2022.100002
    [3] X. Z. Simon, Q. Cheng, X. Wu, P. Li, B. Belezamo, J. Lu, et al., A meso-to-macro cross-resolution performance approach for connecting polynomial arrival queue model to volume-delay function with inflow demand-to-capacity ratio, Multimodal Transp., 1 (2022), 1-28. doi: 10.1016/j.multra.2022.100017
    [4] W. Yi, H. Wang, Y. Jin, J. Cao, Integrated computer vision algorithms and drone scheduling, Commun. Transp. Res., 1 (2021), 1-4. doi: 10.1016/j.commtr.2021.100002
    [5] X. Lang, D. Wu, W. Mao, Comparison of supervised machine learning methods to predict ship propulsion power at sea, Ocean Eng., 245 (2022), 110387. doi: 10.1016/j.oceaneng.2021.110387
    [6] J. Hu, W. Zou, J. Wang, L. Pang, Minimum training sample size requirements for achieving high prediction accuracy with the BN model: a case study regarding seismic liquefaction, Expert Syst. Appl., 185 (2021), 1-13. doi: 10.1016/j.eswa.2021.115702
    [7] C. Ma, X. Wang, L. Xia, X. Cheng, L. Qiu, Effect of sample size and the traditional parametric, nonparametric, and robust methods on the establishment of reference intervals: evidence from real world data. Clin. Biochem., 92 (2021), 67-70. doi: 10.1016/j.clinbiochem.2021.03.006
    [8] E. Burmeister, L. M. Aitken, Sample size: How many is enough? Aust. Crit. Care, 25 (2012), 271-274. doi: 10.1016/j.aucc.2012.07.002
    [9] Z. Cui, G. Gong, The effect of machine learning regression algorithms and sample size on individualized behavioral prediction with functional connectivity features, NeuroImage, 178 (2018), 622-637. doi: 10.1016/j.neuroimage.2018.06.001
    [10] H. Taherdoost, Determining sample size; how to calculate survey sample size, Int. J. Econ. Manage. Syst., 2 (2017), 237-239.
    [11] D. Lakens, Sample size justification, Collabra: Psychol., 8 (2022), 1-28. doi: 10.1525/collabra.33267
    [12] S. Mao, G. Xiao, J. Lee, L. Wang, Z. Wang, H. Huang, Safety effects of work zone advisory systems under the intelligent connected vehicle environment: a microsimulation approach, J. Intell. Connected Veh., 4 (2021), 16-27. doi: 10.1108/JICV-07-2020-0006
    [13] L. Yue, M. Abdel-Aty, Z. Wang, Effects of connected and autonomous vehicle merging behavior on mainline human-driven vehicle, J. Intell. Connected Veh., 5 (2022), 36-45. doi: 10.1108/JICV-08-2021-0013
    [14] J. Zhu, S. Easa, K. Gao, Merging control strategies of connected and autonomous vehicles at freeway on-ramps: a comprehensive review, J. Intell. Connected Veh., 5 (2022), 99-111. doi: 10.1108/JICV-02-2022-0005
    [15] J. Zhu, I. Tasic, X. Qu, Flow-level coordination of connected and autonomous vehicles in multilane freeway ramp merging areas, Multimodal Transp., 1 (2022), 1-13.
    [16] Y. Du, Q. Meng, S. Wang, H. Kuang, Two-phase optimal solutions for ship speed and trim optimization over a voyage using voyage report data, Transp. Res. Part B Methodol., 122 (2019), 88-114. doi: 10.1016/j.trb.2019.02.004
    [17] R. Yan, S. Wang, Y. Du, Development of a two-stage ship fuel consumption prediction and reduction model for a dry bulk ship, Transp. Res. Part E Logist. Transp. Rev., 138 (2020), 1-22. doi: 10.1016/j.tre.2020.101930
    [18] R. Yan, S. Wang, J. Cao, D. Sun, Shipping domain knowledge informed prediction and optimization in port state control, Transp. Res. Part B Methodol., 149 (2021), 52-78. doi: 10.1016/j.trb.2021.05.003
    [19] W. Yi, S. Wang, Mixed-integer linear programming on work-rest schedule design for construction sites in hot weather, Comput.-Aided Civ. Infrastruct. Eng., 32 (2017), 429-439. doi: 10.1111/mice.12267
    [20] Y. Li, Y. Lu, J. Chen, A deep learning approach for real-time rebar counting on the construction site based on YOLOv3 detector, Autom. Constr., 124 (2021), 1-14. doi: 10.1016/j.autcon.2021.103602
    [21] A. Shehadeh, O. Alshboul, R. Mamlook, O. Hamedat, Machine learning models for predicting the residual value of heavy construction equipment: an evaluation of modified decision tree, LightGBM, and XGBoost regression, Autom. Constr., 129 (2021), 1-16. doi: 10.1016/j.autcon.2021.103827
    [22] X. Qu, S. Wang, D. Niemeier, On the urban-rural bus transit system with passenger-freight mixed flow, Commun. Transp. Res., 2 (2022), 1-3. doi: 10.1016/j.commtr.2022.100054
    [23] K. Wang, S. Wang, L. Zhen, X. Qu, Cruise service planning considering berth availability and decreasing marginal profit, Transp. Res. Part B Methodol., 95 (2017), 1-18. doi: 10.1016/j.trb.2016.10.020
    [24] L. Zhen, Y. Hu, S. Wang, G. Laporte, Y. Wu, Fleet deployment and demand fulfillment for container shipping liners, Transp. Res. Part B Methodol., 120 (2019), 15-32. doi: 10.1016/j.trb.2018.11.011
    [25] L. Wu, Y. Adulyasak, J. F. Cordeau, S. Wang, Vessel service planning in seaports, Oper. Res., 2022. doi: 10.1287/opre.2021.2228
    [26] L. Zhen, Y. Wu, S. Wang, G. Laporte, Green technology adoption for fleet deployment in a shipping network, Transp. Res. Part B Methodol., 139 (2020), 388-410. doi: 10.1016/j.trb.2020.06.004
    [27] J. Qi, S. Wang, H. Psaraftis, Bi-level optimization model applications in managing air emissions from ships: a review, Commun. Transp. Res., 1 (2021), 1-5. doi: 10.1016/j.commtr.2021.100020
    [28] S. Wang, H. N. Psaraftis, J. Qi, Paradox of international maritime organization's carbon intensity indicator, Commun. Transp. Res., 1 (2021), 1-5. doi: 10.1016/j.commtr.2021.100005
    [29] S. Wang, L. Zhen, D. Zhuge, Dynamic programming algorithms for selection of waste disposal ports in cruise shipping, Transp. Res. Part B Methodol., 108 (2018), 235-248. doi: 10.1016/j.trb.2017.12.016
    [30] S. Wang, D. Zhuge, L. Zhen, C. Y. Lee, Liner shipping service planning under sulfur emission regulations, Transp. Sci., 55 (2021), 491-509. doi: 10.1287/trsc.2020.1010
    [31] S. Wang, J. Qi, G. Laporte, Optimal subsidy design for shore power usage in ship berthing operations, Nav. Res. Logist., 69 (2022), 566-580. doi: 10.1002/nav.22029
    [32] S. Wang, R. Yan, A global method from predictive to prescriptive analytics considering prediction error for "Predict, then optimize" with an example of low-carbon logistics, Cleaner Logist. Supply Chain, 4 (2022), 1-3. doi: 10.1016/j.clscn.2022.100062
    [33] R. Yan, S. Wang, Integrating prediction with optimization: models and applications in transportation management, Multimodal Transp., 1 (2022), 1-5. doi: 10.1016/j.multra.2022.100018
    [34] R. Yan, S. Wang, L. Zhen, G. Laporte, Emerging approaches applied to maritime transport research: past and future, Commun. Transp. Res., 1 (2021), 1-14. doi: 10.1016/j.commtr.2021.100011
    [35] A. P. Chan, W. Yi, D. W. Chan, D. P. Wong, Using the thermal work limit as an environmental determinant of heat stress for construction workers, J. Manage. Eng., 29 (2013), 414-423.
    [36] A. P. Chan, W. Yi, D. P. Wong, M. C. Yam, D. W. Chan, Determining an optimal recovery time for construction rebar workers after working to exhaustion in a hot and humid environment, Build. Environ., 58 (2012), 163-171. doi: 10.1016/j.buildenv.2012.07.006
    [37] M. Flores-Sosa, E. León-Castro, J. M. Merigó, R. R. Yager, Forecasting the exchange rate with multiple linear regression and heavy ordered weighted average operators, Knowl.-Based Syst., 248 (2022), 108863. doi: 10.1016/j.knosys.2022.108863
    [38] Q. H. Luu, M. F. Lau, S. P. Ng, T. Y. Chen, Testing multiple linear regression systems with metamorphic testing, J. Syst. Software, 182 (2021), 1-21. doi: 10.1016/j.jss.2021.111062
    [39] G. C. McDonald, Ridge regression, Wiley Interdiscip. Rev. Comput. Stat., 1 (2009), 93-100. doi: 10.1002/wics.14
    [40] G. Smith, F. Campbell, A critique of some ridge regression methods, J. Am. Stat. Assoc., 75 (1980), 74-81.
    [41] C. R. Genovese, J. Jin, L. Wasserman, Z. Yao, A comparison of the lasso and marginal regression, J. Mach. Learn. Res., 13 (2012), 2107-2143.
    [42] S. Wang, B. Ji, J. Zhao, W. Liu, T. Xu, Predicting ship fuel consumption based on LASSO regression, Transp. Res. Part D: Transp. Environ., 65 (2018), 817-824. doi: 10.1016/j.trd.2017.09.014
    [43] W. J. Fu, Penalized regressions: the bridge versus the lasso, J. Comput. Graphical Stat. , 7 (1998), 397-416.
    [44] V. Cherkassky, Y. Ma, Practical selection of SVM parameters and noise estimation for SVM regression, Neural Networks, 17 (2004), 113-126. doi: 10.1016/S0893-6080(03)00169-2
    [45] W. C. Hong, Y. Dong, L. Y. Chen, S. Y. Wei, SVR with hybrid chaotic genetic algorithms for tourism demand forecasting, Appl. Soft Comput., 11 (2011), 1881-1890. doi: 10.1016/j.asoc.2010.06.003
    [46] D. Li, M. Qiu, J. Jiang, S. Yang, The application of an optimized fractional order accumulated grey model with variable parameters in the total energy consumption of Jiangsu Province and the consumption level of Chinese residents, Electron. Res. Arch., 30 (2022), 798-812. doi: 10.3934/era.2022042
    [47] X. Li, L. Kang, Y. Liu, Y. Wu, Distributed Bayesian posterior voting strategy for massive data, Electron. Res. Arch., 30 (2022), 1936-1953. doi: 10.3934/era.2022098
  • Reader Comments
  • © 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (
通讯作者: 陈斌,
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索


Article views(2306) PDF downloads(214) Cited by(6)

Article outline

Figures and Tables

Figures(4)  /  Tables(2)

Other Articles By Authors


DownLoad:  Full-Size Img  PowerPoint
