1.
Introduction
Conducting experiments is an essential part of academic research in the field of transportation and construction management (see [1,2,3]). Scholars need to use samples to conduct analysis or validate models. Especially with the application of machine learning algorithms [4,5], an increasing number of studies have to collect data to test the used algorithms. Choosing the sample size of the experiment is an essential first step and a vital decision problem. If the sample size is too small, the results of experiments will be not convincing. As the data collection process takes time and money, it is not necessary to collect too many samples. Thus, determining the appropriate sample size is worth investigating.
Hu et al. [6] explore the minimum training sample size for Bayesian network and find that Bayesian network is not sensitive to the training sample size. Ma et al. [7] pay attention to the influence of sample size on establishing reference intervals in clinical medicine. They report that the reference intervals are more consistent when the sample size is greater than or equal to 2000. Burmeiste and Aitken [8] also pay attention to the sample size problem in the field of clinical medicine and they believe this research topic is both clinically and statistically meaningful. Cui and Gong [9] use six regression models to study the effect of sample size when predicting personalized behavior. They find that the prediction accuracy increases with sample size. Taherdoost [10] solves the problem of survey sample size calculation in the social sciences. Lakens [11] propose six factors to determine the appropriate sample size in empirical study. With the widespread application of machine learning methods, which need to be experimentally verified and evaluated, the choice of sample size has become a problem that cannot be ignored. Specifically, there are many data-driven and algorithm-related studies in the field of transportation and construction management. In autonomous vehicle control, historical trajectories of vehicles are used to predict the future trajectories of vehicles (see [12,13,14,15]). In ship fuel consumption management, data on ships' static factors, voyage-dependent factors, weather information, and fuel consumption rate are used (see [16,17]). In ship inspection planning, data on ships' age, flag, and historical inspection results are used to developed machine learning models [18]. In health and safety management of construction workers, data on workers' physiological factors, environmental factors, work-related factors, and workers' fatigue are used [19]; Li et al. [20] use a dataset including 589 images of rebar to adopt deep learning in rebar counting. Shehadeh et al. [21] apply three machine algorithms to predict the residual value of heavy construction equipment. However, these mentioned studies in the field of transportation and construction management do not discuss how to determine the sample size used in their experiments.
The existing literature related to sample size are mainly in clinical medicine research and survey research. To the best of our knowledge, there are no studies focusing on sample size determination in transportation management or construction management. Therefore, our research chooses four typical regression models in machine learning—multiple linear regression, ridge regression, LASSO regression and support vector regression—to study the optimal sample size for machine learning models in health and safety management for transport infrastructure construction workers. Our research contributes at both theoretical and practical levels. First, the methodology used in our study can provide scholars and practitioners with a reference to determine the number of samples to collect. Even if they do not use the four methods tested in our experiments, the framework for starting with small sample sizes and increasing incrementally can still provide a reference to choose an appropriate sample size. Second, by observing learning curves, we give a suggestion of sample size for machine learning models using a case in health and safety management for urban transportation infrastructure workers. Studies in similar fields can directly apply our results.
We note that the collection of data and development of a machine learning model are usually not the end of solving a problem. The results of data collection and machine learning models are usually input for an optimization model, which produces decision that can be implemented. For instance, predicted numbers of commuters can be used for bus route design [22], predicted ship handling time can be used for pilot scheduling and shipping service design (see [23,24,25]), predicted amount of ship emissions can be used for managing ship operations [26,27,28,29,30,31]. Moreover, a number of advanced techniques that solve problems involving prediction and optimization have been developed [32,33,34].
The remainder of this paper is organized as follows. Section 2 presents a detailed description for our material and methods. Section 3 reports the results. Conclusions are drawn in Section 4.
2.
Dataset and methods
We have collected a dataset of 550 workers in construction sites in Hong Kong. The dataset contains 8 features: age, BMI (Body Mass Index, BMI), alcohol drinking habit, smoking habit, temperature, relative humidity, job nature, work duration, and one label, that is, RPE (Rating of Perceived Exertion, RPE). Chan et al. [35,36] have used this dataset in their study to explore construction workers' heat-stress problems. The detailed feature and label explanations can be found in Table 1. Descriptive statistics of the dataset are shown in Table 2.
We adopt four commonly used regression algorithms: multiple linear regression, ridge regression, LASSO regression and support vector regression (SVR) as these four models are typical and our problem is suitable for regression models. Multiple linear regression is a well-known method which fits feature variables and the label into a linear model by minimizing the residual sum of squares [37,38]:
where subscript i means a sample, i=1,…,N, and N is the total number of samples; xi and yi represent feature variables and the label (i.e., RPE) of sample i, respectively. Multiple linear regression aims to find the optimal ω∗ and b∗ to predict unknown labels using the given feature variables. In our study, xi is an 8-dimensional vector, as shown in Table 1, and ω∗ is an 8-dimensional coefficient vector.
Ridge regression introduces L2-norm in the objective function [39,40]:
where ||ω||2 represents the regularization term which is the sum of the squares of coefficients, and α≥0 is a hyperparameter which makes a trade-off between prediction errors and the regular term. xi is the 8-dimensional vector and yi represents the label (i.e., RPE).
LASSO regression introduces L1-norm regularization in the objective function [41,42], i.e., |ω| represents the sum of absolute value of coefficients:
Both ridge regression and LASSO regression can effectively deal with the problem of overfitting as they add ω into the minimization objective function. Moreover, LASSO regression can reduce the coefficients of some features to 0, i.e., LASSO regression can do feature selection [43]. We use cross validation to determine the value of α.
SVR can tolerate an ϵ error between f(xi) and yi, i.e., the prediction error is 0 when |f(xi)−yi|≤ϵ. SVR can be expressed by the formulas below [44,45]:
subject to
In the above formulation, C is a hyperparameter used to make a trade-off between bias and variance. xi is an 8-dimensional vector that represents the eight feature variables and yi is the label of sample i. We also use cross validation to determine its value. Our label is the rating of perceived exertion that ranges between 1 and 8. Therefore, we set ϵ to be 0.1 in our experiments.
Referring to previous studies [46,47], we use mean squared error (MSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) to evaluate the performance of these four regression models with different sample sizes.
For multiple linear regression, we set the sample size to 50, 75, …, 500 in each experiment. For each sample size, we conduct 100 experiments by randomly splitting the dataset into the training dataset which has a total of "sample size" records and using the remaining samples as the testing dataset. We record the results of each prediction in the testing dataset and calculate the values of MSE, MAE, and MAPE. Finally, we measure the model's performance at each sample size by calculating the average of the three metrics in the testing dataset respectively.
For ridge regression, LASSO regression, and SVR, we first need to deal with hyperparameters α and C. We use K -fold cross validation to determine the value of hyperparameters. We set α to 0.001, 0.01, 0.1, 1, 5, and 10 respectively. We randomly divide our dataset into 10 parts, i.e., K=10. We use 9 parts of the data to train the model, and the remaining 1 part is used as the testing dataset to evaluate the performance of the model. With each value of α, we conduct 10 experiments and also calculate the values of MSE, MAE, and MAPE. By comparing the average values of these three indictors, we choose the optimal α for ridge regression and LASSO regression, respectively. Then, we use the optimal α in ridge regression or LASSO regression to test the impact of different sample size. The process afterwards is the same as for multiple linear regression. The determination of C for SVR is similar. We set C to different values and then conduct K -fold cross validation to choose the optimal C.
3.
Results
The learning curves of the forecasting of RPE using four regression methods are reported in Figures 1-4, respectively. The left axis of the graph represents the results of MSE and MAE and the right axis represents the results of MAPE. The results of multiple linear regression (see Figure 1) show that the prediction error decreases with increasing sample size at first and then the error lines go stable. The results of the other three methods (i.e., ridge regression, LASSO regression and SVR) also show consistent laws. The value of MSE, MAE, and MAPE of the four methods finally converged to approximately 1.05, 0.82 and 20.5, respectively. By observing these 12 learning curves, we find that the performance of the four models no longer improves significantly when the sample size exceeds 250. To be more specific, the values of MSE, MAE and MAPE almost keep decreasing until the sample size exceeds 250. After the sample size exceeds 250: MSE does not decrease by more than 2.5% and sometimes even increases; MAE does not decrease by more than 1.5% and sometimes even increases; MAPE does not decrease by more than 2.5% and sometimes even increases. Thus, 250 samples are enough to conduct experiments considering the cost of collecting data and the efficiency of computing.
4.
Conclusions
In this work, we adopt four typical regression models in machine learning—multiple linear regression, ridge regression, LASSO regression and SVR—to determine the optimal sample size in the field of transportation and construction management. We find that a sample size of 250 is a good choice for scholars by the observation of the percentage decline of the three indicators. The observing process can be automated through calculating the error variation between two experiments and designing the experiment stopping rules. Error changes and the corresponding sample size can be easily recorded in a computer program. For example, we can set that we have found the optimal sample size when the percentage of error changes in five consecutive experiments do not exceed 3%.
As collecting data and running programs are costly and time-consuming, our study makes contributions in helping scholars and practitioners determine the used sample size. However, this study is not without limitations. First, the used dataset comes from transport infrastructure workers. Therefore, the findings may not be widely applicable to other fields. Second, we only use regression models to test impact of sample size. Other complicated machine learning models, e.g., Long Short-Term Memory network (LSTM), maybe a different story. Third, our dataset contains 8 feature variables and 1 label. If more feature variables are available to forecast the label, the optimal sample size may reduce as more variables provide more information. Similarly, if the value of the label is more diverse, the optimal sample size may increase for better prediction performance. Therefore, the used feature variables and labels will affect the results, which leads to a limitation of the study.
Acknowledgments
The authors thank the three referees for their constructive comments that significantly improve the quality of the paper. This study is supported by the start-up grant of The Hong Kong Polytechnic University (Project ID P0040224).
Conflict of interest
The authors declare there is no conflict of interest.