1.
Introduction
Influenza (flu) is a very common respiratory infectious disease with high variability and infectivity. It spreads rapidly through droplets, with an extremely fast transmission speed and incubation period, which makes the influenza epidemic rapidly spread in a short time and pose a serious threat to human health [1]. According to statistics from the World Health Organization (WHO), there are an estimated 600 million to 1.2 billion cases of influenza worldwide each year. This number includes at least 3 million severe cases and complications associated with influenza, and the number of fatalities ranges from 250,000 to 500,000 [2,3]. According to statistics, approximately 84,200 to 92,000 people die from respiratory diseases caused by the flu in China every year, accounting for 8.2% of total deaths from respiratory diseases [4]. At the same time, the annual economic burden caused by influenza is about ¥ 26.381 billion in China, equivalent to 0.233‰ of GDP in 2021 [5], which is shocking. If active and effective prevention and control measures are not taken, influenza will continue to impose a serious health and economic burden on both China and the world. Therefore, preventing and controlling flu is important public health work that requires extensive attention and investment.
Several methods have been proposed for real-time detection and routine monitoring of flu activity. Traditional influenza surveillance systems primarily rely on reported influenza-like cases and virological data from health care providers, including hospitals, clinics and contract laboratories [6,7]. Although China has established a nationwide influenza surveillance system, the time taken to publicly report influenza cases is usually delayed by about 1–2 weeks. Furthermore, there are issues such as high operating costs, low coverage of the surveillance network, low efficiency in information reporting, over-reliance on historical influenza data without multidimensional data support and simplistic methods for data mining, prediction and early warning [8]. If it is possible to predict the flu trends in certain areas promptly and accurately, and take appropriate prevention and control measures before the outbreak of influenza, we can effectively control the spread of the disease and reduce the harm and economic losses caused by it.
Yang et al. [9] propose a comprehensive learning particle swarm optimization based machine learning (CLPSO-ML) framework incorporating support vector regression (SVR) and multilayer perceptron (MLP) for multi-step-ahead influenza prediction. Wang et al. [10] propose a new end-to-end spatiotemporal deep neural network structure for influenza risk prediction. The proposed model mainly consists of two parts. The first stage is the spatiotemporal feature extraction stage where two-stream convolutional and recurrent neural networks are constructed to extract different regions and time granularity information. Then, a dynamically parametric-based fusion method is adopted to integrate the two stream features and make predictions. Kumar et al. [11] propose a hybrid fuzzy time series forecasting model based on particle swarm optimization and the fuzzy c-mean technique, named as fuzzy time series particle swarm optimization extended fuzzy c-mean technique. Thomas et al. [12] develop methods for real-time prediction of the risk that an ongoing influenza epidemic will be exceptionally severe and for real-time detection of anomalous epidemics and use them for prediction and detection of anomalies for influenza epidemics in France. The quality of predictions is assessed on observed and simulated data. Wei et al. [13] aimed to enhance their prediction model by incorporating traditional hydrological and atmospheric data. Features, such as popular search keywords on Google Trends, public holiday information, population density, air quality indices, and the numbers of COVID-19 confirmed cases, were also used to train the model. Kara [14] introduced a hybrid method that combines long short-term memory (LSTM) neural network and genetic algorithm (GA) for multi-step influenza outbreak forecasting problems. Kumar et al. [15] propose a hybrid fuzzy time series model for the prediction of upcoming COVID-19 infection cases and deaths in India by using a modified fuzzy C-means clustering technique.
At present, some non-traditional methods for influenza monitoring have been developed. For example, Ackley et al. [16] conducted a comparative analysis by integrating data from smart thermometers and mobile applications with regional influenza and influenza-like illness (ILI) surveillance data from the California Department of Public Health. They utilized smart thermometer readings and mobile application data to predict regional influenza in California. The experimental results demonstrated that these data improved the predictive capability of influenza illness. Murayama et al. [17] utilized inter-regional commuting data as a representation of human mobility when building a regional influenza prediction model and used it as spatial information in graph convolutional network (GCN) to predict the geographical distribution of influenza patients. The results show that the GCN model based on commuting data significantly improves the prediction accuracy in both temporal and spatial dimensions, thus providing an appropriate prediction interval. Yang et al. [18] developed a comprehensive influenza monitoring framework by integrating electronic medical records (EMRs) from several hospitals in Taiwan and ILI data from the Taiwan Center for Disease Control and Prevention (TWCDC). This framework is scalable and can periodically integrate TWCDC ILI open data with EMRs across multiple hospitals to automatically monitor influenza activity and support early surveillance of influenza outbreaks. In addition, some researchers have achieved real-time monitoring and prediction of influenza activity by utilizing non-traditional data sources such as social media data [19], web search data [20], call center data [21], pharmacy sales data [22] and meteorological data [23].
To enhance the prediction and response capabilities to influenza outbreaks, numerous researchers and institutions are devoted to improving influenza prediction models. These models encompass prediction models based on machine learning and deep learning, alongside prediction models grounded in mathematical models. For instance, Lu et al. [24] proposed the ARGONet method, which combines two prediction approaches with machine learning to estimate local influenza epidemics in real-time. This method first extended the proven inference method for influenza activity, called ARGO, to various states in the United States, and incorporates information related to influenza, such as Google search frequency, electronic health records and historical flu trends. To enhance prediction accuracy, a spatial network method called Net was developed based on ARGO, which improved the influenza estimation of ARGO by combining the spatiotemporal patterns of influenza transmission in neighboring regions. In this study, the ARGO model alone outperformed the Google Flu Trend prediction system that operated from 2008 to 2015. Researcher Fred Lu stated that this new method may lay the foundation for effective prevention of infectious diseases. With the increasing availability of online search data and cloud-based electronic health records collected from medical service providers, this new model will be able to predict disease outbreaks and epidemics more accurately in the future. Zimmer et al. [25] combined the developed calibration and prediction framework with the established humidity-based propagation dynamics model to predict influenza. They found that incorporating daily near real-time internet search data improved the accuracy of short-term and medium-term predictions of influenza activity. Miliou et al. [26] proposed the use of retail market data to improve the prediction of seasonal influenza and developed a near-term forecasting and prediction framework that provided estimates of influenza incidence in Italy. They employed a SVR model to predict seasonal influenza incidence. The results quantitatively show the value of incorporating retail market data into the prediction model, which can serve as an agent for real-time analysis of epidemics. Huang [27] utilized a retrospective epidemiological survey method and based on the Baidu index of H7N9 avian influenza keywords and clinical symptoms keywords of H7N9 subtype avian influenza, established the SVR prediction model and multiple linear regression prediction model in different segments to analyze the fit degree. The results revealed that public search behavior, epidemic segment characteristics and the frequency of public search for clinical symptoms of infectious diseases significantly improved the capability of search engine big data to predict the epidemic trends of H7N9 subtype avian influenza.
Recent research indicates that flu transmission trends can be effectively monitored by integrating open-source search query data and machine learning methods. This approach not only enables the timely provision of useful information to the public and medical professionals for taking appropriate prevention and control measures but also holds tremendous potential. However, research in this field is still relatively limited domestically, which necessitates further exploration and development. Currently, research methods mainly focus on multiple correlation regression analysis [28,29], but this approach has some issues in predicting the trend of influenza transmission. For example, in a multiple linear regression model, multicollinearity among the independent variables may lead to model instability. Additionally, the relationship between ILI and related factors is influenced by various factors, which may not exhibit a simple linear relationship. Therefore, using conventional linear models for fitting may not achieve the desired predictive performance.
Overseas research has primarily focused on using Google search engine data and Twitter data [30,31,32,33,34], while in China, Baidu index has become one of the main sources of search engine data. As of July 2022, Baidu holds a dominant market share of 71.2% in the Chinese search engine market, far surpassing other search engines, which better reflects the level of attention that most Chinese people have towards the epidemic. Therefore, this study utilized web search data provided by Baidu index and ILI data, and constructed a nonlinear influenza prediction model suitable for the characteristics of southern China based on machine learning methods. By leveraging Baidu index and machine learning algorithms, the model can better predict the spread trend of influenza and provide relevant information in time to provide scientific support for influenza prevention and control efforts.
2.
Data acquisition and processing
2.1. Data source
The official influenza like case data used in this study was obtained through the ILI weekly report released by the National Influenza Center of China [35]. The collection of this data relies on the collaboration of medical institutions at all levels, disease prevention and control centers, and sentinel hospitals for monitoring, summarizing and analyzing influenza data reported by sentinel hospitals across the country. In this paper, we collected 207 weeks of ILI data in southern regions of the China from the 1st week of 2018 to the 49th week of 2021. These official influenza sample case data are recognized as reliable sources widely used for research and monitoring of influenza transmission trends. By analyzing these data, we can obtain important information about the epidemic situation and changing trends of influenza in southern China.
The web search data originates from Baidu index of Baidu search engine [36]. It is a statistical index that comprehensively reflects the reference value of user interest and media attention to a specific keyword on a certain day. Based on the search volume of internet users on Baidu, the weighted sum of search frequency of each keyword in Baidu web search is analyzed and calculated. In this paper, we first conducted a long-tail keyword search using "flu" as the initial value on "Chinaz.com." We selected keywords with a whole network index greater than 200 and chose relatively original search terms related to influenza symptoms, treatment, preventive measures and other aspects. Then, we referred to the literature to summarize other keywords used in relevant studies. A total of 37 keywords that may be related to changes in the influenza epidemic trend were sorted out, as shown in Table 1.
2.2. Data preprocessing
The Baidu index of flu-related keywords is counted on a daily basis. In order to conduct consistent analysis with other time series data, it needs to be aggregated on a weekly basis. Each keyword's weekly summaries are calculated separately. However, missing data were found when collecting the Baidu index for keywords. To improve the accuracy of the prediction of ILI, it is necessary to repair the raw data. To address this issue, the K-nearest neighbors (KNN) algorithm was employed to fill in the missing data in the Baidu index of keywords K1, K2, K4, K5, K6, K8, K9, K18, K31 and K32. The algorithm utilizes existing adjacent data points to infer the missing value and interpolates by finding neighbor data that is most similar to the missing data. This approach enables the estimation of the missing data and reduces its impact on the accuracy of ILI prediction. The repaired data allows for a more comprehensive analysis of the trends and changes in flu-related keywords, providing more accurate predictions and insights.
2.3. Keyword filtering
Research has shown that an increase in the number of keywords does not necessarily improve the model's fitting performance. In order to accurately select the influencing factors related to the predicted outcome variable ILI%, a correlation analysis was conducted by comparing ILI% with the curated Baidu search index of keywords. In this way, keywords that contribute to the prediction model can be screened and included in the prediction model. In this study, the IBM SPSS Statistics 26.0 statistical tool was used for conducting the correlation analysis. To preliminarily screen keywords, a minimum correlation coefficient of 0.5 between the time series of Baidu search index for keywords and ILI% was required. By conducting a ranking analysis based on the correlation between the Baidu search index for each keyword and ILI%, it was found that out of the 37 keywords, 17 keywords had correlation coefficients less than 0.5 with ILI%, while 20 keywords had correlation coefficients greater than 0.5. The specific analysis results are shown in Table 2, which will help in the further selection of the most relevant keywords to establish a more accurate prediction model.
Influenza viruses are primarily transmitted through airborne droplets produced by sneezing or coughing, as well as through direct contact between people or contact with objects contaminated by influenza viruses. With the rapid and frequent operation of modern transportation, the frequent flow of people and the transportation of various new types of food, previously localized infectious diseases may become widespread and epidemic diseases. In areas with frequent population mobility, if an influenza outbreak occurs in one region, other closely related regions are also likely to be affected. Therefore, it is of great significance to analyze the number of ILI in northern and southern China. Figure 1 shows the trends of ILI in the southern and northern regions of China from week 1 in 2018 to week 49 in 2021. From the graph, it can be observed that the trend of rising and declining influenza activity levels in both southern and northern China is relatively consistent, and there is a strong correlation. This observation indicates that it is important to consider the impact of influenza transmission in the northern region on the southern region when modeling prediction models, and to assess its effect on the prediction effectiveness.
2.4. Keyword time-delay correlation analysis
Due to the incubation period and subsequent disease development of influenza, the predictive factors generally exhibit time-delay characteristics. Therefore, the correlation trend between ILI% in the southern region of China and the Baidu search index of preliminary screened keywords was analyzed. The keywords "How to prevent influenza (K1)", "Influenza A symptoms (K9)", "Viral cold (K12)", "Fever (K15)", "High fever (K17)" and "Influenza A (K33)" were taken as examples. Figure 2 shows the distribution of ILI% and Baidu index of specific keywords in the southern region of China from week 1 in 2018 to week 49 in 2021. From the figure, it can be observed that the Baidu index of keywords K1, K9, K12 and K33 show a certain leading relationship compared to ILI%, while the Baidu index of keywords K15 and K17 exhibit relative synchronicity with ILI%. Based on the above analysis, it is evident that the influence of time lag should be considered when conducting keyword correlation analysis. Therefore, this study employed cross-correlation analysis to examine the time-lagged relationship between the selected keywords and ILI% in the southern region within a time range of 7 weeks before and after. The maximum absolute correlation coefficient for each keyword and ILI% was selected to ensure the correlation between the data. The results of the keyword cross-correlation analysis are shown in Table 3.
Keywords are classified into synchronous keywords, leading keywords and lagging keywords according to their temporal nature. From Table 3, it can be observed that as the number of lag days decreases, the correlation of each keyword gradually increases. Among them, 8 keywords reach the maximum value when the delay is 0, which belong to the "synchronous" keywords, including fever, high fever, what medicine to take for the flu, amoxicillin, flu, influenza virus, H1N1 influenza virus and influenza A virus. Additionally, there are 12 keywords that reach their maximum value when the delay is -1, which are "leading" keywords, including how to prevent flu, influenza A symptoms, cold, viral cold, flu treatment, cold medicine, Contac, Gankang, Tylenol, H1N1 influenza, what is H1N1 influenza and influenza A. Due to each keyword being highly correlated with ILI% at different lag times, the lag variable with the largest correlation coefficient was used to establish the model. This can more accurately reflect the association between keywords and ILI, thereby improving the accuracy of the prediction model.
3.
Model
According to the analysis results of the above keywords, it can be observed that there is a significant positive correlation between the Baidu search index of 12 leading keywords and the weekly official reported ILI%. Based on this observation, the influenza prediction model (model 1) was first established using the leading keywords to verify whether search query data can reflect influenza transmission trends. Understanding the historical data of influenza is of great significance for predicting future trends. Therefore, model 2 was established to consider the influence of past influenza epidemics on the next moment's influenza to verify the impact of historical ILI% data in the southern region on ILI prediction. In addition, contact is an important pathway of influenza transmission, and with frequent population mobility, influenza can easily spread. Therefore, it is essential to incorporate the influenza level information from the northern region into the real-time influenza prediction model in the southern region to build model 3. The expression of the models are as follows:
where ILIt% represents the ILI% of the southern region in the i-th week, Gi,t−1denotes the Baidu search index of the i-th leading keyword, St−j represents the official ILI% of the southern region before the j-th week, Nt−k represents the official ILI% of the northern region before the k-th week, P=12 indicates the number of leading keywords and M and N represent the lead orders of Stand Nt, respectively. Through experimental verification, the model achieves the best predictive performance when M=4 and N=3, αi, βi, γj, ϕi, φj and ηk are coefficients for each model, while εt, ot and σt represent the residual terms of each model, respectively.
Wang et al. [37] suggested in their study that only leading keywords can be used to establish influenza prediction models. However, the results of cross-correlation analysis among keywords showed that the correlation coefficient between synchronous keywords and the number of ILI was all greater than 0.5 at a lag of 1 week. Therefore, this study considered incorporating the Baidu index of synchronous keywords at a lag of 1 week, along with leading keywords into the influenza prediction model to analyze their influence on the prediction accuracy of ILI. Considering that there are a large number of keywords and certain correlation among them, the information in the data overlaps to some extent. In order to reduce the number of variables and retain the main information, this study used principal component analysis to process the input Baidu keywords and extracted the principal components that contributed to 90% of the variance as the input variables for the model. After the analysis, when 7 principal components are selected, the cumulative contribution rate of keywords reached 95.58%. Therefore, these 7 principal components of the keywords were selected as inputs for the influenza prediction model (model 4). The specific model formula is as follows:
where ILIt%, St−j, Nt−k, M, N, μl, ¯φj, ¯ηk and ¯σt are the same as represented in Equations (1)–(3), Zl,t−1 represents the value of the l-th principal component at the time t−1 and Q=7 indicates the number of search principal components included in the model.
4.
Model prediction and analysis
4.1. Research method
4.1.1. Support vector regression
Support vector regression (SVR) is a machine learning method based on statistical learning theory. It employs the criterion of structural risk minimization, which aims to minimize the error of sample points while also maximizing the model's generalization ability. It is a convex quadratic optimization problem, ensuring that the extreme value found is the globally optimal solution [38]. SVR can be used to capture complex nonlinear relationships in the real world. Its main idea is to find a regression plane that minimizes the distance of all training points to that plane.
In a typical regression problem, given the training set:
where xi∈Rd is the input vector, yi∈R is the output variable and l represents the number of samples.
The modeling purpose of the nonlinear SVR is to map x into a high-dimensional feature space through a nonlinear mapping φ, and then determine the linear regression function y=f(x) in that space to fit the data (xi,yi), which can be expressed as:
where ω is the weight vector and b is the threshold, which are estimated by the training set G, φ(x) represents the nonlinear mapping function that maps the input vector to a high-dimensional feature space F. Therefore, the linear regression in the high-dimensional feature space corresponds to nonlinear regression in the low-dimensional input space, while the inner product calculation between ω and φ(x) in the high-dimensional feature space is ignored.
Based on the principle of structural risk minimization, the objective functions and constraints of SVR are defined as follows:
where C is the trade-off parameter that adjusts the balance between regression error and regularization term, l is the number of training samples, ξi(¯ξi) is the relaxation variable that allows for the error range of the regression function andε⩾0 is the parameter in the insensitive loss function of ε−, which is used to control the accuracy of the regression approximation.
By introducing Lagrange multipliers α and ¯α, the quadratic programming problem can be optimized into a dual problem, then the dual problem of equation (7) can be written as:
where α={α1,⋯,αl} and ¯α={¯α1,⋯,¯αl} are dual variables and K(xi,xj) is the kernel function representing the inner product ⟨φ(xi),φ(xj)⟩.
By utilizing the Karush-Kuhn-Tucker (KKT) conditions to solve for αi, ¯αi and b in Equation (9), the regression function is as follows:
For the training of SVR method, the first step is to determine the kernel function. At present, several kernel functions have been proposed, but there is no theoretical solution for selecting the optimal kernel function, and the trial-and-error method is usually adopted [39]. In this paper, through iterative tests, radial basis function (RBF) is employed as the basic kernel function, expressed as follows:
4.1.2. Improved particle swarm optimization algorithm
Particle swarm optimization (PSO) algorithm, based on swarm intelligence, is one of the widely used methods in SVR parameter optimization calculation. It does not require gradient information during the iterative process and involves a relatively small number of adjustable parameters. This algorithm is known for its advantages such as ease of implementation, high efficiency and fast convergence speed [40,41]. The PSO algorithm can be described as follows: the particle swarm consists of m particles in the n-dimensional search space, The velocity state vector is composed of four parts: xi=(xi1,xi2,⋯,xij,⋯,xin)T,i=1,2,⋯,m is the current position of the i-th particle in the search space, vi=(vi1,⋯,vij,⋯,vin)T is the velocity of the i-th particle, pi=(pi1,⋯,pij,⋯,pin)T represents the optimal position of the i-th particle at the current moment and pg=(pg1,⋯,pgj,⋯,pgn)T represents the optimal position of the particle swarm in each iteration. The velocity and position of each particle are updated according to Equations (12) and (13):
where vkij is the velocity of the j-th component of the i-th particle in the k-th iteration, ω is the inertia weight, c1 and c2 are cognitive learning factors and social learning factors and rk1 and rk2 are random numbers generated within the interval (0,1).
Inertia weight ω plays a crucial role in the performance of PSO, as it balances the global search ability and local search ability of particles [42]. A large inertia weight enhances the algorithm's global search ability, but it may lead to lower search efficiency. In contrast, a smaller inertia weight is beneficial for local search, but may lead to local optimality.
where ωmax and wmin are the maximum and minimum values of the inertia weight ω, respectively, t is the current iteration number and tmax is the maximum number of iterations. In Equation (14), ω gradually decreases during the search process, which satisfies the requirements of the adaptive process for the algorithm from global optimization to local optimization.
4.1.3. Improved particle swarm optimization-based SVR method
In the SVR method based on the RBF kernel function, C and σ (kernel width) are two adjustable parameters that play a crucial role in the performance of SVR [43,44]. In this study, an improved PSO algorithm is utilized to optimize the parameters C and σ of SVR. The method of optimizing SVR parameters using the improved PSO algorithm is referred to as IPSO-SVR, and the basic steps are summarized as follows:
Step 1: Initialize all the parameters of the algorithm, including the maximum number of iterations tmax, population size, cognitive learning factor c1 and social learning factor c2, velocity range [Vmin,Vmax], etc.
Step 2: The population and speed are generated randomly, and the initial fitness value of each particle is calculated using Equation (15) for evaluation. xi is set to pi, and the particle with the best fitness is set to pg.
where ¯yi is the predicted value and yi is the true value.
Step 3: The velocity, position and inertia weights of the particles are updated according to Equations (12)–(14). Evaluate the fitness function for each particle and compare it with pi. If the fitness value fiti of the i-th particle is less than pi, set xi to pi; otherwise, pi is left unchanged. If the fitness value fiti of the i-th particle is less than pg, set xi to pg; otherwise, the original value is retained.
Step 4: Determine whether the termination conditions are met. If the condition is satisfied, proceed to the next step; otherwise, go back to step 2;
Step 5: The best parameters Cbest and σbest of the SVR model were obtained, an SVR model with Cbest and σbest as parameters was established by using the training set and the trained model was used to predict ILI% in southern China.
4.2. Model evaluation index
To validate the predictive performance of each model, the MSE, RMSE and MAE were used to evaluate the prediction results of each model, as shown in equations (16)–(18). MSE represents the mean of the squared prediction errors, RMSE represents the square root of the mean of the squared differences between predicted and true values, divided by the sample size m, which is used to measure the deviation of the overall prediction results from the actual values, and MAE is the mean of the absolute errors, accurately reflecting the actual predicted error situation.
where m represents the number of samples, yi represents the true value of ILI% and ^yi represents the predicted values of ILI%.
4.3. Prediction results and analysis
The prediction target of this study is the ILI% in the southern region of China. Due to the fact that both domestic and international scholars often adopt traditional multiple linear regression methods when choosing influenza trend prediction methods [28,29], this paper uses the prediction results of this method as a comparative baseline to analyze and compare them with the prediction results of SVR, GA-SVR, PSO-SVR and IPSO-SVR methods. The construction code of the model was implemented using Python 3.8.12 software. In this study, 164 weeks of data from week 1 of 2018 to week 7 of 2021 were selected as training samples and 42 weeks of data from week 8 of 2021 to week 49 of 2021 were selected as test samples.
The independent variables from models 1–4 were used as inputs for multiple linear regression, SVR, GA-SVR, PSO-SVR and IPSO-SVR, respectively, ILIt as the output. Training samples were used to train each model, and the trained models were utilized to predict the ILI% for the southern region of China from week 8 to week 49 of 2021. The MSE, RMSE and MAE results of each model's test sample are shown in Table 4, where LR represents the prediction results of the multiple linear regression method.
Comparing the MSE, RMSE and MAE results of the five prediction methods for each model in Table 4, it can be observed that, when compared with the traditional multiple linear regression, SVR, GA-SVR and PSO-SVR methods, the prediction results of IPSO-SVR are the best in models 2–4, demonstrating superior prediction performance. Among the SVR, GA-SVR, PSO-SVR and IPSO-SVR methods, model 4 demonstrates the best predictive performance. However, within the LR method, model 3 exhibits the most favorable prediction effectiveness. From the IPSO-SVR prediction results of model 1, it can be observed that when ILI% is predicted by the leading keywords, although there is some discrepancy between the predicted results and the true values, the trend of the predictions is relatively consistent with the true values. By comparing the three evaluation index results of the IPSO-SVR algorithm in model 1 and model 2, it can be found that by adding historical ILI% data from the southern region, the MSE, RMSE and MAE index results of the model were reduced by 74.7%, 49.7% and 54.1%, respectively. This indicates that the historical ILI data contains a significant amount of influenza epidemic trend information.
By comparing the three evaluation index results of the IPSO-SVR algorithm in model 2 and model 3, it can be observed that adding historical ILI% data from the northern region led to a reduction of 18.3%, 9.5% and 1.2% in the model's MSE, RMSE, and MAE index, respectively. This indicates that the influenza epidemic in the northern region has some impact on the southern region, which means that influenza transmission can be affected by interregional transmission. Therefore, when analyzing and forecasting ILI, it is essential to consider not only the impact of ILI in the current region but also the influence of the epidemic situation in other regions on the current region.
By comparing the three evaluation index results of the IPSO-SVR algorithm in model 2 and model 3, it can be observed that adding the Baidu index of synchronous keywords from the previous week can reduce the MSE, RMSE and MAE index results of the model by 4.6%, 2.4% and 7.3%, respectively. This indicates that incorporating synchronous keywords into the model can improve its predictive accuracy. Therefore, when establishing influenza prediction model based on web search data, the information of synchronous keywords should not be directly excluded. Instead, the influence of synchronous keywords on influenza prediction should be further analyzed by constructing models to assess their impact.
The comparison of fitted values, actual values and predicted values for the training and testing samples using five forecasting methods in models 1–4 is shown in Figures 3–6. In each model prediction result graph, the subgraph is divided into two parts by a vertical deep red line along the horizontal axis: the left part shows the actual values (red) and fitted values of models 1–4 on the training sample, while the right part displays the actual values (red) and predicted values of models 1–4 on the testing samples. By comparing the prediction results of the five methods in model 1 to model 4, it can be found that the prediction output of model 4 is closer to the real value of both the training set and the test set. Regardless of the fitting and prediction time periods, the IPSO-SVR method in model 4 can capture the peaks and troughs of the time series curve of ILI, and the prediction effect is better than that of other models.
5.
Conclusions
Influenza is a common respiratory disease that can lead to illness and death in humans. Timely and accurate prediction of disease risks is essential for public health management and prevention. While various prediction efforts regarding infectious diseases have matured, the current infectious disease surveillance system model is excessively passive, heavily reliant on case reporting and there is a large time lag. Additionally, the geographical distribution and genetic diversity of novel influenza viruses are rapidly expanding, presenting a direct challenge to the existing disease control system in China. To achieve near real-time monitoring of influenza spread, both domestic and international scholars have proposed influenza prediction methods based on informal sources of data, such as news reports, social media data, online search query data and electronic health information records. However, there is few research on the domestic influenza epidemic in this field. Many existing methods solely utilize historical time series data for prediction, overlooking the impact of spatial correlations among neighboring regions and temporal correlations across different time periods. Additionally, influenza prediction methods often heavily rely on the use of multivariate linear regression techniques.
In this study, an attempt was made to identify significant keywords related to influenza, followed by an initial screening of these keywords. By analyzing the time-delay correlation between each keyword and ILI, the keywords were further filtered and screened. Secondly, based on the identified distinct types of keywords and considering the influence of influenza transmission between neighboring regions, the influenza prediction model suitable for the characteristics of the southern region of China was constructed. The model can comprehensively consider spatial and temporal correlations, providing a more accurate reflection of the influenza transmission trends in the region. Finally, an improved PSO-based SVR method was proposed for model prediction, and its prediction results were compared and analyzed with multiple linear regression, SVR, GA-SVR and PSO-SVR methods.
By comparing the prediction results of each model, the following conclusions were drawn: 1) The influenza epidemic in the northern region has some impact on the southern region, indicating that influenza transmission is influenced by interregional spread. 2) When establishing influenza prediction models based on web search data, the information of synchronous keywords should not be excluded directly. Instead, their impact on influenza prediction should be analyzed through further modeling. 3) The IPSO-SVR method used in model 4 can capture the peaks and troughs in the time series curve of ILI, which has higher prediction accuracy and a better effect, and can better reflect the real level of influenza.
In this study, the integration of Baidu search data and machine learning methods was employed to construct a series of influenza trend prediction models, along with the incorporation of the IPSO-SVR algorithm as a predictive tool. This innovative approach introduces a novel predictive framework to the field of influenza trend forecasting, providing essential decision support for public health management and epidemic prevention and control. By constructing various prediction models, this study has unveiled multiple factors that influence the spread of influenza, thereby enhancing our understanding of the mechanisms underlying influenza transmission. The significant impact of introducing the IPSO-SVR algorithm in enhancing the accuracy of predictions is particularly noteworthy. This optimization algorithm demonstrates promising potential in influenza trend prediction, offering a novel avenue to improve the accuracy of prediction outcomes. By incorporating this algorithm into the model, it becomes possible to capture the dynamic changes in influenza trends with greater precision, which provides new perspectives and approaches for research and application of influenza prediction.
The paper still has some limitations. For instance, the scope of the study is confined to influenza forecasting in the southern region of China, and the prediction performance in other regions has not undergone sufficient in-depth research. Further validation and expansion are necessary in this regard. In reality, there might be cases of cross-infection and mutual influence among different diseases. ILI% data could be affected by these underlying factors. Therefore, when conducting influenza prediction, it is crucial to take into account the impact of other relevant disease data to mitigate the prediction errors arising from multifactorial influences. In future research, further exploration can be conducted on how to incorporate additional disease data into the predictive model, aiming to enhance the accuracy of influenza trend prediction and further improve the reliability and applicability of the predictive model. Furthermore, the exploration of more advanced machine learning techniques and data analysis methods will be pursued to optimize the performance of the influenza prediction model. By introducing new technological approaches, there is a potential to further enhance the predictive capabilities of the model across various regions, offering more forward-looking and practical solutions for research and practical applications in the field of influenza prediction.
Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
Acknowledgments
This paper is supported by the National Natural Science Foundation of China granted [No. 62106238], Fundamental Research Program of Shanxi Province [No. 202103021224195], Shanxi Province Science and Technology Major Special Project "Revealing the Leader" Program [No. 202201150401021] and Provincial Natural Science Foundation of Shanxi [No. 202203021212138].
Conflict of interest
All authors declare no conflicts of interest in this paper.