Phenomenological models are particularly useful for characterizing epidemic trajectories because they often offer a simple mathematical form defined through ordinary differential equations (ODEs) that in many cases can be solved explicitly. Such models avoid the description of biological mechanisms that may be difficult to identify, are based on a small number of model parameters that can be calibrated easily, and can be utilized for efficient and rapid forecasts with quantified uncertainty. These advantages motivate an in-depth examination of 37 data sets of epidemic outbreaks, with the aim to identify for each case the best suited model to describe epidemiological growth. Four parametric ODE-based models are chosen for study, namely the logistic and Gompertz model with their respective generalizations that in each case consists in elevating the cumulative incidence function to a power p∈[0,1]. This parameter within the generalized models provides a criterion on the early growth behavior of the epidemic between constant incidence for p=0, sub-exponential growth for 0<p<1 and exponential growth for p=1. Our systematic comparison of a number of epidemic outbreaks using phenomenological growth models indicates that the GLM model outperformed the other models in describing the great majority of the epidemic trajectories. In contrast, the errors of the GoM and GGoM models stay fairly close to each other and the contribution of the adjustment of p remains subtle in some cases. More generally, we also discuss how this methodology could be extended to assess the "distance" between models irrespective of their complexity.
1.
Introduction
1.1. Scope
Dynamic growth models provide an important quantitative framework for characterizing epidemic trajectories, generating estimates of key transmission parameters, assessing the impact of control interventions, gaining insight to the contribution of different transmission pathways, and producing short- and long-term forecasts [1]. A natural question is that of the choice of the best suitable growth model for a given epidemic. It is the purpose of this paper to shed light on the performance of different growth models in describing different real epidemic outbreaks. Specifically, we employ four different growth models based on differential equations (two of them with two parameters, and two with three parameters), and apply them to a total of 37 infectious disease outbreak datasets consisting of time series of case incidence for different historic outbreaks comprising different diseases and settings.
The two-parameter models are the well-known logistic model (LM) [2] and Gompertz model (GoM) [3], and the three-parameter models are generalizations for both models which we refer to as the generalized logistic model (GLM) and the generalized Gompertz model (GGoM), respectively. These models incorporate a parameter p, which is an exponent that provides a criterion about the type of early growth dynamics, namely sub-exponential (0<p<1) or exponential (p=1) growth. (For p=1, the GLM and GGoM models reduce to the LM and GoM models, respectively.) We explored the performance of these models in describing the trajectory of 37 outbreaks by applying the methodology described by Chowell [1] to estimate parameters with their confidence intervals. In this analysis, we analyzed how well models fitted the 37 outbreaks using the root mean squared error (RMSE).
The particular choice of parametric models complements that of [1], where the well-known exponential and Richards [4,5] growth models are employed along with their generalized counterparts. Moreover, since that work is focused in detailing the methodology, the data set in [1] is limited to the 2013–2016 Ebola outbreak in Sierra Leone, and no mechanism of choice between two or more alternative models for the same data set is established. In this paper we are particularly interested in gaining insight into the types of outbreaks where the different model variants provide an enhanced description of the epidemic outbreaks.
1.2. Related work
This paper is focused on models given by ordinary differential equations (ODEs) to describe the temporal dynamics of epidemic outbreaks. The properties of ODEs as models of growth are treated in numerous monographs, see e.g. [6,7,8,9,10,11,12]. On the other hand, the presence of the nonlinearity caused by the growth rate exponent p precludes in some cases solutions of the corresponding ODE in closed form. Nevertheless, we mention that for p=1, the properties of the Richards, logistic, Gompertz, and related (e.g., von Bertalanffy [13,14]) models are broadly discussed in terms of closed algebraic expressions in [15,16,17] (see also the references cited in these papers).
We use phenomenological models within an empirical approach (without an explicit basis of physical laws or mechanisms) that are useful to reproduce the patterns observed in the time series data [18]. The result is a fairly simple temporal description of epidemic growth patterns [1]. For instance, epidemics display variable epidemic growth scaling (e.g., from sub-exponential to exponential). Here we are particularly interested in the contribution of the parameter p as a corrector in the fit and the possible improvement in the forecasts. The relevance of this parameter was recently highlighted by Chowell and Viboud [19] who demonstrated that a generalized-growth model is a simple tool that can be used to characterize the early epidemic growth profile from case incidence data as well as from synthetic data derived from transmission models via stochastic simulation [18]. Related references to early epidemic growth models also include [20,21]. For the connection between the growth rate and the reproductive number of an epidemic, an aspect that is not discussed herein, we refer to [22,23,24,25].
Finally, we mention that there are also stochastic models built to study sigmoidal behaviours. In particular, in recent years there have been many advances in stochastic models based on diffusion processes, particularly associated with the Gompertz and logistic curves. A general procedure for obtaining and estimating this type of models is considered in [26], where also further references can be found (see also [27]). As is discussed in the introduction of [26], considering particular choices of the time functions that define the exogenous factors has enabled researchers to define diffusion processes associated to alternative expressions of already-known growth curves [26]. These processes include a Gompertz-type process [28] (applied to the study of rabbit growth), a generalized von Bertalanffy diffusion process (with an application to the growth of fish species) [29], a logistic-type process [30] (applied to the growth of a microorganism culture), and a Richards-type diffusion process [31]. More recent contributions to this line of research are [32] and [33].
1.3. Outline of the paper
The remainder of the paper is organized as follows. In Section 2 the mathematical growth models that we employ to investigate the trajectory of epidemic outbreaks are described. Specifically, we employ two growth models with two parameters and two models with three parameters, which are generalizations of the first models that incorporate a third parameter to model varying degrees of early growth profiles (from sub-exponential to exponential growth). The materials and methods employed in our analyses are presented in the Section 3, including the descriptions of the datasets for the 37 epidemic outbreaks, the concept of the RMSE, and the methods for the parameter estimation and confidence interval generation using the parametric bootstrap approach. The results of applying the methodology developed in Section 3 to the datasets are presented in Section 4. Finally, some conclusions and possible directions of future work are collected in Section 5.
2.
Mathematical models
The general form of a phenomenological model is
where dxi/dt denotes the rate of change of the system state xi, i=1,…,n, and Θ=(θ1,…,θm) is the set of model parameters, where the complexity of a model depends on the number m of parameters that are needed to characterize the states of the system and the spectrum of the dynamics that can be recovered from the model [1]. In this contribution we highlight the logistic growth model (LM) and the Gompertz model (GoM) and their respective generalizations, namely the generalized logistic model (GLM) and the generalized Gompertz model (GGoM). The last two models incorporate a parameter p that indicates the kind of scaling of growth. These models can be described as follows.
The logistic growth model (LM) relies on two parameters to characterize the trajectory of an epidemic, where the model is given by the differential equation
where t is time, C′(t) describes the incidence curve over time, C(t) is the cumulative number of cases at time t, while the parameter r>0 indicates the growth rate (its dimension is 1/time), and K is the size of the epidemic. During the initial stages of disease propagation, when C(t)≪K, this model assumes an exponential growth phase, as can be inferred from the well-known explicit solution of (2.2),
The two-parameter Gompertz model (GoM) is given by the ODE
where the parameter b>0 describes the exponential decay of the growth rate r, and the quantities C and C′ have the same meaning as for the LM model. If C(0) is the initial number of cases, then the solution of (2.3) is
We generalize the logistic and Gompertz models by incorporating a growth scaling parameter p∈[0,1] that indicates the kind of growth, where p=0 corresponds to a constant incidence over time, p=1 corresponds to the exponential growth and recovers the logistic model, and any value 0<p<1 leads to a model that describes a sub-exponential growth, a property that leads to potentially more realistic models as shown in [18]. The model is given by the differential equation
Similarly, the Gompertz model leads to the following ODE that defines the Generalized Gompertz Model (GGoM), where p plays the same role as in the GLM:
It is worth noting that for general values p∈(0,1), (2.5) does not possess an explicit solution in closed algebraic form. (For a detailed discussion of this point and further references we refer to Ohnishi et al. [34], who deal with the Pütter-von Bertalanffy equation dC/dt=αCA−βCB with positive constants α, β, A and B, which includes (2.5). Nevertheless, this equation admits an analytical solution given in implicit form [34]HY__HY, Eq. (9)].)
In contrast to the GLM equation (2.5), one may easily integrate the GGoM equation (2.6) for these values of p to get
For this expression we get
It is interesting to note that for the Gompertz model with p=1, (2.3), the expression (2.4) implies that C(t)→C(0)exp(r/b) as t→∞ so the limit value depends linearly on C(0) (unless the initial population is absorbed into b or r), while for 0<p<1, (2.7) means that the limit of C(t) still depends on C(0) but does so in a nonlinear fashion.
Summarizing, we have two two-parameter models with their respective generalizations that are three-parameter models, where the third parameter is the growth scaling parameter p∈[0,1], as we show in Table 1. Before we proceed, we illustrate by an example the effect of varying p within the GLM and GGoM, see Figure 1. We start with the logistic model (2.2) setting r=1, C(0)=10 and K=1000. The solid red curve in Figure 1 (top left) shows the incidence curve t↦C′(t) corresponding to the solution t↦C(t) (Figure 1, top right). This solution approximates the maximum (K=1000). Now we pass to the GLM (2.5) by gradually decreasing p from one to p=0.995, p=0.99, and so on (see the caption of Figure 1). We observe that the maxima of the incidence C′(t) decrease (as follows easily from discussing the extrema of C↦Cp(1−C/K)), but their time of occurrence increases, as p is decreased. Furthermore, the incidence curves stay fairly close to the curve for p=1 for values of p close to one, and all solutions behave like C(t)→K as t→∞.
In order to compare these observations with those for the Gompertz and GGoM models, we plot in Figure 1 (middle left) the incidence curve t↦C′(t) corresponding to the solution t↦C(t) (Figure 1, middle right) for the Gompertz model (2.3) with parameters C(0)=10,
which have been chosen in such a way that C′(0) is the same as for the GLM as well as that C(t)→K as t→∞ (cf. (2.4)) for p=1. Note that the maximum of C′(t) is smaller than for the logistic model. As p is decreased, but all other parameters are kept, these maxima become smaller (as with the GLM), but they appear each time earlier (in contrast to the GLM). However, for t→∞ we observe that consistently with (2.7), C(t) approaches smaller values than K as t→∞. If we wish to ensure that the GGoM with p∈(0,1) has the same value of C′(0) as the GLM (for the corresponding value of p) and C(t)→K as t→∞, then we must also adjust b by setting
(which results from equating the limit in (2.7) with K). From the bottom plots of Figure 1 we observe that the joint variation of p and b produces curves similar to those of the GLM.
Finally, let us emphasize once again that the exponent p is introduced in both (2.5) and (2.6) in such a way that it affects the initial growth rate, corresponding to the early stage when C(t)/K≪1 and therefore C′(t)≈rCp(t), so that p characterizes sub-exponential growth dynamics [18]. In particular, the identification of p at early stage of an epidemic is fundamental for forecasting the outbreak [19]. It is therefore instructive to provide an example to compare (2.5) with an alternative way of introducing an exponent p into (2.2), namely the well-known Richards equation [4]
Figure 2 displays the incidence curves t↦C′(t) and the solution t↦C(t) for selected values of p for both the GLM model (2.5) and the Richards equation (2.10). We observe that since C(0)/K≪1, the initial growth rates for (2.10) are very similar for all values of p, in contrast to those of the GLM model. Thus, the variability of the exponent p in the Richards equation (2.10) is not suitable for capturing sub-exponential initial growth.
On a similar note, we mention that the traditional form of the Gompertz ODE (cf., e.g., [28]) is
with a constant α>0, which is a nonlinear differential equation, in contrast to the linear ODE (2.3) utilized herein. Our preference of (2.3) is based on the fact that this equation can easily be equipped with the exponent p to give (2.6). Furthermore it is fairly easily possible to compare (2.6) and its solutions with those of the sub-exponential growth equation dC/dt=rC(t)p analyzed in [18,19], while the multiple, and nonlinear occurrence of C(t) makes such a generalization at least more complicated.
3.
Materials and methods
In order to compare the mathematical models, we need time series data that describe the temporal changes in one or more states of the system, whose temporal resolution varies among daily, weekly or yearly and by the frequency at which the state of the system is measured. We herein employ a data set for 37 different epidemic trajectories with different temporal resolutions (see Table 2). Additionally we present the method for fitting the model to the data, that is, to estimate the parameters as in [1]. Finally, to compare the models, we conduct a comparative analysis of RMSEs for all models and epidemics. Then, to continue we present the materials and methods that allow us to understand the methodology.
3.1. Datasets
Table 2 summarizes the information of the 37 epidemic outbreaks analyzed, including the name of the disease associated with each epidemic, the location where the outbreak occurred, the temporal resolution (by days, weeks, or years) of the time series, and the number of data points. For each outbreak, the onset corresponds to the first observation associated with a monotonic increase in incident cases, up to the peak incidence. We notice that for Ebola we have more information about the outbreak in West Africa (see also [35,36,37]).
3.2. The root mean square error (RMSE)
As in [1], besides using the residuals for any systematic deviations for the model fit to the data, it is also possible to quantify the error of the model fit to the data using performance metrics [55]. These metrics are also useful to quantify the error associated with a forecast. A widely used performance metric is the root mean squared error (RMSE) given by
where ˆΘ is the set of parameter estimates, f(ti,ˆΘ) denotes the best-fit model, and yti (i=1,…,n) is the time series data (for that specific epidemic outbreak) and n is the total number of data points. In this work we employ the RMSE since this quantity naturally arises in the context of least-squares methods. Other applicable performance metrics [1] include the mean absolute error (MAE) and the mean absolute percentage error (MAPE), given by the respective expressions
While we have not applied any special treatment on outliers when calculating the RSME, the sensitivity of each of these performance metrics to anomalous data is left as a topic for future study.
3.3. Parameter estimation and confidence interval generation
Based on the description of the determination of the best fit in [1], we use the built-in Matlab (The Mathworks, Inc.) function LSQCURVEFIT to obtain parameter estimates via least-square fitting of the model solution to the observed data. This is achieved by searching for the set of parameters ˆΘ=(ˆθ1,…,ˆθm) that minimizes the sum of squared differences between the observed data yti=yt1,…,ytn and the corresponding model solution denoted by f(ti,Θ). For the implementation for this function, we need the initial parameter guesses and the upper and lower bounds for these parameters as well as the initial data point C(0). The process for the parameter estimation is summarized in the next steps:
1. Define the upper and lower bounds for each parameter.
2. Consider m sets of initial parameters defined with the Matlab function LSHDESING and the upper and lower bounds defined in step 1.
3. Calculate the parameter estimation for each set of initial parameters with the function LSQCURVEFIT.
4. Measure the error RMSE and select the parameter estimates with lowest RMSE, in order to ensure that the global minimum rather than a local minimum was found.
On the other hand, to generate the confidence interval, we use the parametric bootstrap method [56] (see also [57,58]) with Poisson error structure that was implemented to generate 250 model realizations. This process can be summarized in the following steps:
1. With the parameter estimations ˆΘ obtained by the least-squares fit of the model f(ti,Θ) to the time series data yt1,…,ytn, we achieve the best-fit model f(ti,ˆΘ).
2. Then, we generate S-times replicated simulated datasets, using the best-fit model, which we denote by f∗1(tj,ˆΘ),…,f∗S(tj,ˆΘ). To generate these simulated data sets, we first use the best-fit model f(ti,ˆΘ) to calculate the corresponding cumulative curve function F(tj,ˆΘ) defined as
Moreover, f∗k(t1,ˆΘ)=f(t1,ˆΘ) for k=1,…,S. Besides, these data are generated assuming a Poisson error structure as follows: we assume that
where Po(λ) denotes the Poisson distribution with mean λ.
3. We re-estimate parameters for each of the S simulated realizations, which are denoted by ˆΘi for i=1,…,S.
4. Finally, using the set of re-estimated parameters ˆΘi, i=1,…,S, we construct the confidence interval, so the resulting uncertainty around the model fit is given by f(t,ˆΘ1),f(t,ˆΘ2),…,f(t,ˆΘS).
Then, for our case, from these S=250 realizations, we calculate 95% confidence intervals for model parameters.
3.4. Methodology: Analysis of the RMSE
In this section we summarize the methodology used to decide which is the best model for a given outbreak, and to analyze the contribution of the parameter p. The definitions and theory are taken from [1]. The methodology consists in an analysis of the RMSE error with the help of bar and scatter charts.
For this purpose, we first explore the initial parameters for each model and epidemic in order to ensure that the best fit of the model yields the smallest RMSE following the steps defined in the Section 3.3 for parameter estimation and considering r,b∈[0,5], K∈[0,107] and the known p∈[0,1]. The above is an important process in order to ensure that we are obtaining the best fit to the data using the LSQCURVEFIT function in Matlab. We then with the best fits for each model and epidemic, we have their incidence curves and the lower RMSE. With these values we obtain graphs that compare the fit with the data, bar charts and scatter plots, which will be used for the error analysis (see Figure 3).
4.
Results
4.1. Error analysis and comparison of fits for each epidemic
With the RMSE and the best fits obtained for each model, we obtain tables and graphics (see Table 3 and Figures 4 to 8) to compare the sizes of the errors for each model and epidemic outbreak, where the numbers from 1 to 37 in Table 3 identify the cases of outbreak (see Table 2). In Table 3 we observe that (independently of the epidemic) the GLM method yields the lowest RMSE in most of the cases (highlighted in green), and the LM yields the larger errors (highlighted in yellow). Besides, whenever the GLM is not the "best'' model, the GGoM follows.
Furthermore, we also observe that between LM and GoM, the GoM is better, because the dynamics of this model are more closely aligned to the dynamics of the GGoM. Furthermore, the LM is associated with the largest errors in the great majority of the cases of outbreaks.
Figures 4 and 5 display the RMSE for each model and dataset. In Figure 4 we can see that although the GLM outperforms in most cases, we note that the error for the GLM is higher for Cases 3, 5, 7, 18, 20, 23, and 28 compared to the GGoM. Yet, those error differences are very small.
We also employ scatter plots to compare the errors yielded by a pair of models across all of the epidemics (Figure 5). Therefore, we compare the models with or without the para-meter p, and then between the logistic and Gompertz models. For the first comparison we verify that the GGoM has errors with sizes larger than the GLM, unlike the models without p, where the behavior is different, since the LM has the errors with more scatter and below the line with slope one. Moreover, for the second group of cases, we note that the logistic models have a more scattered behavior above the diagonal line, where LM has errors with sizes greater than the sizes for the GLM's errors. This contrasts with the Gompertz models, where the scatter is closer to the diagonal. This shows that the errors yielded by both Gompertz models are very similar, and we can readily observe that these models are stable or closer to each other.
Having analyzed the RMSE for each model, now we study their respective fits for each epidemic outbreak, where we obtain a graphic sample of the best fit that corresponds to the RMSE, i.e., we will plot the best fits. These results are plotted in Figures 6 to 8. In these figures we can observe and compare the quality of the fits and their erorrs, where can note that the best fits to the data correspond to the smaller errors in terms of the RMSE.
Having finalized our comparative analysis of the model fits and their corresponding errors, we point out that for the Ebola epidemics (Cases 1 to 25), the GLM tends to yield an improved description to the data because in those cases where the GGoM wins (in terms of smallness of the RMSE), the corresponding errors do not differ by more than 0.6399. However, for the rest of the cases of epidemic outbreaks, the best model remains the GLM which yields smaller errors compared to the GGoM.
4.2. Parameter estimation
These results were obtained from the fits calculated in the previous section with the use of the Matlab function LSDCURVEFIT. We summarize the results for all cases in Table 4. We note that for the GGoM, there are 24 cases with p=1, which means that these exhibit an initial exponential growth, where moreover the Gompertz and GGoM models yield equal RSMEs for that value of p. On the other hand for this same period of time and for the logistic models, we notice that only for four epidemics we have p=1 (exponential initial growth), and the others give rise to initial sub-exponential growth with p∈(0,1). There were a number of outbreaks where the Gompertz models yield p=1 (Gompertz and GGoM models are equivalent), for which the differences between the corresponding RMSEs are negligible.
Additionally, we observe that for the cases of Ebola in Grand Cape Mount, Lofa, Kono and Pandemic Influenza (Cases 9, 10, 17, and 33), we obtained p=1 for the two generalized models. Also, for epidemics when the value of p for GLM is near one, the corresponding value of the parameter for the GGoM is one including the epidemics of Ebola (Kindia, Montserrado; Cases 4 and 12), Plague (Bombay; Case 27) and Cholera (Aalborg; Case 37), in Table 4. We also observe that when the value of the parameter for GLM is small, for example the cases of Ebola (Bomi, Bong; Cases 7 and 8), the value for the GGoM is also small, and for all cases when the value of p=1 en GGoM, the values of p for GLM is greater than 0.6.
4.3. Confidence intervals
In this part, for the calculation of confidence intervals, we consider the generalized models (GLM and GGoM), for which we can obtain another piece of information to compare both models, and to decide which models best fit a given dataset. To this end we take the same initial parameters obtained for the RMSE calculation, and we use the parametric bootstrap process with 250 simulations with Poisson error structure, defined in Section 3, and summarize the results in Tables 5 and 6. In these results we note that the intervals are narrower and contain the mean value, suggesting that the parameters are identifiable (see [1]) for the GLM model. On the other hand, for the GGoM model, this situation occurs in some cases, for example, see Figure 9, where for Case 1 the confidence interval obtained with GLM model has a bar chart that is centred, while that for the GGoM model, the bar chart displays a distribution with two modes. This behavior displayed by the GGoM model can be due to dependency or correlations (presented in Section 2) between the parameters b and p.
Another observation is that the non-identifiability can be present in the results where the upper and lower limit of the 95%CI intervals are not so close, and the mean is not a central value inside the interval. This is observed for the GGoM in the Cases 1 and 24, and the opposite situation can be observed, for instance, for Cases 12 and 19, where the mean value is a central value inside the interval which has the extremes very close. This last situation also appears in all the results derived from the GLM.
5.
Discussion and conclusion
Our systematic comparison of a number of epidemic outbreaks using phenomenological growth models indicates that the GLM outperformed the other models in describing the great majority of the epidemic trajectories. In a few cases (such as Cases 3, 4, 23, and 28) the GGoM outperformed the other models. These findings indicate that the parameter p plays a much more significant role in shaping the dynamic trajectories supported by the GLM compared to the GoM since we observed that the errors of the GoM and GGoM models stay fairly close to each other and the contribution of the adjustment of p remains subtle in some cases. In fact, a closer examination of the parameter estimates derived from both models GoM and GGoM indicates that parameter p is close to 1 in these models, which explains the similarity in the fits derived from these models. So the GGoM model could be reduced to GoM without much impact on the model fit. This is in sharp contrast to what is happening with the logistic models where both the LM and GLM models only yield similar fits for three epidemics. Future research could be directed at determining which of the models equipped with generalized growth are easier to calibrate than the other, considering the initial or final parts of the dynamics and with the aim to improve predictions.
Referring to the parameter estimation procedure and the need to provide an initial solution to the optimization numerical methods, we have found that Matlab functions and the steps defined in the section 3.3, are sufficient for the present study, in agreement with the experience made in [1,18]. However, since there is a limited range for some of the parameters (as is the case of parameter p, but not of the others) it might be interesting future work to use metaheuristic procedures to the parameter estimation that possibly guarantee in an appropriate form that the parameters found are indeed optimal globally. As is mentioned in [26], such procedures include simulated annealing (see, e.g., [27,31,59]), variable neighborhood search (VNS) [27,31], and the so-called firefly algorithm [33].
While we compared phenomenological growth models based on their ability to describe empirical trajectories of real epidemics, our methodology could be extended to assess the "distance" between models in terms of the range of dynamics supported by model A that can also be supported by model B and vice versa. For instance, based on our empirical findings we hypothesize that the distance between the LM and GLM models is larger compared to the distance between the GoM and GGoM models. Importantly such distance could be derived for any pair of models regardless of model complexity. Future work could explore this research direction by analyzing a larger set of dynamic models including phenomenological and mechanistic models.
Acknowledgements
RB is supported by Fondecyt project 1170473; CONICYT/PIA/Concurso Apoyo a Centros Científicos y Tecnológicos de Excelencia con Financiamiento Basal AFB170001; and CRHIAM, project CONICYT/FONDAP/15130015. GC acknowledges financial support from grants NSF-IIS RAPID award #1518939, NSF grant 1318788 Ⅲ: Small: Data Management for Real-Time Data Driven Epidemic simulation, and Conicyt (Chile), project MEC80170119. LYLD is supported by CONICYT scholarship CONICYT-PCHA/Doctorado Nacional/2019-21190640.
Conflict of interest
All authors declare no conflicts of interest in this paper.