Research article Special Issues

A new test for detecting specification errors in Gaussian linear mixed-effects models

  • Linear mixed-effects models (LMEMs) are widely used in medical, engineering, and social applications. The accurate specification of the covariance matrix structure within the error term is known to impact the estimation and inference procedures. Thus, it is crucial to detect the source of errors in LMEMs specifications. In this study, we propose combining a user-friendly computational test with an analytical method to visualize the source of errors. Through statistical simulations under different scenarios, we evaluate the performance of the proposed test in terms of the Power and Type I error rate. Our findings indicate that as the sample size n increases, the proposed test effectively detects misspecification in the systematic component, the number of random effects, the within-subject covariance structure, and the covariance structure of the error term in the LMEM with high Power while maintaining the nominal Type I error rate. Finally, we show the practical usefulness of our proposed test with a real-world application.

    Citation: Jairo A. Angel, Francisco M.M. Rocha, Jorge I. Vélez, Julio M. Singer. A new test for detecting specification errors in Gaussian linear mixed-effects models[J]. AIMS Mathematics, 2024, 9(11): 30710-30727. doi: 10.3934/math.20241483

    Related Papers:

    [1] Nesrin Güler, Melek Eriş Büyükkaya . Statistical inference of a stochastically restricted linear mixed model. AIMS Mathematics, 2023, 8(10): 24401-24417. doi: 10.3934/math.20231244
    [2] Heng Liu, Xia Cui . Adaptive estimation for spatially varying coefficient models. AIMS Mathematics, 2023, 8(6): 13923-13942. doi: 10.3934/math.2023713
    [3] Anouar Ben Mabrouk, Abdulaziz Alanazi, Zaid Bassfar, Dalal Alanazi . New hybrid model for nonlinear systems via Takagi-Sugeno fuzzy approach. AIMS Mathematics, 2024, 9(9): 23197-23220. doi: 10.3934/math.20241128
    [4] Ran Tamir . Testing for correlation in Gaussian databases via local decision making. AIMS Mathematics, 2025, 10(4): 7721-7766. doi: 10.3934/math.2025355
    [5] Haidy A. Newer, Bader S Alanazi . Bayesian estimation and prediction for linear exponential models using ordered moving extremes ranked set sampling in medical data. AIMS Mathematics, 2025, 10(1): 1162-1182. doi: 10.3934/math.2025055
    [6] Zhichao Fang, Ruixia Du, Hong Li, Yang Liu . A two-grid mixed finite volume element method for nonlinear time fractional reaction-diffusion equations. AIMS Mathematics, 2022, 7(2): 1941-1970. doi: 10.3934/math.2022112
    [7] Taewan Kim, Jung Hoon Kim . A new optimal control approach to uncertain Euler-Lagrange equations: H disturbance estimator and generalized H2 tracking controller. AIMS Mathematics, 2024, 9(12): 34466-34487. doi: 10.3934/math.20241642
    [8] Teekam Singh, Ramu Dubey, Vishnu Narayan Mishra . Spatial dynamics of predator-prey system with hunting cooperation in predators and type I functional response. AIMS Mathematics, 2020, 5(1): 673-684. doi: 10.3934/math.2020045
    [9] Murali Ramdoss, Divyakumari Pachaiyappan, Inho Hwang, Choonkil Park . Stability of an n-variable mixed type functional equation in probabilistic modular spaces. AIMS Mathematics, 2020, 5(6): 5903-5915. doi: 10.3934/math.2020378
    [10] Xinnian Song, Lei Gao . CKV-type B-matrices and error bounds for linear complementarity problems. AIMS Mathematics, 2021, 6(10): 10846-10860. doi: 10.3934/math.2021630
  • Linear mixed-effects models (LMEMs) are widely used in medical, engineering, and social applications. The accurate specification of the covariance matrix structure within the error term is known to impact the estimation and inference procedures. Thus, it is crucial to detect the source of errors in LMEMs specifications. In this study, we propose combining a user-friendly computational test with an analytical method to visualize the source of errors. Through statistical simulations under different scenarios, we evaluate the performance of the proposed test in terms of the Power and Type I error rate. Our findings indicate that as the sample size n increases, the proposed test effectively detects misspecification in the systematic component, the number of random effects, the within-subject covariance structure, and the covariance structure of the error term in the LMEM with high Power while maintaining the nominal Type I error rate. Finally, we show the practical usefulness of our proposed test with a real-world application.



    In linear mixed-effects models (LMEMs), the assumptions that concern the structure of the covariance matrix of the response variable are presumed to be adequately specified. This involves specifying the mean structure, the covariance matrix structure, and the distribution pattern of the covariance matrix. These elements define both the covariance among the individuals and those associated with the vector of random effects. However, verifying these assumptions in LMEMs can be challenging due to the complexity of the data structure and the presence of two sources of error within the model.

    Diagnostic tests are crucial for detecting misspecifications errors in LMEMs, as they identify potential model misrepresentations through analytical methods. Some authors have proposed tests to detect misspecification when formulating LMEMs. For instance, [3] used a strategy to evaluate the random effect distribution via a parametric Bootstrap for small samples in the case of mixed models, generalized linear models, and non-linear mixed models. For this purpose, they used an asymptotic test based on a gradient function

    Through statistical simulations, [10] showed that the increase in Type II error was consistent with the effect of specification error on the distribution of the random effect for generalized LMEMs. On the other hand, [7] developed a diagnostic method by performing data reconstruction, to detect misspecification for generalized LMEMs. The author proposed a theoretical justification of the method and investigated the behavior of this method via a simulation in finite samples. In these simulations, the author compared a model without a specification error with a model with a misspecification in its fixed part.

    Several techniques have emerged to discern errors in either the distribution of the error term or the distribution of random effects within statistical models. These components have been assessed by different researchers to ensure model integrity and draw accurate conclusions [5,9,16]. Specifically, [8] proposed a test to analyze specification errors in the distribution of the random error term and the random effects, while [7] presented a test that allowed them to identify errors in the specification of the distribution of the random effects.

    For LMEMs, exploratory techniques have also been proposed to identify the sources of errors. For instance, [18] proposed a set of graphical and analytical techniques, based on three types of residuals (i.e., marginal, conditional, and random effects) for the diagnosis of the intra-unit sample covariance matrix in repeated measure studies, as well as graphical tools to analyze violations of the error structure in LMEMs. To identify the number of random effects in an LMEM, the recommended exploratory methods based on individual and mean profile analyses [15].

    Despite their utility, the tests previously described fall short in pinpointing the origins or specific sources of these errors. Hence, int his paper we propose combining a user-friendly computational test with an analytical method to visualise the source of errors, while considering the methodology suggested by [15]. However, our focus does not compare our approach with other approaches in the literature to determine the superior power behaviour. Instead, we concentrate on integrating two approaches: a formal test and a graphical diagnostic tool. This combination allows for the detection of model misspecifications and the identification of source of errors within the model. Hence, the test detects when the model is misspecified and the analytical method allows us to visualise the source of the misspecification.

    This article is organized as follows: in Section 2, the Gaussian LMEM is outlined; in Section 3, the proposed test is described; in Section 4, the simulation study and the different scenarios are described in detail, and in Section 5 the results are reported; subsequently, in Section 6, we illustrate the usefulness of the proposed test using real data, and present the diagnostic graphical tools to visualize the source of the misspecification; and finally, the conclusions are recommendations presented, and further areas of research are discussed.

    A general form of the Gaussian LMEM is

    yi=Xiβ+Zibi+ei,i=1,n, (2.1)

    where yi=(yi1,yi2,...,yimi) represents the vector of the mi observations recorded for the i-th sample unit, β=(β1,,βp) denotes the vector of either the location parameters or the fixed effects, Xi is the matrix corresponding to the specification of the fixed terms, bi=(b1i,,bqi) represents the vector of random effects, Zi is the matrix corresponding to the specification of the vector of random effects and ei represents the vector of random errors. By definition, biNmi[0,G(θ)] and eiNmi[0,Ri(θ)] are assumed to be independent.

    The vector θ=(θ1,θk) contains all non-redundant components (parameters) of the covariance matrix of vector bi. The vector ϕ=(β,θ)(s×1) represents the vector of all the parameters in (2.1). It follows that the covariance matrix of the vector yi can be written as

    Var(yi)=Vi(θ)=Vi=ZiGZi+Ri, (2.2)

    where G=G(θ) and Ri=Ri(θ). In summary,

    yiNmi[Xiβ,Vi]. (2.3)

    The most commonly used estimation methods for the parameters in model (2.1) are the Maximum Likelihood Estimation (MLE) method and the Restricted MLE (RMLE). For additional details on the MLE and the RMLE, see [12] and [14], respectively.

    The maximum likelihood methodology yields unbiased estimators for the fixed effects, though it introduces bias in the estimators for the random effects. This bias stems from disregarding the loss of the degrees of freedom during the estimation of the fixed terms. Consequently, it also results in biased estimators for the parameters of the intra-unit covariance matrix.

    The random vectors y1,,yn are independent with a distribution given by (2.3). The probability density function associated to each vector yi is denoted as f(yi;ϕ). Considering the vector

    y=(yTi,...,yTn)T,

    and considering the probability density function of the random variables yi, the likelihood function of ϕ is as follows:

    L(ϕ;y)=1inf(yi;ϕ)=1inRqif(yi;ϕ|bi)f(bi;ϕ)dbi. (2.4)

    where Rqi is the qi-dimensional space of the vector bi. It follows that the logarithm of (2.4) is represented by the following:

    l(ϕ;y)=12{Nlog(2π)+log|V(θ)|+(yXβ)[V(θ)]1(yXβ)}, (2.5)

    where V(θ) is is the covariance matrix of y.

    To reduce the bias in the process of estimating the components of the vector θ by RMLE, [12] and [6] proposed to use a linear transformation of the type y=Uy with E(Uy)=0. The considered the matrix U, such that UU=In and UU=InX(XX)1X, with

    X=(XTi,...,XTn)T,

    so that,

    yNNp[0,UV(θ)U], (2.6)

    with

    N=ni=1mi.

    The logarithm of the restricted marginal likelihood function is as follows:

    lR(θ;y)=(Np)2log(2π)12log|V(θ)|12log|X[V(θ)]1X|(Np)2ˆe[V(θ)]1ˆe, (2.7)

    where ˆe=yXˆβ(θ), and ˆβ(θ)=(X[V(θ)]1X)1X[V(θ)]1y is the MLE of β assuming that θ is known. The maximization of (2.7) generates the estimators ˆθR of maximum plausibility of θ. Hence, ˆθR in conjunction with ˆβR=ˆβ(ˆθR) are the desired RMLEs [2].

    The function (2.7) can also be written as follows:

    lR(θ;y)=12{(Np)log(2π)+log|V(θ)|+log|X[V(θ)]1X|+yPy}, (2.8)

    with

    P=[V(θ)]1[V(θ)]1X(X[V(θ)]1X)1X[V(θ)]1.

    In general, the process that generates the vector y is not known (i.e., the true probability density g(y) is not known). An LMEM is usually proposed assuming that the distribution of both, the random effects and the random error term are known. Here, we consider that f(y,ϕ) is the density function of the random vector y. If there exists a vector ϕ0Θ such that g(y)=f(y,ϕ0), with Θ being a compact subset of a p-dimensional Euclidean space, it can be concluded that the model would be correctly specified. Otherwise, the model would have a specification error. [20] illustrates that when the model is correctly specified, ˆϕn, which is obtained by either maximum likelihood or restricted maximum likelihood, is a consistent estimator for ϕ0 [17]HY__HY, p.34], this is,

    ˆϕnpϕ0. (2.9)

    When the model is incorrectly specified, there exists a vector ϕΘ, which minimizes the information criterion using the Kullback Leibler (KL) distance, that is,

    KL(g:f,ϕ)=Eg[logg(y)f(y,ϕ)]=RNg(y)logg(y)f(y,ϕ)dy. (2.10)

    However, when the model is properly specified, [19] proved that the only value that minimized the KL criterion was ϕ = ϕ0.

    Let us assume that the following matrices exist:

    A(ϕ)=E[(2l(ϕ;y)ϕkϕl)], B(ϕ)=E[(l(ϕ;y)ϕkl(ϕ;y)ϕl)], An(ϕ)=1nni=1(2li(ϕ;yi)ϕkϕl), Bn(ϕ)=1nni=1(li(ϕ;yi)ϕkli(ϕ;yi)ϕl),A(ϕ0)=E[(2l(ϕ;y)ϕkϕl)]ϕ=ϕ0,B(ϕ0)=E[(l(ϕ;y)ϕkl(ϕ;y)ϕl)]ϕ=ϕ0, An(ˆϕn)=1nni=1(2li(ϕ;yi)ϕkϕl)ϕ=ˆϕn,Bn(ˆϕn)=1nni=1(li(ϕ;yi)ϕkli(ϕ;yi)ϕl)ϕ=ˆϕn.

    According to [19],

    An(ˆϕn)asA(ϕ). (2.11)

    Now, let

    H=2li(ϕ;yi)ϕkϕl (2.12)

    be the observed information matrix. If model (2.1) is correctly specified [19],

    A(ϕ0)+B(ϕ0)=0. (2.13)

    The expression (2.13) is called an equality of the information matrix. Under appropriate conditions [4], it is possible to demonstrate that

    nˆϕnϕ0dNs[0,V(ϕ0)], (2.14)

    with

    V(ϕ0)=[A(ϕ0)]1B(ϕ0)[A(ϕ0)]1. (2.15)

    In this section, we describe in detail a new test designed to detect misspecification errors as and their source in Gaussian LMEMs. Our proposed test is based on the "Sandwich" estimator of the covariance matrix of ˆϕn and in the equality of the information matrix given by (2.13), under the null hypothesis H0 that the model (2.1) is correctly specified.

    Considering the asymptotic distribution of the estimator ˆϕn and under regularity conditions [19], it is known that

    n(ˆϕnϕ0)ANs[0,V(ϕ0)], (3.1)

    where V(ϕ0) is as in (2.15).

    Let us consider the vector

    d(ϕ)=diag{[A(ϕ)]1B(ϕ)[A(ϕ)]1+[A(ϕ)]1}, (3.2)

    and assume that it is differentiable in ϕ0. Thus, we can build the following:

    d(ϕ0)=d(ϕ)|ϕ=ϕ00. (3.3)

    Equation (3.3) is a condition necessary for the existence of the gradient. It follows that the vector

    dn(ˆϕn)=diag{[An(ˆϕn)]1Bn(ˆϕn)[An(ˆϕn)]1+[An(ˆϕn)]1}=diag{[An(ˆϕn)]1Bn(ˆϕn)[An(ˆϕn)]1}+diag{[An(ˆϕn)]1}=1nni=1di(ϕ)|ϕ=ˆϕn (3.4)

    is an estimator of (3.2), and a potential indicator for detecting specification errors in the model (2.1).

    Let dn(ϕ) be the following:

    dn(ϕ)=diag{[An(ϕ)]1Bn(ϕ)[An(ϕ)]1+[An(ϕ)]1}=(d11(ϕ)dss(ϕ)). (3.5)

    Then,

    dn(ϕ)={diag[An(ϕ)1Bn(ϕ)[An(ϕ)]1+[An(ϕ)]1]} (3.6)
    =(d11(ϕ)ϕ1d11(ϕ)ϕsdss(ϕ)ϕ1dss(ϕ)ϕs)s×s. (3.7)

    Using (3.1) under H0 and the Delta method [17]HY__HY, p.136], we obtain the following:

    n(dn(ˆϕn)d(ϕ0))DNs(0,d(ϕ0)V(ϕ0)d(ϕ0)),n. (3.8)

    Now, if we consider (3.8) under H0 and use Cochran's theorem [17]HY__HY, p.137], then the test statistic of the alternative "Sandwich" estimator (ASEST) takes the following form

    ASEST=ndn(ˆϕn)[dn(ˆϕn)ˆV(ˆϕn)dn(ˆϕn)]1dn(ˆϕn) (3.9)

    where

    ˆV(ˆϕn)=[An(ˆϕn)]1Bn(ˆϕn)[An(ˆϕn)]1 (3.10)

    is an unbiased and consistent estimator of the covariance matrix of the "Sandwich" estimator for ˆϕ [19]. It is straightforward to show that, under H0, ASESTχ2s as n. Thus, the Type I error of the test can be calculated as follows:

    Type I error=P(ASEST>χ2α,s|H0). (3.11)

    Similarly, under the alternative hypothesis H1, the power of the test can be calculated as

    Power=1P(ASEST<χ2α,s|H1). (3.12)

    We conducted numerical experiments to study the behavior of our proposed test in terms of the Type I error rate and the Power to identify potential misspecifications in LMEMs. The structure in all models is the same as in (2.1).

    In order to reflect particular situations we commonly encounter in practical contexts, five different cases were considered:

    Case I: Misspecification of the systematic component;

    Case II: Misspecification of the number of random effects; and

    Case III: Random effects are considered independent;

    Case IVa: Misspecification of the within-subject covariance structure;

    Case IVb: Misspecification of the within-subject covariance structure.

    We simulated data following a previously published structure of a Gaussian LMEM [1], and considered [15] to identify a correctly specified model. In particular, the individuals' profiles were used to identify both the structure of the mean response and of the random effects of the LMEM. The identified model, shown as Case I in Table 1, is a second-degree polynomial in the fixed part, and a first-degree polynomial in the random part. This model can be written as follows:

    yij=β0+β1xij+β2x2ij+b0i+b1ixij+eij, (4.1)
    Table 1.  Structure, fixed effects, random effects, and error term considered in the numerical experiments.
    Case Model G Ri
    I yij=β0+β1xij+β2x2ijFixed effects+b0i+b1ixijRandom effects+eijError term (σ20σ01σ01σ21) σ2In
    II yij=β0+β1xijFixed effects+b0i+b1ixijRandom effects+eijError term (σ20σ01σ01σ21) σ2In
    III yij=β0+β1xijFixed effects+b0iRandom effect+eijError term σ20 σ2In
    IVa As in Case I (σ20σ01σ01σ21) σ2(1ρρ2ρmi11ρρmi21ρρ1),|ρ|<1
    IVb As in Case I (σ2000σ21) σ2In
    yij=β0+β1xij+β2x2ijFixed effects+b0iRandom effect+eijError term σ20 σ2(1ρρ2ρmi11ρρmi21ρρ1),|ρ|<1

     | Show Table
    DownLoad: CSV

    where yij represents a j-th observation of the i-th the individual, xij is the j-th time registered for the i-th individual, β=(β0,β1,β2) are the location parameters, θ=(b0i,b1i) represents the intercept, and random slope, and eij is the random error term.

    Model (4.1) can be written in a compact form where Xi=(1xi1x2i11ximix2imi) and Zi=(1xi11ximi). Assuming that G=(σ20σ01σ01σ21), Ri=σ2In, biN2[0,G], eiNmi[0,Ri], and that bi and ei are independent (i=1,n), it follows that

    yiNmi(Xiβ,Vi),Vi=ZiGZi+Ri. (4.2)

    Let ϕ=(β,θ) be the vector of all model parameters for model (4.1). Using LME, the parameter estimates ˆϕn of model (4.1) are as follows:

    ˆβ=(2.4511.3900.548),ˆG=(0.5360.5690.5690.888),and ˆσ2=0.404, (4.3)

    where ˆG is the estimated covariance matrix of the vector of random effects, G. These values are considered to be the true values of the parameters in the simulation process. Hence, we subsequently simulate B data sets of size n and fit a model according to Table 1. The next step is to calculate the power and the Type I error rate of the proposed test. For the former, we simulated data from the identified model (that is, Case I) and fit a model in accordance with Table 1. Similarly, to calculate the Type I error rate, we simulate data of the identified model and fit a different model.

    Here, we generated B=10,000 data sets from Model I in Table 1 using the estimates ˆβ, ˆG, and ˆσ2 in (4.3). Thus, the response vector is such that yiNmi(Xiβ,Vi), where Vi=ZiGZi+Ri.

    In order to assess the misspecification of the systematic component, we fitted Model II in Table 1 to the simulated data. Note that, in this model, the quadratic term is omitted; hence, the model is misspecified. The Type I error rate and Power, as a function of n, are reported in Figure 1. Overall, our results for Case I indicate that as n, the Type I error of the test gets closer to the nominal Type I error rate of 5% and the Power increases, suggesting that our proposed test is capable of correctly detecting a misspecificacion in the systematic component of the LMEM model as n increases.

    Figure 1.  (a) Type I error and (b) Power of the proposed test for each Case in Table 1 as a function of the sample size, n. The blue line represents a Type I error of 5%.

    To assess the Type I error and Power of the proposed test in this specific instance, we created B=10,000 datasets based on Model I of Table 1, utilizing the estimates provided in (4.3). Subsequently, we fitted Model III from the same table. By design, here we omit the random slope that induces the specification error in the number of random effects; therefore, the covariance matrix generated by this model induces a uniform structure, in which the variances are constant over time. Our results, presented in Figure 1, show that as n, the Type I error of the test gets closer to the nominal level and the Power increases, suggesting that the proposed test is capable of correctly detecting the misspecification in the number of the random effects as the sample size increases.

    Using a simulation strategy similar to that previously discussed in Case I and Case II, we simulated B=10,000 data sets from Model I and fitted Model III (see Table 1 for more details) to reflect a misspecification error in the covariance matrix of the random effects. In particular, we wrongly considered that the random effects were independent and assessed the performance of the proposed test to detect such misspecifications. Figure 1 shows the results. Overall, our findings suggest that the Type I error of the proposed test gets closer to the nominal level of 5% as n increases, suggesting that our test controls the probability of wrongly identifying a misspecification of the covariance structure of the random effects. In addition, the Power of the proposed test in Case III is >0.95 regardless of n, and increases to 1 for n>300. This result indicates that the proposed test is highly likely to detect a misspecification error in the covariance structure of the random effects when it actually exists.

    Here, we assess the performance of the proposed test when the covariance matrix of the response vector is misspecified. In particular, we consider that the random error term follows an autoregressive (AR) process of order 1 with ρ=0.9. Following (3.9), the test statistic is χ2calculated=1233.4 and the associated p-value is <1×1016, indicating that the LMEM is misspecified.

    Following a similar simulation strategy to that previously described, we assessed the Type I error rate and the Power of the proposed test when a misspecification of the within-subject covariance structure exists. The results are presented in Figure 1. As seen for the other cases, for Case IVa, the Type I error of the proposed test gets closer to the nominal level as n increases, and the Power of the proposed test is >0.9 regardless of n and increases to 1 for n>500. Overall, these results indicate that the proposed test performs reasonably well for detecting a misspecification error in the within-subject covariance structure when it actually exists, and controls the Type I error when it does not.

    This case is similar to Case IVa. However, there is a specification error in the number of random effects and in the within-subject covariance matrix. In particular, we simulated B=10,000 data sets from a model that has an intercept, slope, and quadratic effect in the fixed part, and only an intercept in the random part. In addition, the random error term follows an AR(1) process. With the simulated data, we fitted a Model I of 1, which includes an intercept, slope, and square effect in the fixed part, and an intercept and slope in the random part. We subsequently fitted Model IVb, which includes an intercept and a slope in the fixed part, and considers that the random effects are homoscedastic with a conditional independence. Figure 1 displays the results.

    Nagle (2018)[11] presented a dataset that analyzed the temporal shift in the production of stop consonants by a group of 24 English learners (Figure 2). The phonetic context is controlled with 4 dummy characters: 'Pafo', 'Bafo', 'Pamuso', and 'Bamuso'. With the first two characters, the occlusion occurs on a stressed syllable, while with the other two, the occlusion occurs on an unstressed syllable. The outcome variable was voice onset time (VOT), which is an acoustic measure representing the time elapsed between the onset of vocalisation and the release of an occlusion closure. Five sessions were conducted for each student. Two stress categories were analysed, as well as each participant's age and stress, among other variables. In this study, it was of interest in the study to differentiate the VOT variation from the individuals' variations.

    Figure 2.  Individuals' profiles (in gray) around the loess profile (dashed line).

    Figure 2 suggests that introducing sessions as a random effect may be appropriate. Thus, an LMEM will allow us to analyze the individuals' and the global behavior, even with incomplete data for some individuals, as shown in Figure 2. In addition, we illustrate the proposed test to identify the source of error, if there is a misspecification in the model.

    Given the behavior of the profiles in Figure 2, we can suggest an LMEMs with intercept, slope, and quadratic effect in the fixed part, and an intercept and slope in the random part. To model the observed heteroscedasticity, we propose a homoscedastic LMEM with a conditional independence for simplicity. The general form of this model was previously described and is given in (4.1). Table 2 shows the MLEs of ϕ=(β,θ), and the vector of all model parameters in (4.1).

    Table 2.  MLEs for the (a) Fixed and (b) Random part of model (4.1).
    (a) Fixed part
    Term Estimate SE df t Pvalue
    β0 46.24 4.16 24.22 11.12 0.00
    β1 -10.22 1.33 28.69 -7.66 0.00
    β2 1.98 0.16 4356.48 12.34 0.00
    (b) Random part
    Term Variance Estimate
    b0 405.96
    b1 33.55
    e 15.984
    Note: SE: Standard Error; df: degrees of freedom; t: test statistic.

     | Show Table
    DownLoad: CSV

    Under the null hypothesis that the model (4.1) is correctly specified, the ASEST test statistic is χ2calculated=1.279×109 and the associated p-value is >0.05. Thus, no misspecification is detected.

    As a complement to the ASEST proposed test for identifying misspecification errors in model (4.1) for the VOT data set [11], here we employed several graphical diagnostic tests on the fitted model. Our results are presented in Figures 38.

    Figure 3.  Modified Lesaffre–Verbeke unit index plot for model (4.1). Dashed line represents the Q3 + 1.5 IQR. Q3: 3rd quatile; IQR: interquartile range.
    Figure 4.  Q-Q plot (left) and histogram (right) for the standardized conditional residuals.
    Figure 5.  Mahalanobis distance as a function of the unit index for identifying influential observations.
    Figure 6.  Standardized vs. marginal fitted residuals (left) and histogram of the standardized residuals (right) for the VOT data set based on model (4.1).
    Figure 7.  Q-Q plot for the Mahalanobis distance.
    Figure 8.  Standardized residuals vs. predicted values (left) and histogram of the standardized residuals (right) for the VOT data set based on model (4.1).

    Figure 3 presents the residual diagnostic plot for the modified Lesaffre–Verbeke index. The plot suggests that the proposed covariance structure may be suitable for the 24 units, as our results do not indicate any significant deviations from the expected behavior if the model were incorrectly specified.

    Our interest in this section is to illustrate the graphical diagnostic methodology that allows us to identify and locate the source of the specification error, if it exists. In this case, the model is incorrectly specified in the sources identified.

    On the other hand, Figure 4 shows the standard raw residual Q-Q plot for the fitted model. This result, which is in line with the findings of the proposed test, suggests that the error distribution for the adopted LMEMs does not seem to have heavy tails.

    Additionally, we analyzed the potential influential observations based on the Mahalanobis distance (Figure 5). Notably, it is critical to identify observations may be crucial for a subsequent analysis in the context LMEMs; however, this cannot be achieved using our ASEST test. Therefore, the relevance of this plot lies in its ability to visualize and potentially highlight influential observations. Interestingly, we identified that two possible observations (i.e., observations #5 and #6) may be influential. Although this result warrants a further investigation to detect potential outliers or inconsistent values based on the findings, we can statistically treat them as possible influential values.

    Another important aspect that cannot be addressed using the analytical version of the ASEST test, is the identification of patterns in the fixed effects of the LMEM. Figure 6 shows the diagnostic plot for such a suggestion when model (4.1) is fitted to the VOT data (Figure 2), which suggests that there is no evidence to illustrate the omission of any fixed effect.

    Consequently, we should check whether the adopted Gaussian assumption for the random effect in model (4.1) is not suitable through the ASEST test. Thus, we propose the use of a Q-Q plot for the Mahalanobis distance, as shown in Figure 7. Overall, these results suggest no evidence against the adopted Gaussian assumption for the random effects.

    Finally, the vector of conditional errors was assumed to have a homoscedastic structure in model (4.1). However, the ASEST test cannot be directly used to validate such an assumption. Thus, we encourage the use of the standardized minimally confounded residuals, shown in Figure 8, as a suitable alternative for such a purpose. When applied to the VOT data set, the plot indicates that the assumption holds, which is in line with the assumption made while proposing the LMEM model given in (4.1).

    LMEMs are widely used to analyze complex data structures; however, they are susceptible to various types of misspecification errors that can lead to biased or inconsistent estimates. Hence, detecting misspecification errors in LMEMs is crucial to ensure the accuracy and reliability of statistical inferences. These errors can arise from incorrect assumptions about the systematic component, random effects, within-subject covariance structure, or error term covariance structure.

    In this study, we developed ASEST, a novel test specifically designed to identify misspecifications within LMEMs and accurately identifying their sources. Our approach applied the Delta method to leverage the asymptotic behavior of the test. The construction methodology was outlined in Sections 2 and 3, followed by a comprehensive analysis of diverse scenarios explained in detail in Section 4, and the associated results in Section 5.

    One of the main advantages of our proposed test is its computational efficiency. We developed and implemented functions that exhibited rapid responsiveness, which ensures their applicability for real-world use. In the future, we plan to contribute the implementation of the ASEST test to the Comprehensive R Archive Network (CRAN)[13] repository to facilitate its accessibility and wider use.

    An intriguing aspect of our methodology lies in its compatibility with models that feature no missing data, as displayed in Figure 2. Furthermore, we complement our analytical framework with a graphical methodology. This combined approach not only discerns erroneous model specifications, but also visually elucidates the origins of misspecification within the model structure. Notably, few methodologies are specifically designed to achieve this goal, particularly when it is of interest to identify the source of the misspecification. However, the integration of a robust statistical test, methodological clarity, computational efficiency, and the potential for future repository integration underscore the substantive contributions of our study. Our approach holds promise for a widespread adoption, offering a comprehensive toolkit to accurately assess and rectifying misspecification errors within LMEMs. Hence, the integration of statistical testing and graphical diagnostics in our methodology will significantly expand the diagnostic capabilities of LMEMs.

    Future studies should be aimed at investigating additional scenarios to first assess the impact of simultaneous changes in the distribution of random effects and random errors. In this complex setting, it is crucial to analyse its feasibility to simultaneously detect the origin of both errors. This is particularly important as it is challenging to distinguish between the two. Identifying the sources of these errors in this scenario poses significant computational challenges, which require innovative methods and advanced statistical techniques to effectively address this issue.

    Second, the performance of our test must be assessed in cases where the vector of random effects does not follow a Normal distribution. Although preliminary numerical experiments showed that the results presented in this study seemed to hold when this was the case, a more comprehensive evaluation has yet to be explored and completed. Third, here we considered a homoscedastic conditional independence structure for the covariance of the random error term. Further lines of research could explore AR(p), non-constant, and not homogeneous covariance structures for the error term.

    Finally, in our numerical experiments (see Section 4 for further details), the number of predictors or features p were small compared to the sample size n. However, in many fields, the number of features in the data are typically larger than the sample size (i.e., p>>>n). Therefore, the performance of our proposed test must be evaluated in these settings to ensure its effectiveness in real-world applications.

    The R code for generating the plots and results in this paper is available from the first author upon request.

    Jairo A. Angel: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing—original draft, Writing—review and editing, Visualization, Funding acquisition; Francisco M. M. Rocha: Conceptualization, Methodology, Software, Validation, Writing—original draft, Supervision, Project administration, Funding acquisition; Jorge I. Vélez: Formal analysis, Resources, Writing—review and editing, Visualization, Funding acquisition; Julio M. Singer: Conceptualization, Methodology, Software, Validation, Resources, Writing—original draft, Supervision, Project administration, Funding acquisition. All authors have read and agreed to the published version of the manuscript.

    The article processing charge was funded by Universidad del Norte, Barranquilla, Colombia.

    The authors declare that they have no conflict of interest in this work.



    [1] J. Afiune, Avaliação Ecocardiografica Evolutiva de Recémnascidos Pre-termo, do Nascimento Até o Termo, Phd thesis, Universidade de São Paulo, 2000.
    [2] E. Demidenko, Mixed Models: Theory and Applications with R, 2 Eds., Hoboken: John Wiley & Sons, Inc., 2013.
    [3] R. Drikvandi, G. Verbeke, G. Molenberghs, Diagnosing Misspecification of the Random-Effects Distribution in Mixed Models, Biometrics, 73 (2017), 63–71.
    [4] D. A. Freedman, On the so-called 'Huber Sandwich Estimator' and 'Robust Standard Errors', Amer. Stat., 60 (2006), 299–302. http://doi.org/10.1198/000313006X152207 doi: 10.1198/000313006X152207
    [5] S. K. Hanneman, Design, Analysis, and Interpretation of Method-Comparison Studies, AACN Adv. Crit. Care, 19 (2008), 223–234. http://doi.org/10.4037/15597768-2008-2015 doi: 10.4037/15597768-2008-2015
    [6] D. A. Harville, Maximum Likelihood Approaches to Variance Component Estimation and to Related Problems, J. Amer. Stat. Assoc., 72 (1977), 320–338.
    [7] X. Huang, Detecting Random-Effects Model Misspecification via Coarsened Data, Comput. Stat. Data Anal., 55 (2011), 703–714. https://doi.org/10.1016/j.csda.2010.06.012 doi: 10.1016/j.csda.2010.06.012
    [8] J. Jiang, Goodness-of-Fit Tests for Mixed Model Diagnostics, Ann. Stat., 29 (2001), 1137–1164.
    [9] N. Lange, L. Ryan, Assessing Normality in Random Effects Models, Ann. Stat., 17 (1989), 624–642.
    [10] S. Litière, A. Alonso, G. Molenberghs, Type I and Type II Error under Random-Effects Misspecification in Generalized Linear Mixed Models, Biometrics, 63 (2007), 1038–1044.
    [11] C. Nagle, An Introduction to Fitting and Evaluating Mixed-effects Models in R, Proceedings of the 10th Pronunciation in Second Language Learning and Teaching Conference, 2018, 82–105.
    [12] H. D. Patterson, R. Thompson, Recovery of Inter-block Information when Block Sizes are Unequal, Biometrika, 58 (1971), 545–554,
    [13] R Core Team, R: A Language and Environment for Statistical Computing, Available from: https://www.r-project.org/.
    [14] G. K. Robinson, That BLUP is a Good Thing: The Estimation of Random Effects, Stat. Sci., 6 (1991), 15–32,
    [15] F. M. Rocha, J. M. Singer, Selection of Terms in Random Coefficient Regression Models, J. Appl. Stat., 45 (2018), 225–242.
    [16] H. Schielzeth, N. J. Dingemanse, S. Nakagawa, D. F. Westneat, H. Allegue, C. Teplitsky, et al., Robustness of Linear Mixed-Effects Models to Violations of Distributional Assumptions, Methods Ecol. Evol., 11 (2020), 1141–1152. https://doi.org/10.1111/2041-210X.13434 doi: 10.1111/2041-210X.13434
    [17] P. Sen, J. Singer, Large Sample Methods in Statistics: An Introduction with Applications, Boca Raton: CRC Press, 1993.
    [18] J. M. Singer, F. M. Rocha, J. S. Nobre, Graphical Tools for Detecting Departures from Linear Mixed Model Assumptions and Some Remedial Measures, Int. Stat. Rev., 85 (2017), 290–324. https://doi.org/10.1111/insr.12178 doi: 10.1111/insr.12178
    [19] H. White, Maximum Likelihood Estimation of Misspecified Models, Econometrica, 50 (1982), 1–25. https://doi.org/10.2307/1912526 doi: 10.2307/1912526
    [20] D. Yu, X. Zhang, K. K. Yau, Asymptotic Properties and Information Criteria for Misspecified Generalized Linear Mixed Models, J. R. Stat. Soc. Ser. B (Stat. Methodol.), 80 (2018), 817–836.
  • Reader Comments
  • © 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(582) PDF downloads(52) Cited by(0)

Figures and Tables

Figures(8)  /  Tables(2)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog