1.
Introduction
Varying-coefficient partially linear model, one of the important semiparametric models, has attracted much attention due to the interpretability of parametric models and the flexibility of nonparametric models. In this model, we assume that the response and covariates with a parametric part have a linear relationship. However, this assumption may be inappropriate in some practical applications. For example, based on the empirical studies in an analysis of a real data set in ecology [1], it is more reasonable to assume that the relationship between net ecosystem CO2 exchange (NEE) and the amount of photosynthetically active radiation (PAR) is nonlinear. The scatterplot between viral load and CD4 cell count shows that there is no rigorous linearity between them [2].
To describe the complicated potential nonlinear relationship between the response and certain covariates, Li and Mei [3] proposed the varying-coefficient partially nonlinear model, which takes the form as
where Y is the response and X∈Rp,Z∈Rr and U are the associated covariates. g(.,.) is a pre-specified nonlinear function with parameter vector β=(β1,⋯,βq)T. α(.)=(α1(.),⋯,αp(.))T is a coefficient function and ε is a model error with mean of zero and variance σ2. It is not necessary to demand that the dimension of Z and β is equal.
It is obvious that model (1.1) is more flexible, which covers a varying-coefficient partially linear model as a special case, thus, it has been explored by extensive research since it was proposed. Li and Mei [3] proposed a profile nonlinear least squares estimation procedure and established the asymptotic properties of the resulting estimators. Zhou et al. [4] investigated the empirical likelihood inference and showed the corresponding Wilks phenomena. Furthermore, Yang and Yang [5] developed a two-stage estimation procedure based on the orthogonality-projection method and smooth-threshold estimating equations. In addition, Xiao and Chen [6] developed a local bias-corrected empirical likelihood procedure for model (1.1) with measurement errors in the nonparametric part. Tang et al. [7] proposed an adjusted empirical likelihood statistical inference procedure for model (1.1) with endogenous covariates.
The common above research is based on either the least squares method or the empirical likelihood method which are related to mean regression. As we all know, although they have satisfactory performance in the case of normal distributed data set, it is sensitive and may produce relative large bias when the data is subject to heavy-tailed distribution or contains outliers. Therefore, some robust statistical inferences are developed. For model (1.1), Jiang et al. [8] proposed a robust estimation procedure based on the exponential squared loss function, and Yang et al. [9] developed quantile regression for robust inference. Recently, Xiao and Liang [10] proposed a robust two-stage estimation procedure for model (1.1) based on modal regression.
Missing data is frequently encountered in research fields such as biomedicine, economics and clinical medicine. Common missing mechanism includes missing completely at random (MCAR), missing at random (MAR) and missing nonignorable at random (MNAR) (Little and Rubin [11]). The missing probability, which is called the propensity score function, depends only on the exactly observed variables in the assumption of MAR. At present, some research has explored statistical inference for model (1.1) under the MAR assumption. For example, with the missing covariates at random, Wang et al. [12] proposed the profile nonlinear least squares estimation procedure based on the inverse probability weighted method. Xia et al. [13] developed an empirical likelihood inference based on the weighted imputation method when the response was missing at random.
Nevertheless, it is possible that the propensity score is related to the value of the unobserved variable itself. For example, in the surveys about income, individuals with high income are usually less willing to provide their specific income, and the nonresponse rates tend to be related to the values of response. In this situation, the NMAR assumption would become reasonable. Compared to the MAR assumption, it is a great challenge to handle with MNAR since some parameters are not identifiable without any further restrictions on the propensity score. Over the past few years, MNAR data analysis has received a lot of attention in the literature. Kim and Yu [14] first defined the exponential tilting model for the nonignorably missing mechanism and proposed a semiparametric estimation procedure for the mean functions. Tang et al. [15] developed modified estimating equations by imputing missing data through a kernel regression method in the exponential tilting model. Wang et al. [16] proposed an instrument variable, which is related to the study variable but unrelated to the missing data propensity, to construct estimating equations. To our knowledge, there is no literature to study the inferences for the varying-coefficient partially nonlinear model with nonignorable missing response. This motivates us to study this interesting topic. First, as an extension of the varying-coefficient partially linear model, the model has the ability of the interpretability of the parametric model and the flexibility of the nonparametric model. In addition, it can overcome the limitation of linear function in traditional semiparametric models. Second, compared to MAR, MNAR is much more common in practical applications and more challenging to study. Furthermore, as one of the robust estimation procedures, modal regression not only has good performances in the presence of outliers or heavy-tailed distributions, but it also has full asymptotic efficiency under the normal error distributions.
In this paper, we propose a robust estimation procedure for model (1.1) with nonignorable missing response. For the nonignorable nonresponse propensity, we assume a semiparametric model because the fully parametric approach is very sensitive to failure of the assumed parametric models. The instrument variable not involved in the propensity has been studied by Shao and Wang [17] to avoid the identifiability issue. The nonparametric component is replaced by a kernel-type estimator, and then the parameter is estimated by the profiled estimating equations and the generalized method of moments. We then propose robust estimators in the varying-coefficient partially nonlinear model based on the inverse probability weighted technique and modal regression. On the one hand, estimation procedures of using complete-case data may result in serious bias, especially when the missing probability is large. In consequence, the inverse probability weighted technique provides a modified way to reduce the estimation bias by using the inverse of the propensity as weights. On the other hand, modal regression, introduced by Yao et al. [18], has been extensively applied to other semiparametric models, such as [19,20]. Modal regression has many advantages over the traditional mean regression. First, it is easy to implement by involving a tuning parameter that can be automatically selected. Second, this method has good robustness in the presence of outliers or heavy-tailed errors, which has been successfully verified by extensive literature. Meanwhile, the obtained estimator can achieve fully asymptotic efficiency under the normal error distribution in terms of theoretical properties. This encourages us to adopt inverse probability weighting and modal regression to the varying-coefficient partially nonlinear model.
The rest of this paper is organized as follows. Section 2 introduces a robust estimation procedure based on modal regression and inverse probability weighting. Section 3 establishes its theoretical properties. In Section 4, we discuss the selection of bandwidth and the specific estimation algorithm. Simulation studies and a real data are then conducted to evaluate the performances of the proposed estimation procedure in Sections 5 and 6. We make our concluding remarks in Section 7 and leave the proofs of the main theorems to the Appendix.
2.
Estimation methodology
2.1. Modal regression procedure with nonignorable missing response
Suppose that {Yi,Xi,Zi,Ui}ni=1 is a random sample from model (1.1); that is,
where covariates Xi,Zi,Ui are completely observed and response Yi has nonignorable missing values. Let δi be a binary response indicator for Yi, where δi=1 if Yi is observed, and δi=0 otherwise. Furthermore, we define the propensity of missing data as follows
where Ti=(XTi,ZTi,Ui)T. In the MNAR assumption, the propensity of missing data depends on Yi, regardless of whether Yi is observed or missing.
To simplify model (2.1), we apply B-spline functions to approximate the nonparametric function α(u) instead of utilizing the local polynomial estimation. This choice is motivated by the fact that B-spline functions approximation possesses bounded support and numerical stability. In addition, B-spline functions approximation avoids the issue of high computational complexity associated with local polynomial estimation, as elaborated in [21]. More specifically, let B(u)=(B1(u),…,BL(u))T be the B-spline basis function with the order of M, where L=K+M and K is the number of interior knots. Therefore, the nonparametric function αk(u) can be approximated by
where γk=(γk1,γk2,⋯,γkL)T. Then, we substitute (2.2) into model (2.1) to obtain
where γ=(γT1,γT2⋯,γTp)T and Wi=Ip⊗B(Ui)⋅Xi, with Ip being a p-dimensional identity matrix.
For fixed β, model (2.3) can be reexpressed as
where Y=(Y1,Y2,⋯,Yn)T, g(Z,β)=(g(Z1,β),g(Z2,β),⋯,g(Zn,β))T, W=(W1,W2,⋯,Wn)T and ε=(ε1,ε2,⋯,εn)T.
Clearly, model (2.4) is a standard linear model. We can obtain the initial estimators of γ with the ordinary least squares method as follows
We replace γ in (2.4) by ˆγ(1), and model (2.4) becomes
where P=In−W(WTW)−1WT, with In being an n-dimensional identity matrix.
Then, the inverse probability weighting technique is adopted to reduce the estimation bias when Yi is missing and modal regression is involved to attained the robust estimators. Motivated by the idea of Yao et al. [18], the modal estimator for β is constructed by maximizing the following objective function
where Δi=δi/π(Yi,Ti), ϕh(t)=ϕ(t/h)/h with kernel density function ϕ(⋅) and bandwidth h. Pi denotes the ith row of the matrix P. According to Yao et al. [18], we adopt the Gaussian kernel density function for ϕ(⋅) throughout the paper. The specific selection of h is described in Section 4.1.
However, the propensity score π(Yi,Ti) is usually unknown in practice, and it is common to use the estimator ˆπ(Yi,Ti) to replace π(Yi,Ti). The estimation procedure of the propensity π(Yi,Ti) is given in the next subsection.
2.2. Estimation for semiparametric nonignorable propensity
According to Shao and Wang [17], we assume the following semiparametric model for the propensity
where q(Yi,ζ) is a known function of ζ, which is a d-dimensional unknown parameter, and ψ(⋅) is an unknown function. Several common models are special cases of the semiparametric model (2.7). For example, (2.7) reduces to the model for ignorable missing data when q(Yi,ζ) does not depend on Yi. When q(Yi,ζ)=exp(ζYi), (2.7) simplifies to the exponential tilting model for nonignorable missing data, as described in Kim and Yu [14].
Without any further assumptions, Shao and Wang [17] showed that ψ(⋅) and ζ are not identifiable in propensity score function (2.7). To overcome this difficulty, we assume that Ti can be decomposed as two parts Ti=(VTi,STi)T and Si can be excluded from the propensity model, then (2.7) can be reduced as
where ψ(⋅) is an unspecified function of Vi. The covariate Si is referred to the instrument variable by Wang et al. [16], which aids in identifying and estimating all unknown parameters.
To estimate the propensity defined by (2.8), for any fixed ζ, we have
Thus, a nonparametric kernel estimator of ψζ(V) is given by
where Lb(⋅)=b−1L(⋅/b), L(⋅) is a kernel function and b is a bandwidth. Note that ˆψζ(V) is not an estimator of ψ(V) as ζ is an unknown parameter value. However, ˆψζ(V) is useful in the following estimating equations for estimating unknown ζ. Define
where π(Y,T,ψζ,ζ)={1+ψ(V)q(Y,ζ)}−1 and h(S) is a known vector-valued function with dimension R≥d+1. As in Wang et al. [16], we recommend to use h(S)=(1,VT,ST)T throughout this paper. It is easy to verify that E[f(Y,T,δ,ψζ0,ζ0)]=0 under the true parameter value ζ0. The estimating equations are over-identifies because of R>d, and we employ the two-step generalized method of moments (GMM).
Let Fn(ˆψζ,ζ)=1nn∑i=1f(Yi,Ti,δi,ˆψζ,ζ) and ˆζ(1)=argminζFn(ˆψζ,ζ)TFn(ˆψζ,ζ). The two-step GMM estimator of ζ is
where ˆWn=1nn∑i=1f(Yi,Ti,δi,ˆψˆζ(1),ˆζ(1))f(Yi,Ti,δi,ˆψˆζ(1),ˆζ(1))T. Eventually, the propensity model can be consistently estimated by
Then, the modal estimator ˆβ for β is constructed by maximizing the following objective function
where ˆΔi=δi/ˆπ(Yi,Ti).
Once modal estimator ˆβ is obtained, model (2.3) can be transformed to
where Y∗i=Yi−g(Zi,ˆβ).
Similarly, the modal estimators of γ are obtained by maximizing the following objective function
Therefore, we have the estimator of coefficient functions by ˆαk(u)=B(u)Tˆγk,k=1,⋯,p.
3.
Theoretical properties
In this section, we discuss the asymptotic properties of the proposed modal regression estimators for both parametric and nonparametric. Denote the true values of β and αk(⋅) as β0 and α0k(⋅) in model (2.1), respectively. Correspondingly, γ0k is the optimal approximation coefficient of γk in (2.2). Let F(x,z,u,h)=E(ϕh″ and G(x, z, u, h) = \mathrm{E}(\phi_h'(\varepsilon)^2|{{X} = x, {Z} = z}, U = u) . To facilitate the presentation, we denote by g'(\mathit{\boldsymbol{z}}, \mathit{\boldsymbol{\beta}}) = \partial g(\mathit{\boldsymbol{z}}, \mathit{\boldsymbol{\beta}}) / \partial \mathit{\boldsymbol{\beta}} the q\times 1 vector of the first order derivatives of g(\mathit{\boldsymbol{z}}, \mathit{\boldsymbol{\beta}}) with respect to \mathit{\boldsymbol{\beta}} .
The required assumption conditions of the main results are derived in the following. These assumption are common and can be easily satisfied.
A1: The random variable U has a bounded support \mathcal{U} . Its density function f(\cdot) is Lipschitz continuous, bounded away from 0 to infinite on its support.
A2: All the true coefficient functions {\alpha _{0k}}(u), k = 1, \cdots, p are all r th continuously differentiable on the interval (0, 1) with r \ge 2 .
A3: For any \mathit{\boldsymbol{z}} , g(\mathit{\boldsymbol{z}}, \mathit{\boldsymbol{\beta}}) is a continuous function of \mathit{\boldsymbol{\beta}} , and the 2th derivative of g(\mathit{\boldsymbol{z}}, \mathit{\boldsymbol{\beta}}) with respect to \mathit{\boldsymbol{\beta}} is continuous.
A4: There exists an s > 0 such that \mathrm{E}||\textbf{X}||^{2s} < \infty , \mathrm{E}||g'(\mathit{\boldsymbol{z}}, \mathit{\boldsymbol{\beta}})||^{2s} < \infty .
A5: The bandwidth h satisfies as n \to \infty, n{h^{2r}} \to \infty, n{h}/\log n \to \infty .
A6: Suppose that {{\mathit{\boldsymbol{\zeta}}} _0} is the unique solution to E[{f({Y}, \mathit{\boldsymbol{T}}, {\delta}, {\psi _{\mathit{\boldsymbol{\zeta}}} }, \mathit{\boldsymbol{\zeta}}) }] = 0 , \Gamma = E[\partial f({Y}, \mathit{\boldsymbol{T}}, {\delta}, {\psi _{\mathit{\boldsymbol{\zeta}}_0} }, \mathit{\boldsymbol{\zeta_0}})/\partial \mathit{\boldsymbol{\zeta}}] is of full rank and D = E[{ f {{({Y}, \mathit{\boldsymbol{T}}, {\delta}, {\psi _{\mathit{\boldsymbol{\zeta}}_0} }, \mathit{\boldsymbol{\zeta_0}})}^{ \otimes 2}}}] is positive definite.
A7: Let c_1, \ldots, c_K be the interior knots on [0, 1] . Set c_0 = 0, \; c_{K+1} = 1 and h_i = c_i-c_{i-1} . Then, there exists a constant C_0 such that
A8: F(x, z, u, h) and G(x, z, u, h) are continuous with respect to (x, z, u) . In addition, F(x, z, u, h) < 0 for any h > 0 .
A9: \mathrm{E}(\phi_h'(\varepsilon)|{x}, z, u) = 0 and \mathrm{E}(\phi_h''(\varepsilon)^2|{x}, z, u) , \mathrm{E}(\phi_h'(\varepsilon)^3|{x}, z, u) and \mathrm{E}(\phi_h'''(\varepsilon)|{x}, z, u) are continuous with respect to (x, z, u) .
A10: The matrix A and \Sigma defined in Theorem 1 are positive definite.
These assumptions are commonly adopted in the previous semiparametric model. Assumptions A1–A4 are generally required in the varying-coefficient partially nonlinear model. Assumption A5 is common with bandwidth in kernel function. Assumption A6 is necessary for proving the asymptotic properties of two-step GMM estimator {{\mathit{\boldsymbol{\zeta}}}} . Assumption A7 indicates that c_1, \ldots, c_K is a quasi-uniform sequence of partitions on [0, 1] . Assumptions A8 and A9 are commonly used in the modal regression. Assumption A10 ensures the existence of the asymptotic variance of the estimator \mathit{\boldsymbol{\hat\beta}} .
The following Theorem 1 gives the asymptotic normality for the parameter estimator \mathit{\boldsymbol{\hat\beta}} .
Theorem 1. Suppose that assumptions A1-A10 hold and the number of interior knots is K = O(n^{1/(2r+1)}) . As n\rightarrow \infty , we have
where C = \Sigma A^{-1}\Sigma ^T , \Sigma = E({F(x, z, u, h)g'(\mathit{\boldsymbol{z}}, \mathit{\boldsymbol{\beta}}){{g'(\mathit{\boldsymbol{z}}, \mathit{\boldsymbol{\beta}})^T}}}) , A = B+H {\rm{\Pi}} H^T , B = E[{{\pi ({Y}, \mathit{\boldsymbol{T}}, {\mathit{\boldsymbol{\zeta}}_0})}}^{-1} G(x, z, u, h)g'(\mathit{\boldsymbol{z}}, \mathit{\boldsymbol{\beta}}){{ g'(\mathit{\boldsymbol{z}}, \mathit{\boldsymbol{\beta}})^T}}] , {\rm{\Pi}} is the asymptotic variance of \hat{\mathit{\boldsymbol{\zeta}}} defined in remark 1, H = \lim\limits_{n\to\infty}\frac{1}{n}\sum_{i = 1}^nE[{{\pi ({Y}, \mathit{\boldsymbol{T}}, {\mathit{\boldsymbol{\zeta}}_0})}}^{-1} \phi_h'(\varepsilon)(\mathit{\boldsymbol{P}}_ig'(\mathit{\boldsymbol{\beta}}))^T\{\partial {{\pi ({Y}, \mathit{\boldsymbol{T}}, {\mathit{\boldsymbol{\zeta}}_0})}}/ \partial {\mathit{\boldsymbol{\zeta}}}\}^T] and \mathit{\boldsymbol{P}}_i is the i th row of \mathit{\boldsymbol{P}} = {\mathit{\boldsymbol{I}}_n} - {\mathit{\boldsymbol{W}}{{({{\mathit{\boldsymbol{W}}^T}\mathit{\boldsymbol{W}}})}^{ - 1}}{\mathit{\boldsymbol{W}}^T}} , g'(\mathit{\boldsymbol{\beta}}) = {(g'({\mathit{\boldsymbol{Z}}_1}, \mathit{\boldsymbol{\beta}}), \cdots, g'({\mathit{\boldsymbol{Z}}_n}, \mathit{\boldsymbol{\beta}}))^T} .
Remark 1. As discussed in [22], \hat {\mathit{\boldsymbol{\zeta}}} is a consistent estimator of {\mathit{\boldsymbol{\zeta}}} , and
where {\rm{\Pi }} = {({{{\rm{\Xi }}^ T }{D^{-1}}{\rm{\Xi }}})^{-1}}{{\rm{\Xi }}^T }{D^{-1}}{\rm{\Omega }}{D^{-1}}{\rm{\Xi }}{({{{\rm{\Xi }}^T }{D^{ - 1}}{\rm{\Xi }}})^{ - 1}} , {\rm{\Xi }} = E\{{(1 - \delta)[{h(\mathit{\boldsymbol{S}})-{m_s}(\mathit{\boldsymbol{V}})}][{\eta ({Y}, {{\mathit{\boldsymbol{\zeta}}}_0}})-{m_\eta}(\mathit{\boldsymbol{V}})}]\} , {m_\eta }(\mathit{\boldsymbol{V}}) = {{E\{ {{\delta \eta ({{Y}, {{\mathit{\boldsymbol{\zeta}}} _0}})q({{Y}, {{\mathit{\boldsymbol{\zeta}}} _0}})} |\mathit{\boldsymbol{V}}} \}}/ {E\{ { {\delta q({{Y}, {{\mathit{\boldsymbol{\zeta}}} _0}})} |\mathit{\boldsymbol{V}}} \}}} , {m_s}(\mathit{\boldsymbol{V}}) = {{E\{ {{\delta h(\mathit{\boldsymbol{S}})q({{Y}, {{\mathit{\boldsymbol{\zeta}}} _0}})} |\mathit{\boldsymbol{V}}} \}} / {E\{ {{\delta q({{Y}, {{\mathit{\boldsymbol{\zeta}}} _0}})} |\mathit{\boldsymbol{V}}} \}}} , \eta ({{Y}, {{\mathit{\boldsymbol{\zeta}}} _0}}) = {\{ {q({{Y}, {{\mathit{\boldsymbol{\zeta}}} _0}})} \}^{ - 1}}\partial q({{Y}, {{\mathit{\boldsymbol{\zeta}}} _0}})/\partial {{\mathit{\boldsymbol{\zeta}}} ^ \top } , {\Omega } is the covariate matrix of {\rm{\Lambda }} = \{ {h(\mathit{\boldsymbol{S}}) - {m_s}(\mathit{\boldsymbol{V}})} \} [{\delta \{ {1 + \psi_{\zeta _0}(\mathit{\boldsymbol{V}})q({{Y}, {{\mathit{\boldsymbol{\zeta}}} _0}})} \} -1}] + {\cal M} , {\cal M} = {({{{\cal M}_1}, \ldots, {{\cal M}_R}})^ \top } , {{\cal M}_r} = vec (\Phi_{r})^{T} \ell(\mathit{\boldsymbol{T}}, {Y}, \delta), r = 1, \ldots, R , \ell(\mathit{\boldsymbol{T}}, {Y}, \delta) is an influence function, R is the dimension of h(\mathit{\boldsymbol{S}}) , \Phi_{r} = E\{ {{{\psi_{\zeta _0}(\mathit{\boldsymbol{V}})[{\nabla {{\cal C}_r}(\mathit{\boldsymbol{V}}){f_v}(\mathit{\boldsymbol{V}}) + {{\cal C}_r }(\mathit{\boldsymbol{V}})\nabla {f_v}(\mathit{\boldsymbol{V}})}]} / {{f_v}(\mathit{\boldsymbol{V}})}}} \} , {{\cal C}_r}(\mathit{\boldsymbol{V}}) = {\mathop{\rm Cov}} ({\mathit{\boldsymbol{T}}, {{\tilde {\cal W}}_r }\mid \mathit{\boldsymbol{V}}}), \nabla {{\cal C}_r}(\mathit{\boldsymbol{V}}) = \partial {{\cal C}_r}(\mathit{\boldsymbol{V}}) /\partial {\mathit{\boldsymbol{V}}^ \top }, \nabla {f_v}(\mathit{\boldsymbol{V}}) = \partial {f_v}(\mathit{\boldsymbol{V}})/\partial {\mathit{\boldsymbol{V}}^ \top } , {f_v}(\mathit{\boldsymbol{V}}) is the density of \mathit{\boldsymbol{V}} and {{\tilde {\cal W}}_r} is the r th coordinate of \delta q({{Y}, {{\mathit{\boldsymbol{\zeta}}} _0}})[{h(\mathit{\boldsymbol{S}}) - {m_s}(\mathit{\boldsymbol{V}})}] .
The following Theorem 2 gives the consistency of estimator {\hat \alpha _k}(\cdot), k = 1, \cdots, p .
Theorem 2. Suppose that assumptions A1–A10 hold, and the number of interior knots is K = O(n^{1/(2r+1)}) . As n\rightarrow \infty , we have
4.
Bandwidth selection and estimation algorithm
In this section, we discuss the selection of the bandwidth and estimation procedure based on the modal expectation maximization (MEM) algorithm proposed by Li et al. [23].
4.1. Bandwidth selection
To obtain the modal estimators \hat{\mathit{\boldsymbol{\beta}}} and \hat{\mathit{\boldsymbol{\gamma}}} of parameters \mathit{\boldsymbol{\beta}} and \mathit{\boldsymbol{\gamma}} , it is necessary to select an appropriate bandwidth h . For simplicity, we assume that the error variable is independent of \mathit{\boldsymbol{X}} , \mathit{\boldsymbol{Z}} and U . According to the thought of Yao et al. [18], the ratio of the asymptotic variance of our proposed estimator to that of the least square estimator is given by
where \sigma_1^2 = E(\varepsilon^2) , F(h_1) = \mathrm{E}[\phi_{h_1}''(\varepsilon)] and G(h_1) = \mathrm{E}[\phi_{h_1}'(\varepsilon)^2] . In practice, we do not know the error distribution, hence, we cannot obtain F(h_1) and G(h_1) . A feasible method is with estimator F(h_1) and G(h_1) by \hat F({h_1}) = \frac{1}{n}\sum\limits_{i = 1}^n {{\phi''_{{h_1}}}\left({{{\hat \varepsilon }_{1i}}} \right)}, and \hat G({h_1}) = \frac{1}{n}\sum\limits_{i = 1}^n [\phi'_{{h_1}}(\hat\varepsilon_{1i})]^2 , respectively.
Then, the estimator for R({h_1}) is given by
where {\hat \varepsilon _{1i}} = \hat{\Delta} _i{\mathit{\boldsymbol{P}}_i}[\mathit{\boldsymbol{Y}} - g(\mathit{\boldsymbol{Z}}, {\mathit{\boldsymbol{\hat \beta}} ^{{\mathop{\rm int}} }})] and \hat \sigma _1^2 = \frac{1}{n}\sum\limits_{i = 1}^n \hat\varepsilon_{1i}^2 , \hat{\mathit{\boldsymbol{\beta}}}^{\rm int} are the pilot estimators.
Similarly, to obtain the modal estimator {\hat{\gamma}} , the estimator for the ratio of the asymptotic variance of our proposed estimator to that of the least square estimator is given by
where \hat F({h_2}) = \frac{1}{n}\sum\limits_{i = 1}^n {{\phi''_{{h_2}}}\left({{{\hat \varepsilon }_{2i}}} \right)}, \hat G({h_2}) = \frac{1}{n}\sum\limits_{i = 1}^n [\phi'_{{h_2}}(\hat\varepsilon_{2i})]^2 , {\hat \varepsilon _{2i}} = {{\hat{\Delta }}_i}({{Y_i} - g({{\mathit{\boldsymbol{Z}}_i}, \mathit{\boldsymbol{\hat \beta}} }) - \mathit{\boldsymbol{X}}_i^T{{\mathit{\boldsymbol{\hat \alpha}} }^{{\mathop{\rm int}} }}({{U_i}})}) and \hat \sigma _2^2 = \frac{1}{n}\sum\limits_{i = 1}^n \hat\varepsilon_{2i}^2 , \hat{\mathit{\boldsymbol{\beta}}} are the modal estimators and { \mathit{\boldsymbol{\hat \alpha}} ^{{\mathop{\rm int}} }}({{\cdot}}) are the pilot estimators.
Furthermore, we use the grid search method to choose appropriate bandwidth. {h_{opt1}} and {h_{opt2}} can be obtained by minimizing (4.2) and (4.3), respectively. The optimal choice of h_{1} and h_{2} are
Following Yao et al. [18], the possible grid points for bandwidth can be {h_1} = 0.5{\hat \sigma _1} \times {1.02^j} , {h_2} = 0.5{\hat \sigma _2} \times {1.02^j}, j = 0, 1, \cdots, 100 .
4.2. The MEM algorithm of parametric components
In this subsection, we adopt the following MEM algorithm by Li et al. [23] to obtain the estimators of parameters \mathit{\boldsymbol{\beta}} by maximizing (2.13). Let \mathit{\boldsymbol{\beta}}^{(0)} be initial estimators of \mathit{\boldsymbol{\beta}} with m = 0 .
\textbf{Step 1 (E-step):} We update \pi(i|\mathit{\boldsymbol{\beta}}^{(m)}) by
\textbf{Step 2 (M-step):} We update \mathit{\boldsymbol{\beta}}^{(m+1)} by
where \mathit{\boldsymbol{\hat{Y}}} = (\hat{Y}_{1}, \hat{Y}_{2}, \ldots, \hat{Y}_{n})^{T} , {\hat Y_i} = {{\hat{\Delta }}_i}{\mathit{\boldsymbol{P}}_i} [{\mathit{\boldsymbol{Y}} - g({\mathit{\boldsymbol{Z}}, {{\mathit{\boldsymbol{\beta}}} ^{(m)}}}) + {g^\prime }({\mathit{\boldsymbol{Z}}, {{\mathit{\boldsymbol{\beta}}} ^{(m)}}}){{\mathit{\boldsymbol{\beta}}} ^{(m)}}}] , \mathit{\boldsymbol{G}} = {({{\hat\Delta _1}{\mathit{\boldsymbol{P}}_1}{g^\prime }({\mathit{\boldsymbol{Z}}, {{\mathit{\boldsymbol{\beta}}} ^{(m)}}}), {\hat\Delta _2}{\mathit{\boldsymbol{P}}_2}{g^\prime }({\mathit{\boldsymbol{Z}}, {{\mathit{\boldsymbol{\beta}}} ^{(m)}}}), \cdots, {\hat\Delta _n}{\mathit{\boldsymbol{P}}_n}{g^\prime }({\mathit{\boldsymbol{Z}}, {{\mathit{\boldsymbol{\beta}}} ^{(m)}}})})^T} , \mathit{\boldsymbol{D}} is n\times n diagonal matrix with \mathit{\boldsymbol{D}} = \mathrm{diag} ({\pi ({\left. 1 \right|{{\mathit{\boldsymbol{\beta}}} ^{(m)}}}), \pi ({\left. 2 \right|{{\mathit{\boldsymbol{\beta}}} ^{(m)}}}), \cdots, \pi ({\left. n \right|{{\mathit{\boldsymbol{\beta}}} ^{(m)}}})}) .
\textbf{Step 3:} Let m = m + 1 and iterate the E-step and M-step until the algorithm converges. Denote the final estimator of \mathit{\boldsymbol{\beta}} as \mathit{\boldsymbol{\hat{\beta}}} .
4.3. The MEM algorithm of nonparametric components
Based on the MEM algorithm, we can obtain the estimators of parameters \mathit{\boldsymbol{\gamma}} by maximizing (2.15). Let \mathit{\boldsymbol{\gamma}}^{(0)} be initial estimators of \mathit{\boldsymbol{\gamma}} with m = 0 .
\textbf{Step 1 (E-step):} We update \pi(i|\mathit{\boldsymbol{\gamma}}^{(m)}) by
\textbf{Step 2(M-step):} We update \mathit{\boldsymbol{\gamma}}^{(m+1)} by
where \tilde {\mathit{\boldsymbol{Y}}} = {{\mathit{\boldsymbol{\Delta}} }}{\mathit{\boldsymbol{Y}}^ * } , \tilde {\mathit{\boldsymbol{W}}} = {{\mathit{\boldsymbol{\Delta}} }}\mathit{\boldsymbol{W}} , \tilde {\mathit{\boldsymbol{D}}} = \mathrm{diag}({\pi ({\left. 1 \right|{{\mathit{\boldsymbol{\gamma}}} ^{(m)}}}), \cdots, \pi ({\left. n \right|{{\mathit{\boldsymbol{\gamma}}} ^{(m)}}})}) and {{\mathit{\boldsymbol{\Delta}} }} = \mathrm{diag}({{{\hat{{\Delta} }}_1}, \cdots, {{\hat{{\Delta} }}_n}}) .
\textbf{Step 3:} Let m = m + 1 and iterate the E-step and M-step until the algorithm converges. Denote the final estimator of \mathit{\boldsymbol{\gamma}} as \mathit{\boldsymbol{\hat{\gamma}}} . The corresponding estimator of nonparametric function is {\hat \alpha _k}(u) = \mathit{\boldsymbol{B}}{(u)^T}{\hat {\mathit{\boldsymbol{\gamma}}} _k}, k = 1, \cdots, p .
5.
Simulation studies
In this section, we conduct some simulations to evaluate the finite sample performances of the proposed estimation procedures. We generate the data from the varying-coefficient partially nonlinear model described as follows:
where the nonlinear function g(\mathit{\boldsymbol{Z}}, \mathit{\boldsymbol{\beta}}) = \exp \left({{Z_1}{\beta _1} + {Z_2}{\beta _2}} \right) with parameter vector \mathit{\boldsymbol{\beta}} = {\left({{\beta _1}, {\beta _2}} \right)^T} = {\left({1, 1.5} \right)^T} , the coefficient functions \mathit{\boldsymbol{\alpha}} (U) = {\left({{\alpha _1}(U), {\alpha _2}(U)} \right)^T} with {\alpha _1}(U) = \sin (2\pi U) and {\alpha _2}(U) = 3.5\{\exp [{-{{(4U-1)}^2}}] + \exp [{-{{(4U-3)}^2}}] - 1.5\}. The variable U is generated from uniform distribution U(0, 1) . The variable \mathit{\boldsymbol{X}} = {\left({{X_1}, {X_2}} \right)^T} follows \left({0, \Sigma} \right) with {\Sigma _{ij}} = {0.5^{\left| {\left. {i - j} \right|} \right.}}, 1 \le i, j \le 2 and \mathit{\boldsymbol{Z}} = {\left({{Z_1}, {Z_2}} \right)^T} and it is the same generated as \mathit{\boldsymbol{X}} . To illustrate the robustness of the proposed method, the following three different distribution of model errors are considered: (1) The normal distribution: \varepsilon\sim N(0, 1) ; (2) the t distribution: \varepsilon\sim t(3) ; (3) the mixed normal distribution(MN): \varepsilon\sim 0.9N(0, 1^{2})+0.1N(0, 9^{2}) . The sample sizes n are 200 and 400, separately, and the simulations are based on 500 replications. Furthermore, we generate {\delta _i} from the Bernoulli distribution with probability \pi ({Y_i}, {\mathit{\boldsymbol{V}}_i}) = \{{1 + \psi ({\mathit{\boldsymbol{V}}_i})\exp \{ \zeta {Y_i}\} }\}^{-1} and consider four choices of the function \psi (\cdot) and parameter \zeta :
Case1. \psi \left({{\mathit{\boldsymbol{V}}_i}} \right) = \exp \left\{ { - 0.1 - 1.5{X_1}_i - 1.5{Z_{1i}} - 1.5{U_i}} \right\} and \zeta = 0 ;
Case2. \psi \left({{\mathit{\boldsymbol{V}}_i}} \right) = \exp \left\{ { - 0.1 + 0.5{X_1}_i + 0.5{Z_{1i}} + 0.5{U_i}} \right\} and \zeta = - 0.8 ;
Case3. \psi \left({{\mathit{\boldsymbol{V}}_i}} \right) = \exp \left\{ { - 0.1 + 0.5\sin ({X_{1i}}) + 0.5\sin ({Z_{1i}}) + 0.5\sin ({U_i})} \right\} and \zeta = - 0.8 ;
Case4. \psi \left({{\mathit{\boldsymbol{V}}_i}} \right) = \exp \left\{ {0.6 - 0.3\exp ({X_1}_i) - 0.3\exp ({Z_{1i}}) - 0.3\exp ({U_i})} \right\} and \zeta = - 0.1 .
While Case 1 is an ignorable missing mechanism, Cases 2–4 represent three different kinds of nonignorable missing mechanisms. The coefficients in the aforementioned four probability models are chosen so that the average proportions of missing data are between 30 % and 40 % . It can be seen that {\mathit{\boldsymbol{V}}_i} = ({{X_{1i}}, {Z_{1i}}, {U_i}})^{T} and the instrument variable {\mathit{\boldsymbol{S}}_i} = ({{X_{2i}}, {Z_{2i}}})^{T} . As in Shao and Wang [17], we select the Gaussian kernel function L(\cdot) = \frac{1}{\sqrt{2\pi}}{\rm exp}({-(\cdot)^{2}}/{2}) to compute the nonparametric kernel estimator with L\left({{X_{1i}}, {Z_{1i}}, {U_i}} \right) = L\left({{X_{1i}}} \right)L\left({{Z_{1i}}} \right)L\left({{U_i}} \right) , and choose the bandwidth to be {b_1} = 1.5{\hat \sigma _{{X_1}}}{n^{{{ - 1} / 3}}} , {b_2} = 1.5{\hat \sigma _{{Z_1}}}{n^{{{ - 1} / 3}}} and {b_3} = 1.5{\hat \sigma _U}{n^{{{ - 1} / 3}}} , where {\hat \sigma _{{X_1}}}, {\hat \sigma _{{Z_1}}}, {\hat \sigma _U} is the standard deviation of datasets \left\{ {{X_{1i}}} \right\}, \left\{ {{Z_{1i}}} \right\}, \left\{ {{U_i}} \right\}, i = 1, 2, \cdots n .
To evaluate the efficiencies of the proposed estimation procedure, four estimators of \mathit{\boldsymbol{\beta}} are considered as follows:
(ⅰ) The proposed modal regression (MR) estimator is based on (2.13), which is denoted as MNAR-MR.
(ⅱ) The MR estimator is based on (2.13) in which \hat\pi is estimated with MAR assumption, which is denoted as MAR-MR.
(ⅲ) The MR estimator is based on complete-case(CC) data, which is denoted as CC-MR.
(ⅳ) The MR estimator is based on full sample data, which is denoted as FULL-MR.
Further, in order to assess the robustness of the proposed procedure, we also compare the profile least square (PLS) method in four scenarios: (ⅰ) The PLS estimator based on (2.13) which is denoted as MNAR-PLS, (ⅱ) The PLS estimator based on (2.13) in which \hat\pi is estimated with MAR assumption, which is denoted as MAR-PLS, (ⅲ) The PLS estimator based on complete-case data, which is denoted as CC-PLS and (ⅳ) The PLS estimator based on full sample data, which is denoted as FULL-PLS. The absolute Bias and standard deviation (SD) with the proposed MR method and the PLS method are reported in Tables 1–4, respectively.
Simulation results of Tables 1–4 observe the following conclusions: (1) According to Tables 1 and 2, the proposed MNAR-MR estimation procedure behaves better than the CC-MR and MAR-MR in terms of giving smaller Bias and SD in most of the considered scenarios, even under ignorable missing Case 1. The CC-MR method only focuses on the fully observed samples and ignores the information of the covariates corresponding to missing response, which causes the biased estimators. As expected, the estimators of MAR-MR perform well under Case 1 when the missing mechanism is ignorable, while it has large bias under Cases 2–4 with nonignorable missing mechanism. (2) Comparing to the proposed MR procedure and PLS method, it is clear to see that Bias and SD of the estimators obtained by the PLS method are smaller than the proposed MR procedure when the error distribution is normal. However, when the model error yields to t distribution or mixed normal distribution, the estimators obtained by the proposed MR procedure are more accurate than those obtained by the PLS method. This indicates that the proposed MR estimation procedure is less sensitive to the heavy-tailed distribution or distribution of containing outliers. (3) As the sample size increases, there is an evident tendency that the performances of all estimation procedures are getting better and better in most of the considered scenarios.
Furthermore, to evaluate the performance of the proposed estimation procedure for nonparametric functions, we present the estimated curves of {\hat \alpha _1}(\cdot) and {\hat \alpha _2}(\cdot) with different methods, respectively. Specifically, Figure 1 shows the estimated curves of {\hat \alpha _1}(\cdot) with the proposed MR procedure and the PLS method when the sample size n is 200 and the missing mechanism is Case 2. Similar to Figure 1, Figure 2 displays the estimated curves of {\hat \alpha _2}(\cdot) . Meanwhile, we consider the root of average square error (RASE) value of the estimation results for the nonparametric functions, where RASE = \sqrt{\frac{1}{M}\sum_{j = 1}^{2}\sum_{i = 1}^{M}[\hat{\alpha}_{j}(u_{i})-\alpha_{j}(u_{i})]^{2}} and M = 200 and u_{i} takes the equality points on the interval (0, 1) . Figure 3 shows the boxplots of the RASE for four estimators based on the proposed MR procedure when the sample size n is 200 and 400 with the model error \varepsilon\sim t(3) , respectively.
According to Figures 1–3, a few conclusions can be drawn as follows: (1) It is noted that the curve-fitting results of MNAR-MR are more effective than MNAR-PLS in that it is much closer to the real curve, regardless of whether the model error yields to normal distribution or mixed normal distribution. This indicates that the advantages of modal regression was apparent when data contains outliers. (2) It can be seen from Figure 3 that the RASE value of MAR-MR is closest to the RASE value of FULL-MR under Case 1, while it has large bias under the nonignorable missing mechanism. However, the RASE value of MNAR-MR is closest to the RASE value of FULL-MR with nonignorable missing mechanism with evidence that the proposed procedure has better performance on the nonparametric part. (3) The RASE value of four estimators based on the proposed MR procedure decreases with the increasing of sample size.
6.
A real data analysis
In this section, we illustrate the proposed method using HIV-CD4 data, which is collected on 2139 HIV positive patients enrolled in AIDS Clinical Trials Group Protocol 175 (ACTG 175) (Hammer et al. [24]). In this data, the patients were randomized into four groups receiving their respective antiretroviral treatment regimen: (1) Zidovudine or ZDV with 532 subjects, (2) didanosine or ddi with 522 subjects, (3) ZDV + ddi with 524 subjects and (4) ZDV + zalcitabine with 561 subjects. Let response Y be the CD4 count at 96 \pm 5 weeks under each antiretroviral treatment regimen. There are six continuous baseline covariates: Age ( U ), weight ( Z_{1} ), CD4 cell counts at baseline and 20 \pm 5 weeks ( Z_{2} and Z_{3} ) and CD8 cell counts at baseline and 20 \pm 5 weeks ( Z_{4} and Z_{5} ). Due to adverse events, death, dropout and some other reasons, Y has missing values, but the values of covariates are fully observed. Specifically, the proportions of missing data under Y in four regimens are about 39.66 % , 36.21 % , 35.69 % and 37.43 % , respectively.
Typically, a sharp decline in CD4 cell count is indicative of disease progression, and patients with low CD4 cell count are more likely to drop out from the scheduled study visits compared to patients with normal CD4 (Yuan and Yin [25]). Therefore, the missing values of the CD4 count Y at 96 \pm 5 weeks is likely related to the CD4 value count itself, and Y has nonignorable missing values. Following Zhang and Wang [26], given the six baseline covariates and CD4 count at 96 \pm 5 weeks, the missing data propensity does not depend on age and weight. That is, to apply the proposed method, we use the age and weight as the instrument variable \mathit{\boldsymbol{S}} .
The scatterplot in Figure 4 indicates that there is no obvious linear relationship between Y and the covariates. According to [24], it is generally believed that the influence of CD4 cell counts and CD8 cell counts on Y may not be immediate. There is a lagging effect, which is similar to the impact of acceleration on displacement. Based on the above discussions, we establish the following varying-coefficient partially nonlinear model:
The estimators of parameter based on the MNAR-MR procedure and fitting curves for the nonparametric function are given in Table 5 and Figure 5 (From left to right, top to bottom, they are regimen (1) to regimen (4)).
It can be seen that: (1) For all the regimens, the estimated coefficients of Z_{4} and Z_{5} are close to zero, while the estimated coefficients of Z_{2} and Z_{3} are relatively large. This may indicate that CD8 cell counts at baseline and 20 \pm 5 weeks have no significant impact on the CD4 count at 96 \pm 5 weeks, but CD4 cell counts at baseline and 20 \pm 5 weeks appear to affect the CD4 count at 96 \pm 5 weeks. (2) Under four regimens, the coefficients of Z_{2} and Z_{3} under regimen (3) are greater than the other regimens, which may indicate that regimen (3) gives smaller relative hazard ratios than other regimens. (3) The estimation curve for the nonparametric function is different under the assumptions of MNAR, MAR and CC, which suggests that the assumption of a nonignorable missing propensity is reasonable, and also that the effect of age on the response differs under four scenarios. In the future study, the method of distributed fault estimation based on the fault estimation observer for multi-agent systems [27,28,29] may be applied to improve the estimation performance.
7.
Conclusions
This article studied the statistical inference for the varying-coefficient partially nonlinear model with a nonignorable missing response. A robust estimation procedure was proposed to estimate the parametric and nonparametric parts separately based on modal regression. Mainly due to the identifiability of the nonresponse propensity, we considered a semiparametric propensity model and applied the GMM approach, making use of an instrument variable to identify unknown parameters in the propensity. Further, the coefficient function was approximated by the B-spline function to simplify the model structure, and the inverse probability weighted technique and modal regression were applied to construct the efficient MNAR-MR estimators. Compared to mean regression, modal regression is robust against outliers or heavy-tail error distribution, and it performs no worse than the least-square-based method for normal error distribution. Inverse probability weighted technique can increase the estimation efficiency by the complete-case-data procedure in which the missing subject is excluded. Under some mild conditions, the asymptotic properties of the proposed procedure were established. Some simulation studies and a real example were carried out to demonstrate that the proposed procedure has good performance in the finite samples. In our assumptions, nonlinear function g(.) is pre-specified. A more valuable model is that of relaxing g(.) , which is unknown. It would be an interesting topic in the future research.
Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
Acknowledgments
This work is supported by the Natural Science Basic Research Plan in Shaanxi Province of China(2022JM-027).
Conflict of interest
The authors declare no conflict of interest.
Appendix: Proofs of theorems
To facilitate the presentation, we denote by
then the modal estimator \mathit{\boldsymbol{\hat \beta}} satisfies \sum_{i = 1}^n \eta_i(\hat{\mathit{\boldsymbol{\beta}}}) = 0 , which is obtained by maximizing (2.13).
Lemma 1. Suppose Assumptions (A1)–(A10) hold. As n\rightarrow \infty , we have
Proof. Let {R_j}(U) = {\alpha _j}(U) - \mathit{\boldsymbol{B}}^T{(U)}{\mathit{\boldsymbol{\gamma}}_{0j}} , \mathit{\boldsymbol{R}}(U) = {({R_1}(U), {R_2}(U), \cdots, {R_p}(U))^T} . Model (2.4) can be reexpressed as
For fixed \mathit{\boldsymbol{\beta}} , model (7.2) becomes
Notice that
By simple calculation,
For I_{1i} , by (7.4) and the fact ||R(U) = O({K^{-r}})|| , it can be shown that E(I_{1i}) = 0 and \mathrm{Var}(I_{1i}) = E(I_{1i}^2) = E [{{\pi}({Y_i}, {\mathit{\boldsymbol{T}}_i}, {\mathit{\boldsymbol{\zeta}}_0})^{-1}} {\phi }'_h(\varepsilon_i)^2 (P_ig'(\mathit{\boldsymbol{\beta}}_0))^T P_ig'(\mathit{\boldsymbol{\beta}}_0)] .
Therefore,
is obtained, where B = E[{{\pi ({Y}, \mathit{\boldsymbol{T}}, {\mathit{\boldsymbol{\zeta}}_0})}}^{-1} G(x, z, u, h)g'(\mathit{\boldsymbol{z}}, \mathit{\boldsymbol{\beta}}){{ g'(\mathit{\boldsymbol{z}}, \mathit{\boldsymbol{\beta}})^T}}].
In terms of I_{2i} , noticing that
it is easy to show that
where H is defined in Theorem 1.
According to Wang et al. [22], we have \hat{\mathit{\boldsymbol{\zeta}}}-\mathit{\boldsymbol{\zeta}}_0 = O_p(n^{-1/2}) and n^{-1/2}(\hat{\mathit{\boldsymbol{\zeta}}}-\mathit{\boldsymbol{\zeta}}_0)\stackrel{d}\longrightarrow N(0, \Pi) . Then, we have
By (7.5), (7.6) and (7.8), the result is proved. □
Lemma 2. Suppose Assumptions (A1)–(A10) hold. As n\rightarrow \infty , we have
Proof.
it is easy to obtain E(T_1) = E\Big({\phi }''_h(\varepsilon_i)(\mathit{\boldsymbol{P}}_ig'(\mathit{\boldsymbol{\beta}}_0))^T \mathit{\boldsymbol{P}}_ig'(\mathit{\boldsymbol{\beta}}_0)\Big). By the law of large numbers, we have T_1\stackrel{p}\longrightarrow \Sigma , where \Sigma = E\left({F(x, z, u, h)g'(\mathit{\boldsymbol{z}}, \mathit{\boldsymbol{\beta}}){{g'(\mathit{\boldsymbol{z}}, \mathit{\boldsymbol{\beta}})^T}}} \right) .
For T_2 , applying the Taylor expansion, we get
By the law of large numbers, we have
As \hat{\mathit{\boldsymbol{\zeta}}} is the consistent estimator of \mathit{\boldsymbol{\zeta}}_0 , we conclude that T_2 = o_p(1) . This yields the result that
□
Proof of Theorem 1.. Since the modal estimators \mathit{\boldsymbol{\hat {{\beta}}}} satisfies \sum_{i = 1}^n {{\eta _i}(\hat{\mathit{\boldsymbol{\beta}}})} = 0 , according to the Taylor expansion, we have
From Lemmas 1, 2 and (7.12), we have \sqrt n (\hat {\mathit{\boldsymbol{\beta}}} - \mathit{\boldsymbol{\beta_0}})\stackrel{d}\longrightarrow N(0, {\rm{ }}{\Sigma ^{ - 1}} A {\Sigma ^{ - 1}}) . □
Proof of Theorem 2.. Let {\delta _n} = {n^{ - r/(2r + 1)}} and v = {\rm{ }}{(v_1^T, \cdots, {\rm{ }}v_p^T)^T} with PL dimension. Define \mathit{\boldsymbol{\gamma}} = {{\mathit{\boldsymbol{\gamma}}} _0} + {\delta _n}v .
We show that, for any given \varepsilon > 0 , there exists a large enough constant C such that
where \hat{Q}(\mathit{\boldsymbol{\gamma}}) is defined in (2.15). Let \varphi (\mathit{\boldsymbol{\gamma}}) = \hat Q(\mathit{\boldsymbol{\gamma}}) - \hat Q({\mathit{\boldsymbol{\gamma}} _0}) . By Taylor expansion we have
where {\xi _i} lies in Y_i^* - \mathit{\boldsymbol{W}}_i^T{\mathit{\boldsymbol{\gamma}} _0} and Y_i^* - \mathit{\boldsymbol{W}}_i^T\mathit{\boldsymbol{\gamma}} . Then, for {I_1} , using Taylor expansion, we obtain that
where {\zeta _i} lies in {\varepsilon _i} and {\varepsilon _i} + \mathit{\boldsymbol{X}}_i^TR({U_i}) + g'({\mathit{\boldsymbol{Z}}_i}, \mathit{\boldsymbol{\beta}})(\mathit{\boldsymbol{\beta}} - \hat {\mathit{\boldsymbol{\beta}}}) .
By Assumption (A6), (A8) and some calculation results, we have {I_1} = {O_p}(n{K^{ - r}}{\delta _n}\left\| v \right\|) = {O_p}(n\delta _n^2\left\| v \right\|) .
Similarly, we can prove that {I_2} = {O_p}(n\delta _n^2{\left\| v \right\|^2}) and {I_3} = {O_p}(n\delta _n^3{\left\| v \right\|^3}) . Hence, by choosing a sufficiently large C , {I_2} dominates {I_1} uniformly \left\| v \right\| = C . Since {\delta _n} \to 0 , it follows that {\delta _n}||v|| \to 0 with |v|| = C , which leads to {I_3} = {O_p}({I_2}){\rm{ }} . Hence, (11.13) holds. There exists local maximizers \hat {\mathit{\boldsymbol{\gamma}}} , and we have
Similar to the proof of Theorem 2.1(b) in [30], we can obtain that
This completes the Theorem 2. □