1.
Introduction
In practical applications, due to the interference of various factors, collected data is often incomplete. Missing data is common in public opinion polls, medical research, experimental science, and other application fields. Missing data will not only result in the reduction of effective information, the deviation of the estimation result, but also affect the statistical decision-making and distort the analysis result to some extent. One approach to deal with missing data is complete-case analysis, which deletes all incomplete data. However, Little and Rubin [1] pointed out that this will cause biased estimation when the occurrence of missing data is not completely at random. Yates [2] introduced an imputation method which is widely used to handle missing responses. The purpose of this method is to find suitable values for the missing data to impute. Then, the data of the filled values are regarded as the complete observation data, which can be analyzed by the classical method. Inverse probability weighting (IPW), which was proposed by Horvitz and Thompson [3], is another method to deal with missing data. The inverse of the selection probability is chosen to be the weight assigned to the fully observed data. The missing at random (MAR) assumption, in the sense of Rubin et al. [4], is a common assumption for statistical analysis with missing data.
In the case of missing data, the missing mechanism is usually unknown, and parameter methods and nonparametric methods are commonly used to estimate. For the parameter method, there may be a model misspecification problem. Imai and Ratkovic [5] proposed the covariate balanced propensity score (CBPS), which improves the parameter method. Based on the CBPS method, Guo et al. [6] applied the CBPS method to mean regression to obtain the estimators of the regression parameters β and the mean μ in the case of missing data.
Expectile regression, which was proposed by Newey and Powell [7], can be regarded as a generalization of mean regression. Expectile regression uses the sum of asymmetric residual squares as the loss function, and since the loss function is convex and differentiable, expectile regression has computational advantages over quantile regression. Recently, people have carried out a lot of specific research on expectile regression. Sobotka et al. [8] established the asymptotic properties of a semi-parametric expectile regression estimator and introduced confidence intervals for expectiles. Waltrup et al. [9] observed that expectile regression tends to have less crossing and more robustness against heavy tailed distributions than quantile regression. Ziegel [10] concluded that expectile shares coherence and elicitability. Pan et al. [11] considered fitting a linear expectile regression model for estimating conditional expectiles based on a large quantity of data with covariates missing at random. Recently, Pan et al. [12] developed a weighted expectile regression approach for estimating the conditional expectile when covariates are missing at random (MAR). They only considered a single expectile, and the missing mechanism was assumed to be logistic regression. However, the missing mechanism model may be misspecified. In addition, it is known that making full use of multiple target information can improve the efficiency of parameter estimation. In summary, when the model may be misspecified, we use the idea of covariate balance to study the weighted expectile average estimation of unknown parameters based on CBPS by using multiple expected information. Our estimators can improve performance of the usual weighted expectile average estimator in terms of standard deviation (SD) and mean squared error (MSE).
The rest of this paper is organized as follows. In Section 2, we propose a CBPS-based estimator for the propensity score. In Section 3, we estimate the expected quantile weighted average of the regression parameters based on CBPS. Moreover, we establish the asymptotic normality of the weighted estimator in Section 4. In Section 5, a simulation study is carried out to assess the performance of the proposed method. The proofs of those theoretical results are deferred to the Appendix.
2.
CBPS-based estimator for the propensity score
Consider the following linear regression model:
where Yi is response, Xi is covariate, β is the p-dimensional vector of unknown parameters, and εi is the random error. Assuming that the response variable Yi is missing at random, the covariate Xi can be fully observed. For the ith individual, let δi denote the observing indicator, i.e., δi=1 if Yi is observed and 0 otherwise. In our paper, we only consider the missing mechanism of missing at random (MAR), that is,
where πi is called the selection probability function or the propensity score.
The most popular choice of π(Xi) is a logistic regression function (Peng et al. [13]). We make the same choice and posit a logistic regression model for π(Xi),
and γ=(γ0,γT1)T∈Θ is the unknown parameter vector with the parameter space Θ⊆Rq+1. Here, γ can be estimated by maximizing the log-likelihood function
Assuming that π(Xi,γ) is twice continuously differentiable with respect to γ, maximizing L(γ) implies the first-order condition
where π′(Xi,γ)=∂π(Xi,γ)/∂γT. The maximum likelihood method is a commonly used and simple parameter estimation method. However, when the selection probability model (2.3) is assumed to be wrong, the estimator based on this method will have a large deviation. In order to make the parameter method more robust, we use the covariate balanced propensity score method proposed by Imai and Ratkovic [5] to estimate the unknown parameter γ, that is,
~Xi=f(Xi) is an M-dimensional vector-valued measurable function of Xi. For any covariate function, as long as the expectation exists, Eq (2.5) must hold. If the propensity score model is incorrectly specified, then the maximum likelihood may not be able to balance the covariates. Following Imai and Ratkovic [5], we can set ~Xi=Xi to ensure that the first moment of each covariate is balanced even when the model is misspecified. π(Xi,γ) satisfies the condition
The sample form of the covariate equilibrium condition obtained from (2.6) is
where
According to Imai and Ratkovic [5], if we only use the condition of the π′(Xi,γ) equilibrium, i.e., (2.4), at this time, the number of equations is equal to the number of parameters. Then, the covariate equilibrium propensity score is just-identified. If we combine Eq (2.4) with the score condition given in Eq (2.7),
where
then the covariate equilibrium propensity score is over-identified because the number of moment conditions exceeds that of parameters. For over-identified CBPS, the estimation of γ can be obtained by using the generalized moment method (GMM) (Hansen [14]). For a positive semidefinite symmetric weight matrix W, the GMM estimator ˆγ can be obtained by minimizing the following objective function for γ:
The above method is also applicable to the case where the covariate balanced propensity score is just-identified.
3.
Estimator for the regression parameter
Pan et al. [12] introduced the weighted expectile regression estimation of a linear model in detail. According to the idea of inverse probability weighting, when the selection probability function (π1…,πn)T is known, the expectile estimator of β under missing responses is defined as
where τk∈(0,1) is expectile level, and Φτk(v)=|τk−I(v≤0)|v2. bτk represents the τk-expectile of the error term εi. Then, according to Zhao et al.[15], let K be the number of expectiles, and consider the equally spaced expectiles τk=kK+1, k=1,2,…,K. The weighted expectile average estimator of the linear model parameter β when the missing mechanism is known is defined as
where the weight vector (ω1,…,ωK)T satisfies ∑Kk=1ωk=1.
When the selection probability function is unknown, we use the method proposed in the second section to estimate the parameter γ based on CBPS, so as to obtain π(Xi,ˆγ). The loss function of the τk-expectile can be defined as
By minimizing the loss function, we can obtain the expectile estimation of the unknown parameter β,
Therefore, the weighted expectile average estimation of the linear model parameter β when the missing mechanism is unknown under the missing responses is defined as
The weight vector (ω1,…,ωK)T satisfies ∑Kk=1ωk=1.
4.
Asymptotic property
Let γ0 and β0 represent the true values of γ and β respectively, and U(γ)=(s(δ,X,γ)z(δ,X,γ)X). In addition, with reference to Pan et al. [12] and Guo [16], the following regularity conditions are required.
C1: γ0 is the interior point of Θ.
C2: U(γ) is differentiable in the neighborhood △ of γ0.
C3: E[U(γ0)]=0, E[‖U(γ0)‖2]<∞.
C4: E[supγ∈△‖∇γU(γ)‖]<∞, where ∇γ is the first-order partial derivative of the function to γ.
C5: Γ=E[∇γU(γ)] exists.
C6: For any i, there exists a compact set X, such that Xi∈X⊂Rp, and Xi and εi are independent.
C7: The regression errors {εi}ni=1 are independent and identically distributed with common cumulative distribution function F(⋅), satisfying E[ε2i]<∞.
C8: There exists a>0 such that π(Vi,γ)>a for any i.
C9: The symmetric matrix Σ1 is positive definite.
The following theorem presents the asymptotic distribution of ˆβw.
Theorem 4.1 (Asymptotic Normality of ˆβw) Under the assumptions C1–C9, we have
where Σ1=E[XiXTi], Λ=E[λλT],λ=μ−E[∂μ/∂γT]{E[∂U(γ)/∂γT]}−1U(γ), μ=δπ(X,γ)X∑Kk=1ωkΨτk(ε−b0k)g(τk).
5.
Simulation
In the following, the expectile weighted average estimation based on covariate balancing propensity score proposed in this paper is analyzed by numerical simulation, and the method is compared with the usual parameter estimation method in the case of correct and wrong model assumptions. Consider the following linear model:
where β1=0.5, β2=1, β3=1, and (X1,X2,X3) obeys the joint normal distribution with mean of 0, covariance of 0.5, and variance of 1. The error term ε obeys the standard normal distribution. In our simulation, we take K=10, τk=k/11 for k=1,2,…,10, and consider the real choice probability model as
Under the assumption of random missing, in order to illustrate the effect when the model is misspecified, we assume that the covariates
If the model (5.2) is represented by π(X∗), the model will be specified incorrectly. In the simulation study of the expectile regression of the unknown parameter β, we consider the following two cases: (1) Propensity score model is correctly specified. (2) Propensity score model is misspecified. Zhao [15] proposed the weighted composite expectile regression method for a varying-coefficient partially linear model. For a given scenario, referring to Zhao [15], we compare the weighted expectile average estimation based on CBPS, denoted as CBPS-WEAE, with weighted composite expectile regression, denoted as WCER, and weighted composite quantile regression, denoted as WCQR, to examine the performance of the estimator, where the weights of WCER and WCQR are estimated by the generalized linear model.
In the simulation, samples of size n=500,800,1000,1200 are generated independently. For each scenario, we conduct 1000 simulations and calculate the average mean squared error (MSE) for estimator of β and the average bias (Bias) and standard deviation (SD) for estimator of β1, β2, and β3. In order to examine the influence of the error distribution on the performance of the proposed method, two different distributions of the model error ε are considered: standard normal distribution N(0,1) and centralized χ2 distribution with 4 degrees of freedom. The results of our simulations are presented in Tables 1 and 2.
From Tables 1 and 2 we observe that, as expected, all three estimators are unbiased. In terms of MSE, as a convenient measure of average error, we observe that when model error ε follows the standard normal distribution N(0,1), CBPS-WEAE performs best among the three estimators considered, followed immediately by WCER, while WCQR performs worst. When ε follows a centralized χ2 distribution with 4 degrees of freedom, CBPS-WEAE is superior to the other two methods. When sample size is large, it can be seen that the performance of the three estimators is significantly improved compared with that when the sample size is small. In general, our proposed improved estimator is effective.
6.
Conclusions
In this paper, in order to improve the estimation efficiency of weighted expectile average estimation, we estimate the selection probability function based on CBPS and propose a weighted expectile average estimator based on CBPS when the response variables are missing at random. The asymptotic normality of the proposed method is proved, and the estimation effect of the method is further illustrated by numerical simulation. The numerical simulation results show that the method is effective.
Author contributions
Qiang Zhao: Conceptualization, methodology, supervision, writing-review and editing; Zhaodi Wang: Validation, software, writing-original draft; Jingjing Wu: Funding acquisition, formal analysis, writing-original draft; Xiuli Wang: Funding acquisition, investigation, resources, writing-review and editing. All authors have read and approved the final version of the manuscript for publication.
Use of AI tools declaration
The authors declare they have not used artificial intelligence (AI) tools in the creation of this article.
Acknowledgments
The research is supported by Natural Science Foundation of Shandong Province (Grant Nos. ZR2021MA077 and ZR2021MA048).
Conflict of interest
All authors declare that there is no conflict of interest.
Appendix: Assumptions and proofs
Define the following symbols:
ηi=δiπ(Xi,γ)XiΨτk(εi),
^ηi=δiπ(Xi,ˆγ)XiΨτk(εi),
Fn=−1√n∑ni=1ˆηi,
εi=Yi−XTiβ0,
ω=(ω1,ω2,...ωn)T,
Σ1=E[XiXTi],
Ψτk=2|τk−I(v≤0)|v,
u=(u1,u2,...,un)T,
Gn(u)=∑ni=1δiπ(Xi,ˆγ)[Ψτk(εi−XTiu√n)−Ψτk(εi)].
Lemma 1. Assume that C1–C5 hold. Then, when n→∞,
where Γ=E[∇γU(γ)], Σ=E[U(γ)UT(γ)].
The proof of Lemma 1 can refer to Theorem 2.2.1 in Guo [16].
Lemma 2. If the conditions C1–C4 are satisfied, then
where Ω=E[QQT],Q=η−E[∂η/∂γT]{E[∂U(γ)/∂γT]}−1U(γ).
Proof. By expanding 1√n∑ni=1ˆηi at γ and the proof process of Lemma 1, we can get
where Dn=[1n∑ni=1∂ηi∂γ]γ∗,Bn=[1n∑ni=1∂Ui(γ)∂γ]γ∗, and γ∗ lies between γ and ˆγ.
According to the central limit theorem,
where Ω=E[QQT],Q=η−E[∂η/∂γT]{E[∂U(γ)/∂γT]}−1U(γ).
Therefore, Lemma 2 is proved.
Lemma 3. If the conditions C1–C4 are satisfied, then
Proof. If the conditions C1–C4 are satisfied, it can be known from Pan et al. [12] that
where g(τ)=(1−τ)F(0)+τ(1−F(0)).
Known by Hjort and Pollard [17], if
where Dn(u) is a convex objective function with minimum point ˆun, A is a symmetric and positive definite matrix, and B is a random variable, then
Therefore, if we define ˆun=√n(ˆβτk−β0), then ˆβτk=β0+ˆun√n. By some simple calculations and (A.2), we have
According to condition C4, Σ1 is a symmetric positive definite matrix. Lemma 3 is proved by Lemma 1 and Slutsky's theorem.
Proof of Theorem 4.1. By Lemma 3 we know that
From ˆβw=∑Kk=1ωkˆβτk, ∑Kk=1ωk=1, we can get
According to the proof of Lemma 2, we can obtain that
Let μi=δiπ(Xi,γ)Xi∑Kk=1ωkΨτk(εi−b0k)g(τk),ˆμi=δiπ(Xi,ˆγ)Xi∑Kk=1ωkΨτk(εi−b0k)g(τk), and then
where Hn=[1n∑ni=1∂μi∂γ]γ∗. Therefore, Eq (A.4) is equivalent to
Therefore,
where Λ=E[λλT], λ=μ−E[∂μ/∂γT]{E[∂U(γ)/∂γT]}−1U(γ), μ=δπ(X,γ)X∑Kk=1ωkΨτk(ε−b0k)g(τk).