1.
Introduction
Analyzing high-dimensional survival data has become an important topic in statistics, among which finding covariates with good predictive power of survival is a fundamental step. In the variable selection, penalized least squares procedures is an attractive approach. Penalized least squares procedures are used for variable selection and estimation which help predict estimators.
As a useful alternative to the Cox model [2], the AFT model[10] based on linear regression models has an intuitive form compared to Cox model. The AFT model with an unspecified error distribution has been studied commonly for right-censored data. Two approaches in this aspect have gained attractive attention. One uses the Kaplan-Meier estimator to obtain the ordinary least squares estimator. The other is the rank-based estimator, which is motivated by the score function of the partial likelihood. See for examples in [1,13,17].
Identifying significant factors with predictive power, many techniques for linear regression models have been extended to the Cox regression and the AFT model. Penalized methods have drawn extensive attentions, which are for imposing some penalties to the regression coefficients. By balancing the goodness of fit and model complexity, penalization approaches lead the complex models to a profile. There exists plenty of methods used in gene expression analysis with survival data; see for examples in [16,19]. Moreover, various penalization methods of consistent selection have also been proposed. Examples include the adaptive Lasso [19], the smoothly clipped absolute deviations (SCAD) [6], the minimax concave penalty (MCP)[20] and the bridge penalty. It has been shown that the bridge penalty had the oracle estimation in the linear regression models having divergent number of covariates. For the AFT models, there also exists much literature (e.g., [7,9,22]). To name but a few, Huang et al. [7] considered the regularization approaches for estimation in the AFT model with high-dimension covariates based on Stute's weighted least squares method. Huang and Ma [8] considered variable selection for AFT model with bridge method. Wang and Song[18] applied adaptive lasso to the AFT models. In recent years, there are still a lot of studies on the AFT models. For example, Chai et al.[3] considered a set of low-dimensional covariates of main interest and a set of high-dimensional covariates that may also affect survival under the accelerated failure time model. Choi and Choi [4] proposed the logistic-kernel smoothing procedure for the semi-parametric AFT model with high-dimensional right-censored data. Li et al.[12] proposed a unified Expectation-Maximization approach combined with the L1-norm penalty to perform variable selection and parameter estimation simultaneously in the accelerated failure time model with right-censored survival data of moderate sizes.
This article is motivated by the need for considering a seamless-L0 (SELO) penalty [5] under the AFT model, which is a smooth function similar to the L0 penalty. Under appropriate conditions, we show that the SELO selects a model whose dimension is comparable to the underlying model and prove that the proposed estimators is asymptotically normal. Monte Carlo simulations to evaluate the finite sample performance of the proposed procedure are computed. The proposed method is also demonstrated through an empirical analysis.
The rest of this paper is organized as follows. The AFT model is based on SELO penalization and computational algorithm are introduced in Section 2. In Section 3, we further propose an accurate variable selection for high dimensional sparse AFT model based on seamless L0. The root n consistency and the asymptotic normality of the resulting estimate are established. We simulate Monte Carlo simulation study to examine the finite sample performance of the proposed estimate in Section 4. A real data example is used to illustrate the proposed methodology in Section 5.
2.
SELO estimation in the AFT model
Let Ti be the logarithm of the failure time and Xi be the p-dimensional covariate vector. The AFT model assumes
where α is the intercept, β∈Rp is an unknown vector of interest, and εi is the random error. When Ti is subject to right censoring, we can only observe (Yi,δi,Xi), where Yi=min(Ti,Ci), Xi be the p-dimensional covariate vector for the ith row of the n×p matrix X which is the covariate matrix, Ci is the logarithm of the censoring time, and δi=I{Ti≤Ci} is the censoring indicator. We assume that (Yi,δi,Xi), i=1,…,n, come from the same distribution.
Let ˆFn be the Kaplan-Meier estimator of the distribution function F of T. ˆFn can be written as
where the wi's are the jumps in the Kaplan-Meier estimator expressed as w1=δ(1)n, wi=δ(i)n−i+1∏i−1j=1(n−jn−j+1)δ(j), j=2,…,n. wi's are also called the Kaplan-Meier weights; see for examples in Stute and Wang[14]. Here Y(1)≤⋯≤Y(n) are the order statistics of Yi's and δ(1),…,δ(n) are the associated censoring indicators. Similarly, let X(1),…,X(n) be the associated covariates of the ordered Yi's. The weighted least square(WLS) loss function is
Let ˉXw=∑ni=1wiX(i)/∑ni=1wi, and ˉYw=∑ni=1wiY(i)/∑ni=1wi, denoted by X∗(i)=(nwi)1/2(X(i)−ˉXw) and Y∗(i)=(nwi)1/2(Y(i)−ˉYw). The weighted least square(WLS) objective function (2.2) can be written as
Penalized regression problem has been studied extensively. LASSO is one of the most popular and widely studied L1 penalty. But it has been proved that its estimator may be inconsistent for model selection. The smoothly clipped absolute deviations (SCAD) and the minimax concave penalty (MCP) are another two popular penalties. SCAD is a continuous penalty and its estimator has oracle property. MCP also performs well in variable selection, whose estimator is consistent. However, L0 penalty directly penalizes the non-zero parameters, whose drawback is the difficulty of computing because of its discontinuity. Seamless-L0 (SELO) was proposed in Dicker [5], which was explicitly designed to minic L0 penalty. It has been found that SELO possessed good theoretical properties.
We now describe the variable selection for AFT model via SELO. Coordinate descent is introduced to solve this problem. We propose tuning parameter λ using cross-validation. The SELO penalized objective function is,
where SELO(βj) is defined as,
and λ is tuning parameter. When λ is large, SELO may select small estimators. In the paragraph, λ is determined by Cross Validation. It is easy to see that when τ is enough small, pSELO(βj)≈λI{βj≠0}, which is similar to L0 penalty.
To minimize (2.3), we utilize coordinate descent algorithm. Coordinate descent algorithm[21] has been widely used in penalized regression problem, which optimizes an objective function by calculating a single parameter at a time until convergence is reached. Dicker [5] described this algorithm for obtaining SELO estimators. The algorithm is formulated in terms of the tuning parameter λ. For a fixed value of λ, it can be implemented in the following steps.
3.
Theoretical properties
In this section, we prove the consistency and asymptotic normality of WLS estimator via SELO under some conditions. Following the notation of Stute [14,15], let H denote the distribution function of Y. Under the assumption of independence between T and C, 1−H(y)=(1−F(y))(1−G(y)), where F and G are the distribution functions of T and C. Let τY, τT and τC be the endpoints of the support of Y, T and C. We put
Now, introduce the following sub-distribution functions:
Under random censoring the limit variance becomes much more complicated. Let
and
We assume that
(A1) (a) E[(Y−XTβ∗)2XXTδ]<∞, (b) ∫|(w−xTβ∗)xj|D1/2(w)˜F0(dx,dw)<∞, for j=1,...,p and D(y)=∫y−0[(1−H(w))(1−G(w))]−1G(dw).
(A2) λ=O(1), τ=O(p1/2/n3/2), λ√n/p→∞ and pσ2/n→0 when n→∞.
(A3) r≤λ(E(XXT))≤R, where r and R are positive constant.
(A4) limn→∞n−1max1≤i≤n∑pj=1w2ix2ij=0.
(A5) E(|ϵσ|2+2δ1+δ)<M for some M<0 and δ>0.
Condition (A1) is usually used for the proof of consistency in Stute[14]. Condition (A2) restricts the size of λ and τ. Condition (A3) gives the bound of the eigenvalues of E(XXT), which is used in Theorem 3.1. Conditions (A4) and (A5) are used for the proof the asymptotic normality of SELO estimators and are related to the Lindeberg condition of Lindeberg-Feller CLT.
Theorem 3.1. Suppose that conditions (A1)–(A5), then
(i) limn→∞P({j:^ββj≠0}=A)=1, A={j;βj≠0}.
(ii) √n(n−1XTAWXA/σ2)1/2(ˆβA−β∗A)D→(σ2(XTAWXA))−12GA where G∼N(0,Σ) and Σ=Var{δγ0(Y)(Y−Xβ∗)X+(1−δ)γ1(Y;β∗)−γ2(Y;β∗)}/n and GA is the part of G corresponding to β∗A.
The proof is put in the supplementary.
4.
Numerical results
Let T be generated from T=Xβ+ϵ, where ϵ∼N(0,σ). The covariates X=(X1,...,Xp) are standard normal. Here we set σ=0.1. The censoring variables are generated as uniformly distributed U(0,C0) and independent of the events, where C0 is chosen to obtain the censoring rate 25% and 40%. Set two sample sizes n=200 and n=400. The tuning parameter λ is chosen by cross validation. For each value of n, we simulated 1000 independent datasets {(y1,xT1),...,(yn,xTn)}. For each dataset, we calculated estimates of β. For each estimator ˆβ, we recorded: the model size, ˆA={j;ˆβj≠0}; an indicator of whether or not the true model was selected, I{ˆA=A}; the false positive rate, |ˆA−A|/|ˆA|; the false negative rate, |A−ˆA|/(p−|ˆA|); and the model error, (ˆβ−β∗)T(ˆβ−β∗). The column labeled "size", "rate", "F+", "F-" and "MSE" represent the above indicators. Results for SELO, LASSO, SCAD and MCP are summarized in the tables. Furthermore, we use the V-fold cross-validation to determine the tuning parameter. The CV score is ∑Vv=1[ℓn(ˆβ(−v))−ℓ(−v)n(ˆβ(−v))]. In this article, we set V=5.
4.1. Simulation I
The example was conducted with p=8 and set β=(3,1.5,0,0,1,0,0,0)∈R8, where Table 1 summarizes the variable selection results based on SELO, LASSO, SCAD and MCP when censoring rates are 25% and 40%. Overall, SELO performs better than other three methods, which selects the correct model more frequently. For instance, when the censoring rate is 25%, SELO selects the true model most accurately. The true model size is 3 and the average size from SELO is 3.43. LASSO performs worse than other three methods both in model size and correct rate. LASSO, SCAD and MCP select model with average size 4.12, 4.05 and 4.04 and select the correct model in 42%, 46.7% and 47%. Similar results perform when the censoring rate is 40%. But we can easily see that the situation when the censoring rate is 25% is better than that when the censoring rate is 40%. When n increases, we can see that the results are better. For instance, SELO selects 3.07 variables when n=400, which is more accurate compared to 3.43 when n=200. Other indicators can also prove it.
4.2. Simulation II
The example was conducted with p=50 and set β=(3,1.5,0,0,2,0,3,0,0,2,0,…,0)∈R50, the rest of β is zero. Other settings were similar to the case in simulation I and the results are listed in Table 2. It is easy to see that SELO remains better performance compared to other three methods. The model size from SELO is 5.68, which is the closest to the true model when the censoring rate is 25% compared to 9.64, 9.27 and 8.86 for LASSO, SCAD and MCP. And it also performs better in correct rate and other indicators. For instance, SELO selects 52% correct variables which is better compared to 3%, 5% and 7% for LASSO, SCAD and MCP. The indicators "F+" and "F-" of SELO are the smallest among four methods. Similar results perform when p=50 compared to p=8. It can be concluded that the results are worse when the censoring rate is 40%. However, when n increases from 200 to 400, SELO selects more correct models.
4.3. Simulation III
The example is set under 25% and 40% censoring, and we also estimate the mean estimated variance across 1000 simulated datasets when n is 200 and 400. From the Tables 3 and 4, we see that SELO with tuning over τ∈{0.001,0.01,0.1,0.5} seems to give better variance when compared to SELO with τ=0.01 fixed. We can see that the 1000 estimators remain stable whenever censoring rate are 25% and 40% and the simulation results perform better when n increases to 400.
4.4. PBC Data
PBC data was collected in the Mayo Clinic trial of primary biliary cirrhosis of liver conducted between 1974 and 1984. A total of 424 PBC patients, referred to Mayo Clinic during that ten-year interval, met eligibility criteria for the randomized placebo controlled trial of the drug D-penicillamine. The first 312 cases in the data were participated in the randomized trial and contained complete data. The additional 112 cases did not participate in the clinical trial. After deleting the missing data, the remaining 276 datasets are used for the analysis. We consider 17 covariates: age, albumin, alk.phos, ascites, ast, bili, chol, copper, platelet, edema, hepato, protime, sex, spiders, stage, trt, trig.
Among 276 samples without losing data, we calculated Table 5. The optimal values of lambda with SELO is small, which is 0.008. And the optimal values of lambda that are chosen by CV for LASSO, SCAD and MCP are 0.011. LASSO selects 6 variables and SELO selects 3 variables. LASSO selects the genes including sex, heptato, bili, albumin, protime, stage. SCAD selects the same variables like MCP. However, SELO selects sex, albumin and protime. The variables selected by SELO are also contained by other three methods. Meanwhile, we also calculate the AIC (Akaike Information Criteria) AIC=nlog(ˆσ2)+2(d+1) (d is the non-zero parameter and ˆσ is the error between the estimator and true value) for four methods which show that SELO results better with smaller AIC. In Table 6, we also calculated p values for the coefficients of variables selected by SELO. We calculated p values for several steps. Firstly, we calculated t values called tk where tk=ˆβ/s and s is the sum of squares error of the estimators. Secondly, we found the value of tα/2(n−K) where α=0.05 and n−K is the degree. Finally, we calculated the values of P(T>tα/2(n−K))=P(T<tα/2(n−K)).
Table 6 indicates the significance test results. From the results, we can see that the p value of sex, albumin and protime are all less than 0.05. The variables selected by SELO are all significant. Overall, SELO selected a simple model compared to another three methods.
5.
Conclusions
Statistical analysis of failure time with high dimension covariates is an important topic. In this article, we investigate a new method (SELO) for the AFT model with high dimension covariates, for simultaneous variable selection and estimation. A real dataset (PBC) is analyzed and SELO selects some important covariates. Our numerical results indicate that SELO performs better than another three methods. In this article, we address the situation where p≪n and prove the oracle property under the condition p/n→0. We allow both n and p to diverge but p goes to infinity more slowly than n. The situation where p is much larger than n will be extended in the future research.
Acknowledgments
The authors would like to thanks the editors and three reviewers for their valuable and helpful comments. The authors also thanks Guangren Yang and Yiming Liu for their advice for revising the paper.
Conflict of interest
The authors declare no conflict of interest.
Supplementary
In order to complete the proofs of Theorem 3.1 (i), we first show two lemmas.
Lemma 1. Recall that
Then for every r∈(0,1), there exists a constant C0>0 such that
where C≥C0.
Proof. Let αn=√pσ2/n and fix r∈(0,1). To prove the Lemma 1, it suffices to show that if C>0 is large enough, then
Furthermore, define Qn(u)=Q(β∗+Cαnu)−Q(β∗).Then,
By the results of Stute (1993, 1996), we have
where W∼N(0,Σ), with Σ defined in the theorem. The last term in Qn(u) where K(u)={j;PSELO(β∗j+Cαnuj)−PSELO(β∗j)<0}. The fact that PSELO is concave on [0,∞) imply that, for each β, PSELO(β∗j+Cαnuj)−PSELO(β∗j)≥−Cαn|uj|P′SELO(βj+Cαnuj)
Under conditions (A1) and (A3), for I1,
For I2,
For I3, by condition (A2), we have I3=op(Cαn). We conclude that if C>0 is large enough, then inf‖u‖=1Qn(u)>0 holds for all n sufficiently large, with probability at least 1−r. This finishes the proof of the Lemma 1.
Lemma 2. We assume that C>0, Q(β) is similar to that in Lemma 1. Under the conditions (A1)–(A3),
where Ac={1,...,p}/A is the complement of A in {1,...,p}.
Proof. Suppose that β∈Rp and that ‖β−β∗‖<Cαnu. Define ˜β∈Rp by ~ββAc=0 and ˜βA=βA. Similar to the proof of Lemma 1, if Dn(β,˜β)=Qn(β)−Qn(˜β), then
For I1 and I2, under the conditions (A1) and (A3),
For I3, PSELO is concave ‖β‖<C√pσ2/n→|βj|<C√pσ2/n, and
Under the condition (A2), we combine the results of I1, I2 and I3. So, Dn(β,˜β)>0.
Combining the proof of Lemmas 1 and 2, we can have the conclusion of Theorem 3.1 (i).
Proof of Theorem 3.1 (ii). We consider the proof related to the Lindeberg condition of Lindeberg-Feller CLT. Under the conditions (A1)–(A5), let ˆβA be a estimator where A={j;ˆβj≠0}.
We can easily have the following form,
and
To prove,
where wi,n=(σ2XTAWXA)−12wix′(i),Aϵi. Let ηi,n=wixT(i),A(XTAWXA)−12(XTAWXA)−12wix(i),A.
By the condition of Lindeberg-Feller CLT, we have
By Holder inequality, set 1p=22+δ, 1q=δδ+2,
By Markov inequality,
and
We showed that ∑ni=1ηi,n=∑ni=1wi and ηi,n≤‖(n−1XTAWXA)−12‖2max1≤i≤n∑qj=11nw2ix2ij
By conditions (A4) and (A5), ∑ni=1E[‖wi,n‖2;‖wi,n‖2>δ0]→0.
By conditions (A1)–(A3), ‖(σ2∑ni=1XTAWXA)−1/2p′A(ˆβ)‖=op(1),
and
where, Σ=Var{δγ0(Y)(Y−Xβ∗)X+(1−δ)γ1(Y;β∗)−γ2(Y;β∗)}.
So, (σ2XTAWXA)−12XTAWϵD→(σ2XTAWXA)−12GA, where G∼N(0,Σ) follows.