1.
Introduction
Survival analysis is a method to analyze time to event of interest assuming that all subjects will eventually experience the event [1]. It has been widely applied in many fields such as medicine, engineering, credit scoring, etc. [2,3,4]. However, in practice, a proportion of subjects may not experience the event of interest. They are considered 'cured', or in other words, the long-term survivors. For instance, in medical practice, there may be a fraction of patients cured of their disease. Therefore, the cure rate model, an extension of survival analysis, is introduced to modeling data with long-term survivors [5].
Let xi be the covariate vector of subject i, and Si(t|xi) be the survival probability in time t. The cure rate model can be written as
where Ui is the random binary variable which indicates the cure status of subject i. The cure status Ui=0 denotes subject i is cured and will not experience the event, and Ui=1 otherwise. Here πi(xi) is the probability function of being cured with the vector of regression coefficients α, and Si(t|Ui=1,xi) is the survival function with the vector of regression coefficients β.
As shown in (1.1), the cure rate model is composed of two parts. The first part is the incident part, which predicts the probability of being cured. The second part is the latency part, which describes the conditional survival of the uncured subjects. Since the cure rate model can predict whether subjects are cured and the time to event of the uncured subjects, it is commonly adopted when a proportion of subjects have long-term survivors [6].
There are numerous studies in the literature regarding many extensions of the cure rate model. Cooner et al. [7] proposed a flexible hierarchical cure rate model to distinguish among underlying mechanisms that lead to cure. Rodrigues et al. [8] assumed the number of risk factors to follow the Conway-Maxwell Poisson distribution and proposed a Conway-Maxwell Poisson cure rate model to unify some cure rate models. Li et al. [9] considered a mixture of linear dependent tail-free processes as the prior for the distribution of the cure rate parameter to develop a latent promotion time cure rate model. Results showed that the cure rate model incorporated penalized spline has better performance in prediction [10]. Georgiana proposed a Bayesian spatial cure rate model with Weibull lifetime to model spatial variability in the censoring mechanism [11]. Pal et al. [12] proposed a projected non-linear conjugate gradient algorithm for the cure rate model under a competing risks scenario. Besides, many works have developed semi-parametric and nonparametric methods to investigate the effects of covariates on the outcome. For instance, Li et al. [13] proposed a semi-parametric additive predictor consisting of a sum of linear and nonparametric terms in the incident part. Chen et al. [14] modeled the covariate effects by a nonparametric form.
However, many existing works tend to pay much less attention to the relations between the vector of regression coefficients α and β in the two model parts. Generally, they assume that there are no direct constraints on the relation of the vector of regression coefficients α and β corresponding to the same covariates [15,16]. In other words, these works assume that the probability of being cured and the time to event are independent, and there are no direct constraints between the coefficients α and β.
Notably, in practical applications, the two model parts are quite related. The incident part describes the probability of cure (infinity survival), and the latency part describes the conditional survival of the uncured subjects (finite survival). It is desirable that there may be some connections between the vector of regression coefficients α and β corresponding to the same covariates. Theoretical derivations and case studies also suggest that relaxing the assumption of no direct constraints on regression coefficients can improve the performance of the model [17,18]. Liu et al. [19] relax the assumption of no direct constraints by establishing a joint distribution of the covariates and the logarithm of the hazard rate. Fan et al. [20] incorporate structural effects of α and β in the cure rate model to improve the estimation accuracy and interpretability. A joint distribution or structural effects of α and β are crude yet effective way to impose the relations between the coefficients. However, the two parts of the model still describe two different aspects. The assumption of a joint distribution or structural effects are too restrictive constraints and might not work well. So we consider a more flexible model that allows the coefficients α and β can be in different distributions and magnitudes. Sign consistency penalty is proposed to promote the similarity in sign to get more interpretable results by Zhang [21]. In many practical cases, it is hard to interpret the results when coefficients α and β corresponding to the same covariates have conflicting signs. In this paper, we consider a sign consistency cure rate model with a sign-based penalty. In addition, the models may suffer from bad performance due to the high dimension of the data, and grouping structure arises naturally in many practical cases. A group lasso penalty is also imposed for group variable selection.
In this paper, we propose a cure rate model with group selection and sign consistency (CRGS), which can select important groups of covariates and promote similarity in the sign of coefficients. The proposed method can promote the similarity in the signs of coefficients α and β to improve interpretability. Compared to the individual variable selection approaches such as the sign consistency method in [22], the proposed method with the group selection approach takes into consideration the grouping structure and can lead to better prediction. Compared with previously employed approaches such as joint distribution or structural effects of coefficients, the CRGS method can avoid too strict constraints between the coefficients and lead to more consistency and hence more interpretable results.
The paper is organized as follows. In section 2, the sign consistency cure rate model with Weibull lifetime and the algorithm is introduced. Simulation is presented in section 3. Section 4 displays a real data application. Finally, section 5 discusses the conclusions.
2.
Methods
2.1. Sign consistency cure rate model with Weibull lifetime
Consider data with n subjects and p covariates. Denote Yi as the time to event of subject i, that is, the time until the event of interest is observed to occur. Let Ci be the time of right censoring, and δi = I(Yi<Ci) be the censoring indicator of subject i, where δi = 0 for censored and δi = 1 for uncensored. Denote yi=min(Ti,Ci). Note that the censored subjects include the cured subjects and the uncured subjects for whom the event has not occurred at censoring time. So the cure status Ui is unobservable. The observable data is {(yi,δi,xi),i=1,⋯,n}.
Denote x as the covariate vector with grouping structure. Let x=(xT1,⋯,xTJ)T be the covariate vector with J nonoverlapping subgroups. xj=(xj1,⋯,xjpj)T is the j -th subgroup of covariate vector, with ∑Jj=1pj=p. Grouping structure arises naturally in many practical cases. Examples include the expression of a multi-level factor by a group of dummy covariates and the expression of an addictive model by several basis functions [23]. In addition, grouping structure can also be introduced into a model by taking advantage of prior knowledge [24]. For example, genes belong to the same biological pathway [25].
In the incident part of the cure rate model, we adopted logistic regression, a generalized linear model, to describe the probability of cure πi(xi)=1/1(1+exp(α0+xiTα))(1+exp(α0+xiTα)). Here α=(αT1,⋯,αTJ)T is the vector of regression coefficients, αj=(αj1,⋯,αjpj)T is the j -th subgroup of the coefficient vector, and α0 is the intercept.
In the latency part, let λi be the link function. The survival function is the chance an individual survives to time t given the individual will eventually experience the event of interest, while λi is the probability of an individual experiencing the event in the next instant of time t.
We assume the time to event t follow the Weibull distribution. Weibull distribution is a considerable flexibility distribution to model lifetime data [26]. It had been justified to be a valid lifetime distribution within the broad family of generalized gamma models [27,28]. Referring to [1,29], the survival function for the uncured subjects following the Weibull distribution can be written as
with the probability density function f(t|Ui=1)=(r/rtt)(λit)rexp(−(λit)r). Here r>0 is the shape parameter (more discussions below). The link function can be written as
where β0 is the intercept, β=(βT1,⋯,βTJ)T is the vector of regression coefficients, and the j -th subgroup of the coefficient vector is βj=(βj1,⋯,βjpj)T.
For observable data {(yi,δi,xi),i=1,⋯,n}, the log-likelihood function is
For promoting sign consistency and variable selection, we propose the following objective function
Here P1(α,β) is the group variable selection penalty function, and P2(α,β) is the sign consistency penalty function as follows.
where ‖∙‖ is the l2 norm, sign(∙) is the sign function, μ1>0 and μ2>0 are tuning parameters. μ1 and μ2 control the degree of penalty.
The first penalty P1(α,β) is a group lasso penalty, which can conduct group variable selection and generate more accurate estimation. Group lasso has the advantage in group variable selection performance and is commonly used in literature. It has been justified that the group lasso is more robust to noise compared with lasso when the underlying signal is strongly group-sparse [30]. In addition, different from methods in some existing works [20,22], group lasso can consider grouping structures and it has better performance compared with the group LARS and the group non-negative garrotte [31]. The second penalty P2(α,β) promotes the similarity in signs of coefficients α and β, which can lead to more interpretable results. The proposed method is more flexible than the methods with structural effect in [20] or joint distribution in [19] of coefficients.
2.2. EGCD algorithm
In this section, we propose the Expectation Group Coordinate Descent (EGCD) algorithm to optimize the objective function. In the E-step, we introduce a latent unobserved Ui to obtain a complete log-likelihood function. In the GCD-step, group coordinate descent is adopted to iteratively update a subgroup of parameters with the remaining parameters fixed at their most recent values. Since the sign function sign(∙) is not differentiable and continuous, and it is hard to optimize. So we introduce the following approximation for computational feasibility [21,22].
with τ is a small positive constant.
The EGCD algorithm iteratively updates α0, α, β0, and β in the m -th iteration.
2.2.1. E-step
In the E-step of the m -th iteration, let the observation of the latent Ui be U[m]i. Consider the complete data {(yi,δi,U[m]i,xi),i=1,⋯,n} with the latent U[m]i. The complete log-likelihood is
where
Regarding the expectation of Ui, there are three possible situations of subjects. (a) δi = 0 and Ui=0 : censored and cured, indicates long-term survivors; (b) δi = 0 and Ui=1 : censored and uncured, indicates subjects who will eventually experience the event, and the event have not occurred in censoring time Ci; (c) δi = 1 and Ui=1 : uncensored and uncured, indicates subjects who have experienced the event. Therefore, the expectation of Ui is 1 when uncensored (δi=1). When censored (δi=0), the expectation of Ui is related to the probability of cure and the proportion that uncured subjects for whom the event has not occurred at time t. Denote the expectation of U[m]i as u[m]i,
Given the complete data {(yi,δi,U[m]i,xi),i=1,⋯,n}, we take the expectation of L[m] in (2.7) with respect to u[m]i
where
The objective function can be written as
2.2.2. GCD-step
In the GCD-step, group coordinate descent is adopted to iteratively update α0, α, β0, and β. Group coordinate descent algorithms optimize the objective function with respect to a group of parameters at a time, and iteratively cycles through the parameter groups until convergence [25]. The intercept α[m+1]0 is updated by
where ∂l[m]1/∂α[m]0=∑ni=1(π[m]i+u[m]i−1), and ∂2l[m]1/∂(α[m]0)2=∑ni=1π[m]i(π[m]i−1).
For α[m+1]j∈Rpj, we can obtain Taylor's quadratic expansion of the objective function in (2.14) respect to α[m+1]j. Referring to the fast unified algorithm for group lasso [32], the upper bound of the objective function can be written as
where ∂l1[m]/∂l1[m]∂α[m]j∂α[m]j = ∑ni=1(π[m]i+u[m]i−1)xjk. Here V[m]1j is pj -length vector, and M[m]1j is a constant as follows.
where ∂2l1[m]/∂2l1[m]∂α[m]jk1∂α[m]jk2∂α[m]jk1∂α[m]jk2=∑ni=1(1−π[m]i)π[m]ix2jk, and ψ(·) is the maximum eigenvalue function. By minimizing (2.16), we obtain α[m+1]j :
where (a)+=max{a,0}.
Similarly, the intercept β[m+1]0 is updated by
where ∂l[m]2/∂l[m]2∂β[m]0∂β[m]0 = ∑ni=1(δi−u[m]ir(yiλ[m]i)r), and ∂2l[m]2/∂2l[m]2∂(β[m]0)∂(β[m]0)2 = −∑ni=1(u[m]ir2(yiλ[m]i)r).
For β[m+1]j∈Rpj, consider the optimization function
where ∂l[m]2/∂l[m]2∂β[m]j∂β[m]j=∑ni=1(δixjk−u[m]irxjk(yiλ[m]i)r).
Here V[m]2j is pj -length vector, and M[m]2j is a constant as follows.
where ∂2l[m]2/∂2l[m]2∂β[m]jk1∂β[m]jk2∂β[m]jk1∂β[m]jk2 = −∑ni=1u[m]i(rxjk)2(yiλ[m]i)r.
By minimizing (2.21), we obtain β[m+1]j :
Regarding the parameters μ1, μ2, and r, Wang et al [33] had demonstrated that the tuning parameters selected by the Bayesian information criterion (BIC) type criterion can identify the true model consistently as long as the covariate dimension is fixed. So the parameters, μ1, μ2, and r, are selected by Bayesian information criterion (BIC). The EGCD algorithm is shown in Table 1.
3.
Simulation
In this section, we perform a numerical study to evaluate our method, in terms of both variable selection and estimation performance. The variable selection is assessed in terms of the (1) true positive rate (TPR), and (2) false positive rate (FPR) of variable selection. Estimation is evaluated in terms of mean square error (MSE) of coefficient estimates.
3.1. Design and assessment
In Scenario 1, the covariates xjk, j=1,...,J, k=1,...,pj, are generated from a multivariate normal distribution. The correlation coefficient of covariates xjkm and xjkn in the same group is ρ = 0.1|km−kn|, whereas ρ = 0 when in different groups. We consider the case with discrete covariates in Scenario 2. In Scenario 2, xjk is defined as follows.
The sample size is n=500. We consider low-dimension data with p=40, and high-dimension data with p=200. The censoring time is generated from a Weibull distribution with the shape parameter r = {0.25, 2.5}. We compare the proposed method (CRGS) with three alternative methods. The alternatives are the standard cure rate model without sign consistency and variable selection penalty (CR), the cure rate model with sign consistency (CRS), and the cure rate model with group lasso penalty (CRG), respectively. For comparison, we also consider the alternatives with the logistic regression in the incident part and the Weibull distribution in the latency part. The grouping structure and coefficients of the 2 scenarios are generated as listed in Table 2.
Let θ∈{α,β}, and ˆθ be the estimation of θ. We evaluate variable selection performance in terms of TPR(θ) and FPR(θ) :
where
Estimates are evaluated by MSE(θ) :
3.2. Results
Tables 3 and 4 summarize the mean and standard deviation in parentheses of the MSEs, TPRs, and FPRs for Scenarios 1 and 2. Both the scenarios are repeated 100 times.
As shown in Tables 3 and 4, the MSEs of methods with group lasso penalty (the proposed and CRG) are smaller than the CRS and CR methods. The estimates of CRS and CR methods have increasing MSEs with higher dimensions. The results indicate that higher dimension leads to less efficient estimation, and group lasso penalty can improve the estimation performance. Comparing the TPRs and FPRs of the proposed and CRG methods, the proposed method has better performance in terms of variable selection. Compared with alternatives, the proposed method has the lowest MSEs. Results of simulation reveal that compared with alternatives, the proposed method can improve the performance in terms of variable selection as well as estimation.
4.
Analysis of credit data in China
In this section, we apply the proposed method to credit data. The data comes from a retail business of a commercial bank in China. It contains 16 covariates of 1213 customers in a personal loan from 2014 to 2019. The primary interest is to assess the credit risk of a credit loan and find the important covariates to predict the time to default of the credit loan customers. The mean observed time is 1.38 years with a standard deviation of 0.69. Customers with missing data of annual household income are removed from our analysis. After preprocessing, the covariates and their descriptions are summarized in Table 5. By transforming the multi-level covariates, we have 24 covariates in the credit model. Censoring time Ci is the interval between the value date and either default or the end of observation (June 1, 2019). Due to the different value dates of the loans, the censoring time vary from individual to individual. Customers whose time to event Yi is longer than the censoring time Ci are censored (δi=0). In this data, 1201 out of 1213 customers are censored.
The data are randomly divided into the training set and test set by 7:3. The training data is used for fitting the model and the test data is used to verify the performance of the fitted model. The parameters μ1, μ2, and r are selected by BIC. Different from the simulations, the real coefficient is unknown for the real data. Therefore, in this section, we adopt the negative log-likelihood to evaluate the performance of the methods. The mean negative log-likelihood (standard error) of the proposed method is 25.91 (11.61), compared with 56.33 (258.73), 31.04 (27.34), and 26.17 (13.02) for the CR, CRS, and CRG method respectively. For stability, all the results are based on 100 duplicates. The results indicate that the proposed method has competitive prediction performance than alternatives.
The coefficients are estimated based on 100 duplicates. With the median estimates of coefficients, we can compute the probability of cure (non-default) πi(xi) for all the customers. We dichotomize πi(xi) at the median and get two different groups of customers. One group with lower πi(xi) is denoted as "high risk", whereas another with higher πi(xi) is denoted as "low risk". Figure 1 presents the Kaplan-Meier curves of the survival of the customers belonging to different groups. Kaplan-Meier curve describes the change of the survival probability over time. It is commonly used in survival analysis, see Rodrigues et al [8], and Pal [34]. As indicated in Figure 1, the "low risk" group has higher survival probability than the "high risk" group.
The median estimates of coefficients based on 100 duplicates are listed in Table 6. A positive coefficient α indicates that the covariate is positively related to the probability of default, and a positive coefficient β indicates that the covariate is negatively related to default time. Both the probability of default and default time are two quite relevant credit aspects. Customers with a higher probability of default are likely to default earlier. Compared with the alternative method, the signs of the α and β of the proposed method are promoted to be more consistent, whereas many covariates such as the housing status in the CR method have an opposite effect on the probability and time to default.
The coefficient results of the proposed method reveal that loan line, early repayment, gender, housing status, annual household income, employment status, type of workplace, and occupation are important covariates for credit risk assessment. The impact of business type and education on credit is not clear.
The loan line has a positive effect. One possible explanation is that customers with better credit status are more likely to obtain a higher loan line. Customers with housing loans are more likely to default. The increasing annual household income leads to better credit status. Employed customers are less likely to default. Compared with other employment groups such as self-employed, freelance, and unemployed, the employed group has a more stable income and is less likely to default. Customers who work in a government organization and institution have better credit status. Customers who are managers or commercial and service workers are less likely to default. Customers with early payment records tend to maintain good credit records and are less likely to default. Compared with women, men are more likely to default. This is consistent with the results of [35] and the personality characteristics of men's risk preference [36].
5.
Conclusions
The cure rate model is commonly used when the data has long-term survivors. The model is composed of two parts. The incident part describes the probability of cure and the latency part describes the survival function of the uncured group. The drawback of the standard cure rate model is that it assumes that there are no direct constraints between the coefficients corresponding to the same covariates in the two parts of the model. This may lead to conflicting results of covariate effects on the probability of cure and conditional survival of the uncured group. In fact, the two parts of the model describe quite related aspects. It is desirable that there may be some connections between coefficients corresponding to the same covariates. Existing works have considered joint distribution or structural effect of the two sets of covariates, which is too strict.
In this paper, we consider a more flexible cure rate model that allows the two sets of covariates can be in different distributions and magnitudes. In the proposed method, a sign consistency cure rate model is proposed to promote the similarity in the sign of coefficients in the two model parts to improve interpretability. In addition, we also impose a group lasso penalty for variable selection. Simulation results show that compared with alternatives, the proposed method has better performance in terms of variable selection and estimation. An analysis of credit data in China illustrates that the proposed method can improve prediction performance as well as interpretability.
Acknowledgments
We are grateful to the reviewers and the editor for their helpful comments and suggestions. This work was supported by the National Office for Philosophy and Social Sciences of China under Grant 20 & ZD137.
Conflict of interest
The authors declare there is no conflict of interest.