Loading [MathJax]/jax/output/SVG/jax.js
Research article

Promote sign consistency in cure rate model with Weibull lifetime

  • Received: 27 October 2021 Accepted: 22 November 2021 Published: 25 November 2021
  • MSC : 62N01, 62N02

  • In survival analysis, the cure rate model is widely adopted when a proportion of subjects have long-term survivors. The cure rate model is composed of two parts: the first part is the incident part which describes the probability of cure (infinity survival), and the second part is the latency part which describes the conditional survival of the uncured subjects (finite survival). In the standard cure rate model, there are no constraints on the relations between the coefficients in the two model parts. However, in practical applications, the two model parts are quite related. It is desirable that there may be some relations between the two sets of the coefficients corresponding to the same covariates. Existing works have considered incorporating a joint distribution or structural effect, which is too restrictive. In this paper, we consider a more flexible model that allows the two sets of covariates can be in different distributions and magnitudes. In many practical cases, it is hard to interpret the results when the two sets of the coefficients of the same covariates have conflicting signs. Therefore, we proposed a sign consistency cure rate model with a sign-based penalty to improve interpretability. To accommodate high-dimensional data, we adopt a group lasso penalty for variable selection. Simulations and a real data analysis demonstrate that the proposed method has competitive performance compared with alternative methods.

    Citation: Chenlu Zheng, Jianping Zhu. Promote sign consistency in cure rate model with Weibull lifetime[J]. AIMS Mathematics, 2022, 7(2): 3186-3202. doi: 10.3934/math.2022176

    Related Papers:

    [1] Heba S. Mohammed, Zubair Ahmad, Alanazi Talal Abdulrahman, Saima K. Khosa, E. H. Hafez, M. M. Abd El-Raouf, Marwa M. Mohie El-Din . Statistical modelling for Bladder cancer disease using the NLT-W distribution. AIMS Mathematics, 2021, 6(9): 9262-9276. doi: 10.3934/math.2021538
    [2] Qasim Ramzan, Muhammad Amin, Ahmed Elhassanein, Muhammad Ikram . The extended generalized inverted Kumaraswamy Weibull distribution: Properties and applications. AIMS Mathematics, 2021, 6(9): 9955-9980. doi: 10.3934/math.2021579
    [3] Ayed. R. A. Alanzi, M. Qaisar Rafique, M. H. Tahir, Farrukh Jamal, M. Adnan Hussain, Waqas Sami . A novel Muth generalized family of distributions: Properties and applications to quality control. AIMS Mathematics, 2023, 8(3): 6559-6580. doi: 10.3934/math.2023331
    [4] Mohammed Albassam, Muhammad Ahsan-ul-Haq, Muhammad Aslam . Weibull distribution under indeterminacy with applications. AIMS Mathematics, 2023, 8(5): 10745-10757. doi: 10.3934/math.2023545
    [5] Ahmed Z. Afify, Rehab Alsultan, Abdulaziz S. Alghamdi, Hisham A. Mahran . A new flexible Weibull distribution for modeling real-life data: Improved estimators, properties, and applications. AIMS Mathematics, 2025, 10(3): 5880-5927. doi: 10.3934/math.2025270
    [6] Attaullah, Sultan Alyobi, Mansour F. Yassen . A study on the transmission and dynamical behavior of an HIV/AIDS epidemic model with a cure rate. AIMS Mathematics, 2022, 7(9): 17507-17528. doi: 10.3934/math.2022965
    [7] Jumanah Ahmed Darwish, Saman Hanif Shahbaz, Lutfiah Ismail Al-Turk, Muhammad Qaiser Shahbaz . Some bivariate and multivariate families of distributions: Theory, inference and application. AIMS Mathematics, 2022, 7(8): 15584-15611. doi: 10.3934/math.2022854
    [8] Nora Nader, Dina A. Ramadan, Hanan Haj Ahmad, M. A. El-Damcese, B. S. El-Desouky . Optimizing analgesic pain relief time analysis through Bayesian and non-Bayesian approaches to new right truncated Fréchet-inverted Weibull distribution. AIMS Mathematics, 2023, 8(12): 31217-31245. doi: 10.3934/math.20231598
    [9] Najwan Alsadat, Mahmoud Abu-Moussa, Ali Sharawy . On the study of the recurrence relations and characterizations based on progressive first-failure censoring. AIMS Mathematics, 2024, 9(1): 481-494. doi: 10.3934/math.2024026
    [10] Walid Emam . Benefiting from statistical modeling in the analysis of current health expenditure to gross domestic product. AIMS Mathematics, 2023, 8(5): 12398-12421. doi: 10.3934/math.2023623
  • In survival analysis, the cure rate model is widely adopted when a proportion of subjects have long-term survivors. The cure rate model is composed of two parts: the first part is the incident part which describes the probability of cure (infinity survival), and the second part is the latency part which describes the conditional survival of the uncured subjects (finite survival). In the standard cure rate model, there are no constraints on the relations between the coefficients in the two model parts. However, in practical applications, the two model parts are quite related. It is desirable that there may be some relations between the two sets of the coefficients corresponding to the same covariates. Existing works have considered incorporating a joint distribution or structural effect, which is too restrictive. In this paper, we consider a more flexible model that allows the two sets of covariates can be in different distributions and magnitudes. In many practical cases, it is hard to interpret the results when the two sets of the coefficients of the same covariates have conflicting signs. Therefore, we proposed a sign consistency cure rate model with a sign-based penalty to improve interpretability. To accommodate high-dimensional data, we adopt a group lasso penalty for variable selection. Simulations and a real data analysis demonstrate that the proposed method has competitive performance compared with alternative methods.



    Survival analysis is a method to analyze time to event of interest assuming that all subjects will eventually experience the event [1]. It has been widely applied in many fields such as medicine, engineering, credit scoring, etc. [2,3,4]. However, in practice, a proportion of subjects may not experience the event of interest. They are considered 'cured', or in other words, the long-term survivors. For instance, in medical practice, there may be a fraction of patients cured of their disease. Therefore, the cure rate model, an extension of survival analysis, is introduced to modeling data with long-term survivors [5].

    Let xi be the covariate vector of subject i, and Si(t|xi) be the survival probability in time t. The cure rate model can be written as

    Si(t|xi)=πi(xi)+(1πi(xi))Si(t|Ui=1,xi), (1.1)

    where Ui is the random binary variable which indicates the cure status of subject i. The cure status Ui=0 denotes subject i is cured and will not experience the event, and Ui=1 otherwise. Here πi(xi) is the probability function of being cured with the vector of regression coefficients α, and Si(t|Ui=1,xi) is the survival function with the vector of regression coefficients β.

    As shown in (1.1), the cure rate model is composed of two parts. The first part is the incident part, which predicts the probability of being cured. The second part is the latency part, which describes the conditional survival of the uncured subjects. Since the cure rate model can predict whether subjects are cured and the time to event of the uncured subjects, it is commonly adopted when a proportion of subjects have long-term survivors [6].

    There are numerous studies in the literature regarding many extensions of the cure rate model. Cooner et al. [7] proposed a flexible hierarchical cure rate model to distinguish among underlying mechanisms that lead to cure. Rodrigues et al. [8] assumed the number of risk factors to follow the Conway-Maxwell Poisson distribution and proposed a Conway-Maxwell Poisson cure rate model to unify some cure rate models. Li et al. [9] considered a mixture of linear dependent tail-free processes as the prior for the distribution of the cure rate parameter to develop a latent promotion time cure rate model. Results showed that the cure rate model incorporated penalized spline has better performance in prediction [10]. Georgiana proposed a Bayesian spatial cure rate model with Weibull lifetime to model spatial variability in the censoring mechanism [11]. Pal et al. [12] proposed a projected non-linear conjugate gradient algorithm for the cure rate model under a competing risks scenario. Besides, many works have developed semi-parametric and nonparametric methods to investigate the effects of covariates on the outcome. For instance, Li et al. [13] proposed a semi-parametric additive predictor consisting of a sum of linear and nonparametric terms in the incident part. Chen et al. [14] modeled the covariate effects by a nonparametric form.

    However, many existing works tend to pay much less attention to the relations between the vector of regression coefficients α and β in the two model parts. Generally, they assume that there are no direct constraints on the relation of the vector of regression coefficients α and β corresponding to the same covariates [15,16]. In other words, these works assume that the probability of being cured and the time to event are independent, and there are no direct constraints between the coefficients α and β.

    Notably, in practical applications, the two model parts are quite related. The incident part describes the probability of cure (infinity survival), and the latency part describes the conditional survival of the uncured subjects (finite survival). It is desirable that there may be some connections between the vector of regression coefficients α and β corresponding to the same covariates. Theoretical derivations and case studies also suggest that relaxing the assumption of no direct constraints on regression coefficients can improve the performance of the model [17,18]. Liu et al. [19] relax the assumption of no direct constraints by establishing a joint distribution of the covariates and the logarithm of the hazard rate. Fan et al. [20] incorporate structural effects of α and β in the cure rate model to improve the estimation accuracy and interpretability. A joint distribution or structural effects of α and β are crude yet effective way to impose the relations between the coefficients. However, the two parts of the model still describe two different aspects. The assumption of a joint distribution or structural effects are too restrictive constraints and might not work well. So we consider a more flexible model that allows the coefficients α and β can be in different distributions and magnitudes. Sign consistency penalty is proposed to promote the similarity in sign to get more interpretable results by Zhang [21]. In many practical cases, it is hard to interpret the results when coefficients α and β corresponding to the same covariates have conflicting signs. In this paper, we consider a sign consistency cure rate model with a sign-based penalty. In addition, the models may suffer from bad performance due to the high dimension of the data, and grouping structure arises naturally in many practical cases. A group lasso penalty is also imposed for group variable selection.

    In this paper, we propose a cure rate model with group selection and sign consistency (CRGS), which can select important groups of covariates and promote similarity in the sign of coefficients. The proposed method can promote the similarity in the signs of coefficients α and β to improve interpretability. Compared to the individual variable selection approaches such as the sign consistency method in [22], the proposed method with the group selection approach takes into consideration the grouping structure and can lead to better prediction. Compared with previously employed approaches such as joint distribution or structural effects of coefficients, the CRGS method can avoid too strict constraints between the coefficients and lead to more consistency and hence more interpretable results.

    The paper is organized as follows. In section 2, the sign consistency cure rate model with Weibull lifetime and the algorithm is introduced. Simulation is presented in section 3. Section 4 displays a real data application. Finally, section 5 discusses the conclusions.

    Consider data with n subjects and p covariates. Denote Yi as the time to event of subject i, that is, the time until the event of interest is observed to occur. Let Ci be the time of right censoring, and δi = I(Yi<Ci) be the censoring indicator of subject i, where δi = 0 for censored and δi = 1 for uncensored. Denote yi=min(Ti,Ci). Note that the censored subjects include the cured subjects and the uncured subjects for whom the event has not occurred at censoring time. So the cure status Ui is unobservable. The observable data is {(yi,δi,xi),i=1,,n}.

    Denote x as the covariate vector with grouping structure. Let x=(xT1,,xTJ)T be the covariate vector with J nonoverlapping subgroups. xj=(xj1,,xjpj)T is the j -th subgroup of covariate vector, with Jj=1pj=p. Grouping structure arises naturally in many practical cases. Examples include the expression of a multi-level factor by a group of dummy covariates and the expression of an addictive model by several basis functions [23]. In addition, grouping structure can also be introduced into a model by taking advantage of prior knowledge [24]. For example, genes belong to the same biological pathway [25].

    In the incident part of the cure rate model, we adopted logistic regression, a generalized linear model, to describe the probability of cure πi(xi)=1/1(1+exp(α0+xiTα))(1+exp(α0+xiTα)). Here α=(αT1,,αTJ)T is the vector of regression coefficients, αj=(αj1,,αjpj)T is the j -th subgroup of the coefficient vector, and α0 is the intercept.

    In the latency part, let λi be the link function. The survival function is the chance an individual survives to time t given the individual will eventually experience the event of interest, while λi is the probability of an individual experiencing the event in the next instant of time t.

    We assume the time to event t follow the Weibull distribution. Weibull distribution is a considerable flexibility distribution to model lifetime data [26]. It had been justified to be a valid lifetime distribution within the broad family of generalized gamma models [27,28]. Referring to [1,29], the survival function for the uncured subjects following the Weibull distribution can be written as

    Si(t|Ui=1)=exp((λit)r), (2.1)

    with the probability density function f(t|Ui=1)=(r/rtt)(λit)rexp((λit)r). Here r>0 is the shape parameter (more discussions below). The link function can be written as

    λi=exp(β0+xTiβ), (2.2)

    where β0 is the intercept, β=(βT1,,βTJ)T is the vector of regression coefficients, and the j -th subgroup of the coefficient vector is βj=(βj1,,βjpj)T.

    For observable data {(yi,δi,xi),i=1,,n}, the log-likelihood function is

    L(α0,α,β0,β) = ni=1(δi(log(1πi)+log(r)log(t)+rlog(λiyi)(λiyi)r))+ni=1((1δi)log((1πi)exp((λiyi)r)+πi)). (2.3)

    For promoting sign consistency and variable selection, we propose the following objective function

    Q(α0,α,β0,β) = L(α0,α,β0,β) + P1(α,β) + P2(α,β). (2.4)

    Here P1(α,β) is the group variable selection penalty function, and P2(α,β) is the sign consistency penalty function as follows.

    P2(α,β)=μ22Jj=1pjk=1(sign(αjk)sign(βjk))2,P1(α,β)=μ1Jj=1pj(αj+βj), (2.5)

    where is the l2 norm, sign() is the sign function, μ1>0 and μ2>0 are tuning parameters. μ1 and μ2 control the degree of penalty.

    The first penalty P1(α,β) is a group lasso penalty, which can conduct group variable selection and generate more accurate estimation. Group lasso has the advantage in group variable selection performance and is commonly used in literature. It has been justified that the group lasso is more robust to noise compared with lasso when the underlying signal is strongly group-sparse [30]. In addition, different from methods in some existing works [20,22], group lasso can consider grouping structures and it has better performance compared with the group LARS and the group non-negative garrotte [31]. The second penalty P2(α,β) promotes the similarity in signs of coefficients α and β, which can lead to more interpretable results. The proposed method is more flexible than the methods with structural effect in [20] or joint distribution in [19] of coefficients.

    In this section, we propose the Expectation Group Coordinate Descent (EGCD) algorithm to optimize the objective function. In the E-step, we introduce a latent unobserved Ui to obtain a complete log-likelihood function. In the GCD-step, group coordinate descent is adopted to iteratively update a subgroup of parameters with the remaining parameters fixed at their most recent values. Since the sign function sign() is not differentiable and continuous, and it is hard to optimize. So we introduce the following approximation for computational feasibility [21,22].

    (sign(αjk)sign(βjk))2(αjk|αjk|+τβjk|βjk|+τ)2, (2.6)

    with τ is a small positive constant.

    The EGCD algorithm iteratively updates α0, α, β0, and β in the m -th iteration.

    In the E-step of the m -th iteration, let the observation of the latent Ui be U[m]i. Consider the complete data {(yi,δi,U[m]i,xi),i=1,,n} with the latent U[m]i. The complete log-likelihood is

    L[m]=ni=1((1U[m]i)log(π[m]i)+U[m]ilog(1π[m]i))+ni=1(δiλ[m]iU[m]i(yiexpλ[m]i)r):=L[m]1+L[m]2, (2.7)

    where

    π[m]i=11+exp(α[m]0+xTiα[m]), (2.8)
    λ[m]i=exp(β[m]0+xTiβ[m]). (2.9)

    Regarding the expectation of Ui, there are three possible situations of subjects. (a) δi = 0 and Ui=0 : censored and cured, indicates long-term survivors; (b) δi = 0 and Ui=1 : censored and uncured, indicates subjects who will eventually experience the event, and the event have not occurred in censoring time Ci; (c) δi = 1 and Ui=1 : uncensored and uncured, indicates subjects who have experienced the event. Therefore, the expectation of Ui is 1 when uncensored (δi=1). When censored (δi=0), the expectation of Ui is related to the probability of cure and the proportion that uncured subjects for whom the event has not occurred at time t. Denote the expectation of U[m]i as u[m]i,

    u[m]i=E(U[m]i)={1δi=1(1π[m]i)exp((λ[m]iyi)r)π[m]i+(1π[m]i)exp((λ[m]iyi)r)δi=0. (2.10)

    Given the complete data {(yi,δi,U[m]i,xi),i=1,,n}, we take the expectation of L[m] in (2.7) with respect to u[m]i

    E(L[m])=l1[m]+l[m]2, (2.11)

    where

    l1[m]=E(L[m]1)=ni=1((1u[m]i)log(π[m]i)+u[m]ilog(1π[m]i)), (2.12)
    l[m]2=E(L[m]2)=ni=1(δi(β[m]0+xTiβ[m])u[m]i(yiexp(β[m]0+xTiβ[m]))r). (2.13)

    The objective function can be written as

    (l1+l2) + μ1Jj=1pj(αj+βj)+μ22Jj=1pjk=1(αjk|αjk|+τβjk|βjk|+τ)2. (2.14)

    In the GCD-step, group coordinate descent is adopted to iteratively update α0, α, β0, and β. Group coordinate descent algorithms optimize the objective function with respect to a group of parameters at a time, and iteratively cycles through the parameter groups until convergence [25]. The intercept α[m+1]0 is updated by

    α[m+1]0=α[m]0(2l1[m](α[m]0)2)1l1[m]α[m]0, (2.15)

    where l[m]1/α[m]0=ni=1(π[m]i+u[m]i1), and 2l[m]1/(α[m]0)2=ni=1π[m]i(π[m]i1).

    For α[m+1]jRpj, we can obtain Taylor's quadratic expansion of the objective function in (2.14) respect to α[m+1]j. Referring to the fast unified algorithm for group lasso [32], the upper bound of the objective function can be written as

    M[m]1j2(α[m+1]jα[m]j)T(α[m+1]jα[m]j)(α[m+1]jα[m]j)T(l1[m]α[m]j+μ2V[m]1j)+μ1pjα[m+1]j, (2.16)

    where l1[m]/l1[m]α[m]jα[m]j = ni=1(π[m]i+u[m]i1)xjk. Here V[m]1j is pj -length vector, and M[m]1j is a constant as follows.

    V[m]1j=(1|α[m]jk|+τ(β[m]jk|β[m]jk|+τα[m]jk|α[m]jk|+τ))1kpj,
    M[m]1j=ψ((2l1[m]α[m]jk1α[m]jk2)1k1,k2pj)+maxk(μ2(1|α[m]jk|+τ)2), (2.17)

    where 2l1[m]/2l1[m]α[m]jk1α[m]jk2α[m]jk1α[m]jk2=ni=1(1π[m]i)π[m]ix2jk, and ψ(·) is the maximum eigenvalue function. By minimizing (2.16), we obtain α[m+1]j :

    α[m+1]j=M[m]1jα[m]j+l1[m]/l1[m]α[m]jα[m]j+μ2V[m]1jM[m]1j(1μ1pj|M[m]1jα[m]j+l1[m]/l1[m]α[m]jα[m]j+μ2V[m]1j|)+, (2.19)

    where (a)+=max{a,0}.

    Similarly, the intercept β[m+1]0 is updated by

    β[m+1]0=β[m]0(2l[m]2(β[m]0)2)1l[m]2β[m]0, (2.20)

    where l[m]2/l[m]2β[m]0β[m]0 = ni=1(δiu[m]ir(yiλ[m]i)r), and 2l[m]2/2l[m]2(β[m]0)(β[m]0)2 = ni=1(u[m]ir2(yiλ[m]i)r).

    For β[m+1]jRpj, consider the optimization function

    M[m]2j2(β[m+1]jβ[m]j)T(β[m+1]jβ[m]j)(β[m+1]jβ[m]j)T(l[m]2β[m]j+μ2V[m]2j)+μ1pjβ[m+1]j, (2.21)

    where l[m]2/l[m]2β[m]jβ[m]j=ni=1(δixjku[m]irxjk(yiλ[m]i)r).

    Here V[m]2j is pj -length vector, and M[m]2j is a constant as follows.

    V[m]2j=(1|β[m]jk|+τ(α[m]jk|α[m]jk|+τβ[m]jk|β[m]jk|+τ))1kpj,
    M[m]2j=ψ((2l[m]2β[m]jk1β[m]jk2)1k1,k2pj)+maxk(μ2(1|β[m]jk|+τ)2), (2.22)

    where 2l[m]2/2l[m]2β[m]jk1β[m]jk2β[m]jk1β[m]jk2 = ni=1u[m]i(rxjk)2(yiλ[m]i)r.

    By minimizing (2.21), we obtain β[m+1]j :

    β[m+1]j=M[m]2jβ[m]j+l2[m]/l2[m]β[m]jβ[m]j+μ2V[m]2jM[m]2j(1μ1pj|M[m]2jβ[m]j+l2[m]/l2[m]β[m]jβ[m]j+μ2V[m]2j|)+. (2.23)

    Regarding the parameters μ1, μ2, and r, Wang et al [33] had demonstrated that the tuning parameters selected by the Bayesian information criterion (BIC) type criterion can identify the true model consistently as long as the covariate dimension is fixed. So the parameters, μ1, μ2, and r, are selected by Bayesian information criterion (BIC). The EGCD algorithm is shown in Table 1.

    Table 1.  EGCD algorithm.
    Expectation Group Coordinate Descent Algorithm
    1. Initialize m=0, α[m]0 = β[m]0=0, and α[m] = β[m]=0J×p;
    2. Repeat the following updates:
      2.1 E-step:
        Update π[m]i and from (2.8), λ[m]i from (2.9), and u[m]i from (2.10);
      2.2 GCD-step:
        Update α[m+1]0 from (2.15);
        For j=1,,p, update α[m+1]j from (2.19);
        Update β[m+1]0 from (2.20);
        For j=1,,p, update β[m+1]j from (2.23);
         m=m+1
      Until max{α[m+1]jα[m]j,β[m+1]jβ[m]j}5×103.

     | Show Table
    DownLoad: CSV

    In this section, we perform a numerical study to evaluate our method, in terms of both variable selection and estimation performance. The variable selection is assessed in terms of the (1) true positive rate (TPR), and (2) false positive rate (FPR) of variable selection. Estimation is evaluated in terms of mean square error (MSE) of coefficient estimates.

    In Scenario 1, the covariates xjk, j=1,...,J, k=1,...,pj, are generated from a multivariate normal distribution. The correlation coefficient of covariates xjkm and xjkn in the same group is ρ = 0.1|kmkn|, whereas ρ = 0 when in different groups. We consider the case with discrete covariates in Scenario 2. In Scenario 2, xjk is defined as follows.

    xjk={xjkjJ/J33I(xjk>0)j>J/J33. (3.1)

    The sample size is n=500. We consider low-dimension data with p=40, and high-dimension data with p=200. The censoring time is generated from a Weibull distribution with the shape parameter r = {0.25, 2.5}. We compare the proposed method (CRGS) with three alternative methods. The alternatives are the standard cure rate model without sign consistency and variable selection penalty (CR), the cure rate model with sign consistency (CRS), and the cure rate model with group lasso penalty (CRG), respectively. For comparison, we also consider the alternatives with the logistic regression in the incident part and the Weibull distribution in the latency part. The grouping structure and coefficients of the 2 scenarios are generated as listed in Table 2.

    Table 2.  Grouping structure and coefficients in Scenarios 1 and 2.
    Scenario 1 Scenario 2
    Non-zero subgroups (α) 0.5,,0.55,0.5,,0.55,0.5,,0.55,0.4,,0.45 0.5,,0.515,0.4,,0.45
    Non-zero subgroups (β) 0.1,,0.15,0.1,,0.15,0.1,,0.15,0.3,,0.35 0.1,,0.115,0.3,,0.35
    Covariates Continuous Discrete and continuous

     | Show Table
    DownLoad: CSV

    Let θ{α,β}, and ˆθ be the estimation of θ. We evaluate variable selection performance in terms of TPR(θ) and FPR(θ) :

    TPR(θ)=TPTP+FN,FPR(θ)=FPTN+FP, (3.2)

    where

    TP=pj=1I(θj0ˆθj0),TP+FN=pj=1I(θj0),
    FP=pj=1I(θj=0ˆθj0),TN+FP=pj=1I(θj=0). (3.3)

    Estimates are evaluated by MSE(θ) :

    MSE(θ) = pj=1(θjˆθj)2pj=1(θj)2. (3.4)

    Tables 3 and 4 summarize the mean and standard deviation in parentheses of the MSEs, TPRs, and FPRs for Scenarios 1 and 2. Both the scenarios are repeated 100 times.

    Table 3.  Results of Scenario 1.
    r p α β
    CR CRS CRG CRGS CR CRS CRG CRGS
    0.25 40 MSE 24.40 2.62 1.75 1.51 9.36 15.24 4.08 3.49
    (17.50) (0.38) (0.18) (0.17) (4.01) (4.06) (4.36) (4.38)
    TPR - - 0.98 0.96 - - 0.49 0.94
    - - (0.09) (0.07) - - (0.22) (0.10)
    FPR - - 0.08 0.06 - - 0.15 0.06
    - - (0.11) (0.10) - - (0.15) (0.10)
    200 ME 35.91 10.35 1.71 1.49 773.53 74.24 3.99 3.09
    (37.71) (3.23) (0.11) (0.07) (1396.13) (22.85) (0.71) (0.47)
    TPR - - 0.99 0.96 - - 0.44 0.94
    - - (0.04) (0.06) - - (0.19) (0.09)
    FPR - - 0.02 0.01 - - 0.09 0.01
    - - (0.02) (0.01) - - (0.05) (0.01)
    2.5 40 ME 24.41 3.34 1.73 1.50 9.06 8.16 3.64 3.06
    (17.48) (0.44) (0.13) (0.07) (2.18) (1.31) (0.70) (0.41)
    TPR - - 0.98 0.96 - - 0.48 0.94
    - - (0.09) (0.07) - - (0.21) (0.10)
    FPR - - 0.07 0.05 - - 0.14 0.05
    - - (0.07) (0.02) - - (0.13) (0.02)
    200 ME 50.98 3.66 24.71 1.54 48.03 14.03 5.17 3.92
    (11.07) (0.51) (27.75) (0.06) (7.82) (2.23) (2.91) (0.52)
    TPR - - 0.52 0.98 - - 0.65 0.98
    - - (0.50) (0.06) - - (0.35) (0.06)
    FPR - - 0.01 0.01 - - 0.06 0.01
    - - (0.01) (0.00) - - (0.06) (0.00)
    Note: in each cell, mean (standard deviation).

     | Show Table
    DownLoad: CSV
    Table 4.  Results of Scenario 2.
    r p α β
    CR CRS CRG CRGS CR CRS CRG CRGS
    0.25 40 MSE 6.01 2.17 1.57 1.42 67.18 19.26 9.83 6.70
    (2.47) (0.85) (0.12) (0.09) (33.69) (16.42) (3.33) (2.36)
    TPR - - 0.95 0.95 - - 0.95 0.94
    - - (0.08) (0.06) - - (0.06) (0.07)
    FPR - - 0.05 0.06 - - 0.35 0.19
    - - (0.00) (0.06) - - (0.21) (0.15)
    200 ME 58.02 9.78 1.37 1.26 187.33 74.62 8.96 6.12
    (58.60) (12.42) (0.11) (0.08) (199.25) (99.61) (3.20) (2.19)
    TPR - - 0.93 0.94 - - 0.87 0.90
    - - (0.15) (0.07) - - (0.18) (0.08)
    FPR - - 0.01 0.05 - - 0.25 0.24
    - - (0.00) (0.02) - - (0.05) (0.05)
    2.5 40 ME 48.42 2.94 2.46 2.04 9.14 6.60 7.46 4.55
    (24.54) (0.33) (0.14) (0.11) (2.63) (0.67) (0.31) (0.38)
    TPR - - 1.00 0.99 - - 0.41 0.99
    - - (0.02) (0.02) - - (0.17) (0.02)
    FPR - - 0.50 0.11 - - 0.07 0.11
    - - (0.16) (0.07) - - (0.08) (0.07)
    200 ME 81.13 3.43 2.57 1.95 73.28 9.83 8.02 4.42
    (18.85) (0.43) (0.17) (0.09) (12.06) (2.50) (0.37) (0.33)
    TPR - - 1.00 0.99 - - 0.53 0.99
    - - (0.02) (0.02) - - (0.14) (0.02)
    FPR - - 0.54 0.13 - - 0.06 0.13
    - - (0.04) (0.03) - - (0.03) (0.03)
    Note: in each cell, mean (standard deviation).

     | Show Table
    DownLoad: CSV

    As shown in Tables 3 and 4, the MSEs of methods with group lasso penalty (the proposed and CRG) are smaller than the CRS and CR methods. The estimates of CRS and CR methods have increasing MSEs with higher dimensions. The results indicate that higher dimension leads to less efficient estimation, and group lasso penalty can improve the estimation performance. Comparing the TPRs and FPRs of the proposed and CRG methods, the proposed method has better performance in terms of variable selection. Compared with alternatives, the proposed method has the lowest MSEs. Results of simulation reveal that compared with alternatives, the proposed method can improve the performance in terms of variable selection as well as estimation.

    In this section, we apply the proposed method to credit data. The data comes from a retail business of a commercial bank in China. It contains 16 covariates of 1213 customers in a personal loan from 2014 to 2019. The primary interest is to assess the credit risk of a credit loan and find the important covariates to predict the time to default of the credit loan customers. The mean observed time is 1.38 years with a standard deviation of 0.69. Customers with missing data of annual household income are removed from our analysis. After preprocessing, the covariates and their descriptions are summarized in Table 5. By transforming the multi-level covariates, we have 24 covariates in the credit model. Censoring time Ci is the interval between the value date and either default or the end of observation (June 1, 2019). Due to the different value dates of the loans, the censoring time vary from individual to individual. Customers whose time to event Yi is longer than the censoring time Ci are censored (δi=0). In this data, 1201 out of 1213 customers are censored.

    Table 5.  Covariates and their descriptions.
    Covariates Descriptions
    Interest rate [0.037,0.087]
    Loan line (0,7000,000)
    Loan term (0,+)
    Business type consumer durables, housing decoration loans, and other personal consumption loans
    Entrusted payment yes, no
    Early repayment yes, no
    Age [20,70]
    Gender male, female
    Education master/doctor, bachelor, vocational education, high school and below
    Medical insurance yes, no
    Housing status self-purchasing (with a mortgage), self-purchasing (without a mortgage), others
    Annual household income (RMB) 200,000, 200,000 - 400,000, 400,000 - 600,000, 600,000
    Employment employed, others
    Type of workplace government organization/institution, firm, others
    Occupation managers, commercial and service workers, others
    Professional title advanced, intermediate, primary, no professional title

     | Show Table
    DownLoad: CSV

    The data are randomly divided into the training set and test set by 7:3. The training data is used for fitting the model and the test data is used to verify the performance of the fitted model. The parameters μ1, μ2, and r are selected by BIC. Different from the simulations, the real coefficient is unknown for the real data. Therefore, in this section, we adopt the negative log-likelihood to evaluate the performance of the methods. The mean negative log-likelihood (standard error) of the proposed method is 25.91 (11.61), compared with 56.33 (258.73), 31.04 (27.34), and 26.17 (13.02) for the CR, CRS, and CRG method respectively. For stability, all the results are based on 100 duplicates. The results indicate that the proposed method has competitive prediction performance than alternatives.

    The coefficients are estimated based on 100 duplicates. With the median estimates of coefficients, we can compute the probability of cure (non-default) πi(xi) for all the customers. We dichotomize πi(xi) at the median and get two different groups of customers. One group with lower πi(xi) is denoted as "high risk", whereas another with higher πi(xi) is denoted as "low risk". Figure 1 presents the Kaplan-Meier curves of the survival of the customers belonging to different groups. Kaplan-Meier curve describes the change of the survival probability over time. It is commonly used in survival analysis, see Rodrigues et al [8], and Pal [34]. As indicated in Figure 1, the "low risk" group has higher survival probability than the "high risk" group.

    Figure 1.  Kaplan-Meier curves stratified by different groups.

    The median estimates of coefficients based on 100 duplicates are listed in Table 6. A positive coefficient α indicates that the covariate is positively related to the probability of default, and a positive coefficient β indicates that the covariate is negatively related to default time. Both the probability of default and default time are two quite relevant credit aspects. Customers with a higher probability of default are likely to default earlier. Compared with the alternative method, the signs of the α and β of the proposed method are promoted to be more consistent, whereas many covariates such as the housing status in the CR method have an opposite effect on the probability and time to default.

    Table 6.  Estimates of coefficients.
    CR CRS CRG CRGS
    α β α β α β α β
    α0/β0 -0.73 -9.15 -0.95 -7.89 -2.40 -5.82 -2.40 -5.82
    Interest rate 0.00 4.68 0.00 0.00 0.00 0.00 0.00 0.00
    Loan line -0.33 -11.55 -1.45 -5.01 -1.47 -2.38 -1.47 -2.38
    Loan term 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    Business type
      Consumer durables 0.00 4.44 1.22 1.07 0.91 0.57 0.91 0.57
      Housing decoration loans -0.67 -5.12 -1.24 -2.77 -1.08 -1.27 -1.08 -1.27
    Entrusted payment (yes) 0.00 -2.91 -0.18 -0.25 0.00 0.00 0.00 0.00
    Early repayment (yes) 0.00 -28.92 -1.51 -7.29 -1.42 -3.78 -1.42 -3.78
    Age 0.00 -0.10 0.00 -0.05 0.00 -0.07 0.00 -0.07
    Gender (male) 0.71 9.50 1.16 6.04 0.98 4.96 0.98 4.96
    Education
      Master/Doctor -0.30 -3.13 -0.28 -0.37 0.00 0.00 0.00 0.00
      Bachelor 0.77 -0.71 0.08 0.01 0.05 0.05 0.05 0.05
      Vocational education 0.00 1.55 0.35 0.38 0.00 0.00 0.00 0.00
    Medical insurance (yes) 0.24 4.18 1.02 1.42 0.91 0.72 0.91 0.72
    Housing status
      Self-purchasing (with a mortgage) 0.59 -0.02 0.33 0.29 0.24 0.19 0.24 0.19
      Self-purchasing (without a mortgage) -0.75 1.40 0.00 0.00 0.00 0.00 0.00 0.00
    Annual household income (RMB)
      200, 000-400, 000 0.47 -3.23 0.00 -0.75 0.00 0.00 0.00 0.00
      400, 000-600, 000 -0.85 -5.66 -0.95 -1.74 -0.61 -0.64 -0.61 -0.64
       600,000 -1.18 -4.12 -0.76 -1.66 -0.50 -0.60 -0.50 -0.60
    Employment (employed) 0.00 -2.03 -0.54 -0.70 -0.52 -0.44 -0.52 -0.44
    Type of workplace
      Government organization and institution -0.36 -7.53 -0.74 -2.61 -0.76 -0.84 -0.76 -0.84
      Firm 0.00 -0.39 0.00 0.00 0.00 0.00 0.00 0.00
    Occupation
      Managers 0.28 -3.69 -0.38 -0.83 -0.52 -0.41 -0.52 -0.41
      Commercial and service workers -0.36 -1.95 -0.95 -0.76 -0.76 -0.40 -0.76 -0.40
    Professional title
      Advanced 1.27 4.77 1.12 2.90 0.88 0.72 0.88 0.72
      Intermediate 0.63 5.03 1.31 2.42 0.97 1.02 0.97 1.02
      Primary 0.17 4.95 0.88 2.25 0.00 0.00 0.00 0.00

     | Show Table
    DownLoad: CSV

    The coefficient results of the proposed method reveal that loan line, early repayment, gender, housing status, annual household income, employment status, type of workplace, and occupation are important covariates for credit risk assessment. The impact of business type and education on credit is not clear.

    The loan line has a positive effect. One possible explanation is that customers with better credit status are more likely to obtain a higher loan line. Customers with housing loans are more likely to default. The increasing annual household income leads to better credit status. Employed customers are less likely to default. Compared with other employment groups such as self-employed, freelance, and unemployed, the employed group has a more stable income and is less likely to default. Customers who work in a government organization and institution have better credit status. Customers who are managers or commercial and service workers are less likely to default. Customers with early payment records tend to maintain good credit records and are less likely to default. Compared with women, men are more likely to default. This is consistent with the results of [35] and the personality characteristics of men's risk preference [36].

    The cure rate model is commonly used when the data has long-term survivors. The model is composed of two parts. The incident part describes the probability of cure and the latency part describes the survival function of the uncured group. The drawback of the standard cure rate model is that it assumes that there are no direct constraints between the coefficients corresponding to the same covariates in the two parts of the model. This may lead to conflicting results of covariate effects on the probability of cure and conditional survival of the uncured group. In fact, the two parts of the model describe quite related aspects. It is desirable that there may be some connections between coefficients corresponding to the same covariates. Existing works have considered joint distribution or structural effect of the two sets of covariates, which is too strict.

    In this paper, we consider a more flexible cure rate model that allows the two sets of covariates can be in different distributions and magnitudes. In the proposed method, a sign consistency cure rate model is proposed to promote the similarity in the sign of coefficients in the two model parts to improve interpretability. In addition, we also impose a group lasso penalty for variable selection. Simulation results show that compared with alternatives, the proposed method has better performance in terms of variable selection and estimation. An analysis of credit data in China illustrates that the proposed method can improve prediction performance as well as interpretability.

    We are grateful to the reviewers and the editor for their helpful comments and suggestions. This work was supported by the National Office for Philosophy and Social Sciences of China under Grant 20 & ZD137.

    The authors declare there is no conflict of interest.



    [1] J. P. Klein, M. L. Moeschberger, Survival analysis: techniques for censored and truncated data, 2 Eds., New York: Springer-Verlag, 2003. doi: 10.1007/b97377.
    [2] M. Stepanova, L. Thomas, Survival analysis methods for personal loan data, Oper. Res., 50 (2002), 277-289. doi: 10.1287/opre.50.2.277.426. doi: 10.1287/opre.50.2.277.426
    [3] V. B. Djeundje, J. Crook, Dynamic survival models with varying coefficients for credit risks, Eur. J. Oper. Res., 275 (2019), 319-333. doi: 10.1016/j.ejor.2018.11.029. doi: 10.1016/j.ejor.2018.11.029
    [4] Q. Zhang, S. Zhang, J. Liu, J. Huang, S. Ma, Penalized integrative analysis under the accelerated failure time model, Stat. Sin., 26 (2016), 492-508. doi: 10.5705/ss.2014.194. doi: 10.5705/ss.2014.194
    [5] J. Berkson, R. P. Gage, Survival curve for cancer patients following treatment. J. Am. Stat. Assoc., 47 (1952), 501-515. doi: 10.1080/01621459.1952.10501187. doi: 10.1080/01621459.1952.10501187
    [6] J. Rodrigues, V. G. Cancho, M.D. Castro, F. Louzada-Neto, On the unification of long-term survival models, Stat. Probability Letters, 79 (2009), 753-759. doi: 10.1016/j.spl.2008.10.029.
    [7] F. Cooner, S. Banerjee, B. P. Carlin, D. Sinha, Flexible cure rate modeling under latent activation schemes, J. Am. Stat. Assoc., 102 (2007), 560-572. doi: 10.1198/016214507000000112. doi: 10.1198/016214507000000112
    [8] J. Rodrigues, M. Castro, V.G. Cancho, N. Balakrishnan, COM-Poisson cure rate survival models and an application to a cutaneous melanoma data, J. Stat. Plan. Infer., 139 (2009), 3605-3611. doi: 10.1016/j.jspi.2009.04.014. doi: 10.1016/j.jspi.2009.04.014
    [9] L. Li, J. H. Lee, A latent promotion time cure rate model using dependent tail-free mixtures, J. R. Statist. Soc. A, 180 (2017), 891-905. doi: 10.1111/rssa.12226. doi: 10.1111/rssa.12226
    [10] L. Dirick, G. Claeskens, B. Baesens, Time to default in credit scoring using survival analysis: a benchmark study, J. Oper. Res. Soc., 68 (2017), 652-665. doi: 10.1057/s41274-016-0128-9. doi: 10.1057/s41274-016-0128-9
    [11] O. Georgiana, A. B. Lawson, Bayesian cure-rate survival model with spatially structured censoring, Spatial Stat., 28 (2018), 352-364. doi: 10.1016/j.spasta.2018.08.007. doi: 10.1016/j.spasta.2018.08.007
    [12] S. Pal, S. Roy, A new non-linear conjugate gradient algorithm for destructive cure rate model and a simulation study: illustration with negative binomial competing risks, Commun. Stat.-Simul. Comput., 2020. doi: 10.1080/03610918.2020.1819321.
    [13] C. Li, J. M. G. Taylor, Smoothing covariate effects in cure models, Commun. Statist.- Theory Meth., 31 (2002), 477-493. doi: 10.1081/STA-120002860. doi: 10.1081/STA-120002860
    [14] T. Chen, P. Du, Promotion time cure rate model with nonparametric form of covariate effects, Stat. Sin., 37 (2018): 1625-1635. doi: 10.1002/sim.7597.
    [15] E. N. C. Tong, C. Mues, L. C. Thomas, Mixture cure models in credit scoring: if and when borrowers default, Eur. J. Oper. Res., 218 (2012), 132-139. doi: 10.1016/j.ejor.2011.10.007. doi: 10.1016/j.ejor.2011.10.007
    [16] C. Jiang, Z. Wang, H. Zhao, A prediction-driven mixture cure model and its application in credit scoring, Eur. J. Oper. Res., 277 (2019), 20-31. doi: 10.1016/j.ejor.2019.01.072. doi: 10.1016/j.ejor.2019.01.072
    [17] C. Han, R. Kronmal, Two-part models for analysis of Agatston scores with possible proportionality constraints, Commun. Stat.-Theory Meth., 35 (2006), 99-111. doi: 10.1080/03610920500438614. doi: 10.1080/03610920500438614
    [18] K. Fang, X. Wang, B.C. Shia, S. Ma, Identification of proportionality structure with two-part models using penalization, Comput. Stat. Data Anal, 99 (2016), 12-24. doi: 10.1016/j.csda.2016.01.002. doi: 10.1016/j.csda.2016.01.002
    [19] F. Liu, Z. Hua, A. Lim, Identifying future defaulters: a hierarchical Bayesian method, Eur. J. Oper. Res., 241 (2015), 202-211. doi: 10.1016/j.ejor.2014.08.008. doi: 10.1016/j.ejor.2014.08.008
    [20] X. Fan, M. Liu, K. Fang, Y. Huang, S. Ma, Promoting structural effects of covariates in the cure rate model with penalization, Stat. Methods Med. Res., 26 (2017), 2078-2092. doi: 10.1177/0962280217708684. doi: 10.1177/0962280217708684
    [21] Q. Zhang, S. Ma, Y. Huang, Promote sign consistency in the joint estimation of precision matrices, Comput. Stat. Data Anal., 159 (2021), 107210. doi: 10.1016/j.csda.2021.107210. doi: 10.1016/j.csda.2021.107210
    [22] X. Shi, S. Ma, and Y. Huang, Promoting sign consistency in the cure model estimation and selection, Stat. Methods Med. Res., 29 (2020), 15-28. doi: 10.1177/0962280218820356. doi: 10.1177/0962280218820356
    [23] M. Yuan, Y. Lin, Model selection and estimation in regression with grouped variables, J. R. Statist. Soc. B, 68 (2006), 49-67. doi: 10.1111/j.1467-9868.2005.00532.x. doi: 10.1111/j.1467-9868.2005.00532.x
    [24] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2 Eds., New York: Springer, 2009. doi: 10.1007/978-0-387-84858-7.
    [25] J. Huang, P. Breheny, S. Ma, A selective review of group selection in high-dimensional models, Stat. Sci., 27 (2012), 481-499. doi: 10.1214/12-STS392. doi: 10.1214/12-STS392
    [26] N. Balakrishnan, S. Pal, Expectation maximization-based likelihood inference for flexible cure rate models with Weibull lifetimes, Stat. Methods Med. Res., 25 (2016), 1535-1563. doi: 10.1177/0962280213491641. doi: 10.1177/0962280213491641
    [27] M. Omer, M. Bakar, M. B. Adam, M. S. Mustafa, Cure models with exponentiated Weibull exponential distribution for the analysis of melanoma patients, Mathematics, 8 (2021), 1926. doi: 10.3390/math8111926. doi: 10.3390/math8111926
    [28] S. Pal, N. Balakrishnan, Likelihood inference based on EM algorithm for the destructive length-biased Poisson cure rate model with Weibull lifetime, Commun. Stat. Simulation Computation, 47 (2018), 644-660. doi: 10.1080/03610918.2015.1053918. doi: 10.1080/03610918.2015.1053918
    [29] X. Li, Y. Tang, A. Xu, Objective Bayesian analysis of Weibull mixture cure model, Qual. Eng., 32 (2020), 449-464. doi: 10.1080/08982112.2020.1757706. doi: 10.1080/08982112.2020.1757706
    [30] J. Huang, T. Zhang, The benefit of group sparsity, Ann. Stat., 38 (2010), 1978-2004. doi: 10.1214/09-AOS778. doi: 10.1214/09-AOS778
    [31] L. Meier, S. V. D. Geer, P. Bhlmann, E. T. H. Zrich, The group lasso for logistic regression, J. R. Statist. Soc. B, 70 (2008), 53-71. doi: 10.1111/j.1467-9868.2007.00627.x. doi: 10.1111/j.1467-9868.2007.00627.x
    [32] Y. Yang, H. Zou, A fast unified algorithm for solving group-lasso penalize learning problems, Stat. and Comput., 25 (2015), 1129-1141. doi: 10.1007/s11222-014-9498-5. doi: 10.1007/s11222-014-9498-5
    [33] H. Wang, B. Li, C. Leng, Shrinkage tuning parameter selection with a diverging number of parameters, J. R. Statist. Soc. B, 71 (2009), 671-683. doi: 10.1111/j.1467-9868.2008.00693.x. doi: 10.1111/j.1467-9868.2008.00693.x
    [34] S. Pal, A simplified stochastic EM algorithm for cure rate model with negative binomial competing risks: an application to breast cancer data, Stat. Med., 2021. doi: 10.1002/sim.9189.
    [35] Y. Li, Y. Li, Y. Li, What factors are influencing credit card customer's default behavior in China? A study based on survival analysis, Physica A, 526 (2019), Article ID 120861. doi: 10.1016/j.physa.2019.04.097.
    [36] Y. Shu, Q. Y. Yang, Research on auto loan default prediction based on large sample data model, Manage. Rev., 29 (2017), 59-71.
  • This article has been cited by:

    1. Chenlu Zheng, Jianping Zhu, Xinyan Fan, Song Chen, Zhiyuan Zhang, Dehua Shen, Promoting Variable Effect Consistency in Mixture Cure Model for Credit Scoring, 2022, 2022, 1607-887X, 1, 10.1155/2022/3112987
    2. Tahir Mahmood, Muhammad Riaz, Anam Iqbal, Kabwe Mulenga, An improved statistical approach to compare means, 2023, 8, 2473-6988, 4596, 10.3934/math.2023227
    3. Linhui Wang, Jianping Zhu, Chenlu Zheng, Zhiyuan Zhang, Incorporating Digital Footprints into Credit-Scoring Models through Model Averaging, 2024, 12, 2227-7390, 2907, 10.3390/math12182907
  • Reader Comments
  • © 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(1953) PDF downloads(62) Cited by(3)

Figures and Tables

Figures(1)  /  Tables(6)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog