1.
Introduction
More and more attention has been paid to the analysis of ultrahigh-dimensional data with the rapid development of information science and technology. The requirement of dealing with high-dimensional data efficiently should be well satisfied. Ultrahigh-dimensional data often appear in the area of biomedical imaging, gene expression and proteomics studies and so on. The dimensionality p of the collected data is allowed to diverge at a nonpolynomial rate with the sample size n, which is logp=O(nξ) for some ξ>0. Hence dimension reduction seems imperative for the efficient manipulation and analysis of ultrahigh-dimensional data.
Feature reduction techniques such as principal component analysis and linear discriminant analysis (LDA) have been proposed and applied successfully in practice to reduce the dimensionality of the original features without losing too much information. LDA is one of the most popular approaches in discriminant classification and pattern recognition, and it aims to find a proper linear transformation so that each sample vector with a high dimension is projected into a low-dimension vector while preserving the original cluster structure as much as possible. However, when the dimension is ultrahigh, the classical classification methods are no longer applicable, such as LDA [1], because of too many redundant variables. For example, the ovarian cancer data which were studied by Sorace and Zhan [2] consisted of serum samples of 162 ovarian cancer patients and 91 control subjects. For each sample, 15154 distinct mass-to-charge ratios (M/Z) were available for analysis. It is interesting to identify proteomic patterns (corresponding to the M/Z value) that can distinguish ovarian cancer subjects from control subjects. The small number of samples and many redundant variables make discriminant classification unable to work effectively.
To deal with the ultrahigh-dimensional discriminant classification difficulty, many marginal feature screening procedures are proposed by statisticians to reduce the dimension rapidly, and then some classical discriminant analysis methods could be processed. Mai and Zou [3] proposed a feature screening procedure named Kolmogorov filter (KF) for binary classification based on the Kolmogorov-Smirnov statistic, which enjoyed the sure screening property under much-weakened model assumptions. Mai and Zou [4] proposed the fused Kolmogorov filter, which generalized the KF procedure to the multi-classification case. Lai et al. [5] proposed the feature screening procedure based on the expected conditional Kolmogorov filter for the ultrahigh-dimensional binary classification problem with a dependent variable. Cui et al. [6] proposed a model-free feature screening index named MV for ultrahigh-dimensional discriminant analysis based on the difference between conditional and unconditional distribution functions. Pan et al. [7] developed a pairwise sure independence screening procedure (PSIS) for LDA, but this procedure depended on the parametric modeling assumptions and may perform poorly for heavy-tailed data. Cheng et al. [8] proposed a robust ranking screening procedure based on the conditional expectation of the rank of predictor samples for the ultrahigh-dimensional discriminant analysis, which was robust against the heavy-tailed distributions, potential outliers and the sample shortage for some categories. He et al. [9] generalized the MV procedure by modifying MV with a weight function. The proposed Anderson-Darling sure independence screening procedure (AD-SIS) could be more robust against the heavy-tailed distributions. Song et al. [10] proposed a robust composite weighted quantile screening procedure based on the difference between the conditional and the unconditional quantiles of the feature. Different from the existing methods, which used the differences of means or the differences of conditional cumulative distribution functions between classes as the screening indexes. Sheng and Wang [11] proposed a new feature screening method to rank the importance of predictors based on the classification accuracy of marginal classifiers.
Although many feature screening procedures for the ultrahigh-dimensional discriminant analysis problems have been proposed, some of them are even model-free; the study for the LDA problem, one of the most popular approaches in discriminant classification and pattern recognition, is still very attractive. In order to solve the linear discrimination problem with ultrahigh-dimensional features, the dimension reduction method based on the linear projection may bring a better performance than the model-free methods. In this paper, we propose a feature screening procedure based on the Fisher's linear discriminant framework. By minimizing the linear projection of the sum of squares in the original cluster structures and maximizing the linear projection of the sum of squares between groups, the marginal score test is constructed and combined with the linear projection optimal problem. First, the proposed method can screen out the irrelevant predictors in the linear discriminant function through the use of estimating equations. Second, the proposed procedure processes the sure screening property. Third, the simple structure of the screening index makes the calculation fast.
The rest of this paper is organized as follows. In Section 2, we construct the feature screening estimating equations, propose the feature screening procedure and further study its theoretical properties. In Section 3, we present Monte Carlo simulation studies to examine the finite sample performance of the proposed procedure. We also use the proposed procedure in a real data example. All technical details are presented in the Appendix.
2.
Linear projection feature screening
Consider the class data samples (Xi,Yi), i=1,…,n, where Yi is the binary class index variable that equals one of {1,2}, and Xi∈Rp. Suppose that G={G1,G2}⊂Rp×n, where each Gj⊂Rp×nj for j=1,2 represents an independent class data set {Xij}={Xi when Yi=j}, nj denotes the number of the samples of the class Gj; and, n1+n2=n. When the dimensionality p is large, the discriminant analysis based on the data would be rather complex and inefficient. LDA aims to find a linear projection β∈Rp, so it maps each sample vector Xi to a new low-dimension sample β⊤Xi, i=1,…,n. It seeks to project the observations into a lower space such that the intergroup variance of the projected samples is large and the intragroup variance is small. A classification rule is obtained by assigning the sample to its nearest centroid in the transformed space. To find the projection direction and delete the irrelevant predictors simultaneously, we construct the linear projection feature screening procedure in the following.
2.1. Screening method
Based on the linear projection, a random sample Xi is projected to β⊤Xi. Define
where ˉXj=1nj∑nji=1Xij and ˉX=1n∑ni=1Xi. Thus, the linear projection procedure aims to obtain β by
When the dimensionality p is ultrahigh, the traditional solutions for (2.2) fail. For example, the related eigenvectors of the eigenvalues solved from |B−λE|=0 are hard to obtain. For ultrahigh-dimensional problems, sparsity is often present, meaning that only a small number of predictors contribute significantly to the LDA process. We denote the active set and the inactive set as A={k:βk≠0,1≤k≤p} and Ac={k:βk=0,1≤k≤p}, respectively. Note that (2.2) is a constrained optimization problem. To avoid the constraint, without loss of generality, we assume β=(β1,β2…,βp)⊤=(β1,β⊤(−1))⊤, where β(−1)=(β2,…,βp)⊤, β1>0 and β1=√1−‖β(−1)‖2. β1>0 means X1 is important for the linear discriminant classification. The active predictor X1 can be identified by comparing each marginal correlation of Xk and Y for k=1,…,p. We propose a feature screening procedure to identify potential active predictors as follows. Assume E(Xk)=0, Var(Xk)=1, and redefine Xk=Xk−E(Xk|X1) for k=2,…,p. The transformation of Xk can clear out the linear correlation between Xk and the active predictor X1.
By introducing β(−1), we estimate β(β(−1)) by maximizing
or solving
where Jβ(−1)=∂β∂β(−1)=(b1,…,bp)⊤ is a p×(p−1) matrix, b1=−(1−‖β(−1)‖2)−1/2β(−1) and bs=(0,…,0,1,0,…,0)⊤ with sth element equals to 1, s=2,…,p. Let ˆL′k(β) be the kth component of ∂ˆL(β(−1))∂β(−1). Therefore, (ˆL′2(β),…,ˆL′p(β))⊤=0 are estimating equations of β(−1). Since the sparsity property is satisfied, motivated by the score test screening procedure proposed by Zhao and Li [12], for each k(k≠1), we consider a marginal estimating equation of βk(k=2…,p) and assume that all the other covariates are unrelated to the linear discriminant classification except X1. Denote this marginal estimating equation by L′k(β), and ˆωk(βk)=ˆL′k(β1,0,…,0,βk,0,…,0)=0. From this marginal estimating equation, if |ˆωk(0)| is bigger than 0, it means that βk=0 is not the solution of this estimating equation. Thus Xk is a possible active predictor. Otherwise, the coefficient βk=0 denotes that Xk is not important in the linear discriminant analysis. Therefore, similar to Zhao and Li [12] and Ma et al. [13], each |ˆωk(0)|=ˆL′k(1,0,…,0) is the numerator of the score statistic for a hypothesis: βk=0(k≥2) under the kth marginal model and therefore can be a sensible screening statistic. Here β1=1 is from ‖β‖=1. Let ˆωk=ˆωk(0). It follows
where E11=∑2j=1∑ni=1[Xi1−ˉXj1]2I(Yi=j), Ek1=∑2j=1∑ni=1[Xik−ˉXjk][Xi1−ˉXj1]I(Yi=j), B11=∑2j=1∑ni=1[ˉXj1−ˉX1]2I(Yi=j) and Bk1=∑2j=1∑ni=1[ˉXjk−ˉXk][ˉXj1−ˉX1]I(Yi=j). To simplify the calculation and theoretical derivation, define
Note that ˆω∗k is a scaled version of ˆωk. They lead to the same result of feature ranking and screening.
Define ω∗k=ω∗k(0)=T11T12−T21T22, where
where Ij=I(Y=j), and I(⋅) is the indicator function. From (2.6), if Xk and Y are independent, then it follows
Therefore, ˆωk could be used as the feature screening index.
For a given threshold value cn, the active set is estimated as
Usually, the predefined cn is not easy to be identified. Another way is to select the top dn predictors and estimate the active set as
The submodel size dn is a predefined threshold value, e.g., dn=v[n/log(n)], v is some positive integer, see Fan and Lv [14]. In practice, v is chosen to be bigger than 1 to enhance the probability of selecting all the relevant predictors.
2.2. Sure screening property
Next, we establish the theoretical property of the proposed feature screening method. To study the sure screening property, the following regularity conditions are assumed.
● C1. X satisfies the sub-exponential tail probability uniformly in p. That is, there exists a positive constant s0 such that for all 0<s≤2s0,
● C2. There exist some constants c>0 and 0≤κ<12 such that mink∈Aω∗k≥2cn−κ.
Theorem 1. (Sure Screening Property) Under Condition C1, for any 0<γ<12−κ, there exist positive constants c1>0 and c2>0 such that
Further, if both conditions C1 and C2 hold, by taking cn=cn−κ in (2.8), we have
where sn is the cardinality of A, which is sparse and may vary with n.
Theorem 2. (Minimum Model Size) Under conditions in Theorem 1, for any cn=c3n−κ,c3>0, there exist positive constants c4 and c5, such that
Here ‖⋅‖0 denotes the cardinality of a set.
Remark 1. Theorem 1 shows that the sure screening property holds for the proposed linear projection feature screening procedure. The dimensionality p is allowed to increase at an exponential rate of the sample size n, i.e., p=o(exp(nα)). From (2.10) of Theorem 1, the left term of (2.10) tends to 0 if 0<α<1−2γ−2κ. Furthermore, it shows that the feature screening procedure could retain all the important classification predictors with probability tending to 1, which means P(A⊂ˆAcn)→1. The screened features could be utilized in the linear discriminant analysis. If the dimensionality is still high, some penalized methods could be processed. Theorem 2 shows that as long as ∑pk=1|ω∗k| is of a polynomial order of sample size, then the number of selected variables is also of polynomial order of sample size.
3.
Numerical examples
In this section, we present two simulation studies of the popular discriminant analysis models, the logistic model and the probit model, and one real data analysis to assess the finite sample performances of the proposed method (LDA-SIS). Furthermore, we compare the effectiveness of our proposed method with other existing competitive screening methods, including the T-test (Fan and Fan [1]), DC (Li et al. [15]), KF (Mai and Zou [3]), MV (Cui et al. [6]), PSIS (Pan et al. [7]) and RRS (Cheng et al. [8]).
3.1. Monte Carlo Simulations
For each simulation, we set the dimensionality p to 1000 and 2000, and the sample size n to 100 and 200, respectively. All the simulation results are based on 1000 replications. Similar to Fan and Lv [14] and Li et al. [15], the screening threshold number is set to be dn=[n/log(n)]. The following criteria are considered to evaluate the performances of all screening methods.
● MMS: The minimum model size of the submodel contains all active predictors. The five quantiles of MMS over 1000 replications are presented.
● Pk: The proportion of the kth active predictor is selected into the model with size dn.
● Pa: The proportion that all active predictors are selected into the model.
Example 1 (Logistic Model): Consider the logistic regression model
The covariate X=(X1,X(−1))⊤,X(−1)=(X2,…,Xp)⊤ is generated from X1∼N(0,1) and X(−1)∼Np−1(0,Σ), where Σ is a (p−1)×(p−1) covariance matrix with elements σij=ρ|i−j|,i,j=1,…,p−1. We consider ρ=0.2,0.5 and 0.8, respectively. Set β=(1.4,1.2,1.0,0.8,0.6,0p−5)⊤ and the random error ε which is added to X⊤ββ follows N(0,1). The simulation results of Example 1 are shown in Tables 1 and 2.
Example 2 (Probit Model): Consider the probit regression model
where Φ(⋅) is the cumulative distribution function of standard normal distribution. Assume that the true active set A={1,5,20,21,100}, and β is the p dimensional parametric vector with βA=(1,1,1,1,1)⊤ and 0 otherwise. The other settings are the same to Example 1. The simulation results of Example 2 are shown in Tables 3 and 4.
From Tables 1–4, we can find that the proposed LDA-SIS procedure has better feature screening performances than the other procedures. The proportion of all active predictors selected into the screened submodel (Pa) is larger for the LDA-SIS procedure, and the minimum model size of the submodel which contains all active predictors (MMS) of the LDA-SIS procedure is smaller. From Tables 1 and 2, with the correlation parameter ρ increasing, the performances of the feature screening procedures become better. This phenomenon shows that when the active predictors have strong relationships with each other, the proposed feature screening procedure could select the important predictors more correctly. On the other hand, from Tables 3 and 4, with the correlation parameter ρ increasing, the performances of the feature screening procedures become worse. It shows that when the active predictors have strong relationships with the inactive predictors, the feature screening accuracy would be compromised. Furthermore, with the sample size increasing, better results would be obtained.
3.2. Real data analysis
We applied the LDA-SIS feature screening procedure to ovarian cancer data previously studied by Sorace and Zhan [2], Fushiki et al. [16], Zhang et al. [17], and Zhang et al. [18]. This dataset was generated using surface-enhanced laser desorption time-of-flight mass spectrometry and comprises serum samples from 162 ovarian cancer patients and 91 control subjects. The data are available on the Clinical Proteomics Program Databank website (http://home.ccr.cancer.gov/ncifdaproteomics/ppatterns.asp). For each ovarian cancer sample, we analyzed 15, 154 distinct mass-to-charge ratios (M/Z). As reported by Sorace and Zhan [2], the region with M/Z values below 500 was often discarded as noise, resulting in a reduction of the dimensionality of the biomarker features from 15, 154 to 12, 757. Our goal in this study was to identify proteomic patterns corresponding to specific M/Z values that could distinguish ovarian cancer subjects from control subjects.
We randomly split the 253 samples into the training data set and the testing data set. In particular, we sampled approximately 100γ% of the ovarian cancer patients and 100γ% of the control subjects as the training data set, and the rest as the testing data set. We standardized the data to zero mean and unit variance before the discriminant classification.
Different feature screening procedures are utilized to identify the important potential biomarkers in the standardized training data. In our LDA-SIS procedure, we select the variable with the largest value of the Kolmogorov-Smirnov statistics (Mai and Zou [3]) as X1. Let dn=[c0ntr/log(ntr)], c0=0.25,0.5,1, and ntr is the sample size of the training data. In this case, dn = 8, 17, and 35, respectively. After the first feature screening step, the kernel support vector machine (KSVM) method with Gaussian kernel function, and the penalized logistic model (PLM) with LASSO method (Tibshirani, R. [19]) are applied in the modeling step based on the screened dn potential biomarkers, respectively. And their performances are evaluated by the testing data. The packages e1071 and glmnet are used here.
The procedure is repeated 200 times with γ = 0.7 and 0.8, respectively. Three assessment criteria are introduced to investigate the classification performance of the different methods.
● Testing error: The number of errors identified on the testing set.
● TPR (sensitivity or true positive rate): The proportion of ovarian cancer patients diagnosed correctly.
● PPV (positive predictive value): The proportion of samples diagnosed with ovarian cancer who did have the disease.
Table 5 summarizes the median and robust standard deviation (RSD) in the parentheses of testing error, TPR, and PPV. The results show that our proposed LDA-SIS method outperforms the other methods based on these evaluation criteria. Furthermore, we observe that increasing the proportion of training data selected (γ=0.8) leads to better performance across all model sizes dn(8,17,35).
4.
Conclusions
In this article, we employed Fisher's linear projection and the marginal score test to study the feature screening procedure for the ultrahigh-dimensional binary classification problem. Although many feature screening procedures for the ultrahigh-dimensional discriminant analysis problems have been proposed, some of them are even model-free, the study for the LDA problem, one of the most popular approaches in discriminant classification and pattern recognition, is still very attractive. By minimizing the linear projection of the sum of squares in the original cluster structures and maximizing the linear projection of the sum of squares between groups, we constructed the marginal score test and combined it with the linear projection optimal problem to build the feature screening index. The sure screening property and the minimum model size of the procedure are studied. The sure screening property ensures that the feature screening procedure can retain all the important classification predictors with the probability tending to 1. And the minimum model size of the procedure proposed by Theorem 2 shows that as long as ∑pk=1|ω∗k| is of a polynomial order of the sample size, the number of the selected variables is also a polynomial order of the sample size. The finite sample performance of the proposed procedure was illustrated by Monte Carlo studies and a real-data example. The simulation studies demonstrate that the proposed feature screening method performs well, and the simple structure of the screening index makes the calculation fast.
Acknowledgment
Peng Lai's research is supported by National Natural Science Foundation of China (11771215). Yanqiu Zhou's research is supported by Guangxi Science and Technology Base and Talent Project (2020ACI9151), and Guangxi University Young and Middle-aged Teachers Basic Research Ability Improvement Project (2021KY0343).
Conflict of interest
The authors declare no conflict of interest.
Appendix.
Technical proofs
The proofs of Theorem 1 and Theorem 2 in this paper are similar to the proofs of Theorem 1 in Li et al. [15] and Theorems 1–2 in Liu et al. [20]. Similar lemmas are used here to facilitate proving the proposed theorems, these lemmas are listed in the following and the proofs of these lemmas can be found in the Appendices of Li et al. [15] and Liu et al. [20].
Lemma 1. Let μ=E(X). If P(a1≤X≤b1)=1, then
Lemma 2. Let h(X1,…,Xm) be a kernel of the U statistics Un, and θ=E{h(X1,…,Xm)}. If a≤h(X1,…,Xm)≤b, then, for any ϵ>0 and n>m, we have
where [n/m] denotes the integer part of n/m.
Furthermore, due to the symmetry of U statistic, we also have
In the following, we give the proofs of Theorem 1 and Theorem 2. For convenience, we denote M, Mi, and ci,i=1,2,…, as the generic constants depending on the context. Define Ij=I(Y=j) and I(Yi=j)=Iij.
Proof of Theorem 1. For ˆω∗k−ω∗k, we have
where ˆT11=1nE11, ˆT12=1nBk1, ˆT21=1nB11 and ˆT22=1nEk1. We first consider ˆT11−T11.
Define ˜T11=1n∑2j=1∑ni=1[Xi1−E(X1Ij)E(Ij)]2Iij. We have
with n sufficiently large, i.e., n≥M1. It follows
where
Obviously, ˆT∗11 is the U-statistic, and T∗11 is the kernel of the U-statistic of ˆT∗11. Define h1(Xi1,Yi)=[Xi1−E(X1Ij)E(Ij)]2Iij. Thus, we have
Accordingly, we decompose T∗11 into two parts
Clearly, ˆT∗111 and ˆT∗112 are unbiased estimators of T∗111 and T∗112.
Similar to the proof of Theorem 2 in Zhu et al.[21], with the Markov's inequality and the properties of U-statistic, for any t>0, we can obtain that
where the last inequality is concluded from Lemma 1. By choosing t=4εn/M2, we have
Therefore, by the symmetry of U-statistic, we can get
Next, we show the consistency of ˆT∗112.
With the Cauchy-Schwartz and Markov's inequalities, for any s′>0,
Note that
which yields
By condition C1, if we choose M=cnγ, for 0<γ<12−κ, then T∗112≤ε2 when n is sufficiently large. Consequently, similar to the proof of (B.4) in Li et al.[15], there exist some constant c1 and some s>0 such that
Recall that M=cnγ. Combining (3)–(6), we have
where c2 and c3 are some positive constants.
Similarly, for u,v=1,2, we can prove
where c1uv, c2uv and c3uv are some positive constants. Therefore, it follows
for u,v=1,2.
Therefore, similar to the proof of Lemma 4 and Lemma 5 in Liu et al.[20], by (1), (2) and (7), we get
where c4−c6 are some positive constants. Thus,
Next, we prove the second part of Theorem 1 using the similar method of the proof of Theorem 1 in Li et al. [15]. If A⊈, then there must exist some k\in\mathcal{A} such that \hat{\omega}_k^{*}\leq cn^{-\kappa} . Since \mathop{\min}\limits_{k\in\mathcal{A}}\omega_k^{*}\geq2cn^{-\kappa} , it indicates that
Therefore,
This complete the proof of the second part.
Proof of Theorem 2. Note that for any c_9 > 0 , the number of \{k:|\omega_k^*| > \frac{c_9}{2}n^{-\kappa}\} is bounded by O(n^{\kappa}\sum_{k = 1}^p|\omega_k^*|) . Then on the set
the number of \{k:|\hat{\omega}_k^*| > c_9n^{-\kappa}\} can't exceed the number of \{k:|\omega_k^*| > \frac{c_9}{2}n^{-\kappa}\} . Therefore, we have
Then, by (8), the proof is completed.