A novel concentration inequality for the sum of independent sub-Gaussian variables with random dependent weights is introduced in statistical settings for high-dimensional data. The random dependent weights are functions of some regularized estimators. We applied the proposed concentration inequality to obtain a high probability bound for the stochastic Lipschitz constant for negative binomial loss functions involved in Lasso-penalized negative binomial regressions. We used this bound to study oracle inequalities for Lasso estimators. Additionally, a similar concentration inequality was derived for a randomly weighted sum of independent centred exponential family variables.
Citation: Huiming Zhang, Hengzhen Huang. Concentration for multiplier empirical processes with dependent weights[J]. AIMS Mathematics, 2023, 8(12): 28738-28752. doi: 10.3934/math.20231471
[1] | Liqi Xia, Xiuli Wang, Peixin Zhao, Yunquan Song . Empirical likelihood for varying coefficient partially nonlinear model with missing responses. AIMS Mathematics, 2021, 6(7): 7125-7152. doi: 10.3934/math.2021418 |
[2] | Jingjing Yang, Weizhong Tian, Chengliang Tian, Sha Li, Wei Ning . Empirical likelihood method for detecting change points in network autoregressive models. AIMS Mathematics, 2024, 9(9): 24776-24795. doi: 10.3934/math.20241206 |
[3] | Suping Wang . The random convolution sampling stability in multiply generated shift invariant subspace of weighted mixed Lebesgue space. AIMS Mathematics, 2022, 7(2): 1707-1725. doi: 10.3934/math.2022098 |
[4] | Cuiping Wang, Xiaoshuang Zhou, Peixin Zhao . Empirical likelihood based heteroscedasticity diagnostics for varying coefficient partially nonlinear models. AIMS Mathematics, 2024, 9(12): 34705-34719. doi: 10.3934/math.20241652 |
[5] | Jianye Yang, Tongjiang Yan, Pengcheng Ren . A novel Bayesian federated learning framework to address multi-dimensional heterogeneity problem. AIMS Mathematics, 2023, 8(7): 15058-15080. doi: 10.3934/math.2023769 |
[6] | Rashad M. Asharabi, Somaia M. Alhazmi . Accelerating the convergence of a two-dimensional periodic nonuniform sampling series through the incorporation of a bivariate Gaussian multiplier. AIMS Mathematics, 2024, 9(11): 30898-30921. doi: 10.3934/math.20241491 |
[7] | Khanittha Talordphop, Yupaporn Areepong, Saowanit Sukparungsee . An empirical assessment of Tukey combined extended exponentially weighted moving average control chart. AIMS Mathematics, 2025, 10(2): 3945-3960. doi: 10.3934/math.2025184 |
[8] | Zhanshou Chen, Muci Peng, Li Xi . A new procedure for unit root to long-memory process change-point monitoring. AIMS Mathematics, 2022, 7(4): 6467-6477. doi: 10.3934/math.2022360 |
[9] | Gunduz Caginalp . Fat tails arise endogenously from supply/demand, with or without jump processes. AIMS Mathematics, 2021, 6(5): 4811-4846. doi: 10.3934/math.2021283 |
[10] | Miyoun Jung . A variational image denoising model under mixed Cauchy and Gaussian noise. AIMS Mathematics, 2022, 7(11): 19696-19726. doi: 10.3934/math.20221080 |
A novel concentration inequality for the sum of independent sub-Gaussian variables with random dependent weights is introduced in statistical settings for high-dimensional data. The random dependent weights are functions of some regularized estimators. We applied the proposed concentration inequality to obtain a high probability bound for the stochastic Lipschitz constant for negative binomial loss functions involved in Lasso-penalized negative binomial regressions. We used this bound to study oracle inequalities for Lasso estimators. Additionally, a similar concentration inequality was derived for a randomly weighted sum of independent centred exponential family variables.
Over the last two decades, modern data collection techniques have enabled scientists and engineers to access and load vast numbers of variables as random data in their experiments. Probability theory provides the mathematical foundations for statistics and data-driven problems have led to various new advances in statistical research, which in turn contributes new and challenging problems in probability for further study. For instance, the rapid development of high-dimensional statistics has spurred the growth of the probability theory and even pure mathematics, including concentration inequalities, random matrix theory, geometric functional analysis and more [2,16].
The emergence of high-throughput data has led to a surge in statistical research on complex data, particularly on high-dimensional data and statistical learning [14,20]. This trend has gained traction in various scientific fields despite the high cost of measurements. Data sets are typically small, with only tens or hundreds of observations, and limited computing power often restricts the size of suitable finite samples. As a result, modern statisticians and data scientists have shifted their focus from asymptotic to non-asymptotic analysis, as it can handle small sample sizes and large model dimensions. Concentration inequalities play a crucial role in high-dimensional statistical inference, as they can derive various explicit non-asymptotic error bounds as a function of the sample size, sparsity level and dimension. When analyzing the various error bounds of the regularized estimator, concentration inequalities are indispensable tools for analysis [3,20].
When the random variables are unbound, the classcial Hoeffding's inequality [8] falls to do a non-asymptotic analysis. We need the concept of sub-Gaussian random variables [9] to obtain tight Hoeffding-type concentration inequalities for the sum of independent random variables. A centered random variable (r.v.) X is called sub-Gaussian (X∼subG(σ2)) if EesX≤es2σ2/2 for ∀s∈R, where σ>0 is the sub-Gaussian parameter. From Chernoff's inequality, the exponential decay of the sub-Gaussian tail is obtained by P(X≥t)≤infs>0exp{−st}Eexp{sX}≤infs>0exp(−st+σ2s22)=exp(−t22σ2), minimizing the upper bound by putting s=t/σ2. Moreover, for independent {Xi}ni=1 with Xi∼subG(σi), we have sub-Gaussian concentration inequality
P(|∑ni=1Xi|≥t)≤2exp{−t22∑ni=1σ2i},t≥0, | (1.1) |
for any variance proxies {σ2i}ni=1 of {Xi}ni=1 (see Theorem 1.5 in [1]). Define the Lp-norm of r.v. X as ‖X‖p:=(E|X|p)1/p. An alternative form of the sub-Gaussian parameter is defined by the sub-Gaussian norm ‖⋅‖θ2 for zero-mean r.v. X is defined as ‖X‖θ2:=supp≥1[EX2p(2p−1)!!]1/(2p) (see page 23 in [1]).
Corollary 1.7 in [15] extended the sub-Gaussian concentration inequality (1.1) to the weighted sum of independent sub-Gaussian random variables with fixed weights.
Lemma 1. [Concentration for weighted sub-Gaussian sum] Let Y1,…,Yn be n independent r.v.s with Yi∼subG(σ2i). Define σ2=max1≤i≤nσ2i<∞. For any w:=(w1,⋯,wn)⊤, we have
P(|∑ni=1wiYi|>t)≤2exp(−t22σ2‖w‖22). |
However, if wi's are random in Lemma 1, the story is totally different. The goal of this paper is to obtain novel theoretical results on the concentration inequality for the sum of dependent variables with random weights, under high-dimensional data background. Our theory is motivated from the non-asymptotic oracle inequalities of the regularized estimator in high-dimensional negative binomial regressions [21], and the concentration of random Lipschitz coefficients associated with empirical loss functions [4]. Our setting is different from classical multiplier empirical processes serving the multiplier Bootstrap inference, where the multipliers are random variables independent of {Yi}ni=1 (see Chapter 2.9 of [17] and [6,7]). Mendelson [11] studied the concentration inequalities for the centered multiplier process indexed by a functional class, where the i.i.d. multipliers need not be independent of the original empirical processes. In the analyses of high-dimension continuous data regressions by empirical processes, researches often resort to concentration inequalities of the Lipschitz function of strongly log-concave distributions (see Theorems 2.26 and 3.16 in [18]). For high-dimension count data regressions, our section 3.3 discusses the discrete distributions with strongly log-concave structures, which it is considered hard to check the definition of discrete strongly log-concave distributions (see (3.12) below). However, this strong assumption is usually intractable and unverifiable from the data. The sub-Gaussian assumption for the i.i.d. data is testable (see [23]).
In section two, we present the main results of the theory and demonstrate their applications in a class of high-dimensional generalized linear models. Theoretical proofs of the main results and some lemmas and additional results are given in section three. Finally, the conclusions are presented in section four.
When controlling the summation of a function of the random sample indexed by a common estimator ˆθ, it is false to use any sort of classical law of large numbers and central limit theorems (or concentration inequality for independent summation).
Formally, let X1,…,Xn be a random sample independently drawn from P on a measurable space (X,A). Given an estimator ˆθ, we want to study its asymptotic properties for summation of some function fˆθ(Xi),
1nn∑i=1[fˆθ(Xi)−Efˆθ(Xi)]. |
A possible solution: Prove a uniform version (the suprema of empirical processes, see [17]) for all possible ˆθ on a set K, which is usually stronger than what is needed.
1nn∑i=1[fˆθ(Xi)−Efˆθ(Xi)]≤supθ∈K|1nn∑i=1[fθ(Xi)−Efθ(Xi)]|. |
The summation in the sup enjoy independence.
In following theorem, we extend Lemma 1 with dependent and random weights.
Theorem 1. [Concentration for weighted dependent sum] Let Yi's be independent centered sub-Gaussian random variables with max1≤i≤n‖Yk‖θ2<∞. Let wi(ˆθ)'s be a series of bounded function of a bounded random variable ˆθ as the weights (can be dependent on all Yi's), where ‖ˆθ‖1≤r<∞ and max1≤i≤nwi(⋅)≤1. Then, with probability of at least 1−δ,
|1nn∑i=1wi(ˆθ)Yi|≤4√1nn∑i=1‖|Yi−Y′i|‖2θ2√logδ−1n+2√1nn∑i=1E(Y2i)√2log(2p)n, | (2.1) |
for all ‖ˆθ‖1<∞, where Y′i is an independent copy of Yi.
The first term in (2.1) is due to sub-Gaussian concentration, and the second term in (2.1) is from the upper bound of the expected version of the superma of empirical process f(Y):=1nsup‖θ‖1≤r|∑ni=1wi(θ)Yi| (see the proof in section three).
The concentration of random Lipschitz coefficients associated with empirical loss functions is crucial for deriving error bounds of Lasso or Elastic-net penalized high-dimensional generalized linear models (GLMs) in high-dimensional regressions. For more information, please refer to [3,4,10].
Definition 1. [Elastic-net or Lasso penalized loss problems] Let {(Yi,Xi)}ni=1 be independent identically random variables with values in R×Rp, where {Yi}ni=1∼Y are response variables and {Xi}ni=1∼X are covariates. Let l(y,x,β) be a loss function of parameter β and data (y,x). The empirical loss function is defined as
Pnl(Y,X,β):=1nn∑i=1l(Yi,Xi,β). |
Elastic-net (or Lasso) estimators are given by
ˆβ=:ˆβ(λ1,λ2)=argminβ∈Rp{Pnl(Y,X,β)+λ1‖β‖1+λ2‖β‖22}, | (2.2) |
where λ1>0 and λ2≥0 are tuning parameters.
Define the minimizer
β∗=argminβ∈RpE[l(Y,X,β)] | (2.3) |
as the vector of true coefficients, where l(Y,X,β) is the loss function. The ℓ1 ball is denoted as SR(β∗):={β∈Rp:‖β−β∗‖1≤R}. Theorem 1 can be used to establish exponential-type concentration inequalities for the local stochastic Lipschitz (LSL) constant:
supβ∈SR(β∗)(Pn−P)[l(Y,X,β)−l(Y,X,β∗)]‖β−β∗‖1. |
When study error bounds ‖ˆβ−β∗‖1 of Lasso or Elastic-net penalized high-dimensional GLMs, one must bound the dependent empirical processes (Pn−P)[l(Y,X,ˆβ)−l(Y,X,β∗)]‖ˆβ−β∗‖1 with the LSL constant as the upper bound.
Next, we provide an example of negative binomial loss in negative binomial regressions [21]. The negative binomial loss function is l(y,x,β)=yx⊤β−(θ+y)log(θ+ex⊤β), where θ is called the dispersion parameter. Denote the expected risk function as Pl(Y,X,β):=El(Y,X,β). Let l1(y,x,β):=−y[x⊤β−log(θ+exp{x⊤β})], and l2(x,β):=θlog(θ+exp{x⊤β}), then
(Pn−P)[l(y,x,β)−l(y,x,β∗)]=(Pn−P)[l1(y,x,β)−l1(y,x,β∗)]+(Pn−P)[l2(x,β)−l2(x,β∗)]. |
The upper bounds for the first and second parts of the empirical process: (Pn−P)(lm(β∗)−lm(ˆβ)) for m=1,2 is paramount to study the error bound of ‖ˆβ−β∗‖1.
Let λ be a positive constant that needs to be determined. We have
P(supβ∈SR(β∗)|(Pn−P)[l(Y,X,β)−l(Y,X,β∗)]|‖β−β∗‖1≤λ)≤P(supβ∈SR(β∗)|(Pn−P)[l1(Y,X,β)−l1(Y,X,β∗)]|‖β−β∗‖1≤λ2)+P(supβ∈SR(β∗)|(Pn−P)[l2(X,β)−l2(X,β∗)]|‖β−β∗‖1≤λ2). | (2.4) |
Here, we assume that both x and β are bound, and θ is a known dispersion parameter. The high probability for the second term in (2.4) is easy to deal with if we apply McDiarmid's inequality (see Lemma 4 of [21]). However, the high probability for the first term in (2.4) is hard to control since it contains unbounded negative binomial variables {Yi}ni=1. Zhang and Jia [21] used the concentration inequality for strongly log-concave discrete distributions to solve this problem, but the strongly log-concave property is difficult to check for discrete distribution (see (H.4) in [21]). The sub-Gaussian distribution assumption is easy to verify for negative binomial variables, and this is from the fact that the negative binomial distribution belongs to the exponential family if the dispersion parameter is given. When Θ is compact in (2.7) below, Proposition 3.2 in [20] shows that {Yi}ni=1 is sub-Gaussian.
From Taylor's expansion of continuous functions, one has log(θ+ex)−log(θ+ea)=e˜aθ+e˜a(x−a), where ˜a is some real number between a and x. Let X⊤i˜β be some point between X⊤iˆβ and X⊤iβ∗, i.e., ˜β=(t1ˆβ1⋮tpˆβp)+((1−t1)β∗1⋮(1−tp)β∗p) for {tj}pj=1⊂[0,1]. Observe that
(Pn−P)[l1(β∗)−l1(ˆβ)]=−1nn∑i=1(Yi−EYi)X⊤i[(β∗−ˆβ)−log(θ+exp{X⊤iβ∗}θ+exp{X⊤iˆβ})]=−1nn∑i=1(Yi−EYi)X⊤i[(β∗−ˆβ)−exp{X⊤i˜β}X⊤i(β∗−ˆβ)θ+exp{X⊤i˜β}]=1nn∑i=1θX⊤i(ˆβ−β∗)θ+exp{X⊤i˜β}⋅(Yi−EYi). | (2.5) |
For a finite M0, if we have ˆβ∈SM0(β∗), then ˜β∈SM0(β∗). This is from the fact ‖˜β−β∗‖≤∑pj=1tj|ˆβj−β∗1|≤‖ˆβ−β∗‖≤M0. Suppose |Xi|∞ is uniformly bounded by 1/M0. We have |(2.5)|:=1nn∑i=1wi(ˆβ)(Yi−EYi) with dependent weights
wi(ˆβ):=θX⊤i(ˆβ−β∗)θ+exp{X⊤i˜β} and |wi(ˆβ)|≤1. |
Thus, the high probability upper bound in Theorem 1 is applicable to determine λ2, i.e.,
λ2=4√1nn∑k=1‖|Yk−Y′k|‖2θ2√logδ−1n+2√1nn∑i=1E(Yi−EYi)2√2log2pn. | (2.6) |
In this section, let {Yi}ni=1 be exponential family random variables with density
f(yi;θi)=c(yi)exp{yiηi−b(ηi)},ηi∈Θ. | (2.7) |
Here, E(Yi)=˙b(ηi) and Var(Yi)=¨b(ηi). It should be noted that Proposition 3.2 in [20] shows that {Yi}ni=1 is sub-Gaussian if Θ is compact.
Under distributional assumption (2.7), we will study the Berstein-type concentration inequalities for the randomly weighted sum of centered exponential family random variables (with different parameters θi's):
n∑i=1{wi(ˆθ)Yi−E[wi(ˆθ)Yi]}, |
where the {wi(ˆθ)}ni=1 are called the multipliers (or random weights), and ˆθ is independent of {Yi}ni=1.
Theorem 2. [Concentration inequalities for randomly weighted sum of exponential family r.v.s] If {Yi}ni=1 has density (2.7) with moment conditions: E|Yi|k≤k!CkY, where CY>0 is a constant. We assume that
(i) Bounded weights: Wˆθ:=(w1(ˆθ),⋯,wn(ˆθ))⊤ is a random vector s.t. max1≤i≤n|wi(ˆθ)|≤w<∞;
(ii) Let E[|Yi|k|Wˆθ]=ρi,kE[|Yi|k] with Eρi,k=1;
(iii) There exists a non-decreasing sequence {un} and constant Cρ s.t. P(maxk≥1,1≤i≤nρi,k>un)≤Cρ/un. Then,
P(|n∑i=1[wi(ˆθ)Yi−E[wi(ˆθ)Yi)]|≥t)≤2exp{−t216nun(wCT)2+4wCTt}+Cρun, | (2.8) |
where {wi(ˆθ)Yi} is dependent since each wi(ˆθ)Yi depends on a common estimator ˆθ from the data {Yi}ni=1.
It should be noted that our work here is related to Proposition 3.2 in [20] and it is about the concentration inequalities for the non-random weighted sum of exponential family random variables. For condition (ii), suppose that the estimator ˆθ converges to a true parameter θ∗ almost surely, and one has E[|Yi|k|Wθ∗]=E[|Yi|k] since Wθ∗ is non-random. The difference between conditional expectation and unconditional expectation is:
maxk≥1,1≤i≤n|ρi,k−1|=maxk≥1,1≤i≤n|E[|Yi|k|Wˆθ]−E[|Yi|k|Wθ∗]|E[|Yi|k|Wθ∗]≤Op(‖ˆθ−θ∗‖1), |
if |E[|Yi|k|Wβ] is a ℓ1-Lipschitz function of β. We call P(maxk≥1,1≤i≤nρi,k≥un)≤Cρ/un in assumption (iii) a high level condition. Intuitionally, due to the dependence summation, the random weighted summation will lose the rate of convergence in the exponential inequalities (addition term 4wCTt is added), compared to the case of non-random weighted summation. The assumption of compact parameter space for the exponential family is key to obtaining the sub-Gaussian type concentration inequalities.
Our multiplier concentration inequality here is different from [11], which studies the concentration upper bounds for centered multiplier empirical processes 1√nn∑i=1[WiYi−E(WiYi)] (the random weights {Wi} and random variables {Yi} need not be independent); however, they assume that {Wi} is i.i.d. To the best of our knowledge, Theorem 2 is a new concentration inequality that is suitable for the weighted sum of dependent random variables.
Proof of Theorem 1: Let Y=(Y1,⋯,Yn)⊤ be a vector of independent r.v.s in a space Y, and define (Y′1,⋯,Y′n)⊤ as an independent copy of (Y1,⋯,Yn)⊤. For any function f:Yn→R, it is of interest to study the concentration for f(Y) about its expectation. In case of Theorem 1,
f(Y):=1nsup‖θ‖1≤r|n∑i=1wi(θ)Yi|. |
For z∈Y and k∈{1,…,n}, define the substitution operator
Skz:Yn→Yn by Skzy:=(y1,…,yk−1,z,yk+1,…,yn) |
and the centered conditional version of f
Df,Yk(y):=f(y1,…,yk−1,Yk,yk+1,…,yn)−Ef(y1,…,yk−1,Y′k,yk+1,…,yn)=f(SkYky)−Ef(SkY′ky)=E[f(SkYky)−f(SkY′ky)∣Yk]. | (3.1) |
Next, we use a constant-sharper sub-Gaussian concentration for f(Z) in Corollary 4 in [22], which requires the ‖⋅‖θ2-norm condition of r.v. {Df,Zi(z)}ni=1.
Lemma 2. If {Df,Zi(z)}ni=1 has finite ‖⋅‖θ2-norm for z∈Z, then f(Z)−Ef(Z)∼subG(8supz∈Z∑ni=1‖Df,Zi(z)‖2θ2) and
P{f(Z)−Ef(Z)>t}≤e−t2/(16supz∈Zn∑i=1‖Df,Zi(z)‖2θ2),t≥0. |
From the identity in (3.1), we have
‖Df,Yk(y)‖θ2=‖f(y1,…,yk−1,Yk,yk+1,…,yn)−Ef(y1,…,yk−1,Y′k,yk+1,…,yn)‖θ2=‖1nsup‖θ‖1≤r|w1(θ)y1+⋯+wk−1(θ)yk−1+wk(θ)Yk+wk+1(θ)yk+1+⋯+wn(θ)yn|−1nsup‖θ‖1≤r|w1(θ)y1+⋯+wk−1(θ)yk−1+wk(θ)Y′k+wk+1(θ)yk+1+⋯+wn(θ)yn|∣Yk]‖θ2≤1n‖E[sup‖θ‖1≤rwk(θ)|Yk−Y′k||Yk]‖θ2≤1n‖E[|Yk−Y′k||Yk]‖θ2. | (3.2) |
The conditional Jensen's inequality gives
E[|E[|Yk−Y′k|∣Xk]|p]≤E[E{|Yk−Y′k|∣Xk}p]=E[E{(|Yk−Y′k|p)1/p∣Xk}p]≤E{E[|Yk−Y′k|p∣Xk]}=E|Yk−Y′k|p,p≥1. | (3.3) |
The definition ‖X‖θ2=supk≥1[2kk!(2k)!EX2k]1/(2k) shows ‖Df,Yk(y)‖θ2≤1n‖|Yk−Y′k|‖θ2 by (3.2) and (3.3). Hence, we have supz∈Z∑ni=1‖Df,Zi(z)‖2θ2=1n2∑nk=1‖|Yk−Y′k|‖2θ2 in Lemma 2, which leads to
P{f(Y)−Ef(Y)>t}≤e−(nt)2/16n∑k=1‖|Yk−Y′k|‖2θ2, t≥0. |
Let δ=e−(nt)2/16∑nk=1‖|Yk−Y′k|‖2θ2, and t=4√1n∑nk=1‖|Yk−Y′k|‖2θ2√logδ−1n. We have
f(Y)≤t+Ef(Y)=4√1nn∑i=1‖|Yi−Y′i|‖2θ2√logδ−1n+Ef(Y), | (3.4) |
with probability at least 1−δ.
It remains to obtain a bounds on Ef(Y), which is upper bounded by the symmetrization theorem from Lemma 3 with different functions. To see this, let Xi=Yi in Lemma 3 and gi(Yi)=wi(θ)Yi for i=1,⋯,n.
Since wi(θ)'s are series of bounded functions of a common bounded variable θ where ‖θ‖1≤r and max1≤i≤nwi(⋅)≤1, for any vector θ with ‖θ‖1≤r, there exists a sequence of vectors {awi}ni=1∈Rp with ‖awi‖∞≤1/r such that
wi(θ)=a⊤wiθ≤‖awi‖∞‖θ‖1≤1. | (3.5) |
Equation (3.5) and Lemma 3 imply
Ef(Y)≤2nE(sup‖θ‖1≤r|n∑i=1wi(θ)ϵiYi|)=2nE(sup‖θ‖1≤r|n∑i=1p∑j=1ϵiYiawijθj|)=2nE(sup‖θ‖1≤r|p∑j=1(n∑i=1ϵiYiawij)θj|)[by Hölder's inequality]≤2nE(sup‖θ‖1≤rmax1≤j≤p|n∑i=1ϵiYiawij|⋅‖θ‖1)≤2rnE(max1≤j≤p|n∑i=1ϵiYiawij|)=2rnE(Eϵmax1≤j≤p|n∑i=1ϵiYiawij|). |
Next, we apply the maximal inequality. By Corollary 7.5 in [20], with E[ϵiYiawij|Y]=0 and ϵiYiawij≤max1≤i≤n‖awi‖∞Yi=r−1Yi, one has
2rnE(Eϵmax1≤j≤p|n∑i=1ϵiYiawij|)≤2n√2log(2p)E(√n∑i=1Y2i|Y)[By Jensen's inequality]≤2n√2log(2p)√En∑i=1Y2i=2√1nn∑i=1EY2i√2log(2p)n. |
Thus, Ef(Y)≤2√1n∑ni=1EY2i√2log(2p)n. Using (3.4),
f(Y)≤t+Ef(Y)=4√1nn∑i=1‖|Yi−Y′i|‖2θ2√logδ−1n+2√1nn∑i=1EY2i√2log(2p)n, | (3.6) |
with the probability of at least 1−δ.
Proof of Theorem 2: We will adopt the following result, which gives the moments inequality for the exponential family. It is deduced by the analytic properties of the absolute moments of exponential family random variables:
E|Y|k≤k!CkY, |
see Proposition 5.2 in [20]. For notation simplicity, let W:=Wˆθ and Wi=wi(ˆθ). By using Taylor's expansion and the binomial coefficient formula, we have the following upper bound for the conditional moment generating function of WiYi−E(WiYi), conditioning on the event {maxk≥1,1≤i≤nρi,k≤un}:
E[es(WiYi−E(WiYi))|W]=1+∞∑m=2smm!E[(WiYi−E(WiYi))m|W]=1+∞∑m=2smm!E[m∑k=0(km)(WiYi)k(−E(WiYi))m−k|W]≤1+∞∑m=2smm≤1+∞∑m=2smm![wmm∑k=0(km)E(|Yi|k|W)(E|Yi|)m−k]≤1+∞∑m=2smm![(2w)mmax1≤k≤m{E(|Yi|k|W)(E|Yi|)m−k}(By assumption (iii))≤1+un∞∑m=2(2ws)mm![max1≤k≤m{E(|Yi|k)(E|Yi|)m−k}, | (3.7) |
for s∈(0,δ) with some δ>0.
Therefore, we can assume that |2swCT|<1, so
E[es(WiYi−E(WiYi))|W]≤1+un∞∑m=2(2ws)mm![m!CmT]=1+un(2swCT)2∞∑m=2(2swCT)m−2=1+un(2swCT)21−2swCT≤eun(2swCT)21−2swCT. | (3.8) |
Define the randomly weighted sum SWn=:∑ni=1WiYi. By the conditional independence of {WiYi|W}ni=1, it follows that by (3.8),
E[es(SWn−ESWn)|W]=n∏i=1E[exp{s[WiYi−E(WiYi)]}|W]≤enun(2swCT)21−2swCT. | (3.9) |
By conditional Markov's inequality and on {maxk≥1,1≤i≤nρi,k<un}, we have for a>0
P(|SWn−ESWn|≥t|W)≤P(a(SWn−ESWn)≥at|W)+P(a(−SWn+ESWn)≥at|W)≤E[ea(SWn−ESWn)|W]exp(at)+E[ea(−SWn+ESWn)|W]exp(at)[Using(3.9)asa∈(−δ,δ)]≤2exp{nun(2awCT)21−2awCT−at}=2exp{−t216nun(wCT)2+4wCTt} | (3.10) |
where the last equality is obtained by setting a=t8nun(wCT)2+2wCTt.
Taking expectation w.r.t. W on (3.10), it implies
P(|SWn−ESWn|≥t)=P(|SWn−ESWn|≥t,maxk≥1,1≤i≤nρi,k>un)+P(|SWn−ESWn|≥t,maxk≥1,1≤i≤nρi,k≤un)≤P(maxk≥1,1≤i≤nρi,k>un)+P(|SWn−ESWn|≥t,maxk≥1,1≤i≤nρi,k≤un})≤Cρ/un+E[P(|SWn−ESWn|≥t,maxk≥1,1≤i≤nρi,k≤un|W)]≤Cρ/un+2exp{−t216nun(wCT)2+4wCTt}. |
Lemma 3. [Symmetrization theorem with different functions] Let ε1,...,εn be a Rademacher sequence with uniform distribution on {−1,1}, independent of X1,...,Xn and gi∈Gi. Then,
E(supg1,⋯,gn∈G1,⋯,Gn|n∑i=1[gi(Xi)−E{gi(Xi)}]|)≤2E[Eϵ{supg1,⋯,gn∈G1,⋯,Gn|n∑i=1ϵigi(Xi)|}], |
where Eϵ{⋅} refers to the expectation w.r.t. ϵ1,...,ϵn.
Proof: Let {X′i}ni=1 be an independent copy of {Xi}ni=1. The E′ denotes the expectation w.r.t. {X′i}ni=1, and let F′n=σ(X′1,⋯,X′n). So,
E(supg1,⋯,gn∈G1,⋯,Gn|n∑i=1[gi(Xi)−E{gi(Xi)}]|)=E(supg1,⋯,gn∈G1,⋯,Gn|E′n∑i=1[gi(Xt)−gi(X′i)]|F′n|)(Jensen's inequality of the absolute function)≤E(supg1,⋯,gn∈G1,⋯,GnE′|n∑i=1[gi(Xt)−gi(X′i)]||F′n)(Jensen's inequality of the max function)≤E(E′supg1,⋯,gn∈G1,⋯,Gn|n∑i=1[gi(Xt)−gi(X′i)]||F′n)=E(supf1,⋯,fn∈G1,⋯,Gn|n∑i=1[gi(Xt)−gi(X′i)]|), |
where we use the conditional expectation version of Jensen's inequalities.
Since εi[gi(Xi)−gi(X′i)] and gi(Xi)−gi(X′i) have the same distribution, then,
=E(supg1,⋯,gn∈G1,⋯,Gn|n∑i=1εi[gi(Xi)−gi(X′i])|)≤2E[Eϵ{supg1,⋯,gn∈G1,⋯,Gn|n∑i=1ϵigi(Xi)|}]. |
If gi=f,Gi=F for i=1,2,⋯,n, then we have the classical symmetrization theorem.
Lemma 4. [Symmetrization Theorem, Lemma 2.3.1 in [17]] Let ε1,...,εn be a Rademacher sequence with uniform distribution on {−1,1}, independent of X1,...,Xn and f∈F. Then, we have
E[supf∈F|n∑i=1[f(Xi)−E{f(Xi)}]|]≤2E[Eϵ{supf∈F|n∑i=1ϵif(Xi)|}], |
where E[⋅] refers to the expectation w.r.t. X1,...,Xn and Eϵ{⋅} w.r.t. ϵ1,...,ϵn.
In this section, we restate the applications of the concentration inequality for a function of the data under the so-called strongly log-concave discrete distribution assumption, which was used in the Supplementary Material of [21]. We utilized the convex geometry approach to establish the tail bounds. In convex geometry, the following discrete version of the Prékopa-Leindler inequality can be found in Theorem 1.2 of [5]. The discrete version of the Prékopa-Leindler inequality is an essential inequality when deriving concentration inequalities of strongly log-concave counting measures. This shares the same idea when we consider the continuous version of Prékopa-Leindler inequality (see Theorem 3.15 of [18]).
Let ⌊r⌋=max{m∈Z;m≤r} be the lower integer part of r∈R, and ⌈r⌉=−⌊−r⌋ be the upper integer part. Denote ⌊x⌋=(⌊x1⌋,…⌊xn⌋) and ⌈x⌉=(⌈x1⌉,…,⌈xn⌉).
Lemma 5. [Discrete Prékopa-Leindler inequality] Let f,g,h,k:Zn→[0,∞) be functions that satisfy the following inequality:
f(x)g(y)≤h(⌊λx+(1−λ)y⌋)k(⌈(1−λ)x+λy⌉),∀x,y∈Zn,∀λ∈[0,1]. | (3.11) |
Then, we have
(∑x∈Znf(x))(∑x∈Zng(x))≤(∑x∈Znh(x))(∑x∈Znk(x)). |
From a geometric perspective, the Prékopa-Leindler inequality is a valuable method to prove concentration inequalities under Lipschitz functions of strongly log-concave distributions. From the idea in [12], a distribution P with a density p(x) (w.r.t. the counting measure) is said to be strongly discrete log-concave, if ψ(x)=:−logp(x):Zn→R is strongly midpoint log-convex for some γ>0:
ψ(x)+ψ(y)−ψ(⌈12x+12y⌉)−ψ(⌊12x+12y⌋)≥γ4‖x−y‖22,∀x,y∈Zn. | (3.12) |
The inequality (3.12) is an extension of strongly convexity for continuous functions on Rn:
λψ(x)+(1−λ)ψ(y)−ψ(λx+(1−λ)y)≥γ2λ(1−λ)‖x−y‖22,∀x,y∈Rn,∀λ∈[0,1], |
with modulus of convexity γ [13].
Strongly log-convex property for a discrete density function requires that continuous functions are restricted on a lattice space. If γ=0, (3.12) turns to the discrete midpoint convexity property for ψ(x)
ψ(x)+ψ(y)≥ψ(⌈12x+12y⌉)+ψ(⌊12x+12y⌋),∀x,y∈Zn, |
see [12]. However, directly restricting a continuous function to some lattice space may not necessarily obtain discrete convex functions. For the corresponding counterexample, see [19].
For one-dimensional P, the probability mass function p(x) is said to be log-concave if the sequence {p(x)}x∈Z is log-concave; that is, for any λn+(1−λ)m∈Z with m,n∈Z and λ∈(0,1), one has
p(λn+(1−λ)m)≥p(n)λp(m)1−λ. |
Proposition 1. [The concentration inequality of strongly log-concave discrete distributions] Consider a strongly log-concave discrete distribution Pγ with index γ>0 on Zn. For f:Rn→R that is L-Lipschitz w.r.t. Euclidean norm, then,
Pr{|f(X)−Ef(X)|≥t}≤2e−γt24L2. | (3.13) |
Proof: Let h be a zero-mean function with Lipschitz constant L (w.r.t. the Euclidean norm). It remains to prove the upper bound of a moment generating function Eeh(X)≤eL2τ. Then, for f with Lipschitz constant K and λ∈R, we apply the upper bound to the zero-mean function h(X):=λ(f(X)−Ef(X)), which has Lipschitz constant L=λK. Given λ∈(0,1) and x,y∈Zn, define the proximity operator of h as
l(y):=infx∈Zn{h(x)+γ4‖x−y‖22}. |
With this proximity operator, the proof is proceeding by using the discrete Prekopa-Leindler inequality (Lemma 5) with λ=1/2, h(t)=k(t)=:p(t)=e−ψ(t), f(x):=e−h(x)−ψ(x) and g(y):=el(y)−ψ(y). We check that
e12[l(y)−h(x)−ψ(y)−ψ(x)]≤e−12ψ(⌈12x+12y⌉)⋅e−12ψ(⌊12x+12y⌋)∀x,y∈Zn. | (3.14) |
Then, (3.11) satisfies Lemma 5 with λ=1/2.
By discrete strong convexity of the function ψ
12[ψ(x)+ψ(y)−ψ(⌈12x+12y⌉)−ψ(⌊12x+12y⌋)≥γ8‖x−y‖22, |
and the proximity operator of h, we have
−12ψ(⌈12x+12y⌉)−12ψ(⌊12x+12y⌋)≥12{l(y)−h(x)−γ4‖x−y‖22}−12ψ(⌈12x+12y⌉)−12ψ(⌊12x+12y⌋)≥12{l(y)−h(x)}−12ψ(y)−12ψ(x), |
which verifies (3.14).
By ∑x∈Znh(x)=∑x∈Znk(x)=1, we know that Lemma 5 gives
Eel(Y)Ee−h(X)=∑x∈Zne−h(x)−ψ(x)∑y∈Znel(y)−ψ(y)≤1. |
Then, Jensen's inequality implies
Eel(Y)≤(Ee−h(X))−1≤(eE[−h(X)])−1=1, |
where in the last equality we use E[−h(X)]=E[λ(f(X)−Ef(X))]=0. The definition of the proximity operator shows
1≥Eel(y)=Eeinfx∈Zn{h(x)+γ4‖x−Y‖22}=Eeinfx∈Zn{h(Y)+[h(x)−h(Y)]+γ4‖x−Y‖22}≥Eeh(Y)+infx∈Rn{−L‖x−Y‖2+γ4‖x−Y‖22}=Eeh(Y)−L2/γ, |
where the second last inequality due to L-Lipschitz of h, i.e., |h(x)−h(Y)|≤L‖x−Y‖2.
Then, we have Eeλ(f(X)−Ef(X)]≤e12⋅λ2⋅2L2γ for all λ∈R. This means that f(X)−Ef(X)∼subG(2L2γ), hence the tail bound is (3.13).
Non-asymptotic statistical inference on high-dimensional data is important for many fields, such as data mining and machine learning. In this paper, we derived a novel concentration inequality for the sum of independent sub-Gaussian variables with random dependent weights in high-dimensional regression settings. We applied the proposed concentration inequality to obtain a high probability bound for the stochastic Lipschitz constant for negative binomial loss functions involved in Lasso-penalized negative binomial regressions, and used this bound to study oracle inequalities for the Lasso estimators. The usefulness of the proposed concentration inequality in applications was justified by solid theoretical proofs.
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
The authors would like to thank the reviewers and Dr. Pengfei Wang for their valuable suggestions that have significantly improved the quality of this paper. This work was supported in part by the National Natural Science Foundation of China under grant numbers 12101630 and 12261011, and the "double first-class" construction projects of Chinese universities under grant number ZG216S2348.
The authors declare no conflict of interest.
[1] | V. V. Buldygin, Y. V. Kozachenko, Metric characterization of random variables and random processes, Providence: American Mathematical Society, 2000. |
[2] | S. Boucheron, G. Lugosi, P. Massart, Concentration inequalities: A nonasymptotic theory of independence, Oxford: Oxford University Press, 2013. |
[3] | P. Bühlmann, S. A. van de Geer, Statistics for high-dimensional data: methods, theory and applications, Berlin: Springer, 2011. https://doi.org/10.1007/978-3-642-20192-9 |
[4] | Z. Chi, A local stochastic Lipschitz condition with application to Lasso for high dimensional generalized linear models, arXiv: 1009.1052. https://doi.org/10.48550/arXiv.1009.1052 |
[5] | D. Halikias, B. Klartag, B. A. Slomka, Discrete variants of Brunn-Minkowski type inequalities, Annales de la Faculté des Sciences de Toulouse Mathématiques, 30 (2021), 267–279. https://doi.org/10.5802/afst.1674 |
[6] |
Q. Han, J. A. Wellner, Convergence rates of least squares regression estimators with heavy-tailed errors, Ann. Statist., 47 (2019), 2286–2319. https://doi.org/10.1214/18-AOS1748 doi: 10.1214/18-AOS1748
![]() |
[7] |
Q. Han, Multiplier U-processes: sharp bounds and applications, Bernoulli, 28 (2022), 87–124. https://doi.org/10.3150/21-BEJ1334 doi: 10.3150/21-BEJ1334
![]() |
[8] |
W. Hoeffding, Probability inequalities for sums of bounded random variables, J. Am. Stat. Assoc., 58 (1963), 13–30. https://doi.org/10.1080/01621459.1963.10500830 doi: 10.1080/01621459.1963.10500830
![]() |
[9] |
J. Kahane, Propriétés locales des fonctions à séries de Fourier aléatoires, Stud. Math., 19 (1960), 1–25. https://doi.org/10.4064/sm-19-1-1-25 doi: 10.4064/sm-19-1-1-25
![]() |
[10] |
S. Li, H. Wei, X. Lei, Heterogeneous overdispersed count data regressions via double-penalized estimations, Mathematics, 10 (2022), 1700. https://doi.org/10.3390/math10101700 doi: 10.3390/math10101700
![]() |
[11] |
S. Mendelson, Upper bounds on product and multiplier empirical processes, Stoch. Proc. Appl., 126 (2016), 3652–3680. https://doi.org/10.1016/j.spa.2016.04.019 doi: 10.1016/j.spa.2016.04.019
![]() |
[12] | S. Moriguchi, K. Murota, A. Tamura, F. Tardella, Discrete midpoint convexity, Math. Oper. Res., 45 (2020), 99–128. https://doi.org/10.1287/moor.2018.0984 |
[13] | M. W. Mahoney, J. C. Duchi, A. C. Gilbert, The mathematics of data, Providence: American Mathematical Society, 2018. |
[14] |
P. Massart, Some applications of concentration inequalities to statistics, Annales de la Facult des Sciences de Toulouse Mathmatiques, 9 (2000), 245–303. https://doi.org/10.5802/afst.961 doi: 10.5802/afst.961
![]() |
[15] | P. Rigollet, J. C. Hütter, High dimensional statistics, New York: Spring, 2019. |
[16] | R. Vershynin, Introduction to the non-asymptotic analysis of random matrices, arXiv: 1011.3027. https://doi.org/10.48550/arXiv.1011.3027 |
[17] | A. W. Vaart, J. A. Wellner, Weak convergence and empirical processes: with applications to statistics, New York: Springer, 1996. https://doi.org/10.1007/978-1-4757-2545-2 |
[18] | M. J. Wainwright, High-dimensional statistics: a non-asymptotic viewpoint, Cambridge: Cambridge University Press, 2019. |
[19] |
Ü. Yüceer, Discrete convexity: convexity for functions defined on discrete spaces, Discrete Appl. Math., 119 (2002), 297–304. https://doi.org/10.1016/S0166-218X(01)00191-3 doi: 10.1016/S0166-218X(01)00191-3
![]() |
[20] |
H. Zhang, S. Chen, Concentration inequalities for statistical inference, Commun. Math. Res., 37 (2021), 1–85 https://doi.org/10.4208/cmr.2020-0041 doi: 10.4208/cmr.2020-0041
![]() |
[21] |
H. Zhang, J. Jia, Elastic-net regularized high-dimensional negative binomial regression: consistency and weak signals detection, Stat. Sinica, 32 (2022), 181–207. https://doi.org/10.5705/SS.202019.0315 doi: 10.5705/SS.202019.0315
![]() |
[22] |
H. Zhang, X. Lei, Growing-dimensional partially functional linear models: non-asymptotic optimal prediction error, Phys. Scr., 98 (2023), 095216. https://doi.org/10.1088/1402-4896/aceac0 doi: 10.1088/1402-4896/aceac0
![]() |
[23] | H. Zhang, H. Wei, G. Cheng, Tight non-asymptotic inference via sub-Gaussian intrinsic moment norm, arXiv: 2303.07287. https://doi.org/10.48550/arXiv.2303.07287 |