We consider the rescaled pure greedy learning algorithm (RPGLA) with the dependent samples drawn according to a non-identical sequence of probability distributions. The generalization performance is provided by applying the independent-blocks technique and adding the drift error. We derive the satisfactory learning rate for the algorithm under the assumption that the process satisfies stationary β-mixing, and also find that the optimal rate O(n−1) can be obtained for i.i.d. processes.
Citation: Qin Guo, Binlei Cai. Learning capability of the rescaled pure greedy algorithm with non-iid sampling[J]. Electronic Research Archive, 2023, 31(3): 1387-1404. doi: 10.3934/era.2023071
[1] | Zhuang Wang, Renting Liu, Jie Xu, Yusheng Fu . FedSC: A federated learning algorithm based on client-side clustering. Electronic Research Archive, 2023, 31(9): 5226-5249. doi: 10.3934/era.2023266 |
[2] | Xiaoyu Jiang, Ruichun Gu, Huan Zhan . Research on incentive mechanisms for anti-heterogeneous federated learning based on reputation and contribution. Electronic Research Archive, 2024, 32(3): 1731-1748. doi: 10.3934/era.2024079 |
[3] | Zhiyong Qian, Wangsen Xiao, Shulan Hu . The generalization ability of logistic regression with Markov sampling. Electronic Research Archive, 2023, 31(9): 5250-5266. doi: 10.3934/era.2023267 |
[4] | Shuangjie Yuan, Jun Zhang, Yujia Lin, Lu Yang . Hybrid self-supervised monocular visual odometry system based on spatio-temporal features. Electronic Research Archive, 2024, 32(5): 3543-3568. doi: 10.3934/era.2024163 |
[5] | Wenya Shi, Xinpeng Yan, Zhan Huan . Faster free pseudoinverse greedy block Kaczmarz method for image recovery. Electronic Research Archive, 2024, 32(6): 3973-3988. doi: 10.3934/era.2024178 |
[6] | Fang Wang, Weiguo Li, Wendi Bao, Li Liu . Greedy randomized and maximal weighted residual Kaczmarz methods with oblique projection. Electronic Research Archive, 2022, 30(4): 1158-1186. doi: 10.3934/era.2022062 |
[7] | Shiqi Zou, Xiaoping Shi, Shenmin Song . MOEA with adaptive operator based on reinforcement learning for weapon target assignment. Electronic Research Archive, 2024, 32(3): 1498-1532. doi: 10.3934/era.2024069 |
[8] | Hui Li, Rongrong Gong, Pengfei Hou, Libao Xing, Dongbao Jia, Haining Li . Online learning resources recommendation model based on improved NSGA-Ⅱ algorithm. Electronic Research Archive, 2023, 31(5): 3030-3049. doi: 10.3934/era.2023153 |
[9] | Jing Chen, Weiyu Ye, Shaowei Kang . Learning user preferences from Multi-Contextual Sequence influences for next POI recommendation. Electronic Research Archive, 2024, 32(1): 486-504. doi: 10.3934/era.2024024 |
[10] | Bangna Li, Qingqing Zhang, Xingshi He . Multi-label feature selection via constraint mapping space regularization. Electronic Research Archive, 2024, 32(4): 2598-2620. doi: 10.3934/era.2024118 |
We consider the rescaled pure greedy learning algorithm (RPGLA) with the dependent samples drawn according to a non-identical sequence of probability distributions. The generalization performance is provided by applying the independent-blocks technique and adding the drift error. We derive the satisfactory learning rate for the algorithm under the assumption that the process satisfies stationary β-mixing, and also find that the optimal rate O(n−1) can be obtained for i.i.d. processes.
Greedy learning algorithms, or more specifically, applying greedy algorithms to tackle supervised learning problems, have triggered enormous recent research activities since they possess the lower computational burden[1,2,3,4]. {Theoretical attempts of greedy learning have been widely concerned recently in the framework of learning theory [1,2,3,5,6]. We consider the learning capability of the rescaled pure greedy algorithm (RPGA) in a non-i.i.d sampling setting, which was initiated by Petrova in [7].
A fast review of regression learning as well as greedy algorithms will be given as follows, respectively. Let X be a compact metric space and Y=R. Let z={zi}ni=1={(xi,yi)}ni=1∈Zn be a stationary real-valued sequence with unknown Borel probability distribution ρ on Z=X×Y.
The generalization error can be defined by
E(f)=∫Z(f(x)−y)2dρ, ∀f:X→Y. | (1.1) |
Minimizing E(f), we can obtain the following regression function
fρ(x)=∫Yydρ(y|x), |
where ρ(⋅|x) denotes the conditional probability measure (given x) on Y. The empirical error Ez(f) which is a good approximation of the generalization error E(f) for a fixed function f on X can be defined by
Ez(f)=‖y−f‖2n:=1nn∑i=1(f(xi)−yi)2. | (1.2) |
The regression problem in learning theory aims at a good approximation fz of fρ, constructed by learning algorithms. Denote by L2ρX(X) the Hilbert space of the square integrable functions defined on X with respect to the measure ρX, where ρX denotes the marginal probability distribution on X and ‖f(⋅)‖ρX=(∫X|f(⋅)|2dρX)12. It is known that, for any f∈L2ρX(X), it holds that
E(f)−E(fρ)=‖f−fρ‖2, | (1.3) |
where
‖u‖2=E(|u(x)|2)=‖u‖2ρX. |
The learning ability of the regression algorithm can be measured by the excess generalization error
‖fz−fρ‖2=E(fz)−E(fρ). |
Let H be a real, separable Hilbert space endowed with inner product ⟨⋅,⋅⟩ and norm ‖⋅‖:=‖⋅‖H=⟨⋅,⋅⟩12. A set of functions D⊂H is called a dictionary if ‖g‖=1 for every g∈D, g∈D implies −g∈D and the closure of span(D) is H. We define the RPGA(D) as follows:
RPGA(D):
Step 0: Define f0:=0.
Step m (m≥1):
(1) If f=fm−1, stop the algorithm and define fk=fm−1=f, for k≥m.
(2) If f≠fm−1, choose a direction φm∈D such that
|⟨f−fm−1,φm⟩|=supφ∈D|⟨f−fm−1,φ⟩|. | (1.4) |
With
λm:=⟨f−fm−1,φm⟩, | (1.5) |
ˆfm:=fm−1+λmφm, | (1.6) |
sm:=<f,ˆfm>‖ˆfm‖2, | (1.7) |
define the next approximant to be
fm=smˆfm, | (1.8) |
and proceed to Step m+1.
Note that the RPGA uses the just appropriate scaling of the output of the pure greedy algorithm (PGA) which can boost convergence rate of the PGA to the optimal approximation rate O(m−12) for functions in A1(D), see [7].
Throughout this paper, we derive the error bounds under the assumption that |y|≤M almost surely for M≥0, hence |fρ(x)|≤M for any x∈X. We also define the following truncation function as in [8,9,10].
Definition 1. Fix M>0, we define the truncation function πM on the space of the measurable functions f:X→R as
πM(f)(x)={M,iff(x)>M,f(x),if|f(x)|≤M,−M,iff(x)<−M. | (1.9) |
Now we use the RPGA to realize the greedy learning. Here we consider leaning by the indefinite kernel K:X×X⟶R [11,12,13,14] and define the following hypothesis space by
HK,z={f=n∑i=1αiKxi=n∑i=1αiK(xi,⋅):α=(α1,⋯,αn)∈Rn,n∈N}, | (1.10) |
where
‖f‖l1:=inf{n∑i=1|αi|:f=n∑i=1αiKxi∈HK,z}. | (1.11) |
We now present the rescaled pure greedy learning algorithm (RPGLA) as follows:
Algorithm 1 RPGLA |
Input: Given a data set z={zi}ni=1={(xi,yi)}ni=1∈Zn, K, T>0 and the dictionary Dn={Kxi,i=1,...,n} Step 1. Normalization: ˜Kxi=Kxi‖Kxi‖n, i=1,...,n Dictionary: ˜Dn={˜Kxi:i=1,...,n} Step 2. Computation: Let ˜f0=0 for k=1,2,..., the approximation ˜fk is generated by the RPGA(˜Dn) if ‖y−˜fk‖2n+‖˜fk‖l1≤‖y‖2n and k≥T break end Output: πM(˜fk) |
Many greedy learning schemes were recently successfully used for the i.i.d. sampling[1,2,3,4,5,6,15]. For example, Barron et al. [5] have used a complexity regularization principle as the stopping criterion and deduced the best learning rate O(n/logn)−12) of various greedy algorithms. Lin et al.[3] have provided the learning capability of the relaxed greedy learning algorithm (RGLA) and proved that the learning rate is faster than the order O(n−12). Their numerous numerical simulation results have confirmed that the relax greedy algorithm (RGA) is more stable in dealing with noisy machine learning problems than the orthogonal greedy algorithm (OGA). Chen et al. [16] have introduced a sparse semi-supervised method to learn the regression functions from samples using the OGA. They can derive the learning rate O(n−1) under mild assumptions. To reduce the computational burden of the OGA, Fang et al. [1] have considered the applications of the orthogonal super greedy algorithm (OSGA) which selects more than one atoms from a dictionary in each iteration in supervise learning and deduced an almost same learning rate as that of the orthogonal greedy learning algorithm (OGLA) in [5]. Different from the traditional variants RGA and OGA, Xu et al.[4] proposed the truncated greedy algorithm (TGA) which truncates the step size of the PGA at a specified value in each greedy iteration to cut down the model complexity. They also proved that for some specified learning tasks, the truncated greedy learning algorithm (TGLA) can remove the logarithmic factor in the learning rates of the OGLA and the RGLA. All these results show that in the realm of supervised learning, each greedy algorithm possesses its own pros and cons. For instance, compared with the OGA, the PGA and the RGA have benefits in computation but suffer from the low convergence rate. In this paper, we study the learning capability of the RPGA which is the very simple modified version of the PGA. Motivated by the researches of [7], we proceed to deduce the error bound of the RPGLA. Our results will show that the RPGLA furthermore reduce the computational burden without sacrificing the generalization capability when compared with the OGLA and the RGLA. However, usually the independent and identity assumption is rather restrictive. For example, in [17,18,19], the authors presented the non-i.i.d. sampling setting for different learning algorithms, respectively. We shall study β-mixing and non-identical sampling, see [20] and the references therein for the details.
Definition 2. Let z={zt}t≥1 be a sequence of random variables. For any i,j∈N∪{+∞}, σji denotes the σ-algebra generated by the random variables {zt=(xt,yt)}jt=i. Then for any l∈N, the β-mixing coefficients of the stochastic process z are defined as
β(l)=supj≥1EsupA∈σ∞j+l|P(A|σj1)−P(A)|. | (1.12) |
z is said to be β-mixing, if β(l)→0 as l→∞. Specifically, it is said to be polynomially β-mixing, if there exists some β0>0 and γ>0 such that, for all l≥1,
β(l)≤β0l−γ. | (1.13) |
The β-mixing condition is "just the right" assumption, which has been adopted in previous studies for learning with weakly dependent samples, see [18,21] and the references therein. It is quite easy to establish and covers a more general non-i.i.d. cases such as Gaussian and Markov processes. Markov chains appear so often and naturally in applications, especially in marking prediction, biological speech recognition, sequence analysis, content-based web search and character recognition.
We assume that {zi}ni=1 is drawn according to the Borel probability measures {ρ(i)}i=1,2,⋅⋅⋅ on Z. Let ρ(i)X be the marginal distribution of ρ(i). For every x∈X, the conditional distribution of {ρ(i)}i=1,2,⋅⋅⋅ at x is ρ(⋅|x).
Definition 3. We say that {ρ(i)X} converges to ρX exponentially in (Cs(X))∗, if for C>0 and 0<α<1,
‖ρ(i)X−ρX‖(Cs(X))∗≤Cαi,∀i∈N. | (1.14) |
The above condition (1.14) is also equivalent to
|∫Xf(x)dρ(i)X−∫Xf(x)dρX|≤Cαi(‖f‖∞+|f|Cs(X)),∀f∈Cs(X), i∈N, | (1.15) |
where
‖f‖Cs(X):=‖f‖∞+|f|Cs(X), | (1.16) |
and
|f|Cs(X):=supx≠y∈X|f(x)−f(y)|(d(x,y))s. | (1.17) |
Before giving our key analysis, we firstly need to impose some mild assumptions concerning K, HK,z and {ρ(y|x):x∈X} below.
The kernel function K is said to satisfy a Lipschitz condition of order (α,β) with 0<α, β≤1, if for some Cα,Cβ>0,
|K(x,t)−K(x,t′)|≤Cα|t−t′|α,∀x,t,t′∈X, | (1.18) |
|K(x,t)−K(x′,t)|≤Cβ|x−x′|β,∀t,x,x′∈X. | (1.19) |
Let R>0 and BR be the ball of HK,z with radius R:
BR={f∈HK,z:‖f‖l1≤R}. | (1.20) |
As [22], we give the complexity assumption of the unit ball B1.
Capacity assumption. We say that B1 has polynomial complexity exponent 0<p<2 if there is some constant cp>0 such that
logN2(B1,ϵ)≤cpϵ−p, ∀ϵ>0. | (1.21) |
The following concept describes the continuity of {ρ(y|x):x∈X}.
Definition 4. We say that {ρ(y|x):x∈X} satisfies a Lipschitz condition of order s in (Cs(Y))∗ if there is some constant Cρ≥0 such that
‖ρ(y|x)−ρ(y|u)‖(Cs(Y))∗≤Cρ|x−u|s, ∀x,u∈X. | (1.22) |
Throughout this paper, we denote κ2=supt,x∈X|K(x,t)|. Since all the constants are independent of δ, n or λ, for simplicity of notation, we denote by C all the constants.
The rest of this paper is organized as follows: in Section 2, we will state the error decomposition of the algorithm (1) and the rate of uniform convergence. In the forthcoming Sections 3–5, we will analyze the drift error, the sample error and the hypothesis error. Finally, we conclude the main results in Section 6.
We use the developed technique for coefficient regularization algorithms for the non-i.i.d. sampling [19,21] to analyze the learning ability of the algorithm (1). We first define the following function space
H1={f:f=∞∑j=1αj¯Kuj:{αj}∈l1,{uj}⊂X,¯Kuj=Kuj‖Kuj‖ρX}, | (2.1) |
with the norm
‖f‖H1:=inf{∞∑j=1|αj|:f=∞∑j=1αj¯Kuj}. | (2.2) |
We define the regularizing function
fλ:=argminf∈H1{E(f)+λ‖f‖H1},λ>0. | (2.3) |
In order to describe the error caused by the change of {ρ(i)X}, we introduce
En(f)=1nn∑i=1∫Z(f(u)−y)2dρ(i)(u,y). | (2.4) |
Now we can give the error decomposition for the algorithm (1).
E(πM(˜fk))−E(fρ)≤P(λ)+S(z,λ)+H(z,λ)+D(λ), | (2.5) |
where
P(λ)={E(πM(˜fk))−En(πM(˜fk))}+{En(fλ)−E(fλ)},S(z,λ)={En(πM(˜fk))−Ez(πM(˜fk))}+{Ez(fλ)−En(fλ)},H(z,λ)={Ez(πM(˜fk))−Ez(fλ)},D(λ)=E(fλ)−E(fρ)+λ‖fλ‖H1. | (2.6) |
The drift error P(λ) describes the change of ρ(i) from ρ, and the sample error S(z,λ) connects the estimator πM(˜fk) with fλ. H(z,λ) and D(λ) are known as the hypothesis error and the approximation error, respectively.
To compared with the main results in [16], we shall assume D(λ) satisfies the same decay rate as follows
D(λ)≤cqλq, ∀ 0<λ≤1, | (2.7) |
for some exponent 0<q≤1 and a constant cq>0.
Next we can state the generalization error bound and give the proofs in Sections 3–6.
Theorem 1. Assume zi=(xi,yi)ni=1 satisfy condition (1.13), the hypothesis space HK,z satisfies the capacity assumption (1.21) with 0<p<2, the kernel K satisfies a Lipschitz condition of order (α,β) with 0<k0≤K(u,v)≤k1 for any u,v∈X, the target function fρ can be approximated with the exponent 0<q≤1 in H1, (1.14) for ρX and (1.22) for ρ(y|x) hold. Take k≥T≥n. Then for any 0<δ<1, with confidence 1−δ, we have
{E(πM(˜fk))−E(fρ)}≤Ct{λq+b−1nλ2q−2+b−22+pn}, | (2.8) |
where t=log(6δ−6bnβ(an)) with bn and an given explicitly later.
Theorem 2. Under the assumptions of Theorem 1, if
n≥{81ζ,(6β0δ)1(γ+1)(1−ζ)−1}, ζ∈(0,γγ+1), | (2.9) |
then we obtain
‖πM(˜fk)−fρ‖2ρX≤˜Dn−θ′log(12δ), | (2.10) |
where
θ′=min{qζ2−q, 2ζ2+p}. |
Let α=0 and ζ=1. Then we obtain the following learning rate of the i.i.d. sampling
‖πM(˜fk)−fρ‖2ρX≤˜C(1n)min{q2−q, 22+p}log(12δ), |
which is the same as that in [16]. In particular, as p→0, 22+p→1 which is the optimal convergence rate.
Proposition 3. Under the assumptions of Theorem 1, the inequality
P(λ)≤Cλ2q−2n, | (3.1) |
holds.
Proof. By (1.1) and (2.4), we get
{(E(πM(˜fk))−E(fλ))−(En(πM(˜fk))−En(fλ))}≤1nn∑i=1|∫Z[(πM(˜fk)(u)−y)2−(fλ(u)−y)2]d(ρ(u,y)−ρ(i)(u,y))|=1nn∑i=1|∫X(πM(˜fk)(u)−fλ(u))(πM(˜fk)(u)+fλ(u)−2fρ(u))d(ρX(u)−ρ(i)X(u))|. | (3.2) |
Now (1.15) tells us that
{(E(πM(˜fk))−E(fλ))−(En(πM(˜fk))−En(fλ))}≤1nn∑i=1Cαi‖(πM(˜fk)(u)−fλ(u))(πM(˜fk)(u)+fλ(u)−2fρ(u))‖Cs(X)≤Cnα1−α(3M+‖fλ‖∞){2|˜fk|Cs(X)+2|fλ|Cs(X)+2|fρ|Cs(X)+4M+2‖fλ‖∞}, | (3.3) |
where the last inequality holds true since ‖fg‖Cs(X)≤‖f‖C(X)‖g‖Cs(X)+‖f‖Cs(X)‖g‖C(X), see [19].
In the following, we estimate ‖fλ‖∞, |fλ|Cs(X), |˜fk|Cs(X) and |fρ|Cs(X) separately. Let fλ(x)=∑∞j=1αj,λ¯Kuj(x), {αj,λ}∈l1. It follows that
|fλ(x)|≤κ‖Kuj‖ρX∞∑j=1|αj,λ|≤κ‖Kuj‖ρX‖fλ‖H1. | (3.4) |
Furthermore,
‖fλ‖∞≤κ‖Kuj‖ρXD(λ)λ. | (3.5) |
The Lipschitz condition (1.18) of the kernel function K yields for any f∈H1 that
|f(x)−f(x′)|≤Cα|x−x′|s‖Kuj‖ρX‖f‖H1,∀x,x′∈X. |
Together with (1.17), this implies that
|fλ|Cs(X)≤Cα‖fλ‖H1‖Kuj‖ρX≤Cα‖Kuj‖ρXD(λ)λ. | (3.6) |
In the same way, from the definition of ˜fk, we have
|˜fk|Cs(X)≤Cα‖˜fk‖l1≤Cα‖y‖2n≤CαM2. | (3.7) |
In addition, combining (1.17) with (1.22) gives
|fρ|Cs(X)=supx,x′∈X|∫Yydρ(y|x)−∫Yydρ(y|x′)||x−x′|s≤‖y‖Cs(Y)Cρ|x−x′|s|x−x′|s≤Cρ(M+(2M)1−s). | (3.8) |
Plugging (3.5), (3.6), (3.7) and (3.8) into (3.3), the desired estimate (3.1) follows, and the proposition is proved.
In our analysis, we apply the method in [18,23] to deal with the original weakly dependent sequence. Let (an,bn) be any integer pair with bn=[n/2an]. The dependent observations are split into 2bn blocks, each of size an. For 1≤k≤2bn, Qank denotes the marginal distribution of block (z(k−1)an+1,z(k−1)an+2,⋅⋅⋅,zkan). With the constructed blocks, one can then take a new sequence (z′1,⋅⋅⋅,z′2bnan) with product distribution ∏2bnk=1Qank. We further define
Z1=(z1,⋅⋅⋅,zan,z2an+1,⋅⋅⋅,z3an,⋅⋅⋅,z2(bn−1)an+1,⋅⋅⋅,z2(bn−1)an),Z2=(zan+1,⋅⋅⋅,z2an,z3an+1,⋅⋅⋅,z4an,⋅⋅⋅,z(2bn−1)an+1,⋅⋅⋅,z2bnan), |
and correspondingly
Z′1=(z′1,⋅⋅⋅,z′an,z′2an+1,⋅⋅⋅,z′3an,⋅⋅⋅,z′2(bn−1)an+1,⋅⋅⋅,z′2(bn−1)an),Z′2=(z′an+1,⋅⋅⋅,z′2an,z′3an+1,⋅⋅⋅,z′4an,⋅⋅⋅,z′(2bn−1)an+1,⋅⋅⋅,z′2bnan). |
The sample error S(z,λ) can be rewritten as
S(z,λ)={Ez(fλ)−Ez(fρ)}−{En(fλ))−En(fρ)}+{En(πM(˜fk))−En(fρ)}−{Ez(πM(˜fk))−Ez(fρ)}:=S1(z,λ)+S2(z,k). |
We analyze the term S1(z,λ) by using the following inequality from [18].
Lemma 4.1. If g is a measurable function on Z satisfying ‖g(z)−∫Zgdρ(i)‖∞≤M, then for any δ>0, with confidence 1−δ, there holds
1nn∑i=1(g(zi)−∫Zgdρ(i))≤b−1n{83Mlog(2δ−2bnβ(an))+√2an2anbn∑i=1∫Zg2dρ(i)log(2δ−2bnβ(an))+M}. |
Proposition 4. Under the assumptions of Theorem 1, for any 0<δ<1, with confidence 1−δ/3,
S2(z,λ)≤C{b−1n(1+D(λ)2λ2)+D(λ)}t. | (4.1) |
Proof. Let g(z)=(y−fλ(u))2−(y−fρ(u))2, z=(u,y)∈Z. Thus
‖g(z)−∫Zgdρ(i)‖∞≤2(3M+κ‖Kuj‖ρXD(λ)λ)2:=2Bλ |
and
∫Zg2dρ(i)≤Bλ∫Zgdρ(i). |
Using Lemma 4.1, with confidence 1−δ/3, we have
1nn∑i=1(g(zi)−∫Zgdρ(i))≤(19t3+2)Bλb−1n+12anbn2anbn∑i=1∫Zgdρ(i)≤(19t3+2)Bλb−1n+2(En(fλ)−En(fρ)). | (4.2) |
Observe that
En(fλ)−En(fρ)≤(En(fλ)−E(fλ)+E(fρ)−En(fρ))+D(λ). | (4.3) |
By (1.15), we have
En(fλ)−E(fλ)+E(fρ)−En(fρ)≤1nn∑i=1|∫X(fλ(u)−fρ(u))2d(ρ(i)X(u)−ρX(u))|≤1nn∑i=1Cαi‖(fλ(u)−fρ(u))2‖Cs(X)≤Cαn(1−α)(1+D(λ)λ)2, | (4.4) |
where the last inequality follows from (3.6) and (3.8).
Combining (4.2), (4.3) and (4.4), we get the desired error bound (4.1) of S1(z,λ). Proposition 4 is proved.
We continue to analyze S2(z,k) by applying the following probability inequality for the β-mixing sequences from [18].
Lemma 4.2. Let G be a class of measurable functions on Z. Moreover, assume that ‖g−∫Zgd(i)‖∞≤M for all g∈G. Then
Prob{supg∈G1nn∑i=1(g(zi)−∫Zg(z)dρ(i))>ϵ+Mbn}≤∏1+∏2+2bnβ(an), |
where
∏1=Prob{supg∈G1bnbn∑j=1(2bnn(2j−1)an∑i=2(j−1)an+1(g(z′i)−∫Zg(z)dρ(i)))≥ϵ}, |
∏2=Prob{supg∈G1bnbn∑j=1(2bnn2jan∑i=(2j−1)an+1(g(z′i)−∫Zg(z)dρ(i)))≥ϵ}. |
To get the upper bounds of the terms ∏1 and ∏2, we need to invoke the following inequality for the non-identical sequence of probability distributions.
Proposition 5. Assume {Xi}ni=1 is a random sequence in the measurable space (Xn,∏ni=1Qi). Let F be a set of measurable functions on X and B>0 be a constant such that each f∈F satisfies ‖f‖∞≤B. Suppose there exists a nonnegative functional w on F and some positive constants (Δi)ni=1 such that
Ef2(Xi)≤w(f)+Δi,∀f∈F. | (4.5) |
Also assume for some a>0 and p∈(0,2),
logN2(F,ε)≤aε−p,∀ε>0. |
Then for any x>0 and any D>0, with probability at least 1−e−x there holds
1nn∑i=1Ef(Xi)−1nn∑i=1f(Xi)≤D−1w(f)+c′p˜η+(D+18B+2)xn,∀f∈F, |
where c′p is a constant depending only on p and
˜η:=max{D2−p2+p,B2−p2+p+1}(an)2p+2+1nn∑i=1Δi. |
The above inequalities imply the estimate of S2(z,k).
Proposition 6. Under the assumptions of Theorem 1, for any 0<δ<1, with confidence 1−δ/3,
S2(z,k)≤12{E(πM(˜fk))−E(fρ)}+Cp,Φ,ρηR+(192M2+2)tbn, | (4.6) |
where
ηR:=(Rpbn)22+p+α1−α1nmax{R,1}. | (4.7) |
Proof. Define the function set ˜G on Zan by
˜G={G(t1,⋅⋅⋅,tan)=2bnnan∑k=1g(tk):g∈G,G={g(z)=g(u,y)=(y−πM(f)(u))2−(y−fρ(u))2:f∈BR}} |
and
w(G):=∫ZanG2(t1,⋅⋅⋅,tan)dρ(t1)dρ(t2)⋅⋅⋅dρ(tan)=4a2nb2nn2∫Zg2dρ. |
It follows that
EG2(z′(k−1)an+1,z′(k−1)an+2,⋅⋅⋅,z′kan)≤4b2nann2kan∑i=(k−1)an+1∫Zg2dρ(i)≤w(G)+4b2nann2kan∑i=(k−1)an+1|∫Zg2d(ρ(i)−ρ)|. | (4.8) |
We see from (1.15) and (1.22) that
|∫Zg2d(ρ(i)−ρ)|≤Cαi‖(fρ(u)−πM(f)(u))2∫Y(2y−πM(f)(u)−fρ(u))2dρ(y|u)‖Cs(X)≤Cαi(1+R). | (4.9) |
By (4.8) and (4.9), we know that Δk in (4.5) satisfies
Δk≤4b2nann2Cρ,Φmax{R,1}an∑i=1α(k−1)an+i. |
Let w={→tj=(tj1,⋅⋅⋅,tjan)}dj=1⊂(Zan)d, d∈N. We know that for any functions G1=2bnn∑ank=1g1(tk) and G2=2bnn∑ank=1g2(tk) in ˜G,
d22,w(G1,G2)=1dd∑j=1(G1(→tj)−G2(→tj))2=1dd∑j=1(2bnnan∑k=1(g1(tjk)−g2(tjk)))2≤1dand∑j=1an∑k=1(g1(tjk)−g2(tjk))2=d22,w(g1,g2), |
so
N2(˜G,ϵ)≤N2(G,ϵ). | (4.10) |
Moreover,
N2(G,ϵ)≤N2(BR,ϵ4M). |
This together with (4.10) yields
logN2(˜G,ϵ)≤logN2(G,ϵ)≤cp(4M)pRPϵ−p. |
Note that ‖G‖∞≤‖g‖∞≤8M2. It is also easy to see that
EG(z′(k−1)an+1,z′(k−1)an+2,⋅⋅⋅,z′kan)≤2bnnkan∑i=(k−1)an+1∫Zgdρ(i), |
and
w(G)=4a2nb2nn2∫Zg2dρ≤∫Zg2dρ≤8M2∫Zgdρ. |
Now applying Proposition 5 to ˜G in ((Zan)bn,∏bnj=1Qan2j−1). Let B=8M2 and a=cp(4M)pRp. Then for any D>0, g∈G, with confidence at least 1−e−t, we have
1bnbn∑j=1(2bnn(2j−1)an∑i=2(j−1)an+1(∫Zg(z)dρ(i)−g(z′i)))≤8M2D(∫Zgdρ)+c′pη1+(D+144M2+2)tbn. |
Here
η1=max{D2−p2+p,(8M2)2−p2+p+1}{cp(4M)pRpbn}22+p+4bnann2Cρ,Φmax{R,1}bn∑j=1an∑i=1α(2j−2)an+i. |
It follows by taking ϵ1=c′pη1+(D+144M2+2)tbn that
Prob{supg∈G1bnbn∑j=1(2bnn(2j−1)an∑i=2(j−1)an+1(∫Zg(z)dρ(i)−g(z′i)))−8M2D(∫Zgdρ)≥ϵ1}≤e−t. |
Applying Proposition 5 to ˜G in ((Zan)bn,∏bnj=1Qan2j−1) once again, we have
Prob{supg∈G1bnbn∑j=1(2bnn2jan∑i=(2j−1)an+1(∫Zg(z)dρ(i)−g(z′i)))−8M2D(∫Zgdρ)≥ϵ2}≤e−t. |
Here ϵ2=c′pη2+(D+144M2+2)tbn with
η2=max{D2−p2+p,(8M2)2−p2+p+1}{cp(4M)pRpbn}22+p+4bnann2Cρ,Φmax{R,1}bn∑j=1an∑i=1α(2j−1)an+i. |
Moreover, we obviously have
4bnann2bn∑j=1an∑i=1α(2j−2)an+i+4bnann2bn∑j=1an∑i=1α(2j−1)an+i≤2nα1−α, |
and
‖g(z)−∫Zg(z)dρ(i)‖∞<16M2. |
We know from Lemma 4.2 by taking ε=c′p˜η+(D+144M2+2)tbn with
˜η={max{D2−p2+p,(8M2)2−p2+p+1}{cp(4M)pRpbn}22+p+2nCρ,Φmax{R,1}α1−α}, |
then
Prob{supg∈G1nn∑j=1(∫Zg(z)dρ(i)−g(zi))−16M2D(∫Zgdρ)>ϵ+16M2bn}≤2e−t+2bnβ(an). |
Then we obtain (4.6) by taking 2e−t+2bnβ(an):=δ3 and D=32M2.
Different from the widely regularized method with data-dependent hypothesis spaces [8,10,21,22], our estimation for the hypothesis error Ez(πM(˜fk))−Ez(fλ) is based on the following lemma, see Theorem 3.3 in [7].
Lemma 5.1. If f∈H, h∈Hn1, then the output (fm)m≥0 of the RPGA satisfies the inequality
‖f−fm‖2−‖f−h‖2≤4m+1‖h‖2Hn1, m=0,1,2,⋅⋅⋅, | (5.1) |
where
Hn1={h=∑iαni¯Knui:αni=αi‖¯Kui‖n,¯Knui=¯Kui‖¯Kui‖n,∑iαi¯Kui∈H1} | (5.2) |
with
‖f‖Hn1:=inf{∑i|αni|:f=∑iαi¯Kui}. | (5.3) |
Proposition 7. Under the assumptions of Theorem 1, for k≥T and any 0<δ<1, with the confidence at least 1−δ/3, there holds
H(z,λ)≤4min{{(19t3+2)Mb−1n+M+αn(1−α)(k21k20+2Cαk1k20)+1}2,k21k20}D2(λ)(k+1)λ2. | (5.4) |
Proof. By Lemma 5.1, we have
H(z,λ)={Ez(πM(˜fk))−Ez(fλ)}≤4‖fλ‖2Hn1k+1. | (5.5) |
From the definitions of ‖f‖Hn1 and ‖f‖H1, we have
‖fλ‖2Hn1≤k21k20‖fλ‖2H1. | (5.6) |
Meanwhile, we define the function g(x)=|¯Kui(x)|2, for any i. Notice that
‖g(x)−∫Xgdρ(j)X‖∞≤2k21k20:=2M, |
and
∫Zg2dρ(j)X≤M∫Zgdρ(j)X. |
Using Lemma 4.1, with confidence 1−δ/3, we have
1nn∑j=1(g(xj)−∫Xgdρ(j)X)≤(19t3+2)Mb−1n+M. | (5.7) |
By (1.17) and (1.15), we get
1nn∑j=1(∫Xgdρ(j)X−∫XgdρX)≤1nn∑j=1Cαj(‖g‖∞+|g|Cs(X))≤αn(1−α)(k21k20+2Cαk1k20). | (5.8) |
This in connection with (5.7) tells us that
1nn∑j=1|¯K(ui,xj)|2−E¯K2ui=1nn∑j=1(g(xj)−∫XgdρX)≤(19t3+2)Mb−1n+M+αn(1−α)(k21k20+2Cαk1k20). | (5.9) |
It is easy to see that ‖¯Kuj‖2ρX=E¯K2uj=1. Now (5.9) implies that
‖¯Kui‖n=√1nn∑j=1|¯K(ui,xj)|2≤1nn∑j=1|¯K(ui,xj)|2≤(19t3+2)Mb−1n+M+αn(1−α)(k21k20+2Cαk1k20)+1. | (5.10) |
Therefore,
‖fλ‖2Hn1≤{(19t3+2)Mb−1n+M+αn(1−α)(k21k20+2Cαk1k20)+1}2‖fλ‖2H1. | (5.11) |
Combining (5.5), (5.6), (5.11), we obtain (5.4).
Proof of Theorem 1. Combining the bounds (2.7), (3.1), (4.1), (4.6) and (5.4), with confidence at least 1−δ,
{E(πM(˜fk))−E(fρ)}≤cqλq+4min{{(19t3+2)Mb−1n+M+α1−α(k21k20+2Cαk1k20)+1}2,k21k20}c2qλ2q(k+1)λ2+Cλ2q−2n+C{b−1n(1+c2qλ2qλ2)+cqλq}t+12{E(πM(˜fk))−E(fρ)}+Cp,Φ,ρηR+(192M2+2)tbn. | (6.1) |
Note that k≥T≥n. By taking R=M2, then
{E(πM(˜fk))−E(fρ)}≤t{(k+1)−1λ2q−2+n−1λ2q−2+b−1nλ2q−2+λq+b−22+pn+n−1+b−1n}≤Ct{λq+b−1nλ2q−2+b−22+pn}. | (6.2) |
This finishes the proof of Theorem 1.
Proof of Theorem 2. Under the conditions of Theorem 1, let n1−ζ≤an<n1−ζ+1, ζ∈[0,1] and n≥81ζ. Then
1bn≤1n2an−1≤2(n1−ζ+1)n−2n1−ζ≤4n1−ζn−2n1−ζ=4n−ζ1−2n−ζ≤8n−ζ. | (6.3) |
Substitute (6.3) into (6.2), we obtain
{E(πM(˜fk))−E(fρ)}≤Ct{λq+n−ζλ2q−2+n−2ζ2+p}. | (6.4) |
By setting λ=n−θ, we know that
{E(πM(˜fk))−E(fρ)}≤D2tn−θ′, | (6.5) |
where
θ′=min{qθ, ζ−(2−2q)θ, 2ζ2+p}. |
To balance the errors in (2.5), we take θ=ζ2−q. Then
θ′=min{qζ2−q, 2ζ2+p}. |
Finally, we choose
n≥(6β0δ)1(γ+1)(1−ζ)−1, ζ∈(0,γγ+1), |
it follows from β(an)≤β0(an)−γ and an≥n1−ζ that
12bnβ(an)δ≤1, |
thus
t=log6δ−6bnβ(an)≤log12δ. |
This finishes the proof of Theorem 2.
This research is supported by the National Science Foundation for Young Scientists of China (Grant No. 12001328), Doctoral Research Fund of Shandong Jianzhu University (No. XNBS1942), the Development Plan of Youth Innovation Team of University in Shandong Province (No. 2021KJ067) and Shandong Provincial Natural Science Foundation of China (Grant No. ZR2022MF223). All authors contributed substantially to this paper, participated in drafting and checking the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing of interests regarding the publication of this paper.
[1] |
J. Fang, S. B. Lin, Z. B. Xu, Learning and approximation capabilities of orthogonal super greedy algorithm, Knowl-Based Syst., 95 (2016), 86–98. https://doi.org/10.1016/j.knosys.2015.12.011 doi: 10.1016/j.knosys.2015.12.011
![]() |
[2] |
H. Chen, L. Q. Li, Z. B. Pan, Learning rates of multi-kernel regression by orthogonal greedy algorithm, J. Stat. Plan. Infer., 143 (2013), 276–282. https://doi.org/10.1016/j.jspi.2012.08.002 doi: 10.1016/j.jspi.2012.08.002
![]() |
[3] |
S. B. Lin, Y. H. Rong, X. P. Sun, Z. B. Xu, Learning capability of relaxed greedy algorithms, IEEE Trans. Neur. Net. Lear., 24 (2013), 1598–1608. https://doi.org/10.1109/TNNLS.2013.2265397 doi: 10.1109/TNNLS.2013.2265397
![]() |
[4] | L. Xu, S. B. Lin, Z. B. Xu, Learning capability of the truncated greedy algorithm, Sci. China Inform. Sci. 59 (2016), 052103. https://doi.org/10.1007/s11432-016-5536-6 |
[5] |
A. R. Barron, A. Cohen, W. Dahmen, R. A. DeVore, Approximation and learning by greedy algorithms, Ann. Statist., 36 (2008), 64–94. https://doi.org/10.1214/009053607000000631 doi: 10.1214/009053607000000631
![]() |
[6] |
L. Xu, S. B. Lin, J. S. Zeng, X. Liu, Z. B. Xu, Greedy criterion in orthogonal greedy learning, IEEE Trans. Cybernetics, 48 (2018), 955–966. https://doi.org/10.1109/TCYB.2017.2669259 doi: 10.1109/TCYB.2017.2669259
![]() |
[7] |
G. Petrova, Rescaled pure greedy algorithm for Hilbert and Banach spaces, Appl. Comput. Harmon. Anal., 41 (2016), 852–866. https://doi.org/10.1016/j.acha.2015.10.008 doi: 10.1016/j.acha.2015.10.008
![]() |
[8] |
S. G. Lv, D. M. Shi, Q. W. Xiao, M. S. Zhang, Sharp learning rates of coefficient-based lq-regularized regression with indefinite kernels, Sci. China Math., 56 (2013), 1557–1574. https://doi.org/10.1007/s11425-013-4688-8 doi: 10.1007/s11425-013-4688-8
![]() |
[9] |
Y. L. Feng, S. G. Lv, Unified approach to coefficient-based regularized regression, Comput. Math. Appl., 62 (2011), 506–515. https://doi.org/10.1016/j.camwa.2011.05.034 doi: 10.1016/j.camwa.2011.05.034
![]() |
[10] |
W. L. Nie, C. Wang, Constructive analysis for coefficient regularization regression algorithms, J. Math. Anal. Appl., 431 (2015), 1153–1171. https://doi.org/10.1016/j.jmaa.2015.06.006 doi: 10.1016/j.jmaa.2015.06.006
![]() |
[11] |
H. W. Sun, Q. Wu, Least square regression with indefinite kernels and coefficient regularization, Appl. Comput. Harmon., 30 (2011), 96–109. https://doi.org/10.1016/j.acha.2010.04.001 doi: 10.1016/j.acha.2010.04.001
![]() |
[12] | B. Schölkopf, A. Smola, Learning with Kernels, MIT Press, Cambridge, 2002. |
[13] |
C. J. Liu, Gabor-based kernel pca with fractional power polynomial models for face recognition, IEEE Trans. Pattern Anal. Mach. Intell., 26 (2004), 572–581. https://doi.org/10.1109/TPAMI.2004.1273927 doi: 10.1109/TPAMI.2004.1273927
![]() |
[14] | R. Opfer, Multiscale kernels, Adv. Comput. Math., 25 (2006), 357–380. https://doi.org/10.1007/s10444-004-7622-3 |
[15] |
A. R. Barron, Universal approximation bounds for superposition of a sigmoidal function, IEEE Trans. Inform. Theory, 39 (1993), 930–945. https://doi.org/10.1109/18.256500 doi: 10.1109/18.256500
![]() |
[16] |
H. Chen, Y. C. Zhou, Y. Y. Tang, Convergence rate of the semi-supervised greedy algorithm, Neural Networks, 44 (2013), 44–50. https://doi.org/10.1016/j.neunet.2013.03.001 doi: 10.1016/j.neunet.2013.03.001
![]() |
[17] |
S. Smale, D. X. Zhou, Online learning with markov sampling, Anal. Appl., 7 (2009), 87–113. https://doi.org/10.1142/S0219530509001293 doi: 10.1142/S0219530509001293
![]() |
[18] |
Z. C. Guo, L. Shi, Classification with non-i.i.d. sampling, Math. Comput. Model., 54 (2011), 1347–1364. https://doi.org/10.1016/j.mcm.2011.03.042 doi: 10.1016/j.mcm.2011.03.042
![]() |
[19] |
Z. W. Pan, Q. W. Xiao, Least-square regularized regression with non-iid sampling, J. Stat. Plan. Infer., 139 (2009), 3579–3587. https://doi.org/10.1016/j.jspi.2009.04.007 doi: 10.1016/j.jspi.2009.04.007
![]() |
[20] |
R. C. Bradley, Basic properties of strong mixing conditions, Progr. Probab. Statist., 2 (1986), 165–192. https://doi.org/10.1007/978-1-4615-8162-8_8 doi: 10.1007/978-1-4615-8162-8_8
![]() |
[21] |
Q. Guo, P. X. Ye, B. L. Cai, Convergence Rate for lq-Coefficient Regularized Regression With Non-i.i.d. Sampling, IEEE Access, 6 (2018), 18804–18813. https://doi.org/10.1109/ACCESS.2018.2817215 doi: 10.1109/ACCESS.2018.2817215
![]() |
[22] |
L. Shi, Y. L. Feng, D. X. Zhou, Concentration estimates for learning with l1-regularizer and data dependent hypothesis spaces, Appl. Comput. Harmon. Anal., 31 (2011), 286–302. https://doi.org/10.1016/j.acha.2011.01.001 doi: 10.1016/j.acha.2011.01.001
![]() |
[23] | B. Yu, Rates of convergence for empirical processes of stationary mixing sequences, Ann. Probab., 22 (1994), 94–116. https://www.jstor.org/stable/2244496 |