Loading [MathJax]/jax/output/SVG/jax.js
Research article

Learning capability of the rescaled pure greedy algorithm with non-iid sampling


  • Received: 24 August 2022 Revised: 07 December 2022 Accepted: 14 December 2022 Published: 11 January 2023
  • We consider the rescaled pure greedy learning algorithm (RPGLA) with the dependent samples drawn according to a non-identical sequence of probability distributions. The generalization performance is provided by applying the independent-blocks technique and adding the drift error. We derive the satisfactory learning rate for the algorithm under the assumption that the process satisfies stationary β-mixing, and also find that the optimal rate O(n1) can be obtained for i.i.d. processes.

    Citation: Qin Guo, Binlei Cai. Learning capability of the rescaled pure greedy algorithm with non-iid sampling[J]. Electronic Research Archive, 2023, 31(3): 1387-1404. doi: 10.3934/era.2023071

    Related Papers:

    [1] Zhuang Wang, Renting Liu, Jie Xu, Yusheng Fu . FedSC: A federated learning algorithm based on client-side clustering. Electronic Research Archive, 2023, 31(9): 5226-5249. doi: 10.3934/era.2023266
    [2] Xiaoyu Jiang, Ruichun Gu, Huan Zhan . Research on incentive mechanisms for anti-heterogeneous federated learning based on reputation and contribution. Electronic Research Archive, 2024, 32(3): 1731-1748. doi: 10.3934/era.2024079
    [3] Zhiyong Qian, Wangsen Xiao, Shulan Hu . The generalization ability of logistic regression with Markov sampling. Electronic Research Archive, 2023, 31(9): 5250-5266. doi: 10.3934/era.2023267
    [4] Shuangjie Yuan, Jun Zhang, Yujia Lin, Lu Yang . Hybrid self-supervised monocular visual odometry system based on spatio-temporal features. Electronic Research Archive, 2024, 32(5): 3543-3568. doi: 10.3934/era.2024163
    [5] Wenya Shi, Xinpeng Yan, Zhan Huan . Faster free pseudoinverse greedy block Kaczmarz method for image recovery. Electronic Research Archive, 2024, 32(6): 3973-3988. doi: 10.3934/era.2024178
    [6] Fang Wang, Weiguo Li, Wendi Bao, Li Liu . Greedy randomized and maximal weighted residual Kaczmarz methods with oblique projection. Electronic Research Archive, 2022, 30(4): 1158-1186. doi: 10.3934/era.2022062
    [7] Shiqi Zou, Xiaoping Shi, Shenmin Song . MOEA with adaptive operator based on reinforcement learning for weapon target assignment. Electronic Research Archive, 2024, 32(3): 1498-1532. doi: 10.3934/era.2024069
    [8] Hui Li, Rongrong Gong, Pengfei Hou, Libao Xing, Dongbao Jia, Haining Li . Online learning resources recommendation model based on improved NSGA-Ⅱ algorithm. Electronic Research Archive, 2023, 31(5): 3030-3049. doi: 10.3934/era.2023153
    [9] Jing Chen, Weiyu Ye, Shaowei Kang . Learning user preferences from Multi-Contextual Sequence influences for next POI recommendation. Electronic Research Archive, 2024, 32(1): 486-504. doi: 10.3934/era.2024024
    [10] Bangna Li, Qingqing Zhang, Xingshi He . Multi-label feature selection via constraint mapping space regularization. Electronic Research Archive, 2024, 32(4): 2598-2620. doi: 10.3934/era.2024118
  • We consider the rescaled pure greedy learning algorithm (RPGLA) with the dependent samples drawn according to a non-identical sequence of probability distributions. The generalization performance is provided by applying the independent-blocks technique and adding the drift error. We derive the satisfactory learning rate for the algorithm under the assumption that the process satisfies stationary β-mixing, and also find that the optimal rate O(n1) can be obtained for i.i.d. processes.



    Greedy learning algorithms, or more specifically, applying greedy algorithms to tackle supervised learning problems, have triggered enormous recent research activities since they possess the lower computational burden[1,2,3,4]. {Theoretical attempts of greedy learning have been widely concerned recently in the framework of learning theory [1,2,3,5,6]. We consider the learning capability of the rescaled pure greedy algorithm (RPGA) in a non-i.i.d sampling setting, which was initiated by Petrova in [7].

    A fast review of regression learning as well as greedy algorithms will be given as follows, respectively. Let X be a compact metric space and Y=R. Let z={zi}ni=1={(xi,yi)}ni=1Zn be a stationary real-valued sequence with unknown Borel probability distribution ρ on Z=X×Y.

    The generalization error can be defined by

    E(f)=Z(f(x)y)2dρ,  f:XY. (1.1)

    Minimizing E(f), we can obtain the following regression function

    fρ(x)=Yydρ(y|x),

    where ρ(|x) denotes the conditional probability measure (given x) on Y. The empirical error Ez(f) which is a good approximation of the generalization error E(f) for a fixed function f on X can be defined by

    Ez(f)=yf2n:=1nni=1(f(xi)yi)2. (1.2)

    The regression problem in learning theory aims at a good approximation fz of fρ, constructed by learning algorithms. Denote by L2ρX(X) the Hilbert space of the square integrable functions defined on X with respect to the measure ρX, where ρX denotes the marginal probability distribution on X and f()ρX=(X|f()|2dρX)12. It is known that, for any fL2ρX(X), it holds that

    E(f)E(fρ)=ffρ2, (1.3)

    where

    u2=E(|u(x)|2)=u2ρX.

    The learning ability of the regression algorithm can be measured by the excess generalization error

    fzfρ2=E(fz)E(fρ).

    Let H be a real, separable Hilbert space endowed with inner product , and norm :=H=,12. A set of functions DH is called a dictionary if g=1 for every gD, gD implies gD and the closure of span(D) is H. We define the RPGA(D) as follows:

    RPGA(D):

    Step 0: Define f0:=0.

    Step m (m1):

    (1) If f=fm1, stop the algorithm and define fk=fm1=f, for km.

    (2) If ffm1, choose a direction φmD such that

    |ffm1,φm|=supφD|ffm1,φ|. (1.4)

    With

    λm:=ffm1,φm, (1.5)
    ˆfm:=fm1+λmφm, (1.6)
    sm:=<f,ˆfm>ˆfm2, (1.7)

    define the next approximant to be

    fm=smˆfm, (1.8)

    and proceed to Step m+1.

    Note that the RPGA uses the just appropriate scaling of the output of the pure greedy algorithm (PGA) which can boost convergence rate of the PGA to the optimal approximation rate O(m12) for functions in A1(D), see [7].

    Throughout this paper, we derive the error bounds under the assumption that |y|M almost surely for M0, hence |fρ(x)|M for any xX. We also define the following truncation function as in [8,9,10].

    Definition 1. Fix M>0, we define the truncation function πM on the space of the measurable functions f:XR as

    πM(f)(x)={M,iff(x)>M,f(x),if|f(x)|M,M,iff(x)<M. (1.9)

    Now we use the RPGA to realize the greedy learning. Here we consider leaning by the indefinite kernel K:X×XR [11,12,13,14] and define the following hypothesis space by

    HK,z={f=ni=1αiKxi=ni=1αiK(xi,):α=(α1,,αn)Rn,nN}, (1.10)

    where

    fl1:=inf{ni=1|αi|:f=ni=1αiKxiHK,z}. (1.11)

    We now present the rescaled pure greedy learning algorithm (RPGLA) as follows:

    Algorithm 1 RPGLA
    Input: Given a data set z={zi}ni=1={(xi,yi)}ni=1Zn, K, T>0 and the dictionary Dn={Kxi,i=1,...,n}
      Step 1. Normalization: ˜Kxi=KxiKxin, i=1,...,n
        Dictionary: ˜Dn={˜Kxi:i=1,...,n}
      Step 2. Computation: Let ˜f0=0
        for k=1,2,..., the approximation ˜fk is generated by the RPGA(˜Dn)
          if y˜fk2n+˜fkl1y2n and kT break
        end
    Output: πM(˜fk)

    Many greedy learning schemes were recently successfully used for the i.i.d. sampling[1,2,3,4,5,6,15]. For example, Barron et al. [5] have used a complexity regularization principle as the stopping criterion and deduced the best learning rate O(n/logn)12) of various greedy algorithms. Lin et al.[3] have provided the learning capability of the relaxed greedy learning algorithm (RGLA) and proved that the learning rate is faster than the order O(n12). Their numerous numerical simulation results have confirmed that the relax greedy algorithm (RGA) is more stable in dealing with noisy machine learning problems than the orthogonal greedy algorithm (OGA). Chen et al. [16] have introduced a sparse semi-supervised method to learn the regression functions from samples using the OGA. They can derive the learning rate O(n1) under mild assumptions. To reduce the computational burden of the OGA, Fang et al. [1] have considered the applications of the orthogonal super greedy algorithm (OSGA) which selects more than one atoms from a dictionary in each iteration in supervise learning and deduced an almost same learning rate as that of the orthogonal greedy learning algorithm (OGLA) in [5]. Different from the traditional variants RGA and OGA, Xu et al.[4] proposed the truncated greedy algorithm (TGA) which truncates the step size of the PGA at a specified value in each greedy iteration to cut down the model complexity. They also proved that for some specified learning tasks, the truncated greedy learning algorithm (TGLA) can remove the logarithmic factor in the learning rates of the OGLA and the RGLA. All these results show that in the realm of supervised learning, each greedy algorithm possesses its own pros and cons. For instance, compared with the OGA, the PGA and the RGA have benefits in computation but suffer from the low convergence rate. In this paper, we study the learning capability of the RPGA which is the very simple modified version of the PGA. Motivated by the researches of [7], we proceed to deduce the error bound of the RPGLA. Our results will show that the RPGLA furthermore reduce the computational burden without sacrificing the generalization capability when compared with the OGLA and the RGLA. However, usually the independent and identity assumption is rather restrictive. For example, in [17,18,19], the authors presented the non-i.i.d. sampling setting for different learning algorithms, respectively. We shall study β-mixing and non-identical sampling, see [20] and the references therein for the details.

    Definition 2. Let z={zt}t1 be a sequence of random variables. For any i,jN{+}, σji denotes the σ-algebra generated by the random variables {zt=(xt,yt)}jt=i. Then for any lN, the β-mixing coefficients of the stochastic process z are defined as

    β(l)=supj1EsupAσj+l|P(A|σj1)P(A)|. (1.12)

    z is said to be β-mixing, if β(l)0 as l. Specifically, it is said to be polynomially β-mixing, if there exists some β0>0 and γ>0 such that, for all l1,

    β(l)β0lγ. (1.13)

    The β-mixing condition is "just the right" assumption, which has been adopted in previous studies for learning with weakly dependent samples, see [18,21] and the references therein. It is quite easy to establish and covers a more general non-i.i.d. cases such as Gaussian and Markov processes. Markov chains appear so often and naturally in applications, especially in marking prediction, biological speech recognition, sequence analysis, content-based web search and character recognition.

    We assume that {zi}ni=1 is drawn according to the Borel probability measures {ρ(i)}i=1,2, on Z. Let ρ(i)X be the marginal distribution of ρ(i). For every xX, the conditional distribution of {ρ(i)}i=1,2, at x is ρ(|x).

    Definition 3. We say that {ρ(i)X} converges to ρX exponentially in (Cs(X)), if for C>0 and 0<α<1,

    ρ(i)XρX(Cs(X))Cαi,iN. (1.14)

    The above condition (1.14) is also equivalent to

    |Xf(x)dρ(i)XXf(x)dρX|Cαi(f+|f|Cs(X)),fCs(X), iN, (1.15)

    where

    fCs(X):=f+|f|Cs(X), (1.16)

    and

    |f|Cs(X):=supxyX|f(x)f(y)|(d(x,y))s. (1.17)

    Before giving our key analysis, we firstly need to impose some mild assumptions concerning K, HK,z and {ρ(y|x):xX} below.

    The kernel function K is said to satisfy a Lipschitz condition of order (α,β) with 0<α, β1, if for some Cα,Cβ>0,

    |K(x,t)K(x,t)|Cα|tt|α,x,t,tX, (1.18)
    |K(x,t)K(x,t)|Cβ|xx|β,t,x,xX. (1.19)

    Let R>0 and BR be the ball of HK,z with radius R:

    BR={fHK,z:fl1R}. (1.20)

    As [22], we give the complexity assumption of the unit ball B1.

    Capacity assumption. We say that B1 has polynomial complexity exponent 0<p<2 if there is some constant cp>0 such that

    logN2(B1,ϵ)cpϵp, ϵ>0. (1.21)

    The following concept describes the continuity of {ρ(y|x):xX}.

    Definition 4. We say that {ρ(y|x):xX} satisfies a Lipschitz condition of order s in (Cs(Y)) if there is some constant Cρ0 such that

    ρ(y|x)ρ(y|u)(Cs(Y))Cρ|xu|s, x,uX. (1.22)

    Throughout this paper, we denote κ2=supt,xX|K(x,t)|. Since all the constants are independent of δ, n or λ, for simplicity of notation, we denote by C all the constants.

    The rest of this paper is organized as follows: in Section 2, we will state the error decomposition of the algorithm (1) and the rate of uniform convergence. In the forthcoming Sections 3–5, we will analyze the drift error, the sample error and the hypothesis error. Finally, we conclude the main results in Section 6.

    We use the developed technique for coefficient regularization algorithms for the non-i.i.d. sampling [19,21] to analyze the learning ability of the algorithm (1). We first define the following function space

    H1={f:f=j=1αj¯Kuj:{αj}l1,{uj}X,¯Kuj=KujKujρX}, (2.1)

    with the norm

    fH1:=inf{j=1|αj|:f=j=1αj¯Kuj}. (2.2)

    We define the regularizing function

    fλ:=argminfH1{E(f)+λfH1},λ>0. (2.3)

    In order to describe the error caused by the change of {ρ(i)X}, we introduce

    En(f)=1nni=1Z(f(u)y)2dρ(i)(u,y). (2.4)

    Now we can give the error decomposition for the algorithm (1).

    E(πM(˜fk))E(fρ)P(λ)+S(z,λ)+H(z,λ)+D(λ), (2.5)

    where

    P(λ)={E(πM(˜fk))En(πM(˜fk))}+{En(fλ)E(fλ)},S(z,λ)={En(πM(˜fk))Ez(πM(˜fk))}+{Ez(fλ)En(fλ)},H(z,λ)={Ez(πM(˜fk))Ez(fλ)},D(λ)=E(fλ)E(fρ)+λfλH1. (2.6)

    The drift error P(λ) describes the change of ρ(i) from ρ, and the sample error S(z,λ) connects the estimator πM(˜fk) with fλ. H(z,λ) and D(λ) are known as the hypothesis error and the approximation error, respectively.

    To compared with the main results in [16], we shall assume D(λ) satisfies the same decay rate as follows

    D(λ)cqλq,  0<λ1, (2.7)

    for some exponent 0<q1 and a constant cq>0.

    Next we can state the generalization error bound and give the proofs in Sections 3–6.

    Theorem 1. Assume zi=(xi,yi)ni=1 satisfy condition (1.13), the hypothesis space HK,z satisfies the capacity assumption (1.21) with 0<p<2, the kernel K satisfies a Lipschitz condition of order (α,β) with 0<k0K(u,v)k1 for any u,vX, the target function fρ can be approximated with the exponent 0<q1 in H1, (1.14) for ρX and (1.22) for ρ(y|x) hold. Take kTn. Then for any 0<δ<1, with confidence 1δ, we have

    {E(πM(˜fk))E(fρ)}Ct{λq+b1nλ2q2+b22+pn}, (2.8)

    where t=log(6δ6bnβ(an)) with bn and an given explicitly later.

    Theorem 2. Under the assumptions of Theorem 1, if

    n{81ζ,(6β0δ)1(γ+1)(1ζ)1}, ζ(0,γγ+1), (2.9)

    then we obtain

    πM(˜fk)fρ2ρX˜Dnθlog(12δ), (2.10)

    where

    θ=min{qζ2q, 2ζ2+p}.

    Let α=0 and ζ=1. Then we obtain the following learning rate of the i.i.d. sampling

    πM(˜fk)fρ2ρX˜C(1n)min{q2q, 22+p}log(12δ),

    which is the same as that in [16]. In particular, as p0, 22+p1 which is the optimal convergence rate.

    Proposition 3. Under the assumptions of Theorem 1, the inequality

    P(λ)Cλ2q2n, (3.1)

    holds.

    Proof. By (1.1) and (2.4), we get

    {(E(πM(˜fk))E(fλ))(En(πM(˜fk))En(fλ))}1nni=1|Z[(πM(˜fk)(u)y)2(fλ(u)y)2]d(ρ(u,y)ρ(i)(u,y))|=1nni=1|X(πM(˜fk)(u)fλ(u))(πM(˜fk)(u)+fλ(u)2fρ(u))d(ρX(u)ρ(i)X(u))|. (3.2)

    Now (1.15) tells us that

    {(E(πM(˜fk))E(fλ))(En(πM(˜fk))En(fλ))}1nni=1Cαi(πM(˜fk)(u)fλ(u))(πM(˜fk)(u)+fλ(u)2fρ(u))Cs(X)Cnα1α(3M+fλ){2|˜fk|Cs(X)+2|fλ|Cs(X)+2|fρ|Cs(X)+4M+2fλ}, (3.3)

    where the last inequality holds true since fgCs(X)fC(X)gCs(X)+fCs(X)gC(X), see [19].

    In the following, we estimate fλ, |fλ|Cs(X), |˜fk|Cs(X) and |fρ|Cs(X) separately. Let fλ(x)=j=1αj,λ¯Kuj(x), {αj,λ}l1. It follows that

    |fλ(x)|κKujρXj=1|αj,λ|κKujρXfλH1. (3.4)

    Furthermore,

    fλκKujρXD(λ)λ. (3.5)

    The Lipschitz condition (1.18) of the kernel function K yields for any fH1 that

    |f(x)f(x)|Cα|xx|sKujρXfH1,x,xX.

    Together with (1.17), this implies that

    |fλ|Cs(X)CαfλH1KujρXCαKujρXD(λ)λ. (3.6)

    In the same way, from the definition of ˜fk, we have

    |˜fk|Cs(X)Cα˜fkl1Cαy2nCαM2. (3.7)

    In addition, combining (1.17) with (1.22) gives

    |fρ|Cs(X)=supx,xX|Yydρ(y|x)Yydρ(y|x)||xx|syCs(Y)Cρ|xx|s|xx|sCρ(M+(2M)1s). (3.8)

    Plugging (3.5), (3.6), (3.7) and (3.8) into (3.3), the desired estimate (3.1) follows, and the proposition is proved.

    In our analysis, we apply the method in [18,23] to deal with the original weakly dependent sequence. Let (an,bn) be any integer pair with bn=[n/2an]. The dependent observations are split into 2bn blocks, each of size an. For 1k2bn, Qank denotes the marginal distribution of block (z(k1)an+1,z(k1)an+2,,zkan). With the constructed blocks, one can then take a new sequence (z1,,z2bnan) with product distribution 2bnk=1Qank. We further define

    Z1=(z1,,zan,z2an+1,,z3an,,z2(bn1)an+1,,z2(bn1)an),Z2=(zan+1,,z2an,z3an+1,,z4an,,z(2bn1)an+1,,z2bnan),

    and correspondingly

    Z1=(z1,,zan,z2an+1,,z3an,,z2(bn1)an+1,,z2(bn1)an),Z2=(zan+1,,z2an,z3an+1,,z4an,,z(2bn1)an+1,,z2bnan).

    The sample error S(z,λ) can be rewritten as

    S(z,λ)={Ez(fλ)Ez(fρ)}{En(fλ))En(fρ)}+{En(πM(˜fk))En(fρ)}{Ez(πM(˜fk))Ez(fρ)}:=S1(z,λ)+S2(z,k).

    We analyze the term S1(z,λ) by using the following inequality from [18].

    Lemma 4.1. If g is a measurable function on Z satisfying g(z)Zgdρ(i)M, then for any δ>0, with confidence 1δ, there holds

    1nni=1(g(zi)Zgdρ(i))b1n{83Mlog(2δ2bnβ(an))+2an2anbni=1Zg2dρ(i)log(2δ2bnβ(an))+M}.

    Proposition 4. Under the assumptions of Theorem 1, for any 0<δ<1, with confidence 1δ/3,

    S2(z,λ)C{b1n(1+D(λ)2λ2)+D(λ)}t. (4.1)

    Proof. Let g(z)=(yfλ(u))2(yfρ(u))2, z=(u,y)Z. Thus

    g(z)Zgdρ(i)2(3M+κKujρXD(λ)λ)2:=2Bλ

    and

    Zg2dρ(i)BλZgdρ(i).

    Using Lemma 4.1, with confidence 1δ/3, we have

    1nni=1(g(zi)Zgdρ(i))(19t3+2)Bλb1n+12anbn2anbni=1Zgdρ(i)(19t3+2)Bλb1n+2(En(fλ)En(fρ)). (4.2)

    Observe that

    En(fλ)En(fρ)(En(fλ)E(fλ)+E(fρ)En(fρ))+D(λ). (4.3)

    By (1.15), we have

    En(fλ)E(fλ)+E(fρ)En(fρ)1nni=1|X(fλ(u)fρ(u))2d(ρ(i)X(u)ρX(u))|1nni=1Cαi(fλ(u)fρ(u))2Cs(X)Cαn(1α)(1+D(λ)λ)2, (4.4)

    where the last inequality follows from (3.6) and (3.8).

    Combining (4.2), (4.3) and (4.4), we get the desired error bound (4.1) of S1(z,λ). Proposition 4 is proved.

    We continue to analyze S2(z,k) by applying the following probability inequality for the β-mixing sequences from [18].

    Lemma 4.2. Let G be a class of measurable functions on Z. Moreover, assume that gZgd(i)M for all gG. Then

    Prob{supgG1nni=1(g(zi)Zg(z)dρ(i))>ϵ+Mbn}1+2+2bnβ(an),

    where

    1=Prob{supgG1bnbnj=1(2bnn(2j1)ani=2(j1)an+1(g(zi)Zg(z)dρ(i)))ϵ},
    2=Prob{supgG1bnbnj=1(2bnn2jani=(2j1)an+1(g(zi)Zg(z)dρ(i)))ϵ}.

    To get the upper bounds of the terms 1 and 2, we need to invoke the following inequality for the non-identical sequence of probability distributions.

    Proposition 5. Assume {Xi}ni=1 is a random sequence in the measurable space (Xn,ni=1Qi). Let F be a set of measurable functions on X and B>0 be a constant such that each fF satisfies fB. Suppose there exists a nonnegative functional w on F and some positive constants (Δi)ni=1 such that

    Ef2(Xi)w(f)+Δi,fF. (4.5)

    Also assume for some a>0 and p(0,2),

    logN2(F,ε)aεp,ε>0.

    Then for any x>0 and any D>0, with probability at least 1ex there holds

    1nni=1Ef(Xi)1nni=1f(Xi)D1w(f)+cp˜η+(D+18B+2)xn,fF,

    where cp is a constant depending only on p and

    ˜η:=max{D2p2+p,B2p2+p+1}(an)2p+2+1nni=1Δi.

    The above inequalities imply the estimate of S2(z,k).

    Proposition 6. Under the assumptions of Theorem 1, for any 0<δ<1, with confidence 1δ/3,

    S2(z,k)12{E(πM(˜fk))E(fρ)}+Cp,Φ,ρηR+(192M2+2)tbn, (4.6)

    where

    ηR:=(Rpbn)22+p+α1α1nmax{R,1}. (4.7)

    Proof. Define the function set ˜G on Zan by

    ˜G={G(t1,,tan)=2bnnank=1g(tk):gG,G={g(z)=g(u,y)=(yπM(f)(u))2(yfρ(u))2:fBR}}

    and

    w(G):=ZanG2(t1,,tan)dρ(t1)dρ(t2)dρ(tan)=4a2nb2nn2Zg2dρ.

    It follows that

    EG2(z(k1)an+1,z(k1)an+2,,zkan)4b2nann2kani=(k1)an+1Zg2dρ(i)w(G)+4b2nann2kani=(k1)an+1|Zg2d(ρ(i)ρ)|. (4.8)

    We see from (1.15) and (1.22) that

    |Zg2d(ρ(i)ρ)|Cαi(fρ(u)πM(f)(u))2Y(2yπM(f)(u)fρ(u))2dρ(y|u)Cs(X)Cαi(1+R). (4.9)

    By (4.8) and (4.9), we know that Δk in (4.5) satisfies

    Δk4b2nann2Cρ,Φmax{R,1}ani=1α(k1)an+i.

    Let w={tj=(tj1,,tjan)}dj=1(Zan)d, dN. We know that for any functions G1=2bnnank=1g1(tk) and G2=2bnnank=1g2(tk) in ˜G,

    d22,w(G1,G2)=1ddj=1(G1(tj)G2(tj))2=1ddj=1(2bnnank=1(g1(tjk)g2(tjk)))21dandj=1ank=1(g1(tjk)g2(tjk))2=d22,w(g1,g2),

    so

    N2(˜G,ϵ)N2(G,ϵ). (4.10)

    Moreover,

    N2(G,ϵ)N2(BR,ϵ4M).

    This together with (4.10) yields

    logN2(˜G,ϵ)logN2(G,ϵ)cp(4M)pRPϵp.

    Note that Gg8M2. It is also easy to see that

    EG(z(k1)an+1,z(k1)an+2,,zkan)2bnnkani=(k1)an+1Zgdρ(i),

    and

    w(G)=4a2nb2nn2Zg2dρZg2dρ8M2Zgdρ.

    Now applying Proposition 5 to ˜G in ((Zan)bn,bnj=1Qan2j1). Let B=8M2 and a=cp(4M)pRp. Then for any D>0, gG, with confidence at least 1et, we have

    1bnbnj=1(2bnn(2j1)ani=2(j1)an+1(Zg(z)dρ(i)g(zi)))8M2D(Zgdρ)+cpη1+(D+144M2+2)tbn.

    Here

    η1=max{D2p2+p,(8M2)2p2+p+1}{cp(4M)pRpbn}22+p+4bnann2Cρ,Φmax{R,1}bnj=1ani=1α(2j2)an+i.

    It follows by taking ϵ1=cpη1+(D+144M2+2)tbn that

    Prob{supgG1bnbnj=1(2bnn(2j1)ani=2(j1)an+1(Zg(z)dρ(i)g(zi)))8M2D(Zgdρ)ϵ1}et.

    Applying Proposition 5 to ˜G in ((Zan)bn,bnj=1Qan2j1) once again, we have

    Prob{supgG1bnbnj=1(2bnn2jani=(2j1)an+1(Zg(z)dρ(i)g(zi)))8M2D(Zgdρ)ϵ2}et.

    Here ϵ2=cpη2+(D+144M2+2)tbn with

    η2=max{D2p2+p,(8M2)2p2+p+1}{cp(4M)pRpbn}22+p+4bnann2Cρ,Φmax{R,1}bnj=1ani=1α(2j1)an+i.

    Moreover, we obviously have

    4bnann2bnj=1ani=1α(2j2)an+i+4bnann2bnj=1ani=1α(2j1)an+i2nα1α,

    and

    g(z)Zg(z)dρ(i)<16M2.

    We know from Lemma 4.2 by taking ε=cp˜η+(D+144M2+2)tbn with

    ˜η={max{D2p2+p,(8M2)2p2+p+1}{cp(4M)pRpbn}22+p+2nCρ,Φmax{R,1}α1α},

    then

    Prob{supgG1nnj=1(Zg(z)dρ(i)g(zi))16M2D(Zgdρ)>ϵ+16M2bn}2et+2bnβ(an).

    Then we obtain (4.6) by taking 2et+2bnβ(an):=δ3 and D=32M2.

    Different from the widely regularized method with data-dependent hypothesis spaces [8,10,21,22], our estimation for the hypothesis error Ez(πM(˜fk))Ez(fλ) is based on the following lemma, see Theorem 3.3 in [7].

    Lemma 5.1. If fH, hHn1, then the output (fm)m0 of the RPGA satisfies the inequality

    ffm2fh24m+1h2Hn1, m=0,1,2,, (5.1)

    where

    Hn1={h=iαni¯Knui:αni=αi¯Kuin,¯Knui=¯Kui¯Kuin,iαi¯KuiH1} (5.2)

    with

    fHn1:=inf{i|αni|:f=iαi¯Kui}. (5.3)

    Proposition 7. Under the assumptions of Theorem 1, for kT and any 0<δ<1, with the confidence at least 1δ/3, there holds

    H(z,λ)4min{{(19t3+2)Mb1n+M+αn(1α)(k21k20+2Cαk1k20)+1}2,k21k20}D2(λ)(k+1)λ2. (5.4)

    Proof. By Lemma 5.1, we have

    H(z,λ)={Ez(πM(˜fk))Ez(fλ)}4fλ2Hn1k+1. (5.5)

    From the definitions of fHn1 and fH1, we have

    fλ2Hn1k21k20fλ2H1. (5.6)

    Meanwhile, we define the function g(x)=|¯Kui(x)|2, for any i. Notice that

    g(x)Xgdρ(j)X2k21k20:=2M,

    and

    Zg2dρ(j)XMZgdρ(j)X.

    Using Lemma 4.1, with confidence 1δ/3, we have

    1nnj=1(g(xj)Xgdρ(j)X)(19t3+2)Mb1n+M. (5.7)

    By (1.17) and (1.15), we get

    1nnj=1(Xgdρ(j)XXgdρX)1nnj=1Cαj(g+|g|Cs(X))αn(1α)(k21k20+2Cαk1k20). (5.8)

    This in connection with (5.7) tells us that

    1nnj=1|¯K(ui,xj)|2E¯K2ui=1nnj=1(g(xj)XgdρX)(19t3+2)Mb1n+M+αn(1α)(k21k20+2Cαk1k20). (5.9)

    It is easy to see that ¯Kuj2ρX=E¯K2uj=1. Now (5.9) implies that

    ¯Kuin=1nnj=1|¯K(ui,xj)|21nnj=1|¯K(ui,xj)|2(19t3+2)Mb1n+M+αn(1α)(k21k20+2Cαk1k20)+1. (5.10)

    Therefore,

    fλ2Hn1{(19t3+2)Mb1n+M+αn(1α)(k21k20+2Cαk1k20)+1}2fλ2H1. (5.11)

    Combining (5.5), (5.6), (5.11), we obtain (5.4).

    Proof of Theorem 1. Combining the bounds (2.7), (3.1), (4.1), (4.6) and (5.4), with confidence at least 1δ,

    {E(πM(˜fk))E(fρ)}cqλq+4min{{(19t3+2)Mb1n+M+α1α(k21k20+2Cαk1k20)+1}2,k21k20}c2qλ2q(k+1)λ2+Cλ2q2n+C{b1n(1+c2qλ2qλ2)+cqλq}t+12{E(πM(˜fk))E(fρ)}+Cp,Φ,ρηR+(192M2+2)tbn. (6.1)

    Note that kTn. By taking R=M2, then

    {E(πM(˜fk))E(fρ)}t{(k+1)1λ2q2+n1λ2q2+b1nλ2q2+λq+b22+pn+n1+b1n}Ct{λq+b1nλ2q2+b22+pn}. (6.2)

    This finishes the proof of Theorem 1.

    Proof of Theorem 2. Under the conditions of Theorem 1, let n1ζan<n1ζ+1, ζ[0,1] and n81ζ. Then

    1bn1n2an12(n1ζ+1)n2n1ζ4n1ζn2n1ζ=4nζ12nζ8nζ. (6.3)

    Substitute (6.3) into (6.2), we obtain

    {E(πM(˜fk))E(fρ)}Ct{λq+nζλ2q2+n2ζ2+p}. (6.4)

    By setting λ=nθ, we know that

    {E(πM(˜fk))E(fρ)}D2tnθ, (6.5)

    where

    θ=min{qθ,  ζ(22q)θ,  2ζ2+p}.

    To balance the errors in (2.5), we take θ=ζ2q. Then

    θ=min{qζ2q, 2ζ2+p}.

    Finally, we choose

    n(6β0δ)1(γ+1)(1ζ)1, ζ(0,γγ+1),

    it follows from β(an)β0(an)γ and ann1ζ that

    12bnβ(an)δ1,

    thus

    t=log6δ6bnβ(an)log12δ.

    This finishes the proof of Theorem 2.

    This research is supported by the National Science Foundation for Young Scientists of China (Grant No. 12001328), Doctoral Research Fund of Shandong Jianzhu University (No. XNBS1942), the Development Plan of Youth Innovation Team of University in Shandong Province (No. 2021KJ067) and Shandong Provincial Natural Science Foundation of China (Grant No. ZR2022MF223). All authors contributed substantially to this paper, participated in drafting and checking the manuscript. All authors read and approved the final manuscript.

    The authors declare that they have no competing of interests regarding the publication of this paper.



    [1] J. Fang, S. B. Lin, Z. B. Xu, Learning and approximation capabilities of orthogonal super greedy algorithm, Knowl-Based Syst., 95 (2016), 86–98. https://doi.org/10.1016/j.knosys.2015.12.011 doi: 10.1016/j.knosys.2015.12.011
    [2] H. Chen, L. Q. Li, Z. B. Pan, Learning rates of multi-kernel regression by orthogonal greedy algorithm, J. Stat. Plan. Infer., 143 (2013), 276–282. https://doi.org/10.1016/j.jspi.2012.08.002 doi: 10.1016/j.jspi.2012.08.002
    [3] S. B. Lin, Y. H. Rong, X. P. Sun, Z. B. Xu, Learning capability of relaxed greedy algorithms, IEEE Trans. Neur. Net. Lear., 24 (2013), 1598–1608. https://doi.org/10.1109/TNNLS.2013.2265397 doi: 10.1109/TNNLS.2013.2265397
    [4] L. Xu, S. B. Lin, Z. B. Xu, Learning capability of the truncated greedy algorithm, Sci. China Inform. Sci. 59 (2016), 052103. https://doi.org/10.1007/s11432-016-5536-6
    [5] A. R. Barron, A. Cohen, W. Dahmen, R. A. DeVore, Approximation and learning by greedy algorithms, Ann. Statist., 36 (2008), 64–94. https://doi.org/10.1214/009053607000000631 doi: 10.1214/009053607000000631
    [6] L. Xu, S. B. Lin, J. S. Zeng, X. Liu, Z. B. Xu, Greedy criterion in orthogonal greedy learning, IEEE Trans. Cybernetics, 48 (2018), 955–966. https://doi.org/10.1109/TCYB.2017.2669259 doi: 10.1109/TCYB.2017.2669259
    [7] G. Petrova, Rescaled pure greedy algorithm for Hilbert and Banach spaces, Appl. Comput. Harmon. Anal., 41 (2016), 852–866. https://doi.org/10.1016/j.acha.2015.10.008 doi: 10.1016/j.acha.2015.10.008
    [8] S. G. Lv, D. M. Shi, Q. W. Xiao, M. S. Zhang, Sharp learning rates of coefficient-based lq-regularized regression with indefinite kernels, Sci. China Math., 56 (2013), 1557–1574. https://doi.org/10.1007/s11425-013-4688-8 doi: 10.1007/s11425-013-4688-8
    [9] Y. L. Feng, S. G. Lv, Unified approach to coefficient-based regularized regression, Comput. Math. Appl., 62 (2011), 506–515. https://doi.org/10.1016/j.camwa.2011.05.034 doi: 10.1016/j.camwa.2011.05.034
    [10] W. L. Nie, C. Wang, Constructive analysis for coefficient regularization regression algorithms, J. Math. Anal. Appl., 431 (2015), 1153–1171. https://doi.org/10.1016/j.jmaa.2015.06.006 doi: 10.1016/j.jmaa.2015.06.006
    [11] H. W. Sun, Q. Wu, Least square regression with indefinite kernels and coefficient regularization, Appl. Comput. Harmon., 30 (2011), 96–109. https://doi.org/10.1016/j.acha.2010.04.001 doi: 10.1016/j.acha.2010.04.001
    [12] B. Schölkopf, A. Smola, Learning with Kernels, MIT Press, Cambridge, 2002.
    [13] C. J. Liu, Gabor-based kernel pca with fractional power polynomial models for face recognition, IEEE Trans. Pattern Anal. Mach. Intell., 26 (2004), 572–581. https://doi.org/10.1109/TPAMI.2004.1273927 doi: 10.1109/TPAMI.2004.1273927
    [14] R. Opfer, Multiscale kernels, Adv. Comput. Math., 25 (2006), 357–380. https://doi.org/10.1007/s10444-004-7622-3
    [15] A. R. Barron, Universal approximation bounds for superposition of a sigmoidal function, IEEE Trans. Inform. Theory, 39 (1993), 930–945. https://doi.org/10.1109/18.256500 doi: 10.1109/18.256500
    [16] H. Chen, Y. C. Zhou, Y. Y. Tang, Convergence rate of the semi-supervised greedy algorithm, Neural Networks, 44 (2013), 44–50. https://doi.org/10.1016/j.neunet.2013.03.001 doi: 10.1016/j.neunet.2013.03.001
    [17] S. Smale, D. X. Zhou, Online learning with markov sampling, Anal. Appl., 7 (2009), 87–113. https://doi.org/10.1142/S0219530509001293 doi: 10.1142/S0219530509001293
    [18] Z. C. Guo, L. Shi, Classification with non-i.i.d. sampling, Math. Comput. Model., 54 (2011), 1347–1364. https://doi.org/10.1016/j.mcm.2011.03.042 doi: 10.1016/j.mcm.2011.03.042
    [19] Z. W. Pan, Q. W. Xiao, Least-square regularized regression with non-iid sampling, J. Stat. Plan. Infer., 139 (2009), 3579–3587. https://doi.org/10.1016/j.jspi.2009.04.007 doi: 10.1016/j.jspi.2009.04.007
    [20] R. C. Bradley, Basic properties of strong mixing conditions, Progr. Probab. Statist., 2 (1986), 165–192. https://doi.org/10.1007/978-1-4615-8162-8_8 doi: 10.1007/978-1-4615-8162-8_8
    [21] Q. Guo, P. X. Ye, B. L. Cai, Convergence Rate for lq-Coefficient Regularized Regression With Non-i.i.d. Sampling, IEEE Access, 6 (2018), 18804–18813. https://doi.org/10.1109/ACCESS.2018.2817215 doi: 10.1109/ACCESS.2018.2817215
    [22] L. Shi, Y. L. Feng, D. X. Zhou, Concentration estimates for learning with l1-regularizer and data dependent hypothesis spaces, Appl. Comput. Harmon. Anal., 31 (2011), 286–302. https://doi.org/10.1016/j.acha.2011.01.001 doi: 10.1016/j.acha.2011.01.001
    [23] B. Yu, Rates of convergence for empirical processes of stationary mixing sequences, Ann. Probab., 22 (1994), 94–116. https://www.jstor.org/stable/2244496
  • Reader Comments
  • © 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(1354) PDF downloads(115) Cited by(0)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog