Research article

Convergence of online learning algorithm with a parameterized loss

  • Received: 19 July 2022 Revised: 27 August 2022 Accepted: 06 September 2022 Published: 13 September 2022
  • MSC : 41A25, 68Q32, 68T40, 90C25

  • The research on the learning performance of machine learning algorithms is one of the important contents of machine learning theory, and the selection of loss function is one of the important factors affecting the learning performance. In this paper, we introduce a parameterized loss function into the online learning algorithm and investigate the performance. By applying convex analysis techniques, the convergence of the learning sequence is proved and the convergence rate is provided in the expectation sense. The analysis results show that the convergence rate can be greatly improved by adjusting the parameter in the loss function.

    Citation: Shuhua Wang. Convergence of online learning algorithm with a parameterized loss[J]. AIMS Mathematics, 2022, 7(11): 20066-20084. doi: 10.3934/math.20221098

    Related Papers:

    [1] Wei Xue, Pengcheng Wan, Qiao Li, Ping Zhong, Gaohang Yu, Tao Tao . An online conjugate gradient algorithm for large-scale data analysis in machine learning. AIMS Mathematics, 2021, 6(2): 1515-1537. doi: 10.3934/math.2021092
    [2] Yiyuan Cheng, Yongquan Zhang, Xingxing Zha, Dongyin Wang . On stochastic accelerated gradient with non-strongly convexity. AIMS Mathematics, 2022, 7(1): 1445-1459. doi: 10.3934/math.2022085
    [3] Yassamine Chellouf, Banan Maayah, Shaher Momani, Ahmad Alawneh, Salam Alnabulsi . Numerical solution of fractional differential equations with temporal two-point BVPs using reproducing kernal Hilbert space method. AIMS Mathematics, 2021, 6(4): 3465-3485. doi: 10.3934/math.2021207
    [4] Kasamsuk Ungchittrakool, Natthaphon Artsawang . A generalized viscosity forward-backward splitting scheme with inertial terms for solving monotone inclusion problems and its applications. AIMS Mathematics, 2024, 9(9): 23632-23650. doi: 10.3934/math.20241149
    [5] Suthep Suantai, Pronpat Peeyada, Andreea Fulga, Watcharaporn Cholamjiak . Heart disease detection using inertial Mann relaxed $ CQ $ algorithms for split feasibility problems. AIMS Mathematics, 2023, 8(8): 18898-18918. doi: 10.3934/math.2023962
    [6] Junaid Ahmad, Kifayat Ullah, Reny George . Numerical algorithms for solutions of nonlinear problems in some distance spaces. AIMS Mathematics, 2023, 8(4): 8460-8477. doi: 10.3934/math.2023426
    [7] Chibueze C. Okeke, Abubakar Adamu, Ratthaprom Promkam, Pongsakorn Sunthrayuth . Two-step inertial method for solving split common null point problem with multiple output sets in Hilbert spaces. AIMS Mathematics, 2023, 8(9): 20201-20222. doi: 10.3934/math.20231030
    [8] Raweerote Suparatulatorn, Wongthawat Liawrungrueang, Thanasak Mouktonglang, Watcharaporn Cholamjiak . An algorithm for variational inclusion problems including quasi-nonexpansive mappings with applications in osteoporosis prediction. AIMS Mathematics, 2025, 10(2): 2541-2561. doi: 10.3934/math.2025118
    [9] Yali Zhao, Qixin Dong, Xiaoqing Huang . A self-adaptive viscosity-type inertial algorithm for common solutions of generalized split variational inclusion and paramonotone equilibrium problem. AIMS Mathematics, 2025, 10(2): 4504-4523. doi: 10.3934/math.2025208
    [10] Puntita Sae-jia, Suthep Suantai . A new two-step inertial algorithm for solving convex bilevel optimization problems with application in data classification problems. AIMS Mathematics, 2024, 9(4): 8476-8496. doi: 10.3934/math.2024412
  • The research on the learning performance of machine learning algorithms is one of the important contents of machine learning theory, and the selection of loss function is one of the important factors affecting the learning performance. In this paper, we introduce a parameterized loss function into the online learning algorithm and investigate the performance. By applying convex analysis techniques, the convergence of the learning sequence is proved and the convergence rate is provided in the expectation sense. The analysis results show that the convergence rate can be greatly improved by adjusting the parameter in the loss function.



    Let XRd be the input space, Y=[M,M] be the output space for some M>0. z={(xt,yt)}Tt=1 are the random samples i.i.d. (independently and identically drawn) according to a Borel probability measure ρ(x,y)=ρ(y|x)ρX(x) on Z=X×Y. Based on these samples, the goal of regression problems is to look for a predictor f:XR from some hypothesis space such that f(x) is a "good" approximation of y. The quality of the predictor f is measured by the generalization error

    E(f):=ZV(x,y,f)dρ(x,y),

    where V(r):RR+ is a prescribed loss function.

    The hypothesis space considered in this paper is the reproducing kernel Hilbert space(RKHS) HK. This means that there exists a unique symmetric and positive definite continuous function K:X×XR, called the reproducing kernel of HK, or Mercer kernel, and an inner product ,K such that f(x)=K(x,),fK which is the reproducing property of the kernel, and all fHK are linear combinations of kernel functions. In other words, the RKHS HK is the closure of the linear span of the set of functions {Kx()=K(x,):xX} with the inner product ,K. For each xX and fHK the evaluation functional ex(f):=f(x) is continuous (i.e.bounded) in the topology of HK, and |f(x)|κfK with κ:=supxXK(x,x) (see [1]).

    Traditional off-line learning is also called batch learning, all sample points need to be tested in each training. When the amount of data is large or new sample points are added, the learning ability of batch learning decreases significantly. Online learning is one effective approach raised for analyzing and processing big data in various applications, such as communication, electronics, medicine, biology, and other fields (see e.g., [2,3,4,5,6]). The performance of the kernel-based regularized online learning algorithms have been researched and their effectiveness has been verified (see e.g., [7,8,9,10,11] and references therein). Unlike the off-line learning algorithms, online learning algorithms process the observations one by one, and the output is adjusted in time according to the results of the last learning.

    With the observations z={(xt,yt)}Tt=1, the kernel regularized online learning algorithm based on the stochastic descent method is given by (see e.g., [8,9,10])

    {f1=0,ft+1=ftηt(fV(xt,yt,ft)Kxt+λft), (1.1)

    where ηt is called the stepsize, λ>0 is a regularization parameter and the sequence {ft:t=1,,T+1} is the learning sequence.

    When the least-square loss function V(x,y,f(x))=(f(x)y)2 is selected, the specific iteration format of the online learning algorithm is given by

    {f1=0,ft+1=ftηt((ft(xt)yt)Kxt+λft). (1.2)

    To study the learning performance of online learning algorithms we need to bound the convergence rate of the iterative sequence {ft:t=1,,T+1}. The convergence of the online learning algorithm (1.2) has been extensively studied in the literature (see e.g., [8,9,12]). The research results in [12] show that under mild conditions the regularized online learning algorithm can converge comparably fast as the off-line learning algorithm.

    The least square-loss is the most widely utilized criterion for regression in practice. However, from a robustness point of view, the least square loss is not a good choice. In many practical applications, outliers or heavy-tailed distributions often occur in real data sets. In recent years, how to improve the robustness of algorithms has become one of the hot topics in the field of machine learning. In the literature on learning theory, a lot of efforts have been made and there have been some results on generalization analysis (see e.g., [13,14,15,16,17]) and empirical experiments (see e.g., [18,19]) of learning algorithms when outliers or heavy-tailed noise are allowed.

    One of the main strategies to improve robustness is to use some robust loss function with a scale parameter(see e.g., [20,21]). Based on the quadratic function 1+t2,tR which plays an important role in constructing shape preserving quasi-interpolation and solving partial differential equations with mesh-free method since its strong nonlinear property and its convexity, [21] constructed a robust loss function Vσ(r) with a scale parameter σ(0,1]. For σ(0,1], the parameterized loss function Vσ(r):R[0,) is defined as

    Vσ(r):=σ2(σ2+r2σ1),rR.

    The analysis results in [21] show that the learning algorithm based on the parameterized loss function Vσ(r) has a good generalization ability.

    Encouraged by these researches, we want to further improve the performance and applicability of the online algorithm. In this paper, we introduce the parameterized loss function Vσ(r) into the online learning algorithm, and analyze the influence of the scale parameter on the convergence rate of the algorithm.

    To give the specific format of the learning algorithm with the parameterized loss function, we give the following notations correspondingly. We denote

    Eσ(f)=ZVσ(yf(x))dρ(x,y), (1.3)

    and

    fσ=argminfL2(ρX)Eσ(f). (1.4)

    The kernel-based regularized off-line algorithm is the following optimization problem:

    fσλ=argminfHKEσ(f)+λ2f2K. (1.5)

    Based on the random observations z={(xt,yt)}Tt=1, the approximate solution of the problem (1.5) is obtained through the following learning process

    fσz,λ=argminfHK1|T|Tt=1Vσ(ytf(xt))+λ2f2K.

    It is easy to see that the gradient of the loss function Vσ is given by

    fVσ(ytft(xt))=ytft(xt)1+(ytft(xt)σ)2.

    Along with the online algorithm (1.1), the kernel regularized online learning algorithm with the parameterized loss function Vσ(r) is defined by

    {f1=0,ft+1=ftηt(ytft(xt)1+(y tft(xt)σ)2Kxt+λft). (1.6)

    In this paper, we focus on the performance of the sequence {ft:t=1,...,T+1} produced by the algorithm (1.6).

    The remaining parts of this paper are organized as follows: we present the main results of this paper in Section 2. The proofs of the main results are given in Section 3. Discussions and comparisons with related works are given in Section 4. Conclusions and some questions for further study are mentioned in Section 5.

    In the present paper, we write A=O(B) if there is a constant C0 such that ACB. We use Ez[] to denote the expectation with respect to z. When its meaning is clear from the context, we use the shorthand notation E[].

    In this section, we present our main results about the performance of the algorithm (1.6), proofs are given in Section 3.

    Our first main result establishes the convergence of the sequence {ft:t=1,...,T+1} in expectation, under mild conditions on the stepsize.

    Theorem 2.1. Let fσλ be defined as in (1.5), and let {ft:t=1,...,T+1} be the sequence produced by the algorithm (1.6), λ>0. If {ηt} satisfies t=1ηt=+, limt+ηt=0 and ηt1λ, then it holds that

    Ez1,...,zT[fT+1fσλK]0(T+).

    Our second main result gives the explicit convergence rate of the last iterate by specifying the step sizes in the algorithm (1.6).

    Theorem 2.2. Let fσλ be defined as in (1.5), and let {ft:t=1,...,T+1} be the sequence produced by the algorithm (1.6). For any 0<λ1, 0<θ1, take ηt=1Ctθ with Cλ+4(κ2+1). Then, it holds that

    E[fT+1fσλ2K]{(Dσ(λ)+9CσT1θC2(1θ)21θ)exp{λ(12θ1)2C(1θ)(T+1)1θ}+36CσλCTθ,if0<θ<1,(Dσ(λ)+5CσT1θC2(Cλ))Tλ2C,ifθ=1,

    where Dσ(λ)=2Mσλ,Cσ=4Mσ(κ2+1).

    Corollary 2.1. Let fσλ be defined as in (1.5), and let {ft:t=1,...,T+1} be the sequence produced by the algorithm (1.6). For any θ(0,1),0<αmin{1θ,θ}, take λ=Tα. If the stepsize is chosen as ηt=1Ctθ, then it holds that

    E[fT+1fσλ2K]CM,C,κ,θ×σTθα, (2.1)

    where CM,C,κ,θ is a constant depending only on θ,κ,C and M.

    Remark 1. The results given in Theorem 2.2 and Corollary 2.1 show that the scale parameter σ can effectively control the convergence rate of fT+1fσλK, which is usually referred to as the sample error. Depending on the circumstances, the sample error bound can be greatly improved by choosing the parameter σ properly. In fact, take σ=λ=Tα. Then by (2.1), we have E[fT+1fσλ2K]=O(Tθ) which is better than the sample error bound O(T(θα)) given in [9].

    The results provided above mainly describe the convergence rate of the sample error. However, in the studying the learning performance of learning algorithms, we are often interested in the excess generalization error Eσ(fT+1)Eσ(fσ). Define

    K(f,λ)=infgHK{Eσ(g)Eσ(f)+λ2g2K},

    which is often used to denote the approximation error, whose convergence is determined by the capacity of HK. We assume the K-functional satisfies the following decay

    K(fσ,λ)=O(λβ),λ0+, (2.2)

    with 0<β1.

    By combining the sample error with the approximation error, we obtain the overall learning rate stated as follows.

    Corollary 2.2. Let Eσ(f) be the generalization error defined as in (1.3), {ft:t=1,...,T+1} be the sequence produced by the algorithm (1.6). For θ(0,1),0<αmin{1θ,θ}, take λ=Tα. If the stepsize is chosen as ηt=1Ctθ, then it holds that

    E[Eσ(fT+1)Eσ(fσ)]=O(σ32Tθα2+1Tβα). (2.3)

    Now, we compare the performance of the algorithm (1.6) with that of the online kernel regularized learning algorithm based on the least square loss.

    In [22], the online kernel regularized learning algorithm with the least-square loss is researched, and the learning rates are established under some assumptions. Namely, for 0<β1, 0<δ<ββ+1, there holds

    E[E(fT+1)E(fρ)]=O(Tδββ+1) (2.4)

    and for 12<β1, 0<δ<2β12β+1, there holds

    E[E(fT+1)E(fρ)]=O(Tδ2β12β+1). (2.5)

    For 0<β1, we choose σ=Tμ3 with 0<μ2β12β+1. Take λ = Tδ2β12(β+1)μ2β+1 and ηt = 14κ2+5t2β+12βδ2β+12(β+1) with ββ+12β2β+1μ<δ<ββ+1+2β2β+1. By Corollary 2.2, we have

    E[Eσ(fT+1)Eσ(fσ)]=O(T12(δββ+1)β2β+1μ). (2.6)

    And for 12<β1, we choose σ3=λ=T2β2β1δ2β2β+1 with 0<δ<2β12β+1. Take ηt=14κ2+5t2β2β1δ2β2β+1. By Corollary 2.2, we have that

    E[Eσ(fT+1)Eσ(fσ)]=O(Tβ2β1δ12β+1). (2.7)

    The rate (2.6) is better than (2.4), and (2.7) is better than (2.5). The analysis results illustrate that the convergence rate of the algorithm (1.6) can be improved by choosing the parameter σ appropriately.

    Lemma 3.1. Let {ft:t=1,...,T+1} be the sequence produced by the algorithm (1.6). If ηt1λ, then, for any t=1,...,T+1, it holds that

    ftKκσλ. (3.1)

    Proof. We prove (3.1) by induction on t. The inequality (3.1) is true for t=1 because of the initialization condition f1=0. Suppose the bound (3.1) holds for any t, and consider ft+1. The iteration (1.6) can be rewritten as

    ft+1=ft+ytft(xt)1+(ytft(xt)σ)2ηtKxtληtft=(1ληt)ft+ytft(xt)1+(ytft(xt)σ)2ηtKxt. (3.2)

    This implies that

    ft+1K(1ληt)ftK+κηt|ytft(xt)1+(ytft(xt)σ)2|. (3.3)

    Since

    |ytft(xt)1+(ytft(xt)σ)2|σ. (3.4)

    Combined with the assumption ftKκσλ, we have

    ft+1K(1ληt)ftK+κσηt(1ληt)κσλ+κσηt=κσλ. (3.5)

    This completes the proof.

    Lemma 3.2. Assume fσλ is defined as in (1.5). Then, for any fHK, it holds that

    Zyfσλ(x)1+(yfσλ(x)σ)2(fσλ(x)f(x))dρ=λfσλ,fσλfK.

    Proof. By Taylor formula, for any u,vR, we have

    Vσ(u)Vσ(v)=Vσ(v)(uv)+12Vσ(ξ)(uv)2, (3.6)

    where ξR is between u and v.

    Note that, Vσ(ξ)=1(1+(ξσ)2)3>0. Then,

    Vσ(u)Vσ(v)Vσ(v)(uv)=v(uv)1+(vσ)2. (3.7)

    Therefore, for any f,gHK, we have

    Eσ(f)Eσ(g)=ZVσ(yf(x))dρZVσ(yg(x))dρZyg(x)1+(yg(x)σ)2(f(x)g(x))dρ=fg,Zyg(x)1+(yg(x)σ)2KxdρK=fg,gEσ(g)K. (3.8)

    By (i) of Lemma 5.1 in [23], we know Eσ(f) is a convex function on HK. And f2K is a strictly convex function on HK. Then, Ωσ(f)=Eσ(f)+λ2f2K is a convex function on HK. Based on (ii) of Lemma 5.1 in [23], we have

    0=fΩσ(f)f=fσλ=fEσ(f)f=fσλ+λfσλ=Zyfσλ(x)1+(yfσλ(x)σ)2Kxdρ+λfσλ. (3.9)

    Taking inner product with ffσλ on both sides of the above formula, we get

    0=Zyfσλ(x)1+(yfσλ(x)σ)2Kxdρ+λfσλ,ffσλK=Zyfσλ(x)1+(yfσλ(x)σ)2Kx,ffσλKdρ+λfσλ,ffσλK=Zyfσλ(x)1+(yfσλ(x)σ)2(fσλ(x)f(x))dρ+λfσλ,ffσλK.

    This proves our conclusion.

    Lemma 3.3. Let fσλ be defined as in (1.5). For any fHK, we denote Ωσ(f)=Eσ(f)+λ2f2K. Then, it holds that

    λ2ffσλ2KΩσ(f)Ωσ(fσλ).

    Proof. For any fHK, we define a functionf(θ)=fσλ+θ(ffσλ),θ[0,1]. Then, f(0)=fσλ and f(1)=f. Denote F(θ)=Ωσ(f(θ))=ZVσ(yf(θ)(x))dρ+λ2f(θ)2K, then F(1)=Ωσ(f),F(0)=Ωσ(fσλ). Since Vσ is differentiable, as a function of θ, F(θ) is differentiable. And for any θ[0,1], we have

    F(θ)=limθ0F(θ+θ)F(0)θ=limθ0Ωσ(fθ+θ)Ωσ(fθ)θ=limθ01θ(Z(Vσ(yf(θ+θ)(x))Vσ(yf(θ)(x)))dρ+λ2f(θ+θ)2Kλ2f(θ)2K)=limθ01θ(Z(Vσ(yf(θ)(x))θ(f(x)fσλ(x))Vσ(yf(θ)(x)))dρ+λ2(f(θ)+θ(ffσλ)2Kf(θ)2K)). (3.10)

    By the median value theorem, there holds

    Vσ(yf(θ)(x))θ(f(x)fσλ(x)))Vσ(yf(θ)(x))=θVσ(ξ)(fσλ(x)f(x)), (3.11)

    where ξ(yf(θ)(x))θ(f(x)fσλ(x),yf(θ)(x)).

    This in connection with f(θ)+θ(ffσλ)2Kf(θ)2K=2θffσλ,f(θ)K+(θ)2ffσλ)2K, according to (3.10), tells us that

    F(θ)=limθ0(ZVσ(ξ)(fσλ(x)f(x))dρ+λffσλ,f(θ)K+λ2θffσλ2K)=ZVσ(yf(θ)(x))(fσλ(x)f(x))dρ+λffσλ,f(θ)K=ZVσ((yfσλ(x))+θ(fσλ(x)f(x)))(fσλ(x)f(x))dρ+λffσλ,f+θ(ff(θ))K=ZVσ((yfσλ(x))+θ(fσλ(x)f(x)))(fσλ(x)f(x))dρ+λffσλ,fK+θλffσλ)2K. (3.12)

    By Lemma 3.2, we see that

    λffσλ,fK=Zyfσλ(x)1+(yfσλ(x)σ)2(fσλ(x)f(x))dρ=ZVσ(yfσλ(x))(fσλ(x)f(x))dρ. (3.13)

    On the other hand, since Vσ(u) is a convex function in R, by discussions in [24], we know that

    Vσ(yfσλ(x))+θ(fσλ(x)f(x))Vσ(yfσλ(x))(fσλ(x)f(x))0. (3.14)

    Therefore, for θ(0,1), we have

    F(θ)=Z(Vσ(yfσλ(x))+θ(fσλ(x)f(x))Vσ(yfσλ(x)))(fσλ(x)f(x))dρ+λθffσλ2Kλθffσλ2K. (3.15)

    By the definition of fσλ, we know that F(θ)F(0)=Ωσ(fσλ),θ[0,1]. Therefore, (3.14) implies that

    Ωσ(f)Ωσ(fσλ)=F(1)F(0)=10F(θ)dθ10λθffσλ2Kdθ=λffσλ2K10θdθ=λ2ffσλ2K.

    The proof is completed.

    Lemma 3.4. Let fσλ be defined as in (1.5), {ft:t=1,...,T+1} be the sequence produced by the algorithm (1.6), if λ>0, ηt1λ, then

    Ez1,...,zT[fT+1fσλ2K]4κ2σ2Tt=1η2tTj=t+1(1ληj)+2MσλTt=1(1ληt). (3.16)

    Furthermore, there holds

    Ez1,...,zT[fT+1fσλ2K]4κ2σ2Tt=1η2texp{λTj=t+1ηj}+2Mσλexp{λTt=1ηt}. (3.17)

    Proof. According to the algorithm (1.6), we know

    ft+1fσλ2K=ftfσλ2K+η2tλftytft(xt)1+(ytft(xt)σ)2Kxt2K+2ηtλftytft(xt)1+(ytft(xt)σ)2Kxt,fσλftK=ftfσλ2K+η2tλftytft(xt)1+(ytft(xt)σ)2Kxt2K+2ηtA, (3.18)

    where A=λftytft(xt)1+(ytft(xt)σ)2Kxt,fσλftK.

    By using the inequality a,baK12(b2Ka2K),a,bHK, with a=ft,b=fσλ, we have

    A=λft,fσλftKytft(xt)1+(ytft(xt)σ)2Kxt,fσλftKλ2(fσλ2Kft2K)ytft(xt)1+(ytft(xt)σ)2Kxt,fσλftK=λ2(fσλ2Kft2K)ytft(xt)1+(ytft(xt)σ)2(fσλ(xt)ft(xt))λ2(fσλ2Kft2K)+Vσ(ytfσλ(xt))Vσ(ytft(xt))=(Vσ(ytfσλ(xt))+λ2fσλ2K)(Vσ(ytft(xt))+λ2ft2K). (3.19)

    Since ft depends on {z1,...,zt1} but not on zt, it follows that

    Ez1,...,zt(A)Ez1,...,zt1[Ezt[(Vσ(ytfσλ(xt))+λ2fσλ2K)(Vσ(ytft(xt))+λ2ft2K)]]=Eσ(fσλ)+λ2fσλ2KEz1,...,zt1[Eσ(ft)+λ2ft2K]. (3.20)

    Combining (3.18) with (3.20), we have

    Ez1,...,zt[ft+1fσλ2K]Ez1,...,zt1[ftfσλ2K]+η2tEz1,...,zt[λftytft(xt)1+(ytft(xt)σ)2Kxt2K]+2ηtEz1,...,zt1[(Eσ(fσλ)+λ2fσλ2K)(Eσ(ft)+λ2ft2K)]=Ez1,...,zt1[ftfσλ2K]+η2tEz1,...,zt[λftytft(xt)1+(ytft(xt)σ)2Kxt2K]+2ηt(Ωσ(fσλ)Ωσ(ft)). (3.21)

    According to Lemma 3.3, we know

    Ωσ(fσλ)Ωσ(ft)λ2ftfσλ2K. (3.22)

    Therefore, (3.21) implies that

    Ez1,...,zt[ft+1fσλ2K]Ez1,...,zt1[ftfσλ2K]+η2tEz1,...,zt[λftytft(xt)1+(ytft(xt)σ)2Kxt2K]2ηtEz1,...,zt1[ftfσλ2K]=(1ληt)Ez1,...,zt1[ftfσλ2K]+η2tEz1,...,zt[λftytft(xt)1+(ytft(xt)σ)2Kxt2K]. (3.23)

    By Lemma 3.1, we know

    λftytft(xt)1+(ytft(xt)σ)2Kxt2K(λftK+κ|ytft(xt)1+(ytft(xt)σ)2|)2(λ×κσλ+κσ)2=4κ2σ2. (3.24)

    Substituting (3.24) into (3.23), we obtain

    Ez1,...,zt[ft+1fσλ2K](1ληt)Ez1,...,zt1[ftfσλ2K]+4κ2σ2η2t.

    Applying the relation iteratively for t=T,T1,...,1, we have

    Ez1,...,zT[fT+1fσλ2K](1ληT)Ez1,...,zT1[fTfσλ2K]+4κ2σ2η2T(1ληT)(((1ληT1))Ez1,...,zT2[fT1fσλ2K]+4κ2σ2η2T1)+4κ2σ2η2T.......4κ2σ2Tt=1η2tTj=t+1(1ληj)+Tt=1(1ληt)E[f1fσλ2K].

    Denote Tj=T+1(1ληj)=1. From the definition of fσλ, we see that

    λ2fσλ2KEσ(0)Mσ. (3.25)

    This in connection with the initialization condition f1=0, it follows that

    Ez1,...,zT[fT+1fσλ2K]4κ2σ2Tt=1η2tTj=t+1(1ληj)+Tt=1(1ληt)E[fσλ2K]4κ2σ2Tt=1η2tTj=t+1(1ληj)+2MσλTt=1(1ληt).

    This shows (3.16). (3.17) follows from (3.16) and the inequality 1ueu for any u0.

    Proof. It is easy to see that

    Tt=1(1ληt)exp{λTt=1ηt}0(T+).

    It implies that, for any ε>0, there exists some T1N such that

    Tt=1(1ληt)ε,

    whenever TT1.

    And according to the assumption limt+ηt=0, we know that there exists some t(ε)N such that ηtλε, for every tt(ε). Furthermore, we have

    Tt=t(ε)+1η2tTj=t+1(1ληj)λεTt=t(ε)+1ηtTj=t+1(1ληj)=εTt=t(ε)+1ληtTj=t+1(1ληj)=εTt=t(ε)+1((1(1ληt))Tj=t+1(1ληj))=εTt=t(ε)+1(Tj=t+1(1ληj)Tj=t(1ληj))=ε((Tj=t(ε)+2(1ληj)Tj=t(ε)+1(1ληj))+(Tj=t(ε)+3(1ληj)Tj=t(ε)+2(1ληj))++(Tj=T+1(1ληj)TT(1ληj)))=ε(1Tj=t(ε)+1(1ληj))ε. (3.26)

    Since t(ε) is fixed, there exists some T2N such that, for every TT2, it holds that

    Tj=t(ε)+1ηjT2j=t(ε)+1ηj1λlogt(ε)λ2ε.

    So, for any 1tt(ε), we have

    Tj=t+1(1ληj)exp{Tj=t+1ληj}exp{Tj=t(ε)+1ληj}λ2εt(ε).

    Hence

    t(ε)t=1η2tTj=t+1(1ληj)λ2εt(ε)t(ε)t=1η2tε. (3.27)

    From (3.26) and (3.27), we know that for any ε>0, there exists some T2N such that

    Tt=1η2tTj=t+1(1ληj)=t(ε)t=1η2tTj=t+1(1ληj)+Tt=t(ε)+1η2tTj=t+1(1ληj)ε+ε=2ε, (3.28)

    whenever TT2. Let T=max{T1,T2}, then by (3.16), (3.26) and (3.27) we have

    Ez1,...,zT[fT+1fσλ2K](8κ2σ2+2Mσλ)ε,

    when TT. Thus we complete the proof of Theorem 2.1.

    Proof. By (3.21), we know

    Ez1,...,zt[ft+1fσλ2K]Ez1,...,zt1[ftfσλ2K]+η2tEz1,...,zt[λftytft(xt)1+(ytft(xt)σ)2Kxt2K]+2ηtEz1,...,zt1[(Eσ(fσλ)+λ2fσλ2K)(Eσ(ft)+λ2ft2K)]. (3.29)

    From the inequality ab2K2a2K+2b2K, we have

    λftytft(xt)1+(ytft(xt)σ)2Kxt2K2λ2ft2K+2|ytft(xt)1+(ytft(xt)σ)2|2Kxt()2K2λ2ft2K+2κ2σ2(1+(ytft(xt)σ)21) (3.30)

    On the other hand, for any rR, it holds that |r1+(r)σ)2|22σ2(σ2+r2σ1)=2Vσ(r). This implies that

    |ytft(xt)1+(ytft(xt)σ)2|22Vσ(ytft(xt)). (3.31)

    Combining (3.30) with (3.31), we get

    λftytft(xt)1+(ytft(xt)σ)2Kxt2K4κ2Vσ(ytft(xt))+2λ2ft2K4κ2Vσ(ytft(xt))+2λft2K=4κ2Vσ(ytft(xt))+4×λ2ft2K4(κ2+1)(Vσ(ytft(xt))+λ2ft2K). (3.32)

    Substituting (3.32) into (3.29), we get

    Ez1,...,zt[ft+1fσλ2K]Ez1,...,zt1[ftfσλ2K]+4(κ2+1)η2tEz1,...,zt[Vσ(ytft(xt))+λ2ft2K]+2ηtEz1,...,zt1[(Eσ(fσλ)+λ2fσλ2K)(Eσ(ft)+λ2ft2K)]=Ez1,...,zt1[ftfσλ2K]+4(κ2+1)η2tEz1,...,zt[Eσ(ft)+λ2ft2K]+2ηtEz1,...,zt1[(Eσ(fσλ)+λ2fσλ2K)(Eσ(ft)+λ2ft2K)]Ez1,...,zt1[ftfσλ2K]+4(κ2+1)η2tEz1,...,zt[Eσ(ft)+λ2ft2K(Eσ(fσλ)+λ2fσλ2K)]+2ηtEz1,...,zt1[(Eσ(fσλ)+λ2fσλ2K)(Eσ(ft)+λ2ft2K)]+4(κ2+1)η2tEσ(0)Ez1,...,zt1[ftfσλ2K]+2ηt(12(κ2+1)ηt)Ez1,...,zt1[(Eσ(fσλ)+λ2fσλ2K)(Eσ(ft)+λ2ft2K)]+4(κ2+1)Mση2t=Ez1,...,zt1[ftfσλ2K]+B+4(κ2+1)Mση2t, (3.33)

    where

    B=2ηt(12(κ2+1)ηt)Ez1,...,zt1[(Eσ(fσλ)+λ2fσλ2K)(Eσ(ft)+λ2ft2K)].

    Based on the assumptions about η, we know that 12(κ2+1)ηt12. And by Lemma 3.3, we know

    Ez1,...,zt1[(Eσ(fσλ)+λ2fσλ2K)(Eσ(ft)+λ2ft2K)]λ2ftfσλ2K.

    This implies that

    Bληt2Ez1,...,zt1[ftfσλ2K]. (3.34)

    Combining (3.33) with (3.34), we obtain

    Ez1,...,zt[ft+1fσλ2K](1ληt2)Ez1,...,zt1[ftfσλ2K]+4Mσ2(κ2+1)η2t. (3.35)

    Denote Cσ=4Mσ2(κ2+1). For t=T,T1,...,1, we apply the relation iteratively. Then we have

    Ez1,...,zT[fT+1fσλ2K](1λ2ηT)Ez1,...,zT1[fTfσλ2K]+Cση2T(1λ2ηT)((1λ2ηT1)Ez1,...,zT2[fT1fσλ2K]+Cση2T1)+Cση2T=(1λ2ηT)(1λ2ηT1)Ez1,...,zT2[fT1fσλ2K]+(1λ2ηT)Cση2T1+Cση2T(1λ2ηT)(1λ2ηT1)((1λ2ηT2)Ez1,...,zT3[fT2fσλ2K]+Cση2T2)+(1λ2ηT)Cση2T1+Cση2T(1λ2ηT)(1λ2ηT1)(1λ2η1)(E[f1fσλ2K]+Cση21)=CσTt=1η2tTj=t+1(1λ2ηj)+Tt=1(1λ2ηt)E[fσλ2K]CσTt=1η2tTj=t+1(1λ2ηj)+2MσλTt=1(1λ2ηt). (3.36)

    For any u0, we know that the inequality 1ueu holds. And (3.36) implies that

    Ez1,...,zT[fT+1fσλ2K]CσTt=1η2texp{λ2Tj=t+1ηj}+2Mσλexp{λ2Tt=1ηt}. (3.37)

    Denote

    I1=2Mσλexp{λ2Tt=1ηt}=Dσ(λ)exp{λ2Tt=1ηt}=Dσ(λ)exp{λ2CTt=1tθ}

    and

    I2=CσTt=1(1Ctθ)2exp{λ2CTj=t+1jθ}=CσC2Tt=1t2θexp{λ2CTj=t+1jθ}.

    Then, by (3.37) and the assumptions about the stepsize ηt, we know

    Ez1,...,zT[fT+1fσλ2K]I1+I2. (3.38)

    Now, we estimate I1 and I2 respectively. By Lemma 4 of [9], we obtain the following estimate of I1

    I1{Dσ(λ)exp{λ2C12θ11θ(T+1)1θ},if0<θ<1,Dσ(λ)(T+1)λ2C,ifθ=1. (3.39)

    On the other hand, by Lemma 5.10 of [23] with ν=λ2C,s=θ, we have

    T1t=1t2θexp{λ2CTj=t+1jθ}{18λ2CTθ+9T1θ(1θ)21θexp{λ(12θ1)2C(1θ)(T+1)1θ},if0<θ<1,81λ2C(T+1)λ2C,ifθ=1.

    So,

    Tt=1t2θexp{λ2CTj=t+1jθ}T1t=1t2θexp{λ2CTj=t+1jθ}+T2θexp{λ2CTj=T+1jθ}=T1t=1t2θexp{λ2CTj=t+1jθ}+T2θ{36CλTθ+9T1θ(1θ)21θexp{λ(12θ1)2C(1θ)(T+1)1θ}+T2θ,if0<θ<1,8C2Cλ(T+1)λ2C+T2,ifθ=1.

    Furthermore, we have the following estimate of I2

    I2{CσC2(36CλTθ+9T1θ(1θ)21θexp{λ(12θ1)2C(1θ)(T+1)1θ}+T2θ),if0<θ<1,CσC2(8C2Cλ(T+1)λ2C+T2),ifθ=1. (3.40)

    Since T2θTθ and λC, the conclusion can be established by combining (3.38) with (3.39) and (3.40).

    Proof. For θ(0,1),0<αmin{1θ,θ}, by Theorem 2.2 with λ=Tα, we have

    E[fT+1fσλ2K]σ(2MTα+36M(κ2+1)C2(1θ)21θT1θexp{λ(12θ1)2C(1θ)T(1θ)α}+36M(κ2+1)CTαθ) (3.41)

    Since for any ν>0,c>0,η>0, there exists L>0 such that exp{cTν}LTη, and hence the first term on the right-hand side of (3.41) decays in the form of O(σTη) for any large η>0. However, the second term on the right-hand side of (3.41) is bounded by O(σTθα). Consequently, there exists a constant CM,C,κ,θ depending only on θ,κ,C and M such that

    E[fT+1fσλ2K]CM,C,κ,θ×σTθα.

    The proof is completed.

    Proof. According to the median value theorem, there exists ξ between yfT+1(x) and yfλ,σ(x) such that

    Vσ(yfT+1(x))Vσ(yfλ,σ(x))=Vσ(ξ)|fT+1(x)fλ,σ(x)|=|ξ|1+(ξσ)2|fT+1(x)fλ,σ(x)|σ|fT+1(x)fλ,σ(x)|.

    Then, we have

    Eσ(fT+1)Eσ(fλ,σ)Z|Vσ(yfT+1(x))Vσ(yfλ,σ(x))|dρ(x,y)σZ|fT+1(x)fλ,σ(x)|dρ(x,y)κσfT+1fλ,σK. (3.42)

    And we get

    Eσ(fT+1)Eσ(fσ)(Eσ(fT+1)Eσ(fλ,σ))+Eσ(fλ,σ)Eσ(fσ)κσfT+1fλ,σK+Eσ(fλ,σ)Eσ(fσ)κσfT+1fλ,σK+Eσ(fλ,σ)Eσ(fσ)+λ2fλ,σ2K=κσfT+1fλ,σK+inffHK{Eσ(f)Eσ(fσ)+λ2fλ,σ2K}=κσfT+1fλ,σK+K(fσ,λ).

    Combined which with Corollary 2.1 and the assumption 2.2, the desired result follows.

    Most studies of online learning algorithms focus on the convergence in expectation (for example [8,9,10,25]). However, these results were established based on some fixed loss functions, such as the least-square loss function (see e.g., [9,22]). Our results are established based on a parameterized loss function with a scale parameter σ. The analysis results in Section 2 show that the scale parameter σ can effectively control the convergence rate of the learning algorithm, and a better convergence rate is obtained. On the other hand, the previous researches on online learning algorithms rely on integral operator theory (see [25]), this paper establishes the error bounds for the learning sequence by applying the convex analysis method. Convex analysis method has been widely used in various research fields, for example, in the analysis of machine learning algorithms (see e.g., [21,23,26]) and the studies of discrete fractional operators (see e.g., [27,28]), and it has been proved to be a very effective analysis method.

    In [23], the online pairwise regression problem with the quadratic loss is researched. Different from the reference [23], in this paper, we use the parameterized loss function for the pointwise learning model, which has a wider range of applications than the pairwise learning model. It is known that deep convolution networks can increase approximation order (see e.g., [29,30,31,32]), then it is hopeful that the convergence rate provided in this paper can be improved by choosing the deep neural network method.

    In the present paper, we analyze the learning performance of the kernel regularized online algorithm with a parameterized loss. The convergence of the learning sequence is proved and the error bound is provided in the expectation sense by using the convex analysis method. There are some questions for further study. In this paper, we focus on the theoretical analysis of the kernel regularized online algorithm with a parameterized loss Vσ. However, there is still a gap between theoretical analysis and the optimization process of empirical risk minimization based on a parameterized loss. In the future study, it would be interesting to apply the online learning algorithm based on Vσ to solve some practical problems and construct an effective solution method. In addition, we mainly analyze the sample error in this paper, and the approximation error is represented by K-functional. How to make a more accurate analysis of the approximation error and further study the influence of the scale parameter σ on the approximation error still need to be further studied.

    This work is supported by the Special Project for Scientific and Technological Cooperation (Project No. 20212BDH80021) of Jiangxi Province, the Science and Technology Project in Jiangxi Province Department of Education (Project No. GJJ211334).

    No potential conflict of interest was reported by the author.



    [1] N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Math. Soc., 68 (1950), 337–404. http://dx.doi.org/10.2307/1990404 doi: 10.2307/1990404
    [2] W. Dai, J. Hu, Y. Cheng, X. Wang, T. Chai, RVFLN-based online adaptive semi-supervised learning algorithm with application to product quality estimation of industrial processes, J. Cent. South Univ., 26 (2019), 3338–3350. http://dx.doi.org/10.1007/s11771-019-4257-6 doi: 10.1007/s11771-019-4257-6
    [3] J. Gui, Y. Liu, X. Deng, B. Liu, Network capacity optimization for Cellular-assisted vehicular systems by online learning-based mmWave beam selection, Wirel. Commun. Mob. Com., 2021 (2021), 8876186. http://dx.doi.org/10.1155/2021/8876186 doi: 10.1155/2021/8876186
    [4] M. Li, I. Sethi, A new online learning algorithm with application to image segmentation, Image Processing: Algorithms and Systems IV, 5672 (2005), 277–286. http://dx.doi.org/10.1117/12.586328 doi: 10.1117/12.586328
    [5] S. Sai Santosh, S. Darak, Intelligent and reconfigurable architecture for KL divergence based online machine learning algorithm, arXiv: 2002.07713.
    [6] B. Yang, J. Yao, X. Yang, Y. Shi, Painting image classification using online learning algorithm, In: Distributed, ambient and pervasive interactions, Cham: Springer, 2017,393–403. http://dx.doi.org/10.1007/978-3-319-58697-7_29
    [7] S. Das, Kuhoo, D. Mishra, M. Rout, An optimized feature reduction based currency forecasting model exploring the online sequential extreme learning machine and krill herd strategies, Physica A, 513 (2019), 339–370. http://dx.doi.org/10.1016/j.physa.2018.09.021 doi: 10.1016/j.physa.2018.09.021
    [8] S. Smale, Y. Yao, Online learning algorithms, Found. Comput. Math., 6 (2006), 145–170. http://dx.doi.org/10.1007/s10208-004-0160-z
    [9] Y. Ying, D. Zhou, Online regularized classification algorithms, IEEE Trans. Inform. Theory, 52 (2006), 4775–4788. http://dx.doi.org/10.1109/TIT.2006.883632 doi: 10.1109/TIT.2006.883632
    [10] Y. Ying, D. Zhou, Unregularized online learning algorithms with general loss functions, Appl. Comput. Harmon. Anal., 42 (2017), 224–244. http://dx.doi.org/10.1016/J.ACHA.2015.08.007 doi: 10.1016/J.ACHA.2015.08.007
    [11] Y. Zeng, D. Klabjian, Online adaptive machine learning based algorithm for implied volatility surface modeling, Knowl.-Based Syst., 163 (2019), 376–391. http://dx.doi.org/10.1016/j.knosys.2018.08.039 doi: 10.1016/j.knosys.2018.08.039
    [12] J. Lin, D. Zhou, Online learning algorithms can converge comparably fast as batch learning, IEEE Trans. Neural Netw. Learn. Syst., 29 (2018), 2367–2378. http://dx.doi.org/10.1109/TNNLS.2017.2677970 doi: 10.1109/TNNLS.2017.2677970
    [13] P. Huber, E. Ronchetti, Robust statistics, Hoboken: John Wiley & Sons, 2009. http://dx.doi.org/10.1002/9780470434697
    [14] Y. Wu, Y. Liu, Robust truncated hinge loss support vector machine, J. Am. Stat. Assoc., 102 (2007), 974–983. http://dx.doi.org/10.1198/016214507000000617 doi: 10.1198/016214507000000617
    [15] Y. Yu, M. Yang, L. Xu, M. White, D. Schuurmans, Relaxed clipping: a global training method for robust regression and classification, Proceedings of the 23rd International Conference on Neural Information Processing Systems, 2 (2010), 2532–2540.
    [16] S. Huang, Y. Feng, Q. Wu, Learning theory of minimum error entropy under weak moment conditions, Anal. Appl., 20 (2022), 121–139. http://dx.doi.org/10.1142/S0219530521500044 doi: 10.1142/S0219530521500044
    [17] F. Lv, J. Fan, Optimal learning with Gaussians and correntropy loss, Anal. Appl., 19 (2021), 107–124. http://dx.doi.org/10.1142/S0219530519410124 doi: 10.1142/S0219530519410124
    [18] X. Zhu, Z. Li, J. Sun, Expression recognition method combining convolutional features and Transformer, Math. Found. Compt., in press. http://dx.doi.org/10.3934/mfc.2022018
    [19] S. Suzumura, K. Ogawa, M. Sugiyama, M. Karasuyama, I. Takeuchi, Homotopy continuation approaches for robust SV classification and regression, Mach. Learn., 106 (2017), 1009–1038. http://dx.doi.org/10.1007/s10994-017-5627-7 doi: 10.1007/s10994-017-5627-7
    [20] Z. Guo, T. Hu, L. Shi, Gradient descent for robust kernel-based regression, Inverse Probl., 34 (2018), 065009. http://dx.doi.org/10.1088/1361-6420/aabe55 doi: 10.1088/1361-6420/aabe55
    [21] B. Sheng, H. Zhu, The convergence rate of semi-supervised regression with quadratic loss, Appl. Math. Comput., 321 (2018), 11–24. http://dx.doi.org/10.1016/j.amc.2017.10.033 doi: 10.1016/j.amc.2017.10.033
    [22] M. Pontil, Y. Ying, D. Zhou, Error analysis for online gradient descent algorithms in reproducing kernel Hilbert spaces, Proceedings of Technical Report, University College London, 2005, 1–20.
    [23] S. Wang, Z. Chen, B. Sheng, Convergence of online pairwise regression learning with quadratic loss, Commun. Pur. Appl. Anal., 19 (2020), 4023–4054. http://dx.doi.org/10.3934/cpaa.2020178 doi: 10.3934/cpaa.2020178
    [24] H. Bauschke, P. Combettes, Convex analysis and monotone operator theory in Hilber spaces, Cham: Springer-Verlag, 2010. http://dx.doi.org/10.1007/978-3-319-48311-5
    [25] Z. Guo, L. Shi, Fast and strong convergence of online learning algorithms, Adv. Comput. Math., 45 (2019), 2745–2770. http://dx.doi.org/10.1007/s10444-019-09707-8 doi: 10.1007/s10444-019-09707-8
    [26] Y. Lei, D. Zhou, Convergence of online mirror descent, Appl. Comput. Harmon. Anal., 48 (2020), 343–373. http://dx.doi.org/10.1016/j.acha.2018.05.005 doi: 10.1016/j.acha.2018.05.005
    [27] I. Baloch, T. Abdeljawad, S. Bibi, A. Mukheimer, G. Farid, A. Haq, Some new Caputo fractional derivative inequalities for exponentially (θ,hm)-convex functions, AIMS Mathematics, 7 (2022), 3006–3026. http://dx.doi.org/10.3934/math.2022166 doi: 10.3934/math.2022166
    [28] P. Mohammed, D. O'Regan, A. Brzo, K. Abualnaja, D. Baleanu, Analysis of positivity results for discrete fractional operators by means of exponential kernels, AIMS Mathematics, 7 (2022), 15812–15823. http://dx.doi.org/10.3934/math.2022865 doi: 10.3934/math.2022865
    [29] Y. Xia, J. Zhou, T. Xu, W. Gao, An improved deep convolutional neural network model with kernel loss function in image classifiaction, Math. Found. Comput., 3 (2020), 51–64. http://dx.doi.org/10.3934/mfc.2020005 doi: 10.3934/mfc.2020005
    [30] D. Zhou, Deep distributed convolutional neural networks: universality, Anal. Appl., 16 (2018), 895–919. http://dx.doi.org/10.1142/S0219530518500124 doi: 10.1142/S0219530518500124
    [31] D. Zhou, Universality of deep convolutional neural networks, Appl. Comput. Harmon. Anal., 48 (2020), 787–794. http://dx.doi.org/10.1016/j.acha.2019.06.004 doi: 10.1016/j.acha.2019.06.004
    [32] D. Zhou, Theory of deep convolutional neural networks: downsampling, Neural Networks, 124 (2020), 319–327. http://dx.doi.org/10.1016/j.neunet.2020.01.018 doi: 10.1016/j.neunet.2020.01.018
  • This article has been cited by:

    1. Lin Liu, Xiaoling Pan, Baohuai Sheng, Jun Fan, Theory Analysis for the Convergence of Kernel-Regularized Online Binary Classification Learning Associated with RKBSs, 2023, 2023, 2314-4785, 1, 10.1155/2023/6566375
  • Reader Comments
  • © 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(1517) PDF downloads(68) Cited by(1)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog