The research on the learning performance of machine learning algorithms is one of the important contents of machine learning theory, and the selection of loss function is one of the important factors affecting the learning performance. In this paper, we introduce a parameterized loss function into the online learning algorithm and investigate the performance. By applying convex analysis techniques, the convergence of the learning sequence is proved and the convergence rate is provided in the expectation sense. The analysis results show that the convergence rate can be greatly improved by adjusting the parameter in the loss function.
Citation: Shuhua Wang. Convergence of online learning algorithm with a parameterized loss[J]. AIMS Mathematics, 2022, 7(11): 20066-20084. doi: 10.3934/math.20221098
[1] | Wei Xue, Pengcheng Wan, Qiao Li, Ping Zhong, Gaohang Yu, Tao Tao . An online conjugate gradient algorithm for large-scale data analysis in machine learning. AIMS Mathematics, 2021, 6(2): 1515-1537. doi: 10.3934/math.2021092 |
[2] | Yiyuan Cheng, Yongquan Zhang, Xingxing Zha, Dongyin Wang . On stochastic accelerated gradient with non-strongly convexity. AIMS Mathematics, 2022, 7(1): 1445-1459. doi: 10.3934/math.2022085 |
[3] | Yassamine Chellouf, Banan Maayah, Shaher Momani, Ahmad Alawneh, Salam Alnabulsi . Numerical solution of fractional differential equations with temporal two-point BVPs using reproducing kernal Hilbert space method. AIMS Mathematics, 2021, 6(4): 3465-3485. doi: 10.3934/math.2021207 |
[4] | Kasamsuk Ungchittrakool, Natthaphon Artsawang . A generalized viscosity forward-backward splitting scheme with inertial terms for solving monotone inclusion problems and its applications. AIMS Mathematics, 2024, 9(9): 23632-23650. doi: 10.3934/math.20241149 |
[5] | Suthep Suantai, Pronpat Peeyada, Andreea Fulga, Watcharaporn Cholamjiak . Heart disease detection using inertial Mann relaxed $ CQ $ algorithms for split feasibility problems. AIMS Mathematics, 2023, 8(8): 18898-18918. doi: 10.3934/math.2023962 |
[6] | Junaid Ahmad, Kifayat Ullah, Reny George . Numerical algorithms for solutions of nonlinear problems in some distance spaces. AIMS Mathematics, 2023, 8(4): 8460-8477. doi: 10.3934/math.2023426 |
[7] | Chibueze C. Okeke, Abubakar Adamu, Ratthaprom Promkam, Pongsakorn Sunthrayuth . Two-step inertial method for solving split common null point problem with multiple output sets in Hilbert spaces. AIMS Mathematics, 2023, 8(9): 20201-20222. doi: 10.3934/math.20231030 |
[8] | Raweerote Suparatulatorn, Wongthawat Liawrungrueang, Thanasak Mouktonglang, Watcharaporn Cholamjiak . An algorithm for variational inclusion problems including quasi-nonexpansive mappings with applications in osteoporosis prediction. AIMS Mathematics, 2025, 10(2): 2541-2561. doi: 10.3934/math.2025118 |
[9] | Yali Zhao, Qixin Dong, Xiaoqing Huang . A self-adaptive viscosity-type inertial algorithm for common solutions of generalized split variational inclusion and paramonotone equilibrium problem. AIMS Mathematics, 2025, 10(2): 4504-4523. doi: 10.3934/math.2025208 |
[10] | Puntita Sae-jia, Suthep Suantai . A new two-step inertial algorithm for solving convex bilevel optimization problems with application in data classification problems. AIMS Mathematics, 2024, 9(4): 8476-8496. doi: 10.3934/math.2024412 |
The research on the learning performance of machine learning algorithms is one of the important contents of machine learning theory, and the selection of loss function is one of the important factors affecting the learning performance. In this paper, we introduce a parameterized loss function into the online learning algorithm and investigate the performance. By applying convex analysis techniques, the convergence of the learning sequence is proved and the convergence rate is provided in the expectation sense. The analysis results show that the convergence rate can be greatly improved by adjusting the parameter in the loss function.
Let X⊂Rd be the input space, Y=[−M,M] be the output space for some M>0. z={(xt,yt)}Tt=1 are the random samples i.i.d. (independently and identically drawn) according to a Borel probability measure ρ(x,y)=ρ(y|x)ρX(x) on Z=X×Y. Based on these samples, the goal of regression problems is to look for a predictor f:X→R from some hypothesis space such that f(x) is a "good" approximation of y. The quality of the predictor f is measured by the generalization error
E(f):=∫ZV(x,y,f)dρ(x,y), |
where V(r):R→R+ is a prescribed loss function.
The hypothesis space considered in this paper is the reproducing kernel Hilbert space(RKHS) HK. This means that there exists a unique symmetric and positive definite continuous function K:X×X→R, called the reproducing kernel of HK, or Mercer kernel, and an inner product ⟨⋅,⋅⟩K such that f(x)=⟨K(x,⋅),f⟩K which is the reproducing property of the kernel, and all f∈HK are linear combinations of kernel functions. In other words, the RKHS HK is the closure of the linear span of the set of functions {Kx(⋅)=K(x,⋅):x∈X} with the inner product ⟨⋅,⋅⟩K. For each x∈X and f∈HK the evaluation functional ex(f):=f(x) is continuous (i.e.bounded) in the topology of HK, and |f(x)|≤κ‖f‖K with κ:=supx∈X√K(x,x) (see [1]).
Traditional off-line learning is also called batch learning, all sample points need to be tested in each training. When the amount of data is large or new sample points are added, the learning ability of batch learning decreases significantly. Online learning is one effective approach raised for analyzing and processing big data in various applications, such as communication, electronics, medicine, biology, and other fields (see e.g., [2,3,4,5,6]). The performance of the kernel-based regularized online learning algorithms have been researched and their effectiveness has been verified (see e.g., [7,8,9,10,11] and references therein). Unlike the off-line learning algorithms, online learning algorithms process the observations one by one, and the output is adjusted in time according to the results of the last learning.
With the observations z={(xt,yt)}Tt=1, the kernel regularized online learning algorithm based on the stochastic descent method is given by (see e.g., [8,9,10])
{f1=0,ft+1=ft−ηt(∇fV(xt,yt,ft)Kxt+λft), | (1.1) |
where ηt is called the stepsize, λ>0 is a regularization parameter and the sequence {ft:t=1,…,T+1} is the learning sequence.
When the least-square loss function V(x,y,f(x))=(f(x)−y)2 is selected, the specific iteration format of the online learning algorithm is given by
{f1=0,ft+1=ft−ηt((ft(xt)−yt)Kxt+λft). | (1.2) |
To study the learning performance of online learning algorithms we need to bound the convergence rate of the iterative sequence {ft:t=1,⋯,T+1}. The convergence of the online learning algorithm (1.2) has been extensively studied in the literature (see e.g., [8,9,12]). The research results in [12] show that under mild conditions the regularized online learning algorithm can converge comparably fast as the off-line learning algorithm.
The least square-loss is the most widely utilized criterion for regression in practice. However, from a robustness point of view, the least square loss is not a good choice. In many practical applications, outliers or heavy-tailed distributions often occur in real data sets. In recent years, how to improve the robustness of algorithms has become one of the hot topics in the field of machine learning. In the literature on learning theory, a lot of efforts have been made and there have been some results on generalization analysis (see e.g., [13,14,15,16,17]) and empirical experiments (see e.g., [18,19]) of learning algorithms when outliers or heavy-tailed noise are allowed.
One of the main strategies to improve robustness is to use some robust loss function with a scale parameter(see e.g., [20,21]). Based on the quadratic function √1+t2,t∈R which plays an important role in constructing shape preserving quasi-interpolation and solving partial differential equations with mesh-free method since its strong nonlinear property and its convexity, [21] constructed a robust loss function Vσ(r) with a scale parameter σ∈(0,1]. For σ∈(0,1], the parameterized loss function Vσ(r):R→[0,∞) is defined as
Vσ(r):=σ2(√σ2+r2σ−1),r∈R. |
The analysis results in [21] show that the learning algorithm based on the parameterized loss function Vσ(r) has a good generalization ability.
Encouraged by these researches, we want to further improve the performance and applicability of the online algorithm. In this paper, we introduce the parameterized loss function Vσ(r) into the online learning algorithm, and analyze the influence of the scale parameter on the convergence rate of the algorithm.
To give the specific format of the learning algorithm with the parameterized loss function, we give the following notations correspondingly. We denote
Eσ(f)=∫ZVσ(y−f(x))dρ(x,y), | (1.3) |
and
fσ=argminf∈L2(ρX)Eσ(f). | (1.4) |
The kernel-based regularized off-line algorithm is the following optimization problem:
fσλ=argminf∈HKEσ(f)+λ2‖f‖2K. | (1.5) |
Based on the random observations z={(xt,yt)}Tt=1, the approximate solution of the problem (1.5) is obtained through the following learning process
fσz,λ=argminf∈HK1|T|T∑t=1Vσ(yt−f(xt))+λ2‖f‖2K. |
It is easy to see that the gradient of the loss function Vσ is given by
∇fVσ(yt−ft(xt))=−yt−ft(xt)√1+(yt−ft(xt)σ)2. |
Along with the online algorithm (1.1), the kernel regularized online learning algorithm with the parameterized loss function Vσ(r) is defined by
{f1=0,ft+1=ft−ηt(−yt−ft(xt)√1+(y t−ft(xt)σ)2Kxt+λft). | (1.6) |
In this paper, we focus on the performance of the sequence {ft:t=1,...,T+1} produced by the algorithm (1.6).
The remaining parts of this paper are organized as follows: we present the main results of this paper in Section 2. The proofs of the main results are given in Section 3. Discussions and comparisons with related works are given in Section 4. Conclusions and some questions for further study are mentioned in Section 5.
In the present paper, we write A=O(B) if there is a constant C≥0 such that A≤CB. We use Ez[⋅] to denote the expectation with respect to z. When its meaning is clear from the context, we use the shorthand notation E[⋅].
In this section, we present our main results about the performance of the algorithm (1.6), proofs are given in Section 3.
Our first main result establishes the convergence of the sequence {ft:t=1,...,T+1} in expectation, under mild conditions on the stepsize.
Theorem 2.1. Let fσλ be defined as in (1.5), and let {ft:t=1,...,T+1} be the sequence produced by the algorithm (1.6), λ>0. If {ηt} satisfies ∞∑t=1ηt=+∞, limt→+∞ηt=0 and ηt≤1λ, then it holds that
Ez1,...,zT[‖fT+1−fσλ‖K]→0(T→+∞). |
Our second main result gives the explicit convergence rate of the last iterate by specifying the step sizes in the algorithm (1.6).
Theorem 2.2. Let fσλ be defined as in (1.5), and let {ft:t=1,...,T+1} be the sequence produced by the algorithm (1.6). For any 0<λ≤1, 0<θ≤1, take ηt=1Ct−θ with C≥λ+4(κ2+1). Then, it holds that
E[‖fT+1−fσλ‖2K]≤{(Dσ(λ)+9CσT1−θC2(1−θ)21−θ)exp{−λ(1−2θ−1)2C(1−θ)(T+1)1−θ}+36CσλCT−θ,if0<θ<1,(Dσ(λ)+5CσT1−θC2(C−λ))T−λ2C,ifθ=1, |
where Dσ(λ)=2Mσλ,Cσ=4Mσ(κ2+1).
Corollary 2.1. Let fσλ be defined as in (1.5), and let {ft:t=1,...,T+1} be the sequence produced by the algorithm (1.6). For any θ∈(0,1),0<α≤min{1−θ,θ}, take λ=T−α. If the stepsize is chosen as ηt=1Ct−θ, then it holds that
E[‖fT+1−fσλ‖2K]≤CM,C,κ,θ×σTθ−α, | (2.1) |
where CM,C,κ,θ is a constant depending only on θ,κ,C and M.
Remark 1. The results given in Theorem 2.2 and Corollary 2.1 show that the scale parameter σ can effectively control the convergence rate of ‖fT+1−fσλ‖K, which is usually referred to as the sample error. Depending on the circumstances, the sample error bound can be greatly improved by choosing the parameter σ properly. In fact, take σ=λ=T−α. Then by (2.1), we have E[‖fT+1−fσλ‖2K]=O(T−θ) which is better than the sample error bound O(T−(θ−α)) given in [9].
The results provided above mainly describe the convergence rate of the sample error. However, in the studying the learning performance of learning algorithms, we are often interested in the excess generalization error Eσ(fT+1)−Eσ(fσ). Define
K(f,λ)=infg∈HK{Eσ(g)−Eσ(f)+λ2‖g‖2K}, |
which is often used to denote the approximation error, whose convergence is determined by the capacity of HK. We assume the K-functional satisfies the following decay
K(fσ,λ)=O(λβ),λ→0+, | (2.2) |
with 0<β≤1.
By combining the sample error with the approximation error, we obtain the overall learning rate stated as follows.
Corollary 2.2. Let Eσ(f) be the generalization error defined as in (1.3), {ft:t=1,...,T+1} be the sequence produced by the algorithm (1.6). For θ∈(0,1),0<α≤min{1−θ,θ}, take λ=T−α. If the stepsize is chosen as ηt=1Ct−θ, then it holds that
E[Eσ(fT+1)−Eσ(fσ)]=O(σ32Tθ−α2+1Tβα). | (2.3) |
Now, we compare the performance of the algorithm (1.6) with that of the online kernel regularized learning algorithm based on the least square loss.
In [22], the online kernel regularized learning algorithm with the least-square loss is researched, and the learning rates are established under some assumptions. Namely, for 0<β≤1, 0<δ<ββ+1, there holds
E[E(fT+1)−E(fρ)]=O(Tδ−ββ+1) | (2.4) |
and for 12<β≤1, 0<δ<2β−12β+1, there holds
E[E(fT+1)−E(fρ)]=O(Tδ−2β−12β+1). | (2.5) |
For 0<β≤1, we choose σ=T−μ3 with 0<μ≤2β−12β+1. Take λ = Tδ2β−12(β+1)−μ2β+1 and ηt = 14κ2+5t2β+12βδ−2β+12(β+1) with ββ+1−2β2β+1μ<δ<ββ+1+2β2β+1. By Corollary 2.2, we have
E[Eσ(fT+1)−Eσ(fσ)]=O(T12(δ−ββ+1)−β2β+1μ). | (2.6) |
And for 12<β≤1, we choose σ3=λ=T2β2β−1δ−2β2β+1 with 0<δ<2β−12β+1. Take ηt=14κ2+5t2β2β−1δ−2β2β+1. By Corollary 2.2, we have that
E[Eσ(fT+1)−Eσ(fσ)]=O(Tβ2β−1δ−12β+1). | (2.7) |
The rate (2.6) is better than (2.4), and (2.7) is better than (2.5). The analysis results illustrate that the convergence rate of the algorithm (1.6) can be improved by choosing the parameter σ appropriately.
Lemma 3.1. Let {ft:t=1,...,T+1} be the sequence produced by the algorithm (1.6). If ηt≤1λ, then, for any t=1,...,T+1, it holds that
‖ft‖K≤κσλ. | (3.1) |
Proof. We prove (3.1) by induction on t. The inequality (3.1) is true for t=1 because of the initialization condition f1=0. Suppose the bound (3.1) holds for any t, and consider ft+1. The iteration (1.6) can be rewritten as
ft+1=ft+yt−ft(xt)√1+(yt−ft(xt)σ)2ηtKxt−ληtft=(1−ληt)ft+yt−ft(xt)√1+(yt−ft(xt)σ)2ηtKxt. | (3.2) |
This implies that
‖ft+1‖K≤(1−ληt)‖ft‖K+κηt|yt−ft(xt)√1+(yt−ft(xt)σ)2|. | (3.3) |
Since
|yt−ft(xt)√1+(yt−ft(xt)σ)2|≤σ. | (3.4) |
Combined with the assumption ‖ft‖K≤κσλ, we have
‖ft+1‖K≤(1−ληt)‖ft‖K+κσηt≤(1−ληt)κσλ+κσηt=κσλ. | (3.5) |
This completes the proof.
Lemma 3.2. Assume fσλ is defined as in (1.5). Then, for any f∈HK, it holds that
∫Zy−fσλ(x)√1+(y−fσλ(x)σ)2(fσλ(x)−f(x))dρ=λ⟨fσλ,fσλ−f⟩K. |
Proof. By Taylor formula, for any u,v∈R, we have
Vσ(u)−Vσ(v)=V′σ(v)(u−v)+12V″σ(ξ)(u−v)2, | (3.6) |
where ξ∈R is between u and v.
Note that, V″σ(ξ)=1√(1+(ξσ)2)3>0. Then,
Vσ(u)−Vσ(v)≥V′σ(v)(u−v)=v(u−v)√1+(vσ)2. | (3.7) |
Therefore, for any f,g∈HK, we have
Eσ(f)−Eσ(g)=∫ZVσ(y−f(x))dρ−∫ZVσ(y−g(x))dρ≥−∫Zy−g(x)√1+(y−g(x)σ)2(f(x)−g(x))dρ=⟨f−g,−∫Zy−g(x)√1+(y−g(x)σ)2Kxdρ⟩K=⟨f−g,∇gEσ(g)⟩K. | (3.8) |
By (i) of Lemma 5.1 in [23], we know Eσ(f) is a convex function on HK. And ‖f‖2K is a strictly convex function on HK. Then, Ωσ(f)=Eσ(f)+λ2‖f‖2K is a convex function on HK. Based on (ii) of Lemma 5.1 in [23], we have
0=∇fΩσ(f)∣f=fσλ=∇fEσ(f)∣f=fσλ+λfσλ=−∫Zy−fσλ(x)√1+(y−fσλ(x)σ)2Kxdρ+λfσλ. | (3.9) |
Taking inner product with f−fσλ on both sides of the above formula, we get
0=⟨−∫Zy−fσλ(x)√1+(y−fσλ(x)σ)2Kxdρ+λfσλ,f−fσλ⟩K=∫Z−y−fσλ(x)√1+(y−fσλ(x)σ)2⟨Kx,f−fσλ⟩Kdρ+λ⟨fσλ,f−fσλ⟩K=∫Zy−fσλ(x)√1+(y−fσλ(x)σ)2(fσλ(x)−f(x))dρ+λ⟨fσλ,f−fσλ⟩K. |
This proves our conclusion.
Lemma 3.3. Let fσλ be defined as in (1.5). For any f∈HK, we denote Ωσ(f)=Eσ(f)+λ2‖f‖2K. Then, it holds that
λ2‖f−fσλ‖2K≤Ωσ(f)−Ωσ(fσλ). |
Proof. For any f∈HK, we define a functionf(θ)=fσλ+θ(f−fσλ),θ∈[0,1]. Then, f(0)=fσλ and f(1)=f. Denote F(θ)=Ωσ(f(θ))=∫ZVσ(y−f(θ)(x))dρ+λ2‖f(θ)‖2K, then F(1)=Ωσ(f),F(0)=Ωσ(fσλ). Since Vσ is differentiable, as a function of θ, F(θ) is differentiable. And for any θ∈[0,1], we have
F′(θ)=lim△θ→0F(θ+△θ)−F(0)△θ=lim△θ→0Ωσ(fθ+△θ)−Ωσ(fθ)△θ=lim△θ→01△θ(∫Z(Vσ(y−f(θ+△θ)(x))−Vσ(y−f(θ)(x)))dρ+λ2‖f(θ+△θ)‖2K−λ2‖f(θ)‖2K)=lim△θ→01△θ(∫Z(Vσ(y−f(θ)(x))−△θ(f(x)−fσλ(x))−Vσ(y−f(θ)(x)))dρ+λ2(‖f(θ)+△θ(f−fσλ)‖2K−‖f(θ)‖2K)). | (3.10) |
By the median value theorem, there holds
Vσ(y−f(θ)(x))−△θ(f(x)−fσλ(x)))−Vσ(y−f(θ)(x))=△θV′σ(ξ)(fσλ(x)−f(x)), | (3.11) |
where ξ∈(y−f(θ)(x))−△θ(f(x)−fσλ(x),y−f(θ)(x)).
This in connection with ‖f(θ)+△θ(f−fσλ)‖2K−‖f(θ)‖2K=2△θ⟨f−fσλ,f(θ)⟩K+(△θ)2‖f−fσλ)‖2K, according to (3.10), tells us that
F′(θ)=lim△θ→0(∫ZV′σ(ξ)(fσλ(x)−f(x))dρ+λ⟨f−fσλ,f(θ)⟩K+λ2△θ‖f−fσλ‖2K)=∫ZV′σ(y−f(θ)(x))(fσλ(x)−f(x))dρ+λ⟨f−fσλ,f(θ)⟩K=∫ZV′σ((y−fσλ(x))+θ(fσλ(x)−f(x)))(fσλ(x)−f(x))dρ+λ⟨f−fσλ,f+θ(f−f(θ))⟩K=∫ZV′σ((y−fσλ(x))+θ(fσλ(x)−f(x)))(fσλ(x)−f(x))dρ+λ⟨f−fσλ,f⟩K+θλ‖f−fσλ)‖2K. | (3.12) |
By Lemma 3.2, we see that
λ⟨f−fσλ,f⟩K=−∫Zy−fσλ(x)√1+(y−fσλ(x)σ)2(fσλ(x)−f(x))dρ=−∫ZV′σ(y−fσλ(x))(fσλ(x)−f(x))dρ. | (3.13) |
On the other hand, since Vσ(u) is a convex function in R, by discussions in [24], we know that
V′σ(y−fσλ(x))+θ(fσλ(x)−f(x))−V′σ(y−fσλ(x))(fσλ(x)−f(x))≥0. | (3.14) |
Therefore, for θ∈(0,1), we have
F′(θ)=∫Z(V′σ(y−fσλ(x))+θ(fσλ(x)−f(x))−V′σ(y−fσλ(x)))(fσλ(x)−f(x))dρ+λθ‖f−fσλ‖2K≥λθ‖f−fσλ‖2K. | (3.15) |
By the definition of fσλ, we know that F(θ)≥F(0)=Ωσ(fσλ),θ∈[0,1]. Therefore, (3.14) implies that
Ωσ(f)−Ωσ(fσλ)=F(1)−F(0)=∫10F′(θ)dθ≥∫10λθ‖f−fσλ‖2Kdθ=λ‖f−fσλ‖2K∫10θdθ=λ2‖f−fσλ‖2K. |
The proof is completed.
Lemma 3.4. Let fσλ be defined as in (1.5), {ft:t=1,...,T+1} be the sequence produced by the algorithm (1.6), if λ>0, ηt≤1λ, then
Ez1,...,zT[‖fT+1−fσλ‖2K]≤4κ2σ2T∑t=1η2tT∏j=t+1(1−ληj)+2MσλT∏t=1(1−ληt). | (3.16) |
Furthermore, there holds
Ez1,...,zT[‖fT+1−fσλ‖2K]≤4κ2σ2T∑t=1η2texp{−λT∑j=t+1ηj}+2Mσλexp{−λT∑t=1ηt}. | (3.17) |
Proof. According to the algorithm (1.6), we know
‖ft+1−fσλ‖2K=‖ft−fσλ‖2K+η2t‖λft−yt−ft(xt)√1+(yt−ft(xt)σ)2Kxt‖2K+2ηt⟨λft−yt−ft(xt)√1+(yt−ft(xt)σ)2Kxt,fσλ−ft⟩K=‖ft−fσλ‖2K+η2t‖λft−yt−ft(xt)√1+(yt−ft(xt)σ)2Kxt‖2K+2ηtA, | (3.18) |
where A=⟨λft−yt−ft(xt)√1+(yt−ft(xt)σ)2Kxt,fσλ−ft⟩K.
By using the inequality ⟨a,b−a⟩K≤12(‖b‖2K−‖a‖2K),a,b∈HK, with a=ft,b=fσλ, we have
A=λ⟨ft,fσλ−ft⟩K−⟨yt−ft(xt)√1+(yt−ft(xt)σ)2Kxt,fσλ−ft⟩K≤λ2(‖fσλ‖2K−‖ft‖2K)−⟨yt−ft(xt)√1+(yt−ft(xt)σ)2Kxt,fσλ−ft⟩K=λ2(‖fσλ‖2K−‖ft‖2K)−yt−ft(xt)√1+(yt−ft(xt)σ)2(fσλ(xt)−ft(xt))≤λ2(‖fσλ‖2K−‖ft‖2K)+Vσ(yt−fσλ(xt))−Vσ(yt−ft(xt))=(Vσ(yt−fσλ(xt))+λ2‖fσλ‖2K)−(Vσ(yt−ft(xt))+λ2‖ft‖2K). | (3.19) |
Since ft depends on {z1,...,zt−1} but not on zt, it follows that
Ez1,...,zt(A)≤Ez1,...,zt−1[Ezt[(Vσ(yt−fσλ(xt))+λ2‖fσλ‖2K)−(Vσ(yt−ft(xt))+λ2‖ft‖2K)]]=Eσ(fσλ)+λ2‖fσλ‖2K−Ez1,...,zt−1[Eσ(ft)+λ2‖ft‖2K]. | (3.20) |
Combining (3.18) with (3.20), we have
Ez1,...,zt[‖ft+1−fσλ‖2K]≤Ez1,...,zt−1[‖ft−fσλ‖2K]+η2tEz1,...,zt[‖λft−yt−ft(xt)√1+(yt−ft(xt)σ)2Kxt‖2K]+2ηtEz1,...,zt−1[(Eσ(fσλ)+λ2‖fσλ‖2K)−(Eσ(ft)+λ2‖ft‖2K)]=Ez1,...,zt−1[‖ft−fσλ‖2K]+η2tEz1,...,zt[‖λft−yt−ft(xt)√1+(yt−ft(xt)σ)2Kxt‖2K]+2ηt(Ωσ(fσλ)−Ωσ(ft)). | (3.21) |
According to Lemma 3.3, we know
Ωσ(fσλ)−Ωσ(ft)≤−λ2‖ft−fσλ‖2K. | (3.22) |
Therefore, (3.21) implies that
Ez1,...,zt[‖ft+1−fσλ‖2K]≤Ez1,...,zt−1[‖ft−fσλ‖2K]+η2tEz1,...,zt[‖λft−yt−ft(xt)√1+(yt−ft(xt)σ)2Kxt‖2K]−2ηtEz1,...,zt−1[‖ft−fσλ‖2K]=(1−ληt)Ez1,...,zt−1[‖ft−fσλ‖2K]+η2tEz1,...,zt[‖λft−yt−ft(xt)√1+(yt−ft(xt)σ)2Kxt‖2K]. | (3.23) |
By Lemma 3.1, we know
‖λft−yt−ft(xt)√1+(yt−ft(xt)σ)2Kxt‖2K≤(λ‖ft‖K+κ|yt−ft(xt)√1+(yt−ft(xt)σ)2|)2≤(λ×κσλ+κσ)2=4κ2σ2. | (3.24) |
Substituting (3.24) into (3.23), we obtain
Ez1,...,zt[‖ft+1−fσλ‖2K]≤(1−ληt)Ez1,...,zt−1[‖ft−fσλ‖2K]+4κ2σ2η2t. |
Applying the relation iteratively for t=T,T−1,...,1, we have
Ez1,...,zT[‖fT+1−fσλ‖2K]≤(1−ληT)Ez1,...,zT−1[‖fT−fσλ‖2K]+4κ2σ2η2T≤(1−ληT)(((1−ληT−1))Ez1,...,zT−2[‖fT−1−fσλ‖2K]+4κ2σ2η2T−1)+4κ2σ2η2T≤.......≤4κ2σ2T∑t=1η2tT∏j=t+1(1−ληj)+T∏t=1(1−ληt)E[‖f1−fσλ‖2K]. |
Denote T∏j=T+1(1−ληj)=1. From the definition of fσλ, we see that
λ2‖fσλ‖2K≤Eσ(0)≤Mσ. | (3.25) |
This in connection with the initialization condition f1=0, it follows that
Ez1,...,zT[‖fT+1−fσλ‖2K]≤4κ2σ2T∑t=1η2tT∏j=t+1(1−ληj)+T∏t=1(1−ληt)E[‖fσλ‖2K]≤4κ2σ2T∑t=1η2tT∏j=t+1(1−ληj)+2MσλT∏t=1(1−ληt). |
This shows (3.16). (3.17) follows from (3.16) and the inequality 1−u≤e−u for any u≥0.
Proof. It is easy to see that
T∏t=1(1−ληt)≤exp{−λT∑t=1ηt}→0(T→+∞). |
It implies that, for any ε>0, there exists some T1∈N such that
T∏t=1(1−ληt)≤ε, |
whenever T≥T1.
And according to the assumption limt→+∞ηt=0, we know that there exists some t(ε)∈N such that ηt≤λε, for every t≥t(ε). Furthermore, we have
T∑t=t(ε)+1η2tT∏j=t+1(1−ληj)≤λεT∑t=t(ε)+1ηtT∏j=t+1(1−ληj)=εT∑t=t(ε)+1ληtT∏j=t+1(1−ληj)=εT∑t=t(ε)+1((1−(1−ληt))T∏j=t+1(1−ληj))=εT∑t=t(ε)+1(T∏j=t+1(1−ληj)−T∏j=t(1−ληj))=ε((T∏j=t(ε)+2(1−ληj)−T∏j=t(ε)+1(1−ληj))+(T∏j=t(ε)+3(1−ληj)−T∏j=t(ε)+2(1−ληj))+⋯+(T∏j=T+1(1−ληj)−T∏T(1−ληj)))=ε(1−T∏j=t(ε)+1(1−ληj))≤ε. | (3.26) |
Since t(ε) is fixed, there exists some T2∈N such that, for every T≥T2, it holds that
T∑j=t(ε)+1ηj≥T2∑j=t(ε)+1ηj≥1λlogt(ε)λ2ε. |
So, for any 1≤t≤t(ε), we have
T∏j=t+1(1−ληj)≤exp{−T∑j=t+1ληj}≤exp{−T∑j=t(ε)+1ληj}≤λ2εt(ε). |
Hence
t(ε)∑t=1η2tT∏j=t+1(1−ληj)≤λ2εt(ε)t(ε)∑t=1η2t≤ε. | (3.27) |
From (3.26) and (3.27), we know that for any ε>0, there exists some T2∈N such that
T∑t=1η2tT∏j=t+1(1−ληj)=t(ε)∑t=1η2tT∏j=t+1(1−ληj)+T∑t=t(ε)+1η2tT∏j=t+1(1−ληj)≤ε+ε=2ε, | (3.28) |
whenever T≥T2. Let T′=max{T1,T2}, then by (3.16), (3.26) and (3.27) we have
Ez1,...,zT[‖fT+1−fσλ‖2K]≤(8κ2σ2+2Mσλ)ε, |
when T≥T′. Thus we complete the proof of Theorem 2.1.
Proof. By (3.21), we know
Ez1,...,zt[‖ft+1−fσλ‖2K]≤Ez1,...,zt−1[‖ft−fσλ‖2K]+η2tEz1,...,zt[‖λft−yt−ft(xt)√1+(yt−ft(xt)σ)2Kxt‖2K]+2ηtEz1,...,zt−1[(Eσ(fσλ)+λ2‖fσλ‖2K)−(Eσ(ft)+λ2‖ft‖2K)]. | (3.29) |
From the inequality ‖a−b‖2K≤2‖a‖2K+2‖b‖2K, we have
‖λft−yt−ft(xt)√1+(yt−ft(xt)σ)2Kxt‖2K≤2λ2‖ft‖2K+2|yt−ft(xt)√1+(yt−ft(xt)σ)2|2‖Kxt(⋅)‖2K≤2λ2‖ft‖2K+2κ2σ2(√1+(yt−ft(xt)σ)2−1) | (3.30) |
On the other hand, for any r∈R, it holds that |r√1+(r)σ)2|2≤2σ2(√σ2+r2σ−1)=2Vσ(r). This implies that
|yt−ft(xt)√1+(yt−ft(xt)σ)2|2≤2Vσ(yt−ft(xt)). | (3.31) |
Combining (3.30) with (3.31), we get
‖λft−yt−ft(xt)√1+(yt−ft(xt)σ)2Kxt‖2K≤4κ2Vσ(yt−ft(xt))+2λ2‖ft‖2K≤4κ2Vσ(yt−ft(xt))+2λ‖ft‖2K=4κ2Vσ(yt−ft(xt))+4×λ2‖ft‖2K≤4(κ2+1)(Vσ(yt−ft(xt))+λ2‖ft‖2K). | (3.32) |
Substituting (3.32) into (3.29), we get
Ez1,...,zt[‖ft+1−fσλ‖2K]≤Ez1,...,zt−1[‖ft−fσλ‖2K]+4(κ2+1)η2tEz1,...,zt[Vσ(yt−ft(xt))+λ2‖ft‖2K]+2ηtEz1,...,zt−1[(Eσ(fσλ)+λ2‖fσλ‖2K)−(Eσ(ft)+λ2‖ft‖2K)]=Ez1,...,zt−1[‖ft−fσλ‖2K]+4(κ2+1)η2tEz1,...,zt[Eσ(ft)+λ2‖ft‖2K]+2ηtEz1,...,zt−1[(Eσ(fσλ)+λ2‖fσλ‖2K)−(Eσ(ft)+λ2‖ft‖2K)]≤Ez1,...,zt−1[‖ft−fσλ‖2K]+4(κ2+1)η2tEz1,...,zt[Eσ(ft)+λ2‖ft‖2K−(Eσ(fσλ)+λ2‖fσλ‖2K)]+2ηtEz1,...,zt−1[(Eσ(fσλ)+λ2‖fσλ‖2K)−(Eσ(ft)+λ2‖ft‖2K)]+4(κ2+1)η2tEσ(0)≤Ez1,...,zt−1[‖ft−fσλ‖2K]+2ηt(1−2(κ2+1)ηt)Ez1,...,zt−1[(Eσ(fσλ)+λ2‖fσλ‖2K)−(Eσ(ft)+λ2‖ft‖2K)]+4(κ2+1)Mση2t=Ez1,...,zt−1[‖ft−fσλ‖2K]+B+4(κ2+1)Mση2t, | (3.33) |
where
B=2ηt(1−2(κ2+1)ηt)Ez1,...,zt−1[(Eσ(fσλ)+λ2‖fσλ‖2K)−(Eσ(ft)+λ2‖ft‖2K)]. |
Based on the assumptions about η, we know that 1−2(κ2+1)ηt≥12. And by Lemma 3.3, we know
Ez1,...,zt−1[(Eσ(fσλ)+λ2‖fσλ‖2K)−(Eσ(ft)+λ2‖ft‖2K)]≤−λ2‖ft−fσλ‖2K. |
This implies that
B≤−ληt2Ez1,...,zt−1[‖ft−fσλ‖2K]. | (3.34) |
Combining (3.33) with (3.34), we obtain
Ez1,...,zt[‖ft+1−fσλ‖2K]≤(1−ληt2)Ez1,...,zt−1[‖ft−fσλ‖2K]+4Mσ2(κ2+1)η2t. | (3.35) |
Denote Cσ=4Mσ2(κ2+1). For t=T,T−1,...,1, we apply the relation iteratively. Then we have
Ez1,...,zT[‖fT+1−fσλ‖2K]≤(1−λ2ηT)Ez1,...,zT−1[‖fT−fσλ‖2K]+Cση2T≤(1−λ2ηT)((1−λ2ηT−1)Ez1,...,zT−2[‖fT−1−fσλ‖2K]+Cση2T−1)+Cση2T=(1−λ2ηT)(1−λ2ηT−1)Ez1,...,zT−2[‖fT−1−fσλ‖2K]+(1−λ2ηT)Cση2T−1+Cση2T≤(1−λ2ηT)(1−λ2ηT−1)((1−λ2ηT−2)Ez1,...,zT−3[‖fT−2−fσλ‖2K]+Cση2T−2)+(1−λ2ηT)Cση2T−1+Cση2T≤⋯≤(1−λ2ηT)(1−λ2ηT−1)⋯(1−λ2η1)(E[‖f1−fσλ‖2K]+Cση21)=CσT∑t=1η2tT∏j=t+1(1−λ2ηj)+T∏t=1(1−λ2ηt)E[‖fσλ‖2K]≤CσT∑t=1η2tT∏j=t+1(1−λ2ηj)+2MσλT∏t=1(1−λ2ηt). | (3.36) |
For any u≥0, we know that the inequality 1−u≤e−u holds. And (3.36) implies that
Ez1,...,zT[‖fT+1−fσλ‖2K]≤CσT∑t=1η2texp{−λ2T∑j=t+1ηj}+2Mσλexp{−λ2T∑t=1ηt}. | (3.37) |
Denote
I1=2Mσλexp{−λ2T∑t=1ηt}=Dσ(λ)exp{−λ2T∑t=1ηt}=Dσ(λ)exp{−λ2CT∑t=1t−θ} |
and
I2=CσT∑t=1(1Ct−θ)2exp{−λ2CT∑j=t+1j−θ}=CσC2T∑t=1t−2θexp{−λ2CT∑j=t+1j−θ}. |
Then, by (3.37) and the assumptions about the stepsize ηt, we know
Ez1,...,zT[‖fT+1−fσλ‖2K]≤I1+I2. | (3.38) |
Now, we estimate I1 and I2 respectively. By Lemma 4 of [9], we obtain the following estimate of I1
I1≤{Dσ(λ)exp{−λ2C1−2θ−11−θ(T+1)1−θ},if0<θ<1,Dσ(λ)(T+1)−λ2C,ifθ=1. | (3.39) |
On the other hand, by Lemma 5.10 of [23] with ν=λ2C,s=θ, we have
T−1∑t=1t−2θexp{−λ2CT∑j=t+1j−θ}≤{18λ2CTθ+9T1−θ(1−θ)21−θexp{−λ(1−2θ−1)2C(1−θ)(T+1)1−θ},if0<θ<1,81−λ2C(T+1)−λ2C,ifθ=1. |
So,
T∑t=1t−2θexp{−λ2CT∑j=t+1j−θ}≤T−1∑t=1t−2θexp{−λ2CT∑j=t+1j−θ}+T−2θexp{−λ2CT∑j=T+1j−θ}=T−1∑t=1t−2θexp{−λ2CT∑j=t+1j−θ}+T−2θ≤{36CλTθ+9T1−θ(1−θ)21−θexp{−λ(1−2θ−1)2C(1−θ)(T+1)1−θ}+T−2θ,if0<θ<1,8C2C−λ(T+1)−λ2C+T−2,ifθ=1. |
Furthermore, we have the following estimate of I2
I2≤{CσC2(36CλTθ+9T1−θ(1−θ)21−θexp{−λ(1−2θ−1)2C(1−θ)(T+1)1−θ}+T−2θ),if0<θ<1,CσC2(8C2C−λ(T+1)−λ2C+T−2),ifθ=1. | (3.40) |
Since T−2θ≤T−θ and λ≤C, the conclusion can be established by combining (3.38) with (3.39) and (3.40).
Proof. For θ∈(0,1),0<α≤min{1−θ,θ}, by Theorem 2.2 with λ=T−α, we have
E[‖fT+1−fσλ‖2K]≤σ(2MTα+36M(κ2+1)C2(1−θ)21−θT1−θexp{−λ(1−2θ−1)2C(1−θ)T(1−θ)−α}+36M(κ2+1)CTα−θ) | (3.41) |
Since for any ν>0,c>0,η>0, there exists L>0 such that exp{−cTν}≤LT−η, and hence the first term on the right-hand side of (3.41) decays in the form of O(σTη) for any large η>0. However, the second term on the right-hand side of (3.41) is bounded by O(σTθ−α). Consequently, there exists a constant CM,C,κ,θ depending only on θ,κ,C and M such that
E[‖fT+1−fσλ‖2K]≤CM,C,κ,θ×σTθ−α. |
The proof is completed.
Proof. According to the median value theorem, there exists ξ between y−fT+1(x) and y−fλ,σ(x) such that
Vσ(y−fT+1(x))−Vσ(y−fλ,σ(x))=V′σ(ξ)|fT+1(x)−fλ,σ(x)|=|ξ|√1+(ξσ)2|fT+1(x)−fλ,σ(x)|≤σ|fT+1(x)−fλ,σ(x)|. |
Then, we have
Eσ(fT+1)−Eσ(fλ,σ)≤∫Z|Vσ(y−fT+1(x))−Vσ(y−fλ,σ(x))|dρ(x,y)≤σ∫Z|fT+1(x)−fλ,σ(x)|dρ(x,y)≤κσ‖fT+1−fλ,σ‖K. | (3.42) |
And we get
Eσ(fT+1)−Eσ(fσ)≤(Eσ(fT+1)−Eσ(fλ,σ))+Eσ(fλ,σ)−Eσ(fσ)≤κσ‖fT+1−fλ,σ‖K+Eσ(fλ,σ)−Eσ(fσ)≤κσ‖fT+1−fλ,σ‖K+Eσ(fλ,σ)−Eσ(fσ)+λ2‖fλ,σ‖2K=κσ‖fT+1−fλ,σ‖K+inff∈HK{Eσ(f)−Eσ(fσ)+λ2‖fλ,σ‖2K}=κσ‖fT+1−fλ,σ‖K+K(fσ,λ). |
Combined which with Corollary 2.1 and the assumption 2.2, the desired result follows.
∙ Most studies of online learning algorithms focus on the convergence in expectation (for example [8,9,10,25]). However, these results were established based on some fixed loss functions, such as the least-square loss function (see e.g., [9,22]). Our results are established based on a parameterized loss function with a scale parameter σ. The analysis results in Section 2 show that the scale parameter σ can effectively control the convergence rate of the learning algorithm, and a better convergence rate is obtained. On the other hand, the previous researches on online learning algorithms rely on integral operator theory (see [25]), this paper establishes the error bounds for the learning sequence by applying the convex analysis method. Convex analysis method has been widely used in various research fields, for example, in the analysis of machine learning algorithms (see e.g., [21,23,26]) and the studies of discrete fractional operators (see e.g., [27,28]), and it has been proved to be a very effective analysis method.
∙ In [23], the online pairwise regression problem with the quadratic loss is researched. Different from the reference [23], in this paper, we use the parameterized loss function for the pointwise learning model, which has a wider range of applications than the pairwise learning model. It is known that deep convolution networks can increase approximation order (see e.g., [29,30,31,32]), then it is hopeful that the convergence rate provided in this paper can be improved by choosing the deep neural network method.
In the present paper, we analyze the learning performance of the kernel regularized online algorithm with a parameterized loss. The convergence of the learning sequence is proved and the error bound is provided in the expectation sense by using the convex analysis method. There are some questions for further study. In this paper, we focus on the theoretical analysis of the kernel regularized online algorithm with a parameterized loss Vσ. However, there is still a gap between theoretical analysis and the optimization process of empirical risk minimization based on a parameterized loss. In the future study, it would be interesting to apply the online learning algorithm based on Vσ to solve some practical problems and construct an effective solution method. In addition, we mainly analyze the sample error in this paper, and the approximation error is represented by K-functional. How to make a more accurate analysis of the approximation error and further study the influence of the scale parameter σ on the approximation error still need to be further studied.
This work is supported by the Special Project for Scientific and Technological Cooperation (Project No. 20212BDH80021) of Jiangxi Province, the Science and Technology Project in Jiangxi Province Department of Education (Project No. GJJ211334).
No potential conflict of interest was reported by the author.
[1] |
N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Math. Soc., 68 (1950), 337–404. http://dx.doi.org/10.2307/1990404 doi: 10.2307/1990404
![]() |
[2] |
W. Dai, J. Hu, Y. Cheng, X. Wang, T. Chai, RVFLN-based online adaptive semi-supervised learning algorithm with application to product quality estimation of industrial processes, J. Cent. South Univ., 26 (2019), 3338–3350. http://dx.doi.org/10.1007/s11771-019-4257-6 doi: 10.1007/s11771-019-4257-6
![]() |
[3] |
J. Gui, Y. Liu, X. Deng, B. Liu, Network capacity optimization for Cellular-assisted vehicular systems by online learning-based mmWave beam selection, Wirel. Commun. Mob. Com., 2021 (2021), 8876186. http://dx.doi.org/10.1155/2021/8876186 doi: 10.1155/2021/8876186
![]() |
[4] |
M. Li, I. Sethi, A new online learning algorithm with application to image segmentation, Image Processing: Algorithms and Systems IV, 5672 (2005), 277–286. http://dx.doi.org/10.1117/12.586328 doi: 10.1117/12.586328
![]() |
[5] | S. Sai Santosh, S. Darak, Intelligent and reconfigurable architecture for KL divergence based online machine learning algorithm, arXiv: 2002.07713. |
[6] | B. Yang, J. Yao, X. Yang, Y. Shi, Painting image classification using online learning algorithm, In: Distributed, ambient and pervasive interactions, Cham: Springer, 2017,393–403. http://dx.doi.org/10.1007/978-3-319-58697-7_29 |
[7] |
S. Das, Kuhoo, D. Mishra, M. Rout, An optimized feature reduction based currency forecasting model exploring the online sequential extreme learning machine and krill herd strategies, Physica A, 513 (2019), 339–370. http://dx.doi.org/10.1016/j.physa.2018.09.021 doi: 10.1016/j.physa.2018.09.021
![]() |
[8] | S. Smale, Y. Yao, Online learning algorithms, Found. Comput. Math., 6 (2006), 145–170. http://dx.doi.org/10.1007/s10208-004-0160-z |
[9] |
Y. Ying, D. Zhou, Online regularized classification algorithms, IEEE Trans. Inform. Theory, 52 (2006), 4775–4788. http://dx.doi.org/10.1109/TIT.2006.883632 doi: 10.1109/TIT.2006.883632
![]() |
[10] |
Y. Ying, D. Zhou, Unregularized online learning algorithms with general loss functions, Appl. Comput. Harmon. Anal., 42 (2017), 224–244. http://dx.doi.org/10.1016/J.ACHA.2015.08.007 doi: 10.1016/J.ACHA.2015.08.007
![]() |
[11] |
Y. Zeng, D. Klabjian, Online adaptive machine learning based algorithm for implied volatility surface modeling, Knowl.-Based Syst., 163 (2019), 376–391. http://dx.doi.org/10.1016/j.knosys.2018.08.039 doi: 10.1016/j.knosys.2018.08.039
![]() |
[12] |
J. Lin, D. Zhou, Online learning algorithms can converge comparably fast as batch learning, IEEE Trans. Neural Netw. Learn. Syst., 29 (2018), 2367–2378. http://dx.doi.org/10.1109/TNNLS.2017.2677970 doi: 10.1109/TNNLS.2017.2677970
![]() |
[13] | P. Huber, E. Ronchetti, Robust statistics, Hoboken: John Wiley & Sons, 2009. http://dx.doi.org/10.1002/9780470434697 |
[14] |
Y. Wu, Y. Liu, Robust truncated hinge loss support vector machine, J. Am. Stat. Assoc., 102 (2007), 974–983. http://dx.doi.org/10.1198/016214507000000617 doi: 10.1198/016214507000000617
![]() |
[15] | Y. Yu, M. Yang, L. Xu, M. White, D. Schuurmans, Relaxed clipping: a global training method for robust regression and classification, Proceedings of the 23rd International Conference on Neural Information Processing Systems, 2 (2010), 2532–2540. |
[16] |
S. Huang, Y. Feng, Q. Wu, Learning theory of minimum error entropy under weak moment conditions, Anal. Appl., 20 (2022), 121–139. http://dx.doi.org/10.1142/S0219530521500044 doi: 10.1142/S0219530521500044
![]() |
[17] |
F. Lv, J. Fan, Optimal learning with Gaussians and correntropy loss, Anal. Appl., 19 (2021), 107–124. http://dx.doi.org/10.1142/S0219530519410124 doi: 10.1142/S0219530519410124
![]() |
[18] | X. Zhu, Z. Li, J. Sun, Expression recognition method combining convolutional features and Transformer, Math. Found. Compt., in press. http://dx.doi.org/10.3934/mfc.2022018 |
[19] |
S. Suzumura, K. Ogawa, M. Sugiyama, M. Karasuyama, I. Takeuchi, Homotopy continuation approaches for robust SV classification and regression, Mach. Learn., 106 (2017), 1009–1038. http://dx.doi.org/10.1007/s10994-017-5627-7 doi: 10.1007/s10994-017-5627-7
![]() |
[20] |
Z. Guo, T. Hu, L. Shi, Gradient descent for robust kernel-based regression, Inverse Probl., 34 (2018), 065009. http://dx.doi.org/10.1088/1361-6420/aabe55 doi: 10.1088/1361-6420/aabe55
![]() |
[21] |
B. Sheng, H. Zhu, The convergence rate of semi-supervised regression with quadratic loss, Appl. Math. Comput., 321 (2018), 11–24. http://dx.doi.org/10.1016/j.amc.2017.10.033 doi: 10.1016/j.amc.2017.10.033
![]() |
[22] | M. Pontil, Y. Ying, D. Zhou, Error analysis for online gradient descent algorithms in reproducing kernel Hilbert spaces, Proceedings of Technical Report, University College London, 2005, 1–20. |
[23] |
S. Wang, Z. Chen, B. Sheng, Convergence of online pairwise regression learning with quadratic loss, Commun. Pur. Appl. Anal., 19 (2020), 4023–4054. http://dx.doi.org/10.3934/cpaa.2020178 doi: 10.3934/cpaa.2020178
![]() |
[24] | H. Bauschke, P. Combettes, Convex analysis and monotone operator theory in Hilber spaces, Cham: Springer-Verlag, 2010. http://dx.doi.org/10.1007/978-3-319-48311-5 |
[25] |
Z. Guo, L. Shi, Fast and strong convergence of online learning algorithms, Adv. Comput. Math., 45 (2019), 2745–2770. http://dx.doi.org/10.1007/s10444-019-09707-8 doi: 10.1007/s10444-019-09707-8
![]() |
[26] |
Y. Lei, D. Zhou, Convergence of online mirror descent, Appl. Comput. Harmon. Anal., 48 (2020), 343–373. http://dx.doi.org/10.1016/j.acha.2018.05.005 doi: 10.1016/j.acha.2018.05.005
![]() |
[27] |
I. Baloch, T. Abdeljawad, S. Bibi, A. Mukheimer, G. Farid, A. Haq, Some new Caputo fractional derivative inequalities for exponentially (θ,h−m)-convex functions, AIMS Mathematics, 7 (2022), 3006–3026. http://dx.doi.org/10.3934/math.2022166 doi: 10.3934/math.2022166
![]() |
[28] |
P. Mohammed, D. O'Regan, A. Brzo, K. Abualnaja, D. Baleanu, Analysis of positivity results for discrete fractional operators by means of exponential kernels, AIMS Mathematics, 7 (2022), 15812–15823. http://dx.doi.org/10.3934/math.2022865 doi: 10.3934/math.2022865
![]() |
[29] |
Y. Xia, J. Zhou, T. Xu, W. Gao, An improved deep convolutional neural network model with kernel loss function in image classifiaction, Math. Found. Comput., 3 (2020), 51–64. http://dx.doi.org/10.3934/mfc.2020005 doi: 10.3934/mfc.2020005
![]() |
[30] |
D. Zhou, Deep distributed convolutional neural networks: universality, Anal. Appl., 16 (2018), 895–919. http://dx.doi.org/10.1142/S0219530518500124 doi: 10.1142/S0219530518500124
![]() |
[31] |
D. Zhou, Universality of deep convolutional neural networks, Appl. Comput. Harmon. Anal., 48 (2020), 787–794. http://dx.doi.org/10.1016/j.acha.2019.06.004 doi: 10.1016/j.acha.2019.06.004
![]() |
[32] |
D. Zhou, Theory of deep convolutional neural networks: downsampling, Neural Networks, 124 (2020), 319–327. http://dx.doi.org/10.1016/j.neunet.2020.01.018 doi: 10.1016/j.neunet.2020.01.018
![]() |
1. | Lin Liu, Xiaoling Pan, Baohuai Sheng, Jun Fan, Theory Analysis for the Convergence of Kernel-Regularized Online Binary Classification Learning Associated with RKBSs, 2023, 2023, 2314-4785, 1, 10.1155/2023/6566375 |