Convergence of online learning algorithm with a parameterized loss

Shuhua Wang; Shuhua Wang

doi:10.3934/math.20221098

AIMS Mathematics

2022, Volume 7, Issue 11: 20066-20084. doi: 10.3934/math.20221098

Previous Article Next Article

Research article

Convergence of online learning algorithm with a parameterized loss

Shuhua Wang ^,

School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, China

Received: 19 July 2022 Revised: 27 August 2022 Accepted: 06 September 2022 Published: 13 September 2022
MSC : 41A25, 68Q32, 68T40, 90C25

The research on the learning performance of machine learning algorithms is one of the important contents of machine learning theory, and the selection of loss function is one of the important factors affecting the learning performance. In this paper, we introduce a parameterized loss function into the online learning algorithm and investigate the performance. By applying convex analysis techniques, the convergence of the learning sequence is proved and the convergence rate is provided in the expectation sense. The analysis results show that the convergence rate can be greatly improved by adjusting the parameter in the loss function.

Keywords:

Citation: Shuhua Wang. Convergence of online learning algorithm with a parameterized loss[J]. AIMS Mathematics, 2022, 7(11): 20066-20084. doi: 10.3934/math.20221098

Related Papers:

[1]	Wei Xue, Pengcheng Wan, Qiao Li, Ping Zhong, Gaohang Yu, Tao Tao . An online conjugate gradient algorithm for large-scale data analysis in machine learning. AIMS Mathematics, 2021, 6(2): 1515-1537. doi: 10.3934/math.2021092
[2]	Yiyuan Cheng, Yongquan Zhang, Xingxing Zha, Dongyin Wang . On stochastic accelerated gradient with non-strongly convexity. AIMS Mathematics, 2022, 7(1): 1445-1459. doi: 10.3934/math.2022085
[3]	Yassamine Chellouf, Banan Maayah, Shaher Momani, Ahmad Alawneh, Salam Alnabulsi . Numerical solution of fractional differential equations with temporal two-point BVPs using reproducing kernal Hilbert space method. AIMS Mathematics, 2021, 6(4): 3465-3485. doi: 10.3934/math.2021207
[4]	Kasamsuk Ungchittrakool, Natthaphon Artsawang . A generalized viscosity forward-backward splitting scheme with inertial terms for solving monotone inclusion problems and its applications. AIMS Mathematics, 2024, 9(9): 23632-23650. doi: 10.3934/math.20241149
[5]	Suthep Suantai, Pronpat Peeyada, Andreea Fulga, Watcharaporn Cholamjiak . Heart disease detection using inertial Mann relaxed $ CQ $ algorithms for split feasibility problems. AIMS Mathematics, 2023, 8(8): 18898-18918. doi: 10.3934/math.2023962
[6]	Junaid Ahmad, Kifayat Ullah, Reny George . Numerical algorithms for solutions of nonlinear problems in some distance spaces. AIMS Mathematics, 2023, 8(4): 8460-8477. doi: 10.3934/math.2023426
[7]	Chibueze C. Okeke, Abubakar Adamu, Ratthaprom Promkam, Pongsakorn Sunthrayuth . Two-step inertial method for solving split common null point problem with multiple output sets in Hilbert spaces. AIMS Mathematics, 2023, 8(9): 20201-20222. doi: 10.3934/math.20231030
[8]	Raweerote Suparatulatorn, Wongthawat Liawrungrueang, Thanasak Mouktonglang, Watcharaporn Cholamjiak . An algorithm for variational inclusion problems including quasi-nonexpansive mappings with applications in osteoporosis prediction. AIMS Mathematics, 2025, 10(2): 2541-2561. doi: 10.3934/math.2025118
[9]	Yali Zhao, Qixin Dong, Xiaoqing Huang . A self-adaptive viscosity-type inertial algorithm for common solutions of generalized split variational inclusion and paramonotone equilibrium problem. AIMS Mathematics, 2025, 10(2): 4504-4523. doi: 10.3934/math.2025208
[10]	Puntita Sae-jia, Suthep Suantai . A new two-step inertial algorithm for solving convex bilevel optimization problems with application in data classification problems. AIMS Mathematics, 2024, 9(4): 8476-8496. doi: 10.3934/math.2024412

Abstract

1. Introduction

Let $X\subset\mathbf{R}^{d}$ be the input space, $Y = [-M, M]$ be the output space for some $M > 0$ . $z = \{(x_{t}, y_{t})\}_{t = 1}^{T}$ are the random samples i.i.d. (independently and identically drawn) according to a Borel probability measure $\rho(x, y) = \rho(y|x)\rho_X(x)$ on $Z = X\times Y$ . Based on these samples, the goal of regression problems is to look for a predictor $f: X\rightarrow \mathbf{R}$ from some hypothesis space such that $f(x)$ is a "good" approximation of $y$ . The quality of the predictor $f$ is measured by the generalization error

$\begin{equation*} \mathcal{E}(f): = \int_{Z}V(x, y, f) d\rho(x, y), \end{equation*}$

where $V(r):\mathbf{R}\rightarrow \mathbf{R}_{+}$ is a prescribed loss function.

The hypothesis space considered in this paper is the reproducing kernel Hilbert space(RKHS) $\mathcal{H}_{K}$ . This means that there exists a unique symmetric and positive definite continuous function $K: X\times X\rightarrow \mathbf{R}$ , called the reproducing kernel of $\mathcal{H}_{K}$ , or Mercer kernel, and an inner product $\langle\cdot, \cdot\rangle_{K}$ such that $f(x) = \langle K(x, \cdot), f\rangle _{K}$ which is the reproducing property of the kernel, and all $f\in\mathcal{H}_{K}$ are linear combinations of kernel functions. In other words, the RKHS $\mathcal{H}_{K}$ is the closure of the linear span of the set of functions $\{K_{x}(\cdot) = K(x, \cdot): x\in X\}$ with the inner product $\langle\cdot, \cdot\rangle_{K}$ . For each $x\in X$ and $f\in \mathcal{H}_{K}$ the evaluation functional $e_{x}(f): = f(x)$ is continuous (i.e.bounded) in the topology of $\mathcal{H}_{K}$ , and $|f(x)|\leq \kappa\Vert f\Vert _{K}$ with $\kappa: = \sup\limits_{x\in X}\sqrt{K(x, x)}$ (see ^[1]).

Traditional off-line learning is also called batch learning, all sample points need to be tested in each training. When the amount of data is large or new sample points are added, the learning ability of batch learning decreases significantly. Online learning is one effective approach raised for analyzing and processing big data in various applications, such as communication, electronics, medicine, biology, and other fields (see e.g., ^[2,3,4,5,6]). The performance of the kernel-based regularized online learning algorithms have been researched and their effectiveness has been verified (see e.g., ^{[7,8,9,10,11]} and references therein). Unlike the off-line learning algorithms, online learning algorithms process the observations one by one, and the output is adjusted in time according to the results of the last learning.

With the observations $z = \{(x_{t}, y_{t})\}_{t = 1}^{T}$ , the kernel regularized online learning algorithm based on the stochastic descent method is given by (see e.g., ^[8,9,10])

$\begin{equation} \begin{cases} f_{1} = 0, \\ f_{t+1} = f_{t}-\eta_{t}\big(\nabla_{f}V(x_{t}, y_{t}, f_{t})K_{x_{t}}+\lambda f_{t}\big), \end{cases} \end{equation}$

(1.1)

where $\eta_{t}$ is called the stepsize, $\lambda > 0$ is a regularization parameter and the sequence $\{f_t: t = 1, \dots, T+1\}$ is the learning sequence.

When the least-square loss function $V(x, y, f(x)) = (f(x)-y)^2$ is selected, the specific iteration format of the online learning algorithm is given by

$\begin{equation} \begin{cases} f_{1} = 0, \\ f_{t+1} = f_{t}-\eta_{t}((f_t(x_t)-y_t)K_{x_{t}}+\lambda f_{t}). \end{cases} \end{equation}$

(1.2)

To study the learning performance of online learning algorithms we need to bound the convergence rate of the iterative sequence $\{f_{t}: t = 1, \cdots, T+1\}$ . The convergence of the online learning algorithm (1.2) has been extensively studied in the literature (see e.g., ^[8,9,12]). The research results in ^[12] show that under mild conditions the regularized online learning algorithm can converge comparably fast as the off-line learning algorithm.

The least square-loss is the most widely utilized criterion for regression in practice. However, from a robustness point of view, the least square loss is not a good choice. In many practical applications, outliers or heavy-tailed distributions often occur in real data sets. In recent years, how to improve the robustness of algorithms has become one of the hot topics in the field of machine learning. In the literature on learning theory, a lot of efforts have been made and there have been some results on generalization analysis (see e.g., ^{[13,14,15,16,17]}) and empirical experiments (see e.g., ^[18,19]) of learning algorithms when outliers or heavy-tailed noise are allowed.

One of the main strategies to improve robustness is to use some robust loss function with a scale parameter(see e.g., ^[20,21]). Based on the quadratic function $\sqrt{1+t^2}, t\in \mathbf{R}$ which plays an important role in constructing shape preserving quasi-interpolation and solving partial differential equations with mesh-free method since its strong nonlinear property and its convexity, ^[21] constructed a robust loss function $V_{\sigma}(r)$ with a scale parameter $\sigma\in (0, 1]$ . For $\sigma\in (0, 1]$ , the parameterized loss function $V_{\sigma}(r): \mathbf{R}\rightarrow [0, \infty)$ is defined as

$\begin{eqnarray*} \label{loss} V_{\sigma}(r): = \sigma^{2}\big(\frac{\sqrt{\sigma^2+r^2}}{\sigma}-1\big), \;\;\;r\in\mathbf{R}. \end{eqnarray*}$

The analysis results in ^[21] show that the learning algorithm based on the parameterized loss function $V_{\sigma}(r)$ has a good generalization ability.

Encouraged by these researches, we want to further improve the performance and applicability of the online algorithm. In this paper, we introduce the parameterized loss function $V_{\sigma}(r)$ into the online learning algorithm, and analyze the influence of the scale parameter on the convergence rate of the algorithm.

To give the specific format of the learning algorithm with the parameterized loss function, we give the following notations correspondingly. We denote

$\begin{equation} \mathcal{E}_{\sigma}(f) = \int_{Z}V_{\sigma}(y-f(x))d\rho(x, y), \end{equation}$

(1.3)

and

$\begin{equation} f_{\sigma} = \arg \min\limits_{f\in L^2(\rho_X)}\mathcal{E_{\sigma}}(f). \end{equation}$

(1.4)

The kernel-based regularized off-line algorithm is the following optimization problem:

$\begin{equation} f_{\lambda}^{\sigma} = \arg \min\limits_{f\in \mathcal{H}_{K}}\mathcal{E}_{\sigma}(f)+\frac{\lambda}{2}\|f\|^{2}_{K}. \end{equation}$

(1.5)

Based on the random observations $z = \{(x_{t}, y_{t})\}_{t = 1}^{T}$ , the approximate solution of the problem (1.5) is obtained through the following learning process

$\begin{equation*} f_{z, \lambda}^{\sigma} = \arg \min\limits_{f\in \mathcal{H}_{K}}\frac{1}{|{T}|}\sum\limits^T_{t = 1}V_{\sigma}(y_t-f(x_t))+\frac{\lambda}{2}\|f\|^{2}_{K}. \end{equation*}$

It is easy to see that the gradient of the loss function $V_{\sigma}$ is given by

$\begin{eqnarray*} \nabla_{f}V_{\sigma}(y_{t}-f_{t}(x_{t})) = -\frac{y_{t}-f_{t}(x_{t})}{\sqrt{1+(\frac{y_{t}-f_{t}(x_{t})}{\sigma})^2}}. \end{eqnarray*}$

Along with the online algorithm (1.1), the kernel regularized online learning algorithm with the parameterized loss function $V_{\sigma}(r)$ is defined by

$\begin{equation} \begin{cases} f_{1} = 0, \\ f_{t+1} = f_{t}-\eta_{t}(-\frac{y_{t}-f_{t}(x_{t})}{\sqrt{1+(\frac{y_{ \ t}-f_{t}(x_{t})}{\sigma})^2}}K_{x_{t}}+\lambda f_{t}). \end{cases} \end{equation}$

(1.6)

In this paper, we focus on the performance of the sequence $\{f_t: t = 1, ..., T+1\}$ produced by the algorithm (1.6).

The remaining parts of this paper are organized as follows: we present the main results of this paper in Section 2. The proofs of the main results are given in Section 3. Discussions and comparisons with related works are given in Section 4. Conclusions and some questions for further study are mentioned in Section 5.

In the present paper, we write $A = O(B)$ if there is a constant $C\geq0$ such that $A\leq CB$ . We use $\mathbb{E}_{z}[\cdot]$ to denote the expectation with respect to $z$ . When its meaning is clear from the context, we use the shorthand notation $\mathbb{E}[\cdot]$ .

2. The main results

In this section, we present our main results about the performance of the algorithm (1.6), proofs are given in Section 3.

2.1. The convergence of the learning sequence

Our first main result establishes the convergence of the sequence $\{f_t: t = 1, ..., T+1\}$ in expectation, under mild conditions on the stepsize.

Theorem 2.1. Let $f_{\lambda}^{\sigma}$ be defined as in (1.5), and let $\{f_t: t = 1, ..., T+1\}$ be the sequence produced by the algorithm (1.6), $\lambda > 0$ . If $\{\eta_{t}\}$ satisfies $\sum\limits^\infty_{t = 1}\eta_{t} = +\infty$ , $\lim\limits_{t\rightarrow +\infty}\eta_{t} = 0$ and $\eta_{t}\leq\frac{1}{\lambda}$ , then it holds that

$\begin{eqnarray*} \mathbb{E}_{z_{1}, ..., z_{T}}\left[\|f_{T+1}-f_{\lambda}^{\sigma}\|_{K}\right]\rightarrow0\quad (T\rightarrow +\infty).\nonumber\\ \end{eqnarray*}$

2.2. Error analysis

Our second main result gives the explicit convergence rate of the last iterate by specifying the step sizes in the algorithm (1.6).

Theorem 2.2. Let $f_{\lambda}^{\sigma}$ be defined as in (1.5), and let $\{f_t: t = 1, ..., T+1\}$ be the sequence produced by the algorithm (1.6). For any $0 < \lambda\leq1$ , $0 < \theta\leq1$ , take $\eta_{t} = \frac{1}{C}t^{-\theta}$ with $C\geq\lambda+4(\kappa^2+1)$ . Then, it holds that

$\mathbb{E}\left[\|f_{T+1}-f_{\lambda}^{\sigma}\|_{K}^2\right] \leq\\ \begin{cases} \left(D_{\sigma}(\lambda)+\frac{9C_{\sigma}T^{1-\theta}}{C^2(1-\theta)2^{1-\theta}}\right)\exp\left\{-\frac{\lambda(1-2^{\theta-1})}{2C(1-\theta)}(T+1)^{1-\theta}\right\} +\frac{36C_{\sigma}}{\lambda C}T^{-\theta}, \;\; if\;0 < \theta < 1, \\ \left(D_{\sigma}(\lambda)+\frac{5C_{\sigma}T^{1-\theta}}{C^2(C-\lambda)}\right)T^{-\frac{\lambda}{2C}}, \;\; \;\;if \;\theta = 1, \end{cases}$

where $D_{\sigma}(\lambda) = \frac{2M\sigma}{\lambda}, C_{\sigma} = 4M\sigma(\kappa^2+1)$ .

Corollary 2.1. Let $f_{\lambda}^{\sigma}$ be defined as in (1.5), and let $\{f_t: t = 1, ..., T+1\}$ be the sequence produced by the algorithm (1.6). For any $\theta\in(0, 1), 0 < \alpha\leq\min\{1-\theta, \theta\}$ , take $\lambda = T^{-\alpha}$ . If the stepsize is chosen as $\eta_{t} = \frac{1}{C}t^{-\theta}$ , then it holds that

$\begin{align} \mathbb{E}\left[\|f_{T+1}-f_{\lambda}^{\sigma}\|_{K}^2\right] \leq C_{M, C, \kappa, \theta}\times\frac{\sigma}{T^{\theta-\alpha}}, \end{align}$

(2.1)

where $C_{M, C, \kappa, \theta}$ is a constant depending only on $\theta, \kappa, C$ and $M$ .

Remark 1. The results given in Theorem 2.2 and Corollary 2.1 show that the scale parameter $\sigma$ can effectively control the convergence rate of $\|f_{T+1}-f^{\sigma}_{\lambda}\|_{K}$ , which is usually referred to as the sample error. Depending on the circumstances, the sample error bound can be greatly improved by choosing the parameter $\sigma$ properly. In fact, take $\sigma = \lambda = T^{-\alpha}$ . Then by (2.1), we have $\mathbb{E}\left[\|f_{T+1}-f_{\lambda}^{\sigma}\|_{K}^2\right] = O\big(T^{-\theta}\big)$ which is better than the sample error bound $O\big(T^{-(\theta-\alpha)}\big)$ given in ^[9].

The results provided above mainly describe the convergence rate of the sample error. However, in the studying the learning performance of learning algorithms, we are often interested in the excess generalization error $\mathcal{E}_{\sigma}(f_{T+1})-\mathcal{E}_{\sigma}(f_{\sigma})$ . Define

$\mathcal{K}(f, \lambda) = \inf\limits_{g\in\mathcal{H_{K}}}\left\{\mathcal{E}_{\sigma}(g)-\mathcal{E}_{\sigma}(f)+\frac{\lambda}{2}\|g\|^2_{K}\right\},$

which is often used to denote the approximation error, whose convergence is determined by the capacity of $\mathcal{H}_K$ . We assume the $\mathcal{K}$ -functional satisfies the following decay

$\begin{equation} \mathcal{K}(f_{\sigma}, \lambda) = O(\lambda^{\beta}), \;\;\lambda\rightarrow0^+, \end{equation}$

(2.2)

with $0 < \beta\leq1$ .

By combining the sample error with the approximation error, we obtain the overall learning rate stated as follows.

Corollary 2.2. Let $\mathcal{E}_{\sigma}(f)$ be the generalization error defined as in (1.3), $\{f_t: t = 1, ..., T+1\}$ be the sequence produced by the algorithm (1.6). For $\theta\in(0, 1), 0 < \alpha\leq\min\{1-\theta, \theta\}$ , take $\lambda = T^{-\alpha}$ . If the stepsize is chosen as $\eta_{t} = \frac{1}{C}t^{-\theta}$ , then it holds that

$\begin{equation} \mathbb{E}\left[\mathcal{E}_{\sigma}(f_{T+1})-\mathcal{E}_{\sigma}(f_{\sigma})\right] = O \big(\frac{ \sigma^{\frac{3}{2}}}{T^{\frac{\theta-\alpha}{2}}} +\frac{1}{T^{\beta\alpha}}\big). \end{equation}$

(2.3)

Now, we compare the performance of the algorithm (1.6) with that of the online kernel regularized learning algorithm based on the least square loss.

In ^[22], the online kernel regularized learning algorithm with the least-square loss is researched, and the learning rates are established under some assumptions. Namely, for $0 < \beta\leq1$ , $0 < \delta < \frac{\beta}{\beta+1}$ , there holds

$\begin{align} \mathbb{E}\left[\mathcal{E}(f_{T+1})-\mathcal{E}(f_{\rho})\right] = O\big(T^{\delta-\frac{\beta}{\beta+1}}\big) \end{align}$

(2.4)

and for $\frac{1}{2} < \beta\leq1$ , $0 < \delta < \frac{2\beta-1}{2\beta+1}$ , there holds

$\begin{align} \mathbb{E}\left[\mathcal{E}(f_{T+1})-\mathcal{E}(f_{\rho})\right] = O\big(T^{\delta-\frac{2\beta-1}{2\beta+1}}\big). \end{align}$

(2.5)

For $0 < \beta\leq1$ , we choose $\sigma = T^{-\frac{\mu}{3}}$ with $0 < \mu\leq\frac{2\beta-1}{2\beta+1}$ . Take $\lambda$ = $T^{\frac{\delta}{2\beta}-\frac{1}{2(\beta+1)}-\frac{\mu}{2\beta+1}}$ and $\eta_t$ = $\frac{1}{4\kappa^2+5}t^{\frac{2\beta+1}{2\beta}\delta-\frac{2\beta+1}{2(\beta+1)}}$ with $\frac{\beta}{\beta+1}-\frac{2\beta}{2\beta+1}\mu < \delta < \frac{\beta}{\beta+1}+\frac{2\beta}{2\beta+1}$ . By Corollary 2.2, we have

$\begin{align} \mathbb{E}\left[\mathcal{E}_{\sigma}(f_{T+1})-\mathcal{E}_{\sigma}(f_{\sigma})\right] = O\big(T^{\frac{1}{2}(\delta-\frac{\beta}{\beta+1})-\frac{\beta}{2\beta+1}\mu}\big). \end{align}$

(2.6)

And for $\frac{1}{2} < \beta\leq1$ , we choose $\sigma^3 = \lambda = T^{\frac{2\beta}{2\beta-1}\delta-\frac{2\beta}{2\beta+1}}$ with $0 < \delta < \frac{2\beta-1}{2\beta+1}$ . Take $\eta_t = \frac{1}{4\kappa^2+5}t^{\frac{2\beta}{2\beta-1}\delta-\frac{2\beta}{2\beta+1}}$ . By Corollary 2.2, we have that

$\begin{align} \mathbb{E}\left[\mathcal{E}_{\sigma}(f_{T+1})-\mathcal{E}_{\sigma}(f_{\sigma})\right] = O\big(T^{\frac{\beta}{2\beta-1}\delta-\frac{1}{2\beta+1}}\big). \end{align}$

(2.7)

The rate (2.6) is better than (2.4), and (2.7) is better than (2.5). The analysis results illustrate that the convergence rate of the algorithm (1.6) can be improved by choosing the parameter $\sigma$ appropriately.

3. Proofs

Lemma 3.1. Let $\{f_t: t = 1, ..., T+1\}$ be the sequence produced by the algorithm (1.6). If $\eta_{t}\leq\frac{1}{\lambda}$ , then, for any $t = 1, ..., T+1$ , it holds that

$\begin{eqnarray} \|f_{t}\|_{K}\leq\frac{\kappa \sigma}{\lambda}. \end{eqnarray}$

(3.1)

Proof. We prove (3.1) by induction on $t$ . The inequality (3.1) is true for $t = 1$ because of the initialization condition $f_{1} = 0$ . Suppose the bound (3.1) holds for any $t$ , and consider $f_{t+1}$ . The iteration (1.6) can be rewritten as

$\begin{eqnarray} f_{t+1}& = &f_t+\frac{y_t-f_t(x_t)}{\sqrt{1+(\frac{y_{t}-f_t(x_t)}{\sigma})^2}}\eta_tK_{x_t}-\lambda\eta_{t}f_{t} \\ & = &(1-\lambda\eta_{t})f_{t}+\frac{y_t-f_t(x_t)}{\sqrt{1+(\frac{y_{t}-f_t(x_t)}{\sigma})^2}}\eta_tK_{x_t}. \end{eqnarray}$

(3.2)

This implies that

$\begin{eqnarray} \|f_{t+1}\|_K\leq (1-\lambda\eta_{t})\|f_{t}\|_K+\kappa\eta_t\big|\frac{y_t-f_t(x_t)}{\sqrt{1+(\frac{y_{t}-f_t(x_t)}{\sigma})^2}}\big|. \end{eqnarray}$

(3.3)

Since

$\begin{eqnarray} \big|\frac{y_t-f_t(x_t)}{\sqrt{1+(\frac{y_{t}-f_t(x_t)}{\sigma})^2}}\big|\leq\sigma. \end{eqnarray}$

(3.4)

Combined with the assumption $\|f_{t}\|_K\leq\frac{\kappa \sigma}{\lambda}$ , we have

$\begin{eqnarray} \|f_{t+1}\|_K&\leq& (1-\lambda\eta_{t})\|f_{t}\|_K+\kappa\sigma\eta_t\\ &\leq&(1-\lambda\eta_{t})\frac{\kappa \sigma}{\lambda}+\kappa\sigma\eta_t\\ & = &\frac{\kappa \sigma}{\lambda}. \end{eqnarray}$

(3.5)

This completes the proof.

Lemma 3.2. Assume $f_{\lambda}^{\sigma}$ is defined as in (1.5). Then, for any $f\in\mathcal{H}_{K}$ , it holds that

$\begin{align*} \label{eq} \int_{Z}\frac{y-f_{\lambda}^{\sigma}(x)}{\sqrt{1+(\frac{y-f_{\lambda}^{\sigma}(x)}{\sigma})^2}}(f_{\lambda}^{\sigma}(x)-f(x))d\rho = \lambda\langle f_{\lambda}^{\sigma}, f_{\lambda}^{\sigma}-f\rangle_{K}. \end{align*}$

Proof. By Taylor formula, for any $u, v\in \mathbf{R}$ , we have

$\begin{eqnarray} V_{\sigma}(u)-V_{\sigma}(v) = V_{\sigma}^{'}(v)(u-v)+\frac{1}{2}V_{\sigma}^{''}(\xi)(u-v)^2, \end{eqnarray}$

(3.6)

where $\xi\in \mathbf{R}$ is between $u$ and $v$ .

Note that, $V_{\sigma}^{''}(\xi) = \frac{1}{\sqrt{\left(1+(\frac{\xi}{\sigma})^2\right)^3}} > 0$ . Then,

$\begin{eqnarray} V_{\sigma}(u)-V_{\sigma}(v)\geq V_{\sigma}^{'}(v)(u-v) = \frac{v(u-v)}{\sqrt{1+(\frac{v}{\sigma})^2}}. \end{eqnarray}$

(3.7)

Therefore, for any $f, g\in\mathcal{H}_{K}$ , we have

$\begin{eqnarray} \mathcal{E}_{\sigma}(f)-\mathcal{E}_{\sigma}(g)& = &\int_{Z}V_{\sigma}(y-f(x))d\rho-\int_{Z}V_{\sigma}(y-g(x))d\rho\\ &\geq&-\int_{Z}\frac{y-g(x)}{\sqrt{1+(\frac{y-g(x)}{\sigma})^2}}(f(x)-g(x))d\rho\\ & = &\big\langle f-g, -\int_{Z}\frac{y-g(x)}{\sqrt{1+(\frac{y-g(x)}{\sigma})^2}}K_{x}d\rho \big\rangle_{K}\\ & = &\langle f-g, \nabla_{g}\mathcal{E}_{\sigma}(g)\rangle_{K}. \end{eqnarray}$

(3.8)

By ( $i$ ) of Lemma 5.1 in ^[23], we know $\mathcal{E}_{\sigma}(f)$ is a convex function on $\mathcal{H}_{K}$ . And $\|f\|^2_{K}$ is a strictly convex function on $\mathcal{H}_{K}$ . Then, $\Omega_{\sigma}(f) = \mathcal{E}_{\sigma}(f)+\frac{\lambda}{2}\|f\|^2_{K}$ is a convex function on $\mathcal{H}_{K}$ . Based on ( $ii$ ) of Lemma 5.1 in ^[23], we have

$\begin{eqnarray} 0& = &\nabla_{f}\Omega_{\sigma}(f)\mid_{f = f_\lambda^{\sigma}}\\ & = &\nabla_{f}\mathcal{E}_{\sigma}(f)\mid_{f = f_\lambda^{\sigma}}+\lambda f_\lambda^{\sigma}\\ & = &-\int_{Z}\frac{y-f_{\lambda}^{\sigma}(x)}{\sqrt{1+(\frac{y-f_{\lambda}^{\sigma}(x)}{\sigma})^2}}K_{x}d\rho+\lambda f_\lambda^{\sigma}. \end{eqnarray}$

(3.9)

Taking inner product with $f-f_{\lambda}^{\sigma}$ on both sides of the above formula, we get

$\begin{eqnarray*} 0& = &\big\langle-\int_{Z}\frac{y-f_{\lambda}^{\sigma}(x)}{\sqrt{1+(\frac{y-f_{\lambda}^{\sigma}(x)}{\sigma})^2}}K_{x}d\rho+\lambda f_\lambda^{\sigma}, f-f_{\lambda}^{\sigma} \big\rangle_{K}\nonumber\\ & = &\int_{Z}-\frac{y-f_{\lambda}^{\sigma}(x)}{\sqrt{1+(\frac{y-f_{\lambda}^{\sigma}(x)}{\sigma})^2}}\big\langle K_{x}, f-f_{\lambda}^{\sigma}\big\rangle_{K}d\rho+\lambda\big\langle f_{\lambda}^{\sigma}, f-f_{\lambda}^{\sigma} \big\rangle_{K}\nonumber\\ & = &\int_{Z}\frac{y-f_{\lambda}^{\sigma}(x)}{\sqrt{1+(\frac{y-f_{\lambda}^{\sigma}(x)}{\sigma})^2}}(f_{\lambda}^{\sigma}(x)-f(x))d\rho+\lambda\big\langle f_{\lambda}^{\sigma}, f-f_{\lambda}^{\sigma} \big\rangle_{K}. \end{eqnarray*}$

This proves our conclusion.

Lemma 3.3. Let $f_{\lambda}^{\sigma}$ be defined as in (1.5). For any $f\in\mathcal{H}_{K}$ , we denote $\Omega_{\sigma}(f) = \mathcal{E}_{\sigma}(f)+\frac{\lambda}{2}\|f\|^2_{K}$ . Then, it holds that

$\begin{eqnarray*} \label{ineq} \frac{\lambda}{2}\|f-f_{\lambda}^{\sigma}\|_{K}^2\leq\Omega_{\sigma}(f)-\Omega_{\sigma}(f_{\lambda}^{\sigma}). \end{eqnarray*}$

Proof. For any $f\in\mathcal{H}_{K}$ , we define a function $f_{(\theta)} = f_{\lambda}^{\sigma}+\theta(f-f_{\lambda}^{\sigma}), \theta\in[0, 1]$ . Then, $f_{(0)} = f_{\lambda}^{\sigma}$ and $f_{(1)} = f$ . Denote $F(\theta) = \Omega_{\sigma}(f_{(\theta)}) = \int_{Z}V_{\sigma}(y-f_{(\theta)}(x))d\rho+\frac{\lambda}{2}\|f_{(\theta)}\|^{2}_{K}$ , then $F(1) = \Omega_{\sigma}(f), F(0) = \Omega_{\sigma}(f_{\lambda}^{\sigma})$ . Since $V_{\sigma}$ is differentiable, as a function of $\theta$ , $F(\theta)$ is differentiable. And for any $\theta\in[0, 1]$ , we have

$\begin{eqnarray} F^{'}(\theta)& = &\lim\limits_{\triangle\theta\rightarrow0}\frac{F(\theta+\triangle\theta)-F(0)}{\triangle\theta}\\ & = &\lim\limits_{\triangle\theta\rightarrow0}\frac{\Omega_{\sigma}(f_{\theta+\triangle\theta})-\Omega_{\sigma}(f_{\theta})}{\triangle\theta}\\ & = &\lim\limits_{\triangle\theta\rightarrow0}\frac{1}{\triangle\theta}\left(\int_{Z}\left(V_{\sigma}(y-f_{(\theta+\triangle\theta)}(x))- V_{\sigma}(y-f_{(\theta)}(x))\right)d\rho+\frac{\lambda}{2}\|f_{(\theta+\triangle\theta)}\|^{2}_{K}-\frac{\lambda}{2}\|f_{(\theta)}\|^{2}_{K} \right)\\ & = &\lim\limits_{\triangle\theta\rightarrow0}\frac{1}{\triangle\theta}\Big(\int_{Z}\left(V_{\sigma}(y-f_{(\theta)}(x))-\triangle\theta(f(x)-f_{\lambda}^{\sigma}(x))- V_{\sigma}(y-f_{(\theta)}(x))\right)d\rho\\ &&\quad\quad\quad\quad\quad\quad+\frac{\lambda}{2}\big(\|f_{(\theta)}+\triangle\theta(f-f_{\lambda}^{\sigma})\|^{2}_{K}-\|f_{(\theta)}\|^{2}_{K}\big) \Big). \end{eqnarray}$

(3.10)

By the median value theorem, there holds

$\begin{eqnarray} V_{\sigma}\left(y-f_{(\theta)}(x))-\triangle\theta(f(x)-f_{\lambda}^{\sigma}(x))\right)- V_{\sigma}(y-f_{(\theta)}(x)) = \triangle\theta V_{\sigma}^{'}(\xi)(f_{\lambda}^{\sigma}(x)-f(x)), \;\;\quad\quad \end{eqnarray}$

(3.11)

where $\xi\in(y-f_{(\theta)}(x))-\triangle\theta(f(x)-f_{\lambda}^{\sigma}(x), y-f_{(\theta)}(x))$ .

This in connection with $\|f_{(\theta)}+\triangle\theta(f-f_{\lambda}^{\sigma})\|^{2}_{K}-\|f_{(\theta)}\|^{2}_{K} = 2\triangle\theta\langle f-f_{\lambda}^{\sigma}, f_{(\theta)}\rangle_{K}+(\triangle\theta)^2\|f-f_{\lambda}^{\sigma})\|^{2}_{K}$ , according to (3.10), tells us that

$\begin{eqnarray} F^{'}(\theta)& = &\lim\limits_{\triangle\theta\rightarrow0}\Big(\int_{Z}V_{\sigma}^{'}(\xi)(f_{\lambda}^{\sigma}(x)-f(x))d\rho+\lambda \langle f-f_{\lambda}^{\sigma}, f_{(\theta)}\rangle_{K}+\frac{\lambda}{2}\triangle\theta\|f-f_{\lambda}^{\sigma}\|^{2}_{K}\Big)\\ & = & \int_{Z}V_{\sigma}^{'}(y-f_{(\theta)}(x))(f_{\lambda}^{\sigma}(x)-f(x))d\rho+\lambda \langle f-f_{\lambda}^{\sigma}, f_{(\theta)}\rangle_{K}\\ & = &\int_{Z}V_{\sigma}^{'}\left((y-f_{\lambda}^{\sigma}(x))+\theta(f_{\lambda}^{\sigma}(x)-f(x))\right)(f_{\lambda}^{\sigma}(x)-f(x))d\rho+\lambda \langle f-f_{\lambda}^{\sigma}, f+\theta (f-f_{(\theta)})\rangle_{K}\\ & = &\int_{Z}V_{\sigma}^{'}\left((y-f_{\lambda}^{\sigma}(x))+\theta(f_{\lambda}^{\sigma}(x)-f(x))\right)(f_{\lambda}^{\sigma}(x)-f(x))d\rho\\ &&\quad+\lambda\langle f-f_{\lambda}^{\sigma}, f\rangle_{K}+\theta\lambda\|f-f_{\lambda}^{\sigma})\|^{2}_{K}. \end{eqnarray}$

(3.12)

By Lemma 3.2, we see that

$\begin{eqnarray} \lambda\langle f-f_{\lambda}^{\sigma}, f\rangle_{K}& = &-\int_{Z}\frac{y-f_{\lambda}^{\sigma}(x)}{\sqrt{1+(\frac{y-f_{\lambda}^{\sigma}(x)}{\sigma})^2}}(f_{\lambda}^{\sigma}(x)-f(x))d\rho\\ & = &-\int_{Z}V_{\sigma}^{'}(y-f_{\lambda}^{\sigma}(x))(f_{\lambda}^{\sigma}(x)-f(x))d\rho. \end{eqnarray}$

(3.13)

On the other hand, since $V_{\sigma}(u)$ is a convex function in $\mathbf{R}$ , by discussions in ^[24], we know that

$\begin{eqnarray} V_{\sigma}^{'}(y-f_{\lambda}^{\sigma}(x))+\theta(f_{\lambda}^{\sigma}(x)-f(x))-V_{\sigma}^{'}(y-f_{\lambda}^{\sigma}(x))(f_{\lambda}^{\sigma}(x)-f(x))\geq0. \end{eqnarray}$

(3.14)

Therefore, for $\theta\in(0, 1)$ , we have

$F^{'}(\theta) = \int_{Z}\left(V_{\sigma}^{'}(y-f_{\lambda}^{\sigma}(x))+\theta(f_{\lambda}^{\sigma}(x)-f(x))-V_{\sigma}^{'}\\ (y-f_{\lambda}^{\sigma}(x)) \right)(f_{\lambda}^{\sigma}(x)-f(x))d\rho+ \lambda\theta\|f-f_{\lambda}^{\sigma}\|^{2}_{K}\\ \geq \lambda\theta\|f-f_{\lambda}^{\sigma}\|^{2}_{K}.$

(3.15)

By the definition of $f_{\lambda}^{\sigma}$ , we know that $F(\theta)\geq F(0) = \Omega_{\sigma}(f_{\lambda}^{\sigma}), \theta\in[0, 1]$ . Therefore, (3.14) implies that

$\begin{eqnarray*} \Omega_{\sigma}(f)-\Omega_{\sigma}(f_{\lambda}^{\sigma})& = &F(1)-F(0) = \int_{0}^{1}F^{'}(\theta)d\theta\nonumber\\ &\geq&\int_{0}^{1}\lambda\theta\|f-f_{\lambda}^{\sigma}\|^{2}_{K}d\theta\nonumber\\ & = &\lambda\|f-f_{\lambda}^{\sigma}\|^{2}_{K}\int_{0}^{1}\theta d\theta\nonumber\\ & = &\frac{\lambda}{2}\|f-f_{\lambda}^{\sigma}\|_{K}^2. \end{eqnarray*}$

The proof is completed.

Lemma 3.4. Let $f_{\lambda}^{\sigma}$ be defined as in (1.5), $\{f_t: t = 1, ..., T+1\}$ be the sequence produced by the algorithm (1.6), if $\lambda > 0$ , $\eta_{t}\leq\frac{1}{\lambda}$ , then

$\begin{eqnarray} &&\mathbb{E}_{z_{1}, ..., z_{T}}\left[\|f_{T+1}-f_{\lambda}^{\sigma}\|_{K}^2\right]\\ &&\leq 4\kappa^2\sigma^2\sum\limits^T_{t = 1}\eta_{t}^2\prod\limits_{j = t+1}^T(1-\lambda\eta_{j})+\frac{2M\sigma}{\lambda}\prod\limits_{t = 1}^T(1-\lambda\eta_{t}). \end{eqnarray}$

(3.16)

Furthermore, there holds

$\begin{eqnarray} &&\mathbb{E}_{z_{1}, ..., z_{T}}\left[ \|f_{T+1}-f_{\lambda}^{\sigma}\|_{K}^2\right]\\ &&\leq 4\kappa^2\sigma^2\sum\limits^T_{t = 1}\eta_{t}^2\exp\left\{-\lambda\sum\limits^T_{j = t+1}\eta_{j}\right\}+\frac{2M\sigma}{\lambda}\exp\left\{-\lambda\sum\limits^T_{t = 1}\eta_{t}\right\}. \end{eqnarray}$

(3.17)

Proof. According to the algorithm (1.6), we know

$\begin{eqnarray} \|f_{t+1}-f_{\lambda}^{\sigma}\|_{K}^2& = &\|f_{t}-f_{\lambda}^{\sigma}\|_{K}^2+\eta_{t}^2\|\lambda f_{t}-\frac{y_{t}-f_{t}(x_{t})}{\sqrt{1+(\frac{y_{t}-f_{t}(x_{t})}{\sigma})^2}}K_{x_{t}}\|_{K}^2\\ &&\quad +2\eta_{t}\langle \lambda f_{t}-\frac{y_{t}-f_{t}(x_{t})}{\sqrt{1+(\frac{y_{t}-f_{t}(x_{t})}{\sigma})^2}}K_{x_{t}}, f_{\lambda}^{\sigma}-f_{t}\rangle_{K}\\ & = &\|f_{t}-f_{\lambda}^{\sigma}\|_{K}^2+\eta_{t}^2\|\lambda f_{t}-\frac{y_{t}-f_{t}(x_{t})}{\sqrt{1+(\frac{y_{t}-f_{t}(x_{t})}{\sigma})^2}}K_{x_{t}}\|_{K}^2\\ &&\quad +2\eta_{t}A, \end{eqnarray}$

(3.18)

where $A = \langle \lambda f_{t}-\frac{y_{t}-f_{t}(x_{t})}{\sqrt{1+(\frac{y_{t}-f_{t}(x_{t})}{\sigma})^2}}K_{x_{t}}, f_{\lambda}^{\sigma}-f_{t}\rangle_{K}$ .

By using the inequality $\langle a, b-a \rangle_{K}\leq\frac{1}{2}(\|b\|_{K}^2-\|a\|_{K}^2), a, b\in \mathcal{H}_{K}$ , with $a = f_{t}, b = f_{\lambda}^{\sigma}$ , we have

$\begin{eqnarray} A& = &\lambda\langle f_{t}, f_{\lambda}^{\sigma}-f_{t}\rangle_{K}-\langle \frac{y_{t}-f_{t}(x_{t})}{\sqrt{1+(\frac{y_{t}-f_{t}(x_{t})}{\sigma})^2}}K_{x_{t}}, f_{\lambda}^{\sigma}-f_{t}\rangle_{K}\\ &\leq&\frac{\lambda}{2}(\|f_{\lambda}^{\sigma}\|_{K}^2-\|f_{t}\|_{K}^2)-\langle \frac{y_{t}-f_{t}(x_{t})}{\sqrt{1+(\frac{y_{t}-f_{t}(x_{t})}{\sigma})^2}}K_{x_{t}}, f_{\lambda}^{\sigma}-f_{t}\rangle_{K}\\ & = &\frac{\lambda}{2}(\|f_{\lambda}^{\sigma}\|_{K}^2-\|f_{t}\|_{K}^2)-\frac{y_{t}-f_{t}(x_{t})}{\sqrt{1+(\frac{y_{t}-f_{t}(x_{t})}{\sigma})^2}}(f_{\lambda}^{\sigma}(x_{t})-f_{t}(x_{t}))\\ &\leq&\frac{\lambda}{2}(\|f_{\lambda}^{\sigma}\|_{K}^2-\|f_{t}\|_{K}^2)+V_{\sigma}(y_{t}-f_{\lambda}^{\sigma}(x_{t}))-V_{\sigma}(y_{t}-f_{t}(x_{t}))\\ & = &\left( V_{\sigma}(y_{t}-f_{\lambda}^{\sigma}(x_{t}))+\frac{\lambda}{2}\|f_{\lambda}^{\sigma}\|_{K}^2\right)-\left( V_{\sigma}(y_{t}-f_{t}(x_{t}))+\frac{\lambda}{2}\|f_{t}\|_{K}^2\right). \end{eqnarray}$

(3.19)

Since $f_{t}$ depends on $\{z_{1}, ..., z_{t-1}\}$ but not on $z_{t}$ , it follows that

$\mathbb{E}_{z_{1}, ..., z_{t}}(A) \leq \mathbb{E}_{z_{1}, ..., z_{t-1}}\\ \left[\mathbb{E}_{z_{t}}\left[ \left(V_{\sigma}(y_{t}-f_{\lambda}^{\sigma}(x_{t}))+\frac{\lambda}{2}\|f_{\lambda}^{\sigma}\|_{K}^2\right)-\left( V_{\sigma}(y_{t}-f_{t}(x_{t}))+\frac{\lambda}{2}\|f_{t}\|_{K}^2\right)\right]\right]\\ = \mathcal{E}_{\sigma}(f_{\lambda}^{\sigma})+\frac{\lambda}{2}\|f_{\lambda}^{\sigma}\|_{K}^2-\mathbb{E}_{z_{1}, ..., z_{t-1}}\left[ \mathcal{E}_{\sigma}(f_{t})+\frac{\lambda}{2}\|f_{t}\|_{K}^2\right].$

(3.20)

Combining (3.18) with (3.20), we have

$\begin{eqnarray} \mathbb{E}_{z_{1}, ..., z_{t}}[\|f_{t+1}-f_{\lambda}^{\sigma}\|_{K}^2]&\leq&\mathbb{E}_{z_{1}, ..., z_{t-1}}\big[\|f_{t}-f_{\lambda}^{\sigma}\|_{K}^2\big]+\eta_{t}^2\mathbb{E}_{z_{1}, ..., z_{t}}\big[\|\lambda f_{t}-\frac{y_{t}-f_{t}(x_{t})}{\sqrt{1+(\frac{y_{t}-f_{t}(x_{t})}{\sigma})^2}}K_{x_{t}}\|_{K}^2\big]\\ &&\quad+2\eta_{t}\mathbb{E}_{z_{1}, ..., z_{t-1}}\big[ \big(\mathcal{E}_{\sigma}(f_{\lambda}^{\sigma})+\frac{\lambda}{2}\|f_{\lambda}^{\sigma}\|_{K}^2\big)-\big( \mathcal{E}_{\sigma}(f_{t})+\frac{\lambda}{2}\|f_{t}\|_{K}^2\big)\big]\\ & = &\mathbb{E}_{z_{1}, ..., z_{t-1}}\big[\|f_{t}-f_{\lambda}^{\sigma}\|_{K}^2\big]+\eta_{t}^2\mathbb{E}_{z_{1}, ..., z_{t}}\big[\|\lambda f_{t}-\frac{y_{t}-f_{t}(x_{t})}{\sqrt{1+(\frac{y_{t}-f_{t}(x_{t})}{\sigma})^2}}K_{x_{t}}\|_{K}^2\big]\\ &&\quad+2\eta_{t}(\Omega_{\sigma}(f_{\lambda}^{\sigma})-\Omega_{\sigma}(f_{t})). \end{eqnarray}$

(3.21)

According to Lemma 3.3, we know

$\begin{eqnarray} \Omega_{\sigma}(f_{\lambda}^{\sigma})-\Omega_{\sigma}(f_{t})\leq-\frac{\lambda}{2}\|f_{t}-f_{\lambda}^{\sigma}\|_{K}^2. \end{eqnarray}$

(3.22)

Therefore, (3.21) implies that

$\begin{eqnarray} &&\mathbb{E}_{z_{1}, ..., z_{t}}[\|f_{t+1}-f_{\lambda}^{\sigma}\|_{K}^2]\\ &\leq&\mathbb{E}_{z_{1}, ..., z_{t-1}}\big[\|f_{t}-f_{\lambda}^{\sigma}\|_{K}^2\big]+\eta_{t}^2\mathbb{E}_{z_{1}, ..., z_{t}}\big[\|\lambda f_{t}-\frac{y_{t}-f_{t}(x_{t})}{\sqrt{1+(\frac{y_{t}-f_{t}(x_{t})}{\sigma})^2}}K_{x_{t}}\|_{K}^2\big]\\ &&\quad-2\eta_{t}\mathbb{E}_{z_{1}, ..., z_{t-1}}\big[\|f_{t}-f_{\lambda}^{\sigma}\|_{K}^2\big]\\ & = &(1-\lambda\eta_{t})\mathbb{E}_{z_{1}, ..., z_{t-1}}\big[\|f_{t}-f_{\lambda}^{\sigma}\|_{K}^2\big]+\eta_{t}^2\mathbb{E}_{z_{1}, ..., z_{t}}\big[\|\lambda f_{t}-\frac{y_{t}-f_{t}(x_{t})}{\sqrt{1+(\frac{y_{t}-f_{t}(x_{t})}{\sigma})^2}}K_{x_{t}}\|_{K}^2\big]. \end{eqnarray}$

(3.23)

By Lemma 3.1, we know

$\begin{equation} \big\|\lambda f_{t}-\frac{y_{t}-f_{t}(x_{t})}{\sqrt{1+(\frac{y_{t}-f_{t}(x_{t})}{\sigma})^2}}K_{x_{t}}\big\|_{K}^2 \leq\Big(\lambda\|f_{t}\|_K+\kappa\Big|\frac{y_{t}-f_{t}(x_{t})}{\sqrt{1+(\frac{y_{t}-f_{t}(x_{t})}{\sigma})^2}}\Big|\Big)^2 \leq(\lambda\times\frac{\kappa\sigma}{\lambda}+\kappa\sigma)^2 = 4\kappa^2\sigma^2. \end{equation}$

(3.24)

Substituting (3.24) into (3.23), we obtain

$\begin{align*} \mathbb{E}_{z_{1}, ..., z_{t}}[\|f_{t+1}-f_{\lambda}^{\sigma}\|_{K}^2]\leq(1-\lambda\eta_{t})\mathbb{E}_{z_{1}, ..., z_{t-1}}\big[\|f_{t}-f_{\lambda}^{\sigma}\|_{K}^2\big]+4\kappa^2\sigma^2\eta_{t}^2. \end{align*}$

Applying the relation iteratively for $t = T, T-1, ..., 1$ , we have

$\begin{align*} &\mathbb{E}_{z_{1}, ..., z_{T}}[\|f_{T+1}-f_{\lambda}^{\sigma}\|_{K}^2]\nonumber\\ &\leq(1-\lambda\eta_{T})\mathbb{E}_{z_{1}, ..., z_{T-1}}\big[\|f_{T}-f_{\lambda}^{\sigma}\|_{K}^2\big]+4\kappa^2\sigma^2\eta_{T}^2\nonumber\\ &\leq(1-\lambda\eta_{T})\Big(((1-\lambda\eta_{T-1}))\mathbb{E}_{z_{1}, ..., z_{T-2}}\big[\|f_{T-1}-f_{\lambda}^{\sigma}\|_{K}^2\big]+4\kappa^2\sigma^2\eta_{T-1}^2\Big)+4\kappa^2\sigma^2\eta_{T}^2\nonumber\\ &\leq .......\nonumber\\ &\leq4\kappa^2\sigma^2\sum\limits^T_{t = 1}\eta_{t}^2\prod\limits^T_{j = t+1}(1-\lambda\eta_{j})+\prod\limits^T_{t = 1}(1-\lambda\eta_{t})\mathbb{E}\big[\|f_{1}-f_{\lambda}^{\sigma}\|_{K}^2\big]. \end{align*}$

Denote $\prod\limits^T_{j = T+1}(1-\lambda\eta_{j}) = 1$ . From the definition of $f_{\lambda}^{\sigma}$ , we see that

$\begin{eqnarray} \frac{\lambda}{2}\|f_{\lambda}^{\sigma}\|_{K}^2\leq\mathcal{E}_{\sigma}(0)\leq M\sigma. \end{eqnarray}$

(3.25)

This in connection with the initialization condition $f_{1} = 0$ , it follows that

$\begin{align*} \mathbb{E}_{z_{1}, ..., z_{T}}[\|f_{T+1}-f_{\lambda}^{\sigma}\|_{K}^2] &\leq4\kappa^2\sigma^2\sum\limits^T_{t = 1}\eta_{t}^2\prod\limits^T_{j = t+1}(1-\lambda\eta_{j})+\prod\limits^T_{t = 1}(1-\lambda\eta_{t})\mathbb{E}\left[\|f_{\lambda}^{\sigma}\|_{K}^2\right]\nonumber\\ &\leq4\kappa^2\sigma^2\sum\limits^T_{t = 1}\eta_{t}^2\prod\limits^T_{j = t+1}(1-\lambda\eta_{j})+\frac{2M\sigma}{\lambda}\prod\limits^T_{t = 1}(1-\lambda\eta_{t}). \end{align*}$

This shows (3.16). (3.17) follows from (3.16) and the inequality $1-u\leq e^{-u}$ for any $u\geq0$ .

3.1. Proof of Theorem 2.1

Proof. It is easy to see that

$\begin{eqnarray*} \prod\limits^T_{t = 1}(1-\lambda\eta_{t})\leq\exp\big\{-\lambda\sum\limits^T_{t = 1}\eta_{t}\big\}\rightarrow0\quad(T\rightarrow +\infty). \end{eqnarray*}$

It implies that, for any $\varepsilon > 0$ , there exists some $T_{1}\in\mathbb{N}$ such that

$\begin{eqnarray*} \label{ineq3} \prod\limits^T_{t = 1}(1-\lambda\eta_{t})\leq\varepsilon, \end{eqnarray*}$

whenever $T\geq T_1$ .

And according to the assumption $\lim\limits_{t\rightarrow +\infty}\eta_{t} = 0$ , we know that there exists some $t(\varepsilon)\in\mathbb{N}$ such that $\eta_t\leq\lambda\varepsilon$ , for every $t\geq t(\varepsilon)$ . Furthermore, we have

$\begin{eqnarray} \sum\limits^T_{t = t(\varepsilon)+1}\eta^2_{t}\prod\limits^T_{j = t+1}(1-\lambda\eta_{j})&\leq&\lambda\varepsilon\sum\limits^T_{t = t(\varepsilon)+1}\eta_{t}\prod\limits^T_{j = t+1}(1-\lambda\eta_{j})\\ & = &\varepsilon\sum\limits^T_{t = t(\varepsilon)+1}\lambda\eta_{t}\prod\limits^T_{j = t+1}(1-\lambda\eta_{j})\\ & = &\varepsilon\sum\limits^T_{t = t(\varepsilon)+1}\left((1-(1-\lambda\eta_{t}))\prod\limits^T_{j = t+1}(1-\lambda\eta_{j})\right)\\ & = &\varepsilon\sum\limits^T_{t = t(\varepsilon)+1}\left(\prod\limits^T_{j = t+1}(1-\lambda\eta_{j})-\prod\limits^T_{j = t}(1-\lambda\eta_{j})\right)\\ & = &\varepsilon\Big(\big(\prod\limits^T_{j = t(\varepsilon)+2}(1-\lambda\eta_{j})-\prod\limits^T_{j = t(\varepsilon)+1}(1-\lambda\eta_{j})\big)\\ \quad\quad&+&\big(\prod\limits^T_{j = t(\varepsilon)+3}(1-\lambda\eta_{j})-\prod\limits^T_{j = t(\varepsilon)+2}(1-\lambda\eta_{j})\big)\\ \quad\quad&+&\cdots\\ \quad\quad&+&\big(\prod\limits^T_{j = T+1}(1-\lambda\eta_{j})-\prod\limits^T_{T}(1-\lambda\eta_{j})\big)\Big)\\ & = &\varepsilon\Big(1-\prod\limits^T_{j = t(\varepsilon)+1}(1-\lambda\eta_{j})\Big)\\ &\leq&\varepsilon. \end{eqnarray}$

(3.26)

Since $t(\varepsilon)$ is fixed, there exists some $T_{2}\in\mathbb{N}$ such that, for every $T\geq T_{2}$ , it holds that

$\begin{eqnarray*} \sum\limits^T_{j = t(\varepsilon)+1}\eta_{j}\geq\sum\limits^{T_{2}}_{j = t(\varepsilon)+1}\eta_{j} \geq\frac{1}{\lambda}\log\frac{t(\varepsilon)}{\lambda^2\varepsilon}. \end{eqnarray*}$

So, for any $1\leq t\leq t(\varepsilon)$ , we have

$\begin{eqnarray*} \label{ineq5} \prod^T_{j = t+1}(1-\lambda\eta_{j})\leq\exp\{-\sum\limits^T_{j = t+1}\lambda\eta_{j}\} \leq\exp\{-\sum\limits^T_{j = t(\varepsilon)+1}\lambda\eta_{j}\} \leq\frac{\lambda^2\varepsilon}{t(\varepsilon)}. \end{eqnarray*}$

Hence

$\begin{eqnarray} \sum\limits^{t(\varepsilon)}_{t = 1}\eta^2_{t}\prod\limits^T_{j = t+1}(1-\lambda\eta_{j})\leq\frac{\lambda^2\varepsilon}{t(\varepsilon)}\sum\limits^{t(\varepsilon)}_{t = 1}\eta^2_{t}\leq\varepsilon. \end{eqnarray}$

(3.27)

From (3.26) and (3.27), we know that for any $\varepsilon > 0$ , there exists some $T_{2}\in\mathbb{N}$ such that

$\begin{eqnarray} \sum\limits^T_{t = 1}\eta^2_{t}\prod\limits^T_{j = t+1}(1-\lambda\eta_{j}) = \sum\limits^{t(\varepsilon)}_{t = 1}\eta^2_{t}\prod\limits^T_{j = t+1}(1-\lambda\eta_{j})+\sum\limits^T_{t = t(\varepsilon)+1}\eta^2_{t}\prod\limits^T_{j = t+1}(1-\lambda\eta_{j}) \leq\varepsilon+\varepsilon = 2\varepsilon, \end{eqnarray}$

(3.28)

whenever $T\geq T_2$ . Let $T^{'} = \max\{T_{1}, T_{2}\}$ , then by (3.16), (3.26) and (3.27) we have

$\begin{eqnarray*} \mathbb{E}_{z_{1}, ..., z_{T}}\left[ \|f_{T+1}-f_{\lambda}^{\sigma}\|_{K}^2\right]\leq\big(8\kappa^2\sigma^2+\frac{2M\sigma}{\lambda}\big)\varepsilon, \end{eqnarray*}$

when $T\geq T^{'}$ . Thus we complete the proof of Theorem 2.1.

3.2. Proof of Theorem 2.2

Proof. By (3.21), we know

$\begin{eqnarray} \mathbb{E}_{z_{1}, ..., z_{t}}[\|f_{t+1}-f_{\lambda}^{\sigma}\|_{K}^2]&\leq&\mathbb{E}_{z_{1}, ..., z_{t-1}}\left[\|f_{t}-f_{\lambda}^{\sigma}\|_{K}^2\right]+\eta_{t}^2\mathbb{E}_{z_{1}, ..., z_{t}}\big[\|\lambda f_{t}-\frac{y_{t}-f_{t}(x_{t})}{\sqrt{1+(\frac{y_{t}-f_{t}(x_{t})}{\sigma})^2}}K_{x_{t}}\|_{K}^2\big]\\ &&\quad+2\eta_{t}\mathbb{E}_{z_{1}, ..., z_{t-1}}\left[ \left(\mathcal{E}_{\sigma}(f_{\lambda}^{\sigma})+\frac{\lambda}{2}\|f_{\lambda}^{\sigma}\|_{K}^2\right)-\left( \mathcal{E}_{\sigma}(f_{t})+\frac{\lambda}{2}\|f_{t}\|_{K}^2\right)\right]. \end{eqnarray}$

(3.29)

From the inequality $\|a-b\|_{K}^2\leq2\|a\|_{K}^2+2\|b\|_{K}^2$ , we have

$\begin{align} \|\lambda f_{t}-\frac{y_{t}-f_{t}(x_{t})}{\sqrt{1+(\frac{y_{t}-f_{t}(x_{t})}{\sigma})^2}}K_{x_{t}}\|_{K}^2 &\leq2\lambda^2\|f_{t}\|_{K}^2+2\big|\frac{y_{t}-f_{t}(x_{t})}{\sqrt{1+(\frac{y_{t}-f_{t}(x_{t})}{\sigma}})^2}\big|^2\|K_{x_{t}}(\cdot)\|_{K}^2\\ &\leq2\lambda^2\|f_{t}\|_{K}^2+2\kappa^2\sigma^2\big(\sqrt{1+(\frac{y_{t}-f_{t}(x_{t})}{\sigma}})^2-1\big) \end{align}$

(3.30)

On the other hand, for any $r\in\mathbf{R}$ , it holds that $\big|\frac{r}{\sqrt{1+(\frac{r)}{\sigma}})^2}\big|^2\leq2\sigma^2\big(\frac{\sqrt{\sigma^2+r^2}}{\sigma}-1\big) = 2V_{\sigma}(r)$ . This implies that

$\begin{align} \big|\frac{y_{t}-f_{t}(x_{t})}{\sqrt{1+(\frac{y_{t}-f_{t}(x_{t})}{\sigma}})^2}\big|^2\leq2V_{\sigma}(y_{t}-f_{t}(x_{t})). \end{align}$

(3.31)

Combining (3.30) with (3.31), we get

$\begin{align} \|\lambda f_{t}-\frac{y_{t}-f_{t}(x_{t})}{\sqrt{1+(\frac{y_{t}-f_{t}(x_{t})}{\sigma})^2}}K_{x_{t}}\|_{K}^2 &\leq4\kappa^2V_{\sigma}(y_{t}-f_{t}(x_{t}))+2\lambda^2\|f_{t}\|_{K}^2\\ &\leq4\kappa^2V_{\sigma}(y_{t}-f_{t}(x_{t}))+2\lambda\|f_{t}\|_{K}^2\\ & = 4\kappa^2V_{\sigma}(y_{t}-f_{t}(x_{t}))+4\times\frac{\lambda}{2}\|f_{t}\|_{K}^2\\ &\leq4(\kappa^2+1)\big(V_{\sigma}(y_{t}-f_{t}(x_{t}))+\frac{\lambda}{2}\|f_{t}\|_{K}^2\big). \end{align}$

(3.32)

Substituting (3.32) into (3.29), we get

$\mathbb{E}_{z_{1}, ..., z_{t}}[\|f_{t+1}-f_{\lambda}^{\sigma}\|_{K}^2]\\ \leq\mathbb{E}_{z_{1}, ..., z_{t-1}}[\|f_{t}-f_{\lambda}^{\sigma}\|_{K}^2]+4(\kappa^2+1)\eta_{t}^2\mathbb{E}_{z_{1}, ..., z_{t}}\big[V_{\sigma}(y_{t}-f_{t}(x_{t}))+\frac{\lambda}{2}\|f_{t}\|_{K}^2\big]\\ \quad+2\eta_{t}\mathbb{E}_{z_{1}, ..., z_{t-1}}\big[ \big(\mathcal{E}_{\sigma}(f_{\lambda}^{\sigma})+\frac{\lambda}{2}\|f_{\lambda}^{\sigma}\|_{K}^2\big)-\big( \mathcal{E}_{\sigma}(f_{t})+\frac{\lambda}{2}\|f_{t}\|_{K}^2\big)\big]\\ = \mathbb{E}_{z_{1}, ..., z_{t-1}}\big[\|f_{t}-f_{\lambda}^{\sigma}\|_{K}^2\big]+4(\kappa^2+1)\eta_{t}^2\mathbb{E}_{z_{1}, ..., z_{t}}\big[\mathcal{E}_{\sigma}(f_{t})+\frac{\lambda}{2}\|f_{t}\|_{K}^2\big]\\ \quad+2\eta_{t}\mathbb{E}_{z_{1}, ..., z_{t-1}}\big[ \big(\mathcal{E}_{\sigma}(f_{\lambda}^{\sigma})+\frac{\lambda}{2}\|f_{\lambda}^{\sigma}\|_{K}^2\big)-\big( \mathcal{E}_{\sigma}(f_{t})+\frac{\lambda}{2}\|f_{t}\|_{K}^2\big)\big]\\ \leq\mathbb{E}_{z_{1}, ..., z_{t-1}}\big[\|f_{t}-f_{\lambda}^{\sigma}\|_{K}^2\big]+4(\kappa^2+1)\eta_{t}^2\mathbb{E}_{z_{1}, ..., z_{t}}\big[\mathcal{E}_{\sigma}(f_{t})+\frac{\lambda}{2}\|f_{t}\|_{K}^2-\big(\mathcal{E}_{\sigma}(f_{\lambda}^{\sigma})+\frac{\lambda}{2}\|f_{\lambda}^{\sigma}\|_{K}^2\big)\big]\\ \quad+2\eta_{t}\mathbb{E}_{z_{1}, ..., z_{t-1}}\big[ \big(\mathcal{E}_{\sigma}(f_{\lambda}^{\sigma})+\frac{\lambda}{2}\|f_{\lambda}^{\sigma}\|_{K}^2\big)-\big( \mathcal{E}_{\sigma}(f_{t})+\frac{\lambda}{2}\|f_{t}\|_{K}^2\big)\big] +4(\kappa^2+1)\eta_{t}^2\mathcal{E}_{\sigma}(0)\\ \leq\mathbb{E}_{z_{1}, ..., z_{t-1}}\big[\|f_{t}-f_{\lambda}^{\sigma}\|_{K}^2\big]+2\eta_{t}(1-2(\kappa^2+1)\eta_t)\mathbb{E}_{z_{1}, ..., z_{t-1}}\\ \big[ \big(\mathcal{E}_{\sigma}(f_{\lambda}^{\sigma})+\frac{\lambda}{2}\|f_{\lambda}^{\sigma}\|_{K}^2\big)-\big( \mathcal{E}_{\sigma}(f_{t})+\frac{\lambda}{2}\|f_{t}\|_{K}^2\big)\big]\\ \quad+4(\kappa^2+1)M\sigma\eta_{t}^2 = \mathbb{E}_{z_{1}, ..., z_{t-1}}\big[\|f_{t}-f_{\lambda}^{\sigma}\|_{K}^2\big]+B+4(\kappa^2+1)M\sigma\eta_{t}^2,$

(3.33)

where

$B = 2\eta_{t}(1-2(\kappa^2+1)\eta_t)\mathbb{E}_{z_{1}, ..., z_{t-1}}\big[ \big(\mathcal{E}_{\sigma}(f_{\lambda}^{\sigma})+\frac{\lambda}{2}\|f_{\lambda}^{\sigma}\|_{K}^2\big)-\big( \mathcal{E}_{\sigma}(f_{t})+\frac{\lambda}{2}\|f_{t}\|_{K}^2\big)\big].$

Based on the assumptions about $\eta$ , we know that $1-2(\kappa^2+1)\eta_t\geq\frac{1}{2}$ . And by Lemma 3.3, we know

$\mathbb{E}_{z_{1}, ..., z_{t-1}}\big[ \big(\mathcal{E}_{\sigma}(f_{\lambda}^{\sigma})+\frac{\lambda}{2}\|f_{\lambda}^{\sigma}\|_{K}^2\big)-\big( \mathcal{E}_{\sigma}(f_{t})+\frac{\lambda}{2}\|f_{t}\|_{K}^2\big)\big]\leq-\frac{\lambda}{2}\|f_{t}-f_{\lambda}^{\sigma}\|_{K}^2.$

This implies that

$\begin{align} B\leq-\frac{\lambda\eta_t}{2}\mathbb{E}_{z_{1}, ..., z_{t-1}}[\|f_{t}-f_{\lambda}^{\sigma}\|_{K}^2]. \end{align}$

(3.34)

Combining (3.33) with (3.34), we obtain

$\begin{align} \mathbb{E}_{z_{1}, ..., z_{t}}[\|f_{t+1}-f_{\lambda}^{\sigma}\|_{K}^2] \leq(1-\frac{\lambda\eta_{t}}{2})\mathbb{E}_{z_{1}, ..., z_{t-1}}[\|f_{t}-f_{\lambda}^{\sigma}\|_{K}^2]+4M\sigma^2(\kappa^2+1)\eta_{t}^2. \end{align}$

(3.35)

Denote $C_{\sigma} = 4M\sigma^2(\kappa^2+1)$ . For $t = T, T-1, ..., 1$ , we apply the relation iteratively. Then we have

$\mathbb{E}_{z_{1}, ..., z_{T}}[\|f_{T+1}-f_{\lambda}^{\sigma}\|_{K}^2]\\ \leq(1-\frac{\lambda}{2}\eta_{T})\mathbb{E}_{z_{1}, ..., z_{T-1}}[\|f_{T}-f_{\lambda}^{\sigma}\|_{K}^2]+C_{\sigma}\eta_{T}^2\\ \leq(1-\frac{\lambda}{2}\eta_{T})\left((1-\frac{\lambda}{2}\eta_{T-1})\mathbb{E}_{z_{1}, ..., z_{T-2}}[\|f_{T-1}-f_{\lambda}^{\sigma}\|_{K}^2]+C_{\sigma}\eta_{T-1}^2\right)+C_{\sigma}\eta_{T}^2\\ = (1-\frac{\lambda}{2}\eta_{T})(1-\frac{\lambda}{2}\eta_{T-1})\mathbb{E}_{z_{1}, ..., z_{T-2}}[\|f_{T-1}-f_{\lambda}^{\sigma}\|_{K}^2]+(1-\frac{\lambda}{2}\eta_{T})C_{\sigma}\eta_{T-1}^2+C_{\sigma}\eta_{T}^2\\ \leq(1-\frac{\lambda}{2}\eta_{T})(1-\frac{\lambda}{2}\eta_{T-1})\left((1-\frac{\lambda}{2}\eta_{T-2})\mathbb{E}_{z_{1}, ..., z_{T-3}}[\|f_{T-2}-f_{\lambda}^{\sigma}\|_{K}^2]\\ +C_{\sigma}\eta_{T-2}^2\right)+ (1-\frac{\lambda}{2}\eta_{T})C_{\sigma}\eta_{T-1}^2+C_{\sigma}\eta_{T}^2\\ \leq\cdots\\ \leq(1-\frac{\lambda}{2}\eta_{T})(1-\frac{\lambda}{2}\eta_{T-1})\cdots(1-\frac{\lambda}{2}\eta_{1})\big(\mathbb{E}[\|f_{1}-f_{\lambda}^{\sigma}\|_{K}^2]+C_{\sigma}\eta_{1}^2\big)\\ = C_{\sigma}\sum\limits_{t = 1}^T\eta_{t}^2\prod\limits_{j = t+1}^T(1-\frac{\lambda}{2}\eta_{j})+\prod\limits_{t = 1}^T(1-\frac{\lambda}{2}\eta_{t})\mathbb{E}[\|f_{\lambda}^{\sigma}\|_{K}^2]\\ \leq C_{\sigma}\sum\limits_{t = 1}^T\eta_{t}^2\prod\limits_{j = t+1}^T(1-\frac{\lambda}{2}\eta_{j})+\frac{2M\sigma}{\lambda}\prod\limits_{t = 1}^T(1-\frac{\lambda}{2}\eta_{t}).$

(3.36)

For any $u\geq0$ , we know that the inequality $1-u\leq e^{-u}$ holds. And (3.36) implies that

$\begin{eqnarray} \mathbb{E}_{z_{1}, ..., z_{T}}[\|f_{T+1}-f_{\lambda}^{\sigma}\|_{K}^2] \leq C_{\sigma}\sum\limits_{t = 1}^T\eta_{t}^2\exp\{-\frac{\lambda}{2}\sum\limits_{j = t+1}^T\eta_{j}\}+\frac{2M\sigma}{\lambda}\exp\{-\frac{\lambda}{2}\sum\limits_{t = 1}^T\eta_{t}\}. \end{eqnarray}$

(3.37)

Denote

$\begin{eqnarray*} I_{1} = \frac{2M\sigma}{\lambda}\exp\{-\frac{\lambda}{2}\sum\limits_{t = 1}^T\eta_{t}\} = D_{\sigma}(\lambda)\exp\{-\frac{\lambda}{2}\sum\limits_{t = 1}^T\eta_{t}\} = D_{\sigma}(\lambda)\exp\{-\frac{\lambda}{2C}\sum\limits_{t = 1}^Tt^{-\theta}\} \end{eqnarray*}$

and

$\begin{eqnarray*} I_{2} = C_{\sigma}\sum\limits_{t = 1}^T(\frac{1}{C}t^{-\theta})^2\exp\{-\frac{\lambda}{2C}\sum\limits_{j = t+1}^Tj^{-\theta}\} = \frac{C_{\sigma}}{C^2}\sum\limits_{t = 1}^Tt^{-2\theta}\exp\{-\frac{\lambda}{2C}\sum\limits_{j = t+1}^Tj^{-\theta}\}. \end{eqnarray*}$

Then, by (3.37) and the assumptions about the stepsize $\eta_{t}$ , we know

$\begin{eqnarray} \mathbb{E}_{z_{1}, ..., z_{T}}[\|f_{T+1}-f_{\lambda}^{\sigma}\|_{K}^2]\leq I_{1}+I_{2}. \end{eqnarray}$

(3.38)

Now, we estimate $I_{1}$ and $I_{2}$ respectively. By Lemma 4 of ^[9], we obtain the following estimate of $I_{1}$

$\begin{eqnarray} I_{1}\leq \begin{cases} D_{\sigma}(\lambda)\exp\{-\frac{\lambda}{2C}\frac{1-2^{\theta-1}}{1-\theta}(T+1)^{1-\theta}\}, \;\; if 0 < \theta < 1, \\ D_{\sigma}(\lambda)(T+1)^{-\frac{\lambda}{2C}}, \;\; \;\;if \theta = 1. \end{cases} \end{eqnarray}$

(3.39)

On the other hand, by Lemma 5.10 of ^[23] with $\nu = \frac{\lambda}{2C}, s = \theta$ , we have

$\begin{align*} \sum\limits_{t = 1}^{T-1}t^{-2\theta}\exp\{-\frac{\lambda}{2C}\sum\limits_{j = t+1}^Tj^{-\theta}\} \leq \begin{cases} \frac{18}{\frac{\lambda}{2C}T_{\theta}}+\frac{9T^{1-\theta}}{(1-\theta)2^{1-\theta}}\exp\{-\frac{\lambda(1-2^{\theta-1})}{2C(1-\theta)}(T+1)^{1-\theta}\}, if\; 0 < \theta < 1, \\ \frac{8}{1-\frac{\lambda}{2C}}(T+1)^{-\frac{\lambda}{2C}}, \;\; \;if\; \theta = 1. \end{cases} \end{align*}$

So,

$\begin{align*} \sum\limits_{t = 1}^Tt^{-2\theta}\exp\{-\frac{\lambda}{2C}\sum\limits_{j = t+1}^Tj^{-\theta}\} &\leq\sum\limits_{t = 1}^{T-1}t^{-2\theta}\exp\{-\frac{\lambda}{2C}\sum\limits_{j = t+1}^Tj^{-\theta}\}+T^{-2\theta}\exp\{-\frac{\lambda}{2C}\sum\limits_{j = T+1}^Tj^{-\theta}\}\nonumber\\ & = \sum\limits_{t = 1}^{T-1}t^{-2\theta}\exp\{-\frac{\lambda}{2C}\sum\limits_{j = t+1}^Tj^{-\theta}\}+T^{-2\theta}\nonumber\\ &\leq \begin{cases} \frac{36C}{\lambda T_{\theta}}+\frac{9T^{1-\theta}}{(1-\theta)2^{1-\theta}}\exp\{-\frac{\lambda(1-2^{\theta-1})}{2C(1-\theta)}(T+1)^{1-\theta}\}+T^{-2\theta}, \;\; if\; 0 < \theta < 1, \\ \frac{8C}{2C-\lambda}(T+1)^{-\frac{\lambda}{2C}}+T^{-2}, \;\; \;\;if\; \theta = 1. \end{cases} \end{align*}$

Furthermore, we have the following estimate of $I_{2}$

$\begin{align} I_{2}\leq \begin{cases} \frac{C_{\sigma}}{C^2}\left(\frac{36C}{\lambda T_{\theta}}+\frac{9T^{1-\theta}}{(1-\theta)2^{1-\theta}}\exp\{-\frac{\lambda(1-2^{\theta-1})}{2C(1-\theta)}(T+1)^{1-\theta}\}+T^{-2\theta}\right), \;\; if\; 0 < \theta < 1, \\ \frac{C_{\sigma}}{C^2}\left(\frac{8C}{2C-\lambda}(T+1)^{-\frac{\lambda}{2C}}+T^{-2}\right), \;\; \;\;if\; \theta = 1. \end{cases} \end{align}$

(3.40)

Since $T^{-2\theta}\leq T^{-\theta}$ and $\lambda\leq C$ , the conclusion can be established by combining (3.38) with (3.39) and (3.40).

3.3. Proof of Corollary 2.1

Proof. For $\theta\in(0, 1), 0 < \alpha\leq\min\{1-\theta, \theta\}$ , by Theorem 2.2 with $\lambda = T^{-\alpha}$ , we have

$\begin{align} &\mathbb{E}\left[\|f_{T+1}-f_{\lambda}^{\sigma}\|_{K}^2\right]\\ &\leq\sigma\left(2MT^{\alpha}+\frac{36M(\kappa^2+1)}{C^2(1-\theta)2^{1-\theta}}T^{1-\theta}\exp\left\{-\frac{\lambda(1-2^{\theta-1})}{2C(1-\theta)}T^{(1-\theta)-\alpha}\right\} +\frac{36M(\kappa^2+1)}{C}T^{\alpha-\theta}\right) \end{align}$

(3.41)

Since for any $\nu > 0, c > 0, \eta > 0$ , there exists $L > 0$ such that $\exp\{-cT^\nu\}\leq LT^{-\eta}$ , and hence the first term on the right-hand side of (3.41) decays in the form of $O(\frac{\sigma}{T^{\eta}})$ for any large $\eta > 0$ . However, the second term on the right-hand side of (3.41) is bounded by $O(\frac{\sigma}{T^{\theta-\alpha}})$ . Consequently, there exists a constant $C_{M, C, \kappa, \theta}$ depending only on $\theta, \kappa, C$ and $M$ such that

$\mathbb{E}\left[\|f_{T+1}-f_{\lambda}^{\sigma}\|_{K}^2\right] \leq C_{M, C, \kappa, \theta}\times\frac{\sigma}{T^{\theta-\alpha}}.$

The proof is completed.

3.4. Proof of Corollary 2.2

Proof. According to the median value theorem, there exists $\xi$ between $y-f_{T+1}(x)$ and $y-f_{\lambda, \sigma}(x)$ such that

$\begin{align*} V_{\sigma}\big(y-f_{T+1}(x)\big)-V_{\sigma}\big(y-f_{\lambda, \sigma}(x)\big) = &V^{\prime}_{\sigma}(\xi)|f_{T+1}(x)-f_{\lambda, \sigma}(x)|\nonumber\\ = &\frac{|\xi|}{\sqrt{1+(\frac{\xi}{\sigma})^2}}\left|f_{T+1}(x)-f_{\lambda, \sigma}(x)\right|\nonumber\\ \leq&\sigma|f_{T+1}(x)-f_{\lambda, \sigma}(x)|. \end{align*}$

Then, we have

$\begin{align} \mathcal{E}_{\sigma}(f_{T+1})-\mathcal{E}_{\sigma}(f_{\lambda, \sigma}) &\leq\int_{Z}\Big|V_{\sigma}\big(y-f_{T+1}(x)\big)-V_{\sigma}\big(y-f_{\lambda, \sigma}(x)\big)\Big|d\rho(x, y)\\ &\leq\sigma\int_{Z}\Big|f_{T+1}(x)-f_{\lambda, \sigma}(x)\Big|d\rho(x, y)\\ &\leq\kappa\sigma\|f_{T+1}-f_{\lambda, \sigma}\|_{K}. \end{align}$

(3.42)

And we get

$\begin{align*} \mathcal{E}_{\sigma}(f_{T+1})-\mathcal{E}_{\sigma}(f_{\sigma}) &\leq\big(\mathcal{E}_{\sigma}(f_{T+1})-\mathcal{E}_{\sigma}(f_{\lambda, \sigma})\big)+\mathcal{E}_{\sigma}(f_{\lambda, \sigma})-\mathcal{E}_{\sigma}(f_{\sigma})\nonumber\\ &\leq\kappa\sigma\|f_{T+1}-f_{\lambda, \sigma}\|_{K}+\mathcal{E}_{\sigma}(f_{\lambda, \sigma})-\mathcal{E}_{\sigma}(f_{\sigma})\nonumber\\ &\leq\kappa\sigma\|f_{T+1}-f_{\lambda, \sigma}\|_{K}+\mathcal{E}_{\sigma}(f_{\lambda, \sigma})-\mathcal{E}_{\sigma}(f_{\sigma})+\frac{\lambda}{2}\|f_{\lambda, \sigma}\|^2_{K}\nonumber\\ & = \kappa\sigma\|f_{T+1}-f_{\lambda, \sigma}\|_{K} +\inf\limits_{f\in\mathcal{H}_{K}}\Big\{\mathcal{E}_{\sigma}(f)-\mathcal{E}_{\sigma}(f_{\sigma})+\frac{\lambda}{2}\|f_{\lambda, \sigma}\|^2_{K}\Big\}\nonumber\\ & = \kappa\sigma\|f_{T+1}-f_{\lambda, \sigma}\|_{K}+\mathcal{K}(f_{\sigma}, \lambda). \end{align*}$

Combined which with Corollary 2.1 and the assumption 2.2, the desired result follows.

4. Discussions

$\bullet$ Most studies of online learning algorithms focus on the convergence in expectation (for example ^[8,9,10,25]). However, these results were established based on some fixed loss functions, such as the least-square loss function (see e.g., ^[9,22]). Our results are established based on a parameterized loss function with a scale parameter $\sigma$ . The analysis results in Section 2 show that the scale parameter $\sigma$ can effectively control the convergence rate of the learning algorithm, and a better convergence rate is obtained. On the other hand, the previous researches on online learning algorithms rely on integral operator theory (see ^[25]), this paper establishes the error bounds for the learning sequence by applying the convex analysis method. Convex analysis method has been widely used in various research fields, for example, in the analysis of machine learning algorithms (see e.g., ^[21,23,26]) and the studies of discrete fractional operators (see e.g., ^[27,28]), and it has been proved to be a very effective analysis method.

$\bullet$ In ^[23], the online pairwise regression problem with the quadratic loss is researched. Different from the reference ^[23], in this paper, we use the parameterized loss function for the pointwise learning model, which has a wider range of applications than the pairwise learning model. It is known that deep convolution networks can increase approximation order (see e.g., ^{[29,30,31,32]}), then it is hopeful that the convergence rate provided in this paper can be improved by choosing the deep neural network method.

5. Conclusions

In the present paper, we analyze the learning performance of the kernel regularized online algorithm with a parameterized loss. The convergence of the learning sequence is proved and the error bound is provided in the expectation sense by using the convex analysis method. There are some questions for further study. In this paper, we focus on the theoretical analysis of the kernel regularized online algorithm with a parameterized loss $V_{\sigma}$ . However, there is still a gap between theoretical analysis and the optimization process of empirical risk minimization based on a parameterized loss. In the future study, it would be interesting to apply the online learning algorithm based on $V_{\sigma}$ to solve some practical problems and construct an effective solution method. In addition, we mainly analyze the sample error in this paper, and the approximation error is represented by $\mathcal{K}$ -functional. How to make a more accurate analysis of the approximation error and further study the influence of the scale parameter $\sigma$ on the approximation error still need to be further studied.

Acknowledgments

This work is supported by the Special Project for Scientific and Technological Cooperation (Project No. 20212BDH80021) of Jiangxi Province, the Science and Technology Project in Jiangxi Province Department of Education (Project No. GJJ211334).

Conflict of interest

No potential conflict of interest was reported by the author.

References

[1]	N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Math. Soc., 68 (1950), 337–404. http://dx.doi.org/10.2307/1990404 doi: 10.2307/1990404
[2]	W. Dai, J. Hu, Y. Cheng, X. Wang, T. Chai, RVFLN-based online adaptive semi-supervised learning algorithm with application to product quality estimation of industrial processes, J. Cent. South Univ., 26 (2019), 3338–3350. http://dx.doi.org/10.1007/s11771-019-4257-6 doi: 10.1007/s11771-019-4257-6
[3]	J. Gui, Y. Liu, X. Deng, B. Liu, Network capacity optimization for Cellular-assisted vehicular systems by online learning-based mmWave beam selection, Wirel. Commun. Mob. Com., 2021 (2021), 8876186. http://dx.doi.org/10.1155/2021/8876186 doi: 10.1155/2021/8876186
[4]	M. Li, I. Sethi, A new online learning algorithm with application to image segmentation, Image Processing: Algorithms and Systems IV, 5672 (2005), 277–286. http://dx.doi.org/10.1117/12.586328 doi: 10.1117/12.586328
[5]	S. Sai Santosh, S. Darak, Intelligent and reconfigurable architecture for KL divergence based online machine learning algorithm, arXiv: 2002.07713.
[6]	B. Yang, J. Yao, X. Yang, Y. Shi, Painting image classification using online learning algorithm, In: Distributed, ambient and pervasive interactions, Cham: Springer, 2017,393–403. http://dx.doi.org/10.1007/978-3-319-58697-7_29
[7]	S. Das, Kuhoo, D. Mishra, M. Rout, An optimized feature reduction based currency forecasting model exploring the online sequential extreme learning machine and krill herd strategies, Physica A, 513 (2019), 339–370. http://dx.doi.org/10.1016/j.physa.2018.09.021 doi: 10.1016/j.physa.2018.09.021
[8]	S. Smale, Y. Yao, Online learning algorithms, Found. Comput. Math., 6 (2006), 145–170. http://dx.doi.org/10.1007/s10208-004-0160-z
[9]	Y. Ying, D. Zhou, Online regularized classification algorithms, IEEE Trans. Inform. Theory, 52 (2006), 4775–4788. http://dx.doi.org/10.1109/TIT.2006.883632 doi: 10.1109/TIT.2006.883632
[10]	Y. Ying, D. Zhou, Unregularized online learning algorithms with general loss functions, Appl. Comput. Harmon. Anal., 42 (2017), 224–244. http://dx.doi.org/10.1016/J.ACHA.2015.08.007 doi: 10.1016/J.ACHA.2015.08.007
[11]	Y. Zeng, D. Klabjian, Online adaptive machine learning based algorithm for implied volatility surface modeling, Knowl.-Based Syst., 163 (2019), 376–391. http://dx.doi.org/10.1016/j.knosys.2018.08.039 doi: 10.1016/j.knosys.2018.08.039
[12]	J. Lin, D. Zhou, Online learning algorithms can converge comparably fast as batch learning, IEEE Trans. Neural Netw. Learn. Syst., 29 (2018), 2367–2378. http://dx.doi.org/10.1109/TNNLS.2017.2677970 doi: 10.1109/TNNLS.2017.2677970
[13]	P. Huber, E. Ronchetti, Robust statistics, Hoboken: John Wiley & Sons, 2009. http://dx.doi.org/10.1002/9780470434697
[14]	Y. Wu, Y. Liu, Robust truncated hinge loss support vector machine, J. Am. Stat. Assoc., 102 (2007), 974–983. http://dx.doi.org/10.1198/016214507000000617 doi: 10.1198/016214507000000617
[15]	Y. Yu, M. Yang, L. Xu, M. White, D. Schuurmans, Relaxed clipping: a global training method for robust regression and classification, Proceedings of the 23rd International Conference on Neural Information Processing Systems, 2 (2010), 2532–2540.
[16]	S. Huang, Y. Feng, Q. Wu, Learning theory of minimum error entropy under weak moment conditions, Anal. Appl., 20 (2022), 121–139. http://dx.doi.org/10.1142/S0219530521500044 doi: 10.1142/S0219530521500044
[17]	F. Lv, J. Fan, Optimal learning with Gaussians and correntropy loss, Anal. Appl., 19 (2021), 107–124. http://dx.doi.org/10.1142/S0219530519410124 doi: 10.1142/S0219530519410124
[18]	X. Zhu, Z. Li, J. Sun, Expression recognition method combining convolutional features and Transformer, Math. Found. Compt., in press. http://dx.doi.org/10.3934/mfc.2022018
[19]	S. Suzumura, K. Ogawa, M. Sugiyama, M. Karasuyama, I. Takeuchi, Homotopy continuation approaches for robust SV classification and regression, Mach. Learn., 106 (2017), 1009–1038. http://dx.doi.org/10.1007/s10994-017-5627-7 doi: 10.1007/s10994-017-5627-7
[20]	Z. Guo, T. Hu, L. Shi, Gradient descent for robust kernel-based regression, Inverse Probl., 34 (2018), 065009. http://dx.doi.org/10.1088/1361-6420/aabe55 doi: 10.1088/1361-6420/aabe55
[21]	B. Sheng, H. Zhu, The convergence rate of semi-supervised regression with quadratic loss, Appl. Math. Comput., 321 (2018), 11–24. http://dx.doi.org/10.1016/j.amc.2017.10.033 doi: 10.1016/j.amc.2017.10.033
[22]	M. Pontil, Y. Ying, D. Zhou, Error analysis for online gradient descent algorithms in reproducing kernel Hilbert spaces, Proceedings of Technical Report, University College London, 2005, 1–20.
[23]	S. Wang, Z. Chen, B. Sheng, Convergence of online pairwise regression learning with quadratic loss, Commun. Pur. Appl. Anal., 19 (2020), 4023–4054. http://dx.doi.org/10.3934/cpaa.2020178 doi: 10.3934/cpaa.2020178
[24]	H. Bauschke, P. Combettes, Convex analysis and monotone operator theory in Hilber spaces, Cham: Springer-Verlag, 2010. http://dx.doi.org/10.1007/978-3-319-48311-5
[25]	Z. Guo, L. Shi, Fast and strong convergence of online learning algorithms, Adv. Comput. Math., 45 (2019), 2745–2770. http://dx.doi.org/10.1007/s10444-019-09707-8 doi: 10.1007/s10444-019-09707-8
[26]	Y. Lei, D. Zhou, Convergence of online mirror descent, Appl. Comput. Harmon. Anal., 48 (2020), 343–373. http://dx.doi.org/10.1016/j.acha.2018.05.005 doi: 10.1016/j.acha.2018.05.005
[27]	I. Baloch, T. Abdeljawad, S. Bibi, A. Mukheimer, G. Farid, A. Haq, Some new Caputo fractional derivative inequalities for exponentially $(\theta, h-m)$ -convex functions, AIMS Mathematics, 7 (2022), 3006–3026. http://dx.doi.org/10.3934/math.2022166 doi: 10.3934/math.2022166
[28]	P. Mohammed, D. O'Regan, A. Brzo, K. Abualnaja, D. Baleanu, Analysis of positivity results for discrete fractional operators by means of exponential kernels, AIMS Mathematics, 7 (2022), 15812–15823. http://dx.doi.org/10.3934/math.2022865 doi: 10.3934/math.2022865
[29]	Y. Xia, J. Zhou, T. Xu, W. Gao, An improved deep convolutional neural network model with kernel loss function in image classifiaction, Math. Found. Comput., 3 (2020), 51–64. http://dx.doi.org/10.3934/mfc.2020005 doi: 10.3934/mfc.2020005
[30]	D. Zhou, Deep distributed convolutional neural networks: universality, Anal. Appl., 16 (2018), 895–919. http://dx.doi.org/10.1142/S0219530518500124 doi: 10.1142/S0219530518500124
[31]	D. Zhou, Universality of deep convolutional neural networks, Appl. Comput. Harmon. Anal., 48 (2020), 787–794. http://dx.doi.org/10.1016/j.acha.2019.06.004 doi: 10.1016/j.acha.2019.06.004
[32]	D. Zhou, Theory of deep convolutional neural networks: downsampling, Neural Networks, 124 (2020), 319–327. http://dx.doi.org/10.1016/j.neunet.2020.01.018 doi: 10.1016/j.neunet.2020.01.018

This article has been cited by:

Lin Liu, Xiaoling Pan, Baohuai Sheng, Jun Fan, Theory Analysis for the Convergence of Kernel-Regularized Online Binary Classification Learning Associated with RKBSs, 2023, 2023, 2314-4785, 1, 10.1155/2023/6566375

Reader Comments

Your name:*

Email:*
© 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

AIMS Mathematics

1.8 3.4

Metrics

Article views(1517) PDF downloads(68) Cited by(1)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

AIMS Mathematics

Convergence of online learning algorithm with a parameterized loss

Related Papers:

Abstract

1. Introduction

2. The main results

2.1. The convergence of the learning sequence

2.2. Error analysis

3. Proofs

3.1. Proof of Theorem 2.1

3.2. Proof of Theorem 2.2

3.3. Proof of Corollary 2.1

3.4. Proof of Corollary 2.2

4. Discussions

5. Conclusions

Acknowledgments

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Other Articles By Authors

Catalog

AIMS Mathematics

Convergence of online learning algorithm with a parameterized loss

Related Papers:

Abstract

1. Introduction

2. The main results

2.1. The convergence of the learning sequence

2.2. Error analysis

3. Proofs

3.1. Proof of Theorem 2.1

3.2. Proof of Theorem 2.2

3.3. Proof of Corollary 2.1

3.4. Proof of Corollary 2.2

4. Discussions

5. Conclusions

Acknowledgments

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog