Learning capability of the rescaled pure greedy algorithm with non-iid sampling

Qin Guo; Binlei Cai; Qin Guo; Binlei Cai

doi:10.3934/era.2023071

Electronic Research Archive

2023, Volume 31, Issue 3: 1387-1404. doi: 10.3934/era.2023071

Previous Article Next Article

Research article

Learning capability of the rescaled pure greedy algorithm with non-iid sampling

Qin Guo ^{1
,
,},
Binlei Cai ^{2
,
,}

1.
School of Science, Shandong Jianzhu University, Jinan 250101, China
2.
Shandong Computer Science Center (National Supercomputer Center in Jinan), Shandong Provincial Key Laboratory of Computer Networks, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250101, China

Academic Editor: William Guo

Received: 24 August 2022 Revised: 07 December 2022 Accepted: 14 December 2022 Published: 11 January 2023

We consider the rescaled pure greedy learning algorithm (RPGLA) with the dependent samples drawn according to a non-identical sequence of probability distributions. The generalization performance is provided by applying the independent-blocks technique and adding the drift error. We derive the satisfactory learning rate for the algorithm under the assumption that the process satisfies stationary $\beta$ -mixing, and also find that the optimal rate $O(n^{-1})$ can be obtained for i.i.d. processes.

Keywords:

Citation: Qin Guo, Binlei Cai. Learning capability of the rescaled pure greedy algorithm with non-iid sampling[J]. Electronic Research Archive, 2023, 31(3): 1387-1404. doi: 10.3934/era.2023071

Related Papers:

[1]	Zhuang Wang, Renting Liu, Jie Xu, Yusheng Fu . FedSC: A federated learning algorithm based on client-side clustering. Electronic Research Archive, 2023, 31(9): 5226-5249. doi: 10.3934/era.2023266
[2]	Xiaoyu Jiang, Ruichun Gu, Huan Zhan . Research on incentive mechanisms for anti-heterogeneous federated learning based on reputation and contribution. Electronic Research Archive, 2024, 32(3): 1731-1748. doi: 10.3934/era.2024079
[3]	Zhiyong Qian, Wangsen Xiao, Shulan Hu . The generalization ability of logistic regression with Markov sampling. Electronic Research Archive, 2023, 31(9): 5250-5266. doi: 10.3934/era.2023267
[4]	Shuangjie Yuan, Jun Zhang, Yujia Lin, Lu Yang . Hybrid self-supervised monocular visual odometry system based on spatio-temporal features. Electronic Research Archive, 2024, 32(5): 3543-3568. doi: 10.3934/era.2024163
[5]	Wenya Shi, Xinpeng Yan, Zhan Huan . Faster free pseudoinverse greedy block Kaczmarz method for image recovery. Electronic Research Archive, 2024, 32(6): 3973-3988. doi: 10.3934/era.2024178
[6]	Fang Wang, Weiguo Li, Wendi Bao, Li Liu . Greedy randomized and maximal weighted residual Kaczmarz methods with oblique projection. Electronic Research Archive, 2022, 30(4): 1158-1186. doi: 10.3934/era.2022062
[7]	Shiqi Zou, Xiaoping Shi, Shenmin Song . MOEA with adaptive operator based on reinforcement learning for weapon target assignment. Electronic Research Archive, 2024, 32(3): 1498-1532. doi: 10.3934/era.2024069
[8]	Hui Li, Rongrong Gong, Pengfei Hou, Libao Xing, Dongbao Jia, Haining Li . Online learning resources recommendation model based on improved NSGA-Ⅱ algorithm. Electronic Research Archive, 2023, 31(5): 3030-3049. doi: 10.3934/era.2023153
[9]	Jing Chen, Weiyu Ye, Shaowei Kang . Learning user preferences from Multi-Contextual Sequence influences for next POI recommendation. Electronic Research Archive, 2024, 32(1): 486-504. doi: 10.3934/era.2024024
[10]	Bangna Li, Qingqing Zhang, Xingshi He . Multi-label feature selection via constraint mapping space regularization. Electronic Research Archive, 2024, 32(4): 2598-2620. doi: 10.3934/era.2024118

Abstract

1. Introduction and assumptions

Greedy learning algorithms, or more specifically, applying greedy algorithms to tackle supervised learning problems, have triggered enormous recent research activities since they possess the lower computational burden^[1,2,3,4]. {Theoretical attempts of greedy learning have been widely concerned recently in the framework of learning theory ^[1,2,3,5,6]. We consider the learning capability of the rescaled pure greedy algorithm (RPGA) in a non-i.i.d sampling setting, which was initiated by Petrova in ^[7].

A fast review of regression learning as well as greedy algorithms will be given as follows, respectively. Let $X$ be a compact metric space and $Y = \mathbb{R}$ . Let ${\bf z} = \{z_{i}\}_{i = 1}^{n} = \{(x_i, \, y_i)\}_{i = 1}^n \in Z^n$ be a stationary real-valued sequence with unknown Borel probability distribution $\rho$ on $Z = X\times Y$ .

The generalization error can be defined by

$\begin{equation} \mathcal {E}(f) = \int_{Z}(f(x)-y)^{2}d\rho, \ \ \forall f:X\rightarrow Y. \end{equation}$

(1.1)

Minimizing $\mathcal {E}(f)$ , we can obtain the following regression function

$f_{\rho}(x) = \int_{Y} y d\rho(y\vert x),$

where $\rho(\cdot\vert x)$ denotes the conditional probability measure (given $x$ ) on $Y$ . The empirical error $\mathcal {E}_{{\bf z}}(f)$ which is a good approximation of the generalization error $\mathcal {E}(f)$ for a fixed function $f$ on $X$ can be defined by

$\begin{equation} \mathcal {E}_{{\bf z}}(f) = \|y-f\|_{n}^{2}: = \frac{1}{n}\sum\limits_{i = 1}^{n}(f(x_{i})-y_{i})^{2}. \end{equation}$

(1.2)

The regression problem in learning theory aims at a good approximation $f_{{\bf z} }$ of $f_{\rho}$ , constructed by learning algorithms. Denote by $L_{\rho_X} ^2(X)$ the Hilbert space of the square integrable functions defined on $X$ with respect to the measure $\rho_X$ , where $\rho_{X}$ denotes the marginal probability distribution on $X$ and $\|f(\cdot)\|_{\rho_{X}} = (\int_{X}\vert f(\cdot)\vert^{2}d{\rho_{X}})^{\frac{1}{2}}$ . It is known that, for any $f\in L_{\rho_X} ^2(X)$ , it holds that

$\begin{equation} \mathcal {E}(f)-\mathcal {E}(f_{\rho}) = \|f-f_{\rho}\|^{2}, \end{equation}$

(1.3)

where

$\|u\|^{2} = E(\vert u(x)\vert^{2}) = \|u\|_{\rho_X}^{2}.$

The learning ability of the regression algorithm can be measured by the excess generalization error

$\|f_{{\bf z} }-f_{\rho}\|^{2} = \mathcal {E}(f_{{\bf z} })-\mathcal {E}(f_{\rho}).$

Let $\mathcal {H}$ be a real, separable Hilbert space endowed with inner product $\langle\cdot, \cdot\rangle$ and norm $\|\cdot\|: = \|\cdot\|_{\mathcal {H}} = \langle\cdot, \cdot\rangle^{\frac{1}{2}}$ . A set of functions $\mathcal {D}\subset\mathcal {H}$ is called a dictionary if $\|g\| = 1$ for every $g\in\mathcal {D}$ , $g\in\mathcal {D}$ implies $-g\in\mathcal {D}$ and the closure of $span(\mathcal {D})$ is $\mathcal {H}$ . We define the RPGA( $\mathcal {D}$ ) as follows:

$\bf{RPGA(\mathcal {D})}$ :

Step 0: Define $f_0 : = 0$ .

Step $m$ ( $m \geq 1$ ):

(1) If $f = f_{m-1}$ , stop the algorithm and define $f_{k} = f_{m-1} = f$ , for $k \geq m$ .

(2) If $f\neq f_{m-1}$ , choose a direction $\varphi_{m}\in D$ such that

$\begin{equation} \vert\langle f-f_{m-1}, \varphi _{m}\rangle\vert = \sup\limits_ { \varphi \in D} \vert\langle f-f_{m-1}, \varphi \rangle\vert. \end{equation}$

(1.4)

With

$\begin{equation} \lambda_{m}: = \langle f-f_{m-1}, \varphi _m\rangle, \end{equation}$

(1.5)

$\begin{equation} {\hat{f}_{m}}: = f_{m-1}+\lambda_{m} \varphi_{m}, \end{equation}$

(1.6)

$\begin{equation} s_{m}: = \frac { < f, {\hat{f}_{m}} > }{\|{\hat{f}_{m} }\|^{2}}, \end{equation}$

(1.7)

define the next approximant to be

$\begin{equation} f_{m} = s_{m}{\hat{f}_{m}} , \end{equation}$

(1.8)

and proceed to Step $m+1$ .

Note that the RPGA uses the just appropriate scaling of the output of the pure greedy algorithm (PGA) which can boost convergence rate of the PGA to the optimal approximation rate $\mathcal {O}(m^{-\frac{1}{2}})$ for functions in $\mathcal {A}_{1}(\mathcal {D})$ , see ^[7].

Throughout this paper, we derive the error bounds under the assumption that $\vert y\vert\leq M$ almost surely for $M\geq0$ , hence $\vert f_{\rho}(x)\vert\leq M$ for any $x\in X$ . We also define the following truncation function as in ^[8,9,10].

Definition 1. Fix $M > 0$ , we define the truncation function $\pi_{M}$ on the space of the measurable functions $f:X\rightarrow \mathbb{R}$ as

$\begin{equation} \pi_{M}(f)(x) = \begin{cases}M, & \, if\, f(x) > M, \\f(x), & \, if\, \vert f(x)\vert\leq M, \\ -M, & \, if\, f(x) < -M.\end{cases} \end{equation}$

(1.9)

Now we use the RPGA to realize the greedy learning. Here we consider leaning by the indefinite kernel $K: X\times X\longrightarrow\mathbb{R}$ ^{[11,12,13,14]} and define the following hypothesis space by

$\begin{equation} \mathcal {H}_{K, \, {\bf z}} = \bigg\{f = \sum\limits_{i = 1}^{n}\alpha_{i}K_{x_{i}} = \sum\limits_{i = 1}^{n}\alpha_{i}K(x_{i}, \cdot):\alpha = (\alpha_{1}, \cdots, \alpha_{n})\in \mathbb{R}^{n}, n\in \mathbb{N}\bigg\}, \end{equation}$

(1.10)

where

$\begin{equation} \|f\|_{l_{1}}: = \inf\bigg\{\sum\limits_{i = 1}^{n}\vert\alpha_{i}\vert: f = \sum\limits_{i = 1}^{n}\alpha_{i}K_{x_{i}}\in\mathcal {H}_{K, \, {\bf z}}\bigg\}. \end{equation}$

(1.11)

We now present the rescaled pure greedy learning algorithm (RPGLA) as follows:

Algorithm 1 RPGLA

Input: Given a data set

${\bf z} = \{z_{i}\}_{i = 1}^{n} = \{(x_i, \, y_i)\}_{i = 1}^n \in Z^n$ ,

$K$ ,

$T > 0$ and the dictionary

$\mathcal {D}_{n} = \{K_{x_{i}}, i = 1, ..., n\}$
Step 1. Normalization:

$\widetilde{K}_{x_{i}} = \frac{K_{x_{i}}}{\|K_{x_{i}}\|_{n}}$ ,

$i = 1, ..., n$
Dictionary:

$\widetilde{\mathcal {D}}_{n} = \{\widetilde{K}_{x_{i}}: i = 1, ..., n\}$
Step 2. Computation: Let

$\widetilde{f}_{0} = 0$
for

$k = 1, 2, ...$ , the approximation

$\widetilde{f}_{k}$ is generated by the RPGA

$(\widetilde{\mathcal {D}}_{n})$
if

$\|y-\widetilde{f}_{k}\|_{n}^{2}+\|\widetilde{f}_{k}\|_{l_{1}}\leq\|y\|_{n}^{2}$ and

$k\geq T$ break
end
Output:

$\pi_{M}(\widetilde{f}_{k})$

Many greedy learning schemes were recently successfully used for the i.i.d. sampling^{[1,2,3,4,5,6,15]}. For example, Barron et al. ^[5] have used a complexity regularization principle as the stopping criterion and deduced the best learning rate $\mathcal {O}\big(n/\log n)^{-\frac{1}{2}}\big)$ of various greedy algorithms. Lin et al.^[3] have provided the learning capability of the relaxed greedy learning algorithm (RGLA) and proved that the learning rate is faster than the order $\mathcal {O}(n^{-\frac{1}{2}})$ . Their numerous numerical simulation results have confirmed that the relax greedy algorithm (RGA) is more stable in dealing with noisy machine learning problems than the orthogonal greedy algorithm (OGA). Chen et al. ^[16] have introduced a sparse semi-supervised method to learn the regression functions from samples using the OGA. They can derive the learning rate $\mathcal {O}(n^{-1})$ under mild assumptions. To reduce the computational burden of the OGA, Fang et al. ^[1] have considered the applications of the orthogonal super greedy algorithm (OSGA) which selects more than one atoms from a dictionary in each iteration in supervise learning and deduced an almost same learning rate as that of the orthogonal greedy learning algorithm (OGLA) in ^[5]. Different from the traditional variants RGA and OGA, Xu et al.^[4] proposed the truncated greedy algorithm (TGA) which truncates the step size of the PGA at a specified value in each greedy iteration to cut down the model complexity. They also proved that for some specified learning tasks, the truncated greedy learning algorithm (TGLA) can remove the logarithmic factor in the learning rates of the OGLA and the RGLA. All these results show that in the realm of supervised learning, each greedy algorithm possesses its own pros and cons. For instance, compared with the OGA, the PGA and the RGA have benefits in computation but suffer from the low convergence rate. In this paper, we study the learning capability of the RPGA which is the very simple modified version of the PGA. Motivated by the researches of ^[7], we proceed to deduce the error bound of the RPGLA. Our results will show that the RPGLA furthermore reduce the computational burden without sacrificing the generalization capability when compared with the OGLA and the RGLA. However, usually the independent and identity assumption is rather restrictive. For example, in ^[17,18,19], the authors presented the non-i.i.d. sampling setting for different learning algorithms, respectively. We shall study $\beta$ -mixing and non-identical sampling, see ^[20] and the references therein for the details.

Definition 2. Let ${\bf{z}} = \{z_t\}_{t\geq 1}$ be a sequence of random variables. For any $i, j\in\mathbb{N}\cup\{+\infty\}$ , $\sigma_{i}^{j}$ denotes the $\sigma$ -algebra generated by the random variables $\{z_t = (x_t, \, y_t)\}_{t = i}^j$ . Then for any $l\in\mathbb{N}$ , the $\beta$ -mixing coefficients of the stochastic process ${\bf z}$ are defined as

$\begin{equation} \beta(l) = \sup\limits_{j\geq1}\mathbb{E}\sup\limits_{A\in\sigma_{j+l}^{\infty}}\vert P(A\vert\sigma_{1}^{j})-P(A)\vert. \end{equation}$

(1.12)

$\bf{z}$ is said to be $\beta$ -mixing, if $\beta(l)\rightarrow0$ as $l\rightarrow \infty$ . Specifically, it is said to be polynomially $\beta$ -mixing, if there exists some $\beta_{0} > 0$ and $\gamma > 0$ such that, for all $l\geq1$ ,

$\begin{equation} \beta(l)\leq\beta_{0}l^{-\gamma}. \end{equation}$

(1.13)

The $\beta$ -mixing condition is "just the right" assumption, which has been adopted in previous studies for learning with weakly dependent samples, see ^[18,21] and the references therein. It is quite easy to establish and covers a more general non-i.i.d. cases such as Gaussian and Markov processes. Markov chains appear so often and naturally in applications, especially in marking prediction, biological speech recognition, sequence analysis, content-based web search and character recognition.

We assume that $\{z_{i}\}_{i = 1}^{n}$ is drawn according to the Borel probability measures $\{\rho^{(i)}\}_{i = 1, 2, \cdot\cdot\cdot}$ on $Z$ . Let $\rho_{X}^{(i)}$ be the marginal distribution of $\rho^{(i)}$ . For every $x \in X$ , the conditional distribution of $\{\rho^{(i)}\}_{i = 1, 2, \cdot\cdot\cdot}$ at $x$ is $\rho(\cdot\vert x)$ .

Definition 3. We say that $\big\{\rho_{X}^{(i)}\big\}$ converges to $\rho_{X}$ exponentially in $(C^{s}(X))^{\ast}$ , if for $C > 0$ and $0 < \alpha < 1$ ,

$\begin{equation} \|\rho_{X}^{(i)}-\rho_{X}\|_{(C^{s}(X))^{\ast}}\leq C\alpha^{i}, \forall i\in \mathbb{N}. \end{equation}$

(1.14)

The above condition (1.14) is also equivalent to

$\begin{equation} \bigg\vert\int_{X}f(x)d\rho_{X}^{(i)}-\int_{X}f(x)d\rho_{X}\bigg\vert\leq C\alpha^{i}(\|f\|_{\infty}+\vert f\vert_{C^{s}(X)}), \forall f\in C^{s}(X), \ i\in\mathbb{N}, \end{equation}$

(1.15)

where

$\begin{equation} \|f\|_{C^{s}(X)}: = \|f\|_{\infty}+\vert f\vert_{C^{s}(X)}, \end{equation}$

(1.16)

and

$\begin{equation} \vert f\vert_{C^{s}(X)}: = \sup\limits_{x\neq y\in X}\frac{\vert f(x)-f(y)\vert}{(d(x, y))^{s}}. \end{equation}$

(1.17)

Before giving our key analysis, we firstly need to impose some mild assumptions concerning $K$ , $\mathcal {H}_{K, \, {\bf z}}$ and $\{\rho(y\vert x): x\in X\}$ below.

The kernel function $K$ is said to satisfy a Lipschitz condition of order $(\alpha, \beta)$ with $0 < \alpha$ , $\beta\leq1$ , if for some $C_{\alpha}, C_{\beta} > 0$ ,

$\begin{equation} \vert K(x, t)-K(x, t')\vert\leq C_{\alpha}\vert t-t'\vert^{\alpha}, \forall x, t, t'\in X, \end{equation}$

(1.18)

$\begin{equation} \vert K(x, t)-K(x', t)\vert\leq C_{\beta}\vert x-x'\vert^{\beta}, \forall t, x, x'\in X. \end{equation}$

(1.19)

Let $R > 0$ and $B_{R}$ be the ball of $\mathcal {H}_{K, \, {\bf z}}$ with radius $R$ :

$\begin{equation} B_{R} = \bigg\{f\in\mathcal {H}_{K, \, {\bf z}}: \|f\|_{l_{1}}\leq R\bigg\}. \end{equation}$

(1.20)

As ^[22], we give the complexity assumption of the unit ball $B_{1}$ .

Capacity assumption. We say that $B_{1}$ has polynomial complexity exponent $0 < p < 2$ if there is some constant $c_{p} > 0$ such that

$\begin{equation} \log\mathcal {N}_{2}(B_{1}, \epsilon)\leq c_{p}\epsilon^{-p}, \ \forall\epsilon > 0. \end{equation}$

(1.21)

The following concept describes the continuity of $\{\rho(y\vert x):x\in X\}$ .

Definition 4. We say that $\{\rho(y\vert x): x\in X\}$ satisfies a Lipschitz condition of order $s$ in $(C_{s}(Y))^{\ast}$ if there is some constant $C_{\rho}\geq0$ such that

$\begin{equation} \|\rho(y\vert x)-\rho(y\vert u)\|_{(C^{s}(Y))^{\ast}}\leq C_{\rho}\vert x-u\vert^{s}, \ \forall x, u\in X. \end{equation}$

(1.22)

Throughout this paper, we denote $\kappa^2 = \sup_{t, x\in X} \vert K(x, t)\vert$ . Since all the constants are independent of $\delta$ , $n$ or $\lambda$ , for simplicity of notation, we denote by $C$ all the constants.

The rest of this paper is organized as follows: in Section 2, we will state the error decomposition of the algorithm (1) and the rate of uniform convergence. In the forthcoming Sections 3–5, we will analyze the drift error, the sample error and the hypothesis error. Finally, we conclude the main results in Section 6.

2. Error decomposition and main results

We use the developed technique for coefficient regularization algorithms for the non-i.i.d. sampling ^[19,21] to analyze the learning ability of the algorithm (1). We first define the following function space

$\begin{equation} \mathcal {H}_{1} = \bigg\{f:f = \sum\limits_{j = 1}^{\infty}\alpha_{j}\overline{K}_{u_{j}}:\{\alpha_{j}\}\in l_{1}, \{u_{j}\}\subset {X}, \overline{K}_{u_{j}} = \frac{K_{u_{j}}}{\|K_{u_{j}}\|_{\rho_{X}}}\bigg\}, \end{equation}$

(2.1)

with the norm

$\begin{equation} \|f\|_{\mathcal {H}_{1}}: = \inf\bigg\{\sum\limits_{j = 1}^{\infty}\vert\alpha_{j}\vert: f = \sum\limits_{j = 1}^{\infty}\alpha_{j}\overline{K}_{u_{j}}\bigg\}. \end{equation}$

(2.2)

We define the regularizing function

$\begin{equation} f_{\lambda}: = \arg\min\limits_{f\in\mathcal {H}_{1}}\{\mathcal {E}(f)+\lambda\|f\|_{\mathcal {H}_{1}}\}, \lambda > 0. \end{equation}$

(2.3)

In order to describe the error caused by the change of $\{\rho_{X}^{(i)}\}$ , we introduce

$\begin{equation} \mathcal {E}_{n}(f) = \frac{1}{n}\sum\limits_{i = 1}^{n}\int_{Z}(f(u)-y)^{2}d\rho^{(i)}(u, y). \end{equation}$

(2.4)

Now we can give the error decomposition for the algorithm (1).

$\begin{equation} \mathcal {E}(\pi_{M}(\widetilde{f}_{k}))-\mathcal {E}(f_{\rho})\leq\mathcal {P}(\lambda)+\mathcal {S}({\bf z}, \lambda)+\mathcal {H}({\bf z}, \lambda)+\mathcal {D}(\lambda), \end{equation}$

(2.5)

where

$\begin{align} &\mathcal {P}(\lambda) = \{\mathcal {E}(\pi_{M}(\widetilde{f}_{k}))-\mathcal {E}_{n}(\pi_{M}(\widetilde{f}_{k}))\}+\{\mathcal {E}_{n}(f_{\lambda})-\mathcal {E}(f_{\lambda})\}, \\ &\mathcal {S}({\bf z}, \lambda) = \{\mathcal {E}_{n}(\pi_{M}(\widetilde{f}_{k}))-\mathcal {E}_{{\bf z}}(\pi_{M}(\widetilde{f}_{k}))\}+\{\mathcal {E}_{{\bf z}}(f_{\lambda})-\mathcal {E}_{n}(f_{\lambda})\}, \\ &\mathcal {H}({\bf z}, \lambda) = \{\mathcal {E}_{{\bf z}}(\pi_{M}(\widetilde{f}_{k})) -\mathcal {E}_{{\bf z}}(f_{\lambda})\}, \\ &\mathcal {D}(\lambda) = \mathcal {E}(f_{\lambda})-\mathcal {E}(f_{\rho})+\lambda\|f_{\lambda}\|_{\mathcal {H}_{1}}. \end{align}$

(2.6)

The drift error $\mathcal {P}(\lambda)$ describes the change of $\rho^{(i)}$ from $\rho$ , and the sample error $\mathcal {S}({\bf z}, \lambda)$ connects the estimator $\pi_{M}(\widetilde{f}_{k})$ with $f_{\lambda}$ . $\mathcal {H}({\bf z}, \lambda)$ and $\mathcal {D}(\lambda)$ are known as the hypothesis error and the approximation error, respectively.

To compared with the main results in ^[16], we shall assume $\mathcal {D}(\lambda)$ satisfies the same decay rate as follows

$\begin{equation} \mathcal {D}(\lambda)\leq c_{q}\lambda^{q}, \ \forall\ 0 < \lambda\leq1, \end{equation}$

(2.7)

for some exponent $0 < q\leq1$ and a constant $c_{q} > 0$ .

Next we can state the generalization error bound and give the proofs in Sections 3–6.

Theorem 1. Assume $z_i = (x_i, y_i)_{i = 1}^{n}$ satisfy condition (1.13), the hypothesis space $\mathcal {H}_{K, \, {\bf z}}$ satisfies the capacity assumption (1.21) with $0 < p < 2$ , the kernel $K$ satisfies a Lipschitz condition of order $(\alpha, \beta)$ with $0 < k_{0}\leq K(u, v)\leq k_{1}$ for any $u, v\in X$ , the target function $f_{\rho}$ can be approximated with the exponent $0 < q\leq1$ in $\mathcal {H}_{1}$ , (1.14) for $\rho_{X}$ and (1.22) for $\rho(y\vert x)$ hold. Take $k\geq T\geq n$ . Then for any $0 < \delta < 1$ , with confidence $1-\delta$ , we have

$\begin{equation} \{\mathcal {E}(\pi_{M}(\widetilde{f}_{k}))-\mathcal {E}(f_{\rho})\}\leq Ct\{\lambda^{q}+b_{n}^{-1}\lambda^{2q-2} +b_{n}^{-\frac{2}{2+p}}\}, \end{equation}$

(2.8)

where $t = \log\Big(\frac{6}{\delta-6b_{n}\beta(a_{n})}\Big)$ with $b_{n}$ and $a_{n}$ given explicitly later.

Theorem 2. Under the assumptions of Theorem 1, if

$\begin{equation} n\geq\bigg\{8^{\frac{1}{\zeta}}, \Big(\frac{6\beta_{0}}{\delta}\Big)^{\frac{1}{(\gamma+1)(1-\zeta)-1}}\bigg\}, \ \zeta\in\Big(0, \frac{\gamma}{\gamma+1}\Big), \end{equation}$

(2.9)

then we obtain

$\begin{equation} \|\pi_{M}(\widetilde{f}_{k}) - f_\rho \|_{\rho_X}^{2}\leq \widetilde{D} n^{-\theta'}\log\bigg(\frac{12}{\delta}\bigg), \end{equation}$

(2.10)

where

$\begin{equation} \theta' = \min\Bigg\{\frac{q\zeta}{2-q}, \ \frac{2\zeta}{2+p}\Bigg\}.\nonumber \end{equation}$

Let $\alpha = 0$ and $\zeta = 1$ . Then we obtain the following learning rate of the i.i.d. sampling

$\begin{equation} \|\pi_{M}(\widetilde{f}_{k})-f_{\rho}\|^{2}_{\rho_{X}}\leq\widetilde{C}\bigg(\frac{1}{n}\bigg)^{\min\Big\{\frac{q}{2-q}, \ \frac{2}{2+p}\Big\}}\log\bigg(\frac{12}{\delta}\bigg), \nonumber \end{equation}$

which is the same as that in ^[16]. In particular, as $p\rightarrow0$ , $\frac{2}{2+p}\rightarrow1$ which is the optimal convergence rate.

3. Estimates for the drift error

Proposition 3. Under the assumptions of Theorem 1, the inequality

$\begin{equation} \mathcal {P}(\lambda)\leq \frac{C\lambda^{2q-2}}{n}, \end{equation}$

(3.1)

holds.

Proof. By (1.1) and (2.4), we get

$\begin{align} &\Big\{\big(\mathcal {E}(\pi_{M}(\widetilde{f}_{k}))-\mathcal {E}(f_{\lambda})\big)-\big(\mathcal {E}_{n}(\pi_{M}(\widetilde{f}_{k}))-\mathcal {E}_{n}(f_{\lambda})\big)\Big\} \\ &\leq\frac{1}{n}\sum\limits_{i = 1}^{n}\bigg\vert\int_{Z}\Big[\big(\pi_{M}(\widetilde{f}_{k})(u)-y\big)^{2}-\big(f_{\lambda}(u)-y\big)^{2}\Big]d\big(\rho(u, y)-\rho^{(i)}(u, y)\big)\bigg\vert \\ & = \frac{1}{n}\sum\limits_{i = 1}^{n}\bigg\vert\int_{X}\big(\pi_{M}(\widetilde{f}_{k})(u)-f_{\lambda}(u)\big)\big(\pi_{M}(\widetilde{f}_{k})(u)+f_{\lambda}(u)-2f_{\rho}(u)\big)d\big(\rho_{X}(u)-\rho^{(i)}_{X}(u)\big)\bigg\vert. \end{align}$

(3.2)

Now (1.15) tells us that

$\begin{align} &\Big\{\big(\mathcal {E}(\pi_{M}(\widetilde{f}_{k}))-\mathcal {E}(f_{\lambda})\big)-\big(\mathcal {E}_{n}(\pi_{M}(\widetilde{f}_{k}))-\mathcal {E}_{n}(f_{\lambda})\big)\Big\} \\&\leq \frac{1}{n}\sum\limits_{i = 1}^{n}C\alpha^{i}\bigg\|\big(\pi_{M}(\widetilde{f}_{k})(u)-f_{\lambda}(u)\big)\big(\pi_{M}(\widetilde{f}_{k})(u)+f_{\lambda}(u)-2f_{\rho}(u)\big)\bigg\|_{C^{s}(X)} \\&\leq \frac{C }{n}\frac{\alpha}{1-\alpha}(3M+\|f_{\lambda}\|_{\infty})\{2\vert \widetilde{f}_{k}\vert_{C^{s}(X)}+2\vert f_{\lambda}\vert_{C^{s}(X)}+2\vert f_{\rho}\vert_{C^{s}(X)}+4M+2\|f_{\lambda}\|_{\infty}\}, \end{align}$

(3.3)

where the last inequality holds true since $\|fg\|_{C^{s}(X)}\leq\|f\|_{C(X)}\|g\|_{C^{s}(X)}+\|f\|_{C^{s}(X)}\|g\|_{C(X)}$ , see ^[19].

In the following, we estimate $\|f_{\lambda}\|_{\infty}$ , $\vert f_{\lambda}\vert_{C^{s}(X)}$ , $\vert \widetilde{f}_{k}\vert_{C^{s}(X)}$ and $\vert f_{\rho}\vert_{C^{s}(X)}$ separately. Let $f_{\lambda}(x) = \sum_{j = 1}^{\infty}\alpha_{j, \lambda}\overline{K}_{u_{j}}(x)$ , $\{\alpha_{j, \lambda}\}\in l_{1}$ . It follows that

$\begin{align} \vert f_{\lambda}(x)\vert&\leq\frac{\kappa}{\|K_{u_{j}}\|_{\rho_{X}}}\sum\limits_{j = 1}^{\infty}\vert\alpha_{j, \lambda}\vert \\ &\leq\frac{\kappa}{\|K_{u_{j}}\|_{\rho_{X}}}\|f_{\lambda}\|_{\mathcal {H}_{1}}. \end{align}$

(3.4)

Furthermore,

$\begin{equation} \|f_{\lambda}\|_{\infty}\leq\frac{\kappa}{\|K_{u_{j}}\|_{\rho_{X}}}\frac{\mathcal {D}(\lambda)}{\lambda}. \end{equation}$

(3.5)

The Lipschitz condition (1.18) of the kernel function $K$ yields for any $f\in \mathcal {H}_{1}$ that

$\vert f(x)-f(x')\vert\leq\frac{C_{\alpha}\vert x-x'\vert^{s}}{\|K_{u_{j}}\|_{\rho_{X}}} \|f\|_{\mathcal {H}_{1}}, \forall x, x'\in X.$

Together with (1.17), this implies that

$\begin{align} \vert f_{\lambda}\vert_{C^{s}(X)}&\leq \frac{C_{\alpha}\|f_{\lambda}\|_{\mathcal {H}_{1}}}{\|K_{u_{j}}\|_{\rho_{X}}} \\ &\leq\frac{C_{\alpha}}{\|K_{u_{j}}\|_{\rho_{X}}} \frac{\mathcal {D}(\lambda)}{\lambda}. \end{align}$

(3.6)

In the same way, from the definition of $\widetilde{f}_{k}$ , we have

$\begin{align} \vert \widetilde{f}_{k}\vert_{C^{s}(X)}&\leq C_{\alpha}\|\widetilde{f}_{k}\|_{{l}_{1}} \\ &\leq C_{\alpha}\|y\|_{n}^{2} \\ &\leq C_{\alpha}M^{2}. \end{align}$

(3.7)

In addition, combining (1.17) with (1.22) gives

$\begin{align} \vert f_{\rho}\vert_{C^{s}(X)}& = \sup\limits_{x, x'\in X}\frac{\vert \int_{Y} y d\rho(y\vert x)-\int_{Y} y d\rho(y\vert x')\vert}{\vert x-x'\vert^{s}} \\ &\leq\frac{\|y\|_{C^{s}(Y)}C_{\rho}\vert x-x'\vert^{s}}{\vert x-x'\vert^{s}} \\ & \leq C_{\rho}(M+(2M)^{1-s}). \end{align}$

(3.8)

Plugging (3.5), (3.6), (3.7) and (3.8) into (3.3), the desired estimate (3.1) follows, and the proposition is proved.

4. Estimates for the sample error

In our analysis, we apply the method in ^[18,23] to deal with the original weakly dependent sequence. Let $(a_{n}, b_{n})$ be any integer pair with $b_{n} = [n/2a_{n}]$ . The dependent observations are split into $2b_{n}$ blocks, each of size $a_{n}$ . For $1\leq k\leq2b_{n}$ , $Q^{a_{n}}_{k}$ denotes the marginal distribution of block $(z_{(k-1)a_{n}+1}, z_{(k-1)a_{n}+2}, \cdot\cdot\cdot, z_{ka_{n}})$ . With the constructed blocks, one can then take a new sequence $(z'_{1}, \cdot\cdot\cdot, z'_{2b_{n}a_{n}})$ with product distribution $\prod_{k = 1}^{2b_{n}}Q_{k}^{a_{n}}$ . We further define

$\begin{align} &Z_{1} = (z_{1}, \cdot\cdot\cdot, z_{a_{n}}, z_{2a_{n}+1}, \cdot\cdot\cdot, z_{3a_{n}}, \cdot\cdot\cdot, z_{2(b_{n}-1)a_{n}+1}, \cdot\cdot\cdot, z_{2(b_{n}-1)a_{n}}), \\ &Z_{2} = (z_{a_{n}+1}, \cdot\cdot\cdot, z_{2a_{n}}, z_{3a_{n}+1}, \cdot\cdot\cdot, z_{4a_{n}}, \cdot\cdot\cdot, z_{(2b_{n}-1)a_{n}+1}, \cdot\cdot\cdot, z_{2b_{n}a_{n}}), \end{align}$

and correspondingly

$\begin{align} &Z_{1}' = (z_{1}', \cdot\cdot\cdot, z_{a_{n}}', z_{2a_{n}+1}', \cdot\cdot\cdot, z_{3a_{n}}', \cdot\cdot\cdot, z_{2(b_{n}-1)a_{n}+1}', \cdot\cdot\cdot, z_{2(b_{n}-1)a_{n}}'), \\ &Z_{2}' = (z_{a_{n}+1}', \cdot\cdot\cdot, z_{2a_{n}}', z_{3a_{n}+1}', \cdot\cdot\cdot, z_{4a_{n}}', \cdot\cdot\cdot, z_{(2b_{n}-1)a_{n}+1}', \cdot\cdot\cdot, z_{2b_{n}a_{n}}'). \end{align}$

The sample error $\mathcal {S}({\bf z}, \lambda)$ can be rewritten as

$\begin{align} \mathcal {S}({\bf z}, \lambda)& = \{\mathcal {E}_{{\bf z}}(f_{\lambda})-\mathcal {E}_{{\bf z}}(f_{\rho})\}-\{\mathcal {E}_{n}(f_{\lambda}))-\mathcal {E}_{n}(f_{\rho})\} \\ &\quad+\{\mathcal {E}_{n}(\pi_{M}(\widetilde{f}_{k}))-\mathcal {E}_{n}(f_{\rho})\}-\{\mathcal {E}_{{\bf z}}(\pi_{M}(\widetilde{f}_{k}))-\mathcal {E}_{{\bf z}}(f_{\rho})\} \\ &: = \mathcal {S}_{1}({\bf z}, \lambda)+\mathcal {S}_{2}({\bf z}, k). \end{align}$

We analyze the term $\mathcal {S}_{1}({\bf z}, \lambda)$ by using the following inequality from ^[18].

Lemma 4.1. If $g$ is a measurable function on $Z$ satisfying $\big\|g(z)-\int_{Z}gd\rho^{(i)}\big\|_{\infty}\leq M$ , then for any $\delta > 0$ , with confidence $1-\delta$ , there holds

$\begin{equation} \frac{1}{n}\sum\limits_{i = 1}^{n}\big(g(z_{i})-\int_{Z}gd\rho^{(i)}\big)\leq b_{n}^{-1}\Bigg\{\frac{8}{3}M\log\bigg(\frac{2}{\delta-2b_{n}\beta(a_{n})}\bigg)+\sqrt{\frac{2}{a_{n}}\sum\limits_{i = 1}^{2a_{n}b_{n}}\int_{Z}g^{2}d\rho^{(i)}\log\bigg(\frac{2}{\delta-2b_{n}\beta(a_{n})}\bigg)}+M\Bigg\}.\nonumber \end{equation}$

Proposition 4. Under the assumptions of Theorem 1, for any $0 < \delta < 1$ , with confidence $1-\delta/3$ ,

$\begin{equation} \mathcal {S}_{2}({\bf z}, \lambda)\leq C\bigg\{b_{n}^{-1}\bigg(1+\frac{D(\lambda)^{2}}{\lambda^{2}}\bigg)+D(\lambda)\bigg\}t. \end{equation}$

(4.1)

Proof. Let $g(z) = (y-f_{\lambda}(u))^{2}-(y-f_{\rho}(u))^{2}$ , $z = (u, y)\in Z$ . Thus

$\bigg\|g(z)-\int_{Z}gd\rho^{(i)}\bigg\|_{\infty}\leq2\bigg(3M+\frac{\kappa}{\|K_{u_{j}}\|_{\rho_{X}}}\frac{\mathcal {D}(\lambda)}{\lambda}\bigg)^{2}: = 2B_{\lambda}$

and

$\int_{Z}g^{2}d\rho^{(i)}\leq B_{\lambda}\int_{Z}gd\rho^{(i)}.$

Using Lemma 4.1, with confidence $1-\delta/3$ , we have

$\begin{align} &\frac{1}{n}\sum\limits_{i = 1}^{n}\bigg(g(z_{i})-\int_{Z}gd\rho^{(i)}\bigg) \\&\leq\bigg(\frac{19t}{3}+2\bigg)B_{\lambda}b_{n}^{-1}+\frac{1}{2a_{n}b_{n}}\sum\limits_{i = 1}^{2a_{n}b_{n}}\int_{Z}gd\rho^{(i)} \\ &\leq\bigg(\frac{19t}{3}+2\bigg)B_{\lambda}b_{n}^{-1}+2\big(\mathcal {E}_{n}(f_{\lambda})-\mathcal {E}_{n}(f_{\rho})\big). \end{align}$

(4.2)

Observe that

$\begin{align} \mathcal {E}_{n}(f_{\lambda})-\mathcal {E}_{n}(f_{\rho})\leq\big(\mathcal {E}_{n}(f_{\lambda})-\mathcal {E}(f_{\lambda})+\mathcal {E}(f_{\rho})-\mathcal {E}_{n}(f_{\rho})\big)+D(\lambda). \end{align}$

(4.3)

By (1.15), we have

$\begin{align} &\mathcal {E}_{n}(f_{\lambda})-\mathcal {E}(f_{\lambda})+\mathcal{E}(f_{\rho})-\mathcal {E}_{n}(f_{\rho}) \\&\leq\frac{1}{n}\sum\limits_{i = 1}^{n}\bigg\vert\int_{X}\big(f_{\lambda}(u)-f_{\rho}(u)\big)^{2}d\big(\rho^{(i)}_{X}(u)-\rho_{X}(u)\big)\bigg\vert \\&\leq\frac{1}{n}\sum\limits_{i = 1}^{n}C\alpha^{i}\bigg\|\big(f_{\lambda}(u)-f_{\rho}(u)\big)^{2}\bigg\|_{C^{s}(X)} \\&\leq\frac{C\alpha}{n(1-\alpha)}\bigg(1+\frac{\mathcal {D}(\lambda)}{\lambda}\bigg)^{2}, \end{align}$

(4.4)

where the last inequality follows from (3.6) and (3.8).

Combining (4.2), (4.3) and (4.4), we get the desired error bound (4.1) of $\mathcal {S}_{1}({\bf z}, \lambda)$ . Proposition 4 is proved.

We continue to analyze $\mathcal {S}_{2}({\bf z}, k)$ by applying the following probability inequality for the $\beta$ -mixing sequences from ^[18].

Lemma 4.2. Let $\mathscr{G}$ be a class of measurable functions on $Z$ . Moreover, assume that $\big\|g-\int_{Z}gd ^{(i)}\big\|_{\infty}\leq M$ for all $g\in\mathscr{G}$ . Then

$\begin{equation} Prob\Bigg\{\sup\limits_{g\in\mathscr{G}}\frac{1}{n}\sum\limits_{i = 1}^{n}\bigg(g(z_{i})-\int_{Z}g(z)d\rho^{(i)}\bigg) > \epsilon+\frac{M}{b_{n}}\Bigg\}\leq \prod\limits_{1}+\prod\limits_{2}+2b_{n}\beta(a_{n}), \nonumber \end{equation}$

where

$\begin{equation} \prod\limits_{1} = Prob\Bigg\{\sup\limits_{g\in\mathscr{G}}\frac{1}{b_{n}}\sum\limits_{j = 1}^{b_{n}}\Bigg(\frac{2b_{n}}{n}\sum\limits_{i = 2(j-1)a_{n}+1}^{(2j-1)a_{n}}\bigg(g(z_{i}')-\int_{Z}g(z)d\rho^{(i)}\bigg)\Bigg)\geq\epsilon\Bigg\}, \nonumber \end{equation}$

$\begin{equation} \prod\limits_{2} = Prob\Bigg\{\sup\limits_{g\in\mathscr{G}}\frac{1}{b_{n}}\sum\limits_{j = 1}^{b_{n}}\Bigg(\frac{2b_{n}}{n}\sum\limits_{i = (2j-1)a_{n}+1}^{2ja_{n}}\bigg(g(z_{i}')-\int_{Z}g(z)d\rho^{(i)}\bigg)\Bigg)\geq\epsilon\Bigg\}.\nonumber \end{equation}$

To get the upper bounds of the terms $\prod_{1}$ and $\prod_{2}$ , we need to invoke the following inequality for the non-identical sequence of probability distributions.

Proposition 5. Assume $\{X_{i}\}_{i = 1}^{n}$ is a random sequence in the measurable space $\big(\mathfrak{X}^{n}, \prod_{i = 1}^{n}Q_{i}\big)$ . Let $\mathscr{F}$ be a set of measurable functions on $\mathfrak{X}$ and $B > 0$ be a constant such that each $f\in\mathscr{F}$ satisfies $\|f\|_{\infty}\leq B$ . Suppose there exists a nonnegative functional $w$ on $\mathscr{F}$ and some positive constants $(\Delta_{i})_{i = 1}^{n}$ such that

$\begin{equation} \mathbb{E}f^{2}(X_{i})\leq w(f)+\Delta_{i}, \forall f\in \mathscr{F}. \end{equation}$

(4.5)

Also assume for some $a > 0$ and $p\in(0, 2)$ ,

$\begin{equation} \log\mathcal {N}_{2}(\mathscr{F}, \varepsilon)\leq a\varepsilon^{-p}, \forall \varepsilon > 0.\nonumber \end{equation}$

Then for any $x > 0$ and any $D > 0$ , with probability at least $1-e^{-x}$ there holds

$\begin{equation} \frac{1}{n}\sum\limits_{i = 1}^{n}\mathbb{E}f(X_{i})-\frac{1}{n}\sum\limits_{i = 1}^{n}f(X_{i})\leq D^{-1}w(f)+c_{p}'\tilde{\eta}+\frac{(D+18B+2)x}{n}, \forall f\in\mathscr{F}, \nonumber \end{equation}$

where $c_{p}'$ is a constant depending only on $p$ and

$\begin{equation} \tilde{\eta}: = \max\bigg\{D^{\frac{2-p}{2+p}}, B^{\frac{2-p}{2+p}}+1\bigg\}\bigg(\frac{a}{n}\bigg)^{\frac{2}{p+2}}+\frac{1}{n}\sum\limits_{i = 1}^{n}\Delta_{i}.\nonumber \end{equation}$

The above inequalities imply the estimate of $\mathcal {S}_{2}({\bf z}, k)$ .

Proposition 6. Under the assumptions of Theorem 1, for any $0 < \delta < 1$ , with confidence $1-\delta/3$ ,

$\begin{equation} \mathcal {S}_{2}({\bf z}, k)\leq\frac{1}{2}\{\mathcal {E}(\pi_{M}(\widetilde{f}_{k}))-\mathcal {E}(f_{\rho})\}+C_{p, \Phi, \rho}\eta_{R}+\frac{(192M^{2}+2)t}{b_{n}}, \end{equation}$

(4.6)

where

$\begin{equation} \eta_{R}: = \bigg(\frac{R^{p}}{b_{n}}\bigg)^{\frac{2}{2+p}}+\frac{\alpha}{1-\alpha}\frac{1}{n}\max\{R, 1\}. \end{equation}$

(4.7)

Proof. Define the function set $\widetilde{\mathscr{G}}$ on $Z^{a_{n}}$ by

$\begin{align} \widetilde{\mathscr{G}} = \bigg\{&G(t_{1}, \cdot\cdot\cdot, t_{a_{n}}) = \frac{2b_{n}}{n}\sum\limits_{k = 1}^{a_{n}}g(t_{k}): g\in\mathscr{G}, \mathscr{G} = \Big\{g(z) = g(u, y) = (y-\pi_{M}(f)(u))^{2} \\ &-(y-f_{\rho}(u))^{2}:f\in B_{R}\Big\}\bigg\} \end{align}$

and

$\begin{align} w(G):& = \int_{Z^{a_{n}}}G^{2}(t_{1}, \cdot\cdot\cdot, t_{a_{n}})d\rho(t_{1})d\rho(t_{2})\cdot\cdot\cdot d\rho(t_{a_{n}}) \\ & = \frac{4a_{n}^{2}b_{n}^{2}}{n^{2}}\int_{Z}g^{2}d\rho. \end{align}$

It follows that

$\begin{align} &\mathbb{E}G^{2}(z_{(k-1)a_{n}+1}', z_{(k-1)a_{n}+2}', \cdot\cdot\cdot, z_{ka_{n}}') \\ &\leq\frac{4b_{n}^{2}a_{n}}{n^{2}}\sum\limits_{i = (k-1)a_{n}+1}^{ka_{n}}\int_{Z}g^{2}d\rho^{(i)} \\ &\leq w(G)+\frac{4b_{n}^{2}a_{n}}{n^{2}}\sum\limits_{i = (k-1)a_{n}+1}^{ka_{n}}\bigg\vert\int_{Z}g^{2}d(\rho^{(i)}-\rho)\bigg\vert. \end{align}$

(4.8)

We see from (1.15) and (1.22) that

$\begin{align} \bigg\vert\int_{Z}g^{2}d(\rho^{(i)}-\rho)\bigg\vert&\leq C\alpha^{i}\bigg\|(f_{\rho}(u)-\pi_{M}(f)(u))^{2}\int_{Y}(2y-\pi_{M}(f)(u)-f_{\rho}(u))^{2}d\rho(y\vert u)\bigg\|_{C^{s}(X)} \\ &\leq C\alpha^{i}(1+R). \end{align}$

(4.9)

By (4.8) and (4.9), we know that $\Delta_{k}$ in (4.5) satisfies

$\begin{equation} \Delta_{k}\leq\frac{4b_{n}^{2}a_{n}}{n^{2}}C_{\rho, \Phi}\max\{R, 1\}\sum\limits_{i = 1}^{a_{n}}\alpha^{(k-1)a_{n}+i}.\nonumber \end{equation}$

Let $w = \{\vec{t_{j}} = (t_{1}^{j}, \cdot\cdot\cdot, t_{a_{n}}^{j})\}_{j = 1}^{d}\subset(Z^{a^{}_{n}})^{d}$ , $d\in\mathbb{N}$ . We know that for any functions $G_{1} = \frac{2b_{n}}{n}\sum_{k = 1}^{a_{n}}g_{1}(t_{k})$ and $G_{2} = \frac{2b_{n}}{n}\sum_{k = 1}^{a_{n}}g_{2}(t_{k})$ in $\widetilde{\mathscr{G}}$ ,

$\begin{align} d_{2, w}^{2}(G_{1}, G_{2})& = \frac{1}{d}\sum\limits_{j = 1}^{d}\Big(G_{1}(\vec{{t_{j}}})-G_{2}(\vec{{t_{j}}})\Big)^{2} \\ & = \frac{1}{d}\sum\limits_{j = 1}^{d}\bigg(\frac{2b_{n}}{n}\sum\limits_{k = 1}^{a_{n}}\big(g_{1}(t_{k}^{j})-g_{2}(t_{k}^{j})\big)\bigg)^{2} \\ &\leq\frac{1}{da_{n}}\sum\limits_{j = 1}^{d}\sum\limits_{k = 1}^{a_{n}}\big(g_{1}(t_{k}^{j})-g_{2}(t_{k}^{j})\big)^{2} \\ & = d_{2, w}^{2}(g_{1}, g_{2}), \end{align}$

$\begin{equation} \mathscr{N}_{2}(\widetilde{\mathscr{G}}, \epsilon)\leq\mathscr{N}_{2}(\mathscr{G}, \epsilon). \end{equation}$

(4.10)

Moreover,

$\begin{equation} \mathscr{N}_{2}(\mathscr{G}, \epsilon)\leq\mathscr{N}_{2}(B_{R}, \frac{\epsilon}{4M}).\nonumber \end{equation}$

This together with (4.10) yields

$\begin{equation} \log\mathscr{N}_{2}(\widetilde{\mathscr{G}}, \epsilon)\leq \log\mathscr{N}_{2}(\mathscr{G}, \epsilon)\leq c_{p}(4M)^{p}R^{P}\epsilon^{-p}.\nonumber \end{equation}$

Note that $\|G\|_{\infty}\leq\|g\|_{\infty}\leq8M^{2}$ . It is also easy to see that

$\begin{equation} \mathbb{E}G(z_{(k-1)a_{n}+1}', z_{(k-1)a_{n}+2}', \cdot\cdot\cdot, z_{ka_{n}}')\leq\frac{2b_{n}}{n}\sum\limits_{i = (k-1)a_{n}+1}^{ka_{n}}\int_{Z}gd\rho^{(i)}, \nonumber \end{equation}$

and

$\begin{equation} w(G) = \frac{4a_{n}^{2}b_{n}^{2}}{n^{2}}\int_{Z}g^{2}d\rho\leq\int_{Z}g^{2}d\rho\leq8M^{2}\int_{Z}gd\rho.\nonumber \end{equation}$

Now applying Proposition 5 to $\widetilde{\mathscr{G}}$ in $\big((Z^{a_{n}})^{b_{n}}, \prod_{j = 1}^{b_{n}}Q_{2j-1}^{a_{n}}\big)$ . Let $B = 8M^{2}$ and $a = c_{p}(4M)^{p}R^{p}$ . Then for any $D > 0$ , $g\in\mathscr{G}$ , with confidence at least $1-e^{-t}$ , we have

$\begin{equation} \frac{1}{b_{n}}\sum\limits_{j = 1}^{b_{n}}\Bigg(\frac{2b_{n}}{n}\sum\limits_{i = 2(j-1)a_{n}+1}^{(2j-1)a_{n}}\bigg(\int_{Z}g(z)d\rho^{(i)}-g(z_{i}')\bigg)\Bigg)\leq\frac{8M^{2}}{D}\bigg(\int_{Z}gd\rho\bigg)+c_{p}'\eta_{1}+\frac{(D+144M^{2}+2)t}{b_{n}}.\nonumber \end{equation}$

Here

$\begin{equation} \eta_{1} = \max\bigg\{D^{\frac{2-p}{2+p}}, (8M^{2})^{\frac{2-p}{2+p}}+1\bigg\}\bigg\{\frac{c_{p}(4M)^{p}R^{p}}{b_{n}}\bigg\}^{\frac{2}{2+p}}+\frac{4b_{n}a_{n}}{n^{2}}C_{\rho, \Phi}\max\{R, 1\}\sum\limits_{j = 1}^{b_{n}}\sum\limits_{i = 1}^{a_{n}}\alpha^{(2j-2)a_{n}+i}.\nonumber \end{equation}$

It follows by taking $\epsilon_{1} = c_{p}'\eta_{1}+\frac{(D+144M^{2}+2)t}{b_{n}}$ that

$\begin{equation} Prob\Bigg\{\sup\limits_{g\in\mathscr{G}}\frac{1}{b_{n}}\sum\limits_{j = 1}^{b_{n}}\Bigg(\frac{2b_{n}}{n}\sum\limits_{i = 2(j-1)a_{n}+1}^{(2j-1)a_{n}} \bigg(\int_{Z}g(z)d\rho^{(i)}-g(z_{i}')\bigg)\Bigg)-\frac{8M^{2}}{D}\bigg(\int_{Z}gd\rho\bigg)\geq\epsilon_{1}\Bigg\}\leq e^{-t}.\nonumber \end{equation}$

Applying Proposition 5 to $\widetilde{\mathscr{G}}$ in $\big((Z^{a_{n}})^{b_{n}}, \prod_{j = 1}^{b_{n}}Q_{2j-1}^{a_{n}}\big)$ once again, we have

$\begin{equation} Prob\Bigg\{\sup\limits_{g\in\mathscr{G}}\frac{1}{b_{n}}\sum\limits_{j = 1}^{b_{n}}\Bigg(\frac{2b_{n}}{n}\sum\limits_{i = (2j-1)a_{n}+1}^{2ja_{n}} \bigg(\int_{Z}g(z)d\rho^{(i)}-g(z_{i}')\bigg)\Bigg)-\frac{8M^{2}}{D}\bigg(\int_{Z}gd\rho\bigg)\geq\epsilon_{2}\Bigg\}\leq e^{-t}.\nonumber \end{equation}$

Here $\epsilon_{2} = c_{p}'\eta_{2}+\frac{(D+144M^{2}+2)t}{b_{n}}$ with

$\begin{equation} \eta_{2} = \max\bigg\{D^{\frac{2-p}{2+p}}, (8M^{2})^{\frac{2-p}{2+p}}+1\bigg\}\bigg\{\frac{c_{p}(4M)^{p}R^{p}}{b_{n}}\bigg\}^{\frac{2}{2+p}}+\frac{4b_{n}a_{n}}{n^{2}}C_{\rho, \Phi}\max\{R, 1\}\sum\limits_{j = 1}^{b_{n}}\sum\limits_{i = 1}^{a_{n}}\alpha^{(2j-1)a_{n}+i}.\nonumber \end{equation}$

Moreover, we obviously have

$\begin{align} &\frac{4b_{n}a_{n}}{n^{2}}\sum\limits_{j = 1}^{b_{n}}\sum\limits_{i = 1}^{a_{n}}\alpha^{(2j-2)a_{n}+i}+\frac{4b_{n}a_{n}}{n^{2}}\sum\limits_{j = 1}^{b_{n}}\sum\limits_{i = 1}^{a_{n}}\alpha^{(2j-1)a_{n}+i} \\ &\leq\frac{2}{n}\frac{\alpha}{1-\alpha}, \end{align}$

and

$\begin{equation} \bigg\|g(z)-\int_{Z}g(z)d\rho^{(i)}\bigg\|_{\infty} < 16M^{2}.\nonumber \end{equation}$

We know from Lemma 4.2 by taking $\varepsilon = c_{p}'\tilde{\eta}+\frac{(D+144M^{2}+2)t}{b_{n}}$ with

$\begin{equation} \tilde{\eta} = \Bigg\{\max\bigg\{D^{\frac{2-p}{2+p}}, (8M^{2})^{\frac{2-p}{2+p}}+1\bigg\}\bigg\{\frac{c_{p}(4M)^{p}R^{p}}{b_{n}}\bigg\}^{\frac{2}{2+p}}+\frac{2}{n}C_{\rho, \Phi}\max\{R, 1\}\frac{\alpha}{1-\alpha}\Bigg\}, \nonumber \end{equation}$

then

$\begin{equation} Prob\Bigg\{\sup\limits_{g\in\mathscr{G}}\frac{1}{n}\sum\limits_{j = 1}^{n}\bigg(\int_{Z}g(z)d\rho^{(i)}-g(z_{i})\bigg)-\frac{16M^{2}}{D}\bigg(\int_{Z}gd\rho\bigg) > \epsilon+\frac{16M^{2}}{b_{n}}\Bigg\}\leq 2e^{-t}+2b_{n}\beta(a_{n}).\nonumber \end{equation}$

Then we obtain (4.6) by taking $2e^{-t}+2b_{n}\beta(a_{n}): = \frac{\delta}{3}$ and $D = 32M^{2}$ .

5. Estimates for the hypothesis error

Different from the widely regularized method with data-dependent hypothesis spaces ^[8,10,21,22], our estimation for the hypothesis error $\mathcal {E}_{{\bf z}}(\pi_{M}(\widetilde{f}_{k})) -\mathcal {E}_{{\bf z}}(f_{\lambda})$ is based on the following lemma, see Theorem 3.3 in ^[7].

Lemma 5.1. If $f\in\mathcal {H}$ , $h\in\mathcal {H}_{1}^{n}$ , then the output $(f_{m})_{m\geq0}$ of the RPGA satisfies the inequality

$\begin{equation} \|f-f_{m}\|^{2}-\|f-h\|^{2}\leq \frac{4}{m+1}\|h\|_{\mathcal {H}_{1}^{n}}^{2}, \ m = 0, 1, 2, \cdot\cdot\cdot, \end{equation}$

(5.1)

where

$\begin{equation} \mathcal {H}_{1}^{n} = \bigg\{h = \sum\limits_{i}\alpha_{i}^{n}\overline{K}_{u_{i}}^{n}:\alpha_{i}^{n} = \alpha_{i}\|\overline{K}_{u_{i}}\|_{n}, \overline{K}_{u_{i}}^{n} = \frac{\overline{K}_{u_{i}}}{\|\overline{K}_{u_{i}}\|_{n}}, \sum\limits_{i}\alpha_{i}\overline{K}_{u_{i}}\in\mathcal {H}_{1}\bigg\} \end{equation}$

(5.2)

with

$\begin{equation} \|f\|_{\mathcal {H}_{1}^{n}}: = \inf\bigg\{\sum\limits_{i}\vert\alpha_{i}^{n}\vert: f = \sum\limits_{i}\alpha_{i}\overline{K}_{u_{i}}\bigg\}. \end{equation}$

(5.3)

Proposition 7. Under the assumptions of Theorem 1, for $k\geq T$ and any $0 < \delta < 1$ , with the confidence at least $1-\delta/3$ , there holds

$\begin{equation} \mathcal {H}({\bf z}, \lambda)\leq4\min\bigg\{\bigg\{\bigg(\frac{19t}{3}+2\bigg)Mb_{n}^{-1}+M+\frac{\alpha}{n(1-\alpha)}\bigg(\frac{k_{1}^{2}}{k_{0}^{2}}+\frac{2C_{\alpha}k_{1}}{k_{0}^{2}}\bigg)+1\bigg\}^{2}, \frac{k_{1}^{2}}{k_{0}^{2}}\bigg\}\frac{\mathcal {D}^{2}(\lambda)}{(k+1)\lambda^{2}}. \end{equation}$

(5.4)

Proof. By Lemma 5.1, we have

$\begin{equation} \mathcal {H}({\bf z}, \lambda) = \{\mathcal {E}_{{\bf z}}(\pi_{M}(\widetilde{f}_{k})) -\mathcal {E}_{{\bf z}}(f_{\lambda})\}\leq4\frac{\|f_{\lambda}\|^{2}_{\mathcal {H}_{1}^{n}}}{k+1}. \end{equation}$

(5.5)

From the definitions of $\|f\|_{\mathcal {H}_{1}^{n}}$ and $\|f\|_{\mathcal {H}_{1}}$ , we have

$\begin{equation} \|f_{\lambda}\|_{\mathcal {H}_{1}^{n}}^{2}\leq\frac{k_{1}^{2}}{k_{0}^{2}}\|f_{\lambda}\|_{\mathcal {H}_{1}}^{2}. \end{equation}$

(5.6)

Meanwhile, we define the function $g(x) = \vert\overline{K}_{u_{i}}(x)\vert^{2}$ , for any $i$ . Notice that

$\bigg\|g(x)-\int_{X}gd\rho_{X}^{(j)}\bigg\|_{\infty}\leq2\frac{k_{1}^{2}}{k_{0}^{2}}: = 2M,$

and

$\int_{Z}g^{2}d\rho_{X}^{(j)}\leq M\int_{Z}gd\rho_{X}^{(j)}.$

Using Lemma 4.1, with confidence $1-\delta/3$ , we have

$\begin{equation} \frac{1}{n}\sum\limits_{j = 1}^{n}\bigg(g(x_{j})-\int_{X}gd\rho_{X}^{(j)}\bigg)\leq\bigg(\frac{19t}{3}+2\bigg)Mb_{n}^{-1}+M. \end{equation}$

(5.7)

By (1.17) and (1.15), we get

$\begin{align} \frac{1}{n}\sum\limits_{j = 1}^{n}\bigg(\int_{X}gd\rho_{X}^{(j)}-\int_{X}gd\rho_{X}\bigg) &\leq\frac{1}{n}\sum\limits_{j = 1}^{n}C\alpha^{j}(\|g\|_{\infty}+\vert g\vert_{C^{s}(X)}) \\ &\leq\frac{\alpha}{n(1-\alpha)}\bigg(\frac{k_{1}^{2}}{k_{0}^{2}}+\frac{2C_{\alpha}k_{1}}{k_{0}^{2}}\bigg). \end{align}$

(5.8)

This in connection with (5.7) tells us that

$\begin{align} \frac{1}{n}\sum\limits_{j = 1}^{n}\vert\overline{K}(u_{i}, x_{j})\vert^{2}-E\overline{K}_{u_{i}}^{2}& = \frac{1}{n}\sum\limits_{j = 1}^{n}\bigg(g(x_{j})-\int_{X}gd\rho_{X}\bigg) \\ &\leq\bigg(\frac{19t}{3}+2\bigg)Mb_{n}^{-1}+M+\frac{\alpha}{n(1-\alpha)}\bigg(\frac{k_{1}^{2}}{k_{0}^{2}}+\frac{2C_{\alpha}k_{1}}{k_{0}^{2}}\bigg). \end{align}$

(5.9)

It is easy to see that $\|\overline{K}_{u_{j}}\|_{\rho_{X}}^{2} = E\overline{K}_{u_{j}}^{2} = 1$ . Now (5.9) implies that

$\begin{align} \|\overline{K}_{u_{i}}\|_{n}& = \sqrt{\frac{1}{n}\sum\limits_{j = 1}^{n}\vert\overline{K}(u_{i}, x_{j})\vert^{2}} \\ &\leq\frac{1}{n}\sum\limits_{j = 1}^{n}\vert\overline{K}(u_{i}, x_{j})\vert^{2} \\ &\leq\bigg(\frac{19t}{3}+2\bigg)Mb_{n}^{-1}+M+\frac{\alpha}{n(1-\alpha)}\bigg(\frac{k_{1}^{2}}{k_{0}^{2}}+\frac{2C_{\alpha}k_{1}}{k_{0}^{2}}\bigg)+1. \end{align}$

(5.10)

Therefore,

$\begin{equation} \|f_{\lambda}\|_{\mathcal {H}_{1}^{n}}^{2}\leq\bigg\{\bigg(\frac{19t}{3}+2\bigg)Mb_{n}^{-1}+M+\frac{\alpha}{n(1-\alpha)}\bigg(\frac{k_{1}^{2}}{k_{0}^{2}}+\frac{2C_{\alpha}k_{1}}{k_{0}^{2}}\bigg)+1\bigg\}^{2}\|f_{\lambda}\|_{\mathcal {H}_{1}}^{2}. \end{equation}$

(5.11)

Combining (5.5), (5.6), (5.11), we obtain (5.4).

6. Proofs of main results

Proof of Theorem 1. Combining the bounds (2.7), (3.1), (4.1), (4.6) and (5.4), with confidence at least $1-\delta$ ,

$\begin{align} \{\mathcal {E}(\pi_{M}(\widetilde{f}_{k}))-\mathcal {E}(f_{\rho})\}&\leq c_{q}\lambda^{q}+4\min\bigg\{\bigg\{\bigg(\frac{19t}{3}+2\bigg)Mb_{n}^{-1}+M \\ &+\frac{\alpha}{1-\alpha}\bigg(\frac{k_{1}^{2}}{k_{0}^{2}}+\frac{2C_{\alpha}k_{1}}{k_{0}^{2}}\bigg)+1\bigg\}^{2}, \frac{k_{1}^{2}}{k_{0}^{2}}\bigg\}\frac{c_{q}^{2}\lambda^{2q}}{(k+1)\lambda^{2}} \\ &+\frac{C\lambda^{2q-2}}{n}+C\bigg\{b_{n}^{-1}\bigg(1+\frac{c_{q}^{2}\lambda^{2q}}{\lambda^{2}}\bigg)+c_{q}\lambda^{q}\bigg\}t \\ &+\frac{1}{2}\{\mathcal {E}(\pi_{M}(\widetilde{f}_{k}))-\mathcal {E}(f_{\rho})\}+C_{p, \Phi, \rho}\eta_{R}+\frac{(192M^{2}+2)t}{b_{n}}. \end{align}$

(6.1)

Note that $k\geq T\geq n$ . By taking $R = M^{2}$ , then

$\begin{align} \{\mathcal {E}(\pi_{M}(\widetilde{f}_{k}))-\mathcal {E}(f_{\rho})\}&\leq t\Big\{(k+1)^{-1}\lambda^{2q-2}+n^{-1}\lambda^{2q-2}+b_{n}^{-1}\lambda^{2q-2} \\ &+\lambda^{q}+b_{n}^{-\frac{2}{2+p}}+n^{-1}+b_{n}^{-1}\Big\} \\ &\leq Ct\Big\{\lambda^{q}+b_{n}^{-1}\lambda^{2q-2} +b_{n}^{-\frac{2}{2+p}}\Big\}. \end{align}$

(6.2)

This finishes the proof of Theorem 1.

Proof of Theorem 2. Under the conditions of Theorem 1, let $n^{1-\zeta}\leq a_{n} < n^{1-\zeta}+1$ , $\zeta\in[0, 1]$ and $n\geq8^{\frac{1}{\zeta}}$ . Then

$\begin{align} \frac{1}{b_{n}}&\leq\frac{1}{\frac{n}{2a_{n}}-1}\leq \frac{2(n^{1-\zeta}+1)}{n-2n^{1-\zeta}} \\ &\leq\frac{4n^{1-\zeta}}{n-2n^{1-\zeta}} = \frac{4n^{-\zeta}}{1-2n^{-\zeta}} \\ &\leq8n^{-\zeta}. \end{align}$

(6.3)

Substitute (6.3) into (6.2), we obtain

$\begin{equation} \{\mathcal {E}(\pi_{M}(\widetilde{f}_{k}))-\mathcal {E}(f_{\rho})\}\leq Ct\{\lambda^{q}+n^{-\zeta}\lambda^{2q-2} +n^{-\frac{2\zeta}{2+p}}\}. \end{equation}$

(6.4)

By setting $\lambda = n^{-\theta}$ , we know that

$\begin{equation} \{\mathcal {E}(\pi_{M}(\widetilde{f}_{k}))-\mathcal {E}(f_{\rho})\}\leq D_{2}t n^{-\theta'}, \end{equation}$

(6.5)

where

$\begin{equation} \theta' = \min\bigg\{q\theta, \ \ \zeta-(2-2q)\theta, \ \ \frac{2\zeta}{2+p}\bigg\}.\nonumber \\ \end{equation}$

To balance the errors in (2.5), we take $\theta = \frac{\zeta}{2-q}$ . Then

$\begin{equation} \theta' = \min\Bigg\{\frac{q\zeta}{2-q}, \ \frac{2\zeta}{2+p}\Bigg\}.\nonumber \end{equation}$

Finally, we choose

$\begin{equation} n\geq\Big(\frac{6\beta_{0}}{\delta}\Big)^{\frac{1}{(\gamma+1)(1-\zeta)-1}}, \ \zeta\in\Big(0, \frac{\gamma}{\gamma+1}\Big), \nonumber \end{equation}$

it follows from $\beta(a_{n})\leq\beta_{0}(a_{n})^{-\gamma}$ and $a_{n}\geq n^{1-\zeta}$ that

$\begin{equation} \frac{12b_{n}\beta(a_{n})}{\delta}\leq 1, \nonumber \end{equation}$

thus

$\begin{equation} t = \log\frac{6}{\delta-6b_{n}\beta(a_{n})}\leq\log\frac{12}{\delta}.\nonumber \end{equation}$

This finishes the proof of Theorem 2.

Acknowledgments

This research is supported by the National Science Foundation for Young Scientists of China (Grant No. 12001328), Doctoral Research Fund of Shandong Jianzhu University (No. XNBS1942), the Development Plan of Youth Innovation Team of University in Shandong Province (No. 2021KJ067) and Shandong Provincial Natural Science Foundation of China (Grant No. ZR2022MF223). All authors contributed substantially to this paper, participated in drafting and checking the manuscript. All authors read and approved the final manuscript.

Conflict of interest

The authors declare that they have no competing of interests regarding the publication of this paper.

References

[1]	J. Fang, S. B. Lin, Z. B. Xu, Learning and approximation capabilities of orthogonal super greedy algorithm, Knowl-Based Syst., 95 (2016), 86–98. https://doi.org/10.1016/j.knosys.2015.12.011 doi: 10.1016/j.knosys.2015.12.011
[2]	H. Chen, L. Q. Li, Z. B. Pan, Learning rates of multi-kernel regression by orthogonal greedy algorithm, J. Stat. Plan. Infer., 143 (2013), 276–282. https://doi.org/10.1016/j.jspi.2012.08.002 doi: 10.1016/j.jspi.2012.08.002
[3]	S. B. Lin, Y. H. Rong, X. P. Sun, Z. B. Xu, Learning capability of relaxed greedy algorithms, IEEE Trans. Neur. Net. Lear., 24 (2013), 1598–1608. https://doi.org/10.1109/TNNLS.2013.2265397 doi: 10.1109/TNNLS.2013.2265397
[4]	L. Xu, S. B. Lin, Z. B. Xu, Learning capability of the truncated greedy algorithm, Sci. China Inform. Sci. 59 (2016), 052103. https://doi.org/10.1007/s11432-016-5536-6
[5]	A. R. Barron, A. Cohen, W. Dahmen, R. A. DeVore, Approximation and learning by greedy algorithms, Ann. Statist., 36 (2008), 64–94. https://doi.org/10.1214/009053607000000631 doi: 10.1214/009053607000000631
[6]	L. Xu, S. B. Lin, J. S. Zeng, X. Liu, Z. B. Xu, Greedy criterion in orthogonal greedy learning, IEEE Trans. Cybernetics, 48 (2018), 955–966. https://doi.org/10.1109/TCYB.2017.2669259 doi: 10.1109/TCYB.2017.2669259
[7]	G. Petrova, Rescaled pure greedy algorithm for Hilbert and Banach spaces, Appl. Comput. Harmon. Anal., 41 (2016), 852–866. https://doi.org/10.1016/j.acha.2015.10.008 doi: 10.1016/j.acha.2015.10.008
[8]	S. G. Lv, D. M. Shi, Q. W. Xiao, M. S. Zhang, Sharp learning rates of coefficient-based $l^{q}$ -regularized regression with indefinite kernels, Sci. China Math., 56 (2013), 1557–1574. https://doi.org/10.1007/s11425-013-4688-8 doi: 10.1007/s11425-013-4688-8
[9]	Y. L. Feng, S. G. Lv, Unified approach to coefficient-based regularized regression, Comput. Math. Appl., 62 (2011), 506–515. https://doi.org/10.1016/j.camwa.2011.05.034 doi: 10.1016/j.camwa.2011.05.034
[10]	W. L. Nie, C. Wang, Constructive analysis for coefficient regularization regression algorithms, J. Math. Anal. Appl., 431 (2015), 1153–1171. https://doi.org/10.1016/j.jmaa.2015.06.006 doi: 10.1016/j.jmaa.2015.06.006
[11]	H. W. Sun, Q. Wu, Least square regression with indefinite kernels and coefficient regularization, Appl. Comput. Harmon., 30 (2011), 96–109. https://doi.org/10.1016/j.acha.2010.04.001 doi: 10.1016/j.acha.2010.04.001
[12]	B. Schölkopf, A. Smola, Learning with Kernels, MIT Press, Cambridge, 2002.
[13]	C. J. Liu, Gabor-based kernel pca with fractional power polynomial models for face recognition, IEEE Trans. Pattern Anal. Mach. Intell., 26 (2004), 572–581. https://doi.org/10.1109/TPAMI.2004.1273927 doi: 10.1109/TPAMI.2004.1273927
[14]	R. Opfer, Multiscale kernels, Adv. Comput. Math., 25 (2006), 357–380. https://doi.org/10.1007/s10444-004-7622-3
[15]	A. R. Barron, Universal approximation bounds for superposition of a sigmoidal function, IEEE Trans. Inform. Theory, 39 (1993), 930–945. https://doi.org/10.1109/18.256500 doi: 10.1109/18.256500
[16]	H. Chen, Y. C. Zhou, Y. Y. Tang, Convergence rate of the semi-supervised greedy algorithm, Neural Networks, 44 (2013), 44–50. https://doi.org/10.1016/j.neunet.2013.03.001 doi: 10.1016/j.neunet.2013.03.001
[17]	S. Smale, D. X. Zhou, Online learning with markov sampling, Anal. Appl., 7 (2009), 87–113. https://doi.org/10.1142/S0219530509001293 doi: 10.1142/S0219530509001293
[18]	Z. C. Guo, L. Shi, Classification with non-i.i.d. sampling, Math. Comput. Model., 54 (2011), 1347–1364. https://doi.org/10.1016/j.mcm.2011.03.042 doi: 10.1016/j.mcm.2011.03.042
[19]	Z. W. Pan, Q. W. Xiao, Least-square regularized regression with non-iid sampling, J. Stat. Plan. Infer., 139 (2009), 3579–3587. https://doi.org/10.1016/j.jspi.2009.04.007 doi: 10.1016/j.jspi.2009.04.007
[20]	R. C. Bradley, Basic properties of strong mixing conditions, Progr. Probab. Statist., 2 (1986), 165–192. https://doi.org/10.1007/978-1-4615-8162-8_8 doi: 10.1007/978-1-4615-8162-8_8
[21]	Q. Guo, P. X. Ye, B. L. Cai, Convergence Rate for $l^{q}$ -Coefficient Regularized Regression With Non-i.i.d. Sampling, IEEE Access, 6 (2018), 18804–18813. https://doi.org/10.1109/ACCESS.2018.2817215 doi: 10.1109/ACCESS.2018.2817215
[22]	L. Shi, Y. L. Feng, D. X. Zhou, Concentration estimates for learning with $l^{1}$ -regularizer and data dependent hypothesis spaces, Appl. Comput. Harmon. Anal., 31 (2011), 286–302. https://doi.org/10.1016/j.acha.2011.01.001 doi: 10.1016/j.acha.2011.01.001
[23]	B. Yu, Rates of convergence for empirical processes of stationary mixing sequences, Ann. Probab., 22 (1994), 94–116. https://www.jstor.org/stable/2244496

Reader Comments

Your name:*

Email:*
© 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Electronic Research Archive

1 1.3

Metrics

Article views(1354) PDF downloads(115) Cited by(0)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Electronic Research Archive

Learning capability of the rescaled pure greedy algorithm with non-iid sampling

Related Papers:

Abstract

1. Introduction and assumptions

2. Error decomposition and main results

3. Estimates for the drift error

4. Estimates for the sample error

5. Estimates for the hypothesis error

6. Proofs of main results

Acknowledgments

Conflict of interest

References

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Other Articles By Authors

Catalog

Electronic Research Archive

Learning capability of the rescaled pure greedy algorithm with non-iid sampling

Related Papers:

Abstract

1. Introduction and assumptions

2. Error decomposition and main results

3. Estimates for the drift error

4. Estimates for the sample error

5. Estimates for the hypothesis error

6. Proofs of main results

Acknowledgments

Conflict of interest

References

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog