On stochastic accelerated gradient with non-strongly convexity

Yiyuan Cheng; Yongquan Zhang; Xingxing Zha; Dongyin Wang; Yiyuan Cheng; Yongquan Zhang; Xingxing Zha; Dongyin Wang

doi:10.3934/math.2022085

AIMS Mathematics

2022, Volume 7, Issue 1: 1445-1459. doi: 10.3934/math.2022085

Previous Article Next Article

Research article

On stochastic accelerated gradient with non-strongly convexity

1.
School of Mathematics and Statistics, Chaohu University, 238024 Hefei, China
2.
School of Data Sciences, Zhejiang University of Finance & Economics, 310018 Hangzhou, China

Received: 14 June 2021 Accepted: 17 October 2021 Published: 26 October 2021
MSC : 68Q19, 68Q25, 68Q30

In this paper, we consider stochastic approximation algorithms for least-square and logistic regression with no strong-convexity assumption on the convex loss functions. We develop two algorithms with varied step-size motivated by the accelerated gradient algorithm which is initiated for convex stochastic programming. We analyse the developed algorithms that achieve a rate of $O(1/n^{2})$ where $n$ is the number of samples, which is tighter than the best convergence rate $O(1/n)$ achieved so far on non-strongly-convex stochastic approximation with constant-step-size, for classic supervised learning problems. Our analysis is based on a non-asymptotic analysis of the empirical risk (in expectation) with less assumptions that existing analysis results. It does not require the finite-dimensionality assumption and the Lipschitz condition. We carry out controlled experiments on synthetic and some standard machine learning data sets. Empirical results justify our theoretical analysis and show a faster convergence rate than existing other methods.

Keywords:

Citation: Yiyuan Cheng, Yongquan Zhang, Xingxing Zha, Dongyin Wang. On stochastic accelerated gradient with non-strongly convexity[J]. AIMS Mathematics, 2022, 7(1): 1445-1459. doi: 10.3934/math.2022085

Related Papers:

[1]	Adisak Hanjing, Panadda Thongpaen, Suthep Suantai . A new accelerated algorithm with a linesearch technique for convex bilevel optimization problems with applications. AIMS Mathematics, 2024, 9(8): 22366-22392. doi: 10.3934/math.20241088
[2]	Charles Audet, Jean Bigeon, Romain Couderc, Michael Kokkolaras . Sequential stochastic blackbox optimization with zeroth-order gradient estimators. AIMS Mathematics, 2023, 8(11): 25922-25956. doi: 10.3934/math.20231321
[3]	Hassan Ranjbar, Leila Torkzadeh, Dumitru Baleanu, Kazem Nouri . Simulating systems of Itô SDEs with split-step $(\alpha, \beta)$ -Milstein scheme. AIMS Mathematics, 2023, 8(2): 2576-2590. doi: 10.3934/math.2023133
[4]	Wei Xue, Pengcheng Wan, Qiao Li, Ping Zhong, Gaohang Yu, Tao Tao . An online conjugate gradient algorithm for large-scale data analysis in machine learning. AIMS Mathematics, 2021, 6(2): 1515-1537. doi: 10.3934/math.2021092
[5]	Adisak Hanjing, Pachara Jailoka, Suthep Suantai . An accelerated forward-backward algorithm with a new linesearch for convex minimization problems and its applications. AIMS Mathematics, 2021, 6(6): 6180-6200. doi: 10.3934/math.2021363
[6]	Zhongzhe Ouyang, Ke Liu, Min Lu . Bias correction based on AR model in spurious regression. AIMS Mathematics, 2024, 9(4): 8439-8460. doi: 10.3934/math.2024410
[7]	W. B. Altukhaes, M. Roozbeh, N. A. Mohamed . Feasible robust Liu estimator to combat outliers and multicollinearity effects in restricted semiparametric regression model. AIMS Mathematics, 2024, 9(11): 31581-31606. doi: 10.3934/math.20241519
[8]	Tingting Wang, Shulin Sun . Dynamics of a stochastic epidemic model with quarantine and non-monotone incidence. AIMS Mathematics, 2023, 8(6): 13241-13256. doi: 10.3934/math.2023669
[9]	Xiaoming Wang, Muhammad W. Yasin, Nauman Ahmed, Muhammad Rafiq, Muhammad Abbas . Numerical approximations of stochastic Gray-Scott model with two novel schemes. AIMS Mathematics, 2023, 8(3): 5124-5147. doi: 10.3934/math.2023257
[10]	Jinling Gao, Zengtai Gong . Uncertain logistic regression models. AIMS Mathematics, 2024, 9(5): 10478-10493. doi: 10.3934/math.2024512

Abstract

1. Introduction

In the era of 'big data', machine learning algorithms that only need to process each observation only once, or a few times, are desirable. Stochastic approximation algorithms such as stochastic gradient descent (SGD) and stochastic proximal gradient descent (SPGD), have been widely studied for this specific task and they have been successfully applied in various scenarios ^{[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}. Regression and classification are effective analysis methods in machine learning, and have been successfully applied in practical problems. With the wide application of deep learning, these two methods are often used to train the parameters. In this paper, we consider stochastic approximation algorithms that consider minimizing a convex function where only the unbiased estimates of its gradients at the observations are assumed available.

The convex function defined on a closed convex set in Euclidean space is usually given by $f(\theta) = \frac{1}{2}\mathbb{E}\left[\ell\left(y_i, \langle\theta, x_i\rangle\right)\right]$ , where $(x_i, y_i)\in {\cal X} \times {\cal Y}$ denotes the sample data which is assumed to be i.i.d., $\ell$ denotes a loss function that is convex and $\mathbb{E}$ represents the expectation under the second variable. This loss function includes such as the least square and logistic regression. In the stochastic approximation framework, the samples appear sequentially according to an unknown probability measure $\rho$ and the predictor defined by $\theta$ is updated after each pair is seen.

Robbins and Monro ^[1] were the first authors who proposed the stochastic approximation (SA) on the gradient descent method. From then on, algorithms based on SA have been widely used in stochastic optimisation and machine learning. Polyak ^[2] and Polyak and Juditsky ^[3] developed an important improvement of SA by using longer step-sizes with consequent averaging of the obtained iterates. The mirror-descent SA was developed by Nemirovski et al. ^[6] who showed that the mirror-descent SA exhibited an un-improvable expected rate for solving non-strongly convex programming problems. Shalev-Shwartz et al. ^[5] and Nemirovski et al. ^[6] studied an averaged stochastic gradient descent method for the least-square regression.

Theoretical studies on SGD for the least-square and logistic regression have shown that the convexity and smoothness of the loss function and the step-size policy play a critical role on the convergence rate. It was found that under strong-convexity assumption (i.e., the loss function is twice differentiable, the Hessians of the loss function is lower bounded by a constant $c$ ), the convergence rate of averaged SGD with proper step-size is of $O(1/c n)$ ^[5,6], while it is only of $O(1/\sqrt{n})$ in non-strongly-convex case ^[6]. By using the smoothness property of the loss function, it was shown in ^[10] that the averaged SGD with constant-step-size can achieve a convergence rate of $O(1/n)$ without requiring the strong-convexity assumption.

D. P. Kingma ^[24] propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients; the name Adam is derived from adaptive moment estimation.

Z. A. Zhu ^[25] introduced Katyusha, a direct, primal-only stochastic gradient method to fix this issue. It has a provably accelerated convergence rate in convex (off-line) stochastic optimization. It can be incorporated into a variance-reduction based algorithm and speed it up, both in terms of sequential and parallel performance.

In this paper, we develop two stochastic accelerated gradient algorithms for the consider least-square and logistic regressions aiming to improve the convergence rate without assuming strong-convexity. Our development is inspired by the work in the area of accelerated gradient method for general stochastic non-linear programming (NLP) ^{[17,18,19,20,21,22,23]}. In the stochastic NLP setting, different from the consider problem where the gradient can be estimated unbiased at certain points, the gradient is noisy and available through a stochastic oracle. That is, random vectors with unknown distribution are associated with each gradient at certain points. For such problem, recently developed stochastic accelerated gradient method proposed by Ghadimi and Lan ^[22] showed that for convex smooth function with Lipschitz continuous gradients achieves a convergence rate of $O(1/n^2)$ .

In this paper, we prove that without the Lipschitz continuous gradient and strong-convexity assumption, the developed algorithms achieve a convergence rate of $O(1/ n^2)$ by using non-asymptotic analysis. Experimental studies on synthetic data and benchmarks justify our theoretical results and show a faster convergence rate than classical SGD and constant-step-size averaged SGD ^[10].

The rest of the paper is organised as follows. In Section 2, we present the accelerated gradient algorithm for the least square regression. In Section 3, we study the accelerated gradient algorithm for the logistic regression. Section 4 empirically verify the obtained theoretical results. Section 5 concludes the paper.

2. Stochastic accelerated gradient algorithm for least square regression

In this section, we consider the least square regression. Let $(X, d)$ be a compact metric space and $Y = \mathbb{R}$ . Assume $\rho$ be a probability distribution on $Z = \mathcal{X}\times {\cal Y}$ and $(X, Y)$ be corresponding random variable. We further assume:

(a) The training data $(x_k, y_k), k \geq 1$ are i.i.d. sampled from $\rho$ .

(b) $\mathbb{E}\|x_k\|^2$ is finite, i.e., $\mathbb{E}\|x_k\|^2\leq M$ for any $k\geq 1$ .

(c) The global minimum of $f(\theta) = \frac{1}{2}\mathbb{E}[\langle\theta, x_{k}\rangle^2-2y_{k}\langle\theta, x_{k}\rangle]$ is attainable at a certain point $\theta^{*}\in \mathbb{R}^d$ .

(d) In the following, we denote $\xi_{k} = \left(y_{k}-\langle\theta^{k}, x_{k}\rangle\right)x_{k}$ as the residual. We assume that $\mathbb{E}\|\xi_{k}\|^2\leq M_{1}$ for every $k$ and $\overline{\xi}_{k} = \frac{1}{k}\sum_{i = 1}^{k}\xi_{i}$ .

These assumptions are standard in stochastic approximation ^[9,10]. However, compared with the work in ^[10], we do not make assumptions on the covariance operator $\mathbb{E}(x_k \bigotimes x_k)$ and $\mathbb{E}[\xi_k\bigotimes\xi_k]$ .

In the following, we present the accelerated stochastic gradient algorithm for least square regression learning in Algorithm 1. The algorithm takes a stream of data $(x_k, y_k)$ as input, and an initial guess of the parameter $\theta_0$ . The other requirements include $\{\alpha_{k}\}$ which satisfies $\alpha_{1} = 1$ and $\alpha_{k} > 0$ for any $k\geq 2$ , $\beta_{k} > 0$ , and $\lambda_{k} > 0$ . The algorithm involves two intermediate quantities $\theta^{ag}$ (which is initialised to be $\theta_0$ ) and $\theta^{md}$ . $\theta^{md}$ is updated as a linear combination of $\theta^{ag}$ and the current estimation of the parameter $\theta$ when a data comes in (line 1), where $\alpha_k$ is the coefficient. The parameter $\theta$ is estimated in line 1 taking $\lambda_k$ as a parameter. The residue and the average residue of previous residues up to the $k$ -th data are computed in line 1. $\theta^{ag}$ is then updated by taking $\beta_k$ as a parameter in line 1. The process continues whenever a new pair of data is seen.

Algorithm 1 The accelerated stochastic gradient algorithm for least square regression.

Require:

$\theta_0$ and

$\alpha_1 = 1, \alpha_k > 0$ for

$k = 2, \cdots$ ,

$\{\beta_k > 0 \}$ and

$\{\lambda_k > 0\}$

(1) Set

$\theta_{0}^{ag}=\theta_{0}$ ,

$\overline{\xi}_0 = 0$ and

$k=1$

(2) Set

$\theta_{k}^{md}=(1-\alpha_{k})\theta_{k-1}^{ag}+\alpha_{k}\theta_{k-1}$ ,

(3) Set

$z_{k}=\nabla f(\theta_{k}^{md})/\alpha_{k}= \left(\langle \theta_{k}^{md}, x_{k}\rangle x_{k}-y_{k}x_{k}\right)/\alpha_{k},$

(4) Set

$\theta_{k}=\theta_{k-1}-\lambda_{k}z_{k},$

(5) Compute

$\xi_{k}=\left(y_{k}-\langle\theta_k, x_{k}\rangle\right)x_{k}$ and

$\overline{\xi}_k = \overline{\xi}_{k-1} + \frac{1}{k}(\xi_k - \overline{\xi}_{k-1})$ ;

(6) Set

$\theta_{k}^{ag}=\theta_{k}^{md}-\beta_{k}\left(z_{k}+\frac{1}{k}{\overline{\xi}_{k}}\right).$

(7) Set

$k \leftarrow k + 1$ , and goto step 2.

| Show Table

DownLoad: CSV

2.1. Notes on the algorithm

The unbiased estimate of the gradient, i.e., $\left(\langle \theta_{k}^{md}, x_{k}\rangle x_{k}-y_{k}x_{k}\right)$ for each data point $(x_k, y_k)$ is used in line 1. From this perspective, it is seen that the update of $\theta_k$ (line 1) is actually the same as in the stochastic gradient descent (also called least-mean-square, LMS) algorithm if we set $\alpha_k = 1$ .

During optimizing, the residue $\xi_k$ is computed (line 1). All the residues up to now are averaged and the averaged residue takes effect on the update of $\theta_k^{ag}$ (line 1). It differs from the accelerated stochastic gradient algorithm in ^[22] where no residue is computed and used in the optimizing.

2.2. Non-asymptotic analysis on convergence rate

This section we establish the convergence rate of the developed algorithm. The goal is to estimate the bound on the expectation $\mathbb{E}[f(\theta_n^{ag}) - f(\theta^*)]$ . It turns out that the developed algorithm is able to achieve a convergence rate of $O(1/n^2)$ without strong convexity and Lipschitz continuous gradient assumptions.

To establish the convergence rate of the developed gradient algorithm, we need the following Lemma (see Lemma 1 of ^[22]).

Lemma 1. Let $\alpha_{k}$ be a sequence of step sizes in the accelerated gradient algorithm and the sequence $\{\eta_{k}\}$ satisfies $\eta_{k}\leq(1-\alpha_{k})\eta_{k-1}+\tau_{k}, \ \ k = 1, 2, \ldots,$ where

$\begin{eqnarray} \Gamma_{k} = \left\{ \begin{array}{ll} 1, & \ k = 1, \\ (1-\alpha_{k})\Gamma_{k-1}, & \ k\geq 2. \end{array} \right. \end{eqnarray}$

(2.1)

Then we have $\eta_{k}\leq \Gamma_{k}\sum_{i = 1}^{k}\frac{\tau_{i}}{\Gamma_{i}}$ for any $k\geq 1$ .

Proof. Noting that $\alpha_{1} = 1$ and $\alpha_{k}\in(0, 1]$ , we obtain

$\frac{\eta_{1}}{\Gamma_{1}} = \frac{(1-\alpha_{1})\eta_{0}}{\Gamma_{1}}+\frac{\tau_{1}}{\Gamma_{1}} = \frac{\tau_{1}}{\Gamma_{1}},$

and

$\frac{\eta_{i}}{\Gamma_{i}}\leq\frac{(1-\alpha_{i})\eta_{i-1}}{\Gamma_{i}}+\frac{\tau_{i}}{\Gamma_{i}} = \frac{\eta_{i-1}}{\Gamma_{i-1}}+ \frac{\tau_{i}}{\Gamma_{i}},$

The result then immediately follows by summing up the above inequalities and rearranging the terms.

Applying Lemma 1, we can obtain the convergence rate of the developed algorithm as explained in Theorem 1.

Theorem 1. Let $\{\theta_{k}^{md}, \theta_{k}^{ag}\}$ be computed by the accelerated gradient algorithm and $\Gamma_{k}$ be defined in (2.1). Assume (a–d). If $\{\alpha_{k}\}, \{\beta_{k}\}, \{\lambda_{k}\}$ are chosen such that

$\begin{eqnarray*} 1-\frac{\lambda_{k}\alpha_{k}}{2\beta_{k}}-\frac{\beta_{k}}{\alpha_{k}}M \geq0, \qquad \frac{\alpha_{1}^{2}}{\lambda_{1}\Gamma_{1}}\geq\frac{\alpha_{2}^{2}}{\lambda_{2}\Gamma_{2}}\geq\cdots, \ \ 2\beta_{k}^2M-\beta_{k}\alpha_{k} = 0, \end{eqnarray*}$

then for any $n\geq 1$ , we have

$\begin{eqnarray*} \mathbb{E}\left[f\left(\theta_{n}^{ag}\right)-f(\theta^{*})\right]&\leq& \frac{\Gamma_{n}}{2\lambda_{1}}\|\theta_{0}-\theta^{*}\|^2 +M M_{1}\Gamma_{n}\sum\limits_{k = 1}^{n}\frac{\beta_{k}^2}{k^{2}\Gamma_{k}}. \end{eqnarray*}$

Proof. By Taylor expansion of the function $f$ , Algorithm1 (line 3) and (line 6), we have:

$\begin{eqnarray*} f\left(\theta_{k}^{ag}\right)& = &f(\theta_{k}^{md})+\langle\nabla f(\theta_{k}^{md}), \theta_{k}^{ag}-\theta_{k}^{md}\rangle +(\theta_{k}^{ag}-\theta_{k}^{md})^\intercal \nabla^{2}f(\theta_{k}^{md})(\theta_{k}^{ag}-\theta_{k}^{md})\\ &\leq&f(\theta_{k}^{md})-\beta_{k}\alpha_{k}\|z_{k}\|^2-\beta_{k}\alpha_{k}\frac{1}{k}\langle z_{k}, \overline{\xi}_{k}\rangle +\beta_{k}^2\mathbb{E}\|x_{k}\|^2\|z_{k}+\frac{1}{k}\overline{\xi}_{k}\|^2\\ &\leq&f(\theta_{k}^{md})-\beta_{k}\alpha_{k}\|z_{k}\|^2-\beta_{k}\alpha_{k}\frac{1}{k}\langle z_{k}, \overline{\xi}_{k}\rangle +\beta_{k}^2M_{1} \|z_{k}+\frac{1}{k}\overline{\xi}_{k}\|^2, \end{eqnarray*}$

where the last inequality holds due to assumption (b). Since

$\begin{eqnarray*} f(\mu)-f(\nu) = \langle \nabla f(\nu), \mu-\nu\rangle +(\mu-\nu)^{T}\mathbb{E}(x_{k}x_{k}^{T})(\mu-\nu), \end{eqnarray*}$

we have

$\begin{eqnarray} f(\nu)-f(\mu)& = &\langle \nabla f(\nu), \nu-\mu\rangle -(\mu-\nu)^{T}\mathbb{E}(x_{k}x_{k}^{T})(\mu-\nu)\\ &\leq&\langle \nabla f(\nu), \nu-\mu\rangle, \end{eqnarray}$

(2.2)

where the inequality follows from the positive semi-definition of matrix $\mathbb{E}(x_{k}x_{k}^\intercal)$ . By Algorithm 1 (line 2) and (2.2), we have

$\begin{eqnarray*} f(\theta_{k}^{md})-[(1-\alpha_{k})f(\theta_{k-1}^{ag})+\alpha_{k}f(\theta)]& = &\alpha_{k}[f(\theta_{k}^{md})-f(\theta)]+ (1-\alpha_{k})[f(\theta_{k}^{md})-f(\theta_{k-1}^{ag})]\nonumber\\ &\leq&\alpha_{k}\langle \nabla f(\theta_{k}^{md}), \theta_{k}^{md}-\theta\rangle +(1-\alpha_{k})\langle \nabla f(\theta_{k}^{md}), \theta_{k}^{md}-\theta_{k-1}^{ag}\rangle\nonumber\\ & = &\langle \nabla f(\theta_{k}^{md}), \alpha_{k}(\theta_{k}^{md}-\theta)+(1-\alpha_{k})(\theta_{k}^{md}-\theta_{k-1}^{ag})\rangle\nonumber\\ & = &\alpha_{k}\langle \nabla f(\theta_{k}^{md}), \theta_{k-1}-\theta\rangle\\ & = &\alpha^{2}_{k}\langle z_{k}, \theta_{k-1}-\theta\rangle. \end{eqnarray*}$

So we obtain

$\begin{eqnarray*} f\left(\theta_{k}^{ag}\right)&\leq& (1-\alpha_{k})f(\theta_{k-1}^{ag})+\alpha_{k}f(\theta)+\alpha_{k}^{2}\langle z_{k}, \theta_{k-1}-\theta\rangle\\ &&-\beta_{k}\alpha_{k}\|z_{k}\|^2-\beta_{k}\alpha_{k}\frac{1}{k}\langle z_{k}, \overline{\xi}_{k}\rangle +\beta_{k}^2M\|z_{k}+\frac{1}{k}\overline{\xi}_{k}\|^2. \end{eqnarray*}$

It follows from Algorithm 1 (line 4) that

$\begin{eqnarray} \|\theta_{k}-\theta\|^2& = &\left\|\theta_{k-1}-\lambda_{k}z_{k}-\theta\right\|^2 \end{eqnarray}$

(2.3)

$\begin{eqnarray} & = &\|\theta_{k-1}-\theta\|^2-2\lambda_{k}\langle z_{k}, \theta_{k-1}-\theta \rangle+\lambda_{k}^2\left\|z_{k}\right\|^2. \end{eqnarray}$

(2.4)

Then we have

$\begin{eqnarray*} \langle z_{k}, \theta_{k-1}-\theta \rangle = \frac{1}{2\lambda_{k}}\left[\|\theta_{k-1}-\theta\|^2-\|\theta_{k}-\theta\|^2\right]+\frac{\lambda_{k}}{2}\left\|z_{k}\right\|^2. \end{eqnarray*}$

While

$\begin{eqnarray} \left\|z_{k}+\frac{1}{k}\overline{\xi}_{k} \right\|^2 = \left\|z_{k}\right\|^2+\frac{1}{k^{2}}\left\|\overline{\xi}_{k}\right\|^2+2\frac{1}{k}\langle z_{k}, \overline{\xi}_{k}\rangle. \end{eqnarray}$

(2.5)

Combining (2.4) and (2.5), we obtain

$\begin{eqnarray*} f\left(\theta_{k}^{ag}\right)&\leq& (1-\alpha_{k})f(\theta_{k-1}^{ag})+\alpha_{k}f(\theta)+\frac{\alpha_{k}^{2}}{2\lambda_{k}} \left[\|\theta_{k-1}-\theta\|^2-\|\theta_{k}-\theta\|^2\right]\\ &&-\beta_{k}\alpha_{k}\left(1-\frac{\lambda_{k}\alpha_{k}}{2\beta_{k}}-\frac{\beta_{k}}{\alpha_{k}}M\right)\|z_{k}\|^2+M\beta_{k}^2\frac{1}{k^{2}}\left\|\overline{\xi}_{k}\right\|^2+\langle\overline{\xi}_{k}, \frac{1}{k}(2\beta_{k}^2M-\beta_{k}\alpha_{k})z_{k}\rangle. \end{eqnarray*}$

The above inequality is equal to

$\begin{eqnarray*} f\left(\theta_{k}^{ag}\right)-f(\theta)&\leq& (1-\alpha_{k})[f(\theta_{k-1}^{ag})-f(\theta)]+\frac{\alpha_{k}^{2}}{2\lambda_{k}} \left[\|\theta_{k-1}-\theta\|^2-\|\theta_{k}-\theta\|^2\right]\\ &&-\beta_{k}\alpha_{k}\left(1-\frac{\lambda_{k}\alpha_{k}}{2\beta_{k}}-\frac{\beta_{k}}{\alpha_{k}}M\right)\|z_{k}\|^2+M\beta_{k}^2\frac{1}{k^{2}}\left\|\overline{\xi}_{k}\right\|^2+\langle\overline{\xi}_{k}, \frac{1}{k}(2\beta_{k}^2M-\beta_{k}\alpha_{k})z_{k}\rangle. \end{eqnarray*}$

Using Lemma 1, we have

$\begin{eqnarray*} &&f\left(\theta_{n}^{ag}\right)-f(\theta)\leq \Gamma_{n}\sum\limits_{k = 1}^{n}\frac{\alpha_{k}^{2}}{2\lambda_{k}\Gamma_{k}} \left[\|\theta_{k-1}-\theta\|^2-\|\theta_{k}-\theta\|^2\right]+\Gamma_{n}\sum\limits_{k = 1}^{n}\frac{\beta_{k}^2M}{\Gamma_{k}k^{2}}\left\|\overline{\xi}_{k}\right\|^2\\ &&-\Gamma_{n}\sum\limits_{k = 1}^{n}\frac{\beta_{k}\alpha_{k}}{\Gamma_{k}}\left(1-\frac{\lambda_{k}\alpha_{k}}{2\beta_{k}}-\frac{\beta_{k}}{\alpha_{k}}M\right)\|z_{k}\|^2+\Gamma_{n}\sum\limits_{k = 1}^{n}\frac{1}{\Gamma_{k}}\langle\overline{\xi}_{k}, \frac{1}{k}(2\beta_{k}^2M-\beta_{k}\alpha_{k})z_{k})\rangle. \end{eqnarray*}$

Since

$\begin{eqnarray*} \frac{\alpha_{1}^{2}}{\lambda_{1}\Gamma_{1}}\geq\frac{\alpha_{2}^{2}}{\lambda_{2}\Gamma_{2}}\geq\cdots, \ \alpha_{1} = \Gamma_{1} = 1, \end{eqnarray*}$

then

$\begin{eqnarray*} \sum\limits_{k = 1}^{n}\frac{\alpha_{k}^{2}}{2\lambda_{k}\Gamma_{k}} \left[\|\theta_{k-1}-\theta\|^2-\|\theta_{k}-\theta\|^2\right]\leq\frac{\alpha_{1}^{2}}{2\lambda_{1}\Gamma_{1}} \left[\|\theta_{0}-\theta\|^2\right] = \frac{1}{2\lambda_{1}}\|\theta_{0}-\theta\|^2. \end{eqnarray*}$

So we obtain

$\begin{eqnarray} &&f\left(\theta_{n}^{ag}\right)-f(\theta) \leq \frac{\Gamma_{n}}{2\lambda_{1}}\|\theta_{0}-\theta\|^2 +\Gamma_{n}\sum\limits_{k = 1}^{n}\frac{\beta_{k}^2M}{\Gamma_{k}k^{2}}\left\|\overline{\xi}_{k}\right\|^2+\Gamma_{n}\sum\limits_{k = 1}^{n}\frac{1}{\Gamma_{k}}\langle\overline{\xi}_{k}, \frac{1}{k}(2\beta_{k}^2M-\beta_{k}\alpha_{k})z_{k})\rangle, \end{eqnarray}$

(2.6)

where the inequality follows from the assumption

$\begin{eqnarray*} 1-\frac{\lambda_{k}\alpha_{k}}{2\beta_{k}}-\frac{\beta_{k}}{\alpha_{k}}M \geq0, \ \ 2\beta_{k}^2M-\beta_{k}\alpha_{k} = 0. \end{eqnarray*}$

Under assumption (d), we have

$\begin{eqnarray*} \mathbb{E}\left\|\bar{\xi_{k}}\right\|^2 = \mathbb{E}\left\{\frac{1}{k^{2}}\|\sum\limits_{i = 1}^{k}\xi_{i}\|^2\right\}\leq \mathbb{E}\left\{\frac{1}{k^{2}}k\sum\limits_{i = 1}^{k}\|\xi_{i}\|^2\right\}\leq M_{1}. \end{eqnarray*}$

Taking expectation on both sides of the inequality (2.6) with respect to $(x_{i}, y_{i})$ , we obtain for $\theta\in \mathbb{R}^d$ ,

$\begin{eqnarray*} \mathbb{E}\left[f\left(\theta_{n}^{ag}\right)-f(\theta)\right]&\leq& \frac{\Gamma_{n}}{2\lambda_{1}}\|\theta_{0}-\theta\|^2 +M M_{1}\Gamma_{n}\sum\limits_{k = 1}^{n}\frac{\beta_{k}^2}{k^{2}\Gamma_{k}}. \end{eqnarray*}$

Now, fixing $\theta = \theta^{*}$ , we have

This finishes the proof of the theorem.

In the following, we apply the results of Theorem 1 to some particular selections of $\{\alpha_{k}\}, \{\beta_{k}\}$ and $\{\lambda_{k}\}$ . We obtain the following Corollary 1.

Corollary 1. Suppose that $\alpha_{k}$ and $\beta_{k}$ in the accelerated gradient algorithm for regression learning are set to

$\begin{eqnarray} \alpha_{k} = \frac{2}{k+1}, \ \ \beta_{k} = \frac{1}{M(k+1)}\ \ \mathit{\mbox{and}}\ \ \lambda_{k} = \frac{k}{2M(k+1)}\ \ \ \ \ \forall k\geq 1, \mathfrak{} \end{eqnarray}$

(2.7)

then for any $n\geq 1$ , we have

$\begin{eqnarray*} \mathbb{E}\left[f\left(\theta_{n}^{ag}\right)-f(\theta^{*})\right]\leq \frac{4M}{n(n+1)}\|\theta_{0}-\theta^{*}\|^2 +\frac{M_{1}}{M n(n+1)}. \end{eqnarray*}$

Proof. In the view (2.1) and (2.7), we have for $k\geq 2$

$\begin{eqnarray*} \Gamma_{k} = (1-\alpha_{k})\Gamma_{k-1} = \frac{k-1}{k+1}\times \frac{k-2}{k}\times \frac{k-3}{k-1}\times \cdots\times\frac{2}{4}\times\frac{1}{3}\times\Gamma_{1} = \frac{2}{k(k+1)}. \end{eqnarray*}$

It is easy to check

$\begin{eqnarray*} 1-\frac{\lambda_{k}\alpha_{k}}{2\beta_{k}}-\frac{\beta_{k}}{\alpha_{k}}M \geq0, \ \ \frac{\alpha_{1}^{2}}{\lambda_{1}\Gamma_{1}} = \frac{\alpha_{2}^{2}}{\lambda_{2}\Gamma_{2}} = \cdots = 4M, \ \ 2\beta_{k}^2M-\beta_{k}\alpha_{k} = 0. \end{eqnarray*}$

Then we obtain

$\begin{eqnarray*} M\Gamma_{n}M_{1}\sum\limits_{k = 1}^{n}\frac{\beta_{k}^2}{k^{2}\Gamma_{k}} & = &\frac{2M_{1}}{n(n+1)}\sum\limits_{k = 1}^{n}\frac{\frac{M}{M^2(k+1)^2}}{\frac{2k^{2}}{k(k+1)}} = \frac{M_{1}}{M n(n+1)}\sum\limits_{k = 1}^{n}\frac{1}{k(k+1)}\leq\frac{M_{1}}{M n(n+1)}. \end{eqnarray*}$

From the result of Theorem 1, we have

$\begin{eqnarray*} \mathbb{E}\left[f\left(\theta_{n}^{ag}\right)-f(\theta^{*})\right]\leq \frac{4M}{n(n+1)}\|\theta_{0}-\theta^{*}\|^2 +\frac{M_{1}}{Mn(n+1)}. \end{eqnarray*}$

This finishes the proof of the Collorary.

3. The accelerated stochastic gradient algorithm for logistic regression learning

In this section, we develop the accelerated gradient algorithm for logistic regression. For the logistic regression, we consider the logistic loss function: $l(\theta) = \mathbb{E}[\log(1+\exp(-y\langle x, \theta\rangle))]$ . Assume the observations $(x_{i}, y_{i})\in \mathcal{F}\times \{-1, 1\}$ are independent and identically distributed from unknown distribution $\rho$ where $\mathcal{F}$ is a $d-$ dimension Euclidean space, with $d\geq 1$ . Further, we denote by $\theta^{*}\in \mathbb{R}^d$ a global minimiser of $l$ and assume its existence. Let $\xi_{i} = \left(y_{i}-\langle\theta^{k}, x_{i}\rangle\right)x_{i}$ denote the residual. We denote $\bar{\xi}_{k} = \frac{1}{k}\sum_{i = 1}^{k}\xi_{i}$ the average residue up until $k$ input data. To analyse the algorithm, we make the following assumptions:

(B1) $\mathbb{E}\|x_{i}\|^2$ is finite, i.e., $\mathbb{E}\|x_{i}\|^2\leq M$ for any $i\geq 1$ .

(B2) $\mathbb{E}\|\xi_{i}\|^2\leq M_{1}$ for every $i$ .

Again, unlike the algorithm by Bach et al. ^[10], we make no assumption on the Hessian operator at the global optimum $\theta^*$ .

The developed accelerated stochastic gradient algorithm for the logistic regression is presented in Algorithm 2. In the algorithm, $\theta_0 \in {\cal F}$ is an initial guess, and

$\nabla l(\theta_{k}) = \frac{-y_{k}\exp\{-y_{k}\langle x_{k}, \theta_{k}\rangle\}x_{k}}{1+\exp\{-y_{k}\langle x_{k}, \theta_{k}\rangle\}}.$

It can be seen that the basic framework of Algorithm 2 is the same as Algorithm 1 except that the unbiased estimation to the gradient is different due to the loss functions.

Algorithm 2 The accelerated stochastic gradient approximation algorithm for logistic regression.

Require:

$\theta_0$ and

$\alpha_1 = 1, \alpha_k > 0$ for

$k = 2, \cdots$ ,

$\{\beta_k > 0 \}$ and

$\{\lambda_k > 0\}$

(1) Set

$\theta_{0}^{ag}=\theta_{0}$ ,

$\overline{\xi}_0 = 0$ and

$k=1$

(2) Set

$\theta_{k}^{md}=(1-\alpha_{k})\theta_{k-1}^{ag}+\alpha_{k}\theta_{k-1}$ ,

(3) Compute

$z_{k}=\frac{1}{\alpha_{k}}\nabla l(\theta_{k}^{md}),$

(4) Set

$\theta_{k}=\theta_{k-1}-\lambda_{k}z_{k},$

(5) Compute

$\xi_{k}=\left(y_{k}-\langle\theta_k, x_{k}\rangle\right)x_{k}$ and

$\overline{\xi}_k = \overline{\xi}_{k-1} + \frac{1}{k}(\xi_k - \overline{\xi}_{k-1})$ ;

(6) Set

$\theta_{k}^{ag}=\theta_{k}^{md}-\beta_{k}\left(z_{k}+\frac{1}{k}\bar{\xi}_{k}\right).$

(7) Set

$k \leftarrow k + 1$ , and goto step 2.

| Show Table

DownLoad: CSV

3.1. Non-asymptotic analysis on convergence rate

In this section, we also provide the non-asymptotic analysis on the convergence rate of the developed algorithm in expectation. Theorem 2 describes the convergence rate.

Theorem 2. Let $\{\theta_{k}^{md}, \theta_{k}^{ag}\}$ be computed by the accelerated gradient algorithm and $\Gamma_{k}$ be defined in (2.1). Assume (B1 and B2). If $\{\alpha_{k}\}, \{\beta_{k}\}, \{\lambda_{k}\}$ are chosen such that

$\begin{eqnarray*} 1-\frac{\lambda_{k}\alpha_{k}}{2\beta_{k}}-\frac{\beta_{k}}{\alpha_{k}}M \geq0, \ \ 2\beta_{k}^2M-\beta_{k}\alpha_{k} = 0, \ \ \frac{\alpha_{1}^{2}}{\lambda_{1}\Gamma_{1}}\geq\frac{\alpha_{2}^{2}}{\lambda_{2}\Gamma_{2}}\geq\cdots, \end{eqnarray*}$

then for any $n\geq 1$ , we have

$\begin{eqnarray*} \mathbb{E}\left[f\left(\theta_{n}^{ag}\right)-f(\theta^{*})\right]&\leq& \frac{\Gamma_{n}}{2\lambda_{1}}\|\theta_{0}-\theta^{*}\|^2 +M\sigma^2\Gamma_{n}\sum\limits_{k = 1}^{n}\frac{\beta_{k}^2}{k^{2}\Gamma_{k}}. \end{eqnarray*}$

Proof. By Taylor expansion of the function $l$ , there exists a $\vartheta$ such that

$\begin{eqnarray} l\left(\theta_{k}^{ag}\right)& = &l(\theta_{k}^{md})+\langle\nabla l(\theta_{k}^{md}), \theta_{k}^{ag}-\theta_{k}^{md}\rangle +(\theta_{k}^{ag}-\theta_{k}^{md})^{T}\nabla^{2}l(\vartheta)(\theta_{k}^{ag}-\theta_{k}^{md})\\ & = &l(\theta_{k}^{md})-\beta_{k}\alpha_{k}\|z_{k}\|^2-\beta_{k}\alpha_{k}\langle z_{k}, \frac{1}{k}\bar{\xi}_{k}\rangle\\ &&+(\theta_{k}^{ag}-\theta_{k}^{md})^{T}\mathbb{E}\left[\frac{\exp\{-y_{k}\langle x_{k}, \vartheta\rangle\}x_{k}x_{k}^{T}}{1+\exp\{-y_{k}\langle x_{k}, \vartheta\rangle\}}\right](\theta_{k}^{ag}-\theta_{k}^{md}) \end{eqnarray}$

(3.1)

In the equation, we know

$\begin{eqnarray*} \frac{\exp\{-y_{k}\langle x_{k}, \vartheta\rangle\}}{1+\exp\{-y_{k}\langle x_{k}, \vartheta\rangle\}}\leq1, \ \ \lambda_{max}(x_{k}x_{k}^{T})\leq\|x_{k}\|^{2}. \end{eqnarray*}$

It is easy to verify the matrix

$\begin{eqnarray*} \mathbb{E}\left[\frac{\exp\{-y_{k}\langle x_{k}, \vartheta\rangle\}x_{k}x_{k}^{T}}{1+\exp\{-y_{k}\langle x_{k}, \vartheta\rangle\}}\right] \end{eqnarray*}$

is positive semidefinite and its largest eigenvalue satisfies

$\begin{eqnarray*} \lambda_{max}\left(\mathbb{E}\left[\frac{\exp\{-y_{k}\langle x_{k}, \vartheta\rangle\}x_{k}x_{k}^{T}}{1+\exp\{-y_{k}\langle x_{k}, \vartheta\rangle\}}\right]\right)\leq \mathbb{E}\|x_{k}\|^2\leq M. \end{eqnarray*}$

Combining with Algorithm 2 (line 6) and (3.1), we have

$\begin{eqnarray*} l\left(\theta_{k}^{ag}\right) \leq l(\theta_{k}^{md})-\beta_{k}\alpha_{k}\|z_{k}\|^2-\beta_{k}\alpha_{k}\langle z_{k}, \frac{1}{k}\bar{\xi}_{k}\rangle +\beta_{k}^2M \|z_{k}+\frac{1}{k}\bar{\xi}_{k}\|^2. \end{eqnarray*}$

Similar to (3.1), there exists a $\zeta\in \mathbb{R}^d$ satisfying

$\begin{eqnarray*} l(\mu)-l(\nu) = \langle \nabla l(\nu), \mu-\nu\rangle +(\mu-\nu)^{T}\mathbb{E}\left[\frac{\exp\{-y_{k}\langle x_{k}, \zeta\rangle\}x_{k}x_{k}^{T}}{1+\exp\{-y_{k}\langle x_{k}, \zeta\rangle\}}\right](\mu-\nu), \ \mu, \nu\in \mathbb{R}^d \end{eqnarray*}$

we have

$\begin{eqnarray*} l(\nu)-l(\mu)& = &\langle \nabla l(\nu), \nu-\mu\rangle -(\mu-\nu)^{T}\mathbb{E}\left[\frac{\exp\{-y_{k}\langle x_{k}, \zeta\rangle\}x_{k}x_{k}^{T}}{1+\exp\{-y_{k}\langle x_{k}, \zeta\rangle\}}\right](\mu-\nu)\nonumber\\ &\leq&\langle \nabla l(\nu), \nu-\mu\rangle, \end{eqnarray*}$

where the inequality follows from the positive semi-definition of matrix. Similar to (2.2), we have

$\begin{eqnarray*} l(\theta_{k}^{md})-[(1-\alpha_{k})l(\theta_{k-1}^{ag})+\alpha_{k}l(\theta)]\leq \alpha_{k}^{2}\langle z_{k}, \theta_{k-1}-\theta\rangle. \end{eqnarray*}$

So we obtain

$\begin{eqnarray*} l\left(\theta_{k}^{ag}\right)&\leq& (1-\alpha_{k})l(\theta_{k-1}^{ag})+\alpha_{k}l(\theta)+\alpha_{k}^{2}\langle z_{k}, \theta_{k-1}-\theta\rangle\\ &&-\beta_{k}\alpha_{k}\|z_{k}\|^2-\beta_{k}\alpha_{k}\langle z_{k}, \frac{1}{k}\bar{\xi}_{k}\rangle +\beta_{k}^2M\|z_{k}+\frac{1}{k}\bar{\xi}_{k}\|^2. \end{eqnarray*}$

It follows from Algorithm 2 (line 4) that

$\begin{eqnarray*} \|\theta_{k}-\theta\|^2& = &\left\|\theta_{k-1}-\lambda_{k}z_{k}-\theta\right\|^2\\ & = &\|\theta_{k-1}-\theta\|^2-2\lambda_{k}\langle z_{k}, \theta_{k-1}-\theta \rangle +\left\|z_{k}\right\|^2. \end{eqnarray*}$

Combining the above two inequalities, we obtain

$\begin{eqnarray*} l\left(\theta_{k}^{ag}\right)&\leq& (1-\alpha_{k})l(\theta_{k-1}^{ag})+\alpha_{k}l(\theta)+\frac{\alpha_{k}^{2}}{2\lambda_{k}} \left[\|\theta_{k-1}-\theta\|^2-\|\theta_{k}-\theta\|^2\right]\\ &&-\beta_{k}\alpha_{k}\left(1-\frac{\lambda_{k}\alpha_{k}}{2\beta_{k}}-\frac{\beta_{k}}{\alpha_{k}}M\right)\|z_{k}\|^2+M\beta_{k}^2\frac{1}{k^{2}}\left\|\overline{\xi}_{k}\right\|^2+\langle\overline{\xi}_{k}, \frac{1}{k}(2\beta_{k}^2M-\beta_{k}\alpha_{k})z_{k})\rangle. \end{eqnarray*}$

The above inequality is equal to

$\begin{eqnarray*} l\left(\theta_{k}^{ag}\right)-l(\theta)&\leq& (1-\alpha_{k})[l(\theta_{k-1}^{ag})-l(\theta)]+\frac{\alpha_{k}^{2}}{2\lambda_{k}} \left[\|\theta_{k-1}-\theta\|^2-\|\theta_{k}-\theta\|^2\right]\\ &&-\beta_{k}\alpha_{k}\left(1-\frac{\lambda_{k}\alpha_{k}}{2\beta_{k}}-\frac{\beta_{k}}{\alpha_{k}}M\right)\|z_{k}\|^2+M\beta_{k}^2\frac{1}{k^{2}}\left\|\overline{\xi}_{k}\right\|^2+\langle\overline{\xi}_{k}, \frac{1}{k}(2\beta_{k}^2M-\beta_{k}\alpha_{k})z_{k}\rangle. \end{eqnarray*}$

Using Lemma 1, we have

$\begin{eqnarray*} &&l\left(\theta_{n}^{ag}\right)-l(\theta)\leq \Gamma_{n}\sum\limits_{k = 1}^{n}\frac{\alpha_{k}^{2}}{2\lambda_{k}\Gamma_{k}} \left[\|\theta_{k-1}-\theta\|^2-\|\theta_{k}-\theta\|^2\right]+\Gamma_{n}\sum\limits_{k = 1}^{n}\frac{\beta_{k}^2M}{\Gamma_{k}k^{2}}\left\|\overline{\xi}_{k}\right\|^2\\ &&-\Gamma_{n}\sum\limits_{k = 1}^{n}\frac{\beta_{k}\alpha_{k}}{\Gamma_{k}}\left(1-\frac{\lambda_{k}\alpha_{k}}{2\beta_{k}}-\frac{\beta_{k}}{\alpha_{k}}M\right)\|z_{k}\|^2+\Gamma_{n}\sum\limits_{k = 1}^{n}\frac{1}{\Gamma_{k}}\langle\overline{\xi}_{k}, \frac{1}{k}(2\beta_{k}^2M-\beta_{k}\alpha_{k})z_{k})\rangle. \end{eqnarray*}$

Since

$\begin{eqnarray*} \frac{\alpha_{1}^{2}}{\lambda_{1}\Gamma_{1}}\geq\frac{\alpha_{2}^{2}}{\lambda_{2}\Gamma_{2}}\geq\cdots, \ \alpha_{1} = \Gamma_{1} = 1, \end{eqnarray*}$

then

So we obtain

$\begin{eqnarray} l\left(\theta_{n}^{ag}\right)-l(\theta)\leq \frac{\Gamma_{n}}{2\lambda_{1}}\|\theta_{0}-\theta\|^2 +\Gamma_{n}\sum\limits_{k = 1}^{n}\frac{\beta_{k}^2M}{\Gamma_{k}k^{2}}\left\|\overline{\xi}_{k}\right\|^2, \end{eqnarray}$

(3.2)

where the inequality follows from the assumption

$\begin{eqnarray*} 1-\frac{\lambda_{k}\alpha_{k}}{2\beta_{k}}-\frac{\beta_{k}}{\alpha_{k}}M \geq0, \ \ 2\beta_{k}^2M-\beta_{k}\alpha_{k} = 0. \end{eqnarray*}$

Under the assumption (B4), we have

Taking expectation on both sides of the inequality (3.2) with respect to $(x_{i}, y_{i})$ , we obtain for $\theta\in \mathbb{R}^d$ ,

$\begin{eqnarray*} \mathbb{E}\left[l\left(\theta_{n}^{ag}\right)-l(\theta)\right]&\leq& \frac{\Gamma_{n}}{2\lambda_{1}}\|\theta_{0}-\theta\|^2 +M M_{1}\Gamma_{n}\sum\limits_{k = 1}^{n}\frac{\beta_{k}^2}{k^{2}\Gamma_{k}}. \end{eqnarray*}$

Now, fixing $\theta = \theta^{*}$ , we have

$\begin{eqnarray*} \mathbb{E}\left[l\left(\theta_{n}^{ag}\right)-l(\theta^{*})\right]&\leq& \frac{\Gamma_{n}}{2\lambda_{1}}\|\theta_{0}-\theta^{*}\|^2 +M M_{1}\Gamma_{n}\sum\limits_{k = 1}^{n}\frac{\beta_{k}^2}{k^{2}\Gamma_{k}}. \end{eqnarray*}$

This finishes the proof of the theorem.

Similar to Corollary 1, we specialise the result of Theorem 2 for some particular selections of $\{\alpha_{k}\}, \{\beta_{k}\}$ and $\{\lambda_{k}\}$ .

Corollary 2. Suppose that $\alpha_{k}$ and $\beta_{k}$ in the accelerated gradient algorithm for regression learning are set to

$\begin{eqnarray*} \alpha_{k} = \frac{2}{k+1}, \ \ \beta_{k} = \frac{1}{M(k+1)}\ \ \mathit{\mbox{and}}\ \ \lambda_{k} = \frac{k}{2M(k+1)}, \ \ \ \ \ \forall k\geq 1, \mathfrak{} \end{eqnarray*}$

then for any $n\geq 1$ , we have

$\begin{eqnarray*} \mathbb{E}\left[l\left(\theta_{n}^{ag}\right)-l(\theta^{*})\right]\leq \frac{4M}{n(n+1)}\|\theta_{0}-\theta^{*}\|^2 +\frac{M_{1}}{M n(n+1)}. \end{eqnarray*}$

4. Experiment results

In this section, we empirically investigate the performance of our algorithms on synthetic data and some benchmarks widely used by the machine learning community.

4.1. Least square regression

We consider normally distributed inputs, with covariance matrix $H$ that has random eigenvectors and eigenvalues $1/k$ , $k = 1, \ldots, d$ . The outputs are generated from a linear function with homoscedastic noise with various signal to noise-ratio $\sigma$ . We consider $d = 20$ and $100000$ samples using mini-batch size of 100.

We compare SGD, stochastic approximation (SA) with averaging ^[10], ADAM ^[24], Katyusha ^[25] and ASGA on synthetic noisy datasets with different noise levels: $\sigma = 0, 0.1$ and $0.01$ . For SGD and SA we choose the step size $\rho = 1/2R^{2}$ and $\gamma_{n} = 1/2R^{2}\sqrt{n})$ where $R^2 = trace(H)$ , respectively. The loss function is defined as $\log_{10}[f({\theta})-f(\theta^{*})]$ . The average loss function value over 100 runs on the training data is shown in Figure 1 (a)–(c). It can be seen that ASGA converges much faster than SGD, SA and ADAM. The results also verify our theoretical improvement on the convergence rate of ASGA.

Figure 1. (a)–(c): Least square regression training log-likelihood on synthetic data sets with different noise levels (0, 0.01 and 0.1 clockwise in turn); and (d) logistic regression on synthetic data set.

DownLoad: Full-Size Img PowerPoint

4.2. Logistic regression

For logistic regression, we consider the same input data as for the least-squares, but outputs are generated from the logistic probabilistic model. Comparison results are shown in . A step size $\gamma_{n} = 1/(2R^{2}\sqrt{n})$ is chosen for SA with averaging for an optimal performance. For ADAM, its step size $\alpha$ is adjusted by $1/\sqrt{n}$ decay as suggested in ^[24]. From Figure 1 (d), it is clearly seen that ASGA converges significantly faster than all the compared algorithms.

4.3. Benchmarks

We evaluate ASGA on the MNIST, Wine dataset and News dataset. In our experiments, the raw features of the datasets are used directly as the input to the classifier. For MNIST, 9000/1000 data points are used as training/test dataset, the numbers are 4000/898 for Wine, while the numbers are 18000/846 for News. We compare SGD, stochastic approximation (SA) with averaging ^[10], ADAM ^[24], Katyusha ^[25] and ASGA using mini-batch size of 90 in MINST, using mini-batch size of 100 in Wine and News. Figure 2 shows the training procedure of ASGA on these three datasets by averaging 50 runs. From , it is seen that ASGA obtains better the loss values with faster convergence rate. This clearly demonstrates the effectiveness of ASGA. summarizes the prediction accuracies of the obtained optimal parameters of the logistic regression model on the test datasets by the compared algorithms. The $p$ -values obtained by t-test at $5\%$ significance level are also given in the subscripts of the compared algorithms. From the table, it is seen that ASGA achieves significantly better accuracies that the other algorithms ( $p < 0.05$ ).

Table 1. The prediction accuracy of the compared algorithms on test datasets of MNIST and Wine, where the subscript values are the

$p$ -values obtained by the t-test between the corresponding algorithm with ASGA.

Method	MNIST	Wine	News
ASGA	0.9056	0.8863	0.9236
ADAM	0.8812 $_{1.5\times10^{-6}}$	0.8520 $_{2.6\times 10^{-8}}$	0.9120 $_{3.5\times 10^{-6}}$
SA	0.8913 $_{4.2\times 10^{-5}}$	0.8612 $_{2.7\times10^{-7}}$	0.8875 $_{4.7\times10^{-5}}$
Katyusha	0.8845 $_{1.7\times 10^{-4}}$	0.8624 $_{3.8\times10^{-3}}$	0.8943 $_{6.3\times10^{-4}}$
SGD	0.8615 $_{3.7\times 10^{-10}}$	0.8523 $_{4.7\times10^{-8}}$	0.8831 $_{2.4\times10^{-6}}$

| Show Table

DownLoad: CSV

Figure 2. The training procedure of ASGA on Wine, MNIST and News averaged over 50 runs.

DownLoad: Full-Size Img PowerPoint

5. Conclusions

In this paper, we proposed two accelerated stochastic gradient algorithms (ASGA) for least-square and logistic regression in which the averaged residue is used to adjust the parameter estimation. An asymptotic analysis proved that ASGA can achieve a convergence rate of ${\cal O}(1/n^2)$ which is much tighter than the state-of-the-art rate under non-strongly convexity assumptions. Experimental results on synthetic data and benchmark datasets justified our theoretical results.

Acknowledgments

The authors acknowledge the financial supports from the National Natural Science Foundation of China [No.61573326], Support project for outstanding young talents in Colleges and universities in Anhui Province [No.gxyq2018076], Natural science research project of colleges and universities in Anhui Province [No.KJ2018A0455], scientific research project of Chaohu University [No.XLY-201903].

Conflict of interest

The authors declare that they have no competing interests.

References

[1]	H. Robbins, S. Monro, A stochastic approximation method, In: The annals of mathematical statistics, Institute of Mathematical Statistics, 22 (1951), 400–407.
[2]	B. T. Polyak, New stochastic approximation type procedures, Automat. i Telemekh, 1990, 98–107.
[3]	B. T. Polyak, A. B. Juditsky, Acceleration of stochastic approximation by averaging, SIAM J. Control Optim., 30 (1992), 838–855. doi: 10.1137/0330046. doi: 10.1137/0330046
[4]	L. Bottou, O. Bousquet, The tradeoffs of large scale learning, In: Advances in neural information processing systems, 20 (2007), 1–8.
[5]	S. Shalev-Shwartz, Y. Singer, N. Srebro, A. Cotter, Pegasos: Primal estimated sub-gradient solver for SVM, Math. Program., 127 (2011), 3–30. doi: 10.1007/s10107-010-0420-4. doi: 10.1007/s10107-010-0420-4
[6]	A. Nemirovski, A. Juditsky, G. Lan, A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM J. Optim., 19 (2009), 1574–1609. doi: 10.1137/070704277. doi: 10.1137/070704277
[7]	G. H. Lan, R. D. C. Monteiro, Iteration-complexity of first-order penalty methods for convex programming, Math. Program., 138 (2013), 115–139. doi: 10.1007/s10107-012-0588-x. doi: 10.1007/s10107-012-0588-x
[8]	S. Ghadimi, G. H. Lan, Accelerated gradient methods for nonconvex nonlinear and stochastic programming, Math. Program., 156 (2016), 59–99. doi: 10.1007/s10107-015-0871-8. doi: 10.1007/s10107-015-0871-8
[9]	F. Bach, E. Moulines, Non-asymptotic analysis of stochastic approximation algorithms for machine learning, 2011.
[10]	F. Bach, E. Moulines, Non-strongly-convex smooth stochastic approximation with convergence rate $O(1/n)$ , 2013. Avaliable from: https://proceedings.neurips.cc/paper/2013/file/7fe1f8abaad094e0b5cb1b01d712f708-Paper.pdf.
[11]	J. C. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res., 12 (2010), 2121–2159.
[12]	Y. E. Nesterov, A method of solving a convex programming problem with convergence rate $O(1/k^2)$ , Dokl. Akad. Nauk SSSR, 269 (1983), 543–547.
[13]	A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sci., 2, (2009), 183–202. doi: 10.1137/080716542.
[14]	P. Tseng, On accelerated proximal gradient methods for convex-concave optimization, SIAM J. Optimiz., 2008.
[15]	Y. Nesterov, Smooth minimization of non-smooth functions, Math. Program., 103 (2005), 127–152. doi: 10.1007/s10107-004-0552-5. doi: 10.1007/s10107-004-0552-5
[16]	Y. Nesterov, Gradient methods for minimizing composite functions, Math. Program., 140 (2013), 125–161. doi: 10.1007/s10107-012-0629-5. doi: 10.1007/s10107-012-0629-5
[17]	G. H. Lan, An optimal method for stochastic composite optimization, Math. Program., 133 (2012), 365–397. doi: 10.1007/s10107-010-0434-y. doi: 10.1007/s10107-010-0434-y
[18]	S. Ghadimi, G. H. Lan, Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework, SIAM J. Optim., 22 (2012), 1469–1492. doi: 10.1137/110848864. doi: 10.1137/110848864
[19]	S. Ghadimi, G. H. Lan, Stochastic first- and zeroth-order methods for nonconvex stochastic programming, SIAM J. Optim., 23 (2013), 2341–2368. doi: 10.1137/120880811. doi: 10.1137/120880811
[20]	S. Ghadimi, G. H. Lan, H. C. Zhang, Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization, Math. Program., 155 (2016), 267–305. doi: 10.1007/s10107-014-0846-1. doi: 10.1007/s10107-014-0846-1
[21]	G. H. Lan, Bundle-level type methods uniformly optimal for smooth and nonsmooth convex optimization, Math. Program., 149 (2015), 1–45. doi: 10.1007/s10107-013-0737-x. doi: 10.1007/s10107-013-0737-x
[22]	S. Ghadimi, G. H. Lan, Accelerated gradient methods for nonconvex nonlinear and stochastic programming, Math. Program., 156 (2016), 59–99. doi: 10.1007/s10107-015-0871-8. doi: 10.1007/s10107-015-0871-8
[23]	L. Bottou, Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT'2010, Physica-Verlag HD, 2010,177–186. doi: 10.1007/978-3-7908-2604-3_16.
[24]	D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv: 1412.6980v9, 2012.
[25]	Z. A. Zhu, Katyusha: The first direct acceleration of stochastic gradient methods, In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2017), New York, USA: Association for Computing Machinery. doi: 10.1145/3055399.3055448.

Reader Comments

Your name:*

Email:*
© 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

AIMS Mathematics

1.8 3.4

Metrics

Article views(2700) PDF downloads(65) Cited by(0)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Figures and Tables

Figures(2) / Tables(1)

AIMS Mathematics

On stochastic accelerated gradient with non-strongly convexity

Related Papers:

Abstract

1. Introduction

2. Stochastic accelerated gradient algorithm for least square regression

2.1. Notes on the algorithm

2.2. Non-asymptotic analysis on convergence rate

3. The accelerated stochastic gradient algorithm for logistic regression learning

3.1. Non-asymptotic analysis on convergence rate

4. Experiment results

4.1. Least square regression

4.2. Logistic regression

4.3. Benchmarks

5. Conclusions

Acknowledgments

Conflict of interest

References

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Catalog

AIMS Mathematics

On stochastic accelerated gradient with non-strongly convexity

Related Papers:

Abstract

1. Introduction

2. Stochastic accelerated gradient algorithm for least square regression

2.1. Notes on the algorithm

2.2. Non-asymptotic analysis on convergence rate

3. The accelerated stochastic gradient algorithm for logistic regression learning

3.1. Non-asymptotic analysis on convergence rate

4. Experiment results

4.1. Least square regression

4.2. Logistic regression

4.3. Benchmarks

5. Conclusions

Acknowledgments

Conflict of interest

References

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog