Research article

On stochastic accelerated gradient with non-strongly convexity

  • Received: 14 June 2021 Accepted: 17 October 2021 Published: 26 October 2021
  • MSC : 68Q19, 68Q25, 68Q30

  • In this paper, we consider stochastic approximation algorithms for least-square and logistic regression with no strong-convexity assumption on the convex loss functions. We develop two algorithms with varied step-size motivated by the accelerated gradient algorithm which is initiated for convex stochastic programming. We analyse the developed algorithms that achieve a rate of O(1/n2) where n is the number of samples, which is tighter than the best convergence rate O(1/n) achieved so far on non-strongly-convex stochastic approximation with constant-step-size, for classic supervised learning problems. Our analysis is based on a non-asymptotic analysis of the empirical risk (in expectation) with less assumptions that existing analysis results. It does not require the finite-dimensionality assumption and the Lipschitz condition. We carry out controlled experiments on synthetic and some standard machine learning data sets. Empirical results justify our theoretical analysis and show a faster convergence rate than existing other methods.

    Citation: Yiyuan Cheng, Yongquan Zhang, Xingxing Zha, Dongyin Wang. On stochastic accelerated gradient with non-strongly convexity[J]. AIMS Mathematics, 2022, 7(1): 1445-1459. doi: 10.3934/math.2022085

    Related Papers:

    [1] Adisak Hanjing, Panadda Thongpaen, Suthep Suantai . A new accelerated algorithm with a linesearch technique for convex bilevel optimization problems with applications. AIMS Mathematics, 2024, 9(8): 22366-22392. doi: 10.3934/math.20241088
    [2] Charles Audet, Jean Bigeon, Romain Couderc, Michael Kokkolaras . Sequential stochastic blackbox optimization with zeroth-order gradient estimators. AIMS Mathematics, 2023, 8(11): 25922-25956. doi: 10.3934/math.20231321
    [3] Hassan Ranjbar, Leila Torkzadeh, Dumitru Baleanu, Kazem Nouri . Simulating systems of Itô SDEs with split-step (α,β)-Milstein scheme. AIMS Mathematics, 2023, 8(2): 2576-2590. doi: 10.3934/math.2023133
    [4] Wei Xue, Pengcheng Wan, Qiao Li, Ping Zhong, Gaohang Yu, Tao Tao . An online conjugate gradient algorithm for large-scale data analysis in machine learning. AIMS Mathematics, 2021, 6(2): 1515-1537. doi: 10.3934/math.2021092
    [5] Adisak Hanjing, Pachara Jailoka, Suthep Suantai . An accelerated forward-backward algorithm with a new linesearch for convex minimization problems and its applications. AIMS Mathematics, 2021, 6(6): 6180-6200. doi: 10.3934/math.2021363
    [6] Zhongzhe Ouyang, Ke Liu, Min Lu . Bias correction based on AR model in spurious regression. AIMS Mathematics, 2024, 9(4): 8439-8460. doi: 10.3934/math.2024410
    [7] W. B. Altukhaes, M. Roozbeh, N. A. Mohamed . Feasible robust Liu estimator to combat outliers and multicollinearity effects in restricted semiparametric regression model. AIMS Mathematics, 2024, 9(11): 31581-31606. doi: 10.3934/math.20241519
    [8] Tingting Wang, Shulin Sun . Dynamics of a stochastic epidemic model with quarantine and non-monotone incidence. AIMS Mathematics, 2023, 8(6): 13241-13256. doi: 10.3934/math.2023669
    [9] Xiaoming Wang, Muhammad W. Yasin, Nauman Ahmed, Muhammad Rafiq, Muhammad Abbas . Numerical approximations of stochastic Gray-Scott model with two novel schemes. AIMS Mathematics, 2023, 8(3): 5124-5147. doi: 10.3934/math.2023257
    [10] Jinling Gao, Zengtai Gong . Uncertain logistic regression models. AIMS Mathematics, 2024, 9(5): 10478-10493. doi: 10.3934/math.2024512
  • In this paper, we consider stochastic approximation algorithms for least-square and logistic regression with no strong-convexity assumption on the convex loss functions. We develop two algorithms with varied step-size motivated by the accelerated gradient algorithm which is initiated for convex stochastic programming. We analyse the developed algorithms that achieve a rate of O(1/n2) where n is the number of samples, which is tighter than the best convergence rate O(1/n) achieved so far on non-strongly-convex stochastic approximation with constant-step-size, for classic supervised learning problems. Our analysis is based on a non-asymptotic analysis of the empirical risk (in expectation) with less assumptions that existing analysis results. It does not require the finite-dimensionality assumption and the Lipschitz condition. We carry out controlled experiments on synthetic and some standard machine learning data sets. Empirical results justify our theoretical analysis and show a faster convergence rate than existing other methods.



    In the era of 'big data', machine learning algorithms that only need to process each observation only once, or a few times, are desirable. Stochastic approximation algorithms such as stochastic gradient descent (SGD) and stochastic proximal gradient descent (SPGD), have been widely studied for this specific task and they have been successfully applied in various scenarios [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]. Regression and classification are effective analysis methods in machine learning, and have been successfully applied in practical problems. With the wide application of deep learning, these two methods are often used to train the parameters. In this paper, we consider stochastic approximation algorithms that consider minimizing a convex function where only the unbiased estimates of its gradients at the observations are assumed available.

    The convex function defined on a closed convex set in Euclidean space is usually given by f(θ)=12E[(yi,θ,xi)], where (xi,yi)X×Y denotes the sample data which is assumed to be i.i.d., denotes a loss function that is convex and E represents the expectation under the second variable. This loss function includes such as the least square and logistic regression. In the stochastic approximation framework, the samples appear sequentially according to an unknown probability measure ρ and the predictor defined by θ is updated after each pair is seen.

    Robbins and Monro [1] were the first authors who proposed the stochastic approximation (SA) on the gradient descent method. From then on, algorithms based on SA have been widely used in stochastic optimisation and machine learning. Polyak [2] and Polyak and Juditsky [3] developed an important improvement of SA by using longer step-sizes with consequent averaging of the obtained iterates. The mirror-descent SA was developed by Nemirovski et al. [6] who showed that the mirror-descent SA exhibited an un-improvable expected rate for solving non-strongly convex programming problems. Shalev-Shwartz et al. [5] and Nemirovski et al. [6] studied an averaged stochastic gradient descent method for the least-square regression.

    Theoretical studies on SGD for the least-square and logistic regression have shown that the convexity and smoothness of the loss function and the step-size policy play a critical role on the convergence rate. It was found that under strong-convexity assumption (i.e., the loss function is twice differentiable, the Hessians of the loss function is lower bounded by a constant c), the convergence rate of averaged SGD with proper step-size is of O(1/cn) [5,6], while it is only of O(1/n) in non-strongly-convex case [6]. By using the smoothness property of the loss function, it was shown in [10] that the averaged SGD with constant-step-size can achieve a convergence rate of O(1/n) without requiring the strong-convexity assumption.

    D. P. Kingma [24] propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients; the name Adam is derived from adaptive moment estimation.

    Z. A. Zhu [25] introduced Katyusha, a direct, primal-only stochastic gradient method to fix this issue. It has a provably accelerated convergence rate in convex (off-line) stochastic optimization. It can be incorporated into a variance-reduction based algorithm and speed it up, both in terms of sequential and parallel performance.

    In this paper, we develop two stochastic accelerated gradient algorithms for the consider least-square and logistic regressions aiming to improve the convergence rate without assuming strong-convexity. Our development is inspired by the work in the area of accelerated gradient method for general stochastic non-linear programming (NLP) [17,18,19,20,21,22,23]. In the stochastic NLP setting, different from the consider problem where the gradient can be estimated unbiased at certain points, the gradient is noisy and available through a stochastic oracle. That is, random vectors with unknown distribution are associated with each gradient at certain points. For such problem, recently developed stochastic accelerated gradient method proposed by Ghadimi and Lan [22] showed that for convex smooth function with Lipschitz continuous gradients achieves a convergence rate of O(1/n2).

    In this paper, we prove that without the Lipschitz continuous gradient and strong-convexity assumption, the developed algorithms achieve a convergence rate of O(1/n2) by using non-asymptotic analysis. Experimental studies on synthetic data and benchmarks justify our theoretical results and show a faster convergence rate than classical SGD and constant-step-size averaged SGD [10].

    The rest of the paper is organised as follows. In Section 2, we present the accelerated gradient algorithm for the least square regression. In Section 3, we study the accelerated gradient algorithm for the logistic regression. Section 4 empirically verify the obtained theoretical results. Section 5 concludes the paper.

    In this section, we consider the least square regression. Let (X,d) be a compact metric space and Y=R. Assume ρ be a probability distribution on Z=X×Y and (X,Y) be corresponding random variable. We further assume:

    (a) The training data (xk,yk),k1 are i.i.d. sampled from ρ.

    (b) Exk2 is finite, i.e., Exk2M for any k1.

    (c) The global minimum of f(θ)=12E[θ,xk22ykθ,xk] is attainable at a certain point θRd.

    (d) In the following, we denote ξk=(ykθk,xk)xk as the residual. We assume that Eξk2M1 for every k and ¯ξk=1kki=1ξi.

    These assumptions are standard in stochastic approximation [9,10]. However, compared with the work in [10], we do not make assumptions on the covariance operator E(xkxk) and E[ξkξk].

    In the following, we present the accelerated stochastic gradient algorithm for least square regression learning in Algorithm 1. The algorithm takes a stream of data (xk,yk) as input, and an initial guess of the parameter θ0. The other requirements include {αk} which satisfies α1=1 and αk>0 for any k2, βk>0, and λk>0. The algorithm involves two intermediate quantities θag (which is initialised to be θ0) and θmd. θmd is updated as a linear combination of θag and the current estimation of the parameter θ when a data comes in (line 1), where αk is the coefficient. The parameter θ is estimated in line 1 taking λk as a parameter. The residue and the average residue of previous residues up to the k-th data are computed in line 1. θag is then updated by taking βk as a parameter in line 1. The process continues whenever a new pair of data is seen.

    Algorithm 1 The accelerated stochastic gradient algorithm for least square regression.
    Require: θ0 and α1=1,αk>0 for k=2,, {βk>0} and {λk>0}
      (1) Set θag0=θ0, ¯ξ0=0 and k=1
      (2) Set θmdk=(1αk)θagk1+αkθk1,
      (3) Set zk=f(θmdk)/αk=(θmdk,xkxkykxk)/αk,
      (4) Set θk=θk1λkzk,
      (5) Compute ξk=(ykθk,xk)xk and ¯ξk=¯ξk1+1k(ξk¯ξk1);
      (6) Set θagk=θmdkβk(zk+1k¯ξk).
      (7) Set kk+1, and goto step 2.

     | Show Table
    DownLoad: CSV

    The unbiased estimate of the gradient, i.e., (θmdk,xkxkykxk) for each data point (xk,yk) is used in line 1. From this perspective, it is seen that the update of θk (line 1) is actually the same as in the stochastic gradient descent (also called least-mean-square, LMS) algorithm if we set αk=1.

    During optimizing, the residue ξk is computed (line 1). All the residues up to now are averaged and the averaged residue takes effect on the update of θagk (line 1). It differs from the accelerated stochastic gradient algorithm in [22] where no residue is computed and used in the optimizing.

    This section we establish the convergence rate of the developed algorithm. The goal is to estimate the bound on the expectation E[f(θagn)f(θ)]. It turns out that the developed algorithm is able to achieve a convergence rate of O(1/n2) without strong convexity and Lipschitz continuous gradient assumptions.

    To establish the convergence rate of the developed gradient algorithm, we need the following Lemma (see Lemma 1 of [22]).

    Lemma 1. Let αk be a sequence of step sizes in the accelerated gradient algorithm and the sequence {ηk} satisfies ηk(1αk)ηk1+τk,  k=1,2,, where

    Γk={1, k=1,(1αk)Γk1, k2. (2.1)

    Then we have ηkΓkki=1τiΓi for any k1.

    Proof. Noting that α1=1 and αk(0,1], we obtain

    η1Γ1=(1α1)η0Γ1+τ1Γ1=τ1Γ1,

    and

    ηiΓi(1αi)ηi1Γi+τiΓi=ηi1Γi1+τiΓi,

    The result then immediately follows by summing up the above inequalities and rearranging the terms.

    Applying Lemma 1, we can obtain the convergence rate of the developed algorithm as explained in Theorem 1.

    Theorem 1. Let {θmdk,θagk} be computed by the accelerated gradient algorithm and Γk be defined in (2.1). Assume (a–d). If {αk},{βk},{λk} are chosen such that

    1λkαk2βkβkαkM0,α21λ1Γ1α22λ2Γ2,  2β2kMβkαk=0,

    then for any n1, we have

    E[f(θagn)f(θ)]Γn2λ1θ0θ2+MM1Γnnk=1β2kk2Γk.

    Proof. By Taylor expansion of the function f, Algorithm1 (line 3) and (line 6), we have:

    f(θagk)=f(θmdk)+f(θmdk),θagkθmdk+(θagkθmdk)2f(θmdk)(θagkθmdk)f(θmdk)βkαkzk2βkαk1kzk,¯ξk+β2kExk2zk+1k¯ξk2f(θmdk)βkαkzk2βkαk1kzk,¯ξk+β2kM1zk+1k¯ξk2,

    where the last inequality holds due to assumption (b). Since

    f(μ)f(ν)=f(ν),μν+(μν)TE(xkxTk)(μν),

    we have

    f(ν)f(μ)=f(ν),νμ(μν)TE(xkxTk)(μν)f(ν),νμ, (2.2)

    where the inequality follows from the positive semi-definition of matrix E(xkxk). By Algorithm 1 (line 2) and (2.2), we have

    f(θmdk)[(1αk)f(θagk1)+αkf(θ)]=αk[f(θmdk)f(θ)]+(1αk)[f(θmdk)f(θagk1)]αkf(θmdk),θmdkθ+(1αk)f(θmdk),θmdkθagk1=f(θmdk),αk(θmdkθ)+(1αk)(θmdkθagk1)=αkf(θmdk),θk1θ=α2kzk,θk1θ.

    So we obtain

    f(θagk)(1αk)f(θagk1)+αkf(θ)+α2kzk,θk1θβkαkzk2βkαk1kzk,¯ξk+β2kMzk+1k¯ξk2.

    It follows from Algorithm 1 (line 4) that

    θkθ2=θk1λkzkθ2 (2.3)
    =θk1θ22λkzk,θk1θ+λ2kzk2. (2.4)

    Then we have

    zk,θk1θ=12λk[θk1θ2θkθ2]+λk2zk2.

    While

    zk+1k¯ξk2=zk2+1k2¯ξk2+21kzk,¯ξk. (2.5)

    Combining (2.4) and (2.5), we obtain

    f(θagk)(1αk)f(θagk1)+αkf(θ)+α2k2λk[θk1θ2θkθ2]βkαk(1λkαk2βkβkαkM)zk2+Mβ2k1k2¯ξk2+¯ξk,1k(2β2kMβkαk)zk.

    The above inequality is equal to

    f(θagk)f(θ)(1αk)[f(θagk1)f(θ)]+α2k2λk[θk1θ2θkθ2]βkαk(1λkαk2βkβkαkM)zk2+Mβ2k1k2¯ξk2+¯ξk,1k(2β2kMβkαk)zk.

    Using Lemma 1, we have

    f(θagn)f(θ)Γnnk=1α2k2λkΓk[θk1θ2θkθ2]+Γnnk=1β2kMΓkk2¯ξk2Γnnk=1βkαkΓk(1λkαk2βkβkαkM)zk2+Γnnk=11Γk¯ξk,1k(2β2kMβkαk)zk).

    Since

    α21λ1Γ1α22λ2Γ2, α1=Γ1=1,

    then

    nk=1α2k2λkΓk[θk1θ2θkθ2]α212λ1Γ1[θ0θ2]=12λ1θ0θ2.

    So we obtain

    f(θagn)f(θ)Γn2λ1θ0θ2+Γnnk=1β2kMΓkk2¯ξk2+Γnnk=11Γk¯ξk,1k(2β2kMβkαk)zk), (2.6)

    where the inequality follows from the assumption

    1λkαk2βkβkαkM0,  2β2kMβkαk=0.

    Under assumption (d), we have

    E¯ξk2=E{1k2ki=1ξi2}E{1k2kki=1ξi2}M1.

    Taking expectation on both sides of the inequality (2.6) with respect to (xi,yi), we obtain for θRd,

    E[f(θagn)f(θ)]Γn2λ1θ0θ2+MM1Γnnk=1β2kk2Γk.

    Now, fixing θ=θ, we have

    E[f(θagn)f(θ)]Γn2λ1θ0θ2+MM1Γnnk=1β2kk2Γk.

    This finishes the proof of the theorem.

    In the following, we apply the results of Theorem 1 to some particular selections of {αk},{βk} and {λk}. We obtain the following Corollary 1.

    Corollary 1. Suppose that αk and βk in the accelerated gradient algorithm for regression learning are set to

    αk=2k+1,  βk=1M(k+1)  and  λk=k2M(k+1)     k1, (2.7)

    then for any n1, we have

    E[f(θagn)f(θ)]4Mn(n+1)θ0θ2+M1Mn(n+1).

    Proof. In the view (2.1) and (2.7), we have for k2

    Γk=(1αk)Γk1=k1k+1×k2k×k3k1××24×13×Γ1=2k(k+1).

    It is easy to check

    1λkαk2βkβkαkM0,  α21λ1Γ1=α22λ2Γ2==4M,  2β2kMβkαk=0.

    Then we obtain

    MΓnM1nk=1β2kk2Γk=2M1n(n+1)nk=1MM2(k+1)22k2k(k+1)=M1Mn(n+1)nk=11k(k+1)M1Mn(n+1).

    From the result of Theorem 1, we have

    E[f(θagn)f(θ)]4Mn(n+1)θ0θ2+M1Mn(n+1).

    This finishes the proof of the Collorary.

    In this section, we develop the accelerated gradient algorithm for logistic regression. For the logistic regression, we consider the logistic loss function: l(θ)=E[log(1+exp(yx,θ))]. Assume the observations (xi,yi)F×{1,1} are independent and identically distributed from unknown distribution ρ where F is a ddimension Euclidean space, with d1. Further, we denote by θRd a global minimiser of l and assume its existence. Let ξi=(yiθk,xi)xi denote the residual. We denote ˉξk=1kki=1ξi the average residue up until k input data. To analyse the algorithm, we make the following assumptions:

    (B1) Exi2 is finite, i.e., Exi2M for any i1.

    (B2) Eξi2M1 for every i.

    Again, unlike the algorithm by Bach et al. [10], we make no assumption on the Hessian operator at the global optimum θ.

    The developed accelerated stochastic gradient algorithm for the logistic regression is presented in Algorithm 2. In the algorithm, θ0F is an initial guess, and

    l(θk)=ykexp{ykxk,θk}xk1+exp{ykxk,θk}.

    It can be seen that the basic framework of Algorithm 2 is the same as Algorithm 1 except that the unbiased estimation to the gradient is different due to the loss functions.

    Algorithm 2 The accelerated stochastic gradient approximation algorithm for logistic regression.
    Require: θ0 and α1=1,αk>0 for k=2,, {βk>0} and {λk>0}
      (1) Set θag0=θ0, ¯ξ0=0 and k=1
      (2) Set θmdk=(1αk)θagk1+αkθk1,
      (3) Compute zk=1αkl(θmdk),
      (4) Set θk=θk1λkzk,
      (5) Compute ξk=(ykθk,xk)xk and ¯ξk=¯ξk1+1k(ξk¯ξk1);
      (6) Set θagk=θmdkβk(zk+1kˉξk).
      (7) Set kk+1, and goto step 2.

     | Show Table
    DownLoad: CSV

    In this section, we also provide the non-asymptotic analysis on the convergence rate of the developed algorithm in expectation. Theorem 2 describes the convergence rate.

    Theorem 2. Let {θmdk,θagk} be computed by the accelerated gradient algorithm and Γk be defined in (2.1). Assume (B1 and B2). If {αk},{βk},{λk} are chosen such that

    1λkαk2βkβkαkM0,  2β2kMβkαk=0,  α21λ1Γ1α22λ2Γ2,

    then for any n1, we have

    E[f(θagn)f(θ)]Γn2λ1θ0θ2+Mσ2Γnnk=1β2kk2Γk.

    Proof. By Taylor expansion of the function l, there exists a ϑ such that

    l(θagk)=l(θmdk)+l(θmdk),θagkθmdk+(θagkθmdk)T2l(ϑ)(θagkθmdk)=l(θmdk)βkαkzk2βkαkzk,1kˉξk+(θagkθmdk)TE[exp{ykxk,ϑ}xkxTk1+exp{ykxk,ϑ}](θagkθmdk) (3.1)

    In the equation, we know

    exp{ykxk,ϑ}1+exp{ykxk,ϑ}1,  λmax(xkxTk)xk2.

    It is easy to verify the matrix

    E[exp{ykxk,ϑ}xkxTk1+exp{ykxk,ϑ}]

    is positive semidefinite and its largest eigenvalue satisfies

    λmax(E[exp{ykxk,ϑ}xkxTk1+exp{ykxk,ϑ}])Exk2M.

    Combining with Algorithm 2 (line 6) and (3.1), we have

    l(θagk)l(θmdk)βkαkzk2βkαkzk,1kˉξk+β2kMzk+1kˉξk2.

    Similar to (3.1), there exists a ζRd satisfying

    l(μ)l(ν)=l(ν),μν+(μν)TE[exp{ykxk,ζ}xkxTk1+exp{ykxk,ζ}](μν), μ,νRd

    we have

    l(ν)l(μ)=l(ν),νμ(μν)TE[exp{ykxk,ζ}xkxTk1+exp{ykxk,ζ}](μν)l(ν),νμ,

    where the inequality follows from the positive semi-definition of matrix. Similar to (2.2), we have

    l(θmdk)[(1αk)l(θagk1)+αkl(θ)]α2kzk,θk1θ.

    So we obtain

    l(θagk)(1αk)l(θagk1)+αkl(θ)+α2kzk,θk1θβkαkzk2βkαkzk,1kˉξk+β2kMzk+1kˉξk2.

    It follows from Algorithm 2 (line 4) that

    θkθ2=θk1λkzkθ2=θk1θ22λkzk,θk1θ+zk2.

    Combining the above two inequalities, we obtain

    l(θagk)(1αk)l(θagk1)+αkl(θ)+α2k2λk[θk1θ2θkθ2]βkαk(1λkαk2βkβkαkM)zk2+Mβ2k1k2¯ξk2+¯ξk,1k(2β2kMβkαk)zk).

    The above inequality is equal to

    l(θagk)l(θ)(1αk)[l(θagk1)l(θ)]+α2k2λk[θk1θ2θkθ2]βkαk(1λkαk2βkβkαkM)zk2+Mβ2k1k2¯ξk2+¯ξk,1k(2β2kMβkαk)zk.

    Using Lemma 1, we have

    l(θagn)l(θ)Γnnk=1α2k2λkΓk[θk1θ2θkθ2]+Γnnk=1β2kMΓkk2¯ξk2Γnnk=1βkαkΓk(1λkαk2βkβkαkM)zk2+Γnnk=11Γk¯ξk,1k(2β2kMβkαk)zk).

    Since

    α21λ1Γ1α22λ2Γ2, α1=Γ1=1,

    then

    nk=1α2k2λkΓk[θk1θ2θkθ2]α212λ1Γ1[θ0θ2]=12λ1θ0θ2.

    So we obtain

    l(θagn)l(θ)Γn2λ1θ0θ2+Γnnk=1β2kMΓkk2¯ξk2, (3.2)

    where the inequality follows from the assumption

    1λkαk2βkβkαkM0,  2β2kMβkαk=0.

    Under the assumption (B4), we have

    E¯ξk2=E{1k2ki=1ξi2}E{1k2kki=1ξi2}M1.

    Taking expectation on both sides of the inequality (3.2) with respect to (xi,yi), we obtain for θRd,

    E[l(θagn)l(θ)]Γn2λ1θ0θ2+MM1Γnnk=1β2kk2Γk.

    Now, fixing θ=θ, we have

    E[l(θagn)l(θ)]Γn2λ1θ0θ2+MM1Γnnk=1β2kk2Γk.

    This finishes the proof of the theorem.

    Similar to Corollary 1, we specialise the result of Theorem 2 for some particular selections of {αk},{βk} and {λk}.

    Corollary 2. Suppose that αk and βk in the accelerated gradient algorithm for regression learning are set to

    αk=2k+1,  βk=1M(k+1)  and  λk=k2M(k+1),     k1,

    then for any n1, we have

    E[l(θagn)l(θ)]4Mn(n+1)θ0θ2+M1Mn(n+1).

    In this section, we empirically investigate the performance of our algorithms on synthetic data and some benchmarks widely used by the machine learning community.

    We consider normally distributed inputs, with covariance matrix H that has random eigenvectors and eigenvalues 1/k, k=1,,d. The outputs are generated from a linear function with homoscedastic noise with various signal to noise-ratio σ. We consider d=20 and 100000 samples using mini-batch size of 100.

    We compare SGD, stochastic approximation (SA) with averaging [10], ADAM [24], Katyusha [25] and ASGA on synthetic noisy datasets with different noise levels: σ=0,0.1 and 0.01. For SGD and SA we choose the step size ρ=1/2R2 and γn=1/2R2n) where R2=trace(H), respectively. The loss function is defined as log10[f(θ)f(θ)]. The average loss function value over 100 runs on the training data is shown in Figure 1 (a)–(c). It can be seen that ASGA converges much faster than SGD, SA and ADAM. The results also verify our theoretical improvement on the convergence rate of ASGA.

    Figure 1.  (a)–(c): Least square regression training log-likelihood on synthetic data sets with different noise levels (0, 0.01 and 0.1 clockwise in turn); and (d) logistic regression on synthetic data set.

    For logistic regression, we consider the same input data as for the least-squares, but outputs are generated from the logistic probabilistic model. Comparison results are shown in Figure 1 (d). A step size γn=1/(2R2n) is chosen for SA with averaging for an optimal performance. For ADAM, its step size α is adjusted by 1/n decay as suggested in [24]. From Figure 1 (d), it is clearly seen that ASGA converges significantly faster than all the compared algorithms.

    We evaluate ASGA on the MNIST, Wine dataset and News dataset. In our experiments, the raw features of the datasets are used directly as the input to the classifier. For MNIST, 9000/1000 data points are used as training/test dataset, the numbers are 4000/898 for Wine, while the numbers are 18000/846 for News. We compare SGD, stochastic approximation (SA) with averaging [10], ADAM [24], Katyusha [25] and ASGA using mini-batch size of 90 in MINST, using mini-batch size of 100 in Wine and News. Figure 2 shows the training procedure of ASGA on these three datasets by averaging 50 runs. From Figure 2, it is seen that ASGA obtains better the loss values with faster convergence rate. This clearly demonstrates the effectiveness of ASGA. Table 1 summarizes the prediction accuracies of the obtained optimal parameters of the logistic regression model on the test datasets by the compared algorithms. The p-values obtained by t-test at 5% significance level are also given in the subscripts of the compared algorithms. From the table, it is seen that ASGA achieves significantly better accuracies that the other algorithms (p<0.05).

    Table 1.  The prediction accuracy of the compared algorithms on test datasets of MNIST and Wine, where the subscript values are the p-values obtained by the t-test between the corresponding algorithm with ASGA.
    Method MNIST Wine News
    ASGA 0.9056 0.8863 0.9236
    ADAM 0.88121.5×106 0.85202.6×108 0.91203.5×106
    SA 0.89134.2×105 0.86122.7×107 0.88754.7×105
    Katyusha 0.88451.7×104 0.86243.8×103 0.89436.3×104
    SGD 0.86153.7×1010 0.85234.7×108 0.88312.4×106

     | Show Table
    DownLoad: CSV
    Figure 2.  The training procedure of ASGA on Wine, MNIST and News averaged over 50 runs.

    In this paper, we proposed two accelerated stochastic gradient algorithms (ASGA) for least-square and logistic regression in which the averaged residue is used to adjust the parameter estimation. An asymptotic analysis proved that ASGA can achieve a convergence rate of O(1/n2) which is much tighter than the state-of-the-art rate under non-strongly convexity assumptions. Experimental results on synthetic data and benchmark datasets justified our theoretical results.

    The authors acknowledge the financial supports from the National Natural Science Foundation of China [No.61573326], Support project for outstanding young talents in Colleges and universities in Anhui Province [No.gxyq2018076], Natural science research project of colleges and universities in Anhui Province [No.KJ2018A0455], scientific research project of Chaohu University [No.XLY-201903].

    The authors declare that they have no competing interests.



    [1] H. Robbins, S. Monro, A stochastic approximation method, In: The annals of mathematical statistics, Institute of Mathematical Statistics, 22 (1951), 400–407.
    [2] B. T. Polyak, New stochastic approximation type procedures, Automat. i Telemekh, 1990, 98–107.
    [3] B. T. Polyak, A. B. Juditsky, Acceleration of stochastic approximation by averaging, SIAM J. Control Optim., 30 (1992), 838–855. doi: 10.1137/0330046. doi: 10.1137/0330046
    [4] L. Bottou, O. Bousquet, The tradeoffs of large scale learning, In: Advances in neural information processing systems, 20 (2007), 1–8.
    [5] S. Shalev-Shwartz, Y. Singer, N. Srebro, A. Cotter, Pegasos: Primal estimated sub-gradient solver for SVM, Math. Program., 127 (2011), 3–30. doi: 10.1007/s10107-010-0420-4. doi: 10.1007/s10107-010-0420-4
    [6] A. Nemirovski, A. Juditsky, G. Lan, A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM J. Optim., 19 (2009), 1574–1609. doi: 10.1137/070704277. doi: 10.1137/070704277
    [7] G. H. Lan, R. D. C. Monteiro, Iteration-complexity of first-order penalty methods for convex programming, Math. Program., 138 (2013), 115–139. doi: 10.1007/s10107-012-0588-x. doi: 10.1007/s10107-012-0588-x
    [8] S. Ghadimi, G. H. Lan, Accelerated gradient methods for nonconvex nonlinear and stochastic programming, Math. Program., 156 (2016), 59–99. doi: 10.1007/s10107-015-0871-8. doi: 10.1007/s10107-015-0871-8
    [9] F. Bach, E. Moulines, Non-asymptotic analysis of stochastic approximation algorithms for machine learning, 2011.
    [10] F. Bach, E. Moulines, Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n), 2013. Avaliable from: https://proceedings.neurips.cc/paper/2013/file/7fe1f8abaad094e0b5cb1b01d712f708-Paper.pdf.
    [11] J. C. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res., 12 (2010), 2121–2159.
    [12] Y. E. Nesterov, A method of solving a convex programming problem with convergence rate O(1/k2), Dokl. Akad. Nauk SSSR, 269 (1983), 543–547.
    [13] A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sci., 2, (2009), 183–202. doi: 10.1137/080716542.
    [14] P. Tseng, On accelerated proximal gradient methods for convex-concave optimization, SIAM J. Optimiz., 2008.
    [15] Y. Nesterov, Smooth minimization of non-smooth functions, Math. Program., 103 (2005), 127–152. doi: 10.1007/s10107-004-0552-5. doi: 10.1007/s10107-004-0552-5
    [16] Y. Nesterov, Gradient methods for minimizing composite functions, Math. Program., 140 (2013), 125–161. doi: 10.1007/s10107-012-0629-5. doi: 10.1007/s10107-012-0629-5
    [17] G. H. Lan, An optimal method for stochastic composite optimization, Math. Program., 133 (2012), 365–397. doi: 10.1007/s10107-010-0434-y. doi: 10.1007/s10107-010-0434-y
    [18] S. Ghadimi, G. H. Lan, Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework, SIAM J. Optim., 22 (2012), 1469–1492. doi: 10.1137/110848864. doi: 10.1137/110848864
    [19] S. Ghadimi, G. H. Lan, Stochastic first- and zeroth-order methods for nonconvex stochastic programming, SIAM J. Optim., 23 (2013), 2341–2368. doi: 10.1137/120880811. doi: 10.1137/120880811
    [20] S. Ghadimi, G. H. Lan, H. C. Zhang, Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization, Math. Program., 155 (2016), 267–305. doi: 10.1007/s10107-014-0846-1. doi: 10.1007/s10107-014-0846-1
    [21] G. H. Lan, Bundle-level type methods uniformly optimal for smooth and nonsmooth convex optimization, Math. Program., 149 (2015), 1–45. doi: 10.1007/s10107-013-0737-x. doi: 10.1007/s10107-013-0737-x
    [22] S. Ghadimi, G. H. Lan, Accelerated gradient methods for nonconvex nonlinear and stochastic programming, Math. Program., 156 (2016), 59–99. doi: 10.1007/s10107-015-0871-8. doi: 10.1007/s10107-015-0871-8
    [23] L. Bottou, Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT'2010, Physica-Verlag HD, 2010,177–186. doi: 10.1007/978-3-7908-2604-3_16.
    [24] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv: 1412.6980v9, 2012.
    [25] Z. A. Zhu, Katyusha: The first direct acceleration of stochastic gradient methods, In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2017), New York, USA: Association for Computing Machinery. doi: 10.1145/3055399.3055448.
  • Reader Comments
  • © 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(2700) PDF downloads(65) Cited by(0)

Figures and Tables

Figures(2)  /  Tables(1)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog