0% found this document useful (0 votes)
17 views14 pages

Adaptive Stochastic Conjugate Gradient For Machine Learning

This paper presents a stable adaptive stochastic conjugate gradient (SCG) algorithm for machine learning that combines stochastic recursive gradient techniques with second-order information, achieving faster convergence and lower computational costs compared to traditional methods. The proposed algorithm avoids the need for line search to determine step sizes, utilizing a local quadratic model instead, and demonstrates linear convergence for strongly convex loss functions. Numerical experiments validate the efficiency of the algorithm in various machine learning applications.

Uploaded by

yeremy55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views14 pages

Adaptive Stochastic Conjugate Gradient For Machine Learning

This paper presents a stable adaptive stochastic conjugate gradient (SCG) algorithm for machine learning that combines stochastic recursive gradient techniques with second-order information, achieving faster convergence and lower computational costs compared to traditional methods. The proposed algorithm avoids the need for line search to determine step sizes, utilizing a local quadratic model instead, and demonstrates linear convergence for strongly convex loss functions. Numerical experiments validate the efficiency of the algorithm in various machine learning applications.

Uploaded by

yeremy55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Expert Systems With Applications 206 (2022) 117719

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

Adaptive stochastic conjugate gradient for machine learning


Zhuang Yang ∗
School of Computer Science and Technology, Soochow University, Suzhou 215006, China
School of Electronics and Communication Engineering, Sun Yat-Sen University, Guangzhou 510275, China

ARTICLE INFO ABSTRACT

Keywords: Due to their faster convergence rate than gradient descent algorithms and less computational cost than second
Stochastic conjugate gradient order algorithms, conjugate gradient (CG) algorithms have been widely used in machine learning. This paper
Recursive iteration considers conjugate gradient in the mini-batch setting. Concretely, we propose a stable adaptive stochastic
Second-order methods
conjugate gradient (SCG) algorithm via incorporating both the stochastic recursive gradient algorithm (SARAH)
Sub sampling
and second order information into the CG-type algorithm. Unlike most of existing CG algorithms that spend
Large scale learning
a lot of time in determining the step size by using line search and may fail in stochastic optimization, the
proposed algorithms use a local quadratic model to estimate the step size sequence, but do not require
computing the Hessian information, which make the proposed algorithms attain a low computational cost
as first-order algorithms. We establish the linear convergence rate of a class of SCG algorithms, when the loss
function is the strongly convex. Moreover, we show that the complexity of the proposed algorithm matches
modern stochastic optimization algorithms. As a by-product, we develop a practical variant of the proposed
algorithm by setting a stopping criterion for the number of inner loop iterations. Various numerical experiments
on machine learning problems demonstrate the efficiency of the proposed algorithms.

1. Introduction and related algorithms such as ADAM (according to an adaptive esti-


mate of lower order moments) (Kingma & Ba, 2014), RMSprop (adapt-
Many large scale machine learning problems boil down the follow- ing the learning rate per weight according to the observed sign al-
ing composite optimization problem: teration in the gradients) (Tieleman & Hinton, 2017), the adaptive
gradient (AdaGrad) algorithm (Duchi, Hazan, & Singer, 2011), work
1∑
𝑛
min 𝐹 (𝑤) = 𝑓 (𝑤) + 𝑅(𝑤), (1) with a single or a batch of training samples in each iterative step and
𝑤∈R𝑑 𝑛 𝑖=1 𝑖
perform with a descent direction. In contrast to deterministic first-
∑ order (DFO) algorithms working with the full gradient of Problem (1)
where 𝑓 (𝑤) = 1𝑛 𝑛𝑖=1 𝑓𝑖 (𝑤) defines the loss function, 𝑅(𝑤) usually
defines a regularizer term and 𝑛 represents the number of samples. and meeting challenges when 𝑛 is super large, the SFO algorithms
Instances of 𝑓𝑖 (𝑤) contain 𝑓𝑖 (𝑤) = max(0, 1 − 𝑦𝑖 𝑤𝑇 𝑥𝑖 ) for the hinge loss, are preferred since they have lower computational cost via only using
𝑓𝑖 (𝑤) = 12 (𝑦𝑖 −𝑤𝑇 𝑥𝑖 )2 for the squared loss or 𝑓𝑖 (𝑤) = log(1+exp(−𝑦𝑖 𝑥𝑇𝑖 𝑤)) a small part of samples for large scale learning problems. Although
for the logistic loss, where {𝑥𝑖 , 𝑦𝑖 } defines the feature vector and label practical and effective, the SGD-type algorithms often are persecuted
for sample 𝑖, respectively. Typical selections of 𝑅(𝑤) contain 𝑅(𝑤) = by a slow convergence speed in many applications. One of the major
𝜆‖𝑤‖1 (𝓁1 -norm regularizer), 𝑅(𝑤) = (𝜆∕2)‖𝑤‖2 (𝓁2 -norm regularizer), reasons is that noisy gradients make the SGD-type algorithms generate
and 𝑅(𝑤) = 𝜆1 ‖𝑤‖1 + (𝜆2 ∕2)‖𝑤‖2 (elastic-net regularizer). The structure high variance. Usually, to guarantee its convergence, the SGD-type
of Eq. (1) can also be found in sparse learning (Zhang, Ghanem, Liu, & algorithm works with a decreasing step size sequence, but not too
Ahuja, 2013), deep learning (Guo, Ye, Xiao, & Zhu, 2020), non-negative speedily, i.e., the choices of the step size is proportion to 1∕𝑇 𝛼 with
matrix factorization (Hoyer, 2004), to name a few. 1∕2 < 𝛼 ≤ 1, where 𝑇 defines the number of iterations. Specifically,√
the SGD-type algorithm converges with a rate of 𝑂(1∕𝑇 ) and 𝑂(1∕ 𝑇 )
1.1. Stochastic first-order algorithm for strongly convex and the convex objective function respectively.
However, although using the carefully selecting step size, both the per-
Stochastic first-order (SFO) algorithms are greatly popularized in formance and the convergence properties of the SGD-type algorithms
solving Problem (1). The stochastic gradient descent (SGD) algorithm are not improved significantly.

∗ Corresponding author at: School of Computer Science and Technology, Soochow University, Suzhou 215006, China.
E-mail address: [email protected].

https://doi.org/10.1016/j.eswa.2022.117719
Received 30 April 2021; Received in revised form 11 March 2022; Accepted 31 May 2022
Available online 9 June 2022
0957-4174/© 2022 Elsevier Ltd. All rights reserved.
Z. Yang Expert Systems With Applications 206 (2022) 117719

To improve the convergence speed of the SGD-type algorithms, optimization algorithms is using an adaptive step size sequence, pop-
several strategies have shown potential, containing ularized by AdaGrad and ADAM. In addition, it has been proved that
mini-batch training is an effective way to accelerate the convergence
Importance sampling: (Needell, Srebro, & Ward, 2016) showed how
speed of stochastic optimization algorithms, where the incremental
the importance sampling was necessary in further improving the con-
iterative was executed at an average of gradient with respect to several
vergence speed of SGD. Johnson and Guestrin (2018) developed a ro-
data points at a time, rather than a single data point, see, e.g., Li, Zhang,
bust, approximate importance sampling (RAIS) scheme for SGD, where
Chen, and Smola (2014).
it provided much of the advantage of exact importance sampling with
vehemently reduced overhead of exact importance sampling by ap-
1.2. Stochastic second-order algorithm
proximating the ideal sampling distribution employing robust opti-
mization. Csiba and Richtárik (2018) proposed the first important
The Newton or quasi-Newton algorithms working with second-order
sampling for the mini-batch setting and provided simple and rigorous
information attains better performance than the first-order algorithms.
complexity analysis for its performance. Based on safe bounds on the
In the past few years, stochastic quasi-Newton algorithms have flour-
gradient, Stich, Raj, and Jaggi (2017) proposed a valid approximation
ished to achieve a solution efficiently than the SFO algorithms by using
of gradient-based sampling, which can easily be incorporated into
approximate second-order information. For instance, Schraudolph, Yu,
existing optimization algorithms.
and Günter (2007) proposed stochastic variants of the well-known
Control variate: Control variate is a canonical strategy for reducing BFGS quasi-Newton optimization algorithm for online optimization
the variance of a stochastic quantity without bringing into bias. Baker, under the convex setting. Byrd, Hansen, Nocedal, and Singer (2016)
Fearnhead, Fox, and Nemeth (2019) presented an alternative log- proposed an efficient, robust, stable stochastic quasi-Newton algorithm
posterior gradient estimator for stochastic gradient Markov chain Monte by utilizing the classical BFGS update scheme in its limited memory
Carlo (SGMCMC), using the control variate to reduce the variance. form for large scale optimization. Bordes, Bottou, and Gallinari (2009)
Wang, Chen, Smola, and Xing (2013) proposed a general approach used SGD with a diagonal rescaling matrix according to the secant con-
of employing control variate for variance reduction in SGD. Gower, dition connected with quasi-Newton algorithms. Agarwal, Bullins, and
Le Roux, and Bach (2018) improved variance-reduction stochastic Hazan (2017) developed second-order stochastic algorithms for ma-
algorithms by using the better control variate. It notices that, several chine learning problems that match the per-iteration cost of gradient-
modern stochastic optimization algorithms, e.g., the stochastic variance based algorithms, and in certain settings improved upon the total
reduced gradient (SVRG) method (Johnson & Zhang, 2013), SAGA (De- running time upon the advance. Mokhtari and Ribeiro (2020) discussed
fazio, Bach, & Lacoste-Julien, 2014), are viewed as using the control recent developments to accelerate the convergence speed of stochastic
variate technique. The SVRG-type method and the following mentioned optimization algorithms using the exploitation of second-order infor-
stochastic recursive gradient algorithm (SARAH) (proposed by Nguyen, mation. Gower, Goldfarb, and Richtárik (2016) developed a limited-
Liu, Scheinberg, & Takáč, 2017) first run a deterministic step (viewed as memory stochastic block BFGS update for introducing enriched second-
the outer loop), followed by a large quantity of stochastic steps (viewed order information in stochastic optimization algorithms. Also, Moritz,
as the inner loop). The conventional SAGA approach keeps two moving Nishihara, and Jordan (2016) proposed a linearly-convergent stochastic
quantities when solving Eq. (1): the current iterate 𝑤 and a table of past L-BFGS for the strongly convex and smooth case. Wang, Wang, and
gradients. According to the work in Gower et al. (2018), Yang (2021c) Yuan (2019) developed a generic framework for stochastic proximal
introduced the modified form of the implicit gradient transport (MIGT) quasi-Newton (SPQN) algorithms for solving non-convex composite
approach into SVRG, leading to a novel algorithm using the control optimization problems.
variate: SVRG-MIGT. Notice that the accelerated techniques used in the SFO algorithms
are also adopted in stochastic second-order optimization algorithms.
Momentum scheme: Momentum scheme is originally proposed for For instance, Yasuda, Mahboubi, Indrapriyadarsini, Ninomiya, and Asai
accelerating gradient optimization algorithms to obtain an optimal con- (2019) proposed a stochastic variance reduced Nesterov’s accelerated
vergence rate in convex optimization. Common choices for momentum quasi-Newton algorithm for large scale learning problems by introduc-
scheme include heavy ball (HB) momentum, Nesterov’s accelerated gra- ing the SVRG gradient estimator and NAG into the stochastic quasi-
dient (NAG), and quasi-hyperbolic momentum (QHM). Gitman, Lang, Newton algorithm. Zhou, Gao, and Goldfarb (2017) proposed a stochas-
Zhang, and Xiao (2019) discussed the role of momentum in SGD tic adaptive quasi-Newton algorithm for minimizing expected values,
in detail. Specifically, they used the general formulation of QHM to which employed adaptive step sizes and eliminated the requirement for
show a unified analysis of several popular algorithms, containing their the user to enter a step size manually. For use in stochastic optimization
stability regions, asymptotic convergence conditions, and properties algorithms, Wills and Schön (2021) presented a novel quasi-Newton al-
of their stationary distribution. Recently, Li, Fang, and Lin (2020) gorithm by taking a highly flexible model for the curvature information
provided a comprehensive survey of stochastic optimization algorithms and inferring its value according to noisy gradients. Also, they adopted
with different momentum scheme to achieve accelerated algorithms. the standard line-search procedures in their proposed algorithm.
Under a strong growth condition, Vaswani, Bach, and Schmidt (2019)
showed that fixed step size SGD with Nesterov acceleration matched 1.3. Stochastic conjugate gradient
the convergence speed of the deterministic accelerated algorithm for
both convex and strongly convex cases. The conjugate gradient (CG) algorithm, enjoying the faster con-
Excepting the above-mentioned techniques, there exist several other vergence speed than gradient descent algorithms and less computa-
techniques for improving stochastic optimization algorithms. For exam- tional cost than second-order algorithms, was originally proposed in
ple, one can take the averaging of iterates to improve the performance the pioneering work of Hestenes and Stiefel for solving linear sys-
of the SGD-type algorithms, see the work in Polyak and Juditsky tems (Hestenes & Stiefel, 1952), known as the Hestenes–Stiefel (HS)
(1992), where the authors developed a new recursive method of SGD method. For minimizing a smooth nonlinear function, Fletcher and
with the averaging of trajectories and proved that the proposed method Reeves (1964) developed a nonlinear CG algorithm, known as the
achieved the highest possible speed of convergence. Several works pro- Fletcher–Reeves (FR) method. Gilbert and Nocedal (1992) considered
posed using the normalized updates to speed up the convergence speed the convergence of several well-known CG algorithms, e.g., FR and the
of stochastic optimization algorithms, which updated the model param- Polak–Ribière (PR) algorithm (Polak & Ribière, 1969), under various
eter by a normalized gradient per iteration, see, e.g., Cui, Wu, and line search strategies. We will see the details of these CG-type algo-
Huang (2020). Another common technique of accelerating stochastic rithms later. Dai and Yuan (1999) presented a novel CG algorithm,

2
Z. Yang Expert Systems With Applications 206 (2022) 117719

which converged globally, provided the line search made the stan- 2. Related work
dard Wolfe conditions hold. Gao et al. (2020) proposed an improved
CG algorithm with a generalized Armijo search technique to train a Algorithms employing second-order information for stochastic opti-
recalling-enhanced current neural network (RERNN) and the proposed mization have been actively researched in recent years, see, e.g., Byrd,
algorithm obtained a better performance. Chin, Neveitt, and Nocedal (2011), Wang and Zhang (2019). Adjust-
The CG techniques are not designed to deal with noisy gradient
ing the step size schedule in stochastic optimization algorithms is a
and curvature information, which makes CG diverge in the stochas-
meaningful unresolved problem which needs to tune in practice.
tic setting. However, empirical loss minimization problems are often
Adaptive stochastic optimization algorithms such as AdaGrad,
using noisy-gradient measurement obtained on small, random sam-
ples of datasets. This is the major reason why we need to discuss ADAM, RMSprop, AMSGRAD (Reddi, Kale, & Kumar, 2018) and
the CG techniques in the stochastic setting. Recent years had pre- Adadelta (Zeiler, 2012) have been preferred for achieving a rapid
sented several stochastic conjugate gradient (SCG) algorithms. For training process with an element-wise scaling terms of choosing the
instance, Schraudolph and Graepel (2002) explored ideas from CG in step size. However, they are observed to have a poor generalization
the stochastic setting, taking fast Hessian-gradient products to gen- compared with SGD, or even fail to converge to the optimal due
erate low-dimensional Krylov subspaces within individual mini-batch to unstable and extreme step sizes. Several approaches have been
samples. Jiang and Wilford (2012) proposed a new SCG algorithm, proposed to improve the performance of these adaptive stochastic
which avoided evaluating and storing the covariance matrix in the optimization, see, e.g., Shazeer and Stern (2018).
normal equations for the least square solution. Byrd, Chin, Nocedal, and The Barzilai–Borwein (BB) type approach, originally developed in
Wu (2012) developed a Newton-CG algorithm that employed varying the pioneer work of Barzilai and Borwein (1988)for tackling the uncon-
sample sizes for the evaluation of the function and gradient as well strained minimization problem, has been widely studied in stochastic
as for the incorporation of curvature information. Jin, Zhang, Huang, optimization algorithms to automatically determine the step size se-
and Geng (2019) proposed a novel SCG algorithm with the SVRG quence because of its efficacy and simplicity, see, e.g., Ma et al.
gradient estimator and Wolfe conditions, called conjugate gradient with (2018), Yang (2021a) and Yang, Wang, Zang, and Li (2018a), Yang,
variance reduction (CGVR). Also, they proved that CGVR has a linear Wang, Zhang, and Li (2018b). The Polyak step size, originally used in
convergence rate with the FR algorithm for strongly convex and smooth the subgradient algorithm, is also applied into stochastic optimization
cases.
algorithms. Moreover, Loizou, Vaswani, Laradji, and Lacoste-Julien
(2021) showed that the Polyak step size can improve the convergence
1.4. Our contributions
speed of stochastic optimization algorithms. More recently, Baydin,
In this paper, we develop a stable adaptive SCG algorithm via Cornish, Rubio, Schmidt, and Wood (2018) and Yang, Wang, Zhang,
using the SARAH gradient estimator and second-order information. and Li (2019) proposed using the Hypergradient Descent (HD) method
Specifically, the main contributions of this work are summarized as to compute the step size for stochastic optimization algorithms.
follows:
3. Preliminary
(i) Different from existing CG algorithms, spending a lot of time
in hunting the step size sequence by using a line search tech-
nique which may fail in stochastic optimization algorithms, the Throughout this paper, we define the Euclidean norm (also known
proposed algorithm uses a local quadratic model to estimate the as 𝓁2 -norm) by the symbol, ‖⋅‖, and define 𝓁1 -norm by the symbol, ‖⋅‖1 .
step size sequence, but does not require computing the curva- We define the gradient of the function, 𝐹 (𝑤), by ∇𝐹 (𝑤). In addition, we
ture information, which makes the proposed algorithm have low take R𝑑 to present a set of 𝑑-dimension vectors. We take the symbol,
computational cost as first-order algorithms. ⟨⋅, ⋅⟩, to define the inner product in the Euclidean space. We note E[𝑤]
(ii) Unlike the work in Castera, Bolte, Févotte, and Pauwels (2021) the expectation of a random variable, 𝑤.
that incorporated an adaptive step size into the SGD algorithm, To finish the theoretical analysis of the proposed algorithm, the
we study the performance of the SCG-type algorithm with an following assumptions are provided.
adaptive step size. Besides, compared to the work in Castera et al.
(2021), this paper uses a different iterative scheme to determine Assumption 1. The gradient of each individual function, 𝑓𝑖 (𝑤), in (1)
the step size sequence. is 𝐿-Lipschitz, i.e., for any 𝑤, 𝑣 ∈ R𝑑
(iii) We theoretically analyze the convergence properties of the pro-
posed algorithm in the strongly convex setting, which shows ‖∇𝑓𝑖 (𝑤) − ∇𝑓𝑖 (𝑣)‖ ≤ 𝐿‖𝑤 − 𝑣‖. (2)
that the proposed algorithm converges sublinearly within one
outer loop and converges linearly within multiple outer loops. Assumption 1 implies that the gradient of function, 𝐹 (𝑤), is 𝐿-
Moreover, we show that the complexity of the proposed algorithm Lipschitz as well. Moreover, from (2), one can obtain the following
matches state-of-the-art stochastic optimization algorithms. crucial inequalities:
(iv) As a by-product, we propose a practical variant of the pro- 𝐿
𝐹 (𝑤) ≤ 𝐹 (𝑣) + ⟨∇𝐹 (𝑣), 𝑤 − 𝑣⟩ + ‖𝑤 − 𝑣‖2 , (3)
posed algorithm, providing an automatic and adaptive selection 2
of the number of inner loop iterations, which further reduces the 1
⟨∇𝐹 (𝑤) − ∇𝐹 (𝑣), 𝑤 − 𝑣⟩ ≥ ‖∇𝐹 (𝑤) − ∇𝐹 (𝑣)‖2 . (4)
difficulty in selecting the crucial parameter. A large panel of nu- 𝐿
merical experiments on machine learning problems demonstrates
the efficiency of the proposed algorithms. Assumption 2. The loss function 𝐹 (𝑤) is continuously differentiable
and 𝜇-strongly convex, i.e., there exist a positive constant, 𝜇, such that
The rest of this paper is organized as follows. Section 2 presents for all, 𝑤, 𝑣 ∈ R𝑑 ,
several related works. Section 3 provides preliminary information com-
monly used in stochastic optimization. Section 4 presents our first ⟨∇𝐹 (𝑤) − ∇𝐹 (𝑣), 𝑤 − 𝑣⟩ ≥ 𝜇‖𝑤 − 𝑣‖2 . (5)
algorithm and provides its convergence analysis. Section 5 develops
a practical variant of the proposed algorithm. Section 6 contains nu- Actually, the strong convexity of 𝐹 (𝑤) indicates that
merical experiments for the proposed algorithms applied to the 𝓁2 - 2𝜇[𝐹 (𝑤) − 𝐹 (𝑤∗ )] ≤ ‖∇𝐹 (𝑤)‖2 , ∀𝑤 ∈ R𝑑 . (6)
regularized logistic regression problem. Section 7 provides a conclusion
of the paper. where we define 𝑤∗ = arg min𝑤 𝐹 (𝑤).

3
Z. Yang Expert Systems With Applications 206 (2022) 117719

Assumption 3. There exists a constant, 𝛽̂ < 1, such that in SARAH (as seen from (8)), we do not have E[𝑣𝑘 ] = ∇𝐹 (𝑤𝑘 ). Hence,
we can say that SARAH is a biased stochastic optimization algorithm.
‖𝑣𝑘+1 ‖2
𝛽𝑘+1 = ̂
≤ 𝛽, (7) In this work, we focus on stochastic optimization algorithms in
‖𝑣𝑘 ‖2
the mini-batch setting. Therefore, the updating scheme of the SARAH
where 𝑣𝑘 denotes the SARAH gradient estimator in the mini-batch gradient estimator, 𝑣𝑘 , was rewritten as
setting, i.e., 𝑣𝑘 = ∇𝐹𝑆 (𝑤𝑘 ) − ∇𝐹𝑆 (𝑤𝑘−1 ) + 𝑣𝑘−1 , where ∇𝐹𝑆 (𝑤𝑘 ) =
1 ∑ ∑ 𝑣𝑘 = ∇𝐹𝑆 (𝑤𝑘 ) − ∇𝐹𝑆 (𝑤)
̃ + 𝑣𝑘−1 , (11)
̃ = 1 𝑖∈𝑆 ∇𝑓𝑖 (𝑤),
𝑖∈𝑆 ∇𝑓𝑖 (𝑤𝑘 ), ∇𝐹𝑆 (𝑤) ̃ and 𝑆 ⊂ {1, … , 𝑛} with
𝑏 𝑏
𝑏 = |𝑆|. In addition, the sequence {𝑤𝑘 } denotes the solution sequence. 1 ∑ 1 ∑
where ∇𝐹𝑆 (𝑤𝑘 ) = 𝑏 𝑖∈𝑆 ∇𝑓𝑖 (𝑤𝑘 ), ∇𝐹𝑆 (𝑤)
̃ =
𝑏
̃ and 𝑆 ⊂
𝑖∈𝑆 ∇𝑓𝑖 (𝑤),
{1, … , 𝑛} with 𝑏 = |𝑆|. Notice that the mini-batch version of SARAH
To understand theoretical results well, the following definition is
can be found in Nguyen, Scheinberg, and Takáč (2021), Yang (2021b)
provided.
and Yang, Chen, and Wang (2021).
Definition 1. A stochastic gradient algorithm attains 𝜀-accuracy in
4.2. Second-order infinitesimal step size tuning
𝑘 iterations if E[‖∇𝐹 (𝑤𝑘 )‖2 ] ≤ 𝜀, where the expectation is over the
stochastically of the algorithm.
The idea of second-order step size tuning of SGD for non-convex
cases has been proposed by Castera et al. (2021). Let us assume that
4. Stochastic conjugate gradient with second-order step size tun- 𝐹 (𝑤) is a twice-differentiable function. Given an update direction 𝑝 ∈
ing and SARAH technique R𝑑 , a common strategy is to select 𝜂 ∈ R, minimizing 𝐹 (𝑤 + 𝜂𝑝). Let us
approximate 𝜂 ↦ 𝐹 (𝑤 + 𝜂𝑝) around zero with a Taylor expansion, i.e.,
On one hand, the proposed algorithm adopts the conjugate gradient
def 𝜂2 2
algorithm with the SARAH gradient estimator. On the other hand, 𝑞𝑝 (𝜂) = 𝐹 (𝑤) + 𝜂⟨∇𝐹 (𝑤), 𝑝⟩ + ⟨∇ 𝐹 (𝑤)𝑝, 𝑝⟩. (12)
2
instead of using the line search to obtain the step size, the proposed
algorithm uses the second-order step size tuning. Therefore, for clarity, If the second-order term ⟨∇2 𝐹 (𝑤)𝑝, 𝑝⟩ is positive, then 𝑞𝑝 would
we will briefly describe the SARAH technique in Section 4.1 and obtain a unique minimizer at the point
second-order step size tuning in Section 4.2, respectively. Our first SCG- ⟨∇𝐹 (𝑤), 𝑝⟩
𝜂 ∗= − . (13)
type algorithm is deployed in Section 4.3. In addition, for simplicity, ⟨∇2 𝐹 (𝑤)𝑝, 𝑝⟩
the proposed algorithm is referred to as CG-SARAH-SO.
Further, when setting the update direction 𝑝 = −∇𝐹 (𝑤), we obtain

4.1. The SARAH gradient estimator ‖∇𝐹 (𝑤)‖2


𝜂(𝑤) = . (14)
⟨∇2 𝐹 (𝑤)∇𝐹 (𝑤), ∇𝐹 (𝑤)⟩
SARAH was proposed by Nguyen et al. (2017) for dealing with Generally, we write the traditional gradient descent update scheme
machine learning problems. In particular, we present it in Algorithm 1. 𝑤𝑘+1 = 𝑤𝑘 −𝜂𝑘 ∇𝐹 (𝑤𝑘 ) with 𝜂𝑘 = 𝜂(𝑤𝑘 ) when 𝜂(𝑤𝑘 ) > 0. Further, we take
a step size 𝜂𝑘 such that, 𝜂𝑘 ≃ 𝜂(𝑤𝑘−1 ). Hence, let us suppose that, for
Algorithm 1 SARAH 𝑘 ≥ 1, 𝑤𝑘−1 and 𝜂𝑘−1 are known. Also, let us approximate the quantity:

Inut: the step size 𝜂 > 0, the initial point 𝑤


̃0 and the initial loop size ‖∇𝐹 (𝑤𝑘−1 )‖2
𝜂(𝑤𝑘−1 ) = , (15)
𝑚. ⟨∇2 𝐹 (𝑤 𝑘−1 )∇𝐹 (𝑤𝑘−1 ), ∇𝐹 (𝑤𝑘−1 )⟩
for 𝑠 = 1 to 𝑆̂ do taking only first-order information. When relying on the following two
𝑤0 = 𝑤 ̃𝑠−1 identities:
𝑣0 = ∇𝐹 (𝑤0 )
𝑤1 = 𝑤0 − 𝜂0 𝑣0 𝑠𝑘 = 𝑤𝑘 − 𝑤𝑘−1 = −𝜂𝑘−1 ∇𝐹 (𝑤𝑘−1 ), (16)
for 𝑘 = 1 to 𝑚 − 1 do 𝑦𝑘 = ∇𝐹 (𝑤𝑘 ) − ∇𝐹 (𝑤𝑘−1 ) ≃ −𝜂𝑘−1 𝐶𝐹 (𝑤𝑘−1 ), (17)
Randomly pick 𝑖 ∈ {1, … , 𝑛} and update weight,
def
where 𝐶𝐹 (𝑤) = ∇2 𝐹 (𝑤)∇𝐹 (𝑤) and (17) is obtained by using Taylor’s
formula, the above descriptions lead to the following step size:
𝑣𝑘 = ∇𝑓𝑖 (𝑤𝑘 ) − ∇𝑓𝑖 (𝑤𝑘−1 ) + 𝑣𝑘−1 (8)
‖𝑠𝑘 ‖2
𝑤𝑘+1 = 𝑤𝑘 − 𝜂𝑣𝑘 𝜂𝑘 = . (18)
⟨𝑠𝑘 , 𝑦𝑘 ⟩
end for
̃𝑠 = 𝑤𝑘 with 𝑘 is chosen uniformly at random from {0, 1, … , 𝑚}.
𝑤 The formula (18) can be viewed as the BB method. For clarity, we
end for provide a brief description about the BB method. To tackle Problem (1),
the iterative scheme of the BB method was formulated as:
SARAH can be viewed as a variant of SVRG. For clarity, we briefly 𝑤𝑘+1 = 𝑤𝑘 − 𝜂𝑘 ∇𝐹 (𝑤𝑘 ),
discuss SVRG. SVRG was originally proposed by Johnson and Zhang
where 𝜂𝑘 is obtained by minimizing the residual of the secant equation
(2013) for smooth and strongly convex functions, where the crucial step
underlying the quasi-Newton approach, i.e., ‖(1∕𝜂𝑘 )𝑠𝑘 − 𝑦𝑘 ‖2 , leading
was the following update scheme of stochastic gradient estimates ‖𝑠 ‖2
to the selection of the step size 𝜂 = ⟨𝑠 𝑘,𝑦 ⟩ . Obviously, such way
𝑘 𝑘
𝑣𝑘 = ∇𝑓𝑖 (𝑤𝑘 ) − ∇𝑓𝑖 (𝑤)
̃ + 𝑣0 , (9) of selecting the step size in the BB method is same as Eq. (18) and
they use different perspectives to determine the step size. In addition,
where 𝑣0 = ∇𝐹 (𝑤) ̃ and 𝑤 ̃ defines a snapshot point. Notice that
utilizing symmetry and minimizing the optimization problem ‖𝑠𝑘 −
the gradient, ∇𝐹 (𝑤),
̃ has been obtained in the past iteration. Then,
𝜂𝑘 𝑦𝑘 ‖2 , Barzilai and Borwein (1988) proposed another BB-like step size,
continued with the following stochastic update scheme
⟨𝑠𝑘 , 𝑦𝑘 ⟩
𝑤𝑘+1 = 𝑤𝑘 − 𝜂𝑣𝑘 . (10) 𝜂𝑘 = . (19)
‖𝑦𝑘 ‖2
Combining Eq. (9) and the fact that E[∇𝑓𝑖 (𝑤)]
̃ = ∇𝐹 (𝑤) ̃ and 𝑣0 = In practice, the formula for computing the step size in (18) is widely
̃ we obtain E[𝑣𝑘 ] = ∇𝐹 (𝑤𝑘 ) − ∇𝐹 (𝑤)
∇𝐹 (𝑤), ̃ + ∇𝐹 (𝑤)̃ = ∇𝐹 (𝑤𝑘 ), which used in deterministic or stochastic optimization algorithms since it is
demonstrates that 𝑣𝑘 is an unbiased estimate of the gradient. However, believed to achieve better performance than the formula for computing

4
Z. Yang Expert Systems With Applications 206 (2022) 117719

the step size in (19). Actually, when setting the update direction 𝑝 = Algorithm 2 CG-SARAH-SO
−(∇2 𝐹 (𝑤))−1 ∇2 𝐹 (𝑤)∇𝐹 (𝑤), we obtain the update scheme (19) from
Input: initial point 𝑤 ̃0 , epoch length 𝑚, mini-batch sizes 𝑏 and 𝑏𝐻 ,
the Eq. (15). In this work, we only consider the performance of the ′
initial step size 𝜂0 , constant 𝛾 > 0
proposed algorithms under the second BB-type step size (a.k.a. (19)),
Given 𝑧0 = ∇𝐹 (𝑤 ̃0 )
which is not commonly used in stochastic optimization algorithms.
for 𝑠 = 1, 2, . . . , do
Although we do not discuss the performance of the proposed algorithms
𝑤0 = 𝑤 ̃𝑠−1
with the formula (18), we can easily introduce the update scheme (18)
𝑣0 = ∇𝐹 (𝑤0 )
into the proposed algorithms.
𝑢0 = 𝑧𝑠−1
Similar to the work in Castera et al. (2021), we also consider using
𝑑0 = −𝑢0
the batch samples to approximate the curvature information. In detail,
for 𝑘 = 1 to 𝑚 do
the update information 𝑦𝑘 is replaced by ′
𝑤𝑘 = 𝑤𝑘−1 + 𝜂𝑘−1 𝑑𝑘−1
𝑦̂𝑘 = ∇𝐹𝑆𝐻 (𝑤𝑘 ) − ∇𝐹𝑆𝐻 (𝑤𝑘−1 ) ≃ −𝜂𝑘−1 𝐶𝐹𝑆 (𝑤𝑘−1 ), (20) Choose mini-batch 𝑆 ⊂ {1, … , 𝑛} of size 𝑏 uniformly random,
𝐻
compute
def
where 𝐶𝐹𝑆 (𝑤) = ∇2 𝐹𝑆𝐻 (𝑤)∇𝐹𝑆𝐻 (𝑤) and 𝑆𝐻 ⊂ {1, 2, … , 𝑛} with 𝑣𝑘 = ∇𝐹𝑆 (𝑤𝑘 ) − ∇𝐹𝑆 (𝑤𝑘−1 ) + 𝑣𝑘−1
𝐻 ∑ Compute 𝛽 by
𝑏𝐻 = |𝑆𝐻 |. Note that we denote ∇𝐹𝑆𝐻 (𝑤) = 𝑏1 𝑖∈𝑆𝐻 ∇𝑓𝑖 (𝑤),
1 ∑
𝐻
which means that ∇𝐹𝑆𝐻 (𝑤𝑘 ) = 𝑏 ∇𝑓 (𝑤 ) and ∇𝐹 ‖𝑣𝑘 ‖2
𝑖∈𝑆𝐻 𝑖 𝑘 𝑆𝐻 (𝑤𝑘−1 ) = 𝛽𝑘𝐹 𝑅 =
1 ∑
𝐻
‖𝑣𝑘−1 ‖2
𝑏𝐻 𝑖∈𝑆𝐻 ∇𝑓𝑖 (𝑤𝑘−1 ).
Hence, the update schemes, (18) and (19), are rewritten as 𝑑𝑘 = −𝑣𝑘 + 𝛽𝑘 𝑑𝑘−1
Choose mini-batch 𝑆𝐻 ⊂ {1, … , 𝑛} of size 𝑏𝐻 uniformly
1 ‖𝑠𝑘 ‖2
𝜂̂𝑘 = , (21) random, compute a step size
𝑏𝐻 ⟨𝑠𝑘 , 𝑦̂𝑘 ⟩
1 ⟨𝑠𝑘 , 𝑦̂𝑘 ⟩ ′ 𝛾 |⟨𝑤𝑘 − 𝑤𝑘−1 , ∇𝐹𝑆𝐻 (𝑤𝑘 ) − ∇𝐹𝑆𝐻 (𝑤𝑘−1 )⟩|
𝜂̃𝑘 = . (22) 𝜂𝑘 = ⋅
𝑏𝐻 ‖̂𝑦𝑘 ‖2 𝑏𝐻 ‖∇𝐹𝑆𝐻 (𝑤𝑘 ) − ∇𝐹𝑆𝐻 (𝑤𝑘−1 )‖2
In addition, in this paper, to control the convergence speed of the end for
proposed algorithm, we bring a parameter, 𝛾 > 0, into the second-order 𝑧𝑠 = 𝑣𝑚
step size tuning. Hence, we obtain ̃ 𝑠 = 𝑤𝑚
𝑤
𝛾 ‖𝑠𝑘 ‖2 end for
𝜂̂𝑘′ = , (23)
𝑏𝐻 ⟨𝑠𝑘 , 𝑦̂𝑘 ⟩
𝛾 ⟨𝑠𝑘 , 𝑦̂𝑘 ⟩
𝜂𝑘′ = . (24)
𝑏𝐻 ‖̂𝑦𝑘 ‖2 scheme (24), which is not often used in stochastic optimization
algorithms. Actually, most of existing studies only discuss the per-
Note that in Castera et al. (2021), the authors, using the second-
formance of stochastic optimization algorithms using the update
order step size tuning, sets 𝜂𝑘 = 𝜈 (𝜈 > 0) when the curvature term
scheme (23) to determine the step size sequence. It seems that the
⟨∇2 𝐹 (𝑤)𝑝, 𝑝⟩ does not keep positive. The authors also show that such
update scheme (23) easily achieved better performance than the
strategy makes the proposed algorithm use the curvature information
update scheme (24) in stochastic or deterministic settings.
more sufficient.
(ii) Unlike the work in Castera et al. (2021) that set 𝜂𝑘 = 𝜈 (𝜈 >
0) when the curvature term ⟨∇2 𝐹 (𝑤)𝑝, 𝑝⟩ does not keep posi-
4.3. CG-SARAH-SO
tive, we take the absolute for the update scheme (24). This is
another difference between the work in Castera et al. (2021)
Since the proposed algorithm also uses the CG-type iterative scheme,
and our work. Moreover, it seems that the parameter, 𝛾, in the
we briefly describe the CG algorithm for solving Problem (1).
update scheme (24) keeps the proposed algorithm work with
The update schemes of CG generally consist of
a better Hessian approximation than conventional BB-type or
𝑤𝑘 = 𝑤𝑘−1 + 𝜂𝑘−1 𝑝𝑘−1 , (25) quasi-Newton algorithms.
𝑝𝑘 = −∇𝐹 (𝑤𝑘 ) + 𝛽𝑘−1 𝑝𝑘−1 , (26) (iii) Although this work only considers and analyzes the performance
of the proposed algorithm using the FR algorithm, we can eas-
where we often set 𝑝0 = −∇𝐹 (𝑤0 ) in practice. ily introduce other CG algorithms, e.g., PR and HS, into our
Popular choices for 𝛽𝑘 are the FR, PR, HS formulas, and are provided CG-SARAH-SO algorithm.
by:
‖∇𝐹 (𝑤𝑘 )‖2 4.4. Convergence analysis for CG-SARAH-SO
𝛽𝑘𝐹 𝑅 = , (27)
‖∇𝐹 (𝑤𝑘−1 )‖2
Before providing our main theoretical results, we start from the
⟨∇𝐹 (𝑤𝑘 ), ∇𝐹 (𝑤𝑘 ) − ∇𝐹 (𝑤𝑘−1 )⟩
𝛽𝑘𝑃 𝑅 = , (28) following lemmas.
⟨∇𝐹 (𝑤𝑘−1 ), ∇𝐹 (𝑤𝑘−1 )⟩ The bound of the step size sequence is provided in Lemma 1,
⟨∇𝐹 (𝑤𝑘 ), ∇𝐹 (𝑤𝑘 ) − ∇𝐹 (𝑤𝑘−1 )⟩ computed by (23) and (24).
𝛽𝑘𝐻𝑆 = . (29)
⟨𝑝𝑘−1 , ∇𝐹 (𝑤𝑘 ) − ∇𝐹 (𝑤𝑘−1 )⟩
In this paper, our main focus is the FR formula (a.k.a. (27)). Lemma 1. Suppose that Assumptions 1 and 2 hold, then according to (23)
According to the above descriptions of CG, SARAH and second- and (24), we have
order step size tuning, we now present our first algorithm named as 𝛾 𝛾
≤ 𝜂𝑘′ ≤ 𝜂̂𝑘′ ≤ . (30)
CG-SARAH-SO in Algorithm 2. 𝑏𝐻 𝐿 𝑏𝐻 𝜇
A few remarks about CG-SARAH-SO are provided in the following:
Proof. See Appendix A.1. □
(i) As we pointed out that there are two different ways to determine
the step size sequence, i.e., (23) and (24), in this paper we only In addition, from Yang et al. (2021), we have the following bound
study the performance of the proposed algorithm with the update of E[‖∇𝐹 (𝑤𝑘 ) − 𝑣𝑘 ‖2 ].

5
Z. Yang Expert Systems With Applications 206 (2022) 117719

Lemma 2. Suppose that Assumption 1 holds, consider 𝑣𝑘 defined in According to Theorem 2, for obtaining E[‖∇𝐹 (𝑤 ̃𝑠 )‖2 ] ≤ (𝜎𝑚 )𝑠
Algorithm 2, then for any 𝑘 ≥ 1, we have ‖∇𝐹 (𝑤 2
̃0 )‖ ≤ 𝜀, it is sufficient to select 𝑠 = 𝑂(log(1∕𝜀)). Therefore, we
have the following result for the total complexity of CG-SARAH-SO.
𝐿2 𝛾 2 ( 𝑛 − 𝑏 ) ∑
𝑘
E[‖∇𝐹 (𝑤𝑘 ) − 𝑣𝑘 ‖2 ] ≤ E[‖𝑣𝑗−1 ‖2 ].
2
𝜇 𝑏𝑏 2 𝑛 − 1 𝑗=1
𝐻 Corollary 2. Suppose that Assumptions 1 2, 3 and Lemmas 1, 2, 3 hold,
the
((total complexity of CG-SARAH-SO to attain an 𝜀-accurate solution is
) ( ))
Lemma 3. Suppose that Assumption 1 holds, consider CG-SARAH-SO 𝑂 𝑛+
𝜇𝑏𝐻 (𝑏+𝑏𝐻 )
log 𝜀1
.
̂
within a single outer loop in Algorithm 2 and set 𝑤∗ = arg min𝑤 𝐹 (𝑤), then 𝛾(1−𝛽)𝜀

we obtain
For strongly convex cases, it has been pointed out that the complex-

𝑚
[ ] 2𝜇𝑏𝐻 1 ∑
𝑚
ity of some variants of modern SGD approaches, e.g., the stochastic av-
E ‖∇𝐹 (𝑤𝑘 )‖2 ≤ [𝐹 (𝑤0 ) − 𝐹 (𝑤∗ )] + ‖∇𝐹 (𝑤𝑘 ) − 𝑣𝑘 ‖2
𝑘=0 𝛾 − 𝛾 𝛽̂ 1 − 𝛽̂ 𝑘=0 erage gradient (SAG) approach (Roux, Schmidt, & Bach, 2012), SAGA,
( ) SVRG, the stochastic dual coordinate ascent (SDCA) approach (Shalev-
1 ∑
𝑚
2𝐿𝛾 ( )
− 1− ‖𝑣 ‖2 Shwartz & Zhang, 2013) and SARAH, is 𝑂 (𝑛 + 𝜅) log(1∕𝜀) , where 𝜅
𝜇𝑏𝐻 1 − 𝛽̂ 𝑘=0 𝑘
defines the condition number and is set to 𝜅 = 𝐿∕𝜇. It can be seen
from Corollary 2 that, the complexity of CG-SARAH-SO is comparable
Proof. See Appendix A.2. □
to that of modern stochastic optimization algorithms when choosing
Here, we summarize the first main convergence result of CG- the appropriate parameters, 𝑏, 𝑏𝐻 and 𝛾. More specifically, when setting
SARAH-SO in Theorem 1. ̂ − 𝜇𝑏2 )∕(𝜇𝑏𝐻 ), CG-SARAH-SO attains a better complexity
𝑏 ≤ (𝜅𝛾(1 − 𝛽)𝜀 𝐻
than modern stochastic optimization algorithms.
Theorem 1. Suppose that Assumptions 1, 2, 3 and Lemmas 1, 2, 3 hold.
Let 𝑤∗ = arg min𝑤 𝐹 (𝑤) and choose 𝑆, 𝑆𝐻 ⊂ {1, … , 𝑛} with 𝑏 = |𝑆| 5. A practical variant
and 𝑏𝐻 = |𝑆𝐻 |. Consider CG-SARAH-SO (within a single outer loop in
Algorithm 2) with
( ) As a practical variant of SARAH, Nguyen et al. (2017) proposed
𝐿2 𝛾 2 ( 𝑛 − 𝑏 ) 𝐿𝛾 a SARAH+ algorithm, which worked with an automatic and adaptive
𝑚− 1− ≤ 0, (31)
𝜇 2 𝑏𝑏2𝐻 𝑛 − 1 𝜇𝑏𝐻 selection of the inner loop size, 𝑚. Actually, the SARAH+ algorithm set
then we have a stopping criterion according to the values of the stochastic estimator,
2𝜇𝑏𝐻 ‖𝑣𝑘 ‖2 , while providing the number of iterative steps by a large enough,
E[‖∇𝐹 (𝑤𝑚 )‖2 ] ≤ [𝐹 (𝑤0 ) − 𝐹 (𝑤∗ )]. 𝑚, for robustness. Yang et al. (2021) also proposed a new variant of the
̂ + 1)
𝛾(1 − 𝛽)(𝑚
SARAH+-type algorithm by introducing the random Barzilai–Borwein
(RBB) step size into SARAH+. Similarly, here we propose a practical
Proof. See Appendix A.3. □
variant of CG-SARAH-SO, named as CG-SARAH+-SO and shown in
To hold E[‖∇𝐹 (𝑤𝑚 )‖2 ] ≤ 𝜀, it is sufficient to set Algorithm 3.
2𝜇𝑏𝐻
E[𝐹 (𝑤0 ) − 𝐹 (𝑤∗ )] ≤ 𝜀. Algorithm 3 CG-SARAH+-SO
̂ + 1)
𝛾(1 − 𝛽)(𝑚
Input: initial point 𝑤 ̃0 , epoch length 𝑚, mini-batch sizes 𝑏 and 𝑏𝐻 ,
(Actually,) to make the above inequality hold, we can choose 𝑚 = ′
𝜇𝑏𝐻 initial step size 𝜂0 , constant 0 < 𝜉 ≤ 1, 𝛾 > 0.
𝑂 ̂
. Hence, the complexity of CG-SARAH-SO (Algorithm 2) at- Given 𝑧0 = ∇𝐹 (𝑤 ̃0 )
𝛾(1−𝛽)𝜀
( )
𝜇𝑏 (𝑏+𝑏 ) for 𝑠 = 1, 2, . . . , do
taining 𝜀-accurate solution is 𝑛+2𝑚(𝑏+𝑏𝐻 ) = 𝑂 𝑛+ 𝐻 ̂ 𝐻 . For clar-
𝛾(1−𝛽)𝜀 𝑤0 = 𝑤 ̃𝑠−1
ity, we provide a conclusion of the complexity result of CG-SARAH-SO 𝑣0 = ∇𝐹 (𝑤0 )
in Algorithm 2 within a single outer loop. 𝑢0 = 𝑧𝑠−1
𝑑0 = −𝑢0
Corollary 1. Suppose that Assumptions 1 2, 3 and Lemmas 1, 2, 3 𝑘=1
hold, CG-SARAH-SO within a single outer loop,(then ‖∇𝐹 ̃𝑠 )‖2 converges
)(𝑤 while ‖𝑑𝑘−1 ‖2 > 𝜉‖𝑑0 ‖2 and 𝑘 < 𝑚 do
𝜇𝑏 ′
sub-linearly in expectation with a speed of 𝑂 𝐻
. Also, the total 𝑤𝑘 = 𝑤𝑘−1 + 𝜂𝑘−1 𝑑𝑘−1
̂
(𝛾(1−𝛽)𝑚 ) Choose mini-batch 𝑆 ⊂ {1, … , 𝑛} of size 𝑏 uniformly random,
𝜇𝑏𝐻 (𝑏+𝑏𝐻 )
complexity to attain an 𝜀-accurate solution is 𝑂 𝑛 + ̂
. compute
𝛾(1−𝛽)𝜀
𝑣𝑘 = ∇𝐹𝑆 (𝑤𝑘 ) − ∇𝐹𝑆 (𝑤𝑘−1 ) + 𝑣𝑘−1
In the following theorem, we have the theoretical result of CG- Compute 𝛽 by
SARAH-SO with multiple outer iterations.
‖𝑣𝑘 ‖2
𝛽𝑘𝐹 𝑅 =
Theorem 2. Suppose that Assumptions 1, 2, 3 and Lemmas 1, 2, 3 hold. ‖𝑣𝑘−1 ‖2
Let 𝑤∗ = arg min𝑤 𝐹 (𝑤) and set 𝑆, 𝑆𝐻 ⊂ {1, … , 𝑛} with size 𝑏 and 𝑏𝐻 at 𝑑𝑘 = −𝑣𝑘 + 𝛽𝑘 𝑑𝑘−1
random, respectively. Consider CG-SARAH-SO with Choose mini-batch 𝑆𝐻 ⊂ {1, … , 𝑛} of size 𝑏𝐻 uniformly
( )
𝐿2 𝛾 2 ( 𝑛 − 𝑏 ) 𝐿𝛾 random, compute a step size
𝑚− 1− ≤ 0,
𝜇 2 𝑏𝑏2 𝑛 − 1
𝐻
𝜇𝑏𝐻 ′ 𝛾 ⟨𝑤𝑘 − 𝑤𝑘−1 , ∇𝐹𝑆𝐻 (𝑤𝑘 ) − ∇𝐹𝑆𝐻 (𝑤𝑘−1 )⟩
𝜂𝑘 = ⋅
then we have 𝑏𝐻 ‖∇𝐹𝑆𝐻 (𝑤𝑘 ) − ∇𝐹𝑆𝐻 (𝑤𝑘−1 )‖2

̃𝑠 )‖2 ] ≤ (𝜎𝑚 )𝑠 ‖∇𝐹 (𝑤


E[‖∇𝐹 (𝑤 ̃0 )‖2 , 𝑘=𝑘+1
end while
𝑏𝐻
where 𝜎𝑚 = ̂
. 𝑧 𝑠 = 𝑣𝑚
𝛾(1+𝛽)(𝑚+1)
̃ 𝑠 = 𝑤𝑚
𝑤
Proof. See Appendix A.4. □ end for

6
Z. Yang Expert Systems With Applications 206 (2022) 117719

Table 1 than MB-SARAH, working with the best-tuned step size. Also, the top of
Datasets information used in the experiments.
Fig. 1 shows that the small mini-batch sample, 𝑏, makes our CG-SARAH-
Datasets Training size Testing size Feature 𝜆 SO algorithm perform well. However, the smaller mini-batch size will
rcv1 20,242 677,399 47,236 10−1 make the proposed algorithm diverge. From the bottom of Fig. 1, we
a8a 22,696 9,865 123 10−2 have that our CG-SARAH-SO algorithm has the same test error rate as
w8a 49,749 14,951 300 10−2 MB-SARAH with the best choice of the step size. The bottom of Fig. 1
ijcnn1 49,990 91,701 22 10−2 indicates that our CG-SARAH-SO algorithm obtain the best test error
rate faster than MB-SARAH with the best choice of the step size.

6.1.2. Effect of the mini-batch size, 𝑏𝐻 , in CG-SARAH-SO


Remarks. Unlike the work in Nguyen et al. (2017) and Yang et al.
We present the effect of the mini-batch size, 𝑏𝐻 , in CG-SARAH-SO
(2021), using the stochastic estimator, ‖𝑣𝑘 ‖2 , to set a stopping criterion,
on 𝑤8𝑎 and 𝑟𝑐𝑣1. When running CG-SARAH-SO, we set the mini-batch
we use the update direction, 𝑑𝑘 , to set the stopping criterion. In addi-
size, 𝑏 = 200, and the parameter, 𝛾 = 1, on 𝑤8𝑎 and 𝑟𝑐𝑣1. Moreover,
tion, for the parameter, 𝜉, we can set it as suggested in Nguyen et al.
similar to the above subsection, when running MB-SARAH, we take the
(2017) and Yang et al. (2021). More specifically, to better understand
same mini-batch size, 𝑏, as CG-SARAH-SO.
the properties of CG-SARAH+-SO, we will show that the performance
We plot the comparison results between CG-SARAH-SO with dif-
of CG-SARAH+-SO is not sensitive in picking the hyperparameter 𝜉.
ferent batch samples, 𝑏𝐻 , and MB-SARAH, working with best-tuned
step sizes in Fig. 2. As seen from Fig. 2, our CG-SARAH-SO algorithm
6. Numerical results
achieves better performance than MB-SARAH, working with best-tuned
step sizes. Also, the small mini-batch size, 𝑏𝐻 , makes our CG-SARAH-SO
In this section, we tested CG-SARAH-SO (Algorithm 2) and CG-
algorithm achieve a better performance. However, a smaller mini-batch
SARAH+-SO (Algorithm 3) on the 𝓁2 -regularized logistic regression
size, 𝑏𝐻 , leads to a divergence for the proposed algorithm.
problem,

1∑
𝑛
𝜆 6.1.3. Effect of the parameter, 𝛾, in CG-SARAH-SO
min 𝐹 (𝑤) ∶= log(1 + exp(−𝑦𝑖 𝑥𝑇𝑖 𝑤)) + ‖𝑤‖2 . (32)
𝑤∈R𝑑 𝑛 𝑖=1 2 We present the effect of the parameter, 𝛾, in CG-SARAH-SO on 𝑤8𝑎
and 𝑖𝑗𝑐𝑛𝑛1. When running CG-SARAH-SO, we set 𝑏 = 200 and 𝑏𝐻 = 400
Several public datasets, including 𝑎8𝑎, 𝑤8𝑎, 𝑖𝑗𝑐𝑛𝑛1 and 𝑟𝑐𝑣1, were
on different datasets. The choice of the parameter, 𝛾, is shown in the
used in our experiments, where they are downloaded from the LIBSVM
legend of Fig. 3 for CG-SARAH-SO. Also, the best choice of the step size
Data website.1 For clarity, we depict the information of these datasets
is depicted in the legend of Fig. 3 for MB-SARAH.
in Table 1 in detail.
The results of CG-SARAH-SO with different parameters, 𝛾, are plot-
ted in Fig. 3. Fig. 3 demonstrates that our CG-SARAH-SO algorithm
6.1. Properties of CG-SARAH-SO
achieves better performance than the original MB-SARAH algorithm,
working with best-tuned step sizes. In addition, Fig. 3 indicates that
As seen from Algorithm 2, the performance of CG-SARAH-SO is
a small parameter, 𝛾, makes our CG-SARAH-SO algorithm converge
mainly dominated by the initial step size, 𝜂0 , mini-batch sizes, 𝑏 and
slowly. Notice that a bigger parameter, 𝛾, will make our CG-SARAH-SO
𝑏𝐻 , the epoch length, 𝑚, and a parameter, 𝛾. Therefore, we show the
algorithm diverge.
numerical results of CG-SARAH-SO with these different parameters,
respectively. Actually, it is easy proof that our CG-SARAH-SO algorithm
is robust to the selection of the initial step size, 𝜂0 . In addition, in this 6.1.4. Effect of the initial step size, 𝜂0′ , in CG-SARAH-SO
section, we set 𝑚 = 𝑛∕(𝑏 + 𝑏𝐻 ) for CG-SARAH-SO. For simplicity, when In this subsection, we show that our CG-SARAH-SO algorithm is
discussing the effect of one parameter in CG-SARAH-SO, we fix other robust to the initial step size on 𝑤8𝑎 and 𝑖𝑗𝑐𝑛𝑛1. When running CG-
parameters. Moreover, we present the comparison results between CG- SARAH-SO, we set 𝑏 = 200, 𝑏𝐻 = 400 and 𝛾 = 1 for different datasets.
SARAH-SO and SARAH utilizing best-tuned step sizes in the mini-batch The initial step size for CG-SARAH-SO is chosen from {0.01, 0.1, 1}.
setting. For simplicity, the latter is named as MB-SARAH. When running We plot the results of CG-SARAH-SO with different initial step sizes
MB-SARAH, we set and 𝑚 = 𝑛∕𝑏. The 𝑥-axis stands for the number of in Fig. 4. It is observed from Fig. 4 that our CG-SARAH-SO algorithm
epochs 𝑘 and the 𝑦-axis stands for the loss residual: 𝐹 (𝑤̃ 𝑠 ) − 𝐹 (𝑤∗ ) un- is robust to the selection of the initial step size.
less otherwise specified. For instance, when showing the classification
accuracy of different approaches, the vertical axis stands for the test 6.2. Performance of CG-SARAH+-SO
error rate.
Here, we show the performance of CG-SARAH+-SO (Algorithm 3)
6.1.1. Effect of the mini-batch size, 𝑏, in CG-SARAH-SO with different parameters. Similar to show the performance of CG-
We present the effect of the mini-batch size, 𝑏, in CG-SARAH-SO on SARAH-SO (Algorithm 2) with different parameters, we also present the
𝑎8𝑎 and 𝑖𝑗𝑐𝑛𝑛1. When running CG-SARAH-SO, we set 𝑏𝐻 = 600 and comparison results between CG-SARAH+-SO and MB-SARAH+, utiliz-
𝛾 = 20 on 𝑎8𝑎 and set 𝑏𝐻 = 400 and 𝛾 = 0.4 on 𝑖𝑗𝑐𝑛𝑛1. When executing ing the best-tuned step size. First, we show the effect of the mini-batch
MB-SARAH, we choose the same batch samples, 𝑏, as CG-SARAH-SO. size, 𝑏, in CG-SARAH+-SO on 𝑎8𝑎 and 𝑤8𝑎, plotted in Fig. 5. When
The best choice of the step size in MB-SARAH is provided in the legend discussing the properties of CG-SARAH+-SO with the mini-batch size,
of figure for different batch samples. 𝑏, we provide the results of sub-optimality and classification accuracy
We plot the comparison results between CG-SARAH-SO with differ- in Fig. 5. Then, we show the effect of the mini-batch size, 𝑏𝐻 , in CG-
ent mini-batch sizes, 𝑏, and MB-SARAH, working with best-tuned step SARAH+-SO on 𝑤8𝑎 and 𝑖𝑗𝑐𝑛𝑛1, plotted in Fig. 6. Finally, we show
sizes in Fig. 1. In the top row of Fig. 1, we provide the results of sub- the effect of the parameter, 𝛾, in CG-SARAH+-SO on 𝑟𝑐𝑣1 and 𝑖𝑗𝑐𝑛𝑛1,
optimality of the algorithms, while in the bottom row of Fig. 1, we show plotted in Fig. 7.
the results of the classification accuracy of the algorithms. From the top Figs. 5–7 show that the CG-SARAH+-SO algorithm, working with
row of Fig. 1, we see that our CG-SARAH-SO algorithm performs well different parameters, achieves a better performance than the MB-
SARAH+ algorithm, working with best-tuned step sizes. Addition-
ally, the results of the test error rate in Fig. 5 demonstrate that
1
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. CG-SARAH+-SO obtains the same test error rate as MB-SARAH+.

7
Z. Yang Expert Systems With Applications 206 (2022) 117719

Fig. 1. Performance of CG-SARAH-SO with different mini-batch size, 𝑏, on 𝑎8𝑎 (left) and 𝑖𝑗𝑐𝑛𝑛1 (right).

Fig. 2. Performance of CG-SARAH-SO with different mini-batch size, 𝑏𝐻 , on 𝑤8𝑎 (left) and 𝑟𝑐𝑣1 (right).

At the end of this subsection, we will present that the performance and 𝑖𝑗𝑐𝑛𝑛1, which, in practice, greatly reduce the difficulty of selecting
of CG-SARAH+-SO is insensitive to pick the hyperparameter, 𝜉, on 𝑎8𝑎 the hyperparameter of the original CG-SARAH-SO algorithm.

8
Z. Yang Expert Systems With Applications 206 (2022) 117719

Fig. 3. Performance of CG-SARAH-SO with different parameters, 𝛾, on 𝑤8𝑎 (left) and 𝑖𝑗𝑐𝑛𝑛1 (right).

Fig. 4. Performance of CG-SARAH-SO with different initial step sizes, 𝜂0 , on 𝑤8𝑎 (left) and 𝑖𝑗𝑐𝑛𝑛1 (right).

When running CG-SARAH+-SO, we set 𝛾 = 18 and 𝛾 = 1 on 𝑎8𝑎 As observed from Fig. 9, CG-SARAH-SO (Algorithm 2) and CG-
and 𝑖𝑗𝑐𝑛𝑛1, respectively. Additionally, the details of other parameters SARAH+-SO (Algorithm 3) work better than or are comparable to
for CG-SARAH+-SO are displayed in the legend of Fig. 8. state-of-the-art algorithms.
Fig. 8 demonstrates that the performance of CG-SARAH+-SO does
not sensitive to the choice of the hyperparameter, 𝜉. 7. Conclusion

In this paper, we developed a stable adaptive SCG algorithm, CG-


6.3. Comparison with other related algorithms
SARAH-SO, based on the SARAH gradient estimator and second-order
step size tuning. Instead of using line search in conventional CG al-
To end this section, we perform the numerical experiment compar- gorithms, which, in practice, is time consuming and may fail in the
ing our CG-SARAH-SO (Algorithm 2) and CG-SARAH+-SO (Algorithm stochastic setting, the proposed algorithms used a local quadratic model
3) algorithms to some state-of-the-art algorithms for solving Problem to estimate the step size sequence and had a lower computational cost
(32), stochastic average gradient which include Step-Tuned SGD (Cast- than the Newton and quasi-Newton algorithms. We rigorously proved
era et al., 2021), the stochastic average gradient algorithm with line that CG-SARAH-SO converges sublinearly within a single outer iteration
search (SAG-LS) (Schmidt et al., 2015), SVRG (Johnson & Zhang, and attains a linear convergence speed within multiple outer itera-
2013), CGVR (Jin et al., 2019), SDCA (Shalev-Shwartz & Zhang, 2013), tions. Also, we analyzed the complexity of CG-SARAH-SO and showed
AMSGRAD (a variant of ADAM) (Reddi et al., 2018) and the accelerated that it is comparable to modern stochastic optimization algorithms. In
mini-batch Prox-SVRG (Acc-Prox-SVRG) algorithm (incorporating NAG addition, to further reduce the difficulty of selecting the parameter
and SVRG in the mini-batch setting) (Nitanda, 2014). To enable a fair when running CG-SARAH-SO, we proposed CG-SARAH+-SO worked
comparison, for all these algorithms we choose the optimal parameters with an automatic and adaptive selection of the inner loop size, 𝑚.
as advised in these papers. Various numerical experiments, executed on the 𝓁2 -regularized logistic

9
Z. Yang Expert Systems With Applications 206 (2022) 117719

Fig. 5. Performance of CG-SARAH+-SO with different mini-batch sizes, 𝑏, on 𝑎8𝑎 (left) and 𝑤8𝑎 (right).

Fig. 6. Performance of CG-SARAH+-SO with different mini-batch sizes, 𝑏𝐻 , on 𝑤8𝑎 (left) and 𝑖𝑗𝑐𝑛𝑛1 (right).

10
Z. Yang Expert Systems With Applications 206 (2022) 117719

Fig. 7. Performance of CG-SARAH+-SO with different parameters, 𝛾, on 𝑖𝑗𝑐𝑛𝑛1 (left) and 𝑟𝑐𝑣1 (right).

Fig. 8. Performance of CG-SARAH+-SO with different parameters, 𝜉, on 𝑎8𝑎 (left) and 𝑖𝑗𝑐𝑛𝑛1 (right).

regression problem, validated the effectiveness of the two proposed Appendix. Proofs for CG-SARAH-SO
algorithms.

CRediT authorship contribution statement A.1. Proof of Lemma 1

Zhuang Yang: Data curation, Formal analysis, Investigation, Proof. From the definition of (23) and (24), we have
Methodology, Validation, Visualization, Writing – original draft, 𝜂̂𝑘′ ‖𝑠𝑘 ‖2 ‖̂
𝑦𝑘 ‖2
Writing – review & editing, Funding acquisition. = ⋅ ≥ 1,
𝜂𝑘′ ⟨𝑠𝑘 , 𝑦̂𝑘 ⟩ ⟨𝑠𝑘 , 𝑦̂𝑘 ⟩
where the last inequality holds due to Cauchy–Schwarz inequality,
Declaration of competing interest
i.e., ⟨𝑠𝑇𝑘 𝑠𝑘 , 𝑦̂′𝑘 𝑦̂𝑘 ⟩ ≤ ‖𝑠𝑘 ‖2 ‖̂
𝑦𝑘 ‖2 .
Hence, we have 𝜂𝑘′ ≤ 𝜂̂𝑘′ .
The authors declare that they have no known competing financial In addition, from (23), we have
interests or personal relationships that could have appeared to
𝛾
influence the work reported in this paper. 𝜂̂𝑘′ ≤ ,
𝑏𝐻 𝜇
where the inequality holds due to Assumption 2.
Acknowledgments
Also, from (24), we have
𝛾
This work was supported by the China Postdoctoral Science Founda- 𝜂𝑘′ ≥ , (33)
tion under Grant 2019M663238. We are very grateful for the valuable 𝑏𝐻 𝐿
comments, helpful advice and the patience of anonymous reviewers. where the inequality holds due to Assumption 1. □

11
Z. Yang Expert Systems With Applications 206 (2022) 117719

Fig. 9. Comparisons of loss residuals from different state-of-the-art algorithms on 𝑎8𝑎 (top left), 𝑤8𝑎 (top right), 𝑖𝑗𝑐𝑛𝑛1 (bottom left) and 𝑟𝑐𝑣1 (bottom right).

( )
𝛾 𝛾𝛽𝑘
A.2. Proof of Lemma 3 = E[𝐹 (𝑤𝑘 )] − − ‖∇𝐹 (𝑤𝑘 )‖2
2𝜇𝑏𝐻 2𝜇𝑏𝐻
( ) (
𝛾 𝛾 𝐿𝛾 2 𝛾𝛽𝑘
Proof. From Assumption 1, 𝑤𝑘+1 = 𝑤𝑘 + 𝜂𝑘′ 𝑑𝑘 and 𝑑𝑘 = −𝑣𝑘 + 𝛽𝑘 𝑑𝑘−1 in + ‖∇𝐹 (𝑤𝑘 ) − 𝑣𝑘 ‖2 − − ‖𝑣𝑘 ‖2 +
2𝜇𝑏𝐻 2𝜇𝑏𝐻 2
𝜇 𝑏𝐻2 2𝜇𝑏 𝐻
Algorithm 2, we obtain )
𝐿𝛾 2 𝛽𝑘2 𝛾𝛽
[ ] + ‖𝑑𝑘−1 ‖2 − 𝑘
‖∇𝐹 (𝑤𝑘 ) − 𝑑𝑘−1 ‖2
𝐿 9𝜇 2 𝑏2𝐻 2𝜇𝑏𝐻
E[𝐹 (𝑤𝑘+1 )] ≤ E 𝐹 (𝑤𝑘 ) + ⟨∇𝐹 (𝑤𝑘 ), 𝑤𝑘+1 − 𝑤𝑘 ⟩ + ‖𝑤𝑘+1 − 𝑤𝑘 ‖2
2 ( )
[ ] 𝛾 𝛾 𝛽̂ 𝛾
𝐿(𝜂 ′ )2 ≤ E[𝐹 (𝑤𝑘 )] − − ‖∇𝐹 (𝑤𝑘 )‖2 +
= E 𝐹 (𝑤𝑘 ) + 𝜂𝑘′ ⟨∇𝐹 (𝑤𝑘 ), 𝑑𝑘 ⟩ + 𝑘
‖𝑑𝑘 ‖2 2𝜇𝑏𝐻 2𝜇𝑏𝐻 2𝜇𝑏𝐻
2 ( )
[ ] 𝛾 𝐿𝛾 2
𝛾 ⋅ ‖∇𝐹 (𝑤𝑘 ) − 𝑣𝑘 ‖2 − − ‖𝑣𝑘 ‖2
≤ E[𝐹 (𝑤𝑘 )] + E ⟨∇𝐹 (𝑤𝑘 ), −𝑣𝑘 + 𝛽𝑘 𝑑𝑘−1 ⟩ 2𝜇𝑏𝐻 𝜇 2 𝑏2𝐻
𝜇𝑏𝐻
( )
𝐿𝛾 2 𝛾 𝛽̂ 𝐿𝛾 2 𝛽̂2
+ E[‖ − 𝑣𝑘 + 𝛽𝑘 𝑑𝑘−1 ‖2 ] + + ‖𝑑𝑘−1 ‖2 , (34)
2𝜇 2 𝑏2𝐻 2𝜇𝑏𝐻 𝜇 2 𝑏2𝐻
𝛾 [ ] 𝛾𝛽𝑘 [ ] where the first inequality holds due to Assumption 1, the second
= E[𝐹 (𝑤𝑘 )] − E ∇𝐹 (𝑤𝑘 ), 𝑣𝑘 + E ⟨∇𝐹 (𝑤𝑘 ), 𝛽𝑘 𝑑𝑘−1 ⟩
𝜇𝑏𝐻 𝜇𝑏𝐻 inequality holds due to the equality ⟨𝑎, 𝑏⟩ = 12 [‖𝑎‖2 + ‖𝑏‖2 ] − ‖𝑎 − 𝑏‖2
𝐿𝛾 2 and the inequality (𝑎 − 𝑏)2 ≤ 2𝑎2 + 2𝑏2 , and the last inequality holds due
+ E[‖𝑑𝑘 ‖2 ]
2𝜇 2 𝑏2𝐻 to Assumption 3.
𝛾 To hold (34), it is enough to set
≤ 𝐹 (𝑤𝑘 ) − E[‖∇𝐹 (𝑤𝑘 )‖2 + ‖𝑣𝑘 ‖2 − ‖∇𝐹 (𝑤𝑘 ) − 𝑣𝑘 ‖2 ]
2𝜇𝑏𝐻 ( )
𝛾 𝛾 𝛽̂ 𝛾
𝛾𝛽𝑘 E[𝐹 (𝑤𝑘+1 )] ≤ E[𝐹 (𝑤𝑘 )] − − ‖∇𝐹 (𝑤𝑘 )‖2 +
+ [‖∇𝐹 (𝑤𝑘 )‖2 + ‖𝑑𝑘−1 ‖2 − ‖∇𝐹 (𝑤𝑘 ) − 𝑑𝑘−1 ‖2 ] 2𝜇𝑏𝐻 2𝜇𝑏𝐻 2𝜇𝑏𝐻
2𝜇𝑏𝐻 ( )
𝛾 𝐿𝛾 2
𝐿𝛾 2 ⋅ ‖∇𝐹 (𝑤𝑘 ) − 𝑣𝑘 ‖2 − − ‖𝑣𝑘 ‖2 .
+ E[2‖𝑣𝑘 ‖2 + 2𝛽𝑘2 ‖𝑑𝑘−1 ‖2 ] 2𝜇𝑏𝐻 𝜇 2 𝑏2𝐻
2𝜇 2 𝑏2𝐻

12
Z. Yang Expert Systems With Applications 206 (2022) 117719

By adding over 𝑘 = 0, 1, … , 𝑚, we have A.4. Proof of Theorem 2


( ) 𝑚
𝛾 𝛾 𝛽̂ ∑ [ ]
E[𝐹 (𝑤𝑚+1 )] ≤ E[𝐹 (𝑤0 )] − − E ‖∇𝐹 (𝑤𝑘 )‖2 Proof. Note that 𝑤0 = 𝑤 ̃𝑠 = 𝑤𝑚 , 𝑠 ≥ 1. According to
̃𝑠−1 and 𝑤
2𝜇𝑏𝐻 2𝜇𝑏𝐻 𝑘=0 Theorem 1, we ascertain
( ) 𝑚
𝛾 ∑ ∑ [
𝑚
𝛾 𝐿𝛾 2 ]
+ E[‖∇𝐹 (𝑤𝑘 ) − 𝑣𝑘 ‖2 ] − − E ‖𝑣𝑘 ‖2 . E[‖∇𝐹 (𝑤 ̃𝑠−1 ‖2 ] = E[‖∇𝐹 (𝑤
̃𝑠 )|𝑤 ̃𝑠 )|𝑤0 ‖2 ]
2𝜇𝑏𝐻 𝑘=0 2𝜇𝑏𝐻 𝜇 2 𝑏2 𝑘=0
𝐻 2𝜇𝑏𝐻
≤ E[𝐹 (𝑤0 ) − 𝐹 (𝑤∗ )]
Further, we have ̂ + 1)
𝛾(1 + 𝛽)(𝑚
(6) 𝑏𝐻

𝑚
[ ] 2𝜇𝑏𝐻 ≤ ‖∇𝐹 (𝑤0 )‖2
E ‖∇𝐹 (𝑤𝑘 )‖2 ≤ [𝐹 (𝑤0 ) − 𝐹 (𝑤𝑚+1 )] ̂ + 1)
𝛾(1 + 𝛽)(𝑚
𝑘=0 𝛾 − 𝛾 𝛽̂
𝑏𝐻
̃𝑠−1 )‖2
‖∇𝐹 (𝑤
1 ∑
𝑚 =
+ ‖∇𝐹 (𝑤𝑘 ) − 𝑣𝑘 ‖2 ̂ + 1)
𝛾(1 + 𝛽)(𝑚
1 − 𝛽̂ 𝑘=0
( ) Hence, taking expectation on both sides, we obtain
1 ∑
𝑚
2𝐿𝛾
− 1− ‖𝑣 ‖2 𝑏𝐻
𝜇𝑏𝐻 1 − 𝛽̂ 𝑘=0 𝑘 ̃𝑠 )‖2 ] ≤
E[‖∇𝐹 (𝑤 E[‖∇𝐹 (𝑤̃𝑠−1 )‖2 ]
̂ + 1)
𝛾(1 + 𝛽)(𝑚
[ ]𝑠
1 ∑
𝑚
2𝜇𝑏𝐻 𝑏𝐻
≤ [𝐹 (𝑤0 ) − 𝐹 (𝑤∗ )] + ‖∇𝐹 (𝑤𝑘 ) − 𝑣𝑘 ‖2 ≤ ̃0 )‖2
‖∇𝐹 (𝑤
𝛾 − 𝛾 𝛽̂ 1 − 𝛽̂ 𝑘=0 ̂ + 1)
𝛾(1 + 𝛽)(𝑚
( )
1 ∑
𝑚
2𝐿𝛾 𝑏𝐻
− 1− ‖𝑣 ‖2 , When setting 𝜎𝑚 = , we have the desired results. □
𝜇𝑏𝐻 1 − 𝛽̂ 𝑘=0 𝑘 ̂
𝛾(1+𝛽)(𝑚+1)

where the last inequality holds due to 𝑤∗ = arg min𝑤 𝐹 (𝑤). References
Here, we have finished the proof of Lemma 3. □
Agarwal, N., Bullins, B., & Hazan, E. (2017). Second-order stochastic optimization
for machine learning in linear time. Journal of Machine Learning Research, 18(1),
4148–4187.
A.3. Proof of Theorem 1
Baker, J., Fearnhead, P., Fox, E. B., & Nemeth, C. (2019). Control variates for stochastic
gradient MCMC. Statistics and Computing, 29(3), 599–615.
Barzilai, J., & Borwein, J. M. (1988). Two-point step size gradient methods. IMA Journal
Proof. Combining Lemma 2 and ‖∇𝐹 (𝑤0 ) − 𝑣0 ‖2 = 0, when adding
of Numerical Analysis, 8(1), 141–148.
over 𝑘 = 0, … , 𝑚, we have Baydin, A. G., Cornish, R., Rubio, D. M., Schmidt, M. W., & Wood, F. D. (2018). Online

𝐿2 𝛾 2 ( 𝑛 − 𝑏 ) [
learning rate adaptation with hypergradient descent. In International conference on

𝑚
[ ]
E ‖∇𝐹 (𝑤𝑘 ) − 𝑣𝑘 ‖2 ≤ 𝑚E[‖𝑣0 ‖2 ] + (𝑚 − 1)E[‖𝑣1 ‖2 ] learning representations.
𝑘=0 𝜇 2 𝑏𝑏2𝐻 𝑛 − 1 Bordes, A., Bottou, L., & Gallinari, P. (2009). SGD-QN: Careful quasi-Newton stochastic
] gradient descent. Journal of Machine Learning Research, 10, 1737–1754.
+ ⋯ + E[‖𝑣𝑚−1 ‖2 ] . Byrd, R. H., Chin, G. M., Neveitt, W., & Nocedal, J. (2011). On the use of stochastic
hessian information in optimization methods for machine learning. SIAM Journal
Further, we obtain on Optimization, 21(3), 977–995.
( ) 𝑚 Byrd, R. H., Chin, G. M., Nocedal, J., & Wu, Y. (2012). Sample size selection in

𝑚
[ ] 2𝐿𝛾 ∑
E ‖∇𝐹 (𝑤𝑘 ) − 𝑣𝑘 ‖2 − 1 − ‖𝑣 ‖2 optimization methods for machine learning. Mathematical Programming, 134(1),
𝑘=0
𝜇𝑏𝐻 𝑘=0 𝑘 127–155.

𝐿2 𝛾 2 ( 𝑛 − 𝑏 ) [
Byrd, R. H., Hansen, S. L., Nocedal, J., & Singer, Y. (2016). A stochastic quasi-
]
≤ 𝑚E[‖𝑣0 ‖2 ] + (𝑚 − 1)E[‖𝑣1 ‖2 ] + ⋯ + E[‖𝑣𝑚−1 ‖2 ] Newton method for large-scale optimization. SIAM Journal on Optimization, 26(2),
𝜇 2 𝑏𝑏2𝐻 𝑛 − 1 1008–1031.
( ) 𝑚 Castera, C., Bolte, J., Févotte, C., & Pauwels, E. (2021). Second-order step-size tuning
2𝐿𝛾 ∑
− 1− ‖𝑣 ‖2 of SGD for non-convex optimization. arXiv preprint arXiv:2103.03570.
𝜇𝑏𝐻 𝑘=0 𝑘 Csiba, D., & Richtárik, P. (2018). Importance sampling for minibatches. Journal of
[ 2 2 ( ) ( )] 𝑚 Machine Learning Research, 19(1), 962–982.
𝐿 𝛾 𝑛−𝑏 2𝐿𝛾 ∑
≤ 𝑚− 1− ‖𝑣𝑘 ‖2 Cui, Y., Wu, D., & Huang, J. (2020). Optimize TSK fuzzy systems for classification
2
𝜇 𝑏𝑏𝐻 2 𝑛 − 1 𝜇𝑏 problems: Minibatch gradient descent with uniform regularization and batch
𝐻 𝑘=0
normalization. IEEE Transactions on Fuzzy Systems, 28(12), 3065–3075.
≤ 0. Dai, Y.-H., & Yuan, Y. (1999). A nonlinear conjugate gradient method with a strong
global convergence property. SIAM Journal on Optimization, 10(1), 177–182.
Based on Lemma 3, we have Defazio, A., Bach, F., & Lacoste-Julien, S. (2014). SAGA: A fast incremental gradient
method with support for non-strongly convex composite objectives. In Advances in

𝑚
[ ] 2𝜇𝑏𝐻
E ‖∇𝐹 (𝑤𝑘 )‖2 ≤ [𝐹 (𝑤0 ) − 𝐹 (𝑤∗ )] neural information processing systems (pp. 1646–1654).
𝑘=0 𝛾 − 𝛾 𝛽̂ Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online
( ) learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul),
1 ∑ 1 ∑
𝑚 𝑚
2𝐿𝛾 2121–2159.
+ ‖∇𝐹 (𝑤𝑘 ) − 𝑣𝑘 ‖2 − 1 − ‖𝑣 ‖2
1 − 𝛽̂ 𝑘=0 𝜇𝑏𝐻 1 − 𝛽̂ 𝑘=0 𝑘 Fletcher, R., & Reeves, C. M. (1964). Function minimization by conjugate gradients.
Computer Journal, 7(2), 149–154.
2𝜇𝑏𝐻
≤ [𝐹 (𝑤0 ) − 𝐹 (𝑤∗ )]. Gao, T., Gong, X., Zhang, K., Lin, F., Wang, J., Huang, T., et al. (2020). A recalling-
𝛾 − 𝛾 𝛽̂ enhanced recurrent neural network: Conjugate gradient learning algorithm and its
convergence analysis. Information Sciences, 519, 273–288.
Further, according to the definition of 𝑤
̃𝑠 in Algorithm 2 and 𝑤
̃𝑠 = Gilbert, J. C., & Nocedal, J. (1992). Global convergence properties of conjugate gradient
𝑤𝑚 , we obtain methods for optimization. SIAM Journal on Optimization, 2(1), 21–42.
Gitman, I., Lang, H., Zhang, P., & Xiao, L. (2019). Understanding the role of momentum
1 ∑
𝑚
in stochastic gradient methods. In Advances in neural information processing systems.
E[‖∇𝐹 (𝑤𝑚 )‖2 ] = E[‖∇𝐹 (𝑤𝑘 )‖2 ] Gower, R., Goldfarb, D., & Richtárik, P. (2016). Stochastic block BFGS: Squeezing
𝑚 + 1 𝑘=0
more curvature out of data. In International conference on machine learning (pp.
2𝜇𝑏𝐻 1869–1878). PMLR.
≤ E[𝐹 (𝑤0 ) − 𝐹 (𝑤∗ )]
̂ + 1)
𝛾(1 − 𝛽)(𝑚 Gower, R., Le Roux, N., & Bach, F. (2018). Tracking the gradients using the hessian:
A new look at variance reducing stochastic methods. In International conference on
Thus, we have completed the proof of Theorem 1. □ artificial intelligence and statistics (pp. 707–715). PMLR.

13
Z. Yang Expert Systems With Applications 206 (2022) 117719

Guo, P., Ye, Z., Xiao, K., & Zhu, W. (2020). Weighted aggregating stochastic gradient Schmidt, M., Babanezhad, R., Ahmed, M. O., Defazio, A., Clifton, A., & Sarkar, A.
descent for parallel deep learning. IEEE Transactions on Knowledge and Data (2015). Non-uniform stochastic average gradient method for training conditional
Engineering. random fields. In International conference on artificial intelligence and statistics.
Hestenes, M. R., & Stiefel, E. (1952). Methods of conjugate gradients for solving linear Schraudolph, N. N., & Graepel, T. (2002). Conjugate directions for stochastic gradient
systems’. Journal of Research of the National Bureau of Standards, 49(6). descent. In International conference on artificial neural networks (pp. 1351–1356).
Hoyer, P. O. (2004). Non-negative matrix factorization with sparseness constraints. Springer.
Journal of Machine Learning Research, 5(9). Schraudolph, N. N., Yu, J., & Günter, S. (2007). A stochastic quasi-Newton method
Jiang, H., & Wilford, P. (2012). A stochastic conjugate gradient method for the for online convex optimization. In Artificial intelligence and statistics (pp. 436–443).
approximation of functions. Journal of Computational and Applied Mathematics, PMLR.
236(9), 2529–2544. Shalev-Shwartz, S., & Zhang, T. (2013). Stochastic dual coordinate ascent methods for
Jin, X., Zhang, X., Huang, K., & Geng, G. (2019). Stochastic conjugate gradient regularized loss minimization. Journal of Machine Learning Research, 14(2).
algorithm with variance reduction. IEEE Transactions on Neural Networks and Shazeer, N., & Stern, M. (2018). Adafactor: Adaptive learning rates with sublinear
Learning Systems, 30(5), 1360. memory cost. In International conference on machine learning (pp. 4596–4604).
Johnson, T. B., & Guestrin, C. (2018). Training deep models faster with robust, approx- PMLR.
imate importance sampling. Advances in Neural Information Processing Systems, 31, Stich, S. U., Raj, A., & Jaggi, M. (2017). Safe adaptive importance sampling. In
7265–7275. International conference on neural information processing systems (pp. 4384–4394).
Johnson, R., & Zhang, T. (2013). Accelerating stochastic gradient descent using Tieleman, T., & Hinton, G. (2017). Divide the gradient by a running average of its recent
predictive variance reduction. In Advances in neural information processing systems magnitude. coursera: Neural networks for machine learning: Technical report.
(pp. 315–323). Vaswani, S., Bach, F., & Schmidt, M. (2019). Fast and faster convergence of sgd for
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv over-parameterized models and an accelerated perceptron. In The 22nd international
e-prints, arXiv–1412. conference on artificial intelligence and statistics (pp. 1195–1204). PMLR.
Li, H., Fang, C., & Lin, Z. (2020). Accelerated first-order optimization algorithms for Wang, C., Chen, X., Smola, A. J., & Xing, E. P. (2013). Variance reduction for stochastic
machine learning. Proceedings of the IEEE, 108(11), 2067–2082. gradient optimization. In Advances in neural information processing systems (pp.
Li, M., Zhang, T., Chen, Y., & Smola, A. J. (2014). Efficient mini-batch training 181–189).
for stochastic optimization. In ACM SIGKDD international conference on knowledge Wang, X., Wang, X., & Yuan, Y.-x. (2019). Stochastic proximal quasi-Newton methods
discovery and data mining (pp. 661–670). for non-convex composite optimization. Optimization Methods & Software, 34(5),
Loizou, N., Vaswani, S., Laradji, I. H., & Lacoste-Julien, S. (2021). Stochastic polyak 922–948.
step-size for sgd: An adaptive learning rate for fast convergence. In International Wang, J., & Zhang, T. (2019). Utilizing second order information in minibatch stochas-
conference on artificial intelligence and statistics (pp. 1306–1314). PMLR. tic variance reduced proximal iterations. Journal of Machine Learning Research, 20,
Ma, K., Zeng, J., Xiong, J., Xu, Q., Cao, X., Liu, W., et al. (2018). Stochastic non-convex 1–42.
ordinal embedding with stabilized barzilai-borwein step size. In AAAI conference on Wills, A. G., & Schön, T. B. (2021). Stochastic quasi-Newton with line-search
artificial intelligence. regularisation. Automatica, 127, Article 109503.
Mokhtari, A., & Ribeiro, A. (2020). Stochastic quasi-Newton methods. Proceedings of Yang, Z. (2021a). Fast automatic step size selection for zeroth-order nonconvex
the IEEE, 108(11), 1906–1922. stochastic optimization. Expert Systems with Applications, 174, Article 114749.
Moritz, P., Nishihara, R., & Jordan, M. (2016). A linearly-convergent stochastic L-BFGS Yang, Z. (2021b). On the step size selection in variance-reduced algorithm for
algorithm. In Artificial intelligence and statistics (pp. 249–258). PMLR. nonconvex optimization. Expert Systems with Applications, 169, Article 114336.
Needell, D., Srebro, N., & Ward, R. (2016). Stochastic gradient descent, weighted Yang, Z. (2021c). Variance reduced optimization with implicit gradient transport.
sampling, and the randomized kaczmarz algorithm. Mathematical Programming, Knowledge-Based Systems, 212, Article 106626.
155(1–2), 549–573. Yang, Z., Chen, Z., & Wang, C. (2021). Accelerating mini-batch SARAH by step size
Nguyen, L. M., Liu, J., Scheinberg, K., & Takáč, M. (2017). SARAH: A novel method rules. Information Sciences, 558, 157–173.
for machine learning problems using stochastic recursive gradient. In International Yang, Z., Wang, C., Zang, Y., & Li, J. (2018a). Mini-batch algorithms with
conference on machine learning-vol. 70 (pp. 2613–2621). JMLR. org. Barzilai–Borwein update step. Neurocomputing, 314, 177–185.
Nguyen, L. M., Scheinberg, K., & Takáč, M. (2021). Inexact SARAH algorithm for Yang, Z., Wang, C., Zhang, Z., & Li, J. (2018b). Random Barzilai-Borwein step size for
stochastic optimization. Optimization Methods & Software, 36(1), 237–258. mini-batch algorithms. Engineering Applications of Artificial Intelligence, 72, 124–135.
Nitanda, A. (2014). Stochastic proximal gradient descent with acceleration techniques. Yang, Z., Wang, C., Zhang, Z., & Li, J. (2019). Mini-batch algorithms with online step
In Advances in neural information processing systems (pp. 1574–1582). size. Knowledge-Based Systems, 165, 228–240.
Polak, E., & Ribière, G. (1969). Note sur la convergence de directions conjugées. Rev. Yasuda, S., Mahboubi, S., Indrapriyadarsini, S., Ninomiya, H., & Asai, H. (2019).
Francaise Informat Recherche Opertionelle, 16, 35–43. A stochastic variance reduced nesterov’s accelerated quasi-Newton method. In
Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of stochastic approximation by International conference on machine learning and applications (pp. 1874–1879). IEEE.
averaging. SIAM Journal on Control and Optimization, 30(4), 838–855. Zeiler, M. D. (2012). Adadelta: an adaptive learning rate method. arXiv preprint
Reddi, S. J., Kale, S., & Kumar, S. (2018). On the convergence of ADAM and beyond. arXiv:1212.5701.
In International conference on learning representations. Zhang, T., Ghanem, B., Liu, S., & Ahuja, N. (2013). Robust visual tracking via structured
Roux, N. L., Schmidt, M., & Bach, F. R. (2012). A stochastic gradient method with multi-task sparse learning. International Journal of Computer Vision, 101(2), 367–383.
an exponential convergence rate for finite training sets. In Advances in neural Zhou, C., Gao, W., & Goldfarb, D. (2017). Stochastic adaptive quasi-Newton methods
information processing systems (pp. 2663–2671). for minimizing expected values. In International conference on machine learning (pp.
4150–4159). PMLR.

14

You might also like