Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation

This document presents a novel optimization method called Local Quadratic Approximation (LQA) for dynamically determining the learning rate in deep learning tasks. The proposed method automatically adjusts the learning rate at each update step based on the current loss function and parameter estimates, aiming for nearly optimal performance. Extensive experiments validate the effectiveness of the LQA method in improving gradient-based optimization processes.

Uploaded by

eyobembet4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views10 pages

Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation

Uploaded by

eyobembet4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Automatic, Dynamic, and Nearly Optimal Learning

Rate Specification by Local Quadratic Approximation

Yingqiu Zhu Yu Chen Danyang Huang∗
[email protected] [email protected] [email protected]
School of Statistics, Renmin Guanghua School of Management, Center for Applied Statistics, Renmin
University of China Peking University University of China
Beijing, China Beijing, China School of Statistics, Renmin
arXiv:2004.03260v1 [stat.ML] 7 Apr 2020

University of China
Beijing, China

Bo Zhang Hansheng Wang

[email protected] [email protected]
Center for Applied Statistics, Renmin Guanghua School of Management,
University of China Peking University
School of Statistics, Renmin Beijing, China
University of China
Beijing, China

ABSTRACT experiments have been conducted to prove the strengths of

In deep learning tasks, the learning rate determines the up- the proposed LQA method.
date step size in each iteration, which plays a critical role
in gradient-based optimization. However, the determination CCS CONCEPTS
of the appropriate learning rate in practice typically replies • Computing methodologies → Neural networks; Batch
on subjective judgement. In this work, we propose a novel learning.
optimization method based on local quadratic approximation
(LQA). In each update step, given the gradient direction, we KEYWORDS
locally approximate the loss function by a standard quadratic neural networks, gradient descent, learning rate, machine
function of the learning rate. Then, we propose an approx- learning
imation step to obtain a nearly optimal learning rate in a
computationally efficient way. The proposed LQA method ACM Reference Format:
has three important features. First, the learning rate is au- Yingqiu Zhu, Yu Chen, Danyang Huang, Bo Zhang, and Hansheng
tomatically determined in each update step. Second, it is Wang. 2020. Automatic, Dynamic, and Nearly Optimal Learning
dynamically adjusted according to the current loss function Rate Specification by Local Quadratic Approximation. In CIKM ’20:
value and the parameter estimates. Third, with the gradi- 29th ACM International Conference on Information and Knowledge
Management, October 19–23, 2020, Galway, Ireland. ACM, New York,
ent direction fixed, the proposed method leads to nearly the
NY, USA, 10 pages. https://doi.org/10.1145/1122445.1122456
greatest reduction in terms of the loss function. Extensive

∗ Corresponding Author 1 INTRODUCTION

In recent years, the development of deep learning has led to
Permission to make digital or hard copies of all or part of this work for remarkable success in visual recognition [7, 10, 14], speech
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear recognition [8, 29], natural language processing [2, 5], and
this notice and the full citation on the first page. Copyrights for components many other fields. For different learning tasks, researchers
of this work owned by others than ACM must be honored. Abstracting with have developed different network frameworks, including
credit is permitted. To copy otherwise, or republish, to post on servers or to deep convolutional neural networks [14, 16], recurrent neu-
redistribute to lists, requires prior specific permission and/or a fee. Request ral networks [6], graph convolutional networks [12] and rein-
permissions from [email protected].
forcement learning [19, 20]. Although the network structure
CIKM ’20, October 19–23, 2020, Galway, Ireland
© 2020 Association for Computing Machinery.
could be totally different, the training methods are typically
ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00 similar. They are often gradient decent methods, which are
https://doi.org/10.1145/1122445.1122456 developed based on backpropagation [24].
CIKM ’20, October 19–23, 2020, Galway, Ireland Zhu et al.

Given a differentiable objective function, gradient descent compute the Hessian matrix for the loss function. However,
is a natural and efficient method for optimization. Among this solution is computationally expensive. Because many
various gradient descent methods, the stochastic gradient de- complex deep neural networks involve a large number of
scent (SGD) method [23] plays a critical role. In the standard parameters, this makes the Hessian matrix have ultra-high
SGD method, the first-order gradient of a randomly selected dimensionality. To solve this problem, we propose here a
sample is used to iteratively update the parameter estimates novel approximation step. Note that, given a fixed gradient
of a network. Specifically, the parameter estimates are ad- direction, the loss function can be approximated by a stan-
justed with the negative of the random gradient multiplied dard quadratic function with the learning rate as the only
by a step size. The step size is called the learning rate. Many input variable. For a univariate quadratic function such as
generalized methods based on the SGD method have been this, there are only two unknown coefficients. They are the
proposed [1, 4, 11, 25, 26]. Most of these extensions specify linear term coefficient and the quadratic term coefficient.
improved update rules to adjust the direction or the step size. As long as these two coefficients can be determined, the
However, [1] pointed out that, many hand-designed update optimal learning rate can be obtained. To estimate the two
rules are designed for circumstances with certain characteris- unknown coefficients, one can try, for example, two different
tics, such as sparsity or nonconvexity. As a result, rule-based but reasonably small learning rates. Then, the correspond-
methods might perform well in some cases but poorly in ing objective function can be evaluated. This step leads to
others. Consequently, an optimizer with an automatically two equations, which can be solved to estimate the two un-
adjusted update rule is preferable. known coefficients in the quadratic approximation function.
An update rule contains two important components: one Thereafter, the optimal learning rate can be obtained.
is the update direction, and the other is the step size. The Our contributions: We propose an automatic, dynamic
learning rate determines the step size, which plays a sig- and nearly optimal learning rate tuning algorithm that has
nificant role in optimization. If it is set inappropriately, the the following three important features.
parameter estimates could be suboptimal. Empirical experi- (1) The algorithm is automatic. In other words, it leads to
ence suggests that a relatively larger learning rate might be an optimization method with little subjective judgment.
preferred in the early stages of the optimization. Otherwise, (2) The method is dynamic in the sense that the learning
the algorithm might converge very slowly. In contrast, a rate used in each update step is different. It is dynamically
relatively smaller learning rate should be used in the later adjusted according to the current status of the loss function
stages. Otherwise, the objective function cannot be fully op- and the parameter estimates. Typically, larger rates are used
timized. This phenomenon inspires us to design a method in the earlier iterations, while smaller rates are used in the
to automatically search for an optimal learning rate in each latter iterations.
update step during optimization. (3) The learning rate derived from the proposed method is
To this end, we propose here a novel optimization method nearly optimal. For each update step, by the novel quadratic
based on local quadratic approximation (LQA). It tunes the approximation, the learning rate leads to almost the greatest
learning rate in a dynamic, automatic and nearly optimal reduction in terms of the loss function. Here, “almost” refers
manner. The method can obtain the best step size in each up- to the fact that the loss function is locally approximated by
date step. Intuitively, given a search direction, what should a quadratic function with unknown coefficients numerically
be the best step size? One natural definition is the step size estimated. For this particular update step, with the gradient
that can lead to the greatest reduction in the global loss. Ac- direction fixed, and among all the possible learning rates, the
cordingly, the step size itself should be treated as a parameter, one determined by the proposed method can result in nearly
that needs to be optimized. For this purpose, the proposed the greatest reduction in terms of the loss function.
method can be decomposed into two important steps. They The rest of this article is organized as follows. In Section
are the expansion step and the approximation step. First, in 2, we review related works on gradient-based optimizers.
the expansion step, we conduct Taylor expansion on the loss Section 3 presents the proposed algorithm in detail. In Sec-
function, around the current parameter estimates. Accord- tion 4, we verify the performance of the proposed method
ingly, the objective function can be locally approximated by through empirical studies on open datasets. Then, conclud-
a quadratic function in terms of the learning rate. Then, the ing remarks are given in Section 5.
learning rate is also treated as a parameter to be optimized,
which leads to a nearly optimal determination of the learning 2 RELATED WORK
rate for this particular update step. To optimize a loss function, two important components need
Second, to implement this idea, we need to compute the to be specified: the update direction and the step size. Ideally,
first- and second-order derivatives of the objective function the best update direction should be the gradient computed for
on the gradient direction. One way to solve this problem is to the loss function based on the whole data. For convenience,
Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation CIKM ’20, October 19–23, 2020, Galway, Ireland

we refer to it as the global gradient. Since the calculation of 3.1 Stochastic gradient descent
the global gradient is computationally expensive, the SGD Assume we have a total of N samples. They are indexed by
method [23] uses the gradient estimated based on a stochas- 1 ≤ i ≤ N and collected by S = {1, 2, · · · , N }. For each
tic subsample in each iteration, which we referred to as a sample, a loss function can be defined as ℓ(X i ; θ ), where X i
sample gradient. It leads to a fairly satisfactory empirical is the input corresponding to the i-th sample and θ ∈ Rp
performance. The SGD method has inspired many new opti- denotes the parameter. Then the global loss function can be
mization methods, most of which enhance their performance defined as
by improving the estimation of the global gradient direction.
N
A natural improvement is to combine sample gradients from 1 Õ 1 Õ
ℓ(θ ) = ℓ(X i ; θ ) = ℓ(X i ; θ ).
different update steps so that a more reliable estimate for the N i=1 |S|
i ∈S
global gradient direction can be obtained. This improvement
has led to the momentum-based optimization methods, such Ideally, one should optimize ℓ(θ ) by a gradient descent al-
as those proposed in [3, 15, 25, 27]. In particular, [3] adopted gorithm. Assume there are a total of T iterations. Let θˆ(t ) be
Nesterov’s accelerated gradient algorithm [21] to further the parameter estimate obtained in the t-th iteration. Then,
improve the calculation of the gradient direction. the estimate in the next iteration θˆ(t +1) is given by,
There exist other optimization methods that focus on the
adjustment of the step size. [4] proposed AdaGrad, in which θˆ(t +1) = θˆ(t ) − δ ∇ℓ(θˆ(t ) ),
the step size is iteratively decreased according to a prespeci-
where δ is the learning rate and ∇ℓ(θˆ(t ) ) is the gradient of the
fied function. However, it still involves a parameter related to
the learning rate, which needs to be subjectively determined. global loss function ℓ(θ ) with respect to θ at θˆ(t ) . More specif-
ically, ∇ℓ(θˆ(t ) ) = N −1 i=1 ∇ℓ(X i ; θˆ(t ) ), where ∇ℓ(X i ; θˆ(t ) ) is
ÍN
More extensions of AdaGrad have been proposed, such as
RMSProp [26] and AdaDelta [30]. Particularly, RMSProp in- the gradient of the local loss function for the i-th sample.
troduced a decay factor to adjust the weights of previous Unfortunately, such a straightforward implementation is
sample gradients. [11] proposed an adaptive moment es- computationally expensive if the sample size N is relatively
timation (Adam) method, that combined RMSProp with a large, which is particularly true if the dimensionality of θ
momentum-based method. Accordingly, the step size and is also ultrahigh. To alleviate the computational burden, re-
the update direction are both adjusted during each iteration. searchers proposed the idea of SGD. The key idea is to ran-
However, because step sizes are adjusted without consider- domly partition the whole sample into a number of nonover-
ing the loss function, the loss reduction obtained for each lapping batches. For example, we can write S = ∪kK=1 Sk ,
update step is suboptimal. Thus, the resulting convergence where Sk collects the indices of the samples in the k-th
rate can be further improved. batch. We should have Sk1 ∩ Sk2 = ∅ for any k 1 , k 2 and
To summarize, most existing optimization methods suffer |Sk | = n for any 1 ≤ k ≤ K, where n is a fixed batch size.
from one or both of the following two limitations. First, they Next, instead of computing the global gradient ∇ℓ(θˆ(t ) ), we
are not automatic, and human intervention is required. Sec- can replace it by an estimate computed based on the k-th
ond, they are suboptimal because the loss reduction achieved batch. More specifically, each iteration (e.g., the t-th itera-
in each update step can be further improved. These pioneer- tion) is further decomposed into a total of K batch steps. Let
ing researchers inspired us to develop a new method for θˆ(t,k ) be the estimate obtained during the k-th (1 ≤ k ≤ K)
automatic determination of the learning rate. Ideally, the batch step during the t-th iteration. Then, we have
new method should be automatic with little human interven- δ Õ
tion. It should be dynamic so that the learning rate used for θˆ(t,k +1) = θˆ(t,k ) − ∇ℓ(X i ; θˆ(t,k ) ),
n
each update step is particularly selected. Mostly importantly, i ∈Sk
in each update step, the learning rate determined by the new
method should be optimal (or nearly optimal) in terms of the where k = 1, · · · , K − 1. In particular, θˆ(t +1,1) = θˆ(t,K ) −
δn −1 i ∈SK ∇ℓ(X i ; θˆ(t, K ) ).
Í
loss reduction, given a fixed update direction.
By doing so, the computational burden can be alleviated.
However, the tradeoff is that the batch-sample-based gra-
3 METHODOLOGY dient estimate could be unstable, which is particularly true
In this section, we first introduce the notations used in this if the batch size n is relatively small. To fix this problem,
paper and the general formulation of the SGD method. Then, various momentum-based methods have been proposed. The
we propose an algorithm based on local quadratic approxi- key idea is to record gradients in previous iterations and
mation to dynamically search for an optimal learning rate. integrate them together to form a more stable estimate.
This results in a new variant of the SGD method. 3.2 Local quadratic approximation
CIKM ’20, October 19–23, 2020, Galway, Ireland Zhu et al.

In this work, we assume that for each batch step, the esti- According to (1), ∆ℓ(δ t,k ) is a quadratic function of δ t,k .
mate for the gradient direction is given. It can be obtained For simplicity, the coefficient of the linear term and the coef-
by different algorithms. For example, it could be the esti- ficient of the quadratic term are denoted as
mate obtained by a standard SGD algorithm or an estimate
that involves rule-based corrections, such as that from a  
1
Õ 
at,k = ∇ℓ X i ; θˆ(t,k) дt,k , and
 

momentum-based method. We focus on how to specify the
learning rate in an optimal (or, nearly optimal) way. ni ∈Sk 

To this end, we treat the learning rate δ as an unknown
 
 
parameter. It is remarkable that the optimal learning rate 1 Õ T 2
 
bt,k = дt,k ∇ ℓ X i ; θˆ дt,k ,

 (t,k )


could dynamically change in different batch steps. Thus, we  2n i ∈Sk
 
use δ t,k to denote the learning rate in the k-th batch step

 
within the t-th iteration. Since the reduction in the loss in
respectively. Since the Taylor remainder here could be negli-
this batch step is influenced by the learning rate δ t,k , we
gible, (1) can be simply denoted by
express it as a function of the learning rate ∆ℓ(δ t,k ).
To find the optimal value for δ t,k , we investigate the op-
timization of ∆ℓ(δ t,k ) based on the Taylor expansion. For ∆ℓ(δ t,k ) ≈ −at,k δ t,k + bt,k δ t,k
2
. (2)
simplicity, we use дt,k = n −1 i ∈Sk ∇ℓ(X i ; θˆ(t,k ) ) to denote
Í

the current gradient. Given θˆ(t,k ) and дt,k , the loss reduction To maximize ∆ℓ(δ t,k ) with respect to δ t,k , we take the corre-
could be expressed as sponding derivative of the loss reduction, which leads to,

1 Õ h ∂∆ℓ(δ t,k )
i
∆ℓ(δ t,k ) = ℓ X i ; θˆ(t,k +1) − ℓ X i ; θˆ(t,k ) ≈ −at,k + 2bt,k δ t,k = 0.
n ∂δ t,k
i ∈Sk
1 Õ h i
= ℓ X i ; θˆ(t,k ) − δ t,k дt,k − ℓ X i ; θˆ(t,k ) As a result, the optimal learning rate in this batch step can
n
i ∈Sk be approximated by,
Then, two estimation steps are conducted to determine an
appropriate value for δ t,k in this batch step. δ t,k
∗
= (2bt,k )−1at,k . (3)
(1) Expansion Step. By a Taylor expansion of ℓ(X i ; θ ) around
Note that the computation of bt,k involves the first- and

θˆ(t,k ) ˆ
, we have ℓ X i ; θ (t,k) − δ t,k дt,k =
second-order derivatives. For a general form of loss function,
this calculation may be computationally expensive in real
ℓ X i ; θˆ(t,k) − ∇ℓ X i ; θˆ(t,k ) δ t,k дt,k
applications. Thus, an approximation step is preferred to
1 2 T 2
improve the computational efficiency.
+ δ t,k дt,k ∇ ℓ X i ; θˆ(t,k ) дt,k + o δ t,k дt,k дt,k ,
2 T
2 (2) Approximation Step. To compute the coefficients at,k
and bt,k and avoid the computation of second derivatives,
where ∇ℓ X i ; θˆ(t,k ) and ∇2 ℓ X i ; θˆ(t,k ) denote the first- we consider the following approximation method. The basic
and second-order derivatives of the local loss function, re- idea is to build 2 equations with respect to the 2 unknown
spectively. As a result, the reduction is coefficients.
1 Õ h Let дt,k be a given estimate of the gradient direction. We
∆ℓ(δ t,k ) = −∇ℓ X i ; θˆ(t,k ) δ t,k дt,k then compute
n
i ∈Sk
Õ
1 2 T 2
Õ
ℓ X i ; θˆ(t,k ) − δ 0дt,k = ℓ X i ; θˆ(t,k ) − at,k δ 0n
i
+ δ t,k дt,k ∇ ℓ X i ; θˆ(t,k ) дt,k + o δ t,k дt,k дt,k
2 T
2 i ∈Sk i ∈Sk

+ bt,k δ 02n,
 
1 Õ

ˆ
 (4)
=− ∇ℓ X i ; θ дt,k δ t,k
 (t,k)


 n i ∈Sk
Õ Õ
ℓ X i ; θˆ(t,k ) + δ 0дt,k = ℓ X i ; θˆ(t,k ) + at,k δ 0n
 

 
  i ∈Sk i ∈Sk
 1 Õ T 2
 
+ дt,k ∇ ℓ X i ; θˆ дt,k δ t,k + bt,k δ 02n,
 (t,k )
 2

(5)
 2n i ∈Sk
 

 
for a reasonably small learning rate δ 0 . A natural choice for
+ o n δ t,k дt,k дt,k .
−1 2 T
(1)
δ 0 could be δ t,k−1
∗ if k > 1 and δ t∗−1, K if k = 1. By solving (4)
Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation CIKM ’20, October 19–23, 2020, Galway, Ireland

and (5), we have is substantially smaller than computing the gradient and it
1 Õ h is particularly true if the unknown parameter θ ’s dimension
b̃t,k = 2
ℓ X i ; θˆ(t,k) + δ 0дt,k is ultrahigh.
2nδ 0 i ∈S
k
i
+ ℓ X i ; θˆ(t,k) − δ 0дt,k − 2ℓ X i ; θˆ(t,k ) , (6) 4 EXPERIMENTS
In this section, we empirically evaluate the proposed method
1 Õ h
ãt,k = ℓ X i ; θˆ(t,k ) + δ 0дt,k based on different models and compare it with various op-
2nδ 0
i ∈Sk timizers under different parameter settings. The details are
listed as follows.
i
− ℓ X i ; θˆ(t,k ) − δ 0дt,k , (7)
Classification Model. To demonstrate the robustness of the
where ãt,k and b̃t,k could serve as the approximations of proposed method, we consider three classic models. They
at,k and bt,k , respectively. Then, we apply these results back are multinomial logistic regression, multilayer perceptron
to (3), which gives the approximated optimal learning rate (MLP), and deep convolutional neural network (CNN) mod-
δˆt,k
∗ . Because δˆ∗ is optimally selected, the reduction in the els.
t,k Competing Optimizers. For comparison purposes, we com-
loss function is nearly optimal for each batch step. As a
pare the proposed LQA method with other popular optimiz-
consequence, the total number of iterations required for
ers. They are the standard SGD method, the SGD method
convergence can be much reduced, which makes the whole
with momentum, the SGD method based on Nesterov’s ac-
algorithm converge much faster than usual. In summary,
celerated gradient (NAG), AdaGrad, RMSProp and Adam.
Algorithm 1 illustrates the pseudocode of the proposed
For simplicity, we use “SGD-M” to denote the SGD method
method.
with momentum and “SGD-NAG” to denote the SGD method
Algorithm 1 Local quadratic approximation algorithm based on the NAG algorithm.
Parameter Settings. For the competing optimizers, we adopt
Require: T : number of iterations; K: number of batches three different learning rates, δ = 0.1, 0.01, and 0.001. For
within one iteration; ℓ(θ ): loss function with parameters all the optimizers, the minibatch size is 64, and the initial
θ ; θ 0 : initial estimate for parameters (e.g., a zero vector); values for all the parameters are zero. If there are other hy-
δ 0 : initial (small) learning rate. perparameters (e.g., decay rates) in the models, they are set
t ← 1; by default.
θˆ(1,1) ← θ 0 ; Performance Measurement. To gauge the performance of
while t ≤ T do the optimizers, we report the training loss of the different
k ← 1; optimizers, which is defined as the negative log-likelihood
while k ≤ K − 1 do function. The results of different optimizers in each iteration
Compute the gradient дt,k ; are shown in figures for comparison purposes.
Compute ãt,k and b̃t,k according to (6) and (7);
4.1 Multinomial Logistic Regression
δˆt,k
∗ ← (2b̃ )−1ã ;
t,k t,k
θ ˆ(t,k +1) ←θˆ(t,k ) − δˆ∗ дt,k ; We first compare the performance of different optimizers
t,k
based on multinomial logistic regression. It has a convex
k ← k + 1;
objective function. We consider the MNIST dataset [17] for
end while
illustration. The dataset consists of a total of 70,000 (28×28)
Compute the gradient дt, K ;
images of handwritten digits, each of which corresponds to a
Compute ãt, K and b̃t,K according to (6) and (7);
−1 10-dimensional one-hot vector as its label. Then, the images
δˆ∗ ← 2b̃t, K
t, K ãt, K ; are flattened as 784-dimensional vectors to train the logistic
θˆ(t +1,1) ← θˆ(t, K ) − δˆt,∗ K дt, K ; regression classifier. Figure 1 displays the performance of
t ← t + 1; the proposed LQA method against the competing optimizers.
end while We can draw the following conclusions.
return θˆ(T +1,1) , the resulting estimate. Learning Rate. For the competing optimizers, the train-
ing loss curves of different learning rates clearly have dif-
ferent shapes, which means different convergence speeds
It is remarkable that the computational cost required for because the convergence speed is greatly affected by δ . The
calculating the optimal learning rate is ignorable. The main best δ in this case is 0.01 for the SGD, SGD-M, SGD-NAG and
cost is due to the calculation of the loss function values (not AdaGrad methods. However, that for RMSProp and Adam
its derivatives) at two different points. The cost of this step is 0.001. Note that for RMSProp and Adam, the loss may
CIKM ’20, October 19–23, 2020, Galway, Ireland Zhu et al.

fail to converge with an inappropriate δ (e.g., δ = 0.1). It is in the early stages (e.g., the first 5 iterations). In the later
remarkable that with the LQA method, the learning rate is stages, the LQA method continues to reduce the loss, which
automatically determined and dynamically updated. Thus, makes the training loss curve of LQA lower than that of the
the proposed method can provide an automatic solution for other methods. For example, the smallest loss corresponding
the training of multinomial logistic regression classifiers. to the Adam optimizer in the 20th iteration is 0.011, while
Next, we compare LQA and the competing optimizers with that of the LQA method is 0.002. Third, although AdaGrad
their best learning rates. converges faster when δ =0.01, the performance of the pro-
Loss Reduction. First, the loss curve of LQA remains posed method is slightly better than AdaGrad after the 12th
lower than those of the SGD optimizers during the whole iteration.
training process. This finding means that LQA converges
4.3 Deep Convolutional Neural Networks
faster than the SGD optimizers. For example, LQA reduces
the loss to 0.256 in the first 10 iterations. It takes 40 iterations CNNs have brought remarkable breakthroughs in computer
for the standard SGD optimizer with δ = 0.1 to achieve the vision tasks over the past two decades [7, 10, 14] and play a
same level. Second, for AdaGrad and RMSProp with δ = 0.1, critical role in various industrial applications, such as face
although the training loss curves are slightly lower than that recognition [28] and driverless vehicles [18]. In this subsec-
of LQA in the early stages (e.g. the first 5 iterations), LQA tion, we investigate the performance of the LQA method
performs better in the later stages. Third, the best perfor- with respect to the training of CNNs. Two classic CNNs
mance for Adam in this case is achieved when δ = 0.001. are considered. They are LeNet [17] and ResNet [7]. More
Although the performances are quite similar for LQA and specifically, LeNet-5 and ResNet-18 are studied in this paper.
Adam with δ = 0.001, LQA has lower loss values in the early The MNIST and CIFAR10 [13] datasets are used to demon-
stages. strate the performance. The CIFAR10 dataset contains 60,000
(32×32) RGB images, which are divided into 10 classes.
4.2 Multilayer Perception
LeNet. Figure 3 and Figure 4 show the results of exper-
MLP models are powerful neural network models and have iments on the MNIST and the CIFAR10 datasets, respec-
been widely used in machine learning tasks [22]. They con- tively. The following conclusions can be drawn: (1) For both
tain multiple fully connected layers and activation functions datasets, the loss curves of the LQA method remain lower
between those layers. An MLP can approximate arbitrary than those of the standard SGD and AdaGrad optimizers.
continuous functions over compact input sets [9]. This finding suggests LQA converges faster than those opti-
To investigate the performance of the proposed method mizers during the whole training process. (2) LQA performs
in this case, we consider the MNIST dataset. Following the similarly to the SGD-M, SGD-NAG, RMSProp and Adam
model setting in [11], the MLP is built with 2 fully connected optimizers in the early stages (e.g., the first 20 iterations).
hidden layers, each of which has 1,000 units, and the ReLU However, in the later stages, the proposed method can further
function is adopted as the activation function. Figure 2 shows reduce the loss and lead to a lower loss than those optimizers
the performances of different optimizers. The following con- after the same number of iterations. (3) For the CIFAR10
clusions can be drawn. dataset, a large δ (e.g., δ = 0.1) may lead to an unstable loss
Learning Rate. In this case, the best learning rates for the curve for a standard SGD optimizer. Although the loss curve
competing methods are quite different: (1) for the standard of LQA is unstable in the early stages of training, it becomes
SGD method, the best learning rate is δ = 0.1; (2) for the smooth in the later stages because the proposed method is
SGD-M, SGD-NAG and AdaGrad methods, the best learning able to automatically and adaptively adjust the update step
rate is 0.01; (3) for RMSProp and Adam, δ = 0.001 is the best. size to accelerate training. It is fairly robust.
It is remarkable that even for the same optimizer, different ResNet. Figure 5 displays the training loss of ResNet-18
learning rates could lead to a different performance, if the corresponding to different optimizers on the CIFAR10 dataset.
model changes. Thus, determining the appropriate learning Accordingly, we make the following conclusions. First, the
rate in practice may depend on expert experience and subjec- LQA method performs similarly to the other optimizers in
tive judgement. In contrast, the proposed method can avoid the early stages of training (e.g., the first 15 iterations). How-
such effort in choosing δ and give a comparable and robust ever, it converges faster in the later stages. Particularly, the
performance. proposed method leads to a lower loss than RMSProp and
Loss Reduction. First, compared with the standard SGD Adam within the same number of iterations. Second, in this
optimizers, LQA performs much better, which could be seen case, the loss curves of the SGD optimizers and AdaGrad are
from the lower training loss curve. Second, compared with quite unstable during the whole training period. The LQA
the SGD-M, SGD-NAG, RMSProp and Adam optimizers, the method is much more stable in the later stages of the training
performance of LQA is comparable to their best performances than early stages.
Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation CIKM ’20, October 19–23, 2020, Galway, Ireland

Figure 1: Training loss of multinomial logistic regression on the MNIST dataset.

Figure 2: Training loss of MLP on the MNIST dataset.

5 CONCLUSIONS
In this work, we propose LQA, a novel approach to determine
the nearly optimal learning rate for automatic optimization.
CIKM ’20, October 19–23, 2020, Galway, Ireland Zhu et al.

Figure 3: Training loss of LeNet-5 on the MNIST dataset.

Figure 4: Training loss of LeNet-5 on the CIFAR10 dataset.

Our method has three important features. First, the learning it is dynamically adjusted during the whole training pro-
rate is automatically estimated in each update step. Second, cess. Third, given the gradient direction, the learning rate
Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation CIKM ’20, October 19–23, 2020, Galway, Ireland

Figure 5: Training loss of ResNet-18 on the CIFAR10 dataset.

leads to nearly the greatest reduction in the loss function. preprint arXiv:1402.3722 (2014).
Experiments on openly available datasets demonstrate its [6] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013.
effectiveness. Speech recognition with deep recurrent neural networks. In 2013 IEEE
international conference on acoustics, speech and signal processing. IEEE,
We discuss two interesting topics for future research. First, 6645–6649.
the optimal learning rate derived by LQA is shared by all [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep
dimensions of the parameter estimate. A potential extension residual learning for image recognition. In Proceedings of the IEEE
is to allow for different optimal learning rates for different conference on computer vision and pattern recognition. 770–778.
dimensions. Second, in this paper, we focus on accelerating [8] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman
Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick
the training of the network models. We do not discuss the Nguyen, Tara N Sainath, et al. 2012. Deep neural networks for acoustic
overfitting issue and sparsity of the gradients. To further im- modeling in speech recognition: The shared views of four research
prove the performance of the proposed method, it is possible groups. IEEE Signal processing magazine 29, 6 (2012), 82–97.
to combine dropouts or sparsity penalties with LQA. [9] Kurt Hornik. 1991. Approximation capabilities of multilayer feedfor-
ward networks. Neural networks 4, 2 (1991), 251–257.
[10] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Wein-
REFERENCES berger. 2017. Densely connected convolutional networks. In Proceed-
ings of the IEEE conference on computer vision and pattern recognition.
[1] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoff-
4700–4708.
man, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Fre-
[11] Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic
itas. 2016. Learning to learn by gradient descent by gradient descent.
optimization. In International Conference on Learning Representations
In Advances in neural information processing systems. 3981–3989.
2015.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural
[12] Thomas N Kipf and Max Welling. 2016. Semi-supervised classification
machine translation by jointly learning to align and translate. arXiv
with graph convolutional networks. arXiv preprint arXiv:1609.02907
preprint arXiv:1409.0473 (2014).
(2016).
[3] Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan.
[13] Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers
2011. Better mini-batch algorithms via accelerated gradient methods.
of features from tiny images. Technical Report.
In Advances in neural information processing systems. 1647–1655.
[14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet
[4] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient
classification with deep convolutional neural networks. In Advances
methods for online learning and stochastic optimization. Journal of
in neural information processing systems. 1097–1105.
machine learning research 12, Jul (2011), 2121–2159.
[15] Guanghui Lan. 2012. An optimal method for stochastic composite
[5] Yoav Goldberg and Omer Levy. 2014. word2vec Explained: deriving
optimization. Mathematical Programming 133, 1-2 (2012), 365–397.
Mikolov et al.’s negative-sampling word-embedding method. arXiv
CIKM ’20, October 19–23, 2020, Galway, Ireland Zhu et al.

[16] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, and Artificial Intelligence 4, 1 (2016), 26–30.
Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. 1989. [23] Herbert Robbins and Sutton Monro. 1951. A stochastic approximation
Backpropagation applied to handwritten zip code recognition. Neural method. The annals of mathematical statistics (1951), 400–407.
computation 1, 4 (1989), 541–551. [24] David E Rumelhart, Richard Durbin, Richard Golden, and Yves Chauvin.
[17] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. 1995. Backpropagation: The basic theory. Backpropagation: Theory,
Gradient-based learning applied to document recognition. Proc. IEEE architectures and applications (1995), 1–34.
86, 11 (1998), 2278–2324. [25] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1986.
[18] Peiliang Li, Xiaozhi Chen, and Shaojie Shen. 2019. Stereo r-cnn based Learning representations by back-propagating errors. Nature 323, 6088
3d object detection for autonomous driving. In Proceedings of the IEEE (1986), 533–536.
Conference on Computer Vision and Pattern Recognition. 7644–7652. [26] Tijmen Tieleman and Geoffrey Hinton. 2012. RMSProp, COURSERA:
[19] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioan- Neural networks for machine learning. Technical Report.
nis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing [27] Paul Tseng. 1998. An incremental gradient (-projection) method with
atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 momentum term and adaptive stepsize rule. SIAM Journal on Opti-
(2013). mization 8, 2 (1998), 506–531.
[20] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, [28] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2016. A
Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, An- discriminative feature learning approach for deep face recognition. In
dreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control European conference on computer vision. Springer, 499–515.
through deep reinforcement learning. Nature 518, 7540 (2015), 529– [29] Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike
533. Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. 2016. Achiev-
[21] Yu Nesterov. 1983. A method of solving a convex programming prob- ing human parity in conversational speech recognition. arXiv preprint
lem with convergence rate O (1/k 2 ). In Soviet Mathematics Doklady, arXiv:1610.05256 (2016).
Vol. 27. 372–âĂŞ376. [30] Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method.
[22] Hassan Ramchoun, Mohammed Amine Janati Idrissi, Youssef Ghanou, arXiv preprint arXiv:1212.5701 (2012).
and Mohamed Ettaouil. 2016. Multilayer Perceptron: Architecture Op-
timization and Training. International Journal of Interactive Multimedia

32 Learning Rate Free Learning by
No ratings yet
32 Learning Rate Free Learning by
31 pages
O L R A H D: Nline Earning ATE Daptation With Ypergradient Escent
No ratings yet
O L R A H D: Nline Earning ATE Daptation With Ypergradient Escent
11 pages
Sample Size Selection in Optimization Methods For Machine Learning
No ratings yet
Sample Size Selection in Optimization Methods For Machine Learning
29 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
15 pages
Regression
No ratings yet
Regression
30 pages
Linearna Regresija - NG
No ratings yet
Linearna Regresija - NG
7 pages
Machine Learning Notes AndrewNg
No ratings yet
Machine Learning Notes AndrewNg
141 pages
Machine Learning Notes by Standard Andrew NG
No ratings yet
Machine Learning Notes by Standard Andrew NG
142 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
293 pages
Stanford ML CS229-Merged Notes
No ratings yet
Stanford ML CS229-Merged Notes
126 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
CS229
No ratings yet
CS229
69 pages
Notes 1
No ratings yet
Notes 1
30 pages
cs229 2
No ratings yet
cs229 2
275 pages
First-And Second-Order Methods Learning: Between Steepest Descent and Newton's Method
No ratings yet
First-And Second-Order Methods Learning: Between Steepest Descent and Newton's Method
26 pages
10.1007@s00521 018 3712 X PDF
No ratings yet
10.1007@s00521 018 3712 X PDF
13 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
Adaptive Stochastic Conjugate Gradient For Machine Learning
No ratings yet
Adaptive Stochastic Conjugate Gradient For Machine Learning
14 pages
Smith Et Al. - 2021 - On The Origin of Implicit Regularization in Stocha
No ratings yet
Smith Et Al. - 2021 - On The Origin of Implicit Regularization in Stocha
14 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
Module 6
No ratings yet
Module 6
47 pages
Neural Net Research
No ratings yet
Neural Net Research
10 pages
A Proposal On Machine Learning Via Dynamical Systems
No ratings yet
A Proposal On Machine Learning Via Dynamical Systems
11 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Deep Learning Via Hessian-Free Optimization: James Martens
No ratings yet
Deep Learning Via Hessian-Free Optimization: James Martens
8 pages
Learning Curves For Stochastic Gradient Descent in Linear Feedforward Networks
No ratings yet
Learning Curves For Stochastic Gradient Descent in Linear Feedforward Networks
8 pages
Screenshot 2024-10-19 at 10.37.25 AM
No ratings yet
Screenshot 2024-10-19 at 10.37.25 AM
25 pages
483 Learning To Optimize
No ratings yet
483 Learning To Optimize
13 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
Unit VI Optimization Techniques Question Bank Solved Answer
No ratings yet
Unit VI Optimization Techniques Question Bank Solved Answer
20 pages
A Very Fast Learning Method For Neural Networks Based On Sensitivity Analysis
No ratings yet
A Very Fast Learning Method For Neural Networks Based On Sensitivity Analysis
24 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Unconstrained Online Learning With Unbounded Losses
No ratings yet
Unconstrained Online Learning With Unbounded Losses
41 pages
The Road Less Scheduled
No ratings yet
The Road Less Scheduled
35 pages
HMD-Deep Learning-Lecture 2-2024
No ratings yet
HMD-Deep Learning-Lecture 2-2024
47 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Adam Optimizer
No ratings yet
Adam Optimizer
14 pages
M-HOF-Opt: Multi-Objective Hierarchical Output Feedback Optimization Via Multiplier Induced Loss Landscape Scheduling
No ratings yet
M-HOF-Opt: Multi-Objective Hierarchical Output Feedback Optimization Via Multiplier Induced Loss Landscape Scheduling
15 pages
Unit 2
No ratings yet
Unit 2
37 pages
DL 4
No ratings yet
DL 4
15 pages
Adas: Adaptive Scheduling of Stochastic Gradients: Preprint. Under Review
No ratings yet
Adas: Adaptive Scheduling of Stochastic Gradients: Preprint. Under Review
19 pages
NIPS 2011 Non Asymptotic Analysis of Stochastic Approximation Algorithms For Machine Learning Paper
No ratings yet
NIPS 2011 Non Asymptotic Analysis of Stochastic Approximation Algorithms For Machine Learning Paper
9 pages
Reinforcement Learning Optimization
No ratings yet
Reinforcement Learning Optimization
6 pages
Optimal Learning in Neural Network Memories?: Le'Lter To The Editor
No ratings yet
Optimal Learning in Neural Network Memories?: Le'Lter To The Editor
7 pages
Cours 5
No ratings yet
Cours 5
23 pages
Cst414-Deep Learning Module 2
No ratings yet
Cst414-Deep Learning Module 2
13 pages
5 Gradients
No ratings yet
5 Gradients
26 pages
Mathematical Introduction To Deep Learning: Methods, Implementations, and Theory
No ratings yet
Mathematical Introduction To Deep Learning: Methods, Implementations, and Theory
714 pages
Optimal Stochastic Non-Smooth Non-Convex Optimization Through
No ratings yet
Optimal Stochastic Non-Smooth Non-Convex Optimization Through
39 pages
Learning Rate Free Learning by D Adaptation
No ratings yet
Learning Rate Free Learning by D Adaptation
35 pages
Mastering Partial Least Squares Structural Equation Modeling (Pls-Sem) with Smartpls in 38 Hours
From Everand
Mastering Partial Least Squares Structural Equation Modeling (Pls-Sem) with Smartpls in 38 Hours
Ken Kwong-Kay Wong
3/5 (1)
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Geometric Feature Learning: Unlocking Visual Insights through Geometric Feature Learning
From Everand
Geometric Feature Learning: Unlocking Visual Insights through Geometric Feature Learning
Fouad Sabry
No ratings yet
EDUCATION DATA MINING FOR PREDICTING STUDENTS’ PERFORMANCE
From Everand
EDUCATION DATA MINING FOR PREDICTING STUDENTS’ PERFORMANCE
Dr. GEETHA N DATA SCIENTIST, BENGALURU
No ratings yet
8778 Hooria Batool Farm Visit
No ratings yet
8778 Hooria Batool Farm Visit
10 pages
Naturebased: Solutions
No ratings yet
Naturebased: Solutions
2 pages
Skit Rubric
No ratings yet
Skit Rubric
1 page
Aimil Ist Lot Delivery
No ratings yet
Aimil Ist Lot Delivery
2 pages
Science 7 q3 Balanced and Unbalanced Forces Week 1
No ratings yet
Science 7 q3 Balanced and Unbalanced Forces Week 1
10 pages
Mbquart Formula Fx1700
No ratings yet
Mbquart Formula Fx1700
11 pages
Determining Liquid Limits of Soils: Test Procedure For
No ratings yet
Determining Liquid Limits of Soils: Test Procedure For
12 pages
Triz Ol Rna Extraction 030911
No ratings yet
Triz Ol Rna Extraction 030911
3 pages
Burndy PDF
No ratings yet
Burndy PDF
36 pages
Detailed Lesson Plan in MAPEH 8 Arts q1
75% (4)
Detailed Lesson Plan in MAPEH 8 Arts q1
15 pages
25 Combi and NT Problems
No ratings yet
25 Combi and NT Problems
7 pages
2021sem3 PHSH CC6
No ratings yet
2021sem3 PHSH CC6
2 pages
Experiment 1 Photocell
83% (6)
Experiment 1 Photocell
6 pages
Cable As Axial Elements
No ratings yet
Cable As Axial Elements
9 pages
Presupposition and Entailment by Jack Sidnell
No ratings yet
Presupposition and Entailment by Jack Sidnell
4 pages
Summarize The Concept of Consumer Learning
75% (4)
Summarize The Concept of Consumer Learning
15 pages
Approvals - Listofproducts - Siemens 2019 PDF
No ratings yet
Approvals - Listofproducts - Siemens 2019 PDF
3 pages
Mayo Clinic Internal Medicine Board Review 10th
No ratings yet
Mayo Clinic Internal Medicine Board Review 10th
303 pages
Flexural Behaviour of RC One-Way Slabs Reinforced Using PAN Based Carbon Textile Grid
No ratings yet
Flexural Behaviour of RC One-Way Slabs Reinforced Using PAN Based Carbon Textile Grid
13 pages
Thesis For M.tech For Electronics and Communication
100% (2)
Thesis For M.tech For Electronics and Communication
4 pages
Ey Parthenon Ficci Report Transformation of Indian Higher Education Strategies To Leapfrog
No ratings yet
Ey Parthenon Ficci Report Transformation of Indian Higher Education Strategies To Leapfrog
60 pages
Definitions of Research
No ratings yet
Definitions of Research
3 pages
A589344703 - 30700 - 28 - 2024 - K24RK Phy110
No ratings yet
A589344703 - 30700 - 28 - 2024 - K24RK Phy110
4 pages
Day 6
No ratings yet
Day 6
3 pages
Chapter 02 Warehousing Decisions
No ratings yet
Chapter 02 Warehousing Decisions
12 pages
1 s2.0 S0022509624002333 Main
No ratings yet
1 s2.0 S0022509624002333 Main
25 pages
Iit Jee Aiee Book List
No ratings yet
Iit Jee Aiee Book List
6 pages
307 Physics - STUDY HELP FOR FINAL - QUESTIONS WITH 100% CORRECT ANSWERS
No ratings yet
307 Physics - STUDY HELP FOR FINAL - QUESTIONS WITH 100% CORRECT ANSWERS
92 pages
Westin Aristotle's Rhetorical Energeia
No ratings yet
Westin Aristotle's Rhetorical Energeia
11 pages
Astm A-291
No ratings yet
Astm A-291
4 pages