Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation
Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation
University of China
Beijing, China
Given a differentiable objective function, gradient descent compute the Hessian matrix for the loss function. However,
is a natural and efficient method for optimization. Among this solution is computationally expensive. Because many
various gradient descent methods, the stochastic gradient de- complex deep neural networks involve a large number of
scent (SGD) method [23] plays a critical role. In the standard parameters, this makes the Hessian matrix have ultra-high
SGD method, the first-order gradient of a randomly selected dimensionality. To solve this problem, we propose here a
sample is used to iteratively update the parameter estimates novel approximation step. Note that, given a fixed gradient
of a network. Specifically, the parameter estimates are ad- direction, the loss function can be approximated by a stan-
justed with the negative of the random gradient multiplied dard quadratic function with the learning rate as the only
by a step size. The step size is called the learning rate. Many input variable. For a univariate quadratic function such as
generalized methods based on the SGD method have been this, there are only two unknown coefficients. They are the
proposed [1, 4, 11, 25, 26]. Most of these extensions specify linear term coefficient and the quadratic term coefficient.
improved update rules to adjust the direction or the step size. As long as these two coefficients can be determined, the
However, [1] pointed out that, many hand-designed update optimal learning rate can be obtained. To estimate the two
rules are designed for circumstances with certain characteris- unknown coefficients, one can try, for example, two different
tics, such as sparsity or nonconvexity. As a result, rule-based but reasonably small learning rates. Then, the correspond-
methods might perform well in some cases but poorly in ing objective function can be evaluated. This step leads to
others. Consequently, an optimizer with an automatically two equations, which can be solved to estimate the two un-
adjusted update rule is preferable. known coefficients in the quadratic approximation function.
An update rule contains two important components: one Thereafter, the optimal learning rate can be obtained.
is the update direction, and the other is the step size. The Our contributions: We propose an automatic, dynamic
learning rate determines the step size, which plays a sig- and nearly optimal learning rate tuning algorithm that has
nificant role in optimization. If it is set inappropriately, the the following three important features.
parameter estimates could be suboptimal. Empirical experi- (1) The algorithm is automatic. In other words, it leads to
ence suggests that a relatively larger learning rate might be an optimization method with little subjective judgment.
preferred in the early stages of the optimization. Otherwise, (2) The method is dynamic in the sense that the learning
the algorithm might converge very slowly. In contrast, a rate used in each update step is different. It is dynamically
relatively smaller learning rate should be used in the later adjusted according to the current status of the loss function
stages. Otherwise, the objective function cannot be fully op- and the parameter estimates. Typically, larger rates are used
timized. This phenomenon inspires us to design a method in the earlier iterations, while smaller rates are used in the
to automatically search for an optimal learning rate in each latter iterations.
update step during optimization. (3) The learning rate derived from the proposed method is
To this end, we propose here a novel optimization method nearly optimal. For each update step, by the novel quadratic
based on local quadratic approximation (LQA). It tunes the approximation, the learning rate leads to almost the greatest
learning rate in a dynamic, automatic and nearly optimal reduction in terms of the loss function. Here, “almost” refers
manner. The method can obtain the best step size in each up- to the fact that the loss function is locally approximated by
date step. Intuitively, given a search direction, what should a quadratic function with unknown coefficients numerically
be the best step size? One natural definition is the step size estimated. For this particular update step, with the gradient
that can lead to the greatest reduction in the global loss. Ac- direction fixed, and among all the possible learning rates, the
cordingly, the step size itself should be treated as a parameter, one determined by the proposed method can result in nearly
that needs to be optimized. For this purpose, the proposed the greatest reduction in terms of the loss function.
method can be decomposed into two important steps. They The rest of this article is organized as follows. In Section
are the expansion step and the approximation step. First, in 2, we review related works on gradient-based optimizers.
the expansion step, we conduct Taylor expansion on the loss Section 3 presents the proposed algorithm in detail. In Sec-
function, around the current parameter estimates. Accord- tion 4, we verify the performance of the proposed method
ingly, the objective function can be locally approximated by through empirical studies on open datasets. Then, conclud-
a quadratic function in terms of the learning rate. Then, the ing remarks are given in Section 5.
learning rate is also treated as a parameter to be optimized,
which leads to a nearly optimal determination of the learning 2 RELATED WORK
rate for this particular update step. To optimize a loss function, two important components need
Second, to implement this idea, we need to compute the to be specified: the update direction and the step size. Ideally,
first- and second-order derivatives of the objective function the best update direction should be the gradient computed for
on the gradient direction. One way to solve this problem is to the loss function based on the whole data. For convenience,
Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation CIKM ’20, October 19–23, 2020, Galway, Ireland
we refer to it as the global gradient. Since the calculation of 3.1 Stochastic gradient descent
the global gradient is computationally expensive, the SGD Assume we have a total of N samples. They are indexed by
method [23] uses the gradient estimated based on a stochas- 1 ≤ i ≤ N and collected by S = {1, 2, · · · , N }. For each
tic subsample in each iteration, which we referred to as a sample, a loss function can be defined as ℓ(X i ; θ ), where X i
sample gradient. It leads to a fairly satisfactory empirical is the input corresponding to the i-th sample and θ ∈ Rp
performance. The SGD method has inspired many new opti- denotes the parameter. Then the global loss function can be
mization methods, most of which enhance their performance defined as
by improving the estimation of the global gradient direction.
N
A natural improvement is to combine sample gradients from 1 Õ 1 Õ
ℓ(θ ) = ℓ(X i ; θ ) = ℓ(X i ; θ ).
different update steps so that a more reliable estimate for the N i=1 |S|
i ∈S
global gradient direction can be obtained. This improvement
has led to the momentum-based optimization methods, such Ideally, one should optimize ℓ(θ ) by a gradient descent al-
as those proposed in [3, 15, 25, 27]. In particular, [3] adopted gorithm. Assume there are a total of T iterations. Let θˆ(t ) be
Nesterov’s accelerated gradient algorithm [21] to further the parameter estimate obtained in the t-th iteration. Then,
improve the calculation of the gradient direction. the estimate in the next iteration θˆ(t +1) is given by,
There exist other optimization methods that focus on the
adjustment of the step size. [4] proposed AdaGrad, in which θˆ(t +1) = θˆ(t ) − δ ∇ℓ(θˆ(t ) ),
the step size is iteratively decreased according to a prespeci-
where δ is the learning rate and ∇ℓ(θˆ(t ) ) is the gradient of the
fied function. However, it still involves a parameter related to
the learning rate, which needs to be subjectively determined. global loss function ℓ(θ ) with respect to θ at θˆ(t ) . More specif-
ically, ∇ℓ(θˆ(t ) ) = N −1 i=1 ∇ℓ(X i ; θˆ(t ) ), where ∇ℓ(X i ; θˆ(t ) ) is
ÍN
More extensions of AdaGrad have been proposed, such as
RMSProp [26] and AdaDelta [30]. Particularly, RMSProp in- the gradient of the local loss function for the i-th sample.
troduced a decay factor to adjust the weights of previous Unfortunately, such a straightforward implementation is
sample gradients. [11] proposed an adaptive moment es- computationally expensive if the sample size N is relatively
timation (Adam) method, that combined RMSProp with a large, which is particularly true if the dimensionality of θ
momentum-based method. Accordingly, the step size and is also ultrahigh. To alleviate the computational burden, re-
the update direction are both adjusted during each iteration. searchers proposed the idea of SGD. The key idea is to ran-
However, because step sizes are adjusted without consider- domly partition the whole sample into a number of nonover-
ing the loss function, the loss reduction obtained for each lapping batches. For example, we can write S = ∪kK=1 Sk ,
update step is suboptimal. Thus, the resulting convergence where Sk collects the indices of the samples in the k-th
rate can be further improved. batch. We should have Sk1 ∩ Sk2 = ∅ for any k 1 , k 2 and
To summarize, most existing optimization methods suffer |Sk | = n for any 1 ≤ k ≤ K, where n is a fixed batch size.
from one or both of the following two limitations. First, they Next, instead of computing the global gradient ∇ℓ(θˆ(t ) ), we
are not automatic, and human intervention is required. Sec- can replace it by an estimate computed based on the k-th
ond, they are suboptimal because the loss reduction achieved batch. More specifically, each iteration (e.g., the t-th itera-
in each update step can be further improved. These pioneer- tion) is further decomposed into a total of K batch steps. Let
ing researchers inspired us to develop a new method for θˆ(t,k ) be the estimate obtained during the k-th (1 ≤ k ≤ K)
automatic determination of the learning rate. Ideally, the batch step during the t-th iteration. Then, we have
new method should be automatic with little human interven- δ Õ
tion. It should be dynamic so that the learning rate used for θˆ(t,k +1) = θˆ(t,k ) − ∇ℓ(X i ; θˆ(t,k ) ),
n
each update step is particularly selected. Mostly importantly, i ∈Sk
in each update step, the learning rate determined by the new
method should be optimal (or nearly optimal) in terms of the where k = 1, · · · , K − 1. In particular, θˆ(t +1,1) = θˆ(t,K ) −
δn −1 i ∈SK ∇ℓ(X i ; θˆ(t, K ) ).
Í
loss reduction, given a fixed update direction.
By doing so, the computational burden can be alleviated.
However, the tradeoff is that the batch-sample-based gra-
3 METHODOLOGY dient estimate could be unstable, which is particularly true
In this section, we first introduce the notations used in this if the batch size n is relatively small. To fix this problem,
paper and the general formulation of the SGD method. Then, various momentum-based methods have been proposed. The
we propose an algorithm based on local quadratic approxi- key idea is to record gradients in previous iterations and
mation to dynamically search for an optimal learning rate. integrate them together to form a more stable estimate.
This results in a new variant of the SGD method. 3.2 Local quadratic approximation
CIKM ’20, October 19–23, 2020, Galway, Ireland Zhu et al.
In this work, we assume that for each batch step, the esti- According to (1), ∆ℓ(δ t,k ) is a quadratic function of δ t,k .
mate for the gradient direction is given. It can be obtained For simplicity, the coefficient of the linear term and the coef-
by different algorithms. For example, it could be the esti- ficient of the quadratic term are denoted as
mate obtained by a standard SGD algorithm or an estimate
that involves rule-based corrections, such as that from a
1
Õ
at,k = ∇ℓ X i ; θˆ(t,k) дt,k , and
momentum-based method. We focus on how to specify the
learning rate in an optimal (or, nearly optimal) way. ni ∈Sk
To this end, we treat the learning rate δ as an unknown
parameter. It is remarkable that the optimal learning rate 1 Õ T 2
bt,k = дt,k ∇ ℓ X i ; θˆ дt,k ,
(t,k )
could dynamically change in different batch steps. Thus, we 2n i ∈Sk
use δ t,k to denote the learning rate in the k-th batch step
within the t-th iteration. Since the reduction in the loss in
respectively. Since the Taylor remainder here could be negli-
this batch step is influenced by the learning rate δ t,k , we
gible, (1) can be simply denoted by
express it as a function of the learning rate ∆ℓ(δ t,k ).
To find the optimal value for δ t,k , we investigate the op-
timization of ∆ℓ(δ t,k ) based on the Taylor expansion. For ∆ℓ(δ t,k ) ≈ −at,k δ t,k + bt,k δ t,k
2
. (2)
simplicity, we use дt,k = n −1 i ∈Sk ∇ℓ(X i ; θˆ(t,k ) ) to denote
Í
the current gradient. Given θˆ(t,k ) and дt,k , the loss reduction To maximize ∆ℓ(δ t,k ) with respect to δ t,k , we take the corre-
could be expressed as sponding derivative of the loss reduction, which leads to,
1 Õ h ∂∆ℓ(δ t,k )
i
∆ℓ(δ t,k ) = ℓ X i ; θˆ(t,k +1) − ℓ X i ; θˆ(t,k ) ≈ −at,k + 2bt,k δ t,k = 0.
n ∂δ t,k
i ∈Sk
1 Õ h i
= ℓ X i ; θˆ(t,k ) − δ t,k дt,k − ℓ X i ; θˆ(t,k ) As a result, the optimal learning rate in this batch step can
n
i ∈Sk be approximated by,
Then, two estimation steps are conducted to determine an
appropriate value for δ t,k in this batch step. δ t,k
∗
= (2bt,k )−1at,k . (3)
(1) Expansion Step. By a Taylor expansion of ℓ(X i ; θ ) around
Note that the computation of bt,k involves the first- and
θˆ(t,k ) ˆ
, we have ℓ X i ; θ (t,k) − δ t,k дt,k =
second-order derivatives. For a general form of loss function,
this calculation may be computationally expensive in real
ℓ X i ; θˆ(t,k) − ∇ℓ X i ; θˆ(t,k ) δ t,k дt,k
applications. Thus, an approximation step is preferred to
1 2 T 2
improve the computational efficiency.
+ δ t,k дt,k ∇ ℓ X i ; θˆ(t,k ) дt,k + o δ t,k дt,k дt,k ,
2 T
2 (2) Approximation Step. To compute the coefficients at,k
and bt,k and avoid the computation of second derivatives,
where ∇ℓ X i ; θˆ(t,k ) and ∇2 ℓ X i ; θˆ(t,k ) denote the first- we consider the following approximation method. The basic
and second-order derivatives of the local loss function, re- idea is to build 2 equations with respect to the 2 unknown
spectively. As a result, the reduction is coefficients.
1 Õ h Let дt,k be a given estimate of the gradient direction. We
∆ℓ(δ t,k ) = −∇ℓ X i ; θˆ(t,k ) δ t,k дt,k then compute
n
i ∈Sk
Õ
1 2 T 2
Õ
ℓ X i ; θˆ(t,k ) − δ 0дt,k = ℓ X i ; θˆ(t,k ) − at,k δ 0n
i
+ δ t,k дt,k ∇ ℓ X i ; θˆ(t,k ) дt,k + o δ t,k дt,k дt,k
2 T
2 i ∈Sk i ∈Sk
+ bt,k δ 02n,
1 Õ
ˆ
(4)
=− ∇ℓ X i ; θ дt,k δ t,k
(t,k)
n i ∈Sk
Õ Õ
ℓ X i ; θˆ(t,k ) + δ 0дt,k = ℓ X i ; θˆ(t,k ) + at,k δ 0n
i ∈Sk i ∈Sk
1 Õ T 2
+ дt,k ∇ ℓ X i ; θˆ дt,k δ t,k + bt,k δ 02n,
(t,k )
2
(5)
2n i ∈Sk
for a reasonably small learning rate δ 0 . A natural choice for
+ o n δ t,k дt,k дt,k .
−1 2 T
(1)
δ 0 could be δ t,k−1
∗ if k > 1 and δ t∗−1, K if k = 1. By solving (4)
Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation CIKM ’20, October 19–23, 2020, Galway, Ireland
and (5), we have is substantially smaller than computing the gradient and it
1 Õ h is particularly true if the unknown parameter θ ’s dimension
b̃t,k = 2
ℓ X i ; θˆ(t,k) + δ 0дt,k is ultrahigh.
2nδ 0 i ∈S
k
i
+ ℓ X i ; θˆ(t,k) − δ 0дt,k − 2ℓ X i ; θˆ(t,k ) , (6) 4 EXPERIMENTS
In this section, we empirically evaluate the proposed method
1 Õ h
ãt,k = ℓ X i ; θˆ(t,k ) + δ 0дt,k based on different models and compare it with various op-
2nδ 0
i ∈Sk timizers under different parameter settings. The details are
listed as follows.
i
− ℓ X i ; θˆ(t,k ) − δ 0дt,k , (7)
Classification Model. To demonstrate the robustness of the
where ãt,k and b̃t,k could serve as the approximations of proposed method, we consider three classic models. They
at,k and bt,k , respectively. Then, we apply these results back are multinomial logistic regression, multilayer perceptron
to (3), which gives the approximated optimal learning rate (MLP), and deep convolutional neural network (CNN) mod-
δˆt,k
∗ . Because δˆ∗ is optimally selected, the reduction in the els.
t,k Competing Optimizers. For comparison purposes, we com-
loss function is nearly optimal for each batch step. As a
pare the proposed LQA method with other popular optimiz-
consequence, the total number of iterations required for
ers. They are the standard SGD method, the SGD method
convergence can be much reduced, which makes the whole
with momentum, the SGD method based on Nesterov’s ac-
algorithm converge much faster than usual. In summary,
celerated gradient (NAG), AdaGrad, RMSProp and Adam.
Algorithm 1 illustrates the pseudocode of the proposed
For simplicity, we use “SGD-M” to denote the SGD method
method.
with momentum and “SGD-NAG” to denote the SGD method
Algorithm 1 Local quadratic approximation algorithm based on the NAG algorithm.
Parameter Settings. For the competing optimizers, we adopt
Require: T : number of iterations; K: number of batches three different learning rates, δ = 0.1, 0.01, and 0.001. For
within one iteration; ℓ(θ ): loss function with parameters all the optimizers, the minibatch size is 64, and the initial
θ ; θ 0 : initial estimate for parameters (e.g., a zero vector); values for all the parameters are zero. If there are other hy-
δ 0 : initial (small) learning rate. perparameters (e.g., decay rates) in the models, they are set
t ← 1; by default.
θˆ(1,1) ← θ 0 ; Performance Measurement. To gauge the performance of
while t ≤ T do the optimizers, we report the training loss of the different
k ← 1; optimizers, which is defined as the negative log-likelihood
while k ≤ K − 1 do function. The results of different optimizers in each iteration
Compute the gradient дt,k ; are shown in figures for comparison purposes.
Compute ãt,k and b̃t,k according to (6) and (7);
4.1 Multinomial Logistic Regression
δˆt,k
∗ ← (2b̃ )−1ã ;
t,k t,k
θ ˆ(t,k +1) ←θˆ(t,k ) − δˆ∗ дt,k ; We first compare the performance of different optimizers
t,k
based on multinomial logistic regression. It has a convex
k ← k + 1;
objective function. We consider the MNIST dataset [17] for
end while
illustration. The dataset consists of a total of 70,000 (28×28)
Compute the gradient дt, K ;
images of handwritten digits, each of which corresponds to a
Compute ãt, K and b̃t,K according to (6) and (7);
−1 10-dimensional one-hot vector as its label. Then, the images
δˆ∗ ← 2b̃t, K
t, K ãt, K ; are flattened as 784-dimensional vectors to train the logistic
θˆ(t +1,1) ← θˆ(t, K ) − δˆt,∗ K дt, K ; regression classifier. Figure 1 displays the performance of
t ← t + 1; the proposed LQA method against the competing optimizers.
end while We can draw the following conclusions.
return θˆ(T +1,1) , the resulting estimate. Learning Rate. For the competing optimizers, the train-
ing loss curves of different learning rates clearly have dif-
ferent shapes, which means different convergence speeds
It is remarkable that the computational cost required for because the convergence speed is greatly affected by δ . The
calculating the optimal learning rate is ignorable. The main best δ in this case is 0.01 for the SGD, SGD-M, SGD-NAG and
cost is due to the calculation of the loss function values (not AdaGrad methods. However, that for RMSProp and Adam
its derivatives) at two different points. The cost of this step is 0.001. Note that for RMSProp and Adam, the loss may
CIKM ’20, October 19–23, 2020, Galway, Ireland Zhu et al.
fail to converge with an inappropriate δ (e.g., δ = 0.1). It is in the early stages (e.g., the first 5 iterations). In the later
remarkable that with the LQA method, the learning rate is stages, the LQA method continues to reduce the loss, which
automatically determined and dynamically updated. Thus, makes the training loss curve of LQA lower than that of the
the proposed method can provide an automatic solution for other methods. For example, the smallest loss corresponding
the training of multinomial logistic regression classifiers. to the Adam optimizer in the 20th iteration is 0.011, while
Next, we compare LQA and the competing optimizers with that of the LQA method is 0.002. Third, although AdaGrad
their best learning rates. converges faster when δ =0.01, the performance of the pro-
Loss Reduction. First, the loss curve of LQA remains posed method is slightly better than AdaGrad after the 12th
lower than those of the SGD optimizers during the whole iteration.
training process. This finding means that LQA converges
4.3 Deep Convolutional Neural Networks
faster than the SGD optimizers. For example, LQA reduces
the loss to 0.256 in the first 10 iterations. It takes 40 iterations CNNs have brought remarkable breakthroughs in computer
for the standard SGD optimizer with δ = 0.1 to achieve the vision tasks over the past two decades [7, 10, 14] and play a
same level. Second, for AdaGrad and RMSProp with δ = 0.1, critical role in various industrial applications, such as face
although the training loss curves are slightly lower than that recognition [28] and driverless vehicles [18]. In this subsec-
of LQA in the early stages (e.g. the first 5 iterations), LQA tion, we investigate the performance of the LQA method
performs better in the later stages. Third, the best perfor- with respect to the training of CNNs. Two classic CNNs
mance for Adam in this case is achieved when δ = 0.001. are considered. They are LeNet [17] and ResNet [7]. More
Although the performances are quite similar for LQA and specifically, LeNet-5 and ResNet-18 are studied in this paper.
Adam with δ = 0.001, LQA has lower loss values in the early The MNIST and CIFAR10 [13] datasets are used to demon-
stages. strate the performance. The CIFAR10 dataset contains 60,000
(32×32) RGB images, which are divided into 10 classes.
4.2 Multilayer Perception
LeNet. Figure 3 and Figure 4 show the results of exper-
MLP models are powerful neural network models and have iments on the MNIST and the CIFAR10 datasets, respec-
been widely used in machine learning tasks [22]. They con- tively. The following conclusions can be drawn: (1) For both
tain multiple fully connected layers and activation functions datasets, the loss curves of the LQA method remain lower
between those layers. An MLP can approximate arbitrary than those of the standard SGD and AdaGrad optimizers.
continuous functions over compact input sets [9]. This finding suggests LQA converges faster than those opti-
To investigate the performance of the proposed method mizers during the whole training process. (2) LQA performs
in this case, we consider the MNIST dataset. Following the similarly to the SGD-M, SGD-NAG, RMSProp and Adam
model setting in [11], the MLP is built with 2 fully connected optimizers in the early stages (e.g., the first 20 iterations).
hidden layers, each of which has 1,000 units, and the ReLU However, in the later stages, the proposed method can further
function is adopted as the activation function. Figure 2 shows reduce the loss and lead to a lower loss than those optimizers
the performances of different optimizers. The following con- after the same number of iterations. (3) For the CIFAR10
clusions can be drawn. dataset, a large δ (e.g., δ = 0.1) may lead to an unstable loss
Learning Rate. In this case, the best learning rates for the curve for a standard SGD optimizer. Although the loss curve
competing methods are quite different: (1) for the standard of LQA is unstable in the early stages of training, it becomes
SGD method, the best learning rate is δ = 0.1; (2) for the smooth in the later stages because the proposed method is
SGD-M, SGD-NAG and AdaGrad methods, the best learning able to automatically and adaptively adjust the update step
rate is 0.01; (3) for RMSProp and Adam, δ = 0.001 is the best. size to accelerate training. It is fairly robust.
It is remarkable that even for the same optimizer, different ResNet. Figure 5 displays the training loss of ResNet-18
learning rates could lead to a different performance, if the corresponding to different optimizers on the CIFAR10 dataset.
model changes. Thus, determining the appropriate learning Accordingly, we make the following conclusions. First, the
rate in practice may depend on expert experience and subjec- LQA method performs similarly to the other optimizers in
tive judgement. In contrast, the proposed method can avoid the early stages of training (e.g., the first 15 iterations). How-
such effort in choosing δ and give a comparable and robust ever, it converges faster in the later stages. Particularly, the
performance. proposed method leads to a lower loss than RMSProp and
Loss Reduction. First, compared with the standard SGD Adam within the same number of iterations. Second, in this
optimizers, LQA performs much better, which could be seen case, the loss curves of the SGD optimizers and AdaGrad are
from the lower training loss curve. Second, compared with quite unstable during the whole training period. The LQA
the SGD-M, SGD-NAG, RMSProp and Adam optimizers, the method is much more stable in the later stages of the training
performance of LQA is comparable to their best performances than early stages.
Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation CIKM ’20, October 19–23, 2020, Galway, Ireland
5 CONCLUSIONS
In this work, we propose LQA, a novel approach to determine
the nearly optimal learning rate for automatic optimization.
CIKM ’20, October 19–23, 2020, Galway, Ireland Zhu et al.
Our method has three important features. First, the learning it is dynamically adjusted during the whole training pro-
rate is automatically estimated in each update step. Second, cess. Third, given the gradient direction, the learning rate
Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation CIKM ’20, October 19–23, 2020, Galway, Ireland
leads to nearly the greatest reduction in the loss function. preprint arXiv:1402.3722 (2014).
Experiments on openly available datasets demonstrate its [6] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013.
effectiveness. Speech recognition with deep recurrent neural networks. In 2013 IEEE
international conference on acoustics, speech and signal processing. IEEE,
We discuss two interesting topics for future research. First, 6645–6649.
the optimal learning rate derived by LQA is shared by all [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep
dimensions of the parameter estimate. A potential extension residual learning for image recognition. In Proceedings of the IEEE
is to allow for different optimal learning rates for different conference on computer vision and pattern recognition. 770–778.
dimensions. Second, in this paper, we focus on accelerating [8] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman
Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick
the training of the network models. We do not discuss the Nguyen, Tara N Sainath, et al. 2012. Deep neural networks for acoustic
overfitting issue and sparsity of the gradients. To further im- modeling in speech recognition: The shared views of four research
prove the performance of the proposed method, it is possible groups. IEEE Signal processing magazine 29, 6 (2012), 82–97.
to combine dropouts or sparsity penalties with LQA. [9] Kurt Hornik. 1991. Approximation capabilities of multilayer feedfor-
ward networks. Neural networks 4, 2 (1991), 251–257.
[10] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Wein-
REFERENCES berger. 2017. Densely connected convolutional networks. In Proceed-
ings of the IEEE conference on computer vision and pattern recognition.
[1] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoff-
4700–4708.
man, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Fre-
[11] Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic
itas. 2016. Learning to learn by gradient descent by gradient descent.
optimization. In International Conference on Learning Representations
In Advances in neural information processing systems. 3981–3989.
2015.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural
[12] Thomas N Kipf and Max Welling. 2016. Semi-supervised classification
machine translation by jointly learning to align and translate. arXiv
with graph convolutional networks. arXiv preprint arXiv:1609.02907
preprint arXiv:1409.0473 (2014).
(2016).
[3] Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan.
[13] Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers
2011. Better mini-batch algorithms via accelerated gradient methods.
of features from tiny images. Technical Report.
In Advances in neural information processing systems. 1647–1655.
[14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet
[4] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient
classification with deep convolutional neural networks. In Advances
methods for online learning and stochastic optimization. Journal of
in neural information processing systems. 1097–1105.
machine learning research 12, Jul (2011), 2121–2159.
[15] Guanghui Lan. 2012. An optimal method for stochastic composite
[5] Yoav Goldberg and Omer Levy. 2014. word2vec Explained: deriving
optimization. Mathematical Programming 133, 1-2 (2012), 365–397.
Mikolov et al.’s negative-sampling word-embedding method. arXiv
CIKM ’20, October 19–23, 2020, Galway, Ireland Zhu et al.
[16] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, and Artificial Intelligence 4, 1 (2016), 26–30.
Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. 1989. [23] Herbert Robbins and Sutton Monro. 1951. A stochastic approximation
Backpropagation applied to handwritten zip code recognition. Neural method. The annals of mathematical statistics (1951), 400–407.
computation 1, 4 (1989), 541–551. [24] David E Rumelhart, Richard Durbin, Richard Golden, and Yves Chauvin.
[17] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. 1995. Backpropagation: The basic theory. Backpropagation: Theory,
Gradient-based learning applied to document recognition. Proc. IEEE architectures and applications (1995), 1–34.
86, 11 (1998), 2278–2324. [25] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1986.
[18] Peiliang Li, Xiaozhi Chen, and Shaojie Shen. 2019. Stereo r-cnn based Learning representations by back-propagating errors. Nature 323, 6088
3d object detection for autonomous driving. In Proceedings of the IEEE (1986), 533–536.
Conference on Computer Vision and Pattern Recognition. 7644–7652. [26] Tijmen Tieleman and Geoffrey Hinton. 2012. RMSProp, COURSERA:
[19] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioan- Neural networks for machine learning. Technical Report.
nis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing [27] Paul Tseng. 1998. An incremental gradient (-projection) method with
atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 momentum term and adaptive stepsize rule. SIAM Journal on Opti-
(2013). mization 8, 2 (1998), 506–531.
[20] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, [28] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2016. A
Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, An- discriminative feature learning approach for deep face recognition. In
dreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control European conference on computer vision. Springer, 499–515.
through deep reinforcement learning. Nature 518, 7540 (2015), 529– [29] Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike
533. Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. 2016. Achiev-
[21] Yu Nesterov. 1983. A method of solving a convex programming prob- ing human parity in conversational speech recognition. arXiv preprint
lem with convergence rate O (1/k 2 ). In Soviet Mathematics Doklady, arXiv:1610.05256 (2016).
Vol. 27. 372–âĂŞ376. [30] Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method.
[22] Hassan Ramchoun, Mohammed Amine Janati Idrissi, Youssef Ghanou, arXiv preprint arXiv:1212.5701 (2012).
and Mohamed Ettaouil. 2016. Multilayer Perceptron: Architecture Op-
timization and Training. International Journal of Interactive Multimedia