0% found this document useful (0 votes)
103 views

Optimization Algorithms Deep PDF

The document discusses optimization algorithms for deep learning models. It introduces gradient descent methods like batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. It then describes improvements on these methods, including momentum, Nesterov acceleration, Adagrad, Adadelta, RMSprop, and Adam. These improved methods address issues like step size tuning and slow convergence. The document aims to compare the performance of these optimization algorithms on tasks like image classification and text generation.

Uploaded by

saad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views

Optimization Algorithms Deep PDF

The document discusses optimization algorithms for deep learning models. It introduces gradient descent methods like batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. It then describes improvements on these methods, including momentum, Nesterov acceleration, Adagrad, Adadelta, RMSprop, and Adam. These improved methods address issues like step size tuning and slow convergence. The document aims to compare the performance of these optimization algorithms on tasks like image classification and text generation.

Uploaded by

saad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Optimization Algorithms for Deep Learning

Piji Li
Department of Systems Engineering and Engineering Management,
The Chinese University of Hong Kong
[email protected]

Abstract

Gradient descent algorithms are the most important and popular techniques for
optimizing deep learning related models. Considering the large scale dataset and
the limited computation memory especially on GPUs, the traditional typical batch
gradient decent method and stochastic gradient decent method cannot conduct
the training effectively and efficiently. Moreover, tricks of tuning of step size
(learning rate) also play an critical role during training procedure. To address this
problem, several variations of gradient methods have been proposed. In this paper,
we investigate the basic methodology and property of the methods. Moreover, we
also conduct several experiments on various of tasks to compare the performance.

1 Introduction

Nowadays, Deep Learning [3] shows almost the state-of-the-art performance on various tasks in
different fields, such as speech recognition, computer vision and natural language processing. There
are huge number of different variants of deep architectures, such as Deep Neural Network(DNN),
Deep Belief Network(DBN), Deep Convolutional Neural Network(CNN) and Recurrent Neural
Network(RNN) [3]. The basic techniques of AlphaGo is also neural network [14].
Back-Propagation (BP) [13] is a common method of training neural networks. The algorithm repeats a
two phase cycle, propagation and weight update. Errors obtained from the output layer will propagate
to the other nodes in a backward direction. These errors are used to calculate the gradient of the
loss function with respect to the weights in the network. Then the gradient is fed to the optimization
method, which in turn uses it to update the weights, in an attempt to minimize the loss function.
According to the data used to compute the gradient of the objective function, gradient decent methods
can be divided into three categories: batch gradient decent method, mini-batch gradient decent
method, and stochastic gradient decent method [12]. Considering the large scale dataset and the
limited computation memory especially on GPUs, the traditional typical batch gradient decent
method and stochastic gradient decent method cannot conduct the training effectively and efficiently.
Moreover, step size adjusting also plays an critical role during training procedure. But it is very
difficult the tune the step size precisely in a task needs large dataset and long training period. To
address these problems, several variations of typical gradient methods have been proposed. In the
following sections, we will introduce the basic framework of the related methods. We also conduct
experiments on image classification and headline generation to show the different performance of
those methods.

2 Optimization Algorithms

Assume that the objective function need to be minimized is f (x), and x ∈ Rn . The corresponding
gradient is ∆f (x). The step size for iteration k is tk .
2.1 Batch Gradient Decent

The batch gradient decent algorithm updates the parameters x after scanning the whole training set:
xk+1 = xk − tk ∆f (xk )(1:n) (1)
Batch gradient descent is guaranteed to converge to the global minimum for convex problem and
to a local minimum for non-convex problem. However, in the related tasks of deep learning, the
training set contains about millions of or even billions of samples, which will cost a long time to scan
the entire training set to calculate the gradient. So it is to slow to perform once parameter update.
Moreover, the computation memory is also limited and is difficult to feed all the data into the model
at one time. Therefore, rare of deep learning models use batch gradient decent method to handle the
optimization problem.

2.2 Stochastic Gradient Decent

In contrast, stochastic gradient decent calculates the gradient and update the parameters for each
training sample.
xk+1 = xk − tk ∆f (xk )(i) (2)
However, due to the high variance of different training samples, updating parameters frequently will
cause the objective function to fluctuate heavily. Although a small step size can let SGD convergent
to a good point, it make the training slow. Moreover, when we use GPUs to conduct the computation,
the frequent data commutation between gpu memory and local memory also decrease the efficiency.

2.3 Mini-Batch Gradient Decent

Mini-Batch gradient decent integrates the advantages of batch gradient decent and stochastic gradient
decent, and update the parameters after obtaining the gradient of a mini-batch of samples:
xk+1 = xk − tk ∆f (xk )(i:i+m) (3)
where the mini-batch size is m. Mini-batch gradient decent can not guarantee good convergence, and
the tuning of step size also need some experience. Therefore some researchers extend it with some
more useful tricks and techniques to improve the convergence. For convenience, people also call
mini-batch gradient decent as SGD.

2.4 Gradient Decent with Momentum

If there is a long shallow ravine with steep walls on the direction to the optimal point, then the
standard SGD will tend to oscillate across the narrow ravine. Momentum is one of the mechanism
which is used to fix the direction:
vk = mvk−1 + tk ∆f (xk )
(4)
xk+1 = xk − vk
where m ∈ (0, 1] determines for how many iterations the previous gradients are incorporated into the
current update. Generally m is set to 0.5 until the initial learning stabilizes and then is increased to
0.9 or higher.

2.5 Nesterov Accelerated Gradient

We can use Nesterov accelerated gradient (NAG) [10] to conduct the improvement by looking ahead
to approximate the direction:
vk = mvk−1 + tk ∆f (xk − mvk−1 )
(5)
xk+1 = xk − vk

2.6 Adagrad

Adagrad [2] scales the step size for each parameter according to the history of gradients for that
parameter which is basically done by dividing current gradient in update rule by the sum of previous

2
gradients:
Gk = Gk−1 + ∆f (xk )2
t (6)
xk+1 = xk − √ ∆f (xk )
Gk + ε
where G is the accumulation of the history gradients, and ε is a smoothing term that avoids division by
zero (can be 1e − 6). The step size is different for each of the parameters. It is greater for parameters
where the historical gradients were small (since G is small) and the rate is small whenever historical
gradients were relatively big. Therefore, we need not to manually tune the step size t. We can use
a default value of 0.01. But when accumulation G becomes larger, the step size will reach zero at
infinity. So the following methods are proposed.

2.7 Adadelta

Adadelta [15] is derived from Adagrad in order to improve upon the two main drawbacks of the
method: 1) the continual decay of learning rates throughout training, and 2) the need for a manu-
ally selected global learning rate. Adadelta integrates the advantages of momentum and Adagrad.
Specifically, it scales the step size based on the historical gradient. But it only uses the latest time
window instead of the whole history as Adagrad. It also uses a component that serves an acceleration
term, that accumulates historical updates (similar to momentum). Follows are the operation details of
Adadelta:
E[∆f (x)2 ]k = ρE[∆f (x)2 ]k−1 + (1 − ρ)∆f (xk )2
q
E[x̂2 ]k−1 + ε
x̂k = − q ∆f (xk )
2 (7)
E[∆f (x) ]k + ε
E[x̂2 ]k = ρE[x̂2 ]k−1 + (1 − ρ)x̂2k
xk+1 = xk + x̂k
where ρ is a decay constant (e.g., 0.95) and ε is a small value (e.g., 1e-6) for numerical stability.

2.8 RMSprop

RMSprop [5] is also proposed to tackle the problem of step size vanishing of Adagrad. It also
employs the decaying average of the history gradients:

E[∆f (x)2 ]k = ρE[∆f (x)2 ]k−1 + (1 − ρ)∆f (xk )2


t (8)
xk+1 = xk − q ∆f (xk )
2
E[∆f (x) ]k + ε

where ρ is a decay constant (e.g., 0.9).

2.9 Adam

Adam [6] is another method that computes adaptive step size for each parameter. It uses both the
decaying average of history gradients and their squared values. Adam update rule consists of the
following steps:
mk = β1 mk−1 + (1 − β1 )∆f (xk )
vk = β2 vk−1 + (1 − β2 )∆f (xk )2
mk vk
m̂k = , v̂k = (9)
1 − β1k 1 − β2k
t
xk+1 = xk − √ m̂k
v̂k + ε
where β1 can be 0.9, β2 can be 0.999, and ε can be 1e − 8.

3
Figure 1: Structure of LeNet [9].

2.10 Adapg

Combine adadelta and adam, we can get a new method:

E[∆f (x)]k = ρE[∆f (x)]k−1 + (1 − ρ)∆f (xk )


E[∆f (x)2 ]k = ρE[∆f (x)2 ]k−1 + (1 − ρ)∆f (xk )2
q
E[x̂2 ]k−1 + ε
x̂k = − q E[∆f (x)]k (10)
2
E[∆f (x) ]k + ε
E[x̂2 ]k = ρE[x̂2 ]k−1 + (1 − ρ)x̂2k
xk+1 = xk + x̂k

3 Image Classification

3.1 Frameworks

To investigate the performance of those mentioned gradient methods in different model structures, we
employ two neural network models to handle the handwritten digits recognition problem: Multi-Layer
Perceptron (MLP) and Convolutional Neural Networks (CNN).

3.1.1 Multi-Layer Perceptron (MLP)


MLP is a one-hidden-layer artificial neural network. For the handwritten digits recognition problem,
we use vector x ∈ Rd to denote the input image, the operations in the feed-forward direction are as
follows:
h = σ(Wh x + bh )
(11)
ŷ = sof tmax(Wy h + by )

where Wh ∈ Rk×d and Wy ∈ Rc×k are the weight matrix, bh ∈ Rd and by ∈ Rc are hte biases.
zi
σ(·) is the sigmoid function: σ(z) = 1+e1 −z , and sof tmax(z)i = Pce ezj . The model’s prediction
j=1
y ? is the class whose probability is maximal, specifically: y ∗ = arg maxi ŷi . Actually, we can regard
ŷ as the probability distribution for each class, so we have P (y ? = i|x, Θ) = ŷi .

3.1.2 Convolutional Neural Networks (CNN)


CNN is a biologically-inspired variant of MLP. It consists of convolutional operation layers and the
pooling layers. A feature map is obtained by repeated application of convolution operation of the
input image with a linear filter, adding a bias term and then applying a non-linear function.

hij = tanh((W ∗ x)ij + b) (12)

where ∗ is the convolutional operator. After each convolutional operation, we add a poling layer,
which is a form of non-linear down-sampling. Max-pooling partitions the input image into a set
of non-overlapping rectangles and, for each such sub-region, outputs the maximum value. In our
experiments, we use LeNet-5 [9] as the basic framework, as shown in Figure 1.

4
3.2 Learning

Learning optimal model parameters involves minimizing a loss function. In the case of multi-class
logistic regression, it is very common to use the negative log-likelihood as the loss. This is equivalent
to maximizing the likelihood of the data set D under the model parameterized by Θ:
n
X
min J = − log P (y ∗ (i) = y (i) |x(i) , Θ) + λkΘk22 (13)
Θ
i=1

where λkΘk22 is an L2-regularization term (aka penalty) that penalizes complex models; and λ > 0 is
a non-negative hyperparameter that controls the magnitude of the penalty.

4 Neural Headline Generation


4.1 Framework

In this task we will create a neural network model which can learn to generate headlines for news
automatically like human.

y1 y2 <eos> output

ℎ𝑑𝑒𝑐 variational-decoder

Attention
z
z1 z2 z3
DKL [ N (u ,  2 ) || N (0, I )]

𝜀 log  2 𝜇

𝑁(0, 𝐼) ℎ𝑒𝑛𝑐 variational-encoder

x1 x2 x3 x4 <eos> y1 y2 input

Encoder Decoder Variational Auto-Encoders

Figure 2: Our deep recurrent generative decoder (DRGN) for latent structure modeling.
As shown in the left block of Figure 2, the encoder is designed based on bidirectional recurrent neural
networks. Let xt be the word embedding vector of the t-th word in the source sequence. GRU maps
xt and the previous hidden state ht−1 to the current hidden state ht in feed-forward direction and
back-forward direction respectively:
* * ( (
ht = GRU (xt , ht−1 ), ht = GRU (xt , ht−1 ) (14)

Then the final hidden state het ∈ R2kh is concatenated using the hidden states from the two directions:
* (
het = ht ||h. As shown in the middle block of Figure 2, the decoder consists of two components:
discriminative deterministic decoding and generative latent structure modeling.
The discriminative deterministic decoding is an improved attention modeling based recurrent sequence
decoder. The first hidden state hd1 is initialized using the average of all the source input states:
Te
hd1 = T1e het , where het is the source input hidden state. T e is the input sequence length. The
P
t=1
deterministic decoder hidden state hdt is calculated using two layers of GRUs. On the first layer, the
hidden state is calculated only using the current input word embedding yt−1 and the previous hidden
state hdt−1
1
: hdt 1 = GRU1 (yt−1 , hdt−1
1
). where the superscript d1 denotes the first decider GRU layer.
Then the attention weights at the time step t are calculated based on the relationship of hdt 1 and all
the source hidden states {het }. Let ai,j be the attention weight between hdi 1 and hej , which can be
calculated using the following formulation:
exp(ei,j )
ai,j = PT e , ei,j = vT tanh(Whh
d
hdi 1 + Whh
e
hej + ba )
j 0 =1 exp(ei,j )
0

where Whh d
∈ Rkh ×kh , Whhe
∈ Rkh ×2kh , ba ∈ Rkh , and v ∈ Rkh . The attention context is
PT e
obtained by the weighted linear combination of all the source hidden states: ct = j 0 =1 at,j 0 hej0 .

5
The final deterministic hidden state hdt 2 is the output of the second decoder GRU layer, jointly
considering the word yt−1 , the previous hidden state hdt−1
2
, and the attention context ct :
hdt 2 = GRU2 (yt−1 , hdt−1
2
, ct ) (15)

For the component of recurrent generative model, inspired by some ideas in previous works [7, 11, 4],
we assume that both the prior and posterior of the latent variables are Gaussian, i.e., pθ (zt ) = N (0, I)
and qφ (zt |y<t , z<t ) = N (zt ; µ, σ 2 I), where µ and σ denote the variational mean and standard
deviation respectively, which can be calculated via a multilayer perceptron. Precisely, given the word
embedding yt−1 , the previous latent structure variable zt−1 , and the previous deterministic hidden
state hdt−1 , we first project it to a new hidden space:
het z = g(Wyh
ez ez
yt−1 + Wzh ez d
zt−1 + Whh ht−1 + behz )
ez ez ez
where Wyh ∈ Rkh ×kw , Wzh ∈ Rkh ×kz , Whh ∈ Rkh ×kh , and behz ∈ Rkh . g is the sigmoid
−x
activation function: σ(x) = 1/(1 + e ). Then the Gaussian parameters µt ∈ Rkz and σ t ∈ Rkz
can be obtained via a linear transformation based on het z :
ez ez
µt = Whµ ht + beµz , log(σ 2t ) = Whσ het z + beσz (16)
The latent structure variable zt ∈ Rkz can be calculated using the reparameterization trick:
ε ∼ N (0, I), zt = µt + σ t ⊗ ε (17)
kz
where ε ∈ R is an auxiliary noise variable. The process of inference for finding zt based on neural
networks can be teated as a variational encoding process.
To generate summaries precisely, we first integrate the recurrent generative decoding component with
the discriminative deterministic decoding component, and map the latent structure variable zt and the
deterministic decoding hidden state hdt 2 to a new hidden variable:
d d dz d2 d
ht y = tanh(Wzhy zt + Whh ht + bhy ) (18)

d
Given the combined decoding state ht y at the time t, the probability of generating any target word yt
is given as follows:
d d
yt = ς(Why ht y + bdhy ) (19)
where Whyd
∈ Rky ×kh and bdhy ∈ Rky . ς(·) is the softmax function. Finally, we use a beam search
algorithm [8] for decoding and generating the best summaries

4.2 Learning

We use {X}N and {Y }N to denote the training source and target sequence. The final objective
function - negative log-likelihood (NLL), which needs to be minimized is organized as follows under
the model parameterized by Θ:
N T
(  )
1 XX (n) (n) (n)
min J = − log p(yt |y<t , X ) (20)
Θ N n=1 t=1

5 Experiments and Discussions


5.1 Datesets

We train and evaluate our framework on two popular datasets. For the image classification, we
use MNIST1 dataset. For headline generation, we use Gigawords, which is an English sentence
summarization dataset prepared based on Annotated Gigawords2 by extracting the first sentence from
articles to create a source-summary pair. It roughly contains 3.8M training pairs, 190K validation
pairs, and 2,000 test pairs.
1
http://yann.lecun.com/exdb/mnist/
2
https://catalog.ldc.upenn.edu/ldc2012t21

6
Table 1: Examples of the generated headlines. S: input source document. G: ground truth. P:
generated abstractive headlines.
S(1): factory orders for manufactured goods rose #.# percent in september , the
commerce department said here thursday.
G: us september factory orders up #.# percent.
P: us factory orders up #.# percent in september.
S(2): nick zito , who had three horses finish second during rival trainer d. wayne
lukas ’ streak of six straight victories in triple crown races , won the preakness
on saturday with louis quatorze , who had finished ##th in the kentucky derby.
G: UNK louis UNK wins preakness.
P: zito wins preakness for louis UNK.
S(3): UNK thorpe remembers her difficult marriage to jim thorpe , called the
greatest athlete of the modern era , and their harsh life outside the spotlight.
G: late olympian ’s wife recalls hard times <unk> photo available.
P: thorpe remembers her difficult marriage to thorpe.

5.2 Experimental Settings

For the MLP and CNN framework, the hidden size is 500. For the experiments on the English dataset
Gigawords, we set the dimension of word embeddings to 300, and the dimension of hidden states and
latent variables to 500. The maximum length of documents and summaries is 100 and 50 respectively.
The batch size of mini-batch training is 256. The beam size of the decoder was set to be 10. Our
neural network based framework is implemented using Theano [1] on a single GTX 1080 GPU.

5.3 Performance of Different Gradient Methods

As shown in Figure 3 and Figure 4, we can see that the mentioned gradients methods perform different
on various tasks in different neural network structures. For MLP, methods of momentum, adagrad,
adam converge faster than the other methods. For CNN, adadelta also performs well. However, in
the RNN model, the methods without tuning the step size, adadelta and adapg, perform the best.
Considering that it will cost several weeks to finish the training state on some large scale tasks, less
tuning of step size make the work more convenient.

5.4 Case Analysis for Headline Generation

As shown in Figure 3, we report the headlines generated by our own framework DRGD and the
baseline methods under ROUGE (overlapping between prediction and ground truth) metric. We can
see that our new neural network based framework with adadelta training methods outperform all the
baselines. As shown in Table 1, we select several samples cases to demonstrate the performance of
our framework. From the cases we can see that our framework can generate headlines with a good
linguistic quality.

6 Conclusions
In the investigations we compare the performance of different gradient methods in deep learning and
find that the structure of the meodels will affect the choice of the methods.

References
[1] Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron,
Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. Theano: new features and speed improve-
ments. arXiv preprint arXiv:1211.5590, 2012.
[2] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
[3] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT Press, 2016.
[4] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, and Daan Wierstra. Draw: A recurrent neural
network for image generation. In ICML, pages 1462–1471, 2015.
[5] Geoffrey Hinton, N Srivastava, and Kevin Swersky. Lecture 6a overview of mini–batch gradient descent.
Coursera Lecture slides https://class. coursera. org/neuralnets-2012-001/lecture,[Online, 2012.

7
0.8 1

adapg
adadelta
0.7
adam 3
adagrad-0.01 0.95
2
adagrad-0.1
0.6
rmsprop-0.001
rmsprop-0.01 1
momentum-0.01
0.5 0.9
momentum-0.1
adapg

Accuracy
momentum-1
adadelta
Loss

sgd-0.01
0.4 adam
sgd-0.1
sgd-1 adagrad-0.01
0.85 adagrad-0.12
0.3 rmsprop-0.001
1
rmsprop-0.01
momentum-0.01
0.2
momentum-0.1
0.8
momentum-1
sgd-0.01
0.1
sgd-0.1
3
sgd-1
0 0.75
0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100
Iteration Iteration

(a) MLP: objective value. (b) MLP: accuracy.


0.5 1

adapg 4
0.45
adadelta 2
adam 3
0.4 adagrad-0.01 0.95
adagrad-0.001
0.35 rmsprop-0.001
rmsprop-0.01
momentum-0.001
0.3 momentum-0.01
Accuracy

momentum-0.11 0.9
Loss

0.25 sgd-0.01 adapg 2


sgd-0.1 adadelta 3
0.2 adam 1
adagrad-0.01
0.85 adagrad-0.001
0.15 rmsprop-0.001
rmsprop-0.01
0.1 momentum-0.001
momentum-0.01
4
0.8 momentum-0.1
0.05
sgd-0.01
sgd-0.1
0
0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50
Iteration Iteration

(c) CNN: objective value. (d) CNN: accuracy.


Figure 3: Performance of different gradient methods in MLP and CNN framework.
70
adapg 3
adadelta 1
adagrad
60
adam
rmsprop 2

50
momentum 4 System R-1 R-2 R-L
sgd
ABS 29.55 11.32 26.42
40
ABS+ 29.78 11.89 26.97
RAS-LSTM 32.55 14.70 30.03
Loss

30
RAS-Elman 33.78 15.97 31.15
ASC + FSC1 34.17 15.94 31.92
lvt2k-1sent 32.67 15.59 30.64
20
lvt5k-1sent 35.30 16.64 32.62
DRGD 36.27 17.57 33.62
10

Figure 5: ROUGE-F1 on Gigawords


0
0 5 10 15 20 25 30
Iteration

Figure 4: RNN: objective value.

8
[6] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
[7] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.
[8] Philipp Koehn. Pharaoh: a beam search decoder for phrase-based statistical machine translation models.
In Conference of the Association for Machine Translation in the Americas, pages 115–124. Springer, 2004.
[9] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[10] Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o
(1/k2). In Doklady an SSSR, volume 269, pages 543–547, 1983.
[11] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi-
mate inference in deep generative models. In ICML, pages 1278–1286, 2014.
[12] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint
arXiv:1609.04747, 2016.
[13] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-
propagating errors. Cognitive modeling, 5(3):1, 1988.
[14] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian
Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go
with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
[15] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.

You might also like