0% found this document useful (0 votes)

124 views9 pages

Optimization Algorithms Deep PDF

The document discusses optimization algorithms for deep learning models. It introduces gradient descent methods like batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. It then describes improvements on these methods, including momentum, Nesterov acceleration, Adagrad, Adadelta, RMSprop, and Adam. These improved methods address issues like step size tuning and slow convergence. The document aims to compare the performance of these optimization algorithms on tasks like image classification and text generation.

Uploaded by

saad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

124 views9 pages

Optimization Algorithms Deep PDF

Uploaded by

saad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Optimization Algorithms for Deep Learning

Piji Li
Department of Systems Engineering and Engineering Management,
The Chinese University of Hong Kong
[email protected]

Abstract

Gradient descent algorithms are the most important and popular techniques for
optimizing deep learning related models. Considering the large scale dataset and
the limited computation memory especially on GPUs, the traditional typical batch
gradient decent method and stochastic gradient decent method cannot conduct
the training effectively and efficiently. Moreover, tricks of tuning of step size
(learning rate) also play an critical role during training procedure. To address this
problem, several variations of gradient methods have been proposed. In this paper,
we investigate the basic methodology and property of the methods. Moreover, we
also conduct several experiments on various of tasks to compare the performance.

1 Introduction

Nowadays, Deep Learning [3] shows almost the state-of-the-art performance on various tasks in
different fields, such as speech recognition, computer vision and natural language processing. There
are huge number of different variants of deep architectures, such as Deep Neural Network(DNN),
Deep Belief Network(DBN), Deep Convolutional Neural Network(CNN) and Recurrent Neural
Network(RNN) [3]. The basic techniques of AlphaGo is also neural network [14].
Back-Propagation (BP) [13] is a common method of training neural networks. The algorithm repeats a
two phase cycle, propagation and weight update. Errors obtained from the output layer will propagate
to the other nodes in a backward direction. These errors are used to calculate the gradient of the
loss function with respect to the weights in the network. Then the gradient is fed to the optimization
method, which in turn uses it to update the weights, in an attempt to minimize the loss function.
According to the data used to compute the gradient of the objective function, gradient decent methods
can be divided into three categories: batch gradient decent method, mini-batch gradient decent
method, and stochastic gradient decent method [12]. Considering the large scale dataset and the
limited computation memory especially on GPUs, the traditional typical batch gradient decent
method and stochastic gradient decent method cannot conduct the training effectively and efficiently.
Moreover, step size adjusting also plays an critical role during training procedure. But it is very
difficult the tune the step size precisely in a task needs large dataset and long training period. To
address these problems, several variations of typical gradient methods have been proposed. In the
following sections, we will introduce the basic framework of the related methods. We also conduct
experiments on image classification and headline generation to show the different performance of
those methods.

2 Optimization Algorithms

Assume that the objective function need to be minimized is f (x), and x ∈ Rn . The corresponding
gradient is ∆f (x). The step size for iteration k is tk .
2.1 Batch Gradient Decent

The batch gradient decent algorithm updates the parameters x after scanning the whole training set:
xk+1 = xk − tk ∆f (xk )(1:n) (1)
Batch gradient descent is guaranteed to converge to the global minimum for convex problem and
to a local minimum for non-convex problem. However, in the related tasks of deep learning, the
training set contains about millions of or even billions of samples, which will cost a long time to scan
the entire training set to calculate the gradient. So it is to slow to perform once parameter update.
Moreover, the computation memory is also limited and is difficult to feed all the data into the model
at one time. Therefore, rare of deep learning models use batch gradient decent method to handle the
optimization problem.

2.2 Stochastic Gradient Decent

In contrast, stochastic gradient decent calculates the gradient and update the parameters for each
training sample.
xk+1 = xk − tk ∆f (xk )(i) (2)
However, due to the high variance of different training samples, updating parameters frequently will
cause the objective function to fluctuate heavily. Although a small step size can let SGD convergent
to a good point, it make the training slow. Moreover, when we use GPUs to conduct the computation,
the frequent data commutation between gpu memory and local memory also decrease the efficiency.

2.3 Mini-Batch Gradient Decent

Mini-Batch gradient decent integrates the advantages of batch gradient decent and stochastic gradient
decent, and update the parameters after obtaining the gradient of a mini-batch of samples:
xk+1 = xk − tk ∆f (xk )(i:i+m) (3)
where the mini-batch size is m. Mini-batch gradient decent can not guarantee good convergence, and
the tuning of step size also need some experience. Therefore some researchers extend it with some
more useful tricks and techniques to improve the convergence. For convenience, people also call
mini-batch gradient decent as SGD.

2.4 Gradient Decent with Momentum

If there is a long shallow ravine with steep walls on the direction to the optimal point, then the
standard SGD will tend to oscillate across the narrow ravine. Momentum is one of the mechanism
which is used to fix the direction:
vk = mvk−1 + tk ∆f (xk )
(4)
xk+1 = xk − vk
where m ∈ (0, 1] determines for how many iterations the previous gradients are incorporated into the
current update. Generally m is set to 0.5 until the initial learning stabilizes and then is increased to
0.9 or higher.

2.5 Nesterov Accelerated Gradient

We can use Nesterov accelerated gradient (NAG) [10] to conduct the improvement by looking ahead
to approximate the direction:
vk = mvk−1 + tk ∆f (xk − mvk−1 )
(5)
xk+1 = xk − vk

2.6 Adagrad

Adagrad [2] scales the step size for each parameter according to the history of gradients for that
parameter which is basically done by dividing current gradient in update rule by the sum of previous

2
gradients:
Gk = Gk−1 + ∆f (xk )2
t (6)
xk+1 = xk − √ ∆f (xk )
Gk + ε
where G is the accumulation of the history gradients, and ε is a smoothing term that avoids division by
zero (can be 1e − 6). The step size is different for each of the parameters. It is greater for parameters
where the historical gradients were small (since G is small) and the rate is small whenever historical
gradients were relatively big. Therefore, we need not to manually tune the step size t. We can use
a default value of 0.01. But when accumulation G becomes larger, the step size will reach zero at
infinity. So the following methods are proposed.

2.7 Adadelta

Adadelta [15] is derived from Adagrad in order to improve upon the two main drawbacks of the
method: 1) the continual decay of learning rates throughout training, and 2) the need for a manu-
ally selected global learning rate. Adadelta integrates the advantages of momentum and Adagrad.
Specifically, it scales the step size based on the historical gradient. But it only uses the latest time
window instead of the whole history as Adagrad. It also uses a component that serves an acceleration
term, that accumulates historical updates (similar to momentum). Follows are the operation details of
Adadelta:
E[∆f (x)2 ]k = ρE[∆f (x)2 ]k−1 + (1 − ρ)∆f (xk )2
q
E[x̂2 ]k−1 + ε
x̂k = − q ∆f (xk )
2 (7)
E[∆f (x) ]k + ε
E[x̂2 ]k = ρE[x̂2 ]k−1 + (1 − ρ)x̂2k
xk+1 = xk + x̂k
where ρ is a decay constant (e.g., 0.95) and ε is a small value (e.g., 1e-6) for numerical stability.

2.8 RMSprop

RMSprop [5] is also proposed to tackle the problem of step size vanishing of Adagrad. It also
employs the decaying average of the history gradients:

E[∆f (x)2 ]k = ρE[∆f (x)2 ]k−1 + (1 − ρ)∆f (xk )2

t (8)
xk+1 = xk − q ∆f (xk )
2
E[∆f (x) ]k + ε

where ρ is a decay constant (e.g., 0.9).

2.9 Adam

Adam [6] is another method that computes adaptive step size for each parameter. It uses both the
decaying average of history gradients and their squared values. Adam update rule consists of the
following steps:
mk = β1 mk−1 + (1 − β1 )∆f (xk )
vk = β2 vk−1 + (1 − β2 )∆f (xk )2
mk vk
m̂k = , v̂k = (9)
1 − β1k 1 − β2k
t
xk+1 = xk − √ m̂k
v̂k + ε
where β1 can be 0.9, β2 can be 0.999, and ε can be 1e − 8.

3
Figure 1: Structure of LeNet [9].

2.10 Adapg

Combine adadelta and adam, we can get a new method:

E[∆f (x)]k = ρE[∆f (x)]k−1 + (1 − ρ)∆f (xk )

E[∆f (x)2 ]k = ρE[∆f (x)2 ]k−1 + (1 − ρ)∆f (xk )2
q
E[x̂2 ]k−1 + ε
x̂k = − q E[∆f (x)]k (10)
2
E[∆f (x) ]k + ε
E[x̂2 ]k = ρE[x̂2 ]k−1 + (1 − ρ)x̂2k
xk+1 = xk + x̂k

3 Image Classification

3.1 Frameworks

To investigate the performance of those mentioned gradient methods in different model structures, we
employ two neural network models to handle the handwritten digits recognition problem: Multi-Layer
Perceptron (MLP) and Convolutional Neural Networks (CNN).

3.1.1 Multi-Layer Perceptron (MLP)

MLP is a one-hidden-layer artificial neural network. For the handwritten digits recognition problem,
we use vector x ∈ Rd to denote the input image, the operations in the feed-forward direction are as
follows:
h = σ(Wh x + bh )
(11)
ŷ = sof tmax(Wy h + by )

where Wh ∈ Rk×d and Wy ∈ Rc×k are the weight matrix, bh ∈ Rd and by ∈ Rc are hte biases.
zi
σ(·) is the sigmoid function: σ(z) = 1+e1 −z , and sof tmax(z)i = Pce ezj . The model’s prediction
j=1
y ? is the class whose probability is maximal, specifically: y ∗ = arg maxi ŷi . Actually, we can regard
ŷ as the probability distribution for each class, so we have P (y ? = i|x, Θ) = ŷi .

3.1.2 Convolutional Neural Networks (CNN)

CNN is a biologically-inspired variant of MLP. It consists of convolutional operation layers and the
pooling layers. A feature map is obtained by repeated application of convolution operation of the
input image with a linear filter, adding a bias term and then applying a non-linear function.

hij = tanh((W ∗ x)ij + b) (12)

where ∗ is the convolutional operator. After each convolutional operation, we add a poling layer,
which is a form of non-linear down-sampling. Max-pooling partitions the input image into a set
of non-overlapping rectangles and, for each such sub-region, outputs the maximum value. In our
experiments, we use LeNet-5 [9] as the basic framework, as shown in Figure 1.

4
3.2 Learning

Learning optimal model parameters involves minimizing a loss function. In the case of multi-class
logistic regression, it is very common to use the negative log-likelihood as the loss. This is equivalent
to maximizing the likelihood of the data set D under the model parameterized by Θ:
n
X
min J = − log P (y ∗ (i) = y (i) |x(i) , Θ) + λkΘk22 (13)
Θ
i=1

where λkΘk22 is an L2-regularization term (aka penalty) that penalizes complex models; and λ > 0 is
a non-negative hyperparameter that controls the magnitude of the penalty.

4 Neural Headline Generation

4.1 Framework

In this task we will create a neural network model which can learn to generate headlines for news
automatically like human.

y1 y2 <eos> output

ℎ𝑑𝑒𝑐 variational-decoder

Attention
z
z1 z2 z3
DKL [ N (u ,  2 ) || N (0, I )]

𝜀 log  2 𝜇

𝑁(0, 𝐼) ℎ𝑒𝑛𝑐 variational-encoder

x1 x2 x3 x4 <eos> y1 y2 input

Encoder Decoder Variational Auto-Encoders

Figure 2: Our deep recurrent generative decoder (DRGN) for latent structure modeling.
As shown in the left block of Figure 2, the encoder is designed based on bidirectional recurrent neural
networks. Let xt be the word embedding vector of the t-th word in the source sequence. GRU maps
xt and the previous hidden state ht−1 to the current hidden state ht in feed-forward direction and
back-forward direction respectively:
* * ( (
ht = GRU (xt , ht−1 ), ht = GRU (xt , ht−1 ) (14)

Then the final hidden state het ∈ R2kh is concatenated using the hidden states from the two directions:
* (
het = ht ||h. As shown in the middle block of Figure 2, the decoder consists of two components:
discriminative deterministic decoding and generative latent structure modeling.
The discriminative deterministic decoding is an improved attention modeling based recurrent sequence
decoder. The first hidden state hd1 is initialized using the average of all the source input states:
Te
hd1 = T1e het , where het is the source input hidden state. T e is the input sequence length. The
P
t=1
deterministic decoder hidden state hdt is calculated using two layers of GRUs. On the first layer, the
hidden state is calculated only using the current input word embedding yt−1 and the previous hidden
state hdt−1
1
: hdt 1 = GRU1 (yt−1 , hdt−1
1
). where the superscript d1 denotes the first decider GRU layer.
Then the attention weights at the time step t are calculated based on the relationship of hdt 1 and all
the source hidden states {het }. Let ai,j be the attention weight between hdi 1 and hej , which can be
calculated using the following formulation:
exp(ei,j )
ai,j = PT e , ei,j = vT tanh(Whh
d
hdi 1 + Whh
e
hej + ba )
j 0 =1 exp(ei,j )
0

where Whh d
∈ Rkh ×kh , Whhe
∈ Rkh ×2kh , ba ∈ Rkh , and v ∈ Rkh . The attention context is
PT e
obtained by the weighted linear combination of all the source hidden states: ct = j 0 =1 at,j 0 hej0 .

5
The final deterministic hidden state hdt 2 is the output of the second decoder GRU layer, jointly
considering the word yt−1 , the previous hidden state hdt−1
2
, and the attention context ct :
hdt 2 = GRU2 (yt−1 , hdt−1
2
, ct ) (15)

For the component of recurrent generative model, inspired by some ideas in previous works [7, 11, 4],
we assume that both the prior and posterior of the latent variables are Gaussian, i.e., pθ (zt ) = N (0, I)
and qφ (zt |y<t , z<t ) = N (zt ; µ, σ 2 I), where µ and σ denote the variational mean and standard
deviation respectively, which can be calculated via a multilayer perceptron. Precisely, given the word
embedding yt−1 , the previous latent structure variable zt−1 , and the previous deterministic hidden
state hdt−1 , we first project it to a new hidden space:
het z = g(Wyh
ez ez
yt−1 + Wzh ez d
zt−1 + Whh ht−1 + behz )
ez ez ez
where Wyh ∈ Rkh ×kw , Wzh ∈ Rkh ×kz , Whh ∈ Rkh ×kh , and behz ∈ Rkh . g is the sigmoid
−x
activation function: σ(x) = 1/(1 + e ). Then the Gaussian parameters µt ∈ Rkz and σ t ∈ Rkz
can be obtained via a linear transformation based on het z :
ez ez
µt = Whµ ht + beµz , log(σ 2t ) = Whσ het z + beσz (16)
The latent structure variable zt ∈ Rkz can be calculated using the reparameterization trick:
ε ∼ N (0, I), zt = µt + σ t ⊗ ε (17)
kz
where ε ∈ R is an auxiliary noise variable. The process of inference for finding zt based on neural
networks can be teated as a variational encoding process.
To generate summaries precisely, we first integrate the recurrent generative decoding component with
the discriminative deterministic decoding component, and map the latent structure variable zt and the
deterministic decoding hidden state hdt 2 to a new hidden variable:
d d dz d2 d
ht y = tanh(Wzhy zt + Whh ht + bhy ) (18)

d
Given the combined decoding state ht y at the time t, the probability of generating any target word yt
is given as follows:
d d
yt = ς(Why ht y + bdhy ) (19)
where Whyd
∈ Rky ×kh and bdhy ∈ Rky . ς(·) is the softmax function. Finally, we use a beam search
algorithm [8] for decoding and generating the best summaries

4.2 Learning

We use {X}N and {Y }N to denote the training source and target sequence. The final objective
function - negative log-likelihood (NLL), which needs to be minimized is organized as follows under
the model parameterized by Θ:
N T
( )
1 XX (n) (n) (n)
min J = − log p(yt |y<t , X ) (20)
Θ N n=1 t=1

5 Experiments and Discussions

5.1 Datesets

We train and evaluate our framework on two popular datasets. For the image classification, we
use MNIST1 dataset. For headline generation, we use Gigawords, which is an English sentence
summarization dataset prepared based on Annotated Gigawords2 by extracting the first sentence from
articles to create a source-summary pair. It roughly contains 3.8M training pairs, 190K validation
pairs, and 2,000 test pairs.
1
http://yann.lecun.com/exdb/mnist/
2
https://catalog.ldc.upenn.edu/ldc2012t21

6
Table 1: Examples of the generated headlines. S: input source document. G: ground truth. P:
generated abstractive headlines.
S(1): factory orders for manufactured goods rose #.# percent in september , the
commerce department said here thursday.
G: us september factory orders up #.# percent.
P: us factory orders up #.# percent in september.
S(2): nick zito , who had three horses finish second during rival trainer d. wayne
lukas ’ streak of six straight victories in triple crown races , won the preakness
on saturday with louis quatorze , who had finished ##th in the kentucky derby.
G: UNK louis UNK wins preakness.
P: zito wins preakness for louis UNK.
S(3): UNK thorpe remembers her difficult marriage to jim thorpe , called the
greatest athlete of the modern era , and their harsh life outside the spotlight.
G: late olympian ’s wife recalls hard times <unk> photo available.
P: thorpe remembers her difficult marriage to thorpe.

5.2 Experimental Settings

For the MLP and CNN framework, the hidden size is 500. For the experiments on the English dataset
Gigawords, we set the dimension of word embeddings to 300, and the dimension of hidden states and
latent variables to 500. The maximum length of documents and summaries is 100 and 50 respectively.
The batch size of mini-batch training is 256. The beam size of the decoder was set to be 10. Our
neural network based framework is implemented using Theano [1] on a single GTX 1080 GPU.

5.3 Performance of Different Gradient Methods

As shown in Figure 3 and Figure 4, we can see that the mentioned gradients methods perform different
on various tasks in different neural network structures. For MLP, methods of momentum, adagrad,
adam converge faster than the other methods. For CNN, adadelta also performs well. However, in
the RNN model, the methods without tuning the step size, adadelta and adapg, perform the best.
Considering that it will cost several weeks to finish the training state on some large scale tasks, less
tuning of step size make the work more convenient.

5.4 Case Analysis for Headline Generation

As shown in Figure 3, we report the headlines generated by our own framework DRGD and the
baseline methods under ROUGE (overlapping between prediction and ground truth) metric. We can
see that our new neural network based framework with adadelta training methods outperform all the
baselines. As shown in Table 1, we select several samples cases to demonstrate the performance of
our framework. From the cases we can see that our framework can generate headlines with a good
linguistic quality.

6 Conclusions
In the investigations we compare the performance of different gradient methods in deep learning and
find that the structure of the meodels will affect the choice of the methods.

References
[1] Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron,
Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. Theano: new features and speed improve-
ments. arXiv preprint arXiv:1211.5590, 2012.
[2] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
[3] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT Press, 2016.
[4] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, and Daan Wierstra. Draw: A recurrent neural
network for image generation. In ICML, pages 1462–1471, 2015.
[5] Geoffrey Hinton, N Srivastava, and Kevin Swersky. Lecture 6a overview of mini–batch gradient descent.
Coursera Lecture slides https://class. coursera. org/neuralnets-2012-001/lecture,[Online, 2012.

7
0.8 1

adapg
adadelta
0.7
adam 3
adagrad-0.01 0.95
2
adagrad-0.1
0.6
rmsprop-0.001
rmsprop-0.01 1
momentum-0.01
0.5 0.9
momentum-0.1
adapg

Accuracy
momentum-1
adadelta
Loss

sgd-0.01
0.4 adam
sgd-0.1
sgd-1 adagrad-0.01
0.85 adagrad-0.12
0.3 rmsprop-0.001
1
rmsprop-0.01
momentum-0.01
0.2
momentum-0.1
0.8
momentum-1
sgd-0.01
0.1
sgd-0.1
3
sgd-1
0 0.75
0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100
Iteration Iteration

(a) MLP: objective value. (b) MLP: accuracy.

0.5 1

adapg 4
0.45
adadelta 2
adam 3
0.4 adagrad-0.01 0.95
adagrad-0.001
0.35 rmsprop-0.001
rmsprop-0.01
momentum-0.001
0.3 momentum-0.01
Accuracy

momentum-0.11 0.9
Loss

0.25 sgd-0.01 adapg 2

sgd-0.1 adadelta 3
0.2 adam 1
adagrad-0.01
0.85 adagrad-0.001
0.15 rmsprop-0.001
rmsprop-0.01
0.1 momentum-0.001
momentum-0.01
4
0.8 momentum-0.1
0.05
sgd-0.01
sgd-0.1
0
0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50
Iteration Iteration

(c) CNN: objective value. (d) CNN: accuracy.

Figure 3: Performance of different gradient methods in MLP and CNN framework.
70
adapg 3
adadelta 1
adagrad
60
adam
rmsprop 2

50
momentum 4 System R-1 R-2 R-L
sgd
ABS 29.55 11.32 26.42
40
ABS+ 29.78 11.89 26.97
RAS-LSTM 32.55 14.70 30.03
Loss

30
RAS-Elman 33.78 15.97 31.15
ASC + FSC1 34.17 15.94 31.92
lvt2k-1sent 32.67 15.59 30.64
20
lvt5k-1sent 35.30 16.64 32.62
DRGD 36.27 17.57 33.62
10

Figure 5: ROUGE-F1 on Gigawords

0
0 5 10 15 20 25 30
Iteration

Figure 4: RNN: objective value.

8
[6] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
[7] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.
[8] Philipp Koehn. Pharaoh: a beam search decoder for phrase-based statistical machine translation models.
In Conference of the Association for Machine Translation in the Americas, pages 115–124. Springer, 2004.
[9] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[10] Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o
(1/k2). In Doklady an SSSR, volume 269, pages 543–547, 1983.
[11] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi-
mate inference in deep generative models. In ICML, pages 1278–1286, 2014.
[12] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint
arXiv:1609.04747, 2016.
[13] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-
propagating errors. Cognitive modeling, 5(3):1, 1988.
[14] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian
Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go
with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
[15] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.

Optimization in Machine Learning
No ratings yet
Optimization in Machine Learning
26 pages
IMPORTANCE of Hypothesis
No ratings yet
IMPORTANCE of Hypothesis
2 pages
Media Literacy Education An Overview and Impact On Critical Thinking Skills - Group 6
No ratings yet
Media Literacy Education An Overview and Impact On Critical Thinking Skills - Group 6
34 pages
Ralph W. Tylers Principles of Curriculum Instruction and Evalua
No ratings yet
Ralph W. Tylers Principles of Curriculum Instruction and Evalua
490 pages
Lecture 7: Stochastic Gradient Descent
No ratings yet
Lecture 7: Stochastic Gradient Descent
4 pages
Dogs Love
No ratings yet
Dogs Love
8 pages
Test 5
No ratings yet
Test 5
302 pages
Personal Narration Essay
No ratings yet
Personal Narration Essay
3 pages
TEXTO: 1 - Comum À Questão: 1: Power Napping Is Good For The I.Q
No ratings yet
TEXTO: 1 - Comum À Questão: 1: Power Napping Is Good For The I.Q
2 pages
(EDU - 601) Assignment 01 (Fall 2022)
No ratings yet
(EDU - 601) Assignment 01 (Fall 2022)
2 pages
An Amharic Stemmer Reducing Words To The
No ratings yet
An Amharic Stemmer Reducing Words To The
5 pages
Lecture 04
No ratings yet
Lecture 04
32 pages
Unit 4 NNDL-1
No ratings yet
Unit 4 NNDL-1
12 pages
2410.19706 (1)
No ratings yet
2410.19706 (1)
15 pages
SuperGD
No ratings yet
SuperGD
15 pages
Gradient Descent and Its Types
No ratings yet
Gradient Descent and Its Types
5 pages
How To Start A Career in Data Science 1598272894
No ratings yet
How To Start A Career in Data Science 1598272894
19 pages
Lesson Plan
No ratings yet
Lesson Plan
5 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Discovery and Creation Theory
No ratings yet
Discovery and Creation Theory
22 pages
Analysis of The Case Lisa Benton
100% (2)
Analysis of The Case Lisa Benton
3 pages
Adadelta: An Adaptive Learning Rate Method Matthew D. Zeiler Google Inc., USA New York University, USA
No ratings yet
Adadelta: An Adaptive Learning Rate Method Matthew D. Zeiler Google Inc., USA New York University, USA
6 pages
Berry (1967)
No ratings yet
Berry (1967)
4 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
Optimization of Gradiant Descant
No ratings yet
Optimization of Gradiant Descant
7 pages
Module 2
No ratings yet
Module 2
67 pages
CLA of Leadership SALIMA
No ratings yet
CLA of Leadership SALIMA
8 pages
Technical_writing
No ratings yet
Technical_writing
8 pages
Technical_writing (2)
No ratings yet
Technical_writing (2)
9 pages
Technical_writing (1)
No ratings yet
Technical_writing (1)
9 pages
SIHRM
No ratings yet
SIHRM
2 pages
Optimizer
No ratings yet
Optimizer
13 pages
4_Gradient Descent and Stochastic GD
No ratings yet
4_Gradient Descent and Stochastic GD
37 pages
Backpropagation_optimization_tutorial
No ratings yet
Backpropagation_optimization_tutorial
14 pages
Making Annotation
No ratings yet
Making Annotation
25 pages
Training NNs
No ratings yet
Training NNs
34 pages
Adaptive Stochastic Conjugate Gradient for Machine Learning
No ratings yet
Adaptive Stochastic Conjugate Gradient for Machine Learning
14 pages
The Journal of Special Education
No ratings yet
The Journal of Special Education
14 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
optim
No ratings yet
optim
33 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
2020 CS182 Section 2 Notes
No ratings yet
2020 CS182 Section 2 Notes
6 pages
14-RMSProp and Adam Optimization-12!08!2024
No ratings yet
14-RMSProp and Adam Optimization-12!08!2024
2 pages
Gradient Descent_PR
No ratings yet
Gradient Descent_PR
31 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
Debate Script 3
No ratings yet
Debate Script 3
7 pages
S09_DNN_Gradients_wip
No ratings yet
S09_DNN_Gradients_wip
28 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
cours5
No ratings yet
cours5
23 pages
optimization techniques (SGD alternatives)
No ratings yet
optimization techniques (SGD alternatives)
34 pages
MLP Encoder Decoder
No ratings yet
MLP Encoder Decoder
14 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Lecture_2
No ratings yet
Lecture_2
31 pages
Discrete Event My1
No ratings yet
Discrete Event My1
2 pages
cst414-deep learning module 2
No ratings yet
cst414-deep learning module 2
13 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
LR 1 Intro
No ratings yet
LR 1 Intro
24 pages
08 Training
No ratings yet
08 Training
18 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Gradient Descent Algorithm is a first
No ratings yet
Gradient Descent Algorithm is a first
5 pages
English: First Quarter - Week 8 Speech Style For Various Purposes (Consultative Speech)
No ratings yet
English: First Quarter - Week 8 Speech Style For Various Purposes (Consultative Speech)
19 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
Test Taking Strategy
No ratings yet
Test Taking Strategy
7 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Optimizers
No ratings yet
Optimizers
4 pages
Survey of FNN
No ratings yet
Survey of FNN
25 pages
Zoeadlerresumenaatik
No ratings yet
Zoeadlerresumenaatik
3 pages
Optimization For Deep Learning: Sebastian Ruder
No ratings yet
Optimization For Deep Learning: Sebastian Ruder
49 pages
Uschold and Gruninger - 1996 - Ontologies Principles, Methods and Aplications
No ratings yet
Uschold and Gruninger - 1996 - Ontologies Principles, Methods and Aplications
69 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Stochastic Gradient Descent - Term Paper
No ratings yet
Stochastic Gradient Descent - Term Paper
8 pages
(Vijayan Sugumaran, Arun Kumar Sangaiah, Arunkumar
No ratings yet
(Vijayan Sugumaran, Arun Kumar Sangaiah, Arunkumar
379 pages
Analyzing The Views of Middle School Science Teachers Towards Stem Education in Ghana and Turkey
No ratings yet
Analyzing The Views of Middle School Science Teachers Towards Stem Education in Ghana and Turkey
25 pages
3 Types of Gradient Descent Algorithms For Small & Large Datasets
No ratings yet
3 Types of Gradient Descent Algorithms For Small & Large Datasets
9 pages
1 Quick Test: Grammar Tick ( ) A, B, or C To Complete The Sentences
No ratings yet
1 Quick Test: Grammar Tick ( ) A, B, or C To Complete The Sentences
3 pages
Hyper-Parameter Tuning Techniques in Deep Learning - Towards Data Science
No ratings yet
Hyper-Parameter Tuning Techniques in Deep Learning - Towards Data Science
14 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
TOEFL Speaking Task 1 Sample Answer - Vacation Plans - 230309 - 142334
No ratings yet
TOEFL Speaking Task 1 Sample Answer - Vacation Plans - 230309 - 142334
4 pages
Global Dimensions of Gifted and Talented Education - Mangaliman
No ratings yet
Global Dimensions of Gifted and Talented Education - Mangaliman
8 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Speed Mathamatics
From Everand
Speed Mathamatics
Naila Hina
1/5 (1)
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet

Optimization Algorithms Deep PDF

Uploaded by

Optimization Algorithms Deep PDF

Uploaded by

Optimization Algorithms for Deep Learning

2.2 Stochastic Gradient Decent

2.3 Mini-Batch Gradient Decent

2.4 Gradient Decent with Momentum

2.5 Nesterov Accelerated Gradient

E[∆f (x)2 ]k = ρE[∆f (x)2 ]k−1 + (1 − ρ)∆f (xk )2

where ρ is a decay constant (e.g., 0.9).

Combine adadelta and adam, we can get a new method:

E[∆f (x)]k = ρE[∆f (x)]k−1 + (1 − ρ)∆f (xk )

3.1.1 Multi-Layer Perceptron (MLP)

3.1.2 Convolutional Neural Networks (CNN)

hij = tanh((W ∗ x)ij + b) (12)

4 Neural Headline Generation

𝑁(0, 𝐼) ℎ𝑒𝑛𝑐 variational-encoder

Encoder Decoder Variational Auto-Encoders

5 Experiments and Discussions

5.2 Experimental Settings

5.3 Performance of Different Gradient Methods

5.4 Case Analysis for Headline Generation

(a) MLP: objective value. (b) MLP: accuracy.

0.25 sgd-0.01 adapg 2

(c) CNN: objective value. (d) CNN: accuracy.

Figure 5: ROUGE-F1 on Gigawords

Figure 4: RNN: objective value.

You might also like