0% found this document useful (0 votes)
77 views11 pages

CS 437 / CS 5317 Deep Learning: Murtaza Taj

This document contains a summary of a lecture on gradient descent optimizers and deep learning. It discusses various optimizers including gradient descent, momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, Adam, AdaMax, Nadam, and AMSGrad. It provides explanations of how each optimizer works and compares them. The document also contains reminders and summaries of key concepts relating to gradient descent optimization algorithms.

Uploaded by

hoshi hamza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views11 pages

CS 437 / CS 5317 Deep Learning: Murtaza Taj

This document contains a summary of a lecture on gradient descent optimizers and deep learning. It discusses various optimizers including gradient descent, momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, Adam, AdaMax, Nadam, and AMSGrad. It provides explanations of how each optimizer works and compares them. The document also contains reminders and summaries of key concepts relating to gradient descent optimization algorithms.

Uploaded by

hoshi hamza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

CS 437 / CS 5317


Deep Learning

Murtaza Taj
[email protected]

Lecture 6: Optimizers, Reading Ch 4


Wed 03th Feb 2021
Grading Descent with Momentum

! Gradient Descent
w = w − η ⋅ ∇w J(w)
w = w − η ⋅ ∇w J(w; x (i); y (i))

! Gradient Descent with momentum

vt = γ vt−1 + η ∇w J(w)
w = w − vt
Nesterov accelerated gradient (NAG)
! Slow down SGD before the hill slopes up again

! Computing w − γ vt−1 thus gives us an approximation (rough idea) of


the next position of the parameters

! We can now effectively look ahead by calculating the gradient not


w.r.t. to our current parameters w but w.r.t. the approximate future
position of our parameters:

vt = γ vt−1 + η ∇w J(w) vt = γ vt−1 + η ∇w J(w − γ vt−1)


w = w − vt w = w − vt

! Significantly increased the performance of RNNs


Nesterov accelerated gradient (NAG)

Nesterov update (Source: G. Hinton's lecture 6c)

! Momentum first computes the current gradient (small blue vector) and
then takes a big jump in the direction of the updated accumulated
gradient (big blue vector),

! NAG first makes a big jump in the direction of the previous


accumulated gradient (brown vector), measures the gradient and then
makes a correction (red vector), which results in the complete NAG
update (green vector).
Adagrad
Adagrad
! Adapts the learning rate

! smaller updates(i.e. low learning rates) for parameters associated with


frequently occurring features,

! larger updates (i.e. high learning rates) for parameters associated with
infrequent features. For this reason, it is well-suited for dealing with
sparse data.

! Uses a different learning rate for every wi at every time step t

gt, j = ∇w J(wt, j)
η
wt+ 1, j = wt, j − ⋅ gt, j
Gt, jj + ϵ
!
Gt ∈ ℝd×d is a diagonal matrix where each diagonal element j,j is the
sum of the squares of the gradients w.r.t. wj up to time step t
Adadelta & RMS Prop
! Adagrad's main weakness is its accumulation of the squared
gradients in the denominator
t→∞ η→0
! Instead of inefficiently storing previous squared gradients, the sum of
gradients is recursively defined as a decaying average of all past
squared gradients.
E[g2]t = γE[g2]t−1 + (1 − γ)gt2
η
wt+ 1,i = wt,i − ⋅ gt
E[g2]t + ϵ

η
Δwt = − g
RMS[g]t t

! RMS Prop E[g2]t = 0.9E[g2]t−1 + 0.1gt2


Recall
Gradient descent Variants & optimization algorithms
! Varients
! Vanilla / Batch gradient descent w = w − η ⋅ ∇w J(w)
! Stochastic gradient descent w = w − η ⋅ ∇w J(w; x (i); y (i))
! Mini-batch gradient descent w = w − η ⋅ ∇w J(w; x (i:i+ n); y (i:i+ n))

vt = γ vt−1 + η ∇w J(w)
! Optimization Algos w = w − vt
! Momentum vt = γ vt−1 + η ∇w J(w − γ vt−1)
! Nesterov accelerated gradient w = w − vt
η
! Adagrad wt+ 1,i = wt,i − ⋅ gt,i
Gt,ii + ϵ

! Adadelta E[g2]t = γE[g2]t−1 + (1 − γ)gt2


η
wt+ 1,i = wt,i − ⋅ gt
E[g2]t + ϵ

! RMSprop E[g2]t = 0.9E[g2]t−1 + 0.1gt2


Gradient descent optimization algorithms
mt
! mt = β1mt−1 + (1 − β1)gt m̂ t =
Optimization Algos 1 − β1t
vt = β2vt−1 + (1 − β2)gt2
! Adam vt
vt̂ =
1 − β2t
η
wt+ 1 = wt − m̂ t
vt̂ + ϵ

http://ruder.io/optimizing-gradient-descent/
η
! AdaMax u t = β2∞vt−1 + (1 − β2∞) | gt |∞ wt+ 1 = wt − m̂
ut t
= max(β2 ⋅ vt−1, | gt | )

η (1 − β1)gt
! Nadam wt+ 1 = wt − (β1m̂ t + )
vt̂ + ϵ 1 − β1t

mt = β1mt−1 + (1 − β1)gt
vt = β2vt−1 + (1 − β2)gt2
! AMSGrad vt̂ = max(vt−1
̂ , vt )
η
wt+ 1 = wt − mt
vt̂ + ϵ
Optimizers Comparison

SGD optimization on loss surface contours SGD optimization on saddle point

Reading:
http://ruder.io/optimizing-gradient-descent/
Next
! 1D Conv
! 2D Conv
! Convolution-Filters (Edge detection)
! Forward and Backward Propagation using Convolution
operation
! Transforming Multilayer Perceptron to Convolutional
Neural Network

You might also like