CS 437 / CS 5317
Deep Learning
Murtaza Taj
[email protected]
Lecture 6: Optimizers, Reading Ch 4
Wed 03th Feb 2021
Grading Descent with Momentum
! Gradient Descent
w = w − η ⋅ ∇w J(w)
w = w − η ⋅ ∇w J(w; x (i); y (i))
! Gradient Descent with momentum
vt = γ vt−1 + η ∇w J(w)
w = w − vt
Nesterov accelerated gradient (NAG)
! Slow down SGD before the hill slopes up again
! Computing w − γ vt−1 thus gives us an approximation (rough idea) of
the next position of the parameters
! We can now effectively look ahead by calculating the gradient not
w.r.t. to our current parameters w but w.r.t. the approximate future
position of our parameters:
vt = γ vt−1 + η ∇w J(w) vt = γ vt−1 + η ∇w J(w − γ vt−1)
w = w − vt w = w − vt
! Significantly increased the performance of RNNs
Nesterov accelerated gradient (NAG)
Nesterov update (Source: G. Hinton's lecture 6c)
! Momentum first computes the current gradient (small blue vector) and
then takes a big jump in the direction of the updated accumulated
gradient (big blue vector),
! NAG first makes a big jump in the direction of the previous
accumulated gradient (brown vector), measures the gradient and then
makes a correction (red vector), which results in the complete NAG
update (green vector).
Adagrad
Adagrad
! Adapts the learning rate
! smaller updates(i.e. low learning rates) for parameters associated with
frequently occurring features,
! larger updates (i.e. high learning rates) for parameters associated with
infrequent features. For this reason, it is well-suited for dealing with
sparse data.
! Uses a different learning rate for every wi at every time step t
gt, j = ∇w J(wt, j)
η
wt+ 1, j = wt, j − ⋅ gt, j
Gt, jj + ϵ
!
Gt ∈ ℝd×d is a diagonal matrix where each diagonal element j,j is the
sum of the squares of the gradients w.r.t. wj up to time step t
Adadelta & RMS Prop
! Adagrad's main weakness is its accumulation of the squared
gradients in the denominator
t→∞ η→0
! Instead of inefficiently storing previous squared gradients, the sum of
gradients is recursively defined as a decaying average of all past
squared gradients.
E[g2]t = γE[g2]t−1 + (1 − γ)gt2
η
wt+ 1,i = wt,i − ⋅ gt
E[g2]t + ϵ
η
Δwt = − g
RMS[g]t t
! RMS Prop E[g2]t = 0.9E[g2]t−1 + 0.1gt2
Recall
Gradient descent Variants & optimization algorithms
! Varients
! Vanilla / Batch gradient descent w = w − η ⋅ ∇w J(w)
! Stochastic gradient descent w = w − η ⋅ ∇w J(w; x (i); y (i))
! Mini-batch gradient descent w = w − η ⋅ ∇w J(w; x (i:i+ n); y (i:i+ n))
vt = γ vt−1 + η ∇w J(w)
! Optimization Algos w = w − vt
! Momentum vt = γ vt−1 + η ∇w J(w − γ vt−1)
! Nesterov accelerated gradient w = w − vt
η
! Adagrad wt+ 1,i = wt,i − ⋅ gt,i
Gt,ii + ϵ
! Adadelta E[g2]t = γE[g2]t−1 + (1 − γ)gt2
η
wt+ 1,i = wt,i − ⋅ gt
E[g2]t + ϵ
! RMSprop E[g2]t = 0.9E[g2]t−1 + 0.1gt2
Gradient descent optimization algorithms
mt
! mt = β1mt−1 + (1 − β1)gt m̂ t =
Optimization Algos 1 − β1t
vt = β2vt−1 + (1 − β2)gt2
! Adam vt
vt̂ =
1 − β2t
η
wt+ 1 = wt − m̂ t
vt̂ + ϵ
http://ruder.io/optimizing-gradient-descent/
η
! AdaMax u t = β2∞vt−1 + (1 − β2∞) | gt |∞ wt+ 1 = wt − m̂
ut t
= max(β2 ⋅ vt−1, | gt | )
η (1 − β1)gt
! Nadam wt+ 1 = wt − (β1m̂ t + )
vt̂ + ϵ 1 − β1t
mt = β1mt−1 + (1 − β1)gt
vt = β2vt−1 + (1 − β2)gt2
! AMSGrad vt̂ = max(vt−1
̂ , vt )
η
wt+ 1 = wt − mt
vt̂ + ϵ
Optimizers Comparison
SGD optimization on loss surface contours SGD optimization on saddle point
Reading:
http://ruder.io/optimizing-gradient-descent/
Next
! 1D Conv
! 2D Conv
! Convolution-Filters (Edge detection)
! Forward and Backward Propagation using Convolution
operation
! Transforming Multilayer Perceptron to Convolutional
Neural Network