0% found this document useful (0 votes)
4 views26 pages

Optimization

The document discusses various optimization methods for deep learning, including Momentum, Nesterov Momentum, AdaGrad, RMSProp, and Adam. It explains the principles behind each method, their formulations, and algorithms for implementation. The document highlights the advantages and potential issues of these optimization techniques in training neural networks.

Uploaded by

dibawed780
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views26 pages

Optimization

The document discusses various optimization methods for deep learning, including Momentum, Nesterov Momentum, AdaGrad, RMSProp, and Adam. It explains the principles behind each method, their formulations, and algorithms for implementation. The document highlights the advantages and potential issues of these optimization techniques in training neural networks.

Uploaded by

dibawed780
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Optimization Methods for Deep Learning

S. Sumitra
Department of Mathematics
Indian Institute of Space Science and Technology

Graphical and Deep Learning


Optimization Methods

• Momentum
• Nestrov Momentum
• AdaGrad
• RMSProp
• Adam
Properties of slop

• When the curve is steep, the slope is large


∇f (x) f (x + h) − f (x)
• =
∇x h
• When the curve is gentle, the slope is small
• The weight update is proportional to the gradient.
Therefore in the areas where the curve is gentle the
updates are small whereas in the areas where the curve is
steep the updates are large.
Momentum

• A lot of time is taken to navigate in the regions that have a


gentle slop as the gradient in these regions is very slow.
• How to overcome this? Apply momentum
• Take faster step if lots of people ask to move in the same
direction
• Just as a ball gains momentum while rolling down a slope
• For applying momentum, look at the history of the update
• Use the concept of exponentially moving average to
reduce oscillation
Exponentially Moving Average (EMA)

The EMA for a series Y may be calculated recursively: The


EMA (for a series Y may be calculated recursively:
Y1 , t =1
St =
βYt + (1 − β) · St−1 , t > 1
β ∈ (0, 1]
Formulation

current update = α∇wt


previous update = updatet−1
EMA of current update and previous update is given by

γupdatet−1 + (1 − γ)α∇wt

Choose α := (1 − γ)α
History of Updates

• γ ∈ (0, 1]
• updatet = γupdatet−1 + α∇wt

wt+1 = wt − γupdatet − α∇wt


• In addition to the current update, also look at the history of
updates
• update0 = 0
• update1 = γupdate0 + α∇w1 = α∇w1
• update2 = γupdate1 + α∇w2 = γα∇w1 + α∇w2
• update3 = γupdate2 + α∇w3 = γ 2 α∇w1 + γα∇w2 + α∇w3
• In general, updatet = γupdatet−1 + α∇wt =
γ t−1 α∇w1 + γ t−2 α∇w2 + · · · + α∇wt
common values of γ are 0.5, 0.9, 0.99
Algorithm 1 Momentum
Initial w, b
updatew = 0, updateb = 0
choose momentum parameter γ, learning rate
α
Iterate until convergence
{
Choose a mini batch of d training examples
α Pd
updatew := γupdatew + ∇w Lm (xm , ym )
d m=1
α Pd
updateb := γupdateb + ∇b Lm (xm , ym )
d m=1
w := w − updatew
b := b − updateb
}
Nestrov Accelerated Gradient Descent

• Even in the regions having gentle steps, momentum based


gradient descent is able to take large steps because the
momentum carries along
• Is moving fast always good? Would there be a solution
where momentum would cause us to run pass our goal?
• Is there a way to reduce oscillations?
• In momentum,
• updatet = γupdatet−1 + α∇wt
• Move by atleast by γupdatet−1 and then a little more by
α∇wt
Updation Steps

• Partially updated value of w


• wt+1/2 = wt − γupdatet−1
• Calculate the gradient at the partially updated value of w:
∇wt+1/2 L
• Find update_t
• update_t = γupdatet−1 + α∇wt+1/2 L
wt+1 = wt − updatet
Algorithm 2 Nestrov Momentum
Initial w, b
updatew = 0, updateb = 0, gw = gb = 0, w̄ = 0, b̄ = 0
choose momentum parameter γ, learning rate
α
Iterate until convergence
{
Choose a mini batch of d training examples
{
interim update: w̄ := w − γupdatew
interim update: b̄ := b − γupdateb
1 Pd
Gradient at interim update: gw := ∇w̄ Lm (xm , ym )
d m=1
1 Pd
Gradient at interim update: gb := ∇ Lm (xm , ym )
d m=1 b̄
updatew := γupdatew + αgw
updateb := γupdateb + αgb
w := w − updatew
b := b − updateb
}
AdaGrad

• Different learning rate for different feature at each


iteration(time)
• Adaptively scale the learning rate for each dimension

α
wt+1 = wt − p ⊙ gt (1)
ϵI + diag(Gt )
where α is the learning rate, ϵ is some small quantity that used
to avoid the division of zero, I is the identity matrix, gt is the
gradient estimate in time step t
d
1 X
gt = ∇wt L(xm , ym )
d
m=1

where d is the number of data in the batch


The key of this algorithm is in the matrix Gt , which is the sum of
the outer productP of the gradients until time-step t, which is
defined by Gt = tτ =1 gτ gτT . Only the diagonal elements of Gt
is considered, for ease of compuatation.
From (1)
 (1)   (1)     (1,1) −1/2
wt+1 wt ϵ ··· 0 G ··· 0
 (2)   (2)   t
wt+1  wt  0 · · · 0 0 ··· 0 
 

 ..  =  ..  − α   +  ..
 . .  
    .. 
 .   .   .. .. 0  . . 0 

(n) (n) 0 ··· ϵ (n,n)
wt+1 wt 0 ··· Gt
 (1) 
g
 t(2) 
gt 
 . 
 . 
 . 
(n)
gt
 (1)   (1)   −1/2  (1)

wt+1 (1,1)
wt ϵ + Gt ··· 0 gt
 (2)   (2)   (2) 
wt+1  wt  0 ··· 0
 
   gt 
 ..  =  ..  − α  .. ..
       . 
.  . 
 .   .  . 0  . 

 
(n) (n) (n,n) (n)
wt+1 wt 0 ··· ϵ + Gt gt
1
 
q ··· 0
(1)
 ϵ + Gt(1,1)
     (1) 
wt  gt
 (2)     (2) 
wt   0 ··· 0  g 
  t 
= − α

 .. 
 .. ..   . 
  .. 

 .  
 . . 0 
wt
(n) 1  (n)
 gt

 0 ··· q
(n,n)
ϵ + Gt
1
 
(1)
 (1)   (1)  gt
w w
q
 t+1  t(2)   ϵ + Gt(1,1)
 
(2) 

wt+1  wt   
 =  .  − α q 1
 (2) 
 ..   .  gt 

 .   .   ϵ + G(2,2)
 

(n) (n) t
wt+1 wt
 
..
.

The parameters with the largest partial derivative of the loss


have a correspondingly rapid decrease in their learning rate,
while parameters with small partial derivatives have a relatively
small decrease in their learning rate. The net effect is greater
progress in the more gently sloped directions of parameter
space.
t
(1,1) (2,2) (n,n) T
X
(Gt , Gt . . . Gt ) = gτ ⊙ gτ
τ =1
= st−1 + gt ⊙ gt
Pt−1
Here, st−1 = τ =1 gτ ⊙ gτ
That is,

st = st−1 + gt ⊙ gt
(i)
Let st , i= 1, 2, . . . n be the i th element of st . Now perform the
following operation:
˜(i) α
st = q , i = 1, 2, . . . n
(i)
ϵ + st
The element wise multiplication and division performed on st ,
as described above is denoted by
α
s˜t = √
ϵ + st
Algorithm 3 AdaGrad
Initial w, b
ϵ = 1e − 8, sw = 0, sb = 0
gw = 0, gb = 0, learning rate α
Iterate until convergence
{
Choose a mini batch of d training examples
{
1 Pd
Gradient Computation: gw := ∇w L(xm , ym )
d m=1
1 Pd
Gradient Computation: gb := ∇b L(xm , ym )
d m=1
Accumulation of squared gradient: sw := sw + gw ⊙ gw
Accumulation of squared gradient: sb := sb + gb ⊙ gb
α
Updation: w := w − √ ⊙ gw
ϵ + sw
α
Updation: b := b − √ ⊙ gb
ϵ + sb
}
}
Problems with AdaGrad

• Lowers the update size very aggressively


• AdaGrad is designed to converge rapidly when applied to a
convex function. When applied to a nonconvex function to
train a neural network, the learning trajectory may pass
through many different structures and eventually arrive at a
region that is a locally convex bowl. AdaGrad shrinks the
learning rate according to the entire history of the squared
gradient and may have made too small before arriving at
such a convex structure.
Refer: Deep Learning Book by Ian Goodfellow et al.
RMSprop

• RMS Prop algorithm modifies AdaGrad by using the


exponential moving average in place of gradient
accumulation. Hence history from the extreme past is
discarded so that it can converge rapidly after finding a
convex bowl.
Algorithm 4 RMSprop
Initial w, b
ϵ = 1e − 8, sw = 0, sb = 0
gw = 0, gb = 0, learning rate α, decay rate
γ
Iterate until convergence
{
Choose a mini batch of d training examples
{
1 Pd
Gradient Computation: gw := ∇w L(xm , ym )
d m=1
1 Pd
Gradient Computation: gb := ∇b L(xm , ym )
d m=1
EMA of squared gradient: sw := γsw + (1 − γ)gw ⊙ gw
EMA of squared gradient: sb := γsb + (1 − γ)gb ⊙ gb
α
Updation: w := w − √ ⊙ gw
ϵ + sw
α
Updation: b := b − √ ⊙ gb
ϵ + sb
}
}
Adam

• Adam is an adaptive learning rate optimization algorithm


• Adam stands for adaptive momentum
• It makes use of the principles of RMSprop and momentum
• As in momentum, Adam takes the exponential moving
average of the gradients
• mt = β1 mt−1 + (1 − β1 )gt (first momentum)
• As in RMSProp, Adam takes the exponential moving
average of the square of the gradients
• st = β2 st−1 + (1 − β2 )gt ⊙ gt (second momentum)
Bias Correction

As mt and st are initialized as vectors of 0’s, the authors of


Adam observe that they are biased towards zero, especially
during the initial time steps, and especially when the decay
rates are small (i.e.β1 and β2 are close to 1). They counteract
these biases by computing bias-corrected first and second
moment estimates:
mt st
m̂t = sˆt = (2)
1 − β1t 1 − β2t
Adam

α
wt+1 = wt − p ⊙ m̂t
sˆt + ϵ
Suggested defaults: β1 = 0.9, β2 = 0.999, ϵ = .001 , α = 0.001
Algorithm 5 ADAM
Initial w, b
ϵ = 1e − 8, mw = 0, mb = 0, sw = 0, sb = 0
gw = 0, gb = 0, β1 = 0.9, β2 = 0.999, ϵ = 0.001, α

Iterate until convergence


{
Iteration t: Choose a mini batch of d training examples
{
1 Pd 1 Pd
gw := m=1 ∇w L(xm , ym ) and gb := ∇b L(xm , ym )
d d m=1
mw := β1 mw + (1 − β1 )gw and mb := β1 mb + (1 − β1 )gb
sw := β2 sw + (1 − β2 )gw ⊙ gw and sb := β2 vb + (1 − β2 )gb ⊙ gb
mw sw mb sb
mˆw := t
; sˆw := t
; m̂b := t
; sˆb :=
1 − β1 1 − β2 1 − β1 1 − β2t
α α
w := w − p ⊙ mˆw and b := b − p ⊙ m̂b
ϵ + sˆw ϵ + sˆb
}
}

You might also like