0% found this document useful (0 votes)

4 views26 pages

Optimization

The document discusses various optimization methods for deep learning, including Momentum, Nesterov Momentum, AdaGrad, RMSProp, and Adam. It explains the principles behind each method, their formulations, and algorithms for implementation. The document highlights the advantages and potential issues of these optimization techniques in training neural networks.

Uploaded by

dibawed780

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views26 pages

Optimization

Uploaded by

dibawed780

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Optimization Methods for Deep Learning

S. Sumitra
Department of Mathematics
Indian Institute of Space Science and Technology

Graphical and Deep Learning

Optimization Methods

• Momentum
• Nestrov Momentum
• AdaGrad
• RMSProp
• Adam
Properties of slop

• When the curve is steep, the slope is large

∇f (x) f (x + h) − f (x)
• =
∇x h
• When the curve is gentle, the slope is small
• The weight update is proportional to the gradient.
Therefore in the areas where the curve is gentle the
updates are small whereas in the areas where the curve is
steep the updates are large.
Momentum

• A lot of time is taken to navigate in the regions that have a

gentle slop as the gradient in these regions is very slow.
• How to overcome this? Apply momentum
• Take faster step if lots of people ask to move in the same
direction
• Just as a ball gains momentum while rolling down a slope
• For applying momentum, look at the history of the update
• Use the concept of exponentially moving average to
reduce oscillation
Exponentially Moving Average (EMA)

The EMA for a series Y may be calculated recursively: The

EMA (for a series Y may be calculated recursively:
Y1 , t =1
St =
βYt + (1 − β) · St−1 , t > 1
β ∈ (0, 1]
Formulation

current update = α∇wt

previous update = updatet−1
EMA of current update and previous update is given by

γupdatet−1 + (1 − γ)α∇wt

Choose α := (1 − γ)α
History of Updates

• γ ∈ (0, 1]
• updatet = γupdatet−1 + α∇wt

wt+1 = wt − γupdatet − α∇wt

• In addition to the current update, also look at the history of
updates
• update0 = 0
• update1 = γupdate0 + α∇w1 = α∇w1
• update2 = γupdate1 + α∇w2 = γα∇w1 + α∇w2
• update3 = γupdate2 + α∇w3 = γ 2 α∇w1 + γα∇w2 + α∇w3
• In general, updatet = γupdatet−1 + α∇wt =
γ t−1 α∇w1 + γ t−2 α∇w2 + · · · + α∇wt
common values of γ are 0.5, 0.9, 0.99
Algorithm 1 Momentum
Initial w, b
updatew = 0, updateb = 0
choose momentum parameter γ, learning rate
α
Iterate until convergence
{
Choose a mini batch of d training examples
α Pd
updatew := γupdatew + ∇w Lm (xm , ym )
d m=1
α Pd
updateb := γupdateb + ∇b Lm (xm , ym )
d m=1
w := w − updatew
b := b − updateb
}
Nestrov Accelerated Gradient Descent

• Even in the regions having gentle steps, momentum based

gradient descent is able to take large steps because the
momentum carries along
• Is moving fast always good? Would there be a solution
where momentum would cause us to run pass our goal?
• Is there a way to reduce oscillations?
• In momentum,
• updatet = γupdatet−1 + α∇wt
• Move by atleast by γupdatet−1 and then a little more by
α∇wt
Updation Steps

• Partially updated value of w

• wt+1/2 = wt − γupdatet−1
• Calculate the gradient at the partially updated value of w:
∇wt+1/2 L
• Find update_t
• update_t = γupdatet−1 + α∇wt+1/2 L
wt+1 = wt − updatet
Algorithm 2 Nestrov Momentum
Initial w, b
updatew = 0, updateb = 0, gw = gb = 0, w̄ = 0, b̄ = 0
choose momentum parameter γ, learning rate
α
Iterate until convergence
{
Choose a mini batch of d training examples
{
interim update: w̄ := w − γupdatew
interim update: b̄ := b − γupdateb
1 Pd
Gradient at interim update: gw := ∇w̄ Lm (xm , ym )
d m=1
1 Pd
Gradient at interim update: gb := ∇ Lm (xm , ym )
d m=1 b̄
updatew := γupdatew + αgw
updateb := γupdateb + αgb
w := w − updatew
b := b − updateb
}
AdaGrad

• Different learning rate for different feature at each

iteration(time)
• Adaptively scale the learning rate for each dimension

α
wt+1 = wt − p ⊙ gt (1)
ϵI + diag(Gt )
where α is the learning rate, ϵ is some small quantity that used
to avoid the division of zero, I is the identity matrix, gt is the
gradient estimate in time step t
d
1 X
gt = ∇wt L(xm , ym )
d
m=1

where d is the number of data in the batch

The key of this algorithm is in the matrix Gt , which is the sum of
the outer productP of the gradients until time-step t, which is
defined by Gt = tτ =1 gτ gτT . Only the diagonal elements of Gt
is considered, for ease of compuatation.
From (1)
 (1)   (1)     (1,1) −1/2
wt+1 wt ϵ ··· 0 G ··· 0
 (2)   (2)   t
wt+1  wt  0 · · · 0 0 ··· 0 
 

 ..  =  ..  − α   +  ..
 . .  
    .. 
 .   .   .. .. 0  . . 0 

(n) (n) 0 ··· ϵ (n,n)
wt+1 wt 0 ··· Gt
 (1) 
g
 t(2) 
gt 
 . 
 . 
 . 
(n)
gt
 (1)   (1)   −1/2  (1)

wt+1 (1,1)
wt ϵ + Gt ··· 0 gt
 (2)   (2)   (2) 
wt+1  wt  0 ··· 0
 
   gt 
 ..  =  ..  − α  .. ..
       . 
.  . 
 .   .  . 0  . 

 
(n) (n) (n,n) (n)
wt+1 wt 0 ··· ϵ + Gt gt
1
 
q ··· 0
(1)
 ϵ + Gt(1,1)
     (1) 
wt  gt
 (2)     (2) 
wt   0 ··· 0  g 
  t 
= − α

 .. 
 .. ..   . 
  .. 

 .  
 . . 0 
wt
(n) 1  (n)
 gt

 0 ··· q
(n,n)
ϵ + Gt
1
 
(1)
 (1)   (1)  gt
w w
q
 t+1  t(2)   ϵ + Gt(1,1)
 
(2) 

wt+1  wt   
 =  .  − α q 1
 (2) 
 ..   .  gt 

 .   .   ϵ + G(2,2)
 

(n) (n) t
wt+1 wt
 
..
.

The parameters with the largest partial derivative of the loss

have a correspondingly rapid decrease in their learning rate,
while parameters with small partial derivatives have a relatively
small decrease in their learning rate. The net effect is greater
progress in the more gently sloped directions of parameter
space.
t
(1,1) (2,2) (n,n) T
X
(Gt , Gt . . . Gt ) = gτ ⊙ gτ
τ =1
= st−1 + gt ⊙ gt
Pt−1
Here, st−1 = τ =1 gτ ⊙ gτ
That is,

st = st−1 + gt ⊙ gt
(i)
Let st , i= 1, 2, . . . n be the i th element of st . Now perform the
following operation:
˜(i) α
st = q , i = 1, 2, . . . n
(i)
ϵ + st
The element wise multiplication and division performed on st ,
as described above is denoted by
α
s˜t = √
ϵ + st
Algorithm 3 AdaGrad
Initial w, b
ϵ = 1e − 8, sw = 0, sb = 0
gw = 0, gb = 0, learning rate α
Iterate until convergence
{
Choose a mini batch of d training examples
{
1 Pd
Gradient Computation: gw := ∇w L(xm , ym )
d m=1
1 Pd
Gradient Computation: gb := ∇b L(xm , ym )
d m=1
Accumulation of squared gradient: sw := sw + gw ⊙ gw
Accumulation of squared gradient: sb := sb + gb ⊙ gb
α
Updation: w := w − √ ⊙ gw
ϵ + sw
α
Updation: b := b − √ ⊙ gb
ϵ + sb
}
}
Problems with AdaGrad

• Lowers the update size very aggressively

• AdaGrad is designed to converge rapidly when applied to a
convex function. When applied to a nonconvex function to
train a neural network, the learning trajectory may pass
through many different structures and eventually arrive at a
region that is a locally convex bowl. AdaGrad shrinks the
learning rate according to the entire history of the squared
gradient and may have made too small before arriving at
such a convex structure.
Refer: Deep Learning Book by Ian Goodfellow et al.
RMSprop

• RMS Prop algorithm modifies AdaGrad by using the

exponential moving average in place of gradient
accumulation. Hence history from the extreme past is
discarded so that it can converge rapidly after finding a
convex bowl.
Algorithm 4 RMSprop
Initial w, b
ϵ = 1e − 8, sw = 0, sb = 0
gw = 0, gb = 0, learning rate α, decay rate
γ
Iterate until convergence
{
Choose a mini batch of d training examples
{
1 Pd
Gradient Computation: gw := ∇w L(xm , ym )
d m=1
1 Pd
Gradient Computation: gb := ∇b L(xm , ym )
d m=1
EMA of squared gradient: sw := γsw + (1 − γ)gw ⊙ gw
EMA of squared gradient: sb := γsb + (1 − γ)gb ⊙ gb
α
Updation: w := w − √ ⊙ gw
ϵ + sw
α
Updation: b := b − √ ⊙ gb
ϵ + sb
}
}
Adam

• Adam is an adaptive learning rate optimization algorithm

• Adam stands for adaptive momentum
• It makes use of the principles of RMSprop and momentum
• As in momentum, Adam takes the exponential moving
average of the gradients
• mt = β1 mt−1 + (1 − β1 )gt (first momentum)
• As in RMSProp, Adam takes the exponential moving
average of the square of the gradients
• st = β2 st−1 + (1 − β2 )gt ⊙ gt (second momentum)
Bias Correction

As mt and st are initialized as vectors of 0’s, the authors of

Adam observe that they are biased towards zero, especially
during the initial time steps, and especially when the decay
rates are small (i.e.β1 and β2 are close to 1). They counteract
these biases by computing bias-corrected first and second
moment estimates:
mt st
m̂t = sˆt = (2)
1 − β1t 1 − β2t
Adam

α
wt+1 = wt − p ⊙ m̂t
sˆt + ϵ
Suggested defaults: β1 = 0.9, β2 = 0.999, ϵ = .001 , α = 0.001
Algorithm 5 ADAM
Initial w, b
ϵ = 1e − 8, mw = 0, mb = 0, sw = 0, sb = 0
gw = 0, gb = 0, β1 = 0.9, β2 = 0.999, ϵ = 0.001, α

Iterate until convergence

{
Iteration t: Choose a mini batch of d training examples
{
1 Pd 1 Pd
gw := m=1 ∇w L(xm , ym ) and gb := ∇b L(xm , ym )
d d m=1
mw := β1 mw + (1 − β1 )gw and mb := β1 mb + (1 − β1 )gb
sw := β2 sw + (1 − β2 )gw ⊙ gw and sb := β2 vb + (1 − β2 )gb ⊙ gb
mw sw mb sb
mˆw := t
; sˆw := t
; m̂b := t
; sˆb :=
1 − β1 1 − β2 1 − β1 1 − β2t
α α
w := w − p ⊙ mˆw and b := b − p ⊙ m̂b
ϵ + sˆw ϵ + sˆb
}
}

ADL Unit-3
100% (2)
ADL Unit-3
21 pages
Evaluating and Graphing Function
No ratings yet
Evaluating and Graphing Function
15 pages
Calculus Practicals
No ratings yet
Calculus Practicals
19 pages
Nsea PDF
100% (1)
Nsea PDF
4 pages
18MAT41
0% (2)
18MAT41
2 pages
Quantitative Methods
No ratings yet
Quantitative Methods
194 pages
UNIT3
No ratings yet
UNIT3
17 pages
Adam: A Method For Stochastic Optimization: Diederik P. Kingma and Jimmy Lei Ba
No ratings yet
Adam: A Method For Stochastic Optimization: Diederik P. Kingma and Jimmy Lei Ba
41 pages
Optimizers and Activation Functions in Deep Learning
No ratings yet
Optimizers and Activation Functions in Deep Learning
15 pages
Abstract Algebra Part I: Group Theory
No ratings yet
Abstract Algebra Part I: Group Theory
29 pages
Training NNs
No ratings yet
Training NNs
34 pages
Unit-1 and 2 and 3
No ratings yet
Unit-1 and 2 and 3
212 pages
Herstein: Topics in Algebra - Definition, Examples and Preliminary Lemmas of Groups
No ratings yet
Herstein: Topics in Algebra - Definition, Examples and Preliminary Lemmas of Groups
3 pages
Lesson 1.1.1 Parabola (Session 1 of 3)
No ratings yet
Lesson 1.1.1 Parabola (Session 1 of 3)
7 pages
Chapter-2 Single Feed Forward Netwotk
No ratings yet
Chapter-2 Single Feed Forward Netwotk
132 pages
T H E Precision O F Photogrammetric Models
No ratings yet
T H E Precision O F Photogrammetric Models
99 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
Deep Learning (MODULE-2)
No ratings yet
Deep Learning (MODULE-2)
86 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Rajesh (DL Unit3) 06dec2024
No ratings yet
Rajesh (DL Unit3) 06dec2024
67 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Module 2
No ratings yet
Module 2
67 pages
Optimization For Deep Learning: Sebastian Ruder
No ratings yet
Optimization For Deep Learning: Sebastian Ruder
49 pages
Quizzes F97
No ratings yet
Quizzes F97
46 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
(Fall 2024) Deep Learning 2
No ratings yet
(Fall 2024) Deep Learning 2
46 pages
Geometric Tolerances
No ratings yet
Geometric Tolerances
23 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Optim
No ratings yet
Optim
33 pages
Matrix MCQ
No ratings yet
Matrix MCQ
32 pages
15 Deep
No ratings yet
15 Deep
39 pages
9 - Gradient Descent Part 3
No ratings yet
9 - Gradient Descent Part 3
31 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Wma11 01 Que 20221011
No ratings yet
Wma11 01 Que 20221011
28 pages
Unit 2.4
No ratings yet
Unit 2.4
31 pages
Module 2 Part1new
No ratings yet
Module 2 Part1new
32 pages
DL CS 6 M2 Live Session Flow
No ratings yet
DL CS 6 M2 Live Session Flow
32 pages
DL Class2
No ratings yet
DL Class2
30 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
Cours 5
No ratings yet
Cours 5
23 pages
Momentum, AdaGrad, RMSProp, Adam
No ratings yet
Momentum, AdaGrad, RMSProp, Adam
27 pages
Optimization and Tips For Neural Network Training: Geena Kim
No ratings yet
Optimization and Tips For Neural Network Training: Geena Kim
24 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
VENTANILLA - CENG121 (Mod3-Classwork Assignment 3,4 and 5)
No ratings yet
VENTANILLA - CENG121 (Mod3-Classwork Assignment 3,4 and 5)
23 pages
MATH 4A - Linear Algebra With Applications: Lecture 25: Angles and Orthogonality
No ratings yet
MATH 4A - Linear Algebra With Applications: Lecture 25: Angles and Orthogonality
20 pages
10 - AdaGrad
No ratings yet
10 - AdaGrad
17 pages
Slides Lec 10
No ratings yet
Slides Lec 10
14 pages
Deep Learning
No ratings yet
Deep Learning
18 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
05 Gradient Descent
No ratings yet
05 Gradient Descent
14 pages
MLP Encoder Decoder
No ratings yet
MLP Encoder Decoder
14 pages
Trainers and Optimizers
No ratings yet
Trainers and Optimizers
12 pages
Linear Programming-Wolfram Mathematica 9 ..
No ratings yet
Linear Programming-Wolfram Mathematica 9 ..
15 pages
Signals and Systems: CE/EE301
No ratings yet
Signals and Systems: CE/EE301
9 pages
AdamZ Research Paper
No ratings yet
AdamZ Research Paper
13 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
Adafactor - Adaptive Learning Rates With Sublinear Memory Cost
No ratings yet
Adafactor - Adaptive Learning Rates With Sublinear Memory Cost
9 pages
Sistemas Dinamicos
No ratings yet
Sistemas Dinamicos
10 pages
Ahmad Et Al - J Dig Imging (2015)
No ratings yet
Ahmad Et Al - J Dig Imging (2015)
10 pages
CS 437 / CS 5317 Deep Learning: Murtaza Taj
No ratings yet
CS 437 / CS 5317 Deep Learning: Murtaza Taj
11 pages
Cst414-Deep Learning Module 2
No ratings yet
Cst414-Deep Learning Module 2
13 pages
Computation of DFT
No ratings yet
Computation of DFT
13 pages
Important Optimization Algorithms Essentials
No ratings yet
Important Optimization Algorithms Essentials
12 pages
Role of An Optimizer
No ratings yet
Role of An Optimizer
9 pages
Math Solving LET
No ratings yet
Math Solving LET
2 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
Finite Math Unit 6 Notes - The Simplex Method Slack Variables, Pivots, and Interpreting Simplex Tableau
No ratings yet
Finite Math Unit 6 Notes - The Simplex Method Slack Variables, Pivots, and Interpreting Simplex Tableau
8 pages
Permutations and Combinations Questions
No ratings yet
Permutations and Combinations Questions
8 pages
Module 3
No ratings yet
Module 3
7 pages
Optimization of Gradiant Descant
No ratings yet
Optimization of Gradiant Descant
7 pages
On The James Space J X For A Banach
No ratings yet
On The James Space J X For A Banach
6 pages
054 Report
No ratings yet
054 Report
6 pages
Optimizers
No ratings yet
Optimizers
4 pages
Sample Questions: Summative Assessment-Ii 2014-2015
No ratings yet
Sample Questions: Summative Assessment-Ii 2014-2015
5 pages
QP 12 Math PB 2023-24 Set 2
No ratings yet
QP 12 Math PB 2023-24 Set 2
6 pages
GD Compare
No ratings yet
GD Compare
5 pages
Deep Learning Exp 2.3 MU
No ratings yet
Deep Learning Exp 2.3 MU
4 pages
Matroids
No ratings yet
Matroids
5 pages
Optimizer
No ratings yet
Optimizer
2 pages
Integral and Vector Calculus: Mathematics
No ratings yet
Integral and Vector Calculus: Mathematics
1 page
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
Inverse Trigonometric Functions (Trigonometry) Mathematics Question Bank
From Everand
Inverse Trigonometric Functions (Trigonometry) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet

Optimization

Uploaded by

Optimization

Uploaded by

Optimization Methods for Deep Learning

Graphical and Deep Learning

• When the curve is steep, the slope is large

• A lot of time is taken to navigate in the regions that have a

The EMA for a series Y may be calculated recursively: The

current update = α∇wt

wt+1 = wt − γupdatet − α∇wt

• Even in the regions having gentle steps, momentum based

• Partially updated value of w

• Different learning rate for different feature at each

where d is the number of data in the batch

The parameters with the largest partial derivative of the loss

• Lowers the update size very aggressively

• RMS Prop algorithm modifies AdaGrad by using the

• Adam is an adaptive learning rate optimization algorithm

As mt and st are initialized as vectors of 0’s, the authors of

Iterate until convergence

You might also like