11 - Optimizers

Uploaded by

Swasti Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views16 pages

11 - Optimizers

Uploaded by

Swasti Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Optimizers

RMSProp, GD with Momentum, ADAM

© Nisheeth Joshi
RMSProp
• Root Mean Square Propagation
• Works on the principle of decay
• Chooses a different learning rate for each parameter (This property is
what makes it different from other optimizers)
• Developed by Hilton in 2012 during his lecture

© Nisheeth Joshi
Key Takeaway
• Rather then having a fixed learning rate (0.01), here we have a vector
of trainable parameter (decay function)
• The algorithm is updated iteratively which is a running average of the
magnitude of gradients.
• Due to this the changes in the weights are not always in the direction
of the gradient but rather by element wise gradient of vector vt
• This speeds up the process of convergence

• Updates in one directions are larger and in another is smaller

w
• Compute 𝛿𝑤 and 𝛿𝑏 𝑜𝑛 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝐺𝐷
Exponential weighted
average of change in
𝑆𝛿𝑤𝑡 = 𝛽𝑆𝛿𝑤𝑡 −1 + 1 − 𝛽 ∗ 𝛿𝑤 2 w/b

𝑆𝛿𝑏𝑡 = 𝛽𝑆𝛿𝑏𝑡 −1 + 1 − 𝛽 ∗ 𝛿𝑏 2

𝛿𝑤 𝛿𝑏
𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 − 𝑏𝑛𝑒𝑤 = 𝑏𝑜𝑙𝑑 −
𝑆𝛿𝑤𝑡 + 𝜖 𝑆𝛿𝑏𝑡 + 𝜖

© Nisheeth Joshi
© Nisheeth Joshi
RMSProp Takeaway
• RMSProp is very similar to Adagrad insofar as both use the
square of the gradient to scale coefficients.
• RMSProp shares with momentum the leaky averaging.
However, RMSProp uses the technique to adjust the coefficient-
wise preconditioner.
• The learning rate needs to be scheduled by the experimenter in
practice.
• The coefficient 𝛽 determines how long the history is when
adjusting the per-coordinate scale.

© Nisheeth Joshi
GD with Momentum
• BGS/MBGD/SGD takes a lot of time to reach to Global minimum
• As the gradient steps are noisy even if they are correct on an average

• Momentum is a technique to improve performance of GD

W1 ∑ ∫ g1 = 0.37 b3
60
W3
W2
W5
80
∑ 24.95
W4
g2 = 0.047

W5 ∑ ∫
5

b2
© Nisheeth Joshi
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡
W1+ = W1 -  W7+ = W7 - 
𝜕𝑤1 𝜕𝑤7
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡
W2 = W2 - 
+ W8 = W8 - 
+
𝜕𝑤2 𝜕𝑤8
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡
W3 = W3 - 
+ b3 = b3 - 
+
𝜕𝑤3 𝜕𝑏3
𝜕𝑐𝑜𝑠𝑡
W4 = W4 - 
+
𝜕𝑤4
𝜕𝑐𝑜𝑠𝑡
W5 = W5 - 
+
𝜕𝑤5
𝜕𝑐𝑜𝑠𝑡
W6 = W6 - 
+
𝜕𝑤6
𝜕𝑐𝑜𝑠𝑡
b1 = b1 - 
+
𝜕𝑏1
𝜕𝑐𝑜𝑠𝑡
b2 = b2 - 
+
𝜕𝑏2

acceleration velocity

© Nisheeth Joshi
Adam Optimizer
• One of the key components of Adam is that it uses exponential
weighted moving averages (also known as leaky averaging) to obtain
an estimate of both the momentum and also the second moment of
the gradient.
• That is, it uses both Gradient Descent with Momentum and RMSProp
Optimizers

On Iteration t
compute 𝛿𝑏 𝛿𝑤 using current minibatch

𝑉𝛿𝑤𝑡 = 𝛽1 𝑉𝛿𝑤𝑡 −1 + 1 − 𝛽1 ∗ 𝛿𝑤
Momentum like Update with 𝛽1

𝑉𝛿𝑏𝑡 = 𝛽1 𝑉𝛿𝑏𝑡−1 + 1 − 𝛽1 ∗ 𝛿𝑏

𝑆𝛿𝑤𝑡 = 𝛽2 𝑆𝛿𝑤𝑡 −1 + 1 − 𝛽2 ∗ 𝛿𝑤 2
RMSProp like Update with 𝛽2
𝑆𝛿𝑏𝑡 = 𝛽2 𝑆𝛿𝑏𝑡 −1 + 1 − 𝛽2 ∗ 𝛿𝑏2

𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑉𝛿𝑤 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑉𝛿𝑏
𝑉𝛿𝑤 = 𝑉𝛿𝑏 =
(1 − 𝛽1𝑡 ) (1 − 𝛽1𝑡 )

𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑆𝛿𝑤 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑆𝛿𝑏
𝑆𝛿𝑤 = 𝑡 𝑆𝛿𝑏 = 𝑡
(1 − 𝛽2 ) (1 − 𝛽2 )

𝑉𝛿𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑤+ = 𝑤 − 𝛼 𝑤

𝑆𝛿𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑤
+ 𝜖

𝑉𝛿𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑏+ = 𝑏 − 𝛼 𝑏

𝑆𝛿𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑏
+ 𝜖

© Nisheeth Joshi
Hyperparameter Choice
• Learning rate  needs to be tuned.
• Common Choice for 𝛽1 = 0.9 (Momentum terms)
• Common Choice for 𝛽2 = 0.999 (RMSProp terms)
•  = 10−8

• ADAM – ADAptive Moment Estimation

© Nisheeth Joshi
Adam – Key Takeaway
• Reviewing the design of Adam its inspiration is clear.
• Momentum and scale are clearly visible in the state variables.
• Their rather peculiar definition forces us to debias terms (this
could be fixed by a slightly different initialization and update
condition).
• The combination of both terms is pretty straightforward, given
RMSProp.
• The explicit learning rate  allows us to control the step length
to address issues of convergence.

ADL Unit-3
100% (2)
ADL Unit-3
21 pages
Meta Heuristic Method
No ratings yet
Meta Heuristic Method
46 pages
Optimizers and Activation Functions in Deep Learning
No ratings yet
Optimizers and Activation Functions in Deep Learning
15 pages
ADAM StochasticOptimiz 1412.6980
100% (1)
ADAM StochasticOptimiz 1412.6980
15 pages
Adam: A Method For Stochastic Optimization: Diederik P. Kingma and Jimmy Lei Ba
No ratings yet
Adam: A Method For Stochastic Optimization: Diederik P. Kingma and Jimmy Lei Ba
41 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
AdaGrad - RMSProp - Adam
No ratings yet
AdaGrad - RMSProp - Adam
9 pages
Matlab Control System Toolbox User's Guide (PDFDrive)
No ratings yet
Matlab Control System Toolbox User's Guide (PDFDrive)
1,816 pages
Activations, Loss Functions & Optimizers in ML
No ratings yet
Activations, Loss Functions & Optimizers in ML
29 pages
Uniform Noise &amp All Filters (Matlab Code)
0% (1)
Uniform Noise &amp All Filters (Matlab Code)
4 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
(Fall 2024) Deep Learning 2
No ratings yet
(Fall 2024) Deep Learning 2
46 pages
DocumentsTraining Neural Networks - Part II
No ratings yet
DocumentsTraining Neural Networks - Part II
91 pages
Code Adam Optimization Algorithm From Scratch
No ratings yet
Code Adam Optimization Algorithm From Scratch
28 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
UNIT 5 Session 6
No ratings yet
UNIT 5 Session 6
67 pages
NeurIPS 2020 Adabelief Optimizer Adapting Stepsizes by The Belief in Observed Gradients Paper
No ratings yet
NeurIPS 2020 Adabelief Optimizer Adapting Stepsizes by The Belief in Observed Gradients Paper
12 pages
Lecture 8.5
No ratings yet
Lecture 8.5
9 pages
Linear and Binary Search
No ratings yet
Linear and Binary Search
4 pages
Otimization 2024 - Ver3
No ratings yet
Otimization 2024 - Ver3
42 pages
Super GD
No ratings yet
Super GD
15 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Super Gradient Descent: Global Optimization Requires Global Gradient
No ratings yet
Super Gradient Descent: Global Optimization Requires Global Gradient
15 pages
Rajesh (DL Unit3) 06dec2024
No ratings yet
Rajesh (DL Unit3) 06dec2024
67 pages
Adam 1
No ratings yet
Adam 1
11 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
Visualising SGD With Momentum, Adam and Learning Rate Annealing
No ratings yet
Visualising SGD With Momentum, Adam and Learning Rate Annealing
8 pages
Cours 5
No ratings yet
Cours 5
23 pages
Deep Learning Exp 2.3 MU
No ratings yet
Deep Learning Exp 2.3 MU
4 pages
DL Class2
No ratings yet
DL Class2
30 pages
Adam Optimizer
No ratings yet
Adam Optimizer
14 pages
UNIT4 - Convex Sets and Convex Functions, Optimization
No ratings yet
UNIT4 - Convex Sets and Convex Functions, Optimization
30 pages
CenterTrack-Tracking Objects As Points
100% (1)
CenterTrack-Tracking Objects As Points
45 pages
Optimization
No ratings yet
Optimization
26 pages
Chat GPT
No ratings yet
Chat GPT
4 pages
Adafactor - Adaptive Learning Rates With Sublinear Memory Cost
No ratings yet
Adafactor - Adaptive Learning Rates With Sublinear Memory Cost
9 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
Module 2
No ratings yet
Module 2
67 pages
Deep Learning (MODULE-2)
No ratings yet
Deep Learning (MODULE-2)
86 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
08 Training
No ratings yet
08 Training
18 pages
Role of An Optimizer
No ratings yet
Role of An Optimizer
9 pages
A: A M S O: DAM Ethod For Tochastic Ptimization
No ratings yet
A: A M S O: DAM Ethod For Tochastic Ptimization
13 pages
A Proof of Local Convergence For The Adam Optimizer
No ratings yet
A Proof of Local Convergence For The Adam Optimizer
8 pages
Optimizers
No ratings yet
Optimizers
3 pages
DL CS 6 M2 Live Session Flow
No ratings yet
DL CS 6 M2 Live Session Flow
32 pages
Momentum Update Rule
No ratings yet
Momentum Update Rule
4 pages
Optimizers Types
No ratings yet
Optimizers Types
6 pages
CSE5311 FinalExamPractice
No ratings yet
CSE5311 FinalExamPractice
12 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
D Manju23ba032
No ratings yet
D Manju23ba032
3 pages
Momentum, AdaGrad, RMSProp, Adam
No ratings yet
Momentum, AdaGrad, RMSProp, Adam
27 pages
Optimization of Gradiant Descant
No ratings yet
Optimization of Gradiant Descant
7 pages
AdamZ Research Paper
No ratings yet
AdamZ Research Paper
13 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Optimization For Deep Learning: Sebastian Ruder
No ratings yet
Optimization For Deep Learning: Sebastian Ruder
49 pages
Causality: The Impulse Response H (N) of An Ideal Low Pass Filter With Frequency Response
No ratings yet
Causality: The Impulse Response H (N) of An Ideal Low Pass Filter With Frequency Response
3 pages
Optimizer
No ratings yet
Optimizer
3 pages
Cpps Unit-3 Arrays Array:: Declaration: Syntax: Data Type Array - Name (Size of The Array)
No ratings yet
Cpps Unit-3 Arrays Array:: Declaration: Syntax: Data Type Array - Name (Size of The Array)
24 pages
Optimization
No ratings yet
Optimization
3 pages
GD Compare
No ratings yet
GD Compare
5 pages
Module 3 - DAA Vtu
No ratings yet
Module 3 - DAA Vtu
80 pages
Efficient Solution of Otsu Multilevel Image Thresholding: A Comparative Study
No ratings yet
Efficient Solution of Otsu Multilevel Image Thresholding: A Comparative Study
21 pages
Optimizers
No ratings yet
Optimizers
4 pages
Important Optimization Algorithms Essentials
No ratings yet
Important Optimization Algorithms Essentials
12 pages
ADS & A Unit-5 Study Material
No ratings yet
ADS & A Unit-5 Study Material
50 pages
Op Tim Ization
No ratings yet
Op Tim Ization
1 page
Deep Learning
No ratings yet
Deep Learning
18 pages
14-Solving Fem Equations
No ratings yet
14-Solving Fem Equations
14 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
Optimization Gradient Descent Method
No ratings yet
Optimization Gradient Descent Method
3 pages
LT15 Graphing Linear Functions Foldable
No ratings yet
LT15 Graphing Linear Functions Foldable
2 pages
Assignment 5 Solutions James Vanderhyde
No ratings yet
Assignment 5 Solutions James Vanderhyde
2 pages
02 Edge-Detection-example C4W1L02 EdgeDetectionExample
No ratings yet
02 Edge-Detection-example C4W1L02 EdgeDetectionExample
4 pages
Algorithmic Problem Solving
No ratings yet
Algorithmic Problem Solving
12 pages
In The Complex Plane With Application: On Polynomial Approximation To Conformai Mapping
No ratings yet
In The Complex Plane With Application: On Polynomial Approximation To Conformai Mapping
9 pages
Candidates Are Required To Answer Group A and Any 5 (Five) From Group B To E, Taking at Least One From Each Group
No ratings yet
Candidates Are Required To Answer Group A and Any 5 (Five) From Group B To E, Taking at Least One From Each Group
4 pages
Lecture 17 - Hyperplane Classifiers - SVM - Plain
No ratings yet
Lecture 17 - Hyperplane Classifiers - SVM - Plain
16 pages
Open Lab Quiz 1 (10 - 2 - 21) (1-22)
No ratings yet
Open Lab Quiz 1 (10 - 2 - 21) (1-22)
7 pages
Chapter 9 - Parallel Computation Problems
No ratings yet
Chapter 9 - Parallel Computation Problems
43 pages
Machine Learning 2: Exercise Sheet 1
No ratings yet
Machine Learning 2: Exercise Sheet 1
2 pages
ThungYang ClassificationOfTrashForRecyclabilityStatus Report
No ratings yet
ThungYang ClassificationOfTrashForRecyclabilityStatus Report
6 pages
Unit 4 Deadlock Bankers Algorithms Numerical
No ratings yet
Unit 4 Deadlock Bankers Algorithms Numerical
3 pages
Operating System - Lab Manual # 8
No ratings yet
Operating System - Lab Manual # 8
6 pages
Model Order Reduction Techniques For Reducing Order of Industrial PR
No ratings yet
Model Order Reduction Techniques For Reducing Order of Industrial PR
5 pages
Become A Machine Learning Engineer
No ratings yet
Become A Machine Learning Engineer
4 pages
Deep Fake Detection Document
No ratings yet
Deep Fake Detection Document
10 pages
Unit 3
No ratings yet
Unit 3
14 pages