0% found this document useful (0 votes)

30 views34 pages

Optimization Techniques (SGD Alternatives)

The document provides an overview of gradient descent optimization algorithms used in machine learning, detailing variants such as batch, stochastic, and mini-batch gradient descent. It discusses challenges like choosing an appropriate learning rate and avoiding local minima, as well as advanced techniques like momentum and adaptive gradient methods (e.g., Adam). The conclusion emphasizes mini-batch gradient descent as the most popular choice for training neural networks.

Uploaded by

lokr.789

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views34 pages

Optimization Techniques (SGD Alternatives)

Uploaded by

lokr.789

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Overview of gradient descent

optimization algorithms
HYUNG IL KOO
Based on
http://sebastianruder.com/optimizing-gradient-descent/
Problem Statement
• Machine Learning  Optimization Problem
• Training samples: 𝑥 𝑖 ,𝑦 𝑖

• Cost function: 𝐽(𝜃; 𝑋; 𝑌) = 𝑖 𝑑(𝑓(𝑥 (𝑖) ; 𝜃), 𝑦 (𝑖) )

𝜃 = arg min 𝐽(𝜃; 𝑋; 𝑌)

𝜃
Optimization method
• Gradient Descent
• The most common way to optimize neural networks
• Deep learning library contains implementations of various gradient descent
algorithms
• To minimize an objective function 𝐽(𝜃) parameterized by a model's
parameters 𝜃 ∈ ℜ𝑑 by updating the parameters in the opposite direction of
the gradient of the objective function 𝛻𝜃 𝐽(𝜃) with respect to the
parameters.
CONTENTS
• Gradient descent variants
• Batch gradient descent
• Stochastic gradient descent
• Mini-batch gradient descent

• Challenges
• Gradient descent optimization algorithms
• Momentum
• Adaptive Gradient

• Visualization

• Which optimizer to choose?

• Additional strategies for optimizing SGD

• Shuffling and Curriculum Learning
• Batch normalization
GRADIENT DESCENT
VARIANTS
Batch gradient descent
• Computes the gradient of the cost function w.r.t. 𝜃 for the
entire training dataset:
𝜃 𝑛𝑒𝑤 = 𝜃 𝑜𝑙𝑑 − 𝜂𝛻𝜃 𝐽(𝜃; 𝑋; 𝑌)

• Properties
• Very slow
• Intractable for datasets that don't fit in memory
• No online learning
Stochastic Gradient descent (SGD)
• To perform a parameter update for each training example 𝑥 (𝑖)
and label 𝑦 (𝑖)

𝜃 𝑛𝑒𝑤 = 𝜃 𝑜𝑙𝑑 − 𝜂𝛻𝜃 𝐽(𝜃; 𝑥 𝑖 ; 𝑦 (𝑖) )

• Properties:
• Faster
• Online learning
• Heavy fluctuation
• Capability to jump to new (potentially better local minima)
• Complicated convergence (overshooting)
SGD fluctuation
Batch Gradient vs SGD
Batch gradient Stochastic gradient descent

• It converges to the minimum • It is able to jump to new and

of the basin the parameters potentially better local
are placed in. minima.
• This complicates
convergence to the exact
minimum, as SGD will keep
overshooting
Mini-batch (stochastic) gradient descent
• To perform an update for every mini-batch of 𝑛 training
examples:

𝜃 𝑛𝑒𝑤 = 𝜃 𝑜𝑙𝑑 − 𝜂𝛻𝜃 𝐽(𝜃; 𝑥 𝑖:𝑖+𝑛 ; 𝑦 (𝑖:𝑖+𝑛) )

Properties of mini-batch gradient descent
• Compared with SGD
• It reduces the variance of the parameter updates, which can lead to more
stable convergence;
• It can make use of highly optimized matrix optimizations common to
state-of-the-art deep learning libraries that make computing the gradient
w.r.t. a mini-batch very efficient

• Mini-batch gradient descent is typically the algorithm of

choice when training a neural network and the term SGD
usually is employed also when mini-batches are used.
CHALLENGES
Challenges
• Choosing a proper learning rate can be difficult.
• Small learning rate leads to painfully slow convergence
• Large learning rate can hinder convergence and cause the loss function to
fluctuate around the minimum or even to diverge

• Learning rate schedules and thresholds

• It has to be defined in advance and unable to adapt to a dataset's
characteristics.

• Same learning rate applies to all parameter updates.

• If our data is sparse and our features have very different frequencies, we
might not want to update all of them to the same extent, but perform a
larger update for rarely occurring features.
Challenges
• Avoiding getting trapped in their numerous suboptimal local
minima and saddle points
• Dauphin et al. argue that the difficulty arises in fact not from local
minima but from saddle points.
• These saddle points are usually surrounded by a plateau of the same error,
which makes it notoriously hard for SGD to escape, as the gradient is
close to zero in all dimensions.
MOMENTUM
Momentum
• One of the main limitations of gradient descent) is local minima
• When the gradient descent algorithm reaches a local minimum, the gradient becomes
zero and the weights converge to a sub-optimal solution

• A very popular method to avoid local minima is to compute a temporal average

direction in which the weights have been moving recently
• An easy way to implement this is by using an exponential average

𝑣𝑡 = 𝛾𝑣𝑡−1 + 𝜂𝛻𝜃 𝐽(𝜃)

𝜃 𝑛𝑒𝑤 = 𝜃 𝑜𝑙𝑑 − 𝑣𝑡
• The term 𝛾 is called the momentum
• The momentum has a value between 0 and 1 (typically 0.9)

• Properties
• Fast convergence
• Less oscillation
Momentum
• Essentially, when using momentum, we push a ball down a hill. The ball
accumulates momentum as it rolls downhill, becoming faster and faster on
the way (until it reaches its terminal velocity if there is air resistance).
• The same thing happens to our parameter updates: The momentum term
increases for dimensions whose gradients point in the same directions and
reduces updates for dimensions whose gradients change directions. As a
result, we gain faster convergence and reduced oscillation.
Momentum
• The momentum term is also useful in spaces with long
ravines characterized by sharp curvature across the ravine
and a gently sloping floor
• Sharp curvature tends to cause divergent oscillations across
the ravine
• To avoid this problem, we could decrease the learning rate, but this is too
slow
• The momentum term filters out the high curvature and allows the effective weight
steps to be bigger
• It turns out that ravines are not uncommon in optimization problems, so the use of
momentum can be helpful in many situations
• However, a momentum term can hurt when the search is close to the minima
(think of the error surface as a bowl)
• As the network approaches the bottom of the error surface, it builds enough
momentum to propel the weights in the opposite direction, creating an
undesirable oscillation that results in slower convergence
Smarter Ball?
NGD (Nesterov accelerated gradient)
• Nesterov accelerated gradient improved on the basis of Momentum algorithm
• Approximation of the next position of the parameters.

𝑣𝑡 = 𝛾𝑣𝑡−1 + 𝜂𝛻𝜃 𝐽(𝜃 − 𝛾𝑣𝑡−1 ).

𝜃 𝑛𝑒𝑤 = 𝜃 𝑜𝑙𝑑 − 𝑣𝑡
ADAPTIVE GRADIENTS
Adaptive Gradient Methods
• Let us adapt the learning rate of each parameter, performing
larger updates for infrequent and smaller updates for frequent
parameters.

• Methods
• AdaGrad (Adaptive Gradient Method)
• AdaDelta
• RMSProp (Root Mean Square Propagation)
• Adam (Adaptive Moment Estimation)
Adaptive Gradient Methods
• These methods use a different learning rate for each parameter
𝜃𝑖 ∈ ℜ at every time step 𝑡.
(𝑡)
• For brevity, we set 𝑔𝑖 to be the gradient of the objective function w.r.t.
𝜃𝑖 ∈ ℜ at time step 𝑡:
(𝑡+1) (𝑡) (𝑡)
𝜃𝑖 = 𝜃𝑖 − 𝜼 ∙ 𝑔𝑖

• These methods modify the learning rate 𝜼 at each time step (𝑡) for every
parameter 𝜃𝑖 based on the past gradients that have been computed for 𝜃𝑖 .
Adagrad
• Adagrad modifies the general learning rate 𝜂 at each time step 𝑡
for every parameter 𝜃𝑖 based on the past gradients that have
been computed for 𝜃𝑖 :
(𝑡+1) (𝑡) 𝜼
𝜃𝑖 = 𝜃𝑖 − 𝑔𝑡,𝑖
𝑮𝒕,𝒊 + 𝝐
2
𝑘
• 𝐺𝑡,𝑖 = 𝑘≤𝑡 𝑔𝑖

(𝑘) 𝜕𝐽(𝜃)
𝑔𝑖 =
𝜕𝜃𝑖 𝜃(𝑘)
Adagrad
• Pros
• It eliminates the need to manually tune the learning rate. Most
implementations use a default value of 0.01.

• Cons
• Its accumulation of the squared gradients in the denominator: the
accumulated sum keeps growing during training. This causes the learning
rate to shrink and eventually become infinitesimally small. The following
algorithms aim to resolve this flaw.
RMSprop
• RMSprop has been developed to resolve Adagrad's diminishing
learning rates.

2
𝑡
𝜆𝑡,𝑖 = 𝛾 𝜆𝑡−1,𝑖 + 1 − 𝛾 𝑔𝑖
(𝑡+1) (𝑡) 𝜼
𝜃𝑖 = 𝜃𝑖 − 𝑔𝑡,𝑖
𝝀𝒕,𝒊 + 𝝐

η : Learning rate is suggest to set to 0.001

𝛾 : Fixed momentum term
Adam (Adaptive moment Estimation)
• Adam keeps an exponentially decaying average of past gradients 𝑚𝑡 , similar to
momentum.
𝑡
𝑚𝑡,𝑖 = 𝛽1 𝑚𝑡−1,𝑖 + (1 − 𝛽1 )𝑔𝑖 𝑚𝑡 : Estimate of the first moment(the mean)
𝑣𝑡 : Estimate of the second moment
2 (the un-centered variance)
𝑡 𝛽1 : suggest to set to 0.9
𝑣𝑡,𝑖 = 𝛽2 𝑣𝑡−1,𝑖 + 1 − 𝛽2 𝑔𝑖 𝛽2 : suggest to set to 0.999

• They counteract these biases by computing bias-corrected first and second

moment estimates.
𝑚𝑡,𝑖 𝑣𝑡,𝑖
𝑚𝑡,𝑖 = 𝑣𝑡,𝑖 =
1 − 𝛽1 1 − 𝛽2

• which yields the Adam update rule.

(𝑡+1) (𝑡) 𝜼
𝜃𝑖 = 𝜃𝑖 − 𝑚𝑡,𝑖 𝜖 : suggest to set to 10−8

𝑣𝑡 + 𝜖
COMPARISON
Visualization of algorithms
Which optimizer to choose?
• RMSprop is an extension of Adagrad that deals with its
radically diminishing learning rates.

• Adam slightly outperform RMSprop towards the end of

optimization as gradients become sparser.
CONCLUSION
Conclusion
• Three variants of gradient descent, among which mini-batch
gradient descent is the most popular.

Lesson 5 Deep Neural Net Optimization Tuning Interpretability
100% (1)
Lesson 5 Deep Neural Net Optimization Tuning Interpretability
105 pages
Momentum, AdaGrad, RMSProp, Adam
No ratings yet
Momentum, AdaGrad, RMSProp, Adam
27 pages
Optimizers and Activation Functions in Deep Learning
No ratings yet
Optimizers and Activation Functions in Deep Learning
15 pages
Optimization For Deep Learning: Sebastian Ruder
No ratings yet
Optimization For Deep Learning: Sebastian Ruder
49 pages
Opti Incertitude
No ratings yet
Opti Incertitude
231 pages
SCSA3015 Deep Learning Unit 4 PDF
No ratings yet
SCSA3015 Deep Learning Unit 4 PDF
30 pages
DL Class2
No ratings yet
DL Class2
30 pages
Rajesh (DL Unit3) 06dec2024
No ratings yet
Rajesh (DL Unit3) 06dec2024
67 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
Unit-1 and 2 and 3
No ratings yet
Unit-1 and 2 and 3
212 pages
Deep Learning (MODULE-2)
No ratings yet
Deep Learning (MODULE-2)
86 pages
Otimization 2024 - Ver3
No ratings yet
Otimization 2024 - Ver3
42 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
CS 437 / CS 5317 Deep Learning: Murtaza Taj
No ratings yet
CS 437 / CS 5317 Deep Learning: Murtaza Taj
11 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
Visualising SGD With Momentum, Adam and Learning Rate Annealing
No ratings yet
Visualising SGD With Momentum, Adam and Learning Rate Annealing
8 pages
Optimization
No ratings yet
Optimization
26 pages
Module 2
No ratings yet
Module 2
67 pages
08 Training
No ratings yet
08 Training
18 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
INT255 Unit-4
No ratings yet
INT255 Unit-4
40 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
AdamZ Research Paper
No ratings yet
AdamZ Research Paper
13 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
MLP Encoder Decoder
No ratings yet
MLP Encoder Decoder
14 pages
DL CS 6 M2 Live Session Flow
No ratings yet
DL CS 6 M2 Live Session Flow
32 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
GD Compare
No ratings yet
GD Compare
5 pages
Optimizer
No ratings yet
Optimizer
13 pages
Optim
No ratings yet
Optim
33 pages
Cours 5
No ratings yet
Cours 5
23 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
Optimizers
No ratings yet
Optimizers
4 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Important Optimization Algorithms Essentials
No ratings yet
Important Optimization Algorithms Essentials
12 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Deep Learning
No ratings yet
Deep Learning
18 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
Artificial Intelligent Subjectm - SC - .IT Part 2 PDF
No ratings yet
Artificial Intelligent Subjectm - SC - .IT Part 2 PDF
246 pages
Chapter 5 Machine Learning Basics
100% (1)
Chapter 5 Machine Learning Basics
32 pages
QG 24-25 - Đề Phát Triển Từ Đề Minh Họa Số 16.bản in
No ratings yet
QG 24-25 - Đề Phát Triển Từ Đề Minh Họa Số 16.bản in
6 pages
Artificial Intelligence in Forensics & Criminal Investigation in Indian Perspective
No ratings yet
Artificial Intelligence in Forensics & Criminal Investigation in Indian Perspective
3 pages
How To Use Chatgpt For Students
100% (1)
How To Use Chatgpt For Students
12 pages
Đề THPT
No ratings yet
Đề THPT
32 pages
Ty It FF105 Sem1 22 23
No ratings yet
Ty It FF105 Sem1 22 23
36 pages
Rocket Fuel Newsletter
No ratings yet
Rocket Fuel Newsletter
55 pages
Microsoft Certified Azure Ai Fundamentals Skills Measured
No ratings yet
Microsoft Certified Azure Ai Fundamentals Skills Measured
3 pages
Algorithms 15 00428
No ratings yet
Algorithms 15 00428
26 pages
Thanuja Resume-1
No ratings yet
Thanuja Resume-1
2 pages
英语口语练习
No ratings yet
英语口语练习
13 pages
Transforming Personalized Travel Recommendations I
No ratings yet
Transforming Personalized Travel Recommendations I
22 pages
Chatgpt: Future Directions and Open Possibilities: January 2023
No ratings yet
Chatgpt: Future Directions and Open Possibilities: January 2023
3 pages
English Monotype - TrendsReport - 2023 - v7
No ratings yet
English Monotype - TrendsReport - 2023 - v7
47 pages
Hasani Et Al 2024 Towards A Framework For Successful Metaverse Adoption in Small and Medium Sized Enterprises An
No ratings yet
Hasani Et Al 2024 Towards A Framework For Successful Metaverse Adoption in Small and Medium Sized Enterprises An
34 pages
December 2024
No ratings yet
December 2024
32 pages
Unit - 4 AI
No ratings yet
Unit - 4 AI
8 pages
Motivation Letter
No ratings yet
Motivation Letter
2 pages
Feature Engineering and Deep Learning
No ratings yet
Feature Engineering and Deep Learning
2 pages
Versal Ai Edge Gen2 Product Brief
No ratings yet
Versal Ai Edge Gen2 Product Brief
3 pages
Summer HH IX
No ratings yet
Summer HH IX
26 pages
Introduction To Dimensionality Reduction-1
No ratings yet
Introduction To Dimensionality Reduction-1
16 pages
The ROOM Fellowship Slides 2022-0820 - 783978159
No ratings yet
The ROOM Fellowship Slides 2022-0820 - 783978159
23 pages
Inbound 3051673720160856818
No ratings yet
Inbound 3051673720160856818
5 pages
SMART: Robust and E Fficient Fine-Tuning For Pre-Trained Natural Language Models Through Principled Regularized Optimization
No ratings yet
SMART: Robust and E Fficient Fine-Tuning For Pre-Trained Natural Language Models Through Principled Regularized Optimization
21 pages
Adedoyin Ahmed Hussain Ouns Bouachir Fadi Al-Turjman Moayad Aloqaily
No ratings yet
Adedoyin Ahmed Hussain Ouns Bouachir Fadi Al-Turjman Moayad Aloqaily
21 pages
Empowering Education With AI Addressing Ethical Co
No ratings yet
Empowering Education With AI Addressing Ethical Co
10 pages
Using HTK
No ratings yet
Using HTK
36 pages
SL Classification For Data Science..
No ratings yet
SL Classification For Data Science..
4 pages
Hill Climbing: Fundamentals and Applications
From Everand
Hill Climbing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
From Everand
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
Fouad Sabry
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet

Optimization Techniques (SGD Alternatives)

Uploaded by

Optimization Techniques (SGD Alternatives)

Uploaded by

Overview of gradient descent

• Cost function: 𝐽(𝜃; 𝑋; 𝑌) = 𝑖 𝑑(𝑓(𝑥 (𝑖) ; 𝜃), 𝑦 (𝑖) )

𝜃 = arg min 𝐽(𝜃; 𝑋; 𝑌)

• Which optimizer to choose?

• Additional strategies for optimizing SGD

𝜃 𝑛𝑒𝑤 = 𝜃 𝑜𝑙𝑑 − 𝜂𝛻𝜃 𝐽(𝜃; 𝑥 𝑖 ; 𝑦 (𝑖) )

• It converges to the minimum • It is able to jump to new and

𝜃 𝑛𝑒𝑤 = 𝜃 𝑜𝑙𝑑 − 𝜂𝛻𝜃 𝐽(𝜃; 𝑥 𝑖:𝑖+𝑛 ; 𝑦 (𝑖:𝑖+𝑛) )

• Mini-batch gradient descent is typically the algorithm of

• Learning rate schedules and thresholds

• Same learning rate applies to all parameter updates.

• A very popular method to avoid local minima is to compute a temporal average

𝑣𝑡 = 𝛾𝑣𝑡−1 + 𝜂𝛻𝜃 𝐽(𝜃)

𝑣𝑡 = 𝛾𝑣𝑡−1 + 𝜂𝛻𝜃 𝐽(𝜃 − 𝛾𝑣𝑡−1 ).

η : Learning rate is suggest to set to 0.001

• They counteract these biases by computing bias-corrected first and second

• which yields the Adam update rule.

• Adam slightly outperform RMSprop towards the end of

You might also like