0% found this document useful (0 votes)

12 views

Gradient-Based Optimizers

Uploaded by

sunnyrx100virat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Gradient-Based Optimizers

Uploaded by

sunnyrx100virat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 54

Gradient-Based

Optimizers

&
Gradient descent optimization
algorithms
Gradient descent is an optimization technique used to minimize the error or loss function
in machine learning and neural networks. It works by iteratively adjusting the parameters of
the model to find the values that result in the lowest possible error. Here’s how it works:
1.Objective: The goal of gradient descent is to find the minimum value of a function,
typically the loss function that measures how well the model's predictions match the
actual data.
2.Initialize Parameters: Start with an initial set of parameters or weights, which are usually
set randomly.
3.Compute Gradient: Calculate the gradient (or partial derivatives) of the loss function with
respect to each parameter. The gradient indicates the direction and rate at which the loss
function increases.
4.Update Parameters: Adjust the parameters in the direction that reduces the loss. This is
done by subtracting a fraction of the gradient from the current parameters. The size of
this step is controlled by a value called the learning rate.
5.Iterate: Repeat the process of computing gradients and updating parameters until the
changes in the loss function become very small or the number of iterations reaches a
predefined limit.
6.Convergence: The process continues until the parameters converge to values where the
loss function is minimized or the change in loss is below a certain threshold.
Key Elements of Gradient Descent:
• Learning Rate: A hyperparameter that determines the size of the steps taken during parameter updates. A
learning rate that is too high may cause the algorithm to overshoot the minimum, while a learning rate
that is too low may result in a slow convergence.
• Cost Function: Also known as the loss function, it measures the performance of the model. The goal is to
minimize this function.
• Gradient: A vector that points in the direction of the steepest increase of the loss function. The negative
gradient points in the direction of the steepest decrease.
Variants of Gradient Descent:
• Batch Gradient Descent: Computes the gradient using the entire dataset. This can be computationally
expensive for large datasets.
• Stochastic Gradient Descent (SGD): Computes the gradient using a single data point at a time. This can be
faster but introduces more noise in the parameter updates.
• Mini-Batch Gradient Descent: Computes the gradient using a small random subset of the data. It balances
the efficiency of batch gradient descent and the speed of stochastic gradient descent.

In summary, gradient descent is a fundamental optimization algorithm used to minimize the loss function by
iteratively adjusting model parameters based on computed gradients.
Different types of Gradient descent
based Optimizers:
Batch Gradient Descent or Vanilla
Gradient Descent or Gradient
Descent (GD)
• Gradient Descent is an optimization algorithm for finding a local
minimum of a differentiable function. Gradient descent is simply
used to find the values of a function's parameters (coefficients) that
minimize a cost function as far as possible.

• The weight is initialized using some initialization strategies and is

updated with each epoch according to the update equation.
GD

The above equation computes the gradient of the cost function J(θ)
w.r.t. the parameters/weights θ for the entire training dataset
We then update our parameters in the opposite direction of the gradients
with the learning rate

Batch gradient descent

is guaranteed to
converge to the global
minimum for convex
error surfaces and to a
local minimum for non-
convex surfaces.
GD
Stochastic Gradient Descent (SGD)

• SGD algorithm is an extension of the Gradient Descent and it

overcomes some of the disadvantages of the GD algorithm.
• Gradient Descent has a disadvantage that it requires a lot of memory
to load the entire dataset of n-points at a time to compute the
derivative of the loss function.
• In the SGD algorithm derivative is computed taking one point at a
time.

Here, imagine the same mountain, but this time, it's

foggy, and you can only see a few feet in front of you.
You can't determine the steepest descent over the whole
landscape, but you can still move downward based on
the local slope.
SG
D

So, let’s have a dataset that contains 1000

Batch gradient descent performs redundant
rows, and when we apply SGD it will update computations for large datasets, as it
the model parameters 1000 times in one recomputes gradients for similar examples
complete cycle of a dataset instead of one before each parameter update. SGD does away
time as in Gradient Descent. with this redundancy by performing one update
at a time. It is therefore usually much faster and
can also be used to learn online.
SGD

SGD seems to be
quite noisy, but
at the same time
it is much faster
than others and
also it might be
possible that it
not converges to
a minimum.
SG
D
Mini Batch Stochastic Gradient Descent (MB-SGD)
• MB-SGD algorithm is an extension of the SGD algorithm and it
overcomes the problem of large time complexity in the case of the
SGD algorithm.

• MB-SGD algorithm takes a batch of points or subset of points from

the dataset to compute derivate.

• It is observed that the derivative of the loss function for MB-SGD is

almost the same as a derivate of the loss function for GD after some
number of iterations. But the number of iterations to achieve minima
is large for MB-SGD compared to GD and the cost of computation is
also large.
MB-
SGD

Mini-batch gradient descent is typically

the algorithm of choice when training a
neural network
MB-
SGD

The update of weight is dependent on the derivate of

loss for a batch of points. The updates in the case of
MB-SGD are much noisy because the derivative is not
always towards minima.
Each mini-batch is only a small sample of the total
MB- dataset, so it might not fully represent the overall
trend in the data. Different mini-batches might
SGD suggest slightly different directions for updating the
model's parameters, leading to inconsistent (noisy)
updates.
Gradient descent optimization
algorithms
• SGD with momentum
• Nesterov Accelerated Gradient (NAG)
• Adaptive Gradient (AdaGrad)
• AdaDelta
• RMSprop
• Adam
SGD with momentum

• Instead of smoothly progressing towards the minimum of the cost

function (as in Batch Gradient Descent), the path in MB-SGD tends to
zigzag or oscillate because of the noise. The steps may sometimes
overshoot or undershoot, making the convergence to the optimal
solution less smooth.

• SGD with momentum overcomes this disadvantage by

denoising the gradients
SGD with momentum
• The idea is to denoise derivative using exponential
weighting average that is to give more weightage to
recent updates compared to the previous update.

• It accelerates the convergence towards the relevant

direction and reduces the fluctuation to the irrelevant
direction.
SGD with momentum
SGD with momentum
SGD with
momentu
m

Example
Nesterov Accelerated Gradient
(NAG)
• In this version we’re first looking at a point where current momentum
is pointing to and computing gradients from that point.
Parameter Initialization
Strategies
1. Initialization of weight values

Heuristics for initial scale of weights

We almost always
initialize all the
weights in the model
to values drawn
randomly from a
Gaussian or uniform
distribution.
Weight Initialization for Sigmoid and Tanh
Xavier Weight Initialization / Glorot Initialization Initialize the weights of a
fully connected layer with
Nin inputs and Nout outputs
by sampling each weights
from Uniform (-r, r) where
Weight Initialization for ReLU

He Weight Initialization / Kaiming Initialization

The main goal of He initialization is to maintain the variance of the

activations in each layer, especially when using ReLU activations,
which tend to zero out negative inputs. This helps in avoiding issues
like vanishing or exploding gradients, which can hinder the training
process in deep neural networks.
2. Initialization of bias
1. Zero Initialization:- The most common and often the default
method is to initialize the bias terms to zero. This is because
biases primarily serve to shift the activation function.

2. Small Positive Value Initialization

• Positive Bias for ReLU:
• When using ReLU or its variants, biases can be initialized to a small
positive value (like 0.01). This is done to prevent neurons from
"dying," especially during the initial phases of training when many
ReLU neurons might output zero due to the nature of the
activation function.
3. Learned Biases
• In modern architectures and deep learning frameworks, biases
are typically learned during training, so the initialization is less
critical compared to weights. The optimizer will adjust the bias
terms based on the loss function, regardless of their initial
values.
Annealing the learning rate
Adagrad

• The key idea of AdaGrad is to have an adaptive learning

rate for each of the weights.

• It performs smaller updates for parameters associated

with frequently occurring features, and larger updates
for parameters associated with infrequently occurring
features.
Adadelta

• The problem with the previous algorithm AdaGrad was

learning rate becomes very small with a large number
of iterations which leads to slow convergence.

• To avoid this,
Adadelta adapts learning rates based on a
moving window of gradient updates, instead of
accumulating all past gradients.
Accumulated Gradients:

E[⋅] represents the exponential moving average (EMA) of the quantity inside
the brackets. It's a way to smooth out the values over time, giving more weight
to recent values while still considering the past values.
RMSprop

A good default value for the learning rate is

0.001.
Adam

• Adaptive Moment Estimation (Adam)is another method

that computes adaptive learning rates for each
parameter.

Gradient Descent
No ratings yet
Gradient Descent
17 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Gradient Descent_PR
No ratings yet
Gradient Descent_PR
31 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
optimization techniques (SGD alternatives)
No ratings yet
optimization techniques (SGD alternatives)
34 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
Technical_writing (2)
No ratings yet
Technical_writing (2)
9 pages
S09_DNN_Gradients_wip
No ratings yet
S09_DNN_Gradients_wip
28 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
Technical_writing (1)
No ratings yet
Technical_writing (1)
9 pages
Technical_writing
No ratings yet
Technical_writing
8 pages
Stochastic Gradient Descent - Term Paper
No ratings yet
Stochastic Gradient Descent - Term Paper
8 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Optimizers
No ratings yet
Optimizers
4 pages
Gradient Descent Algorithm is a first
No ratings yet
Gradient Descent Algorithm is a first
5 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Module 2
No ratings yet
Module 2
67 pages
GD Types
No ratings yet
GD Types
98 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
cours5
No ratings yet
cours5
23 pages
14-RMSProp and Adam Optimization-12!08!2024
No ratings yet
14-RMSProp and Adam Optimization-12!08!2024
2 pages
4_Gradient Descent and Stochastic GD
No ratings yet
4_Gradient Descent and Stochastic GD
37 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
Gradient_decent
No ratings yet
Gradient_decent
15 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
04 Batch SGD Mini Batch Gradient Descent Algorithms
No ratings yet
04 Batch SGD Mini Batch Gradient Descent Algorithms
3 pages
2. Gradient Descent (GD)- GD With Momentum- Nesterov Accelerated GD- Stochastic GD - OrIGINAL
No ratings yet
2. Gradient Descent (GD)- GD With Momentum- Nesterov Accelerated GD- Stochastic GD - OrIGINAL
25 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
3 Types of Gradient Descent Algorithms For Small & Large Datasets
No ratings yet
3 Types of Gradient Descent Algorithms For Small & Large Datasets
9 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
DL Unit -2
No ratings yet
DL Unit -2
20 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
Optimization For Deep Learning: Sebastian Ruder
No ratings yet
Optimization For Deep Learning: Sebastian Ruder
49 pages
Gradient Descent
No ratings yet
Gradient Descent
2 pages
Gradient Descent DS Rohit Sharma Fench Knjs
No ratings yet
Gradient Descent DS Rohit Sharma Fench Knjs
15 pages
PCA and Convex optimization and bias , Variance-2
No ratings yet
PCA and Convex optimization and bias , Variance-2
29 pages
UNIT3
No ratings yet
UNIT3
37 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
Gradient Descent
No ratings yet
Gradient Descent
8 pages
Paper 2
No ratings yet
Paper 2
27 pages
Gradient Descent 5 Part 2
No ratings yet
Gradient Descent 5 Part 2
15 pages
Deep Learning (MODULE-2) (2)
No ratings yet
Deep Learning (MODULE-2) (2)
86 pages
Deep learning chapter 1
No ratings yet
Deep learning chapter 1
46 pages
Unit 4 - GRADIENT LEARNING
No ratings yet
Unit 4 - GRADIENT LEARNING
3 pages
Gradient Descent & Stockastic Gradient Descent
No ratings yet
Gradient Descent & Stockastic Gradient Descent
6 pages
Deep Learning Tutorial 9
No ratings yet
Deep Learning Tutorial 9
70 pages
chp2 Gradient Descent algorithm
No ratings yet
chp2 Gradient Descent algorithm
5 pages
SCSA3015 Deep Learning Unit 4 PDF
No ratings yet
SCSA3015 Deep Learning Unit 4 PDF
30 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
From Everand
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
Fouad Sabry
No ratings yet
Hill Climbing: Fundamentals and Applications
From Everand
Hill Climbing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Notes On Jensen's Inequality
No ratings yet
Notes On Jensen's Inequality
7 pages
Tugasan Graf Fungsi
No ratings yet
Tugasan Graf Fungsi
16 pages
TradingPatterns v1.0
No ratings yet
TradingPatterns v1.0
3 pages
11th Maths EM TM Answer Keys To Quarterly Exam 2023 PDF Download
No ratings yet
11th Maths EM TM Answer Keys To Quarterly Exam 2023 PDF Download
13 pages
Practice_Sheet_2
No ratings yet
Practice_Sheet_2
1 page
Network Theory - 17801350 - 2024 - 09 - 19 - 23 - 55
No ratings yet
Network Theory - 17801350 - 2024 - 09 - 19 - 23 - 55
1 page
Question 3: Projectiles: Introduction: Breaking Velocity Into Two Components
No ratings yet
Question 3: Projectiles: Introduction: Breaking Velocity Into Two Components
46 pages
Planning Your Career
100% (1)
Planning Your Career
17 pages
PL-SQL Tables
No ratings yet
PL-SQL Tables
2 pages
ISMC Exemplar Primary 3 Questions 2018
No ratings yet
ISMC Exemplar Primary 3 Questions 2018
10 pages
CH 06
No ratings yet
CH 06
22 pages
Chapter 5-Basic Thermodynamic
No ratings yet
Chapter 5-Basic Thermodynamic
19 pages
I Learn Smart Start Grade 4 Flashcards t1
No ratings yet
I Learn Smart Start Grade 4 Flashcards t1
22 pages
Exam Final Practice
No ratings yet
Exam Final Practice
12 pages
Dynamic Matrix Control: Presented by Chinta Manohar D Surya Suvidha
No ratings yet
Dynamic Matrix Control: Presented by Chinta Manohar D Surya Suvidha
35 pages
Sequence and Series WORKSHEET
No ratings yet
Sequence and Series WORKSHEET
2 pages
C++ Programs For Class 12
78% (9)
C++ Programs For Class 12
59 pages
Math375 Final Practice
No ratings yet
Math375 Final Practice
11 pages
Fill in The Blanks
No ratings yet
Fill in The Blanks
2 pages
Population Forecasting
100% (1)
Population Forecasting
21 pages
LaMotte 4497-DR Chlorine PCT-DR Direct Reading Titrator Kit Instructions
No ratings yet
LaMotte 4497-DR Chlorine PCT-DR Direct Reading Titrator Kit Instructions
2 pages
Measure Phase: A. Deriving Measures
No ratings yet
Measure Phase: A. Deriving Measures
19 pages
Statistics-Linear Regression and Correlation Analysis
No ratings yet
Statistics-Linear Regression and Correlation Analysis
50 pages
7601 Numbers3Classsheet
No ratings yet
7601 Numbers3Classsheet
6 pages
Psychological Statistics
No ratings yet
Psychological Statistics
36 pages
Fear of Math - How To Get Over It And' Get On With Your Life
100% (1)
Fear of Math - How To Get Over It And' Get On With Your Life
277 pages
Central Library, Doon University: New Arrivals (December-2016)
No ratings yet
Central Library, Doon University: New Arrivals (December-2016)
9 pages
Optimal Recloser Setting, Considering Reliability and Power Quality in Distribution Networks
No ratings yet
Optimal Recloser Setting, Considering Reliability and Power Quality in Distribution Networks
7 pages
Understanding Multivariate Research A Primer For Beginning Social Scientists First Edition. Edition Berry 2024 Scribd Download
100% (4)
Understanding Multivariate Research A Primer For Beginning Social Scientists First Edition. Edition Berry 2024 Scribd Download
60 pages
EC3000 Sem3 (Lec5 Answer)
No ratings yet
EC3000 Sem3 (Lec5 Answer)
8 pages

Gradient-Based Optimizers

Uploaded by

Gradient-Based Optimizers

Uploaded by

Gradient-Based

• The weight is initialized using some initialization strategies and is

Batch gradient descent

• SGD algorithm is an extension of the Gradient Descent and it

Here, imagine the same mountain, but this time, it's

So, let’s have a dataset that contains 1000

• MB-SGD algorithm takes a batch of points or subset of points from

• It is observed that the derivative of the loss function for MB-SGD is

Mini-batch gradient descent is typically

The update of weight is dependent on the derivate of

• Instead of smoothly progressing towards the minimum of the cost

• SGD with momentum overcomes this disadvantage by

• It accelerates the convergence towards the relevant

Heuristics for initial scale of weights

He Weight Initialization / Kaiming Initialization

The main goal of He initialization is to maintain the variance of the

2. Small Positive Value Initialization

• The key idea of AdaGrad is to have an adaptive learning

• It performs smaller updates for parameters associated

• The problem with the previous algorithm AdaGrad was

A good default value for the learning rate is

• Adaptive Moment Estimation (Adam)is another method

You might also like