0% found this document useful (0 votes)

3 views6 pages

Optimizers Types

The document discusses various types of optimizers used in neural networks, including Gradient Descent, Stochastic Gradient Descent, Adagrad, Adadelta, RMSprop, and Adam. Each optimizer has unique characteristics and advantages, particularly in handling different types of optimization problems, such as convex and non-convex scenarios. The document highlights the importance of learning rates and their impact on the efficiency and effectiveness of the optimization process.

Uploaded by

rasiksuhaif35

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views6 pages

Optimizers Types

Uploaded by

rasiksuhaif35

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Types of Optimizers

A) Gradient Descent :

This is one of the oldest and the most common

optimizer used in neural networks, best for the
cases where the data is arranged in a way that
it possesses a convex optimization problem.

It will try to find the least cost function value

by updating the weights of your learning
algorithm and will come up with the best-
suited parameter values corresponding to the
Global Minima.

This is done by moving down the hill with a

negative slope, increasing the older weight,
and positive slope reducing the older weight.

Although there are challenges while using this

optimizer, suppose the data is arranged in a
way that it possesses a non-convex
optimization problem then it can possibly land
on the Local Minima instead of the Global
Minima thereby providing the parameter
values with a higher cost function.

B) Stochastic Gradient Descent :

This is another variant of the Gradient Descent

optimizer with an additional capability of
working with the data with a non-convex
optimization problem. The problem with such
data is that the cost function results to rest at
the local minima which are not suitable for
your learning algorithm.

Rather than going for batch processing, this

optimizer focuses on performing one update at
a time. It is therefore usually much faster, also
the cost function minimizes after each iteration
(EPOCH).

It performs frequent updates with a high

variance that causes the objective
function(cost function) to fluctuate heavily.
Due to which it makes the gradient to jump to
a potential Global Minima.

However, if we choose a learning rate that is

too small, it may lead to very slow
convergence, while a larger learning rate can
make it difficult to converge and cause the cost
function to fluctuate around the minimum or
even to diverge away from the global minima.

C) Adagrad :

This is the Adaptive Gradient optimization

algorithm, where the learning rate plays an
important role in determining the updated
parameter values. Unlike Stochastic Gradient
descent, this optimizer uses a different
learning rate for each iteration(EPOCH) rather
than using the same learning rate for
determining all the parameters.
Thus it performs smaller updates(lower
learning rates) for the weights corresponding
to the high-frequency features and bigger
updates(higher learning rates) for the weights
corresponding to the low-frequency features,
which in turn helps in better performance with
higher accuracy. Adagrad is well-suited for
dealing with sparse data.

So at each iteration, first the alpha at time t

will be calculated and as the iterations
increase the value of t increases, and thus
alpha t will start increasing.

However, there is a disadvantage of getting

into the problem of Vanishing
Gradient because after a lot of iterations the
alpha value becomes very large making the
learning rate very small leading to no change
between the new and the old weight. This in
turn causes the learning rate to shrink and
eventually become very small, where the
algorithm is not able to acquire any further
knowledge.

D) Adadelta :

This is an extension of the Adaptive Gradient

optimizer, taking care of its aggressive nature
of reducing the learning rate infinitesimally.
Here instead of using the previous squared
gradients, the sum of gradients is defined as a
reducing weighted average of all past squared
gradients(weighted averages) this restricts the
learning rate to reduce to a very small value.

The formula for the new weight remains the

same as in Adagrad. However, there are some
changes in determining the learning rate at
time step t for each iteration.

At each iteration, first the weighted average is

calculated. Where we have the restricting
term(gamma = 0.95) which helps in avoiding
the problem of Vanishing Gradient.

E) RMSprop :

Both the optimizing algorithms, RMSprop(Root

Mean Square Propagation) and Adadelta were
developed around the same time, for the same
purpose to resolve Adagrad’s problem of
destructive learning rates. However, both use
the same method which utilizes an Exponential
Weighted Average to determine the learning
rate at time t for each iteration.

RMSprop is an adaptive learning rate method

proposed by Geoffrey Hinton, which
appropriately divides the learning rate by an
exponentially weighted average of squared
gradients. It is suggested to set gamma at
0.95, as it has been showing good results for
most of the cases.

F) Adam :
This is the Adaptive Moment Estimation
algorithm which also works on the method of
computing adaptive learning rates for each
parameter at every iteration. It uses a
combination of Gradient Descent with
Momentum and RMSprop to determine the
parameter values.

When introducing the algorithm, there was a

list of attractive benefits of using Adam on non-
convex optimization problems which made it
the most commonly used optimizer.

It comes with several advantages combining

the benefits of both Gradient with Momentum
and RMSProp like low memory requirements,
appropriate for non-stationary objectives,
works best with large data and parameters
with efficient computation.

This works using the same methodology of

adaptive learning rate in addition to storing an
exponential weighted average of the past
squared derivative of loss with respect to the
weight at time t-1.

It comes with several parameters, which are

β1, β2, and ε (epsilon). Where β1 and β2 are
the initial restricting parameters for
Momentum and RMSprop respectively. Here,
β1 corresponds to the first moment and β2
corresponds to the second moment.

For updating the weights with an adaptive

learning rate at iteration t, first, we need to
calculate the first and second moment given by
the following formulae,

VdW = β1 x VdW + (1-β1) x dW — — GD with

Momentum (1st)

SdW = β2 x SdW + (1-β2) x dW² — — RMSprop

(2nd)

Adam is relatively easy to configure where the

default configuration parameters do well on
most problems. It is proposed to have default
values of β1=0.9 ,β2 = 0.999 and ε =10E-8.
Studies show that Adam works well in practice,
in comparison to other adaptive learning
algorithms.

ADL Unit-3
100% (2)
ADL Unit-3
21 pages
50 Years of Integer Programming 1958-2008 PDF
100% (3)
50 Years of Integer Programming 1958-2008 PDF
811 pages
7 1 NADCA Product Specification Standard PDF
No ratings yet
7 1 NADCA Product Specification Standard PDF
20 pages
Monte Carlo Methods in Finance PDF
100% (2)
Monte Carlo Methods in Finance PDF
75 pages
SCSA3015 Deep Learning Unit 4 PDF
No ratings yet
SCSA3015 Deep Learning Unit 4 PDF
30 pages
Wandoor Ganitham - S.S.L.C Study Material 2021: Focus Area - Question Bank - Polynomials
0% (2)
Wandoor Ganitham - S.S.L.C Study Material 2021: Focus Area - Question Bank - Polynomials
8 pages
Dynamic Macroeconomic Modeling With Matlab
No ratings yet
Dynamic Macroeconomic Modeling With Matlab
38 pages
2018 Book IntroductionToParallelComputin PDF
100% (1)
2018 Book IntroductionToParallelComputin PDF
263 pages
(Mark S. Gockenbach) Partial Differential Equation
96% (28)
(Mark S. Gockenbach) Partial Differential Equation
638 pages
Differential Equations With BoundaryValue Problems 10th Edition Zill Full Download
No ratings yet
Differential Equations With BoundaryValue Problems 10th Edition Zill Full Download
400 pages
Solution of LPP by Graphical Method
No ratings yet
Solution of LPP by Graphical Method
9 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Mathematical and Physical Pendulum
0% (1)
Mathematical and Physical Pendulum
39 pages
Optimizers and Activation Functions in Deep Learning
No ratings yet
Optimizers and Activation Functions in Deep Learning
15 pages
ADAM StochasticOptimiz 1412.6980
100% (1)
ADAM StochasticOptimiz 1412.6980
15 pages
ALAFF
No ratings yet
ALAFF
637 pages
New Doc 2018-09-12
No ratings yet
New Doc 2018-09-12
46 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
Introduction To FDM
No ratings yet
Introduction To FDM
82 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Numerical Solutions of Differential Equations: Dr. Mai Duc Thanh
No ratings yet
Numerical Solutions of Differential Equations: Dr. Mai Duc Thanh
47 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Optimization For Deep Learning: Sebastian Ruder
No ratings yet
Optimization For Deep Learning: Sebastian Ruder
49 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
NumericalMethods G
No ratings yet
NumericalMethods G
39 pages
Nddcet Model QP-1-100 Marks
No ratings yet
Nddcet Model QP-1-100 Marks
9 pages
1 Intro
No ratings yet
1 Intro
91 pages
Apm3711 2021 TL 201 3 e 35577
No ratings yet
Apm3711 2021 TL 201 3 e 35577
13 pages
Unit 4 Final
No ratings yet
Unit 4 Final
29 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
Cours 5
No ratings yet
Cours 5
23 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Linear Programming: Presented by Paul Moore
No ratings yet
Linear Programming: Presented by Paul Moore
36 pages
Deep Learning (MODULE-2)
No ratings yet
Deep Learning (MODULE-2)
86 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
Super Gradient Descent: Global Optimization Requires Global Gradient
No ratings yet
Super Gradient Descent: Global Optimization Requires Global Gradient
15 pages
Big M Method
No ratings yet
Big M Method
12 pages
Adas: Adaptive Scheduling of Stochastic Gradients: Preprint. Under Review
No ratings yet
Adas: Adaptive Scheduling of Stochastic Gradients: Preprint. Under Review
19 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
Super GD
No ratings yet
Super GD
15 pages
Optim
No ratings yet
Optim
33 pages
Matlab
No ratings yet
Matlab
57 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Code Adam Optimization Algorithm From Scratch
No ratings yet
Code Adam Optimization Algorithm From Scratch
28 pages
Unit2 Optimizer
No ratings yet
Unit2 Optimizer
18 pages
Momentum, AdaGrad, RMSProp, Adam
No ratings yet
Momentum, AdaGrad, RMSProp, Adam
27 pages
Rajesh (DL Unit3) 06dec2024
No ratings yet
Rajesh (DL Unit3) 06dec2024
67 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Linear Programming
No ratings yet
Linear Programming
20 pages
Deep Learning
No ratings yet
Deep Learning
18 pages
DL 3unit Last Topic Meta Algoritham
No ratings yet
DL 3unit Last Topic Meta Algoritham
32 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
Adam Optimizer
No ratings yet
Adam Optimizer
14 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
Unit 2.4
No ratings yet
Unit 2.4
31 pages
Adadelta: An Adaptive Learning Rate Method Matthew D. Zeiler Google Inc., USA New York University, USA
No ratings yet
Adadelta: An Adaptive Learning Rate Method Matthew D. Zeiler Google Inc., USA New York University, USA
6 pages
Otimization 2024 - Ver3
No ratings yet
Otimization 2024 - Ver3
42 pages
Research Proposal (Enregistré Automatiquement)
No ratings yet
Research Proposal (Enregistré Automatiquement)
13 pages
Module 3dl1
No ratings yet
Module 3dl1
11 pages
Important Optimization Algorithms Essentials
No ratings yet
Important Optimization Algorithms Essentials
12 pages
Adafactor - Adaptive Learning Rates With Sublinear Memory Cost
No ratings yet
Adafactor - Adaptive Learning Rates With Sublinear Memory Cost
9 pages
DL 4
No ratings yet
DL 4
15 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
11 - Optimizers
No ratings yet
11 - Optimizers
16 pages
Optimization
No ratings yet
Optimization
26 pages
AdamZ Research Paper
No ratings yet
AdamZ Research Paper
13 pages
9.4 The Simplex Method: Minimization: X W X X
No ratings yet
9.4 The Simplex Method: Minimization: X W X X
11 pages
MAT485 2009spring
No ratings yet
MAT485 2009spring
14 pages
Optimization Gradient Descent Method
No ratings yet
Optimization Gradient Descent Method
3 pages
Lecture 8.5
No ratings yet
Lecture 8.5
9 pages
Optimization Techniques
No ratings yet
Optimization Techniques
9 pages
Role of An Optimizer
No ratings yet
Role of An Optimizer
9 pages
Module 3
No ratings yet
Module 3
7 pages
Optimizer
No ratings yet
Optimizer
13 pages
Optimization of Gradiant Descant
No ratings yet
Optimization of Gradiant Descant
7 pages
Topic 2 Exercises
No ratings yet
Topic 2 Exercises
6 pages
Optimizers
No ratings yet
Optimizers
4 pages
Practice Problems
No ratings yet
Practice Problems
2 pages
Optimization
No ratings yet
Optimization
3 pages
GD Compare
No ratings yet
GD Compare
5 pages
Deep Learning Exp 2.3 MU
No ratings yet
Deep Learning Exp 2.3 MU
4 pages
17-Deep Learning Frameworks - Data Augmentation - Under-Fitting Vs Over-Fitting-21!08!2024
No ratings yet
17-Deep Learning Frameworks - Data Augmentation - Under-Fitting Vs Over-Fitting-21!08!2024
3 pages
Assignment
No ratings yet
Assignment
2 pages
D Manju23ba032
No ratings yet
D Manju23ba032
3 pages
Swe1002 Optimization-Techniques TH 1.1 47 Swe1002
No ratings yet
Swe1002 Optimization-Techniques TH 1.1 47 Swe1002
2 pages
Co-1 Home Assignment
No ratings yet
Co-1 Home Assignment
2 pages
Practice Final Exam
No ratings yet
Practice Final Exam
3 pages
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
From Everand
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
Fouad Sabry
No ratings yet

Optimizers Types

Uploaded by

Optimizers Types

Uploaded by

Types of Optimizers

This is one of the oldest and the most common

It will try to find the least cost function value

This is done by moving down the hill with a

Although there are challenges while using this

B) Stochastic Gradient Descent :

This is another variant of the Gradient Descent

Rather than going for batch processing, this

It performs frequent updates with a high

However, if we choose a learning rate that is

This is the Adaptive Gradient optimization

So at each iteration, first the alpha at time t

However, there is a disadvantage of getting

This is an extension of the Adaptive Gradient

The formula for the new weight remains the

At each iteration, first the weighted average is

Both the optimizing algorithms, RMSprop(Root

RMSprop is an adaptive learning rate method

When introducing the algorithm, there was a

It comes with several advantages combining

This works using the same methodology of

It comes with several parameters, which are

For updating the weights with an adaptive

VdW = β1 x VdW + (1-β1) x dW — — GD with

SdW = β2 x SdW + (1-β2) x dW² — — RMSprop

Adam is relatively easy to configure where the

You might also like