0% found this document useful (0 votes)
2 views6 pages

Optimizers Types

The document discusses various types of optimizers used in neural networks, including Gradient Descent, Stochastic Gradient Descent, Adagrad, Adadelta, RMSprop, and Adam. Each optimizer has unique characteristics and advantages, particularly in handling different types of optimization problems, such as convex and non-convex scenarios. The document highlights the importance of learning rates and their impact on the efficiency and effectiveness of the optimization process.

Uploaded by

rasiksuhaif35
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views6 pages

Optimizers Types

The document discusses various types of optimizers used in neural networks, including Gradient Descent, Stochastic Gradient Descent, Adagrad, Adadelta, RMSprop, and Adam. Each optimizer has unique characteristics and advantages, particularly in handling different types of optimization problems, such as convex and non-convex scenarios. The document highlights the importance of learning rates and their impact on the efficiency and effectiveness of the optimization process.

Uploaded by

rasiksuhaif35
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Types of Optimizers

A) Gradient Descent :

This is one of the oldest and the most common


optimizer used in neural networks, best for the
cases where the data is arranged in a way that
it possesses a convex optimization problem.

It will try to find the least cost function value


by updating the weights of your learning
algorithm and will come up with the best-
suited parameter values corresponding to the
Global Minima.

This is done by moving down the hill with a


negative slope, increasing the older weight,
and positive slope reducing the older weight.

Although there are challenges while using this


optimizer, suppose the data is arranged in a
way that it possesses a non-convex
optimization problem then it can possibly land
on the Local Minima instead of the Global
Minima thereby providing the parameter
values with a higher cost function.

B) Stochastic Gradient Descent :

This is another variant of the Gradient Descent


optimizer with an additional capability of
working with the data with a non-convex
optimization problem. The problem with such
data is that the cost function results to rest at
the local minima which are not suitable for
your learning algorithm.

Rather than going for batch processing, this


optimizer focuses on performing one update at
a time. It is therefore usually much faster, also
the cost function minimizes after each iteration
(EPOCH).

It performs frequent updates with a high


variance that causes the objective
function(cost function) to fluctuate heavily.
Due to which it makes the gradient to jump to
a potential Global Minima.

However, if we choose a learning rate that is


too small, it may lead to very slow
convergence, while a larger learning rate can
make it difficult to converge and cause the cost
function to fluctuate around the minimum or
even to diverge away from the global minima.

C) Adagrad :

This is the Adaptive Gradient optimization


algorithm, where the learning rate plays an
important role in determining the updated
parameter values. Unlike Stochastic Gradient
descent, this optimizer uses a different
learning rate for each iteration(EPOCH) rather
than using the same learning rate for
determining all the parameters.
Thus it performs smaller updates(lower
learning rates) for the weights corresponding
to the high-frequency features and bigger
updates(higher learning rates) for the weights
corresponding to the low-frequency features,
which in turn helps in better performance with
higher accuracy. Adagrad is well-suited for
dealing with sparse data.

So at each iteration, first the alpha at time t


will be calculated and as the iterations
increase the value of t increases, and thus
alpha t will start increasing.

However, there is a disadvantage of getting


into the problem of Vanishing
Gradient because after a lot of iterations the
alpha value becomes very large making the
learning rate very small leading to no change
between the new and the old weight. This in
turn causes the learning rate to shrink and
eventually become very small, where the
algorithm is not able to acquire any further
knowledge.

D) Adadelta :

This is an extension of the Adaptive Gradient


optimizer, taking care of its aggressive nature
of reducing the learning rate infinitesimally.
Here instead of using the previous squared
gradients, the sum of gradients is defined as a
reducing weighted average of all past squared
gradients(weighted averages) this restricts the
learning rate to reduce to a very small value.

The formula for the new weight remains the


same as in Adagrad. However, there are some
changes in determining the learning rate at
time step t for each iteration.

At each iteration, first the weighted average is


calculated. Where we have the restricting
term(gamma = 0.95) which helps in avoiding
the problem of Vanishing Gradient.

E) RMSprop :

Both the optimizing algorithms, RMSprop(Root


Mean Square Propagation) and Adadelta were
developed around the same time, for the same
purpose to resolve Adagrad’s problem of
destructive learning rates. However, both use
the same method which utilizes an Exponential
Weighted Average to determine the learning
rate at time t for each iteration.

RMSprop is an adaptive learning rate method


proposed by Geoffrey Hinton, which
appropriately divides the learning rate by an
exponentially weighted average of squared
gradients. It is suggested to set gamma at
0.95, as it has been showing good results for
most of the cases.

F) Adam :
This is the Adaptive Moment Estimation
algorithm which also works on the method of
computing adaptive learning rates for each
parameter at every iteration. It uses a
combination of Gradient Descent with
Momentum and RMSprop to determine the
parameter values.

When introducing the algorithm, there was a


list of attractive benefits of using Adam on non-
convex optimization problems which made it
the most commonly used optimizer.

It comes with several advantages combining


the benefits of both Gradient with Momentum
and RMSProp like low memory requirements,
appropriate for non-stationary objectives,
works best with large data and parameters
with efficient computation.

This works using the same methodology of


adaptive learning rate in addition to storing an
exponential weighted average of the past
squared derivative of loss with respect to the
weight at time t-1.

It comes with several parameters, which are


β1, β2, and ε (epsilon). Where β1 and β2 are
the initial restricting parameters for
Momentum and RMSprop respectively. Here,
β1 corresponds to the first moment and β2
corresponds to the second moment.

For updating the weights with an adaptive


learning rate at iteration t, first, we need to
calculate the first and second moment given by
the following formulae,

VdW = β1 x VdW + (1-β1) x dW — — GD with


Momentum (1st)

SdW = β2 x SdW + (1-β2) x dW² — — RMSprop


(2nd)

Adam is relatively easy to configure where the


default configuration parameters do well on
most problems. It is proposed to have default
values of β1=0.9 ,β2 = 0.999 and ε =10E-8.
Studies show that Adam works well in practice,
in comparison to other adaptive learning
algorithms.

You might also like