0% found this document useful (0 votes)
22 views22 pages

Adam Optimizer

Gradient Descent is a key optimization algorithm in machine learning used to minimize cost functions by iteratively adjusting model parameters in the direction of the steepest descent. It relies on the learning rate to control the size of the steps taken towards the minimum, and can face challenges like vanishing or exploding gradients. Variants include Batch Gradient Descent, Stochastic Gradient Descent, and Mini-batch Gradient Descent, each with its own advantages and disadvantages depending on the dataset size and computational resources.

Uploaded by

OLGA RAJEE C
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views22 pages

Adam Optimizer

Gradient Descent is a key optimization algorithm in machine learning used to minimize cost functions by iteratively adjusting model parameters in the direction of the steepest descent. It relies on the learning rate to control the size of the steps taken towards the minimum, and can face challenges like vanishing or exploding gradients. Variants include Batch Gradient Descent, Stochastic Gradient Descent, and Mini-batch Gradient Descent, each with its own advantages and disadvantages depending on the dataset size and computational resources.

Uploaded by

OLGA RAJEE C
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

What is Gradient descent?

Gradient Descent is a fundamental algorithm in machine learning


and optimization. It is used for tasks like training neural
networks, fitting regression lines, and minimizing cost functions
in models. In this article we will understand what gradient
descent is, how it works , mathematics behind it and why
it’s so important in machine learning.
Introduction to Gradient Descent
Gradient Descent is an algorithm used to find the best solution
to a problem by making small adjustments in the right direction.
It’s like trying to find the lowest point in a hilly area by
walking down the slope, step by step until you reach the
bottom.
For example:

Gradient descent
Imagine you're at the top of a hill and your goal is to find the
lowest point in the valley. You can't see the entire valley from
the top, but you can feel the slope under your feet.
1. Start at the Top: You begin at the top of the hill (this is
like starting with random guesses for the model's
parameters).
2. Feel the Slope: You look around to find out which
direction the ground is sloping down. This is like
calculating the gradient, which tells you the steepest
way downhill.
3. Take a Step Down: Move in the direction where the
slope is steepest (this is adjusting the model's
parameters). The bigger the slope, the bigger the
step you take.
4. Repeat: You keep repeating the process — feeling the
slope and moving downhill — until you reach the
bottom of the valley (this is when the model has
learned and minimized the error).
The key idea is that, just like walking down a hill, Gradient
Descent moves towards the "bottom" or minimum of the loss
function, which represents the error in predictions.
Moving in opposite direction of the gradient allows the algorithm
to gradually descend towards lower values of the function
and eventually reaching to the minimum of the function.
These gradients guide the updates ensuring convergence
towards the optimal parameter values. Gradual steps used
in descent is done by defining learning rate.
What is Learning Rate?
Learning rate is a important hyperparameter in gradient descent
that controls how big or small the steps should be when going
downwards in gradient for updating models parameters. It is
essential to determines how quickly or slowly the algorithm
converges toward minimum of cost function.
1. If Learning rate is too small: The algorithm will take tiny
steps during iteration and converge very slowly. This can
significantly increases training time and computational cost
especially for large datasets. This process is termed as
vanishing gradient problem.

Learning rate with small steps


2. If Learning rate is too big: The algorithm may take huge
steps leading overshooting the minimum of cost function without
settling. It fail to converge causing the algorithm to
oscillate. This process is termed as exploding problem.
Learning rate with big steps
In image we can see point got oscillated from right to left with
converging to minimum gradient value.
To address these problems we have some technique that can be
used:
 Weights Regularzations: The initialization of weights
can be adjusted to ensure that they are in an appropriate
range. Using a different activation function such as the
Rectified Linear Unit (ReLU) can help us to mitigate the
vanishing gradient problem.
 Gradient clipping: Restrict the gradients to a
predefined range to prevent them from becoming
excessively large or small.
 Batch normalization: It can also help to address these
problems by normalizing the input of each layer to
prevent activation function from saturating and hence
reducing vanishing and exploding gradient problems.
Choosing right learning rate can leads to fast and stable
convergence improving the efficiency of the training process
but sometimes vanishing and exploding gradient problem is
unavoidable and to address these we have some techniques that
we will discuss further in the article.
Mathematics Behind Gradient Descent
For simplicity let's consider a linear regression model with a
single input feature xx and target yy. The loss function (or cost
function) for a single data point is defined as the Mean Squared
Error (MSE):
J(w,b)=1n∑i=1n(yp−y)2J(w,b)=n1∑i=1n(yp−y)2
Here:
 yp=x⋅w+byp=x⋅w+b: The predicted value.
 ww: Weight (slope of the line).
 bb: Bias (intercept).
 nn: Number of data points.
To optimize the model parameters ww, we compute the gradient
of the loss function with respect to ww. This process involves
taking the partial derivatives of J(w,b)J(w,b).
The gradient with respect to ww is:
∂J(w,b)∂w=∂∂w[1n∑i=1n(yp−y)2] ∂w∂J(w,b)=∂w∂[n1∑i=1n(yp−y)2]
\frac{\partial J(w, b)}{\partial w} = \frac{2}{n} \
sum_{i=1}^{n} (y_p - y) \cdot \frac{\partial}{\partial w} \
left( y_p - y \right)
substitute yp=x⋅w+byp=x⋅w+b: ∂J(w,b)∂w=2n∑i=1n(yp−y)⋅∂∂w(x⋅w+b−y)∂w∂J(w,b)
=n2∑i=1n(yp−y)⋅∂w∂(x⋅w+b−y)
Final Gradient with respect to ww:
∂J(w,b)∂w=2n∑i=1n(yp−y)⋅x∂w∂J(w,b)=n2∑i=1n(yp−y)⋅x
Gradient Descent Update:
Once the gradients are calculated we update the
parameters ww in the direction opposite to the gradient (to
minimize the loss function):
1. For +ve gradient:

Gradient descent

Here:
 γ: Learning rate (step size for each update).
 ∂J(w,b)/∂w): Gradients with respect to ww.
Since the gradient is positive subtracting it effectively
decreases w and hence reducing cost function.
2. For -ve gradient:

Gradient descent
Since the gradient is negative subtracting it
effectively increases ww so here we add it to reduce cost
function.
Working of Gradient Descent
 Step 1 we first initialize the parameters of the model
randomly
 Step 2 Compute the gradient of the cost function with
respect to each parameter. It involves making partial
differentiation of cost function with respect to the
parameters.
 Step 3 Update the parameters of the model by taking
steps in the opposite direction of the model. Here we
choose a hyperparameter learning rate which is denoted
by γγ. It helps in deciding the step size of the gradient.
 Step 4 Repeat steps 2 and 3 iteratively to get the best
parameter for the defined model.

Gradient Descent
This animation shows iterative process of gradient descent as it
traverses the 3D convex surface of cost function. Each step
represents adjustment of model parameters to minimize the loss.
It illustrates how the algorithm moves in opposite direction of
descent to converge
Pseudo code:
t ← 0
max_iterations ← 1000
w, b ← initialize randomly

while t < max_iterations do


t ← t + 1
w_t+1 ← w_t − γ ∇w_t
b_t+1 ← b_t − γ ∇b_t
end

Here:
max_iterations is the number of iteration we want to do to
update our parameter
W,b are the weights and bias parameter
γ is the learning parameter
So now we learned what is gradient descent and how it works,
now we will learn about its variations.
Different Variants of Gradient Descent
Types of gradient descent are:
1. Batch Gradient Descent: Batch Gradient
Descent computes gradients using the entire dataset in
each iteration.
2. Stochastic Gradient Descent (SGD): SGD uses one
data point per iteration to compute gradients, making it
faster.
3. Mini-batch Gradient Descent: Mini-batch Gradient
Descent combines batch and SGD by using small batches
of data for updates.
4. Momentum-based Gradient Descent: Momentum-
based Gradient Descent speeds up convergence by
adding a fraction of the previous gradient to the current
update.
5. Adagrad: Adagrad adjusts learning rates based on the
historical magnitude of gradients.
6. RMSprop: RMSprop is similar to Adagrad but uses a
moving average of squared gradients for learning rate
adjustments.
7. Adam: Adam combines Momentum, Adagrad, and
RMSprop by using moving averages of gradients and
squared gradients.
For understand their explanation and use-cases, please refer :
Types of Gradient Descent .

Advantages Of Gradient Descent


 Flexibility: It can be used with various cost functions
and can handle non-linear regression problems.
 Scalability: It is scalable to large datasets since it
updates the parameters for each training example one at
a time.
 Convergence: It can converge to global minimum of the
cost function provided that the learning rate is set
appropriately.
Disadvantages Of Gradient Descent
 Sensitivity to Learning Rate: The choice of learning
rate is important in gradient descent as it can lead to
vanishing or exploding gradient problem.
 Sensitivity to initialization: It can be sensitive to the
initialization of the models parameters which can affect
the convergence and the quality of the solution.
 Local Minima: It can get stuck in local minima, if the
cost function has multiple local minima.
 Time-consuming: It can be time-consuming especially
when dealing with large datasets.
Gradient Descent is a fundamental optimization technique used
in machine learning. Understanding it allows us to make efficient
and accurate model by reducing error made by model using cost
function during their training phase, making gradient descent
essential for building effective machine learning models.
Difference between Batch Gradient Descent
and Stochastic Gradient Descent
Gradient Descent is considered a complete algorithm,
meaning it is guaranteed to find the global minimum, assuming
sufficient time and a proper learning rate are chosen. Two widely
used variants of Gradient Descent are Batch Gradient
Descent and Stochastic Gradient Descent (SGD). These
variants differ mainly in how they process data and optimize the
model parameters.
Batch Gradient Descent
Batch Gradient Descent computes the gradient of the cost
function using the entire training dataset for each iteration. This
approach ensures that the computed gradient is precise, but it
can be computationally expensive when dealing with very large
datasets.
Advantages of Batch Gradient Descent
1. Accurate Gradient Estimates: Since it uses the entire
dataset, the gradient estimate is precise.
2. Good for Smooth Error Surfaces: It works well for
convex or relatively smooth error manifolds.
Disadvantages of Batch Gradient Descent
1. Slow Convergence: Because the gradient is computed
over the entire dataset, it can take a long time to
converge, especially with large datasets.
2. High Memory Usage: Requires significant memory to
process the whole dataset in each iteration, making it
computationally intensive.
3. Inefficient for Large Datasets: With large-scale
datasets, Batch Gradient Descent becomes impractical
due to its high computation and memory requirements.
When to Use Batch Gradient Descent?
Batch Gradient Descent is ideal when the dataset is small to
medium-sized and when the error surface is smooth and convex.
It is also preferred when you can afford the computational cost.

Stochastic Gradient Descent (SGD)


Stochastic Gradient Descent (SGD) addresses the
inefficiencies of Batch Gradient Descent by computing the
gradient using only a single training example (or a small
subset) in each iteration. This makes the algorithm much faster
since only a small fraction of the data is processed at each step.
Advantages of Stochastic Gradient Descent
1. Faster Convergence: Since the gradient is updated
after each individual data point, the algorithm converges
much faster than Batch Gradient Descent.
2. Lower Memory Requirements: As it processes only
one data point at a time, it requires significantly less
memory, making it suitable for large datasets.
3. Escape Local Minima: Due to its stochastic nature, SGD
can escape local minima and find the global minimum,
especially for non-convex functions.
Disadvantages of Stochastic Gradient Descent
(SGD)
1. Noisy Gradient Estimates: Since the gradient is based
on a single data point, the estimates can be noisy,
leading to less accurate results.
2. Convergence Issues: While SGD may converge quickly,
it tends to oscillate around the minimum and does not
settle exactly at the global minimum. This can be
mitigated by gradually decreasing the learning rate.
3. Requires Shuffling: To ensure randomness, the dataset
should be shuffled before each epoch.
When to Use Stochastic Gradient Descent?
SGD is particularly useful when dealing with large datasets,
where processing the entire dataset at once is computationally
expensive. It is also effective when optimizing non-convex loss
functions.

Batch Gradient Descent vs Stochastic


Gradient Descent
Here’s a side-by-side comparison of Batch Gradient Descent and
Stochastic Gradient Descent:
Stochastic
Batch Gradient Gradient
Aspect Descent Descent (SGD)

Uses a single
Uses the whole
training sample to
training dataset to
compute the
compute the gradient.
Data Processing gradient.

Faster, converges
Convergence Slower, takes longer to
quicker due to
converge.
Speed frequent updates.

More accurate, gives Less accurate due


Convergence precise gradient to noisy gradient
Accuracy estimates. estimates.

Computational Requires significant Requires less


and Memory computation and computation and
Requirements memory. memory.

Optimization of Can escape local


Can get stuck in local
Non-Convex minima and find the
minima.
Functions global minimum.

Not ideal for very large


Suitability for Can handle large
datasets due to slow
datasets effectively.
Large Datasets computation.

Stochastic: Results
Deterministic: Same
can vary with
result for the same
different initial
initial conditions.
Nature conditions.

Learning rate can be


Fixed learning rate. adjusted
Learning Rate dynamically.

Requires shuffling of
No need for shuffling. data before each
Shuffling of Data epoch.

Overfitting Can overfit if the Can reduce


Stochastic
Batch Gradient Gradient
Aspect Descent Descent (SGD)

overfitting due to
model is too complex. more frequent
updates.

Can escape shallow


Escape Local Cannot escape shallow
local minima more
local minima.
Minima easily.

High due to processing Low due to


Computational the entire dataset at processing one
Cost once. sample at a time.

Tends to converge to
May converge to a
the global minimum
local minimum or
for convex loss
saddle point.
Final Solution functions.

Both Batch Gradient Descent and Stochastic Gradient


Descent are powerful optimization algorithms that serve
different purposes depending on the problem at hand.
 Batch Gradient Descent is more accurate but slower
and computationally expensive. It is ideal when working
with small to medium-sized datasets, and when high
accuracy is required.
 Stochastic Gradient Descent, on the other hand, is
faster and requires less computational power, making it
suitable for large datasets. It can also escape local
minima more easily but may converge less accurately.
Choosing between the two algorithms depends on factors like
the size of the dataset, computational resources, and the nature
of the error surface.
In machine learning, gradient descent is an optimization
technique used for computing the model parameters
(coefficients and bias) for algorithms like linear regression,
logistic regression, neural networks, etc. In this technique, we
repeatedly iterate through the training set and update the model
parameters in accordance with the gradient of the error with
respect to the training set. Depending on the number of training
examples considered in updating the model parameters, we
have 3-types of gradient descents:
1. Batch Gradient Descent: Parameters are updated after
computing the gradient of the error with respect to the
entire training set
2. Stochastic Gradient Descent: Parameters are updated
after computing the gradient of the error with respect to
a single training example
3. Mini-Batch Gradient Descent: Parameters are
updated after computing the gradient of the error with
respect to a subset of the training set
Batch Gradient Stochastic Gradient Mini-Batch Gradient
Descent Descent Descent

Since the entire Since only a single training


Since a subset of training
training data is example is considered before
examples is considered, it
considered before taking a step in the direction
can make quick updates
taking a step in the of gradient, we are forced to
in the model parameters
direction of gradient, loop over the training set and
and can also exploit the
therefore it takes a lot thus cannot exploit the speed
speed associated with
of time for making a associated with vectorizing the
vectorizing the code.
single update. code.

Depending upon the


It makes smooth batch size, the updates
It makes very noisy updates in
updates in the model can be made less noisy –
the parameters
parameters greater the batch size less
noisy is the update

Thus, mini-batch gradient descent makes a compromise between


the speedy convergence and the noise associated with gradient
update which makes it a more flexible and robust algorithm.

Convergence in BGD, SGD & MBGD

Mini-Batch Gradient Descent: Algorithm-


Let theta = model parameters and max_iters = number of
epochs. for itr = 1, 2, 3, …, max_iters: for mini_batch
(X_mini, y_mini):
 Forward Pass on the batch X_mini:
oMake predictions on the mini-batch
oCompute error in predictions (J(theta)) with
the current values of the parameters
 Backward Pass:
o Compute gradient(theta) = partial
derivative of J(theta) w.r.t. theta
 Update parameters:
o theta = theta –
learning_rate*gradient(theta)

Momentum-based Gradient Optimizer


Momentum-based gradient optimizers are a class of algorithms
used in machine learning and deep learning to optimize the
training of models. They are an enhancement over the classic
gradient descent method and help accelerate the training
process, especially for large-scale datasets and deep neural
networks.
By incorporating a “momentum” term, these optimizers can
navigate the loss surface more efficiently, leading to faster
convergence, reduced oscillations, and better overall
performance.
Understanding Gradient Descent
Before diving into momentum-based optimizers, it’s essential to
understand the traditional gradient descent method . In
gradient descent, the model’s weights are updated by taking
small steps in the direction of the negative gradient of the loss
function. Mathematically, the weight update rule is:
wt+1=wt–η∇L(wt)wt+1=wt–η∇L(wt)
Where:
 wtwt is the weight at time step tt,

∇L(wt)∇L(wt) is the gradient of the loss function with


 ηη is the learning rate,

respect to the weights.
This method works well, but it suffers from issues such as slow
convergence and the tendency to get stuck in local minima,
especially in high-dimensional spaces. This is where momentum-
based optimization comes into play.
What is Momentum?
Momentum is a concept borrowed from physics, where an
object’s motion depends not only on the current force but also
on its previous velocity. In the context of gradient optimization,
momentum refers to a method that smoothens the optimization
trajectory by adding a term that helps the optimizer remember
the past gradients.
In mathematical terms, the momentum-based gradient descent
updates can be described as follows:
vt+1=βvt+(1–β)∇L(wt)vt+1=βvt+(1–β)∇L(wt)
wt+1=wt–ηvt+1wt+1=wt–ηvt+1
Where:
 vt is the velocity (a running average of gradients),
 β is the momentum factor, typically a value between 0

 ∇L(wt) is the current gradient of the loss function,


and 1 (often around 0.9),

 ηη is the learning rate.


Here’s how it works:
1. Velocity Update: The velocity vtvt is updated by
considering both the previous velocity (which represents
the momentum) and the current gradient. The
momentum factor ββ controls the contribution of the
previous velocity to the current update.
2. Weight Update: The weights are updated using the
velocity vt+1vt+1, which is a weighted average of the past
gradients and the current gradient.
Benefits of Momentum-Based Optimizers
1. Faster Convergence: Momentum-based optimizers help
accelerate the convergence by considering past
gradients, which helps the model navigate through flat
regions more efficiently.
2. Reduces Oscillation: Traditional gradient descent can
oscillate when there are steep gradients in some
directions and flat gradients in others. Momentum
reduces this oscillation by maintaining the direction of
previous updates.
3. Improved Generalization: By smoothing the
optimization process, momentum-based methods can
lead to better generalization on unseen data, preventing
overfitting.
4. Helps Avoid Local Minima: The momentum term can
help the optimizer escape from local minima by
maintaining a strong enough “velocity” to continue
moving past these suboptimal points.
Types of Momentum-Based Optimizers
There are several variations of momentum-based optimizers,
each with slight modifications to the basic momentum algorithm:
1. Nesterov Accelerated Gradient (NAG)
Nesterov momentum is an advanced form of momentum-based
optimization. It modifies the update rule by calculating the
gradient at the “look-ahead” position, rather than the current
position of the weights.
The update rule becomes:
vt+1=βvt+∇L(wt–ηβvt)vt+1=βvt+∇L(wt–ηβvt)
wt+1=wt–ηvt+1wt+1=wt–ηvt+1
NAG is considered more efficient than classical momentum
because it has a better understanding of the future trajectory,
leading to even faster convergence and better performance in
some cases.
2. AdaMomentum
AdaMomentum combines the concept of adaptive learning rates
with momentum. It adjusts the momentum term based on the
recent gradients, making the optimizer more sensitive to the
landscape of the loss function. This can help in fine-tuning the
convergence process.
3. RMSProp (Root Mean Square Propagation)
Although not strictly a momentum-based optimizer in the
traditional sense, RMSProp incorporates a form of momentum by
adapting the learning rate for each parameter. It’s particularly
effective when dealing with non-stationary objectives, such as in
training recurrent neural networks (RNNs).
Key Hyperparameters
 Learning Rate (ηη): The learning rate determines the
size of the step taken during each update. It plays a
crucial role in both standard gradient descent and
momentum-based optimizers.
 Momentum Factor (ββ): This controls how much of the
past gradients are remembered in the current update. A
value close to 1 means the optimizer will have more
inertia, while a value closer to 0 means less reliance on
past gradients.
Challenges and Considerations
1. Choosing Hyperparameters: Selecting the appropriate
values for the learning rate and momentum factor can be
challenging. Typically, a momentum factor of 0.9 is
common, but it may vary based on the specific problem
or dataset.
2. Potential for Over-Accumulation: If the momentum
term becomes too large, it can lead to the optimizer
overshooting the minimum, especially in the presence of
noisy gradients.
3. Initial Momentum: When momentum is initialized, it
can have a significant impact on the convergence rate.
Poor initialization can lead to slow or erratic optimization
behavior.
Adagrad Optimizer in Deep Learning
Adagrad is short for “Adaptive Gradient Algorithm”. It is an
adaptive learning rate optimization algorithm used for training
deep learning models. It is particularly effective for sparse data
or scenarios where features exhibit a large variation in
magnitude.
Adagrad adjusts the learning rate for each parameter
individually. Unlike standard gradient descent, where a fixed
learning rate is applied to all parameters Adagrad adapts the
learning rate based on the historical gradients for each
parameter, allowing the model to focus on more important
features and learn efficiently.
How Does Adagrad Work?
The primary concept behind Adagrad is the idea of adapting the
learning rate based on the historical sum of squared gradients
for each parameter. Here’s a step-by-step explanation of how
Adagrad works:
1. Initialization
Adagrad begins by initializing the parameter values randomly,
just like other optimization algorithms. Additionally, it initializes a
running sum of squared gradients for each parameter, which will
track the gradients over time.
2. Gradient Calculation
For each training step, the gradient of the loss function with
respect to the model’s parameters is calculated, just like in
standard gradient descent.
3. Adaptive Learning Rate
The key difference comes next. Instead of using a fixed learning
rate, Adagrad adjusts the learning rate for each parameter based
on the accumulated sum of squared gradients.
The updated learning rate for each parameter is calculated as
follows:
lrt=ηGt+ϵlrt=Gt+ϵη
Where:
 ηη is the global learning rate (a small constant value),
 GtGt is the sum of squared gradients for a given
parameter up to time step tt,
 ϵϵ is a small value added to avoid division by zero (often
set to 1e−81e−8).
Here, the denominator Gt+ϵGt+ϵ grows as the squared gradients
accumulate, causing the learning rate to decrease over time,
which helps to stabilize the training.
4. Parameter Update
The model’s parameters are updated by subtracting the product

θt+1=θt–lrt⋅∇θθt+1=θt–lrt⋅∇θ
of the adaptive learning rate and the gradient at each step:

Where:

 ∇θJ(θ)∇θJ(θ)is the gradient of the loss function with respect


 θtθt is the current parameter,

to the parameter.
4. Convergence
The model continues updating the parameters until convergence
is achieved. Over time, the learning rate for each parameter
decreases as the sum of squared gradients grows, which helps
avoid large updates that could lead to overshooting.
Advantages of Adagrad
1. Adaptive Learning Rate: The most significant
advantage of Adagrad is its ability to adapt the learning
rate for each parameter. This is especially beneficial
when dealing with sparse features (e.g., in natural
language processing or recommender systems) or when
the data contains a lot of noise.
2. Efficient for Sparse Data: Adagrad is particularly
effective when training models on sparse data. For
example, in text classification problems, certain words
(features) may appear infrequently but still carry
significant importance. Adagrad ensures that these rare
features have an appropriate learning rate, preventing
them from being neglected.
3. No Need for Manual Learning Rate Tuning: With
Adagrad, there’s no need to manually tune the learning
rate throughout the training process. The algorithm
adjusts the learning rates automatically based on the
gradients, making it easier to train models without
needing to experiment with different learning rate
values.
4. Improved Performance in Many Scenarios: Adagrad
often provides superior performance in problems where
the gradients of the loss function vary significantly
across different parameters, leading to more efficient
and effective convergence.
Limitations of Adagrad
While Adagrad has many benefits, it also comes with certain
limitations:
1. Diminishing Learning Rates
One of the main drawbacks of Adagrad is that the learning rates
decrease monotonically as the algorithm progresses. This means
that as training continues, the learning rates for each parameter
become smaller, which can result in slower convergence and
premature halting of updates. Once the learning rates become
too small, the algorithm may struggle to make further
improvements, especially in the later stages of training.
2. Sensitivity to the Initial Learning Rate
Adagrad is sensitive to the choice of the initial learning rate. If
the learning rate is set too high or too low, it can lead to
suboptimal training performance. Although the algorithm adapts
learning rates during training, the initial learning rate still plays a
significant role.
3. No Momentum
Adagrad does not incorporate momentum, which means that it
may not always escape from shallow local minima in highly
complex loss surfaces. This limitation can hinder its performance
in some deep learning tasks.
Variants of Adagrad
To overcome some of Adagrad’s limitations, several variants
have been proposed, with the most popular ones being:
1. RMSProp (Root Mean Square Propagation):
RMSProp addresses the diminishing learning rate issue by
introducing an exponentially decaying average of the squared
gradients instead of accumulating the sum. This prevents the
learning rate from decreasing too quickly, making the algorithm
more effective in training deep neural networks.
The update rule for RMSProp is as follows:
Gt=γGt−1+(1–γ)(∇θJ(θ))2Gt=γGt−1+(1–γ)(∇θJ(θ))2
Where:
 GtGt is the accumulated gradient,

∇θJ(θ)∇θJ(θ) is the gradient.


 γγ is the decay factor (typically set to 0.9),

θt+1=θt–ηGt+ϵ⋅∇θJ(θ)θt+1=θt–Gt+ϵη⋅∇θJ(θ)
The parameter update rule is:

2. AdaDelta
AdaDelta is another modification of Adagrad that focuses on
reducing the accumulation of past gradients. It updates the
learning rates based on the moving average of past gradients
and incorporates a more stable and bounded update rule.

Δθt+1=–E[Δθ]t2E[∇θJ(θ)]t2+ϵ⋅∇θJ(θ)Δθt+1=–E[∇θJ(θ)]t2+ϵE[Δθ]t2⋅∇θJ(θ)
The key update for AdaDelta is:

Where:
 [Δθ]t2[Δθ]t2 is the running average of past squared
parameter updates.
3. Adam (Adaptive Moment Estimation)
Adam combines the benefits of both Adagrad and momentum-
based methods. It uses both the moving average of the gradients
and the squared gradients to adapt the learning rate. Adam is
widely used due to its robustness and superior performance in
various machine learning tasks.
Adam has the following update rules:
 First moment estimate ( mtmt):
mt=β1mt−1+(1–β1)∇θJ(θ)mt=β1mt−1+(1–β1)∇θJ(θ)
 Second moment estimate ( vtvt):
vt=β2vt−1+(1–β2)(∇θJ(θ))2vt=β2vt−1+(1–β2)(∇θJ(θ))2
 Corrected moment estimates:
m^t=mt1–β1t,v^t=vt1–β2tm^t=1–β1tmt,v^t=1–β2tvt

θt+1=θt–ηv^t+ϵ⋅m^tθt+1=θt–v^t+ϵη⋅m^t
 Parameter update:

When to Use Adagrad?


Adagrad is ideal for:
 Problems with sparse data and features (e.g., natural
language processing, recommender systems).
 Tasks where features have different levels of importance
and frequency.
 Training models that do not require a very fast
convergence rate but benefit from a more stable
optimization process.
However, if you are dealing with problems where a more
constant learning rate is preferable (e.g., in some deep learning
tasks), using variants like RMSProp or Adam might be more
appropriate.
RMSProp Optimizer in Deep Learning
RMSProp (Root Mean Square Propagation) is an adaptive
learning rate optimization algorithm designed to improve the
performance and speed of training deep learning models.
 It is a variant of thegradient descent algorithm, which
adapts the learning rate for each parameter individually
by considering the magnitude of recent gradients for
those parameters.
 This adaptive nature helps in dealing with the challenges
of non-stationary objectives and sparse gradients
commonly encountered in deep learning tasks.
Need of RMSProp Optimizer
RMSProp was developed to address the limitations of previous
optimization methods such as SGD (Stochastic Gradient
Descent) and AdaGrad.
While SGD uses a constant learning rate, which can be
inefficient, and AdaGrad reduces the learning rate too
aggressively, RMSProp strikes a balance by adapting the learning
rates based on a moving average of squared gradients. This
approach helps in maintaining a balance between efficient
convergence and stability during the training process, making
RMSProp a widely used optimization algorithm in modern deep
learning.
How RMSProp Works?
RMSProp keeps a moving average of the squared gradients to
normalize the gradient updates. By doing so, RMSProp prevents
the learning rate from becoming too small, which was a
drawback in AdaGrad, and ensures that the updates are
appropriately scaled for each parameter. This mechanism allows
RMSProp to perform well even in the presence of non-stationary
objectives, making it suitable for training deep learning models.
The mathematical formulation is as follows:
1. Compute the gradient gtgt at time step t: gt=∇θgt=∇θ
2. Update the moving average of squared
gradients E[g2]t=γE[g2]t−1+(1−γ)E[g2]t=γE[g2]t−1+(1−γ)where γγ is
the decay rate.
3. Update the parameter θθ using the adjusted learning
rate: θt+1=θt−ηE[g2]t+ϵθt+1=θt−E[g2]t+ϵη where ηη is the learning
rate and ϵϵ is a small constant added for numerical
stability.
Key Parameters Involved in RMSProp
 Learning Rate (ηη): Controls the step size during the
parameter updates. RMSProp typically uses a default
learning rate of 0.001, but it can be adjusted based on
the specific problem.
 Decay Rate (γγ): Determines how quickly the moving
average of squared gradients decays. A common default
value is 0.9, which balances the contribution of recent
and past gradients.
 Epsilon (ϵϵ): A small constant added to the denominator
to prevent division by zero and ensure numerical
stability. A typical value for ϵ\epsilonϵ is 1e-8.
By carefully adjusting these parameters, RMSProp effectively
adapts the learning rates during training, leading to faster and
more reliable convergence in deep learning models.
What is Adam Optimizer?
Adam (short for Adaptive Moment Estimation) optimizer
combines the strengths of two other well-known techniques—
Momentum and RMSprop—to deliver a powerful method for
adjusting the learning rates of parameters during training.
Adam is highly effective, especially when working with large
datasets and complex models, because it is memory-efficient
and adapts the learning rate dynamically for each parameter.
How Does Adam Work?
Adam builds upon two key concepts in optimization:
1. Momentum
Momentum is used to accelerate the gradient descent process by
incorporating an exponentially weighted moving average of past
gradients. This helps smooth out the trajectory of the
optimization, allowing the algorithm to converge faster by
reducing oscillations.
The update rule with momentum is:
wt+1=wt–αmt
where:
 mt is the moving average of the gradients at time tt,
 α is the learning rate,
 wt and wt+1 are the weights at time tt and t+1t+1,
respectively.
The momentum term mt is updated recursively as:
mt=β1mt−1+(1–β1)∂wt∂L
where:
 β1 is the momentum parameter (typically set to 0.9),
 ∂L/∂wtis the gradient of the loss function with respect to
the weights at time tt.
2. RMSprop (Root Mean Square Propagation)
RMSprop is an adaptive learning rate method that improves upon
AdaGrad. While AdaGrad accumulates squared gradients,
RMSprop uses an exponentially weighted moving average of
squared gradients, which helps overcome the problem of
diminishing learning rates.
The update rule for RMSprop is:
wt+1=wt–αtvt+ϵ∂L∂wtwt+1=wt–vt+ϵαt∂wt∂L
where:
 vtvt is the exponentially weighted average of squared
gradients:
vt=β2vt−1+(1–β2)(∂L∂wt)2vt=β2vt−1+(1–β2)(∂wt∂L)2
 ϵϵ is a small constant (e.g., 10−810−8) added to prevent
division by zero.
Combining Momentum and RMSprop: Adam
Optimizer
Adam optimizer combines the momentum and RMSprop
techniques to provide a more balanced and efficient optimization
process. The key equations governing Adam are as follows:
 First moment (mean) estimate:
mt=β1mt−1+(1–β1)∂L∂wtmt=β1mt−1+(1–β1)∂wt∂L
 Second moment (variance) estimate:
vt=β2vt−1+(1–β2)(∂L∂wt)2vt=β2vt−1+(1–β2)(∂wt∂L)2
 Bias correction: Since both mtmt and vtvt are initialized
at zero, they tend to be biased toward zero, especially
during the initial steps. To correct this bias, Adam
computes the bias-corrected estimates:
mt^=mt1–β1t,vt^=vt1–β2tmt^=1–β1tmt,vt^=1–β2tvt
 Final weight update: The weights are then updated as:
wt+1=wt–mt^vt^+ϵαwt+1=wt–vt^+ϵmt^α
Key Parameters in Adam
 αα: The learning rate or step size (default is 0.001).
 β1β1 and β2β2: Decay rates for the moving averages of the
gradient and squared gradient, typically set to β1=0.9β1
=0.9 and β2=0.999β2=0.999.
 ϵϵ: A small positive constant (e.g., 10−810−8) used to avoid
division by zero when computing the final update.
Why Adam Works So Well?
Adam addresses several challenges of gradient descent
optimization:
 Dynamic learning rates: Each parameter has its own
adaptive learning rate based on past gradients and their
magnitudes. This helps the optimizer avoid oscillations
and get past local minima more effectively.
 Bias correction: By adjusting for the initial bias when
the first and second moment estimates are close to zero,
Adam helps prevent early-stage instability.
 Efficient performance: Adam typically requires fewer
hyperparameter tuning adjustments compared to other
optimization algorithms like SGD, making it a more
convenient choice for most problems.
Performance of Adam
In comparison to other optimizers like SGD (Stochastic Gradient
Descent) and momentum-based SGD, Adam outperforms them
significantly in terms of both training time and convergence
accuracy. Its ability to adjust the learning rate per parameter,
combined with the bias-correction mechanism, leads to faster
convergence and more stable optimization. This makes Adam
especially useful in complex models with large datasets, as it
avoids slow convergence and instability while reaching the global
minimum.
In practice, Adam often achieves superior results with minimal
tuning, making it a go-to optimizer for deep learning tasks.
Performance Comparison on Training cost

You might also like