Adam Optimizer
Adam Optimizer
Gradient descent
Imagine you're at the top of a hill and your goal is to find the
lowest point in the valley. You can't see the entire valley from
the top, but you can feel the slope under your feet.
1. Start at the Top: You begin at the top of the hill (this is
like starting with random guesses for the model's
parameters).
2. Feel the Slope: You look around to find out which
direction the ground is sloping down. This is like
calculating the gradient, which tells you the steepest
way downhill.
3. Take a Step Down: Move in the direction where the
slope is steepest (this is adjusting the model's
parameters). The bigger the slope, the bigger the
step you take.
4. Repeat: You keep repeating the process — feeling the
slope and moving downhill — until you reach the
bottom of the valley (this is when the model has
learned and minimized the error).
The key idea is that, just like walking down a hill, Gradient
Descent moves towards the "bottom" or minimum of the loss
function, which represents the error in predictions.
Moving in opposite direction of the gradient allows the algorithm
to gradually descend towards lower values of the function
and eventually reaching to the minimum of the function.
These gradients guide the updates ensuring convergence
towards the optimal parameter values. Gradual steps used
in descent is done by defining learning rate.
What is Learning Rate?
Learning rate is a important hyperparameter in gradient descent
that controls how big or small the steps should be when going
downwards in gradient for updating models parameters. It is
essential to determines how quickly or slowly the algorithm
converges toward minimum of cost function.
1. If Learning rate is too small: The algorithm will take tiny
steps during iteration and converge very slowly. This can
significantly increases training time and computational cost
especially for large datasets. This process is termed as
vanishing gradient problem.
Gradient descent
Here:
γ: Learning rate (step size for each update).
∂J(w,b)/∂w): Gradients with respect to ww.
Since the gradient is positive subtracting it effectively
decreases w and hence reducing cost function.
2. For -ve gradient:
Gradient descent
Since the gradient is negative subtracting it
effectively increases ww so here we add it to reduce cost
function.
Working of Gradient Descent
Step 1 we first initialize the parameters of the model
randomly
Step 2 Compute the gradient of the cost function with
respect to each parameter. It involves making partial
differentiation of cost function with respect to the
parameters.
Step 3 Update the parameters of the model by taking
steps in the opposite direction of the model. Here we
choose a hyperparameter learning rate which is denoted
by γγ. It helps in deciding the step size of the gradient.
Step 4 Repeat steps 2 and 3 iteratively to get the best
parameter for the defined model.
Gradient Descent
This animation shows iterative process of gradient descent as it
traverses the 3D convex surface of cost function. Each step
represents adjustment of model parameters to minimize the loss.
It illustrates how the algorithm moves in opposite direction of
descent to converge
Pseudo code:
t ← 0
max_iterations ← 1000
w, b ← initialize randomly
Here:
max_iterations is the number of iteration we want to do to
update our parameter
W,b are the weights and bias parameter
γ is the learning parameter
So now we learned what is gradient descent and how it works,
now we will learn about its variations.
Different Variants of Gradient Descent
Types of gradient descent are:
1. Batch Gradient Descent: Batch Gradient
Descent computes gradients using the entire dataset in
each iteration.
2. Stochastic Gradient Descent (SGD): SGD uses one
data point per iteration to compute gradients, making it
faster.
3. Mini-batch Gradient Descent: Mini-batch Gradient
Descent combines batch and SGD by using small batches
of data for updates.
4. Momentum-based Gradient Descent: Momentum-
based Gradient Descent speeds up convergence by
adding a fraction of the previous gradient to the current
update.
5. Adagrad: Adagrad adjusts learning rates based on the
historical magnitude of gradients.
6. RMSprop: RMSprop is similar to Adagrad but uses a
moving average of squared gradients for learning rate
adjustments.
7. Adam: Adam combines Momentum, Adagrad, and
RMSprop by using moving averages of gradients and
squared gradients.
For understand their explanation and use-cases, please refer :
Types of Gradient Descent .
Uses a single
Uses the whole
training sample to
training dataset to
compute the
compute the gradient.
Data Processing gradient.
Faster, converges
Convergence Slower, takes longer to
quicker due to
converge.
Speed frequent updates.
Stochastic: Results
Deterministic: Same
can vary with
result for the same
different initial
initial conditions.
Nature conditions.
Requires shuffling of
No need for shuffling. data before each
Shuffling of Data epoch.
overfitting due to
model is too complex. more frequent
updates.
Tends to converge to
May converge to a
the global minimum
local minimum or
for convex loss
saddle point.
Final Solution functions.
θt+1=θt–lrt⋅∇θθt+1=θt–lrt⋅∇θ
of the adaptive learning rate and the gradient at each step:
Where:
to the parameter.
4. Convergence
The model continues updating the parameters until convergence
is achieved. Over time, the learning rate for each parameter
decreases as the sum of squared gradients grows, which helps
avoid large updates that could lead to overshooting.
Advantages of Adagrad
1. Adaptive Learning Rate: The most significant
advantage of Adagrad is its ability to adapt the learning
rate for each parameter. This is especially beneficial
when dealing with sparse features (e.g., in natural
language processing or recommender systems) or when
the data contains a lot of noise.
2. Efficient for Sparse Data: Adagrad is particularly
effective when training models on sparse data. For
example, in text classification problems, certain words
(features) may appear infrequently but still carry
significant importance. Adagrad ensures that these rare
features have an appropriate learning rate, preventing
them from being neglected.
3. No Need for Manual Learning Rate Tuning: With
Adagrad, there’s no need to manually tune the learning
rate throughout the training process. The algorithm
adjusts the learning rates automatically based on the
gradients, making it easier to train models without
needing to experiment with different learning rate
values.
4. Improved Performance in Many Scenarios: Adagrad
often provides superior performance in problems where
the gradients of the loss function vary significantly
across different parameters, leading to more efficient
and effective convergence.
Limitations of Adagrad
While Adagrad has many benefits, it also comes with certain
limitations:
1. Diminishing Learning Rates
One of the main drawbacks of Adagrad is that the learning rates
decrease monotonically as the algorithm progresses. This means
that as training continues, the learning rates for each parameter
become smaller, which can result in slower convergence and
premature halting of updates. Once the learning rates become
too small, the algorithm may struggle to make further
improvements, especially in the later stages of training.
2. Sensitivity to the Initial Learning Rate
Adagrad is sensitive to the choice of the initial learning rate. If
the learning rate is set too high or too low, it can lead to
suboptimal training performance. Although the algorithm adapts
learning rates during training, the initial learning rate still plays a
significant role.
3. No Momentum
Adagrad does not incorporate momentum, which means that it
may not always escape from shallow local minima in highly
complex loss surfaces. This limitation can hinder its performance
in some deep learning tasks.
Variants of Adagrad
To overcome some of Adagrad’s limitations, several variants
have been proposed, with the most popular ones being:
1. RMSProp (Root Mean Square Propagation):
RMSProp addresses the diminishing learning rate issue by
introducing an exponentially decaying average of the squared
gradients instead of accumulating the sum. This prevents the
learning rate from decreasing too quickly, making the algorithm
more effective in training deep neural networks.
The update rule for RMSProp is as follows:
Gt=γGt−1+(1–γ)(∇θJ(θ))2Gt=γGt−1+(1–γ)(∇θJ(θ))2
Where:
GtGt is the accumulated gradient,
θt+1=θt–ηGt+ϵ⋅∇θJ(θ)θt+1=θt–Gt+ϵη⋅∇θJ(θ)
The parameter update rule is:
2. AdaDelta
AdaDelta is another modification of Adagrad that focuses on
reducing the accumulation of past gradients. It updates the
learning rates based on the moving average of past gradients
and incorporates a more stable and bounded update rule.
Δθt+1=–E[Δθ]t2E[∇θJ(θ)]t2+ϵ⋅∇θJ(θ)Δθt+1=–E[∇θJ(θ)]t2+ϵE[Δθ]t2⋅∇θJ(θ)
The key update for AdaDelta is:
Where:
[Δθ]t2[Δθ]t2 is the running average of past squared
parameter updates.
3. Adam (Adaptive Moment Estimation)
Adam combines the benefits of both Adagrad and momentum-
based methods. It uses both the moving average of the gradients
and the squared gradients to adapt the learning rate. Adam is
widely used due to its robustness and superior performance in
various machine learning tasks.
Adam has the following update rules:
First moment estimate ( mtmt):
mt=β1mt−1+(1–β1)∇θJ(θ)mt=β1mt−1+(1–β1)∇θJ(θ)
Second moment estimate ( vtvt):
vt=β2vt−1+(1–β2)(∇θJ(θ))2vt=β2vt−1+(1–β2)(∇θJ(θ))2
Corrected moment estimates:
m^t=mt1–β1t,v^t=vt1–β2tm^t=1–β1tmt,v^t=1–β2tvt
θt+1=θt–ηv^t+ϵ⋅m^tθt+1=θt–v^t+ϵη⋅m^t
Parameter update: