Gradient-Based Optimizers
Gradient-Based Optimizers
Optimizers
&
Gradient descent optimization
algorithms
Gradient descent is an optimization technique used to minimize the error or loss function
in machine learning and neural networks. It works by iteratively adjusting the parameters of
the model to find the values that result in the lowest possible error. Here’s how it works:
1.Objective: The goal of gradient descent is to find the minimum value of a function,
typically the loss function that measures how well the model's predictions match the
actual data.
2.Initialize Parameters: Start with an initial set of parameters or weights, which are usually
set randomly.
3.Compute Gradient: Calculate the gradient (or partial derivatives) of the loss function with
respect to each parameter. The gradient indicates the direction and rate at which the loss
function increases.
4.Update Parameters: Adjust the parameters in the direction that reduces the loss. This is
done by subtracting a fraction of the gradient from the current parameters. The size of
this step is controlled by a value called the learning rate.
5.Iterate: Repeat the process of computing gradients and updating parameters until the
changes in the loss function become very small or the number of iterations reaches a
predefined limit.
6.Convergence: The process continues until the parameters converge to values where the
loss function is minimized or the change in loss is below a certain threshold.
Key Elements of Gradient Descent:
• Learning Rate: A hyperparameter that determines the size of the steps taken during parameter updates. A
learning rate that is too high may cause the algorithm to overshoot the minimum, while a learning rate
that is too low may result in a slow convergence.
• Cost Function: Also known as the loss function, it measures the performance of the model. The goal is to
minimize this function.
• Gradient: A vector that points in the direction of the steepest increase of the loss function. The negative
gradient points in the direction of the steepest decrease.
Variants of Gradient Descent:
• Batch Gradient Descent: Computes the gradient using the entire dataset. This can be computationally
expensive for large datasets.
• Stochastic Gradient Descent (SGD): Computes the gradient using a single data point at a time. This can be
faster but introduces more noise in the parameter updates.
• Mini-Batch Gradient Descent: Computes the gradient using a small random subset of the data. It balances
the efficiency of batch gradient descent and the speed of stochastic gradient descent.
In summary, gradient descent is a fundamental optimization algorithm used to minimize the loss function by
iteratively adjusting model parameters based on computed gradients.
Different types of Gradient descent
based Optimizers:
Batch Gradient Descent or Vanilla
Gradient Descent or Gradient
Descent (GD)
• Gradient Descent is an optimization algorithm for finding a local
minimum of a differentiable function. Gradient descent is simply
used to find the values of a function's parameters (coefficients) that
minimize a cost function as far as possible.
The above equation computes the gradient of the cost function J(θ)
w.r.t. the parameters/weights θ for the entire training dataset
We then update our parameters in the opposite direction of the gradients
with the learning rate
SGD seems to be
quite noisy, but
at the same time
it is much faster
than others and
also it might be
possible that it
not converges to
a minimum.
SG
D
Mini Batch Stochastic Gradient Descent (MB-SGD)
• MB-SGD algorithm is an extension of the SGD algorithm and it
overcomes the problem of large time complexity in the case of the
SGD algorithm.
Example
Nesterov Accelerated Gradient
(NAG)
• In this version we’re first looking at a point where current momentum
is pointing to and computing gradients from that point.
Parameter Initialization
Strategies
1. Initialization of weight values
• To avoid this,
Adadelta adapts learning rates based on a
moving window of gradient updates, instead of
accumulating all past gradients.
Accumulated Gradients:
E[⋅] represents the exponential moving average (EMA) of the quantity inside
the brackets. It's a way to smooth out the values over time, giving more weight
to recent values while still considering the past values.
RMSprop