0% found this document useful (0 votes)
12 views

Gradient-Based Optimizers

Uploaded by

sunnyrx100virat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Gradient-Based Optimizers

Uploaded by

sunnyrx100virat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Gradient-Based

Optimizers

&
Gradient descent optimization
algorithms
Gradient descent is an optimization technique used to minimize the error or loss function
in machine learning and neural networks. It works by iteratively adjusting the parameters of
the model to find the values that result in the lowest possible error. Here’s how it works:
1.Objective: The goal of gradient descent is to find the minimum value of a function,
typically the loss function that measures how well the model's predictions match the
actual data.
2.Initialize Parameters: Start with an initial set of parameters or weights, which are usually
set randomly.
3.Compute Gradient: Calculate the gradient (or partial derivatives) of the loss function with
respect to each parameter. The gradient indicates the direction and rate at which the loss
function increases.
4.Update Parameters: Adjust the parameters in the direction that reduces the loss. This is
done by subtracting a fraction of the gradient from the current parameters. The size of
this step is controlled by a value called the learning rate.
5.Iterate: Repeat the process of computing gradients and updating parameters until the
changes in the loss function become very small or the number of iterations reaches a
predefined limit.
6.Convergence: The process continues until the parameters converge to values where the
loss function is minimized or the change in loss is below a certain threshold.
Key Elements of Gradient Descent:
• Learning Rate: A hyperparameter that determines the size of the steps taken during parameter updates. A
learning rate that is too high may cause the algorithm to overshoot the minimum, while a learning rate
that is too low may result in a slow convergence.
• Cost Function: Also known as the loss function, it measures the performance of the model. The goal is to
minimize this function.
• Gradient: A vector that points in the direction of the steepest increase of the loss function. The negative
gradient points in the direction of the steepest decrease.
Variants of Gradient Descent:
• Batch Gradient Descent: Computes the gradient using the entire dataset. This can be computationally
expensive for large datasets.
• Stochastic Gradient Descent (SGD): Computes the gradient using a single data point at a time. This can be
faster but introduces more noise in the parameter updates.
• Mini-Batch Gradient Descent: Computes the gradient using a small random subset of the data. It balances
the efficiency of batch gradient descent and the speed of stochastic gradient descent.

In summary, gradient descent is a fundamental optimization algorithm used to minimize the loss function by
iteratively adjusting model parameters based on computed gradients.
Different types of Gradient descent
based Optimizers:
Batch Gradient Descent or Vanilla
Gradient Descent or Gradient
Descent (GD)
• Gradient Descent is an optimization algorithm for finding a local
minimum of a differentiable function. Gradient descent is simply
used to find the values of a function's parameters (coefficients) that
minimize a cost function as far as possible.

• The weight is initialized using some initialization strategies and is


updated with each epoch according to the update equation.
GD

The above equation computes the gradient of the cost function J(θ)
w.r.t. the parameters/weights θ for the entire training dataset
We then update our parameters in the opposite direction of the gradients
with the learning rate

Batch gradient descent


is guaranteed to
converge to the global
minimum for convex
error surfaces and to a
local minimum for non-
convex surfaces.
GD
Stochastic Gradient Descent (SGD)

• SGD algorithm is an extension of the Gradient Descent and it


overcomes some of the disadvantages of the GD algorithm.
• Gradient Descent has a disadvantage that it requires a lot of memory
to load the entire dataset of n-points at a time to compute the
derivative of the loss function.
• In the SGD algorithm derivative is computed taking one point at a
time.

Here, imagine the same mountain, but this time, it's


foggy, and you can only see a few feet in front of you.
You can't determine the steepest descent over the whole
landscape, but you can still move downward based on
the local slope.
SG
D

So, let’s have a dataset that contains 1000


Batch gradient descent performs redundant
rows, and when we apply SGD it will update computations for large datasets, as it
the model parameters 1000 times in one recomputes gradients for similar examples
complete cycle of a dataset instead of one before each parameter update. SGD does away
time as in Gradient Descent. with this redundancy by performing one update
at a time. It is therefore usually much faster and
can also be used to learn online.
SGD

SGD seems to be
quite noisy, but
at the same time
it is much faster
than others and
also it might be
possible that it
not converges to
a minimum.
SG
D
Mini Batch Stochastic Gradient Descent (MB-SGD)
• MB-SGD algorithm is an extension of the SGD algorithm and it
overcomes the problem of large time complexity in the case of the
SGD algorithm.

• MB-SGD algorithm takes a batch of points or subset of points from


the dataset to compute derivate.

• It is observed that the derivative of the loss function for MB-SGD is


almost the same as a derivate of the loss function for GD after some
number of iterations. But the number of iterations to achieve minima
is large for MB-SGD compared to GD and the cost of computation is
also large.
MB-
SGD

Mini-batch gradient descent is typically


the algorithm of choice when training a
neural network
MB-
SGD

The update of weight is dependent on the derivate of


loss for a batch of points. The updates in the case of
MB-SGD are much noisy because the derivative is not
always towards minima.
Each mini-batch is only a small sample of the total
MB- dataset, so it might not fully represent the overall
trend in the data. Different mini-batches might
SGD suggest slightly different directions for updating the
model's parameters, leading to inconsistent (noisy)
updates.
Gradient descent optimization
algorithms
• SGD with momentum
• Nesterov Accelerated Gradient (NAG)
• Adaptive Gradient (AdaGrad)
• AdaDelta
• RMSprop
• Adam
SGD with momentum

• Instead of smoothly progressing towards the minimum of the cost


function (as in Batch Gradient Descent), the path in MB-SGD tends to
zigzag or oscillate because of the noise. The steps may sometimes
overshoot or undershoot, making the convergence to the optimal
solution less smooth.

• SGD with momentum overcomes this disadvantage by


denoising the gradients
SGD with momentum
• The idea is to denoise derivative using exponential
weighting average that is to give more weightage to
recent updates compared to the previous update.

• It accelerates the convergence towards the relevant


direction and reduces the fluctuation to the irrelevant
direction.
SGD with momentum
SGD with momentum
SGD with
momentu
m

Example
Nesterov Accelerated Gradient
(NAG)
• In this version we’re first looking at a point where current momentum
is pointing to and computing gradients from that point.
Parameter Initialization
Strategies
1. Initialization of weight values

Heuristics for initial scale of weights


We almost always
initialize all the
weights in the model
to values drawn
randomly from a
Gaussian or uniform
distribution.
Weight Initialization for Sigmoid and Tanh
Xavier Weight Initialization / Glorot Initialization Initialize the weights of a
fully connected layer with
Nin inputs and Nout outputs
by sampling each weights
from Uniform (-r, r) where
Weight Initialization for ReLU

He Weight Initialization / Kaiming Initialization

The main goal of He initialization is to maintain the variance of the


activations in each layer, especially when using ReLU activations,
which tend to zero out negative inputs. This helps in avoiding issues
like vanishing or exploding gradients, which can hinder the training
process in deep neural networks.
2. Initialization of bias
1. Zero Initialization:- The most common and often the default
method is to initialize the bias terms to zero. This is because
biases primarily serve to shift the activation function.

2. Small Positive Value Initialization


• Positive Bias for ReLU:
• When using ReLU or its variants, biases can be initialized to a small
positive value (like 0.01). This is done to prevent neurons from
"dying," especially during the initial phases of training when many
ReLU neurons might output zero due to the nature of the
activation function.
3. Learned Biases
• In modern architectures and deep learning frameworks, biases
are typically learned during training, so the initialization is less
critical compared to weights. The optimizer will adjust the bias
terms based on the loss function, regardless of their initial
values.
Annealing the learning rate
Adagrad

• The key idea of AdaGrad is to have an adaptive learning


rate for each of the weights.

• It performs smaller updates for parameters associated


with frequently occurring features, and larger updates
for parameters associated with infrequently occurring
features.
Adadelta

• The problem with the previous algorithm AdaGrad was


learning rate becomes very small with a large number
of iterations which leads to slow convergence.

• To avoid this,
Adadelta adapts learning rates based on a
moving window of gradient updates, instead of
accumulating all past gradients.
Accumulated Gradients:

E[⋅] represents the exponential moving average (EMA) of the quantity inside
the brackets. It's a way to smooth out the values over time, giving more weight
to recent values while still considering the past values.
RMSprop

A good default value for the learning rate is


0.001.
Adam

• Adaptive Moment Estimation (Adam)is another method


that computes adaptive learning rates for each
parameter.

You might also like