0% found this document useful (0 votes)
28 views

Gradient Descent

Gradient Descent in Deep Learning

Uploaded by

mohitdubey42551
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Gradient Descent

Gradient Descent in Deep Learning

Uploaded by

mohitdubey42551
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

What is gradient descent?

Gradient descent is an optimization algorithm which is commonly-used to train machine


learning models and neural networks. Training data helps these models learn over time, and the
cost function within gradient descent specifically acts as a barometer, gauging its accuracy with
each iteration of parameter updates. Until the function is close to or equal to zero, the model will
continue to adjust its parameters to yield the smallest possible error.
Once machine learning models are optimized for accuracy, they can be powerful tools
for artificial intelligence (AI) and computer science applications.
How does gradient descent work?

Before we dive into gradient descent, it may help to review some concepts from linear
regression. You may recall the following formula for the slope of a line, which is y = mx + b,
where m represents the slope and b is the intercept on the y-axis.

You may also recall plotting a scatterplot in statistics and finding the line of best fit, which
required calculating the error between the actual output and the predicted output (y-hat) using the
mean squared error formula. The gradient descent algorithm behaves similarly, but it is based on
a convex function.

The starting point is just an arbitrary point for us to evaluate the performance. From that starting
point, we will find the derivative (or slope), and from there, we can use a tangent line to observe
the steepness of the slope. The slope will inform the updates to the parameters—i.e. the weights
and bias. The slope at the starting point will be steeper, but as new parameters are generated, the
steepness should gradually reduce until it reaches the lowest point on the curve, known as the
point of convergence.

Similar to finding the line of best fit in linear regression, the goal of gradient descent is to
minimize the cost function, or the error between predicted and actual y. In order to do this, it
requires two data points—a direction and a learning rate. These factors determine the partial
derivative calculations of future iterations, allowing it to gradually arrive at the local or global
minimum (i.e. point of convergence).
 Learning rate (also referred to as step size or the alpha) is the size of the steps that are
taken to reach the minimum. This is typically a small value, and it is evaluated and
updated based on the behavior of the cost function. High learning rates result in larger
steps but risks overshooting the minimum. Conversely, a low learning rate has small step
sizes. While it has the advantage of more precision, the number of iterations
compromises overall efficiency as this takes more time and computations to reach the
minimum.
o The learning rate in a neural network is a hyperparameter that controls how quickly a
neural network learns and adjusts its weights. It's an important factor in the training
process, as it directly affects the speed and quality of the final model.
o The learning rate (LR) determines how far the neural network weights change within the
context of optimization while minimizing the loss function. Thus, this parameter is
important to optimizer and loss function.

 The cost (or loss) function measures the difference, or error, between actual y and
predicted y at its current position. This improves the machine learning model's efficacy
by providing feedback to the model so that it can adjust the parameters to minimize the
error and find the local or global minimum. It continuously iterates, moving along the
direction of steepest descent (or the negative gradient) until the cost function is close to
or at zero. At this point, the model will stop learning. Additionally, while the terms, cost
function and loss function, are considered synonymous, there is a slight difference
between them. It’s worth noting that a loss function refers to the error of one training
example, while a cost function calculates the average error across an entire training set.

Types of gradient descent

There are three types of gradient descent learning algorithms: batch gradient descent, stochastic
gradient descent and mini-batch gradient descent.
Batch gradient descent

Batch gradient descent sums the error for each point in a training set, updating the model only
after all training examples have been evaluated. This process referred to as a training epoch.

While this batching provides computation efficiency, it can still have a long processing time for
large training datasets as it still needs to store all of the data into memory. Batch gradient descent
also usually produces a stable error gradient and convergence, but sometimes that convergence
point isn’t the most ideal, finding the local minimum versus the global one.
Stochastic gradient descent

Stochastic gradient descent (SGD) runs a training epoch for each example within the dataset and
it updates each training example's parameters one at a time. Since you only need to hold one
training example, they are easier to store in memory. While these frequent updates can offer
more detail and speed, it can result in losses in computational efficiency when compared to
batch gradient descent. Its frequent updates can result in noisy gradients, but this can also be
helpful in escaping the local minimum and finding the global one.
Mini-batch gradient descent

Mini-batch gradient descent combines concepts from both batch gradient descent and stochastic
gradient descent. It splits the training dataset into small batch sizes and performs updates on each
of those batches. This approach strikes a balance between the computational efficiency
of batch gradient descent and the speed of stochastic gradient descent.
Challenges with gradient descent

While gradient descent is the most common approach for optimization problems, it does come
with its own set of challenges. Some of them include:
Local minima and saddle points

For convex problems, gradient descent can find the global minimum with ease, but as nonconvex
problems emerge, gradient descent can struggle to find the global minimum, where the model
achieves the best results.

Recall that when the slope of the cost function is at or close to zero, the model stops learning. A
few scenarios beyond the global minimum can also yield this slope, which are local minima and
saddle points. Local minima mimic the shape of a global minimum, where the slope of the cost
function increases on either side of the current point. However, with saddle points, the negative
gradient only exists on one side of the point, reaching a local maximum on one side and a local
minimum on the other. Its name inspired by that of a horse’s saddle.

Noisy gradients can help the gradient escape local minimums and saddle points.
Vanishing and Exploding Gradients

In deeper neural networks, particular recurrent neural networks, we can also encounter two other
problems when the model is trained with gradient descent and backpropagation.
 Vanishing gradients: This occurs when the gradient is too small. As we move
backwards during backpropagation, the gradient continues to become smaller, causing the
earlier layers in the network to learn more slowly than later layers. When this happens,
the weight parameters update until they become insignificant—i.e. 0—resulting in an
algorithm that is no longer learning.
 Exploding gradients: This happens when the gradient is too large, creating an unstable
model. In this case, the model weights will grow too large, and they will eventually be
represented as NaN. One solution to this issue is to leverage a dimensionality reduction
technique, which can help to minimize complexity within the model.

Introduction
Machine learning is a sub-field of Artificial Intelligence that is changing the world. It affects
almost every aspect of our life. Machine learning algorithms differ from traditional algorithms
because no human interaction is needed to improve the model. It learns from data and improves
accordingly. There are many algorithms used to make this self-learning possible. We will
explore one of the algorithms called Nesterov Accelerated Gradient (NAG)

Gradient descent
It is essential to understand Gradient descent before we look at Nesterov Accelerated Algorithm.
Gradient descent is an optimization algorithm that is used to train our model. The accuracy of a
machine learning model is determined by the cost function. The lower the cost, the better
our machine learning model is performing. Optimization algorithms are used to reach the
minimum point of our cost function. Gradient descent is the most common optimization
algorithm. It takes parameters at the start and then changes them iteratively to reach the
minimum point of our cost function.

As we can see above, we take some initial weight, and according to that, we are positioned at
some point on our cost function. Now, gradient descent tweaks the weight in each iteration, and
we move towards the minimum of our cost function accordingly.
The size of our steps depends on the learning rate of our model. The higher the learning rate,
the higher the step size. Choosing the correct learning rate for our model is very important as it
can cause problems while training.
A low learning rate assures us to reach the minimum point, but it takes a lot of iterations to train,
while a very high learning rate can cause us to cross the minimum point, a problem commonly
known as overshooting.
Drawbacks of gradient descent
The main drawback of gradient descent is that it depends on the learning rate and the gradient of
that particular step only. The gradient at the plateau, also known as saddle points of our
function, will be close to zero. The step size becomes very small or even zero. Thus, the update
of our parameters is very slow at a gentle slope.
Let us look at an example. The starting point of our model is ‘A’. The loss function will decrease
rapidly on the path AB because of the higher gradient. But as the gradient decreases from B to C,
the learning is negligible. The gradient at point ‘C’ is zero, and it is the saddle point of our
function. Even after many iterations, we will be stuck at ‘C’ and will not reach the desired
minimum ‘D’.

This problem is solved by using momentum in our gradient descent.


Gradient descent with momentum
The issue discussed above can be solved by including the previous gradients in our calculation.
The intuition behind this is if we are repeatedly asked to go in a particular direction, we can take
bigger steps towards that direction.
The weighted average of all the previous gradients is added to our equation, and it acts as
momentum to our step.
We can understand gradient descent with momentum from the above image. As we start to
descend, the momentum increases, and even at gentle slopes where the gradient is minimal, the
actual movement is large due to the added momentum.
But this added momentum causes a different type of problem. We actually cross the minimum
point and have to take a U-turn to get to the minimum point. Momentum-based gradient descent
oscillates around the minimum point, and we have to take a lot of U-turns to reach the desired
point. Despite these oscillations, momentum-based gradient descent is faster than conventional
gradient descent.
To reduce these oscillations, we can use Nesterov Accelerated Gradient.
Nesterov Accelerated Gradient (NAG)
NAG resolves this problem by adding a look ahead term in our equation. The intuition behind
NAG can be summarized as ‘look before you leap’. Let us try to understand this through an
example.
As can see, in the momentum-based gradient, the steps become larger and larger due to the
accumulated momentum, and then we overshoot at the 4th step. We then have to take steps in the
opposite direction to reach the minimum point.
However, the update in NAG happens in two steps. First, a partial step to reach the look-ahead
point, and then the final update. We calculate the gradient at the look-ahead point and then use it
to calculate the final update. If the gradient at the look-ahead point is negative, our final update
will be smaller than that of a regular momentum-based gradient. Like in the above example, the
updates of NAG are similar to that of the momentum-based gradient for the first three steps
because the gradient at that point and the look-ahead point are positive. But at step 4, the
gradient of the look-ahead point is negative.
In NAG, the first partial update 4a will be used to go to the look-ahead point and then the
gradient will be calculated at that point without updating the parameters. Since the gradient at
step 4b is negative, the overall update will be smaller than the momentum-based gradient
descent.
We can see in the above example that the momentum-based gradient descent takes six steps to
reach the minimum point, while NAG takes only five steps.
This looking ahead helps NAG to converge to the minimum points in fewer steps and reduce the
chances of overshooting.
How NAG actually works?
We saw how NAG solves the problem of overshooting by ‘looking ahead’. Let us see how this is
calculated and the actual math behind it.
Update rule for gradient descent:
wt+1 = wt − η∇wt
In this equation, the weight (W) is updated in each iteration. η is the learning rate, and ∇wt is the
gradient.
Update rule for momentum-based gradient descent:
In this, momentum is added to the conventional gradient descent equation. The update equation
is
wt+1 = wt − updatet
updatet is calculated by:
updatet = γ · updatet−1 + η∇wt
This is how the gradient of all the previous updates is added to the current update.
Update rule for NAG:
wt+1 = wt − updatet
While calculating the updatet, We will include the look ahead gradient (∇wlook_ahead).
updatet = γ · updatet−1 + η∇wlook_ahead
∇wlook_ahead is calculated by:
wlook_ahead = wt − γ · updatet−1
This look-ahead gradient will be used in our update and will prevent overshooting.

Mini Batch Gradient Descent Deep Learning Optimizer


In this variant of gradient descent, instead of taking all the training data, only a subset of the
dataset is used for calculating the loss function. Since we are using a batch of data instead of
taking the whole dataset, fewer iterations are needed. That is why the mini-batch gradient
descent algorithm is faster than both stochastic gradient descent and batch gradient descent
algorithms. This algorithm is more efficient and robust than the earlier variants of gradient
descent. As the algorithm uses batching, all the training data need not be loaded in the memory,
thus making the process more efficient to implement. Moreover, the cost function in mini-batch
gradient descent is noisier than the batch gradient descent algorithm but smoother than that of the
stochastic gradient descent algorithm. Because of this, mini-batch gradient descent is ideal and
provides a good balance between speed and accuracy.
Despite all that, the mini-batch gradient descent algorithm has some downsides too. It needs a
hyperparameter that is “mini-batch-size”, which needs to be tuned to achieve the required
accuracy. Although, the batch size of 32 is considered to be appropriate for almost every case.
Also, in some cases, it results in poor final accuracy. Due to this, there needs a rise to look for
other alternatives too.

Adagrad (Adaptive Gradient Descent) Deep Learning Optimizer


The adaptive gradient descent algorithm is slightly different from other gradient descent
algorithms. This is because it uses different learning rates for each iteration. The change in
learning rate depends upon the difference in the parameters during training. The more the
parameters get changed, the more minor the learning rate changes. This modification is highly
beneficial because real-world datasets contain sparse as well as dense features. So it is unfair to
have the same value of learning rate for all the features. The Adagrad algorithm uses the below
formula to update the weights. Here the alpha(t) denotes the different learning rates at each
iteration, n is a constant, and E is a small positive to avoid division by 0.

The benefit of using Adagrad is that it abolishes the need to modify the learning rate manually. It
is more reliable than gradient descent algorithms and their variants, and it reaches convergence at
a higher speed.
One downside of the AdaGrad optimizer is that it decreases the learning rate aggressively and
monotonically. There might be a point when the learning rate becomes extremely small. This is
because the squared gradients in the denominator keep accumulating, and thus the denominator
part keeps on increasing. Due to small learning rates, the model eventually becomes unable to
acquire more knowledge, and hence the accuracy of the model is compromised.

RMS Prop (Root Mean Square) Deep Learning Optimizer


RMS prop is one of the popular optimizers among deep learning enthusiasts. This is maybe
because it hasn’t been published but is still very well-known in the community. RMS prop is
ideally an extension of the work RPPROP. It resolves the problem of varying gradients. The
problem with the gradients is that some of them were small while others may be huge. So,
defining a single learning rate might not be the best idea. RPPROP uses the gradient sign,
adapting the step size individually for each weight. In this algorithm, the two gradients are first
compared for signs. If they have the same sign, we’re going in the right direction, increasing the
step size by a small fraction. If they have opposite signs, we must decrease the step size. Then
we limit the step size and can now go for the weight update.
The problem with RPPROP is that it doesn’t work well with large datasets and when we want to
perform mini-batch updates. So, achieving the robustness of RPPROP and the efficiency of mini-
batches simultaneously was the main motivation behind the rise of RMS prop. RMS prop is an
advancement in AdaGrad optimizer as it reduces the monotonically decreasing learning rate.

RMS Prop Formula


The algorithm mainly focuses on accelerating the optimization process by decreasing the number
of function evaluations to reach the local minimum. The algorithm keeps the moving average of
squared gradients for every weight and divides the gradient by the square root of the mean
square.

where gamma is the forgetting factor. Weights are updated by the below formula
In simpler terms, if there exists a parameter due to which the cost function oscillates a lot, we
want to penalize the update of this parameter. Suppose you built a model to classify a variety of
fishes. The model relies on the factor ‘color’ mainly to differentiate between the fishes. Due to
this, it makes a lot of errors. What RMS Prop does is, penalize the parameter ‘color’ so that it can
rely on other features too. This prevents the algorithm from adapting too quickly to changes in
the parameter ‘color’ compared to other parameters. This algorithm has several benefits as
compared to earlier versions of gradient descent algorithms. The algorithm converges quickly
and requires lesser tuning than gradient descent algorithms and their variants.
The problem with RMS Prop is that the learning rate has to be defined manually, and the
suggested value doesn’t work for every application.

AdaDelta Deep Learning Optimizer


AdaDelta can be seen as a more robust version of the AdaGrad optimizer. It is based upon
adaptive learning and is designed to deal with significant drawbacks of AdaGrad and RMS prop
optimizer. The main problem with the above two optimizers is that the initial learning rate must
be defined manually. One other problem is the decaying learning rate which becomes
infinitesimally small at some point. Due to this, a certain number of iterations later, the model
can no longer learn new knowledge.
To deal with these problems, AdaDelta uses two state variables to store the leaky average of the
second moment gradient and a leaky average of the second moment of change of parameters in
the model.
Here St and delta Xt denote the state variables, g’t denotes rescaled gradient, delta Xt-1 denotes
squares rescaled gradients, and epsilon represents a small positive integer to handle division by
0.

Adam Optimizer in Deep Learning


Adam optimizer, short for Adaptive Moment Estimation optimizer, is an optimization algorithm
commonly used in deep learning. It is an extension of the stochastic gradient descent (SGD)
algorithm and is designed to update the weights of a neural network during training.
The name “Adam” is derived from “adaptive moment estimation,” highlighting its ability to
adaptively adjust the learning rate for each network weight individually. Unlike SGD, which
maintains a single learning rate throughout training, Adam optimizer dynamically computes
individual learning rates based on the past gradients and their second moments.
The creators of Adam optimizer incorporated the beneficial features of other optimization
algorithms such as AdaGrad and RMSProp. Similar to RMSProp, Adam optimizer considers the
second moment of the gradients, but unlike RMSProp, it calculates the uncentered variance of
the gradients (without subtracting the mean).
By incorporating both the first moment (mean) and second moment (uncentered variance) of the
gradients, Adam optimizer achieves an adaptive learning rate that can efficiently navigate the
optimization landscape during training. This adaptivity helps in faster convergence and improved
performance of the neural network.
In summary, Adam optimizer is an optimization algorithm that extends SGD by dynamically
adjusting learning rates based on individual weights. It combines the features of AdaGrad and
RMSProp to provide efficient and adaptive updates to the network weights during deep learning
training.
Adam Optimizer Formula
The adam optimizer has several benefits, due to which it is used widely. It is adapted as a
benchmark for deep learning papers and recommended as a default optimization algorithm.
Moreover, the algorithm is straightforward to implement, has a faster running time, low memory
requirements, and requires less tuning than any other optimization algorithm.

The above formula represents the working of adam optimizer. Here B1 and B2 represent the
decay rate of the average of the gradients.
If the adam optimizer uses the good properties of all the algorithms and is the best available
optimizer, then why shouldn’t you use Adam in every application? And what was the need to
learn about other algorithms in depth? This is because even Adam has some downsides. It tends
to focus on faster computation time, whereas algorithms like stochastic gradient descent focus on
data points. That’s why algorithms like SGD generalize the data in a better manner at the cost of
low computation speed. So, the optimization algorithms can be picked accordingly depending on
the requirements and the type of data.
The above visualizations create a better picture in mind and help in comparing the results of
various optimization algorithms.

You might also like