Gradient Descent
Gradient Descent
Before we dive into gradient descent, it may help to review some concepts from linear
regression. You may recall the following formula for the slope of a line, which is y = mx + b,
where m represents the slope and b is the intercept on the y-axis.
You may also recall plotting a scatterplot in statistics and finding the line of best fit, which
required calculating the error between the actual output and the predicted output (y-hat) using the
mean squared error formula. The gradient descent algorithm behaves similarly, but it is based on
a convex function.
The starting point is just an arbitrary point for us to evaluate the performance. From that starting
point, we will find the derivative (or slope), and from there, we can use a tangent line to observe
the steepness of the slope. The slope will inform the updates to the parameters—i.e. the weights
and bias. The slope at the starting point will be steeper, but as new parameters are generated, the
steepness should gradually reduce until it reaches the lowest point on the curve, known as the
point of convergence.
Similar to finding the line of best fit in linear regression, the goal of gradient descent is to
minimize the cost function, or the error between predicted and actual y. In order to do this, it
requires two data points—a direction and a learning rate. These factors determine the partial
derivative calculations of future iterations, allowing it to gradually arrive at the local or global
minimum (i.e. point of convergence).
Learning rate (also referred to as step size or the alpha) is the size of the steps that are
taken to reach the minimum. This is typically a small value, and it is evaluated and
updated based on the behavior of the cost function. High learning rates result in larger
steps but risks overshooting the minimum. Conversely, a low learning rate has small step
sizes. While it has the advantage of more precision, the number of iterations
compromises overall efficiency as this takes more time and computations to reach the
minimum.
o The learning rate in a neural network is a hyperparameter that controls how quickly a
neural network learns and adjusts its weights. It's an important factor in the training
process, as it directly affects the speed and quality of the final model.
o The learning rate (LR) determines how far the neural network weights change within the
context of optimization while minimizing the loss function. Thus, this parameter is
important to optimizer and loss function.
The cost (or loss) function measures the difference, or error, between actual y and
predicted y at its current position. This improves the machine learning model's efficacy
by providing feedback to the model so that it can adjust the parameters to minimize the
error and find the local or global minimum. It continuously iterates, moving along the
direction of steepest descent (or the negative gradient) until the cost function is close to
or at zero. At this point, the model will stop learning. Additionally, while the terms, cost
function and loss function, are considered synonymous, there is a slight difference
between them. It’s worth noting that a loss function refers to the error of one training
example, while a cost function calculates the average error across an entire training set.
There are three types of gradient descent learning algorithms: batch gradient descent, stochastic
gradient descent and mini-batch gradient descent.
Batch gradient descent
Batch gradient descent sums the error for each point in a training set, updating the model only
after all training examples have been evaluated. This process referred to as a training epoch.
While this batching provides computation efficiency, it can still have a long processing time for
large training datasets as it still needs to store all of the data into memory. Batch gradient descent
also usually produces a stable error gradient and convergence, but sometimes that convergence
point isn’t the most ideal, finding the local minimum versus the global one.
Stochastic gradient descent
Stochastic gradient descent (SGD) runs a training epoch for each example within the dataset and
it updates each training example's parameters one at a time. Since you only need to hold one
training example, they are easier to store in memory. While these frequent updates can offer
more detail and speed, it can result in losses in computational efficiency when compared to
batch gradient descent. Its frequent updates can result in noisy gradients, but this can also be
helpful in escaping the local minimum and finding the global one.
Mini-batch gradient descent
Mini-batch gradient descent combines concepts from both batch gradient descent and stochastic
gradient descent. It splits the training dataset into small batch sizes and performs updates on each
of those batches. This approach strikes a balance between the computational efficiency
of batch gradient descent and the speed of stochastic gradient descent.
Challenges with gradient descent
While gradient descent is the most common approach for optimization problems, it does come
with its own set of challenges. Some of them include:
Local minima and saddle points
For convex problems, gradient descent can find the global minimum with ease, but as nonconvex
problems emerge, gradient descent can struggle to find the global minimum, where the model
achieves the best results.
Recall that when the slope of the cost function is at or close to zero, the model stops learning. A
few scenarios beyond the global minimum can also yield this slope, which are local minima and
saddle points. Local minima mimic the shape of a global minimum, where the slope of the cost
function increases on either side of the current point. However, with saddle points, the negative
gradient only exists on one side of the point, reaching a local maximum on one side and a local
minimum on the other. Its name inspired by that of a horse’s saddle.
Noisy gradients can help the gradient escape local minimums and saddle points.
Vanishing and Exploding Gradients
In deeper neural networks, particular recurrent neural networks, we can also encounter two other
problems when the model is trained with gradient descent and backpropagation.
Vanishing gradients: This occurs when the gradient is too small. As we move
backwards during backpropagation, the gradient continues to become smaller, causing the
earlier layers in the network to learn more slowly than later layers. When this happens,
the weight parameters update until they become insignificant—i.e. 0—resulting in an
algorithm that is no longer learning.
Exploding gradients: This happens when the gradient is too large, creating an unstable
model. In this case, the model weights will grow too large, and they will eventually be
represented as NaN. One solution to this issue is to leverage a dimensionality reduction
technique, which can help to minimize complexity within the model.
Introduction
Machine learning is a sub-field of Artificial Intelligence that is changing the world. It affects
almost every aspect of our life. Machine learning algorithms differ from traditional algorithms
because no human interaction is needed to improve the model. It learns from data and improves
accordingly. There are many algorithms used to make this self-learning possible. We will
explore one of the algorithms called Nesterov Accelerated Gradient (NAG)
Gradient descent
It is essential to understand Gradient descent before we look at Nesterov Accelerated Algorithm.
Gradient descent is an optimization algorithm that is used to train our model. The accuracy of a
machine learning model is determined by the cost function. The lower the cost, the better
our machine learning model is performing. Optimization algorithms are used to reach the
minimum point of our cost function. Gradient descent is the most common optimization
algorithm. It takes parameters at the start and then changes them iteratively to reach the
minimum point of our cost function.
As we can see above, we take some initial weight, and according to that, we are positioned at
some point on our cost function. Now, gradient descent tweaks the weight in each iteration, and
we move towards the minimum of our cost function accordingly.
The size of our steps depends on the learning rate of our model. The higher the learning rate,
the higher the step size. Choosing the correct learning rate for our model is very important as it
can cause problems while training.
A low learning rate assures us to reach the minimum point, but it takes a lot of iterations to train,
while a very high learning rate can cause us to cross the minimum point, a problem commonly
known as overshooting.
Drawbacks of gradient descent
The main drawback of gradient descent is that it depends on the learning rate and the gradient of
that particular step only. The gradient at the plateau, also known as saddle points of our
function, will be close to zero. The step size becomes very small or even zero. Thus, the update
of our parameters is very slow at a gentle slope.
Let us look at an example. The starting point of our model is ‘A’. The loss function will decrease
rapidly on the path AB because of the higher gradient. But as the gradient decreases from B to C,
the learning is negligible. The gradient at point ‘C’ is zero, and it is the saddle point of our
function. Even after many iterations, we will be stuck at ‘C’ and will not reach the desired
minimum ‘D’.
The benefit of using Adagrad is that it abolishes the need to modify the learning rate manually. It
is more reliable than gradient descent algorithms and their variants, and it reaches convergence at
a higher speed.
One downside of the AdaGrad optimizer is that it decreases the learning rate aggressively and
monotonically. There might be a point when the learning rate becomes extremely small. This is
because the squared gradients in the denominator keep accumulating, and thus the denominator
part keeps on increasing. Due to small learning rates, the model eventually becomes unable to
acquire more knowledge, and hence the accuracy of the model is compromised.
where gamma is the forgetting factor. Weights are updated by the below formula
In simpler terms, if there exists a parameter due to which the cost function oscillates a lot, we
want to penalize the update of this parameter. Suppose you built a model to classify a variety of
fishes. The model relies on the factor ‘color’ mainly to differentiate between the fishes. Due to
this, it makes a lot of errors. What RMS Prop does is, penalize the parameter ‘color’ so that it can
rely on other features too. This prevents the algorithm from adapting too quickly to changes in
the parameter ‘color’ compared to other parameters. This algorithm has several benefits as
compared to earlier versions of gradient descent algorithms. The algorithm converges quickly
and requires lesser tuning than gradient descent algorithms and their variants.
The problem with RMS Prop is that the learning rate has to be defined manually, and the
suggested value doesn’t work for every application.
The above formula represents the working of adam optimizer. Here B1 and B2 represent the
decay rate of the average of the gradients.
If the adam optimizer uses the good properties of all the algorithms and is the best available
optimizer, then why shouldn’t you use Adam in every application? And what was the need to
learn about other algorithms in depth? This is because even Adam has some downsides. It tends
to focus on faster computation time, whereas algorithms like stochastic gradient descent focus on
data points. That’s why algorithms like SGD generalize the data in a better manner at the cost of
low computation speed. So, the optimization algorithms can be picked accordingly depending on
the requirements and the type of data.
The above visualizations create a better picture in mind and help in comparing the results of
various optimization algorithms.