0% found this document useful (0 votes)
8 views8 pages

Gradient Descent

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views8 pages

Gradient Descent

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Gradient Descent

What is a Cost Function?

It is a function that measures the performance of a model for any given data. Cost
Function quantifies the error between predicted values and expected values and presents it in
the form of a single real number.

After making a hypothesis with initial parameters, we calculate the Cost function. And with a
goal to reduce the cost function, we modify the parameters by using the Gradient descent
algorithm over the given data. Here’s the mathematical representation for it:

What is Gradient Descent?

Gradient Descent in Machine Learning

Gradient Descent is known as one of the most commonly used optimization algorithms to
train machine learning models by means of minimizing errors between actual and expected
results. Further, gradient descent is also used to train Neural Networks.

In mathematical terminology, Optimization algorithm refers to the task of


minimizing/maximizing an objective function f(x) parameterized by x. Similarly, in machine
learning, optimization is the task of minimizing the cost function parameterized by the
model's parameters. The main objective of gradient descent is to minimize the convex
function using iteration of parameter updates. Once these machine learning models are
optimized, these models can be used as powerful tools for Artificial Intelligence and various
computer science applications.

In this tutorial on Gradient Descent in Machine Learning, we will learn in detail about
gradient descent, the role of cost functions specifically as a barometer within Machine
Learning, types of gradient descents, learning rates, etc.

Example of Gradient Descent

Let’s say you are playing a game where the players are at the top of a mountain, and they are
asked to reach the lowest point of the mountain. Additionally, they are blindfolded. So, what
approach do you think would make you reach the lake?

Take a moment to think about this before you read on.

The best way is to observe the ground and find where the land descends. From that position,
take a step in the descending direction and iterate this process until we reach the lowest point.

Finding the lowest point in a hilly landscape. (Source: Fisseha Berhane)

Gradient descent is an iterative optimization algorithm for finding the local minimum of a
function.

To find the local minimum of a function using gradient descent, we must take steps
proportional to the negative of the gradient (move away from the gradient) of the function at
the current point. If we take steps proportional to the positive of the gradient (moving towards
the gradient), we will approach a local maximum of the function, and the procedure is
called Gradient Ascent.

Gradient descent was originally proposed by CAUCHY in 1847. It is also known as steepest
descent.
This entire procedure is known as Gradient Ascent, which is also known as steepest
descent. The main objective of using a gradient descent algorithm is to minimize the cost
function using iteration. To achieve this goal, it performs two steps iteratively:

o Calculates the first-order derivative of the function to compute the gradient or slope of
that function.
o Move away from the direction of the gradient, which means slope increased from the
current point by alpha times, where Alpha is defined as Learning Rate. It is a tuning
parameter in the optimization process which helps to decide the length of the steps.

How does Gradient Descent work?

Before starting the working principle of gradient descent, we should know some basic
concepts to find out the slope of a line from linear regression. The equation for simple linear
regression is given as:

1. Y=mX+c

Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.
The starting point(shown in above fig.) is used to evaluate the performance as it is considered
just as an arbitrary point. At this starting point, we will derive the first derivative or slope and
then use a tangent line to calculate the steepness of this slope. Further, this slope will inform
the updates to the parameters (weights and bias).

The slope becomes steeper at the starting point or arbitrary point, but whenever new
parameters are generated, then steepness gradually reduces, and at the lowest point, it
approaches the lowest point, which is called a point of convergence.

The main objective of gradient descent is to minimize the cost function or the error between
expected and actual. To minimize the cost function, two data points are required:

o Direction & Learning Rate

These two factors are used to determine the partial derivative calculation of future iteration
and allow it to the point of convergence or local minimum or global minimum. Let's discuss
learning rate factors in brief;

Learning Rate:

It is defined as the step size taken to reach the minimum or lowest point. This is typically a
small value that is evaluated and updated based on the behavior of the cost function. If the
learning rate is high, it results in larger steps but also leads to risks of overshooting the
minimum. At the same time, a low learning rate shows the small step sizes, which
compromises overall efficiency but gives the advantage of more precision.
How Does Gradient Descent Work?
1. Gradient descent is an optimization algorithm used to minimize the cost function of a
model.
2. The cost function measures how well the model fits the training data and is defined
based on the difference between the predicted and actual values.
3. The gradient of the cost function is the derivative with respect to the model’s parameters
and points in the direction of the steepest ascent.
4. The algorithm starts with an initial set of parameters and updates them in small steps to
minimize the cost function.
5. In each iteration of the algorithm, the gradient of the cost function with respect to each
parameter is computed.
6. The gradient tells us the direction of the steepest ascent, and by moving in the opposite
direction, we can find the direction of the steepest descent.
7. The size of the step is controlled by the learning rate, which determines how quickly the
algorithm moves towards the minimum.
8. The process is repeated until the cost function converges to a minimum, indicating that
the model has reached the optimal set of parameters.
9. There are different variations of gradient descent, including batch gradient descent,
stochastic gradient descent, and mini-batch gradient descent, each with its own
advantages and limitations.
10. Efficient implementation of gradient descent is essential for achieving good performance
in machine learning tasks. The choice of the learning rate and the number of iterations
can significantly impact the performance of the algorithm.
Types of Gradient Descent

Based on the error in various training models, the Gradient Descent learning algorithm can be
divided into Batch gradient descent, stochastic gradient descent, and mini-batch
gradient descent. Let's understand these different types of gradient descent:

1. Batch Gradient Descent:

Batch gradient descent (BGD) is used to find the error for each point in the training set and
update the model after evaluating all training examples. This procedure is known as the
training epoch. In simple words, it is a greedy approach where we have to sum over all
examples for each update.

Advantages of Batch gradient descent:

o It produces less noise in comparison to other gradient descent.


o It produces stable gradient descent convergence.
o It is Computationally efficient as all resources are used for all training samples.

2. Stochastic gradient descent

Stochastic gradient descent (SGD) is a type of gradient descent that runs one training
example per iteration. Or in other words, it processes a training epoch for each example
within a dataset and updates each training example's parameters one at a time. As it requires
only one training example at a time, hence it is easier to store in allocated memory. However,
it shows some computational efficiency losses in comparison to batch gradient systems as it
shows frequent updates that require more detail and speed. Further, due to frequent updates, it
is also treated as a noisy gradient. However, sometimes it can be helpful in finding the global
minimum and also escaping the local minimum.

Advantages of Stochastic gradient descent:

In Stochastic gradient descent (SGD), learning happens on every example, and it consists of a
few advantages over other gradient descent.

o It is easier to allocate in desired memory.


o It is relatively fast to compute than batch gradient descent.
o It is more efficient for large datasets.

3. MiniBatch Gradient Descent:

Mini Batch gradient descent is the combination of both batch gradient descent and stochastic
gradient descent. It divides the training datasets into small batch sizes then performs the
updates on those batches separately. Splitting training datasets into smaller batches make a
balance to maintain the computational efficiency of batch gradient descent and speed of
stochastic gradient descent. Hence, we can achieve a special type of gradient descent with
higher computational efficiency and less noisy gradient descent.

Advantages of Mini Batch gradient descent:

o It is easier to fit in allocated memory.


o It is computationally efficient.
o It produces stable gradient descent convergence.

Challenges with the Gradient Descent

Although we know Gradient Descent is one of the most popular methods for optimization
problems, it still also has some challenges. There are a few challenges as follows:

1. Local Minima and Saddle Point:

For convex problems, gradient descent can find the global minimum easily, while for non-
convex problems, it is sometimes difficult to find the global minimum, where the machine
learning models achieve the best results.

Whenever the slope of the cost function is at zero or just close to zero, this model stops
learning further. Apart from the global minimum, there occur some scenarios that can show
this slop, which is saddle point and local minimum. Local minima generate the shape similar
to the global minimum, where the slope of the cost function increases on both sides of the
current points.

In contrast, with saddle points, the negative gradient only occurs on one side of the point,
which reaches a local maximum on one side and a local minimum on the other side. The
name of a saddle point is taken by that of a horse's saddle.
The name of local minima is because the value of the loss function is minimum at that point
in a local region. In contrast, the name of the global minima is given so because the value of
the loss function is minimum there, globally across the entire domain the loss function.

2. Vanishing and Exploding Gradient

In a deep neural network, if the model is trained with gradient descent and backpropagation,
there can occur two more issues other than local minima and saddle point.

Vanishing Gradients:

Vanishing Gradient occurs when the gradient is smaller than expected. During
backpropagation, this gradient becomes smaller that causing the decrease in the learning rate
of earlier layers than the later layer of the network. Once this happens, the weight parameters
update until they become insignificant.

Exploding Gradient:

Exploding gradient is just opposite to the vanishing gradient as it occurs when the Gradient is
too large and creates a stable model. Further, in this scenario, model weight increases, and
they will be represented as NaN. This problem can be solved using the dimensionality
reduction technique, which helps to minimize complexity within the model.

You might also like