Gradient Descent
Gradient Descent
Gradient Descent
first-order optimisation
timisation algorithm, used to find a local
minimum/maximum of a given function. This method is commonly used in machine
learning (ML) and deep learning (DL) to minimise a cost/loss function (e.g. in a linear
regression). Due to its importance and ease of iimplementation,
mplementation, this algorithm is usually taught at
the beginning of almost all machine learning courses.
However, its use is not limited to ML/DL only, it’s widely used also in areas like:
convex - the
he line segment connecting two function’s points lays on or above its curve (it
does not cross it). If it does it means that it has a local minimum which is not a global one.
one
https://towardsdatascience.com/gradient
https://towardsdatascience.com/gradient-descent-algorithm-a-deep-dive
dive-cf04e8115f21
We start by defining the initial parameter’s values and from there the gradient descent
algorithm uses calculus to iteratively adjust the values so they minimize the given cost-function.
A gradient simply measures the change in all weights with regard to the change in error.
The higher the gradient, the steeper the slope and the faster a model can learn. But if the slope is
zero, the model stops learning. In mathematical terms, a gradient is a partial derivative with
respect to its inputs.
Imagine a blindfolded man who wants to climb to the top of a hill with the fewest steps
along the way as possible. He might start climbing the hill by taking really big steps in the
steepest direction, which he can do as long as he is not close to the top. As he comes closer to the
top, however, his steps will get smaller and smaller to avoid overshooting it. This process can be
described mathematically using the gradient.
Note that the gradient ranging from X0 to X1 is much longer than the one reaching from
X3 to X4. This is because the steepness/slope of the hill, which determines the length of the
vector, is less. This perfectly represents the example of the hill because the hill is getting less
steep the higher it’s climbed. Therefore a reduced gradient goes along with a reduced slope and a
reduced step size for the hill climber.
The equation below describes what the gradient descent algorithm does: b is the next
position of our climber, while a represents his current position. The minus sign refers to the
minimization part of the gradient descent algorithm. The gamma in the middle is a waiting factor
and the gradient term ( Δf(a) ) is simply the direction of the steepest descent.
So this formula basically tells us the next position we need to go, which is the direction of the
steepest descent. Imagine you have a machine learning problem and want to train your algorithm
with gradient descent to minimize your cost-function J(w, b) and reach its local minimum by
tweaking its parameters (w and b). The image below shows the horizontal axes representing the
parameters (w and b), while the cost function J(w, b) is represented on the vertical axes. Gradient
descent is a convex function.
We know we want to find the values of w and b that correspond to the minimum of the cost
function (marked with the red arrow). To start finding the right values we initialize w and b with
some random numbers. Gradient descent then starts at that point (somewhere around the top of
our illustration), and it takes one step after another in the steepest downside direction (i.e., from
the top to the bottom of the illustration) until it reaches the point where the cost function is as
small as possible.
The best way to define the local minimum or local maximum of a function using gradient descent is as follows:
follows
o If we move towards a negative gradient or away from the gradient of the function at the current point, it will
give the local minimum of that function.
o Whenever we move towards a positive gradient or towards the gradient of the function at the current point,
we will get the local maximum of that function.
The main objective of using a gradient descent algorithm is to minimize the cost function using iteration. To
achieve this goal, it performs two steps iteratively:
The cost function is defined as the measurement of difference or error between actual values and
expected values at the current position and present in the form of a single real number.
number It helps to
increase and improve machine learning eff
efficiency
iciency by providing feedback to this model so that it can
minimize error and find the local or global minimum
minimum.
How big the steps gradient descent takes into the direction of the loca
locall minimum are determined
by the learning rate, which figures out how fast or slow we will move towards the optimal
weights.
For the gradient descent algorithm to reach the local minimum we must set the learning rate to an
appropriate value, which is neither too low nor too high. This is important because if the steps it
takes are too big, it may not reach the local minimum because it bounces back and forth between
the convex function of gradient descent (see left image below). If we set the learning rate to a
very small value, gradient descent will eventually reach the local minimum but that may take a
while (see the right image).
https://builtin.com/data-science/gradient-descent
A good way to make sure the gradient descent algorithm runs properly is by plotting the
cost function as the optimization runs. Put the number of iterations on the x-axis and the value of
the cost function on the y-axis. This helps you see the value of your cost function after each
iteration of gradient descent, and provides a way to easily spot how appropriate your learning
rate is. You can just try different values for it and plot them all together. The left image below
shows such a plot, while the image on the right illustrates the difference between good and bad
learning rates.
If the gradient descent algorithm is working properly, the cost function should decrease after
every iteration.
When gradient descent can’t decrease the cost-function anymore and remains more or less on the
same level, it has converged. The number of iterations gradient descent needs to converge can
sometimes vary a lot. It can take 50 iterations, 60,000 or maybe even 3 million, making the
number of iterations to convergence hard to estimate in advance.
There are three popular types of gradient descent that mainly differ in the amount of data they
use:
BATCH GRADIENT DESCENT
Batch gradient descent, also called vanilla gradient descent, calculates the error for each example
within the training dataset, but only after all training examples have been evaluated does the
model get updated. This whole process is like a cycle and it’s called a training epoch.
Some advantages of batch gradient descent are its computational efficiency: it produces a stable
error gradient and a stable convergence. Some disadvantages are that the stable error gradient
can sometimes result in a state of convergence that isn’t the best the model can achieve. It also
requires the entire training dataset to be in memory and available to the algorithm.
By contrast, stochastic gradient descent (SGD) does this for each training example within the
dataset, meaning it updates the parameters for each training example one by one. Depending on
the problem, this can make SGD faster than batch gradient descent. One advantage is the
frequent updates allow us to have a pretty detailed rate of improvement.
The frequent updates, however, are more computationally expensive than the batch gradient
descent approach. Additionally, the frequency of those updates can result in noisy gradients,
which may cause the error rate to jump around instead of slowly decreasing.
Mini-batch gradient descent is the go-to method since it’s a combination of the concepts of SGD
and batch gradient descent. It simply splits the training dataset into small batches and performs
an update for each of those batches. This creates a balance between the robustness of stochastic
gradient descent and the efficiency of batch gradient descent.
Common mini-batch sizes range between 50 and 256, but like any other machine learning
technique, there is no clear rule because it varies for different applications. This is the go-to
algorithm when training a neural network and it is the most common type of gradient descent
within deep learning.
The name of local minima is because the value of the loss function is minimum at that point in a local region. In
contrast, the name of the global minima is given so because the value of the loss function is minimum there, globally
across the entire domain the loss function.
In a deep neural network, if the model is trained with gradient descent and backpropagation, there can occur two
more issues other than local minima and saddle point.
Vanishing Gradients:
Vanishing Gradient
adient occurs when the gradient is smaller than expected. During backpropagation, this gradient
becomes smaller that causing the decrease in the learning rate of earlier layers than the later layer of the network.
Once this happens, the weight parameters u
update until they become insignificant.
Exploding Gradient:
Exploding gradient is just opposite to the vanishing gradient as it occurs when the Gradient is too large and creates a
stable model. Further, in this scenario, model weight increases, and they wil
willl be represented as NaN. This problem can
be solved using the dimensionality reduction technique, which helps to minimize complexity within the model.