GRADIENT DESCENT OPTIMIZATION
Gradient descent is an optimization algorithm in gadget mastering used to limit a feature with
the aid of iteratively moving towards the minimal fee of characteristic.
We essentially use this algorithm when we have to locate the least possible values which
could fulfill a given free function. In gadget getting to know, greater regularly that not we try
to limit loss features (like mean squared error). By minimizing the loss characteristic, we will
improve our model and gradient descent is one of the most popular algorithms used for this
cause.
The graph above shows how exactly a gradient descent set of rules works.
We first take a factor in the value function and begin shifting in steps in the direction of the
minimum factor. The size of the step, or how quickly we ought to converge to the minimum
factor is defined by learning rate.
We can cowl more location with better learning fee but at the risk of overshooting the
minima. On the opposite hand, small steps/ smaller gaining knowledge of charges will eat a
number of times to attain the lowest point.
1|Page
Now, the direction where in algorithm has to transport is also important. We calculate this by
way of using derivatives. You need to be familiar with derivatives from calculus. A spinoff is
largely calculated because the slope of the graph at any specific factor.
We get that with the aid of finding the tangent line to the graph at that point. The extra sleep
the tangent, would suggest that more steps would be needed to reach minimum point;
muchless steep might suggest lesser steps are required to reach the minimum
Negate gradie
d nt
Direction in every weight vector, a parabola
w0-w1plane a single global
with
producing minimum
steepest
desce
nt
STOCHASTIC GRADIENT DESCENT
The word stochastic means a system or a process that is linked with a random probability.
Hence, in stochastic gradient descent, a few samples are selected randomly instead of the
whole data set for each iteration.
Stochastic gradient descent is a type of gradient descent that runs one training example per
iteration. It processes a training epoch for each example within a dataset and updates each
training example’s parameters one at a time.
2|Page
As it requires only one training example at a time, hence it is easier to store in allocated
memory. However, it shows some computational efficiency losses in comparison to batch
gradient systems as it shows frequent updates that require more detail and speed.
Further, due to frequent updates, it is also treated as a noisy gradient. However, sometimes it
can be helpful in finding the global minimum and also escaping the local minimum.
Advantages of stochastic gradient descent: It is easier to allocate in desired memory. It is
relatively fast to compute than batch gradient descent. It is more efficient for large dataset.
Disadvantages of stochastic gradient descent: SGD require a number of hyperparameters such
as the regularization parameter and the number of iterations.
Stochastic Gradient Descent:
In Stochastic Gradient Descent, a few samples are selected randomly instead of the whole
data set for each iteration.
3|Page
In Gradient Descent, there is a term called “batch” which denotes the total number of samples
from a dataset that is used for calculating the gradient for each iteration.
In typical Gradient Descent optimization, like Batch Gradient Descent, the batch is taken to
be the whole dataset. Although using the whole dataset is really useful for getting to the
minima in a less noisy and less random manner, the problem arises when our dataset gets big.
Suppose, you have a million samples in your dataset, so if you use a typical Gradient Descent
optimization technique, you will have to use all of the one million samples for completing
one iteration while performing the Gradient Descent, and it has to be done for every iteration
until the minima are reached. Hence, it becomes computationally very expensive to perform.
This problem is solved by Stochastic Gradient Descent. In SGD, it uses only a single sample,
i.e., a batch size of one, to perform each iteration.
The sample is randomly shuffled and selected for performing the iteration.
4|Page