0% found this document useful (0 votes)
17 views11 pages

2,5 Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is a variant of the Gradient Descent algorithm that optimizes machine learning models by using a single random training example or a small batch for each iteration, improving computational efficiency for large datasets. While SGD is faster and more memory-efficient than traditional methods, it can produce noisy updates leading to oscillations and may require more iterations to converge. Despite its disadvantages, SGD is preferred in many scenarios due to its ability to escape local minima and its suitability for online learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views11 pages

2,5 Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is a variant of the Gradient Descent algorithm that optimizes machine learning models by using a single random training example or a small batch for each iteration, improving computational efficiency for large datasets. While SGD is faster and more memory-efficient than traditional methods, it can produce noisy updates leading to oscillations and may require more iterations to converge. Despite its disadvantages, SGD is preferred in many scenarios due to its ability to escape local minima and its suitability for online learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Stochastic Gradient Descent (SGD)

Gradient Descent is an iterative


optimization process that searches for
an objective function’s optimum value
(Minimum/Maximum). It is one of the
most used methods for changing a
model’s parameters in order to reduce a
cost function in machine learning
projects.
The primary goal of gradient descent is
to identify the model parameters that
provide the maximum accuracy on both
training and test datasets.
In gradient descent, the gradient is a vector
pointing in the general direction of the
function’s steepest rise at a particular point.
The algorithm might gradually drop towards
lower values of the function by moving in the
opposite direction of the gradient, until
reaching the minimum of the function.

Types of Gradient Descent:


Typically, there are three types of Gradient
Descent:
1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini-batch Gradient Descent
In this article, we will be discussing Stochastic
Gradient Descent (SGD).

Stochastic Gradient Descent (SGD):


Stochastic Gradient Descent (SGD) is a
variant of the Gradient Descent algorithm that
is used for optimizing machine
learning models. It addresses the
computational inefficiency of traditional
Gradient Descent methods when dealing with
large datasets in machine learning projects.
In SGD, instead of using the entire
dataset for each iteration, only a single
random training example (or a small
batch) is selected to calculate the
gradient and update the model
parameters. This random selection
introduces randomness into the optimization
process, hence the term “stochastic” in
stochastic Gradient Descent
The advantage of using SGD is its
computational efficiency, especially
when dealing with large datasets. By
using a single example or a small batch, the
computational cost per iteration is
significantly reduced compared to traditional
Gradient Descent methods that require
processing the entire dataset.
Stochastic Gradient Descent Algorithm
 Initialization: Randomly initialize the
parameters of the model.
 Set Parameters: Determine the number

of iterations and the learning rate


(alpha) for updating the parameters.
 Stochastic Gradient Descent Loop:
Repeat the following steps until the
model converges or reaches the
maximum number of iterations:
 Shuffle the training dataset to
introduce randomness.
 Iterate over each training example

(or a small batch) in the shuffled


order.
 Compute the gradient of the cost

function with respect to the model


parameters using the current
training
example (or batch).
 Update the model parameters by

taking a step in the direction of the


negative gradient, scaled by the
learning rate.
 Evaluate the convergence criteria,

such as the difference in the cost


function between iterations of the
gradient.
 Return Optimized Parameters: Once the

convergence criteria are met or the


maximum number of iterations is
reached, return the optimized model
parameters.
In SGD, since only one sample from the
dataset is chosen at random for each
iteration, the path taken by the
algorithm to reach the minima is usually
noisier than your typical Gradient
Descent algorithm. But that doesn’t matter
all that much because the path taken by the
algorithm does not matter, as long as we
reach the minimum and with a significantly
shorter training time.
The path taken by Batch Gradient
Descent is shown below:
Batch gradient optimization path

A path taken by Stochastic Gradient


Descent looks as follows –
stochastic gradient optimization path

One thing to be noted is that, as SGD is


generally noisier than typical Gradient
Descent, it usually took a higher number of
iterations to reach the minima, because of the
randomness in its descent. Even though it
requires a higher number of iterations to
reach the minima than typical Gradient
Descent, it is still computationally much less
expensive than typical Gradient Descent.
Hence, in most scenarios, SGD is preferred
over Batch Gradient Descent for optimizing a
learning algorithm.

Difference between Stochastic Gradient


Descent & batch Gradient Descent
The comparison between Stochastic Gradient
Descent (SGD) and Batch Gradient Descent
are as follows:
Stochastic
Gradient Descent Batch Gradient
Aspect (SGD) Descent

Uses a single
random sample or Uses the entire
a small batch of dataset (batch) at
samples at each each iteration.
Dataset Usage iteration.

Computational Computationally Computationally


Efficiency less expensive per more expensive
iteration, as it per iteration, as it
processes fewer processes the
Stochastic
Gradient Descent Batch Gradient
Aspect (SGD) Descent

data points. entire dataset.

Faster Slower
convergence due convergence due
to frequent to less frequent
Convergence updates. updates.

High noise due to Low noise as it


frequent updates updates
Noise in with a single or parameters using
Updates few samples. all data points.

Less stable as it More stable as it


may oscillate converges
around the smoothly towards
Stability optimal solution. the optimum.

Requires less
Requires more
memory as it
memory to hold
processes fewer
the entire dataset
Memory data points at a
in memory.
Requirement time.
Stochastic
Gradient Descent Batch Gradient
Aspect (SGD) Descent

Frequent updates Less frequent


make it suitable updates make it
Update for online learning suitable for
Frequency and large datasets. smaller datasets.

Less sensitive to
More sensitive to
initial parameter
initial parameter
Initialization values due to
values.
Sensitivity frequent updates.

Advantages of Stochastic Gradient


Descent
 Speed: SGD is faster than other variants

of Gradient Descent such as Batch


Gradient Descent and Mini-Batch
Gradient Descent since it uses only one
example to update the parameters.
 Memory Efficiency: Since SGD updates

the parameters for each training


example one at a time, it is memory-
efficient and can handle large datasets
that cannot fit into memory.
 Avoidance of Local Minima: Due to
the noisy updates in SGD, it has the
ability to escape from local minima and
converges to a global minimum.
Disadvantages of Stochastic Gradient
Descent
 Noisy updates: The updates in SGD are

noisy and have a high variance, which


can make the optimization process less
stable and lead to oscillations around
the minimum.
 Slow Convergence: SGD may require

more iterations to converge to the


minimum since it updates the
parameters for each training example
one at a time.
 Sensitivity to Learning Rate: The
choice of learning rate can be critical in
SGD since using a high learning rate can
cause the algorithm to overshoot the
minimum, while a low learning rate can
make the algorithm converge slowly.
 Less Accurate: Due to the noisy
updates, SGD may not converge to the
exact global minimum and can result in
a suboptimal solution. This can be
mitigated by using techniques such as
learning rate scheduling and
momentum-based updates.

You might also like