0% found this document useful (0 votes)
6 views3 pages

04 Batch SGD Mini Batch Gradient Descent Algorithms

The document discusses Stochastic Gradient Descent (SGD) as a variation of gradient descent used to optimize machine learning models, particularly in neural networks. It outlines the steps involved in performing SGD, including selecting a learning rate, initializing parameters, and updating them based on the gradient of a single training example. Additionally, it introduces mini-batch gradient descent as a compromise between batch and SGD, allowing for more efficient computation by processing smaller batches of data.

Uploaded by

Saadi Humayun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views3 pages

04 Batch SGD Mini Batch Gradient Descent Algorithms

The document discusses Stochastic Gradient Descent (SGD) as a variation of gradient descent used to optimize machine learning models, particularly in neural networks. It outlines the steps involved in performing SGD, including selecting a learning rate, initializing parameters, and updating them based on the gradient of a single training example. Additionally, it introduces mini-batch gradient descent as a compromise between batch and SGD, allowing for more efficient computation by processing smaller batches of data.

Uploaded by

Saadi Humayun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

l_network) and decreases machine computation time while increasing complexity and

performance for large-scale problems.[5]

Theory
SGD is a variation on gradient descent, also called
batch gradient descent. As a review, gradient Gradient Descent Algorithms
descent seeks to minimize an objective function
by iteratively updating each parameter by a
small amount based on the negative gradient of a
givenBatch
data Gradient
set. The Descent
steps forAlgorithm
performing gradient
descent are as follows:
20/02/2024, 18:27 Stochastic gradient descent - Cornell University Computational Optimization Open Textbook - Optimization Wiki

Stochastic gradient descent is being used in neural networks (https://en.wikipedia.org/wiki/Neura


l_network) and adecreases
Step 1: Select machine computation time while increasing complexity and
learning rate
performance for large-scale problems.[5] Visualization of the gradient descent algorithm[6]
Step 2: Select initial parameter values as
the starting point
Theory
Step Optimization
Cornell University Computational 3: Update allTextbook
Open parameters fromWiki
- Optimization the gradient of the training data set, i.e. compute
SGD is a variation on gradient descent, also called
d in neural networks
batch (https://en.wikipedia.org/wiki/Neura
gradient descent. As a review, gradient
computation descent
time while
seeks toincreasing
minimize ancomplexity and
objective function
Step 4: Repeat updating
by iteratively Step 3 until
eacha parameter
local minimabyisareached
small amount based on the negative gradient of a
UnderUnder
given
batch
batchdata gradient
gradient descent,
descent,
set. The
the gradient,
theperforming
steps for gradient, is calculated at every step against a full data
gradient, is calculated at every step against a full
dataset.
set. When When
descentthe
are the training
training
as follows:data is data
large,is computation
large, computation
may bemay be or
slow slow or require
require largelarge
amounts of
amounts
computer of computer
memory. [2] memory.
Step 1: Select a learning rate
, also called Visualization of the gradient descent algorithm[6]
w, gradient Step 2: Select
Stochastic Gradient initial
Descent parameter values as
Algorithm
ve function the starting point
meter bySGD a modifiesStep
the batch gradient descent algorithm
3: Update all parameters from the gradient of the training data set, i.e. compute
(https://en.wikipedia.org/wiki/Algorithm) by
radient of a
calculating the gradient for only one training
ng gradient [7]
example at Step
every iteration.
4: Repeat The
Step 3 until steps
a local for is reached
minima
performing SGD are as follows:
Under batch gradient descent, the gradient, , is calculated at every step against a full data
set.Step
When1: the
Randomly
training shuffle the data
data is large, set of may be slow or require large amounts of
computation
size mmemory.
Visualization
computer [2]gradient descent algorithm[6]
of the Visualization of the stochastic gradient descent
values as algorithm[6]
Step 2: Select a learning rate
Stochastic Gradient Descent Algorithm
rom the gradient Step
of the3:training
Select initial parameter
datagradient values as the starting point
set, i.e.descent
compute
SGD modifies the batch algorithm
(https://en.wikipedia.org/wiki/Algorithm) by
Step 4: Update
calculating all parameters
the gradient for only from
one the gradient of a single training example
training ,
i.e. compute
al minima is reached
example at every iteration.[7] The steps for
performing SGD are as follows:
ent, Step 5: Repeat
, is calculated Step step
at every 4 until a localaminimum
against full datais reached
Step 1: Randomly shuffle the data set of
computation may be sizeslow
m or require large amounts of Visualization of the stochastic gradient descent
algorithm[6]
Step 2: Select a learning rate

Step 3: Select initial parameter values as the starting point


m
Step 4: Update all parameters from the gradient of a single training example ,
nt algorithm i.e. compute
https://optimization.cbe.cornell.edu/index.php?title=Stochastic_gradient_descent 2/7
m) by
ne training Step 5: Repeat Step 4 until a local minimum is reached
steps for

data set of
Visualization of the stochastic gradient descent
algorithm[6]
https://optimization.cbe.cornell.edu/index.php?title=Stochastic_gradient_descent 2/7
The learning rate (https://en.wikipedia.org/wiki/Learning_rate) is used to calculate the step size
at every iteration. Too large a learning rate and the step sizes may overstep too far past the
optimum value. Too small a learning rate may require many iterations to reach a local minimum (h
ttps://en.wikipedia.org/wiki/Maxima_and_minima). A good starting point for the learning rate is
0.1 and adjust as necessary.[8]

Mini-Batch Gradient Descent

A variation on stochastic gradient descent is the mini-batch gradient descent. In SGD, the gradient
is computed on only one training example and may result in a large number of iterations required
to converge on a local minimum. Mini-batch gradient descent offers a compromise between batch
gradient descent and SGD by splitting the training data into smaller batches. The steps for
performing mini-batch gradient descent are identical to SGD with one exception - when updating
the parameters from the gradient, rather than calculating the gradient of a single training example,
the gradient is calculated against a batch size of training examples, i.e. compute

Numerical
Example of SGD Example

Training Data:
Data preparation

Consider a simple 2-D data set with only 6 data points (each point has ), and each data point
y
have a label value assigned to them.

4 1 2
Model overview
2 8 -14
For
1 the0 purpose
1 of demonstrating the computation of the SGD process, simply employ a linear
regression model: , where and are weights and is the constant
3 2 -1
term. In this case, the goal of this model is to find the best value for and , based on the
datasets.
1 4 -7
6 7 -8
Definition of loss function
Loss/Cost Func<on J can be defined as 𝐽 = (𝑦% − 𝑦)! , here m=1, and learning
In this example, the loss function should be l2 norm square, that is .
rate is set as
0.05.
Forward
Model: = 𝑦% = 𝜃" 𝑥" + 𝜃# 𝑥# + 𝜃! 𝑥!
Initial Weights:
Weight Ini<aliza<on: [ 𝜃" 𝜃# 𝜃! ] = [0, −0.044, −0.042]
The linear regression model starts by initializing (https://en.wikipedia.org/wiki/Init
ialization_(programming)) the weights and setting the bias term at 0. In this
Weight Upda<on: 𝜃 = 𝜃$ −] 2𝛼(
case, initiate$[
𝑦% − 𝑦)𝑥$
= [-0.044, -0.042].

Training [ 𝜃" 𝜃# 𝜃! ] 𝑦"


https://optimization.cbe.cornell.edu/index.php?title=Stochastic_gradient_descent 2(𝑦
+ − 𝑦)𝑥𝑗 [ 𝜃" 𝜃# 𝜃! ] 3/7
Example 𝜃$ = 𝜃$ − 2𝛼(𝑦% − 𝑦)𝑥$
1 [0, -0.044, -0,042] -0.2 [-4.4000, -17.6000,-4.4000] [0.2200, 0.8360, 0.1780]
2 [0.2200, 0.8360, 3.3 [34.6000, 69.2000, [-1.5080, -2.6170,
0.1780] 276.8000] -13.6610]
3
Mini Batch:

M=2;

Training [ 𝜃" 𝜃# 𝜃! ] 𝑦" 2(𝑦


+ − 𝑦)𝑥𝑗 [ 𝜃" 𝜃# 𝜃! ]
Example 𝜃$ = 𝜃$ − 2𝛼(𝑦% − 𝑦)𝑥$
1 [0, -0.044, -0,042] -0.2 [-4.4000, -17.6000,-4.4000]
2 [0.2200, 0.8360, 3.3 [34.6000, 69.2000,
0.1780] 276.8000]
Averaging [15.1000 25.8000 [-0.7550, -1.3340
136.2000] -6.8520]
3
4
Averaging XXXXXXXXXX XXXXXXXX
….

Batch Processing:
Training [ 𝜃" 𝜃# 𝜃! ] 𝑦" 2(𝑦
+ − 𝑦)𝑥𝑗 [ 𝜃" 𝜃# 𝜃! ]
Example 𝜃$ = 𝜃$ − 2𝛼(𝑦% − 𝑦)𝑥$
1 [0, -0.044, -0,042] -0.2 [-4.4000, -17.6000,-4.4000]
2
3
4
5
6
Averaging XXXXXXXXXXX XXXXXXXXXXX

You might also like