04 Batch SGD Mini Batch Gradient Descent Algorithms
04 Batch SGD Mini Batch Gradient Descent Algorithms
Theory
SGD is a variation on gradient descent, also called
batch gradient descent. As a review, gradient Gradient Descent Algorithms
descent seeks to minimize an objective function
by iteratively updating each parameter by a
small amount based on the negative gradient of a
givenBatch
data Gradient
set. The Descent
steps forAlgorithm
performing gradient
descent are as follows:
20/02/2024, 18:27 Stochastic gradient descent - Cornell University Computational Optimization Open Textbook - Optimization Wiki
data set of
Visualization of the stochastic gradient descent
algorithm[6]
https://optimization.cbe.cornell.edu/index.php?title=Stochastic_gradient_descent 2/7
The learning rate (https://en.wikipedia.org/wiki/Learning_rate) is used to calculate the step size
at every iteration. Too large a learning rate and the step sizes may overstep too far past the
optimum value. Too small a learning rate may require many iterations to reach a local minimum (h
ttps://en.wikipedia.org/wiki/Maxima_and_minima). A good starting point for the learning rate is
0.1 and adjust as necessary.[8]
A variation on stochastic gradient descent is the mini-batch gradient descent. In SGD, the gradient
is computed on only one training example and may result in a large number of iterations required
to converge on a local minimum. Mini-batch gradient descent offers a compromise between batch
gradient descent and SGD by splitting the training data into smaller batches. The steps for
performing mini-batch gradient descent are identical to SGD with one exception - when updating
the parameters from the gradient, rather than calculating the gradient of a single training example,
the gradient is calculated against a batch size of training examples, i.e. compute
Numerical
Example of SGD Example
Training Data:
Data preparation
Consider a simple 2-D data set with only 6 data points (each point has ), and each data point
y
have a label value assigned to them.
4 1 2
Model overview
2 8 -14
For
1 the0 purpose
1 of demonstrating the computation of the SGD process, simply employ a linear
regression model: , where and are weights and is the constant
3 2 -1
term. In this case, the goal of this model is to find the best value for and , based on the
datasets.
1 4 -7
6 7 -8
Definition of loss function
Loss/Cost Func<on J can be defined as 𝐽 = (𝑦% − 𝑦)! , here m=1, and learning
In this example, the loss function should be l2 norm square, that is .
rate is set as
0.05.
Forward
Model: = 𝑦% = 𝜃" 𝑥" + 𝜃# 𝑥# + 𝜃! 𝑥!
Initial Weights:
Weight Ini<aliza<on: [ 𝜃" 𝜃# 𝜃! ] = [0, −0.044, −0.042]
The linear regression model starts by initializing (https://en.wikipedia.org/wiki/Init
ialization_(programming)) the weights and setting the bias term at 0. In this
Weight Upda<on: 𝜃 = 𝜃$ −] 2𝛼(
case, initiate$[
𝑦% − 𝑦)𝑥$
= [-0.044, -0.042].
M=2;
Batch Processing:
Training [ 𝜃" 𝜃# 𝜃! ] 𝑦" 2(𝑦
+ − 𝑦)𝑥𝑗 [ 𝜃" 𝜃# 𝜃! ]
Example 𝜃$ = 𝜃$ − 2𝛼(𝑦% − 𝑦)𝑥$
1 [0, -0.044, -0,042] -0.2 [-4.4000, -17.6000,-4.4000]
2
3
4
5
6
Averaging XXXXXXXXXXX XXXXXXXXXXX