Stochastic Gradient Descent - Math and Python Code
Stochastic Gradient Descent - Math and Python Code
Get unlimited access to the best of Medium for less than $1/week. Become a member
851 12
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 1/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
Image by DALL-E-2
Introduction
The image above is not just an appealing visual that drew you to this article
(despite its length), but it also represents a potential journey of the SGD
algorithm in search of a global minimum. In this journey, it navigates rocky
paths where the height symbolizes the loss. If this doesn’t sound clear now,
don’t worry, it will be by the end of this article.
Index:
· 1: Understanding the Basics
∘ 1.1: What is Gradient Descent
∘ 1.2: The ‘Stochastic’ in Stochastic Gradient Descent
· 2: The Mechanics of SGD
∘ 2.1: The Algorithm Explained
∘ 2.2: Understanding Learning Rate
· 3: SGD in Practice
∘ 3.1: Implementing SGD in Machine Learning Models
∘ 3.2: SGD in Sci-kit Learn and Tensorflow
· 4: Advantages and Challenges
∘ 4.1: Why Choose SGD?
∘ 4.2: Overcoming Challenges in SGD
· 5: Beyond Basic SGD
∘ 5.1: Variants of SGD
∘ 5.2: Future of SGD
· Conclusion
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 2/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
Image by DALL-E-2
The beauty of Gradient Descent is its simplicity and elegance. Here’s how it
works, you start with a random point on the function you’re trying to
minimize, for example a random starting point on the mountain. Then, you
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 3/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
calculate the gradient (slope) of the function at that point. In the mountain
analogy, this is like looking around you to find the steepest slope. Once you
know the direction, you take a step downhill in that direction, and then you
calculate the gradient again. Repeat this process until you reach the bottom.
The size of each step is determined by the learning rate. However, if the
learning rate is too small, it might take a long time to reach the bottom. If it’s
too large, you might overshoot the lowest point. Finding the right balance is
key to the success of the algorithm.
One of the most appealing aspects of Gradient Descent is its generality. It can
be applied to almost any function, especially those where an analytical
solution is not feasible. This makes it incredibly versatile in solving various
types of problems in machine learning, from simple linear regression to
complex neural networks. Open in app
In traditional batch gradient descent, you calculate the gradient of the loss
function with respect to the parameters for the entire training set. As you
can imagine, for large datasets, this can be quite computationally intensive
and time-consuming. This is where SGD comes into play. Instead of using
the entire dataset to calculate the gradient, SGD randomly selects just one
data point (or a few data points) to compute the gradient in each iteration.
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 4/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
Think of this process as if you were again descending a mountain, but this
time in thick fog with limited visibility. Rather than viewing the entire
landscape to decide your next step, you make your decision based on where
your foot lands next. This step is small and random, but it’s repeated many
times, each time adjusting your path slightly in response to the immediate
terrain under your feet.
Speed: By using only a small subset of data at a time, SGD can make rapid
progress in reducing the loss, especially for large datasets.
Online Learning: SGD is well-suited for online learning, where the model
needs to be updated as new data comes in, due to its ability to update the
model incrementally.
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 5/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
Initialization (Step 1)
First, you initialize the parameters (weights) of your model. This can be done
randomly or by some other initialization technique. The starting point for
SGD is crucial as it influences the path the algorithm will take.
Gradient Formula
Here, ∇θJ(θ) represents the gradient of the loss function J(θ) with respect to
the parameters θ. This gradient is a vector of partial derivatives, where each
component of the vector is the partial derivative of the loss function with
respect to the corresponding parameter in θ.
where:
η is the learning rate, a positive scalar determining the size of the step in
the direction of the negative gradient.
∇θJ(θ) is the gradient of the loss function J(θ) with respect to the
parameters θ.
The learning rate determines the size of the steps you take towards the
minimum. If it’s too small, the algorithm will be slow; if it’s too large, you
might overshoot the minimum.
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 7/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 8/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
set to decrease over time. This approach is called learning rate annealing or
scheduling.
Step Decay: Reduce the learning rate by some factor after a certain
number of epochs.
3: SGD in Practice
models-from-scratch-python/Stochastic Gradient
Descent/demo.ipynb at main ·…
Repo where I recreate some popular machine learning models from
scratch in Python …
github.com
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 9/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
class SGDRegressor:
def __init__(self, learning_rate=0.01, epochs=100, batch_size=1, reg=None, r
"""
Constructor for the SGDRegressor.
Parameters:
learning_rate (float): The step size used in each update.
epochs (int): Number of passes over the training dataset.
batch_size (int): Number of samples to be used in each batch.
reg (str): Type of regularization ('l1' or 'l2'); None if no regularizat
reg_param (float): Regularization parameter.
The weights and bias are initialized as None and will be set during the
"""
self.learning_rate = learning_rate
self.epochs = epochs
self.batch_size = batch_size
self.reg = reg
self.reg_param = reg_param
self.weights = None
self.bias = None
Parameters:
X (numpy.ndarray): Training data, shape (m_samples, n_features).
y (numpy.ndarray): Target values, shape (m_samples,).
This method initializes the weights and bias, and then updates them over
"""
m, n = X.shape # m is number of samples, n is number of features
self.weights = np.zeros(n)
self.bias = 0
for _ in range(self.epochs):
indices = np.random.permutation(m)
X_shuffled = X[indices]
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 10/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
y_shuffled = y[indices]
if self.reg == 'l1':
gradient_w += self.reg_param * np.sign(self.weights)
elif self.reg == 'l2':
gradient_w += self.reg_param * self.weights
Parameters:
X (numpy.ndarray): Data for which to predict target values.
Returns:
numpy.ndarray: Predicted target values.
"""
return np.dot(X, self.weights) + self.bias
Parameters:
X (numpy.ndarray): The input data.
y (numpy.ndarray): The true target values.
Returns:
float: The computed loss value.
"""
return (np.mean((y - self.predict(X)) ** 2) + self._get_regularization_l
def _get_regularization_loss(self):
"""
Computes the regularization loss based on the regularization type.
Returns:
float: The regularization loss.
"""
if self.reg == 'l1':
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 11/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
def get_weights(self):
"""
Returns the weights of the model.
Returns:
numpy.ndarray: The weights of the linear model.
"""
return self.weights
Initialization (Step 1)
weights and bias are set to None initially and will be initialized in the
fit method.
for _ in range(self.epochs):
indices = np.random.permutation(m)
X_shuffled = X[indices]
y_shuffled = y[indices]
if self.reg == 'l1':
gradient_w += self.reg_param * np.sign(self.weights)
elif self.reg == 'l2':
gradient_w += self.reg_param * self.weights
This method fits the model to the training data. It starts by initializing
weights as a zero vector of length n (number of features) and bias to zero.
The model’s parameters are updated over a number of epochs through SGD.
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 13/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
for _ in range(self.epochs):
indices = np.random.permutation(m)
X_shuffled = X[indices]
y_shuffled = y[indices]
In each epoch, the data is shuffled, and batches are created to update the
model parameters using SGD.
Gradients for weights and bias are computed in each batch. These are then
used to update the model’s weights and bias. If regularization is used, it’s also
included in the gradient calculation.
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 14/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
The predict method calculates the predicted target values using the learned
linear model.
It calculates the mean squared error between the predicted values and the
actual target values y. Additionally, it incorporates the regularization loss if
regularization is specified.
def _get_regularization_loss(self):
if self.reg == 'l1':
return self.reg_param * np.sum(np.abs(self.weights))
elif self.reg == 'l2':
return self.reg_param * np.sum(self.weights ** 2)
else:
return 0
This private method computes the regularization loss based on the type of
regularization ( l1 or l2 ) and the regularization parameter. This loss is
added to the main loss function to penalize large weights, thereby avoiding
overfitting.
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 15/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
Now, while the code above is very useful for educational purposes, data
scientists definitely don’t use it on a daily basis. Indeed, we can directly call
SGD with few lines of code from popular libraries such as scikit learn
(machine learning) or tensorflow (deep learning).
# Making predictions
predictions = model.predict(X)
SGD regressor is directly called from sklearn library, and follows the same
structure of other algorithms in the same library.
The parameter ‘max_iter’ is the number of epochs (rounds). By specifying
max_iter to 1000 we will make the algorithm update the linear regression
weights and bias 1000 times.
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
])
sgd = SGD(learning_rate=0.01)
In this code we are defining a Neural Network with one Dense Layer and 64
nodes. However, besides the specifics of the neural network, here we are
again calling SGD with just two lines of code:
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 17/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
General Applicability:
SGD can be applied to a wide range of problems and is not limited to specific
types of models. This general applicability makes it a versatile tool in the
machine learning toolbox.
Improved Generalization:
By updating the model frequently with a high degree of variance, SGD can
often lead to models that generalize better on unseen data. This is because
the algorithm is less likely to overfit to the noise in the training data.
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 18/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 19/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
Hyperparameter Tuning
SGD requires careful tuning of hyperparameters, not just the learning rate
but also parameters like momentum and the size of the mini-batch.
Utilize grid search, random search, or more advanced methods like Bayesian
optimization to find the optimal set of hyperparameters.
Overfitting
Like any machine learning algorithm, there’s a risk of overfitting, where the
model performs well on training data but poorly on unseen data.
Use regularization techniques such as L1 or L2 regularization, and validate
the model using a hold-out set or cross-validation.
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 20/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
Momentum SGD
Momentum is an approach that helps accelerate SGD in the relevant
direction and dampens oscillations. It does this by adding a fraction of the
previous update vector to the current update.
It helps in faster convergence and reduces oscillations. It is particularly
useful for navigating the ravines of the cost function, where the surface
curves much more steeply in one dimension than in another.
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 21/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
RMSprop
RMSprop (Root Mean Square Propagation) modifies Adagrad to address its
radically diminishing learning rates. It uses a moving average of squared
gradients to normalize the gradient.
It works well in online and non-stationary settings and has been found to be
an effective and practical optimization algorithm for neural networks.
Each of these variants has its own strengths and is suited for specific types of
problems. Their development reflects the ongoing effort in the machine
learning community to refine and enhance optimization algorithms to
achieve better and faster results. Understanding these variants and their
appropriate applications is crucial for anyone looking to delve deeper into
machine learning optimization techniques.
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 22/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
Quantum SGD
The advent of quantum computing presents an opportunity to explore
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 23/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
Conclusion
As we wrap up our exploration of Stochastic Gradient Descent (SGD), it’s
clear that this algorithm is much more than just a method for optimizing
machine learning models. It stands as a testament to the ingenuity and
continuous evolution in the field of artificial intelligence. From its basic
form to its more advanced variants, SGD remains a critical tool in the
machine learning toolkit, adaptable to a wide array of challenges and
applications.
If you liked the article please leave a clap, and let me know in the comments
what you think about it!
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 24/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
A Data Scientist with a passion about recreating all the popular machine learning
algorithm from scratch.
Cristian Leo in Towards Data Science Leonie Monigatti in Towards Data Science
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 25/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
19 min read · Feb 19, 2024 · 13 min read · Feb 27, 2024
-- 8 -- 10
Dave Melillo in Towards Data Science Cristian Leo in Towards Data Science
-- 33 -- 3
See all from Cristian Leo See all from Towards Data Science
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 26/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
-- 80 -- 7
Lists
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 27/28
3/21/24, 9:54 PM Stochastic Gradient Descent: Math and Python Code | Towards Data Science
Cristian Leo in Artificial Intelligence in Plain English Artem Shelamanov in Python in Plain English
-- 1 1.7K 9
33 min read · Jan 29, 2024 15 min read · Jan 19, 2024
3.3K 31 1.1K 5
https://towardsdatascience.com/stochastic-gradient-descent-math-and-python-code-35b5e66d6f79 28/28