ML | Mini-Batch Gradient Descent with Python
Last Updated :
05 Jul, 2025
Gradient Descent is an optimization algorithm in machine learning used to determine the optimal parameters such as weights and bias for models. The idea is to minimize the model's error by iteratively updating the parameters in the direction of the steepest descent as determined by the gradient of the loss function.
Depending on how much data is used to compute the gradient during each update, gradient descent comes in three main variants:
- Batch Gradient Descent
- Stochastic Gradient Descent (SGD)
- Mini-Batch Gradient Descent
Each variant has its own strengths and trade-offs in terms of speed, stability and convergence behavior.
Convergence in BGD, SGD & MBGDWorking of Mini-Batch Gradient Descent
Mini-batch gradient descent is a optimization method that updates model parameters using small subsets of the training data called mini-batches. This technique offers a middle path between the high variance of stochastic gradient descent and the high computational cost of batch gradient descent. They are used to perform each update, making training faster and more memory-efficient. It also helps stabilize convergence and introduces beneficial randomness during learning.
It is often preferred in modern machine learning applications because it combines the benefits of both batch and stochastic approaches.
Key advantages of mini-batch gradient descent:
- Computational Efficiency: Supports parallelism and vectorized operations on GPUs or TPUs.
- Faster Convergence: Provides more frequent updates than full-batch which improves speed.
- Noise Reduction: Less noisy than stochastic updates which leads to smoother convergence.
- Better Generalization: Introduces slight randomness to help escape local minima.
- Memory Efficiency: Doesn’t require loading the entire dataset into memory.
Algorithm:
Let:
- \theta = model parameters
max_iters
= number of epochs- \eta = learning rate
For itr=1,2,3,…,max_iters:
- Shuffle the training data. It is optional but often done for better randomness in mini-batch selection.
- Split the dataset into mini-batches of size b.
For each mini-batch (X_{mini}, y_{mini}):
1. Forward Pass on the batch X_mini:
Make predictions on the mini-batch
\hat{y} = f(X_{\text{mini}},\ \theta)
Compute error in predictions J(θ) with the current values of the parameters
J(θ)=L(\hat{y},y_{mini})
2. Backward Pass:
Compute gradient:
\nabla_{\theta} J(\theta) = \frac{\partial J(\theta)}{\partial \theta}
3. Update parameters:
Gradient descent rule:
\theta = \theta - \eta \nabla_{\theta} J(\theta)
Python Implementation
Here we will use Mini-Batch Gradient Descent for Linear Regression.
1. Importing Libraries
We begin by importing libraries like Numpy
and
Matplotlib.pyplot
Python
import numpy as np
import matplotlib.pyplot as plt
2. Generating Synthetic 2D Data
Here, we generate 8000 two-dimensional data points sampled from a multivariate normal distribution:
- The data is centered at the point (5.0, 6.0).
- The
cov
matrix defines the variance and correlation between the features. A value of 0.95
indicates a strong positive correlation between the two features.
Python
mean = np.array([5.0, 6.0])
cov = np.array([[1.0, 0.95], [0.95, 1.2]])
data = np.random.multivariate_normal(mean, cov, 8000)
3. Visualizing Generated Data
Python
plt.scatter(data[:500, 0], data[:500, 1], marker='.')
plt.title("Scatter Plot of First 500 Samples")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid(True)
plt.show()
Output:

4. Splitting Data
We split the data into training and testing sets:
- Original data shape:
(8000, 2)
- New shape after adding bias:
(8000, 3)
- 90% of the data is used for training and 10% for testing.
Python
data = np.hstack((np.ones((data.shape[0], 1)), data)) # shape: (8000, 3)
split_factor = 0.90
split = int(split_factor * data.shape[0])
X_train = data[:split, :-1]
y_train = data[:split, -1].reshape((-1, 1))
X_test = data[split:, :-1]
y_test = data[split:, -1].reshape((-1, 1))
5. Displaying Datasets
Python
print("Number of examples in training set = %d" % X_train.shape[0])
print("Number of examples in testing set = %d" % X_test.shape[0])
Output:
results6. Defining Core Functions of Linear Regression
- Hypothesis(X, theta): Computes the predicted output using the linear model h(X)=X⋅θ
- Gradient(X, y, theta): Calculates the gradient of the cost function which is used to update model parameters during training.
- Cost(X, y, theta): Computes the Mean Squared Error (MSE).
Python
# Hypothesis function
def hypothesis(X, theta):
return np.dot(X, theta)
# Gradient of the cost function
def gradient(X, y, theta):
h = hypothesis(X, theta)
grad = np.dot(X.T, (h - y))
return grad
# Mean squared error cost
def cost(X, y, theta):
h = hypothesis(X, theta)
J = np.dot((h - y).T, (h - y)) / 2
return J[0]
7. Creating Mini-Batches for Training
This function divides the dataset into random mini-batches used during training:
- Combines the feature matrix X and target vector y, then shuffles the data to introduce randomness.
- Splits the shuffled data into batches of size batch_size.
- Each mini-batch is a tuple (X_mini, Y_mini) used for one update step in mini-batch gradient descent.
- Also handles the case where data isn’t evenly divisible by the batch size by including the leftover samples in an extra batch.
Python
# Create mini-batches from the dataset
def create_mini_batches(X, y, batch_size):
mini_batches = []
data = np.hstack((X, y))
np.random.shuffle(data)
n_minibatches = data.shape[0] // batch_size
for i in range(n_minibatches + 1):
mini_batch = data[i * batch_size:(i + 1) * batch_size, :]
X_mini = mini_batch[:, :-1]
Y_mini = mini_batch[:, -1].reshape((-1, 1))
mini_batches.append((X_mini, Y_mini))
if data.shape[0] % batch_size != 0:
mini_batch = data[i * batch_size:]
X_mini = mini_batch[:, :-1]
Y_mini = mini_batch[:, -1].reshape((-1, 1))
mini_batches.append((X_mini, Y_mini))
return mini_batches
8. Mini-Batch Gradient Descent Function
This function performs mini-batch gradient descent to train the linear regression model:
- Initialization: Weights
theta
are initialized to zeros and an empty list error_list
tracks the cost over time. - Training Loop: For a fixed number of iterations (
max_iters
), the dataset is divided into mini-batches. - Each mini-batch: computes the gradient, updates
theta
to reduce cost and records the current error for tracking training progress.
Python
# Mini-batch gradient descent
def gradientDescent(X, y, learning_rate=0.001, batch_size=32):
theta = np.zeros((X.shape[1], 1))
error_list = []
max_iters = 3
for itr in range(max_iters):
mini_batches = create_mini_batches(X, y, batch_size)
for X_mini, y_mini in mini_batches:
theta = theta - learning_rate * gradient(X_mini, y_mini, theta)
error_list.append(cost(X_mini, y_mini, theta))
return theta, error_list
9. Training and Visualization
The model is trained using gradientDescent()
on the training data. After training:
- theta[0] is the bias term (intercept).
- theta[1:] contains the feature weights (coefficients).
- The plot shows how the cost decreases as the model learns, showing convergence of the algorithm.
This provides a visual and quantitative insight into how well the mini-batch gradient descent is optimizing the regression model.
Python
theta, error_list = gradientDescent(X_train, y_train)
print("Bias = ", theta[0])
print("Coefficients = ", theta[1:])
# visualising gradient descent
plt.plot(error_list)
plt.xlabel("Number of iterations")
plt.ylabel("Cost")
plt.show()
Output:
Mini-Batch over Regression model10. Final Prediction and Evaluation
Prediction: The hypothesis() function is used to compute predicted values for the test set.
Visualization:
- A scatter plot shows actual test values.
- A line plot overlays the predicted values, helping to visually assess model performance.
Evaluation:
- Computes Mean Absolute Error (MAE) to measure average prediction deviation.
- A lower MAE indicates better accuracy of the model.
Python
# Predicting output for X_test
y_pred = hypothesis(X_test, theta)
# Visualizing predictions vs actual values
plt.scatter(X_test[:, 1], y_test, marker='.', label='Actual')
plt.plot(X_test[:, 1], y_pred, color='orange', label='Predicted')
plt.xlabel("Feature 1")
plt.ylabel("Target")
plt.title("Model Predictions vs Actual Values")
plt.legend()
plt.grid(True)
plt.show()
# Calculating mean absolute error
error = np.sum(np.abs(y_test - y_pred)) / y_test.shape[0]
print("Mean Absolute Error =", error)
Output:
Model prediction and Actual valuesThe orange line represents the final hypothesis function i.e θ[0] + θ[1] * X_test[:, 1] + θ[2] * X_test[:, 2] = 0
This is the linear equation learned by the model where:
θ[0]
is the bias (intercept)θ[1]
is the weight for the first featureθ[2]
is the weight for the second feature
Comparison Between Gradient Descent Variants
Lets see a quick difference between Batch Gradient Descent, Stochastic Gradient Descent (SGD) and Mini-Batch Gradient Descent.
Type | Update Strategy | Speed & Efficiency | Noise in Updates |
---|
Batch Gradient Descent | Updates parameters after computing gradient using the entire training dataset | Slow, as it processes the full dataset before each update | Smooth and stable |
---|
Stochastic Gradient Descent (SGD) | Updates parameters after computing gradient using one training example | Faster updates, but cannot fully utilize vectorized computations | Highly noisy updates |
---|
Mini-Batch Gradient Descent | Updates parameters using a small batch (subset) of training examples | Efficient; leverages vectorization for faster computation | Moderate noise—dependent on batch size |
---|
Similar Reads
Machine Learning Algorithms Machine learning algorithms are essentially sets of instructions that allow computers to learn from data, make predictions, and improve their performance over time without being explicitly programmed. Machine learning algorithms are broadly categorized into three types: Supervised Learning: Algorith
8 min read
Top 15 Machine Learning Algorithms Every Data Scientist Should Know in 2025 Machine Learning (ML) Algorithms are the backbone of everything from Netflix recommendations to fraud detection in financial institutions. These algorithms form the core of intelligent systems, empowering organizations to analyze patterns, predict outcomes, and automate decision-making processes. Wi
14 min read
Linear Model Regression
Ordinary Least Squares (OLS) using statsmodelsOrdinary Least Squares (OLS) is a widely used statistical method for estimating the parameters of a linear regression model. It minimizes the sum of squared residuals between observed and predicted values. In this article we will learn how to implement Ordinary Least Squares (OLS) regression using P
3 min read
Linear Regression (Python Implementation)Linear regression is a statistical method that is used to predict a continuous dependent variable i.e target variable based on one or more independent variables. This technique assumes a linear relationship between the dependent and independent variables which means the dependent variable changes pr
14 min read
Multiple Linear Regression using Python - MLLinear regression is a statistical method used for predictive analysis. It models the relationship between a dependent variable and a single independent variable by fitting a linear equation to the data. Multiple Linear Regression extends this concept by modelling the relationship between a dependen
4 min read
Polynomial Regression ( From Scratch using Python )Prerequisites Linear RegressionGradient DescentIntroductionLinear Regression finds the correlation between the dependent variable ( or target variable ) and independent variables ( or features ). In short, it is a linear model to fit the data linearly. But it fails to fit and catch the pattern in no
5 min read
Bayesian Linear RegressionLinear regression is based on the assumption that the underlying data is normally distributed and that all relevant predictor variables have a linear relationship with the outcome. But In the real world, this is not always possible, it will follows these assumptions, Bayesian regression could be the
10 min read
How to Perform Quantile Regression in PythonIn this article, we are going to see how to perform quantile regression in Python. Linear regression is defined as the statistical method that constructs a relationship between a dependent variable and an independent variable as per the given set of variables. While performing linear regression we a
4 min read
Isotonic Regression in Scikit LearnIsotonic regression is a regression technique in which the predictor variable is monotonically related to the target variable. This means that as the value of the predictor variable increases, the value of the target variable either increases or decreases in a consistent, non-oscillating manner. Mat
6 min read
Stepwise Regression in PythonStepwise regression is a method of fitting a regression model by iteratively adding or removing variables. It is used to build a model that is accurate and parsimonious, meaning that it has the smallest number of variables that can explain the data. There are two main types of stepwise regression: F
6 min read
Least Angle Regression (LARS)Regression is a supervised machine learning task that can predict continuous values (real numbers), as compared to classification, that can predict categorical or discrete values. Before we begin, if you are a beginner, I highly recommend this article. Least Angle Regression (LARS) is an algorithm u
3 min read
Linear Model Classification
Regularization
K-Nearest Neighbors (KNN)
Support Vector Machines
ML - Stochastic Gradient Descent (SGD) Stochastic Gradient Descent (SGD) is an optimization algorithm in machine learning, particularly when dealing with large datasets. It is a variant of the traditional gradient descent algorithm but offers several advantages in terms of efficiency and scalability, making it the go-to method for many d
8 min read
Decision Tree
Ensemble Learning