ML | Mini-Batch Gradient Descent with Python
Last Updated :
05 Jul, 2025
Gradient Descent is an optimization algorithm in machine learning used to determine the optimal parameters such as weights and bias for models. The idea is to minimize the model's error by iteratively updating the parameters in the direction of the steepest descent as determined by the gradient of the loss function.
Depending on how much data is used to compute the gradient during each update, gradient descent comes in three main variants:
- Batch Gradient Descent
- Stochastic Gradient Descent (SGD)
- Mini-Batch Gradient Descent
Each variant has its own strengths and trade-offs in terms of speed, stability and convergence behavior.
Convergence in BGD, SGD & MBGDWorking of Mini-Batch Gradient Descent
Mini-batch gradient descent is a optimization method that updates model parameters using small subsets of the training data called mini-batches. This technique offers a middle path between the high variance of stochastic gradient descent and the high computational cost of batch gradient descent. They are used to perform each update, making training faster and more memory-efficient. It also helps stabilize convergence and introduces beneficial randomness during learning.
It is often preferred in modern machine learning applications because it combines the benefits of both batch and stochastic approaches.
Key advantages of mini-batch gradient descent:
- Computational Efficiency: Supports parallelism and vectorized operations on GPUs or TPUs.
- Faster Convergence: Provides more frequent updates than full-batch which improves speed.
- Noise Reduction: Less noisy than stochastic updates which leads to smoother convergence.
- Better Generalization: Introduces slight randomness to help escape local minima.
- Memory Efficiency: Doesn’t require loading the entire dataset into memory.
Algorithm:
Let:
- \theta = model parameters
max_iters
= number of epochs- \eta = learning rate
For itr=1,2,3,…,max_iters:
- Shuffle the training data. It is optional but often done for better randomness in mini-batch selection.
- Split the dataset into mini-batches of size b.
For each mini-batch (X_{mini}, y_{mini}):
1. Forward Pass on the batch X_mini:
Make predictions on the mini-batch
\hat{y} = f(X_{\text{mini}},\ \theta)
Compute error in predictions J(θ) with the current values of the parameters
J(θ)=L(\hat{y},y_{mini})
2. Backward Pass:
Compute gradient:
\nabla_{\theta} J(\theta) = \frac{\partial J(\theta)}{\partial \theta}
3. Update parameters:
Gradient descent rule:
\theta = \theta - \eta \nabla_{\theta} J(\theta)
Python Implementation
Here we will use Mini-Batch Gradient Descent for Linear Regression.
1. Importing Libraries
We begin by importing libraries like Numpy
and
Matplotlib.pyplot
Python
import numpy as np
import matplotlib.pyplot as plt
2. Generating Synthetic 2D Data
Here, we generate 8000 two-dimensional data points sampled from a multivariate normal distribution:
- The data is centered at the point (5.0, 6.0).
- The
cov
matrix defines the variance and correlation between the features. A value of 0.95
indicates a strong positive correlation between the two features.
Python
mean = np.array([5.0, 6.0])
cov = np.array([[1.0, 0.95], [0.95, 1.2]])
data = np.random.multivariate_normal(mean, cov, 8000)
3. Visualizing Generated Data
Python
plt.scatter(data[:500, 0], data[:500, 1], marker='.')
plt.title("Scatter Plot of First 500 Samples")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid(True)
plt.show()
Output:

4. Splitting Data
We split the data into training and testing sets:
- Original data shape:
(8000, 2)
- New shape after adding bias:
(8000, 3)
- 90% of the data is used for training and 10% for testing.
Python
data = np.hstack((np.ones((data.shape[0], 1)), data)) # shape: (8000, 3)
split_factor = 0.90
split = int(split_factor * data.shape[0])
X_train = data[:split, :-1]
y_train = data[:split, -1].reshape((-1, 1))
X_test = data[split:, :-1]
y_test = data[split:, -1].reshape((-1, 1))
5. Displaying Datasets
Python
print("Number of examples in training set = %d" % X_train.shape[0])
print("Number of examples in testing set = %d" % X_test.shape[0])
Output:
results6. Defining Core Functions of Linear Regression
- Hypothesis(X, theta): Computes the predicted output using the linear model h(X)=X⋅θ
- Gradient(X, y, theta): Calculates the gradient of the cost function which is used to update model parameters during training.
- Cost(X, y, theta): Computes the Mean Squared Error (MSE).
Python
# Hypothesis function
def hypothesis(X, theta):
return np.dot(X, theta)
# Gradient of the cost function
def gradient(X, y, theta):
h = hypothesis(X, theta)
grad = np.dot(X.T, (h - y))
return grad
# Mean squared error cost
def cost(X, y, theta):
h = hypothesis(X, theta)
J = np.dot((h - y).T, (h - y)) / 2
return J[0]
7. Creating Mini-Batches for Training
This function divides the dataset into random mini-batches used during training:
- Combines the feature matrix X and target vector y, then shuffles the data to introduce randomness.
- Splits the shuffled data into batches of size batch_size.
- Each mini-batch is a tuple (X_mini, Y_mini) used for one update step in mini-batch gradient descent.
- Also handles the case where data isn’t evenly divisible by the batch size by including the leftover samples in an extra batch.
Python
# Create mini-batches from the dataset
def create_mini_batches(X, y, batch_size):
mini_batches = []
data = np.hstack((X, y))
np.random.shuffle(data)
n_minibatches = data.shape[0] // batch_size
for i in range(n_minibatches + 1):
mini_batch = data[i * batch_size:(i + 1) * batch_size, :]
X_mini = mini_batch[:, :-1]
Y_mini = mini_batch[:, -1].reshape((-1, 1))
mini_batches.append((X_mini, Y_mini))
if data.shape[0] % batch_size != 0:
mini_batch = data[i * batch_size:]
X_mini = mini_batch[:, :-1]
Y_mini = mini_batch[:, -1].reshape((-1, 1))
mini_batches.append((X_mini, Y_mini))
return mini_batches
8. Mini-Batch Gradient Descent Function
This function performs mini-batch gradient descent to train the linear regression model:
- Initialization: Weights
theta
are initialized to zeros and an empty list error_list
tracks the cost over time. - Training Loop: For a fixed number of iterations (
max_iters
), the dataset is divided into mini-batches. - Each mini-batch: computes the gradient, updates
theta
to reduce cost and records the current error for tracking training progress.
Python
# Mini-batch gradient descent
def gradientDescent(X, y, learning_rate=0.001, batch_size=32):
theta = np.zeros((X.shape[1], 1))
error_list = []
max_iters = 3
for itr in range(max_iters):
mini_batches = create_mini_batches(X, y, batch_size)
for X_mini, y_mini in mini_batches:
theta = theta - learning_rate * gradient(X_mini, y_mini, theta)
error_list.append(cost(X_mini, y_mini, theta))
return theta, error_list
9. Training and Visualization
The model is trained using gradientDescent()
on the training data. After training:
- theta[0] is the bias term (intercept).
- theta[1:] contains the feature weights (coefficients).
- The plot shows how the cost decreases as the model learns, showing convergence of the algorithm.
This provides a visual and quantitative insight into how well the mini-batch gradient descent is optimizing the regression model.
Python
theta, error_list = gradientDescent(X_train, y_train)
print("Bias = ", theta[0])
print("Coefficients = ", theta[1:])
# visualising gradient descent
plt.plot(error_list)
plt.xlabel("Number of iterations")
plt.ylabel("Cost")
plt.show()
Output:
Mini-Batch over Regression model10. Final Prediction and Evaluation
Prediction: The hypothesis() function is used to compute predicted values for the test set.
Visualization:
- A scatter plot shows actual test values.
- A line plot overlays the predicted values, helping to visually assess model performance.
Evaluation:
- Computes Mean Absolute Error (MAE) to measure average prediction deviation.
- A lower MAE indicates better accuracy of the model.
Python
# Predicting output for X_test
y_pred = hypothesis(X_test, theta)
# Visualizing predictions vs actual values
plt.scatter(X_test[:, 1], y_test, marker='.', label='Actual')
plt.plot(X_test[:, 1], y_pred, color='orange', label='Predicted')
plt.xlabel("Feature 1")
plt.ylabel("Target")
plt.title("Model Predictions vs Actual Values")
plt.legend()
plt.grid(True)
plt.show()
# Calculating mean absolute error
error = np.sum(np.abs(y_test - y_pred)) / y_test.shape[0]
print("Mean Absolute Error =", error)
Output:
Model prediction and Actual valuesThe orange line represents the final hypothesis function i.e θ[0] + θ[1] * X_test[:, 1] + θ[2] * X_test[:, 2] = 0
This is the linear equation learned by the model where:
θ[0]
is the bias (intercept)θ[1]
is the weight for the first featureθ[2]
is the weight for the second feature
Comparison Between Gradient Descent Variants
Lets see a quick difference between Batch Gradient Descent, Stochastic Gradient Descent (SGD) and Mini-Batch Gradient Descent.
Type | Update Strategy | Speed & Efficiency | Noise in Updates |
---|
Batch Gradient Descent | Updates parameters after computing gradient using the entire training dataset | Slow, as it processes the full dataset before each update | Smooth and stable |
---|
Stochastic Gradient Descent (SGD) | Updates parameters after computing gradient using one training example | Faster updates, but cannot fully utilize vectorized computations | Highly noisy updates |
---|
Mini-Batch Gradient Descent | Updates parameters using a small batch (subset) of training examples | Efficient; leverages vectorization for faster computation | Moderate noise—dependent on batch size |
---|
Similar Reads
Machine Learning Tutorial Machine learning is a branch of Artificial Intelligence that focuses on developing models and algorithms that let computers learn from data without being explicitly programmed for every task. In simple words, ML teaches the systems to think and understand like humans by learning from the data.Do you
5 min read
Introduction to Machine Learning
Python for Machine Learning
Machine Learning with Python TutorialPython language is widely used in Machine Learning because it provides libraries like NumPy, Pandas, Scikit-learn, TensorFlow, and Keras. These libraries offer tools and functions essential for data manipulation, analysis, and building machine learning models. It is well-known for its readability an
5 min read
Pandas TutorialPandas is an open-source software library designed for data manipulation and analysis. It provides data structures like series and DataFrames to easily clean, transform and analyze large datasets and integrates with other Python libraries, such as NumPy and Matplotlib. It offers functions for data t
6 min read
NumPy Tutorial - Python LibraryNumPy (short for Numerical Python ) is one of the most fundamental libraries in Python for scientific computing. It provides support for large, multi-dimensional arrays and matrices along with a collection of mathematical functions to operate on arrays.At its core it introduces the ndarray (n-dimens
3 min read
Scikit Learn TutorialScikit-learn (also known as sklearn) is a widely-used open-source Python library for machine learning. It builds on other scientific libraries like NumPy, SciPy and Matplotlib to provide efficient tools for predictive data analysis and data mining.It offers a consistent and simple interface for a ra
3 min read
ML | Data Preprocessing in PythonData preprocessing is a important step in the data science transforming raw data into a clean structured format for analysis. It involves tasks like handling missing values, normalizing data and encoding variables. Mastering preprocessing in Python ensures reliable insights for accurate predictions
6 min read
EDA - Exploratory Data Analysis in PythonExploratory Data Analysis (EDA) is a important step in data analysis which focuses on understanding patterns, trends and relationships through statistical tools and visualizations. Python offers various libraries like pandas, numPy, matplotlib, seaborn and plotly which enables effective exploration
6 min read
Feature Engineering
Supervised Learning
Unsupervised Learning
Model Evaluation and Tuning
Advance Machine Learning Technique
Machine Learning Practice