ML | Mini-Batch Gradient Descent with Python

Last Updated : 05 Jul, 2025

Gradient Descent is an optimization algorithm in machine learning used to determine the optimal parameters such as weights and bias for models. The idea is to minimize the model's error by iteratively updating the parameters in the direction of the steepest descent as determined by the gradient of the loss function.

Depending on how much data is used to compute the gradient during each update, gradient descent comes in three main variants:

Batch Gradient Descent
Stochastic Gradient Descent (SGD)
Mini-Batch Gradient Descent

Each variant has its own strengths and trade-offs in terms of speed, stability and convergence behavior.

Working of Mini-Batch Gradient Descent

Mini-batch gradient descent is a optimization method that updates model parameters using small subsets of the training data called mini-batches. This technique offers a middle path between the high variance of stochastic gradient descent and the high computational cost of batch gradient descent. They are used to perform each update, making training faster and more memory-efficient. It also helps stabilize convergence and introduces beneficial randomness during learning.

It is often preferred in modern machine learning applications because it combines the benefits of both batch and stochastic approaches.

Key advantages of mini-batch gradient descent:

Computational Efficiency: Supports parallelism and vectorized operations on GPUs or TPUs.
Faster Convergence: Provides more frequent updates than full-batch which improves speed.
Noise Reduction: Less noisy than stochastic updates which leads to smoother convergence.
Better Generalization: Introduces slight randomness to help escape local minima.
Memory Efficiency: Doesn’t require loading the entire dataset into memory.

Algorithm:

Let:

\theta = model parameters
max_iters = number of epochs
\eta = learning rate

For itr=1,2,3,…,max_iters:

Shuffle the training data. It is optional but often done for better randomness in mini-batch selection.
Split the dataset into mini-batches of size b.

For each mini-batch (X_{mini}, y_{mini}):

1. Forward Pass on the batch X_mini:

Make predictions on the mini-batch

\hat{y} = f(X_{\text{mini}},\ \theta)

Compute error in predictions J(θ) with the current values of the parameters

J(θ)=L(\hat{y},y_{mini})

2. Backward Pass:

Compute gradient:

\nabla_{\theta} J(\theta) = \frac{\partial J(\theta)}{\partial \theta}

3. Update parameters:

Gradient descent rule:

\theta = \theta - \eta \nabla_{\theta} J(\theta)

Python Implementation

Here we will use Mini-Batch Gradient Descent for Linear Regression.

1. Importing Libraries

We begin by importing libraries like Numpy and Matplotlib.pyplot

Python

import numpy as np
import matplotlib.pyplot as plt

2. Generating Synthetic 2D Data

Here, we generate 8000 two-dimensional data points sampled from a multivariate normal distribution:

The data is centered at the point (5.0, 6.0).
The cov matrix defines the variance and correlation between the features. A value of 0.95 indicates a strong positive correlation between the two features.

Python

mean = np.array([5.0, 6.0])
cov = np.array([[1.0, 0.95], [0.95, 1.2]])
data = np.random.multivariate_normal(mean, cov, 8000)

3. Visualizing Generated Data

Python

plt.scatter(data[:500, 0], data[:500, 1], marker='.')
plt.title("Scatter Plot of First 500 Samples")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid(True)
plt.show()

Output:

4. Splitting Data

We split the data into training and testing sets:

Original data shape: (8000, 2)
New shape after adding bias: (8000, 3)
90% of the data is used for training and 10% for testing.

Python

data = np.hstack((np.ones((data.shape[0], 1)), data))  # shape: (8000, 3)

split_factor = 0.90
split = int(split_factor * data.shape[0])

X_train = data[:split, :-1]
y_train = data[:split, -1].reshape((-1, 1))
X_test = data[split:, :-1]
y_test = data[split:, -1].reshape((-1, 1))

5. Displaying Datasets

Python

print("Number of examples in training set = %d" % X_train.shape[0])
print("Number of examples in testing set = %d" % X_test.shape[0])

Output:

6. Defining Core Functions of Linear Regression

Hypothesis(X, theta): Computes the predicted output using the linear model h(X)=X⋅θ
Gradient(X, y, theta): Calculates the gradient of the cost function which is used to update model parameters during training.
Cost(X, y, theta): Computes the Mean Squared Error (MSE).

Python

# Hypothesis function
def hypothesis(X, theta):
    return np.dot(X, theta)

# Gradient of the cost function
def gradient(X, y, theta):
    h = hypothesis(X, theta)
    grad = np.dot(X.T, (h - y))
    return grad

# Mean squared error cost
def cost(X, y, theta):
    h = hypothesis(X, theta)
    J = np.dot((h - y).T, (h - y)) / 2
    return J[0]

7. Creating Mini-Batches for Training

This function divides the dataset into random mini-batches used during training:

Combines the feature matrix X and target vector y, then shuffles the data to introduce randomness.
Splits the shuffled data into batches of size batch_size.
Each mini-batch is a tuple (X_mini, Y_mini) used for one update step in mini-batch gradient descent.
Also handles the case where data isn’t evenly divisible by the batch size by including the leftover samples in an extra batch.

Python

# Create mini-batches from the dataset
def create_mini_batches(X, y, batch_size):
    mini_batches = []
    data = np.hstack((X, y))
    np.random.shuffle(data)
    n_minibatches = data.shape[0] // batch_size
    for i in range(n_minibatches + 1):
        mini_batch = data[i * batch_size:(i + 1) * batch_size, :]
        X_mini = mini_batch[:, :-1]
        Y_mini = mini_batch[:, -1].reshape((-1, 1))
        mini_batches.append((X_mini, Y_mini))
    if data.shape[0] % batch_size != 0:
        mini_batch = data[i * batch_size:]
        X_mini = mini_batch[:, :-1]
        Y_mini = mini_batch[:, -1].reshape((-1, 1))
        mini_batches.append((X_mini, Y_mini))
    return mini_batches

8. Mini-Batch Gradient Descent Function

This function performs mini-batch gradient descent to train the linear regression model:

Initialization: Weights theta are initialized to zeros and an empty list error_list tracks the cost over time.
Training Loop: For a fixed number of iterations (max_iters), the dataset is divided into mini-batches.
Each mini-batch: computes the gradient, updates theta to reduce cost and records the current error for tracking training progress.

Python

# Mini-batch gradient descent
def gradientDescent(X, y, learning_rate=0.001, batch_size=32):
    theta = np.zeros((X.shape[1], 1))
    error_list = []
    max_iters = 3

    for itr in range(max_iters):
        mini_batches = create_mini_batches(X, y, batch_size)
        for X_mini, y_mini in mini_batches:
            theta = theta - learning_rate * gradient(X_mini, y_mini, theta)
            error_list.append(cost(X_mini, y_mini, theta))

    return theta, error_list

9. Training and Visualization

The model is trained using gradientDescent() on the training data. After training:

theta[0] is the bias term (intercept).
theta[1:] contains the feature weights (coefficients).
The plot shows how the cost decreases as the model learns, showing convergence of the algorithm.

This provides a visual and quantitative insight into how well the mini-batch gradient descent is optimizing the regression model.

Python

theta, error_list = gradientDescent(X_train, y_train)
print("Bias = ", theta[0])
print("Coefficients = ", theta[1:])

# visualising gradient descent
plt.plot(error_list)
plt.xlabel("Number of iterations")
plt.ylabel("Cost")
plt.show()

Output:

10. Final Prediction and Evaluation

Prediction: The hypothesis() function is used to compute predicted values for the test set.

Visualization:

A scatter plot shows actual test values.
A line plot overlays the predicted values, helping to visually assess model performance.

Evaluation:

Computes Mean Absolute Error (MAE) to measure average prediction deviation.
A lower MAE indicates better accuracy of the model.

Python

# Predicting output for X_test
y_pred = hypothesis(X_test, theta)

# Visualizing predictions vs actual values
plt.scatter(X_test[:, 1], y_test, marker='.', label='Actual')
plt.plot(X_test[:, 1], y_pred, color='orange', label='Predicted')
plt.xlabel("Feature 1")
plt.ylabel("Target")
plt.title("Model Predictions vs Actual Values")
plt.legend()
plt.grid(True)
plt.show()

# Calculating mean absolute error
error = np.sum(np.abs(y_test - y_pred)) / y_test.shape[0]
print("Mean Absolute Error =", error)

Output:

The orange line represents the final hypothesis function i.e θ[0] + θ[1] * X_test[:, 1] + θ[2] * X_test[:, 2] = 0

This is the linear equation learned by the model where:

θ[0] is the bias (intercept)
θ[1] is the weight for the first feature
θ[2] is the weight for the second feature

Comparison Between Gradient Descent Variants

Lets see a quick difference between Batch Gradient Descent, Stochastic Gradient Descent (SGD) and Mini-Batch Gradient Descent.

Type	Update Strategy	Speed & Efficiency	Noise in Updates
Batch Gradient Descent	Updates parameters after computing gradient using the entire training dataset	Slow, as it processes the full dataset before each update	Smooth and stable
Stochastic Gradient Descent (SGD)	Updates parameters after computing gradient using one training example	Faster updates, but cannot fully utilize vectorized computations	Highly noisy updates
Mini-Batch Gradient Descent	Updates parameters using a small batch (subset) of training examples	Efficient; leverages vectorization for faster computation	Moderate noise—dependent on batch size

Linear Regression (Python Implementation)

savyakhosla

Improve

Article Tags :

Practice Tags :

Machine Learning

ML | Mini-Batch Gradient Descent with Python

Working of Mini-Batch Gradient Descent

Algorithm:

Python Implementation

1. Importing Libraries

2. Generating Synthetic 2D Data

3. Visualizing Generated Data

4. Splitting Data

5. Displaying Datasets

6. Defining Core Functions of Linear Regression

7. Creating Mini-Batches for Training

8. Mini-Batch Gradient Descent Function

9. Training and Visualization

10. Final Prediction and Evaluation

Comparison Between Gradient Descent Variants

Similar Reads

Linear Model Regression

Linear Model Classification

Regularization

K-Nearest Neighbors (KNN)

Support Vector Machines

Decision Tree

Ensemble Learning

Thank You!

What kind of Experience do you want to share?