Mini-Batch Gradient Descent in Deep Learning

Last Updated : 23 Jul, 2025

Mini-batch gradient descent is a variant of the traditional gradient descent algorithm used to optimize the parameters i.e weights and biases of a neural network. It divides the training data into small subsets called mini-batches allowing the model to update its parameters more frequently compared to using the entire dataset at once.

The update rule for Mini-Batch Gradient Descent is:

\theta := \theta - \frac{\alpha}{m} \sum_{i=1}^{m} \nabla_{\theta} \mathcal{L}(\theta; x^{(i)}, y^{(i)})

Where:

x^{(i)}, y^{(i)} are the input features and labels of the i-th data point in the mini-batch.
\nabla_{\theta} \mathcal{L}(\theta; x^{(i)}, y^{(i)}) is the gradient of the loss function with respect to \theta for the i-th data point.

Instead of updating weights after calculating the error for each data point (in stochastic gradient descent) or after the entire dataset (in batch gradient descent), mini-batch gradient descent updates the model’s parameters after processing a mini-batch of data. This provides a balance between computational efficiency and convergence stability.

Why to use Mini-Batch Gradient Descent?

The primary reason for using mini-batches is to improve both the computational efficiency and the convergence rate of the training process. By processing smaller subsets of data at a time we can update the weights more frequently than batch gradient descent while avoiding the noisiness of stochastic gradient descent.

It can be summarized as a method that combines the benefits of both batch and stochastic gradient descent:

Batch gradient descent uses the entire dataset for each iteration, which can be computationally expensive.
Stochastic gradient descent updates the model after each training sample but this can lead to noisy updates and fluctuations in the training process.

Mini-batch gradient descent offers a middle ground making it more efficient and often leading to faster convergence.

How Mini-Batch Gradient Descent Works

Splitting the Data: The training dataset is divided into smaller mini-batches. Each mini-batch contains a subset of data points. For example if the dataset has 10,000 examples we might split it into 100 mini-batches each containing 100 data points.
Computing the Gradient: For each mini-batch the gradient of the loss function is computed and used to update the model’s parameters. The loss is averaged over the mini-batch which helps in reducing the noise compared to the SGD approach.
Updating the Parameters: Once the gradient is computed for a mini-batch the model’s parameters are updated using the learning rate and the gradient. This step is repeated for each mini-batch and the process continues until all mini-batches have been processed.
Epochs: An epoch refers to one complete pass through the entire dataset. After each epoch the mini-batches are typically reshuffled to ensure that the model does not overfit to the specific order of the data.

How to Choose the Mini-Batch Size

Choosing the right mini-batch size is crucial for the effectiveness of the model. Here are a few considerations:

Small Batch Size: Usually between 32 and 128. This may result in more frequent updates but it can also introduce noise into the optimization process. It is suitable for smaller models or when computational resources are limited.
Large Batch Size: Larger than 512. This results in more stable updates but can increase memory consumption. It is ideal when training on powerful hardware like GPUs or TPUs.
General Guideline: Start with a batch size of 64 or 128 and adjust based on the training behavior. If training is too slow or unstable then experiment with smaller or larger batch sizes.

Optimizing Mini-Batch Gradient Descent

Learning Rate Schedulers: Dynamic adjustment of the learning rate during training can improve the performance of mini-batch gradient descent by helping the model converge more smoothly.
Momentum: Using momentum with mini-batch gradient descent can accelerate convergence by helping the model overcome oscillations and reach the global minimum faster.
Adam Optimizer: Adam optimizer combines the benefits of both momentum and RMSProp optimizers and is widely used with mini-batch gradient descent for faster and more reliable convergence.

1. Implementing Mini Batch Gradient Descent in TensorFlow

In the following code mini-batch gradient descent is used by setting batch_size=32 in model.fit() meaning the model processes 32 data points at a time before updating the weights. We are using tensorflow here.

A custom callback BatchLossHistory is created to track the loss after each mini-batch.
The on_train_batch_end method captures the loss after each batch
While on_epoch_begin resets the loss list at the start of each epoch.
The recorded losses are then plotted to visualize the loss per mini-batch during training.

Python

import tensorflow as tf
from tensorflow.keras import layers, models
import numpy as np
import matplotlib.pyplot as plt

X_train = np.random.rand(1000, 20)
y_train = np.random.randint(0, 2, size=(1000, 1))

model = models.Sequential([
    layers.Dense(64, activation='relu', input_shape=(20,)),
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

class BatchLossHistory(tf.keras.callbacks.Callback):
    def on_train_batch_end(self, batch, logs=None):
        self.losses.append(logs['loss'])

    def on_epoch_begin(self, epoch, logs=None):
        self.losses = []

batch_loss_history = BatchLossHistory()

history = model.fit(X_train, y_train, epochs=10, batch_size=32, callbacks=[batch_loss_history], verbose=1)

plt.plot(batch_loss_history.losses)
plt.xlabel('Batch')
plt.ylabel('Loss')
plt.title('Mini-Batch Loss per Batch')
plt.show()

Output:

2. Implementing Mini Batch Gradient Descent in PyTorch

In this PyTorch code mini-batch gradient descent is used with a batch size of 32 set via the DataLoader (batch_size=32). This means that the model processes 32 samples at a time before updating the weights.

The training loop iterates over mini-batches where the loss for each batch is calculated using the BCELoss function.
The adam optimizer performs the backward pass and updates the weights.
Mini-batch loss for each batch is recorded in the batch_losses list.
After training these losses are plotted to visualize the loss per mini-batch during the 10 epochs.

Python

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import matplotlib.pyplot as plt

X_train = torch.randn(1000, 20)
y_train = torch.randint(0, 2, (1000, 1)).float()

class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(20, 64)
        self.fc2 = nn.Linear(64, 1)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.sigmoid(self.fc2(x))
        return x

model = SimpleNN()

criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(dataset, batch_size=32, shuffle=True)

batch_losses = []
epochs = 10

for epoch in range(epochs):
    for batch_X, batch_y in train_loader:
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        batch_losses.append(loss.item())
    
    print(f"Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}")

plt.plot(batch_losses)
plt.xlabel('Batch')
plt.ylabel('Loss')
plt.title('Mini-Batch Loss per Batch')
plt.show()

Output:

Advantages of Mini-Batch Gradient Descent

Faster Convergence: Because the model updates its parameters more frequently i.e after each mini-batch hence converges faster than batch gradient descent which waits until the entire dataset is processed.
Reduced Memory Usage: Mini-batches allow training on large datasets without the need to load the entire dataset into memory at once. This makes it feasible to train deep learning models on large-scale data.
Smoother Gradient Estimation: Compared to stochastic gradient descent (SGD), mini-batch gradient descent offers a smoother convergence. While SGD updates after each data point resulting in noisy gradients hence mini-batch offers a more balanced and stable gradient update.
Parallelization and Speed: It can be parallelized on hardware like GPUs where the computation of gradients for each mini-batch can be executed simultaneously. This results in faster training especially for large neural networks.

Disadvantages of Mini-Batch Gradient Descent

While mini-batch gradient descent is widely used it has a few potential drawbacks:

Choice of Batch Size: Selecting the appropriate batch size is crucial. A batch that’s too small may lead to noisy updates, while a batch too large may negate the efficiency benefits.
Requires More Epochs: Since it doesn’t use the entire dataset at once it may require more epochs to converge compared to batch gradient descent.
Complexity: It introduces complexity in terms of managing multiple mini-batches which can increase the development effort and tuning requirements.

Introduction to Deep Learning

alka1974

Improve

Article Tags :

Mini-Batch Gradient Descent in Deep Learning

Why to use Mini-Batch Gradient Descent?

How Mini-Batch Gradient Descent Works

How to Choose the Mini-Batch Size

Optimizing Mini-Batch Gradient Descent

1. Implementing Mini Batch Gradient Descent in TensorFlow

2. Implementing Mini Batch Gradient Descent in PyTorch

Advantages of Mini-Batch Gradient Descent

Disadvantages of Mini-Batch Gradient Descent

Similar Reads

Deep Learning Basics

Neural Networks Basics

Deep Learning Models

Deep Learning Frameworks

Model Evaluation

Deep Learning Projects

Thank You!

What kind of Experience do you want to share?