Open In App

Incremental Learning with Scikit-learn

Last Updated : 31 Jul, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Incremental Learning is a technique where a machine learning model learns from data in small chunks or batches rather than all at once. This is useful when working with very large datasets or streaming data that can’t fit into memory. Scikit-learn a popular machine learning library in Python that supports incremental learning using models that implement the partial_fit() method which allows you to train your model on fone batch at a time, update it with new data continuously and avoid retraining from scratch.

data_stream
Incremental Learning

Incremental Learning

  • Incremental learning is a machine learning technique where models are trained gradually using small batches of data instead of the entire dataset at once.
  • This approach is particularly useful when working with large scale or streaming data that cannot fit into memory all at once.
  • Rather than starting over every time new data becomes available the model updates itself incrementally, learning from each new batch without forgetting what it has already learned.
  • This makes incremental learning ideal for real time applications such as fraud detection, recommendation systems and monitoring systems where data evolves continuously.

Implementation

Step 1: Import Required Libraries

  • This code imports key Python libraries for building and evaluating a machine learning model, pandas and numpy are used for data manipulation and numerical operations.
  • SGDClassifier from sklearn.linear_model is a fast linear classifier based on stochastic gradient descent and StandardScaler helps normalize the features to improve model training.
  • accuracy_score and classification_report are used to measure the performance of the model and shuffle randomizes the dataset to ensure better training and testing splits.
Python
import pandas as pd
import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
from sklearn.utils import shuffle

Step 2: Load Dataset

  • This line reads the creditcard.csv file into a pandas DataFrame named df.
  • It loads the dataset into memory so it can be processed and analyzed using pandas functions.
Python
df = pd.read_csv("creditcard.csv") 

Step 3: Separate Features and Target

  • These lines separate the dataset into features and target labels.
  • X contains all the input features by dropping the "Class" column while y stores the target values from the "Class" column which typically indicates whether a transaction is fraudulent or not.
Python
X = df.drop("Class", axis=1).values
y = df["Class"].values

Step 4: Normalize Time and Amount Features

  • This code creates a StandardScaler to normalize the first two columns of X often "Time" and "Amount" in credit card datasets.
  • It scales them to have zero mean and unit variance, improving the performance of machine learning models.
C++
scaler = StandardScaler()
X[:, [0, 1]] = scaler.fit_transform(X[:, [0, 1]])

Step 5: Shuffle Data to Simulate Streaming

  • This line shuffles the feature matrix X and target vector y in unison to randomize the data order which helps prevent any patterns in the original order from affecting model training.
  • The random_state=42 ensures reproducibility.
Python
X, y = shuffle(X, y, random_state=42)

Step 6: Initialize the Incremental Model

  • This line initializes an SGDClassifier with logistic loss for binary classification.
  • max_iter=1 allows training in small steps and warm_start=True ensures the model retains its state between training iterations enabling updates without reinitialization.
Python
model = SGDClassifier(loss='log_loss', max_iter=1, warm_start=True)

Step 7: Define Classes for partial_fit

  • This line extracts and stores the unique class labels from the target array y using np.unique().
  • It ensures that the model is aware of all possible output classes which is important for methods like partial fitting in incremental learning.
Python
classes = np.unique(y)

Step 8: Define Batch Size and Number of Batches

  • This code sets a batch size of 10,000 and calculates the total number of full batches by dividing the total number of samples by the batch size.
  • It's used to split the data for incremental training in manageable chunks.
Python
batch_size = 10000
n_batches = X.shape[0] // batch_size

Step 9: Train Model Incrementally in Batches

  • This loop trains the model incrementally on batches of data. For each batch it selects a slice of features and targets then uses partial_fit to update the model.
  • The first batch includes the full list of classes to initialize the model properly. Every 5 batches it predicts on the current batch and prints the accuracy allowing you to monitor training progress batch by batch.
Python
for i in range(n_batches):
    start = i * batch_size
    end = start + batch_size
    X_batch = X[start:end]
    y_batch = y[start:end]
    
    if i == 0:
        model.partial_fit(X_batch, y_batch, classes=classes)
    else:
        model.partial_fit(X_batch, y_batch)

    if i % 5 == 0:
        y_pred = model.predict(X_batch)
        acc = accuracy_score(y_batch, y_pred)
        print(f"Batch {i + 1}, Accuracy: {acc:.4f}")

Step 10: Final Evaluation on Last Batch

  • This code predicts the labels for the last batch of data and then prints a detailed classification report.
  • The report includes metrics like precision, recall and F1 score which help evaluate the model’s performance on the final batch.
Python
y_pred = model.predict(X[-batch_size:])
print("\nFinal Batch Classification Report:\n")
print(classification_report(y[-batch_size:], y_pred))

Ouput:

Output
Output

Applications

  1. Fraud Detection: Financial fraud is dynamic with new attack patterns emerging regularly. Incremental learning helps update models quickly with recent transactions to detect anomalies in real time without full retraining.
  2. Recommendation Systems: User interests change rapidly in platforms like e commerce or streaming services. By learning incrementally from each user interaction, models stay up to date and deliver more relevant, personalized content.
  3. Sensor and IoT Analytics: Smart devices and industrial IoT generate massive continuous data streams. Incremental models can analyze this data on the fly, helping in tasks like predictive maintenance or real time monitoring.
  4. Social Media Monitoring: Platforms like Twitter and Instagram evolve every second with new trends and opinions. Incremental learning allows sentiment analysis or topic classification models to stay current by processing recent posts in batches.

Practice Tags :

Similar Reads