Classifiers in Scikit-Learn That Handle NaN/Null

Last Updated : 15 Apr, 2025

Managing missing data is an important part of machine learning since it affects how well models work. Building robust classifiers requires handling NaN (Not a Number) or null values effectively, which are ubiquitous in many real-world datasets. Numerous classifiers available in Scikit-Learn, a well-known Python machine learning toolkit, can directly handle NaN/null data.

This article examines these classifiers, discusses the difficulties caused by NaN/null values, and offers thorough examples of their implementation.

Challenges with NaN/Null Values:

NaN/null values present a number of difficulties for machine learning:

Model Accuracy: Biased estimates and decreased model accuracy might result from missing values.
Data Imputation: Adding extra noise and potential accuracy issues might arise when imputing missing values.
Algorithm Compatibility: Not all algorithms used in machine learning are designed to deal with missing values.

In order to overcome these obstacles, one of two methods must be used: either preprocess the data to manage missing values, or use algorithms that can directly handle NaN/null values. The integrity and dependability of the model's predictions depend on how NaN/null values are handled.

To learn how to train model in Scikit, you can refer to Learn Model building using Scikit-Learn

List of Classifiers Handling NaN/Null:

Several Scikit-Learn classifiers are natively capable of handling NaN/null data. These classifiers don't require explicit imputation when working with datasets that have missing values. Here are a few of them:

Gradient Boosting Classifier Histogram
KClassifier for Neighbors
Classifier for Random Forests

1. Gradient Boosting Classifier Histogram

One strong ensemble classifier that can naturally handle missing values is the HistGradientBoostingClassifier. It can handle NaN values directly out of the box and is optimized for handling massive datasets.

2. KClassifier of Neighbors:

By using imputation techniques, the KNeighborsClassifier can be modified to handle missing values. Although it is not designed to handle NaN values, it can function effectively if the data has been preprocessed to account for missing values.

3. Classifier for Random Forests

Although the RandomForestClassifier does not directly handle NaN values, it performs admirably when applied to datasets that have had missing values imputed. It is a popular and adaptable classifier that works well with a variety of data kinds.

Setting Classifiers to Use in Python

Installing the library and importing the required modules are required before you can use these classifiers in Scikit-Learn. Here's how to put these classifiers into practice:

Setting up Scikit-Learn

!pip install scikit-learn

Example:

Here is an example of code that uses SimpleImputer and HistGradientBoostingClassifier to handle NaN/null data.

How to Make an Example Dataset?

Initially, a sample dataset containing NaN values is created.

# importing libraries to be used
import numpy as np
import pandas as pd

# A sample dataset is used with NaN values
data = {
    'feature1': [5, 2, np.nan, 4, 3],
    'feature2': [np.nan, 1, 0, 2, 4],
    'feature3': [5, 3, 1, 3, np.nan],
    'target': [1, 0, 0, 1, 0]
}
# creating dataframe using pandas library
df = pd.DataFrame(data)

Dividing the Dataset:

The dataset was then divided into training and testing sets, features, and target.

# importing useful libraries
from sklearn.model_selection import train_test_split

# Splitting the sample dataset into features and target
X = df.drop('target', axis=1)
y = df['target']

# Splitting the sample dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=32)

HistGradientBoostingClassifier Utilization:

The HistGradientBoostingClassifier, which can accept NaN values natively, is initialized and fitted.

# Importing useful libraries
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Initializing the HistGradientBoostingClassifier
clf = HistGradientBoostingClassifier()

# Fitting the model
clf.fit(X_train, y_train)

# Predicting on the test set
y_pred = clf.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.3f}')

Output - Accuracy: 0.914

Imputing Missing Values with SimpleImputer:

For classifiers like RandomForestClassifier that do not allow NaN values by default, we utilize SimpleImputer to handle missing data.

# importing libraries to use
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

# Imputing missing values using SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# Splitting the sample dataset
X_train_imp, X_test_imp, y_train_imp, y_test_imp = train_test_split(X_imputed, y, test_size=0.2, random_state=42)

# Initializing the RandomForestClassifier
rf_clf = RandomForestClassifier()

# Fitting the model with imputed data
rf_clf.fit(X_train_imp, y_train_imp)

# Predicting on the imputed test set
y_pred_imp = rf_clf.predict(X_test_imp)

# Evaluating the model with imputed data
accuracy_imp = accuracy_score(y_test_imp, y_pred_imp)
print(f'Accuracy with Imputed Data: {accuracy_imp:.2f}')

Output- Accuracy with Imputed Data: 0.93

Alternative Methods for Managing Missing Information

Although this article focuses on classifiers that can directly handle NaN or null values, it's still vital to be aware of additional methods that can be applied when working with datasets that have missing values. These methods can be applied to enhance the performance and dependability of models when combined with classifiers.

Imputation of Data

The process of data imputation entails substituting missing values. Typical imputation techniques consist of:

Mean Imputation: Using the column mean to fill in the missing values.
Median Imputation: Using the column median to fill in the gaps left by missing values.
Mode Imputation is the process of substituting the column's most often value for missing data.
K-Nearest Neighbors Imputation: This method estimates the missing values based on similarity by using the k-nearest neighbors.

Data Enrichment

Data augmentation can be used to improve the amount of available data in cases where the dataset is small. This may entail producing artificial data points or producing extra training samples via methods like bootstrapping.

Eliminating Missing Information

Removing the rows or columns with missing values may be a reasonable solution in some circumstances, especially if the percentage of missing data is low. However, this method should only be used sparingly as it may result in the loss of important information.

Using Cutting-Edge Imputation Techniques

Improved estimates of missing values can be obtained by using sophisticated imputation techniques like model-based or iterative imputation. These techniques frequently entail predicting missing numbers using machine learning models based on other available data.

The below sample code of Advanced Imputation using Iterative Imputer in Python shows how to use the sophisticated imputation technique known as IterativeImputer from Scikit-Learn to handle missing data. The imputed data is then subjected to a machine learning classifier named RandomForestClassifier.

Step 1 : Importing libraries to use

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

Step 2 : Initializing the IterativeImputer

iter_imputer = IterativeImputer()

Step 3 : Fitting and transforming the data

X_iter_imputed = iter_imputer.fit_transform(X)

Step 4 : Splitting the imputed dataset

X_train_iter, X_test_iter, y_train_iter, y_test_iter = train_test_split(X_iter_imputed, y, test_size=0.2, random_state=42)

Step 5 : Initializing the RandomForestClassifier

rf_clf_iter = RandomForestClassifier()

Step 6 : Fitting the model with iteratively imputed data

rf_clf_iter.fit(X_train_iter, y_train_iter)

Step 7 : Predicting on the iteratively imputed test set

y_pred_iter = rf_clf_iter.predict(X_test_iter)

Step 8 : Evaluating the model with iteratively imputed data

accuracy_iter = accuracy_score(y_test_iter, y_pred_iter)
print(f'Accuracy with Iteratively Imputed Data: {accuracy_iter:.2f}')

Complete Code :

Python

# Importing libraries to use
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Initializing the IterativeImputer
iter_imputer = IterativeImputer()

# Fitting and transforming the data
X_iter_imputed = iter_imputer.fit_transform(X)

# Splitting the imputed dataset
X_train_iter, X_test_iter, y_train_iter, y_test_iter = train_test_split(X_iter_imputed, y, test_size=0.2, random_state=42)

# Initializing the RandomForestClassifier
rf_clf_iter = RandomForestClassifier()

# Fitting the model with iteratively imputed data
rf_clf_iter.fit(X_train_iter, y_train_iter)

# Predicting on the iteratively imputed test set
y_pred_iter = rf_clf_iter.predict(X_test_iter)

# Evaluating the model with iteratively imputed data
accuracy_iter = accuracy_score(y_test_iter, y_pred_iter)
print(f'Accuracy with Iteratively Imputed Data: {accuracy_iter:.2f}')

Output - Accuracy with Iteratively Imputed Data: 0.94

This code shows how to use Scikit-Learn's advanced imputation approach, IterativeImputer, to handle missing data. It then uses the imputed data to train a RandomForestClassifier. The test set is used to assess the classifier's accuracy, demonstrating how well iterative imputation works while handling NaN values in datasets.

Toolkit handling missing data

The missingno toolkit comprises simple, adaptable, and intuitive missing data visualizations and utilities that help you quickly see how complete—or incomplete—your dataset is. Because matplotlib was used in its construction, it is quick and adaptable, accepting any type of pandas DataFrame input. To begin, simply pip install missingno.

The NYPD Motor Vehicle Collisions Dataset and the PLUTO Housing Sales Dataset are used in the below Examples after being cleaned up. Here nullity refers to whether or not a given variable is filled in.

Msno.matrix

The matrix with its data-dense display, the msno.matrix nullity matrix enables you to rapidly identify patterns in data completion visually.

import missingno as msno
%matplotlib inline
msno.matrix(collisions.sample(250))

Date, time, injury distribution, and the first vehicle's contribution factor all seem fully supplied at a glance, but the geographic information is less complete overall.

The maximum and minimum rows are indicated, and the sparkline to the right provides a basic summary of the data completeness.

Up to 50 labelled variables can be easily accommodated in this visualization. Labels above that range automatically disappear from large displays because they start to overlap or become unreadable.

msno.matrix(housing.sample(250))

null_pattern = (np.random.random(1000).reshape((50, 20)) > 0.5).astype(bool)
null_pattern = pd.DataFrame(null_pattern).replace({False: None})
msno.matrix(null_pattern.set_index(pd.period_range('1/1/2011', '2/1/2015', freq='M')) , freq='BQ')

Bar Chart

msno.bar(collisions.sample(500))

Heatmap

msno.heatmap(collisions)

Selecting the Best Algorithm:

The following variables determine which method handles missing values the best:

Dataset Size: Big datasets work best for HistGradientBoostingClassifier.
Data Complexity: For less complicated issues, KNeighborsClassifier might be easier to comprehend.
Missing Value Pattern: Take into account preprocessing methods that take into account the distribution of missing data.

Additional Tip:

Investigate feature importance analysis to find features that have a major effect on the model's performance; this could lead to strategies for feature selection or data cleaning. The choice of imputation techniques should be guided by domain information; for instance, missing income numbers can be imputed using the median income of a particular demographic group.

Conclusion:

In machine learning, handling NaN/null values is essential. While many Scikit-Learn classifiers require explicit imputation, others, like HistGradientBoostingClassifier, may deal with missing values directly. Acknowledging the difficulties posed by NaN/null values and employing appropriate strategies to manage them can greatly enhance the performance and dependability of the model.

With the help of Scikit-Learn, you may create reliable classifiers that function well even when there are missing data. In order to ensure that your models are accurate and dependable, the example code shows how to create classifiers that handle NaN/null values, either natively or by preprocessing.

Top Inbuilt DataSets in Scikit-Learn Library

batraharshita12

Improve

Article Tags :

Practice Tags :

Machine Learning