Data Normalization with Python Scikit-Learn

Last Updated : 13 Jul, 2024

Data normalization is a crucial step in machine learning and data science. It involves transforming features to similar scales to improve the performance and stability of machine learning models. Python's Scikit-Learn library provides several techniques for data normalization, which are essential for ensuring that models are not biased towards features with large ranges. This article will delve into the different data normalization techniques available in Scikit-Learn, their applications, and how they can be implemented.

Table of Content

Why Data Normalization is Necessary?
Using Scikit-Learn for Normalization
Implementing Data Normalization with Scikit-Learn

1. MinMaxScaler
2. Standardization (Z-score Normalization)
3. Robust Scaling

Practical Example : Data Normalization with Python

Step 2: Performing the Techniques

1. Min-Max Scaling
2. Standardization (Z-score Normalization)
3. Robust Scaling
Combining Techniques

When to Normalize Data?
Common Normalization Pitfalls and Best Practices

Why Data Normalization is Necessary?

Machine learning algorithms are sensitive to the scale of features. Features with large ranges can dominate the model, leading to poor performance and inaccurate results. Data normalization helps to mitigate this issue by transforming features to a common scale, ensuring that all features contribute equally to the model.

Using Scikit-Learn for Normalization

Scikit-Learn provides several transformers for normalization, including MinMaxScaler, StandardScaler, and RobustScaler. Let’s go through each of these with examples.

Min-Max Normalization (Feature Scaling): Min-max normalization, also known as feature scaling, is a widely used technique that rescales features to a common range, typically between 0 and 1. This technique is useful when the ranges of features vary significantly.
Z-Score Normalization (Standardization): Z-score normalization, also known as standardization, transforms features to follow a standard normal distribution with a mean of 0 and a standard deviation of 1. This technique is useful when the distribution of features is not uniform.
Robust Scaling: Robust Scaling uses the median and the interquartile range to scale features, making it robust to outliers

Implementing Data Normalization with Scikit-Learn

1. `MinMaxScaler`

The MinMaxScaler function is used for min-max normalization. It can be implemented as follows:

Python

import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Sample data
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

# Initialize the scaler
scaler = MinMaxScaler()

# Fit and transform the data
normalized_data = scaler.fit_transform(data)
print("Normalized Data (Min-Max Scaling):")
print(normalized_data)

Output:

Normalized Data (Min-Max Scaling):
[[0.         0.        ]
 [0.33333333 0.33333333]
 [0.66666667 0.66666667]
 [1.         1.        ]]

2. Standardization (Z-score Normalization)

Standardization transforms features to have a mean of 0 and a standard deviation of 1. The formula used is:

Python

from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the data
standardized_data = scaler.fit_transform(data)
print("Standardized Data (Z-score Normalization):")
print(standardized_data)

Output:

Standardized Data (Z-score Normalization):
[[-1.34164079 -1.34164079]
 [-0.4472136  -0.4472136 ]
 [ 0.4472136   0.4472136 ]
 [ 1.34164079  1.34164079]]

3. Robust Scaling

Robust Scaling uses the median and the interquartile range to scale features, making it robust to outliers. Example Code:

Python

from sklearn.preprocessing import RobustScaler

# Initialize the scaler
scaler = RobustScaler()

# Fit and transform the data
robust_scaled_data = scaler.fit_transform(data)
print("Robust Scaled Data:")
print(robust_scaled_data)

Output:

Robust Scaled Data:
[[-1.         -1.        ]
 [-0.33333333 -0.33333333]
 [ 0.33333333  0.33333333]
 [ 1.          1.        ]]

Practical Example : Data Normalization with Python

Example with Boston House Price Dataset

Step 1: Data Loading and Exploration

First, let's load and explore the Boston House Price dataset.

Python

import pandas as pd
from sklearn.datasets import load_boston

# Load the dataset
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['PRICE'] = boston.target

Step 2: Performing the Techniques

1. Min-Max Scaling

Min-Max Scaling transforms features by scaling them to a given range, typically [0, 1].

Python

from sklearn.preprocessing import MinMaxScaler

# Initialize the scaler
min_max_scaler = MinMaxScaler()

# Fit and transform the data
df_min_max_scaled = pd.DataFrame(min_max_scaler.fit_transform(df), columns=df.columns)

# Display the first few rows of the scaled dataset
print(df_min_max_scaled.head())

Output:

       CRIM   ZN  INDUS  CHAS   NOX    RM   AGE     DIS  RAD    TAX  PTRATIO       B  LSTAT  PRICE
0  0.000000  0.18  0.067  0.0  0.314  0.577  0.641  0.269  0.000  0.172  0.167  0.999  0.089  0.283
1  0.000237  0.00  0.222  0.0  0.210  0.416  0.808  0.353  0.125  0.083  0.333  0.999  0.204  0.228
2  0.000237  0.00  0.222  0.0  0.210  0.720  0.588  0.353  0.125  0.083  0.333  0.992  0.069  0.627
3  0.000282  0.00  0.068  0.0  0.173  0.675  0.407  0.448  0.188  0.052  0.417  0.996  0.047  0.588
4  0.000602  0.00  0.068  0.0  0.173  0.708  0.482  0.448  0.188  0.052  0.417  0.999  0.101  0.692

2. Standardization (Z-score Normalization)

Standardization transforms features to have a mean of 0 and a standard deviation of 1.

Python

from sklearn.preprocessing import StandardScaler

# Initialize the scaler
standard_scaler = StandardScaler()

# Fit and transform the data
df_standard_scaled = pd.DataFrame(standard_scaler.fit_transform(df), columns=df.columns)

# Display the first few rows of the scaled dataset
print(df_standard_scaled.head())

Output:

       CRIM       ZN     INDUS     CHAS      NOX       RM      AGE      DIS      RAD      TAX  PTRATIO         B     LSTAT     PRICE
0 -0.419782  0.284830 -1.287909 -0.272329 -0.144217  0.413672 -0.120013  0.140214 -0.982843 -0.666608 -1.459000  0.441052 -1.075562  0.159686
1 -0.417339 -0.487722 -0.593381 -0.272329 -0.740262  0.194274  0.367166  0.557160 -0.867883 -0.987329 -0.303094  0.441052 -0.492439 -0.101525
2 -0.417342 -0.487722 -0.593381 -0.272329 -0.740262  1.282714 -0.265812  0.557160 -0.867883 -0.987329 -0.303094  0.396427 -1.208727  1.324247
3 -0.416750 -0.487722 -1.306878 -0.272329 -0.835284  1.016302 -0.809889  1.077737 -0.752922 -1.104446  0.113032  0.416163 -1.361517  1.182758
4 -0.412482 -0.487722 -1.306878 -0.272329 -0.835284  1.228577 -0.511180  1.077737 -0.752922 -1.104446  0.113032  0.441052 -1.026501  1.487503

3. Robust Scaling

Robust Scaling uses the median and the interquartile range to scale features, making it robust to outliers.

Python

from sklearn.preprocessing import RobustScaler

# Initialize the scaler
robust_scaler = RobustScaler()

# Fit and transform the data
df_robust_scaled = pd.DataFrame(robust_scaler.fit_transform(df), columns=df.columns)

# Display the first few rows of the scaled dataset
print(df_robust_scaled.head())

Output:

       CRIM       ZN     INDUS  CHAS      NOX       RM      AGE      DIS  RAD      TAX  PTRATIO         B     LSTAT     PRICE
0 -0.365682  0.180000 -0.597701  0.0 -0.314286  0.204545 -0.159091 -0.327273  0.0 -0.425532 -0.666667  0.441052 -0.571429  0.0
1 -0.364885  0.000000 -0.298851  0.0 -0.628571  0.000000  0.204545  0.000000  0.0 -0.734043 -0.333333  0.441052 -0.285714 -0.176471
2 -0.364885  0.000000 -0.298851  0.0 -0.628571  0.590909 -0.068182  0.000000  0.0 -0.734043 -0.333333  0.415584 -0.642857  0.823529
3 -0.364688  0.000000 -0.597701  0.0 -0.742857  0.477273 -0.454545  0.327273  0.0 -0.808511  0.000000  0.428571 -0.714286  0.705882
4 -0.362991  0.000000 -0.597701  0.0 -0.742857  0.568182 -0.295455  0.327273  0.0 -0.808511  0.000000  0.441052 -0.571429  0.882353

Combining Techniques

In some cases, you might want to combine different normalization techniques for different features. For example, you might use Min-Max Scaling for some features and Standardization for others.

Python

# Initialize scalers
min_max_scaler = MinMaxScaler()
standard_scaler = StandardScaler()

# Apply Min-Max Scaling to selected columns
columns_min_max = ['CRIM', 'ZN', 'INDUS']
df_min_max = pd.DataFrame(min_max_scaler.fit_transform(df[columns_min_max]), columns=columns_min_max)

# Apply Standardization to other columns
columns_standard = ['NOX', 'RM', 'AGE']
df_standard = pd.DataFrame(standard_scaler.fit_transform(df[columns_standard]), columns=columns_standard)

# Combine the scaled data
df_combined = pd.concat([df_min_max, df_standard, df.drop(columns=columns_min_max + columns_standard)], axis=1)

# Display the first few rows of the combined dataset
print(df_combined.head())

Output:

       CRIM   ZN  INDUS      NOX       RM       AGE     DIS  RAD    TAX  PTRATIO         B     LSTAT  PRICE
0  0.000000  0.18  0.067 -0.144217  0.413672 -0.120013  4.0900    1  296.0    15.3  396.90   4.98  24.0
1  0.000237  0.00  0.222 -0.740262  0.194274  0.367166  4.9671    2  242.0    17.8  396.90   9.14  21.6
2  0.000237  0.00  0.222 -0.740262  1.282714 -0.265812  4.9671    2  242.0    17.8  392.83   4.03  34.7
3  0.000282  0.00  0.068 -0.835284  1.016302 -0.809889  6.0622    3  222.0    18.7  394.63   2.94  33.4
4  0.000602  0.00  0.068 -0.835284  1.228577 -0.511180  6.0622    3  222.0    18.7  396.90   5.33  36.2

When to Normalize Data?

Normalize your data when using machine learning algorithms that rely on distance calculations, such as KNN or SVM, or when your data has features with different units or scales. Normalization is also beneficial for algorithms using gradient descent optimization.

Common Normalization Pitfalls and Best Practices

Although powerful, normalization is not without its challenges. Here are some common pitfalls and best practices:

Handling Outliers: Outliers can significantly affect normalization. Consider using robust scaling techniques or preprocessing steps to handle outliers before normalization.
Choosing the Right Technique: The choice of normalization method depends on the data and context. Min-Max Scaling is good for preserving specific ranges, whereas Z-Score Normalization is good for preserving mean and standard deviation.
Reversibility: If you use transformations like MinMaxScaler or StandardScaler, you can reverse the normalization by using the inverse transform method provided by these classes.

Conclusion

Data normalization is a vital step in the preprocessing pipeline of any machine learning project. Using Scikit-Learn, we can easily apply different normalization techniques such as Min-Max Scaling, Standardization, and Robust Scaling. Choosing the right normalization method can significantly impact the performance of your machine learning models. By incorporating these normalization techniques, you can ensure that your data is well-prepared for modeling, leading to more accurate and reliable predictions.

Data Normalization with Python Scikit-Learn

lakshaymbnwg

Improve

Article Tags :

Practice Tags :

Machine Learning

Data Normalization with Python Scikit-Learn

Why Data Normalization is Necessary?

Using Scikit-Learn for Normalization

Implementing Data Normalization with Scikit-Learn

1. MinMaxScaler

2. Standardization (Z-score Normalization)

3. Robust Scaling

Practical Example : Data Normalization with Python

Step 2: Performing the Techniques

1. Min-Max Scaling

2. Standardization (Z-score Normalization)

3. Robust Scaling

Combining Techniques

When to Normalize Data?

Common Normalization Pitfalls and Best Practices

Conclusion

Similar Reads

Thank You!

What kind of Experience do you want to share?

1. `MinMaxScaler`