Search
Get unlimited access to the best of Medium for less than $1/week. Become a member
Python in Plain Eng… · Follow publication
Introduction To Data Science: Data
Preprocessing In Python
Learn about different data preprocessing techniques using the Sklearn library.
Karan Patel · Follow
Published in Python in Plain English
6 min read · Aug 26, 2021
Listen Share More
Fig 1. Model development phases
Data preprocessing is also one of the important steps in data science along with data
collection. In one of my previous posts, I talked about Web Scraping using Python,
which is one of the common methods used to obtain data from the internet. But this
data needs to be preprocessed and cannot be directly used for Machine Learning.
What is Data Processing?
Before we start analyzing our data and extracting the insights out of it, it is
necessary to process the data i.e. we need to convert our data in the form which our
model can understand. Since the machines cannot understand data in the form of
images, audios, etc. The data we use in the real world is not perfect and it is
incomplete, inconsistent (with outliers and noisy values), and in an unstructured
form. Preprocessing the raw data helps to organize, scaling, clean (remove outliers),
standardize i.e. simplifying it to feed the data to the machine learning algorithm.
Preprocessing
In this post, I am going to walk through the implementation of Data Preprocessing
methods using Python, and the following subjects will be handled:
• Missing values
• Standardization
• Normalization
• Encoding categorical features
• Discretization
For this preprocessing script, I have used Google Colab.
Importing the Libraries
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import KBinsDiscretizer
If you see any import errors, try to install those packages explicitly using pip
command as follows.
pip install <package-name>
Dataset Used
The dataset which I used is auto mpg provided by UC Irvine Machine Learning
Repository. It consists of the data of different car models and their average in miles
per gallon which is based on factors like engine size, number of cylinders,
horsepower, and acceleration.
Fig 2. A Glimpse of Dataset Used
Handling Missing Values
Handling missing values is an essential step in preprocessing because it can
drastically deteriorate your model when not done with sufficient care. Before
starting to handle missing values, it is important to identify the missing values and
know with which value they can be replaced. You should be able to find this out by
combining the metadata information with exploratory analysis.
Once you know a bit more about the missing data you have to decide whether or not
you want to keep entries with missing data. A better strategy is to impute the
missing values, i.e., to infer them from the known part of the data. The
SimpleImputer class provides basic strategies for imputing missing values. Missing
values can be imputed with a provided constant value, or using the statistics (mean,
median, or most frequent) of each column in which the missing values are located.
This class also allows for different missing values encodings. Here we have replaced
the missing values in the horsepower field by the mean of that column.
from sklearn.impute import MissingIndicator
indicator = MissingIndicator(missing_values=np.NaN)
indicator = indicator.fit_transform(df)
indicator = pd.DataFrame(indicator, columns=['horsepower'])
#replacing the missing values by their mean
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(df.iloc[:, 1:7])
df.iloc[:, 1:7] = imputer.transform(df.iloc[:, 1:7])
df
Fig 3. Imputation of Missing Values
Standardization
Standardization is a transformation that centers the data by removing the mean
value of each feature and then scale it by dividing (non-constant) features by their
standard deviation. After standardizing data the mean will be zero and the standard
deviation one.
In practice, we often ignore the shape of the distribution and just transform the data
to center it by removing the mean value of each feature, then scale it by dividing
non-constant features by their standard deviation. For this task, I have used
Standard Scaler. Other alternatives to this method can be MinMaxScaler,
MaxAbsScaler, and RobustScaler.
sc_X = StandardScaler(with_mean=False)
X = sc_X.fit_transform(X.drop(['car name'], axis=1))
Fig 4. Imputation of Missing Values
Normalization
Normalization is the process of scaling individual samples to have a unit norm. In
basic terms, you need to normalize data when the algorithm predicts based on the
weighted relationships formed between data points. Scaling inputs to unit norms is
a common operation for text classification or clustering.
One of the key differences between scaling (e.g. standardizing) and normalizing, is
that normalizing is performed row-wise whereas scaling is a column-wise
operation.
from sklearn.preprocessing import Normalizer
nm = Normalizer()
x_sc = nm.fit_transform(X)
X=pd.DataFrame(x_sc)
Fig 5. Normalizing the dataset
Encoding categorical features
Managing categorical data is another essential process during data preprocessing.
Unfortunately, sklearn’s machine learning library does not support handling
categorical data. Even for tree-based models, it is important to convert categorical
features to a numerical representation.
Label Encoding refers to converting the labels into the numeric form so as to
convert them to machine-readable form. Machine learning algorithms can then
decide in a better way how those labels must be operated. It is an important pre-
processing step for the structured dataset in supervised learning.
This dataset contains multiple car model names which have a string as their
datatype, but by using label encoding, we have assigned numeric form to it. Now, to
represent which car model a particular row belongs to, the value is 1 in a specific
column but the rest will be zero. Here we have used the OneHot Encoding
technique.
As you can see in the below figure, the car in row 3 represents a car model ‘AMC
rebel sst’. By label encoding, ‘AMC rebel sst’ is given the number 14. Hence car 3 has
a value of 1 in column 14 and the rest columns are 0.
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder(dtype=np.int, sparse=True)
nominals = pd.DataFrame(
onehot.fit_transform(X[['car name']])\
.toarray())
nominals
Fig 5. Label Encoding on ‘Car Names’
Discretization
Data discretization refers to a method of converting a huge number of data values
into smaller ones so that the evaluation and management of data become easy. In
other words, data discretization is a method of converting attributes values of
continuous data into a finite set of intervals with minimum data loss. There are two
forms of data discretization first is supervised discretization, and the second is
unsupervised discretization. Supervised discretization refers to a method in which
the class data is used. Unsupervised discretization refers to a method depending
upon the way which operation proceeds
Sklearn provides a KBinsDiscretizer class that can take care of this. The only thing
you have to specify, is the number of bins (n_bins) for each feature and how to
encode these bins (ordinal, onehot or onehot-dense).
from sklearn.preprocessing import KBinsDiscretizer
disc = KBinsDiscretizer(
n_bins=6, encode='onehot',strategy='uniform')
disc.fit_transform(X)
Fig 6. Discretization Of The Dataset Using KBins
Conclusion
After performing this task, you’ll acquire the basic knowledge of how to preprocess
the different types of data before using it for Machine Learning.
THE IPYNB FILE CAN BE FOUND HERE
More content at plainenglish.io
Python Programming Data Science Machine Learning
Software Development
Follow
Published in Python in Plain English
43K Followers · Last published 15 hours ago
New Python content every day. Follow to join our 3.5M+ monthly readers.
Follow
Written by Karan Patel
58 Followers · 44 Following
AI Enthusiast, Technocrat, Photographer, Wildlife
Responses (2)
Mineralsman
What are your thoughts?