0% found this document useful (0 votes)

56 views10 pages

Introduction To Data Science - Data Preprocessing in Python - by Karan Patel - Python in Plain English

The document provides an introduction to data preprocessing techniques in Python using the Sklearn library, emphasizing its importance in preparing data for machine learning. Key topics covered include handling missing values, standardization, normalization, encoding categorical features, and discretization. The author uses a dataset from the UC Irvine Machine Learning Repository to demonstrate these preprocessing methods.

Uploaded by

ericvespene

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views10 pages

Introduction To Data Science - Data Preprocessing in Python - by Karan Patel - Python in Plain English

Uploaded by

ericvespene

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Search

Get unlimited access to the best of Medium for less than $1/week. Become a member

Python in Plain Eng… · Follow publication

Introduction To Data Science: Data

Preprocessing In Python
Learn about different data preprocessing techniques using the Sklearn library.

Karan Patel · Follow

Published in Python in Plain English
6 min read · Aug 26, 2021

Listen Share More

Fig 1. Model development phases

Data preprocessing is also one of the important steps in data science along with data
collection. In one of my previous posts, I talked about Web Scraping using Python,
which is one of the common methods used to obtain data from the internet. But this
data needs to be preprocessed and cannot be directly used for Machine Learning.

What is Data Processing?

Before we start analyzing our data and extracting the insights out of it, it is
necessary to process the data i.e. we need to convert our data in the form which our
model can understand. Since the machines cannot understand data in the form of
images, audios, etc. The data we use in the real world is not perfect and it is
incomplete, inconsistent (with outliers and noisy values), and in an unstructured
form. Preprocessing the raw data helps to organize, scaling, clean (remove outliers),
standardize i.e. simplifying it to feed the data to the machine learning algorithm.

Preprocessing
In this post, I am going to walk through the implementation of Data Preprocessing
methods using Python, and the following subjects will be handled:

• Missing values

• Standardization

• Normalization

• Encoding categorical features

• Discretization

For this preprocessing script, I have used Google Colab.

Importing the Libraries

import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import KBinsDiscretizer

If you see any import errors, try to install those packages explicitly using pip

command as follows.

pip install <package-name>

Dataset Used
The dataset which I used is auto mpg provided by UC Irvine Machine Learning
Repository. It consists of the data of different car models and their average in miles
per gallon which is based on factors like engine size, number of cylinders,
horsepower, and acceleration.

Fig 2. A Glimpse of Dataset Used

Handling Missing Values

Handling missing values is an essential step in preprocessing because it can
drastically deteriorate your model when not done with sufficient care. Before
starting to handle missing values, it is important to identify the missing values and
know with which value they can be replaced. You should be able to find this out by
combining the metadata information with exploratory analysis.

Once you know a bit more about the missing data you have to decide whether or not
you want to keep entries with missing data. A better strategy is to impute the
missing values, i.e., to infer them from the known part of the data. The
SimpleImputer class provides basic strategies for imputing missing values. Missing
values can be imputed with a provided constant value, or using the statistics (mean,
median, or most frequent) of each column in which the missing values are located.
This class also allows for different missing values encodings. Here we have replaced
the missing values in the horsepower field by the mean of that column.

from sklearn.impute import MissingIndicator

indicator = MissingIndicator(missing_values=np.NaN)
indicator = indicator.fit_transform(df)
indicator = pd.DataFrame(indicator, columns=['horsepower'])

#replacing the missing values by their mean

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(df.iloc[:, 1:7])
df.iloc[:, 1:7] = imputer.transform(df.iloc[:, 1:7])
df

Fig 3. Imputation of Missing Values

Standardization
Standardization is a transformation that centers the data by removing the mean
value of each feature and then scale it by dividing (non-constant) features by their
standard deviation. After standardizing data the mean will be zero and the standard
deviation one.

In practice, we often ignore the shape of the distribution and just transform the data
to center it by removing the mean value of each feature, then scale it by dividing
non-constant features by their standard deviation. For this task, I have used
Standard Scaler. Other alternatives to this method can be MinMaxScaler,
MaxAbsScaler, and RobustScaler.
sc_X = StandardScaler(with_mean=False)
X = sc_X.fit_transform(X.drop(['car name'], axis=1))

Fig 4. Imputation of Missing Values

Normalization
Normalization is the process of scaling individual samples to have a unit norm. In
basic terms, you need to normalize data when the algorithm predicts based on the
weighted relationships formed between data points. Scaling inputs to unit norms is
a common operation for text classification or clustering.

One of the key differences between scaling (e.g. standardizing) and normalizing, is
that normalizing is performed row-wise whereas scaling is a column-wise
operation.

from sklearn.preprocessing import Normalizer

nm = Normalizer()
x_sc = nm.fit_transform(X)
X=pd.DataFrame(x_sc)
Fig 5. Normalizing the dataset

Encoding categorical features

Managing categorical data is another essential process during data preprocessing.
Unfortunately, sklearn’s machine learning library does not support handling
categorical data. Even for tree-based models, it is important to convert categorical
features to a numerical representation.

Label Encoding refers to converting the labels into the numeric form so as to
convert them to machine-readable form. Machine learning algorithms can then
decide in a better way how those labels must be operated. It is an important pre-
processing step for the structured dataset in supervised learning.

This dataset contains multiple car model names which have a string as their
datatype, but by using label encoding, we have assigned numeric form to it. Now, to
represent which car model a particular row belongs to, the value is 1 in a specific
column but the rest will be zero. Here we have used the OneHot Encoding
technique.

As you can see in the below figure, the car in row 3 represents a car model ‘AMC
rebel sst’. By label encoding, ‘AMC rebel sst’ is given the number 14. Hence car 3 has
a value of 1 in column 14 and the rest columns are 0.

from sklearn.preprocessing import OneHotEncoder

onehot = OneHotEncoder(dtype=np.int, sparse=True)
nominals = pd.DataFrame(
onehot.fit_transform(X[['car name']])\
.toarray())
nominals

Fig 5. Label Encoding on ‘Car Names’

Discretization
Data discretization refers to a method of converting a huge number of data values
into smaller ones so that the evaluation and management of data become easy. In
other words, data discretization is a method of converting attributes values of
continuous data into a finite set of intervals with minimum data loss. There are two
forms of data discretization first is supervised discretization, and the second is
unsupervised discretization. Supervised discretization refers to a method in which
the class data is used. Unsupervised discretization refers to a method depending
upon the way which operation proceeds

Sklearn provides a KBinsDiscretizer class that can take care of this. The only thing
you have to specify, is the number of bins (n_bins) for each feature and how to
encode these bins (ordinal, onehot or onehot-dense).
from sklearn.preprocessing import KBinsDiscretizer
disc = KBinsDiscretizer(
n_bins=6, encode='onehot',strategy='uniform')
disc.fit_transform(X)

Fig 6. Discretization Of The Dataset Using KBins

Conclusion
After performing this task, you’ll acquire the basic knowledge of how to preprocess
the different types of data before using it for Machine Learning.

THE IPYNB FILE CAN BE FOUND HERE

Python Programming Data Science Machine Learning

Software Development

Published in Python in Plain English

43K Followers · Last published 15 hours ago

New Python content every day. Follow to join our 3.5M+ monthly readers.

Written by Karan Patel

58 Followers · 44 Following

AI Enthusiast, Technocrat, Photographer, Wildlife

Responses (2)

Mineralsman

What are your thoughts?

DA Unit - IV
No ratings yet
DA Unit - IV
229 pages
Pricing Mercari
No ratings yet
Pricing Mercari
41 pages
Data Preprocessing Steps For Machine Learning in Python (Part 1) - by Learn With Nas - Wom
No ratings yet
Data Preprocessing Steps For Machine Learning in Python (Part 1) - by Learn With Nas - Wom
39 pages
Day11 Machine Learning
No ratings yet
Day11 Machine Learning
37 pages
Data Preparation For Machine Learning Mini Course
No ratings yet
Data Preparation For Machine Learning Mini Course
19 pages
IP Project Model
No ratings yet
IP Project Model
51 pages
Data Preprocessing Using Python. Python Implementation of Data - by Suneet Jain - Medium
No ratings yet
Data Preprocessing Using Python. Python Implementation of Data - by Suneet Jain - Medium
20 pages
Data Preprocessing Implementation 13112023 061217pm
No ratings yet
Data Preprocessing Implementation 13112023 061217pm
31 pages
Lec 2 ML S4 Data Preprocessing
No ratings yet
Lec 2 ML S4 Data Preprocessing
20 pages
Assignmnet
No ratings yet
Assignmnet
25 pages
Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
Report
No ratings yet
Report
24 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
4 Data Preprocessing
No ratings yet
4 Data Preprocessing
27 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Xii Project PDF
No ratings yet
Xii Project PDF
19 pages
Data Science Bootcamp (Day-01) (1) - Compressed
No ratings yet
Data Science Bootcamp (Day-01) (1) - Compressed
161 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
Statisitics Project 7
No ratings yet
Statisitics Project 7
22 pages
Data Wrangling
No ratings yet
Data Wrangling
24 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Python Basics Refresher
No ratings yet
Python Basics Refresher
19 pages
Machine Learning Algorithms PDF
100% (1)
Machine Learning Algorithms PDF
148 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
Machine Learning With Python - Part-2
No ratings yet
Machine Learning With Python - Part-2
27 pages
Statisitics Project 3
No ratings yet
Statisitics Project 3
22 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
24 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
Session 2 - Data Pre-Processing
No ratings yet
Session 2 - Data Pre-Processing
19 pages
Dsbda Lab - 1 - 1736243987425
No ratings yet
Dsbda Lab - 1 - 1736243987425
10 pages
As Data Manipulation With Dplyr-2
No ratings yet
As Data Manipulation With Dplyr-2
6 pages
Machine Learning With Python Data Preprocessing, Analysis and Visualization
No ratings yet
Machine Learning With Python Data Preprocessing, Analysis and Visualization
8 pages
Data Pre Process I
No ratings yet
Data Pre Process I
6 pages
AI With Python-Data Preprocessing: Student Name Student Roll # Program Section
No ratings yet
AI With Python-Data Preprocessing: Student Name Student Roll # Program Section
7 pages
Python Pandas Matplot
No ratings yet
Python Pandas Matplot
15 pages
DS On MTCARS Solutions
No ratings yet
DS On MTCARS Solutions
3 pages
Data Preprocessing
No ratings yet
Data Preprocessing
38 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
4 pages
Document
No ratings yet
Document
3 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
Answer
No ratings yet
Answer
5 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
Binning and Normalization Activity
No ratings yet
Binning and Normalization Activity
2 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Data Wrangling PDF
No ratings yet
Data Wrangling PDF
14 pages
Data Preprocessing in Python - Handling Missing Data
No ratings yet
Data Preprocessing in Python - Handling Missing Data
8 pages
ML (Prac1)
No ratings yet
ML (Prac1)
12 pages
Module 5 Digital Technology and Social Change Complete
100% (1)
Module 5 Digital Technology and Social Change Complete
66 pages
Data Wrangling
No ratings yet
Data Wrangling
24 pages
Huawei IMaster NCE Data Sheet
100% (1)
Huawei IMaster NCE Data Sheet
11 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Virtual Humans Today and Tomorrow
100% (1)
Virtual Humans Today and Tomorrow
78 pages
Oracle GoldenGate For HrOUG 2024
No ratings yet
Oracle GoldenGate For HrOUG 2024
49 pages
Grade 12 Literacy Sample Assessment
No ratings yet
Grade 12 Literacy Sample Assessment
62 pages
BUSM3006 (Vietnam) - 2025 - Vietnam Operation Trimester 1 - On-Site - V2
No ratings yet
BUSM3006 (Vietnam) - 2025 - Vietnam Operation Trimester 1 - On-Site - V2
23 pages
Voice AI Report by Titan Capital
No ratings yet
Voice AI Report by Titan Capital
12 pages
Introduction of Artificial Intelligence
No ratings yet
Introduction of Artificial Intelligence
38 pages
Explore Microsoft Ai 102 Questions by Miles
No ratings yet
Explore Microsoft Ai 102 Questions by Miles
11 pages
10 Thought-Provoking Novels About Artificial Intelligence
No ratings yet
10 Thought-Provoking Novels About Artificial Intelligence
11 pages
Artificial Intelligenc in Phamacy 3rd Paper
No ratings yet
Artificial Intelligenc in Phamacy 3rd Paper
10 pages
Manufacturing 5.0 AI Automation and Cloud Computing Revolutionizing Industry For Enhanced Productivity and Efficiency
No ratings yet
Manufacturing 5.0 AI Automation and Cloud Computing Revolutionizing Industry For Enhanced Productivity and Efficiency
9 pages
Robotics
No ratings yet
Robotics
60 pages
APSACC Inter-School Debate 2024 Script
No ratings yet
APSACC Inter-School Debate 2024 Script
16 pages
Healthcare Fraud Detection System
No ratings yet
Healthcare Fraud Detection System
25 pages
32 Abstract
No ratings yet
32 Abstract
4 pages
Fabric Defect Detection and Classification Using Modified VGG Network
No ratings yet
Fabric Defect Detection and Classification Using Modified VGG Network
10 pages
Evaluation and Analysis of Large Language Models For Clinical Text Augmentation and Generation
No ratings yet
Evaluation and Analysis of Large Language Models For Clinical Text Augmentation and Generation
10 pages
Augmented Reality (Ar)
No ratings yet
Augmented Reality (Ar)
6 pages
Council Elemental Resources
No ratings yet
Council Elemental Resources
5 pages
Free Text To Speech & AI Voice Generator ElevenLabs
No ratings yet
Free Text To Speech & AI Voice Generator ElevenLabs
1 page
Creating Features Like The Ones in QuantsApp
No ratings yet
Creating Features Like The Ones in QuantsApp
3 pages
Computer Systems - Paper 1
No ratings yet
Computer Systems - Paper 1
2 pages
Roadmap To Machine Learning
No ratings yet
Roadmap To Machine Learning
1 page
Latihan Soal
No ratings yet
Latihan Soal
1 page
Computer Science 2021 22
No ratings yet
Computer Science 2021 22
8 pages
Biological and Artificial Neuron
No ratings yet
Biological and Artificial Neuron
6 pages
CCA Loss
No ratings yet
CCA Loss
5 pages
OREO Method For Answering Questions
No ratings yet
OREO Method For Answering Questions
2 pages
OpenAI - Dummy
No ratings yet
OpenAI - Dummy
2 pages

Introduction To Data Science - Data Preprocessing in Python - by Karan Patel - Python in Plain English

Uploaded by

Introduction To Data Science - Data Preprocessing in Python - by Karan Patel - Python in Plain English

Uploaded by

Search

Python in Plain Eng… · Follow publication

Introduction To Data Science: Data

Karan Patel · Follow

Listen Share More

What is Data Processing?

• Encoding categorical features

For this preprocessing script, I have used Google Colab.

Importing the Libraries

from sklearn.impute import SimpleImputer

pip install <package-name>

Fig 2. A Glimpse of Dataset Used

Handling Missing Values

from sklearn.impute import MissingIndicator

#replacing the missing values by their mean

Fig 3. Imputation of Missing Values

Fig 4. Imputation of Missing Values

from sklearn.preprocessing import Normalizer

Encoding categorical features

from sklearn.preprocessing import OneHotEncoder

Fig 5. Label Encoding on ‘Car Names’

Fig 6. Discretization Of The Dataset Using KBins

THE IPYNB FILE CAN BE FOUND HERE

More content at plainenglish.io

Python Programming Data Science Machine Learning

Published in Python in Plain English

Written by Karan Patel

AI Enthusiast, Technocrat, Photographer, Wildlife

What are your thoughts?

You might also like