0% found this document useful (0 votes)

39 views20 pages

Data Preprocessing Using Python. Python Implementation of Data - by Suneet Jain - Medium

This article provides a comprehensive guide on data preprocessing using Python, aimed at beginners in machine learning. It covers essential steps such as importing libraries, handling missing data, encoding categorical variables, normalizing data, and splitting datasets into training and testing sets. The author emphasizes the importance of preprocessing for improving model efficiency and accuracy, and includes practical code examples throughout the tutorial.

Uploaded by

ericvespene

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views20 pages

Data Preprocessing Using Python. Python Implementation of Data - by Suneet Jain - Medium

Uploaded by

ericvespene

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Open in app

Get unlimited access to the best of Medium for less than $1/week. Become a member

Data Preprocessing using Python

Suneet Jain · Follow
8 min read · Nov 15, 2020

Listen Share More

This article will take you through the basic concepts of Data Preprocessing and
implement them using python. We’ll be starting from the basics so if you have no
prior knowledge about machine learning or data preprocessing, no need to worry!
Use the ipynb file available here to follow along on the implementation that I have
performed below. Everything including the dataset is present in the repository.

Let’s begin!

What is Data Preprocessing?

Data Preprocessing is the process of making data suitable for use while training a
machine learning model. The dataset initially provided for training might not be in
a ready-to-use state, for e.g. it might not be formatted properly, or may contain
missing or null values.

Solving all these problems using various methods is called Data Preprocessing,
using a properly processed dataset while training will not only make life easier for
you but also increase the efficiency and accuracy of your model.

Steps in Data Preprocessing:

In this article, We’ll be covering the following steps:

• Importing the libraries

• Importing the dataset

• Taking care of missing data

• Encoding categorical data

• Normalizing the data

• Splitting the data into test and train

Steps of data processing

Want to read this story later? Save it in Journal.

Step 1: Importing the libraries

In the beginning, we’ll import three basic libraries which are very common in
machine learning and will be used every time you train a model

1. NumPy:- it is a library that allows us to work with arrays and as most machine
learning models work on arrays NumPy makes it easier

2. matplotlib:- this library helps in plotting graphs and charts, which are very
useful while showing the result of your model

3. Pandas:- pandas allows us to import our dataset and also creates a matrix of
features containing the dependent and independent variable.
Step 2: Importing the dataset
The data that we’ll be using can be viewed and downloaded from here.
Sample dataset that we’ll be using

As you can see in the above image we are using a very simple dataset that contains
information about customers who have purchased a particular product from a
company.

It contains various information about the customers like their age, salary, country,
etc. It also shows whether a particular customer has purchased the product or not.

It also contains a null value in the fifth row.

Let’s begin by importing the data.

As the given data is in CSV format, we’ll be using the read_csv function from the
pandas library.

Now we’ll show the imported data. You must remember the data imported using the
read_csv function is in a Data Frame format, we’ll later convert it into NumPy arrays
to perform other operations and training.
1 data = pd.read_csv('Data.csv')
2 data

importingDataset hosted with by GitHub view raw

In any dataset used for machine learning, there are two types of variables:

• Independent variable

• Dependent variable

The independent variable is the columns that we are going to use to predict the
dependent variable, or in other words, the independent variable affects the
dependent variable

Independent and Dependent variable

In our dataset, the country, age, and salary column are the independent variable and
will be used to predict the purchased column which is the dependent variable.

Step 3: Handling the missing values

As you can see in our dataset we have two missing values one in the Salary column
in the 5th Row and another in the Age column of the 7th row.
Missing Values

Now there are multiple ways to handle missing values, one of them is to ignore them
and delete the entire entry/row, this is commonly done in datasets containing a very
large number of entries, where the missing values only constitute 0.1% of the total
data. Thus they affect the model negligibly and can be removed.

But in our case, the dataset is very small and we cannot just ignore those rows. So we
use another method, in which we take the mean of the entire column containing the
missing values(in our case the age or salary column) and replace the missing values
with that mean.

To perform this process we will use SimpleImputer class from the ScikitLearn
library

Code for the Python implementation is given below:

1 from sklearn.impute import SimpleImputer
2
3 # 'np.nan' signifies that we are targeting missing values
4 # and the strategy we are choosing is replacing it with 'mean'
5 imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
6
7 imputer.fit(data.iloc[:, 1:3])
8 data.iloc[:, 1:3] = imputer.transform(data.iloc[:, 1:3])
9
10 # print the dataset
11 data

missingValues hosted with by GitHub view raw

Here the “missing_values = np.nan” means that we are replacing missing values and
“strategy = ‘mean’ ” means that we are replacing the missing value with the mean of
that column.

You can see that we have only selected the column with numerical data, as the mean
can only be calculated on numerical data.

After running the above code you’ll get the following output:
Output after replacing missing values with mean

As you can observe all the missing values have been replaced by the mean of the
column.

Step 4: Encoding categorical data

In our case, we have two categorical columns, the country column, and the purchased
column.

• OneHot Encoding

In the country column, we have three different categories: France, Germany, Spain.
We can simply label France as 0, Germany as 1, and Spain as 2 but doing this might
lead our machine learning model to interpret that there is some correlation between
these numbers and the outcome.

So to avoid this, we apply OneHot Encoding

OneHot Encoding consists of turning the country column into three separate
columns, each column consists of 0s and 1s. Therefore each country will have a
unique vector/code and no correlation between the vectors and outcome can be
formed.

You’ll understand more about it when we implement it below:

To perform this encoding we use OneHotEncoder and ColumnTransformer class

from the same ScikitLearn library.

The ColumnTransformer class allows us to select the column to apply encoding on

and leave the other columns untouched.

Note: The new columns created will be added in the front of the data frame and the
original column will be deleted.
After performing the above implementation you’ll get the following output:
New columns created after OneHot encoding

Now we can see that each country has got a unique vector or code, for example,
France is 1 0 0, Spain 0 0 1, and Germany 0 1 0.

• Label Encoding

In the last column, i.e. the purchased column, the data is in binary form meaning
that there are only two outcomes either Yes or No. Therefore here we need to
perform Label Encoding.

In this case, we use LabelEncoder class from the same ScikitLearn library.

1 from sklearn.preprocessing import LabelEncoder

2 le = LabelEncoder()
3 data.iloc[:,-1] = le.fit_transform(data.iloc[:,-1])
4 # 'data.iloc[:,-1]' is used to select the column that we need to be encoded
5 data

LabelEncoding hosted with by GitHub view raw

We use ‘data.iloc[:,-1]’ to select the index of the column we are transforming.

After performing this our data will look something like this:

Label Encoding

As you can see the purchased column has been successfully transformed.

Now we have completed the encoding of all the categorical data in our dataset and
can move to the next step.

Step 5: Normalizing the dataset

Feature scaling is bringing all of the features on the dataset to the same scale, this is
necessary while training a machine learning model because in some cases the
dominant features become so dominant that the other ordinary features are not
even considered by the model.
When we normalize the dataset it brings the value of all the features between 0 and
1 so that all the columns are in the same range, and thus there is no dominant
feature.

Now to normalize the dataset we use MinMaxScaler class from the same
ScikitLearn library.

The implementation of MinMaxScaler is very simple:

After running the above code our data set will look something like this:
Dataset after scaling

As you can see in the above image all the values in the dataset are now between 0
and 1, so there are no dominant features, and all features will be considered equally.

Note: Feature scaling is not always necessary and only required in some machine
learning models.

Step 6: Splitting the dataset

Before we begin training our model there is one final step to go, which is splitting of
the testing and training dataset. In machine learning, a larger part of the dataset is
used to train the model, and a small part is used to test the trained model for finding
out the accuracy and the efficiency of the model.

Now before we begin splitting the dataset we need to separate the dependent and
independent variables which we have already discussed above in the article.

The last (purchased) column is the dependent variable and the rest are independent
variables, so we’ll store the dependent variable in ‘y’ and the independent variables
in ‘X’.

Another important part we need to remember is that while training the model
accepts data as arrays so it is necessary that we convert the data to arrays. We do
that while separating the dependent and independent variables by adding .values
while storing data in ‘X’ and ‘y’.

1 X = data.iloc[:, :-1].values
2 y = data.iloc[:, -1].values
3 # .values function coverts the data into arrays
4 print("Independent Variable\n")
5 print(X)
6 print("\nDependent Variable\n")
7 print(y)

IndependentAndDepedentVariable hosted with by GitHub view raw

After running the above code our data will look something like this:

X and y

Now let’s split the dataset between Testing data and Training data.

To do this we’ll be using the train_test_split class from the same ScikitLearn library.
Deciding the ratio between testing data and training data is up to us and depends on
what we are trying to achieve with our model. In our case, we are going to go with
an 80-20% split between the train-test data. So 80% training and 20% testing data.

1 from sklearn.model_selection import train_test_split

2 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
3 #'test_size=0.2' means 20% test data and 80% train data

TestTrainSplit hosted with by GitHub view raw

Here the test_size = 0.2 signifies that we have selected 20% of data as testing data,
you can change that according to your choice.

After this, the X_train and X_test variables will have their respective data.

Now our data is finally ready for training!!

I hope this article was helpful in understanding data preprocessing. Be sure to drop
your feedback and suggestion below.

Thank You for reading!

PS: Once again you can view the code and dataset I have used on this GitHub
repository.

More from Journal

There are many Black creators doing incredible work in Tech. This collection of
resources shines a light on some of us:
Machine Learning Data Preprocessing Aiml Computer Science Encoding

Written by Suneet Jain

21 Followers · 13 Following

Tech Enthusiast, Currently perusing BTech at Bennett University

Responses (2)

Mineralsman

What are your thoughts?

Sparsh Chadha
Nov 15, 2020

nice one

1 Reply

Nicola Renzi
Mar 15, 2022

Can you please suggest me a best tool in alternative of Jupiter notebook? Currently I am suing google's colab

The Cosmic Perspective 8th Edition Full Download
100% (1)
The Cosmic Perspective 8th Edition Full Download
403 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
Module 1
No ratings yet
Module 1
25 pages
ML Checklist PDF
No ratings yet
ML Checklist PDF
4 pages
Machine Learning Project Checklist
100% (1)
Machine Learning Project Checklist
10 pages
Flight of Dreams A Novel Lawhon PDF Download
No ratings yet
Flight of Dreams A Novel Lawhon PDF Download
102 pages
Data Preprocessing in Python - Handling Missing Data
No ratings yet
Data Preprocessing in Python - Handling Missing Data
8 pages
Machine Learning
100% (1)
Machine Learning
33 pages
Guidelines For Oral Presentation
No ratings yet
Guidelines For Oral Presentation
5 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
Complete Bundle Six Steps To Managing Alzheimers Disease and Dementia Guide For Families by Budson MD HQ File
No ratings yet
Complete Bundle Six Steps To Managing Alzheimers Disease and Dementia Guide For Families by Budson MD HQ File
411 pages
Prepare and Produce Gateaux, Tortes and Cakes
100% (3)
Prepare and Produce Gateaux, Tortes and Cakes
101 pages
Daily Lesson Plan 1
No ratings yet
Daily Lesson Plan 1
5 pages
Medical Language For Modern Health Care 5th Edition Basco Test Bank Available Instantly
No ratings yet
Medical Language For Modern Health Care 5th Edition Basco Test Bank Available Instantly
329 pages
Machine Learning Life Cycle
No ratings yet
Machine Learning Life Cycle
11 pages
Sex Education in Utah
No ratings yet
Sex Education in Utah
10 pages
Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
Unit6 Part3 General Procedure
No ratings yet
Unit6 Part3 General Procedure
19 pages
Data Preprocessing Steps For Machine Learning in Python (Part 1) - by Learn With Nas - Wom
No ratings yet
Data Preprocessing Steps For Machine Learning in Python (Part 1) - by Learn With Nas - Wom
39 pages
Data Preprocessing Implementation 13112023 061217pm
No ratings yet
Data Preprocessing Implementation 13112023 061217pm
31 pages
Data Science Bootcamp Insights
No ratings yet
Data Science Bootcamp Insights
161 pages
National Apprenticeship Training Scheme: Student User Manual
No ratings yet
National Apprenticeship Training Scheme: Student User Manual
41 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Edwards Government in America People Politics and Policy 2016 Election Edition 17th Edition Ap Edition George C Edwards Instant Download
100% (1)
Edwards Government in America People Politics and Policy 2016 Election Edition 17th Edition Ap Edition George C Edwards Instant Download
29 pages
ML Data Preprocessing Guide
No ratings yet
ML Data Preprocessing Guide
5 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
ML (Prac1)
No ratings yet
ML (Prac1)
12 pages
Day11 Machine Learning
No ratings yet
Day11 Machine Learning
37 pages
Class Scheduling System and Attendance Monitoring System
100% (1)
Class Scheduling System and Attendance Monitoring System
6 pages
Lecture 4 Data Pre-Processing
No ratings yet
Lecture 4 Data Pre-Processing
43 pages
CSL0777 L09
No ratings yet
CSL0777 L09
29 pages
4 Data Preprocessing
No ratings yet
4 Data Preprocessing
27 pages
Introduction To Data Science - Data Preprocessing in Python - by Karan Patel - Python in Plain English
No ratings yet
Introduction To Data Science - Data Preprocessing in Python - by Karan Patel - Python in Plain English
10 pages
Polish Level 1
No ratings yet
Polish Level 1
4 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
1july Presentation
No ratings yet
1july Presentation
18 pages
Handle Missing Data in Real-Time
No ratings yet
Handle Missing Data in Real-Time
5 pages
Week5 Modified
No ratings yet
Week5 Modified
25 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
No ratings yet
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
53 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
Unit 1,2,3
No ratings yet
Unit 1,2,3
30 pages
Machine Learning Data Preprocessing Guide
No ratings yet
Machine Learning Data Preprocessing Guide
24 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
14 pages
ML Unit I Data Preprocessing &unit IV Cost Function and Unit V Pruning Topic
No ratings yet
ML Unit I Data Preprocessing &unit IV Cost Function and Unit V Pruning Topic
11 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
Creative Writing REVIEWER
No ratings yet
Creative Writing REVIEWER
2 pages
Machine Learning for Beginners
No ratings yet
Machine Learning for Beginners
18 pages
Module 5.pptx - 20250608 - 201231 - 0000
No ratings yet
Module 5.pptx - 20250608 - 201231 - 0000
43 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
7 Data Preprocessing Steps in Machine Learning
No ratings yet
7 Data Preprocessing Steps in Machine Learning
5 pages
2020 World AIDS Day Report Graphs Tables en
No ratings yet
2020 World AIDS Day Report Graphs Tables en
45 pages
Data Pre-Processing Steps
No ratings yet
Data Pre-Processing Steps
32 pages
Machine Learning Data Prep Guide
No ratings yet
Machine Learning Data Prep Guide
9 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Understanding Sensory Processing Disorder
100% (2)
Understanding Sensory Processing Disorder
61 pages
CMR BDA Data Pre Processing
No ratings yet
CMR BDA Data Pre Processing
10 pages
ML SIG - Day 1
No ratings yet
ML SIG - Day 1
55 pages
Data Preprocessing Visualization
No ratings yet
Data Preprocessing Visualization
25 pages
Lec 4
No ratings yet
Lec 4
9 pages
Complete Guide To Exploratory Data Analysis With Python Plotly - by Anar Abiyev - Mar, 2022 - Medium
No ratings yet
Complete Guide To Exploratory Data Analysis With Python Plotly - by Anar Abiyev - Mar, 2022 - Medium
11 pages
Core Data Preprocessing Techniques
No ratings yet
Core Data Preprocessing Techniques
9 pages
Systems of Linear Equations Matrices: Section 7 Leontief Input-Output Analysis
No ratings yet
Systems of Linear Equations Matrices: Section 7 Leontief Input-Output Analysis
19 pages
Lab 16 Questions
No ratings yet
Lab 16 Questions
5 pages
Regression
No ratings yet
Regression
26 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
16 pages
Customer Segmentation 2
No ratings yet
Customer Segmentation 2
19 pages
Worksheet 1 - Is Slavery A Thing of The Past?
No ratings yet
Worksheet 1 - Is Slavery A Thing of The Past?
1 page
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
ML Da
No ratings yet
ML Da
55 pages
First Quarter Module 1 Activities
100% (4)
First Quarter Module 1 Activities
2 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
4 pages
Model Learning Steps
No ratings yet
Model Learning Steps
12 pages
Data Pre Process I
No ratings yet
Data Pre Process I
6 pages
Umbrella To Which All The Defense Mechanism Exist
No ratings yet
Umbrella To Which All The Defense Mechanism Exist
9 pages
Eye of The Storm
No ratings yet
Eye of The Storm
14 pages
Krupacon 2018: Drug Development Conference
No ratings yet
Krupacon 2018: Drug Development Conference
8 pages
Assessing Environmental Perception
No ratings yet
Assessing Environmental Perception
8 pages
Digital Literacy For The 21st Century: Rethinking & Redesigning The Roles of Libraries
No ratings yet
Digital Literacy For The 21st Century: Rethinking & Redesigning The Roles of Libraries
6 pages
Lesson 1&2
No ratings yet
Lesson 1&2
6 pages
Practical Research
No ratings yet
Practical Research
7 pages
PBL Rubric Ed PDF
No ratings yet
PBL Rubric Ed PDF
1 page
BIOL 1310 Syllabus Fall 2023 Robert Morris University
No ratings yet
BIOL 1310 Syllabus Fall 2023 Robert Morris University
4 pages
Application Form Status Details Pandey
No ratings yet
Application Form Status Details Pandey
1 page
INVITATION For Speaker
No ratings yet
INVITATION For Speaker
3 pages
Calander 2018-2019 Tusd
No ratings yet
Calander 2018-2019 Tusd
1 page

Data Preprocessing Using Python. Python Implementation of Data - by Suneet Jain - Medium

Uploaded by

Data Preprocessing Using Python. Python Implementation of Data - by Suneet Jain - Medium

Uploaded by

Open in app

Data Preprocessing using Python

Listen Share More

What is Data Preprocessing?

Steps in Data Preprocessing:

• Importing the libraries

• Importing the dataset

• Taking care of missing data

• Encoding categorical data

• Normalizing the data

• Splitting the data into test and train

Want to read this story later? Save it in Journal.

Step 1: Importing the libraries

It also contains a null value in the fifth row.

Let’s begin by importing the data.

importingDataset hosted with by GitHub view raw

Independent and Dependent variable

Step 3: Handling the missing values

Code for the Python implementation is given below:

missingValues hosted with by GitHub view raw

Step 4: Encoding categorical data

So to avoid this, we apply OneHot Encoding

You’ll understand more about it when we implement it below:

To perform this encoding we use OneHotEncoder and ColumnTransformer class

The ColumnTransformer class allows us to select the column to apply encoding on

1 from sklearn.preprocessing import LabelEncoder

LabelEncoding hosted with by GitHub view raw

Step 5: Normalizing the dataset

The implementation of MinMaxScaler is very simple:

Step 6: Splitting the dataset

IndependentAndDepedentVariable hosted with by GitHub view raw

1 from sklearn.model_selection import train_test_split

TestTrainSplit hosted with by GitHub view raw

Now our data is finally ready for training!!

Thank You for reading!

More from Journal

Written by Suneet Jain

Tech Enthusiast, Currently perusing BTech at Bennett University

What are your thoughts?

You might also like