Open in app
Search
Get unlimited access to the best of Medium for less than $1/week. Become a member
Data Preprocessing using Python
Suneet Jain · Follow
8 min read · Nov 15, 2020
Listen Share More
This article will take you through the basic concepts of Data Preprocessing and
implement them using python. We’ll be starting from the basics so if you have no
prior knowledge about machine learning or data preprocessing, no need to worry!
Use the ipynb file available here to follow along on the implementation that I have
performed below. Everything including the dataset is present in the repository.
Let’s begin!
What is Data Preprocessing?
Data Preprocessing is the process of making data suitable for use while training a
machine learning model. The dataset initially provided for training might not be in
a ready-to-use state, for e.g. it might not be formatted properly, or may contain
missing or null values.
Solving all these problems using various methods is called Data Preprocessing,
using a properly processed dataset while training will not only make life easier for
you but also increase the efficiency and accuracy of your model.
Steps in Data Preprocessing:
In this article, We’ll be covering the following steps:
• Importing the libraries
• Importing the dataset
• Taking care of missing data
• Encoding categorical data
• Normalizing the data
• Splitting the data into test and train
Steps of data processing
Want to read this story later? Save it in Journal.
Step 1: Importing the libraries
In the beginning, we’ll import three basic libraries which are very common in
machine learning and will be used every time you train a model
1. NumPy:- it is a library that allows us to work with arrays and as most machine
learning models work on arrays NumPy makes it easier
2. matplotlib:- this library helps in plotting graphs and charts, which are very
useful while showing the result of your model
3. Pandas:- pandas allows us to import our dataset and also creates a matrix of
features containing the dependent and independent variable.
Step 2: Importing the dataset
The data that we’ll be using can be viewed and downloaded from here.
Sample dataset that we’ll be using
As you can see in the above image we are using a very simple dataset that contains
information about customers who have purchased a particular product from a
company.
It contains various information about the customers like their age, salary, country,
etc. It also shows whether a particular customer has purchased the product or not.
It also contains a null value in the fifth row.
Let’s begin by importing the data.
As the given data is in CSV format, we’ll be using the read_csv function from the
pandas library.
Now we’ll show the imported data. You must remember the data imported using the
read_csv function is in a Data Frame format, we’ll later convert it into NumPy arrays
to perform other operations and training.
1 data = pd.read_csv('Data.csv')
2 data
importingDataset hosted with by GitHub view raw
In any dataset used for machine learning, there are two types of variables:
• Independent variable
• Dependent variable
The independent variable is the columns that we are going to use to predict the
dependent variable, or in other words, the independent variable affects the
dependent variable
Independent and Dependent variable
In our dataset, the country, age, and salary column are the independent variable and
will be used to predict the purchased column which is the dependent variable.
Step 3: Handling the missing values
As you can see in our dataset we have two missing values one in the Salary column
in the 5th Row and another in the Age column of the 7th row.
Missing Values
Now there are multiple ways to handle missing values, one of them is to ignore them
and delete the entire entry/row, this is commonly done in datasets containing a very
large number of entries, where the missing values only constitute 0.1% of the total
data. Thus they affect the model negligibly and can be removed.
But in our case, the dataset is very small and we cannot just ignore those rows. So we
use another method, in which we take the mean of the entire column containing the
missing values(in our case the age or salary column) and replace the missing values
with that mean.
To perform this process we will use SimpleImputer class from the ScikitLearn
library
Code for the Python implementation is given below:
1 from sklearn.impute import SimpleImputer
2
3 # 'np.nan' signifies that we are targeting missing values
4 # and the strategy we are choosing is replacing it with 'mean'
5 imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
6
7 imputer.fit(data.iloc[:, 1:3])
8 data.iloc[:, 1:3] = imputer.transform(data.iloc[:, 1:3])
9
10 # print the dataset
11 data
missingValues hosted with by GitHub view raw
Here the “missing_values = np.nan” means that we are replacing missing values and
“strategy = ‘mean’ ” means that we are replacing the missing value with the mean of
that column.
You can see that we have only selected the column with numerical data, as the mean
can only be calculated on numerical data.
After running the above code you’ll get the following output:
Output after replacing missing values with mean
As you can observe all the missing values have been replaced by the mean of the
column.
Step 4: Encoding categorical data
In our case, we have two categorical columns, the country column, and the purchased
column.
• OneHot Encoding
In the country column, we have three different categories: France, Germany, Spain.
We can simply label France as 0, Germany as 1, and Spain as 2 but doing this might
lead our machine learning model to interpret that there is some correlation between
these numbers and the outcome.
So to avoid this, we apply OneHot Encoding
OneHot Encoding consists of turning the country column into three separate
columns, each column consists of 0s and 1s. Therefore each country will have a
unique vector/code and no correlation between the vectors and outcome can be
formed.
You’ll understand more about it when we implement it below:
To perform this encoding we use OneHotEncoder and ColumnTransformer class
from the same ScikitLearn library.
The ColumnTransformer class allows us to select the column to apply encoding on
and leave the other columns untouched.
Note: The new columns created will be added in the front of the data frame and the
original column will be deleted.
After performing the above implementation you’ll get the following output:
New columns created after OneHot encoding
Now we can see that each country has got a unique vector or code, for example,
France is 1 0 0, Spain 0 0 1, and Germany 0 1 0.
• Label Encoding
In the last column, i.e. the purchased column, the data is in binary form meaning
that there are only two outcomes either Yes or No. Therefore here we need to
perform Label Encoding.
In this case, we use LabelEncoder class from the same ScikitLearn library.
1 from sklearn.preprocessing import LabelEncoder
2 le = LabelEncoder()
3 data.iloc[:,-1] = le.fit_transform(data.iloc[:,-1])
4 # 'data.iloc[:,-1]' is used to select the column that we need to be encoded
5 data
LabelEncoding hosted with by GitHub view raw
We use ‘data.iloc[:,-1]’ to select the index of the column we are transforming.
After performing this our data will look something like this:
Label Encoding
As you can see the purchased column has been successfully transformed.
Now we have completed the encoding of all the categorical data in our dataset and
can move to the next step.
Step 5: Normalizing the dataset
Feature scaling is bringing all of the features on the dataset to the same scale, this is
necessary while training a machine learning model because in some cases the
dominant features become so dominant that the other ordinary features are not
even considered by the model.
When we normalize the dataset it brings the value of all the features between 0 and
1 so that all the columns are in the same range, and thus there is no dominant
feature.
Now to normalize the dataset we use MinMaxScaler class from the same
ScikitLearn library.
The implementation of MinMaxScaler is very simple:
After running the above code our data set will look something like this:
Dataset after scaling
As you can see in the above image all the values in the dataset are now between 0
and 1, so there are no dominant features, and all features will be considered equally.
Note: Feature scaling is not always necessary and only required in some machine
learning models.
Step 6: Splitting the dataset
Before we begin training our model there is one final step to go, which is splitting of
the testing and training dataset. In machine learning, a larger part of the dataset is
used to train the model, and a small part is used to test the trained model for finding
out the accuracy and the efficiency of the model.
Now before we begin splitting the dataset we need to separate the dependent and
independent variables which we have already discussed above in the article.
The last (purchased) column is the dependent variable and the rest are independent
variables, so we’ll store the dependent variable in ‘y’ and the independent variables
in ‘X’.
Another important part we need to remember is that while training the model
accepts data as arrays so it is necessary that we convert the data to arrays. We do
that while separating the dependent and independent variables by adding .values
while storing data in ‘X’ and ‘y’.
1 X = data.iloc[:, :-1].values
2 y = data.iloc[:, -1].values
3 # .values function coverts the data into arrays
4 print("Independent Variable\n")
5 print(X)
6 print("\nDependent Variable\n")
7 print(y)
IndependentAndDepedentVariable hosted with by GitHub view raw
After running the above code our data will look something like this:
X and y
Now let’s split the dataset between Testing data and Training data.
To do this we’ll be using the train_test_split class from the same ScikitLearn library.
Deciding the ratio between testing data and training data is up to us and depends on
what we are trying to achieve with our model. In our case, we are going to go with
an 80-20% split between the train-test data. So 80% training and 20% testing data.
1 from sklearn.model_selection import train_test_split
2 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
3 #'test_size=0.2' means 20% test data and 80% train data
TestTrainSplit hosted with by GitHub view raw
Here the test_size = 0.2 signifies that we have selected 20% of data as testing data,
you can change that according to your choice.
After this, the X_train and X_test variables will have their respective data.
Now our data is finally ready for training!!
I hope this article was helpful in understanding data preprocessing. Be sure to drop
your feedback and suggestion below.
Thank You for reading!
PS: Once again you can view the code and dataset I have used on this GitHub
repository.
More from Journal
There are many Black creators doing incredible work in Tech. This collection of
resources shines a light on some of us:
Machine Learning Data Preprocessing Aiml Computer Science Encoding
Follow
Written by Suneet Jain
21 Followers · 13 Following
Tech Enthusiast, Currently perusing BTech at Bennett University
Responses (2)
Mineralsman
What are your thoughts?
Sparsh Chadha
Nov 15, 2020
nice one
1 Reply
Nicola Renzi
Mar 15, 2022
Can you please suggest me a best tool in alternative of Jupiter notebook? Currently I am suing google's colab