0% found this document useful (0 votes)
2 views

Data Preprocessing using Python. Python implementation of data… _ by Suneet Jain _ Medium

This article provides a comprehensive guide on data preprocessing using Python, aimed at beginners in machine learning. It covers essential steps such as importing libraries, handling missing data, encoding categorical variables, normalizing data, and splitting datasets into training and testing sets. The author emphasizes the importance of preprocessing for improving model efficiency and accuracy, and includes practical code examples throughout the tutorial.

Uploaded by

ericvespene
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data Preprocessing using Python. Python implementation of data… _ by Suneet Jain _ Medium

This article provides a comprehensive guide on data preprocessing using Python, aimed at beginners in machine learning. It covers essential steps such as importing libraries, handling missing data, encoding categorical variables, normalizing data, and splitting datasets into training and testing sets. The author emphasizes the importance of preprocessing for improving model efficiency and accuracy, and includes practical code examples throughout the tutorial.

Uploaded by

ericvespene
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Open in app

Search

Get unlimited access to the best of Medium for less than $1/week. Become a member

Data Preprocessing using Python


Suneet Jain · Follow
8 min read · Nov 15, 2020

Listen Share More

This article will take you through the basic concepts of Data Preprocessing and
implement them using python. We’ll be starting from the basics so if you have no
prior knowledge about machine learning or data preprocessing, no need to worry!
Use the ipynb file available here to follow along on the implementation that I have
performed below. Everything including the dataset is present in the repository.

Let’s begin!

What is Data Preprocessing?


Data Preprocessing is the process of making data suitable for use while training a
machine learning model. The dataset initially provided for training might not be in
a ready-to-use state, for e.g. it might not be formatted properly, or may contain
missing or null values.

Solving all these problems using various methods is called Data Preprocessing,
using a properly processed dataset while training will not only make life easier for
you but also increase the efficiency and accuracy of your model.

Steps in Data Preprocessing:


In this article, We’ll be covering the following steps:

• Importing the libraries

• Importing the dataset

• Taking care of missing data

• Encoding categorical data

• Normalizing the data

• Splitting the data into test and train


Steps of data processing

Want to read this story later? Save it in Journal.

Step 1: Importing the libraries


In the beginning, we’ll import three basic libraries which are very common in
machine learning and will be used every time you train a model

1. NumPy:- it is a library that allows us to work with arrays and as most machine
learning models work on arrays NumPy makes it easier

2. matplotlib:- this library helps in plotting graphs and charts, which are very
useful while showing the result of your model

3. Pandas:- pandas allows us to import our dataset and also creates a matrix of
features containing the dependent and independent variable.
Step 2: Importing the dataset
The data that we’ll be using can be viewed and downloaded from here.
Sample dataset that we’ll be using

As you can see in the above image we are using a very simple dataset that contains
information about customers who have purchased a particular product from a
company.

It contains various information about the customers like their age, salary, country,
etc. It also shows whether a particular customer has purchased the product or not.

It also contains a null value in the fifth row.

Let’s begin by importing the data.

As the given data is in CSV format, we’ll be using the read_csv function from the
pandas library.

Now we’ll show the imported data. You must remember the data imported using the
read_csv function is in a Data Frame format, we’ll later convert it into NumPy arrays
to perform other operations and training.
1 data = pd.read_csv('Data.csv')
2 data

importingDataset hosted with by GitHub view raw

In any dataset used for machine learning, there are two types of variables:

• Independent variable

• Dependent variable

The independent variable is the columns that we are going to use to predict the
dependent variable, or in other words, the independent variable affects the
dependent variable

Independent and Dependent variable

In our dataset, the country, age, and salary column are the independent variable and
will be used to predict the purchased column which is the dependent variable.

Step 3: Handling the missing values


As you can see in our dataset we have two missing values one in the Salary column
in the 5th Row and another in the Age column of the 7th row.
Missing Values

Now there are multiple ways to handle missing values, one of them is to ignore them
and delete the entire entry/row, this is commonly done in datasets containing a very
large number of entries, where the missing values only constitute 0.1% of the total
data. Thus they affect the model negligibly and can be removed.

But in our case, the dataset is very small and we cannot just ignore those rows. So we
use another method, in which we take the mean of the entire column containing the
missing values(in our case the age or salary column) and replace the missing values
with that mean.

To perform this process we will use SimpleImputer class from the ScikitLearn
library

Code for the Python implementation is given below:


1 from sklearn.impute import SimpleImputer
2
3 # 'np.nan' signifies that we are targeting missing values
4 # and the strategy we are choosing is replacing it with 'mean'
5 imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
6
7 imputer.fit(data.iloc[:, 1:3])
8 data.iloc[:, 1:3] = imputer.transform(data.iloc[:, 1:3])
9
10 # print the dataset
11 data

missingValues hosted with by GitHub view raw

Here the “missing_values = np.nan” means that we are replacing missing values and
“strategy = ‘mean’ ” means that we are replacing the missing value with the mean of
that column.

You can see that we have only selected the column with numerical data, as the mean
can only be calculated on numerical data.

After running the above code you’ll get the following output:
Output after replacing missing values with mean

As you can observe all the missing values have been replaced by the mean of the
column.

Step 4: Encoding categorical data


In our case, we have two categorical columns, the country column, and the purchased
column.

• OneHot Encoding

In the country column, we have three different categories: France, Germany, Spain.
We can simply label France as 0, Germany as 1, and Spain as 2 but doing this might
lead our machine learning model to interpret that there is some correlation between
these numbers and the outcome.

So to avoid this, we apply OneHot Encoding

OneHot Encoding consists of turning the country column into three separate
columns, each column consists of 0s and 1s. Therefore each country will have a
unique vector/code and no correlation between the vectors and outcome can be
formed.

You’ll understand more about it when we implement it below:

To perform this encoding we use OneHotEncoder and ColumnTransformer class


from the same ScikitLearn library.

The ColumnTransformer class allows us to select the column to apply encoding on


and leave the other columns untouched.

Note: The new columns created will be added in the front of the data frame and the
original column will be deleted.
After performing the above implementation you’ll get the following output:
New columns created after OneHot encoding

Now we can see that each country has got a unique vector or code, for example,
France is 1 0 0, Spain 0 0 1, and Germany 0 1 0.

• Label Encoding

In the last column, i.e. the purchased column, the data is in binary form meaning
that there are only two outcomes either Yes or No. Therefore here we need to
perform Label Encoding.

In this case, we use LabelEncoder class from the same ScikitLearn library.

1 from sklearn.preprocessing import LabelEncoder


2 le = LabelEncoder()
3 data.iloc[:,-1] = le.fit_transform(data.iloc[:,-1])
4 # 'data.iloc[:,-1]' is used to select the column that we need to be encoded
5 data

LabelEncoding hosted with by GitHub view raw


We use ‘data.iloc[:,-1]’ to select the index of the column we are transforming.

After performing this our data will look something like this:

Label Encoding

As you can see the purchased column has been successfully transformed.

Now we have completed the encoding of all the categorical data in our dataset and
can move to the next step.

Step 5: Normalizing the dataset


Feature scaling is bringing all of the features on the dataset to the same scale, this is
necessary while training a machine learning model because in some cases the
dominant features become so dominant that the other ordinary features are not
even considered by the model.
When we normalize the dataset it brings the value of all the features between 0 and
1 so that all the columns are in the same range, and thus there is no dominant
feature.

Now to normalize the dataset we use MinMaxScaler class from the same
ScikitLearn library.

The implementation of MinMaxScaler is very simple:


After running the above code our data set will look something like this:
Dataset after scaling

As you can see in the above image all the values in the dataset are now between 0
and 1, so there are no dominant features, and all features will be considered equally.

Note: Feature scaling is not always necessary and only required in some machine
learning models.

Step 6: Splitting the dataset


Before we begin training our model there is one final step to go, which is splitting of
the testing and training dataset. In machine learning, a larger part of the dataset is
used to train the model, and a small part is used to test the trained model for finding
out the accuracy and the efficiency of the model.

Now before we begin splitting the dataset we need to separate the dependent and
independent variables which we have already discussed above in the article.

The last (purchased) column is the dependent variable and the rest are independent
variables, so we’ll store the dependent variable in ‘y’ and the independent variables
in ‘X’.

Another important part we need to remember is that while training the model
accepts data as arrays so it is necessary that we convert the data to arrays. We do
that while separating the dependent and independent variables by adding .values
while storing data in ‘X’ and ‘y’.

1 X = data.iloc[:, :-1].values
2 y = data.iloc[:, -1].values
3 # .values function coverts the data into arrays
4 print("Independent Variable\n")
5 print(X)
6 print("\nDependent Variable\n")
7 print(y)

IndependentAndDepedentVariable hosted with by GitHub view raw

After running the above code our data will look something like this:

X and y

Now let’s split the dataset between Testing data and Training data.

To do this we’ll be using the train_test_split class from the same ScikitLearn library.
Deciding the ratio between testing data and training data is up to us and depends on
what we are trying to achieve with our model. In our case, we are going to go with
an 80-20% split between the train-test data. So 80% training and 20% testing data.

1 from sklearn.model_selection import train_test_split


2 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
3 #'test_size=0.2' means 20% test data and 80% train data

TestTrainSplit hosted with by GitHub view raw

Here the test_size = 0.2 signifies that we have selected 20% of data as testing data,
you can change that according to your choice.

After this, the X_train and X_test variables will have their respective data.

Now our data is finally ready for training!!


I hope this article was helpful in understanding data preprocessing. Be sure to drop
your feedback and suggestion below.

Thank You for reading!

PS: Once again you can view the code and dataset I have used on this GitHub
repository.

More from Journal


There are many Black creators doing incredible work in Tech. This collection of
resources shines a light on some of us:
Machine Learning Data Preprocessing Aiml Computer Science Encoding

Follow

Written by Suneet Jain


21 Followers · 13 Following

Tech Enthusiast, Currently perusing BTech at Bennett University

Responses (2)

Mineralsman

What are your thoughts?

Sparsh Chadha
Nov 15, 2020

nice one

1 Reply

Nicola Renzi
Mar 15, 2022

Can you please suggest me a best tool in alternative of Jupiter notebook? Currently I am suing google's colab

You might also like