Data Preprocessing using Python. Python implementation of data… _ by Suneet Jain _ Medium
Data Preprocessing using Python. Python implementation of data… _ by Suneet Jain _ Medium
Search
Get unlimited access to the best of Medium for less than $1/week. Become a member
This article will take you through the basic concepts of Data Preprocessing and
implement them using python. We’ll be starting from the basics so if you have no
prior knowledge about machine learning or data preprocessing, no need to worry!
Use the ipynb file available here to follow along on the implementation that I have
performed below. Everything including the dataset is present in the repository.
Let’s begin!
Solving all these problems using various methods is called Data Preprocessing,
using a properly processed dataset while training will not only make life easier for
you but also increase the efficiency and accuracy of your model.
1. NumPy:- it is a library that allows us to work with arrays and as most machine
learning models work on arrays NumPy makes it easier
2. matplotlib:- this library helps in plotting graphs and charts, which are very
useful while showing the result of your model
3. Pandas:- pandas allows us to import our dataset and also creates a matrix of
features containing the dependent and independent variable.
Step 2: Importing the dataset
The data that we’ll be using can be viewed and downloaded from here.
Sample dataset that we’ll be using
As you can see in the above image we are using a very simple dataset that contains
information about customers who have purchased a particular product from a
company.
It contains various information about the customers like their age, salary, country,
etc. It also shows whether a particular customer has purchased the product or not.
As the given data is in CSV format, we’ll be using the read_csv function from the
pandas library.
Now we’ll show the imported data. You must remember the data imported using the
read_csv function is in a Data Frame format, we’ll later convert it into NumPy arrays
to perform other operations and training.
1 data = pd.read_csv('Data.csv')
2 data
In any dataset used for machine learning, there are two types of variables:
• Independent variable
• Dependent variable
The independent variable is the columns that we are going to use to predict the
dependent variable, or in other words, the independent variable affects the
dependent variable
In our dataset, the country, age, and salary column are the independent variable and
will be used to predict the purchased column which is the dependent variable.
Now there are multiple ways to handle missing values, one of them is to ignore them
and delete the entire entry/row, this is commonly done in datasets containing a very
large number of entries, where the missing values only constitute 0.1% of the total
data. Thus they affect the model negligibly and can be removed.
But in our case, the dataset is very small and we cannot just ignore those rows. So we
use another method, in which we take the mean of the entire column containing the
missing values(in our case the age or salary column) and replace the missing values
with that mean.
To perform this process we will use SimpleImputer class from the ScikitLearn
library
Here the “missing_values = np.nan” means that we are replacing missing values and
“strategy = ‘mean’ ” means that we are replacing the missing value with the mean of
that column.
You can see that we have only selected the column with numerical data, as the mean
can only be calculated on numerical data.
After running the above code you’ll get the following output:
Output after replacing missing values with mean
As you can observe all the missing values have been replaced by the mean of the
column.
• OneHot Encoding
In the country column, we have three different categories: France, Germany, Spain.
We can simply label France as 0, Germany as 1, and Spain as 2 but doing this might
lead our machine learning model to interpret that there is some correlation between
these numbers and the outcome.
OneHot Encoding consists of turning the country column into three separate
columns, each column consists of 0s and 1s. Therefore each country will have a
unique vector/code and no correlation between the vectors and outcome can be
formed.
Note: The new columns created will be added in the front of the data frame and the
original column will be deleted.
After performing the above implementation you’ll get the following output:
New columns created after OneHot encoding
Now we can see that each country has got a unique vector or code, for example,
France is 1 0 0, Spain 0 0 1, and Germany 0 1 0.
• Label Encoding
In the last column, i.e. the purchased column, the data is in binary form meaning
that there are only two outcomes either Yes or No. Therefore here we need to
perform Label Encoding.
In this case, we use LabelEncoder class from the same ScikitLearn library.
After performing this our data will look something like this:
Label Encoding
As you can see the purchased column has been successfully transformed.
Now we have completed the encoding of all the categorical data in our dataset and
can move to the next step.
Now to normalize the dataset we use MinMaxScaler class from the same
ScikitLearn library.
As you can see in the above image all the values in the dataset are now between 0
and 1, so there are no dominant features, and all features will be considered equally.
Note: Feature scaling is not always necessary and only required in some machine
learning models.
Now before we begin splitting the dataset we need to separate the dependent and
independent variables which we have already discussed above in the article.
The last (purchased) column is the dependent variable and the rest are independent
variables, so we’ll store the dependent variable in ‘y’ and the independent variables
in ‘X’.
Another important part we need to remember is that while training the model
accepts data as arrays so it is necessary that we convert the data to arrays. We do
that while separating the dependent and independent variables by adding .values
while storing data in ‘X’ and ‘y’.
1 X = data.iloc[:, :-1].values
2 y = data.iloc[:, -1].values
3 # .values function coverts the data into arrays
4 print("Independent Variable\n")
5 print(X)
6 print("\nDependent Variable\n")
7 print(y)
After running the above code our data will look something like this:
X and y
Now let’s split the dataset between Testing data and Training data.
To do this we’ll be using the train_test_split class from the same ScikitLearn library.
Deciding the ratio between testing data and training data is up to us and depends on
what we are trying to achieve with our model. In our case, we are going to go with
an 80-20% split between the train-test data. So 80% training and 20% testing data.
Here the test_size = 0.2 signifies that we have selected 20% of data as testing data,
you can change that according to your choice.
After this, the X_train and X_test variables will have their respective data.
PS: Once again you can view the code and dataset I have used on this GitHub
repository.
Follow
Responses (2)
Mineralsman
Sparsh Chadha
Nov 15, 2020
nice one
1 Reply
Nicola Renzi
Mar 15, 2022
Can you please suggest me a best tool in alternative of Jupiter notebook? Currently I am suing google's colab