Day11 Machine Learning
Day11 Machine Learning
For example, Random Forest algorithm does not support null values, therefore to execute
random forest algorithm null values have to be managed from the original raw data set.
Another aspect is that data set should be formatted in such a way that more than one Machine
Learning and Deep Learning algorithms are executed in one data set, and best out of them is
chosen.
The data has to be in proper format and any missing values must be processed before applying
the Machine Learning algorithms.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
5 Data Preprocessing Process
Formatting the data to make it suitable for ML (structured format).
Cleaning the data to remove incomplete variables.
Sampling the data further to reduce running times for algorithms and memory requirements.
Selecting data objects and attributes for the analysis.
Creating/changing the attributes.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
6 Data Preprocessing Steps
The rituals programmers usually perform data pre processing in 7 simple steps.
Step 1: Importing the libraries
Step 2: Loading the Dataset
Step 3: Identify independent and dependent feature
Step 4: Handling of Missing Data
Step 5: Handling of Categorical Data
Step 6: Feature Scaling
Step 7: Splitting the dataset into training and testing datasets
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
7 Step 1: Import Libraries
First step is usually importing the libraries that will be needed in the program. A library is
essentially a collection of modules that can be called and used. Built-in functions are defined in
libraries which can be used by the programmer.
For example, importing the library pandas and assigning alias as pd.
import pandas as pd
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
8 Step 2: Loading the dataset
Load the dataset into pandas data frame using read_csv() function. The read_csv() function reads
comma separated values (csv) dataset into pandas dataframe.
import pandas as pd
dataset = pd.read_csv(‘Data_for_preprocessing.csv')
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
9 Step 3: Identify Independent and Dependent Variables
The next step of data preprocessing is to identify independent and dependent variables from the
dataset.
All the features of any dataset are not important for Machine Learning algorithm.
Classification of dependent and independent feature is very important in Machine Learning.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
10 Independent Variables
Independent variables (also referred to as Features) are the input for a process that is being
analyzes.
Usually independent features/variables are also known as input features/variables and
represented as X.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
11 Continue……..
For example, in Data_for_preprocessing dataset, the features such as Country, Age and Salary is
known as independent features because they are not dependent to Purchased feature.
They must be extracted before starting Machine Learning process.
They can be extracted from the dataset as follows:
X=dataset.drop([‘Purchased’], axis=1)
Dropping the ‘Purchased’ feature from the dataset and initializing the remaining features to X.
Here, axis=1 means dropping the column named ‘Purchased’ from the dataset.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
12 Dependent Variable
Dependent variables/features are the output of the process.
Dependent features/variables are also known as output feature/variable and represented as y.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
13 Continues...
The result (whether a user purchased or not) is the dependent variable.
It must be extracted before starting Machine Learning Process.
It can be extracted as follows:
y=dataset[‘Purchased’]
For example: In the Data_for_proprocessing.csv file, the missing values are represented by ‘#’.
The ‘#’ can be replaced with NaN as
print(dataset.isnull())
Checking Age column only
print(dataset['Age'].isnull())
Counting missing values from each column
print(dataset.isnull().sum())
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
17 Replacing missing values
A Simple Option: Drop Columns or rows with Missing Values
X=X.dropna(axis=1)
Now, the column which has NaN values will be dropped from the X dataframe.
To drop rows with missing values:
X=X.dropna()
Now, all the rows with NaN values are dropped from the X dataframe.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
18 Replacing missing values
A Better Option: Imputation
The Imputer() class can take a few parameters —
missing_values: The missing_values placeholder which has to be imputed. By default is NaN.
strategy : The data which will replace the NaN values from the dataset. The
strategy argument can take the values – ‘mean'(default), ‘median’,
‘most_frequent’.
axis : We can either assign it 0 or 1. 0 to impute along columns and 1 to impute
along rows.
Imputer works on numbers, not strings.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
19 Replacing Numerical Values
For numerical values, the simplest method is to replace the missing numerical values with
mean.
X[['Age','Salary']]= fill_NaN.fit_transform(X[['Age','Salary']])
print (X)
Note: Since ‘Age’ and ‘Salary’ column contains numerical values. So the missing values of ‘Age’
and ‘Salary’ column is replaced by their mean.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
20 Replacing Categorical Values
For Categorical values, count the occurrences of each category and replace the missing
values with high frequency values.
Count frequency of each category
X['Country'].value_counts()
Output: France 4
Spain 3
Germany 1
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
21 Continue…
Replace the missing values with highest frequency value
Output suggests that the most frequent value is ‘France’. So replace the NaN values of
‘Country’ Column with ‘France’.
X[‘Country’].fillna('France', inplace=True)
Checking missing values again,
X.isnull().sum()
Output: Country 0
Age 0
Salary 0
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
22 Step 5: Encoding Categorical Data
Machine learning algorithms require numerical inputs.
Categorical data are variables that contain label values rather than numeric values.
The number of possible values is often limited to a fixed set.
Machine learning algorithms cannot work with variables in text form.
Categorical values must be transformed into numeric values to work with machine learning
algorithm.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
23 Encoding Categorical Data
Categorical values can be transformed in to numeric values by :
Label Encoding
One Hot Encoding
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
24 Label Encoding
LabelEncoder:
Encode target labels with value between 0 and n_classes-1.
This transformer should be used to encode target values, i.e. y, and not the input X.
Example:
lb_encode = LabelEncoder()
X['Country']= lb_encode.fit_transform(X['Country'])
print(X.head())
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
25 Continue…
Output:
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
26 Limitation of Label Encoding
Label encoding convert the data in machine readable form, but it assigns a unique number(starting
from 0) to each class of data.
This may lead to the generation of priority issue in training of data sets. A label with high value may
be considered to have high priority than a label having lower value.
For example, on Label Encoding ‘Country’ column, let France is replaced with 0 , Germany is
replaced with 1 and Spain is replaced with 2.
With this, it can be interpreted that Spain have high priority than Germany and France while
training the model. But actually there is no such priority relation between these countries.
This can be overcome by the concept of One-Hot Encoding.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
27 One Hot Encoding
The technique to convert categorical values into a numerical vector is known as one hot encoding.
It refers to splitting the column which contains numerical categorical data to many columns
depending on the number of categories present in that column. Each column contains “0” or “1”
corresponding to which column it has been placed.
The resulting vector will have only one element equal to 1 and the rest will be 0.
For example, In given dataset ‘Country’ column contains categorical data. So, ‘Country’ column must
be converted into numerical values before starting Machine Learning Process.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
28 Continue…
get_dummies(): Used to encode categorical values into numerical values.
Syntax: get_dummies(dataframe)
Example: X=get_dummies(X)
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
29 Step 6: Feature Scaling
Real world dataset contains features that highly vary in magnitudes, units, and range.
Differences in the scales across input variables may increase the difficulty of the problem being
modelled. An example of this is that large input values (e.g. a spread of hundreds or thousands of
units) can result in a model that learns large weight values.
A model with large weight values is often unstable, meaning that it may suffer from poor
performance during learning and sensitivity to input values resulting in higher generalization error.
Feature Scaling or Standardization is a step of Data Pre Processing which is applied to
independent variables or features of data.
It basically helps to normalise the data within a particular range. Sometimes, it also helps in
speeding up the calculations in an algorithm.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
30 Feature Scaling
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
31 StandardScaler
StandardScaler performs the task of Standardization. Usually a dataset contains variables that are
different in scale. For e.g. an Employee dataset will contain AGE column with values on scale 20-
70 and SALARY column with values on scale 10000-80000.
As these two columns are different in scale, they are Standardized to have common scale while
building machine learning model.
Scaling is done for numerical values only. Categorical values are not scaled.
scaler = StandardScaler()
X[['Age','Salary']] = scaler.fit_transform(X[['Age','Salary']])
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
32 MinMaxScaler
Transform features by scaling each feature to a given range.
This estimator scales and translates each feature individually such that it is in the given range on
the training set, e.g. between zero and one.
Example:
X[['Age','Salary']] = scalerX.fit_transform(X[['Age','Salary']])
X
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
33 Step 7: Splitting the Dataset into Training set and Test Set
One important aspect of all machine learning models is to determine their accuracy. Now, in order
to determine their accuracy, one can train the model using the given dataset and then predict the
response values for the same dataset using that model and hence, find the accuracy of the model.
A better option is to split our data into two parts: first one for training our machine learning model,
and second one for testing our model.
Train the model on the training set.
Test the model on the testing set, and evaluate how well our model did.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
34 Continues…
train_test_split: splits the data into two sets: train and test.
It returns four datasets: X_train, X_test, y_train, y_test.
Parameters:
test_size: This parameter decides the size of the data that has to be split as the test dataset. This
is given as a fraction. For example, if you pass 0.8 as the value, the dataset will be split 80% as
the test dataset.
random_state: Here you pass an integer, which will act as the seed for the random number
generator during the split.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
35 Continues…
Example:
Now X and y is ready. Spilt the data in two parts: train data and test data as:
• Tutorialspoint.com
• https://www.geeksforgeeks.org/
• https://www.kaggle.com/
• https://github.com/
Course: Machine Learning using Python
Module: Data Preprocessing for Machine Learning
37
Thank
You ! ! !