0% found this document useful (0 votes)
3 views37 pages

Day11 Machine Learning

The document provides a comprehensive guide on data preprocessing for machine learning using Python, detailing the importance and steps involved in transforming raw data into a clean dataset. It outlines seven key steps in the preprocessing process, including importing libraries, handling missing data, and encoding categorical data. The document emphasizes the necessity of proper data formatting to achieve better results in machine learning models.

Uploaded by

Rahul Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views37 pages

Day11 Machine Learning

The document provides a comprehensive guide on data preprocessing for machine learning using Python, detailing the importance and steps involved in transforming raw data into a clean dataset. It outlines seven key steps in the preprocessing process, including importing libraries, handling missing data, and encoding categorical data. The document emphasizes the necessity of proper data formatting to achieve better results in machine learning models.

Uploaded by

Rahul Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

NATIONAL INSTITUTE OF ELECTRONICS AND INFORMATION TECHNOLOGY

Sumit Complex, A-1/9, Vibhuti Khand, Gomti Nagar, Lucknow,

Setting Up User Accounts

Machine Learning using Python


1 Day 11
Course: Machine Learning using Python
Module: Data Preprocessing for Machine Learning
2 Index
 Data Preprocessing  Dependent Variable
 Why Data Preprocessing  Step 4: Taking care of Missing Data in Dataset
 Data Preprocessing Process  Replacing missing values
 Data Preprocessing Steps  Step 5: Encoding Categorical Data
 Step 1: Import Libraries  Label Encoding
 Step 2: Loading the dataset  One Hot Encoding
 Step 3: Identify Independent and Dependent Va  Step 6: Feature Scaling
riables
 Step 7: Splitting the Dataset into Training set an
 Independent Variables d Test Set
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
3 Data Preprocessing
 In any machine learning process, data preprocessing is the step in which data is transformed or
encoded so that the machine can process the data easily. The features of data can now be easily
interpreted by machine learning algorithms.
 Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.
 It is a technique that is used to convert the raw data into a clean data set. In other words,
whenever the data is gathered from different sources it is collected in raw format which is not
feasible for the analysis.
 L.fit(X, y)
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
4 Why Data Preprocessing
 For achieving better results from the applied model in Machine Learning projects, the format of the
data has to be in a proper manner. Some specified Machine Learning model needs information in
a specified format.

For example, Random Forest algorithm does not support null values, therefore to execute
random forest algorithm null values have to be managed from the original raw data set.
 Another aspect is that data set should be formatted in such a way that more than one Machine
Learning and Deep Learning algorithms are executed in one data set, and best out of them is
chosen.
 The data has to be in proper format and any missing values must be processed before applying
the Machine Learning algorithms.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
5 Data Preprocessing Process
 Formatting the data to make it suitable for ML (structured format).
 Cleaning the data to remove incomplete variables.
 Sampling the data further to reduce running times for algorithms and memory requirements.
 Selecting data objects and attributes for the analysis.
 Creating/changing the attributes.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
6 Data Preprocessing Steps
The rituals programmers usually perform data pre processing in 7 simple steps.
 Step 1: Importing the libraries
 Step 2: Loading the Dataset
 Step 3: Identify independent and dependent feature
 Step 4: Handling of Missing Data
 Step 5: Handling of Categorical Data
 Step 6: Feature Scaling
 Step 7: Splitting the dataset into training and testing datasets
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
7 Step 1: Import Libraries
 First step is usually importing the libraries that will be needed in the program. A library is
essentially a collection of modules that can be called and used. Built-in functions are defined in
libraries which can be used by the programmer.

For example, importing the library pandas and assigning alias as pd.

import pandas as pd
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
8 Step 2: Loading the dataset
 Load the dataset into pandas data frame using read_csv() function. The read_csv() function reads
comma separated values (csv) dataset into pandas dataframe.

import pandas as pd
dataset = pd.read_csv(‘Data_for_preprocessing.csv')
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
9 Step 3: Identify Independent and Dependent Variables
 The next step of data preprocessing is to identify independent and dependent variables from the
dataset.
 All the features of any dataset are not important for Machine Learning algorithm.
 Classification of dependent and independent feature is very important in Machine Learning.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
10 Independent Variables
 Independent variables (also referred to as Features) are the input for a process that is being
analyzes.
 Usually independent features/variables are also known as input features/variables and
represented as X.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
11 Continue……..
 For example, in Data_for_preprocessing dataset, the features such as Country, Age and Salary is
known as independent features because they are not dependent to Purchased feature.
 They must be extracted before starting Machine Learning process.
 They can be extracted from the dataset as follows:

X=dataset.drop([‘Purchased’], axis=1)

Dropping the ‘Purchased’ feature from the dataset and initializing the remaining features to X.
Here, axis=1 means dropping the column named ‘Purchased’ from the dataset.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
12 Dependent Variable
 Dependent variables/features are the output of the process.
 Dependent features/variables are also known as output feature/variable and represented as y.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
13 Continues...
 The result (whether a user purchased or not) is the dependent variable.
 It must be extracted before starting Machine Learning Process.
 It can be extracted as follows:

y=dataset[‘Purchased’]

Now, the ‘Purchased’ column of dataset will be assigned to y.


Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
14 Step 4: Taking care of Missing Data in Dataset
 In Python, specifically Pandas, NumPy and Scikit-Learn, missing values are represented as NaN.
 Values with a NaN value are ignored from operations like sum, count, etc.
 Missing values are specified with NaN. Python will recognize only NaNs as missing.
 Any other missing values such as space, .(dot), *, $ or # will not be recognized by the Python as
missing values.
 Missing values other than NaN are handled by na_values parameter of read_csv().
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
15 Continue…
 na_values - handles non NaN values in a DataFrame.

For example: In the Data_for_proprocessing.csv file, the missing values are represented by ‘#’.
The ‘#’ can be replaced with NaN as

dataset=pd.read_csv(‘Data_for_preprocessing.csv', na_values=[' #‘,’NULL’])


 Here, na_values=[' #‘,’NULL’] specifies that the # and NULL values are treated as NaN.
 We can specify any symbol as missing value in na_values. The symbol depends upon the dataset
being used.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
16 Checking Missing Values
 isnull(): isnull() function is used to check missing values in a data frame. It returns Boolean values
which are True for NaN values.
 Checking entire data frame

print(dataset.isnull())
 Checking Age column only

print(dataset['Age'].isnull())
 Counting missing values from each column

print(dataset.isnull().sum())
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
17 Replacing missing values
 A Simple Option: Drop Columns or rows with Missing Values

dropna(): dropna() function is used to drop Rows/Columns with NaN values.


 To drop columns with missing values:

X=X.dropna(axis=1)

Now, the column which has NaN values will be dropped from the X dataframe.
 To drop rows with missing values:

X=X.dropna()

Now, all the rows with NaN values are dropped from the X dataframe.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
18 Replacing missing values
 A Better Option: Imputation
 The Imputer() class can take a few parameters —
 missing_values: The missing_values placeholder which has to be imputed. By default is NaN.
 strategy : The data which will replace the NaN values from the dataset. The
strategy argument can take the values – ‘mean'(default), ‘median’,
‘most_frequent’.
 axis : We can either assign it 0 or 1. 0 to impute along columns and 1 to impute
along rows.
 Imputer works on numbers, not strings.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
19 Replacing Numerical Values
 For numerical values, the simplest method is to replace the missing numerical values with
mean.

from sklearn.preprocessing import Imputer

fill_NaN = Imputer(missing_values='NaN', strategy='mean', axis=0)

X[['Age','Salary']]= fill_NaN.fit_transform(X[['Age','Salary']])

print (X)
 Note: Since ‘Age’ and ‘Salary’ column contains numerical values. So the missing values of ‘Age’
and ‘Salary’ column is replaced by their mean.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
20 Replacing Categorical Values
 For Categorical values, count the occurrences of each category and replace the missing
values with high frequency values.
 Count frequency of each category

#Imputing missing values of categorical column 'Country'

#Counting frequency of each category in 'Country' Column using


value_counts()

X['Country'].value_counts()
Output: France 4
Spain 3
Germany 1
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
21 Continue…
 Replace the missing values with highest frequency value

Output suggests that the most frequent value is ‘France’. So replace the NaN values of
‘Country’ Column with ‘France’.

#Replacing the NaN values with 'France'

X[‘Country’].fillna('France', inplace=True)
 Checking missing values again,

X.isnull().sum()
Output: Country 0
Age 0
Salary 0
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
22 Step 5: Encoding Categorical Data
 Machine learning algorithms require numerical inputs.
 Categorical data are variables that contain label values rather than numeric values.
 The number of possible values is often limited to a fixed set.
 Machine learning algorithms cannot work with variables in text form.
 Categorical values must be transformed into numeric values to work with machine learning
algorithm.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
23 Encoding Categorical Data
Categorical values can be transformed in to numeric values by :
 Label Encoding
 One Hot Encoding
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
24 Label Encoding
LabelEncoder:
 Encode target labels with value between 0 and n_classes-1.
 This transformer should be used to encode target values, i.e. y, and not the input X.
 Example:

from sklearn.preprocessing import LabelEncoder

lb_encode = LabelEncoder()

# Encode labels in column 'Country'.

X['Country']= lb_encode.fit_transform(X['Country'])

print(X.head())
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
25 Continue…
 Output:
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
26 Limitation of Label Encoding
 Label encoding convert the data in machine readable form, but it assigns a unique number(starting
from 0) to each class of data.
 This may lead to the generation of priority issue in training of data sets. A label with high value may
be considered to have high priority than a label having lower value.
 For example, on Label Encoding ‘Country’ column, let France is replaced with 0 , Germany is
replaced with 1 and Spain is replaced with 2.
 With this, it can be interpreted that Spain have high priority than Germany and France while
training the model. But actually there is no such priority relation between these countries.
 This can be overcome by the concept of One-Hot Encoding.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
27 One Hot Encoding
 The technique to convert categorical values into a numerical vector is known as one hot encoding.
 It refers to splitting the column which contains numerical categorical data to many columns
depending on the number of categories present in that column. Each column contains “0” or “1”
corresponding to which column it has been placed.
 The resulting vector will have only one element equal to 1 and the rest will be 0.

For example, In given dataset ‘Country’ column contains categorical data. So, ‘Country’ column must
be converted into numerical values before starting Machine Learning Process.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
28 Continue…
get_dummies(): Used to encode categorical values into numerical values.

Syntax: get_dummies(dataframe)

Example: X=get_dummies(X)
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
29 Step 6: Feature Scaling
 Real world dataset contains features that highly vary in magnitudes, units, and range.
 Differences in the scales across input variables may increase the difficulty of the problem being
modelled. An example of this is that large input values (e.g. a spread of hundreds or thousands of
units) can result in a model that learns large weight values.
 A model with large weight values is often unstable, meaning that it may suffer from poor
performance during learning and sensitivity to input values resulting in higher generalization error.
 Feature Scaling or Standardization is a step of Data Pre Processing which is applied to
independent variables or features of data.
 It basically helps to normalise the data within a particular range. Sometimes, it also helps in
speeding up the calculations in an algorithm.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
30 Feature Scaling
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
31 StandardScaler
 StandardScaler performs the task of Standardization. Usually a dataset contains variables that are
different in scale. For e.g. an Employee dataset will contain AGE column with values on scale 20-
70 and SALARY column with values on scale 10000-80000.
 As these two columns are different in scale, they are Standardized to have common scale while
building machine learning model.
 Scaling is done for numerical values only. Categorical values are not scaled.

Example: from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

#scaling ‘Age’ and ‘Salary’ Column only

X[['Age','Salary']] = scaler.fit_transform(X[['Age','Salary']])
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
32 MinMaxScaler
 Transform features by scaling each feature to a given range.
 This estimator scales and translates each feature individually such that it is in the given range on
the training set, e.g. between zero and one.

Example:

from sklearn.preprocessing import MinMaxScaler

scalerX = MinMaxScaler(feature_range=(0, 1))

X[['Age','Salary']] = scalerX.fit_transform(X[['Age','Salary']])

X
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
33 Step 7: Splitting the Dataset into Training set and Test Set
 One important aspect of all machine learning models is to determine their accuracy. Now, in order
to determine their accuracy, one can train the model using the given dataset and then predict the
response values for the same dataset using that model and hence, find the accuracy of the model.
 A better option is to split our data into two parts: first one for training our machine learning model,
and second one for testing our model.
 Train the model on the training set.
 Test the model on the testing set, and evaluate how well our model did.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
34 Continues…
 train_test_split: splits the data into two sets: train and test.
 It returns four datasets: X_train, X_test, y_train, y_test.

Parameters:
 test_size: This parameter decides the size of the data that has to be split as the test dataset. This
is given as a fraction. For example, if you pass 0.8 as the value, the dataset will be split 80% as
the test dataset.
 random_state: Here you pass an integer, which will act as the seed for the random number
generator during the split.
Course:
Course: NIELITLearning
Machine ‘O’ Levelusing
(IT) Python
Module:
Module: M2-R5:
Data Introduction
Preprocessing for to ICT Resources
Machine Learning
35 Continues…
 Example:
 Now X and y is ready. Spilt the data in two parts: train data and test data as:

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test =train_test_split(X,y,test_size = 0.20, random_state=42)


 The 80% of data will be assigned as training data and remaining 20% of data will be assigned as
testing data.
Course: Machine Learning using Python
Module: Data Preprocessing for Machine Learning
36 References
• Wikipedia.org

• Tutorialspoint.com

• https://www.geeksforgeeks.org/

• https://www.kaggle.com/

• https://github.com/
Course: Machine Learning using Python
Module: Data Preprocessing for Machine Learning
37

Thank
You ! ! !

You might also like