0% found this document useful (0 votes)

99 views9 pages

Lab 08 - Data Preprocessing

This document provides instructions for a lab on data preprocessing. The objective is to learn techniques for preprocessing data before applying machine learning models. These techniques include filling in missing values, handling categorical data, normalizing datasets, and splitting data into training and validation sets. The document outlines the specific steps to preprocess a sample dataset, including importing libraries, loading the dataset, imputing missing values, encoding categorical features, and splitting the data.

Uploaded by

rida

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

99 views9 pages

Lab 08 - Data Preprocessing

Uploaded by

rida

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

Lab 08

Data Preprocessing
Objective:
The objective of this lab is to learn how to Preprocess data before applying any machine learning tool.
Activity Outcomes:
On completion of this lab student will be able to
 Fill in Missing values
 Deal with Categorical Data
 Perform Normalization of Dataset for improved results
 Split the Dataset
 Do feature Scaling
Instructor Note:
As pre-lab activity, read chapter 2 from the text book “Learning Data Mining with Python, By Robert
Layton, PACKT Publishing”.

1) Useful Concepts

Data Preprocessing
In a real-world data science project, data preprocessing is one of the most important things, and it is one
of the common factors of success of a model, i.e., if there is correct data preprocessing and feature
engineering, that model is more likely to produce noticeably better results as compared to a model for
which data is not well preprocessed.
There are 4 main important steps for the preprocessing of data.

 Taking care of Missing values

 Taking care of Categorical Features
 Normalization of data set
 Splitting of the data set in Training and Validation sets

1. Taking Care of Missing Values

There is a famous Machine Learning phrase which you might have heard that is

Garbage in Garbage out

If your data set is full of NaNs and garbage values, then surely your model will perform garbage too. So
taking care of such missing values is important.

You can add following library for handling missing values

from sklearn.impute import SimpleImputer

2. Taking care of Categorical Features
We can take care of categorical features by converting them to integers. There are 2 common ways to do
so.
1. Label Encoding
2. One Hot Encoding
In Label Encoder, we can convert the Categorical values into numerical labels. In OneHotEncoder we
make a new column for each unique categorical value, and the value is 1 for that column, if in an actual
data frame that value is there, else it is 0.
You will use following library for taking care of categorical features

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

3. Normalizing the Dataset

This brings us to the last part of data preprocessing, which is the normalization of the dataset. It is proven
from certain experimentation that Machine Learning and Deep Learning Models perform way better on a
normalized data set as compared to a data set that is not normalized.

The goal of normalization is to change values to a common scale without distorting the difference
between the range of values.

There are several ways to do so. We will discuss 2 common ways to normalize a dataset.

Standard Scaler

Standardization :
x−μ
z=
σ
With mean :
N
1
μ= ∑ (x )
N i=1 i
¿ standard deviation

√
N
1
σ= ∑ ( x −μ )2
N i=1 i
You will use following library for normalizing datasets.

from sklearn.preprocessing import StandardScaler

4. Train Test Split

Train Test Split is one of the important steps in Machine Learning. It is very important because your
model needs to be evaluated before it has been deployed. And that evaluation needs to be done on unseen
data because when it is deployed, all incoming data is unseen.

The main idea behind the train test split is to convert original data set into 2 parts
 train
 test
where train consists of training data and training labels and test consists of testing data and testing labels.

Following command is used for splitting training and testing data

from sklearn.model_selection import train_test_split

2) Solved Lab Activities

Sr. No Allocated Time Level of Complexity CLO Mapping

Activity 1 10 minutes Low CLO-5

Activity 2 10 minutes Medium CLO-5

Activity 3 10 minutes Medium CLO-5

Activity 4 10 minutes Medium CLO-5

Activity 5 10 minutes Medium CLO-5

Activity 1: Importing the libraries and Dataset

# libraries
import numpy as np
# used for handling numbers
import pandas as pd
# used for handling the dataset
from sklearn.impute import SimpleImputer
# used for handling missing data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# used for encoding categorical data
from sklearn.model_selection import train_test_split
# used for splitting training and testing data
from sklearn.preprocessing import StandardScaler
# used for feature scaling

If you see any import errors, try to install those packages explicitly using pip command as follows.
pip install <package-name>

Importing the Dataset

First of all, let us have a look at the dataset we are going to use for this particular example. You can find
the dataset here.

Figure 8.1: Sample dataset

In order to import this dataset into our script, we are apparently going to use pandas as follows.

dataset = pd.read_csv('Data.csv')

# to import the dataset into a variable

# Splitting the attributes into independent and dependent attributes

X = dataset.iloc[:, :-1].values

# attributes to determine dependent variable / Class

Y = dataset.iloc[:, -1].values

# dependent variable / Class

When you run this code section, you should not see any errors, if you do make sure the script and
the Data.csv are in the same folder. When successfully executed, you can move to variable explorer in the
Spyder UI and you will see the following three variables.
Figure 8.2. Description of varaiables created after running the code

When you double click on each of these variables, you should see something similar.

Figure 8.3: Detailed description of dataset loaded

Figure 8.4: Detailed description of variable x and y

Activity 2: Handling of Missing Data

Well the first idea is to remove the lines in the observations where there is some missing data. But that can
be quite dangerous because imagine this data set contains crucial information. It would be quite dangerous
to remove an observation. So we need to figure out a better idea to handle this problem. And another idea
that’s actually the most common idea to handle missing data is to take the mean of the columns.
If you noticed in our dataset, we have two values missing, one for age column in 7th data row and for
Income column in 5th data row. Missing values should be handled during the data analysis. So, we do that
as follows.

# handling the missing data and replace missing values with nan from
numpy and replace with mean of all the other values

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

imputer = imputer.fit(X[:, 1:])

X[:, 1:] = imputer.transform(X[:, 1:])

After execution of this code, the independent variable X will transform into the following.
Figure 8.5: Description of independent variable X

Here you can see, that the missing values have been replaced by the average values of the respective
columns.

Activity 3: Handling of Categorical Data

In this dataset we can see that we have two categorical variables. We have the Region variable and the
Online Shopper variable. These two variables are categorical variables because simply they contain
categories. The Region contains three categories. It’s India, USA & Brazil and the online shopper variable
contains two categories. Yes and No that’s why they’re called categorical variables.
You can guess that since machine learning models are based on mathematical equations you can intuitively
understand that it would cause some problem if we keep the text here in the categorical variables in the
equations because we would only want numbers in the equations. So that’s why we need to encode the
categorical variables. That is to encode the text that we have here into numbers. To do this we use the
following code snippet.

# encode categorical data

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)

After execution of this code, the independent variable X and dependent variable Y will transform into the
following.

Figure 8.6: Description of independent variable X and dependent variable Y

Here, you can see that the Region variable is now made up of a 3 bit binary variable. The left most bit
represents India, 2nd bit represents Brazil and the last bit represents USA. If the bit is 1 then it represents
data for that country otherwise not. For Online Shopper variable, 1 represents Yes and 0 represents No.

Activity 4: Splitting the dataset into training and testing datasets

Any machine learning algorithm needs to be tested for accuracy. In order to do that, we divide our data set
into two parts: training set and testing set. As the name itself suggests, we use the training set to make the
algorithm learn the behaviours present in the data and check the correctness of the algorithm by testing on
testing set. In Python, we do that as follows:

# splitting the dataset into training set and test set

X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
test_size=0.2, random_state=0)

Here, we are taking training set to be 80% of the original data set and testing set to be 20% of the original
data set. This is usually the ratio in which they are split. But, you can come across sometimes to a 70–30%
or 75–25% ratio split. But, you don’t want to split it 50–50%. This can lead to Model Overfitting.

Activity 5: Feature Scaling

As you can see we have these two columns age and income that contains numerical numbers. You notice
that the variables are not on the same scale because the age are going from 32 to 55 and the salaries going
from 57.6 K to like 99.6 K.
So because this age variable in the salary variable don’t have the same scale. This will cause some issues in
your machinery models. And why is that. It’s because your machine models a lot of machinery models are
based on what is called the Euclidean distance.
We use feature scaling to convert different scales to a standard scale to make it easier for Machine
Learning algorithms. We do this in Python as follows:

# feature scaling

sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

3) Graded Lab Tasks

Note: The instructor can design graded lab activities according to the level of difficult and complexity
of the solved lab activities. The lab tasks assigned by the instructor should be evaluated in the same
lab.
1. Download any data set from following links
a. Office Suply Sales sample data workbook
b. get the hockey player data file
c. get the food sales data file

and apply preprocessing steps on that data to make the data ready for applying machine learning
techniques.

Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
Machine Learning Algorithms PDF
100% (1)
Machine Learning Algorithms PDF
148 pages
University of Waterloo Department of Management Sciences MSCI 331: Introduction To Optimization FALL 2019 Assignment 4
No ratings yet
University of Waterloo Department of Management Sciences MSCI 331: Introduction To Optimization FALL 2019 Assignment 4
3 pages
Supermarket Queue Simulation
No ratings yet
Supermarket Queue Simulation
11 pages
A Steganography Intrusion Detection System
No ratings yet
A Steganography Intrusion Detection System
21 pages
4 Data Preprocessing
No ratings yet
4 Data Preprocessing
27 pages
Mtech Study Material
No ratings yet
Mtech Study Material
10 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
ML (Prac1)
No ratings yet
ML (Prac1)
12 pages
Data - Preprocessing - Jupyter Notebook
No ratings yet
Data - Preprocessing - Jupyter Notebook
5 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Data Pre-Processing Steps
No ratings yet
Data Pre-Processing Steps
32 pages
Data Preprocessing Implementation 13112023 061217pm
No ratings yet
Data Preprocessing Implementation 13112023 061217pm
31 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
Handling Missing Values in A Real-Time Dataset During
No ratings yet
Handling Missing Values in A Real-Time Dataset During
5 pages
Mini 4
No ratings yet
Mini 4
9 pages
CSL0777 L09
No ratings yet
CSL0777 L09
29 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
24 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Python Scikit-Learn Cheat Sheet For Machine Learning
No ratings yet
Python Scikit-Learn Cheat Sheet For Machine Learning
3 pages
ML Unit 3
No ratings yet
ML Unit 3
17 pages
How To Prepare Your Dataset For Machine Learning in Python
No ratings yet
How To Prepare Your Dataset For Machine Learning in Python
14 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
Data Pre-Processing With Sklearn Using Standard and Minmax
No ratings yet
Data Pre-Processing With Sklearn Using Standard and Minmax
21 pages
1
No ratings yet
1
3 pages
1 - Data Preprocessing and Cleaning - 55
No ratings yet
1 - Data Preprocessing and Cleaning - 55
8 pages
Data Preprocessing For Machine Learning in Python
No ratings yet
Data Preprocessing For Machine Learning in Python
27 pages
Dsbda Lab - 1 - 1736243987425
No ratings yet
Dsbda Lab - 1 - 1736243987425
10 pages
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
No ratings yet
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
38 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
Untitled Document
No ratings yet
Untitled Document
2 pages
Kabir Data Preprocessing Python
No ratings yet
Kabir Data Preprocessing Python
14 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
Data Mining Lab Manual 2 2
No ratings yet
Data Mining Lab Manual 2 2
63 pages
20dit073 Jay Prajapati ML
No ratings yet
20dit073 Jay Prajapati ML
68 pages
ML Pgms - 24mar2025
No ratings yet
ML Pgms - 24mar2025
23 pages
Lab Mannual of ML
No ratings yet
Lab Mannual of ML
43 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
4 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
ML LAB
No ratings yet
ML LAB
29 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
ML Cyber Lab
No ratings yet
ML Cyber Lab
16 pages
Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
ML-Lab05-Data Preprocessing Techniques in Python
No ratings yet
ML-Lab05-Data Preprocessing Techniques in Python
7 pages
Week 4
No ratings yet
Week 4
2 pages
DSBDA Lab Manual24-25
No ratings yet
DSBDA Lab Manual24-25
58 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Be A 65 Ads Exp 3
No ratings yet
Be A 65 Ads Exp 3
6 pages
DA Programs
No ratings yet
DA Programs
44 pages
MACHINE LEARNING Manual
No ratings yet
MACHINE LEARNING Manual
36 pages
MLP Week 2 Slides
No ratings yet
MLP Week 2 Slides
82 pages
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Reinforcement Learning (RL) : by Abhiram Sharma (19311A12P0)
No ratings yet
Reinforcement Learning (RL) : by Abhiram Sharma (19311A12P0)
14 pages
CPM & Pert: Prepared By: Jenine Sipagan & Melanie Rose C. Duka
No ratings yet
CPM & Pert: Prepared By: Jenine Sipagan & Melanie Rose C. Duka
95 pages
Auto&hetero Memory
No ratings yet
Auto&hetero Memory
3 pages
Real Time Object Detection Using YOLO
No ratings yet
Real Time Object Detection Using YOLO
6 pages
Solutions Chapter6
No ratings yet
Solutions Chapter6
19 pages
Clash Detection
No ratings yet
Clash Detection
13 pages
Machine Learning
No ratings yet
Machine Learning
2 pages
09 - AI-900 1-35 - M - Answered
No ratings yet
09 - AI-900 1-35 - M - Answered
9 pages
Gokul Ai.1.2.1
No ratings yet
Gokul Ai.1.2.1
12 pages
General Monsters Corporation Has Two Plants For Producing Juggernauts One
No ratings yet
General Monsters Corporation Has Two Plants For Producing Juggernauts One
1 page
Pumping Lemma
No ratings yet
Pumping Lemma
22 pages
Lecture Slides For: Ethem Alpaydin © The MIT Press, 2010
No ratings yet
Lecture Slides For: Ethem Alpaydin © The MIT Press, 2010
28 pages
A Review of Deep Learning Methods and Applications For PDF
No ratings yet
A Review of Deep Learning Methods and Applications For PDF
14 pages
Ee250a FCH
No ratings yet
Ee250a FCH
1 page
TOC: IPU Intro and Unit1
No ratings yet
TOC: IPU Intro and Unit1
33 pages
Feature-Interactions Based Information Retrieval Models - Sumit's Diary
No ratings yet
Feature-Interactions Based Information Retrieval Models - Sumit's Diary
8 pages
Midterm Assessment - Answer Key: Faculty of Computer Studies T215B Communication and Information Technologies (II)
No ratings yet
Midterm Assessment - Answer Key: Faculty of Computer Studies T215B Communication and Information Technologies (II)
7 pages
DataMining DataAnalysisofaBankingdataset
No ratings yet
DataMining DataAnalysisofaBankingdataset
23 pages
Implementation of ML Model For Image Classification
No ratings yet
Implementation of ML Model For Image Classification
19 pages
03 22 1-S2.0-S1364032122000569-Main
No ratings yet
03 22 1-S2.0-S1364032122000569-Main
35 pages
Sophia Ucciferri - 5.1 Practice
No ratings yet
Sophia Ucciferri - 5.1 Practice
5 pages
Modern Optimization With R Use R 2nd Ed 2021 3030728188 9783030728182 - Compress
No ratings yet
Modern Optimization With R Use R 2nd Ed 2021 3030728188 9783030728182 - Compress
264 pages
Unit6 wks1 Key PDF
No ratings yet
Unit6 wks1 Key PDF
1 page
Queueing Models: Discrete-Event System Simulation
No ratings yet
Queueing Models: Discrete-Event System Simulation
50 pages
Chapter 5 - Excursion B-Splines - Commented2
No ratings yet
Chapter 5 - Excursion B-Splines - Commented2
47 pages
25 Most Frequent Ask DSA Questions in MAANG
No ratings yet
25 Most Frequent Ask DSA Questions in MAANG
16 pages
Computer 9th (Chapter 4)
No ratings yet
Computer 9th (Chapter 4)
3 pages

Lab 08 - Data Preprocessing

Uploaded by

Lab 08 - Data Preprocessing

Uploaded by

Lab 08

 Taking care of Missing values

1. Taking Care of Missing Values

Garbage in Garbage out

You can add following library for handling missing values

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

3. Normalizing the Dataset

from sklearn.preprocessing import StandardScaler

4. Train Test Split

Following command is used for splitting training and testing data

from sklearn.model_selection import train_test_split

2) Solved Lab Activities

Activity 1 10 minutes Low CLO-5

Activity 2 10 minutes Medium CLO-5

Activity 3 10 minutes Medium CLO-5

Activity 4 10 minutes Medium CLO-5

Activity 5 10 minutes Medium CLO-5

Activity 1: Importing the libraries and Dataset

Importing the Dataset

Figure 8.1: Sample dataset

# to import the dataset into a variable

# Splitting the attributes into independent and dependent attributes

# attributes to determine dependent variable / Class

# dependent variable / Class

Figure 8.3: Detailed description of dataset loaded

Activity 2: Handling of Missing Data

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

imputer = imputer.fit(X[:, 1:])

X[:, 1:] = imputer.transform(X[:, 1:])

Activity 3: Handling of Categorical Data

# encode categorical data

Figure 8.6: Description of independent variable X and dependent variable Y

Activity 4: Splitting the dataset into training and testing datasets

# splitting the dataset into training set and test set

Activity 5: Feature Scaling

3) Graded Lab Tasks

You might also like