Lecture 2 20022025 092902am

The document provides an overview of data preprocessing in machine learning, detailing its importance, steps, and techniques. It covers acquiring datasets, handling missing values, encoding categorical data, splitting datasets, and feature scaling. Additionally, it discusses data binning and its role in improving predictive model accuracy.

Uploaded by

ridasaman47

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views87 pages

Lecture 2 20022025 092902am

Uploaded by

ridasaman47

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 87

Data Preprocessing

Lecture 2
Data Types
Categorical Data
Categorical Data
Text Data
Pre-processing in NLP
What is Data Preprocessing
• Data preprocessing is a process of preparing the raw data and making
it suitable for a machine learning
• It is the first and crucial step while creating a machine learning model.
Why do we need Data Preprocessing
• A real-world data generally contains
• Noises
• missing values
• It maybe in an unusable format which cannot be directly used for machine
learning models.
Steps for data preprocessing
• Acquire the dataset
• Import all the crucial libraries
• Import the dataset
• Identifying and handling the missing values
• Encoding the categorical data
• Splitting the dataset
• Feature scaling
Acquiring the dataset
• The first step in data preprocessing in machine learning
• The dataset will be comprised of data gathered from multiple and
disparate sources which are then combined in a proper format to
form a dataset
• Dataset formats differ according to use cases.
• A business dataset will be entirely different from a medical dataset.
• A business dataset will contain relevant industry and business data
• A medical dataset will include healthcare-related data.
• Once the dataset is ready, you must put it in CSV, or HTML, or XLSX
file formats.
https://www.kaggle.com/datasets https://archive.ics.uci.edu/ml/index.php
Importing the libraries
• Numpy
• It is the fundamental package for scientific calculation in Python.
• It is used for inserting any type of mathematical operation in the code.
• Also used to add large multidimensional arrays and matrices in your code.
• Pandas
• Pandas is an open-source Python library for data manipulation and analysis.
• It is used for importing and managing the datasets.
• Matplotlib
• Matplotlib is a Python 2D plotting library that is used to plot any type of
charts in Python.
Code:
Sample dataset
• For our exercise the dataset is given in Data.csv file
• It has 10 instances/examples
• It has three independent variables
• Country
• Age
• Salary
• It has one dependent variable
• Purchased
• Two values are missing
• One in Age independent variable
• One in Salary independent variable
• One variable is categorical i.e., Country
Importing the dataset
Code: • Save your Python file in the directory
containing the dataset.
• read_csv()” is function of the Pandas
library. This function can read a CSV file
• For every Machine Learning model, it is
necessary to separate the independent
variables and dependent variables in a
dataset.
• To extract the independent variables, you
can use “iloc[ ]” function of the Pandas
library.
Identifying and handling missing values
Identifying and handling missing values
• In data preprocessing, it is pivotal to identify and correctly handle the
missing values,
• Failing to handle missing values, you might draw inaccurate and faulty
conclusions and inferences from the data.
• There are two commonly used methods to handle missing data: (Ask
the domain expert, which method to use)
• Deleting a particular row
• Impute the data
• Replacing with the mean
• Replacing with the median
• Replacing with the most frequently occurring value
• Replacing with a constant value
Deleting a particular row
• You remove a specific row that has a null value for a feature or a
particular column where more than 75% of the values are missing.
• However, this method is not 100% efficient, and it is recommended
that you use it only when the dataset has adequate samples.
• You must ensure that after deleting the data, there remains no
addition of bias.
Code: Deleting rows with nan values
Impute data
• This method can add variance to the dataset, and any loss of data can
be efficiently negated.
• Hence, it yields better results compared to the first method (omission
of rows/columns)
Code: Replacing nan values
Replacing nan values (most frequent)

Replacing nan values (median/mean)

Encoding the data
Encoding the data
• Categorical data refers to the information that has specific categories
within the dataset.
• Machine Learning models are primarily based on mathematical equations.
• Thus, you can intuitively understand that keeping the categorical data in
the equation will cause certain issues since you would only need numbers
in the equations.
• How to encode
• categorical data
• Ordinal data mapping
• One hot encoding (nominal data)
• Continuous data
• Binning
• Normalization
Mapping (Ordinal data)
• The categorical columns
are
• eye_color (Nominal)
• Satisfaction (Ordinal)
• Upsell (Nominal)
• The column satisfaction
is ordinal.
• Since order matters in
this column
Mapping (Ordinal data)
Code:

Satisfaction
Satisfaction Satisfaction
very satisfied 3
satisfied 1
slightly satisfied 2
very satisfied 1 3
satisfied
not satisfied 0 0
not satisfied
very satisfied 3
slightly satisfied 2
One hot encoding (Nominal Data)
• Nominal data is not ordered
• If we map nominal data as ordinal data, the ML model may assume
that there is come some correlation between the nominal variables,
thereby producing faulty output.
• What is the solution?
• To eliminate this issue, we will now use Dummy Encoding.
• Dummy variables are those that take the values 0 or 1 to indicate the absence
or presence of a specific categorical effect that can shift the outcome.
• The value 1 indicates the presence of that variable in a particular column
while the other variables become of value 0.
• In dummy encoding, the number of columns equals the number of categories.
`one-hot encoding
One Hot encoding
• For the second categorical variable, that is, purchased, you can use
the “labelencoder” object of the LableEncoder class.
• We are not using the OneHotEncoder class since the purchased
variable only has two categories yes or no, both of which are encoded
into 0 and 1.
Splitting the dataset
• Every dataset for Machine Learning model must be split into two separate sets –
• training set
• test set.
• This is one of the crucial steps of data preprocessing as by doing this, we can enhance
the performance of our machine learning model.
• Suppose, if we have given training to our machine learning model by a dataset and we
test it by a completely different dataset. Then, it will create difficulties for our model to
understand the correlations between the models.
• Training Set
• Training set denotes the subset of a dataset that is used for training the machine learning model.
• In the training set, you are already aware of the output.
• Test Set
• A test set, is the subset of the dataset that is used for testing the machine learning model.
• The ML model uses the test set to predict outcomes and evaluate the trained ML model
• Usually, the dataset is split into 70:30 ratio or 80:20 ratio.
• 70:30 ratio
• This means that you take 70% of the data for training the model while leaving
out the rest 30%.
• 80:20 ratio
• This means that you take 80% of the data for training the model while leaving
out the rest 20%.
• The code includes four variables:
• X_train – features for the training data
• X_test – features for the test data
• y_train – dependent variables for training data
• y_test – independent variable for testing data

• The train_test_split() function includes four parameters,

• The first two of which are for arrays of data.
• The test_size function specifies the size of the test set. The test_size maybe 0.5, 0.3,
or 0.2 – this specifies the dividing ratio between the training and test sets.
• The last parameter, “random_state” sets seed for a random generator so that the
output is always the same if it is set to zero.
Encoding the continuous data
Feature Scaling (Normalization) and binning
• Feature scaling and binning marks the end of the data preprocessing
in Machine Learning.
Feature Scaling
• It is a method to standardize the independent variables of a dataset
within a specific range.
• In other words, feature scaling limits the range of variables so that
you can compare them on common grounds.
• It prevents algorithms from being influenced by higher values
Feature Scaling
• In the dataset, you can notice that the age and
salary columns do not have the same scale.
• In such a scenario, if you compute any two values
from the age and salary columns, the salary values
will dominate the age values and deliver incorrect
results.
• Thus, we must remove this issue by performing
feature scaling for Machine Learning.
Feature Scaling
• You can perform feature scaling in Machine Learning in two ways:
• Standardization
• Normalization

• Standardization

• Min-Max Normalization
Standardization code

• To standardize the data of the test set, mean and standard deviation
values of the training set are used. So, there is no data leaking.
• Hence, we only use the transform() function for the test set instead of
the fit_transform() function.
• Using the above code we get the following standardized data for the
train and test data set

Training data

Test data
Min Max normalization code

• To standardize the data of the test set, max and min values of the
training set are used. So, there is no data leaking.
• Hence, we only use the transform() function for the test set instead of
the fit_transform() function.
• Using the above code we get the following min-max normalized data
for the train and test data set

Training data

Test data
Data Binning
• Data binning/bucketing groups data in bins/buckets, in the sense that
it replaces values contained into a small interval with a single
representative value for that interval.
• Sometimes binning improves accuracy in predictive models.
• Binning can be applied to
• convert numeric values to categorical values
• binning by distance
• binning by frequency
• Reduce numeric values
• quantization (or sampling)
• Binning is a technique for data smoothing.
• Data smoothing is employed to remove noise from data.
• Three techniques are used for data smoothing:
• binning
• regression
• outlier analysis
• We will cover only binning here
Example: cupcake
• Google trends
• Shows the
search trends
of cupcakes in
the world.
• Code:
Binning by distance
• Import the dataset
• Compute the range of values and find the edges of intervals/bins
• Define labels
• convert numeric values into categorical labels
• Plot the histogram to see the distribution
Binning by distance
• In this case we define the edges of each bin
• We group values related to the column into
• Small
• Medium
• Big
• We need to calculate the intervals within which each group falls.
• We calculate the interval range as the difference between the
maximum and minimum value and then we split this interval into
“N=3” parts, one for each group.
• Now we can calculate the range of each interval, i.e. the minimum
and maximum value of each interval.
• Since we have 3 groups, we need 4 edges of intervals (bins):
• small — (edge1, edge2)
• medium — (edge2, edge3)
• big — (edge3, edge4)
• Now we define the labels
• Convert the numeric values of the column into the categorical values
• We can plot the distribution of values by plotting the histogram

• If we define the edges of each bin manually

Binning by frequency
• Binning by frequency calculates the size of each bin so that each bin
contains the (almost) same number of observations, but the bin range
will vary.
• Steps are as follows
• Import the dataset
• Define the labels
• Use qcut of the pandas library for data binning
• Plot the histogram to see the distribution
Binning by Sampling
• Sampling is another technique of data binning.
• It permits to reduce the number of samples, by grouping similar
values or contiguous values.
• There are three approaches to perform sampling:
• binning by mean:
• Each value in a bin is replaced by the mean value of the bin.
• Binning by median:
• Each bin value is replaced by its bin median value.
• Binning by boundary:
• each bin value is replaced by the closest boundary value, i.e. maximum or minimum
value of the bin.
Binning by mean
• Import the dataset
• Compute the range of each bin and compute the mean of each bin
• Compute the bin edges of each bin
• Set the value of each bin to the mean value
• Plot the distribution
• Now we should approximate each value of the column to
the median value of the corresponding bin
• The left edges starts from the beginning of the bin edges
and do not contain the last value of the bin edges.
• The right edges instead, start from the second value of the
bin edges and last until the last value.
• We can quantize the column by defining a function which loops
through the intervals and when it finds the correct interval, it returns
the mid value.
California Housing Data
NIST Data (Images)
Generate Data

Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
User Manual
No ratings yet
User Manual
92 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
12 pages
Data Preprocessing Implementation 13112023 061217pm
No ratings yet
Data Preprocessing Implementation 13112023 061217pm
31 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Week 10
No ratings yet
Week 10
50 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
4 Data Preprocessing
No ratings yet
4 Data Preprocessing
27 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
24 pages
Feature Engineering
No ratings yet
Feature Engineering
50 pages
ML Unit 3
No ratings yet
ML Unit 3
17 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Week 6. Data Preparation and Transformation
No ratings yet
Week 6. Data Preparation and Transformation
34 pages
Data Processing
No ratings yet
Data Processing
19 pages
Cse3001 Ai ML m2
No ratings yet
Cse3001 Ai ML m2
118 pages
Machine Learning With Python Data Preprocessing, Analysis and Visualization
No ratings yet
Machine Learning With Python Data Preprocessing, Analysis and Visualization
8 pages
Machine Learning
No ratings yet
Machine Learning
28 pages
ML Da
No ratings yet
ML Da
55 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
L1 - Data Pre-Processing & Steps of Building A Model
No ratings yet
L1 - Data Pre-Processing & Steps of Building A Model
30 pages
Lecture 5
No ratings yet
Lecture 5
26 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Lec 2 ML S4 Data Preprocessing
No ratings yet
Lec 2 ML S4 Data Preprocessing
20 pages
ML 1
No ratings yet
ML 1
13 pages
Data - Preprocessing - Jupyter Notebook
No ratings yet
Data - Preprocessing - Jupyter Notebook
5 pages
7 محاضرات
No ratings yet
7 محاضرات
36 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
Unit 2 - Supervised Learning - Regression
No ratings yet
Unit 2 - Supervised Learning - Regression
19 pages
CSL0777 L09
No ratings yet
CSL0777 L09
29 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
Unit 2
No ratings yet
Unit 2
19 pages
Data Pre-Processing With Sklearn Using Standard and Minmax
No ratings yet
Data Pre-Processing With Sklearn Using Standard and Minmax
21 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
No ratings yet
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
53 pages
Final ML
No ratings yet
Final ML
2 pages
Data Science Bootcamp (Day-01) (1) - Compressed
No ratings yet
Data Science Bootcamp (Day-01) (1) - Compressed
161 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Data Pre-Processing Steps
No ratings yet
Data Pre-Processing Steps
32 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
4 pages
Machine: Learning
No ratings yet
Machine: Learning
24 pages
Data Preprocessing
No ratings yet
Data Preprocessing
38 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
5 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
ML Unit 2
No ratings yet
ML Unit 2
33 pages
Lab Mannual of ML
No ratings yet
Lab Mannual of ML
43 pages
ML (Prac1)
No ratings yet
ML (Prac1)
12 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
CCN Lecture 3 BRIS SP25 17022025 033144pm
No ratings yet
CCN Lecture 3 BRIS SP25 17022025 033144pm
23 pages
Midterm Lab Viva - BS RIS 2A 27032025 120831pm
No ratings yet
Midterm Lab Viva - BS RIS 2A 27032025 120831pm
2 pages
Artificial Intelligence Journal
No ratings yet
Artificial Intelligence Journal
46 pages
Assignment1 CVT 14032025 112123am
No ratings yet
Assignment1 CVT 14032025 112123am
1 page
AI Quizi 1 Bs RIS 4C SPR 2025
No ratings yet
AI Quizi 1 Bs RIS 4C SPR 2025
1 page
AI Assi 1 BEE Fall 2024
No ratings yet
AI Assi 1 BEE Fall 2024
1 page
Assignment No 2 28112024 092512pm
No ratings yet
Assignment No 2 28112024 092512pm
1 page
Cassi
No ratings yet
Cassi
4 pages
Module 6 Living in The IT Era
No ratings yet
Module 6 Living in The IT Era
12 pages
Haier India Customer Care No. 1800 419 9999 India Customer Care
No ratings yet
Haier India Customer Care No. 1800 419 9999 India Customer Care
1 page
Multi Cast
No ratings yet
Multi Cast
129 pages
Word Processing 2025
No ratings yet
Word Processing 2025
7 pages
Styling Your Text!
No ratings yet
Styling Your Text!
17 pages
Trader-Agreement Certificate
No ratings yet
Trader-Agreement Certificate
31 pages
Letter of Complaint2013
No ratings yet
Letter of Complaint2013
37 pages
Discadora GSM FKS
No ratings yet
Discadora GSM FKS
2 pages
Psycho Py Manual
No ratings yet
Psycho Py Manual
981 pages
5.7 Lesson Answer Key
No ratings yet
5.7 Lesson Answer Key
2 pages
Log
No ratings yet
Log
7 pages
Mylapali Gireesh
No ratings yet
Mylapali Gireesh
94 pages
BE6IT CH#6 3161611 AWP Database Programming With Node JS and MongoDB IT GECBVN BharatVainsh-202324
No ratings yet
BE6IT CH#6 3161611 AWP Database Programming With Node JS and MongoDB IT GECBVN BharatVainsh-202324
61 pages
A Digital-To-Analog Converter (DAC) Block Diagram.: Introduction To Microprocessor-Based Control
No ratings yet
A Digital-To-Analog Converter (DAC) Block Diagram.: Introduction To Microprocessor-Based Control
3 pages
User Instructions Guide For NPC Elearning Courses
No ratings yet
User Instructions Guide For NPC Elearning Courses
6 pages
ZD 9.13 Syslog Event Reference Guide - Rev A - 20160715
No ratings yet
ZD 9.13 Syslog Event Reference Guide - Rev A - 20160715
37 pages
Cambridge International AS & A Level: Computer Science 9618/32
No ratings yet
Cambridge International AS & A Level: Computer Science 9618/32
16 pages
Memory Hierarchy Cache Memory
No ratings yet
Memory Hierarchy Cache Memory
9 pages
Namma Kalvi 8th Tamil Unit 1 Surya Guide 219056
No ratings yet
Namma Kalvi 8th Tamil Unit 1 Surya Guide 219056
37 pages
Operations TBLA SWI - 7675108 - 01
No ratings yet
Operations TBLA SWI - 7675108 - 01
55 pages
DX Diag 1
No ratings yet
DX Diag 1
39 pages
Student Induction Program-July Aug - 2025
No ratings yet
Student Induction Program-July Aug - 2025
3 pages
Canteen Management System: An Internship Report On
No ratings yet
Canteen Management System: An Internship Report On
62 pages
Seed Fill Algorithms
No ratings yet
Seed Fill Algorithms
13 pages
Be Wlr-11-U014
No ratings yet
Be Wlr-11-U014
1 page
ST - Mapeh 6 - Q4
No ratings yet
ST - Mapeh 6 - Q4
2 pages
Smehfuz
No ratings yet
Smehfuz
12 pages
Gadissa Kebede
No ratings yet
Gadissa Kebede
113 pages

Lecture 2 20022025 092902am

Uploaded by

Lecture 2 20022025 092902am

Uploaded by

Data Preprocessing

Replacing nan values (median/mean)

• The train_test_split() function includes four parameters,

• If we define the edges of each bin manually

You might also like