-16-Data Preprocessing

Uploaded by

dawasthi952

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views27 pages

-16-Data Preprocessing

Uploaded by

dawasthi952

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Data Preprocessing

Agenda
• Introduction to data
• Different form of data
• Different type of data in ML model
• Data preprocessing
Introduction to data
Data is a collection of facts and figures, observations, or descriptions of things in an
unorganized or organized form. Data can exist as images, words, numbers, characters,
videos, audios, and etc.

Data Preprocessing
Real-world datasets are generally messy, raw, incomplete, inconsistent, and unusable. It can
contain manual entry errors, missing values, inconsistent schema, etc. Data Preprocessing
is the process of converting raw data into a format that is understandable and usable. It is a
crucial step in any Data Science project to carry out an efficient and accurate analysis. It
ensures that data quality is consistent before applying any Machine Learning or Data
Mining techniques.
Why is Data Preprocessing Important ?
Data Preprocessing is an important step in the Data Preparation stage of a Data Science
development lifecycle that will ensure reliable, robust, and consistent results. The main objective of
this step is to ensure and check the quality of data before applying any Machine Learning or Data
Mining methods. Let’s review some of its benefits –
• Accuracy - Data Preprocessing will ensure that input data is accurate and reliable by ensuring there
are no manual entry errors, no duplicates, etc.
• Completeness - It ensures that missing values are handled, and data is complete for further
analysis.
• Consistent - Data Preprocessing ensures that input data is consistent, i.e., the same data kept in
different places should match.
• Timeliness - Whether data is updated regularly and on a timely basis or not.
• Trustable - Whether data is coming from trustworthy sources or not.
• Interpretability - Raw data is generally unusable, and Data Preprocessing converts raw data into an
interpretable format.
Data is processed in the form (an efficient format) that it can be easily interpreted by the algorithm
and produce the required output accurately.
Key Steps in Data Preprocessing
Data Cleaning
Data Cleaning uses methods to handle incorrect, incomplete, inconsistent, or missing
values. Some of the techniques for Data Cleaning include -
• Handling Missing Values
• Input data can contain missing or NULL values, which must be handled before
applying any Machine Learning or Data Mining techniques.
• Missing values can be handled by many techniques, such as removing rows/columns
containing NULL values and imputing NULL values using mean, mode, regression,
etc.
• De-noising
• De-noising is a process of removing noise from the data. Noisy data is meaningless
data that is not interpretable or understandable by machines or humans. It can occur
due to data entry errors, faulty data collection, etc.
• De-noising can be performed by applying many techniques, such as binning the
features, using regression to smoothen the features to reduce noise, clustering to
detect the outliers, etc.
Data Integration
Data Integration can be defined as combining data from multiple sources. A few of the
issues to be considered during Data Integration include the following -
• Entity Identification Problem - It can be defined as identifying objects/features from
multiple databases that correspond to the same entity. For example, in database
A _customer_id,_ and in database B _customer_number_ belong to the same entity.
• Schema Integration - It is used to merge two or more database schema/metadata into a
single schema. It essentially takes two or more schema as input and determines a
mapping between them. For example, entity type CUSTOMER in one schema may have
CLIENT in another schema.
• Detecting and Resolving Data Value Concepts - The data can be stored in various ways in
different databases, and it needs to be taken care of while integrating them into a single
dataset. For example, dates can be stored in various formats such
as DD/MM/YYYY, YYYY/MM/DD, or MM/DD/YYYY, etc.
Data Reduction
Data Reduction is used to reduce the volume or size of the input data. Its main objective is
to reduce storage and analysis costs and improve storage efficiency. A few of the popular
techniques to perform Data Reduction include -
• Dimensionality Reduction - It is the process of reducing the number of features in the
input dataset. It can be performed in various ways, such as selecting features with the
highest importance, Principal Component Analysis (PCA), etc.
• Numerosity Reduction - In this method, various techniques can be applied to reduce the
volume of data by choosing alternative smaller representations of the data. For example, a
variable can be approximated by a regression model, and instead of storing the entire
variable, we can store the regression model to approximate it.
• Data Compression - In this method, data is compressed. Data Compression can be
lossless or lossy depending on whether the information is lost or not during compression.
Data Transformation
Data Transformation is a process of converting data into a format that helps in building efficient
ML models and deriving better insights. A few of the most common methods for Data
Transformation include -
• Smoothing - Data Smoothing is used to remove noise in the dataset, and it helps identify
important features and detect patterns. Therefore, it can help in predicting trends or future
events.
• Aggregation - Data Aggregation is the process of transforming large volumes of data into an
organized and summarized format that is more understandable and comprehensive. For
example, a company may look at monthly sales data of a product instead of raw sales data to
understand its performance better and forecast future sales.
• Discretization - Data Discretization is a process of converting numerical or continuous
variables into a set of intervals/bins. This makes data easier to analyze. For example, the age
features can be converted into various intervals such as (0-10, 11-20, ..) or (child, young, …).
• Normalization - Data Normalization is a process of converting a numeric variable into a
specified range such as [-1,1], [0,1], etc. A few of the most common approaches to performing
normalization are Min-Max Normalization, Data Standardization or Data Scaling, etc.
Conclusion
• Data Preprocessing is a process of converting raw datasets into a format that is
consumable, understandable, and usable for further analysis. It is an
important step in any project that will ensure the input
dataset's accuracy, consistency, and completeness.
Scikit-learn library for data preprocessing

Scikit-learn is a popular machine learning library available as an open- source.

This library provides us various essential tools including algorithms for random
forests, classification, regression, and of course for data preprocessing as well.
This library is built on the top of NumPy and SciPy and it is easy to learn and
understand.
We can use the following code to import the library in the workspace:

For including the features for preprocessing we can use the following code:
Identifying and handling the missing values

What Is a Missing Value?

Missing data is defined as the values or data that is not stored (or not present) for some
variable/s in the given dataset. Below is a sample of the missing data from the Titanic
dataset. You can see the columns ‘Age’ and ‘Cabin’ have some missing values.
How Is a Missing Value Represented in a Dataset?

In the dataset, the blank shows the missing values.

In Pandas, usually, missing values are represented by NaN. It stands for Not a
Number.
Why Is Data Missing From the Dataset?
There can be multiple reasons why certain values are missing from the data. Reasons for
the missing of data from the dataset affect the approach of handling missing data. So it’s
necessary to understand why the data could be missing.
Some of the reasons are listed below:
• Past data might get corrupted due to improper maintenance.
• Observations are not recorded for certain fields due to some reasons. There might be
a failure in recording the values due to human error.
• The user has not provided the values intentionally
• Item nonresponse: This means the participant refused to respond.
Why Do We Need to Care About Handling Missing Data?

It is important to handle the missing values appropriately.

• Many machine learning algorithms fail if the dataset contains missing values.
However, algorithms like K-nearest and Naive Bayes support data with missing
values.
• You may end up building a biased machine learning model, leading to incorrect
results if the missing values are not handled properly.
• Missing data can lead to a lack of precision in the statistical analysis.
Feature Scaling
It is a step of Data Pre Processing that is applied to independent variables or features of
data. It helps to normalize the data within a particular range. Sometimes, it also helps in
speeding up the calculations in an algorithm.
Why and Where to Apply Feature Scaling?
• The real-world dataset contains features that highly vary in magnitudes, units, and range.
Normalization should be performed when the scale of a feature is irrelevant or
misleading and should not normalize when the scale is meaningful.
• The algorithms which use Euclidean Distance measures are sensitive to Magnitudes.
Here feature scaling helps to weigh all the features equally.
• Formally, If a feature in the dataset is big in scale compared to others then in algorithms
where Euclidean distance is measured this big scaled feature becomes dominating and
needs to be normalized.
Rescale data
• Using Scikit Learn using MinMaxScaler which convert values between 0 and 1.
• From sklearn.preprocessing import MinMaxScaler

Encoding the Categorical data

Label encoding: Label Encoding is a technique that is used to convert categorical columns
into numerical ones so that they can be fitted by machine learning models which only take
numerical data. It is an important pre-processing step in a machine-learning project.
Example Of Label Encoding: Suppose we have a column Height in some dataset that has
elements as Tall, Medium, and short. To convert this categorical column into a numerical
column we will apply label encoding to this column. After applying label encoding, the
Height column is converted into a numerical column having elements 0,1, and 2 where 0 is
the label for tall, 1 is the label for medium, and 2 is the label for short height.
Limitation of label Encoding
Label encoding converts the categorical data into numerical ones, but it assigns a unique
number (starting from 0) to each class of data. This may lead to the generation of
priority issues during model training of data sets. A label with a high value may be
considered to have high priority than a label having a lower value.
Example of limitation of label encoding
• An attribute having output classes Mexico, Paris, Dubai. On Label Encoding, this
column lets Mexico is replaced with 0, Paris is replaced with 1, and Dubai is replaced
with 2.
• With this, it can be interpreted that Dubai has high priority
than Mexico and Paris while training the model, But actually, there is no such priority
relation between these cities here.
One Hot Encoding
One hot encoding is a technique that we use to represent categorical variables as numerical values in a machine
learning model.
The advantages of using one hot encoding include:
• It allows the use of categorical variables in models that require numerical input.
• It can improve model performance by providing more information to the model about the categorical
variable.
• It can help to avoid the problem of ordinality, which can occur when a categorical variable has a natural
ordering (e.g. “small”, “medium”, “large”).
The disadvantages of using one hot encoding include:
• It can lead to increased dimensionality, as a separate column is created for each category in the variable. This
can make the model more complex and slow to train.
• It can lead to sparse data, as most observations will have a value of 0 in most of the one-hot encoded
columns.
• It can lead to overfitting, especially if there are many categories in the variable and the sample size is
relatively small.
• One-hot-encoding is a powerful technique to treat categorical data, but it can lead to increased
dimensionality, sparsity, and overfitting. It is important to use it cautiously and consider other methods such
as ordinal encoding or binary encoding.
One Hot Encoding Examples
In One Hot Encoding, the categorical parameters will prepare separate columns for both
Male and Female labels. So, wherever there is a Male, the value will be 1 in the Male
column and 0 in the Female column, and vice-versa. Let’s understand with an example:
Consider the data where fruits, their corresponding categorical values, and prices are given.
Standardization
• Standardization is a technique used to scale
the data such that the mean of the data
becomes zero and the standard deviation
becomes one. Here the values are not
restricted to a particular range. We can use
standardization when features of input data set
have large differences between their ranges.
• It is another process of scaling down the data
and making it easier for the machine learning
model to learn from it. In this method, we will
try to reduce the mean to ‘0’ and the standard
deviation to ‘1’.
Normalization
Normalization is scaling the data to be analyzed to a specific range such as [0.0, 1.0] to
provide better results.
from sklearn import preprocessing
min_max_scaler=preprocessing.MinMaxScaler(feature_range=(0,1))
X_after_min_max_scaler=min_max_scaler.fit_transform(x)

from sklearn import preprocessing

Standardization=preprocessing.StandardScaler(X)
X_after_standard_scaler=Standardization.fit_transform(x)

Resume Template Helen
100% (1)
Resume Template Helen
2 pages
List of Institutes-Agencies With Limit Assigned Under 1817
No ratings yet
List of Institutes-Agencies With Limit Assigned Under 1817
152 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
Bcom Thesis
100% (2)
Bcom Thesis
5 pages
Report On Construction of LED Screen Support Structure
No ratings yet
Report On Construction of LED Screen Support Structure
3 pages
Astrology For 21th Century
100% (1)
Astrology For 21th Century
22 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
12 01 0192 PDF
No ratings yet
12 01 0192 PDF
9 pages
Common Module 1
No ratings yet
Common Module 1
6 pages
MT152 Assignment 1 2024-25
No ratings yet
MT152 Assignment 1 2024-25
1 page
UNIT 2 dt
No ratings yet
UNIT 2 dt
8 pages
Machine Learning - 1
No ratings yet
Machine Learning - 1
3 pages
Siilabasii Koorsii
No ratings yet
Siilabasii Koorsii
7 pages
Adobe Scan 06 May 2024
No ratings yet
Adobe Scan 06 May 2024
8 pages
Peripeteia Ibsen S History in Hedda Gabler and The Pretenders
No ratings yet
Peripeteia Ibsen S History in Hedda Gabler and The Pretenders
42 pages
6 - Concurrent Processes (v2)
No ratings yet
6 - Concurrent Processes (v2)
33 pages
Data Preprocessing
No ratings yet
Data Preprocessing
54 pages
4.1 - Data Preprocessing
No ratings yet
4.1 - Data Preprocessing
28 pages
Perianto, Jurnal Pemetaan Stakeholder Dalam Penetapan Kebijakan UKT
No ratings yet
Perianto, Jurnal Pemetaan Stakeholder Dalam Penetapan Kebijakan UKT
13 pages
DMDW Chapter 3
No ratings yet
DMDW Chapter 3
13 pages
2.5 M2S MLC
No ratings yet
2.5 M2S MLC
2 pages
Oliva - A Maturity Model For Enterprise Risk Management
No ratings yet
Oliva - A Maturity Model For Enterprise Risk Management
14 pages
DS-Unit-2_ABM_final
No ratings yet
DS-Unit-2_ABM_final
134 pages
Pergi Devils PDF
No ratings yet
Pergi Devils PDF
1 page
CS322_Lec 3_S25
No ratings yet
CS322_Lec 3_S25
42 pages
FDS CH 3
No ratings yet
FDS CH 3
2 pages
Arba Minch University Arba Minch Institute of Technology Faculty of Computing & Software Engineering
No ratings yet
Arba Minch University Arba Minch Institute of Technology Faculty of Computing & Software Engineering
20 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Complete Subjects and Predicates
No ratings yet
Complete Subjects and Predicates
10 pages
14. Preprocessing-Cleaning & Reduction
No ratings yet
14. Preprocessing-Cleaning & Reduction
42 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Introduction to data science 1-2-2025
No ratings yet
Introduction to data science 1-2-2025
14 pages
3. Data Preprocessing
No ratings yet
3. Data Preprocessing
120 pages
BioNavis AN108 Performance SPR
No ratings yet
BioNavis AN108 Performance SPR
2 pages
Jhpolice Cyber Crime Investigation
No ratings yet
Jhpolice Cyber Crime Investigation
22 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Mikrotik Read
No ratings yet
Mikrotik Read
4 pages
Null 1
No ratings yet
Null 1
62 pages
Data Preprocessing
No ratings yet
Data Preprocessing
13 pages
Research !!
No ratings yet
Research !!
11 pages
Chapter 3& 4 (3)
No ratings yet
Chapter 3& 4 (3)
60 pages
03Preprocessing
No ratings yet
03Preprocessing
59 pages
1.3 Introduction To Data Preprocessing
No ratings yet
1.3 Introduction To Data Preprocessing
16 pages
CEED Question Paper 14 Feb 2016.
No ratings yet
CEED Question Paper 14 Feb 2016.
19 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
2 Wan Loadbalancing - Icyflame Studio
No ratings yet
2 Wan Loadbalancing - Icyflame Studio
2 pages
Unit 2 FDS
No ratings yet
Unit 2 FDS
13 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
8 pages
Data Preprocessing
100% (1)
Data Preprocessing
33 pages
CloudComputing DATABASE
No ratings yet
CloudComputing DATABASE
27 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Unit 1
No ratings yet
Unit 1
8 pages
02 Data_preprocessing -4,5,6
No ratings yet
02 Data_preprocessing -4,5,6
54 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
CloudComputing_Chapter3_Lecture1
No ratings yet
CloudComputing_Chapter3_Lecture1
27 pages
Data Cleaning Wrangling
No ratings yet
Data Cleaning Wrangling
42 pages
MT238 assignment 1
No ratings yet
MT238 assignment 1
11 pages
UNIT _ Introduction_DataScience_new (1)
No ratings yet
UNIT _ Introduction_DataScience_new (1)
55 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Data Warehouse and Data Mining - Unit 3
No ratings yet
Data Warehouse and Data Mining - Unit 3
14 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
03_Data_Preprocessing
No ratings yet
03_Data_Preprocessing
15 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
Java Programming Report of Summer Training
No ratings yet
Java Programming Report of Summer Training
22 pages
data preprocessing
No ratings yet
data preprocessing
8 pages
OJCST_Vol13_N2-3_p_78-81
No ratings yet
OJCST_Vol13_N2-3_p_78-81
4 pages
data preprocessing
No ratings yet
data preprocessing
11 pages
Data Mining
No ratings yet
Data Mining
5 pages
Taylor Brook
No ratings yet
Taylor Brook
4 pages
Scenario 2 264
No ratings yet
Scenario 2 264
34 pages
Sp-Smaw Nci-3
No ratings yet
Sp-Smaw Nci-3
3 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
B.P.Ed. Third Semester (CBS) Examination Sports Medicine Physiotherapy and Rehabilitation (Elective Course)
100% (1)
B.P.Ed. Third Semester (CBS) Examination Sports Medicine Physiotherapy and Rehabilitation (Elective Course)
6 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Remedial Instruction in English
100% (1)
Remedial Instruction in English
18 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
K-Means Clustering
No ratings yet
K-Means Clustering
16 pages
Chapter 1: The Nature of The Economic Problem: Suggested Answers To Individual and Group Activities
100% (2)
Chapter 1: The Nature of The Economic Problem: Suggested Answers To Individual and Group Activities
11 pages
Derrida - On Abraham
No ratings yet
Derrida - On Abraham
11 pages
(Draft) MYNI 2019 of The RSPO Principles and Criteria 2018-English
No ratings yet
(Draft) MYNI 2019 of The RSPO Principles and Criteria 2018-English
34 pages
Neo Vernaculararchitecure ContributiontotheResearchonRevivalofVernacularHeritagethroughModernArchitecturalDesign
No ratings yet
Neo Vernaculararchitecure ContributiontotheResearchonRevivalofVernacularHeritagethroughModernArchitecturalDesign
14 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet

-16-Data Preprocessing

Uploaded by

-16-Data Preprocessing

Uploaded by

Data Preprocessing

Scikit-learn is a popular machine learning library available as an open- source.

What Is a Missing Value?

In the dataset, the blank shows the missing values.

It is important to handle the missing values appropriately.

Encoding the Categorical data

from sklearn import preprocessing

You might also like