Data

Uploaded by

8497kfgt8w

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views

Data

Uploaded by

8497kfgt8w

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 36

Data Preprocessing : Concepts

 Data is truly considered a resource in

today’s world.
 As per the World Economic Forum, by 2025
we will be generating about 463 exabytes of
data globally per day!
 But is all this data fit enough to be used by
machine learning algorithms?
 How do we decide that?
 Data preprocessing — transforming the data
such that it becomes machine-readable…
What is Data Preprocessing?
 Usually think of some large datasets
with huge number of rows and
columns. While that is a likely
scenario, it is not always the case
 Data could be in so many different
forms: Structured Tables, Images,
Audio files, Videos etc..
 Machines don’t understand free text,
image or video data as it is, they
understand 1s and0s.
What is Data Preprocessing?
 In any Machine Learning process,
Data Preprocessing is that step in
which the data gets transformed, or
Encoded, to bring it to such a state
that now the machine can easily
parse it.
 In other words, the features of the
data can now be easily interpreted by
the algorithm.
Features in Machine Learning
 A dataset can be viewed as a collection of
data objects, which are often also called as
a records, points, vectors, patterns,
events, cases, samples, observations, or
entities.
 Data objects are described by a number of
features, that capture the basic
characteristics of an object, such as the
mass of a physical object or the time at
which an event occurred, etc.. Features
are often called as variables,
characteristics, fields, attributes, or
dimensions.
Features in Machine Learning
 A dataset can be viewed as a collection of data
objects, which are often also called as a
records, points, vectors, patterns, events,
cases, samples, observations, or entities.
 Data objects are described by a number of
features, that capture the basic characteristics
of an object, such as the mass of a physical
object or the time at which an event occurred,
etc.. Features are often called as variables,
characteristics, fields, attributes, or
dimensions.
 A feature is an individual measurable property
or characteristic of a phenomenon being
observed
Statistical Data Types
Statistical Data Types
 Categorical : Features whose values are
taken from a defined set of values.
 For instance, days in a week : {Monday,
Tuesday, Wednesday, Thursday, Friday,
Saturday, Sunday} is a category because its
value is always taken from this set.
 Another example could be the Boolean set :
{True, False}
Statistical Data Types
 Numerical : Features whose values are
continuous or integer-valued.
 They are represented by numbers and
possess most of the properties of numbers.
 For instance, number of steps you walk in a
day, or the speed at which you are driving
your car at.
Statistical Data Types
STEPS OF DATA PREPROCESSING
 Data Quality Assessment
 Feature Aggregation
 Feature Sampling
 Dimensionality Reduction
 Feature Encoding
STEPS OF DATA PREPROCESSING
 Data Quality Assessment: Because data is often
taken from multiple sources which are normally
not too reliable and that too in different
formats, more than half our time is consumed in
dealing with data quality issues when working
on a machine learning problem.
 It is simply unrealistic to expect that the data
will be perfect.
 There may be problems due to human error,
limitations of measuring devices, or flaws in the
data collection process.
 Let’s go over a few of them and methods to deal
with them :
MISSING VALUES
 It is very much usual to have missing values in your dataset.
It may have happened during data collection, or maybe due
to some data validation rule, but regardless missing values
must be taken into consideration.
 Eliminate rows with missing data : Simple and sometimes
effective strategy. Fails if many objects have missing values.
If a feature has mostly missing values, then that feature
itself can also be eliminated.
 Estimate missing values : If only a reasonable percentage of
values are missing, then we can also run simple
interpolation methods to fill in those values.
 However, most common method of dealing with missing
values is by filling them in with the mean, median or mode
value of the respective feature.
INCONSISTENT VALUES
 We know that data can contain inconsistent values.
Most probably we have already faced this issue at
some point.
 For instance, the ‘Address’ field contains the ‘Phone
number’.
 It may be due to human error or maybe the
information was misread while being scanned from a
handwritten form.
 It is therefore always advised to perform data
assessment like knowing what the data type of the
features should be and whether it is the same for all
the data objects.
DUPLICATE VALUES
 A dataset may include data objects which are
duplicates of one another.
 It may happen when say the same person submits a
form more than once.
 The term de duplication is often used to refer to the
process of dealing with duplicates.

 In most cases, the duplicates are removed so as to

not give that particular data object an advantage or
bias, when running machine learning algorithms.
FEATURE AGGREGATION
 Feature Aggregations are performed so as to
take the aggregated values in order to put the
data in a better perspective.
 Think of transactional data, suppose we have
day-to-day transactions of a product from
recording the daily sales of that product in
various store locations over the year.
 Aggregating the transactions to single store-
wide monthly or yearly transactions will help
us reducing hundreds or potentially thousands
of transactions that occur daily at a specific
store, thereby reducing the number of data
objects.
 This results in reduction of memory
consumption and processing time Aggregations
provide us with a high-level view of the data as
the behavior of groups or aggregates is more
FEATURE SAMPLING
 Sampling is a very common method for
selecting a subset of the dataset that we
are analyzing.
 In most cases, working with the
complete dataset can turn out to be too
expensive considering the memory and
time constraints.
 Using a sampling algorithm can help us
reduce the size of the dataset to a point
where we can use a better, but more
expensive, machine learning algorithm.
FEATURE SAMPLING
The key principle here is that the sampling should
be done in such a manner that the sample
generated should have approximately the same
properties as the original dataset, meaning that
the sample is representative. This involves
choosing the correct sample size and sampling
strategy. Simple Random Sampling dictates that
there is an equal probability of selecting any
particular entity. It has two main variations as
well :
Sampling without Replacement : As each item is
selected, it is removed from the set of all the
objects that form the total dataset.
Sampling with Replacement : Items are not
removed from the total dataset after getting
selected. This means they can get selected more
than once.
 Fail to output a representative sample when
the dataset includes object types which vary
drastically in ratio.
 This can cause problems when the sample
needs to have a proper representation of all
object types, for example, when we have an
imbalanced dataset.
 It is critical that the rarer classes be
adequately represented in the sample.
 In these cases, there is another sampling
technique which we can use, called Stratified
Sampling, which begins with predefined
groups of objects.
 There are different versions of Stratified
Sampling too, with the simplest version
suggesting equal number of objects be drawn
from all the groups even though the groups
are of different sizes. For more on sampling
 Most real world datasets have a large
number of features.
 For example, consider an image processing
problem, we might have to deal with
thousands of features, also called as
dimensions.
 As the name suggests, dimensionality
reduction aims to reduce the number of
features - but not simply by selecting a
sample of features from the feature-set,
which is something else — Feature Subset
Selection or simply Feature Selection.
 Conceptually, dimension refers to the number
of geometric planes the dataset lies in, which
could be high so much so that it cannot be
visualized with pen and paper.
 More the number of such planes, more is the
complexity of the dataset.
THE CURSE OF DIMENSIONALITY
 This refers to the phenomena that generally
data analysis tasks become significantly
harder as the dimensionality of the data
increases.
 As the dimensionality increases, the number
planes occupied by the data increases thus
adding more and more sparsity to the data
which is difficult to model and visualize.
 What dimension reduction essentially does is
that it maps the dataset to a lower-
dimensional space, which may very well be to
a number of planes which can now be
visualized, say 2D.
THE CURSE OF DIMENSIONALITY
 The basic objective of techniques which are
used for this purpose is tor educe the
dimensionality of a dataset by creating new
features which are a combination of the old
features.
 In other words, the higher-dimensional
feature-space is mapped to a lower-
dimensional feature-space.
 Principal Component Analysis and Singular
Value Decomposition are two widely accepted
techniques.
THE CURSE OF DIMENSIONALITY
 A few major benefits of dimensionality
reduction are :
 Data Analysis algorithms work better if the
dimensionality of the dataset is lower.
 This is mainly because irrelevant features and
noise have now been eliminated.
 The models which are built on top of lower
dimensional data are more understandable
and explainable.
 The data may now also get easier to
visualize! Features can always be taken in
pairs or triplets for visualization purposes,
which makes more sense if the feature set is
not that big.
FEATURE ENCODING
 As mentioned before, the whole purpose of
data preprocessing is to encode the data in
order to bring it to such a state that the
machine now understands it.
 Feature encoding is basically performing
transformations on the data such that it can
be easily accepted as input for machine
learning algorithms while still retaining its
original meaning.
 There are some general norms or rules which
are followed when performing feature
encoding.
FEATURE ENCODING
 For Continuous variables :
 Nominal : Any one-to-one mapping can be
done which retains the meaning. For instance,
a permutation of values like in One-Hot
Encoding.
 Ordinal : An order-preserving change of
values. The notion of small, medium and large
can be represented equally well with the help
of a new function, that is,<new_value =
f(old_value)> -
 For example, {0, 1, 2} or maybe {1, 2, 3}.
FEATURE ENCODING
For Numeric variables:
 Interval : Simple mathematical
transformation like using the equation
<new_value= a*old_value + b>, a and b being
constants. For example, Fahrenheit and
Celsius scales, which differ in their Zero
values size of a unit, can be encoded in this
manner.
 Ratio : These variables can be scaled to any
particular measures, of course while still
maintaining the meaning and ratio of their
values. Simple mathematical transformations
work in this case as well, like the
transformation <new_value =a*old_value>.
For, length can be measured in meters or
feet, money can be taken indifferent
TRAIN / VALIDATION / TEST SPLIT
 After feature encoding is done, our dataset is
ready for the exciting machine learning
algorithms!
 But before we start deciding the algorithm
which should be used, it is always advised to
split the dataset into 2 or sometimes 3 parts.
algorithm for that matter, has to be first
trained on the data distribution available and
then validated and tested, before it can be
deployed to deal with real-world data.
TRAIN / VALIDATION / TEST SPLIT
 Training data : This is the part on which your
machine learning algorithms are actually
trained to build a model.
 The model tries to learn the dataset and its
various characteristics and intricacies, which
also raises the issue of Over fitting v/s Under
fitting.
TRAIN / VALIDATION / TEST SPLIT
 Validation data : This is the part of the
dataset which is used to validate our various
model fits.
 In simpler words, we use validation data to
choose and improve our model hyper
parameters.
 The model does not learn the validation set
but uses it to get to a better state of hyper
parameters.
 Test data : This part of the dataset is used to
test our model hypothesis.
 It is left untouched and unseen until the
model and hyper parameters are decided, and
only after that the model is applied on the
test data to get an accurate measure of how it
would perform when deployed on real-world
TRAIN / VALIDATION / TEST SPLIT
 Split Ratio : Data is split as per a split ratio
which is highly dependent on the type of
model we are building and the dataset itself.
 If our dataset and model are such that a lot of
training is required, then we use a larger
chunk of the data just for training
purposes(usually the case) — For instance,
training on textual data, image data, or video
data usually involves thousands of features!
 If the model has a lot of hyper parameters that
can be tuned, then keeping a higher
percentage of data for the validation set is
advisable.
TRAIN / VALIDATION / TEST SPLIT

 Models with less number of hyper parameters

are easy to tune and update, and so we can
keep a smaller validation set.
 Like many other things in Machine Learning,
the split ratio is highly dependent on the
problem we are trying to solve and must be
decided after taking into account all the
various details about the model and the
dataset in hand.
Assessing Classification Accuracy
Misclassification Error
• Metric for assessing the accuracy of
classification algorithms is: number of samples
misclassified by the model
• For binary classification problems,

• For 0% error, for all data points

Confusion Matrix
• Decisions made on classifications based on
misclassification error rate lead to poor
performance when data is unbalanced.
• For example, in case of financial fraud detection,
the proportion of fraud cases is extremely small.
• In such classification problems, the interest is
mainly in minority cases.
• The class that the user is interested in is
commonly called positive class and the rest
negative class.
• A single prediction on the test set has four
possible outcomes.
1. The true positive (TP) and true negative
(TN) are correct classifications.
2. A false positive (FP) occurs when the
outcome is incorrectly predicted as
positive when it is actually negative.
3. A false negative (FN) occurs when the
outcome is incorrectly predicted
Hypothesized class (prediction) as
negative when
Actual Class it is actually positive.
Classified +ve Classified –ve

(observation) Actual +ve TP FN

Actual -ve FP TN
Confusion Matrix
Misclassification Rate

True Positive Rate (tp rate)

• Determines sensitivity in detection of

abnormal events
• Classification method with high sensitivity

• FP = FN = 0 is desired.
would rarely miss abnormal event.
True Negative Rate

• Determines the specificity in detection of the

abnormal event
• High specificity results in low rate of false alarms
caused by classification of a normal event as an
abnormal one.

• Simultaneously high sensitivity and high specificity is

desired.
Machine Learning algorithms, or any12/29/2020
Data Preprocessing : Concepts. Introduction to the
concepts of Data… | by Pranjal Pandey | Towards
Data Sciencehttps://towardsdatascience.com/data-
preprocessing-concepts-fa946d11c825 12/14

FDTD Getting Started Manual
No ratings yet
FDTD Getting Started Manual
63 pages
Applied Mathematics 10 (Zambak) - Zambak Publishing (2008) PDF
100% (1)
Applied Mathematics 10 (Zambak) - Zambak Publishing (2008) PDF
200 pages
AIDS C04-Session-20
No ratings yet
AIDS C04-Session-20
17 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
UNIT - II - Data Mining Essentials
No ratings yet
UNIT - II - Data Mining Essentials
20 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
NN-7
No ratings yet
NN-7
26 pages
R21 Unit 2
No ratings yet
R21 Unit 2
101 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Data Preprocessing 09112023 065121pm
No ratings yet
Data Preprocessing 09112023 065121pm
30 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
DM 2 Part 2
No ratings yet
DM 2 Part 2
35 pages
Data Science unit I(LN and QB)
No ratings yet
Data Science unit I(LN and QB)
44 pages
Data Mining and Data Warehousing - Data Preprocessing - L03
No ratings yet
Data Mining and Data Warehousing - Data Preprocessing - L03
10 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Data - part 1
No ratings yet
Data - part 1
58 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Mining
No ratings yet
Data Mining
40 pages
Normalization
No ratings yet
Normalization
35 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Script
No ratings yet
Script
5 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Week 2
No ratings yet
Week 2
96 pages
CHP 4
No ratings yet
CHP 4
72 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
Unit 1
No ratings yet
Unit 1
8 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
unit-2-part-4
No ratings yet
unit-2-part-4
47 pages
Data Mining Notes C2
No ratings yet
Data Mining Notes C2
12 pages
Feature and Feature Extractionlect2
No ratings yet
Feature and Feature Extractionlect2
28 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Data Mining Chapter3 0
No ratings yet
Data Mining Chapter3 0
32 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
life lesson
No ratings yet
life lesson
13 pages
Unit 2
No ratings yet
Unit 2
18 pages
Microsoft Excel Statistical and Advanced Functions for Decision Making
From Everand
Microsoft Excel Statistical and Advanced Functions for Decision Making
Palani Murugappan
No ratings yet
How to write Data Interpretation in English
No ratings yet
How to write Data Interpretation in English
7 pages
Foundation 2
No ratings yet
Foundation 2
34 pages
Sustainable Mobility in Metropolitan Regions Insights from Interdisciplinary Research for Practice Application 1st Edition Gebhard Wulfhorst - Quickly download the ebook to explore the full content
100% (1)
Sustainable Mobility in Metropolitan Regions Insights from Interdisciplinary Research for Practice Application 1st Edition Gebhard Wulfhorst - Quickly download the ebook to explore the full content
45 pages
7 Effective Classroom Management in Physical Education 6
No ratings yet
7 Effective Classroom Management in Physical Education 6
10 pages
Esia U-1
No ratings yet
Esia U-1
5 pages
PR Importance of QR
No ratings yet
PR Importance of QR
8 pages
Abrencillo, Angel Alcaide, Mica Aquino, Miles Aranilla, Vea Arpon, Jessica Asistin Rose Ann
No ratings yet
Abrencillo, Angel Alcaide, Mica Aquino, Miles Aranilla, Vea Arpon, Jessica Asistin Rose Ann
11 pages
Environmental Engineering Air Pollution - Dust Removal
No ratings yet
Environmental Engineering Air Pollution - Dust Removal
26 pages
Acer Aspire E1 v1 v3-531 - 571 - Compal Ls-7912p - Powerboard - Rev 0.5
No ratings yet
Acer Aspire E1 v1 v3-531 - 571 - Compal Ls-7912p - Powerboard - Rev 0.5
3 pages
Diode 1 SW
No ratings yet
Diode 1 SW
16 pages
Reserves Estimation of Oil & Gas
No ratings yet
Reserves Estimation of Oil & Gas
17 pages
Cable Sizing & Voltage Drop Calculations Formula - Electrical Engineering 123
No ratings yet
Cable Sizing & Voltage Drop Calculations Formula - Electrical Engineering 123
4 pages
The Importance of Saving Habits in Terms of Economy
No ratings yet
The Importance of Saving Habits in Terms of Economy
2 pages
Mobil DTE Oil Double Letter Series
No ratings yet
Mobil DTE Oil Double Letter Series
2 pages
TA 12 _ĐỀ HK1 FORM 2025- Ng Linh
No ratings yet
TA 12 _ĐỀ HK1 FORM 2025- Ng Linh
6 pages
Tissue-Culture King - AmazingStoriesVolume02Number05
100% (2)
Tissue-Culture King - AmazingStoriesVolume02Number05
107 pages
Learners Observed Values SY 2023 2024
No ratings yet
Learners Observed Values SY 2023 2024
4 pages
ENC222-0375 - 2020 - BIKENDI - JUMA - HUDNUT - Final WriteUP
No ratings yet
ENC222-0375 - 2020 - BIKENDI - JUMA - HUDNUT - Final WriteUP
14 pages
Generalised Method of Harmonic Reduction in A.c.-D.c. Converters by Harmonic Current Injection
No ratings yet
Generalised Method of Harmonic Reduction in A.c.-D.c. Converters by Harmonic Current Injection
8 pages
Experiment6-Oxygen Consumption in Mealworms
No ratings yet
Experiment6-Oxygen Consumption in Mealworms
5 pages
Group 3 BUSTAT
No ratings yet
Group 3 BUSTAT
6 pages
Spelling Term 3 Yr 5
No ratings yet
Spelling Term 3 Yr 5
10 pages
Total History
No ratings yet
Total History
14 pages
DLL - Mapeh 5 - Q2 - W6
No ratings yet
DLL - Mapeh 5 - Q2 - W6
5 pages
SHS - Statistics
No ratings yet
SHS - Statistics
7 pages
Actividades 1° de Secundaria: 1st English Basics
No ratings yet
Actividades 1° de Secundaria: 1st English Basics
7 pages
2SD401A SavantIC
No ratings yet
2SD401A SavantIC
3 pages
Funtion of Mass Media
100% (1)
Funtion of Mass Media
2 pages

Data

Uploaded by

Data

Uploaded by

Data Preprocessing : Concepts

 Data is truly considered a resource in

 In most cases, the duplicates are removed so as to

 Models with less number of hyper parameters

• For 0% error, for all data points

(observation) Actual +ve TP FN

True Positive Rate (tp rate)

• Determines sensitivity in detection of

• Determines the specificity in detection of the

• Simultaneously high sensitivity and high specificity is

You might also like