0% found this document useful (0 votes)
37 views

Data

Uploaded by

8497kfgt8w
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Data

Uploaded by

8497kfgt8w
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Data Preprocessing : Concepts

 Data is truly considered a resource in


today’s world.
 As per the World Economic Forum, by 2025
we will be generating about 463 exabytes of
data globally per day!
 But is all this data fit enough to be used by
machine learning algorithms?
 How do we decide that?
 Data preprocessing — transforming the data
such that it becomes machine-readable…
What is Data Preprocessing?
 Usually think of some large datasets
with huge number of rows and
columns. While that is a likely
scenario, it is not always the case
 Data could be in so many different
forms: Structured Tables, Images,
Audio files, Videos etc..
 Machines don’t understand free text,
image or video data as it is, they
understand 1s and0s.
What is Data Preprocessing?
 In any Machine Learning process,
Data Preprocessing is that step in
which the data gets transformed, or
Encoded, to bring it to such a state
that now the machine can easily
parse it.
 In other words, the features of the
data can now be easily interpreted by
the algorithm.
Features in Machine Learning
 A dataset can be viewed as a collection of
data objects, which are often also called as
a records, points, vectors, patterns,
events, cases, samples, observations, or
entities.
 Data objects are described by a number of
features, that capture the basic
characteristics of an object, such as the
mass of a physical object or the time at
which an event occurred, etc.. Features
are often called as variables,
characteristics, fields, attributes, or
dimensions.
Features in Machine Learning
 A dataset can be viewed as a collection of data
objects, which are often also called as a
records, points, vectors, patterns, events,
cases, samples, observations, or entities.
 Data objects are described by a number of
features, that capture the basic characteristics
of an object, such as the mass of a physical
object or the time at which an event occurred,
etc.. Features are often called as variables,
characteristics, fields, attributes, or
dimensions.
 A feature is an individual measurable property
or characteristic of a phenomenon being
observed
Statistical Data Types
Statistical Data Types
 Categorical : Features whose values are
taken from a defined set of values.
 For instance, days in a week : {Monday,
Tuesday, Wednesday, Thursday, Friday,
Saturday, Sunday} is a category because its
value is always taken from this set.
 Another example could be the Boolean set :
{True, False}
Statistical Data Types
 Numerical : Features whose values are
continuous or integer-valued.
 They are represented by numbers and
possess most of the properties of numbers.
 For instance, number of steps you walk in a
day, or the speed at which you are driving
your car at.
Statistical Data Types
STEPS OF DATA PREPROCESSING
 Data Quality Assessment
 Feature Aggregation
 Feature Sampling
 Dimensionality Reduction
 Feature Encoding
STEPS OF DATA PREPROCESSING
 Data Quality Assessment: Because data is often
taken from multiple sources which are normally
not too reliable and that too in different
formats, more than half our time is consumed in
dealing with data quality issues when working
on a machine learning problem.
 It is simply unrealistic to expect that the data
will be perfect.
 There may be problems due to human error,
limitations of measuring devices, or flaws in the
data collection process.
 Let’s go over a few of them and methods to deal
with them :
MISSING VALUES
 It is very much usual to have missing values in your dataset.
It may have happened during data collection, or maybe due
to some data validation rule, but regardless missing values
must be taken into consideration.
 Eliminate rows with missing data : Simple and sometimes
effective strategy. Fails if many objects have missing values.
If a feature has mostly missing values, then that feature
itself can also be eliminated.
 Estimate missing values : If only a reasonable percentage of
values are missing, then we can also run simple
interpolation methods to fill in those values.
 However, most common method of dealing with missing
values is by filling them in with the mean, median or mode
value of the respective feature.
INCONSISTENT VALUES
 We know that data can contain inconsistent values.
Most probably we have already faced this issue at
some point.
 For instance, the ‘Address’ field contains the ‘Phone
number’.
 It may be due to human error or maybe the
information was misread while being scanned from a
handwritten form.
 It is therefore always advised to perform data
assessment like knowing what the data type of the
features should be and whether it is the same for all
the data objects.
DUPLICATE VALUES
 A dataset may include data objects which are
duplicates of one another.
 It may happen when say the same person submits a
form more than once.
 The term de duplication is often used to refer to the
process of dealing with duplicates.

 In most cases, the duplicates are removed so as to


not give that particular data object an advantage or
bias, when running machine learning algorithms.
FEATURE AGGREGATION
 Feature Aggregations are performed so as to
take the aggregated values in order to put the
data in a better perspective.
 Think of transactional data, suppose we have
day-to-day transactions of a product from
recording the daily sales of that product in
various store locations over the year.
 Aggregating the transactions to single store-
wide monthly or yearly transactions will help
us reducing hundreds or potentially thousands
of transactions that occur daily at a specific
store, thereby reducing the number of data
objects.
 This results in reduction of memory
consumption and processing time Aggregations
provide us with a high-level view of the data as
the behavior of groups or aggregates is more
FEATURE SAMPLING
 Sampling is a very common method for
selecting a subset of the dataset that we
are analyzing.
 In most cases, working with the
complete dataset can turn out to be too
expensive considering the memory and
time constraints.
 Using a sampling algorithm can help us
reduce the size of the dataset to a point
where we can use a better, but more
expensive, machine learning algorithm.
FEATURE SAMPLING
The key principle here is that the sampling should
be done in such a manner that the sample
generated should have approximately the same
properties as the original dataset, meaning that
the sample is representative. This involves
choosing the correct sample size and sampling
strategy. Simple Random Sampling dictates that
there is an equal probability of selecting any
particular entity. It has two main variations as
well :
Sampling without Replacement : As each item is
selected, it is removed from the set of all the
objects that form the total dataset.
Sampling with Replacement : Items are not
removed from the total dataset after getting
selected. This means they can get selected more
than once.
 Fail to output a representative sample when
the dataset includes object types which vary
drastically in ratio.
 This can cause problems when the sample
needs to have a proper representation of all
object types, for example, when we have an
imbalanced dataset.
 It is critical that the rarer classes be
adequately represented in the sample.
 In these cases, there is another sampling
technique which we can use, called Stratified
Sampling, which begins with predefined
groups of objects.
 There are different versions of Stratified
Sampling too, with the simplest version
suggesting equal number of objects be drawn
from all the groups even though the groups
are of different sizes. For more on sampling
 Most real world datasets have a large
number of features.
 For example, consider an image processing
problem, we might have to deal with
thousands of features, also called as
dimensions.
 As the name suggests, dimensionality
reduction aims to reduce the number of
features - but not simply by selecting a
sample of features from the feature-set,
which is something else — Feature Subset
Selection or simply Feature Selection.
 Conceptually, dimension refers to the number
of geometric planes the dataset lies in, which
could be high so much so that it cannot be
visualized with pen and paper.
 More the number of such planes, more is the
complexity of the dataset.
THE CURSE OF DIMENSIONALITY
 This refers to the phenomena that generally
data analysis tasks become significantly
harder as the dimensionality of the data
increases.
 As the dimensionality increases, the number
planes occupied by the data increases thus
adding more and more sparsity to the data
which is difficult to model and visualize.
 What dimension reduction essentially does is
that it maps the dataset to a lower-
dimensional space, which may very well be to
a number of planes which can now be
visualized, say 2D.
THE CURSE OF DIMENSIONALITY
 The basic objective of techniques which are
used for this purpose is tor educe the
dimensionality of a dataset by creating new
features which are a combination of the old
features.
 In other words, the higher-dimensional
feature-space is mapped to a lower-
dimensional feature-space.
 Principal Component Analysis and Singular
Value Decomposition are two widely accepted
techniques.
THE CURSE OF DIMENSIONALITY
 A few major benefits of dimensionality
reduction are :
 Data Analysis algorithms work better if the
dimensionality of the dataset is lower.
 This is mainly because irrelevant features and
noise have now been eliminated.
 The models which are built on top of lower
dimensional data are more understandable
and explainable.
 The data may now also get easier to
visualize! Features can always be taken in
pairs or triplets for visualization purposes,
which makes more sense if the feature set is
not that big.
FEATURE ENCODING
 As mentioned before, the whole purpose of
data preprocessing is to encode the data in
order to bring it to such a state that the
machine now understands it.
 Feature encoding is basically performing
transformations on the data such that it can
be easily accepted as input for machine
learning algorithms while still retaining its
original meaning.
 There are some general norms or rules which
are followed when performing feature
encoding.
FEATURE ENCODING
 For Continuous variables :
 Nominal : Any one-to-one mapping can be
done which retains the meaning. For instance,
a permutation of values like in One-Hot
Encoding.
 Ordinal : An order-preserving change of
values. The notion of small, medium and large
can be represented equally well with the help
of a new function, that is,<new_value =
f(old_value)> -
 For example, {0, 1, 2} or maybe {1, 2, 3}.
FEATURE ENCODING
For Numeric variables:
 Interval : Simple mathematical
transformation like using the equation
<new_value= a*old_value + b>, a and b being
constants. For example, Fahrenheit and
Celsius scales, which differ in their Zero
values size of a unit, can be encoded in this
manner.
 Ratio : These variables can be scaled to any
particular measures, of course while still
maintaining the meaning and ratio of their
values. Simple mathematical transformations
work in this case as well, like the
transformation <new_value =a*old_value>.
For, length can be measured in meters or
feet, money can be taken indifferent
TRAIN / VALIDATION / TEST SPLIT
 After feature encoding is done, our dataset is
ready for the exciting machine learning
algorithms!
 But before we start deciding the algorithm
which should be used, it is always advised to
split the dataset into 2 or sometimes 3 parts.
algorithm for that matter, has to be first
trained on the data distribution available and
then validated and tested, before it can be
deployed to deal with real-world data.
TRAIN / VALIDATION / TEST SPLIT
 Training data : This is the part on which your
machine learning algorithms are actually
trained to build a model.
 The model tries to learn the dataset and its
various characteristics and intricacies, which
also raises the issue of Over fitting v/s Under
fitting.
TRAIN / VALIDATION / TEST SPLIT
 Validation data : This is the part of the
dataset which is used to validate our various
model fits.
 In simpler words, we use validation data to
choose and improve our model hyper
parameters.
 The model does not learn the validation set
but uses it to get to a better state of hyper
parameters.
 Test data : This part of the dataset is used to
test our model hypothesis.
 It is left untouched and unseen until the
model and hyper parameters are decided, and
only after that the model is applied on the
test data to get an accurate measure of how it
would perform when deployed on real-world
TRAIN / VALIDATION / TEST SPLIT
 Split Ratio : Data is split as per a split ratio
which is highly dependent on the type of
model we are building and the dataset itself.
 If our dataset and model are such that a lot of
training is required, then we use a larger
chunk of the data just for training
purposes(usually the case) — For instance,
training on textual data, image data, or video
data usually involves thousands of features!
 If the model has a lot of hyper parameters that
can be tuned, then keeping a higher
percentage of data for the validation set is
advisable.
TRAIN / VALIDATION / TEST SPLIT

 Models with less number of hyper parameters


are easy to tune and update, and so we can
keep a smaller validation set.
 Like many other things in Machine Learning,
the split ratio is highly dependent on the
problem we are trying to solve and must be
decided after taking into account all the
various details about the model and the
dataset in hand.
Assessing Classification Accuracy
Misclassification Error
• Metric for assessing the accuracy of
classification algorithms is: number of samples
misclassified by the model
• For binary classification problems,

• For 0% error, for all data points


Confusion Matrix
• Decisions made on classifications based on
misclassification error rate lead to poor
performance when data is unbalanced.
• For example, in case of financial fraud detection,
the proportion of fraud cases is extremely small.
• In such classification problems, the interest is
mainly in minority cases.
• The class that the user is interested in is
commonly called positive class and the rest
negative class.
• A single prediction on the test set has four
possible outcomes.
1. The true positive (TP) and true negative
(TN) are correct classifications.
2. A false positive (FP) occurs when the
outcome is incorrectly predicted as
positive when it is actually negative.
3. A false negative (FN) occurs when the
outcome is incorrectly predicted
Hypothesized class (prediction) as
negative when
Actual Class it is actually positive.
Classified +ve Classified –ve

(observation) Actual +ve TP FN


Actual -ve FP TN
Confusion Matrix
Misclassification Rate

True Positive Rate (tp rate)

• Determines sensitivity in detection of


abnormal events
• Classification method with high sensitivity

• FP = FN = 0 is desired.
would rarely miss abnormal event.
True Negative Rate

• Determines the specificity in detection of the


abnormal event
• High specificity results in low rate of false alarms
caused by classification of a normal event as an
abnormal one.

• Simultaneously high sensitivity and high specificity is


desired.
Machine Learning algorithms, or any12/29/2020
Data Preprocessing : Concepts. Introduction to the
concepts of Data… | by Pranjal Pandey | Towards
Data Sciencehttps://towardsdatascience.com/data-
preprocessing-concepts-fa946d11c825 12/14

You might also like