0% found this document useful (0 votes)
31 views18 pages

3-Data Pre-Processing

Intro

Uploaded by

BindiyaAbhilash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views18 pages

3-Data Pre-Processing

Intro

Uploaded by

BindiyaAbhilash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Data Pre-processing

Why Data Preprocessing?


No quality data, no quality mining
Data in the real world is dirty
results!
• incomplete: lacking attribute • Quality decisions must be based
values, lacking certain attributes on quality data
of interest, or containing only • Data warehouse needs consistent
aggregate data integration of quality data
• noisy: containing errors or
outliers
• inconsistent: containing
discrepancies in codes or names

2
Steps Involved
What is Data
For instance, days in a
week : {Monday, Tuesday,
Wednesday, Thursday, Another example could
Friday, Saturday, Sunday} be the Boolean set :
is a category because its {True, False}
value is always taken
from this set.

Features
whose values
are taken from
a defined set
of values.

Categorical Attributes
Numerical : Features whose values are continuous or integer-valued. They are represented by numbers
and possess most of the properties of numbers.
Missing values :
Eliminate rows with missing data :
• Simple and sometimes effective strategy. Fails if
many objects have missing values. If a feature has
mostly missing values, then that feature itself can
also be eliminated.

Estimate missing values :


• If only a reasonable percentage of values are missing,
then we can also run simple interpolation methods to
fill in those values. However, most common method
of dealing with missing values is by filling them in
with the mean, median or mode value of the
respective feature.
Inconsistent values :
Data can contain inconsistent values.

For instance, the ‘Address’ field contains the ‘Phone


number’. It may be due to human error or maybe the
information was misread while being scanned from a
handwritten form.

It is therefore always advised to perform data assessment


like knowing what the data type of the features should be
and whether it is the same for all the data objects.
Duplicate values :

A dataset may include data objects which are duplicates of one another.
It may happen when say the same person submits a form more than
once. The term deduplication is often used to refer to the process of
dealing with duplicates.

In most cases, the duplicates are removed so as to not give that


particular data object an advantage or bias, when running machine
learning algorithms.
• Don’t Impute does nothing with the missing values.
• Average/Most-frequent uses the average value (for continuous attributes) or the most
common value (for discrete attributes).
• As a distinct value creates new values to substitute the missing ones.
• Model-based imputer constructs a model for predicting the missing value, based on
values of other attributes; a separate model is constructed for each attribute. The default
model is 1-NN learner, which takes the value from the most similar example (this is
sometimes referred to as hot deck imputation). This algorithm can be substituted by one that
the user connects to the input signal Learner for Imputation. Note, however, that if there are
discrete and continuous attributes in the data, the algorithm needs to be capable of handling
them both; at the moment only 1-NN learner can do that. (In the future, when Orange has
more regressors, the Impute widget may have separate input signals for discrete and
continuous models.)
• Random values computes the distributions of values for each attribute and then imputes by
picking random values from them.
• Remove examples with missing values removes the example containing missing values.
This check also applies to the class attribute if Impute class values is checked.
Noisy data are data with a large amount of additional meaningless information in it called
noise.
• Noise: random error or variance in a measured
variable
• Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which requires data cleaning
• duplicate records
• incomplete data
• inconsistent data
How to Handle Noisy Data?

• Binning method:
• Clustering
• Combined computer and human inspection
• Regression
Binning method is used to smoothing data or to handle noisy
data. In this method, the data is first sorted and then the sorted
values are distributed into a number of buckets or bins.
As binning methods consult the neighbourhood of values, they
perform local smoothing.

There are three approaches to perform smoothing –

• Smoothing by bin means : In smoothing by bin means, each


value in a bin is replaced by the mean value of the bin.
• Smoothing by bin median : In this method each bin value is
replaced by its bin median value.
• Smoothing by bin boundary : In smoothing by bin boundaries,
the minimum and maximum values in a given bin are identified
as the bin boundaries. Each bin value is then replaced by the
Binning method closest boundary value.
Simple Discretization Methods: Binning
• Equal-width (distance) partitioning:
• It divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B-A)/N.
• The most straightforward
• But outliers may dominate presentation
• Skewed data is not handled well.
• Equal-depth (frequency) partitioning:
• It divides the range into N intervals, each containing approximately
same number of samples
• Good data scaling
• Managing categorical attributes can be tricky.
Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
1 July 2024 Data Mining: Concepts and Techniques 15
Outlier Detection using Cluster Analysis
Outlier Detection using Linear Regression
Continued…

You might also like