0% found this document useful (0 votes)
5 views51 pages

DWDMUNIT2

The document outlines the importance of data preprocessing, which includes tasks such as data cleaning, integration, transformation, reduction, and discretization. It emphasizes that quality data is essential for quality mining results, detailing various methods for handling issues like missing or noisy data, and strategies for data reduction. Additionally, it discusses the creation of concept hierarchies to simplify data analysis by categorizing continuous attributes into intervals.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPS, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views51 pages

DWDMUNIT2

The document outlines the importance of data preprocessing, which includes tasks such as data cleaning, integration, transformation, reduction, and discretization. It emphasizes that quality data is essential for quality mining results, detailing various methods for handling issues like missing or noisy data, and strategies for data reduction. Additionally, it discusses the creation of concept hierarchies to simplify data analysis by categorizing continuous attributes into intervals.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPS, PDF, TXT or read online on Scribd
You are on page 1/ 51

UNIT-2 Data Preprocessing

Lecture Topic
**********************************************
Lecture-13 Why preprocess the data?
Lecture-14 Data cleaning
Lecture-15 Data integration and transformation
Lecture-16 Data reduction
Lecture-17 Discretization and concept
hierarchgeneration
Lecture-13
Why preprocess the data?
Lecture-13 Why Data Preprocessing?
Data in the real world is:

incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data

noisy: containing errors or outliers

inconsistent: containing discrepancies in codes or
names
No quality data, no quality mining results!

Quality decisions must be based on quality data

Data warehouse needs consistent integration of
quality data

Lecture-13 Why Data Preprocessing?


Multi-Dimensional Measure of Data Quality
A well-accepted multidimensional view:

Accuracy

Completeness

Consistency

Timeliness

Believability

Value added

Interpretability

Accessibility
Broad categories:

intrinsic, contextual, representational, and
accessibility.

Lecture-13 Why Data Preprocessing?


Major Tasks in Data Preprocessing
Data cleaning

Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration

Integration of multiple databases, data cubes, or files
Data transformation

Normalization and aggregation
Data reduction

Obtains reduced representation in volume but produces the
same or similar analytical results
Data discretization

Part of data reduction but with particular importance, especially
for numerical data

Lecture-13 Why Data Preprocessing?


Forms of data preprocessing

Lecture-13 Why Data Preprocessing?


Lecture-14
Data cleaning
Data Cleaning

Data cleaning tasks



Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data

Lecture-14 - Data cleaning


Missing Data
Data is not always available

E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important at the time of
entry

not register history or changes of the data
Missing data may need to be inferred.
Lecture-14 - Data cleaning
How to Handle Missing Data?
Ignore the tuple: usually done when class label
is missing
Fill in the missing value manually
Use a global constant to fill in the missing value:
ex. “unknown”

Lecture-14 - Data cleaning


How to Handle Missing Data?
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the
same class to fill in the missing value
Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision
tree

Lecture-14 - Data cleaning


Noisy Data
Noise: random error or variance in a measured
variable
Incorrect attribute values may due to

faulty data collection instruments

data entry problems

data transmission problems

technology limitation

inconsistency in naming convention
Other data problems which requires data cleaning

duplicate records

incomplete data

inconsistent data

Lecture-14 - Data cleaning


How to Handle Noisy Data?
Binning method:

first sort data and partition into (equal-
frequency) bins

then one can smooth by bin means, smooth
by bin median, smooth by bin boundaries
Clustering

detect and remove outliers
Regression

smooth by fitting the data to a regression
functions – linear regression
Lecture-14 - Data cleaning
Simple Discretization Methods: Binning

Equal-width (distance) partitioning:



It divides the range into N intervals of equal size:
uniform grid

if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B-A)/N.

The most straightforward

But outliers may dominate presentation

Skewed data is not handled well.
Equal-depth (frequency) partitioning:

It divides the range into N intervals, each containing
approximately same number of samples

Good data scaling

Managing categorical attributes can be tricky.

Lecture-14 - Data cleaning


Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

Lecture-14 - Data cleaning


Cluster Analysis

Lecture-14 - Data cleaning


Regression
y

Y1

Y1’ y=x+1

X1 x

Lecture-14 - Data cleaning


Lecture-15
Data integration and
transformation
Data Integration
Data integration:

combines data from multiple sources into a coherent
store
Schema integration

integrate metadata from different sources

Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id  B.cust-#
Detecting and resolving data value conflicts

for the same real world entity, attribute values from
different sources are different

possible reasons: different representations, different
scales, e.g., metric vs. British units

Lecture-15 - Data integration and transformation


Handling Redundant Data in Data
Integration

Redundant data occur often when


integration of multiple databases

The same attribute may have different names
in different databases

One attribute may be a “derived” attribute in
another table, e.g., annual revenue

Lecture-15 - Data integration and transformation


Handling Redundant Data in Data
Integration
Redundant data may be able to be
detected by correlation analysis
Careful integration of the data from
multiple sources may help reduce/avoid
redundancies and inconsistencies and
improve mining speed and quality

Lecture-15 - Data integration and transformation


Data Transformation

Smoothing: remove noise from data


Aggregation: summarization, data cube
construction
Generalization: concept hierarchy climbing

Lecture-15 - Data integration and transformation


Data Transformation

Normalization: scaled to fall within a small,


specified range

min-max normalization

z-score normalization

normalization by decimal scaling
Attribute/feature construction

New attributes constructed from the given ones

Lecture-15 - Data integration and transformation


Data Transformation: Normalization

min-max normalization
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
z-score normalization
v  mean A
v' 
stand _ dev A

normalization by decimal scaling


v
v'  j Where j is the smallest integer such that Max(| v ' |)<1
10

Lecture-15 - Data integration and transformation


Lecture-16
Data reduction
Data Reduction

Warehouse may store terabytes of data:


Complex data analysis/mining may take a
very long time to run on the complete data
set
Data reduction

Obtains a reduced representation of the data set
that is much smaller in volume but yet produces
the same (or almost the same) analytical results

Lecture-16 - Data reduction


Data Reduction Strategies

Data reduction strategies



Data cube aggregation

Attribute subset selection

Dimensionality reduction

Numerosity reduction

Discretization and concept hierarchy
generation

Lecture-16 - Data reduction


Data Cube Aggregation
The lowest level of a data cube

the aggregated data for an individual entity of interest

e.g., a customer in a phone calling data warehouse.
Multiple levels of aggregation in data cubes

Further reduce the size of data to deal with
Reference appropriate levels

Use the smallest representation which is enough to
solve the task
Queries regarding aggregated information should
be answered using data cube, when possible
Lecture-16 - Data reduction
Dimensionality Reduction
Feature selection (attribute subset selection):

Select a minimum set of features such that the
probability distribution of different classes given the
values for those features is as close as possible to the
original distribution given the values of all features

reduce # of patterns in the patterns, easier to understand
Heuristic methods

step-wise forward selection

step-wise backward elimination

combining forward selection and backward elimination

decision-tree induction

Lecture-16 - Data reduction


Wavelet Transforms
Haar2 Daubechie4

Discrete wavelet transform (DWT): linear signal


processing
Compressed approximation: store only a small fraction of
the strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
Method:

Length, L, must be an integer power of 2 (padding with 0s, when
necessary)

Each transform has 2 functions: smoothing, difference

Applies to pairs of data, resulting in two set of data of length L/2

Applies two functions recursively, until reaches the desired length
Lecture-16 - Data reduction
Principal Component Analysis

Given N data vectors from k-dimensions, find c


<= k orthogonal vectors that can be best used
to represent data

The original data set is reduced to one consisting of N
data vectors on c principal components (reduced
dimensions)
Each data vector is a linear combination of the c
principal component vectors
Works for numeric data only
Used when the number of dimensions is large

Lecture-16 - Data reduction


Principal Component Analysis

X2

Y1
Y2

X1

Lecture-16 - Data reduction


Attribute subset selection
Attribute subset selection reduces the data
set size by removing irrelevent or
redundant attributes.
Goal is find min set of attributes
Uses basic heuristic methods of attribute
selection

Lecture-16 - Data reduction


Example of Decision Tree Induction

Initial attribute set:


{A1, A2, A3, A4, A5, A6}
A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

Lecture-16 - Data reduction


Numerosity Reduction
Parametric methods

Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)

Log-linear models: obtain value at a point in m-D
space as the product on appropriate marginal
subspaces
Non-parametric methods

Do not assume models

Major families: histograms, clustering, sampling

Lecture-16 - Data reduction


Regression and Log-Linear Models
Linear regression: Data are modeled to fit a
straight line

Often uses the least-square method to fit the line

Multiple regression: allows a response variable


Y to be modeled as a linear function of
multidimensional feature vector

Log-linear model: approximates discrete


multidimensional probability distributions
Lecture-16 - Data reduction
Regress Analysis and Log-Linear
Models
Linear regression: Y =  +  X

Two parameters ,  and  specify the line and are to
be estimated by using the data at hand.

using the least squares criterion to the known values of
Y1, Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2.

Many nonlinear functions can be transformed into the
above.
Log-linear models:

The multi-way table of joint probabilities is
approximated by a product of lower-order tables.

Probability: p(a, b, c, d) = ab acad bcd
Lecture-16 - Data reduction
Histograms
A popular data
reduction technique 40
Divide data into 35
buckets and store 30
average (sum) for
25
each bucket
20
Can be constructed
optimally in one 15
dimension using 10
dynamic programming
5
Related to
0
quantization problems. 10000 30000 50000 70000 90000
Lecture-16 - Data reduction
Clustering

Partition data set into clusters, and one can store cluster
representation only
Can be very effective if data is clustered but not if data
is “smeared”
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms.

Lecture-16 - Data reduction


Sampling
Allows a large data set to be represented
by a much smaller of the data.
Let a large data set D, contains N tuples.
Methods to reduce data set D:

Simple random sample without replacement
(SRSWOR)

Simple random sample with replacement
(SRSWR)

Cluster sample

Stright sample
Lecture-16 - Data reduction
Sampling

W O R
SRS le random
i m p ho ut
( s e wi t
l
samp ment)
p l a ce
re

SRSW
R

Raw Data
Lecture-16 - Data reduction
Sampling

Raw Data Cluster/Stratified Sample

Lecture-16 - Data reduction


Lecture-17
Discretization and concept
hierarchy generation
Discretization

Three types of attributes:



Nominal — values from an unordered set

Ordinal — values from an ordered set

Continuous — real numbers
Discretization: divide the range of a continuous
attribute into intervals

Some classification algorithms only accept
categorical attributes.

Reduce data size by discretization

Prepare for further analysis

Lecture-17 - Discretization and concept hierarchy generation


Discretization and Concept hierachy

Discretization

reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.
Concept hierarchies

reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute
age) by higher level concepts (such as young,
middle-aged, or senior).

Lecture-17 - Discretization and concept hierarchy generation


Discretization and concept hierarchy
generation for numeric data

Binning

Histogram analysis

Clustering analysis

Entropy-based discretization

Discretization by intuitive partitioning

Lecture-17 - Discretization and concept hierarchy generation


Entropy-Based Discretization
Given a set of samples S, if S is partitioned into
two intervals S1 and S2 using boundary T, the
entropy after partitioning is
| S 1| |S 2|
E (S ,T )  Ent ( S 1)  Ent ( S 2)
|S| |S|
The boundary that minimizes the entropy function
over all possible boundaries is selected as a
binary discretization.
The process is recursively applied to partitions
obtained until some stopping criterion is met, e.g.,
Experiments showEnt ( Sthat
)  Eit(may
T , S ) reduce
 data size
and improve classification accuracy

Lecture-17 - Discretization and concept hierarchy generation


Discretization by intuitive partitioning

3-4-5 rule can be used to segment numeric data into


relatively uniform, “natural” intervals.
* If an interval covers 3, 6, 7 or 9 distinct values at the most
significant digit, partition the range into 3 equal-width
intervals
* If it covers 2, 4, or 8 distinct values at the most significant
digit, partition the range into 4 intervals
* If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into 5 intervals

Lecture-17 - Discretization and concept hierarchy generation


Example of 3-4-5 rule
count

Step 1: -$351 -$159 profit $1,838 $4,700


Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - $2,000)
Step 3:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$4000 -$5,000)
Step 4:

($2,000 - $5, 000)


(-$400 - 0) (0 - $1,000) ($1,000 - $2, 000)
(0 -
($1,000 -
(-$400 - $200)
$1,200) ($2,000 -
-$300) $3,000)
($200 -
($1,200 -
$400)
(-$300 - $1,400)
($3,000 -
-$200)
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) ($600 - ($1,600 - $5,000)
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
Lecture-17 - Discretization and concept hierarchy generation
Concept hierarchy generation for categorical
data
Specification of a partial ordering of attributes
explicitly at the schema level by users or experts
Specification of a portion of a hierarchy by
explicit data grouping
Specification of a set of attributes, but not of
their partial ordering
Specification of only a partial set of attributes

Lecture-17 - Discretization and concept hierarchy generation


Specification of a set of attributes

Concept hierarchy can be automatically


generated based on the number of distinct
values per attribute in the given attribute set.
The attribute with the most distinct values is
placed at the lowest level of the hierarchy.

country 15 distinct values

province_or_ state 65 distinct


values
city 3567 distinct values

street 674,339 distinct values


Lecture-17 - Discretization and concept hierarchy generation

You might also like