DWDMUNIT2
DWDMUNIT2
Lecture Topic
**********************************************
Lecture-13 Why preprocess the data?
Lecture-14 Data cleaning
Lecture-15 Data integration and transformation
Lecture-16 Data reduction
Lecture-17 Discretization and concept
hierarchgeneration
Lecture-13
Why preprocess the data?
Lecture-13 Why Data Preprocessing?
Data in the real world is:
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or
names
No quality data, no quality mining results!
Quality decisions must be based on quality data
Data warehouse needs consistent integration of
quality data
Y1
Y1’ y=x+1
X1 x
min-max normalization
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
z-score normalization
v mean A
v'
stand _ dev A
X2
Y1
Y2
X1
A1? A6?
Partition data set into clusters, and one can store cluster
representation only
Can be very effective if data is clustered but not if data
is “smeared”
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms.
W O R
SRS le random
i m p ho ut
( s e wi t
l
samp ment)
p l a ce
re
SRSW
R
Raw Data
Lecture-16 - Data reduction
Sampling
Discretization
reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.
Concept hierarchies
reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute
age) by higher level concepts (such as young,
middle-aged, or senior).
Binning
Histogram analysis
Clustering analysis
Entropy-based discretization
(-$1,000 - $2,000)
Step 3:
(-$4000 -$5,000)
Step 4: