Ch2 Data Preprocessing Part3: Amit KR Upadhyay Sharda University
Ch2 Data Preprocessing Part3: Amit KR Upadhyay Sharda University
Amit Kr Upadhyay
Sharda University
Knowledge Discovery (KDD) Process
Pattern Evaluation
Data mining—core of
knowledge discovery
process Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
Forms of Data Preprocessing
Data Transformation
Data transformation – the data are
transformed or consolidated into forms
appropriate for mining
Data Transformation
Data Transformation can involve the
following:
Smoothing: remove noise from the data,
including binning, regression and clustering
Aggregation
Generalization
Normalization
Attribute construction
Normalization
Min-max normalization
Z-score normalization
Decimal normalization
Min-max normalization
Min-max normalization: to [new_minA,
new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Ex. Let income range $12,000 to $98,000
normalized to [0.0, 1.0]. Then $73,000 is
mapped to 73,600 12,000 (1.0 0) 0 0.716
98,000 12,000
Z-score normalization
Z-score normalization (μ: mean or for
what figure u have to calculate lets say
54000, σ: standard
v deviation):
v'
A
A
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
A4 ?
A1? A6?
20000
30000
40000
Histograms
50000
60000
70000
80000
90000
100000
Data Reduction Method:
Data Reduction Method:
Histograms
Divide data into buckets and store average (sum) for
each bucket
Partitioning rules:
Equal-width: equal bucket range
Equal-frequency (or equal-depth)
V-optimal: with the least histogram variance (weighted sum
of the original values that each bucket represents)
MaxDiff: set bucket boundary between each pair for pairs
have the β–1 largest differences
Data Reduction Method:
Clustering
Partition data set into clusters based on similarity,
and store cluster representation (e.g., centroid and
diameter) only
There are many choices of clustering definitions and
clustering algorithms
Cluster analysis will be studied in depth in Chapter 7
Data Reduction Method:
Sampling
Sampling: obtaining a small sample s to
represent the whole data set N
W O R
SRS le random
i m p h ou t
( s e wi t
p l
sam ment)
pl a c e
re
SRSW
R
Raw Data
Sampling: Cluster or Stratified
Sampling
Raw Data Cluster/Stratified Sample