0% found this document useful (0 votes)
26 views

Ch2 Data Preprocessing Part3: Amit KR Upadhyay Sharda University

This document discusses different techniques for data preprocessing, including data transformation, data reduction, and sampling. It describes how data transformation can involve smoothing, aggregation, generalization, normalization, and attribute construction. Common normalization techniques are min-max normalization, z-score normalization, and decimal normalization. The document also discusses why data reduction is important when dealing with large datasets, and covers strategies like data cube aggregation, attribute subset selection, dimensionality reduction, numerosity reduction using histograms and clustering, and different sampling methods.

Uploaded by

yachana_talk
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Ch2 Data Preprocessing Part3: Amit KR Upadhyay Sharda University

This document discusses different techniques for data preprocessing, including data transformation, data reduction, and sampling. It describes how data transformation can involve smoothing, aggregation, generalization, normalization, and attribute construction. Common normalization techniques are min-max normalization, z-score normalization, and decimal normalization. The document also discusses why data reduction is important when dealing with large datasets, and covers strategies like data cube aggregation, attribute subset selection, dimensionality reduction, numerosity reduction using histograms and clustering, and different sampling methods.

Uploaded by

yachana_talk
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 24

Ch2 Data Preprocessing part3

Amit Kr Upadhyay
Sharda University
Knowledge Discovery (KDD) Process

Pattern Evaluation
 Data mining—core of
knowledge discovery
process Data Mining

Task-relevant Data

Data Selection
Warehouse
Data Cleaning

Data Integration

Databases
Forms of Data Preprocessing
Data Transformation
 Data transformation – the data are
transformed or consolidated into forms
appropriate for mining
Data Transformation
 Data Transformation can involve the
following:
 Smoothing: remove noise from the data,
including binning, regression and clustering
 Aggregation
 Generalization
 Normalization
 Attribute construction
Normalization
 Min-max normalization
 Z-score normalization
 Decimal normalization
Min-max normalization
 Min-max normalization: to [new_minA,
new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12,000 to $98,000
normalized to [0.0, 1.0]. Then $73,000 is
mapped to 73,600  12,000 (1.0  0)  0  0.716
98,000  12,000
Z-score normalization
 Z-score normalization (μ: mean or for
what figure u have to calculate lets say
54000, σ: standard
v   deviation):
v' 
A

 A

 Ex. Let μ = 54,000, σ = 16,000. Then


73,600  54,000
 1.225
16,000
Decimal normalization
 Normalization by decimal scaling

v
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10

 Suppose the recorded value of A range from


-986 to 917, the max absolute value is 986,
so j = 3
Data Reduction
 Why data reduction?
 A database/data warehouse may store
terabytes of data
 Complex data analysis/mining may take a
very long time to run on the complete data
set
Data Reduction
 Data reduction
 Obtain a reduced representation of the
data set that is much smaller in volume but
yet produce the same (or almost the same)
analytical results
Data Reduction
 Data reduction strategies
 Data cube aggregation
 Attribute subset selection
 Dimensionality reduction — e.g., remove
unimportant attributes
 Numerosity reduction — e.g., fit data into
models
 Discretization and concept hierarchy
generation
Data cube aggregation
Data cube aggregation
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with

 Reference appropriate levels


 Use the smallest representation which is enough
to solve the task
Attribute subset selection
Dimensionality reduction
 Feature selection (i.e., attribute subset
selection):

 Select a minimum set of features such that the


probability distribution of different classes given
the values for those features is as close as
possible to the original distribution given the
values of all features

 reduce # of patterns in the patterns, easier to


understand
Attribute subset selection
Dimensionality reduction
 Heuristic methods (due to exponential
# of choices):
 Step-wise forward selection
 Step-wise backward elimination
 Combining forward selection and backward
elimination
 Decision-tree induction
Attribute subset selection
Dimensionality reduction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}


Numerosity reduction
 Reduce data volume by choosing
alternative, smaller forms of data
representation

 Major families: histograms, clustering,


sampling
0
5
10
15
20
25
30
35
40
10000

20000

30000

40000
Histograms

50000

60000

70000

80000

90000

100000
Data Reduction Method:
Data Reduction Method:
Histograms
 Divide data into buckets and store average (sum) for
each bucket
 Partitioning rules:
 Equal-width: equal bucket range
 Equal-frequency (or equal-depth)
 V-optimal: with the least histogram variance (weighted sum
of the original values that each bucket represents)
 MaxDiff: set bucket boundary between each pair for pairs
have the β–1 largest differences
Data Reduction Method:
Clustering
 Partition data set into clusters based on similarity,
and store cluster representation (e.g., centroid and
diameter) only
 There are many choices of clustering definitions and
clustering algorithms
 Cluster analysis will be studied in depth in Chapter 7
Data Reduction Method:
Sampling
 Sampling: obtaining a small sample s to
represent the whole data set N

 Simple random sample without replacement


 Simple random sample with replacement
 Cluster sample: if the tuples in D are grouped
into M mutually disjoint clusters, then an Simple
Random Sample can be obtained, where s < M
 Stratified sample
Sampling: with or without
Replacement

W O R
SRS le random
i m p h ou t
( s e wi t
p l
sam ment)
pl a c e
re

SRSW
R

Raw Data
Sampling: Cluster or Stratified
Sampling
Raw Data Cluster/Stratified Sample

You might also like