3 Prep
3 Prep
Concepts and
Techniques
— Slides for Textbook —
— Chapter 3 —
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Broad categories:
intrinsic, contextual, representational, and
accessibility.
Data Mining: Concepts and
December 17, 2024 Techniques 4
Major Tasks in Data
Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces
the same or similar analytical results
Data discretization
Part of data reduction but with particular importance,
especially for numerical data
Data Mining: Concepts and
December 17, 2024 Techniques 5
Forms of data
preprocessing
technology limitation
incomplete data
inconsistent data
Data Mining: Concepts and
December 17, 2024 Techniques 11
How to Handle Noisy Data?
Binning method:
first sort data and partition into (equi-depth) bins
Regression
smooth by fitting the data into regression
functions
Data Mining: Concepts and
December 17, 2024 Techniques 12
Simple Discretization Methods:
Binning
Equal-width (distance) partitioning:
It divides the range into N intervals of equal size:
uniform grid
if A and B are the lowest and highest values of the
Y1
Y1’ y=x+1
X1 x
coherent store
Schema integration
integrate metadata from different sources
Dimensionality reduction
Numerosity reduction
understand
Heuristic methods (due to exponential # of choices):
step-wise forward selection
elimination
decision-tree induction
Data Mining: Concepts and
December 17, 2024 Techniques 25
Example of Decision Tree Induction
A1? A6?
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Time sequence is not audio
Typically
December 17, 2024 short and vary slowly with time
Data Mining: Concepts and
Techniques 28
Data Compression
os sy
l
Original Data
Approximated
X2
Y1
Y2
X1
A popular data 40
reduction technique 35
Divide data into 30
buckets and store
25
average (sum) for
each bucket 20
Can be constructed 15
optimally in one
10
dimension using
dynamic programming 5
Related to 0
quantization problems.
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
Data Mining: Concepts and
December 17, 2024 Techniques 36
Clustering
Approximate the percentage of each class (or
subpopulation of interest) in the overall database
Used in conjunction with skewed data
Sampling may not reduce database I/Os (page at a
time).
Data Mining: Concepts and
December 17, 2024 Techniques 38
Sampling
W O R
SRS le random
i m p ho ut
( s e wi t
l
samp ment)
p l a ce
re
SRSW
R
Raw Data
Data Mining: Concepts and
December 17, 2024 Techniques 39
Sampling
Discretization:
divide the range of a continuous attribute into
intervals
Some classification algorithms only accept
categorical attributes.
Reduce data size by discretization
Discretization
reduce the number of values for a given
continuous attribute by dividing the range of
the attribute into intervals. Interval labels can
then be used to replace actual data values.
Concept hierarchies
reduce the data by collecting and replacing
low level concepts (such as numeric values for
the attribute age) by higher level concepts
(such as young, middle-aged, or senior).
Data Mining: Concepts and
December 17, 2024 Techniques 44
Discretization and concept
hierarchy generation for numeric
data
Entropy-based discretization
(-$1,000 - $2,000)
Step 3:
(-$4000 -$5,000)
Step 4: