Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
names
No quality data, no quality mining results!
Quality decisions must be based on quality data
quality data
December 8, 2021 Data Mining: Concepts and Techniques 3
Multi-Dimensional Measure of Data Quality
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Broad categories:
intrinsic, contextual, representational, and
accessibility.
December 8, 2021 Data Mining: Concepts and Techniques 4
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same
or similar analytical results
Data discretization
Part of data reduction but with particular importance, especially for
numerical data
technology limitation
incomplete data
inconsistent data
Regression
smooth by fitting the data into regression functions
uniform grid
if A and B are the lowest and highest values of the
Y1
Y1’ y=x+1
X1 x
store
Schema integration
integrate metadata from different sources
Dimensionality reduction
Numerosity reduction
understand
Heuristic methods (due to exponential # of choices):
step-wise forward selection
decision-tree induction
December 8, 2021 Data Mining: Concepts and Techniques 25
Example of Decision Tree Induction
A1? A6?
Typically lossless
expansion
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
ss y
lo
Original Data
Approximated
X2
Y1
Y2
X1
Parametric methods
Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)
Log-linear models: obtain value at a point in m-D
space as the product on appropriate marginal
subspaces
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling
above.
Log-linear models:
The multi-way table of joint probabilities is
60000
80000
10000
20000
30000
40000
50000
70000
90000
100000
December 8, 2021 Data Mining: Concepts and Techniques 36
Clustering
Partition data set into clusters, and one can store cluster
representation only
Can be very effective if data is clustered but not if data
is “smeared”
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms, further detailed in Chapter 8
W O R
SRS le random
i m p h ou t
( s e wi t
l
samp ment)
pl a c e
re
SRSW
R
Raw Data
December 8, 2021 Data Mining: Concepts and Techniques 39
Sampling
Discretization:
divide the range of a continuous attribute into
intervals
Some classification algorithms only accept categorical
attributes.
Reduce data size by discretization
Discretization
reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.
Concept hierarchies
reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute
age) by higher level concepts (such as young,
middle-aged, or senior).
Entropy-based discretization
(-$1,000 - $2,000)
Step 3:
(-$4000 -$5,000)
Step 4: