Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or
names – e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records
No quality data, no quality mining results!
Quality decisions must be based on quality data
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Broad categories:
intrinsic, contextual, representational, and
accessibility.
October 28, 2021 Data Mining: Concepts and Techniques 4
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same
or similar analytical results
Data discretization
Part of data reduction but with particular importance, especially for
numerical data
technology limitation
incomplete data
inconsistent data
Binning method:
first sort data and partition into (equi-depth) bins
checked by humans
Regression
smooth by fitting the data into regression functions
Simple Discretization Methods: Binning
number
of values
Example: customer ages
Equi-width
binning: 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Equi-depth
binning: 0-22 22-31 62-80
38-44 48-55
32-38 44-48 55-62
Smoothing using Binning Methods
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into (equal) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries: [4,15],[21,25],[26,34]
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
•Sorted data for price (in dollars):
•13,15,16, 16,19,20,22,25,25,25,33,33,35,35,52,70
salary
cluster
outlier
age
Regression
y (salary)
Y1 y=x+1
X1 x (age)
Smoothing Noisy Data
The purpose of data smoothing is to eliminate noise
Bin1: 4, 8, 15
Binning Bin2: 21, 21, 24
Bin3: 25, 28, 34
Clustering
means boundaries
Regression Bin1: 9, 9, 9 Bin1: 4, 4, 15
Bin2: 22, 22, Bin2: 21, 21, 24
22 Bin3: 25, 25, 34
Bin3: 29, 29,
29
20
Chapter 3: Data Preprocessing
Data integration:
Data analysis may require combination of data from multiple
sources into a coherent data store
Schema integration
integrate metadata from different sources
metadata: data about the data (i.e., data descriptors)
Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id B.cust-#
Detecting and resolving data value conflicts
for the same real world entity, attribute values from different
sources are different (e.g., J.D.Smith and Jonh Smith may refer to
the same person)
possible reasons: different representations, different scales, e.g.,
(inches vs. cm)
Handling Redundant Data
in Data Integration
Redundant data occur often when integration of multiple
databases
The same attribute may have different names in
different databases
One attribute may be a “derived” attribute in another
table, e.g. age is derived from DOB
Redundant data may be able to be detected by
correlational analysis
Careful integration of the data from multiple sources may
help to reduce redundancies and inconsistencies and
improve mining speed and quality
October 28, 2021 Data Mining: Concepts and Techniques 23
Data Transformation
min-max normalization
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
e.g. convert age=30 to range 0-1, when min=10,max=80.
new_age=(30-10)/(80-10)=2/7
z-score normalization v meanA
v'
stand_devA
normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(| v ' |)<1
10
Chapter 3: Data Preprocessing
Dimensionality reduction
Numerosity reduction
Use heuristics: select local ‘best’ (or most pertinent) attribute, e.g.,
using information gain, etc. Eg. Initial Set -{A1, A2, A3, A4, A5, A6}
step-wise forward selection {}{A1}{A1, A4}{A1, A4, A6}
step-wise backward elimination {A1, A2, A3, A4, A5, A6}{A1, A3, A4, A5, A6}
{A1, A4, A5, A6} – Reduced set - > {A1, A4, A6}
combining forward selection and backward elimination
decision-tree induction
October 28, 2021 Data Mining: Concepts and Techniques 30
Example of Decision Tree Induction
A4 ?
A1? A6?
Typically lossless
expansion
Audio/video compression
Typically lossy compression, with progressive refinement
ss y
lo
Original Data
Approximated
Techniques:
DWT: Discrete Wavelet Transform
DFT: Discrete Fourier Transform
PCA: Principal Component Analysis
Haar2 Daubechie4
b0,b1,b2 as constants
Log-linear models:
Higher dimensional data space to be constructed from lower
dimensional space
Lower dimensional points together occupy less space than
1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15
,
15,15,15,18,18,18,18,18,18,18,18,20,20,20,20,20,
20,20,21,21,21,21,25,25,25,25,25,28,28,30,30,30
count
1-10 11-20 21-30
W O R
SRS le random
i m p h ou t
( s e wi t
l
samp ment)
pl a c e
re
SRSW
R
Raw Data
October 28, 2021 Data Mining: Concepts and Techniques 45
Sampling
Discretization:
divide the range of a continuous attribute into
intervals
Some classification algorithms only accept categorical
attributes.
Reduce data size by discretization
Discretization
reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.
Concept hierarchies
reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute
age) by higher level concepts (such as young,
middle-aged, or senior).
Entropy-based discretization
(-$1,000 - $2,000)
Step 3:
(-$4000 -$5,000)
Step 4: