03 Preprocessing
03 Preprocessing
— Chapter 3 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 3: Data Preprocessing
3
Major Tasks in Data Preprocessing
Data cleaning
Trustworthiness
Confusion to the mining process – unreliable output
Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Avoid redundancies
Data reduction
Dimensionality reduction – PCA
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization - NNs, Nearest-neighbor classifiers, clustering …
Concept hierarchy generation -> age to youth, adult or senior
4
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
e.g., Occupation=“ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday?
5
Incomplete (Missing) Data
technology limitation
incomplete data
inconsistent data
8
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
Clustering
detect and remove outliers
9
Binning
10
Data Cleaning as a Process
Data discrepancy detection
Data Discrepancy caused by
11
Data Cleaning as a Process
Data migration and integration
Data migration tools: allow transformations to be specified
12
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different
Possible reasons: different representations, different scales, e.g.,
metric vs. British units
13
Handling Redundancy in Data Integration
15
Chi-Square Calculation: An Example
i 1 (ai A)(bi B)
n n
(ai bi ) n A B
rA, B i 1
(n 1) A B (n 1) A B
Scatter plots
showing the
similarity from
–1 to 1.
18
Correlation (viewed as linear relationship)
Correlation measures the linear relationship
between objects
To compute correlation, we standardize data
objects, A and B, and then take their dot product
19
Covariance (Numeric Data)
Covariance is similar to correlation
Correlation coefficient:
Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
Thus, A and B rise together since Cov(A, B) > 0.
Chapter 3: Data Preprocessing
Wavelet transforms
24
Mapping Data to a New Space
Fourier transform
Wavelet transform
25
What Is Wavelet Transform?
Decomposes a signal into
different frequency subbands
Applicable to n-
dimensional signals
Data are transformed to
preserve relative distance
between objects at different
levels of resolution
Allow natural clusters to
become more distinguishable
Used for image compression
26
Wavelet Transformation
Haar2 Daubechie4
Discrete wavelet transform (DWT) for linear signal
processing, multi-resolution analysis
Compressed approximation: store only a small fraction of
the strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
Method:
Length, L, must be an integer power of 2 (padding with 0’s, when
necessary)
Each transform has 2 functions: smoothing, difference
Applies to pairs of data, resulting in two set of data of length L/2
Applies two functions recursively, until reaches the desired length
27
Wavelet Decomposition
Wavelets: A math tool for space-efficient hierarchical
decomposition of functions
S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S ^ =
[23/4, -11/4, 1/2, 0, 0, -1, -1, 0]
Compression: many small detail coefficients can be
replaced by 0’s, and only the significant coefficients are
retained
28
Haar Wavelet Coefficients
Coefficient
Hierarchical “Supports”
2.75
decomposition 2.75 +
structure (a.k.a. +
“error tree”) + -1.25
-
-1.25
+ -
0.5
+
0.5
- +
0
- 0
+
-
0 -1 -1 0
+
-
+ 0
- + - + - + -
-1
+
-+
-+
2 2 0 2 3 5 4 4
-1
Original frequency distribution 0 -+
29
-
Why Wavelet Transform?
Use hat-shape filters
Emphasize region where points cluster
Multi-resolution
Detect arbitrary shaped clusters at different scales
Efficient
Complexity O(N)
30
Principal Component Analysis (PCA)
Find a projection that captures the largest amount of variation in data
The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space
x2
x1
31
Principal Component Analysis (Steps)
Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
Normalize input data: Each attribute falls within the same range
Compute k orthonormal (unit) vectors, i.e., principal components
Each input data (vector) is a linear combination of the k principal
component vectors
The principal components are sorted in order of decreasing
“significance” or strength
Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
Works for numeric data only
32
Principal Component Analysis
Can handle
ordered and unordered attributes
33
Attribute Subset Selection
Another way to reduce dimensionality of data
Redundant attributes
E.g., purchase price of a product and the amount of sales tax paid
Irrelevant attributes
E.g., students' ID is often irrelevant to the task of predicting
students' GPA
34
Heuristic Search in Attribute Selection
There are 2d possible attribute combinations of d attributes
Typical heuristic attribute selection methods:
Best single attribute under the attribute independence
35
Attribute Creation (Feature Generation)
Create new attributes (features) that can capture the
important information in a data set more effectively than
the original ones
Three general methodologies
Attribute extraction
Domain-specific
patterns in Chapter 7)
Data discretization
36
Data Reduction 2: Numerosity Reduction
Reduce data volume by choosing alternative, smaller
forms of data representation
Parametric methods (e.g., regression)
Assume the data fits some model, estimate model
37
Parametric Data Reduction: Regression
and Log-Linear Models
Linear regression
Data modeled to fit a straight line
Multiple regression
Allows a response variable Y to be modeled as a
distributions
38
y
Regression Analysis
Y1
Regression analysis: A collective name for
techniques for the modeling and analysis Y1’
y=x+1
of numerical data consisting of values of a
dependent variable (also called
response variable or measurement) and X1 x
of one or more independent variables (aka.
explanatory variables or predictors)
Used for prediction
The parameters are estimated so as to (including forecasting of
give a "best fit" of the data time-series data), inference,
hypothesis testing, and
Most commonly the best fit is evaluated by
modeling of causal
using the least squares method, but relationships
other criteria have also been used
39
Regress Analysis and Log-Linear Models
Linear regression: Y = w X + b
Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
Using the least squares criterion to the known values of Y1, Y2, …, X1,
X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2
Many nonlinear functions can be transformed into the above
Log-linear models:
Approximate discrete multidimensional probability distributions
Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset of
dimensional combinations
Useful for dimensionality reduction and data smoothing
40
Histogram Analysis
Divide data into buckets and store
average (sum) for each bucket
Partitioning rules:
Equal-width: equal bucket
range
Equal-frequency (or equal-
depth)
Good for sparse/ dense/ highly
skewed/ uniform data
Multidimensional histograms
41
Clustering
Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
Can be very effective if data is clustered but not if data
is “smeared”
Diameter of the cluster
Centroid distance
42
Sampling
item
Sampling without replacement
Once an object is selected, it is removed from the
population
Sampling with replacement
A selected object is not removed from the population
Stratified sampling:
Partition the data set, and draw samples from each
44
Sampling: With or without Replacement
W O R
SRS le random
i m p ho ut
( s e wi t
l
samp ment)
p l a ce
re
SRSW
R
Raw Data
45
Sampling: Cluster or Stratified Sampling
46
Data Cube Aggregation
47
Chapter 3: Data Preprocessing
73,600 54,000
1.225
Ex. Let μ = 54,000, σ = 16,000. Then 16,000
Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
50
Normalization
Minmax – preserves the relationships among the
original data
Z-score -> mean absolute deviation of A – robust
to outliers than SD
Decimal scaling
52
Data Discretization Methods
Typical methods: All the methods can be applied recursively
Binning
Top-down split, unsupervised
Histogram analysis
Top-down split, unsupervised
Clustering analysis (unsupervised, top-down split or
bottom-up merge)
Decision-tree analysis (supervised, top-down split)
Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)
53
Simple Discretization: Binning
57
Concept Hierarchy Generation
Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a data
warehouse
Concept hierarchies facilitate drilling and rolling in data warehouses to
view data in multiple granularity
Concept hierarchy formation: Recursively reduce the data by collecting
and replacing low level concepts (such as numeric values for age) by
higher level concepts (such as youth, adult, or senior)
Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers
Concept hierarchy can be automatically formed for both numeric and
nominal data. For numeric data, use discretization methods shown.
58
Concept Hierarchy Generation
for Nominal Data
Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
street < city < state < country
Specification of a hierarchy for a set of values by explicit
data grouping
{UP, MP, HP} ⸦ North India
Specification of only a partial set of attributes
E.g., only street < city, not others
Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
E.g., for a set of attributes: {street, city, state, country}
59
Automatic Concept Hierarchy Generation
Some hierarchies can be automatically generated based on the analysis of the
number of distinct values per attribute in the data set
The attribute with the most distinct values is placed at the lowest level of the
hierarchy
Exceptions, e.g., weekday, month, quarter, year
Fool-proof method?
Remove redundancies
Detect inconsistencies
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
62
References
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of
ACM, 42:73-78, 1999
A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning:
Language, model, and algorithms. VLDB'01
M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), Dec. 1997
H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining
Perspective. Kluwer Academic, 1998
J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995
63
Image Credits
https://towardsdatascience.com/k-means-a-compl
ete-introduction-1702af9cd8c
https://dinhanhthi.com/metrics-for-clustering/
https://stats.stackexchange.com/questions/74843
/binning-by-equal-width
64