BIS 541 Ch03 20-21 S
BIS 541 Ch03 20-21 S
BIS 541 Ch03 20-21 S
Data Mining
2020/2021 Spring
Chapter 3
Data Preprocessing
1
Chapter 3: Data Preprocessing
2
Why Data Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking
6
Forms of data preprocessing
Chapter 3: Data Preprocessing
8
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
e.g., Occupation=“ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday?
9
Incomplete (Missing) Data
technology limitation
incomplete data
inconsistent data
12
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency or
distance) bins
then one can smooth by bin means, smooth by bin
Clustering
detect and remove outliers
13
Binning
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
15
Inconsistent Data
Inconsistent data may due to
faulty data collection instruments
18
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Entity identification problem:
Identify real world entities from multiple data sources
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Object matching: e.g., Bill Clinton = William Clinton
Detecting and resolving data value conflicts:
For the same real world entity, attribute values from different sources are
different
Possible reasons: different representations, different scales, e.g., metric vs.
British units
19
Handling Redundancy in Data Integration
21
Data Reduction Strategies
Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
Data reduction strategies
Dimensionality reduction, e.g., remove unimportant attributes
Data compression
22
Data Reduction Situations
Eliminate redandent attributes
Correlation coefficients show that
Watch and magazine are associated
Eliminate one
Combine variables to reduce number of independent
variables
Principle component analysis
24
Curse of Dimensionality
When dimensionality increases, data becomes
increasingly sparse
Density and distance between points, which is
critical to clustering, outlier analysis, becomes
less meaningful
The possible combinations of subspaces will
grow exponentially
Advantages of Dimensionality Reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce
noise
Reduce time and space required in data mining
Reduce # of patterns in the evaluation phase,
easier to understand the results
Allow easier visualization
Dimensionality Reduction Techniques
Feature (attribute) subset selection:
Find a subset of the original variables (or
features, attributes)
Feature (attribute) creation (generation):
Create new attributes (features) that can
capture the important information in a data set
more effectively than the original ones
Principal component analysis (PCA):
Transform the data in the high-dimensional
space to a space of fewer dimensions
Feature (Attribute) Subset Selection
Redundant attributes
Duplicate much or all of the information contained in
one or more other attributes
E.g., purchase price of a product and the amount of
sales tax paid
Irrelevant attributes
Contain no information that is useful for the data
mining task at hand
E.g., students' ID is often irrelevant to the task of
predicting students' GPA
28
Heuristic Feature Selection Methods
There are 2d possible sub-features of d features
Several heuristic feature selection methods:
Best single features under the feature
independence assumption:
Choose by significance tests.
Best step-wise feature selection:
The best single-feature is picked first
Credıt Card
Income Magazıne Watch
Gender Age Life Insurance Insurance
Range Promotıon Promotıon
Promotıon
40-50 K Yes No Male 45 No No
30-40 K Yes Yes Female 40 No Yes
40-50 K No No Male 42 No No
30-40 K Yes Yes Male 43 Yes Yes
50-60 K Yes No Female 38 No Yes
20-30 K No No Female 55 No No
30-40 K Yes No Male 35 Yes Yes
20-30 K No Yes Male 27 No No
30-40 K Yes No Male 43 No No
30-40 K Yes Yes Female 41 No Yes
40-50 K No Yes Female 43 No Yes
20-30 K No Yes Male 29 No Yes
50-60 K Yes Yes Female 39 No Yes
40-50 K No Yes Male 55 No No
20-30 K No No Female 19 Yes Yes
Example of Decision Tree Induction
A1? A6?
Which ones are the best predictor for inflation rate (three
months ahead)?
Aim: develop a simple model to predict inflation by using
only a couple of those 45 macro variables
Best-stepwise feature selection:
Models with one independent variable
45 models
Select the best single variable – say X3
Keep X3 in the model
Find the second best variable
try 44 models
– second best variable – say X43
Continue until a stopping criteria is satisfied
Example: Inflation Targeting Model (cont.)
Best feature elimination:
Develop 45 models
domain-specific
Attribute construction
combining features
34
Example: Attribute/Feature Construction
Automatic attribute generation
using product or and opperations
x2
x1
38
Principal Component Analysis (cont.)
Given N data objects from n-dimensions, find k < n
orthogonal vectors that can be best used to represent
data
The original data set is reduced to one consisting of n
data vectors on k principal components (reduced
dimensions)
Each data vector (object) is a linear combination of the k
principal component vectors
Works for numeric data only
Used when the number of dimensions is large
39
Dimensionality Reduction from 2 to 1
40
Change of coordinates
Coordinate of P :
X2
new coordinates: y ,y
1 2
Y1
Y2 a1 a2
expressed in unit vectors
Remove a2
P
Dimensionality reduction a2
a1
X1
41
Principal Component Analysis (Steps)
Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
Normalize input data: Each attribute falls within the same range
Compute k orthonormal (unit) vectors, i.e., principal components
Each input data (vector) is a linear combination of the k principal
component vectors
The principal components are sorted in order of decreasing
“significance” or strength
Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
Works for numeric data only
42
Positively and Negatively Correlated Data
43
A Nonlinear Relationship Among Variables
44
Not Correlated Data
not corrolated
cannot reduce dimensionality
from 2 to 1
45
How to perform PCA
46
Is Data Appropriate for PCA
1 – corrolation matrix
present correlation coefficient among pairs of variables
(bivariate)
High correlations likely to form
components
among vriables
higher values indication of appropriateness of data to
PCA
47
Communality
the analysis
48
How It Works
Mean corrected or normalized z scores
if original variables are of different units some have
total variability =
Var(X1) + Var(X2) +...+ Var(Xn) =n (# of variables)
total variability is not: Var(X1 + X2 +...+ Xn)
How It Works (cont.)
Note that total variance is the same as n ( total # of
original variables)
First principle component:
is along the direction with the highest variability
it captures as much variability as possibe
...
Number of components
eignevalue over 1.0 (> 1.0) are taken
other
no correlation among factors
components
Naming of components
name factors or component by examining highly
53
Data Reduction 2: Numerosity Reduction
Reduce data volume by choosing alternative, smaller
forms of data representation
Parametric methods (e.g., regression)
Assume the data fits some model, estimate model
54
Histogram Analysis
Divide data into buckets and 40
store average (sum) for each 35
bucket 30
Partitioning rules: 25
Equal-width: equal bucket 20
range 15
Equal-frequency (or equal- 10
depth) 5
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
55
Histogram Analysis: Methods
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Max Difference
Specify number of buckets: k
Sort the data in an descending order
Make a bucket boundry between adjacent values
if the difference is one of the largest k differences
x
i+1 - xi >= max_k-1(x1,..xN)
V-optimal Design
Sort the values
Assign equal number of values in each bucket
Compute variance
repeat
change bucket`s of boundary values
compute new variance
until no reduction in variance
variance = (n1*Var1+n2*Var2+...+nk*Vark)/N
N= n1+n2+..+nk,
Vari= nij=1(xj-x_meani)2/ni,
item
Sampling without replacement
Once an object is selected, it is removed from the
population
Sampling with replacement
A selected object is not removed from the population
Stratified sampling:
Partition the data set, and draw samples from each
Cluster sampleing
61
Sampling Methods
Simple random sample without
replacement (SRSWOR):
of size n
W O R
SRS le random
i m p ho ut
( s e wi t
l
samp ment)
p l a ce
re
SRSW
R
Raw Data
64
Sampling: Cluster or Stratified Sampling
65
Chapter 3: Data Preprocessing
66
Data Transformation
A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
Methods
Smoothing: Remove noise from data
Attribute/feature construction
New attributes constructed from the given ones
Aggregation: Summarization, data cube construction
Normalization: Scaled to fall within a smaller, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Discretization: Concept hierarchy climbing
67
Normalization
min-max normalization
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
When a score is outside the new range the algorithm
z-score normalization
v meanA
v'
stand _ devA
70
Normalization (cont.)
Decimal scaling
v
v' j v' |)<1
Where j is the smallest integer such that Max(|
10
Logarithmic transformations:
Y` = logY
Used in some distance based methods
Clustering
For ratio scaled variables
E.g., weight, TL/dollar
Distance between two persons is related to
percentage changes rather then actual differences
Linear Transformations
74
Data Discretization Methods
Typical methods: All the methods can be applied recursively
Binning
Top-down split, unsupervised
Histogram analysis
Top-down split, unsupervised
Clustering analysis (unsupervised, top-down split or
bottom-up merge)
Decision-tree analysis (supervised, top-down split)
Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)
75
Discretization by Binning
or by a symbolic value
numerical to numerical
numerical to categorical
discretization
categorical to categorical
aggregation
categorical to numerical
Numerical to Numerical
min-max normalization
between 0 to 1 or -1 to 1
z-score normalization
Discretization – binning
represented by bin mean, bin median,…
Discretization – binning
Bins are represented by symbolic values
Symbolic Variables
Gender Education Marital Status Buy
male high school single yes
female no school married no
female graduate divorced yes
male high school single yes
male undergraduate married no
female primary school divorced no
Numerical Variables
Gender Education Is_Single Is_Married Is_Diverced Buy
0 0.50 1 0 0 1
1 0.00 0 1 0 0
1 1.00 0 0 1 1
0 0.50 1 0 0 1
0 0.75 0 1 0 0
1 0.25 0 0 1 0
Example: Thermometer Encoding for an Ordinal
Variable
Education I1 I2 I3 I4
high school 0 0 1 1
no school 0 0 0 0
graduate 1 1 1 1
high school 0 0 1 1
undergraduate 0 1 1 1
primary school 0 0 0 1
Chapter 3: Data Preprocessing
85
Summary
Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
Data cleaning: e.g. missing/noisy values, outliers
Data integration from multiple sources:
Entity identification problem
Remove redundancies
Detect inconsistencies
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
86