CH 3
CH 3
Data cleaning
Data reduction
Y1
Y1’ y=x+1
X1 x
Cluster Analysis
detect and remove outliers
Data Preprocessing
Why preprocess the data?
Data cleaning
Data reduction
rA, B =
( A − A)( B − B) ( AB) − n AB
=
(n − 1)AB (n − 1)AB
where n is the number of tuples, A and B are the respective
means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(AB) is the sum of the AB cross-product.
(Observed − Expected) 2
2 =
Expected
Expected=count(A=ai)*count(B=bi)/n
Chi-Square Calculation: An Example
E11=count(male)*count(fiction)/n=300*450/1500=90
It shows that like gender and science fiction are correlated in the group
Covariance of Numeric Data
correlation and covariance are two similar measures for assessing how much
two attributes change together. The mean values of A and B, respectively, are
also known as the expected values on A and B, that is,
Data Transformation
Smoothing: remove noise from data. It includes binning , regression and
clustering
Aggregation: summarization. E.g annual sale amount can be generated
from monthly sale and data cube construction for analysis at multiple
abstraction level
Concept hierarchy generation for nominal data: where attributes such
as street can be generalized to higher-level concepts, like city or country
Normalization: scaled to fall within a small, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
Data Transformation: Normalization
Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
v
v' = j
10
Where j is the smallest integer such that Max(|ν’|) < 1
Original Data
Approximated
Dimensionality Reduction:
Wavelet Transformation
The discrete wavelet transform (DWT) is a linear signal processing technique
that, when applied to a data vector X, transforms it to a numerically different
vector, X’, of wavelet coefficients
Method:
Length, L, of the input data vector must be an integer power of 2. (padding
with 0’s, when necessary)
Each transform has 2 functions: smoothing, difference which acts to bring
out the detailed features of the data.
Applies to pairs of data points in X, resulting in two set of data of length
L/2. In general, these represent a smoothed or low-frequency version of
the input data and the high frequency content of it, respectively.
Applies two functions recursively, until reaches the desired length
Selected values from the data sets obtained in the previous iterations are
designated the wavelet coefficients of the transformed data.
Dimensionality Reduction: Principal
Component Analysis (PCA)
Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
Steps
Normalize input data: Each attribute falls within the same range
Compute k orthonormal (unit) vectors, i.e., principal components
Each input data (vector) is a linear combination of the k principal
component vectors
The principal components are sorted in order of decreasing
“significance” or strength
Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance. (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data
Works for numeric data only
Used when the number of dimensions is large
Numerosity Reduction
Reduce data volume by choosing alternative, smaller forms of data
representation
These techniques may be:
Parametric --A model is used to estimate the data, so that typically
only the data parameters need to be stored, instead of the actual
data
Regression and log-linear models
Nonparametric
Histograms
Clustering
Sampling and
Data cube aggregation
Regression and log-linear models
Regression and log-linear models can be used to approximate the given
data.
In linear regression, the data are modeled to fit a straight line. For
example, a random variable, y (called a response variable), can be
modeled as a linear function of another random variable, x (called a
predictor variable), with the equation
Y=w x + b
In the context of data mining, x and y are numeric database attributes.
The coefficients, w and b called regression coefficients
Discretization:
Divide the range of a continuous attribute into intervals
(-$1,000 - $2,000)
Step 3:
(-$400 -$5,000)
Step 4: