0% found this document useful (0 votes)
50 views36 pages

Lecture 9 - Data Reduction

Data

Uploaded by

raoseshu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views36 pages

Lecture 9 - Data Reduction

Data

Uploaded by

raoseshu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Transfer Functions

Data Preprocessing
- Data Reduction
Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

2
2
Data Reduction Strategies

 Data reduction: Obtain a reduced representation of the data set that is


much smaller in volume but yet produces the same (or almost the same)
analytical results

 Why data reduction? — A database/data warehouse may store terabytes of


data. Complex data analysis may take a very long time to run on the
complete data set.

 Data reduction strategies


 Dimensionality reduction, e.g., remove unimportant attributes
 Numerosity reduction (some simply call it: Data Reduction)
 Data compression

3
Data Reduction Strategies

 Data reduction strategies

 Dimensionality reduction, e.g., remove unimportant attributes


 Wavelet transforms

 Principal Components Analysis (PCA)

 Feature subset selection, feature creation

 Numerosity reduction (some simply call it: Data Reduction)


 Regression and Log-Linear Models

 Histograms, clustering, sampling

 Data cube aggregation

 Data compression

4
Data Reduction: Dimensionality Reduction

 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering,
outlier analysis, becomes less meaningful
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

5
Visualization Problem
 Not easy to visualize multivariate data
 - 1D: dot

 - 2D: Bivariate plot (i.e. X-Y plane)

 - 3D: X-Y-Z plot

 - 4D: ternary plot with a color code /Tetrahedron- 5D, 6D,


etc. : ???
Motivation

• Given data points in d dimensions


• Convert them to data points in r<d dimensions
• With minimal loss of information
Basics of PCA
 PCA is useful when we need to extract useful information
from multivariate data sets.

 This technique is based on the reduced dimensionality.


What is Principal Component

 A principal component can be defined as a linear


combination of optimally-weighted observed variables.
What are the new axes?

Original Variable B PC 2
PC 1

Original Variable A

• Orthogonal directions of greatest variance in data


• Projections along PC1 discriminate the data most along any one axis
Principle Component Analysis

PCA:
Orthogonal projection of data onto lower-dimension
linear space that...
• maximizes variance of projected data (purple line)

• minimizes mean squared distance between


• data point and
• projections (sum of blue lines) 14
The Principal Components
• Vectors originating from the center of mass

• Principal component #1 points


in the direction of the largest variance.

• Each subsequent principal component…


• is orthogonal to the previous ones, and
• points in the directions of the largest
variance of the residual subspace

15
2D Gaussian dataset

16
1st PCA axis

17
2nd PCA axis

18
Principal component analysis
• Principal component analysis (PCA) is a procedure which
uses the correlations between the variables to identify
which combinations of variables capture most information
about the dataset

• Mathematically, it determines the eigenvectors of the


covariance matrix and sorts them in importance according
to their corresponding eigenvalues
Basics for Principal Component Analysis

• Orthogonal/Orthonormal

• Standard deviation, Variance, Covariance

• The Covariance matrix

• Eigenvalues and Eigenvectors


Covariance

• Standard Deviation and Variance are 1-dimensional

• How much do the dimensions vary from the mean with respect to each other ?

• Covariance measures between 2 dimensions

We easily see, if X=Y we end up with variance


Covariance Matrix

• Let X be a random vector.

• Then the covariance matrix of X, denoted by Cov(X), is

• The diagonals of Cov(X) are .


• In matrix notation,

The covariance matrix is symmetric


Orthogonality/Orthonormality

1.5 <v1,v2> = <(1 0),(0 1)>


= 0
1
0.5

0.5 1.0 1.5

• Two vectors v1 and v2 for which <v1,v2>=0 holds are said to be orthogonal

• Unit vectors which are orthogonal are said to be orthonormal.


Eigenvalues/Eigenvectors

• Let A be an nxn square matrix and x an nx1 column vector. Then a (right)
eigenvector of A is a nonzero vector x such that:

Eigenvalue Eigenvector

Procedure:
Finding the eigenvalues

=0 Finding lambdas

Find corresponding eigenvectors


Transformation

• Looking for a transformation of the data matrix X (pxn) such that

Y= T X=1 X1+ 2 X2+..+ p Xp


Transformation

What is a reasonable choice for the  ?

Remember: We wanted a transformation that maximizes information

That means: captures Variance in the data

Maximize the variance of the projection of the observations on the Y


variables !
Find  such that

Var(T X) is maximal

The matrix C=Var(X) is the covariance matrix of the Xi variables


Transformation
Can we intuitively see that in a picture?

Good Better
 v( x1 ) c(x1,x2 ) ........c(x1,x p ) 
 
 c(x1,x2 ) v( x2 ) ........c(x2 ,x p ) 
Cov(X)=  
 
 c(x ,x ) c(x ,x )..........v( x ) 
 1 p 2 p p 
PCA algorithm
(based on sample covariance matrix)
• Given data {x1, …, xm}, compute covariance matrix 

1 m 1 m
   (x i  x)( x  x) T where x   xi
m i 1 m i 1

• PCA basis vectors = Compute the eigenvectors of 

• Larger eigenvalue  more important eigenvectors

29
PCA – zero mean
• Suppose we are given x1, x2, ..., xM (N x 1) vectors
N: # of features
Step 1: compute sample mean M: # data
M
1
x
M
x
i 1
i

Step 2: subtract sample mean (i.e., center data at zero)

Φi  xi  x
Step 3: compute the sample covariance matrix Σx
1 M
1 M
1 where A=[Φ1 Φ2 ... ΦΜ]
x 
M

i 1
( x i  x )( x i  x )T

M

i 1
 T
i
i  
M
AAT
i.e., the columns of A are the Φi
(N x M matrix)

30
PCA - Steps
Step 4: compute the eigenvalues/eigenvectors of Σx
 xui  iui
where we assume 1  2  ...  N
Note : most software packages return the eigenvalues (and corresponding eigenvectors)
is decreasing order – if not, you can explicitly put them in this order)

Since Σx is symmetric, <u1,u2,…,uN> form an orthogonal basis


in RN and we can represent any x∈RN as: x 
x 
y 
y 
1 1

 2  2
N  .   . 

x  x   yi ui  y1u1  y2u2  ...  y N u N


   
 .  .
xx:  
 .   . 
i 1    
i.e., this is  .   . 
just a “change”  .   . 
(x  x)T ui    
yi  T
 ( x  x )T
ui if || ui || 1 of basis!  xN   y N 
ui ui
Note : most software packages normalize ui to unit length to simplify calculations; if
not, you can explicitly normalize them) 31
PCA - Steps
Step 5: dimensionality reduction step – approximate x using
only the first K eigenvectors (K<<N) (i.e., corresponding to
the K largest eigenvalues where K is a parameter)

32
Example
• Compute the PCA of the following dataset:

(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)

• Compute the sample covariance matrix is:

• The eigenvalues can be computed by finding the roots of the


characteristic polynomial:

33
Example (cont’d)
• The eigenvectors are the solutions of the systems:
xui  iui

Note: if ui is a solution, then cui is also a solution where c≠0.

Eigenvectors can be normalized to unit-length using:


vi
vˆi 
|| vi ||
34
Choosing the projection dimension K ?

• K is typically chosen based on how much information


(variance) we want to preserve:
K

Choose the smallest  i

K that satisfies
i 1
N
T where T is a threshold (e.g., 0.9)
the following
inequality: 
i 1
i

• If T=0.9, for example, we “preserve” 90% of the information


(variance) in the data.

• If K=N, then we “preserve” 100% of the information in the


data (i.e., just a “change” of basis and xˆ  x )

35
Data Normalization

• The principal components are dependent on the units used


to measure the original variables as well as on the range of
values they assume.

• Data should always be normalized prior to using PCA.

• A common normalization method is to transform all the data


to have zero mean and unit standard deviation:

xi   where μ and σ are the mean and standard


deviation of the i-th feature xi

36

You might also like