0% found this document useful (0 votes)
274 views42 pages

CS434a/541a: Pattern Recognition Prof. Olga Veksler

This lecture discusses the problem of high-dimensional data and dimensionality reduction methods. It introduces the "curse of dimensionality" which refers to increased complexity, overfitting, and large sample sizes needed as the number of dimensions increases. Two dimensionality reduction methods are introduced: Principal Component Analysis (PCA) and Fisher Linear Discriminant (FLD). PCA seeks the most accurate lower-dimensional representation by preserving the largest variances in the data. It works by finding the directions (principal components) that maximize the variance of the projected data. FLD will be covered in the next lecture.

Uploaded by

abhas kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
274 views42 pages

CS434a/541a: Pattern Recognition Prof. Olga Veksler

This lecture discusses the problem of high-dimensional data and dimensionality reduction methods. It introduces the "curse of dimensionality" which refers to increased complexity, overfitting, and large sample sizes needed as the number of dimensions increases. Two dimensionality reduction methods are introduced: Principal Component Analysis (PCA) and Fisher Linear Discriminant (FLD). PCA seeks the most accurate lower-dimensional representation by preserving the largest variances in the data. It works by finding the directions (principal components) that maximize the variance of the projected data. FLD will be covered in the next lecture.

Uploaded by

abhas kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

CS434a/541a: Pattern Recognition

Prof. Olga Veksler

Lecture 7
Today
Problems of high dimensional data, “the
curse of dimensionality”
running time
overfitting
number of samples required
Dimensionality Reduction Methods
Principle Component Analysis (today)
Fisher Linear Discriminant (next time)
Dimensionality on the Course Road Map
1. Bayesian Decision theory (rare case) a lot is
Know probability distribution of the categories known
Do not even need training data
Can design optimal classifier
2. ML and Bayesian parameter estimation
affects all these methods

Need to estimate Parameters of probability dist.


Need training data
3. Non-Parametric Methods
No probability distribution, labeled data
4. Linear discriminant functions and Neural Nets
The shape of discriminant functions is known
Need to estimate parameters of discriminant functions
5. Unsupervised Learning and Clustering
No probability distribution and unlabeled data little is
known
Curse of Dimensionality: Complexity
Complexity (running time) increases with
dimension d
A lot of methods have at least O(nd2) complexity,
where n is the number of samples
For example if we need to estimate covariance
matrix

So as d becomes large, O(nd2) complexity may


be too costly
Curse of Dimensionality: Overfitting
If d is large, n, the number of samples, may be
too small for accurate parameter estimation
For example, covariance matrix has d2
parameters:
σ1 2
σ 1d
=
σ d1 σ d2
For accurate estimation, n should be much bigger
than d2
Otherwise model is too complicated for the data,
overfitting:
Curse of Dimensionality: Overfitting
Paradox: If n < d2 we are better off assuming that
features are uncorrelated, even if we know this
assumption is wrong
In this case, the covariance matrix has only d
parameters: σ 12 0
=
0 σ d2

We are likely to avoid overfitting because we fit a


model with less parameters: model with more
parameters

model with less


parameters
Curse of Dimensionality: Number of Samples
Suppose we want to use the nearest neighbor
approach with k = 1 (1NN)
Suppose we start with only one feature
0 1

This feature is not discriminative, i.e. it does not


separate the classes well
We decide to use 2 features. For the 1NN method
to work well, need a lot of samples, i.e. samples
have to be dense
To maintain the same density as in 1D (9 samples
per unit length), how many samples do we need?
Curse of Dimensionality: Number of Samples
We need 92 samples to maintain the same
density as in 1D
1

1
0
Curse of Dimensionality: Number of Samples
Of course, when we go from 1 feature to 2, no
one gives us more samples, we still have 9

0 1

This is way too sparse for 1NN to work well


Curse of Dimensionality: Number of Samples
Things go from bad to worse if we decide to use 3
features:
1

0 1

If 9 was dense enough in 1D, in 3D we need


93=729 samples!
Curse of Dimensionality: Number of Samples

In general, if n samples is dense enough in 1D

Then in d dimensions we need nd samples!


And nd grows really really fast as a function of d

Common pitfall:
If we can’t solve a problem with a few features, adding
more features seems like a good idea
However the number of samples usually stays the same
The method with more features is likely to perform
worse instead of expected better
Curse of Dimensionality: Number of Samples
For a fixed number of samples, as we add
features, the graph of classification error:
classification
error

1 # features
optimal # features

Thus for each fixed sample size n, there is the


optimal number of features to use
The Curse of Dimensionality
We should try to avoid creating lot of features
Often no choice, problem starts with many features
Example: Face Detection
One sample point is k by m array of pixels

Feature extraction is not trivial, usually every


pixel is taken as a feature
Typical dimension is 20 by 20 = 400
Suppose 10 samples are dense enough for 1
dimension. Need only 10400 samples
The Curse of Dimensionality
Face Detection, dimension of one sample point is km
=

The fact that we set up the problem with km


dimensions (features) does not mean it is really
a km-dimensional problem
Space of all k by m images has km dimensions
Space of all k by m faces must be much smaller,
since faces form a tiny fraction of all possible images
Most likely we are not setting the problem up with
the right features
If we used better features, we are likely need much
less than km-dimensions
Dimensionality Reduction
High dimensionality is challenging and redundant
It is natural to try to reduce dimensionality
Reduce dimensionality by feature combination:
combine old features x to create new features y
x1 x1 y1
x = x2 → f x2 = =y with k < d
yk
xd xd

For example, x1
x x + x2
x= 2 → 1 =y
x3 x3 + x4
x4
Ideally, the new vector y should retain from x all
information important for classification
Dimensionality Reduction
The best f(x) is most likely a non-linear function
Linear functions are easier to find though
For now, assume that f(x) is a linear mapping
Thus it can be represented by a matrix W:

x1 x1 w 11 w 1d x1 y1
x2 W x2 = x2 = with k < d
w k1 w kd yk
xd x d xd
Feature Combination

We will look at 2 methods for feature


combination
Principle Component Analysis (PCA)
Fischer Linear Discriminant (next lecture)
Principle Component Analysis (PCA)
Main idea: seek most accurate data representation in
a lower dimensional space
Example in 2-D
Project data to 1-D subspace (a line) which minimize the
projection error
dimension 2

dimension 2
dimension 1 dimension 1

large projection errors, small projection errors,


bad line to project to good line to project to
Notice that the the good line to use for projection lies
in the direction of largest variance
PCA

After the data is projected on the best line, need to


transform the coordinate system to get 1D
representation for vector y

Note that new data y has the same variance as old


data x in the direction of the green line
PCA preserves largest variances in the data. We will
prove this statement, for now it is just an intuition of
what PCA will do
PCA: Approximation of Elliptical Cloud in 3D

best 2D approximation best 1D approximation


PCA
What is the direction of largest variance in data?
Recall that if x has multivariate distribution N(µ,Σ),
direction of largest variance is given by eigenvector
corresponding to the largest eigenvalue of Σ

This is a hint that we should be looking at the


covariance matrix of the data (note that PCA can be
applied to distributions other than Gaussian)
PCA: Linear Algebra for Derivation
Let V be a d dimensional linear space, and W be a k
dimensional linear subspace of V
We can always find a set of d dimensional vectors
{e1,e2,…,ek} which forms an orthonormal basis for W
<ei,ej> = 0 if i is not equal to j and <ei,ei> = 1
Thus any vector in W can be written as
k
α 1e1 + α 2e2 + ... + α k ek = α i ei for scalars α 1,...,α k
i =1

Let V = R2 and W be the line


W x-2y=0. Then the orthonormal
basis for W is
1
1/ 5
2 − 2/ 5
PCA: Linear Algebra for Derivation
Recall that subspace W contains the zero vector, i.e.
it goes through the origin
this line is not a
subspace of R2

For derivation, it will be convenient to project to


subspace W: thus we need to shift everything

this line is a
subspace of R2
PCA Derivation: Shift by the Mean Vector
Before PCA, subtract sample mean from the data
n
1
x− x i = x − µ̂
µ
n i =1

The new data has zero mean: E(X-E(X)) = E(X)-E(X) = 0


All we did is change the coordinate system
x 2′′
x 2′
µ̂
µ x 1′′
x 1′ µ̂
µ

Another way to look at it:


first step of getting y is to subtract the mean of x
x → y = f ( x ) = g ( x − µ̂
µ)
PCA: Derivation
We want to find the most accurate representation of
data D={x1,x2,…,xn} in some subspace W which has
dimension k < d
Let {e1,e2,…,ek} be the orthonormal basis for W. Any
k
vector in W can be written as α i ei
i =1
Thus x1 will be represented by some vector in W
k
α 1i ei
i =1

er
Error this representation: x1

ro
r
k 2
W
error = x 1 − α 1i ei α 1i ei
i =1
PCA: Derivation
To find the total error, we need to sum over all xj’s
k
Any xj can be written as α ji ei
i =1

Thus the total error for representation of all data D is:


sum over all data points

n k 2

J (e1,..., ek ,α 11 ,...α nk ) = xj − α ji ei
j =1 i =1
unknowns error at one point
PCA: Derivation
To minimize J, need to take partial derivatives and
also enforce constraint that {e1,e2,…,ek} are
orthogonal
n k 2

J (e1,..., ek ,α 11 ,...α nk ) = xj − α ji ei
j =1 i =1

Let us simplify J first


n n k n k
J (e1,..., ek ,α 11 ,...α nk ) =
2
xj −2 x tj α ji ei + α 2ji
j =1 j =1 i =1 j = 1 i =1
n n k n k
2
= xj −2 α ji x tj ei + α 2ji
j =1 j =1 i =1 j =1 i =1
PCA: Derivation
n n k n k
J (e1 ,..., ek , α11 ,...α nk ) =
2
xj −2 α ji x tj ei + α 2ji
j =1 j =1 i =1 j =1 i =1

First take partial derivatives with respect to αml



J (e1,..., ek ,α 11,...α nk ) = −2 x mt el + 2α ml
∂α ml

Thus the optimal value for αml is

− 2 x mt el + 2α ml = 0 α ml = x mt el
PCA: Derivation
n n k n k
J (e1 ,..., ek , α11 ,...α nk ) =
2
xj −2 α ji x tj ei + α 2ji
j =1 j =1 i =1 j =1 i =1

Plug the optimal value for αml = xtmel back into J

(x e )x e (x e )
n n k n k
J (e1 ,..., ek ) =
2
xj −2 t
j i
t
j i + t
j i
2

j =1 j =1 i =1 j =1 i =1

Can simplify J
(x e )
n n k
J (e1 ,..., ek ) =
2
xj − t
j i
2

j =1 j =1 i =1
PCA: Derivation
(x e )
n n k
J (e1 ,..., ek ) =
2
xj − t
j i
2

j =1 j =1 i =1

Rewrite J using (atb)2= (atb)(atb)=(bta)(atb)=bt(aat )b


(x x ) e
n k n
J (e1,..., ek ) =
2
xj − eit j
t
j i
j =1 i =1 j =1
n k
2
= xj − eit S ei
j =1 i =1

n
Where S = x j x tj
j =1

S is called the scatter matrix, it is just n-1 times the


sample covariance matrix we have seen before
( )( )
n
ˆ = 1
x j − µˆ x j − µˆ t

n − 1 j =1
PCA: Derivation
n k
J (e1,..., ek ) =
2
xj − eit S ei
j =1 i =1
constant
k
Minimizing J is equivalent to maximizing eit S ei
i =1
We should also enforce constraints eitei = 1 for all i
Use the method of Lagrange multipliers, incorporate
the constraints with undetermined λ1 ,…, λk
Need to maximize new function u
λ j (e tj e j − 1)
k k
u (e1,..., ek ) = eit S ei −
i =1 j =1
PCA: Derivation

If x is a vector and f(x)= f(x1 ,…, xd) is a function, to


simplify notation, define
∂f
∂x 1
d f (x ) =
dx ∂f
∂x d

It can be shown that


d
dx
( x t x ) = 2x

If A is a symmetric matrix, it can be shown that


d
dx
( x t Ax ) = 2 Ax
PCA: Derivation

λ j (e tj e j − 1)
k k
u (e1,..., ek ) = eit S ei −
i =1 j =1

Compute the partial derivatives with respect to em



u (e1,..., ek ) = 2Sem − 2λmem = 0
∂e m
Note: em is a vector, what we are really doing here is
taking partial derivatives with respect to each
element of em and then arranging them up in a
linear equation

Thus λm and em are eigenvalues and eigenvectors of


scatter matrix S
Se m = λm em
PCA: Derivation
n k
J (e1,..., ek ) =
2
xj − eit S ei
j =1 i =1

Let’s plug em back into J and use Se m = λmem


n k n k
J (e1 ,..., ek ) =
2 2 2
xj − λi ei = xj − λi
j =1 i =1 j =1 i =1
constant

Thus to minimize J take for the basis of W the k


eigenvectors of S corresponding to the k largest
eigenvalues
PCA
The larger the eigenvalue of S, the larger is the
variance in the direction of corresponding eigenvector
λ1 = 30

λ 2 = 0 .8

This result is exactly what we expected: project x into


subspace of dimension k which has the largest
variance
This is very intuitive: restrict attention to directions
where the scatter is the greatest
PCA

Thus PCA can be thought of as finding new


orthogonal basis by rotating the old axis until the
directions of maximum variance are found
PCA as Data Approximation
Let {e1,e2,…,ed } be all d eigenvectors of the scatter
matrix S, sorted in order of decreasing corresponding
eigenvalue
Without any approximation, for any sample xi:
error of approximation
d
xi = α j e j = α1 e1 + + α k ek + α k +1 ek +1 ... + α d ed
j =1
approximation of xi
coefficients αm =xtiem are called principle components
The larger k, the better is the approximation
Components are arranged in order of importance, more
important components come first
Thus PCA takes the first k most important
components of xi as an approximation to xi
PCA: Last Step
Now we know how to project the data
Last step is to change the coordinates to get final
k-dimensional vector y

Let matrix E = [e1 ek ]


Then the coordinate transformation is y = E t
x
e1 0
Under Et, the eigenvectors
E t ei = ei ei = 1
become the standard basis:
ek 0
Recipe for Dimension Reduction with PCA
Data D={x1,x2,…,xn}. Each xi is a d-dimensional
vector. Wish to use PCA to reduce dimension to k
n
1
1. Find the sample mean µ
µ̂ = xi
n i =1

2. Subtract sample mean from the data z i = x i − µ̂


µ
n
3. Compute the scatter matrix S = z i z it
i =1
4. Compute eigenvectors e1,e2,…,ek corresponding to
the k largest eigenvalues of S
5. Let e1,e2,…,ek be the columns of matrix E = [e1 ek ]

6. The desired y which is the closest approximation


to x is y = E t z
PCA Example Using Matlab
Let D = {(1,2),(2,3),(3,2),(4,4),(5,4),(6,7),(7,6),(9,7)}
Convenient to arrange data in array
1 2 x1
X = =
9 7 x8

Mean µ = mean ( X ) = [4.6 4.4 ]


Subtract mean from data to get new data array Z
µ − 3 .6 − 4 . 4
Z=X− = X − repmat (µ ,8 ,1) =
µ 4 . 4 2 .6
Compute the scatter matrix S
− 3.6 + ... + [4.4
S = 7 ∗ cov (Z ) = [− 3.6 − 4.4 ] − 4.4 = 57 40
2 .6 ] 2
4 .4 .6 40 34
matlab uses unbiased estimate for covariance, so S=(n-1)*cov(Z)
PCA Example Using Matlab
Use [V,D] =eig(S) to get eigenvalues and
eigenvectors of S
− 0.8
λ1 = 87 and e1 = − 0 .6

λ2 = 3.8 and e2 = 0−.06.8

Projection to 1D space in the direction of e1


Y = e1t Z t = [− 0.8 − 3 .6
− 0. 6 ] − 4 .4 = [4.3 − 5 .1 ]
4 .4 2 .6
= [y 1 y8 ]
Drawbacks of PCA
PCA was designed for accurate data
representation, not for data classification
Preserves as much variance in data as possible
If directions of maximum variance is important for
classification, will work
However the directions of maximum variance may
be useless for classification

apply PCA

to each class

Next Lecture: Fisher Linear Discriminant


preserve direction useful for discrimination

You might also like