CS434a/541a: Pattern Recognition Prof. Olga Veksler
CS434a/541a: Pattern Recognition Prof. Olga Veksler
Lecture 7
Today
Problems of high dimensional data, “the
curse of dimensionality”
running time
overfitting
number of samples required
Dimensionality Reduction Methods
Principle Component Analysis (today)
Fisher Linear Discriminant (next time)
Dimensionality on the Course Road Map
1. Bayesian Decision theory (rare case) a lot is
Know probability distribution of the categories known
Do not even need training data
Can design optimal classifier
2. ML and Bayesian parameter estimation
affects all these methods
1
0
Curse of Dimensionality: Number of Samples
Of course, when we go from 1 feature to 2, no
one gives us more samples, we still have 9
0 1
0 1
Common pitfall:
If we can’t solve a problem with a few features, adding
more features seems like a good idea
However the number of samples usually stays the same
The method with more features is likely to perform
worse instead of expected better
Curse of Dimensionality: Number of Samples
For a fixed number of samples, as we add
features, the graph of classification error:
classification
error
1 # features
optimal # features
For example, x1
x x + x2
x= 2 → 1 =y
x3 x3 + x4
x4
Ideally, the new vector y should retain from x all
information important for classification
Dimensionality Reduction
The best f(x) is most likely a non-linear function
Linear functions are easier to find though
For now, assume that f(x) is a linear mapping
Thus it can be represented by a matrix W:
x1 x1 w 11 w 1d x1 y1
x2 W x2 = x2 = with k < d
w k1 w kd yk
xd x d xd
Feature Combination
dimension 2
dimension 1 dimension 1
this line is a
subspace of R2
PCA Derivation: Shift by the Mean Vector
Before PCA, subtract sample mean from the data
n
1
x− x i = x − µ̂
µ
n i =1
er
Error this representation: x1
ro
r
k 2
W
error = x 1 − α 1i ei α 1i ei
i =1
PCA: Derivation
To find the total error, we need to sum over all xj’s
k
Any xj can be written as α ji ei
i =1
n k 2
J (e1,..., ek ,α 11 ,...α nk ) = xj − α ji ei
j =1 i =1
unknowns error at one point
PCA: Derivation
To minimize J, need to take partial derivatives and
also enforce constraint that {e1,e2,…,ek} are
orthogonal
n k 2
J (e1,..., ek ,α 11 ,...α nk ) = xj − α ji ei
j =1 i =1
− 2 x mt el + 2α ml = 0 α ml = x mt el
PCA: Derivation
n n k n k
J (e1 ,..., ek , α11 ,...α nk ) =
2
xj −2 α ji x tj ei + α 2ji
j =1 j =1 i =1 j =1 i =1
(x e )x e (x e )
n n k n k
J (e1 ,..., ek ) =
2
xj −2 t
j i
t
j i + t
j i
2
j =1 j =1 i =1 j =1 i =1
Can simplify J
(x e )
n n k
J (e1 ,..., ek ) =
2
xj − t
j i
2
j =1 j =1 i =1
PCA: Derivation
(x e )
n n k
J (e1 ,..., ek ) =
2
xj − t
j i
2
j =1 j =1 i =1
n
Where S = x j x tj
j =1
n − 1 j =1
PCA: Derivation
n k
J (e1,..., ek ) =
2
xj − eit S ei
j =1 i =1
constant
k
Minimizing J is equivalent to maximizing eit S ei
i =1
We should also enforce constraints eitei = 1 for all i
Use the method of Lagrange multipliers, incorporate
the constraints with undetermined λ1 ,…, λk
Need to maximize new function u
λ j (e tj e j − 1)
k k
u (e1,..., ek ) = eit S ei −
i =1 j =1
PCA: Derivation
λ j (e tj e j − 1)
k k
u (e1,..., ek ) = eit S ei −
i =1 j =1
λ 2 = 0 .8
apply PCA
to each class