0% found this document useful (0 votes)

274 views42 pages

CS434a/541a: Pattern Recognition Prof. Olga Veksler

This lecture discusses the problem of high-dimensional data and dimensionality reduction methods. It introduces the "curse of dimensionality" which refers to increased complexity, overfitting, and large sample sizes needed as the number of dimensions increases. Two dimensionality reduction methods are introduced: Principal Component Analysis (PCA) and Fisher Linear Discriminant (FLD). PCA seeks the most accurate lower-dimensional representation by preserving the largest variances in the data. It works by finding the directions (principal components) that maximize the variance of the projected data. FLD will be covered in the next lecture.

Uploaded by

abhas kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

274 views42 pages

CS434a/541a: Pattern Recognition Prof. Olga Veksler

Uploaded by

abhas kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

CS434a/541a: Pattern Recognition

Prof. Olga Veksler

Lecture 7
Today
Problems of high dimensional data, “the
curse of dimensionality”
running time
overfitting
number of samples required
Dimensionality Reduction Methods
Principle Component Analysis (today)
Fisher Linear Discriminant (next time)
Dimensionality on the Course Road Map
1. Bayesian Decision theory (rare case) a lot is
Know probability distribution of the categories known
Do not even need training data
Can design optimal classifier
2. ML and Bayesian parameter estimation
affects all these methods

Need to estimate Parameters of probability dist.

Need training data
3. Non-Parametric Methods
No probability distribution, labeled data
4. Linear discriminant functions and Neural Nets
The shape of discriminant functions is known
Need to estimate parameters of discriminant functions
5. Unsupervised Learning and Clustering
No probability distribution and unlabeled data little is
known
Curse of Dimensionality: Complexity
Complexity (running time) increases with
dimension d
A lot of methods have at least O(nd2) complexity,
where n is the number of samples
For example if we need to estimate covariance
matrix

So as d becomes large, O(nd2) complexity may

be too costly
Curse of Dimensionality: Overfitting
If d is large, n, the number of samples, may be
too small for accurate parameter estimation
For example, covariance matrix has d2
parameters:
σ1 2
σ 1d
=
σ d1 σ d2
For accurate estimation, n should be much bigger
than d2
Otherwise model is too complicated for the data,
overfitting:
Curse of Dimensionality: Overfitting
Paradox: If n < d2 we are better off assuming that
features are uncorrelated, even if we know this
assumption is wrong
In this case, the covariance matrix has only d
parameters: σ 12 0
=
0 σ d2

We are likely to avoid overfitting because we fit a

model with less parameters: model with more
parameters

model with less

parameters
Curse of Dimensionality: Number of Samples
Suppose we want to use the nearest neighbor
approach with k = 1 (1NN)
Suppose we start with only one feature
0 1

This feature is not discriminative, i.e. it does not

separate the classes well
We decide to use 2 features. For the 1NN method
to work well, need a lot of samples, i.e. samples
have to be dense
To maintain the same density as in 1D (9 samples
per unit length), how many samples do we need?
Curse of Dimensionality: Number of Samples
We need 92 samples to maintain the same
density as in 1D
1

1
0
Curse of Dimensionality: Number of Samples
Of course, when we go from 1 feature to 2, no
one gives us more samples, we still have 9

0 1

This is way too sparse for 1NN to work well

Curse of Dimensionality: Number of Samples
Things go from bad to worse if we decide to use 3
features:
1

0 1

If 9 was dense enough in 1D, in 3D we need

93=729 samples!
Curse of Dimensionality: Number of Samples

In general, if n samples is dense enough in 1D

Then in d dimensions we need nd samples!

And nd grows really really fast as a function of d

Common pitfall:
If we can’t solve a problem with a few features, adding
more features seems like a good idea
However the number of samples usually stays the same
The method with more features is likely to perform
worse instead of expected better
Curse of Dimensionality: Number of Samples
For a fixed number of samples, as we add
features, the graph of classification error:
classification
error

1 # features
optimal # features

Thus for each fixed sample size n, there is the

optimal number of features to use
The Curse of Dimensionality
We should try to avoid creating lot of features
Often no choice, problem starts with many features
Example: Face Detection
One sample point is k by m array of pixels

Feature extraction is not trivial, usually every

pixel is taken as a feature
Typical dimension is 20 by 20 = 400
Suppose 10 samples are dense enough for 1
dimension. Need only 10400 samples
The Curse of Dimensionality
Face Detection, dimension of one sample point is km
=

The fact that we set up the problem with km

dimensions (features) does not mean it is really
a km-dimensional problem
Space of all k by m images has km dimensions
Space of all k by m faces must be much smaller,
since faces form a tiny fraction of all possible images
Most likely we are not setting the problem up with
the right features
If we used better features, we are likely need much
less than km-dimensions
Dimensionality Reduction
High dimensionality is challenging and redundant
It is natural to try to reduce dimensionality
Reduce dimensionality by feature combination:
combine old features x to create new features y
x1 x1 y1
x = x2 → f x2 = =y with k < d
yk
xd xd

For example, x1
x x + x2
x= 2 → 1 =y
x3 x3 + x4
x4
Ideally, the new vector y should retain from x all
information important for classification
Dimensionality Reduction
The best f(x) is most likely a non-linear function
Linear functions are easier to find though
For now, assume that f(x) is a linear mapping
Thus it can be represented by a matrix W:

x1 x1 w 11 w 1d x1 y1
x2 W x2 = x2 = with k < d
w k1 w kd yk
xd x d xd
Feature Combination

We will look at 2 methods for feature

combination
Principle Component Analysis (PCA)
Fischer Linear Discriminant (next lecture)
Principle Component Analysis (PCA)
Main idea: seek most accurate data representation in
a lower dimensional space
Example in 2-D
Project data to 1-D subspace (a line) which minimize the
projection error
dimension 2

dimension 2
dimension 1 dimension 1

large projection errors, small projection errors,

bad line to project to good line to project to
Notice that the the good line to use for projection lies
in the direction of largest variance
PCA

After the data is projected on the best line, need to

transform the coordinate system to get 1D
representation for vector y

Note that new data y has the same variance as old

data x in the direction of the green line
PCA preserves largest variances in the data. We will
prove this statement, for now it is just an intuition of
what PCA will do
PCA: Approximation of Elliptical Cloud in 3D

best 2D approximation best 1D approximation

PCA
What is the direction of largest variance in data?
Recall that if x has multivariate distribution N(µ,Σ),
direction of largest variance is given by eigenvector
corresponding to the largest eigenvalue of Σ

This is a hint that we should be looking at the

covariance matrix of the data (note that PCA can be
applied to distributions other than Gaussian)
PCA: Linear Algebra for Derivation
Let V be a d dimensional linear space, and W be a k
dimensional linear subspace of V
We can always find a set of d dimensional vectors
{e1,e2,…,ek} which forms an orthonormal basis for W
<ei,ej> = 0 if i is not equal to j and <ei,ei> = 1
Thus any vector in W can be written as
k
α 1e1 + α 2e2 + ... + α k ek = α i ei for scalars α 1,...,α k
i =1

Let V = R2 and W be the line

W x-2y=0. Then the orthonormal
basis for W is
1
1/ 5
2 − 2/ 5
PCA: Linear Algebra for Derivation
Recall that subspace W contains the zero vector, i.e.
it goes through the origin
this line is not a
subspace of R2

For derivation, it will be convenient to project to

subspace W: thus we need to shift everything

this line is a
subspace of R2
PCA Derivation: Shift by the Mean Vector
Before PCA, subtract sample mean from the data
n
1
x− x i = x − µ̂
µ
n i =1

The new data has zero mean: E(X-E(X)) = E(X)-E(X) = 0

All we did is change the coordinate system
x 2′′
x 2′
µ̂
µ x 1′′
x 1′ µ̂
µ

Another way to look at it:

first step of getting y is to subtract the mean of x
x → y = f ( x ) = g ( x − µ̂
µ)
PCA: Derivation
We want to find the most accurate representation of
data D={x1,x2,…,xn} in some subspace W which has
dimension k < d
Let {e1,e2,…,ek} be the orthonormal basis for W. Any
k
vector in W can be written as α i ei
i =1
Thus x1 will be represented by some vector in W
k
α 1i ei
i =1

er
Error this representation: x1

ro
r
k 2
W
error = x 1 − α 1i ei α 1i ei
i =1
PCA: Derivation
To find the total error, we need to sum over all xj’s
k
Any xj can be written as α ji ei
i =1

Thus the total error for representation of all data D is:

sum over all data points

n k 2

J (e1,..., ek ,α 11 ,...α nk ) = xj − α ji ei
j =1 i =1
unknowns error at one point
PCA: Derivation
To minimize J, need to take partial derivatives and
also enforce constraint that {e1,e2,…,ek} are
orthogonal
n k 2

J (e1,..., ek ,α 11 ,...α nk ) = xj − α ji ei
j =1 i =1

Let us simplify J first

n n k n k
J (e1,..., ek ,α 11 ,...α nk ) =
2
xj −2 x tj α ji ei + α 2ji
j =1 j =1 i =1 j = 1 i =1
n n k n k
2
= xj −2 α ji x tj ei + α 2ji
j =1 j =1 i =1 j =1 i =1
PCA: Derivation
n n k n k
J (e1 ,..., ek , α11 ,...α nk ) =
2
xj −2 α ji x tj ei + α 2ji
j =1 j =1 i =1 j =1 i =1

First take partial derivatives with respect to αml

∂
J (e1,..., ek ,α 11,...α nk ) = −2 x mt el + 2α ml
∂α ml

Thus the optimal value for αml is

− 2 x mt el + 2α ml = 0 α ml = x mt el
PCA: Derivation
n n k n k
J (e1 ,..., ek , α11 ,...α nk ) =
2
xj −2 α ji x tj ei + α 2ji
j =1 j =1 i =1 j =1 i =1

Plug the optimal value for αml = xtmel back into J

(x e )x e (x e )
n n k n k
J (e1 ,..., ek ) =
2
xj −2 t
j i
t
j i + t
j i
2

j =1 j =1 i =1 j =1 i =1

Can simplify J
(x e )
n n k
J (e1 ,..., ek ) =
2
xj − t
j i
2

j =1 j =1 i =1
PCA: Derivation
(x e )
n n k
J (e1 ,..., ek ) =
2
xj − t
j i
2

j =1 j =1 i =1

Rewrite J using (atb)2= (atb)(atb)=(bta)(atb)=bt(aat )b

(x x ) e
n k n
J (e1,..., ek ) =
2
xj − eit j
t
j i
j =1 i =1 j =1
n k
2
= xj − eit S ei
j =1 i =1

n
Where S = x j x tj
j =1

S is called the scatter matrix, it is just n-1 times the

sample covariance matrix we have seen before
( )( )
n
ˆ = 1
x j − µˆ x j − µˆ t

n − 1 j =1
PCA: Derivation
n k
J (e1,..., ek ) =
2
xj − eit S ei
j =1 i =1
constant
k
Minimizing J is equivalent to maximizing eit S ei
i =1
We should also enforce constraints eitei = 1 for all i
Use the method of Lagrange multipliers, incorporate
the constraints with undetermined λ1 ,…, λk
Need to maximize new function u
λ j (e tj e j − 1)
k k
u (e1,..., ek ) = eit S ei −
i =1 j =1
PCA: Derivation

If x is a vector and f(x)= f(x1 ,…, xd) is a function, to

simplify notation, define
∂f
∂x 1
d f (x ) =
dx ∂f
∂x d

It can be shown that

d
dx
( x t x ) = 2x

If A is a symmetric matrix, it can be shown that

d
dx
( x t Ax ) = 2 Ax
PCA: Derivation

λ j (e tj e j − 1)
k k
u (e1,..., ek ) = eit S ei −
i =1 j =1

Compute the partial derivatives with respect to em

∂
u (e1,..., ek ) = 2Sem − 2λmem = 0
∂e m
Note: em is a vector, what we are really doing here is
taking partial derivatives with respect to each
element of em and then arranging them up in a
linear equation

Thus λm and em are eigenvalues and eigenvectors of

scatter matrix S
Se m = λm em
PCA: Derivation
n k
J (e1,..., ek ) =
2
xj − eit S ei
j =1 i =1

Let’s plug em back into J and use Se m = λmem

n k n k
J (e1 ,..., ek ) =
2 2 2
xj − λi ei = xj − λi
j =1 i =1 j =1 i =1
constant

Thus to minimize J take for the basis of W the k

eigenvectors of S corresponding to the k largest
eigenvalues
PCA
The larger the eigenvalue of S, the larger is the
variance in the direction of corresponding eigenvector
λ1 = 30

λ 2 = 0 .8

This result is exactly what we expected: project x into

subspace of dimension k which has the largest
variance
This is very intuitive: restrict attention to directions
where the scatter is the greatest
PCA

Thus PCA can be thought of as finding new

orthogonal basis by rotating the old axis until the
directions of maximum variance are found
PCA as Data Approximation
Let {e1,e2,…,ed } be all d eigenvectors of the scatter
matrix S, sorted in order of decreasing corresponding
eigenvalue
Without any approximation, for any sample xi:
error of approximation
d
xi = α j e j = α1 e1 + + α k ek + α k +1 ek +1 ... + α d ed
j =1
approximation of xi
coefficients αm =xtiem are called principle components
The larger k, the better is the approximation
Components are arranged in order of importance, more
important components come first
Thus PCA takes the first k most important
components of xi as an approximation to xi
PCA: Last Step
Now we know how to project the data
Last step is to change the coordinates to get final
k-dimensional vector y

Let matrix E = [e1 ek ]

Then the coordinate transformation is y = E t
x
e1 0
Under Et, the eigenvectors
E t ei = ei ei = 1
become the standard basis:
ek 0
Recipe for Dimension Reduction with PCA
Data D={x1,x2,…,xn}. Each xi is a d-dimensional
vector. Wish to use PCA to reduce dimension to k
n
1
1. Find the sample mean µ
µ̂ = xi
n i =1

2. Subtract sample mean from the data z i = x i − µ̂

µ
n
3. Compute the scatter matrix S = z i z it
i =1
4. Compute eigenvectors e1,e2,…,ek corresponding to
the k largest eigenvalues of S
5. Let e1,e2,…,ek be the columns of matrix E = [e1 ek ]

6. The desired y which is the closest approximation

to x is y = E t z
PCA Example Using Matlab
Let D = {(1,2),(2,3),(3,2),(4,4),(5,4),(6,7),(7,6),(9,7)}
Convenient to arrange data in array
1 2 x1
X = =
9 7 x8

Mean µ = mean ( X ) = [4.6 4.4 ]

Subtract mean from data to get new data array Z
µ − 3 .6 − 4 . 4
Z=X− = X − repmat (µ ,8 ,1) =
µ 4 . 4 2 .6
Compute the scatter matrix S
− 3.6 + ... + [4.4
S = 7 ∗ cov (Z ) = [− 3.6 − 4.4 ] − 4.4 = 57 40
2 .6 ] 2
4 .4 .6 40 34
matlab uses unbiased estimate for covariance, so S=(n-1)*cov(Z)
PCA Example Using Matlab
Use [V,D] =eig(S) to get eigenvalues and
eigenvectors of S
− 0.8
λ1 = 87 and e1 = − 0 .6

λ2 = 3.8 and e2 = 0−.06.8

Projection to 1D space in the direction of e1

Y = e1t Z t = [− 0.8 − 3 .6
− 0. 6 ] − 4 .4 = [4.3 − 5 .1 ]
4 .4 2 .6
= [y 1 y8 ]
Drawbacks of PCA
PCA was designed for accurate data
representation, not for data classification
Preserves as much variance in data as possible
If directions of maximum variance is important for
classification, will work
However the directions of maximum variance may
be useless for classification

apply PCA

to each class

Next Lecture: Fisher Linear Discriminant

preserve direction useful for discrimination

Modeling and Analysis of Principles For Chemical and Biological Engineers
100% (12)
Modeling and Analysis of Principles For Chemical and Biological Engineers
565 pages
Two Sample Hotelling's T Square
No ratings yet
Two Sample Hotelling's T Square
30 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
41 pages
Curse of Dimensionality, Dimensionality Reduction With PCA
No ratings yet
Curse of Dimensionality, Dimensionality Reduction With PCA
36 pages
Curse of Dimensionality, Dimensionality Reduction With PCA
No ratings yet
Curse of Dimensionality, Dimensionality Reduction With PCA
36 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
60 pages
Facial Recognition and Mathematics - Vectors and Geometry in Action
No ratings yet
Facial Recognition and Mathematics - Vectors and Geometry in Action
6 pages
Unit 4 Dimenstionality Reduction
No ratings yet
Unit 4 Dimenstionality Reduction
104 pages
AML Unit - 1 Material
No ratings yet
AML Unit - 1 Material
36 pages
Pca Kmeans GMM
No ratings yet
Pca Kmeans GMM
96 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
19 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
274 pages
22AIP3101A Session 7
No ratings yet
22AIP3101A Session 7
28 pages
Lecture W12ab
No ratings yet
Lecture W12ab
60 pages
Dim Reduction & Pattern Recognition
No ratings yet
Dim Reduction & Pattern Recognition
63 pages
16 dm2 Dimred 2022 23
No ratings yet
16 dm2 Dimred 2022 23
49 pages
Principal Component Analysis: Jianxin Wu
No ratings yet
Principal Component Analysis: Jianxin Wu
24 pages
08 HighDimensional PDF
No ratings yet
08 HighDimensional PDF
88 pages
08 HighDimensional PDF
No ratings yet
08 HighDimensional PDF
88 pages
Projecting Data To A Lower Dimension With PCA
No ratings yet
Projecting Data To A Lower Dimension With PCA
6 pages
5 Data Pre Processing III
No ratings yet
5 Data Pre Processing III
30 pages
10 Autoencoders
No ratings yet
10 Autoencoders
42 pages
Lecture 14: Principal Component Analysis: Computing The Principal Components
No ratings yet
Lecture 14: Principal Component Analysis: Computing The Principal Components
6 pages
PCA - Feb 8
No ratings yet
PCA - Feb 8
28 pages
Python Unit 4
No ratings yet
Python Unit 4
43 pages
Machine Learning (CSO851) - Lecture 03
No ratings yet
Machine Learning (CSO851) - Lecture 03
71 pages
Lec 3
No ratings yet
Lec 3
60 pages
Lecture 7: Unsupervised Learning: C19 Machine Learning Hilary 2013 A. Zisserman
No ratings yet
Lecture 7: Unsupervised Learning: C19 Machine Learning Hilary 2013 A. Zisserman
20 pages
Lecture 9 - Data Prep - Reduction - PCA-M
No ratings yet
Lecture 9 - Data Prep - Reduction - PCA-M
44 pages
Ruiz Modified I2ml3e Chap6
No ratings yet
Ruiz Modified I2ml3e Chap6
38 pages
20 Pca
No ratings yet
20 Pca
50 pages
03 Dimensionality Reduction
No ratings yet
03 Dimensionality Reduction
38 pages
Presentation
No ratings yet
Presentation
31 pages
Pca
No ratings yet
Pca
6 pages
Sta 5
No ratings yet
Sta 5
16 pages
AI Unit-5
No ratings yet
AI Unit-5
53 pages
L7 Ann
No ratings yet
L7 Ann
22 pages
PCA
100% (1)
PCA
33 pages
I2ml3e Chap6
No ratings yet
I2ml3e Chap6
37 pages
Lecture 6-PCA
No ratings yet
Lecture 6-PCA
11 pages
ML Unit 2 Part 2
No ratings yet
ML Unit 2 Part 2
23 pages
Classification Techniques
No ratings yet
Classification Techniques
99 pages
Week12 PCA BayesianInference Before Lecture
No ratings yet
Week12 PCA BayesianInference Before Lecture
82 pages
MLSP-6 Dimensionality Reduction
No ratings yet
MLSP-6 Dimensionality Reduction
39 pages
Visualization 9 Dim Reduction
No ratings yet
Visualization 9 Dim Reduction
73 pages
Lec05 DimensionalityReduction
No ratings yet
Lec05 DimensionalityReduction
57 pages
Pca Smote
No ratings yet
Pca Smote
15 pages
Dimensionality Reduction 22-01-22
No ratings yet
Dimensionality Reduction 22-01-22
47 pages
Module 06
No ratings yet
Module 06
15 pages
6 - Data Pre-Processing-III
No ratings yet
6 - Data Pre-Processing-III
30 pages
ML Mod 4 & 6 Pyq
No ratings yet
ML Mod 4 & 6 Pyq
11 pages
Chapter6 - Unit IV2024
No ratings yet
Chapter6 - Unit IV2024
84 pages
Dimension Reduction
No ratings yet
Dimension Reduction
23 pages
Unit 3dimentionality Reduction
No ratings yet
Unit 3dimentionality Reduction
13 pages
Lecture4-Dimensionality Reduction Methods
No ratings yet
Lecture4-Dimensionality Reduction Methods
40 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
27 pages
Dimension Reduction
No ratings yet
Dimension Reduction
38 pages
Lec 11: Linear Dimensionality Reduction: 11.33.1 Minimizing Variance
No ratings yet
Lec 11: Linear Dimensionality Reduction: 11.33.1 Minimizing Variance
3 pages
Lecture 11 Dimensionality Reduction
No ratings yet
Lecture 11 Dimensionality Reduction
32 pages
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
From Everand
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
Fouad Sabry
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Mode Shapes and Frequencies of Thin Rectangular Plates With Arbitrarily Varying Non-Homogeneity Along Two Concurrent Edges
No ratings yet
Mode Shapes and Frequencies of Thin Rectangular Plates With Arbitrarily Varying Non-Homogeneity Along Two Concurrent Edges
25 pages
Mathematical Physics I (PS417) : Jul - Dec 2019 Assignment 2
No ratings yet
Mathematical Physics I (PS417) : Jul - Dec 2019 Assignment 2
2 pages
Ece Syllabus Book 24 11 2023 3
No ratings yet
Ece Syllabus Book 24 11 2023 3
339 pages
Final Examination: SUBJECT: C/C++ Programming in UNIX
No ratings yet
Final Examination: SUBJECT: C/C++ Programming in UNIX
5 pages
TIFR Question With Solution 2019
No ratings yet
TIFR Question With Solution 2019
27 pages
Formula - Engg Maths 1 PDF
No ratings yet
Formula - Engg Maths 1 PDF
7 pages
Mathematics
No ratings yet
Mathematics
32 pages
R22 B.Tech - Mechanical Engineering Syllabus
No ratings yet
R22 B.Tech - Mechanical Engineering Syllabus
139 pages
Linear System Theory and Desing PDF
No ratings yet
Linear System Theory and Desing PDF
688 pages
PEST Tutorial
No ratings yet
PEST Tutorial
18 pages
Syllabus 2017
No ratings yet
Syllabus 2017
140 pages
Graph Theory Applications To Deregulated Power Systems Ricardo Moreno Chuquen - The Full Ebook Version Is Ready For Instant Download
100% (1)
Graph Theory Applications To Deregulated Power Systems Ricardo Moreno Chuquen - The Full Ebook Version Is Ready For Instant Download
55 pages
Section 7.1
No ratings yet
Section 7.1
10 pages
Extra
No ratings yet
Extra
5 pages
Chapter 4 Multiple Degree of Freedom Systems
No ratings yet
Chapter 4 Multiple Degree of Freedom Systems
89 pages
How To Crack Gate Exam.... Mechanical: Hello Friends
No ratings yet
How To Crack Gate Exam.... Mechanical: Hello Friends
6 pages
MS252
No ratings yet
MS252
1 page
1.ma6459 NM PDF
No ratings yet
1.ma6459 NM PDF
118 pages
Some Hypothesis Tests For The Covariance Matrix When The Dimension Is Large Compared To The Sample Size.
No ratings yet
Some Hypothesis Tests For The Covariance Matrix When The Dimension Is Large Compared To The Sample Size.
22 pages
Unit-3 Matrices 2022-2023
No ratings yet
Unit-3 Matrices 2022-2023
35 pages
IIT JAM New Syallbus
No ratings yet
IIT JAM New Syallbus
7 pages
Modified Cholesky Algorithms: A Catalog With New Approaches: Haw-Ren Fang Dianne P. O'Leary August 8, 2006
No ratings yet
Modified Cholesky Algorithms: A Catalog With New Approaches: Haw-Ren Fang Dianne P. O'Leary August 8, 2006
42 pages
JNTU - H R05 Mechanical Syllabus Book
100% (2)
JNTU - H R05 Mechanical Syllabus Book
48 pages
Inertia-Free Pose and Angular Velocity Estimation Using Monocular Vision
No ratings yet
Inertia-Free Pose and Angular Velocity Estimation Using Monocular Vision
19 pages
O 03 07
No ratings yet
O 03 07
3 pages
Textile Engineering 2008
No ratings yet
Textile Engineering 2008
90 pages
Maths
No ratings yet
Maths
91 pages

CS434a/541a: Pattern Recognition Prof. Olga Veksler

Uploaded by

CS434a/541a: Pattern Recognition Prof. Olga Veksler

Uploaded by

CS434a/541a: Pattern Recognition

Prof. Olga Veksler

Need to estimate Parameters of probability dist.

So as d becomes large, O(nd2) complexity may

We are likely to avoid overfitting because we fit a

model with less

This feature is not discriminative, i.e. it does not

This is way too sparse for 1NN to work well

If 9 was dense enough in 1D, in 3D we need

In general, if n samples is dense enough in 1D

Then in d dimensions we need nd samples!

Thus for each fixed sample size n, there is the

Feature extraction is not trivial, usually every

The fact that we set up the problem with km

We will look at 2 methods for feature

large projection errors, small projection errors,

After the data is projected on the best line, need to

Note that new data y has the same variance as old

best 2D approximation best 1D approximation

This is a hint that we should be looking at the

Let V = R2 and W be the line

For derivation, it will be convenient to project to

The new data has zero mean: E(X-E(X)) = E(X)-E(X) = 0

Another way to look at it:

Thus the total error for representation of all data D is:

Let us simplify J first

First take partial derivatives with respect to αml

Thus the optimal value for αml is

Plug the optimal value for αml = xtmel back into J

Rewrite J using (atb)2= (atb)(atb)=(bta)(atb)=bt(aat )b

S is called the scatter matrix, it is just n-1 times the

If x is a vector and f(x)= f(x1 ,…, xd) is a function, to

It can be shown that

If A is a symmetric matrix, it can be shown that

Compute the partial derivatives with respect to em

Thus λm and em are eigenvalues and eigenvectors of

Let’s plug em back into J and use Se m = λmem

Thus to minimize J take for the basis of W the k

This result is exactly what we expected: project x into

Thus PCA can be thought of as finding new

Let matrix E = [e1 ek ]

2. Subtract sample mean from the data z i = x i − µ̂

6. The desired y which is the closest approximation

Mean µ = mean ( X ) = [4.6 4.4 ]

λ2 = 3.8 and e2 = 0−.06.8

Projection to 1D space in the direction of e1

Next Lecture: Fisher Linear Discriminant

You might also like