0% found this document useful (0 votes)
11 views28 pages

Lec 15

The document discusses Principal Component Analysis (PCA) and its applications, including dimensionality reduction, noise removal, and data representation. It highlights the differences between PCA and regression, the importance of mean centering, and various use cases such as image processing and recommendation systems. Additionally, it covers probabilistic PCA and its variants, emphasizing the ability to recover latent variables and handle missing data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views28 pages

Lec 15

The document discusses Principal Component Analysis (PCA) and its applications, including dimensionality reduction, noise removal, and data representation. It highlights the differences between PCA and regression, the importance of mean centering, and various use cases such as image processing and recommendation systems. Additionally, it covers probabilistic PCA and its variants, emphasizing the ability to recover latent variables and handle missing data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Applications of PCA

PCA – the inside story


Recap: given a matrix with
Rows of not SVD , find top singular
triplets of the SVD necessarily
In other words, findorthonormal
such that have orthonormal cols and
contain the largest singular vectors and is diagonal and
contains the largest singular values
Note: this gives which has rank ( has only non-zero entries)
Turns out that this matrix has many other nice properties too
Storing requires only space ( requires space)
is the best approximation to among all rank- matrices i.e. it
is the global optimum to the following problem:
For a matrix , the Frobenius norm is obtained by either
stretching into a long vector and taking its L2 norm or else
taking the L2 norm of the vector formed out of the singular
values of i.e.
Low dimensional Modelling
PCA can help you solve this problem. Find and set
and . As noted earlier, it will give us the best
possible approximation any with only columns
We may suspect that our
could havedata
given features, although
presented as -dimensional vectors, are really lying
on/close to some -dim
Given subspace
, can we recover In
other words, given ,
recover and such that

𝐱 𝑖
𝑊 𝐳 𝑖

Dictionary/
Factor Loading
=
matrix
PCA vs Regression
The previous setup may seem like regression where
“labels” are vectors and model is a matrix instead of a
vector
Linear Regression Low-rank Modelling
•, •,

• Observed data • Observed data

The most important difference is that in linear


regression, “features” are visible, in low-rank modelling
setting, they are absent (latent)
Shortcomings of PCA What PCA will
give us

PCA will reveal hidden


structure within data if that
hidden structure is a linear
subspace
PCA fails to reveal hidden
structure in data if data is lying
“Swiss Roll”
on curved (hyper) surfaces
data – used to
PCA may also fail if data is What we really be very
lying on an affine subspace want popular in ML
However, this can be easily
overcome by mean centering
data i.e. find and do PCA with
Mean centering removes
displacement
Mean centering and PCA

First (leading) component


Second component
Applications of PCAIsfindthere
–whether
Space
are quick way to
is small or
Savings not?
Given features , takes space to store, takes time to
apply a linear model to all data points i.e. compute
Perform PCA and approximate using top singular pairs
Find using power + peeling method
Approx as a rank matrix. PCA gives the best such rank approximation
Takes space to store
Warning: the above benefit is lost if we compute and store (instead,
store )
Applying linear model takes time
Warning: the above benefit is lost if we compute then compute
How to choose ? Various considerations that pose a
tradeoff
Time/Space Budget: choose small enough s.t. is
FindingSo,value
the right value of
first find . Then, once you have a PCA for some
of , compute the squared sum of singular
For , we have values
and you have got i.e. . Use this to get
Thus, ignoring peeling errors, we get
Fact:
Often, where
if you is the
plot how theerror
tracegoes
(sum of diagonal)
down as

‖ 𝑋 − ^𝑋 ‖2𝐹
“knee
, you will find that after some golden value of , error ” point
drops much
Now more
are slowly. This and
orthonormal “knee”
so point
are is a good
i.e. and place
to stop to get good bang-for-buck in terms of the
accuracy-speed tradeoff 𝑘
Used linearity of trace and to show this
Similarly, can show that
A prominent knee point, if one exists, gives us a good idea of
the true intrinsic dimensionality of the data
Applications Treat imagesof PCA
as matrix and – Noise
“smoothen” the image by taking a
Removal low-rank approximation
If data features are -dim but data really lying
close to a dim subspace then it may be
noise that is making the data -dim
PCA can extract the important (hidden) data
features – can then learn ML models
Given , compute PCA and use as a set of new -
dimensional features for the data points
Training algos would speed up if used with -dim features instead
of -dim
Testing may not speed up since for a given test point , we will
first have to find out its -dim representation – would need to
compute
Notice that we have (since is orthonormal) so even for training
features we can compute -dim rep by just hitting with
http://personales.upv.es/jmanjon/
Foreground Background
Separation

¿
+¿
10
Note: here we are treating images as vectors and a

Denoising, Foreground Extraction


group of images (say a video) is treated as a matrix –
this is different from the noise removal example where
every image was treated as a matrix itself
Make every frame a vector where is number of pixels
Thus, we have frames represented as a matrix
Background is a constant vector . Let
Foreground treated as noise . Let
This gives us i.e. if noise is not too much then is
approximately rank 1 
We can do PCA and recover as noise

Netrapalli et al, Non-convex Robust PCA, NIPS 2014


Applications of PCA – Learning
Prototypes
Given , compute PCA
Notice that i.e. we can think of as a dataset of prototypes
All points in dataset can be approximated well as a linear
combination of these prototypes – the linear combinations
are given by
Specifically, the -th data point (the -th row of ) can be
approximated as

Thus, PCA gives us a new way to get good prototypes to


explain data
GMMs earlier had given us one way to get good
prototypes
However, GMM did not assure us that data features could be
Note: here again, we are

Eigenfaces
treating images as vectors
[Sirovich
and a group of images is and Kirby, Turk and Pentland]
treated as a matrix
An iconic application of PCA – given face images, the
prototypes given by the leading few right singular
vectors are called “eigenfaces”
Images are treated as vectors for this (and many other)
application
Each topic here is represented by a prototypical
Latent Semantic Analysis
document for that topic. All documents are linear
representations of the topics. The amount of weight a
(LSA/LSI)
document places on a certain topic in its representation
Used totells
be usa avery popular
lot about what iscourse project
that document topic
talking 
about
e.g. if
Word of then the -th
caution: topic
PCA is really
itself will core
not tell to thewhether
you -th
Given documents, each document as bag-of-words representation
blah prototype is about sports or bleh prototype is
with dictionary size as very large , discover “topics”
about politics. You have to take a look at the prototype
about which the documents are talking
vector, also look at documents that place lots of
Topics could
weightbe
onsports, education,
that prototype politics,
and make thesescience,
deductions
entertainmentwords topics
separately words
topics
≈𝑛
docs

𝑋 ^^ ^
Word of caution: A recommendation
system where users have features, is
Recommendation Systems called content-based filtering. However,
in the setting in this slide, users have no
A popular techniquefeatures.
is RecSys is Collaborative
A drawback of collaborative Filtering
Have data for usersfiltering is that
and their it gets morewith
interactions difficult to
items
For these users, predictadd newitems
other users to thethey
that system would also
like
Done by discovering “user types” and representing each
user as a combination of these user types
items types items
types
≈𝑛
users
The Many Faces of PCA
Has been “discovered” several times
[Pearson, 1901; Hotelling, 1930] This line captures most of the
Gives us best possible (in terms of information in the data Why not
Frob. norm error) low-rank approx of get rid of the other information?
our data Save space, reduce noise!
Can be thought of as giving low-dim
reps of our feature vectors so that
pairwise L2 distances among them is
preserved – see multidimensional
scaling (MDS)
Also, gives us new basis with smaller
dim ( is orthonormal) s.t. in that basis,
data can be reconstructed with little
error
Can also be thought of as giving us the
directions along which the data has
maximum variance
PCA Minimizes Reconstruction
We can show that for as well, PCA offers
the best reconstruction error. In that case,
Error instead of optimizing over a unit vector ,
we would have to optimize over an
Already seen that for any matrix , its best rank approx. is
orthonormal matrix
obtained by by taking leading singular triplets
Proof for : want best rank approx. for i.e. want to fit all
data points on a 1D subspace i.e. find a unit vector s.t.
data is best represented along subspace spanned by
than any other vector
Claim: any vector is best represented in as
Proof: we want . Apply first order optimality now
Thus, reconstruction error for entire dataset is

This matches exactly the definition of the leading eigenvector of


PCA Preserves Maximum Data
Variance Projecting onto
Data may take similar values this line throws
along one direction, varied away a lot of
values along another information
Directional variance: given a about the data 
unit vector , directional var. along
is defined as variance of the r.v.
where we choose each data point
w.p.
We Assume
can show data
Directions with is centered
more
that for i.e.
directional
as well, PCA variance preserve more info
Thus,
offers
PCA ai.e.
does set of orthonormal
give us orthonormal directions with largest directional
directions
variance such that the total
directional variance captured
along
Thus, those directions
finding is the offering the maximum directional
the direction
variancemaximum.
is the same as finding the leading right singular vector
Probabilistic PCA
The real data was actually sampled from a -dim
standard Gaussian,Given
but ,got
can we recovermapped
linearly In to a -dim
other words, given ,
space and some noise got added
recover and such that

𝐱 𝑖
𝑊 𝐳 𝑖

Dictionary/
Factor Loading
=
matrix
Probabilistic
It is veryPCA
unusual [Tipping and Bishop,
for latent variables to simply
integrate out like this leaving behind a nice
1999] Gaussian density. We got very lucky here .
Given samples , we wish
Usually to recover
latent variables mean AltOpt/EM
Note: this is a generative problem, i.e. deals with generation of
feature vectors
As discussed before, the original data are “latent” – not seen
Also clear from the noise model that
More flexible models possible e.g. Factor Analysis – will see later
Will first see how to recover and then head into recovering
Some mildly painful integrals later we can get

The above also assumes


Thus, to simplify life, we can pretend for a moment that our
samples were really generated from and there are no in the
picture.
An MLE For in PPCA
If we decide to set (and not estimate it either) then we get
which, apart from the scaling with the eigenvalues, is just in
WePCA!
getThus, PCA PPCA (apart from a scaling factor) under the
samples from . The log-likelihood is
noiseless assumption 

where and
Note: is always invertible because of (imp. since )
Can apply first order optimality to get MLE (painful
derivatives though)
Thankfully, end result is very familiar. Let be
eigendecomposition of where and with
An ED always exists for since it is square symmetric
where
PPCA Variants
We can tweak several moving parts in the PPCA generative
process
Can instead assume that and estimate
Can assume non-spherical noise and estimate
A technique called Factor Analysis actually uses non-spherical noise
model
Since PPCA is a generative model, it can model missing data too
Suppose we have already found out using clean training data
If test data has missing features , use fact that marginals of Gaussians
are Gaussian. Since we know , we can see ( has only observed rows)
Missing test data is easier to handle, missing training data more
challenging
Need to apply the above trick when defining the likelihood of training data
points
Each training data point may have different coordinates missing. If we do not
Dimensionality Reduction using
PPCA
With PCA we got low-dim feat. easily
These made sense because these are -dim features which are just a
rotation away (using ) from features which we know approximate very
well
With PPCA too we can we recover the original low-dim features by
treating them as latent variables and applying AltOpt or EM
Need to be careful since these latent variables are (continuous) vectors
now!
Earlier, we used a shortcut to get MLE for . To get hold of we
need proper AltOpt/EM

AltOpt will approximate integral using a single term (a single value for )
EM will lower bound the integral using another (easier to compute)
integral
Need to replace “sum” over possible values of with “integral” since
These derivations are routine but tedious. You

PPCA – Alternating Optimization


can check [BIS] Chapter 12 for details.
Warning: equation 12.42 in that book has an
error. The correct expression is given in this
We wish to get slide
Same old Because
peskythis the mean of (actually
sum-log-sum conditioned on . It is the
sum-log-integral) form
marginal
– difficult  (unconditional) mean of that is . Suppose
we know
Approximate that is aby
integral vector far offdominant
its most from the origin.
termThe
it is likely that as well which immediately tells that
with
Approx. integral by
Weaknow
single term
that may
. Then hownot be ?bad if the dist.
come
has small variance – advantage of being cheaper than EM in
Because
computation of some beautiful coincidences: 1) it turns
time
out that is a Gaussian, 2) for Gaussians, mean is
Thus, wemode
wishwhich
to solve
means that 3) is the MLE solution to the
where and problem which is indeed a vector least squares
regression
Since for Gaussians, mode problem
is mean, we have
ThisWhy
gives usthe
does one of the alternating
expression updates,
look like least lets derive the
squares?
other
PPCA – Alternating Optimization
Thus, if are fixed, we can obtain using first order
optimality
where and
See [BIS] Chap 12 for detailed derivations - check that
dimensionalities match
Apply first order optimality to get
Once is known, can also be found using first order
optimality
PPCA – Alternating Optimization
Total of time taken per iteration. In contrast, PCA using
power + peeling method takes only time if we do power
updates as which takes time per power step
ALTERNATING OPTIMIZATION time to calculate the
1. Initialize inverse term and time to
2. For calculate all terms
afterward
1. Update fixing In practice, PCA is usually
1. Let , for faster than AltOpt PPCA –
2. Update fixing no inverses required to
solve PCA, just simple
1. Calculate and mat-vec multiplication
2. Update steps 
3. Calculate time for , time for and
4. Update finally time for
PPCA – Expectation Maximization
We wish to get
As before, given a model , let and lower bound the integral
using Jensen’s inequality (the term does not depend on )

Some simple (but non-trivial) calculations show that the


resulting EM algorithm looks very similar to the AltOpt with a
few simple changes
Replace with
This is the same as we used in AltOpt since for Gaussian, mode is the
same as mean
Replace with

Rest of the algorithm (can be shown to) remain the same (see
[BIS] Chap 12)
Time complexity of
PPCA – Expectation MaximizationPPCA using EM
roughly same as that
EXPECTATION MAXIMIZATION
of PPCA using AltOpt.
1. Initialize
2. For
1. Update fixing
1. Let , for
2. Let
2. Update fixing
1. Update
2. Calculate
3. Update

You might also like