Lec 15
Lec 15
𝐱 𝑖
𝑊 𝐳 𝑖
Dictionary/
Factor Loading
=
matrix
PCA vs Regression
The previous setup may seem like regression where
“labels” are vectors and model is a matrix instead of a
vector
Linear Regression Low-rank Modelling
•, •,
‖ 𝑋 − ^𝑋 ‖2𝐹
“knee
, you will find that after some golden value of , error ” point
drops much
Now more
are slowly. This and
orthonormal “knee”
so point
are is a good
i.e. and place
to stop to get good bang-for-buck in terms of the
accuracy-speed tradeoff 𝑘
Used linearity of trace and to show this
Similarly, can show that
A prominent knee point, if one exists, gives us a good idea of
the true intrinsic dimensionality of the data
Applications Treat imagesof PCA
as matrix and – Noise
“smoothen” the image by taking a
Removal low-rank approximation
If data features are -dim but data really lying
close to a dim subspace then it may be
noise that is making the data -dim
PCA can extract the important (hidden) data
features – can then learn ML models
Given , compute PCA and use as a set of new -
dimensional features for the data points
Training algos would speed up if used with -dim features instead
of -dim
Testing may not speed up since for a given test point , we will
first have to find out its -dim representation – would need to
compute
Notice that we have (since is orthonormal) so even for training
features we can compute -dim rep by just hitting with
http://personales.upv.es/jmanjon/
Foreground Background
Separation
¿
+¿
10
Note: here we are treating images as vectors and a
Eigenfaces
treating images as vectors
[Sirovich
and a group of images is and Kirby, Turk and Pentland]
treated as a matrix
An iconic application of PCA – given face images, the
prototypes given by the leading few right singular
vectors are called “eigenfaces”
Images are treated as vectors for this (and many other)
application
Each topic here is represented by a prototypical
Latent Semantic Analysis
document for that topic. All documents are linear
representations of the topics. The amount of weight a
(LSA/LSI)
document places on a certain topic in its representation
Used totells
be usa avery popular
lot about what iscourse project
that document topic
talking
about
e.g. if
Word of then the -th
caution: topic
PCA is really
itself will core
not tell to thewhether
you -th
Given documents, each document as bag-of-words representation
blah prototype is about sports or bleh prototype is
with dictionary size as very large , discover “topics”
about politics. You have to take a look at the prototype
about which the documents are talking
vector, also look at documents that place lots of
Topics could
weightbe
onsports, education,
that prototype politics,
and make thesescience,
deductions
entertainmentwords topics
separately words
topics
≈𝑛
docs
𝑋 ^^ ^
Word of caution: A recommendation
system where users have features, is
Recommendation Systems called content-based filtering. However,
in the setting in this slide, users have no
A popular techniquefeatures.
is RecSys is Collaborative
A drawback of collaborative Filtering
Have data for usersfiltering is that
and their it gets morewith
interactions difficult to
items
For these users, predictadd newitems
other users to thethey
that system would also
like
Done by discovering “user types” and representing each
user as a combination of these user types
items types items
types
≈𝑛
users
The Many Faces of PCA
Has been “discovered” several times
[Pearson, 1901; Hotelling, 1930] This line captures most of the
Gives us best possible (in terms of information in the data Why not
Frob. norm error) low-rank approx of get rid of the other information?
our data Save space, reduce noise!
Can be thought of as giving low-dim
reps of our feature vectors so that
pairwise L2 distances among them is
preserved – see multidimensional
scaling (MDS)
Also, gives us new basis with smaller
dim ( is orthonormal) s.t. in that basis,
data can be reconstructed with little
error
Can also be thought of as giving us the
directions along which the data has
maximum variance
PCA Minimizes Reconstruction
We can show that for as well, PCA offers
the best reconstruction error. In that case,
Error instead of optimizing over a unit vector ,
we would have to optimize over an
Already seen that for any matrix , its best rank approx. is
orthonormal matrix
obtained by by taking leading singular triplets
Proof for : want best rank approx. for i.e. want to fit all
data points on a 1D subspace i.e. find a unit vector s.t.
data is best represented along subspace spanned by
than any other vector
Claim: any vector is best represented in as
Proof: we want . Apply first order optimality now
Thus, reconstruction error for entire dataset is
𝐱 𝑖
𝑊 𝐳 𝑖
Dictionary/
Factor Loading
=
matrix
Probabilistic
It is veryPCA
unusual [Tipping and Bishop,
for latent variables to simply
integrate out like this leaving behind a nice
1999] Gaussian density. We got very lucky here .
Given samples , we wish
Usually to recover
latent variables mean AltOpt/EM
Note: this is a generative problem, i.e. deals with generation of
feature vectors
As discussed before, the original data are “latent” – not seen
Also clear from the noise model that
More flexible models possible e.g. Factor Analysis – will see later
Will first see how to recover and then head into recovering
Some mildly painful integrals later we can get
where and
Note: is always invertible because of (imp. since )
Can apply first order optimality to get MLE (painful
derivatives though)
Thankfully, end result is very familiar. Let be
eigendecomposition of where and with
An ED always exists for since it is square symmetric
where
PPCA Variants
We can tweak several moving parts in the PPCA generative
process
Can instead assume that and estimate
Can assume non-spherical noise and estimate
A technique called Factor Analysis actually uses non-spherical noise
model
Since PPCA is a generative model, it can model missing data too
Suppose we have already found out using clean training data
If test data has missing features , use fact that marginals of Gaussians
are Gaussian. Since we know , we can see ( has only observed rows)
Missing test data is easier to handle, missing training data more
challenging
Need to apply the above trick when defining the likelihood of training data
points
Each training data point may have different coordinates missing. If we do not
Dimensionality Reduction using
PPCA
With PCA we got low-dim feat. easily
These made sense because these are -dim features which are just a
rotation away (using ) from features which we know approximate very
well
With PPCA too we can we recover the original low-dim features by
treating them as latent variables and applying AltOpt or EM
Need to be careful since these latent variables are (continuous) vectors
now!
Earlier, we used a shortcut to get MLE for . To get hold of we
need proper AltOpt/EM
AltOpt will approximate integral using a single term (a single value for )
EM will lower bound the integral using another (easier to compute)
integral
Need to replace “sum” over possible values of with “integral” since
These derivations are routine but tedious. You
Rest of the algorithm (can be shown to) remain the same (see
[BIS] Chap 12)
Time complexity of
PPCA – Expectation MaximizationPPCA using EM
roughly same as that
EXPECTATION MAXIMIZATION
of PPCA using AltOpt.
1. Initialize
2. For
1. Update fixing
1. Let , for
2. Let
2. Update fixing
1. Update
2. Calculate
3. Update