Outline: - Mathematical Background - PCA - SVD - Some PCA and SVD Applications - Case Study: LSI
Outline: - Mathematical Background - PCA - SVD - Some PCA and SVD Applications - Case Study: LSI
• Mathematical background
• PCA
• SVD
• Some PCA and SVD applications
• Case study: LSI
Iyad Batal
Mathematical Background
Variance
If we have one dimension:
• English: The average square of the distance from the mean of the
data set to its points
• Definition: Var(X)=
• Empirical: Var(x)=
Iyad Batal
Mathematical Background
Covariance
Always measured between two dimensions.
• English: For each data item, multiply the difference between the x
value and the mean of x, by the difference between the y value and the
mean of y.
• Empirical:
the inner product of values on the X dimension with the values on the
Y dimension (after subtracting the means):
Iyad Batal
Mathematical Background
Covariance properties
• cov(X,X)=Var(X)
• cov(X,Y)=cov(Y,X)
• If X and Y are independent (uncorrelated) cov(X,Y)=0
• If X and Y are correlated (both dimensions increase together)
cov(X,Y)>0
• If X and Y are anti-correlated (one dimension increases, the other
decreases) cov(X,Y)<0
Correlation
Is a scaled version of covariance:
Always -1<= cor(X,Y) <= 1
Iyad Batal
Mathematical Background
Covariance Matrix
Recall that covariance is a measure between two dimensions.
For example, if we have 3 dimensional data set (dimensions x, y, z),
we should calculate cov(x,y), cov(y,z), and cov(x,z).
For n dimensional data, we calculate n!/(n-2)!*2 = n(n-1)/2 different
covariance values.
The covariance matrix for a set on data with n dimensions is:
C(n x n)= (c[i,j]=c[j,i]=cov(Dim[i],Dim[j])
The covariance matrix for a 3 dimensional data is
Iyad Batal
Mathematical Background
Covariance Matrix Properties
Example:
Iyad Batal
Mathematical Background
eigenvectors
The length of a vector doesn’t affect whether it’s an eigenvector or
not, whereas the direction does. So to keep eigenvectors standard,
we scale them to have length 1.
So we scale our vector:
eigenvectors properties
Iyad Batal
Mathematical Background
eigenvectors properties (continue)
3. If we scale the eigenvector by an amount, we will still get the
same eigenvalue (as we saw).
4. Symmetric matrices S(n x n) also satisfies two properties:
I. has exactly n eigenvectors.
II. All the eigenvectors are orthogonal (perpendicular). This is
important because we can express the data in term of
eigenvectors, instead of expressing them in the original
space.
We say that eigenvectors are orthonormal, which means orthogonal
and has length 1.
What are the eigenvectors of the identity matrix?
Any vector is an eigenvector to
Iyadthe
Batal identity matrix.
Outline
• Mathematical background
• PCA
• SVD
• Some PCA and SVD applications
• Case study: LSI
Iyad Batal
PCA and SVD
PCA: Principle Components Analysis, also known as KLT
(Karhunen-Loeve Transform).
SVD: Singular Value Decomposition.
SVD and PCA are closely related.
Why we use SVD and PCA?
• A powerful tool for analyzing data and finding patterns.
• Used for compression. So you can reduce the number of
dimensions without much loss of information.
Iyad Batal
PCA
Objective: project the data onto a lower dimensional linear space
such that the variance of the projected data is maximized.
Equivalently, it is the linear projection that minimizes the average
projection cost (mean squared distance between the data points and
their projections).
Different from the feature subset selection !!!
Problem to solve: In high dimensional space, we need to learn a large
number of parameters. Thus if the dataset is small, this will result in
large variance and over-fitting.
We want to represent the vector x in a different space (p dimensional)
using a set of orthonormal vectors U (where ui is a principle
component).
Iyad Batal
PCA
Method to perform PCA on a data
Step 1: get some data
Let A (N,n) be the data matrix: N is the number of data points, n is the
number of dimensions. It could represent N patients with n numerical
symptoms each (blood pressure, cholesterol level etc) or N documents
with n terms in each document (used in IR).
Iyad Batal
PCA
Method to perform PCA on a data
Step 2: Subtract the mean
Intuitively, we translate the origin to the center of gravity. We obtain the
matrix B (N,n):
This produces a zero mean data (the column averages are zero).
Step 3: Calculate the covariance matrix C.
C(n,n)= [BT(n,N) x B(N,n) ]: T is transpose.
Since our data is 2D, the covariance matrix will be (2 x 2).
V1 . V2 = 0
|V1|=1
|V2|=1
Iyad Batal
PCA
Method to perform PCA on a data
Step 5: Choosing components and forming a feature vector
Objective: project the n dimensional data on a p dimensional sub-
space (p ≤ n), minimizing the error of the projections (sum of
squared difference).
Here is where we reduce the dimensionality of the data (for example,
to do compression).
How: Order the eigenvalues from highest to lowest to get the
components in order of significance. Project on the p eigenvectors
that corresponds to the highest p eigenvalues.
Iyad Batal
PCA
Method to perform PCA on a data
Step 5 (continue)
if we are allowed to pick only one dimension to project the data on it,
then the principle component is the best direction.
Iyad Batal
PCA
Method to perform PCA on a data
Step 6: Derive the new data set
Let’s denote Feature space matrix by U(n x p): where the columns are
the eigenvectors. Let the final data be F.
F(N,p)=B(N,n) x U(n,p)
Iyad Batal
PCA
Method to perform PCA on a data
Step 6: (continue)
For our example, if we keep both eigenvectors, we get the original data,
rotated so that The eigenvectors are the axes. (we’ve lost no information)
Iyad Batal
PCA
Method to perform PCA on a data
Step 7 (optional): getting the old data back. (if we are doing compression.)
If we keep all eigenvectors, we will get exactly the same data.
We know F=B x U
B = F x U-1
B = F x UT : (proof later)
To get A, we add the mean to
vector to B.
Notice: The variance along the
Other Component has
gone (a lossy compression)
Iyad Batal
PCA and FLD
FLD: Fisher’s Linear Discriminant.
Is a supervised learning method (utilizes the class label) that is used
to select the projection that maximizes the class separation.
Specifically: we have y=wT x, FLD tries to adjust the components of
w such that it maximizes the distance between the projected means,
while minimizing the variance within each class.
Iyad Batal
PCA and FLD
PCA is unsupervised learning technique that finds the dimensions
useful to represent the data, but maybe bad for discrimination
between different classes.
Example: Project 2D data into 1D. We have two classes: red and
green.
Iyad Batal
Outline
• Mathematical background
• PCA
• SVD
• Some PCA and SVD applications
• Case study: LSI
Iyad Batal
SVD
The eigenvalues and eigenvectors are defined for squared matrices.
For rectangular matrices, a closely related concept is Singular Value
Decomposition (SVD).
Theorem: Given an N x n real matrix A, we can express it as:
A = U x Λ x VT
A-1 = AT
UT x U = I and VT x V = I
A also can be written using spectral decomposition as:
A= λ1 U1 V1T + λ2 U2 V2T+ … + λr Ur VrT
Iyad Batal
SVD
Theorem: if S is a real and symmetric (S=ST) matrix then
S = U x Λ x UT . Where the columns of U are the eigenvectors, and Λ
is a diagonal matrix with values corresponding to eigenvalues.
Proof: let U be the matrix of eigenvectors placed in the columns:
U=[u1 u2 … un]
We can write: S x U=U x Λ
[S x u1 S x u2 … S x un]=[λ1.u1 λ2.u2 … λn.un] which is the definition
of the eigenvectors.
Therefore: S= U x Λ x U-1
Because U is orthonormal U-1 = UT
S = U x Λ x UT
Iyad Batal
SVD
doc-to-concept
similarity matrix
concepts strengths
term-to-concept
similarity matrix
term-to-concept
similarity matrix
The eigenvectors of the D matrix will be the columns of the U matrix of the
SVD of A. Proof?
A x AT = U x Λ x VT x V x Λ x UT (remember that (A x B)T= BT x AT)
= U x Λ x Λ x UT : VT x V=I because V is orthonormal
= U x Λ2 x UT : because Λ is a diagonal matrix.
Notice that D is symmetric because (U x Λ2 x UT )T = U x Λ2 x UT
or simply because (A x AT)T=ATTx AT=A x AT
Because D is symmetric, it can be written as D = S = U x Λ`x UT where U
are the D’s eigenvectors and Λ`are D’s eigenvalues.
Iyad Batal
SVD
Symmetrically, the n x n matrix T = AT x A will give the term-to-term
similarities (the covariance matrix).
The eigenvectors of the T matrix are the columns of the V matrix of the
SVD of A.
T is a symmetric matrix because (AT x A)T=AT x ATT=AT x A
AT x A = V x Λ x UT x U x Λ x VT= V x Λ x Λ x VT = V x Λ2 x VT
Notice that both D and T have the same eigenvalues, which are the squares
of the λi elements (The singular values of A).
These observations shows the close relation between SVD and PCA ,
which uses the eigenvectors of the covariance matrix.
Iyad Batal
SVD
Very important property: (we will see it again in Kleinberg algorithm)
(ATx A)k x v` ≈ (constant) v1 where k>>1, v` is a random vector, v1 is the
first right singular vector of A, or equivalently is the eigenvector of AT x A
(as we already proved). Proof?
(AT x A)k = (AT x A) x (AT x A) x … = (V x Λ2 x VT ) x (V x Λ2 x Vt ) x …
= (V x Λ2 x VT ) x … = (V x Λ4 x VT ) x …. = (V x Λ2k x VT)
Using spectral decomposition:
(ATx A)k = (V x Λ2k x VT) = λ12k v1 v1T + λ22k v2 v2T+ … + λn2k vn vnT
Because λ1 > λi≠1 λ12k >> λi≠12k
Thus (AT x A)k ≈ λ12k v1 v1T
Now (ATx A)k x v` = λ12k v1 v1T x v` = (const) v1
because v1Tx v` is a scalar.
Iyad Batal
SVD
Geometrically, this means that if we multiple any vector with matrix
(AT x A)k, then result is a vector that is parallel to the first
eigenvector.
Iyad Batal
PCA and SVD
Summary for PCA and SVD
Objective: find the principal components P of a data matrix A(n,m).
1. First zero mean the columns of A (translate the origin to the center of gravity).
2. Apply PCA or SVD to find the principle components (P) of A.
PCA:
I. Calculate the covariance matrix C=
II. p = the eigenvectors of C.
III. The variances in each new dimension is given by the eigenvalues.
SVD:
I. Calculate the SVD of A.
II. P = V: the right singular vectors.
III. The variances are given by the squaring the singular values.
3. Project the data onto the feature space. F = P x A
4. Optional: Reconstruct A’ from Y where A’ is the compressed version of A.
Iyad Batal
Outline
• Mathematical background
• PCA
• SVD
• Some PCA and SVD applications
• Case study: LSI
Iyad Batal
SVD and PCA applications
LSI: Latent Semantic Indexing.
Solve over-specified (no solution: least squares error solution) and
under-specified (infinite number of solutions: shortest length
solution) linear equations.
Ratio rules (computer quantifiable association rules like
bread:milk:buffer=2:4:3).
Google/PageRank algorithm (random walk with restart).
Kleinberg/Hits algorithm (compute hubs and authority scores for
nodes).
Query feedbacks (learn to estimate the selectivity of the queries: a
regression problem).
Image compression (other methods: DCT used in JPEG, and
wavelet compression)
Data visualization (by projecting the data on 2D).
Iyad Batal
Variations: ICA
ICA (Independent Components Analysis)
Relaxes the constraint of orthogonality but keeps the linearity. Thus,
could be more flexible than PCA in finding patterns.
Iyad Batal
Outline
• Mathematical background
• PCA
• SVD
• Some PCA and SVD applications
• Case study: LSI
Iyad Batal
SVD and LSI
LSI: Latent Semantic Indexing.
Idea: try to group similar terms together, to form a few concepts,
then map the documents into vectors in the concept-space, as
opposed to vectors in the n-dimensional space, where n is the
vocabulary size of the document collection.
This approach automatically groups terms that occur together into
concepts. Then every time the user asks for a term, the system
determines the relevant concepts and search for them.
In order to map documents or queries into the concept space, we
need the term-to-concept similarity matrix V.
Iyad Batal
SVD and LSI
Example: find the documents containing the term ‘data’.
thus LSI system may return documents that do not necessarily contain
the term ‘data’
For example, a document d with a single word ‘retrieval’
Questions or Thoughts??