0% found this document useful (0 votes)
91 views42 pages

Outline: - Mathematical Background - PCA - SVD - Some PCA and SVD Applications - Case Study: LSI

This document provides an outline and overview of principal component analysis (PCA) and singular value decomposition (SVD). It begins with mathematical background on concepts like variance, covariance, and eigenvectors. It then discusses PCA and SVD, explaining that they are closely related techniques for analyzing data and finding patterns. The document outlines the main steps to perform PCA on a dataset, including subtracting the mean, calculating the covariance matrix, finding eigenvectors and eigenvalues, choosing principal components, projecting the data onto the new feature space, and optionally recovering the original data. It also compares PCA to Fisher's linear discriminant analysis and provides an example of how PCA may not be optimal for discrimination between classes.

Uploaded by

karan182
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views42 pages

Outline: - Mathematical Background - PCA - SVD - Some PCA and SVD Applications - Case Study: LSI

This document provides an outline and overview of principal component analysis (PCA) and singular value decomposition (SVD). It begins with mathematical background on concepts like variance, covariance, and eigenvectors. It then discusses PCA and SVD, explaining that they are closely related techniques for analyzing data and finding patterns. The document outlines the main steps to perform PCA on a dataset, including subtracting the mean, calculating the covariance matrix, finding eigenvectors and eigenvalues, choosing principal components, projecting the data onto the new feature space, and optionally recovering the original data. It also compares PCA to Fisher's linear discriminant analysis and provides an example of how PCA may not be optimal for discrimination between classes.

Uploaded by

karan182
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Outline

• Mathematical background
• PCA
• SVD
• Some PCA and SVD applications
• Case study: LSI

Iyad Batal
Mathematical Background
Variance
If we have one dimension:

• English: The average square of the distance from the mean of the
data set to its points

• Definition: Var(X)=

• Empirical: Var(x)=

Many datasets have more than one dimension. Example: we might


have our data set both the height of all students and the mark they
received, and we want to see if the height has an effect on the mark.

Iyad Batal
Mathematical Background
Covariance
Always measured between two dimensions.

• English: For each data item, multiply the difference between the x
value and the mean of x, by the difference between the y value and the
mean of y.

• Definition: cov(X,Y)= E[(X-E(X))(Y-E(Y))] = E(X.Y)-E(X).E(Y)

• Empirical:

the inner product of values on the X dimension with the values on the
Y dimension (after subtracting the means):
Iyad Batal
Mathematical Background
Covariance properties
• cov(X,X)=Var(X)
• cov(X,Y)=cov(Y,X)
• If X and Y are independent (uncorrelated)  cov(X,Y)=0
• If X and Y are correlated (both dimensions increase together) 
cov(X,Y)>0
• If X and Y are anti-correlated (one dimension increases, the other
decreases)  cov(X,Y)<0
Correlation
Is a scaled version of covariance:
Always -1<= cor(X,Y) <= 1
Iyad Batal
Mathematical Background
Covariance Matrix
Recall that covariance is a measure between two dimensions.
For example, if we have 3 dimensional data set (dimensions x, y, z),
we should calculate cov(x,y), cov(y,z), and cov(x,z).
For n dimensional data, we calculate n!/(n-2)!*2 = n(n-1)/2 different
covariance values.
The covariance matrix for a set on data with n dimensions is:
C(n x n)= (c[i,j]=c[j,i]=cov(Dim[i],Dim[j])
The covariance matrix for a 3 dimensional data is

Iyad Batal
Mathematical Background
Covariance Matrix Properties

• C is square and symmetric matrix.


• The diagonal values are the variance for each dimension and the
off-diagonal are the covariance between measurement types.
• Large term in the diagonal correspond to interesting dimensions,
whereas large values in the off-diagonal correspond to high
correlations (redundancy).
Because we want to minimize the correlation (redundancy) and
maximize the variance, we would like to have a diagonal covariance
matrix.
Iyad Batal
Mathematical Background
eigenvectors

Example:

if we thought of the squared matrix as a transformation matrix, then


multiply it with the Eigenvector don’t change its direction.
eigenvalues and eigenvectors always come in pairs. In the example:
4 is the eigenvalue of our eigenvector.
No matter what multiple of the eigenvector we took, we get the
same eigenvalue.
Example:

Iyad Batal
Mathematical Background
eigenvectors
The length of a vector doesn’t affect whether it’s an eigenvector or
not, whereas the direction does. So to keep eigenvectors standard,
we scale them to have length 1.
So we scale our vector:

eigenvectors properties

1. eigenvectors can only be found for square matrices.

2. Not every square matrix has eigenvectors.

Iyad Batal
Mathematical Background
eigenvectors properties (continue)
3. If we scale the eigenvector by an amount, we will still get the
same eigenvalue (as we saw).
4. Symmetric matrices S(n x n) also satisfies two properties:
I. has exactly n eigenvectors.
II. All the eigenvectors are orthogonal (perpendicular). This is
important because we can express the data in term of
eigenvectors, instead of expressing them in the original
space.
We say that eigenvectors are orthonormal, which means orthogonal
and has length 1.
What are the eigenvectors of the identity matrix?
Any vector is an eigenvector to
Iyadthe
Batal identity matrix.
Outline

• Mathematical background
• PCA
• SVD
• Some PCA and SVD applications
• Case study: LSI

Iyad Batal
PCA and SVD
PCA: Principle Components Analysis, also known as KLT
(Karhunen-Loeve Transform).
SVD: Singular Value Decomposition.
SVD and PCA are closely related.
Why we use SVD and PCA?
• A powerful tool for analyzing data and finding patterns.
• Used for compression. So you can reduce the number of
dimensions without much loss of information.

Iyad Batal
PCA
Objective: project the data onto a lower dimensional linear space
such that the variance of the projected data is maximized.
Equivalently, it is the linear projection that minimizes the average
projection cost (mean squared distance between the data points and
their projections).
Different from the feature subset selection !!!
Problem to solve: In high dimensional space, we need to learn a large
number of parameters. Thus if the dataset is small, this will result in
large variance and over-fitting.
We want to represent the vector x in a different space (p dimensional)
using a set of orthonormal vectors U (where ui is a principle
component).
Iyad Batal
PCA
Method to perform PCA on a data
 Step 1: get some data
Let A (N,n) be the data matrix: N is the number of data points, n is the
number of dimensions. It could represent N patients with n numerical
symptoms each (blood pressure, cholesterol level etc) or N documents
with n terms in each document (used in IR).

Iyad Batal
PCA
Method to perform PCA on a data
 Step 2: Subtract the mean
Intuitively, we translate the origin to the center of gravity. We obtain the
matrix B (N,n):

This produces a zero mean data (the column averages are zero).
 Step 3: Calculate the covariance matrix C.
C(n,n)= [BT(n,N) x B(N,n) ]: T is transpose.
Since our data is 2D, the covariance matrix will be (2 x 2).

Notice that the non-diagonal elements are positive, why?


Iyad Batal
PCA
Method to perform PCA on a data
 Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix
Since the covariance matrix is square, we can calculate the eigenvectors
λ1≈1.28, V1 ≈ [-0.677 -0.735]T , λ2 ≈0.49, V2 ≈ [-0.735 0.677]T

Notice that V1 and V2


Are orthonormal, why?

V1 . V2 = 0
|V1|=1
|V2|=1

Iyad Batal
PCA
Method to perform PCA on a data
 Step 5: Choosing components and forming a feature vector
Objective: project the n dimensional data on a p dimensional sub-
space (p ≤ n), minimizing the error of the projections (sum of
squared difference).
Here is where we reduce the dimensionality of the data (for example,
to do compression).
How: Order the eigenvalues from highest to lowest to get the
components in order of significance. Project on the p eigenvectors
that corresponds to the highest p eigenvalues.

Iyad Batal
PCA
Method to perform PCA on a data
 Step 5 (continue)

The eigenvector with the highest eigenvalue is the principle


component of the data.

if we are allowed to pick only one dimension to project the data on it,
then the principle component is the best direction.

the PC of our example is :

Iyad Batal
PCA
Method to perform PCA on a data
 Step 6: Derive the new data set
Let’s denote Feature space matrix by U(n x p): where the columns are
the eigenvectors. Let the final data be F.

F(N,p)=B(N,n) x U(n,p)

The final data F resides


in a p-dimensional
feature space.

Iyad Batal
PCA
Method to perform PCA on a data
 Step 6: (continue)
For our example, if we keep both eigenvectors, we get the original data,
rotated so that The eigenvectors are the axes. (we’ve lost no information)

Iyad Batal
PCA
Method to perform PCA on a data
 Step 7 (optional): getting the old data back. (if we are doing compression.)
If we keep all eigenvectors, we will get exactly the same data.
We know F=B x U
B = F x U-1

B = F x UT : (proof later)
To get A, we add the mean to
vector to B.
Notice: The variance along the
Other Component has
gone (a lossy compression)

Iyad Batal
PCA and FLD
FLD: Fisher’s Linear Discriminant.
Is a supervised learning method (utilizes the class label) that is used
to select the projection that maximizes the class separation.
Specifically: we have y=wT x, FLD tries to adjust the components of
w such that it maximizes the distance between the projected means,
while minimizing the variance within each class.

where m1 and m2 are the projected means and

Iyad Batal
PCA and FLD
PCA is unsupervised learning technique that finds the dimensions
useful to represent the data, but maybe bad for discrimination
between different classes.
Example: Project 2D data into 1D. We have two classes: red and
green.

What direction will PCA choose?

Iyad Batal
Outline

• Mathematical background
• PCA
• SVD
• Some PCA and SVD applications
• Case study: LSI

Iyad Batal
SVD
The eigenvalues and eigenvectors are defined for squared matrices.
For rectangular matrices, a closely related concept is Singular Value
Decomposition (SVD).
Theorem: Given an N x n real matrix A, we can express it as:

A = U x Λ x VT

where U is a column-orthonormal N x r matrix, r is the rank of the


matrix A (number of linearly independent rows or columns), Λ is a
diagonal r x r matrix where the elements are sorted in descending
order, and V is a column-orthonormal n x r matrix.
SVD decomposition for a matrix is unique.
Iyad Batal
SVD
The values of the diagonal Λ are called singular values. (we will see
later that they correspond to the square root of the eigenvalues of the
covariance matrix).
Theorem: the inverse of an orthonormal matrix is its transpose.
Proof: we know that

Therefore AT x A = I where I is the identity matrix.

From the definition of A-1 : A-1 x A = I

A-1 = AT

UT x U = I and VT x V = I
A also can be written using spectral decomposition as:
A= λ1 U1 V1T + λ2 U2 V2T+ … + λr Ur VrT
Iyad Batal
SVD
Theorem: if S is a real and symmetric (S=ST) matrix then
S = U x Λ x UT . Where the columns of U are the eigenvectors, and Λ
is a diagonal matrix with values corresponding to eigenvalues.
Proof: let U be the matrix of eigenvectors placed in the columns:
U=[u1 u2 … un]
We can write: S x U=U x Λ
[S x u1 S x u2 … S x un]=[λ1.u1 λ2.u2 … λn.un] which is the definition
of the eigenvectors.
Therefore: S= U x Λ x U-1
Because U is orthonormal  U-1 = UT
S = U x Λ x UT

Iyad Batal
SVD
doc-to-concept
similarity matrix

concepts strengths

term-to-concept
similarity matrix

Example: the matrix contains 7 documents with the corresponding


frequencies of each term.
In real IR applications, we take into considerations the normalized TF
and IDF when calculating term weights.
The rank of this matrix r=2 because we have 2 types of documents (CS
and Medical documents), i.e. 2 concepts.
Iyad Batal
SVD
doc-to-concept
similarity matrix
concepts strengths

term-to-concept
similarity matrix

U can be thought as the document-to-concept similarity matrix, while


V is the term-to-concept similarity matrix.
For example, U1,1 is the weight of CS concept in document d1, λ1 is
the strength of the CS concept, V1,1 is the weight of the first term
‘data’ in the CS concept, V2,1=0 means ‘data’ has zero similarity
with the 2nd concept (Medical).
What does U4,1 means?
Iyad Batal
SVD
The N x N matrix D = A x AT will intuitively give the document-to-document
similarities.

The eigenvectors of the D matrix will be the columns of the U matrix of the
SVD of A. Proof?
A x AT = U x Λ x VT x V x Λ x UT (remember that (A x B)T= BT x AT)
= U x Λ x Λ x UT : VT x V=I because V is orthonormal
= U x Λ2 x UT : because Λ is a diagonal matrix.
Notice that D is symmetric because (U x Λ2 x UT )T = U x Λ2 x UT
or simply because (A x AT)T=ATTx AT=A x AT
Because D is symmetric, it can be written as D = S = U x Λ`x UT where U
are the D’s eigenvectors and Λ`are D’s eigenvalues.
Iyad Batal
SVD
Symmetrically, the n x n matrix T = AT x A will give the term-to-term
similarities (the covariance matrix).

The eigenvectors of the T matrix are the columns of the V matrix of the
SVD of A.
T is a symmetric matrix because (AT x A)T=AT x ATT=AT x A
AT x A = V x Λ x UT x U x Λ x VT= V x Λ x Λ x VT = V x Λ2 x VT
Notice that both D and T have the same eigenvalues, which are the squares
of the λi elements (The singular values of A).
These observations shows the close relation between SVD and PCA ,
which uses the eigenvectors of the covariance matrix.
Iyad Batal
SVD
Very important property: (we will see it again in Kleinberg algorithm)
(ATx A)k x v` ≈ (constant) v1 where k>>1, v` is a random vector, v1 is the
first right singular vector of A, or equivalently is the eigenvector of AT x A
(as we already proved). Proof?
(AT x A)k = (AT x A) x (AT x A) x … = (V x Λ2 x VT ) x (V x Λ2 x Vt ) x …
= (V x Λ2 x VT ) x … = (V x Λ4 x VT ) x …. = (V x Λ2k x VT)
Using spectral decomposition:
(ATx A)k = (V x Λ2k x VT) = λ12k v1 v1T + λ22k v2 v2T+ … + λn2k vn vnT
Because λ1 > λi≠1  λ12k >> λi≠12k
Thus (AT x A)k ≈ λ12k v1 v1T
Now (ATx A)k x v` = λ12k v1 v1T x v` = (const) v1
because v1Tx v` is a scalar.

Iyad Batal
SVD
Geometrically, this means that if we multiple any vector with matrix
(AT x A)k, then result is a vector that is parallel to the first
eigenvector.

Iyad Batal
PCA and SVD
Summary for PCA and SVD
Objective: find the principal components P of a data matrix A(n,m).
1. First zero mean the columns of A (translate the origin to the center of gravity).
2. Apply PCA or SVD to find the principle components (P) of A.
PCA:
I. Calculate the covariance matrix C=
II. p = the eigenvectors of C.
III. The variances in each new dimension is given by the eigenvalues.
SVD:
I. Calculate the SVD of A.
II. P = V: the right singular vectors.
III. The variances are given by the squaring the singular values.
3. Project the data onto the feature space. F = P x A
4. Optional: Reconstruct A’ from Y where A’ is the compressed version of A.
Iyad Batal
Outline

• Mathematical background
• PCA
• SVD
• Some PCA and SVD applications
• Case study: LSI

Iyad Batal
SVD and PCA applications
 LSI: Latent Semantic Indexing.
 Solve over-specified (no solution: least squares error solution) and
under-specified (infinite number of solutions: shortest length
solution) linear equations.
 Ratio rules (computer quantifiable association rules like
bread:milk:buffer=2:4:3).
 Google/PageRank algorithm (random walk with restart).
 Kleinberg/Hits algorithm (compute hubs and authority scores for
nodes).
 Query feedbacks (learn to estimate the selectivity of the queries: a
regression problem).
 Image compression (other methods: DCT used in JPEG, and
wavelet compression)
 Data visualization (by projecting the data on 2D).
Iyad Batal
Variations: ICA
ICA (Independent Components Analysis)
Relaxes the constraint of orthogonality but keeps the linearity. Thus,
could be more flexible than PCA in finding patterns.

X (N,n)=H (N,n) x B (n,n)


where X is the data set, H are hidden variables, and B are basis
vectors. hi,j can be understood as the weight of bj in the instance Xi
Iyad Batal
Variations: ICA
Linearity: Xi = hi,1 b1 + hi,2 b2
Problem definition: Knowing X, find H and B.
Make hidden variables hi mutually independent:
p(hi ,hj)=p(hi) * P(hj)

Which figure satisfies data independency?

Iyad Batal
Outline

• Mathematical background
• PCA
• SVD
• Some PCA and SVD applications
• Case study: LSI

Iyad Batal
SVD and LSI
LSI: Latent Semantic Indexing.
Idea: try to group similar terms together, to form a few concepts,
then map the documents into vectors in the concept-space, as
opposed to vectors in the n-dimensional space, where n is the
vocabulary size of the document collection.
This approach automatically groups terms that occur together into
concepts. Then every time the user asks for a term, the system
determines the relevant concepts and search for them.
In order to map documents or queries into the concept space, we
need the term-to-concept similarity matrix V.

Iyad Batal
SVD and LSI
Example: find the documents containing the term ‘data’.

To translate q to a vector qc in the concept space:


qc =VT x q

It means that q is related to the CS group of terms (with


strength=0.58), and unrelated to the medical group of terms.
Iyad Batal
SVD and LSI
More importantly, qc now involves the terms ‘information’ and ‘retrieval’,

thus LSI system may return documents that do not necessarily contain
the term ‘data’
For example, a document d with a single word ‘retrieval’

d will be mapped into the concept space


And will be a perfect match for the query.
Cosine similarity is one way to measure the similarity between the
query and the documents.
Experiments showed that LSI outperforms standard vector methods
with improvement of as much as 30% in terms of precision and recall.
Iyad Batal
Thank you for listening

Questions or Thoughts??

CS3550 Project: Iyad Batal 42

You might also like