0% found this document useful (0 votes)
25 views27 pages

15PCA

This document discusses principal component analysis (PCA) applied to images. PCA seeks to represent observations in a form that enhances the mutual independence of contributory components. It transforms the data to a new coordinate system such that the greatest variance comes to lie on the first coordinate and subsequent lower variances on following coordinates. The document covers PCA derivation, geometric motivation, eigen-values, eigen-vectors, and least-squares approximation.

Uploaded by

Saad Juboory
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views27 pages

15PCA

This document discusses principal component analysis (PCA) applied to images. PCA seeks to represent observations in a form that enhances the mutual independence of contributory components. It transforms the data to a new coordinate system such that the greatest variance comes to lie on the first coordinate and subsequent lower variances on following coordinates. The document covers PCA derivation, geometric motivation, eigen-values, eigen-vectors, and least-squares approximation.

Uploaded by

Saad Juboory
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Principal Component Analysis (PCA) applied to images

Václav Hlaváč

Czech Technical University in Prague


Czech Institute of Informatics, Robotics and Cybernetics
160 00 Prague 6, Jugoslávských partyzánů 1580/3, Czech Republic
http://people.ciirc.cvut.cz/hlavac, [email protected]
also Center for Machine Perception, http://cmp.felk.cvut.cz
Courtesy: Václav Voráček jr.

Outline of the lecture:


Principal components, informal idea. PCA derivation, PCA for images.



Needed linear algebra. Drawbacks. Interesting behaviors live in manifolds.



Least-squares approximation. Subspace methods, LDA, CCA, . . .



PCA, the instance of the eigen-analysis
2/27

PCA seeks to represent observations (or signals, images, and general data) in a form that


enhances the mutual independence of contributory components.


One observation is assumed to be a point in a p-dimensional linear space.


This linear space has some ‘natural’ orthogonal basis vectors. It is of advantage to express


observation as a linear combination with regards to this ‘natural’ base (given by eigen-vectors
as we will see later).
PCA is mathematically defined as an orthogonal linear transformation that transforms the


data to a new coordinate system such that the greatest variance by some projection of the
data comes to lie on the first coordinate (called the first principal component), the second
greatest variance on the second coordinate, and so on.
Geometric rationale of PCA
3/27

PCA objective is to rotate rigidly the coordinate


axes of the p-dimensional linear space to new
‘natural’ positions (principal axes) such that:
Coordinate axes are ordered such that


principal axis 1 corresponds to the highest


variance in data, axis 2 has the next highest
variance, . . . , and axis p has the lowest
variance.
The covariance among each pair of principal


axes is zero, i.e. they are uncorrelated.


Geometric motivation, principal components (1)
4/27

Two-dimensional vector space of observations, (x1, x2).




Each observation corresponds to a single point in the




vector space.
The goal:


Find another basis of the vector space, which treats


variations of data better.

We will see later:




Data points (observations) are represented in a rotated


orthogonal coordinate system. The origin is the mean of the
data points and the axes are provided by the eigenvectors.
Geometric motivation, principal components (2)
5/27

Assume a single straight line approximating best the




observation in the (total) least-square sense, i.e. by


minimizing the sum of perpendicular distances between
data points and the line.
The first principal direction (component) is the direction of


this line. Let it be a new basis vector z1.


The second principal direction (component, basis vector)


z2 is a direction perpendicular to z1 and minimizing the


distances to data points to a corresponding straight line.
For higher dimensional observation spaces, this


construction is repeated.
Eigen-values, eigen-vectors of matrices
6/27

Assume a finite-dimensional vector space and a square n × n regular matrix A.




Eigen-vectors are solutions of the eigen-equation A v = λ v, where a (column) eigen-vector v




is one of matrix A eigen-vectors and λ is one of eigen-values (which may be complex). The
matrix A has n eigen-values λi and n eigen-vectors vi, i = 1, . . . , n.
Let us derive: A v = λ v ⇒ A v − λ v = 0 ⇒ (A − λ I) v = 0. Matrix I is the identity


matrix. The equation (A − λ I) v = 0 has the non-zero solution v if and only if


det(A − λ I) = 0.
The polynomial det(A − λ I) is called the characteristic polynomial of the matrix A. The


fundamental theorem of algebra implies that the characteristic polynomial can be factored,
i.e. det(A − λ I) = 0 = (λ1 − λ)(λ2 − λ) . . . (λn − λ).
Eigen-values λi are not necessarily distinct. Multiple eigen-values arise from multiple roots of


the characteristic polynomial.


Deterministic view first, statistical view later
7/27

We start reviewing eigen-analysis from a deterministic, linear algebra standpoint.




Later, we will develop a statistical view based on covariance matrices and principal component


analysis.
A system of linear equations, a reminder
8/27
A system of linear equations can be expressed in a matrix form as Ax = b, where A is the


matrix of the system.


Example:
    
x + 3y − 2z = 5  1 3 −2 5
3x + 5y + 6z = 7 =⇒ A =  3 5 6 , b =  7 .
2x + 4y + 3z = 8 2 4 3 8

The augmented matrix of the system is created by concatenating a column vector b to the


matrix A, i.e., [A|b]. 


1 3 −2 5

Example: [A|b] =  3 5 6 7  .
2 4 3 8

This system has a solution if and only if the rank of the matrix A is equal to the rank of the


extended matrix [A|b]. The solution is unique if the rank of matrix (A) equals to the number
of unknowns or equivalently null(A) = A.
Similarity transformations of a matrix
9/27

Let A be a regular matrix.




Matrices A and B with real or complex entries are called similar if there exists an invertible


square matrix P such that P −1A P = B.


Matrix P is called the change of basis matrix.


The similarity transformation refers to a matrix transformation that results in similar matrices.


Similar matrices have useful properties: they have the same rank, determinant, trace,


characteristic polynomial, minimal polynomial and eigen-values (but not necessarily the same
eigen-vectors).
Similarity transformations allow us to express regular matrices in several useful forms, e.g.,


Jordan canonical form, Frobenius normal form (called also rational canonical form).
Jordan canonical form of a matrix
10/27

Any complex square matrix is similar to a matrix in the Jordan canonical form


 
  λi 1 0
J1 0 ...
...  , where Ji are Jordan blocks  0 λi 0 

,

 0 0 ... 1 
0 Jp
0 0 λi

in which λi are the multiple eigen-values.


The multiplicity of the eigen-value gives the size of the Jordan block.


If the eigen-value is not multiple then the Jordan block degenerates to the eigen-value itself.

Least-square approximation
11/27
Assume that abundant data comes from many observations or measurements. This case is


very common in practice.


We intent to approximate the data by a linear model - a system of linear equations, e.g., a


straight line in particular.


Strictly speaking, the observations are likely to be in a contradiction with respect to the


system of linear equations.


In the deterministic world, the conclusion would be that the system of linear equations has no


solution.
There is an interest in finding the solution to the system, which is in some sense ‘closest’ to


the observations, perhaps compensating for noise in observations.


We will usually adopt a statistical approach by minimizing the least square error.

Principal component analysis, introduction
12/27
PCA is a powerful and widely used linear technique in statistics, signal processing, image


processing, and elsewhere.


Several names: the (discrete) Karhunen-Loève transform (KLT, after Kari Karhunen,


1915-1992 and Michael Loève, 1907-1979) or the Hotelling transform (after Harold Hotelling,
1895-1973). Invented by Pearson (1901) and H. Hotelling (1933).
In statistics, PCA is a method for simplifying a multidimensional dataset to lower dimensions


for analysis, visualization or data compression.


PCA represents the data in a new coordinate system in which basis vectors follow modes of


greatest variance in the data.


Thus, new basis vectors are calculated for the particular data set.


The price to be paid for PCA’s flexibility is in higher computational requirements as compared


to, e.g., the fast Fourier transform.


Derivation, M -dimensional case (1)
13/27

Suppose a data set comprising N observations, each of M variables (dimensions). Usually




N  M.
The aim: to reduce the dimensionality of the data so that each observation can be usefully


represented with only L variables, 1 ≤ L < M .


Data are arranged as a set of N column data vectors, each representing a single observation


of M variables: the n-th observations is a column vector xn = (x1, . . . , xM )>,


n = 1, . . . , N .
We thus have an M × N data matrix X. Such matrices are often huge because N may be


very large: this is in fact good, since many observations imply better statistics.
Data normalization is needed first
14/27

This procedure is not applied to the raw data R but to normalized data X as follows.


The raw observed data is arranged in a matrix R and the empirical mean is calculated along


each row of R. The result is stored in a vector u the elements of which are scalars

N
1 X
u(m) = R(m, n) , where m = 1, . . . , M .
N n=1

The empirical mean is subtracted from each column of R: if e is a unitary vector of size N


(consisting of ones only), we will write

X = R − ue .
Derivation, M -dimensional case (2)
15/27
If we approximate higher dimensional space X (of dimension M ) by the lower dimensional matrix
Y (of dimension L) then the mean square error ε2 of this approximation is given by

N L N
!
1 X X 1 X
ε = 2 2
|xn| − b>
i xn x>
n bi ,
N n=1 i=1
N n=1

where bi, i = 1, . . . , L are basis vector of the linear space of dimension L.

If ε2 has to be minimal then the following term has to be maximal

L N
X 1 X
b>
i cov(x) bi , where cov(x) = xn x>
n ,
i=1
N n=1

is the covariance matrix.


Approximation error
16/27
The covariance matrix cov(x) has special properties: it is real, symmetric and positive


semi-definite.
So the covariance matrix can be guaranteed to have real eigen-values.


Matrix theory tells us that these eigen-values may be sorted (largest to smallest) and the


associated eigen-vectors taken as the basis vectors that provide the maximum we seek.
In the data approximation, dimensions corresponding to the smallest eigen-values are omitted.


The mean square error ε2 is given by

L
X M
X
ε2 = trace cov(x) −

λi = λi ,
i=1 i=L+1

where trace(A) is the trace—sum of the diagonal elements—of the matrix A. The trace
equals the sum of all eigenvalues.
Can we use PCA for images?
17/27
It took a while to realize (Turk, Pentland, 1991), but yes.


Let us consider a 321 × 261 image.




The image is considered as a very long 1D vector by concatenating image pixels column by


column (or alternatively row by row), i.e. 321 × 261 = 83781.


The huge number 83781 is the dimensionality of our vector space.


The intensity variation is assumed in each pixel of the image.



What if we have 32 instances of images?
18/27
Fewer observations than unknowns, and what?
19/27

We have only 32 observations and 83781 unknowns in our example!




The induced system of linear equations is not over-constrained but under-constrained.




PCA is still applicable.




The number of principle components is less than or equal to the number of observations


available (32 in our particular case). This is because the (square) covariance matrix has a size
corresponding to the number of observations.
The eigen-vectors we derive are called eigen-images, after rearranging back from the 1D


vector to a rectangular image.


Let us perform the dimensionality reduction from 32 to 4 in our example.

PCA, graphical illustration
20/27

data matrix PCA repesentation


N observed images L basis vectors
of N images

}
}
}
... ~
~ ...

...
one PCA
represented image

one image one basis


vector
Approximation by 4 principal components only
21/27

Reconstruction of the image from four basis vectors bi, i = 1, . . . , 4 which can be displayed


as images by rearranging the (long) vector back to the matrix form.


The linear combination was computed as q1b1 + q2b2 + q3b3 + q4b4 = 0.078 b1 +


0.062 b2 − 0.182 b3 + 0.179 b4.


The mean value of images subtracted when data were normalized earlier has to be added, cf.


slide 14.

= q1 + q2 + q3 + q4
Reconstruction fidelity, 4 components
22/27
Reconstruction fidelity, original
23/27
PCA drawbacks, the images case
24/27

By rearranging pixels column by column to a 1D vector, relations of a given pixel to pixels in




neighboring rows are not taken into account.


Another disadvantage is in the global nature of the representation; small change or error in


the input images influences the whole eigen-representation. However, this property is inherent
in all linear integral transforms.
Data (images) representations
25/27

Reconstructive (also generative) representation


Enables (partial) reconstruction of input images (hallucinations).


It is general. It is not tuned for a specific task.




Enables closing the feedback loop, i.e. bidirectional processing.




Discriminative representation
Does not allow partial reconstruction.


Less general. A particular task specific.




Stores only information needed for the decision task.



Dimensionality issues, low-dimensional manifolds
26/27
Images, as we saw, lead to enormous dimensionality.


The data of interest often live in a much lower-dimensional subspace called the manifold.


Example (courtesy Thomas Brox):




The 100 × 100 image of the number 3 shifted and rotated, i.e. there are only 3 degrees of
variations.

All data points live in a 3-dimensional manifold of the 10,000-dimensional observation space.
The difficulty of the task is to find out empirically from the data in which manifold the data


vary.
Subspace methods
27/27
Subspace methods explore the fact that data (images) can be represented in a subspace of the
original vector space in which data live.

Different methods examples:


Method (abbreviation) Key property
reconstructive, unsupervised, optimal reconstruction, mini-
Principal Component Analysis (PCA) mizes squared reconstruction error, maximizes variance of
projected input vectors
discriminative, supervised, optimal separation, maximizes
Linear Discriminative Analysis (LDA)
distance between projection vectors
supervised, optimal correlation, motivated by regression
Canonical Correlation Analysis (CCA)
task, e.g. robot localization
Independent Component Analysis (ICA) independent factors
Non-negative matrix factorization (NMF) non-negative factors
Kernel methods for nonlinear extension local straightening by kernel functions

You might also like