15PCA
15PCA
Václav Hlaváč
Needed linear algebra. Drawbacks. Interesting behaviors live in manifolds.
Least-squares approximation. Subspace methods, LDA, CCA, . . .
PCA, the instance of the eigen-analysis
2/27
PCA seeks to represent observations (or signals, images, and general data) in a form that
This linear space has some ‘natural’ orthogonal basis vectors. It is of advantage to express
observation as a linear combination with regards to this ‘natural’ base (given by eigen-vectors
as we will see later).
PCA is mathematically defined as an orthogonal linear transformation that transforms the
data to a new coordinate system such that the greatest variance by some projection of the
data comes to lie on the first coordinate (called the first principal component), the second
greatest variance on the second coordinate, and so on.
Geometric rationale of PCA
3/27
vector space.
The goal:
construction is repeated.
Eigen-values, eigen-vectors of matrices
6/27
is one of matrix A eigen-vectors and λ is one of eigen-values (which may be complex). The
matrix A has n eigen-values λi and n eigen-vectors vi, i = 1, . . . , n.
Let us derive: A v = λ v ⇒ A v − λ v = 0 ⇒ (A − λ I) v = 0. Matrix I is the identity
fundamental theorem of algebra implies that the characteristic polynomial can be factored,
i.e. det(A − λ I) = 0 = (λ1 − λ)(λ2 − λ) . . . (λn − λ).
Eigen-values λi are not necessarily distinct. Multiple eigen-values arise from multiple roots of
Later, we will develop a statistical view based on covariance matrices and principal component
analysis.
A system of linear equations, a reminder
8/27
A system of linear equations can be expressed in a matrix form as Ax = b, where A is the
The augmented matrix of the system is created by concatenating a column vector b to the
Example: [A|b] = 3 5 6 7 .
2 4 3 8
This system has a solution if and only if the rank of the matrix A is equal to the rank of the
extended matrix [A|b]. The solution is unique if the rank of matrix (A) equals to the number
of unknowns or equivalently null(A) = A.
Similarity transformations of a matrix
9/27
Matrices A and B with real or complex entries are called similar if there exists an invertible
The similarity transformation refers to a matrix transformation that results in similar matrices.
Similar matrices have useful properties: they have the same rank, determinant, trace,
characteristic polynomial, minimal polynomial and eigen-values (but not necessarily the same
eigen-vectors).
Similarity transformations allow us to express regular matrices in several useful forms, e.g.,
Jordan canonical form, Frobenius normal form (called also rational canonical form).
Jordan canonical form of a matrix
10/27
Any complex square matrix is similar to a matrix in the Jordan canonical form
λi 1 0
J1 0 ...
... , where Ji are Jordan blocks 0 λi 0
,
0 0 ... 1
0 Jp
0 0 λi
If the eigen-value is not multiple then the Jordan block degenerates to the eigen-value itself.
Least-square approximation
11/27
Assume that abundant data comes from many observations or measurements. This case is
solution.
There is an interest in finding the solution to the system, which is in some sense ‘closest’ to
1915-1992 and Michael Loève, 1907-1979) or the Hotelling transform (after Harold Hotelling,
1895-1973). Invented by Pearson (1901) and H. Hotelling (1933).
In statistics, PCA is a method for simplifying a multidimensional dataset to lower dimensions
The price to be paid for PCA’s flexibility is in higher computational requirements as compared
N M.
The aim: to reduce the dimensionality of the data so that each observation can be usefully
very large: this is in fact good, since many observations imply better statistics.
Data normalization is needed first
14/27
This procedure is not applied to the raw data R but to normalized data X as follows.
The raw observed data is arranged in a matrix R and the empirical mean is calculated along
each row of R. The result is stored in a vector u the elements of which are scalars
N
1 X
u(m) = R(m, n) , where m = 1, . . . , M .
N n=1
The empirical mean is subtracted from each column of R: if e is a unitary vector of size N
X = R − ue .
Derivation, M -dimensional case (2)
15/27
If we approximate higher dimensional space X (of dimension M ) by the lower dimensional matrix
Y (of dimension L) then the mean square error ε2 of this approximation is given by
N L N
!
1 X X 1 X
ε = 2 2
|xn| − b>
i xn x>
n bi ,
N n=1 i=1
N n=1
L N
X 1 X
b>
i cov(x) bi , where cov(x) = xn x>
n ,
i=1
N n=1
semi-definite.
So the covariance matrix can be guaranteed to have real eigen-values.
Matrix theory tells us that these eigen-values may be sorted (largest to smallest) and the
associated eigen-vectors taken as the basis vectors that provide the maximum we seek.
In the data approximation, dimensions corresponding to the smallest eigen-values are omitted.
L
X M
X
ε2 = trace cov(x) −
λi = λi ,
i=1 i=L+1
where trace(A) is the trace—sum of the diagonal elements—of the matrix A. The trace
equals the sum of all eigenvalues.
Can we use PCA for images?
17/27
It took a while to realize (Turk, Pentland, 1991), but yes.
The image is considered as a very long 1D vector by concatenating image pixels column by
The number of principle components is less than or equal to the number of observations
available (32 in our particular case). This is because the (square) covariance matrix has a size
corresponding to the number of observations.
The eigen-vectors we derive are called eigen-images, after rearranging back from the 1D
}
}
}
... ~
~ ...
...
one PCA
represented image
Reconstruction of the image from four basis vectors bi, i = 1, . . . , 4 which can be displayed
slide 14.
= q1 + q2 + q3 + q4
Reconstruction fidelity, 4 components
22/27
Reconstruction fidelity, original
23/27
PCA drawbacks, the images case
24/27
the input images influences the whole eigen-representation. However, this property is inherent
in all linear integral transforms.
Data (images) representations
25/27
Discriminative representation
Does not allow partial reconstruction.
The data of interest often live in a much lower-dimensional subspace called the manifold.
The 100 × 100 image of the number 3 shifted and rotated, i.e. there are only 3 degrees of
variations.
All data points live in a 3-dimensional manifold of the 10,000-dimensional observation space.
The difficulty of the task is to find out empirically from the data in which manifold the data
vary.
Subspace methods
27/27
Subspace methods explore the fact that data (images) can be represented in a subspace of the
original vector space in which data live.