0% found this document useful (0 votes)
2 views

LectureNotes-PCA

The lecture notes on Principal Component Analysis (PCA) by Laurenz Wiskott provide an overview of PCA's purpose, which is to reduce the dimensionality of high-dimensional data while minimizing reconstruction error. The document discusses the mathematical foundations of PCA, including covariance matrices, variance, and the process of diagonalizing the covariance matrix to identify directions of maximal variance. It also touches on applications of PCA, such as face processing, and includes a revision history of the document.

Uploaded by

ayman.892010
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

LectureNotes-PCA

The lecture notes on Principal Component Analysis (PCA) by Laurenz Wiskott provide an overview of PCA's purpose, which is to reduce the dimensionality of high-dimensional data while minimizing reconstruction error. The document discusses the mathematical foundations of PCA, including covariance matrices, variance, and the process of diagonalizing the covariance matrix to identify directions of maximal variance. It also touches on applications of PCA, such as face processing, and includes a revision history of the document.

Uploaded by

ayman.892010
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Lecture Notes on

Principal Component Analysis


Laurenz Wiskott∗

Revision history (selection):


11 March 2004, first version
14 February 2006, major revision
27 October 2009, added the SVD section
21 February 2013, minor revision

Contents
1 Intuition 2
1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Projection and reconstruction error . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Reconstruction error and variance . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Covariance matrix and higher order structure . . . . . . . . . . . . . . . . . 5
1.6 PCA by diagonalizing the covariance matrix . . . . . . . . . . . . . . . . . . 6

2 Formalism 7
2.1 Definition of the PCA-optimization problem . . . . . . . . . . . . . . . . . . 7
T
2.2 Matrix V : Mapping from high-dimensional old coordinate system to low-
dimensional new coordinate system . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Matrix V: Mapping from low-dimensional new coordinate system to subspace
in old coordinate system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
T
2.4 Matrix (V V): Identity mapping within new coordinate system . . . . . . . 9
2.5 Matrix (VVT ): Projection from high- to low-dimensional (sub)space within
old coordinate system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
c 2004–2010 Laurenz Wiskott

Institut für Neuroinformatik, Ruhr-Universität Bochum, http://www.ini.rub.de/PEOPLE/wiskott/.

1
2.7 Reconstruction error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.8 Covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.9 Eigenvalue equation of the covariance matrix . . . . . . . . . . . . . . . . . . 12
2.10 Total variance of the data x . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.11 Diagonalizing the covariance matrix . . . . . . . . . . . . . . . . . . . . . . . 13
2.12 Variance of y for a diagonalized covariance matrix . . . . . . . . . . . . . . . 14
2.13 Constraints of matrix V0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.14 Finding the optimal subspace . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.15 Interpretation of the result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.16 PCA Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.17 Intuition of the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.18 Whitening or sphering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.19 Singular value decomposition + . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Application 19
3.1 Face processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Acknowledgment 20

1 Intuition

1.1 Problem statement


Experimental data to be analyzed is often represented as a number of vectors of fixed dimen-
sionality. A single vector could for example be a set of temperature measurements across
Germany. Taking such a vector of measurements at different times results in a number of
vectors that altogether constitute the data. Each vector can also be interpreted as a point
in a high dimensional space. Then the data are simply a cloud of points in this space (if one
ignores the temporal order, otherwise it would be a trajectory).
When analyzing such data one often encounters the problem that the dimensionality of the
data points is too high to be visualized or analyzed with some particular technique. Thus
the problem arises to reduce the dimensionality of the data in some optimal way.
To keep things simple we insist that the dimensionality reduction is done linearly, i.e. we
are looking for a low-dimensional linear subspace of the data space, onto which the data can
be projected. As a criterion for what the optimal subspace might be it seems reasonable to
require that it should be possible to reconstruct the original data points from the reduced
ones as well as possible. Thus if one were to project the data back from the low-dimensional
space into the original high-dimensional space, the reconstructed data points should lie as
close as possible to the original ones, with the mean squared distance between original and

2
reconstructed data points being the reconstruction error. The question is, how can we find
the linear subspace that minimizes this reconstruction error.
It is useful and common practice to remove the mean value from the data first before doing
the dimensionality reduction as stated above. Thus, we assume zero mean data throughout.
As a result variances and 2nd moments are the same. This justifies the slightly confusing
common practice to speak of variances but write the equations for 2nd moments. Please
keep that in mind.

1.2 Projection and reconstruction error


The task of principal component analysis (PCA) is to reduce the dimensionality of some
high-dimensional data points by linearly projecting them onto a lower-dimensional space
in such a way that the reconstruction error made by this projection is minimal. In order
to develop an intuition for PCA we first take a closer look at what it means to project
the data points and to reconstruct them. Figure 1 illustrates the process. (a) A few data

x2 x2 x2
(a) (b) (c) (d)
y y

x1 x1 y x1
x x

x|| x||

data points in 2D projection onto 1D data points in 1D reconstruction in 2D

Figure 1: Projection of 2D data points onto a 1D subspace and their reconstruction.

points are given in a two-dimensional space and are represented by two-dimensional vectors
x = (x1 , x2 ). (b) In order to reduce the dimensionality down to one, we have to choose a
one-dimensional subspace defined by a unit vector v and project the data points onto it,
which can be done by
xk := vvT x . (1)
(c) The points can now be represented by just one number,

y := vT x , (2)

and we do not care that they originally came from a two-dimensional space. (d) If we want
to reconstruct the original two-dimensional positions of the data points as well as possible,
we have to embed the one-dimensional space in the original two-dimensional space in exactly
the orientation used during the projection,
(1,2)
xk = vy . (3)

However, we cannot recover the accurate 2D-position; the points remain on the one-
dimensional subspace. The reconstruction error is therefore the average distance of the

3
original 2D-positions from the one-dimensional subspace (the length of the projection ar-
rows in (b)). For mathematical convenience one actually takes the average squared distance

E := hkxµ − xµ|| k2 iµ (4)


M I
1 XX µ
= (x − xµ|| i )2 , (5)
M µ=1 i=1 i

where µ indicates the different data points, M the number of data points, and I the dimen-
sionality of the data vectors.

1.3 Reconstruction error and variance


The question now is how we can find the direction of the one-dimensional subspace that
minimizes the reconstruction error. For that it is interesting to inspect more closely what
happens as we rotate the subspace. Figure 2 illustrates the projection onto two different

x2 x2
(a) (b)
y
x||

r v
d x1 x1
x x d
r v
x||

Figure 2: Variance of the projected data and reconstruction error as the linear subspace
is rotated.

subspaces. Focus just on the one point x and its projection x|| . d is the distance of x from
the origin, r is the distance of x from x|| in the subspace, and v is the distance of x|| from
the origin. r and v depend on the direction of the subspace while d does not. Interestingly,
since the triangles between x, x|| , and the origin are right-angled, r and v are related by
Pythagoras’ theorem, i.e. r2 + v 2 = d2 . We know that r2 contributes to the reconstruction
error. v 2 on the other hand contributes to the variance of the projected data within the
subspace. Thus we see that the sum over the reconstruction error plus the variance of the
projected data is constant and equals the variance of the original data. Therefore, minimizing
the reconstruction error is equivalent to maximizing the variance of the projected data.

1.4 Covariance matrix


How can we determine the direction of maximal variance? The first we can do is to determine
the variances of the individual components. If the data points (or vectors) are written as
x = (x1 , x2 )T (T indicates transpose), then the variances of the first and second component
can be written as C11 := hx1 x1 i and C22 := hx2 x2 i, where angle brackets indicate averaging
over all data points. Please remember that these are strictly speaking 2nd moments and not

4
variances, but since we assume zero mean data that does not make a difference. If C11 is
large compared to C22 , then the direction of maximal variance is close to (1, 0)T , while if C11
is small, the direction of maximal variance is close to (0, 1)T . (Notice that variance doesn’t
have a polarity, so that one could use the inverse vector (−1, 0)T instead of (1, 0)T equally
well for indicating the direction of maximal variance.)
But what if C11 is of similar value as C22 , like in the example of Figure 1? Then the co-
variance between the two components, C12 := hx1 x2 i, can give us additional information
(notice that C21 := hx2 x1 i is equal to C12 ). A large positive value of C12 indicates a strong
correlation between x1 and x2 and that the data cloud is extended along the (1, 1)T direc-
tion. A negative value would indicate anti-correlation and an extension along the (−1, 1)T
direction. A small value of C12 would indicate no correlation and thus little structure of the
data, i.e. no prominent direction of maximal variance. The variances and covariances are
conveniently arranged in a matrix with components

• Cij := hxi xj i , (6)

which is called covariance matrix (remember, assuming zero mean data)1 . It can
easily be shown that the components obey the relation

• Cij2 ≤ Cii Cjj . (7)

It is also easy to see that scaling the data by a factor α scales the covariance matrix by a
factor α2 . Figure 3 shows several data clouds and the corresponding covariance matrices.

x2 0.2 0 x2 1 −0.5 x2 1 0
0 1 −0.5 0.3 0 1
1 1 1

1 x1 1 x1 1 x1

Figure 3: Several data distributions and their covariance matrices.

1.5 Covariance matrix and higher order structure


Notice that the covariance matrix only gives you information about the general
extent of the data (the second order moments). It does not give you any infor-
mation about the higher-order structure of the data cloud. Figure 4 shows different
data distributions that all have the same covariance matrix. Thus as long as we consider
only the covariance matrix, i.e. second order moments, we can always assume a Gaussian
data distribution with an ellipsoid shape, because the covariance matrix does not represent
any more structure in any case.
1
Important text (but not inline formulas) is set bold face; • marks important formulas worth remembering;
◦ marks less important formulas, which I also discuss in the lecture.

5
x2 1 0 x2 1 0 x2 1 0
1 0 1 1 0 1 1 0 1

1 x1 1 x1 1 x1

Figure 4: Different data distributions with identical covariance matrices.

1.6 PCA by diagonalizing the covariance matrix


Now that we have learned that the covariance matrix in principle contains the information
about the direction of maximal variance the question arises how we can get at this infor-
mation. From Figure 3 (a) and (b) we can see that there are two fundamentally different
situations: in (a) the data cloud is aligned with the axes of the coordinate system and the co-
variance matrix is diagonal; in (b) the data cloud is oblique to the axes and the matrix is not
diagonal. In the former case the direction of maximal variance is simply the axis belonging
to the largest value on the diagonal of the covariance matrix. In the latter case, we cannot
directly say what the direction of maximal variance might be. Thus, since the case of a
diagonal covariance matrix is so much simpler, the strategy we are going to take is to make a
non-diagonal covariance matrix digonal by rotating the coordinate system accordingly. This
is illustrated in Figure 5. From linear algebra we know that diagonalizing a matrix can be
done by solving the corresponding eigenvalue equation. It will turn out that the eigenvectors
of the covariance matrix point into the directions of maximal (and minimal) variance and
that the eigenvalues are equal to the variances along these directions. Projecting the data
onto the eigenvectors with largest eigenvalues is therefore the optimal linear dimensionality
reduction.

x2 1 −0.5 1.1 0
−0.5 0.3 0 0.1
x’
2
x1

x’
1

Figure 5: Diagonalizing the covariance matrix by rotating the coordinate system.

6
2 Formalism

2.1 Definition of the PCA-optimization problem


The problem of principal component analysis (PCA) can be formally stated as follows.

Principal Component Analysis (PCA): Given a set {xµ : µ = 1, ..., M }


of I-dimensional data points xµ = (xµ1 , xµ2 , ..., xµI )T with zero mean,
hxµ iµ = 0I , find an orthogonal matrix U with determinant |U| = +1
generating the transformed data points x0 µ := UT xµ such that for
any given dimensionality P the data projected onto the first P axes,
x0 µ|| := (x0 µ1 , x0 µ2 , ..., x0 µP , 0, ..., 0)T , have the smallest
µ µ
reconstruction error E := hkx0 − x0 || k2 iµ (8)

among all possible projections onto a P -dimensional subspace. The


row vectors of matrix U define the new axes and are called the prin-
cipal components.

Some remarks: (i) hxµ iµ indicates the mean over all M data points indexed with µ. To
simplify the notation we will from now on drop the index µ and indicate averages over the
data points by h·i. (ii) If one has non-zero-mean data, one typically removes the mean before
applying PCA. Even though all the math is valid also for non-zero-mean data, the results
would typically be undesired and nonintuitive. (iii) Since matrix U is orthogonal and has
determinant value +1, it corresponds simply to a rotation of the data x. Thus, the ’shape’
of the data cloud remains the same, just the ’perspective’ changes. |U| = −1 would imply a
mirror reflection of the data distribution and is often permitted, too. Note also that one can
interpret the multiplication with matrix UT either as a rotation of the data or as a rotation
of the coordinate system. Either interpretation is valid. (iv) Projecting the data x0 onto
the P -dimensional linear subspace spanned by the first P axes is simply done by setting all
components higher than P to zero. This can be done, because we still have an orthonormal
coordinate system. If U and therefore the new coordinate system were not orthogonal then
the projection became a mathematically more complex operation. (v) The reconstruction
error has to be minimal for any P . This has the advantage that we do not need to decide
on P before performing PCA. Often P is actually choosen based on information obtained
during PCA and governed by a constraint, such as that the reconstruction error should be
below a certain threshold.

7
2.2 Matrix VT : Mapping from high-dimensional old coordinate
system to low-dimensional new coordinate system
Assume some data points x are given in an I-dimensional space and a linear
subspace is spanned by P orthonormal vectors

◦ vp := (v1p , v2p , ..., vIp )T (9)



T 1 if p = q
◦ with vp vq = δpq := . (10)
0 otherwise

We will typically assume P < I and speak of a high(I)-dimensional space and a low(P )-
dimensional (sub)space. However, P = I may be possible as a limiting case as well.
Arranging these vectors in a matrix yields

• V := (v1 , v2 , ..., vP ) (11)


 
v11 v12 ... v1P
 v21 v22 ... v2P 
(9)
=  .. ..  . (12)
 
. .
 . . . 
vI1 vI2 ... vIP

This matrix can be used to map the data points x into the subspace spanned by
the vectors vp yielding
◦ y := VT x , (13)
see also figure 6. If P < I then the dimensionality is reduced and some information is lost;

x2
V
y
T
x|| VV

x1 y
T
V V x

T
V
data points in 2D data points in 1D

Figure 6: The effect of matrices VT and V and combinations thereof for an example of a
mapping from 2D to 1D.

if P = I all information is preserved. In any case the mapped data are now represented in
a new coordinate system the axes of which are given by the vectors vp . With P = 2 and
I = 3, for example, we have
 T   T 
v1 x v1
y= T = x = VT x
v2 x vT2
 
    x1
y1 v11 v21 v31 
or y = = x2  = V T x .
y2 v12 v22 v32
x3

8
Note that y is P -dimensional while x is I-dimensional.
It is important to realize that we have done two things here: firstly, we have moved
the points from the high-dimensional space onto the low-dimensional subspace
(the points that were already in the subspace have not been moved, of course) and secondly,
we have represented the moved points in a new coordinate system that is partic-
ularly suitable for the low-dimensional subspace. Thus, we went from the high-dimensional
space and the old coordinate system to the low-dimensional subspace and a new coordinate
system. Note also that points in the high-dimensional space can generally not be represented
accurately in the new coordinate system, because it does not have enough dimensions.

2.3 Matrix V: Mapping from low-dimensional new coordinate sys-


tem to subspace in old coordinate system
Interestingly, since the vectors vp are orthonormal, matrix V can also be used to trans-
form the points back from the new to the old coordinate system, although, the
lost dimensions cannot be recovered, of course. Thus the mapped points y in the new
coordinate system become points x|| in the old coordinate system and are given by

◦ x|| := Vy (14)
(13)
◦ = VVT x . (15)

y and x|| are equivalent representations, i.e. they contain the same information, just
in different coordinate systems.

2.4 Matrix (VT V): Identity mapping within new coordinate sys-
tem
Before we look at the combined matrix VVT consider VT V. The latter is obviously
a P × P -matrix and performs a transformation from the new (low-dimensional)
coordinate system to the old (high-dimensional) coordinate system (14) and back
again (13). The back-transformation implies a dimensionality reduction, but since all
points in the old coordinate system come from the new coordinate system and therefore lie
within the low-dimensional subspace already, the mapping onto the low-dimensional
space does not discard any information. Thus, only the back and forth (or rather forth and
back) transformation between the two coordinate systems remains and that in combination
is without any effect either. This means that VT V is the identity matrix, which can
be easily verified
(10)
VT V pq = vTp vq = δpq

(16)
⇐⇒ V T V = 1P (17)

with 1P indicating the identity matrix of dimensionality P . With P = 2, for example, we


have  T   T   
T v1 v1 v1 vT1 v2 (10) 1 0
V V= (v1 v2 ) = = .
vT2 vT2 v1 vT2 v2 0 1

9
2.5 Matrix (VVT ): Projection from high- to low-dimensional
(sub)space within old coordinate system
As we have seen above (15) the combined matrix VVT maps the points x onto the
low-dimensional subspace but in contrast to matrix VT alone the mapped points are
represented within the old coordinate system and not the new one. It turns out that
this is a projection operation with the characterizing property that it does not
make a difference whether you apply it once or twice, i.e. PP = P. Let us therefore define
the projection matrix
• P := VVT (18)
and verify that
T (18)
T
◦ PP = V V
| {zV} V (19)
1P
(17)
◦ = VVT (20)
(18)
◦ = P. (21)

A closer look at P shows that

◦ P := VVT (22)
 
vT1
◦ = (v1 , ..., vP )  ...  (23)
 
vTP
 
v11 v12 · · ·
···
 
...  v11 v21 v31
 v21

(9)
  v12 . . . . . .

=  (24)

 v . . .

..

 31
. vIP

..
. vIP
 P P 
p v1p v1p p v1p v2p ···
 P . .. 
=   p v2p v1p (25)

..

P
. p vIp vIp

···
 
P
v1p v1p v1p v2p
X ..
=  v2p v1p . 
 (26)
p=1 ..
. vIp vIp
P
X
◦ = vp vTp . (27)
p=1

P is obviously an I × I-matrix. If P = I then projecting from the old to the new and back
to the old coordinate system causes no information loss and P = 1I . The smaller P the
more information is lost and the more does P differ from the identity matrix. Consider,

10
for example
 √ √ 
2 − 2 2
1 √ 1 √
v1 := ( 2, −1, 1)T T
⇒ v1 v1 = −√2 1 −1 ,
2 4
2 −1 1
 
0 0 0
1 √ √ 1
v2 := (0, 2, 2)T T
⇒ v2 v2 = 0 2 2  , and
2 4
0 2 2
 √ √ 
2 2 − 2
1 √ 1 √
v3 := (− 2, −1, 1)T ⇒ v3 vT3 =  √2 1 −1 
2 4
− 2 −1 1

for which you can easily verify that P (27) successively becomes the identity matrix as you
take more of the vp vTp -terms.

2.6 Variance
The variance of a multi-dimensional data set is defined as the sum over the
variances of its components. Since we assume zero-mean data, we have
I
X
• var(x) := hx2i i (28)
i=1
* I +
X
◦ = x2i (29)
i=1
T
• = hx xi (30)

This also holds for the projected data, of course, var(y) = hyT yi.

2.7 Reconstruction error


The reconstruction error E is defined as the mean square sum over the distances between
the original data points x and the projected ones xk . If we define the orthogonal vectors (D:
Lotvektoren)
◦ x⊥ = x − xk (31)

11
(in contrast to the projected vectors xk ) we can write the reconstruction error as the
variance of the orthogonal vectors and find
(8,31)
•|◦ E = hx⊥ T x⊥ i (32)
(31)
= h(x − xk )T (x − xk )i (33)
(15)
◦ = h(x − VVT x)T (x − VVT x)i (34)
◦ = hxT x − 2xT VVT x + xT V (VT V) VT xi (35)
| {z }
=1P
(17) T T
◦ = hxT xi − hxT V (V V) V xi (36)
| {z }
=1P
(15)
◦ = hxT xi − hxk T xk i (37)
(36,13)
• = hxT xi − hyT yi . (38)

This means that the reconstruction error equals the difference between the variance
of the data minus the variance of the projected data. Thus, this verifies our intuition
that minimizing the reconstruction error is equivalent to maximizing the variance of the
projected data.

2.8 Covariance matrix


We have already argued heuristically that the covariance matrix Cx with Cxij := hxi xj i plays
an important role in performing PCA. It is convenient to write the covariance matrix in
vector notation:
1 X µ µT
• Cx := xxT = x x . (39)
M µ

It is an easy exercise to show that this definition is equivalent to the componentwise one
given above. Since (xxT )T = xxT (remember (AB)T = BT AT ), one can also see that Cx is
symmetric, i.e. CTx = Cx .

2.9 Eigenvalue equation of the covariance matrix


Since the covariance matrix is symmetric, its eigenvalues are real and a set of orthogonal
eigenvectors always exists. In mathematical terms, for a given covariance matrix Cx we
can always find a complete set of real eigenvalues λi and corresponding eigen-
vectors ui such that

◦ Cx ui = ui λi (eigenvalue equation) , (40)


• λi ≥ λi+1 (eigenvalues are ordered) , (41)
◦ uTi uj = δij (eigenvectors are orthonormal) . (42)

12
If we combine the eigenvectors into an orthogonal matrix U and the eigenvalues into a
diagonal matrix Λ,

• U := (u1 , u2 , ..., uI ) , (43)


• Λ := diag(λ1 , λ2 , ..., λI ) , (44)

then we can rewrite (42) and (40) as


(42,43)
• UT U = 1I (matrix U is orthogonal) , (45)
◦ ⇐⇒ UUT = 1I (since U−1 = UT and U is quadratic) , (46)
(40,43,44)
• Cx U = UΛ (eigenvalue equation) , (47)
(45)
◦ ⇐⇒ UT Cx U = Λ (48)
(45,46)
◦ ⇐⇒ Cx = UΛUT . (49)

2.10 Total variance of the data x


Given the eigenvector matrix U and the eigenvalue matrix Λ it is easy to compute the total
variance of the data

•|◦ hxT xi = htr(xT x)i (since s = tr(s) for any scalar s) (50)
◦ = htr(xxT )i (since tr(AB) = tr(BA) for any matrices A, B) (51)
◦ = tr(hxxT i) (since tr(·) and h·i commute) (52)
(39)
◦ = tr(Cx ) (53)
(46) T
◦ = tr(UU Cx ) (54)
◦ = tr(UT Cx U) (55)
(48)
◦ = tr(Λ) (56)
X
(44)
• = λi . (57)
i

Thus the total variance of the data is simply the sum of the eigenvalues of its
covariance matrix.
Notice that on the way of this proof we have shown some very general properties. From
line (50) to (53) we have shown that the total variance of some multi-dimensional data
equals the trace of its covariance matrix. From line (53) to (55) we have shown that the
trace remains invariant under any orthogonal transformation of the coordinate system. This
implies that the total variance of some multi-dimensional data is invariant under any or-
thogonal transformation such as a rotation. This is intuitively clear.

2.11 Diagonalizing the covariance matrix


We can now use matrix U to transform the data such that the covariance matrix
becomes diagonal. Define x0 := UT x and denote the new covariance matrix by C0x . We

13
have

• x0 := UT x (58)
D E
0 0 0T
•|◦ Cx := x x (59)
(58)
◦ = (UT x)(UT x)T (60)
T T
◦ = U xx U (61)
(39)
◦ = UT Cx U , (62)
(48)
• = Λ (63)

and find that the transformed data x0 have a diagonal covariance matrix. Working with x0
instead of x will simplify further analysis without loss of generality.

2.12 Variance of y for a diagonalized covariance matrix


Now that we have the data represented in a coordinate system in which the covariance matrix
is diagonal, we can try to answer the question, which is the P -dimensional subspace that
minimizes the reconstruction error. Our intuition would predict that it is simply the space
spanned by the first P eigenvectors. To show this analytically, we take an arbitrary set
of P orthonormal vectors v0p , and with V0 := (v01 , v02 , ..., v0P ) we compute the variance
of y.
T
◦ y := V0 x0 (64)
T (64) 0T 0 0T 0
◦ =⇒ hy yi = hx V V x i (65)
0T 0 0T 0
= htr(x V V x )i (since s = tr(s) for any scalar s) (66)
0T 0 0T
= htr(V x x V0 )i (since tr(ABC) = tr(BCA) if defined) (67)
(59) 0T
◦ = tr(V C0x V0 ) (since tr(·) and h·i commute) (68)
(63) 0T
◦ = tr(V ΛV0 ) (69)
X X
0 2
◦ = λi (vip ) . (as one can work out on a sheet of paper) (70)
i p

2.13 Constraints of matrix V0


Note that, since the vectors v0 p are orthonormal, V0 can always be completed to an
orthogonal I × I-matrix by adding I − P additional orthonormal vectors. Since we know
that an orthogonal matrix has normalized row as well as column vectors, we see that, by
taking away the I − P additional column vectors, we are left with the constraints
X
◦ 0 2
(vip ) = 1 (column vectors of V0 have norm one) , (71)
i
X
0 2
◦ =⇒ (vip ) = P (square sum over all matrix elements equals P ) , (72)
ip
X
◦ 0 2
(vip ) ≤ 1 (row vectors of V0 have norm less or equal one) . (73)
p

14
Notice that Constraint (72) is a direct consequence of Constraint (71) and does not need to
be verified separately in the following considerations.

2.14 Finding the optimal subspace


Since the variance (70) of y as well as the constraints (71, 72, 73) of Matrix V0
0 2
are linear in (vip ) , maximization of the variance hyT yi is obviously achieved by
putting as much ’weight’ as possible on the large eigenvalues, which are the first
ones. The simplest way of doing that is to set

0 1 if i = p
◦ vip := δip := , (74)
0 otherwise

with the Kronecker symbol δip .


Since I ≥ P we can verify the constraints
 
X
0 2 (74)
X
2 1 if i ≤ P
(vip ) = δip = ≤ 1, (75)
0 otherwise
p p
X (74)
X
0 2 2 2
(vip ) = δip = δpp = 1, (76)
i i

and see from (75) that there is actually as much ’weight’ on the first, i.e. large, eigenvalues
as Constraint (73) permits.

2.15 Interpretation of the result


0
What does it mean to set vip := δip ? It means that V0 projects the data x0 onto the
first P axes, which in fact is a projection onto the first P eigenvectors of the
covariance matrix Cx . Thus, if we define

•|◦ V := UV0 (77)


(43, 74)
• = (u1 , u2 , ..., uP ) (78)

we can go back to the original coordinate system and find


(64) T
•|◦ y = V0 x0 (79)
(58) T
◦ = V0 UT x (80)
(77)
• = VT x , (81)

which we know has maximal variance. Thus, if we start from the original data x we
would set vp := up .

15
The variance of y is
I
X P
X
(70) 0 2
•|◦ hyT yi = λi (vip ) (82)
i=1 p=1
I
X P
X
(74) 2
◦ = λi δip (83)
i=1 p=1
P
X
• = λi , (84)
i=1

which is the sum over the first P largest eigenvalues of the covariance matrix. Likewise one
can determine the reconstruction error as
(38)
•|◦ E = hxT xi − hyT yi (85)
XI XP
(57, 84)
◦ = λi − λj (86)
i=1 j=1
I
X
• = λi . (87)
i=P +1

Notice that this is just one optimal set of weights. We have seen above that the projected
data, like any multi-dimensional data, can be rotated arbitrarily without changing
its variance and therefore without changing its reconstruction error. This is equivalent to
a rotation of the projection vectors vp within the space spanned by the first eigenvectors.

2.16 PCA Algorithm

2.17 Intuition of the Results


Eigenvalue spectrum.
Projection onto a low-dimensional eigenspace.
Visualization of eigenvectors.

2.18 Whitening or sphering


Sometimes it is desirable to transform a data set such that it has variance one in
all directions. Such a normalization operation is called whitening or sphering. The
latter term is quite intuitive, because a spherical data distribution has the same variance
in all directions. Intuitively speaking sphering requires to stretch and compress the data
distribution along the axes of the principal components such they have variance one. Tech-
nically speaking one first rotates the data into a coordinate system where the covariance
matrix is diagonal, then performs the stretching along the axes, and then rotates the data
back into the original coordinate system. Principal component analysis obviously gives all
the required information. The eigenvectors of the covariance matrix provide the axes of the

16
new coordinate system and the eigenvalues λi indicate the variances and therefore how much
one has to stretch√the data. If the original variance is λi then one obviously has to stretch
by a factor of 1/ λi to get variance one. Thus, sphering is achieved by multiplying
the data with a sphering matrix
 
1 1 1
• W := U diag √ , √ , ..., √ UT (88)
λ1 λ2 λI
• x̂ := Wx . (89)

If the final orientation of the data does not matter, the sphering matrix is often defined
without the first U. It is easy to verify that the sphering matrix is symmetrical, the
sphered data x̂ have a unit covariance matrix,

Cx̂ := hx̂x̂T i (90)


(89)
= WhxxT iWT (91)
   
(39,88) 1 T 1
= U diag √ U Cx U diag √ UT (92)
λi λi
   
(48) 1 1
= U diag √ Λ diag √ UT (93)
λi λi
(44)
= U 1 UT (94)
(46)
= 1, (95)

and they have variance one in all directions, since for any projection vector n of norm
one the variance h(nT x̂)2 i of the projected data is

h(nT x̂)2 i = nT hx̂x̂T in (96)


(95)
= nT n (97)
= 1 (98)

Similarly one can show that the sphered data projected onto two orthogonal vectors
are uncorrelated.

2.19 Singular value decomposition +


Sometimes one has fewer data points than dimensions. For instance one might have 100
images with 10000 pixels each. Then doing direct PCA is very inefficient and the following
method, known as singular value decomposition (SVD), is helpful.
Let xµ , µ = 1, ..., M be the I-dimensional data with M < I. For convenience we write the
data in one I × M -matrix

X := (x1 , ..., xM ) . (99)

The second-moment matrix can then be written as

C1 := XXT /M , (100)

17
and its eigenvalue equation and decomposition read

C1 U1 = U1 Λ1 (101)
⇐⇒ C1 = U1 Λ1 UT1 . (102)

The data represented in the coordinate system of the eigenvectors is

Y1 := UT1 X , (103)

which is still high-dimensional.


Now interpret the data matrix X transposed, i.e. swap the data point index for the di-
mension index. In our example this would correspond to having 10000 data points in a
100-dimensional space, which is, of course, much easier to deal with. We get the same
equations as above just with X transposed.

C2 := XT X/I , (104)
C2 U2 = U2 Λ2 (105)
⇐⇒ C2 = U2 Λ2 UT2 , (106)
Y2 := UT2 XT . (107)

The interesting property of matrix Y2 now is that its rows are eigenvectors of matrix C1 , as
can be shown easily.
(100,107)
C1 YT2 = (XXT /M )(XU2 ) (108)
= X(XT X/I)U2 I/M (109)
(104)
= XC2 U2 I/M (110)
(105)
= XU2 Λ2 I/M (111)
(107)
= YT2 Λ2 I/M . (112)

The corresponding eigenvalues are eigenvalues of C2 scaled by I/M .


However, Y2 yields only M eigenvectors and eigenvalues. The other eigenvalues are all
zero, because M data points can only produce M non-zero variance dimensions or, in other
words, M data points together with the origin can only span an M -dimensional subspace.
The missing (I − M ) eigenvectors must all be orthogonal to the first M ones and orthogonal
to each other but can otherwise be quite arbitrary, since their eigenvalues are all equal. A
Gram-Schmidt orthogonalization procedure can be used to generate them.

18
3 Application

3.1 Face processing

If one carefully shifts and scales many face images


Eigenfaces
such that the eyes are in register, i.e. at identi-
cal positions, and performs PCA (or SVD) on
them, meaningful eigenvectors can be calculated
and are called eigenfaces. These are the principal
grey value variations that distinguish faces from
each other. The figure shows first the mean and
then the eigenfaces ordered in rows. The first
eigenface obviously accounts for the mean grey
value, the second one for the difference in color
between hair and face, the third one for illumi-
(2002-12-17, http://www-white.media.mit.edu/vismod/demos/facerec/basic.html)

nation from the side, the fourth and seventh one


at least partially for a beard. The higher compo-
nents become increasingly difficult to interpret. The projection of face images onto the first
eigenfaces is a suitable representation for face recognition, (cf. Turk and Pentland, 1991).

In this example (Lanitis et al., 1995) many faces


Eigenfaces - Texture
were warped to a standard geometry and then
PCA was applied to calculate the eigenvectors,
which are called eigenfaces. The Figure visualizes
the first four eigenfaces by varying the average
face (middle) along the direction of the eigen-
vectors by up to ±3 standard deviations. The
first eigenface (mode) accounts for overall illumi-
nation, the other three for some combination of
beard, gender, and mimic expression.
(Lanitis, Taylor, & Cootes, 1995, Image and Vision Computing 13(5):393–401)

In this example (Lanitis et al., 1995) PCA was


Eigenfaces - Shape
applied to the geometry of faces. A graph with
some standard structure was mapped onto many
faces and the concatenated vector of xy-positions
of the nodes (not shown in the graphs) of the
graphs serve as the data for PCA. Visualization
is done relative to the average geometry (middle)
by adding the eigenvectors up to ±2 standard
deviations. The first three components mainly
account for the orientation of the head along the
(Lanitis, Taylor, & Cootes, 1995, Image and Vision Computing 13(5):393–401)
three rotational axes. The fourth component ac-
counts for some variation in width and mimic
expression.

19
4 Acknowledgment
I thank Agnieszka Grabska-Barwinska for working out the proof of singular value decompo-
sition.

References
Lanitis, A., Taylor, C. J., and Cootes, T. F. (1995). An automatic face identification system
using flexible appearance models. Image and Vision Computing, 13(5):393–401.

Turk, M. and Pentland, A. (1991). Eigenfaces for recognition. J. of Cognitive Neuroscience,


3(1):71–86.

20

You might also like