LectureNotes-PCA
LectureNotes-PCA
Contents
1 Intuition 2
1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Projection and reconstruction error . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Reconstruction error and variance . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Covariance matrix and higher order structure . . . . . . . . . . . . . . . . . 5
1.6 PCA by diagonalizing the covariance matrix . . . . . . . . . . . . . . . . . . 6
2 Formalism 7
2.1 Definition of the PCA-optimization problem . . . . . . . . . . . . . . . . . . 7
T
2.2 Matrix V : Mapping from high-dimensional old coordinate system to low-
dimensional new coordinate system . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Matrix V: Mapping from low-dimensional new coordinate system to subspace
in old coordinate system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
T
2.4 Matrix (V V): Identity mapping within new coordinate system . . . . . . . 9
2.5 Matrix (VVT ): Projection from high- to low-dimensional (sub)space within
old coordinate system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
c 2004–2010 Laurenz Wiskott
∗
Institut für Neuroinformatik, Ruhr-Universität Bochum, http://www.ini.rub.de/PEOPLE/wiskott/.
1
2.7 Reconstruction error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.8 Covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.9 Eigenvalue equation of the covariance matrix . . . . . . . . . . . . . . . . . . 12
2.10 Total variance of the data x . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.11 Diagonalizing the covariance matrix . . . . . . . . . . . . . . . . . . . . . . . 13
2.12 Variance of y for a diagonalized covariance matrix . . . . . . . . . . . . . . . 14
2.13 Constraints of matrix V0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.14 Finding the optimal subspace . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.15 Interpretation of the result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.16 PCA Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.17 Intuition of the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.18 Whitening or sphering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.19 Singular value decomposition + . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Application 19
3.1 Face processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Acknowledgment 20
1 Intuition
2
reconstructed data points being the reconstruction error. The question is, how can we find
the linear subspace that minimizes this reconstruction error.
It is useful and common practice to remove the mean value from the data first before doing
the dimensionality reduction as stated above. Thus, we assume zero mean data throughout.
As a result variances and 2nd moments are the same. This justifies the slightly confusing
common practice to speak of variances but write the equations for 2nd moments. Please
keep that in mind.
x2 x2 x2
(a) (b) (c) (d)
y y
x1 x1 y x1
x x
x|| x||
points are given in a two-dimensional space and are represented by two-dimensional vectors
x = (x1 , x2 ). (b) In order to reduce the dimensionality down to one, we have to choose a
one-dimensional subspace defined by a unit vector v and project the data points onto it,
which can be done by
xk := vvT x . (1)
(c) The points can now be represented by just one number,
y := vT x , (2)
and we do not care that they originally came from a two-dimensional space. (d) If we want
to reconstruct the original two-dimensional positions of the data points as well as possible,
we have to embed the one-dimensional space in the original two-dimensional space in exactly
the orientation used during the projection,
(1,2)
xk = vy . (3)
However, we cannot recover the accurate 2D-position; the points remain on the one-
dimensional subspace. The reconstruction error is therefore the average distance of the
3
original 2D-positions from the one-dimensional subspace (the length of the projection ar-
rows in (b)). For mathematical convenience one actually takes the average squared distance
where µ indicates the different data points, M the number of data points, and I the dimen-
sionality of the data vectors.
x2 x2
(a) (b)
y
x||
r v
d x1 x1
x x d
r v
x||
Figure 2: Variance of the projected data and reconstruction error as the linear subspace
is rotated.
subspaces. Focus just on the one point x and its projection x|| . d is the distance of x from
the origin, r is the distance of x from x|| in the subspace, and v is the distance of x|| from
the origin. r and v depend on the direction of the subspace while d does not. Interestingly,
since the triangles between x, x|| , and the origin are right-angled, r and v are related by
Pythagoras’ theorem, i.e. r2 + v 2 = d2 . We know that r2 contributes to the reconstruction
error. v 2 on the other hand contributes to the variance of the projected data within the
subspace. Thus we see that the sum over the reconstruction error plus the variance of the
projected data is constant and equals the variance of the original data. Therefore, minimizing
the reconstruction error is equivalent to maximizing the variance of the projected data.
4
variances, but since we assume zero mean data that does not make a difference. If C11 is
large compared to C22 , then the direction of maximal variance is close to (1, 0)T , while if C11
is small, the direction of maximal variance is close to (0, 1)T . (Notice that variance doesn’t
have a polarity, so that one could use the inverse vector (−1, 0)T instead of (1, 0)T equally
well for indicating the direction of maximal variance.)
But what if C11 is of similar value as C22 , like in the example of Figure 1? Then the co-
variance between the two components, C12 := hx1 x2 i, can give us additional information
(notice that C21 := hx2 x1 i is equal to C12 ). A large positive value of C12 indicates a strong
correlation between x1 and x2 and that the data cloud is extended along the (1, 1)T direc-
tion. A negative value would indicate anti-correlation and an extension along the (−1, 1)T
direction. A small value of C12 would indicate no correlation and thus little structure of the
data, i.e. no prominent direction of maximal variance. The variances and covariances are
conveniently arranged in a matrix with components
which is called covariance matrix (remember, assuming zero mean data)1 . It can
easily be shown that the components obey the relation
It is also easy to see that scaling the data by a factor α scales the covariance matrix by a
factor α2 . Figure 3 shows several data clouds and the corresponding covariance matrices.
x2 0.2 0 x2 1 −0.5 x2 1 0
0 1 −0.5 0.3 0 1
1 1 1
1 x1 1 x1 1 x1
5
x2 1 0 x2 1 0 x2 1 0
1 0 1 1 0 1 1 0 1
1 x1 1 x1 1 x1
x2 1 −0.5 1.1 0
−0.5 0.3 0 0.1
x’
2
x1
x’
1
6
2 Formalism
Some remarks: (i) hxµ iµ indicates the mean over all M data points indexed with µ. To
simplify the notation we will from now on drop the index µ and indicate averages over the
data points by h·i. (ii) If one has non-zero-mean data, one typically removes the mean before
applying PCA. Even though all the math is valid also for non-zero-mean data, the results
would typically be undesired and nonintuitive. (iii) Since matrix U is orthogonal and has
determinant value +1, it corresponds simply to a rotation of the data x. Thus, the ’shape’
of the data cloud remains the same, just the ’perspective’ changes. |U| = −1 would imply a
mirror reflection of the data distribution and is often permitted, too. Note also that one can
interpret the multiplication with matrix UT either as a rotation of the data or as a rotation
of the coordinate system. Either interpretation is valid. (iv) Projecting the data x0 onto
the P -dimensional linear subspace spanned by the first P axes is simply done by setting all
components higher than P to zero. This can be done, because we still have an orthonormal
coordinate system. If U and therefore the new coordinate system were not orthogonal then
the projection became a mathematically more complex operation. (v) The reconstruction
error has to be minimal for any P . This has the advantage that we do not need to decide
on P before performing PCA. Often P is actually choosen based on information obtained
during PCA and governed by a constraint, such as that the reconstruction error should be
below a certain threshold.
7
2.2 Matrix VT : Mapping from high-dimensional old coordinate
system to low-dimensional new coordinate system
Assume some data points x are given in an I-dimensional space and a linear
subspace is spanned by P orthonormal vectors
We will typically assume P < I and speak of a high(I)-dimensional space and a low(P )-
dimensional (sub)space. However, P = I may be possible as a limiting case as well.
Arranging these vectors in a matrix yields
This matrix can be used to map the data points x into the subspace spanned by
the vectors vp yielding
◦ y := VT x , (13)
see also figure 6. If P < I then the dimensionality is reduced and some information is lost;
x2
V
y
T
x|| VV
x1 y
T
V V x
T
V
data points in 2D data points in 1D
Figure 6: The effect of matrices VT and V and combinations thereof for an example of a
mapping from 2D to 1D.
if P = I all information is preserved. In any case the mapped data are now represented in
a new coordinate system the axes of which are given by the vectors vp . With P = 2 and
I = 3, for example, we have
T T
v1 x v1
y= T = x = VT x
v2 x vT2
x1
y1 v11 v21 v31
or y = = x2 = V T x .
y2 v12 v22 v32
x3
8
Note that y is P -dimensional while x is I-dimensional.
It is important to realize that we have done two things here: firstly, we have moved
the points from the high-dimensional space onto the low-dimensional subspace
(the points that were already in the subspace have not been moved, of course) and secondly,
we have represented the moved points in a new coordinate system that is partic-
ularly suitable for the low-dimensional subspace. Thus, we went from the high-dimensional
space and the old coordinate system to the low-dimensional subspace and a new coordinate
system. Note also that points in the high-dimensional space can generally not be represented
accurately in the new coordinate system, because it does not have enough dimensions.
◦ x|| := Vy (14)
(13)
◦ = VVT x . (15)
y and x|| are equivalent representations, i.e. they contain the same information, just
in different coordinate systems.
2.4 Matrix (VT V): Identity mapping within new coordinate sys-
tem
Before we look at the combined matrix VVT consider VT V. The latter is obviously
a P × P -matrix and performs a transformation from the new (low-dimensional)
coordinate system to the old (high-dimensional) coordinate system (14) and back
again (13). The back-transformation implies a dimensionality reduction, but since all
points in the old coordinate system come from the new coordinate system and therefore lie
within the low-dimensional subspace already, the mapping onto the low-dimensional
space does not discard any information. Thus, only the back and forth (or rather forth and
back) transformation between the two coordinate systems remains and that in combination
is without any effect either. This means that VT V is the identity matrix, which can
be easily verified
(10)
VT V pq = vTp vq = δpq
(16)
⇐⇒ V T V = 1P (17)
9
2.5 Matrix (VVT ): Projection from high- to low-dimensional
(sub)space within old coordinate system
As we have seen above (15) the combined matrix VVT maps the points x onto the
low-dimensional subspace but in contrast to matrix VT alone the mapped points are
represented within the old coordinate system and not the new one. It turns out that
this is a projection operation with the characterizing property that it does not
make a difference whether you apply it once or twice, i.e. PP = P. Let us therefore define
the projection matrix
• P := VVT (18)
and verify that
T (18)
T
◦ PP = V V
| {zV} V (19)
1P
(17)
◦ = VVT (20)
(18)
◦ = P. (21)
◦ P := VVT (22)
vT1
◦ = (v1 , ..., vP ) ... (23)
vTP
v11 v12 · · ·
···
... v11 v21 v31
v21
(9)
v12 . . . . . .
= (24)
v . . .
..
31
. vIP
..
. vIP
P P
p v1p v1p p v1p v2p ···
P . ..
= p v2p v1p (25)
..
P
. p vIp vIp
···
P
v1p v1p v1p v2p
X ..
= v2p v1p .
(26)
p=1 ..
. vIp vIp
P
X
◦ = vp vTp . (27)
p=1
P is obviously an I × I-matrix. If P = I then projecting from the old to the new and back
to the old coordinate system causes no information loss and P = 1I . The smaller P the
more information is lost and the more does P differ from the identity matrix. Consider,
10
for example
√ √
2 − 2 2
1 √ 1 √
v1 := ( 2, −1, 1)T T
⇒ v1 v1 = −√2 1 −1 ,
2 4
2 −1 1
0 0 0
1 √ √ 1
v2 := (0, 2, 2)T T
⇒ v2 v2 = 0 2 2 , and
2 4
0 2 2
√ √
2 2 − 2
1 √ 1 √
v3 := (− 2, −1, 1)T ⇒ v3 vT3 = √2 1 −1
2 4
− 2 −1 1
for which you can easily verify that P (27) successively becomes the identity matrix as you
take more of the vp vTp -terms.
2.6 Variance
The variance of a multi-dimensional data set is defined as the sum over the
variances of its components. Since we assume zero-mean data, we have
I
X
• var(x) := hx2i i (28)
i=1
* I +
X
◦ = x2i (29)
i=1
T
• = hx xi (30)
This also holds for the projected data, of course, var(y) = hyT yi.
11
(in contrast to the projected vectors xk ) we can write the reconstruction error as the
variance of the orthogonal vectors and find
(8,31)
•|◦ E = hx⊥ T x⊥ i (32)
(31)
= h(x − xk )T (x − xk )i (33)
(15)
◦ = h(x − VVT x)T (x − VVT x)i (34)
◦ = hxT x − 2xT VVT x + xT V (VT V) VT xi (35)
| {z }
=1P
(17) T T
◦ = hxT xi − hxT V (V V) V xi (36)
| {z }
=1P
(15)
◦ = hxT xi − hxk T xk i (37)
(36,13)
• = hxT xi − hyT yi . (38)
This means that the reconstruction error equals the difference between the variance
of the data minus the variance of the projected data. Thus, this verifies our intuition
that minimizing the reconstruction error is equivalent to maximizing the variance of the
projected data.
It is an easy exercise to show that this definition is equivalent to the componentwise one
given above. Since (xxT )T = xxT (remember (AB)T = BT AT ), one can also see that Cx is
symmetric, i.e. CTx = Cx .
12
If we combine the eigenvectors into an orthogonal matrix U and the eigenvalues into a
diagonal matrix Λ,
•|◦ hxT xi = htr(xT x)i (since s = tr(s) for any scalar s) (50)
◦ = htr(xxT )i (since tr(AB) = tr(BA) for any matrices A, B) (51)
◦ = tr(hxxT i) (since tr(·) and h·i commute) (52)
(39)
◦ = tr(Cx ) (53)
(46) T
◦ = tr(UU Cx ) (54)
◦ = tr(UT Cx U) (55)
(48)
◦ = tr(Λ) (56)
X
(44)
• = λi . (57)
i
Thus the total variance of the data is simply the sum of the eigenvalues of its
covariance matrix.
Notice that on the way of this proof we have shown some very general properties. From
line (50) to (53) we have shown that the total variance of some multi-dimensional data
equals the trace of its covariance matrix. From line (53) to (55) we have shown that the
trace remains invariant under any orthogonal transformation of the coordinate system. This
implies that the total variance of some multi-dimensional data is invariant under any or-
thogonal transformation such as a rotation. This is intuitively clear.
13
have
• x0 := UT x (58)
D E
0 0 0T
•|◦ Cx := x x (59)
(58)
◦ = (UT x)(UT x)T (60)
T T
◦ = U xx U (61)
(39)
◦ = UT Cx U , (62)
(48)
• = Λ (63)
and find that the transformed data x0 have a diagonal covariance matrix. Working with x0
instead of x will simplify further analysis without loss of generality.
14
Notice that Constraint (72) is a direct consequence of Constraint (71) and does not need to
be verified separately in the following considerations.
and see from (75) that there is actually as much ’weight’ on the first, i.e. large, eigenvalues
as Constraint (73) permits.
which we know has maximal variance. Thus, if we start from the original data x we
would set vp := up .
15
The variance of y is
I
X P
X
(70) 0 2
•|◦ hyT yi = λi (vip ) (82)
i=1 p=1
I
X P
X
(74) 2
◦ = λi δip (83)
i=1 p=1
P
X
• = λi , (84)
i=1
which is the sum over the first P largest eigenvalues of the covariance matrix. Likewise one
can determine the reconstruction error as
(38)
•|◦ E = hxT xi − hyT yi (85)
XI XP
(57, 84)
◦ = λi − λj (86)
i=1 j=1
I
X
• = λi . (87)
i=P +1
Notice that this is just one optimal set of weights. We have seen above that the projected
data, like any multi-dimensional data, can be rotated arbitrarily without changing
its variance and therefore without changing its reconstruction error. This is equivalent to
a rotation of the projection vectors vp within the space spanned by the first eigenvectors.
16
new coordinate system and the eigenvalues λi indicate the variances and therefore how much
one has to stretch√the data. If the original variance is λi then one obviously has to stretch
by a factor of 1/ λi to get variance one. Thus, sphering is achieved by multiplying
the data with a sphering matrix
1 1 1
• W := U diag √ , √ , ..., √ UT (88)
λ1 λ2 λI
• x̂ := Wx . (89)
If the final orientation of the data does not matter, the sphering matrix is often defined
without the first U. It is easy to verify that the sphering matrix is symmetrical, the
sphered data x̂ have a unit covariance matrix,
and they have variance one in all directions, since for any projection vector n of norm
one the variance h(nT x̂)2 i of the projected data is
Similarly one can show that the sphered data projected onto two orthogonal vectors
are uncorrelated.
C1 := XXT /M , (100)
17
and its eigenvalue equation and decomposition read
C1 U1 = U1 Λ1 (101)
⇐⇒ C1 = U1 Λ1 UT1 . (102)
Y1 := UT1 X , (103)
C2 := XT X/I , (104)
C2 U2 = U2 Λ2 (105)
⇐⇒ C2 = U2 Λ2 UT2 , (106)
Y2 := UT2 XT . (107)
The interesting property of matrix Y2 now is that its rows are eigenvectors of matrix C1 , as
can be shown easily.
(100,107)
C1 YT2 = (XXT /M )(XU2 ) (108)
= X(XT X/I)U2 I/M (109)
(104)
= XC2 U2 I/M (110)
(105)
= XU2 Λ2 I/M (111)
(107)
= YT2 Λ2 I/M . (112)
18
3 Application
19
4 Acknowledgment
I thank Agnieszka Grabska-Barwinska for working out the proof of singular value decompo-
sition.
References
Lanitis, A., Taylor, C. J., and Cootes, T. F. (1995). An automatic face identification system
using flexible appearance models. Image and Vision Computing, 13(5):393–401.
20