PrincipalComponentAnalysis-LectureNotesPublic
PrincipalComponentAnalysis-LectureNotesPublic
— Lecture Notes —
Laurenz Wiskott
Institut für Neuroinformatik
Ruhr-Universität Bochum, Germany, EU
14 December 2016
— Summary —
Principal component analysis (PCA) can be applied to vectorial data and is probably the most common
method to reduce the dimensionality of data for compression and visualization. It determines the dimensions
of largest and smallest variance of the data, referred to as the principal components, which can then be used to
discard the small variance dimensions for dimensionality reduction or select the two or three largest variance
dimensions for visualization. For instance, if you have one thousand 100-dimensional data points, PCA
might be used to reduce the dimensionality of the data down to 10 without loosing too much information,
which corresponds to a compression by 90%. In some cases it is also the small variance directions that are of
interest. PCA can also be used to normalize data such, that it has unit variance in all directions, a process
called whitening or sphering, to eliminate correlations.
1 Intuition introduces the problem statement of principal component analysis and the general idea
of the algorithm without much formalism. → Video ’In a nutshell’, no 1, Exercises, Solutions
© 2004–2006, 2009, 2010, 2013, 2016 Laurenz Wiskott (ORCID http://orcid.org/0000-0001-6237-740X, homepage
https://www.ini.rub.de/PEOPLE/wiskott/). This work (except for all figures from other sources, if present) is licensed under
the Creative Commons Attribution-ShareAlike 4.0 International License, see http://creativecommons.org/licenses/by-sa/
4.0/. If figures are not included for copyright reasons, they are uni colored, but the word ’Figure’, ’Image’, or the like in the
reference is often linked to a freely available copy.
Core text and formulas are set in dark red, one can repeat the lecture notes quickly by just reading these; marks important
formulas or items worth remembering and learning for an exam; ♦ marks less important formulas or items that I would usually
also present in a lecture; + marks sections that I would usually skip in a lecture.
More teaching material is available at https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/.
1
2 Formalism provides a detailed mathematical treatment of principal component analysis, including
whitening and singular value decommposition, an extension for very high dimensional spaces.
→ Exercises, Solutions
3 Application presents an application to face processing. A large number of images of faces are
aligned at the eyes and then treated as data vectors by concatenating the rows (or columns). The
mean represents the average face and the principal components the dominant modes of variation.
This can be used for face recognition and analysis.
Contents
Introductory remarks for the lecture 3
1 Intuition 3
1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Projection and reconstruction error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Reconstruction error and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Covariance matrix and higher order structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 PCA by diagonalizing the covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Whitening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Formalism 7
2.1 Definition of the PCA-optimization problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
T
2.2 Matrix V : Mapping from high-dimensional old coordinate system to low-dimensional new
coordinate system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Matrix V: Mapping from low-dimensional new coordinate system to subspace in old coordi-
nate system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
T
2.4 Matrix (V V): Identity mapping within new coordinate system . . . . . . . . . . . . . . . . 10
T
2.5 Matrix (VV ): Projection from high- to low-dimensional (sub)space within old coordinate
system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Reconstruction error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.8 Covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.9 Eigenvalue equation of the covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.10 Total variance of the data x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.11 Diagonalizing the covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.12 Variance of y for a diagonalized covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . 14
0
2.13 Constraints of matrix V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.14 Finding the optimal subspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2
2.15 Interpretation of the result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.16 PCA Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.17 Intuition of the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.18 Whitening or sphering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.19 Singular value decomposition + . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Application 19
3.1 Face processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Other resources 21
4.1 Written material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1 Intuition
This section is partly based on (Hertz et al., 1991).
3
1.2 Projection and reconstruction error
The task of principal component analysis (PCA) is to reduce the dimensionality of some high-dimensional
data points by linearly projecting them onto a lower-dimensional space in such a way that the reconstruction
error made by this projection is minimal. In order to develop an intuition for PCA we first take a closer
look at what it means to project the data points and to reconstruct them. Figure 1.1 illustrates the process.
x2 x2 x2
(a) (b) (c) (d)
y y
x1 x1 y x1
x x
x|| x||
Figure 1.1: Projection of 2D data points onto a 1D subspace and their reconstruction.
© CC BY-SA 4.0
(a) A few data points are given in a two-dimensional space and are represented by two-dimensional vectors
x = (x1 , x2 ). (b) In order to reduce the dimensionality down to one, we have to choose a one-dimensional
subspace defined by a unit vector v and project the data points onto it, which can be done by
xk := vv T x . (1.1)
(c) The points can now be represented by just one number,
y := v T x , (1.2)
and we do not care that they originally came from a two-dimensional space. (d) If we want to reconstruct
the original two-dimensional positions of the data points as well as possible, we have to embed the one-
dimensional space in the original two-dimensional space in exactly the orientation used during the projection,
(1.1,1.2)
xk = vy . (1.3)
However, we cannot recover the accurate 2D-position; the points remain on the one-dimensional sub-
space. The reconstruction error is therefore the average distance of the original 2D-positions from the
one-dimensional subspace (the length of the projection arrows in (b)). For mathematical convenience one
actually takes the average squared distance
E := hkxµ − xµ|| k2 iµ (1.4)
M X
I
1 X
= (xµi − xµ|| i )2 , (1.5)
M µ=1 i=1
where µ indicates the different data points, M the number of data points, and I the dimensionality of the
data vectors.
4
x2 x2
(a) (b)
y
x||
r v
d x1 x1
x x d
r v
x||
y
Figure 1.2: Variance of the projected data and reconstruction error as the linear subspace is rotated.
© CC BY-SA 4.0
subspace. Figure 1.2 illustrates the projection onto two different subspaces. Focus just on the one point x
and its projection x|| . d is the distance of x from the origin, r is the distance of x from x|| in the subspace,
and v is the distance of x|| from the origin. r and v depend on the direction of the subspace while d does
not. Interestingly, since the triangles between x, x|| , and the origin are right-angled, r and v are related by
Pythagoras’ theorem, i.e. r2 + v 2 = d2 . We know that r2 contributes to the reconstruction error. v 2 on the
other hand contributes to the variance of the projected data within the subspace. Thus we see that the sum
over the reconstruction error plus the variance of the projected data is constant and equals the variance of
the original data. Therefore, minimizing the reconstruction error is equivalent to maximizing the variance
of the projected data.
5
x2 0.2 0 x2 1 −0.5 x2 1 0
0 1 −0.5 0.3 0 1
1 1 1
1 x1 1 x1 1 x1
x2 1 0 x2 1 0 x2 1 0
0 1 0 1 0 1
x1 x1 x1
Figure 1.4: Different data distributions with identical covariance matrices. What is the scale of the axes
in order to get the identity matrix?
© CC BY-SA 4.0
6
turn out that the eigenvectors of the covariance matrix point into the directions of maximal (and minimal)
variance and that the eigenvalues are equal to the variances along these directions. Projecting the data onto
the eigenvectors with largest eigenvalues is therefore the optimal linear dimensionality reduction.
x2 1 −0.5 1.1 0
−0.5 0.3 0 0.1
x’
2
x1
x’
1
Figure 1.5: Diagonalizing the covariance matrix by rotating the coordinate system.
© CC BY-SA 4.0
1.7 Whitening
Whitening data means to first remove the mean and then stretch the data along the principal axes such
that it has unit variance in all directions, i.e. the stretching must be done by one over the square root of
the variance. Principal axes as well as the corresponding variances can be optained by PCA. If the original
data is an unisotropic Gaussian to begin with, e.g. having the shape of a flying sausage or a cigar, it has a
spherical shape after whitening, which is the reason to call it also sphering.
Whitened data has the advantage that you can project it onto any unit vector and it has unit variance. More
formally, if n is an arbitrary unit vector and xµ are the whitened data vectors, the projected data x|| := nT x
has zero mean hx|| i = 0, since hxµ i = 0, and unit variance hx2|| i = 1. Projected onto two orthogonal unit
vectors, the two projected data sets are uncorrelated. More formally, if n1 and n2 are two arbitrary but
orthogonal unit vectors, i.e. nTi nj = δij (Kronecker symbol), the two projected data sets x||1 := nT1 x and
x||2 := nT2 x are uncorrelated, i.e. hx||1 x||2 i = 0.
2 Formalism
This section is partly based on (Hertz et al., 1991).
7
whitening
Figure 1.6: Whitening of some data. Left: Original data. The ellipse represents the second moment matrix
of the data distribution. Right: The mean is removed and the data stretched such that it has variance one
in all directions, indicated by a circle representing the identity second moment matrix. The graphs to the
top and right represent the marginals of the first and second component, respectively.
© CC BY-SA 4.0
among all possible projections onto a P -dimensional subspace. The row vectors of
matrix U define the new axes and are called the principal components.
Some remarks: (i) hxµ iµ indicates the mean over all M data points indexed with µ. To simplify the notation
we will from now on drop the index µ and indicate averages over the data points by h·i. (ii) If one has
non-zero-mean data, one typically removes the mean before applying PCA. Even though all the math is
valid also for non-zero-mean data, the results would typically be undesired and nonintuitive. (iii) Since
matrix U is orthogonal and has determinant value +1, it corresponds simply to a rotation of the data x.
Thus, the ’shape’ of the data cloud remains the same, just the ’perspective’ changes. |U | = −1 would imply
a mirror reflection of the data distribution and is often permitted, too. Notice also that one can interpret
the multiplication with matrix U T either as a rotation of the data or as a rotation of the coordinate system.
Either interpretation is valid. (iv) Projecting the data x0 onto the P -dimensional linear subspace spanned by
the first P axes is simply done by setting all components higher than P to zero. This can be done, because
we still have an orthonormal coordinate system. If U and therefore the new coordinate system were not
orthogonal then the projection became a mathematically more complex operation. (v) The reconstruction
error has to be minimal for any P . This has the advantage that we do not need to decide on P before
performing PCA. Often P is actually choosen based on information obtained during PCA and governed by
a constraint, such as that the reconstruction error should be below a certain threshold.
8
2.2 Matrix VT : Mapping from high-dimensional old coordinate system to low-
dimensional new coordinate system
Assume some data points x are given in an I-dimensional space and a linear subspace is
spanned by P orthonormal vectors
We will typically assume P < I and speak of a high(I)-dimensional space and a low(P )-dimensional
(sub)space. However, P = I may be possible as a limiting case as well.
Arranging these vectors in a matrix yields
This matrix can be used to map the data points x into the subspace spanned by the vectors vp
yielding
♦ y := V T x , (2.6)
see also figure 2.1. If P < I then the dimensionality is reduced and some information is lost; if P = I all
x2
V
y
T
x|| VV
x1 y
T
V V x
T
V
data points in 2D data points in 1D
Figure 2.1: The effect of matrices V T and V and combinations thereof for an example of a mapping from
2D to 1D.
© CC BY-SA 4.0
information is preserved. In any case the mapped data are now represented in a new coordinate system the
axes of which are given by the vectors vp . With P = 2 and I = 3, for example, we have
T T
v1 x v1
y= = x = V Tx
v2T x v2T
x
v11 v21 v31 1
y1
or y = = x2 = V Tx.
y2 v12 v22 v32
x3
9
It is important to realize that we have done two things here: firstly, we have moved the points from
the high-dimensional space onto the low-dimensional subspace (the points that were already in the
subspace have not been moved, of course) and secondly, we have represented the moved points in
a new coordinate system that is particularly suitable for the low-dimensional subspace. Thus, we went
from the high-dimensional space and the old coordinate system to the low-dimensional subspace and a new
coordinate system. Notice also that points in the high-dimensional space can generally not be represented
accurately in the new coordinate system, because it does not have enough dimensions.
♦ x|| := Vy (2.7)
(2.6) T
♦ = V V x. (2.8)
y and x|| are equivalent representations, i.e. they contain the same information, just in different
coordinate systems.
2.4 Matrix (VT V): Identity mapping within new coordinate system
Before we look at the combined matrix V V T consider V T V . The latter is obviously a P × P -matrix
and performs a transformation from the new (low-dimensional) coordinate system to the old
(high-dimensional) coordinate system (2.7) and back again (2.6). The back-transformation implies
a dimensionality reduction, but since all points in the old coordinate system come from the new coordinate
system and therefore lie within the low-dimensional subspace already, the mapping onto the low-
dimensional space does not discard any information. Thus, only the back and forth (or rather forth and
back) transformation between the two coordinate systems remains and that in combination is without
any effect either. This means that V T V is the identity matrix, which can be easily verified
(2.3)
V TV = vpT vq
pq
= δpq (2.9)
⇐⇒ V TV = 1P (2.10)
with 1P indicating the identity matrix of dimensionality P . With P = 2, for example, we have
T T
v1 v1 v1T v2
v1 (2.3) 1 0
V TV = (v v
1 2 ) = = .
v2T v2T v1 v2T v2 0 1
10
characterizing property that it does not make a difference whether you apply it once or twice, i.e.
P P = P . Let us therefore define the projection matrix
P := V V T (2.11)
♦ P := VVT (2.15)
v1T
vPT
v11 v12 · · ·
. v11 v21 v31 ···
v21 ..
(2.2)
.. ..
=
. v12
. . (2.17)
v31 . .
..
. vIP
..
. vIP
P P
p v1p v1p p v1p v2p ···
P . ..
= p v2p v1p (2.18)
.. P
. p vIp vIp
v1p v1p v1p v2p ···
P
X ..
= v2p v1p
.
(2.19)
p=1 ..
. vIp vIp
P
X
♦ = vp vpT . (2.20)
p=1
P is obviously an I × I-matrix. If P = I then projecting from the old to the new and back to the old
coordinate system causes no information loss and P = 1I . The smaller P the more information is lost
and the more does P differ from the identity matrix. Consider, for example
√ √
√ 2 − 2 2
1 T T 1 √
v1 := ( 2, −1, 1) ⇒ v1 v1 = −√ 2 1 −1 ,
2 4
2 −1 1
0 0 0
1 √ √ 1
v2 := (0, 2, 2)T ⇒ v2 v2T = 0 2 2 , and
2 4
0 2 2
√ √
2 2 − 2
1 √ 1 √
v3 := (− 2, −1, 1)T ⇒ v3 v3T = √2 1 −1
2 4
− 2 −1 1
11
for which you can easily verify that P (2.20) successively becomes the identity matrix as you take more of
the vp vpT -terms.
2.6 Variance
The variance of a multi-dimensional data set is defined as the sum over the variances of its
components. Since we assume zero-mean data, we have
I
X
var(x) := hx2i i (2.21)
i=1
* I
+
X
♦ = x2i (2.22)
i=1
T
= hx xi (2.23)
This also holds for the projected data, of course, var(y) = hy T yi.
♦ x⊥ = x − xk (2.24)
(in contrast to the projected vectors xk ) we can write the reconstruction error as the variance of the
orthogonal vectors and find
(2.1,2.24)
|♦ E = hx⊥ T x⊥ i (2.25)
(2.24) T
= h(x − xk ) (x − xk )i (2.26)
(2.8)
♦ = h(x − V V T x)T (x − V V T x)i (2.27)
♦ = hxT x − 2xT V V T x + xT V (V T V ) V T xi (2.28)
| {z }
=1P
(2.10)
♦ = hxT xi − hxT V (V T V ) V T xi (2.29)
| {z }
=1P
(2.8) T T
♦ = hx xi − hxk xk i (2.30)
(2.29,2.6) T T
= hx xi − hy yi . (2.31)
This means that the reconstruction error equals the difference between the variance of the data
minus the variance of the projected data. Thus, this verifies our intuition that minimizing the recon-
struction error is equivalent to maximizing the variance of the projected data.
12
It is an easy exercise to show that this definition is equivalent to the componentwise one given above. Since
(xxT )T = xxT (remember (AB)T = B T AT ), one can also see that Cx is symmetric, i.e. CxT = Cx .
Keep in mind that Cx is strictly speaking a 2nd moment matrix and only identical to the covariance matrix
since we assume zero mean data.
Thus the total variance of the data is simply the sum of the eigenvalues of its covariance matrix.
Notice that on the way of this proof we have shown some very general properties. From line (2.43) to (2.46)
we have shown that the total variance of some multi-dimensional data equals the trace of its covariance
matrix. From line (2.46) to (2.48) we have shown that the trace remains invariant under any orthogonal
transformation of the coordinate system. This implies that the total variance of some multi-dimensional
data is invariant under any orthogonal transformation such as a rotation. This is intuitively clear.
13
2.11 Diagonalizing the covariance matrix
We can now use matrix U to transform the data such that the covariance matrix becomes
diagonal. Define x0 := U T x and denote the new covariance matrix by Cx0 . We have
x0 := UT x (2.51)
D E
T
|♦ Cx0 := x0 x0 (2.52)
(2.51)
♦ = (U T x)(U T x)T (2.53)
♦ = U T xxT U (2.54)
(2.32) T
♦ = U Cx U , (2.55)
(2.41)
= Λ (2.56)
and find that the transformed data x0 have a diagonal covariance matrix. Working with x0 instead of x will
simplify further analysis without loss of generality.
V 0 is a rectangular I × P -matrix and projects the data into a lower-dimensional space. Since its columns are
orthonormal to each other, it only performes a rotation and elimination of some dimensions (not necessarily
along the axes) but no scaling of any sort or distortion beyond the elimnation of dimensions.
14
normalized row as well as column vectors, we see that, by taking away the I − P additional column vectors,
we are left with the constraints
X
0 2
♦ (vip ) = 1 (column vectors of V 0 have norm one) , (2.64)
i
X
0 2
♦ =⇒ (vip ) = P (square sum over all matrix elements equals P ) , (2.65)
ip
X
0 2
♦ (vip ) ≤ 1 (row vectors of V 0 have norm less or equal one) . (2.66)
p
Notice that Constraint (2.65) is a direct consequence of Constraint (2.64) and does not need to be verified
separately in the following considerations.
and see from (2.68) that there is actually as much ’weight’ on the first, i.e. large, eigenvalues as Con-
straint (2.66) permits.
|♦ V := UV 0 (2.70)
(2.36, 2.67)
= (u1 , u2 , ..., uP ) (2.71)
which we know has maximal variance. Thus, if we start from the original data x we would set vp := up .
15
The variance of y is
I
X P
X
(2.63) 0 2
|♦ hy T yi = λi (vip ) (2.75)
i=1 p=1
I
X P
X
(2.67) 2
♦ = λi δip (2.76)
i=1 p=1
P
X
= λi , (2.77)
i=1
which is the sum over the first P largest eigenvalues of the covariance matrix. Likewise one can determine
the reconstruction error as
(2.31)
|♦ E = hxT xi − hy T yi (2.78)
I
X P
X
(2.50, 2.77)
♦ = λi − λj (2.79)
i=1 j=1
I
X
= λi . (2.80)
i=P +1
Notice that this is just one optimal set of weights. We have seen above that the projected data, like
any multi-dimensional data, can be rotated arbitrarily without changing its variance and therefore
without changing its reconstruction error. This is equivalent to a rotation of the projection vectors vp within
the space spanned by the first eigenvectors.
16
√
variance is λi then one obviously has to stretch by a factor of 1/ λi to get variance one. Thus, sphering
is achieved by multiplying the data with a sphering matrix
1 1 1
W := U diag √ , √ , ..., √ UT (2.81)
λ1 λ2 λI
x̂ := W x . (2.82)
If the final orientation of the data does not matter, the sphering matrix is often defined without the first
U . It is easy to verify that the sphering matrix is symmetrical, the sphered data x̂ have a unit
covariance matrix,
and they have variance one in all directions, since for any projection vector n of norm one the variance
h(nT x̂)2 i of the projected data is
Similarly one can show that the sphered data projected onto two orthogonal vectors are uncorre-
lated.
C1 := XX T /M , (2.93)
C1 U1 = U1 Λ1 (2.94)
⇐⇒ C1 = U1 Λ1 U1T . (2.95)
17
The data represented in the coordinate system of the eigenvectors is
Y1 := U1T X , (2.96)
C2 := X T X/I , (2.97)
C2 U2 = U2 Λ2 (2.98)
⇐⇒ C2 = U2 Λ2 U2T , (2.99)
Y2 := U2T X T . (2.100)
The interesting property of matrix Y2 now is that its rows are eigenvectors of matrix C1 , as can be shown
easily.
(2.93,2.100)
C1 Y2T = (XX T /M )(XU2 ) (2.101)
T
= X(X X/I)U2 I/M (2.102)
(2.97)
= XC2 U2 I/M (2.103)
(2.98)
= XU2 Λ2 I/M (2.104)
(2.100)
= Y2T Λ2 I/M . (2.105)
18
3 Application
3.1 Face processing
19
In this example many faces
were warped to a standard ge-
Eigenfaces - Texture ometry and then PCA was ap-
plied to calculate the eigenvec-
tors, which are called eigen-
faces. The Figure visualizes
the first four eigenfaces by
varying the average face (mid-
dle) along the direction of the
eigenvectors by up to ±3 stan-
dard deviations. The first
eigenface (mode) accounts for
overall illumination, the other
three for some combination of
beard, gender, and mimic ex-
pression.
Figure: (Lanitis et al., 1995,
Fig. 11, URL)3.2
Figure: (Lanitis, Taylor, et al., 1995, Image and Vision Computing 13(5):393–401, Fig. 11, URL)
20
4 Other resources
Numbers in square brackets indicate sections of these lecture notes to which the corresponding item is related.
4.2 Visualizations
• 2D, 3D, and 17D examples of PCA
http://setosa.io/ev/principal-component-analysis/
4.3 Videos
• Abstract conceptual introduction to PCA from Georgia Tech.
1. Part 1
https://www.youtube.com/watch?v=kw9R0nD69OU (4:22) [1.1, 1.2, 1.3]
2. Part 2
https://www.youtube.com/watch?v=_nZUhV-qhZA (5:27) [1.2, 1.3]
3. Part 3
https://www.youtube.com/watch?v=kuzJJgPBrqc (5:01) [1.2, 1.3]
• Introduction to PCA with several practical examples by Rasmus Bro from the University of Copenhagen
Note: In these videos the principal components are called loadings, and the projected values are called
scores.
1. Conceptual introduction
https://www.youtube.com/watch?v=UUxIXU_Ob6E (12:32)
– 00:00–00:38 Introduction
– 00:38–07:52 Examples of multi-variate data
– 07:52–12:32 Concept of PCA and bi-plot by means of a simple 3D example
2. Example: Analysis of demographic data of many countries in the world with PCA
https://www.youtube.com/watch?v=26YhtSJi1qc (11:36)
– 00:11–02:26 Outlier detection
– 02:26–05:47 Loadings help interpreting the projected data dimensions
– 05:47–08:11 Exploration after removing outliers
– 08:11–09:40 Using external information as markers
– 09:40–10:36 From plots to numbers
– 10:36–11:36 Summary
21
3. A bit more in depth discussion
https://www.youtube.com/watch?v=2s-a62zSWL4 (14:31)
– 00:00–00:50 Introduction and historical remarks
– 00:50–03:18 Basic equation for data generation
– 03:18–06:32 Example: Three-variable data set
∗ 04:30–05:40 Removing the mean before PCA
∗ 05:40–06:11 PCA as variations along loadings
∗ 06:11–06:32 PCA as a rotation and truncation
– 06:32–06:59 Scores are projections of the data onto the loadings
– 06:59–14:26 PCA as finding common profiles and profile variations
∗ 07:12–11:58 Example 1: Continuous spectral data of sugar samples
∗ 12:22–14:26 Example 2: Discrete physical and chemical measurements of the same sugar
samples
· 12:48–14:26 Scaling the variables before PCA (covariance- vs. correlation-matrix)
4. Continuation of the previous video
https://www.youtube.com/watch?v=sRsdF3rcAJc (08:22)
– 00:05–14:26 PCA as finding common profiles and profile variations (cont.)
∗ 00:05–04:02 Example 2: Discrete physical and chemical measurements of the same sugar
samples (cont.)
∗ 01:31–02:37 Relating results from Examples 1 and 2
∗ 02:37–04:02 Bi-plot: Understanding the data by visualizing the loadings
– 04:02–08:18 PCA reviewed
• Introduction to PCA in Geoscience by Matthew E. Clapham from the UC Santa Cruz
https://www.youtube.com/watch?v=TSYL-oHx4T0 (15:27)
– 00:00–00:33 Introduction
– 00:33–01:17 Data is often multi-variate / high-dimensional
– 01:17–02:50 Covariation allows to reduce dimensionality and remove redundancy
– 02:50–03:36 skip (Indirect gradient analysis)
– 03:36–04:44 skip (Types of ordination methods)
– 04:44–06:10 Idea of PCA
– 06:10–06:29 Covariance matrix
– 06:29–09:22 Eigenvectors define a new coordinate system ordered by variance
∗ 07:53–08:43 Eigenvalues measure the amount of variance along the axes of the new coordinate
system
– 09:22–10:42 Loadings
– 10:42–12:17 Covariance- vs. correlation-matrix (scaling the variables before PCA)
– 12:17–13:50 Using the eigenvalue spectrum to decide how many PCs to keep
– 13:50–15:27 When is PCA applicable?
• Three lectures on singular value decomposition and PCA
(See also the closely related tutorial by Jonathon Shlens https://arxiv.org/pdf/1404.1100v1.pdf.)
22
1. Singular value decomposition
https://www.youtube.com/watch?v=EokL7E6o1AE (44:35)
Explains singular value decomposition in general, which is quite an interesting fact in linear algebra,
but not really necessary here. It is also not entirely obvious how this relates to Section 2.19. I list
this here, because it belongs to the series.
2. PCA in relatively simple mathematical terms
https://www.youtube.com/watch?v=a9jdQGybYmE (51:12)
– 00:00–15:38 Why would you like to do PCA? [1.1]
– 15:38–23:32 Variance and covariance [1.4]
∗ 18:18–18:30 The vectors introduced 16:52-17:10 are row vectors, so that abT is indeed an
inner product. I always use column vectors, and then I write the inner product as aT b
∗ 22:26–22:54 The statement made here is not always true. Variables can be uncorrelated but
statistically dependent, a simple example being cos(t) and sin(t) for t ∈ [0, 2π]. He uses the
term stastistical independence for what really only is zero covariance or no correlation.
– 23:32–25:06 Relation of variance and covariance to the motivational example
– 25:06–35:49 Covariance matrix (assuming zero mean data) [2.8]
– 35:49–39:44 Diagonalizing the covariance matrix [1.7]
– 39:44–47:20 Diagonalization with eigenvalues and -vectors [2.11]
– 47:20–49:43 Diagonalization with singular value decomposition
– 49:43–51:12 Summary
3. Application of PCA to face recognition
https://www.youtube.com/watch?v=8BTv-KZ2Bh8 (48:02)
4.4 Software
• General list of software for PCA
https://en.wikipedia.org/wiki/Principal_component_analysis#Software.2Fsource_code
• PCA and variants thereof in scikit-learn, a python library for machine learning
http://scikit-learn.org/stable/modules/decomposition.html#principal-component-analysis-pca
• Examples using PCA in scikit-learn
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#
sklearn.decomposition.PCA
4.5 Exercises
• Analytical exercises by Laurenz Wiskott
https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-ExercisesPublic.
pdf
https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-SolutionsPublic.
pdf
• Python exercises by Laurenz Wiskott
https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-PythonExercisesPubl
zip
https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-PythonSolutionsPubl
zip
23
References
Hertz, J., Krogh, A., and Palmer, R. G. (1991). Introduction to the Theory of Neural Computation. Addison-
Wesley, Redwood City, CA.
Lanitis, A., Taylor, C. J., and Cootes, T. F. (1995). An automatic face identification system using flexible
appearance models. Image and Vision Computing, 13(5):393–401.
Turk, M. and Pentland, A. P. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1):71–
86.
Notes
3.1 http://www-white.media.mit.edu/ 2002-12-17, http://www-white.media.mit.edu/vismod/demos/facerec/basic.html
3.2 Lanitis,
Taylor, et al., 1995, Image and Vision Computing 13(5):393–401, Fig. 11, https://www.researchgate.net/
profile/Andreas_Lanitis/publication/223308678_Automatic_face_identification_system_using_flexible_appearance_
models/links/00463533817d5b2dd9000000.pdf
3.3 Lanitis, Taylor, et al., 1995, Image and Vision Computing 13(5):393–401, Fig. 6, https://www.researchgate.net/profile/
Andreas_Lanitis/publication/223308678_Automatic_face_identification_system_using_flexible_appearance_models/
links/00463533817d5b2dd9000000.pdf
Copyrightprotectionlevel: 2/ 2
24