0% found this document useful (0 votes)

27 views24 pages

PrincipalComponentAnalysis-LectureNotesPublic

Principal Component Analysis (PCA) is a widely used method for reducing the dimensionality of vectorial data, allowing for data compression and visualization by identifying principal components with the largest variance. The process involves projecting high-dimensional data onto a lower-dimensional space while minimizing reconstruction error, which can be useful in applications like face recognition. PCA can also normalize data through a technique called whitening, which eliminates correlations among dimensions.

Uploaded by

frehiwot.tilahun1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views24 pages

PrincipalComponentAnalysis-LectureNotesPublic

Uploaded by

frehiwot.tilahun1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Principal Component Analysis

— Lecture Notes —

Laurenz Wiskott
Institut für Neuroinformatik
Ruhr-Universität Bochum, Germany, EU

14 December 2016

— Summary —

Principal component analysis (PCA) can be applied to vectorial data and is probably the most common
method to reduce the dimensionality of data for compression and visualization. It determines the dimensions
of largest and smallest variance of the data, referred to as the principal components, which can then be used to
discard the small variance dimensions for dimensionality reduction or select the two or three largest variance
dimensions for visualization. For instance, if you have one thousand 100-dimensional data points, PCA
might be used to reduce the dimensionality of the data down to 10 without loosing too much information,
which corresponds to a compression by 90%. In some cases it is also the small variance directions that are of
interest. PCA can also be used to normalize data such, that it has unit variance in all directions, a process
called whitening or sphering, to eliminate correlations.

1 Intuition introduces the problem statement of principal component analysis and the general idea
of the algorithm without much formalism. → Video ’In a nutshell’, no 1, Exercises, Solutions

© 2004–2006, 2009, 2010, 2013, 2016 Laurenz Wiskott (ORCID http://orcid.org/0000-0001-6237-740X, homepage
https://www.ini.rub.de/PEOPLE/wiskott/). This work (except for all figures from other sources, if present) is licensed under
the Creative Commons Attribution-ShareAlike 4.0 International License, see http://creativecommons.org/licenses/by-sa/
4.0/. If figures are not included for copyright reasons, they are uni colored, but the word ’Figure’, ’Image’, or the like in the
reference is often linked to a freely available copy.
Core text and formulas are set in dark red, one can repeat the lecture notes quickly by just reading these; marks important
formulas or items worth remembering and learning for an exam; ♦ marks less important formulas or items that I would usually
also present in a lecture; + marks sections that I would usually skip in a lecture.
More teaching material is available at https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/.

1
2 Formalism provides a detailed mathematical treatment of principal component analysis, including
whitening and singular value decommposition, an extension for very high dimensional spaces.
→ Exercises, Solutions

3 Application presents an application to face processing. A large number of images of faces are
aligned at the eyes and then treated as data vectors by concatenating the rows (or columns). The
mean represents the average face and the principal components the dominant modes of variation.
This can be used for face recognition and analysis.

Contents
Introductory remarks for the lecture 3

1 Intuition 3
1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Projection and reconstruction error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Reconstruction error and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Covariance matrix and higher order structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 PCA by diagonalizing the covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Whitening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Formalism 7
2.1 Definition of the PCA-optimization problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
T
2.2 Matrix V : Mapping from high-dimensional old coordinate system to low-dimensional new
coordinate system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Matrix V: Mapping from low-dimensional new coordinate system to subspace in old coordi-
nate system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
T
2.4 Matrix (V V): Identity mapping within new coordinate system . . . . . . . . . . . . . . . . 10
T
2.5 Matrix (VV ): Projection from high- to low-dimensional (sub)space within old coordinate
system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Reconstruction error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.8 Covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.9 Eigenvalue equation of the covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.10 Total variance of the data x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.11 Diagonalizing the covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.12 Variance of y for a diagonalized covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . 14
0
2.13 Constraints of matrix V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.14 Finding the optimal subspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2
2.15 Interpretation of the result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.16 PCA Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.17 Intuition of the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.18 Whitening or sphering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.19 Singular value decomposition + . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Application 19
3.1 Face processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Other resources 21
4.1 Written material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Introductory remarks for the lecture

1 Intuition
This section is partly based on (Hertz et al., 1991).

1.1 Problem statement

Experimental data to be analyzed is often represented as a number of vectors of fixed dimensionality. A
single vector could for example be a set of temperature measurements across Germany. Taking such a vector
of measurements at different times results in a number of vectors that altogether constitute the data. Each
vector can also be interpreted as a point in a high dimensional space. Then the data are simply a cloud of
points in this space (if one ignores the temporal order, otherwise it would be a trajectory).
When analyzing such data one often encounters the problem that the dimensionality of the data points is
too high to be visualized or analyzed with some particular technique. Thus the problem arises to reduce the
dimensionality of the data in some optimal way.
To keep things simple we insist that the dimensionality reduction is done linearly, i.e. we are looking for a
low-dimensional linear subspace of the data space, onto which the data can be projected. As a criterion for
what the optimal subspace might be it seems reasonable to require that it should be possible to reconstruct
the original data points from the reduced ones as well as possible. Thus if one were to project the data
back from the low-dimensional space into the original high-dimensional space, the reconstructed data points
should lie as close as possible to the original ones, with the mean squared distance between original and
reconstructed data points being the reconstruction error. The question is, how can we find the linear subspace
that minimizes this reconstruction error.
It is useful and common practice to remove the mean value from the data first before doing the dimensionality
reduction as stated above. Thus, we assume zero mean data throughout. As a result variances and 2nd
moments are the same. This justifies the slightly confusing common practice to speak of variances but write
the equations for 2nd moments. Please keep that in mind.

3
1.2 Projection and reconstruction error
The task of principal component analysis (PCA) is to reduce the dimensionality of some high-dimensional
data points by linearly projecting them onto a lower-dimensional space in such a way that the reconstruction
error made by this projection is minimal. In order to develop an intuition for PCA we first take a closer
look at what it means to project the data points and to reconstruct them. Figure 1.1 illustrates the process.

x2 x2 x2
(a) (b) (c) (d)
y y

x1 x1 y x1
x x

x|| x||

data points in 2D projection onto 1D data points in 1D reconstruction in 2D

Figure 1.1: Projection of 2D data points onto a 1D subspace and their reconstruction.
© CC BY-SA 4.0

(a) A few data points are given in a two-dimensional space and are represented by two-dimensional vectors
x = (x1 , x2 ). (b) In order to reduce the dimensionality down to one, we have to choose a one-dimensional
subspace defined by a unit vector v and project the data points onto it, which can be done by
xk := vv T x . (1.1)
(c) The points can now be represented by just one number,
y := v T x , (1.2)
and we do not care that they originally came from a two-dimensional space. (d) If we want to reconstruct
the original two-dimensional positions of the data points as well as possible, we have to embed the one-
dimensional space in the original two-dimensional space in exactly the orientation used during the projection,
(1.1,1.2)
xk = vy . (1.3)
However, we cannot recover the accurate 2D-position; the points remain on the one-dimensional sub-
space. The reconstruction error is therefore the average distance of the original 2D-positions from the
one-dimensional subspace (the length of the projection arrows in (b)). For mathematical convenience one
actually takes the average squared distance
E := hkxµ − xµ|| k2 iµ (1.4)
M X
I
1 X
= (xµi − xµ|| i )2 , (1.5)
M µ=1 i=1

where µ indicates the different data points, M the number of data points, and I the dimensionality of the
data vectors.

1.3 Reconstruction error and variance

The question now is how we can find the direction of the one-dimensional subspace that minimizes the
reconstruction error. For that it is interesting to inspect more closely what happens as we rotate the

4
x2 x2
(a) (b)
y
x||

r v
d x1 x1
x x d
r v
x||

y
Figure 1.2: Variance of the projected data and reconstruction error as the linear subspace is rotated.
© CC BY-SA 4.0

subspace. Figure 1.2 illustrates the projection onto two different subspaces. Focus just on the one point x
and its projection x|| . d is the distance of x from the origin, r is the distance of x from x|| in the subspace,
and v is the distance of x|| from the origin. r and v depend on the direction of the subspace while d does
not. Interestingly, since the triangles between x, x|| , and the origin are right-angled, r and v are related by
Pythagoras’ theorem, i.e. r2 + v 2 = d2 . We know that r2 contributes to the reconstruction error. v 2 on the
other hand contributes to the variance of the projected data within the subspace. Thus we see that the sum
over the reconstruction error plus the variance of the projected data is constant and equals the variance of
the original data. Therefore, minimizing the reconstruction error is equivalent to maximizing the variance
of the projected data.

1.4 Covariance matrix

How can we determine the direction of maximal variance? The first we can do is to determine the variances
of the individual components. If the data points (or vectors) are written as x = (x1 , x2 )T (T indicates
transpose), then the variances of the first and second component can be written as C11 := hx1 x1 i and
C22 := hx2 x2 i, where angle brackets indicate averaging over all data points. Please remember that these are
strictly speaking 2nd moments and not variances, but since we assume zero mean data that does not make a
difference. If C11 is large compared to C22 , then the direction of maximal variance is close to (1, 0)T , while
if C11 is small, the direction of maximal variance is close to (0, 1)T . (Notice that variance doesn’t have a
polarity, so that one could use the inverse vector (−1, 0)T instead of (1, 0)T equally well for indicating the
direction of maximal variance.)
But what if C11 is of similar value as C22 , like in the example of Figure 1.1? Then the co-variance between
the two components, C12 := hx1 x2 i, can give us additional√ information (notice that C21 := hx2 x1 i is equal
to C12 ). A large positive value of C12 (large relative to C11 C22 ) indicates a strong correlation between x1
and x2 and that the data cloud is extended along the (1, 1)T direction. A negative value would indicate
√ anti-
correlation and an extension along the (−1, 1)T direction. A small value of C12 (again relative to C11 C22 )
would indicate no correlation and thus little structure of the data, i.e. no prominent direction of maximal
variance. The variances and covariances are conveniently arranged in a matrix with components
Cij := hxi xj i , (1.6)
which is called covariance matrix (remember, assuming zero mean data). It can easily be shown
that the components obey the relation
2
Cij ≤ Cii Cjj . (1.7)
It is also easy to see that scaling the data by a factor α scales the covariance matrix by a factor α2 . Figure 1.3
shows several data clouds and the corresponding covariance matrices.

5
x2 0.2 0 x2 1 −0.5 x2 1 0
0 1 −0.5 0.3 0 1
1 1 1

1 x1 1 x1 1 x1

Figure 1.3: Several data distributions and their covariance matrices.

1.5 Covariance matrix and higher order structure

Notice that the covariance matrix only gives you information about the general extent of the
data (the second order moments). It does not give you any information about the higher-
order structure of the data cloud. Figure 1.4 shows different data distributions that all have the same
covariance matrix. Thus as long as we consider only the covariance matrix, i.e. second order moments, we
can always assume a Gaussian data distribution with an ellipsoid shape, because the covariance matrix does
not represent any more structure in any case.

x2 1 0 x2 1 0 x2 1 0
0 1 0 1 0 1

x1 x1 x1

Figure 1.4: Different data distributions with identical covariance matrices. What is the scale of the axes
in order to get the identity matrix?
© CC BY-SA 4.0

1.6 PCA by diagonalizing the covariance matrix

Now that we have learned that the covariance matrix in principle contains the information about the direction
of maximal variance the question arises how we can get at this information. From Figure 1.3 (a) and (b)
we can see that there are two fundamentally different situations: in (a) the data cloud is aligned with the
axes of the coordinate system and the covariance matrix is diagonal; in (b) the data cloud is oblique to the
axes and the matrix is not diagonal. In the former case the direction of maximal variance is simply the
axis belonging to the largest value on the diagonal of the covariance matrix. In the latter case, we cannot
directly say what the direction of maximal variance might be. Thus, since the case of a diagonal covariance
matrix is so much simpler, the strategy we are going to take is to make a non-diagonal covariance matrix
digonal by rotating the coordinate system accordingly. This is illustrated in Figure 1.5. From linear algebra
we know that diagonalizing a matrix can be done by solving the corresponding eigenvalue equation. It will

6
turn out that the eigenvectors of the covariance matrix point into the directions of maximal (and minimal)
variance and that the eigenvalues are equal to the variances along these directions. Projecting the data onto
the eigenvectors with largest eigenvalues is therefore the optimal linear dimensionality reduction.

x2 1 −0.5 1.1 0
−0.5 0.3 0 0.1
x’
2
x1

x’
1

Figure 1.5: Diagonalizing the covariance matrix by rotating the coordinate system.
© CC BY-SA 4.0

1.7 Whitening
Whitening data means to first remove the mean and then stretch the data along the principal axes such
that it has unit variance in all directions, i.e. the stretching must be done by one over the square root of
the variance. Principal axes as well as the corresponding variances can be optained by PCA. If the original
data is an unisotropic Gaussian to begin with, e.g. having the shape of a flying sausage or a cigar, it has a
spherical shape after whitening, which is the reason to call it also sphering.
Whitened data has the advantage that you can project it onto any unit vector and it has unit variance. More
formally, if n is an arbitrary unit vector and xµ are the whitened data vectors, the projected data x|| := nT x
has zero mean hx|| i = 0, since hxµ i = 0, and unit variance hx2|| i = 1. Projected onto two orthogonal unit
vectors, the two projected data sets are uncorrelated. More formally, if n1 and n2 are two arbitrary but
orthogonal unit vectors, i.e. nTi nj = δij (Kronecker symbol), the two projected data sets x||1 := nT1 x and
x||2 := nT2 x are uncorrelated, i.e. hx||1 x||2 i = 0.

2 Formalism
This section is partly based on (Hertz et al., 1991).

2.1 Definition of the PCA-optimization problem

The problem of principal component analysis (PCA) can be formally stated as follows.

7
whitening

Figure 1.6: Whitening of some data. Left: Original data. The ellipse represents the second moment matrix
of the data distribution. Right: The mean is removed and the data stretched such that it has variance one
in all directions, indicated by a circle representing the identity second moment matrix. The graphs to the
top and right represent the marginals of the first and second component, respectively.
© CC BY-SA 4.0

Principal Component Analysis (PCA): Given a set {xµ : µ = 1, ..., M } of I-

dimensional data points xµ = (xµ1 , xµ2 , ..., xµI )T with zero mean, hxµ iµ = 0I , find an
orthogonal matrix U with determinant |U | = +1 generating the transformed data
µ
points x0 := U T xµ such that for any given dimensionality P the data projected
µ µ µ µ
onto the first P axes, x0 || := (x0 1 , x0 2 , ..., x0 P , 0, ..., 0)T , have the smallest
µ µ
reconstruction error E := hkx0 − x0 || k2 iµ (2.1)

among all possible projections onto a P -dimensional subspace. The row vectors of
matrix U define the new axes and are called the principal components.

Some remarks: (i) hxµ iµ indicates the mean over all M data points indexed with µ. To simplify the notation
we will from now on drop the index µ and indicate averages over the data points by h·i. (ii) If one has
non-zero-mean data, one typically removes the mean before applying PCA. Even though all the math is
valid also for non-zero-mean data, the results would typically be undesired and nonintuitive. (iii) Since
matrix U is orthogonal and has determinant value +1, it corresponds simply to a rotation of the data x.
Thus, the ’shape’ of the data cloud remains the same, just the ’perspective’ changes. |U | = −1 would imply
a mirror reflection of the data distribution and is often permitted, too. Notice also that one can interpret
the multiplication with matrix U T either as a rotation of the data or as a rotation of the coordinate system.
Either interpretation is valid. (iv) Projecting the data x0 onto the P -dimensional linear subspace spanned by
the first P axes is simply done by setting all components higher than P to zero. This can be done, because
we still have an orthonormal coordinate system. If U and therefore the new coordinate system were not
orthogonal then the projection became a mathematically more complex operation. (v) The reconstruction
error has to be minimal for any P . This has the advantage that we do not need to decide on P before
performing PCA. Often P is actually choosen based on information obtained during PCA and governed by
a constraint, such as that the reconstruction error should be below a certain threshold.

8
2.2 Matrix VT : Mapping from high-dimensional old coordinate system to low-
dimensional new coordinate system
Assume some data points x are given in an I-dimensional space and a linear subspace is
spanned by P orthonormal vectors

♦ vp := (v1p , v2p , ..., vIp )T (2.2)

1 if p = q
♦ with vpT vq = δpq := . (2.3)
0 otherwise

We will typically assume P < I and speak of a high(I)-dimensional space and a low(P )-dimensional
(sub)space. However, P = I may be possible as a limiting case as well.
Arranging these vectors in a matrix yields

V := (v1 , v2 , ..., vP ) (2.4)

 
v11 v12 ... v1P
 v21 v22 ... v2P 
(2.2)
= . (2.5)
 
 .. .. ..
 . . . 
vI1 vI2 ... vIP

This matrix can be used to map the data points x into the subspace spanned by the vectors vp
yielding
♦ y := V T x , (2.6)
see also figure 2.1. If P < I then the dimensionality is reduced and some information is lost; if P = I all

x2
V
y
T
x|| VV

x1 y
T
V V x

T
V
data points in 2D data points in 1D

Figure 2.1: The effect of matrices V T and V and combinations thereof for an example of a mapping from
2D to 1D.
© CC BY-SA 4.0

information is preserved. In any case the mapped data are now represented in a new coordinate system the
axes of which are given by the vectors vp . With P = 2 and I = 3, for example, we have
T T
v1 x v1
y= = x = V Tx
v2T x v2T
 
x
v11 v21 v31  1 

y1
or y = = x2 = V Tx.
y2 v12 v22 v32
x3

Notice that y is P -dimensional while x is I-dimensional.

9
It is important to realize that we have done two things here: firstly, we have moved the points from
the high-dimensional space onto the low-dimensional subspace (the points that were already in the
subspace have not been moved, of course) and secondly, we have represented the moved points in
a new coordinate system that is particularly suitable for the low-dimensional subspace. Thus, we went
from the high-dimensional space and the old coordinate system to the low-dimensional subspace and a new
coordinate system. Notice also that points in the high-dimensional space can generally not be represented
accurately in the new coordinate system, because it does not have enough dimensions.

2.3 Matrix V: Mapping from low-dimensional new coordinate system to sub-

space in old coordinate system
Interestingly, since the vectors vp are orthonormal, matrix V can also be used to transform the
points back from the new to the old coordinate system, although, the lost dimensions cannot
be recovered, of course. Thus the mapped points y in the new coordinate system become points x|| in the
old coordinate system and are given by

♦ x|| := Vy (2.7)
(2.6) T
♦ = V V x. (2.8)

y and x|| are equivalent representations, i.e. they contain the same information, just in different
coordinate systems.

2.4 Matrix (VT V): Identity mapping within new coordinate system
Before we look at the combined matrix V V T consider V T V . The latter is obviously a P × P -matrix
and performs a transformation from the new (low-dimensional) coordinate system to the old
(high-dimensional) coordinate system (2.7) and back again (2.6). The back-transformation implies
a dimensionality reduction, but since all points in the old coordinate system come from the new coordinate
system and therefore lie within the low-dimensional subspace already, the mapping onto the low-
dimensional space does not discard any information. Thus, only the back and forth (or rather forth and
back) transformation between the two coordinate systems remains and that in combination is without
any effect either. This means that V T V is the identity matrix, which can be easily verified
(2.3)
V TV = vpT vq

pq
= δpq (2.9)
⇐⇒ V TV = 1P (2.10)

with 1P indicating the identity matrix of dimensionality P . With P = 2, for example, we have
T T
v1 v1 v1T v2

v1 (2.3) 1 0
V TV = (v v
1 2 ) = = .
v2T v2T v1 v2T v2 0 1

2.5 Matrix (VVT ): Projection from high- to low-dimensional (sub)space within

old coordinate system
As we have seen above (2.8) the combined matrix V V T maps the points x onto the low-dimensional
subspace but in contrast to matrix V T alone the mapped points are represented within the old coor-
dinate system and not the new one. It turns out that this is a projection operation with the

10
characterizing property that it does not make a difference whether you apply it once or twice, i.e.
P P = P . Let us therefore define the projection matrix

P := V V T (2.11)

and verify that

(2.11) T T
♦ PP = | {z V} V
V V (2.12)
1P
(2.10) T
♦ = VV (2.13)
(2.11)
♦ = P. (2.14)

A closer look at P shows that

♦ P := VVT (2.15)
v1T
 

♦ = (v1 , ..., vP )  ...  (2.16)

 

vPT
v11 v12 · · ·
 
 
. v11 v21 v31 ···

 v21 .. 
(2.2)
 .. .. 
=

.   v12
 . .  (2.17)

 v31 . . 
 ..
. vIP
 
..
. vIP
 P P
p v1p v1p p v1p v2p ···

 P . .. 
= p v2p v1p (2.18)
 
 
.. P
. p vIp vIp
 
v1p v1p v1p v2p ···
P
X .. 
=  v2p v1p
 . 
 (2.19)
p=1 ..
. vIp vIp
P
X
♦ = vp vpT . (2.20)
p=1

P is obviously an I × I-matrix. If P = I then projecting from the old to the new and back to the old
coordinate system causes no information loss and P = 1I . The smaller P the more information is lost
and the more does P differ from the identity matrix. Consider, for example
 √ √ 
√ 2 − 2 2
1 T T 1 √
v1 := ( 2, −1, 1) ⇒ v1 v1 = −√ 2 1 −1  ,
2 4
2 −1 1
 
0 0 0
1 √ √ 1
v2 := (0, 2, 2)T ⇒ v2 v2T =  0 2 2  , and
2 4
0 2 2
 √ √ 
2 2 − 2
1 √ 1 √
v3 := (− 2, −1, 1)T ⇒ v3 v3T =  √2 1 −1 
2 4
− 2 −1 1

11
for which you can easily verify that P (2.20) successively becomes the identity matrix as you take more of
the vp vpT -terms.

2.6 Variance
The variance of a multi-dimensional data set is defined as the sum over the variances of its
components. Since we assume zero-mean data, we have
I
X
var(x) := hx2i i (2.21)
i=1
* I
+
X
♦ = x2i (2.22)
i=1
T
= hx xi (2.23)

This also holds for the projected data, of course, var(y) = hy T yi.

2.7 Reconstruction error

The reconstruction error E is defined as the mean square sum over the distances between the original data
points x and the projected ones xk . If we define the orthogonal vectors (D: Lotvektoren)

♦ x⊥ = x − xk (2.24)

(in contrast to the projected vectors xk ) we can write the reconstruction error as the variance of the
orthogonal vectors and find
(2.1,2.24)
|♦ E = hx⊥ T x⊥ i (2.25)
(2.24) T
= h(x − xk ) (x − xk )i (2.26)
(2.8)
♦ = h(x − V V T x)T (x − V V T x)i (2.27)
♦ = hxT x − 2xT V V T x + xT V (V T V ) V T xi (2.28)
| {z }
=1P
(2.10)
♦ = hxT xi − hxT V (V T V ) V T xi (2.29)
| {z }
=1P
(2.8) T T
♦ = hx xi − hxk xk i (2.30)
(2.29,2.6) T T
= hx xi − hy yi . (2.31)

This means that the reconstruction error equals the difference between the variance of the data
minus the variance of the projected data. Thus, this verifies our intuition that minimizing the recon-
struction error is equivalent to maximizing the variance of the projected data.

2.8 Covariance matrix

We have already argued heuristically that the covariance matrix Cx with Cxij := hxi xj i plays an important
role in performing PCA. It is convenient to write the covariance matrix in vector notation:
1 X µ µT
Cx := xxT = x x . (2.32)
M µ

12
It is an easy exercise to show that this definition is equivalent to the componentwise one given above. Since
(xxT )T = xxT (remember (AB)T = B T AT ), one can also see that Cx is symmetric, i.e. CxT = Cx .
Keep in mind that Cx is strictly speaking a 2nd moment matrix and only identical to the covariance matrix
since we assume zero mean data.

2.9 Eigenvalue equation of the covariance matrix

Since the covariance matrix is symmetric, its eigenvalues are real and a set of orthogonal eigenvectors always
exists. In mathematical terms, for a given covariance matrix Cx we can always find a complete set
of real eigenvalues λi and corresponding eigenvectors ui such that
♦ Cx ui = ui λi (eigenvalue equation) , (2.33)
λi ≥ λi+1 (eigenvalues are ordered) , (2.34)
T
♦ ui uj = δij (eigenvectors are orthonormal) . (2.35)
If we combine the eigenvectors into an orthogonal matrix U and the eigenvalues into a diagonal matrix Λ,
U := (u1 , u2 , ..., uI ) , (2.36)
Λ := diag(λ1 , λ2 , ..., λI ) , (2.37)
then we can rewrite (2.35) and (2.33) as
(2.35,2.36)
UT U = 1I (matrix U is orthogonal) , (2.38)
T −1 T
♦ ⇐⇒ UU = 1I (since U =U and U is quadratic) , (2.39)
(2.33,2.36,2.37)
Cx U = U Λ (eigenvalue equation) , (2.40)
(2.38) T
♦ ⇐⇒ U Cx U = Λ (2.41)
(2.38,2.39)
♦ ⇐⇒ Cx = U ΛU T . (2.42)

2.10 Total variance of the data x

Given the eigenvector matrix U and the eigenvalue matrix Λ it is easy to compute the total variance of the
data
|♦ hxT xi = htr(xT x)i (since s = tr(s) for any scalar s) (2.43)
T
♦ = htr(xx )i (since tr(AB) = tr(BA) for any matrices A, B) (2.44)
T
♦ = tr(hxx i) (since tr(·) and h·i commute) (2.45)
(2.32)
♦ = tr(Cx ) (2.46)
(2.39) T
♦ = tr(U U Cx ) (2.47)
T
♦ = tr(U Cx U ) (2.48)
(2.41)
♦ = tr(Λ) (2.49)
(2.37)
X
= λi . (2.50)
i

Thus the total variance of the data is simply the sum of the eigenvalues of its covariance matrix.
Notice that on the way of this proof we have shown some very general properties. From line (2.43) to (2.46)
we have shown that the total variance of some multi-dimensional data equals the trace of its covariance
matrix. From line (2.46) to (2.48) we have shown that the trace remains invariant under any orthogonal
transformation of the coordinate system. This implies that the total variance of some multi-dimensional
data is invariant under any orthogonal transformation such as a rotation. This is intuitively clear.

13
2.11 Diagonalizing the covariance matrix
We can now use matrix U to transform the data such that the covariance matrix becomes
diagonal. Define x0 := U T x and denote the new covariance matrix by Cx0 . We have

x0 := UT x (2.51)
D E
T
|♦ Cx0 := x0 x0 (2.52)
(2.51)
♦ = (U T x)(U T x)T (2.53)
♦ = U T xxT U (2.54)
(2.32) T
♦ = U Cx U , (2.55)
(2.41)
= Λ (2.56)

and find that the transformed data x0 have a diagonal covariance matrix. Working with x0 instead of x will
simplify further analysis without loss of generality.

2.12 Variance of y for a diagonalized covariance matrix

Now that we have the data represented in a coordinate system in which the covariance matrix is diagonal, we
can try to answer the question, which is the P -dimensional subspace that minimizes the reconstruction error.
Our intuition would predict that it is simply the space spanned by the first P eigenvectors. To show this
analytically, we take an arbitrary set of P orthonormal vectors vp0 , and with V 0 := (v10 , v20 , ..., vP0 )
we compute the variance of y.
T
♦ y := V 0 x0 (2.57)
T (2.57) 0T 0 0T 0
♦ =⇒ hy yi = hx V V xi (2.58)
0T 0 0T 0
= htr(x V V x )i (since s = tr(s) for any scalar s) (2.59)
0T 0 0T 0
= htr(V x x V )i (since tr(ABC) = tr(BCA) if defined) (2.60)
(2.52) 0T
♦ = tr(V Cx0 V 0 ) (since tr(·) and h·i commute) (2.61)
(2.56) 0T 0
♦ = tr(V ΛV ) (2.62)
X X
0 2
♦ = λi (vip ) . (as one can work out on a sheet of paper) (2.63)
i p

V 0 is a rectangular I × P -matrix and projects the data into a lower-dimensional space. Since its columns are
orthonormal to each other, it only performes a rotation and elimination of some dimensions (not necessarily
along the axes) but no scaling of any sort or distortion beyond the elimnation of dimensions.

2.13 Constraints of matrix V0

Notice that, since the vectors v 0 p are orthonormal, V 0 can always be completed to an orthogonal
I × I-matrix by adding I − P additional orthonormal vectors. Since we know that an orthogonal matrix has

14
normalized row as well as column vectors, we see that, by taking away the I − P additional column vectors,
we are left with the constraints
X
0 2
♦ (vip ) = 1 (column vectors of V 0 have norm one) , (2.64)
i
X
0 2
♦ =⇒ (vip ) = P (square sum over all matrix elements equals P ) , (2.65)
ip
X
0 2
♦ (vip ) ≤ 1 (row vectors of V 0 have norm less or equal one) . (2.66)
p

Notice that Constraint (2.65) is a direct consequence of Constraint (2.64) and does not need to be verified
separately in the following considerations.

2.14 Finding the optimal subspace

Since the variance (2.63) of y as well as the constraints (2.64, 2.65, 2.66) of Matrix V 0 are
0 2
linear in (vip ) , maximization of the variance hy T yi is obviously achieved by putting as much
’weight’ as possible on the large eigenvalues, which are the first ones. The simplest way of doing that
is to set
0 1 if i = p
♦ vip := δip := , (2.67)
0 otherwise
with the Kronecker symbol δip .
Since I ≥ P we can verify the constraints

X
0 2 (2.67)
X
2 1 if i ≤ P
(vip ) = δip = ≤ 1, (2.68)
0 otherwise
p p
X (2.67)
X
0 2 2 2
(vip ) = δip = δpp = 1, (2.69)
i i

and see from (2.68) that there is actually as much ’weight’ on the first, i.e. large, eigenvalues as Con-
straint (2.66) permits.

2.15 Interpretation of the result

0
What does it mean to set vip := δip ? It means that V 0 projects the data x0 onto the first P axes,
which in fact is a projection onto the first P eigenvectors of the covariance matrix Cx . Thus, if
we define

|♦ V := UV 0 (2.70)
(2.36, 2.67)
= (u1 , u2 , ..., uP ) (2.71)

we can go back to the original coordinate system and find

(2.57) T
|♦ y = V 0 x0 (2.72)
(2.51) 0T T
♦ = V U x (2.73)
(2.70) T
= V x, (2.74)

which we know has maximal variance. Thus, if we start from the original data x we would set vp := up .

15
The variance of y is
I
X P
X
(2.63) 0 2
|♦ hy T yi = λi (vip ) (2.75)
i=1 p=1
I
X P
X
(2.67) 2
♦ = λi δip (2.76)
i=1 p=1
P
X
= λi , (2.77)
i=1

which is the sum over the first P largest eigenvalues of the covariance matrix. Likewise one can determine
the reconstruction error as
(2.31)
|♦ E = hxT xi − hy T yi (2.78)
I
X P
X
(2.50, 2.77)
♦ = λi − λj (2.79)
i=1 j=1
I
X
= λi . (2.80)
i=P +1

Notice that this is just one optimal set of weights. We have seen above that the projected data, like
any multi-dimensional data, can be rotated arbitrarily without changing its variance and therefore
without changing its reconstruction error. This is equivalent to a rotation of the projection vectors vp within
the space spanned by the first eigenvectors.

2.16 PCA Algorithm

2.17 Intuition of the Results

Eigenvalue spectrum.
Projection onto a low-dimensional eigenspace.
Visualization of eigenvectors.

2.18 Whitening or sphering

Sometimes it is desirable to transform a data set such that it has variance one in all directions.
Such a normalization operation is called whitening or sphering , see Fig. 1.6. The latter term is quite
intuitive, because a spherical data distribution has the same variance in all directions. Intuitively speaking
sphering requires to stretch and compress the data distribution along the axes of the principal components
such they have variance one. Technically speaking one first rotates the data into a coordinate system where
the covariance matrix is diagonal, then performs the stretching along the axes, and then rotates the data
back into the original coordinate system. Principal component analysis obviously gives all the required
information. The eigenvectors of the covariance matrix provide the axes of the new coordinate system and
the eigenvalues λi indicate the variances and therefore how much one has to stretch the data. If the original

16
√
variance is λi then one obviously has to stretch by a factor of 1/ λi to get variance one. Thus, sphering
is achieved by multiplying the data with a sphering matrix

1 1 1
W := U diag √ , √ , ..., √ UT (2.81)
λ1 λ2 λI
x̂ := W x . (2.82)

If the final orientation of the data does not matter, the sphering matrix is often defined without the first
U . It is easy to verify that the sphering matrix is symmetrical, the sphered data x̂ have a unit
covariance matrix,

Cx̂ := hx̂x̂T i (2.83)

(2.82) T T
= W hxx iW (2.84)

(2.32,2.81) 1 1
= U diag √ U T Cx U diag √ UT (2.85)
λi λi

(2.41) 1 1
= U diag √ Λ diag √ UT (2.86)
λi λi
(2.37)
= U 1 UT (2.87)
(2.39)
= 1, (2.88)

and they have variance one in all directions, since for any projection vector n of norm one the variance
h(nT x̂)2 i of the projected data is

h(nT x̂)2 i = nT hx̂x̂T in (2.89)

(2.88)
= nT n (2.90)
= 1 (2.91)

Similarly one can show that the sphered data projected onto two orthogonal vectors are uncorre-
lated.

2.19 Singular value decomposition +

Sometimes one has fewer data points than dimensions. For instance one might have 100 images with 10000
pixels each. Then doing direct PCA is very inefficient and the following method, known as singular value
decomposition (SVD), is helpful.
Let xµ , µ = 1, ..., M be the I-dimensional data with M < I. For convenience we write the data in one
I × M -matrix

X := (x1 , ..., xM ) . (2.92)

The second-moment matrix can then be written as

C1 := XX T /M , (2.93)

and its eigenvalue equation and decomposition read

C1 U1 = U1 Λ1 (2.94)
⇐⇒ C1 = U1 Λ1 U1T . (2.95)

17
The data represented in the coordinate system of the eigenvectors is

Y1 := U1T X , (2.96)

which is still high-dimensional.

Now interpret the data matrix X transposed, i.e. swap the data point index for the dimension index. In our
example this would correspond to having 10000 data points in a 100-dimensional space, which is, of course,
much easier to deal with. We get the same equations as above just with X transposed.

C2 := X T X/I , (2.97)
C2 U2 = U2 Λ2 (2.98)
⇐⇒ C2 = U2 Λ2 U2T , (2.99)
Y2 := U2T X T . (2.100)

The interesting property of matrix Y2 now is that its rows are eigenvectors of matrix C1 , as can be shown
easily.
(2.93,2.100)
C1 Y2T = (XX T /M )(XU2 ) (2.101)
T
= X(X X/I)U2 I/M (2.102)
(2.97)
= XC2 U2 I/M (2.103)
(2.98)
= XU2 Λ2 I/M (2.104)
(2.100)
= Y2T Λ2 I/M . (2.105)

The corresponding eigenvalues are eigenvalues of C2 scaled by I/M .

However, Y2 yields only M eigenvectors and eigenvalues. The other eigenvalues are all zero, because M
data points can only produce M non-zero variance dimensions or, in other words, M data points together
with the origin can only span an M -dimensional subspace. The missing (I − M ) eigenvectors must all be
orthogonal to the first M ones and orthogonal to each other but can otherwise be quite arbitrary, since their
eigenvalues are all equal. A Gram-Schmidt orthogonalization procedure can be used to generate them.
Acknowledgment: I thank Agnieszka Grabska-Barwinska for working out the proof for singular value
decomposition.

18
3 Application
3.1 Face processing

If one carefully shifts and

scales many face images such
Eigenfaces that the eyes are in regis-
ter, i.e. at identical positions,
and performs PCA (or SVD)
on them, meaningful eigenvec-
tors can be calculated and are
called eigenfaces. These are
the principal grey value vari-
ations that distinguish faces
from each other. The fig-
ure shows first the mean and
then the eigenfaces ordered in
rows. The first eigenface ob-
viously accounts for the mean
grey value, the second one for
the difference in color between
hair and face, the third one
for illumination from the side,
the fourth and seventh one
at least partially for a beard.
Figure: (http://www-white.media.mit.edu/ 2002-12-17, URL)
The higher components be-
come increasingly difficult to
interpret. The projection of
face images onto the first eigenfaces is a suitable representation for face recognition, (cf. Turk and Pentland,
1991).
Figure: (http://www-white.media.mit.edu/ 2002-12-17, URL)3.1

19
In this example many faces
were warped to a standard ge-
Eigenfaces - Texture ometry and then PCA was ap-
plied to calculate the eigenvec-
tors, which are called eigen-
faces. The Figure visualizes
the first four eigenfaces by
varying the average face (mid-
dle) along the direction of the
eigenvectors by up to ±3 stan-
dard deviations. The first
eigenface (mode) accounts for
overall illumination, the other
three for some combination of
beard, gender, and mimic ex-
pression.
Figure: (Lanitis et al., 1995,
Fig. 11, URL)3.2

Figure: (Lanitis, Taylor, et al., 1995, Image and Vision Computing 13(5):393–401, Fig. 11, URL)

In this example PCA was ap-

plied to the geometry of faces.
Eigenfaces - Shape A graph with some standard
structure was mapped onto
many faces and the concate-
nated vector of xy-positions of
the nodes (not shown in the
graphs) of the graphs serve as
the data for PCA. Visualiza-
tion is done relative to the
average geometry (middle) by
adding the eigenvectors up to
±2 standard deviations. The
first three components mainly
account for the orientation of
the head along the three rota-
tional axes. The fourth com-
ponent accounts for some vari-
ation in width and mimic ex-
pression.
Figure: (Lanitis, Taylor, et al., 1995, Image and Vision Computing 13(5):393–401, Fig. 6, URL)
Figure: (Lanitis et al., 1995,
Fig. 6, URL)3.3

20
4 Other resources
Numbers in square brackets indicate sections of these lecture notes to which the corresponding item is related.

4.1 Written material

• PCA in Wikipedia
https://en.wikipedia.org/wiki/Principal_component_analysis

• A tutorial by Jonathon Shlens

https://arxiv.org/pdf/1404.1100v1.pdf
• An in depth book by I.T.Jolliffe
http://wpage.unina.it/cafiero/books/pc.pdf

4.2 Visualizations
• 2D, 3D, and 17D examples of PCA
http://setosa.io/ev/principal-component-analysis/

4.3 Videos
• Abstract conceptual introduction to PCA from Georgia Tech.
1. Part 1
https://www.youtube.com/watch?v=kw9R0nD69OU (4:22) [1.1, 1.2, 1.3]
2. Part 2
https://www.youtube.com/watch?v=_nZUhV-qhZA (5:27) [1.2, 1.3]
3. Part 3
https://www.youtube.com/watch?v=kuzJJgPBrqc (5:01) [1.2, 1.3]

• Introduction to PCA with several practical examples by Rasmus Bro from the University of Copenhagen
Note: In these videos the principal components are called loadings, and the projected values are called
scores.
1. Conceptual introduction
https://www.youtube.com/watch?v=UUxIXU_Ob6E (12:32)
– 00:00–00:38 Introduction
– 00:38–07:52 Examples of multi-variate data
– 07:52–12:32 Concept of PCA and bi-plot by means of a simple 3D example
2. Example: Analysis of demographic data of many countries in the world with PCA
https://www.youtube.com/watch?v=26YhtSJi1qc (11:36)
– 00:11–02:26 Outlier detection
– 02:26–05:47 Loadings help interpreting the projected data dimensions
– 05:47–08:11 Exploration after removing outliers
– 08:11–09:40 Using external information as markers
– 09:40–10:36 From plots to numbers
– 10:36–11:36 Summary

21
3. A bit more in depth discussion
https://www.youtube.com/watch?v=2s-a62zSWL4 (14:31)
– 00:00–00:50 Introduction and historical remarks
– 00:50–03:18 Basic equation for data generation
– 03:18–06:32 Example: Three-variable data set
∗ 04:30–05:40 Removing the mean before PCA
∗ 05:40–06:11 PCA as variations along loadings
∗ 06:11–06:32 PCA as a rotation and truncation
– 06:32–06:59 Scores are projections of the data onto the loadings
– 06:59–14:26 PCA as finding common profiles and profile variations
∗ 07:12–11:58 Example 1: Continuous spectral data of sugar samples
∗ 12:22–14:26 Example 2: Discrete physical and chemical measurements of the same sugar
samples
· 12:48–14:26 Scaling the variables before PCA (covariance- vs. correlation-matrix)
4. Continuation of the previous video
https://www.youtube.com/watch?v=sRsdF3rcAJc (08:22)
– 00:05–14:26 PCA as finding common profiles and profile variations (cont.)
∗ 00:05–04:02 Example 2: Discrete physical and chemical measurements of the same sugar
samples (cont.)
∗ 01:31–02:37 Relating results from Examples 1 and 2
∗ 02:37–04:02 Bi-plot: Understanding the data by visualizing the loadings
– 04:02–08:18 PCA reviewed
• Introduction to PCA in Geoscience by Matthew E. Clapham from the UC Santa Cruz
https://www.youtube.com/watch?v=TSYL-oHx4T0 (15:27)
– 00:00–00:33 Introduction
– 00:33–01:17 Data is often multi-variate / high-dimensional
– 01:17–02:50 Covariation allows to reduce dimensionality and remove redundancy
– 02:50–03:36 skip (Indirect gradient analysis)
– 03:36–04:44 skip (Types of ordination methods)
– 04:44–06:10 Idea of PCA
– 06:10–06:29 Covariance matrix
– 06:29–09:22 Eigenvectors define a new coordinate system ordered by variance
∗ 07:53–08:43 Eigenvalues measure the amount of variance along the axes of the new coordinate
system
– 09:22–10:42 Loadings
– 10:42–12:17 Covariance- vs. correlation-matrix (scaling the variables before PCA)
– 12:17–13:50 Using the eigenvalue spectrum to decide how many PCs to keep
– 13:50–15:27 When is PCA applicable?
• Three lectures on singular value decomposition and PCA
(See also the closely related tutorial by Jonathon Shlens https://arxiv.org/pdf/1404.1100v1.pdf.)

22
1. Singular value decomposition
https://www.youtube.com/watch?v=EokL7E6o1AE (44:35)
Explains singular value decomposition in general, which is quite an interesting fact in linear algebra,
but not really necessary here. It is also not entirely obvious how this relates to Section 2.19. I list
this here, because it belongs to the series.
2. PCA in relatively simple mathematical terms
https://www.youtube.com/watch?v=a9jdQGybYmE (51:12)
– 00:00–15:38 Why would you like to do PCA? [1.1]
– 15:38–23:32 Variance and covariance [1.4]
∗ 18:18–18:30 The vectors introduced 16:52-17:10 are row vectors, so that abT is indeed an
inner product. I always use column vectors, and then I write the inner product as aT b
∗ 22:26–22:54 The statement made here is not always true. Variables can be uncorrelated but
statistically dependent, a simple example being cos(t) and sin(t) for t ∈ [0, 2π]. He uses the
term stastistical independence for what really only is zero covariance or no correlation.
– 23:32–25:06 Relation of variance and covariance to the motivational example
– 25:06–35:49 Covariance matrix (assuming zero mean data) [2.8]
– 35:49–39:44 Diagonalizing the covariance matrix [1.7]
– 39:44–47:20 Diagonalization with eigenvalues and -vectors [2.11]
– 47:20–49:43 Diagonalization with singular value decomposition
– 49:43–51:12 Summary
3. Application of PCA to face recognition
https://www.youtube.com/watch?v=8BTv-KZ2Bh8 (48:02)

4.4 Software
• General list of software for PCA
https://en.wikipedia.org/wiki/Principal_component_analysis#Software.2Fsource_code

• PCA and variants thereof in scikit-learn, a python library for machine learning
http://scikit-learn.org/stable/modules/decomposition.html#principal-component-analysis-pca
• Examples using PCA in scikit-learn
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#
sklearn.decomposition.PCA

4.5 Exercises
• Analytical exercises by Laurenz Wiskott
https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-ExercisesPublic.
pdf
https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-SolutionsPublic.
pdf
• Python exercises by Laurenz Wiskott
https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-PythonExercisesPubl
zip
https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-PythonSolutionsPubl
zip

23
References
Hertz, J., Krogh, A., and Palmer, R. G. (1991). Introduction to the Theory of Neural Computation. Addison-
Wesley, Redwood City, CA.
Lanitis, A., Taylor, C. J., and Cootes, T. F. (1995). An automatic face identification system using flexible
appearance models. Image and Vision Computing, 13(5):393–401.
Turk, M. and Pentland, A. P. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1):71–
86.

Notes
3.1 http://www-white.media.mit.edu/ 2002-12-17, http://www-white.media.mit.edu/vismod/demos/facerec/basic.html
3.2 Lanitis,
Taylor, et al., 1995, Image and Vision Computing 13(5):393–401, Fig. 11, https://www.researchgate.net/
profile/Andreas_Lanitis/publication/223308678_Automatic_face_identification_system_using_flexible_appearance_
models/links/00463533817d5b2dd9000000.pdf
3.3 Lanitis, Taylor, et al., 1995, Image and Vision Computing 13(5):393–401, Fig. 6, https://www.researchgate.net/profile/

Andreas_Lanitis/publication/223308678_Automatic_face_identification_system_using_flexible_appearance_models/
links/00463533817d5b2dd9000000.pdf

Copyrightprotectionlevel: 2/ 2

Audio, Video, and Media in the Ministry
From Everand
Audio, Video, and Media in the Ministry
Clarence Floyd Richmond
No ratings yet
LectureNotes-PCA
No ratings yet
LectureNotes-PCA
20 pages
4.5 Principal Component Analysis
No ratings yet
4.5 Principal Component Analysis
15 pages
Preshius Project
No ratings yet
Preshius Project
39 pages
FALLSEM2024-25 SWE1015 ETH VL2024250103260 2024-09-18 Reference-Material-I
No ratings yet
FALLSEM2024-25 SWE1015 ETH VL2024250103260 2024-09-18 Reference-Material-I
62 pages
Pca
No ratings yet
Pca
39 pages
AML Unit - 1 Material
No ratings yet
AML Unit - 1 Material
36 pages
Pca Kmeans GMM
No ratings yet
Pca Kmeans GMM
96 pages
1-Python Algebra Maths
No ratings yet
1-Python Algebra Maths
26 pages
P-3.1.4 - Pca
No ratings yet
P-3.1.4 - Pca
44 pages
15PCA
No ratings yet
15PCA
27 pages
Dimensionality Reduction by Pca: Non - Feasible
No ratings yet
Dimensionality Reduction by Pca: Non - Feasible
26 pages
Ruiz Modified I2ml3e Chap6
No ratings yet
Ruiz Modified I2ml3e Chap6
38 pages
L 10 Principal Component Analysis 09052024 072206pm
No ratings yet
L 10 Principal Component Analysis 09052024 072206pm
37 pages
16 dm2 Dimred 2022 23
No ratings yet
16 dm2 Dimred 2022 23
49 pages
Principal Component Analysis (PCA) Application To Images: Outline of The Lecture
No ratings yet
Principal Component Analysis (PCA) Application To Images: Outline of The Lecture
26 pages
4 - Basics in Statistics and Linear Algebra
No ratings yet
4 - Basics in Statistics and Linear Algebra
7 pages
Notes For Multivariate Statistics With R
No ratings yet
Notes For Multivariate Statistics With R
189 pages
Visualization 9 Dim Reduction
No ratings yet
Visualization 9 Dim Reduction
73 pages
Eigenfaces With Pca
No ratings yet
Eigenfaces With Pca
12 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
60 pages
Feature Extraction
No ratings yet
Feature Extraction
90 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
82 pages
Principal Component Analysis: Atent Ariables
No ratings yet
Principal Component Analysis: Atent Ariables
13 pages
16. Principal Component Analysis
No ratings yet
16. Principal Component Analysis
27 pages
Computer Vision: Spring 2006 15-385,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm - 4:20pm
No ratings yet
Computer Vision: Spring 2006 15-385,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm - 4:20pm
58 pages
Dim Reduction & Pattern Recognition
No ratings yet
Dim Reduction & Pattern Recognition
63 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
19 pages
6 Dimension Reduction Theory
No ratings yet
6 Dimension Reduction Theory
18 pages
ML RUSA Module 5 Dim Red
No ratings yet
ML RUSA Module 5 Dim Red
85 pages
Lecture 7: Unsupervised Learning: C19 Machine Learning Hilary 2013 A. Zisserman
No ratings yet
Lecture 7: Unsupervised Learning: C19 Machine Learning Hilary 2013 A. Zisserman
20 pages
Principal Component Analysis and Cluster Analysis
No ratings yet
Principal Component Analysis and Cluster Analysis
14 pages
Pac
No ratings yet
Pac
70 pages
کتاب نهم بارگزاری شده
No ratings yet
کتاب نهم بارگزاری شده
55 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
79 pages
Basic Theory
No ratings yet
Basic Theory
4 pages
Pca
No ratings yet
Pca
6 pages
# Loop Over Classes: 6.2 Principal Components Analysis (Pca)
No ratings yet
# Loop Over Classes: 6.2 Principal Components Analysis (Pca)
10 pages
Projecting Data To A Lower Dimension With PCA
No ratings yet
Projecting Data To A Lower Dimension With PCA
6 pages
Lecture 9 - Data Reduction
No ratings yet
Lecture 9 - Data Reduction
36 pages
Machine Learning (CSO851) - Lecture 03
No ratings yet
Machine Learning (CSO851) - Lecture 03
71 pages
Lec_16_PCA
No ratings yet
Lec_16_PCA
64 pages
I2ml3e Chap6
No ratings yet
I2ml3e Chap6
37 pages
Presentation
No ratings yet
Presentation
31 pages
Computer Vision: Spring 2006 15-385,-685
No ratings yet
Computer Vision: Spring 2006 15-385,-685
58 pages
Principal_Component_Analysis_PCA__17
No ratings yet
Principal_Component_Analysis_PCA__17
58 pages
FACIAL RECOGNITION AND MATHEMATICS -VECTORS AND GEOMETRY IN ACTION
No ratings yet
FACIAL RECOGNITION AND MATHEMATICS -VECTORS AND GEOMETRY IN ACTION
6 pages
MLSP Exp02
No ratings yet
MLSP Exp02
10 pages
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
10.1007@978 0 387 39351 3 PDF
No ratings yet
10.1007@978 0 387 39351 3 PDF
316 pages
Lecture 7 Data Transformation and Dimensionality Reduction
No ratings yet
Lecture 7 Data Transformation and Dimensionality Reduction
22 pages
cs229 Notes10 PDF
No ratings yet
cs229 Notes10 PDF
6 pages
Learning Book 11 Feb
No ratings yet
Learning Book 11 Feb
322 pages
Unit 3dimentionality Reduction
No ratings yet
Unit 3dimentionality Reduction
13 pages
Machine Learning
No ratings yet
Machine Learning
29 pages
PCA
100% (1)
PCA
33 pages
DAAI - Lecture - 04 - With - Solutions - 10oct22
No ratings yet
DAAI - Lecture - 04 - With - Solutions - 10oct22
84 pages
Advanced college algebra study guide
From Everand
Advanced college algebra study guide
Harrison Cook
No ratings yet
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
From Everand
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
Harrison K Cook
No ratings yet
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
A. Personal Information: Stuart Musson
No ratings yet
A. Personal Information: Stuart Musson
4 pages
Dance 102 Summative Assessment Competency #1 (10%) Teacher: Karina Palacio Ongoing Print An Empty Hard Copy of This Summative and Bring It To Class 3
No ratings yet
Dance 102 Summative Assessment Competency #1 (10%) Teacher: Karina Palacio Ongoing Print An Empty Hard Copy of This Summative and Bring It To Class 3
4 pages
Fast Cross Validation Via Sequential Analysis - Appendix
No ratings yet
Fast Cross Validation Via Sequential Analysis - Appendix
6 pages
Bali 2007: On The Road Again!
No ratings yet
Bali 2007: On The Road Again!
7 pages
PS Exponential and Logarithmic Curve Fitti
No ratings yet
PS Exponential and Logarithmic Curve Fitti
3 pages
Indiviual Assignment: Smart Home Control System
No ratings yet
Indiviual Assignment: Smart Home Control System
23 pages
2020-Sep-Oct - Machine Learning For Systems
No ratings yet
2020-Sep-Oct - Machine Learning For Systems
94 pages
Ex34063 PDF
No ratings yet
Ex34063 PDF
1 page
Report Sales Review Format 2024-25
No ratings yet
Report Sales Review Format 2024-25
208 pages
A Guide To OTT in Advertising in Southeast Asia
No ratings yet
A Guide To OTT in Advertising in Southeast Asia
23 pages
Safety Section: Technical Guide P12Y/En T00/A11 Micom P125/P126/P127
No ratings yet
Safety Section: Technical Guide P12Y/En T00/A11 Micom P125/P126/P127
10 pages
Final Paper ASCE - UHPC Segmental Bridge Construction
No ratings yet
Final Paper ASCE - UHPC Segmental Bridge Construction
12 pages
b16af87d-1feb-4089-8826-71bd32c702a3
No ratings yet
b16af87d-1feb-4089-8826-71bd32c702a3
4 pages
LC Installation Manual - v0.01
No ratings yet
LC Installation Manual - v0.01
44 pages
Unit-1 Matrix Multiplication
No ratings yet
Unit-1 Matrix Multiplication
18 pages
Form Machinery and Equipments
No ratings yet
Form Machinery and Equipments
19 pages
NodeFlair_Resume_2024-03-31_08_52_49
No ratings yet
NodeFlair_Resume_2024-03-31_08_52_49
1 page
Soal Area Dan Perimeter SD
No ratings yet
Soal Area Dan Perimeter SD
2 pages
Valve Sizing Tables
No ratings yet
Valve Sizing Tables
5 pages
ETAP User Manual Pag 250-500
No ratings yet
ETAP User Manual Pag 250-500
249 pages
Case Study PepsiCo
No ratings yet
Case Study PepsiCo
4 pages
2 Protein
No ratings yet
2 Protein
7 pages
DW3 Cheat Code p1
No ratings yet
DW3 Cheat Code p1
2 pages
STS Report Outline1
100% (1)
STS Report Outline1
14 pages
Engineering in Kenya - Issue 18
No ratings yet
Engineering in Kenya - Issue 18
83 pages
Technical Textile and Sustainability
No ratings yet
Technical Textile and Sustainability
5 pages
Chapter 06
No ratings yet
Chapter 06
46 pages
WHT-en-technical-data-sheet (1)
No ratings yet
WHT-en-technical-data-sheet (1)
8 pages
Instant download (Original PDF) Multicultural Psychology Understanding Our Diverse Communities 4th Edition pdf all chapter
100% (2)
Instant download (Original PDF) Multicultural Psychology Understanding Our Diverse Communities 4th Edition pdf all chapter
50 pages
R&S®Ft224 Vhf/Uhf Diplexer
No ratings yet
R&S®Ft224 Vhf/Uhf Diplexer
2 pages

PrincipalComponentAnalysis-LectureNotesPublic

Uploaded by

PrincipalComponentAnalysis-LectureNotesPublic

Uploaded by

Principal Component Analysis

Introductory remarks for the lecture

1.1 Problem statement

data points in 2D projection onto 1D data points in 1D reconstruction in 2D

1.3 Reconstruction error and variance

1.4 Covariance matrix

Figure 1.3: Several data distributions and their covariance matrices.

1.5 Covariance matrix and higher order structure

1.6 PCA by diagonalizing the covariance matrix

2.1 Definition of the PCA-optimization problem

Principal Component Analysis (PCA): Given a set {xµ : µ = 1, ..., M } of I-

♦ vp := (v1p , v2p , ..., vIp )T (2.2)

 V := (v1 , v2 , ..., vP ) (2.4)

Notice that y is P -dimensional while x is I-dimensional.

2.3 Matrix V: Mapping from low-dimensional new coordinate system to sub-

2.5 Matrix (VVT ): Projection from high- to low-dimensional (sub)space within

and verify that

A closer look at P shows that

♦ = (v1 , ..., vP )  ...  (2.16)

2.7 Reconstruction error

2.8 Covariance matrix

2.9 Eigenvalue equation of the covariance matrix

2.10 Total variance of the data x

2.12 Variance of y for a diagonalized covariance matrix

2.13 Constraints of matrix V0

2.14 Finding the optimal subspace

2.15 Interpretation of the result

we can go back to the original coordinate system and find

2.16 PCA Algorithm

2.17 Intuition of the Results

2.18 Whitening or sphering

Cx̂ := hx̂x̂T i (2.83)

h(nT x̂)2 i = nT hx̂x̂T in (2.89)

2.19 Singular value decomposition +

X := (x1 , ..., xM ) . (2.92)

The second-moment matrix can then be written as

and its eigenvalue equation and decomposition read

which is still high-dimensional.

The corresponding eigenvalues are eigenvalues of C2 scaled by I/M .

If one carefully shifts and

In this example PCA was ap-

4.1 Written material

• A tutorial by Jonathon Shlens

You might also like

V := (v1 , v2 , ..., vP ) (2.4)