0% found this document useful (0 votes)
4 views

PrincipalComponentAnalysis-LectureNotesPublic

Principal Component Analysis (PCA) is a widely used method for reducing the dimensionality of vectorial data, allowing for data compression and visualization by identifying principal components with the largest variance. The process involves projecting high-dimensional data onto a lower-dimensional space while minimizing reconstruction error, which can be useful in applications like face recognition. PCA can also normalize data through a technique called whitening, which eliminates correlations among dimensions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

PrincipalComponentAnalysis-LectureNotesPublic

Principal Component Analysis (PCA) is a widely used method for reducing the dimensionality of vectorial data, allowing for data compression and visualization by identifying principal components with the largest variance. The process involves projecting high-dimensional data onto a lower-dimensional space while minimizing reconstruction error, which can be useful in applications like face recognition. PCA can also normalize data through a technique called whitening, which eliminates correlations among dimensions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Principal Component Analysis

— Lecture Notes —

Laurenz Wiskott
Institut für Neuroinformatik
Ruhr-Universität Bochum, Germany, EU

14 December 2016

— Summary —

Principal component analysis (PCA) can be applied to vectorial data and is probably the most common
method to reduce the dimensionality of data for compression and visualization. It determines the dimensions
of largest and smallest variance of the data, referred to as the principal components, which can then be used to
discard the small variance dimensions for dimensionality reduction or select the two or three largest variance
dimensions for visualization. For instance, if you have one thousand 100-dimensional data points, PCA
might be used to reduce the dimensionality of the data down to 10 without loosing too much information,
which corresponds to a compression by 90%. In some cases it is also the small variance directions that are of
interest. PCA can also be used to normalize data such, that it has unit variance in all directions, a process
called whitening or sphering, to eliminate correlations.

1 Intuition introduces the problem statement of principal component analysis and the general idea
of the algorithm without much formalism. → Video ’In a nutshell’, no 1, Exercises, Solutions

© 2004–2006, 2009, 2010, 2013, 2016 Laurenz Wiskott (ORCID http://orcid.org/0000-0001-6237-740X, homepage
https://www.ini.rub.de/PEOPLE/wiskott/). This work (except for all figures from other sources, if present) is licensed under
the Creative Commons Attribution-ShareAlike 4.0 International License, see http://creativecommons.org/licenses/by-sa/
4.0/. If figures are not included for copyright reasons, they are uni colored, but the word ’Figure’, ’Image’, or the like in the
reference is often linked to a freely available copy.
Core text and formulas are set in dark red, one can repeat the lecture notes quickly by just reading these;  marks important
formulas or items worth remembering and learning for an exam; ♦ marks less important formulas or items that I would usually
also present in a lecture; + marks sections that I would usually skip in a lecture.
More teaching material is available at https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/.

1
2 Formalism provides a detailed mathematical treatment of principal component analysis, including
whitening and singular value decommposition, an extension for very high dimensional spaces.
→ Exercises, Solutions

3 Application presents an application to face processing. A large number of images of faces are
aligned at the eyes and then treated as data vectors by concatenating the rows (or columns). The
mean represents the average face and the principal components the dominant modes of variation.
This can be used for face recognition and analysis.

Contents
Introductory remarks for the lecture 3

1 Intuition 3
1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Projection and reconstruction error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Reconstruction error and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Covariance matrix and higher order structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 PCA by diagonalizing the covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Whitening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Formalism 7
2.1 Definition of the PCA-optimization problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
T
2.2 Matrix V : Mapping from high-dimensional old coordinate system to low-dimensional new
coordinate system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Matrix V: Mapping from low-dimensional new coordinate system to subspace in old coordi-
nate system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
T
2.4 Matrix (V V): Identity mapping within new coordinate system . . . . . . . . . . . . . . . . 10
T
2.5 Matrix (VV ): Projection from high- to low-dimensional (sub)space within old coordinate
system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Reconstruction error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.8 Covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.9 Eigenvalue equation of the covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.10 Total variance of the data x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.11 Diagonalizing the covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.12 Variance of y for a diagonalized covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . 14
0
2.13 Constraints of matrix V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.14 Finding the optimal subspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2
2.15 Interpretation of the result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.16 PCA Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.17 Intuition of the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.18 Whitening or sphering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.19 Singular value decomposition + . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Application 19
3.1 Face processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Other resources 21
4.1 Written material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Introductory remarks for the lecture

1 Intuition
This section is partly based on (Hertz et al., 1991).

1.1 Problem statement


Experimental data to be analyzed is often represented as a number of vectors of fixed dimensionality. A
single vector could for example be a set of temperature measurements across Germany. Taking such a vector
of measurements at different times results in a number of vectors that altogether constitute the data. Each
vector can also be interpreted as a point in a high dimensional space. Then the data are simply a cloud of
points in this space (if one ignores the temporal order, otherwise it would be a trajectory).
When analyzing such data one often encounters the problem that the dimensionality of the data points is
too high to be visualized or analyzed with some particular technique. Thus the problem arises to reduce the
dimensionality of the data in some optimal way.
To keep things simple we insist that the dimensionality reduction is done linearly, i.e. we are looking for a
low-dimensional linear subspace of the data space, onto which the data can be projected. As a criterion for
what the optimal subspace might be it seems reasonable to require that it should be possible to reconstruct
the original data points from the reduced ones as well as possible. Thus if one were to project the data
back from the low-dimensional space into the original high-dimensional space, the reconstructed data points
should lie as close as possible to the original ones, with the mean squared distance between original and
reconstructed data points being the reconstruction error. The question is, how can we find the linear subspace
that minimizes this reconstruction error.
It is useful and common practice to remove the mean value from the data first before doing the dimensionality
reduction as stated above. Thus, we assume zero mean data throughout. As a result variances and 2nd
moments are the same. This justifies the slightly confusing common practice to speak of variances but write
the equations for 2nd moments. Please keep that in mind.

3
1.2 Projection and reconstruction error
The task of principal component analysis (PCA) is to reduce the dimensionality of some high-dimensional
data points by linearly projecting them onto a lower-dimensional space in such a way that the reconstruction
error made by this projection is minimal. In order to develop an intuition for PCA we first take a closer
look at what it means to project the data points and to reconstruct them. Figure 1.1 illustrates the process.

x2 x2 x2
(a) (b) (c) (d)
y y

x1 x1 y x1
x x

x|| x||

data points in 2D projection onto 1D data points in 1D reconstruction in 2D

Figure 1.1: Projection of 2D data points onto a 1D subspace and their reconstruction.
© CC BY-SA 4.0

(a) A few data points are given in a two-dimensional space and are represented by two-dimensional vectors
x = (x1 , x2 ). (b) In order to reduce the dimensionality down to one, we have to choose a one-dimensional
subspace defined by a unit vector v and project the data points onto it, which can be done by
xk := vv T x . (1.1)
(c) The points can now be represented by just one number,
y := v T x , (1.2)
and we do not care that they originally came from a two-dimensional space. (d) If we want to reconstruct
the original two-dimensional positions of the data points as well as possible, we have to embed the one-
dimensional space in the original two-dimensional space in exactly the orientation used during the projection,
(1.1,1.2)
xk = vy . (1.3)
However, we cannot recover the accurate 2D-position; the points remain on the one-dimensional sub-
space. The reconstruction error is therefore the average distance of the original 2D-positions from the
one-dimensional subspace (the length of the projection arrows in (b)). For mathematical convenience one
actually takes the average squared distance
E := hkxµ − xµ|| k2 iµ (1.4)
M X
I
1 X
= (xµi − xµ|| i )2 , (1.5)
M µ=1 i=1

where µ indicates the different data points, M the number of data points, and I the dimensionality of the
data vectors.

1.3 Reconstruction error and variance


The question now is how we can find the direction of the one-dimensional subspace that minimizes the
reconstruction error. For that it is interesting to inspect more closely what happens as we rotate the

4
x2 x2
(a) (b)
y
x||

r v
d x1 x1
x x d
r v
x||

y
Figure 1.2: Variance of the projected data and reconstruction error as the linear subspace is rotated.
© CC BY-SA 4.0

subspace. Figure 1.2 illustrates the projection onto two different subspaces. Focus just on the one point x
and its projection x|| . d is the distance of x from the origin, r is the distance of x from x|| in the subspace,
and v is the distance of x|| from the origin. r and v depend on the direction of the subspace while d does
not. Interestingly, since the triangles between x, x|| , and the origin are right-angled, r and v are related by
Pythagoras’ theorem, i.e. r2 + v 2 = d2 . We know that r2 contributes to the reconstruction error. v 2 on the
other hand contributes to the variance of the projected data within the subspace. Thus we see that the sum
over the reconstruction error plus the variance of the projected data is constant and equals the variance of
the original data. Therefore, minimizing the reconstruction error is equivalent to maximizing the variance
of the projected data.

1.4 Covariance matrix


How can we determine the direction of maximal variance? The first we can do is to determine the variances
of the individual components. If the data points (or vectors) are written as x = (x1 , x2 )T (T indicates
transpose), then the variances of the first and second component can be written as C11 := hx1 x1 i and
C22 := hx2 x2 i, where angle brackets indicate averaging over all data points. Please remember that these are
strictly speaking 2nd moments and not variances, but since we assume zero mean data that does not make a
difference. If C11 is large compared to C22 , then the direction of maximal variance is close to (1, 0)T , while
if C11 is small, the direction of maximal variance is close to (0, 1)T . (Notice that variance doesn’t have a
polarity, so that one could use the inverse vector (−1, 0)T instead of (1, 0)T equally well for indicating the
direction of maximal variance.)
But what if C11 is of similar value as C22 , like in the example of Figure 1.1? Then the co-variance between
the two components, C12 := hx1 x2 i, can give us additional√ information (notice that C21 := hx2 x1 i is equal
to C12 ). A large positive value of C12 (large relative to C11 C22 ) indicates a strong correlation between x1
and x2 and that the data cloud is extended along the (1, 1)T direction. A negative value would indicate
√ anti-
correlation and an extension along the (−1, 1)T direction. A small value of C12 (again relative to C11 C22 )
would indicate no correlation and thus little structure of the data, i.e. no prominent direction of maximal
variance. The variances and covariances are conveniently arranged in a matrix with components
 Cij := hxi xj i , (1.6)
which is called covariance matrix (remember, assuming zero mean data). It can easily be shown
that the components obey the relation
2
 Cij ≤ Cii Cjj . (1.7)
It is also easy to see that scaling the data by a factor α scales the covariance matrix by a factor α2 . Figure 1.3
shows several data clouds and the corresponding covariance matrices.

5
x2 0.2 0 x2 1 −0.5 x2 1 0
0 1 −0.5 0.3 0 1
1 1 1

1 x1 1 x1 1 x1

Figure 1.3: Several data distributions and their covariance matrices.


© CC BY-SA 4.0

1.5 Covariance matrix and higher order structure


Notice that the covariance matrix only gives you information about the general extent of the
data (the second order moments). It does not give you any information about the higher-
order structure of the data cloud. Figure 1.4 shows different data distributions that all have the same
covariance matrix. Thus as long as we consider only the covariance matrix, i.e. second order moments, we
can always assume a Gaussian data distribution with an ellipsoid shape, because the covariance matrix does
not represent any more structure in any case.

x2 1 0 x2 1 0 x2 1 0
0 1 0 1 0 1

x1 x1 x1

Figure 1.4: Different data distributions with identical covariance matrices. What is the scale of the axes
in order to get the identity matrix?
© CC BY-SA 4.0

1.6 PCA by diagonalizing the covariance matrix


Now that we have learned that the covariance matrix in principle contains the information about the direction
of maximal variance the question arises how we can get at this information. From Figure 1.3 (a) and (b)
we can see that there are two fundamentally different situations: in (a) the data cloud is aligned with the
axes of the coordinate system and the covariance matrix is diagonal; in (b) the data cloud is oblique to the
axes and the matrix is not diagonal. In the former case the direction of maximal variance is simply the
axis belonging to the largest value on the diagonal of the covariance matrix. In the latter case, we cannot
directly say what the direction of maximal variance might be. Thus, since the case of a diagonal covariance
matrix is so much simpler, the strategy we are going to take is to make a non-diagonal covariance matrix
digonal by rotating the coordinate system accordingly. This is illustrated in Figure 1.5. From linear algebra
we know that diagonalizing a matrix can be done by solving the corresponding eigenvalue equation. It will

6
turn out that the eigenvectors of the covariance matrix point into the directions of maximal (and minimal)
variance and that the eigenvalues are equal to the variances along these directions. Projecting the data onto
the eigenvectors with largest eigenvalues is therefore the optimal linear dimensionality reduction.

x2 1 −0.5 1.1 0
−0.5 0.3 0 0.1
x’
2
x1

x’
1

Figure 1.5: Diagonalizing the covariance matrix by rotating the coordinate system.
© CC BY-SA 4.0

1.7 Whitening
Whitening data means to first remove the mean and then stretch the data along the principal axes such
that it has unit variance in all directions, i.e. the stretching must be done by one over the square root of
the variance. Principal axes as well as the corresponding variances can be optained by PCA. If the original
data is an unisotropic Gaussian to begin with, e.g. having the shape of a flying sausage or a cigar, it has a
spherical shape after whitening, which is the reason to call it also sphering.
Whitened data has the advantage that you can project it onto any unit vector and it has unit variance. More
formally, if n is an arbitrary unit vector and xµ are the whitened data vectors, the projected data x|| := nT x
has zero mean hx|| i = 0, since hxµ i = 0, and unit variance hx2|| i = 1. Projected onto two orthogonal unit
vectors, the two projected data sets are uncorrelated. More formally, if n1 and n2 are two arbitrary but
orthogonal unit vectors, i.e. nTi nj = δij (Kronecker symbol), the two projected data sets x||1 := nT1 x and
x||2 := nT2 x are uncorrelated, i.e. hx||1 x||2 i = 0.

2 Formalism
This section is partly based on (Hertz et al., 1991).

2.1 Definition of the PCA-optimization problem


The problem of principal component analysis (PCA) can be formally stated as follows.

7
whitening

Figure 1.6: Whitening of some data. Left: Original data. The ellipse represents the second moment matrix
of the data distribution. Right: The mean is removed and the data stretched such that it has variance one
in all directions, indicated by a circle representing the identity second moment matrix. The graphs to the
top and right represent the marginals of the first and second component, respectively.
© CC BY-SA 4.0

Principal Component Analysis (PCA): Given a set {xµ : µ = 1, ..., M } of I-


dimensional data points xµ = (xµ1 , xµ2 , ..., xµI )T with zero mean, hxµ iµ = 0I , find an
orthogonal matrix U with determinant |U | = +1 generating the transformed data
µ
points x0 := U T xµ such that for any given dimensionality P the data projected
µ µ µ µ
onto the first P axes, x0 || := (x0 1 , x0 2 , ..., x0 P , 0, ..., 0)T , have the smallest
µ µ
reconstruction error E := hkx0 − x0 || k2 iµ (2.1)

among all possible projections onto a P -dimensional subspace. The row vectors of
matrix U define the new axes and are called the principal components.

Some remarks: (i) hxµ iµ indicates the mean over all M data points indexed with µ. To simplify the notation
we will from now on drop the index µ and indicate averages over the data points by h·i. (ii) If one has
non-zero-mean data, one typically removes the mean before applying PCA. Even though all the math is
valid also for non-zero-mean data, the results would typically be undesired and nonintuitive. (iii) Since
matrix U is orthogonal and has determinant value +1, it corresponds simply to a rotation of the data x.
Thus, the ’shape’ of the data cloud remains the same, just the ’perspective’ changes. |U | = −1 would imply
a mirror reflection of the data distribution and is often permitted, too. Notice also that one can interpret
the multiplication with matrix U T either as a rotation of the data or as a rotation of the coordinate system.
Either interpretation is valid. (iv) Projecting the data x0 onto the P -dimensional linear subspace spanned by
the first P axes is simply done by setting all components higher than P to zero. This can be done, because
we still have an orthonormal coordinate system. If U and therefore the new coordinate system were not
orthogonal then the projection became a mathematically more complex operation. (v) The reconstruction
error has to be minimal for any P . This has the advantage that we do not need to decide on P before
performing PCA. Often P is actually choosen based on information obtained during PCA and governed by
a constraint, such as that the reconstruction error should be below a certain threshold.

8
2.2 Matrix VT : Mapping from high-dimensional old coordinate system to low-
dimensional new coordinate system
Assume some data points x are given in an I-dimensional space and a linear subspace is
spanned by P orthonormal vectors

♦ vp := (v1p , v2p , ..., vIp )T (2.2)



1 if p = q
♦ with vpT vq = δpq := . (2.3)
0 otherwise

We will typically assume P < I and speak of a high(I)-dimensional space and a low(P )-dimensional
(sub)space. However, P = I may be possible as a limiting case as well.
Arranging these vectors in a matrix yields

 V := (v1 , v2 , ..., vP ) (2.4)


 
v11 v12 ... v1P
 v21 v22 ... v2P 
(2.2)
= . (2.5)
 
 .. .. ..
 . . . 
vI1 vI2 ... vIP

This matrix can be used to map the data points x into the subspace spanned by the vectors vp
yielding
♦ y := V T x , (2.6)
see also figure 2.1. If P < I then the dimensionality is reduced and some information is lost; if P = I all

x2
V
y
T
x|| VV

x1 y
T
V V x

T
V
data points in 2D data points in 1D

Figure 2.1: The effect of matrices V T and V and combinations thereof for an example of a mapping from
2D to 1D.
© CC BY-SA 4.0

information is preserved. In any case the mapped data are now represented in a new coordinate system the
axes of which are given by the vectors vp . With P = 2 and I = 3, for example, we have
 T   T 
v1 x v1
y= = x = V Tx
v2T x v2T
 
x
v11 v21 v31  1 
   
y1
or y = = x2 = V Tx.
y2 v12 v22 v32
x3

Notice that y is P -dimensional while x is I-dimensional.

9
It is important to realize that we have done two things here: firstly, we have moved the points from
the high-dimensional space onto the low-dimensional subspace (the points that were already in the
subspace have not been moved, of course) and secondly, we have represented the moved points in
a new coordinate system that is particularly suitable for the low-dimensional subspace. Thus, we went
from the high-dimensional space and the old coordinate system to the low-dimensional subspace and a new
coordinate system. Notice also that points in the high-dimensional space can generally not be represented
accurately in the new coordinate system, because it does not have enough dimensions.

2.3 Matrix V: Mapping from low-dimensional new coordinate system to sub-


space in old coordinate system
Interestingly, since the vectors vp are orthonormal, matrix V can also be used to transform the
points back from the new to the old coordinate system, although, the lost dimensions cannot
be recovered, of course. Thus the mapped points y in the new coordinate system become points x|| in the
old coordinate system and are given by

♦ x|| := Vy (2.7)
(2.6) T
♦ = V V x. (2.8)

y and x|| are equivalent representations, i.e. they contain the same information, just in different
coordinate systems.

2.4 Matrix (VT V): Identity mapping within new coordinate system
Before we look at the combined matrix V V T consider V T V . The latter is obviously a P × P -matrix
and performs a transformation from the new (low-dimensional) coordinate system to the old
(high-dimensional) coordinate system (2.7) and back again (2.6). The back-transformation implies
a dimensionality reduction, but since all points in the old coordinate system come from the new coordinate
system and therefore lie within the low-dimensional subspace already, the mapping onto the low-
dimensional space does not discard any information. Thus, only the back and forth (or rather forth and
back) transformation between the two coordinate systems remains and that in combination is without
any effect either. This means that V T V is the identity matrix, which can be easily verified
(2.3)
V TV = vpT vq

pq
= δpq (2.9)
⇐⇒ V TV = 1P (2.10)

with 1P indicating the identity matrix of dimensionality P . With P = 2, for example, we have
 T   T
v1 v1 v1T v2
  
v1 (2.3) 1 0
V TV = (v v
1 2 ) = = .
v2T v2T v1 v2T v2 0 1

2.5 Matrix (VVT ): Projection from high- to low-dimensional (sub)space within


old coordinate system
As we have seen above (2.8) the combined matrix V V T maps the points x onto the low-dimensional
subspace but in contrast to matrix V T alone the mapped points are represented within the old coor-
dinate system and not the new one. It turns out that this is a projection operation with the

10
characterizing property that it does not make a difference whether you apply it once or twice, i.e.
P P = P . Let us therefore define the projection matrix

 P := V V T (2.11)

and verify that


(2.11) T T
♦ PP = | {z V} V
V V (2.12)
1P
(2.10) T
♦ = VV (2.13)
(2.11)
♦ = P. (2.14)

A closer look at P shows that

♦ P := VVT (2.15)
v1T
 

♦ = (v1 , ..., vP )  ...  (2.16)


 

vPT
v11 v12 · · ·
 
 
. v11 v21 v31 ···

 v21 .. 
(2.2)
 .. .. 
=

.   v12
 . .  (2.17)

 v31 . . 
 ..
. vIP
 
..
. vIP
 P P
p v1p v1p p v1p v2p ···

 P . .. 
= p v2p v1p (2.18)
 
 
.. P
. p vIp vIp
 
v1p v1p v1p v2p ···
P
X .. 
=  v2p v1p
 . 
 (2.19)
p=1 ..
. vIp vIp
P
X
♦ = vp vpT . (2.20)
p=1

P is obviously an I × I-matrix. If P = I then projecting from the old to the new and back to the old
coordinate system causes no information loss and P = 1I . The smaller P the more information is lost
and the more does P differ from the identity matrix. Consider, for example
 √ √ 
√ 2 − 2 2
1 T T 1 √
v1 := ( 2, −1, 1) ⇒ v1 v1 = −√ 2 1 −1  ,
2 4
2 −1 1
 
0 0 0
1 √ √ 1
v2 := (0, 2, 2)T ⇒ v2 v2T =  0 2 2  , and
2 4
0 2 2
 √ √ 
2 2 − 2
1 √ 1 √
v3 := (− 2, −1, 1)T ⇒ v3 v3T =  √2 1 −1 
2 4
− 2 −1 1

11
for which you can easily verify that P (2.20) successively becomes the identity matrix as you take more of
the vp vpT -terms.

2.6 Variance
The variance of a multi-dimensional data set is defined as the sum over the variances of its
components. Since we assume zero-mean data, we have
I
X
 var(x) := hx2i i (2.21)
i=1
* I
+
X
♦ = x2i (2.22)
i=1
T
 = hx xi (2.23)

This also holds for the projected data, of course, var(y) = hy T yi.

2.7 Reconstruction error


The reconstruction error E is defined as the mean square sum over the distances between the original data
points x and the projected ones xk . If we define the orthogonal vectors (D: Lotvektoren)

♦ x⊥ = x − xk (2.24)

(in contrast to the projected vectors xk ) we can write the reconstruction error as the variance of the
orthogonal vectors and find
(2.1,2.24)
|♦ E = hx⊥ T x⊥ i (2.25)
(2.24) T
= h(x − xk ) (x − xk )i (2.26)
(2.8)
♦ = h(x − V V T x)T (x − V V T x)i (2.27)
♦ = hxT x − 2xT V V T x + xT V (V T V ) V T xi (2.28)
| {z }
=1P
(2.10)
♦ = hxT xi − hxT V (V T V ) V T xi (2.29)
| {z }
=1P
(2.8) T T
♦ = hx xi − hxk xk i (2.30)
(2.29,2.6) T T
 = hx xi − hy yi . (2.31)

This means that the reconstruction error equals the difference between the variance of the data
minus the variance of the projected data. Thus, this verifies our intuition that minimizing the recon-
struction error is equivalent to maximizing the variance of the projected data.

2.8 Covariance matrix


We have already argued heuristically that the covariance matrix Cx with Cxij := hxi xj i plays an important
role in performing PCA. It is convenient to write the covariance matrix in vector notation:
1 X µ µT
 Cx := xxT = x x . (2.32)
M µ

12
It is an easy exercise to show that this definition is equivalent to the componentwise one given above. Since
(xxT )T = xxT (remember (AB)T = B T AT ), one can also see that Cx is symmetric, i.e. CxT = Cx .
Keep in mind that Cx is strictly speaking a 2nd moment matrix and only identical to the covariance matrix
since we assume zero mean data.

2.9 Eigenvalue equation of the covariance matrix


Since the covariance matrix is symmetric, its eigenvalues are real and a set of orthogonal eigenvectors always
exists. In mathematical terms, for a given covariance matrix Cx we can always find a complete set
of real eigenvalues λi and corresponding eigenvectors ui such that
♦ Cx ui = ui λi (eigenvalue equation) , (2.33)
 λi ≥ λi+1 (eigenvalues are ordered) , (2.34)
T
♦ ui uj = δij (eigenvectors are orthonormal) . (2.35)
If we combine the eigenvectors into an orthogonal matrix U and the eigenvalues into a diagonal matrix Λ,
 U := (u1 , u2 , ..., uI ) , (2.36)
 Λ := diag(λ1 , λ2 , ..., λI ) , (2.37)
then we can rewrite (2.35) and (2.33) as
(2.35,2.36)
 UT U = 1I (matrix U is orthogonal) , (2.38)
T −1 T
♦ ⇐⇒ UU = 1I (since U =U and U is quadratic) , (2.39)
(2.33,2.36,2.37)
 Cx U = U Λ (eigenvalue equation) , (2.40)
(2.38) T
♦ ⇐⇒ U Cx U = Λ (2.41)
(2.38,2.39)
♦ ⇐⇒ Cx = U ΛU T . (2.42)

2.10 Total variance of the data x


Given the eigenvector matrix U and the eigenvalue matrix Λ it is easy to compute the total variance of the
data
|♦ hxT xi = htr(xT x)i (since s = tr(s) for any scalar s) (2.43)
T
♦ = htr(xx )i (since tr(AB) = tr(BA) for any matrices A, B) (2.44)
T
♦ = tr(hxx i) (since tr(·) and h·i commute) (2.45)
(2.32)
♦ = tr(Cx ) (2.46)
(2.39) T
♦ = tr(U U Cx ) (2.47)
T
♦ = tr(U Cx U ) (2.48)
(2.41)
♦ = tr(Λ) (2.49)
(2.37)
X
 = λi . (2.50)
i

Thus the total variance of the data is simply the sum of the eigenvalues of its covariance matrix.
Notice that on the way of this proof we have shown some very general properties. From line (2.43) to (2.46)
we have shown that the total variance of some multi-dimensional data equals the trace of its covariance
matrix. From line (2.46) to (2.48) we have shown that the trace remains invariant under any orthogonal
transformation of the coordinate system. This implies that the total variance of some multi-dimensional
data is invariant under any orthogonal transformation such as a rotation. This is intuitively clear.

13
2.11 Diagonalizing the covariance matrix
We can now use matrix U to transform the data such that the covariance matrix becomes
diagonal. Define x0 := U T x and denote the new covariance matrix by Cx0 . We have

 x0 := UT x (2.51)
D E
T
|♦ Cx0 := x0 x0 (2.52)
(2.51)
♦ = (U T x)(U T x)T (2.53)
♦ = U T xxT U (2.54)
(2.32) T
♦ = U Cx U , (2.55)
(2.41)
 = Λ (2.56)

and find that the transformed data x0 have a diagonal covariance matrix. Working with x0 instead of x will
simplify further analysis without loss of generality.

2.12 Variance of y for a diagonalized covariance matrix


Now that we have the data represented in a coordinate system in which the covariance matrix is diagonal, we
can try to answer the question, which is the P -dimensional subspace that minimizes the reconstruction error.
Our intuition would predict that it is simply the space spanned by the first P eigenvectors. To show this
analytically, we take an arbitrary set of P orthonormal vectors vp0 , and with V 0 := (v10 , v20 , ..., vP0 )
we compute the variance of y.
T
♦ y := V 0 x0 (2.57)
T (2.57) 0T 0 0T 0
♦ =⇒ hy yi = hx V V xi (2.58)
0T 0 0T 0
= htr(x V V x )i (since s = tr(s) for any scalar s) (2.59)
0T 0 0T 0
= htr(V x x V )i (since tr(ABC) = tr(BCA) if defined) (2.60)
(2.52) 0T
♦ = tr(V Cx0 V 0 ) (since tr(·) and h·i commute) (2.61)
(2.56) 0T 0
♦ = tr(V ΛV ) (2.62)
X X
0 2
♦ = λi (vip ) . (as one can work out on a sheet of paper) (2.63)
i p

V 0 is a rectangular I × P -matrix and projects the data into a lower-dimensional space. Since its columns are
orthonormal to each other, it only performes a rotation and elimination of some dimensions (not necessarily
along the axes) but no scaling of any sort or distortion beyond the elimnation of dimensions.

2.13 Constraints of matrix V0


Notice that, since the vectors v 0 p are orthonormal, V 0 can always be completed to an orthogonal
I × I-matrix by adding I − P additional orthonormal vectors. Since we know that an orthogonal matrix has

14
normalized row as well as column vectors, we see that, by taking away the I − P additional column vectors,
we are left with the constraints
X
0 2
♦ (vip ) = 1 (column vectors of V 0 have norm one) , (2.64)
i
X
0 2
♦ =⇒ (vip ) = P (square sum over all matrix elements equals P ) , (2.65)
ip
X
0 2
♦ (vip ) ≤ 1 (row vectors of V 0 have norm less or equal one) . (2.66)
p

Notice that Constraint (2.65) is a direct consequence of Constraint (2.64) and does not need to be verified
separately in the following considerations.

2.14 Finding the optimal subspace


Since the variance (2.63) of y as well as the constraints (2.64, 2.65, 2.66) of Matrix V 0 are
0 2
linear in (vip ) , maximization of the variance hy T yi is obviously achieved by putting as much
’weight’ as possible on the large eigenvalues, which are the first ones. The simplest way of doing that
is to set 
0 1 if i = p
♦ vip := δip := , (2.67)
0 otherwise
with the Kronecker symbol δip .
Since I ≥ P we can verify the constraints
 
X
0 2 (2.67)
X
2 1 if i ≤ P
(vip ) = δip = ≤ 1, (2.68)
0 otherwise
p p
X (2.67)
X
0 2 2 2
(vip ) = δip = δpp = 1, (2.69)
i i

and see from (2.68) that there is actually as much ’weight’ on the first, i.e. large, eigenvalues as Con-
straint (2.66) permits.

2.15 Interpretation of the result


0
What does it mean to set vip := δip ? It means that V 0 projects the data x0 onto the first P axes,
which in fact is a projection onto the first P eigenvectors of the covariance matrix Cx . Thus, if
we define

|♦ V := UV 0 (2.70)
(2.36, 2.67)
 = (u1 , u2 , ..., uP ) (2.71)

we can go back to the original coordinate system and find


(2.57) T
|♦ y = V 0 x0 (2.72)
(2.51) 0T T
♦ = V U x (2.73)
(2.70) T
 = V x, (2.74)

which we know has maximal variance. Thus, if we start from the original data x we would set vp := up .

15
The variance of y is
I
X P
X
(2.63) 0 2
|♦ hy T yi = λi (vip ) (2.75)
i=1 p=1
I
X P
X
(2.67) 2
♦ = λi δip (2.76)
i=1 p=1
P
X
 = λi , (2.77)
i=1

which is the sum over the first P largest eigenvalues of the covariance matrix. Likewise one can determine
the reconstruction error as
(2.31)
|♦ E = hxT xi − hy T yi (2.78)
I
X P
X
(2.50, 2.77)
♦ = λi − λj (2.79)
i=1 j=1
I
X
 = λi . (2.80)
i=P +1

Notice that this is just one optimal set of weights. We have seen above that the projected data, like
any multi-dimensional data, can be rotated arbitrarily without changing its variance and therefore
without changing its reconstruction error. This is equivalent to a rotation of the projection vectors vp within
the space spanned by the first eigenvectors.

2.16 PCA Algorithm

2.17 Intuition of the Results


Eigenvalue spectrum.
Projection onto a low-dimensional eigenspace.
Visualization of eigenvectors.

2.18 Whitening or sphering


Sometimes it is desirable to transform a data set such that it has variance one in all directions.
Such a normalization operation is called whitening or sphering , see Fig. 1.6. The latter term is quite
intuitive, because a spherical data distribution has the same variance in all directions. Intuitively speaking
sphering requires to stretch and compress the data distribution along the axes of the principal components
such they have variance one. Technically speaking one first rotates the data into a coordinate system where
the covariance matrix is diagonal, then performs the stretching along the axes, and then rotates the data
back into the original coordinate system. Principal component analysis obviously gives all the required
information. The eigenvectors of the covariance matrix provide the axes of the new coordinate system and
the eigenvalues λi indicate the variances and therefore how much one has to stretch the data. If the original

16

variance is λi then one obviously has to stretch by a factor of 1/ λi to get variance one. Thus, sphering
is achieved by multiplying the data with a sphering matrix
 
1 1 1
 W := U diag √ , √ , ..., √ UT (2.81)
λ1 λ2 λI
 x̂ := W x . (2.82)

If the final orientation of the data does not matter, the sphering matrix is often defined without the first
U . It is easy to verify that the sphering matrix is symmetrical, the sphered data x̂ have a unit
covariance matrix,

Cx̂ := hx̂x̂T i (2.83)


(2.82) T T
= W hxx iW (2.84)
   
(2.32,2.81) 1 1
= U diag √ U T Cx U diag √ UT (2.85)
λi λi
   
(2.41) 1 1
= U diag √ Λ diag √ UT (2.86)
λi λi
(2.37)
= U 1 UT (2.87)
(2.39)
= 1, (2.88)

and they have variance one in all directions, since for any projection vector n of norm one the variance
h(nT x̂)2 i of the projected data is

h(nT x̂)2 i = nT hx̂x̂T in (2.89)


(2.88)
= nT n (2.90)
= 1 (2.91)

Similarly one can show that the sphered data projected onto two orthogonal vectors are uncorre-
lated.

2.19 Singular value decomposition +


Sometimes one has fewer data points than dimensions. For instance one might have 100 images with 10000
pixels each. Then doing direct PCA is very inefficient and the following method, known as singular value
decomposition (SVD), is helpful.
Let xµ , µ = 1, ..., M be the I-dimensional data with M < I. For convenience we write the data in one
I × M -matrix

X := (x1 , ..., xM ) . (2.92)

The second-moment matrix can then be written as

C1 := XX T /M , (2.93)

and its eigenvalue equation and decomposition read

C1 U1 = U1 Λ1 (2.94)
⇐⇒ C1 = U1 Λ1 U1T . (2.95)

17
The data represented in the coordinate system of the eigenvectors is

Y1 := U1T X , (2.96)

which is still high-dimensional.


Now interpret the data matrix X transposed, i.e. swap the data point index for the dimension index. In our
example this would correspond to having 10000 data points in a 100-dimensional space, which is, of course,
much easier to deal with. We get the same equations as above just with X transposed.

C2 := X T X/I , (2.97)
C2 U2 = U2 Λ2 (2.98)
⇐⇒ C2 = U2 Λ2 U2T , (2.99)
Y2 := U2T X T . (2.100)

The interesting property of matrix Y2 now is that its rows are eigenvectors of matrix C1 , as can be shown
easily.
(2.93,2.100)
C1 Y2T = (XX T /M )(XU2 ) (2.101)
T
= X(X X/I)U2 I/M (2.102)
(2.97)
= XC2 U2 I/M (2.103)
(2.98)
= XU2 Λ2 I/M (2.104)
(2.100)
= Y2T Λ2 I/M . (2.105)

The corresponding eigenvalues are eigenvalues of C2 scaled by I/M .


However, Y2 yields only M eigenvectors and eigenvalues. The other eigenvalues are all zero, because M
data points can only produce M non-zero variance dimensions or, in other words, M data points together
with the origin can only span an M -dimensional subspace. The missing (I − M ) eigenvectors must all be
orthogonal to the first M ones and orthogonal to each other but can otherwise be quite arbitrary, since their
eigenvalues are all equal. A Gram-Schmidt orthogonalization procedure can be used to generate them.
Acknowledgment: I thank Agnieszka Grabska-Barwinska for working out the proof for singular value
decomposition.

18
3 Application
3.1 Face processing

If one carefully shifts and


scales many face images such
Eigenfaces that the eyes are in regis-
ter, i.e. at identical positions,
and performs PCA (or SVD)
on them, meaningful eigenvec-
tors can be calculated and are
called eigenfaces. These are
the principal grey value vari-
ations that distinguish faces
from each other. The fig-
ure shows first the mean and
then the eigenfaces ordered in
rows. The first eigenface ob-
viously accounts for the mean
grey value, the second one for
the difference in color between
hair and face, the third one
for illumination from the side,
the fourth and seventh one
at least partially for a beard.
Figure: (http://www-white.media.mit.edu/ 2002-12-17, URL)
The higher components be-
come increasingly difficult to
interpret. The projection of
face images onto the first eigenfaces is a suitable representation for face recognition, (cf. Turk and Pentland,
1991).
Figure: (http://www-white.media.mit.edu/ 2002-12-17, URL)3.1

19
In this example many faces
were warped to a standard ge-
Eigenfaces - Texture ometry and then PCA was ap-
plied to calculate the eigenvec-
tors, which are called eigen-
faces. The Figure visualizes
the first four eigenfaces by
varying the average face (mid-
dle) along the direction of the
eigenvectors by up to ±3 stan-
dard deviations. The first
eigenface (mode) accounts for
overall illumination, the other
three for some combination of
beard, gender, and mimic ex-
pression.
Figure: (Lanitis et al., 1995,
Fig. 11, URL)3.2

Figure: (Lanitis, Taylor, et al., 1995, Image and Vision Computing 13(5):393–401, Fig. 11, URL)

In this example PCA was ap-


plied to the geometry of faces.
Eigenfaces - Shape A graph with some standard
structure was mapped onto
many faces and the concate-
nated vector of xy-positions of
the nodes (not shown in the
graphs) of the graphs serve as
the data for PCA. Visualiza-
tion is done relative to the
average geometry (middle) by
adding the eigenvectors up to
±2 standard deviations. The
first three components mainly
account for the orientation of
the head along the three rota-
tional axes. The fourth com-
ponent accounts for some vari-
ation in width and mimic ex-
pression.
Figure: (Lanitis, Taylor, et al., 1995, Image and Vision Computing 13(5):393–401, Fig. 6, URL)
Figure: (Lanitis et al., 1995,
Fig. 6, URL)3.3

20
4 Other resources
Numbers in square brackets indicate sections of these lecture notes to which the corresponding item is related.

4.1 Written material


• PCA in Wikipedia
https://en.wikipedia.org/wiki/Principal_component_analysis

• A tutorial by Jonathon Shlens


https://arxiv.org/pdf/1404.1100v1.pdf
• An in depth book by I.T.Jolliffe
http://wpage.unina.it/cafiero/books/pc.pdf

4.2 Visualizations
• 2D, 3D, and 17D examples of PCA
http://setosa.io/ev/principal-component-analysis/

4.3 Videos
• Abstract conceptual introduction to PCA from Georgia Tech.
1. Part 1
https://www.youtube.com/watch?v=kw9R0nD69OU (4:22) [1.1, 1.2, 1.3]
2. Part 2
https://www.youtube.com/watch?v=_nZUhV-qhZA (5:27) [1.2, 1.3]
3. Part 3
https://www.youtube.com/watch?v=kuzJJgPBrqc (5:01) [1.2, 1.3]

• Introduction to PCA with several practical examples by Rasmus Bro from the University of Copenhagen
Note: In these videos the principal components are called loadings, and the projected values are called
scores.
1. Conceptual introduction
https://www.youtube.com/watch?v=UUxIXU_Ob6E (12:32)
– 00:00–00:38 Introduction
– 00:38–07:52 Examples of multi-variate data
– 07:52–12:32 Concept of PCA and bi-plot by means of a simple 3D example
2. Example: Analysis of demographic data of many countries in the world with PCA
https://www.youtube.com/watch?v=26YhtSJi1qc (11:36)
– 00:11–02:26 Outlier detection
– 02:26–05:47 Loadings help interpreting the projected data dimensions
– 05:47–08:11 Exploration after removing outliers
– 08:11–09:40 Using external information as markers
– 09:40–10:36 From plots to numbers
– 10:36–11:36 Summary

21
3. A bit more in depth discussion
https://www.youtube.com/watch?v=2s-a62zSWL4 (14:31)
– 00:00–00:50 Introduction and historical remarks
– 00:50–03:18 Basic equation for data generation
– 03:18–06:32 Example: Three-variable data set
∗ 04:30–05:40 Removing the mean before PCA
∗ 05:40–06:11 PCA as variations along loadings
∗ 06:11–06:32 PCA as a rotation and truncation
– 06:32–06:59 Scores are projections of the data onto the loadings
– 06:59–14:26 PCA as finding common profiles and profile variations
∗ 07:12–11:58 Example 1: Continuous spectral data of sugar samples
∗ 12:22–14:26 Example 2: Discrete physical and chemical measurements of the same sugar
samples
· 12:48–14:26 Scaling the variables before PCA (covariance- vs. correlation-matrix)
4. Continuation of the previous video
https://www.youtube.com/watch?v=sRsdF3rcAJc (08:22)
– 00:05–14:26 PCA as finding common profiles and profile variations (cont.)
∗ 00:05–04:02 Example 2: Discrete physical and chemical measurements of the same sugar
samples (cont.)
∗ 01:31–02:37 Relating results from Examples 1 and 2
∗ 02:37–04:02 Bi-plot: Understanding the data by visualizing the loadings
– 04:02–08:18 PCA reviewed
• Introduction to PCA in Geoscience by Matthew E. Clapham from the UC Santa Cruz
https://www.youtube.com/watch?v=TSYL-oHx4T0 (15:27)
– 00:00–00:33 Introduction
– 00:33–01:17 Data is often multi-variate / high-dimensional
– 01:17–02:50 Covariation allows to reduce dimensionality and remove redundancy
– 02:50–03:36 skip (Indirect gradient analysis)
– 03:36–04:44 skip (Types of ordination methods)
– 04:44–06:10 Idea of PCA
– 06:10–06:29 Covariance matrix
– 06:29–09:22 Eigenvectors define a new coordinate system ordered by variance
∗ 07:53–08:43 Eigenvalues measure the amount of variance along the axes of the new coordinate
system
– 09:22–10:42 Loadings
– 10:42–12:17 Covariance- vs. correlation-matrix (scaling the variables before PCA)
– 12:17–13:50 Using the eigenvalue spectrum to decide how many PCs to keep
– 13:50–15:27 When is PCA applicable?
• Three lectures on singular value decomposition and PCA
(See also the closely related tutorial by Jonathon Shlens https://arxiv.org/pdf/1404.1100v1.pdf.)

22
1. Singular value decomposition
https://www.youtube.com/watch?v=EokL7E6o1AE (44:35)
Explains singular value decomposition in general, which is quite an interesting fact in linear algebra,
but not really necessary here. It is also not entirely obvious how this relates to Section 2.19. I list
this here, because it belongs to the series.
2. PCA in relatively simple mathematical terms
https://www.youtube.com/watch?v=a9jdQGybYmE (51:12)
– 00:00–15:38 Why would you like to do PCA? [1.1]
– 15:38–23:32 Variance and covariance [1.4]
∗ 18:18–18:30 The vectors introduced 16:52-17:10 are row vectors, so that abT is indeed an
inner product. I always use column vectors, and then I write the inner product as aT b
∗ 22:26–22:54 The statement made here is not always true. Variables can be uncorrelated but
statistically dependent, a simple example being cos(t) and sin(t) for t ∈ [0, 2π]. He uses the
term stastistical independence for what really only is zero covariance or no correlation.
– 23:32–25:06 Relation of variance and covariance to the motivational example
– 25:06–35:49 Covariance matrix (assuming zero mean data) [2.8]
– 35:49–39:44 Diagonalizing the covariance matrix [1.7]
– 39:44–47:20 Diagonalization with eigenvalues and -vectors [2.11]
– 47:20–49:43 Diagonalization with singular value decomposition
– 49:43–51:12 Summary
3. Application of PCA to face recognition
https://www.youtube.com/watch?v=8BTv-KZ2Bh8 (48:02)

4.4 Software
• General list of software for PCA
https://en.wikipedia.org/wiki/Principal_component_analysis#Software.2Fsource_code

• PCA and variants thereof in scikit-learn, a python library for machine learning
http://scikit-learn.org/stable/modules/decomposition.html#principal-component-analysis-pca
• Examples using PCA in scikit-learn
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#
sklearn.decomposition.PCA

4.5 Exercises
• Analytical exercises by Laurenz Wiskott
https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-ExercisesPublic.
pdf
https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-SolutionsPublic.
pdf
• Python exercises by Laurenz Wiskott
https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-PythonExercisesPubl
zip
https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-PythonSolutionsPubl
zip

23
References
Hertz, J., Krogh, A., and Palmer, R. G. (1991). Introduction to the Theory of Neural Computation. Addison-
Wesley, Redwood City, CA.
Lanitis, A., Taylor, C. J., and Cootes, T. F. (1995). An automatic face identification system using flexible
appearance models. Image and Vision Computing, 13(5):393–401.
Turk, M. and Pentland, A. P. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1):71–
86.

Notes
3.1 http://www-white.media.mit.edu/ 2002-12-17, http://www-white.media.mit.edu/vismod/demos/facerec/basic.html
3.2 Lanitis,
Taylor, et al., 1995, Image and Vision Computing 13(5):393–401, Fig. 11, https://www.researchgate.net/
profile/Andreas_Lanitis/publication/223308678_Automatic_face_identification_system_using_flexible_appearance_
models/links/00463533817d5b2dd9000000.pdf
3.3 Lanitis, Taylor, et al., 1995, Image and Vision Computing 13(5):393–401, Fig. 6, https://www.researchgate.net/profile/

Andreas_Lanitis/publication/223308678_Automatic_face_identification_system_using_flexible_appearance_models/
links/00463533817d5b2dd9000000.pdf

Copyrightprotectionlevel: 2/ 2

24

You might also like