0% found this document useful (0 votes)
14 views11 pages

Lecture 7: Principal Component Analysis (PCA) (Draft: Version 0.9.1)

Uploaded by

陳某
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views11 pages

Lecture 7: Principal Component Analysis (PCA) (Draft: Version 0.9.1)

Uploaded by

陳某
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Lectures on Machine Learning (Fall 2017)

Hyeong In Choi Seoul National University

Lecture 7: Principal Component Analysis


(PCA)
(Draft: version 0.9.1)

Topics to be covered:

• Basic setup of PCA

• SVD formulation

• Projection in terms of SVD

• Mathematics of singular value decomposition (SVD)

1 PCA: Principal Component Analysis


Principal component analysis is one of the most important tools in ma-
chine learning. At its core, it is a tool in unsupervised learning, but it is also
used as a feature extraction tool in supervised learning.

1
1.1 Basics of PCA
Let {x(i) }N
i=1 be a dataset with no label whose data matrix X is denoted
by 0 1
0 1 (1) (1) (1)
(x(1) )T x1 · · · xj · · · xd
B .. C B .. .. .. C
B . C B B . . . CC
B (i) T C B (i) (i) (i) C
X = B (x ) C = B x1 · · · xj · · · xd C . (1)
B .. C B . .. .. C
@ . A @ .. . . A
(x(N ) )T x1
(N ) (N )
· · · xj
(N )
· · · xd
(i)
Lowering the superscript, let xij = xj . Then the data matrix X is written
as 0 1
x11 · · · x1j · · · x1d
B .. .. .. C
B . . . C
B C
X = B xi1 · · · xij · · · xid C . (2)
B . . . C
@ .. .. .. A
xN 1 · · · xN j · · · xN d
Note that the rows of X represent data points and the columns values of
features. In particular, We may view the j-th column as a random variable
Zj whose IID samples are the entries of the j-th column. This way, we
identify
Zj = [x1j , · · · , xN j ]T .
Let µj = E[Zj ], i.e.
N
1 X
µj = xij
N i=1
for j = 1, · · · , d, and let
µ = [µ1 , · · · , µd ]T .
If we think of each data point x(i) as a point in Rd , this µ represents the
centroid of all data points in Rd . Define the normalized centered data
matrix X̃ by
0 1 0 1
(x(1) µ)T x11 µ1 · · · x1j µj · · · x1d µd
B .. C B .. .. .. C
B . C B . . . C
1 B (i) T C 1 B C
X̃ = p B (x µ) C= p B xi1 µ1 · · · xij µj · · · xid µd C . (3)
NB @ .. C
A NB@ .. .. .. C
A
. . . .
(x(N ) µ)T x N 1 µ1 · · · x N j µj · · · x N d µd

2
The gist of the idea of PCA is to successively find the directions along
which the data spread out the most. In particular, our objective is to find a
unit vector v 2 Rd that maximizes
N N
1 X 1 X T (`)
F (v) = |Projv (x(`) 2
µ)| = |v (x µ)|2 . (4)
N `=1 N `=1

Now v T (x(`) µ) is a scalar, which is equal to (x(`) µ)T v. Thus we can write
N
1 X T (`) n1 XN o
2 T
|v (x µ)| = v (x(`) µ)(x(`) µ)T v.
N `=1 N `=1

Define a d ⇥ d symmetric matrix C by


N
1 X (`)
C= (x µ)(x(`) µ)T .
N `=1

Then (4) can be written as

F (v) = v T Cv. (5)

Note the (i, j)-th entry Cij of C is


N
1 X
Cij = (x`i µi )(x`j µj ).
N `=1

Therefore the following is easy to check.


Lemma 1. C is a d ⇥ d (empirical) covariance matrix satisfying
(i) Cij = Cov(Zi , Zj )

(ii) C = X̃ T X̃.
The way to maximize (4) subject to |v| = 1 is to use the Lagrange mul-
tiplier
N
1 X T (`)
L(v, ) = |v (x µ)|2 (|v|2 1)
N `=1
= v T Cv (v T v 1).

3
Thus by computing the derivative and setting it equal to zero, we have
@L
= 2Cv 2 v = 0,
@v
i.e.,
Cv = v.
This equation says v must be an eigenvector of C. Since C is a symmetric
matrix, it is diagonalizable, and the diagonalization theorem says C can be
written as
C = 1 v1 v1T + · · · + d vd vdT ,
where {v1 , · · · , vd } is an orthonormal basis of Rd and 1 . · · · , d are eigen-
values of C ordered in the descending order: 1 2··· d 0. Check
that
F (vi ) = viT Cvi = i .
Thus F (v) is the biggest when v = v1 .
Let us now look at the multi-dimensional version. So let W be a linear
subspace of Rd and let ProjW be the orthogonal projection of Rd to W. We
want to find a subspace in which the data spreads out more.
Proposition 1. Let
ProjW (x(i) µ)
be the orthogonal projection of the centered i-th data to the subspace W. Define
N
1 X
F (W ) = |ProjW (x(`) µ)|2 .
N `=1

Then F (W ) is maximum among all k dimensional linear subspaces of Rd if


W is the subspace spanned by the k highest eigenvectors, i.e., the eigenvectors
corresponding to k highest eigenvalues.
Proof. We will give a proof for k = 2 as it is trivial to generalize it for any
k 2. Let v and w be orthonormal vectors and let W be a two dimensional
linear subspace spanned by v and w. Then

1 X n T (`) o
N N
1 X
|ProjW (x(`) µ)|2 = |v (x µ)|2 + |wT (x(`) µ)|2
N `=1 N `=1
= v T Cv + wT Cw.

4
Use the Lagrange multiplier

L(v, w, ↵, ) = v T Cv + wT Cw ↵{|v|2 1} {|w|2 1}.

Upon setting
@L
= 2Cv 2↵v = 0
@v
@L
= 2Cw 2 w = 0.
@w
we get

Cv = ↵v
Cw = w.

Thus v and w are orthonormal eigenvectors of C. As before, write


T
C= 1 v1 v1 + ··· + d vd vd .

where v1 , · · · , vd are orthonormal eigenvectors with eigenvalues 1 2···


d 0. Since v, w must be two of v1 , · · · , vd , let v = vi and w = vj for i 6= j.
Then
F (v, w) = F (vi , vj ) = i + j .
This is maximum when v = v1 and w = v2 , in which case

F (v, w) = 1 + 2.

For general k, it is trivial to see that W must be the k dimensional subspace


spanned by v1 , · · · , vk .

1.2 SVD formulation


Proposition 1 shows how to find the k dimensional linear subspace in
which the data spreads out the most. The remaining task is how to do it
most efficiently, for which the singular value decomposition comes in handy.
By SVD, the N ⇥ d matrix X̃ can be written as

X̃ = U DV T , (6)

5
where U is an N ⇥ N orthogonal matrix, V a d ⇥ d orthogonal matrix, and
D a N ⇥ d “diagonal” matrix. Therefore

C = X̃ T X̃ = V DT U T U DV T = V DT DV T = V ⇤V T , (7)

where ⇤ = DT D is a d ⇥ d diagonal matrix. Note that (7) is rewritten as

CV = V ⇤,

which can be interpreted as saying

Cvj = j vj ,

for j = 1, · · · , d, where v1 , · · · , vd are column vectors of V. Namely, the


column vectors of V are precisely the eigenvectors of C. Assuming the eigen-
values of C are ordered in descending order, the basis of the k dimensional
subspace W found in Proposition 1 can be readily read o↵ as the set of the
first k column vectors of the matrix V.
These k vectors v1 , · · · , vk become new features to be used in lieu of
the old ones. This process of finding new features is called the feature
extraction.
The next question is how the data x(i) is written in terms of these newly
extracted features. Let R be an N ⇥ d matrix defined by
0 1
r11 · · · r1d
B .. C .
R = U D = @ ... . A
rN 1 · · · rN d

Then, from (6), we have X̃ T = V RT . Let ei be the N dimensional standard


basis vector all of whose components are zero except the i-th which is 1.
Then it is easy to see that
d
X d
X
T T
R ei = (R )ji ej = rij ej .
j=1 j=1

Thus
d
X d
X
T
V R ei = rij V ej = rij vj
j=1 j=1

6
Therefore, by (3), we have
d
X
1
p (x(i) µ) = X̃ T ei = V RT ei = rij vj .
N j=1

This gives a convenient way of expressing the projection to the subspace


W = span{v1 , · · · , vk } of Rd as
k
X
(i)
p
ProjW x µ = N rij vj . (8)
j=1

It now remains to determine k. In general, as we can see in Figure 1,


the data does not spread out evenly in all directions. In some directions it
spreads out quite a lot, while it remains little changed in others. In fact, W
depicted here is the linear subspace W drawn at (or translated to) the center
point µ of the data. One can readily see that the data spreads out more
in the direction of W , while it is relatively unchanged in the perpendicular
direction W ? . As we saw above, the degree of spreading out in each direction
is measured in terms of eigenvalues. So the strategy is to choose the first k
directions (eigenvector) that can capture most of the spreading. To do that,
note that Tr(C) = 1 + · · · + d . Choose k so that 1 + · · · + k is close to
Tr(C) within a given degree of closeness. Say, if we want to capture 90% of
Tr(C), choose the smallest k such that 1 + · · · + k 0.9Tr(C).

1.3 Summary: Procedure of PCA


In view of what we have done so far, the PCA procedure can be summa-
rized as follows:

(i) (Preparation) Compute X̃ as in (3)

(ii) (SVD) Compute the SVD of X̃ as in (6)

(iii) (New features) New features are the first k eigenvectors of V and W =
span{v1 , · · · , vk }

(iv) (Data) The ith data projected to W is written as in (8).

7
Figure 1: Subspace W

Remark. One must exercise judicious caution when applying PCA. The
most salient is the problem of scale. For example, look at the regression
problem of estimating the weight of a person in terms of bodily measurement
quantities. Suppose one uses the millimeter as the unit of measurement of the
girth of waist instead of the customary unit of centimeters. Then the quantity
representing the girth will be ten times more exaggerated in the PCA analy-
sis. Or, if one uses the kilometer as the unit of measurement, the girth data
may look like it changes so little from one person to another that it could be
tossed out as an irrelevant feature.
In order to prevent such discrepancies, people frequently normalize the
data before embarking on PCA. For instance, for the random variable repre-
senting the feature, one normalizes it to have mean zero and variance one.
Z j µj
Namely, Zj is replaced with . It really amounts to using the correla-
j
tion rather than covariance matrix in our PCA.
This cure, however, is not a panacea. For example if j is very small,
through this procedure, the random variable Zj may receive a disproportionate
boost. So one must exercise judicious caution and be aware of the relative im-
portance of each feature in the first place so that the PCA can truly suppress
unimportant features while boosting more important ones.

8
2 Mathematical supplements
2.1 Singular value decomposition (SVD)
The singular value decomposition (SVD) of a matrix is perhaps one of the
most useful techniques in numerical linear algebra. Its impact is far-reaching
and one cannot do without it in many applications. In here, we introduce
what it is and give an intuitive, geometric proof of it.
Let A be an m⇥n matrix. A can be viewed as a linear map A : Rn ! Rm .
Since the diagonal matrices are the easiest to deal with, it would have been
great, if A were a diagonal matrix. Of course, it cannot be true in general. So
the next best thing to hope for is to find some orthonormal bases {v1 , · · · , vn }
of Rn and {u1 , · · · , um } of Rm with regard to which A is diagonal, i.e.

Avi = i ui ,

for some i for i = 1, · · · , n. If so, the first thing that must hold is that
there is an orthonormal basis {v1 , ·, vn } such that Av1 , · · · , Avn are mutually
orthogonal, i.e.,

(Avj )T (Avi ) = 0,
for all j 6= i. Thus for any fixed i and for all j 6= i,

vjT (AT Avi ) = 0,

which must necessarily mean that AT Avi is a scalar multiple of vi , i.e., there
exists some ↵i 2 R such that AT Avi = ↵i vi . Therefore v1 , · · · , vn must be
eigenvectors of the n ⇥ n symmetric matrix AT A, which are well known
to exist and relatively easier to find numerically. So the idea is to define
Avi
ui = and declare victory, because Avi = |Avi |ui holds then. So the gist
|Avi |
of the proof is making this line of thoughts mathematically correct.
Let ↵1 , · · · , ↵n be eigenvalues of AT A. Let r = rank(A)  n. Since
rank(AT A) = rank(A) by Lemma 2, there are exactly r non-zero eigen-
values ↵1 , · · · , ↵r of AT A. We may assume that ↵1 , · · · , ↵n are ordered in
descending order as

↵1 ↵2 ··· ↵r > 0 = ↵r+1 = · · · = ↵n = 0.

9
Define, for i = 1, · · · , r,
Avi Avi
ui = =p ,
|Avi | ↵i
p
where the last equality uses the fact that |Avi | = ↵i for i = 1, · · · , n. Thus
we have
p
Avi = ↵i ui ,
for i = 1, · · · , r. Now extend {u1 , · · · , ur } to an orthonormal basis {u1 , · · · , um }
of Rm . For i r + 1, let ↵i = 0. Then we can write
p
Avj = ↵j uj
for j = 1, · · · , n. In other words,
p
uTi Avj = ↵j ij . (9)
Let us write these relations in matrix form. Define an n ⇥ n orthogonal
matrix V by 0 1
| |
V = @v 1 · · · v n A ,
| |
where vj is written as the jth column vector. Define also an m⇥m orthogonal
matrix U by 0 1
| |
U = @u 1 · · · u m A .
| |
Therefore the relation (9) is written in matrix form as
U T AV = D, (10)
where D is an m ⇥ n diagonal matrix of the form
0p 1
↵1 0 ··· ··· 0
B . .. .C
B 0
B . 0 · · · .. CC
B . p C
B . ↵r 0C
B . . .. C
D=B 0 ··· 0 . .C ,
B C
B 0 ··· 0C C
B
B .. ... .. C
@ . .A
0 ··· 0

10
when m n, or
0p 1
↵1 0 ··· 0 0 ··· 0
B .. .. C
B 0 . . ··· C
B .. p . . .C ,
D=B
B . 0 ↵r 0 .. . . .. C
C
B .. .. .. C
@ . . . A
0 ··· 0 ··· 0 ··· 0

when m  n. From (10) we get

A = U DV T ,

which is called the singular value decomposition of A.

Lemma 2. For any m ⇥ n matrix A,

rank(AT A) = rank(A).

Proof. Since xT AT Ax = |Ax|2 , it is easy to see that Ax = 0 if and only if


AA x = 0. Thus ker(A) = ker(AT A), i.e. nullity(A) = nullity(AT A). Since
rank + nullity = n, the proof follows.

11

You might also like