Lecture 7: Principal Component Analysis (PCA) (Draft: Version 0.9.1)
Lecture 7: Principal Component Analysis (PCA) (Draft: Version 0.9.1)
Topics to be covered:
• SVD formulation
1
1.1 Basics of PCA
Let {x(i) }N
i=1 be a dataset with no label whose data matrix X is denoted
by 0 1
0 1 (1) (1) (1)
(x(1) )T x1 · · · xj · · · xd
B .. C B .. .. .. C
B . C B B . . . CC
B (i) T C B (i) (i) (i) C
X = B (x ) C = B x1 · · · xj · · · xd C . (1)
B .. C B . .. .. C
@ . A @ .. . . A
(x(N ) )T x1
(N ) (N )
· · · xj
(N )
· · · xd
(i)
Lowering the superscript, let xij = xj . Then the data matrix X is written
as 0 1
x11 · · · x1j · · · x1d
B .. .. .. C
B . . . C
B C
X = B xi1 · · · xij · · · xid C . (2)
B . . . C
@ .. .. .. A
xN 1 · · · xN j · · · xN d
Note that the rows of X represent data points and the columns values of
features. In particular, We may view the j-th column as a random variable
Zj whose IID samples are the entries of the j-th column. This way, we
identify
Zj = [x1j , · · · , xN j ]T .
Let µj = E[Zj ], i.e.
N
1 X
µj = xij
N i=1
for j = 1, · · · , d, and let
µ = [µ1 , · · · , µd ]T .
If we think of each data point x(i) as a point in Rd , this µ represents the
centroid of all data points in Rd . Define the normalized centered data
matrix X̃ by
0 1 0 1
(x(1) µ)T x11 µ1 · · · x1j µj · · · x1d µd
B .. C B .. .. .. C
B . C B . . . C
1 B (i) T C 1 B C
X̃ = p B (x µ) C= p B xi1 µ1 · · · xij µj · · · xid µd C . (3)
NB @ .. C
A NB@ .. .. .. C
A
. . . .
(x(N ) µ)T x N 1 µ1 · · · x N j µj · · · x N d µd
2
The gist of the idea of PCA is to successively find the directions along
which the data spread out the most. In particular, our objective is to find a
unit vector v 2 Rd that maximizes
N N
1 X 1 X T (`)
F (v) = |Projv (x(`) 2
µ)| = |v (x µ)|2 . (4)
N `=1 N `=1
Now v T (x(`) µ) is a scalar, which is equal to (x(`) µ)T v. Thus we can write
N
1 X T (`) n1 XN o
2 T
|v (x µ)| = v (x(`) µ)(x(`) µ)T v.
N `=1 N `=1
(ii) C = X̃ T X̃.
The way to maximize (4) subject to |v| = 1 is to use the Lagrange mul-
tiplier
N
1 X T (`)
L(v, ) = |v (x µ)|2 (|v|2 1)
N `=1
= v T Cv (v T v 1).
3
Thus by computing the derivative and setting it equal to zero, we have
@L
= 2Cv 2 v = 0,
@v
i.e.,
Cv = v.
This equation says v must be an eigenvector of C. Since C is a symmetric
matrix, it is diagonalizable, and the diagonalization theorem says C can be
written as
C = 1 v1 v1T + · · · + d vd vdT ,
where {v1 , · · · , vd } is an orthonormal basis of Rd and 1 . · · · , d are eigen-
values of C ordered in the descending order: 1 2··· d 0. Check
that
F (vi ) = viT Cvi = i .
Thus F (v) is the biggest when v = v1 .
Let us now look at the multi-dimensional version. So let W be a linear
subspace of Rd and let ProjW be the orthogonal projection of Rd to W. We
want to find a subspace in which the data spreads out more.
Proposition 1. Let
ProjW (x(i) µ)
be the orthogonal projection of the centered i-th data to the subspace W. Define
N
1 X
F (W ) = |ProjW (x(`) µ)|2 .
N `=1
1 X n T (`) o
N N
1 X
|ProjW (x(`) µ)|2 = |v (x µ)|2 + |wT (x(`) µ)|2
N `=1 N `=1
= v T Cv + wT Cw.
4
Use the Lagrange multiplier
Upon setting
@L
= 2Cv 2↵v = 0
@v
@L
= 2Cw 2 w = 0.
@w
we get
Cv = ↵v
Cw = w.
F (v, w) = 1 + 2.
X̃ = U DV T , (6)
5
where U is an N ⇥ N orthogonal matrix, V a d ⇥ d orthogonal matrix, and
D a N ⇥ d “diagonal” matrix. Therefore
C = X̃ T X̃ = V DT U T U DV T = V DT DV T = V ⇤V T , (7)
CV = V ⇤,
Cvj = j vj ,
Thus
d
X d
X
T
V R ei = rij V ej = rij vj
j=1 j=1
6
Therefore, by (3), we have
d
X
1
p (x(i) µ) = X̃ T ei = V RT ei = rij vj .
N j=1
(iii) (New features) New features are the first k eigenvectors of V and W =
span{v1 , · · · , vk }
7
Figure 1: Subspace W
Remark. One must exercise judicious caution when applying PCA. The
most salient is the problem of scale. For example, look at the regression
problem of estimating the weight of a person in terms of bodily measurement
quantities. Suppose one uses the millimeter as the unit of measurement of the
girth of waist instead of the customary unit of centimeters. Then the quantity
representing the girth will be ten times more exaggerated in the PCA analy-
sis. Or, if one uses the kilometer as the unit of measurement, the girth data
may look like it changes so little from one person to another that it could be
tossed out as an irrelevant feature.
In order to prevent such discrepancies, people frequently normalize the
data before embarking on PCA. For instance, for the random variable repre-
senting the feature, one normalizes it to have mean zero and variance one.
Z j µj
Namely, Zj is replaced with . It really amounts to using the correla-
j
tion rather than covariance matrix in our PCA.
This cure, however, is not a panacea. For example if j is very small,
through this procedure, the random variable Zj may receive a disproportionate
boost. So one must exercise judicious caution and be aware of the relative im-
portance of each feature in the first place so that the PCA can truly suppress
unimportant features while boosting more important ones.
8
2 Mathematical supplements
2.1 Singular value decomposition (SVD)
The singular value decomposition (SVD) of a matrix is perhaps one of the
most useful techniques in numerical linear algebra. Its impact is far-reaching
and one cannot do without it in many applications. In here, we introduce
what it is and give an intuitive, geometric proof of it.
Let A be an m⇥n matrix. A can be viewed as a linear map A : Rn ! Rm .
Since the diagonal matrices are the easiest to deal with, it would have been
great, if A were a diagonal matrix. Of course, it cannot be true in general. So
the next best thing to hope for is to find some orthonormal bases {v1 , · · · , vn }
of Rn and {u1 , · · · , um } of Rm with regard to which A is diagonal, i.e.
Avi = i ui ,
for some i for i = 1, · · · , n. If so, the first thing that must hold is that
there is an orthonormal basis {v1 , ·, vn } such that Av1 , · · · , Avn are mutually
orthogonal, i.e.,
(Avj )T (Avi ) = 0,
for all j 6= i. Thus for any fixed i and for all j 6= i,
which must necessarily mean that AT Avi is a scalar multiple of vi , i.e., there
exists some ↵i 2 R such that AT Avi = ↵i vi . Therefore v1 , · · · , vn must be
eigenvectors of the n ⇥ n symmetric matrix AT A, which are well known
to exist and relatively easier to find numerically. So the idea is to define
Avi
ui = and declare victory, because Avi = |Avi |ui holds then. So the gist
|Avi |
of the proof is making this line of thoughts mathematically correct.
Let ↵1 , · · · , ↵n be eigenvalues of AT A. Let r = rank(A) n. Since
rank(AT A) = rank(A) by Lemma 2, there are exactly r non-zero eigen-
values ↵1 , · · · , ↵r of AT A. We may assume that ↵1 , · · · , ↵n are ordered in
descending order as
9
Define, for i = 1, · · · , r,
Avi Avi
ui = =p ,
|Avi | ↵i
p
where the last equality uses the fact that |Avi | = ↵i for i = 1, · · · , n. Thus
we have
p
Avi = ↵i ui ,
for i = 1, · · · , r. Now extend {u1 , · · · , ur } to an orthonormal basis {u1 , · · · , um }
of Rm . For i r + 1, let ↵i = 0. Then we can write
p
Avj = ↵j uj
for j = 1, · · · , n. In other words,
p
uTi Avj = ↵j ij . (9)
Let us write these relations in matrix form. Define an n ⇥ n orthogonal
matrix V by 0 1
| |
V = @v 1 · · · v n A ,
| |
where vj is written as the jth column vector. Define also an m⇥m orthogonal
matrix U by 0 1
| |
U = @u 1 · · · u m A .
| |
Therefore the relation (9) is written in matrix form as
U T AV = D, (10)
where D is an m ⇥ n diagonal matrix of the form
0p 1
↵1 0 ··· ··· 0
B . .. .C
B 0
B . 0 · · · .. CC
B . p C
B . ↵r 0C
B . . .. C
D=B 0 ··· 0 . .C ,
B C
B 0 ··· 0C C
B
B .. ... .. C
@ . .A
0 ··· 0
10
when m n, or
0p 1
↵1 0 ··· 0 0 ··· 0
B .. .. C
B 0 . . ··· C
B .. p . . .C ,
D=B
B . 0 ↵r 0 .. . . .. C
C
B .. .. .. C
@ . . . A
0 ··· 0 ··· 0 ··· 0
A = U DV T ,
rank(AT A) = rank(A).
11