0% found this document useful (0 votes)
12 views39 pages

Dimensionality Reduction With Principal Component Analysis

Uploaded by

nidduakku
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views39 pages

Dimensionality Reduction With Principal Component Analysis

Uploaded by

nidduakku
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Dimensionality Reduction with

Principal Component Analysis

Liang Zheng
Australian National University
[email protected]

1
Discovering faster matrix multiplication algorithms with
reinforcement learning. Fawzi et al., Nature 2022

• a neural network architecture that incorporates


problem-specific inductive biases
• a procedure to generate useful synthetic data
• a recipe to leverage symmetries of the problem.

• Traditional algorithm taught in school multiplies a


4x5 by 5x5 matrix using 100 multiplications;
• this number was reduced to 80 with human
ingenuity
• AlphaTensor has found algorithms that do the
same operation using just 76 multiplications.
Idea of PCA area

House price (million) House area (100m2)


a 10 10
b 2 2
c 7 7
d 1 1
e 5 5
price
area
We subtract means from data points
House price House area
(normalised) (normalised)
a 5 5
b -3 -3 price
c 2 2
d -4 -4
e 0 0 3
Idea of PCA second principal first principal
area component component

We rotate our
price and area axis
price

House price House area First principal Second principal


(normalised) (normalised) component component
a 5 5 a 7.07 0
b -3 -3 b -4.24 0
c 2 2 c 2.82 0
d -4 -4 d -5.66 0
e 0 0 e 0 0

4
Motivation
• High-dimensional data, such as images, is hard to analyze, interpret, and
visualize, and expensive to store.
• Good news
• high-dimensional data is often overcomplete, i.e., many dimensions are
redundant and can be explained by a combination of other dimensions
• Furthermore, dimensions in high-dimensional data are often correlated so
that the data possesses an intrinsic lower-dimensional structure.
The data in (a) does not vary
much in the 𝑥! -direction, so
that we can express it as if it
were on a line – with nearly
no loss; see (b).

To describe the data in (b),


only the 𝑥! -coordinate is
required, and the data lies
in a one-dimensional
subspace of ℝ! 5
10.1 Problem Setting
!! of data points 𝒙! that are
• In PCA, we are interested in finding projections 𝒙
as similar to the original data points as possible, but which have a
significantly lower intrinsic dimensionality
• We consider an i.i.d. dataset 𝓧 = 𝒙" , ⋯ , 𝒙# , 𝒙! ∈ ℝ$ , with mean 𝟎, that
possesses the data covariance matrix
#
1
𝑺 = - 𝒙! 𝒙&!
𝑁
!%"
• We assume there exists a low-dimensional compressed representation (code)
𝒛! = 𝑩&𝒙! ∈ ℝ'
of 𝒙! , where we define the projection matrix
𝑩 ≔ 𝒃" , ⋯ , 𝒃' ∈ ℝ$×'

6
• Example (Coordinate Representation/Code)
• Consider ℝ) with the canonical basis 𝒆" = 1,0 &, 𝒆) = 0,1 &.
• 𝒙 ∈ ℝ) can be represented as a linear combination of these basis vectors, e.g.,
5
= 5𝒆" + 3𝒆)
3
• However, when we consider vectors of the form
0
!=
𝒙 ∈ ℝ) , 𝑧∈ℝ
𝑧
they can always be written as 0𝒆" + 𝑧𝒆) .
• To represent these vectors it is sufficient to store the coordinate/code 𝑧 of
! with respect to the 𝒆) vector.
𝒙

7
10.2 PCA from Maximum Variance Perspective

• We ignore 𝑥) -coordinate of the data because it did not add too much
information: the compressed data (b) is similar to the original data in (a)
• We derive PCA so as to maximize the variance in the low-dimensional
representation of the data to retain as much information as possible
• Retaining most information after data compression is equivalent to capturing
the largest amount of variance in the low-dimensional code (Hotelling, 1933)
8
10.2.1 Direction with Maximal Variance
• Data centering
• In the data covariance matrix, we assume centered data.
#
1
𝑺 = - 𝒙! 𝒙&!
𝑁
!%"
• Let us assume that 𝝁 is the mean of the data. Using the properties of the
variance, we obtain
𝕍𝒛 𝒛 = 𝕍𝒙 𝑩& 𝒙 − 𝝁 = 𝕍𝒙 𝑩& 𝒙 − 𝑩& 𝝁 = 𝕍𝒙 𝑩& 𝒙
𝕍 𝒙 + 𝒚 = 𝕍 𝒙 + 𝕍 𝒚 + Cov 𝒙, 𝒚 + Cov 𝒚, 𝒙

• That is, the variance of the low-dimensional code does not depend on the
mean of the data.
• With this assumption the mean of the low-dimensional code is also 𝟎 since
𝔼𝒛 𝒛 = 𝔼𝒙 𝑩 & 𝒙 = 𝑩 & 𝔼𝒙 𝒙 = 𝟎

9
• To maximize the variance of the low-dimensional code, we first seek a single
vector 𝒃" ∈ ℝ$ that maximizes the variance of the projected data, i.e., we aim
to maximize the variance of the first coordinate 𝑧" of 𝒛 ∈ ℝ' so that
#
1 )
𝑉" ≔ 𝕍[𝑧" ] = - 𝑧"!
𝑁
!%"
is maximized, where we defined 𝑧"! as the first coordinate of the low-
dimensional representation 𝒛! ∈ ℝ' of 𝒙! ∈ ℝ$ . 𝑧"! is given by,
𝑧"! = 𝒃"&𝒙!
i.e., it is the coordinate of the orthogonal projection of 𝒙! onto the one-
dimensional subspace spanned by 𝒃" . We substitute 𝑧"! into 𝑉" and obtain,
# #
1 1
𝑉" = - (𝒃"&𝒙! )) = - 𝒃"&𝒙! 𝒙&! 𝒃"
𝑁 𝑁
!%" !%"
#
1
= 𝒃"& - 𝒙! 𝒙&! 𝒃" = 𝒃"&𝑺𝒃"
𝑁
!%"
where 𝑺 is the data covariance matrix.
• We further restrict all solutions to 𝒃" ) =1
10
• We have the following constrained optimization problem
max 𝒃"# 𝑺𝒃"
𝒃𝟏
subject to 𝒃" $ =1
• We obtain the Lagrangian (not required in this course),
𝔏 𝒃" , λ = 𝒃"# 𝑺𝒃" + λ" 1 − 𝒃"# 𝒃"
• The partial derivatives of 𝔏 with respect to 𝒃" and 𝜆" are
𝜕𝔏 # #
𝜕𝔏
= 2𝒃" 𝑺 − 2λ" 𝒃" , = 1 − 𝒃"# 𝒃"
𝜕𝒃" 𝜕λ"
• Setting these partial derivatives to 𝟎 gives us the relations
𝑺𝒃" = λ" 𝒃"
𝒃"# 𝒃" = 1
• We see that 𝒃" is an eigenvector of 𝑺, and λ" is the corresponding eigenvalue. We
rewrite our objective as,
𝑉" = 𝒃"# 𝑺𝒃" = λ" 𝒃"# 𝒃" = λ"
• i.e., the variance of the data projected onto a one-dimensional subspace equals the
eigenvalue that is associated with the basis vector 𝒃" that spans this subspace.
• To maximize the variance of the low-dimensional code, we choose the basis vector
associated with the largest eigenvalue of the data covariance matrix. This eigenvector
is called the first principal component. 11
10.2.2 𝑀-dimensional Subspace with Maximal Variance
• Assume we have found the 𝑚 − 1 eigenvectors of 𝑺 that are associated with
the largest 𝑚 − 1 eigenvalues.
• We want to find the 𝑚th principal component.
• We subtract the effect of the first 𝑚 − 1 principal components 𝒃" ⋯ , 𝒃,-" from
the data, and find principal components that compress the remaining
information. We then arrive at the new data matrix,
C ≔ 𝑿 − ∑,-"
𝑿 &
.%" 𝒃. 𝒃. 𝑿 = 𝑿 − 𝑩,-" 𝑿

where 𝑿 = [𝒙" , ⋯ , 𝒙# ] ∈ ℝ$×# contains the data points as column vectors


and 𝑩,-" ≔ ∑,-" &
.%" 𝒃. 𝒃. is a projection matrix that projects onto the subspace
spanned by 𝒃" , ⋯ , 𝒃,-" .
• To find the 𝑚th principal component, we maximize the variance
# #
1 1
𝑉, = 𝕍 𝑧, = - 𝑧,! = - (𝒃&, 𝒙F! )) = 𝒃&, 𝑺
) C𝒃,
𝑁 𝑁
!%" !%"
subject to 𝒃, ) = 1, and we define 𝑺 C as the data covariance matrix of the
C ≔ 𝒙
transformed dataset 𝓧 G" , ⋯ , 𝒙
G# .
12
C that is associated with the
• The optimal solution 𝒃, is the eigenvector of 𝑺
C.
largest eigenvalue of 𝑺
• In fact, we can derive that
C𝒃, = 𝑺𝒃, = λ, 𝒃, (1)
𝑺
C.
• 𝒃, is not only an eigenvector of 𝑺 but also of 𝑺
C and λ, is the 𝑚th largest
• Specifically, λ, is the largest eigenvalue of 𝑺
eigenvalue of 𝑺, and both have the associated eigenvector 𝒃, .
• Considering (1) and, 𝒃&, 𝒃, = 1, the variance of the data projected onto the
𝑚th principal component is
𝑉, = 𝒃&, 𝑺𝒃, = λ, 𝒃&, 𝒃, = λ,
• This means that the variance of the data, when projected onto an 𝑀-
dimensional subspace, equals the sum of the eigenvalues that are associated
with the corresponding eigenvectors of the data covariance matrix.

13
MNIST dataset
• 60,000 examples of handwritten digits 0 through 9.
• Each digit is a grayscale image of size 28×28, i.e., it contains 784 pixels.
• We can interpret every image in this dataset as a vector 𝑥 ∈ ℝ/01

14
Example - Eigenvalues of MNIST digit “8”

(a) Top 200 largest eigenvalues (b) Variance captured by the


principal components.
• A 784-dim vector is used to represent an image
• Taking all images of “8” in MNIST, we compute the eigenvalues of the data
covariance matrix.
• We see that only a few of them have a value that differs significantly from 0.
• Most of the variance, when projecting data onto the subspace spanned by the
corresponding eigenvectors, is captured by only a few principal components
15
Overall
• To find an 𝑀-dimensional subspace of ℝ$ that retains as much information as
possible,
• We choose the columns of 𝑩 = 𝒃" , ⋯ , 𝒃' ∈ ℝ$×' as the 𝑀 eigenvectors of
the data covariance matrix 𝑺 that are associated with the 𝑀 largest
eigenvalues.
• The maximum amount of variance PCA can capture with the first 𝑀 principal
components is
'
𝑉' = - λ,
,%"
where the λ, are the 𝑀 largest eigenvalues of the data covariance matrix 𝑺.
• The variance lost by data compression via PCA is
$
𝐽' = - λ2 = 𝑉$ − 𝑉'
2%'3"
• Instead of these absolute quantities, we can define the relative variance
4 4
captured as 4! , and the relative variance lost by compression as 1 − 4! .
" "
16
Domain Adaptation

http://download.europe.naverlabs.com//ECCV-DA-Tutorial/Gabriela-Csurka-Part-2.pdf 17
http://download.europe.naverlabs.com//ECCV-DA-Tutorial/Tatiana-Tommasi-Part-3.pdf
10.3 PCA from Projection Perspective
• Previously, we derived PCA by maximizing the variance in the projected
space to retain as much information as possible
max 𝒃"&𝑺𝒃"
𝒃𝟏
subject to 𝒃" ) =1
• Alternatively, we derive PCA as an algorithm that directly minimizes the
average reconstruction error

18
10.3.1 Setting and Objective

(a) A vector 𝒙 ∈ ℝ$ (red cross) shall be C𝒊 for 50 different 𝒙


(b) Differences 𝒙 − 𝒙 C𝒊
projected onto a one-dimensional subspace are shown by the red lines
𝑈 ⊆ ℝ$ spanned by 𝒃

• We wish to project 𝒙 to 𝒙! in a lower-dimensional space, such that 𝒙


! is similar
to the original data point 𝒙. That is,
!
• We aim to minimize the (Euclidean) distance 𝒙 − 𝒙
19
• Given an orthonormal basis (𝒃" , ⋯ , 𝒃$ ) of ℝ$ , any 𝒙 ∈ ℝ$ can be written as a
linear combination of the basis vectors of ℝ$ :
$ ' $
𝒙 = - 𝜁6 𝒃6 = - 𝜁, 𝒃, + - 𝜁2 𝒃2
6%" ,%" 2%'3"

for suitable coordinates 𝜁6 ∈ ℝ .


! ∈ ℝ$ , which live in an intrinsically lower-dimensional
• We aim to find vectors 𝒙
subspace 𝑈 ⊆ ℝ$ , dim(U) = 𝑀, so that
'
! = - 𝑧, 𝒃, ∈ 𝑈 ⊆ ℝ$
𝒙
,%"
is as similar to 𝒙 as possible.

20
• We have a dataset 𝒳 = 𝒙" , … , 𝒙# , 𝒙# ∈ ℝ$ centered at 𝟎, i.e., 𝔼 𝒳 = 𝟎.
• We want to find the best linear projection of 𝒳 onto a lower dimensional
subspace 𝑈 ⊆ ℝ$ , dim(U) = 𝑀. Also, 𝑈 has orthonormal basis vectors
𝒃" , ⋯ , 𝒃' .
• We call this subspace 𝑈 the principal subspace.
• The projections of the data points are denoted by
'
!! ≔ - 𝑧,! 𝒃, = 𝑩𝒛! ∈ ℝ$
𝒙
,%"
where 𝒛! ≔ 𝑧"! , ⋯ , 𝑧'! & ∈ ℝ' is !! with respect to
the coordinate vector of 𝒙
the basis (𝒃" , ⋯ , 𝒃' ).
!! as similar to 𝒙! as possible.
• We want to have 𝒙
• We define our objective as minimizing the average squared Euclidean
distance (reconstruction error)
#
1
𝐽' : = - 𝒙! − 𝒙 !! )
𝑁
!%"
• We need to find the orthonormal basis of the principal subspace and the
coordinates 𝑧! ∈ ℝ ' of the projections with respect to this basis. 21
10.3.2 Finding Optimal Coordinates

(a) A vector 𝒙 ∈ ℝ$ (red cross) shall be C𝒊 for 50 different 𝒙


(b) Differences 𝒙 − 𝒙 C𝒊
projected onto a one-dimensional subspace are shown by the red lines
𝑈 ⊆ ℝ$ spanned by 𝒃

! in a subspace spanned by 𝒃 that minimizes 𝒙 − 𝒙


• We want to find 𝒙 ! .
• Apparently, this will be the orthogonal projection

22
• Given an ONB (𝒃" , ⋯ , 𝒃' ) of 𝑈 ⊆ ℝ$ , to find the optimal coordinates 𝒛, with
respect to this basis, we calculate the partial derivatives
𝜕𝐽' 𝜕𝐽' 𝜕! 𝒙!
=
𝜕𝑧.! 𝜕! 𝒙! 𝜕𝑧.!
𝜕𝐽' 2
= − (𝒙! − 𝒙 !! )&∈ ℝ"×$
𝜕!
𝒙! 𝑁
'
𝜕!
𝒙! 𝜕
= - 𝑧,! 𝒃, = 𝒃.
𝜕𝑧.! 𝜕𝑧.!
,%"
for 𝑖 = 1 , ⋯ , 𝑀, such that we obtain
' &
𝜕𝐽' 2 &
2
= − !! 𝒃. = −
𝒙! − 𝒙 𝒙! − - 𝑧,! 𝒃, 𝒃.
𝜕𝑧.! 𝑁 𝑁
,%"
ONB 2 & 2 &
&
= − 𝒙 𝒃 − 𝑧.! 𝒃. 𝒃. = − (𝒙! 𝒃. − 𝑧.! )
𝑁 ! . 𝑁

23
𝜕𝐽' 2 &
= − (𝒙! 𝒃. − 𝑧.! )
𝜕𝑧.! 𝑁
• Setting this partial derivative to 0 yields immediately the optimal coordinates
𝑧.! = 𝒙&! 𝒃. = 𝒃&. 𝒙!
for 𝑖 = 1 , ⋯ , 𝑀, and 𝑛 = 1 , ⋯ , 𝑁 .
• The optimal coordinates 𝑧.! of the projection 𝒙!! are the coordinates of the
orthogonal projection of the original data point 𝒙! onto the one-dimensional
subspace that is spanned by 𝒃. .
!! of 𝒙! is an orthogonal projection.
• The optimal linear projection 𝒙
• The coordinates of 𝒙!! with respect to the basis (𝒃" , ⋯ , 𝒃' ) are the
coordinates of the orthogonal projection of 𝒙! onto the principal subspace.

24
(a) A vector 𝒙 ∈ ℝ$ (red cross) shall be C𝒊 for 50 different 𝒙
(b) Differences 𝒙 − 𝒙 C𝒊
projected onto a one-dimensional subspace are shown by the red lines
𝑈 ⊆ ℝ$ spanned by 𝒃

C for some 𝒙
(c) Distances 𝒙 − 𝒙 C= C that minimizes 𝒙 − 𝒙
(d) The vector 𝒙 C25 is
𝑧" 𝒃 ∈ 𝑈 = span 𝒃 the orthogonal projection of 𝒙 onto 𝑈.
• We briefly recap orthogonal projections from Section 3.8 (Analytic geometry).
• If (𝒃" , ⋯ , 𝒃# ) is an orthonormal basis of ℝ# then
𝒃$ % 𝒙 & #
(=
𝒙 ! 𝒃$ = 𝒃$ 𝒃$ 𝒙 ∈ ℝ
𝒃$
is the orthogonal projection of 𝒙 onto the subspace spanned by the 𝑗th basis
vector, and 𝑧$ = 𝒃$& 𝒙 is the coordinate of this projection with respect to the basis
vector 𝒃$ that spans that subspace since 𝑧$ 𝒃$ = 𝒙 (.

• More generally, if we aim to project onto an 𝑀-dimensional subspace of ℝ# , we


obtain the orthogonal projection of 𝒙 onto the 𝑀-dimensional subspace with
orthonormal basis vectors 𝒃" , ⋯ , 𝒃' as
(" &
( = 𝑩 𝑩& 𝑩
𝒙 𝑩 𝒙 = 𝑩𝑩& 𝒙
=𝑰
where we defined 𝑩 ∶= 𝒃𝟏 , ⋯ , 𝒃 ' ∈ ℝ#×' . The coordinates of this projection
with respect to the ordered basis (𝒃" , ⋯ , 𝒃' ) are 𝒛 ≔ 𝑩& 𝒙

( ∈ ℝ𝑫 , we only need 𝑀 coordinates to represent 𝒙


• Although 𝒙 (. The other 𝐷 − 𝑀
coordinates with respect to the basis vectors (𝒃'," , ⋯ , 𝒃# ) are always 0 26
10.3.3 Finding the Basis of the Principal Subspace
• So far we have shown that for a given ONB we can find the optimal coordinates of 𝒙 ! by an
orthogonal projection onto the principal subspace. In the following, we will determine what
the best basis is.
! given ONB is
• Recall the optimal coordinates of 𝒙
𝑧$% = 𝒙&% 𝒃$ = 𝒃&$ 𝒙%
• We have
* *

!% = & 𝑧'% 𝒃' = & (𝒙&% 𝒃' ) 𝒃'


𝒙
'() '()
• We rearrange this equation, which yields
* * *

!% = & (𝒃&' 𝒙% ) 𝒃' = & 𝒃' (𝒃&' 𝒙% ) =


𝒙 & 𝒃' 𝒃&' 𝒙%
'() '() '()
• Since we can generally write the original data point 𝒙% as a linear combination of all basis
vectors, it holds that
, , ,

𝒙% = & 𝑧+% 𝒃+ = & 𝒙&% 𝒃+ 𝒃+ = & 𝒃+ 𝒃&+ 𝒙%


+() +() +()
* ,

= & 𝒃' 𝒃&' 𝒙% + & 𝒃- 𝒃-& 𝒙%


'() -(*.)

where we split the sum with 𝐷 terms into a sum over 𝑀 and a sum over 𝐷 − 𝑀 terms.
27
• With these results, the displacement vector 𝒙! − 𝒙 !! , i.e., the difference vector
between the original data point and its projection, is
$ $
!! =
𝒙! − 𝒙 - 𝒃2 𝒃2& 𝒙! = - 𝒙&! 𝒃2 𝒃2
2%'3" 2%'3"

!! is exactly the projection of the data point


• The displacement vector 𝒙! − 𝒙
onto the orthogonal complement of the principal subspace.
!! lies in the subspace that is orthogonal to the principal subspace.
• 𝒙! − 𝒙
• We identify the matrix ∑$ &
2%'3" 𝒃2 𝒃2 in the equation above as the projection
matrix that performs this projection.

Orthogonal projection and displacement


vectors. When projecting data points 𝒙&
C&
(blue) onto subspace 𝑈" , we obtain 𝒙
(orange). The displacement vector 𝒙& −
C& lies completely in the orthogonal
𝒙
complement 𝑈$ of 𝑈" .
28
• Now we reformulate the loss function.
!
/ / #
1 1
𝐽' = (-
8 𝒙- − 𝒙 ! = 8 8 (𝒃&$ 𝒙- )𝒃$
𝑁 𝑁
-." -." $.',"

• We explicitly compute the squared norm and exploit the fact that the 𝒃$ form an
ONB:
/ # / #
1 1
𝐽' = 8 8 (𝒃$ 𝒙- ) = 8 8 𝒃$& 𝒙- 𝒃$& 𝒙-
& !
𝑁 𝑁
-." $.'," -." $.',"
/ #
1
= 8 8 𝒃$& 𝒙- 𝒙&- 𝒃$
𝑁
-." $.',"
where we exploited the symmetry of the dot product in the last step to write
𝒃$& 𝒙-
= 𝒙&- 𝒃$ . We now swap the sums and obtain
# / #
1
𝐽' = 8 𝒃$& 8 𝒙- 𝒙&- 𝒃$ = 8 𝒃$& 𝑺𝒃$
𝑁
$.'," -." $.',"
.:𝑺
= ∑# &
$.'," tr(𝒃$ 𝑺𝒃$ ) = ∑$.'," tr(𝑺𝒃$ 𝒃$& )
#
= tr ∑# &
$.'," 𝒃$ 𝒃$ 𝑺
234567894: ;<839=
where we exploited the property that the trace operator tr(·) is linear and invariant
to cyclic permutations of its arguments 29
$ $
𝐽' = - 𝒃2&𝑺𝒃2 = tr - 𝒃2 𝒃&2 𝑺
2%'3" 2%'3"
789:;<=>9? @A=8>B
• The loss is formulated as the covariance matrix of the data, projected onto
the orthogonal complement of the principal subspace.
• Minimizing the average squared reconstruction error is therefore equivalent to
minimizing the variance of the data when projected onto the subspace we
ignore, i.e., the orthogonal complement of the principal subspace.
• Equivalently, we maximize the variance of the projection that we retain in the
principal subspace, which links the projection loss immediately to the
maximum-variance formulation of PCA in Section 10.2.
• In Slide #17, the average squared reconstruction error, when projecting onto
the 𝑀-dimensional principal subspace, is
$
𝐽' = - λ2
2%'3"
• where λ: are the eigenvalues of the data covariance matrix.
30
$
𝐽' = - λ2
2%'3"

• To minimize it, we need to select the smallest 𝐷 − 𝑀 eigenvalues. Their


corresponding eigenvectors are the basis of the orthogonal complement of
the principal subspace.
• Consequently, this means that the basis of the principal subspace comprises
the eigenvectors 𝒃" , … , 𝒃' that are associated with the largest 𝑀 eigenvalues
of the data covariance matrix.

31
10.5 PCA in High Dimensions

• In order to do PCA, we need to compute the data covariance matrix 𝑺


• In 𝐷 dimensions, 𝑺 is a 𝐷×𝐷 matrix.
• Computing the eigenvalues and eigenvectors of this matrix is computationally
expensive as it scales cubically in 𝐷.
• Therefore, PCA will be infeasible in very high dimensions
• For example, if 𝒙! are images with 10,000 pixels, we would need to compute
the eigendecomposition of a 10,000×10,000 matrix.
• We provide a solution to this problem for the case that we have substantially
fewer data points than dimensions, i.e., 𝑁 ≪ 𝐷
• Assume we have a centered dataset 𝒙" , ⋯ , 𝒙# , 𝒙! ∈ ℝ$ . Then the data
covariance matrix is given as
1
𝑺 = 𝑿𝑿& ∈ ℝ$×$
𝑁
where 𝑿 = 𝒙" , ⋯ , 𝒙# is a 𝐷×𝑁 matrix whose columns are the data points.
32
• We now assume that 𝑁 ≪ 𝐷, i.e., the number of data points is smaller than
the dimensionality of the data.
• With 𝑁 ≪ 𝐷 data points, the rank of the covariance matrix 𝑆 is at most 𝑁, so it
has at least 𝐷 − 𝑁 eigenvalues that are 0.
• Intuitively, this means that there are some redundancies. In the following, we
will exploit this and turn the 𝐷×𝐷 covariance matrix into an 𝑁 × 𝑁 covariance
matrix whose eigenvalues are all positive.
• In PCA, we ended up with the eigenvector equation
𝑺𝒃, = λ, 𝒃, , 𝑚 = 1, . . . , 𝑀
where 𝒃, is a basis vector of the principal subspace. Let us rewrite this
"
equation a bit: With 𝑺 = 𝑿𝑿& ∈ ℝ$×$ , we obtain
#
1
𝑺𝒃, = 𝑿𝑿&𝒃, = λ, 𝒃,
𝑁
• We now multiply 𝑿& ∈ ℝ#×$ from the left-hand side, which yields
1 & & &
1 &
𝑿 𝑿 𝑿 𝒃, = λ, 𝑿 𝒃, ⟺ 𝑿 𝑿𝒄, = λ, 𝒄,
𝑁l 𝑁
𝑁×𝑁 =: 𝒄'
33
1 &
𝑿 𝑿𝒄> = λ> 𝒄>
𝑁
• We get a new eigenvector/eigenvalue equation: λ> remains eigenvalue.
"
• We obtain the eigenvector of the matrix 𝑿& 𝑿 ∈ ℝ/×/ associated with λ> as
/
&
𝒄> ≔ 𝑿 𝒃> .
"
• This also implies that 𝑿& 𝑿 has the same (nonzero) eigenvalues as the data
/
covariance matrix 𝑺.
• But 𝑿& 𝑿 now an 𝑁×𝑁 matrix, so that we can compute the eigenvalues and
eigenvectors much more efficiently than for the original 𝐷×𝐷 data covariance
matrix.
"
• Now that we have the eigenvectors of / 𝑿& 𝑿, we are going to recover the original
eigenvectors, which we still need for PCA. Currently, we know the eigenvectors of
" &
𝑿 𝑿. If we left-multiply our eigenvalue/ eigenvector equation with 𝑿, we get
/
1
𝑿𝑿& 𝑿𝒄> = λ> 𝑿𝒄>
𝑁
𝑺
and we recover the data covariance matrix again. This now also means that we
34
recover 𝑿𝒄> as an eigenvector of 𝑺.
10.6 Key Steps of PCA in Practice

eigendecomposition 35
• Step 1. Mean subtraction
• We center the data by computing the mean 𝜇 of the dataset and subtracting it
from every single data point. This ensures that the dataset has mean 0.

• Step 2. Standardization Divide the data points by the standard deviation σd


of the dataset for every dimension 𝑑 = 1, . . . , 𝐷. Now the data has variance 1
along each axis.

36
• Step 3. Eigendecomposition of the covariance matrix
• Compute the data covariance matrix and its eigenvalues and corresponding
eigenvectors. The longer vector (larger eigenvalue) spans the principal
subspace 𝑈

37
• 4. Projection We can project any data point 𝒙∗ ∈ ℝ$ onto the principal
subspace: To get this right, we need to standardize 𝒙∗ using the mean 𝜇6 and
standard deviation 𝜎6 of the training data in the 𝑑th dimension, respectively,
so that
6
6 𝑥∗ − 𝜇6
𝑥∗ ⟵ , 𝑑 = 1, ⋯ , 𝐷
𝜎6
6
where 𝑥∗ is the 𝑑th component of 𝒙∗.
• We obtain the projection as
!∗ = 𝑩𝑩&𝒙∗
𝒙
with coordinates
𝒛∗ = 𝑩&𝒙∗
with respect to the basis of the principal subspace. Here, 𝑩 is the matrix that
contains the eigenvectors that are associated with the largest eigenvalues of
the data covariance matrix as columns.

38
• Having standardized our dataset, 𝒙!∗ = 𝑩𝑩&𝒙∗ only yields the projections in
the context of the standardized dataset.
• To obtain our projection in the original data space (i.e., before
standardization), we need to undo the standardization: multiply by the
standard deviation before adding the mean.
• We obtain
6 6
𝑥s∗ ⟵ 𝑥s∗ 𝜎6 + 𝜇6 , 𝑑 = 1, . . . , 𝐷
• Figure 10.10(f) illustrates the projection in the original data space.

39

You might also like