0% found this document useful (0 votes)

12 views39 pages

Dimensionality Reduction With Principal Component Analysis

Uploaded by

nidduakku

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views39 pages

Dimensionality Reduction With Principal Component Analysis

Uploaded by

nidduakku

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Dimensionality Reduction with

Principal Component Analysis

Liang Zheng
Australian National University
[email protected]

1
Discovering faster matrix multiplication algorithms with
reinforcement learning. Fawzi et al., Nature 2022

• a neural network architecture that incorporates

problem-specific inductive biases
• a procedure to generate useful synthetic data
• a recipe to leverage symmetries of the problem.

• Traditional algorithm taught in school multiplies a

4x5 by 5x5 matrix using 100 multiplications;
• this number was reduced to 80 with human
ingenuity
• AlphaTensor has found algorithms that do the
same operation using just 76 multiplications.
Idea of PCA area

House price (million) House area (100m2)

a 10 10
b 2 2
c 7 7
d 1 1
e 5 5
price
area
We subtract means from data points
House price House area
(normalised) (normalised)
a 5 5
b -3 -3 price
c 2 2
d -4 -4
e 0 0 3
Idea of PCA second principal first principal
area component component

We rotate our
price and area axis
price

House price House area First principal Second principal

(normalised) (normalised) component component
a 5 5 a 7.07 0
b -3 -3 b -4.24 0
c 2 2 c 2.82 0
d -4 -4 d -5.66 0
e 0 0 e 0 0

4
Motivation
• High-dimensional data, such as images, is hard to analyze, interpret, and
visualize, and expensive to store.
• Good news
• high-dimensional data is often overcomplete, i.e., many dimensions are
redundant and can be explained by a combination of other dimensions
• Furthermore, dimensions in high-dimensional data are often correlated so
that the data possesses an intrinsic lower-dimensional structure.
The data in (a) does not vary
much in the 𝑥! -direction, so
that we can express it as if it
were on a line – with nearly
no loss; see (b).

To describe the data in (b),

only the 𝑥! -coordinate is
required, and the data lies
in a one-dimensional
subspace of ℝ! 5
10.1 Problem Setting
!! of data points 𝒙! that are
• In PCA, we are interested in finding projections 𝒙
as similar to the original data points as possible, but which have a
significantly lower intrinsic dimensionality
• We consider an i.i.d. dataset 𝓧 = 𝒙" , ⋯ , 𝒙# , 𝒙! ∈ ℝ$ , with mean 𝟎, that
possesses the data covariance matrix
#
1
𝑺 = - 𝒙! 𝒙&!
𝑁
!%"
• We assume there exists a low-dimensional compressed representation (code)
𝒛! = 𝑩&𝒙! ∈ ℝ'
of 𝒙! , where we define the projection matrix
𝑩 ≔ 𝒃" , ⋯ , 𝒃' ∈ ℝ$×'

6
• Example (Coordinate Representation/Code)
• Consider ℝ) with the canonical basis 𝒆" = 1,0 &, 𝒆) = 0,1 &.
• 𝒙 ∈ ℝ) can be represented as a linear combination of these basis vectors, e.g.,
5
= 5𝒆" + 3𝒆)
3
• However, when we consider vectors of the form
0
!=
𝒙 ∈ ℝ) , 𝑧∈ℝ
𝑧
they can always be written as 0𝒆" + 𝑧𝒆) .
• To represent these vectors it is sufficient to store the coordinate/code 𝑧 of
! with respect to the 𝒆) vector.
𝒙

7
10.2 PCA from Maximum Variance Perspective

• We ignore 𝑥) -coordinate of the data because it did not add too much
information: the compressed data (b) is similar to the original data in (a)
• We derive PCA so as to maximize the variance in the low-dimensional
representation of the data to retain as much information as possible
• Retaining most information after data compression is equivalent to capturing
the largest amount of variance in the low-dimensional code (Hotelling, 1933)
8
10.2.1 Direction with Maximal Variance
• Data centering
• In the data covariance matrix, we assume centered data.
#
1
𝑺 = - 𝒙! 𝒙&!
𝑁
!%"
• Let us assume that 𝝁 is the mean of the data. Using the properties of the
variance, we obtain
𝕍𝒛 𝒛 = 𝕍𝒙 𝑩& 𝒙 − 𝝁 = 𝕍𝒙 𝑩& 𝒙 − 𝑩& 𝝁 = 𝕍𝒙 𝑩& 𝒙
𝕍 𝒙 + 𝒚 = 𝕍 𝒙 + 𝕍 𝒚 + Cov 𝒙, 𝒚 + Cov 𝒚, 𝒙

• That is, the variance of the low-dimensional code does not depend on the
mean of the data.
• With this assumption the mean of the low-dimensional code is also 𝟎 since
𝔼𝒛 𝒛 = 𝔼𝒙 𝑩 & 𝒙 = 𝑩 & 𝔼𝒙 𝒙 = 𝟎

9
• To maximize the variance of the low-dimensional code, we first seek a single
vector 𝒃" ∈ ℝ$ that maximizes the variance of the projected data, i.e., we aim
to maximize the variance of the first coordinate 𝑧" of 𝒛 ∈ ℝ' so that
#
1 )
𝑉" ≔ 𝕍[𝑧" ] = - 𝑧"!
𝑁
!%"
is maximized, where we defined 𝑧"! as the first coordinate of the low-
dimensional representation 𝒛! ∈ ℝ' of 𝒙! ∈ ℝ$ . 𝑧"! is given by,
𝑧"! = 𝒃"&𝒙!
i.e., it is the coordinate of the orthogonal projection of 𝒙! onto the one-
dimensional subspace spanned by 𝒃" . We substitute 𝑧"! into 𝑉" and obtain,
# #
1 1
𝑉" = - (𝒃"&𝒙! )) = - 𝒃"&𝒙! 𝒙&! 𝒃"
𝑁 𝑁
!%" !%"
#
1
= 𝒃"& - 𝒙! 𝒙&! 𝒃" = 𝒃"&𝑺𝒃"
𝑁
!%"
where 𝑺 is the data covariance matrix.
• We further restrict all solutions to 𝒃" ) =1
10
• We have the following constrained optimization problem
max 𝒃"# 𝑺𝒃"
𝒃𝟏
subject to 𝒃" $ =1
• We obtain the Lagrangian (not required in this course),
𝔏 𝒃" , λ = 𝒃"# 𝑺𝒃" + λ" 1 − 𝒃"# 𝒃"
• The partial derivatives of 𝔏 with respect to 𝒃" and 𝜆" are
𝜕𝔏 # #
𝜕𝔏
= 2𝒃" 𝑺 − 2λ" 𝒃" , = 1 − 𝒃"# 𝒃"
𝜕𝒃" 𝜕λ"
• Setting these partial derivatives to 𝟎 gives us the relations
𝑺𝒃" = λ" 𝒃"
𝒃"# 𝒃" = 1
• We see that 𝒃" is an eigenvector of 𝑺, and λ" is the corresponding eigenvalue. We
rewrite our objective as,
𝑉" = 𝒃"# 𝑺𝒃" = λ" 𝒃"# 𝒃" = λ"
• i.e., the variance of the data projected onto a one-dimensional subspace equals the
eigenvalue that is associated with the basis vector 𝒃" that spans this subspace.
• To maximize the variance of the low-dimensional code, we choose the basis vector
associated with the largest eigenvalue of the data covariance matrix. This eigenvector
is called the first principal component. 11
10.2.2 𝑀-dimensional Subspace with Maximal Variance
• Assume we have found the 𝑚 − 1 eigenvectors of 𝑺 that are associated with
the largest 𝑚 − 1 eigenvalues.
• We want to find the 𝑚th principal component.
• We subtract the effect of the first 𝑚 − 1 principal components 𝒃" ⋯ , 𝒃,-" from
the data, and find principal components that compress the remaining
information. We then arrive at the new data matrix,
C ≔ 𝑿 − ∑,-"
𝑿 &
.%" 𝒃. 𝒃. 𝑿 = 𝑿 − 𝑩,-" 𝑿

where 𝑿 = [𝒙" , ⋯ , 𝒙# ] ∈ ℝ$×# contains the data points as column vectors

and 𝑩,-" ≔ ∑,-" &
.%" 𝒃. 𝒃. is a projection matrix that projects onto the subspace
spanned by 𝒃" , ⋯ , 𝒃,-" .
• To find the 𝑚th principal component, we maximize the variance
# #
1 1
𝑉, = 𝕍 𝑧, = - 𝑧,! = - (𝒃&, 𝒙F! )) = 𝒃&, 𝑺
) C𝒃,
𝑁 𝑁
!%" !%"
subject to 𝒃, ) = 1, and we define 𝑺 C as the data covariance matrix of the
C ≔ 𝒙
transformed dataset 𝓧 G" , ⋯ , 𝒙
G# .
12
C that is associated with the
• The optimal solution 𝒃, is the eigenvector of 𝑺
C.
largest eigenvalue of 𝑺
• In fact, we can derive that
C𝒃, = 𝑺𝒃, = λ, 𝒃, (1)
𝑺
C.
• 𝒃, is not only an eigenvector of 𝑺 but also of 𝑺
C and λ, is the 𝑚th largest
• Specifically, λ, is the largest eigenvalue of 𝑺
eigenvalue of 𝑺, and both have the associated eigenvector 𝒃, .
• Considering (1) and, 𝒃&, 𝒃, = 1, the variance of the data projected onto the
𝑚th principal component is
𝑉, = 𝒃&, 𝑺𝒃, = λ, 𝒃&, 𝒃, = λ,
• This means that the variance of the data, when projected onto an 𝑀-
dimensional subspace, equals the sum of the eigenvalues that are associated
with the corresponding eigenvectors of the data covariance matrix.

13
MNIST dataset
• 60,000 examples of handwritten digits 0 through 9.
• Each digit is a grayscale image of size 28×28, i.e., it contains 784 pixels.
• We can interpret every image in this dataset as a vector 𝑥 ∈ ℝ/01

14
Example - Eigenvalues of MNIST digit “8”

(a) Top 200 largest eigenvalues (b) Variance captured by the

principal components.
• A 784-dim vector is used to represent an image
• Taking all images of “8” in MNIST, we compute the eigenvalues of the data
covariance matrix.
• We see that only a few of them have a value that differs significantly from 0.
• Most of the variance, when projecting data onto the subspace spanned by the
corresponding eigenvectors, is captured by only a few principal components
15
Overall
• To find an 𝑀-dimensional subspace of ℝ$ that retains as much information as
possible,
• We choose the columns of 𝑩 = 𝒃" , ⋯ , 𝒃' ∈ ℝ$×' as the 𝑀 eigenvectors of
the data covariance matrix 𝑺 that are associated with the 𝑀 largest
eigenvalues.
• The maximum amount of variance PCA can capture with the first 𝑀 principal
components is
'
𝑉' = - λ,
,%"
where the λ, are the 𝑀 largest eigenvalues of the data covariance matrix 𝑺.
• The variance lost by data compression via PCA is
$
𝐽' = - λ2 = 𝑉$ − 𝑉'
2%'3"
• Instead of these absolute quantities, we can define the relative variance
4 4
captured as 4! , and the relative variance lost by compression as 1 − 4! .
" "
16
Domain Adaptation

http://download.europe.naverlabs.com//ECCV-DA-Tutorial/Gabriela-Csurka-Part-2.pdf 17
http://download.europe.naverlabs.com//ECCV-DA-Tutorial/Tatiana-Tommasi-Part-3.pdf
10.3 PCA from Projection Perspective
• Previously, we derived PCA by maximizing the variance in the projected
space to retain as much information as possible
max 𝒃"&𝑺𝒃"
𝒃𝟏
subject to 𝒃" ) =1
• Alternatively, we derive PCA as an algorithm that directly minimizes the
average reconstruction error

18
10.3.1 Setting and Objective

(a) A vector 𝒙 ∈ ℝ$ (red cross) shall be C𝒊 for 50 different 𝒙

(b) Differences 𝒙 − 𝒙 C𝒊
projected onto a one-dimensional subspace are shown by the red lines
𝑈 ⊆ ℝ$ spanned by 𝒃

• We wish to project 𝒙 to 𝒙! in a lower-dimensional space, such that 𝒙

! is similar
to the original data point 𝒙. That is,
!
• We aim to minimize the (Euclidean) distance 𝒙 − 𝒙
19
• Given an orthonormal basis (𝒃" , ⋯ , 𝒃$ ) of ℝ$ , any 𝒙 ∈ ℝ$ can be written as a
linear combination of the basis vectors of ℝ$ :
$ ' $
𝒙 = - 𝜁6 𝒃6 = - 𝜁, 𝒃, + - 𝜁2 𝒃2
6%" ,%" 2%'3"

for suitable coordinates 𝜁6 ∈ ℝ .

! ∈ ℝ$ , which live in an intrinsically lower-dimensional
• We aim to find vectors 𝒙
subspace 𝑈 ⊆ ℝ$ , dim(U) = 𝑀, so that
'
! = - 𝑧, 𝒃, ∈ 𝑈 ⊆ ℝ$
𝒙
,%"
is as similar to 𝒙 as possible.

20
• We have a dataset 𝒳 = 𝒙" , … , 𝒙# , 𝒙# ∈ ℝ$ centered at 𝟎, i.e., 𝔼 𝒳 = 𝟎.
• We want to find the best linear projection of 𝒳 onto a lower dimensional
subspace 𝑈 ⊆ ℝ$ , dim(U) = 𝑀. Also, 𝑈 has orthonormal basis vectors
𝒃" , ⋯ , 𝒃' .
• We call this subspace 𝑈 the principal subspace.
• The projections of the data points are denoted by
'
!! ≔ - 𝑧,! 𝒃, = 𝑩𝒛! ∈ ℝ$
𝒙
,%"
where 𝒛! ≔ 𝑧"! , ⋯ , 𝑧'! & ∈ ℝ' is !! with respect to
the coordinate vector of 𝒙
the basis (𝒃" , ⋯ , 𝒃' ).
!! as similar to 𝒙! as possible.
• We want to have 𝒙
• We define our objective as minimizing the average squared Euclidean
distance (reconstruction error)
#
1
𝐽' : = - 𝒙! − 𝒙 !! )
𝑁
!%"
• We need to find the orthonormal basis of the principal subspace and the
coordinates 𝑧! ∈ ℝ ' of the projections with respect to this basis. 21
10.3.2 Finding Optimal Coordinates

(a) A vector 𝒙 ∈ ℝ$ (red cross) shall be C𝒊 for 50 different 𝒙

(b) Differences 𝒙 − 𝒙 C𝒊
projected onto a one-dimensional subspace are shown by the red lines
𝑈 ⊆ ℝ$ spanned by 𝒃

! in a subspace spanned by 𝒃 that minimizes 𝒙 − 𝒙

• We want to find 𝒙 ! .
• Apparently, this will be the orthogonal projection

22
• Given an ONB (𝒃" , ⋯ , 𝒃' ) of 𝑈 ⊆ ℝ$ , to find the optimal coordinates 𝒛, with
respect to this basis, we calculate the partial derivatives
𝜕𝐽' 𝜕𝐽' 𝜕! 𝒙!
=
𝜕𝑧.! 𝜕! 𝒙! 𝜕𝑧.!
𝜕𝐽' 2
= − (𝒙! − 𝒙 !! )&∈ ℝ"×$
𝜕!
𝒙! 𝑁
'
𝜕!
𝒙! 𝜕
= - 𝑧,! 𝒃, = 𝒃.
𝜕𝑧.! 𝜕𝑧.!
,%"
for 𝑖 = 1 , ⋯ , 𝑀, such that we obtain
' &
𝜕𝐽' 2 &
2
= − !! 𝒃. = −
𝒙! − 𝒙 𝒙! − - 𝑧,! 𝒃, 𝒃.
𝜕𝑧.! 𝑁 𝑁
,%"
ONB 2 & 2 &
&
= − 𝒙 𝒃 − 𝑧.! 𝒃. 𝒃. = − (𝒙! 𝒃. − 𝑧.! )
𝑁 ! . 𝑁

23
𝜕𝐽' 2 &
= − (𝒙! 𝒃. − 𝑧.! )
𝜕𝑧.! 𝑁
• Setting this partial derivative to 0 yields immediately the optimal coordinates
𝑧.! = 𝒙&! 𝒃. = 𝒃&. 𝒙!
for 𝑖 = 1 , ⋯ , 𝑀, and 𝑛 = 1 , ⋯ , 𝑁 .
• The optimal coordinates 𝑧.! of the projection 𝒙!! are the coordinates of the
orthogonal projection of the original data point 𝒙! onto the one-dimensional
subspace that is spanned by 𝒃. .
!! of 𝒙! is an orthogonal projection.
• The optimal linear projection 𝒙
• The coordinates of 𝒙!! with respect to the basis (𝒃" , ⋯ , 𝒃' ) are the
coordinates of the orthogonal projection of 𝒙! onto the principal subspace.

24
(a) A vector 𝒙 ∈ ℝ$ (red cross) shall be C𝒊 for 50 different 𝒙
(b) Differences 𝒙 − 𝒙 C𝒊
projected onto a one-dimensional subspace are shown by the red lines
𝑈 ⊆ ℝ$ spanned by 𝒃

C for some 𝒙
(c) Distances 𝒙 − 𝒙 C= C that minimizes 𝒙 − 𝒙
(d) The vector 𝒙 C25 is
𝑧" 𝒃 ∈ 𝑈 = span 𝒃 the orthogonal projection of 𝒙 onto 𝑈.
• We briefly recap orthogonal projections from Section 3.8 (Analytic geometry).
• If (𝒃" , ⋯ , 𝒃# ) is an orthonormal basis of ℝ# then
𝒃$ % 𝒙 & #
(=
𝒙 ! 𝒃$ = 𝒃$ 𝒃$ 𝒙 ∈ ℝ
𝒃$
is the orthogonal projection of 𝒙 onto the subspace spanned by the 𝑗th basis
vector, and 𝑧$ = 𝒃$& 𝒙 is the coordinate of this projection with respect to the basis
vector 𝒃$ that spans that subspace since 𝑧$ 𝒃$ = 𝒙 (.

• More generally, if we aim to project onto an 𝑀-dimensional subspace of ℝ# , we

obtain the orthogonal projection of 𝒙 onto the 𝑀-dimensional subspace with
orthonormal basis vectors 𝒃" , ⋯ , 𝒃' as
(" &
( = 𝑩 𝑩& 𝑩
𝒙 𝑩 𝒙 = 𝑩𝑩& 𝒙
=𝑰
where we defined 𝑩 ∶= 𝒃𝟏 , ⋯ , 𝒃 ' ∈ ℝ#×' . The coordinates of this projection
with respect to the ordered basis (𝒃" , ⋯ , 𝒃' ) are 𝒛 ≔ 𝑩& 𝒙

( ∈ ℝ𝑫 , we only need 𝑀 coordinates to represent 𝒙

• Although 𝒙 (. The other 𝐷 − 𝑀
coordinates with respect to the basis vectors (𝒃'," , ⋯ , 𝒃# ) are always 0 26
10.3.3 Finding the Basis of the Principal Subspace
• So far we have shown that for a given ONB we can find the optimal coordinates of 𝒙 ! by an
orthogonal projection onto the principal subspace. In the following, we will determine what
the best basis is.
! given ONB is
• Recall the optimal coordinates of 𝒙
𝑧$% = 𝒙&% 𝒃$ = 𝒃&$ 𝒙%
• We have
* *

!% = & 𝑧'% 𝒃' = & (𝒙&% 𝒃' ) 𝒃'

𝒙
'() '()
• We rearrange this equation, which yields
* * *

!% = & (𝒃&' 𝒙% ) 𝒃' = & 𝒃' (𝒃&' 𝒙% ) =

𝒙 & 𝒃' 𝒃&' 𝒙%
'() '() '()
• Since we can generally write the original data point 𝒙% as a linear combination of all basis
vectors, it holds that
, , ,

𝒙% = & 𝑧+% 𝒃+ = & 𝒙&% 𝒃+ 𝒃+ = & 𝒃+ 𝒃&+ 𝒙%

+() +() +()
* ,

= & 𝒃' 𝒃&' 𝒙% + & 𝒃- 𝒃-& 𝒙%

'() -(*.)

where we split the sum with 𝐷 terms into a sum over 𝑀 and a sum over 𝐷 − 𝑀 terms.
27
• With these results, the displacement vector 𝒙! − 𝒙 !! , i.e., the difference vector
between the original data point and its projection, is
$ $
!! =
𝒙! − 𝒙 - 𝒃2 𝒃2& 𝒙! = - 𝒙&! 𝒃2 𝒃2
2%'3" 2%'3"

!! is exactly the projection of the data point

• The displacement vector 𝒙! − 𝒙
onto the orthogonal complement of the principal subspace.
!! lies in the subspace that is orthogonal to the principal subspace.
• 𝒙! − 𝒙
• We identify the matrix ∑$ &
2%'3" 𝒃2 𝒃2 in the equation above as the projection
matrix that performs this projection.

Orthogonal projection and displacement

vectors. When projecting data points 𝒙&
C&
(blue) onto subspace 𝑈" , we obtain 𝒙
(orange). The displacement vector 𝒙& −
C& lies completely in the orthogonal
𝒙
complement 𝑈$ of 𝑈" .
28
• Now we reformulate the loss function.
!
/ / #
1 1
𝐽' = (-
8 𝒙- − 𝒙 ! = 8 8 (𝒃&$ 𝒙- )𝒃$
𝑁 𝑁
-." -." $.',"

• We explicitly compute the squared norm and exploit the fact that the 𝒃$ form an
ONB:
/ # / #
1 1
𝐽' = 8 8 (𝒃$ 𝒙- ) = 8 8 𝒃$& 𝒙- 𝒃$& 𝒙-
& !
𝑁 𝑁
-." $.'," -." $.',"
/ #
1
= 8 8 𝒃$& 𝒙- 𝒙&- 𝒃$
𝑁
-." $.',"
where we exploited the symmetry of the dot product in the last step to write
𝒃$& 𝒙-
= 𝒙&- 𝒃$ . We now swap the sums and obtain
# / #
1
𝐽' = 8 𝒃$& 8 𝒙- 𝒙&- 𝒃$ = 8 𝒃$& 𝑺𝒃$
𝑁
$.'," -." $.',"
.:𝑺
= ∑# &
$.'," tr(𝒃$ 𝑺𝒃$ ) = ∑$.'," tr(𝑺𝒃$ 𝒃$& )
#
= tr ∑# &
$.'," 𝒃$ 𝒃$ 𝑺
234567894: ;<839=
where we exploited the property that the trace operator tr(·) is linear and invariant
to cyclic permutations of its arguments 29
$ $
𝐽' = - 𝒃2&𝑺𝒃2 = tr - 𝒃2 𝒃&2 𝑺
2%'3" 2%'3"
789:;<=>9? @A=8>B
• The loss is formulated as the covariance matrix of the data, projected onto
the orthogonal complement of the principal subspace.
• Minimizing the average squared reconstruction error is therefore equivalent to
minimizing the variance of the data when projected onto the subspace we
ignore, i.e., the orthogonal complement of the principal subspace.
• Equivalently, we maximize the variance of the projection that we retain in the
principal subspace, which links the projection loss immediately to the
maximum-variance formulation of PCA in Section 10.2.
• In Slide #17, the average squared reconstruction error, when projecting onto
the 𝑀-dimensional principal subspace, is
$
𝐽' = - λ2
2%'3"
• where λ: are the eigenvalues of the data covariance matrix.
30
$
𝐽' = - λ2
2%'3"

• To minimize it, we need to select the smallest 𝐷 − 𝑀 eigenvalues. Their

corresponding eigenvectors are the basis of the orthogonal complement of
the principal subspace.
• Consequently, this means that the basis of the principal subspace comprises
the eigenvectors 𝒃" , … , 𝒃' that are associated with the largest 𝑀 eigenvalues
of the data covariance matrix.

31
10.5 PCA in High Dimensions

• In order to do PCA, we need to compute the data covariance matrix 𝑺

• In 𝐷 dimensions, 𝑺 is a 𝐷×𝐷 matrix.
• Computing the eigenvalues and eigenvectors of this matrix is computationally
expensive as it scales cubically in 𝐷.
• Therefore, PCA will be infeasible in very high dimensions
• For example, if 𝒙! are images with 10,000 pixels, we would need to compute
the eigendecomposition of a 10,000×10,000 matrix.
• We provide a solution to this problem for the case that we have substantially
fewer data points than dimensions, i.e., 𝑁 ≪ 𝐷
• Assume we have a centered dataset 𝒙" , ⋯ , 𝒙# , 𝒙! ∈ ℝ$ . Then the data
covariance matrix is given as
1
𝑺 = 𝑿𝑿& ∈ ℝ$×$
𝑁
where 𝑿 = 𝒙" , ⋯ , 𝒙# is a 𝐷×𝑁 matrix whose columns are the data points.
32
• We now assume that 𝑁 ≪ 𝐷, i.e., the number of data points is smaller than
the dimensionality of the data.
• With 𝑁 ≪ 𝐷 data points, the rank of the covariance matrix 𝑆 is at most 𝑁, so it
has at least 𝐷 − 𝑁 eigenvalues that are 0.
• Intuitively, this means that there are some redundancies. In the following, we
will exploit this and turn the 𝐷×𝐷 covariance matrix into an 𝑁 × 𝑁 covariance
matrix whose eigenvalues are all positive.
• In PCA, we ended up with the eigenvector equation
𝑺𝒃, = λ, 𝒃, , 𝑚 = 1, . . . , 𝑀
where 𝒃, is a basis vector of the principal subspace. Let us rewrite this
"
equation a bit: With 𝑺 = 𝑿𝑿& ∈ ℝ$×$ , we obtain
#
1
𝑺𝒃, = 𝑿𝑿&𝒃, = λ, 𝒃,
𝑁
• We now multiply 𝑿& ∈ ℝ#×$ from the left-hand side, which yields
1 & & &
1 &
𝑿 𝑿 𝑿 𝒃, = λ, 𝑿 𝒃, ⟺ 𝑿 𝑿𝒄, = λ, 𝒄,
𝑁l 𝑁
𝑁×𝑁 =: 𝒄'
33
1 &
𝑿 𝑿𝒄> = λ> 𝒄>
𝑁
• We get a new eigenvector/eigenvalue equation: λ> remains eigenvalue.
"
• We obtain the eigenvector of the matrix 𝑿& 𝑿 ∈ ℝ/×/ associated with λ> as
/
&
𝒄> ≔ 𝑿 𝒃> .
"
• This also implies that 𝑿& 𝑿 has the same (nonzero) eigenvalues as the data
/
covariance matrix 𝑺.
• But 𝑿& 𝑿 now an 𝑁×𝑁 matrix, so that we can compute the eigenvalues and
eigenvectors much more efficiently than for the original 𝐷×𝐷 data covariance
matrix.
"
• Now that we have the eigenvectors of / 𝑿& 𝑿, we are going to recover the original
eigenvectors, which we still need for PCA. Currently, we know the eigenvectors of
" &
𝑿 𝑿. If we left-multiply our eigenvalue/ eigenvector equation with 𝑿, we get
/
1
𝑿𝑿& 𝑿𝒄> = λ> 𝑿𝒄>
𝑁
𝑺
and we recover the data covariance matrix again. This now also means that we
34
recover 𝑿𝒄> as an eigenvector of 𝑺.
10.6 Key Steps of PCA in Practice

eigendecomposition 35
• Step 1. Mean subtraction
• We center the data by computing the mean 𝜇 of the dataset and subtracting it
from every single data point. This ensures that the dataset has mean 0.

• Step 2. Standardization Divide the data points by the standard deviation σd

of the dataset for every dimension 𝑑 = 1, . . . , 𝐷. Now the data has variance 1
along each axis.

36
• Step 3. Eigendecomposition of the covariance matrix
• Compute the data covariance matrix and its eigenvalues and corresponding
eigenvectors. The longer vector (larger eigenvalue) spans the principal
subspace 𝑈

37
• 4. Projection We can project any data point 𝒙∗ ∈ ℝ$ onto the principal
subspace: To get this right, we need to standardize 𝒙∗ using the mean 𝜇6 and
standard deviation 𝜎6 of the training data in the 𝑑th dimension, respectively,
so that
6
6 𝑥∗ − 𝜇6
𝑥∗ ⟵ , 𝑑 = 1, ⋯ , 𝐷
𝜎6
6
where 𝑥∗ is the 𝑑th component of 𝒙∗.
• We obtain the projection as
!∗ = 𝑩𝑩&𝒙∗
𝒙
with coordinates
𝒛∗ = 𝑩&𝒙∗
with respect to the basis of the principal subspace. Here, 𝑩 is the matrix that
contains the eigenvectors that are associated with the largest eigenvalues of
the data covariance matrix as columns.

38
• Having standardized our dataset, 𝒙!∗ = 𝑩𝑩&𝒙∗ only yields the projections in
the context of the standardized dataset.
• To obtain our projection in the original data space (i.e., before
standardization), we need to undo the standardization: multiply by the
standard deviation before adding the mean.
• We obtain
6 6
𝑥s∗ ⟵ 𝑥s∗ 𝜎6 + 𝜇6 , 𝑑 = 1, . . . , 𝐷
• Figure 10.10(f) illustrates the projection in the original data space.

Chapter 10. Dimensionality Reduction With PCA
No ratings yet
Chapter 10. Dimensionality Reduction With PCA
23 pages
Lecture 12
No ratings yet
Lecture 12
31 pages
20 Pca
No ratings yet
20 Pca
50 pages
Lecture: Dimensionality Reduction With Principal Component Analysis
No ratings yet
Lecture: Dimensionality Reduction With Principal Component Analysis
42 pages
FALLSEM2024-25 SWE1015 ETH VL2024250103260 2024-09-18 Reference-Material-I
No ratings yet
FALLSEM2024-25 SWE1015 ETH VL2024250103260 2024-09-18 Reference-Material-I
62 pages
Principal Component Analysis: Atent Ariables
No ratings yet
Principal Component Analysis: Atent Ariables
13 pages
Dimensionality Reduction by Pca: Non - Feasible
No ratings yet
Dimensionality Reduction by Pca: Non - Feasible
26 pages
LectureNotes PCA
No ratings yet
LectureNotes PCA
20 pages
Dim Reduction & Pattern Recognition
No ratings yet
Dim Reduction & Pattern Recognition
63 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
60 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
3 pages
4.5 Principal Component Analysis
No ratings yet
4.5 Principal Component Analysis
15 pages
Lecture 9 - Data Reduction
No ratings yet
Lecture 9 - Data Reduction
36 pages
کتاب نهم بارگزاری شده
No ratings yet
کتاب نهم بارگزاری شده
55 pages
Part1 Lecture 12 Annotated
No ratings yet
Part1 Lecture 12 Annotated
12 pages
Pca Kmeans GMM
No ratings yet
Pca Kmeans GMM
96 pages
Week12 PCA BayesianInference Before Lecture
No ratings yet
Week12 PCA BayesianInference Before Lecture
82 pages
Lecture8 2015
No ratings yet
Lecture8 2015
51 pages
Machine Learning (CSO851) - Lecture 03
No ratings yet
Machine Learning (CSO851) - Lecture 03
71 pages
PrincipalComponentAnalysis LectureNotesPublic
No ratings yet
PrincipalComponentAnalysis LectureNotesPublic
24 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
16 pages
Presentation A I STD 2
No ratings yet
Presentation A I STD 2
63 pages
Principal Component Analysis PCA 17
No ratings yet
Principal Component Analysis PCA 17
58 pages
AML Unit - 1 Material
No ratings yet
AML Unit - 1 Material
36 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
79 pages
Dimensionality Reduction Using Principal Component Analysis
No ratings yet
Dimensionality Reduction Using Principal Component Analysis
32 pages
10 Autoencoders
No ratings yet
10 Autoencoders
42 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
27 pages
Lecture 9 - PCA
No ratings yet
Lecture 9 - PCA
44 pages
Pca
No ratings yet
Pca
6 pages
Lecture 17 and 18
No ratings yet
Lecture 17 and 18
29 pages
Math - ML Trang 10
No ratings yet
Math - ML Trang 10
31 pages
PCA
100% (1)
PCA
33 pages
Dimension Reduction
No ratings yet
Dimension Reduction
23 pages
10-601 Machine Learning (Fall 2010) Principal Component Analysis
No ratings yet
10-601 Machine Learning (Fall 2010) Principal Component Analysis
8 pages
CDT 05 PCA SVD FoDS
No ratings yet
CDT 05 PCA SVD FoDS
34 pages
cs229 Notes10 PDF
No ratings yet
cs229 Notes10 PDF
6 pages
Module 5 - BECE309L - AIML - Part2
No ratings yet
Module 5 - BECE309L - AIML - Part2
34 pages
09 Pca
No ratings yet
09 Pca
19 pages
Computer Vision and Image Processing - Fundamentals and Applications
No ratings yet
Computer Vision and Image Processing - Fundamentals and Applications
34 pages
315 F19 27 Pca1
No ratings yet
315 F19 27 Pca1
28 pages
P-3.1.4 - Pca
No ratings yet
P-3.1.4 - Pca
44 pages
Principal Component Analysis Concepts: T56Gzsrvah
No ratings yet
Principal Component Analysis Concepts: T56Gzsrvah
16 pages
Principal Component Analysis (PCA) : Gundimeda Venugopal
No ratings yet
Principal Component Analysis (PCA) : Gundimeda Venugopal
17 pages
MLSP-6 Dimensionality Reduction
No ratings yet
MLSP-6 Dimensionality Reduction
39 pages
Lecture6 PCA
No ratings yet
Lecture6 PCA
30 pages
Slides Lecture7 Ext
No ratings yet
Slides Lecture7 Ext
21 pages
Week 9 Lecture - Revision Test-Dual-Translated
No ratings yet
Week 9 Lecture - Revision Test-Dual-Translated
92 pages
Presentation
No ratings yet
Presentation
31 pages
Machine Learning
No ratings yet
Machine Learning
29 pages
PCA revis-BoW PDF
No ratings yet
PCA revis-BoW PDF
47 pages
Lecture 7: Principal Component Analysis (PCA) (Draft: Version 0.9.1)
No ratings yet
Lecture 7: Principal Component Analysis (PCA) (Draft: Version 0.9.1)
11 pages
Lec 16 PCA
No ratings yet
Lec 16 PCA
64 pages
Prs l6
No ratings yet
Prs l6
10 pages
Computer Vision: Spring 2006 15-385,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm - 4:20pm
No ratings yet
Computer Vision: Spring 2006 15-385,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm - 4:20pm
58 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
33 pages
Geometric Hashing: Efficient Algorithms for Image Recognition and Matching
From Everand
Geometric Hashing: Efficient Algorithms for Image Recognition and Matching
Fouad Sabry
No ratings yet
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
From Everand
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
Taryn Voska
No ratings yet
Python Data Science Cookbook
From Everand
Python Data Science Cookbook
Taryn Voska
No ratings yet
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
Advent Youth Sing
No ratings yet
Advent Youth Sing
451 pages
Ashok Finance Manager
No ratings yet
Ashok Finance Manager
4 pages
Fowl Typhoid and Pullorum Disease: H I - Shivaprasad
No ratings yet
Fowl Typhoid and Pullorum Disease: H I - Shivaprasad
20 pages
Rangelia Vitalli Dogs
No ratings yet
Rangelia Vitalli Dogs
5 pages
POLYAMIDE
No ratings yet
POLYAMIDE
12 pages
Yguyg
No ratings yet
Yguyg
14 pages
Entrepreneurship
No ratings yet
Entrepreneurship
3 pages
01.15.01 Pediatric History Taking and Physical Exam
100% (2)
01.15.01 Pediatric History Taking and Physical Exam
14 pages
4life Product Presentation v2024 041624
No ratings yet
4life Product Presentation v2024 041624
72 pages
Epicure 11 2024 Freemagazines Top
No ratings yet
Epicure 11 2024 Freemagazines Top
88 pages
Plant Fact Sheet: Bermudagrass
No ratings yet
Plant Fact Sheet: Bermudagrass
3 pages
(Springer) (Archimedes 41) G.W.Leibniz, Interrelations Between Mathematics and Philosophy (New Studies in The History and Philosophy of Science and Technology)
100% (1)
(Springer) (Archimedes 41) G.W.Leibniz, Interrelations Between Mathematics and Philosophy (New Studies in The History and Philosophy of Science and Technology)
215 pages
Price vs. Innodata (Case)
No ratings yet
Price vs. Innodata (Case)
14 pages
Drafting Pleading and Conveyancing, Mod 1
No ratings yet
Drafting Pleading and Conveyancing, Mod 1
9 pages
Final-DSC-2024-District Wise DR Vacancy
No ratings yet
Final-DSC-2024-District Wise DR Vacancy
74 pages
SST Sa2 Blue Print
No ratings yet
SST Sa2 Blue Print
9 pages
186 Scra 345
No ratings yet
186 Scra 345
2 pages
Geometry in Art, Lesson
No ratings yet
Geometry in Art, Lesson
6 pages
Loner by Rae
No ratings yet
Loner by Rae
278 pages
Microsoft Dynamics 365: A Cheat Sheet
No ratings yet
Microsoft Dynamics 365: A Cheat Sheet
10 pages
Department of Biological Science Plant Biology and Biotechnology Unit PBB 213 - Bryology and Pteridology
100% (1)
Department of Biological Science Plant Biology and Biotechnology Unit PBB 213 - Bryology and Pteridology
17 pages
MB Annual Report 2022 English 230705 Trang Doi
No ratings yet
MB Annual Report 2022 English 230705 Trang Doi
147 pages
2021 - PQLI Advancing Innovation & Regulation - Pharma Eng 2021
No ratings yet
2021 - PQLI Advancing Innovation & Regulation - Pharma Eng 2021
10 pages
GA 323 Astronomy - Rudolf Steiner
100% (2)
GA 323 Astronomy - Rudolf Steiner
154 pages
Mathematical Modeling of Inland Vessel Maneuverability Considering Rudder Hydrodynamics Jialun Liu Instant Download
No ratings yet
Mathematical Modeling of Inland Vessel Maneuverability Considering Rudder Hydrodynamics Jialun Liu Instant Download
55 pages
Duranti Concept of Appraisal and Archival Theory
No ratings yet
Duranti Concept of Appraisal and Archival Theory
18 pages
Karnataka Secondary Education Examination Board, Ksqaac 6 Cross, Malleshwaram, Bengaluru - 560 003 Flow Chart For Ntse Application
No ratings yet
Karnataka Secondary Education Examination Board, Ksqaac 6 Cross, Malleshwaram, Bengaluru - 560 003 Flow Chart For Ntse Application
2 pages
RR - Herbal Body Butter - Recipe Card
No ratings yet
RR - Herbal Body Butter - Recipe Card
1 page
Old Testament Books - OverviewBible
No ratings yet
Old Testament Books - OverviewBible
7 pages
Construction of Closet, Fence, Ceiling of Cabana and Floor Tiles of Toilet
No ratings yet
Construction of Closet, Fence, Ceiling of Cabana and Floor Tiles of Toilet
6 pages

Dimensionality Reduction With Principal Component Analysis

Uploaded by

Dimensionality Reduction With Principal Component Analysis

Uploaded by

Dimensionality Reduction with

Principal Component Analysis

• a neural network architecture that incorporates

• Traditional algorithm taught in school multiplies a

House price (million) House area (100m2)

House price House area First principal Second principal

To describe the data in (b),

where 𝑿 = [𝒙" , ⋯ , 𝒙# ] ∈ ℝ$×# contains the data points as column vectors

(a) Top 200 largest eigenvalues (b) Variance captured by the

(a) A vector 𝒙 ∈ ℝ$ (red cross) shall be C𝒊 for 50 different 𝒙

• We wish to project 𝒙 to 𝒙! in a lower-dimensional space, such that 𝒙

for suitable coordinates 𝜁6 ∈ ℝ .

(a) A vector 𝒙 ∈ ℝ$ (red cross) shall be C𝒊 for 50 different 𝒙

! in a subspace spanned by 𝒃 that minimizes 𝒙 − 𝒙

• More generally, if we aim to project onto an 𝑀-dimensional subspace of ℝ# , we

( ∈ ℝ𝑫 , we only need 𝑀 coordinates to represent 𝒙

!% = & 𝑧'% 𝒃' = & (𝒙&% 𝒃' ) 𝒃'

!% = & (𝒃&' 𝒙% ) 𝒃' = & 𝒃' (𝒃&' 𝒙% ) =

𝒙% = & 𝑧+% 𝒃+ = & 𝒙&% 𝒃+ 𝒃+ = & 𝒃+ 𝒃&+ 𝒙%

= & 𝒃' 𝒃&' 𝒙% + & 𝒃- 𝒃-& 𝒙%

!! is exactly the projection of the data point

Orthogonal projection and displacement

• To minimize it, we need to select the smallest 𝐷 − 𝑀 eigenvalues. Their

• In order to do PCA, we need to compute the data covariance matrix 𝑺

• Step 2. Standardization Divide the data points by the standard deviation σd

You might also like