Dimensionality Reduction With Principal Component Analysis
Dimensionality Reduction With Principal Component Analysis
Liang Zheng
Australian National University
[email protected]
1
Discovering faster matrix multiplication algorithms with
reinforcement learning. Fawzi et al., Nature 2022
We rotate our
price and area axis
price
4
Motivation
• High-dimensional data, such as images, is hard to analyze, interpret, and
visualize, and expensive to store.
• Good news
• high-dimensional data is often overcomplete, i.e., many dimensions are
redundant and can be explained by a combination of other dimensions
• Furthermore, dimensions in high-dimensional data are often correlated so
that the data possesses an intrinsic lower-dimensional structure.
The data in (a) does not vary
much in the 𝑥! -direction, so
that we can express it as if it
were on a line – with nearly
no loss; see (b).
6
• Example (Coordinate Representation/Code)
• Consider ℝ) with the canonical basis 𝒆" = 1,0 &, 𝒆) = 0,1 &.
• 𝒙 ∈ ℝ) can be represented as a linear combination of these basis vectors, e.g.,
5
= 5𝒆" + 3𝒆)
3
• However, when we consider vectors of the form
0
!=
𝒙 ∈ ℝ) , 𝑧∈ℝ
𝑧
they can always be written as 0𝒆" + 𝑧𝒆) .
• To represent these vectors it is sufficient to store the coordinate/code 𝑧 of
! with respect to the 𝒆) vector.
𝒙
7
10.2 PCA from Maximum Variance Perspective
• We ignore 𝑥) -coordinate of the data because it did not add too much
information: the compressed data (b) is similar to the original data in (a)
• We derive PCA so as to maximize the variance in the low-dimensional
representation of the data to retain as much information as possible
• Retaining most information after data compression is equivalent to capturing
the largest amount of variance in the low-dimensional code (Hotelling, 1933)
8
10.2.1 Direction with Maximal Variance
• Data centering
• In the data covariance matrix, we assume centered data.
#
1
𝑺 = - 𝒙! 𝒙&!
𝑁
!%"
• Let us assume that 𝝁 is the mean of the data. Using the properties of the
variance, we obtain
𝕍𝒛 𝒛 = 𝕍𝒙 𝑩& 𝒙 − 𝝁 = 𝕍𝒙 𝑩& 𝒙 − 𝑩& 𝝁 = 𝕍𝒙 𝑩& 𝒙
𝕍 𝒙 + 𝒚 = 𝕍 𝒙 + 𝕍 𝒚 + Cov 𝒙, 𝒚 + Cov 𝒚, 𝒙
• That is, the variance of the low-dimensional code does not depend on the
mean of the data.
• With this assumption the mean of the low-dimensional code is also 𝟎 since
𝔼𝒛 𝒛 = 𝔼𝒙 𝑩 & 𝒙 = 𝑩 & 𝔼𝒙 𝒙 = 𝟎
9
• To maximize the variance of the low-dimensional code, we first seek a single
vector 𝒃" ∈ ℝ$ that maximizes the variance of the projected data, i.e., we aim
to maximize the variance of the first coordinate 𝑧" of 𝒛 ∈ ℝ' so that
#
1 )
𝑉" ≔ 𝕍[𝑧" ] = - 𝑧"!
𝑁
!%"
is maximized, where we defined 𝑧"! as the first coordinate of the low-
dimensional representation 𝒛! ∈ ℝ' of 𝒙! ∈ ℝ$ . 𝑧"! is given by,
𝑧"! = 𝒃"&𝒙!
i.e., it is the coordinate of the orthogonal projection of 𝒙! onto the one-
dimensional subspace spanned by 𝒃" . We substitute 𝑧"! into 𝑉" and obtain,
# #
1 1
𝑉" = - (𝒃"&𝒙! )) = - 𝒃"&𝒙! 𝒙&! 𝒃"
𝑁 𝑁
!%" !%"
#
1
= 𝒃"& - 𝒙! 𝒙&! 𝒃" = 𝒃"&𝑺𝒃"
𝑁
!%"
where 𝑺 is the data covariance matrix.
• We further restrict all solutions to 𝒃" ) =1
10
• We have the following constrained optimization problem
max 𝒃"# 𝑺𝒃"
𝒃𝟏
subject to 𝒃" $ =1
• We obtain the Lagrangian (not required in this course),
𝔏 𝒃" , λ = 𝒃"# 𝑺𝒃" + λ" 1 − 𝒃"# 𝒃"
• The partial derivatives of 𝔏 with respect to 𝒃" and 𝜆" are
𝜕𝔏 # #
𝜕𝔏
= 2𝒃" 𝑺 − 2λ" 𝒃" , = 1 − 𝒃"# 𝒃"
𝜕𝒃" 𝜕λ"
• Setting these partial derivatives to 𝟎 gives us the relations
𝑺𝒃" = λ" 𝒃"
𝒃"# 𝒃" = 1
• We see that 𝒃" is an eigenvector of 𝑺, and λ" is the corresponding eigenvalue. We
rewrite our objective as,
𝑉" = 𝒃"# 𝑺𝒃" = λ" 𝒃"# 𝒃" = λ"
• i.e., the variance of the data projected onto a one-dimensional subspace equals the
eigenvalue that is associated with the basis vector 𝒃" that spans this subspace.
• To maximize the variance of the low-dimensional code, we choose the basis vector
associated with the largest eigenvalue of the data covariance matrix. This eigenvector
is called the first principal component. 11
10.2.2 𝑀-dimensional Subspace with Maximal Variance
• Assume we have found the 𝑚 − 1 eigenvectors of 𝑺 that are associated with
the largest 𝑚 − 1 eigenvalues.
• We want to find the 𝑚th principal component.
• We subtract the effect of the first 𝑚 − 1 principal components 𝒃" ⋯ , 𝒃,-" from
the data, and find principal components that compress the remaining
information. We then arrive at the new data matrix,
C ≔ 𝑿 − ∑,-"
𝑿 &
.%" 𝒃. 𝒃. 𝑿 = 𝑿 − 𝑩,-" 𝑿
13
MNIST dataset
• 60,000 examples of handwritten digits 0 through 9.
• Each digit is a grayscale image of size 28×28, i.e., it contains 784 pixels.
• We can interpret every image in this dataset as a vector 𝑥 ∈ ℝ/01
14
Example - Eigenvalues of MNIST digit “8”
http://download.europe.naverlabs.com//ECCV-DA-Tutorial/Gabriela-Csurka-Part-2.pdf 17
http://download.europe.naverlabs.com//ECCV-DA-Tutorial/Tatiana-Tommasi-Part-3.pdf
10.3 PCA from Projection Perspective
• Previously, we derived PCA by maximizing the variance in the projected
space to retain as much information as possible
max 𝒃"&𝑺𝒃"
𝒃𝟏
subject to 𝒃" ) =1
• Alternatively, we derive PCA as an algorithm that directly minimizes the
average reconstruction error
18
10.3.1 Setting and Objective
20
• We have a dataset 𝒳 = 𝒙" , … , 𝒙# , 𝒙# ∈ ℝ$ centered at 𝟎, i.e., 𝔼 𝒳 = 𝟎.
• We want to find the best linear projection of 𝒳 onto a lower dimensional
subspace 𝑈 ⊆ ℝ$ , dim(U) = 𝑀. Also, 𝑈 has orthonormal basis vectors
𝒃" , ⋯ , 𝒃' .
• We call this subspace 𝑈 the principal subspace.
• The projections of the data points are denoted by
'
!! ≔ - 𝑧,! 𝒃, = 𝑩𝒛! ∈ ℝ$
𝒙
,%"
where 𝒛! ≔ 𝑧"! , ⋯ , 𝑧'! & ∈ ℝ' is !! with respect to
the coordinate vector of 𝒙
the basis (𝒃" , ⋯ , 𝒃' ).
!! as similar to 𝒙! as possible.
• We want to have 𝒙
• We define our objective as minimizing the average squared Euclidean
distance (reconstruction error)
#
1
𝐽' : = - 𝒙! − 𝒙 !! )
𝑁
!%"
• We need to find the orthonormal basis of the principal subspace and the
coordinates 𝑧! ∈ ℝ ' of the projections with respect to this basis. 21
10.3.2 Finding Optimal Coordinates
22
• Given an ONB (𝒃" , ⋯ , 𝒃' ) of 𝑈 ⊆ ℝ$ , to find the optimal coordinates 𝒛, with
respect to this basis, we calculate the partial derivatives
𝜕𝐽' 𝜕𝐽' 𝜕! 𝒙!
=
𝜕𝑧.! 𝜕! 𝒙! 𝜕𝑧.!
𝜕𝐽' 2
= − (𝒙! − 𝒙 !! )&∈ ℝ"×$
𝜕!
𝒙! 𝑁
'
𝜕!
𝒙! 𝜕
= - 𝑧,! 𝒃, = 𝒃.
𝜕𝑧.! 𝜕𝑧.!
,%"
for 𝑖 = 1 , ⋯ , 𝑀, such that we obtain
' &
𝜕𝐽' 2 &
2
= − !! 𝒃. = −
𝒙! − 𝒙 𝒙! − - 𝑧,! 𝒃, 𝒃.
𝜕𝑧.! 𝑁 𝑁
,%"
ONB 2 & 2 &
&
= − 𝒙 𝒃 − 𝑧.! 𝒃. 𝒃. = − (𝒙! 𝒃. − 𝑧.! )
𝑁 ! . 𝑁
23
𝜕𝐽' 2 &
= − (𝒙! 𝒃. − 𝑧.! )
𝜕𝑧.! 𝑁
• Setting this partial derivative to 0 yields immediately the optimal coordinates
𝑧.! = 𝒙&! 𝒃. = 𝒃&. 𝒙!
for 𝑖 = 1 , ⋯ , 𝑀, and 𝑛 = 1 , ⋯ , 𝑁 .
• The optimal coordinates 𝑧.! of the projection 𝒙!! are the coordinates of the
orthogonal projection of the original data point 𝒙! onto the one-dimensional
subspace that is spanned by 𝒃. .
!! of 𝒙! is an orthogonal projection.
• The optimal linear projection 𝒙
• The coordinates of 𝒙!! with respect to the basis (𝒃" , ⋯ , 𝒃' ) are the
coordinates of the orthogonal projection of 𝒙! onto the principal subspace.
24
(a) A vector 𝒙 ∈ ℝ$ (red cross) shall be C𝒊 for 50 different 𝒙
(b) Differences 𝒙 − 𝒙 C𝒊
projected onto a one-dimensional subspace are shown by the red lines
𝑈 ⊆ ℝ$ spanned by 𝒃
C for some 𝒙
(c) Distances 𝒙 − 𝒙 C= C that minimizes 𝒙 − 𝒙
(d) The vector 𝒙 C25 is
𝑧" 𝒃 ∈ 𝑈 = span 𝒃 the orthogonal projection of 𝒙 onto 𝑈.
• We briefly recap orthogonal projections from Section 3.8 (Analytic geometry).
• If (𝒃" , ⋯ , 𝒃# ) is an orthonormal basis of ℝ# then
𝒃$ % 𝒙 & #
(=
𝒙 ! 𝒃$ = 𝒃$ 𝒃$ 𝒙 ∈ ℝ
𝒃$
is the orthogonal projection of 𝒙 onto the subspace spanned by the 𝑗th basis
vector, and 𝑧$ = 𝒃$& 𝒙 is the coordinate of this projection with respect to the basis
vector 𝒃$ that spans that subspace since 𝑧$ 𝒃$ = 𝒙 (.
where we split the sum with 𝐷 terms into a sum over 𝑀 and a sum over 𝐷 − 𝑀 terms.
27
• With these results, the displacement vector 𝒙! − 𝒙 !! , i.e., the difference vector
between the original data point and its projection, is
$ $
!! =
𝒙! − 𝒙 - 𝒃2 𝒃2& 𝒙! = - 𝒙&! 𝒃2 𝒃2
2%'3" 2%'3"
• We explicitly compute the squared norm and exploit the fact that the 𝒃$ form an
ONB:
/ # / #
1 1
𝐽' = 8 8 (𝒃$ 𝒙- ) = 8 8 𝒃$& 𝒙- 𝒃$& 𝒙-
& !
𝑁 𝑁
-." $.'," -." $.',"
/ #
1
= 8 8 𝒃$& 𝒙- 𝒙&- 𝒃$
𝑁
-." $.',"
where we exploited the symmetry of the dot product in the last step to write
𝒃$& 𝒙-
= 𝒙&- 𝒃$ . We now swap the sums and obtain
# / #
1
𝐽' = 8 𝒃$& 8 𝒙- 𝒙&- 𝒃$ = 8 𝒃$& 𝑺𝒃$
𝑁
$.'," -." $.',"
.:𝑺
= ∑# &
$.'," tr(𝒃$ 𝑺𝒃$ ) = ∑$.'," tr(𝑺𝒃$ 𝒃$& )
#
= tr ∑# &
$.'," 𝒃$ 𝒃$ 𝑺
234567894: ;<839=
where we exploited the property that the trace operator tr(·) is linear and invariant
to cyclic permutations of its arguments 29
$ $
𝐽' = - 𝒃2&𝑺𝒃2 = tr - 𝒃2 𝒃&2 𝑺
2%'3" 2%'3"
789:;<=>9? @A=8>B
• The loss is formulated as the covariance matrix of the data, projected onto
the orthogonal complement of the principal subspace.
• Minimizing the average squared reconstruction error is therefore equivalent to
minimizing the variance of the data when projected onto the subspace we
ignore, i.e., the orthogonal complement of the principal subspace.
• Equivalently, we maximize the variance of the projection that we retain in the
principal subspace, which links the projection loss immediately to the
maximum-variance formulation of PCA in Section 10.2.
• In Slide #17, the average squared reconstruction error, when projecting onto
the 𝑀-dimensional principal subspace, is
$
𝐽' = - λ2
2%'3"
• where λ: are the eigenvalues of the data covariance matrix.
30
$
𝐽' = - λ2
2%'3"
31
10.5 PCA in High Dimensions
eigendecomposition 35
• Step 1. Mean subtraction
• We center the data by computing the mean 𝜇 of the dataset and subtracting it
from every single data point. This ensures that the dataset has mean 0.
36
• Step 3. Eigendecomposition of the covariance matrix
• Compute the data covariance matrix and its eigenvalues and corresponding
eigenvectors. The longer vector (larger eigenvalue) spans the principal
subspace 𝑈
37
• 4. Projection We can project any data point 𝒙∗ ∈ ℝ$ onto the principal
subspace: To get this right, we need to standardize 𝒙∗ using the mean 𝜇6 and
standard deviation 𝜎6 of the training data in the 𝑑th dimension, respectively,
so that
6
6 𝑥∗ − 𝜇6
𝑥∗ ⟵ , 𝑑 = 1, ⋯ , 𝐷
𝜎6
6
where 𝑥∗ is the 𝑑th component of 𝒙∗.
• We obtain the projection as
!∗ = 𝑩𝑩&𝒙∗
𝒙
with coordinates
𝒛∗ = 𝑩&𝒙∗
with respect to the basis of the principal subspace. Here, 𝑩 is the matrix that
contains the eigenvectors that are associated with the largest eigenvalues of
the data covariance matrix as columns.
38
• Having standardized our dataset, 𝒙!∗ = 𝑩𝑩&𝒙∗ only yields the projections in
the context of the standardized dataset.
• To obtain our projection in the original data space (i.e., before
standardization), we need to undo the standardization: multiply by the
standard deviation before adding the mean.
• We obtain
6 6
𝑥s∗ ⟵ 𝑥s∗ 𝜎6 + 𝜇6 , 𝑑 = 1, . . . , 𝐷
• Figure 10.10(f) illustrates the projection in the original data space.
39