0% found this document useful (0 votes)
7 views31 pages

Lecture 12

Uploaded by

chirag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views31 pages

Lecture 12

Uploaded by

chirag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Lecture 12

Math Foundations Team


Introduction

▶ We will look at principle components analysis and dimension


reduction in this lecture.
▶ High-dimensional data is hard to visulaize and interpret, can
we project this data into lower dimensions while preserving the
semantics of the data so as to draw the same conclusions as if
we interpreted the higher dimensional data?
▶ Higher dimensional data is often overcomplete, in that there
are redundant dimensions which can be explained by a
combination of other dimensions.
▶ Dimensions in higher-dimensional data might be correlated, so
the actual data may have an intrinsic lower-dimensional
structure
Principle components analysis

▶ PCA is a technique for linear dimensionality reduction. It was


first proposed by Pearson in 1900 and was independently
rediscovered by Hotelling in 1933.
Problem setting

▶ We are interested in finding projections x˜n of datapoints xn


which are as similar as possible to the original datapoints but
have lower dimensionality.
▶ Consider an independent, identically distributed dataset
{x1 , x2 , . . . , xn }, xn ∈ RD with P
mean 0 which possesses the
data covariance matrix S = N n=N 1
n=1 xn xn .
T

▶ We assume there exists a lower-dimensional compressed


representation zn of xn such that zn = B T xn where the
projection matrix B = [b1 , . . . , bm ] ∈ RD×M .
▶ The columns of B are orthonormal which means biT bj = 0
when i ̸= j and biT bi = 1.
▶ We seek an M-dimensional subspace U ⊆ RD , where
dim(U) < D onto which we can project the given data.
Problem setting

▶ The figure below shows how z represents the


lower-dimensional representation of the compressed data x̃
and plays the role of a bottleneck which controls the
information flow between x and x̃ .

▶ There exists a linear relationship between the original data x ,


its low-dimensional code z and the compressed data x̃ :
z = B T x , and x̃ = Bz for a suitable matrix B .
Maximum variance perspective

▶ We can interpret information content in the data as how


”space-filling” it is and describe the information contained in
the data by looking at the spread of the data.
▶ We can capture spread of the data using the concept of
variance.
▶ PCA can then be viewed as a dimensionality reduction
algorithm that maximizes the variance in the low-dimensional
representation of the data to retain as much information as
possible.
▶ Mathematically our aim is to find a matrix B so that we can
retain as much information as possible by projecting the data
on the columns b1 , b2 , . . . bM of the matrix.
Centred data

▶ For the data covariance matrix S = N1 n=1 xn xnT we assume


Pn=N
centred data, and can make this assumption without loss of
generality.
▶ Let us assume that µ is the mean of the data. Centred data
means that we work with data columns x − µ, rather than
the original columns x but this does not change the variance.
▶ To see this note that
Vz (z ) = Vx (B T (x − µ)) = Vx (B T x − B T µ) = Vx (B T x ).
▶ Therefore we assume that the data has a mean of 0 for this
lecture.
▶ Letting the mean be Ex (x ) = 0 means
Ez (z ) = Ex (B T x ) = B T Ex (x ) = 0
Direction with maximal variance

▶ We maximize the variance of the low-dimensional code by


following a sequential approach.
▶ First we aim to maximize the variance of the first coordinate
z1n of z ∈ R M , so that V1 = V(z1 ) = N1 n=1
Pn=N 2
z1n .
▶ In the above expression for the variance we have used the fact
that the data x is independent.
▶ We can rewrite z1n = b1T xn , and can be viewed as the
orthogonal projection of xn onto the one-dimensional subspace
spanned by b1 .
▶ Then we have V1 = N1 n=N (b1T xn )2 =
P
n=1
n=1 b1 xn xn b1 = b1 n=1 xn xn b1 = b1 Sb1 .
1 P n=N T T T
P n=N T
 T
N
Direction with maximal variance

▶ Arbitrarily increasing the magnitude of the vector b1 in the


previous slide will increase the variance - so we seek to
maximize the variance subject to ∥b1 ∥ = 1.
▶ Finding the direction b1 that maximizes variance can be set
up as a constrained optimization problem

max b1T Sb1 subject to


∥b1 ∥ = 1

▶ To solve this problem we set up the Lagrangian


L(x , λ) = b1T Sb1 + λ(1 − b1T b1 ).
▶ How do we solve this Lagrangian?
Solving the Lagrangian

▶ To solve the Lagrangian, we set the partial derivatives with


respect to λ and b1 to zero, ie ∂L ∂L
∂λ = 0 and ∂ b1 = 0.
▶ The partial derivatives can be calculated as follows:

= 1 − b1T b1
∂L
∂λ
= 2b1T S − 2λb1T
∂L
∂ b1
▶ Setting these partial derivatives to zero we get the following
two equations Sb1 = λb1 and b1T b = 1.
▶ Thus we find that the direction b1 we seek is an eigenvector of
the covariance matrix S and λ is its corresponding eigenvalue.
First principal component

▶ Putting the result of the previous slide into the objective


function of the constrained optimization problem ie.
max b1T Sb1 we have b1T Sb1 = b1T λb1 = λ.
▶ Our objective function boils to maximizing λ which means we
are looking for the eigenvector of S that corresponds to its
largest eigenvalue.
▶ This is the first principal component.
▶ Let us now examine the inner workings of the Lagrangian
method.
Why does this method work?

▶ Suppose we have the following constrained optimization


problem:

max f (x, y ) subject to


g (x, y ) = c

▶ We note that at the optimal solution (x0 , y0 ), if we move a


small distance (δx, δy ), we must continue to remain on the
surface g (x, y ) = c. This means
g (x0 + δx, y0 + δy ) = c = g (x, y ).
▶ But dg = g (x0 + δx, y0 + δy ) − g (x, y ) = ∇g .(δx, δy ) = 0.
▶ Therefore ∇g is orthogonal to the displacement vector
(δx, δy ).
Why does this method work?

▶ As we move along the displacement vector δx, δy ), the value


of the objective function also cannot change because
otherwise we can get a better solution by moving along the
displacement vector or its negative direction.
▶ Thus we have ∇f .(δx, δy ) = 0, so ∇f is orthogonal to the
displacement vector (δx, δy ).
▶ Keeping in mind the result from the previous slide, we note
that both ∇f and ∇g are orthogonal to the displacement
vector (δx, δy ) and hence must be parallel.
▶ This leads us to write the equation ∇f = λ∇g which is what
we get when we set ∂∂L b1 = 0.
▶ The other partial derivative ∂L
∂λ = 0 merely enforces the
constraint in the original constrained optimization problem.
M-dimensional subspace with maximum variance

▶ Assume that we have found the first m − 1 principal


components as the m − 1 eigenvectors of S that are
associated with the largest m − 1 eigenvalues of S.
▶ Since S is symmetric we can use the spectral theorem to use
the m − 1 eigenvectors to construct an orthonormal basis of
an m − 1-dimensional basis of RD .
▶ The mth principal component can be found by subtracting
from the data the contribution of the first m − 1 components
b1 , b2 , . . . bm−1 . Essentially we are trying to find principal
components that compress the remainder of the information.
M-dimensional subspace with maximum variance

▶ We then arrive at a new data matrix X̂ = X − m−1 i=1 bi bi X


T
P
where X = [x1 , x2 , . . . , xN ] ∈ RD×N contains the data points
as column vectors and Bm−1 = i=1 bi biT is a projection
Pm−1
matrix that projects X onto the subspace spanned by
b1 , b2 , . . . bm−1 .
▶ Note that we are collecting the data vectors x1 , x2 , . . . xN as
column vectors in the data matrix rather than as row vectors
as is done conventionally.
▶ To find the mth principal component we maximize
n=1 (bm xˆn ) = bm Ŝbm .
Pn=N 2
Vm = V[zm ] = N1 n=1 zmn = N1 n=N T 2 T
P

▶ What is Ŝ in the above equation?


M-dimensional subspace with maximum variance

▶ Ŝ is the data covariance matrix of the transformed data set


represented by x̂ = [x̂1 , x̂2 , . . . x̂N ].
▶ As before we set up a constrained optimization problem to
find the first principal component, and establish that the
optimal solution bm is the eignevector of Ŝ that corresponds
to the largest eigenvalue.
▶ We now establish that bm is an eigenvector of the original
data matrix X .
▶ More generally the sets of eigenvectors for Ŝ and S are the
same.
Eigenvectors of S and Ŝ

▶ We now show that the eigenvectors of S and Ŝ are the same.


▶ Let bi be an eigenvector of S , i.e Sbi = λbi .
▶ Now we can write

Ŝbi =
N
1
(X − Bm−1 X )(X − Bm−1 X )T bi
= (S − SBm−1
T
− Bm−1 XX T + Bm−1 XX T Bm−1
T
)bi
= (S − SBm−1
T
− Bm−1 S + Bm−1 SBm−1
T
)bi
= (S − SBm−1 − Bm−1 S + Bm−1 SBm−1 )bi

▶ Note that in the last line we have used the fact that Bm−1 is
a projection matrix and is therefore symmetric.
Eigenvectors of S and Ŝ

▶ We have two cases: i ≥ m and i ≤ m − 1.


▶ When i ≥ m, bi is an eigenvector not among the first m − 1
components.
▶ Since Bm−1 = m−1 i=1 bi bi and bm is orthogonal to the
T
P
bi , 1 ≤ i ≤ m − 1, we have Bm−1 bi = 0.
▶ Plugging this into the last equation on the previous slide, we
see that Ŝbi = (S − Bm−1 S )bi = Sbi = λi bi .
▶ Thus Sbm = λm bm . λm is the mth largest eigenvalue of S
and is also the largest eigenvalue of Ŝ because of the way the
constrained optimization problem is set up.
▶ On the other hand, when i ≤ m − 1, we can see that
Bm−1 bi = bi .
Eigenvectors of S and Ŝ

▶ From the previous slide, when i ≤ m − 1 we have


Bm−1 bi = bi .
▶ Plugging this into
Ŝbi = (S − SBm−1 − Bm−1 S + Bm−1 SBm−1 )bi , we get
Sbi = 0.
▶ Thus the vectors b1 , b2 , . . . bm−1 are eigenvectors for Ŝ which
are associated with the eigenvalue 0.
▶ Since Vm = bm Sbm = λm , we see that the variance of the
data projected onto the mth principal component is λm .
▶ To find an M-dimensional susbspace that retains as much
information as possible, PCA tells us to choose the columns of
the matrix B as the M eigenvectors of the data covariance
matrix S that have the largest eigenvalues.
Projection perspective

▶ We derived the PCA as an algorithm that maximizes the


variance in the projected space to retain as much information
as possible.
▶ Now we shall derive the PCA using a projection perspective to
minimize the average reconstruction error. The original data
is modeled as xn and the reconstruction is modeled as x̃n . We
seek to minimize the distance between xn and xn .
▶ Assume that we have an orthonormal basis
B = (b1 , b2 , . . . , bD ) of RD .
▶ We can write any x ∈ RD as
x = Dd=1 ξd bd = Mm=1 ξm bm + Dj=M+1 ξj bj for suitable
P P P
coordinates ξd ∈ R.
Projection perspective

▶ We are interested in finding vectors x̃ ∈ RD which live in a


lower dimensional subspace U ⊂ RD where dim(U) = M so
that x̃ = m=1 zm bm ∈ U ⊂ RD is as similar to x as
PM
possible, by which we mean that the distance ∥x − x̃ ∥ is as
small as possible.
▶ We assume that the dataset X = {x1 , x2 , . . . , xN } is centred
at 0 or E (X ) = 0. This is not a restrictive assumption, in that
we get the same results with or without the assumption, but it
considerably simplifies the mathematical development.
▶ We call the subspace U ⊂ RD , on which we are projecting the
vectors Px , the principal subspace. We can write
x˜n = Mm=1 zmn bm = Bzn ∈ RD . Here z = [z1n , . . . zMn ] is
the coordinate vector of xn with respect to the basis
(b1 , . . . , bM ).
Finding optimal coordinates

▶ The reconstruction error can be written as


JM = N1 N n=1 ∥xn − x˜n ∥ where we use the subscript M to
2
P
denote the dimension of the subspace on which to project
data.
▶ We would like to find optimal coordinates z1n , z2n , . . . zMn with
respect to the basis vectors b1 , . . . , bM for x˜n , n = 1, . . . N.
▶ In geometrical terms, finding the optimal coordinates boils to
finding the representation with respect to b that minimizes
the distance between x and x̃ .
▶ To do this, we need to find the orthogonal projection of x
onto b . The concept is illustrated on the next slide.
Finding optimal coordinates
Finding optimal coordinates

▶ Assume an orthonormal basis (ONB) (b1 , . . . , bM ) of the


subspace U ⊂ RD .
▶ To find the optimal coordinates, we take the derivative of the
reconstruction error JM = N1 N n=1 ∥xn − x˜n ∥ with respect to
2
P

the coordinates zin and set the derivative to zero, i.e ∂J


∂zin = 0.
M

▶ We can write ∂J ∂JM ∂ x˜n


∂zin = ∂ x˜n zin using the chain rule.
M

▶ Note that only x˜n is a function of the zin s and the given xn is
a fixed vector independent of the coordinates zin .
▶ We shall now find expressions for each of the partial
derivatives on the right hand side.
Finding optimal coordinates

(x − x̃n )T
∂JM 2
= −
∂ x˜n N
∂ x˜n
M
zmn bm ) = bi
∂ X
= (
∂zin ∂zin
m=1

= − (xn − x̃n )T bi
∂JM 2
∂zin N
M
= − (xn − zmn bm )T bi )
∂JM 2 X
∂zin N
m=1

= − (xnT bi − zin )
∂JM 2
∂zin N
Finding optimal coordinates

▶ Note that in the preceding sequence of equations we used the


orthonormality property of the basis vectors (b1 , . . . bM ) to
write ( M m=1 mn bm ) bi = zin bi bi = zin .
T T
P
z
▶ Setting the right-hand side of the last equation in the previous
slide to zero yields zin = xnT bi = biT xn .
▶ The coordinates of x˜n with respect to the basis (b1 , . . . , bM )
xn are the coordinates of the orthogonal projection of xn onto
the principal subspace.
Finding a basis of the principal subspace

▶ To find the basis vectors b1 , . . . , bM of the principal subspace


we have to reformulate the loss function.
▶ We can write x˜n = M m=1 zmn bm = m=1 (xn bm )bm .
P PM T

▶ We can rewrite this as x˜n = ( M m=1 bm bm )xn .


T
P

▶ The original data point xn can also be written as a linear


combination of all the basis vectors as follows:
xn = d=1 (xnT bd )bd .
PD

▶ How can we do the above? First write xn = D d=1 zdn bd , and


P
take the inner product with bd on both sides to obtain zdn .
Orthonormality of the basis b1 , . . . , bD gives the result.
▶ By rearranging the expression for xn , we can write
xn = Dd=1 (bd bdT )xn .
P
Finding a basis of the principal subspace

▶ We can split the expression for xn as follows:


xn = ( Mm=1 bm bmT )xn + ( Dj=m+1 bj bjT )xn .
P P

▶ Then we find that the displacement vector x − x˜n can be


written as x − x˜n = D j=m+1 bj bj xn = j=m+1 (xn bj )bj .
P T
PD T

▶ The last expression shows that the displacement vector is


exactly the projection of the original data point xn onto the
orthogonal complement of the principal subspace.
▶ We can now express the loss function as follows:
JM = N1 N n=1 ∥xn − x˜n ∥ = N j=M+1 (bj xn )bj ∥ .
P 2 1 PN PN T 2
n=1 ∥
▶ We now expand the squared-norm and exploit the fact that
the bj form an orthonormal basis to rewrite the loss function
as in the next slide.
Finding a basis of the principal subspace

N D
(bjT xn )2
1 X X
JM =
N
n=1 j=M+1
N D
=
1 X X
N
bjT xn bjT xn
n=1 j=M+1
N D
=
1 X X
N
bjT xn xnT bj
n=1 j=M+1
D N
bjT ( N1 xn xnT )bj
X X
=
j=M+1 n=1
D
bjT S bj
X
=
j=M+1
Finding a basis of the principal subspace

j=M+1 bj S bj = tr ( j=M+1 bj S bj ) =
▶ We note that JM = D
P T
PD T

j=M+1 S bj bj ) = tr ( j=M+1 bj bj S)
PD T
P D T

▶ In the above we exploit the fact that the trace operator is


invariant with respect to the cyclic permutation of its
arguments.
▶ S in the above equations is the data covariance matrix since
we assume that E (X ) = 0.
▶ We can formulate the averager reconstruction error as the
covariance matrix of the data projected onto the orthogonal
complement of the principal subspace.
Finding a basis of the principal subspace

▶ To minimize the average reconstruction error we need to


minimize the variance of the data when projected onto the
space we ignore which is the orthogonal complement of the
principal subspace.
▶ This is equivalent to maximizing the variance of the data
when projected onto the principal subspace which therefore
leads us to the same kind of solution as in the maximum
variance perspective.
▶ The smallest value of the average squared reconstruction error
turns out to be JM = D
P
j=M+1 λj where the λj s are the last
D − M eigenvalues of the data covariance matrix.

You might also like