0% found this document useful (0 votes)

7 views31 pages

Lecture 12

Uploaded by

chirag

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views31 pages

Lecture 12

Uploaded by

chirag

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Lecture 12

Math Foundations Team

Introduction

▶ We will look at principle components analysis and dimension

reduction in this lecture.
▶ High-dimensional data is hard to visulaize and interpret, can
we project this data into lower dimensions while preserving the
semantics of the data so as to draw the same conclusions as if
we interpreted the higher dimensional data?
▶ Higher dimensional data is often overcomplete, in that there
are redundant dimensions which can be explained by a
combination of other dimensions.
▶ Dimensions in higher-dimensional data might be correlated, so
the actual data may have an intrinsic lower-dimensional
structure
Principle components analysis

▶ PCA is a technique for linear dimensionality reduction. It was

first proposed by Pearson in 1900 and was independently
rediscovered by Hotelling in 1933.
Problem setting

▶ We are interested in finding projections x˜n of datapoints xn

which are as similar as possible to the original datapoints but
have lower dimensionality.
▶ Consider an independent, identically distributed dataset
{x1 , x2 , . . . , xn }, xn ∈ RD with P
mean 0 which possesses the
data covariance matrix S = N n=N 1
n=1 xn xn .
T

▶ We assume there exists a lower-dimensional compressed

representation zn of xn such that zn = B T xn where the
projection matrix B = [b1 , . . . , bm ] ∈ RD×M .
▶ The columns of B are orthonormal which means biT bj = 0
when i ̸= j and biT bi = 1.
▶ We seek an M-dimensional subspace U ⊆ RD , where
dim(U) < D onto which we can project the given data.
Problem setting

▶ The figure below shows how z represents the

lower-dimensional representation of the compressed data x̃
and plays the role of a bottleneck which controls the
information flow between x and x̃ .

▶ There exists a linear relationship between the original data x ,

its low-dimensional code z and the compressed data x̃ :
z = B T x , and x̃ = Bz for a suitable matrix B .
Maximum variance perspective

▶ We can interpret information content in the data as how

”space-filling” it is and describe the information contained in
the data by looking at the spread of the data.
▶ We can capture spread of the data using the concept of
variance.
▶ PCA can then be viewed as a dimensionality reduction
algorithm that maximizes the variance in the low-dimensional
representation of the data to retain as much information as
possible.
▶ Mathematically our aim is to find a matrix B so that we can
retain as much information as possible by projecting the data
on the columns b1 , b2 , . . . bM of the matrix.
Centred data

▶ For the data covariance matrix S = N1 n=1 xn xnT we assume

Pn=N
centred data, and can make this assumption without loss of
generality.
▶ Let us assume that µ is the mean of the data. Centred data
means that we work with data columns x − µ, rather than
the original columns x but this does not change the variance.
▶ To see this note that
Vz (z ) = Vx (B T (x − µ)) = Vx (B T x − B T µ) = Vx (B T x ).
▶ Therefore we assume that the data has a mean of 0 for this
lecture.
▶ Letting the mean be Ex (x ) = 0 means
Ez (z ) = Ex (B T x ) = B T Ex (x ) = 0
Direction with maximal variance

▶ We maximize the variance of the low-dimensional code by

following a sequential approach.
▶ First we aim to maximize the variance of the first coordinate
z1n of z ∈ R M , so that V1 = V(z1 ) = N1 n=1
Pn=N 2
z1n .
▶ In the above expression for the variance we have used the fact
that the data x is independent.
▶ We can rewrite z1n = b1T xn , and can be viewed as the
orthogonal projection of xn onto the one-dimensional subspace
spanned by b1 .
▶ Then we have V1 = N1 n=N (b1T xn )2 =
P
n=1
n=1 b1 xn xn b1 = b1 n=1 xn xn b1 = b1 Sb1 .
1 P n=N T T T
P n=N T
T
N
Direction with maximal variance

▶ Arbitrarily increasing the magnitude of the vector b1 in the

previous slide will increase the variance - so we seek to
maximize the variance subject to ∥b1 ∥ = 1.
▶ Finding the direction b1 that maximizes variance can be set
up as a constrained optimization problem

max b1T Sb1 subject to

∥b1 ∥ = 1

▶ To solve this problem we set up the Lagrangian

L(x , λ) = b1T Sb1 + λ(1 − b1T b1 ).
▶ How do we solve this Lagrangian?
Solving the Lagrangian

▶ To solve the Lagrangian, we set the partial derivatives with

respect to λ and b1 to zero, ie ∂L ∂L
∂λ = 0 and ∂ b1 = 0.
▶ The partial derivatives can be calculated as follows:

= 1 − b1T b1
∂L
∂λ
= 2b1T S − 2λb1T
∂L
∂ b1
▶ Setting these partial derivatives to zero we get the following
two equations Sb1 = λb1 and b1T b = 1.
▶ Thus we find that the direction b1 we seek is an eigenvector of
the covariance matrix S and λ is its corresponding eigenvalue.
First principal component

▶ Putting the result of the previous slide into the objective

function of the constrained optimization problem ie.
max b1T Sb1 we have b1T Sb1 = b1T λb1 = λ.
▶ Our objective function boils to maximizing λ which means we
are looking for the eigenvector of S that corresponds to its
largest eigenvalue.
▶ This is the first principal component.
▶ Let us now examine the inner workings of the Lagrangian
method.
Why does this method work?

▶ Suppose we have the following constrained optimization

problem:

max f (x, y ) subject to

g (x, y ) = c

▶ We note that at the optimal solution (x0 , y0 ), if we move a

small distance (δx, δy ), we must continue to remain on the
surface g (x, y ) = c. This means
g (x0 + δx, y0 + δy ) = c = g (x, y ).
▶ But dg = g (x0 + δx, y0 + δy ) − g (x, y ) = ∇g .(δx, δy ) = 0.
▶ Therefore ∇g is orthogonal to the displacement vector
(δx, δy ).
Why does this method work?

▶ As we move along the displacement vector δx, δy ), the value

of the objective function also cannot change because
otherwise we can get a better solution by moving along the
displacement vector or its negative direction.
▶ Thus we have ∇f .(δx, δy ) = 0, so ∇f is orthogonal to the
displacement vector (δx, δy ).
▶ Keeping in mind the result from the previous slide, we note
that both ∇f and ∇g are orthogonal to the displacement
vector (δx, δy ) and hence must be parallel.
▶ This leads us to write the equation ∇f = λ∇g which is what
we get when we set ∂∂L b1 = 0.
▶ The other partial derivative ∂L
∂λ = 0 merely enforces the
constraint in the original constrained optimization problem.
M-dimensional subspace with maximum variance

▶ Assume that we have found the first m − 1 principal

components as the m − 1 eigenvectors of S that are
associated with the largest m − 1 eigenvalues of S.
▶ Since S is symmetric we can use the spectral theorem to use
the m − 1 eigenvectors to construct an orthonormal basis of
an m − 1-dimensional basis of RD .
▶ The mth principal component can be found by subtracting
from the data the contribution of the first m − 1 components
b1 , b2 , . . . bm−1 . Essentially we are trying to find principal
components that compress the remainder of the information.
M-dimensional subspace with maximum variance

▶ We then arrive at a new data matrix X̂ = X − m−1 i=1 bi bi X

T
P
where X = [x1 , x2 , . . . , xN ] ∈ RD×N contains the data points
as column vectors and Bm−1 = i=1 bi biT is a projection
Pm−1
matrix that projects X onto the subspace spanned by
b1 , b2 , . . . bm−1 .
▶ Note that we are collecting the data vectors x1 , x2 , . . . xN as
column vectors in the data matrix rather than as row vectors
as is done conventionally.
▶ To find the mth principal component we maximize
n=1 (bm xˆn ) = bm Ŝbm .
Pn=N 2
Vm = V[zm ] = N1 n=1 zmn = N1 n=N T 2 T
P

▶ What is Ŝ in the above equation?

M-dimensional subspace with maximum variance

▶ Ŝ is the data covariance matrix of the transformed data set

represented by x̂ = [x̂1 , x̂2 , . . . x̂N ].
▶ As before we set up a constrained optimization problem to
find the first principal component, and establish that the
optimal solution bm is the eignevector of Ŝ that corresponds
to the largest eigenvalue.
▶ We now establish that bm is an eigenvector of the original
data matrix X .
▶ More generally the sets of eigenvectors for Ŝ and S are the
same.
Eigenvectors of S and Ŝ

▶ We now show that the eigenvectors of S and Ŝ are the same.

▶ Let bi be an eigenvector of S , i.e Sbi = λbi .
▶ Now we can write

Ŝbi =
N
1
(X − Bm−1 X )(X − Bm−1 X )T bi
= (S − SBm−1
T
− Bm−1 XX T + Bm−1 XX T Bm−1
T
)bi
= (S − SBm−1
T
− Bm−1 S + Bm−1 SBm−1
T
)bi
= (S − SBm−1 − Bm−1 S + Bm−1 SBm−1 )bi

▶ Note that in the last line we have used the fact that Bm−1 is
a projection matrix and is therefore symmetric.
Eigenvectors of S and Ŝ

▶ We have two cases: i ≥ m and i ≤ m − 1.

▶ When i ≥ m, bi is an eigenvector not among the first m − 1
components.
▶ Since Bm−1 = m−1 i=1 bi bi and bm is orthogonal to the
T
P
bi , 1 ≤ i ≤ m − 1, we have Bm−1 bi = 0.
▶ Plugging this into the last equation on the previous slide, we
see that Ŝbi = (S − Bm−1 S )bi = Sbi = λi bi .
▶ Thus Sbm = λm bm . λm is the mth largest eigenvalue of S
and is also the largest eigenvalue of Ŝ because of the way the
constrained optimization problem is set up.
▶ On the other hand, when i ≤ m − 1, we can see that
Bm−1 bi = bi .
Eigenvectors of S and Ŝ

▶ From the previous slide, when i ≤ m − 1 we have

Bm−1 bi = bi .
▶ Plugging this into
Ŝbi = (S − SBm−1 − Bm−1 S + Bm−1 SBm−1 )bi , we get
Sbi = 0.
▶ Thus the vectors b1 , b2 , . . . bm−1 are eigenvectors for Ŝ which
are associated with the eigenvalue 0.
▶ Since Vm = bm Sbm = λm , we see that the variance of the
data projected onto the mth principal component is λm .
▶ To find an M-dimensional susbspace that retains as much
information as possible, PCA tells us to choose the columns of
the matrix B as the M eigenvectors of the data covariance
matrix S that have the largest eigenvalues.
Projection perspective

▶ We derived the PCA as an algorithm that maximizes the

variance in the projected space to retain as much information
as possible.
▶ Now we shall derive the PCA using a projection perspective to
minimize the average reconstruction error. The original data
is modeled as xn and the reconstruction is modeled as x̃n . We
seek to minimize the distance between xn and xn .
▶ Assume that we have an orthonormal basis
B = (b1 , b2 , . . . , bD ) of RD .
▶ We can write any x ∈ RD as
x = Dd=1 ξd bd = Mm=1 ξm bm + Dj=M+1 ξj bj for suitable
P P P
coordinates ξd ∈ R.
Projection perspective

▶ We are interested in finding vectors x̃ ∈ RD which live in a

lower dimensional subspace U ⊂ RD where dim(U) = M so
that x̃ = m=1 zm bm ∈ U ⊂ RD is as similar to x as
PM
possible, by which we mean that the distance ∥x − x̃ ∥ is as
small as possible.
▶ We assume that the dataset X = {x1 , x2 , . . . , xN } is centred
at 0 or E (X ) = 0. This is not a restrictive assumption, in that
we get the same results with or without the assumption, but it
considerably simplifies the mathematical development.
▶ We call the subspace U ⊂ RD , on which we are projecting the
vectors Px , the principal subspace. We can write
x˜n = Mm=1 zmn bm = Bzn ∈ RD . Here z = [z1n , . . . zMn ] is
the coordinate vector of xn with respect to the basis
(b1 , . . . , bM ).
Finding optimal coordinates

▶ The reconstruction error can be written as

JM = N1 N n=1 ∥xn − x˜n ∥ where we use the subscript M to
2
P
denote the dimension of the subspace on which to project
data.
▶ We would like to find optimal coordinates z1n , z2n , . . . zMn with
respect to the basis vectors b1 , . . . , bM for x˜n , n = 1, . . . N.
▶ In geometrical terms, finding the optimal coordinates boils to
finding the representation with respect to b that minimizes
the distance between x and x̃ .
▶ To do this, we need to find the orthogonal projection of x
onto b . The concept is illustrated on the next slide.
Finding optimal coordinates
Finding optimal coordinates

▶ Assume an orthonormal basis (ONB) (b1 , . . . , bM ) of the

subspace U ⊂ RD .
▶ To find the optimal coordinates, we take the derivative of the
reconstruction error JM = N1 N n=1 ∥xn − x˜n ∥ with respect to
2
P

the coordinates zin and set the derivative to zero, i.e ∂J

∂zin = 0.
M

▶ We can write ∂J ∂JM ∂ x˜n

∂zin = ∂ x˜n zin using the chain rule.
M

▶ Note that only x˜n is a function of the zin s and the given xn is
a fixed vector independent of the coordinates zin .
▶ We shall now find expressions for each of the partial
derivatives on the right hand side.
Finding optimal coordinates

(x − x̃n )T
∂JM 2
= −
∂ x˜n N
∂ x˜n
M
zmn bm ) = bi
∂ X
= (
∂zin ∂zin
m=1

= − (xn − x̃n )T bi
∂JM 2
∂zin N
M
= − (xn − zmn bm )T bi )
∂JM 2 X
∂zin N
m=1

= − (xnT bi − zin )
∂JM 2
∂zin N
Finding optimal coordinates

▶ Note that in the preceding sequence of equations we used the

orthonormality property of the basis vectors (b1 , . . . bM ) to
write ( M m=1 mn bm ) bi = zin bi bi = zin .
T T
P
z
▶ Setting the right-hand side of the last equation in the previous
slide to zero yields zin = xnT bi = biT xn .
▶ The coordinates of x˜n with respect to the basis (b1 , . . . , bM )
xn are the coordinates of the orthogonal projection of xn onto
the principal subspace.
Finding a basis of the principal subspace

▶ To find the basis vectors b1 , . . . , bM of the principal subspace

we have to reformulate the loss function.
▶ We can write x˜n = M m=1 zmn bm = m=1 (xn bm )bm .
P PM T

▶ We can rewrite this as x˜n = ( M m=1 bm bm )xn .

T
P

▶ The original data point xn can also be written as a linear

combination of all the basis vectors as follows:
xn = d=1 (xnT bd )bd .
PD

▶ How can we do the above? First write xn = D d=1 zdn bd , and

P
take the inner product with bd on both sides to obtain zdn .
Orthonormality of the basis b1 , . . . , bD gives the result.
▶ By rearranging the expression for xn , we can write
xn = Dd=1 (bd bdT )xn .
P
Finding a basis of the principal subspace

▶ We can split the expression for xn as follows:

xn = ( Mm=1 bm bmT )xn + ( Dj=m+1 bj bjT )xn .
P P

▶ Then we find that the displacement vector x − x˜n can be

written as x − x˜n = D j=m+1 bj bj xn = j=m+1 (xn bj )bj .
P T
PD T

▶ The last expression shows that the displacement vector is

exactly the projection of the original data point xn onto the
orthogonal complement of the principal subspace.
▶ We can now express the loss function as follows:
JM = N1 N n=1 ∥xn − x˜n ∥ = N j=M+1 (bj xn )bj ∥ .
P 2 1 PN PN T 2
n=1 ∥
▶ We now expand the squared-norm and exploit the fact that
the bj form an orthonormal basis to rewrite the loss function
as in the next slide.
Finding a basis of the principal subspace

N D
(bjT xn )2
1 X X
JM =
N
n=1 j=M+1
N D
=
1 X X
N
bjT xn bjT xn
n=1 j=M+1
N D
=
1 X X
N
bjT xn xnT bj
n=1 j=M+1
D N
bjT ( N1 xn xnT )bj
X X
=
j=M+1 n=1
D
bjT S bj
X
=
j=M+1
Finding a basis of the principal subspace

j=M+1 bj S bj = tr ( j=M+1 bj S bj ) =
▶ We note that JM = D
P T
PD T

j=M+1 S bj bj ) = tr ( j=M+1 bj bj S)
PD T
P D T

▶ In the above we exploit the fact that the trace operator is

invariant with respect to the cyclic permutation of its
arguments.
▶ S in the above equations is the data covariance matrix since
we assume that E (X ) = 0.
▶ We can formulate the averager reconstruction error as the
covariance matrix of the data projected onto the orthogonal
complement of the principal subspace.
Finding a basis of the principal subspace

▶ To minimize the average reconstruction error we need to

minimize the variance of the data when projected onto the
space we ignore which is the orthogonal complement of the
principal subspace.
▶ This is equivalent to maximizing the variance of the data
when projected onto the principal subspace which therefore
leads us to the same kind of solution as in the maximum
variance perspective.
▶ The smallest value of the average squared reconstruction error
turns out to be JM = D
P
j=M+1 λj where the λj s are the last
D − M eigenvalues of the data covariance matrix.

Unit 3
No ratings yet
Unit 3
21 pages
1 - Compressors and Surge Control Rev0 - S3
100% (2)
1 - Compressors and Surge Control Rev0 - S3
119 pages
MLF Notes - Rishab Dec 24
No ratings yet
MLF Notes - Rishab Dec 24
6 pages
Chapter 10. Dimensionality Reduction With PCA
No ratings yet
Chapter 10. Dimensionality Reduction With PCA
23 pages
GENAI Lab
No ratings yet
GENAI Lab
29 pages
r20 4-1 Open Elective III Syllabus Final Ws
No ratings yet
r20 4-1 Open Elective III Syllabus Final Ws
29 pages
Ch. 10 Principal Components Analysis (PCA)
No ratings yet
Ch. 10 Principal Components Analysis (PCA)
17 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
60 pages
Guha Rehan - Machine Learning Interview Guide - 2025
No ratings yet
Guha Rehan - Machine Learning Interview Guide - 2025
442 pages
Lecture8 2015
No ratings yet
Lecture8 2015
51 pages
Tutorial On Principal Component Analysis: Javier R. Movellan
No ratings yet
Tutorial On Principal Component Analysis: Javier R. Movellan
9 pages
PCA Fin. Econ.
No ratings yet
PCA Fin. Econ.
56 pages
APS1070 Lecture (6) Slides Annotated
No ratings yet
APS1070 Lecture (6) Slides Annotated
77 pages
Pca PDF
No ratings yet
Pca PDF
33 pages
Computer Vision: Spring 2006 15-385,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm - 4:20pm
No ratings yet
Computer Vision: Spring 2006 15-385,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm - 4:20pm
58 pages
Sanjay Singh Principal Component Analysis
No ratings yet
Sanjay Singh Principal Component Analysis
9 pages
Lecture 17 and 18
No ratings yet
Lecture 17 and 18
29 pages
cs229 Notes10 PDF
No ratings yet
cs229 Notes10 PDF
6 pages
Dimension Reduction
No ratings yet
Dimension Reduction
23 pages
Lecture: Dimensionality Reduction With Principal Component Analysis
No ratings yet
Lecture: Dimensionality Reduction With Principal Component Analysis
42 pages
ML Lec-20
No ratings yet
ML Lec-20
17 pages
Slides Lecture7 Ext
No ratings yet
Slides Lecture7 Ext
21 pages
FALLSEM2024-25 SWE1015 ETH VL2024250103260 2024-09-18 Reference-Material-I
No ratings yet
FALLSEM2024-25 SWE1015 ETH VL2024250103260 2024-09-18 Reference-Material-I
62 pages
PCA
100% (1)
PCA
33 pages
4.5 Principal Component Analysis
No ratings yet
4.5 Principal Component Analysis
15 pages
Dimensionality Reduction With Principal Component Analysis
No ratings yet
Dimensionality Reduction With Principal Component Analysis
39 pages
Computer Vision: Spring 2006 15-385,-685
No ratings yet
Computer Vision: Spring 2006 15-385,-685
58 pages
15PCA
No ratings yet
15PCA
27 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
79 pages
Pca Kmeans GMM
No ratings yet
Pca Kmeans GMM
96 pages
کتاب نهم بارگزاری شده
No ratings yet
کتاب نهم بارگزاری شده
55 pages
Part1 Lecture 12 Annotated
No ratings yet
Part1 Lecture 12 Annotated
12 pages
Machine Learning
No ratings yet
Machine Learning
29 pages
CDT 05 PCA SVD FoDS
No ratings yet
CDT 05 PCA SVD FoDS
34 pages
Principal Component Analysis (PCA) Application To Images: Outline of The Lecture
No ratings yet
Principal Component Analysis (PCA) Application To Images: Outline of The Lecture
26 pages
PrincipalComponentAnalysis LectureNotesPublic
No ratings yet
PrincipalComponentAnalysis LectureNotesPublic
24 pages
Machine Learning (CSO851) - Lecture 03
No ratings yet
Machine Learning (CSO851) - Lecture 03
71 pages
# Loop Over Classes: 6.2 Principal Components Analysis (Pca)
No ratings yet
# Loop Over Classes: 6.2 Principal Components Analysis (Pca)
10 pages
L7 Ann
No ratings yet
L7 Ann
22 pages
SFML DATE 19 Lecture3 Svdpca Notes
No ratings yet
SFML DATE 19 Lecture3 Svdpca Notes
6 pages
LectureNotes PCA
No ratings yet
LectureNotes PCA
20 pages
Week12 PCA BayesianInference Before Lecture
No ratings yet
Week12 PCA BayesianInference Before Lecture
82 pages
Presentation A I STD 2
No ratings yet
Presentation A I STD 2
63 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
3 pages
10 Autoencoders
No ratings yet
10 Autoencoders
42 pages
Lecture 7: Principal Component Analysis (PCA) (Draft: Version 0.9.1)
No ratings yet
Lecture 7: Principal Component Analysis (PCA) (Draft: Version 0.9.1)
11 pages
Lecture 9 - PCA
No ratings yet
Lecture 9 - PCA
44 pages
AML Unit - 1 Material
No ratings yet
AML Unit - 1 Material
36 pages
Lecture 9 - Data Reduction
No ratings yet
Lecture 9 - Data Reduction
36 pages
MLSP-6 Dimensionality Reduction
No ratings yet
MLSP-6 Dimensionality Reduction
39 pages
Data Pre-Processing-IV (Feature Extraction-PCA)
No ratings yet
Data Pre-Processing-IV (Feature Extraction-PCA)
23 pages
20 Pca
No ratings yet
20 Pca
50 pages
Dimensionality Reduction by Pca: Non - Feasible
No ratings yet
Dimensionality Reduction by Pca: Non - Feasible
26 pages
Lecture 7: Unsupervised Learning: C19 Machine Learning Hilary 2013 A. Zisserman
No ratings yet
Lecture 7: Unsupervised Learning: C19 Machine Learning Hilary 2013 A. Zisserman
20 pages
Pca
No ratings yet
Pca
6 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
27 pages
PCA Complete
No ratings yet
PCA Complete
8 pages
Computer Vision and Image Processing - Fundamentals and Applications
No ratings yet
Computer Vision and Image Processing - Fundamentals and Applications
34 pages
Auxillary Console Tag List
No ratings yet
Auxillary Console Tag List
51 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
20 pages
Project: ©great Learning. Proprietary Content. All Rights Reserved. Unauthorised Use or Distribution Prohibited
100% (1)
Project: ©great Learning. Proprietary Content. All Rights Reserved. Unauthorised Use or Distribution Prohibited
8 pages
Principal Component Analysis: Atent Ariables
No ratings yet
Principal Component Analysis: Atent Ariables
13 pages
Data Science REPORT
No ratings yet
Data Science REPORT
40 pages
06 Wordvectors
No ratings yet
06 Wordvectors
96 pages
Ransomware Attack Detection Based On Pertinent System Calls Using Machine Learning Techniques
No ratings yet
Ransomware Attack Detection Based On Pertinent System Calls Using Machine Learning Techniques
23 pages
Lecture Notes For Chapter 2 Introduction To Data Mining: by Tan, Steinbach, Kumar
100% (1)
Lecture Notes For Chapter 2 Introduction To Data Mining: by Tan, Steinbach, Kumar
66 pages
Sahaj QC
No ratings yet
Sahaj QC
86 pages
Chapter 6 Object Recognition
No ratings yet
Chapter 6 Object Recognition
51 pages
PLC
No ratings yet
PLC
176 pages
Data Science
No ratings yet
Data Science
11 pages
Pca 1692550768
No ratings yet
Pca 1692550768
13 pages
Dimensionality Reduction-PCA FA LDA
No ratings yet
Dimensionality Reduction-PCA FA LDA
12 pages
PG Program in AI & Machine Learning: Work Integrated Learning Programmes
No ratings yet
PG Program in AI & Machine Learning: Work Integrated Learning Programmes
29 pages
771 A18 Lec19
No ratings yet
771 A18 Lec19
131 pages
Data Mining Unit-1 Complete
No ratings yet
Data Mining Unit-1 Complete
45 pages
Multi-Scale Local Binary Pattern Histograms For Fa
No ratings yet
Multi-Scale Local Binary Pattern Histograms For Fa
159 pages
Zhong Dan Enke - 2017 - Forecasting Daily Stock Market Return Using Dimens
No ratings yet
Zhong Dan Enke - 2017 - Forecasting Daily Stock Market Return Using Dimens
14 pages
Pratik Zanke Factor Hair Revised
No ratings yet
Pratik Zanke Factor Hair Revised
37 pages
Dcs-Hart - Complex Closed
No ratings yet
Dcs-Hart - Complex Closed
18 pages
Machine Learning: Unsupervised Learning Dimensionality Reduction K-Means Clustering
No ratings yet
Machine Learning: Unsupervised Learning Dimensionality Reduction K-Means Clustering
28 pages
Machine Learning Models For Geospatial Data
No ratings yet
Machine Learning Models For Geospatial Data
54 pages
Ddos Attacks Detection Using Machine Learning Algorithms: Qian Li Linhai Meng Jinyao Yan Yuan Zhang
No ratings yet
Ddos Attacks Detection Using Machine Learning Algorithms: Qian Li Linhai Meng Jinyao Yan Yuan Zhang
2 pages
Machine Learning in Nutrition Research
No ratings yet
Machine Learning in Nutrition Research
17 pages
Machine Learning Report PDF
No ratings yet
Machine Learning Report PDF
27 pages
Annexure D - 110UIC36101 (On-Line Mode)
No ratings yet
Annexure D - 110UIC36101 (On-Line Mode)
1 page
Presentation 2
No ratings yet
Presentation 2
9 pages
DataReduction NotCompletedYet
No ratings yet
DataReduction NotCompletedYet
9 pages
Sershift C, March 10, 2021 Inst-gcu-2-Shift
No ratings yet
Sershift C, March 10, 2021 Inst-gcu-2-Shift
1 page
Artificial Intelligence Certification
No ratings yet
Artificial Intelligence Certification
8 pages
Two-Stage Hierarchical and Explainable Feature
No ratings yet
Two-Stage Hierarchical and Explainable Feature
13 pages
SD SP Expression
No ratings yet
SD SP Expression
1 page
VIP Cheatsheet: Unsupervised Learning: Afshine Amidi and Shervine Amidi August 12, 2018
No ratings yet
VIP Cheatsheet: Unsupervised Learning: Afshine Amidi and Shervine Amidi August 12, 2018
2 pages
Floor & Wall Burner Control
No ratings yet
Floor & Wall Burner Control
1 page
Da 2025
No ratings yet
Da 2025
1 page
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Topology Essentials
From Everand
Topology Essentials
Emil G. Milewski
5/5 (1)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Group Theory I Essentials
From Everand
Group Theory I Essentials
Emil Milewski
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)

Lecture 12

Uploaded by

Lecture 12

Uploaded by

Lecture 12

Math Foundations Team

▶ We will look at principle components analysis and dimension

▶ PCA is a technique for linear dimensionality reduction. It was

▶ We are interested in finding projections x˜n of datapoints xn

▶ We assume there exists a lower-dimensional compressed

▶ The figure below shows how z represents the

▶ There exists a linear relationship between the original data x ,

▶ We can interpret information content in the data as how

▶ For the data covariance matrix S = N1 n=1 xn xnT we assume

▶ We maximize the variance of the low-dimensional code by

▶ Arbitrarily increasing the magnitude of the vector b1 in the

max b1T Sb1 subject to

▶ To solve this problem we set up the Lagrangian

▶ To solve the Lagrangian, we set the partial derivatives with

▶ Putting the result of the previous slide into the objective

▶ Suppose we have the following constrained optimization

max f (x, y ) subject to

▶ We note that at the optimal solution (x0 , y0 ), if we move a

▶ As we move along the displacement vector δx, δy ), the value

▶ Assume that we have found the first m − 1 principal

▶ We then arrive at a new data matrix X̂ = X − m−1 i=1 bi bi X

▶ What is Ŝ in the above equation?

▶ Ŝ is the data covariance matrix of the transformed data set

▶ We now show that the eigenvectors of S and Ŝ are the same.

▶ We have two cases: i ≥ m and i ≤ m − 1.

▶ From the previous slide, when i ≤ m − 1 we have

▶ We derived the PCA as an algorithm that maximizes the

▶ We are interested in finding vectors x̃ ∈ RD which live in a

▶ The reconstruction error can be written as

▶ Assume an orthonormal basis (ONB) (b1 , . . . , bM ) of the

the coordinates zin and set the derivative to zero, i.e ∂J

▶ We can write ∂J ∂JM ∂ x˜n

▶ Note that in the preceding sequence of equations we used the

▶ To find the basis vectors b1 , . . . , bM of the principal subspace

▶ We can rewrite this as x˜n = ( M m=1 bm bm )xn .

▶ The original data point xn can also be written as a linear

▶ How can we do the above? First write xn = D d=1 zdn bd , and

▶ We can split the expression for xn as follows:

▶ Then we find that the displacement vector x − x˜n can be

▶ The last expression shows that the displacement vector is

▶ In the above we exploit the fact that the trace operator is

▶ To minimize the average reconstruction error we need to

You might also like