Pattern Recognition and Machine learning
Dimensionality Reduction
Dipanjan Roy
Associate Professor
School of AIDE
Indian Institute of Technology Jodhpur
Numerous examples of high-dimensional data…..
Documents
According to media reports, a pair of hackers said on Saturday that
the Firefox Web browser, commonly perceived as the safer and
more customizable alternative to market leader Internet Explorer,
Face images is critically flawed. A presentation on the flaw was shown during the
ToorCon hacker conference in San Diego.
Zambian President Levy
Mwanawasa has won a
second term in office in
an election his challenger
Michael Sata accused him
of rigging, official results
showed on Monday.
Neural population recordings MEG readings
Gene expression data
2
High dimensional Brain fMRI data
Motivation and context
Why do dimensionality reduction?
• Computational: compress data ⇒ time/space
efficiency
Motivation and context
Why do dimensionality reduction?
• Computational: compress data ⇒ time/space
efficiency
• Statistical: fewer dimensions ⇒ better generalization
3
Motivation and context
Why do dimensionality reduction?
• Computational: compress data ⇒ time/space
efficiency
• Statistical: fewer dimensions ⇒ better generalization
• Visualization: understand structure of data
3
Motivation and context
Why do dimensionality reduction?
• Computational: compress data ⇒ time/space efficiency
• Statistical: fewer dimensions ⇒ better generalization
• Visualization: understand structure of data
• Anomaly detection: describe normal data, detect
outliers
3
Motivation and context
Why do dimensionality reduction?
• Computational: compress data ⇒ time/space efficiency
• Statistical: fewer dimensions ⇒ better generalization
• Visualization: understand structure of data
• Anomaly detection: describe normal data, detect
outliers
Dimensionality reduction in this course:
• Linear methods (this week)
• Nonlinear methods (later)
3
Why reduce dimensions?
High dimensionality has many costs
– Redundant and irrelevant features degrade performance of some ML
algorithms
– Difficulty in interpretation and visualization
– Computation may become infeasible
what if your algorithm scales as O( n3 )?
– Curse of dimensionality
Types of problems
• Prediction x → y: classification, regression
Applications: face recognition, gene expression
prediction Techniques: kNN, SVM, least squares ( +
dimensionality reduction preprocessing)
• Structure discovery x → z: find an
alternative representation z of data x
Applications: visualization
Techniques: clustering, linear dimensionality
reduction
• Density estimation p(x): model the data
Applications: anomaly detection, language modeling
Techniques: clustering, linear dimensionality
Linear dimensionality reduction
Bestk-dimensional subspace for projection depends on task
– Unsupervised: retain as much data variance as possible
Example: principal component analysis (PCA)
– Classification: maximize separation among classes
Example: linear discriminant analysis (LDA)
– Regression: maximize correlation between projected data and response
variable
Example: partial least squares (PLS)
Basic idea of linear dimensionality reduction
Represent each face as a high-dimensional vector x ∈
R3 6 1
5
Basic idea of linear dimensionality reduction
Represent each face as a high-dimensional vector x ∈
R3 6 1
x ∈R 361
z=
UT x z ∈
R 10
5
Basic idea of linear dimensionality
reduction
Represent each face as a high-dimensional vector x ∈
R3 6 1
x ∈R 361
z=
UT x z ∈
R 10
How do we choose U ? 5
Outline
• Principal component analysis (PCA)
– Basic principles
– Case studies
• Linear discriminant analysis (LDA)
• Fisher discriminant analysis (FDA)
• Canonical correlation analysis (CCA)
• Independent Component Analysis (ICA)
• Summary
Dimensionality reduction setup
Given n data points in d dimensions: x 1 , . . . , x n ∈
Rd
Dimensionality reduction setup
Given n data points in d dimensions: x 1 , . . . , x n ∈
Rd
X = ( x1 · · · · · · xn ) ∈
R d×n
Dimensionality reduction setup
Given n data points in d dimensions: x 1 , . . . , x n ∈
Rd
X = ( x1 · · · · · · xn ) ∈R d×n
Want to reduce dimensionality from d to k
Dimensionality reduction setup
Given n data points in d dimensions: x 1 , . . . , x n ∈
Rd
X = ( x1 · · · · · · xn ) ∈R d×n
Want to reduce dimensionality from d to k
Choose k directions u 1 , . . . , u k
Dimensionality reduction setup
Given n data points in d dimensions: x 1 , . . . , x n ∈
Rd
X = ( x1 · · · · · · xn ) ∈R d×n
Want to reduce dimensionality from d to k
Choose k directions u 1 , . . . , u k
U = ( u 1 ·· u k ) ∈R d×k
Dimensionality reduction setup
Given n data points in d dimensions: x 1 , . . . , x n ∈
Rd
X = ( x1 · · · · · · xn ) ∈R d×n
Want to reduce dimensionality from d to k
Choose k directions u 1 , . . . , u k
U = ( u 1 ·· u k ) ∈R d×k
For each u j , compute “similarity” z j = uTj x
Dimensionality reduction setup
Given n data points in d dimensions: x 1 , . . . , x n ∈
Rd
X = ( x1 · · · · · · xn ) ∈
R d×n
Want to reduce dimensionality from d to k
Choose k directions u 1 , . . . , u k
U = ( u 1 ·· u k ) ∈R d×k
For each u j , compute “similarity” z j = uTj
x
Dimensionality reduction setup
Given n data points in d dimensions: x 1 , . . . , x n ∈
Rd
X = ( x1 · · · · · · xn ) ∈
R d×n
Want to reduce dimensionality from d to k
Choose k directions u 1 , . . . , u k
U = ( u 1 ·· u k ) ∈R d×k
For each u j , compute “similarity” z j = uTj
x
PCA objective 1: reconstruction error
U serves two functions:
• Encode: z = U T x, zj =
uTj x
PCA objective 1: reconstruction error
U serves two functions:
• Encode: z = U T x, zj =
•u
T x
Decode:
j
x˜ = U z Σ kj =1 z j u j
=
PCA objective 1: reconstruction error
U serves two functions:
• Encode: z = U T x, zj =
•u
T x
Decode:
j
x˜ = U z Σ kj =1 z j u j
= reconstruction error ǁx − x˜ǁ to be
Want
small
PCA objective 1: reconstruction error
U serves two functions:
• Encode: z = U T x, zj =
•u
T x Σ k
Decode: x˜ = U z
j
j =1 z j u j
= reconstruction error ǁx − x˜ ǁ to be small
Want
Objective: minimize total squared reconstruction
error
n
min Σ ǁx i − U U T x i ǁ 2
U ∈ R d×k
i=1
PCA objective 2: projected variance
Empirical distribution: uniform over x 1 , . . . ,
xn
PCA objective 2: projected variance
Empirical distribution: uniform over x 1 , . . . ,
xn
Expectation (think
Ê [f sum
(x)] over
1 data
Σ
n
fpoints):
n i=1
= (x i )
PCA objective 2: projected variance
Empirical distribution: uniform over x 1 , . . . ,
xn
Expectation (think sum over data points):
Variance (think sum of squares if
centered):
Assume data is centered:
PCA objective 2: projected variance
Dimensionality reduction from Multi Trial Recordings
Trial averaged and concatenated PCA
Equivalence in two objectives
Finding one principal component
How many principal components?
• Similar to question of “How many clusters?”
• Magnitude of eigenvalues indicate fraction of variance captured.
How many principal components?
• Similar to question of “How many clusters?”
• Magnitude of eigenvalues indicate fraction of variance
captured.
• Eigenvalues 1353.2
on a face image dataset:
1086.7
820.1
λi
553.6
287.1
2 3 4 5 6 7 8 9 10
i 11
Principal component analysis (PCA) / Basic 15
How many principal components?
• Similar to question of “How many clusters?”
• Magnitude of eigenvalues indicate fraction of variance
captured.
• Eigenvalues 1353.2
on a face image dataset:
1086.7
820.1
λi
553.6
287.1
2 3 4 6 7 8 9 10
5 11
i
• Eigenvalues typically drop off sharply, so don’t need that
many.
• Of course variance isn’t everything...
Summary of PCA
Reducing Matrix Dimensions
◾ Often, our data can be represented by an
𝑚-by-𝑛 matrix
◾ And this matrix can be closely approximated
by the product of three matrices that share a
small common dimension 𝑟
n
n r r VT
r
m A ≈ U m
Jure Leskovec & Mina Ghashami
SVD Definition
T
n r r n
r
m A m VT
◾ A: Input data matrix U
m x n matrix (e.g., m documents, n terms)
◾ U: Left singular vectors
m x r matrix (m documents, r concepts)
◾
: Singular values
r x r diagonal matrix (strength of each ‘concept’)
(r : rank of the matrix A)
◾ V: Right singular vectors
n x r matrix (n terms, r concepts)
SVD steps for estimation of eigenvectors
n
1u1v1 2u2v2
m A +
σi … scalar
If we set 2 = 0, then the green ui … vector
columns may as well vi … vector
It is always possible to decompose a
real matrix A into A = U VT , where
◾ U, , V: unique
◾ U, V: column orthonormal
UT U = I; VT V = I (I: identity matrix)
(Columns are orthogonal unit vectors)
◾ : diagonal
Entries (singular values) are non-negative,
and sorted in decreasing order (σ1 σ2 ... 0)
Nice proof of uniqueness: https://www.cs.cornell.edu/courses/cs322/2008sp/stuff/TrefethenBau_Lec4_SVD.pdf
Large-scale Brain Networks in M/EEG and
dimension reduction?
• What is happening at faster time-scales?
• What are the specific neuronal interactions?
• Can we use MEG to answer these questions?
• excellent temporal res (millisecs)
• good spatial res
• non-
invasive
SVD on High dimensional Brain Data
Spectrum Cross-Spectral Matrix Karhunen-Loève transform
Global Coherence
Sahoo et al. (2020) Neuroimage
Linear dimensionality reduction
Best k-dimensional subspace for projection depends on task
– Unsupervised: retain as much data variance as possible
Example: principal component analysis (PCA)
– Classification: maximize separation among classes
Example: linear discriminant analysis (LDA)
– Regression: maximize correlation between projected data and response
variable
Example: partial least squares (PLS)
LDA for two classes