CHBE413CDS Lecture 12 Unsupervised DimRed
CHBE413CDS Lecture 12 Unsupervised DimRed
CHBE413CDS Lecture 12 Unsupervised DimRed
Relevant Readings:
• ISL Ch. 12.1-12.3
1
Why Should We Use Dimensionality
Reduction?
Features/Input Output
[x1,x2,…,x10,000] [y1]
2
Types of Dimensionality Reduction
Techniques
Most dimensionality reduction techniques fall into a few categories
3
Feature Elimination and Extraction
From a feature set, we are looking for a subset of features to use:
• Remove features with too many missing values (Missing Value Ratio)
• Remove highly correlated features (e.g., using Pearson’s r) (High Correlation filter)
These are all very simple, powerful strategies that may be complementary to
other methods → always worth considering!
4
Principal Components Analysis (PCA)
PCA is the “canonical” and most common DR technique
5
Principal Components Analysis (PCA)
6
PCA: How to Pick Basis Vectors
Idea: Find the directions along which the data is maximally spread (highest variance).
7
https://www.cs.cornell.edu/courses/cs4786/2020sp/
PCA: How to Pick Basis Vectors
Idea: Find the directions along which the data is maximally spread (highest variance).
• mean-subtracted data
• covariance matrix
8
PCA: Additional Details and Realization
• We want to maximize this function subject to the constraint that the norm of w is 1. We can do this
with Lagrange multipliers
2
𝑤1 = arg 𝑚𝑎𝑥 𝑤 𝑇 Σ𝑤 − 𝜆 𝑤 2
• We now take the derivative of this with respect to w equate to 0, and arrive at
Σ𝑤1 − 𝜆 𝑤1=0
• This is an eigenvalue equation! The direction w1 that we obtain by maximizing the
variance in the direction is some unit vector that satisfies this eigenvalue equation. We
want to maximize 𝑤 𝑇 Σ𝑤, so if we plug in the eigenvalue solution we get that
9
PCA: The Basic Algorithm
1. Standardize the d-dimensional dataset.
𝑛
1 𝑖 𝑖
2. Construct the covariance matrix. 𝜎𝑗𝑘 = 𝑥𝑗 − 𝜇𝑗 𝑥𝑘 − 𝜇𝑘
𝑛−1
𝑖=1
3. Decompose the covariance matrix into its eigenvectors and eigenvalues. Σ𝑣 = 𝜆𝑣
5. Select k eigenvectors, which correspond to the k largest eigenvalues, where k is the dimensionality
of the new feature subspace (k <= d).
𝑥 ′ = 𝑊𝑥 W = n x k matrix
10
The Proportion of Variance Explained
It is natural to ask how much of the information in a data set is lost by projecting the observations onto a small
subset of principal components.
The amount of variance explained can be analytically related to the eigenvalues of the covariance matrix via:
𝜆𝑗
𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑅𝑎𝑡𝑖𝑜 =
σ𝑑𝑗=1 𝜆𝑗
11
Application of PCA to Visualizing Chemical
Space
12
ACS Polym. Au 2023, 3, 4, 318–330
Problems with PCA
PCA generally fails to detect low-dimensional manifolds
𝑛
1
Σ = 𝑥 (𝑖) 𝑥 𝑖 𝑇
𝑛
𝑖=1 Data must have zero mean
𝑖 𝑗 2
𝑖
𝜅(𝑥 , 𝑥 𝑗 )= 𝑒 −𝛾 𝑥 −𝑥
1. Kernel principal component analysis, B. Scholkopf, A. Smola, and K.R. Muller, 583-588, 1997. 15
Using Kernel PCA
https://scikit-learn.org/stable/auto_examples/decomposition/plot_kernel_pca.html 16
Using Kernel PCA
17
Other Greatest Hits of Dimensionality
Reduction
• Linear Discriminant Analysis • Laplacian Eigenmaps
• Hessian LLE
• Generalized Discrimination Analysis
• Local Tangent Space
• UMAP Analysis
• Non-negative Matrix Factorization • Sammon Mapping
• Classical Scaling • Multilayer Autoencoders
• Locally Linear Coordination
• Maximum Variance Unfolding • Manifold Charting
• Diffusion Maps • ISOMAP
• Locally Linear Embedding • Kernel PCA
18
Other Greatest Hits of Dimensionality
Reduction
Variational Autoencoders Diffusion Maps
Shmilovich, K.; Mansbach, R. A.; Sidky, Ferguson, A. L.; Panagiotopoulos, A.
H.; Dunne, O. E.; Panda, S. S.; Tovar, Z.; Kevrekidis, I. G. & Debenedetti,
J. D. & Ferguson, A. L. Discovery of P. G. Nonlinear dimensionality
Self-Assembling pi-Conjugated reduction in molecular simulation:
Peptides by Active Learning-Directed The diffusion map
Coarse-Grained Molecular Simulation. approach Chemical Physics
The Journal of Physical Chemistry B Letters, 2011, 509, 1-11
2020, 124, 3873-3891
UMAP
Reis, M.; Gusev, F.; Taylor, N. G.; Chung, S. H.; Verber, M. D.; Lee, Y. Z.; Isayev, O. & Leibfarth, F. A. Machine-Learning-Guided Discovery of 19F MRI Agents Enabled
by Automated Copolymer Synthesis Journal of the American Chemical Society, 2021, 143, 17677-17689
19
Standard DR Techniques in Scikit-learn
Sklearn.decomposition
https://scikit-learn.org/stable/modules/decomposition.html
• Kernel PCA
• Independent Components Analysis
• Linear Factor Analysis
• Non-negative Matrix Factorization
• Truncated SVD
Sklearn.manifold
• Isomap
• Locally Linear Embedding
• Spectral Embedding
• t-SNE
• Multidimensional Scaling
https://scikit-learn.org/stable/modules/manifold.html
20
State-of-the-Art Nonlinear Dimensionality
Reduction Algorithms
21
t-Distributed Stochastic Neighborhood
Embedding (t-SNE)
Given a set of high dimensional data points x 1,…,xn t-SNE computes the conditional probabilities, pj|i, that are proportional to
the similarity of data points xi and xj
Why can we call the similarity a conditional probability? As explained in the original article:
“The similarity of datapoint xj to datapoint xi is the conditional probability, pj|i, that xi would pick xj as its neighbor
if neighbors were picking in proportion to their probability density under a Gaussian centered at x i.”
van der Maaten, L.J.P.; Hinton, G.E. (Nov 2008). "Visualizing Data Using t-SNE" (PDF). Journal of Machine Learning 22
Research. 9: 2579–2605.
t-Distributed Stochastic Neighborhood
Embedding (t-SNE)
Given a set of high dimensional data points x 1,…,xn t-SNE computes the conditional probabilities, pj|i, that are proportional to
the similarity of data points xi and xj
So how the heck do we pick 𝜎? 𝜎 effectively determines the One defines the target perplexity “k” which is computed using:
number of nearest neighbors that any given point “feels”.
23
Ok, So What do I Do With All of this Pj|i’s?
Now – we find a lower dimensional space to which we can map the higher dimensional space that
preserves these conditional probabilities as best as possible. We refer to the conditional probabilities in
the low dimensional space as qj|i.
How the heck do we actually do this? We can measure the mismatch between the high dimensional
probabilities and the low dimensional probabilities using the Kullback-Leibler divergence as a loss
function for each data point.
The Algorithm:
24
Ok – I Lied. This isn’t Exactly t-SNE. To Get to
Real t-SNE You Modify a Couple of Things:
1. Symmetrize the conditional probabilities (Makes KL-divergence calculation faster)
25
How to Use t-SNE Effectively, Including How
to Not, and How to Pick Perplexity
Scikit-learn recommends this link, which is quite helpful https://distill.pub/2016/misread-tsne/. Read this
before you ever use t-SNE in an important task.
26
How to Use t-SNE Effectively, Including How
to Not, and How to Pick Perplexity
27
Uniform Manifold Approximation and
Projection (UMAP)
The original paper is a pain to understand and read through, but it suffices to say that people like it because it has
a more rigorous mathematical justification than t-SNE. In terms of function, it has many similarities to t-SNE.
28
https://umap-learn.readthedocs.io/en/latest/
UMAP and t-SNE
If you would like to better understand UMAP I would recommend:
https://pair-code.github.io/understanding-umap/
• UMAP has a firmer mathematical foundation than t-SNE, and thus many folks like it simply
for this reason.
• UMAP is often better at preserving global structure than t-SNE - the inter-cluster
relationships are potentially more meaningful than t-SNE.
29
Jupyter Notebook Example
30