CHBE413CDS Lecture 12 Unsupervised DimRed

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

CHBE 413/594 and CHEM 452/590

Chemical Data Science and Engineering


Lecture 12: Unsupervised Learning - Dimensionality
Reduction
Topics you will learn in this lecture:

• Why dimensionality reduction is important


• Principal Components Analysis
• A tour of advanced dimensionality reduction techniques
• How t-SNE works

Relevant Readings:
• ISL Ch. 12.1-12.3

1
Why Should We Use Dimensionality
Reduction?
Features/Input Output

[x1,x2,…,x10,000] [y1]

Do we need all these features to accurately predict the output?


Can we get away with only using a subset of the feature space?

Considerations “Everything should be made as simple as


possible but no simpler.”
• Model trains faster since it has fewer dimensions.
• Make the model simpler for researchers to interpret/visualize. “The number of dependent variables you
can ignore says how much you understand
• Improve model accuracy due to less misleading or noisy data. a problem.”

2
Types of Dimensionality Reduction
Techniques
Most dimensionality reduction techniques fall into a few categories

• Feature Elimination and Extraction


we have been working with this already a bit
in general, the goal is to systematically remove
uninformative/redundant variables
• Linear Transformations/Components
here the strategy is to find a linear
transformation/projection of your feature space
that results in a more “useful” set of coordinates
• Non-linear Transformations & Manifolds
in this case, we usually perform some non-linear
transformation and project the features onto a
lower-dimensional manifold

3
Feature Elimination and Extraction
From a feature set, we are looking for a subset of features to use:
• Remove features with too many missing values (Missing Value Ratio)

• Remove features that exhibit small variance (Low-variance Filter)

• Remove highly correlated features (e.g., using Pearson’s r) (High Correlation filter)

• Assess feature importance to model predictions


• by using e.g., Random Forest feature importance, Lasso
• by systematically removing features (Backward Feature Elimination)
• by systematically adding features (Forward Feature Selection)

These are all very simple, powerful strategies that may be complementary to
other methods → always worth considering!

4
Principal Components Analysis (PCA)
PCA is the “canonical” and most common DR technique

Important features of PCA


• PCA assumes linear relationships between variables.
• PCA is scale dependent (features with larger values look more important)
• PCA looks at variance in the feature data (it is important to preprocess the data via normalization,
mean-centering, or scaling).
• PCA is one of the central applications of SVD (Singular Value Decomposition)

The basic goal:

5
Principal Components Analysis (PCA)

The big idea:

1. We want to identify an orthonormal basis to exact


represent xi – μ → d-dimensional basis reconstruction
2. We will only pick k of the d basis vectors

3. All of the points will then be represented in a k- approximate


dimensional subspace spanned by w1, …, wk reconstruction

Algorithm → how to choose W and which basis vectors to keep

6
PCA: How to Pick Basis Vectors
Idea: Find the directions along which the data is maximally spread (highest variance).

w1 captures the most w2 captures the 2nd most


variance of the data. variance of the data and
is orthogonal to w1.

7
https://www.cs.cornell.edu/courses/cs4786/2020sp/
PCA: How to Pick Basis Vectors
Idea: Find the directions along which the data is maximally spread (highest variance).

• mean-subtracted data

• covariance matrix

• first principal component

subsequent components can be determined by orthogonalization and repetition

8
PCA: Additional Details and Realization
• We want to maximize this function subject to the constraint that the norm of w is 1. We can do this
with Lagrange multipliers

2
𝑤1 = arg 𝑚𝑎𝑥 𝑤 𝑇 Σ𝑤 − 𝜆 𝑤 2

• We now take the derivative of this with respect to w equate to 0, and arrive at

Σ𝑤1 − 𝜆 𝑤1=0
• This is an eigenvalue equation! The direction w1 that we obtain by maximizing the
variance in the direction is some unit vector that satisfies this eigenvalue equation. We
want to maximize 𝑤 𝑇 Σ𝑤, so if we plug in the eigenvalue solution we get that

2 So the eigenvector w1 that maximizes the variance


𝑤 𝑇 Σ𝑤 = 𝜆 𝑤1 2 =𝜆
is the one with the largest eigenvalue!
Principal components analysis then
• We can proceed in similar fashion to pick eigenvectors with the second, amounts to simply finding the
third, etc largest eigenvalues to form this basis. eigenvectors of the covariance matrix 𝛴!

9
PCA: The Basic Algorithm
1. Standardize the d-dimensional dataset.
𝑛
1 𝑖 𝑖
2. Construct the covariance matrix. 𝜎𝑗𝑘 = ෍ 𝑥𝑗 − 𝜇𝑗 𝑥𝑘 − 𝜇𝑘
𝑛−1
𝑖=1
3. Decompose the covariance matrix into its eigenvectors and eigenvalues. Σ𝑣 = 𝜆𝑣

4. Sort the eigenvalues by decreasing order to rank the corresponding eigenvectors.

5. Select k eigenvectors, which correspond to the k largest eigenvalues, where k is the dimensionality
of the new feature subspace (k <= d).

6. Construct a projection (feature transformation) matrix, W, from the “top” k eigenvectors.

𝑥 ′ = 𝑊𝑥 W = n x k matrix

10
The Proportion of Variance Explained
It is natural to ask how much of the information in a data set is lost by projecting the observations onto a small
subset of principal components.

The amount of variance explained can be analytically related to the eigenvalues of the covariance matrix via:

𝜆𝑗
𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑅𝑎𝑡𝑖𝑜 =
σ𝑑𝑗=1 𝜆𝑗

11
Application of PCA to Visualizing Chemical
Space

12
ACS Polym. Au 2023, 3, 4, 318–330
Problems with PCA
PCA generally fails to detect low-dimensional manifolds

Tenenbaum, J. B.; de Silva, V. & Langford, J. C. A Global Geometric


Framework for Nonlinear Dimensionality Reduction Science, 2000, 290,
2319-2323
13
The Kernel Trick Applied to PCA
Consider the following…
The Kernel PCA Trick: Tackle nonlinear problems by projecting them onto a
new feature space of higher dimensionality, and then use PCA there.
𝑇
𝜙
Ex: 𝑥 = 𝑥1 , 𝑥2 z= 𝑥, 𝑦, 𝑥 2 + 𝑦 2 𝑇

𝑛
1
Σ = ෍ 𝑥 (𝑖) 𝑥 𝑖 𝑇
𝑛
𝑖=1 Data must have zero mean

It can be proved that the dot products can be generalized to


nonlinear feature combinations1
𝑛
1 𝑖 𝑖
Σ = ෍ 𝜙(𝑥 )𝜙(𝑥 )𝑇
𝑛
𝑖=1

And solving for the eigensystem of this covariance matrix is equivalent to


solve the eigensystem of a defined kernel function, K.

What happens if 𝐾 = 𝜙(X)𝜙(𝑋)𝑇 Solving Kv = 𝜆v is referred


to as Kernel PCA.
we do PCA here?
1. Kernel principal component analysis, B. Scholkopf, A. Smola, and K.R. Muller, 583-588, 1997. 14
Using Kernel PCA
The challenging task of kernel PCA is picking what kernel to use. The most common is the radial basis function (RBF) kernel:

𝑖 𝑗 2
𝑖
𝜅(𝑥 , 𝑥 𝑗 )= 𝑒 −𝛾 𝑥 −𝑥

Algorithm to implement Kernel PCA:

1. Compute the Kernel matrix, K. 1 1 1 𝑛


𝜅(𝑥
,𝑥 ) ⋯ 𝜅(𝑥 ,𝑥 )
𝐾= ⋮ ⋱ ⋮
𝜅(𝑥 𝑛 , 𝑥 1 ) ⋯ 𝜅(𝑥 𝑛 , 𝑥 𝑛 )
2. Center the kernel matrix using (like normal feature scaling, but in kernel space)
1𝑛 = nxn matrix with all elements
𝐾 ′ = 𝐾 − 1𝑛 𝐾 − 𝐾1𝑛 + 1𝑛 𝐾1𝑛 equal to 1/n.

3. Solve K’v = 𝜆v and extract the top k eigenvectors.

1. Kernel principal component analysis, B. Scholkopf, A. Smola, and K.R. Muller, 583-588, 1997. 15
Using Kernel PCA

https://scikit-learn.org/stable/auto_examples/decomposition/plot_kernel_pca.html 16
Using Kernel PCA

17
Other Greatest Hits of Dimensionality
Reduction
• Linear Discriminant Analysis • Laplacian Eigenmaps
• Hessian LLE
• Generalized Discrimination Analysis
• Local Tangent Space
• UMAP Analysis
• Non-negative Matrix Factorization • Sammon Mapping
• Classical Scaling • Multilayer Autoencoders
• Locally Linear Coordination
• Maximum Variance Unfolding • Manifold Charting
• Diffusion Maps • ISOMAP
• Locally Linear Embedding • Kernel PCA

• t-distributed stochastic neighbor


embedding (t-SNE)
There are many more beyond this!

18
Other Greatest Hits of Dimensionality
Reduction
Variational Autoencoders Diffusion Maps
Shmilovich, K.; Mansbach, R. A.; Sidky, Ferguson, A. L.; Panagiotopoulos, A.
H.; Dunne, O. E.; Panda, S. S.; Tovar, Z.; Kevrekidis, I. G. & Debenedetti,
J. D. & Ferguson, A. L. Discovery of P. G. Nonlinear dimensionality
Self-Assembling pi-Conjugated reduction in molecular simulation:
Peptides by Active Learning-Directed The diffusion map
Coarse-Grained Molecular Simulation. approach Chemical Physics
The Journal of Physical Chemistry B Letters, 2011, 509, 1-11
2020, 124, 3873-3891

UMAP

Reis, M.; Gusev, F.; Taylor, N. G.; Chung, S. H.; Verber, M. D.; Lee, Y. Z.; Isayev, O. & Leibfarth, F. A. Machine-Learning-Guided Discovery of 19F MRI Agents Enabled
by Automated Copolymer Synthesis Journal of the American Chemical Society, 2021, 143, 17677-17689

19
Standard DR Techniques in Scikit-learn
Sklearn.decomposition
https://scikit-learn.org/stable/modules/decomposition.html
• Kernel PCA
• Independent Components Analysis
• Linear Factor Analysis
• Non-negative Matrix Factorization
• Truncated SVD

Sklearn.manifold

• Isomap
• Locally Linear Embedding
• Spectral Embedding
• t-SNE
• Multidimensional Scaling

https://scikit-learn.org/stable/modules/manifold.html

20
State-of-the-Art Nonlinear Dimensionality
Reduction Algorithms

• In the time that remains we will discuss a very


powerful and popular non-linear dimensionality
reduction algorithm called t-SNE.

• t-SNE and UMAP are the two state-of-the-art


methods commonly employed. We will only talk
about t-SNE, but luckily UMAP is quite similar if ever
you would like to learn about it.

Not linearly separable!

21
t-Distributed Stochastic Neighborhood
Embedding (t-SNE)
Given a set of high dimensional data points x 1,…,xn t-SNE computes the conditional probabilities, pj|i, that are proportional to
the similarity of data points xi and xj

Why can we call the similarity a conditional probability? As explained in the original article:

“The similarity of datapoint xj to datapoint xi is the conditional probability, pj|i, that xi would pick xj as its neighbor
if neighbors were picking in proportion to their probability density under a Gaussian centered at x i.”

van der Maaten, L.J.P.; Hinton, G.E. (Nov 2008). "Visualizing Data Using t-SNE" (PDF). Journal of Machine Learning 22
Research. 9: 2579–2605.
t-Distributed Stochastic Neighborhood
Embedding (t-SNE)
Given a set of high dimensional data points x 1,…,xn t-SNE computes the conditional probabilities, pj|i, that are proportional to
the similarity of data points xi and xj

So how the heck do we pick 𝜎? 𝜎 effectively determines the One defines the target perplexity “k” which is computed using:
number of nearest neighbors that any given point “feels”.

In practice, this value of sigma is determined in a fairly


complicated and mathematically involved fashion… One then finds the optimal 𝜎 by finding the value that satisfies
this prespecified equation. High k -> high 𝝈.

23
Ok, So What do I Do With All of this Pj|i’s?
Now – we find a lower dimensional space to which we can map the higher dimensional space that
preserves these conditional probabilities as best as possible. We refer to the conditional probabilities in
the low dimensional space as qj|i.

How the heck do we actually do this? We can measure the mismatch between the high dimensional
probabilities and the low dimensional probabilities using the Kullback-Leibler divergence as a loss
function for each data point.
The Algorithm:

• Randomly place the low dimensional points, and then


minimize the cost function using gradient descent to find
improved points. This is messy to do in practice, but simple
conceptually.

24
Ok – I Lied. This isn’t Exactly t-SNE. To Get to
Real t-SNE You Modify a Couple of Things:
1. Symmetrize the conditional probabilities (Makes KL-divergence calculation faster)

Total # of high-dimensional data points.

2. Use a student’s t-distribution instead of a Gaussian for the low-dimensional probability


distribution (Helps prevent crowding in the projection).

And then you do the same gradient of the KL divergence, etc.

25
How to Use t-SNE Effectively, Including How
to Not, and How to Pick Perplexity
Scikit-learn recommends this link, which is quite helpful https://distill.pub/2016/misread-tsne/. Read this
before you ever use t-SNE in an important task.

Another Useful t-SNE Tip:

• Try initializing your low


dimensional representation with a
PCA-derived projection instead of
random points. Usually this helps,
and scikit-learn has a flag that will
do this for you.

26
How to Use t-SNE Effectively, Including How
to Not, and How to Pick Perplexity

1. Cluster sizes in a t-SNE plot mean


nothing

2. Distances between clusters might not


mean anything

3. Random noise doesn’t always look


random.

27
Uniform Manifold Approximation and
Projection (UMAP)
The original paper is a pain to understand and read through, but it suffices to say that people like it because it has
a more rigorous mathematical justification than t-SNE. In terms of function, it has many similarities to t-SNE.

Gaussian similarities Symmetrization of


between high dimensional similarities
data points Other Points of Interest About UMAP

• Uses Cross-Entropy instead of KL


Divergence.

T-distribution similarities between


low dimensional data points • Uses Stochastic Gradient Descent
instead of plain gradient descent

• Initializes the low dimensional space


A more nuanced definition of “perplexity” with a spectral method instead of
random or PCA.

28
https://umap-learn.readthedocs.io/en/latest/
UMAP and t-SNE
If you would like to better understand UMAP I would recommend:

https://pair-code.github.io/understanding-umap/

The biggest differences between UMAP and t-SNE are:

• UMAP has a firmer mathematical foundation than t-SNE, and thus many folks like it simply
for this reason.

• UMAP uses a few clever tricks to improve numerical stability.

• UMAP is often better at preserving global structure than t-SNE - the inter-cluster
relationships are potentially more meaningful than t-SNE.

• But both are very widely used!

29
Jupyter Notebook Example

30

You might also like