0% found this document useful (0 votes)
26 views47 pages

Dimensionality Reduction

Uploaded by

ivan.cheung.yui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views47 pages

Dimensionality Reduction

Uploaded by

ivan.cheung.yui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

5.

Dimensionality Reduction

COMP3314
Machine Learning
COMP 3314 2

Motivation
● Many ML problems have thousands or even millions of features
● As a result we have an intractable problem
○ Training is slow
○ Finding a solution is difficult (Curse of Dimensionality)
○ Data visualization is impossible
● Solution
○ Dimensionality Reduction using feature extraction
○ Often possible without losing much relevant information
■ E.g., Merge neighboring pixels of the MNIST dataset
COMP 3314 3

The Curse of Dimensionality


COMP 3314 4

High Dimensional Weirdness


● 2D
○ Pick a random point in a unit square will have <0.4% chance of
being located <0.001 from a border
● 10,000D
○ Pick a random point in a unit hypercube will have >99.99999%
chance of being located <0.001 from a border
● I.e., the high-dimensional unit hypercube can be said to consist
almost entirely of borders with almost no middle
COMP 3314 5

High Dimensional Weirdness


● 2D
○ Pick two random points in a unit square
○ Distance between them will be 0.52 on average
● 1,000,000D
○ Pick two random points in a unit hypercube
○ Distance between them will be 408.25 on average
● How can two points be so far apart when they both lie within the
same unit hypercube?
● As a result, new test samples will likely be far away from training
samples in high dimensional space
○ Overfitting risk is much higher in high-dimensional space
COMP 3314 6

Idea: Projection
● In most problems, training
instances are not spread out
uniformly across all dimensions
● Many features are almost
constant, while others are highly
correlated
● As a result, all training instances
lie within (or close to) a much
lower-dimensional subspace of
the high-dimensional space
COMP 3314 7

When Projection Fails


● Projection is not always the
best approach
● Consider the following toy
dataset to illustrate this
problem
○ The Swiss roll
● Simply projecting onto a plane
(e.g., dropping x3) would
squash different layers
COMP 3314 8

Solution: Manifold Learning


● The Swiss roll is an example of 2D manifold
○ A 2D manifold is a 2D shape that can be bent and twisted in a
higher-dimensional space
● It is possible to learn the manifold on which the training instances
lie and then to unroll the swiss roll
COMP 3314 9

Manifold Learning
● Note
○ The decision boundary may not always be simpler in lower
dimensions
COMP 3314 10

Outline
● PCA
○ Principal Component Analysis
○ Projects data points onto (few) principal components
● LLE
○ Locally Linear Embedding
○ Powerful nonlinear dimensionality reduction technique
○ Manifold Learning technique that does not rely on projections
COMP 3314 11

PCA - Principal Component Analysis


● By far the most popular dimensionality reduction algorithm
● Identifies a hyperplane and then projects data onto it
How to choose the hyperplane?
COMP 3314 13

Preserving the Variance


● Select axis that preserves the maximum amount of variance
○ I.e., loses less information than other projections
● Example: 1D Hyperplanes Preserves max
variance
Let’s try to project
this data onto three
different axis
Preserves
intermediate
amount of variance

Preserves very little


variance
COMP 3314 14

Principal Components (PC)


● The first PC is the axis that accounts for the
largest amount of variance
○ E.g., PC1 in the figure
● The second PC is orthogonal to the first one
and accounts for the largest amount of
remaining variance
○ E.g., PC2 in the figure
○ In this 2D example there is no choice
● If it were in a higher-dimensional dataset the
third PC would be orthogonal to both
previous axes, and a fourth, a fifth, and so
on—as many axes as the number of
dimensions in the dataset
COMP 3314 15

How to find PCs?


● There is a standard matrix factorization technique
called Singular Value Decomposition (SVD)
● It decomposes the training set matrix X into the
matrix multiplication of three matrices
X = U Σ V⊺, where V contains the unit vectors
that define all the principal components that we
are looking for
● Note that PCs are highly sensitive to data scaling
● We need to standardize the features prior to PCA
if the features were measured on different scales
COMP 3314 16

PCA - Principal Component Analysis


● An unsupervised linear transformation technique
○ Finds PCs
■ Using e.g., SVD
○ Projects data onto a subspace with fewer (or equal) dimensions using
some (or all) of the found PCs
■ Multiply original data with a transformation matrix that consists
of PCs, some (or all)
COMP 3314 17

Projecting Down to k Dimensions


● Once you have identified all the principal components, you can reduce the
dimensionality of the dataset down to k dimensions by projecting it onto
the hyperplane defined by the first k principal components
● To project the training set onto the hyperplane and obtain a reduced
dataset of dimensionality k, compute the matrix multiplication of the
training set vector (or matrix) x (or X) by the matrix W, defined as the
matrix containing the first k columns of V
● W is a d × k transformation matrix
○ Maps a d-dimensional vector x to a
k-dimensional vector z
COMP 3314 18

Code - PCA.ipynb
● Available here on CoLab
COMP 3314 19

Load and Standardize Data


● Let’s apply PCA on the wine dataset
○ Load the wine dataset and split it into separate train and test sets
○ Standardize the (d=13)-dimensional dataset
COMP 3314 20

Projecting Down
COMP 3314 21
COMP 3314 22

Using Scikit-Learn’s PCS


● Scikit-Learn’s PCA class uses SVD decomposition to implement
PCA
○ Just like we did manually
● The following code applies PCA to reduce the dimensionality of the
dataset down to two dimensions
COMP 3314 23
COMP 3314 24

Explained Variance Ratio


● Another useful piece of information is the explained variance ratio
of each principal component
○ Available via the explained_variance_ratio_ variable
● The ratio indicates the proportion of the dataset’s variance that lies
along each principal component

● This output tells us that 36.9% of the dataset’s variance lies along
the first PC, and 18.4% lies along the second PC, etc
COMP 3314 25

Choosing the Right Number of Dimensions


● Choose the number of dimensions that add up to a sufficiently large
portion of the variance (e.g., 95%)
○ Unless, of course, you are reducing dimensionality for data
visualization—in that case you will want to reduce the dimensionality
down to 2 or 3
● The following code performs PCA without reducing dimensionality, then
computes the minimum number of dimensions required to preserve 90%
of the training set’s variance
COMP 3314 26

Choosing the Right Number of Dimensions


● You could then set n_components=k and run PCA again
○ But there is a much better option: instead of specifying the
number of principal components you want to preserve, you can
set n_components to be a float between 0.0 and 1.0, indicating
the ratio of variance you wish to preserve
COMP 3314 27

Choosing the Right Number of Dimensions


● Yet another option is to plot the explained variance as a function of
the number of dimensions
COMP 3314 28
COMP 3314 29
COMP 3314 30

PCA for Compression


● Let’s apply PCA to the MNIST dataset while preserving 90% of its variance
○ 87 features instead of the original 784 features
○ This size reduction can speed up a classification algorithm (such as an SVM
classifier) tremendously
● It is also possible to decompress the reduced dataset back to 784 dimensions
○ This won’t give you back the original data, since the projection lost a bit of
information (within the 10% variance that was dropped)
○ The following code compresses the MNIST dataset down to 87 dimensions, then
uses the inverse_transform() method to decompress it back to 784 dimensions
COMP 3314 31

PCA for Compression


COMP 3314 32

Randomized PCA
● If you set the svd_solver hyperparameter to "randomized", Scikit-Learn uses a stochastic
algorithm called Randomized PCA that quickly finds an approximation of the first d
principal components
○ It is dramatically faster than full SVD when k is much smaller than d
● By default, svd_solver is actually set to "auto"
○ Scikit-Learn automatically uses the randomized PCA algorithm if d is greater than
500 and k is less than 80% d, or else it uses the full SVD approach
○ If you want to force Scikit-Learn to use full SVD, you can set the svd_solver
hyperparameter to "full"
COMP 3314 33

Incremental PCA
● The previous PCA implementations require the whole training set to fit in memory
● Incremental PCA (IPCA) allows you to feed an IPCA algorithm one mini-batch at a time
○ Useful for large training sets and online training (i.e., on the fly, as new data arrive)
● The following code splits the MNIST dataset into 100 mini-batches (using NumPy’s
array_split() function) and feeds them to Scikit-Learn’s IncrementalPCA class
○ Note that you must call the partial_fit() method with each mini-batch, rather than the
fit() method with the whole training set
COMP 3314 34

Outline
● PCA
○ Principal Component Analysis
○ Projects data points onto (few) principal components
● LLE
○ Locally Linear Embedding
○ Powerful nonlinear dimensionality reduction technique
○ Manifold Learning technique that does not rely on projections
COMP 3314 35

Code - LLE.ipynb
● Available here on CoLab
COMP 3314 36

LLE
● How it works
○ Measures how each training instance linearly relates to its
closest neighbors
○ Then looks for a low-dimensional representation of the training
set where these local relationships are best preserved
● This approach makes it particularly good at unrolling twisted
manifolds, especially when there is not too much noise
COMP 3314 37

Example: Unrolling the Swiss roll


COMP 3314 38

LLE - Details
● For each training sample x(i), the algorithm identifies its
n_neighbors closest neighbors
○ E.g., n_neighbors = 10
● Then it tries to reconstruct x(i) as a linear function of these
neighbors
● More specifically, it finds the weights wi,j such that the squared
distance between x(i) and

is as small as possible, assuming wi,j = 0 if x(j) is not one of the k


closest neighbors of x(i)
COMP 3314 39

LLE - Details
● Thus the first step of LLE is the constrained optimization problem
below, where W is the weight matrix containing all the weights wi,j
● The second constraint simply normalizes the weights for each
training instance x(i)
COMP 3314 40

LLE - Details
● After this step, the weight matrix Wˆ (containing the weights wˆi,j)
encodes the local linear relationships between the training instances
● The second step is to map the training instances into a k-
dimensional space (where k < d) while preserving these local
relationships as much as possible
● If z(i) is the image of x(i) in this k-dimensional space, then we want
the squared distance between z(i) and

● to be as small as possible
COMP 3314 41

LLE - Details
● This idea leads to the following unconstrained optimization
problem
● It looks very similar to the first step, but instead of keeping the
instances fixed and finding the optimal weights, we are doing the
reverse
○ Keeping the weights fixed and finding the optimal position of
the instances’ images in the low-dimensional space
● Note that Z is the matrix containing all z(i)
COMP 3314 42

Other Dimensionality Reduction Techniques


● There are many other dimensionality reduction techniques, several
of which are available in Scikit-Learn
● Here are some of the most popular ones
○ Random Projections
○ Multidimensional Scaling (MDS)
○ Isomap
○ t-Distributed Stochastic Neighbor Embedding (t-SNE)
○ Linear Discriminant Analysis (LDA)
COMP 3314 43

References
● Most materials in this chapter are
based on
○ Book
○ Code
COMP 3314 44

References
● Some materials in this chapter
are based on
○ Book
○ Code
COMP 3314 45

Exercise 1
● What are the main motivations for reducing a dataset’s dimensionality?
○ What are the main drawbacks?
● What is the curse of dimensionality?
● Once a dataset’s dimensionality has been reduced, is it possible to reverse
the operation?
○ If so, how? If not, why?
● Can PCA be used to reduce the dimensionality of a highly nonlinear
dataset?
● Suppose you perform PCA on a 1,000-dimensional dataset, setting the
explained variance ratio to 95%
○ How many dimensions will the resulting dataset have?
COMP 3314 46

Exercise 2
● In what cases would you use vanilla PCA, Incremental PCA,
Randomized PCA?
● How can you evaluate the performance of a dimensionality
reduction algorithm on your dataset?
● Does it make any sense to chain two different dimensionality
reduction algorithms?
COMP 3314 47

Exercise 3
● Load the MNIST dataset and split it into a training set and a test set (take the first
60,000 instances for training, and the remaining 10,000 for testing)
● Train a Random Forest classifier on the dataset and time how long it takes, then
evaluate the resulting model on the test set
● Next, use PCA to reduce the dataset’s dimensionality, with an explained variance
ratio of 95%
● Train a new Random Forest classifier on the reduced dataset and see how long it
takes
● Was training much faster?
● Next, evaluate the classifier on the test set
● How does it compare to the previous classifier?

You might also like