Dimensionality Reduction
Dimensionality Reduction
Dimensionality Reduction
COMP3314
Machine Learning
COMP 3314 2
Motivation
● Many ML problems have thousands or even millions of features
● As a result we have an intractable problem
○ Training is slow
○ Finding a solution is difficult (Curse of Dimensionality)
○ Data visualization is impossible
● Solution
○ Dimensionality Reduction using feature extraction
○ Often possible without losing much relevant information
■ E.g., Merge neighboring pixels of the MNIST dataset
COMP 3314 3
Idea: Projection
● In most problems, training
instances are not spread out
uniformly across all dimensions
● Many features are almost
constant, while others are highly
correlated
● As a result, all training instances
lie within (or close to) a much
lower-dimensional subspace of
the high-dimensional space
COMP 3314 7
Manifold Learning
● Note
○ The decision boundary may not always be simpler in lower
dimensions
COMP 3314 10
Outline
● PCA
○ Principal Component Analysis
○ Projects data points onto (few) principal components
● LLE
○ Locally Linear Embedding
○ Powerful nonlinear dimensionality reduction technique
○ Manifold Learning technique that does not rely on projections
COMP 3314 11
Code - PCA.ipynb
● Available here on CoLab
COMP 3314 19
Projecting Down
COMP 3314 21
COMP 3314 22
● This output tells us that 36.9% of the dataset’s variance lies along
the first PC, and 18.4% lies along the second PC, etc
COMP 3314 25
Randomized PCA
● If you set the svd_solver hyperparameter to "randomized", Scikit-Learn uses a stochastic
algorithm called Randomized PCA that quickly finds an approximation of the first d
principal components
○ It is dramatically faster than full SVD when k is much smaller than d
● By default, svd_solver is actually set to "auto"
○ Scikit-Learn automatically uses the randomized PCA algorithm if d is greater than
500 and k is less than 80% d, or else it uses the full SVD approach
○ If you want to force Scikit-Learn to use full SVD, you can set the svd_solver
hyperparameter to "full"
COMP 3314 33
Incremental PCA
● The previous PCA implementations require the whole training set to fit in memory
● Incremental PCA (IPCA) allows you to feed an IPCA algorithm one mini-batch at a time
○ Useful for large training sets and online training (i.e., on the fly, as new data arrive)
● The following code splits the MNIST dataset into 100 mini-batches (using NumPy’s
array_split() function) and feeds them to Scikit-Learn’s IncrementalPCA class
○ Note that you must call the partial_fit() method with each mini-batch, rather than the
fit() method with the whole training set
COMP 3314 34
Outline
● PCA
○ Principal Component Analysis
○ Projects data points onto (few) principal components
● LLE
○ Locally Linear Embedding
○ Powerful nonlinear dimensionality reduction technique
○ Manifold Learning technique that does not rely on projections
COMP 3314 35
Code - LLE.ipynb
● Available here on CoLab
COMP 3314 36
LLE
● How it works
○ Measures how each training instance linearly relates to its
closest neighbors
○ Then looks for a low-dimensional representation of the training
set where these local relationships are best preserved
● This approach makes it particularly good at unrolling twisted
manifolds, especially when there is not too much noise
COMP 3314 37
LLE - Details
● For each training sample x(i), the algorithm identifies its
n_neighbors closest neighbors
○ E.g., n_neighbors = 10
● Then it tries to reconstruct x(i) as a linear function of these
neighbors
● More specifically, it finds the weights wi,j such that the squared
distance between x(i) and
LLE - Details
● Thus the first step of LLE is the constrained optimization problem
below, where W is the weight matrix containing all the weights wi,j
● The second constraint simply normalizes the weights for each
training instance x(i)
COMP 3314 40
LLE - Details
● After this step, the weight matrix Wˆ (containing the weights wˆi,j)
encodes the local linear relationships between the training instances
● The second step is to map the training instances into a k-
dimensional space (where k < d) while preserving these local
relationships as much as possible
● If z(i) is the image of x(i) in this k-dimensional space, then we want
the squared distance between z(i) and
● to be as small as possible
COMP 3314 41
LLE - Details
● This idea leads to the following unconstrained optimization
problem
● It looks very similar to the first step, but instead of keeping the
instances fixed and finding the optimal weights, we are doing the
reverse
○ Keeping the weights fixed and finding the optimal position of
the instances’ images in the low-dimensional space
● Note that Z is the matrix containing all z(i)
COMP 3314 42
References
● Most materials in this chapter are
based on
○ Book
○ Code
COMP 3314 44
References
● Some materials in this chapter
are based on
○ Book
○ Code
COMP 3314 45
Exercise 1
● What are the main motivations for reducing a dataset’s dimensionality?
○ What are the main drawbacks?
● What is the curse of dimensionality?
● Once a dataset’s dimensionality has been reduced, is it possible to reverse
the operation?
○ If so, how? If not, why?
● Can PCA be used to reduce the dimensionality of a highly nonlinear
dataset?
● Suppose you perform PCA on a 1,000-dimensional dataset, setting the
explained variance ratio to 95%
○ How many dimensions will the resulting dataset have?
COMP 3314 46
Exercise 2
● In what cases would you use vanilla PCA, Incremental PCA,
Randomized PCA?
● How can you evaluate the performance of a dimensionality
reduction algorithm on your dataset?
● Does it make any sense to chain two different dimensionality
reduction algorithms?
COMP 3314 47
Exercise 3
● Load the MNIST dataset and split it into a training set and a test set (take the first
60,000 instances for training, and the remaining 10,000 for testing)
● Train a Random Forest classifier on the dataset and time how long it takes, then
evaluate the resulting model on the test set
● Next, use PCA to reduce the dataset’s dimensionality, with an explained variance
ratio of 95%
● Train a new Random Forest classifier on the reduced dataset and see how long it
takes
● Was training much faster?
● Next, evaluate the classifier on the test set
● How does it compare to the previous classifier?