W9a Autoencoders Pca
W9a Autoencoders Pca
Analysis (PCA)
One of the purposes of machine learning is to automatically learn how to use data, without
writing code by hand. When we started the course with linear regression, we saw that we
could represent complicated functions if we hand-engineered features (or basis functions).
Those functions can then be turned into “neural networks”, where — given enough labelled
data — we can learn the features that are useful for classification automatically.
For some data science tasks the amount of labelled data is small. In these situations it is
useful to have pre-existing basis functions that were fitted as part of solving some other task.
We can then fit a linear regression model on top of these basis functions. Or perhaps use the
basis functions to initialize a neural network, and only train for a short time.
The basis functions could come from fitting another supervised task. For example, neural
networks trained on the large ImageNet dataset are often used to initialize the training of
image recognition models for tasks with only a few labels.
We may also wish to use completely unlabelled data, such as relevant but unannotated
images or text. Recently (2018–), there has been an explosion of interest in Natural Language
Processing in using pre-trained deep neural networks based on unlabelled data. See the
Further Reading for papers.
1 Autoencoders
Autoencoders solve an “unsupervised” task: find a representation of feature vectors, without
any labels. This representation might be useful for other tasks. An autoencoder is a neural
network representing a vector-valued function, which when fitted well, approximately
returns its input:
f(x) ≈ x. (1)
If we were allowed to set up the network arbitrarily, this function is easy to represent. For
example, we could use a single “weight matrix”:
h = g ( 1 ) (W ( 1 ) x + b ( 1 ) ) (3)
(2) (2) (2)
f=g (W h+b ), (4)
where W (1) is a K × D weight matrix, and the g’s are element-wise functions. If the function
output manages to closely match its inputs, then we have a good lossy compressor. The
network can compress D numbers down into K numbers, and then decodes them again,
approximately reconstructing the original input.
One application of dimensionality reduction is visualization. When K = 2 we can plot our
transformed data as a 2-dimensional scatter-plot.
When an autoencoder works well, the transformed values h contain most of the information
from the original input. We should therefore be able to use these transformed vectors as
input to a classifier instead of our original data. It might then be possible to fit a classifier
using less labelled data, because we are fitting a function with lower-dimensional inputs.
Σ = QΛQ> , (5)
where Λ is a diagonal matrix containing the eigenvalues of Σ, and the columns of Q contain
the eigenvectors of Σ.
Questions:
1
K=1
+=X
0
·= X proj
−1 — = V[:,0]
−1 0 1
The two-dimensional coordinate of each + is reduced to one number, giving the position
along the red line that it has been projected onto (the principal component). Transforming
back up to two dimensions gives the coordinates of the •’s in the full 2-dimensional space.
3. The data might not be Gaussian distributed, so this summary could be misleading, just as the standard deviation
can be a misleading indicator of width for a 1D distribution.
4 PCA Examples
PCA is widely used, across many different types of data. It can give a quick first visual-
ization of a dataset, or reduce the number of dimensions of a data matrix if overfitting or
computational cost are concerns.
An example where we expect data to be largely controlled by a few numbers is body
shape. The location of a point on a triangular mesh representing a human body is strongly
constrained by the surrounding mesh-points, and could be accurately predicted with linear
regression. PCA describes the principal ways in which variables can jointly change when
moving away from the mean object.4 The principal components are often interpretable,
and can be animated. Starting at a mean body mesh, one can move along each of the
principal components, showing taller/shorter people, and then thinner/fatter people. The
later principal components will correspond to more subtle, less interpretable combinations
of features that covary.
A striking PCA visualization was obtained by reducing the dimensionality of ≈ 200, 000
features of people’s DNA to two dimensions (Novembre et al., 2008).5 The coordinates along
the two principal axes closely correspond to a map of Europe showing where the people
came from(!). The people were carefully chosen.
As is often the case with useful algorithms, we can choose how to put our data into them, and
solve different tasks with the same code. Given an N × D matrix, we can run PCA to visualize
the N rows. Or we can transpose the matrix and instead visualize the D columns. As an
example, we took a binary S × C matrix M relating students and courses. Msc = 1, if student
s was taking course c. In terms of these features, each course is a length-S vector, or each
student is a length-C vector. We can reduce either of these sets of vectors to 2-dimensions
and visualize them.
The 2D scatter plot of courses was somewhat interpretable:
CPSLP
MT
ANLP
0.3 NLU
SProc
ASR
0.2
ALE1 TCM
CCS
0.1 MASWS
MI CCN
4. While they’re doing something a little more complicated, you can get an idea of what the principal components
of body shape look like from the figures in the paper: Lie bodies: a manifold representation of 3D human shape,
Freifeld and Black, ECCV 2012.
5. Genes mirror geography within Europe. https://www.nature.com/articles/nature07331
0.05
−0.05
−0.1
−0.15
−0.04 −0.02 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16
Finally, PCA doesn’t always work well. One of the papers that helped convince people that
it was feasible to fit deep neural networks showed impressive results with non-linear deep
autoencoders in cases where PCA worked poorly: Reducing the dimensionality of data with
neural networks, Hinton and Salakhutdinov (2006). Science, Vol. 313. no. 5786, pp504–507,
28 July 2006. Available from https://www.cs.utoronto.ca/~hinton/papers.html
5 Pre-processing matters
The units that data are measured in affects the principal components. Given age, weights,
and heights of people, it matters if their height is measured in centimeters or meters. The
numbers are 100 times bigger if we use centimeters, making the square error for an equivalent
mistake in reconstructing height 10,000 times bigger. Therefore, if we use centimeters the
principal component will be more aligned with height to reduce overall square error than
if we use meters. To give each feature similar importance, it’s common to standardize all
features so they have unit standard deviation: but the best scaling could depend on the
application.
Given DNA data, xd ∈ {A, C, G, T}, we have to decide how to encode categorical data. We
could use one-hot encoding. In the example above, Novembre et al. used a lossy binary
encoding indicating if the subject had the most common letter or not.
As usual, given positive data, we may wish to take logarithms. There are lots of free choices
in data analysis.
where U has size N × K, S is a diagonal K × K matrix, and V > has size K × D. The columns of
the V matrix (or the rows of V > ) contain eigenvectors of X > X. The columns of U contain
eigenvectors of XX > . The rows of U give a K-dimensional embedding of the rows of X. The
columns of V > (or the rows of V) give a K-dimensional embedding of the columns of X.
When K = min( N, D ), SVD exactly reconstructs the matrix. For smaller K, truncated SVD is
known to be the best low-rank approximation of a matrix, as measured by square error.
When applied to centred data (the mean feature vector has been subtracted from every row of
X, so that ∑n Xnd = 0 for each feature d), SVD gives the same solution as PCA. The V matrix
contains the eigenvectors of the covariance (Σ = N1 X > X, where the 1/N scaling makes no
difference to the directions). The U matrix contains the eigenvectors of the covariance if we
were to transpose our data matrix before applying PCA.
Python demo:
# PCA via SVD, for NxD matrix X
x_bar = np.mean(X, 0)
[U, vecS, VT] = np.linalg.svd(X - x_bar, 0) # Apply SVD to centred data
U = U[:, :K] # NxK "datapoints" transformed into K-dims
vecS = vecS[:K] # The diagonal elements of diagonal matrix S, in a vector
V = VT[:K, :].T # DxK "features" transformed into K-dims
X_kdim = U * vecS # = np.dot(U, np.diag(vecS))
X_proj = np.dot(X_kdim, V.T) + x_bar # SVD approx USV' + mean
8 Further reading
Different tutorials will focus on different use-cases of PCA. Some practitioners are mostly
interested in reducing the dimensionality of their data. Others are interested in inspecting
and interpreting the components.
You may also find that different tutorials put different emphasis on the two different
principles from which PCA can be derived: 1) Auto-encoding / error minimization: PCA
6. https://homepages.inf.ed.ac.uk/imurray2/pub/14dnade/
7. https://arxiv.org/abs/1810.04805
8. https://thegradient.pub/nlp-imagenet/
9. https://blog.google/products/search/search-language-understanding-bert