Deep Learning Tutorial: Autoencoders
Jen-Tzung Chien
Department of Electrical and Computer Engineering
Department of Computer Science
National Chiao Tung University, Hsinchu
September 26, 2016
Outline
• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders
1
Autoencoders Deep Learning
Outline
• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders
2
Autoencoders Deep Learning
Introduction
• An autoencoder is a neural network that is trained to attempt to copy its input
to its output
• Internally, it has a hidden layer h that describes a code used to represent the
input
• The network may be viewed as consisting of two parts: an encoder function
h = f (x) and a decoder that produces a reconstruction r = g(h)
• If an autoencoder succeeds in simply learning to set g(f (x)) = x everywhere,
then it is not especially useful
• Instead, autoencoders are designed to be unable to learn to copy perfectly
3
Autoencoders Deep Learning
Autoencoder Graphical model
4
Autoencoders Deep Learning
Introduction
• Modern autoencoders have generalized the idea of an encoder and a de-
coder beyond deterministic functions to stochastic mappings pencoder (h|x)
and pdecoder (x|h)
• Traditionally, autoencoders were used for dimensionality reduction or feature
learning
• Recently, theoretical connections between autoencoders and latent variable
models have brought autoencoders to the forefront of generative modeling
• Unlike general feedforward networks, autoencoders may also be trained using
recirculation(Hinton and McClelland, 1988), a learning algorithm based on
comparing the activations of the network on the original input to the activations
on the reconstructed input
5
Autoencoders Deep Learning
Outline
• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders
6
Autoencoders Deep Learning
Undercomplete Autoencoders
• We hope that training the autoencoder to perform the input copying task will
result in h taking on useful properties
• One way to obtain useful features from the autoencoder is to constrain h to
have smaller dimension than x
• An autoencoder whose code dimension is less than the input dimension is
called undercomplete
• Learning an undercomplete representation forces the autoencoder to capture
the most salient features of the training data
7
Autoencoders Deep Learning
Learning Process
• The learning process is described simply as minimizing a loss function
L(x, g(f (x)))
• where L is a loss function penalizing g(f (x)) for being dissimilar from x, such
as the mean squared error
• When the decoder is linear and L is the mean squared error, an undercomplete
autoencoder learns to span the same subspace as PCA
• In this case, an autoencoder trained to perform the copying task has learned
the principal subspace of the training data as a side-effect
• If the encoder and decoder are allowed too much capacity, the autoencoder
can learn to perform the copying task without extracting useful information
about the distribution of the data
8
Autoencoders Deep Learning
Regularized Autoencoders
• A similar problem occurs if the hidden code is allowed to have dimension
equal to the input, and in the overcompletecase in which the hidden code has
dimension greater than the input
• In these cases, even a linear encoder and linear decoder can learn to copy
the input to the output without learning anything useful about the data
distribution
• Rather than limiting the model capacity by keeping the encoder and decoder
shallow and the code size small, regularized autoencoders use a loss function
that encourages the model to have other properties besides the ability to copy
its input to its output
• A regularized autoencoder can be nonlinear and overcomplete but still learn
something useful about the data distribution even if the model capacity is
great enough to learn a trivial identity function
9
Autoencoders Deep Learning
Sparse Autoencoders
• A sparse autoencoder is simply an autoencoder whose training criterion involves
a sparsity penalty Ω(h) on the code layer h, in addition to the reconstruction
error:
L(x, g(f (x))) + Ω(h)
• where g(h) is the decoder output and typically we have h = f (x), the encoder
output
• Sparse autoencoders are typically used to learn features for another task such
as classification
• An autoencoder that has been regularized to be sparse must respond to unique
statistical features of the dataset it has been trained on, rather than simply
acting as an identity function
10
Autoencoders Deep Learning
Sparse Autoencoders
• We can think of the penalty Ω(h) simply as a regularizer term added
to a feedforward network whose primary task is to copy the input to
the output(unsupervised learning objective) and possibly also perform some
supervised task(with a supervised learning objective) that depends on these
sparse features
• Training with weight decay and other regularization penalties can be interpreted
as a MAP approximation to Bayesian inference, with the added regularizing
penalty corresponding to a prior probability distribution over the model
parameters
• Regularized autoencoders defy such an interpretation because the regularizer
depends on the data and is therefore by definition not a prior in the formal
sense of the word
11
Autoencoders Deep Learning
Sparse Autoencoders
• Rather than thinking of the sparsity penalty as a regularizer for the copying
task, we can think of the entire sparse autoencoder framework as approximating
maximum likelihood training of a generative model that has latent variables
• Suppose we have a model with visible variables x and latent variables h, with
an explicit joint distribution pmodel(x, h) = pmodel(h)pmodel(x|h)
• We refer to pmodel(h) as the model’s prior distribution over the latent variables,
representing the model’s beliefs prior to seeing x
log pmodel(x, h) = log pmodel(h) + log pmodel(x|h)
12
Autoencoders Deep Learning
Sparse Autoencoder
• The log pmodel(h) term can be sparsity-inducing. For example, the Laplace
prior
λ
pmodel(hi) = e−λ|hi|
2
corresponds to an absolute value sparsity penalty
∑( λ
)
− log pmodel(h) = λ|hi| − log = Ω(h) + const
i
2
∑
Ω(h) = λ |hi|
i
where the constant term depends only on λ and not h
• We typically treat λ as a hyperparameter and discard the constant term since
it does not affect the parameter learning
13
Autoencoders Deep Learning
Denoising Autoencoders
• A denoising autoencoder or DAE instead minimizes
L(x, g(f (x̃)))
where x̃ is a copy of x that has been corrupted by some form of noise
• Denoising training forces f and g to implicitly learn the structure of pdata(x),
as shown by Alain and Bengio (2014) and Bengio et al. (2013)
14
Autoencoders Deep Learning
Regularizing by Penalizing Derivatives
• Another strategy for regularizing an autoencoder is to use a penalty Ω as in
sparse autoencoders
L(x, g(f (x))) + Ω(x, h)
but with a different form of Ω
∑
Ω(x, h) = λ ||∇xhi||2
i
• This forces the model to learn a function that does not change much when x
changes slightly
• An autoencoder regularized in this way is called acontractive autoencoderor
CAE
15
Autoencoders Deep Learning
Outline
• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders
16
Autoencoders Deep Learning
Representational Power, Layer Size and Depth
• Autoencoder with a single hidden layer is able to represent the identity function
along the domain of the data arbitrarily well
• A deep autoencoder, with atleast one additional hidden layer inside the encoder
itself, can approximate any mapping from input to code arbitrarily well, given
enough hidden units
• A common strategy for training a deep autoencoder is to greedily pretrain the
deep architecture by training a stack of shallow autoencoders
17
Autoencoders Deep Learning
Outline
• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders
18
Autoencoders Deep Learning
Stochastic Encoders and Decoders
19
Autoencoders Deep Learning
Stochastic Encoders and Decoders
• Any latent variable model pmodel(h|x) defines a stochastic encoder
pencode(h|x) = pmodel(h|x)
• and a stochastic decoder
pdecode(x|h) = pmodel(x|h)
20
Autoencoders Deep Learning
Outline
• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders
21
Autoencoders Deep Learning
Denoising Autoencoders
22
Autoencoders Deep Learning
Denoising Autoencoders
• The denoising autoencoder(DAE) is an autoencoder that receives a corrupted
data point as input and is trained to predict the original, uncorrupted data
point as its output
• We introduce a corruption process C(x̂|x) which represents a conditional
distribution over corrupted samples x̂, given a data sample x
23
Autoencoders Deep Learning
DAE Training Procedure
Algorithm 1 DAE training procedure
1: Sample a training example x from the training data
2: Sample a corrupted version x̂ from C(x̂|x)
3: Use (x̂, x) as a training example for estimating the autoencoder reconstruction
distribution preconstruct(x|x̂) = pdecoder (x|h) with h the output of encoder
f (x̂) and pdecoder typically defined by a decoder g(h)
• We can therefore view the DAE as performing stochastic gradient descent on
the following expectation
[ ]
−Ex∼p̂data(x) Ex̂∼C(x̂|x) [log pdecoder (x|h = f (x̂))]
where p̂data(x) is the training distribution
24
Autoencoders Deep Learning
Vector Field g(f (x)) − x
25
Autoencoders Deep Learning
Historical Perspective
• The name ”denoising autoencoder” refers to a model that is intended not
merely to learn to denoise its input but to learn a good internal representation
as a side effect of learning to denoise(Vincent et al. (2008), Vincent et al.
(2010))
• Prior to the introduction of the modern DAE, Inayoshi and Kurita (2005)
explored some of the same goals with some of the same methods
• Their approach minimizes reconstruction error in addition to a supervised
objective while injecting noise in the hidden layer of a supervised MLP, with
the objective to improve generalization by introducing the reconstruction error
and the injected noise
26
Autoencoders Deep Learning
Outline
• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders
27
Autoencoders Deep Learning
Learning Manifolds with Autoencoders
• Like many other machine learning algorithms, autoencoders exploit the idea
that data concentrates around a low-dimensional manifold or a small set of
such manifolds
• At a point x on a d-dimensional manifold, the tangent plane is given by d basis
vectors that span the local directions of variation allowed on the manifold
• These local directions specify how one can change x infinitesimally while
staying on the manifold
28
Autoencoders Deep Learning
Tangent Planes
29
Autoencoders Deep Learning
Learning Manifolds with Autoencoders
• All autoencoder training procedures involve a compromise between two forces:
– Learning a representation h of a training example x such that x can be
approximately recovered from h through a decoder
– Satisfying the constraint or regularization penalty
• The two forces together are useful because they force the hidden representation
to capture information about the structure of the data generating distribution
30
Autoencoders Deep Learning
One-Dimensional Example
Figure 1: If the autoencoder learns a reconstruction function that is invariant to
small perturbations near the data points, it captures the manifold structure of
the data
31
Autoencoders Deep Learning
Non-parametric Manifold Learning
Figure 2: Non-parametric manifold learning procedures build a nearest neighbor
graph in which nodes represent training examples a directed edges indicate nearest
neighbor relationships
32
Autoencoders Deep Learning
Global Coordinate System
Figure 3: Each local patchcan be thought of as a local Euclidean coordinate
system or as a locally flat Gaussian, or ”pancake” , with a very small variance in
the directions orthogonal to the pancake and avery large variance in the directions
defining the coordinate system on the pancake
33
Autoencoders Deep Learning
Contractive Autoencoders
• The contractive autoencoder (Rifai et al., 2011a), (Rifai et al., 2011b)
introduces an explicit regularizer on the code h = f (x), encouraging the
derivatives of f to be as small as possible:
∂f (x) 2
Ω(h) = λ∥ ∥F
x
• Denoising autoencoders make the reconstruction function resist small but
finite-sized perturbations of the input, while contractive autoencoders make
the feature extraction function resist infinitesimal perturbations of the input
34
Autoencoders Deep Learning
Tangent Vectors
• The goal of the CAE is to learn the manifold structure of the data
• Directions x with large J x, that J is the Jacobian matrix at a point x, rapidly
change h, so these are likely to be directions which approximate the tangent
planes of the manifold
• The directions corresponding to the largest singular values are interpreted as
the tangent directions that the contractive autoencoder has learned
• Ideally, these tangent directions should correspond to real variations in the
data
35
Autoencoders Deep Learning
Tangent Vectors of the Manifold
Figure 4: Although both local PCA and the CAE can capture local tangents, the
CAE is able to form more accurate estimates from limited training data because
it exploits parameter sharing across different locations that share a subset of
active hidden units. The CAE tangent directions typically correspond to moving
or changing parts of the object (such as the head or legs)
36
Autoencoders Deep Learning
Outline
• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders
37
Autoencoders Deep Learning
Predictive Sparse Decomposition
• Predictive sparse decomposition(PSD) is a model that is a hybrid of
sparsecoding and parametric autoencoders (Kavukcuoglu et al. (2010))
• Training proceeds by minimizing
||x − g(h)||2 + λ|h|1 + ||h − f (x)||2
• Predictive sparse coding is an example of learned approximate inference
• PSD models may be stacked and used to initialize a deep network to be trained
with another criterion
38
Autoencoders Deep Learning
Outline
• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders
39
Autoencoders Deep Learning
Applications of Autoencoders
• Autoencoders have been successfully applied to dimensionality reduction and
information retrieval tasks
• Hinton and Salakhutdinov (2006) trained a stack of RBMs and then used their
weights to initialize a deep autoencoder
• The resulting code yielded less reconstruction error than PCA into 30
dimensions
• One task that benefits even more than usual from dimensionality reduction is
information retrieval, the task of finding entries in a database that resemble a
query entry
40
Autoencoders Deep Learning
References
[1] G. E. Hinton and J. L. McClelland, “Learning representations by recirculation,” in Neural information processing
systems. New York: American Institute of Physics, 1988, pp. 358–366.
[2] G. Alain and Y. Bengio, “What regularized auto-encoders learn from the data-generating distribution.” Journal
of Machine Learning Research, vol. 15, no. 1, pp. 3563–3593, 2014.
[3] Y. Bengio, L. Yao, G. Alain, and P. Vincent, “Generalized denoising auto-encoders as generative models,” in
Advances in Neural Information Processing Systems, 2013, pp. 899–907.
[4] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with
denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning. ACM,
2008, pp. 1096–1103.
[5] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning
useful representations in a deep network with a local denoising criterion,” Journal of Machine Learning Research,
vol. 11, no. Dec, pp. 3371–3408, 2010.
[6] H. Inayoshi and T. Kurita, “Improved generalization by adding both auto-association and hidden-layer-noise to
neural-network-based-classifiers,” in 2005 IEEE Workshop on Machine Learning for Signal Processing. IEEE,
2005, pp. 141–146.
[7] S. Rifai, G. Mesnil, P. Vincent, X. Muller, Y. Bengio, Y. Dauphin, and X. Glorot, “Higher order contractive
auto-encoder,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases.
Springer, 2011a, pp. 645–660.
[8] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contractive auto-encoders: Explicit invariance
during feature extraction,” in Proceedings of the 28th international conference on machine learning (ICML-11),
2011b, pp. 833–840.
[9] K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “Fast inference in sparse coding algorithms with applications to
object recognition,” arXiv preprint arXiv:1010.3467, 2010.
[10] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science,
vol. 313, no. 5786, pp. 504–507, 2006.
41