ch14 Autoencoder
ch14 Autoencoder
Jen-Tzung Chien
• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders
1
Autoencoders Deep Learning
Outline
• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders
2
Autoencoders Deep Learning
Introduction
• Internally, it has a hidden layer h that describes a code used to represent the
input
3
Autoencoders Deep Learning
4
Autoencoders Deep Learning
Introduction
5
Autoencoders Deep Learning
Outline
• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders
6
Autoencoders Deep Learning
Undercomplete Autoencoders
• We hope that training the autoencoder to perform the input copying task will
result in h taking on useful properties
7
Autoencoders Deep Learning
Learning Process
• The learning process is described simply as minimizing a loss function
• where L is a loss function penalizing g(f (x)) for being dissimilar from x, such
as the mean squared error
• When the decoder is linear and L is the mean squared error, an undercomplete
autoencoder learns to span the same subspace as PCA
• In this case, an autoencoder trained to perform the copying task has learned
the principal subspace of the training data as a side-effect
• If the encoder and decoder are allowed too much capacity, the autoencoder
can learn to perform the copying task without extracting useful information
about the distribution of the data
8
Autoencoders Deep Learning
Regularized Autoencoders
• A similar problem occurs if the hidden code is allowed to have dimension
equal to the input, and in the overcompletecase in which the hidden code has
dimension greater than the input
• In these cases, even a linear encoder and linear decoder can learn to copy
the input to the output without learning anything useful about the data
distribution
• Rather than limiting the model capacity by keeping the encoder and decoder
shallow and the code size small, regularized autoencoders use a loss function
that encourages the model to have other properties besides the ability to copy
its input to its output
9
Autoencoders Deep Learning
Sparse Autoencoders
• where g(h) is the decoder output and typically we have h = f (x), the encoder
output
• Sparse autoencoders are typically used to learn features for another task such
as classification
10
Autoencoders Deep Learning
Sparse Autoencoders
• Training with weight decay and other regularization penalties can be interpreted
as a MAP approximation to Bayesian inference, with the added regularizing
penalty corresponding to a prior probability distribution over the model
parameters
11
Autoencoders Deep Learning
Sparse Autoencoders
• Rather than thinking of the sparsity penalty as a regularizer for the copying
task, we can think of the entire sparse autoencoder framework as approximating
maximum likelihood training of a generative model that has latent variables
• Suppose we have a model with visible variables x and latent variables h, with
an explicit joint distribution pmodel(x, h) = pmodel(h)pmodel(x|h)
• We refer to pmodel(h) as the model’s prior distribution over the latent variables,
representing the model’s beliefs prior to seeing x
12
Autoencoders Deep Learning
Sparse Autoencoder
• The log pmodel(h) term can be sparsity-inducing. For example, the Laplace
prior
λ
pmodel(hi) = e−λ|hi|
2
corresponds to an absolute value sparsity penalty
∑( λ
)
− log pmodel(h) = λ|hi| − log = Ω(h) + const
i
2
∑
Ω(h) = λ |hi|
i
13
Autoencoders Deep Learning
Denoising Autoencoders
14
Autoencoders Deep Learning
• This forces the model to learn a function that does not change much when x
changes slightly
15
Autoencoders Deep Learning
Outline
• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders
16
Autoencoders Deep Learning
• Autoencoder with a single hidden layer is able to represent the identity function
along the domain of the data arbitrarily well
• A deep autoencoder, with atleast one additional hidden layer inside the encoder
itself, can approximate any mapping from input to code arbitrarily well, given
enough hidden units
17
Autoencoders Deep Learning
Outline
• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders
18
Autoencoders Deep Learning
19
Autoencoders Deep Learning
pencode(h|x) = pmodel(h|x)
pdecode(x|h) = pmodel(x|h)
20
Autoencoders Deep Learning
Outline
• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders
21
Autoencoders Deep Learning
Denoising Autoencoders
22
Autoencoders Deep Learning
Denoising Autoencoders
23
Autoencoders Deep Learning
24
Autoencoders Deep Learning
25
Autoencoders Deep Learning
Historical Perspective
• Prior to the introduction of the modern DAE, Inayoshi and Kurita (2005)
explored some of the same goals with some of the same methods
26
Autoencoders Deep Learning
Outline
• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders
27
Autoencoders Deep Learning
• Like many other machine learning algorithms, autoencoders exploit the idea
that data concentrates around a low-dimensional manifold or a small set of
such manifolds
• These local directions specify how one can change x infinitesimally while
staying on the manifold
28
Autoencoders Deep Learning
Tangent Planes
29
Autoencoders Deep Learning
• The two forces together are useful because they force the hidden representation
to capture information about the structure of the data generating distribution
30
Autoencoders Deep Learning
One-Dimensional Example
31
Autoencoders Deep Learning
32
Autoencoders Deep Learning
33
Autoencoders Deep Learning
Contractive Autoencoders
∂f (x) 2
Ω(h) = λ∥ ∥F
x
34
Autoencoders Deep Learning
Tangent Vectors
• The goal of the CAE is to learn the manifold structure of the data
35
Autoencoders Deep Learning
Figure 4: Although both local PCA and the CAE can capture local tangents, the
CAE is able to form more accurate estimates from limited training data because
it exploits parameter sharing across different locations that share a subset of
active hidden units. The CAE tangent directions typically correspond to moving
or changing parts of the object (such as the head or legs)
36
Autoencoders Deep Learning
Outline
• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders
37
Autoencoders Deep Learning
• PSD models may be stacked and used to initialize a deep network to be trained
with another criterion
38
Autoencoders Deep Learning
Outline
• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders
39
Autoencoders Deep Learning
Applications of Autoencoders
• Hinton and Salakhutdinov (2006) trained a stack of RBMs and then used their
weights to initialize a deep autoencoder
• The resulting code yielded less reconstruction error than PCA into 30
dimensions
• One task that benefits even more than usual from dimensionality reduction is
information retrieval, the task of finding entries in a database that resemble a
query entry
40
Autoencoders Deep Learning
References
[1] G. E. Hinton and J. L. McClelland, “Learning representations by recirculation,” in Neural information processing
systems. New York: American Institute of Physics, 1988, pp. 358–366.
[2] G. Alain and Y. Bengio, “What regularized auto-encoders learn from the data-generating distribution.” Journal
of Machine Learning Research, vol. 15, no. 1, pp. 3563–3593, 2014.
[3] Y. Bengio, L. Yao, G. Alain, and P. Vincent, “Generalized denoising auto-encoders as generative models,” in
Advances in Neural Information Processing Systems, 2013, pp. 899–907.
[4] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with
denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning. ACM,
2008, pp. 1096–1103.
[5] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning
useful representations in a deep network with a local denoising criterion,” Journal of Machine Learning Research,
vol. 11, no. Dec, pp. 3371–3408, 2010.
[6] H. Inayoshi and T. Kurita, “Improved generalization by adding both auto-association and hidden-layer-noise to
neural-network-based-classifiers,” in 2005 IEEE Workshop on Machine Learning for Signal Processing. IEEE,
2005, pp. 141–146.
[7] S. Rifai, G. Mesnil, P. Vincent, X. Muller, Y. Bengio, Y. Dauphin, and X. Glorot, “Higher order contractive
auto-encoder,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases.
Springer, 2011a, pp. 645–660.
[8] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contractive auto-encoders: Explicit invariance
during feature extraction,” in Proceedings of the 28th international conference on machine learning (ICML-11),
2011b, pp. 833–840.
[9] K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “Fast inference in sparse coding algorithms with applications to
object recognition,” arXiv preprint arXiv:1010.3467, 2010.
[10] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science,
vol. 313, no. 5786, pp. 504–507, 2006.
41