0% found this document useful (0 votes)
141 views

ch14 Autoencoder

The document provides an introduction and overview of autoencoders. It discusses undercomplete autoencoders, regularized autoencoders including sparse autoencoders, and denoising autoencoders. The key points are that autoencoders are neural networks trained to reconstruct their input, but can learn useful representations if constrained, such as through sparsity penalties or partially corrupted inputs as in denoising autoencoders.

Uploaded by

黃良初
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
141 views

ch14 Autoencoder

The document provides an introduction and overview of autoencoders. It discusses undercomplete autoencoders, regularized autoencoders including sparse autoencoders, and denoising autoencoders. The key points are that autoencoders are neural networks trained to reconstruct their input, but can learn useful representations if constrained, such as through sparsity penalties or partially corrupted inputs as in denoising autoencoders.

Uploaded by

黃良初
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Deep Learning Tutorial: Autoencoders

Jen-Tzung Chien

Department of Electrical and Computer Engineering


Department of Computer Science
National Chiao Tung University, Hsinchu

September 26, 2016


Outline

• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders

1
Autoencoders Deep Learning

Outline

• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders

2
Autoencoders Deep Learning

Introduction

• An autoencoder is a neural network that is trained to attempt to copy its input


to its output

• Internally, it has a hidden layer h that describes a code used to represent the
input

• The network may be viewed as consisting of two parts: an encoder function


h = f (x) and a decoder that produces a reconstruction r = g(h)

• If an autoencoder succeeds in simply learning to set g(f (x)) = x everywhere,


then it is not especially useful

• Instead, autoencoders are designed to be unable to learn to copy perfectly

3
Autoencoders Deep Learning

Autoencoder Graphical model

4
Autoencoders Deep Learning

Introduction

• Modern autoencoders have generalized the idea of an encoder and a de-


coder beyond deterministic functions to stochastic mappings pencoder (h|x)
and pdecoder (x|h)

• Traditionally, autoencoders were used for dimensionality reduction or feature


learning

• Recently, theoretical connections between autoencoders and latent variable


models have brought autoencoders to the forefront of generative modeling

• Unlike general feedforward networks, autoencoders may also be trained using


recirculation(Hinton and McClelland, 1988), a learning algorithm based on
comparing the activations of the network on the original input to the activations
on the reconstructed input

5
Autoencoders Deep Learning

Outline

• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders

6
Autoencoders Deep Learning

Undercomplete Autoencoders

• We hope that training the autoencoder to perform the input copying task will
result in h taking on useful properties

• One way to obtain useful features from the autoencoder is to constrain h to


have smaller dimension than x

• An autoencoder whose code dimension is less than the input dimension is


called undercomplete

• Learning an undercomplete representation forces the autoencoder to capture


the most salient features of the training data

7
Autoencoders Deep Learning

Learning Process
• The learning process is described simply as minimizing a loss function

L(x, g(f (x)))

• where L is a loss function penalizing g(f (x)) for being dissimilar from x, such
as the mean squared error

• When the decoder is linear and L is the mean squared error, an undercomplete
autoencoder learns to span the same subspace as PCA

• In this case, an autoencoder trained to perform the copying task has learned
the principal subspace of the training data as a side-effect

• If the encoder and decoder are allowed too much capacity, the autoencoder
can learn to perform the copying task without extracting useful information
about the distribution of the data

8
Autoencoders Deep Learning

Regularized Autoencoders
• A similar problem occurs if the hidden code is allowed to have dimension
equal to the input, and in the overcompletecase in which the hidden code has
dimension greater than the input

• In these cases, even a linear encoder and linear decoder can learn to copy
the input to the output without learning anything useful about the data
distribution

• Rather than limiting the model capacity by keeping the encoder and decoder
shallow and the code size small, regularized autoencoders use a loss function
that encourages the model to have other properties besides the ability to copy
its input to its output

• A regularized autoencoder can be nonlinear and overcomplete but still learn


something useful about the data distribution even if the model capacity is
great enough to learn a trivial identity function

9
Autoencoders Deep Learning

Sparse Autoencoders

• A sparse autoencoder is simply an autoencoder whose training criterion involves


a sparsity penalty Ω(h) on the code layer h, in addition to the reconstruction
error:
L(x, g(f (x))) + Ω(h)

• where g(h) is the decoder output and typically we have h = f (x), the encoder
output

• Sparse autoencoders are typically used to learn features for another task such
as classification

• An autoencoder that has been regularized to be sparse must respond to unique


statistical features of the dataset it has been trained on, rather than simply
acting as an identity function

10
Autoencoders Deep Learning

Sparse Autoencoders

• We can think of the penalty Ω(h) simply as a regularizer term added


to a feedforward network whose primary task is to copy the input to
the output(unsupervised learning objective) and possibly also perform some
supervised task(with a supervised learning objective) that depends on these
sparse features

• Training with weight decay and other regularization penalties can be interpreted
as a MAP approximation to Bayesian inference, with the added regularizing
penalty corresponding to a prior probability distribution over the model
parameters

• Regularized autoencoders defy such an interpretation because the regularizer


depends on the data and is therefore by definition not a prior in the formal
sense of the word

11
Autoencoders Deep Learning

Sparse Autoencoders

• Rather than thinking of the sparsity penalty as a regularizer for the copying
task, we can think of the entire sparse autoencoder framework as approximating
maximum likelihood training of a generative model that has latent variables

• Suppose we have a model with visible variables x and latent variables h, with
an explicit joint distribution pmodel(x, h) = pmodel(h)pmodel(x|h)

• We refer to pmodel(h) as the model’s prior distribution over the latent variables,
representing the model’s beliefs prior to seeing x

log pmodel(x, h) = log pmodel(h) + log pmodel(x|h)

12
Autoencoders Deep Learning

Sparse Autoencoder
• The log pmodel(h) term can be sparsity-inducing. For example, the Laplace
prior
λ
pmodel(hi) = e−λ|hi|
2
corresponds to an absolute value sparsity penalty
∑( λ
)
− log pmodel(h) = λ|hi| − log = Ω(h) + const
i
2

Ω(h) = λ |hi|
i

where the constant term depends only on λ and not h

• We typically treat λ as a hyperparameter and discard the constant term since


it does not affect the parameter learning

13
Autoencoders Deep Learning

Denoising Autoencoders

• A denoising autoencoder or DAE instead minimizes

L(x, g(f (x̃)))

where x̃ is a copy of x that has been corrupted by some form of noise

• Denoising training forces f and g to implicitly learn the structure of pdata(x),


as shown by Alain and Bengio (2014) and Bengio et al. (2013)

14
Autoencoders Deep Learning

Regularizing by Penalizing Derivatives

• Another strategy for regularizing an autoencoder is to use a penalty Ω as in


sparse autoencoders
L(x, g(f (x))) + Ω(x, h)

but with a different form of Ω



Ω(x, h) = λ ||∇xhi||2
i

• This forces the model to learn a function that does not change much when x
changes slightly

• An autoencoder regularized in this way is called acontractive autoencoderor


CAE

15
Autoencoders Deep Learning

Outline

• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders

16
Autoencoders Deep Learning

Representational Power, Layer Size and Depth

• Autoencoder with a single hidden layer is able to represent the identity function
along the domain of the data arbitrarily well

• A deep autoencoder, with atleast one additional hidden layer inside the encoder
itself, can approximate any mapping from input to code arbitrarily well, given
enough hidden units

• A common strategy for training a deep autoencoder is to greedily pretrain the


deep architecture by training a stack of shallow autoencoders

17
Autoencoders Deep Learning

Outline

• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders

18
Autoencoders Deep Learning

Stochastic Encoders and Decoders

19
Autoencoders Deep Learning

Stochastic Encoders and Decoders

• Any latent variable model pmodel(h|x) defines a stochastic encoder

pencode(h|x) = pmodel(h|x)

• and a stochastic decoder

pdecode(x|h) = pmodel(x|h)

20
Autoencoders Deep Learning

Outline

• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders

21
Autoencoders Deep Learning

Denoising Autoencoders

22
Autoencoders Deep Learning

Denoising Autoencoders

• The denoising autoencoder(DAE) is an autoencoder that receives a corrupted


data point as input and is trained to predict the original, uncorrupted data
point as its output

• We introduce a corruption process C(x̂|x) which represents a conditional


distribution over corrupted samples x̂, given a data sample x

23
Autoencoders Deep Learning

DAE Training Procedure

Algorithm 1 DAE training procedure


1: Sample a training example x from the training data
2: Sample a corrupted version x̂ from C(x̂|x)
3: Use (x̂, x) as a training example for estimating the autoencoder reconstruction
distribution preconstruct(x|x̂) = pdecoder (x|h) with h the output of encoder
f (x̂) and pdecoder typically defined by a decoder g(h)

• We can therefore view the DAE as performing stochastic gradient descent on


the following expectation
[ ]
−Ex∼p̂data(x) Ex̂∼C(x̂|x) [log pdecoder (x|h = f (x̂))]

where p̂data(x) is the training distribution

24
Autoencoders Deep Learning

Vector Field g(f (x)) − x

25
Autoencoders Deep Learning

Historical Perspective

• The name ”denoising autoencoder” refers to a model that is intended not


merely to learn to denoise its input but to learn a good internal representation
as a side effect of learning to denoise(Vincent et al. (2008), Vincent et al.
(2010))

• Prior to the introduction of the modern DAE, Inayoshi and Kurita (2005)
explored some of the same goals with some of the same methods

• Their approach minimizes reconstruction error in addition to a supervised


objective while injecting noise in the hidden layer of a supervised MLP, with
the objective to improve generalization by introducing the reconstruction error
and the injected noise

26
Autoencoders Deep Learning

Outline

• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders

27
Autoencoders Deep Learning

Learning Manifolds with Autoencoders

• Like many other machine learning algorithms, autoencoders exploit the idea
that data concentrates around a low-dimensional manifold or a small set of
such manifolds

• At a point x on a d-dimensional manifold, the tangent plane is given by d basis


vectors that span the local directions of variation allowed on the manifold

• These local directions specify how one can change x infinitesimally while
staying on the manifold

28
Autoencoders Deep Learning

Tangent Planes

29
Autoencoders Deep Learning

Learning Manifolds with Autoencoders

• All autoencoder training procedures involve a compromise between two forces:


– Learning a representation h of a training example x such that x can be
approximately recovered from h through a decoder
– Satisfying the constraint or regularization penalty

• The two forces together are useful because they force the hidden representation
to capture information about the structure of the data generating distribution

30
Autoencoders Deep Learning

One-Dimensional Example

Figure 1: If the autoencoder learns a reconstruction function that is invariant to


small perturbations near the data points, it captures the manifold structure of
the data

31
Autoencoders Deep Learning

Non-parametric Manifold Learning

Figure 2: Non-parametric manifold learning procedures build a nearest neighbor


graph in which nodes represent training examples a directed edges indicate nearest
neighbor relationships

32
Autoencoders Deep Learning

Global Coordinate System

Figure 3: Each local patchcan be thought of as a local Euclidean coordinate


system or as a locally flat Gaussian, or ”pancake” , with a very small variance in
the directions orthogonal to the pancake and avery large variance in the directions
defining the coordinate system on the pancake

33
Autoencoders Deep Learning

Contractive Autoencoders

• The contractive autoencoder (Rifai et al., 2011a), (Rifai et al., 2011b)


introduces an explicit regularizer on the code h = f (x), encouraging the
derivatives of f to be as small as possible:

∂f (x) 2
Ω(h) = λ∥ ∥F
x

• Denoising autoencoders make the reconstruction function resist small but


finite-sized perturbations of the input, while contractive autoencoders make
the feature extraction function resist infinitesimal perturbations of the input

34
Autoencoders Deep Learning

Tangent Vectors

• The goal of the CAE is to learn the manifold structure of the data

• Directions x with large J x, that J is the Jacobian matrix at a point x, rapidly


change h, so these are likely to be directions which approximate the tangent
planes of the manifold

• The directions corresponding to the largest singular values are interpreted as


the tangent directions that the contractive autoencoder has learned

• Ideally, these tangent directions should correspond to real variations in the


data

35
Autoencoders Deep Learning

Tangent Vectors of the Manifold

Figure 4: Although both local PCA and the CAE can capture local tangents, the
CAE is able to form more accurate estimates from limited training data because
it exploits parameter sharing across different locations that share a subset of
active hidden units. The CAE tangent directions typically correspond to moving
or changing parts of the object (such as the head or legs)

36
Autoencoders Deep Learning

Outline

• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders

37
Autoencoders Deep Learning

Predictive Sparse Decomposition

• Predictive sparse decomposition(PSD) is a model that is a hybrid of


sparsecoding and parametric autoencoders (Kavukcuoglu et al. (2010))

• Training proceeds by minimizing

||x − g(h)||2 + λ|h|1 + ||h − f (x)||2

• Predictive sparse coding is an example of learned approximate inference

• PSD models may be stacked and used to initialize a deep network to be trained
with another criterion

38
Autoencoders Deep Learning

Outline

• Introduction
• Undercomplete Autoencoders
• Representational Power, Layer Size and Depth
• Stochastic Encoders and Decoders
• Denoising Autoencoders
• Learning Manifolds with Autoencoders
• Predictive Sparse Decomposition
• Applications of Autoencoders

39
Autoencoders Deep Learning

Applications of Autoencoders

• Autoencoders have been successfully applied to dimensionality reduction and


information retrieval tasks

• Hinton and Salakhutdinov (2006) trained a stack of RBMs and then used their
weights to initialize a deep autoencoder

• The resulting code yielded less reconstruction error than PCA into 30
dimensions

• One task that benefits even more than usual from dimensionality reduction is
information retrieval, the task of finding entries in a database that resemble a
query entry

40
Autoencoders Deep Learning

References
[1] G. E. Hinton and J. L. McClelland, “Learning representations by recirculation,” in Neural information processing
systems. New York: American Institute of Physics, 1988, pp. 358–366.
[2] G. Alain and Y. Bengio, “What regularized auto-encoders learn from the data-generating distribution.” Journal
of Machine Learning Research, vol. 15, no. 1, pp. 3563–3593, 2014.
[3] Y. Bengio, L. Yao, G. Alain, and P. Vincent, “Generalized denoising auto-encoders as generative models,” in
Advances in Neural Information Processing Systems, 2013, pp. 899–907.
[4] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with
denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning. ACM,
2008, pp. 1096–1103.
[5] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning
useful representations in a deep network with a local denoising criterion,” Journal of Machine Learning Research,
vol. 11, no. Dec, pp. 3371–3408, 2010.
[6] H. Inayoshi and T. Kurita, “Improved generalization by adding both auto-association and hidden-layer-noise to
neural-network-based-classifiers,” in 2005 IEEE Workshop on Machine Learning for Signal Processing. IEEE,
2005, pp. 141–146.
[7] S. Rifai, G. Mesnil, P. Vincent, X. Muller, Y. Bengio, Y. Dauphin, and X. Glorot, “Higher order contractive
auto-encoder,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases.
Springer, 2011a, pp. 645–660.
[8] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contractive auto-encoders: Explicit invariance
during feature extraction,” in Proceedings of the 28th international conference on machine learning (ICML-11),
2011b, pp. 833–840.
[9] K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “Fast inference in sparse coding algorithms with applications to
object recognition,” arXiv preprint arXiv:1010.3467, 2010.
[10] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science,
vol. 313, no. 5786, pp. 504–507, 2006.

41

You might also like