0% found this document useful (0 votes)
15 views45 pages

Unit5 Autoencoders

Autoencoders are neural networks designed for unsupervised learning that aim to copy their input to output while learning a low-dimensional embedding of the data. They can be categorized into various types, including undercomplete, regularized, and denoising autoencoders, each with unique properties and applications. The document discusses the architecture, training methods, and challenges associated with autoencoders, as well as their connections to generative models and practical uses in feature learning and data compression.

Uploaded by

kannan.niran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views45 pages

Unit5 Autoencoders

Autoencoders are neural networks designed for unsupervised learning that aim to copy their input to output while learning a low-dimensional embedding of the data. They can be categorized into various types, including undercomplete, regularized, and denoising autoencoders, each with unique properties and applications. The document discusses the architecture, training methods, and challenges associated with autoencoders, as well as their connections to generative models and practical uses in feature learning and data compression.

Uploaded by

kannan.niran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Autoencoders

1
Contents
• What is an autoencoder?
1. Undercomplete Autoencoders
2. Regularized Autoencoders
3. Representational Power, Layout Size and Depth
4. Stochastic Encoders and Decoders
5. Denoising Autoencoders
6. Learning Manifolds and Autoencoders
7. Contractive Autoencoders
2
8. Predictive Sparse Decomposition
9. Applications of Autoencoders

3
What is an Autoencoder?

• A neural network trained using unsupervised learning


• Trained to copy its input to its output
• Learns an embedding

4
Embedding is a point on a
manifold
• An embedding is a low-dimensional vector
• With fewer dimensions than than the ambient space
of which the manifold is a low-dimensional subset
• Embedding Algorithm
• Maps any point in ambient space x to its embedding h
• Embeddings of related inputs form a manifold

5
A manifold in ambient
space
Embedding: map x to lower dimensional h

1-D manifold in 2-D space Age Progression/Regression by Conditional Adversarial Autoencoder (CAA
Github: https://github.com/ZZUTK/Face-Aging-CAAE
Derived from 28x28=784 space
6
General structure of an
autoencoder
• Maps an input x to an output r (called reconstruction)
through an internal representation code h
• It has a hidden layer h that describes a code used to represent
the input
• The network has two parts
• The encoder function h=f(x)
• A decoder that produces a reconstruction r=g(h)

7
Autoencoders differ from General Data
Compression
• Autoencoders are data-specific
• i.e., only able to compress data similar to what they have been
trained on
• This is different from, say, MP3 or JPEG compression
algorithm
• Which make general assumptions about "sound/images”, but
not about specific types of sounds/images
• Autoencoder for pictures of cats would do poorly in
compressing pictures of trees
• Because features it would learn would be cat-specific
• Autoencoders are lossy
• which means that the decompressed outputs will be degraded
compared to the original inputs (similar to MP3 or JPEG
8
compression).
• This differs from lossless arithmetic compression
• Autoencoders are learnt

9
What does an Autoencoder Learn?
• Learning g (f (x))=x everywhere is not useful
• Autoencoders are designed to be unable to copy
perfectly
• Restricted to copy only approximately
• Autoencoders learn useful properties of the data
• Being forced to prioritize which aspects of input should be
copied
• Can learn stochastic mappings
• Go beyond deterministic functions to mappings pencoder(h|x) and
pdecoder(x|h)

10
Autoencoder History

• Part of neural network landscape for decades


• Used for dimensionality reduction and feature
learning
• Theoretical connection to latent variable models
• Have brought them into forefront of generative
models
• Variational Autoencoders

11
An autoencoder architecture
Weights W are learnt
using:
1. Training samples, and
2. a loss
Decoder
g function as

discussed next

Encoder
f

12
Two Autoencoder Training Methods

1. Autoencoder is a feed-forward non-recurrent neural net


• With an input layer, an output layer and one or more hidden
layers
• Can be trained using the same techniques
• Compute gradients using back-propagation
• Followed by minibatch gradient descent
2. Unlike feedforward networks, can also be trained using
Recirculation
• Compare activations on the input to activations of the
reconstructed input
• More biologically plausible than back-prop but rarely used in ML

13
1. Undercomplete Autoencoder

• Copying input to output sounds useless


• But we have no interest in decoder output
• We hope h takes on useful properties
• Undercomplete autoencoder
• Constrain h to have lower dimension than x
• Force it to capture most salient features of training
data

14
Autoencoder with linear decoder +MSE is PCA
• Learning process is that of minimizing a loss function
L(x, g ( f (x)))
• where L is a loss function penalizing g( f (x)) for being dissimilar
from x
• such as L2 norm of difference: mean squared error
• When the decoder g is linear and L is the mean squared error,
an undercomplete autoencoder learns to span the same
subspace as PCA
• In this case the autoencoder trained to perform the copying task
has learned the principal subspace of the training data as a side-
effect
• Autoencoders with nonlinear f and g can learn more
powerful nonlinear generalizations of PCA
• But high capacity is not desirable as seen next

15
Autoencoder training using a loss
function
• Encoder f and Autoencoder with 3 fully connected hidden
layers
decoder g
f : Χ →h
g: h →X
arg 2

min X −(f ! g)X


f ,g
h
• One hidden layer
• Non-linear encoder
• Takes input x ε Rd
Decoder
• Maps into output h ε Rp Encoder
g
f
h = σ1(Wx +b)
x ' = σ2(W 'h
σ is an element-wise activation function such as sigmoid
+b') or Relu
16
Trained to minimize reconstruction error (such as sum of
squared errors)

17
2

• If encoder f and decoder g are allowed too much


capacity
• autoencoder can learn to perform the copying task without
learning any useful information about distribution of data
• Autoencoder with a one-dimensional code and a very
powerful nonlinear encoder can learn to map x(i) to
code i.
• The decoder can learn to map these integer indices back to
the values of specific training examples
• Autoencoder trained for copying task fails to learn
anything useful if f/g capacity is too great

18
Cases when Autoencoder Learning Fails

• Where autoencoders fail to learn anything


useful:
1. Capacity of encoder/decoder f/g is too high
• Capacity controlled by depth
2. Hidden code h has dimension equal to input x
3. Overcomplete case: where hidden code h has
dimension greater than input x
• Even a linear encoder/decoder can learn to copy input
to output without learning anything useful about data
distribution

19
Right Autoencoder Design: Use regularization

• Ideally, choose code size (dimension of h) small and


capacity of encoder f and decoder g based on
complexity of distribution modeled
• Regularized autoencoders provide the ability to do so
• Rather than limiting model capacity by keeping
encoder/decoder shallow and code size small
• They use a loss function that encourages the model to have
properties other than copy its input to output

20
2. Regularized Autoencoder Properties
• Regularized AEs have properties beyond copying
input to output:
• Sparsity of representation
• Smallness of the derivative of the representation
• Robustness to noise
• Robustness to missing inputs
• Regularized autoencoder can be nonlinear and
overcomplete
• But still learn something useful about the data distribution
even if model capacity is great enough to learn trivial identity
function

21
Generative Models Viewed as Autoencoders

• Beyond regularized autoencoders


• Generative models with latent variables and an
inference procedure (for computing latent
representations given input) can be viewed as a
particular form of autoencoder
• Generative modeling approaches which emphasize
connection with autoencoders are descendants of
Helmholtz machine:
1. Variational autoencoder
2. Generative stochastic networks

22
Deep Srihar
Learning i

Latent variables treated as


distributions

Source: https://www.jeremyjordan.me/variational-autoencoders/ 23
Variational
Deep
Srihari
Learning

Autoencoder
• VAE is a generative model
• able to generate samples that look like samples from training
data
• With MNIST, these fake samples would be synthetic images
of digits

• Due to random variable between input-output it


cannot be trained using backprop
• Instead, backprop proceeds through parameters of latent
distribution
• Called reparameterization trick
N(μ,Σ) = μ + Σ N(0, I)
24
Where Σ is diagonal

25
Sparse Autoencoder
Only a few nodes are encouraged to activate when a
single sample is fed into the network

Fewer nodes activating while still keeping its performance would guarantee that the
autoencoder is actually learning latent representations instead of redundant information
26
in our input data

27
Sparse Autoencoder Loss Function

• A sparse autoencoder is an autoencoder whose


• Training criterion includes a sparsity penalty Ω(h) on the code
layer h in addition to the reconstruction error:
L(x, g ( f (x))) + Ω(h)
• where g (h) is the decoder output and typically we have h = f (x)
• Sparse encoders are typically used to learn features
for another task such as classification
• An autoencoder that has been trained to be sparse
must respond to unique statistical features of the
dataset rather than simply perform the copying task
• Thus sparsity penalty can yield a model that has learned
useful features as a byproduct
28
Sparse Encoder doesn’t have Bayesian
Interpretation
• Penalty term Ω(h) is a regularizer term added to a
feedforward network whose
• Primary task: copy input to output (with Unsupervised learning
objective)
• Also perform some supervised task (with Supervised learning
objective) that depends on the sparse features
• In supervised learning regularization term
corresponds to prior probabilities over model
parameters
• Regularized MLE corresponds to maximizing p(θ|x), which is
equivalent to maximizing log p(x|θ)+log p(θ)
• First term is data log-likelihood and second term is log-prior over
parameters
29
• Regularizer depends on data and thus is not a prior
• Instead, regularization terms express a preference over functions

30
Generative Model view of Sparse
Autoencoder

• Rather than thinking of sparsity penalty as a


regularizer for copying task, think of sparse
autoencoder as approximating ML training of a
generative model that has latent variables
• Suppose model has visible/latent variables x and h
• Explicit joint distribution is pmodel(x,h) = pmodel(h) pmodel(x|h)
• where pmodel(h) is model’s prior distribution over latent variables
• Different from p(θ) being distribution of parameters
• The log-likelihood can be decomposed aslog p model
(x,h) = log ∑
pmodel(h,x)
h

• Autoencoder approximates the sum with a point


estimate for
just one highly likely value of h, the output of a
parametric encoder
• With a chosen h we are maximizing log pmodel(x,h) = log pmodel(h)+log pmodel(x|h)
Denoising Autoencoders (DAE)

• Rather than adding a penalty Ω to the cost


function, we can obtain an autoencoder that learns
something useful
• By changing the reconstruction error term of the cost function
• Traditional autoencoders minimize L(x, g ( f (x)))
• where L is a loss function penalizing g( f (x)) for being
dissimilar from x, such as L2 norm of difference: mean
squared error
• A DAE L(x, g(f (x! )))
minimizes
• where is a copy of x that has been corrupted by some form of
x! noise
• The autoencoder must undo this corruption rather than
simply copying their input
• Denoising training forces f and g to implicitly learn the
structure of pdata(x)
• Another example of how useful properties can emerge
as a by- product of minimizing reconstruction error
Regularizing by Penalizing Derivatives
• Another strategy for regularizing an autoencoder
• Use penalty as in sparse autoencoders
L(x, g ( f (x))) + Ω(h,x)
• But with a different form of Ω
2
Ω(h,x) = λ∑
∇xhi
• Forces the model to learn a function that does not
change much when x changes slightly
• Called a Contractive Auto Encoder (CAE)
• This model has theoretical connections to
• Denoising autoencoders
• Manifold learning
• Probabilistic modeling
28
3. Representational Power, Layer Size and
Depth

• Autoencoders are often trained with with single layer


• However using deep encoder offers many advantages
• Recall: Although universal approximation theorem states
that a single layer is sufficient, there are disadvantages:
1. no of units needed may be too large
2. may not generalize well
• Common strategy: greedily pretrain a stack of
shallow autoencoders

29
4. Stochastic Encoders and
Decoders
• General strategy for designing the output units
and loss function of a feedforward network is to
• Define the output distribution p(y|x)
• Minimize the negative log-likelihood –log p(y|x)
• In this setting y is a vector of targets such as class labels
• In an autoencoder x is the target as well as the input
• Yet we can apply the same machinery as before, as we see next

30
Loss function for Stochastic Decoder

• Given a hidden code h, we may think of the


decoder as providing a conditional distribution
pdecoder(x|h)
• We train the autoencoder by minimizing –log pdecoder(x|h)
• The exact form of this loss function will change
depending on the form of pdecoder(x|h)
• As with feedforward networks we use linear output
units to parameterize the mean of the Gaussian
distribution if x is real
• In this case negative log-likelihood is the mean-squared error
• With binary x correspond to a Bernoulli with
parameters given by a sigmoid
• Discrete x values correspond to a softmax
• The output variables are treated as being conditionally
independent given h 31
Stochastic encoder

• We can also generalize the notion of an encoding


function f(x)
to an encoding distribution pencoder(h|x)

32
Structure of stochastic autoencoder

• Both the encoder and decoder are not simple


functions but involve a distribution
• The output is sampled from a distribution pencoder(h|x)
for the encoder and pdecoder(x|h) for the decoder

33
Relationship to joint distribution

• Any latent variable model pmodel(h|x) defines a


stochastic encoder pencoder(h|x)=pmodel(h|x)
• And a stochastic decoder pdecoder(x|h)=pmodel(x|h)
• In general the encoder and decoder distributions
are not conditional distributions compatible with
a unique joint distribution pmodel(x,h)
• Training the autoencoder as a denoising autoencoder
will tend to make them compatible asymptotically
• With enough capacity and examples

34
Sampling pmodel(h|x)

pencoder(h|x) pdecoder(x|h)

35
Ex: Sampling p(x|h): Deepstyle
• Boil down to a
representation which
relates to style
• By iterating neural network
through a set of images learn
efficient representations
• Choosing a random
numerical description in
encoded space will generate
new images of styles not
seen
• Using one input image and
changing values along
different dimensions of
feature space you can see
how the generated image
changes (patterning, color
texture) in style space 36

You might also like