Variational Autoencoder
Variational Autoencoder
Variational Autoencoder
Supervised Learning
PLANE
CAR
CAR
Y LeCun
Obstacles to Progress in AI
Predicting any part of the past, present or future percepts from whatever
information is available.
The number of samples required to train a large learning machine (for any
task) depends on the amount of information that we ask it to predict.
The more you ask of the machine, the larger it can be.
“The brain has about 10^14 synapses and we only live for about 10^9
seconds. So we have a lot more parameters than data. This motivates the
idea that we must do a lot of unsupervised learning since the perceptual
input (including proprioception) is the only place we can get 10^5
dimensions of constraint per second.”
Geoffrey Hinton (in his 2014 AMA on Reddit)
(but he has been saying that since the late 1970s)
(Yes, I know, this picture is slightly offensive to RL folks. But I’ll make it up)
Y LeCun
The Architecture
Of an
Intelligent System
Y LeCun
AI System: Learning Agent + Immutable Objective
Agent
State
Objective
Cost
Y LeCun
Agent Actions/
World Percepts
Outputs
Simulator
Predicted Inferred Action
Percepts World State Proposals Agent
Actor
Agent State
Actor State
Predicted Objective Cost
Critic Cost
What we need is Model-Based Reinforcement LearningY LeCun
Agent
World World World World
Simulator Simulator Simulator Simulator
Perception
Unsupervised Learning
Energy-Based Unsupervised Learning
Y LeCun
Y2
Y1
Capturing Dependencies Between Variables
with an Energy Function
Y LeCun
The energy surface is a “contrast function” that takes low values on the data
manifold, and higher values everywhere else
Special case: energy = negative log density
Example: the samples live in the manifold
Y 2=(Y 1 )2
Y1
Y2
Y LeCun
Implausible futures
(high energy) Plausible futures
(low energy)
Learning the Energy Function
Y LeCun
1. build the machine so that the volume of low energy stuff is constant
PCA, K-means, GMM, square ICA
2. push down of the energy of data points, push up everywhere else
Max likelihood (needs tractable partition function)
3. push down of the energy of data points, push up on chosen locations
contrastive divergence, Ratio Matching, Noise Contrastive Estimation,
Minimum Probability Flow
4. minimize the gradient and maximize the curvature around data points
score matching
5. train a dynamical system so that the dynamics goes to the manifold
denoising auto-encoder
6. use a regularizer that limits the volume of space that has low energy
Sparse coding, sparse auto-encoder, PSD
7. if E(Y) = ||Y - G(Y)||^2, make G(Y) as "constant" as possible.
Contracting auto-encoder, saturating auto-encoder
#1: constant volume of low energy
Energy surface for PCA and K-means
Y LeCun
1. build the machine so that the volume of low energy stuff is constant
PCA, K-means, GMM, square ICA...
K-Means,
PCA
Z constrained to 1-of-K code
E (Y )=∥W T WY −Y ∥2 E (Y )=min z ∑i ∥Y −W i Z i∥2
#6. use a regularizer that limits
the volume of space that has low energy
Y LeCun
Probabilistic interpretation:
• The “decoder” of the VAE can be seen as a deep (high
representational power) probabilistic model that can give us explicit
likelihoods
• The “encoder” of the VAE can be seen as a variational distribution
used to help train the decoder
2. From importance sampling to VAEs
• Selected slides from Shakir Mohamed’s talk at the Deep Learning
Summer School 2016
Importance Sampling
b-
nc- p(z) q(z) f (z) Z
z)
di- Integral problem p(x) = p(x|z)p(z)dz
wn
he Z
re q(z)
Proposal p(x) = p(x|z)p(z) dz
z q(z)
Z
e of acceptance rate with dimensionality is a
p(z)
Although rejection can be a useful technique
Importance Weight p(x) = p(x|z) q(z)dz
Notation
d to problems of high dimensionality. It can,
q(z)
n moreAlways think
sophisticated of q(z|x)
algorithms for sampling
• Penalty: Ensures that the explanation of the data q(z|x) doesn’t deviate
too far from your beliefs p(z). A mechanism for realising Ockham’s razor.
Renyi Variational
2 Objective 3
!1 ↵
1 1 X p(z)
F(x, q) = Eq(z) 4 log p(x|z) 5
1 ↵ S s q(z)
Other generalised families exist. Optimal solution is the same for all objectives.
Machines that Imagine and Reason 43
“From importance sampling to VAE” take-
aways:
• The VAE objective function can be derived in a way that I think is
pretty unobjectionable to Bayesians and frequentists alike.
• Treat the decoder as a likelihood model we wish to train with
maximum likelihood. We want to use importance sampling as p(x|z) is
low for most z.
• The encoder is a trainable importance sampling distribution, and the
VAE objective is a lower bound to the likelihood by Jensen’s
inequality.