0% found this document useful (0 votes)
118 views61 pages

Presentation - Deeplearning2015 Courville Autoencoder Extension 01

The document outlines the variational autoencoder (VAE) model for learning latent variable generative models. The VAE introduces an inference model to approximate the intractable true posterior distribution. This inference model, combined with a reparameterization trick, allows the generative and inference models to be trained simultaneously by optimizing a variational lower bound on the data likelihood using stochastic gradient descent. Extensions of the VAE discussed include its use for semi-supervised learning and sequential data modeling with variational recurrent neural networks.

Uploaded by

Prawan Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views61 pages

Presentation - Deeplearning2015 Courville Autoencoder Extension 01

The document outlines the variational autoencoder (VAE) model for learning latent variable generative models. The VAE introduces an inference model to approximate the intractable true posterior distribution. This inference model, combined with a reparameterization trick, allows the generative and inference models to be trained simultaneously by optimizing a variational lower bound on the data likelihood using stochastic gradient descent. Extensions of the VAE discussed include its use for semi-supervised learning and sequential data modeling with variational recurrent neural networks.

Uploaded by

Prawan Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Variational Autoencoder and

Extensions
Aaron Courville

1
Friday, August 14, 15
Outline
Variational autoencoder (VAE)

Semi-supervised learning with the VAE

Sequential application of VAE: the VRNN

DRAW model

Incorporating normalizing flows

Incorporating MCMC in the VAE inference

Deep Learning Summer School 2015 Aaron Courville 2


Friday, August 14, 15
Deep directed graphical models
The Variational Autoencoder model:
- Kingma and Welling, Auto-Encoding Variational Bayes, International
Conference on Learning Representations (ICLR) 2014.

- Rezende, Mohamed and Wierstra, Stochastic back-propagation and


variational inference in deep latent Gaussian models. ICML 2014.

Unlike RBM, DBM, here we are interested in deep directed graphical


models:
z :

y :

x :

Deep Learning Summer School 2015 Aaron Courville 3


Friday, August 14, 15
Latent variable generative model

latent variable model: learn a mapping from some latent variable z


to a complicated distribution on x.
!
p(x) = p(x, z) dz where p(x, z) = p(x | z)p(z)
p(z) = something simple p(x | z) = g(z)

Can we learn to decouple the true explanatory factors underlying


the data distribution? E.g. separate identity and expression in face images
z2 x2

x3
z1 x1
Image from: Ward, A. D., Hamarneh, G.: 3D Surface Parameterization Using Manifold Learning for Medial Shape Representation, Conference on Image Processing, Proc. of SPIE Medical Imaging, 2007

Deep Learning Summer School 2015 Aaron Courville 4


Friday, August 14, 15
Variational autoencoder (VAE) approach

Leverage neural networks to learn a latent variable model.


!
p(x) = p(x, z) dz where p(x, z) = p(x | z)p(z)

p(z) = something simple p(x | z) = g(z)

z2 x2
z :

g
g(z):
x3
z1 x1
x :

Deep Learning Summer School 2015 Aaron Courville 5


Friday, August 14, 15
z2 x2

What VAE can do?


g
x3
z1 x1

MNIST: Frey Face dataset:


z2 z2

Expression
ace manifold (b) Learned MNIST manifold
z1
s of learned data manifold for generative models with two-dimensional latent
Frey Face manifoldz1
(a) Learned Pose
EVB. Since the prior of the latent space is Gaussian, linearly spaced coor-
areDeep
were Learning
transformed Summer
throughSchool 2015
the inverse CDF ofAaron Courville
the Gaussian to produce Figure 4: Visualisations of learned data manifo
6
ables
Friday,z. For each
August 14, 15 of these values z, we plotted the corresponding generative
space, learned with AEVB. Since the prior o
The inference / learning challenge

Where does z come from? The classic directed model dilemma.

Computing the posterior p(z | x) is intractable.

We need it to train the directed model.

? z :
z2 x2

g(z):
g

x3
x :
z1 x1
Deep Learning Summer School 2015 Aaron Courville 7
Friday, August 14, 15
Variational Autoencoder (VAE)
Where does z come from? The classic DAG problem.

The VAE approach: introduce an inference machine q (z | x) that


learns to approximate the posterior p (z | x) .

- Define a variational lower bound on the data likelihood: p (x) L(, , x)

L( , , x) = Eq (z|x) [log p (x, z) log q (z | x)]


= Eq (z|x) [log p (x | z) + log p (z) log q (z | x)]
= DKL (q (z | x) p (z)) + Eq (z|x) [log p (x | z)]

regularization term reconstruction term

What is q (z | x)?

Deep Learning Summer School 2015 Aaron Courville 8


Friday, August 14, 15
VAE Inference model
The VAE approach: introduce an inference model q (z | x) that
learns to approximates the intractable posterior p (z | x) by
optimizing the variational lower bound:

L(, , x) = DKL (q (z | x) p (z)) + Eq (z|x) [log p (x | z)]

We parameterize q (z | x) with another neural network:

q (z | x) = q(z; f (x, )) p (x | z) = p(x; g(z, ))


z : z :

f(x): g(z):

x : x :

Deep Learning Summer School 2015 Aaron Courville 9


Friday, August 14, 15
Reparametrization trick
Adding a few details + one really important trick
Lets consider z to be real and q (z | x) = N (z; z (x), z (x))

Parametrize z as z = z (x) + z (x)z where z = N (0, 1)

(optional) Parametrize x a x = x (z) + x (z)x where x = N (0, 1)

z (x) z (x) z :
{
{

g(z):

f(z):
x (z) {
x : x (z) {

Deep Learning Summer School 2015 Aaron Courville 10


Friday, August 14, 15
Training with backpropagation!
Due to a reparametrization trick, we can simultaneously train both
the generative model p (x | z) and the inference model q (z | x)
by optimizing the variational bound using gradient backpropagation.

Objective function: L(, , x) = DKL (q (z | x) p (z)) + Eq (z|x) [log p (x | z)]

Forward propagation

x Backward propagation x

q (z | x) p (x | z)
Deep Learning Summer School 2015 Aaron Courville 11
Friday, August 14, 15
Relative performance of VAE

Figure 3: Comparison of AEVB to the wake-sleep algorithm and Monte Carlo EM, in terms of the
estimated marginal likelihood, for a different number of training points. Monte Carlo EM is not an
on-line algorithm, and (unlike AEVB and the wake-sleep method) cant be applied efficiently for
the full MNIST dataset.

Note: MCEM is Expectation Maximization, where p(z | x) is sampled using


Visualisation of high-dimensional data If we choose a low-dimensional latent space (e.g. 2D),
Hybrid (Hamiltonian)
we can use Monte(recognition
the learned encoders Carlo model) to project high-dimensional data to a low-
dimensional
For more see:manifold. See appendix
Markov Chain A forand
Monte Carlo visualisations
Variationalof the 2D latent
Inference: manifolds
Bridging for the MNIST
the Gap,
and Salimans,
Tim Frey FaceDiederik
datasets.P. Kingma, Max Welling
Figure from Diederik P. Kingma & Max Welling
6 Summer
Deep Learning Conclusion
School 2015 Aaron Courville 12
Friday, August 14, 15
Effect ofComponent collapsing
KL term: component collapse

Figure from Laurent Dinh & Vincent Dumoulin


Deep Learning Summer School 2015 Aaron Courville 13
Friday, August 14, 15
Component collapse & decoder weights
Effects on the model

Decoder weight norms

Figure from Laurent Dinh & Vincent Dumoulin


Deep Learning Summer School 2015 Aaron Courville 14
Friday, August 14, 15
Semi-supervised Learning with
Deep Generative Models
Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, Max Welling (NIPS 2014)

15
Friday, August 14, 15
e labelled and unlabelled subsets as pel (x, y) and peu (x), respectively. We now develop
semi-supervised learning that exploit generative descriptions of the data to improve upon
ationSemi-supervised
performance that wouldLearning withthe
be obtained using Deep Generative
labelled data alone. Models
Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, Max Welling (NIPS 2014)
ure discriminative model (M1): A commonly used approach is to construct a model
s anThey embedding studyortwo feature
basicrepresentation
approaches:of the data. Using these features, a separate
pproximate
thereafter trained. samples The from the posterior
embeddings allowdistribution over the
for a clustering of latent
relatedvariables p(z|x)
observations in are
a use
ures
re space to train M1:
that aallows
classifier
Standard for that predicts
unsupervised
accurate class labels
feature
classification, y, such
with aas
learning
even a (transductive)
(self-taught
limited ofSVM
learning)
number labels.or mul
asterior
lineardistribution
egression. Using this
embedding, over the latent
orapproach,
features wevariables
can from
obtained now perform
p(z|x) are classification
a regular used as fea- inwe
auto-encoder, a lower dimension
construct a
icts
tive class
nce we
model -labels
typically
of Train usesuchthatasiszavariables
latent
features
they,data (transductive)
on
ableunlabeled whose
to provide aSVM
data, train or
more multinomial
dimensionality
arobust setis
classifier ofmuch
to map less
from
latent z than that
to label
features. They. of th
we can
ations.
model now
weTheseuse- is: perform
Generative
classification
low dimensional inthat
embeddings
model: (recall
a lower
x =should
dimensional
data, z now alsospace
= latent be more easily separable s
features)
ablesuse
make whose dimensionality
of independent latentisGaussian
much less than that
posteriors of the
whose obser- are formed
parameters z by a seq
embeddings
on-linear transformations should
p(z) =now N (z|0,
also
of the bedata.
I); more easily
pThis
(x|z) =separable
simple we in improved (1)
z, ),sinceresults
approach
f (x; perform
aussian
VMs, and posteriors whose parameters
we demonstrate this in section are formed
4. by a sequence of
z, ) is a suitable likelihood function (e.g., a Gaussian or Bernoulli distribution) whose
data. This simple approachtransformation,
results in improved performance fora set of latent
Generative semi-supervised model (M2): We propose a probabilistic model xthatvari-
s are formed by a non-linear with parameters , of describes
section
is non-linear 4. transformation is essential to allow for higher moments of the data to be
s being generated by a latent class variable y in addition to a continuous latent variable z.
the
del (M2):
explained M2:
density We by Generative
model,
the generative
propose semi-supervised
and aweprobabilistic
choose thesemodel
process: non-linearmodel.
that functions to
describes thebedata
deep neural networks.
s variable yp(y) in addition to a continuous
= Cat(y|); p(z) = latent variable
N (z|0, I); z. The data z) = f (x; y, z, ),
p (x|y,
cess: z
here Cat(y|) is the multinomial distribution, 2 the class labels y areytreated as latent var
p(z) label
o class = N (z|0, I); and
is available p z(x|y, z) = f (x;latent
are additional y, z, ), variables. These (2) latent variables are ma
ndependent
al distribution, and theallow
classus,labels
in case y ofaredigit
treatedgeneration
as latentfor example,
variables if to separate the class s
on from the writing style of the digit. As before, is a suitablex likelihood f
e additional latent variables. These latent variables are marginally f (x; y, z, )
eg.,
ofadigitBernoulligeneration or Gaussian distribution,
for example, parameterised
to separate the classbyspecifica-
a non-linear transformation of t
ariables.
digit. AsInbefore,
Deep Learning our
Summerexperiments,
f (x;2015
School y, z,)we ischoose
Aaron a suitable
Courville deeplikelihood
neural networksfunction, as this non-linear functio 16
most
Friday,labels
bution, 14, 15y are unobserved,
August parameterised we integrate
by a non-linear over the class
transformation of theof any
latent unlabelled data during t
where Cat(y|) is the multinomial distribution, the class labels y ar
x|y, z) = f (x; y, z, ), (2) no class label is available and z are additional latent variables. These
Semi-supervised
s y are treated as latent variables ifLearning
independent and with Deep
allow Generative
us, in case Models
of digit generation for example, t
These latent variables are marginally tion from the writing style of the digit. As before, f (x; y, z, ) is a
Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, Max Welling (NIPS 2014)
ple, to separate the class specifica- e.g., a Bernoulli or Gaussian distribution, parameterised by a non-line
) is a suitable likelihood function, variables. In our experiments, we choose deep neural networks as th
n-linear transformation of the latent most labels y are unobserved, we integrate over the class of any unl
s as this non-linear function. Since ence process, thus performing classification as inference. Predictio
ny unlabelled
data during
M1+M2: the infer- obtained
Combination semi-supervised
from the inferredmodelposterior distribution p (y|x). This mode
dictions for any missing labels are
continuous-discrete mixture model where the different mixture compo
model can also be seen as a hybrid
-
components share Train generative Stacked
parameters. semi-supervised
generative model on unsupervised
semi-supervised modelfeatures
(M1+M2): z1 on
We can com
unlabeled
n combine these two approachesdata,
by train a classifier
first learning to map
a new latentfrom z1 to labelzz1.
representation 1 using the generative mod
e model from M1, and subsequently learning a generative semi-supervised model M2, using embeddings f
z2 of stochastic
ings from z1 instead of the raw data x. The result is a deep generative model with two layers
chastic variables: p (x, y, z1 , z2 ) = p(y)p(z2 )p (z1 |y, z2 )p (x|z1 ), where the priors y p(y) and p(z2 ) equa
) equal those of y and z above, and both p (z1 |y, z2 ) and p (x|z1 ) are parameterised as deep neural netw
l networks.
z1
3 Scalable Variational Inference
3.1 Lower Bound Objective
n is intractable due to the nonlinear, In all our models, computation of the exact posterior distribution
x is in
To allow for tractable and scalable non-conjugate dependencies between the random variables. To all
n variational inference (Kingma and inference and parameter learning, we exploit recent advances in varia
scribed, we introduce a fixed-form Welling, 2014; Rezende et al., 2014). For all the models describe
ue posterior distribution p(z|x). We distribution q (z|x) with parameters that approximates the true pos
he marginal likelihood of the model then follow the variational principle to derive a lower bound on the ma
approximate posterior is as close as this bound forms our objective function and ensures that our appro
Deep Learning Summer School 2015 possible to the true posterior.
Aaron Courville 17
Friday, August 14, 15
popular approach for efficient variational inference (Dayan, 2000;
Welling, 2014; Rezende et al., 2014; Stuhlmuller et al., 2013). Using an inference network, we avoid
Kingma and
popular
nde the approach
et al.,
need2014;
for efficient
Stuhlmuller
to compute per data point
variational
et al., 2013).
variational
inference
Using an
parameters,
(Dayan, 2000;network,
inference
but can instead
Kingma
compute we
and
a setavoid
of global
nde
e peret al., 2014;
Semi-supervised
data point
variational Stuhlmuller
variational
parameters etparameters,
. This al.,
Learning 2013).
allows Using
us towithbut can
amortise antheinference
Deep instead
cost network,
Generative
compute
of inference wesetavoid
Models
byageneralising
of global
between
per the
ers data
. Thispoint
posterior variational
allowsestimates
us Diederik parameters,
toforamortise
allP.latent
Kingma, the but
variables
cost
Danilo
can ofinstead
J. through
Rezende, the compute
parameters
inference
Shakir ofathe
setinference
by generalising
Mohamed, Max
of(NIPS
Welling
globalnetwork, and
between
2014)
rs .
tes for This
allows
all allows
for us to
fast inference
latent variables amortise the
at both training
throughforthe cost andof inference
testing
parameters by
time (unlike
of the generalising
with VEM, between
inference in which we repeat
network, and
the generalized E-step optimisation every test data
es for all latent variables through the parameters of the inference network, and point). An inference network is introduced
enceforatallboth training and testing time them(unlike with VEM, in which we repeat
ence at both training and testing time (unlike with VEM, in which we repeat form the
latent variables,
Approximate and we
posterior parameterise
(encoder as
model) deep neural networks whose outputs
ep optimisation
parameters of the
ep optimisation
for every
distribution
for every test
test data
q ().
data
point).
For
point).
An inference
the latent-feature
An inference
network
discriminative
network
is
model introduced
is posterior
(M1), we use a
introduced
es, Gaussian
and we - inference
parameterise
Following network
the them
VAE q as
(z|x)
strategyfor
deep the
we latent
neural variable
parametrize networks
thez. For the
whose
approximate generative
outputs semi-supervised
form
with athe
s, and we (M2),
model parameterise
we capacity
high introducethemmodel, as deep
an inference
like a neural
model
MLP networks
forsome
or of thewhose
each other outputs
latent model
deep variables zform
and y,the
(convnet, which we we
RNN,
istribution
stribution
assumeqhas q().().
a ForForthethe
factorised form latent-feature
latent-feature
q (z, y|x) = q discriminative
discriminative
(z|x)q (y|x), model model
specified (M1),
as (M1),
we
Gaussian usewemultinomial
and ause a
etc).
network
network q q(z|x)
(z|x)
distributions forfor
thethe
respectively. latent
latent variable
variable z. For z. Forthe the generative
generative semi-supervised
semi-supervised
oduceananinference
oduce inference
M1: qmodel model
(z|x) =forNfor
each
(z| each of
of the
(x), the
latent
diag( latent
2 variables
variables
(x))), z andz y,and y, which
which we wewe we (3)
sedform
sed formq q(z, (z, y|x)
y|x)
M2: == q x)
q (z|y, q(z|x)q
(z|x)q
= N (z| (y|x),
(y|x),
(y, x), specified
specified
diag( (x)));as qGaussian
2 as Gaussian
(y|x)and and multinomial
multinomial
= Cat(y| (x)), (4)
ively.
vely.
2
|x) ==NN(z|
z|x) - (z|(x), (x), diag(2 (x))),
and
diag( are parameterized by deep MLPs, that can share(3) (3)
(x))),
parameters. 3
2
z|y, x)==NN(z|
|y, x) (z|(y,(y,
x),x), diag( (x)));
diag( (x))); 2
q (y|x)
q (y|x) = Cat(y|
= Cat(y| (x)),(x)), (4) (4)
M1: z M2: z
y
3 3
x x

Deep Learning Summer School 2015 Aaron Courville 18


Friday, August 14, 15
Latent Feature Discriminative
For this model, Model Objectivebound J (x) on the marginal likelihood for a single data point is:
the variational
For this model, the variational bound J (x) on the marginal likelihood for a single data point is:
For the case whereEq (z|x)
the[loglabel is missing, it is treated as(x),a latent variable (5)
model, the variational bound J (x)logonpthe
E
(x)marginal likelihood for
p (x|z)] a single
KL[q data point
(z|x)kp is: (z)] = J
(5)
posterior inference and the resulting bound for handling data points with a

log p (x) [log p (x|z)] KL[q (z|x)kp (z)] = J (x),
Semi-supervised Learning with Deep Generative Models
log p (x) The Eq inference
q (z|x)
(5)using both the labelled and
network network
(z|x) [log p (x|z)] KL[q (3) is used
(z|x)kp during
(z)] = training
J (x), of the model
The inference q (z|x) q(3)(z|x) is used during training of the model using both the labelled and
unlabelled
rence network q (z|x)
unlabelled
data sets.
(3) and
data
This
is used
sets.
Diederik This
log
approximate
during
p approximate
P. Kingma,

training
(x) Danilo
posterior E
oftraining
theX
posterior
J.
q is
model
Rezende,
then
(y,z|x)using
is
used then
[log
Shakir
as p
aused as
(x|y,
Mohamed,
feature
a
both the labelled and
feature
z) +
Max
extractor extractor
log
Welling
for pthe
(y) for+
(NIPS thelog
2014)
labelled labelled
p(z)
dataThis data set,
set, approximate
and the features the features
used for used
training for the classifier.
ed data sets. posterior is then usedthe asclassifier.
a feature extractor for the labelled
and the features used for training the classifier. = q (y|x)( L(x, y)) + H(q (y|x)) = U(x).
3.1.2
The Generative
3.1.2M2:Generative lower boundSemi-supervised
for
Semi-supervised Model Objective the Model y Objective
generative semi-supervised model.
Generative Semi-supervised
For -this For
model, thisThe
we bound
Model
model,
have we
two ontwo
Objective
have
cases the
to marginal
cases
consider. to consider.
In likelihood
the In case,
first for
the first
the the the
case,
label entirelabel dataset
corresponding corresponding
to is
a now:
data to a data
Objective
point is observed withand labeled data: bound is a simple
the variational X extension of equation (5): X
point is observed and the variational bound is a
model, we have two cases to consider. In the first case, the label corresponding to a datasimple extension of equation (5):
observed and the variational
log p (x, bound
y) E is a simple [logextension
p (x|y, of
z) + J
equation
log =
p (5):
(y) + log p(z) L(x,
log q y) +y)] = L(x, y),
(z|x, U(x)
(6)
log p (x, y) Eq (z|x,y) [log p (x|y, z) + log p (y) + log p(z) (x,y)e
q (z|x,y) log q p(z|x,
l y)] = L(x, y), (6)
xe p u
x, y) Eq (z|x,y) [log For pthe
(x|y,
case z)where
+ logthe p (y) label+ log is p(z) log
missing, it qis (z|x,
treated L(x, y),
y)]as= a latent (6) over which we perform
variable
For -the case where the
Objective label islabels:
without missing, it is treated as a latent variable over which we perform
posterior The distribution
inference and the resulting q (y|x) bound (4) for forhandlingthe missing
data points labels
with has
an unobserved they is:form a discr
label y is:
posterior inference and the resulting bound for
case where the label is missing, it is treated as a latent variable over which we perform
we can use this knowledge
handling
to
data points
construct
with
the
an unobserved
best classifier
label
possible zas ou
inference and the resulting
log p (x)bound
log pE (x) Eq[log
for handling pdata points
[log withz)
p (x|y, an+unobserved
log p (y) +label loglog y is:
p(z) q (y,log q (y, z|x)]
y data.
(y,z|x)
(x|y, z) + log p (y) + log p(z) z|x)]
distribution
X
q (y,z|x)X
is also used at test time for predictions of any unseen
log p (x) Eq (y,z|x) [log p= (x|y, z) q=+(y|x)(log pq (y) (y|x)(
L(x, logL(x,
+ y)) p(z)
+ H(q y))log + H(q
(y|x))q (y,=(y|x))
z|x)]
U (x).= U (x). (7) (7)
X y
In the objective
y
function (8), the label predictive(7) distribution q (y|x) contr
=
The bound The q (y|x)(
bound onL(x,
the y)) + H(qlikelihood
marginal (y|x)) =for
y on the marginal likelihood for the entire dataset is now:
term relating to the unlabelled
U(x).
the entire
data,
dataset
which
is now:
is an undesirable property xif we
X X
nd on the marginal- Semi-supervised
likelihood tionforasthea classifier.
entire objective:
X
dataset is now: all L(x,
J =Ideally,
X
model y) +and variational
U (x) parameters should (8) lear
J = L(x,
(x,y)e y)
p + U (x)
xe p (8)
X X pl l u

J = this, we add L(x, ay)classification U(x)loss to (8), such that (8)the distribution q (y|x)
(x,y)e xe pu
+
data. The
(x,y)e pl
q extended
The distribution q (y|x) (4) for the missing missing
The distribution (y|x) (4) for objective
xe
the pu
labels has function
labels has is:
the form the form a discriminative
a discriminative classifier,classifier,
and and
we can we can
use this use this knowledge
knowledge to construct to construct
the best the best classifier
classifier possible possible
as our as our inference
inference model. model. This
This
ribution q (y|x) (4) for the missing labels has timethe form a discriminative
of any unseen classifier, and
-
distribution
use this knowledge
distribution
actually,
is also
to construct
is
used also
forthe used
classification,
at test timeat test
for for
they use
predictions
best classifier possible
predictions
of J
any =
unseen J +
data.
as our inference model. pe
E
data.
l (x,y)
This
[ log q (y|x)] ,
ion is alsoInused
the at In
testthe
objective timeobjective
function function
for predictions
(8), the (8),
of
labelany the label predictive
unseen
predictive data.
distribution distribution
q (y|x) q (y|x) contributes
contributes only to only
the to the second
second
term term to
relating
where
relating
the
theunlabelled
to the
unlabelled
hyper-parameter
data, which data,is which
an
an
is
undesirable
controls
undesirable theproperty
property
relative
if we wish
weight
if we
to wish
use tobetween
this
generativ
use this distribu-
distribu-
jective function (8), tion theaslabel
ative predictive
learning.
classifier. distribution
Ideally, We all use
model q(y|x)
and
= contributes
variational
0.1 N in onlyexperiments.
all
parametersto the second
should learn While
in all we To
cases. have obtain
remedy
tion as
ating to the unlabelleda classifier.
data, Ideally,
which all model
is an undesirable and variational
property parameters
if we wish toshould learn
use this distribu- in all cases. To remedy
this, we this,
add awe byadd a classification
motivating
classification loss to the loss
(8), to
need
such (8),
that such
for the all that the
model
distribution distribution
components
q (y|x) qalso(y|x) also
to
learns learns
learn
from from
at
labelledall labelled
times, th
classifier. Ideally,
Deep Learning all
data.
Summer model
The and variational
extended
School 2015 objective parameters
function
Aaron Courville should
is: learn in all cases. To remedy 19
data. The extended objective
derived function
directly is:
using the
add a classification loss to (8), such that the distribution q (y|x) also learns from labelledvariational principle by instead performing infer
Friday, August 14, 15
Semi-supervised MNIST classification results
Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, Max Welling (NIPS 2014)

Combination model M1+M2 shows dramatic improvement:

Table 1: Benchmark results of semi-supervised classification on MNIST with few labels.


N NN CNN TSVM CAE MTC AtlasRBF M1+TSVM M2 M1+M2
100 25.81 22.98 16.81 13.47 12.03 8.10 ( 0.95) 11.82 ( 0.25) 11.97 ( 1.71) 3.33 ( 0.14)
600 11.44 7.68 6.16 6.3 5.13 5.72 ( 0.049) 4.94 ( 0.13) 2.59 ( 0.05)
1000 10.7 6.45 5.38 4.77 3.64 3.68 ( 0.12) 4.24 ( 0.07) 3.60 ( 0.56) 2.40 ( 0.02)
3000 6.04 3.35 3.45 3.22 2.57 3.49 ( 0.04) 3.92 ( 0.63) 2.18 ( 0.04)

4 Experimental Results
OpenFull MNIST
source testwhich
code, with errorthe
(non-convolutional): 0.96%
most important results and figures can be reproduced, is avail-
able- at http://github.com/dpkingma/nips14-ssl.
for comparison, current SOTA: 0.61% For the latest experimental results,
please see http://arxiv.org/abs/1406.5298.

4.1 Benchmark Classification


We test performance on the standard MNIST digit classification benchmark. The data set for semi-
supervised learning is created by splitting the 50,000 training points between a labelled and unla-
belled set, and varying the size of the labelled from 100 to 3000. We ensure that all classes are
balanced when doing this, i.e. each class has the same number of labelled points. We create a num-
Deep ber of dataSummer
Learning sets using randomised
School sampling
2015 Aaron Courvilleto confidence bounds for the mean performance under 20
Friday, repeated
August 14, 15 draws of data sets.
Conditional generation using M2
Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, Max Welling (NIPS 2014)

(a) Handwriting styles for MNIST obtained by fixing the class label and varying the 2D latent variable z

(b) MNIST analogies (c) SVHN analogies

Figure 1: (a) Visualisation of handwriting styles learned by the model with 2D z-space. (b,c)
Analogical reasoning with generative semi-supervised models using a high-dimensional z-space.
The leftmost columns show images from the test set. The other columns show analogical fantasies
of x by the generative model, where the latent variable z of each row is set to the value inferred from
the test-set image on the left by the inference network. Each column corresponds to a class label y.

Deep Learning SummerTable


School 2015 Aaron
2: Semi-supervised Courvilleon
classification Table 3: Semi-supervised classification on 21
Friday, August 14, 15 the SVHN dataset with 1000 labels. the NORB dataset with 1000 labels.
A Recurrent Latent Variable Model
for Sequential Data
Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, Yoshua Bengio
(arXiv, 2015)

22
Friday, August 14, 15
VRNN: Model Structure
Variational recurrent neural network (VRNN) is a recurrent
(conditional) application of the VAE at every time-step.

Recurrence is mediated through the recurrent hidden layer.

Motivation: latent variables are a more natural space to


encode stochasticity, standard RNNs encode noise in input.

VAE: VRNN:
z

+Recurrence

Deep Learning Summer School 2015 Aaron Courville 23


Friday, August 14, 15
VRNN: Model Structure

Variational recurrent neural network (VRNN) is a recurrent


(conditional) application of the VAE at every time-step.

Recurrence is mediated through the recurrent hidden layer.

VRNN:

(a) Prior (b) Generation (c) Recurrence (d) Inference (e

Figure 1: Graphical illustrations of each operation in the proposed VRNN: (a) comput
priors using Eq. (5); (b) generating function using Eq. (6); (c) updating the recurrenc
Deep Learning Summer School 2015 Aaron Courville 24
part using Eq. (7); (d) inference of the approximate posterior using Eq. (9); (e) overall c
Friday, August 14, 15
paths of the VRNN.
VRNN: Prior on zt

At time step t, the latent variable zt is generated as a


function of the recurrent state at time step t-1.

Deep Learning Summer School 2015 Aaron Courville 25


Friday, August 14, 15
VRNN: Generation

Generation of xt uses the current latent variable zt and the


previous recurrent state ht-1.

Generative model
factorizes over time

Deep Learning Summer School 2015 Aaron Courville 26


Friday, August 14, 15
VRNN: Recurrence

Recurrent state ht is a function of the previous recurrent state,


the current observation ht and the current latent variable ht

Deep Learning Summer School 2015 Aaron Courville 27


Friday, August 14, 15
2 enc x
zt | xt N (z,t , diag( z,t )) , where [z,t , ] = ' (' (xt ), ht

VRNN: Inference
z,t

similarly and denote the parameter set of the approximate posterior. We


z,t z,t
the encoding of the approximate posterior and the decoding for generation are also
hidden state ht 1. We can also observe that this results in the factorization
T
Y
Approximate posterior: q(zT | xT ) =
t=1
q(zt | xt , z<t ).

where the history is summarized by the recurrent hidden


state ht-1.This factorization is crucial in breaking the variational lower bound into
Learning
Z T
X
p(xT , zT )
log dq(zT | xT ) = KL(q(zt | xt , z<t )kp(zt |
q(zT | xT ) t=1
+Eq(zt |xt ,z<t ) [log(p(xt | z
As in the standard VAE, we learn the generative and inference models jointly by
variational lower bound with respect to their parameters. The schematic view o
shown in Fig. 1, each of (a)(d) operation corresponds to each of Eqs. (5),(6),(7),(9
network applies the operation (a), hence, it has a sequential prior (VRNN, see Eq. (
of the VRNN which does not apply the operation (a), then the prior is independent
(VRNN-I). STORN [2] model can be considered an instance of the VRNN-I model
STORN makes further restrictions on the dependency structure of the approximate i
We include this version of the model (VRNN-I) in our experimental evaluation in
study the impact of including the temporal dependency structure in the prior (seque
the latent
Deep Learning Summer Schoolrandom
2015 variables.
Aaron Courville 28
Friday, August 14, 15
priors using Eq. (5); (b) generating function using Eq. (6); (c) updating the recurrence of the RNN
part using Eq. (7); (d) inference of the approximate posterior using Eq. (9); (e) overall computational

VRNN: Learning
paths of the VRNN.

Inference In a similar fashion, the approximate posterior will not only be a function of xt but also
of ht 1 following the equation:
zt | xt N (z,t , diag( 2
z,t )) , where [z,t , z,t ] = 'enc x
(' (xt ), ht 1 ), (9)
similarly z,t and z,t denote the parameter set of the approximate posterior. We can notice that
Learning is accomplished
the encoding via gradient
of the approximate posterior backpropagation:
and the decoding for generation are also tied through the
hidden state ht 1 . We can also observe that this results in the factorization
- through the decoder and encoder, as in standard
YT VAE.
- and through the recurrent connections,
q(zT | xT ) as
= in q(z
thet standard
| xt , z<t ).RNN.
t=1
(10)

Objective
Learning function:is crucial in breaking the variational lower bound into timestep-wise
This factorization
Z XT
p(xT , zT )
log dq(zT | xT ) = KL(q(zt | xt , z<t )kp(zt | x<t , z<t ))
q(zT | xT ) t=1
+Eq(zt |xt ,z<t ) [log(p(xt | zt , x<t ))].
As in the standard VAE, we learn the generative and inference models jointly by maximizing the
Factored
variational lower bound withversion
respect tooftheir
theparameters.
variational Thelower bound
schematic view of the VRNN is
shown in Fig. 1, each of (a)(d) operation corresponds to each of Eqs. (5),(6),(7),(9). The proposed
network applies the operation (a), hence, it has a sequential prior (VRNN, see Eq. (5)). The variant
of the VRNN which does not apply the operation (a), then the prior is independent across timesteps
(VRNN-I). STORN [2] model can be considered an instance of the VRNN-I model family. In fact,
STORN makes further restrictions on the dependency structure of the approximate inference model.
We include this version of the model (VRNN-I) in our experimental evaluation in order to directly
study the impact of including the temporal dependency structure in the prior (sequential prior) over
the latent random variables.

Deep Learning Summer School 2015 Aaron Courville 29


Friday, August 14, 15
4 Experiment Settings
VRNN: Results
Results on speech synthesis and handwriting synthesis

Using stochastic latent variables allows for a more effective model


than adding the stochasticity in the input

Table :1: Average log-probability on the test (or validation) set of each task.
Speech modelling Handwriting
Models Blizzard TIMIT Onomatopoeia Accent IAM-OnDB
RNN-Gauss 3539 -1900 -984 -1293 1016
RNN-GMM 7413 26643 18865 3453 1358
VRNN-I-Gauss 8933 28340 19053 3843 1332
9188 29639 19638 4180 1353
VRNN-Gauss 9223 28805 20721 3952 1337
9516 30235 21332 4223 1354
VRNN-GMM 9107 28982 20849 4140 1384
9392 29604 21219 4319 1384

1. Blizzard: This text-to-speech dataset made available by the Blizzard Challenge 2013 con-
tains
Deep Learning 300 School
Summer hours 2015
of English spoken
Aaron by a single female speaker [9].
Courville 30
Friday, August 14, 15
VRNN: Speech synthesis

(a) Ground Truth (b) RNN-GMM (c) VRNN-Gauss


Deep Learning
FigureSummer School
3: Typical 2015 examples
training Aaron and
Courville
generated samples from RNN-GMM and VRNN-Gauss. 31
Top three rows show the global waveforms while the bottom three rows show more zoomed-in
Friday, August 14, 15
VRNN: KL Divergence

The KL divergence tends to be fairly sparse and seems to be most


active at motif transitions.

|z,t z,t1 |
KL divergence:

input waveform:

Figure 2: The top row represents the difference t between z,t and z,t 1 . The middle row
represents the dominant KL divergence values shown in temporal order. The bottom row show
corresponding waveforms.

linear units for any hidden layer that belongs to either 'x or 'dec
(800 for Blizzard). Note that th
models using GMM (RNN-GMM & VRNN-GMM) have 20 mixture components.
For qualitative analysis for speech, we train larger models to generate sequences, but again contro
Deep Learning Summer School 2015 Aaron Courville 32
the number of parameters. For all models, we use stacked RNNs with three recurrent hidden layers
Friday, August 14, 15
VRNN: Writing synthesis
(a) Ground Truth (b) RNN-GMM (c) VRNN-Gauss
Figure 3: Typical training examples and generated samples from RNN-GMM and VRNN-Gauss.
Top three rows show the global waveforms while the bottom three rows show more zoomed-in
Predicting
waveforms. a sequence
Samples from of (x,y)
(b) RNN-GMM locations
contain of the next
high frequency noise,destination of the
(c) VRNN-Gauss gener-
pen.
ates samples with less noise. We excluded RNN-Gauss because the samples are almost close to pure
noise.

(a) Ground Truth (b) RNN-Gauss (c) RNN-GMM (d) VRNN-GMM


Figure 4: Handwriting samples: (a) ground truth examples from the training examples; uncondi-
tionally generated handwritings from (b) RNN-Gauss, (c) RNN-GMM and (d) VRNN-GMM. The
VRNN-GMM retains writing styles from beginning to end while RNN-Gauss and RNN-GMM tend
to change style during the generation process. This is possibly because sequential latent random
variables guide the model to generate samples with a consistent writing style.

6 Conclusion
Deep Learning Summer School 2015 Aaron Courville 33
Friday, August 14, 15
DRAW
Deep Recurrent Attentive Writer
Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, Daan Wierstra
Google Deepmind - (ICML, 2015)

34
Friday, August 14, 15
DRAW: Deep Recurrent Attentive Writer

Augments the encoder and decoder with recurrent neural


networks.

Inference and generation defined by a sequential process,


even for non-sequential data.

Adds an attention mechanism over the input to define a


sequential process.

Deep Learning Summer School 2015 Aaron Courville 35


Friday, August 14, 15
Variational Autoencoder Recap
p(x | z)

decoder
MLP
z

sample Generative (decoder) model


Inference (encoder) model
q(z | x)

encoder
MLP

read

x
Deep Learning Summer School 2015 Aaron Courville 36
Friday, August 14, 15
DRAW Model
Canvas at t

ct1 write ct write cT p(x | zT )

decoder decoder
hdec
t1
RNN RNN
zt zt+1

sample sample Generative (decoder) model


Inference (encoder) model
q(zt | x, z1:t1 ) q(zt+1 | x, z1:t )

henc encoder encoder


t1
RNN RNN

read read

x x could be a subset of x (with attention)

Deep Learning Summer School 2015 Aaron Courville 37


Friday, August 14, 15
DRAW MNIST Generation
DRAW: A Recurrent Neural Network For Image G

1. Reading and Writing Without Attention


Simplest instantiation of DRAW is without an attention mechanism.
n the simplest instantiation of DRAW the entire input im-
Entire input is passed to the encoder at every time-step
ge is passed to the encoder at every time-step, and the de-
Decoder
oder modifies writes
the entire to entire
canvas canvas
matrix at every
at every time-step
time-step.
n this case the read and write operations reduce to gY {
DRAW: A Recurrent Neural

read (x, xt , hdec


t 1 ) = [x, xt ] (17)
write(hdec
t ) = W (h dec
t ) (18)

owever this approach does not allow the encoder to fo-


g
us on only part of the input when creating the latent dis-
ibution; nor does it allow the decoder to modify only a
art of the canvas vector. In other words it does not pro- Figure 3. Left: A 3
Time
ide the network with an explicit selective attention mech- The stride ( ) and c
Deep Learning Summer School 2015 Aaron Courville Figure 7. MNIST generation sequences for DRAW without 38 at-
nism, which we believe to be crucial to large scale
Friday, August 14, 15
tention. image Three
Notice how the network N
first generates patc
Nblurry
a very im-
DRAW Attention Mechanism
DRAW: A Recurrent Neural Network For Image Generation

ting Without
DRAW Attention
can use a differentiable
attention
ation of DRAW themechanism.
entire input im-
coder at every time-step, and the de-

{
Attention uses recurrence (via the
ire canvas matrix at every time-step.
decoder) to select subsets of x for
nd write operations reduce to
reading and writing. gY {
xt , h dec
Attention controls the extracted
t 1 ) = [x, xt ] (17)
patch location,
dec dec scale and blur.
ite(ht ) = W (ht ) (18)
gX
{
h does not allow the encoder to fo-
e input when creating the latent dis-
allow the decoder to modify only a
tor. In other words it does not pro- Figure 3. Left: A 3 3 grid of filters superimposed on an image.
an explicit selective attention mech- The stride ( ) and centre location (gX , gY ) are indicated. Right:
ve to be crucial to large scale image Three N N patches extracted from the image (N = 12). The
o the above configuration as DRAW green rectangles on the left indicate the boundary and precision
Deep Learning Summer School 2015 ( ) of the patches, while the patches themselves are shown to 39
Aaron Courville the
Friday, August 14, 15 right. The top patch has a small and high , giving a zoomed-in
DRAW Attention Mechanism DRAW:
DRAW: A Recurrent NeuralANetwork
Recurrent
ForNeural
ImageNet
Ge
DRAW: A Recurrent Neural Network For I
generated by the
copies ofgenerated
training
guishablecopies
from rea of
ated CIFAR image
guishable
read
recognisable struc
ated CIFA
write
MNIST results
recognisabsub
As a preliminary
MNIST ree
DRAW: A Recurrent Neural Network
tion module DRAW: of Fo
A
th
As a preli
Gaussian grid filters: classification.
tion modu
generate
classificat
For all experiment
copies o
was a Bernoulli
Forguishab
all exd
For the MNIST ated
was a Ber ex
CIF
Eq 9 was the usu
recognis
For the M
SVHN and CIFAR
MNIST
Eq 9 was
Figure 4. Zooming. Top Figure 4. Zooming.
Left: Top
The original Left:
100 75The pixel intensities
Top100 75 image. Top
original
image. As
SVHN a w
pr
an
Middle: A 12 12 patch Middle: A 12
extracted 12 patch
with extracted
144 2D Gaussian with and 1, which
144 2D Gaussian
filters. tionwere
pixel
filters. mo
inte
Figure 4. Zooming. Top Left: The original 100 75 image. emission Top probabili
Top Right:image
Top Right: The reconstructed The reconstructed
when applyingimage when applying transposed
transposed andclassific
1, wh
Middle: A 12 12 patch
Deep Learning Summer School 2015 Aaron Courville extracted with 144 2D Gaussian filters.
fore the cross-entro 40
filters on the patch. Bottom:
filters on the
Onlypatch.
two Bottom:
2D Only
Gaussian two
filters 2D
are Gaussian filters are
emission
Friday, August 14, 15 Top Right: The reconstructed image when applying transposed
DLGM 8 leapfrog steps [6] 85

DRAW: Cluttered MNIST Classification DARN 1hl [7]


DARN 12hl [7]
84
-
DRAW: A Recurrent Ne
DRAW without attention -
DRAW -
Draw w/ attention being applied to a
classification task: cluttered MNIST.

Attention learns to focus on the digit in


Figure 5.the scene.
Cluttered MNIST classification with attention. Each
sequence shows a succession of four glimpses taken by the net-
work while classifying cluttered translated MNIST. The green
rectangle indicates the size and location of the attention patch,
while the line width represents the variance of the filters.

Table 1. Classification test error on 100 100 Cluttered Trans-


lated MNIST.
Model Error
Convolutional, 2 layers 14.35%
RAM, 4 glimpses, 12 12, 4 scales 9.41%
RAM, 8 glimpses, 12 12, 4 scales 8.11%
Differentiable RAM, 4 glimpses, 12 12 4.18%
Differentiable RAM, 8 glimpses, 12 12 3.36%
Figure 6. Generated MNIST images. All digits
Time
by DRAW except those in the rightmost column, w
time-step,
Deep Learningwhereas RAM used
Summer School 2015 four, at different
Aaron Courville zooms. training set images closest to those in the colum
41 E
Figure 5. Cluttered MNIST2 classification with attention.
right (pixelwise L is the distance measure). N
Friday, August 14, 15
sequence shows a succession of four glimpses taken by the
DRAWDRAW:
MNIST Generation with
A Recurrent Neural Network For Image Generation
Attention
Table 2. Negative log-likelihood (in nats) per test-set example on
the binarised MNIST data set. The right hand column, where
Samples from DRAW with Attention: present, gives an upper bound (Eq. 12) on the negative log-
likelihood. The previous results are from [1] (Salakhutdinov &
Hinton, 2009), [2] (Murray & Salakhutdinov, 2009), [3] (Uria
et al., 2014), [4] (Raiko et al., 2014), [5] (Rezende et al., 2014),
[6]NLL of MNIST
(Salimans test[7]
et al., 2014), samples:
(Gregor et al., 2014).
Model log p
DBM 2hl [1] 84.62
DBN 2hl [2] 84.55
NADE [3] 88.33
EoNADE 2hl (128 orderings) [3] 85.10
EoNADE-5 2hl (128 orderings) [4] 84.68
DLGM [5] 86.60
DLGM 8 leapfrog steps [6] 85.51 88.30
DARN 1hl [7] 84.13 88.30
DARN 12hl [7] - 87.72
DRAW without attention - 87.40
DRAW - 80.97
Image from Jrg Bornschein

This is really low!


Deep Learning Summer School 2015 Aaron Courville 42
Friday, August 14, 15
DRAW MNIST Generation with Attention

NLL of MNIST test samples:

Deep Learning Summer School 2015 Aaron Courville 43


Friday, August 14, 15
Recent Innovations in VAE
Inference

44
Friday, August 14, 15
Inference in the VAE
VAE inference approximates the posterior p (z | x)! with a distribution
that is conditionally independent in z : q (z | x) = q (zi | x)
i

- Consequence: Non-multimodal, i.e. unimodal distribution.

f
z2 x2

x3
z1 x1
Can parametrize some distribution (e.g. a full cov. Gaussian), but what
is the right distribution?

Can we lessen this restriction? How can we get closer to p (z | x) ?


Deep Learning Summer School 2015 Aaron Courville 45
Friday, August 14, 15
Variational Inference with
Normalizing Flows
Danilo Jimenez Rezende, Shakir Mohamed
Google Deepmind - (ICML, 2015)

46
Friday, August 14, 15
Normalizing Flows
How do we specify a complicated joint distribution over z?

Normalizing flows: the transformation of a probability density through


a sequence of invertible mappings.
- By repeated application of the rule for random variable transformations, the initial
density flows through the sequence of invertible mappings.

- At the end of the sequence, we have a valid (maybe complex) probability distribution.

Transformation of random variables: z = f (z), f 1 (z ) = z


! 1 ! ! !1
! f ! ! f !
q(z ) = q(z) !!det
! = q(z) !det !
z ! ! z !

Chaining together a sequence: z K = f K f K1 f 2 f 1 (z 0 )


K
! " "
" f "
log qK (z K ) = log q0 (z 0 ) log ""det k "
z k "
k=1

Deep Learning Summer School 2015 Aaron Courville 47


Friday, August 14, 15
Normalizing Flows
Law of the unconscious statistician: expectations w.r.t. the
transformed density qK(zK) can be written as expectations w.r.t. the
original q0(z0). For z K = f K f K1 f 2 f 1 (z 0 ),

! "
EqK g(z K )] = Eq0 [g(f K f K1 f 2 f 1 (z 0 ))

The variational lower bound:

L( , , x) = Eq (z|x) [log p (x, z) log q (z | x)]


= EqK (zK ) [log p(x, z K ) log qK (z K )]
K
fk
= Eq0 (z0 ) log p(x, z K ) log q0 (z 0 ) + log det
zk
k=1

Deep Learning Summer School 2015 Aaron Courville 48


Friday, August 14, 15
Normalizing Flows for VAE posteriors
! "
Consider the family of transformations: f (z) = z + uh w z + b
! !
! " ! f ! ! !
(z) = h w z + b w !det ! = !1 + u (z)!

! z !

Chaining these transformations gives us a rich family of posteriors,


Variational Inference with !K Normalizing
"
Flows
"
log qK (z K ) = log q0 (z 0 ) log "1 + u
k k (z k )
"
Planar
that are also O(LD ) 3 k=1
q0
Unit Gaussian K=1 K=2 K=10

be numerically unsta-
g flows that allow for
nant, or where the Ja-
Uniform

rmations
Deep Learning Summer School 2015 Aaron Courville 49
Friday, August 14, 15
he Ja-

Uniform
Normalizing Flows for VAE posteriors
! "
Normalizing flow integration into the VAE: f (z) = z + uh w z + b
Figure 1. Effect of normalizing flow on two distributions.
Normalizing Flow
(10)

e pa-
arity,
mpute
matrix

(11)
. (12) Inference network Generative model

ed by Normalizing flows are fully differentiable, so learning via gradient


h the backpropagation canand
Figure 2. Inference proceed as before.
generative models. Left: Inference net-
given work maps the observations to the parameters of the flow; Right:
Deep Learning Summer School 2015 Aaron Courville 50
Friday, August 14, 15 generative model which receives the posterior samples from the
(a) Bound F (x) (b) IDKL (q; p(z|x)) (c) ln p(x)
Q

Normalizing FlowsFigurefor VAE


4. Effect posteriors
of the flow-length on MNIST.
m
sh
im
Quantitative comparison
Table to other methods
2. Comparison shows
of negative the benefitonofthe
log-probabilities thetest set ri
normalizing flows. for the binarised MNIST data.
Model ln p(x) 7
DLGM diagonal covariance 89.9
DLGM+NF (k = 10) 87.5 In
DLGM+NF (k = 20) 86.5 in
DLGM+NF (k = 40) 85.7 tr
DLGM+NF (k = 80) 85.1
DLGM+NICE (k = 10) 88.6 th
DLGM+NICE (k = 20) 87.9 tiz
DLGM+NICE (k = 40) 87.3 ne
DLGM+NICE (k = 80) 87.2 ar
Results below from (Salimans et al., 2015)
m
DLGM + HVI (1 leapfrog step) 88.08
DLGM + HVI (4 leapfrog steps) 86.40 iz
DLGM + HVI (8 leapfrog steps) 85.51 ot
Results below from (Gregor et al., 2014) tio
DARN nh = 500 84.71 si
DARN nh = 500, adaNoise 84.13 fe
Recall
digits (0 to thatareDRAW
9) that achieved
28 28 pixels in<= 80.97
size. We used the A
binarized dataset as in (Uria et al., 2014). We trained differ- is
Deep Learning Summer School 2015 ent DLGMs
Aaron Courvillewith 40 latent variables for 500, 000 parameter 51to
Friday, August 14, 15 at
Markov Chain Monte Carlo and
Variational Inference: Bridging the Gap
Tim Salimans, Diederik P. Kingma, Max Welling
(ICML, 2015)

52
Friday, August 14, 15
Variational and MCMC inference
Variational inference (a la VAE) and MCMC inference have different
properties
- Variational inference (VI) is efficient / MCMC is computationally intensive
- VI has a fixed parametric form / MCMC asymptotically approaches p(z | x)

Can we combine these two approaches to find a good compromise?

q (z | x) = q(z; f (x, )) p (x | z) = p(x; g(z, ))


z : z :

f(x): g(z):

x : x :
Deep Learning Summer School 2015 Aaron Courville 53
Friday, August 14, 15
Hamiltonian (Hybrid) Monte Carlo
Hamiltonian Monte Carlo
Basic Idea:
- Consider sampling from the posterior p(z | x) as a physics simulation from a
frictionless ball rolling on the potential energy surface E(x,z) = log p(x,z).
The algorithm:
- Augment with a velocity v with kinetic energy: K(v) = v v/2
Gibbs sample velocity N (0, I)
- Total energy = Hamiltonian: H(x,z,v) = E(x,z) + K(v)
Simulate Leapfrog dynamics for L steps

Accept new position with probability


min[1, exp(H(v, x) H(v , x))]

HMC innovation: If gradients of H(x,z,v) are available then we can


use that information to move around the surface more effectively.
The original name is Hybrid Monte Carlo, with reference to the
Deep Learning Summer School 2015 Aaron Courville 54
hybrid dynamical simulation method on which it was based.
Friday, August 14, 15
Hamiltonian (Hybrid) Monte Carlo
Hamiltonian Monte Carlo
The HMC algorithm:
- Gibbs sample the velocity v N (0, I)
- Simulate leapfrog dynamics for T steps
The
- algorithm:
Accept new position with probability
minsample
Gibbs [1, exp(H(x,
velocity (0,
z 0 , v0 )N H(x,
I) z T , v T ))]

Simulate Leapfrog dynamics for L steps


Leapfrog dynamics:
Accept new position with probability
v t+ 2 = v t z (log p(x,
z t ))
min[1, exp(H(v,2 x) H(v , x ))]
z t+ = z t + v t+ 2
v t+ = v t+ 2 z (log p(x, z t+ ))
2

The original name is Hybrid Monte Carlo, with reference to the


Deep Learning Summer School 2015 Aaron Courville 55
hybrid dynamical simulation method on which it was based.
Friday, August 14, 15
HMC for Deep Generative Models
The HMC algorithm:
- Gibbs sample the velocity v N (0, I)
- Simulate leapfrog dynamics for T steps
- Accept new position with probability Deep Generative Model:
min [1, exp(H(x, z 0 , v 0 ) H(x, z T , v T ))]
Forward propagation

Leapfrog dynamics:
v t+ 2 = v t z (log p(x, z t ))
2
z t+ = z t + v t+ 2 z
v t+ = v t+ 2 (log p(x, z t+ )) Backward propagation
2
z x

p (x | z)
Deep Learning Summer School 2015 Aaron Courville 56
Friday, August 14, 15
Hamiltonian Variational Inference (HVI)
Fusing the VAE and HMC:
Central Idea: Interpret the stochastic Markov chain (from HMC)
T
!
q(z | x) = q(z 0 | x) q(z t | z t1 , x)
t=1

as a variational approximation in an expanded space.

Consider y = z0,z1,z2,...,zt-1 to be a set of auxiliary random variables.

We obtain a new (lower) lower bound on the log-likelihood:

Laux = Eq(y,zT |x) [log p(x, z T )r(y | x, z T ) log q(y, z T | x)]


=L Eq(zT |x) {DKL [q(y | z T , x) r(y | z T , x)]}
L log p(x)

where r(y | z T , x) is an auxiliary inference distribution (we choose it).


Deep Learning Summer School 2015 Aaron Courville 57
Friday, August 14, 15
Hamiltonian Variational Inference (HVI)
Assume the auxiliary inference distribution has a Markov structure:
T
!
r(z 0 , . . . , z t1 | x, z T ) = rt (z t1 | x, z t )
t=1

With this, lower bound becomes


r(z 0 , . . . , z T 1 | x, z T )
Laux = Eq(y,zT |x) log p(x, z T ) + log
q(z 0 , . . . , z T | x)
T
p(x, z T ) rt (z t 1 | x, z t )
= Eq(y,zT |x) log + log
q(z 0 | x) t=1 qt (z t | x, z t 1 )

This will work for any MCMC method. Specializing to HMC involves
some details (like considering the velocity). See paper for details.

Deep Learning Summer School 2015 Aaron Courville 58


Friday, August 14, 15
Hamiltonian Variational Inference (HVI)
Markov Chain Monte Carlo and Variational Inference:

Algorithm 3 Hamiltonian variational inference (HVI) and depends on


Require: Unnormalized log posterior log p(x, z) Compared to r
Require: Number of iterations T rithm 3 has a n
Require: Momentum initialization distribution(s) from q(z|x) are
qt (vt0 |zt 1 , x) and inverse model(s) rt (vt |zt , x) tonian dynamic
Require: HMC stepsize and mass matrix , M may choose to o
Draw an initial random variable z0 q(z0 |x) to reject any of
Init. lower bound L = log[p(x, z0 )] log[q(z0 |x)] optimize a lowe
for t = 1 : T do we can assess
Draw initial momentum vt0 qt (vt0 |x, zt 1 ) niques discusse
Set zt , vt = Hamiltonian Dynamics(zt 1 , vt0 ) ing a good initi
p(x,zt )rt (vt |x,zt )
Calculate the ratio t = p(x,z 0
t 1 )qt (vt |x,zt 1 ) convergence to
Update the lower bound L = L + log[t ] approximation u
end for than relying on
return lower bound L, approx. posterior draw zT
3.1. Example:
overdisper
Deep Learning Summer Here
Schoolwe
2015 Aaron
omit Courville
the Metropolis-Hastings step that is typically 59
Friday, August 14, 15
h
tional networks for inference and
16 leapfrog steps. This is slightl
HVI: Generative model of MNIST Table 1. Comparison of our approach to other recent methods in
the literature. We compare the average marginal log-likelihood
ported number with DRAW (Gre
with recurrent neural networks fo
measured in nats of the digits in the MNIST test set. See sec-
eration. Our approaches are not
tion 3.2 for details.
could indeed be combined for eve
Model log p(x) log p(x)
= The model can also be trained
latent space to obtain a low-dim
HVI + fully-connected VAE: data. See figure 4 for a visualizat
Without inference network: such a model trained on the MNIS
5 leapfrog steps 90.86 87.16
10 leapfrog steps 87.60 85.56
With inference network:
No leapfrog steps 94.18 88.95
1 leapfrog step 91.70 88.08
4 leapfrog steps 89.82 86.40
8 leapfrog steps 88.30 85.51
HVI + convolutional VAE:
No leapfrog steps 86.66 83.20
1 leapfrog step 85.40 82.98
2 leapfrog steps 85.17 82.96
4 leapfrog steps 84.94 82.78
8 leapfrog steps 84.81 82.72
16 leapfrog steps 84.11 82.22
16 leapfrog steps, nh = 800 83.49 81.94
From (Gregor et al., 2015):
DBN 2hl 84.55
EoNADE 85.10
DARN 1hl 88.30 84.13
DARN 12hl 87.72
DRAW 80.97
Figure 4. Visualization of the two-di
Deep Learning Summer School 2015 Aaron Courville 60
generative model trained with our p
Friday, August 14, 15 Stochastic gradient-based optimization was performed us- tional posterior approximation; show
The end.

61
Friday, August 14, 15

You might also like