Presentation - Deeplearning2015 Courville Autoencoder Extension 01
Presentation - Deeplearning2015 Courville Autoencoder Extension 01
Extensions
Aaron Courville
1
Friday, August 14, 15
Outline
Variational autoencoder (VAE)
DRAW model
y :
x :
x3
z1 x1
Image from: Ward, A. D., Hamarneh, G.: 3D Surface Parameterization Using Manifold Learning for Medial Shape Representation, Conference on Image Processing, Proc. of SPIE Medical Imaging, 2007
z2 x2
z :
g
g(z):
x3
z1 x1
x :
Expression
ace manifold (b) Learned MNIST manifold
z1
s of learned data manifold for generative models with two-dimensional latent
Frey Face manifoldz1
(a) Learned Pose
EVB. Since the prior of the latent space is Gaussian, linearly spaced coor-
areDeep
were Learning
transformed Summer
throughSchool 2015
the inverse CDF ofAaron Courville
the Gaussian to produce Figure 4: Visualisations of learned data manifo
6
ables
Friday,z. For each
August 14, 15 of these values z, we plotted the corresponding generative
space, learned with AEVB. Since the prior o
The inference / learning challenge
? z :
z2 x2
g(z):
g
x3
x :
z1 x1
Deep Learning Summer School 2015 Aaron Courville 7
Friday, August 14, 15
Variational Autoencoder (VAE)
Where does z come from? The classic DAG problem.
What is q (z | x)?
f(x): g(z):
x : x :
z (x) z (x) z :
{
{
g(z):
f(z):
x (z) {
x : x (z) {
Forward propagation
x Backward propagation x
q (z | x) p (x | z)
Deep Learning Summer School 2015 Aaron Courville 11
Friday, August 14, 15
Relative performance of VAE
Figure 3: Comparison of AEVB to the wake-sleep algorithm and Monte Carlo EM, in terms of the
estimated marginal likelihood, for a different number of training points. Monte Carlo EM is not an
on-line algorithm, and (unlike AEVB and the wake-sleep method) cant be applied efficiently for
the full MNIST dataset.
15
Friday, August 14, 15
e labelled and unlabelled subsets as pel (x, y) and peu (x), respectively. We now develop
semi-supervised learning that exploit generative descriptions of the data to improve upon
ationSemi-supervised
performance that wouldLearning withthe
be obtained using Deep Generative
labelled data alone. Models
Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, Max Welling (NIPS 2014)
ure discriminative model (M1): A commonly used approach is to construct a model
s anThey embedding studyortwo feature
basicrepresentation
approaches:of the data. Using these features, a separate
pproximate
thereafter trained. samples The from the posterior
embeddings allowdistribution over the
for a clustering of latent
relatedvariables p(z|x)
observations in are
a use
ures
re space to train M1:
that aallows
classifier
Standard for that predicts
unsupervised
accurate class labels
feature
classification, y, such
with aas
learning
even a (transductive)
(self-taught
limited ofSVM
learning)
number labels.or mul
asterior
lineardistribution
egression. Using this
embedding, over the latent
orapproach,
features wevariables
can from
obtained now perform
p(z|x) are classification
a regular used as fea- inwe
auto-encoder, a lower dimension
construct a
icts
tive class
nce we
model -labels
typically
of Train usesuchthatasiszavariables
latent
features
they,data (transductive)
on
ableunlabeled whose
to provide aSVM
data, train or
more multinomial
dimensionality
arobust setis
classifier ofmuch
to map less
from
latent z than that
to label
features. They. of th
we can
ations.
model now
weTheseuse- is: perform
Generative
classification
low dimensional inthat
embeddings
model: (recall
a lower
x =should
dimensional
data, z now alsospace
= latent be more easily separable s
features)
ablesuse
make whose dimensionality
of independent latentisGaussian
much less than that
posteriors of the
whose obser- are formed
parameters z by a seq
embeddings
on-linear transformations should
p(z) =now N (z|0,
also
of the bedata.
I); more easily
pThis
(x|z) =separable
simple we in improved (1)
z, ),sinceresults
approach
f (x; perform
aussian
VMs, and posteriors whose parameters
we demonstrate this in section are formed
4. by a sequence of
z, ) is a suitable likelihood function (e.g., a Gaussian or Bernoulli distribution) whose
data. This simple approachtransformation,
results in improved performance fora set of latent
Generative semi-supervised model (M2): We propose a probabilistic model xthatvari-
s are formed by a non-linear with parameters , of describes
section
is non-linear 4. transformation is essential to allow for higher moments of the data to be
s being generated by a latent class variable y in addition to a continuous latent variable z.
the
del (M2):
explained M2:
density We by Generative
model,
the generative
propose semi-supervised
and aweprobabilistic
choose thesemodel
process: non-linearmodel.
that functions to
describes thebedata
deep neural networks.
s variable yp(y) in addition to a continuous
= Cat(y|); p(z) = latent variable
N (z|0, I); z. The data z) = f (x; y, z, ),
p (x|y,
cess: z
here Cat(y|) is the multinomial distribution, 2 the class labels y areytreated as latent var
p(z) label
o class = N (z|0, I); and
is available p z(x|y, z) = f (x;latent
are additional y, z, ), variables. These (2) latent variables are ma
ndependent
al distribution, and theallow
classus,labels
in case y ofaredigit
treatedgeneration
as latentfor example,
variables if to separate the class s
on from the writing style of the digit. As before, is a suitablex likelihood f
e additional latent variables. These latent variables are marginally f (x; y, z, )
eg.,
ofadigitBernoulligeneration or Gaussian distribution,
for example, parameterised
to separate the classbyspecifica-
a non-linear transformation of t
ariables.
digit. AsInbefore,
Deep Learning our
Summerexperiments,
f (x;2015
School y, z,)we ischoose
Aaron a suitable
Courville deeplikelihood
neural networksfunction, as this non-linear functio 16
most
Friday,labels
bution, 14, 15y are unobserved,
August parameterised we integrate
by a non-linear over the class
transformation of theof any
latent unlabelled data during t
where Cat(y|) is the multinomial distribution, the class labels y ar
x|y, z) = f (x; y, z, ), (2) no class label is available and z are additional latent variables. These
Semi-supervised
s y are treated as latent variables ifLearning
independent and with Deep
allow Generative
us, in case Models
of digit generation for example, t
These latent variables are marginally tion from the writing style of the digit. As before, f (x; y, z, ) is a
Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, Max Welling (NIPS 2014)
ple, to separate the class specifica- e.g., a Bernoulli or Gaussian distribution, parameterised by a non-line
) is a suitable likelihood function, variables. In our experiments, we choose deep neural networks as th
n-linear transformation of the latent most labels y are unobserved, we integrate over the class of any unl
s as this non-linear function. Since ence process, thus performing classification as inference. Predictio
ny unlabelled
data during
M1+M2: the infer- obtained
Combination semi-supervised
from the inferredmodelposterior distribution p (y|x). This mode
dictions for any missing labels are
continuous-discrete mixture model where the different mixture compo
model can also be seen as a hybrid
-
components share Train generative Stacked
parameters. semi-supervised
generative model on unsupervised
semi-supervised modelfeatures
(M1+M2): z1 on
We can com
unlabeled
n combine these two approachesdata,
by train a classifier
first learning to map
a new latentfrom z1 to labelzz1.
representation 1 using the generative mod
e model from M1, and subsequently learning a generative semi-supervised model M2, using embeddings f
z2 of stochastic
ings from z1 instead of the raw data x. The result is a deep generative model with two layers
chastic variables: p (x, y, z1 , z2 ) = p(y)p(z2 )p (z1 |y, z2 )p (x|z1 ), where the priors y p(y) and p(z2 ) equa
) equal those of y and z above, and both p (z1 |y, z2 ) and p (x|z1 ) are parameterised as deep neural netw
l networks.
z1
3 Scalable Variational Inference
3.1 Lower Bound Objective
n is intractable due to the nonlinear, In all our models, computation of the exact posterior distribution
x is in
To allow for tractable and scalable non-conjugate dependencies between the random variables. To all
n variational inference (Kingma and inference and parameter learning, we exploit recent advances in varia
scribed, we introduce a fixed-form Welling, 2014; Rezende et al., 2014). For all the models describe
ue posterior distribution p(z|x). We distribution q (z|x) with parameters that approximates the true pos
he marginal likelihood of the model then follow the variational principle to derive a lower bound on the ma
approximate posterior is as close as this bound forms our objective function and ensures that our appro
Deep Learning Summer School 2015 possible to the true posterior.
Aaron Courville 17
Friday, August 14, 15
popular approach for efficient variational inference (Dayan, 2000;
Welling, 2014; Rezende et al., 2014; Stuhlmuller et al., 2013). Using an inference network, we avoid
Kingma and
popular
nde the approach
et al.,
need2014;
for efficient
Stuhlmuller
to compute per data point
variational
et al., 2013).
variational
inference
Using an
parameters,
(Dayan, 2000;network,
inference
but can instead
Kingma
compute we
and
a setavoid
of global
nde
e peret al., 2014;
Semi-supervised
data point
variational Stuhlmuller
variational
parameters etparameters,
. This al.,
Learning 2013).
allows Using
us towithbut can
amortise antheinference
Deep instead
cost network,
Generative
compute
of inference wesetavoid
Models
byageneralising
of global
between
per the
ers data
. Thispoint
posterior variational
allowsestimates
us Diederik parameters,
toforamortise
allP.latent
Kingma, the but
variables
cost
Danilo
can ofinstead
J. through
Rezende, the compute
parameters
inference
Shakir ofathe
setinference
by generalising
Mohamed, Max
of(NIPS
Welling
globalnetwork, and
between
2014)
rs .
tes for This
allows
all allows
for us to
fast inference
latent variables amortise the
at both training
throughforthe cost andof inference
testing
parameters by
time (unlike
of the generalising
with VEM, between
inference in which we repeat
network, and
the generalized E-step optimisation every test data
es for all latent variables through the parameters of the inference network, and point). An inference network is introduced
enceforatallboth training and testing time them(unlike with VEM, in which we repeat
ence at both training and testing time (unlike with VEM, in which we repeat form the
latent variables,
Approximate and we
posterior parameterise
(encoder as
model) deep neural networks whose outputs
ep optimisation
parameters of the
ep optimisation
for every
distribution
for every test
test data
q ().
data
point).
For
point).
An inference
the latent-feature
An inference
network
discriminative
network
is
model introduced
is posterior
(M1), we use a
introduced
es, Gaussian
and we - inference
parameterise
Following network
the them
VAE q as
(z|x)
strategyfor
deep the
we latent
neural variable
parametrize networks
thez. For the
whose
approximate generative
outputs semi-supervised
form
with athe
s, and we (M2),
model parameterise
we capacity
high introducethemmodel, as deep
an inference
like a neural
model
MLP networks
forsome
or of thewhose
each other outputs
latent model
deep variables zform
and y,the
(convnet, which we we
RNN,
istribution
stribution
assumeqhas q().().
a ForForthethe
factorised form latent-feature
latent-feature
q (z, y|x) = q discriminative
discriminative
(z|x)q (y|x), model model
specified (M1),
as (M1),
we
Gaussian usewemultinomial
and ause a
etc).
network
network q q(z|x)
(z|x)
distributions forfor
thethe
respectively. latent
latent variable
variable z. For z. Forthe the generative
generative semi-supervised
semi-supervised
oduceananinference
oduce inference
M1: qmodel model
(z|x) =forNfor
each
(z| each of
of the
(x), the
latent
diag( latent
2 variables
variables
(x))), z andz y,and y, which
which we wewe we (3)
sedform
sed formq q(z, (z, y|x)
y|x)
M2: == q x)
q (z|y, q(z|x)q
(z|x)q
= N (z| (y|x),
(y|x),
(y, x), specified
specified
diag( (x)));as qGaussian
2 as Gaussian
(y|x)and and multinomial
multinomial
= Cat(y| (x)), (4)
ively.
vely.
2
|x) ==NN(z|
z|x) - (z|(x), (x), diag(2 (x))),
and
diag( are parameterized by deep MLPs, that can share(3) (3)
(x))),
parameters. 3
2
z|y, x)==NN(z|
|y, x) (z|(y,(y,
x),x), diag( (x)));
diag( (x))); 2
q (y|x)
q (y|x) = Cat(y|
= Cat(y| (x)),(x)), (4) (4)
M1: z M2: z
y
3 3
x x
J = this, we add L(x, ay)classification U(x)loss to (8), such that (8)the distribution q (y|x)
(x,y)e xe pu
+
data. The
(x,y)e pl
q extended
The distribution q (y|x) (4) for the missing missing
The distribution (y|x) (4) for objective
xe
the pu
labels has function
labels has is:
the form the form a discriminative
a discriminative classifier,classifier,
and and
we can we can
use this use this knowledge
knowledge to construct to construct
the best the best classifier
classifier possible possible
as our as our inference
inference model. model. This
This
ribution q (y|x) (4) for the missing labels has timethe form a discriminative
of any unseen classifier, and
-
distribution
use this knowledge
distribution
actually,
is also
to construct
is
used also
forthe used
classification,
at test timeat test
for for
they use
predictions
best classifier possible
predictions
of J
any =
unseen J +
data.
as our inference model. pe
E
data.
l (x,y)
This
[ log q (y|x)] ,
ion is alsoInused
the at In
testthe
objective timeobjective
function function
for predictions
(8), the (8),
of
labelany the label predictive
unseen
predictive data.
distribution distribution
q (y|x) q (y|x) contributes
contributes only to only
the to the second
second
term term to
relating
where
relating
the
theunlabelled
to the
unlabelled
hyper-parameter
data, which data,is which
an
an
is
undesirable
controls
undesirable theproperty
property
relative
if we wish
weight
if we
to wish
use tobetween
this
generativ
use this distribu-
distribu-
jective function (8), tion theaslabel
ative predictive
learning.
classifier. distribution
Ideally, We all use
model q(y|x)
and
= contributes
variational
0.1 N in onlyexperiments.
all
parametersto the second
should learn While
in all we To
cases. have obtain
remedy
tion as
ating to the unlabelleda classifier.
data, Ideally,
which all model
is an undesirable and variational
property parameters
if we wish toshould learn
use this distribu- in all cases. To remedy
this, we this,
add awe byadd a classification
motivating
classification loss to the loss
(8), to
need
such (8),
that such
for the all that the
model
distribution distribution
components
q (y|x) qalso(y|x) also
to
learns learns
learn
from from
at
labelledall labelled
times, th
classifier. Ideally,
Deep Learning all
data.
Summer model
The and variational
extended
School 2015 objective parameters
function
Aaron Courville should
is: learn in all cases. To remedy 19
data. The extended objective
derived function
directly is:
using the
add a classification loss to (8), such that the distribution q (y|x) also learns from labelledvariational principle by instead performing infer
Friday, August 14, 15
Semi-supervised MNIST classification results
Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, Max Welling (NIPS 2014)
4 Experimental Results
OpenFull MNIST
source testwhich
code, with errorthe
(non-convolutional): 0.96%
most important results and figures can be reproduced, is avail-
able- at http://github.com/dpkingma/nips14-ssl.
for comparison, current SOTA: 0.61% For the latest experimental results,
please see http://arxiv.org/abs/1406.5298.
(a) Handwriting styles for MNIST obtained by fixing the class label and varying the 2D latent variable z
Figure 1: (a) Visualisation of handwriting styles learned by the model with 2D z-space. (b,c)
Analogical reasoning with generative semi-supervised models using a high-dimensional z-space.
The leftmost columns show images from the test set. The other columns show analogical fantasies
of x by the generative model, where the latent variable z of each row is set to the value inferred from
the test-set image on the left by the inference network. Each column corresponds to a class label y.
22
Friday, August 14, 15
VRNN: Model Structure
Variational recurrent neural network (VRNN) is a recurrent
(conditional) application of the VAE at every time-step.
VAE: VRNN:
z
+Recurrence
VRNN:
Figure 1: Graphical illustrations of each operation in the proposed VRNN: (a) comput
priors using Eq. (5); (b) generating function using Eq. (6); (c) updating the recurrenc
Deep Learning Summer School 2015 Aaron Courville 24
part using Eq. (7); (d) inference of the approximate posterior using Eq. (9); (e) overall c
Friday, August 14, 15
paths of the VRNN.
VRNN: Prior on zt
Generative model
factorizes over time
VRNN: Inference
z,t
VRNN: Learning
paths of the VRNN.
Inference In a similar fashion, the approximate posterior will not only be a function of xt but also
of ht 1 following the equation:
zt | xt N (z,t , diag( 2
z,t )) , where [z,t , z,t ] = 'enc x
(' (xt ), ht 1 ), (9)
similarly z,t and z,t denote the parameter set of the approximate posterior. We can notice that
Learning is accomplished
the encoding via gradient
of the approximate posterior backpropagation:
and the decoding for generation are also tied through the
hidden state ht 1 . We can also observe that this results in the factorization
- through the decoder and encoder, as in standard
YT VAE.
- and through the recurrent connections,
q(zT | xT ) as
= in q(z
thet standard
| xt , z<t ).RNN.
t=1
(10)
Objective
Learning function:is crucial in breaking the variational lower bound into timestep-wise
This factorization
Z XT
p(xT , zT )
log dq(zT | xT ) = KL(q(zt | xt , z<t )kp(zt | x<t , z<t ))
q(zT | xT ) t=1
+Eq(zt |xt ,z<t ) [log(p(xt | zt , x<t ))].
As in the standard VAE, we learn the generative and inference models jointly by maximizing the
Factored
variational lower bound withversion
respect tooftheir
theparameters.
variational Thelower bound
schematic view of the VRNN is
shown in Fig. 1, each of (a)(d) operation corresponds to each of Eqs. (5),(6),(7),(9). The proposed
network applies the operation (a), hence, it has a sequential prior (VRNN, see Eq. (5)). The variant
of the VRNN which does not apply the operation (a), then the prior is independent across timesteps
(VRNN-I). STORN [2] model can be considered an instance of the VRNN-I model family. In fact,
STORN makes further restrictions on the dependency structure of the approximate inference model.
We include this version of the model (VRNN-I) in our experimental evaluation in order to directly
study the impact of including the temporal dependency structure in the prior (sequential prior) over
the latent random variables.
Table :1: Average log-probability on the test (or validation) set of each task.
Speech modelling Handwriting
Models Blizzard TIMIT Onomatopoeia Accent IAM-OnDB
RNN-Gauss 3539 -1900 -984 -1293 1016
RNN-GMM 7413 26643 18865 3453 1358
VRNN-I-Gauss 8933 28340 19053 3843 1332
9188 29639 19638 4180 1353
VRNN-Gauss 9223 28805 20721 3952 1337
9516 30235 21332 4223 1354
VRNN-GMM 9107 28982 20849 4140 1384
9392 29604 21219 4319 1384
1. Blizzard: This text-to-speech dataset made available by the Blizzard Challenge 2013 con-
tains
Deep Learning 300 School
Summer hours 2015
of English spoken
Aaron by a single female speaker [9].
Courville 30
Friday, August 14, 15
VRNN: Speech synthesis
|z,t z,t1 |
KL divergence:
input waveform:
Figure 2: The top row represents the difference t between z,t and z,t 1 . The middle row
represents the dominant KL divergence values shown in temporal order. The bottom row show
corresponding waveforms.
linear units for any hidden layer that belongs to either 'x or 'dec
(800 for Blizzard). Note that th
models using GMM (RNN-GMM & VRNN-GMM) have 20 mixture components.
For qualitative analysis for speech, we train larger models to generate sequences, but again contro
Deep Learning Summer School 2015 Aaron Courville 32
the number of parameters. For all models, we use stacked RNNs with three recurrent hidden layers
Friday, August 14, 15
VRNN: Writing synthesis
(a) Ground Truth (b) RNN-GMM (c) VRNN-Gauss
Figure 3: Typical training examples and generated samples from RNN-GMM and VRNN-Gauss.
Top three rows show the global waveforms while the bottom three rows show more zoomed-in
Predicting
waveforms. a sequence
Samples from of (x,y)
(b) RNN-GMM locations
contain of the next
high frequency noise,destination of the
(c) VRNN-Gauss gener-
pen.
ates samples with less noise. We excluded RNN-Gauss because the samples are almost close to pure
noise.
6 Conclusion
Deep Learning Summer School 2015 Aaron Courville 33
Friday, August 14, 15
DRAW
Deep Recurrent Attentive Writer
Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, Daan Wierstra
Google Deepmind - (ICML, 2015)
34
Friday, August 14, 15
DRAW: Deep Recurrent Attentive Writer
decoder
MLP
z
encoder
MLP
read
x
Deep Learning Summer School 2015 Aaron Courville 36
Friday, August 14, 15
DRAW Model
Canvas at t
decoder decoder
hdec
t1
RNN RNN
zt zt+1
read read
ting Without
DRAW Attention
can use a differentiable
attention
ation of DRAW themechanism.
entire input im-
coder at every time-step, and the de-
{
Attention uses recurrence (via the
ire canvas matrix at every time-step.
decoder) to select subsets of x for
nd write operations reduce to
reading and writing. gY {
xt , h dec
Attention controls the extracted
t 1 ) = [x, xt ] (17)
patch location,
dec dec scale and blur.
ite(ht ) = W (ht ) (18)
gX
{
h does not allow the encoder to fo-
e input when creating the latent dis-
allow the decoder to modify only a
tor. In other words it does not pro- Figure 3. Left: A 3 3 grid of filters superimposed on an image.
an explicit selective attention mech- The stride ( ) and centre location (gX , gY ) are indicated. Right:
ve to be crucial to large scale image Three N N patches extracted from the image (N = 12). The
o the above configuration as DRAW green rectangles on the left indicate the boundary and precision
Deep Learning Summer School 2015 ( ) of the patches, while the patches themselves are shown to 39
Aaron Courville the
Friday, August 14, 15 right. The top patch has a small and high , giving a zoomed-in
DRAW Attention Mechanism DRAW:
DRAW: A Recurrent NeuralANetwork
Recurrent
ForNeural
ImageNet
Ge
DRAW: A Recurrent Neural Network For I
generated by the
copies ofgenerated
training
guishablecopies
from rea of
ated CIFAR image
guishable
read
recognisable struc
ated CIFA
write
MNIST results
recognisabsub
As a preliminary
MNIST ree
DRAW: A Recurrent Neural Network
tion module DRAW: of Fo
A
th
As a preli
Gaussian grid filters: classification.
tion modu
generate
classificat
For all experiment
copies o
was a Bernoulli
Forguishab
all exd
For the MNIST ated
was a Ber ex
CIF
Eq 9 was the usu
recognis
For the M
SVHN and CIFAR
MNIST
Eq 9 was
Figure 4. Zooming. Top Figure 4. Zooming.
Left: Top
The original Left:
100 75The pixel intensities
Top100 75 image. Top
original
image. As
SVHN a w
pr
an
Middle: A 12 12 patch Middle: A 12
extracted 12 patch
with extracted
144 2D Gaussian with and 1, which
144 2D Gaussian
filters. tionwere
pixel
filters. mo
inte
Figure 4. Zooming. Top Left: The original 100 75 image. emission Top probabili
Top Right:image
Top Right: The reconstructed The reconstructed
when applyingimage when applying transposed
transposed andclassific
1, wh
Middle: A 12 12 patch
Deep Learning Summer School 2015 Aaron Courville extracted with 144 2D Gaussian filters.
fore the cross-entro 40
filters on the patch. Bottom:
filters on the
Onlypatch.
two Bottom:
2D Only
Gaussian two
filters 2D
are Gaussian filters are
emission
Friday, August 14, 15 Top Right: The reconstructed image when applying transposed
DLGM 8 leapfrog steps [6] 85
44
Friday, August 14, 15
Inference in the VAE
VAE inference approximates the posterior p (z | x)! with a distribution
that is conditionally independent in z : q (z | x) = q (zi | x)
i
f
z2 x2
x3
z1 x1
Can parametrize some distribution (e.g. a full cov. Gaussian), but what
is the right distribution?
46
Friday, August 14, 15
Normalizing Flows
How do we specify a complicated joint distribution over z?
- At the end of the sequence, we have a valid (maybe complex) probability distribution.
! "
EqK g(z K )] = Eq0 [g(f K f K1 f 2 f 1 (z 0 ))
be numerically unsta-
g flows that allow for
nant, or where the Ja-
Uniform
rmations
Deep Learning Summer School 2015 Aaron Courville 49
Friday, August 14, 15
he Ja-
Uniform
Normalizing Flows for VAE posteriors
! "
Normalizing flow integration into the VAE: f (z) = z + uh w z + b
Figure 1. Effect of normalizing flow on two distributions.
Normalizing Flow
(10)
e pa-
arity,
mpute
matrix
(11)
. (12) Inference network Generative model
52
Friday, August 14, 15
Variational and MCMC inference
Variational inference (a la VAE) and MCMC inference have different
properties
- Variational inference (VI) is efficient / MCMC is computationally intensive
- VI has a fixed parametric form / MCMC asymptotically approaches p(z | x)
f(x): g(z):
x : x :
Deep Learning Summer School 2015 Aaron Courville 53
Friday, August 14, 15
Hamiltonian (Hybrid) Monte Carlo
Hamiltonian Monte Carlo
Basic Idea:
- Consider sampling from the posterior p(z | x) as a physics simulation from a
frictionless ball rolling on the potential energy surface E(x,z) = log p(x,z).
The algorithm:
- Augment with a velocity v with kinetic energy: K(v) = v v/2
Gibbs sample velocity N (0, I)
- Total energy = Hamiltonian: H(x,z,v) = E(x,z) + K(v)
Simulate Leapfrog dynamics for L steps
Leapfrog dynamics:
v t+ 2 = v t z (log p(x, z t ))
2
z t+ = z t + v t+ 2 z
v t+ = v t+ 2 (log p(x, z t+ )) Backward propagation
2
z x
p (x | z)
Deep Learning Summer School 2015 Aaron Courville 56
Friday, August 14, 15
Hamiltonian Variational Inference (HVI)
Fusing the VAE and HMC:
Central Idea: Interpret the stochastic Markov chain (from HMC)
T
!
q(z | x) = q(z 0 | x) q(z t | z t1 , x)
t=1
This will work for any MCMC method. Specializing to HMC involves
some details (like considering the velocity). See paper for details.
61
Friday, August 14, 15