0% found this document useful (0 votes)
21 views59 pages

VAE talk.compressed - 副本

Uploaded by

qbk12138
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views59 pages

VAE talk.compressed - 副本

Uploaded by

qbk12138
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Variational Auto-Encoders

Diederik P. Kingma
Introduction and
Motivation
Motivation and applications
Versatile framework for unsupervised and semi-supervised
deep learning

Representation Learning. E.g.:

2D visualisation

Data-efficient learning. Semi-supervised learning

Artificial Creativity. E.g.:

Image/text resynthesis, Molecule design


Sad Kanye -> Happy Kanye

“Smile vector”. Tom White, 2016,


twitter: @dribnet
Background
Probabilistic Models
x: Observed random variables

p*(x) or: underlying unknown process

pθ(x): model distribution

Goal: pθ(x) ≈ p*(x)

We wish flexible pθ(x)

Conditional modeling goal: pθ(x|y) ≈ p*(x|y)


Concept 1:
Parameterization of conditional distributions
with Neural Networks
Common example

x y
0.9
NeuralNet(x)
0.45

0
Cat MouseDog ...
Concept 2:
Generalization into Directed Models
parameterized with Bayesian Networks
Directed graphical models / Bayesian networks

Joint distribution factorizes as:

We parameterize conditionals using neural networks:

Traditionally: parameterized using probability tables


Maximum Likelihood (ML)
Log-probability of a datapoint x:

Log-likelihood of i.i.d. dataset:

Optimizable with (minibatch) SGD


Concept 3:
Generalization into
Deep Latent-Variable Models
Deep Latent-Variable Model (DLVM)
Introduction of latent variables in graph

Latent-variable model pθ(x,z)


where conditionals are parameterized with neural networks

Advantages:

Extremely flexible: even if each conditional is simple (e.g.


conditional Gaussian), the marginal likelihood can be
arbitrarily complex

Disadvantage:

is intractable
Neural Net
DLVM: Optimization is non-trivial
By direct optimization of log p(x) ?

Intractable marg. likelihood

With expectation maximization (EM)?

Intractable posterior: p(z|x) = p(x,z)/p(x)

With MAP: point estimate of p(z|x)?

Overfits

With trad. variational EM and MCMC-EM?

Slow

And none tells us how to do fast posterior inference


Variational Autoencoders
(VAEs)
Solution: Variational Autoencoder (VAE)
Introduce q(z|x): parametric model
of true posterior

Parameterized by another neural network

Joint optimization of q(z|x) and p(x,z)

Remarkably simple objective:


evidence lower bound (ELBO) [MacKay, 1992]
Encoder / Approximate Posterior
qφ(z|x): parametric model of the posterior
φ: variational parameters

We optimize the variational parameters φ such that:

Like a DLVM, the inference model can be (almost) any


directed graphical model:

Note that traditionally, variational methods employ local


variational parameters. We only have global φ
Evidence Lower Bound / ELBO
Objective (ELBO):
L(x; ✓) = Eq(z|x) [log p(x, z) log q(z|x)]

Can be rewritten as:


L(x; ✓) = log p(x) DKL (q(z|x)||p(z|x))

Example
1. Maximization of log p(x)
=> Good marginal likelihood
z θ
2. Minimization of DKL(q(z|x)||p(z|x))
=> Accurate (and fast) posterior inference
x
N
Stochastic Gradient Descent (SGD)
Minibatch SGD: requires unbiased gradients estimates

Reparameterization trick for continuous latent variables


[Kingma and Welling, 2013]

REINFORCE for discrete latent variables

Adam optimizer adaptively pre-conditioned SGD


[Kingma and Ba, 2014]

Weight normalisation for faster convergence


[Salimans and Kingma, 2015]
ELBO as KL Divergence
Gradients
An unbiased gradient estimator of the ELBO w.r.t. the
generative model parameters is straightforwardly obtained:

A gradient estimator of the ELBO w.r.t. the variational


parameters φ is more difficult to obtain:
Reparameterization Trick
Construct the following Monte Carlo estimator:

where p(ε) and g() chosen such that z ∼ qφ(z|x)

Which has a simple Monte Carlo gradient:


Reparameterization Trick
This is an unbiased estimator of the exact single-datapoint
ELBO gradient:
Reparameterization Trick
Under reparameterization, density is given by:

Important: choose transformations g() for which the logdet


is computationally affordable/simple
Factorized Gaussian Posterior
A common choice is a simple factorized Gaussian encoder:

After reparameterization, we can write:


Factorized Gaussian Posterior
The Jacobian of the transformation is:

Determinant of diagonal matrix is product of diag. entries.

So the posterior density is:


Full-covariance Gaussian posterior
The factorized Gaussian posterior can be extended to a
Gaussian with full covariance:

A reparameterization of this distribution with a surprisingly


simple determinant, is:

where L is a lower (or upper) triangular matrix, with non-


zero entries on the diagonal. The off-diagonal element define
the correlations (covariance) of the elements in z.
Full-covariance Gaussian posterior
This reason for this parameterization of the full-covariance
Gaussian, is that the Jacobian determinant is remarkably
simple. The Jacobian is trivial:

And the determinant of a triangular matrix is simply the


product of its diagonal terms. So:
Full-covariance Gaussian posterior
This parameterization corresponds to the Cholesky
decomposition of the covariance of z:
Full-covariance Gaussian posterior
One way to construct the matrix L is as follows:

Lmask is a masking matrix.

The log-determinant is identical to the factorized Gaussian


case:
Full-covariance Gaussian posterior
Therefore, density equal to diagonal Gaussian case!
Beyond Gaussian
posteriors
Normalizing Flows
Full-covariance Gaussian:

One transformation operation: ft(ε, x) = Lε

Normalizing flows:

Multiple transformation steps


Normalizing Flows
Define z ~ qφ(z|x) as:

The Jacobian of the transformation factorizes:

And the density

[Rezende and Mohamed, 2015]


Inverse Autoregressive Flows
Probably the most flexible type of transformation, with
simple determinant, that can be chained.

Each transformation given by a autoregressive neural net,


with triangular Jacobian

Best known way to construct arbitrarily flexible posteriors


Inverse Autoregressive Flow
Posteriors in 2D space
Deep IAF helps towards better likelihoods

[Kingma, Salimans and Welling, 2014]


Optimization Issues
Overpruning:

Solution 1: KL annealing

Solution 2: Free bits (see IAF paper)

‘Blurriness’ of samples

Solution: better Q or P models


Better generative
models
Improving Q versus improving P
PixelVAE
Use PixelCNN models as p(x|z) and p(z) models

No need for complicated q(z|x): just factorized Gaussian

[Gulrajani et al, 2016]


PixelVAE

[Gulrajani et al, 2016]


PixelVAE
PixelVAE
Applications
Visualisation
of Data in 2D
Representation learning

2D z

x
Semi-supervised
learning
SSL With Auxiliary VAE

[Maaløe et al, 2016]


Data-efficient learning on ImageNet

from 10% to 60% accuracy,


for 1% labeled

[Pu et al, “Variational Autoencoder for Deep Learning of Images, Labels and Captions”, 2016]
(Re)Synthesis
Analogies
Analogy-making
Automatic chemical design
VAE trained on text representation of 250K molecules

Uses latent space to design new drugs and organic LEDs

[Gómez-Bombarelli et al, 2016]


Semantic Editing
“Smile vector”. Tom White, 2016, twitter: @dribnet
Semantic Editing
“Smile vector”. Tom White, 2016, twitter: @dribnet
Semantic Editing
“Neural Photo Editing”. Andrew Brock et al, 2016
Questions?

You might also like