05 Vae
05 Vae
y2
x1 x2
y1
𝒘𝑻 xො = 𝑤 𝑇 𝑤x
2
𝑑𝑖𝑣 xො , x = x − xො = x − w 𝑇 𝑤x 2
𝒘
= argmin 𝐸 𝑑𝑖𝑣 xො , x
𝑊
𝑊
𝐱 = argmin 𝐸 x − w 𝑇 𝑤x 2
𝑊
𝑊
5
Recap: Auto-encoders and PCA
𝐱ො
𝒘𝑻
DECODER
https://arxiv.org/abs/1611.07004
Why generative models? Intrinsic to task
Example: Super resolution
https://arxiv.org/abs/1609.04802
Why generative models? Insight
manifolds?
Factor Analysis
• Generative model: Assumes that data are generated from real valued
latent variables
𝑝 𝒙𝒊 𝑾, 𝝁, 𝚿 = න 𝒩(𝑾𝒛𝒊 + 𝝁, 𝚿) 𝒩 𝒛𝒊 𝝁𝟎 , 𝚺𝟎 𝐝𝒛𝒊
= 𝒩 𝒙𝒊 𝑾𝝁𝟎 + 𝝁, 𝚿 + 𝑾 𝚺𝟎 𝑾𝑇
Note that we can rewrite this as:
𝑝 𝒙𝒊 𝑾, 𝝁ෝ , 𝚿 = 𝒩 𝒙𝒊 𝝁 𝑾
ෝ, 𝚿 + 𝑾 𝑇
1
−2
Where 𝝁ෝ = 𝑾𝝁𝟎 + 𝝁 and 𝑾 = 𝑾𝚺 .
𝟎
Thus without loss of generality (since 𝝁𝟎 , 𝚺𝟎 are absorbed into learnable
parameters) we let:
𝑝 𝒛𝒊 = 𝒩 𝒛𝒊 𝟎, 𝑰
And find:
𝑝 𝒙𝒊 𝑾, 𝝁, 𝚿 = 𝒩 𝒙𝒊 𝝁, 𝚿 + 𝑾𝑾𝑇
Marginal distribution interpretation
• We can see from 𝑝 𝒙𝒊 𝑾, 𝝁, 𝚿 = 𝒩 𝒙𝒊 𝝁, 𝚿 + 𝑾𝑾𝑇 that the
covariance matrix of the data distribution is broken into 2 terms
• A diagonal part 𝚿: variance not shared between variables
• A low rank matrix 𝑾𝑾𝑇 : shared variance due to latent factors
Special Case: Probabilistic PCA (PPCA)
• Probabilistic PCA is a special case of Factor Analysis
• We further restrict 𝚿 = 𝜎 2 𝑰 (assume isotropic independent variance)
• Possible to show that when the data are centered (𝝁 = 0), the limiting
case where 𝜎 → 0 gives back the same solution for 𝑾 as PCA
• Factor analysis is a generalization of PCA that models non-shared
variance (can think of this as noise in some situations, or individual
variation in others)
Inference in FA
• To find the parameters of the FA model, we use the Expectation
Maximization (EM) algorithm
• EM is very similar to variational inference
• We’ll derive EM by first finding a lower bound on the log-likelihood
we want to maximize, and then maximizing this lower bound
Evidence Lower Bound decomposition
• For any distributions 𝑞 𝑧 , 𝑝(𝑧) we have:
𝑞(𝑧)
KL 𝑞 𝑧 || 𝑝 𝑧 ≜ න 𝑞 𝑧 log 𝐝𝑧
𝑝(𝑧)
• Consider the KL divergence of an arbitrary weighting distribution
𝑞 𝑧 from a conditional distribution 𝑝 𝑧|𝑥, 𝜃 :
𝑞(𝑧)
KL 𝑞 𝑧 || 𝑝 𝑧|𝑥, 𝜃 ≜ න 𝑞 𝑧 log 𝐝𝑧
𝑝(𝑧|𝑥, 𝜃)
𝑞(𝑧)
= න 𝑞 𝑧 log 𝐝𝑧 + log 𝑝 𝑥 𝜃
𝑝 𝑥 𝑧, 𝜃 𝑝(𝑧, 𝜃)
𝑞(𝑧)
= න 𝑞 𝑧 log 𝐝𝑧 + log 𝑝 𝑥 𝜃
𝑝(𝑥, 𝑧 |𝜃)
Then we have:
KL 𝑞 𝑧 || 𝑝 𝑧|𝑥, 𝜃 = KL 𝑞 𝑧 || 𝑝 𝑥, 𝑧 |𝜃 + log 𝑝 𝑥 𝜃
Evidence Lower Bound
• From basic probability we have:
KL 𝑞 𝑧 || 𝑝 𝑧|𝑥, 𝜃 = KL 𝑞 𝑧 || 𝑝 𝑥, 𝑧 |𝜃 + log 𝑝 𝑥 𝜃
• We can rearrange the terms to get the following decomposition:
log 𝑝 𝑥 𝜃 = KL 𝑞 𝑧 || 𝑝 𝑧|𝑥, 𝜃 − KL 𝑞 𝑧 || 𝑝 𝑥, 𝑧 |𝜃
• We define the evidence lower bound (ELBO) as:
ℒ 𝑞, 𝜃 ≜ −KL 𝑞 𝑧 || 𝑝 𝑥, 𝑧 |𝜃
Then:
log 𝑝 𝑥 𝜃 = KL 𝑞 𝑧 ||𝑝 𝑧|𝑥, 𝜃 + ℒ 𝑞, 𝜃
Why the name evidence lower bound?
• Rearranging the decomposition
log 𝑝 𝑥 𝜃 = KL 𝑞 𝑧 ||𝑝 𝑧|𝑥, 𝜃 + ℒ 𝑞, 𝜃
• we have
ℒ 𝑞, 𝜃 = log 𝑝 𝑥 𝜃 − KL 𝑞 𝑧 || 𝑝 𝑧|𝑥, 𝜃
• Since KL 𝑞 𝑧 ||𝑝 𝑧|𝑥, 𝜃 ≥ 0, ℒ 𝑞, 𝜃 is a lower bound on the log-
likelihood we want to maximize
• 𝑝 𝑥 𝜃 is sometimes called the evidence
• When is this bound tight? When 𝑞 𝑧 = 𝑝 𝑧|𝑥, 𝜃
• The ELBO is also sometimes called the variational bound
Visualizing ELBO decomposition
• Initialize 𝜃 (0)
• At each iteration 𝑡 = 1, …
• E step: Hold 𝜃 (𝑡−1) fixed, find 𝑞 (𝑡) which maximizes ℒ 𝑞, 𝜃 (𝑡−1)
• M step: Hold 𝑞 (𝑡) fixed, find 𝜃 (𝑡) which maximizes ℒ 𝑞 (𝑡) , 𝜃
The E step
• After applying the E step, we increase the likelihood of the data by finding better
parameters according to: 𝜃 (𝑡) ← 𝐚𝐫𝐠𝐦𝐚𝐱 𝜽 𝔼𝒒 𝒕 (𝒛) 𝐥𝐨𝐠 𝒑 𝒙, 𝒛 𝜽
EM algorithm
• Initialize 𝜃 (0)
• At each iteration 𝑡 = 1, …
• E step: Update 𝑞 𝑡 𝑧 ← 𝑝 𝑧 𝑥, 𝜃 𝑡−1
• M step: Update 𝜃 (𝑡) ← argmax𝜃 𝔼𝑞 𝑡 (𝑧) log 𝑝 𝑥, 𝑧 𝜃
Why does EM work?
• EM does coordinate ascent on the ELBO, ℒ 𝑞, 𝜃
• Each iteration increases the log-likelihood until 𝑞 𝑡 converges (i.e. we
reach a local maximum)!
• Simple to prove By definition of argmax in the M step:
ℒ 𝑞 𝑡 , 𝜃 (𝑡) ≥ ℒ 𝑞 𝑡 , 𝜃 (𝑡−1)
Notice after the E step: By simple substitution:
ℒ 𝑞 𝑡 , 𝜃 (𝑡−1) ℒ 𝑞 𝑡 , 𝜃 (𝑡) ≥ log 𝑝 𝑥 𝜃 𝑡−1
Rewriting the left hand side:
= log 𝑝(𝑥|𝜃 (𝑡−1) ) − KL 𝑝 𝑧|𝑥, 𝜃 𝑡−1 || 𝑝 𝑧|𝑥, 𝜃 𝑡−1
log 𝑝(𝑥|𝜃 (𝑡) ) − KL 𝑝 𝑧|𝑥, 𝜃 𝑡−1 || 𝑝 𝑧|𝑥, 𝜃 𝑡
= log 𝑝(𝑥|𝜃 (𝑡−1) )
The ELBO is tight! ≥ log 𝑝 𝑥 𝜃 𝑡−1
Noting that KL is non-negative:
𝐥𝐨𝐠 𝒑 𝒙 𝜽 𝒕 ≥ 𝐥𝐨𝐠 𝒑 𝒙 𝜽 𝒕−𝟏
Why does EM work?
• This proof is saying the same thing we saw in pictures. Make the KL 0,
then improve our parameter estimates to get a better likelihood
A different perspective
• Consider the log-likelihood of a marginal distribution of the data 𝑥 in a generic
latent variable model with latent variable 𝑧 parameterized by 𝜃:
𝑁 𝑁
𝔼𝑞 𝑧 ℓ𝑐 𝜃 = න 𝑞 𝑧𝑖 log 𝑝 𝑥𝑖 , 𝑧𝑖 𝜃 d𝑧𝑖
𝑖=1
• This looks similar to marginalizing, but now the log is inside the integral, so
it’s easier to deal with
• We can treat the latent variables as observed and solve this more easily
than directly solving the log-likelihood
• Finding the 𝑞 that maximizes this is the E step of EM
• Finding the 𝜃 that maximizes this is the M step of EM
Back to Factor Analysis
• For simplicity, assume data is centered. We want:
𝑁
(𝒕−𝟏) 𝑇 (𝑡−1) −1
𝔼𝑞 𝑡 (𝒛𝒊 ) 𝒛𝒊 = 𝑮𝑾 𝚿 𝒙𝑖
𝔼𝑞 𝑡 (𝒛𝒊 ) 𝒛𝒊 𝒛𝑇𝒊 = 𝑮 + 𝔼𝑞 𝑡 (𝒛𝒊 ) 𝒛𝒊 𝔼𝑞 𝑡 (𝒛𝒊 ) 𝒛𝒊 𝑇
Where
−1
𝑮= 𝑰+𝑾 𝑡−1 𝑇 𝚿 𝑡−1 −1 𝑾 𝑡−1
𝑁 𝑁
1 1
𝚿 (𝑡) ← diag 𝑇
𝒙𝒊 𝒙𝒊 − 𝑾 (𝑡) 𝔼𝑞 𝑡 (𝒛𝒊 ) 𝒛𝒊 𝒙𝑇𝑖
𝑁 𝑁
𝑖=1 𝑖=1
From EM to Variational Inference
• In EM we alternately maximize the ELBO with respect to 𝜃 and
probability distribution (functional) 𝑞
• In variational inference, we drop the distinction between hidden
variables and parameters of a distribution
• I.e. we replace 𝑝(𝑥, 𝑧|𝜃) with 𝑝(𝑥, 𝑧). Effectively this puts a
probability distribution on the parameters 𝜽, then absorbs them into
𝑧
• Fully Bayesian treatment instead of a point estimate for the
parameters
Variational Inference
• Now the ELBO is just a function of our weighting distribution ℒ(𝑞)
• We assume a form for 𝑞 that we can optimize
• For example mean field theory assumes 𝑞 factorizes:
𝑀
𝑞 𝑍 = ෑ 𝑞𝑖 (𝑍𝑖 )
𝑖=1
• Then we optimize ℒ(𝑞) with respect to one of the terms while
holding the others constant, and repeat for all terms
• By assuming a form for 𝑞 we approximate a (typically) intractable true
posterior
Mean Field update derivation
𝑝(𝑋, 𝑍)
ℒ 𝑞 = න 𝑞 𝑍 log 𝑑𝑍 = න 𝑞 𝑍 log 𝑝(𝑋, 𝑍) − 𝑞 𝑍 log 𝑞(𝑍) 𝑑𝑍
𝑞(𝑍)
= න 𝑞𝑗 (𝑍𝑗 ) න log 𝑝(𝑋, 𝑍) ෑ 𝑞𝑖 𝑍𝑖 𝑑𝑍𝑖 − log 𝑞𝑗 (𝑍𝑗 ) න ෑ 𝑞𝑖 (𝑍𝑖 ) 𝑑𝑍𝑖 𝑑𝑍𝑗 + const
𝑖≠𝑗 𝑖≠𝑗
• The point of this is not the update equations themselves, but the
general idea:
• freeze some of the variables, compute expectations over those
• update the rest using these expectations
Why does Variational Inference work?
• The argument is similar to the argument for EM
• When expectations are computed using the current values for the
variables not being updated, we implicitly set the KL divergence
between the weighting distributions and the posterior distributions to
0
• The update then pushes up the data likelihood
𝑞(𝑧𝑖 |𝑥𝑖 , 𝜙)
𝒙𝒊
𝑞(𝑧𝑖 |𝑥𝑖 , 𝜙)
𝒙𝒊
𝑞(𝑧𝑖 |𝑥𝑖 , 𝜙)
𝒙𝒊
𝑝(𝑥𝑖 |𝑧𝑖 , 𝜃)
𝒛𝒊 ~𝒒(𝒛𝒊 |𝒙𝒊 , 𝝓)
𝑧𝑖 = 𝑔(𝜖𝑖 , 𝑥𝑖 , 𝜙)
• Final simplification: update all of the
parameters at the same time instead of
using separate E, M steps
• This is standard back propagation. Just use
𝜖𝑖 ~𝑝(𝜖) 𝑔(𝜖𝑖 , 𝑥𝑖 , 𝜙) −ℒሚ 𝐴 or −ℒሚ 𝐵 as the loss, and run your
favorite SGD variant
Running the model on new data
• To get a MAP estimate of the latent variables, just use the mean
output by the encoder (for a Gaussian distribution)
• No need to take a sample
• Give the mean to the decoder
• At test time, this is used just as an auto-encoder
• You can optionally take multiple samples of the latent variables to
estimate the uncertainty
Relationship to Factor Analysis
• VAE performs probabilistic, non-linear
dimensionality reduction
𝑝(𝑥𝑖 |𝑧𝑖 , 𝜃) • It uses a generative model with a latent
variable distributed according to some
prior distribution 𝑝(𝑧𝑖 )
• The observed variable is distributed
𝑧𝑖 ~𝑞(𝑧𝑖 |𝑥𝑖 , 𝜙) according to a conditional distribution
𝑝(𝑥𝑖 |𝑧𝑖 , 𝜃)
• Training is approximately running
expectation maximization to maximize
the data likelihood
𝑞(𝑧𝑖 |𝑥𝑖 , 𝜙)
• This can be seen as a non-linear version
of Factor Analysis
Regularization by a prior
• Looking at the form of ℒ we used to justify ℒሚ 𝐵 gives us additional
insight
ℒ 𝜙, 𝜃, 𝑥 = −KL 𝑞 𝑧 𝑥, 𝜙 || 𝑝 𝑧 + 𝔼𝑞 𝑧 𝑥, 𝜙 log 𝑝 𝑥|𝑧, 𝜃
• We are making the latent distribution as close as possible to a prior
on 𝑧
• While maximizing the conditional likelihood of the data under our
model
• In other words this is an approximation to Maximum Likelihood
Estimation regularized by a prior on the latent space
Practical advantages of a VAE vs. an AE
• The prior on the latent space:
• Allows you to inject domain knowledge
• Can make the latent space more interpretable
• The VAE also makes it possible to estimate the variance/uncertainty in
the predictions
Interpreting the latent space
https://arxiv.org/pdf/1610.00291.pdf
Requirements of the VAE
• Note that the VAE requires 2 tractable distributions to be used:
• The prior distribution 𝑝(𝑧) must be easy to sample from
• The conditional likelihood 𝑝 𝑥|𝑧, 𝜃 must be computable
• In practice this means that the 2 distributions of interest are often
simple, for example uniform, Gaussian, or even isotropic Gaussian
The blurry image problem
• The samples from the VAE
look blurry
• Three plausible
explanations for this
• Maximizing the
likelihood
• Restrictions on the
family of distributions
https://blog.openai.com/generative-models/
• The lower bound
approximation
The maximum likelihood explanation
• Recent evidence
suggests that this is
not actually the
problem
• GANs can be trained
with maximum
likelihood and still
generate sharp
examples
https://arxiv.org/pdf/1701.00160.pdf
Investigations of blurriness
• Recent investigations suggest that both the simple probability
distributions and the variational approximation lead to blurry images
• Kingma & colleages: Improving Variational Inference with Inverse
Autoregressive Flow
• Zhao & colleagues: Towards a Deeper Understanding of Variational
Autoencoding Models
• Nowozin & colleagues: f-gan: Training generative neural samplers
using variational divergence minimization