cs236_lecture5
cs236_lecture5
Stefano Ermon
Stanford University
Lecture 5
1 Autoregressive models:
Chain rule based factorization is fully general
Compact representation via conditional independence and/or neural
parameterizations
2 Autoregressive models Pros:
Easy to evaluate likelihoods
Easy to train
3 Autoregressive models Cons:
Requires an ordering
Generation is sequential
Cannot learn features in an unsupervised way
Generative process
1 Pick a mixture component k by sampling z
2 Generate a data point by sampling from that Gaussian
Mixture of Gaussians:
1 z ∼ Categorical(1, · · · , K )
2 p(x | z = k) = N (µk , Σk )
Shown is the posterior probability that a data point was generated by the
i-th mixture component, P(z = i|x)
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 10 / 29
Unsupervised learning
X X K
X
p(x) = p(x, z) = p(z)p(x | z) = p(z = k) N (x; µk , Σk )
| {z }
z z k=1 component
Suppose some pixel values are missing at train time (e.g., top half)
Let X denote observed random variables, and Z the unobserved ones (also
called hidden or latent)
Suppose we have a model for the joint distribution (e.g., PixelCNN)
p(X, Z; θ)
What is the probability p(X = x̄; θ) of observing a training data point x̄?
X X
p(X = x̄, Z = z; θ) = p(x̄, z; θ)
z z
Need to consider all possible ways to complete the image (fill green part)
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 15 / 29
Variational Autoencoder Marginal Likelihood
We have a dataset D, where for each datapoint the X variables are observed
(e.g., pixel values) and the variables Z are never observed (e.g., cluster or
class id.). D = {x(1) , · · · , x(M) }.
Maximum likelihood learning:
Y X X X
log p(x; θ) = log p(x; θ) = log p(x, z; θ)
x∈D x∈D x∈D z
P
Evaluating log z p(x, z; θ) can be intractable. Suppose we have 30 binary
latent features, z ∈ {0, 1}30 . Evaluating R z p(x, z; θ) involves a sum with
P
230 terms. For continuous variables, log z p(x, z; θ)dz is often intractable.
Gradients ∇θ also hard to compute.
Need approximations. One gradient evaluation per training data point
x ∈ D, so approximation needs to be cheap.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 17 / 29
First attempt: Naive Monte Carlo
Works in theory but not in practice. For most z, pθ (x, z) is very low (most
completions don’t make sense). Some completions have large pθ (x, z) but we will
never ”hit” likely completions by uniform random sampling. Need a clever way to
select z(j) to reduce variance of the estimator.
Recall that for training, we need the log-likelihood log (pθ (x)). We could estimate
it as:
k (j)
pθ (x, z(1) )
1 X pθ (x, z ) k=1
log (pθ (x)) ≈ log ≈ log
k
j=1
q(z(j) ) q(z(1) )
h i h i
(x,z(1) ) (x,z(1) )
However, it’s clear that Ez(1) ∼q(z) log pθq(z (1) ) ̸= log Ez(1) ∼q(z) pθq(z (1) )
pθ (x,z)
Choosing f (z) = q(z)
pθ (x, z) pθ (x, z)
log Ez∼q(z) ≥ Ez∼q(z) log
q(z) q(z)
Assume p(xtop , xbottom ; θ) assigns high probability to images that look like
digits. In this example, we assume z = xtop are unobserved (latent)
Suppose q(xtop ; ϕ) is a (tractable) probability distribution over the hidden
variables (missing pixels in this example) xtop parameterized by ϕ
(variational parameters)
Y top top
q(xtop ; ϕ) = (ϕi )xi (1 − ϕi )(1−xi )
unobserved variables xtop
i
X
log p(x; θ) ≥ q(z; ϕ) log p(z, x; θ) + H(q(z; ϕ)) = L(x; θ, ϕ)
| {z }
z
ELBO
= L(x; θ, ϕ) + DKL (q(z; ϕ)∥p(z|x; θ))
The better q(z; ϕ) can approximate the posterior p(z|x; θ), the smaller
DKL (q(z; ϕ)∥p(z|x; θ)) we can achieve, the closer ELBO will be to
log p(x; θ). Next: jointly optimize over θ and ϕ to maximize the ELBO
over a dataset
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 28 / 29
Summary