0% found this document useful (0 votes)
0 views

cs236_lecture5

Uploaded by

21pd14
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

cs236_lecture5

Uploaded by

21pd14
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Latent Variable Models

Stefano Ermon

Stanford University

Lecture 5

Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 1 / 29


Recap of last lecture

1 Autoregressive models:
Chain rule based factorization is fully general
Compact representation via conditional independence and/or neural
parameterizations
2 Autoregressive models Pros:
Easy to evaluate likelihoods
Easy to train
3 Autoregressive models Cons:
Requires an ordering
Generation is sequential
Cannot learn features in an unsupervised way

Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 2 / 29


Plan for today

1 Latent Variable Models


Mixture models
Variational autoencoder
Variational inference and learning

Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 3 / 29


Latent Variable Models: Motivation

1 Lots of variability in images x due to gender, eye color, hair color,


pose, etc. However, unless images are annotated, these factors of
variation are not explicitly available (latent).
2 Idea: explicitly model these factors using latent variables z

Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 4 / 29


Latent Variable Models: Motivation

1 Only shaded variables x are observed in the data (pixel values)


2 Latent variables z correspond to high level features
If z chosen properly, p(x|z) could be much simpler than p(x)
If we had trained this model, then we could identify features via
p(z | x), e.g., p(EyeColor = Blue|x)
3 Challenge: Very difficult to specify these conditionals by hand

Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 5 / 29


Deep Latent Variable Models

Use neural networks to model the conditionals (deep latent variable


models):
1 z ∼ N (0, I )
2 p(x | z) = N (µθ (z), Σθ (z)) where µθ ,Σθ are neural networks
Hope that after training, z will correspond to meaningful latent
factors of variation (features). Unsupervised representation learning.
As before, features can be computed via p(z | x)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 6 / 29


Mixture of Gaussians: a Shallow Latent Variable Model

Mixture of Gaussians. Bayes net: z → x.


1 z ∼ Categorical(1, · · · , K )
2 p(x | z = k) = N (µk , Σk )

Generative process
1 Pick a mixture component k by sampling z
2 Generate a data point by sampling from that Gaussian

Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 7 / 29


Mixture of Gaussians: a Shallow Latent Variable Model

Mixture of Gaussians:
1 z ∼ Categorical(1, · · · , K )
2 p(x | z = k) = N (µk , Σk )

Clustering: The posterior p(z | x) identifies the mixture component


Unsupervised learning: We are hoping to learn from unlabeled data
(ill-posed problem)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 8 / 29


Unsupervised learning

Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 9 / 29


Unsupervised learning

Shown is the posterior probability that a data point was generated by the
i-th mixture component, P(z = i|x)
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 10 / 29
Unsupervised learning

Unsupervised clustering of handwritten digits.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 11 / 29


Mixture models

Alternative motivation: Combine simple models into a more complex


and expressive one

X X K
X
p(x) = p(x, z) = p(z)p(x | z) = p(z = k) N (x; µk , Σk )
| {z }
z z k=1 component

Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 12 / 29


Variational Autoencoder

A mixture of an infinite number of Gaussians:


1 z ∼ N (0, I )

2 p(x | z) = N (µ (z), Σ (z)) where µ ,Σ are neural networks


θ θ θ θ
µθ (z) = σ(Az + c) = (σ(a1 z + c1 ),σ(a2 z + c2 )) = (µ1 (z), µ2
(z))
exp(σ(b1 z+d1 )) 0
Σθ (z) = diag (exp(σ(Bz + d))) = 0 exp(σ(b2 z+d2 ))
θ = (A, B, c, d)
3 Even though p(x | z) is simple, the marginal p(x) is very
complex/flexible
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 13 / 29
Recap

Latent Variable Models


Allow us to define complex models p(x) in terms of simpler building
blocks p(x | z)
Natural for unsupervised learning tasks (clustering, unsupervised
representation learning, etc.)
No free lunch: much more difficult to learn compared to fully observed,
autoregressive models

Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 14 / 29


Marginal Likelihood

Suppose some pixel values are missing at train time (e.g., top half)
Let X denote observed random variables, and Z the unobserved ones (also
called hidden or latent)
Suppose we have a model for the joint distribution (e.g., PixelCNN)
p(X, Z; θ)
What is the probability p(X = x̄; θ) of observing a training data point x̄?
X X
p(X = x̄, Z = z; θ) = p(x̄, z; θ)
z z

Need to consider all possible ways to complete the image (fill green part)
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 15 / 29
Variational Autoencoder Marginal Likelihood

A mixture of an infinite number of Gaussians:


1 z ∼ N (0, I )

2 p(x | z) = N (µ (z), Σ (z)) where µ ,Σ are neural networks


θ θ θ θ
3 Z are unobserved at train time (also called hidden or latent)

4 Suppose we have a model for the joint distribution. What is the

probability p(X = x̄; θ) of observing a training data point x̄?


Z Z
p(X = x̄, Z = z; θ)dz = p(x̄, z; θ)dz
z z

Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 16 / 29


Partially observed data
Suppose that our joint distribution is
p(X, Z; θ)

We have a dataset D, where for each datapoint the X variables are observed
(e.g., pixel values) and the variables Z are never observed (e.g., cluster or
class id.). D = {x(1) , · · · , x(M) }.
Maximum likelihood learning:
Y X X X
log p(x; θ) = log p(x; θ) = log p(x, z; θ)
x∈D x∈D x∈D z

P
Evaluating log z p(x, z; θ) can be intractable. Suppose we have 30 binary
latent features, z ∈ {0, 1}30 . Evaluating R z p(x, z; θ) involves a sum with
P
230 terms. For continuous variables, log z p(x, z; θ)dz is often intractable.
Gradients ∇θ also hard to compute.
Need approximations. One gradient evaluation per training data point
x ∈ D, so approximation needs to be cheap.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 17 / 29
First attempt: Naive Monte Carlo

Likelihood function pθ (x) for Partially Observed Data is hard to compute:


X X 1
pθ (x) = pθ (x, z) = |Z| pθ (x, z) = |Z|Ez∼Uniform(Z) [pθ (x, z)]
|Z|
All values of z z∈Z

We can think of it as an (intractable) expectation. Monte Carlo to the rescue:


1 Sample z(1) , · · · , z(k) uniformly at random
2 Approximate expectation with sample average
k
X 1X
pθ (x, z) ≈ |Z| pθ (x, z(j) )
z
k
j=1

Works in theory but not in practice. For most z, pθ (x, z) is very low (most
completions don’t make sense). Some completions have large pθ (x, z) but we will
never ”hit” likely completions by uniform random sampling. Need a clever way to
select z(j) to reduce variance of the estimator.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 18 / 29


Second attempt: Importance Sampling
Likelihood function pθ (x) for Partially Observed Data is hard to compute:
X q(z)  
X pθ (x, z)
pθ (x) = pθ (x, z) = pθ (x, z) = Ez∼q(z)
q(z) q(z)
All possible values of z z∈Z

Monte Carlo to the rescue:


1 Sample z(1) , · · · , z(k) from q(z)
2 Approximate expectation with sample average
k
1 X pθ (x, z(j) )
pθ (x) ≈
k
j=1
q(z(j) )

What is a good choice for q(z)? Intuitively, frequently sample z


(completions) that are likely given x under pθ (x, z).
3 This is an unbiased estimator of pθ (x)
 
k (j)
1 X pθ (x, z )  = pθ (x)
Ez(j) )∼q(z)  (j) )
k q(z
j=1

Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 19 / 29


Estimating log-likelihoods
Likelihood function pθ (x) for Partially Observed Data is hard to compute:
X q(z)  
X pθ (x, z)
pθ (x) = pθ (x, z) = pθ (x, z) = Ez∼q(z)
q(z) q(z)
All possible values of z z∈Z

Monte Carlo to the rescue:


1 Sample z(1) , · · · , z(k) from q(z)
2 Approximate expectation with sample average (unbiased estimator):
k
1 X pθ (x, z(j) )
pθ (x) ≈
k
j=1
q(z(j) )

Recall that for training, we need the log-likelihood log (pθ (x)). We could estimate
it as:  
k (j)
pθ (x, z(1) )
 
1 X pθ (x, z ) k=1
log (pθ (x)) ≈ log   ≈ log
k
j=1
q(z(j) ) q(z(1) )
h  i  h i
(x,z(1) ) (x,z(1) )
However, it’s clear that Ez(1) ∼q(z) log pθq(z (1) ) ̸= log Ez(1) ∼q(z) pθq(z (1) )

Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 20 / 29


Evidence Lower Bound
Log-Likelihood function for Partially Observed Data is hard to compute:
! !   
X X q(z) pθ (x, z)
log pθ (x, z) = log pθ (x, z) = log Ez∼q(z)
q(z) q(z)
z∈Z z∈Z

log() is a concave function. log(px + (1 − p)x ′ ) ≥ p log(x) + (1 − p) log(x ′ ).


Idea: use Jensen Inequality (for concave functions)
!
 X X
log Ez∼q(z) [f (z)] = log q(z)f (z) ≥ q(z) log f (z)
z z

Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 21 / 29


Evidence Lower Bound
Log-Likelihood function for Partially Observed Data is hard to compute:
! !   
X X q(z) pθ (x, z)
log pθ (x, z) = log pθ (x, z) = log Ez∼q(z)
q(z) q(z)
z∈Z z∈Z

log() is a concave function. log(px + (1 − p)x ′ ) ≥ p log(x) + (1 − p) log(x ′ ).


Idea: use Jensen Inequality (for concave functions)
X X
log(Ez∼q(z) [f (z)]) = log( q(z)f (z)) ≥ q(z) log f (z) = Ez∼q(z) [log f (z)]
z z

pθ (x,z)
Choosing f (z) = q(z)
     
pθ (x, z) pθ (x, z)
log Ez∼q(z) ≥ Ez∼q(z) log
q(z) q(z)

Called Evidence Lower Bound (ELBO).


Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 22 / 29
Variational inference
Suppose q(z) is any probability distribution over the hidden variables
Evidence lower bound (ELBO) holds for any q
 
X pθ (x, z)
log p(x; θ) ≥ q(z) log
z
q(z)
X X
= q(z) log pθ (x, z) − q(z) log q(z)
z z
| {z }
Entropy H(q) of q
X
= q(z) log pθ (x, z) + H(q)
z

Equality holds if q = p(z|x; θ)


X
log p(x; θ)= q(z) log p(z, x; θ) + H(q)
z

(Aside: This is what we compute in the E-step of the EM algorithm)


Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 23 / 29
Why is the bound tight
We derived this lower bound that holds holds for any choice of q(z):
X p(x, z; θ)
log p(x; θ) ≥ q(z) log
z
q(z)

If q(z) = p(z|x; θ) the bound becomes:


X p(x, z; θ) X p(z|x; θ)p(x; θ)
p(z|x; θ) log = p(z|x; θ) log
z
p(z|x; θ) z
p(z|x; θ)
X
= p(z|x; θ) log p(x; θ)
z
X
= log p(x; θ) p(z|x; θ)
z
| {z }
=1
= log p(x; θ)

Confirms our previous importance sampling intuition: we should


choose likely completions.
What if the posterior p(z|x; θ) is intractable to compute? How loose
is the bound?
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 24 / 29
Variational inference continued
Suppose q(z) is any probability distribution over the hidden variables.
A little bit of algebra reveals
X
DKL (q(z)∥p(z|x; θ)) = − q(z) log p(z, x; θ) + log p(x; θ) − H(q) ≥ 0
z

Rearranging, we re-derived the Evidence lower bound (ELBO)


X
log p(x; θ) ≥ q(z) log p(z, x; θ) + H(q)
z

Equality holds if q = p(z|x; θ) because DKL (q(z)∥p(z|x; θ))=0


X
log p(x; θ)= q(z) log p(z, x; θ) + H(q)
z

In general, log p(x; θ) = ELBO + DKL (q(z)∥p(z|x; θ)). The closer


q(z) is to p(z|x; θ), the closer the ELBO is to the true log-likelihood
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 25 / 29
The Evidence Lower bound

What if the posterior p(z|x; θ) is intractable to compute?


Suppose q(z; ϕ) is a (tractable) probability distribution over the hidden
variables parameterized by ϕ (variational parameters)
For example, a Gaussian with mean and covariance specified by ϕ
q(z; ϕ) = N (ϕ1 , ϕ2 )
Variational inference: pick ϕ so that q(z; ϕ) is as close as possible to
p(z|x; θ). In the figure, the posterior p(z|x; θ) (blue) is better approximated
by N (2, 2) (orange) than N (−4, 0.75) (green)
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 26 / 29
A variational approximation to the posterior

Assume p(xtop , xbottom ; θ) assigns high probability to images that look like
digits. In this example, we assume z = xtop are unobserved (latent)
Suppose q(xtop ; ϕ) is a (tractable) probability distribution over the hidden
variables (missing pixels in this example) xtop parameterized by ϕ
(variational parameters)
Y top top
q(xtop ; ϕ) = (ϕi )xi (1 − ϕi )(1−xi )
unobserved variables xtop
i

Is ϕi = 0.5 ∀i a good approximation to the posterior p(xtop |xbottom ; θ)? No


Is ϕi = 1 ∀i a good approximation to the posterior p(xtop |xbottom ; θ)? No
Is ϕi ≈ 1 for pixels i corresponding to the top part of digit 9 a good
approximation? Yes
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 27 / 29
The Evidence Lower bound

X
log p(x; θ) ≥ q(z; ϕ) log p(z, x; θ) + H(q(z; ϕ)) = L(x; θ, ϕ)
| {z }
z
ELBO
= L(x; θ, ϕ) + DKL (q(z; ϕ)∥p(z|x; θ))
The better q(z; ϕ) can approximate the posterior p(z|x; θ), the smaller
DKL (q(z; ϕ)∥p(z|x; θ)) we can achieve, the closer ELBO will be to
log p(x; θ). Next: jointly optimize over θ and ϕ to maximize the ELBO
over a dataset
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 28 / 29
Summary

Latent Variable Models Pros:


Easy to build flexible models
Suitable for unsupervised learning
Latent Variable Models Cons:
Hard to evaluate likelihoods
Hard to train via maximum-likelihood
Fundamentally, the challenge is that posterior inference p(z | x) is hard.
Typically requires variational approximations
Alternative: give up on KL-divergence and likelihood (GANs)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 29 / 29

You might also like