0% found this document useful (0 votes)
6 views

Mod6_Slides

The document discusses various deep generative models, including autoregressive models, normalizing flows, variational autoencoders, and generative adversarial networks (GANs), highlighting their architectures, advantages, and challenges. It introduces energy-based models (EBMs) as flexible architectures with stable training and high sample quality, while also addressing the difficulties in sampling and likelihood evaluation. Additionally, it covers applications of EBMs, such as anomaly detection and image restoration, and provides examples like the Ising model and Restricted Boltzmann Machines.

Uploaded by

kaveh1980
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Mod6_Slides

The document discusses various deep generative models, including autoregressive models, normalizing flows, variational autoencoders, and generative adversarial networks (GANs), highlighting their architectures, advantages, and challenges. It introduces energy-based models (EBMs) as flexible architectures with stable training and high sample quality, while also addressing the difficulties in sampling and likelihood evaluation. Additionally, it covers applications of EBMs, such as anomaly detection and image restoration, and provides examples like the Ising model and Restricted Boltzmann Machines.

Uploaded by

kaveh1980
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Recap.

Qn
Autoregressive models. pθ (x1 , x2 , · · · , xn ) = i=1 pθ (xi | x<i )
Normalizing flow models. pθ (x) = p(z)| det Jfθ (x)|, where z = fθ (x).
R
Variational autoencoders: pθ (x) = p(z)pθ (x | z)dz.
Cons: Model architectures are restricted.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 2/1


Recap.

Generative Adversarial Networks (GANs).


minθ maxϕ Ex∼pdata [log Dϕ (x)] + Ez∼p(z) [log(1 − Dϕ (Gθ (z)))].
Two sample tests. Can (approximately) optimize f -divergences and the
Wasserstein distance.
Very flexible model architectures. But likelihood is intractable, training
is unstable, hard to evaluate, and has mode collapse issues.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 3/1


Today’s lecture

Energy-based models (EBMs).


Very flexible model architectures.
Stable training.
Relatively high sample quality.
Flexible composition.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 4/1


Parameterizing probability distributions

Probability distributions p(x) are a key building block in generative


modeling.
1 non-negative: p(x) ≥ 0
P R
2 sum-to-one: x p(x) = 1 (or p(x)dx = 1 for continuous variables)
Coming up with a non-negative function pθ (x) is not hard.
Given any function fθ (x), we can choose
gθ (x) = fθ (x)2
gθ (x) = exp(fθ (x))
gθ (x) = |fθ (x)|
gθ (x) = log(1 + exp(fθ (x)))
etc.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 5/1


Parameterizing probability distributions
Probability distributions p(x) are a key building block in generative
modeling.
1 non-negative: p(x) ≥ 0
P R
2 sum-to-one:
x p(x) = 1 (or p(x)dx = 1 for continuous variables)
Sum-to-one is key:

Total “volume” is fixed: increasing p(xtrain ) guarantees that xtrain becomes


relatively more likely (compared to the rest).
Problem:
gθ (x) ≥ 0 is easy, but gθ (x) might not sum-to-one.
P
x gθ (x) = Z (θ) ̸= 1 in general, so gθ (x) is not
R a valid probability
mass function or density (for continuous case, gθ (x)dx ̸= 1)
Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 6/1
Parameterizing probability distributions
Problem: gθ (x) ≥ 0 is easy, but gθ (x) might not be normalized
Solution:
1 1 1
pθ (x) = gθ (x) = R gθ (x) = gθ (x)
Z (θ) gθ (x)dx Volume(gθ )
Then by definition, pθ (x)dx = gZθ(θ)
R (x)
dx = ZZ (θ)
R
(θ) = 1.
Example: choose gθ (x) so that we know the volume analytically as a
function of θ.

(x−µ)2 R − x−µ √
1 g
(µ,σ) (x) = e 2σ 2 . Volume is: e 2σ2 dx = 2πσ 2 . → Gaussian
R +∞ −λx
2 g (x) = e −λx . Volume is: e dx = λ1 . → Exponential
λ 0
3 g (x) = h(x) exp{θ · T (x)}. Volume is exp{A(θ)}, where
θ R
A(θ) = log h(x) exp{θ · T (x)}dx. → Exponential family
Normal, Poisson, exponential, Bernoulli
beta, gamma, Dirichlet, Wishart, etc.
Function forms gθ (x) need to allow analytical integration. Despite being
restrictive, they are very useful as building blocks for more complex
distributions.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 7/1
Likelihood based learning

Problem: gθ (x) ≥ 0 is easy, but gθ (x) might not be normalized


Solution:
1 1 1
pθ (x) = gθ (x) = R gθ (x) = gθ (x)
Volume(gθ ) gθ (x)dx Z (θ)

Typically, choose gθ (x) so that we know the volume analytically. More complex
models can be obtained by combining these building blocks.
1 Autoregressive: Products of normalized
Z objects pθ (x)pθ′ (x) (y):
R R R R
p (x)pθ′ (x) (y)dxdy = x pθ (x) pθ′ (x) (y)dy dx = x pθ (x)dx = 1
x y θ
y
| {z }
=1

RLatent variables: Mixtures of normalized objects αpθ (x) + (1 − α)pθ′ (x) :


2

x
αpθ (x) + (1 − α)pθ′ (x)dx = α + (1 − α) = 1
How about using models where the “volume”/normalization constant of gθ (x) is
not easy to compute analytically?

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 8/1


Energy-based model

1 1
pθ (x) = R exp(fθ (x)) = exp(fθ (x))
exp(fθ (x))dx Z (θ)
The volume/normalization constant
Z
Z (θ) = exp(fθ (x))dx

is also called the partition function. Why exponential (and not e.g. fθ (x)2 )?
1 Want to capture very large variations in probability. log-probability is the
natural scale we want to work with. Otherwise need highly non-smooth fθ .
2 Exponential families. Many common distributions can be written in this
form.
3 These distributions arise under fairly general assumptions in statistical
physics (maximum entropy, second law of thermodynamics).
−fθ (x) is called the energy, hence the name.
Intuitively, configurations x with low energy (high fθ (x)) are more likely.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 9/1
Energy-based model

1 1
pθ (x) = R exp(fθ (x)) = exp(fθ (x))
exp(fθ (x))dx Z (θ)
Pros:
1 extreme flexibility: can use pretty much any function fθ (x) you want
Cons:
1 Sampling from pθ (x) is hard
2 Evaluating and optimizing likelihood pθ (x) is hard (learning is hard)
3 No feature learning (but can add latent variables)
Curse of dimensionality: The fundamental issue is that computing Z (θ)
numerically (when no analytic solution is available) scales exponentially in
the number of dimensions of x.
Nevertheless, some tasks do not require knowing Z (θ)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 10 / 1


Applications of Energy-based models

1 1
pθ (x) = R exp(fθ (x)) = exp(fθ (x))
exp(fθ (x))dx Z (θ)

Given x, x′ evaluating pθ (x) or pθ (x′ ) requires Z (θ).


However, their ratio
pθ (x)
= exp(fθ (x) − fθ (x′ ))
pθ (x′ )

does not involve Z (θ).


This means we can easily check which one is more likely. Applications:
1 anomaly detection
2 denoising

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 11 / 1


Applications of Energy-based models

E(Y, X) E(Y, X) E(Y, X)

X Y X Y X Y

cat “class” noun

object recognition sequence labeling image restoration

Given a trained model, many applications require relative comparisons. Hence


Z (θ) is not needed.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 12 / 1


Example: Ising Model
There is a true image y ∈ {0, 1}3×3, and a corrupted image x ∈ {0,
1}3×3. We know x, and want to somehow recover y.
Markov Random Field
Y1 Y2 Y3

X1 X2 X3

Y4 Y5 Y6

X4 X5 X6

Y7 Y8 Y9

X7 X8 X9

Xi: noisy pixels


Yi: “true” pixels

We model the joint probability distribution p(y, x) as


 
1 X X
p(y, x) = exp  ψi (xi , yi ) + ψij (yi , yj )
Z i (i,j)∈E

ψi (xi , yi ): the i-th corrupted pixel depends on the i-th original pixel
ψij (yi , yj ): neighboring pixels tend to have the same value
How did the original image y look like? Solution: maximize p(y|x). Or
equivalently, maximize p(y, x).
Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 13 / 1
Example: Product of Experts
Suppose you have trained several models qθ1 (x), rθ2 (x), tθ3 (x). They
can be different models (PixelCNN, Flow, etc.)
Each one is like an expert that can be used to score how likely an
input x is.
Assuming the experts make their judgments indpendently, it is
tempting to ensemble them as
pθ1 (x)qθ2 (x)rθ3 (x)

To get a valid probability distribution, we need to normalize


1
pθ1 ,θ2 ,θ3 (x) = qθ (x)rθ2 (x)tθ3 (x)
Z (θ1 , θ2 , θ3 ) 1

Note: similar to an AND operation (e.g., probability is zero as long as


one model gives zero probability), unlike mixture models which
behave more like OR
Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 14 / 1
Example: Product of Experts

Image source: Du et al., 2020.


Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 15 / 1
Example: Restricted Boltzmann machine (RBM)
RBM: energy-based model with latent variables
Two types of variables:
1 x ∈ {0, 1}n are visible variables (e.g., pixel values)
2 z ∈ {0, 1}m are latent ones

The joint distribution is !


n X m
1 
T
 1 X
pW ,b,c (x, z) = exp x W z + bx + cz = exp xi zj wij + bx + cz
Z Z i=1 j=1

Hidden units
Visible units

Restricted because there are no visible-visible and hidden-hidden


connections, i.e., xi xj or zi zj terms in the objective
Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 16 / 1
Example: Deep Boltzmann Machines
Stacked RBMs are one of the first deep generative models:

Deep Boltzmann machine

h(1)

W(3)

h(2)

W(2)
(3)
h

W(1)

Bottom layer variables v are pixel values. Layers above (h) represent
“higher-level” features (corners, edges, etc).
Early deep neural networks for supervised learning had to be
pre-trained like this to make them work.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 17 / 1
Deep Boltzmann Machines: samples

Image source: Salakhutdinov and Hinton, 2009.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 18 / 1


Energy-based models: learning and inference

1 1
pθ (x) = R exp(fθ (x)) = exp(fθ (x))
exp(fθ (x)) Z (θ)
Pros:
1 can plug in pretty much any function fθ (x) you want
Cons (lots of them):
1 Sampling is hard
2 Evaluating likelihood (learning) is hard
3 No feature learning
Curse of dimensionality: The fundamental issue is that computing Z (θ)
numerically (when no analytic solution is available) scales exponentially in
the number of dimensions of x.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 19 / 1


Computing the normalization constant is hard
As an example, the RBM joint distribution is
1  
pW ,b,c (x, z) = exp xT W z + bx + cz
Z
where
1 x ∈ {0, 1}n are visible variables (e.g., pixel values)
2 z ∈ {0, 1}m are latent ones
The normalization constant (the “volume”) is
X X  
Z (W , b, c) = exp xT W z + bx + cz
x∈{0,1}n z∈{0,1}m

Note: it is a well defined function of the parameters W , b, c, but no


simple closed-form. Takes time exponential in n, m to compute. This
means that evaluating the objective function pW ,b,c (x, z) for
likelihood based learning is hard.
Observation: Optimizing the likelihood pW ,b,c (x, z) is difficult,but
optimizing the un-normalized probability exp xT W z + bx + cz
(w.r.t. trainable parameters W , b, c) is easy.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 20 / 1
Training intuition

exp{fθ (xtrain )}
Goal: maximize Z (θ) . Increase numerator, decrease denominator.
Intuition: because the model is not normalized, increasing the
un-normalized log-probability fθ (xtrain ) by changing θ does not guarantee
that xtrain becomes relatively more likely (compared to the rest).
We also need to take into account the effect on other “wrong points” and
try to “push them down” to also make Z (θ) small.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 21 / 1


Contrastive Divergence

exp{fθ (xtrain )}
Goal: maximize Z (θ)

Idea: Instead of evaluating Z (θ) exactly, use a Monte Carlo estimate.


Contrastive divergence algorithm: sample xsample ∼ pθ , take step on
∇θ (fθ (xtrain ) − fθ (xsample )). Make training data more likely than typical
sample from the model.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 22 / 1


Contrastive Divergence

Maximize log-likelihood: maxθ fθ (xtrain ) − log Z (θ).


Gradient of log-likelihood:

∇θ fθ (xtrain ) − ∇θ log Z (θ)


= ∇θ fθ (xtrain ) − ∇Zθ Z(θ)(θ)
1
R
= ∇θ fθ (xtrain ) − Z (θ) ∇θ exp{fθ (x)}dx
1
R
= ∇θ fθ (xtrain ) − Z (θ) exp{fθ (x)}∇θ fθ (x)dx
R exp{fθ (x)}
= ∇θ fθ (xtrain ) − Z (θ) ∇θ fθ (x)dx
= ∇θ fθ (xtrain ) − Exsample [∇θ fθ (xsample )]
≈ ∇θ fθ (xtrain ) − ∇θ fθ (xsample ),

where xsample ∼ exp{fθ (xsample )}/Z (θ).


How to sample?

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 23 / 1


Sampling from energy-based models

1 1
pθ (x) = R exp(fθ (x)) = exp(fθ (x))
exp(fθ (x)) Z (θ)

No direct way to sample like in autoregressive or flow models. Main


issue: cannot easily compute how likely each possible sample is
However, we can easily compare two samples x, x′ .
Use an iterative approach called Markov Chain Monte Carlo:
1 Initialize x 0 randomly, t = 0
2 Let x ′ = x t + noise
1 If fθ (x ′ ) > fθ (x t ), let x t+1 = x ′
2 Else let x t+1 = x ′ with probability exp(fθ (x ′ ) − fθ (x t ))
3 Go to step ??
Works in theory, but can take a very long time to converge

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 24 / 1


Sampling from energy-based models

For any continuous distribution pθ (x), suppose we can compute its


gradient (the score function) ∇x log pθ (x).
Let π(x) be a prior distribution that is easy to sample from.
Langevin MCMC.
x0 ∼ π(x) √
Repeat xt+1 ∼ xt + ϵ ∇x log pθ (xt ) + 2ϵ zt for t = 0, 1, 2, · · · , T − 1,
where zt ∼ N (0, I ).
If ϵ → 0 and T → ∞, we have xT ∼ pθ (x).
Note that for energy-based models, the score function is tractable

∇x log pθ (x) = ∇x fθ (x) − ∇x log Z (θ)


| {z }
=0
= ∇x fθ (x)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 25 / 1


Recap. of last lecture

exp{fθ (x)}
Energy-based models: pθ (x) = Z (θ) .
Z (θ) is intractable, so no access to likelihood.
Comparing the probability of two points is easy:
pθ (x′ )/pθ (x) = exp(fθ (x′ ) − fθ (x)).
Maximum likelihood training: maxθ {fθ (xtrain ) − log Z (θ)}.
Contrastive divergence:

∇θ fθ (xtrain ) − ∇θ log Z (θ) ≈ ∇θ fθ (xtrain ) − ∇θ fθ (xsample ),

where xsample ∼ pθ (x).


Stefano Ermon (AI Lab) Deep Generative Models Lecture 12 2/1
Sampling from EBMs: MH-MCMC

Metropolis-Hastings Markov chain Monte Carlo (MCMC).


1 x0 ∼ π(x)
2 Repeat for t = 0, 1, 2, · · · , T − 1:
x′ = xt + noise
xt+1 = x′ if fθ (x′ ) ≥ fθ (xt )
If fθ (x′ ) < fθ (xt ), set xt+1 = x′ with probability exp{fθ (x′ ) − fθ (xt )},
otherwise set xt+1 = xt .
Properties:
In theory, xT converges to pθ (x) when T → ∞. Why?
Satisfies detailed balance condition: pθ (x)Tx→x′ = pθ (x′ )Tx′ →x where
Tx→x′ is the probability of transitioning from x to x′
If xt is distributed as pθ , then xt+1 is distributed as pθ .
In practice, need a large number of iterations and convergence slows
down exponentially in dimensionality.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 12 3/1


Sampling from EBMs: unadjusted Langevin MCMC

Unadjusted Langevin MCMC:


1 x0 ∼ π(x)
2 Repeat for t = 0, 1, 2, · · · , T − 1:
zt ∼ N (0, I ) √
xt+1 = xt + ϵ∇x log pθ (x)|x=xt + 2ϵzt
Properties:
xT converges to a sample from pθ (x) when T → ∞ and ϵ → 0.
∇x log pθ (x) = ∇x fθ (x) for continuous energy-based models.
Convergence slows down as dimensionality grows.

Sampling converges slowly in high dimensional spaces and is thus very


expensive, yet we need sampling for each training iteration in contrastive
divergence.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 12 4/1

You might also like