Mod6_Slides
Mod6_Slides
Qn
Autoregressive models. pθ (x1 , x2 , · · · , xn ) = i=1 pθ (xi | x<i )
Normalizing flow models. pθ (x) = p(z)| det Jfθ (x)|, where z = fθ (x).
R
Variational autoencoders: pθ (x) = p(z)pθ (x | z)dz.
Cons: Model architectures are restricted.
Typically, choose gθ (x) so that we know the volume analytically. More complex
models can be obtained by combining these building blocks.
1 Autoregressive: Products of normalized
Z objects pθ (x)pθ′ (x) (y):
R R R R
p (x)pθ′ (x) (y)dxdy = x pθ (x) pθ′ (x) (y)dy dx = x pθ (x)dx = 1
x y θ
y
| {z }
=1
x
αpθ (x) + (1 − α)pθ′ (x)dx = α + (1 − α) = 1
How about using models where the “volume”/normalization constant of gθ (x) is
not easy to compute analytically?
1 1
pθ (x) = R exp(fθ (x)) = exp(fθ (x))
exp(fθ (x))dx Z (θ)
The volume/normalization constant
Z
Z (θ) = exp(fθ (x))dx
is also called the partition function. Why exponential (and not e.g. fθ (x)2 )?
1 Want to capture very large variations in probability. log-probability is the
natural scale we want to work with. Otherwise need highly non-smooth fθ .
2 Exponential families. Many common distributions can be written in this
form.
3 These distributions arise under fairly general assumptions in statistical
physics (maximum entropy, second law of thermodynamics).
−fθ (x) is called the energy, hence the name.
Intuitively, configurations x with low energy (high fθ (x)) are more likely.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 9/1
Energy-based model
1 1
pθ (x) = R exp(fθ (x)) = exp(fθ (x))
exp(fθ (x))dx Z (θ)
Pros:
1 extreme flexibility: can use pretty much any function fθ (x) you want
Cons:
1 Sampling from pθ (x) is hard
2 Evaluating and optimizing likelihood pθ (x) is hard (learning is hard)
3 No feature learning (but can add latent variables)
Curse of dimensionality: The fundamental issue is that computing Z (θ)
numerically (when no analytic solution is available) scales exponentially in
the number of dimensions of x.
Nevertheless, some tasks do not require knowing Z (θ)
1 1
pθ (x) = R exp(fθ (x)) = exp(fθ (x))
exp(fθ (x))dx Z (θ)
X Y X Y X Y
X1 X2 X3
Y4 Y5 Y6
X4 X5 X6
Y7 Y8 Y9
X7 X8 X9
ψi (xi , yi ): the i-th corrupted pixel depends on the i-th original pixel
ψij (yi , yj ): neighboring pixels tend to have the same value
How did the original image y look like? Solution: maximize p(y|x). Or
equivalently, maximize p(y, x).
Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 13 / 1
Example: Product of Experts
Suppose you have trained several models qθ1 (x), rθ2 (x), tθ3 (x). They
can be different models (PixelCNN, Flow, etc.)
Each one is like an expert that can be used to score how likely an
input x is.
Assuming the experts make their judgments indpendently, it is
tempting to ensemble them as
pθ1 (x)qθ2 (x)rθ3 (x)
Hidden units
Visible units
h(1)
W(3)
h(2)
W(2)
(3)
h
W(1)
Bottom layer variables v are pixel values. Layers above (h) represent
“higher-level” features (corners, edges, etc).
Early deep neural networks for supervised learning had to be
pre-trained like this to make them work.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 17 / 1
Deep Boltzmann Machines: samples
1 1
pθ (x) = R exp(fθ (x)) = exp(fθ (x))
exp(fθ (x)) Z (θ)
Pros:
1 can plug in pretty much any function fθ (x) you want
Cons (lots of them):
1 Sampling is hard
2 Evaluating likelihood (learning) is hard
3 No feature learning
Curse of dimensionality: The fundamental issue is that computing Z (θ)
numerically (when no analytic solution is available) scales exponentially in
the number of dimensions of x.
exp{fθ (xtrain )}
Goal: maximize Z (θ) . Increase numerator, decrease denominator.
Intuition: because the model is not normalized, increasing the
un-normalized log-probability fθ (xtrain ) by changing θ does not guarantee
that xtrain becomes relatively more likely (compared to the rest).
We also need to take into account the effect on other “wrong points” and
try to “push them down” to also make Z (θ) small.
exp{fθ (xtrain )}
Goal: maximize Z (θ)
1 1
pθ (x) = R exp(fθ (x)) = exp(fθ (x))
exp(fθ (x)) Z (θ)
exp{fθ (x)}
Energy-based models: pθ (x) = Z (θ) .
Z (θ) is intractable, so no access to likelihood.
Comparing the probability of two points is easy:
pθ (x′ )/pθ (x) = exp(fθ (x′ ) − fθ (x)).
Maximum likelihood training: maxθ {fθ (xtrain ) − log Z (θ)}.
Contrastive divergence: