0% found this document useful (0 votes)

6 views

Mod6_Slides

The document discusses various deep generative models, including autoregressive models, normalizing flows, variational autoencoders, and generative adversarial networks (GANs), highlighting their architectures, advantages, and challenges. It introduces energy-based models (EBMs) as flexible architectures with stable training and high sample quality, while also addressing the difficulties in sampling and likelihood evaluation. Additionally, it covers applications of EBMs, such as anomaly detection and image restoration, and provides examples like the Ising model and Restricted Boltzmann Machines.

Uploaded by

kaveh1980

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Mod6_Slides

Uploaded by

kaveh1980

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Recap.

Qn
Autoregressive models. pθ (x1 , x2 , · · · , xn ) = i=1 pθ (xi | x<i )
Normalizing flow models. pθ (x) = p(z)| det Jfθ (x)|, where z = fθ (x).
R
Variational autoencoders: pθ (x) = p(z)pθ (x | z)dz.
Cons: Model architectures are restricted.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 2/1

Recap.

Generative Adversarial Networks (GANs).

minθ maxϕ Ex∼pdata [log Dϕ (x)] + Ez∼p(z) [log(1 − Dϕ (Gθ (z)))].
Two sample tests. Can (approximately) optimize f -divergences and the
Wasserstein distance.
Very flexible model architectures. But likelihood is intractable, training
is unstable, hard to evaluate, and has mode collapse issues.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 3/1

Today’s lecture

Energy-based models (EBMs).

Very flexible model architectures.
Stable training.
Relatively high sample quality.
Flexible composition.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 4/1

Parameterizing probability distributions

Probability distributions p(x) are a key building block in generative

modeling.
1 non-negative: p(x) ≥ 0
P R
2 sum-to-one: x p(x) = 1 (or p(x)dx = 1 for continuous variables)
Coming up with a non-negative function pθ (x) is not hard.
Given any function fθ (x), we can choose
gθ (x) = fθ (x)2
gθ (x) = exp(fθ (x))
gθ (x) = |fθ (x)|
gθ (x) = log(1 + exp(fθ (x)))
etc.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 5/1

Parameterizing probability distributions
Probability distributions p(x) are a key building block in generative
modeling.
1 non-negative: p(x) ≥ 0
P R
2 sum-to-one:
x p(x) = 1 (or p(x)dx = 1 for continuous variables)
Sum-to-one is key:

Total “volume” is fixed: increasing p(xtrain ) guarantees that xtrain becomes

relatively more likely (compared to the rest).
Problem:
gθ (x) ≥ 0 is easy, but gθ (x) might not sum-to-one.
P
x gθ (x) = Z (θ) ̸= 1 in general, so gθ (x) is not
R a valid probability
mass function or density (for continuous case, gθ (x)dx ̸= 1)
Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 6/1
Parameterizing probability distributions
Problem: gθ (x) ≥ 0 is easy, but gθ (x) might not be normalized
Solution:
1 1 1
pθ (x) = gθ (x) = R gθ (x) = gθ (x)
Z (θ) gθ (x)dx Volume(gθ )
Then by definition, pθ (x)dx = gZθ(θ)
R (x)
dx = ZZ (θ)
R
(θ) = 1.
Example: choose gθ (x) so that we know the volume analytically as a
function of θ.
−
(x−µ)2 R − x−µ √
1 g
(µ,σ) (x) = e 2σ 2 . Volume is: e 2σ2 dx = 2πσ 2 . → Gaussian
R +∞ −λx
2 g (x) = e −λx . Volume is: e dx = λ1 . → Exponential
λ 0
3 g (x) = h(x) exp{θ · T (x)}. Volume is exp{A(θ)}, where
θ R
A(θ) = log h(x) exp{θ · T (x)}dx. → Exponential family
Normal, Poisson, exponential, Bernoulli
beta, gamma, Dirichlet, Wishart, etc.
Function forms gθ (x) need to allow analytical integration. Despite being
restrictive, they are very useful as building blocks for more complex
distributions.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 7/1
Likelihood based learning

Problem: gθ (x) ≥ 0 is easy, but gθ (x) might not be normalized

Solution:
1 1 1
pθ (x) = gθ (x) = R gθ (x) = gθ (x)
Volume(gθ ) gθ (x)dx Z (θ)

Typically, choose gθ (x) so that we know the volume analytically. More complex
models can be obtained by combining these building blocks.
1 Autoregressive: Products of normalized
Z objects pθ (x)pθ′ (x) (y):
R R R R
p (x)pθ′ (x) (y)dxdy = x pθ (x) pθ′ (x) (y)dy dx = x pθ (x)dx = 1
x y θ
y
| {z }
=1

RLatent variables: Mixtures of normalized objects αpθ (x) + (1 − α)pθ′ (x) :

x
αpθ (x) + (1 − α)pθ′ (x)dx = α + (1 − α) = 1
How about using models where the “volume”/normalization constant of gθ (x) is
not easy to compute analytically?

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 8/1

Energy-based model

1 1
pθ (x) = R exp(fθ (x)) = exp(fθ (x))
exp(fθ (x))dx Z (θ)
The volume/normalization constant
Z
Z (θ) = exp(fθ (x))dx

is also called the partition function. Why exponential (and not e.g. fθ (x)2 )?
1 Want to capture very large variations in probability. log-probability is the
natural scale we want to work with. Otherwise need highly non-smooth fθ .
2 Exponential families. Many common distributions can be written in this
form.
3 These distributions arise under fairly general assumptions in statistical
physics (maximum entropy, second law of thermodynamics).
−fθ (x) is called the energy, hence the name.
Intuitively, configurations x with low energy (high fθ (x)) are more likely.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 9/1
Energy-based model

1 1
pθ (x) = R exp(fθ (x)) = exp(fθ (x))
exp(fθ (x))dx Z (θ)
Pros:
1 extreme flexibility: can use pretty much any function fθ (x) you want
Cons:
1 Sampling from pθ (x) is hard
2 Evaluating and optimizing likelihood pθ (x) is hard (learning is hard)
3 No feature learning (but can add latent variables)
Curse of dimensionality: The fundamental issue is that computing Z (θ)
numerically (when no analytic solution is available) scales exponentially in
the number of dimensions of x.
Nevertheless, some tasks do not require knowing Z (θ)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 10 / 1

Applications of Energy-based models

1 1
pθ (x) = R exp(fθ (x)) = exp(fθ (x))
exp(fθ (x))dx Z (θ)

Given x, x′ evaluating pθ (x) or pθ (x′ ) requires Z (θ).

However, their ratio
pθ (x)
= exp(fθ (x) − fθ (x′ ))
pθ (x′ )

does not involve Z (θ).

This means we can easily check which one is more likely. Applications:
1 anomaly detection
2 denoising

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 11 / 1

Applications of Energy-based models

E(Y, X) E(Y, X) E(Y, X)

X Y X Y X Y

cat “class” noun

object recognition sequence labeling image restoration

Given a trained model, many applications require relative comparisons. Hence

Z (θ) is not needed.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 12 / 1

Example: Ising Model
There is a true image y ∈ {0, 1}3×3, and a corrupted image x ∈ {0,
1}3×3. We know x, and want to somehow recover y.
Markov Random Field
Y1 Y2 Y3

X1 X2 X3

Y4 Y5 Y6

X4 X5 X6

Y7 Y8 Y9

X7 X8 X9

Xi: noisy pixels

Yi: “true” pixels

We model the joint probability distribution p(y, x) as

 
1 X X
p(y, x) = exp  ψi (xi , yi ) + ψij (yi , yj )
Z i (i,j)∈E

ψi (xi , yi ): the i-th corrupted pixel depends on the i-th original pixel
ψij (yi , yj ): neighboring pixels tend to have the same value
How did the original image y look like? Solution: maximize p(y|x). Or
equivalently, maximize p(y, x).
Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 13 / 1
Example: Product of Experts
Suppose you have trained several models qθ1 (x), rθ2 (x), tθ3 (x). They
can be different models (PixelCNN, Flow, etc.)
Each one is like an expert that can be used to score how likely an
input x is.
Assuming the experts make their judgments indpendently, it is
tempting to ensemble them as
pθ1 (x)qθ2 (x)rθ3 (x)

To get a valid probability distribution, we need to normalize

1
pθ1 ,θ2 ,θ3 (x) = qθ (x)rθ2 (x)tθ3 (x)
Z (θ1 , θ2 , θ3 ) 1

Note: similar to an AND operation (e.g., probability is zero as long as

one model gives zero probability), unlike mixture models which
behave more like OR
Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 14 / 1
Example: Product of Experts

Image source: Du et al., 2020.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 15 / 1
Example: Restricted Boltzmann machine (RBM)
RBM: energy-based model with latent variables
Two types of variables:
1 x ∈ {0, 1}n are visible variables (e.g., pixel values)
2 z ∈ {0, 1}m are latent ones

The joint distribution is !

n X m
1
T
1 X
pW ,b,c (x, z) = exp x W z + bx + cz = exp xi zj wij + bx + cz
Z Z i=1 j=1

Hidden units
Visible units

Restricted because there are no visible-visible and hidden-hidden

connections, i.e., xi xj or zi zj terms in the objective
Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 16 / 1
Example: Deep Boltzmann Machines
Stacked RBMs are one of the first deep generative models:

Deep Boltzmann machine

h(1)

W(3)

h(2)

W(2)
(3)
h

W(1)

Bottom layer variables v are pixel values. Layers above (h) represent
“higher-level” features (corners, edges, etc).
Early deep neural networks for supervised learning had to be
pre-trained like this to make them work.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 17 / 1
Deep Boltzmann Machines: samples

Image source: Salakhutdinov and Hinton, 2009.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 18 / 1

Energy-based models: learning and inference

1 1
pθ (x) = R exp(fθ (x)) = exp(fθ (x))
exp(fθ (x)) Z (θ)
Pros:
1 can plug in pretty much any function fθ (x) you want
Cons (lots of them):
1 Sampling is hard
2 Evaluating likelihood (learning) is hard
3 No feature learning
Curse of dimensionality: The fundamental issue is that computing Z (θ)
numerically (when no analytic solution is available) scales exponentially in
the number of dimensions of x.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 19 / 1

Computing the normalization constant is hard
As an example, the RBM joint distribution is
1
pW ,b,c (x, z) = exp xT W z + bx + cz
Z
where
1 x ∈ {0, 1}n are visible variables (e.g., pixel values)
2 z ∈ {0, 1}m are latent ones
The normalization constant (the “volume”) is
X X
Z (W , b, c) = exp xT W z + bx + cz
x∈{0,1}n z∈{0,1}m

Note: it is a well defined function of the parameters W , b, c, but no

simple closed-form. Takes time exponential in n, m to compute. This
means that evaluating the objective function pW ,b,c (x, z) for
likelihood based learning is hard.
Observation: Optimizing the likelihood pW ,b,c (x, z) is difficult,but
optimizing the un-normalized probability exp xT W z + bx + cz
(w.r.t. trainable parameters W , b, c) is easy.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 20 / 1
Training intuition

exp{fθ (xtrain )}
Goal: maximize Z (θ) . Increase numerator, decrease denominator.
Intuition: because the model is not normalized, increasing the
un-normalized log-probability fθ (xtrain ) by changing θ does not guarantee
that xtrain becomes relatively more likely (compared to the rest).
We also need to take into account the effect on other “wrong points” and
try to “push them down” to also make Z (θ) small.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 21 / 1

Contrastive Divergence

exp{fθ (xtrain )}
Goal: maximize Z (θ)

Idea: Instead of evaluating Z (θ) exactly, use a Monte Carlo estimate.

Contrastive divergence algorithm: sample xsample ∼ pθ , take step on
∇θ (fθ (xtrain ) − fθ (xsample )). Make training data more likely than typical
sample from the model.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 22 / 1

Contrastive Divergence

Maximize log-likelihood: maxθ fθ (xtrain ) − log Z (θ).

Gradient of log-likelihood:

∇θ fθ (xtrain ) − ∇θ log Z (θ)

= ∇θ fθ (xtrain ) − ∇Zθ Z(θ)(θ)
1
R
= ∇θ fθ (xtrain ) − Z (θ) ∇θ exp{fθ (x)}dx
1
R
= ∇θ fθ (xtrain ) − Z (θ) exp{fθ (x)}∇θ fθ (x)dx
R exp{fθ (x)}
= ∇θ fθ (xtrain ) − Z (θ) ∇θ fθ (x)dx
= ∇θ fθ (xtrain ) − Exsample [∇θ fθ (xsample )]
≈ ∇θ fθ (xtrain ) − ∇θ fθ (xsample ),

where xsample ∼ exp{fθ (xsample )}/Z (θ).

How to sample?

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 23 / 1

Sampling from energy-based models

1 1
pθ (x) = R exp(fθ (x)) = exp(fθ (x))
exp(fθ (x)) Z (θ)

No direct way to sample like in autoregressive or flow models. Main

issue: cannot easily compute how likely each possible sample is
However, we can easily compare two samples x, x′ .
Use an iterative approach called Markov Chain Monte Carlo:
1 Initialize x 0 randomly, t = 0
2 Let x ′ = x t + noise
1 If fθ (x ′ ) > fθ (x t ), let x t+1 = x ′
2 Else let x t+1 = x ′ with probability exp(fθ (x ′ ) − fθ (x t ))
3 Go to step ??
Works in theory, but can take a very long time to converge

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 24 / 1

Sampling from energy-based models

For any continuous distribution pθ (x), suppose we can compute its

gradient (the score function) ∇x log pθ (x).
Let π(x) be a prior distribution that is easy to sample from.
Langevin MCMC.
x0 ∼ π(x) √
Repeat xt+1 ∼ xt + ϵ ∇x log pθ (xt ) + 2ϵ zt for t = 0, 1, 2, · · · , T − 1,
where zt ∼ N (0, I ).
If ϵ → 0 and T → ∞, we have xT ∼ pθ (x).
Note that for energy-based models, the score function is tractable

∇x log pθ (x) = ∇x fθ (x) − ∇x log Z (θ)

| {z }
=0
= ∇x fθ (x)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 25 / 1

Recap. of last lecture

exp{fθ (x)}
Energy-based models: pθ (x) = Z (θ) .
Z (θ) is intractable, so no access to likelihood.
Comparing the probability of two points is easy:
pθ (x′ )/pθ (x) = exp(fθ (x′ ) − fθ (x)).
Maximum likelihood training: maxθ {fθ (xtrain ) − log Z (θ)}.
Contrastive divergence:

∇θ fθ (xtrain ) − ∇θ log Z (θ) ≈ ∇θ fθ (xtrain ) − ∇θ fθ (xsample ),

where xsample ∼ pθ (x).

Stefano Ermon (AI Lab) Deep Generative Models Lecture 12 2/1
Sampling from EBMs: MH-MCMC

Metropolis-Hastings Markov chain Monte Carlo (MCMC).

1 x0 ∼ π(x)
2 Repeat for t = 0, 1, 2, · · · , T − 1:
x′ = xt + noise
xt+1 = x′ if fθ (x′ ) ≥ fθ (xt )
If fθ (x′ ) < fθ (xt ), set xt+1 = x′ with probability exp{fθ (x′ ) − fθ (xt )},
otherwise set xt+1 = xt .
Properties:
In theory, xT converges to pθ (x) when T → ∞. Why?
Satisfies detailed balance condition: pθ (x)Tx→x′ = pθ (x′ )Tx′ →x where
Tx→x′ is the probability of transitioning from x to x′
If xt is distributed as pθ , then xt+1 is distributed as pθ .
In practice, need a large number of iterations and convergence slows
down exponentially in dimensionality.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 12 3/1

Sampling from EBMs: unadjusted Langevin MCMC

Unadjusted Langevin MCMC:

1 x0 ∼ π(x)
2 Repeat for t = 0, 1, 2, · · · , T − 1:
zt ∼ N (0, I ) √
xt+1 = xt + ϵ∇x log pθ (x)|x=xt + 2ϵzt
Properties:
xT converges to a sample from pθ (x) when T → ∞ and ϵ → 0.
∇x log pθ (x) = ∇x fθ (x) for continuous energy-based models.
Convergence slows down as dimensionality grows.

Sampling converges slowly in high dimensional spaces and is thus very

expensive, yet we need sampling for each training iteration in contrastive
divergence.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 12 4/1

Numerical Methods For Stochastic Partial Differential Equations With White Noise (Karniadakis, George Zhang, Zhongqiang)
No ratings yet
Numerical Methods For Stochastic Partial Differential Equations With White Noise (Karniadakis, George Zhang, Zhongqiang)
391 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Mod4_Slides
No ratings yet
Mod4_Slides
49 pages
Chapter 5
No ratings yet
Chapter 5
140 pages
Restricted Boltzmann Machines: Abstract
No ratings yet
Restricted Boltzmann Machines: Abstract
21 pages
Bayesian NN
No ratings yet
Bayesian NN
82 pages
Lecture 12
No ratings yet
Lecture 12
38 pages
Khan - Diffusion Models and Normalizing Flows
No ratings yet
Khan - Diffusion Models and Normalizing Flows
36 pages
paper8
No ratings yet
paper8
26 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
deep_density_estimation
No ratings yet
deep_density_estimation
20 pages
3a Variations
No ratings yet
3a Variations
17 pages
3a Variations4
No ratings yet
3a Variations4
5 pages
Lecture04 VDL
No ratings yet
Lecture04 VDL
93 pages
1
No ratings yet
1
9 pages
Eric Jang - Tutorial - Categorical Variational Autoencoders Using Gumbel-Softmax
No ratings yet
Eric Jang - Tutorial - Categorical Variational Autoencoders Using Gumbel-Softmax
8 pages
Data Mining1
No ratings yet
Data Mining1
3 pages
Deep Learning & Neural Networks: Kevin Duh
No ratings yet
Deep Learning & Neural Networks: Kevin Duh
86 pages
04 Numerical
No ratings yet
04 Numerical
46 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
No ratings yet
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
3 pages
Exercises INF 5860 Solution Hints
No ratings yet
Exercises INF 5860 Solution Hints
11 pages
Theory of Deep Learning 1652786371
No ratings yet
Theory of Deep Learning 1652786371
118 pages
1728-Article Text-3068-1-10-20230314
No ratings yet
1728-Article Text-3068-1-10-20230314
4 pages
Numerical Algorithms Methods for Computer Vision Machine Learning and Graphics 1st Solomon Solution Manualinstant download
100% (3)
Numerical Algorithms Methods for Computer Vision Machine Learning and Graphics 1st Solomon Solution Manualinstant download
49 pages
Deep Learning Assignment3 Solution
No ratings yet
Deep Learning Assignment3 Solution
9 pages
winter1516_lecture53
No ratings yet
winter1516_lecture53
20 pages
Module 2
No ratings yet
Module 2
13 pages
DGM 2023 Endterm Solution
No ratings yet
DGM 2023 Endterm Solution
12 pages
Density Estimation Using Real NVP
No ratings yet
Density Estimation Using Real NVP
32 pages
Therml: Thermodynamics of Machine Learning: Box & Draper 1987 1A
No ratings yet
Therml: Thermodynamics of Machine Learning: Box & Draper 1987 1A
16 pages
CS 229, Autumn 2017 Problem Set #4: EM, DL & RL
No ratings yet
CS 229, Autumn 2017 Problem Set #4: EM, DL & RL
10 pages
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
No ratings yet
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
57 pages
MLBasicsBook
No ratings yet
MLBasicsBook
287 pages
DO DEEP GENERATIVE MODELS KNOW
No ratings yet
DO DEEP GENERATIVE MODELS KNOW
19 pages
L15 Autoregressive and Reversible Models
No ratings yet
L15 Autoregressive and Reversible Models
7 pages
Instructor Solution Manual To Neural Networks and Deep Learning A Textbook Solutions 3319944622 9783319944623 - Compress
No ratings yet
Instructor Solution Manual To Neural Networks and Deep Learning A Textbook Solutions 3319944622 9783319944623 - Compress
40 pages
Instructor's Solution Manual For Neural Networks
No ratings yet
Instructor's Solution Manual For Neural Networks
40 pages
SkriptOptMach
No ratings yet
SkriptOptMach
49 pages
Montanari
No ratings yet
Montanari
10 pages
cs236 Lecture2
No ratings yet
cs236 Lecture2
30 pages
19EEE362:Deep Learning For Visual Computing: Dr.T.Ananthan
No ratings yet
19EEE362:Deep Learning For Visual Computing: Dr.T.Ananthan
23 pages
DL (Unit I)
No ratings yet
DL (Unit I)
25 pages
Deep Neural Networks and Partial Differential Equations - Approximation Theory and Structural Properties) Philipp Petersen, University of Oxford
No ratings yet
Deep Neural Networks and Partial Differential Equations - Approximation Theory and Structural Properties) Philipp Petersen, University of Oxford
49 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
2312.14977diffusion Models For Generative Artificial
No ratings yet
2312.14977diffusion Models For Generative Artificial
23 pages
VAE talk.compressed - 副本
No ratings yet
VAE talk.compressed - 副本
59 pages
oussidi2018
No ratings yet
oussidi2018
8 pages
1804.10306v1
No ratings yet
1804.10306v1
64 pages
2503.23982v1
No ratings yet
2503.23982v1
26 pages
CPSC 440: Advanced Machine Learning: Exponential Families
No ratings yet
CPSC 440: Advanced Machine Learning: Exponential Families
15 pages
L20-GenerativeModels
No ratings yet
L20-GenerativeModels
53 pages
Mathematical Introduction To Deep Learning: Methods, Implementations, and Theory
No ratings yet
Mathematical Introduction To Deep Learning: Methods, Implementations, and Theory
601 pages
Python Basics Nympy
No ratings yet
Python Basics Nympy
5 pages
Mlgs 2021 Endterm Solution
No ratings yet
Mlgs 2021 Endterm Solution
26 pages
Tutorialon Diffusion Modelsfor Imaging and Vision
No ratings yet
Tutorialon Diffusion Modelsfor Imaging and Vision
90 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Tut 7
No ratings yet
Tut 7
32 pages
05 Vae
No ratings yet
05 Vae
76 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Mod5_Slides
No ratings yet
Mod5_Slides
37 pages
GMTLecureNotes
No ratings yet
GMTLecureNotes
291 pages
jrfm
No ratings yet
jrfm
15 pages
smile-lecture8
No ratings yet
smile-lecture8
19 pages
Draswa Chapter4randomvariableandprobabilitydistribution 28week5 29
No ratings yet
Draswa Chapter4randomvariableandprobabilitydistribution 28week5 29
14 pages
PDF Fundamentals of Probability with Stochastic Processes 3rd Edition Saeed Ghahramani download
100% (3)
PDF Fundamentals of Probability with Stochastic Processes 3rd Edition Saeed Ghahramani download
41 pages
Endogeneity and Gaussian Copulas - SmartPLS
No ratings yet
Endogeneity and Gaussian Copulas - SmartPLS
4 pages
Chapter 2 - Random Variables
No ratings yet
Chapter 2 - Random Variables
28 pages
Lesson 4 - Expectation - Fair Game Problems Practice and Introduction To Binomial Distribution
No ratings yet
Lesson 4 - Expectation - Fair Game Problems Practice and Introduction To Binomial Distribution
17 pages
Chapter 7
No ratings yet
Chapter 7
19 pages
16EC206
No ratings yet
16EC206
2 pages
exs_5-5-30v2_hl_probability_distributions
No ratings yet
exs_5-5-30v2_hl_probability_distributions
3 pages
Statistics Informed Decisions Using Data 5th Edition Sullivan Test Bank - Download Today For A Complete Reading Experience
100% (2)
Statistics Informed Decisions Using Data 5th Edition Sullivan Test Bank - Download Today For A Complete Reading Experience
44 pages
PMR 2020 - Course Material
No ratings yet
PMR 2020 - Course Material
8 pages
Chapter 5
No ratings yet
Chapter 5
46 pages
DSC 1371 - Chapter 07 - RV Probability Distributions
No ratings yet
DSC 1371 - Chapter 07 - RV Probability Distributions
10 pages
RP ch05
No ratings yet
RP ch05
44 pages
Chapter 5 - Creating New Univariate Distributions
No ratings yet
Chapter 5 - Creating New Univariate Distributions
9 pages
Probability Statistics and Random Processes with Queueing Theory and Queueing Networks 4th Edition Veerara - Download the ebook now to never miss important content
100% (1)
Probability Statistics and Random Processes with Queueing Theory and Queueing Networks 4th Edition Veerara - Download the ebook now to never miss important content
80 pages
COSM - Lesson Plan (CSE)
No ratings yet
COSM - Lesson Plan (CSE)
4 pages
Probability 3.2 EdX
No ratings yet
Probability 3.2 EdX
74 pages
Mean Median and Skew Correcting A Textbook Rule
No ratings yet
Mean Median and Skew Correcting A Textbook Rule
14 pages
Module 5 PDF
No ratings yet
Module 5 PDF
17 pages
Unit 2 Discrete PRrobability Distribution
No ratings yet
Unit 2 Discrete PRrobability Distribution
4 pages
Logistic Distribution
No ratings yet
Logistic Distribution
6 pages
21MAB204T-Syllabus-2024-2025(signed copy)
No ratings yet
21MAB204T-Syllabus-2024-2025(signed copy)
2 pages
Reliability Formula
No ratings yet
Reliability Formula
4 pages
Asm Fam-L and S Sample
No ratings yet
Asm Fam-L and S Sample
47 pages
Math 241
No ratings yet
Math 241
4 pages
Be - Artificial Intelligence and Data Science - Semester 7 - 2023 - October - Data Modeling & Visualization DM & V 2019 Pattern
No ratings yet
Be - Artificial Intelligence and Data Science - Semester 7 - 2023 - October - Data Modeling & Visualization DM & V 2019 Pattern
2 pages
Module2 - Random Variable
No ratings yet
Module2 - Random Variable
24 pages
Shynggys Magzanov (1) - 220915 - 194956 PDF
No ratings yet
Shynggys Magzanov (1) - 220915 - 194956 PDF
38 pages
Ai HW-4
No ratings yet
Ai HW-4
4 pages
The Normal Distribution Questions
No ratings yet
The Normal Distribution Questions
15 pages

Mod6_Slides

Uploaded by

Mod6_Slides

Uploaded by

Recap.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 2/1

Generative Adversarial Networks (GANs).

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 3/1

Energy-based models (EBMs).

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 4/1

Probability distributions p(x) are a key building block in generative

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 5/1

Total “volume” is fixed: increasing p(xtrain ) guarantees that xtrain becomes

Problem: gθ (x) ≥ 0 is easy, but gθ (x) might not be normalized

RLatent variables: Mixtures of normalized objects αpθ (x) + (1 − α)pθ′ (x) :

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 8/1

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 10 / 1

Given x, x′ evaluating pθ (x) or pθ (x′ ) requires Z (θ).

does not involve Z (θ).

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 11 / 1

E(Y, X) E(Y, X) E(Y, X)

cat “class” noun

object recognition sequence labeling image restoration

Given a trained model, many applications require relative comparisons. Hence

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 12 / 1

Xi: noisy pixels

We model the joint probability distribution p(y, x) as

To get a valid probability distribution, we need to normalize

Note: similar to an AND operation (e.g., probability is zero as long as

Image source: Du et al., 2020.

The joint distribution is !

Restricted because there are no visible-visible and hidden-hidden

Deep Boltzmann machine

Image source: Salakhutdinov and Hinton, 2009.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 18 / 1

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 19 / 1

Note: it is a well defined function of the parameters W , b, c, but no

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 21 / 1

Idea: Instead of evaluating Z (θ) exactly, use a Monte Carlo estimate.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 22 / 1

Maximize log-likelihood: maxθ fθ (xtrain ) − log Z (θ).

∇θ fθ (xtrain ) − ∇θ log Z (θ)

where xsample ∼ exp{fθ (xsample )}/Z (θ).

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 23 / 1

No direct way to sample like in autoregressive or flow models. Main

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 24 / 1

For any continuous distribution pθ (x), suppose we can compute its

∇x log pθ (x) = ∇x fθ (x) − ∇x log Z (θ)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 11 25 / 1

∇θ fθ (xtrain ) − ∇θ log Z (θ) ≈ ∇θ fθ (xtrain ) − ∇θ fθ (xsample ),

where xsample ∼ pθ (x).

Metropolis-Hastings Markov chain Monte Carlo (MCMC).

Stefano Ermon (AI Lab) Deep Generative Models Lecture 12 3/1

Unadjusted Langevin MCMC:

Sampling converges slowly in high dimensional spaces and is thus very

Stefano Ermon (AI Lab) Deep Generative Models Lecture 12 4/1

You might also like