0% found this document useful (0 votes)

4 views25 pages

cs236 Lecture4

Uploaded by

21pd14

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views25 pages

cs236 Lecture4

Uploaded by

21pd14

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Maximum Likelihood Learning

Stefano Ermon

Stanford University

Lecture 4

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 1 / 25

Learning a generative model
We are given a training set of examples, e.g., images of dogs

We want to learn a probability distribution p(x) over images x such that

Generation: If we sample xnew ∼ p(x), xnew should look like a dog
(sampling)
Density estimation: p(x) should be high if x looks like a dog, and low
otherwise (anomaly detection)
Unsupervised representation learning: We should be able to learn
what these images have in common, e.g., ears, tail, etc. (features)
First question: how to represent pθ (x). Second question: how to learn it.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 2 / 25
Setting

Lets assume that the domain is governed by some underlying distribution

Pdata
We are given a dataset D of m samples from Pdata
Each sample is an assignment of values to (a subset of) the variables,
e.g., (Xbank = 1, Xdollar = 0, ..., Y = 1) or pixel intensities.
The standard assumption is that the data instances are independent and
identically distributed (IID)
We are also given a family of models M, and our task is to learn some
“good” distribution in this set:
For example, M could be all Bayes nets with a given graph structure,
for all possible choices of the CPD tables
For example, a FVSBN for all possible choices of the logistic regression
parameters , θ = concatenation of all logistic regression coefficients

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 3 / 25

Goal of learning

The goal of learning is to return a model Pθ that precisely captures the

distribution Pdata from which our data was sampled
This is in general not achievable because of
limited data only provides a rough approximation of the true underlying
distribution
computational reasons
Example. Suppose we represent each image with a vector X of 784 binary
variables (black vs. white pixel). How many possible states (= possible
images) in the model? 2784 ≈ 10236 . Even 107 training examples provide
extremely sparse coverage!
We want to select Pθ to construct the ”best” approximation to the
underlying distribution Pdata
What is ”best”?

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 4 / 25

What is “best”?

This depends on what we want to do

1 Density estimation: we are interested in the full distribution (so later we can
compute whatever conditional probabilities we want)
2 Specific prediction tasks: we are using the distribution to make a prediction
Is this email spam or not?
Structured prediction: Predict next frame in a video, or caption given
an image
3 Structure or knowledge discovery: we are interested in the model itself
How do some genes interact with each other?
What causes cancer?
Take CS 228

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 5 / 25

Learning as density estimation
We want to learn the full distribution so that later we can answer any
probabilistic inference query
In this setting we can view the learning problem as density estimation
We want to construct Pθ as ”close” as possible to Pdata (recall we assume
we are given a dataset D of samples from Pdata )

How do we evaluate ”closeness”?

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 6 / 25
KL-divergence

How should we measure distance between distributions?

The Kullback-Leibler divergence (KL-divergence) between two
distributions p and q is defined as
X p(x)
D(p∥q) = p(x) log .
x
q(x)

D(p ∥ q) ≥ 0 for all p, q, with equality if and only if p = q. Proof:

!
q(x) q(x) X q(x)
Ex∼p − log ≥ − log Ex∼p = − log p(x) =0
p(x) p(x) x
p(x)

Notice that KL-divergence is asymmetric, i.e., D(p∥q) ̸= D(q∥p)

Measures the expected number of extra bits required to describe
samples from p(x) using a compression code based on q instead of p
Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 7 / 25
Detour on KL-divergence
To compress, it is useful to know the probability distribution the data
is sampled from
For example, let X1 , · · · , X100 be samples of an unbiased coin.
Roughly 50 heads and 50 tails. Optimal compression scheme is to
record heads as 0 and tails as 1. In expectation, use 1 bit per sample,
and cannot do better
Suppose the coin is biased, and P[H] ≫ P[T ]. Then it’s more
efficient to uses fewer bits on average to represent heads and more
bits to represent tails, e.g.
Batch multiple samples together
Use a short sequence of bits to encode HHHH (common) and a long
sequence for TTTT (rare).
Like Morse code: E = •, A = •−, Q = − − •−
KL-divergence: if your data comes from p, but you use a scheme
optimized for q, the divergence DKL (p||q) is the number of extra bits
you’ll need on average
Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 8 / 25
Learning as density estimation

We want to learn the full distribution so that later we can answer any
probabilistic inference query
In this setting we can view the learning problem as density estimation
We want to construct Pθ as ”close” as possible to Pdata (recall we assume
we are given a dataset D of samples from Pdata )
How do we evaluate ”closeness”?
KL-divergence is one possibility:
X
Pdata (x) Pdata (x)
D(Pdata ||Pθ ) = Ex∼Pdata log = Pdata (x) log
Pθ (x) x
Pθ (x)

D(Pdata ||Pθ ) = 0 iff the two distributions are the same.

It measures the ”compression loss” (in bits) of using Pθ instead of Pdata .

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 9 / 25

Expected log-likelihood
We can simplify this somewhat:

Pdata (x)
D(Pdata ||Pθ ) = Ex∼Pdata log
Pθ (x)
= Ex∼Pdata [log Pdata (x)] − Ex∼Pdata [log Pθ (x)]
The first term does not depend on Pθ .
Then, minimizing KL divergence is equivalent to maximizing the expected
log-likelihood
arg min D(Pdata ||Pθ ) = arg min −Ex∼Pdata [log Pθ (x)] = arg max Ex∼Pdata [log Pθ (x)]
Pθ Pθ Pθ

Asks that Pθ assign high probability to instances sampled from Pdata ,

so as to reflect the true distribution
Because of log, samples x where Pθ (x) ≈ 0 weigh heavily in objective
Although we can now compare models, since we are ignoring
H(Pdata ) = −Ex∼Pdata [log Pdata (x)], we don’t know how close we are to the
optimum
Problem: In general we do not know Pdata .
Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 10 / 25
Maximum likelihood

Approximate the expected log-likelihood

Ex∼Pdata [log Pθ (x)]

with the empirical log-likelihood:

1 X
ED [log Pθ (x)] = log Pθ (x)
|D|
x∈D

Maximum likelihood learning is then:

1 X
max log Pθ (x)
Pθ |D|
x∈D

Equivalently, maximize Q likelihood of the data

Pθ (x(1) , · · · , x(m) ) = x∈D Pθ (x)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 11 / 25

Main idea in Monte Carlo Estimation

1 Express the quantity of interest as the expected value of a

random variable.
X
Ex∼P [g (x)] = g (x)P(x)
x

2 Generate T samples x1 , . . . , xT from the distribution P with respect

to which the expectation was taken.
3 Estimate the expected value from the samples using:
T
1 T 1 X
ĝ (x , · · · , x ) ≜ g (xt )
T
t=1

where x1 , . . . , xT are independent samples from P. Note: ĝ is a

random variable. Why?

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 12 / 25

Properties of the Monte Carlo Estimate
Unbiased:
EP [ĝ ] = EP [g (x)]

Convergence: By law of large numbers

T
1 X
ĝ = g (x t ) → EP [g (x)] for T → ∞
T
t=1

Variance:
T
" #
1 X VP [g (x)]
VP [ĝ ] = VP g (x t ) =
T T
t=1

Thus, variance of the estimator can be reduced by increasing the

number of samples.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 13 / 25
Example

Single variable example: A biased coin

Two outcomes: heads (H) and tails (T )
Data set: Tosses of the biased coin, e.g., D = {H, H, T , H, T }
Assumption: the process is controlled by a probability distribution
Pdata (x) where x ∈ {H, T }
Class of models M: all probability distributions over x ∈ {H, T }.
Example learning task: How should we choose Pθ (x) from M if 3 out
of 5 tosses are heads in D?

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 14 / 25

MLE scoring for the coin example

We represent our model: Pθ (x = H) = θ and Pθ (x = T ) = 1 − θ

Example data: D = {H, H, T , H, T }
Q
Likelihood of data = i Pθ (xi ) = θ · θ · (1 − θ) · θ · (1 − θ)

Optimize for θ which makes D most likely. What is the solution in

this case? θ = 0.6, optimization problem can be solved in closed-form

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 15 / 25

Extending the MLE principle to autoregressive models

Given an autoregressive model with n variables and factorization

n
Y
Pθ (x) = pneural (xi |x<i ; θi )
i=1

θ = (θ1 , · · · , θn ) are the parameters of all the conditionals. Training data

D = {x(1) , · · · , x(m) }. Maximum likelihood estimate of the parameters θ?
Decomposition of Likelihood function
m m Y
n
(j) (j)
Y Y
L(θ, D) = Pθ (x(j) ) = pneural (xi |x<i ; θi )
j=1 j=1 i=1

Goal : maximize arg maxθ L(θ, D) = arg maxθ log L(θ, D)

We no longer have a closed form solution

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 16 / 25

MLE Learning: Gradient Descent

m m Y
n
(j) (j)
Y Y
L(θ, D) = Pθ (x(j) ) = pneural (xi |x<i ; θi )
j=1 j=1 i=1

Goal : maximize arg maxθ L(θ, D) = arg maxθ log L(θ, D)

m X
n
(j) (j)
X
ℓ(θ) = log L(θ, D) = log pneural (xi |x<i ; θi )
j=1 i=1

1 Initialize θ0 = (θ1 , · · · , θn ) at random

2 Compute ∇θ ℓ(θ) (by back propagation)
3 θt+1 = θt + αt ∇θ ℓ(θ)
Non-convex optimization problem, but often works well in practice

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 17 / 25

MLE Learning: Stochastic Gradient Descent

n
m X
(j) (j)
X
ℓ(θ) = log L(θ, D) = log pneural (xi |x<i ; θi )
j=1 i=1

1 Initialize θ0 at random
2 Compute ∇θ ℓ(θ) (by back propagation)
3 θt+1 = θt + αt ∇θ ℓ(θ)
What is the gradient with respect to θi ?
m n m
(j) (j) (j) (j)
X X X
∇θi ℓ(θ) = ∇θ i log pneural (xi |x<i ; θi ) = ∇θi log pneural (xi |x<i ; θi )
j=1 i=1 j=1

Each conditional pneural (xi |x<i ; θi ) can be optimized separately if there is no

parameter sharing. In practice, parameters θi are shared (e.g., NADE, PixelRNN,
PixelCNN, etc.)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 18 / 25

MLE Learning: Stochastic Gradient Descent
m X
n
(j) (j)
X
ℓ(θ) = log L(θ, D) = log pneural (xi |x<i ; θi )
j=1 i=1

1 Initialize θ0 at random
2 Compute ∇θ ℓ(θ) (by back propagation)
3 θt+1 = θt + αt ∇θ ℓ(θ)
m X
n
(j) (j)
X
∇θ ℓ(θ) = ∇θ log pneural (xi |x<i ; θi )
j=1 i=1

What if m = |D| is huge?

m n
X 1 X (j) (j)
∇θ ℓ(θ) = m ∇θ log pneural (xi |x<i ; θi )
m
j=1 i=1
" n #
(j) (j)
X
= mEx (j) ∼D ∇θ log pneural (xi |x<i ; θi )
i=1
Pn (j) (j)
Monte Carlo: Sample x (j) ∼ D;∇θ ℓ(θ) ≈ m i=1 ∇θ log pneural (xi |x<i ; θi )
Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 19 / 25
Empirical Risk and Overfitting

Empirical risk minimization can easily overfit the data

Extreme example: The data is the model (remember all training data).
Generalization: the data is a sample, usually there is vast amount of samples
that you have never seen. Your model should generalize well to these
“never-seen” samples.
Thus, we typically restrict the hypothesis space of distributions that we
search over

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 20 / 25

Bias-Variance trade off

If the hypothesis space is very limited, it might not be able to represent

Pdata , even with unlimited data
This type of limitation is called bias, as the learning is limited on how
close it can approximate the target distribution
If we select a highly expressive hypothesis class, we might represent better
the data
When we have small amount of data, multiple models can fit well, or
even better than the true model. Moreover, small perturbations on D
will result in very different estimates
This limitation is call the variance.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 21 / 25

Bias-Variance trade off

There is an inherent bias-variance trade off when selecting the hypothesis

class. Error in learning due to both things: bias and variance.
Hypothesis space: linear relationship

Does it fit well? Underfits

Hypothesis space: high degree polynomial

Overfits
Hypothesis space: low degree polynomial

Right tradeoff

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 22 / 25

How to avoid overfitting?
Hard constraints, e.g. by selecting a less expressive model family:
Smaller neural networks with less parameters
Weight sharing

Soft preference for “simpler” models: Occam Razor.

Augment the objective function with regularization:

objective(x, M) = loss(x, M) + R(M)

Evaluate generalization performance on a held-out validation set

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 23 / 25
Conditional generative models

Suppose we want to generate a set of variables Y given some others

X, e.g., text to speech
We concentrate on modeling p(Y|X), and use a conditional loss
function
− log Pθ (y | x).

Since the loss function only depends on Pθ (y | x), suffices to estimate

the conditional distribution, not the joint

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 24 / 25

Recap

For autoregressive models, it is easy to compute pθ (x)

(j) (j)
Ideally, evaluate in parallel each conditional log pneural (xi |x<i ; θi ).
Not like RNNs.
Natural to train them via maximum likelihood
Higher log-likelihood doesn’t necessarily mean better looking samples
Other ways of measuring similarity are possible (Generative Adversarial
Networks, GANs)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 25 / 25

GUIDE: Arduino Leonardo Cheeto V3 & V4
100% (1)
GUIDE: Arduino Leonardo Cheeto V3 & V4
86 pages
Browser Fingerprinting Protection
No ratings yet
Browser Fingerprinting Protection
105 pages
Unit Commitment
No ratings yet
Unit Commitment
50 pages
Questions & Answers On OOPs Concept & Features
68% (22)
Questions & Answers On OOPs Concept & Features
58 pages
Deep Learning 2017 Lecture7GAN
No ratings yet
Deep Learning 2017 Lecture7GAN
62 pages
ACTIVITY I Linear Motion
100% (1)
ACTIVITY I Linear Motion
4 pages
CS236 Homework 3 Answer
No ratings yet
CS236 Homework 3 Answer
8 pages
Advance Deep Learning - BIT L4
No ratings yet
Advance Deep Learning - BIT L4
100 pages
Lec1 Mathreview
No ratings yet
Lec1 Mathreview
61 pages
Brief Intro To ML PDF
No ratings yet
Brief Intro To ML PDF
236 pages
Naive Bayes Classifier and Other Topics
No ratings yet
Naive Bayes Classifier and Other Topics
52 pages
Poly Aml
No ratings yet
Poly Aml
76 pages
Hung-Yi Lee GAN-Basic Idea (2017.04.21)
No ratings yet
Hung-Yi Lee GAN-Basic Idea (2017.04.21)
67 pages
Lecture 12
No ratings yet
Lecture 12
35 pages
Gans
No ratings yet
Gans
26 pages
Latent Variable Models: Stefano Ermon
No ratings yet
Latent Variable Models: Stefano Ermon
26 pages
Ijcai Ecai Tutorial
No ratings yet
Ijcai Ecai Tutorial
115 pages
Duda Solutions PDF
No ratings yet
Duda Solutions PDF
77 pages
Lec1 Intro
No ratings yet
Lec1 Intro
51 pages
Machine Learning
No ratings yet
Machine Learning
93 pages
DLbook
No ratings yet
DLbook
165 pages
Pattern Recognition and Deep Learning: Ad Feelders
No ratings yet
Pattern Recognition and Deep Learning: Ad Feelders
55 pages
Bayesian NN
No ratings yet
Bayesian NN
82 pages
RADL TQKhoat
No ratings yet
RADL TQKhoat
50 pages
cs236 Lecture3
No ratings yet
cs236 Lecture3
36 pages
Mod5 Slides
No ratings yet
Mod5 Slides
37 pages
Lecture-03 Estimation Basics
No ratings yet
Lecture-03 Estimation Basics
31 pages
SML Lecture2
No ratings yet
SML Lecture2
35 pages
Cpts 440 / 540 Artificial Intelligence: Search
No ratings yet
Cpts 440 / 540 Artificial Intelligence: Search
182 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
Lec 12
No ratings yet
Lec 12
15 pages
cs236 Lecture5
No ratings yet
cs236 Lecture5
29 pages
cs236 Lecture11
No ratings yet
cs236 Lecture11
27 pages
IS VII & VIII 2018 SYLLABUS-compressed
No ratings yet
IS VII & VIII 2018 SYLLABUS-compressed
78 pages
Mod6 Slides
No ratings yet
Mod6 Slides
27 pages
6 Probabilities
No ratings yet
6 Probabilities
52 pages
Intro S4HANA Using Global Bike Case Study CO-CCA en v4.1
No ratings yet
Intro S4HANA Using Global Bike Case Study CO-CCA en v4.1
36 pages
CP4252 ML Unit-Iv
No ratings yet
CP4252 ML Unit-Iv
12 pages
Lecture 2 Annotated
No ratings yet
Lecture 2 Annotated
60 pages
Slide 1
No ratings yet
Slide 1
37 pages
Silo - Tips - Guide To Snare For Windows v42
No ratings yet
Silo - Tips - Guide To Snare For Windows v42
48 pages
Deep Learning A Tutorial
No ratings yet
Deep Learning A Tutorial
16 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
Lecture 1
No ratings yet
Lecture 1
56 pages
Do Deep Generative Models Know
No ratings yet
Do Deep Generative Models Know
19 pages
Lecture 12
No ratings yet
Lecture 12
38 pages
CSGL
No ratings yet
CSGL
11 pages
Mabin - Fundood Data - S To Z
No ratings yet
Mabin - Fundood Data - S To Z
36 pages
Deep Density Estimation
No ratings yet
Deep Density Estimation
20 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
03 Lecturenote MLE MAP
No ratings yet
03 Lecturenote MLE MAP
7 pages
Variation Al
No ratings yet
Variation Al
25 pages
ACV - Notes - Final
No ratings yet
ACV - Notes - Final
7 pages
21 Scheme - Cs Mtech
No ratings yet
21 Scheme - Cs Mtech
11 pages
8D Report For Eye Bolt Internal Dia
No ratings yet
8D Report For Eye Bolt Internal Dia
1 page
Fagor Dryers Brochure
No ratings yet
Fagor Dryers Brochure
6 pages
VMware User Environment Manager Application Profiler Administrator's Guide
No ratings yet
VMware User Environment Manager Application Profiler Administrator's Guide
21 pages
3a Variations
No ratings yet
3a Variations
17 pages
Lustre Admin Monitor
No ratings yet
Lustre Admin Monitor
25 pages
Unit 3-Generative Models
No ratings yet
Unit 3-Generative Models
23 pages
AGMG7VSS
No ratings yet
AGMG7VSS
8 pages
Notes6 Classification
No ratings yet
Notes6 Classification
10 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
Understanding Diffusion Models: A Unified Perspective
No ratings yet
Understanding Diffusion Models: A Unified Perspective
23 pages
Lecture5 Maximum Likelihood
No ratings yet
Lecture5 Maximum Likelihood
13 pages
cs236 Lecture2
No ratings yet
cs236 Lecture2
30 pages
FIRST SUMMATIVE TEST-Q2 (Week 1 and Week 2)
No ratings yet
FIRST SUMMATIVE TEST-Q2 (Week 1 and Week 2)
3 pages
Stratix 5200 Firmware Upgrade
No ratings yet
Stratix 5200 Firmware Upgrade
3 pages
Quadrant Data Efficient Machine Learning Screen
No ratings yet
Quadrant Data Efficient Machine Learning Screen
6 pages
2 Cranko19b Boosted Density Estimation Remastered
No ratings yet
2 Cranko19b Boosted Density Estimation Remastered
10 pages
Unit 1
No ratings yet
Unit 1
12 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
CS 229, Autumn 2017 Problem Set #4: EM, DL & RL
No ratings yet
CS 229, Autumn 2017 Problem Set #4: EM, DL & RL
10 pages
3a Variations4
No ratings yet
3a Variations4
5 pages
Iwi (26) - 1
No ratings yet
Iwi (26) - 1
1 page
Optimized References
No ratings yet
Optimized References
2 pages
EEE-303-Assignment - Group-06 - Tahim & Anik
No ratings yet
EEE-303-Assignment - Group-06 - Tahim & Anik
9 pages
Therml: Thermodynamics of Machine Learning: Box & Draper 1987 1A
No ratings yet
Therml: Thermodynamics of Machine Learning: Box & Draper 1987 1A
16 pages
Note On The Evaluation of Generative Models: These Authors Contributed Equally To This Work. Now at Google Deepmind
No ratings yet
Note On The Evaluation of Generative Models: These Authors Contributed Equally To This Work. Now at Google Deepmind
10 pages
1: Introduction To Programming
No ratings yet
1: Introduction To Programming
9 pages
Sheenam CV
No ratings yet
Sheenam CV
3 pages
Craiyon - Your FREE AI Image Generator Tool Create AI Art!
No ratings yet
Craiyon - Your FREE AI Image Generator Tool Create AI Art!
1 page
Icct Colleges Foundation, Inc
No ratings yet
Icct Colleges Foundation, Inc
7 pages
The Ultimate Multitrack Playback Set Up For Ableton Live
No ratings yet
The Ultimate Multitrack Playback Set Up For Ableton Live
2 pages
Class3 ML MaxEnt
No ratings yet
Class3 ML MaxEnt
6 pages
Hacking PSP
No ratings yet
Hacking PSP
6 pages
Classlink Apps
No ratings yet
Classlink Apps
1 page
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)

cs236 Lecture4

Uploaded by

cs236 Lecture4

Uploaded by

Maximum Likelihood Learning

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 1 / 25

We want to learn a probability distribution p(x) over images x such that

Lets assume that the domain is governed by some underlying distribution

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 3 / 25

The goal of learning is to return a model Pθ that precisely captures the

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 4 / 25

This depends on what we want to do

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 5 / 25

How do we evaluate ”closeness”?

How should we measure distance between distributions?

D(p ∥ q) ≥ 0 for all p, q, with equality if and only if p = q. Proof:

Notice that KL-divergence is asymmetric, i.e., D(p∥q) ̸= D(q∥p)

D(Pdata ||Pθ ) = 0 iff the two distributions are the same.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 9 / 25

Asks that Pθ assign high probability to instances sampled from Pdata ,

Approximate the expected log-likelihood

Ex∼Pdata [log Pθ (x)]

with the empirical log-likelihood:

Maximum likelihood learning is then:

Equivalently, maximize Q likelihood of the data

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 11 / 25

1 Express the quantity of interest as the expected value of a

2 Generate T samples x1 , . . . , xT from the distribution P with respect

where x1 , . . . , xT are independent samples from P. Note: ĝ is a

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 12 / 25

Convergence: By law of large numbers

Thus, variance of the estimator can be reduced by increasing the

Single variable example: A biased coin

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 14 / 25

We represent our model: Pθ (x = H) = θ and Pθ (x = T ) = 1 − θ

Optimize for θ which makes D most likely. What is the solution in

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 15 / 25

Given an autoregressive model with n variables and factorization

θ = (θ1 , · · · , θn ) are the parameters of all the conditionals. Training data

Goal : maximize arg maxθ L(θ, D) = arg maxθ log L(θ, D)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 16 / 25

Goal : maximize arg maxθ L(θ, D) = arg maxθ log L(θ, D)

1 Initialize θ0 = (θ1 , · · · , θn ) at random

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 17 / 25

Each conditional pneural (xi |x<i ; θi ) can be optimized separately if there is no

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 18 / 25

What if m = |D| is huge?

Empirical risk minimization can easily overfit the data

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 20 / 25

If the hypothesis space is very limited, it might not be able to represent

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 21 / 25

There is an inherent bias-variance trade off when selecting the hypothesis

Does it fit well? Underfits

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 22 / 25

Soft preference for “simpler” models: Occam Razor.

objective(x, M) = loss(x, M) + R(M)

Evaluate generalization performance on a held-out validation set

Suppose we want to generate a set of variables Y given some others

Since the loss function only depends on Pθ (y | x), suffices to estimate

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 24 / 25

For autoregressive models, it is easy to compute pθ (x)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 4 25 / 25

You might also like