cs236 Lecture2
cs236 Lecture2
Stanford University
Lecture 2
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 1 / 30
Learning a generative model
We are given a training set of examples, e.g., images of dogs
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 3 / 30
Example of joint distribution
Modeling a single pixel’s color. Three discrete random variables:
Red Channel R. Val(R) = {0, · · · , 255}
Green Channel G . Val(G ) = {0, · · · , 255}
Blue Channel B. Val(B) = {0, · · · , 255}
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 6 / 30
Key notion: conditional independence
Two events A, B are conditionally independent given event C if
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 8 / 30
Structure through conditional independence
Using Chain Rule
p(x1 , . . . , xn ) = p(x1 )p(x2 | x1 )p(x3 | x1 , x2 ) · · · p(xn | x1 , · · · , xn−1 )
Using Chain Rule with a different ordering we can always also write
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 11 / 30
Bayesian networks
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 12 / 30
Example
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 13 / 30
Example
Consider the following Bayesian network:
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 14 / 30
Bayesian network structure implies conditional
independencies!
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 16 / 30
Naive Bayes for single label prediction
Classify e-mails as spam (Y = 1) or not spam (Y = 0)
Let 1 : n index the words in our vocabulary (e.g., English)
Xi = 1 if word i appears in an e-mail, and 0 otherwise
E-mails are drawn according to some distribution p(Y , X1 , . . . , Xn )
Words are conditionally independent given Y :
Then
n
Y
p(y , x1 , . . . xn ) = p(y ) p(xi | y )
i=1
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 17 / 30
Example: naive Bayes for classification
Classify e-mails as spam (Y = 1) or not spam (Y = 0)
Let 1 : n index the words in our vocabulary (e.g., English)
Xi = 1 if word i appears in an e-mail, and 0 otherwise
E-mails are drawn according to some distribution p(Y , X1 , . . . , Xn )
Suppose that the words are conditionally independent given Y . Then,
n
Y
p(y , x1 , . . . xn ) = p(y ) p(xi | y )
i=1
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 20 / 30
Naive Bayes
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 21 / 30
Logistic regression
1 For the discriminative model, assume that
p(Y = 1 | x; α) = f (x, α)
2 Not represented as a table anymore. It is a parameterized function of
x (regression)
Has to be between 0 and 1
Depend in some simple but reasonable way on x1 , · · · , xn
Completely specified by a vector α of n + 1 parameters (compact
representation)
Linear dependence: let z(α, x) = α0 + ni=1 αi xi .Then,
P
p(Y = 1 | x; α) = σ(z(α, x)), where σ(z) = 1/(1 + e −z ) is called the
logistic function:
1
1 + e−z
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 22 / 30
Logistic regression
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 23 / 30
Discriminative models are powerful
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 25 / 30
Neural Models
p(Y = 1 | x; α) = f (x, α)
2 Linear dependence:
Pn
let z(α, x) = α0 + i=1 αi xi .
p(Y = 1 | x; α) = σ(z(α, x)), where σ(z) = 1/(1 + e −z ) is the
logistic function
Dependence might be too simple
3 Non-linear dependence: let h(A, b, x) = f (Ax + b) be a non-linear
transformation of the inputs (features).
pNeural (Y = 1 | x; α, A, b) = σ(α0 + hi=1 αi hi )
P
More flexible
More parameters: A, b, α
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 26 / 30
Neural Models
1 In discriminative models, we assume that
p(Y = 1 | x; α) = f (x, α)
Linear dependence: let z(α, x) = α0 + ni=1 αi xi .
P
2
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 27 / 30
Bayesian networks vs neural models
Fully General
Bayes Net
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 28 / 30
Continuous variables
Chain rule, Bayes rule, etc all still apply. For example,
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 29 / 30
Continuous variables
This means we can still use Bayesian networks with continuous (and
discrete) variables. Examples:
Mixture of 2 Gaussians: Network Z → X with factorization
pZ ,X (z, x) = pZ (z)pX |Z (x | z) and
Z ∼ Bernoulli(p)
X | (Z = 0) ∼ N (µ0 , σ0 ) , X | (Z = 1) ∼ N (µ1 , σ1 )
The parameters are p, µ0 , σ0 , µ1 , σ1
Network Z → X with factorization pZ ,X (z, x) = pZ (z)pX |Z (x | z)
Z ∼ U(a, b)
X | (Z = z) ∼ N (z, σ)
The parameters are a, b, σ
Variational autoencoder: Network Z → X with factorization
pZ ,X (z, x) = pZ (z)pX |Z (x | z) and
Z ∼ N (0, 1)
X | (Z = z) ∼ N (µθ (z), e σφ (z) ) where µθ : R → R and σφ are neural
networks with parameters (weights) θ, φ respectively
Note: Even if µθ , σφ are very deep (flexible), functional form is still
Gaussian
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 30 / 30