0% found this document useful (0 votes)

74 views30 pages

cs236 Lecture2

The document summarizes a lecture on representation learning for generative models. It discusses learning a probability distribution over data like images to enable tasks like generation, density estimation, and unsupervised feature learning. It introduces basic discrete distributions like Bernoulli and categorical and how to represent joint distributions over multiple variables. It describes using conditional independence assumptions to compactly represent high-dimensional joint distributions, including through Bayesian networks that use a directed acyclic graph structure.

Uploaded by

Dickins John

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views30 pages

cs236 Lecture2

Uploaded by

Dickins John

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Representation

Stefano Ermon, Aditya Grover

Stanford University

Lecture 2

Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 1 / 30
Learning a generative model
We are given a training set of examples, e.g., images of dogs

We want to learn a probability distribution p(x) over images x such that

Generation: If we sample xnew ∼ p(x), xnew should look like a dog
(sampling)
Density estimation: p(x) should be high if x looks like a dog, and low
otherwise (anomaly detection)
Unsupervised representation learning: We should be able to learn
what these images have in common, e.g., ears, tail, etc. (features)
First question: how to represent p(x)
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 2 / 30
Basic discrete distributions

Bernoulli distribution: (biased) coin flip

D = {Heads, Tails}
Specify P(X = Heads) = p. Then P(X = Tails) = 1 − p.
Write: X ∼ Ber (p)
Sampling: flip a (biased) coin
Categorical distribution: (biased) m-sided dice
D = {1, · · · , m} P
Specify P(Y = i) = pi , such that pi = 1
Write: Y ∼ Cat(p1 , · · · , pm )
Sampling: roll a (biased) die

Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 3 / 30
Example of joint distribution
Modeling a single pixel’s color. Three discrete random variables:
Red Channel R. Val(R) = {0, · · · , 255}
Green Channel G . Val(G ) = {0, · · · , 255}
Blue Channel B. Val(B) = {0, · · · , 255}

Sampling from the joint distribution (r , g , b) ∼ p(R, G , B) randomly

generates a color for the pixel. How many parameters do we need to
specify the joint distribution p(R = r , G = g , B = b)?
256 ∗ 256 ∗ 256 − 1
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 4 / 30
Example of joint distribution

Suppose X1 , . . . , Xn are binary (Bernoulli) random variables, i.e.,

Val(Xi ) = {0, 1} = {Black, White}.
How many possible states?
n
|2 × 2 ×
{z· · · × 2} = 2
n times

Sampling from p(x1 , . . . , xn ) generates an image

How many parameters to specify the joint distribution p(x1 , . . . , xn )
over n binary pixels?
2n − 1
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 5 / 30
Structure through independence

If X1 , . . . , Xn are independent, then

p(x1 , . . . , xn ) = p(x1 )p(x2 ) · · · p(xn )

How many possible states? 2n

How many parameters to specify the joint distribution p(x1 , . . . , xn )?
How many to specify the marginal distribution p(x1 )? 1
2n entries can be described by just n numbers (if |Val(Xi )| = 2)!
Independence assumption is too strong. Model not likely to be useful
For example, each pixel chosen independently when we sample from it.

Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 6 / 30
Key notion: conditional independence
Two events A, B are conditionally independent given event C if

p(A ∩ B|C ) = p(A|C )p(B|C )

Random variables X , Y are conditionally independent given Z if for

all values x ∈Val(X ), y ∈Val(Y ), z ∈Val(Z )

p(X = x ∩ Y = y |Z = z) = p(X = x|Z = z)p(Y = y |Z = z)

We will also write p(X , Y |Z ) = p(X |Z )p(Y |Z ). Note the more

compact notation.
Equivalent definition: p(X |Y , Z ) = p(X |Z ).
We write X ⊥ Y | Z
Similarly for sets of random variables, X ⊥ Y | Z
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 7 / 30
Two important rules

1 Chain rule Let S1 , . . . Sn be events, p(Si ) > 0.

p(S1 ∩ S2 ∩ · · · ∩ Sn ) = p(S1 )p(S2 | S1 ) · · · p(Sn | S1 ∩ . . . ∩ Sn−1 )

2 Bayes’ rule Let S1 , S2 be events, p(S1 ) > 0 and p(S2 ) > 0.

p(S1 ∩ S2 ) p(S2 | S1 )p(S1 )

p(S1 | S2 ) = =
p(S2 ) p(S2 )

Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 8 / 30
Structure through conditional independence
Using Chain Rule
p(x1 , . . . , xn ) = p(x1 )p(x2 | x1 )p(x3 | x1 , x2 ) · · · p(xn | x1 , · · · , xn−1 )

How many parameters? 1 + 2 + · · · + 2n−1 = 2n − 1

How many parameters? 2n − 1. Exponential reduction!

Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 9 / 30
Structure through conditional independence
Suppose we have 4 random variables X1 , · · · , X4
Using Chain Rule we can always write

p(x1 , . . . , x4 ) = p(x1 )p(x2 | x1 )p(x3 | x1 , x2 )p(x4 | x1 , x2 , x3 )

If X4 ⊥ X2 | {X1 , X3 }, we can simplify as

p(x1 , . . . , xn ) = p(x1 )p(x2 | x1 )p(x3 | x1 , x2 )p(x4 | x1 , 2 , x3 )

Using Chain Rule with a different ordering we can always also write

p(x1 , . . . , x4 ) = p(x4 )p(x3 | x4 )p(x2 | x3 , x4 )p(x1 | x2 , x3 , x4 )

If X1 ⊥ {X2 , X3 } | X4 , we can simplify as

p(x1 , . . . , x4 ) = p(x4 )p(x3 | x4 )p(x2 | x3 , x4 )p(x1 | , x

x2 3 , x4 )

Bayesian Networks: assume an ordering and a set of conditional

independencies to get compact representation
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 10 / 30
Bayes Network: General Idea

Use conditional parameterization (instead of joint parameterization)

For each random variable Xi , specify p(xi |xAi ) for set XAi of random
variables
Then get joint parametrization as
Y
p(x1 , . . . , xn ) = p(xi |xAi )
i

Need to guarantee it is a legal probability distribution. It has to

correspond to a chain rule factorization, with factors simplified due to
assumed conditional independencies

Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 11 / 30
Bayesian networks

A Bayesian network is specified by a directed acyclic graph

G = (V , E ) with:
1 One node i ∈ V for each random variable Xi
2 One conditional probability distribution (CPD) per node, p(xi | xPa(i) ),
specifying the variable’s probability conditioned on its parents’ values
Graph G = (V , E ) is called the structure of the Bayesian Network
Defines a joint distribution:
Y
p(x1 , . . . xn ) = p(xi | xPa(i) )
i∈V

Claim: p(x1 , . . . xn ) is a valid probability distribution

Economical representation: exponential in |Pa(i)|, not |V |

Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 12 / 30
Example

DAG stands for Directed Acyclic Graph

Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 13 / 30
Example
Consider the following Bayesian network:

What is its joint distribution?

Y
p(x1 , . . . xn ) = p(xi | xPa(i) )
i∈V
p(d, i, g , s, l) = p(d)p(i)p(g | i, d)p(s | i)p(l | g )

Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 14 / 30
Bayesian network structure implies conditional
independencies!

The joint distribution corresponding to the above BN factors as

p(d, i, g , s, l) = p(d)p(i)p(g | i, d)p(s | i)p(l | g )

However, by the chain rule, any distribution can be written as

p(d, i, g , s, l) = p(d)p(i | d)p(g | i, d)p(s | i, d, g )p(l | g , d, i, s)

Thus, we are assuming the following additional independencies:

D ⊥ I, S ⊥ {D, G } | I , L ⊥ {I , D, S} | G .
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 15 / 30
Summary

Bayesian networks given by (G , P) where P is specified as a set of

local conditional probability distributions associated with G ’s nodes
Efficient representation using a graph-based data structure
Computing the probability of any assignment is obtained by
multiplying CPDs
Can identify some conditional independence properties by looking at
graph properties
Next: generative vs. discriminative; functional parameterizations

Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 16 / 30
Naive Bayes for single label prediction
Classify e-mails as spam (Y = 1) or not spam (Y = 0)
Let 1 : n index the words in our vocabulary (e.g., English)
Xi = 1 if word i appears in an e-mail, and 0 otherwise
E-mails are drawn according to some distribution p(Y , X1 , . . . , Xn )
Words are conditionally independent given Y :

Then
n
Y
p(y , x1 , . . . xn ) = p(y ) p(xi | y )
i=1

Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 17 / 30
Example: naive Bayes for classification
Classify e-mails as spam (Y = 1) or not spam (Y = 0)
Let 1 : n index the words in our vocabulary (e.g., English)
Xi = 1 if word i appears in an e-mail, and 0 otherwise
E-mails are drawn according to some distribution p(Y , X1 , . . . , Xn )
Suppose that the words are conditionally independent given Y . Then,
n
Y
p(y , x1 , . . . xn ) = p(y ) p(xi | y )
i=1

Estimate parameters from training data. Predict with Bayes rule:

p(Y = 1) ni=1 p(xi | Y = 1)
Q
p(Y = 1 | x1 , . . . xn ) = P Qn
y ={0,1} p(Y = y ) i=1 p(xi | Y = y )

Are the independence assumptions made here reasonable?

Philosophy: Nearly all probabilistic models are “wrong”, but many are
nonetheless useful
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 18 / 30
Discriminative versus generative models
Using chain rule p(Y , X) = p(X | Y )p(Y ) = p(Y | X)p(X).
Corresponding Bayesian networks:

However, suppose all we need for prediction is p(Y | X)

In the left model, we need to specify/learn both p(Y ) and p(X | Y ),
then compute p(Y | X) via Bayes rule
In the right model, it suffices to estimate just the conditional
distribution p(Y | X)
We never need to model/learn/use p(X)!
Called a discriminative model because it is only useful for
discriminating Y ’s label when given X
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 19 / 30
Discriminative versus generative models

Since X is a random vector, chain rules will give

We must make the following choices:

1 In the generative model, p(Y ) is simple, but how do we parameterize
p(Xi | Xpa(i) , Y )?
2 In the discriminative model, how do we parameterize p(Y | X)? Here
we assume we don’t care about modeling p(X) because X is always
given to us in a classification problem

Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 20 / 30
Naive Bayes

1 For the generative model, assume that Xi ⊥ X−i | Y (naive Bayes)

Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 21 / 30
Logistic regression
1 For the discriminative model, assume that
p(Y = 1 | x; α) = f (x, α)
2 Not represented as a table anymore. It is a parameterized function of
x (regression)
Has to be between 0 and 1
Depend in some simple but reasonable way on x1 , · · · , xn
Completely specified by a vector α of n + 1 parameters (compact
representation)
Linear dependence: let z(α, x) = α0 + ni=1 αi xi .Then,
P
p(Y = 1 | x; α) = σ(z(α, x)), where σ(z) = 1/(1 + e −z ) is called the
logistic function:
1
1 + e−z

Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 22 / 30
Logistic regression

Linear dependence: let z(α, x) = α0 + ni=1 αi xi .Then,

P
p(Y = 1 | x; α) = σ(z(α, x)), where σ(z) = 1/(1 + e −z ) is called the
logistic function

1 Decision boundary p(Y = 1 | x; α) > 0.5 is linear in x

2 Equal probability contours are straight lines
3 Probability rate of change has very specific form (third plot)

Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 23 / 30
Discriminative models are powerful

Logistic model does not assume Xi ⊥ X−i | Y , unlike naive Bayes

This can make a big difference in many applications
For example, in spam classification, let X1 = 1[“bank” in e-mail] and
X2 = 1[“account” in e-mail]
Regardless of whether spam, these always appear together, i.e. X1 = X2
Learning in naive Bayes results in p(X1 | Y ) = p(X2 | Y ). Thus, naive Bayes
double counts the evidence
Learning with logistic regression sets α1 = 0 or α2 = 0, in effect ignoring it
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 24 / 30
Generative models are still very useful

Using chain rule p(Y , X) = p(X | Y )p(Y ) = p(Y | X)p(X). Corresponding

Bayesian networks:

1 Using a conditional model is only possible when X is always observed

When some Xi variables are unobserved, the generative model allows us
to compute p(Y | Xevidence ) by marginalizing over the unseen variables

Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 25 / 30
Neural Models

1 In discriminative models, we assume that

p(Y = 1 | x; α) = f (x, α)

2 Linear dependence:
Pn
let z(α, x) = α0 + i=1 αi xi .
p(Y = 1 | x; α) = σ(z(α, x)), where σ(z) = 1/(1 + e −z ) is the
logistic function
Dependence might be too simple
3 Non-linear dependence: let h(A, b, x) = f (Ax + b) be a non-linear
transformation of the inputs (features).
pNeural (Y = 1 | x; α, A, b) = σ(α0 + hi=1 αi hi )
P

More flexible
More parameters: A, b, α

Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 26 / 30
Neural Models
1 In discriminative models, we assume that
p(Y = 1 | x; α) = f (x, α)
Linear dependence: let z(α, x) = α0 + ni=1 αi xi .
P
2

p(Y = 1 | x; α) = f (z(α, x)), where f (z) = 1/(1 + e −z ) is the

logistic function
Dependence might be too simple
3 Non-linear dependence: let h(A, b, x) = f (Ax + b) be a non-linear
transformation of the inputs (features).
pNeural (Y = 1 | x; α, A, b) = f (α0 + hi=1 αi hi )
P
More flexible
More parameters: A, b, α
Can repeat multiple times to get a neural network

Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 27 / 30
Bayesian networks vs neural models

Using Chain Rule

p(x1 , x2 , x3 , x4 ) = p(x1 )p(x2 | x1 )p(x3 | x1 , x2 )p(x4 | x1 , x2 , x3 )

Fully General
Bayes Net

p(x1 , x2 , x3 , x4 ) ≈ p(x1 )p(x2 | x1 )p(x3 | 1 , x2 )p(x4 | x1 ,

x , x
x2 3)

Assumes conditional independencies

Neural Models

p(x1 , x2 , x3 , x4 ) ≈ p(x1 )p(x2 | x1 )pNeural (x3 | x1 , x2 )pNeural (x4 | x1 , x2 , x3 )

Assume specific functional form for the conditionals. A sufficiently

deep neural net can approximate any function.

Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 28 / 30
Continuous variables

If X is a continuous random variable, we can usually represent it using

its probability density function pX : R → R+ . However, we cannot
represent this function as a table anymore. Typically consider
parameterized densities:
2 2
Gaussian: X ∼ N (µ, σ) if pX (x) = σ√12π e −(x−µ) /2σ
1
Uniform: X ∼ U(a, b) if pX (x) = b−a 1[a ≤ x ≤ b]
Etc.
If X is a continuous random vector, we can usually represent it using
its joint probability density function:
Gaussian: if pX (x) = √ 1
exp − 21 (x − µ)T Σ−1 (x − µ)

(2π)n |Σ|

Chain rule, Bayes rule, etc all still apply. For example,

pX ,Y ,Z (x, y , z) = pX (x)pY |X (y | x)pZ |{X ,Y } (z | x, y )

Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 29 / 30
Continuous variables
This means we can still use Bayesian networks with continuous (and
discrete) variables. Examples:
Mixture of 2 Gaussians: Network Z → X with factorization
pZ ,X (z, x) = pZ (z)pX |Z (x | z) and
Z ∼ Bernoulli(p)
X | (Z = 0) ∼ N (µ0 , σ0 ) , X | (Z = 1) ∼ N (µ1 , σ1 )
The parameters are p, µ0 , σ0 , µ1 , σ1
Network Z → X with factorization pZ ,X (z, x) = pZ (z)pX |Z (x | z)
Z ∼ U(a, b)
X | (Z = z) ∼ N (z, σ)
The parameters are a, b, σ
Variational autoencoder: Network Z → X with factorization
pZ ,X (z, x) = pZ (z)pX |Z (x | z) and
Z ∼ N (0, 1)
X | (Z = z) ∼ N (µθ (z), e σφ (z) ) where µθ : R → R and σφ are neural
networks with parameters (weights) θ, φ respectively
Note: Even if µθ , σφ are very deep (flexible), functional form is still
Gaussian
Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 30 / 30

All Tasks
No ratings yet
All Tasks
7 pages
Building Probabilistic Graphical Models With Python
No ratings yet
Building Probabilistic Graphical Models With Python
24 pages
Business Mathematics & Statistics
No ratings yet
Business Mathematics & Statistics
3 pages
cs236_lecture2
No ratings yet
cs236_lecture2
29 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
cs236_lecture3
No ratings yet
cs236_lecture3
36 pages
Ijcai Ecai Tutorial
No ratings yet
Ijcai Ecai Tutorial
115 pages
Lecture # 2-1 Probabilistic Models
No ratings yet
Lecture # 2-1 Probabilistic Models
40 pages
All of Graphical Models
No ratings yet
All of Graphical Models
135 pages
Prob Nets
No ratings yet
Prob Nets
21 pages
lec1_intro
No ratings yet
lec1_intro
51 pages
AA2 3.3.1 Generative Explicit Density 2024
No ratings yet
AA2 3.3.1 Generative Explicit Density 2024
43 pages
Lecture # 1-2 Introduction To Gen AI
No ratings yet
Lecture # 1-2 Introduction To Gen AI
41 pages
Cs 228
No ratings yet
Cs 228
98 pages
2 Graphical Models in A Nutshell: Daphne Koller, Nir Friedman, Lise Getoor and Ben Taskar
No ratings yet
2 Graphical Models in A Nutshell: Daphne Koller, Nir Friedman, Lise Getoor and Ben Taskar
43 pages
Lecture 6_Generative Models
No ratings yet
Lecture 6_Generative Models
33 pages
Lecture-02_Probability_basics
No ratings yet
Lecture-02_Probability_basics
33 pages
CSGL
No ratings yet
CSGL
11 pages
Cheat Sheet 4
No ratings yet
Cheat Sheet 4
2 pages
Mod5_Slides
No ratings yet
Mod5_Slides
37 pages
BayesianNetworks Reduced
No ratings yet
BayesianNetworks Reduced
14 pages
An Overview of Edward: A Probabilistic Programming System: Dustin Tran Columbia University
No ratings yet
An Overview of Edward: A Probabilistic Programming System: Dustin Tran Columbia University
34 pages
cs236_lecture11
No ratings yet
cs236_lecture11
27 pages
SP14 CS188 Lecture 16 - Bayes Nets
No ratings yet
SP14 CS188 Lecture 16 - Bayes Nets
42 pages
Lecture 12 Bayesian Neural Network
No ratings yet
Lecture 12 Bayesian Neural Network
46 pages
Autoregressive Models
No ratings yet
Autoregressive Models
57 pages
deep_density_estimation
No ratings yet
deep_density_estimation
20 pages
Generative Learning algorithims 1233
No ratings yet
Generative Learning algorithims 1233
33 pages
cs236_lecture4
No ratings yet
cs236_lecture4
25 pages
Directed vs. Undirected Graphical Models
No ratings yet
Directed vs. Undirected Graphical Models
16 pages
Bayesian Networks: Machine Learning, Lecture (Jaakkola)
No ratings yet
Bayesian Networks: Machine Learning, Lecture (Jaakkola)
8 pages
Mod6_Slides
No ratings yet
Mod6_Slides
27 pages
F20-AI-L13
No ratings yet
F20-AI-L13
32 pages
Lecture 20
No ratings yet
Lecture 20
338 pages
Bayesian Networks Analysis
No ratings yet
Bayesian Networks Analysis
51 pages
Bayesian Networks: CSC384: Intro To Artificial Intelligence Reasoning Under Uncertainty-III
No ratings yet
Bayesian Networks: CSC384: Intro To Artificial Intelligence Reasoning Under Uncertainty-III
6 pages
Introduction To VAE
No ratings yet
Introduction To VAE
5 pages
BayesianNetworks Reduced
No ratings yet
BayesianNetworks Reduced
14 pages
L0
No ratings yet
L0
26 pages
Graphical
No ratings yet
Graphical
99 pages
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
No ratings yet
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
62 pages
PPT06-Probabilistic Reasoning
No ratings yet
PPT06-Probabilistic Reasoning
31 pages
Kolter PGM
No ratings yet
Kolter PGM
75 pages
dsdf ans mid 2
No ratings yet
dsdf ans mid 2
9 pages
22cse61 Module 4
No ratings yet
22cse61 Module 4
110 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
Bayesian Networks
No ratings yet
Bayesian Networks
45 pages
SP14 CS188 Lecture 16 -- Bayes Nets - Print
No ratings yet
SP14 CS188 Lecture 16 -- Bayes Nets - Print
31 pages
Lecture8 - bays1
No ratings yet
Lecture8 - bays1
40 pages
Lecture Bayesian Networks
No ratings yet
Lecture Bayesian Networks
50 pages
Bayesian Networks: A Tutorial
No ratings yet
Bayesian Networks: A Tutorial
73 pages
Random ( Statistical Stochastic) Vari-Able: V X Rain
No ratings yet
Random ( Statistical Stochastic) Vari-Able: V X Rain
8 pages
unit 1
No ratings yet
unit 1
39 pages
ML pp8_u2
No ratings yet
ML pp8_u2
35 pages
oussidi2018
No ratings yet
oussidi2018
8 pages
Graphical Models: Michael I. Jordan
No ratings yet
Graphical Models: Michael I. Jordan
16 pages
Probabilistic Graphical Models: David Sontag
No ratings yet
Probabilistic Graphical Models: David Sontag
44 pages
Probability Distributions in Data Science - Towards Data Science
No ratings yet
Probability Distributions in Data Science - Towards Data Science
15 pages
ml3 - Text Classification – Naive Bayes
No ratings yet
ml3 - Text Classification – Naive Bayes
50 pages
Learning Causal Bayesian Network Structures From Experimental Data - Byron Ellis Wing Hung Wong
No ratings yet
Learning Causal Bayesian Network Structures From Experimental Data - Byron Ellis Wing Hung Wong
39 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Discrete Random Variable Lecture
No ratings yet
Discrete Random Variable Lecture
4 pages
Undergraduate Probability: Richard F. Bass
No ratings yet
Undergraduate Probability: Richard F. Bass
66 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Spear Man
No ratings yet
Spear Man
5 pages
X (Age Youth, Income Medium, Student Yes, Credit Rating Fair)
No ratings yet
X (Age Youth, Income Medium, Student Yes, Credit Rating Fair)
2 pages
Mean, Variance and Standard Deviations
No ratings yet
Mean, Variance and Standard Deviations
23 pages
CH 07
No ratings yet
CH 07
45 pages
Lampiran 5 (Hasil Analisis Aitem)
No ratings yet
Lampiran 5 (Hasil Analisis Aitem)
4 pages
Lesson 1 - Time Series Basics
No ratings yet
Lesson 1 - Time Series Basics
23 pages
Answers To Homework Assignment 2
No ratings yet
Answers To Homework Assignment 2
25 pages
Geg 222 LMS Questions Compilation With Answers (Ibra)
No ratings yet
Geg 222 LMS Questions Compilation With Answers (Ibra)
12 pages
Strategic Practice and Homework 5
100% (1)
Strategic Practice and Homework 5
14 pages
Ce Peds Probability PDF
No ratings yet
Ce Peds Probability PDF
28 pages
Immediate Download Simulating Copulas Stochastic Models Sampling Algorithms and Applications 2nd Edition Jan-Frederik Mai Ebooks 2024
100% (11)
Immediate Download Simulating Copulas Stochastic Models Sampling Algorithms and Applications 2nd Edition Jan-Frederik Mai Ebooks 2024
41 pages
Lecture 4
No ratings yet
Lecture 4
28 pages
Absorbing Markov Chains
No ratings yet
Absorbing Markov Chains
9 pages
Solucionario Sheldon Ross
No ratings yet
Solucionario Sheldon Ross
66 pages
CH 3
No ratings yet
CH 3
79 pages
The Statistical Imagination: Chapter 5. Measuring Dispersion or Spread in A Distribution of Scores
No ratings yet
The Statistical Imagination: Chapter 5. Measuring Dispersion or Spread in A Distribution of Scores
14 pages
STAT 552 Probability and Statistics Ii: Short Review of S551
No ratings yet
STAT 552 Probability and Statistics Ii: Short Review of S551
51 pages
Business Statistics Key Formulas
50% (4)
Business Statistics Key Formulas
5 pages
Multivariate Normal Distribution
No ratings yet
Multivariate Normal Distribution
14 pages
[FREE PDF sample] (Ebook) Performance modeling and design of computer systems: queueing theory in action by Harchol-Balter, Mor ISBN 9781107027503, 1107027500 ebooks
100% (3)
[FREE PDF sample] (Ebook) Performance modeling and design of computer systems: queueing theory in action by Harchol-Balter, Mor ISBN 9781107027503, 1107027500 ebooks
71 pages
Essentials of Modern Business Statistics with Microsoft Excel 6th Edition Anderson Solutions Manual instant download
100% (2)
Essentials of Modern Business Statistics with Microsoft Excel 6th Edition Anderson Solutions Manual instant download
53 pages
Cpp-I Probability: Fiitjee
No ratings yet
Cpp-I Probability: Fiitjee
15 pages
WNF - Unit 4 - 18 Nov 2020
No ratings yet
WNF - Unit 4 - 18 Nov 2020
141 pages
Complete Download Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability Volume 1 Theory of Statistics Lucien M. Le Cam (Editor) PDF All Chapters
100% (3)
Complete Download Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability Volume 1 Theory of Statistics Lucien M. Le Cam (Editor) PDF All Chapters
50 pages
Module 3 Assesment MTH 310
No ratings yet
Module 3 Assesment MTH 310
3 pages

cs236 Lecture2

Uploaded by

cs236 Lecture2

Uploaded by

Representation

Stefano Ermon, Aditya Grover

We want to learn a probability distribution p(x) over images x such that

Bernoulli distribution: (biased) coin flip

Sampling from the joint distribution (r , g , b) ∼ p(R, G , B) randomly

Suppose X1 , . . . , Xn are binary (Bernoulli) random variables, i.e.,

Sampling from p(x1 , . . . , xn ) generates an image

If X1 , . . . , Xn are independent, then

p(x1 , . . . , xn ) = p(x1 )p(x2 ) · · · p(xn )

How many possible states? 2n

p(A ∩ B|C ) = p(A|C )p(B|C )

Random variables X , Y are conditionally independent given Z if for

p(X = x ∩ Y = y |Z = z) = p(X = x|Z = z)p(Y = y |Z = z)

We will also write p(X , Y |Z ) = p(X |Z )p(Y |Z ). Note the more

1 Chain rule Let S1 , . . . Sn be events, p(Si ) > 0.

p(S1 ∩ S2 ∩ · · · ∩ Sn ) = p(S1 )p(S2 | S1 ) · · · p(Sn | S1 ∩ . . . ∩ Sn−1 )

2 Bayes’ rule Let S1 , S2 be events, p(S1 ) > 0 and p(S2 ) > 0.

p(S1 ∩ S2 ) p(S2 | S1 )p(S1 )

How many parameters? 1 + 2 + · · · + 2n−1 = 2n − 1

How many parameters? 2n − 1. Exponential reduction!

p(x1 , . . . , x4 ) = p(x1 )p(x2 | x1 )p(x3 | x1 , x2 )p(x4 | x1 , x2 , x3 )

If X4 ⊥ X2 | {X1 , X3 }, we can simplify as

p(x1 , . . . , xn ) = p(x1 )p(x2 | x1 )p(x3 | x1 , x2 )p(x4 | x1 , 2 , x3 )

p(x1 , . . . , x4 ) = p(x4 )p(x3 | x4 )p(x2 | x3 , x4 )p(x1 | x2 , x3 , x4 )

If X1 ⊥ {X2 , X3 } | X4 , we can simplify as

p(x1 , . . . , x4 ) = p(x4 )p(x3 | x4 )p(x2 | x3 , x4 )p(x1 |  , x

Bayesian Networks: assume an ordering and a set of conditional

Use conditional parameterization (instead of joint parameterization)

Need to guarantee it is a legal probability distribution. It has to

A Bayesian network is specified by a directed acyclic graph

Claim: p(x1 , . . . xn ) is a valid probability distribution

DAG stands for Directed Acyclic Graph

What is its joint distribution?

The joint distribution corresponding to the above BN factors as

However, by the chain rule, any distribution can be written as

Thus, we are assuming the following additional independencies:

Bayesian networks given by (G , P) where P is specified as a set of

Estimate parameters from training data. Predict with Bayes rule:

Are the independence assumptions made here reasonable?

However, suppose all we need for prediction is p(Y | X)

Since X is a random vector, chain rules will give

We must make the following choices:

1 For the generative model, assume that Xi ⊥ X−i | Y (naive Bayes)

Linear dependence: let z(α, x) = α0 + ni=1 αi xi .Then,

1 Decision boundary p(Y = 1 | x; α) > 0.5 is linear in x

Logistic model does not assume Xi ⊥ X−i | Y , unlike naive Bayes

Using chain rule p(Y , X) = p(X | Y )p(Y ) = p(Y | X)p(X). Corresponding

1 Using a conditional model is only possible when X is always observed

1 In discriminative models, we assume that

p(Y = 1 | x; α) = f (z(α, x)), where f (z) = 1/(1 + e −z ) is the

Using Chain Rule

p(x1 , x2 , x3 , x4 ) = p(x1 )p(x2 | x1 )p(x3 | x1 , x2 )p(x4 | x1 , x2 , x3 )

p(x1 , x2 , x3 , x4 ) ≈ p(x1 )p(x2 | x1 )p(x3 | 1 , x2 )p(x4 | x1 , 

Assumes conditional independencies

p(x1 , x2 , x3 , x4 ) ≈ p(x1 )p(x2 | x1 )pNeural (x3 | x1 , x2 )pNeural (x4 | x1 , x2 , x3 )

Assume specific functional form for the conditionals. A sufficiently

If X is a continuous random variable, we can usually represent it using

pX ,Y ,Z (x, y , z) = pX (x)pY |X (y | x)pZ |{X ,Y } (z | x, y )

You might also like

p(x1 , . . . , xn ) = p(x1 )p(x2 | x1 )p(x3 | x1 , x2 )p(x4 | x1 , 2 , x3 )

p(x1 , . . . , x4 ) = p(x4 )p(x3 | x4 )p(x2 | x3 , x4 )p(x1 | , x

p(x1 , x2 , x3 , x4 ) ≈ p(x1 )p(x2 | x1 )p(x3 | 1 , x2 )p(x4 | x1 ,