0% found this document useful (0 votes)

15 views

Lecture 2

Uploaded by

bc240407773mas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Lecture 2

Uploaded by

bc240407773mas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Introduction to Probabilistic Modeling

Volodymyr Kuleshov

Cornell Tech

Lecture 2

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 1 / 31

Announcements

Recorded lectures will appear in Canvas under ”Zoom”

Good luck with ICML deadline!

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 2 / 31

Learning a Generative Model

We are given a training set of examples, e.g., images of dogs

We want to learn a probability distribution p(x) over images x such that

Generation: If we sample xnew ∼ p(x), xnew should look like a dog
(sampling)
Representation learning: We should be able to learn what these
images have in common, e.g., ears, tail, etc. (features)
First step: how to represent p(x)

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 3 / 31

Lecture Outline

1 Defining Probabilistic Models of the Data

Examples of Probabilistic Models
The Curse of Dimensionality
Parameter-Efficient Models Through Conditional Independence
Bayesian Networks: An Example of Shallow Generative Models
2 Discriminative vs. Generative Models
Naive Bayes vs. Logistic Regression
Which One to Use?
3 A First Glimpse of Deep Generative Models

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 4 / 31

Probabilistic Models: Basic Discrete Distributions

Bernoulli distribution: (biased) coin flip

Domain: {Heads, Tails}
Specify P(X = Heads) = p. Then P(X = Tails) = 1 − p.
Write: X ∼ Ber (p)
Sampling: flip a (biased) coin
Categorical distribution: (biased) m-sided dice
Domain: {1, · · · , m} P
Specify P(Y = i) = pi , such that pi = 1
Write: Y ∼ Cat(p1 , · · · , pm )
Sampling: roll a (biased) die

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 5 / 31

Probabilistic Models: A Multi-Variate Joint Distribution
Suppose we want to define a distribution over one pixel in an image.
We use three discrete random variables:
Red Channel R. Val(R) = {0, · · · , 255}
Green Channel G . Val(G ) = {0, · · · , 255}
Blue Channel B. Val(B) = {0, · · · , 255}

Sampling from the joint distribution (r , g , b) ∼ p(R, G , B) randomly

generates a color for the pixel. How many parameters do we need to
specify the joint distribution p(R = r , G = g , B = b)?
256 ∗ 256 ∗ 256 − 1
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 6 / 31
The Curse of Dimensionality in Probabilistic Models
Suppose we want to model a BW image of digit with n = 28 · 28 pixels.

Pixels X1 , . . . , Xn are modeled as binary (Bernoulli) random variables,

i.e., Val(Xi ) = {0, 1} = {Black, White}.
How many possible states?
n
|2 × 2 ×
{z· · · × 2} = 2
n times

Sampling from p(x1 , . . . , xn ) generates an image

How many parameters to specify the joint distribution p(x1 , . . . , xn )
over n binary pixels? 2n − 1
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 7 / 31
Parameter-Efficient Models Through Conditional
Independence
If X1 , . . . , Xn are independent, then
p(x1 , . . . , xn ) = p(x1 )p(x2 ) · · · p(xn )

How many possible states? 2n

How many parameters to specify the joint distribution p(x1 , . . . , xn )?
How many to specify the marginal distribution p(x1 )? 1
2n entries can be described by just n numbers (if |Val(Xi )| = 2)!
Independence assumption is too strong. Model not likely to be useful
For example, each pixel chosen independently when we sample from it.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 8 / 31

Key Notion: Conditional Independence

Two events A, B are conditionally independent given event C if

p(A ∩ B|C ) = p(A|C )p(B|C )

Random variables X , Y are conditionally independent given Z if for

all values x ∈Val(X ), y ∈Val(Y ), z ∈Val(Z )

p(X = x ∩ Y = y |Z = z) = p(X = x|Z = z)p(Y = y |Z = z)

We will also write p(X , Y |Z ) = p(X |Z )p(Y |Z ). Note the more

compact notation.
Equivalent definition: p(X |Y , Z ) = p(X |Z ).
We write X ⊥ Y | Z
Similarly for sets of random variables, X ⊥ Y | Z
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 9 / 31
Two Important Rules in Probability

1 Chain rule Let S1 , . . . Sn be events, p(Si ) > 0.

p(S1 ∩ S2 ∩ · · · ∩ Sn ) = p(S1 )p(S2 | S1 ) · · · p(Sn | S1 ∩ . . . ∩ Sn−1 )

2 Bayes’ rule Let S1 , S2 be events, p(S1 ) > 0 and p(S2 ) > 0.

p(S1 ∩ S2 ) p(S2 | S1 )p(S1 )

p(S1 | S2 ) = =
p(S2 ) p(S2 )

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 10 / 31

Structure through conditional independence
Using Chain Rule
p(x1 , . . . , xn ) = p(x1 )p(x2 | x1 )p(x3 | x1 , x2 ) · · · p(xn | x1 , · · · , xn−1 )

How many parameters? 1 + 2 + · · · + 2n−1 = 2n − 1

How many parameters? 2n − 1. Exponential reduction!

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 11 / 31
Bayesian Networks: General Idea

Use conditional parameterization (instead of joint parameterization)

For each random variable Xi , specify p(xi |xAi ) for set XAi of random
variables
Then get joint parametrization as
Y
p(x1 , . . . , xn ) = p(xi |xAi )
i

Need to guarantee it is a legal probability distribution. It has to

correspond to a chain rule factorization, with factors simplified due to
assumed conditional independencies

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 12 / 31

Bayesian Networks: Formal Definition

A Bayesian network is specified by a directed acyclic graph

G = (V , E ) with:
1 One node i ∈ V for each random variable Xi
2 One conditional probability distribution (CPD) per node, p(xi | xPa(i) ),
specifying the variable’s probability conditioned on its parents’ values
Graph G = (V , E ) is called the structure of the Bayesian Network
Defines a joint distribution:
Y
p(x1 , . . . xn ) = p(xi | xPa(i) )
i∈V

Claim: p(x1 , . . . xn ) is a valid probability distribution

Economical representation: exponential in |Pa(i)|, not |V |

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 13 / 31

What is a Directed Acyclic Graph?

DAG stands for Directed Acyclic Graph

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 14 / 31

Bayesian Networks: An Example
Consider the following Bayesian network:

What is its joint distribution?

Y
p(x1 , . . . xn ) = p(xi | xPa(i) )
i∈V
p(d, i, g , s, l) = p(d)p(i)p(g | i, d)p(s | i)p(l | g )
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 15 / 31
Graph Structure Encodes Conditional Independencies

The joint distribution corresponding to the above BN factors as

p(d, i, g , s, l) = p(d)p(i)p(g | i, d)p(s | i)p(l | g )

However, by the chain rule, any distribution can be written as

p(d, i, g , s, l) = p(d)p(i | d)p(g | i, d)p(s | i, d, g )p(l | g , d, i, s)

Thus, we are assuming the following extra independencies:

D ⊥ I, S ⊥ {D, G } | I , L ⊥ {I , D, S} | G .
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 16 / 31
Summary so Far

Bayesian networks given by (G , P) where P is specified as a set of

local conditional probability distributions associated with G ’s nodes
Efficient representation using a graph-based data structure
Computing the probability of any assignment is obtained by
multiplying CPDs
Can identify some conditional independence properties by looking at
graph properties
Next: generative vs. discriminative; functional parameterizations

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 17 / 31

Lecture Outline

1 Defining Probabilistic Models of the Data

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 18 / 31

Naive Bayes: A Generative Classification Algorithm

Classify e-mails as spam (Y = 1) or not spam (Y = 0)

Let 1 : n index the words in our vocabulary (e.g., English)
Xi = 1 if word i appears in an e-mail, and 0 otherwise
E-mails are drawn according to some distribution p(Y , X1 , . . . , Xn )
Words are conditionally independent given Y :

Then
n
Y
p(y , x1 , . . . xn ) = p(y ) p(xi | y )
i=1

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 19 / 31

Naive Bayes: A Generative Classification Algorithm
Classify e-mails as spam (Y = 1) or not spam (Y = 0)
Let 1 : n index the words in our vocabulary (e.g., English)
Xi = 1 if word i appears in an e-mail, and 0 otherwise
E-mails are drawn according to some distribution p(Y , X1 , . . . , Xn )
Suppose that the words are conditionally independent given Y . Then,
n
Y
p(y , x1 , . . . xn ) = p(y ) p(xi | y )
i=1

Estimate parameters from data. Predict with Bayes rule:

p(Y = 1) ni=1 p(xi | Y = 1)

Q
p(Y = 1 | x1 , . . . xn ) = P Qn
y ={0,1} p(Y = y ) i=1 p(xi | Y = y )

Are the independence assumptions made here reasonable? Nearly all

probabilistic models are “wrong”, but many are nonetheless useful
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 20 / 31
Discriminative Models

However, suppose all we need for prediction is p(Y | X)

In the left model, we need to specify/learn both p(Y ) and p(X | Y ),
then compute p(Y | X) via Bayes rule
In the right model, it suffices to estimate just the conditional
distribution p(Y | X)
We never need to model/learn/use p(X)!
Called a discriminative model because it is only useful for
discriminating Y ’s label when given X
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 21 / 31
Logistic Regression: Discriminative Classification Algorithm
1 For the discriminative model, assume that
p(Y = 1 | x; α) = f (x, α)
2 Not represented as a table anymore. It is a parameterized function of
x (regression)
Has to be between 0 and 1
Depend in some simple but reasonable way on x1 , · · · , xn
Completely specified by a vector α of n + 1 parameters (compact
representation)
Linear dependence: let z(α, x) = α0 + ni=1 αi xi .Then,
P
p(Y = 1 | x; α) = σ(z(α, x)), where σ(z) = 1/(1 + e −z ) is called the
logistic function:

1
1 + e−z

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 22 / 31

Logistic Regression: Discriminative Classification Algorithm

Linear dependence: let z(α, x) = α0 + ni=1 αi xi .Then,

P
p(Y = 1 | x; α) = σ(z(α, x)), where σ(z) = 1/(1 + e −z ) is called the
logistic function

1 Decision boundary p(Y = 1 | x; α) > 0.5 is linear in x

2 Equal probability contours are straight lines
3 Probability rate of change has very specific form (third plot)

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 23 / 31

Discriminative models are powerful

Logistic model does not assume Xi ⊥ X−i | Y , unlike naive Bayes

This can make a big difference in many applications
For example, in spam classification, let X1 = 1[“bank” in e-mail] and
X2 = 1[“account” in e-mail]
Regardless of whether spam, these always appear together, i.e. X1 = X2
Learning in naive Bayes results in p(X1 | Y ) = p(X2 | Y ). Thus, naive Bayes
double counts the evidence
Learning with logistic regression sets α1 = 0 or α2 = 0, in effect ignoring it
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 24 / 31
Generative models are still very useful

Using chain rule p(Y , X) = p(X | Y )p(Y ) = p(Y | X)p(X). Corresponding

Bayesian networks:

1 Using a conditional model is only possible when X is always observed

When some Xi variables are unobserved, the generative model allows us
to compute p(Y | Xevidence ) by marginalizing over the unseen variables

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 25 / 31

Lecture Outline

1 Defining Probabilistic Models of the Data

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 26 / 31

Neural Models

1 In discriminative models, we assume that

p(Y = 1 | x; α) = f (x, α)

2 Linear dependence:
Pn
let z(α, x) = α0 + i=1 αi xi .
p(Y = 1 | x; α) = σ(z(α, x)), where σ(z) = 1/(1 + e −z ) is the
logistic function
Dependence might be too simple
3 Non-linear dependence: let h(A, b, x) = g (Ax + b) be a non-linear
transformation of the inputs (features).
pNeural (Y = 1 | x; α, A, b) = σ(α0 + hi=1 αi hi )
P

More flexible
More parameters: A, b, α

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 27 / 31

Neural Models
1 In discriminative models, we assume that
p(Y = 1 | x; α) = f (x, α)
Linear dependence: let z(α, x) = α0 + ni=1 αi xi .
P
2

p(Y = 1 | x; α) = f (z(α, x)), where f (z) = 1/(1 + e −z ) is the

logistic function
Dependence might be too simple
3 Non-linear dependence: let h(A, b, x) = f (Ax + b) be a non-linear
transformation of the inputs (features).
pNeural (Y = 1 | x; α, A, b) = g (α0 + hi=1 αi hi )
P
More flexible
More parameters: A, b, α
Can repeat multiple times to get a neural network

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 28 / 31

Bayesian Networks vs Neural Models

Using Chain Rule

p(x1 , x2 , x3 , x4 ) = p(x1 )p(x2 | x1 )p(x3 | x1 , x2 )p(x4 | x1 , x2 , x3 )

Fully General
Bayes Net

p(x1 , x2 , x3 , x4 ) ≈ p(x1 )p(x2 | x1 )p(x3 | 1 , x2 )p(x4 | x1 ,

x , x
x2 3)

Assumes conditional independencies

Neural Models

p(x1 , x2 , x3 , x4 ) ≈ p(x1 )p(x2 | x1 )pNeural (x3 | x1 , x2 )pNeural (x4 | x1 , x2 , x3 )

Assume specific functional form for the conditionals. A sufficiently

deep neural net can approximate any function.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 29 / 31

Continuous Variables

If X is a continuous random variable, we can usually represent it using

its probability density function pX : R → R+ . However, we cannot
represent this function as a table anymore. Typically consider
parameterized densities:
2 2
Gaussian: X ∼ N (µ, σ) if pX (x) = σ√12π e −(x−µ) /2σ
1
Uniform: X ∼ U(a, b) if pX (x) = b−a 1[a ≤ x ≤ b]
Etc.
If X is a continuous random vector, we can usually represent it using
its joint probability density function:
Gaussian: if pX (x) = √ 1
exp − 21 (x − µ)T Σ−1 (x − µ)

(2π)n |Σ|

Chain rule, Bayes rule, etc all still apply. For example,

pX ,Y ,Z (x, y , z) = pX (x)pY |X (y | x)pZ |{X ,Y } (z | x, y )

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 30 / 31

Continuous Variables
This means we can still use Bayesian networks with continuous (and
discrete) variables. Examples:
Mixture of 2 Gaussians: Network Z → X with factorization
pZ ,X (z, x) = pZ (z)pX |Z (x | z) and
Z ∼ Bernoulli(p)
X | (Z = 0) ∼ N (µ0 , σ0 ) , X | (Z = 1) ∼ N (µ1 , σ1 )
The parameters are p, µ0 , σ0 , µ1 , σ1
Infinite Mixture of Gaussians: Network Z → X with factorization
pZ ,X (z, x) = pZ (z)pX |Z (x | z)
Z ∼ U(a, b)
X | (Z = z) ∼ N (z, σ)
The parameters are a, b, σ
Neural Infinite Mixture of Gaussians: Network Z → X with
factorization pZ ,X (z, x) = pZ (z)pX |Z (x | z) and
Z ∼ N (0, 1)
X | (Z = z) ∼ N (µθ (z), e σφ (z) ) where µθ : R → R and σφ are neural
networks with parameters (weights) θ, φ respectively
Note: Even if µθ , σφ are deep nets, functional form is Gaussian
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 31 / 31

2022 Naive Bayes and Probability
No ratings yet
2022 Naive Bayes and Probability
30 pages
Climaveneta Life
No ratings yet
Climaveneta Life
20 pages
Pre Commissioning Checklist
100% (4)
Pre Commissioning Checklist
5 pages
Bendi Series IV SE Parts Manual F-470-0308
No ratings yet
Bendi Series IV SE Parts Manual F-470-0308
90 pages
Building Probabilistic Graphical Models With Python
No ratings yet
Building Probabilistic Graphical Models With Python
24 pages
Asme Section V B Se-1419
100% (1)
Asme Section V B Se-1419
8 pages
The Particular Examination of Conscience and The Dominant Defect
100% (1)
The Particular Examination of Conscience and The Dominant Defect
79 pages
cs236 Lecture2
No ratings yet
cs236 Lecture2
30 pages
cs236_lecture2
No ratings yet
cs236_lecture2
29 pages
Lecture # 2-1 Probabilistic Models
No ratings yet
Lecture # 2-1 Probabilistic Models
40 pages
2 Graphical Models in A Nutshell: Daphne Koller, Nir Friedman, Lise Getoor and Ben Taskar
No ratings yet
2 Graphical Models in A Nutshell: Daphne Koller, Nir Friedman, Lise Getoor and Ben Taskar
43 pages
Cheat Sheet 4
No ratings yet
Cheat Sheet 4
2 pages
All of Graphical Models
No ratings yet
All of Graphical Models
135 pages
13 Bayes Nets
No ratings yet
13 Bayes Nets
38 pages
Prob Nets
No ratings yet
Prob Nets
21 pages
ML pp8_u2
No ratings yet
ML pp8_u2
35 pages
Ijcai Ecai Tutorial
No ratings yet
Ijcai Ecai Tutorial
115 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
Bayesian Networks: Machine Learning, Lecture (Jaakkola)
No ratings yet
Bayesian Networks: Machine Learning, Lecture (Jaakkola)
8 pages
Generative Learning algorithims 1233
No ratings yet
Generative Learning algorithims 1233
33 pages
03 Prob
No ratings yet
03 Prob
38 pages
ml3 - Text Classification – Naive Bayes
No ratings yet
ml3 - Text Classification – Naive Bayes
50 pages
ml 5
No ratings yet
ml 5
28 pages
Unit V -Graphical Models
No ratings yet
Unit V -Graphical Models
43 pages
BayesianNetworks Reduced
No ratings yet
BayesianNetworks Reduced
14 pages
Lecture # 1-2 Introduction To Gen AI
No ratings yet
Lecture # 1-2 Introduction To Gen AI
41 pages
Lecture 6_Generative Models
No ratings yet
Lecture 6_Generative Models
33 pages
Lecture 15: Bayesian Networks III: CS221 / Autumn 2015 / Liang
No ratings yet
Lecture 15: Bayesian Networks III: CS221 / Autumn 2015 / Liang
70 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
Cs 228
No ratings yet
Cs 228
98 pages
BayesianNetworks Reduced
No ratings yet
BayesianNetworks Reduced
14 pages
Lecture 12 Bayesian Neural Network
No ratings yet
Lecture 12 Bayesian Neural Network
46 pages
F20-AI-L13
No ratings yet
F20-AI-L13
32 pages
Epi Summer 24
No ratings yet
Epi Summer 24
291 pages
Chapter 4 Bayesian Networks
No ratings yet
Chapter 4 Bayesian Networks
62 pages
Bayesian Networks
No ratings yet
Bayesian Networks
45 pages
lec1_intro
No ratings yet
lec1_intro
51 pages
Essentials of Bayesian Inference 1706204646
No ratings yet
Essentials of Bayesian Inference 1706204646
21 pages
Libpgm For Bayesian Networks: Dr. A. Obulesh Associate Professor
No ratings yet
Libpgm For Bayesian Networks: Dr. A. Obulesh Associate Professor
59 pages
Bayesian Neworks
No ratings yet
Bayesian Neworks
32 pages
CS 391L: Machine Learning: Bayesian Learning: Beyond Naïve Bayes
No ratings yet
CS 391L: Machine Learning: Bayesian Learning: Beyond Naïve Bayes
29 pages
ML-9
No ratings yet
ML-9
15 pages
Computer Science CPSC 322: Bayesian Networks: Construction
No ratings yet
Computer Science CPSC 322: Bayesian Networks: Construction
70 pages
Lec23 PDF
No ratings yet
Lec23 PDF
7 pages
Probability Distributions in Data Science - Towards Data Science
No ratings yet
Probability Distributions in Data Science - Towards Data Science
15 pages
NBayes Log Reg
No ratings yet
NBayes Log Reg
18 pages
King 5
No ratings yet
King 5
22 pages
Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression
No ratings yet
Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression
17 pages
Good BayesianNetworksPrimer
No ratings yet
Good BayesianNetworksPrimer
23 pages
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation
No ratings yet
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation
58 pages
Probabilistic Reasoning
No ratings yet
Probabilistic Reasoning
58 pages
5 ML NaiveBayes
No ratings yet
5 ML NaiveBayes
45 pages
Lecture 5 Bayesian Networks
No ratings yet
Lecture 5 Bayesian Networks
12 pages
An Introduction To Artificial Intelligence: Chapter 13 &14.1-14.2: Uncertainty & Bayesian Networks
No ratings yet
An Introduction To Artificial Intelligence: Chapter 13 &14.1-14.2: Uncertainty & Bayesian Networks
31 pages
Bayes Ball
No ratings yet
Bayes Ball
5 pages
Research On CDR
No ratings yet
Research On CDR
24 pages
Lecture 13: Bayesian Networks I: CS221 / Spring 2019 / Charikar & Sadigh
No ratings yet
Lecture 13: Bayesian Networks I: CS221 / Spring 2019 / Charikar & Sadigh
76 pages
Bayesian Networks Analysis
No ratings yet
Bayesian Networks Analysis
51 pages
PPT06-Probabilistic Reasoning
No ratings yet
PPT06-Probabilistic Reasoning
31 pages
Lecture-02_Probability_basics
No ratings yet
Lecture-02_Probability_basics
33 pages
SP14 CS188 Lecture 16 -- Bayes Nets - Print
No ratings yet
SP14 CS188 Lecture 16 -- Bayes Nets - Print
31 pages
Lecture 06 Bayesian Networks 07112022 011127pm
No ratings yet
Lecture 06 Bayesian Networks 07112022 011127pm
33 pages
27 Revision
No ratings yet
27 Revision
80 pages
Bayesian Networks: A Tutorial
No ratings yet
Bayesian Networks: A Tutorial
73 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Power Trim Assembly Components
No ratings yet
Power Trim Assembly Components
2 pages
Capa Procedure
100% (4)
Capa Procedure
2 pages
Magnetic V Blocks Soft
No ratings yet
Magnetic V Blocks Soft
4 pages
New 2017 Acend Standards 1
100% (1)
New 2017 Acend Standards 1
2 pages
CAO Unit - 4
No ratings yet
CAO Unit - 4
22 pages
Concepts and Data Model - SAP Documentation
No ratings yet
Concepts and Data Model - SAP Documentation
2 pages
45 Minute Test Unit 6
No ratings yet
45 Minute Test Unit 6
4 pages
Linear Algebra
No ratings yet
Linear Algebra
42 pages
Statement of Axis Account No:923010035809266 For The Period (From: 01-11-2023 To: 09-02-2024)
No ratings yet
Statement of Axis Account No:923010035809266 For The Period (From: 01-11-2023 To: 09-02-2024)
4 pages
Math Project on Optimisation
No ratings yet
Math Project on Optimisation
25 pages
COPAR
No ratings yet
COPAR
5 pages
Elevators and Escalators
No ratings yet
Elevators and Escalators
38 pages
Design of Counterfort Retaining Wall PDF
100% (1)
Design of Counterfort Retaining Wall PDF
14 pages
CNLINKO 2019 Product Catalogue (1)
No ratings yet
CNLINKO 2019 Product Catalogue (1)
56 pages
FEMA FilterManual 2011
100% (1)
FEMA FilterManual 2011
362 pages
Introduction To R.T.P.P: 1.1 General
No ratings yet
Introduction To R.T.P.P: 1.1 General
12 pages
BAUM HORSTMAN KRÜGER. 1979. Transcendental Arguments and Science
0% (1)
BAUM HORSTMAN KRÜGER. 1979. Transcendental Arguments and Science
303 pages
STS Chapter 7 Outline and Reviewer
No ratings yet
STS Chapter 7 Outline and Reviewer
5 pages
Anexos 31 - 1 Parte
No ratings yet
Anexos 31 - 1 Parte
21 pages
Selection and Utilization of Instructional Media
100% (1)
Selection and Utilization of Instructional Media
6 pages
DPR Template
No ratings yet
DPR Template
3 pages
9 Preventive Maintenance
No ratings yet
9 Preventive Maintenance
14 pages
Leadership Powerpoint
No ratings yet
Leadership Powerpoint
14 pages
Important Defination'
No ratings yet
Important Defination'
2 pages

Lecture 2

Uploaded by

Lecture 2

Uploaded by

Introduction to Probabilistic Modeling

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 1 / 31

Recorded lectures will appear in Canvas under ”Zoom”

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 2 / 31

We are given a training set of examples, e.g., images of dogs

We want to learn a probability distribution p(x) over images x such that

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 3 / 31

1 Defining Probabilistic Models of the Data

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 4 / 31

Bernoulli distribution: (biased) coin flip

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 5 / 31

Sampling from the joint distribution (r , g , b) ∼ p(R, G , B) randomly

Pixels X1 , . . . , Xn are modeled as binary (Bernoulli) random variables,

Sampling from p(x1 , . . . , xn ) generates an image

How many possible states? 2n

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 8 / 31

Two events A, B are conditionally independent given event C if

p(A ∩ B|C ) = p(A|C )p(B|C )

Random variables X , Y are conditionally independent given Z if for

p(X = x ∩ Y = y |Z = z) = p(X = x|Z = z)p(Y = y |Z = z)

We will also write p(X , Y |Z ) = p(X |Z )p(Y |Z ). Note the more

1 Chain rule Let S1 , . . . Sn be events, p(Si ) > 0.

p(S1 ∩ S2 ∩ · · · ∩ Sn ) = p(S1 )p(S2 | S1 ) · · · p(Sn | S1 ∩ . . . ∩ Sn−1 )

2 Bayes’ rule Let S1 , S2 be events, p(S1 ) > 0 and p(S2 ) > 0.

p(S1 ∩ S2 ) p(S2 | S1 )p(S1 )

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 10 / 31

How many parameters? 1 + 2 + · · · + 2n−1 = 2n − 1

How many parameters? 2n − 1. Exponential reduction!

Use conditional parameterization (instead of joint parameterization)

Need to guarantee it is a legal probability distribution. It has to

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 12 / 31

A Bayesian network is specified by a directed acyclic graph

Claim: p(x1 , . . . xn ) is a valid probability distribution

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 13 / 31

DAG stands for Directed Acyclic Graph

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 14 / 31

What is its joint distribution?

The joint distribution corresponding to the above BN factors as

However, by the chain rule, any distribution can be written as

Thus, we are assuming the following extra independencies:

Bayesian networks given by (G , P) where P is specified as a set of

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 17 / 31

1 Defining Probabilistic Models of the Data

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 18 / 31

Classify e-mails as spam (Y = 1) or not spam (Y = 0)

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 19 / 31

Estimate parameters from data. Predict with Bayes rule:

p(Y = 1) ni=1 p(xi | Y = 1)

Are the independence assumptions made here reasonable? Nearly all

However, suppose all we need for prediction is p(Y | X)

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 22 / 31

Linear dependence: let z(α, x) = α0 + ni=1 αi xi .Then,

1 Decision boundary p(Y = 1 | x; α) > 0.5 is linear in x

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 23 / 31

Logistic model does not assume Xi ⊥ X−i | Y , unlike naive Bayes

Using chain rule p(Y , X) = p(X | Y )p(Y ) = p(Y | X)p(X). Corresponding

1 Using a conditional model is only possible when X is always observed

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 25 / 31

1 Defining Probabilistic Models of the Data

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 26 / 31

1 In discriminative models, we assume that

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 27 / 31

p(Y = 1 | x; α) = f (z(α, x)), where f (z) = 1/(1 + e −z ) is the

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 28 / 31

Using Chain Rule

p(x1 , x2 , x3 , x4 ) = p(x1 )p(x2 | x1 )p(x3 | x1 , x2 )p(x4 | x1 , x2 , x3 )

p(x1 , x2 , x3 , x4 ) ≈ p(x1 )p(x2 | x1 )p(x3 | 1 , x2 )p(x4 | x1 , 

Assumes conditional independencies

p(x1 , x2 , x3 , x4 ) ≈ p(x1 )p(x2 | x1 )pNeural (x3 | x1 , x2 )pNeural (x4 | x1 , x2 , x3 )

Assume specific functional form for the conditionals. A sufficiently

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 29 / 31

If X is a continuous random variable, we can usually represent it using

pX ,Y ,Z (x, y , z) = pX (x)pY |X (y | x)pZ |{X ,Y } (z | x, y )

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 30 / 31

You might also like

p(x1 , x2 , x3 , x4 ) ≈ p(x1 )p(x2 | x1 )p(x3 | 1 , x2 )p(x4 | x1 ,