0% found this document useful (0 votes)
7 views

Lecture 2

Uploaded by

bc240407773mas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Lecture 2

Uploaded by

bc240407773mas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Introduction to Probabilistic Modeling

Volodymyr Kuleshov

Cornell Tech

Lecture 2

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 1 / 31


Announcements

Recorded lectures will appear in Canvas under ”Zoom”


Good luck with ICML deadline!

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 2 / 31


Learning a Generative Model

We are given a training set of examples, e.g., images of dogs

We want to learn a probability distribution p(x) over images x such that


Generation: If we sample xnew ∼ p(x), xnew should look like a dog
(sampling)
Representation learning: We should be able to learn what these
images have in common, e.g., ears, tail, etc. (features)
First step: how to represent p(x)

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 3 / 31


Lecture Outline

1 Defining Probabilistic Models of the Data


Examples of Probabilistic Models
The Curse of Dimensionality
Parameter-Efficient Models Through Conditional Independence
Bayesian Networks: An Example of Shallow Generative Models
2 Discriminative vs. Generative Models
Naive Bayes vs. Logistic Regression
Which One to Use?
3 A First Glimpse of Deep Generative Models

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 4 / 31


Probabilistic Models: Basic Discrete Distributions

Bernoulli distribution: (biased) coin flip


Domain: {Heads, Tails}
Specify P(X = Heads) = p. Then P(X = Tails) = 1 − p.
Write: X ∼ Ber (p)
Sampling: flip a (biased) coin
Categorical distribution: (biased) m-sided dice
Domain: {1, · · · , m} P
Specify P(Y = i) = pi , such that pi = 1
Write: Y ∼ Cat(p1 , · · · , pm )
Sampling: roll a (biased) die

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 5 / 31


Probabilistic Models: A Multi-Variate Joint Distribution
Suppose we want to define a distribution over one pixel in an image.
We use three discrete random variables:
Red Channel R. Val(R) = {0, · · · , 255}
Green Channel G . Val(G ) = {0, · · · , 255}
Blue Channel B. Val(B) = {0, · · · , 255}

Sampling from the joint distribution (r , g , b) ∼ p(R, G , B) randomly


generates a color for the pixel. How many parameters do we need to
specify the joint distribution p(R = r , G = g , B = b)?
256 ∗ 256 ∗ 256 − 1
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 6 / 31
The Curse of Dimensionality in Probabilistic Models
Suppose we want to model a BW image of digit with n = 28 · 28 pixels.

Pixels X1 , . . . , Xn are modeled as binary (Bernoulli) random variables,


i.e., Val(Xi ) = {0, 1} = {Black, White}.
How many possible states?
n
|2 × 2 ×
{z· · · × 2} = 2
n times

Sampling from p(x1 , . . . , xn ) generates an image


How many parameters to specify the joint distribution p(x1 , . . . , xn )
over n binary pixels? 2n − 1
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 7 / 31
Parameter-Efficient Models Through Conditional
Independence
If X1 , . . . , Xn are independent, then
p(x1 , . . . , xn ) = p(x1 )p(x2 ) · · · p(xn )

How many possible states? 2n


How many parameters to specify the joint distribution p(x1 , . . . , xn )?
How many to specify the marginal distribution p(x1 )? 1
2n entries can be described by just n numbers (if |Val(Xi )| = 2)!
Independence assumption is too strong. Model not likely to be useful
For example, each pixel chosen independently when we sample from it.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 8 / 31


Key Notion: Conditional Independence

Two events A, B are conditionally independent given event C if

p(A ∩ B|C ) = p(A|C )p(B|C )

Random variables X , Y are conditionally independent given Z if for


all values x ∈Val(X ), y ∈Val(Y ), z ∈Val(Z )

p(X = x ∩ Y = y |Z = z) = p(X = x|Z = z)p(Y = y |Z = z)

We will also write p(X , Y |Z ) = p(X |Z )p(Y |Z ). Note the more


compact notation.
Equivalent definition: p(X |Y , Z ) = p(X |Z ).
We write X ⊥ Y | Z
Similarly for sets of random variables, X ⊥ Y | Z
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 9 / 31
Two Important Rules in Probability

1 Chain rule Let S1 , . . . Sn be events, p(Si ) > 0.

p(S1 ∩ S2 ∩ · · · ∩ Sn ) = p(S1 )p(S2 | S1 ) · · · p(Sn | S1 ∩ . . . ∩ Sn−1 )

2 Bayes’ rule Let S1 , S2 be events, p(S1 ) > 0 and p(S2 ) > 0.

p(S1 ∩ S2 ) p(S2 | S1 )p(S1 )


p(S1 | S2 ) = =
p(S2 ) p(S2 )

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 10 / 31


Structure through conditional independence
Using Chain Rule
p(x1 , . . . , xn ) = p(x1 )p(x2 | x1 )p(x3 | x1 , x2 ) · · · p(xn | x1 , · · · , xn−1 )

How many parameters? 1 + 2 + · · · + 2n−1 = 2n − 1


p(x1 ) requires 1 parameter
p(x2 | x1 = 0) requires 1 parameter, p(x2 | x1 = 1) requires 1 parameter
Total 2 parameters.
···
2n − 1 is still exponential, chain rule does not buy us anything.
Now suppose Xi+1 ⊥ X1 , . . . , Xi−1 | Xi , then
p(x1 , . . . , xn ) = p(x1 )p(x2 | x1 )p(x3 | 
x1 , x2 ) · · · p(xn | 
x1 ,
· ·
· ,xn−1 )

= p(x1 )p(x2 | x1 )p(x3 | x2 ) · · · p(xn | xn−1 )

How many parameters? 2n − 1. Exponential reduction!


Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 11 / 31
Bayesian Networks: General Idea

Use conditional parameterization (instead of joint parameterization)


For each random variable Xi , specify p(xi |xAi ) for set XAi of random
variables
Then get joint parametrization as
Y
p(x1 , . . . , xn ) = p(xi |xAi )
i

Need to guarantee it is a legal probability distribution. It has to


correspond to a chain rule factorization, with factors simplified due to
assumed conditional independencies

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 12 / 31


Bayesian Networks: Formal Definition

A Bayesian network is specified by a directed acyclic graph


G = (V , E ) with:
1 One node i ∈ V for each random variable Xi
2 One conditional probability distribution (CPD) per node, p(xi | xPa(i) ),
specifying the variable’s probability conditioned on its parents’ values
Graph G = (V , E ) is called the structure of the Bayesian Network
Defines a joint distribution:
Y
p(x1 , . . . xn ) = p(xi | xPa(i) )
i∈V

Claim: p(x1 , . . . xn ) is a valid probability distribution


Economical representation: exponential in |Pa(i)|, not |V |

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 13 / 31


What is a Directed Acyclic Graph?

DAG stands for Directed Acyclic Graph

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 14 / 31


Bayesian Networks: An Example
Consider the following Bayesian network:

What is its joint distribution?


Y
p(x1 , . . . xn ) = p(xi | xPa(i) )
i∈V
p(d, i, g , s, l) = p(d)p(i)p(g | i, d)p(s | i)p(l | g )
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 15 / 31
Graph Structure Encodes Conditional Independencies

The joint distribution corresponding to the above BN factors as


p(d, i, g , s, l) = p(d)p(i)p(g | i, d)p(s | i)p(l | g )

However, by the chain rule, any distribution can be written as


p(d, i, g , s, l) = p(d)p(i | d)p(g | i, d)p(s | i, d, g )p(l | g , d, i, s)

Thus, we are assuming the following extra independencies:


D ⊥ I, S ⊥ {D, G } | I , L ⊥ {I , D, S} | G .
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 16 / 31
Summary so Far

Bayesian networks given by (G , P) where P is specified as a set of


local conditional probability distributions associated with G ’s nodes
Efficient representation using a graph-based data structure
Computing the probability of any assignment is obtained by
multiplying CPDs
Can identify some conditional independence properties by looking at
graph properties
Next: generative vs. discriminative; functional parameterizations

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 17 / 31


Lecture Outline

1 Defining Probabilistic Models of the Data


Examples of Probabilistic Models
The Curse of Dimensionality
Parameter-Efficient Models Through Conditional Independence
Bayesian Networks: An Example of Shallow Generative Models
2 Discriminative vs. Generative Models
Naive Bayes vs. Logistic Regression
Which One to Use?
3 A First Glimpse of Deep Generative Models

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 18 / 31


Naive Bayes: A Generative Classification Algorithm

Classify e-mails as spam (Y = 1) or not spam (Y = 0)


Let 1 : n index the words in our vocabulary (e.g., English)
Xi = 1 if word i appears in an e-mail, and 0 otherwise
E-mails are drawn according to some distribution p(Y , X1 , . . . , Xn )
Words are conditionally independent given Y :

Then
n
Y
p(y , x1 , . . . xn ) = p(y ) p(xi | y )
i=1

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 19 / 31


Naive Bayes: A Generative Classification Algorithm
Classify e-mails as spam (Y = 1) or not spam (Y = 0)
Let 1 : n index the words in our vocabulary (e.g., English)
Xi = 1 if word i appears in an e-mail, and 0 otherwise
E-mails are drawn according to some distribution p(Y , X1 , . . . , Xn )
Suppose that the words are conditionally independent given Y . Then,
n
Y
p(y , x1 , . . . xn ) = p(y ) p(xi | y )
i=1

Estimate parameters from data. Predict with Bayes rule:

p(Y = 1) ni=1 p(xi | Y = 1)


Q
p(Y = 1 | x1 , . . . xn ) = P Qn
y ={0,1} p(Y = y ) i=1 p(xi | Y = y )

Are the independence assumptions made here reasonable? Nearly all


probabilistic models are “wrong”, but many are nonetheless useful
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 20 / 31
Discriminative Models

However, suppose all we need for prediction is p(Y | X)


In the left model, we need to specify/learn both p(Y ) and p(X | Y ),
then compute p(Y | X) via Bayes rule
In the right model, it suffices to estimate just the conditional
distribution p(Y | X)
We never need to model/learn/use p(X)!
Called a discriminative model because it is only useful for
discriminating Y ’s label when given X
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 21 / 31
Logistic Regression: Discriminative Classification Algorithm
1 For the discriminative model, assume that
p(Y = 1 | x; α) = f (x, α)
2 Not represented as a table anymore. It is a parameterized function of
x (regression)
Has to be between 0 and 1
Depend in some simple but reasonable way on x1 , · · · , xn
Completely specified by a vector α of n + 1 parameters (compact
representation)
Linear dependence: let z(α, x) = α0 + ni=1 αi xi .Then,
P
p(Y = 1 | x; α) = σ(z(α, x)), where σ(z) = 1/(1 + e −z ) is called the
logistic function:

1
1 + e−z

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 22 / 31


Logistic Regression: Discriminative Classification Algorithm

Linear dependence: let z(α, x) = α0 + ni=1 αi xi .Then,


P
p(Y = 1 | x; α) = σ(z(α, x)), where σ(z) = 1/(1 + e −z ) is called the
logistic function

1 Decision boundary p(Y = 1 | x; α) > 0.5 is linear in x


2 Equal probability contours are straight lines
3 Probability rate of change has very specific form (third plot)

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 23 / 31


Discriminative models are powerful

Logistic model does not assume Xi ⊥ X−i | Y , unlike naive Bayes


This can make a big difference in many applications
For example, in spam classification, let X1 = 1[“bank” in e-mail] and
X2 = 1[“account” in e-mail]
Regardless of whether spam, these always appear together, i.e. X1 = X2
Learning in naive Bayes results in p(X1 | Y ) = p(X2 | Y ). Thus, naive Bayes
double counts the evidence
Learning with logistic regression sets α1 = 0 or α2 = 0, in effect ignoring it
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 24 / 31
Generative models are still very useful

Using chain rule p(Y , X) = p(X | Y )p(Y ) = p(Y | X)p(X). Corresponding


Bayesian networks:

1 Using a conditional model is only possible when X is always observed


When some Xi variables are unobserved, the generative model allows us
to compute p(Y | Xevidence ) by marginalizing over the unseen variables

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 25 / 31


Lecture Outline

1 Defining Probabilistic Models of the Data


Examples of Probabilistic Models
The Curse of Dimensionality
Parameter-Efficient Models Through Conditional Independence
Bayesian Networks: An Example of Shallow Generative Models
2 Discriminative vs. Generative Models
Naive Bayes vs. Logistic Regression
Which One to Use?
3 A First Glimpse of Deep Generative Models

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 26 / 31


Neural Models

1 In discriminative models, we assume that

p(Y = 1 | x; α) = f (x, α)

2 Linear dependence:
Pn
let z(α, x) = α0 + i=1 αi xi .
p(Y = 1 | x; α) = σ(z(α, x)), where σ(z) = 1/(1 + e −z ) is the
logistic function
Dependence might be too simple
3 Non-linear dependence: let h(A, b, x) = g (Ax + b) be a non-linear
transformation of the inputs (features).
pNeural (Y = 1 | x; α, A, b) = σ(α0 + hi=1 αi hi )
P

More flexible
More parameters: A, b, α

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 27 / 31


Neural Models
1 In discriminative models, we assume that
p(Y = 1 | x; α) = f (x, α)
Linear dependence: let z(α, x) = α0 + ni=1 αi xi .
P
2

p(Y = 1 | x; α) = f (z(α, x)), where f (z) = 1/(1 + e −z ) is the


logistic function
Dependence might be too simple
3 Non-linear dependence: let h(A, b, x) = f (Ax + b) be a non-linear
transformation of the inputs (features).
pNeural (Y = 1 | x; α, A, b) = g (α0 + hi=1 αi hi )
P
More flexible
More parameters: A, b, α
Can repeat multiple times to get a neural network

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 28 / 31


Bayesian Networks vs Neural Models

Using Chain Rule

p(x1 , x2 , x3 , x4 ) = p(x1 )p(x2 | x1 )p(x3 | x1 , x2 )p(x4 | x1 , x2 , x3 )

Fully General
Bayes Net

p(x1 , x2 , x3 , x4 ) ≈ p(x1 )p(x2 | x1 )p(x3 | 1 , x2 )p(x4 | x1 , 


x , x
x2 3)

Assumes conditional independencies


Neural Models

p(x1 , x2 , x3 , x4 ) ≈ p(x1 )p(x2 | x1 )pNeural (x3 | x1 , x2 )pNeural (x4 | x1 , x2 , x3 )

Assume specific functional form for the conditionals. A sufficiently


deep neural net can approximate any function.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 29 / 31


Continuous Variables

If X is a continuous random variable, we can usually represent it using


its probability density function pX : R → R+ . However, we cannot
represent this function as a table anymore. Typically consider
parameterized densities:
2 2
Gaussian: X ∼ N (µ, σ) if pX (x) = σ√12π e −(x−µ) /2σ
1
Uniform: X ∼ U(a, b) if pX (x) = b−a 1[a ≤ x ≤ b]
Etc.
If X is a continuous random vector, we can usually represent it using
its joint probability density function:
Gaussian: if pX (x) = √ 1
exp − 21 (x − µ)T Σ−1 (x − µ)

(2π)n |Σ|

Chain rule, Bayes rule, etc all still apply. For example,

pX ,Y ,Z (x, y , z) = pX (x)pY |X (y | x)pZ |{X ,Y } (z | x, y )

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 30 / 31


Continuous Variables
This means we can still use Bayesian networks with continuous (and
discrete) variables. Examples:
Mixture of 2 Gaussians: Network Z → X with factorization
pZ ,X (z, x) = pZ (z)pX |Z (x | z) and
Z ∼ Bernoulli(p)
X | (Z = 0) ∼ N (µ0 , σ0 ) , X | (Z = 1) ∼ N (µ1 , σ1 )
The parameters are p, µ0 , σ0 , µ1 , σ1
Infinite Mixture of Gaussians: Network Z → X with factorization
pZ ,X (z, x) = pZ (z)pX |Z (x | z)
Z ∼ U(a, b)
X | (Z = z) ∼ N (z, σ)
The parameters are a, b, σ
Neural Infinite Mixture of Gaussians: Network Z → X with
factorization pZ ,X (z, x) = pZ (z)pX |Z (x | z) and
Z ∼ N (0, 1)
X | (Z = z) ∼ N (µθ (z), e σφ (z) ) where µθ : R → R and σφ are neural
networks with parameters (weights) θ, φ respectively
Note: Even if µθ , σφ are deep nets, functional form is Gaussian
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 2 31 / 31

You might also like