03 MultivariateProbability
03 MultivariateProbability
Winter 2024
January 16 2024
Announcements
• Please read FAQ document on course webpage.
• Course information at https://nidhihegde.github.io/mlbasics
• Assignment due dates
• TA Office hours - updated
• Participation - Reading Exercises
• on eClass;
• open for a 48 hour period; one hour to complete
• first one is a practise one
• Second one open Monday, closes (due) Tuesday 11:59 pm
Outline
1. Recap
2. Random Variables
4. Independence
• The sample space Ω : the set of all possible outcomes of the experiment.
• The event space ℰ : ℰ ⊆ 𝒫(Ω), the space of potential results of the experiment.
• A probability distribution is defined on a measurable space consisting of a sample space
and an event space. Any function P : ℰ → [0,1] that is a probability measure.
• A probability distribution is defined on a measurable space consisting of a sample space
and an event space.
• Discrete sample spaces (and random variables) are defined in terms of probability mass
functions (PMFs)
• Continuous sample spaces (and random variables) are defined in terms of probability
density functions (PDFs)
Discrete vs. Continuous
Sample Spaces
Discrete (countable) outcomes Continuous (uncountable) outcomes
Ω = {1,2,3,4,5,6} Ω = [0,1]
Ω = {person, woman, man, camera, TV, …} Ω=ℝ
Ω=ℕ Ω=ℝ k
We could ask about P(X ≥ 4), where X = the number that comes up.
Random variables are a way of reasoning about a complicated underlying
probability space in a more straightforward way.
Random Variables, Formally
Given a probability space (Ω, ℰ, P), a random variable is a function
X : Ω → ΩX (where ΩX is some other outcome space), satisfying
{ω ∈ Ω ∣ X(ω) ∈ A} ∈ ℰ ∀A ∈ B(ΩX).
It follows that PX(A) = P({ω ∈ Ω ∣ X(ω) ∈ A}).
Example: Let Ω be a population of people, and X(ω) = height, and
A = [5′1′′,5′2′′].
P(X ∈ A) = P(5′1′′ ≤ X ≤ 5′2′′) = P({ω ∈ Ω : X(ω) ∈ A}).
Random Variables and Events
{0 otherwise.
1 if event A occurred
Y=
4 6 8 10 12 14 16 18 20 22 24 t
{0 otherwise. }
1 if ω1 = left
Y(ω) = = 1 if landed on left
∑∑
p(x, y) = 1
x∈𝒳 y∈𝒴
Y=0 Y=1
[
P(X=0,Y=0) = P(X=0, Y=1) =
X=0
50/100 1/100 x= 0
L
4= O
P(X=1, Y=0) = P(X=1, Y=1) =
X=1 Y= 0 io
10/100 39/100
50 1
Y =
Too Y : 1
∑ ∑
Exercise: Check if p(x, y) = 1
D
• 39/100
x∈{0,1} y∈{0,1}
XX + XX + YX + YY = 1
p(x, y) = 1/2 + 1/100 + 1/10 + 39/100 = 1 P(
∑ ∑
1) = 1 0 P(y 0)
=
= =
& + 10100 6000
P(X = 1) P (X=0) =
x∈{0,1} y∈{0,1}
=na %00
5
Questions About Multiple Variables
Example: 𝒳 = {0,1} (young, old) and 𝒴 = {0,1} (no arthritis, arthritis)
Y=0 Y=1
Y=0 Y=1
P(X=0,Y=0) P(X=0, Y=1)
X=0
= 1/2 = 1/100
More generically
P(X=1, Y=0) P(X=1, Y=1)
X=1
= 1/10 = 39/100
∑
p(y) = p(x, y)
x∈𝒳
Back to our example
Example: 𝒳 = {0,1} (young, old) and 𝒴 = {0,1} (no arthritis, arthritis)
Y=0 Y=1
∑
Exercise: Compute marginal p(x) = p(x, y)
•
y∈{0,1}
Back to our example (cont)
Example: 𝒳 = {0,1} (young, old) and 𝒴 = {0,1} (no arthritis, arthritis)
Y=0 Y=1
∑
Exercise: Compute marginal p(x = 1) = p(x = 1,y) = 49/100,
•
y∈{0,1}
p(x = 0) = 1 − p(x = 1) = 51/100
Marginal distributions
• For two random variables X, Y,
∑
If they are discrete we have p(x) = p(x, y)
•
y∈𝒴
∫𝒴
If they are continuous we have p(x) = p(x, y)dy
•
∫𝒴
If X is discrete and Y is continuous then p(x) = p(x, y)dy
•
∑
If X is continuous and Y is discrete then p(x) = p(x, y)
•
y∈𝒴
Marginal Distributions
A marginal distribution is defined for a subset of X ⃗ by summing or integrating
out the remaining variables. (We will often say that we are "marginalizing over"
or "marginalizing out" the remaining variables).
∑ ∑ ∑ ∑
Discrete case: p(xi) = ⋯ ⋯ p(x1, …, xi−1, xi+1, …, xd)
x1∈𝒳1 xi−1∈𝒳i−1 xi+1∈𝒳i+1 xd∈𝒳d
∫𝒳 ∫𝒳 ∫𝒳 ∫𝒳
Continuous: p(xi) = ⋯ ⋯ p(x1, …, xi−1, xi+1, …, xd) dx1…dxi−1dxi+1…dxd
1 i−1 i+1 d
Discrete case:
p : 𝒳1 × 𝒳2 × … × 𝒳d → [0,1] is a (joint) probability mass function if
∑ ∑ ∑
⋯ p(x1, x2, …, xd) = 1
x1∈𝒳1 x2∈𝒳2 xd∈𝒳d
Continuous case:
p : 𝒳1 × 𝒳2 × … × 𝒳d → [0,∞) is a (joint) probability density function if
∫𝒳 ∫ 𝒳 ∫𝒳
⋯ p(x1, x2, …, xd) dx1dx2…dxd = 1
1 2 d
Rules of Probability Already Covered
the Multidimensional Case
Outcome space is 𝒳 = 𝒳1 × 𝒳2 × … × 𝒳d
Outcomes are multidimensional variables x = [x1, x2, . . . , xd]
Discrete case:
∑
p : 𝒳 → [0,1] is a (joint) probability mass function if p(x) = 1
x∈𝒳
Continuous case:
∫𝒳
p : 𝒳 → [0,∞) is a (joint) probability density function if p(x) dx = 1
But useful to recognize that we have multiple variables
Conditional Distribution
Definition: Conditional probability distribution
P(X = x, Y = y)
P(Y = y ∣ X = x) =
P(X = x)
This same equation will hold for the corresponding PDF or PMF:
p(x, y)
p(y ∣ x) =
p(x)
Question: if p(x, y) is small, does that imply that p(y ∣ x) is small?
Visualizing the conditional
distribution
∏
p(x1, …, xd) = p(xd) p(xi ∣ xi+1, …xd)
i=1
d
∏
= p(x1) p(xi ∣ xi, …xi−1)
i=2
The Order Does Not Matter
The RVs are not ordered, so we can write
p(x, y, z) = p(x ∣ y, z)p(y | z)p(z)
= p(x ∣ y, z)p(z | y)p(y)
= p(y ∣ x, z)p(x | z)p(z)
= p(y ∣ x, z)p(z | x)p(x)
= p(z ∣ x, y)p(y | x)p(x)
= p(z ∣ x, y)p(x | y)p(y)
All of these probabilities are equal
Basically its associative
Bayes' Rule
From the chain rule, we have:
p(x, y) = p(y ∣ x)p(x)
= p(x ∣ y)p(y)
• Often, p(x ∣ y) is easier to compute than p(y ∣ x)
• e.g., where x is features and y is label
p(x ∣ y)p(y)
Example:
Questions:
p(Test = pos ∣ Drug = T) = 0.99
p(Test = pos ∣ Drug = F) = 0.01 1. What is p(Drug = F)?
p(Drug = True) = 0.005 2. What is p(Drug = T ∣ Test = pos)?
p(x ∣ y)p(y)
Example:
Questions:
p(Test = pos ∣ Drug = T) = 0.99
p(Test = pos ∣ Drug = F) = 0.01 1. What is p(Drug = F)?
p(Drug = True) = 0.005 2. What is p(Drug = T ∣ Test = pos)?
p(x ∣ y)p(y)
Example:
Questions:
p(Test = pos ∣ Drug = T) = 0.99
p(Test = pos ∣ Drug = F) = 0.01 1. What is p(Drug = F)?
p(Drug = True) = 0.005 2. What is p(Drug = T ∣ Test = pos)?
p(x ∣ y)p(y)
Example:
Questions:
p(Test = pos ∣ Drug = T) = 0.99
p(Test = pos ∣ Drug = F) = 0.01 1. What is p(Drug = F)?
p(Drug = True) = 0.005 2. What is p(Drug = T ∣ Test = pos)?
∑
p(Test = pos) = p(Test = pos, d)
d∈{T,F}
= p(Test = pos, D = F) + p(Test = pos, D = T)
= p(Test = pos | D = F)p(D = F) + p(Test = pos | D = T)p(D = T)
= 0.03 × 0.995 + 0.99 × 0.005 = 0.0348
Example: Posterior
Likelihood
Prior
p(x ∣ y)p(y)
Example:
Questions:
p(Test = pos ∣ Drug = T) = 0.99
p(Test = pos ∣ Drug = F) = 0.01 1. What is p(Drug = F)?
p(Drug = True) = 0.005 2. What is p(Drug = T ∣ Test = pos)?
∑
P(X = Heads) = P(X = Heads | Z = z)p(Z = z)
z∈{0.3,0.5,0.8}
= P(X = Heads | Z = 0.3)p(Z = 0.3)
+P(X = Heads | Z = 0.5)p(Z = 0.5)
+P(X = Heads | Z = 0.8)p(Z = 0.8)
= 0.3 × 0.7 + 0.5 × 0.2 + 0.8 × 0.1 = 0.39
Example: Coins (4)
• Now imagine we do not know Z
• e.g., you randomly grabbed it from a bin of coins with probabilities
P(Z = 0.3) = 0.7, P(Z = 0.5) = 0.2 and P(Z = 0.8) = 0.1
• Is P(X = Heads, Y = Heads) = P(X = Heads)p(Y = Heads)?
• For brevity, lets use h for Heads
∑
P(X = h, Y = h) = P(X = h, Y = h | Z = z)p(Z = z)
z∈{0.3,0.5,0.8}
∑
= P(X = h | Z = z)P(Y = h | Z = z)p(Z = z)
z∈{0.3,0.5,0.8}
Example: Coins (4)
• P(Z = 0.3) = 0.7, P(Z = 0.5) = 0.2 and P(Z = 0.8) = 0.1
• Is P(X = Heads, Y = Heads) = P(X = Heads)p(Y = Heads)?
∑
P(X = h, Y = h) = P(X = h, Y = h | Z = z)p(Z = z)
z∈{0.3,0.5,0.8}
∑
= P(X = h | Z = z)P(Y = h | Z = z)p(Z = z)
z∈{0.3,0.5,0.8}
= P(X = h | Z = 0.3)P(Y = h | Z = 0.3)p(Z = 0.3)
+P(X = h | Z = 0.5)P(Y = h | Z = 0.5)p(Z = 0.5)
+P(X = h | Z = 0.8)p(Y = h | Z = 0.8)p(Z = 0.8)
= 0.3 × 0.3 × 0.7 + 0.5 × ×0.5 × 0.2 + 0.8 × 0.8 × 0.1
= 0.177 ≠ 0.39 * 0.39 = 0.1521
Example: Coins (4)
• Let Z be the bias of the coin, with 𝒵 = {0.3,0.5,0.8} and probabilities
P(Z = 0.3) = 0.7, P(Z = 0.5) = 0.2 and P(Z = 0.8) = 0.1.
• Let X and Y be two consecutive flips of the coin
• Question: Are X and Y conditionally independent given Z?
• i.e., P(X = x, Y = y | Z = z) = P(X = x | Z = z)P(Y = y | Z = z)
• Question: Are X and Y independent?
• i.e. P(X = x, Y = y) = P(X = x)P(Y = y)
The Distribution Changes Based on
What We Know
• The coin has some true bias z
• If we do not know that bias, then from our perspective the coin
outcomes follows probabilities P(X = x, Y = y)
• and X and Y are correlated
∑
p(x)x = 0 × p(x = 0) + 1p(x = 1) = 0 × 0.25 + 1 × 0.75 = 0.75
x∈{0,1}
Expected Value with Functions
The expected value of a function f : 𝒳 → ℝ of a random variable is the
weighted average of that function's value over the domain of the variable.
Example:
Suppose you get $10 if heads is flipped, or lose $3 if tails is flipped.
What are your winnings on expectation?
Expected Value Example
Example:
Suppose you get $10 if heads is flipped, or lose $3 if tails is flipped.
What are your winnings on expectation?
X is the outcome of the coin flip, 1 for heads and 0 for tails
{10 if x = 1
3 if x = 0
f(x) =
{1
−1 if x ≤ 3
f(x) =
if x ≥ 4
Y = f(X) is a new random variable. We see Y = − 1 each time we observe 1, 2 or 3.
We see Y = 1 each time we observe 4, 5, or 6.
∑
𝔼[Y] = 𝔼[ f(X)] = f(x)p(x)
x∈𝒳
{1
−1 if x ≤ 3
f(x) =
if x ≥ 4
Y = f(X) is a new random variable. We see Y = − 1 each time we observe 1, 2 or 3.
We see Y = 1 each time we observe 4, 5, or 6.
∑ ∑
𝔼[Y] = 𝔼[ f(X)] = f(x)p(x) = yp(y) p(Y = − 1) = p(X = 1) + p(X = 2) + p(X = 3) = 0.5
x∈𝒳 y∈{−1,1} p(Y = 1) = p(X = 4) + p(X = 5) + p(X = 6) = 0.5
Summing over x with p(x) is equivalent, and simpler (no need to infer p(y))
Expected Value is a Lossy Summary
P(X)
P(X)
1 2 3 4 5 1 2 3 4 5
X X
𝔼[X] = 3 𝔼[X] = 3
2 2
𝔼[X ] ≃ 10 𝔼[X ] ≃ 12
Conditional Expectations
Definition:
The expected value of Y conditional on X = x is
∑y∈𝒴 yp(y ∣ x) if Y is discrete,
𝔼[Y ∣ X = x] =
∫𝒴 yp(y ∣ x) dy if Y is continuous.
• Another example: 𝔼[X | Z = 0.3] the expected outcome of the coin flip
given that the bias is 0.3 (𝔼[X | Z = 0.3] = 0 × 0.7 + 1 × 0.3 = 0.3)
Conditional Expectation Example (cont)
• What do we mean by p(y | X = 0)? How might it differ from p(y | X = 1)
p(y) for X = 0, fiction books p(y) for X = 1, nonfiction books
Definition:
The expected value of Y conditional on X = x is
∑y∈𝒴 yp(y ∣ x) if Y is discrete,
𝔼[Y ∣ X = x] =
∫𝒴 yp(y ∣ x) dy if Y is continuous.
• 𝔼 [𝔼 [Y ∣ X]] = 𝔼[Y]
∑∑ ∑∑
p(x, y)x = p(x, y)x
∑
𝔼[X + Y] = p(x, y)(x + y)
y∈𝒴 x∈𝒳 x∈𝒳 y∈𝒴
(x,y)∈𝒳×𝒴
∑ ∑ ∑
= x p(x, y) ▹ p(x) = p(x, y)
∑∑
= p(x, y)(x + y) x∈𝒳 y∈𝒴 y∈𝒴
y∈𝒴 x∈𝒳
∑
= xp(x)
∑∑ ∑∑
= p(x, y)x + p(x, y)y x∈𝒳
∑ ∑∑ ∑∑
𝔼[X + Y] = p(x, y)(x + y) p(x, y)x = p(x, y)x
y∈𝒴 x∈𝒳 x∈𝒳 y∈𝒴
(x,y)∈𝒳×𝒴
∑ ∑ ∑
= x p(x, y) ▹ p(x) = p(x, y)
∑∑
= p(x, y)(x + y) x∈𝒳 y∈𝒴 y∈𝒴
y∈𝒴 x∈𝒳
∑
= xp(x)
∑∑ ∑∑
= p(x, y)x + p(x, y)y x∈𝒳
y∈𝒴 x∈𝒳 y∈𝒴 x∈𝒳 = 𝔼[X]
= 𝔼[X] + 𝔼[Y]
What if the RVs are continuous?
∫𝒳×𝒴
𝔼[X + Y] = p(x, y)(x + y)d(x, y)
∑
𝔼[X + Y] = p(x, y)(x + y)
(x,y)∈𝒳×𝒴
∫𝒴 ∫𝒳
=
∑∑
p(x, y)(x + y) = p(x, y)(x + y)dxdy
y∈𝒴 x∈𝒳
∫𝒴 ∫𝒳 ∫𝒴 ∫𝒳
∑∑ ∑∑
= p(x, y)x + p(x, y)y = p(x, y)xdxdy + p(x, y)ydxdy
y∈𝒴 x∈𝒳 y∈𝒴 x∈𝒳
= 𝔼[X] + 𝔼[Y]
∫𝒳 ∫𝒴 ∫𝒴 ∫𝒳
= x p(x, y)dydx + y p(x, y)dxdy
∫𝒳 ∫𝒴
= xp(x)dx + yp(y)dy
= 𝔼[X] + 𝔼[Y]
Properties of Expectations
∑
• Linearity of expectation: 𝔼[Y ] = yp(y) def. E[Y]
y∈𝒴
• 𝔼[cX] = c𝔼[X] for all constant c
∑ ∑
𝔼[Y ] = y p(x, y) def. marginal distribution
• 𝔼[X + Y] = 𝔼[X] + 𝔼[Y] y∈𝒴 x∈𝒳
∑∑
𝔼[Y ] = yp(x, y) rearrange sums
• Products of expectations of independent x∈𝒳 y∈𝒴
2
i.e., 𝔼[ f(X)] where f(x) = (x − 𝔼[X]) .
Equivalently,
Var(X) = 𝔼 [X ] − (𝔼[X])
2 2
(why?)
Covariance
Definition: The covariance of two random variables is
= 𝔼[XY] − 𝔼[X]𝔼[Y] .
Gaussian denoted by N