0% found this document useful (0 votes)
15 views

03 MultivariateProbability

Uploaded by

akhil10831
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

03 MultivariateProbability

Uploaded by

akhil10831
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

CMPUT 267 Basics of Machine Learning

Winter 2024
January 16 2024
Announcements
• Please read FAQ document on course webpage.
• Course information at https://nidhihegde.github.io/mlbasics
• Assignment due dates
• TA Office hours - updated
• Participation - Reading Exercises
• on eClass;
• open for a 48 hour period; one hour to complete
• first one is a practise one
• Second one open Monday, closes (due) Tuesday 11:59 pm
Outline

1. Recap

2. Random Variables

3. Multiple Random Variables

4. Independence

5. Expectations and Moments


Recap
• Probabilities are a means of quantifying uncertainty
• The probability space models an experiment, or a real world process.

• The sample space Ω : the set of all possible outcomes of the experiment.

• The event space ℰ : ℰ ⊆ 𝒫(Ω), the space of potential results of the experiment.
• A probability distribution is defined on a measurable space consisting of a sample space
and an event space. Any function P : ℰ → [0,1] that is a probability measure.
• A probability distribution is defined on a measurable space consisting of a sample space
and an event space.
• Discrete sample spaces (and random variables) are defined in terms of probability mass
functions (PMFs)
• Continuous sample spaces (and random variables) are defined in terms of probability
density functions (PDFs)
Discrete vs. Continuous
Sample Spaces
Discrete (countable) outcomes Continuous (uncountable) outcomes

Ω = {1,2,3,4,5,6} Ω = [0,1]
Ω = {person, woman, man, camera, TV, …} Ω=ℝ
Ω=ℕ Ω=ℝ k

ℰ = {∅, {1,2}, {3,4,5,6}, {1,2,3,4,5,6}} ℰ = {∅, [0,0.5], (0.5,1.0], [0,1]}


Typically: ℰ = 𝒫(Ω) Typically: ℰ = B(Ω) ("Borel field")
Question:
Note: not 𝒫(Ω)
ℰ = {{1}, {2}, {3}, {4}, {5}, {6}}?
Random Variables
Rather than referring to the probability space, we refer to probabilities on
quantities of interest.

Example: Suppose we observe both a die's number, and where it lands.

Ω = {(left,1), (right,1), (left,2), (right,2), …, (right,6)}


We might want to think about the probability that we get a large number, without
thinking about where it landed.

We could ask about P(X ≥ 4), where X = the number that comes up.
Random variables are a way of reasoning about a complicated underlying
probability space in a more straightforward way.
Random Variables, Formally
Given a probability space (Ω, ℰ, P), a random variable is a function
X : Ω → ΩX (where ΩX is some other outcome space), satisfying
{ω ∈ Ω ∣ X(ω) ∈ A} ∈ ℰ ∀A ∈ B(ΩX).
It follows that PX(A) = P({ω ∈ Ω ∣ X(ω) ∈ A}).
Example: Let Ω be a population of people, and X(ω) = height, and
A = [5′1′′,5′2′′].
P(X ∈ A) = P(5′1′′ ≤ X ≤ 5′2′′) = P({ω ∈ Ω : X(ω) ∈ A}).
Random Variables and Events

• A Boolean expression involving random variables defines an event:


E.g., P(X ≥ 4) = P({ω ∈ Ω ∣ X(ω) ≥ 4})
• Similarly, every event can be understood as a Boolean random variable:

{0 otherwise.
1 if event A occurred
Y=

• From this point onwards, we will exclusively reason in terms of random


variables rather than probability spaces.
Example: Histograms
Consider the continuous commuting example again, with observations 12.345
minutes, 11.78213 minutes, etc.

.25 Gamma(31.3, 0.352)


.20
.15
.10
.05

4 6 8 10 12 14 16 18 20 22 24 t

• Question: What is the random variable?


• Question: How could we turn our observations into a histogram?
What About Multiple Variables?
• So far, we've really been thinking about a single random variable at a time
• Straightforward to define multiple random variables on a single probability space

Example: Suppose we observe both a die's number, and where it lands.

Ω = {(left,1), (right,1), (left,2), (right,2), …, (right,6)}


X(ω) = ω2 = number

{0 otherwise. }
1 if ω1 = left
Y(ω) = = 1 if landed on left

P(Y = 1) = P({ω ∣ Y(ω) = 1})


P(X ≥ 4 ∧ Y = 1) = P({ω ∣ X(ω) ≥ 4 ∧ Y(ω) = 1})
Joint Distribution
We typically model the interactions of different random variables.

Joint probability mass function: p(x, y) = P(X = x, Y = y)

∑∑
p(x, y) = 1
x∈𝒳 y∈𝒴

Example: 𝒳 = {0,1} (young, old) and 𝒴 = {0,1} (no arthritis, arthritis)

Y=0 Y=1

X=0 P(X=0,Y=0) = P(X=0, Y=1) =


1/2 1/100
X=1 P(X=1, Y=0) = P(X=1, Y=1) =
1/10 39/100
Is this joint distribution valid?
Example: 𝒳 = {0,1} (young, old) and 𝒴 = {0,1} (no arthritis, arthritis)
the table
Visualizing
Y=0 Y=1
-
-
10/100

[
P(X=0,Y=0) = P(X=0, Y=1) =
X=0
50/100 1/100 x= 0

L
4= O
P(X=1, Y=0) = P(X=1, Y=1) =
X=1 Y= 0 io
10/100 39/100
50 1
Y =

Too Y : 1

∑ ∑
Exercise: Check if p(x, y) = 1
D
• 39/100
x∈{0,1} y∈{0,1}
XX + XX + YX + YY = 1
p(x, y) = 1/2 + 1/100 + 1/10 + 39/100 = 1 P(
∑ ∑
1) = 1 0 P(y 0)
=
= =
& + 10100 6000
P(X = 1) P (X=0) =
x∈{0,1} y∈{0,1}
=na %00
5
Questions About Multiple Variables
Example: 𝒳 = {0,1} (young, old) and 𝒴 = {0,1} (no arthritis, arthritis)

Y=0 Y=1

P(X=0,Y=0) = P(X=0, Y=1) =


X=0
1/2 1/100
P(X=1, Y=0) = P(X=1, Y=1) =
X=1
1/10 39/100

• Are these two variables related at all? Or do they change independently?


• Given this distribution, can we determine the distribution over just Y?
I.e., what is P(Y = 1)? (marginal distribution)
• If we knew something about one variable, does that tell us something about the distribution
over the other? E.g., if I know X = 0 (person is young), does that tell me the prob. that person
we know is young has arthritis? (conditional probability P(Y = 1 ∣ X = 1))
Marginal Distribution for Y
∑ ∑ ∑ ∑
p(Y = 0) = p(x,0) = p(x,0) p(Y = 1) = p(x,1) = p(x,1)
x∈𝒳 x∈{young,old} x∈𝒳 x∈{young,old}

Y=0 Y=1
P(X=0,Y=0) P(X=0, Y=1)
X=0
= 1/2 = 1/100
More generically
P(X=1, Y=0) P(X=1, Y=1)
X=1
= 1/10 = 39/100

p(y) = p(x, y)
x∈𝒳
Back to our example
Example: 𝒳 = {0,1} (young, old) and 𝒴 = {0,1} (no arthritis, arthritis)

Y=0 Y=1

P(X=0,Y=0) = P(X=0, Y=1) =


X=0
50/100 1/100
P(X=1, Y=0) = P(X=1, Y=1) =
X=1
10/100 39/100


Exercise: Compute marginal p(x) = p(x, y)

y∈{0,1}
Back to our example (cont)
Example: 𝒳 = {0,1} (young, old) and 𝒴 = {0,1} (no arthritis, arthritis)

Y=0 Y=1

P(X=0,Y=0) = P(X=0, Y=1) =


X=0
50/100
1/2 1/100
P(X=1, Y=0) = P(X=1, Y=1) =
X=1
10/100
1/10 39/100


Exercise: Compute marginal p(x = 1) = p(x = 1,y) = 49/100,

y∈{0,1}
p(x = 0) = 1 − p(x = 1) = 51/100
Marginal distributions
• For two random variables X, Y,


If they are discrete we have p(x) = p(x, y)

y∈𝒴

∫𝒴
If they are continuous we have p(x) = p(x, y)dy

∫𝒴
If X is discrete and Y is continuous then p(x) = p(x, y)dy


If X is continuous and Y is discrete then p(x) = p(x, y)

y∈𝒴
Marginal Distributions
A marginal distribution is defined for a subset of X ⃗ by summing or integrating
out the remaining variables. (We will often say that we are "marginalizing over"
or "marginalizing out" the remaining variables).

∑ ∑ ∑ ∑
Discrete case: p(xi) = ⋯ ⋯ p(x1, …, xi−1, xi+1, …, xd)
x1∈𝒳1 xi−1∈𝒳i−1 xi+1∈𝒳i+1 xd∈𝒳d

∫𝒳 ∫𝒳 ∫𝒳 ∫𝒳
Continuous: p(xi) = ⋯ ⋯ p(x1, …, xi−1, xi+1, …, xd) dx1…dxi−1dxi+1…dxd
1 i−1 i+1 d

Question: Why do we write p for p(xi) and p(x1, …, xd)?


• They can't be the same function, they have different domains!
Are these really the same function?

• No. They're not the same function.


• But they are derived from the same joint distribution.
• So for brevity we will write p(x, y), p(x) and p(y)
• Even though it would be more precise to write something like
p(x, y), px(x) and py(y)
• We can tell which function we're talking about from context (i.e., arguments)
PMFs and PDFs of Many Variables
In general, we can consider a d-dimensional random variable X ⃗ = (X1, …, Xd) with vector-
valued outcomes x ⃗ = (x1, …, xd), with each xi chosen from some 𝒳i. Then,

Discrete case:
p : 𝒳1 × 𝒳2 × … × 𝒳d → [0,1] is a (joint) probability mass function if

∑ ∑ ∑
⋯ p(x1, x2, …, xd) = 1
x1∈𝒳1 x2∈𝒳2 xd∈𝒳d

Continuous case:
p : 𝒳1 × 𝒳2 × … × 𝒳d → [0,∞) is a (joint) probability density function if

∫𝒳 ∫ 𝒳 ∫𝒳
⋯ p(x1, x2, …, xd) dx1dx2…dxd = 1
1 2 d
Rules of Probability Already Covered
the Multidimensional Case
Outcome space is 𝒳 = 𝒳1 × 𝒳2 × … × 𝒳d
Outcomes are multidimensional variables x = [x1, x2, . . . , xd]
Discrete case:


p : 𝒳 → [0,1] is a (joint) probability mass function if p(x) = 1
x∈𝒳

Continuous case:

∫𝒳
p : 𝒳 → [0,∞) is a (joint) probability density function if p(x) dx = 1
But useful to recognize that we have multiple variables
Conditional Distribution
Definition: Conditional probability distribution
P(X = x, Y = y)
P(Y = y ∣ X = x) =
P(X = x)

This same equation will hold for the corresponding PDF or PMF:
p(x, y)
p(y ∣ x) =
p(x)
Question: if p(x, y) is small, does that imply that p(y ∣ x) is small?
Visualizing the conditional
distribution

P(X = young | Y = 0) = P(X = young, Y = 0)/P(Y = 0) = (50/100)/(60/100) = 50/60


Chain Rule

From the definition of conditional probability:


p(x, y)
p(y ∣ x) =
p(x)
p(x, y)
⟺ p(y ∣ x)p(x) = p(x)
p(x)
⟺ p(y ∣ x)p(x) = p(x, y)

This is called the Chain Rule.


Multiple Variable Chain Rule
The chain rule generalizes to multiple variables:

p(x, y, z) = p(x, y ∣ z)p(z) = p(x ∣ y, z)p(y ∣ z)p(z)


p(y,z)

Definition: Chain rule


d−1


p(x1, …, xd) = p(xd) p(xi ∣ xi+1, …xd)
i=1
d


= p(x1) p(xi ∣ xi, …xi−1)
i=2
The Order Does Not Matter
The RVs are not ordered, so we can write
p(x, y, z) = p(x ∣ y, z)p(y | z)p(z)
= p(x ∣ y, z)p(z | y)p(y)
= p(y ∣ x, z)p(x | z)p(z)
= p(y ∣ x, z)p(z | x)p(x)
= p(z ∣ x, y)p(y | x)p(x)
= p(z ∣ x, y)p(x | y)p(y)
All of these probabilities are equal
Basically its associative
Bayes' Rule
From the chain rule, we have:
p(x, y) = p(y ∣ x)p(x)
= p(x ∣ y)p(y)
• Often, p(x ∣ y) is easier to compute than p(y ∣ x)
• e.g., where x is features and y is label

Definition: Bayes' rule


p(x ∣ y)p(y)
p(y ∣ x) =
p(x)
Bayes' Rule
• Bayes’ rule is typically used to reason about our beliefs, given new
information
• Example: a scientist might have a belief about the prevalence of cancer in
smokers (Y), and update with new evidence (X)
• In ML: we have a belief over our estimator (Y), and we update with new
data that is like new evidence (X)

Posterior Likelihood Prior


Definition: Bayes' rule
p(x ∣ y)p(y)
p(y ∣ x) =
p(x) Evidence
Example: Posterior
Likelihood
Prior

p(x ∣ y)p(y)

Drug Test p(y ∣ x) =


p(x) Evidence

Example:
Questions:
p(Test = pos ∣ Drug = T) = 0.99
p(Test = pos ∣ Drug = F) = 0.01 1. What is p(Drug = F)?
p(Drug = True) = 0.005 2. What is p(Drug = T ∣ Test = pos)?

Mapping to the formula, let


X be Test
Y be presence of the drug
Example: Posterior
Likelihood
Prior

p(x ∣ y)p(y)

Drug Test p(y ∣ x) =


p(x) Evidence

Example:
Questions:
p(Test = pos ∣ Drug = T) = 0.99
p(Test = pos ∣ Drug = F) = 0.01 1. What is p(Drug = F)?
p(Drug = True) = 0.005 2. What is p(Drug = T ∣ Test = pos)?

p(Drug = F) = 1 − p(Drug = T) = 1 − 0.005 = 0.995


Example: Posterior
Likelihood
Prior

p(x ∣ y)p(y)

Drug Test p(y ∣ x) =


p(x) Evidence

Example:
Questions:
p(Test = pos ∣ Drug = T) = 0.99
p(Test = pos ∣ Drug = F) = 0.01 1. What is p(Drug = F)?
p(Drug = True) = 0.005 2. What is p(Drug = T ∣ Test = pos)?

p(Test = pos ∣ Drug = T)p(Drug = T)


p(Drug = T ∣ Test = pos) =
p(Test = pos)
Need to compute this part
Example: Posterior
Likelihood
Prior

p(x ∣ y)p(y)

Drug Test p(y ∣ x) =


p(x) Evidence

Example:
Questions:
p(Test = pos ∣ Drug = T) = 0.99
p(Test = pos ∣ Drug = F) = 0.01 1. What is p(Drug = F)?
p(Drug = True) = 0.005 2. What is p(Drug = T ∣ Test = pos)?


p(Test = pos) = p(Test = pos, d)
d∈{T,F}
= p(Test = pos, D = F) + p(Test = pos, D = T)
= p(Test = pos | D = F)p(D = F) + p(Test = pos | D = T)p(D = T)
= 0.03 × 0.995 + 0.99 × 0.005 = 0.0348
Example: Posterior
Likelihood
Prior

p(x ∣ y)p(y)

Drug Test p(y ∣ x) =


p(x) Evidence

Example:
Questions:
p(Test = pos ∣ Drug = T) = 0.99
p(Test = pos ∣ Drug = F) = 0.01 1. What is p(Drug = F)?
p(Drug = True) = 0.005 2. What is p(Drug = T ∣ Test = pos)?

p(Test = pos) = 0.0348


p(Test = pos ∣ Drug = T)p(Drug = T) 0.99 × 0.005
p(Drug = T ∣ Test = pos) = = ≈ 0.142
p(Test = pos) 0.0348
Independence of Random Variables

Definition: X and Y are independent if:


p(x, y) = p(x)p(y)
X and Y are conditionally independent given Z if:
p(x, y ∣ z) = p(x ∣ z)p(y ∣ z)
Example: Coins
(Ex. 9 in the course text)
• Suppose you have a biased coin: the probability that it comes up heads is not
0.5. Instead, it has some probability to more likely to come up heads.
• Let Z be the bias of the coin, with 𝒵 = {0.3,0.5,0.8} and probabilities
P(Z = 0.3) = 0.7, P(Z = 0.5) = 0.2 and P(Z = 0.8) = 0.1.
• Question: What other outcome space could we consider?
• Question: What kind of distribution is this?
• Question: What other kinds of distribution could we consider?
• Let X and Y be two consecutive flips of the coin
• Question: Are X and Y independent?
• Question: Are X and Y conditionally independent given Z?
Example: Coins (2)

• Now imagine I told you Z = 0.3 (i.e., probability of heads is 0.3)


• Let X and Y be two consecutive flips of the coin
• What is P(X = Heads | Z = 0.3)? What about P(X = Tails | Z = 0.3)?
• What is P(Y = Heads | Z = 0.3)? What about P(Y = Tails | Z = 0.3)?
• Is P(X = x, Y = y | Z = 0.3) = P(X = x | Z = 0.3)P(Y = y | Z = 0.3)?
Example: Coins (3)
• Now imagine we do not know Z
• e.g., you randomly grabbed it from a bin of coins with probabilities
P(Z = 0.3) = 0.7, P(Z = 0.5) = 0.2 and P(Z = 0.8) = 0.1
• What is P(X = Heads)?


P(X = Heads) = P(X = Heads | Z = z)p(Z = z)
z∈{0.3,0.5,0.8}
= P(X = Heads | Z = 0.3)p(Z = 0.3)
+P(X = Heads | Z = 0.5)p(Z = 0.5)
+P(X = Heads | Z = 0.8)p(Z = 0.8)
= 0.3 × 0.7 + 0.5 × 0.2 + 0.8 × 0.1 = 0.39
Example: Coins (4)
• Now imagine we do not know Z
• e.g., you randomly grabbed it from a bin of coins with probabilities
P(Z = 0.3) = 0.7, P(Z = 0.5) = 0.2 and P(Z = 0.8) = 0.1
• Is P(X = Heads, Y = Heads) = P(X = Heads)p(Y = Heads)?
• For brevity, lets use h for Heads


P(X = h, Y = h) = P(X = h, Y = h | Z = z)p(Z = z)
z∈{0.3,0.5,0.8}


= P(X = h | Z = z)P(Y = h | Z = z)p(Z = z)
z∈{0.3,0.5,0.8}
Example: Coins (4)
• P(Z = 0.3) = 0.7, P(Z = 0.5) = 0.2 and P(Z = 0.8) = 0.1
• Is P(X = Heads, Y = Heads) = P(X = Heads)p(Y = Heads)?


P(X = h, Y = h) = P(X = h, Y = h | Z = z)p(Z = z)
z∈{0.3,0.5,0.8}


= P(X = h | Z = z)P(Y = h | Z = z)p(Z = z)
z∈{0.3,0.5,0.8}
= P(X = h | Z = 0.3)P(Y = h | Z = 0.3)p(Z = 0.3)
+P(X = h | Z = 0.5)P(Y = h | Z = 0.5)p(Z = 0.5)
+P(X = h | Z = 0.8)p(Y = h | Z = 0.8)p(Z = 0.8)
= 0.3 × 0.3 × 0.7 + 0.5 × ×0.5 × 0.2 + 0.8 × 0.8 × 0.1
= 0.177 ≠ 0.39 * 0.39 = 0.1521
Example: Coins (4)
• Let Z be the bias of the coin, with 𝒵 = {0.3,0.5,0.8} and probabilities
P(Z = 0.3) = 0.7, P(Z = 0.5) = 0.2 and P(Z = 0.8) = 0.1.
• Let X and Y be two consecutive flips of the coin
• Question: Are X and Y conditionally independent given Z?
• i.e., P(X = x, Y = y | Z = z) = P(X = x | Z = z)P(Y = y | Z = z)
• Question: Are X and Y independent?
• i.e. P(X = x, Y = y) = P(X = x)P(Y = y)
The Distribution Changes Based on
What We Know
• The coin has some true bias z

• If we know that bias, we reason about P(X = x | Z = z)


• Namely, the probability of x given we know the bias is z
• If we do not know that bias, then from our perspective the coin
outcomes follows probabilities P(X = x)
• The world still flips the coin with bias z
• Conditional independence is a property of the distribution we are reasoning
about, not an objective truth about outcomes
A bit more intuition

• If we do not know that bias, then from our perspective the coin
outcomes follows probabilities P(X = x, Y = y)
• and X and Y are correlated

• If we know X= h, do we think it’s more likely Y = h? i.e., is


P(X = h, Y = h) > P(X = h, Y = t)?
Why is independence and
conditional independence important?
• i.e., how is this relevant
• Let’s imagine you want to infer (or learn) the bias of the coin, from data
• data in this case corresponds to a sequence of flips X1, X2, …, Xn

• You can ask: P(Z = z | X1 = H, X2 = H, X3 = T, …, Xn = H)


See 10 Heads
p(z) and 2 Tails p(z)

0.3 0.5 0.8 0.3 0.5 0.8


More uses for independence
and conditional independence
• If I told you X = roof type was independent of Y = house price, would you
use X as a feature to predict Y?
• Imagine you want to predict Y = Has Lung Cancer and you have an indirect
correlation with X = Location since in Location 1 more people smoke on
average. If you could measure Z = Smokes, then X and Y would be
conditionally independent given Z.
• Suggests you could look for such causal variables, that explain these
correlations
• We will see the utility of conditional independence for learning models
Expected Value

The expected value of a random variable is the weighted average of that


variable over its domain.

Definition: Expected value of a random variable


∑x∈𝒳 xp(x) if X is discrete
{ ∫𝒳 xp(x) dx
𝔼[X] =
if X is continuous.
Relationship to Population Average
and Sample Average
• Or Population Mean and Sample Mean
• Population Mean = Expected Value, Sample Mean estimates this number
• e.g., Population Mean = average height of the entire population
• For RV X = height, p(x) gives the probability that a randomly selected person
has height x
• Sample average: you randomly sample n heights from the population
• implicitly you are sampling heights proportionally to p
• As n gets bigger, the sample average approaches the true expected value
Connection to Sample Average
• Imagine we have a biased coin, p(x = 1) = 0.75, p(x = 0) = 0.25
• Imagine we flip this coin 1000 times, and see (x = 1) 700 times
• The sample average is
1000
1 1 300 700
1000 ∑ ∑ ∑
xi = xi + xi =0× +1× = = 0 × 0.3 + 1 × 0.7 = 0.7
i=1
1000 i:xi=0 i:xi=1
1000 1000

• The true expected value is


p(x)x = 0 × p(x = 0) + 1p(x = 1) = 0 × 0.25 + 1 × 0.75 = 0.75
x∈{0,1}
Expected Value with Functions
The expected value of a function f : 𝒳 → ℝ of a random variable is the
weighted average of that function's value over the domain of the variable.

Definition: Expected value of a function of a random variable


∑x∈𝒳 f(x)p(x) if X is discrete
{ ∫𝒳 f(x)p(x) dx
𝔼[ f(X)] =
if X is continuous.

Example:
Suppose you get $10 if heads is flipped, or lose $3 if tails is flipped.
What are your winnings on expectation?
Expected Value Example
Example:
Suppose you get $10 if heads is flipped, or lose $3 if tails is flipped.
What are your winnings on expectation?
X is the outcome of the coin flip, 1 for heads and 0 for tails

{10 if x = 1
3 if x = 0
f(x) =

Y = f(X) is a new random variable



𝔼[Y] = 𝔼[ f(X)] = f(x)p(x) = f(0)p(0) + f(1)p(1) = .5 × 3 + .5 × 10 = 6.5
x∈𝒳
One More Example
Suppose X is the outcome of a dice role

{1
−1 if x ≤ 3
f(x) =
if x ≥ 4
Y = f(X) is a new random variable. We see Y = − 1 each time we observe 1, 2 or 3.
We see Y = 1 each time we observe 4, 5, or 6.


𝔼[Y] = 𝔼[ f(X)] = f(x)p(x)
x∈𝒳

= (−1)(p(X = 1) + p(X = 2) + p(X = 3))

+ (1)(p(X = 4) + p(X = 5) + p(X = 6))


One More Example
Suppose X is the outcome of a dice role

{1
−1 if x ≤ 3
f(x) =
if x ≥ 4
Y = f(X) is a new random variable. We see Y = − 1 each time we observe 1, 2 or 3.
We see Y = 1 each time we observe 4, 5, or 6.

∑ ∑
𝔼[Y] = 𝔼[ f(X)] = f(x)p(x) = yp(y) p(Y = − 1) = p(X = 1) + p(X = 2) + p(X = 3) = 0.5
x∈𝒳 y∈{−1,1} p(Y = 1) = p(X = 4) + p(X = 5) + p(X = 6) = 0.5

= (−1)(p(X = 1) + p(X = 2) + p(X = 3))

+ (1)(p(X = 4) + p(X = 5) + p(X = 6)) = − 1(0.5) + 1(0.5)

Summing over x with p(x) is equivalent, and simpler (no need to infer p(y))
Expected Value is a Lossy Summary

P(X)
P(X)

1 2 3 4 5 1 2 3 4 5
X X

𝔼[X] = 3 𝔼[X] = 3
2 2
𝔼[X ] ≃ 10 𝔼[X ] ≃ 12
Conditional Expectations

Definition:
The expected value of Y conditional on X = x is
∑y∈𝒴 yp(y ∣ x) if Y is discrete,
𝔼[Y ∣ X = x] =
∫𝒴 yp(y ∣ x) dy if Y is continuous.

Question: What is 𝔼[Y ∣ X]?


Conditional Expectation Example
• X is the type of a book, 0 for fiction and 1 for non-fiction
• p(X = 1) is the proportion of all books that are non-fiction
• Y is the number of pages
• p(Y = 100) is the proportion of all books with 100 pages
• 𝔼[Y | X = 0] is different from 𝔼[Y | X = 1]
• e.g. 𝔼[Y | X = 0] = 70 is different from 𝔼[Y | X = 1] = 150

• Another example: 𝔼[X | Z = 0.3] the expected outcome of the coin flip
given that the bias is 0.3 (𝔼[X | Z = 0.3] = 0 × 0.7 + 1 × 0.3 = 0.3)
Conditional Expectation Example (cont)
• What do we mean by p(y | X = 0)? How might it differ from p(y | X = 1)
p(y) for X = 0, fiction books p(y) for X = 1, nonfiction books

Lots of shorter books A long tail, a few very long books


Lots of medium
length books
Conditional Expectation Example (cont)
• What do we mean by p(y | X = 0)? How might it differ from p(y | X = 1)

• 𝔼[Y | X = 0] is the expectation over Y under distribution p(y | X = 0)


• 𝔼[Y | X = 1] is the expectation over Y under distribution p(y | X = 1)
Conditional Expectations

Definition:
The expected value of Y conditional on X = x is
∑y∈𝒴 yp(y ∣ x) if Y is discrete,
𝔼[Y ∣ X = x] =
∫𝒴 yp(y ∣ x) dy if Y is continuous.

Question: What is 𝔼[Y ∣ X]?


Conditional Expectations
Definition:
The expected value of Y conditional on X = x is
∑y∈𝒴 yp(y ∣ x) if Y is discrete,
𝔼[Y ∣ X = x] =
∫𝒴 yp(y ∣ x) dy if Y is continuous.

Question: What is 𝔼[Y ∣ X]?


Answer: Z = 𝔼[Y ∣ X] is a random variable, z = 𝔼[Y ∣ X = x] is an outcome
Properties of Expectations
• Linearity of expectation:
• 𝔼[cX] = c𝔼[X] for all constant c
• 𝔼[X + Y] = 𝔼[X] + 𝔼[Y]
• Products of expectations of independent
random variables X, Y:
• 𝔼[XY] = 𝔼[X]𝔼[Y]
• Law of Total Expectation:

• 𝔼 [𝔼 [Y ∣ X]] = 𝔼[Y]

• Question: How would you prove these?


Linearity of Expectation

∑∑ ∑∑
p(x, y)x = p(x, y)x

𝔼[X + Y] = p(x, y)(x + y)
y∈𝒴 x∈𝒳 x∈𝒳 y∈𝒴
(x,y)∈𝒳×𝒴
∑ ∑ ∑
= x p(x, y) ▹ p(x) = p(x, y)
∑∑
= p(x, y)(x + y) x∈𝒳 y∈𝒴 y∈𝒴
y∈𝒴 x∈𝒳

= xp(x)

∑∑ ∑∑
= p(x, y)x + p(x, y)y x∈𝒳

y∈𝒴 x∈𝒳 y∈𝒴 x∈𝒳


= 𝔼[X]
Linearity of Expectation

∑ ∑∑ ∑∑
𝔼[X + Y] = p(x, y)(x + y) p(x, y)x = p(x, y)x
y∈𝒴 x∈𝒳 x∈𝒳 y∈𝒴
(x,y)∈𝒳×𝒴

∑ ∑ ∑
= x p(x, y) ▹ p(x) = p(x, y)
∑∑
= p(x, y)(x + y) x∈𝒳 y∈𝒴 y∈𝒴
y∈𝒴 x∈𝒳

= xp(x)
∑∑ ∑∑
= p(x, y)x + p(x, y)y x∈𝒳
y∈𝒴 x∈𝒳 y∈𝒴 x∈𝒳 = 𝔼[X]
= 𝔼[X] + 𝔼[Y]
What if the RVs are continuous?
∫𝒳×𝒴
𝔼[X + Y] = p(x, y)(x + y)d(x, y)

𝔼[X + Y] = p(x, y)(x + y)
(x,y)∈𝒳×𝒴

∫𝒴 ∫𝒳
=
∑∑
p(x, y)(x + y) = p(x, y)(x + y)dxdy
y∈𝒴 x∈𝒳

∫𝒴 ∫𝒳 ∫𝒴 ∫𝒳
∑∑ ∑∑
= p(x, y)x + p(x, y)y = p(x, y)xdxdy + p(x, y)ydxdy
y∈𝒴 x∈𝒳 y∈𝒴 x∈𝒳
= 𝔼[X] + 𝔼[Y]
∫𝒳 ∫𝒴 ∫𝒴 ∫𝒳
= x p(x, y)dydx + y p(x, y)dxdy

∫𝒳 ∫𝒴
= xp(x)dx + yp(y)dy

= 𝔼[X] + 𝔼[Y]
Properties of Expectations

• Linearity of expectation: 𝔼[Y ] = yp(y) def. E[Y]
y∈𝒴
• 𝔼[cX] = c𝔼[X] for all constant c
∑ ∑
𝔼[Y ] = y p(x, y) def. marginal distribution
• 𝔼[X + Y] = 𝔼[X] + 𝔼[Y] y∈𝒴 x∈𝒳

∑∑
𝔼[Y ] = yp(x, y) rearrange sums
• Products of expectations of independent x∈𝒳 y∈𝒴

random variables X, Y: 𝔼[Y ] =


∑∑
yp(y ∣ x)p(x) Chain rule
x∈𝒳 y∈𝒴
• 𝔼[XY] = 𝔼[X]𝔼[Y]
∑ ∑
𝔼[Y ] = yp(y ∣ x) p(x)
• Law of Total Expectation: x∈𝒳 y∈𝒴

• 𝔼 [𝔼 [Y ∣ X]] = 𝔼[Y] 𝔼[Y ] =


∑ (𝔼[Y ∣ X = x]) p(x) def. E[Y | X = x]
x∈𝒳

• Question: How would you prove these?


𝔼[Y ] =
∑ (𝔼[Y ∣ X = x]) p(x)
x∈𝒳
𝔼[Y ] = 𝔼 (𝔼[Y ∣ X]) ∎ def. expected value of function
Variance
Definition: The variance of a random variable is

Var(X) = 𝔼 [(X − 𝔼[X]) ].


2

2
i.e., 𝔼[ f(X)] where f(x) = (x − 𝔼[X]) .
Equivalently,

Var(X) = 𝔼 [X ] − (𝔼[X])
2 2

(why?)
Covariance
Definition: The covariance of two random variables is

Cov(X, Y) = 𝔼 [(X − 𝔼[X]) ]2

= 𝔼[XY] − 𝔼[X]𝔼[Y] .

Question: What is the range of Cov(X, Y)?


Correlation
Definition: The correlation of two random variables is
Cov(X, Y)
Corr(X, Y) =
Var(X)Var(Y)

Question: What is the range of Corr(X, Y)?


hint: Var(X) = Cov(X, X)
Properties of Variances

• Var[c] = 0 for constant c


2
• Var[cX] = c Var[X] for constant c
• Var[X + Y] = Var[X] + Var[Y] + 2Cov[X, Y]
• For independent X, Y,
Var[X + Y] = Var[X] + Var[Y] (why?)
Independence and Decorrelation
• Independent RVs have zero correlation (why?)

hint: Cov[X, Y] = 𝔼[XY] − 𝔼[X]𝔼[Y]


• Uncorrelated RVs (i.e., Cov(X, Y) = 0) might be dependent
(i.e., p(x, y) ≠ p(x)p(y)).
• Correlation (Pearson's correlation coefficient) shows linear relationships; but can
miss nonlinear relationships
2
• Example: X ∼ Uniform{−2, − 1,0,1,2}, Y = X
• 𝔼[XY] = .2(−2 × 4) + .2(2 × 4) + .2(−1 × 1) + .2(1 × 1) + .2(0 × 0)
• 𝔼[X] = 0
• So 𝔼[XY] − 𝔼[X]𝔼[Y] = 0 − 0𝔼[Y] = 0
Summary
• Random variables are functions from sample to some value
• Upshot: A random variable takes different values with some probability
• The value of one variable can be informative about the value of another
(because they are both functions of the same sample)
• Distributions of multiple random variables are described by the joint probability
distribution (joint PMF or joint PDF)
• You can have a new distribution over one variable when you condition on the other
• The expected value of a random variable is an average over its values, weighted by the
probability of each value
• The variance of a random variable is the expected squared distance from the mean
• The covariance and correlation of two random variables can summarize how changes in
one are informative about changes in the other.
Exercise applying your knowledge
• Let’s revisit the commuting example, and assume we collect continuous
commute times
( )
1 1 2
p(x) = exp − 2 (x − μ)
• We want to model commute time as a Gaussian 2πσ 2 2σ

• What parameters do I have to specify (or learn) to model commute times


with a Gaussian?
• Is a Gaussian a good choice?
Exercise applying your knowledge

• A better choice is actually what is called a Gamma distribution


Exercise applying your knowledge

• We can also consider conditional distributions p(y | x)

• Y is the commute time, let X be the month


• Why is it useful to know p(y | X = Feb) and p(y | X = Sept)?
• What else could we use for X and why pick it?
Exercise applying your knowledge

• Let’s use a simple X, where it is 1 if it is slippery out and 0 otherwise


• Then we could model two Gaussians, one for the two types of conditions
2
p(y|X = 0) = N µ0 , 0
2
<latexit sha1_base64="urhoOP+QZFfszXmFHWgTPXYAT+k=">AAACX3icdVFbS8MwGE3rbc5b1SfxJTiUCTJaFS8PguiLT6LgdLDMkmbpFkzaknwdjLo/6Zvgi//EdhvzfiBwOOc7JN9JkEhhwHVfLXtqemZ2rjRfXlhcWl5xVtfuTZxqxusslrFuBNRwKSJeBwGSNxLNqQokfwieLgv/oce1EXF0B/2EtxTtRCIUjEIu+U4vqfbxM27gM+zu4p0zTBSFLqMyux4QyUOoEpX67h4mRnQU9d3HfaJFpwu7hJQ/s97/WW+S9SbZsu9U3Jo7BP5NvDGpoDFufOeFtGOWKh4Bk9SYpucm0MqoBsEkH5RJanhC2RPt8GZOI6q4aWXDfgZ4O1faOIx1fiLAQ/VrIqPKmL4K8sliA/PTK8S/vGYK4UkrE1GSAo/Y6KIwlRhiXJSN20JzBrKfE8q0yN+KWZdqyiD/klEJpwWOJiv/Jvf7Ne+gdnh7WDm/GNdRQptoC1WRh47RObpCN6iOGHqzbGvBWrTe7Tl72XZGo7Y1zqyjb7A3PgD6arB3</latexit>
p(y|X = 1) = N µ1 , 1

Gaussian denoted by N

You might also like