0% found this document useful (0 votes)

17 views

03 MultivariateProbability

Uploaded by

akhil10831

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

03 MultivariateProbability

Uploaded by

akhil10831

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 73

CMPUT 267 Basics of Machine Learning

Winter 2024
January 16 2024
Announcements
• Please read FAQ document on course webpage.
• Course information at https://nidhihegde.github.io/mlbasics
• Assignment due dates
• TA Office hours - updated
• Participation - Reading Exercises
• on eClass;
• open for a 48 hour period; one hour to complete
• first one is a practise one
• Second one open Monday, closes (due) Tuesday 11:59 pm
Outline

1. Recap

2. Random Variables

3. Multiple Random Variables

4. Independence

5. Expectations and Moments

Recap
• Probabilities are a means of quantifying uncertainty
• The probability space models an experiment, or a real world process.

• The sample space Ω : the set of all possible outcomes of the experiment.

• The event space ℰ : ℰ ⊆ 𝒫(Ω), the space of potential results of the experiment.
• A probability distribution is defined on a measurable space consisting of a sample space
and an event space. Any function P : ℰ → [0,1] that is a probability measure.
• A probability distribution is defined on a measurable space consisting of a sample space
and an event space.
• Discrete sample spaces (and random variables) are defined in terms of probability mass
functions (PMFs)
• Continuous sample spaces (and random variables) are defined in terms of probability
density functions (PDFs)
Discrete vs. Continuous
Sample Spaces
Discrete (countable) outcomes Continuous (uncountable) outcomes

Ω = {1,2,3,4,5,6} Ω = [0,1]
Ω = {person, woman, man, camera, TV, …} Ω=ℝ
Ω=ℕ Ω=ℝ k

ℰ = {∅, {1,2}, {3,4,5,6}, {1,2,3,4,5,6}} ℰ = {∅, [0,0.5], (0.5,1.0], [0,1]}

Typically: ℰ = 𝒫(Ω) Typically: ℰ = B(Ω) ("Borel field")
Question:
Note: not 𝒫(Ω)
ℰ = {{1}, {2}, {3}, {4}, {5}, {6}}?
Random Variables
Rather than referring to the probability space, we refer to probabilities on
quantities of interest.

Example: Suppose we observe both a die's number, and where it lands.

Ω = {(left,1), (right,1), (left,2), (right,2), …, (right,6)}

We might want to think about the probability that we get a large number, without
thinking about where it landed.

We could ask about P(X ≥ 4), where X = the number that comes up.
Random variables are a way of reasoning about a complicated underlying
probability space in a more straightforward way.
Random Variables, Formally
Given a probability space (Ω, ℰ, P), a random variable is a function
X : Ω → ΩX (where ΩX is some other outcome space), satisfying
{ω ∈ Ω ∣ X(ω) ∈ A} ∈ ℰ ∀A ∈ B(ΩX).
It follows that PX(A) = P({ω ∈ Ω ∣ X(ω) ∈ A}).
Example: Let Ω be a population of people, and X(ω) = height, and
A = [5′1′′,5′2′′].
P(X ∈ A) = P(5′1′′ ≤ X ≤ 5′2′′) = P({ω ∈ Ω : X(ω) ∈ A}).
Random Variables and Events

• A Boolean expression involving random variables defines an event:

E.g., P(X ≥ 4) = P({ω ∈ Ω ∣ X(ω) ≥ 4})
• Similarly, every event can be understood as a Boolean random variable:

{0 otherwise.
1 if event A occurred
Y=

• From this point onwards, we will exclusively reason in terms of random

variables rather than probability spaces.
Example: Histograms
Consider the continuous commuting example again, with observations 12.345
minutes, 11.78213 minutes, etc.

.25 Gamma(31.3, 0.352)

.20
.15
.10
.05

4 6 8 10 12 14 16 18 20 22 24 t

• Question: What is the random variable?

• Question: How could we turn our observations into a histogram?
What About Multiple Variables?
• So far, we've really been thinking about a single random variable at a time
• Straightforward to define multiple random variables on a single probability space

Example: Suppose we observe both a die's number, and where it lands.

Ω = {(left,1), (right,1), (left,2), (right,2), …, (right,6)}

X(ω) = ω2 = number

{0 otherwise. }
1 if ω1 = left
Y(ω) = = 1 if landed on left

P(Y = 1) = P({ω ∣ Y(ω) = 1})

P(X ≥ 4 ∧ Y = 1) = P({ω ∣ X(ω) ≥ 4 ∧ Y(ω) = 1})
Joint Distribution
We typically model the interactions of different random variables.

Joint probability mass function: p(x, y) = P(X = x, Y = y)

∑∑
p(x, y) = 1
x∈𝒳 y∈𝒴

Example: 𝒳 = {0,1} (young, old) and 𝒴 = {0,1} (no arthritis, arthritis)

Y=0 Y=1

X=0 P(X=0,Y=0) = P(X=0, Y=1) =

1/2 1/100
X=1 P(X=1, Y=0) = P(X=1, Y=1) =
1/10 39/100
Is this joint distribution valid?
Example: 𝒳 = {0,1} (young, old) and 𝒴 = {0,1} (no arthritis, arthritis)
the table
Visualizing
Y=0 Y=1
-
-
10/100

[
P(X=0,Y=0) = P(X=0, Y=1) =
X=0
50/100 1/100 x= 0

L
4= O
P(X=1, Y=0) = P(X=1, Y=1) =
X=1 Y= 0 io
10/100 39/100
50 1
Y =

Too Y : 1

∑ ∑
Exercise: Check if p(x, y) = 1
D
• 39/100
x∈{0,1} y∈{0,1}
XX + XX + YX + YY = 1
p(x, y) = 1/2 + 1/100 + 1/10 + 39/100 = 1 P(
∑ ∑
1) = 1 0 P(y 0)
=
= =
& + 10100 6000
P(X = 1) P (X=0) =
x∈{0,1} y∈{0,1}
=na %00
5
Questions About Multiple Variables
Example: 𝒳 = {0,1} (young, old) and 𝒴 = {0,1} (no arthritis, arthritis)

Y=0 Y=1

P(X=0,Y=0) = P(X=0, Y=1) =

X=0
1/2 1/100
P(X=1, Y=0) = P(X=1, Y=1) =
X=1
1/10 39/100

• Are these two variables related at all? Or do they change independently?

• Given this distribution, can we determine the distribution over just Y?
I.e., what is P(Y = 1)? (marginal distribution)
• If we knew something about one variable, does that tell us something about the distribution
over the other? E.g., if I know X = 0 (person is young), does that tell me the prob. that person
we know is young has arthritis? (conditional probability P(Y = 1 ∣ X = 1))
Marginal Distribution for Y
∑ ∑ ∑ ∑
p(Y = 0) = p(x,0) = p(x,0) p(Y = 1) = p(x,1) = p(x,1)
x∈𝒳 x∈{young,old} x∈𝒳 x∈{young,old}

Y=0 Y=1
P(X=0,Y=0) P(X=0, Y=1)
X=0
= 1/2 = 1/100
More generically
P(X=1, Y=0) P(X=1, Y=1)
X=1
= 1/10 = 39/100
∑
p(y) = p(x, y)
x∈𝒳
Back to our example
Example: 𝒳 = {0,1} (young, old) and 𝒴 = {0,1} (no arthritis, arthritis)

Y=0 Y=1

P(X=0,Y=0) = P(X=0, Y=1) =

X=0
50/100 1/100
P(X=1, Y=0) = P(X=1, Y=1) =
X=1
10/100 39/100

∑
Exercise: Compute marginal p(x) = p(x, y)
•
y∈{0,1}
Back to our example (cont)
Example: 𝒳 = {0,1} (young, old) and 𝒴 = {0,1} (no arthritis, arthritis)

Y=0 Y=1

P(X=0,Y=0) = P(X=0, Y=1) =

X=0
50/100
1/2 1/100
P(X=1, Y=0) = P(X=1, Y=1) =
X=1
10/100
1/10 39/100

∑
Exercise: Compute marginal p(x = 1) = p(x = 1,y) = 49/100,
•
y∈{0,1}
p(x = 0) = 1 − p(x = 1) = 51/100
Marginal distributions
• For two random variables X, Y,

∑
If they are discrete we have p(x) = p(x, y)
•
y∈𝒴

∫𝒴
If they are continuous we have p(x) = p(x, y)dy
•

∫𝒴
If X is discrete and Y is continuous then p(x) = p(x, y)dy
•

∑
If X is continuous and Y is discrete then p(x) = p(x, y)
•
y∈𝒴
Marginal Distributions
A marginal distribution is defined for a subset of X ⃗ by summing or integrating
out the remaining variables. (We will often say that we are "marginalizing over"
or "marginalizing out" the remaining variables).

∑ ∑ ∑ ∑
Discrete case: p(xi) = ⋯ ⋯ p(x1, …, xi−1, xi+1, …, xd)
x1∈𝒳1 xi−1∈𝒳i−1 xi+1∈𝒳i+1 xd∈𝒳d

∫𝒳 ∫𝒳 ∫𝒳 ∫𝒳
Continuous: p(xi) = ⋯ ⋯ p(x1, …, xi−1, xi+1, …, xd) dx1…dxi−1dxi+1…dxd
1 i−1 i+1 d

Question: Why do we write p for p(xi) and p(x1, …, xd)?

• They can't be the same function, they have different domains!
Are these really the same function?

• No. They're not the same function.

• But they are derived from the same joint distribution.
• So for brevity we will write p(x, y), p(x) and p(y)
• Even though it would be more precise to write something like
p(x, y), px(x) and py(y)
• We can tell which function we're talking about from context (i.e., arguments)
PMFs and PDFs of Many Variables
In general, we can consider a d-dimensional random variable X ⃗ = (X1, …, Xd) with vector-
valued outcomes x ⃗ = (x1, …, xd), with each xi chosen from some 𝒳i. Then,

Discrete case:
p : 𝒳1 × 𝒳2 × … × 𝒳d → [0,1] is a (joint) probability mass function if

∑ ∑ ∑
⋯ p(x1, x2, …, xd) = 1
x1∈𝒳1 x2∈𝒳2 xd∈𝒳d

Continuous case:
p : 𝒳1 × 𝒳2 × … × 𝒳d → [0,∞) is a (joint) probability density function if

∫𝒳 ∫ 𝒳 ∫𝒳
⋯ p(x1, x2, …, xd) dx1dx2…dxd = 1
1 2 d
Rules of Probability Already Covered
the Multidimensional Case
Outcome space is 𝒳 = 𝒳1 × 𝒳2 × … × 𝒳d
Outcomes are multidimensional variables x = [x1, x2, . . . , xd]
Discrete case:

∑
p : 𝒳 → [0,1] is a (joint) probability mass function if p(x) = 1
x∈𝒳

Continuous case:

∫𝒳
p : 𝒳 → [0,∞) is a (joint) probability density function if p(x) dx = 1
But useful to recognize that we have multiple variables
Conditional Distribution
Definition: Conditional probability distribution
P(X = x, Y = y)
P(Y = y ∣ X = x) =
P(X = x)

This same equation will hold for the corresponding PDF or PMF:
p(x, y)
p(y ∣ x) =
p(x)
Question: if p(x, y) is small, does that imply that p(y ∣ x) is small?
Visualizing the conditional
distribution

P(X = young | Y = 0) = P(X = young, Y = 0)/P(Y = 0) = (50/100)/(60/100) = 50/60

Chain Rule

From the definition of conditional probability:

p(x, y)
p(y ∣ x) =
p(x)
p(x, y)
⟺ p(y ∣ x)p(x) = p(x)
p(x)
⟺ p(y ∣ x)p(x) = p(x, y)

This is called the Chain Rule.

Multiple Variable Chain Rule
The chain rule generalizes to multiple variables:

p(x, y, z) = p(x, y ∣ z)p(z) = p(x ∣ y, z)p(y ∣ z)p(z)

p(y,z)

Definition: Chain rule

d−1

∏
p(x1, …, xd) = p(xd) p(xi ∣ xi+1, …xd)
i=1
d

∏
= p(x1) p(xi ∣ xi, …xi−1)
i=2
The Order Does Not Matter
The RVs are not ordered, so we can write
p(x, y, z) = p(x ∣ y, z)p(y | z)p(z)
= p(x ∣ y, z)p(z | y)p(y)
= p(y ∣ x, z)p(x | z)p(z)
= p(y ∣ x, z)p(z | x)p(x)
= p(z ∣ x, y)p(y | x)p(x)
= p(z ∣ x, y)p(x | y)p(y)
All of these probabilities are equal
Basically its associative
Bayes' Rule
From the chain rule, we have:
p(x, y) = p(y ∣ x)p(x)
= p(x ∣ y)p(y)
• Often, p(x ∣ y) is easier to compute than p(y ∣ x)
• e.g., where x is features and y is label

Definition: Bayes' rule

p(x ∣ y)p(y)
p(y ∣ x) =
p(x)
Bayes' Rule
• Bayes’ rule is typically used to reason about our beliefs, given new
information
• Example: a scientist might have a belief about the prevalence of cancer in
smokers (Y), and update with new evidence (X)
• In ML: we have a belief over our estimator (Y), and we update with new
data that is like new evidence (X)

Posterior Likelihood Prior

Definition: Bayes' rule
p(x ∣ y)p(y)
p(y ∣ x) =
p(x) Evidence
Example: Posterior
Likelihood
Prior

p(x ∣ y)p(y)