0% found this document useful (0 votes)
51 views89 pages

Prob Review

This document provides an overview of probability concepts, including: - Definitions of key terms like sample space, event space, and probability measure. - Axioms of probability such as probabilities being non-negative and the sum of probabilities of a partition equaling 1. - Properties of sets like unions, intersections, and partitions. - Formulas for calculating probabilities of unions and using the law of total probability. The goal is to build intuition around probability concepts and lay out useful identities as a reference.

Uploaded by

anwer fadel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views89 pages

Prob Review

This document provides an overview of probability concepts, including: - Definitions of key terms like sample space, event space, and probability measure. - Axioms of probability such as probabilities being non-negative and the sum of probabilities of a partition equaling 1. - Properties of sets like unions, intersections, and partitions. - Formulas for calculating probabilities of unions and using the law of total probability. The goal is to build intuition around probability concepts and lay out useful identities as a reference.

Uploaded by

anwer fadel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

Probability Review

Rob Hall

September 9, 2010
What is Probability?

I Probability reasons about a sample, knowing the population.


I The goal of statistics is to estimate the population based on a sample.
I Both provide invaluable tools to modern machine learning.
Plan

I Facts about sets (to get our brains in gear).


I Definitions and facts about probabilities.
I Random variables and joint distributions.
I Characteristics of distributions (mean, variance, entropy).
I Some asymptotic results (a “high level” perspective).
Goals: get some intuition about probability, learn how to formulate
a simple proof, lay out some useful identities for use as a reference.
Non-goal: supplant an entire semester long course in probability.
Set Basics

A set is just a collection of elements denoted e.g.,


S = {s1 , s2 , s3 }, R = {r : some condition holds on r }.
I Intersection: the elements that are in both sets:
A ∩ B = {x : x ∈ A and x ∈ B}
I Union: the elements that are in either set, or both:
A ∪ B = {x : x ∈ A or x ∈ B}
I Complementation: all the elements that aren’t in the set:
AC = {x : x 6∈ A}.

AC

A A∩B B A∪B
Properties of Set Operations

I Commutativity: A ∪ B = B ∪ A
I Associativity: A ∪ (B ∪ C ) = (A ∪ B) ∪ C .
I Likewise for intersection.
I Proof?
Properties of Set Operations

I Commutativity: A ∪ B = B ∪ A
I Associativity: A ∪ (B ∪ C ) = (A ∪ B) ∪ C .
I Likewise for intersection.
I Proof? Follows easily from commutative and associative
properties of “and” and “or” in the definitions.
Properties of Set Operations

I Commutativity: A ∪ B = B ∪ A
I Associativity: A ∪ (B ∪ C ) = (A ∪ B) ∪ C .
I Likewise for intersection.
I Proof? Follows easily from commutative and associative
properties of “and” and “or” in the definitions.
I Distributive properties: A ∩ (B ∪ C ) = (A ∩ B) ∪ (A ∩ C )
A ∪ (B ∩ C ) = (A ∪ B) ∩ (A ∪ C )
I Proof?
Properties of Set Operations

I Commutativity: A ∪ B = B ∪ A
I Associativity: A ∪ (B ∪ C ) = (A ∪ B) ∪ C .
I Likewise for intersection.
I Proof? Follows easily from commutative and associative
properties of “and” and “or” in the definitions.
I Distributive properties: A ∩ (B ∪ C ) = (A ∩ B) ∪ (A ∩ C )
A ∪ (B ∩ C ) = (A ∪ B) ∩ (A ∪ C )
I Proof? Show each side of the equality contains the other.
I DeMorgan’s Law ...see book.
Disjointness and Partitions

I A sequence of sets A1 , A2 . . . is called pairwise disjoint or


mutually exclusive if for all i 6= j, Ai ∩ Aj = {}.
If the sequence is pairwise disjoint and ∞
S
i=1 Ai = S, then the
I
sequence forms a partition of S.
Partitions are useful in probability theory and in life:

[
B ∩S = B ∩( Ai ) (def of partition)
i=1

[
= (B ∩ Ai ) (distributive property)
i=1

Note that the sets B ∩ Ai are also pairwise disjoint (proof?).


Disjointness and Partitions

I A sequence of sets A1 , A2 . . . is called pairwise disjoint or


mutually exclusive if for all i 6= j, Ai ∩ Aj = {}.
If the sequence is pairwise disjoint and ∞
S
i=1 Ai = S, then the
I
sequence forms a partition of S.
Partitions are useful in probability theory and in life:

[
B ∩S = B ∩( Ai ) (def of partition)
i=1

[
= (B ∩ Ai ) (distributive property)
i=1

Note that the sets B ∩ Ai are also pairwise disjoint (proof?).


If S is the whole space, what have we constructed?.
Probability Terminology

Name What it is Common What it means


Symbols
Sample Space Set Ω, S “Possible outcomes.”
Event Space Collection of subsets F, E “The things that have
probabilities..”
Probability Measure Measure P, π Assigns probabilities
to events.
Probability Space A triple (Ω, F, P)

Remarks: may consider the event space to be the power set of the sample
space (for a discrete sample space - more later).
Probability Terminology

Name What it is Common What it means


Symbols
Sample Space Set Ω, S “Possible outcomes.”
Event Space Collection of subsets F, E “The things that have
probabilities..”
Probability Measure Measure P, π Assigns probabilities
to events.
Probability Space A triple (Ω, F, P)

Remarks: may consider the event space to be the power set of the sample
space (for a discrete sample space - more later). e.g., rolling a fair die:

Ω = {1, 2, 3, 4, 5, 6}
Probability Terminology

Name What it is Common What it means


Symbols
Sample Space Set Ω, S “Possible outcomes.”
Event Space Collection of subsets F, E “The things that have
probabilities..”
Probability Measure Measure P, π Assigns probabilities
to events.
Probability Space A triple (Ω, F, P)

Remarks: may consider the event space to be the power set of the sample
space (for a discrete sample space - more later). e.g., rolling a fair die:

Ω = {1, 2, 3, 4, 5, 6}
F = 2Ω = {{1}, {2} . . . {1, 2} . . . {1, 2, 3} . . . {1, 2, 3, 4, 5, 6}, {}}
Probability Terminology

Name What it is Common What it means


Symbols
Sample Space Set Ω, S “Possible outcomes.”
Event Space Collection of subsets F, E “The things that have
probabilities..”
Probability Measure Measure P, π Assigns probabilities
to events.
Probability Space A triple (Ω, F, P)

Remarks: may consider the event space to be the power set of the sample
space (for a discrete sample space - more later). e.g., rolling a fair die:

Ω = {1, 2, 3, 4, 5, 6}
F = 2Ω = {{1}, {2} . . . {1, 2} . . . {1, 2, 3} . . . {1, 2, 3, 4, 5, 6}, {}}
P({1}) = P({2}) = . . . = 16 (i.e., a fair die)
P({1, 3, 5}) = 21 (i.e., half chance of odd result)
P({1, 2, 3, 4, 5, 6}) = 1 (i.e., result is “almost surely” one of the faces).
Axioms for Probability

A set of conditions imposed on probability measures (due to


Kolmogorov)
I P(A) ≥ 0, ∀A ∈ F
I P(Ω) = 1
P( ∞
S P∞ ∞
i=1 Ai ) = i=1 P(Ai ) where {Ai }i=1 ∈ F are pairwise
I
disjoint.
Axioms for Probability

A set of conditions imposed on probability measures (due to


Kolmogorov)
I P(A) ≥ 0, ∀A ∈ F
I P(Ω) = 1
P( ∞
S P∞ ∞
i=1 Ai ) = i=1 P(Ai ) where {Ai }i=1 ∈ F are pairwise
I
disjoint.
These quickly lead to:
I P(AC ) = 1 − P(A) (since P(A) + P(AC ) = P(A ∪ AC ) = P(Ω) = 1).
I P(A) ≤ 1 (since P(AC ) ≥ 0).
I P({}) = 0 (since P(Ω) = 1).
P(A ∪ B) – General Unions

Recall that A, AC form a partition of Ω:


B = B ∩Ω = B ∩(A∪AC ) = (B ∩A)∪(B ∩AC )

A A∩B B
P(A ∪ B) – General Unions

Recall that A, AC form a partition of Ω:


B = B ∩Ω = B ∩(A∪AC ) = (B ∩A)∪(B ∩AC )
And so: P(B) = P(B ∩ A) + P(B ∩ AC )
For a general partition this is called the “law of total
probability.”
A A∩B B
P(A ∪ B) – General Unions

Recall that A, AC form a partition of Ω:


B = B ∩Ω = B ∩(A∪AC ) = (B ∩A)∪(B ∩AC )
And so: P(B) = P(B ∩ A) + P(B ∩ AC )
For a general partition this is called the “law of total
probability.”
A A∩B B
P(A ∪ B) = P(A ∪ (B ∩ AC ))
P(A ∪ B) – General Unions

Recall that A, AC form a partition of Ω:


B = B ∩Ω = B ∩(A∪AC ) = (B ∩A)∪(B ∩AC )
And so: P(B) = P(B ∩ A) + P(B ∩ AC )
For a general partition this is called the “law of total
probability.”
A A∩B B
P(A ∪ B) = P(A ∪ (B ∩ AC ))
= P(A) + P(B ∩ AC )
P(A ∪ B) – General Unions

Recall that A, AC form a partition of Ω:


B = B ∩Ω = B ∩(A∪AC ) = (B ∩A)∪(B ∩AC )
And so: P(B) = P(B ∩ A) + P(B ∩ AC )
For a general partition this is called the “law of total
probability.”
A A∩B B
P(A ∪ B) = P(A ∪ (B ∩ AC ))
= P(A) + P(B ∩ AC )
= P(A) + P(B) − P(B ∩ A)
P(A ∪ B) – General Unions

Recall that A, AC form a partition of Ω:


B = B ∩Ω = B ∩(A∪AC ) = (B ∩A)∪(B ∩AC )
And so: P(B) = P(B ∩ A) + P(B ∩ AC )
For a general partition this is called the “law of total
probability.”
A A∩B B
P(A ∪ B) = P(A ∪ (B ∩ AC ))
= P(A) + P(B ∩ AC )
= P(A) + P(B) − P(B ∩ A)
≤ P(A) + P(B)
Very important difference between disjoint and non-disjoint unions.
Same idea yields the so-called “union bound” aka Boole’s inequality
Conditional Probabilities

For events A, B ∈ F with P(B) > 0, we may write the


conditional probability of A given B:

P(A ∩ B)
P(A|B) =
P(B)
A A∩B B
Interpretation: the outcome is definitely in B, so treat
B as the entire sample space and find the probability
that the outcome is also in A.
Conditional Probabilities

For events A, B ∈ F with P(B) > 0, we may write the


conditional probability of A given B:

P(A ∩ B)
P(A|B) =
P(B)
A A∩B B
Interpretation: the outcome is definitely in B, so treat
B as the entire sample space and find the probability
that the outcome is also in A.

This rapidly leads to: P(A|B)P(B) = P(A ∩ B) aka the “chain rule for
probabilities.” (why?)
Conditional Probabilities

For events A, B ∈ F with P(B) > 0, we may write the


conditional probability of A given B:

P(A ∩ B)
P(A|B) =
P(B)
A A∩B B
Interpretation: the outcome is definitely in B, so treat
B as the entire sample space and find the probability
that the outcome is also in A.

This rapidly leads to: P(A|B)P(B) = P(A ∩ B) aka the “chain rule for
probabilities.” (why?)
When A1 , A2 . . . are a partition of Ω:

X ∞
X
P(B) = P(B ∩ Ai ) = P(B|Ai )P(Ai )
i=1 i=1
Conditional Probabilities

For events A, B ∈ F with P(B) > 0, we may write the


conditional probability of A given B:

P(A ∩ B)
P(A|B) =
P(B)
A A∩B B
Interpretation: the outcome is definitely in B, so treat
B as the entire sample space and find the probability
that the outcome is also in A.

This rapidly leads to: P(A|B)P(B) = P(A ∩ B) aka the “chain rule for
probabilities.” (why?)
When A1 , A2 . . . are a partition of Ω:

X ∞
X
P(B) = P(B ∩ Ai ) = P(B|Ai )P(Ai )
i=1 i=1

This is also referred to as the “law of total probability.”


Conditional Probability Example
Suppose we throw a fair die:
Ω = {1, 2, 3, 4, 5, 6}, F = 2Ω , P({i}) = 16 , i = 1 . . . 6
A = {1, 2, 3, 4} i.e., “result is less than 5,”
B = {1, 3, 5} i.e., “result is odd.”

P(A) =
Conditional Probability Example
Suppose we throw a fair die:
Ω = {1, 2, 3, 4, 5, 6}, F = 2Ω , P({i}) = 16 , i = 1 . . . 6
A = {1, 2, 3, 4} i.e., “result is less than 5,”
B = {1, 3, 5} i.e., “result is odd.”

2
P(A) =
3
P(B) =
Conditional Probability Example
Suppose we throw a fair die:
Ω = {1, 2, 3, 4, 5, 6}, F = 2Ω , P({i}) = 16 , i = 1 . . . 6
A = {1, 2, 3, 4} i.e., “result is less than 5,”
B = {1, 3, 5} i.e., “result is odd.”

2
P(A) =
3
1
P(B) =
2
P(A|B) =
Conditional Probability Example
Suppose we throw a fair die:
Ω = {1, 2, 3, 4, 5, 6}, F = 2Ω , P({i}) = 16 , i = 1 . . . 6
A = {1, 2, 3, 4} i.e., “result is less than 5,”
B = {1, 3, 5} i.e., “result is odd.”

2
P(A) =
3
1
P(B) =
2
P(A ∩ B)
P(A|B) =
P(B)
Conditional Probability Example
Suppose we throw a fair die:
Ω = {1, 2, 3, 4, 5, 6}, F = 2Ω , P({i}) = 16 , i = 1 . . . 6
A = {1, 2, 3, 4} i.e., “result is less than 5,”
B = {1, 3, 5} i.e., “result is odd.”

2
P(A) =
3
1
P(B) =
2
P(A ∩ B)
P(A|B) =
P(B)
P({1, 3})
=
P(B)
Conditional Probability Example
Suppose we throw a fair die:
Ω = {1, 2, 3, 4, 5, 6}, F = 2Ω , P({i}) = 16 , i = 1 . . . 6
A = {1, 2, 3, 4} i.e., “result is less than 5,”
B = {1, 3, 5} i.e., “result is odd.”

2
P(A) =
3
1
P(B) =
2
P(A ∩ B)
P(A|B) =
P(B)
P({1, 3})
=
P(B)
2
=
3
Conditional Probability Example
Suppose we throw a fair die:
Ω = {1, 2, 3, 4, 5, 6}, F = 2Ω , P({i}) = 16 , i = 1 . . . 6
A = {1, 2, 3, 4} i.e., “result is less than 5,”
B = {1, 3, 5} i.e., “result is odd.”

2
P(A) =
3
1
P(B) =
2
P(A ∩ B) P(A ∩ B)
P(A|B) = P(B|A) =
P(B) P(A)
P({1, 3})
=
P(B)
2
=
3
Conditional Probability Example
Suppose we throw a fair die:
Ω = {1, 2, 3, 4, 5, 6}, F = 2Ω , P({i}) = 16 , i = 1 . . . 6
A = {1, 2, 3, 4} i.e., “result is less than 5,”
B = {1, 3, 5} i.e., “result is odd.”

2
P(A) =
3
1
P(B) =
2
P(A ∩ B) P(A ∩ B)
P(A|B) = P(B|A) =
P(B) P(A)
P({1, 3}) 1
= =
P(B) 2
2
=
3

Note that in general, P(A|B) 6= P(B|A) however we may quantify their


relationship.
Bayes’ Rule

Using the chain rule we may see:

P(A|B)P(B) = P(A ∩ B) = P(B|A)P(A)


Rearranging this yields Bayes’ rule:

P(A|B)P(B)
P(B|A) =
P(A)
Often this is written as:

P(A|Bi )P(Bi )
P(Bi |A) = P
i P(A|Bi )P(Bi )
Where Bi are a partition of Ω (note the bottom is just the law of
total probability).
Independence

Two events A, B are called independent if P(A ∩ B) = P(A)P(B).


Independence

Two events A, B are called independent if P(A ∩ B) = P(A)P(B).


When P(A) > 0 this may be written P(B|A) = P(B) (why?)
Independence

Two events A, B are called independent if P(A ∩ B) = P(A)P(B).


When P(A) > 0 this may be written P(B|A) = P(B) (why?)
e.g., rolling two dice, flipping n coins etc.
Independence

Two events A, B are called independent if P(A ∩ B) = P(A)P(B).


When P(A) > 0 this may be written P(B|A) = P(B) (why?)
e.g., rolling two dice, flipping n coins etc.

Two events A, B are called conditionally independent given C


when P(A ∩ B|C ) = P(A|C )P(B|C ).
Independence

Two events A, B are called independent if P(A ∩ B) = P(A)P(B).


When P(A) > 0 this may be written P(B|A) = P(B) (why?)
e.g., rolling two dice, flipping n coins etc.

Two events A, B are called conditionally independent given C


when P(A ∩ B|C ) = P(A|C )P(B|C ).
When P(A) > 0 we may write P(B|A, C ) = P(B|C )
Independence

Two events A, B are called independent if P(A ∩ B) = P(A)P(B).


When P(A) > 0 this may be written P(B|A) = P(B) (why?)
e.g., rolling two dice, flipping n coins etc.

Two events A, B are called conditionally independent given C


when P(A ∩ B|C ) = P(A|C )P(B|C ).
When P(A) > 0 we may write P(B|A, C ) = P(B|C )
e.g., “the weather tomorrow is independent of the weather
yesterday, knowing the weather today.”
Random Variables – caution: hand waving

A random variable is a function X : Ω → Rd


e.g.,
I Roll some dice, X = sum of the numbers.
I Indicators of events: X (ω) = 1A (ω). e.g., toss a coin, X = 1 if it came
up heads, 0 otherwise. Note relationship between the set theoretic
constructions, and binary RVs.
I Give a few monkeys a typewriter, X = fraction of overlap with complete
works of Shakespeare.
I Throw a dart at a board, X ∈ R2 are the coordinates which are hit.
Distributions
I By considering random variables, we may think of probability measures as
functions on the real numbers.
I Then, the probability measure associated with the RV is completely
characterized by its cumulative distribution function (CDF):
FX (x) = P(X ≤ x).
I If two RVs have the same CDF we call then identically distributed.
I We say X ∼ FX or X ∼ fX (fX coming soon) to indicate that X has the
distribution specified by FX (resp, fX ).

1.0
1.0

0.8
0.8

0.6
0.6


FX(x)

FX(x)
0.4

0.4
0.2

0.2


0.0

0.0

0 1 2 3 4 −2 −1 0 1 2
Discrete Distributions

I If X takes on only a countable number of values, then we may


characterize it by a probability mass function (PMF) which
describes the probability of each value: fX (x) = P(X = x).
Discrete Distributions

I If X takes on only a countable number of values, then we may


characterize it by a probability mass function (PMF) which
describes the probability of each value: fX (x) = P(X = x).
P
I We have: x fX (x) = 1 (why?)
Discrete Distributions

I If X takes on only a countable number of values, then we may


characterize it by a probability mass function (PMF) which
describes the probability of each value: fX (x) = P(X = x).
P
I We have: x fX (x) = 1 (why?) – since each ω maps to one
x, and P(Ω) = 1.
P
e.g., general discrete PMF: fX (xi ) = θi , i θi = 1, θi ≥ 0.
I
Discrete Distributions

I If X takes on only a countable number of values, then we may


characterize it by a probability mass function (PMF) which
describes the probability of each value: fX (x) = P(X = x).
P
I We have: x fX (x) = 1 (why?) – since each ω maps to one
x, and P(Ω) = 1.
P
e.g., general discrete PMF: fX (xi ) = θi , i θi = 1, θi ≥ 0.
I

I e.g., bernoulli distribution: X ∈ {0, 1}, fX (x) = θx (1 − θ)1−x


I A general model of binary outcomes (coin flips etc.).
Discrete Distributions

I Rather than specifying each probability for each event, we


may consider a more restrictive parametric form, which will be
easier to specify and manipulate (but sometimes less general).
Discrete Distributions

I Rather than specifying each probability for each event, we


may consider a more restrictive parametric form, which will be
easier to specify and manipulate (but sometimes less general).
I e.g., multinomial distribution:
d
Pd n! Qd xi
X ∈N , i=1 xi = n, fX (x) = x1 !x2 !···xd ! i=1 θi .
I Sometimes used in text processing (dimensions correspond to
words, n is the length of a document).
I What have we lost in going from a general form to a
multinomial?
Continuous Distributions
I When the CDF is continuous we may consider its derivative
d
fx (x) = dx FX (x).
I This is called the probability density function (PDF).
Continuous Distributions
I When the CDF is continuous we may consider its derivative
d
fx (x) = dx FX (x).
I This is called the probability density function (PDF).
I The probability of an interval (a, b) is given by
Rb
P(a < X < b) = a fX (x) dx.
I The probability of any specific point c is zero: P(X = c) = 0 (why?).
Continuous Distributions
I When the CDF is continuous we may consider its derivative
d
fx (x) = dx FX (x).
I This is called the probability density function (PDF).
I The probability of an interval (a, b) is given by
Rb
P(a < X < b) = a fX (x) dx.
I The probability of any specific point c is zero: P(X = c) = 0 (why?).
I 1
e.g., Uniform distribution: fX (x) = b−a · 1(a,b) (x)
Continuous Distributions
I When the CDF is continuous we may consider its derivative
d
fx (x) = dx FX (x).
I This is called the probability density function (PDF).
I The probability of an interval (a, b) is given by
Rb
P(a < X < b) = a fX (x) dx.
I The probability of any specific point c is zero: P(X = c) = 0 (why?).
I 1
e.g., Uniform distribution: fX (x) = b−a · 1(a,b) (x)
2
I e.g., Gaussian aka “normal:” fX (x) = √1
2πσ
exp{ (x−µ)
2σ 2
}
Continuous Distributions
I When the CDF is continuous we may consider its derivative
d
fx (x) = dx FX (x).
I This is called the probability density function (PDF).
I The probability of an interval (a, b) is given by
Rb
P(a < X < b) = a fX (x) dx.
I The probability of any specific point c is zero: P(X = c) = 0 (why?).
I 1
e.g., Uniform distribution: fX (x) = b−a · 1(a,b) (x)
2
I e.g., Gaussian aka “normal:” fX (x) = √ 1 exp{ (x−µ) }
2πσ 2σ 2
I Note that both families give probabilities for every interval on the real
line, yet are specified by only two numbers.
0.4
0.3
dnorm (x)

0.2
0.1
0.0

−4 −2 0 2 4

x
Multiple Random Variables

We may consider multiple functions of the same sample space,


e.g., X (ω) = 1A (ω), Y (ω) = 1B (ω):

A∩B B May represent the joint distribution as a


table:
X=0 X=1
Y=0 0.25 0.15
A Y=1 0.35 0.25

We write the joint PMF or PDF as fX ,Y (x, y )


Multiple Random Variables
Two random variables are called independent when the joint PDF
factorizes:
fX ,Y (x, y ) = fX (x)fY (y )
When RVs are independent and identically distributed this is
usually abbreviated to “i.i.d.”
Relationship to independent events: X , Y ind. iff
{ω : X (ω) ≤ x}, {ω : Y (ω) ≤ y } are independent events for all x, y .
Working with a Joint Distribution

We have similar constructions as we did in abstract prob. spaces:


R
I Marginalizing: fX (x) =
Y fX ,Y (x, y ) dy .
Similar idea to the law of total probability (identical for a discrete
distribution).
fX ,Y (x,y ) fX ,Y (x,y )
I Conditioning: fX |Y (x, y ) = fY (y ) = R
fX ,Y (x,y ) dx
.
X
Similar to previous definition.
Old? Blood pressure? Heart Attack? P
0 0 0 0.22
0 0 1 0.01
How to compute
0 1 0 0.15
P(heart attack|old)?
0 1 1 0.01
1 0 0 0.18
... ... ... ...
Characteristics of Distributions
We may consider the expectation (or “mean”) of a distribution:
(P
xfX (x) X is discrete
E (X ) = R ∞x
−∞ xfX (x) dx X is continuous
Characteristics of Distributions
We may consider the expectation (or “mean”) of a distribution:
(P
xfX (x) X is discrete
E (X ) = R ∞x
−∞ xfX (x) dx X is continuous
Expectation is linear:
X
E (aX + bY + c) = (ax + by + c)fX ,Y (x, y )
x,y
X X X
= axfX ,Y (x, y ) + byfX ,Y (x, y ) + cfX ,Y (x, y )
x,y x,y x,y
X X X
= a xfX ,Y (x, y ) + b yfX ,Y (x, y ) + c fX ,Y (x, y )
x,y x,y x,y
X X X X
= a x fX ,Y (x, y ) + b y fX ,Y (x, y ) + c
x y y x
X X
= a xfX (x) + b yfY (y ) + c
x y

= aE (X ) + bE (Y ) + c
Characteristics of Distributions

Questions:
1. E [EX ] =
Characteristics of Distributions

Questions:
P
1. E [EX ] = x (EX )fX (x) =
Characteristics of Distributions

Questions:
P P
1. E [EX ] = x (EX )fX (x) = (EX ) x fX (x) = EX
Characteristics of Distributions

Questions:
P P
1. E [EX ] = x (EX )fX (x) = (EX ) x fX (x) = EX
2. E (X · Y ) = E (X )E (Y )?
Characteristics of Distributions

Questions:
P P
1. E [EX ] = x (EX )fX (x) = (EX ) x fX (x) = EX
2. E (X · Y ) = E (X )E (Y )?
Not in general, although when fX ,Y = fX fY :
X X X
E (X ·Y ) = xyfX (x)fY (y ) = xfX (x) yfY (y ) = EX ·EY
x,y x y
Characteristics of Distributions

We may consider the variance of a distribution:

Var(X ) = E (X − EX )2
This may give an idea of how “spread out” a distribution is.
Characteristics of Distributions

We may consider the variance of a distribution:

Var(X ) = E (X − EX )2
This may give an idea of how “spread out” a distribution is.
A useful alternate form is:

E (X − EX )2 = E [X 2 − 2XE (X ) + (EX )2 ]
= E (X 2 ) − 2E (X )E (X ) + (EX )2
= E (X 2 ) − (EX )2
Characteristics of Distributions

We may consider the variance of a distribution:

Var(X ) = E (X − EX )2
This may give an idea of how “spread out” a distribution is.
A useful alternate form is:

E (X − EX )2 = E [X 2 − 2XE (X ) + (EX )2 ]
= E (X 2 ) − 2E (X )E (X ) + (EX )2
= E (X 2 ) − (EX )2

Variance of a coin toss?


Characteristics of Distributions

Variance is non-linear but the following holds:

Var(aX ) = E (aX − E (aX ))2 = E (aX − aEX )2 = a2 E (X − EX )2 = a2 Var(X )


Characteristics of Distributions

Variance is non-linear but the following holds:

Var(aX ) = E (aX − E (aX ))2 = E (aX − aEX )2 = a2 E (X − EX )2 = a2 Var(X )

Var(X +c) = E (X +c−E (X +c))2 = E (X −EX +c−c)2 = E (X −EX )2 = Var(X )


Characteristics of Distributions

Variance is non-linear but the following holds:

Var(aX ) = E (aX − E (aX ))2 = E (aX − aEX )2 = a2 E (X − EX )2 = a2 Var(X )

Var(X +c) = E (X +c−E (X +c))2 = E (X −EX +c−c)2 = E (X −EX )2 = Var(X )

Var(X + Y ) = E (X − EX + Y − EY )2
= E (X − EX )2 + E (Y − EY )2 +2 E (X − EX )(Y − EY )
| {z } | {z } | {z }
Var(X ) Var(Y ) Cov(X ,Y )
Characteristics of Distributions

Variance is non-linear but the following holds:

Var(aX ) = E (aX − E (aX ))2 = E (aX − aEX )2 = a2 E (X − EX )2 = a2 Var(X )

Var(X +c) = E (X +c−E (X +c))2 = E (X −EX +c−c)2 = E (X −EX )2 = Var(X )

Var(X + Y ) = E (X − EX + Y − EY )2
= E (X − EX )2 + E (Y − EY )2 +2 E (X − EX )(Y − EY )
| {z } | {z } | {z }
Var(X ) Var(Y ) Cov(X ,Y )

So when X , Y are independent we have:

Var(X + Y ) = Var(X ) + Var(Y )

(why?)
Putting it all together

Say we have X1 . . . Xn i.i.d., where EXi = µ and Var(Xi ) = σ 2 .


We want to know the expectation and variance of X̄n = n1 ni=1 Xi .
P

E (X̄n ) =
Putting it all together

Say we have X1 . . . Xn i.i.d., where EXi = µ and Var(Xi ) = σ 2 .


We want to know the expectation and variance of X̄n = n1 ni=1 Xi .
P

n
1X
E (X̄n ) = E [ Xi ] =
n
i=1
Putting it all together

Say we have X1 . . . Xn i.i.d., where EXi = µ and Var(Xi ) = σ 2 .


We want to know the expectation and variance of X̄n = n1 ni=1 Xi .
P

n n
1X 1X
E (X̄n ) = E [ Xi ] = E (Xi ) =
n n
i=1 i=1
Putting it all together

Say we have X1 . . . Xn i.i.d., where EXi = µ and Var(Xi ) = σ 2 .


We want to know the expectation and variance of X̄n = n1 ni=1 Xi .
P

n n
1X 1X 1
E (X̄n ) = E [ Xi ] = E (Xi ) = nµ = µ
n n n
i=1 i=1
Putting it all together

Say we have X1 . . . Xn i.i.d., where EXi = µ and Var(Xi ) = σ 2 .


We want to know the expectation and variance of X̄n = n1 ni=1 Xi .
P

n n
1X 1X 1
E (X̄n ) = E [ Xi ] = E (Xi ) = nµ = µ
n n n
i=1 i=1

n
1X
Var(X̄n ) = Var( Xi ) =
n
i=1
Putting it all together

Say we have X1 . . . Xn i.i.d., where EXi = µ and Var(Xi ) = σ 2 .


We want to know the expectation and variance of X̄n = n1 ni=1 Xi .
P

n n
1X 1X 1
E (X̄n ) = E [ Xi ] = E (Xi ) = nµ = µ
n n n
i=1 i=1

n
1X 1 σ2
Var(X̄n ) = Var( Xi ) = 2 nσ 2 =
n n n
i=1
Entropy of a Distribution

Entropy is a measure of uniformity in a distribution.


X
H(X ) = − fX (x) log2 fX (x)
x

Imagine you had to transmit a sample from fX , so you construct


the optimal encoding scheme:

Entropy gives the mean depth in the tree (= mean number of bits).
Law of Large Numbers (LLN)

Recall our variable X̄n = n1 ni=1 Xi .


P
We may wonder about its behavior as n → ∞.
Law of Large Numbers (LLN)

Recall our variable X̄n = n1 ni=1 Xi .


P
We may wonder about its behavior as n → ∞.
σ2
We had: E X̄n = µ, Var(X̄n ) = n .

Distribution appears to be “contracting:” as n increases, variance


is going to 0.
Law of Large Numbers (LLN)

Recall our variable X̄n = n1 ni=1 Xi .


P
We may wonder about its behavior as n → ∞.
σ2
We had: E X̄n = µ, Var(X̄n ) = n .

Distribution appears to be “contracting:” as n increases, variance


is going to 0.

Using Chebyshev’s inequality:

σ2
P(|X̄n − µ| ≥ ) ≤ →0
n2
For any fixed , as n → ∞.
Law of Large Numbers (LLN)
Recall our variable X̄n = n1 ni=1 Xi .
P
We may wonder about its behavior as n → ∞.
Law of Large Numbers (LLN)
Recall our variable X̄n = n1 ni=1 Xi .
P
We may wonder about its behavior as n → ∞.
The weak law of large numbers:

lim P(|X̄n − µ| < ) = 1


n→∞

In English: choose  and a probability that |X̄n − µ| < , I can find you
an n so your probability is achieved.
Law of Large Numbers (LLN)
Recall our variable X̄n = n1 ni=1 Xi .
P
We may wonder about its behavior as n → ∞.
The weak law of large numbers:

lim P(|X̄n − µ| < ) = 1


n→∞

In English: choose  and a probability that |X̄n − µ| < , I can find you
an n so your probability is achieved.
The strong law of large numbers:

P( lim X̄n = µ) = 1
n→∞

In English: the mean converges to the expectation “almost surely” as n


increases.
Law of Large Numbers (LLN)
Recall our variable X̄n = n1 ni=1 Xi .
P
We may wonder about its behavior as n → ∞.
The weak law of large numbers:

lim P(|X̄n − µ| < ) = 1


n→∞

In English: choose  and a probability that |X̄n − µ| < , I can find you
an n so your probability is achieved.
The strong law of large numbers:

P( lim X̄n = µ) = 1
n→∞

In English: the mean converges to the expectation “almost surely” as n


increases.
Two different versions, each holds under different conditions, but i.i.d.
and finite variance is enough for either.
Central Limit Theorem (CLT)
The distribution of X̄n also converges weakly to a Gaussian,
x −µ
lim FX̄n (x) = Φ( √ )
n→∞ nσ
Simulated n dice rolls and took average, 5000 times:

n= 1 0.8 n= 2 n= 10 n= 75
0.8

2.0
0.6
0.6
0.6

1.5
Density

Density

Density

Density
0.4
0.4
0.4

1.0
0.2
0.2
0.2

0.5
0.0

0.0

0.0

0.0
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

h h h h
Central Limit Theorem (CLT)
The distribution of X̄n also converges weakly to a Gaussian,
x −µ
lim FX̄n (x) = Φ( √ )
n→∞ nσ
Simulated n dice rolls and took average, 5000 times:

n= 1 0.8 n= 2 n= 10 n= 75
0.8

2.0
0.6
0.6
0.6

1.5
Density

Density

Density

Density
0.4
0.4
0.4

1.0
0.2
0.2
0.2

0.5
0.0

0.0

0.0

0.0
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

h h h h

Two kinds of convergence went into this picture (why 5000?):


Central Limit Theorem (CLT)
The distribution of X̄n also converges weakly to a Gaussian,
x −µ
lim FX̄n (x) = Φ( √ )
n→∞ nσ
Simulated n dice rolls and took average, 5000 times:

n= 1 0.8 n= 2 n= 10 n= 75
0.8

2.0
0.6
0.6
0.6

1.5
Density

Density

Density

Density
0.4
0.4
0.4

1.0
0.2
0.2
0.2

0.5
0.0

0.0

0.0

0.0
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

h h h h

Two kinds of convergence went into this picture (why 5000?):


1. True distribution converges to a Gaussian (CLT).
2. Empirical distribution converges to true distribution (Glivenko-Cantelli).
Asymptotics Opinion

Ideas like these are crucial to machine learning:


I We want to minimize error on a whole population (e.g.,
classify text documents as well as possible)
I We minimize error on a training set of size n.
I What happens as n → ∞?
I How does the complexity of the model, or the dimension of
the problem affect convergence?

You might also like