Stat 210 Notes
Stat 210 Notes
Eric K. Zhang
[email protected]
Fall 2020
Abstract
These are notes for Harvard’s Statistics 210, a graduate-level probability class providing
foundational material for statistics PhD students, as taught by Joe Blitzstein1 in Fall 2020. It
has a history as a long-running statistics requirement at Harvard. We will focus on probability
topics applicable to statistics, with a lesser focus on measure theory.
Course description: Random variables, measure theory, reasoning by representation.
Families of distributions: Multivariate Normal, conjugate, marginals, mixtures. Conditional
distributions and expectation. Convergence, laws of large numbers, central limit theorems, and
martingales.
Contents
1 September 3rd, 2020 4
1.1 Course Logistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Breakout Puzzle: Random Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Representations of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1
7 September 24th, 2020 18
7.1 The Beta-Gamma Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.2 The Normal Distribution and Box-Muller . . . . . . . . . . . . . . . . . . . . . . . . 19
7.3 Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2
18 November 5th, 2020 51
18.1 Major Tools in Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
18.2 Natural Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3
1 September 3rd, 2020
We start with an overview of the course. The class has roughly 80 students, ranging from first-year
PhD students in statistics and other areas to advanced undergraduates. We will cover many aspects
of probability from a rigorous standpoint.
• Central limit theorem, including variants that don’t assume i.i.d. variables.
The course material is structured from Joe Blitzstein and Carl Morris’s forthcoming textbook, Prob-
ability for Statistical Science [BM20]. Key philosophies of the course: conditioning distributions
and balancing “coin-flipping” intuition versus analysis.
Exercise 1.1 (Simple symmetric random walk). Suppose that you have a simple, symmetric ran-
dom walk on the real line, moving either +1 or −1 on each step with independent probability 12 . If
you start at 0, what is the expected number of times you reach 10100 before returning to 0?
Proof. The answer is 1. Let b = 10100 , and we can proceed in either of a couple of ways:
• Let p be the probability that we reach 10100 at least once before returning to 0.2 Then, the
distribution of the number of visits N to 10100 before returning is
[N | N ≥ 1] ∼ F S(p),
• Imagine that during your random walk, you decide to write down an infinite sequence of
letters: ‘A’ whenever you hit the number 0, and ‘B’ whenever you hit the number b. This
2 1
We can actually compute that p = 2·10100
with martingales, but this is irrelevant.
4
creates some long string AAAAAABBBBAA . . .. For symmetry, we can start from the point
b/2 and generate this string. Since the random walk is memoryless, we simply want to know
the expected number of B’s we hit between any two adjacent A’s.
By symmetry, the expected number of A’s and B’s in any finite subsegment is equal. Since
every B (except a finite number) is between a pair of A’s with high probability, we have that
#(number of B’s)
E [N ] = lim = 1.
n→∞ #(number of A’s)
Definition 1.1 (Weibull distribution). The Weibull distribution is given by the power X c of an
exponential random variable X.
Joe notes that entire books have been written on the Weibull distribution. Here’s another
interesting distribution.
Definition 1.2 (Cauchy distribution). The Cauchy distribution has probability density function
1
C : p(x) = .
π(1 + x2 )
Example 1.3. There are several interesting properties of the Cauchy distribution in terms of
representations by other distributions:
Another neat fact is that if x, y ∼ N (0, 1), then x + y is independent from x − y!3
Finally, we’ll give some intuition for our forays into measure theory, starting next lecture.
Example 1.4 (Banach-Tarski Paradox). Assuming the axiom of choice, there exists a way to
decompose a 3-ball B 3 into two separate, yet congruent balls. However, at least one of the sets
must not be measurable.
In some sense, the intuition of measure theory allows you to rigorously define an intuitive
concept of mass. This can also help axiomatize concepts to get at the core of problems. We’ll see
that measure theory lets us unify many proofs for different distributions into a single general proof.
3
This is a special property of the normal distribution, not a general fact.
5
2 September 8th, 2020
Today is our first real lecture, where we introduce measure theory and its applications to continuous
distributions.
Example 2.1. Suppose that we had a continuous random variable X varying uniformly on [0, 1].
Then, how can we calculate
Pr(X = 0 | X ∈ {0, 1})?
We would expect, intuitively, for the answer to be 21 . However, if we naively apply the definition
of conditional probability, we get something like
Pr(X = 0) 0
Pr(X = 0 | X ∈ {0, 1}) = = .
Pr(X = 0 ∪ X = 1) 0
Similarly, we want “fundamental” laws of probability like Bayes’ Rule to be formalized over
continuous probability distributions like this. The core concept that will allow this to be possible
is called a σ-algebra.
• X ∈ Σ,
• If A ∈ Σ, then Ac = X \ A ∈ Σ,
• If A1 , . . . , An ∈ Σ, then A1 ∪ · · · ∪ An ∈ Σ.
Unlike a typical set algebra, which is a collection of subsets that is closed under finite unions
and intersections, a σ-algebra is closed under countable unions and intersections (hence the letter
σ). The important takeaway from this definition is that it’s fine enough to talk about probability
in a reasonable way, but coarse enough so that we don’t have Banach-Tarski and friends.
Now we can define the core concept of a probability measure.
Definition 2.3 (Probability measure). Let Ω be a set of samples, and F a σ-algebra on Ω, called
the events. A function P : F → [0, 1] is called a probability function if it satisfies the following
axioms:
• P (Ω) = 1.
6
Note. Since this isn’t a measure theory course (Math 114), we don’t usually care about measures
in general. A general measure space is defined the same way as a probability space, except we call it
(X, Σ, µ) instead of (Ω, F, P ) by convention, and we also do not require the last axiom P (Ω) = 1.
Indeed, probability measures are the special case where the total measure is finite.
Let’s do a couple of examples to visualize σ-algebras.
Example 2.4 (Finite σ-algebra). Suppose that you partition Ω into four disjoint subsets,
a a a
Ω=A B C D.
Then, the σ-algebra generated by {A, B, C, D} has 16 elements, and can be written as
F = {∅, A, B, C, D,
A ∪ B, A ∪ C, A ∪ D, B ∪ C, B ∪ D, C ∪ D,
A ∪ B ∪ C, A ∪ B ∪ D, A ∪ C ∪ D, B ∪ C ∪ D,
A ∪ B ∪ C ∪ D}.
It turns out that all finite σ-algebras basically look like this. They all have a power-of-two size,
and they consist of all subcollections of some finite collection of events.
Essentially, finite σ-algebras are uninteresting because they’re too coarse, but it does help lend
some intuition for the general case. We can think of σ-algebras as offering some kind of information
about the events that we have observed so far. This lends itself to the following definition:
Definition 2.5 (Filtration). Given a probability space (Ω, F, P ), a filtration is a sequence of sub
σ-algebras F1 , F2 , . . . where for all k ≤ `,
Fk ⊆ F` ⊆ F.
In other words, Bx is the smallest measurable set containing x. We claim that all distinct atoms
are disjoint. In other words, if Bx ∩ By 6= ∅, then there exists some z ∈ Bx ∩ By , so Bz ⊆ Bx ∩ By .
Assume for the sake of contradiction that x ∈ / Bz . Then, Bx \ Bz is a subset containing x but
not containing z. However, this implies that Bx \ Bz ⊆ Bx , so z ∈ / Bx , which is a contradiction.
Therefore x ∈ Bz , and by symmetry y ∈ Bz as well, so Bx = By = Bz .
Finally, consider the set of all atoms {Bx }x∈X . If this set is finite, then F must be finite as
well, which is a contradiction. Therefore there must be at least countably many distinct atoms
B1 , B2 , . . .. We can define an injective map f : 2N → F by
f ({n1 , n2 , . . .}) = Bn1 ∪ Bn2 ∪ · · · ,
so #(F) ≥ #(2N ) = 2ℵ0 .
Now we know the axioms of probability, and everything starts from here!
7
3 September 10th, 2020
Last lecture, we broke off after defining the foundations of probability: two axioms that define
everything from basics to modern research. We won’t cover too much more about this, as that is
the topic of measure theory classes (Math 114, Statistics 212). Instead we’ll shift gears and start
defining higher-level concepts.
Be careful! The above proposition does not work for unions of σ-algebras.
Definition 3.2 (σ-algebra generated by subsets). Given a collection of subsets A ⊆ 2Ω , we define
the σ-algebra generated by A to be the smallest σ-algebra containing A, i.e.,
\
σ(A) = F.
A⊆F ⊆Ω
F is a σ−algebra
With this machinery, we can now define the Borel measure on the real numbers.
Definition 3.3 (Borel sets). Consider the set of closed intervals [a, b] ⊂ R. The Borel sets are
members of the σ-algebra generated by closed intervals.
Note. We can actually construct a stratified Borel hierarchy as follows. Start from F0 , the set
of closed intervals in R. Then, let F1 be the collection of all sets formed as countable unions or
intersections of sets in F0 , or their complements. This is already very complex, but we can similarly
let F2 be the collection of all sets formed as countable unions, intersections, or complements of sets
in F1 . It turns out that F0 ( F1 ( F2 ( · · · , and even the limit Fω is not a σ-algebra. You have
to keep going up to the first uncountable ordinal, and then you reach the Borel σ-algebra B = Fω1 .
Definition 3.4 (Lebesgue measurable sets). These exist and are more general than the Borel sets,
but we won’t talk too much more about them.
Note. These definitions are really general, which begs the question: are there sets that are not
measurable? The answer is yes (assuming the axiom of choice), for example, the Vitali sets.
f −1 (B) ∈ Σ.
4
Technically, we can define this more generally for sets in the Lebesgue measure, but the difference is unimportant.
8
Note. For the rest of this course, we may implicitly assume that functions are measurable if not
specified. It is extremely difficult to construct a non-measurable function, and they almost never
occur in practice.
Definition 3.6 (Random variable). A random variable X is a measurable function X : Ω → R.
Random variables are so useful that we give them special notation. In particular, suppose that
you have a random variable X, and you want to know the probability that its value lies between 1
and 3. We could write this rigorously in terms of events, i.e.,
However, this is a bit cumbersome, so we use the notation “X ∈ B” to mean the same thing as
X −1 (B). We can then write the above as
9
Proposition 3.9 (Dynkin’s π-λ theorem). Call a collection of subsets P ⊆ 2Ω a π-system if it is
closed under set intersection. If two measures P , Q agree on a π-system P , then they also agree
on all subsets in σ(P ), the σ-algebra generated by P .
Proof. This involves some complicated analysis wizardry. See Section 2.10 of the book.
Corollary 3.9.1 (CDFs are all you need). Any distribution is uniquely determined by its cumulative
distribution function F (x) = P (X ≤ x).
10
4 September 15th, 2020
Today is our final day focused primarily on measure theory, before we move on to random variables
and representations.
Note. Also, Joe mentions that we should not be intimidated by his use of the words trivial or
obvious in class. These words indicate that the ideas are simple enough to not require further
justification once you understand them, not that you should feel bad if you don’t immediately see
the justification.
Proof. Let X : F → B be an arbitrary function. Let A be the set of all Borel sets B such that
A = {B ∈ B | X −1 (B) ∈ F}.
We know that (−∞, x] ∈ A for all x ∈ R. The key observation is that A is a σ-algebra, which we
can directly verify by checking the three properties and mapping them back to properties of F.
Therefore, A = B.
Definition 4.2 (Random vector). A random vector is a collection of n random variables, which
may or may not be independent. You can also see it as a measurable function X : Ω → Rn , which
defines the joint distribution of these variables. The marginal distribution of each variable is simply
the composition of X with the projection map.
Marginal distributions give us some information, but this is lossy. Only the joint distribution
gives us the full story of a random vector.
Now let’s go back to talking about π-λ. What does the letter λ mean?
• (Whole set) Ω ∈ L,
It turns out that there is only one example of a λ-system that we really care about.
11
Example 4.4. Let P1 , P2 be probability measures on (Ω, F). Let
L = {A ∈ F | P1 (A) = P2 (A)}.
Then, L is a λ-system.
The above example can be easily checked by verifying the axioms; Joe skips this justification
in class. Anyway, with this context, we can provide the general statement of Dynkin’s theorem.
Proposition 4.6 (Dynkin’s π-λ, full form). If S is a π-system and L is a λ-system, and S ⊆ L,
then σ(S) ⊆ L.
Proof. Once again, the same tricky proof. We’ll outline it in the next lecture.
Some intuition for π-λ is that you can take a finite non π-system such as S = {{1, 2}, {2, 3}},
and this is not enough to guarantee uniqueness on the σ-algebra generated by S, which includes
sets like {2}, {1, 2, 3}. But, at least in the countable case, you can use the π-system property to do
disjointification/partitioning on Ω, which finishes the proof.
12
5 September 17th, 2020
We’ll first go through the proof of π-λ, then finally begin talking about distributions.
Proof of Proposition 4.6. Without loss of generality, let L be the smallest λ-system containing S.6
The key idea will be to show that L is a σ-algebra, by showing that it is a π-system. In other
words, it suffices to show that for all A, B ∈ L, we have A ∩ B ∈ L.
To prove this result, we will rely on the following key claim. For some fixed A0 ∈ L, we define
a collection of sets L(A0 ) = {B ∈ L | A0 ∩ B ∈ L}. Then L(A0 ) is a λ-system for any A0 .
The proof of the above claim is completely mechanical; just verify the axioms. Then, by the
assumption that S is a π-system, we know that S ⊆ L(A0 ) whenever A0 ∈ S, and since L is the
smallest λ-system containing S, we in fact have L(A0 ) = L. This means that whenever A ∈ S and
B ∈ L, we can conclude A ∩ B ∈ L.
With this stronger fact, we can apply the lemma once again to get the stronger result that
S ⊆ L(A0 ) whenever A0 ∈ L. Then, applying the same logic again, this means that L(A0 ) = L for
any A0 ∈ L, as desired.
Lemma 5.1. Recall that the definition of two random variables X, Y being independent is that for
all Borel sets A, B ∈ B, you have P (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B). You can show that
this is equivalent to, for all x, y ∈ R,
P (X ≤ x, Y ≤ y) = P (X ≤ x)P (Y ≤ y).
Proof. Apply π-λ twice, judiciously. This can be generalized to n random variables.
Definition 5.3 (Bernoulli distribution). The simplest distribution is the Bernoulli, which models
a weighted coin toss. If 0 ≤ p ≤ 1 and Y ∼ Bern(p), then P (Y = 1) = p and P (Y = 0) = 1 − p.
The expected value of the Bernoulli distribution is p, while the variance is p(1 − p).
6
This is valid because λ-systems are closed under intersection.
13
Definition 5.4 (Rademacher distribution). The Rademacher distribution takes values {−1, +1}
with equal probabilities 12 each.
Example 5.5. The position of a random walk on the real line, after n steps, can be modeled as a
sum of n i.i.d. Rademacher random variables.
Definition 5.6 (Binomial distribution). The binomial distribution Bin(n, p) is the sum of n inde-
pendent and identically distributed Bern(p) random variables.
The mean of a binomial distribution is np, while the variance is np(1 − p).
Definition 5.7 (Uniform distribution). The uniform distribution, written as U ∼ Unif, is the
distribution of equal density on the unit interval [0, 1]. It has the property that P (U ∈ [a, b]) = b−a
whenever 0 ≤ a ≤ b ≤ 1. It can be represented in terms of i.i.d. Y1 , Y2 , . . . ∼ Bern(1/2) by
∞
X Yk
U= .
2k
k=1
For brevity, we omit the measure theoretic details that the above dyadic construction is valid.
Note that many sources represent uniform distributions on intervals as Unif(a, b) instead, but Joe
prefers to write (b − a)U + a. The uniform distribution satisfies E [U ] = 12 and Var [U ] = 12
1
.
The mean and variance are both 1. Note that in the above definition, we’re using the syntac-
tic convention of doing arithmetic on a distribution. This actually means that we draw random
variables from that distribution, and do arithmetic on the values. Although unambiguous in most
cases, Joe mentions that we should not write things like Expo + Expo, where the joint distribution
is unclear.
Definition 5.9 (Gamma distribution). The gamma distribution is the sum of independent expo-
nentially distributed random variables. We call r the integer rate parameter. Then, Gamma(r) is
the distribution of
X1 + X2 + · · · + Xr ,
where Xj are i.i.d. and drawn from Expo. The probability density function can be written as
1 r−1 −x
f (x) = x e ,
Γ(r)
14
6 September 22nd, 2020
Today we discuss reasoning by representation in more depth, and we introduce a fair number of
useful, common distributions.
Definition 6.1 (Quantile function). The quantile function of a distribution with CDF F is
When F is continuous and strictly increasing, F −1 is identical to the inverse. Otherwise, it serves
as a sort of proxy that skips over regions with zero probability.
Proposition 6.2 (Probability integral transform). Let F be any CDF, with quantile function F −1 .
If we sample U ∼ Unif, then it follows that F −1 (U ) ∼ F .7
Proof. Note that u ≤ F (y) is the same as F −1 (u) ≤ y, since F is a non-decreasing function.
Therefore, the events U ≤ F (y) and F −1 (U ) ≤ y are the same for any y ∈ R, so
P (F −1 (U ) ≤ y) = P (U ≤ F (y)) = F (y).
Notice that this reminds us of the exponential distribution, which is in fact defined in a manner
similar to the probability integral transform, as a function of a uniform random variable.
Example 6.3. To generate a Bernoulli random variable with probability p, we can generate a
uniform random variable and pass it through the quantile function
(
0 if u ≤ 1 − p,
F (u) =
1 if u > 1 − p.
It’s worth mentioning that the uniform distribution is not necessarily special. Using a variant
of the probability integral transform, we can generate a uniform from any continuous probability
distribution, and by extension, we can generate any probability distribution from any continuous
probability distribution.
Note. We can generate a normal distribution this way as well, by taking erf −1 (U ) for U ∼ Unif.
However, this is not terribly appealing because the error function is not expressible in terms of
elementary functions.
7
This is also sometimes called universality of the uniform.
15
Uniform
U
log( 1−U )
− log U
Logistic Exponential
± arrivals b•c
?r
Xβ
Arcsine Normal
√
N (0,1)
eX Z1 χ2
n /n
Z2
U
Example 6.4 (Logistic distribution). The logistic distribution has representation log( 1−U ), where
we sample U ∼ Unif. The quantile function of the distribution is called the logit function, which is
p
logit(p) = log .
1−p
This maps a probability (0, 1) 7→ R, and it can be thought of as the log-odds of a probability. For
example, you can imagine predicting logits with a linear model (logistic regression), or a neural
network (softmax and cross entropy). The CDF is the sigmoid function,
ey
σ(y) = logit−1 (y) = ,
1 + ey
which can also be used as a nonlinearity in neural networks!
16
Finally, in a somewhat roundabout manner, we finally arrive at a definition of the normal
distribution from the χ2 distribution!
Definition 6.7 (Normal distribution). The celebrated standard normal distribution is defined by
N (0, 1) ∼ S · χ1 , where S ∼ Rad. We can scale this standard normal distribution to define a family
of distributions with various means and variances, which we denote N (µ, σ 2 ) ∼ σN (0, 1) + µ.
Example 6.8. χ22 ∼ Z12 + Z22 , where Z1 , Z2 are i.i.d. ∼ N (0, 1). Also, χ22 ∼ 2 Expo.
Finally, here’s our last collector’s item today, which is often used in hypothesis testing.
We now work on an exercise in breakout rooms. Joe mentions that this exercise is very difficult
to solve with calculus, involving messy integrals, but it is surprisingly elegant when you attack it
by means of representations!
N (0, 1) χ1 √
|T | ∼ p ∼ n.
2
χn /n χn
From here, since the numerator and denominator are independent (sorry for the sloppy notation),
we end up with a simpler expression:
√
1
E [|T |] = E [χ1 ] · E · n.
χn
This is still kind of messy, but it’s broken down into much more manageable parts. For example,
we can find E [χ1 ] by a quick search on Wikipedia.
Definition 6.12 (Beta distribution). The beta distribution with shape parameters α, β > 0 is
supported on [0, 1]. Its representation is Beta(α, β) ∼ GGα+β
α
, where Gα ∼ Gamma(α) and Gα+β ∼
Gamma(α + β), independently.
The beta distribution is often used as a conjugate prior for an unknown probability parameter.
Its probability density function is proportional to xα−1 (1−x)β−1 , and we’ll see some nice properties
connecting it to the gamma distribution next lecture.
17
7 September 24th, 2020
We continue where we left off, with the beta distribution, and we also talk about basic properties
of the normal distribution.
Proposition 7.1 (Beta-Gamma). Suppose that you have independent random variables Gα ∼
Gamma(α) and Gβ ∼ Gamma(β). Then by representations, we have that Gα +Gβ ∼ Gamma(α+β),
and GαG+G
α
β
∼ Beta(α, β). The interesting fact is that
Gα
⊥⊥ Gα + Gβ .
Gα + G β
Proof. This fact comes from a straightforward calculation with Jacobians. Alternatively, you can
also reason about this by relating both variables to a Poisson process and order statistics, which
might help provide additional intuition.
Note. Surprisingly, this fact actually completely characterizes the beta and gamma distributions,
though nontrivial. This was formalized and proven in a theorem of Lukacs.
Joe remarks that he wants to emphasize the “Choose Your Favorite (CYF)” methodology.
Whenever you start working on a new problem about distributions, try to pick whichever repre-
sentation of said distributions gives you the most salient properties.
Example 7.2. Let B1 ∼ Beta(α, β) and B2 ∼ Beta(α + β, δ), such that B1 ⊥⊥ B2 . Using Proposi-
tion 7.1, we can choose the following construction for B1 and B2 :
We can verify in the above representation that indeed, B1 ⊥⊥ B2 . Therefore the fractions cancel,
Gα
and we can write B1 B2 = Gα +G β +Gδ
∼ Beta(α, β + δ).
Example 7.3. Let’s try to compute the mean of the beta distribution. Note that independent
random variables are uncorrelated, so by definition
Gα Gα
⊥⊥ Gα + Gβ =⇒ E E [Gα + Gβ ] = E [Gα ] .
Gα + Gβ Gα + G β
18
7.2 The Normal Distribution and Box-Muller
Recall some basic properties of the normal distribution. If Z1 ∼ N (µ1 , σ12 ) and Z2 ∼ N (µ2 , σ22 ),
then Z1 ⊥⊥ Z2 =⇒ Z1 + Z2 ∼ N (µ1 + µ1 , σ12 + σ22 ). Other useful properties are that the normal
distribution is invariant under rotations (e.g., Z1 + Z2 ⊥⊥ Z1 − Z2 ), and it is symmetric.
Proposition 7.4 (Box-Muller transform). If U1 , U2 ∼ Unif and U1 ⊥⊥ U2 , then define
p
Z1 = −2 log U1 cos(2πU2 ),
p
Z2 = −2 log U1 sin(2πU2 ).
It follows that Z1 , Z2 are i.i.d. ∼ N (0, 1).
Proof. Note that (Z1 , Z2 ) has support on R2 . Since the multivariate normal distribution is centrally
symmetric, we can sample the angle θ ∼ 2π Unif, which is what U2 is used for. Meanwhile,
√ to get
the radius, observe that Z12 + Z22 ∼ χ22 ∼ 2 Gamma(1), which motivates the use of −2 log U1 .
This transformation gives us an efficient way to sample i.i.d. normal random variables in the
special case of a parallel processor (SIMD or GPU). However, the Ziggurat algorithm, a variant of
rejection sampling, is more efficient on common processors. Taking NumPy’s implementation as
an example, see the current Ziggurat version, or the old Box-Muller version.
In addition to being computationally nice in some cases (avoiding code branches), the Box-
Muller transform is also useful as a representation, which transforms many problems about the
normal distribution into ones about trigonometric functions.
Example 7.5. If U ∼ Unif, then tan(2πU ) ∼ Cauchy.
19
where Y1 , Y2 , . . . , Yn are i.i.d. ∼ Expo.
Proof. This follows from induction and the memoryless property of the exponential distribution.
One other interesting case to consider is when the distributions are uniform. In some sense,
these are the two nicest order statistics to work with.
Proposition 7.8 (Uniform order statistics). If U1 , . . . , Un are i.i.d. ∼ Unif, then their order statis-
tics are jointly distributed as
X1 + · · · + Xj
U(j) = ,
X1 + · · · + Xn+1
where X1 , . . . , Xn+1 are i.i.d. ∼ Expo. It immediately follows that the marginal distributions of the
order statistics are U(j) ∼ Beta(j, n + 1 − j).
Proof. Joe notes that there’s a nice proof of this due to Franklyn Wang, when viewed as related
to the Rényi representation. Essentially, you map this to a transformed Poisson process. See the
textbook for details.
When dealing with order statistics for exponential distributions with different rates λ1 , . . . , λn ,
the first order statistic is nice.9 However, all of the other order statistics are unfortunately messy.
9 1
It’s not hard to show that this is distributed according to λ1 +···+λn
Expo.
20
8 September 24th, 2020
Today we formally introduce Poisson distribution and related Poisson process.
Definition 8.1 (Poisson distribution). The Poisson distribution Pois(λ) with rate parameter λ,
supported on {0, 1, 2, . . .}, is defined by the probability mass function
e−λ λk
P (X = k) = .
k!
This distribution is deeply connected to the exponential and gamma distributions.
Definition 8.2 (Poisson process). The Poisson process refers to the sequence of arrival times
T1 , T2 , . . . ≥ 0, where the successive time differences X1 = T1 , X2 = T2 − T1 , X3 = T3 − T2 , . . . are
i.i.d. ∼ λ−1 Expo. The marginal distribution of arrival times is
Tn = X1 + X2 + · · · + Xn ∼ λ−1 Gamma(n).
Furthermore, if Nt = #(arrivals in [0, t]), then the two events {Nt ≥ n} = {Tn ≤ t} are equivalent.
This holds for general arrival processes, and we sometimes call this count-time duality.10
Both of these latter probabilities can be expressed as a CDF of the gamma distribution. Although
the incomplete gamma function is messy, applying integration by parts cracks the problem:
Z λt Z λt
1 −x k−1 1
P (Tk ≤ t) − P (Tk+1 ≤ t) = e x dx − e−x xk dx
Γ(k) 0 Γ(k + 1) 0
Z λt Z λt
1 −x k−1 1 −λt k
= e x dx + k
e (λt) − e−x xk−1 dx
Γ(k) 0 Γ(k + 1) Γ(k + 1) 0
e−λt (λt)k
= .
k!
Corollary 8.3.1. Given any fixed time interval of length t, the number of Poisson arrival events
in that interval is distributed ∼ Pois(λt). Furthermore, given two disjoint time intervals of any
lengths, the number of Poisson arrival events in those intervals are independent.
Previously, we mentioned Poisson processes before through a connection with the order statistics
of the uniform distribution. We formalize this below.
10
This is Joe’s invented name for the fact.
21
Proposition 8.4 (Conditional arrival times). If Tn+1 = t, then the conditional joint distribution
of (T1 , T2 , . . . , Tn ) are the order statistics of i.i.d. uniform random variables multiplied by t, i.e.,
where U1 , . . . , Un ∼ Unif.
Proof. This stems from distribution representations and the Beta-Gamma calculus. Observe that
Tk X1 + X2 + · · · + Xk
= .
Tn+1 X1 + X2 + · · · + Xn+1
The right-hand side is precisely the representation from Proposition 7.8 for the joint distribution
of uniform order statistics U(k) .
The above fact also yields a nice proof that Beta(1, 1) ∼ Unif.
Proof. Use the order statistics of the uniform distribution. This tells us that in the joint distribu-
tion, the k-th cut point can be represented as
X1 + · · · + Xk
,
X1 + · · · + Xn+1
where X1 , . . . , Xn+1 are i.i.d. ∼ Expo. Then, apply the Rényi representation of the exponential
distribution, which tells us that
1 1
X(1) ∼ Y1 ; X(2) − X(1) ∼ Y2 ; ...; X(n+1) − X(n) ∼ Yn+1 ;
n+1 n
where Y1 , . . . , Yn are also i.i.d. ∼ Expo. Finally, we can conclude that the length of the shortest
segment is simply distributed as
1
X(1) n+1 Y1 1
= = Beta(1, n).
X(1) + · · · + X(n+1) Y1 + · · · + Yn+1 n+1
1
This has mean (n+1)2
.
Note. By a slight modification of the above argument, using linearity of expectation, we can see
that the expected value of the length of the k-th largest segment is simply
1 1 1
+ ··· + .
n+1 n+1 k
In the next lecture, we will begin discussing expected value through Lebesgue integration!
22
9 October 1st, 2020
Today we will continue discussing Poisson processes and some of their nice properties. Then, we
introduce the notion of expected value, which is defined in Chapter 4 of the textbook.
Definition 9.1 (Poisson point process). A Poisson point process on a measure space (X, µ) with
rate λ has the property that the number of points in a bound region U ⊂ X is distributed according
to a Poisson random variable with parameter λµ(U ).
Note that with this definition, we lose the interpretation of a Poisson process as having expo-
nential arrival times. This only works for Poisson point processes on R+ , which is what we have
been working with so far. When we take Poisson processes over other measure spaces, there is no
longer any notion of arrival time.
Example 9.2 (Poisson process on a circle). We can define a Poisson process with rate λ on the
unit circle S 1 . Over any angle θ of the circle, written in radians, the number of points in that arc
is distributed according to Pois(λθ). The expected total number of points on the circle is 2πλ.
Example 9.3 (2D Poisson process). Consider the special case where X is a compact subset of R2 ,
and µ is the Lebesgue (or Borel) measure. Then, we call this a 2D Poisson process, and it has the
property that the number of points in any two separate regions are independent, and the mean is
proportional to the area of those regions.
We can arrive at an approximation for a 2D Poisson point process by subdividing our region
into many small squares, then giving each square a finite and i.i.d. Bernoulli probability of having
a point. As the number of squares gets larger, and each square gets smaller, our approximation
gets closer to a true Poisson process.
Proof. Consider a Poisson point process with rate λ1 , and another Poisson point process with rate
λ2 . Then we can simply superimpose these processes together into a single process, combining the
arrival times from both. It’s easy to see that X is the number of arrivals in [0, 1] for the first
process, Y is the number of arrivals in [0, 1] for the second process, and X + Y becomes the number
of arrivals in the superimposed process, which has rate λ1 + λ2 .
Lemma 9.5 (#2). If X ∼ Pois(λ1 ), Y ∼ Pois(λ2 ), and X ⊥⊥ Y , then the conditional distribution
of X on X + Y = n is given by Bin(n, λ1λ+λ
1
2
).
11
Technically, these need to be a Radon measure for mathematical reasons.
12
Joe calls these his “favorite” properties, particularly #3.
23
Proof. This is equivalent to the following fact about a Poisson process. Given a Poisson process
with rate λ, the distribution of the arrival times T1 , . . . , TN , conditioned on N ∼ Pois(λt) equaling
the number of point events in the interval [0, t], is equivalent to the order statistics of N i.i.d.
uniform random variables multiplied by t.
Lemma 9.6 (#3). Consider the chicken-egg story, where a chicken lays N ∼ Pois(λ) eggs, which
each hatch independently with probability p. Let X be the number of eggs that hatch, and let Y be
the number of eggs that do not hatch. Then, X ⊥⊥ Y , X ∼ Pois(λp), and Y ∼ Pois(λ(1 − p)).
Proof. The proof of this result comes from LOTP, where we compute
P (X = x, Y = y) = P (X = x, Y = y | N = x + y) · P (N = x + y).
After some algebraic manipulation, this eventually shows independence of X ⊥⊥ Y . As an inter-
pretation in the corresponding Poisson point process, you can imagine starting with a process of
rate λ, then thinning the process by coloring each point independently with probability p. The
colored and uncolored points then form their own, independent, Poisson processes with rates λp
and λ(1 − p). This can be seen as the reverse of superposition.
where a = x0 < x1 < x2 < · · · < xn = b, and for each i, ti ∈ [xi , xi+1 ].
This definition of Riemann integral clearly does not work when you have a discrete distribution,
which does not have a finite PDF. The Riemann sums will not converge in this case, so we need
something slightly more powerful.
Definition 9.8 (Riemann-Stieltjes integral). The Riemann-Stieltjes integral of f : [a, b] → R with
respect to a non-decreasing integrator function g : [a, b] → R is
Z b n−1
X
f (x) dg(x) = lim f (ti )(g(xi+1 ) − g(xi )),
a n→∞
i=0
where a = x0 < x1 < x2 < · · · < xn = b, and for each i, ti ∈ [xi , xi+1 ].
Note that this coincides with the ordinary Riemann integral when g(x) = x. This integral has
the property that it works for computing the expected value of discrete distributions, since you can
simply plug in the CDF (which is well-defined) as the integrator. It’s also fairly easy to compute by
hand, which makes it useful in practice. However, the strongest and most general integral, which
is often used in proofs, is as follows.13
13
We will only define the Lebesgue integral for random variables in probability spaces here, but you can generalize
the definition to other functions on measure spaces.
24
Definition 9.9 (Lebesgue integral). Let (Ω, F, P ) be a probability space, and let X : Ω → R be a
random variable. Then the expected value of X, denoted E [X] is defined by the following three-step
construction:14
1. For indicator random variables, which are simply 1 on a bounded measurable set S ∈ F and
0 otherwise. Their expectation is the measure P (S).
2. Extending to non-negative weighted sums of indicator random variables, called simple random
variables. We do this by linearity of expectation.
3. Defining for non-negative random variables by taking the supremum over all dominated simple
random variables X ∗ ,
E [X] = sup E [X ∗ ] .
X ∗ ≤X
We omit the remaining details, but these are the key ideas of the Lebesgue integral construction.
14
Joe refers to this as InSiPoD, short for Indicator-Simple-Positive-Difference.
25
10 October 6th, 2020
Last week we introduced the Riemann-Stieltjes and Lebesgue integrals, for the purpose of defining
what a random variable is. Today we’ll continue by discussing them in more detail.
The first is a mild generalization of our familiar Riemann integral from high school Calculus, while
the second is the venerable Lebesgue integral, which is general enough to work on any measurable
domain (not necessarily just the reals!). In general, when an integral is written, you can choose
whichever definition as they are consistent where defined.
Example 10.1 (Indicator of Q). Consider the indicator function IQ : R → {0, 1}, which is 1 on all
the rationals and 0 everywhere else. This function is not Riemann integrable (non-convergent) on
any nonzero interval of the reals, yet it is Lebesgue integrable. In fact, because the rationals are
countable, Z
IQ (x)λ(dx) = 0,
R
This is not an obvious statement, and proving it requires some work. We can also generalize to
countably infinite sums of random variables, as linearity still holds under some mild regularizing
assumptions. Joe uses this as an example of the difference between the Riemann and Lebesgue
definitions of expected value. Compare the statement of linearity in both senses:
Z ∞ Z ∞ Z ∞
tfX+Y (t) dt = xfX (x) dx + yfY (y) dy,
−∞ −∞ ∞
Z Z Z
(X + Y )(ω)P (dω) = X(ω)P (dω) + Y (ω)P (dω).
Ω Ω Ω
Either statement requires a formal mathematical proof, but the second statement (in terms of the
Lebesgue integral) is much more intuitive to read, as the integrating factor ω is the same.
Example 10.3 (Simple random variables). Consider a simple random variable X = nj=1 aj IAj .
P
We can usually write such a variable in canonical form by assuming that each subset Aj is distinct
from the rest, which makes X essentially a collection of disjoint positive rectangles over the sample
space Ω.
15
Joe notes that we won’t care much about weird cases like Q, as they don’t come up in practice. For example, in
the real world, all of your measurements will be in Q due to finite precision. Here’s a pun: “In this course, we care
about hard work, not IQ .”
26
For an additional clarification about Definition 9.9, consider the following equivalent description
of the nonnegative case. This isn’t written in the book yet, but there’s a really clean formula for
the simple random variables approximating any nonnegative random variable X. We can just take
a monotone sequence of random variables:
It’s not hard to show that this is equivalent to the step in the definition of the Lebesgue integral
that uses a supremum over simple random variables. Basically, all this does is cut off the values of
X at n, then quantize it to the first n digits of its binary representation. However, this definition
can be much easier to use in an actual computation.
Example 10.4 (Darth Vader rule). For any nonnegative random variable Y , the following formula
for the expectation holds: Z ∞
E [Y ] = P (Y > y) dy.
0
Proof. First we will show this for the Lebesgue definition of expected value. If Y is an indicator
random variable IA , then the right-hand side integral just becomes P (A), which follows immediately.
Next, if Y is simple, then we proceed simply by breaking up the variable into its canonical form
and writing a double sum. After some manipulation (swapping the order of sums), this works.
Finally, we can generalize to all nonnegative random variables by using the monotone convergence
theorem, which lets us swap the order of lim and E.
For completeness, we also sketch the proof when the left-hand side has E [Y ] defined according
to the Riemann-Stieltjes definition. Recall by definition that
Z ∞ Z ∞Z y
E [Y ] = y dF (y) = dx dF (y).
−∞ −∞ 0
Writing it in this form, it’s clear that this statement just becomes a consequence of Fubini’s theorem.
Swapping the order of the integrals yields our desired result:
Z ∞Z y Z ∞Z ∞ Z ∞
dx dF (y) = dF (y) dx = P (Y > y) dy.
−∞ 0 0 x 0
lim
j→∞
{Xj } X
E E
lim
j→∞
{E[Xj ]} E[X].
In other words, when does limj→∞ E [Xj ] = E [limj→∞ Xj ]? It turns out that this intuitive state-
ment holds most of the time, but there are counterexamples where it fails to hold.
27
Example 10.5 (Failure of convergence). Consider the sequence of discrete random variables Xn ,
where Xn is n2 with probability 1/n2 and 0 otherwise. By the Borel-Cantelli lemma, note that
∞ ∞
X X 1 π2
P (Xn > 0) = = < ∞,
j2 6
n=1 n=1
so with probability 1, at most finitely many of the Xn will be nonzero. This means that Xn → X al-
most surely, where X = 0 is in a Dirac delta distribution. However, now we have a counterexample,
as E [Xn ] = 1 for all n, yet E [X] = 0.
Despite this pessimistic example, under some mild assumptions we can prove that expected
values and limits commute, using three so-called convergence theorems.
Proposition 10.7 (Dominated convergence theorem). If there exists a random variable W such
that |Xn | ≤ W for all n, E [W ] < ∞, and X1 , X2 , . . . → X in probability, then
See Section 4.6 of the book for proofs of each of these theorems.
28
11 October 8th, 2020
Today we review convergence theorems a bit, for the purpose of providing us some analysis intuition.
Definition 11.2 (Almost sure convergence). If X1 , X2 , . . . and X are random variables, then we
say that X1 , X2 , . . . → X strongly, or almost surely converges, if
P lim Xn = X = 1.
n→∞
Proof of Corollary 10.7.1. In the statement of the bounded convergence theorem, we assumed that
|Xn | ≤ c for all n. Let’s first try to bound X as well. To do this, we will take a strategy of adding
some slack to the variable.16 For any n and > 0, note that by a union bound,
However, as n → ∞, this probability just goes immediately to zero. We can view this as “taking
the limit” on both sides, except the left-hand side doesn’t actually contain the variable n in it.
Therefore, by the squeeze theorem,
where the last step follows from the definition of convergence in probability. For the next part of
our proof, consider E [|Xn − X|] for varying n. By the triangle inequality, since |Xn |, |X| ≤ c, we
must have |Xn − X| ≤ 2c. Then,
For any , as n → ∞, the first term on the right-hand side approaches zero. Therefore,
29
Note that we choose to present the proof above, instead of the more general dominated conver-
gence theorem, as that proof requires using machinery such as Fatou’s lemma.
Exercise 11.1. Does the bounded convergence theorem still hold if we replace “converges in
probability” with “converges in distribution” for X1 , X2 , . . .? Joe mentions that he’s not sure if
this true, but he can’t think of an easy counterexample at the moment.
E [Y | X] = g(X).
Note. To be very rigorous about definitions, assume that Y : Ω → R is a random variable, and
G ⊆ F is a σ-subalgebra. Then E [Y | G] is also a function Ω → R, defined in terms of an averaging
operator across all atomic sets in G. In other words, we already have that Y is a F-measurable
function, but by applying a certain averaging map, we can make it G-measurable, which is a stronger
condition because G is coarser than F. Mathematically, we have for all G ∈ G that
Z Z
E [Y | G] dP = Y dP.
G G
Therefore, the equation that E [Y | X] = g(X) is actually somewhat of an abuse of notation ac-
cording to this σ-algebra definition, but it makes the definition much easier to think about!
For more intuition about conditional expectation, you can also think of it as a form of projection.
This is reflected in the (albeit nonconstructive) definition below.
E [(Y − g(X))h(X)] = 0.
This makes sense intuitively, as you can pushforward the Lebesgue integral from the underlying
σ-algebra F to the σ-subalgebra σ(X) ⊂ F, which makes h(X) a constant and Y − g(X) zero.
Anyway, this is the property we’d really like for conditional expectation to have. Let’s now see
if this definition is actually valid, i.e., showing existence and uniqueness! For what follows, let’s
assume (for the sake of convenience) that all r.v.s have finite variance.
17
Therefore, you may see some papers writing this with the notation E [Y | σ(X)].
30
Definition 11.4 (Hilbert space). A Hilbert space is a real or complex inner product space that is
also a complete metric space with respect to its norm.
We care about this completeness condition because in function spaces, which are infinite-
dimensional real vector spaces, we don’t actually get completeness for free. Anyway, we could
spend more time talking about Hilbert and Banach spaces, but that’s the content of Math 114.
Instead, we’ll just state the theorem.
Proposition 11.5. Zero-centered random variables, i.e., such that E [X] = 0, form a Hilbert space
under the covariance inner product
This assumes that we consider two random variables to be equivalent if they are almost surely equal.
It’s a well-known fact that quotient Hilbert spaces exist. Using some kind of argument along
this form, you can essentially show with relative ease that conditional expectations exist and are
unique. The details are omitted here in the lecture, as it’s all measure theory.
E [E [Y | X]] = E [Y ] .
Proof. This follows immediately from the conditional expectation property written above. In par-
ticular, if we set h(X) = 1, then the property reduces to
E [Y − E [Y | X]] = 0,
Var [Y ] = E Y 2 = E E Y 2 | X .
That concludes a very brief foray into conditional expectation and some of its properties.
31
12 October 13th, 2020
First, let’s talk about the midterm exam. The test-taking window will start on October 22nd, and
it will last for 60 hours. As a result, there will be no class next Thursday. The test can be taken
in any 3-hour block, although it has been written to be “reasonable” as a 75-80 minute exam to
somewhat alleviate time pressure. The exam is open-book, open-note, and open-internet, but you
may not ask questions or consult any other students. Submissions are in PDF format and can be
handwritten or in LATEX.
Proof. We will show this by simply applying Adam’s law and linearity of expectation:
Alternatively, note that the above argument could be slightly simplified by assuming, without
essential loss of generality, that E [X] = E [Y ] = 0.
As an aside, note that everything stated above has been about conditional expectations. This
is because conditional expectations (which are just random variables) are much easier to rigorously
talk about than conditional distributions. When we write X | Z ∼ N (Z, 1), this is a statement
about the conditional distribution of X, not a random variable called “X | Z” (which does not make
sense). Defining conditional distributions rigorously requires some measure theory machinery,18
which is not the focus of this course.
An interesting generalization of Adam’s law, Eve’s law, and ECCE is the law of total cumulance.
This is not included in the textbook at the moment, but Joe might add it later. Anyway, it’s beyond
the scope of this course, as the laws written above cover 95% of cases.
Note. Borel’s paradox, as mentioned in the book, is an issue when trying to define conditional
probability when conditioning on events. It happens with continuous random variables, for example,
conditioning on the events X = X based on the equivalent formulations X − Y = 0, and X Y = 1.
The issue is that we are conditioning on a event of measure zero in both cases. Because of the
obvious issues, conditioning on events is outside the scope of this course, and we will only condition
on r.v.s and σ-algebras, which is well-defined.
18
For more info and some juicy pushforwards, see regular conditional probability.
32
12.2 Moment Generating Functions
In your typical undergraduate-level probability class, you probably talked about moment generating
functions, but perhaps not on generating functions in general.19 We’ll talk about this briefly.
Example 12.3 (Making change). Suppose that you wanted to know how many ways you can make
change for 50 cents, given coins of denominations 1, 5, 10, 25, 50 cents. You could do this by writing
out the possibilities, but this is tedious. You could also use dynamic programming (Knapsack).
One formalism from combinatorics that might help here though, is a generating function. Write
Then, in the formal power series expansion for p(t), the coefficient of xk is precisely the number of
1 (k)
ways to make k cents, using these types of coins. Mathematically, this is also k! p (0). Here t is
just a “bookkeeping device” for the sequence of values in the coefficients.
With that informative example, which also shows the two-pronged interpretation of generating
functions as either formal power series or convergent functions, we are now ready to define the
notion of a generating function.
Definition 12.4 (Generating function). Given a sequence (a0 , a1 , a2 , . . .), a generating function
for the sequence is a power series containing this sequence in its coefficients. There are two kinds:
Generally we prefer working with the exponential kind in statistics, as it is more likely to converge.
Definition 12.5 (Moment generating function). The moment generating function of a random
variable X is the exponential generating function of the moments. We denote this by
"∞ # ∞
tX X (tX)n X E [X n ] tn
MX (t) = E e =E = .
n! n!
n=0 n=0
This function only exists when MX (t) < ∞ for all t in some open neighborhood of 0. Under this
assumption, we are allowed to swap the expectation and sum in the last step, due to dominated
convergence. This is because
m m
X X n tn X |X|n |t|n
= ≤ e|tX| ≤ etX + e−tX .
n! n!
n=0 n=0
The last expression above has finite expectation, by our assumption, so it meets the necessary
requirements for dominated convergence.
Now that we’ve rigorously defined MGFs, let’s see some useful properties.
19
For the definitive text on generating functions, see Herbert Wilf’s generatingfunctionology.
33
Proposition 12.6 (MGF of independent sum). If X, Y are r.v.s and X ⊥⊥ Y , then
Proposition 12.7 (Uniqueness of MGFs). If X, Y are r.v.s with moment generating functions
and MX (t) = MY (t) on some open neighborhood of the origin, then X ∼ Y .
Combined with the previous two propositions, this immediately implies that the sum of independent
normals is also normally distributed.
Example 12.9 (No MGF for log-normal distribution). If Y ∼ ez where z ∼ N (0, 1), then all of
the moments of Y are defined, as
2
E [Y n ] = E enZ = MZ (n) = en /2 .
Definition 12.10 (Joint moment generating function). Given two random variables X and Y , or
in general n variables, the joint MGF of X and Y is defined as the function
In addition to the usual moments, the joint MGF also generates joint moments such as the
covariance when X and Y are centered. It also fully describes the joint distribution of (X, Y ),
meaning that X ⊥ ⊥ Y if and only if their joint MGF factors into the marginal variants.
Finally, one problem with moment generating functions illustrated by an earlier example is
convergence. To fix this issue somewhat, there is a variant of MGFs based on the Fourier transform
rather than the Laplace transform, which is guaranteed to always exist.
This also has uniqueness properties, as we will describe next lecture, but there is also the nice
fact that its value always lies within the unit disk (as it’s really a convex combination of points on
the unit circle).
34
13 October 15th, 2020
Today, class is starting 15 minutes early due to a last-minute schedule conflict for Joe. Therefore,
we will have a brief self-contained topic (cumulants) for the first 15 minutes, before going back to
the main topic.
13.1 Cumulants
As a brief aside, recall the definition of the characteristic function. This has the nice property that
it always exists, but it is not always smooth. The moment generating function has a redeeming
property that it are infinitely differentiable, i.e., C ∞ at 0, and their existence implies that all the
moments of the distribution exist. This makes them very useful in many cases.
Proposition 13.1. If a random variable X has moment generating function MX (t), then the k-th
moment of X exists for all k, and h i
E X k = M (k) (0).
Proof. This can be justified by dominated convergence, which can be used to do differentiation
under the integral sign.20 Essentially,
0 d Xt
= E XeXt .
MX (t) = E e
dt
Doing this repeatedly lets you illustrate that
(k) dk Xt h
k Xt
i
MX (t) = E e = E X e ,
dtk
and the result follows. In general, the MGF is always infinitely differentiable, arguing from domi-
nated convergence once again (i.e., MGF existence is stronger than existence of all moments), so
this proposition is valid.
Recall that if two random variables X and Y are independent, then the moment generating
function of their sum X + Y is simply the product of their individual moment generating functions.
In other words, MX+Y (t) = MX (t)MY (t). What if we wanted to turn this product back into a
sum?
Definition 13.2 (Cumulant generating function (CGF)). The cumulant generating function of a
random variable X, defined wheneever the MGF exists, is defined by
∞
X κr
KX (t) = log MX (t) = tr .
r!
r=1
You can derive formulas for the few cumulants by using the power series expansion for log(1+x),
since the moment generating function satisfies MX (0) = 1. This power series looks like
x2 x3 x4
log(1 + x) = x − + − ± ··· .
2 3 4
20
Joe calls this by the acronym DUThIS.
35
Example 13.3 (Cumulants). The first four cumulants are:
h i
4. κ4 = E (X − µ)4 − 3 Var [X]2 . This is the excess kurtosis multiplied by Var [X]2 .
We’ll come back to cumulants and generating functions later, when we discuss the central limit
theorem. However, we can still state some basic facts. One nice property of cumulants is that they
are easy to compute, partially due to the following fact.
Proposition 13.4 (Cumulants are additive). If X ⊥⊥ Y , then KX+Y (t) = KX (t) + KY (t).
This is a vast generalization of the fact that variances are additive for independent random
variables, as the variance is just the second cumulant. Cumulants also give you a good way to find
central moments of distributions like the Poisson, as their generating functions are simple.
t −1)
Example 13.5 (Cumulance of Poisson). The MGF of the Poisson distribution is eλ(e , and
therefore the CGF is
x2 x3
t
K(t) = λ(e − 1) = λ x + + + ··· .
2! 3!
This means that all of the cumulants of the Poisson distribution are equal to λ.
2 t2 /2
Example 13.6 (Cumulance of Normal). The MGF of the normal distribution is eµt+σ , and
therefore the CGF is
σ 2 t2
K(t) = µt + .
2
Therefore, the first two cumulants are nonzero, equal to the mean µ and variance σ 2 . The rest of
the cumulants are all zero.
Note. As an aside, it turns out that the normal distribution is the only nontrivial distribution
that has a finite number of nonzero cumulants.
as the complex magnitude function |x|2 = xx is convex. This should be consistent with your
intuitions about the Fourier transform, if you are familiar with that operator. We will not compute
many characteristic functions by hand in this course, as the integrals can involve complex analysis
machinery like the residue theorem. Still, it can be useful to see some examples.
36
Example 13.7 (Characteristic of Cauchy). The characteristic function of the Cauchy distribution,
with PDF f (x) ∝ 1/(1 + x2 ), is
ϕX (t) = e−|t| .
This should remind you of the PDF of the Laplace distribution. Indeed, the characteristic function
of the Laplace distribution is also a scaled version of the Cauchy PDF; this is a consequence of the
inversion property of the Fourier transform.
Example 13.8 (Characteristic of Normal). The characteristic function of the normal distribution,
where X ∼ N (µ, σ 2 ), is
2 2
ϕX (t) = MX (it) = eiµt−σ t /2 .
Note that the above is a slight abuse of notation, as the moment generating function has real
domain, but it works out anyway if we pretend that it’s a Laplace transform and extend to C.
Definition 13.10 (Covariance matrix). Given a random vector Y, we define the covariance matrix
to be the n × n matrix of variances and covariances between pairwise components. In other words,
Var [Y1 ] Cov(Y1 , Y2 ) · · · Cov(Y1 , Yn )
Cov(Y2 , Y1 ) Var [Y2 ] · · · Cov(Y2 , Yn )
Cov(Y, Y) = E (Y − E [Y])(Y − E [Y])T =
.
.. .. . . ..
. . . .
Cov(Yn , Y1 ) Cov(Yn , Y2 ) · · · Var [Yn ]
Proposition 13.11 (Semidefinite covariance matrix). The covariance matrix is always positive
semidefinite. In other words, it is symmetric, and all eigenvalues are nonnegative.
We can also define Cov(X, Y) similarly. One can show that this is a bilinear operator.
E [AY + b] = A E [Y] + b.
The proof of this is left as an exercise. Generally, you want to stick to vector and matrix
notation whenever possible when proving facts about random vectors, as it will make arguments
much cleaner (and natural ). You should avoid explicit sums over bases whenever possible.
37
With all of this machinery for talking about multivariate distributions, it’s time to actually
create some instances. The nicest multivariate distribution is unambiguously the multivariate
normal.21 It turns out that there are many ways you might attempt to construct a multivariate
normal, but all of them will end up generating the same distribution.
Definition 13.14 (Matrix square root). If Σ 0, then there exists at least one matrix A such
that Σ = AT A = AAT . In general, there can exist many A, but they will all be equivalent up to
multiplication by an orthogonal matrix.
One can construct matrix square roots explicitly by using the Cholesky decomposition algorithm.
X = AZ + µ
Note. Observe that since the standard multivariate normal is rotationally symmetric, multiplying
Z by any orthogonal rotation matrix does not affect the joint distribution. This means that the
above definition is unambiguous with respect to the matrix square root, and multivariate normals
are indeed characterized by their covariance matrices.
Another interesting fact is that within a multivariate normal distribution, uncorrelated implies
independent. This can be most easily shown through representations.
21
The second-nicest one is the multinomial distribution, but this isn’t too much different from the binomial.
38
14 October 21st, 2020
Today we will finish discussing some useful properties of the multivariate normal. The midterm is
on Thursday (so no class then), and it will be timed. Generally the problems will require some tricky
thinking, but there should always be a clever solution that does not require tedious calculation.
In particular, tT X occurs in the exponent, so values of the characteristic function are completely
determined from marginal distributions of projections of X.
Example 14.2 (MGF of multivariate normal). Recall that the moment generating function of a
univariate N (µ, σ 2 ) normal distribution is
2 t2 /2
etµ+σ .
each projection of a multivariate normal is also normal, so we can compute the joint MGF from
the univariate MGF.
Example 14.3 (Closure properties of MVN). The multivariate normal distribution has many nice
closure properties, such as:
• If you take a linear combination or shift of multivariate normals, it is also multivariate normal.
39
It turns out that these closure properties are really useful for applications like Kalman filtering
(Branislav’s favorite!), where we can exactly compute posteriors due to closure.
Here’s a really important fact about multivariate normal distributions.
Proposition 14.4. Within a multivariate normal distribution, consider any two (possibly vector)
projections Y1 and Y2 . Then, if Y1 and Y2 are uncorrelated, they are also independent.
Proof. Consider the multivariate normal random vector Y = Y Y2 ∼ N (µ, V ). We have
1
V11 V12
V =
V21 V22
as a block matrix, where V11 and V22 are the covariance matrices of Y1 and Y2 , respectively. Now
we can simply observe that V12 = Cov(Y1 , Y2 ) = 0, and V21 = Cov(Y2 , Y1 ) = 0, which is the
assumption we made about the vectors being uncorrelated. Then, the matrix is a diagonal block
and factorizes into a direct sum of invariant subspaces, as desired.
Proposition 14.5. Suppose that Y = Y Y2
1
is multivariate normal with covariance matrix
V11 V12
V = .
V21 V22
Also assume that E [Y1 ] = µ1 and E [Y2 ] = µ2 . Then,
−1 −1
Y2 | Y1 ∼ N (µ2 + V21 V11 (Y1 − µ1 ), V22 − V21 V11 V12 ).
In particular, the conditional distribution is still normal, its mean is linear with respect to Y1 , and
its variance is constant! This is related to formulas from linear regression.23
40
We’re going to find T in a slightly different way. For each color i ∈ {1, . . . , n}, let Ti be the time
when we complete the pair of socks with color i, so that T = min(T1 , . . . , Tn ). Note that each
individual sock is independent, so all of the Ti are i.i.d. distributed
√ according to the maximum of
two independent uniforms. Recall that √ this is ∼
√ Beta(2,p 1) ∼ Unif.
We can then write that T = min( U1 , . . .p , Un ) = min(U1 , . . . , Un ), where the Ui are jointly
distributed as i.i.d. uniform. Therefore, T ∼ Beta(1, n). We can verify by LOTUS that
1
x1/2 (1 − x)n−1
Z
E [T ] = dx
0 B(1, n)
B(3/2, n)
=
B(1, n)
Γ(3/2)Γ(n + 1)
=
Γ(1)Γ(n + 3/2)
n!
=
(3/2)(5/2) · · · ((2n + 1)/2)
4n (n!)2
= .
(2n)!(2n + 1)
4n (n!)2
Therefore, matching up our two expressions for E [T ], we get E [T ] = (2n)! .
Note that the above problem does not appear to be related to continuous distributions at all
(very combinatorial), yet we found a very natural solution by using a continuous embedding!
41
15 October 27th, 2020
Today marks a little over the halfway point of the course, as we’re done with the midterm. Grading
will take some time due to logistics. We’ll begin Chapter 9 today, on inequalities, and Chapters 10–
14 are all posted on Canvas.24 Overall, the chapters are fairly short, but they’re also analysis-heavy
and packed with information.
In many problems, approximation can be useful. For example, we can say something like a “linear”
or “quadratic” approximation, but these asymptotic bounds don’t tell us exactly how close we are
to the answer. Convergence in the n → ∞ case isn’t immediately applicable to discuss what a
distribution looks like when n = 5, or even n = 30.
We are going to develop a few inequalities, which we can apply to make statements like p is
within of the true value. Let’s get started with one of the most famous inequalities in math.25
Proof. This proof is particularly nice, though slightly algebra-heavy. The key idea is that variances
are a sum-of-squares, so for any value of β ∈ R,
E (Y − βX)2 ≥ 0.
This is an infinite family of inequalities. We can take the derivative to find the value of β that gives
us the strongest bound. This is a neat problem-solving idea because we added complexity with this
24
This includes things like exponential families and natural exponential families, convergence theorems, the central
limit theorem, and martingales. If we had more time, Joe mentions that he would have also liked to discuss Markov
chains. Those are covered in Stat 212 and Stat 171.
25
For more on this, Joe recommends J. Michael Steele’s book The Cauchy-Schwarz Master Class.
42
additional variable, but it actually makes the solution easier. In any case, the optimal value of β
is given by the projection of Y onto X in the Hilbert space, which is
hX, Y i E [XY ]
β= = .
hX, Xi E [X 2 ]
Substituting this into the inequality and expanding yields
E Y 2 + β 2 E X 2 ≥ 2β E [XY ] ,
E X2 E Y 2
+ E [XY ] ≥ 2 E [XY ] ,
E [XY ]
E X E Y ≥ E [XY ]2 .
2 2
The result follows after assuming, without loss of generality, that X and Y are nonnegative.
In the probability setting, this means that we can bound the covariance of random variables,
which is a 2-variable expectation, by the product of the second moments of their marginal distri-
butions. Marginal distributions are often easier to calculate.
Corollary 15.1.1 (Covariance inequality). For any random variables X and Y , we have
Cov(X, Y )
|Corr(X, Y )| = p ≤ 1.
Var [X] Var [Y ]
Proof. This corollary is almost equivalent to Cauchy-Schwarz, but it admits a particularly elegant
direct proof. Assume without loss of generality that X and Y are standardized to have mean 0 and
variance 1, and let ρ = Corr(X, Y ). Since variances are nonnegative,
E [Y ]
P (Y ≥ a) ≤ .
a
Proof. This is an extremely crude bound. Observe that
a · IY ≥a ≤ |Y |.
This is obviously true. Here, a · IY ≥a is simply equal to a when Y ≥ a and 0 when Y < a. Now,
we can take expectation on both sides, and the result immediately follows.
Although Markov’s inequality seems really obvious, it’s the starting point for pretty much all
concentration bounds, as it makes very few assumptions about the random variable Y . If we
additionally assume that Y has a second moment, then we can extend our bound slightly.
43
Proposition 15.3 (Chebyshev’s inequality). If Y is a random variable with finite variance, then
Var [Y ]
P (|Y − E [Y ]| ≥ c) ≤ .
c2
Proof. Apply Markov’s inequality to the random variable X = (Y − E [Y ])2 .
Note that even though Chebyshev’s inequality is a trivial extension of Markov, you often get
much better tail bounds using it, as they are quadratic in the deviation. This idea of applying an
increasing function (such as x 7→ x2 ) to both sides of Markov’s inequality can be used to get even
better tail bounds in general, such as the celebrated Chernoff bound.26
Proposition 15.4 (Chernoff bound). Let Y be a nonnegative random variable, and let t > 0 be a
constant. Then,
E etY
tY ta
P (Y ≥ a) ≤ P (e ≥ e ) ≤ ,
eta
where the last step follows by Markov’s inequality.
Notice the coincidental appearance of the moment generating function MY (t) = E etY above.
This means that for Chernoff bounds to be applied, you essentially need all of the moments to be
defined. Intuitively, it is the limit case of many concentration inequalities based on moments, as
it makes a strong assumption of the MGF existing. The Chernoff bound is also intuitively useful
because it lets you optimize for any value of t by taking the derivative.
Proof. Jensen’s inequality is interesting because it does not rely on smoothness properties of g,
and it also works in any number of dimensions. A proof is given in the book using the supporting
hyperplane theorem for convex sets.
Definition 15.6 (p-norms of random variables). The Lp norm for a random variable X, where we
have some fixed p ≥ 1, is defined by
This is a valid norm for two reasons. First, if kXkp = 0, then X is almost surely zero. Second, the
norm satisfies the triangle inequality, which is a fact called Minkowski’s inequality.
Now let’s ask the question of how the r-norm compares to the s-norm, when 1 ≤ r < s. The
following result actually holds for any values of r and s, including negative values and zero (in the
limit, which is called the geometric mean).
26
Named after Herman Chernoff, who is faculty emeritus at Harvard.
44
Proposition 15.7 (Monotonicity of norms). If 1 ≤ r < s, then
kXkr ≤ kXks .
Proof. This follows from Jensen’s inequality on the convex function x 7→ xs/r . Assume without loss
of generality that X is nonnegative. Then X r is also nonnegative, so
h i
E (X r )s/r ≥ E [X r ]s/r =⇒ E [X s ]1/s ≥ E [X r ]1/r .
Finally, we write down one of the most famous and classical inequalities, which is a special case
of the discrete inequality of power means mentioned above!
The left-hand side is called the arithmetic mean, and the right-hand side is called the geometric
mean. Equality holds if and only if x1 = x2 = · · · = xn .
Proof. Assume without loss of generality that the xi are distinct. Let W be a random variable
supported on {x1 , . . . , xn }, with P (W = xi ) = wi for each i. By Jensen’s inequality on log,
n n
!
X X
wi log xi = E [log W ] ≤ log E [W ] = log wi x i .
i=1 i=1
Corollary 15.8.1 (Young’s inequality). In the special n = 2 case of weighted AM-GM, we have
ap bq ≤ pa + qb,
where a, b, p, q ≥ 0 and p + q = 1.
45
16 October 29th, 2020
Today we continue our discussion of inequalities and norms.
Hopefully that was an inspiring, short proof of a classic inequality in analysis. Hoping to outdo
himself, Joe will now attempt to present an even more inspiring proof of another inequality.
Proposition 16.2 (Nonnegative covariance27 ). If g and h are non-decreasing functions, then
Cov(g(X), h(X)) ≥ 0.
Proof. The key idea is to choose i.i.d. X1 , X2 ∼ X. Then, observe that
(g(X1 ) − g(X2 ))(h(X1 ) − h(X2 )) ≥ 0.
What happens when we take the covariance of the above expression? Well,
E [(g(X1 ) − g(X2 ))(h(X1 ) − h(X2 ))]
= E [g(X1 )h(X1 )] − E [g(X1 )h(X2 )] − E [g(X2 )h(X1 )] + E [g(X2 )h(X2 )]
= 2 E [g(X)h(X)] − 2 E [g(X)] E [h(X)]
= 2 Cov(g(X), h(X)).
The result follows from the nonnegativity of that expression.
27
This is a special case of the FKG inequality in correlation theory. Amusingly enough, it’s also a continuous
version of Chebyshev’s sum inequality from olympiad mathematics — even having essentially the same proof!
46
16.2 Convergence and the Borel-Cantelli Lemma
Recall in an earlier lecture that we introduced the notions of almost-sure convergence (Defini-
tion 11.2) and convergence in probability (Definition 11.1). The first is stronger than the second.
Let’s rigorously introduce one more useful notion of convergence, the weakest so far.
In some sense, convergence in distribution is much weaker than the other two, as it only talks
about the marginal distributions of the random variables. Meanwhile, almost sure convergence is
only slightly weaker than convergence in probability.
Example 16.4 (Convergence in distribution but not in probability). Consider an infinite sequence
of i.i.d. U, U1 , U2 , . . . ∼ Unif. Then, clearly U1 , U2 , . . . ∼ U in distribution, as all of their marginal
distributions are the same (uniform). However, they do not converge in probability, as clearly
Pr(|Un − U | > ) ≥ 1 − 2.
Example 16.5 (Convergence in probability but not almost surely). Let Xn ∼ Bern(1/n), and
assume that all of them are independent. Then X1 , X2 , . . . → 0 in probability, since
How can we show that the above example does not converge almost surely? It’s true that in
the sequence X1 , X2 , . . ., the 1 values get rarer and rarer as n → ∞. If there is a finite number of
1’s, then we have almost sure convergence, but if there are an infinite number of 1’s, then we do
not have convergence.
With this motivation in mind, we will now deliver a two-part lemma that very elegantly describes
the above as a dichotomy — useful both for proving and disproving almost sure convergence.
1. (Borel-Cantelli lemma).28 If ∞
P
n=1 P (An ) < ∞, then p = 0.
The second lemma immediately shows why Example 16.5 does not have almost sure convergence,
as the harmonic series diverges. However, if we had changed it slightly to Xn ∼ Bern(1/n1.001 )
instead, it would converge almost surely by the first Borel-Cantelli lemma.
28
This first version of the lemma also holds when the Ei are not necessarily independent.
47
17 November 3rd, 2020
Last time, we were talking about convergence. Let’s pick up where we left off.
Exercise 17.1. Construct an infinite, random sequence of coin tosses such that for any n consec-
1
utive coin tosses, the probability of all n tosses coming up heads is n+1 .
Now let’s prove the Borel-Cantelli lemma. Joe mentions that this is interesting not just for
completeness, but also because it illustrates many useful ideas in analysis — short yet instructive.
Proof of Proposition 16.6. We’ll prove each of the two parts separately.
1. Assume that ∞
P
n=1 P (An ) < ∞. Then, by the definition of lim sup and a union bound,
∞
! !
\ [ [ X
P lim sup An = P Am ≤ P Am ≤ P (Am ).
n→∞ m=n
n≥1 m≥n m≥n
This is the tail of the series, but by the definition of convergence of an infinite series, its partial
sums must converge. Therefore, the tail of the series P (A1 ) + P (A2 ) + · · · must converge to
zero, so we conclude.
2. Our strategy in this case will be slightly different. Instead of trying to directly prove that
something will happen infinitely often, we’re going to show that the complement (event hap-
pens finitely often) has zero probability. In other words, we want
!
[ \
P ACm .
n≥1 m≥n
A useful fact from measure theory is that the countable union of measure-zero sets also has
measure zero. Therefore, it’s equivalent to show that the inner intersection has measure zero
for any n. Since the Am are independent, we have
∞ ∞
!
\ Y Y P∞
P C
Am = C
P (Am ) = (1 − P (Am )) ≤ e− m=n P (Am ) .
m≥n m=n m=n
The next topic is an example of a zero-one law similar to the Borel-Cantelli lemma.
48
Proposition 17.1 (Kolmogorov zero-one law). Let A1 , A2 , . . . be independent events. Recall that
we can generate a σ-algebra from a collection of sets (i.e., events) by taking the smallest σ-algebra
containing those events. Then, the “tail field” of the Ai is
∞
\
A= σ(An , An+1 , An+2 , . . .).
n=1
You can think of the tail field as the set of events that only depends on the limiting tail of the event
sequence. Then, for any A ∈ A, we have P (A) ∈ {0, 1}.
Proof. Omitted, but the key idea in this proof is very “cute” — it is to show that A ⊥⊥ A.
This generalizes part of the Boreli-Cantelli lemma, since lim supn→∞ An is an example of some-
thing that only depends on the limiting values of An , so it is in the tail field.
This is zero precisely when Xn → X converges almost surely. By our first inequality above, we
have proven the proposition.
Next on our menu is a theorem that Joe calls both “beautiful and useful,” which lets you go
from convergence in distribution back to convergence in probability. However, there has to be a
catch, since convergence in distribution is obviously weaker. We will need to move to a different
probability space.
Proposition 17.3 (Skorokhod’s representation theorem). Suppose that Xn → X in distribution.
Then, there exists a new probability space (Ω∗ , F ∗ , P ∗ ), with random variables Xn∗ , X ∗ : Ω∗ → R,
such that Xn∗ ∼ Xn , X ∗ ∼ X, and Xn∗ → X ∗ almost surely.
Proof. The proof is omitted because of hard technical details. However, in principle, the key
intuition is that you can just take a PIT on all of the Xn variables, which couples them to the
same uniform. This fixes the issue where the Xn may be totally independent, in a sequence that
converges in distribution.
49
Skorokhod’s theorem is somewhat of a useful hammer. One neat application is that you can
really easily prove the “in distribution” case of the continuous mapping theorem, by reducing it to
the “almost sure” case using Skorokhod.
50
18 November 5th, 2020
Today we will start talking about asymptotics. In other words, how do distributions change in
some limit where their parameters go to infinity? Some of these theorems are quite beautiful,29
but we will specifically focus on facts that have practical applications.
• Taylor’s theorem: Taylor approximations are also called the Delta method in statistics.
For the rest of the semester, we will focus on a few high-level goals. One topic is natural exponential
families, which unify a lot of distributions that we’ve seen this semester.30 This includes the special
NEF-QVF families. We’ll also talk about martingales, which are useful for concentration bounds
and for modeling financial markets.
Proposition 18.2 (Slutsky’s theorem). Assume that we have two sequences of random variables
X1 , X2 . . . and Y1 , Y2 , . . ., not necessarily independent, such that Xn → X and Yn → c in distribu-
tion, where c is a constant. Then,
• Xn + Yn converges in distribution to X + c,
• Xn − Yn converges in distribution to X − c,
Proof. This is a somewhat technical fact from analysis, so we omit the proof.
Proposition 18.3 (Delta method). Assume that you have a sequence of random variables T1 , T2 , . . .
√
such that n(Tn − θ0 ) → Z in distribution, where θ0 is some constant. If g is a real function that
is C 1 continuous at θ0 , then √
n(g(Tn ) − g(θ0 )) → g 0 (θ0 )Z
in probability. In particular, the special case when Z ∼ N (0, 1) is particularly nice because of
connections with the central limit theorem.
29
Joe cites the law of the iterated logarithm as an example.
30
Around this time, Carl Morris walked into our class and said hello. He is the “originator” of the NEF.
51
Proof. The proof uses the mean value theorem, which tells us that
Definition 18.4 (Natural exponential family). A natural exponential family with natural param-
eter η is a family of distributions with CDF Fη , taking the form
We give the condition that F0 (y) does not depend on η. In particular, it’s just the η = 0 case.
Here, we can see that ψ(η) is a normalizing factor for the rest of the density. The rough idea is
that we just shift probabilities by weighting with pointwise multiplication by some exponential of
the value. In particular,
Z Z Z
ηy−ψ(η) ψ(η)
dFη (y) = e dF0 (y) = 1 =⇒ e = eηy dF0 (y) = EY ∼F0 [eηY ].
The last step above follows from LOTUS. In particular, this means that ψ(t) is just the cumulant
generating function of Y ∼ F0 . It’s also easy to show that the cumulant generating function of Fη
for any η is ψ(t + η) − ψ(η). For this reason, we call ψ the cumulant function.
Example 18.5. The binomial distribution Bin(n, p) is a natural exponential family for any fixed
value of n, where we vary p. In this case, the natural parameter is given by the logit function
p
logit(p) = log 1−p .
Example 18.6. The normal distribution with unit variance, N (µ, 1), is a natural exponential
family with natural parameter µ and cumulant function µ2 /2. Notice how this aligns with the
cumulant generating function of the standard normal N (0, 1), which is t2 /2.
Another useful fact, which falls out of the cumulant function, is that if we let ψ 0 (η) = µ and
ψ 00 (η) = σ 2 , then Fη ∼ [µ, σ 2 ]. Note that since variances are positive (except in the degenerate
case), this tells us that ψ 0 (η) = µ is a strictly increasing function, so we can invert it.
Definition 18.7 (Variance function). The variance function of a natural exponential family Fη is
V (µ) = σ 2 . In other words, we have for any η that V (ψ 0 (η)) = ψ 00 (η).
Definition 18.8 (NEF-QVF). An NEF-QVF is a natural exponential family with variance function
of the form V (µ) = v0 + v1 µ + v2 µ2 .
It is a theorem from Carl Morris that there are only six NEF-QVF distribution families.
52
19 November 10th, 2020
Today we will continue talking about NEF-QVFs and asymptotics (delta method), in preparation
for the law of large numbers and central limit theorem.
• Weak laws of large numbers deal with convergence of the sample mean in probability.
This terminology is unique to the law of large numbers, as the central limit theorem only applies to
convergence in distribution. Generally, we will see that weak laws of large numbers have a slightly
weaker result, but also require less assumptions.
31
For practical purposes, Joe suggests using simulation to find the smallest value of n for which the distribution of
2
the mean becomes approximately normal, i.e., X n ≈ N (µ, σn ). This is not useful for rigorous proofs though.
53
Proposition 19.3 (Weak LLN, basic version). Suppose that X n is the mean of n i.i.d. random
variables with mean µ and finite variance σ 2 . Then, as n → ∞, X n → µ in probability.
Proof. By Chebyshev’s inequality,
Var X n σ2
P (|X n − µ| ≥ ) ≤ = .
2 n2
This goes to zero as n → ∞, so we’re done.
This was a really simple proof. Let’s see how we can relax the assumptions a bit, by being
a little more sophisticated in our argument. In particular, what if the random samples were not
independent? One strategy to deal with this is to consider
P
i,j Cov(Xi , Xj )
Var X n = .
n2 2
If the above value goes to zero as n → ∞, then we have an equivalent to the weak law of large
numbers. However, sometimes this strategy does not work, and we need to try something else.
For example, consider the characteristic function, which always exists and can be approximated by
derivatives. It turns out that with a linear approximation (first term) of the characteristic function,
we get LLN, and with a quadratic approximation, we get CLT.
In the case when random variables are guaranteed to be independent, we can prove incredibly
strong LLNs and CLTs due to the multiplicativity of the characteristic function. However, the
dependent case is harder, and we might give a couple of examples, later on, where we relax the
independence assumptions.
Proposition 19.4 (Strong LLN). Assume that X1 , X2 , . . . are i.i.d., with E [Xj ] = µ and the
average absolute deviation is bounded, i.e., E [|Xj |] < ∞. Then, X n → µ almost surely.
The above version of the strong LLN is hard to prove and fairly technical. This is because
it only assumes first moments. For now, we will prove an easier version with a different set of
assumptions — more moments, but also not necessarily i.i.d. this time.
Proposition 19.5 (Strong LLN, fourth moments). Assume that Xj are independent and have
mean zero, and also that E[Xj4 ] ≤ b < ∞ for some bound b. Then X n → 0 almost surely.
Proof. By the Borel-Cantelli lemma (Proposition 16.6), it suffices to check that for any > 0,
∞
X
P (|X n | > ) < ∞.
n=1
4
However, we have P (|X n | > ) = P (X n > 4 ), applying Markov’s inequality tells us that
∞ ∞
X 1 X h 4i
P (|X n | > ) ≤ E Xn .
4
n=1 n=1
This is fair enough, but how do we get rid of the fourth moment of the mean? One way to deal with
this is by brute forcing through the multinomial theorem on X n = n1 (X1 + · · · + Xn ). However, a
nicer approach is to use cumulants, which are additive. Note that
h i
4
h i2
2 1
E X n = κ4 (X n ) + 3 Var X n = 4 (κ4 (X1 ) + · · · + κ4 (Xn ) + 3(Var [X1 ] + · · · + Var [Xn ])2 ).
n
54
Now we just need to bound κ4 (Xj ) and Var [Xj ], for each j. This turns out to be very simple. First,
the fourth cumulant is strictly less than the first moment (smaller by three times the variance),
so κ4 (Xj ) ≤ E[Xj4 ] ≤ b. Also, by Jensen’s inequality, Var[Xj ] ≤ b as well. Therefore, the last
summation above is bounded by
∞ ∞
1 X h 4i 1 X b 3b
E X n ≤ + .
4 4 n3 n2
n=1 n=1
This summation converges because sums of 1/ns are finite for s > 1, so we are done.
Note. Even when b is not bounded by a constant, we can still use the argument above as long as
the summation converges, i.e., when b = o(n). For example, b = n0.999 would work just as well.
It’s instructive to ask why we need finite fourth moments in the law of large numbers above,
versus two or six or some other number. This is because only having finite variances gives you
linear falloff similar to the above, and the harmonic series diverges, so we end up on the wrong side
of Borel-Cantelli for almost sure convergence.
55
20 November 12th, 2020
Today we will discuss the central limit theorem.
1. Cumulants: Suppose that X1 , X2 , . . . are i.i.d. with mean 0 and variance 1. Then, the r-th
√
cumulant of the sum of these variables, divided by n, is
X1 + . . . + Xn n
κr √ = r/2 Kr (X1 ).
n n
This just follows from the additivity of cumulants. Notice that because of the exponent, this
fraction approaches 0 as n → ∞ for any r > 1. Therefore, we should expect the limiting
distribution to only have finite first two cumulants — which makes it a normal distribution!
2. Entropy: The normal distribution maximizes entropy for a given mean and variance. When
you add independent random variables together, their entropy increases, which is a statistical
analogue of the second law of thermodynamics. Andrew Barron has a paper in The Annals
of Probability where he proves CLT using an entropy-type argument.
√
3. Stability: Let Sn = X1 + · · · + Xn , and suppose that Sn / n converges in in distribution to
some distribution Z. Why must Z be normal? Well, note that in convergence, we can replace
n by 2n, so
S X + · · · + Xn Xn+1 + · · · + X2n D
√2n = 1 √ + √ −
→ Z.
2n 2n 2n
However,
√ we√can also write the second expression above as converging in distribution to
Z1 / 2 + Z2 / 2, where Z1 , Z2 are i.i.d. ∼ Z. Since a sequence can’t converge to two different
distributions, these must be the same, so
1
Z ∼ √ (Z1 + Z2 ).
2
This is a stable law, and the only stable law with finite variance is the normal distribution.
Actually, although we only promised to give some intuition above, we can also formalize the third
point to produce a rigorous proof as well. First, a quick lemma.
Lemma 20.1 (Taylor approximation for characteristic function). If X is a random variable with
finite m-th moment E[|X|m ] < ∞, and X has characteristic function ϕ, then
m
(it)k E X k
X
ϕ(t) = + o(|t|m ),
k!
k=0
Proof. This follows almost immediately from the Peano form of the Taylor series remainder for ex .
The only slight hiccup is that we need to apply dominated convergence, due to the expected value.
This is also why we need to assume finite m-th moment.
56
Proposition 20.2 (Stable law with √finite variance). Let Z1 , Z2 be i.i.d. random variables with
mean 0 and variance 1. If Z1 + Z2 ∼ 2Z1 , then Z1 ∼ N (0, 1).
Proof. We’ll turn the condition into a functional equation of the characteristic function. Let ϕ be
the characteristic function of Z1 . Then, the characteristic function of Z1√+Z
2
2
∼ Z1 is
t 2
h √it i
(Z1 +Z2 )
E e 2 =ϕ √ = ϕ(t).
2
By iterating this functional equation, we get
2 !22 2n
t t t
ϕ(t) = ϕ √ =ϕ √ 2 = · · · = ϕ n/2 .
2 2 2
Since κ1 (Z1 ) = 0 and κ2 (Z1 ) = 1, we have by Lemma 20.1 that
2n 2n
t2
t 1 2
lim ϕ n/2 = lim 1 − n
+o n = e−t /2 .
n→∞ 2 n→∞ 2! · 2 2
Therefore, by uniqueness of characteristic functions (Fourier transform), we conclude.
Therefore, from the intuition at the start of this section, this stable law immediately implies
the basic result of the central limit theorem itself.
Proposition 20.3 (Classical CLT (Lindeberg–Lévy)). If X1 , X2 , . . . is a sequence of i.i.d. random
variables with mean µ and variance σ 2 < ∞, then as n → ∞, we have in distribution that
(X1 + X2 + · · · + Xn ) − nµ D
√ −
→ N (0, σ 2 ).
n
Definition 20.4 (UAN). The uniform asymptotic negligibility condition is that none of that none
of the n terms in Sn has a large asymptotic variance, in comparison to the total variance s2n . In
general, UAN holds when
max1≤j≤n σj
un := → 0.
sn
We can interpret u2n as the largest fraction of the variance contributed by any single term Xj of
the entire sum Sn .
57
It turns out that the UAN is almost a sufficient condition to prove a central limit theorem.32
As a consequence, we will primarily focus on the setting in which UAN holds, which allows us to
prove two central limit theorems that turn out to be equivalent. The first is due to Morris and
Blitzstein, while the second is very famous.
Proposition 20.5 (Fundamental bound). Define the “fundamental bound” FBn by
n
" #
Xj 2
X |Xj |
FBn = E min 1, .
sn sn
j=1
Proposition 20.8 (Fourth cumulant CLT). If the UAN condition holds and |κ4 (Zn )| → 0, then
Zn → N (0, 1).
Proof. This turns out to be equivalent to the r = 4 case of Lyapunov’s CLT. Observe that
n
X n
X
2
Lyap4,n = κ4 (Xj /sn ) + 3 Var [Xj /sn ] = κ4 (Zn ) + 3 (σj /sn )4 .
j=1 j=1
If we assume the UAN condition, then the latter term definitely tends to zero, so we are done.
32
In fact, the UAN condition is also almost necessary, in the sense that if any term contributes an asymptotically
nontrivial portion to the variance, then CLT only holds when that term is already normally distributed itself.
58
The proofs of all of these CLTs are given in the textbook. They are all pretty technical arguments
involving analysis on the characteristic function. Anyway, we can now do an illustrative example.
Example 20.9. Assume that Y1 , . . . , Yn are i.i.d. ∼ [0, 1], and let Sn = c1 Y1 + · · · + cn Yn . Then,
the UAN can be written as
max1≤j≤n c2j
Pn 2 < ∞,
j=1 cj
Sn
and it turns out that this condition alone is enough to prove that √ converges in distribution
c21 +···+c2n
to N (0, 1). Let’s see how to do this with the κ4 method. Observe that
P
n 4 |κ (Y )|
c
j=1 j 4 1
|κ4 (Zn )| = P 2 .
n 2
c
j=1 j
To bound this, we use a really neat trick. Note that c4j = c2j c2j . So, we can write
P
Pn 2 n 2
c4 max 1≤j≤n cj j=1 cj max1≤j≤n c2j
j=1 j
2 ≤ 2 = Pn 2 → 0,
j=1 cj
P P
n 2 n 2
c
j=1 j c
j=1 j
which is simply the UAN condition. Therefore, assuming that the fourth cumulants of Yi exist, we
are done by Proposition 20.8.
59
21 November 17th, 2020
First, some administrative information. The midterm grades will be posted tonight, and information
about the final project will be added shortly. We’re getting close to the end of the course, and today
we’ll continue discussing the central limit theorem. The last topic after this will be martingales.
Future courses to consider include Stat 171 and Stat 212.
Therefore, we’ve shown that CLT holds. The only remaining things to check are that E [Rn ] =
log n + O(1) and Var [Rn ] = s2n = log n + O(1), so we conclude by Slutsky’s theorem.
33
This leads to some interesting behavior. For example, what’s the expected number of variables before the first
value greater than X1 , i.e., the first record? This actually turns out to be ∞!
60
21.2 The Replacement Method
We’re going to now discuss an interesting method used by Lindeberg and Lyapunov in the past.
We will use this strategy to prove an i.i.d. version of the central limit theorem, but it can also be
used more generally to prove Lindeberg’s CLT.
Suppose that we have X1 , X2 , . . . be i.i.d. random variables with mean 0 and variance 1. Let
Sn = X1 + · · · + Xn . Also, suppose that we have i.i.d. standard normals Z1 , Z2 , . . . ∼ N (0, 1). The
idea of the replacement method is to simply “install” Zj by replacing Xj in the sum, swapping in
the normals one-by-one.34 It turns out that each of these steps is negligible both individually and
as a whole on the final distribution, which implies X1 + · · · + Xn ∼ Z1 + · · · + Zn ∼ N (0, n).
Let’s go over each step of this argument in detail. We initially start by letting T0 = Sn , and
define Tj = Z1 + · · · + Zj + Xj+1 + · · · + Xn for each 1 ≤ j ≤ n. To show convergence in distribution,
we will use an equivalent definition of convergence in terms of expectation.
In the above proposition, using indicators gives us the original definition of convergence in
distribution, but this creates big issues at discontinuities. Therefore, we prefer to apply the other
kinds of test functions, which are continuous and simpler to use in an argument. Motivated by
this, we’ll show for all C 3 functions g such that g, g 0 , g 00 , and g 000 are bounded that
Sn Tn
E g √ −g √ → 0.
n n
Each of the differences in this telescoping sum can be thought of as replacing Xj by Zj . Any-
way, using a third-order Taylor series expansion with error term, we can show that each of these
differences is locally bounded by O(n−3/2 ), and therefore the entire sum vanishes as n → 0.
34
Joe compares this argument to the ship of Theseus thought experiment.
61
22 November 19th, 2020
Today we cover some central limit theorems on sums of dependent random variables, and we begin
our discussion of martingales.
The motivation for the m-dependence definition is that it can be really useful for time series
data. For example, when m = 0, we get ordinary independence. For larger values of m, we can
imagine a “horizon” of observations in the past, which can influence our future observations.
Proposition 22.3 (m-dependent CLT). Let (Xn )n be a stationary, m-dependent sequence of ran-
dom variables, such that E [Xj ] = µ and Var [Xj ] = σ 2 < ∞ for all j. Then,
√ D
n(X n − µ) −
→ N (0, ν),
Proof. The full proof of this theorem is very technical. However, we will provide an outline of the
key idea, which involves an argument where we split up the series into two parts. This is called a
“big block-little block” strategy.
Choose some value k > 2m, and divide up the sequence of variables X1 , . . . , Xn into alternating
blocks of length k − m (big block) and m (little block). The idea is that because of m-dependence,
the sum of each of the big blocks constitutes an i.i.d. random variable. Meanwhile, as k grows
larger, we can show that the little blocks contribute a negligible amount to the total sum of the
series. Therefore, after many technical details, one can show convergence by piggybacking off of the
standard CLT (Proposition 20.3) for big blocks, and concentration bounding the little blocks.
Another common case of dependent random variables is when we sample n elements without
replacement from a finite population of size N . Since we don’t have replacement, the samples
Y1 , . . . , Yn must be dependent. We can prove finite population central limit theorems in this case,
showing that Y is approximately normal, as N, n → ∞ while maintaining that n N .
One interesting duality in the finite population case is between Y n and Y N −n . The distributions
of these two means for samples of size n and N − n have the exact same shape (just reversed).
This leads to the interesting observation that as you increase the sample size in a finite popula-
tion, the normal approximation gets more accurate, but once you increase it too far, the normal
approximation once again decreases in accuracy, until finally you get a full census when n = N .
62
Other CLTs of interest are Markov chain CLTs, which are useful for proving facts about MCMC
algorithms, and martingale CLTs, which we may cover at the end of the course if time permits.
Anyway, that concludes our unit on central limit theorems!
22.2 Martingales
Now we discuss discrete-time martingales, which are a useful model of stochastic processes that
maintain a certain “fairness” property.
Definition 22.4 (Discrete-time martingale). We say that X1 , X2 , X3 , . . . is a martingale with
respect to another sequence Y1 , Y2 , Y3 , . . . if for all n,
1. (Regularity) E [|Xn |] < ∞,
2. (Measurability) Xn ∈ σ(Y1 , . . . , Yn ),
3. (Fairness) E [Xn+1 | Y1 , . . . , Yn ] = Xn .
You can more generally think of (Xn )n as a martingale with respect to the filtration F1 ⊆ F2 ⊆ · · · ,
where Fn = σ(Y1 , . . . , Yn ). This is a more general definition, but it’s also more abstract.
Note. The etymology of the word “martingale” is complicated. In the context of probability theory,
the martingale was a risky betting strategy where you double your bet after each loss. This would
theoretically lead to a +$1 payoff if you had infinite money, but in practice, you will eventually run
out of money after enough consecutive losses.
Note. There are other interesting models of stochastic processes like Brownian motion, which we
won’t cover in this course. Brownian motion is a special kind of continuous-time martingale where
the deviations are Markov and multivariate normal. This gives it particularly nice properties, but
it also has some weird properties like being continuous everywhere yet differentiable nowhere.
Many stochastic processes will be both Markov chains and martingales. However, in general
the Markov property (memorylessness) and martingale property are different, as martingales are
allowed to depend on all previous events.
Definition 22.5 (Submartingale and supermartingale). We call a sequence of random variables a
submartingale if the third property above is replaced by E [Xn+1 | Y1 , . . . , Yn ] ≥ Xn . On the other
hand, it is a supermartingale if E [Xn+1 | Y1 , . . . , Yn ] ≤ Xn .
Oftentimes we will just write “Xn is a martingale” without specifying the sequence Yn . The
following proposition justifies why this is unambiguous.
Proposition 22.6. If (Xn )n is a martingale with respect to (Yn )n , then (Xn )n is also a martingale
with respect to (Xn )n .
Proof. We can simply check the properties. The first two properties are trivial, while the third
property can be verified by using Adam’s law, since σ(X1 , . . . , Xn ) ⊆ σ(Y1 , . . . , Yn ).
This is a relatively simple definition, and we’ll see how it can be used to prove really nice facts
about various processes, with machinery like Doob’s optional stopping theorem, Azuma’s inequality,
Kolmogorov’s inequality, and others.
Example 22.7 (Random walks are martingales). If X1 , X2 , . . . is a sequence of random variables
that have mean 0, then Sn = X1 + · · · + Xn is a martingale. Similarly, if E [Xj ] ≥ 0, then Sn is a
submartingale, and if E [Xj ] ≤ 0, then Sn is a supermartingale.
63
23 November 24th, 2020
Today we continue to discuss martingales and their applications. The two key theorems in this
area are the martingale convergence theorem and the optional stopping theorem.
M∞ ∼ Beta(a, b).
64
Example 23.6 (Branching process). Suppose that you have a process that spreads to many in-
dividuals through a tree structure, such as a viral disease or a family tree. We can write down
the number of members in the process at time t as a stochastic process. Although this is not a
martingale (it’s increasing), it becomes a martingale after an appropriate rescaling.
Example 23.7 (Doob martingale). Suppose that we have a random variable Y with E [|Y |] < ∞.
Then, Zt = E [Y | Ft ] is a martingale with respect to filtration {F0 , F1 , F2 , . . .}.
sup E [|Mn |] ≤ c,
n
for some constant c < ∞. Then, there exists a random variable M∞ such that Mn → M∞ almost
surely, and also, E [|M∞ |] < ∞.
Proof. This is technical but interesting, and we’ll try to cover it in the next lecture.
The intuition behind this theorem is that submartingales are similar to a monotone increasing
sequence in their convergence properties. Bounded monotone sequences must converge. Although
submartingales are not strictly increasing because they have bumpiness, these bumps are not enough
to significantly impact the convergence properties.
The next theorem is very useful for generalizing some intuitions about martingales not drifting
in expectation, to the case where our stopping time may be unbounded. If Mn be a martingale,
then it’s easy to show that E [Mt ] = E [M0 ] for any positive integer t. However, what if our stopping
time t is a random variable, instead?
Example 23.10. To show that at least one of these conditions is necessary, consider the SSRW
Sn with stopping time T = inf{n ∈ N | Sn = 1}. This stopping time is almost surely finite by the
martingale convergence theorem, so we have E [ST ] = 1, but S0 = 0. An interesting conclusion is
that by the contrapositive of the optional stopping theorem, E [T ] = ∞.
35
In other words, you can’t use “psychic powers” to see into the future when deciding whether to stop at time t.
65
24 December 1st, 2020
This is the last week of classes. Today, we continue discussing martingales, and we’ll prove the
optional stopping theorem. This will give us some reusable tools that we can use more generally in
martingale problems.
Proof of Proposition 23.9. First consider the bounded time condition. If T ≤ n almost surely, then
we can write MT in terms of a telescoping series with indicators,
T
X n
X
MT = M0 + (Mj − Mj−1 ) = M0 + (Mj − Mj−1 )IT ≥j ,
j=1 j=1
where the second equality holds almost surely. If we show that the summation above has expectation
zero, then we’re done. To do this, we use Adam’s law to get
We can factor IT ≥j out of the conditional expectation because it is a stopping time, therefore in
the sigma-algebra generated by events up to time j − 1, and the last equality is just the definition
of a martingale. This finishes the proof for the bounded-time case.
What about the second condition, where |Mn | ≤ c almost surely for all n? For this, we will use
a technique called truncation. Let Tn = min(T, n) for all n, so Tn is clearly a bounded stopping
time. Furthermore, as n → ∞, we have Tn → T almost surely. The idea of truncation is that
we have the result E [MTn ] = E [M0 ] in the truncated case, and we use a convergence theorem to
deduce the same result in the general case. In this case, the bounded convergence theorem yields
Note that with minor modifications to the above proof, we can get variants of the optional
stopping theorem for submartingales and supermartingales, where E [MT ] ≥ E [M0 ] and E [MT ] ≤
E [M0 ] respectively.
Example 24.1 (Gambler’s ruin). Consider a simple symmetric random walk on Z, starting at 0,
with absorbing barriers at −a and b. The position St at time t is a bounded martingale. Therefore,
by the optional stopping theorem, the probability of being absorbed at a is b/(a + b), while the
probability of being absorbed at b is a/(a + b).
66
Example 24.2 (Asymmetric random walk). Consider the same problem as the previous example,
but we instead have an unfair game where we win with probability p 6= 1/2 and lose with probability
q = 1 − p. Then, (q/p)St is a martingale.
Example 24.3 (“Say red”). Consider a deck of cards in random order, with 26 red cards and 26
black cards. A dealer is flipping over cards, one at a time, and after each step they give you the
option to stop. When you stop, the next card in the deck is revealed. You win that card is red. It
turns out that no strategy for this game achieves a success probability different from 50%. This is
because the fraction of red cards Mn left in the deck after n draws is a martingale, and it is also
your success probability when stopping.
Joe mentions that the above example has shown up in many job interviews. Indeed, I remember
Paul Christiano giving us this exercise as a brain teaser at SPARC. Another slick argument is to use
the interchangeability of the cards, which says that choosing the top card of the deck is completely
interchangeable with choosing the last card of the deck, and therefore none of your actions matter!
67
25 December 3rd, 2020
Today is the last lecture of the course, before reading period! We will talk about a selection of
topics, as requested by the students.
When η = 0, this means that h+ (y)f0 (y) dy = h− (y)f0 (y) dy = c, so we can divide both sides
R R
of the above equation by c to turn this into a probability distribution. Then, the above integrals
are precisely the moment generating functions of two distributions (according to LOTUS), so
h+ (y)f0 (y) = h− (y)f0 (y)
almost everywhere, by the uniqueness of moment generating functions. Therefore, h+ (y) = h− (y)
almost everywhere on the support of y, where f0 (y) 6= 0, so therefore h(y) = 0.
as n → ∞, for any fixed value of . However, note that since |Xj | ≤ c, the indicator random variable
must be zero for all sufficiently large values of n. Therefore, the Lindeberg condition converges to
zero for any choice of value for , as desired.
68
This is a generalization of the so-called coupon collector’s problem where
Pn b1= 1, and the expected
amount of time is simply equal to the n-th harmonic number Hn = j=1 j . However, while the
basic coupon collector’s problem is easy, this problem is significantly more difficult. The really
interesting thing is that although the problem is ostensibly discrete, adding the continuous-time
Poisson process greatly simplifies the solution, as this distribution has nice properties.
Probability is a vast subject. There’s plenty of courses like Stat 212 that go further. Joe mentions
that his book is quite different from the standard courses on the material, emphasizing probabilistic
thinking and not just measure theory. That’s it for the semester!
69
References
[BM20] J.K. Blitzstein and C. Morris. Probability for Statistical Science. Unpublished draft, 2020.
70