Advanced Probability and Statistics: MAFS5020 Hkust Kani Chen (Instructor)
Advanced Probability and Statistics: MAFS5020 Hkust Kani Chen (Instructor)
MAFS5020
HKUST
PART I
PROBABILITY THEORY
2
We first review some basic concepts about probability space through examples. Then we summarize
the structure of probability space and present axioms and theory.
Example 0.1. De Méré’s Problem. (for the purpose of reviewing the discrete probability space.)
The Chevalier de Méré is a French nobleman and a gambler of the 17th century. He was puzzled
by the equality of the probability of two events: At least one ace turns up with four rolls of a die,
and at least one double-ace turns up in 24 rolls of two dice. His reasoning:
In one roll of a die, there is 1/6 chance of getting an ace, so in 4 rolls, there is 4 ∗ (1/6) = 2/3
chance of getting at least one die.
In one roll of two dice there is 1/36 chance of getting a double-ace, so in 24 rolls, there is 24∗(1/36) =
2/3 chance of getting at least one double ace.
De Méré turned to Blaise Pascal (1623-1662) and Pierre de Fermat (1601-1665) for help. And the
two mathematicians/physicists gave the right answer:
P (getting at least one ace in 4 rolls of a die) = 1 − (1 − 1/6)4 = 0.518
4-rolls makes favorable bet, and 3-rolls makes not.
P (getting at least one double-ace in 24 rolls of two dice) = 1 − (1 − 1/36)24 = 0.491
25 rolls makes a favorable bet, 24 rolls make still an unfavorable bet.
Historical remark: (Cardano’s mistake) Probability theory has an infamous birth place: gam-
bling room. The earliest publications on probability dates back to “Liber de Ludo Aleae” (the
book on games of chance) by Gerolamo Cardano (1501-1576), an Italian mathematician/physician/
astrologer/gambler of the time. Cardano made an important discovery of the product law of inde-
pendent events, and it is believed that ”Probability” was first coined and used by Cardano. Even
though he made several serious mistakes that seem to be elementary nowadays, he is considered as
a pioneer who first systematically computed probability of events. Pascal, Fermat, Bernoulli and
de Moivre are among many other prominent developers of the probability theory. The axioms of
probability was first formally and rigorously shown by Kolmogorov in the last century.
Exercise 0.1’. Galileo’s problem. Galileo (1564-1642), the famous physicist and astronomer, was
also involved in a calculation of probabilities of similar nature. Italian gamblers used to bet on
the total number of spots in a roll of three dice. The question is the chance of 9 total dots the
same as that of 10 total dots? There are altogether 6 combinations of total 9 dots (126, 135, 144,
234, 225, 333) and 6 combinations of total 10 dots (145, 136, 226, 235, 244, 334). This can give
the false impression that the two chances are equal. Galileo gave the correct answer: 25/216 for
one and 27/216 for the other. The key point here is to lay out all 63 = 216 outcomes/elements of
the probability space and realize that each of these 216 outcomes/elements are of the same chance
1/216. Then the chance of an event in the sum of the probability of the outcomes in the event.
Example 0.2. The St. Petersberg Paradox. (for the purpose of reviewing the concept of Expecta-
tion)
A gambler pays an entry fee M $ to play following game: A fair coin is tossed repeated until the first
head occurs and you win 2n−1 amount of money where n is the total number of tosses. Question:
what is the ”fair” amount of M ?
n is a random number with P (n = k) = 2−k for k = 1, 2, .... Therefore the ”Expected Winning” is
∞
X 1
E(2n−1 ) = 2k−1 × = ∞.
2k
k=1
3
Notice that here I have used the expectation of a function of random variable. It appears that a
“fair”, but indeed naive, M should be ∞. However, by common sense, this game, despite its infinite
payoff, should not worth the same as infinity.
Daniel Bernoulli (1700-1782), a Dutch-born Swiss mathematician, provided one solution in 1738. In
his own words: The determination of the value of an item must not be based on the price, but rather
on the utility it yields. There is no doubt that a gain of one thousand ducats is more significant to
the pauper than to a rich man though both gain the same amount. Using a utility function, e.g., as
suggested by Bernoulli himself, the logarithmic function u(x) = log(x) (known as log utility), the
expected utility of the payoff (for simplicity assuming an initial wealth of zero) becomes finite:
∞
X ∞
X
E(U ) = u(2k−1 ) ∗ P (n = k) = log(2k−1 )/2k = log(2) = u(2) < ∞.
k=1 k=1
Example 0.3. The dice game called “craps”. (This is to review conditional probability).
Two dice are rolled repeatedly and let the total dots of the n-th roll be Zn . If Z1 is 2 or 3 or 12,
it is an immediate loss of the game. If Z1 is 7 or 11, it is an immediate win. Else, continue the
rolls of the two dice until either Z1 occurs, meaning a loss, or 7 occurs, meaning a win. What is
the chance of a win of this game.
Solution. Write
12
X 12
X
P (W in) = P (Win and Z1 = k) = P (Win |Z1 = k)P (Z1 = k)
k=2 k=2
X
= P (Z1 = 7) + P (Z1 = 11) + P (Win |Z1 = k)P (Z1 = k)
k=4,5,6,8,9,10
X
= 6/36 + 2/36 + P (Ak |Z1 = k)P (Z1 = k)
k=4,5,6,8,9,10
X
= 2/9 + P (Ak )P (Z1 = k)
k=4,5,6,8,9,10
where Ak is the event that starting from the second roll, 7 dots occur before k dots. Now,
12
X 12
X
P (Ak ) = P (Ak ∩ {Z2 = j}) = P (Ak |{Z2 = j})P (Z2 = j)
j=2 j=2
12
X
= P (Z2 = 7) + P (starting from the 3rd roll, 7 occurs before k)P (Z2 = j)
j=2
j6=k,j6=7
As a result,
P (Ak ) = P (Z1 = 7)/[P (Z1 = k) + P (Z1 = 7)].
And,
X
P (win) = 2/9 + P (Z1 = 7)/[P (Z1 = k) + P (Z1 = 7)]P (Z1 = k) = 0.492929.
k=4,5,6,8,9,10
4
This section lays the necessary rigorous foundation for probability as a mathematical theory. It
begins with sets, relations among sets, measurement of sets and functions defined on the sets.
Example 1.1. (A prototype of probability space.) Drop a needle blindly on the interval
[0, 1]. The needle hits interval [a, b], a sub-interval of [0, 1] with chance b − a. Suppose A is any
subset of [0, 1]. What’s the chance or length of A?
Here, we might interpret the largest set Ω = [0, 1] as the “universe”. Note that not all subsets
are “nice” in the sense that their volume/length can be properly assigned. So we first focus our
attention on certain class of “nice” subsets.
To begin with, the “Basic” subsets are all the sub-intervals of [0, 1], which may be denoted as [a, b],
with 0 ≤ a ≤ b ≤ 1. Denote B as the collection of all subsets of [0, 1], which are generated by all
basic sets after f inite set operations. B is called an algebra of Ω.
It can be proved that any set in B is a finite union of disjoint intervals (closed, open or half-closed).
Still, B is not rich enough. For example, it does not contain the set of all rational numbers. More
importantly, the limits of sets in B are often not in B. This is serious restrictions of mathematical
analysis.
Let A be the collection of all subsets of [0, 1], which are generated by all “basic” sets after countably
many set operations. A is called Borel σ-algebra of Ω. Sets in A are called Borel sets. Limits of
sets in A are still in A. (Ω, A) is a measurable space.
Borel measure: any set A in A can be assigned a volume, denoted as µ(A), such that
(i). µ([a, b]) = b − a.
(ii). µ(A) = lim µ(An ) for any sequence of Borel sets An ↑ A.
Lebesgue measure (1901): Completion of Borel σ-algebra by adding all subsets of Borel measure
0 sets, denoted as F . Sets with measure 0 are called null sets.
Why should Borel measure or Lebesgue measure exist in general?
Caratheodory’s extension theorem: extending a (σ-finite) measure on an algebra B to the σ-algebra
A = σ(B).
Ω = [0, 1] (the universe).
B: an algebra (finite set operations) generated by subintervals.
A: the Borel σ-algebra, is a σ-algebra, generated by subintervals.
F : completion of A, a σ-algebra, generated by A and null sets.
Relation: A ⊂ B, if ω ∈ A ensures ω ∈ B.
A sequence of sets {An : n ≥ 1} is called increasing (decreasing) if An ⊂ An+1 (An ⊃ An+1 .)
A = B if and only if A ⊂ B and B ⊂ A.
Indicator functions. (A very useful tool to translate set operation into numerical operation)
The relation and operation of sets are equivalent to the indication set functions. For any subset
A ⊂ Ω, define its indicator function as
1 if ω ∈ A
n
1A (ω) =
0 Otherwise.
The indicator function is a function defined on Ω.
Set operations vs. function operations:
A⊂B ⇐⇒ 1A ≤ 1B .
A∩B ⇐⇒ 1A × 1B = 1A∩B = min(1A , 1B ).
Ac = Ω \ A ⇐⇒ 1 − 1A = 1Ac .
A∪B ⇐⇒ 1A∪B = 1A + 1B , if A ∩ B = ∅
⇐⇒ 1A∪B = max(1A , 1B ).
Set limits.
There are two limits of sets: upper limit and low limit.
lim sup An ≡ ∩∞ ∞
n=1 ∪k=n Ak = {An infinitely occurs.}
1lim sup An = lim sup 1An
lim inf An ≡ ∪∞ ∞
n=1 ∩k=n Ak
= {An always occurs except for finite number of times.}
1lim inf An = lim inf 1An
since f −1 (B) ∈ B p .
Proposition 1.2 If X1 , X2 , ... are r.v.s. So are
Therefore, inf n Xn , supn Xn , lim inf n Xn and lim supn Xn are r.v.s.
Proposition 1.3 Suppose X is a map from a measurable space (Ω, A) to another measurable
space (S, S). If X −1 (C) ∈ A for every C ∈ C and S = σ(C). Then, X is a measurable map, i.e.,
X −1 (S) ∈ A for every S ∈ S. In particular, when (S, S) = ([−∞, ∞], B), X −1 ([−∞, x]) ∈ A for
every x is enough to ensure X is a r.v..
Proof. Note that σ(C), the σ-algebra generated by C, is defined mathematically as the smallest
σ-algebra containing C.
Set B ∗ = {B ∈ S : X −1 (B) ∈ A}.
We first show B ∗ is a σ-algebra. Observe that
(i). for any B ∈ B ∗ , X −1 (B) ∈ A and, therefore, X −1 (B c ) = (X −1 (B))c ∈ A;
(ii). for any Bn ∈ B ∗ , X −1 (Bn ) ∈ A and X −1 (∪n Bn ) = ∪n X −1 (Bn ) ∈ A.
Consequently, B ∗ is a σ-algebra. Since C ⊂ B ∗ ⊂ S, it follows that B ∗ = S.
set operations.
a σ-algebra of Ω and P a set function such that
DIY Exercises:
Exercise 1.1 ⋆⋆ Show 1lim inf An = lim inf 1An and DeMorgen’s identity.
Exercise
P 1.2 ⋆⋆ Show that, the so called “countable additivity” or “σ-additivity”, (P (∪n An ) =
n P (A n ) for countable disjoint An ∈ A), is equivalent to “finite additivity” plus “continuity” (if
An ↓ ∅, then P (An ) → 0.)
Exercise 1.3 ⋆ ⋆ ⋆ (Completion of a Probability space) Let (Ω, F , P ) be a probability space.
Define
F̄ = {A : P (A \ B) + P (B \ A) = 0, for someB ∈ F },
And for each A ∈ F̄ , P (A) is defined as P (B) for the B given above. Prove that (Ω, F̄ , P ) is also a
probability space. (Hint: need to show that F is a σ-algebra and that P is a probability measure.)
Exercise 1.4 ⋆ ⋆ ⋆ If X1 and X2 are two r.v.s, so is X1 + X2 . (Hint: cite Propositions 1.1 and
1.3)
9
Expectation, also called mean, of a random variable is often referred to as the location or center of
the random variable or its distribution. To avoid some non-essential trivialities, unless otherwise
stated, the random variables will usually be assumed to take finite values and those taking values
−∞ and ∞ are considered as r.v.s in extended sense.
(i). Distribution.
Recall that, given a probability space (Ω, F , P ), a random variable (r.v.) X is defined as a real-
valued function of Ω, satisfying certain measurability condition. The cumulative distribution func-
tion of X is then
F (·) is then a right-continuous function defined on the real line (−∞, ∞).
Remark. The distribution function of a single r.v. may be considered as complete profile/description
of the r.v.. The distribution function F (·) defines a probability measure on (−∞, ∞). This is the
induced measure, induced by the random variable as a map/function from the probability mea-
sure P on (Ω, F , P ) to ((−∞, ∞), B, F ). In this sense, the original probability space is often left
unspecified or seemingly irrelevant when dealing with one single random variable.
We often call a random variablediscrete random variableif it takes countable number of values, and
call a random variablecontinuous random variableif the chance it takes any particular value is 0.
In statistics, continuous random variableis often, by default, given a density function. In general,
continuous random variablemay not have a density function (with respect to Legbegue measure).
An example is the Cantor measure.
For two random variables X and Y , their joint c.d.f. is
Joint c.d.f can be extended for finite number of variables in a straightforward fashion. If the (joint)
c.d.f. is differentiable, the derivative is then called (joint) density.
(ii). Expectation.
Definitions. For a nonnegative random variableX with c.d.f F , its expectation is defined as
Z ∞
E(X) ≡ xdF (x).
0
If X takes ∞ with positive probability, E(X + ) = ∞. Note that X has finite mean is equivalent to
E|X| < ∞. And the mean of X does not exist is the same as E(X + ) = E(X − ) = ∞.
10
The expectation defined above is mathematically an integral or summation with respect to certain
probability measure induced by the random variable. In layman’s words, it is the weighted ”average”
of the values taken by the r.v., weighted by chances which sum up to 1.
P (X = k) = (1 − p)k−1 p, k = 1, 2, ...
Hyper-geometric: X ∼ HG(r, n, m): the number of black balls when r balls are taken without
replacement from an urn containing n black balls and m white balls.
n m n+m
P (X = k) = / , k = 0 ∨ (r − m), 1, ..., r ∧ n.
k r−k r
Γ(α + β) α−1
f (x) = x (1 − x)β−1 , x ∈ [0, 1]
Γ(α)Γ(β)
ξ/(ξ + η) ∼ B(α, β) where ξ ∼ Γ(α, γ) and η ∼ Γ(β, γ) are independent. X(k) ∼ B(k − 1, n − k + 1)
as the k-th smallest of X1 , ..., Xn iid ∼ U nif [0, 1]
Cauchy: density f (x) = 1/[π(1 + x2 )]. Symmetric about 0, but expectation and variance not exist.
χ2n (with d.f. n): sum of n i.i.d standard normal r.v.s. χ22 is E(2).
p
tn (with d.f n): ξ/ η/n where ξ ∼ N (0, 1), η ∼ χ2n and ξ and η are independent.
Fm,n (with d.f. (m, n)): (ξ/m)/(η/n) where ξ ∼ χ2m , η ∼ χ2n and ξ and η are independent.
Inequalities are extremely useful tools in theoretical development of probability theory. For sim-
plicity of notation, we use kXkp , which is also called Lp norm if p ≥ 1, to denote [E(|X|p )]1/p for
a r.v. X. In what follows, X and Y are two random variables.
(1) the Jensen inequality: Suppose ψ(·) is a convex function and X and ψ(X) have finite expectation.
Then ψ(E(X)) ≤ E(ψ(X)).
Proof. Convexity implies for every a, there exists a constant c such that ψ(x) − ψ(a) ≥ c(x − a).
Let a = E(X) and x = X, the right hand side is mean 0. So Jensen’s inequality follows.
Proof. The inequality holds if var(X) = ∞. Assume var(X) < ∞, then E(X) is finite and
Y ≡ (X − E(X))2 is well defined. It follows from the Markov inequality that
(4). the Hölder inequality: for 1/p + 1/q = 1 with p > 0 and q > 0,
E|XY | ≤ kXkpkY kq
Proof. Observe that for any two nonnegative numbers a and b, ab ≤ ap /p + bq /q. (This is a result
of the concavity of the log-function. please DIY.) Let a = |X|/kXkp and b = |Y |/kY kq and take
expectation on both sides. The Hölder inequality follows.
kX + Y kp ≤ kXkp + kY kp .
Proof. If p = 1, the inequality is trivial. Assume p > 1. Let q = p/(p − 1). Then 1/p + 1/q = 1.
By the Hölder inequality,
E[|X||X +Y |p−1 ] ≤ kXkp k|X +Y |p−1 kq = kXkp {E[|X +Y |(p−1)q ]}1/q = kXkp {E[|X +Y |p ]}(p−1)/p .
Likewise,
E[|Y ||X + Y |p−1 ] ≤ kY kp {E[|X + Y |p ]}(p−1)/p .
Summing up the above two inequalities leas to
Remark. Jensen’s inequality is a powerful tool. For example, straightforward applications include
which implies
kXkp ≤ kY kq , for 0 < p < q.
Moreover,
E(log(|X|)) ≤ log(E(|X|)).
If E(X) exists,
E(eX ) ≥ eE(X) .
These inequalities are all very commonly used. For example, the validity of the maximum likelihood
likelihood estimation essentially rests on the fact,
f (X) f (X) Z f (x) Z
θ θ θ
E log ≤ log E = log fθ0 (x)dx = log fθ (x)dx = log(1) = 0,
fθ0 (X) fθ0 (X) fθ0 (x)
which is a result of Jensen’s inequality. Here fθ (·) is a parametric family of density of X with θ0
being the true value of θ.
The Markov inequality, despite its simplicity, shall be frequently used in the order of a sequence
of random variables, especially when coupled with the technique of truncation. The Chebyshev
inequality is so mighty that, as an example, it directly proves the weak law of large numbers.
13
The Schwarz inequality shows that covariance is an inner product, and, furthermore, the space of
mean 0 r.v.s with finite variances forms a Hilbert space. The Minkowsky inequality is the triangle
inequality for Lp norm, without which Lp cannot be a norm.
DIY Exercises.
Exercise 2.1. ⋆ Suppose X is a r.v. taking values on all rational numbers on [0, 1], Specifically,
P (X = qi ) = pi > 0 where q1 , q2 , ... denotes all rational numbers on [0, 1]. Then, the c.d.f of X is
continuous at irrational numbers and discontinuous at rational numbers.
Exercise 2.2. ⋆ ⋆ ⋆ Show var(X + ) ≤ var(X) and var(min(X, c)) ≤ var(X) where c is any constant.
Exercise 2.3. ⋆ ⋆ ⋆ (Generalizing Jensen’s inequality). Suppose g(·) is a convex function and X is
a random variable with finite mean. Then, for any constant c,
Exercise 2.4. ⋆⋆⋆ Lyapunov (Liapounov) : Show that the function log E(|X|p ) is a convex function
of p on [0, ∞). Or, equivalently, for any 0 < s < m < l, show
Chapter 3. Convergence
Unlike convergence of a sequence of numbers, the convergence of a sequence of r.v.s at least has
four commonly used modes: almost sure convergence, in probability convergence, Lp convergence
and in distribution convergence. The first is sometimes called convergence almost everywhere or
almost certain and the last convergence in law.
(i). Definitions
In what follows, we give definitions. Suppose X1 , X2 , ... are a sequence of r.v.s.
Xn → X almost surely, (a.s.) if P ({ω : Xn (ω) → X(ω)}) = P (Xn → X) = 1. Namely, a.s.
convergence is a point-wise convergence “everywhere” except for a null set.
Xn → X in probability, if P (|Xn − X| > ǫ) → 0 for any ǫ > 0.
Xn → X in Lp , if E(|Xn − X|p ) → 0.
Xn → X in distribution. There are four equivalent definitions:
1). For every continuity point t of F , Fn (t) → F (t), where Fn and F are c.d.f of Xn and X.
2). For every closed set B, lim supn P (Xn ∈ B) ≤ P (X ∈ B).
3). For every open set B, lim inf n P (Xn ∈ B) ≥ P (X ∈ B).
4). For every continuous bounded function g(·), E(g(Xn )) → E(g(X)).
Remark. The Lp convergence preclude the limit X taking values of infinity with positive chances.
Sometimes in some textbooks, a sequence of numbers going to infinity is called convergence to
infinity rather than divergence to infinity. If this is the case, the limit X can be ∞ or −∞, for a.s.
convergence and, by slightly modifying the definition, for in probability convergence. For example,
Xn → ∞ in probability is naturally defined as, for any M > 0, P (Xn > M ) → 1. Convergence in
distribution only has to do with distributions.
Proof. Let Xn∗ = inf(Xk : k ≥ n), then Xn∗ ↑ lim inf Xn , so the Monotone convergence theo-
rem, E(Xn∗ ) ↑ E(lim inf Xn ). On the other hand, Xn∗ ≤ Xn so, E(Xn∗ ) ≤ E(Xn ). As a result,
E(lim inf Xn ) ≤ lim inf E(Xn ).
(3). Dominated convergence theorem. If |Xn | ≤ Y , E(Y ) < ∞, and Xn → X a.s., then E(Xn ) →
E(X).
Proof. Observe that Y − Xn ≥ 0 ≤ Y + Xn . By Fatou’s lemma, E(Y − lim Xn ) ≤ lim inf E(Y − Xn ),
leading to E(X) ≥ lim sup E(Xn ). Likewise E(Y + lim Xn ) ≤ lim inf E(Y + Xn ), leading to
E(X) ≤ lim inf E(Xn ). Consequently, E(Xn ) → E(X).
15
The essence of the above convergence theorems is to use a bound, upper or lower, to ensure the
desired convergence in expectation. These bounds, lower bounds as 0 in the monotone convergence
theorem and the Fatou lemma, and both lower and upper bounds in the dominated convergence
theorem, can actually be relaxed; see DIY exercises. The most general extension is through the
concept of uniform integral r.v.s, which shall be introduced later if necessary.
‡
a.s. conv. Lp conv.
‡
†
6 . Suppose |Xn | ≤ c > 0 a.s., then, in probability convergence ⇐⇒ Lp convergence for all (any)
p > 0.
Proof. ⇐= follows from 2 . And =⇒ follows from the dominated convergence theorem.
lim inf P (Xn ∈ (−∞, t)) ≥ lim inf P (Xn ∈ (−∞, tk ]) = P (X ∈ (−∞, tk ]) → P (X ∈ (−∞, t)).
n n
The result can be extended for general open sets. We omit the proof.
3) =⇒ 1). Suppose t is a continuity point. Then lim supn Fn (t) ≤ F (t) by 2) and the equivalency
of 2) and 3). lim inf n Fn (t) ≥ lim inf n P (Xn < t) ≥ P (X < t) = F (t) as t is a continuity point. So
1) follows.
4) =⇒ 1). Let t be a continuity point of F . For any small ǫ > 0, choose a non-increasing continuous
function f of x which is 1 for x < t, and is 0 for x > t + ǫ. Then, P (Xn ≤ t) ≤ E(f (Xn )) →
E(f (X)) ≤ P (X ≤ t + ǫ). Therefore the lim sup P (Xn ≤ t) ≤ P (X ≤ t). Likewise (how?), one can
show lim inf P (Xn ≤ t) ≥ P (X ≤ t). The desired convergence follows.
1) =⇒ 4). Continuity points of the cdf of X are dense (why?). Suppose |f (t)| < c. Choose
continuity points −∞ = t0 < t1 , ... < tK < tK+1 = ∞ such that F (t1 ) < ǫ > 1 − F (tK ), and
|f (t) − f (s)| < ǫ for any t, s ∈ [tj , tj+1 ] for j = 1, ..., K − 1. Then,
Z Z
|E(f (Xn )) − E(f (X))| = | f (t)dFn (t) − f (t)dF (t)|
K Z
X tj+1
≤ | f (t)[dFn (t) − dF (t)]|
j=0 tj
K−1
X Z tj+1
≤ 2cǫ + | f (t)[dFn (t) − dF (t)]|
j=1 tj
K−1
X Z tj+1
≤ 2cǫ + + | f (tj )[dFn (t) − dF (t)]|
j=1 tj
K−1
X Z tj+1
+ | [f (t) − f (tj )][dFn (t) − dF (t)]|
j=1 tj
17
K−1
X Z tj+1 K−1
X Z tj+1
≤ 2cǫ + c| [dFn (t) − dF (t)]| + ǫ [dFn (t) + dF (t)]
j=1 tj j=1 tj
Z tK
→ 2cǫ + 2ǫ dF (t) as n → ∞.
t1
≤ (2c + 1)ǫ,
DIY Exercises.
Exercise 3.1 ⋆⋆ Suppose Xn ≥ η, with E(η − ) < ∞. Show E(lim inf Xn ) ≤ lim inf E(Xn ).
Exercise 3.2 ⋆⋆ Show the dominated convergence theorem still holds if Xn → X in probability or
in distribution.
Pn
Exercise 3.3 ⋆ ⋆ ⋆ Let Sn = i=1 Xi . Raise a counter-example to show Sn /n 6→ 0 in probability
but Xn → 0 in probability.
Pn
Exercise 3.4 ⋆ ⋆ ⋆ Let Sn = i=1 Xi . Show that Sn /n → 0 a.s. if Xn → 0 a.s., and Sn /n → 0 in
Lp if Xn → 0 in Lp for p ≥ 1.
18
Corollary
P (Borel’s 0-1 law) If A1 , ..., An , ... are independent, then P (An , i.o.) = 1 or 0 according
as n P (An ) = ∞) or < ∞.
Even though the above 0-1 law appears to be simple, its impact and implication is profound. More
generally, suppose A ∈ ∩∞ n=1 σ(Aj , j ≥ n), the so-called tail σ-algebra. A is called a tail event. Then,
the independence of A1 , ..., An , ... implies P (A) = 0 or 1. The
Pn key fact here is that A is independent
of An for any n ≥ 1, such as, for example, {An , i.o.} or { i=1 1Ai / log(n) → ∞}. A more general
result involving independent random variables to be introduced below is the Kolmogorov’s 0-1 law
to be introduced later.
The following example can be viewed as a strengthening of the Borel-Cantelli lemma.
P
Example 4.1 Suppose A1 , ..., An , ... are independent events with n pn = ∞ where pn = P (An ).
Then, Pn
1 Ai
Xn ≡ Pi=1 n →1 a.s..
i=1 pi
Proof Since Pn
p (1 − pi ) 1
Pn i
2
E(Xn − 1) = i=1 ≤ Pn → 0,
( i=1 pi )2 i=1 pi
it follows that Xn → 1 in L2 and therefore also in probability by the Chebyshev inequality:
E(Xn − 1)2 1
P (|Xn − 1| > ǫ) ≤ 2
≤ 2 Pn → 0.
ǫ ǫ i=1 pi
Then,
∞
X
P (|Xnk − 1| > ǫ) < ∞.
i=1
The Borel-Cantelli lemma implies Xnk → 1 a.s.. Observe that, for nk ≤ n ≤ nk+1 ,
Pnk Pn Pnk+1
i=1 1Ai i=1 1Ai i=1 1Ai
1 ← Pnk+1 ≤ Xn = P n ≤ P nk → 1, a.s..
i=1 p i p
i=1 i i=1 pi
If the joint density exists, This is the same as fX,Y (x, y) = fX (x)fX (y).
Roughly speaking, independence between two r.v.s X and Y is interpreted as X taking any value
“has nothing to do with” Y taking any value, and vice versus.
Interpretation: E(X|Y ) is the weighted “average” (expected value) of X over the set {Y = y} for
all y. It is a function of Y and therefore is a r.v. measurable to σ(Y ).
If their joint density f (x, y) exists, then the conditional density of X given Y = y is fX|Y (x|y) ≡
f (x, y)/fY (y). And Z
E(X|Y = y) ≡ xfX|Y (x|y)dx.
E(X1A ) = E(E(X|A)1A ),
which is a r.v. that, on each Ai , takes the conditional average of X, i.e., E(X|Ai ), as its value.
Motivated from this simple case, we may obtain an important understanding of the conditional
22
expectation X w.r.t. a σ-algebra A: a new r.v. as the “average” of the r.v. X on each “un-
splitable” or “smallest” set of the σ-algebra A.
Conditional mean/expectation with respect to σ algebra shares many properties just like the ordi-
nary expectation.
Properties:
(1). E(aX + bY |A) = aE(X|A) + bE(Y |A)
(2). If X ∈ A, then E(X|A) = X.
(4). E(E(X|F)|A) = E(X|A) for two σ-algebras A ⊆ F .
Further properties, such as the dominated convergence theorem, Fatou’s lemma and monotone
convergence theorem also hold for conditional mean w.r.t. a σ-algebra. (See DIY exercises.)
The martingale convergence theorem, even with the simplified version, has broad applications. For
example, One of the most basic 0-1 laws: the Kolomogorov 0-1 law, can be established upon it.
Corollary (Kolomogorov 0-1 law) Suppose X1 , ..., Xn , ... are a sequence of independent r.v.s.
Then all tails events are have probability 0 or 1.
Proof. Suppose A is a tail event. Then A is independent of X1 , ..., Xn for any fixed n. Therefore
E(1A |Fn ) = P (A) where Fn is the σ-algebra generated by X1 , ..., Xn . But, by Theorem 1.2,
E(1A |Fn ) → 1A a.s.. Hence 1A = P (A), and A can only be 0 or 1.
A heuristic interpretation of Kolmogorov’s 0-1 law could be in the perspective of information. When
σ-algebras A1 , ..., An , ... are independent, the information carried by each Ai are independent or
unrelated or non-overlapping. Then, the information carried by {An , An+1 , ...} shall shrink to 0 as
n → ∞, as, if otherwise, An , An+1 , ... would have something in common.
As straightforward applications of Kolmogorov’s 0-1 law:
Corollary Suppose X1 , ..., Xn , ... are a sequence of independent random variables. Then,
lim inf Xn , lim sup Xn , lim sup Sn /an and lim inf Sn /an
n n n n
Pn
must be either a constant or ∞ or −∞, a.s., where Sn = i=1 Xi and an ↑ ∞.
Proof. Consider A = {ω : lim inf n Xn (ω) > a}. Try to show A is a tail event. (DIY).
Remark. Without invoking martingale convergence theorem, Kolmogorov’s 0-1 law can be shown
through π − λ theorem, which we do not plan to cover.
DIY Exercises.
Exercise 4.1 ⋆⋆ Suppose Xn are iid random variables. Then Xn /n1/p → 0 a.s. if and only if
E(|Xn |p ) < ∞ for p > 0. Hint: Borel-Cantelli lemma.
Exercise 4.2 ⋆ ⋆ ⋆ Let Xn be iid r.v.s with E(Xn ) = ∞. Show that lim supn |Sn |/n = ∞ a.s. where
Sn = X 1 + · · · + X n .
P∞
Exercise 4.3 ⋆ ⋆ ⋆ Suppose Xn are iid nonnegative random variables such that k=1 kP (X1 >
ak ) < ∞ for ak ↑ ∞. Show that lim supn max1≤i≤n Xi /an ≤ 1 a.s.
23
Exercise 4.4 ⋆ ⋆ ⋆⋆ (Empirical Approximation) For every fixed t ∈ [0, 1], Sn (t) is a sequence
of random variables such that, with probability 1 for some p > 0,
for all n ≥ 1 and all t, s ∈ [0, 1]. Suppose for every constant C > 0, there exists an c > 0 such that
P (|Sn (t)| > C(n log n)1/2 ) ≤ e−cn for all n ≥ 1 and t ∈ [0, 1].
For a sequence of independent r.v.s X1 , X2 , ..., classical law of large numbers is typically about the
convergence of partial sums
Pn
Sn − E(Sn ) [Xi − E(Xi )]
= i=1 ,
n n
Pn
where Sn = i=1 Xi here and throughout this Chapter. A more general form is the convergence of
S n − an
bn
for some constants an and bn . Weak law is convergence in probability and strong law is convergence
a.s..
The following proposition may be called L2 weak law of large numbers which implies the weak law
of large numbers.
Proposition Suppose X1 , ..., Xn , ... are iid with mean µ and finite variance σ 2 . Then,
Sn /n → µ in probability and in L2 .
Proof. Write
E(Sn /n − µ)2 = (1/n)σ 2 → 0.
Therefore L2 convergence holds. And convergence in probability is implied by the Chebyshev
inequality.
The above proposition implies that classical weak law of large numbers holds quite trivially in a
standard setup with the r.v.s being iid with finite variance. In fact, in such a standard setup strong
law of large numbers also holds, as to be shown in Chapter 6. However, the fact that convergence
in probability is implied in L2 convergence plays a central role is establishing weak law of large
numbers. For a example, a straightforward extension of the above proposition can be:
For independent r.v.s X1 , ...,, (Sn − E(Sn ))/bn → 0 in probability if (1/b2n ) ni=1 var(Xi ) → 0, for
P
some bn ↑ ∞.
The following theorem about general weak law of large numbers is a combination of the above
extension and the technique of truncation.
Theorem 5.1. Weak Law of Large Numbers Suppose X1 , X2 , ... are independent. Assume
Pn
(1). i=1 P (|Xi | > bn ) → 0,
−2
Pn 2
(2). bn i=1 E(Xi 1{|Xi |≤bn } ) → 0,
Pn
where 0 < bn ↑ ∞. Then (Sn − an )/bn → 0 in probability, where an = j=1 E(Xi 1{|Xi |≤bn } ).
Proof. Let Yj = Xj 1{|Xj |≤bn } . Consider
Pn Pn
j=1 Yj − an j=1 [Yj − E(Yj )]
= ,
bn bn
n
Y
≥ P (Xj = Yj for all 1 ≤ j ≤ n) = P (Xj = Yj ) by independence
j=1
n
Y n
Y Pn
log[1−P (|Xj |>bn )]
= P (|Xj | ≤ bn ) = [1 − P (|Xj | > bn )] = e j=1
j=1 j=1
− n
P
≈ e j=1 P (|Xj |>bn )
→ 1 by (1).
Theorem 5.2. Suppose X, X1 , X2 , ... are iid. Then, Sn /n − µn → 0 in probability for some µn ,
if and only if
xP (|X1 | > x) → 0 as x → ∞.
in which case µn = E(X1{|X|≤n} ) + o(1).
Proof. “⇐=” Let an = nµn and bn = n in Theorem 5.1. Condition (1) follows. To check
Condition (2), write, as n → ∞,
n
X 1 1
b−2
n E(X 2 1{|X|≤n} ) ≤ E(min(|X|, n)2 )
E(Xi2 1{|Xi |≤bn } ) =
i=1
n n
1 ∞ 1 n
Z Z
= 2xP (min(|X|, n) > x)dx = 2xP (|X| > x)dx
n 0 n 0
Z n
1
= 2xP (|X| > x)dx + o(1) for any fixed M > 0
n M
Z n
2
= xP (|X| > x)dx + o(1) ≤ 2 sup xP (|X| > x) + o(1),
n M x≥M
as n → ∞. Since M is arbitray, Condition (2) holds. And the WLLN follows from Theorem 1.3.
“=⇒” Let X ∗ , X1∗ , ... be iid following the samePdistribution of X and are independent of X, X1 , ....
Set ξi = Xi − Xi∗ (symmetrization) and S̃n = ni=1 ξi . Then, S̃n /n → 0 in probability. The Levy
inequality in Exercise 5.1 implies max{|S̃j | : 1 ≤ j ≤ n}/n → 0 in probability, which further ensures
max{|ξj | : 1 ≤ j ≤ n}/n → 0 in probability. For any ǫ > 0,
Example 5.1. Suppose X1 , X2 , ... are i.i.d. with common density f symmetric about 0 and c.d.f
such that 1 − F (t) = 1/(t log t), for t > 3. Then, Sn /n → 0 in probability. But Sn /n 9 0, a.s..
The convergence in probability is a consequence of Theorem 5.2 with µn = 0 and checking the
condition xP (|X| > x) → 0 as x → ∞. The convergence a.s. is untrue because Xn /n 9 0 a.s. by
Borel-Cantelli lemma.
Corollary. Suppose X1 , ..., Xn , ... are i.i.d. with E(|Xi |) < ∞. Then, Sn /n → E(X1 ) in probabil-
ity.
26
Proof. Since, as x → ∞,
Z x Z ∞
xP (|Xi | > x) = o(1) P (|Xi | > t)dt = o(1) P (|Xi | > t)dt = o(1)E(|Xi |),
0 0
Proof. Notice that P (X ≥ 2k ) = 2−k+1 . Let kn ≈ log log n/ log 2, mn = log n/ log 2 + kn and
bn = 2mn = 2kn n ≈ n log n. mn is an integer. Then,
And
mn
X mn
X
E(X 2 1{|X|≤bn } ) = 22k 2−k = 2k ≤ 2 × 2mn = 2bn .
k=1 k=1
Then,
nE(X 2 1{|X|≤bn } ) 2nbn 2n 2n 2n
2
≤ 2 = = mn = kn → 0.
bn bn bn 2 n2
Let an = nE(X1{|X|≤bn} ).
mn
X
an = n 2k 2−k = nmn = n log n/ log 2 + nkn ≈ bn log 2.
k=1
where X1 , ..., Xn , ... are iid ∼ U nif [0, 1]. Since, by the WLLN
X Z 1 X
(1/n) Xi2 → E(X12 ) = x2 dx = 1/3 and (1/n) Xi → E(X1 ) = 1/2,
i=1 0 i=1
X12 + · · · + Xn2
→ 2/3 in probability.
X1 + · · · + Xn
The r.v. on the left hand side is bounded by 1. By the dominated convergence, its mean also
converges to 2/3. Then the limit of the integral is 2/3.
Remark. The following WLLN for array of r.v.s. is a slight generalization of Theorem 5.1.
Suppose Xn,1 , ..., Xn,n are independent r.v.s. If
n
X n
X
P (|Xn,i | > bn ) → 0 and (1/b2n ) 2
E(Xn,i 1{|Xn,i |≤bn } ) → 0,
i=1 i=1
Then, Pn
i=1 Xn,i − an
→0 in probability
bn
Pn
where an = i=1 E(Xn,i 1{|Xn,i |≤bn } ).
DIY Exercises.
Exercise 5.1 (Levy’s Inequality) Suppose X1 , X2 , ... are independent and symmetric about 0.
Then,
P ( max |Sj | ≥ ǫ) ≤ 2P (|Sn | ≥ ǫ)
1≤j≤n
Exercise 5.2 Show Sn /(n log n) → − log 2 in probability in Example 5.4. Hint: Choose bn = 2mn
with mn = {k : 2−k k −3/2 ≤ 1/n} and proceed as in Example 5.2.
Exercise 5.3 For Example 1.4, prove that Sn /bn → 0 in probability, if bn /(n/ log n) ↑ ∞.
Exercise 5.4 (Marcinkiewicz-Zygmund weak law of large numbers) Suppose xp P (|X| >
x) → 0 as x → ∞ for some 0 < p < 2. Prove that
Sn − nE(X1{|X|≤n1/p} )
→0 in probability.
n1/p
28
var(Sn )
P ( max |Sj | ≥ ǫ) = .
1≤j≤n ǫ2
Proof. Let T = min{j ≤ n : |Sj | ≥ ǫ}, with minimum of empty set being ∞, i.e., T = ∞ |Sj | ≤ ǫ
for all 1 ≤ j ≤ n. Then, {T ≤ j} or {T = j} only depends on X1 , ..., Xj . And, as a result,
{T ≥ j} = {T ≤ j − 1}c = {Si ≤ ǫ, 1 ≤ i ≤ j − 1}
only depends on X1 , ..., Xj−1 and therefore is independent of Xj , Xj+1 , .... Write
≤ var(Sn )/ǫ2 .
Example 6.1. (Extension to continuous time process.) Suppose {St : t ∈ [0, ∞)} is a process
with increments that are independent, zero mean and finite variance. If the path of St is right
continuous, e.g.
var(S )
τ
P max |St | > ǫ ≤ .
t∈[0,τ ] ǫ2
The examples of such processes are, e.g., compensated Poisson process and Brownian Motion.
Kolmogorov’s inequality will later on be seen as a special case of martingale inequality. In the proof
of Kolmogorov inequality, we have used a stopping time T , which is a r.v. associated with a process
Sn or, more generally, a filtration, such that T = k only depends on past and current values of the
process: S1 , ..., Sk . Stopping time is one of the most important concepts and tools in martingale
theory or stochastic processes.
P∞
Proof. Define Am,ǫ = {maxj>m |Sj − Sm | ≤ ǫ}. Then, { n=1 Xn < ∞} = ∩ǫ>0 ∪m Am,ǫ . By
Kolmogorov’s inequality
n ∞
var(Sn − Sm ) 1 X 1 X
P ( max |Sj − Sm | > ǫ) ≤ = var(X i ) ≤ var(Xi ).
m<j≤n ǫ2 ǫ2 i=m+1 ǫ2 i=m+1
“⇐=” is a direct consequence of Theorem 6.2.. “=⇒” follows from the central limit theorem to be
shown in Chapter 8.
p
Corollary.
P∞ 1/p
P∞ ) < ∞1/pfor some 0 < p < 2. Then,
Suppose X, X1 , X2 , ... are iid with E(|X|
n=1 [Xn − E(X)]/n < ∞ a.s. for 1 < p < 2; and n=1 Xn /n < ∞ a.s. for 0 < p < 1.
We leave the proof as Exercise 6.2.
Strong law of large numbers (SLLN) is a central result in classical probability theory. The conver-
gence of series estabalished in Section 1.6 paves a way towards proving SLLN using the Kronecker
lemma.
The following proposition is an immediate application of the Kronecker lemma and the Khintchine-
Kolmogorov convergence of series.
Proposition (Kolmogorov’s criterion of SLLN). Suppose X1 , X2 ..., are independent such that
E(Xn ) = 0 and n var(Xn )/n2 < ∞. Then, Sn /n → 0 a.e..
P
Proof. Consider the series ni=1 Xi /i < ∞, n ≥ 1. Then Theorem 6.2 implies n Xn /n < ∞ a.s..
P P
And the above Kronecker Lemma ensures Sn /n → 0 a.s..
Obviously, if X, X1 , X2 , ... are iid with finite variance, the above proposition implies the SLLN:
Sn /n → E(X) a.s.. In fact, a stronger result than the above SLLN is also straightforward:
Corollary. If X1 , X2 , ... are iid with mean µ and finite variance. Then,
S − nµ
pn →0 .a.s.
n(log n)δ
for iid r.v.s with mean µ and finite variance σ 2 . We do not intend to cover the proofs of Kolmogorov’s
law of iterated logarithm.
The above SLLN requires finite moments of the series. The most standard classical SLLN, estab-
lished by Kolmogorov, for iid r.v.s. holds as long as the population mean exist. In statistical view,
the sample mean shall always converge to the population mean as long as the population mean
exists, without any further moment condition. In fact, the sample mean converges to a finite limit
if and only if the population mean is finite, in which case, the limit is the population mean.
Theorem 6.4. Kolmogorov’s strong law of large numbers. Suppose X, X1 , X2 , ... are iid and
E(X) exists. Then,
Sn /n → E(X), a.s..
Conversely, if Sn /n → µ which is finite, then µ = E(X).
Proof. Suppose first E(X1 ) = 0. We shall utilize the above proposition of Kolmogorov’s criterion
of SLLN. Consider
Yn = Xn 1{|Xn |≤n} − E(Xn 1{|Xn |≤n} ).
Write
∞ ∞ ∞
X var(Yn ) X 1 2
2
X 1
≤ E(X 1 {|X|≤n} ) = E X 1 {|X|≤n}
n=1
n2 n=1
n2 n=1
n2
∞
X 2
≤ E X2 ≤ 2E(|X| + 1) < ∞
n(n + 1)
n≥|X|∨1
Remark Kolmogorov’s SLLN also holds for r.v.s that are pairwise independent following the same
distribution, which is slightly more general. We have chosen to follow the historic development of
the classical probability theory.
(vi). Strong law of large numbers when E(X) does not exist.
Kolmogorov’s SLLN in Theorem 6.4 already shows that the classical SLLN does not hold if E(X)
does not exist, i.e., E(X + ) = E(X − ) = ∞. The SLLN becomes quite complicated. We introduce
the theorem proved by W. Feller:
32
Proposition Suppose X, X1 , ... are iid with E|X| = ∞. Suppose an > 0 and an /n is nondecreas-
ing. Then, P
lim sup |Sn |/an = 0 if Pn P (|X| ≥ an ) < ∞
lim sup |Sn |/an = ∞ if n P (|X| ≥ an ) = ∞.
The proof is somewhat technical but still along the same line as the that of Kolmogorov’s SLLN.
Interested students may refer to the textbook by Durrett (page 67). We omit the details.
Example 6.2. (The St. Petersburg Paradox) See Example 1.5 in which we have shown
Sn 1
→ in probability
n log n log 2
Sn
lim sup =0 a.s..
n(log n)δ
The following Marcinkiewicz-Zygmund SLLN is useful in connecting the rate of convergence with
the moments of the iid r.v.s.
Theorem 6.5. (Marcinkiewicz-Zygmund strong law of large numbers). Suppose
X, X1 , X2 , ... are iid and E(|X|p ) < ∞ for some 0 < p < 2. Then,
(
Sn −nE(X)
n1/p
→ 0, a.s. for 1 ≤ p < 2
Sn
n1/p
→ 0 a.s. for 0 < p < 1.
Proof. The case with p = 1 is Kolmogorov’s SLLN. The cases with 0 < p < 1 and 1 < p < 2 are
consequences of the corollary following Theorem 1.6 and the Kronecker lemma.
Example 6.3 Suppose X, X1 , X2 , ... are iid and X is symmetric with P (X > t) = t−α for some
α > 0 and all large t.
(1). α > 2: Then, E(X 2 ) < ∞, Sn /n → 0 a.s. and, moreover, Kolmogorov’s law of iterated
logarithm gives the sharp rate of the a.s. convergence.
(2). 1 < α ≤ 2: for any 0 < p < α
Sn
→ 0, a.s.
n1/p
It implies that Sn /n converges to 0 a.s. at a rate faster than n−1+1/p , but not at the rate of
n−1+1/α . In particular, if α = 2, Sn /n converges to E(X) a.s. at a rate faster than n−β with any
0 < β < 1/2, but not at the rate of n−1/2 .
(3). 0 < α ≤ 1: E(X) does not exist. For any 0 < p < α,
Sn
→ 0, a.s.
n1/p
33
|Sn | Sn
lim sup = ∞ a.s. and →0 a.s.
n1/α n (log n)δ/α
1/α
DIY Exercises.
Exercise 6.1. ⋆ ⋆ ⋆ Suppose S0 ≡ 0, S1 , S2 , ... form a square integrable martingale, i.e., for
k = 0, 1, ..., n, E(Sk2 ) < ∞ and E(Sk+1 |Fk ) = Sk where Fk is the σ-algebra generated by S1 , ..., Sk .
Show that Kolmogorov’s inequality still holds.
Exercise 6.2. ⋆ ⋆ ⋆ Prove the Corollary following Theorem 6.2.
Exercise 6.3. ⋆ ⋆ ⋆⋆ For positive
P independent r.v.s XP1 , X2 , ..., show that the P
following three
statements are equivalent: (a). X
n n < ∞ a.s.; (b). n E(X n ∧ 1) < ∞; (c). n E(Xn /(1 +
Xn )) < ∞.
Exercise
P 6.4. ⋆ ⋆ ⋆⋆ Raise a counterexample to show that there exists X1 , X2 , ... iid with E(X) = 0
but n Xn /n 6< ∞ a.s..
Exercise 6.5 ⋆ ⋆ ⋆⋆ If X1 , ... are iid with mean µ and finite variance. Then,
Sn − nµ
p →0 a.s.
n(log n)δ
In the above proof, we have used Fatou lemma with Lebesgue measure. In fact, the monotone
convergence theorem, Fatou lemma and dominated convergence theorem that we have established
with probability measure all hold with σ-finite measures, including Lebesgue measure.
Remark. (Slutsky’s Theorem) Suppose Xn → X∞ in distribution and Yn → c in probability.
Then, Xn Yn → cX∞ in distribution and Xn + Yn → Xn − c in distribution.
We leave the proof as an exercise.
In the following, we provide some classical examples about convergence in distribution, only to show
that there are a variety of important limiting distributions besides the normal distribution as the
limiting distribution in CLT.
Example 7.1. (Convergence of maxima and extreme value distributions) Let Mn =
max1≤i≤n Xi where Xi are iid r.v.s with c.d.f. F (·). Then,
(b). F (x) = 1 − |x|β for x ∈ [−1, 0] and some β > 0. Then, for any t < 0,
β
P (n1/β Mn ≤ t) = (1 − n−1 |t|β )n → e−|t|
(c). F (x) = 1 − e−x for x > 0, i.e., Xi follows exponential distribution. Then for all t,
−t
P (Mn − log n ≤ t) → e−e
These limiting distributions are called extreme value distributions.
Example 7.2. (Birthday problem) Suppose X1 , X2 , ... are iid with uniform distribution on
the integers {1, 2, ..., N } with n < N and , Let
TN = min{k : there exists a j < k such that {Xj = Xk } }.
Then, for k ≤ N ,
P (TN > k) = P ( X1 , ..., Xk all take different values )
Yk
= 1 − P ( Xj takes one of the values of X1 , .., Xj−1 )
j=2
k k−1
Y j−1 X
= (1 − ) = exp{ log(1 − j/N )}
j=2
N j=1
In other words, TN /N 1/2 converges in distribution to a distribution F (t) = 1−exp(−t2 /2) for t ≥ 0.
Suppose now N = 365. By this approximation, we have P (T365 > 22) ≈ .5153 and P (T365 > 50) ≈
.0326, meaning that, with 22 (50) people there is about half (3%) probability that all of them have
different birthday.
Example 7.3. (Law of rare events) Suppose there are totally n flights worldwide each year,
and each flight has chance pn to have an accident, independent of rest flights. There is on average λ
accidents a year worldwide. The distribution of the number of accidents is B(n, pn ) with npn close
to λ. Then this distribution approximates Poisson distribution with mean λ, namely,
Bin(n, pn ) → P(λ) if n → ∞ and npn → λ > 0.
Example 7.4. (The secretary/marriage problem) Suppose there are n secretary to be
interviewed one by one and, right after each interview, you must make immediate decision of “hire
or fire” the interviewee. You observe only the relative ranks of the interviewed candidates. What
is the optimal strategy is maximize the chance of hiring the best of the n candidates? (Assume no
ties of performance.)
One type of strategy is to give up the first m candidates, whatever their performance in the interview.
Afterwards, the one that outperforms all previous candidates is hired. In other words, starting from
m + 1-th interview, the first candidate that outperforms the first m candidates is hired. Or else you
settle with the last candidate. The chance that the k-th best among all n candidates is hired is
n
X
Pk = P ( the k-th best is the j-th interviewee and is hired)
j=m+1
n
X 1
= P (the best among first j − 1 appears in the first m,
j=m+1
n
the j-th candidate is the k-th best, and the k − 1 best all appear after the j-th candidate.)
n
X m 1 n − j k−1
≈ × ×( )
j=m+1
j − 1 n n
Let n → ∞, and m ≈ nc where c is the percentage of the interviews to be given up. Then the
probability of hiring the k-th best
n Z 1
X 1 k−1 (1 − x)k−1
Pk ≈ c (1 − j/n) ≈c dx = cAk , say.
j=m
j c x
k−1
X (1 − c)j
Pk → c − log c − , as n → ∞.
j=1
j
In particular, P1 → −c log c. The function c log c is maximized at c = 1/e = 0.368. The best
strategy is to give up the first 36.8% of the interviews and then hire the best to date. The chance of
hiring the best overall is also 36.8%. The chance of hiring the last person is also c. This phenomenon
is also called 1/e law.
You might please formulate this problem in terms of a sequence of random variables.
The dominated convergence theorem also holds with convergence in distribution, which is left as
an exercise.
(b). Continuous mapping theorem: Xn → X∞ in distribution and g(·) is a continuous function.
Then, g(Xn ) → g(X∞ ) in distribution.
37
Proof. For any bounded continuous function f , f (g(·)) is still bounded continuous function. Hence
E(f (g(Xn ))) → E(f (g(X∞ ))), proving that g(Xn ) → g(X∞ ) in distribution.
(c). Tightness and convergent subsequences.
In studying the convergence of a sequence of numbers, it is very useful that boundedness of the
sequence, guarantees a convergent subsequence. The same is true for uniformly bounded monotone
functions, such as, for example, distribution functions. This is the following Helly’s Selection
theorem, which is useful in studying weak convergence of distributions.
Helly’s Selection Theorem. A sequence of cumulative distribution functions Fn always con-
tains a subsequence, say Fnk , that converges to a function, say F∞ , which is nondecreasing and
right continuous, at every continuity point of F∞ . If F∞ (−∞) = 0 and F∞ (∞) = 1. Then, F∞ is
a distribution function and Fnk converges to F weakly.
Proof Let t1 , t2 , ... be all rational numbers. In the sequence Fn (t1 ), n ≥ 1, there is always a
(1)
convergent subsequence. Denote one of them as, say nk , k = 1, 2, .... Among this subsequence there
(2) (2) (1)
is again a further subsequence, denoted as nk , k = 1, 2, ..., with n1 > n1 , such that Fn(2) (t2 ) is
k
(k)
convergent. Repeat this process of selection infinitely. Let nk = n1 be the first element of the k-th
(l)
sub-sub-sequence. Then, for any fixed m, {nk : k ≥ m} is always a subsequence of {nk : k ≥ 1} for
all l ≤ m. Hence Fnk is convergent on every rational number. Denote the limit as F ∗ (tl ) on every
rational tl . Monotonicity of Fnk implies the monotonicity of F ∗ on rational numbers. Define, for all
t, F∞ (t) = inf{F ∗ (tl ) : tl > t, tl are rational}. Than, F∞ is right continuous and non-decreasing.
The right continuity of Fn ensures that, if s is a continuity point of F∞ , Fnk (s) → F∞ (s).
Not all sequence of distributions Fn would converge weakly to a distribution function. The easiest
example is Fn ({n}) = Fn (n) − Fn (n−) = 1, i.e., P (Xn = n) = 1. Then, Fn (t) → 0 for all
t ∈ (−∞, ∞). If Fn all have little probability mass near ∞ or −∞, then the convergence to a
function which is not a distribution function can be avoided. A sequence of distribution functions Fn
is called tight if, for any ǫ > 0, there exists a M > 0 such that lim supn→∞ (1 − Fn (M ) + Fn (M ) < ǫ;
Or, in other words,
sup(1 − Fn (x) + Fn (−x)) → 0 as x → ∞.
n
Proposition. Every tight sequence of distribution functions contains a a subsequence that weakly
converges to a distribution function.
Proof Repeat the proof Helly’s Selection Theorem. The tightness ensures the limit is a distribution
function.
Characteristic function is one of the most useful tools in developing theory about convergence in
distribution. The technical details of characteristic functions involve some knowledge of complex
analysis. We shall view them as only a tool and try not to elaborate the technicalities.
1◦ . Definition and examples.
For a r.v. X with distribution F , its characteristic function is
Z
ψ(t) = E(eitX ) = E(cos(tX) + isin(tX)) = eitx dF (x), t ∈ (−∞, ∞)
√
where i = −1.
Some basic properties are:
Product of characteristic functions is still a characteristic function. And the characteristic function
of X1 + ... + Xn is the product of those of X1 , ..., Xn .
The following table lists some characteristic functions for some commonly used distributions:
Proof. The proof R∞uses Fubini’s theorem to interchange the the expectation with the integration
and the fact that 0 sin(x)/xdx = π/2. We omit the proof.
The above theorem clearly implies that two different distribution cannot have same characteristic
function, as formally presented in the following corollary.
Corollary. There is one-to-one correspondence between distribution functions and characteristic
functions.
DIY Exercises:
Exercise 7.1. ⋆ ⋆ ⋆ Prove Slutsky’s Theorem.
Exercise 7.2. ⋆ ⋆ ⋆ (Dominated convergence theorem) Suppose Xn → X∞ in distribution
and |Xn | ≤ Y with E(Y ) < ∞. Show that E(Xn ) → E(X∞ ).
Exercise 7.3. ⋆⋆ Suppose Xn is independent of Yn , and X is independent of Y . Use characteristic
functions to show that, if Xn converges to X in distribution and Yn converges to Y in distribution
and , then Xn + Yn converges in distribution to X + Y .
39
The most ideal case of the CLT is that the random variables are iid with finite variance. Although
it is a special case of the more general Lindeberg-Feller CLT, it is most standard and its proof
contains the essential ingredients to establish more general CLT. Throughout the chapter, Φ(·) is
the cdf of standard normal distribution N (0, 1).
(i). Central limit theorem (CLT) for iid r.v.s.
The following lemma plays a key role in the proof of CLT.
Lemma 8.1 For any real x and n ≥ 1,
n
X (ix)j |x|n+1 2|x|n
|eix − | ≤ min , .
j=0
j! (n + 1)! n!
Consequently, for any r.v. X with characteristic function ψ and finite second moment,
t2 |t|2
ψ(t) − [1 + itE(X) − E(X 2 )] ≤ E(min(|t||X|3 , 6|X|2 )). (8.1)
2 6
which can be shown by induction and by taking derivatives. The middle term is bounded by
|x|n+1 /(n + 1)!, and the last bounded by 2|x|n /n!.
Theorem 8.2 Suppose X, X1 , ..., Xn , ... are iid with mean µ and finite variance σ 2 > 0. Then,
Sn − nµ
√ → N (0, 1) in distribution.
nσ 2
Proof. Without loss of generality, let µ = 0. Let ψ be the common characteristic function of Xi .
Observe that, by dominated convergence
which is the characteristic function of N (0, 1). Then, Levy’s continuity theorem implies the above
CLT.
In the case the common variance is not finite, the partial sum, after proper normalization, may or
may not converge to a normal distribution. The following theorem provides sufficient and necessary
condition. The key point here is whether there exists appropriate truncation, which is a trick that
we have used so many times before.
40
Theorem 8.3 Suppose X, X1 , X2 , ... are iid nondegenerate. Then, (Sn − an )/bn converges to a
normal distribution for some constants an and 0 < bn → ∞, if and only if
x2 P (|X| > x)
→ 0, as x → ∞. (8.2)
E(X 2 1{|X|≤x} )
2
The proof is omitted. We note that (8.2) holds if Xi has √ finite variance σ > 0, in which case
CLT of Theorem 8.2 holds with an = nE(X) and bn = nσ. Theorem 8.3 is of interest when
E(X 2 ) = ∞. In this case, one can choose to truncate the Xi s at
Separate Sn into two parts, one with Xi beyond ±cn and the other bounded by ±cn . The former
takes value 0 with chance going to 1. The latter, when standardized by
q
an = nE(X1{|X|≤cn} ) and bn = nE(X 2 1{|X|≤cn} ) ≈ cn .
Example 8.1 Recall Example 6.3, in which X, X1 , X2 , ... are iid symmetric such that P (|X| >
x) = x−α for some α > 0 all large x. Then, Theorem 8.3 implies (Sn − an )/bn → N (0, 1) if and
only if α ≥ 2. Indeed, when α > 2, the common variance is finite and CLT applies. When α = 2,
for some σ 2 .
When α < 2, the condition in Theorem cannot hold. In fact, Sn when properly normalized shall
converge to non-normal distribution.
then Sn /sn → N (0, 1). Conversely, if maxj≤n σj2 /s2n → 0 and Sn /sn → N (0, 1), then (8.3) holds.
Proof. “⇐=” The Lindeberg condition (8.3) implies
σ2 1
j
max ≤ ǫ2 + max E(Xj2 1{|Xj |>ǫsn } ) → 0, (8.4)
1≤j≤n s2n s2n 1≤j≤n
by letting n → ∞ and then ǫ ↓ 0. Observe that for every real x > 0, |e−x − 1 + x| ≤ x2 /2. Moreover,
for complex zj and wj with |zj | ≤ 1 and |wj | ≤ 1,
n
Y n
Y n
X
| zj − wj | ≤ |zj − wj |, (8.5)
j=1 j=1 j=1
41
which can be proved by induction. With Lemma 2.1, it follows that, for any ǫ > 0,
2 2 2
|E(eitXj /sn ) − e−t σj /2sn |
(tXj )2 t2 σj2 h t2 X 2 |tX |3 i t4 σ 4
j j j
≤ |E 1 + itXj − − 1 − | + E min , + 4
2s2n 2s2n s2n 6s3n 8sn
t2 X 2 |tX |3 t4 σ 4
j j j
≤ E 1 {|X |>ǫs } + E 1 {|X |≤ǫs } + 4
s2n j n
6s3n j n
8sn
t2 2 |t|3 ǫ 2
t4 σj2 σ2
≤ 2
E(X 1
j {|Xj |>ǫsn } ) + 2
E(X j ) + 2
max 2k
sn sn sn 1≤k≤n sn
Then, for any fixed t,
2
|E(eitSn /sn ) − e−t /2 |
n n
Y Y 2 2 2
= | E(eitXj /sn ) − e−t σj /2sn |
j=1 j=1
n
2
σj2 /2s2n
X
≤ |E(eitXj /sn ) − e−t | by (8.5)
j=1
n 2
X t |t|3 ǫ t4 σj2 σj2
≤ 2
E(Xj2 1{|Xj |>ǫsn } ) + 2 E(Xj2 ) + 2 max 2
j=1
sn sn sn 1≤j≤n sn
n
t2 X σj2
≤ E(Xj2 1{|Xj |>ǫsn } ) + ǫ|t|3 + t4 max
s2n j=1
1≤j≤n s2
n
3
→ ǫ|t| , as n → ∞, by (8.3) and (8.4).
2
Since ǫ > 0 is arbitrary, it follows that E(eitSn /sn ) → e−t /2
for all t. Levy’s continuity theorem
implies Sn /sn → N (0, 1).
“⇐=” Let ψj be the moment generating function of Xj . The asymptotic normality is equivalent
Qn 2
to j=1 ψj (t/sn ) → e−t /2 . Notice that (8.1) implies
t2 σj2
|ψj (t/sn ) − 1| ≤ 2 (8.6)
sn
Write, as n → ∞,
n
X
[ψj (t/sn ) − 1] + t2 /2
j=1
n
X n
X
= [ψj (t/sn ) − 1 − log ψj (t/sn )] + [log ψj (t/sn )] + t2 /2
j=1 j=1
Xn
≤ |ψj (t/sn ) − 1 − log ψj (t/sn )| + +o(1)
j=1
Xn
≤ |ψj (t/sn ) − 1|2 + o(1)
j=1
n
X
≤ max |ψk (t/sn ) − 1| × |ψj (t/sn ) − 1| + o(1)
1≤k≤n
j=1
n
t2 σk2 X t2 σj2
≤ 4 max × + o(1) by (8.6)
1≤k≤n sn sn
j=1
On the other hand, by definition of characteristic function, the above expression is, as n → ∞,
n
X
o(1) = [ψj (t/sn ) − 1] + t2 /2
j=1
n
X n
X n
X
= E(eitXj /sn − 1) + t2 /2 = E(cos(tXj /sn ) − 1) + t2 /2 + i E(sin(tXj /sn ))
j=1 j=1 j=1
Xn n
X
= E{(cos(tXj /sn ) − 1)1{|Xj |>ǫsn } } + E{(cos(tXj /sn ) − 1)1{|Xj |≤ǫsn } } + t2 /2
j=1 j=1
+imaginary part (immaterial).
n
2 X
≤ 2P (|Xj | > ǫsn ) + o(1)
t2 j=1
n
4 X σj2
≤ + o(1) by Chebyshev inequality
t2 j=1 (ǫsn )2
4
≤ + o(1).
t2 ǫ 2
Since t can be chosen arbitrarily large, Lindeberg condition holds.
Remark. Sufficiency is proved by Lindeberg in 1922 and necessity by Feller in 1935. Lindeberg-
Feller CLT is one of the most far-reaching results in probability theory. Nearly all generalizations
of various types of central limit theorems spin from Lindeberg-Feller CLT, such as, for example,
CLT for martingales, for renewal proceses, or for weakly dependent processes. The insights of the
Lindeberg condition (8.3) are that the “wild” values of the random variables, compared with sn ,
the standard deviation of Sn as the normalizing constant, are insignificant and can be truncated
off without affecting the general behavior of the partial sum Sn .
Example 8.2. Suppose Xn are independent and
with 0 < α < 3. Then, σn2 = E(Xn2 ) = n2−α /2 and s2n = nj=1 j 2−α /2, which increases to ∞
P
at the order of n3−α . Note that Lindeberg condition (2.3) is equivalent to n2 /n3−α → 0, i.e.,
0 < α < 1. On the other hand, max1≤j≤n σj2 /s2n → 0. Therefore, it follows from Theorem 8.4 that
Sn /sn → N (0, 1) if and only if 0 < α < 1.
Example 8.3 Suppose Xn are independent and P (Xn = 1) = 1/n = 1 − P (Xn = 0). Then,
p
[Sn − log(n)]/ log(n) → N (0, 1) in distribution.
Pn Pn
Pn E(Xn ) = 1/n and var(Xn ) = (1 − 1/n)/n. So, E(Sn ) = i=1 = i=1 1/i, and
It’s clear that
var(Sn ) = i=1 (1 − 1/i)/i ≈ log(n). As Xn are all bounded by 1 and var(Sn ) ↑ ∞, the Lindeberg
43
Then, [Sn −log(n)]/ log(n) → N (0, 1) in distribution since | log(n)− ni=1 1/i| ≤ 1 and var(Sn )/ log(n) →
p P
1.
Theorem 8.2 as well as the following Lyapunov CLT are both special cases of the Lindeberg-Feller
CLT. Nevertheles they are convenient for application.
Pn
Corollary (Lyapunov CLT) Suppose Xn are indendent with mean 0 and j=1 E(|Xj |δ )/sδn → 0
for some δ > 2, then Sn /sn → N (0, 1).
Proof. For any ǫ > 0, as n → ∞,
n n n
1 X 2
X Xj2 1 X Xjδ
E(X j 1 {|X |>ǫs } ) = E( 1 {|X |/s >ǫ} ) ≤ E( δ ) → 0.
s2n j=1 j n
j=1
s2n j n
ǫδ−2 j=1 sn
DIY Exercises
Exercise 8.1 ⋆ ⋆ ⋆ Suppse Xn are independent with
1 1
P (Xn = nα ) = P (Xn = −nα ) = and P (Xn = 0) = 1 −
2nβ nβ
with 2α > β − 1. Show that the Lindeberg condition holds if and only if 0 ≤ β < 1.
Exercise
P 8.2 ⋆ ⋆ ⋆ Suppose Xn are iid with mean
P 0 and variance 1. Let an > 0 be such that
s2n = nj=1 a2i → ∞ and an /sn → 0. Show that ni=1 ai Xi /sn → N (0, 1).
Exercise 8.3 ⋆ ⋆ ⋆ Suppose X1 , X2 ... are independent and Xn = Yn + Zn , where Yn takes values
1 and −1 with chance 1/2 each, and P (Zn√= ±n) = 1/(2n2 ) = (1 − P (Zn = 0))/2. Show that
Lindeberg condition does not hold, yet Sn / n → N (0, 1).
Exercise 8.4 ⋆ ⋆ ⋆ Suppse
√ X1 , X2 , ... are iid nonnegative r.v.s with mean 1 and finite variance
√
σ 2 > 0. Show that 2( Sn − n) → N (0, 1).
44
.
Review of Probability Theory
1. Probability Calculation.
Calculation of probabilities of events for discrete outcomes (e.g., coin tossing, roll of dice, etc.)
Calculation of the probability of certain events for given density functions of 1 or 2 dimension.
2. Probability space.
(1). Set operations: ∪, ∩, complement.
(2). σ-algebra. Definition and implications (the collection of sets which is closed on set operations).
(3). Kolmogorov’s trio of probability space.
(4). Independence of events and conditional probabilities of events.
(5). Borel-Cantelli lemma.
(6). Sets and set-index functions. 1A∩B = 1A 1B and 1Ac = 1 − 1A . 1{An ,i.o.} = lim supn 1An . etc.
{An , i.o.} = ∩∞ ∞ ∞
n=1 ∪k=1 Ak = lim ∪k=n Ak = lim sup An (lim sup 1An (ω)).
n→∞ n n
ω ∈ {An , i.o.} means ω ∈ An for infinitely many An . (Mathematically precisely, there exists a
subsequence nk → ∞, such that ω ∈ Ank for all nk .)
∪∞ ∞ ∞
n=1 ∩k=1 Ak = lim ∩k=n Ak = lim inf An (lim inf 1An (ω)).
n→∞ n n
ω ∈ lim inf n An means ω ∈ An for all large n. (Mathematically precisely, there exists an N , such
that ω ∈ An for all n ≥ N .)
(7). ⋆ completion of probability space.
3. Random variables.
(1). Definitions.
(2). c.d.f., density or probability functions.
(3). Expectation: definition, interpretation as weighted (by chance) average.
(4). Properties:
(i). Dominated convergence. (Proof and Application).
(ii). ⋆ Fatou’s lemma and monotone convergence.
(iii). Jensen’s inequalities
(iv) Chebyshev inequalities.
(5). Independence of r.v.s.
(6). ⋆ Conditional distribution and expectation given a σ-algebra (Definition and simple properties.)
(7). Commonly used distributions and r.v.s.
4. Convergence.
(1). Definition of a.e., in prob, Lp and in distribution convergence.
(2). ⋆ Equivalence of four definitions of in distribution convergence.
(3). The diagram about the relation among convergence modes.
(4). The technique of truncating off r.v.s
5. LLN.
(1). WLLN. Application of the theorem. (⋆ the proof)
(2). Kolmogorov’s inequality (⋆ the proof).
(3). Corollary (⋆ the proof)
45