0% found this document useful (0 votes)

23 views106 pages

Book

Uploaded by

rishav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views106 pages

Book

Uploaded by

rishav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 106

An Introduction to Advanced

Probability and Statistics

Junhui Qian
⃝
c September 7, 2012
2
Preface

This booklet introduces advanced probability and statistics to ﬁrst-year Ph.D. stu-
dents in economics.

In preparation of this text, I borrow heavily from the lecture notes of Yoosoon Chang
and Joon Y. Park, who taught me econometrics at Rice University. All errors are
mine.

Shanghai, China, Junhui Qian

October 2011 [email protected]

i
ii
Contents

Preface i

1 Introduction to Probability 1
1.1 Probability Triple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Conditional Probability and Independence . . . . . . . . . . . . . . . 4
1.3 Limits of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Construction of Probability Measure . . . . . . . . . . . . . . . . . . 8
1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Random Variable 17
2.1 Measurable Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Expectations 27
3.1 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Moment Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Conditional Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 36

iii
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 Distributions and Transformations 39

4.1 Alternative Characterizations of Distribution . . . . . . . . . . . . . . 39
4.1.1 Moment Generating Function . . . . . . . . . . . . . . . . . . 39
4.1.2 Characteristic Function . . . . . . . . . . . . . . . . . . . . . . 39
4.1.3 Quantile Function . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Common Families of Distributions . . . . . . . . . . . . . . . . . . . . 40
4.3 Transformed Random Variables . . . . . . . . . . . . . . . . . . . . . 44
4.3.1 Distribution Function Technique . . . . . . . . . . . . . . . . . 44
4.3.2 MGF Technique . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.3 Change-of-Variable Transformation . . . . . . . . . . . . . . . 45
4.4 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . 47
4.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4.2 Marginals and Conditionals . . . . . . . . . . . . . . . . . . . 48
4.4.3 Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5 Introduction to Statistics 55
5.1 General Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2.1 Method of Moment . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2.2 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.3 Unbiasedness and Efficiency . . . . . . . . . . . . . . . . . . . 63
5.2.4 Lehmann-Scheffé Theorem . . . . . . . . . . . . . . . . . . . . 64
5.2.5 Efficiency Bound . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3.2 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . 71
5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

iv
6 Asymptotic Theory 77
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.1.1 Modes of Convergence . . . . . . . . . . . . . . . . . . . . . . 77
6.1.2 Small o and Big O Notations . . . . . . . . . . . . . . . . . . 82
6.1.3 Delta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2.1 Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . 86
6.2.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . 88
6.3 Asymptotics for Maximum Likelihood Estimation . . . . . . . . . . . 89
6.3.1 Consistency of MLE . . . . . . . . . . . . . . . . . . . . . . . 89
6.3.2 Asymptotic Normality of MLE . . . . . . . . . . . . . . . . . 90
6.3.3 MLE-Based Tests . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

References 95

v
vi
Chapter 1

Introduction to Probability

In this chapter we lay down the measure-theoretic foundation of probability.

1.1 Probability Triple

We first introduce the well known probability triple, (Ω, F, P), where Ω is the sample
space, F is a sigma-field of a collection of subsets of Ω, and P is a probability measure.
We define and characterize each of the probability triple in the following.
The sample space Ω is a set of outcomes from a random experiment. For instance,
in a coin tossing experiment, the sample space is obviously {H, T }, where H denotes
head and T denotes tail. For another example, the sample space may be an interval,
say Ω = [0, 1], on the real line, and any outcome ω ∈ Ω is a real number randomly
selected from the interval.
To introduce sigma-field, we first define

Deﬁnition 1.1.1 (Field (or Algebra)) A collection of subsets F is called a ﬁeld

or an algebra, if the following holds.

(a) Ω ∈ F

(b) E ∈ F ⇒ E c ∈ F
∪m
(c) E1 , ..., Em ∈ F ⇒ n=1 En ∈ F

Note that (c) says that a field is closed under finite union. In contrast, a sigma-field,
which is defined as follows, is closed under countable union.

1
Definition 1.1.2 (sigma-field (or sigma-algebra)) A collection of subsets F is
called a σ-field or a σ-algebra, if the following holds.

(a) Ω ∈ F

(b) E ∈ F ⇒ E c ∈ F
∪∞
(c) E1 , E2 , . . . ∈ F ⇒ n=1 En ∈ F

Remarks:

• In both deﬁnitions, (a) and (b) imply that the empty set ∅ ∈ F
∩
• (b) and (c) implies that if E1 , E2 , . . . ∈ F ⇒ ∞n=1 En ∈ F, since ∩n En =
c c
(∪n En ) .

• A σ-field is a field; a field is a σ-field only when Ω is finite.

• An arbitrary intersection of σ-ﬁelds is still a σ-ﬁeld. (Exercise 1)

In the following, we may interchangeably write sigma-ﬁeld as σ-ﬁeld. An element

E of the σ-field F in the probability triple is called an event. For an example, if
we toss a coin twice, then the sample space would be Ω = {HH, HT, T H, T T }. A
σ-field (or field) would be

F = {∅, Ω, {HH}, {HT }, {T H}, {T T },

{HH, HT }, {HH, T H}, {HH, T T }, {HT, T H}, {HT, T T }, {T H, T T },
{HH, HT, T H}, {HH, HT, T T }, {HH, T H, T T }, {HT, T H, T T }}.

The event {HH} would be described as “two heads in a row”. The event {HT, T T }
would be described as “the second throw obtains tail”.
For an example of infinite sample space, we may consider a thought experiment
of tossing a coin for infinitely many times. The sample space would be Ω =
{(r1 , r2 , . . . , )|ri = 1 or 0}, where 1 stands for head and 0 stands for tail. One
example of an event would be {r1 = 1, r2 = 1}, which says that the first two throws
give heads in a row.
A sigma-field can be generated from a collection of subsets of Ω, a field for example.
We define

Deﬁnition 1.1.3 (Generated σ-ﬁeld) Let S be a collection of subsets of Ω. The

σ-field generated by S, σ(S), is defined to be the intersection of all the σ-fields
containing S.

2
In other words, σ(S) is the smallest σ-ﬁeld containing S.
Now we introduce the axiomatic deﬁnition of probability measure.

Deﬁnition 1.1.4 (Probability Measure) A set function P on a σ-ﬁeld F is a

probability measure if it satisﬁes:

(1) P(E) ≥ 0 ∀E ∈ F
(2) P(Ω) = 1
∪ ∑
(3) If E1 , E2 , . . . are disjoint, then P ( n En ) = n P(En ).

Properties of Probability Measure

(a) P(∅) = 0
(b) P(Ac ) = 1 − P(A)
(c) A ⊂ B ⇒ P(A) ≤ P(B)
(d) P(A ∪ B) ≤ P(A) + P(B)
(e) An ⊂ An+1 for n = 1, 2, . . ., ⇒ P(An ) ↑ P (∪∞
n=1 An )

(f) An ⊃ An+1 for n = 1, 2, . . ., ⇒ P(An ) ↓ P (∩∞

n=1 An )
∑ ∞
(g) P(∪∞
n=1 An ) ≤ n=1 P(An )

Proof: (a)-(c) are trivial.

(d) Write A∪B = (A∩B c )∪(A∩B)∪(Ac ∩B), a union of disjoint sets. By adding
and subtracting P(A ∩ B), we have P(A ∪ B) = P(A) + P(B) − P(A ∩ B), using
the fact that A = (A ∩ B) ∪ (A ∩ B c ), also a disjoint union.
∪n
(e) ∪
Deﬁne B1 = ∪∞ A1 and Bn = A n+1 − A n for n ≥ 2. We have A n = j=1 Bj and
∞
A
j=1 j = B
j=1 j . Then it follows from
∑
n ∑
∞ ∑
∞ ∪
∞ ∑
∞
P(An ) = P(Bj ) = P(Bj ) − P(Bj ) = P( An ) − P(Bj ).
j=1 j=1 j=n+1 n=1 j=n+1

(f) Note that Acn ⊂ Acn+1 , use (e).

(g) Extend (d).
∪
may write limn→∞ An = ∞
Note that we ∩ n=1 An , if An is monotone increasing, and
∞
limn→∞ An = n=1 An , if An is monotone decreasing.

3
1.2 Conditional Probability and Independence
Definition 1.2.1 (Conditional Probability) For an event F ∈ F that satisfies
P (F ) > 0, we define the conditional probability of another event E given F by

P (E ∩ F )
P (E|F ) = .
P (F )

• For a ﬁxed event F , the function Q(·) = P (·|F ) is a probability. All properties
of probability measure hold for Q.

• The probability of intersection can be deﬁned via conditional probability:

P (E ∩ F ) = P (E|F ) P (F ) ,

and
P (E ∩ F ∩ G) = P (E|F ∩ G) P (F |G) P (G) .
∪
• If {Fn } is a partition of Ω, ie, Fn′ s are disjoint and n Fn = Ω. Then the
following theorem of total probability holds,
∑
P (E) = P (E|Fn ) P (Fn ) , for all event E.
n

• The Bayes Formula follows from P (E ∩ F ) = P (E|F ) P (F ) = P (F |E) P (E),

P (E|F ) P (F )
P (F |E) = ,
P (E)

and
P (E|Fk ) P (Fk )
P (Fk |E) = ∑ .
n P (E|Fn ) P (Fn )

Deﬁnition 1.2.2 (Independence of Events) Events E and F are called inde-

pendent if P (E ∩ F ) = P (E) P (F ).

• We may equivalently deﬁne independence as

P (E|F ) = P (F ) , when P (F ) > 0

4
• E1 , E2 , . . . are said to be independent if, for any (i1 , . . . , ik ),

∩
k
( )
P (Ei1 ∩ Ei2 ∩ · · · ∩ Eik ) = P Eij
j=1

• Let E, E1 , E2 , . . . be independent events. Then E and σ(E1 , E2 , . . .) are inde-

pendent, ie, for any S ∈ σ(E1 , E2 , . . .), P (E ∩ S) = P (E) P (S).

• Let E1 , E2 , . . . , F1 , F2 , . . . be independent events. If E ∈ σ(E1 , E2 , . . .), then

E, F1 , F2 , . . . are independent; furthermore, σ(E1 , E2 , . . .) and σ(F1 , F2 , . . .) are
independent.

1.3 Limits of Events

limsup and liminf First recall that for a series of real numbers {xn }, we deﬁne
{ }
lim sup xn = inf sup xn
n→∞ k n≥k
{ }
lim inf xn = sup inf xn .
n→∞ k n≥k

And we say that xn → x ∈ [−∞, ∞] if lim sup xn = lim inf xn = x.

Deﬁnition 1.3.1 (limsup of Events) For a sequence of events (En ), we deﬁne

∩
∞ ∪
∞
lim sup En = En
n→∞
k=1 n=k
= {ω| ∀k, ∃n(ω) ≥ k s.t. ω ∈ En }
= {ω| ω ∈ En for inﬁnitely many n.}
= {ω| En i.o.} ,

where i.o. denotes “inﬁnitely often”.

We may intuitively interpret lim supn→∞ En as the event that En occurs inﬁnitely
often.

5
Deﬁnition 1.3.2 (liminf of Events) We deﬁne
∪
∞ ∩
∞
lim inf En = En
n→∞
k=1 n=k
= {ω| ∃ k(ω), ω ∈ En ∀n ≥ k}
= {ω| ω ∈ En for all large n.}
= {ω| En e.v.} ,
where e.v. denotes “eventually”.

It is obvious that It is obvious that (lim inf En )c = lim sup Enc and (lim sup En )c =
lim inf Enc . When lim sup En = lim inf En , we say (En ) has a limit lim En .

Lemma 1.3.3 (Fatou’s Lemma) We have

P(lim inf En ) ≤ lim inf P(En ) ≤ lim sup P(En ) ≤ P(lim sup En ).
∩ ∩∞ ∪∞ ∩∞
Proof: Note that ∩ ∞n=k En is monotone increasing and n=k En ↑ k=1 n=k En .
Hence P(Ek ) ≥ P( ∞ E
n=k n ) ↑ P(lim inf En ). The third inequality can be similarly
proved. And the second inequality is obvious.

Lemma 1.3.4 (Borel-Cantelli Lemma) Let E1 , E2 , . . . ∈ F , then

∑∞
(i) n=1 P(En ) < ∞ ⇒ P(lim sup En ) = 0;
∑∞
(ii) if n=1 P(En ) = ∞, and if {En } are independent, then P(lim sup En ) = 1.
∪ ∑
Proof: (i) P(lim sup En ) ≤ P( n≥k En ) ≤ ∞n=k P(En ) → 0.

(ii) For m, n ∈ N, using 1 − x ≤ exp(−x), ∀x ∈ R, we have

(∞ ) (k+m )
∩ ∩
P Enc ≤ P Enc
n=k n=k
∏
k+m ∏
k+m
= P (Enc ) = (1 − P (En ))
n=k n=k
( )
∑
k+m
≤ exp − P (En ) → 0,
n=k
(∪ ∩ ) ∑ (∩ )
as m → ∞. Since P k n≥k Enc ≤ k P n≥k Enc = 0, P (lim sup En ) =
(∪ ∩ )
1 − P k≥1 n≥k Enc = 1.

6
Remarks:

• (ii) does not hold if {En } are not independent. To give a counter example,
consider inﬁnite coin tossing. Let E1 = E2 = · · · = {r1 = 1}, the events
that the ﬁrst coin is head, then {En } is not independent and P (lim sup En ) =
P (r1 = 1) = 1/2.

• Let Hn be∑the event that the n-th tossing comes up head. We have P (Hn ) =
n P (Hn ) = ∞. Hence P (Hn i.o.) = 1, and P (Hn e.v.) = 1 −
c
1/2 and
P (Hn i.o.) = 0.

• Let Bn = H2n +1 ∩ H2n +2 ∩ ∑· · · ∩ H2n +log2 n . Bn is independent, and since

P (Bn ) = (1/2)log2 n = 1/n, n P (Bn ) = ∞. Hence P (Bn i.o.) = 1.

• But if Bn = H2n +1 ∩ H2n +2 ∩ · · · ∩ H2n +2 log2 n , P (Bn i.o.) = 0.

• Let Bn = Hn ∩ Hn+1 , we also have P (Bn i.o.) = 1. To show this, consider B2k ,
which is independent.

Why σ-field? You may already see that events such as lim sup En and lim inf En
are very interesting events. To make meaningful probabilistic statements about
these events, we need to make sure that they are contained in F, on which P is
defined. This is why we require F to be a σ-field, which is closed to infinite unions
and intersections.

Deﬁnition 1.3.5 (Tail Fields) For a sequence of events E1 , E2 , . . ., the tail ﬁeld
is given by
∩∞
T = σ (En , En+1 , . . .) .
n=1

• For any n, an event E ∈ T depends on events En , En+1 , . . .. Any ﬁnite number

of events are irrelevant.

• In the inﬁnite coin tossing experiment,

– lim sup Hn , obtain inﬁnitely many heads

– lim inf Hn , obtain only ﬁnitely many heads
– lim sup H2n inﬁnitely many heads on tosses 2, 4, 8, . . .
∑
– {limn→∞ 1/n ∞ i=1 ri ≤ 1/3}

– {rn = rn+1 = · · · = rn+m }, m ﬁxed.

7
Theorem 1.3.6 (Kolmogrov Zero-One Law) Let a sequence of events E1 , E2 , . . .
be independent with a tail ﬁeld T . If an event E ∈ T , then P (E) = 0 or 1.

Proof: Since E ∈ T ⊂ σ(En , En+1 , . . .), E, E1 , E2 , . . . , En−1 are independent. This

is true for all n, so E, E1 , E2 , . . . are independent. Hence E and σ(E1 , E2 , . . .) are
independent, ie, for all S ∈ σ(E1 , E2 , . . .), S and E are independent. On the other
hand, E ∈ T ⊂ σ(E1 , E2 , . . .). It follows that E is independent of itself! So
P (E ∩ E) = P2 (E) = P (E), which implies P (E) = 0 or 1.

1.4 Construction of Probability Measure

σ-ﬁelds are extremely complicated, hence the diﬃculty of directly assigning proba-
bility to their elements, events. Instead, we work on simpler classes.

Deﬁnition 1.4.1 (π-system) A class of subsets of Ω, P, is a π-system if the fol-

lowing holds:
E, F ∈ P ⇒ E ∩ F ∈ P.

For example, the collection {(−∞, x] : x ∈ R} is a π-system.

Deﬁnition 1.4.2 (λ-system) A class of subsets of Ω, L, is a λ-system if

(a) Ω ∈ L
(b) If E, F ∈ L, and E ⊂ F , then F − E ∈ L
(c) If E1 , E2 , . . . ∈ L and En ↑ E, then E ∈ L.

• If E ∈ L, then E c ∈ L. It follows from (a) and (b).

• L is closed under countable union only for monotone increasing events.

Theorem 1.4.3 A class F of subsets of Ω is a σ-ﬁeld if and only if F is both a

π-system and a λ-system.

∪ “only if” is trivial. To show “if”, it suﬃces to show that for any E1 , E2 , . . . ∈
Proof:
F, n En ∈ F . We indeed have:
( n )c
∩ ∪n ∪
c
Ek = Ek ↑ En .
k=1 k=1 n

8
Notation: Let S be a class of subsets of Ω. σ(S) is the σ-ﬁeld generated by S.
π(S) is the π-system generated by S, meaning that π(S) is the intersection of all
π-system that contain S. λ(S) is similarly deﬁned as the λ-system generated by S.
We have
π(S) ⊂ σ(S) and λ(S) ⊂ σ(S).

Lemma 1.4.4 (Dynkin’s Lemma) Let P be a π-system, then λ(P) = σ(P).

Proof: It suﬃces to show that λ(P) is a π-system.

• For an arbitrary C ∈ P, deﬁne

DC = {B ∈ λ(P)|B ∩ C ∈ λ(P) } .

• We have P ⊂ DC , since for any E ∈ P ⊂ λ(P), E ∩ C ∈ P ⊂ λ(P), hence

E ∈ DC .

• For any C ∈ P, DC is a λ-system.

– Ω ∈ DC
– If B1 , B2 ∈ DC and B1 ⊂ B2 , then (B2 −B1 )∩C = B2 ∩C −B1 ∩C. Since
B1 ∩ C, B2 ∩ C ∈ λ(P) and (B1 ∩ C) ⊂ (B2 ∩ C), (B2 − B1 ) ∩ C ∈ λ(P).
Hence (B2 − B1 ) ∈ DC .
– If B1 , B2 , . . . ∈ DC , and Bn ↑ B, then (Bn ∩ C) ↑ (B ∩ C) ∈ λ(P).
Hence B ∈ DC .

• Thus, for any C ∈ P, DC is a λ-system containing P. And it is obvious that

λ(P) ⊂ DC .

• Now for any A ∈ λ(P), we deﬁne

DA = {B ∈ λ(P)|B ∩ A ∈ λ(P)} .

By deﬁnition, DA ⊂ λ(P).

• We have P ⊂ DA , since if E ∈ P, then E ∩ A ∈ λ(P), since A ∈ λ(P) ⊂ DC

for all C ∈ P.

• We can check that DA is a λ-system that contains P, hence λ(P) ⊂ DA . We

thus have DA = λ(P), which means that for any A, B ∈ λ(P), A ∩ B ∈ λ(P).
Thus λ(P) is a π-system. Q.E.D.

9
Remark: If P is a π-system, and L is a λ-system that contains P, then σ(P) ⊂ L.
To see why, note that λ(P) = σ(P) is the smallest λ-system that contains P.

Theorem 1.4.5 (Uniqueness of Extension) Let P be a π-system on Ω, and P1

and P2 be probability measures on σ(P). If P1 and P2 agree on P, then they agree
on σ(P).

Proof: Let D = {E ∈ σ(P)|P1 (E) = P2 (E)}. D is a λ-system, since

• Ω ∈ D,

• E, F ∈ D and E ⊂ F imply F − E ∈ D, since

P1 (F − E) = P1 (F ) − P1 (E) = P2 (F ) − P2 (E) = P2 (F − E).

• If E1 , E2 , . . . ∈ D and En ↑ E, then E ∈ D, since

P1 (E) = lim P1 (En ) = lim P2 (En ) = P2 (E).

The fact that P1 and P2 agree on P implies that P ⊂ D. The remark following
Dynkin’s lemma shows that σ(P) ⊂ D. On the other hand, by deﬁnition, D ⊂ σ(P).
Hence D = σ(P). Q.E.D.

Borel σ-field The Borel σ-field is the σ-field generated by the family of open
subsets (on a topological space). To probability theory, the most important Borel
σ-field is the σ-field generated by the open subsets of R of real numbers, which we
denote B(R).
Almost every subset of R that we can think of is in B(R), the elements of which may
be quite complicated. As it is difficult for economic agents to assign probabilities to
complicated sets, we often have to consider “simpler” systems of sets, π-system, for
example.
Define
P = (−∞, x], x ∈ R.
It can be easily verified that P is a π-system. And we show in the following that P
generates B(R).

Proof: It is clear from

∩
(−∞, x] = (−∞, x + 1/n) , ∀x ∈ R
n

10
that σ(P) ⊂ B(R). To show σ(P) ⊃ B(R), note that every open set of R is
a countable union of open intervals. It therefore suﬃces to show that the open
intervals of the form (a, b) are in σ(P). This is indeed the case, since
( )
∪
(a, b) = (−∞, a]c ∩ (−∞, b − 1/n] .
n

Note that the above holds even when b ≤ a, in which case (a, b) = ∅.

Theorem 1.4.6 (Extension Theorem) Let F0 be a ﬁeld on Ω, and let F =

σ(F0 ). If P0 is a countably additive set function P0 : F0 → [0, 1] with P0 (∅) = 0
and P0 (Ω) = 1, then there exists a probability measure on (Ω, F) such that

P = P0 on F0 .

Proof: We ﬁrst deﬁne for any E ⊂ Ω,

{ }
∑ ∪
P(E) = inf P0 (An ) : An ∈ F0 , E ⊂ An .
{An }
n n

We next prove that

(a) P is an outer measure.

(b) P is a probability measure on (Ω, M), where M is a σ-ﬁeld of P-measurable

sets in F.

(d) P = P0 on F0 .

Note that (c) immediately implies that F ⊂ M. If we restrict P to the domain

F, we obtain a probability measure on (Ω, F) that coincide with P0 on F0 . The
theorem is then proved. In the following we prove (a)-(d).

(a) We ﬁrst deﬁne outer measure. A set function µ on (Ω, F) is an outer measure
if

(i) µ(∅) = 0.
(ii) E ⊂ F implies µ(E) ≤ µ(F ). (monotonicity)
∪ ∑
(iii) µ ( n En ) ≤ n µ(En ), where E1 , E2 , . . . ∈ F. (countable subadditivity)

11
• It is obvious that P(∅) = 0, since we may choose En = ∅ ∀n.
∪ ∪
• For E ⊂ F , choose {An } such that E ⊂ ( n An ) and F ⊂ ( n An ) ∪ (F − E).
Monotonicity is now obvious.

• To show countable subadditivity, note that∪ for each n,∑we can ﬁnd a collection
∞
{Cnk }k=1 such that Cnk ∈∪F0 , En ∪ ⊂ ∪k Cnk , and∪ k P0 (C∑nk ) ≤ P(En ) +
−n
∑ , where ϵ > 0. Since n En ⊂ n k Cnk , P ( n En ) ≤ n,k P0 (Cnk ) ≤
ϵ2
n P (En ) + ϵ. Since ϵ is arbitrarily chosen, the countable subadditivity is
proved.

(b) Now we deﬁne M as

M = {A ⊂ Ω|P (A ∩ E) + P (Ac ∩ E) = P (E) , ∀E ⊂ Ω}.

M contains sets that “split” every set E ⊂ Ω well. We call these sets P-
measurable. M has an equivalent deﬁnition,

M = {A ⊂ Ω|P (A ∩ E) + P (Ac ∩ E) ≤ P (E) , ∀E ⊂ Ω},

since E = (A ∩ E) ∪ (Ac ∩ E) and the countable subadditivity of P dictates

that P (A ∩ E) + P (Ac ∩ E) ≥ P (E). To prove that P is a probability measure
on (Ω, M), where M is a σ-ﬁeld of P-measurable sets in F. We ﬁrst establish:
∪ ∑
• Lemma 1. If A1 , A2 , . . . ∈ M are disjoint, then P ( n An ) = n P (An ).
Proof: First note that

P (A1 ∪ A2 ) = P (A1 ∩ (A1 ∪ A2 )) + P (Ac1 ∩ (A1 ∪ A2 )) = P (A1 ) + P (A2 ) .

Induction thus obtains ﬁnite additivity. Now for any m ∈ N, we have by

monotonicity,
( ) ( )
∑ ∪ ∪
P (An ) = P An ≤ P An .
n≤m n≤m n

∑ ∪
Since m is arbitrarily chosen, we have n P (An ) ≤ P ( n An ). Combining
this with subadditivity, we obtain Lemma 1. Next we prove that M is a ﬁeld.

• Lemma 2. M is a ﬁeld on Ω.
Proof: It is trivial that Ω ∈ M and that A ∈ M ⇒ Ac ∈ M. It remains to
prove that A, B ∈ M ⇒ A ∩ B ∈ M. We ﬁrst write,

(A ∩ B)c = (Ac ∩ B) ∪ (A ∩ B c ) ∪ (Ac ∩ B c ) .

12
Then
P ((A ∩ B) ∩ E) + P ((A ∩ B)c ∩ E)
= P (A ∩ B ∩ E) + P {[(Ac ∩ B) ∩ E] ∪ [(A ∩ B c ) ∩ E] ∪ [(Ac ∩ B c ) ∩ E]}
≤ P (A ∩ (B ∩ E)) + P (Ac ∩ (B ∩ E)) + P (A ∩ (B c ∩ E)) + P (Ac ∩ (B c ∩ E))
= P (B ∩ E) + P (B c ∩ E) = P (E) .
Using the second definition of M, we have A ∩ B ∈ M. Hence M is a field.
Next we establish that M is a σ-field. To show this we only need to show that
M is closed to countable union. We first prove two technical lemmas.
• ∪
Lemma 3. Let A1 , A2 , . . . ∈ M be disjoint. For each m ∈ N, let Bm =
n≤m An . Then for all m and E ⊂ Ω, we have
∑
P (E ∩ Bm ) = P (E ∩ An ) .
n≤m

Proof: We prove by induction. First, note that the lemma holds trivially
∑ m = 1. Now suppose it holds for some m, we showc that P (E ∩ Bm+1 ) =
when
n≤m+1 P (E ∩ An ). Note that Bm ∩ Bm+1 = Bm and Bm ∩ Bm+1 = Am+1 . So

P (E ∩ Bm+1 ) = P (Bm ∩ E ∩ Bm+1 ) + P (Bmc

∩ E ∩ Bm+1 )
= P (E ∩ Bm ) + P (E ∩ Am+1 )
∑
= P (E ∩ An ) .
n≤m+1
∪
• Lemma 4. Let A1 , A2 , . . . ∈ M be disjoint, then n An ∈ M.
Proof: For any m ∈ N, we have
P (E) = P (E ∩ Bm ) + P (E ∩ Bmc
)
∑
= P (E ∩ An ) + P (E ∩ Bmc
)
n≤m
( ( )c )
∑ ∪
≥ P (E ∩ An ) + P E ∩ An ,
n≤m n
∪
since ( n An )c ⊂ Bm
c
. Since m is arbitrary, we have
( ( )c )
∑ ∪
P (E) ≥ P (E ∩ An ) + P E ∩ An
n n
( ( )) ( ( )c )
∪ ∪
≥ P E∩ An +P E∩ An .
n n
∪
Hence n An ∈ M. Now we are read to prove:

13
• Lemma 5. M is a σ-field of subsets of Ω. ∪
Proof: It suffices to show if E1 , E2 , . . . ∈ M, n En ∈ M. Define A1 = E1 ,
Ai =∪Ei ∩ E1c ∪
∩ E2c ∩ · · · ∩ Ei−1
c
for i ≥ 2. Then A1 , A2 , . . . ∈ M are disjoint
and n En = n An ∈ M by Lemma 4.

(c) We now prove F0 ⊂ M.

Proof: Let A ∈ F0 , we need to show that A ∈ M. For any E ⊂∪Ω and any
ϵ > 0, we can ﬁnd a sequence of E1 , E2 , . . . ∈ F0 such that E ⊂ n En such
that, ∑
P0 (En ) ≤ P (E) + ϵ.
n

By countable additivity of P0 on F0 , we have P0 (En ) = P0 (En ∩ A)+P0 (En ∩ Ac ).

Hence
∑ ∑ ∑
P0 (En ) = P (En ∩ A) + P (En ∩ Ac )
n n
((∪ ) ) n
((∪ ) )
≥ P En ∩ A + P En ∩ A c

≥ P (E ∩ A) + P (E ∩ Ac ) .

Since ϵ is arbitrarily chosen, we have P (E) ≥ P (E ∩ A) + P (E ∩ Ac ). Hence

A ∈ M.

(d) Finally, we prove that P = P0 on F0 .

Proof: Let E ∈ F0 . It is obvious∪from the deﬁnition of P that P (E) ≤ P0 (E).
Let A1 , A2 , . . . ∈ F0 and E ⊂ n An . Deﬁne a disjoint sequence of subsets
{Bn } such that B1 = A1 and ∪ Bi = A∪i ∩ A1 ∩ A2 ∩ · · · ∩ Ai−1 for i ≥ 2. We
c c c

have Bn ⊂ An for all n and n An = n Bn . Using countable additivity of P0 ,

( ( ))
∪ ∑
P0 (E) = P0 E ∩ Bn = P0 (E ∩ Bn ) .
n n

Hence ∑ ∑
P0 (E) ≤ P0 (Bn ) ≤ P0 (An ) .
n n

Now it is obvious that P (E) ≥ P0 (E). The proof is now complete.

1.5 Exercises
1. Prove that an arbitrary intersection of σ-ﬁelds is a σ-ﬁeld.

14
2. Show that ( ]
1 1
lim − , 1 − = [0, 1).
n→∞ n n
3. Let R be the sample space. We deﬁne a sequence En of subsets of R by
{ ( 1 1 1]
−n, 2 − n if n is odd,
En = [1 1 2 1 )
3
− n , 3 + n if n is even.

Find lim inf En and lim sup En . Let the probability P be given by the Lebesgue
measure on the unit interval [0, 1] (that is, the length of interval). Compare
P(lim inf En ), lim inf P(En ), P(lim sup En ), and lim sup P(En ).

4. Prove the following:

(a) If the events E and F are independent, then so are E c and F c .
(b) The events Ω and ∅ are independent of any event E.
(c) In addition to Ω and ∅, is there any event that is independent of itself?

5. Show that σ({[a, b]|∀a ≤ b, a, b ∈ R}) = B(R).

15
16
Chapter 2

Random Variable

2.1 Measurable Functions

Random variables are measurable functions from Ω to R. We first define measurable
functions and examine their properties. Let (S, G) be a general measurable space,
where G is a σ-field on a set S. For example, (Ω, F) is a measurable space, on which
random variables are defined.

Deﬁnition 2.1.1 (Measurable function) A function f : S → R is G-measurable

if, for any A ∈ B(R),

f −1 (A) ≡ {s ∈ S|f (s) ∈ A} ∈ G.

We simply call a function measurable if there is no possibility for confusion.

Remarks:

• For a G-measurable function f , f −1 is a mapping from B to G, while f is a

mapping from S to R.
• For some set E ∈ G, the indicator function IE is G-measurable.
• The mapping f −1 preserves all set operations:
( )
∪ ∪ ( )c
f −1 An = f −1 (An ), f −1 (Ac ) = f −1 (A) , etc.
n n

{f −1 (A)|A ∈ B} is thus a σ-ﬁeld. It may be called the σ-ﬁeld generated by f .

17
Properties:

(a) If C ⊂ B and σ(C) = B, then f −1 (A) ∈ G ∀A ∈ C implies that f is G-

measurable.
Proof: Let E = {B ∈ B|f −1 (B) ∈ G}. By definition E ⊂ B. Now it suffices
to show that B ⊂ E. First, E is a σ-field, since inverse mapping preserves
all set operations. And since f −1 (A) ∈ G ∀A ∈ C, we have C ⊂ E. Hence
σ(C) = B ⊂ E.

(b) f is G-measurable if

{s ∈ S|f (s) ≤ c} ∈ G ∀c ∈ R.

Proof: Let C = {(−∞, c]}, apply (a).

(d) If f is measurable and a is a constant, then af and f + a are measurable.

(e) If both f and g are measurable, then f + g is also measurable.

Proof: Note that we can always ﬁnd a rational number r ∈ (f (s), c − g(s)) if
f (s) + g(s) < c. We can represent
∪
{f (s) + g(s) ≤ c} = ({f (s) < r} ∩ {g(s) < c − r}) ,
r

which is in G for all c ∈ R, since the set of rational numbers is countable.

(f) If both f and g are measurable, then f g is also measurable.

Proof: It suﬃces to prove that if f is measurable, then f 2 is measurable,
√ √ since
f g = ((f + g) − f − g ) /2. But {f (s) ≤ c} = {f (s) ∈ [− c, c]} ∈ G for
2 2 2 2

all c ≥ 0 and {f (s)2 ≤ c} = ∅ ∈ G for c < 0.

(g) Let {fn } be a sequence of measurable functions. Then sup fn , inf fn , lim inf fn ,
and lim sup fn are all measurable (sup fn and inf fn may be inﬁnite, though,
hence we should consider Borel sets on ∩ the extended real line).
Proof:
∩ Note that {sup f n (s) ≤ c} = n {fn (s) ≤ c} ∈ G and {inf fn (s) ≥
c} = n {fn (s) ≥ c} ∈ G. Now the rest is obvious.

(h) If {fn } are measurable, then {lim fn exists in R} ∈ G.

Proof: Note that the set on which the limit exists is

{lim sup fn < ∞} ∩ {lim inf fn > −∞} ∩ g −1 (0),

where g = lim sup fn − lim inf fn is measurable.

18
(i) If {fn } are measurable and f = lim fn exists, then f is measurable.
Proof: Note that for all c ∈ R,
∩∪∩{ 1
}
{f ≤ c} = fn ≤ c + .
m≥1 k n≥k
m

∑
(j) A simple function f , which takes the form f (s) = ni=1 ci IAi , where (Ai ∈ G)
are disjoint and (ci ) are constants, is measurable.
Proof: Use (d) and (e) and the fact that indicator functions are measurable.

Deﬁnition 2.1.2 (Borel Functions) If f is B(R)-measurable, it is called Borel

function.

Borel functions can be more general. For example, a B(S)-measurable function,

where S is a general topological space, may be referred to as a Borel function.

• If both f and g are G-measurable, then the composition function g ◦ f is

G-measurable.

• If g is a continuous real function, then g is Borel. It is well known that a real

function f is continuous if and only if the inverse image of every open set is an
open set. By the deﬁnition of B(R), for every A ∈ B(R), A can be represented
by a countable union of open intervals. It is then obvious that f −1 (A) is also
in B(R).

2.2 Random Variables

Deﬁnition 2.2.1 (Random Variable) Given a probability space (Ω, F, P), we de-
ﬁne a random variable X as a F-measurable function from Ω to R, ie, X −1 (B) ∈ F
for all B ∈ B(R).

Remarks:

• A random variable X is degenerate if X(ω) = c, a constant for all ω. For all

B ∈ B(R), if c ∈ B, then X −1 (B) = Ω ⊂ F, and if c ∈/ B, then X −1 (B) =
∅ ⊂ F.

• From Property (b) of measurable functions, if {ω ∈ Ω|X(ω) ≤ c} ∈ F ∀c ∈ R,

then X is a random variable.

19
• If X and Y are random variables deﬁned on a same probability space, then
cX, X + c, X 2 , X + Y , and XY are all random variables.
• If {Xn } is a sequence of random variables, then sup Xn , inf Xn , lim sup Xn ,
lim inf Xn , and lim Xn (if it exists), are all random variables (possibly un-
bounded).
• If X is a random variable on (Ω, F, P) and f is a Borel function, then f (X) is
also a random variable on the same probability space.
• The concept of random variable may be more general. For example, X may
be a mapping from Ω to a separable Banach space with an appropriate σ-ﬁeld.

Example 2.2.2 For the coin tossing experiments, we may deﬁne a random variable
by X(H) = 1 and X(T ) = 0, where H and T are the outcomes of the experiment,
∑
ie, head and tail, respectively. If we toss the coin for n times, X̄n = n1 ∞ i=1 Xi is
also a random variable. As n → ∞, X̄n becomes a degenerate random variable as
we know by the law of large numbers. lim X̄n is still a random variable since the
following event is in F,
{ }
number of heads 1
→ = {lim sup X̄n = 1/2} ∩ {lim inf X̄n = 1/2}
number of tosses 2

Deﬁnition 2.2.3 (Distribution of Random Variable) The distribution PX of

a random variable X is the probability measure on (R, B(R)) induced by X. Specif-
ically,
PX (A) = P(X −1 (A)) for all A ∈ B(R).

• We may write the distribution function as a composite function PX = P ◦ X −1 .

When there is no ambiguity about the underlying random variable, we write
P in place of PX for simplicity.
• P is indeed a probability measure (verify this). Hence all properties of the
probability measure apply to P . P is often called the law of a random variable
X.

Deﬁnition 2.2.4 (Distribution Function) The distribution function FX of a ran-

dom variable is deﬁned by

FX (x) = PX {(−∞, x]} for all x ∈ R.

We may omit the subscript of FX for simplicity. Note that since {(−∞, x], x ∈ R}
is a π-system that generates B(R), F uniquely determines P .

20
Properties:

(a) limx→−∞ F (x) = 0 and limx→∞ F (x) = 1.

(b) F (x) ≤ F (y) if x ≤ y.

(c) F is right continuous.

Proof: (a) Let xn → −∞. Since (−∞, xn ] ↓ ∅, we have F (xn ) = P {(−∞, xn ]} →

P (∅) = 0. The other statement is similarly established. (b) It follows from (−∞, x] ⊂
(−∞, y] if x ≤ y. (c) Fix an x, it suﬃces to show that F (xn ) → F (x) for
any sequence {xn } such that xn ↓ x. It follows, however, from the fact that
(−∞, xn ] ↓ (−∞, x] and the monotone convergence of probability measure.

Remark: If P ({x}) = 0, we say that P does not have point probability mass at x,
in which case F is also left-continuous. For any sequence {xn } such that xn ↑ x, we
have
F (xn ) = P ((−∞, xn ]) → P ((−∞, x)) = F (x) − P ({x}) = F (x).

2.3 Random Vectors

An n-dimensional random vector is a measurable function from (Ω, F) to (Rn , B(Rn )).
We may write a random vector X as X(ω) = (X1 (ω), . . . , Xn (ω))′ .

Example 2.3.1 Consider the coin tossing experiment. Define a r.v. X(H) = 1 and
X(T ) = 0, and another r.v. Y (H) = 1 and Y (T ) = 0. We may define a random
vector Z = (X, Y )′ . Z is obviously a mapping from Ω = {H, T } to R2 . Specifically,
( ) ( )
1 0
Z(H) = , and Z(T ) = .
0 1

Example 2.3.2 Consider tossing the coin twice. Let X1 be a random variable that
takes 1 if the ﬁrst toss gives Head and 0 otherwise, and let X2 be a random variable
that takes 1 if the second toss gives Head and 0 otherwise. Then the random vector
X = (X1 , X2 )′ is a function from Ω = {HH, HT, T H, T T } to R2 :
( ) ( ) ( ) ( )
1 1 0 0
X(HH) = , X(HT ) = , X(T H) = , X(T T ) = .
1 0 1 0

21
Deﬁnition 2.3.3 (Distribution of Random Vector) The distribution of an n-
dimensional random vector X = (X1 , . . . , Xn )′ is a probability measure on Rn ,
PX (A) = P{ω|X(ω) ∈ A} ∀A ∈ B(Rn ).

The distribution of a random vector X = (X1 , . . . , Xn ) is conventionally called the

joint distribution of X1 , . . . , Xn . The distribution of a subvector of X is called the
marginal distribution.
The marginal distribution is a projection of the joint distribution. Consider a ran-
dom vector Z = (X ′ , Y ′ )′ with two subvectors X ∈ Rm and Y ∈ Rn . Let PX (A) be
the marginal distribution of X for A ∈ B(Rm ). We have
PX (A) = PZ (A × Rn ) = P{ω|Z(ω) ∈ A × Rn },
where the cylinder set A × Rn is obviously an element in B(Rm+n ).

Deﬁnition 2.3.4 (Joint Distribution Function) The distribution function of a

random vector X = (X1 , . . . , Xn )′ is deﬁned by
FX (x1 , . . . , xn ) = P{ω|X1 (ω) ≤ x1 , . . . , Xn (ω) ≤ xn }.
The n-dimensional real function FX is conventionally called the joint distribution
function of X1 , . . . , X2 .

2.4 Density

∑n µ be a measure on (S, G), and let fn be a simple function of the form fn (s) =
Let
k=1 ck IAk , where (Ak ∈ G) are disjoint and (ck ) are real nonnegative constants.
We have

Deﬁnition 2.4.1 The Lebesgue integral of f with respect to µ by

∫ ∑m
f dµ = ck µ(Ak ).
k=1

For a general nonnegative function f , we have

Deﬁnition 2.4.2 The Lebesgue integral of f with respect to µ by

∫ ∫
f dµ = sup fn dµ,
{fn ≤f }

where {fn } are simple functions.

22
In words, the Lebesgue integral of a general function f is the sup of the integrals of
simple functions that are below f . For example, we may choose fn = αn ◦ f , where

 0 f (x) = 0
αn (x) = 2 (k − 1) if 2−n (k − 1) < f (x) ≤ 2−n k, for k = 1, . . . , n2n
−n

n f (x) > n

For functions that are not necessarily nonnegative, we deﬁne

f + (x) = max(f (x), 0)
f − (x) = max(−f (x), 0).
Then we have
f (x) = f + − f − .
The Lebesgue integral of f is now deﬁned by
∫ ∫ ∫
f dµ = f dµ − f − dµ.
+

∫ ∫
If both f + dµ and f − dµ are ﬁnite, then we call f integrable with respect to µ.

Remarks:
∫
• ∫The function f is called integrand. The notation f dµ is a simpliﬁed form of
S
f (x)µ(dx).
∑
• The summation n cn is a special case of Lebesgue integral, which is taken
with respect to the counting measure. The counting measure on R assigns 1
to each point in Z.
• The Lebesgue integral generalizes the Riemann integral. It exists and coincides
with the Riemann integral whenever that the latter exists.

Deﬁnition 2.4.3 (Absolute Continuity of Measures) Let µ and ν be two mea-

sures on (S, G). ν is absolutely continuous with respect to µ if
ν(A) = 0 whenever µ(A) = 0, A ∈ G.

For example, given µ, we may construct a measure ν by

∫
ν(A) = f dµ, A ∈ G,
A

where f is nonnegative. It is obvious that ν, so constructed, is absolutely continuous

with respect to µ.

23
Theorem 2.4.4 (Radon-Nikodym Theorem) Let µ and ν be two measures on
a measurable space (S, G). If ν is absolutely continuous with respect to µ, then there
exists a nonnegative measurable function f such that ν can be represented as
∫
ν(A) = f dµ, A ∈ G.
A

The function f is called the Radon-Nikodym derivative of ν with respect to µ. It is

uniquely determined up to µ-null sets. We may denote f = ∂ν/∂µ.

Density Recall that PX is a probability measure on (R, B(R)). If PX is absolutely

continuous with respect to a measure µ, then there exists a nonnegative function
pX such that ∫
PX (A) = pX dµ, ∀A ∈ B(R). (2.1)
A

• If the measure µ in (2.1) is a Lebesgue measure, the function pX is conven-

tionally called the probability density function of X. If such a pdf exists, we
say that X is a continuous random variable.

• If PX is absolutely continuous with respect to the counting measure µ, then

pX is conventionally called the discrete probabilities and X is called a discrete
random variable.

2.5 Independence
The independence of random variables is defined in terms of σ-fields they generate.
We first define

Deﬁnition 2.5.1 (σ-ﬁeld Generated by Random Variable) Let X be a ran-

dom variable. The σ-ﬁeld generated by X, denoted by σ(X), is deﬁned by
{ }
σ(X) = X −1 (A)|A ∈ B(R) .

• σ(X) is the smallest σ-ﬁeld to which X is measurable.

• The σ-ﬁeld generated by a random vector X = (X1 , . . . , Xn )′ is similarly

deﬁned: σ(X) = σ(X1 , . . . , Xn ) = {X −1 (A)|A ∈ B(Rn )} .

24
• σ(X) may be understood as the set of information that the random variable
X contains about the state of the world. Speaking diﬀerently, σ(X) is the
collection of events E such that, for a given outcome, we can tell whether the
event E has happened based on the observance of X.

Deﬁnition 2.5.2 (Independence of Random Variables) Random variables X1 , . . . , Xn

are independent if the σ-ﬁelds, σ(X1 ), . . . , σ(Xn ), are independent.

Let p(xik ) be the Radon-Nikodym density of the distribution of Xik with respect to
Lebesgue or counting measure. And let, with some abuse of notation, p(xi1 , . . . , xin )
be the Radon-Nikodym density of the distribution of Xi1 , . . . , Xin , with respect to
the product of the measures to which the marginal densities p(xi1 ), . . . , p(xin ) are
deﬁned. The density p may be pdf or discrete probabilities, depending on whether
the corresponding random variable is continuous or discrete. We have the following
theorem.

Theorem 2.5.3 The random variables X1 , X2 , . . . are independent if and only if for
any (i1 , . . . , in ),
∏
n
p(xi1 , . . . , xin ) = p(xik )
k=1

almost everywhere with respect to the measure for which p is deﬁned.

Proof: It suffices to prove the case of two random variables. Let Z = (X, Y )′ be a
two-dimensional random vector, and let µ(dx) and µ(dy) be measure to which p(x)
and p(y) are defined. The joint density p(x, y) is then defined with respect to the
measure µ(dx)µ(dy) on R2 . For any A, B ∈ R, we have

PZ (A × B) = P{Z −1 (A × B)} = P{X −1 (A) ∩ Y −1 (B)}.

X and Y are independent iﬀ

PZ (A × B) = P{X −1 (A) ∩ Y −1 (B)} = P{X −1 (A)}P{Y −1 (B)} = PX (A)PY (B).

And PZ (A × B) = PX (A)PY (B) holds iﬀ

∫ ∫ ∫ ∫
p(x, y)µ(dx)µ(dy) = p(x)µ(x) p(y)µ(dy)
A×B
∫ ∫
A B

= p(x)p(y)µ(dx)µ(dy),
A×B

where the second equality follows from Fubini’s theorem.

25
2.6 Exercises
1. Verify that PX (·) = P (X −1 (·)) is a probability measure on B(R).

2. Let E and F be two events with probabilities P(E) = 1/2, P(F ) = 2/3 and
P(E ∩ F ) = 1/3. Deﬁne random variables X = I(E) and Y = I(F ). Find the
joint distribution of X and Y . Also, obtain the conditional distribution of X
given Y .

3. If a random variable X is endowed with the following density function,

x2
p(x) = I{−3 < x < 3},
18
compute P{ω||X(ω)| < 1}.

4. Suppose the joint probability density function of X and Y is given by

p(x, y) = 3(x + y) I{0 ≤ x + y ≤ 1, 0 ≤ x, y ≤ 1}.

(a) Find the marginal density of X.

(b) Find P{ω|X(ω) + Y (ω) < 1/2}.

26
Chapter 3

Expectations

3.1 Integration
Expectation is integration. Before studying expectation, therefore, we ﬁrst dig
deeper into the theory of integration.

Notations Let µ be a measure on (S, G).

∫ ∫ ∫
• We denote µ(f ) = f dµ and µ(f ; A) = A
f dµ = f IA dµ, where A ∈ G.

• We say that f is µ-integrable if µ(|f |) = µ(f + ) + µ(f − ) < ∞, in which case

we write f ∈ L1 (S, G, µ).

• If in addition, f is nonnegative, then we write f ∈ L1 (S, G, µ)+ .

• E ∈ G is µ-null if µ(E) = 0.

• A statement is said to hold almost everywhere (a.e.) if the set E on which the
statement is false is µ-null.

Properties of Integration

• If f ∈ L1 (S, G, µ), then |µ(f )| ≤ µ(|f |).

• If f, g ∈ L1 (S, G, µ), then af + bg ∈ L1 (S, G, µ), where a, b ∈ R. Furthermore,

µ(af + bg) = aµ(f ) + bµ(g).

• µ(f ; A) is a measure on (S, G).

27
Theorem 3.1.1 (Monotone Convergence Theorem) If fn is a sequence of non-
negative measurable functions such that, except on a µ-null set, fn ↑ f , then
µ(fn ) ↑ µ(f ).

Note that the monotone convergence of probability is implied by the monotone

convergence theorem. Take fn = IAn and f = IA , where An is a monotone increasing
sequence of sets in G that converge to A, and let µ = P be a probability measure.
Then µ(fn ) = P(An ) ↑ P(A) = µ(f ).

Theorem 3.1.2 (Fatou’s Lemma) For a sequence of nonnegative measurable func-

tions fn , we have
µ(lim inf fn ) ≤ lim inf µ(fn ).

Proof: Note that inf n≥k fn is monotone increasing and inf n≥k fn ↑ lim inf fn . In
addition, since fk ≥ inf n≥k fn for all k, we have µ(fk ) ≥ µ(inf n≥k fn ) ↑ µ(lim inf fn )
by Monotone Convergence Theorem.

Theorem 3.1.3 (Reverse Fatou’s Lemma) If a sequence of nonnegative mea-

surable functions fn are bounded by a measurable nonnegative function g for all n
and µ(g) < ∞, then
µ(lim sup fn ) ≥ lim sup µ(fn ).

Proof: Apply Fatou Lemma to (g − fn ).

Theorem 3.1.4 (Dominated Convergence Theorem) Suppose that fn and f

are measurable, that fn (s) → f (s) for every s ∈ S, and that (fn ) is dominated by
some g ∈ L1 (S, G, µ)+ , ie,
|fn (s)| ≤ g(s), ∀s ∈ S, ∀n,
then
µ(|fn − f |) → 0,
so that
µ(fn ) → µ(f ).
In addition, f ∈ L1 (S, G, µ).

Proof: It is obvious that |f (s)| ≤ g(s) ∀s ∈ S. Hence |fn − f | ≤ 2g, where

µ(2g) < ∞. We apply the reverse Fatou Lemma to (|fn − f |) and obtain
lim sup µ(|fn − f |) ≤ µ(lim sup |fn − f |) = µ(0) = 0.
Since |µ(fn ) − µ(f )| = |µ(fn − f )| ≤ µ(|fn − f |), we have
lim |µ(fn ) − µ(f )| ≤ lim sup µ(|fn − f |) = 0.
n→∞

28
3.2 Expectation
Deﬁnition 3.2.1 (Expectation) Let X be a random variable on the probability
space (Ω, F, P) and X ∈ L1 (Ω, F, P). The expectation of X, EX, is deﬁned by
∫
EX = XdP.

More generally, let f be a Borel function,

∫
Ef (X) = f (X)P.

EX is also called the mean of X, and Ef (X) is called the f -moment of X.

Theorem 3.2.2 (Change of Variable) We have

∫ ∫
Ef (X) = f dPX = f pX dµ, (3.1)

where pX is the density of X with respect to measure µ.

Proof: First consider indicator functions of the form f (X) = IA (X), where A ∈ B.
We have f (X)(ω) = IA ◦ X(ω) = IX −1 (A) (ω). Then

Ef (X) = EIA ◦ X = P(X −1 (A)) = PX (A).

And we have
∫ ∫ ∫ ∫
PX (A) = IA dPX = f dPX and PX (A) = IA pX dµ = f pX dµ.

Hence the theorem holds for indicator functions. Similarly we can show that it is
true for simple functions. For a general nonnegative function f , we can choose a
sequence of simple functions (fn ) such that fn ↑ f . The monotone convergence
theorem is then applied to obtain the same result. For general functions, note that
f = f + − f −.

All properties of integration apply to the expectation. In addition, we have the

following convergence theorems.

• (Monotone Convergence Theorem) If 0 ≤ Xn ↑ X, then E(Xn ) ↑ E(X).

• (Fatou’s Lemma) If Xn ≥ 0, then E(lim inf Xn ) ≤ lim inf E(Xn ).

29
• (Reverse Fatou’s Lemma) If Xn ≤ X for all n and EX < ∞, then E lim sup Xn ≥
lim sup EXn .

• (Dominated Convergence Theorem) If |Xn (ω)| ≤ Y (ω) ∀(n, ω), where EY <
∞, then
E(|Xn − X|) → 0,
which implies that
EXn → EX.

• (Bounded Convergence Theorem) If |Xn (ω)| ≤ K ∀(n, ω), where K < ∞ is a

constant, then
E(|Xn − X|) → 0.

3.3 Moment Inequalities

Deﬁnitions: Moments Let X and Y be random variables deﬁned on (Ω, F, P).
Recall that we call Ef (X) the f -moment of X. In particular, if f (x) = xk , µk ≡ EX k
is called the k-th moment of X. If f (x) = (x − µ1 )k , we call E(X − µ1 )k the k-th
central moment of X. Particularly, the second central moment is called the variance.

The covariance of X and Y is deﬁned as

cov(X, Y ) = E(X − µx )(Y − µy ),

where µx and µy are the means of X and Y , respectively. cov(X, X) is of course the
2
variance of X. Let σX and σY2 denote the variances of X and Y , respectively, we
deﬁne the correlation of X and Y by
cov(X, Y )
ρX,Y = .
σX σY

For a random vector X = (X1 , . . . , Xn )′ , the second moment is given by EXX ′ ,

a symmetric matrix. Let µ = EX, then ΣX = E(X − µ)(X − µ)′ is called the
variance-covariance matrix, or simply the covariance matrix. If Y = AX, where
A is a conformable constant matrix, then ΣY = AΣX A′ . This relation reduces to
σY2 = a2 σX
2
, if X and Y are scalar random variables and Y = aX, where a is a
constant.

30
The moments of a random variable X contain the same information as the distribu-
tion (or the law) dose. We have

Theorem 3.3.1 Let X and Y be two random variables (possibly defined on different
probability spaces). Then PX = PY if and only if Ef (X) = Ef (Y ) for all Borel
functions whenever the expectation is finite.

Proof: If PX = PY , then we have Ef (X) = Ef (Y ) by (3.1). Conversely, set f = IB ,

where B is any Borel set. Then Ef (X) = Ef (Y ) implies that P(X ∈ B) = P(Y ∈
B), ie, PX = PY .

In the following, we prove a set of well-known inequalities.

E|X|k
Theorem 3.3.2 (Chebyshev Inequality) P{|X| ≥ ε} ≤ εk
, for any ε > 0
and k > 0.

Proof: It follows from the fact that εk I|X|≥ε ≤ |X|k .

Remarks:

• We have as a special case of the Chebyshev’s inequality,

σ2
P{|X − µ| ≥ ε} ≤ ,
ε2
where µ and σ 2 are the mean and the variance of X, respectively. If a random
variable has a ﬁnite variance, this inequality states that it’s tail probabilities
are bounded.

• Another special case concerns nonnegative random variables. In this case, we

have Markov’s Inequality, which states that for a nonnegative random variable
X,
1
P(X ≥ a) ≤ EX, for all a > 0.
a

Theorem 3.3.3 (Cauchy-Schwartz Inequality) (EXY )2 ≤ (EX 2 )(EY 2 )

Proof: Without loss of generality, we consider the case when X ≥ 0, Y ≥ 0.

Note ﬁrst that if E(X 2 ) = 0, then X = 0 a.s., in which case the inequality holds
with equality. Now we consider the case when E(X 2 ) > 0 and E(Y 2 ) > 0. Let

31
1/2 1/2
X∗ = X/ (E(X 2 )) and Y∗ = Y / (E(Y 2 )) . Then we have EX∗2 = EY∗2 = 1. Then
we have

0 ≤ E(X∗ − Y∗ )2 = E(X∗2 + Y∗2 − 2X∗ Y∗ ) = 1 + 1 − 2E(X∗ Y∗ ),

which results in E(X∗ Y∗ ) ≤ 1. The Cauchy-Schwartz inequality then follows.

Remarks:

• It is obvious that equality holds only when Y is a linear function of X.

• If we apply Cauchy-Schwartz Inequality to X − µX and Y − µY , then we have

cov(X, Y )2 ≤ var(X)var(Y ).

To introduce Jensen’s inequality, recall that f : R → R is convex if f (αx+(1−α)y) ≤

αf (x) + (1 − α)f (y), where α ∈ [0, 1]. If f is twice diﬀerentiable, then f is convex
if and only if f ′′ ≥ 0. Finally, if f is convex, it is automatically continuous.

Theorem 3.3.4 (Jensen’s Inequality) If f is convex, then f (EX) ≤ Ef (X).

Proof: Since f is convex, there exists a linear function ℓ such that

ℓ≤f and ℓ(EX) = f (EX).

It follows that
Ef (X) ≥ Eℓ(X) = ℓ(EX) = f (EX).

Remarks:

• Functions such as |x|, x2 , and exp(θx) are all convex functions of x.

• The inequality is reversed for concave functions such as log(x), x1/2 , etc.

Deﬁnition 3.3.5 (Lp Norm) Let 1 ≤ p < ∞. The Lp norm of a random variable
X is deﬁned by
∥X∥p ≡ (E|X|p )1/p .

Note that Lp ≡ Lp (Ω, F, P) denotes a normed space of random variables that satisﬁes
E|X|p < ∞.

32
Theorem 3.3.6 (Monotonicity of Lp Norms) If 1 ≤ p ≤ q < ∞ and X ∈ Lq ,
then X ∈ Lp , and
∥X∥p ≤ ∥X∥q

Proof: Deﬁne Yn = {min(|X|, n)}p . For any n ∈ N, Yn is bounded, hence both Yn

q/p
and Yn are in L1 . Since xq/p is a convex function of x, we use Jensen’s inequality
to obtain ( )
(EYn )q/p ≤ E Ynq/p = E ({min(|X|, n)}q ) ≤ E (|X|q ) .
Now the monotone convergence theorem obtains the desired result.

3.4 Conditional Expectation

Let X be a random variable on L1 (Ω, F, P) and let G ⊂ F be a sub-σ-ﬁeld.

Deﬁnition 3.4.1 (Conditional Expectation) The conditional expectation of X

given G, denoted by E(X|G), is a G-measurable random variable such that for every
A ∈ G, ∫ ∫
E(X|G)dP = XdP. (3.2)
A A

In particular, if G = σ(Y ), where Y is a random variable, we write E(X|σ(Y ))

simply as E(X|Y ).
The conditional expectation is a local average. To see this, let {Fk } be a partition
of Ω with P(Fk ) > 0 for all k. Let G = σ({Fk }). According to the deﬁnition in (3.2),
we have ∫ ∫
E(X|G)dP = E(X|G)P(Fk ) = XdP.
Fk Fk

Thus E(X|G) can be written as

∑
E(X|G) = ck IFk ,
k

where ∫
Fk
XdP
ck = .
P(Fk )
The conditional expectation E(X|G) may be viewed as a random variable that takes
values that are local averages of X over the partitions made by G. If G1 ⊂ G, G is
said to be “ﬁner” than G1 . In other words, E(X|G) is more “random” than E(X|G1 ),
since the former can take more values. Example 1 gives two extreme cases.

33
Example 3.4.2 If G = {∅, Ω}, then E(X|G) = EX, which is a degenerate random
variable. If G = F, then E(X|G) = X.

Example 3.4.3 Let E and F be two events that satisfy P(E) = P(F ) = 1/2 and
P(E ∩ F ) = 1/3. E and F are obviously not independent. We deﬁne two random
variables, X = IE and Y = IF . It is obvious that {F, F c } is a partition of Ω and
σ({F, F c }) = σ(Y ) = {∅, Ω, F, F c }. The conditional expectation of E(X|Y ) may be
written as
E(X|Y ) = c∗1 IF + c∗2 IF c ,
∫ ∫
where c∗1 = P(F )−1 F XP = P(F )−1 P(F ∩ E) = 2/3, and c∗2 = P(F c )−1 F c XP =
P(F c )−1 P(F c ∩ E) = 1/3.

Existence of Conditional Expectation Note that

∫
µ(A) = XdP, A ∈ G
A

deﬁnes a measure on (Ω, G) and that µ is absolutely continuous with respect to P.

By the Radon-Nikodym theorem, there exists a G-measurable random variable Y
such that ∫
µ(A) = Y dP.
A

The random variable Y is exactly E(X|G). It is unique up to P-null sets.

Deﬁnition 3.4.4 (Conditional Probability) The conditional probability may be

deﬁned as a random variable P(E|G) such that
∫
P(E|G)dP = P(A ∩ E).
A

Check that the conditional probability behaves like ordinary probabilities, in that
it satisﬁes the axioms of the probability, at least in a.s. sense.

Properties:

• (Linearity) E(aX + bY |G) = aE(X|G) + bE(Y |G).

• (Law of Iterative Expectation) The deﬁnition of conditional expectation di-

rectly implies EX = E [E(X|G)].

34
• If X is G-measurable, then E(XY |G) = XE(Y |G) with probability 1.

Proof: First, XE(Y |G) is G-measurable. Now let X = IF , where F ∈ G. For

any A ∈ G, we have
∫ ∫ ∫ ∫ ∫
E(IF Y |G)dP = IF Y dP = Y dP = E(Y |G)dP = IF E(Y |G)dP.
A A A∩F A∩F A

Hence the statement holds for X = IF . For general random variables, use
linearity and monotone convergence theorem.

• Using the above two results, it is trivial to show that X and Y are independent
if and only if Ef (X)g(Y ) = Ef (X)Eg(Y ) for all Borel functions f and g.
• Let G1 and G2 be sub-σ-ﬁelds and G1 ⊂ G2 . Then, with probability 1,
E [E(X|G2 )|G1 ] = E(X|G1 ).

Proof: It follows from, for any A ∈ G1 ⊂ G2 ,

∫ ∫ ∫ ∫
E [E(X|G2 )|G1 ] dP = E(X|G2 )dP = XdP = E(X|G1 )dP.
A A A A

• (Doob-Dynkin) There exists a measurable function f such that E(X|Y ) =

f (Y ).

Conditional Expectation as Projection The last property implies that

E [E(X|G)|G] = E(X|G),
which suggest that the conditional expectation is a projection operator, projecting
a random variable onto a sub-σ-ﬁeld. This is indeed the case. It is well known that
H = L2 (Ω, F, P) is a Hilbert space with inner product deﬁned by ⟨X, Y ⟩ = EXY ,
where X, Y ∈ L2 . Consider a subspace H0 = L2 (Ω, G, P), where G ⊂ F . The
projection theorem in functional analysis guarantees that for any random variable
X ∈ H, there exists a G-measurable random variable Y such that
E(X − Y )W = 0 for all W ∈ H0 . (3.3)
Y is called the (orthogonal) projection of X on H0 . Write W = IA for any A ∈ G,
the equation (3.3) implies that
∫ ∫
XdP = Y dP for all A ∈ G.
A A

It follows that Y is indeed a version of E(X|G).

35
Conditional Expectation as the Best Predictor Consider the problem of
predicting Y given X. We call ϕ(X) a predictor, where ϕ is a Borel function. We
have the following theorem,

Theorem 3.4.5 If Y ∈ L2 , then E(Y |X) solves the following problem,

min E(Y − ϕ(X))2 .

Proof: We have

E(Y − ϕ(X))2 = E([Y − E(Y |X)] + [E(Y |X) − ϕ(X)])2

{
= E [Y − E(Y |X)]2 + [E(Y |X) − ϕ(X)]2
+2[Y − E(Y |X)][E(Y |X) − ϕ(X)]} .

By the law of iterative expectation, E[Y − E(Y |X)][E(Y |X) − ϕ(X)] = 0. Hence

E(Y − ϕ(X))2 = E[Y − E(Y |X)]2 + E[E(Y |X) − ϕ(X)]2 .

Since ϕ only appears in the second term, the minimum of which is attained when
E(Y |X) = ϕ(X), it is now clear that E(Y |X) minimizes E(Y − ϕ(X))2 .

Hence the conditional expectation is the best predictor in the sense of minimizing
mean squared forecast error (MSFE). This fact is the basis of regression analysis
and time series forecasting.

3.5 Conditional Distribution

Suppose that X and Y are two random variables with joint density p(x, y).

Deﬁnition 3.5.1 (Conditional Density) The conditional density of X given Y =

y is obtained by
p(x, y)
p(x|y) = ∫ .
p(x, y)µ(dx)

The conditional expectation E(X|Y = y) may then be represented by

∫
E(X|Y = y) = xp(x|y)µ(dx).

36
∫ any Borel function f such that f (X) ∈ L , we may show that E(f (X)|Y = y) =
1
For
f (x)p(x|y)µ(dx) solves the following problem ,
∫ ∫
min (ϕ(y) − f (x))2 p(x, y)µ(dx)µ(dy).
ϕ

It is clear that E(X|Y = y) is a deterministic function of y. Thus we write g(y) =

E(X|Y = y). Recall that E(X|Y ) is a Borel function of Y . Indeed, here we have
g(Y ) = E(X|Y ).
To show this, ﬁrst note that for all F ∈ σ(Y ), there exists A ∈ B such that F =
Y −1 (A). We now have
∫ ∫
g(Y )dP = g(y)p(y)µ(dy)
F
∫ (∫
A
)
= xp(x|y)µ(dx) p(y)µ(dy)
∫ ∫
A

= xp(x|y)p(y)µ(dx)µ(dy)
R ×A
∫
= XdP
∫F

= E(X|Y )dP.
F

Example 3.5.2 If p(x, y) = (x + y)I{0≤x,y≤1} . To obtain E(X|Y ), we calculate

∫ ∫ 1 1
x+y +y
E(X|Y = y) = xp(x|y)dx = x1 dx = 31 .
0 2
+y 2
+y
Then E(X|Y ) = (1/3 + Y /2)/(1/2 + Y ).

3.6 Exercises
1. Let the sample space Ω = R and the probability P on Ω be given by
{ } { }
1 1 2 2
P = and P = .
3 3 3 3
Deﬁne a sequence of random variables by
( ) ( )
1
Xn = 3 − I(An ) and X = 3 I lim An ,
n n→∞

37
where [ )
1 1 2 1
An = + , +
3 n 3 n
for n = 1, 2, . . ..
(a) Show that lim An exists so that X is well deﬁned.
n→∞
(b) Compare lim E(Xn ) with E(X).
n→∞
(c) Is it true that lim E(Xn − X)2 = 0?
n→∞

2. Let X1 and X2 be two zero-mean random variables with correlation ρ. Suppose

the variances of X1 and X2 are the same, say σ 2 . Prove that

2(1 + ρ)
P (|X1 + X2 | ≥ kσ) ≤ .
k2

3. Prove Cantelli’s inequality, which states that if a random variable X has mean
µ and variance σ 2 < ∞, then for all a > 0,

σ2
P(X − µ ≥ a) ≤ .
σ 2 + a2
[Hint: You may ﬁrst show P(X − µ ≥ a) ≤ P ((X − µ + y)2 ≥ (a + y)2 ), use
Markov’s inequality, and then minimize the resulting bound over the choice of
y. ]

4. Let the sample space Ω = [0, 1] and the probability on Ω be given by the
density
p(x) = 2x
over [0, 1]. We deﬁne random variables X and Y by


 1, 0 ≤ ω < 1/4, {

0, 1/4 ≤ ω < 1/2, 1, 0 ≤ ω < 1/2,
X(ω) = and Y(ω) =

 −1, 1/2 ≤ ω < 3/4, 0, 1/2 ≤ ω ≤ 1.

0, 3/4 ≤ ω ≤ 1,

(a) Find the conditional expectation E(X 2 |Y )

(b) Show that E(E(X 2 |Y )) = E(X 2 ).

38
Chapter 4

Distributions and Transformations

4.1 Alternative Characterizations of Distribution

4.1.1 Moment Generating Function

Let X be a random variable with density p. The moment generating function (MGF)
of X is given by
∫
m(t) = E exp(tX) = exp(tx)p(x)dµ(x).

Note that the moment generating function is the Laplace transform of the density.
The name of MGF is due to the fact that

dk m
k
(0) = EX k .
dt

4.1.2 Characteristic Function

The MGF may not exist, but we can always deﬁne characteristic function, which is
given by
∫
ϕ(t) = E exp(itX) = exp(itx)p(x)dµ(x).

Note that the characteristic function is the Fourier transform of the density. Since
| exp(itx)| is bounded, ϕ(t) is always deﬁned.

39
4.1.3 Quantile Function
We deﬁne the τ -quantile or fractile of X (with distribution function F ) by
Qτ = inf{x|F (x) ≥ τ }, 0 < τ < 1.
In particular, if τ = 1/2, Q1/2 is conventionally called the median of X.

4.2 Common Families of Distributions

In the following we get familiar with some families of distributions that are frequently
used in practice. Given a family of distributions {Pθ } indexed by θ, we call the
index θ parameter. If θ is ﬁnite dimensional, we call {Pθ } a parametric family of
distributions.

Uniform The uniform distribution is a continuous distribution with the following

density with respect to the Lebesgue measure,
1
pa,b (x) = I[a,b] (x), a < b.
b−a
We denote the uniform distribution with parameters a and b by Uniform(a, b).

Bernoulli The Bernoulli distribution is a discrete distribution with the following

density with respect to the counting measure,
pθ (x) = θx (1 − θ)1−x , x ∈ {0, 1}, and θ ∈ [0, 1].
The Bernoulli distribution, denoted by Bernoulli(θ), usually describes random ex-
periments with binary outcomes such as success (x = 1) or failure (x = 0). The
parameter θ is then interpreted as the probability of success, P{x = 1}.

Binomial The Binomial distribution, corresponding to n-consecutive coin tossing,

is a discrete distribution with the following density with respect to counting measure,
( )
n
pn,θ (x) = θx (1 − θ)n−x , x ∈ {0, 1, . . . , n}.
x
We may use Binomial distribution, denoted by Binomial(n, θ), to describe the out-
comes of repeated trials, in which case n is the number of trials and θ is the proba-
bility of success for each trial.
Note that if X ∼ Binomial(n, θ), it can be represented by a sum of n i.i.d. (inde-
pendently and identically distributed) Bernoulli(θ) random variables.

40
Poisson The Poisson distribution is a discrete distribution with the following den-
sity,
exp(−λ)λx
pλ (x) = , x ∈ {0, 1, 2, . . .}.
x!
The Poisson distribution typically describes the probability of the number of events
occurring in a ﬁxed period of time. For example, the number of phone calls in a given
time interval may be modeled by a Poisson(λ) distribution, where the parameter λ
is the expected number of calls. Note that the Poisson(λ) density is a limiting case
of the Binomial(n, λ/n) density,
( ) ( )x ( )n−x ( )−x ( )n x x
n λ λ n! λ λ λ −λ λ
1− = 1 − 1 − → e .
x n n (n − x)!nx n n x! x!

Normal The normal (or Gaussian) distribution, denoted by N (µ, σ 2 ) is a contin-

uous distribution with the following density with respect to Lebesgue measure,
( )
1 (x − µ)2
pµ,σ2 (x) = √ exp − .
2πσ 2σ 2

The parameter µ and σ 2 are the mean and the variance of the distribution, respec-
tively. In particular, N (0, 1) is called standard normal. The normal distribution
was invented for the modeling of observation error, and is now the most important
distribution in probability and statistics.

Exponential The exponential distribution, denoted by Exponential(λ) is a con-

tinuous distribution with the following density with respect to Lebesgue measure,

pλ (x) = λe−λx .

The cdf of the Exponential(λ) distribution is given by

F (x) = 1 − e−λx .

The exponential distribution typically describes the waiting time before the arrival
of next Poisson event.

Gamma The Gamma distribution, denoted by Gamma(k, λ) is a continuous dis-

tribution with the following density,
1
pk,λ = (λx)k−1 e−λx , x ∈ [0, ∞),
Γ(k)

41
where Γ(·) is gamma function deﬁned as follows,
∫ ∞
Γ(z) = tz−1 e−t dt.
0

The parameter k is called shape parameter and λ > 0 is called scale parameter.

• Special cases
– Let k = 1, then Gamma(1, λ) reduces to Exponential(λ).
– If k is an integer, Gamma(1, λ) reduces to an Erlang distribution, i.e., the
sum of k independent exponentially distributed random variables, each
of which has a mean of λ.
– Let ℓ be an integer and λ = 1/2, then Gamma(ℓ/2, 1/2) reduces to χ2ℓ ,
chi-square distribution with ℓ degrees of freedom.
• The gamma function generalizes the factorial function. To see this, note that
Γ(1) = 1 and that by integration by parts, we have

Γ(z + 1) = zΓ(z).

Hence for positive integer n, we have Γ(n + 1) = n!.

Beta The Beta distribution, denoted by Beta(a, b), is a continuous distribution on

[0, 1] with the following density,
1
pa,b (x) = xa−1 (1 − x)b−1 , x ∈ [0, 1],
B(a, b)
where B(a, b) is the beta function deﬁned by
∫ 1
B(a, b) = xa−1 (1 − x)b−1 dx, a, b > 0.
0

Both a > 0 and b > 0 are shape parameters. Since the support of Beta distributions
is [0, 1], it is often used to describe unknown probability value such as the probability
of success in a Bernoulli distribution.

• The beta function is related to the gamma function by

Γ(a)Γ(b)
B(a, b) = .
Γ(a + b)

• Beta(a, b) reduces to Uniform[0, 1] if a = b = 1.

42
Table 4.1: Mean, Variance, and Moment Generating Function
Distribution Mean Variance MGF
a+b (b−a)2 ebt −eat
Uniform[a, b] 2 12 (b−a)t
Bernoulli(θ) θ θ(1 − θ) (1 − θ) + θet
Poisson(λ) λ λ exp(λ(et − 1))
( )
Normal(µ, σ 2 ) µ σ2 exp µt + 21 σ 2 t2
Exponential(λ) λ−1 λ−2 (1 − t/λ)−1
Gamma(k, λ) k/λ k/λ2 (λ/(λ − t))k
a ab
(
∑∞ ∏k−1 a+r ) tk
Beta(a, b) a+b (a+b)2 (a+b+1)
1+ k=1 r=0 a+b+r k!

Cauchy The Cauchy distribution, denoted by Cauchy(a, b), is a continuous dis-

tribution with the following density,
1
pa,b (x) = ( ( )2 ) , b > 0.
πb 1 + x−a
b

The parameter a is called the location parameter and b is called the scale parame-
ter. Cauchy(0, 1) is called the standard Cauchy distribution, which coincides with
Student’s t-distribution with one degree of freedom.

• The Cauchy distribution is a heavy-tail distribution. It does not have any

ﬁnite moment. Its mode and median are well deﬁned and are both equal to a.

• When U and V are two independent standard normal random variables, then
the ratio U/V has the standard Cauchy distribution.

• Like normal distribution, Cauchy distribution is (strictly) stable, ie, if X1 , X2 , X

are i.i.d. Cauchy, then for any constants a1 and a2 , the random variable
a1 X1 + a2 X2 has the same distribution as cX with some constants c.

Multinomial The multinomial distribution generalizes the binomial distribution

to describe more than two categories. Let X = (X1 , . . . , Xm ). For the experiment
of tossing a coin for n times, X would take (k, n − k)′ , ie, there are k heads and
n−k tails.
∑m For the experiment of rolling a die for n times, X would take (x1 , ..., xm ),
where k=1 xk = n. The multinomial density is given by

n! ∑m
p(x1 , . . . , xm ; p1 , ..., pm ) = px1 1 · · · pxmm , x ∈ {0, 1, . . . , n}, xk = n,
x1 ! · · · xm ! k=1

43
where parameter (pk , k = 1, . . . , m) is the probability of getting k−th outcome in
each coin tossing or die rolling. When m = 2, the multinomial distribution reduces
to binomial distribution. The continuous analogue of multinomial distribution is
multivariate normal distribution.

4.3 Transformed Random Variables

In this section, we study three commonly used techniques to derive the distributions
of transformed random variables Y = g(X), given the distribution of X. We denote
by FX the distribution function of X.

4.3.1 Distribution Function Technique

By the deﬁnition of distribution function, we may directly calculate FY (y) = P(Y ≤
y) = P(g(X) ≤ y).

Example 4.3.1 Let X ∼ Uniform[0, 1] and Y = − log(1 − X). It is obvious that,

in a.s. sense, Y ≥ 0 and 1 − exp(−Y ) ∈ [0, 1]. Thus, for y ≥ 0, the distribution
function of Y is given by

FY (y) = P(1 − log(1 − X) ≤ y)

= P(X ≤ 1 − e−y )
= 1 − e−y ,

since FX (x) = x for x ∈ [0, 1]. FY (y) = 0 for y < 0. Note that Y ∼ Exponential(1).

Example 4.3.2 Let Xi be independent random variables with distribution function

Fi , i = 1, . . . , n. Then the distribution of Y = max{X1 , . . . , Xn } is given by
( )
∩
n
FY (y) = P {Xi ≤ y}
i=1
∏
n
= P(Xi ≤ y)
i=1
∏
n
= Fi (y).
i=1

44
Example 4.3.3 Let X = (X1 , X2 )′ be a random vector with distribution P and
density p(x1 , x2 ) with respect to measure µ. Then the distribution of Y = X1 + X2
is given by

FY (y) = P{X1 + X2 ≤ y}
= P{(x1 , x2 )|x1 + x2 ≤ y}
∫ ∞ ∫ y−x2
= p(x1 , x2 )µ(dx1 )µ(dx2 ).
−∞ −∞

4.3.2 MGF Technique

The moment generating function (MGF) uniquely determines distributions. When
MGF of Y = g(X) is easily obtained, we may identify the distribution of Y by
writing the MGF into a form that corresponds to some particular distribution. For
example,
∑ if (Xi ) are independent random variables with MGF mi , then the MGF of
Y = ni=1 Xi is given by
∏
n
m(t) = Ee t(X1 +···+Xn )
= mi (t).
i=1

∑n 4.3.4 Let Xi ∼ Poisson(λi ) be independent over i. Then the MGF of

Example
Y = i=1 Xi is
( )
∏
n
( ( t )) ( t )∑ n
m(t) = exp λi e − 1 = exp e − 1 λi .
i=1 i=1
∑
This suggests that Y ∼ Poisson( i λi ).

Example 4.3.5 Let Xi ∼ N(µi , σi2 ) be independent over i. Then the MGF of Y =
∑ n
i=1 ci Xi is
( ) ( n )
∏n
1 2 22 ∑ t2 ∑ 2 2
n
m(t) = exp ci µi t + ci σi t = exp t ci µi + ci σi .
i=1
2 i=1
2 i=1
∑ ∑n
This suggests that Y ∼ N ( i ci µi ,
2 2
i=1 ci σi ).

4.3.3 Change-of-Variable Transformation

If the transformation function g is one-to-one, we may ﬁnd the density of Y = g(X)
from that of X by the change-of-variable transformation. Let g = (g1 , . . . , gn )′

45
and x = (x1 , . . . , xn )′ . And let PX and PY denote the distributions of X and Y ,
respectively. Assume PX and PY admit density pX and pY with respect to µ, the
counting or the Lebesgue measure on Rn .
For any B ∈ B(R), we deﬁne A = g −1 (B). We have A ∈ B(R) since g is measurable.
It is clear that {X ∈ A} = {Y ∈ B}. We therefore have
∫
PY (B) = PX (A) = pX (x)µ(dx).
A

If µ is counting measure, we have

∫ ∑ ∑
pX (x)µ(dx) = pX (x) = pX (g −1 (y)).
A x∈A y∈B

Hence the density pY of Y is given by

pY (y) = pX (g −1 (y)).

If µ is Lebesgue measure and g is diﬀerentiable, we use the change-of-variable formula

to obtain,
∫ ∫ ∫
( ) −1
pX (x)µ(dx) = pX (x)dx = pX (g −1 (y)) detġ g −1 (y) dy,
A A B

where ġ is the Jacobian matrix of g, ie, the matrix of the ﬁrst partial derivatives of
f , [∂gi /∂xj ]. Then we obtain the density of Y ,
( ) −1
pY (y) = pX (g −1 (y)) detġ g −1 (y) .

Example 4.3.6 Suppose we have two random variables X1 and X2 with joint den-
sity
p(x1 , x2 ) = 4x1 x2 if 0 < x1 , x2 < 1,
= 0 otherwise
Deﬁne Y1 = X1 /X2 and Y2 = X1 X2 . The problem is to obtain the joint density of
(Y1 , Y2 ) from that of (X1 , X2 ). First note that the inverse transformation is
x1 = (y1 y2 )1/2 and x2 = (y2 /y1 )1/2 .
Let X = {(x1 , x2 )|0 < x1 , x2 < 1} denote the support of the joint density of (X1 , X2 ).
Then the support of the joint density of (Y1 , Y2 ) is given by Y = {(y1 , y2 )|y1 , y2 >
0, y1 y2 < 1, y2 < y1 }. Then
( 1 ( √ )
x1 )
√
y1 y2
− y1
|detġ(x)| = det x2
2
x2
= det √ y2 √
y2 /y1
= 2y1 .
x2 x1 y2 /y1 y1 y2

46
Hence the joint density of (Y1 , Y2 ) is given by
4(y1 y2 )1/2 (y2 /y1 )1/2 2y2
p(y1 , y2 ) = = .
2y1 y1

4.4 Multivariate Normal Distribution

4.4.1 Introduction
Deﬁnition 4.4.1 (Multivariate Normal) A random vector X = (X1 , . . . , Xn )′ is
said to be multivariate normally distributed if for all a ∈ Rn , a′ X has a univariate
normal distribution.

Let Z = (Z1 , . . . , Zn )′ be a n-dimensional random vector, where (Zi ) are i.i.d.

N (0, 1). We have EZ = 0 and var(Z) = In . For all a ∈ Rn , we have
∏
n ∏
n ∏
n
∑n
it(a′ z)
e− 2 ak t = e− 2
1 2 2 1
a2k
Ee = Eeitak zk
= ϕZ (ak t) = k=1 ,
k=1 k=1 k=1
∑n
which is the characteristic function of a N (0, k=1 a2k ) random variable. Hence Z is
multivariate normal. We may write Z ∼ N (0, In ), and call it standard multivariate
normal.
Using similar argument, we can show that X is multivariate normal if it can be
written as
X = µ + Σ1/2 Z,
where Z is standard multivariate normal, µ is an n-vector, and Σ is a symmetric
and positive deﬁnite matrix. It is easy to see that EX = µ and var(X) = Σ. We
write X ∼ N (µ, Σ).

Characteristic Function for Random Vectors For a random vector X, the

characteristic function may be deﬁned as ϕX (t) = E exp(it′ X), where t ∈ Rn . The
characteristic function of Z (deﬁned above) is obviously
( )
1 ′
ϕZ (t) = exp − t t .
2
Let X ∼ N (µ, Σ). It follows that
( )
it′ X it′ µ
( 1/2 ) ′ 1 ′
ϕX (t) = Ee =e ϕZ Σ t = exp it µ − t Σt .
2

47
Joint Density The joint density of Z is given by,
( ) ( )
∏n
1 1∑ 2
n
1 1 ′
p(z) = p(zi ) = exp − z = exp − z z .
i=1
(2π)n/2 2 i=1 i (2π)n/2 2

The Jacobian matrix of of the aﬃne transformation X = µ + Σ1/2 Z is Σ1/2 , hence

( )
1 1 ′ −1
p(x) = exp − (x − µ) Σ (x − µ) .
(2π)n/2 |Σ|1/2 2

Remarks:

• A vector of univariate normal random variables is not necessarily a multivariate

normal random vector. A counter example is (X, Y )′ , where X ∼ N (0, 1) and
Y = X if |X| > c and Y = −X if |X| < c, where c is about 1.54.

• If Σ is singular, then there exists some a ∈ Rn such that var(a′ X) = a′ Σa = 0.

This implies that X is random only on a subspace of Rn . We may say that
the joint distribution of X is degenerate in this case.

4.4.2 Marginals and Conditionals

Throughout this section, let X ∼ N (µ, Σ).

Lemma 4.4.2 (Aﬃne Transformation) If Y = AX + b, then Y ∼ N (Aµ +

b, AΣA′ ).

Proof: Exercise. (Hint: use c.f. arguments.)

To introduce marginal distributions, we partition X conformably into

( ) (( ) ( ))
X1 µ1 Σ11 Σ12
X= ∼N , ,
X2 µ2 Σ21 Σ22

where X1 ∈ Rn1 and X2 ∈ Rn2 .

Marginal Distribution Apply Lemma 1 with A = (In1 , 0) and b = 0, we have

X1 ∼ N (µ1 , Σ11 ). In other words, the marginal distributions of a multivariate
normal distribution are also multivariate normal.

48
Lemma 4.4.3 (Independence) X1 and X2 are independent if and only if Σ12 = 0.

Proof: The “only if” part is obvious. If Σ12 = 0, then Σ is a block diagonal,
( )
Σ11 0
Σ= .
0 Σ22
Hence ( )
−1 Σ−1
11 0
Σ = ,
0 Σ−1
22

and
|Σ| = |Σ11 | · |Σ22 |.
Then the joint density of x1 and x2 , can be factored as
( )
−n/2 −1/2 1 ′ −1
p(x) = p(x1 , x2 ) = (2π) |Σ| exp − (x − µ) Σ (x − µ)
2
( )
−n1 /2 −1/2 1 ′ −1
= (2π) |Σ11 | exp − (x1 − µ1 ) Σ11 (x1 − µ1 )
2
( )
−n2 /2 −1/2 1 ′ −1
·(2π) |Σ22 | exp − (x2 − µ2 ) Σ22 (x2 − µ2 )
2
= p(x1 )p(x2 ).

Hence X1 and X2 are independent.

Theorem 4.4.4 (Conditional Distribution) The conditional distribution of X1

given X2 is N (µ1|2 , Σ11|2 ), where

µ1|2 = µ1 + Σ12 Σ−1

22 (X2 − µ2 )
Σ11|2 = Σ11 − Σ12 Σ−1
22 Σ21 .

Proof: First note that

( ) ( )( )
X1 − Σ12 Σ−1
22 X2 I −Σ12 Σ−1
22 X1
= .
X2 0 I X2
Since
( )( )( ) ( )
I −Σ12 Σ−1
22 Σ11 Σ12 I 0 Σ11 − Σ12 Σ−1
22 Σ21 0
= ,
0 I Σ21 Σ22 −Σ12 Σ−1
22 I 0 Σ22

X1 − Σ12 Σ−1
22 X2 and X2 are independent. We write

X1 = A1 + A2 ,

49
where
( )
A1 = X1 − Σ12 Σ−1
22 X2 , A2 = Σ12 Σ−1
22 X2 .

Since A1 is independent of X2 , the conditional distribution of A1 given X2 is the

unconditional distribution of A1 , which is
( )
N µ1 − Σ12 Σ−1
22 µ 2 , Σ 11 − Σ12 Σ−1
22 Σ21 .

A2 may be treated as a constant given X2 , which only shifts the mean of the cond-
tional distribution of X1 given X2 . We have thus obtained the desired result.

From the above result, we may see that the conditional mean of X1 given X2 is
linear in X2 , and that the conditional variance of X1 given X2 does not depend on
X2 . Of course the conditional variance of X1 given X2 is less than the unconditional
variance of X1 , in the sense that Σ11 − Σ11|2 is a positive semi-deﬁnite matrix.

4.4.3 Quadratic Forms

Let X be an n-by-1 random vector and A be an n-by-n deterministic matrix, the
quantity X ′ AX is called the quadratic form of X with respect to A. In this section
we consider the distribution of the quadratic forms of X when X is multivariate
normal. First we introduce a few important distributions that are related with the
quadratic forms of normal vectors.

chi-square distribution If Z = (Z1 , . . . , Zn )′ ∼ N (0, In ), it is well known that

∑
n
′
ZZ= Zi2 ∼ χ2n ,
i=1

which is called chi-square distribution with n degrees of freedom.

Student t distribution Let T = √ Z , where Z ∼ N (0, 1) and V ∼ χ2m and Z

V /m
and V are independent, then T ∼ tm , the Student t distribution with m degrees of
freedom.

F distribution Let F = VV12 /m

/m2
1
, where V1 and V2 are independent χ2m1 and χ2m2 ,
respectively. Then F ∼ Fm1 ,m2 , the F distribution with degrees of freedom m1 and
m2 .

50
Theorem 4.4.5 Let X ∼ N (0, Σ), where Σ is nonsingular. Then

X ′ Σ−1 X ∼ χ2n .

Proof: Note that Σ−1/2 X ∼ N (0, In ).

To get to the next theorem, recall that a square matrix is a projection if and only if
P 2 = P .1 If, in addition, P is symmetric, then P is an orthogonal projection.

Theorem 4.4.6 Let Z ∼ N (0, In ) and P be an m-dimensional orthogonal projec-

tion in Rn , then we have
Z ′ P Z ∼ χ2m .

Proof: It is well known that P may be decomposed into

′
P = Hm Hm ,
′ ′
where Hm is an n × m orthogonal matrix such that Hm Hm = Im . Note that Hm Z∼
′ ′ ′ ′
N (0, Im ) and Z P Z = (Hm Z) (Hm Z).

Theorem 4.4.7 Let Z ∼ N (0, In ), and let A and B be deterministic matrices, then
A′ Z and B ′ Z are independent if and only if A′ B = 0.

Proof: Let C = (A, B). Without loss of generality, we assume that C is full rank
(if it is not, then throw away linearly dependent columns). We have
( ′ ) ( ( ′ ))
′ AZ A A A′ B
CZ= ∼ N 0, .
B′Z B′A B′B

It is now clear that A′ Z and B ′ Z are independent if and only if the covariance A′ B
is null.

It is immediate that we have

Corollary 4.4.8 Let Z ∼ N (0, In ), and let P and Q be orthogonal projections such
that P Q = 0, then Z ′ P Z and Z ′ QZ are independent.

Proof: Note that since P Q = 0, then P Z and QZ are independent. Hence the
independence of Z ′ P Z = (P Z)′ (P Z) and Z ′ QZ = (QZ)′ (QZ).
1
Matrices that satisfy this property is said to be idempotent.

51
Using the above results, we can easily prove

Theorem 4.4.9 Let Z ∼ N (0, In ), and let P and Q be orthogonal projections of

dimensions m1 and m2 , respectively. If P Q = 0, then
Z ′ P Z/m1
∼ Fm1 ,m2 .
Z ′ QZ/m2

Finally, we prove a useful theorem.

Theorem 4.4.10 Let (Xi ) be i.i.d. N (µ, σ 2 ), and deﬁne

1∑
n
X̄n = Xi
n i=1
1 ∑
n
Sn2 = (Xi − X̄n )2 .
n − 1 i=1
We have

(a) X̄n ∼ N (µ, σ 2 /n).

2
(b) (n−1)Sn
σ2
∼ χ2n−1
(c) X̄n and Sn2 are independent
√
n(X̄n −µ)
(d) Sn
∼ tn−1

Proof: Let X = (X1 , . . . , Xn′ ) and ι be an n × 1 vector of ones, then X ∼ N (µι, In ).

(a) follows from X̄n = n1 ι′ X. Deﬁne Pι = ιι′ /n = ι(ι′ ι)−1 ι, which is the orthogonal
projection on the span of ι. Then we have
∑n
(Xi − X̄n )2 = (X − ιι′ X/n)′ (X − ιι′ X/n) = X ′ (I − Pι )X.
i=1

Hence ( )′ ( )
(n − 1)Sn2 X − µι X − µι
= (In − Pι ) .
σ2 σ σ
(b) follows from the fact that X−µισ
∼ N (0, In ) and that (In − Pι ) is an (n − 1)-
′
dimensional orthogonal projection. To prove (c), we note that X̄n = ιn Pι X and
1
Sn2 = n−1 ((I − Pι )X)′ ((I − Pι )X), and that Pι X and (I − Pι )X are independent
by Theorem 4.4.7. Finally, (d) follows from
√ √
n(X̄n −µ)
n(X̄n − µ)
=√ σ .
Sn (n−1)Sn2
σ2
n−1

52
4.5 Exercises
1. Derive the characteristic function of the distribution with density

p(x) = exp(−|x|)/2.

2. Let X and Y be independent standard normal variables. Find the density of

a random variable defined by
X
U= .
Y
[Hint: Let V = Y and first find the joint density of U and V .]

3. Let X and Y have bivariate normal distribution with mean and variance
( ) ( )
1 1 1
and .
2 1 2

(a) Find a constant α∗ such that Y − α∗ X is independent of X. Show that

var(Y − αX) ≥ var(Y − α∗ X) for any constant α.
(b) Find the conditional distribution of X + Y given X − Y .
(c) Obtain E(X|X + Y ).

4. Let X = (X1 , . . . , Xn )′ be a random vector with mean µι and variance Σ,

where µ is a scalar, ι is the n-vector of ones and Σ is an n by n symmetric
matrix. We deﬁne
∑n ∑n
i=1 Xi 2 (Xi − X)2
Xn = and Sn = i=1 .
n n−1
Consider the following assumptions:
(A1) X has multivariate normal distribution,
(A2) Σ = σ 2 I,
(A3) µ = 0.
We claim:
(a) X n and Sn2 are uncorrelated.
(b) E(X n ) = µ.
(c) E(Sn2 ) = σ 2 .
(d) X n ∼ N (µ, σ 2 /n).
(n − 1)Sn2 /σ 2 ∼ χ2n−1 .
(e) √
(f) n(X n − µ)/Sn ∼ tn−1 .
What assumptions in (A1), (A2), and (A3) are needed for each of (a) – (f) to
hold. Prove (a) – (f) using the assumptions you speciﬁed.

53
54
Chapter 5

Introduction to Statistics

5.1 General Settings

The fundamental postulate of statistical analysis is that the observed data are real-
ized values of a vector of random variables deﬁned on a common probability space.
This postulate is not veriﬁable. It is a philosophical view of the world that we choose
to take, and we call it the probabilistic view. An alternative view would be that the
seemingly random data are generated from a deterministic but chaotic law. We only
consider the probabilistic view, which is main stream among economists.
Let X = (X1 , . . . , Xn ) be variables of interest, where for each i, Xi may be a vector.
The objective of statistical inference, is to study the joint distribution of X based
on the observed sample.

The First Example: For example, we may study the relationship between indi-
vidual income (income) and the characteristics of the individual such as education
level (edu), work experience (expr), gender, etc. The variables of interest may then
be Xi = (incomei , edui , expri , genderi ). We may reasonably postulate that (Xi )
are independently and identically distributed (i.i.d.). Hence the study of the joint
distribution of X reduces to that of the joint distribution of Xi . To achieve this,
we take a sample of the whole population, and observe (Xi , i = 1, . . . , n), where i
denotes individuals. In this example in particular, we may focus on the conditional
distribution of income given edu, expr, and gender.

The Second Example: For another example, in macroeconomics, we may be

interested in the relationship among government expenditure (gt ), GDP growth
(yt ), inﬂation (πt ), and unemployment (ut ). The variables of interest may be Xt =

55
(gt , yt , πt , ut ). One of the objective of empirical analysis, in this example, may be
to study the conditional distribution of unemployment given past observations on
government expenditure, GDP growth, inflation, as well as itself. The problem of
this example lies with, first, the i.i.d. assumption on Xt is untenable, and second, the
fact that we can observe Xt only once. In other words, an economic data generating
process is nonindependent and time-irreversible. It is clear that the statistical study
would go nowhere unless we impose (sometimes strong) assumptions on the evolution
of Xt , stationarity for example.
In this chapter, for simplicity, we have the first example in mind. In most cases, we
assume that X1 , . . . , Xn are i.i.d. with a distribution Pθ that belongs to a family of
distributions {Pθ |θ ∈ Θ} where θ is called parameter and Θ a parameter set. In this
course we restrict θ to be finite-dimensional. This is called the parametric approach
to statistical analysis. The nonparametric approach refers to the case where we do
not restrict the distribution to any family of distributions, which is in a sense to
allow θ to be infinite-dimensional. In this course we mainly consider the parametric
approach.

Deﬁnition 5.1.1 (Statistic) A statistic is a real-valued (or vector-valued) mea-

surable function τ (X) of a random sample X = (X1 , . . . , Xn ).

Note that the statistic is a random variable (or vector) itself.

Statistical inference consists of two procedures: estimation of and hypothesis testing
on θ. For the purpose of estimating θ, we need to construct a vector-valued statistic
called estimator, θ̂(X) : X → T , where X is called the state space (the range of X),
and where T includes Θ. It is customary to omit X in θ̂(X) and to write θ̂.
For the purpose of hypothesis testing on θ, we need to construct a statistic called
test statistic, τ (X) : X → T , where T is a subset of R. A hypothesis divides Θ into
two disjoint and exhaustive subsets. We rely on the value of τ to decide whether θ0 ,
the true parameter, is in one of them.

Suﬃcient Statistic Let τ = τ (X) be a statistic, and P = {Pθ |θ ∈ Θ} be a family

of distributions of X.

Definition 5.1.2 (Sufficient Statistic) We define that τ is sufficient for P (or

more precisely θ) if the conditional distribution of X given τ does not depend on θ.

The distribution of X can be any member of the family P. Therefore, the conditional
distribution of X given τ would depend on θ in general. τ is suﬃcient in the sense
that the distribution of X is uniquely determined by the value of τ .

56
Suﬃcient statistics are useful in data reduction. It is less costly to infer θ from a
statistic τ than from X, since the former, being a function of the latter, is of lower
dimension. The suﬃciency of τ guarantees that τ contains all information about θ
in X.

Example 5.1.3 Suppose that X ∼ N (0, σ 2 ) and τ = |X|. Conditional on τ = t,

X can take t or −t. Since the distribution of X is symmetric about the origin, each
has a conditional probability of 1/2, regardless of the value of σ 2 . The statistic τ is
thus suﬃcient.

Example 5.1.4 Let X1 and X2 be independent Poisson(λ). τ = X1 + X2 is a

suﬃcient statistic. First, the joint density of X1 and X2 is
λx1 +x2
pλ (x1 , x2 ) = exp(−2λ) , x1 , x2 = 0, 1, 2, . . . .
x1 !x2 !
We may show that p(x1 |τ = t) = pλ (x1 , t)/pλ (t) is λ-free. The same is of course
true for p(x2 |τ = t). Hence τ is suﬃcient.

Theorem 5.1.5 (Fisher-Neyman Factorization) A statistic τ = τ (X) is suﬃ-

cient if and only if there exist two functions f and g such that the density of X is
factorized as
pθ (x) = f (τ (x), θ)g(x).

This theorem implies that if two samples give the same value for a suﬃcient statistic,
then the MLE based on the two samples yield the same estimate of the parameters.

Example 5.1.6 Let X1 , . . . , Xn be i.i.d. Poisson(λ). We may write the joint dis-
tribution of X = (X1 , . . . , Xn ) as
λx1 +···+xn
pλ (x) = enλ ∏n = f (τ (x), λ)g(x),
i=1 xi !
∑ ∏ −1
where τ (x) = ni=1 xi , f (t, λ) = exp(−nλ)λt , and g(x) = ( ni=1 xi !) . Hence τ (x)
is suﬃcient for λ.

Example 5.1.7 Let X1 , . . . , Xn be i.i.d. N(µ, σ 2 ). The joint density is

( )
( ) −n/2 1 ∑n
pµ,σ2 (x) = 2πσ 2 exp − 2 (xi − µ)2
2σ i=1
( )
( ) −n/2 1 ∑n
µ ∑n
µ2
= 2πσ 2 exp − 2 x2 + xi − n 2 .
2σ i=1 i σ 2 i=1 2σ
∑ ∑
It is clear that τ (x) = ( ni=1 xi , ni=1 x2i ) is suﬃcient for (µ, σ 2 )′ .

57
Minimal Sufficient Statistic Sufficient statistic is by no means unique. τ (x) =
(x1 , . . . , xn )′ , for example, is always sufficient. Let τ and κ be two statistics and κ
is sufficient. It follows immediately from the Fisher-Neyman factorization theorem
that if τ = h(κ) for some function h, then τ is also sufficient. If h is a many-to-one
function, then τ provides further data reduction than κ. We call a sufficient statistic
minimal if it is a function of every sufficient statistic. A minimal sufficient statistic
thus achieves data reduction to the best extent.

Deﬁnition 5.1.8 Exponential Family The exponential family refers to the family of
distributions that have densities of the form
[ m ]
∑
pθ (x) = exp ai (θ)τi (x) + b(θ) g(x),
i=1

where m is a positive integer.

To emphasize the dependence on m, we may call the above family m-parameter

exponential family.

• Note that for the m-parameter exponential family, by the factorization theo-
rem, τ (x) = (τ1 (x), . . . , τm (x))′ is a suﬃcient statistic.

• If X1 , . . . , Xn are i.i.d. with density

pθ (xi ) = exp [a(θ)τi (xi ) + b(θ)] g(xi ),

then the joint density of X = (X1 , . . . , Xn )′ is

[ ] n
∑ n ∏
pθ (x) = exp a(θ) τi (xi ) + nb(θ) g(xi ).
i=1 i=1
∑
This implies that i τ (xi ) is a suﬃcient statistic.

The exponential family includes many distributions that are in frequent use.

Example 5.1.9 (One-parameter exponential family) • Poisson(λ)

λx 1
pλ (x) = e−λ = ex log λ−λ .
x! x!

• Bernoulli(θ)

pθ (x) = θx (1 − θ)1−x = exp (x log(θ/(1 − θ)) + log(1 − θ)) .

58
Example 5.1.10 (Two-parameter exponential family) • N(µ, σ 2 )
( )
1 (x − µ)2
pµ,σ2 = √ exp −
2πσ 2σ 2
( ( 2 ))
1 1 2 µ µ
= √ exp − 2 x + 2 − + log σ
2π 2σ σ 2σ 2

• Gamma(α, β)

1
pα,β = xα−1 e−x/β
Γ(α)β α
( )
1
= exp (α − 1) log x − x − (log Γ(α) + α log β) .
β

Remark on Bayesian Approach The Bayesian approach to probability is one of

the diﬀerent interpretations of the concept of probability. Bayesians view probability
as an extension of logic that enables reasoning with uncertainty. Bayesians do not
reject or accept a hypothesis, but evaluate the probability of a hypothesis. To
achieve this, Bayesians specify some prior distribution p(θ), which is then updated
in the light of new relevant data by the Bayes’ rule,

p(x|θ)
p(θ|x) = p(θ) ,
p(x)
∫
where p(x) = p(x|θ)p(θ)dθ. Note that Bayesians treat θ as random, hence the
conditional-density notation of p(θ|x), which is called posterior density.

5.2 Estimation

5.2.1 Method of Moment

Let X1 , . . . , Xn be i.i.d. random variables with a common distribution Pθ , where
the parameter vector θ is to be estimated. And let x1 , . . . , xn be a realized sam-
ple. We call the underlying distribution Pθ the population, the moments of which
we call population moments. Let f be a vector of measurable functions f(x) =
(f1 (x), . . . , fm (k))′ , the f-population moments of Pθ are given by
∫
Eθ f = f dPθ .

59
In contrast, we call the sample average of (f(xi )) the sample moments. Note that
the sample average may be regarded as the moment of the distribution that assigns
probability mass 1/n to each realization xi . This distribution is called the empir-
ical distribution, which we denote Pn . Obviously, the moments of the empirical
distribution equal the corresponding sample moments
∫
1∑
n
En f = fdPn = f(xi ).
n i=1

The method of moment (MM) equates population moment to sample moment so

that the parameter vector θ may be solved. In other words, the MM estimation
solves the following set of equations for the parameter vector θ,

Eθ f = En f. (5.1)

This set of equations are called the moment conditions.

Example 5.2.1 Let Xi be i.i.d. Poisson(λ). To estimate λ, we may solve the

following equation,
1∑
n
Eλ Xi = xi .
n i=1
∑
It is immediate that the MM estimator of λ is exactly x̄ = n1 ni=1 xi .

Example 5.2.2 Let Xi be i.i.d. N(µ, σ 2 ). To estimate µ and σ 2 , we may solve the
following system of equations

1∑
n
Eµ,σ2 X = xi
n i=1
1∑
n
Eµ,σ2 (X − µ) 2
= (xi − µ)2 .
n i=1

This would obtain

1∑
n
µ̂ = x̄, and σ̂ 2 = (xi − x̄).
n i=1

A Remark on GMM If the number of equations (moment conditions) in (5.1)

exceeds the number of parameters to be estimated, then the parameter θ is over-
identiﬁed. In such cases, we may use the generalized method of moments (GMM)

60
to estimate θ. The basic idea of GMM is to minimize some distance measure be-
tween the population moments and their corresponding sample moments. A popular
approach is to solve the following quadratic programming problem,

min d(θ; x)′ W d(θ; x),

θ∈Θ

where d(θ; x) = Eθ f − En f and W is a positive deﬁnite weighting matrix. The

detailed properties of GMM is out of the scope of this text.

5.2.2 Maximum Likelihood

Let p(x, θ) be the density of the distribution Pθ . We write p(x, θ), instead of pθ (x),
to emphasize that the density is a function of θ as well as that of x. We deﬁne
likelihood function as
p(θ; x) = p(x, θ).

The likelihood function is a function of the parameter θ given a sample x. Obvi-

ously, it is intuitively appealing to assume that if θ = θ0 , the true parameter, then
the likelihood function p(θ; x) achieves the maximum. This is indeed the fundamen-
tal assumption of the maximum likelihood estimation (MLE), which is deﬁned as
follows,

Deﬁnition 5.2.3 (MLE) The maximum likelihood estimator (MLE) of θ is given

by
θ̂M L = arg max p(θ; x).
θ∈Θ

Remark: Let τ be any suﬃcient statistic for the parameter θ. According the
factorization theorem, we have p(x, θ) = f (τ (x), θ)g(x). Then θ̂M L maximizes
f (τ (x), θ) with respect to θ. Therefore, θ̂M L is always a function of τ (X). This
implies that if MLE is a suﬃcient statistic, then it is always minimal.

Log Likelihood It is often easier to maximize the logarithm of the likelihood

function,
ℓ(θ; x) = log(p(θ; x)).

Since the log function is monotone increasing, maximizing log likelihood yields the
same estimates.

61
First Order Condition If the log likelihood function ℓ(θ; x) is diﬀerentiable and
globally concave for all x, then the ML estimator can be obtained by solving the
ﬁrst order condition (FOC),
∂ℓ
(θ; x) = 0
∂θ
∂ℓ
Note that s(θ; x) = ∂θ (θ; x) is called score functions.

Theorem 5.2.4 (Invariance Theorem) If θ̂ is an ML estimator of θ and π =

g(θ) be a function of θ, then g(θ̂) is an ML estimator of π.

Proof: If g is one-to-one, then

p(θ; x) = p(g −1 g(θ); x) = p∗ (g(θ); x).

d maximize the likelihood function and it is obvious
Both ML estimators, θ̂ and g(θ),
that ( )
d .
θ̂ = g −1 g(θ)
d = π̂. If g is many-to-one, π̂ = g(θ̂) still corresponds to θ̂
This implies g(θ̂) = g(θ)
that maximizes p(θ; x). Any other value of π would correspond to θ that results in
lower likelihood. Q.E.D.

Example 5.2.5 (Bernoulli(θ)) Let (Xi , i = 1, . . . , n) be i.i.d. Bernoulli(θ), then

the log likelihood function is given by
( n ) ( )
∑ ∑
n
ℓ(θ; x) = xi log θ + n − xi log(1 − θ).
i=1 i=1

The FOC yields

∑
n ∑
n
−1 −1
θ̂ xi − (1 − θ̂) (n − xi ) = 0,
i=1 i=1
∑
which is solved to obtain θ̂ = x̄ = n−1 ni=1 xi . Note that to estimate the variance of
Xi , we need to estimate v = θ(1 − θ), a function of θ. By the invariance theorem,
we obtain v̂ = θ̂(1 − θ̂).

Example 5.2.6 (N (µ, σ 2 )) Let Xi be i.i.d. N(µ, σ 2 ), then the log-likelihood func-
tion is given by

1 ∑
n
n n
ℓ(µ, σ 2 ; x) = − log(2π) − log σ 2 − 2 (xi − µ)2 .
2 2 2σ i=1

62
Solving the FOC gives
µ̂ = x̄
1∑
n
2
σ̂ = (xi − x̄)2 .
n i=1

Note that the ML estimators are identical to the MM estimators.

Example 5.2.7 (Uniform([0,θ) )] Let Xi be i.i.d. Uniform([0, θ]). Then

1 ∏
n
p(θ; x) = n I0≤xi ≤θ
θ i=1
1
= I{min1≤i≤n xi ≥0} I{max1≤i≤n xi ≤θ} .
θn
It follows that θ̂ = max{x1 , . . . , xn }.

5.2.3 Unbiasedness and Eﬃciency

Let Pθ denote the probability measure in Ω corresponding to Pθ in X , and let Eθ
denote the expectation taken with respect to Pθ .

Deﬁnition 5.2.8 (Unbiasedness) An estimator θ̂ is unbiased if for all θ ∈ Θ,

Eθ θ̂ = θ.

Unbiasedness is a desirable property. Loosely speaking, it refers to the description

that “the estimation is correct in average”. To describe how “varied” an estimator
would be, we often use the mean squared error, which is defined as
MSE(θ̂) = Eθ (θ̂ − θ)2 .
We may decompose the MSE as
MSE(θ̂) = Eθ (θ̂ − Eθ θ̂)2 + (Eθ θ̂ − θ)2 .
For an unbiased estimator θ̂, the second term vanishes, then the MSE is equal to
the variance.
In general, MSE is a function of the unknown parameter θ and it is impossible to
find an estimator that has the smallest MSE for all θ ∈ Θ. However, if we restrict
our attention to the class of unbiased estimators, we may find an estimator that
enjoys the smallest variance (hence MSE) for all θ ∈ Θ. This property is known as
uniformly minimum variance unbiasedness (UMVU). More precisely, we have

63
Deﬁnition 5.2.9 (UMVU Estimator) An estimator θ̂ is called an UMVU esti-
mator if it satisﬁes

(1) θ̂ is unbiased,

(2) Eθ (θ̂ − θ)2 ≤ Eθ (θ̃ − θ)2 for any unbiased estimator θ̃.

5.2.4 Lehmann-Scheﬀé Theorem

The prominent Lehmann-Scheﬀé Theorem helps to ﬁnd UMVU estimators. First,
we introduce some basic concepts in the decision-theoretic approach of statistical
estimation.

Deﬁnition 5.2.10 (Loss Function) Loss function is any function ℓ(t, θ) that as-
signs disutility to each pair of estimate t and parameter value θ.

Examples of Loss Function

• ℓ(t, θ) = (t − θ)2 , squared error.

• ℓ(t, θ) = |t − θ|, absolute error.

• ℓ(t, θ) = cI{|t − θ| > ϵ}, ﬁxed loss out of bound.

Deﬁnition 5.2.11 (Risk Function) For an estimator T = τ (X), the risk func-
tion is deﬁned by
r(τ, θ) = Eθ ℓ(T, θ).

It can be observed that risk function is the expected loss of an estimator for each
value of θ. Risk functions corresponding to the loss functions in the above examples
are

Examples of Risk Function

• r(τ, θ) = Eθ (τ (X) − θ)2 , mean squared error.

• r(τ, θ) = Eθ |τ (X) − θ|, mean absolute error.

• r(τ, θ) = cPθ {|τ − θ| > ϵ}

64
In the decision-theoretic approach of statistical inference, estimators are constructed
by minimizing some appropriate loss or risk functions.

Deﬁnition 5.2.12 (Minimax Estimator) An estimator τ∗ is called minimax if

sup r(τ∗ , θ) ≤ sup r(τ, θ)
θ∈Θ θ∈Θ

for every other estimator τ .

Note that supθ∈Θ r(τ, θ) measures the maximum risk of an estimator τ .

Theorem 5.2.13 (Rao-Blackwell Theorem) Suppose that the loss function ℓ(t, θ)
is convex in t and that S is a sufficient statistic. Let T = τ (X) be an estimator for
θ with finite mean and risk. If we define T∗ = Eθ (T |S) and write T∗ = τ∗ (X), then
we have
r(τ∗ , θ) ≤ r(τ, θ).

Proof: Since ℓ(t, θ) is convex in t, Jensen’s inequality gives

ℓ(T∗ , θ) = ℓ(Eθ (T |S), θ) ≤ Eθ (ℓ(T, θ)|S).
We conclude by taking expectations on both sides and applying the law of iterative
expectations.

Note that Eθ (T |S) is not a function of θ, since S is suﬃcient.

Deﬁnition 5.2.14 (Complete Statistic) A statistic T is complete if Eθ f (T ) = 0

for all θ ∈ Θ implies f = 0 a.s. Pθ .

Theorem 5.2.15 (Lehmann-Scheﬀé Theorem) If S is complete and suﬃcient

and T = τ (X) is an unbiased estimator of g(θ), then f (S) = Eθ (T |S) is a UMVU
estimator.

Proof: Apply Rao-Blackwell Theorem with the squared loss function ℓ(t, θ) = (t −
θ)2 .

Note that f (S) is also a unique unbiased estimator. Suppose there exists another
unbiased estimator f˜(S), then Eθ (f (S) − f˜(S)) = 0. But the completeness of S
guarantees that f = f˜.
Given a complete and suﬃcient statistic, it is then straightforward to obtain a
UMVU estimator. What we have to do is to take any unbiased estimator T and
obtain the desired UMVU estimator as T ∗ = Eθ (T |S).

65
Example 5.2.16 Let (Xi , i = 1, . . . , n) be i.i.d. Uniform(0, θ), and let S =
maxi Xi . S is suﬃcient and complete. To see the completeness, note that
( s )n
Pθ (S ≤ s) = (Pθ (Xi ≤ s)) =
n
.
θ
The density of S is thus

nsn−1
pθ (s) = I{0 ≤ s ≤ θ}.
θn
Eθ f (T ) = 0 for all θ implies
∫ θ
sn−1 f (s)ds = 0, for all θ.
0

This is only possible when f = 0.

Now we proceed to ﬁnd a UMVU estimator. Let T = 2X1 , which is an unbiased
estimator for θ. Suppose S = s, then X1 can take s with probability 1/n, since every
member of (Xi , i = 1, . . . , n) is equally likely to be the maximum. When X1 ̸= s,
which is of probability (n − 1)/n, X1 is uniformly distributed on (0, s). Thus we have

Eθ (T |S = s) = 2Eθ (X1 |S = s)
( )
1 n−1s
= 2 s+
n n 2
n+1
= s
n
The UMVU estimator of θ is thus obtained as
n+1
T∗ = max Xi .
n 1≤i≤n

5.2.5 Eﬃciency Bound

It is generally not possible to construct an UMVU estimator. However, we show in
this section that there exists a lower bound for the variance of unbiased estimators,
which we call efficiency bound. If an unbiased estimator achieves the efficiency
bound, we say that it is an efficient estimator.
Let ℓ(θ; x) be the log-likelihood function. Recall that we have defined score function
s(θ; x) = ∂ℓ/∂θ(θ; x). We further define:

∂2ℓ
(a) Hessian: h(θ; x) = ∂θ∂θ′
(θ; x).

66
(b) Fisher Information: I(θ) = Eθ s(θ; X)s(θ; X)′ .

(c) Expected Hessian: H(θ) = Eθ h(θ; X).

Note that for a vector of independent variables, the scores and Hessians are additive.
Speciﬁcally, let X1 and X2 be independent random vectors, let X = (X1′ , X2′ )′ . De-
note the scores and the Hessians of Xi , i = 1, 2, by s(θ; xi ) and H(θ; xi ) respectively,
and denote the score and the Hessian of X by s(θ; x) and H(θ; x), respectively. Then
it is clear that

s(θ; x) = s(θ; x1 ) + s(θ; x2 )

h(θ; x) = h(θ; x1 ) + h(θ; x2 ).

We can also show that

I(θ) = I1 (θ) + I2 (θ)

H(θ) = H1 (θ) + H2 (θ),

where I(θ), I1 (θ), and I2 (θ) denote the information matrix of X, X1 , X2 , respectively,
and the notations of H, H1 , and H2 are analogous.
From now on, we assume that a random vector X has joint density p(x, θ) with
respect to Lebesque measure µ. Note that the notation p(x, θ) emphasizes the fact
that the joint density of X is a function of both x and θ. We let θ̂ (or more precisely,
θ̂(X)) be an unbiased estimator for θ. And we impose the following regularity
conditions on p(x, θ),

Regularity Conditions

∂
∫ ∫ ∂
(a) ∂θ
p(x, θ)dµ(x) = ∂θ
p(x, θ)dµ(x)
∂2
∫ ∫ ∂2
(b) ∂θ∂θ′
p(x, θ)dµ(x) = ∂θ∂θ′
p(x, θ)dµ(x)
∫ ∫
(c) θ̂(x) ∂θ∂ ′ p(x, θ)dµ(x) = ∂
∂θ′
θ̂(x)p(x, θ)dµ(x).

Under these regularity conditions, we have a few results that are both useful in
proving subsequent theorems and interesting in themselves.

Lemma 5.2.17 Suppose that Condition (a) holds, then

Eθ s(θ; X) = 0.

67
Proof: We have
∫
Eθ s(θ; X) = s(θ; x)p(x, θ)dµ(x)
∫
∂
= ℓ(θ; x)p(x, θ)dµ(x)
∂θ
∫ ∂
∂θ
p(x, θ)
= p(x, θ)dµ(x)
p(x, θ)
∫
∂
= p(x, θ)dµ(x)
∂θ
= 0

Lemma 5.2.18 Suppose that Condition (b) holds, then

I(θ) = −H(θ).

Proof: We have
∂2
∂2 ∂θ∂θ′
p(x, θ) ∂ ∂
ℓ(θ; x) = − log p(x, θ) ′ log p(x, θ).
∂θ∂θ′ p(x, θ) ∂θ ∂θ

Then
∫ ( )
∂2
H(θ) = ℓ(θ; x) p(x, θ)dµ(x)
∂θ∂θ′
∫
∂2
= p(x, θ)dµ(x) − I(θ)
∂θ∂θ′
= −I(θ).

Lemma 5.2.19 Let θ̂(X) be an unbiased estimator for θ, and suppose the Condition
(c) holds, then
Eθ θ̂(X)s(θ; X)′ = I.

Proof: We have
∫ ∂p
′ ′ (x, θ)
Eθ θ̂(X)s(θ; X) = θ̂(x) ∂θ p(x, θ)dµ(x)
p(x, θ)
∫
∂
= θ̂(x)p(x, θ)dµ(x)
∂θ′
= I.

68
Theorem 5.2.20 (Cramer-Rao Bound) Let θ̂(X) be an unbiased estimator of
θ, and if Conditions (a) and (c) hold, then,
( )
varθ θ̂(X) ≥ I(θ)−1 .

Proof: Using the above lemmas, we have

( ) ( ( ) )
θ̂(X) varθ θ̂(X) I
varθ = ≡ A.
s(θ; X) I I(θ)

Recall that the covariance matrix A must be positive deﬁnite. We choose B ′ =

(I, −I(θ)−1 ), then we must have B ′ AB ≥ 0. The conclusion follows.

Example 5.2.21 Let X1 , . . . , Xn be i.i.d. Poisson(λ). The the log-likelihood, the

score, and the Fisher’s information of each Xi are given by

ℓ(λ; xi ) = −λ + xi log λ − log xi !

s(λ; xi ) = −1 + xi /λ
Ii (λ) = 1/λ.

Then the information

∑ matrix I(λ) of X = (X1 , . . . , Xn )′ is I(λ) = nI1 (λ) = n/λ.
Recall λ̂ = X̄ = n1 ni=1 Xi is an unbiased estimator for λ. And we have

varλ (X̄) = varλ (X1 )/n = λ/n.

Hence the estimator X̄ is an UMVU estimator.

5.3 Hypothesis Testing

5.3.1 Basic Concepts

Suppose a random sample X = (X1 , . . . , Xn )′ is drawn from a population charac-
terized by a parametric family P = {Pθ |θ ∈ Θ}. We partition the parameter set Θ
as
Θ = Θ0 ∪ Θ1 .
A statistical hypothesis is of the following form:

H0 : θ ∈ Θ0 H1 : θ ∈ Θ1 ,

where H0 is called the null hypothesis and H1 is called the alternative hypothesis.

69
A test statistic, say τ , is used to partition the state space X into the disjoint union
of the critical region C and the acceptance region A,

X = C ∪ A.

The critical region is conventionally given as

C = {x ∈ X |τ (x) ≥ c},

where c is a constant that is called critical value. If the observed sample is within
the critical region, we reject the null hypothesis. Otherwise, we say that we fail to
reject the null and thus accept the alternative hypothesis. Note that diﬀerent tests
diﬀer in their critical regions. In the following, we denote tests using their critical
regions.

For θ ∈ Θ0 , Pθ (C) is the probability of rejecting H0 when it is true. We thus deﬁne

Deﬁnition 5.3.1 (Size) The size of a test C is

max Pθ (C).
θ∈Θ0

Obviously, it is desirable to have a small size. For θ ∈ Θ1 , Pθ (C) is the probability

of rejecting H0 when it is false. If this probability is large, we say that the test
is powerful. Conventionally, we call π(θ) = Pθ (C) the power function. The power
function restricted to the domain Θ1 characterizes the power of the test.
Given two tests with a same size, C1 and C2 , if Pθ (C1 ) > Pθ (C2 ) at θ ∈ Θ1 , we say
that C1 is more powerful than C2 . If there is a test C∗ that satisﬁes Pθ (C∗ ) ≥ Pθ (C)
at θ ∈ Θ1 for any test C of the same size, then we say that C∗ is the most powerful
test. Furthermore, if the test C∗ is such that Pθ (C∗ ) ≥ Pθ (C) for all θ ∈ Θ1 for any
test C of the same size, then we say that C∗ is the uniformly most powerful.
If Θ0 (or Θ1 ) is a singleton set, ie, Θ0 = {θ0 }, we call the hypothesis H0 : θ = θ0
simple. Otherwise, we call it composite hypothesis.
In particular, when both H0 and H1 are simple hypotheses, say, Θ0 = {θ0 } and
Θ1 = {θ1 }, P consists of two distributions Pθ0 and Pθ1 , which we denote as P0
and P1 , respectively. It is clear that P0 (C) and P1 (C) are the size and the power
of the test C, respectively. Note that both P0 (C) and P1 (A) are probabilities of
making mistakes. P0 (C) is the probability of rejecting the true null, and P1 (A) is
the probability of accepting the false null. Rejecting the true null is often called the
type-I error, and accepting the false null is called the type-II error.

70
5.3.2 Likelihood Ratio Tests
Assume that both the null and the alternative hypotheses are simple, Θ0 = {θ0 }
and Θ1 = {θ1 }. Let p(x, θ0 ) and p(x, θ1 ) be the densities of P0 and P1 , respectively.
We have

Theorem 5.3.2 (Neyman-Pearson Lemma) Let c be a constant. The test

{ }
p(x, θ1 )
C∗ = x λ(x) = ≥c
p(x, θ0 )
is the most powerful test.

Proof: Suppose C is any test with the same size as C∗ . Assume without loss of
generality that C and C∗ are disjoint. It follows that
p(x, θ1 ) ≥ cp(x, θ0 ) on C∗
p(x, θ1 ) < cp(x, θ0 ) on C.
Hence we have
∫ ∫
P1 (C∗ ) = p(x, θ1 )dµ(x) ≥ c p(x, θ0 )dµ(x) = cP0 (C∗ ),
C∗ C∗

and ∫ ∫
P1 (C) = p(x, θ1 )dµ(x) < c p(x, θ0 )dµ(x) = cP0 (C).
C C
Since P0 (C∗ ) = P0 (C) (the same size), we have P1 (C∗ ) ≥ P1 (C). Q.E.D.

Remarks:

• For obvious reasons, test of the same form as C∗ is also called likelihood ratio
(LR) test. The constant c is to be determined by pre-specifying a size, ie, by
solving for c the equation P0 (C) = α, where α is prescribed small number.
• We may view p(x, θ1 ) (or p(x, θ0 )) as marginal increases of power (size) when
the point x is added to the critical region C. The Neyman-Pearson Lemma
shows that those points contributing more power increase per unit increase in
size should be included in C for an optimal test.
• For any monotone increasing function f , the test {x ∈ X |(f ◦ λ)(x) ≥ c′ } is
identical to that is based on λ(x). It is hence also an LR test. Indeed, the LR
tests are often based on monotone increasing transformations of λ whose null
distributions are easier to obtain.

71
For composite hypotheses, we have the generalized LR test based on the ratio
supθ∈Θ1 p(x, θ)
λ(x) = .
supθ∈Θ0 p(x, θ)

The Neyman-Pearson Lemma does not apply to the generalized LR test. However,
it performs well in many contexts.

Example 1: Simple Student-t Test

First consider a simple example. Let X1 , . . . , Xn be i.i.d. N(µ, 1), and we test

H0 : µ = 0 against H1 : µ = 1

Since both the null and the alternative are simple, Neyman-Pearson Lemma ensures
that the likelihood ratio test is the best test. The likelihood ratio is
p(x, 1)
λ(x) =
p(x, 0)
( ∑ )
(2π)−n/2 exp − 12 ni=1 (xi − 1)2
= ( ∑ )
(2π)−n/2 exp − 12 ni=1 (xi − 0)2
( )
1∑
n
n
= exp − xi − .
2 i=1 2
∑
We know that τ (X) = n−1/2 ni=1 Xi is distributed as N(0, 1) under the null. We
may use this construct a test. Note that we can write τ (x) = f ◦ λ(x), where
f (z) = n−1/2 (log z + n/2) is a monotone increasing function. The test

C = {x|τ (x) ≥ c}

is then an LR test. It remains to determine c. Suppose we allow the probability

of type-I error to be 5%, that is a size of 0.05, we may solve for c the equation
P0 (C) = 0.05. Since τ (X) ∼ N (0, 1) under the null, we can look up the N (0, 1)
table and ﬁnd that
P0 (x|τ (x) ≥ 1.645) = 0.05.
This implies c = 1.645.

Example 2: One-Sided Student-t Test

Now we test
H0 : µ = 0 against H1 : µ > 0. (5.2)

72
The alternative hypothesis is now composite. From the preceding analysis, however,
it is clear that for any µ1 > 0, C is the most powerful test for

H0 : µ = 0 against H1 : µ = µ1 .

We conclude that C is the uniformly most powerful test.

Example 3: Two-Sided F Test

Next we let X1 , . . . , Xn be i.i.d. N (µ, σ 2 ), and test

H0 : µ = µ0 against H1 : µ ̸= µ0 .

Here we have two unknown parameters, µ and σ 2 , but the null and the alternative
hypotheses are concerned with the parameter µ only. We consider the generalized
LR test with the following generalized likelihood ratio
( ∑ )
supµ,σ2 (2πσ 2 )−n/2 exp − 2σ1 2 ni=1 (xi − µ)2
λ(x) = ( ∑ ).
supσ2 (2πσ 2 )−n/2 exp − 2σ1 2 ni=1 (xi − µ0 )2

Recall that the ML estimator of µ and σ 2 are

1∑
n
µ̂ = x̄, 2
σ̂ = (xi − x̄)2 .
n i=1

Hence µ̂ and σ̂ 2 achieve the sup on the numerator. On the denominator,

1∑
n
2
σ̃ = (xi − µ0 )2
n i=1

achieves the sup. Then we have

( ∑ )
(2πσ̂ 2 )−n/2 exp − 2σ̂1 2 ni=1 (xi − µ̂)2
λ(x) = ( ∑ )
(2πσ̃ 2 )−n/2 exp − 2σ̃1 2 ni=1 (xi − µ0 )2
( ∑n )n/2
(xi − µ0 )2
= ∑ni=1

i=1 (xi − x̄)

2
( )n/2
n(x̄ − µ0 )2
= 1 + ∑n .
i=1 (xi − x̄)
2

We deﬁne
n(x̄ − µ0 )2
τ (x) = (n − 1) ∑n .
i=1 (xi − x̄)
2

73
It is clear that τ is a monotone increasing transformation of λ. Hence the generalized
LR test is given by C = {x|τ (x) ≥ c} for a constant c. Note that

V1 /1
τ (X) = ,
V2 /(n − 1)

where (√ )2 ∑n
n(X̄ − µ0 ) i=1 (Xi − X̄)2
V1 = and V2 = .
σ σ2
Under H0 , we can show that V1 ∼ χ21 , V2 ∼ χ2n−1 , and V1 and V2 are independent.
Hence, under H0 ,
τ (X) ∼ F1,n−1 .
To ﬁnd the critical value c for a size-α test, we look up the F table and ﬁnd constants
F1,n−1 (α) such that
P0 {x|τ (x) ≤ F1,n−1 (α)} = α.

From the preceding examples, we may see that the hypothesis testing problem con-
sists of three steps in practice: first, forming an appropriate test statistic, second,
finding the distribution of this statistic under H0 , and finally making a decision. If
the outcome of the test statistic is deemed as unlikely under H0 , the null hypothe-
sis H0 is rejected, in which case we accept H1 . The Neyman-Peason Lemma gives
important insights on how to form a test statistic that leads to a powerful test. In
the following example, we illustrate a direct approach that is not built on likelihood
ratio.

Example 4: Two-Sided Student-t Test

For the testing problem of Example 3, we may construct a Student-t test statistic
as follows, √
n(x̄ − µ0 )
τ̃ (x) = √∑n .
i=1 (xi − x̄) /(n − 1)
2

However, τ̃ is not a monotone increasing transformation of λ. Hence the test based

on τ̃ is not a generalized LR test any more. However, we can easily derive the
distribution of τ̃ if the null hypothesis is true. Indeed, we have
Z
τ̃ (X) = √ ,
V /(n − 1)

where √ ∑n
n(X̄ − µ0 ) i=1 (Xi − X̄)2
Z= and V = .
σ σ2
74
Under H0 , we can show that Z ∼ N (0, 1), V ∼ χ2n−1 , and Z and V are independent.
Hence, under H0 ,
τ̃ (X) ∼ tn−1 .
To ﬁnd the critical value c for a size-α test, we look up the t table and ﬁnd a constant
tn−1 (1 − α/2) > 0 such that

P0 {x| − tn−1 (1 − α/2) ≤ τ̃ (x) ≤ tn−1 (1 − α/2)} = 1 − α.

Finally, to see the connection between this test and the F test in Example 3, note
that F1,n−1 ≡ t2n−1 .

5.4 Exercises
1. Let X1 and X2 be independent Poisson(λ). Show that τ = X1 + X2 is a
suﬃcient statistic.

2. Let (Xi , i = 1, . . . , n) be a random sample from the underlying distribution

given by the density
2x
p(x, θ) = 2 I{0 ≤ x ≤ θ}.
θ
(a) Find the MLE of θ.
(b) Show that T = max{X1 , . . . , Xn } is suﬃcient.
(c) Let

S1 = (max{X1 , . . . , Xm }, max{Xm+1 , . . . , Xn }),

S2 = (max{X1 , . . . , Xm }, min{Xm+1 , . . . , Xn }),

where 1 < m < n. Discuss the suﬃciency of S1 and S2 .

3. Let (Xi , i = 1, . . . , n) be i.i.d. Uniform(α − β, α + β), where β > 0, and let

θ = (α, β).
(a) Find a minimal suﬃcient statistic τ for θ.
(b) Find the ML estimator θ̂ML of θ. (Hint: Graph the region for θ such that
the joint density p(x, θ) > 0.)
(c) Given the fact that τ in (a) is complete, ﬁnd the UMVU estimator of α.
(Hint: Note that Eθ (X1 ) = α.)

4. Let (Xi , i = 1, . . . , n) be a random sample from a normal distribution with

mean µ and variance σ 2 . Deﬁne
∑n ∑n
i=1 Xi 2 (Xi − X)2
Xn = and Sn = i=1 .
n n−1

75
(a) Obtain the Cramer-Rao lower bound.
(b) See whether X n and Sn2 attain the lower bound.
(c) Show that X n and Sn2 are jointly suﬃcient for µ and σ 2 .
(d) Are X n and Sn2 the UMVU estimators?

5. Let X1 and X2 be independent and uniformly distributed on (θ, θ+1). Consider

the two tests with critical regions C1 and C2 given by

C1 = {(x1 , x2 )|x1 ≥ 0.95} ,

C2 = {(x1 , x2 )|x1 + x2 ≥ c} ,

to test H0 : θ = 0 versus H1 : θ = 1/2.

(a) Find the value of c so that C2 has the same size as C1 .
(b) Find and compare the powers of C1 and C2 .
(c) Show how to get a test that has the same size, but is more powerful than
C2 .

76
Chapter 6

Asymptotic Theory

6.1 Introduction
Let X1 , . . . , Xn be a sequence of random variables, and let β̂n = β̂(X1 , . . . , Xn ) be
an estimator for the population parameter β. For β̂n to be a good estimator, it
must be asymptotically consistent, ie, β̂n converges to β in some sense as n → ∞.
Furthermore, it is desirable to have an asymptotic distribution of βn , if properly
standardized. That is, there may be a sequence of number an such that an (β̂n − β)
converges in some sense to a random variable Z with a known distribution. If in
particular Z is normal (or Gaussian), we say β̂n is asymptotically normal.
Asymptotic distribution is also important for hypothesis testing. If we can show
that a test statistic has an asymptotic distribution, then we may relax assumptions
on the finite sample distribution of X1 , . . . , Xn . This would make our test more
robust to mis-specifications of the model.
We study basic asymptotic theories in this chapter. They are essential tools for
proving asymptotic consistency and deriving asymptotic distributions. In this sec-
tion we first study the convergence of a sequence of random variables. As a sequence
of measurable functions, the converging behavior of random variables is much richer
than that of real numbers.

6.1.1 Modes of Convergence

Let (Xn ) and X be random variables deﬁned on a common probability space (Ω, F, P).

Deﬁnition 6.1.1 (a.s. Convergence) Xn converges almost surely (a.s.) to X,

77
written as Xn →a.s. X, if

P{ω|Xn (ω) → X(ω)} = 1.

Equivalently, the a.s. convergence can be deﬁned as

P{ω| |Xn (ω) − X(ω)| > ϵ i.o.} = 0.

or
P{ω| |Xn (ω) − X(ω)| < ϵ e.v.} = 1.

Deﬁnition 6.1.2 (Convergence in Probability) Xn converges in probability to

X, written as Xn →p X, if

P{ω| |Xn (ω) − X(ω)| > ϵ} → 0.

Remarks:

• The convergence in probability may be equivalently deﬁned as

P{ω| |Xn (ω) − X(ω)| ≤ ϵ} → 1.

• Most commonly, X in the deﬁnition is a degenerate random variable (or simply,

a constant).

• The deﬁnition carries over to the case where Xn is a sequence of random

vectors. In this case the distance measure | · | should be replaced by the
Euclidian norm.

Deﬁnition 6.1.3 (Lp Convergence) Xn converges in Lp to X, written as Xn →Lp

X, if
E |Xn (ω) − X(ω)|p → 0, p > 0.

In particular, if p = 2, L2 convergence is also called the mean squared error conver-

gence.

Deﬁnition 6.1.4 (Convergence in Distribution) Xn converges in distribution

to X, written as Xn →d X, if for every function f that is bounded and continuous
a.s. in PX ,
Ef (Xn ) → Ef (X).

78
Remarks:

• Note that for the convergence in distribution, (Xn ) and X need not be deﬁned
on a common probability space. It is not a convergence of Xn , but that of
probability measure induced by Xn , ie, PXn (B) = P ◦ Xn (B), B ∈ B(R).

• Recall that we may also call PXn the law of Xn . Thus the convergence in distri-
bution is also called convergence in law. More technically, we may call conver-
gence in distribution as weak convergence, as opposed to strong convergence
in the set of probability measures. Strong convergence refers to convergence
in the distance metric of probability measure (e.g., total variation metric).

• In the deﬁnition of convergence in distribution, the function f need not be

continuous at every point. The requirement of a.s. continuity allows f to be
discontinuous on a set S ⊂ R that PX (S) = 0.

Without proof, we give the following three lemmas, each of which supplies an equiv-
alent deﬁnition of convergence in distribution.

Lemma 6.1.5 Let Fn and F be the distribution function of Xn and X, respectively.

Xn →d X if and only if

Fn (x) → F (x) for every continuous point x of F.

Lemma 6.1.6 Let ϕn and ϕ be the characteristic function of Xn and X, respectively.

Xn →d X if and only if
ϕn (t) → ϕ(t) for all t.

Lemma 6.1.7 Xn →d X if and only if Ef (Xn ) → Ef (X) for every bounded and
uniformly continuous function f . 1

We have

Theorem 6.1.8 Both a.s. convergence and Lp convergence imply convergence in

probability, which implies convergence in distribution.

1
A function f : D → R is uniformly continuous on D if for every ϵ > 0, there exists δ > 0 such
that |f (x1 ) − f (x2 )| < ϵ for x1 , x2 ∈ D that satisfy |x1 − x2 | < δ.

79
Proof: (a) To show that a.s. convergence implies convergence in probability, we let
En = {|Xn − X| > ϵ}. By Fatou’s lemma,

lim P{En } = lim sup P{En } ≤ P{lim sup En } = P{En i.o.}.

n→∞

The conclusion follows.

(b) The fact that Lp convergence implies convergence in probability follows
from the Chebysheve inequality

E|Xn − X|p
P{|Xn − X| > ϵ} ≤ .
ϵp

(c) To show that convergence in probability implies convergence in distribution,

we ﬁrst note that for any ϵ > 0, if X > z + ϵ and |Xn − X| < ϵ, then we must have
Xn > z. That is to say, {Xn > z} ⊃ {X > z + ϵ} ∩ {|Xn − X| < ϵ}. Taking
complements, we have

{Xn ≤ z} ⊂ {X ≤ z + ϵ} ∪ {|Xn − X| ≥ ϵ}.

Then we have

P{Xn ≤ z} ≤ P{X ≤ z + ϵ} + P{|Xn − X| ≥ ϵ}.

Since Xn →p X, lim sup P{Xn ≤ z} ≤ lim sup P{X ≤ z + ϵ}. Let ϵ ↓ 0, we have

lim sup P{Xn ≤ z} ≤ P{X ≤ z}.

Similarly, using the fact that X < z − ϵ and |Xn − X| < ϵ imply Xn < z, we can
show that
lim inf P{Xn ≤ z} ≥ P{X < z}.
If P{X = z} = 0, then P{X ≤ z} = P{X < z}. Hence

lim sup P{Xn ≤ z} = lim sup P{Xn ≤ z} = P{X ≤ z}.

This establishes

lim Fn (z) = F (z) for every continuous point of F.

n→∞

Other directions of the theorem do not hold. And a.s. convergence does not imply
Lp convergence, nor does the latter imply the former. Here are a couple of counter
examples:

80
Counter Examples Consider the probability space ([0, 1], B([0, 1]), µ), where µ
is Lebesgue measure and B([0, 1]) is the Borel ﬁeld on [0, 1]. Deﬁne Xn by

Xn (ω) = n1/p I0≤ω≤1/n , p > 0,

and deﬁne Yn by

Yn = I(b−1)/a≤ω≤b/a , n = a(a − 1)/2 + b, 1 ≤ b ≤ a, a = 1, 2, . . . .

It can be shown that Xn → 0 a.s., but EXnp = 1 for all n. On the contrary,
EYnp = 1/a → 0, but Yn (ω) does not converge for any ω ∈ [0, 1].

It also follows from the above counter examples that convergence in probability does
not imply a.s. convergence. Suppose it does, we would have →Lp ⇒→p ⇒→a.s. . But
we have

Theorem 6.1.9 If Xn →p X, then there exists a subsequence Xnk such that Xnk →a.s.
X.

Proof: For any ϵ > 0, we may choose nk such that

P {|Xnk − X| > ϵ} ≤ 2−k .

Since
∑
∞ ∑
∞
P {|Xnk − X| > ϵ} ≤ 2−k < ∞,
k=1 k=1

Borel-Cantelli Lemma dictates that

P lim sup{|Xnk − X| > ϵ} = P{|Xnk − X| > ϵ i.o.} = 0.

n→∞

It is clear that convergence in distribution does not imply convergence in probability,

since the former does not even require that Xn be deﬁned on a common probability
space. However, we have

Theorem 6.1.10 Let Xn be deﬁned on a common probability space and let c be

constant. If Xn →d c, then Xn →p c.

Proof: Let f (x) = I|x−c|>ϵ for any ϵ > 0. Since f is continuous at c and Xn →d c,
we have
Ef (Xn ) = P{|Xn − c| > ϵ} → Ef (c) = 0.

81
Theorem 6.1.11 Let f be a continuous function. We have,

(a) if Xn →a.s. X, then f (Xn ) →a.s. f (X),

(b) if Xn →p X, then f (Xn ) →p f (X),
(c) if Xn →d X, then f (Xn ) →d f (X). (Continuous Mapping Theorem)

Proof: (a) Omitted.

(b) For any ϵ > 0, there exists δ > 0 such that |x − c| ≤ δ implies |f (x) − f (c)| ≤ ϵ.
So we have
{|Xn − X| ≤ δ} ⊂ {|f (Xn ) − f (X)| ≤ ϵ},
which implies
{|Xn − X| > δ} ⊃ {|f (Xn ) − f (X)| ≤ ϵ}.
Hence
P{|Xn − X| > δ} ≥ P{|f (Xn ) − f (X)| > ϵ}.
The theorem follows.
(c) It suﬃces to show that for any bounded and continuous function g,
Eg(f (Xn )) → Eg(f (X)).
But this is guaranteed by Xn →d X, since g ◦ f is also bounded and continuous.

Using the above results, we easily obtain,

Theorem 6.1.12 (Slutsky Theorem) If Xn →d c and Yn →p Y , where c is a

constant, then

(a) Xn Yn →d cY ,
(b) Xn + Yn →d c + Y .

6.1.2 Small o and Big O Notations

We ﬁrst introduce small o and big O notations for sequences of real numbers.

Deﬁnition 6.1.13 (Small o and Big O) Let (an ) and (bn ) be sequences of real
numbers. We write xn = o(an ) and yn = O(bn ), respectively, when
xn yn
→0 and <M
an bn
for some constant M > 0.

82
Remarks:

• In particular, if we take an = bn = 1 for all n, the sequence xn = o(1) converges

to zero and sequence yn = O(1) is bounded.

• We may write o(an ) = an o(1) and O(bn ) = bn O(1). However, these are not
equalities in the usual sense. It is understood that o(1) = O(1) but O(1) ̸=
o(1).

• For yn = O(1), if suﬃces to have |yn | < M for large n. If |yn | < M
for all n > N , then we would have |yn | < M ∗ for all n, where M ∗ =
max{y1 , y2 , . . . , yn , M }.

• O(o(1)) = o(1)
Proof: Let xn = o(1) and yn = O(xn ). It follows from |yn /xn | < M that
|yn | < M |xn | → 0.

• o(O(1)) = o(1)
Proof: Let xn = O(1) and yn = o(xn ). It follows from |yn | < M
|y |
|xn | n
=
|yn |
M |xn |
→ 0.

• o(1)O(1) = o(1)
Proof: Let xn = o(1) and yn = O(xn ). It follows from |xn yn | < M |xn | → 0.

• In general, we have

O(o(an )) = O(an o(1)) = an O(o(1)) = an o(1) = o(an ).

In probability, we have

Deﬁnition 6.1.14 (Small op and Big Op ) Let Xn and Yn be sequences of random

variables. We say Xn = op (an ) if Xn /an →p 0, and Yn = Op (bn ) if for any ϵ > 0,
there exists M > 0 such that P(|Yn /bn | > M ) < ϵ.

If we take an = bn = 1 for all n, then Xn = op (1) →p 0, and for any ϵ > 0, there
exists M > 0 such that P(|Yn | > M ) < ϵ. In the latter case, we say that Yn is
stochastically bounded.
Analogous to the real series, we have the following results.

Lemma 6.1.15 We have

(a) Op (op (1)) = op (1),

83
(b) op (Op (1)) = op (1),
(c) op (1)Op (1) = op (1).

Proof: (a) Let Xn = op (1) and Yn = Op (Xn ), we show that Yn = op (1). For any
ϵ > 0, since |Yn |/|Xn | ≤ M and |Xn | ≤ M −1 ϵ imply |Yn | ≤ ϵ, we have {|Yn | ≤ ϵ} ⊃
{|Yn | ≤ |Xn |M } ∩ {|Xn | ≤ M −1 ϵ}. Taking complements, we have
{|Yn | > ϵ} ⊂ {|Yn | > |Xn |M } ∪ {|Xn | > M −1 ϵ}.
Thus
P{|Yn | > ϵ} ≤ P{|Yn |/|Xn | > M } + P{|Xn | > M −1 ϵ}.
This holds for any M > 0. We can choose M such that the ﬁrst term on the right
be made arbitrarily small. And since M is a constant, the second term goes to zero.
Thus P{|Yn | > ϵ} → 0, i.e., Yn = op (1).
(b) Let Xn = Op (1) and Yn = op (Xn ), we show that Yn = op (1). For any ϵ > 0 and
M > 0, we have
P{|Yn | > M ϵ} ≤ P{|Yn |/|Xn | > ϵ} + P{|Xn | > M }.
The ﬁrst term on the right goes to zero, and the second term can be made arbitrarily
small by choosing a large M .
(c) Left for exercise.

In addition, we have

Theorem 6.1.16 If Xn →d X, then

(a) Xn = Op (1), and

(b) Xn + op (1) →d X.

Proof: (a) For any ϵ > 0, we have suﬃciently large M such that P(|X| > M ) < ϵ,
since {|X| > M } ↓ ∅ as M ↑ ∞. Let f (x) = I|x|>M . Since Xn →d X and f
is bounded and continuous a.s., we have E(f (Xn )) = P(|Xn | > M ) → Ef (X) =
P(|X| > M ) < ϵ. Therefore, P(|Xn | > M ) < ϵ for large n.
(b) Let Yn = op (1). And let f be any uniformly continuous and bounded function
and let M = sup |f (x)|. For any ϵ > 0, there exists a δ such that |Yn | ≤ δ implies
|f (Xn + Yn ) − f (Xn )| ≤ ϵ. Hence
|f (Xn + Yn ) − f (Xn )|
= |f (Xn + Yn ) − f (Xn )| · I|Yn |≤δ + |f (Xn + Yn ) − f (Xn )| · I|Yn |>δ
≤ ϵ + 2M I|Yn |>δ

84
Hence
E|f (Xn + Yn ) − f (Xn )| ≤ ϵ + 2M P{|Yn | > δ}.
Then we have
|Ef (Xn + Yn ) − Ef (X)| = |E[f (Xn + Yn ) − f (Xn ) + f (Xn ) − f (X)]|
≤ E|f (Xn + Yn ) − f (Xn )| + |Ef (Xn ) − Ef (X)|
≤ ϵ + 2M P{|Yn | > δ} + |Ef (Xn ) − Ef (X)|.
The third term goes to zero since Xn →d X, the second term goes to zero since
Yn = op (1), and ϵ > 0 is arbitrary. Hence Ef (Xn + Yn ) → Ef (X).

Corollary 6.1.17 If Xn →d X and Yn →p c, then Xn Yn →d cX.

Proof: We have
Xn Yn = Xn (c + op (1)) = cXn + Op (1)op (1) = cXn + op (1).
Then the conclusion follows from CMT.

6.1.3 Delta Method

Let θ̂n be an estimator of the parameter θ with true value θ0 . If θ̂n is consistent,
then we may write
θ̂n = θ0 + op (1).
If, in addition, θ̂n has an asymptotic distribution with an convergence rate, then
θ̂n = θ0 + Op (1/an ).

The delta method is used to derive the asymptotic distribution of f (θ̂n ), when f is
diﬀerentiable and θ̂n is asymptotically normal,
√
n(β̂n − β0 ) →d N (0, Σ).
Let ∆(θ) = ∂f (θ)/∂θ′ . The Taylor expansion of f (θ) around θ0 gives
( ) ( )
f (θ̂n ) = f (θ0 ) + ∆(θ0 ) θ̂n − θ0 + o ∥θ̂n − θ0 ∥
( ) ( ( √ ))
= f (θ0 ) + ∆(θ0 ) θ̂n − θ0 + o Op 1/ n
( ) ( √ )
= f (θ0 ) + ∆(θ0 ) θ̂n − θ0 + op 1/ n .
This implies
√ ( ) √ ( )
n f (θ̂n ) − f (θ0 ) = ∆(θ0 ) n θ̂n − θ0 + op (1) →d N (0, ∆(θ0 )Σ∆(θ0 )′ ).

85
√ ( )
Example 6.1.18 Let θ = (α, β)′ . If n θ̂n − θ0 →d N (0, Σ), then the asymptotic
distribution of α̂n /β̂n is N (0, ∆(θ0 )Σ∆(θ0 )′ ) with
( ) ( )
∂f ∂f 1 α
∆(θ) = , = ,− .
∂α ∂β β β2

6.2 Limit Theorems

6.2.1 Law of Large Numbers

The law of large numbers (LLN) states that sample average converges in some sense
to the population mean. In this section we state three LLN’s for independent random
variables. It is more diﬃcult to establish LLN’s for sequences of random variables
with dependence. Intuitively, every additional observation of dependent sequence
brings less information to the sample mean than that of independent sequence.

Theorem 6.2.1 (Weak LLN (Khinchin)) If X1 , . . . , Xn are i.i.d. with mean

µ < ∞, then
1∑
n
Xi →p µ.
n i=1

Proof: We only prove the case when var(Xi ) < ∞. The general proof is more
involved. The theorem follows easily from
( n )2 ( n )2
1 ∑ 1 ∑ 1
E Xi − µ = E (Xi − µ) = E(Xi − µ)2 → 0,
n i=1 n i=1 n

since L2 convergence implies convergence in probability.

Theorem 6.2.2 (Strong LLN) If X1 , . . . , Xn are i.i.d. with mean µ < ∞, then

1∑
n
Xi →a.s. µ.
n i=1

Proof: Since the mean exists, we may assume µ = 0 and prove

1∑
n
Xi →a.s. 0.
n i=1

86
The general proof is involved. Here we prove the case when EXi4 < ∞. We have
( )4 ( )
1∑ ∑ ∑
n n
1
E Xi = EXi4 + 6 EXi2 Xj2
n i=1 n4 i=1 i̸=j
n(n − 1)
= n−3 EXi4 + 3 4
EXi2 EXj2
n
= O(n−2 ).
∑ ( 1 ∑n )4 ∑∞ ( 1 ∑n )4
This implies E ∞n=1 n i=1 X i < ∞, which further implies n=1 n i=1 Xi <
∞ a.s. Then we have
1∑
n
Xi →a.s 0.
n i=1

Without proof, we also give a strong LLN that only requires independence,

Theorem 6.2.3 (Kolmogorov’s Strong ∑∞ 2 LLN) If X1 , . . . , Xn are independent with

EXi = µi and var(Xi ) = σi , and if i=1 σi /i < ∞, then
2 2

1∑ 1∑
n n
Xi →a.s. µi .
n i=1 n i=1

The ﬁrst application of LLN is in deducing the probability p of getting head in the
coin-tossing experiment. If we deﬁne Xi = 0 when we get tail in ∑nthe i-th tossing
1
and Xi = 1 when we get head. Then the LLN guarantees that n i=1 Xi converges
to EXi = p · 1 + (1 − p) · 0 = p. This converge to a probability, indeed, is the basis
of the “frequentist” interpretation of probability.
Sometimes we need LLN for measurable functions of random variables, say, g(Xi , θ),
where θ is a non-random
∑n parameter vector taken values in Θ. The Uniform LLN’s
establishe that n i=1 g(Xi , θ) converges in some sense uniformly in θ ∈ Θ. More
1

precisely, we have

Theorem 6.2.4 (Uniform Weak LLN) Let X1 , . . . , Xn be i.i.d., Θ be compact,

and g(x, θ) be a measurable function that is continuous in x for every θ ∈ Θ. If
E supθ∈Θ |g(X, θ)| < ∞, then

1∑
n
sup g(Xi , θ) − Eg(X1 , θ) →p 0.
θ∈Θ n i=1

87
6.2.2 Central Limit Theorem
The central limit theorem states that sample average, under suitable scaling, con-
verges in distribution to a normal (Gaussian) random variable.
We consider the sequence {Xin }, i = 1, . . . , n. Note that the sequence has double
subscript in, with n denotes sample size and i the index within sample. We call
such data structure as double array. We ﬁrst state without proof the celebrated

Theorem 6.2.5 (Lindberg-Feller CLT) Let∑ X1n , . . . , Xnn be independent with

EXi = µi and var(Xi ) = σi < ∞. Deﬁne σn = ni=1 σi2 . If for any ϵ > 0,
2 2

1 ∑
n
E(Xin − µi )2 I|Xin −µi |>ϵσn → 0, (6.1)
σn2 i=1

then ∑n
i=1 (Xin − µi )
→d N (0, 1).
σn

The condition in (6.1) is called the Lindberg condition. As it is often diﬃcult to

check, we often use the Liapounov condition, which implies the Lindberg condition.
The Liapounov condition states that if for some δ > 0,
∑
n
Xin − µi
2+δ
E → 0. (6.2)
i=1
σn
Xin −µi
To see that Liapounov is stronger than Lindberg, let ξni = σn
. We have
∑n ∑n
E|ξin |3
Eξin I|ξin |>ϵ ≤ i=1
2
.
i=1
ϵ

Using the Lindberg-Feller CLT, we obtain

Theorem 6.2.6 (Lindberg-Levy CLT) If X1 , . . . , Xn are i.i.d. with mean zero

and variance σ 2 < ∞, then
1 ∑
n
√ Xi →d N (0, σ 2 ).
n i=1
√
Proof: Let Yin = Xi / n. Yin is thus an independent double array with µi = 0,
σi2 = σ 2 /n, and σn2 = σ 2 . It suﬃces to check the Lindberg condition
1 ∑
n
1
EY 2
in I|Y in |>ϵσ n = EXi2 I|Xi |>ϵσ√n → 0
σn2 i=1 σ2

88
by dominated convergence theorem. Note that Zn = Xi2 I|Xi |>ϵσ√n ≤ Xi2 < ∞ and
Zn (ω) → 0 for all ω ∈ Ω.

6.3 Asymptotics for Maximum Likelihood Esti-

mation
As an application of the asymptotic theory we have learned, we present in this sec-
tion the asymptotic properties of Maximum Likelihood Estimator (MLE). The tests
based on MLE, such as likelihood ratio (LR), Wald test, and Lagrange multiplier
(LM) test, are also discussed.
Throughout the section, we assume that X1 , . . . , Xn are i.i.d. random variables
with a common distribution that belongs to a parametric family. We assume that
each distribution in the parametric family admits a density p(x, θ) with respect to
a measure µ. Let θ0 ∈ Θ denote∫ the true value of θ, let P0 the distribution with
density p(x, θ0 ), and let E0 (·) ≡ ·p(x, θ0 )dµ(x), an integral operator with respect
to P0 .

6.3.1 Consistency of MLE

We ﬁrst show that the expected log likelihood with respect to P0 is maximized at
θ0 . Let p(xi , θ) and ℓ(xi , θ) denote the likelihood and the log likelihood, respectively.
We consider the function of θ,
∫
E0 ℓ(·, θ) = ℓ(x, θ)p(x, θ0 )dµ(x).

Lemma 6.3.1 We have for all θ ∈ Θ,

E0 ℓ(·, θ0 ) ≥ E0 ℓ(·, θ).

Proof: Note that log(·) is a concave function. Hence by Jensen’s inequality,

p(·, θ)
E0 ℓ(·, θ) − E0 ℓ(·, θ0 ) = E0 log
p(·, θ0 )
p(·, θ)
≤ log E0
p(·, θ0 )
∫
p(·, θ)
= log p(·, θ0 )dµ(x) = 0.
p(·, θ0 )

89
Under our assumptions, the MLE of θ0 is deﬁned by

1∑
n
θ̂ = argmaxθ ℓ(Xi , θ).
n i=1

We have

Theorem 6.3.2 (Consistency of MLE) Under certain regularity conditions, we

have
θ̂ →p θ0 .

Proof: The regularity conditions ensure that the uniform weak LLN applies to
ℓ(Xi , θ),
1∑
n
ℓ(Xi , θ) →p E0 (·, θ)
n i=1
uniformly in θ ∈ Θ. The conclusion then follows.

6.3.2 Asymptotic Normality of MLE

Theorem 6.3.3 Under certain regularity conditions, we have
√
n(θ̂ − θ0 ) →d N (0, I(θ0 )−1 ),
where I(·) is the Fisher’s information.

Proof: The regularity conditions are to ensure:

∑
(a) n−1/2 ni=1 s(Xi , θ0 ) →d N (0, I(θ0 )).
∑
(b) n−1 ni=1 h(Xi , θ0 ) →p E0 h(·, θ0 ) = H(θ0 ) = −I(θ0 ).
∑
(c) s̄(x, θ) ≡ n−1 ni=1 s(xi , θ) is diﬀerentiable at θ0 for all x.

(d) θ̂ = θ0 + Op (n−1/2 ).

By Taylor’s expansion,
s̄(x, θ) = s̄(x, θ0 ) + h̄(x, θ0 )(θ − θ0 ) + o(∥θ − θ0 ∥).
We have
( )
1 ∑ 1 ∑ 1∑ √
n n n
√ s(Xi , θ̂) = √ s(Xi , θ0 ) + h(xi , θ0 ) n(θ̂ − θ0 ) + op (1).
n i=1 n i=1 n i=1

90
Then
( )−1
√ 1∑ 1 ∑
n n
n(θ̂ − θ0 ) = − h(xi , θ0 ) √ s(Xi , θ0 ) + op (1)
n i=1 n i=1
→d N (0, I(θ0 )−1 ).

6.3.3 MLE-Based Tests

Suppose θ ∈ Rm . For simplicity, let the hypothesis be
H0 : θ = θ0 H1 : θ ̸= θ0 .
We consider the following three celebrated test statistics:
( n )
∑ ∑
n
LR = 2 ℓ(xi , θ̂) − ℓ(xi , θ0 )
i=1 i=1
√ √
Wald = n(θ̂ − θ0 )′ I(θ̂) n(θ̂ − θ0 )
( )′ ( )
1 ∑ ∑
n n
1
LM = √ s(xi , θ0 ) I(θ0 )−1 √ s(xi , θ0 ) .
n i=1 n i=1

LR measures the diﬀerence between restricted likelihood and unrestricted likeli-

hood. Wald measures the difference between estimated and hypothesized values of
the parameter. And LM measures the first derivative of the log likelihood at the
hypothesized value of the parameter. Intuitively, if the null hypothesis holds, all
three quantities should be small.
∑
For the Wald statistic, we may replace I(θ̂) by n1 ni=1 s(Xi , θ̂)s(Xi , θ̂)′ , −H(θ̂), or
∑
− n1 ni=1 h(Xi , θ̂). The asymptotic distribution of Wald would not be affected.

Theorem 6.3.4 Suppose the conditions in Theorem 6.3.3 hold. We have

LR, Wald, LM →d χ2m .

Proof: Using Taylor’s expansion,

1
ℓ̄(x, θ) = ℓ̄(x, θ0 ) + s̄(x, θ0 )′ (θ − θ0 ) + (θ − θ0 )′ h̄(x, θ0 )(θ − θ0 ) + o(∥θ − θ0 ∥2 )
2
s̄(x, θ) = s̄(x, θ0 ) + h̄(x, θ0 )(θ − θ0 ) + o(∥θ − θ0 ∥).
Plugging s̄(x, θ0 ) = s̄(x, θ) − h̄(x, θ0 )(θ − θ0 ) − o(∥θ − θ0 ∥) in the ﬁrst equation above,
we obtain
1
ℓ̄(x, θ) = ℓ̄(x, θ0 ) + s̄(x, θ)(θ − θ0 ) − (θ − θ0 )′ h̄(x, θ0 )(θ − θ0 ) + o(∥θ − θ0 ∥2 ).
2
91
We then have
( )
∑
n ∑
n
1√ 1∑
n
√
ℓ(Xi , θ̂) − ℓ(Xi , θ0 ) = − n(θ̂ − θ)′ h(Xi , θ) n(θ̂ − θ) + op (1),
i=1 i=1
2 n i=1

1
∑n
since n i=1 s(Xi , θ̂) = 0. The asymptotic distribution of LR then follows.

For the Wald statistic, we have under regularity conditions that I(θ) is continuous
at θ = θ0 so that I(θ̂) = I(θ0 ) + op (1). Then the asymptotic distribution follows
√
from n(θ̂ − θ0 ) →d N (0, I(θ0 )−1 ).
∑n
The asymptotic distribution of the LM statistic follows from 1
n i=1 s(Xi , θ0 ) →d
N (0, I(θ0 )).

6.4 Exercises
∑n
1. Suppose X1 , . . . , Xn are i.i.d. Exponential(1), and deﬁne X n = n−1 i=1 Xi .
(a) Find the characteristic function of X1 . √
(b) Find the characteristic function of Yn = n(X n − 1).
(c) Find the limiting distribution of Yn .

2. Prove the following statements from the deﬁnition of convergence in probabil-

ity,
(a) op (1)op (1) = op (1)
(b) op (1)Op (1) = op (1).

3. Let X1 , . . . , Xn be a random sample from a N (0, σ 2 ) distribution.

∑n Let X be
2
the sample mean and let Sn be the second sample moment i=1 Xi /n. Using
the asymptotic theory, ﬁnd an approximation to the distribution of each of
the following statistics:
(a) Sn .
(b) log Sn .
(c) X n /Sn .
(d) log(1 + X n ).
2
(e) X n /Sn .

4. A random sample of size n is drawn from a normal population with mean θ and
variance θ, i.e., the mean and variance
∑n are known∑
to be equal but the common
is not known. Let X n = i=1 Xi /n, Sn = ni=1 (Xi − X)2 /(n − 1). and
value ∑ 2

Tn = ni=1 Xi2 /n.

92
(a) Calculate π = plimn→∞ Tn .
(b) Find the maximum-likelihood estimator of θ and show that it is a differ-
entiable function of Tn .
(c)
√ Find the asymptotic distribution of Tn , i.e., find the limit distribution of
n(Tn − π).
(d) Derive the asymptotic distribution of the ML estimator by using the delta
method.
(e) Check your answer to part (d) by using the information to calculate the
asymptotic variance of the ML estimator.
(f) Compare the asymptotic efficiencies of the ML estimator, the sample mean
X n , and the sample variance Sn2 .

93
94
References

Bierens, Herman J. (2005), Introduction to the Mathematical and Statiscal Foun-

dations of Econometrics, Cambridge University Process.

Chang, Yoosoon & Park, Joon Y. (1997), Advanced Probability and Statistics for
Economists, Lecture Notes.

Dudley, R.M (2003), Real Analysis and Probability (2nd Ed.), Cambridge Univer-
sity Process.

Rosenthal, Jeﬀrey S. (2006), A First Look at Rigorous Probability Theory (2nd

Ed.) World Scientiﬁc.

Williams, David (2001), Probability with Martingales, Cambridge University Pro-

cess.

Su, Liangjun (2007), Advanced Mathematical Statistics (in Chinese), Peking Uni-
versity Press.

95
Index

L1 , 27 Chebyshev inequality, 31
Lp convergence, 78 chi-square distribution, 42
Lp norm, 32 CMT, 82
λ-system, 8 coin tossing, 2
π-system, 8 coin tossing
σ-algebra, 2 infinite, 2
σ-field, 2 complete statistic, 65
σ-field composite, 70
generated by random variable, 24 conditional density, 36
f -moment, 29 conditional distribution
multivariate normal, 49
a.e., almost everywhere, 27 conditional expectation, 33
a.s. convergence, 77 conditional probability, 4, 34
absolutely continuous, 23
consistency
algebra, 1
MLE, 90
almost sure convergence, 77
continuous function, 19
alternative hypothesis, 69
continuous mapping theorem, 82
asymptotic normality, 90
convergence in distribution, 78
Bayes formula, 4 convergence in probability, 78
Bernoulli distribution, 40 convex, 32
beta distribution, 42 correlation, 30
big O, 82, 83 countable subadditivity, 11
binomial distribution, 40 covariance, 30
Borel-Cantelli lemma, 6 covariance matrix, 30
bounded convergence theorem, 30 Cramer-Rao bound, 69
cylinder set, 22
Cauchy distribution, 43
Cauchy-Schwartz inequality, 31 delta method, 85
central limit theorem, 88 density, 24
central moment, 30 distribution, 20
change-of-variable theorem, 29 distribution
characteristic function, 39 random vector, 22
characteristic function distribution function, 20
random vector, 47 dominated convergence theorem, 28, 30

96
double array, 88 Jensen’s inequality, 32
Dynkin’s lemma, 9 joint distribution, 22
joint distribution function, 22
empirical distribution, 60
Erlang distribution, 42 Khinchin, 86
estimator, 56 Kolmogorov, 87
event, 2 Kolmogrov zero-one law, 8
expectation, 29
exponential distribution, 41 law of large number
exponential family, 58 Kolmogorov’s strong, 87
extension strong, 86
theorem, 11 uniform weak, 87
uniqueness, 10 weak, 86
law of random variable, 20
F test, 73 Lebesgue integral
factorial, 42 counting measure, 23
Fatou’s lemma, 28, 29 nonnegative function, 22
Fatou’s lemma simple function, 22
probability, 6 Lehmann-Scheffé theorem, 65
field, 1 Liapounov condition, 88
first order condition, 62 likelihood function, 61
Fisher Information, 67 likelihood ratio, 71
Fisher-Neyman factorization, 57 liminf, 5
limsup, 5
gamma distribution, 42 Lindberg condition, 88
Gaussian distribution, 41 Lindberg-Feller CLT, 88
generalized likelihood ratio, 72 Lindberg-Levy CLT, 88
generalized method of moments, 60 LM, 91
generated sigma-field, 2 log likelihood, 61
GMM, 60 loss function, 64
LR, 91
Hessian, 66
marginal distribution, 22, 48
independence, 49 Markov’s inequality, 31
independence maximum likelihood estimator, 61
events, 4 measurable function, 17
random variables, 25 median, 40
sigma fields, 5 minimal sufficient statistic, 58
information matrix, 67 minimax estimator, 65
integrable, 27 MLE, 61
integrand, 23 moment, 30
invariance theorem, 62 moment condition, 60

97
moment generating function, 39, 45 sigma-algebra, 2
monotone convergence theorem, 28, 29 sigma-field, 2
monotonicity simple, 70
Lp norm, 33 simple function, 19
outer measure, 11 size, 70
probability, 3 Slutsky theorem, 82
multinomial distribution, 44 small o, 82
multivariate normal, 47 small op , 83
stable, 43
Neyman-Pearson lemma, 71 standard Cauchy, 43
normal distribution, 41 standard multivariate normal, 47
null hypothesis, 69 statistic, 56
stochastically bounded, 83
orthogonal projection, 51
Student-t test, 72
outer measure, 11
sufficient, 56
point probability mass, 21
t test, 72
Poisson distribution, 41
t test
population, 59
one-sided, 72
population moments, 59
two sided, 74
power, 70
tail field, 7
power function, 70
test statistic, 56
probability
theorem of total probability, 4
measure, 3
type-I error, 70
triple, 1
type-II error, 70
probability density function, 24
projection, 51 UMVU, 64
unbiasedness, 63
quantile, 40
uniform distribution, 40
random variable, 19 uniformly minimum variance unbiased es-
random variable timator, 64
continuous, 24 uniformly most powerful, 70
degenerate, 19
variance, 30
discrete, 24
random vector, 21 Wald, 91
Rao-Blackwell theorem, 65
reverse Fatou’s lemma, 28, 30
Riemann integral, 23
risk function, 64

sample moment, 60
score function, 62

Probability Theory - Varadhan
No ratings yet
Probability Theory - Varadhan
225 pages
Stat Proof Book
No ratings yet
Stat Proof Book
660 pages
Probbook
No ratings yet
Probbook
158 pages
STAT 330 Course Notes Fall 2024 Edition
No ratings yet
STAT 330 Course Notes Fall 2024 Edition
482 pages
The Book of Statistical Proofs: DOI: 10.5281/zenodo.4305950 2022-10-22, 07:22
No ratings yet
The Book of Statistical Proofs: DOI: 10.5281/zenodo.4305950 2022-10-22, 07:22
575 pages
Stat Proof Book
No ratings yet
Stat Proof Book
381 pages
Fall 2018 Statistics 201A Aditya Guntuboyina
No ratings yet
Fall 2018 Statistics 201A Aditya Guntuboyina
101 pages
A First Course in Probability Notes
No ratings yet
A First Course in Probability Notes
103 pages
Book Statistik Non Parametrik, Komang Suardika
No ratings yet
Book Statistik Non Parametrik, Komang Suardika
492 pages
Intuition To Probability (Version 1.19)
No ratings yet
Intuition To Probability (Version 1.19)
396 pages
Probability Theory - R.S.varadhan
No ratings yet
Probability Theory - R.S.varadhan
225 pages
Cours-Corrigé de Proba&Statis
No ratings yet
Cours-Corrigé de Proba&Statis
79 pages
Data Analysis
No ratings yet
Data Analysis
51 pages
Book
No ratings yet
Book
113 pages
Probability and Statistics For Particle Physics: Carlos Maña
100% (1)
Probability and Statistics For Particle Physics: Carlos Maña
252 pages
Final Demo
100% (1)
Final Demo
47 pages
Ma 202
No ratings yet
Ma 202
219 pages
Cheat Sheet - JAM
No ratings yet
Cheat Sheet - JAM
46 pages
FundProb Notes22
No ratings yet
FundProb Notes22
52 pages
All Lectures 2018 Fall 201 A
No ratings yet
All Lectures 2018 Fall 201 A
100 pages
Review Notes - Probability
No ratings yet
Review Notes - Probability
16 pages
STAT2011 Week1 2024
No ratings yet
STAT2011 Week1 2024
14 pages
Introduction To Probability Theory and S
No ratings yet
Introduction To Probability Theory and S
127 pages
Introduction To Probability and Random Signals
100% (9)
Introduction To Probability and Random Signals
139 pages
CMU Prob-Grad-Notes - Tomasz Tkocz
No ratings yet
CMU Prob-Grad-Notes - Tomasz Tkocz
226 pages
Bayesian Inference Data Evaluation and Decisions Second Edition
100% (2)
Bayesian Inference Data Evaluation and Decisions Second Edition
245 pages
MTH2222 Mathematics of Uncertainty
No ratings yet
MTH2222 Mathematics of Uncertainty
96 pages
(Bohm G., Zech G.) Introduction To Statistics and (BookFi)
No ratings yet
(Bohm G., Zech G.) Introduction To Statistics and (BookFi)
412 pages
Formulario Ep Probability and Statistics
No ratings yet
Formulario Ep Probability and Statistics
28 pages
Lectnotemat 5
No ratings yet
Lectnotemat 5
346 pages
(Courant Lecture Notes in Mathematics 7) S. R. S. Varadhan-Probability Theory-Courant Institute of Mathematical Sciences - American Mathematical Society (2001)
0% (1)
(Courant Lecture Notes in Mathematics 7) S. R. S. Varadhan-Probability Theory-Courant Institute of Mathematical Sciences - American Mathematical Society (2001)
227 pages
Ps Notes
No ratings yet
Ps Notes
62 pages
MI 2026 Probs and Statistics Theory and Answer
No ratings yet
MI 2026 Probs and Statistics Theory and Answer
119 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
28 pages
Notests PDF
No ratings yet
Notests PDF
153 pages
GL
No ratings yet
GL
144 pages
Asymp
No ratings yet
Asymp
216 pages
Lectnotemat 2
No ratings yet
Lectnotemat 2
348 pages
Lecture Notes Fall Term 2013
No ratings yet
Lecture Notes Fall Term 2013
40 pages
5 6089131777291453670
100% (1)
5 6089131777291453670
70 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
28 pages
Introduction To Probability Theory and Statistics
No ratings yet
Introduction To Probability Theory and Statistics
127 pages
Stats Cheat Sheet
No ratings yet
Stats Cheat Sheet
28 pages
3 F Lower Godavari Subzone
No ratings yet
3 F Lower Godavari Subzone
90 pages
NUS ST2334 Lecture Notes
No ratings yet
NUS ST2334 Lecture Notes
56 pages
ST2334 Notes (Probability and Statistics - NUS)
No ratings yet
ST2334 Notes (Probability and Statistics - NUS)
55 pages
Statistics Lecture Note Asymptotic Tools
No ratings yet
Statistics Lecture Note Asymptotic Tools
216 pages
Mathematical Statistics
No ratings yet
Mathematical Statistics
271 pages
OOPM UNIT 1 (Cse)
No ratings yet
OOPM UNIT 1 (Cse)
64 pages
E-Book of Financial Markets and Institutions
No ratings yet
E-Book of Financial Markets and Institutions
328 pages
01B Forex Question
No ratings yet
01B Forex Question
48 pages
Distribution Requirement Planning
No ratings yet
Distribution Requirement Planning
8 pages
A History of Graphic Design - Chapter 3 - A Symbiotic Relationship - Codices and Manuscript Books
No ratings yet
A History of Graphic Design - Chapter 3 - A Symbiotic Relationship - Codices and Manuscript Books
34 pages
Pmbok: Rajani Nair, Ramesh B, Upendra Bapat Group 3
33% (3)
Pmbok: Rajani Nair, Ramesh B, Upendra Bapat Group 3
31 pages
Symptoms of Ca3 Problems
100% (1)
Symptoms of Ca3 Problems
4 pages
2019 ASHRAE Boston Product Guide Final PDF
No ratings yet
2019 ASHRAE Boston Product Guide Final PDF
75 pages
Presentation - J&J 2
0% (1)
Presentation - J&J 2
47 pages
F5 Got It Pass Class Notes 2021 June
No ratings yet
F5 Got It Pass Class Notes 2021 June
221 pages
01 - Forex Class Notes
No ratings yet
01 - Forex Class Notes
190 pages
Worksheet (AS)
No ratings yet
Worksheet (AS)
4 pages
The Restless Heart, No Rest Since Birth
No ratings yet
The Restless Heart, No Rest Since Birth
11 pages
Unit 3
No ratings yet
Unit 3
16 pages
History of Kenya
No ratings yet
History of Kenya
2 pages
Making The Most of Your Conference Poster: DR Krystyna Haq Graduate Education Officer Graduate Research School
No ratings yet
Making The Most of Your Conference Poster: DR Krystyna Haq Graduate Education Officer Graduate Research School
19 pages
Bacte - Medically Significant Fungi
No ratings yet
Bacte - Medically Significant Fungi
4 pages
2 Mergers and Acquisition
No ratings yet
2 Mergers and Acquisition
81 pages
S4 Writing 3 - Letter of Advice - Language Focus T (Updated)
No ratings yet
S4 Writing 3 - Letter of Advice - Language Focus T (Updated)
6 pages
1 Derivatives
No ratings yet
1 Derivatives
56 pages
MOS-II Lecture 01 - Stress Analysis
No ratings yet
MOS-II Lecture 01 - Stress Analysis
25 pages
HW Notes Leasing Decision
No ratings yet
HW Notes Leasing Decision
69 pages
Which Advertisement. Next To Each Statement Write A Letter (A-H) - Some Advertisements Correspond To More Than One Statement. One Example Is Given
No ratings yet
Which Advertisement. Next To Each Statement Write A Letter (A-H) - Some Advertisements Correspond To More Than One Statement. One Example Is Given
9 pages
Ict Lesson 9 Notes
No ratings yet
Ict Lesson 9 Notes
1 page
Risk Matrix Rev-06 (Finalized)
No ratings yet
Risk Matrix Rev-06 (Finalized)
1 page
CMPC Pulp: Insulation Requirement: Heat Conservation (For Personnel Protection, See Notes 3 & 4) Service
No ratings yet
CMPC Pulp: Insulation Requirement: Heat Conservation (For Personnel Protection, See Notes 3 & 4) Service
3 pages
Pressure Volume Curve 2005
No ratings yet
Pressure Volume Curve 2005
22 pages
Bioransformation2 180518095843 PDF
No ratings yet
Bioransformation2 180518095843 PDF
79 pages
Sabino 2017
No ratings yet
Sabino 2017
15 pages
AB-063 - 4 - EN Aluminium in Cement
No ratings yet
AB-063 - 4 - EN Aluminium in Cement
8 pages
RIM S BlackBerry Fall Back Analysis and PDF
No ratings yet
RIM S BlackBerry Fall Back Analysis and PDF
9 pages
Audio Spotlight
No ratings yet
Audio Spotlight
40 pages
Eusebio Villanueva, Et Al. vs. City of Iloilo G.R. No. L-26521, December 28, 1968
No ratings yet
Eusebio Villanueva, Et Al. vs. City of Iloilo G.R. No. L-26521, December 28, 1968
4 pages
Using The Universal PE Unpacker
No ratings yet
Using The Universal PE Unpacker
11 pages
Chapter Iintroduction Context of The Study: Production and Inventory Management Journal, Godwin Udo
No ratings yet
Chapter Iintroduction Context of The Study: Production and Inventory Management Journal, Godwin Udo
5 pages