Book
Book
Junhui Qian
© September 28, 2020
2
Preface
This booklet introduces advanced probability and statistics to first-year Ph.D. stu-
dents in economics.
In preparation of this text, I borrow heavily from the lecture notes of Yoosoon Chang
and Joon Y. Park, who taught me econometrics at Rice University. All errors are
mine.
i
ii
Contents
Preface i
1 Introduction to Probability 1
1.1 Probability Triple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Conditional Probability and Independence . . . . . . . . . . . . . . . 4
1.3 Limits of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Construction of Probability Measure . . . . . . . . . . . . . . . . . . 8
1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Random Variable 17
2.1 Measurable Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Expectations 27
3.1 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Moment Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Conditional Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 37
iii
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5 Introduction to Statistics 59
5.1 General Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.1 Method of Moment . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.2 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . 65
5.3.3 Unbiasedness and Efficiency . . . . . . . . . . . . . . . . . . . 67
5.3.4 Lehmann-Scheffé Theorem . . . . . . . . . . . . . . . . . . . . 68
5.3.5 Efficiency Bound . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.4.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.4.2 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . 75
iv
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6 Asymptotic Theory 83
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.1.1 Modes of Convergence . . . . . . . . . . . . . . . . . . . . . . 83
6.1.2 Small o and Big O Notations . . . . . . . . . . . . . . . . . . 88
6.2 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2.1 Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . 91
6.2.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . 93
6.2.3 Delta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3 Asymptotics for Maximum Likelihood Estimation . . . . . . . . . . . 95
6.3.1 Consistency of MLE . . . . . . . . . . . . . . . . . . . . . . . 96
6.3.2 Asymptotic Normality of MLE . . . . . . . . . . . . . . . . . 97
6.3.3 MLE-Based Tests . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
References 101
v
vi
Chapter 1
Introduction to Probability
(a) Ω ∈ F
(b) E ∈ F ⇒ E c ∈ F
Sm
(c) E1 , ..., Em ∈ F ⇒ n=1 En ∈ F
Note that (c) says that a field is closed under finite union. In contrast, a sigma-field,
which is defined as follows, is closed under countable union.
1
Definition 1.1.2 (sigma-field (or sigma-algebra)) A collection of subsets F is
called a σ-field or a σ-algebra, if the following holds.
(a) Ω ∈ F
(b) E ∈ F ⇒ E c ∈ F
S∞
(c) E1 , E2 , . . . ∈ F ⇒ n=1 En ∈ F
Remarks:
• In both definitions, (a) and (b) imply that the empty set ∅ ∈ F
Example 1.1.3 If we toss a coin twice, then the sample space would be Ω =
{HH, HT, T H, T T }. A σ-field (or field) would be
The event {HH} would be described as “two heads in a row”. The event {HT, T T }
would be described as “the second throw obtains tail”.
F in the above example contains all subsets of Ω. It is often called the power set of
Ω, denoted by 2Ω .
Example 1.1.4 For an example of infinite sample space, we may consider a thought
experiment of tossing a coin for infinitely many times. The sample space would be
Ω = {(r1 , r2 , . . . , )|ri = 1 or 0}, where 1 stands for head and 0 stands for tail. One
example of an event would be {r1 = 1, r2 = 1}, which says that the first two throws
give heads in a row.
2
A sigma-field can be generated from a collection of subsets of Ω, a field for example.
We define
(1) P(E) ≥ 0 ∀E ∈ F
(2) P(Ω) = 1
S P
(3) If E1 , E2 , . . . are disjoint, then P ( n En ) = n P(En ).
(a) P(∅) = 0
3
(d) Write A∪B = (A∩B c )∪(A∩B)∪(Ac ∩B), a union of disjoint sets. By adding
and subtracting P(A ∩ B), we have P(A ∪ B) = P(A) + P(B) − P(A ∩ B), using
the fact that A = (A ∩ B) ∪ (A ∩ B c ), also a disjoint union.
c
Sn
(e) S
Define B1 = A 1 and Bn = A n+1 ∩ An for n ≥ 2. We have A n = j=1 Bj and
∞ S∞
j=1 Aj = j=1 Bj . Then it follows from
n
X ∞
X ∞
X ∞
[ ∞
X
P(An ) = P(Bj ) = P(Bj ) − P(Bj ) = P( An ) − P(Bj ).
j=1 j=1 j=n+1 n=1 j=n+1
P (E ∩ F )
P (E|F ) = .
P (F )
• For a fixed event F , the function Q(·) = P (·|F ) is a probability. All properties
of probability measure hold for Q.
P (E ∩ F ) = P (E|F ) P (F ) ,
and
P (E ∩ F ∩ G) = P (E|F ∩ G) P (F |G) P (G) .
4
• The Bayes Formula follows from P (E ∩ F ) = P (E|F ) P (F ) = P (F |E) P (E),
P (E|F ) P (F )
P (F |E) = ,
P (E)
and
P (E|Fk ) P (Fk )
P (Fk |E) = P .
n P (E|Fn ) P (Fn )
It is obvious that lim inf xn ≤ lim sup xn . And we say that xn → x ∈ [−∞, ∞] if
lim sup xn = lim inf xn = x.
5
Definition 1.3.1 (limsup of Events) For a sequence of events (En ), we define
∞ [
\ ∞
lim sup En = En
n→∞
k=1 n=k
= {ω| ∀k, ∃n(ω) ≥ k s.t. ω ∈ En }
= {ω| ω ∈ En for infinitely many n.}
= {ω| En i.o.} ,
We may intuitively interpret lim supn→∞ En as the event that En occurs infinitely
often.
It is obvious that It is obvious that (lim inf En )c = lim sup Enc and (lim sup En )c =
lim inf Enc . When lim sup En = lim inf En , we say (En ) has a limit lim En .
P(lim inf En ) ≤ lim inf P(En ) ≤ lim sup P(En ) ≤ P(lim sup En ).
6
P∞
(ii) if n=1 P(En ) = ∞, and if {En } are independent, then P(lim sup En ) = 1.
c
Enc = 0, P (lim sup En ) =
S T P T
as m → ∞. Since P k n≥k En ≤ k P n≥k
1 − P k≥1 n≥k Enc = 1.
S T
Remarks:
• (ii) does not hold if {En } are not independent. To give a counter example,
consider infinite coin tossing. Let E1 = E2 = · · · = {r1 = 1}, the events
that the first coin is head, then {En } is not independent and P (lim sup En ) =
P (r1 = 1) = 1/2.
• Let Hn bePthe event that the n-th tossing comes up head. We have P (Hn ) =
c
1/2 and n P (Hn ) = ∞. Hence P (Hn i.o.) = 1, and P (Hn e.v.) = 1 −
P (Hn i.o.) = 0.
• Let Bn = Hn ∩ Hn+1 , we also have P (Bn i.o.) = 1. To show this, consider B2k ,
which is independent.
Why σ-field? You may already see that events such as lim sup En and lim inf En
are very interesting events. To make meaningful probabilistic statements about
these events, we need to make sure that they are contained in F, on which P is
defined. This is why we require F to be a σ-field, which is closed to infinite unions
and intersections.
7
Definition 1.3.5 (Tail Fields) For a sequence of events E1 , E2 , . . ., the tail field
is given by
\∞
T = σ (En , En+1 , . . .) .
n=1
8
(a) Ω ∈ L
1
(b) If E, F ∈ L, and E ⊂ F , then F − E ∈ L,
• L is closed under countable union only for monotone increasing events. Note
that E = ∪∞ n=1 En .
S “only if” is trivial. To show “if”, it suffices to show that for any E1 , E2 , . . . ∈
Proof:
F, n En ∈ F. We indeed have:
n
!c n
\ [ [
c
Ek = Ek ↑ En .
k=1 k=1 n
DC = {B ∈ λ(P)|B ∩ C ∈ λ(P) } .
9
– Ω ∈ DC
– If B1 , B2 ∈ DC and B1 ⊂ B2 , then (B2 −B1 )∩C = B2 ∩C −B1 ∩C. Since
B1 ∩ C, B2 ∩ C ∈ λ(P) and (B1 ∩ C) ⊂ (B2 ∩ C), (B2 − B1 ) ∩ C ∈ λ(P).
Hence (B2 − B1 ) ∈ DC .
– If B1 , B2 , . . . ∈ DC , and Bn ↑ B, then (Bn ∩ C) ↑ (B ∩ C) ∈ λ(P).
Hence B ∈ DC .
• Thus, for any C ∈ P, DC is a λ-system containing P. And it is obvious that
λ(P) ⊂ DC .
• Now for any A ∈ λ(P) ⊂ DC , we define
DA = {B ∈ λ(P)|B ∩ A ∈ λ(P)} .
By definition, DA ⊂ λ(P).
• We have P ⊂ DA , since if E ∈ P, then E ∩ A ∈ λ(P), since A ∈ λ(P) ⊂ DC
for all C ∈ P.
• We can check that DA is a λ-system that contains P, hence λ(P) ⊂ DA . We
thus have DA = λ(P), which means that for any A, B ∈ λ(P), A ∩ B ∈ λ(P).
Thus λ(P) is a π-system. Q.E.D.
• Ω ∈ D,
• E, F ∈ D and E ⊂ F imply F − E ∈ D, since
P1 (F − E) = P1 (F ) − P1 (E) = P2 (F ) − P2 (E) = P2 (F − E).
The fact that P1 and P2 agree on P implies that P ⊂ D. The remark following
Dynkin’s lemma shows that σ(P) ⊂ D. On the other hand, by definition, D ⊂ σ(P).
Hence D = σ(P). Q.E.D.
10
Borel σ-field The Borel σ-field is the σ-field generated by the family of open
subsets (on a topological space). To probability theory, the most important Borel
σ-field is the σ-field generated by the open subsets of R of real numbers, which we
denote B(R).
Almost every subset of R that we can think of is in B(R), the elements of which may
be quite complicated. As it is difficult for economic agents to assign probabilities to
complicated sets, we often have to consider “simpler” systems of sets, π-system, for
example.
Define
P = (−∞, x], x ∈ R.
It can be easily verified that P is a π-system. And we show in the following that P
generates B(R).
that σ(P) ⊂ B(R). To show σ(P) ⊃ B(R), note that every open set of R is
a countable union of open intervals. It therefore suffices to show that the open
intervals of the form (a, b) are in σ(P). This is indeed the case, since
!
[
(a, b) = (−∞, a]c ∩ (−∞, b − 1/n] .
n
Note that the above holds even when b ≤ a, in which case (a, b) = ∅.
P = P0 on F0 .
11
(b) P is a probability measure on (Ω, M), where M is a σ-field of P-measurable
sets in F.
(c) F0 ⊂ M
(d) P = P0 on F0 .
(a) We first define outer measure. A set function µ on (Ω, F) is an outer measure
if
(i) µ(∅) = 0.
(ii) E ⊂ F implies µ(E) ≤ µ(F ). (monotonicity)
S P
(iii) µ ( n En ) ≤ n µ(En ), where E1 , E2 , . . . ∈ F. (countable subadditivity)
M contains sets that “split” every set E ⊂ Ω well. We call these sets P-
measurable. M has an equivalent definition,
12
• Lemma 1. If A1 , A2 , . . . ∈ M are disjoint, then P ( n An ) = n P (An ).
S P
Proof: First note that
P (A1 ∪ A2 ) = P (A1 ∩ (A1 ∪ A2 )) + P (Ac1 ∩ (A1 ∪ A2 )) = P (A1 ) + P (A2 ) .
Induction thus obtains finite additivity. Now for any m ∈ N, we have by
monotonicity,
! !
X [ [
P (An ) = P An ≤ P An .
n≤m n≤m n
P S
Since m is arbitrarily chosen, we have n P (An ) ≤ P ( n An ). Combining
this with subadditivity, we obtain Lemma 1. Next we prove that M is a field.
• Lemma 2. M is a field on Ω.
Proof: It is trivial that Ω ∈ M and that A ∈ M ⇒ Ac ∈ M. It remains to
prove that A, B ∈ M ⇒ A ∩ B ∈ M. We first write,
(A ∩ B)c = (Ac ∩ B) ∪ (A ∩ B c ) ∪ (Ac ∩ B c ) .
Then
P ((A ∩ B) ∩ E) + P ((A ∩ B)c ∩ E)
= P (A ∩ B ∩ E) + P {[(Ac ∩ B) ∩ E] ∪ [(A ∩ B c ) ∩ E] ∪ [(Ac ∩ B c ) ∩ E]}
≤ P (A ∩ (B ∩ E)) + P (Ac ∩ (B ∩ E)) + P (A ∩ (B c ∩ E)) + P (Ac ∩ (B c ∩ E))
= P (B ∩ E) + P (B c ∩ E) = P (E) .
Using the second definition of M, we have A ∩ B ∈ M. Hence M is a field.
Next we establish that M is a σ-field. To show this we only need to show that
M is closed to countable union. We first prove two technical lemmas.
• S
Lemma 3. Let A1 , A2 , . . . ∈ M be disjoint. For each m ∈ N, let Bm =
n≤m An . Then for all m and E ⊂ Ω, we have
X
P (E ∩ Bm ) = P (E ∩ An ) .
n≤m
Proof: We prove by induction. First, note that the lemma holds trivially
P m = 1. Now suppose it holds for some m, we showc that P (E ∩ Bm+1 ) =
when
n≤m+1 P (E ∩ An ). Note that Bm ∩ Bm+1 = Bm and Bm ∩ Bm+1 = Am+1 . So
c
P (E ∩ Bm+1 ) = P (Bm ∩ E ∩ Bm+1 ) + P (Bm ∩ E ∩ Bm+1 )
= P (E ∩ Bm ) + P (E ∩ Am+1 )
X
= P (E ∩ An ) .
n≤m+1
13
• Lemma 4. Let A1 , A2 , . . . ∈ M be disjoint, then
S
n An ∈ M.
Proof: For any m ∈ N, we have
c
P (E) = P (E ∩ Bm ) + P (E ∩ Bm )
X
c
= P (E ∩ An ) + P (E ∩ Bm )
n≤m
!c !
X [
≥ P (E ∩ An ) + P E ∩ An ,
n≤m n
An )c ⊂ Bm
c
S
since ( n . Since m is arbitrary, we have
!c !
X [
P (E) ≥ P (E ∩ An ) + P E ∩ An
n n
!! !c !
[ [
≥ P E∩ An +P E∩ An .
n n
S
Hence n An ∈ M. Now we are read to prove:
14
(d) Finally, we prove that P = P0 on F0 .
Proof: Let E ∈ F0 . It is obviousSfrom the definition of P that P (E) ≤ P0 (E).
Let A1 , A2 , . . . ∈ F0 and E ⊂ n An . Define a disjoint sequence of subsets
c c c
{Bn } such that B1 = A1 and S Bi = ASi ∩ A1 ∩ A2 ∩ · · · ∩ Ai−1 for i ≥ 2. We
have Bn ⊂ An for all n and n An = n Bn . Using countable additivity of P0 ,
!!
[ X
P0 (E) = P0 E ∩ Bn = P0 (E ∩ Bn ) .
n n
Hence X X
P0 (E) ≤ P0 (Bn ) ≤ P0 (An ) .
n n
1.5 Exercises
1. Prove that an arbitrary intersection of σ-fields is a σ-field.
2. Show that
1 1
lim − , 1 − = [0, 1).
n→∞ n n
− n1 , 12 − n1
(
if n is odd,
En = 1 1 2 1
3
− n , 3 + n if n is even.
Find lim inf En and lim sup En . Let the probability P be given by the Lebesgue
measure on the unit interval [0, 1] (that is, the length of interval). Compare
P(lim inf En ), lim inf P(En ), P(lim sup En ), and lim sup P(En ).
15
16
Chapter 2
Random Variable
Remarks:
• Given a σ-field G, the inverse of a measurable function maps a Borel set into
an element of G. If we have two σ-fields G1 and G2 and G1 ⊂ G2 . If a function
is G1 -measurable, then it is also G2 -measurable. The reverse is not true.
17
• The mapping f −1 preserves all set operations:
!
[ [ c
f −1 An = f −1 (An ), f −1 (Ac ) = f −1 (A) , etc.
n n
Properties:
18
(h) If {fn } are measurable, then {lim fn exists in R} ∈ G.
Proof: Note that the set on which the limit exists is
\[\ 1
{f ≤ c} = fn ≤ c + .
m≥1 k n≥k
m
(j) A simple function f , which takes the form f (s) = ni=1 ci IAi , where (Ai ∈ G)
P
are disjoint and (ci ) are constants, is measurable.
Proof: Use (d) and (e) and the fact that indicator functions are measurable.
19
Remarks:
Example 2.2.2 For the coin tossing experiments, we may define a random variable
by X(H) = 1 and X(T ) = 0, where H and T are the outcomes of the experiment,
ie, head and tail, respectively. If we toss the coin for n times, X̄n = n1 ∞
P
i=1 Xi is
also a random variable. As n → ∞, X̄n becomes a degenerate random variable as
we know by the law of large numbers. lim X̄n is still a random variable since the
following event is in F,
number of heads 1
→ = {lim sup X̄n = 1/2} ∩ {lim inf X̄n = 1/2}
number of tosses 2
20
Definition 2.2.4 (Distribution Function) The distribution function FX of a ran-
dom variable is defined by
FX (x) = PX {(−∞, x]} for all x ∈ R.
We may omit the subscript of FX for simplicity. Note that since {(−∞, x], x ∈ R}
is a π-system that generates B(R), F uniquely determines P .
Properties:
Remark: If P ({x}) = 0, we say that P does not have point probability mass at x,
in which case F is also left-continuous. For any sequence {xn } such that xn ↑ x, we
have
F (xn ) = P ((−∞, xn ]) → P ((−∞, x)) = F (x) − P ({x}) = F (x).
Example 2.3.1 Consider tossing the coin twice. Let X1 be a random variable that
takes 1 if the first toss gives Head and 0 otherwise, and let X2 be a random variable
that takes 1 if the second toss gives Head and 0 otherwise. Then the random vector
X = (X1 , X2 )0 is a function from Ω = {HH, HT, T H, T T } to R2 :
1 1 0 0
X(HH) = , X(HT ) = , X(T H) = , X(T T ) = .
1 0 1 0
21
Definition 2.3.2 (Distribution of Random Vector) The distribution of an n-
dimensional random vector X = (X1 , . . . , Xn )0 is a probability measure on Rn ,
PX (A) = PZ (A × Rn ) = P{ω|Z(ω) ∈ A × Rn },
2.4 Density
Let µ be a measure on (S, G). A measure is a countably additive1 function from a
σ-field (e.g., G) to [0, ∞). A classic example of measure is the length of intervals.
Equipped with the measure µ, we now have a measure space (S, G, µ). On (S, G, µ),
a statement holds almost everywhere (a.e.) if the set A ∈ G on which the statement
is false is µ-null (µ(A) = 0). The probability triple (Ω, F, P) is of course a special
measure space. A statement on (Ω, F, P) holds almost surely (a.s.) if the event
E ∈ F in which the statement is false has zero probability (P(E) = 0).
We first introduce a more general concept of densityPin the Lebesgue integration
theory. Let fn be a simple function of the form fn (s) = nk=1 ck IAk , where (Ak ∈ G)
are disjoint and (ck ) are real nonnegative constants. We have
1
P
µ is countably additive if whenever {Ak } are disjoint, µ (∪k≥1 Ak ) = k≥1 µ(Ak ).
22
Definition 2.4.1 The Lebesgue integral of a simple function fn with respect to µ is
defined by
Z Xm
fn dµ = ck µ(Ak ).
k=1
In words, the Lebesgue integral of a general function f is the sup of the integrals of
simple functions that are below f . For example, we may choose fn = αn ◦ f , where
0 f (x) = 0
αn (x) = 2 (k − 1) if 2−n (k − 1) < f (x) ≤ 2−n k, for k = 1, . . . , n2n
−n
n f (x) > n
Then we have
f (x) = f + − f − .
The Lebesgue integral of f is now defined by
Z Z Z
f dµ = f dµ − f − dµ.
+
Remarks:
S
f (x)µ(dx).
• The summation n cn is a special case of Lebesgue integral, which is taken
P
with respect to the counting measure. The counting measure on R assigns 1
to each point in N, the set of natural numbers.
23
• The Lebesgue integral generalizes the Riemann integral. It exists and coincides
with the Riemann integral whenever that the latter exists.
• If the measure µ in (2.1) is the Lebesgue measure, which gives the length
of intervals, the function pX is conventionally called the probability density
function of X. If such a pdf exists, we say that X is a continuous random
variable.
24
2.5 Independence
The independence of random variables is defined in terms of σ-fields they generate.
We first define
• σ(X) may be understood as the set of information that the random variable
X contains about the state of the world. Speaking differently, σ(X) is the
collection of events E such that, for a given outcome, we can tell whether the
event E has happened based on the observance of X.
Let p(xik ) be the Radon-Nikodym density of the distribution of Xik with respect to
Lebesgue or counting measure. And let, with some abuse of notation, p(xi1 , . . . , xin )
be the Radon-Nikodym density of the distribution of Xi1 , . . . , Xin , with respect to
the product of the measures to which the marginal densities p(xi1 ), . . . , p(xin ) are
defined. The density p may be pdf or discrete probabilities, depending on whether
the corresponding random variable is continuous or discrete. We have the following
theorem.
Theorem 2.5.3 The random variables X1 , X2 , . . . are independent if and only if for
any (i1 , . . . , in ),
n
Y
p(xi1 , . . . , xin ) = p(xik )
k=1
Proof: It suffices to prove the case of two random variables. Let Z = (X, Y )0 be a
two-dimensional random vector, and let µ(dx) and µ(dy) be measure to which p(x)
25
and p(y) are defined. The joint density p(x, y) is then defined with respect to the
measure µ(dx)µ(dy) on R2 . For any A, B ∈ B, we have
2.6 Exercises
1. Verify that PX (·) = P (X −1 (·)) is a probability measure on B(R).
2. Let E and F be two events with probabilities P(E) = 1/2, P(F ) = 2/3 and
P(E ∩ F ) = 1/3. Define random variables X = I(E) and Y = I(F ). Find the
joint distribution of X and Y . Also, obtain the conditional distribution of X
given Y .
x2
p(x) = I{−3 < x < 3},
18
compute P{ω||X(ω)| < 1}.
26
Chapter 3
Expectations
3.1 Integration
Expectation is integration. Before studying expectation, therefore, we first dig
deeper into the theory of integration.
Properties of Integration
27
Note that the monotone convergence of probability is implied by the monotone
convergence theorem. Take fn = IAn and f = IA , where An is a monotone increasing
sequence of sets in G that converge to A, and let µ = P be a probability measure.
Then µ(fn ) = P(An ) ↑ P(A) = µ(f ).
Proof: Note that inf n≥k fn is monotone increasing and inf n≥k fn ↑ lim inf fn . In
addition, since fk ≥ inf n≥k fn for all k, we have µ(fk ) ≥ µ(inf n≥k fn ) ↑ µ(lim inf fn )
by Monotone Convergence Theorem.
The theorem can be extended to the case where fn →a.e. f only. The condition of the
existence of a dominating function g can also be relaxed to the uniform integrability
of fn .
28
3.2 Expectation
Now we have
EX is also called the mean of X, and Ef (X) can be called the f -moment of X.
Proof: First consider indicator functions of the form f (X) = IA (X), where A ∈ B.
We have f (X)(ω) = IA ◦ X(ω) = IX −1 (A) (ω). Then
And we have
Z Z Z Z
PX (A) = IA dPX = f dPX and PX (A) = IA pX dµ = f pX dµ.
Hence the theorem holds for indicator functions. Similarly we can show that it is
true for simple functions. For a general nonnegative function f , we can choose a
sequence of simple functions (fn ) such that fn ↑ f . The monotone convergence
theorem is then applied to obtain the same result. For general functions, note that
f = f + − f −.
29
• (Monotone Convergence Theorem) If 0 ≤ Xn ↑ X a.s., then E(Xn ) ↑ E(X).
• (Fatou’s Lemma) If Xn ≥ 0 a.s. for all n, then E(lim inf Xn ) ≤ lim inf E(Xn ).
E(|Xn − X|) → 0,
E(|Xn − X|) → 0.
where µx and µy are the means of X and Y , respectively. cov(X, X) is of course the
2
variance of X. Let σX and σY2 denote the variances of X and Y , respectively, we
define the correlation of X and Y by
cov(X, Y )
ρX,Y = .
σX σY
30
For a random vector X = (X1 , . . . , Xn )0 , the second moment is given by EXX 0 ,
a symmetric matrix. Let µ = EX, then ΣX = E(X − µ)(X − µ)0 is called the
variance-covariance matrix, or simply the covariance matrix. If Y = AX, where
A is a conformable constant matrix, then ΣY = AΣX A0 . This relation reduces to
σY2 = a2 σX
2
, if X and Y are scalar random variables and Y = aX, where a is a
constant.
The moments of a random variable X contain the same information as the distribu-
tion (or the law) dose. We have
Theorem 3.3.1 Let X and Y be two random variables (possibly defined on different
probability spaces). Then PX = PY if and only if Ef (X) = Ef (Y ) for all Borel
functions whenever the expectation is finite.
E|X|k
Theorem 3.3.2 (Chebyshev Inequality) P{|X| ≥ ε} ≤ εk
, for any ε > 0
and k > 0.
Remarks:
31
• There is also an exponential form of Markov Inequality:
Remarks:
cov(X, Y )2 ≤ var(X)var(Y ).
It follows that
Ef (X) ≥ E`(X) = `(EX) = f (EX).
32
Remarks:
Definition 3.3.5 (Lp Norm) Let 1 ≤ p < ∞. The Lp norm of a random variable
X is defined by
kXkp ≡ (E|X|p )1/p .
which may be interpreted as the lowest upper bound for X. (Lp (Ω, F, P), k · kp ) with
1 ≥ p ≥ ∞ is a complete normed (Banach) space of random variables. In particular,
when p = 2, if we define inner product
hX, Y i = EXY,
(Lp (Ω, F, P), h·, ·i) is a complete inner product (Hilbert) space.
33
In particular, if G = σ(Y ), where Y is a random variable, we write E(X|σ(Y ))
simply as E(X|Y ).
The conditional expectation is a local average. To see this, let {Fk } be a partition
of Ω with P(Fk ) > 0 for all k. Let G = σ({Fk }). Note that we can write
X
E(X|G) = ck IFk ,
k
which obtains R
Fk
XdP
ck = .
P(Fk )
The conditional expectation E(X|G) may be viewed as a random variable that takes
values that are local averages of X over the partitions made by G. If G1 ⊂ G, G is
said to be “finer” than G1 . In other words, E(X|G) is more “random” than E(X|G1 ),
since the former can take more values. Example 1 gives two extreme cases.
Example 3.4.2 If G = {∅, Ω}, then E(X|G) = EX, which is a degenerate random
variable. If G = F, then E(X|G) = X.
Example 3.4.3 Let E and F be two events that satisfy P(E) = P(F ) = 1/2 and
P(E ∩ F ) = 1/3. E and F are obviously not independent. We define two random
variables, X = IE and Y = IF . It is obvious that {F, F c } is a partition of Ω and
σ({F, F c }) = σ(Y ) = {∅, Ω, F, F c }. The conditional expectation of E(X|Y ) may be
written as
E(X|Y ) = c∗1 IF + c∗2 IF c ,
where c∗1 = P(F )−1 F XP = P(F )−1 P(F ∩ E) = 2/3, and c∗2 = P(F c )−1 F c XP =
R R
34
Definition 3.4.4 (Conditional Probability) The conditional probability may be
defined as a random variable P(E|G) such that
Z
P(E|G)dP = P(A ∩ E).
A
Check that the conditional probability behaves like ordinary probabilities, in that
it satisfies the axioms of the probability, at least in a.s. sense.
Properties:
Hence the statement holds for X = IF . For general random variables, use
linearity and monotone convergence theorem.
• Using the above two results, it is trivial to show that X and Y are independent
if and only if Ef (X)g(Y ) = Ef (X)Eg(Y ) for all Borel functions f and g.
35
Conditional Expectation as Projection The last property implies that
E [E(X|G)|G] = E(X|G),
which suggest that the conditional expectation is a projection operator, projecting
a random variable onto a sub-σ-field. This is indeed the case. It is well known that
H = L2 (Ω, F, P) is a Hilbert space with inner product defined by hX, Y i = EXY ,
where X, Y ∈ L2 . Consider a subspace H0 = L2 (Ω, G, P), where G ⊂ F. The
projection theorem in functional analysis guarantees that for any random variable
X ∈ H, there exists a G-measurable random variable Y such that
E(X − Y )W = 0 for all W ∈ H0 . (3.3)
Y is called the (orthogonal) projection of X on H0 . Write W = IA for any A ∈ G,
the equation (3.3) implies that
Z Z
XdP = Y dP for all A ∈ G.
A A
Proof: We have
E(Y − φ(X))2 = E([Y − E(Y |X)] + [E(Y |X) − φ(X)])2
= E [Y − E(Y |X)]2 + [E(Y |X) − φ(X)]2
Hence the conditional expectation is the best predictor in the sense of minimizing
mean squared forecast error (MSFE). This fact is the basis of regression analysis
and time series forecasting.
36
3.5 Conditional Distribution
Suppose that X and Y are two random variables with joint density p(x, y).
To show this, first note that for all F ∈ σ(Y ), there exists A ∈ B such that F =
Y −1 (A). We now have
Z Z
g(Y )dP = g(y)p(y)µ(dy)
F A
Z Z
= xp(x|y)µ(dx) p(y)µ(dy)
A
Z Z
= xp(x, y)µ(dx)µ(dy)
R ×A
Z
= XdP
ZF
= E(X|Y )dP.
F
1 1
+y
Z Z
x+y 3
E(X|Y = y) = xp(x|y)dx = x1 dx = 1 .
0 2
+y 2
+y
37
3.6 Exercises
1. Let the sample space Ω = R and the probability P on Ω be given by
1 1 2 2
P = and P = .
3 3 3 3
Define a sequence of random variables by
1
Xn = 3 − I(An ) and X = 3 I lim An ,
n n→∞
where
1 1 2 1
An = + , +
3 n 3 n
for n = 1, 2, . . ..
(a) Show that lim An exists so that X is well defined.
n→∞
(b) Compare lim E(Xn ) with E(X).
n→∞
(c) Is it true that lim E(Xn − X)2 = 0?
n→∞
38
(a) Find the conditional expectation E(X 2 |Y )
(b) Show that E(E(X 2 |Y )) = E(X 2 ).
39
40
Chapter 4
Let X be a random variable with density p. The moment generating function (MGF)
of X is given by
Z
m(t) = E exp(tX) = exp(tx)p(x)dµ(x).
Note that the moment generating function is the Laplace transform of the density.
The name of MGF is due to the fact that
dk m
k
(0) = EX k .
dt
The MGF may not exist, but we can always define characteristic function, which is
given by
Z
φ(t) = E exp(itX) = exp(itx)p(x)dµ(x).
Note that the characteristic function is the Fourier transform of the density. Since
| exp(itx)| is bounded, φ(t) is always defined.
41
4.1.3 Quantile Function
We define the τ -quantile or fractile of X (with distribution function F ) by
Qτ = inf{x|F (x) ≥ τ }, 0 < τ < 1.
If F is strictly monotone, then Qτ is nothing but F −1 (τ ). If τ = 1/2, Q1/2 is
conventionally called the median of X.
42
Note that if X ∼ Binomial(n, θ), it can be represented by a sum of n i.i.d. (inde-
pendently and identically distributed) Bernoulli(θ) random variables.
Poisson The Poisson distribution is a discrete distribution with the following den-
sity,
exp(−λ)λx
pλ (x) = , x ∈ {0, 1, 2, . . .}.
x!
The Poisson distribution typically describes the probability of the number of events
occurring in a fixed period of time. For example, the number of phone calls in a given
time interval may be modeled by a Poisson(λ) distribution, where the parameter λ
is the expected number of calls. Note that the Poisson(λ) density is a limiting case
of the Binomial(n, λ/n) density,
x n−x −x n x x
n λ λ n! λ λ λ −λ λ
1− = 1 − 1 − → e .
x n n (n − x)!nx n n x! x!
(x − µ)2
1
pµ,σ2 (x) = √ exp − .
2πσ 2σ 2
The parameter µ and σ 2 are the mean and the variance of the distribution, respec-
tively. In particular, N (0, 1) is called standard normal. The normal distribution
was invented for the modeling of observation error, and is now the most important
distribution in probability and statistics.
pλ (x) = λe−λx .
F (x) = 1 − e−λx .
The exponential distribution typically describes the waiting time before the arrival
of next Poisson event.
43
Gamma The Gamma distribution, denoted by Gamma(k, λ) is a continuous dis-
tribution with the following density,
1
pk,λ = (λx)k−1 e−λx , x ∈ [0, ∞),
Γ(k)
where Γ(·) is gamma function defined as follows,
Z ∞
Γ(z) = tz−1 e−t dt.
0
The parameter k is called shape parameter and λ > 0 is called scale parameter.
• Special cases
– Let k = 1, then Gamma(1, λ) reduces to Exponential(λ).
– If k is an integer, Gamma(k, λ) reduces to an Erlang distribution, i.e., the
sum of k independent exponentially distributed random variables, each
of which has a mean of λ.
– Let ` be an integer and λ = 1/2, then Gamma(`/2, 1/2) reduces to χ2` ,
chi-square distribution with ` degrees of freedom.
• The gamma function generalizes the factorial function. To see this, note that
Γ(1) = 1 and that by integration by parts, we have
Γ(z + 1) = zΓ(z).
Hence for positive integer n, we have Γ(n + 1) = n!.
44
Table 4.1: Mean, Variance, and Moment Generating Function
Distribution Mean Variance MGF
a+b (b−a)2 ebt −eat
Uniform[a, b] 2 12 (b−a)t
Bernoulli(θ) θ θ(1 − θ) (1 − θ) + θet
Poisson(λ) λ λ exp(λ(et − 1))
Normal(µ, σ 2 ) σ2 exp µt + 12 σ 2 t2
µ
Exponential(λ) λ−1 λ−2 (1 − t/λ)−1
Gamma(k, λ) k/λ k/λ2 (λ/(λ − t))k
a ab
P∞ Qk−1 a+r tk
Beta(a, b) a+b (a+b)2 (a+b+1)
1+ k=1 r=0 a+b+r k!
The parameter a is called the location parameter and b is called the scale parame-
ter. Cauchy(0, 1) is called the standard Cauchy distribution, which coincides with
Student’s t-distribution with one degree of freedom.
• When U and V are two independent standard normal random variables, then
the ratio U/V has the standard Cauchy distribution.
45
where parameter (pk , k = 1, . . . , m) is the probability of getting k−th outcome in
each coin tossing or die rolling. When m = 2, the multinomial distribution reduces
to binomial distribution. The continuous analogue of multinomial distribution is
multivariate normal distribution.
FY (y) = P(log(1 − X) ≤ y)
= P(X ≤ 1 − e−y )
= 1 − e−y ,
since FX (x) = x for x ∈ [0, 1]. FY (y) = 0 for y < 0. Note that Y ∼ Exponential(1).
n
!
\
FY (y) = P {Xi ≤ y}
i=1
n
Y
= P(Xi ≤ y)
i=1
Yn
= Fi (y).
i=1
46
Example 4.3.3 Let X = (X1 , X2 )0 be a random vector with distribution P and
density p(x1 , x2 ) with respect to measure µ. Then the distribution of Y = X1 + X2
is given by
FY (y) = P{X1 + X2 ≤ y}
= P{(x1 , x2 )|x1 + x2 ≤ y}
Z ∞ Z y−x2
= p(x1 , x2 )µ(dx1 )µ(dx2 ).
−∞ −∞
Example
Pn 4.3.5 Let Xi ∼ N(µi , σi2 ) be independent over i. Then the MGF of Y =
i=1 ci Xi is
n n n
!
t2 X 2 2
Y 1 2 22 X
m(t) = exp ci µi t + ci σi t = exp t ci µ i + ci σ i .
i=1
2 i=1
2 i=1
P Pn 2 2
This suggests that Y ∼ N ( i ci µ i , i=1 ci σi ).
47
and x = (x1 , . . . , xn )0 . And let PX and PY denote the distributions of X and Y ,
respectively. Assume PX and PY admit density pX and pY with respect to µ, the
counting or the Lebesgue measure on Rn .
For any B ∈ B(R), we define A = g −1 (B). We have A ∈ B(R) since g is measurable.
It is clear that {X ∈ A} = {Y ∈ B}. We therefore have
Z
PY (B) = PX (A) = pX (x)µ(dx).
A
pY (y) = pX (g −1 (y)).
where ġ is the Jacobian matrix of g, ie, the matrix of the first partial derivatives of
f , [∂gi /∂xj ]. Then we obtain the density of Y ,
−1
pY (y) = pX (g −1 (y)) detġ g −1 (y)
.
Example 4.3.6 Suppose we have two random variables X1 and X2 with joint den-
sity
48
Let X = {(x1 , x2 )|0 < x1 , x2 < 1} denote the support of the joint density of (X1 , X2 ).
Then the support of the joint density of (Y1 , Y2 ) is given by Y = {(y1 , y2 )|y1 , y2 >
0, y1 y2 < 1, y2 < y1 }. Then
q √ !
1 x1 y1 y1 y2
− x22
|detġ(x)| = det x2 = det p y2 √
y2 /y1
= 2y1 .
x2 x1 y2 /y1 y1 y2
4.4.1 Introduction
Definition 4.4.1 (Multivariate Normal) A random vector X = (X1 , . . . , Xn )0 is
said to be multivariate normally distributed if for all a ∈ Rn , a0 X has a univariate
normal distribution.
49
characteristic function of Z (defined above) is obviously
1 0
φZ (t) = exp − t t .
2
Note that | · | denotes determinant, and that, for a square matrix A, we have |A2 | =
|A|2 .
Remarks:
50
Proof: Exercise. (Hint: use c.f. arguments.)
Proof: The “only if” part is obvious. If Σ12 = 0, then Σ is a block diagonal,
Σ11 0
Σ= .
0 Σ22
Hence
Σ−1
−1 11 0
Σ = ,
0 Σ−1
22
and
|Σ| = |Σ11 | · |Σ22 |.
Then the joint density of x1 and x2 , can be factored as
−n/2 −1/2 1 0 −1
p(x) = p(x1 , x2 ) = (2π) |Σ| exp − (x − µ) Σ (x − µ)
2
−n1 /2 −1/2 1 0 −1
= (2π) |Σ11 | exp − (x1 − µ1 ) Σ11 (x1 − µ1 )
2
−n2 /2 −1/2 1 0 −1
·(2π) |Σ22 | exp − (x2 − µ2 ) Σ22 (x2 − µ2 )
2
= p(x1 )p(x2 ).
Hence X1 and X2 are independent.
51
Proof: First note that
X1 − Σ12 Σ−1 I −Σ12 Σ−1
22 X2 22 X1
= .
X2 0 I X2
Since
I −Σ12 Σ−1 Σ11 − Σ12 Σ−1
22 Σ11 Σ12 I 0 22 Σ21 0
= ,
0 I Σ21 Σ22 −Σ12 Σ−1
22 I 0 Σ22
X1 − Σ12 Σ−1
22 X2 and X2 are independent. We write
X1 = A1 + A2 ,
where
A1 = X1 − Σ12 Σ−1 A2 = Σ12 Σ−1
22 X2 , 22 X2 .
N µ1 − Σ12 Σ−1 −1
22 µ2 , Σ11 − Σ12 Σ22 Σ21 .
A2 may be treated as a constant given X2 , which only shifts the mean of the cond-
tional distribution of X1 given X2 . We have thus obtained the desired result.
From the above result, we may see that the conditional mean of X1 given X2 is
linear in X2 , and that the conditional variance of X1 given X2 does not depend on
X2 . Of course the conditional variance of X1 given X2 is less than the unconditional
variance of X1 , in the sense that Σ11 − Σ11|2 is a positive semi-definite matrix.
52
Student t distribution Let T = √ Z , where Z ∼ N (0, 1) and V ∼ χ2m and Z
V /m
and V are independent, then T ∼ tm , the Student t distribution with m degrees of
freedom.
X 0 Σ−1 X ∼ χ2n .
To get to the next theorem, recall that a square matrix is a projection if and only if
P 2 = P .1 If, in addition, P is symmetric, then P is an orthogonal projection.
Theorem 4.4.7 Let Z ∼ N (0, In ), and let A and B be deterministic matrices, then
A0 Z and B 0 Z are independent if and only if A0 B = 0.
Proof: Let C = (A, B). Without loss of generality, we assume that C is full rank
(if it is not, then throw away linearly dependent columns). We have
0 0
A A A0 B
0 AZ
CZ= ∼ N 0, .
B0Z B0A B0B
It is now clear that A0 Z and B 0 Z are independent if and only if the covariance A0 B
is null.
1
Matrices that satisfy this property is said to be idempotent.
53
It is immediate that we have
Corollary 4.4.8 Let Z ∼ N (0, In ), and let P and Q be orthogonal projections such
that P Q = 0, then Z 0 P Z and Z 0 QZ are independent.
Proof: Note that since P Q = 0, then P Z and QZ are independent. Hence the
independence of Z 0 P Z = (P Z)0 (P Z) and Z 0 QZ = (QZ)0 (QZ).
We have
54
Hence 0
(n − 1)Sn2
X − µι X − µι
= (In − Pι ) .
σ2 σ σ
(b) follows from the fact that X−µισ
∼ N (0, In ) and that (In − Pι ) is an (n − 1)-
0
dimensional orthogonal projection. To prove (c), we note that X̄n = ιn Pι X and
1
Sn2 = n−1 ((I − Pι )X)0 ((I − Pι )X), and that Pι X and (I − Pι )X are independent
by Theorem 4.4.7. Finally, (d) follows from
√ √
n(X̄n −µ)
n(X̄n − µ)
=r σ .
Sn (n−1)Sn2
σ2
n−1
4.5 Exercises
1. Derive the characteristic function of the distribution with density
p(x) = exp(−|x|)/2.
55
(A2) Σ = σ 2 I,
(A3) µ = 0.
We claim:
(a) X n and Sn2 are uncorrelated.
(b) E(X n ) = µ.
(c) E(Sn2 ) = σ 2 .
(d) X n ∼ N (µ, σ 2 /n).
(n − 1)Sn2 /σ 2 ∼ χ2n−1 .
(e) √
(f) n(X n − µ)/Sn ∼ tn−1 .
What assumptions in (A1), (A2), and (A3) are needed for each of (a) – (f) to
hold. Prove (a) – (f) using the assumptions you specified.
Span: The span of a set of vectors is the set of all linear combinations of the
vectors. For example, the x-y plane is spanned by (1, 0) and (0, 1).
Range: Given a matrix A, the range of A is defined as the span of its columns,
Null space: The null space of A is the set of all column vectors x such that
Ax = 0,
N (A) = {x|Ax = 0}.
It can be easily shown that for any matrix A, R(A)⊥ = N (A0 ).
56
Basis: An independent subset of a vector space X that spans X is called a basis
of X . Independence here means that any vector in the set cannot be written as a
linear combination of other vectors in the set. The number of vectors in a basis of
X is the dimension of X .
P (P x) = P x = λx = P (λx) = λP x = λ2 x,
which implies λ = λ2 .
An orthogonal projection P has the following eigen-decomposition,
P = QΛQ0 ,
57
where Λ is a diagonal matrix with eigenvalues (1 or 0) on the diagonal, and Q is
orthonormal, ie, Q0 Q = I. The i-th column of Q is the eigenvector corresponding
to the i-th eigenvalues on the diagonal. Suppose there are n1 ones and n2 zeros on
the diagonal of Λ. We may conveniently order the eigenvalues and eigenvectors such
that
In1 0
Λ= , Q = (Qn1 Qn2 ),
0 0n2
where the subscript n1 and n2 denotes number of columns. Then we may represent
P by
P = Qn1 Q0n1 .
It is now clear that the range of P has n1 dimensions. In other words, P is n1 -
dimensional. And since I − P = Qn2 Q0n2 , (I − P ) is an n2 -dimensional orthogonal
projection. P is an orthogonal projection on the subspace spanned by the eigenvec-
tors corresponding to eigenvalue ones. And I − P is an orthogonal projection on the
subspace spanned by the eigenvectors corresponding to eigenvalue zeros.
Since the eigenvalues of an orthogonal projection P are either 1 or 0, P is positive
semidefinite, ie, P ≥ 0. And we also have A0 P A ≥ 0, since for any x, x0 A0 P Ax =
(Ax)0 P (Ax) ≥ 0.
58
Chapter 5
Introduction to Statistics
The First Example: For example, we may study the relationship between indi-
vidual income (income) and the characteristics of the individual such as education
level (edu), work experience (expr), gender, etc. The variables of interest may then
be Xi = (incomei , edui , expri , genderi ). We may reasonably postulate that (Xi )
are independently and identically distributed (i.i.d.). Hence the study of the joint
distribution of X reduces to that of the joint distribution of Xi . To achieve this,
we take a sample of the whole population, and observe (Xi , i = 1, . . . , n), where i
denotes individuals. In this example in particular, we may focus on the conditional
distribution of income given edu, expr, and gender.
59
(gt , yt , πt , ut ). One of the objective of empirical analysis, in this example, may be
to study the conditional distribution of unemployment given past observations on
government expenditure, GDP growth, inflation, as well as itself. The problem of
this example lies with, first, the i.i.d. assumption on Xt is untenable, and second, the
fact that we can observe Xt only once. In other words, an economic data generating
process is nonindependent and time-irreversible. It is clear that the statistical study
would go nowhere unless we impose (sometimes strong) assumptions on the evolution
of Xt , stationarity for example.
In this chapter, for simplicity, we have the first example in mind. In most cases, we
assume that X1 , . . . , Xn are i.i.d. with a distribution Pθ that belongs to a family of
distributions {Pθ |θ ∈ Θ} where θ is called parameter and Θ a parameter set. In this
course we restrict θ to be finite-dimensional. This is called the parametric approach
to statistical analysis. The nonparametric approach refers to the case where we do
not restrict the distribution to any family of distributions, which is in a sense to
allow θ to be infinite-dimensional. In this course we mainly consider the parametric
approach.
5.2 Statistic
Recall that random variables are mappings from the sample space to real numbers.
We say that the random vector X = (X1 , . . . , Xn )0 is a mapping from the sample
space to a state space X , which is usually Rn in this text. We may write, X : Ω → X .
Now we introduce
60
Sufficient Statistic Let τ = τ (X) be a statistic, and P = {Pθ |θ ∈ Θ} be a family
of distributions of X.
The distribution of X can be any member of the family P. Therefore, the conditional
distribution of X given τ would depend on θ in general. τ is sufficient in the sense
that the distribution of X is uniquely determined by the value of τ . Bayesians may
interpret sufficiency as P(θ|X) = P(θ|τ (X)).
Sufficient statistics are useful in data reduction. It is less costly to infer θ from a
statistic τ than from X, since the former, being a function of the latter, is of lower
dimension. The sufficiency of τ guarantees that τ contains all information about θ
in X.
This theorem implies that if two samples give the same value for a sufficient statistic,
then the MLE based on the two samples yield the same estimate of the parameters.
Example 5.2.5 Let X1 , . . . , Xn be i.i.d. Poisson(λ). We may write the joint dis-
tribution of X = (X1 , . . . , Xn ) as
λx1 +···+xn
pλ (x) = e−nλ Qn = f (τ (x), λ)g(x),
i=1 xi !
−1
where τ (x) = ni=1 xi , f (t, λ) = exp(−nλ)λt , and g(x) = ( ni=1 xi !) . Hence τ (x)
P Q
is sufficient for λ.
61
Pn Pn
It is clear that τ (x) = ( i=1 xi , i=1 x2i ) is sufficient for (µ, σ 2 )0 .
Definition 5.2.7 Exponential Family The exponential family refers to the family of
distributions that have densities of the form
" m #
X
pθ (x) = exp ai (θ)τi (x) + b(θ) g(x),
i=1
To emphasize the dependence on m, we may call the above family m-parameter expo-
nential family. By the factorization theorem, it is obvious that τ (x) = (τ1 (x), . . . , τm (x))0
is a sufficient statistic.
In the case of m = 1, let X1 , . . . , Xn be i.i.d. with density
• Bernoulli(θ)
62
Example 5.2.9 (Two-parameter exponential family) • N(µ, σ 2 )
(x − µ)2
1
pµ,σ2 = √ exp −
2πσ 2σ 2
2
1 1 2 µx µ
= √ exp − 2 x + 2 − + log σ
2π 2σ σ 2σ 2
• Gamma(α, β)
1
pα,β = xα−1 e−x/β
Γ(α)β α
1
= exp (α − 1) log x − x − (log Γ(α) + α log β) .
β
p(x|θ)
p(θ|x) = p(θ) ,
p(x)
R
where p(x) = p(x|θ)p(θ)dθ. Note that Bayesians treat θ as random, hence the
conditional-density notation of p(θ|x), which is called posterior density.
5.3 Estimation
63
In contrast, we call the sample average of (f(xi )) the sample moments. Note that
the sample average may be regarded as the moment of the distribution that assigns
probability mass 1/n to each realization xi . This distribution is called the empir-
ical distribution, which we denote Pn . Obviously, the moments of the empirical
distribution equal the corresponding sample moments
Z n
1X
En f = fdPn = f(xi ).
n i=1
Eθ f = En f. (5.1)
Example 5.3.2 Let Xi be i.i.d. N(µ, σ 2 ). To estimate µ and σ 2 , we may solve the
following system of equations
n
1X
Eµ,σ2 X = xi
n i=1
n
2 1X
Eµ,σ2 (X − µ) = (xi − µ)2 .
n i=1
64
to estimate θ. The basic idea of GMM is to minimize some distance measure be-
tween the population moments and their corresponding sample moments. A popular
approach is to solve the following quadratic programming problem,
Remark: Let τ be any sufficient statistic for the parameter θ. According the
factorization theorem, we have p(x, θ) = f (τ (x), θ)g(x). Then θ̂M L maximizes
f (τ (x), θ) with respect to θ. Therefore, θ̂M L is always a function of τ (X). This
implies that if MLE is a sufficient statistic, then it is always minimal.
65
First Order Condition If the log likelihood function `(θ, x) is differentiable and
globally concave for all x, then the ML estimator can be obtained by solving the
first order condition (FOC),
∂`
(θ, x) = 0
∂θ
∂`
Note that s(θ, x) = ∂θ (θ, x) is called score functions.
which is solved to obtain θ̂ = x̄ = n−1 ni=1 xi . Note that to estimate the variance of
P
Xi , we need to estimate v = θ(1 − θ), a function of θ. By the invariance theorem,
we obtain v̂ = θ̂(1 − θ̂).
Example 5.3.6 (N (µ, σ 2 )) Let Xi be i.i.d. N(µ, σ 2 ), then the log-likelihood func-
tion is given by
n
n n 1 X
`(µ, σ 2 , x) = − log(2π) − log σ 2 − 2 (xi − µ)2 .
2 2 2σ i=1
66
Solving the FOC gives
µ̂ = x̄
n
2 1X
σ̂ = (xi − x̄)2 .
n i=1
67
Definition 5.3.9 (UMVU Estimator) An estimator T∗ = τ∗ (X) is called an
UMVU estimator if it satisfies
(1) T∗ is unbiased,
Definition 5.3.10 (Loss Function) Loss function is any function `(t, θ) that as-
signs disutility to each pair of estimate t and parameter value θ.
Definition 5.3.11 (Risk Function) For an estimator T = τ (X), the risk func-
tion is defined by
r(τ, θ) = Eθ `(T, θ).
It can be observed that risk function is the expected loss of an estimator for each
value of θ. Risk functions corresponding to the loss functions in the above examples
are
68
In the decision-theoretic approach of statistical inference, estimators are constructed
by minimizing some appropriate loss or risk functions.
Theorem 5.3.13 (Rao-Blackwell Theorem) Suppose that the loss function `(t, θ)
is convex in t and that S is a sufficient statistic. Let T = τ (X) be an estimator for
θ with finite mean and risk. If we define T∗ = Eθ (T |S) and write T∗ = τ∗ (X), then
we have
r(τ∗ , θ) ≤ r(τ, θ).
69
Theorem 5.3.16 (Lehmann-Scheffé Theorem) If S is complete and sufficient
and T = τ (X) is an unbiased estimator of g(θ), then f (S) = Eθ (T |S) is a UMVU
estimator.
Example 5.3.17 We continue with the previous example and proceed to find a
UMVU estimator. Let T = 2X1 , which is an unbiased estimator for θ. Suppose
S = s, then X1 can take s with probability 1/n, since every member of (Xi , i =
1, . . . , n) is equally likely to be the maximum. When X1 6= s, which is of probability
(n − 1)/n, X1 is uniformly distributed on (0, s). Thus we have
Eθ (T |S = s) = 2Eθ (X1 |S = s)
1 n−1s
= 2 s+
n n 2
n+1
= s
n
The UMVU estimator of θ is thus obtained as
n+1
T∗ = max Xi .
n 1≤i≤n
∂2`
(a) Hessian: h(θ, x) = ∂θ∂θ0
(θ, x).
70
(b) Fisher Information: I(θ) = Eθ s(θ, X)s(θ, X)0 .
where I(θ), I1 (θ), and I2 (θ) denote the information matrix of X, X1 , X2 , respectively,
and the notations of H, H1 , and H2 are analogous.
From now on, we assume that a random vector X has joint density p(x, θ) with
respect to Lebesque measure µ. Note that the notation p(x, θ) emphasizes the fact
that the joint density of X is a function of both x and θ. We let θ̂ (or more precisely,
τ (X)) be an unbiased estimator for θ. And we impose the following regularity
conditions on p(x, θ),
Regularity Conditions
∂
R R ∂
(a) ∂θ
p(x, θ)dµ(x) = ∂θ p(x, θ)dµ(x)
∂2
R R ∂2
(b) ∂θ∂θ 0 p(x, θ)dµ(x) = ∂θ∂θ 0 p(x, θ)dµ(x)
Under these regularity conditions, we have a few results that are both useful in
proving subsequent theorems and interesting in themselves.
Eθ s(θ, X) = 0.
71
Proof: We have
Z
Eθ s(θ, X) = s(θ, x)p(x, θ)dµ(x)
Z
∂
= `(θ, x)p(x, θ)dµ(x)
∂θ
Z ∂
∂θ
p(x, θ)
= p(x, θ)dµ(x)
p(x, θ)
Z
∂
= p(x, θ)dµ(x)
∂θ
= 0
Proof: We have
∂2
∂2 ∂θ∂θ0
p(x, θ) ∂ ∂
`(θ, x) = − log p(x, θ) 0 log p(x, θ).
∂θ∂θ0 p(x, θ) ∂θ ∂θ
Then
∂2
Z
H(θ) = `(θ, x) p(x, θ)dµ(x)
∂θ∂θ0
∂2
Z
= p(x, θ)dµ(x) − I(θ)
∂θ∂θ0
= −I(θ).
Lemma 5.3.20 Let τ (X) be an unbiased estimator for θ, and suppose the Condition
(c) holds, then
Eθ τ (X)s(θ, X)0 = I.
Proof: We have
∂p
(x, θ)
Z
0 ∂θ0
Eθ θ̂(X)s(θ, X) = θ̂(x) p(x, θ)dµ(x)
p(x, θ)
Z
∂
= θ̂(x)p(x, θ)dµ(x)
∂θ0
= I.
Since Eθ s(θ, X) = 0, the lemma implies that the covariance matrix between an
unbiased estimator and the random score is identity for all θ.
72
Theorem 5.3.21 (Cramer-Rao Bound) Let θ̂(X) be an unbiased estimator of
θ, and if Conditions (a) and (c) hold, then,
varθ θ̂(X) ≥ I(θ)−1 .
Then the information matrix I(λ) of X = (X1 , . . . , Xn )0 is I(λ) = nI1 (λ) = n/λ.
Recall λ̂ = X̄ = n1 ni=1 Xi is an unbiased estimator for λ. And we have
P
H0 : θ ∈ Θ0 H1 : θ ∈ Θ1 ,
where H0 is called the null hypothesis and H1 is called the alternative hypothesis.
73
A test statistic, say τ , is used to partition the state space X into the disjoint union
of the critical region C and the acceptance region A,
X = C ∪ A.
The critical region is conventionally given as
C = {x ∈ X |τ (x) ≥ c},
where c is a constant that is called critical value. If the observed sample is within
the critical region, we reject the null hypothesis. Otherwise, we say that we fail to
reject the null and thus accept the alternative hypothesis. Note that different tests
differ in their critical regions. In the following, we denote tests using their critical
regions.
Example 5.4.2 Let X be a random variable from Uniform(0, θ), and we want to
test
H0 : θ ≤ 1 H1 : θ > 1.
Consider first the following test,
C1 = {x|x ≥ 1}.
The power function of C1 is given by
Z θ
1 1
π(θ) = dx = 1 − .
1 θ θ
Since π(θ) is monotone increasing, the size of C1 is π(1) = 0. Another test may be
C2 = {x|x ≥ 1/2}.
1
The power function of C2 is 1 − 2θ , and the size is 1/2. Note that the power function
of C2 is higher than that of C1 on Θ1 , but at the cost of higher size.
74
Given two tests with a same size, C1 and C2 , if Pθ (C1 ) > Pθ (C2 ) at θ ∈ Θ1 , we say
that C1 is more powerful than C2 . If there is a test C∗ that satisfies Pθ (C∗ ) ≥ Pθ (C)
at θ ∈ Θ1 for any test C of the same size, then we say that C∗ is the most powerful
test. Furthermore, if the test C∗ is such that Pθ (C∗ ) ≥ Pθ (C) for all θ ∈ Θ1 for any
test C of the same size, then we say that C∗ is the uniformly most powerful.
If Θ0 (or Θ1 ) is a singleton set, ie, Θ0 = {θ0 }, we call the hypothesis H0 : θ = θ0
simple. Otherwise, we call it composite hypothesis. In particular, when both H0
and H1 are simple hypotheses, say, Θ0 = {θ0 } and Θ1 = {θ1 }, P consists of two
distributions Pθ0 and Pθ1 , which we denote as P0 and P1 , respectively. It is clear that
P0 (C) and P1 (C) are the size and the power of the test C, respectively. Note that
both P0 (C) and P1 (A) are probabilities of making mistakes. P0 (C) is the probability
of rejecting the true null, and P1 (A) is the probability of accepting the false null.
Rejecting the true null is often called the type-I error, and accepting the false null
is called the type-II error.
Proof: Suppose C is any test with the same size as C∗ . Assume without loss of
generality that C and C∗ are disjoint. It follows that
p(x, θ1 ) ≥ cp(x, θ0 ) on C∗
p(x, θ1 ) < cp(x, θ0 ) on C.
Hence we have
Z Z
P1 (C∗ ) = p(x, θ1 )dµ(x) ≥ c p(x, θ0 )dµ(x) = cP0 (C∗ ),
C∗ C∗
and Z Z
P1 (C) = p(x, θ1 )dµ(x) < c p(x, θ0 )dµ(x) = cP0 (C).
C C
Since P0 (C∗ ) = P0 (C) (the same size), we have P1 (C∗ ) ≥ P1 (C). Q.E.D.
75
Remarks:
• For obvious reasons, test of the same form as C∗ is also called likelihood ratio
(LR) test. The constant c is to be determined by pre-specifying a size, ie, by
solving for c the equation P0 (C) = α, where α is prescribed small number.
• We may view p(x, θ1 ) (or p(x, θ0 )) as marginal increases of power (size) when
the point x is added to the critical region C. The Neyman-Pearson Lemma
shows that those points contributing more power increase per unit increase in
size should be included in C for an optimal test.
Example 5.4.4 Let X be a random variable with density p(x, θ) = θxθ−1 , x ∈ (0, 1),
θ > 0. The most powerful test for the hypothesis H0 : θ = 2 versus H1 : θ = 1 is
given by
p(x, 1) 1
C= = ≥c .
p(x, 2) 2x
This is equivalent to
C = {x ≤ c}.
To determine c for a size-α test, we solve the following for c,
Z c
p(x, 2)dx = α,
0
√ √
which obtains c = α. Hence C = {x ≤ α} is the most powerful test.
For composite hypotheses, we may use the generalized LR test based on the ratio
supθ∈Θ1 p(x, θ)
λ(x) = .
supθ∈Θ0 p(x, θ)
The Neyman-Pearson Lemma does not apply to the generalized LR test. However,
it is intuitively appealing and leads satisfactory tests in many contexts.
Example 5.4.5 We continue with the previous example. Suppose we want to test
H0 : θ = 1 versus H1 : θ 6= 1. The generalized LR statistic is given by
supθ∈Θ1 supθ∈Θ1
λ(x) = = .
supθ∈Θ0 1
76
1
The sup on the numerator is obtained at the ML estimator θM L = − log(x) . So we
have
1 − log1 x −1
λ(x) = − x .
log x
Let t = log x, we have
1
λ(et ) = − e−(t+1) = f (t).
t
The generalized LR test is thus given by
C = {x|λ(x) ≥ c}
= {x|f (t) ≥ c}
= {x|t ≤ c1 or t ≥ c2 },
where c1 and c2 are constants that satisfy f (c1 ) = f (c2 ) = c. To determine c for a
size-α test, we solve P0 (C) = α.
First consider a simple example. Let X1 , . . . , Xn be i.i.d. N(µ, 1), and we test
H0 : µ = 0 against H1 : µ = 1
Since both the null and the alternative are simple, Neyman-Pearson Lemma ensures
that the likelihood ratio test is the best test. The likelihood ratio is
p(x, 1)
λ(x) =
p(x, 0)
(2π)−n/2 exp − 12 ni=1 (xi − 1)2
P
=
(2π)−n/2 exp − 12 ni=1 (xi − 0)2
P
n
!
X n
= exp xi − .
i=1
2
We know that τ (X) = n−1/2 ni=1 Xi is distributed as N(0, 1) under the null. We
P
may use this construct a test. Note that we can write τ (x) = f ◦ λ(x), where
f (z) = n−1/2 (log z + n/2) is a monotone increasing function. The test
C = {x|τ (x) ≥ c}
is then an LR test. It remains to determine c. Suppose we allow the probability
of type-I error to be 5%, that is a size of 0.05, we may solve for c the equation
P0 (C) = 0.05. Since τ (X) ∼ N (0, 1) under the null, we can look up the N (0, 1)
table and find that
P0 (x|τ (x) ≥ 1.645) = 0.05.
This implies c = 1.645.
77
Example 2: One-Sided Student-t Test
Now we test
H0 : µ = 0 against H1 : µ > 0. (5.2)
The alternative hypothesis is now composite. From the preceding analysis, however,
it is clear that for any µ1 > 0, C is the most powerful test for
H0 : µ = 0 against H1 : µ = µ1 .
H 0 : µ = µ0 against H1 : µ 6= µ0 .
Here we have two unknown parameters, µ and σ 2 , but the null and the alternative
hypotheses are concerned with the parameter µ only. We consider the generalized
LR test with the following generalized likelihood ratio
supµ,σ2 (2πσ 2 )−n/2 exp − 2σ1 2 ni=1 (xi − µ)2
P
λ(x) = .
supσ2 (2πσ 2 )−n/2 exp − 2σ1 2 ni=1 (xi − µ0 )2
P
78
We define
n(x̄ − µ0 )2
τ (x) = (n − 1) n
P 2
.
i=1 (xi − x̄)
V1 /1
τ (X) = ,
V2 /(n − 1)
where √ 2 Pn
n(X̄ − µ0 ) i=1 (Xi − X̄)2
V1 = and V2 = .
σ σ2
Under H0 , we can show that V1 ∼ χ21 , V2 ∼ χ2n−1 , and V1 and V2 are independent.
Hence, under H0 ,
τ (X) ∼ F1,n−1 .
To find the critical value c for a size-α test, we look up the F table and find constant
c such that
P0 {x|τ (x) ≥ c} = α.
From the preceding examples, we may see that the hypothesis testing problem con-
sists of three steps in practice: first, forming an appropriate test statistic, second,
finding the distribution of this statistic under H0 , and finally making a decision. If
the outcome of the test statistic is deemed as unlikely under H0 , the null hypothe-
sis H0 is rejected, in which case we accept H1 . The Neyman-Peason Lemma gives
important insights on how to form a test statistic that leads to a powerful test. In
the following example, we illustrate a direct approach that is not built on likelihood
ratio.
For the testing problem of Example 3, we may construct a Student-t test statistic
as follows, √
n(x̄ − µ0 )
τ̃ (x) = pPn .
(x − x̄)2 /(n − 1)
i=1 i
79
where √ Pn
n(X̄ − µ0 ) i=1 (Xi − X̄)2
Z= and V = .
σ σ2
Under H0 , we can show that Z ∼ N (0, 1), V ∼ χ2n−1 , and Z and V are independent.
Hence, under H0 ,
τ̃ (X) ∼ tn−1 .
To find the critical value c for a size-α test, we look up the t table and find a constant
c > 0 such that
P0 {x| |τ̃ (x)| ≥ c} = α.
Finally, to see the connection between this test and the F test in Example 3, note
that F1,n−1 ≡ t2n−1 . In words, taking square of a tn−1 random variable results in an
F1,n−1 random variable.
5.5 Exercises
1. Let X1 and X2 be independent Poisson(λ). Show that τ = X1 + X2 is a
sufficient statistic.
80
4. Let (Xi , i = 1, . . . , n) be a random sample from a normal distribution with
mean µ and variance σ 2 . Define
Pn Pn
i=1 Xi 2 (Xi − X)2
Xn = and Sn = i=1 .
n n−1
(a) Obtain the Cramer-Rao lower bound.
(b) See whether X n and Sn2 attain the lower bound.
(c) Show that X n and Sn2 are jointly sufficient for µ and σ 2 .
(d) Are X n and Sn2 the UMVU estimators?
81
82
Chapter 6
Asymptotic Theory
6.1 Introduction
Let X1 , . . . , Xn be a sequence of random variables, and let β̂n = β̂(X1 , . . . , Xn ) be
an estimator for the population parameter β. For β̂n to be a good estimator, it
must be asymptotically consistent, ie, β̂n converges to β in some sense as n → ∞.
Furthermore, it is desirable to have an asymptotic distribution of βn , if properly
standardized. That is, there may be a sequence of number an such that an (β̂n − β)
converges in some sense to a random variable Z with a known distribution. If in
particular Z is normal (or Gaussian), we say β̂n is asymptotically normal.
Asymptotic distribution is also important for hypothesis testing. If we can show
that a test statistic has an asymptotic distribution, then we may relax assumptions
on the finite sample distribution of X1 , . . . , Xn . This would make our test more
robust to mis-specifications of the model.
We study basic asymptotic theories in this chapter. They are essential tools for
proving asymptotic consistency and deriving asymptotic distributions. In this sec-
tion we first study the convergence of a sequence of random variables. As a sequence
of measurable functions, the converging behavior of random variables is much richer
than that of real numbers.
Let (Xn ) and X be random variables defined on a common probability space (Ω, F, P).
83
written as Xn →a.s. X, if
or
P{ω| |Xn (ω) − X(ω)| < e.v.} = 1.
Remarks:
84
Remarks:
• Note that for the convergence in distribution, (Xn ) and X need not be defined
on a common probability space. It is not a convergence of Xn , but that of
probability measure induced by Xn , ie, PXn (B) = P ◦ Xn (B), B ∈ B(R).
• Recall that we may also call PXn the law of Xn . Thus the convergence in distri-
bution is also called convergence in law. More technically, we may call conver-
gence in distribution as weak convergence, as opposed to strong convergence
in the set of probability measures. Strong convergence refers to convergence
in the distance metric of probability measure (e.g., total variation metric).
Without proof, we give the following three portmanteau theorems, each of which
supplies an equivalent definition of convergence in distribution.
Lemma 6.1.5 Xn →d X if and only if for every function f that is bounded and
continuous a.s. in PX ,
Ef (Xn ) → Ef (X).
The function f need not be continuous at every point. The requirement of a.s.
continuity allows f to be discontinuous on a set S ⊂ R that PX (S) = 0.
Lemma 6.1.6 Xn →d X if and only if Ef (Xn ) → Ef (X) for every bounded and
uniformly continuous function f . 1
We have
Proof: (a) To show that a.s. convergence implies convergence in probability, we let
En = {|Xn − X| > }. By Fatou’s lemma,
1
A function f : D → R is uniformly continuous on D if for every > 0, there exists δ > 0 such
that |f (x1 ) − f (x2 )| < for x1 , x2 ∈ D that satisfy |x1 − x2 | < δ.
85
The conclusion follows.
(b) The fact that Lp convergence implies convergence in probability follows
from the Chebysheve inequality
E|Xn − X|p
P{|Xn − X| > } ≤ .
p
Then we have
Since Xn →p X, lim sup P{Xn ≤ z} ≤ lim sup P{X ≤ z + }. Let ↓ 0, we have
Similarly, using the fact that X < z − and |Xn − X| < imply Xn < z, we can
show that
lim inf P{Xn ≤ z} ≥ P{X < z}.
If P{X = z} = 0, then P{X ≤ z} = P{X < z}. Hence
This establishes
Other directions of the theorem do not hold. And a.s. convergence does not imply
Lp convergence, nor does the latter imply the former. Here are a couple of counter
examples:
Counter Examples Consider the probability space ([0, 1], B([0, 1]), µ), where µ
is Lebesgue measure and B([0, 1]) is the Borel field on [0, 1]. Define Xn by
86
and define Yn by
It can be shown that Xn → 0 a.s., but EXnp = 1 for all n. On the contrary,
EYnp = 1/a → 0, but Yn (ω) does not converge for any ω ∈ [0, 1].
It also follows from the above counter examples that convergence in probability does
not imply a.s. convergence. Suppose it does, we would have →Lp ⇒→p ⇒→a.s. . But
we have
Theorem 6.1.9 If Xn →p X, then there exists a subsequence Xnk such that Xnk →a.s.
X.
Since ∞ ∞
X X
P {|Xnk − X| > } ≤ 2−k < ∞,
k=1 k=1
Proof: Let f (x) = I|x−c|> for any > 0. Since f is continuous at c and Xn →d c,
we have
Ef (Xn ) = P{|Xn − c| > } → Ef (c) = 0.
87
(c) if Xn →d X, then f (Xn ) →d f (X). (Continuous Mapping Theorem)
(a) Xn Yn →d cY ,
(b) Xn + Yn →d c + Y .
Definition 6.1.13 (Small o and Big O) Let (an ) and (bn ) be sequences of real
numbers. We write
(b) yn = O(bn ) if there exists a constant M > 0 such that |yn /bn | < M for all
large n.
88
Remarks:
• We may write o(an ) = an o(1) and O(bn ) = bn O(1). However, these are not
equalities in the usual sense. It is understood that o(1) = O(1) but O(1) 6=
o(1).
• For yn = O(1), if suffices to have |yn | < M for large n. If |yn | < M
for all n > N , then we would have |yn | < M ∗ for all n, where M ∗ =
max{y1 , y2 , . . . , yn , M }.
• O(o(1)) = o(1)
Proof: Let xn = o(1) and yn = O(xn ). It follows from |yn /xn | < M that
|yn | < M |xn | → 0.
• o(O(1)) = o(1)
M
Proof: Let xn = O(1) and yn = o(xn ). It follows from |yn | < |y |
|xn | n
=
|yn |
M |x n|
→ 0.
• o(1)O(1) = o(1)
Proof: Let xn = o(1) and yn = O(xn ). It follows from |xn yn | < M |xn | → 0.
• In general, we have
In probability, we have
(b) Yn = Op (bn ) if for any > 0, there exists a constant M > 0 and n0 () such
that P(|Yn /bn | > M ) < for all n ≥ n0 ().
If we take an = bn = 1 for all n, then Xn = op (1) →p 0, and for any > 0, there
exists M > 0 such that P(|Yn | > M ) < for all large n. In the latter case, we say
that Yn is stochastically bounded.
Analogous to the real series, we have the following results.
89
Lemma 6.1.15 We have
Proof: (a) Let Xn = op (1) and Yn = Op (Xn ), we show that Yn = op (1). For any
> 0, since |Yn |/|Xn | ≤ M and |Xn | ≤ M −1 imply |Yn | ≤ , we have {|Yn | ≤ } ⊃
{|Yn | ≤ |Xn |M } ∩ {|Xn | ≤ M −1 }. Taking complements, we have
Thus
P{|Yn | > } ≤ P{|Yn |/|Xn | > M } + P{|Xn | > M −1 }.
This holds for any M > 0. We can choose M such that the first term on the right
be made arbitrarily small. And since M is a constant, the second term goes to zero.
Thus P{|Yn | > } → 0, i.e., Yn = op (1).
(b) Let Xn = Op (1) and Yn = op (Xn ), we show that Yn = op (1). By similar argument
as above, we have for any > 0 and M > 0,
The first term on the right goes to zero, and the second term can be made arbitrarily
small by choosing a large M .
(c) Left for exercise.
In addition, we have
(b) Xn + op (1) →d X.
Proof: (a) For any > 0, we have sufficiently large M such that P(|X| > M ) < ,
since {|X| > M } ↓ ∅ as M ↑ ∞. Let f (x) = I|x|>M . Since Xn →d X and f
is bounded and continuous a.s., we have E(f (Xn )) = P(|Xn | > M ) → Ef (X) =
P(|X| > M ) < . Therefore, P(|Xn | > M ) < for large n.
(b) Let Yn = op (1). And let f be any uniformly continuous and bounded function
90
and let M = sup |f (x)|. For any > 0, there exists a δ such that |Yn | ≤ δ implies
|f (Xn + Yn ) − f (Xn )| ≤ . Hence
|f (Xn + Yn ) − f (Xn )|
= |f (Xn + Yn ) − f (Xn )| · I|Yn |≤δ + |f (Xn + Yn ) − f (Xn )| · I|Yn |>δ
≤ + 2M I|Yn |>δ
Hence
E|f (Xn + Yn ) − f (Xn )| ≤ + 2M P{|Yn | > δ}.
Then we have
The third term goes to zero since Xn →d X, the second term goes to zero since
Yn = op (1), and > 0 is arbitrary. Hence Ef (Xn + Yn ) → Ef (X).
Proof: We have
91
Proof: We only prove the case when var(Xi ) < ∞. The general proof is more
involved. The theorem follows easily from
n
!2 n
!2
1X 1X 1
E Xi − µ =E (Xi − µ) = E(Xi − µ)2 → 0,
n i=1 n i=1 n
Theorem 6.2.2 (Strong LLN) If X1 , . . . , Xn are i.i.d. with mean µ < ∞, then
n
1X
Xi →a.s. µ.
n i=1
The general proof is involved. Here we prove the case when EXi4 < ∞. We have
n
!4 n
!
1X 1 X X
E Xi = EXi4 + 6 EXi2 Xj2
n i=1 n4 i=1 i6=j
n(n − 1)
= n−3 EXi4 + 3 EXi2 EXj2
n4
= O(n−2 ).
4 4
This implies E ∞ < ∞, which further implies ∞
P 1
Pn P 1
Pn
n=1 n i=1 Xi n=1 n i=1 Xi <
∞ a.s. Then we have
n
1X
Xi →a.s 0.
n i=1
Without proof, we also give a strong LLN that only requires independence,
92
The first application of LLN is in deducing the probability p of getting head in the
coin-tossing experiment. If we define Xi = 0 when we get tail in Pnthe i-th tossing
1
and Xi = 1 when we get head. Then the LLN guarantees that n i=1 Xi converges
to EXi = p · 1 + (1 − p) · 0 = p. This converge to a probability, indeed, is the basis
of the “frequentist” interpretation of probability.
Sometimes we need LLN for measurable functions of random variables, say, g(Xi , θ),
where θ is a non-random
Pn parameter vector taken values in Θ. The Uniform LLN’s
1
establishe that n i=1 g(Xi , θ) converges in some sense uniformly in θ ∈ Θ. More
precisely, we have
then Pn
i=1 (Xin − µi )
→d N (0, 1).
σn
93
Xin −µi
To see that Liapounov is stronger than Lindberg, let ξni = σn
. We have
n Pn
X
2 i=1 E|ξin |3
Eξin I|ξin |> ≤ .
i=1
√
Proof: Let Yin = Xi / n. Yin is thus an independent double array with µi = 0,
σi2 = σ 2 /n, and σn2 = σ 2 . It suffices to check the Lindberg condition
n
1 X 1
2
EYin2 I|Yin |>σn = 2 EXi2 I|Xi |>σ√n → 0
σn i=1 σ
by dominated convergence theorem. Note that Zn = Xi2 I|Xi |>σ√n ≤ Xi2 , EXi2 < ∞,
and Zn (ω) → 0 for all ω ∈ Ω.
This implies
√ √
n (f (Tn ) − f (θ)) = O(θ)0 n (Tn − θ) + op (1) →d N (0, O(θ)0 ΣO(θ)).
94
Example 6.2.7 Let X1 , . . . , Xn be i.i.d. with mean µ and variance σ 2 . By the
central limit theorem, we have
√
n X̄ − µ →d N (0, σ 2 ).
95
6.3.1 Consistency of MLE
We first show that the expected log likelihood with respect to P0 is maximized at
θ0 . Let p(xi , θ) and `(xi , θ) denote the likelihood and the log likelihood, respectively.
We consider the function of θ,
Z
E0 `(·, θ) = `(x, θ)p(x, θ0 )dµ(x).
We have
Proof: The regularity conditions ensure that the uniform weak LLN applies to
`(Xi , θ),
n
1X
`(Xi , θ) →p E0 (·, θ)
n i=1
uniformly in θ ∈ Θ. The conclusion then follows.
θ̂ = θ0 + op (1).
96
6.3.2 Asymptotic Normality of MLE
Theorem 6.3.3 Under certain regularity conditions, we have
√
n(θ̂ − θ0 ) →d N (0, I(θ0 )−1 ),
√
The rate of the above convergence is n. Using the Op notation, we may write
√
θ̂ = θ0 + Op (1/ n).
(d) θ̂ = θ0 + Op (n−1/2 ).
By Taylor’s expansion,
We have
n n n
!
1 X 1 X 1X √
√ s(Xi , θ̂) = √ s(Xi , θ0 ) + h(xi , θ0 ) n(θ̂ − θ0 ) + op (1).
n i=1 n i=1 n i=1
Then
n
!−1 n
√ 1 X 1 X
n(θ̂ − θ0 ) = − h(xi , θ0 ) √ s(Xi , θ0 ) + op (1)
n i=1 n i=1
→d N (0, I(θ0 )−1 ).
H0 : θ = θ0 H1 : θ 6= θ0 .
97
We consider the following three celebrated test statistics:
n n
!
X X
LR = 2 `(xi , θ̂) − `(xi , θ0 )
i=1 i=1
√ √
Wald = n(θ̂ − θ0 )0 I(θ̂) n(θ̂ − θ0 )
n
!0 n
!
1 X 1 X
LM = √ s(xi , θ0 ) I(θ0 )−1 √ s(xi , θ0 ) .
n i=1 n i=1
LR measures the difference between restricted likelihood and unrestricted likeli-
hood. Wald measures the difference between estimated and hypothesized values of
the parameter. And LM measures the first derivative of the log likelihood at the
hypothesized value of the parameter. Intuitively, if the null hypothesis holds, all
three quantities should be small.
For the Wald statistic, we may replace I(θ̂) by n1 ni=1 s(Xi , θ̂)s(Xi , θ̂)0 , −H(θ̂), or
P
− n1 ni=1 h(Xi , θ̂). The asymptotic distribution of Wald would not be affected.
P
For the Wald statistic, we have under regularity conditions that I(θ) is continuous
at θ = θ0 so that I(θ̂) = I(θ0 ) + op (1). Then the asymptotic distribution follows
√
from n(θ̂ − θ0 ) →d N (0, I(θ0 )−1 ).
1
Pn
The asymptotic distribution of the LM statistic follows from n i=1 s(Xi , θ0 ) →d
N (0, I(θ0 )).
98
6.4 Exercises
Pn
1. Suppose X1 , . . . , Xn are i.i.d. Exponential(1), and define X n = n−1 i=1 Xi .
(a) Find the characteristic function of X1 . √
(b) Find the characteristic function of Yn = n(X n − 1).
(c) Find the limiting distribution of Yn .
4. A random sample of size n is drawn from a normal population with mean θ and
variance θ, i.e., the mean and variance are knownPto be equal but the common
is not known. Let X n = ni=1 Xi /n, Sn2 = ni=1 (Xi − X)2 /(n − 1). and
P
value P
Tn = ni=1 Xi2 /n.
(a) Calculate π = plimn→∞ Tn .
(b) Find the maximum-likelihood estimator of θ and show that it is a differ-
entiable function of Tn .
(c)
√ Find the asymptotic distribution of Tn , i.e., find the limit distribution of
n(Tn − π).
(d) Derive the asymptotic distribution of the ML estimator by using the delta
method.
(e) Check your answer to part (d) by using the information to calculate the
asymptotic variance of the ML estimator.
(f) Compare the asymptotic efficiencies of the ML estimator, the sample mean
X n , and the sample variance Sn2 .
99
100
References
Chang, Yoosoon & Park, Joon Y. (1997), Advanced Probability and Statistics for
Economists, Lecture Notes.
Dudley, R.M (2003), Real Analysis and Probability (2nd Ed.), Cambridge Univer-
sity Process.
Su, Liangjun (2007), Advanced Mathematical Statistics (in Chinese), Peking Uni-
versity Press.
101
Index
L1 , 27 change-of-variable theorem, 29
Lp convergence, 84 characteristic function, 41
Lp norm, 32 characteristic function
λ-system, 8 random vector, 50
π-system, 8 Chebyshev inequality, 31
σ-algebra, 2 chi-square distribution, 44
σ-field, 2 CMT, 88
σ-field coin tossing, 2
generated by random variable, 25 coin tossing
f -moment, 29 infinite, 2
complete statistic, 69
a.e., 22 composite, 75
a.s., 22 conditional density, 37
a.s. convergence, 83
conditional distribution
absolutely continuous, 24
multivariate normal, 51
algebra, 1
conditional expectation, 33
almost everywhere, 22
conditional probability, 4, 34
almost sure convergence, 83
consistency
almost surely, 22
MLE, 96
alternative hypothesis, 74
continuous function, 19
asymptotic normality, 97
continuous mapping theorem, 88
basis, 57 convergence in distribution, 84
Bayes formula, 5 convergence in probability, 84
Bernoulli distribution, 42 convex, 32
beta distribution, 44 correlation, 30
big O, 88, 89 countable subadditivity, 12
binomial distribution, 42 covariance, 30
Borel-Cantelli lemma, 6 covariance matrix, 31
bounded convergence theorem, 30 Cramer-Rao bound, 73
cylinder set, 22
Cauchy distribution, 45
Cauchy-Schwartz inequality, 32 delta method, 94
central limit theorem, 93 density, 24
central moment, 30 dimension of projection, 58
102
distribution, 20 information matrix, 71
distribution integrable, 27
random vector, 22 integrand, 23
distribution function, 21 invariance theorem, 66
dominated convergence theorem, 28, 30
double array, 93 Jensen’s inequality, 32
Dynkin’s lemma, 9 joint distribution, 22
joint distribution function, 22
empirical distribution, 64
Erlang distribution, 44 Khinchin, 91
estimator, 60 Kolmogorov, 92
event, 2 Kolmogrov zero-one law, 8
expectation, 29
exponential distribution, 43 law of large number
exponential family, 62 Kolmogorov’s strong, 92
extension strong, 92
theorem, 11 uniform weak, 93
uniqueness, 10 weak, 91
law of random variable, 20
F test, 79 Lebesgue integral
factorial, 44 counting measure, 23
Fatou’s lemma, 28, 30 nonnegative function, 23
Fatou’s lemma simple function, 23
probability, 6 Lehmann-Scheffé theorem, 70
field, 1 Liapounov condition, 93
first order condition, 66 likelihood function, 65
Fisher Information, 71 likelihood ratio, 76
Fisher-Neyman factorization, 61 liminf, 6
limsup, 6
gamma distribution, 44
Lindberg condition, 93
Gaussian distribution, 43
Lindberg-Feller CLT, 93
generalized likelihood ratio, 77
Lindberg-Levy CLT, 94
generalized method of moments, 65
LM, 98
generated sigma-field, 3
log likelihood, 66
GMM, 65
loss function, 68
Hessian, 71 LR, 98
103
minimal sufficient statistic, 62 range, 56
minimax estimator, 69 Rao-Blackwell theorem, 69
MLE, 65 reverse Fatou’s lemma, 28, 30
moment, 30 Riemann integral, 24
moment condition, 64 risk function, 68
moment generating function, 41, 47
monotone convergence theorem, 27, 30 sample moment, 64
monotonicity score function, 66
Lp norm, 33 sigma-algebra, 2
outer measure, 12 sigma-field, 2
probability, 3 simple, 75
multinomial distribution, 46 simple function, 19
multivariate normal, 49 size, 74
Slutsky theorem, 88
Neyman-Pearson lemma, 76 small o, 88
normal distribution, 43 small op , 89
null hypothesis, 74 span, 56
null space, 56 stable, 45
standard Cauchy, 45
orthogonal projection, 53, 57
standard multivariate normal, 49
outer measure, 12
state space, 60
point probability mass, 21 statistic, 60
Poisson distribution, 43 stochastically bounded, 89
population, 64 Student-t test, 78
population moments, 64 sufficient, 61
positive semidefinite, 57
t test, 78
power, 74
t test
power function, 74
one-sided, 78
probability
two sided, 80
measure, 3
tail field, 8
triple, 1
test statistic, 60
probability density function, 24
theorem of total probability, 4
projection, 53, 57
type-I error, 75
quantile, 42 type-II error, 75
104
variance, 30
vector space, 56
vector subspace, 56
Wald, 98
105