Lect Main Blanc
Lect Main Blanc
Lect Main Blanc
Mathematical Statistics II
Spring Semester 2013
Logan, UT 84322–3900
e-mail: [email protected]
Contents
Acknowledgements 1
6 Limit Theorems 1
6.1 Modes of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
6.2 Weak Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.3 Strong Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.4 Central Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7 Sample Moments 36
7.1 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.2 Sample Moments and the Normal Distribution . . . . . . . . . . . . . . . . . . 39
9 Hypothesis Testing 89
9.1 Fundamental Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.2 The Neyman–Pearson Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
9.3 Monotone Likelihood Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
9.4 Unbiased and Invariant Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
1
11 Confidence Estimation 130
11.1 Fundamental Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
11.2 Shortest–Length Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . 134
11.3 Confidence Intervals and Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . 138
11.4 Bayes Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Index 177
2
Acknowledgements
I would like to thank my students, Hanadi B. Eltahir, Rich Madsen, and Bill Morphet, who
helped during the Fall 1999 and Spring 2000 semesters in typesetting these lecture notes
using LATEX and for their suggestions how to improve some of the material presented in class.
Thanks are also due to more than 60 students who took Stat 6710/20 with me since the Fall
2000 semester for their valuable comments that helped to improve and correct these lecture
notes.
In addition, I particularly would like to thank Mike Minnotte and Dan Coster, who previously
taught this course at Utah State University, for providing me with their lecture notes and other
materials related to this course. Their lecture notes, combined with additional material from
Casella/Berger (2002), Rohatgi (1976) and other sources listed below, form the basis of the
script presented here.
The primary textbook required for this class is:
• Casella, G., and Berger, R. L. (2002): Statistical Inference (Second Edition), Duxbury
Press/Thomson Learning, Pacific Grove, CA.
http://www.math.usu.edu/~symanzik/teaching/2013_stat6720/stat6720.html
This course closely follows Casella and Berger (2002) as described in the syllabus. Additional
material originates from the lectures from Professors Hering, Trenkler, and Gather I have
attended while studying at the Universität Dortmund, Germany, the collection of Masters and
PhD Preliminary Exam questions from Iowa State University, Ames, Iowa, and the following
textbooks:
• Casella, G., and Berger, R. L. (1990): Statistical Inference, Wadsworth & Brooks/Cole,
Pacific Grove, CA.
3
• Johnson, N. L., and Kotz, S., and Balakrishnan, N. (1994): Continuous Univariate
Distributions, Volume 1 (Second Edition), Wiley, New York, NY.
• Johnson, N. L., and Kotz, S., and Balakrishnan, N. (1995): Continuous Univariate
Distributions, Volume 2 (Second Edition), Wiley, New York, NY.
• Mood, A. M., and Graybill, F. A., and Boes, D. C. (1974): Introduction to the Theory
of Statistics (Third Edition), McGraw-Hill, Singapore.
• Parzen, E. (1960): Modern Probability Theory and Its Applications, Wiley, New York,
NY.
• Tamhane, A. C., and Dunlop, D. D. (2000): Statistics and Data Analysis – From Ele-
mentary to Intermediate, Prentice Hall, Upper Saddle River, NJ.
Additional definitions, integrals, sums, etc. originate from the following formula collections:
4
6 Limit Theorems
1
6.1 Modes of Convergence
Definition 6.1.1:
Let X1 , . . . , Xn be iid rv’s with common cdf FX (x). Let T = T (X) be any statistic, i.e., a
Borel–measurable function of X that does not involve the population parameter(s) ϑ, defined
on the support X of X. The induced probability distribution of T (X) is called the sampling
distribution of T (X).
Note:
(ii) Recall that if X1 , . . . , Xn are iid and if E(X) and V ar(X) exist, then E(X n ) = µ =
2
E(X), E(Sn2 ) = σ 2 = V ar(X), and V ar(X n ) = σn .
(iii) Recall that if X1 , . . . , Xn are iid and if X has mgf MX (t) or characteristic function ΦX (t)
then MX n (t) = (MX ( nt ))n or ΦX n (t) = (ΦX ( nt ))n .
Note: Let {Xn }∞ n=1 be a sequence of rv’s on some probability space (Ω, L, P ). Is there any
meaning behind the expression lim Xn = X? Not immediately under the usual definitions
n→∞
of limits. We first need to define modes of convergence for rv’s and probabilities.
Definition 6.1.2:
Let {Xn }∞ ∞
n=1 be a sequence of rv’s with cdf’s {Fn }n=1 and let X be a rv with cdf F . If
Fn (x) → F (x) at all continuity points of F , we say that Xn converges in distribution to
d L
X (Xn −→ X) or Xn converges in law to X (Xn −→ X), or Fn converges weakly to F
w
(Fn −→ F ).
Example 6.1.3:
Let Xn ∼ N (0, n1 ). Then
Z x exp − 1 nt2
2
Fn (x) = q dt
−∞ 2π
n
2
Z √nx
exp(− 12 s2 )
= √ ds
−∞ 2π
√
= Φ( nx)
=⇒ Fn (x) →
(
1, x ≥ 0
If FX (x) = the only point of discontinuity is at x = 0. Everywhere else,
0, x < 0
√
Φ( nx) = Fn (x) → FX (x), where Φ(z) = P (Z ≤ z) with Z ∼ N (0, 1).
d d
So, Xn −→ X, where P (X = 0) = 1, or Xn −→ 0 since the limiting rv here is degenerate,
i.e., it has a Dirac(0) distribution.
Example 6.1.4:
In this example, the sequence {Fn }∞
n=1 converges pointwise to something that is not a cdf:
Example 6.1.5:
Let {Xn }∞
n=1 be a sequence of rv’s such that P (Xn = 0) = 1 −
1
n and P (Xn = n) = 1
n and let
X ∼ Dirac(0), i.e., P (X = 0) = 1.
It is
0, x<0
1
Fn (x) = 1− n, 0≤x<n
x≥n
1,
(
0, x < 0
FX (x) =
1, x ≥ 0
w
It holds that Fn −→ FX but
E(Xnk ) = 6→ E(X k ) =
3
Note:
Convergence in distribution does not say that the Xi ’s are close to each other or to X. It only
means that their cdf’s are (eventually) close to some cdf F . The Xi ’s do not even have to be
defined on the same probability space.
Example 6.1.6:
d
Let X and {Xn }∞
n=1 be iid N (0, 1). Obviously, Xn −→ X but lim Xn 6= X. n→∞
Theorem 6.1.7:
Let X and {Xn }∞ ∞
n=1 be discrete rv’s with support X and {Xn }n=1 , respectively. Define
∞
[
the countable set A = X ∪ Xn = {ak : k = 1, 2, 3, . . .}. Let pk = P (X = ak ) and
n=1
d
pnk = P (Xn = ak ). Then it holds that pnk → pk ∀k iff Xn −→ X.
Theorem 6.1.8:
Let X and {Xn }∞ ∞
n=1 be continuous rv’s with pdf’s f and {fn }n=1 , respectively. If fn (x) → f (x)
d
for almost all x as n → ∞ then Xn −→ X.
Theorem 6.1.9:
d
Let X and {Xn }∞
n=1 be rv’s such that Xn −→ X. Let c ∈ IR be a constant. Then it holds:
d
(i) Xn + c −→ X + c.
d
(ii) cXn −→ cX.
d
(iii) If an → a and bn → b, then an Xn + bn −→ aX + b.
Proof:
Part (iii):
Suppose that a > 0, an > 0. Let Yn = an Xn + bn and Y = aX + b. It is
y−b y−b
FY (y) = P (Y ≤ y) = P (aX + b ≤ y) = P (X ≤ ) = FX ( ).
a a
Likewise,
y − bn
FYn (y) = FXn ( ).
an
If y is a continuity point of FY , y−b
a is a continuity point of FX . Since an → a, bn → b and
FXn (x) → FX (x), it follows that FYn (y) → FY (y) for every continuity point y of FY . Thus,
d
an Xn + bn −→ aX + b.
4
Definition 6.1.10:
Let {Xn }∞n=1 be a sequence of rv’s defined on a probability space (Ω, L, P ). We say that Xn
p
converges in probability to a rv X (Xn −→ X, P- lim Xn = X) if
n→∞
Note:
The following are equivalent:
lim P (| Xn − X |> ) = 0
n→∞
⇐⇒ lim P (| Xn − X |≤ ) = 1
n→∞
Theorem 6.1.11:
p p
(i) Xn −→ X ⇐⇒ Xn − X −→ 0.
p p
(ii) Xn −→ X, Xn −→ Y =⇒ P (X = Y ) = 1.
p p p
(iii) Xn −→ X, Xm −→ X =⇒ Xn − Xm −→ 0 as n, m → ∞.
p p p
(iv) Xn −→ X, Yn −→ Y =⇒ Xn ± Yn −→ X ± Y .
p p
(v) Xn −→ X, k ∈ IR a constant =⇒ kXn −→ kX.
p p
(vi) Xn −→ k, k ∈ IR a constant =⇒ Xnr −→ k r ∀r ∈ IN .
p p p
(vii) Xn −→ a, Yn −→ b, a, b ∈ IR =⇒ Xn Yn −→ ab.
p p
(viii) Xn −→ 1 =⇒ Xn−1 −→ 1.
p p Xn p
(ix) Xn −→ a, Yn −→ b, a ∈ IR, b ∈ IR − {0} =⇒ Yn −→ ab .
p p
(x) Xn −→ X, Y an arbitrary rv =⇒ Xn Y −→ XY .
5
p p p
(xi) Xn −→ X, Yn −→ Y =⇒ Xn Yn −→ XY .
Proof:
See Rohatgi, page 244–245, and Rohatgi/Saleh, page 260–261 for partial proofs.
Theorem 6.1.12:
p p
Let Xn −→ X and let g be a continuous function on IR. Then g(Xn ) −→ g(X).
Proof:
Preconditions:
1.) X rv =⇒ ∀ > 0 ∃k = k() : P (|X| > k) < 2
2.) g is continuous on IR
Let
A = {|X| ≤ k} = {ω : |X(ω)| ≤ k}
B = {|Xn − X| < δ} = {ω : |Xn (ω) − X(ω)| < δ}
C = {|g(Xn ) − g(X)| < } = {ω : |g(Xn (ω)) − g(X(ω))| < }
6
Corollary 6.1.13:
p p
(i) Let Xn −→ c, c ∈ IR and let g be a continuous function on IR. Then g(Xn ) −→ g(c).
d d
(ii) Let Xn −→ X and let g be a continuous function on IR. Then g(Xn ) −→ g(X).
d d
(iii) Let Xn −→ c, c ∈ IR and let g be a continuous function on IR. Then g(Xn ) −→ g(c).
Theorem 6.1.14:
p d
Xn −→ X =⇒ Xn −→ X.
Proof:
p
Xn −→ X ⇔ P (|Xn − X| > ) → 0 as n → ∞ ∀ > 0
It holds:
7
Theorem 6.1.15:
Let c ∈ IR be a constant. Then it holds:
d p
Xn −→ c ⇐⇒ Xn −→ c.
Example 6.1.16:
In this example, we will see that
d p
Xn −→ X 6=⇒ Xn −→ X
for some rv X. Let Xn be identically distributed rv’s and let (Xn , X) have the following joint
distribution:
Xn
0 1
X
1 1
0 0 2 2
1 1
1 2 0 2
1 1
2 2 1
Theorem 6.1.17:
Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of rv’s and X be a rv defined on a probability space
(Ω, L, P ). Then it holds:
d p d
Yn −→ X, | Xn − Yn |−→ 0 =⇒ Xn −→ X.
Proof:
Similar to the proof of Theorem 6.1.14. See also Rohatgi, page 253, Theorem 14, and Ro-
hatgi/Saleh, page 269, Theorem 14.
d p d
(i) Xn −→ X, Yn −→ c =⇒ Xn + Yn −→ X + c.
8
d p d
(ii) Xn −→ X, Yn −→ c =⇒ Xn Yn −→ cX.
p
If c = 0, then also Xn Yn −→ 0.
d p Xn d X
(iii) Xn −→ X, Yn −→ c =⇒ Yn −→ c if c 6= 0.
Proof:
p T h.6.1.11(i) p
(i) Yn −→ c ⇐⇒ Yn − c −→ 0
p
=⇒ Yn − c = Yn + (Xn − Xn ) − c = (Xn + Yn ) − (Xn + c) −→ 0 (A)
d T h.6.1.9(i) d
Xn −→ X =⇒ Xn + c −→ X + c (B)
(ii) Case c = 0:
∀ > 0 ∀k > 0, it is
P (| Xn Yn |> ) = P (| Xn Yn |> , Yn ≤ ) + P (| Xn Yn |> , Yn > )
k k
≤ P (| Xn |> ) + P (Yn > )
k k
≤ P (| Xn |> k) + P (| Yn |> )
k
d p
Since Xn −→ X and Yn −→ 0, it follows for any fixed k > 0
9
Case c 6= 0:
d p
Since Xn −→ X and Yn −→ c, it follows from (ii), case c = 0, that Xn Yn − cXn =
p
Xn (Yn − c) −→ 0.
p
=⇒ Xn Yn −→ cXn
T h.6.1.14 d
=⇒ Xn Yn −→ cXn
d
Since cXn −→ cX by Theorem 6.1.9 (ii), it follows from Theorem 6.1.17:
d
Xn Yn −→ cX
p
(iii) Let Zn −→ 1 and let Yn = cZn .
c6=0 1 1 1
=⇒ Yn = Zn · c
T h.6.1.11(v,viii) 1 p 1
=⇒ Yn −→ c
Definition 6.1.19:
Let (Xn )∞ r
n=1 be a sequence of rv’s such that E(| Xn | ) < ∞ for some r > 0. We say that Xn
r
converges in the rth mean to a rv X (Xn −→ X) if E(| X |r ) < ∞ and
lim E(| Xn − X |r ) = 0.
n→∞
Example 6.1.20:
Let (Xn )∞
n=1 be a sequence of rv’s defined by P (Xn = 0) = 1 −
1
n and P (Xn = 1) = n1 .
1 r
It is E(| Xn |r ) = n ∀r > 0. Therefore, Xn −→ 0 ∀r > 0.
Note:
1
The special cases r = 1 and r = 2 are called convergence in absolute mean for r = 1 (Xn −→ X)
ms 2
and convergence in mean square for r = 2 (Xn −→ X or Xn −→ X).
Theorem 6.1.21:
r p
Assume that Xn −→ X for some r > 0. Then Xn −→ X.
10
Proof:
Using Markov’s Inequality (Corollary 3.5.2), it holds for any > 0:
Example 6.1.22:
Let (Xn )∞
n=1 be a sequence of rv’s defined by P (Xn = 0) = 1 −
1
nr and P (Xn = n) = 1
nr for
some r > 0.
p
For any > 0, P (| Xn |> ) → 0 as n → ∞; so Xn −→ 0.
1 s
For 0 < s < r, E(| Xn |s ) = nr−s → 0 as n → ∞; so Xn −→ 0. But E(| Xn |r ) = 1 6→ 0 as
r
n → ∞; so Xn −→
6 0.
Theorem 6.1.23:
r
If Xn −→ X, then it holds:
Proof:
11
For r > 1, it follows from Minkowski’s Inequality (Theorem 4.8.3):
1 1 1
[E(| X − Xn + Xn |r )] r ≤ [E(| X − Xn |r )] r + [E(| Xn |r )] r
1 1 1
=⇒ [E(| X |r )] r − [E(| Xn |r )] r ≤ [E(| X − Xn |r )] r
1 1 1 r
=⇒ [E(| X |r )] r − lim [E(| Xn |r )] r ≤ lim [E(| Xn −X |r )] r = 0 since Xn −→ X
n→∞ n→∞
1 1
=⇒ [E(| X |r )] r ≤ lim [E(| Xn |r )] r (C)
n→∞
Similarly,
1 1 1
[E(| Xn − X + X |r )] r ≤ [E(| Xn − X |r )] r + [E(| X |r )] r
1 1 1 r
=⇒ lim [E(| Xn |r )] r − lim [E(| X |r )] r ≤ lim [E(| Xn −X |r )] r = 0 since Xn −→ X
n→∞ n→∞ n→∞
1 1
=⇒ lim [E(| Xn |r )] r ≤ [E(| X |r )] r (D)
n→∞
(ii) For 0 < s < r, it follows from Lyapunov’s Inequality (Theorem 3.5.4):
1 1
[E(| Xn − X |s )] s ≤ [E(| Xn − X |r )] r
s
=⇒ E(| Xn − X |s ) ≤ [E(| Xn − X |r )] r
s r
=⇒ lim E(| Xn − X |s ) ≤ lim [E(| Xn − X |r )] r = 0 since Xn −→ X
n→∞ n→∞
12
s
=⇒ Xn −→ X
Note that our proof of Theorem 3.5.4 only covers the case 1 ≤ s < r, but an alternative
proof shows that the result generally holds for 0 < s < r.
Definition 6.1.24:
Let {Xn }∞n=1 be a sequence of rv’s on (Ω, L, P ). We say that Xn converges almost surely
a.s. w.p.1
to a rv X (Xn −→ X) or Xn converges with probability 1 to X (Xn −→ X) or Xn
converges strongly to X iff
Note:
An interesting characterization of convergence with probability 1 and convergence in proba-
bility can be found in Parzen (1960) “Modern Probability Theory and Its Applications” on
page 416 (see Handout).
Example 6.1.25:
Let Ω = [0, 1] and P a uniform distribution on Ω. Let Xn (ω) = ω + ω n and X(ω) = ω.
Theorem 6.1.26:
a.s. p
Xn −→ X =⇒ Xn −→ X.
Proof:
Choose > 0 and δ > 0. Find n0 = n0 (, δ) such that
∞
!
\
P {| Xn − X |≤ } ≥ 1 − δ.
n=n0
13
Example 6.1.27:
p a.s.
Xn −→ X 6=⇒ Xn −→ X:
A1 = (0, 12 ], A2 = ( 21 , 1]
A3 = (0, 41 ], A4 = ( 41 , 21 ], A5 = ( 12 , 43 ], A6 = ( 34 , 1]
A7 = (0, 81 ], A8 = ( 81 , 41 ], . . .
But P ({ω : Xn (ω) → 0}) = 0 (and not 1) because any ω keeps being in some An beyond any
a.s.
6
n0 , i.e., Xn (ω) looks like 0 . . . 010 . . . 010 . . . 010 . . ., so Xn −→ 0.
Example 6.1.28:
r a.s.
Xn −→ X 6=⇒ Xn −→ X:
1
Let Xn be independent rv’s such that P (Xn = 0) = 1 − n and P (Xn = 1) = n1 .
1 r
It is E(| Xn − 0 |r ) = E(| Xn |r ) = E(| Xn |) = n → 0 as n → ∞, so Xn −→ 0 ∀r > 0 (and
p
due to Theorem 6.1.21, also Xn −→ 0).
But
n0
Y 1 m−1 m m+1 n0 − 2 n0 − 1 m−1
P (Xn = 0 ∀m ≤ n ≤ n0 ) = (1− ) = ( )( )( )...( )( )=
n=m n m m+1 m+2 n0 − 1 n0 n0
a.s.
As n0 → ∞, it is P (Xn = 0 ∀m ≤ n ≤ n0 ) → 0 ∀m, so Xn −→
6 0.
14
Example 6.1.29:
a.s. r
Xn −→ X 6=⇒ Xn −→ X:
nr r
But E(| Xn − 0 |r ) = ln n → ∞ ∀r > 0, so Xn −→
6 X.
15
6.2 Weak Laws of Large Numbers
lim P (| X n − µ |≥ ) = 0 ∀ > 0,
n→∞
p
i.e., X n −→ µ.
Proof:
Note:
For iid rv’s with finite variance, X n is consistent for µ.
Definition 6.2.2:
n
X
Let {Xi }∞
i=1 be a sequence of rv’s. Let Tn = Xi . We say that {Xi } obeys the WLLN
i=1
with respect to a sequence of norming constants {Bi }∞
i=1 , Bi > 0, Bi ↑ ∞, if there exists a
∞
sequence of centering constants {Ai }i=1 such that
p
Bn−1 (Tn − An ) −→ 0.
Theorem 6.2.3:
Let {Xi }∞ 2
i=1 be a sequence of pairwise uncorrelated rv’s with E(Xi ) = µi and V ar(Xi ) = σi ,
n
X n
X n
X
i ∈ IN . If σi2 → ∞ as n → ∞, we can choose An = µi and Bn = σi2 and get
i=1 i=1 i=1
n
X
(Xi − µi )
i=1 p
n −→ 0.
X
σi2
i=1
16
Proof:
By Markov’s Inequality (Corollary 3.5.2), it holds for all > 0:
n
X
n n n
E(( (Xi − µi ))2 )
X X X i=1 1
P (| Xi − µi |> σi2 ) ≤ n = n −→ 0 as n → ∞
X X
i=1 i=1 i=1 2 ( σi2 )2 2 σi2
i=1 i=1
Note:
To obtain Theorem 6.2.1, we choose An = nµ and Bn = nσ 2 .
Theorem 6.2.4:
n
X
Let {Xi }∞
i=1 be a sequence of rv’s. Let X n =
1
n Xi . A necessary and sufficient condition
i=1
for {Xi } to obey the WLLN with respect to Bn = n is that
2 !
Xn
E 2 →0
1 + Xn
as n → ∞.
Proof:
Rohatgi, page 258, Theorem 2, and Rohatgi/Saleh, page 275, Theorem 2.
Example 6.2.5:
Let (X1 , . . . , Xn ) be jointly Normal with E(Xi ) = 0, E(Xi2 ) = 1 for all i, and Cov(Xi , Xj ) = ρ
n
X
if | i − j |= 1 and Cov(Xi , Xj ) = 0 if | i − j |> 1. Let Tn = Xi . Then, Tn ∼ N (0, n + 2(n −
i=1
1)ρ) = N (0, σ 2 ). It is
2 ! !
Xn Tn2
E 2 = E
1 + Xn n2 + Tn2
x2
Z ∞ 2
2 − x2 x dx
= √ e 2σ dx | y = , dy =
2πσ 0 n2 + x2 σ σ
σ2y2
Z ∞
2 2
− y2
= √ e dy
2π 0 n2 + σ 2 y 2
(n + 2(n − 1)ρ)y 2
Z ∞
2 2
− y2
= √ e dy
2π 0 n2 + (n + 2(n − 1)ρ)y 2
Z ∞
n + 2(n − 1)ρ 2 y2
≤ √ y 2 e− 2 dy
n2 0 2π
| {z }
=1, since Var of N (0,1) distribution
→0 as n → ∞
17
p
=⇒ X n −→ 0
Note:
We would like to have a WLLN that just depends on means but does not depend on the
existence of finite variances. To approach this, we consider the following:
n
X
Let {Xi }∞
i=1 be a sequence of rv’s. Let Tn = Xi . We truncate each | Xi | at c > 0 and get
i=1
(
Xi , | Xi |≤ c
Xic =
0, otherwise
n
X n
X
Let Tnc = Xic and mn = E(Xic ).
i=1 i=1
Lemma 6.2.6:
For Tn , Tnc and mn as defined in the Note above, it holds:
n
X
P (| Tn − mn |> ) ≤ P (| Tnc − mn |> ) + P (| Xi |> c) ∀ > 0
i=1
Proof:
Note:
If the Xi ’s are identically distributed, then
18
Theorem 6.2.7: Khintchine’s WLLN
Let {Xi }∞
i=1 be a sequence of iid rv’s with finite mean E(Xi ) = µ. Then it holds:
1 p
Xn = Tn −→ µ
n
Proof:
If we take c = n and replace by n in (∗) in the Note above, we get
| Tn − mn | E((X1n )2 )
P > = P (| Tn − mn |> n) ≤ + nP (| X1 |> n).
n n2
Note:
Theorem 6.2.7 meets the previously stated goal of not having a finite variance requirement.
19
6.3 Strong Laws of Large Numbers
Definition 6.3.1:
n
X
Let {Xi }∞
i=1 be a sequence of rv’s. Let Tn = Xi . We say that {Xi } obeys the SLLN
i=1
with respect to a sequence of norming constants {Bi }∞
i=1 , Bi > 0, Bi ↑ ∞, if there exists a
∞
sequence of centering constants {Ai }i=1 such that
a.s.
Bn−1 (Tn − An ) −→ 0.
Note:
Unless otherwise specified, we will only use the case that Bn = n in this section.
Theorem 6.3.2:
a.s.
Xn −→ X ⇐⇒ lim P ( sup | Xm − X |> ) = 0 ∀ > 0.
n→∞ m≥n
Proof: (see also Rohatgi, page 249, Theorem 11)
a.s. a.s.
WLOG, we can assume that X = 0 since Xn −→ X implies Xn − X −→ 0. Thus, we have to
prove:
a.s.
Xn −→ 0 ⇐⇒ lim P ( sup | Xm |> ) = 0 ∀ > 0
n→∞ m≥n
Choose > 0 and define
An () = { sup | Xm |> }
m≥n
C = { lim Xn = 0}
n→∞
“=⇒”:
a.s.
Since Xn −→ 0, we know that P (C) = 1 and therefore P (C c ) = 0.
∞
\
Let Bn () = C ∩ An (). Note that Bn+1 () ⊆ Bn () and for the limit set Bn () = Ø. It
n=1
follows that ∞
\
lim P (Bn ()) = P ( Bn ()) = 0.
n→∞
n=1
We also have
P (Bn ()) = P (An ∩ C)
= 1 − P (C c ∪ Acn )
= 1 − P (C c ) −P (Acn ) + P (C c ∩ AC
n)
| {z } | {z }
=0 =0
= P (An )
20
=⇒ lim P (An ()) = 0
n→∞
“⇐=”:
Assume that lim P (An ()) = 0 ∀ > 0 and define D() = { lim | Xn |> }.
n→∞ n→∞
∞
X 1
=⇒ 1 − P (C) ≤ P (D( )) = 0
k=1
k
a.s.
=⇒ Xn −→ 0
Note:
a.s.
(i) Xn −→ 0 implies that ∀ > 0 ∀δ > 0 ∃n0 ∈ IN : P ( sup | Xn |> ) < δ.
n≥n0
∞
[ ∞ [
\ ∞
A = lim An = lim Ak = Ak
n→∞ n→∞
k=n n=1 k=n
is the event that infinitely many of the An occur. We write P (A) = P (An i.o.) where
i.o. stands for “infinitely often”.
(iii) Using the terminology defined in (ii) above, we can rewrite Theorem 6.3.2 as
a.s.
Xn −→ 0 ⇐⇒ P (| Xn |> i.o.) = 0 ∀ > 0.
21
Theorem 6.3.3: Borel–Cantelli Lemma
Let A be defined as in (ii) of the previous Note.
Proof:
(i): ∞
[
P (A) = P ( lim Ak )
n→∞
k=n
∞
[
= lim P ( Ak )
n→∞
k=n
∞
X
≤ lim P (Ak )
n→∞
k=n
∞ n−1
!
X X
= lim P (Ak ) − P (Ak )
n→∞
k=1 k=1
= 0
∞ \
[ ∞
(ii): We have Ac = Ack . Therefore,
n=1 k=n
∞
\ ∞
\
P (Ac ) = P ( lim Ack ) = lim P ( Ack ).
n→∞ n→∞
k=n k=n
Therefore,
∞
\ n0
\
P( Ack ) ≤ lim P ( Ack )
n0 →∞
k=n k=n
n0
Y
= lim (1 − P (Ak ))
n0 →∞
k=n
n0
!
indep. X
≤ lim exp − P (Ak )
n0 →∞
k=n
= 0
22
=⇒ P (A) = 1
Example 6.3.4:
Independence is necessary for 2nd BC–Lemma:
∞ ∞
X X 1
P (An ) = = ∞.
n=1 n=1
n
But for any ω ∈ Ω, An occurs only for 1, 2, . . . , b ω1 c, where b ω1 c denotes the largest integer
(“floor”) that is ≤ ω1 . Therefore, P (A) = P (An i.o.) = 0.
Proof:
See Rohatgi, page 268, Lemma 2, and Rohatgi/Saleh, page 284, Lemma 1.
Proof:
See Rohatgi, page 269, Lemma 3, and Rohatgi/Saleh, page 285, Lemma 2.
Proof:
See Rohatgi, page 270, Theorem 5.
23
Theorem 6.3.8:
∞
X ∞
X
If V ar(Xn ) < ∞, then (Xn − E(Xn )) converges almost surely.
n=1 n=1
Proof:
See Rohatgi, page 272, Theorem 6, and Rohatgi/Saleh, page 286, Theorem 4.
Corollary 6.3.9:
Let {Xi }∞ ∞
i=1 be a sequence of independent rv’s. Let {Bi }i=1 , Bi > 0, Bi ↑ ∞, a sequence of
Xn ∞
X V ar(Xi )
norming constants. Let Tn = Xi . If < ∞ then it holds:
i=1 i=1
Bi2
Tn − E(Tn ) a.s.
−→ 0
Bn
Proof:
This Corollary follows directly from Theorem 6.3.8 and Lemma 6.3.6.
Lemma 6.3.11:
Let X be a rv with E(| X |) < ∞. Then it holds:
∞
X ∞
X
P (| X |≥ n) ≤ E(| X |) ≤ 1 + P (| X |≥ n)
n=1 n=1
Proof:
Continuous case only:
Let X have a pdf f . Then it holds:
Z ∞ ∞ Z
X
E(| X |) = | x | f (x)dx = | x | f (x)dx
−∞ k=0 k≤|x|≤k+1
24
∞
X ∞
X
=⇒ kP (k ≤| X |≤ k + 1) ≤ E(| X |) ≤ (k + 1)P (k ≤| X |≤ k + 1)
k=0 k=0
It is
∞
X ∞ X
X k
kP (k ≤| X |≤ k + 1) = P (k ≤| X |≤ k + 1)
k=0 k=0 n=1
X∞ X ∞
= P (k ≤| X |≤ k + 1)
n=1 k=n
X∞
= P (| X |≥ n)
n=1
Similarly,
∞
X ∞
X ∞
X
(k + 1)P (k ≤| X |≤ k + 1) = P (| X |≥ n) + P (k ≤| X |≤ k + 1)
k=0 n=1 k=0
X∞
= P (| X |≥ n) + 1
n=1
Theorem 6.3.12:
Let {Xi }∞
i=1 be a sequence of iid rv’s. Then it holds:
∞
a.s. X
Xn −→ 0 ⇐⇒ P (| Xn |> ) < ∞ ∀ > 0
n=1
Proof:
See Rohatgi, page 265, Theorem 3.
25
Theorem 6.3.13: Kolmogorov’s SLLN
n
X
Let {Xi }∞
i=1 be a sequence of iid rv’s. Let Tn = Xi . Then it holds:
i=1
Tn a.s.
= X n −→ µ < ∞ ⇐⇒ E(| X |) < ∞ (and then µ = E(X))
n
Proof:
“=⇒”:
a.s.
Suppose that X n −→ µ < ∞. It is
“⇐=”:
Let E(| X |) < ∞.
26
It is
∞
X 1 1 1 1
= + + + ...
n=k
n2 k 2 (k + 1)2 (k + 2)2
1 1 1
≤ 2
+ + + ...
k k(k + 1) (k + 1)(k + 2)
∞
1 X 1
= 2
+
k n=k+1
n(n − 1)
∞
X 1 1 1 1 1
=⇒ = 1− − − − ... −
n=k+1
n(n − 1) 1·2 2·3 3·4 (k − 1) · k
1 1 1 1
= − − − ... −
2 2·3 3·4 (k − 1) · k
1 1 1
= − − ... −
3 3·4 (k − 1) · k
1 1
= − ... −
4 (k − 1) · k
= ...
1
=
k
∞ ∞
X 1 1 X 1
=⇒ ≤ +
n=k
n2 k 2
n=k+1
n(n − 1)
1 1
= +
k2 k
2
≤
k
27
28
6.4 Central Limit Theorems
Let {Xn }∞ ∞
n=1 be a sequence of rv’s with cdf’s {Fn }n=1 . Suppose that the mgf Mn (t) of Xn
exists.
Questions: Does Mn (t) converge? Does it converge to a mgf M (t)? If it does converge, does
d
it hold that Xn −→ X for some rv X?
Example 6.4.1:
Let {Xn }∞
n=1 be a sequence of rv’s such that P (Xn = −n) = 1. Then the mgf is Mn (t) =
E(e ) = e−tn . So
tX
0, t>0
lim Mn (t) = 1, t=0
n→∞
∞,
t<0
So Mn (t) does not converge to a mgf and Fn (x) → F (x) = 1 ∀x. But F (x) is not a cdf.
Note:
Due to Example 6.4.1, the existence of mgf’s Mn (t) that converge to something is not enough
to conclude convergence in distribution.
d
Conversely, suppose that Xn has mgf Mn (t), X has mgf M (t), and Xn −→ X. Does it hold
that
Mn (t) → M (t)?
Not necessarily! See Rohatgi, page 277, Example 2, and Rohatgi/Saleh, page 289, Example
2, as a counter example. Thus, convergence in distribution of rv’s that all have mgf’s does
not imply the convergence of mgf’s.
29
Example 6.4.3:
Let Xn ∼ Bin(n, nλ ). Recall (e.g., from Theorem 3.3.12 and related Theorems) that for
X ∼ Bin(n, p) the mgf is MX (t) = (1 − p + pet )n . Thus,
Note:
Recall Theorem 3.3.11: Suppose that {Xn }∞
n=1 is a sequence of rv’s with characteristic func-
∞
tions {Φn (t)}n=1 . Suppose that
d
and Φ(t) is the characteristic function of a rv X. Then Xn −→ X.
Let Φ(t) √
be the characteristic function of Xi . We now determine the characteristic function
Φn (t) of n(Xσn −µ) :
30
Here we make use of the Landau symbol “o”. In general, if we write u(x) = o(v(x)) for
u(x)
x → L, this implies lim = 0, i.e., u(x) goes to 0 faster than v(x) or v(x) goes to ∞
x→L v(x)
faster than u(x). We say that u(x) is of smaller order than v(x) as x → L. Examples are
1
x3
= o( x12 ) and x2 = o(x3 ) for x → ∞. See Rohatgi, page 6, for more details on the Landau
symbols “O” and “o”.
31
Definition 6.4.5:
Let X1 , X2 be iid non–degenerate rv’s with common cdf F . Let a1 , a2 > 0. We say that F is
stable if there exist constants A and B (depending on a1 and a2 ) such that
B −1 (a1 X1 + a2 X2 − A) also has cdf F .
Note:
When generalizing the previous definition to sequences of rv’s, we have the following examples
for stable distributions:
n
X
1
• Xi iid Cauchy. Then n Xi ∼ Cauchy (here Bn = n, An = 0).
i=1
n
X √
• Xi iid N (0, 1). Then √1 Xi ∼ N (0, 1) (here Bn = n, An = 0).
n
i=1
Definition 6.4.6:
n
X
Let {Xi }∞
i=1 be a sequence of iid rv’s with common cdf F . Let Tn = Xi . F belongs to
i=1
the domain of attraction of a distribution V if there exist norming and centering constants
{Bn }∞ ∞
n=1 , Bn > 0, and {An }n=1 such that
Note:
A very general Theorem from Loève states that only stable distributions can have domains
of attraction. From the practical point of view, a wide class of distributions F belong to the
domain of attraction of the Normal distribution.
32
Theorem 6.4.7: Lindeberg Central Limit Theorem
Let {Xi }∞ ∞
i=1 be a sequence of independent non–degenerate rv’s with cdf’s {Fi }i=1 . Assume
n
X
that E(Xk ) = µk and V ar(Xk ) = σk2 < ∞. Let s2n = σk2 .
k=1
If the Fk are absolutely continuous with pdf’s fk = Fk0 , assume that it holds for all > 0 that
n Z
1 X
(A) lim 2 (x − µk )2 Fk0 (x)dx = 0.
n→∞ s
n k=1 {|x−µk |>sn }
If the Xk are discrete rv’s with support {xkl } and probabilities {pkl }, l = 1, 2, . . ., assume that
it holds for all > 0 that
n
1 X X
(B) lim (xkl − µk )2 pkl = 0.
n→∞ s2
n k=1 |xkl −µk |>sn
The conditions (A) and (B) are called Lindeberg Condition (LC). If either LC holds, then
n
X
(Xk − µk )
k=1 d
−→ Z
sn
where Z ∼ N (0, 1).
Proof:
Similar to the proof of Theorem 6.4.4, we can use characteristic functions again. An alterna-
tive proof is given in Rohatgi, pages 282–288.
Note:
σn2
Feller shows that the LC is a necessary condition if s2n
→ 0 and s2n → ∞ as n → ∞.
Corollary 6.4.8:
n
X
Let {Xi }∞
i=1 be a sequence of iid rv’s such that
√1
n
Xi has the same distribution for all n.
i=1
If E(Xi ) = 0 and V ar(Xi ) = 1, then Xi ∼ N (0, 1).
Proof:
n
X
Let F be the common cdf of √1 Xi for all n (including n = 1). By the CLT,
n
i=1
n
1 X
lim P ( √ Xi ≤ x) = Φ(x),
n→∞ n i=1
n
X
where Φ(x) denotes P (Z ≤ x) for Z ∼ N (0, 1). Also, P ( √1n Xi ≤ x) = F (x) for each n.
i=1
Therefore, we must have F (x) = Φ(x).
33
Note:
In general, if X1 , X2 , . . ., are independent rv’s such that there exists a constant A with
P (| Xn |≤ A) = 1 ∀n, then the LC is satisfied if s2n → ∞ as n → ∞. Why??
Suppose that s2n → ∞ as n → ∞. Since the | Xk |’s are uniformly bounded (by A), so are the
rv’s (Xk − E(Xk )). Thus, for every > 0 there exists an N such that if n ≥ N then
This implies that the LC holds since we would integrate (or sum) over the empty set, i.e., the
set {| x − µk |> sn } = Ø.
The converse also holds. For a sequence of uniformly bounded independent rv’s, a necessary
and sufficient condition for the CLT to hold is that s2n → ∞ as n → ∞.
Example 6.4.9:
Let {Xi }∞
i=1 be a sequence of independent rv’s such that E(Xk ) = 0, αk = E(| Xk |
2+δ ) < ∞
n
X
for some δ > 0, and αk = o(s2+δ
n ).
k=1
Does the LC hold? It is:
n Z n Z
1 X (A) 1 X | x |2+δ
x2 fk (x)dx ≤ fk (x)dx
s2n k=1 {|x|>sn } s2n k=1 {|x|>sn } δ sδn
n Z ∞
1 X
≤ | x |2+δ fk (x)dx
s2n δ sδn k=1 −∞
n
1 X
= αk
s2n δ sδn k=1
n
X
αk
1 k=1
=
2+δ
δ sn
(B)
−→ 0 as n → ∞
n
|x|δ
X
(A) holds since for | x |> sn , it is δ sδn
> 1. (B) holds since αk = o(s2+δ
n ).
k=1
Thus, the LC is satisfied and the CLT holds.
34
Note:
(ii) Both the CLT and the WLLN hold for a large class of sequences of rv’s {Xi }ni=1 . If
the {Xi }’s are independent uniformly bounded rv’s, i.e., if P (| Xn |≤ M ) = 1 ∀n, the
WLLN (as formulated in Theorem 6.2.3) holds. The CLT holds provided that s2n → ∞
as n → ∞.
If the rv’s {Xi } are iid, then the CLT is a stronger result than the WLLN since the CLT
n
X √
provides an estimate of the probability P ( n1 | Xi − nµ |≥ ) ≈ 1 − P (| Z |≤ n),
i=1
σ
where Z ∼ N (0, 1), and the WLLN follows. However, note that the CLT requires the
existence of a 2nd moment while the WLLN does not.
(iii) If the {Xi } are independent (but not identically distributed) rv’s, the CLT may apply
while the WLLN does not.
(iv) See Rohatgi, pages 289–293, and Rohatgi/Saleh, pages 299–303, for additional details
and examples.
35
7 Sample Moments
Definition 7.1.2:
Let X1 , . . . , Xn be a sample of size n from a population with distribution F . Then
n
1X
X= Xi
n i=1
Definition 7.1.3:
Let X1 , . . . , Xn be a sample of size n from a population with distribution F . The function
n
1X
Fˆn (x) = I (Xi )
n i=1 (−∞,x]
Note:
For any fixed x ∈ IR, Fˆn (x) is a rv.
Theorem 7.1.4:
The rv Fˆn (x) has pmf
!
j n
P (Fˆn (x) = ) = (F (x))j (1 − F (x))n−j , j ∈ {0, 1, . . . , n},
n j
36
F (x)(1−F (x))
with E(Fˆn (x)) = F (x) and V ar(Fˆn (x)) = n .
Proof:
It is I(−∞,x] (Xi ) ∼ Bin(1, F (x)). Then nFˆn (x) ∼ Bin(n, F (x)).
Corollary 7.1.5:
By the WLLN, it follows that
p
Fˆn (x) −→ F (x).
Corollary 7.1.6:
By the CLT, it follows that
√
n(Fˆn (x) − F (x)) d
p −→ Z,
F (x)(1 − F (x))
where Z ∼ N (0, 1).
Definition 7.1.8:
Let X1 , . . . , Xn be a sample of size n from a population with distribution F . We call
n
1X
ak = Xk
n i=1 i
Note:
n−1 2
It is b1 = 0 and b2 = n S .
Theorem 7.1.9:
Let X1 , . . . , Xn be a sample of size n from a population with distribution F . Assume that
E(X) = µ, V ar(X) = σ 2 , and E((X − µ)k ) = µk exist. Then it holds:
37
(i) E(a1 ) = E(X) = µ
σ2
(ii) V ar(a1 ) = V ar(X) = n
n−1 2
(iii) E(b2 ) = n σ
(v) E(S 2 ) = σ 2
µ4 n−3
(vi) V ar(S 2 ) = n − 2
n(n−1) µ2
Proof:
(i)
See Casella/Berger, Page 214, and Rohatgi, page 303–306, for the proof of parts (iv) through
(vi) and results regarding the 3rd and 4th moments and covariances.
38
7.2 Sample Moments and the Normal Distribution
(1):
39
From (1) and (2), it follows:
40
Corollary 7.2.2:
X and S 2 are independent.
Proof:
This can be seen since S 2 is a function of the vector (X1 − X, . . . , Xn − X), and (X1 −
X, . . . , Xn − X) is independent of X, as previously shown in Theorem 7.2.1. We can use
Theorem 4.2.7 to formally complete this proof.
Corollary 7.2.3:
(n − 1)S 2
∼ χ2n−1 .
σ2
Proof:
Recall the following facts:
Now consider
41
Corollary 7.2.4:
√
n(X − µ)
∼ tn−1 .
S
Proof:
Recall the following facts:
Therefore,
√ (X−µ)
√ (X−µ)
√
n(X − µ) σ/ n σ/ n Z1
= √
S/√n
=r =q ∼ tn−1 .
S S 2 (n−1) Yn−1
σ/ n σ 2 (n−1) (n−1)
Corollary 7.2.5:
Let (X1 , . . . , Xm ) ∼ iid N (µ1 , σ12 ) and (Y1 , . . . , Yn ) ∼ iid N (µ2 , σ22 ). Let Xi , Yj be independent
∀i, j.
Then it holds:
s
X − Y − (µ1 − µ2 ) m+n−2
· ∼ tm+n−2
σ12 /m
+ σ22 /n
q
[(m − 1)S12 /σ12 ] + [(n − 1)S22 /σ22 ]
In particular, if σ1 = σ2 , then:
s
X − Y − (µ1 − µ2 ) mn(m + n − 2)
q · ∼ tm+n−2
(m − 1)S12 + (n − 1)S22 m+n
Proof:
Homework.
Corollary 7.2.6:
Let (X1 , . . . , Xm ) ∼ iid N (µ1 , σ12 ) and (Y1 , . . . , Yn ) ∼ iid N (µ2 , σ22 ). Let Xi , Yj be independent
∀i, j.
Then it holds:
S12 /σ12
∼ Fm−1,n−1
S22 /σ22
In particular, if σ1 = σ2 , then:
S12
∼ Fm−1,n−1
S22
42
Proof:
Recall that, if Y1 ∼ χ2m and Y2 ∼ χ2n , then
Y1 /m
F = ∼ Fm,n .
Y2 /n
(m−1)S12 (n−1)S22
Now, C1 = σ12
∼ χ2m−1 and C2 = σ22
∼ χ2n−1 . Therefore,
(m−1)S12
C1 /(m − 1) σ12 (m−1) S12 /σ12
= = ∼ Fm−1,n−1 .
C2 /(n − 1) (n−1)S22 S22 /σ22
σ22 (n−1)
If σ1 = σ2 , then
S12
∼ Fm−1,n−1 .
S22
43
8 The Theory of Point Estimation
Let X be a rv defined on a probability space (Ω, L, P ). Suppose that the cdf F of X depends
on some set of parameters and that the functional form of F is known except for a finite
number of these parameters.
Definition 8.1.1:
The set of admissible values of θ is called the parameter space Θ. If Fθ is the cdf of X
when θ is the parameter, the set {Fθ : θ ∈ Θ} is the family of cdf ’s. Likewise, we speak of
the family of pdf ’s if X is continuous, and the family of pmf ’s if X is discrete.
Example 8.1.2:
X ∼ Bin(n, p), p unknown. Then θ = p and Θ = {p : 0 < p < 1}.
X ∼ N (µ, σ 2 ), (µ, σ 2 ) unknown. Then θ = (µ, σ 2 ) and Θ = {(µ, σ 2 ) : −∞ < µ < ∞, σ 2 > 0}.
Definition 8.1.3:
Let X be a sample from Fθ , θ ∈ Θ ⊆ IR. Let a statistic T (X) map IRn to Θ. We call T (X)
an estimator of θ and T (x) for a realization x of X an (point) estimate of θ. In practice,
the term estimate is used for both.
Example 8.1.4:
Let X1 , . . . , Xn be iid Bin(1, p), p unknown. Estimates of p include:
1 X1 + X2
T1 (X) = X, T2 (X) = X1 , T3 (X) = , T4 (X) =
2 3
Obviously, not all estimates are equally good.
44
8.2 Properties of Estimates
Definition 8.2.1:
Let {Xi }∞ i=1 be a sequence of iid rv’s with cdf Fθ , θ ∈ Θ. A sequence of point estimates
Tn (X1 , . . . , Xn ) = Tn is called
p
• (weakly) consistent for θ if Tn −→ θ as n → ∞ ∀θ ∈ Θ
a.s.
• strongly consistent for θ if Tn −→ θ as n → ∞ ∀θ ∈ Θ
r
• consistent in the rth mean for θ if Tn −→ θ as n → ∞ ∀θ ∈ Θ
Example 8.2.2:
n
X
Let {Xi }∞
i=1 be a sequence of iid Bin(1, p) rv’s. Let X n =
1
n Xi . Since E(Xi ) = p, it
i=1
p a.s.
follows by the WLLN that X n −→ p, i.e., consistency, and by the SLLN that X n −→ p, i.e,
strong consistency.
However, a consistent estimate may not be unique. We may even have infinite many consistent
estimates, e.g.,
n
X
Xi + a
i=1 p
−→ p ∀ finite a, b ∈ IR.
n+b
Theorem 8.2.3:
If Tn is a sequence of estimates such that E(Tn ) → θ and V ar(Tn ) → 0 as n → ∞, then Tn is
consistent for θ.
Proof:
45
Definition 8.2.4:
Let G be a group of Borel–measurable functions of IRn onto itself which is closed under com-
position and inverse. A family of distributions {Pθ : θ ∈ Θ} is invariant under G if for
each g ∈ G and for all θ ∈ Θ, there exists a unique θ0 = g(θ) such that the distribution of
g(X) is Pθ0 whenever the distribution of X is Pθ . We call g the induced function on θ since
Pθ (g(X) ∈ A) = Pg(θ) (X ∈ A).
Example 8.2.5:
Let (X1 , . . . , Xn ) be iid N (µ, σ 2 ) with pdf
n
!
1 1 X
f (x1 , . . . , xn ) = √ exp − (xi − µ)2 .
( 2πσ)n 2σ 2 i=1
So {f : −∞ < µ < ∞, σ 2 > 0} is invariant under this group G, with g(µ, σ 2 ) = (aµ+b, a2 σ 2 ),
where −∞ < aµ + b < ∞ and a2 σ 2 > 0.
Definition 8.2.6:
Let G be a group of transformations that leaves {Fθ : θ ∈ Θ} invariant. An estimate T is
invariant under G if
Definition 8.2.7:
An estimate T is:
46
Example 8.2.8:
Let Fθ ∼ N (µ, σ 2 ).
S 2 is location invariant.
Note:
Different sources make different use of the term invariant. Mood, Graybill & Boes (1974)
for example define location invariant as T (X1 + a, . . . , Xn + a) = T (X1 , . . . , Xn ) + a (page
332) and scale invariant as T (cX1 , . . . , cXn ) = cT (X1 , . . . , Xn ) (page 336). According to their
definition, X is location invariant and scale invariant.
47
8.3 Sufficient Statistics
Note:
(i) The sample X is always sufficient but this is not particularly interesting and usually is
excluded from further considerations.
(ii) Idea: Once we have “reduced” from X to T (X), we have captured all the information
in X about θ.
(iii) Usually, there are several sufficient statistics for a given family of distributions.
Example 8.3.2:
Let X = (X1 , . . . , Xn ) be iid Bin(1, p) rv’s. To estimate p, can we ignore the order and simply
count the number of “successes”?
n
X
Let T (X) = Xi . It is
i=1
48
Example 8.3.3:
n
X
Let X = (X1 , . . . , Xn ) be iid Poisson(λ). Is T = Xi sufficient for λ? It is
i=1
Example 8.3.4:
Let X1 , X2 be iid Poisson(λ). Is T = X1 + 2X2 sufficient for λ? It is
Note:
Definition 8.3.1 can be difficult to check. In addition, it requires a candidate statistic. We
need something constructive that helps in finding sufficient statistics without having to check
Definition 8.3.1. The next Theorem helps in finding such statistics.
where h does not depend on θ and g does not depend on x1 , . . . , xn except as a function of T .
49
Proof:
Discrete case only.
“=⇒”:
Suppose T (X) is sufficient for θ. Let
“⇐=”:
Suppose the factorization holds. For fixed t0 , it is
50
Note:
(ii) If T is sufficient for θ, then also any 1–to–1 mapping of T is sufficient for θ. However,
this does not hold for arbitrary functions of T .
Example 8.3.6:
Let X1 , . . . , Xn be iid Bin(1, p). It is
Example 8.3.7:
Let X1 , . . . , Xn be iid Poisson(λ). It is
Example 8.3.8:
Let X1 , . . . , Xn be iid N (µ, σ 2 ) where µ ∈ IR and σ 2 > 0 are both unknown. It is
Example 8.3.9:
Let X1 , . . . , Xn be iid U (θ, θ + 1) where −∞ < θ < ∞. It is
51
Definition 8.3.10:
Let {fθ (x) : θ ∈ Θ} be a family of pdf’s (or pmf’s). We say the family is complete if
Eθ (g(X)) = 0 ∀θ ∈ Θ
implies that
Pθ (g(X) = 0) = 1 ∀θ ∈ Θ.
Example 8.3.11:
n
X
Let X1 , . . . , Xn be iid Bin(1, p). We have seen in Example 8.3.6 that T = Xi is sufficient
i=1
for p. Is it also complete?
Example 8.3.12:
n
X n
X
Let X1 , . . . , Xn be iid N (θ, θ2 ). We know from Example 8.3.8 that T = ( Xi , Xi2 ) is
i=1 i=1
sufficient for θ. Is it also complete?
52
Note:
Recall from Section 5.2 what it means if we say the family of distributions {fθ : θ ∈ Θ} is a
one–parameter (or k–parameter) exponential family.
Theorem 8.3.13:
Let {fθ : θ ∈ Θ} be a k–parameter exponential family. Let T1 , . . . , Tk be statistics. Then the
family of distributions of (T1 (X), . . . , Tk (X)) is also a k–parameter exponential family given
by
k
!
ti Qi (θ) + D(θ) + S ∗ (t)
X
gθ (t) = exp
i=1
Theorem 8.3.14:
Let {fθ : θ ∈ Θ} be a k–parameter exponential family with k ≤ n and let T1 , . . . , Tk be
statistics as in Theorem 8.3.13. Suppose that the range of Q = (Q1 , . . . , Qk ) contains an open
set in IRk . Then T = (T1 (X), . . . , Tk (X)) is a complete sufficient statistic.
Proof:
Discrete case and k = 1 only.
(A)
g(t) exp(θt + D(θ) + S ∗ (t)) = 0 ∀θ
X
= (B)
t
implies g(t) = 0 ∀t. Note that in (A) we make use of a result established in Theorem 8.3.13.
53
It is g(t) = g + (t) − g − (t) where both functions, g + and g − , are non–negative functions. Using
g + and g − , it turns out that (B) is equivalent to
where the term exp(D(θ)) in (A) drops out as a constant on both sides.
t
=
g + (t) exp(θ0 t + S ∗ (t))
X
t
=
g + (t) exp(θ0 t + S ∗ (t))
X
t
=
g − (t) exp(θ0 t + S ∗ (t))
X
eδt p− (t)
X
=
t
= M − (δ) ∀δ ∈ (a − θ0 , b − θ0 ).
| {z } | {z }
<0 >0
=⇒ g + (t) = g − (t) ∀t
54
=⇒ g(t) = 0 ∀t
=⇒ T is complete
Definition 8.3.15:
Let X = (X1 , . . . , Xn ) be a sample from {Fθ : θ ∈ Θ ⊆ IRk } and let T = T (X) be a sufficient
statistic for θ. T = T (X) is called a minimal sufficient statistic for θ if, for any other
sufficient statistic T 0 = T 0 (X), T (x) is a function of T 0 (x).
Note:
(i) A minimal sufficient statistic achieves the greatest possible data reduction for a sufficient
statistic.
(ii) If T is minimal sufficient for θ, then also any 1–to–1 mapping of T is minimal sufficient
for θ. However, this does not hold for arbitrary functions of T .
Definition 8.3.16:
Let X = (X1 , . . . , Xn ) be a sample from {Fθ : θ ∈ Θ ⊆ IRk }. A statistic T = T (X) is called
ancillary if its distribution does not depend on the parameter θ.
Example 8.3.17:
Let X1 , . . . , Xn be iid U (θ, θ + 1) where −∞ < θ < ∞. As shown in Example 8.3.9,
T = (X(1) , X(n) ) is sufficient for θ. Define
Rn = X(n) − X(1) .
Use the result from Stat 6710, Homework Assignment 5, Question (viii) (a) to obtain
fRn (r | θ) = fRn (r) = n(n − 1)rn−2 (1 − r)I(0,1) (r).
This means that Rn ∼ Beta(n − 1, 2). Moreover, Rn does not depend on θ and, therefore,
Rn is ancillary.
Theorem 8.3.19:
Let X = (X1 , . . . , Xn ) be a sample from {Fθ : θ ∈ Θ ⊆ IRk }. If any minimal sufficient statis-
tic T = T (X) exists for θ, then any complete statistic is also a minimal sufficient statistic.
55
Note:
(i) Due to the last Theorem, Basu’s Theorem often only is stated in terms of a complete
sufficient statistic (which automatically is also a minimal sufficient statistic).
(ii) As already shown in Corollary 7.2.2, X and S 2 are independent when sampling from a
N (µ, σ 2 ) population. As outlined in Casella/Berger, page 289, we could also use Basu’s
Theorem to obtain the same result.
(iii) The converse of Basu’s Theorem is false, i.e., if T (X) is independent of any ancillary
statistic, it does not necessarily follow that T (X) is a complete, minimal sufficient statis-
tic.
n
X n
X
(iv) As seen in Examples 8.3.8 and 8.3.12, T = ( Xi , Xi2 ) is sufficient for θ but it is not
i=1 i=1
complete when X1 , . . . , Xn are iid N (θ, θ2 ). However, it can be shown that T is minimal
sufficient. So, there may be distributions where a minimal sufficient statistic exists but
a complete statistic does not exist.
(v) As with invariance, there exist several different definitions of ancillarity within the lit-
erature — the one defined in this chapter being the most commonly used.
56
8.4 Unbiased Estimation
Eθ (T ) = θ ∀θ ∈ Θ.
Any function d(θ) for which an unbiased estimate T exists is called an estimable function.
If T is biased,
b(θ, T ) = Eθ (T ) − θ
Example 8.4.2:
If the k th population moment exists, the k th sample moment is an unbiased estimate. If
V ar(X) = σ 2 , the sample variance S 2 is an unbiased estimate of σ 2 .
(n − 1)S 2 n−1
∼ χ2n−1 = Gamma( , 2)
σ2 2
s
n−1 x
(n − 1)S 2
Z ∞
√ x 2 −1 e− 2
=⇒ E = x n−1 dx
σ2 2 Γ( n−1
2 )
0 2
√ Z ∞ n −1 − x
2Γ( n2 ) x2 e 2
= n dx
Γ( n−1
2 ) 0 2 2 Γ( n2 )
√
(∗) 2Γ( n2 )
=
Γ( n−1
2 )
s
2 Γ( n2 )
=⇒ E(S) = σ
n − 1 Γ( n−1
2 )
n x
−1 − 2
(∗) holds since x n en is the pdf of a Gamma( n2 , 2) distribution and thus the integral is 1.
2
2 2 Γ( )
2
So S is biased for σ and s
2 Γ( n2 )
b(σ, S) = σ − 1 .
n − 1 Γ( n−1
2 )
57
Note:
If T is unbiased for θ, g(T ) is not necessarily unbiased for g(θ) (unless g is a linear function).
Example 8.4.3:
Unbiased estimates may not exist (see Rohatgi, page 351, Example 2) or they me be absurd
as in the following case:
Let X ∼ Poisson(λ) and let d(λ) = e−2λ . Consider T (X) = (−1)X as an estimate. It is
Note:
If there exist 2 unbiased estimates T1 and T2 of θ, then any estimate of the form αT1 +(1−α)T2
for 0 < α < 1 will also be an unbiased estimate of θ. Which one should we choose?
Definition 8.4.4:
The mean square error of an estimate T of σ is defined as
Let {Ti }∞
i=1 be a sequence of estimates of θ. If
lim M SE(θ, Ti ) = 0 ∀θ ∈ Θ,
i→∞
Note:
(i) If we allow all estimates and compare their MSE, generally it will depend on θ which
estimate is better. For example θ̂ = 17 is perfect if θ = 17, but it is lousy otherwise.
(ii) If we restrict ourselves to the class of unbiased estimates, then M SE(θ, T ) = V arθ (T ).
58
(iii) MSE–consistency means that both the bias and the variance of Ti approach 0 as i → ∞.
Definition 8.4.5:
Let θ0 ∈ Θ and let U (θ0 ) be the class of all unbiased estimates T of θ0 such that Eθ0 (T 2 ) < ∞.
Then T0 ∈ U (θ0 ) is called a locally minimum variance unbiased estimate (LMVUE)
at θ0 if
Eθ0 ((T0 − θ0 )2 ) ≤ Eθ0 ((T − θ0 )2 ) ∀T ∈ U (θ0 ).
Definition 8.4.6:
Let U be the class of all unbiased estimates T of θ ∈ Θ such that Eθ (T 2 ) < ∞ ∀θ ∈ Θ. Then
T0 ∈ U is called a uniformly minimum variance unbiased estimate (UMVUE) of θ if
59
An Excursion into Logic II
In our first “Excursion into Logic” in Stat 6710 Mathematical Statistics I, we have established
the following results:
A ⇒ B is equivalent to ¬B ⇒ ¬A is equivalent to ¬A ∨ B:
A B A⇒B ¬A ¬B ¬B ⇒ ¬A ¬A ∨ B
1 1 1 0 0 1 1
1 0 0 0 1 0 0
0 1 1 1 0 1 1
0 0 1 1 1 1 1
When dealing with formal proofs, there exists one more technique to show A ⇒ B. Equiva-
lently, we can show (A ∧ ¬B) ⇒ 0, a technique called Proof by Contradiction. This means,
assuming that A and ¬B hold, we show that this implies 0, i.e., something that is always
false, i.e., a contradiction. And here is the corresponding truth table:
A B A⇒B ¬B A ∧ ¬B (A ∧ ¬B) ⇒ 0
1 1
1 0
0 1
0 0
Note:
We make use of this proof technique in the Proof of the next Theorem.
Example:
Let A : x = 5 and B : x2 = 25. Obviously A ⇒ B.
A : x = 5 and ¬B : x2 6= 25
=⇒ x2 = 25 ∧ x2 6= 25
60
Theorem 8.4.7:
Let U be the class of all unbiased estimates T of θ ∈ Θ with Eθ (T 2 ) < ∞ ∀θ, and suppose
that U is non–empty. Let U0 be the set of all unbiased estimates of 0, i.e.,
Eθ (νT0 ) = 0 ∀θ ∈ Θ ∀ν ∈ U0 .
Proof:
Note that Eθ (νT0 ) always exists.
61
Theorem 8.4.8:
Let U be the non–empty class of unbiased estimates of θ ∈ Θ as defined in Theorem 8.4.7.
Then there exists at most one UMVUE T ∈ U for θ.
Proof:
Suppose T0 , T1 ∈ U are both UMVUE.
=⇒ Eθ (T02 ) = Eθ (T0 T1 )
= V arθ (T0 )
= V arθ (T1 ) ∀θ ∈ Θ
=⇒ ρT0 T1 = 1 ∀θ ∈ Θ
=⇒ θ = Eθ (T0 ) = Eθ (− ab T1 ) = Eθ (T1 ) ∀θ ∈ Θ
=⇒ − ab = 1
=⇒ Pθ (T0 = T1 ) = 1 ∀θ ∈ Θ
Theorem 8.4.9:
(i) If an UMVUE T exists for a real function d(θ), then λT is the UMVUE for λd(θ), λ ∈ IR.
(ii) If UMVUE’s T1 and T2 exist for real functions d1 (θ) and d2 (θ), respectively, then T1 +T2
is the UMVUE for d1 (θ) + d2 (θ).
Proof:
Homework.
62
Theorem 8.4.10:
If a sample consists of n independent observations X1 , . . . , Xn from the same distribution, the
UMVUE, if it exists, is permutation invariant.
Proof:
Homework.
63
Theorem 8.4.12: Lehmann–Scheffée
If T is a complete sufficient statistic and if there exists an unbiased estimate h of θ, then
E(h | T ) is the (unique) UMVUE.
Proof:
Note:
We can use Theorem 8.4.12 to find the UMVUE in two ways if we have a complete sufficient
statistic T :
(i) If we can find an unbiased estimate h(T ), it will be the UMVUE since E(h(T ) | T ) =
h(T ).
(ii) If we have any unbiased estimate h and if we can calculate E(h | T ), then E(h | T )
will be the UMVUE. The process of determining the UMVUE this way often is called
Rao–Blackwellization.
(iii) Even if a complete sufficient statistic does not exist, the UMVUE may still exist (see
Rohatgi, page 357–358, Example 10).
Example 8.4.13:
n
X
Let X1 , . . . , Xn be iid Bin(1, p). Then T = Xi is a complete sufficient statistic as seen in
i=1
Examples 8.3.6 and 8.3.11.
Since E(X1 ) = p, X1 is an unbiased estimate of p. However, due to part (i) of the Note above,
since X1 is not a function of T , X1 is not the UMVUE.
We can use part (ii) of the Note above to construct the UMVUE. It is
64
If we are interested in the UMVUE for d(p) = p(1 − p) = p − p2 = V ar(X), we can find it in
the following way:
65
8.5 Lower Bounds for the Variance of an Estimate
Let ψ(θ) be defined on Θ and let it be differentiable for all θ ∈ Θ. Let T be an unbiased
estimate of ψ(θ) such that Eθ (T 2 ) < ∞ ∀θ ∈ Θ. Suppose that
∂fθ (x)
(i) ∂θ is defined for all θ ∈ Θ,
∂ log fθ (X) 2
0 2 2
(ψ (θ)) ≤ Eθ ((T (X) − χ(θ)) ) Eθ ( ) ∀θ ∈ Θ (A).
∂θ
Further, for any θ0 ∈ Θ, either ψ 0 (θ0 ) = 0 and equality holds in (A) for θ = θ0 , or we have
(ψ 0 (θ0 ))2
Eθ0 ((T (X) − χ(θ0 ))2 ) ≥ (B).
Eθ0 ( ∂ log∂θ
fθ (X) 2
)
Finally, if equality holds in (B), then there exists a real number K(θ0 ) 6= 0 such that
∂ log fθ (X)
T (X) − χ(θ0 ) = K(θ0 ) (C)
∂θ θ=θ0
66
Note:
(i) Conditions (i), (ii), and (iii) are called regularity conditions. Conditions under which
they hold can be found in Rohatgi, page 11–13, Parts 12 and 13.
(ii) The right hand side of inequality (B) is called Cramér–Rao Lower Bound of θ0 , or, in
symbols CRLB(θ0 ).
∂ log fθ (X) 2
(iii) The expression Eθ ∂θ is called the Fisher Information in X.
Proof:
From (ii), we get
∂ ∂
Z
Eθ log fθ (X) = log fθ (x) fθ (x)dx
∂θ ∂θ
∂ 1
Z
= fθ (x) fθ (x)dx
∂θ fθ (x)
∂
Z
= fθ (x) dx
∂θ
= 0
∂
=⇒ Eθ χ(θ) log fθ (X) = 0
∂θ
From (iii), we get
∂ ∂
Z
Eθ T (X) log fθ (X) = T (x) log fθ (x) fθ (x)dx
∂θ ∂θ
∂ 1
Z
= T (x) fθ (x) fθ (x)dx
∂θ fθ (x)
∂
Z
= T (x) fθ (x) dx
∂θ
∂
Z
(iii)
= T (x)fθ (x)dx
∂θ
∂
= E(T (X))
∂θ
∂
= ψ(θ)
∂θ
= ψ 0 (θ)
∂
=⇒ Eθ (T (X) − χ(θ)) log fθ (X) = ψ 0 (θ) (+)
∂θ
67
2
∂
0 2
=⇒ (ψ (θ)) = Eθ (T (X) − χ(θ)) log fθ (X)
∂θ
(∗)
2 !
∂
2
≤ Eθ (T (X) − χ(θ)) Eθ log fθ (X) ,
∂θ
i.e., (A) holds. (∗) follows from the Cauchy–Schwarz–Inequality (Theorem 4.5.7 (ii)).
If ψ 0 (θ0 ) 6= 0, then the left–hand side of (A) is > 0. Therefore, the right–hand side is > 0.
Thus,
2 !
∂
Eθ0 log fθ (X) > 0,
∂θ
and (B) follows directly from (A).
Finally, if equality holds in (B), then ψ 0 (θ0 ) 6= 0 (because T is not constant). Thus,
M SE(χ(θ0 ), T (X)) > 0. The Cauchy–Schwarz–Inequality (Theorem 4.5.7 (iii)) gives equality
iff there exist constants (α, β) ∈ IR2 − {(0, 0)} such that
! !
∂
P α(T (X) − χ(θ0 )) + β log fθ (X) =0 = 1.
∂θ θ=θ0
This implies K(θ0 ) = − αβ and (C) holds. Since T is not a constant, it also holds that
K(θ0 ) 6= 0.
Example 8.5.2:
If we take χ(θ) = ψ(θ), we get from (B)
(ψ 0 (θ))2
V arθ (T (X)) ≥ (∗).
Eθ ( ∂ log∂θ
fθ (X) 2
)
Finally, if X = (X1 , . . . , Xn ) iid with identical fθ (x), the inequality (∗) reduces to
(ψ 0 (θ))2
V arθ (T (X)) ≥ .
nEθ ( ∂ log ∂θ
fθ (X1 ) 2
)
68
Example 8.5.3:
Let X1 , . . . , Xn be iid Bin(1, p). Let X ∼ Bin(n, p), p ∈ Θ = (0, 1) ⊂ IR. Let
n
!
X n x
ψ(p) = E(T (X)) = T (x) p (1 − p)n−x .
x=0
x
ψ(p) is differentiable with respect to p under the summation sign since it is a finite polynomial
in p.
Example 8.5.4:
Let X ∼ U (0, θ), θ ∈ Θ = (0, ∞) ⊂ IR.
69
Theorem 8.5.5: Chapman, Robbins, Kiefer Inequality (CRK Inequality)
Let Θ ⊆ IR. Let {fθ : θ ∈ Θ} be a family of pdf’s or pmf’s. Let ψ(θ) be defined on Θ. Let
T be an unbiased estimate of ψ(θ) such that Eθ (T 2 ) < ∞ ∀θ ∈ Θ.
If θ 6= ϑ, θ and ϑ ∈ Θ, assume that fθ (x) and fϑ (x) are different. Also assume that there
exists such a ϑ ∈ Θ such that θ 6= ϑ and
Proof:
Since T is unbiased, it follows
Eϑ (T (X)) = ψ(ϑ) ∀ϑ ∈ Θ.
Finally, we take the supremum of the right–hand side with respect to {ϑ : S(ϑ) ⊂ S(θ),
ϑ 6= θ}, which completes the proof.
70
Note:
(i) The CRK inequality holds without the previous regularity conditions.
(iii) The CRK inequality works for discrete Θ, the CRLB does not work in such cases.
Example 8.5.6:
Let X ∼ U (0, θ), θ > 0. The required conditions for the CRLB are not met. Recall from
θ2 θ2
Example 8.5.4 that n+1 n+1
n X(n) is UMVUE with V ar( n X(n) ) = n(n+2) < n = CRLB.
Definition 8.5.7:
Let T1 , T2 be unbiased estimates of θ with Eθ (T12 ) < ∞ and Eθ (T22 ) < ∞ ∀θ ∈ Θ. We define
the efficiency of T1 relative to T2 by
V arθ (T1 )
ef fθ (T1 , T2 ) =
V arθ (T2 )
71
Definition 8.5.8:
Assume the regularity conditions of Theorem 8.5.1 are satisfied by a family of cdf’s {Fθ : θ ∈
Θ}. An unbiased estimate T for θ is most efficient for {Fθ } if
2 !!−1
∂ log fθ (X)
V arθ (T ) = Eθ
∂θ
Definition 8.5.9:
Let T be the most efficient estimate for the family of cdf’s {Fθ : θ ∈ Θ}, Θ ⊆ IR. Then the
efficiency of any unbiased T1 of θ is defined as
V arθ (T1 )
ef fθ (T1 ) = ef fθ (T1 , T ) = .
V arθ (T )
Definition 8.5.10:
T1 is asymptotically (most) efficient if T1 is asymptotically unbiased, i.e., lim Eθ (T1 ) = θ,
n→∞
and lim ef fθ (T1 ) = 1, where n is the sample size.
n→∞
Theorem 8.5.11:
A necessary and sufficient condition for an estimate T of θ to be most efficient is that T is
sufficient and
1 ∂ log fθ (x)
(T (x) − θ) = ∀θ ∈ Θ (∗),
K(θ) ∂θ
where K(θ) is defined as in Theorem 8.5.1 and the regularity conditions for Theorem 8.5.1
hold.
Proof:
“=⇒:”
Theorem 8.5.1 says that if T is most efficient, then (∗) holds.
72
Therefore,
fθ0 (x) = exp(T (x)C(θ0 ) − ψ(θ0 ) + λ(x))
“⇐=:”
From (∗), we get
2 !
∂ log fθ (X) 1
Eθ = V arθ (T (X)).
∂θ (K(θ))2
Additionally, it holds
∂ log fθ (X)
Eθ (T (X) − θ) =1
∂θ
as shown in the Proof of Theorem 8.5.1 (let χ(θ) = θ in (+)).
i.e.,
2 !!−1
∂ log fθ (X)
K(θ) = Eθ .
∂θ
Therefore,
2 !!−1
∂ log fθ (X)
V arθ (T (X)) = Eθ ,
∂θ
i.e., T is most efficient for θ.
Note:
Instead of saying “a necessary and sufficient condition for an estimate T of θ to be most
efficient ...” in the previous Theorem, we could say that “an estimate T of θ is most efficient
iff ...”, i.e., “necessary and sufficient” means the same as “iff”.
A is necessary for B means: B ⇒ A (because ¬A ⇒ ¬B)
A is sufficient for B means: A ⇒ B
73
8.6 The Method of Moments
θ = h(m1 , . . . , mk ),
Note:
(i) The Definition above can also be used to estimate joint moments. For example, we use
n
X
1
n Xi Yi to estimate E(XY ).
i=1
n
Xij ) = mj , method of moment estimates are unbiased for the popula-
X
(ii) Since E( n1
i=1
tion moments. The WLLN and the CLT say that these estimates are consistent and
asymptotically Normal as well.
(iii) If θ is not a linear function of the population moments, θ̂mom will, in general, not be
unbiased. However, it will be consistent and (usually) asymptotically Normal.
(iv) Method of moments estimates do not exist if the related moments do not exist.
(v) Method of moments estimates may not be unique. If there exist multiple choices for the
mom, one usually takes the estimate involving the lowest–order sample moment.
(vi) Alternative method of moment estimates can be obtained from central moments (rather
than from raw moments) or by using moments other than the first k moments.
74
Example 8.6.2:
Let X1 , . . . , Xn be iid N (µ, σ 2 ).
Since µ = m1 , it is µ̂mom = X.
Example 8.6.3:
Let X1 , . . . , Xn be iid Poisson(λ).
75
8.7 Maximum Likelihood Estimation
Note:
Definition 8.7.2:
A maximum likelihood estimate (MLE) is a non–constant estimate θ̂M L such that
Note:
It is often convenient to work with log L when determining the maximum likelihood estimate.
Since the log is monotone, the maximum is the same.
Example 8.7.3:
Let X1 , . . . , Xn be iid N (µ, σ 2 ), where µ and σ 2 are unknown.
n
!
2 1 X (xi − µ)2
L(µ, σ ; x1 , . . . , xn ) = n exp −
σ n (2π) 2 i=1
2σ 2
Formally, we still have to verify that we found the maximum (and not a minimum) and that
there is no parameter θ at the edge of the parameter space Θ such that the likelihood function
76
does not take its absolute maximum which is not detectable by using our approach for local
extrema.
Example 8.7.4:
Let X1 , . . . , Xn be iid U (θ − 12 , θ + 21 ).
Example 8.7.5:
Let X ∼ Bin(1, p), p ∈ [ 14 , 43 ].
p, if x = 1
L(p; x) = px (1 − p)1−x =
1 − p, if x = 0
Theorem 8.7.6:
Let T be a sufficient statistic for fθ (x), θ ∈ Θ. If a unique MLE of θ exists, it is a function
of T .
Proof:
Since T is sufficient, we can write
77
due to the Factorization Criterion (Theorem 8.3.5). Maximizing the likelihood function with
respect to θ takes h(x) as a constant and therefore is equivalent to maximizing gθ (x) with
respect to θ. But gθ (x) involves x only through T .
Note:
(v) Often (but not always), the MLE will be a sufficient statistic itself.
Theorem 8.7.7:
Suppose the regularity conditions of Theorem 8.5.1 hold and θ belongs to an open interval in
IR. If an estimate θ̂ of θ attains the CRLB, it is the unique MLE.
Proof:
If θ̂ attains the CRLB, it follows by Theorem 8.5.1 that
∂ log fθ (X) 1
= (θ̂(X) − θ) w.p. 1.
∂θ K(θ)
∂ 2 log fθ (X)
= A0 (θ)(θ̂(X) − θ) − A(θ).
∂θ2
The Proof of Theorem 8.5.11 gives us
2 !
∂ log fθ (X)
A(θ) = Eθ > 0.
∂θ
So
∂ 2 log fθ (X)
= −A(θ) < 0,
∂θ2 θ=θ̂
78
Note:
The previous Theorem does not imply that every MLE is most efficient.
Theorem 8.7.8:
Let {fθ : θ ∈ Θ} be a family of pdf’s (or pmf’s) with Θ ⊆ IRk , k ≥ 1. Let h : Θ → ∆ be a
mapping of Θ onto ∆ ⊆ IRp , 1 ≤ p ≤ k. If θ̂ is an MLE of θ, then h(θ̂) is an MLE of h(θ).
Proof:
For each δ ∈ ∆, we define
Θδ = {θ : θ ∈ Θ, h(θ) = δ}
and
M (δ; x) = sup L(θ; x),
θ∈Θδ
but also
!
M (δ̂; x) ≤ sup M (δ; x) = sup sup L(θ; x) = sup L(θ; x) = L(θ̂; x).
δ∈∆ δ∈∆ θ∈Θδ θ∈Θ
Therefore,
M (δ̂; x) = L(θ̂; x) = sup M (δ; x).
δ∈∆
Example 8.7.9:
Let X1 , . . . , Xn be iid Bin(1, p). Let h(p) = p(1 − p).
Theorem 8.7.10:
Consider the following conditions a pdf fθ can fulfill:
79
Z ∞ 2
∂ log fθ (x)
(iii) −∞ < fθ (x)dx < 0 ∀θ ∈ Θ.
−∞ ∂θ2
(iv) There exists a function H(x) such that for all θ ∈ Θ:
∂ 3 log fθ (x)
Z ∞
< H(x) and H(x)fθ (x)dx = M (θ) < ∞.
∂θ3 −∞
(v) There exists a function g(θ) that is positive and twice differentiable for every θ ∈ Θ and
there exists a function H(x) such that for all θ ∈ Θ:
∂2
Z ∞
∂ log fθ (x)
2
g(θ) < H(x) and H(x)fθ (x)dx = M (θ) < ∞.
∂θ ∂θ −∞
In case that multiple of these conditions are fulfilled, we can make the following statements:
(i) (Cramér) Conditions (i), (iii), and (iv) imply that, with probability approaching 1, as
n → ∞, the likelihood equation has a consistent solution.
(ii) (Cramér) Conditions (i), (ii), (iii), and (iv) imply that a consistent solution θ̂n of the
likelihood equation is asymptotically Normal, i.e.,
√
n d
(θ̂n − θ) −→ Z
σ
−1
∂ log fθ (X) 2
where Z ∼ N (0, 1) and σ2 = Eθ ∂θ .
(iii) (Kulldorf) Conditions (i), (iii), and (v) imply that, with probability approaching 1, as
n → ∞, the likelihood equation has a consistent solution.
(iv) (Kulldorf) Conditions (i), (ii), (iii), and (v) imply that a consistent solution θ̂n of the
likelihood equation is asymptotically Normal.
Note:
In case of a pmf fθ , we can define similar conditions as in Theorem 8.7.10.
80
8.8 Decision Theory — Bayes and Minimax Estimation
A = Θ (Estimation)
Definition 8.8.1:
A decision function d is a statistic, i.e., a Borel–measurable function, that maps IRn into
A. If X = x is observed, the statistician takes action d(x) ∈ A.
Note:
For the remainder of this Section, we are restricting ourselves to A = Θ, i.e., we are facing
the problem of estimation.
Definition 8.8.2:
A non–negative function L that maps Θ × A into IR is called a loss function. The value
L(θ, a) is the loss incurred to the statistician if he/she takes action a when θ is the true pa-
rameter value.
Definition 8.8.3:
Let D be a class of decision functions that map IRn into A. Let L be a loss function on Θ × A.
The function R that maps Θ × D into IR is defined as
Example 8.8.4:
Let A = Θ ⊆ IR. Let L(θ, a) = (θ − a)2 . Then it holds that
Note that this is just the MSE. If θ̂ is unbiased, this would just be V ar(θ̂).
81
Note:
The basic problem of decision theory is that we would like to find a decision function d ∈ D
such that R(θ, d) is minimized for all θ ∈ Θ. Unfortunately, this is usually not possible.
Definition 8.8.5:
The minimax principle is to choose the decision function d∗ ∈ D such that
Note:
If the problem of interest is an estimation problem, we call a d∗ that satisifies the condition
in Definition 8.8.5 a minimax estimate of θ.
Example 8.8.6:
Let X ∼ Bin(1, p), p ∈ Θ = { 14 , 34 } = A.
p a L(p, a)
1 1
4 4 0
1 3
4 4 2
3 1
4 4 5
3 3
4 4 0
82
3 3 1
L( , d1 (1)) = L( , ) =
4 4 4
1 1 1
L( , d2 (0)) = L( , ) =
4 4 4
1 1 3
L( , d2 (1)) = L( , ) =
4 4 4
3 3 1
L( , d2 (0)) = L( , ) =
4 4 4
3 3 3
L( , d2 (1)) = L( , ) =
4 4 4
1 1 3
L( , d3 (0)) = L( , ) =
4 4 4
1 1 1
L( , d3 (1)) = L( , ) =
4 4 4
3 3 3
L( , d3 (0)) = L( , ) =
4 4 4
3 3 1
L( , d3 (1)) = L( , ) =
4 4 4
1 1 3
L( , d4 (0)) = L( , ) =
4 4 4
1 1 3
L( , d4 (1)) = L( , ) =
4 4 4
3 3 3
L( , d4 (0)) = L( , ) =
4 4 4
3 3 3
L( , d4 (1)) = L( , ) =
4 4 4
Then, the risk function
i p = 14 : R( 14 , di ) p = 43 : R( 34 , di ) max R(p, di )
p∈{1/4, 3/4}
1
2
3
4
Hence,
min max R(p, di ) = .
i∈{1, 2, 3, 4} p∈{1/4, 3/4}
Note:
Minimax estimation does not require any unusual assumptions. However, it tends to be very
83
conservative.
Definition 8.8.7:
Suppose we consider θ to be a rv with pdf π(θ) on Θ. We call π the a priori distribution
(or prior distribution).
Note:
f (x | θ) is the conditional density of x given a fixed θ. The joint density of x and θ is
and the a posteriori distribution (or posterior distribution), which gives the distribution
of θ after sampling, has pdf (or pmf)
f (x, θ)
h(θ | x) = .
g(x)
Definition 8.8.8:
The Bayes risk of a decision function d is defined as
Note:
If θ is a continuous rv and X is of continuous type, then
84
Definition 8.8.9:
A decision function d∗ is called a Bayes rule if d∗ minimizes the Bayes risk, i.e., if
Theorem 8.8.10:
Let A = Θ ⊆ IR. Let L(θ, d(x)) = (θ − d(x))2 . In this case, a Bayes rule is
Proof:
Minimizing Z Z
R(π, d) = g(x) (θ − d(x))2 h(θ | x) dθ dx,
where g is the marginal pdf of X and h is the conditional pdf of θ given x, is the same as
minimizing Z
(θ − d(x))2 h(θ | x) dθ.
However, this is minimized when d(x) = E(θ | X = x) as shown in Stat 6710, Homework 3,
Question (ii), for the unconditional case.
Note:
Under the conditions of Theorem 8.8.10, d(x) = E(θ | X = x) is called the Bayes estimate.
Example 8.8.11:
Let X ∼ Bin(n, p). Let L(p, d(x)) = (p − d(x))2 .
Let π(p) = 1 ∀p ∈ (0, 1), i.e., π ∼ U (0, 1), be the a priori distribution of p.
Then it holds:
!
n x
f (x, p) = p (1 − p)n−x
x
Z
g(x) = f (x, p)dp
Z 1 !
n x
= p (1 − p)n−x dp
0 x
f (x, p)
h(p | x) =
g(x)
85
n x
p (1
x ! − p)n−x
= Z 1
n x
p (1 − p)n−x dp
0 x
px (1 − p)n−x
= Z 1
px (1 − p)n−x dp
0
E(p | x) =
p̂Bayes =
86
Z 1
1
= (1 − 4p + np − np2 + 4p2 ) dp
(n + 2)2 0
Z 1
1
= (1 + (n − 4)p + (4 − n)p2 ) dp
(n + 2)2 0
1
1 n−4 2 4−n 3
= (p + p + p )
(n + 2)2 2 3 0
1 n−4 4−n
= 2
(1 + + )
(n + 2) 2 3
1 6 + 3n − 12 + 8 − 2n
=
(n + 2)2 6
1 n+2
= 2
(n + 2) 6
1
=
6(n + 2)
Now we compare the Bayes rule d∗ (X) with the MLE p̂M L = X
n. This estimate has Bayes risk
X
R(π, ) =
n
Theorem 8.8.12:
Let {fθ : θ ∈ Θ} be a family of pdf’s (or pmf’s). Suppose that an estimate d∗ of θ is a
Bayes estimate corresponding to some prior distribution π on Θ. If the risk function R(θ, d∗ )
is constant on Θ, then d∗ is a minimax estimate of θ.
Proof:
Homework.
Definition 8.8.13:
Let F denote the class of pdf’s (or pmf’s) fθ (x). A class Π of prior distributions is a conju-
gate family for F if the posterior distribution is in the class Π for all f ∈ F , all priors in Π,
and all x ∈ X.
87
Note:
The beta family is conjugate for the binomial family. Thus, if we start with a beta prior, we
will end up with a beta posterior. (See Homework.)
88
9 Hypothesis Testing
Definition 9.1.1:
A parametric hypothesis is an assumption about the unknown parameter θ.
H0 : θ ∈ Θ0 ⊂ Θ.
H1 : θ ∈ Θ1 = Θ − Θ0 .
Definition 9.1.2:
If Θ0 (or Θ1 ) contains only one point, we say that H0 and Θ0 (or H1 and Θ1 ) are simple. In
this case, the distribution of X is completely specified under the null (or alternative) hypoth-
esis.
If Θ0 (or Θ1 ) contains more than one point, we say that H0 and Θ0 (or H1 and Θ1 ) are
composite.
Example 9.1.3:
1 1
Let X1 , . . . , Xn be iid Bin(1, p). Examples for hypotheses are p = 2 (simple), p ≥ 2 (com-
posite), p 6= 14 (composite), etc.
Note:
The problem of testing a hypothesis can be described as follows: Given a sample point x, find
a decision rule that will lead to a decision to accept or reject the null hypothesis. This means,
we partition the space IRn into two disjoint sets C and C c such that, if x ∈ C, we reject
H0 : θ ∈ Θ0 (and we accept H1 ). Otherwise, if x ∈ C c , we accept H0 that X ∼ Fθ , θ ∈ Θ0 .
89
Definition 9.1.4:
Let X ∼ Fθ , θ ∈ Θ. Let C be a subset of IRn such that, if x ∈ C, then H0 is rejected (with
probability 1), i.e.,
C = {x ∈ IRn : H0 is rejected for this x}.
Definition 9.1.5:
If we reject H0 when it is true, we call this a Type I error. If we fail to reject H0 when it
is false, we call this a Type II error. Usually, H0 and H1 are chosen such that the Type I
error is considered more serious.
Example 9.1.6:
We first consider a non–statistical example, in this case a jury trial. Our hypotheses are that
the defendant is innocent or guilty. Our possible decisions are guilty or not guilty. Since it is
considered worse to punish the innocent than to let the guilty go free, we make innocence the
null hypothesis. Thus, we have
Truth (unknown)
Innocent (H0 ) Guilty (H1 )
Decision (known)
Not Guilty (H0 ) Correct Type II Error
Guilty (H1 ) Type I Error Correct
The jury tries to make a decision “beyond a reasonable doubt”, i.e., it tries to make the
probability of a Type I error small.
Definition 9.1.7:
If C is the critical region, then Pθ (C), θ ∈ Θ0 , is a probability of Type I error, and
Pθ (C c ), θ ∈ Θ1 , is a probability of Type II error.
Note:
We would like both error probabilities to be 0, but this is usually not possible. We usually
settle for fixing the probability of Type I error to be small, e.g., 0.05 or 0.01, and minimizing
the Type II error.
Definition 9.1.8:
Every Borel–measurable mapping φ of IRn → [0, 1] is called a test function. φ(x) is the
probability of rejecting H0 when x is observed.
90
If φ is the indicator function of a subset C ⊆ IRn , φ is called a nonrandomized test and C
is the critical region of this test function.
Definition 9.1.9:
Let φ be a test function of the hypothesis H0 : θ ∈ Θ0 against the alternative H1 : θ ∈ Θ1 .
We say that φ has a level of significance of α (or φ is a level–α–test or φ is of size α) if
Eθ (φ(X)) = Pθ (reject H0 ) ≤ α ∀θ ∈ Θ0 .
Definition 9.1.10:
Let φ be a test for the problem (α, Θ0 , Θ1 ). For every θ ∈ Θ, we define
We call βφ (θ) the power function of φ. For any θ ∈ Θ1 , βφ (θ) is called the power of φ
against the alternative θ.
Definition 9.1.11:
Let Φα be the class of all tests for (α, Θ0 , Θ1 ). A test φ0 ∈ Φα is called a most powerful
(MP) test against an alternative θ ∈ Θ1 if
Definition 9.1.12:
Let Φα be the class of all tests for (α, Θ0 , Θ1 ). A test φ0 ∈ Φα is called a uniformly most
powerful (UMP) test if
Example 9.1.13:
Let X1 , . . . , Xn be iid N (µ, 1), µ ∈ Θ = {µ0 , µ1 }, µ0 < µ1 .
91
Under H0 it holds that X ∼ N (µ0 , n1 ). For a given α, we can solve the following equation for
k:
X − µ0 k − µ0
Pµ0 (X > k) = P ( √ > √ ) = P (Z > zα ) = α
1/ n 1/ n
X−µ
Here, √0
1/ n
= Z ∼ N (0, 1) and zα is defined in such a way that P (Z > zα ) = α, i.e., zα is
k−µ
√0
the upper α–quantile of the N (0, 1) distribution. It follows that 1/ n
= zα and therefore,
zα
k = µ0 + √ n
.
92
Example 9.1.14:
Let X ∼ Bin(6, p), p ∈ Θ = (0, 1).
H0 : p = 12 , H1 : p 6= 21 .
Reasonable plan: Since Ep= 1 (X) = 3, reject H0 when | X − 3 |≥ c for some constant c. But
2
how should we select c?
x c =| x − 3 | Pp= 1 (X = x) Pp= 1 (| X − 3 |≥ c)
2 2
0, 6
1, 5
2, 4
3
93
9.2 The Neyman–Pearson Lemma
for some k ≥ 0 and 0 ≤ γ(x) ≤ 1, is most powerful of its significance level for testing
H0 vs. H1 .
If k = ∞, the test (
1, if f0 (x) = 0
φ(x) = (∗∗)
0, if f0 (x) > 0
is most powerful of size (or significance level) 0 for testing H0 vs. H1 .
(ii) Given 0 ≤ α ≤ 1, there exists a test of the form (∗) or (∗∗) with γ(x) = γ (i.e., a
constant) such that
Eθ0 (φ(X)) = α.
Proof:
We prove the continuous case only.
(i):
94
95
Theorem 9.2.2:
If a sufficient statistic T exists for the family {fθ : θ ∈ Θ = {θ0 , θ1 }}, then the Neyman–
Pearson most powerful test is a function of T .
Proof:
Homework
Example 9.2.3:
We want to test H0 : X ∼ N (0, 1) vs. H1 : X ∼ Cauchy(1, 0), based on a single observation.
It is 2
1 1
2 exp( x2 )
r
f1 (x) π 1+x2
= 1 2 = .
f0 (x) √ exp(− x2 ) π 1 + x2
2π
The MP test is
x2
q
1, if 2 exp( 2 )
π 1+x2 > k
φ(x) =
0, otherwise
where k is determined such that EH0 (φ(X)) = α.
α
If α < 0.113, we reject H0 if | x |> z α2 , where z α2 is the upper 2 quantile of a N (0, 1)
distribution.
If α > 0.113, we reject H0 if | x |> k1 or if | x |< k2 , where k1 > 0, k2 > 0, such that
k2 k2
x2
Z k1
exp( 21 ) exp( 22 ) 1 1−α
2 = and √ exp(− )dx = .
1 + k1 1 + k22 k2 2π 2 2
96
Why is α = 0.113 so interesting?
For x = 0, it is r
f1 (x) 2
= ≈ 0.7979.
f0 (x) π
Similarly, for x ≈ −1.585 and x ≈ 1.585, it is
r (±1.585) 2
f1 (x) 2 exp( 2 ) f1 (0)
= ≈ 0.7979 ≈ .
f0 (x) π 1 + (±1.585)2 f0 (0)
97
9.3 Monotone Likelihood Ratios
Definition 9.3.1:
Let {fθ : θ ∈ Θ ⊆ IR} be a family of pdf’s (pmf’s) on a one–dimensional parameter space.
We say the family {fθ } has a monotone likelihood ratio (MLR) in statistic T (X) if for
f (x)
θ1 < θ2 , whenever fθ1 and fθ2 are distinct, the ratio fθθ2 (x) is a nondecreasing function of T (x)
1
for the set of values x for which at least one of fθ1 and fθ2 is > 0.
Note:
We can also define families of densities with nonincreasing MLR in T (X), but such families
can be treated by symmetry.
Example 9.3.2:
Let X1 , . . . , Xn ∼ U [0, θ], θ > 0. Then the joint pdf is
(
1
θn , 0 ≤ x(1) ≤ x(n) ≤ θ 1
fθ (x) = = n I[0,∞) (x(1) )I[0,θ) (x(n) ),
0, otherwise θ
98
Theorem 9.3.3:
The one–parameter exponential family fθ (x) = exp(Q(θ)T (x) + D(θ) + S(x)), where Q(θ) is
nondecreasing, has a MLR in T (X).
Proof:
Homework.
Example 9.3.4:
Let X = (X1 , · · · , Xn ) be a random sample from the Poisson family with parameter λ > 0.
Then the joint pdf is
n n n n
!
1 1
Pn
−λ xi −nλ x
Y Y X X
fλ (x) = e λ =e λ i=1 i = exp −nλ + xi · log(λ) − log(xi !) ,
i=1
xi ! i=1
xi ! i=1 i=1
Since Q(λ) = log(λ) is a nondecreasing function of λ, it follows by Theorem 9.3.3 that the
n
X
Poisson family with parameter λ > 0 has a MLR in T (X) = Xi .
i=1
Theorem 9.3.5:
Let X ∼ fθ , θ ∈ Θ ⊆ IR, where the family {fθ } has a MLR in T (X).
has a nondecreasing power function and is UMP of its size Eθ0 (φ(X)) = α, if the size is not 0.
99
Proof:
“=⇒”:
100
“⇐=”:
Use the Neyman–Pearson Lemma (Theorem 9.2.1).
Note:
By interchanging inequalities throughout Theorem 9.3.5 and its proof, we see that this The-
orem also provides a solution of the dual problem H00 : θ ≥ θ0 vs. H10 : θ < θ0 .
Theorem: 9.3.6
For the one–parameter exponential family, there exists a UMP two–sided test of H0 : θ ≤ θ1
or θ ≥ θ2 , (where θ1 < θ2 ) vs. H1 : θ1 < θ < θ2 of the form
1, if c1 < T (x) < c2
φ(x) = γi , if T (x) = ci , i = 1, 2
0, if T (x) < c , or if T (x) > c
1 2
Note:
UMP tests for H0 : θ1 ≤ θ ≤ θ2 and H00 : θ = θ0 do not exist for one–parameter exponential
families.
101
9.4 Unbiased and Invariant Tests
Eθ (φ(X)) ≥ α ∀θ ∈ Θ1 .
Note:
This condition means that βφ (θ) ≤ α ∀θ ∈ Θ0 and βφ (θ) ≥ α ∀θ ∈ Θ1 . In other words, the
power of this test is never less than α.
Definition 9.4.2:
Let Uα be the class of all unbiased size α tests of H0 vs H1 . If there exists a test φ ∈ Uα
that has maximal power for all θ ∈ Θ1 , we call φ a UMP unbiased (UMPU) size α test.
Note:
It holds that Uα ⊆ Φα . A UMP test φα ∈ Φα will have βφα ≥ α ∀θ ∈ Θ1 since we must
compare all tests φα with the trivial test φ(x) = α. Thus, if a UMP test exists in Φα , it is
also a UMPU test in Uα .
Example 9.4.3:
Let X1 , . . . , Xn be iid N (µ, σ 2 ), where σ 2 > 0 is known. Consider H0 : µ = µ0 vs H1 : µ 6= µ0 .
From the Neyman–Pearson Lemma, we know that for µ1 > µ0 , the MP test is of the form
1, if X > µ0 + √σ zα
φ1 (X) = n
0, otherwise
If a test is UMP, it must have the same rejection region as φ1 and φ2 . However, these 2
rejection regions are different (actually, their intersection is empty). Thus, there exists no
102
UMP test.
We next state a helpful Theorem and then continue with this example and see how we can
find a UMPU test.
Theorem 9.4.4:
Let c1 , . . . , cn ∈ IR be constants and f1 (x), . . . , fn+1 (x) be real–valued functions. Let C be the
class of functions φ(x) satisfying 0 ≤ φ(x) ≤ 1 and
Z ∞
φ(x)fi (x)dx = ci ∀i = 1, . . . , n.
−∞
If φ∗ ∈ C satisfies
n
X
1, if fn+1 (x) > ki fi (x)
∗ i=1
φ (x) = Xn
0, if fn+1 (x) < ki fi (x)
i=1
Z ∞
for some constants k1 , . . . , kn ∈ IR, then φ∗ maximizes φ(x)fn+1 (x)dx among all φ ∈ C.
−∞
Proof:
Let φ∗ (x) be as above. Let φ(x) be any other function in C. Since 0 ≤ φ(x) ≤ 1 ∀x, it is
n
!
∗
X
(φ (x) − φ(x)) fn+1 (x) − ki fi (x) ≥ 0 ∀x.
i=1
This holds since if φ∗ (x) = 1, the left factor is ≥ 0 and the right factor is ≥ 0. If φ∗ (x) = 0,
the left factor is ≤ 0 and the right factor is ≤ 0.
Therefore,
n
Z !
∗
X
0 ≤ (φ (x) − φ(x)) fn+1 (x) − ki fi (x) dx
i=1
Z Z n Z Z
φ∗ (x)fn+1 (x)dx − φ∗ (x)fi (x)dx −
X
= φ(x)fn+1 (x)dx − ki φ(x)fi (x)dx
i=1 | {z }
=ci −ci =0
Thus, Z Z
φ∗ (x)fn+1 (x)dx ≥ φ(x)fn+1 (x)dx.
Note:
(ii) The Theorem above is the Neyman–Pearson Lemma if n = 1, f1 = fθ0 , f2 = fθ1 , and
c1 = α.
103
Example 9.4.3: (continued)
So far, we have seen that there exists no UMP test for H0 : µ = µ0 vs H1 : µ 6= µ0 .
Due to Theorem 9.2.2, we only have to consider functions of sufficient statistics T (X) = X.
σ2
Let τ 2 = n .
∂
Z Z
∂
(ii) ∂µ φ(t)fµ (t)dt|µ=µ0 = φ(t) fµ (t) dt = 0, i.e., we have a minimum at µ0 .
∂µ µ=µ0
Z
We want to maximize φ(t)fµ (t)dt, µ 6= µ0 such that conditions (i) and (ii) hold.
104
Note that the left hand side of this inequality is increasing in x if µ1 > µ0 and decreasing in
x if µ1 < µ0 . Either way, we can choose k1 and k2 such that the linear function in x crosses
the exponential function in x at the two points
σ σ
µL = µ0 − √ zα/2 , µU = µ0 + √ zα/2 .
n n
Obviously, φ3 satisfies (i). We still need to check that φ3 satisfies (ii) and that βφ3 (µ) has a
minimum at µ0 but omit this part from our proof here.
φ3 is of the form φ∗ in Theorem 9.4.4 and therefore φ3 is UMP in C. But the trivial test
φt (x) = α also satisfies (i) and (ii) above. Therefore, βφ3 (µ) ≥ α ∀µ 6= µ0 . This means that
φ3 is unbiased.
Definition 9.4.5:
A test φ is said to be α–similar on a subset Θ∗ of Θ if
βφ (θ) = Eθ (φ(X)) = α ∀θ ∈ Θ∗ .
Note:
The trivial test φ(x) = α is α–similar on every Θ∗ ⊆ Θ.
Theorem 9.4.6:
Let φ be an unbiased test of size α for H0 : θ ∈ Θ0 vs H1 : θ ∈ Θ1 such that βφ (θ) is a
continuous function in θ. Then φ is α–similar on the boundary Λ = Θ0 ∩ Θ1 , where Θ0 and
Θ1 are the closures of Θ0 and Θ1 , respectively.
Proof:
Let θ ∈ Λ. There exist sequences {θn } and {θn0 } whith θn ∈ Θ0 and θn0 ∈ Θ1 such that
lim θn = θ and lim θn0 = θ.
n→∞ n→∞
Since βφ (θn ) ≤ α implies βφ (θ) ≤ α and since βφ (θn0 ) ≥ α implies βφ (θ) ≥ α it must hold
that βφ (θ) = α ∀θ ∈ Λ.
105
Definition 9.4.7:
A test φ that is UMP among all α–similar tests on the boundary Λ = Θ0 ∩ Θ1 is called a
UMP α–similar test.
Theorem 9.4.8:
Suppose βφ (θ) is continuous in θ for all tests φ of H0 : θ ∈ Θ0 vs H1 : θ ∈ Θ1 . If a size α
test of H0 vs H1 is UMP α–similar, then it is UMP unbiased.
Proof:
Let φ0 be UMP α–similar and of size α. This means that Eθ (φ(X)) ≤ α ∀θ ∈ Θ0 .
Since the trivial test φ(x) = α is α–similar, it must hold for φ0 that βφ0 (θ) ≥ α ∀θ ∈ Θ1 since
φ0 is UMP α–similiar. This implies that φ0 is unbiased.
Since βφ (θ) is continuous in θ, we see from Theorem 9.4.6 that the class of unbiased tests is
a subclass of the class of α–similar tests. Since φ0 is UMP in the larger class, it is also UMP
in the subclass. Thus, φ0 is UMPU.
Note:
The continuity of the power function βφ (θ) cannot always be checked easily.
Example 9.4.9:
Let X1 , . . . , Xn ∼ N (µ, 1).
Let H0 : µ ≤ 0 vs H1 : µ > 0.
n
X
Since the family of densities has a MLR in Xi , we could use Theorem 9.3.5 to find a UMP
i=1
test. However, we want to illustate the use of Theorem 9.4.8 here.
of any test φ is continuous in µ. Thus, due to Theorem 9.4.6, any unbiased size α test of H0
is α–similar on Λ.
106
or equivalently, by Theorem 9.2.2,
n
X
1, if T = Xi > k
φ(x) = i=1
0,
otherwise
= α
−nµ
T√ √
(∗) holds since n
∼ N (0, 1) for µ ≤ 0 and zα − nµ ≥ zα for µ ≤ 0.
Thus all the requirements are met for Theorem 9.4.8, i.e., βφ is continuous and φ is UMP
α–similar and of size α, and thus φ is UMPU.
Note:
Rohatgi, page 428–430, lists Theorems (without proofs), stating that for Normal data, one–
and two–tailed t–tests, one– and two–tailed χ2 –tests, two–sample t–tests, and F –tests are all
UMPU.
Note:
Recall from Definition 8.2.4 that a class of distributions is invariant under a group G of trans-
formations, if for each g ∈ G and for each θ ∈ Θ there exists a unique θ0 ∈ Θ such that if
X ∼ Pθ , then g(X) ∼ Pθ0 .
Definition 9.4.10:
A group G of transformations on X leaves a hypothesis testing problem invariant if G
leaves both {Pθ : θ ∈ Θ0 } and {Pθ : θ ∈ Θ1 } invariant, i.e., if y = g(x) ∼ hθ (y), then
{fθ (x) : θ ∈ Θ0 } ≡ {hθ (y) : θ ∈ Θ0 } and {fθ (x) : θ ∈ Θ1 } ≡ {hθ (y) : θ ∈ Θ1 }.
107
Note:
We want two types of invariance for our tests:
Formal Invariance: If two tests have the same structure, i.e, the same Θ, the same pdf’s (or
pmf’s), and the same hypotheses, then we should use the same test in both problems.
So, if the transformed problem in terms of y has the same formal structure as that of
the problem in terms of x, we must have that φ∗ (y) = φ(x) = φ∗ (g(x)).
Definition 9.4.11:
An invariant test with respect to a group G of tansformations is any test φ such that
φ(x) = φ(g(x)) ∀x ∀g ∈ G.
Example 9.4.12:
1
Let X ∼ Bin(n, p). Let H0 : p = 2 vs. H1 : p 6= 12 .
If φ is invariant, then φ(x) = φ(n − x). Is the test problem invariant? For g2 , the answer is
obvious.
For g1 , we get:
g1 (X) = n − X ∼ Bin(n, 1 − p)
1
H0 : p = 2 : {fp (x) : p = 12 } = {hp (g1 (x)) : p = 21 } = Bin(n, 12 )
1 1 1
H1 : p 6= 2 : {fp (x) : p 6= } = {hp (g1 (x)) : p 6= }
| {z 2} | {z 2}
=Bin(n,p6= 21 ) =Bin(n,p6= 12 )
So all the requirements in Definition 9.4.10 are met. If, for example, n = 10, the test
1, if x = 0, 1, 2, 8, 9, 10
φ(x) =
0, otherwise
108
Example 9.4.13:
2
Let X1 , . . . , Xn ∼ N (µ, σ 2 ) where both µ and σ 2 > 0 are unknown. It is X ∼ N (µ, σn ) and
n−1 2
σ2
S ∼ χ2n−1 and X and S 2 independent.
So, this is the same family of distributions and Definition 9.4.10 holds because µ ≤ 0 implies
that cµ ≤ 0 (for c > 0).
If xs11 6= xs22 , then there exists no c > 0 such that (x2 , s22 ) ≡ (cx1 , c2 s21 ). So invariance places
no restrictions on φ for different xs11 = xs22 . Thus, invariant tests are exactly those that depend
only on xs , which are equivalent to tests that are based only on t = s/x√n . Since this mapping
is 1–to–1, the invariant test will use T = S/X√n ∼ tn−1 if µ = 0. Note that this test does not
depend on the nuisance parameter σ 2 . Invariance often produces such results.
Definition 9.4.14:
Let G be a group of transformations on the space of X. We say a statistic T (x) is maximal
invariant under G if
(ii) T is maximal, i.e., T (x1 ) = T (x2 ) implies that x1 = g(x2 ) for some g ∈ G.
109
Example 9.4.15:
Let x = (x1 , . . . , xn ) and gc (x) = (x1 + c, . . . , xn + c).
Definition 9.4.16:
Let Iα be the class of all invariant tests of size α of H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ1 . If there
exists a UMP member in Iα , it is called the UMP invariant test of H0 vs H1 .
Theorem 9.4.17:
Let T (x) be maximal invariant with respect to G. A test φ is invariant under G iff φ is a
function of T .
Proof:
“=⇒”:
Let φ be invariant under G. If T (x1 ) = T (x2 ), then there exists a g ∈ G such that x1 = g(x2 ).
Thus, it follows from invariance that φ(x1 ) = φ(g(x2 )) = φ(x2 ). Since φ is the same whenever
T (x1 ) = T (x2 ), φ must be a function of T .
“⇐=”:
Let φ be a function of T , i.e., φ(x) = h(T (x)). It follows that
(∗)
φ(g(x)) = h(T (g(x))) = h(T (x)) = φ(x).
110
Example 9.4.18:
Consider the test problem
where θ ∈ IR.
where c ∈ IR and n ≥ 2.
111
10 More on Hypothesis Testing
H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ1 = Θ − Θ0
is
sup fθ (x)
θ∈Θ0
λ(x) = .
sup fθ (x)
θ∈Θ
for some constant c ∈ [0, 1], where c is usually chosen in such a way to make φ a test of size
α.
Note:
(ii) LRT’s are strongly related to MLE’s. If θ̂ is the unrestricted MLE of θ over Θ and θ̂0 is
fθ̂ (x)
the MLE of θ over Θ0 , then λ(x) = fθ̂ (x) .
0
Example 10.1.2:
Let X1 , . . . , Xn be a sample from N (µ, 1). We want to construct a LRT for
H0 : µ = µ0 vs. H1 : µ 6= µ0 .
112
Theorem 10.1.3:
If T (X) is sufficient for θ and λ∗ (t) and λ(x) are LRT statistics based on T and X respectively,
then
λ∗ (T (x)) = λ(x) ∀x,
i.e., the LRT can be expressed as a function of every sufficient statistic for θ.
Proof:
Since T is sufficient, it follows from Theorem 8.3.5 that its pdf (or pdf) factorizes as fθ (x) =
gθ (T )h(x). Therefore we get:
sup fθ (x)
θ∈Θ0
λ(x) =
sup fθ (x)
θ∈Θ
sup gθ (T )h(x)
θ∈Θ0
=
sup gθ (T )h(x)
θ∈Θ
sup gθ (T )
θ∈Θ0
=
sup gθ (T )
θ∈Θ
= λ∗ (T (x))
Thus, our simplified expression for λ(x) indeed only depends on a sufficient statistic T .
Theorem 10.1.4:
If for a given α, 0 ≤ α ≤ 1, and for a simple hypothesis H0 and a simple alternative H1 a
non–randomized test based on the NP Lemma and LRT’s exist, then these tests are equivalent.
Proof:
See Homework.
Note:
Usually, LRT’s perform well since they are often UMP or UMPU size α tests. However, this
does not always hold. Rohatgi, Example 4, page 440–441, cites an example where the LRT is
not unbiased and it is even worse than the trivial test φ(x) = α.
113
Theorem 10.1.5:
Under some regularity conditions on fθ (x), the rv −2 log λ(X) under H0 has asymptotically
a chi–squared distribution with ν degrees of freedom, where ν equals the difference between
the number of independent parameters in Θ and Θ0 , i.e.,
d
−2 log λ(X) −→ χ2ν under H0 .
Note:
The regularity conditions required for Theorem 10.1.5 are basically the same as for Theorem
8.7.10. Under “independent” parameters we understand parameters that are unspecified, i.e.,
free to vary.
Example 10.1.6:
Let X1 , . . . , Xn ∼ N (µ, σ 2 ) where µ ∈ IR and σ 2 > 0 are both unknown.
Let H0 : µ = µ0 vs. H1 : µ 6= µ0 .
114
Note that − n
Γ( n2 ) 1 f
2
f1,n−1 (f ) = 1 √ 1+ · I[0,∞) (f )
Γ( 2 )Γ( 2 )(n − 1) 2 f
n−1 1 n−1
is the pdf of a F1,n−1 distribution.
−1
f f 1−y
Let y = 1 + n−1 , then n−1 = y and df = − n−1
y2
dy.
Thus,
115
As n → ∞, we can apply Stirling’s formula which states that
√ 1
Γ(α(n) + 1) ≈ (α(n))! ≈ 2π(α(n))α(n)+ 2 exp(−α(n)).
So,
√ n−1 √ n(1−2t)−3 n(1−2t)−2
2π( n−2
2 ) 2 exp(−
n−2
2 ) 2π( 2 ) 2 exp(− n(1−2t)−3
2 )
Mn (t) ≈ √ √
n−3 n−2 n−3 n(1−2t)−2 n(1−2t)−1 n(1−2t)−2
2π( 2 ) 2 exp(− 2 ) 2π( 2 ) 2 exp(− 2 )
116
10.2 Parametric Chi–Squared Tests
Reject H0 at level α if
H0 H1 µ known µ unknown
σ02 2
σ ≥ σ0 σ < σ 0 (xi − µ)2 ≤ σ02 χ2n;1−α s2 ≤
P
I n−1 χn−1;1−α
σ02 2
σ ≤ σ0 σ > σ 0 (xi − µ)2 ≥ σ02 χ2n;α s2 ≥
P
II n−1 χn−1;α
σ02 2
III σ = σ0 σ 6= σ0 (xi − µ)2 ≤ σ02 χ2n;1−α/2 s2 ≤
P
n−1 χn−1;1−α/2
σ02 2
(xi − µ)2 ≥ σ02 χ2n;α/2 or s2 ≥
P
or n−1 χn−1;α/2
Note:
(iii) In test III, the constants have been chosen in such a way to give equal probability to
each tail. This is the usual approach. However, this may result in a biased test.
(iv) χ2n;1−α is the (lower) α quantile and χ2n;α is the (upper) 1 − α quantile, i.e., for X ∼ χ2n ,
it holds that P (X ≤ χ2n;1−α ) = α and P (X ≤ χ2n;α ) = 1 − α.
(v) We can also use χ2 tests to test for equality of binomial probabilities as shown in the
next few Theorems.
Theorem 10.2.2:
Let X1 , . . . , Xk be independent rv’s with Xi ∼ Bin(ni , pi ), i = 1, . . . , k. Then it holds that
k
!2
X Xi − ni pi d
T = p −→ χ2k
i=1 ni pi (1 − pi )
as n1 , . . . , nk → ∞.
117
Proof:
Homework
Corollary 10.2.3:
Let X1 , . . . , Xk be as in Theorem 10.2.2 above. We want to test the hypothesis that H0 : p1 =
p2 = . . . = pk = p, where p is a known constant (vs. the alternative H1 that at least one of
the pi ’s is different from the other ones). An appoximate level–α test rejects H0 if
k
!2
X xi − ni p
y= p ≥ χ2k;α .
i=1 ni p(1 − p)
Theorem 10.2.4:
Let X1 , . . . , Xk be independent rv’s with Xi ∼ Bin(ni , p), i = 1, . . . , k. Then the MLE of p is
k
X
xi
i=1
p̂ = k
.
X
ni
i=1
Proof:
This can be shown by using the joint likelihood function or by the fact that Xi ∼ Bin( ni , p)
P P
Theorem 10.2.5:
Let X1 , . . . , Xk be independent rv’s with Xi ∼ Bin(ni , pi ), i = 1, . . . , k. An approximate
level–α test of H0 : p1 = p2 = . . . = pk = p, where p is unknown (vs. the alternative H1 that
at least one of the pi ’s is different from the other ones), rejects H0 if
k
!2
X xi − ni p̂
y= p ≥ χ2k−1;α ,
i=1 ni p̂(1 − p̂)
P
x
where p̂ = P ni .
i
Theorem 10.2.6:
k
X
Let (X1 , . . . , Xk ) be a multinomial rv with parameters n, p1 , p2 , . . . , pk where pi = 1 and
i=1
k
X
Xi = n. Then it holds that
i=1
k
X (Xi − npi )2 d
Uk = −→ χ2k−1
i=1
npi
118
as n → ∞.
Proof:
Case k = 2 only:
Theorem 10.2.7:
Let X1 , . . . , Xn be a sample from X. Let H0 : X ∼ F , where the functional form of F is
known completely. We partition the real line into k disjoint Borel sets A1 , . . . , Ak and let
P (X ∈ Ai ) = pi , where pi > 0 ∀i = 1, . . . , k.
n
X
Let Yj = #Xi0 s in Aj = IAj (Xi ), ∀j = 1, . . . , k.
i=1
Then, (Y1 , . . . , Yk ) has multinomial distribution with parameters n, p1 , p2 , . . . , pk .
Theorem 10.2.8:
Let X1 , . . . , Xn be a sample from X. Let H0 : X ∼ Fθ , where θ = (θ1 , . . . , θr ) is unknown.
Let the MLE θ̂ exist. We partition the real line into k disjoint Borel sets A1 , . . . , Ak and let
Pθ̂ (X ∈ Ai ) = p̂i , where p̂i > 0 ∀i = 1, . . . , k.
n
X
Let Yj = #Xi0 s in Aj = IAj (Xi ), ∀j = 1, . . . , k.
i=1
Then it holds that
k
X (Yi − np̂i )2 d
Vk = −→ χ2k−r−1 .
i=1
np̂i
119
An approximate level–α test of H0 : X ∼ Fθ rejects H0 if
k
X (yi − np̂i )2
> χ2k−r−1;α ,
i=1
np̂i
120
10.3 t–Tests and F –Tests
(Based on Rohatgi, Section 10.4 & 10.5 & Rohatgi/Saleh, Section 10.4 & 10.5)
Definition 10.3.1: One– and Two–Tailed t-Tests
Let X1 , . . . , Xn be a sample from a N (µ, σ 2 ) distribution where σ 2 > 0 may be known or
n
X n
X
unknown and µ is unknown. Let X = 1
n Xi and S 2 = 1
n−1 (Xi − X)2 .
i=1 i=1
The following table summarizes the z– and t–tests that are typically being used:
Reject H0 at level α if
H0 H1 σ2 known σ 2 unknown
I µ ≤ µ0 µ > µ0 x ≥ µ0 + √σ zα x ≥ µ0 + √s tn−1;α
n n
Note:
(ii) These tests are based on just one sample and are often called one sample t–tests.
(iii) Tests I and II are UMP and test III is UMPU if σ 2 is known. Tests I, II, and III are
UMPU and UMP invariant if σ 2 is unknown.
(iv) For large n (≥ 30), we can use z–tables instead of t-tables. Also, for large n we can
drop the Normality assumption due to the CLT. However, for small n, none of these
simplifications is justified.
The following table summarizes the z– and t–tests that are typically being used:
121
Reject H0 at level α if
H0 H1 σ12 , σ22 known σ12 , σ22 unknown, σ1 = σ2
q
σ12 σ22
q
1 1
I µ 1 − µ 2 ≤ δ µ 1 − µ2 > δ x − y ≥ δ + zα m + n x − y ≥ δ + tm+n−2;α sp m + n
q
σ12 σ22
q
1 1
II µ 1 − µ 2 ≥ δ µ 1 − µ2 < δ x − y ≤ δ + z1−α m + n x − y ≤ δ + tm+n−2;1−α sp m + n
q
σ12 σ22
q
1 1
III µ1 − µ2 = δ µ1 − µ2 6= δ | x − y − δ |≥ zα/2 m + n | x − y − δ |≥ tm+n−2;α/2 sp m + n
Note:
(iii) If σ12 = σ22 = σ 2 (which is unknown), then Sp2 is an unbiased estimate of σ 2 . We should
check that σ12 = σ22 with an F –test.
(iv) For large m + n, we can use z–tables instead of t-tables. Also, for large m and large n
we can drop the Normality assumption due to the CLT. However, for small m or small
n, none of these simplifications is justified.
H0 H1 Reject H0 at level α if
sd
I µ1 − µ 2 ≤ δ µ 1 − µ2 > δ d≥δ+ √ t
n n−1;α
sd
II µ1 − µ 2 ≥ δ µ 1 − µ2 < δ d≤δ+ √ t
n n−1;1−α
sd
III µ1 − µ2 = δ µ1 − µ2 6= δ | d − δ |≥ √ t
n n−1;α/2
122
Note:
(ii) These tests are special cases of one–sample tests. All the properties stated in the Note
following Definition 10.3.1 hold.
(iii) We could do a test based on Normality assumptions if σ 2 = σ12 + σ22 − 2ρσ1 σ2 were
known, but that is a very unrealistic assumption.
Recall that m n
X X
2
(Xi − X) (Yi − Y )2
i=1 i=1
∼ χ2m−1 , ∼ χ2n−1 ,
σ12 σ22
and m
X
(Xi − X)2
i=1
(m − 1)σ12 σ22 S12
n = ∼ Fm−1,n−1 .
X σ12 S22
(Yi − Y )2
i=1
(n − 1)σ22
The following table summarizes the F –tests that are typically being used:
Reject H0 at level α if
H0 H1 µ1 , µ2 known µ1 , µ2 unknown
1
P 2
P (xi −µ1 )2 ≥ Fm,n;α s21
I σ12 ≤ σ22 σ12 > σ22 m
1
(y −µ ) s22
≥ Fm−1,n−1;α
n i 2
1
P 2
P(yi −µ2 ) 2 ≥ Fn,m;α s22
II σ12 ≥ σ22 σ12 < σ22 n
1
(xi −µ1 ) s21
≥ Fn−1,m−1;α
m
1
P 2
P (xi −µ1 )2 ≥ Fm,n;α/2 s21
III σ12 = σ22 σ12 6= σ22 m
1
(y −µ ) s22
≥ Fm−1,n−1;α/2 if s21 ≥ s22
1
nPi 2 2
(y −µ ) s22
or n1 P(xi −µ2 )2 ≥ Fn,m;α/2 or s21
≥ Fn−1,m−1;α/2 if s21 < s22
m i 1
123
Note:
(i) Tests I and II are UMPU and UMP invariant if µ1 and µ2 are unknown.
(ii) Test III uses equal tails and therefore may not be unbiased.
(iii) If an F –test (at level α1 ) and a t–test (at level α2 ) are both performed, the combined
test has level α = 1 − (1 − α1 )(1 − α2 ) ≥ max(α1 , α2 ) (≡ α1 + α2 if both are small).
1
(iv) Fm,n;1−α = .
Fn,m;α
124
10.4 Bayes and Minimax Tests
0–1 loss: a(θ) = b(θ) = 1, i.e., all errors are equally bad.
Generalized 0–1 loss: a(θ) = cII , b(θ) = cI , i.e., all Type I errors are equally bad and all
Type II errors are equally bad and Type I errors are worse than Type II errors or vice
versa.
125
Theorem 10.4.1:
The minimax rule d for testing
H0 : θ = θ0 vs. H1 : θ = θ1
fθ1 (x)
≥ k,
fθ0 (x)
Proof:
Let d∗ be any other rule.
126
Example 10.4.2:
Let X1 , . . . , Xn be iid N (µ, 1). Let H0 : µ = µ0 vs. H1 : µ = µ1 > µ0 .
Note:
Now suppose we have a prior distribution π(θ) on Θ. Then the Bayes risk of a decision rule
d (under the loss function introduced before) is
if π is a pdf.
The Bayes risk for a pmf π looks similar (see Rohatgi, page 461).
Theorem 10.4.3:
The Bayes rule for testing H0 : θ = θ0 vs. H1 : θ = θ1 under the prior π(θ0 ) = π0 and
π(θ1 ) = π1 = 1 − π0 and the generalized 0–1 loss function is to reject H0 if
fθ1 (x) cI π0
≥ .
fθ0 (x) cII π1
Proof:
127
Note:
For minimax rules and Bayes rules, the significance level α is no longer predetermined.
Example 10.4.4:
Let X1 , . . . , Xn be iid N (µ, 1). Let H0 : µ = µ0 vs. H1 : µ = µ1 > µ0 . Let cI = cII .
Note:
We can generalize Theorem 10.4.3 to the case of classifying among k options θ1 , . . . , θk . If we
use the 0–1 loss function
1, if d(X) = θj ∀j 6= i
L(θi , d) = ,
0, if d(X) = θi
128
Example 10.4.5:
Let X1 , . . . , Xn be iid N (µ, 1). Let µ1 < µ2 < µ3 and let π1 = π2 = π3 .
Choose µ = µi if
! !
(xk − µi )2 (xk − µj )2
P P
πi exp − ≥ πj exp − , j 6= i, j = 1, 2, 3.
2 2
(µi − µj )(µi + µj )
x(µi − µj ) ≥ , j 6= i, j = 1, 2, 3.
2
In our particular example, we get the following decision rules:
µ1 +µ2 µ1 +µ3
(i) Choose µ1 if x ≤ 2 (and x ≤ 2 ).
µ1 +µ2 µ2 +µ3
(ii) Choose µ2 if x ≥ 2 and x ≤ 2 .
µ2 +µ3 µ1 +µ3
(iii) Choose µ3 if x ≥ 2 (and x ≥ 2 ).
Note that in (i) and (iii) the condition in parentheses automatically holds when the other
condition holds.
(i) Choose µ1 if x ≤ 1.
(ii) Choose µ2 if 1 ≤ x ≤ 3.
(iii) Choose µ3 if x ≥ 3.
We do not have to worry how to handle the boundary since the probability that the rv will
realize on any of the two boundary points is 0.
129
11 Confidence Estimation
P (a < X < b) =
The interval I(X) = ( aXb , X) is an example of a random interval. I(X) contains the value
a with a certain fixed probability.
For example, if X ∼ U (0, 1), a = 14 , and b = 43 , then the interval I(X) = ( X3 , X) contains 1
4
with probability 12 .
Definition 11.1.1:
Let Pθ , θ ∈ Θ ⊆ IRk , be a set of probability distributions of a rv X. A family of subsets
S(x) of Θ, where S(x) depends on x but not on θ, is called a family of random sets. In
particular, if θ ∈ Θ ⊆ IR and S(x) is an interval (θ(x), θ(x)) where θ(x) and θ(x) depend on
x but not on θ, we call S(X) a random interval, with θ(X) and θ(X) as lower and upper
bounds, respectively. θ(X) may be −∞ and θ(X) may be +∞.
Note:
Frequently in inference, we are not interested in estimating a parameter or testing a hypoth-
esis about it. Instead, we are interested in establishing a lower or upper bound (or both) for
one or multiple parameters.
Definition 11.1.2:
A family of subsets S(x) of Θ ⊆ IRk is called a family of confidence sets at confidence
level 1 − α if
Pθ (S(X) 3 θ) ≥ 1 − α ∀θ ∈ Θ,
The quantity
inf Pθ (S(X) 3 θ) = 1 − α
θ
130
is called the confidence coefficient (i.e., the smallest probability of true coverage is 1 − α).
Definition 11.1.3:
For k = 1, we use the following names for some of the confidence sets defined in Definition
11.1.2:
(i) If S(x) = (θ(x), ∞), then θ(x) is called a level 1 − α lower confidence bound.
(ii) If S(x) = (−∞, θ(x)), then θ(x) is called a level 1 − α upper confidence bound.
Definition 11.1.4:
A family of 1 − α level confidence sets {S(x)} is called uniformly most accurate (UMA)
if
Pθ (S(X) 3 θ0 ) ≤ Pθ (S 0 (X) 3 θ0 ) ∀θ, θ0 ∈ Θ, θ 6= θ0 ,
and for any 1 − α level family of confidence sets S 0 (X) (i.e., S(x) minimizes the probability
of false (or incorrect) coverage).
Theorem 11.1.5:
Let X1 , . . . , Xn ∼ Fθ , θ ∈ Θ, where Θ is an interval on IR. Let T (X, θ) be a function on
IRn × Θ such that for each θ, T (X, θ) is a statistic, and as a function of θ, T is strictly
monotone (either increasing or decreasing) in θ at every value of x ∈ IRn .
Let Λ ⊆ IR be the range of T and let the equation λ = T (x, θ) be solvable for θ for every
λ ∈ Λ and every x ∈ IRn .
Since the distribution of T (X, θ) is independent of θ, λ1 (α) and λ2 (α) also do not depend on
θ.
If T (X, θ) is increasing in θ, solve the equations λ1 (α) = T (X, θ) for θ(X) and λ2 (α) = T (X, θ)
131
for θ(X).
If T (X, θ) is decreasing in θ, solve the equations λ1 (α) = T (X, θ) for θ(X) and λ2 (α) =
T (X, θ) for θ(X).
Note:
(ii) If T is not monotone, we can still use this Theorem to get confidence sets that may not
be confidence intervals.
Example 11.1.6:
Let X1 , . . . , Xn ∼ N (µ, σ 2 ), where µ and σ 2 > 0 are both unknown. We seek a 1 − α level
confidence interval for µ.
Example 11.1.7:
Let X1 , . . . , Xn ∼ U (0, θ).
We know that θ̂ = max(Xi ) = M axn is the MLE for θ and sufficient for θ.
132
M axn
Then the rv Tn = θ has the pdf
133
11.2 Shortest–Length Confidence Intervals
Definition 11.2.1:
A rv T (X, θ) whose distribution is independent of θ is called a pivot.
Note:
The methods we will discuss here can provide the shortest interval based on a given pivot.
They will not guarantee that there is no other pivot with a shorter minimal interval.
Example 11.2.2:
Let X1 , . . . , Xn ∼ N (µ, σ 2 ), where σ 2 is known. The obvious pivot for µ is
X −µ
Tµ (X) = √ ∼ N (0, 1).
σ/ n
Suppose that (a, b) is an interval such that P (a < Z < b) = 1 − α, where Z ∼ N (0, 1).
X −µ σ σ
1 − α = P (a < √ < b) = P (X − b √ < µ < X − a √ ).
σ/ n n n
Here we get
d db
(Φ(b) − Φ(a)) = φ(b) − φ(a) = 0
da da
and
dL σ db σ φ(a)
= √ ( − 1) = √ ( − 1).
da n da n φ(b)
134
The minimum occurs when φ(a) = φ(b) which happens when a = b or a = −b. If we select
a = b, then Φ(b) − Φ(a) = Φ(a) − Φ(a) = 0 6= 1 − α. Thus, we must have that b = −a = zα/2 .
Thus, the shortest CI based on Tµ is
σ σ
(X − zα/2 √ , X + zα/2 √ ).
n n
Definition 11.2.3:
A pdf f (x) is unimodal iff there exists a x∗ such that f (x) is nondecreasing for x ≤ x∗ and
f (x) is nonincreasing for x ≥ x∗ .
Theorem 11.2.4:
Let f (x) be a unimodal pdf. If the interval [a, b] satisfies
Z b
(i) f (x)dx = 1 − α
a
then the interval [a, b] is the shortest of all intervals which satisfy condition (i).
Proof:
Z b0
Let [a0 , b0 ] be any interval with b0 −a0 < b−a. We will show that this implies f (x)dx < 1−α,
a0
i.e., a contradiction.
• Suppose b0 > a. We can immediately exclude that b0 > b since then b0 − a0 > b − a, i.e.,
b0 − a0 wouldn’t be of shorter length than b − a. Thus, we have to consider the case that
a0 ≤ a < b0 < b. It holds that
Z b0 Z b Z a Z b
f (x)dx = f (x)dx + f (x)dx − f (x)dx
a0 a a0 b0
135
Z a Z b
0
Note that f (x)dx ≤ f (a)(a − a ) and f (x)dx ≥ f (b)(b − b0 ). Therefore, we get
a0 b0
Z a Z b
− ≤ f (a)(a − a0 ) − f (b)(b − b0 )
a0 b0
= f (a)((b0 − a0 ) − (b − a))
< 0
Thus,
Z b0
f (x)dx < a − α.
a0
Note:
Example 11.2.2 is a special case of Theorem 11.2.4. However, Theorem 11.2.4 is not immedi-
ately applicable in the following example since the length of that interval is proportional to
1 1
a − b (and not to b − a).
Example 11.2.5:
Let X1 , . . . , Xn ∼ N (µ, σ 2 ), where µ is known. The obvious pivot for σ 2 is
(Xi − µ)2
P
Tσ2 (X) = ∼ χ2n .
σ2
So
!
(Xi − µ)2
P
P a< <b = 1−α
σ2
!
(Xi − µ)2 (Xi − µ)2
P P
⇐⇒ P < σ2 < = 1−α
b a
We wish to minimize
1 1 X
L = ( − ) (Xi − µ)2
a b
Z b
such that fn (t)dt = 1 − α, where fn (t) is the pdf of a χ2n distribution.
a
We get
db
fn (b) − fn (a) = 0
da
and
dL 1 1 db X 1 1 fn (a) X
= − 2+ 2 (Xi − µ)2 = − 2 + 2 (Xi − µ)2 .
da a b da a b fn (b)
136
We obtain a minimum if a2 fn (a) = b2 fn (b).
Note that in practice equal tails χ2n;α/2 and χ2n;1−α/2 are used, which do not result in shortest–
length CI’s. The reason for this selection is simple: When these tests were developed, com-
puters did not exist that could solve these equations numerically. People in general had to
rely on tabulated values. Manually solving the equation above for each case obviously wasn’t
a feasible solution.
Example 11.2.6:
Let X1 , . . . , Xn ∼ U (0, θ). Let M axn = max Xi = X(n) . Since Tn = M ax θ
n
has pdf
nt n−1 I(0,1) (t) which does not depend on θ, Tn can be selected as a our pivot. The den-
sity of Tn is strictly increasing for n ≥ 2, so we cannot find constants a and b as in Example
11.2.5.
We wish to minimize
1 1
L = M axn ( − )
a b
Z b
such that ntn−1 dt = bn − an = 1 − α.
a
We get
da da bn−1
nbn−1 − nan−1 = 0 =⇒ = n−1
db db a
and
dL 1 da 1 bn−1 1 an+1 − bn+1
= M axn (− 2 + 2 ) = M axn (− n+1 + 2 ) = M axn ( ) < 0 for 0 ≤ a < b ≤ 1.
db a db b a b b2 an+1
Thus, L does not have a local minimum. However, since dL db < 0, L is strictly decreasing
as a function of b. It is minimized when b = 1, i.e., when b is as large as possible. The
corresponding a is selected as a = α1/n .
The shortest 1 − α level CI based on Tn is (M axn , α−1/n M axn ). This is the same CI that
was already obtained in Example 11.1.7.
137
11.3 Confidence Intervals and Hypothesis Tests
Conversely, if φ(x, µ0 ) is a family of size α tests of H0 : µ = µ0 , the set {µ0 | φ(x, µ0 ) fails to reject H0 }
is a level 1 − α confidence set for µ0 .
Theorem 11.3.2:
Denote H0 (θ0 ) for H0 : θ = θ0 , and H1 (θ0 ) for the alternative. Let A(θ0 ), θ0 ∈ Θ, denote the
acceptance region of a level–α test of H0 (θ0 ). For each possible observation x, define
If, moreover, A(θ0 ) is UMP for (α, H0 (θ0 ), H1 (θ0 )), then S(x) minimizes Pθ (S(X) 3 θ0 ) ∀θ ∈
H1 (θ0 ) among all 1 − α level families of confidence sets, i.e., S(x) is UMA.
Proof:
138
Example 11.3.3:
Let X be a rv that belongs to a one–parameter exponential family with pdf fθ (x) = exp(Q(θ)T (x)+
S 0 (x)+D(θ)), where Q(θ) is non–decreasing. We consider a test H0 : θ = θ0 vs. H1 : θ < θ0 .
The acceptance region of a UMP size α test of H0 has the form A(θ0 ) = {x : T (x) > c(θ0 )}.
Example 11.3.4:
x
Let X ∼ Exp(θ) with fθ (x) = 1θ e− θ I(0,∞) (x), which belongs to a one–parameter exponential
family. Then Q(θ) = − 1θ is non–decreasing and T (x) = x.
139
Note:
Just as we frequently restrict the class of tests (when UMP tests don’t exist), we can make
the same sorts of restrictions on CI’s.
Definition 11.3.5:
A family S(x) of confidence sets for parameter θ is said to be unbiased at level 1 − α if
If S(x) is unbiased and minimizes Pθ (S(X) 3 θ0 ) among all unbiased CI’s at level 1 − α, it is
called uniformly most accurate unbiased (UMAU).
Theorem 11.3.6:
Let A(θ0 ) be the acceptance region of a UMPU size α test of H0 : θ = θ0 vs. H1 : θ 6= θ0
(for all θ0 ). Then S(x) = {θ : x ∈ A(θ)} is a UMAU family of confidence sets at level 1 − α.
Proof:
140
Theorem 11.3.7:
Let Θ be an interval on IR and fθ be the pdf of X. Let S(X) be a family of 1 − α level CI’s,
where S(X) = (θ(X), θ(X)), θ and θ increasing functions of X, and θ(X) − θ(X) is a finite
rv.
Proof:
Z θ
It holds that θ − θ = dθ0 . Thus, for all θ ∈ Θ,
θ
Z
Eθ (θ(X) − θ(X))) = (θ(x) − θ(x))fθ (x)dx
IRn
Z Z θ(x) !
0
= dθ fθ (x)dx
IRn θ(x)
∈IRn
z }| {
Z −1 0
θ (θ )
Z
= −1 0 fθ (x)dx dθ 0
IR θ (θ
| {z })
∈IRn
Z
−1
= Pθ X ∈ [θ−1 (θ0 ), θ (θ0 )] dθ0
IR
Z
= Pθ (S(X) 3 θ0 ) dθ0
IR
Z
= Pθ (S(X) 3 θ0 ) dθ0
θ0 6=θ
Note:
Theorem 11.3.7 says that the expected length of the CI is the probability that S(X) includes
the false θ0 , averaged over all false values of θ0 .
Corollary 11.3.8:
If S(X) is UMAU, then Eθ (θ(X) − θ(X)) is minimized among all unbiased families of CI’s.
Proof:
In Theorem 11.3.7 we have shown that
Z
Eθ (θ(X) − θ(X)) = Pθ (S(X) 3 θ0 ) dθ0 .
θ0 6=θ
Since a UMAU CI minimizes this probability for all θ0 , the entire integral is minimized.
141
Example 11.3.9:
Let X1 , . . . , Xn ∼ N (µ, σ 2 ), where σ 2 > 0 is known.
By Example 11.2.2, (X − zα/2 √σn , X + zα/2 √σn ) is the shortest 1 − α level CI for µ.
By Example 9.4.3, the equivalent test is UMPU. So by Theorem 11.3.6 this interval is UMAU
and by Corollary 11.3.8 it has shortest expected length as well.
Example 11.3.10:
Let X1 , . . . , Xn ∼ N (µ, σ 2 ), where µ and σ 2 > 0 are both unknown.
Note that
(n − 1)S 2
T (X, σ 2 ) = = Tσ ∼ χ2n−1 .
σ2
Thus,
Rohatgi, Theorem 4(b), page 428–429, states that the related test is UMPU. Therefore, by
Theorem 11.3.6 and Corollary 11.3.8, our CI is UMAU with shortest expected length among
all unbiased intervals.
Note that this CI is different from the equal–tail CI based on Definition 10.2.1, III, and from
the shortest–length CI obtained in Example 11.2.5.
142
11.4 Bayes Confidence Intervals
Example 11.4.2:
Let X ∼ Bin(n, p) and π(p) ∼ U (0, 1).
px (1 − p)n−x
h(p | x) = Z 1 I(0,1) (p)
x n−x
p (1 − p) dp
0
Γ(n + 2)
= px (1 − p)n−x I(0,1) (p)
Γ(x + 1)Γ(n − x + 1)
⇒ p | x ∼ Beta(x + 1, n − x + 1),
Using the observed value for x and tables for incomplete beta integrals or a numerical ap-
proach, we can find λ1 and λ2 such that Pp|x (λ1 < p < λ2 ) = 1 − α. So (λ1 , λ2 ) is a credible
interval for p.
Note:
(i) The definitions and interpretations of credible intervals and confidence intervals are quite
different. Therefore, very different intervals may result.
(ii) We can often use Theorem 11.2.4 to find the shortest credible interval (if the precondi-
tions hold).
Example 11.4.3:
Let X1 , . . . , Xn be iid N (µ, 1) and π(µ) ∼ N (0, 1). We want to construct a Bayesian level
1 − α CI for µ.
143
By Definition 8.8.7, the posterior distribution of µ given x is
π(µ)f (x | µ)
h(µ | x) =
g(x)
where
g(x) =
144
12 Nonparametric Inference
Definition 12.1.1:
A statistical method which does not rely on assumptions about the distributional form of a
rv (except, perhaps, that it is absolutely continuous, or purely discrete) is called a nonpara-
metric or distribution–free method.
Note:
Unless otherwise specified, we make the following assumptions for the remainder of this chap-
ter: Let X1 , . . . , Xn be iid ∼ F , where F is unknown. Let P be the class of all possible
distributions of X.
Definition 12.1.2:
A statistic T (X) is sufficient for a family of distributions P if the conditional distibution of
X given T = t is the same for all F ∈ P.
Example 12.1.3:
Let X1 , . . . , Xn be absolutely continuous. Let T = (X(1) , . . . , X(n) ) be the order statistics.
It holds that
1
f (x | T = t) =,
n!
so T is sufficient for the family of absolutely continuous distributions on IR.
Definition 12.1.4:
A family of distributions P is complete if the only unbiased estimate of 0 is the 0 itself, i.e.,
Definition 12.1.5:
A statistic T (X) is complete in relation to P if the class of induced distributions of T is
complete.
Theorem 12.1.6:
The order statistic (X(1) , . . . , X(n) ) is a complete sufficient statistic, provided that X1 , . . . , Xn
are of either (pure) discrete of (pure) continuous type.
145
Definition 12.1.7:
A parameter g(F ) is called estimable if it has an unbiased estimate, i.e., if there exists a
T (X) such that
EF (T (X)) = g(F ) ∀F ∈ P.
Example 12.1.8:
Let P be the class of distributions for which second moments exist. Then X is unbiased for
R
µ(F ) = xdF (x). Thus, µ(F ) is estimable.
Definition 12.1.9:
The degree m of an estimable parameter g(F ) is the smallest sample size for which an unbi-
ased estimate exists for all F ∈ P.
Lemma 12.1.10:
There exists a symmetric kernel for every estimable parameter.
Proof:
Let T (X1 , . . . , Xm ) be a kernel of g(F ). Define
1 X
Ts (X1 , . . . , Xm ) = T (Xi1 , . . . , Xim ).
m!
all permutations of{1,...,m}
where the summation is over all m! permutations of {1, . . . , m}.
Example 12.1.11:
(ii) E(I(c,∞) (X1 )) = PF (X > c), where c is a known constant. So g(F ) = PF (X > c) has
degree 1 with kernel I(c,∞) (X1 ).
(iii) There exists no T (X1 ) such that E(T (X1 )) = σ 2 (F ) = (x − µ(F ))2 dF (x).
R
146
Definition 12.1.12:
Let g(F ) be an estimable parameter of degree m. Let X1 , . . . , Xn be a sample of size n, n ≥ m.
Given a kernel T (Xi1 , . . . , Xim ) of g(F ), we define a U –statistic by
1 X
U (X1 , . . . , Xn ) = n Ts (Xi1 , . . . , Xim ),
m c
n
where Ts is defined as in Lemma 12.1.10 and the summation c is over all m combina-
tions of m integers (i1 , . . . , im ) from {1, · · · , n}. U (X1 , . . . , Xn ) is symmetric in the Xi ’s and
EF (U (X)) = g(F ) for all F.
Example 12.1.13:
For estimating µ(F ) with degree m of µ(F ) = 1:
Symmetric kernel:
Ts (Xi ) = Xi , i = 1, . . . , n
U–statistic:
1 X
Uµ (X) = n Xi
1 c
1 · (n − 1)! X
= Xi
n! c
n
1X
= Xi
n i=1
= X
Symmetric kernel:
1
Ts (Xi1 , Xi2 ) = (Xi1 − Xi2 )2 , i1 , i2 = 1, . . . , n, i1 6= i2
2
U–statistic:
1 X 1
Uσ2 (X) = n (Xi1 − Xi2 )2
2 i <i 2
1 2
1 1 X
= n (Xi1 − Xi2 )2
2 4 i 6=i
1 2
(n − 2)! · 2! 1 X
= (Xi1 − Xi2 )2
n! 4 i 6=i
1 2
147
1 X X
= (Xi21 − 2Xi1 Xi2 + Xi22 )
2n(n − 1) i i 6=i
1 2 1
n n n n
1 X X X X
= (n − 1) Xi21 − 2( Xi1 )( Xi2 ) + 2 Xi2 +
2n(n − 1) i =1 i =1 i =1 i=1
1 1 2
n
X
(n − 1) Xi22
i2 =1
n n n n
1 X X X X
= n Xi21 − Xi21 − 2( Xi1 )2 + 2 Xi2 +
2n(n − 1) i =1 i =1 i =1 i=1
1 1 1
n
X n
X
n Xi22 − Xi22
i2 =1 i2 =1
n n
" #
1 X X
= n Xi2 − ( Xi )2
n(n − 1) i=1 i=1
2
n n
1 X
Xi −
1 X
= n Xj
n(n − 1) i=1
n j=1
n
1 X
= (Xi − X)2
(n − 1) i=1
= S2
Theorem 12.1.14:
Let P be the class of all absolutely continuous or all purely discrete distribution functions on
IR. Any estimable function g(F ), F ∈ P, has a unique estimate that is unbiased and sym-
metric in the observations and has uniformly minimum variance among all unbiased estimates.
Proof:
iid
Let X1 , . . . , Xn ∼ F ∈ P, with T (X1 , . . . , Xn ) an unbiased estimate of g(F ).
We define
Ti = Ti (X1 , . . . , Xn ) = T (Xi1 , Xi2 , . . . , Xin ), i = 1, 2, . . . , n!,
148
Then
EF (T ) = g(F )
and
2
V ar(T ) = E(T ) − (E(T ))2
n!
" #
1 X
= E ( Ti )2 − [g(F )]2
n! i=1
n! X
n!
1 2 X
= E ( ) Ti Tj − [g(F )]2
n! i=1 j=1
Xn! X
n!
≤ E Ti Tj − [g(F )]2
i=1 j=1
! n!
n!
X X
= E Ti Tj − [g(F )]2
i=1 j=1
n!
!2
X
= E Ti − [g(F )]2
i=1
= V ar(T )
Corollary 12.1.15:
If T (X1 , . . . , Xn ) is unbiased for g(F ), F ∈ P, the corresponding U –statistic is an essentially
unique UMVUE.
149
Definition 12.1.16:
iid iid
Suppose we have independent samples X1 , . . . , Xm ∼ F ∈ P, Y1 , . . . , Yn ∼ G ∈ P (G may or
may not equal F.) Let g(F, G) be an estimable function with unbiased estimator T (X1 , . . . , Xk , Y1 , . . . , Yl ).
Define
1 XX
Ts (X1 , . . . , Xk , Y1 , . . . , Yl ) = T (Xi1 , . . . , Xik , Yj1 , . . . , Yjl )
k!l! P P
X Y
1 XX
U (X, Y ) = m n Ts (Xi1 , . . . , Xik , Yj1 , . . . , Yjl )
k l CX CY
Example 12.1.17:
Let X1 , . . . , Xm and Y1 , . . . , Yn be independent random samples from F and G, respectively,
with F, G ∈ P. We wish to estimate
g(F, G) = PF,G (X ≤ Y ).
Let us define (
1, Xi ≤ Yj
Zij =
0, Xi > Yj
for each pair Xi , Yj , i = 1, 2, . . . , m, j = 1, 2, . . . , n.
m
X n
X
Then Zij is the number of X’s ≤ Yj , and Zij is the number of Y ’s > Xi .
i=1 j=1
(m − 1)!(n − 1)! X X 1 X X
= T (Xi1 , . . . , Xik , Yj1 , . . . , Yjl )
m!n! C C
1!1! P P
X Y X Y
m X n
1 X
= I(Xi ≤ Yj ).
mn i=1 j=1
150
12.2 Single-Sample Hypothesis Tests
Let X1 , . . . , Xn be a sample from a distribution F . The problem of fit is to test the hypoth-
esis that the sample X1 , . . . , Xn is from some specified distribution against the alternative
that it is from some other distribution, i.e., H0 : F = F0 vs. H1 : F (x) 6= F0 (x) for some x.
Definition 12.2.1:
iid
Let X1 , . . . , Xn ∼ F , and let the corresponding empirical cdf be
n
1X
Fn∗ (x) = I (Xi ).
n i=1 (−∞,x]
The statistic
Dn = sup | Fn∗ (x) − F (x) |
x
Dn+ = sup[Fn∗ (x) − F (x)] and Dn− = sup[F (x) − Fn∗ (x)].
x x
Theorem 12.2.2:
For any continuous distribution F , the K–S statistics Dn , Dn− , Dn+ are distribution free.
Proof:
Let X(1) , . . . , X(n) be the order statistics of X1 , . . . , Xn , i.e., X(1) ≤ X(2) ≤ . . . ≤ X(n) , and
define X(0) = −∞ and X(n+1) = +∞.
Then,
i
Fn∗ (x) = for X(i) ≤ x < X(i+1) , i = 0, . . . , n.
n
Therefore,
i
Dn+ = max { sup [ − F (x)]}
0≤i≤n X(i) ≤x<X(i+1) n
i
= max { −[ inf F (x)]}
0≤i≤n n X(i) ≤x<X(i+1)
(∗) i
= max { − F (X(i) )}
0≤i≤n n
i
= max { max − F (X(i) ) , 0}
1≤i≤n n
(∗) holds since F is nondecreasing in [X(i) , X(i+1) ).
151
Note that Dn+ is a function of F (X(i) ). In order to make some inference about Dn+ , the dis-
tribution of F (X(i) ) must be known. We know from the Probability Integral Transformation
(see Rohatgi, page 203, Theorem 1) that for a rv X with continuous cdf FX , it holds that
FX (X) ∼ U (0, 1).
Thus, F (X(i) ) is the ith order statistic of a sample from U (0, 1), independent from F . There-
fore, the distribution of Dn+ is independent of F .
Since
Dn = sup | Fn∗ (x) − F (x) |= max {Dn+ , Dn− },
x
the distribution of Dn is also independent of F .
Theorem 12.2.3:
If F is continuous, then
0, if ν ≤ 0
2n−1
1 Z 3
1 Z ν+ 2n ν+ 2n Z ν+ 2n
P (Dn ≤ ν + )= . . . f (u)du, if 0 < ν < 2n−1
2n
1
−ν 3
−ν 2n−1
−ν
2n
2n 2n 2n
2n−1
1, if ν ≥ 2n
where (
n!, if 0 < u1 < u2 < . . . < un < 1
f (u) = f (u1 , . . . , un ) =
0, otherwise
is the joint pdf of an order statistic of a sample of size n from U (0, 1).
Note:
As Gibbons & Chakraborti (1992), page 108–109, point out, this result must be interpreted
carefully. Consider the case n = 2.
152
When 0 < ν < 14 , it automatically holds that 0 < u1 < u2 < 1. Thus, for 0 < ν < 14 , it holds
that
Z ν+ 1 Z ν+ 3
1 4 4
P (D2 ≤ ν + ) = 2! du2 du1
4 1
4
−ν 3
4
−ν
Z ν+ 1
4 ν+ 3
= 2! u2 | 3 −ν4 du1
1
4
−ν 4
Z ν+ 1
4
= 2! 2ν du1
1
4
−ν
ν+ 1
= 2! (2ν) u1 | 1 −ν4
4
= 2! (2ν)2
1
For 4 ≤ ν < 34 , the region of integration is as follows:
1
Thus, for 4 ≤ ν < 34 , it holds that
Z ν+ 1 Z ν+ 3
1 4 4
P (D2 ≤ ν + ) = 2! du2 du1
4 1
4
−ν 0<u1 <u2 <1
3
4
−ν
Z ν+ 1 Z 1 Z 3
−ν Z 1
4 4
= 2! du2 du1 + 2! du2 du1
3 3
4
−ν u1 0 4
−ν
ν+ 14 3
"Z #
Z
4
−ν
= 2 u2 |1u1 du1 + 1
u2 | 3 −ν du1
3 4
4
−ν 0
ν+ 41 3
"Z #
Z
4
−ν 3
= 2 (1 − u1 ) du1 + (1 − + ν) du1
3
4
−ν 0 4
153
! ν+ 1 3 −ν
u2 4
u1
4
= 2 u1 − 1 + + νu1
2 3
−ν
4 0
4
(ν + 14 )2 (−ν + 34 )2 (−ν + 34 )
" #
1 3 3
= 2 (ν + ) − − (−ν + ) + + + ν(−ν + )
4 2 4 2 4 4
" #
1 ν2 ν 1 3 ν2 3 9 ν 3 3
= 2 ν+ − − − +ν− + − ν+ − + − ν2 + ν
4 2 4 32 4 2 4 32 4 16 4
3 1
2
= 2 −ν + ν −
2 16
1
= −2ν 2 + 3ν −
8
Combining these results gives
0, if ν ≤ 0
2! (2ν)2 ,
1
1
if 0 < ν < 4
P (D2 ≤ ν + ) =
4 1
−2ν 2 + 3ν − 8 ,
if 1
≤ν< 3
4 4
3
1, if ν ≥
4
Theorem 12.2.4:
Let F be a continuous cdf. Then it holds ∀z ≥ 0:
∞
z X
lim P (Dn ≤ √ ) = L1 (z) = 1 − 2 (−1)i−1 exp(−2i2 z 2 ).
n→∞ n i=1
Theorem 12.2.5:
Let F be a continuous cdf. Then it holds:
0, if z ≤ 0
Z 1 Z un Z u3 Z u2
P (Dn+ ≤ z) = P (Dn− ≤ z) = ... f (u)du, if 0 < z < 1
1−z n−1 −z 2 1
−z n −z
n n
if z ≥ 1
1,
Note:
It should be obvious that the statistics Dn+ and Dn− have the same distribution because of
symmetry.
154
Theorem 12.2.6:
Let F be a continuous cdf. Then it holds ∀z ≥ 0:
z z
lim P (Dn+ ≤ √ ) = lim P (Dn− ≤ √ ) = L2 (z) = 1 − exp(−2z 2 )
n→∞ n n→∞ n
Corollary 12.2.7:
d
Let Vn = 4n(Dn+ )2 . Then it holds Vn −→ χ22 , i.e., this transformation of Dn+ has an asymptotic
χ22 distribution.
Proof:
Let x ≥ 0. Then it follows:
x=4z 2
lim P (Vn ≤ x) = lim P (Vn ≤ 4z 2 )
n→∞ n→∞
= lim P (4n(Dn+ )2 ≤ 4z 2 )
n→∞
√
= lim P ( nDn+ ≤ z)
n→∞
T h.12.2.6
= 1 − exp(−2z 2 )
4z 2 =x
= 1 − exp(−x/2)
Thus, lim P (Vn ≤ x) = 1 − exp(−x/2) for x ≥ 0. Note that this is the cdf of a χ22 distribu-
n→∞
tion.
Definition 12.2.8:
+ be the
Let Dn;α be the smallest value such that P (Dn > Dn;α ) ≤ α. Likewise, let Dn;α
smallest value such that P (Dn+ > Dn;α
+ ) ≤ α.
Note:
+ for selected values of α and small
Rohatgi, Table 7, page 661, gives values of Dn;α and Dn;α
+ for large n.
n. Theorems 12.2.4 and 12.2.6 allow the approximation of Dn;α and Dn;α
155
Example 12.2.9:
Let X1 , . . . , Xn ∼ C(1, 0). We want to test whether H0 : X ∼ N (0, 1).
−1.42, −0.43, −0.19, 0.26, 0.30, 0.45, 0.64, 0.96, 1.97, and 4.68
The results for the K–S test have been obtained through the following S–Plus session, i.e.,
+ −
D10 = 0.02219616, D10 = 0.3025681, and D10 = 0.3025681:
> x _ c(-1.42, -0.43, -0.19, 0.26, 0.30, 0.45, 0.64, 0.96, 1.97, 4.68)
> FX _ pnorm(x)
> FX
[1] 0.07780384 0.33359782 0.42465457 0.60256811 0.61791142 0.67364478
[7] 0.73891370 0.83147239 0.97558081 0.99999857
> Dp _ (1:10)/10 - FX
> Dp
[1] 2.219616e-02 -1.335978e-01 -1.246546e-01 -2.025681e-01 -1.179114e-01
[6] -7.364478e-02 -3.891370e-02 -3.147239e-02 -7.558081e-02 1.434375e-06
> Dm _ FX - (0:9)/10
> Dm
[1] 0.07780384 0.23359782 0.22465457 0.30256811 0.21791142 0.17364478
[7] 0.13891370 0.13147239 0.17558081 0.09999857
> max(Dp)
[1] 0.02219616
> max(Dm)
[1] 0.3025681
> max(max(Dp), max(Dm))
[1] 0.3025681
>
> ks.gof(x, alternative = "two.sided", mean = 0, sd = 1)
data: x
ks = 0.3026, p-value = 0.2617
alternative hypothesis:
True cdf is not the normal distn. with the specified parameters
Using Rohatgi, Table 7, page 661, we have to use D10;0.20 = 0.323 for α = 0.20. Since
D10 = 0.3026 < 0.323 = D10;0.20 , it is p > 0.20. The K–S test does not reject H0 at level
α = 0.20. As S–Plus shows, the precise p–value is even p = 0.2617.
156
Note:
Comparison between χ2 and K–S goodness of fit tests:
• K–S uses all available data; χ2 bins the data and loses information
• K–S works for all sample sizes; χ2 requires large sample sizes
• it is more difficult to modify K–S for estimated parameters; χ2 can be easily adapted
for estimated parameters
• K–S is “conservative” for discrete data, i.e., it tends to accept H0 for such data
• the order matters for K–S; χ2 is better for unordered categorical data
157
12.3 More on Order Statistics
Definition 12.3.1:
Let F be a continuous cdf. A tolerance interval for F with tolerance coefficient γ is
a random interval such that the probability is γ that this random interval covers at least a
specified percentage 100p% of the distribution.
Theorem 12.3.2:
If order statistics X(r) < X(s) are used as the endpoints for a tolerance interval for a continuous
cdf F , it holds that
s−r−1
!
X n i
γ= p (1 − p)n−i .
i=0
i
Proof:
According to Definition 12.3.1, it holds that
γ = PX(r) ,X(s) PX (X(r) < X < X(s) ) ≥ p .
= F (X(s) ) − F (X(r) )
= U(s) − U(r) ,
where U(s) and U(r) are the order statistics of a U (0, 1) distribution.
Thus,
γ = PX(r) ,X(s) PX (X(r) < X < X(s) ) ≥ p = P (U(s) − U(r) ≥ p).
By Therorem 4.4.4, we can determine the joint distribution of order statistics and calculate γ
as Z 1 Z y−p
n!
γ= xr−1 (y − x)s−r−1 (1 − y)n−s dx dy.
p 0 (r − 1)!(s − r − 1)!(n − s)!
Rather than solving this integral directly, we make the transformation
U = U(s) − U(r)
V = U(s) .
158
and the marginal pdf of U is
Z 1
fU (u) = fU,V (u, v) dv
0
Z 1
n!
= us−r−1 I(0,1) (u) (v − u)r−1 (1 − v)n−s dv
(r − 1)!(s − r − 1)!(n − s)! u
Z 1
(A) n!
= us−r−1 (1 − u)n−s+r I(0,1) (u) tr−1 (1 − t)n−s dt
(r − 1)!(s − r − 1)!(n − s)! 0
| {z }
B(r,n−s+1)
n! (r − 1)!(n − s)!
= us−r−1 (1 − u)n−s+r I (u)
(r − 1)!(s − r − 1)!(n − s)! (n − s + r)! (0,1)
n!
= us−r−1 (1 − u)n−s+r I(0,1) (u)
(n − s + r)!(s − r − 1)!
!
n−1
= n us−r−1 (1 − u)n−s+r I(0,1) (u).
s−r−1
v−u
(A) is based on the transformation t = , v − u = (1 − u)t, 1 − v = 1 − u − (1 − u)t =
1−u
(1 − u)(1 − t) and dv = (1 − u)dt.
It follows that
γ = P (U(s) − U(r) ≥ p)
= P (U ≥ p)
Z 1 !
n−1
= n us−r−1 (1 − u)n−s+r du
p s−r−1
(B)
= P (Y < s − r) | where Y ∼ Bin(n, p)
s−r−1
!
X n i
= p (1 − p)n−i .
i=0
i
(B) holds due to Rohatgi, Remark 3 after Theorem 5.3.18, page 216, since for X ∼ Bin(n, p),
it holds that Z 1 !
n − 1 k−1
P (X < k) = n x (1 − x)n−k dx.
p k−1
159
Example 12.3.3:
Let s = n and r = 1. Then,
n−2
!
X n i
γ= p (1 − p)n−i = 1 − pn − npn−1 (1 − p).
i=0
i
i.e., (X(1) , X(10) ) defines a 62.4% tolerance interval for 80% probability.
Theorem 12.3.4:
Let kp be the pth quantile of a continuous cdf F . Let X(1) , . . . , X(n) be the order statistics of
a sample of size n from F . Then it holds that
s−1
!
X n i
P (X(r) ≤ kp ≤ X(s) ) = p (1 − p)n−i .
i=r
i
Proof:
It holds that
Therefore,
160
Corollary 12.3.5:
s−1
!
X n i
(X(r) , X(s) ) is a level p (1 − p)n−i confidence interval for kp .
i=r
i
Example 12.3.6:
Let n = 10. We want a 95% confidence interval for the median, i.e., kp where p = 21 .
s−1
!
X n i
We get the following probabilities pr,s = p (1 − p)n−i that (X(r) , X(s) ) covers k0.5 :
i=r
i
pr,s s
2 3 4 5 6 7 8 9 10
1 0.01 0.05 0.17 0.38 0.62 0.83 0.94 0.99 0.998
2 0.04 0.16 0.37 0.61 0.82 0.93 0.98 0.99
3 0.12 0.32 0.57 0.77 0.89 0.93 0.94
4 0.21 0.45 0.66 0.77 0.82 0.83
r 5 0.25 0.45 0.57 0.61 0.62
6 0.21 0.32 0.37 0.38
7 0.12 0.16 0.17
8 0.04 0.05
9 0.01
Only the random intervals (X(1) , X(9) ), (X(1) , X(10) ), (X(2) , X(9) ), and (X(2) , X(10) ) give the
desired coverage probability. Therefore, we use the one that comes closest to 95%, i.e.,
(X(2) , X(9) ), as the 95% confidence interval for the median.
161
13 Some Results from Sampling
Definition 13.1.1:
Let Ω be a population of size N with mean µ and variance σ 2 . A sampling method (of size
n) is called simple if the set S of possible samples contains all combinations of n elements of
Ω (without repetition) and the probability for each sample s ∈ S to become selected depends
only on n, i.e., p(s) = N1 ∀s ∈ S. Then we call s ∈ S a simple random sample (SRS) of
(n)
size n.
Theorem 13.1.2:
Let Ω be a population of size N with mean µ and variance σ 2 . Let Y : Ω → IR be a measurable
function. Let ni be the total number of times the parameter ỹi occurs in the population and
pi = nNi be the relative frequency the parameter ỹi occurs in the population. Let (y1 , . . . , yn )
be a SRS of size n with respect to Y , where P (Y = ỹi ) = pi = nNi .
Note:
(i) In Sampling, many authors use capital letters to denote properties of the population
and small letters to denote properties of the random sample. In particular, xi ’s and yi ’s
are considered as random variables related to the sample. They are not seen as specific
realizations.
1 X
µ = ni ỹi
N i
1 X
σ2 = ni (ỹi − µ)2
N i
1 X
= ni ỹi2 − µ2
N i
162
Theorem 13.1.3:
n
X
1
Let the same conditions hold as in Theorem 13.1.2. Let y = n yi be the sample mean of a
i=1
SRS of size n. Then it holds:
(i) E(y) = µ, i.e., the sample mean is unbiased for the population mean µ.
1N −n 2 1 N
(ii) V ar(y) = σ = (1 − f ) σ 2 , where f = n
N.
nN −1 n N −1
Proof:
(i)
n
1X
E(y) = E(yi ) = µ, since E(yi ) = µ ∀i.
n i=1
(ii)
163
Theorem 13.1.4:
Let y n be the sample mean of a SRS of size n. Then it holds that
n yn − µ d
r
q −→ N (0, 1),
1−f N
σ
N −1
n
where N → ∞ and f = N is a constant.
In particular, when the yi ’s are 0–1–distributed with E(yi ) = P (yi = 1) = p ∀i, then it holds
that
n yn − p
r
d
q −→ N (0, 1),
1−f N
p(1 − p)
N −1
n
where N → ∞ and f = N is a constant.
164
13.2 Stratified Random Samples
Definition 13.2.1:
Let Ω be a population of size N , that is split into m disjoint sets Ωj , called strata, of size
m
X
Nj , j = 1, . . . , m, where N = Nj . If we independently draw a random sample of size nj in
j=1
each strata, we speak of a stratified random sample.
Note:
(i) The random samples in each strata are not always SRS’s.
(ii) Stratified random samples are used in practice as a means to reduce the sample variance
in the case that data in each strata is homogeneous and data among different strata is
heterogeneous.
(iii) Frequently used strata in practice are gender, state (or county), income range, ethnic
background, etc.
Definition 13.2.2:
Let Y : Ω → IR be a measurable function. In case of a stratified random sample, we use the
following notation:
Nj
X
(i) Yj = Yjk the total in the j th strata,
k=1
1
(ii) µj = Nj Yj the mean in the j th strata,
m
X
1
(iii) µ = N Nj µj the expectation (or grand mean),
j=1
m Nj
m X
X X
(iv) N µ = Yj = Yjk the total,
j=1 j=1 k=1
Nj
X
(v) σj2 = 1
Nj (Yjk − µj )2 the variance in the j th strata, and
k=1
Nj
m X
X
(vi) σ2 = 1
N (Yjk − µ)2 the variance.
j=1 k=1
165
nj
X
1
(vii) We denote an (ordered) sample in Ωj of size nj as (yj1 , . . . , yjnj ) and y j = nj yjk the
k=1
sample mean in the j th strata.
Theorem 13.2.3:
Let the same conditions hold as in Definitions 13.2.1 and 13.2.2. Let µ̂j be an unbiased
estimate of µj and V ar(µ̂
d ) be an unbiased estimate of V ar(µ̂ ). Then it holds:
j j
m
X
1
(i) µ̂ = N Nj µ̂j is unbiased for µ.
j=1
m
X
V ar(µ̂) = 1
N2
Nj2 V ar(µ̂j ).
j=1
m
X
d = 1
(ii) V ar(µ̂) N2
Nj2 V ar(µ̂
d ) is unbiased for V ar(µ̂).
j
j=1
Proof:
(i)
Theorem 13.2.4:
Let the same conditions hold as in Theorem 13.2.3. If we draw a SRS in each strata, then it
holds:
m nj
X X
1 1
(i) µ̂ = N Nj y j is unbiased for µ, where y j = nj yjk , j = 1, . . . , m.
j=1 k=1
m
X 1 Nj nj
V ar(µ̂) = 1
Nj2 (1 − fj ) σ 2 , where fj = Nj .
N2
j=1
nj Nj − 1 j
m
X 1
(ii) V ar(µ̂)
d = 1
N2
Nj2 (1 − fj )s2j is unbiased for V ar(µ̂), where
j=1
nj
nj
1 X
s2j = (yjk − y j )2 .
nj − 1 k=1
166
Proof:
Definition 13.2.5:
Let the same conditions hold as in Definitions 13.2.1 and 13.2.2. If the sample in each strata
N
is of size nj = n Nj , j = 1, . . . , m, where n is the total sample size, then we speak of pro-
portional selection.
Note:
nj n
(i) In the case of proportional selection, it holds that fj = Nj = N = f, j = 1, . . . , m.
(ii) Proportional strata cannot always be obtained for each combination of m, n, and N .
Theorem 13.2.6:
Let the same conditions hold as in Definition 13.2.5. If we draw a SRS in each strata, then it
holds in case of proportional selection that
m
1 1−f X
V ar(µ̂) = 2 Nj σ̃j2 ,
N f j=1
Nj
where σ̃j2 = 2
Nj −1 σj .
Proof:
The proof follows directly from Theorem 13.2.4 (i).
Theorem 13.2.7:
If we draw (1) a stratified random sample that consists of SRS’s of sizes nj under proportional
m
X
selection and (2) a SRS of size n = nj from the same population, then it holds that
j=1
m m
1 N − n X 2 1 X
V ar(y) − V ar(µ̂) = Nj (µj − µ) − (N − Nj )σ̃j2 .
n N (N − 1) j=1 N j=1
167
Proof:
See Homework.
168
14 Some Results from Sequential Statistical Inference
Example 14.1.1:
A particular machine produces a large number of items every day. Each item can be either
“defective” or “non–defective”. The unknown proportion of defective items in the production
of a particular day is p.
Let (X1 , . . . , Xm ) be a sample from the daily production where xi = 1 when the item is
m
X
defective and xi = 0 when the item is non–defective. Obviously, Sm = Xi ∼ Bin(m, p)
i=1
denotes the total number of defective items in the sample (assuming that m is small compared
to the daily production).
However, wouldn’t it be more beneficial if we sequentially sample the items (e.g., take item
# 57, 623, 1005, 1286, 2663, etc.) and stop the machine as soon as it becomes obvious that
it produces too many bad items. (Alternatively, we could also finish the time consuming and
expensive process to determine whether an item is defective or non–defective if it is impossible
to surpass a certain proportion of defectives.) For example, if for some j < m it already holds
that sj > c, then we could stop (and immediately call maintenance) and reject H0 after only
j observations.
More formally, let us define T = min{j | Sj > c} and T 0 = min{T, m}. We can now con-
sider a decision rule that stops with the sampling process at random time T 0 and rejects H0 if
T ≤ m. Thus, if we consider R0 = {(x1 , . . . , xm ) | t ≤ m} and R1 = {(x1 , . . . , xm ) | sm > c}
as critical regions of two tests Φ0 and Φ1 , then these two tests are equivalent.
169
Definition 14.1.2:
Let Θ be the parameter space and A the set of actions the statistician can take. We assume
that the rv’s X1 , X2 , . . . are observed sequentially and iid with common pdf (or pmf) fθ (x).
A sequential decision procedure is defined as follows:
(i) A stopping rule specifies whether an element of A should be chosen without taking
any further observation. If at least one observation is taken, this rule specifies for every
set of observed values (x1 , x2 , . . . , xn ), n ≥ 1, whether to stop sampling and choose an
action in A or to take another observation xn+1 .
(ii) A decision rule specifies the decision to be taken. If no observation has been taken,
then we take action d0 ∈ A. If n ≥ 1 observation have been taken, then we take action
dn (x1 , . . . , xn ) ∈ A, where dn (x1 , . . . , xn ) specifies the action that has to be taken for
the set (x1 , . . . , xn ) of observed values. Once an action has been taken, the sampling
process is stopped.
Note:
In the remainder of this chapter, we assume that the statistician takes at least one observation.
Definition 14.1.3:
Let Rn ⊆ IRn , n = 1, 2, . . ., be a sequence of Borel–measurable sets such that the sampling
process is stopped after observing X1 = x1 , X2 = x2 , . . . , Xn = xn if (x1 , . . . , xn ) ∈ Rn . If
(x1 , . . . , xn ) ∈
/ Rn , then another observation xn+1 is taken. The sets Rn , n = 1, 2, . . . are called
stopping regions.
Definition 14.1.4:
With every sequential stopping rule we associate a stopping random variable N which
takes on the values 1, 2, 3, . . .. Thus, N is a rv that indicates the total number of observations
taken before the sampling is stopped.
Note:
We use the (sloppy) notation {N = n} to denote the event that sampling is stopped after
observing exactly n values x1 , . . . , xn (i.e., sampling is not stopped before taking n samples).
Then the following equalities hold:
{N = 1} = R1
170
{N = n} = {(x1 , . . . , xn ) ∈ IRn | sampling is stopped after n observations but not before}
= (R1 ∪ R2 ∪ . . . ∪ Rn−1 )c ∩ Rn
= R1c ∩ R2c ∩ . . . ∩ Rn−1
c
∩ Rn
n
[
{N ≤ n} = {N = k}
k=1
Here we will only consider closed sequential sampling procedures, i.e., procedures where
sampling eventually stops with probability 1, i.e.,
P (N < ∞) = 1,
P (N = ∞) = 1 − P (N < ∞) = 0.
Proof:
Define a sequence of rv’s Yi , i = 1, 2, . . ., where
1, if no decision is reached up to the (i − 1)th stage, i.e., N > (i − 1)
Yi =
0, otherwise
Consider the rv ∞
X
Xn Yn .
n=1
Obviously, it holds that
∞
X
SN = Xn Yn .
n=1
Thus, it follows that
∞
!
X
E(SN ) = E Xn Yn . (∗)
n=1
It holds that
∞
X ∞
X
E(| Xn Yn |) = E(| Xn |)E(| Yn |)
n=1 n=1
∞
X
= E(| X1 |) P (N ≥ n)
n=1
171
∞ X
X ∞
= E(| X1 |) P (N = k)
n=1 k=n
∞
(A) X
= E(| X1 |) nP (N = n)
n=1
= E(| X1 |)E(N )
< ∞
n k
1 1, 2, 3, . . .
2 2, 3, . . .
3 3, . . .
.. ..
. .
We may therefore interchange the expectation and summation signs in (∗) and get
∞
!
X
E(SN ) = E Xn Yn
n=1
∞
X
= E(Xn Yn )
n=1
∞
X
= E(Xn )E(Yn )
n=1
∞
X
= E(X1 ) P (N ≥ n)
n=1
= E(X1 )E(N )
172
14.2 Sequential Probability Ratio Tests
Definition 14.2.1:
Let X1 , X2 , . . . be a sequence of iid rv’s with common pdf (or pmf) fθ (x). We want to test a
simple hypothesis H0 : X ∼ fθ0 vs. a simple alternative H1 : X ∼ fθ1 when the observations
are taken sequentially.
Let f0n and f1n denote the joint pdf’s (or pmf’s) of X1 , . . . , Xn under H0 and H1 respectively,
i.e.,
n
Y n
Y
f0n (x1 , . . . , xn ) = fθ0 (xi ) and f1n (x1 , . . . , xn ) = fθ1 (xi ).
i=1 i=1
Finally, let
f1n (x)
,
λn (x1 , . . . , xn ) =
f0n (x)
where x = (x1 , . . . , xn ). Then a sequential probability ratio test (SPRT) for testing H0
vs. H1 is the following decision rule:
λn (x) ≥ A,
λn (x) ≤ B,
(iii) If
B < λn (x) < A,
then continue sampling by taking another observation xn+1 .
Note:
instead of using λn (x). Obviously, we now have to use constants b = log B and a = log A
instead of the original constants B and A.
173
(ii) A and B (where A > B) are constants such that the SPRT will have strength (α, β),
where
α = P (Type I error) = P (Reject H0 | H0 )
and
β = P (Type II error) = P (Accept H0 | H1 ).
If N is the stopping rv, then
Example 14.2.2:
Let X1 , X2 , . . . be iid N (µ, σ 2 ), where µ is unknown and σ 2 > 0 is known. We want to test
H0 : µ = µ0 vs. H1 : µ = µ1 , where µ0 < µ1 .
n
1 X 2 2
= −2x i µ 0 + µ 0 + 2xi µ 1 − µ 1
2σ 2 i=1
n
!
1 X
= 2xi (µ1 − µ0 ) + n(µ20 − µ21 )
2σ 2 i=1
n
!
µ1 − µ0 X µ0 + µ1
= xi − n
σ2 i=1
2
We decide for H0 if
log λn (x) ≤ b
n
!
µ1 − µ0 X µ0 + µ1
⇐⇒ xi − n ≤b
σ2 i=1
2
n
µ0 + µ1
+ b∗ ,
X
⇐⇒ xi ≤ n
i=1
2
σ2
where b∗ = µ1 −µ0 b.
174
We decide for H1 if
log λn (x) ≥ a
n
!
µ 1 − µ0 X µ0 + µ1
⇐⇒ xi − n ≥a
σ2 i=1
2
n
µ0 + µ 1
+ a∗ ,
X
⇐⇒ xi ≥ n
i=1
2
σ2
where a∗ = µ1 −µ0 a.
Theorem 14.2.3:
For a SPRT with stopping bounds A and B, A > B, and strength (α, β), we have
1−β β
A≤ and B ≥ ,
α 1−α
where 0 < α < 1 and 0 < β < 1.
Theorem 14.2.4:
Assume we select for given α, β ∈ (0, 1), where α + β ≤ 1, the stopping bounds
1−β β
A0 = and B 0 = .
α 1−α
Then it holds that the SPRT with stopping bounds A0 and B 0 has strength (α0 , β 0 ), where
α β
α0 ≤ , β0 ≤ , and α0 + β 0 ≤ α + β.
1−β 1−α
175
Note:
(ii) A0 and B 0 are functions of α and β only and do not depend on the pdf’s (or pmf’s) fθ0
and fθ1 . Therefore, they can be computed once and for all fθi ’s, i = 0, 1.
176
Index
α–similar, 105 Efficient, More, 71
0–1 Loss, 125 Efficient, Most, 72
Empirical Cumulative Distribution Function, 36
A Posteriori Distribution, 84 Error, Type I, 90
A Priori Distribution, 84 Error, Type II, 90
Action, 81 Estimable, 146
Alternative Hypothesis, 89 Estimable Function, 57
Ancillary, 55 Estimate, Bayes, 85
Asymptotically (Most) Efficient, 72 Estimate, Maximum Likelihood, 76
Estimate, Method of Moments, 74
Basu’s Theorem, 55 Estimate, Minimax, 82
Bayes Estimate, 85 Estimate, Point, 44
Bayes Risk, 84 Estimator, 44
Bayes Rule, 85 Estimator, Mann–Whitney, 150
Bayesian Confidence Set, 143 Estimator, Wilcoxin 2–Sample, 150
Bias, 57 Exponential Family, One–Parameter, 53
177
Level of Significance, 91 Power Function, 91
Level–α–Test, 91 Probability Integral Transformation, 152
Likelihood Function, 76 Probability Ratio Test, Sequential, 173
Likelihood Ratio Test, 112 Problem of Fit, 151
Likelihood Ratio Test Statistic, 112 Proof by Contradiction, 60
Lindeberg Central Limit Theorem, 33 Proportional Selection, 167
Lindeberg Condition, 33
Lindeberg–Lèvy Central Limit Theorem, 30 Random Interval, 130
LMVUE, 59 Random Sample, 36
Locally Minumum Variance Unbiased Estimate, 59 Random Sets, 130
Location Invariant, 46 Random Variable, Stopping, 170
Logic, 60 Randomized Test, 91
Loss Function, 81 Rao–Blackwell, 63
Lower Confidence Bound, 131 Rao–Blackwellization, 64
LRT, 112 Realization, 36
Regularity Conditions, 67
Mann–Whitney Estimator, 150 Risk Function, 81
Maximal Invariant, 109 Risk, Bayes, 84
Maximum Likelihood Estimate, 76
Mean Square Error, 58 Sample, 36
Mean–Squared–Error Consistent, 58 Sample Central Moment of Order k, 37
Measurement Invariance, 108 Sample Mean, 36
Method of Moments Estimate, 74 Sample Moment of Order k, 37
Minimax Estimate, 82 Sample Statistic, 36
Minimax Principle, 82 Sample Variance, 36
Minmal Sufficient, 55 Scale Invariant, 46
MLE, 76 Selection, Proportional, 167
MLR, 98 Sequential Decision Procedure, 170
MOM, 74 Sequential Probability Ratio Test, 173
Monotone Likelihood Ratio, 98 Significance Level, 91
More Efficient, 71 Similar, 105
Most Efficient, 72 Similar, α, 105
Most Powerful Test, 91 Simple, 89, 162
MP, 91 Simple Random Sample, 162
MSE–Consistent, 58 Size, 91
SPRT, 173
Neyman–Pearson Lemma, 94 SRS, 162
Nonparametric, 145 Stable, 32
Nonrandomized Test, 91 Statistic, 36
Normal Variance Tests, 117 Statistic, Kolmogorov–Smirnov, 151
NP Lemma, 94 Statistic, Likelihood Ratio Test, 112
Null Hypothesis, 89 Stopping Random Variable, 170
Stopping Regions, 170
One Sample t–Test, 121 Stopping Rule, 170
One–Tailed t-Test, 121 Strata, 165
Stratified Random Sample, 165
Paired t-Test, 122 Strongly Consistent, 45
Parameter Space, 44 Sufficient, 48, 145
Parametric Hypothesis, 89 Sufficient, Minimal, 55
Permutation Invariant, 46 Symmetric Kernel, 146
Pivot, 134
Point Estimate, 44 t–Test, 121
Point Estimation, 44 Test Function, 90
Population Distribution, 36 Test, Invariant, 108
Power, 91 Test, Kolmogorov–Smirnov, 155
178
Test, Likelihood Ratio, 112
Test, Most Powerful, 91
Test, Nonrandomized, 91
Test, Randomized, 91
Test, Uniformly Most Powerful, 91
Tolerance Coefficient, 158
Tolerance Interval, 158
Two–Sample t-Test, 121
Two–Tailed t-Test, 121
Type I Error, 90
Type II Error, 90
U–Statistic, 147
U–Statistic, Generalized, 150
UMA, 131
UMAU, 140
UMP, 91
UMP α–similar, 106
UMP Invariant, 110
UMP Unbiased, 102
UMPU, 102
UMVUE, 59
Unbiased, 57, 102, 140
Uniformly Minumum Variance Unbiased Estimate, 59
Uniformly Most Accurate, 131
Uniformly Most Accurate Unbiased, 140
Uniformly Most Powerful Test, 91
Unimodal, 135
Upper Confidence Bound, 131
179