0% found this document useful (0 votes)

8 views

Lect Main Blanc

mathematical statistics,a course that requires basic knowledge of senior secondary knowledge

Uploaded by

clivephiri340

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Lect Main Blanc

mathematical statistics,a course that requires basic knowledge of senior secondary knowledge

Uploaded by

clivephiri340

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 185

STAT 6720

Mathematical Statistics II
Spring Semester 2013

Dr. Jürgen Symanzik

Utah State University

Department of Mathematics and Statistics

3900 Old Main Hill

Logan, UT 84322–3900

Tel.: (435) 797–0696

FAX: (435) 797–1822

e-mail: [email protected]
Contents

Acknowledgements 1

6 Limit Theorems 1
6.1 Modes of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
6.2 Weak Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.3 Strong Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.4 Central Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7 Sample Moments 36
7.1 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.2 Sample Moments and the Normal Distribution . . . . . . . . . . . . . . . . . . 39

8 The Theory of Point Estimation 44

8.1 The Problem of Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.2 Properties of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.3 Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.4 Unbiased Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.5 Lower Bounds for the Variance of an Estimate . . . . . . . . . . . . . . . . . . 66
8.6 The Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.7 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.8 Decision Theory — Bayes and Minimax Estimation . . . . . . . . . . . . . . . . 81

9 Hypothesis Testing 89
9.1 Fundamental Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.2 The Neyman–Pearson Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
9.3 Monotone Likelihood Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
9.4 Unbiased and Invariant Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

10 More on Hypothesis Testing 112

10.1 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
10.2 Parametric Chi–Squared Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
10.3 t–Tests and F –Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
10.4 Bayes and Minimax Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

1
11 Confidence Estimation 130
11.1 Fundamental Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
11.2 Shortest–Length Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . 134
11.3 Confidence Intervals and Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . 138
11.4 Bayes Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

12 Nonparametric Inference 145

12.1 Nonparametric Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
12.2 Single-Sample Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
12.3 More on Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

13 Some Results from Sampling 162

13.1 Simple Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
13.2 Stratified Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

14 Some Results from Sequential Statistical Inference 169

14.1 Fundamentals of Sequential Sampling . . . . . . . . . . . . . . . . . . . . . . . 169
14.2 Sequential Probability Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . 173

Index 177

2
Acknowledgements

I would like to thank my students, Hanadi B. Eltahir, Rich Madsen, and Bill Morphet, who
helped during the Fall 1999 and Spring 2000 semesters in typesetting these lecture notes
using LATEX and for their suggestions how to improve some of the material presented in class.
Thanks are also due to more than 60 students who took Stat 6710/20 with me since the Fall
2000 semester for their valuable comments that helped to improve and correct these lecture
notes.
In addition, I particularly would like to thank Mike Minnotte and Dan Coster, who previously
taught this course at Utah State University, for providing me with their lecture notes and other
materials related to this course. Their lecture notes, combined with additional material from
Casella/Berger (2002), Rohatgi (1976) and other sources listed below, form the basis of the
script presented here.
The primary textbook required for this class is:

• Casella, G., and Berger, R. L. (2002): Statistical Inference (Second Edition), Duxbury
Press/Thomson Learning, Pacific Grove, CA.

A Web page dedicated to this class is accessible at:

http://www.math.usu.edu/~symanzik/teaching/2013_stat6720/stat6720.html
This course closely follows Casella and Berger (2002) as described in the syllabus. Additional
material originates from the lectures from Professors Hering, Trenkler, and Gather I have
attended while studying at the Universität Dortmund, Germany, the collection of Masters and
PhD Preliminary Exam questions from Iowa State University, Ames, Iowa, and the following
textbooks:

• Bandelow, C. (1981): Einführung in die Wahrscheinlichkeitstheorie, Bibliographisches

Institut, Mannheim, Germany.

• Büning, H., and Trenkler, G. (1978): Nichtparametrische statistische Methoden, Walter

de Gruyter, Berlin, Germany.

• Casella, G., and Berger, R. L. (1990): Statistical Inference, Wadsworth & Brooks/Cole,
Pacific Grove, CA.

• Fisz, M. (1989): Wahrscheinlichkeitsrechnung und mathematische Statistik, VEB Deut-

scher Verlag der Wissenschaften, Berlin, German Democratic Republic.

• Gibbons, J. D., and Chakraborti, S. (1992): Nonparametric Statistical Inference (Third

Edition, Revised and Expanded), Dekker, New York, NY.

3
• Johnson, N. L., and Kotz, S., and Balakrishnan, N. (1994): Continuous Univariate
Distributions, Volume 1 (Second Edition), Wiley, New York, NY.

• Johnson, N. L., and Kotz, S., and Balakrishnan, N. (1995): Continuous Univariate
Distributions, Volume 2 (Second Edition), Wiley, New York, NY.

• Kelly, D. G. (1994): Introduction to Probability, Macmillan, New York, NY.

• Lehmann, E. L. (1983): Theory of Point Estimation (1991 Reprint), Wadsworth &

Brooks/Cole, Pacific Grove, CA.

• Lehmann, E. L. (1986): Testing Statistical Hypotheses (Second Edition – 1994 Reprint),

Chapman & Hall, New York, NY.

• Mood, A. M., and Graybill, F. A., and Boes, D. C. (1974): Introduction to the Theory
of Statistics (Third Edition), McGraw-Hill, Singapore.

• Parzen, E. (1960): Modern Probability Theory and Its Applications, Wiley, New York,
NY.

• Rohatgi, V. K. (1976): An Introduction to Probability Theory and Mathematical Statis-

tics, John Wiley and Sons, New York, NY.

• Rohatgi, V. K., and Saleh, A. K. E. (2001): An Introduction to Probability and Statistics

(Second Edition), John Wiley and Sons, New York, NY.

• Searle, S. R. (1971): Linear Models, Wiley, New York, NY.

• Tamhane, A. C., and Dunlop, D. D. (2000): Statistics and Data Analysis – From Ele-
mentary to Intermediate, Prentice Hall, Upper Saddle River, NJ.

Additional definitions, integrals, sums, etc. originate from the following formula collections:

• Bronstein, I. N. and Semendjajew, K. A. (1985): Taschenbuch der Mathematik (22.

Auflage), Verlag Harri Deutsch, Thun, German Democratic Republic.

• Bronstein, I. N. and Semendjajew, K. A. (1986): Ergänzende Kapitel zu Taschenbuch der

Mathematik (4. Auflage), Verlag Harri Deutsch, Thun, German Democratic Republic.

• Sieber, H. (1980): Mathematische Formeln — Erweiterte Ausgabe E, Ernst Klett, Stuttgart,

Germany.

Jürgen Symanzik, January 6, 2013.

4
6 Limit Theorems

(Based on Rohatgi, Chapter 6, Rohatgi/Saleh, Chapter 6 & Casella/Berger,

Section 5.5)
Motivation:
I found this slide from my Stat 250, Section 003, “Introductory Statistics” class (an under-
graduate class I taught at George Mason University in Spring 1999):

What does this mean at a more theoretical level???

1
6.1 Modes of Convergence

Definition 6.1.1:
Let X1 , . . . , Xn be iid rv’s with common cdf FX (x). Let T = T (X) be any statistic, i.e., a
Borel–measurable function of X that does not involve the population parameter(s) ϑ, defined
on the support X of X. The induced probability distribution of T (X) is called the sampling
distribution of T (X).

Note:

(i) Commonly used statistics are:

n
X
1
Sample Mean: X n = n Xi
i=1
n
X
Sample Variance: Sn2 = 1
n−1 (Xi − X n )2
i=1
Sample Median, Order Statistics, Min, Max, etc.

(ii) Recall that if X1 , . . . , Xn are iid and if E(X) and V ar(X) exist, then E(X n ) = µ =
2
E(X), E(Sn2 ) = σ 2 = V ar(X), and V ar(X n ) = σn .

(iii) Recall that if X1 , . . . , Xn are iid and if X has mgf MX (t) or characteristic function ΦX (t)
then MX n (t) = (MX ( nt ))n or ΦX n (t) = (ΦX ( nt ))n .

Note: Let {Xn }∞ n=1 be a sequence of rv’s on some probability space (Ω, L, P ). Is there any
meaning behind the expression lim Xn = X? Not immediately under the usual definitions
n→∞
of limits. We first need to define modes of convergence for rv’s and probabilities.

Definition 6.1.2:
Let {Xn }∞ ∞
n=1 be a sequence of rv’s with cdf’s {Fn }n=1 and let X be a rv with cdf F . If
Fn (x) → F (x) at all continuity points of F , we say that Xn converges in distribution to
d L
X (Xn −→ X) or Xn converges in law to X (Xn −→ X), or Fn converges weakly to F
w
(Fn −→ F ).

Example 6.1.3:
Let Xn ∼ N (0, n1 ). Then

Z x exp − 1 nt2
2
Fn (x) = q dt
−∞ 2π
n

2
Z √nx
exp(− 12 s2 )
= √ ds
−∞ 2π
√
= Φ( nx)

=⇒ Fn (x) →

(
1, x ≥ 0
If FX (x) = the only point of discontinuity is at x = 0. Everywhere else,
0, x < 0
√
Φ( nx) = Fn (x) → FX (x), where Φ(z) = P (Z ≤ z) with Z ∼ N (0, 1).
d d
So, Xn −→ X, where P (X = 0) = 1, or Xn −→ 0 since the limiting rv here is degenerate,
i.e., it has a Dirac(0) distribution.

Example 6.1.4:
In this example, the sequence {Fn }∞
n=1 converges pointwise to something that is not a cdf:

Let Xn ∼ Dirac(n), i.e., P (Xn = n) = 1. Then,

(
0, x < n
Fn (x) =
1, x ≥ n
d
It is Fn (x) → 0 ∀x which is not a cdf. Thus, there is no rv X such that Xn −→ X.

Example 6.1.5:
Let {Xn }∞
n=1 be a sequence of rv’s such that P (Xn = 0) = 1 −
1
n and P (Xn = n) = 1
n and let
X ∼ Dirac(0), i.e., P (X = 0) = 1.
It is

 0, x<0


1
Fn (x) = 1− n, 0≤x<n

x≥n

 1,

(
0, x < 0
FX (x) =
1, x ≥ 0
w
It holds that Fn −→ FX but

E(Xnk ) = 6→ E(X k ) =

Thus, convergence in distribution does not imply convergence of moments/means.

3
Note:
Convergence in distribution does not say that the Xi ’s are close to each other or to X. It only
means that their cdf’s are (eventually) close to some cdf F . The Xi ’s do not even have to be
defined on the same probability space.

Example 6.1.6:
d
Let X and {Xn }∞
n=1 be iid N (0, 1). Obviously, Xn −→ X but lim Xn 6= X. n→∞

Theorem 6.1.7:
Let X and {Xn }∞ ∞
n=1 be discrete rv’s with support X and {Xn }n=1 , respectively. Define
∞
[
the countable set A = X ∪ Xn = {ak : k = 1, 2, 3, . . .}. Let pk = P (X = ak ) and
n=1
d
pnk = P (Xn = ak ). Then it holds that pnk → pk ∀k iff Xn −→ X.

Theorem 6.1.8:
Let X and {Xn }∞ ∞
n=1 be continuous rv’s with pdf’s f and {fn }n=1 , respectively. If fn (x) → f (x)
d
for almost all x as n → ∞ then Xn −→ X.

Theorem 6.1.9:
d
Let X and {Xn }∞
n=1 be rv’s such that Xn −→ X. Let c ∈ IR be a constant. Then it holds:

d
(i) Xn + c −→ X + c.
d
(ii) cXn −→ cX.
d
(iii) If an → a and bn → b, then an Xn + bn −→ aX + b.

Proof:
Part (iii):
Suppose that a > 0, an > 0. Let Yn = an Xn + bn and Y = aX + b. It is
y−b y−b
FY (y) = P (Y ≤ y) = P (aX + b ≤ y) = P (X ≤ ) = FX ( ).
a a
Likewise,
y − bn
FYn (y) = FXn ( ).
an
If y is a continuity point of FY , y−b
a is a continuity point of FX . Since an → a, bn → b and
FXn (x) → FX (x), it follows that FYn (y) → FY (y) for every continuity point y of FY . Thus,
d
an Xn + bn −→ aX + b.

4
Definition 6.1.10:
Let {Xn }∞n=1 be a sequence of rv’s defined on a probability space (Ω, L, P ). We say that Xn
p
converges in probability to a rv X (Xn −→ X, P- lim Xn = X) if
n→∞

lim P (| Xn − X |> ) = 0 ∀ > 0.

n→∞

Note:
The following are equivalent:

lim P (| Xn − X |> ) = 0
n→∞

⇐⇒ lim P (| Xn − X |≤ ) = 1
n→∞

⇐⇒ lim P ({ω : | Xn (ω) − X(ω) |> }) = 0

n→∞

If X is degenerate, i.e., P (X = c) = 1, we say that Xn is consistent for c. For example, let

Xn such that P (Xn = 0) = 1 − n1 and P (Xn = 1) = n1 . Then
(
1
P (| Xn |> ) = n, 0<<1
0, ≥1
p
Therefore, lim P (| Xn |> ) = 0 ∀ > 0. So Xn −→ 0, i.e., Xn is consistent for 0.
n→∞

Theorem 6.1.11:

p p
(i) Xn −→ X ⇐⇒ Xn − X −→ 0.
p p
(ii) Xn −→ X, Xn −→ Y =⇒ P (X = Y ) = 1.
p p p
(iii) Xn −→ X, Xm −→ X =⇒ Xn − Xm −→ 0 as n, m → ∞.
p p p
(iv) Xn −→ X, Yn −→ Y =⇒ Xn ± Yn −→ X ± Y .
p p
(v) Xn −→ X, k ∈ IR a constant =⇒ kXn −→ kX.
p p
(vi) Xn −→ k, k ∈ IR a constant =⇒ Xnr −→ k r ∀r ∈ IN .
p p p
(vii) Xn −→ a, Yn −→ b, a, b ∈ IR =⇒ Xn Yn −→ ab.
p p
(viii) Xn −→ 1 =⇒ Xn−1 −→ 1.
p p Xn p
(ix) Xn −→ a, Yn −→ b, a ∈ IR, b ∈ IR − {0} =⇒ Yn −→ ab .
p p
(x) Xn −→ X, Y an arbitrary rv =⇒ Xn Y −→ XY .

5
p p p
(xi) Xn −→ X, Yn −→ Y =⇒ Xn Yn −→ XY .

Proof:
See Rohatgi, page 244–245, and Rohatgi/Saleh, page 260–261 for partial proofs.

Theorem 6.1.12:
p p
Let Xn −→ X and let g be a continuous function on IR. Then g(Xn ) −→ g(X).
Proof:
Preconditions:

1.) X rv =⇒ ∀ > 0 ∃k = k() : P (|X| > k) < 2

2.) g is continuous on IR

=⇒ g is also uniformly continuous on [−k, k] (see Definition of uniformly continuous

in Theorem 3.3.3 (iii):
∀ > 0 ∃δ > 0 ∀x1 , x2 ∈ IR :| x1 − x2 |< δ ⇒| g(x1 ) − g(x2 ) |< .)

=⇒ ∃δ = δ(, k) : |X| ≤ k, |Xn − X| < δ ⇒ |g(Xn ) − g(X)| <

Let

A = {|X| ≤ k} = {ω : |X(ω)| ≤ k}
B = {|Xn − X| < δ} = {ω : |Xn (ω) − X(ω)| < δ}
C = {|g(Xn ) − g(X)| < } = {ω : |g(Xn (ω)) − g(X(ω))| < }

6
Corollary 6.1.13:

p p
(i) Let Xn −→ c, c ∈ IR and let g be a continuous function on IR. Then g(Xn ) −→ g(c).
d d
(ii) Let Xn −→ X and let g be a continuous function on IR. Then g(Xn ) −→ g(X).
d d
(iii) Let Xn −→ c, c ∈ IR and let g be a continuous function on IR. Then g(Xn ) −→ g(c).

Theorem 6.1.14:
p d
Xn −→ X =⇒ Xn −→ X.
Proof:
p
Xn −→ X ⇔ P (|Xn − X| > ) → 0 as n → ∞ ∀ > 0

It holds:

7
Theorem 6.1.15:
Let c ∈ IR be a constant. Then it holds:
d p
Xn −→ c ⇐⇒ Xn −→ c.

Example 6.1.16:
In this example, we will see that
d p
Xn −→ X 6=⇒ Xn −→ X

for some rv X. Let Xn be identically distributed rv’s and let (Xn , X) have the following joint
distribution:
Xn
0 1
X
1 1
0 0 2 2
1 1
1 2 0 2
1 1
2 2 1

Theorem 6.1.17:
Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of rv’s and X be a rv defined on a probability space
(Ω, L, P ). Then it holds:
d p d
Yn −→ X, | Xn − Yn |−→ 0 =⇒ Xn −→ X.

Proof:
Similar to the proof of Theorem 6.1.14. See also Rohatgi, page 253, Theorem 14, and Ro-
hatgi/Saleh, page 269, Theorem 14.

Theorem 6.1.18: Slutsky’s Theorem

Let (Xn )∞ ∞
n=1 and (Yn )n=1 be sequences of rv’s and X be a rv defined on a probability space
(Ω, L, P ). Let c ∈ IR be a constant. Then it holds:

d p d
(i) Xn −→ X, Yn −→ c =⇒ Xn + Yn −→ X + c.

8
d p d
(ii) Xn −→ X, Yn −→ c =⇒ Xn Yn −→ cX.
p
If c = 0, then also Xn Yn −→ 0.
d p Xn d X
(iii) Xn −→ X, Yn −→ c =⇒ Yn −→ c if c 6= 0.

Proof:

p T h.6.1.11(i) p
(i) Yn −→ c ⇐⇒ Yn − c −→ 0
p
=⇒ Yn − c = Yn + (Xn − Xn ) − c = (Xn + Yn ) − (Xn + c) −→ 0 (A)
d T h.6.1.9(i) d
Xn −→ X =⇒ Xn + c −→ X + c (B)

Combining (A) and (B), it follows from Theorem 6.1.17:

d
Xn + Yn −→ X + c

(ii) Case c = 0:
∀ > 0 ∀k > 0, it is

P (| Xn Yn |> ) = P (| Xn Yn |> , Yn ≤ ) + P (| Xn Yn |> , Yn > )
k k

≤ P (| Xn |> ) + P (Yn > )
k k

≤ P (| Xn |> k) + P (| Yn |> )
k
d p
Since Xn −→ X and Yn −→ 0, it follows for any fixed k > 0

lim P (| Xn Yn |> ) ≤ P (| X |> k).

n→∞

As k is arbitrary, we can make P (| X |> k) as small as we want by choosing k large.

p
Therefore, Xn Yn −→ 0.

9
Case c 6= 0:
d p
Since Xn −→ X and Yn −→ c, it follows from (ii), case c = 0, that Xn Yn − cXn =
p
Xn (Yn − c) −→ 0.
p
=⇒ Xn Yn −→ cXn
T h.6.1.14 d
=⇒ Xn Yn −→ cXn
d
Since cXn −→ cX by Theorem 6.1.9 (ii), it follows from Theorem 6.1.17:
d
Xn Yn −→ cX

p
(iii) Let Zn −→ 1 and let Yn = cZn .
c6=0 1 1 1
=⇒ Yn = Zn · c
T h.6.1.11(v,viii) 1 p 1
=⇒ Yn −→ c

With part (ii) above, it follows:

d 1 p 1
Xn −→ X and Yn −→ c
Xn d X
=⇒ Yn −→ c

Definition 6.1.19:
Let (Xn )∞ r
n=1 be a sequence of rv’s such that E(| Xn | ) < ∞ for some r > 0. We say that Xn
r
converges in the rth mean to a rv X (Xn −→ X) if E(| X |r ) < ∞ and

lim E(| Xn − X |r ) = 0.
n→∞

Example 6.1.20:
Let (Xn )∞
n=1 be a sequence of rv’s defined by P (Xn = 0) = 1 −
1
n and P (Xn = 1) = n1 .
1 r
It is E(| Xn |r ) = n ∀r > 0. Therefore, Xn −→ 0 ∀r > 0.

Note:
1
The special cases r = 1 and r = 2 are called convergence in absolute mean for r = 1 (Xn −→ X)
ms 2
and convergence in mean square for r = 2 (Xn −→ X or Xn −→ X).

Theorem 6.1.21:
r p
Assume that Xn −→ X for some r > 0. Then Xn −→ X.

10
Proof:
Using Markov’s Inequality (Corollary 3.5.2), it holds for any > 0:

Example 6.1.22:
Let (Xn )∞
n=1 be a sequence of rv’s defined by P (Xn = 0) = 1 −
1
nr and P (Xn = n) = 1
nr for
some r > 0.
p
For any > 0, P (| Xn |> ) → 0 as n → ∞; so Xn −→ 0.
1 s
For 0 < s < r, E(| Xn |s ) = nr−s → 0 as n → ∞; so Xn −→ 0. But E(| Xn |r ) = 1 6→ 0 as
r
n → ∞; so Xn −→
6 0.

Theorem 6.1.23:
r
If Xn −→ X, then it holds:

(i) lim E(| Xn |r ) = E(| X |r ); and

n→∞
s
(ii) Xn −→ X for 0 < s < r.

Proof:

(i) For 0 < r ≤ 1, it holds:

11
For r > 1, it follows from Minkowski’s Inequality (Theorem 4.8.3):
1 1 1
[E(| X − Xn + Xn |r )] r ≤ [E(| X − Xn |r )] r + [E(| Xn |r )] r
1 1 1
=⇒ [E(| X |r )] r − [E(| Xn |r )] r ≤ [E(| X − Xn |r )] r
1 1 1 r
=⇒ [E(| X |r )] r − lim [E(| Xn |r )] r ≤ lim [E(| Xn −X |r )] r = 0 since Xn −→ X
n→∞ n→∞
1 1
=⇒ [E(| X |r )] r ≤ lim [E(| Xn |r )] r (C)
n→∞

Similarly,
1 1 1
[E(| Xn − X + X |r )] r ≤ [E(| Xn − X |r )] r + [E(| X |r )] r
1 1 1 r
=⇒ lim [E(| Xn |r )] r − lim [E(| X |r )] r ≤ lim [E(| Xn −X |r )] r = 0 since Xn −→ X
n→∞ n→∞ n→∞
1 1
=⇒ lim [E(| Xn |r )] r ≤ [E(| X |r )] r (D)
n→∞

Combining (C) and (D) gives

1 1
lim [E(| Xn |r )] r = [E(| X |r )] r
n→∞

=⇒ lim E(| Xn |r ) = E(| X |r )

n→∞

(ii) For 0 < s < r, it follows from Lyapunov’s Inequality (Theorem 3.5.4):
1 1
[E(| Xn − X |s )] s ≤ [E(| Xn − X |r )] r
s
=⇒ E(| Xn − X |s ) ≤ [E(| Xn − X |r )] r
s r
=⇒ lim E(| Xn − X |s ) ≤ lim [E(| Xn − X |r )] r = 0 since Xn −→ X
n→∞ n→∞

12
s
=⇒ Xn −→ X

Note that our proof of Theorem 3.5.4 only covers the case 1 ≤ s < r, but an alternative
proof shows that the result generally holds for 0 < s < r.

Definition 6.1.24:
Let {Xn }∞n=1 be a sequence of rv’s on (Ω, L, P ). We say that Xn converges almost surely
a.s. w.p.1
to a rv X (Xn −→ X) or Xn converges with probability 1 to X (Xn −→ X) or Xn
converges strongly to X iff

P ({ω : Xn (ω) → X(ω) as n → ∞}) = 1.

Note:
An interesting characterization of convergence with probability 1 and convergence in proba-
bility can be found in Parzen (1960) “Modern Probability Theory and Its Applications” on
page 416 (see Handout).

Example 6.1.25:
Let Ω = [0, 1] and P a uniform distribution on Ω. Let Xn (ω) = ω + ω n and X(ω) = ω.

For ω ∈ [0, 1), ω n → 0 as n → ∞. So Xn (ω) → X(ω) ∀ω ∈ [0, 1).

However, for ω = 1, Xn (1) = 2 6= 1 = X(1) ∀n, i.e., convergence fails at ω = 1.

a.s.
Anyway, since P ({ω : Xn (ω) → X(ω) as n → ∞}) = P ({ω ∈ [0, 1)}) = 1, it is Xn −→ X.

Theorem 6.1.26:
a.s. p
Xn −→ X =⇒ Xn −→ X.
Proof:
Choose > 0 and δ > 0. Find n0 = n0 (, δ) such that
∞
!
\
P {| Xn − X |≤ } ≥ 1 − δ.
n=n0

13
Example 6.1.27:
p a.s.
Xn −→ X 6=⇒ Xn −→ X:

Let Ω = (0, 1] and P a uniform distribution on Ω.

Define An by

A1 = (0, 12 ], A2 = ( 21 , 1]

A3 = (0, 41 ], A4 = ( 41 , 21 ], A5 = ( 12 , 43 ], A6 = ( 34 , 1]

A7 = (0, 81 ], A8 = ( 81 , 41 ], . . .

Let Xn (ω) = IAn (ω).

p
It is P (| Xn − 0 |≥ ) → 0 ∀ > 0 since Xn is 0 except on An and P (An ) ↓ 0. Thus Xn −→ 0.

But P ({ω : Xn (ω) → 0}) = 0 (and not 1) because any ω keeps being in some An beyond any
a.s.
6
n0 , i.e., Xn (ω) looks like 0 . . . 010 . . . 010 . . . 010 . . ., so Xn −→ 0.

Example 6.1.28:
r a.s.
Xn −→ X 6=⇒ Xn −→ X:
1
Let Xn be independent rv’s such that P (Xn = 0) = 1 − n and P (Xn = 1) = n1 .
1 r
It is E(| Xn − 0 |r ) = E(| Xn |r ) = E(| Xn |) = n → 0 as n → ∞, so Xn −→ 0 ∀r > 0 (and
p
due to Theorem 6.1.21, also Xn −→ 0).

But
n0
Y 1 m−1 m m+1 n0 − 2 n0 − 1 m−1
P (Xn = 0 ∀m ≤ n ≤ n0 ) = (1− ) = ( )( )( )...( )( )=
n=m n m m+1 m+2 n0 − 1 n0 n0
a.s.
As n0 → ∞, it is P (Xn = 0 ∀m ≤ n ≤ n0 ) → 0 ∀m, so Xn −→
6 0.

14
Example 6.1.29:
a.s. r
Xn −→ X 6=⇒ Xn −→ X:

Let Ω = [0, 1] and P a uniform distribution on Ω.

Let An = [0, ln1n ].

Let Xn (ω) = nIAn (ω) and X(ω) = 0.

1
It holds that ∀ω > 0 ∃n0 : ln n0 < ω =⇒ Xn (ω) = 0 ∀n > n0 and P (ω = 0) = 0. Thus,
a.s.
Xn −→ 0.

nr r
But E(| Xn − 0 |r ) = ln n → ∞ ∀r > 0, so Xn −→
6 X.

15
6.2 Weak Laws of Large Numbers

Theorem 6.2.1: WLLN: Version I

Let {Xi }∞ 2
i=1 be a sequence of iid rv’s with mean E(Xi ) = µ and variance V ar(Xi ) = σ < ∞.
n
X
1
Let X n = n Xi . Then it holds
i=1

lim P (| X n − µ |≥ ) = 0 ∀ > 0,
n→∞

p
i.e., X n −→ µ.
Proof:

Note:
For iid rv’s with finite variance, X n is consistent for µ.

A more general way to derive a “WLLN” follows in the next Definition.

Definition 6.2.2:
n
X
Let {Xi }∞
i=1 be a sequence of rv’s. Let Tn = Xi . We say that {Xi } obeys the WLLN
i=1
with respect to a sequence of norming constants {Bi }∞
i=1 , Bi > 0, Bi ↑ ∞, if there exists a
∞
sequence of centering constants {Ai }i=1 such that
p
Bn−1 (Tn − An ) −→ 0.

Theorem 6.2.3:
Let {Xi }∞ 2
i=1 be a sequence of pairwise uncorrelated rv’s with E(Xi ) = µi and V ar(Xi ) = σi ,
n
X n
X n
X
i ∈ IN . If σi2 → ∞ as n → ∞, we can choose An = µi and Bn = σi2 and get
i=1 i=1 i=1

n
X
(Xi − µi )
i=1 p
n −→ 0.
X
σi2
i=1

16
Proof:
By Markov’s Inequality (Corollary 3.5.2), it holds for all > 0:
n
X
n n n
E(( (Xi − µi ))2 )
X X X i=1 1
P (| Xi − µi |> σi2 ) ≤ n = n −→ 0 as n → ∞
X X
i=1 i=1 i=1 2 ( σi2 )2 2 σi2
i=1 i=1

Note:
To obtain Theorem 6.2.1, we choose An = nµ and Bn = nσ 2 .

Theorem 6.2.4:
n
X
Let {Xi }∞
i=1 be a sequence of rv’s. Let X n =
1
n Xi . A necessary and sufficient condition
i=1
for {Xi } to obey the WLLN with respect to Bn = n is that
2 !
Xn
E 2 →0
1 + Xn
as n → ∞.

Proof:
Rohatgi, page 258, Theorem 2, and Rohatgi/Saleh, page 275, Theorem 2.

Example 6.2.5:
Let (X1 , . . . , Xn ) be jointly Normal with E(Xi ) = 0, E(Xi2 ) = 1 for all i, and Cov(Xi , Xj ) = ρ
n
X
if | i − j |= 1 and Cov(Xi , Xj ) = 0 if | i − j |> 1. Let Tn = Xi . Then, Tn ∼ N (0, n + 2(n −
i=1
1)ρ) = N (0, σ 2 ). It is
2 ! !
Xn Tn2
E 2 = E
1 + Xn n2 + Tn2
x2
Z ∞ 2
2 − x2 x dx
= √ e 2σ dx | y = , dy =
2πσ 0 n2 + x2 σ σ
σ2y2
Z ∞
2 2
− y2
= √ e dy
2π 0 n2 + σ 2 y 2
(n + 2(n − 1)ρ)y 2
Z ∞
2 2
− y2
= √ e dy
2π 0 n2 + (n + 2(n − 1)ρ)y 2
Z ∞
n + 2(n − 1)ρ 2 y2
≤ √ y 2 e− 2 dy
n2 0 2π
| {z }
=1, since Var of N (0,1) distribution
→0 as n → ∞

17
p
=⇒ X n −→ 0

Note:
We would like to have a WLLN that just depends on means but does not depend on the
existence of finite variances. To approach this, we consider the following:
n
X
Let {Xi }∞
i=1 be a sequence of rv’s. Let Tn = Xi . We truncate each | Xi | at c > 0 and get
i=1
(
Xi , | Xi |≤ c
Xic =
0, otherwise
n
X n
X
Let Tnc = Xic and mn = E(Xic ).
i=1 i=1

Lemma 6.2.6:
For Tn , Tnc and mn as defined in the Note above, it holds:
n
X
P (| Tn − mn |> ) ≤ P (| Tnc − mn |> ) + P (| Xi |> c) ∀ > 0
i=1

Proof:

Note:
If the Xi ’s are identically distributed, then

P (| Tn − mn |> ) ≤ P (| Tnc − mn |> ) + nP (| X1 |> c) ∀ > 0.

If the Xi ’s are iid, then

nE((X1c )2 )
P (| Tn − mn |> ) ≤ + nP (| X1 |> c) ∀ > 0 (∗).
2
Note that P (| Xi |> c) = P (| X1 |> c) ∀i ∈ IN if the Xi ’s are identically distributed and that
E((Xic )2 ) = E((X1c )2 ) ∀i ∈ IN if the Xi ’s are iid.

18
Theorem 6.2.7: Khintchine’s WLLN
Let {Xi }∞
i=1 be a sequence of iid rv’s with finite mean E(Xi ) = µ. Then it holds:

1 p
Xn = Tn −→ µ
n

Proof:
If we take c = n and replace by n in (∗) in the Note above, we get

| Tn − mn | E((X1n )2 )

P > = P (| Tn − mn |> n) ≤ + nP (| X1 |> n).
n n2

Since E(| X1 |) < ∞, it is nP (| X1 Z|> n) → 0 as n → ∞ by Theorem 3.1.9. From Corollary

∞
3.1.12 we know that E(| X |α ) = α xα−1 P (| X |> x)dx. Therefore,
0

Note:
Theorem 6.2.7 meets the previously stated goal of not having a finite variance requirement.

19
6.3 Strong Laws of Large Numbers

Definition 6.3.1:
n
X
Let {Xi }∞
i=1 be a sequence of rv’s. Let Tn = Xi . We say that {Xi } obeys the SLLN
i=1
with respect to a sequence of norming constants {Bi }∞
i=1 , Bi > 0, Bi ↑ ∞, if there exists a
∞
sequence of centering constants {Ai }i=1 such that
a.s.
Bn−1 (Tn − An ) −→ 0.

Note:
Unless otherwise specified, we will only use the case that Bn = n in this section.

Theorem 6.3.2:
a.s.
Xn −→ X ⇐⇒ lim P ( sup | Xm − X |> ) = 0 ∀ > 0.
n→∞ m≥n
Proof: (see also Rohatgi, page 249, Theorem 11)
a.s. a.s.
WLOG, we can assume that X = 0 since Xn −→ X implies Xn − X −→ 0. Thus, we have to
prove:

a.s.
Xn −→ 0 ⇐⇒ lim P ( sup | Xm |> ) = 0 ∀ > 0
n→∞ m≥n
Choose > 0 and define
An () = { sup | Xm |> }
m≥n
C = { lim Xn = 0}
n→∞
“=⇒”:
a.s.
Since Xn −→ 0, we know that P (C) = 1 and therefore P (C c ) = 0.
∞
\
Let Bn () = C ∩ An (). Note that Bn+1 () ⊆ Bn () and for the limit set Bn () = Ø. It
n=1
follows that ∞
\
lim P (Bn ()) = P ( Bn ()) = 0.
n→∞
n=1
We also have
P (Bn ()) = P (An ∩ C)
= 1 − P (C c ∪ Acn )
= 1 − P (C c ) −P (Acn ) + P (C c ∩ AC
n)
| {z } | {z }
=0 =0
= P (An )

20
=⇒ lim P (An ()) = 0
n→∞

“⇐=”:
Assume that lim P (An ()) = 0 ∀ > 0 and define D() = { lim | Xn |> }.
n→∞ n→∞

Since D() ⊆ An () ∀n ∈ IN , it follows that P (D()) = 0 ∀ > 0. Also,

∞
c
[ 1
C = { lim Xn 6= 0} ⊆ { lim | Xn |> }.
n→∞
k=1
n→∞ k

∞
X 1
=⇒ 1 − P (C) ≤ P (D( )) = 0
k=1
k
a.s.
=⇒ Xn −→ 0

Note:

a.s.
(i) Xn −→ 0 implies that ∀ > 0 ∀δ > 0 ∃n0 ∈ IN : P ( sup | Xn |> ) < δ.
n≥n0

(ii) Recall that for a given sequence of events {An }∞

n=1 ,

∞
[ ∞ [
\ ∞
A = lim An = lim Ak = Ak
n→∞ n→∞
k=n n=1 k=n

is the event that infinitely many of the An occur. We write P (A) = P (An i.o.) where
i.o. stands for “infinitely often”.

(iii) Using the terminology defined in (ii) above, we can rewrite Theorem 6.3.2 as
a.s.
Xn −→ 0 ⇐⇒ P (| Xn |> i.o.) = 0 ∀ > 0.

21
Theorem 6.3.3: Borel–Cantelli Lemma
Let A be defined as in (ii) of the previous Note.

(i) 1st BC–Lemma:

∞
X
Let {An }∞
n=1 be a sequence of events such that P (An ) < ∞. Then P (A) = 0.
n=1

(ii) 2nd BC–Lemma:

∞
X
Let {An }∞
n=1 be a sequence of independent events such that P (An ) = ∞. Then
n=1
P (A) = 1.

Proof:
(i): ∞
[
P (A) = P ( lim Ak )
n→∞
k=n
∞
[
= lim P ( Ak )
n→∞
k=n
∞
X
≤ lim P (Ak )
n→∞
k=n
∞ n−1
!
X X
= lim P (Ak ) − P (Ak )
n→∞
k=1 k=1
= 0
∞ \
[ ∞
(ii): We have Ac = Ack . Therefore,
n=1 k=n

∞
\ ∞
\
P (Ac ) = P ( lim Ack ) = lim P ( Ack ).
n→∞ n→∞
k=n k=n

If we choose n0 > n, it holds that

∞
\ n0
\
Ack ⊆ Ack .
k=n k=n

Therefore,
∞
\ n0
\
P( Ack ) ≤ lim P ( Ack )
n0 →∞
k=n k=n
n0
Y
= lim (1 − P (Ak ))
n0 →∞
k=n
n0
!
indep. X
≤ lim exp − P (Ak )
n0 →∞
k=n
= 0

22
=⇒ P (A) = 1

Example 6.3.4:
Independence is necessary for 2nd BC–Lemma:

Let Ω = (0, 1) and P a uniform distribution on Ω.

Let An = I(0, 1 ) (ω). Therefore,

∞ ∞
X X 1
P (An ) = = ∞.
n=1 n=1
n

But for any ω ∈ Ω, An occurs only for 1, 2, . . . , b ω1 c, where b ω1 c denotes the largest integer
(“floor”) that is ≤ ω1 . Therefore, P (A) = P (An i.o.) = 0.

Lemma 6.3.5: Kolmogorov’s Inequality

Let {Xi }∞ 2
i=1 be a sequence of independent rv’s with common mean 0 and variances σi . Let
n
X
Tn = Xi . Then it holds:
i=1
n
X
σi2
i=1
P ( max | Tk |≥ ) ≤ ∀ > 0
1≤k≤n 2

Proof:
See Rohatgi, page 268, Lemma 2, and Rohatgi/Saleh, page 284, Lemma 1.

Lemma 6.3.6: Kronecker’s Lemma

∞
X
For any real numbers xn , if xn converges to s < ∞ and Bn ↑ ∞, then it holds:
n=1
n
1 X
Bk xk → 0 as n → ∞
Bn k=1

Proof:
See Rohatgi, page 269, Lemma 3, and Rohatgi/Saleh, page 285, Lemma 2.

Theorem 6.3.7: Cauchy Criterion

a.s.
Xn −→ X ⇐⇒ lim P (sup | Xn+m − Xn |≤ ) = 1 ∀ > 0.
n→∞ m

Proof:
See Rohatgi, page 270, Theorem 5.

23
Theorem 6.3.8:
∞
X ∞
X
If V ar(Xn ) < ∞, then (Xn − E(Xn )) converges almost surely.
n=1 n=1

Proof:
See Rohatgi, page 272, Theorem 6, and Rohatgi/Saleh, page 286, Theorem 4.

Corollary 6.3.9:
Let {Xi }∞ ∞
i=1 be a sequence of independent rv’s. Let {Bi }i=1 , Bi > 0, Bi ↑ ∞, a sequence of
Xn ∞
X V ar(Xi )
norming constants. Let Tn = Xi . If < ∞ then it holds:
i=1 i=1
Bi2

Tn − E(Tn ) a.s.
−→ 0
Bn

Proof:
This Corollary follows directly from Theorem 6.3.8 and Lemma 6.3.6.

Lemma 6.3.10: Equivalence Lemma

n n
Xi0 .
X X
Let {Xi }∞ 0 ∞
i=1 and {Xi }i=1 be sequences of rv’s. Let Tn = Xi and Tn0 =
i=1 i=1
∞
P (Xi 6= Xi0 ) < ∞, then the series {Xi } and {Xi0 } are tail–equivalent and
X
If the series
i=1
Tn and Tn0 are convergence–equivalent, i.e., for Bn ↑ ∞ the sequences 1
Bn Tn and 1 0
Bn Tn
converge on the same event and to the same limit, except for a null set.
Proof:
See Rohatgi, page 266, Lemma 1.

Lemma 6.3.11:
Let X be a rv with E(| X |) < ∞. Then it holds:
∞
X ∞
X
P (| X |≥ n) ≤ E(| X |) ≤ 1 + P (| X |≥ n)
n=1 n=1

Proof:
Continuous case only:
Let X have a pdf f . Then it holds:
Z ∞ ∞ Z
X
E(| X |) = | x | f (x)dx = | x | f (x)dx
−∞ k=0 k≤|x|≤k+1

24
∞
X ∞
X
=⇒ kP (k ≤| X |≤ k + 1) ≤ E(| X |) ≤ (k + 1)P (k ≤| X |≤ k + 1)
k=0 k=0

It is
∞
X ∞ X
X k
kP (k ≤| X |≤ k + 1) = P (k ≤| X |≤ k + 1)
k=0 k=0 n=1
X∞ X ∞
= P (k ≤| X |≤ k + 1)
n=1 k=n
X∞
= P (| X |≥ n)
n=1

Similarly,
∞
X ∞
X ∞
X
(k + 1)P (k ≤| X |≤ k + 1) = P (| X |≥ n) + P (k ≤| X |≤ k + 1)
k=0 n=1 k=0
X∞
= P (| X |≥ n) + 1
n=1

Theorem 6.3.12:
Let {Xi }∞
i=1 be a sequence of iid rv’s. Then it holds:

∞
a.s. X
Xn −→ 0 ⇐⇒ P (| Xn |> ) < ∞ ∀ > 0
n=1

Proof:
See Rohatgi, page 265, Theorem 3.

25
Theorem 6.3.13: Kolmogorov’s SLLN
n
X
Let {Xi }∞
i=1 be a sequence of iid rv’s. Let Tn = Xi . Then it holds:
i=1

Tn a.s.
= X n −→ µ < ∞ ⇐⇒ E(| X |) < ∞ (and then µ = E(X))
n

Proof:
“=⇒”:
a.s.
Suppose that X n −→ µ < ∞. It is

“⇐=”:
Let E(| X |) < ∞.

26
It is
∞
X 1 1 1 1
= + + + ...
n=k
n2 k 2 (k + 1)2 (k + 2)2

1 1 1
≤ 2
+ + + ...
k k(k + 1) (k + 1)(k + 2)
∞
1 X 1
= 2
+
k n=k+1
n(n − 1)

From Bronstein, page 30, # 7, we know that

1 1 1 1
1 = + + + ... + + ...
1·2 2·3 3·4 n(n + 1)
∞
1 1 1 1 X 1
= + + + ... + +
1·2 2·3 3·4 (k − 1) · k n=k+1 n(n − 1)

∞
X 1 1 1 1 1
=⇒ = 1− − − − ... −
n=k+1
n(n − 1) 1·2 2·3 3·4 (k − 1) · k

1 1 1 1
= − − − ... −
2 2·3 3·4 (k − 1) · k
1 1 1
= − − ... −
3 3·4 (k − 1) · k
1 1
= − ... −
4 (k − 1) · k
= ...
1
=
k
∞ ∞
X 1 1 X 1
=⇒ ≤ +
n=k
n2 k 2
n=k+1
n(n − 1)

1 1
= +
k2 k
2
≤
k

27
28
6.4 Central Limit Theorems

Let {Xn }∞ ∞
n=1 be a sequence of rv’s with cdf’s {Fn }n=1 . Suppose that the mgf Mn (t) of Xn
exists.

Questions: Does Mn (t) converge? Does it converge to a mgf M (t)? If it does converge, does
d
it hold that Xn −→ X for some rv X?

Example 6.4.1:
Let {Xn }∞
n=1 be a sequence of rv’s such that P (Xn = −n) = 1. Then the mgf is Mn (t) =
E(e ) = e−tn . So
tX

 0, t>0


lim Mn (t) = 1, t=0
n→∞ 
 ∞,

t<0
So Mn (t) does not converge to a mgf and Fn (x) → F (x) = 1 ∀x. But F (x) is not a cdf.

Note:
Due to Example 6.4.1, the existence of mgf’s Mn (t) that converge to something is not enough
to conclude convergence in distribution.
d
Conversely, suppose that Xn has mgf Mn (t), X has mgf M (t), and Xn −→ X. Does it hold
that
Mn (t) → M (t)?

Not necessarily! See Rohatgi, page 277, Example 2, and Rohatgi/Saleh, page 289, Example
2, as a counter example. Thus, convergence in distribution of rv’s that all have mgf’s does
not imply the convergence of mgf’s.

However, we can make the following statement in the next Theorem:

Theorem 6.4.2: Continuity Theorem

Let {Xn }∞ ∞ ∞
n=1 be a sequence of rv’s with cdf’s {Fn }n=1 and mgf’s {Mn (t)}n=1 . Suppose that
Mn (t) exists for | t |≤ t0 ∀n. If there exists a rv X with cdf F and mgf M (t) which exists for
w d
| t |≤ t1 < t0 such that lim Mn (t) = M (t) ∀t ∈ [−t1 , t1 ], then Fn −→ F , i.e., Xn −→ X.
n→∞

29
Example 6.4.3:
Let Xn ∼ Bin(n, nλ ). Recall (e.g., from Theorem 3.3.12 and related Theorems) that for
X ∼ Bin(n, p) the mgf is MX (t) = (1 − p + pet )n . Thus,

Note:
Recall Theorem 3.3.11: Suppose that {Xn }∞
n=1 is a sequence of rv’s with characteristic func-
∞
tions {Φn (t)}n=1 . Suppose that

lim Φn (t) = Φ(t) ∀t ∈ (−h, h) for some h > 0,

n→∞

d
and Φ(t) is the characteristic function of a rv X. Then Xn −→ X.

Theorem 6.4.4: Lindeberg–Lévy Central Limit Theorem

Let {Xn }∞ 2
n=1 be a sequence of iid rv’s with E(Xi ) = µ and 0 < V ar(Xi ) = σ < ∞. Then it
n
X
1
holds for X n = n Xi that
i=1 √
n(X n − µ) d
−→ Z
σ
where Z ∼ N (0, 1).
Proof:
Let Z ∼ N (0, 1). According to Theorem 3.3.12 (v), the characteristic function of Z is
ΦZ (t) = exp(− 21 t2 ).

Let Φ(t) √
be the characteristic function of Xi . We now determine the characteristic function
Φn (t) of n(Xσn −µ) :

30
Here we make use of the Landau symbol “o”. In general, if we write u(x) = o(v(x)) for
u(x)
x → L, this implies lim = 0, i.e., u(x) goes to 0 faster than v(x) or v(x) goes to ∞
x→L v(x)
faster than u(x). We say that u(x) is of smaller order than v(x) as x → L. Examples are
1
x3
= o( x12 ) and x2 = o(x3 ) for x → ∞. See Rohatgi, page 6, for more details on the Landau
symbols “O” and “o”.

31
Definition 6.4.5:
Let X1 , X2 be iid non–degenerate rv’s with common cdf F . Let a1 , a2 > 0. We say that F is
stable if there exist constants A and B (depending on a1 and a2 ) such that
B −1 (a1 X1 + a2 X2 − A) also has cdf F .

Note:
When generalizing the previous definition to sequences of rv’s, we have the following examples
for stable distributions:
n
X
1
• Xi iid Cauchy. Then n Xi ∼ Cauchy (here Bn = n, An = 0).
i=1
n
X √
• Xi iid N (0, 1). Then √1 Xi ∼ N (0, 1) (here Bn = n, An = 0).
n
i=1

Definition 6.4.6:
n
X
Let {Xi }∞
i=1 be a sequence of iid rv’s with common cdf F . Let Tn = Xi . F belongs to
i=1
the domain of attraction of a distribution V if there exist norming and centering constants
{Bn }∞ ∞
n=1 , Bn > 0, and {An }n=1 such that

P (Bn−1 (Tn − An ) ≤ x) = FBn−1 (Tn −An ) (x) → V (x) as n → ∞

at all continuity points x of V .

Note:
A very general Theorem from Loève states that only stable distributions can have domains
of attraction. From the practical point of view, a wide class of distributions F belong to the
domain of attraction of the Normal distribution.

32
Theorem 6.4.7: Lindeberg Central Limit Theorem
Let {Xi }∞ ∞
i=1 be a sequence of independent non–degenerate rv’s with cdf’s {Fi }i=1 . Assume
n
X
that E(Xk ) = µk and V ar(Xk ) = σk2 < ∞. Let s2n = σk2 .
k=1

If the Fk are absolutely continuous with pdf’s fk = Fk0 , assume that it holds for all > 0 that
n Z
1 X
(A) lim 2 (x − µk )2 Fk0 (x)dx = 0.
n→∞ s
n k=1 {|x−µk |>sn }

If the Xk are discrete rv’s with support {xkl } and probabilities {pkl }, l = 1, 2, . . ., assume that
it holds for all > 0 that
n
1 X X
(B) lim (xkl − µk )2 pkl = 0.
n→∞ s2
n k=1 |xkl −µk |>sn

The conditions (A) and (B) are called Lindeberg Condition (LC). If either LC holds, then
n
X
(Xk − µk )
k=1 d
−→ Z
sn
where Z ∼ N (0, 1).
Proof:
Similar to the proof of Theorem 6.4.4, we can use characteristic functions again. An alterna-
tive proof is given in Rohatgi, pages 282–288.

Note:
σn2
Feller shows that the LC is a necessary condition if s2n
→ 0 and s2n → ∞ as n → ∞.

Corollary 6.4.8:
n
X
Let {Xi }∞
i=1 be a sequence of iid rv’s such that
√1
n
Xi has the same distribution for all n.
i=1
If E(Xi ) = 0 and V ar(Xi ) = 1, then Xi ∼ N (0, 1).
Proof:
n
X
Let F be the common cdf of √1 Xi for all n (including n = 1). By the CLT,
n
i=1
n
1 X
lim P ( √ Xi ≤ x) = Φ(x),
n→∞ n i=1
n
X
where Φ(x) denotes P (Z ≤ x) for Z ∼ N (0, 1). Also, P ( √1n Xi ≤ x) = F (x) for each n.
i=1
Therefore, we must have F (x) = Φ(x).

33
Note:
In general, if X1 , X2 , . . ., are independent rv’s such that there exists a constant A with
P (| Xn |≤ A) = 1 ∀n, then the LC is satisfied if s2n → ∞ as n → ∞. Why??

Suppose that s2n → ∞ as n → ∞. Since the | Xk |’s are uniformly bounded (by A), so are the
rv’s (Xk − E(Xk )). Thus, for every > 0 there exists an N such that if n ≥ N then

P (| Xk − E(Xk ) |< sn , k = 1, . . . , n) = 1.

This implies that the LC holds since we would integrate (or sum) over the empty set, i.e., the
set {| x − µk |> sn } = Ø.

The converse also holds. For a sequence of uniformly bounded independent rv’s, a necessary
and sufficient condition for the CLT to hold is that s2n → ∞ as n → ∞.

Example 6.4.9:
Let {Xi }∞
i=1 be a sequence of independent rv’s such that E(Xk ) = 0, αk = E(| Xk |
2+δ ) < ∞
n
X
for some δ > 0, and αk = o(s2+δ
n ).
k=1
Does the LC hold? It is:
n Z n Z
1 X (A) 1 X | x |2+δ
x2 fk (x)dx ≤ fk (x)dx
s2n k=1 {|x|>sn } s2n k=1 {|x|>sn } δ sδn
n Z ∞
1 X
≤ | x |2+δ fk (x)dx
s2n δ sδn k=1 −∞
n
1 X
= αk
s2n δ sδn k=1
n
X 
 αk 
1  k=1 
= 
 2+δ 

δ  sn 

(B)
−→ 0 as n → ∞
n
|x|δ
X
(A) holds since for | x |> sn , it is δ sδn
> 1. (B) holds since αk = o(s2+δ
n ).
k=1
Thus, the LC is satisfied and the CLT holds.

34
Note:

(i) In general, if there exists a δ > 0 such that

n
X
E(| Xk − µk |2+δ )
k=1
−→ 0 as n → ∞,
s2+δ
n

then the LC holds.

(ii) Both the CLT and the WLLN hold for a large class of sequences of rv’s {Xi }ni=1 . If
the {Xi }’s are independent uniformly bounded rv’s, i.e., if P (| Xn |≤ M ) = 1 ∀n, the
WLLN (as formulated in Theorem 6.2.3) holds. The CLT holds provided that s2n → ∞
as n → ∞.

If the rv’s {Xi } are iid, then the CLT is a stronger result than the WLLN since the CLT
n
X √
provides an estimate of the probability P ( n1 | Xi − nµ |≥ ) ≈ 1 − P (| Z |≤ n),
i=1
σ
where Z ∼ N (0, 1), and the WLLN follows. However, note that the CLT requires the
existence of a 2nd moment while the WLLN does not.

(iii) If the {Xi } are independent (but not identically distributed) rv’s, the CLT may apply
while the WLLN does not.

(iv) See Rohatgi, pages 289–293, and Rohatgi/Saleh, pages 299–303, for additional details
and examples.

35
7 Sample Moments

7.1 Random Sampling

(Based on Casella/Berger, Section 5.1 & 5.2)

Definition 7.1.1:
Let X1 , . . . , Xn be iid rv’s with common cdf F . We say that {X1 , . . . , Xn } is a (random)
sample of size n from the population distribution F . The vector of values {x1 , . . . , xn } is
called a realization of the sample. A rv g(X1 , . . . , Xn ) which is a Borel–measurable function
of X1 , . . . , Xn and does not depend on any unknown parameter is called a (sample) statistic.

Definition 7.1.2:
Let X1 , . . . , Xn be a sample of size n from a population with distribution F . Then
n
1X
X= Xi
n i=1

is called the sample mean and

n n
!
2 1 X 1 X 2
S = (Xi − X)2 = X 2 − nX
n − 1 i=1 n − 1 i=1 i

is called the sample variance.

Definition 7.1.3:
Let X1 , . . . , Xn be a sample of size n from a population with distribution F . The function
n
1X
Fˆn (x) = I (Xi )
n i=1 (−∞,x]

is called empirical cumulative distribution function (empirical cdf).

Note:
For any fixed x ∈ IR, Fˆn (x) is a rv.

Theorem 7.1.4:
The rv Fˆn (x) has pmf
!
j n
P (Fˆn (x) = ) = (F (x))j (1 − F (x))n−j , j ∈ {0, 1, . . . , n},
n j

36
F (x)(1−F (x))
with E(Fˆn (x)) = F (x) and V ar(Fˆn (x)) = n .
Proof:
It is I(−∞,x] (Xi ) ∼ Bin(1, F (x)). Then nFˆn (x) ∼ Bin(n, F (x)).

The results follow immediately.

Corollary 7.1.5:
By the WLLN, it follows that
p
Fˆn (x) −→ F (x).

Corollary 7.1.6:
By the CLT, it follows that
√
n(Fˆn (x) − F (x)) d
p −→ Z,
F (x)(1 − F (x))
where Z ∼ N (0, 1).

Theorem 7.1.7: Glivenko–Cantelli Theorem

Fˆn (x) converges uniformly to F (x), i.e., it holds for all > 0 that

lim P ( sup | Fˆn (x) − F (x) |> ) = 0.

n→∞ −∞<x<∞

Definition 7.1.8:
Let X1 , . . . , Xn be a sample of size n from a population with distribution F . We call
n
1X
ak = Xk
n i=1 i

the sample moment of order k and

n n
1X 1X
bk = (Xi − a1 )k = (Xi − X)k
n i=1 n i=1

the sample central moment of order k.

Note:
n−1 2
It is b1 = 0 and b2 = n S .

Theorem 7.1.9:
Let X1 , . . . , Xn be a sample of size n from a population with distribution F . Assume that
E(X) = µ, V ar(X) = σ 2 , and E((X − µ)k ) = µk exist. Then it holds:

37
(i) E(a1 ) = E(X) = µ
σ2
(ii) V ar(a1 ) = V ar(X) = n

n−1 2
(iii) E(b2 ) = n σ

µ4 −µ22 2(µ4 −2µ22 ) µ4 −3µ22

(iv) V ar(b2 ) = n − n2
+ n3

(v) E(S 2 ) = σ 2
µ4 n−3
(vi) V ar(S 2 ) = n − 2
n(n−1) µ2

Proof:
(i)

See Casella/Berger, Page 214, and Rohatgi, page 303–306, for the proof of parts (iv) through
(vi) and results regarding the 3rd and 4th moments and covariances.

38
7.2 Sample Moments and the Normal Distribution

(Based on Casella/Berger, Section 5.3)

Theorem 7.2.1:
n
X
1
Let X1 , . . . , Xn be iid N (µ, σ 2 ) rv’s. Then X = n Xi and (X1 − X, . . . , Xn − X) are
i=1
independent.
Proof:
By computing the joint mgf of (X, X1 − X, X2 − X, . . . , Xn − X), we can use Theorem 4.6.3
(iv) to show independence. We will use the following two facts:

(1):

39
From (1) and (2), it follows:

40
Corollary 7.2.2:
X and S 2 are independent.
Proof:
This can be seen since S 2 is a function of the vector (X1 − X, . . . , Xn − X), and (X1 −
X, . . . , Xn − X) is independent of X, as previously shown in Theorem 7.2.1. We can use
Theorem 4.2.7 to formally complete this proof.

Corollary 7.2.3:
(n − 1)S 2
∼ χ2n−1 .
σ2
Proof:
Recall the following facts:

• If Z ∼ N (0, 1) then Z 2 ∼ χ21 .

n
X
• If Y1 , . . . , Yn ∼ iid χ21 , then Yi ∼ χ2n .
i=1

• For χ2n , the mgf is M (t) = (1 − 2t)−n/2 .

Xi −µ (Xi −µ)2
• If Xi ∼ N (µ, σ 2 ), then σ ∼ N (0, 1) and σ2
∼ χ21 .
n
(Xi − µ)2 (X−µ)2 2
= n (X−µ)
X
Therefore, ∼ χ2n and ( √σn )2 σ2
∼ χ21 . (∗)
i=1
σ2

Now consider

41
Corollary 7.2.4:
√
n(X − µ)
∼ tn−1 .
S
Proof:
Recall the following facts:

• If Z ∼ N (0, 1), Y ∼ χ2n and Z, Y independent, then pZY ∼ tn .

n
√
n(X−µ) (n−1)S 2
• Z1 = σ ∼ N (0, 1), Yn−1 = σ2
∼ χ2n−1 , and Z1 , Yn−1 are independent.

Therefore,
√ (X−µ)
√ (X−µ)
√
n(X − µ) σ/ n σ/ n Z1
= √
S/√n
=r =q ∼ tn−1 .
S S 2 (n−1) Yn−1
σ/ n σ 2 (n−1) (n−1)

Corollary 7.2.5:
Let (X1 , . . . , Xm ) ∼ iid N (µ1 , σ12 ) and (Y1 , . . . , Yn ) ∼ iid N (µ2 , σ22 ). Let Xi , Yj be independent
∀i, j.
Then it holds:
s
X − Y − (µ1 − µ2 ) m+n−2
· ∼ tm+n−2
σ12 /m
+ σ22 /n
q
[(m − 1)S12 /σ12 ] + [(n − 1)S22 /σ22 ]
In particular, if σ1 = σ2 , then:
s
X − Y − (µ1 − µ2 ) mn(m + n − 2)
q · ∼ tm+n−2
(m − 1)S12 + (n − 1)S22 m+n

Proof:
Homework.

Corollary 7.2.6:
Let (X1 , . . . , Xm ) ∼ iid N (µ1 , σ12 ) and (Y1 , . . . , Yn ) ∼ iid N (µ2 , σ22 ). Let Xi , Yj be independent
∀i, j.
Then it holds:

S12 /σ12
∼ Fm−1,n−1
S22 /σ22
In particular, if σ1 = σ2 , then:
S12
∼ Fm−1,n−1
S22

42
Proof:
Recall that, if Y1 ∼ χ2m and Y2 ∼ χ2n , then

Y1 /m
F = ∼ Fm,n .
Y2 /n
(m−1)S12 (n−1)S22
Now, C1 = σ12
∼ χ2m−1 and C2 = σ22
∼ χ2n−1 . Therefore,

(m−1)S12
C1 /(m − 1) σ12 (m−1) S12 /σ12
= = ∼ Fm−1,n−1 .
C2 /(n − 1) (n−1)S22 S22 /σ22
σ22 (n−1)

If σ1 = σ2 , then
S12
∼ Fm−1,n−1 .
S22

43
8 The Theory of Point Estimation

(Based on Casella/Berger, Chapters 6 & 7)

8.1 The Problem of Point Estimation

Let X be a rv defined on a probability space (Ω, L, P ). Suppose that the cdf F of X depends
on some set of parameters and that the functional form of F is known except for a finite
number of these parameters.

Definition 8.1.1:
The set of admissible values of θ is called the parameter space Θ. If Fθ is the cdf of X
when θ is the parameter, the set {Fθ : θ ∈ Θ} is the family of cdf ’s. Likewise, we speak of
the family of pdf ’s if X is continuous, and the family of pmf ’s if X is discrete.

Example 8.1.2:
X ∼ Bin(n, p), p unknown. Then θ = p and Θ = {p : 0 < p < 1}.

X ∼ N (µ, σ 2 ), (µ, σ 2 ) unknown. Then θ = (µ, σ 2 ) and Θ = {(µ, σ 2 ) : −∞ < µ < ∞, σ 2 > 0}.

Definition 8.1.3:
Let X be a sample from Fθ , θ ∈ Θ ⊆ IR. Let a statistic T (X) map IRn to Θ. We call T (X)
an estimator of θ and T (x) for a realization x of X an (point) estimate of θ. In practice,
the term estimate is used for both.

Example 8.1.4:
Let X1 , . . . , Xn be iid Bin(1, p), p unknown. Estimates of p include:

1 X1 + X2
T1 (X) = X, T2 (X) = X1 , T3 (X) = , T4 (X) =
2 3
Obviously, not all estimates are equally good.

44
8.2 Properties of Estimates

Definition 8.2.1:
Let {Xi }∞ i=1 be a sequence of iid rv’s with cdf Fθ , θ ∈ Θ. A sequence of point estimates
Tn (X1 , . . . , Xn ) = Tn is called

p
• (weakly) consistent for θ if Tn −→ θ as n → ∞ ∀θ ∈ Θ
a.s.
• strongly consistent for θ if Tn −→ θ as n → ∞ ∀θ ∈ Θ
r
• consistent in the rth mean for θ if Tn −→ θ as n → ∞ ∀θ ∈ Θ

Example 8.2.2:
n
X
Let {Xi }∞
i=1 be a sequence of iid Bin(1, p) rv’s. Let X n =
1
n Xi . Since E(Xi ) = p, it
i=1
p a.s.
follows by the WLLN that X n −→ p, i.e., consistency, and by the SLLN that X n −→ p, i.e,
strong consistency.

However, a consistent estimate may not be unique. We may even have infinite many consistent
estimates, e.g.,
n
X
Xi + a
i=1 p
−→ p ∀ finite a, b ∈ IR.
n+b

Theorem 8.2.3:
If Tn is a sequence of estimates such that E(Tn ) → θ and V ar(Tn ) → 0 as n → ∞, then Tn is
consistent for θ.
Proof:

45
Definition 8.2.4:
Let G be a group of Borel–measurable functions of IRn onto itself which is closed under com-
position and inverse. A family of distributions {Pθ : θ ∈ Θ} is invariant under G if for
each g ∈ G and for all θ ∈ Θ, there exists a unique θ0 = g(θ) such that the distribution of
g(X) is Pθ0 whenever the distribution of X is Pθ . We call g the induced function on θ since
Pθ (g(X) ∈ A) = Pg(θ) (X ∈ A).

Example 8.2.5:
Let (X1 , . . . , Xn ) be iid N (µ, σ 2 ) with pdf
n
!
1 1 X
f (x1 , . . . , xn ) = √ exp − (xi − µ)2 .
( 2πσ)n 2σ 2 i=1

The group of linear transformations G has elements

g(x1 , . . . , xn ) = (ax1 + b, . . . , axn + b), a > 0, − ∞ < b < ∞.

The pdf of g(X) is

n
!
∗ 1 1 X
f (x∗1 , . . . , x∗n ) = √ exp − (x∗ − aµ − b)2 , x∗i = axi + b, i = 1, . . . , n.
( 2πaσ)n 2a2 σ 2 i=1 i

So {f : −∞ < µ < ∞, σ 2 > 0} is invariant under this group G, with g(µ, σ 2 ) = (aµ+b, a2 σ 2 ),
where −∞ < aµ + b < ∞ and a2 σ 2 > 0.

Definition 8.2.6:
Let G be a group of transformations that leaves {Fθ : θ ∈ Θ} invariant. An estimate T is
invariant under G if

T (g(X1 ), . . . , g(Xn )) = T (X1 , . . . , Xn ) ∀g ∈ G.

Definition 8.2.7:
An estimate T is:

• location invariant if T (X1 + a, . . . , Xn + a) = T (X1 , . . . , Xn ), a ∈ IR

• scale invariant if T (cX1 , . . . , cXn ) = T (X1 , . . . , Xn ), c ∈ IR − {0}

• permutation invariant if T (Xi1 , . . . , Xin ) = T (X1 , . . . , Xn ) ∀ permutations (i1 , . . . , in )

of 1, . . . , n

46
Example 8.2.8:
Let Fθ ∼ N (µ, σ 2 ).

S 2 is location invariant.

X and S 2 are both permutation invariant.

Neither X nor S 2 is scale invariant.

Note:
Different sources make different use of the term invariant. Mood, Graybill & Boes (1974)
for example define location invariant as T (X1 + a, . . . , Xn + a) = T (X1 , . . . , Xn ) + a (page
332) and scale invariant as T (cX1 , . . . , cXn ) = cT (X1 , . . . , Xn ) (page 336). According to their
definition, X is location invariant and scale invariant.

47
8.3 Sufficient Statistics

(Based on Casella/Berger, Section 6.2)

Definition 8.3.1:
Let X = (X1 , . . . , Xn ) be a sample from {Fθ : θ ∈ Θ ⊆ IRk }. A statistic T = T (X) is
sufficient for θ (or for the family of distributions {Fθ : θ ∈ Θ}) iff the conditional dis-
tribution of X given T = t does not depend on θ (except possibly on a null set A where
Pθ (T ∈ A) = 0 ∀θ).

Note:

(i) The sample X is always sufficient but this is not particularly interesting and usually is
excluded from further considerations.

(ii) Idea: Once we have “reduced” from X to T (X), we have captured all the information
in X about θ.

(iii) Usually, there are several sufficient statistics for a given family of distributions.

Example 8.3.2:
Let X = (X1 , . . . , Xn ) be iid Bin(1, p) rv’s. To estimate p, can we ignore the order and simply
count the number of “successes”?
n
X
Let T (X) = Xi . It is
i=1

48
Example 8.3.3:
n
X
Let X = (X1 , . . . , Xn ) be iid Poisson(λ). Is T = Xi sufficient for λ? It is
i=1

Example 8.3.4:
Let X1 , X2 be iid Poisson(λ). Is T = X1 + 2X2 sufficient for λ? It is

Note:
Definition 8.3.1 can be difficult to check. In addition, it requires a candidate statistic. We
need something constructive that helps in finding sufficient statistics without having to check
Definition 8.3.1. The next Theorem helps in finding such statistics.

Theorem 8.3.5: Factorization Criterion

Let X1 , . . . , Xn be rv’s with pdf (or pmf) f (x1 , . . . , xn | θ), θ ∈ Θ. Then T (X1 , . . . , Xn ) is
sufficient for θ iff we can write

f (x1 , . . . , xn | θ) = h(x1 , . . . , xn ) g(T (x1 , . . . , xn ) | θ),

where h does not depend on θ and g does not depend on x1 , . . . , xn except as a function of T .

49
Proof:
Discrete case only.

“=⇒”:
Suppose T (X) is sufficient for θ. Let

“⇐=”:
Suppose the factorization holds. For fixed t0 , it is

50
Note:

(i) In the Theorem above, θ and T may be vectors.

(ii) If T is sufficient for θ, then also any 1–to–1 mapping of T is sufficient for θ. However,
this does not hold for arbitrary functions of T .

Example 8.3.6:
Let X1 , . . . , Xn be iid Bin(1, p). It is

Example 8.3.7:
Let X1 , . . . , Xn be iid Poisson(λ). It is

Example 8.3.8:
Let X1 , . . . , Xn be iid N (µ, σ 2 ) where µ ∈ IR and σ 2 > 0 are both unknown. It is

Example 8.3.9:
Let X1 , . . . , Xn be iid U (θ, θ + 1) where −∞ < θ < ∞. It is

51
Definition 8.3.10:
Let {fθ (x) : θ ∈ Θ} be a family of pdf’s (or pmf’s). We say the family is complete if

Eθ (g(X)) = 0 ∀θ ∈ Θ

implies that
Pθ (g(X) = 0) = 1 ∀θ ∈ Θ.

We say a statistic T (X) is complete if the family of distributions of T is complete.

Example 8.3.11:
n
X
Let X1 , . . . , Xn be iid Bin(1, p). We have seen in Example 8.3.6 that T = Xi is sufficient
i=1
for p. Is it also complete?

We know that T ∼ Bin(n, p). Thus,

Example 8.3.12:
n
X n
X
Let X1 , . . . , Xn be iid N (θ, θ2 ). We know from Example 8.3.8 that T = ( Xi , Xi2 ) is
i=1 i=1
sufficient for θ. Is it also complete?

52
Note:
Recall from Section 5.2 what it means if we say the family of distributions {fθ : θ ∈ Θ} is a
one–parameter (or k–parameter) exponential family.

Theorem 8.3.13:
Let {fθ : θ ∈ Θ} be a k–parameter exponential family. Let T1 , . . . , Tk be statistics. Then the
family of distributions of (T1 (X), . . . , Tk (X)) is also a k–parameter exponential family given
by
k
!
ti Qi (θ) + D(θ) + S ∗ (t)
X
gθ (t) = exp
i=1

for suitable S ∗ (t).

Proof:
The proof follows from our Theorems regarding the transformation of rv’s.

Theorem 8.3.14:
Let {fθ : θ ∈ Θ} be a k–parameter exponential family with k ≤ n and let T1 , . . . , Tk be
statistics as in Theorem 8.3.13. Suppose that the range of Q = (Q1 , . . . , Qk ) contains an open
set in IRk . Then T = (T1 (X), . . . , Tk (X)) is a complete sufficient statistic.
Proof:
Discrete case and k = 1 only.

Write Q(θ) = θ and let (a, b) ⊆ Θ.

It follows from the Factorization Criterion (Theorem 8.3.5) that T is sufficient for θ. Thus,
we only have to show that T is complete, i.e., that
X
Eθ (g(T (X))) = g(t)Pθ (T (X) = t)
t

(A)
g(t) exp(θt + D(θ) + S ∗ (t)) = 0 ∀θ
X
= (B)
t

implies g(t) = 0 ∀t. Note that in (A) we make use of a result established in Theorem 8.3.13.

We now define functions g + and g − as:

(
+ g(t), if g(t) ≥ 0
g (t) =
0, otherwise
(
− −g(t), if g(t) < 0
g (t) =
0, otherwise

53
It is g(t) = g + (t) − g − (t) where both functions, g + and g − , are non–negative functions. Using
g + and g − , it turns out that (B) is equivalent to

g + (t) exp(θt + S ∗ (t)) = g − (t) exp(θt + S ∗ (t)) ∀θ

X X
(C)
t t

where the term exp(D(θ)) in (A) drops out as a constant on both sides.

If we fix θ0 ∈ (a, b) and define

g + (t) exp(θ0 t + S ∗ (t)) g − (t) exp(θ0 t + S ∗ (t))

p+ (t) = X + , p− (t) = X − ,
g (t) exp(θ0 t + S ∗ (t)) g (t) exp(θ0 t + S ∗ (t))
t t
X
it is obvious that p+ (t) ≥ 0 ∀t and p− (t) ≥ 0 ∀t and by construction p+ (t) = 1 and
t
p− (t) = 1. Hence, p+ and p− are both pmf’s.
X

From (C), it follows for the mgf’s M + and M − of p+ and p− that

X
M + (δ) = eδt p+ (t)
t

eδt g + (t) exp(θ0 t + S ∗ (t))

t
=
g + (t) exp(θ0 t + S ∗ (t))
X

g + (t) exp((θ0 + δ)t + S ∗ (t))

t
=
g + (t) exp(θ0 t + S ∗ (t))
X

g − (t) exp((θ0 + δ)t + S ∗ (t))

X
(C) t
=
g − (t) exp(θ0 t + S ∗ (t))
X

eδt g − (t) exp(θ0 t + S ∗ (t))

t
=
g − (t) exp(θ0 t + S ∗ (t))
X

eδt p− (t)
X
=
t

= M − (δ) ∀δ ∈ (a − θ0 , b − θ0 ).
| {z } | {z }
<0 >0

By the uniqueness of mgf’s it follows that p+ (t) = p− (t) ∀t.

=⇒ g + (t) = g − (t) ∀t

54
=⇒ g(t) = 0 ∀t

=⇒ T is complete
Definition 8.3.15:
Let X = (X1 , . . . , Xn ) be a sample from {Fθ : θ ∈ Θ ⊆ IRk } and let T = T (X) be a sufficient
statistic for θ. T = T (X) is called a minimal sufficient statistic for θ if, for any other
sufficient statistic T 0 = T 0 (X), T (x) is a function of T 0 (x).

Note:

(i) A minimal sufficient statistic achieves the greatest possible data reduction for a sufficient
statistic.

(ii) If T is minimal sufficient for θ, then also any 1–to–1 mapping of T is minimal sufficient
for θ. However, this does not hold for arbitrary functions of T .

Definition 8.3.16:
Let X = (X1 , . . . , Xn ) be a sample from {Fθ : θ ∈ Θ ⊆ IRk }. A statistic T = T (X) is called
ancillary if its distribution does not depend on the parameter θ.

Example 8.3.17:
Let X1 , . . . , Xn be iid U (θ, θ + 1) where −∞ < θ < ∞. As shown in Example 8.3.9,
T = (X(1) , X(n) ) is sufficient for θ. Define
Rn = X(n) − X(1) .
Use the result from Stat 6710, Homework Assignment 5, Question (viii) (a) to obtain
fRn (r | θ) = fRn (r) = n(n − 1)rn−2 (1 − r)I(0,1) (r).
This means that Rn ∼ Beta(n − 1, 2). Moreover, Rn does not depend on θ and, therefore,
Rn is ancillary.

Theorem 8.3.18: Basu’s Theorem

Let X = (X1 , . . . , Xn ) be a sample from {Fθ : θ ∈ Θ ⊆ IRk }. If T = T (X) is a complete and
minimal sufficient statistic, then T is independent of any ancillary statistic.

Theorem 8.3.19:
Let X = (X1 , . . . , Xn ) be a sample from {Fθ : θ ∈ Θ ⊆ IRk }. If any minimal sufficient statis-
tic T = T (X) exists for θ, then any complete statistic is also a minimal sufficient statistic.

55
Note:

(i) Due to the last Theorem, Basu’s Theorem often only is stated in terms of a complete
sufficient statistic (which automatically is also a minimal sufficient statistic).

(ii) As already shown in Corollary 7.2.2, X and S 2 are independent when sampling from a
N (µ, σ 2 ) population. As outlined in Casella/Berger, page 289, we could also use Basu’s
Theorem to obtain the same result.

(iii) The converse of Basu’s Theorem is false, i.e., if T (X) is independent of any ancillary
statistic, it does not necessarily follow that T (X) is a complete, minimal sufficient statis-
tic.
n
X n
X
(iv) As seen in Examples 8.3.8 and 8.3.12, T = ( Xi , Xi2 ) is sufficient for θ but it is not
i=1 i=1
complete when X1 , . . . , Xn are iid N (θ, θ2 ). However, it can be shown that T is minimal
sufficient. So, there may be distributions where a minimal sufficient statistic exists but
a complete statistic does not exist.

(v) As with invariance, there exist several different definitions of ancillarity within the lit-
erature — the one defined in this chapter being the most commonly used.

56
8.4 Unbiased Estimation

(Based on Casella/Berger, Section 7.3)

Definition 8.4.1:
Let {Fθ : θ ∈ Θ}, Θ ⊆ IR, be a nonempty set of cdf’s. A Borel–measurable function T from
IRn to Θ is called unbiased for θ (or an unbiased estimate for θ) if

Eθ (T ) = θ ∀θ ∈ Θ.

Any function d(θ) for which an unbiased estimate T exists is called an estimable function.

If T is biased,
b(θ, T ) = Eθ (T ) − θ

is called the bias of T .

Example 8.4.2:
If the k th population moment exists, the k th sample moment is an unbiased estimate. If
V ar(X) = σ 2 , the sample variance S 2 is an unbiased estimate of σ 2 .

However, note that for X1 , . . . , Xn iid N (µ, σ 2 ) S is not an unbiased estimate of σ:

(n − 1)S 2 n−1
∼ χ2n−1 = Gamma( , 2)
σ2 2
s 
n−1 x
(n − 1)S 2 
Z ∞
√ x 2 −1 e− 2
=⇒ E  = x n−1 dx
σ2 2 Γ( n−1
2 )
0 2

√ Z ∞ n −1 − x
2Γ( n2 ) x2 e 2
= n dx
Γ( n−1
2 ) 0 2 2 Γ( n2 )
√
(∗) 2Γ( n2 )
=
Γ( n−1
2 )
s
2 Γ( n2 )
=⇒ E(S) = σ
n − 1 Γ( n−1
2 )
n x
−1 − 2
(∗) holds since x n en is the pdf of a Gamma( n2 , 2) distribution and thus the integral is 1.
2

2 2 Γ( )
2
So S is biased for σ and s 
2 Γ( n2 )
b(σ, S) = σ  − 1 .
n − 1 Γ( n−1
2 )

57
Note:
If T is unbiased for θ, g(T ) is not necessarily unbiased for g(θ) (unless g is a linear function).

Example 8.4.3:
Unbiased estimates may not exist (see Rohatgi, page 351, Example 2) or they me be absurd
as in the following case:

Let X ∼ Poisson(λ) and let d(λ) = e−2λ . Consider T (X) = (−1)X as an estimate. It is

Note:
If there exist 2 unbiased estimates T1 and T2 of θ, then any estimate of the form αT1 +(1−α)T2
for 0 < α < 1 will also be an unbiased estimate of θ. Which one should we choose?

Definition 8.4.4:
The mean square error of an estimate T of σ is defined as

M SE(θ, T ) = Eθ ((T − θ)2 )

= V arθ (T ) + (b(θ, T ))2 .

Let {Ti }∞
i=1 be a sequence of estimates of θ. If

lim M SE(θ, Ti ) = 0 ∀θ ∈ Θ,
i→∞

then {Ti } is called a mean–squared–error consistent (MSE–consistent) sequence of es-

timates of θ.

Note:

(i) If we allow all estimates and compare their MSE, generally it will depend on θ which
estimate is better. For example θ̂ = 17 is perfect if θ = 17, but it is lousy otherwise.

(ii) If we restrict ourselves to the class of unbiased estimates, then M SE(θ, T ) = V arθ (T ).

58
(iii) MSE–consistency means that both the bias and the variance of Ti approach 0 as i → ∞.

Definition 8.4.5:
Let θ0 ∈ Θ and let U (θ0 ) be the class of all unbiased estimates T of θ0 such that Eθ0 (T 2 ) < ∞.
Then T0 ∈ U (θ0 ) is called a locally minimum variance unbiased estimate (LMVUE)
at θ0 if
Eθ0 ((T0 − θ0 )2 ) ≤ Eθ0 ((T − θ0 )2 ) ∀T ∈ U (θ0 ).

Definition 8.4.6:
Let U be the class of all unbiased estimates T of θ ∈ Θ such that Eθ (T 2 ) < ∞ ∀θ ∈ Θ. Then
T0 ∈ U is called a uniformly minimum variance unbiased estimate (UMVUE) of θ if

Eθ ((T0 − θ)2 ) ≤ Eθ ((T − θ)2 ) ∀θ ∈ Θ ∀T ∈ U.

59
An Excursion into Logic II

In our first “Excursion into Logic” in Stat 6710 Mathematical Statistics I, we have established
the following results:
A ⇒ B is equivalent to ¬B ⇒ ¬A is equivalent to ¬A ∨ B:

A B A⇒B ¬A ¬B ¬B ⇒ ¬A ¬A ∨ B
1 1 1 0 0 1 1
1 0 0 0 1 0 0
0 1 1 1 0 1 1
0 0 1 1 1 1 1

When dealing with formal proofs, there exists one more technique to show A ⇒ B. Equiva-
lently, we can show (A ∧ ¬B) ⇒ 0, a technique called Proof by Contradiction. This means,
assuming that A and ¬B hold, we show that this implies 0, i.e., something that is always
false, i.e., a contradiction. And here is the corresponding truth table:

A B A⇒B ¬B A ∧ ¬B (A ∧ ¬B) ⇒ 0
1 1
1 0
0 1
0 0

Note:
We make use of this proof technique in the Proof of the next Theorem.

Example:
Let A : x = 5 and B : x2 = 25. Obviously A ⇒ B.

But we can also prove this in the following way:

A : x = 5 and ¬B : x2 6= 25

=⇒ x2 = 25 ∧ x2 6= 25

This is impossible, i.e., a contradiction. Thus, A ⇒ B.

60
Theorem 8.4.7:
Let U be the class of all unbiased estimates T of θ ∈ Θ with Eθ (T 2 ) < ∞ ∀θ, and suppose
that U is non–empty. Let U0 be the set of all unbiased estimates of 0, i.e.,

U0 = {ν : Eθ (ν) = 0, Eθ (ν 2 ) < ∞ ∀θ ∈ Θ}.

Then T0 ∈ U is UMVUE iff

Eθ (νT0 ) = 0 ∀θ ∈ Θ ∀ν ∈ U0 .

Proof:
Note that Eθ (νT0 ) always exists.

61
Theorem 8.4.8:
Let U be the non–empty class of unbiased estimates of θ ∈ Θ as defined in Theorem 8.4.7.
Then there exists at most one UMVUE T ∈ U for θ.
Proof:
Suppose T0 , T1 ∈ U are both UMVUE.

Then T1 − T0 ∈ U0 , V arθ (T0 ) = V arθ (T1 ), and Eθ (T0 (T1 − T0 )) = 0 ∀θ ∈ Θ

=⇒ Eθ (T02 ) = Eθ (T0 T1 )

=⇒ Covθ (T0 , T1 ) = Eθ (T0 T1 ) − Eθ (T0 )Eθ (T1 )

= Eθ (T02 ) − (Eθ (T0 ))2

= V arθ (T0 )

= V arθ (T1 ) ∀θ ∈ Θ

=⇒ ρT0 T1 = 1 ∀θ ∈ Θ

=⇒ Pθ (aT0 + bT1 = 0) = 1 for some a, b ∀θ ∈ Θ

=⇒ θ = Eθ (T0 ) = Eθ (− ab T1 ) = Eθ (T1 ) ∀θ ∈ Θ

=⇒ − ab = 1

=⇒ Pθ (T0 = T1 ) = 1 ∀θ ∈ Θ

Theorem 8.4.9:

(i) If an UMVUE T exists for a real function d(θ), then λT is the UMVUE for λd(θ), λ ∈ IR.

(ii) If UMVUE’s T1 and T2 exist for real functions d1 (θ) and d2 (θ), respectively, then T1 +T2
is the UMVUE for d1 (θ) + d2 (θ).

Proof:
Homework.

62
Theorem 8.4.10:
If a sample consists of n independent observations X1 , . . . , Xn from the same distribution, the
UMVUE, if it exists, is permutation invariant.
Proof:
Homework.

Theorem 8.4.11: Rao–Blackwell

Let {Fθ : θ ∈ Θ} be a family of cdf’s, and let h be any statistic in U , where U is the non–
empty class of all unbiased estimates of θ with Eθ (h2 ) < ∞. Let T be a sufficient statistic for
{Fθ : θ ∈ Θ}. Then the conditional expectation Eθ (h | T ) is independent of θ and it is an
unbiased estimate of θ. Additionally,

Eθ ((E(h | T ) − θ)2 ) ≤ Eθ ((h − θ)2 ) ∀θ ∈ Θ

with equality iff h = E(h | T ).

Proof:
By Theorem 4.7.3, Eθ (E(h | T )) = E(h) = θ.

63
Theorem 8.4.12: Lehmann–Scheffée
If T is a complete sufficient statistic and if there exists an unbiased estimate h of θ, then
E(h | T ) is the (unique) UMVUE.
Proof:

Note:
We can use Theorem 8.4.12 to find the UMVUE in two ways if we have a complete sufficient
statistic T :

(i) If we can find an unbiased estimate h(T ), it will be the UMVUE since E(h(T ) | T ) =
h(T ).

(ii) If we have any unbiased estimate h and if we can calculate E(h | T ), then E(h | T )
will be the UMVUE. The process of determining the UMVUE this way often is called
Rao–Blackwellization.

(iii) Even if a complete sufficient statistic does not exist, the UMVUE may still exist (see
Rohatgi, page 357–358, Example 10).

Example 8.4.13:
n
X
Let X1 , . . . , Xn be iid Bin(1, p). Then T = Xi is a complete sufficient statistic as seen in
i=1
Examples 8.3.6 and 8.3.11.

Since E(X1 ) = p, X1 is an unbiased estimate of p. However, due to part (i) of the Note above,
since X1 is not a function of T , X1 is not the UMVUE.

We can use part (ii) of the Note above to construct the UMVUE. It is

64
If we are interested in the UMVUE for d(p) = p(1 − p) = p − p2 = V ar(X), we can find it in
the following way:

65
8.5 Lower Bounds for the Variance of an Estimate

(Based on Casella/Berger, Section 7.3)

Theorem 8.5.1: Cramér–Rao Lower Bound (CRLB)
Let Θ be an open interval of IR. Let {fθ : θ ∈ Θ} be a family of pdf’s or pmf’s. Assume
that the set {x : fθ (x) = 0} is independent of θ.

Let ψ(θ) be defined on Θ and let it be differentiable for all θ ∈ Θ. Let T be an unbiased
estimate of ψ(θ) such that Eθ (T 2 ) < ∞ ∀θ ∈ Θ. Suppose that

∂fθ (x)
(i) ∂θ is defined for all θ ∈ Θ,

(ii) for a pdf fθ

∂ ∂fθ (x)
Z Z
fθ (x)dx = dx = 0 ∀θ ∈ Θ
∂θ ∂θ
or for a pmf fθ  
∂ X X ∂fθ (x)
fθ (x) = = 0 ∀θ ∈ Θ,
∂θ x x ∂θ

(iii) for a pdf fθ

∂ ∂fθ (x)
Z Z
T (x)fθ (x)dx = T (x) dx ∀θ ∈ Θ
∂θ ∂θ
or for a pmf fθ  
∂ X X ∂fθ (x)
T (x)fθ (x) = T (x) ∀θ ∈ Θ.
∂θ x x ∂θ

Let χ : Θ → IR be any measurable function. Then it holds

∂ log fθ (X) 2

0 2 2
(ψ (θ)) ≤ Eθ ((T (X) − χ(θ)) ) Eθ ( ) ∀θ ∈ Θ (A).
∂θ

Further, for any θ0 ∈ Θ, either ψ 0 (θ0 ) = 0 and equality holds in (A) for θ = θ0 , or we have

(ψ 0 (θ0 ))2
Eθ0 ((T (X) − χ(θ0 ))2 ) ≥ (B).
Eθ0 ( ∂ log∂θ
fθ (X) 2
)

Finally, if equality holds in (B), then there exists a real number K(θ0 ) 6= 0 such that

∂ log fθ (X)
T (X) − χ(θ0 ) = K(θ0 ) (C)
∂θ θ=θ0

with probability 1, provided that T is not a constant.

66
Note:

(i) Conditions (i), (ii), and (iii) are called regularity conditions. Conditions under which
they hold can be found in Rohatgi, page 11–13, Parts 12 and 13.

(ii) The right hand side of inequality (B) is called Cramér–Rao Lower Bound of θ0 , or, in
symbols CRLB(θ0 ).

∂ log fθ (X) 2

(iii) The expression Eθ ∂θ is called the Fisher Information in X.

Proof:
From (ii), we get
∂ ∂
Z
Eθ log fθ (X) = log fθ (x) fθ (x)dx
∂θ ∂θ
∂ 1
Z
= fθ (x) fθ (x)dx
∂θ fθ (x)
∂
Z
= fθ (x) dx
∂θ
= 0

∂

=⇒ Eθ χ(θ) log fθ (X) = 0
∂θ
From (iii), we get
∂ ∂
Z
Eθ T (X) log fθ (X) = T (x) log fθ (x) fθ (x)dx
∂θ ∂θ
∂ 1
Z
= T (x) fθ (x) fθ (x)dx
∂θ fθ (x)
∂
Z
= T (x) fθ (x) dx
∂θ
∂
Z
(iii)
= T (x)fθ (x)dx
∂θ
∂
= E(T (X))
∂θ
∂
= ψ(θ)
∂θ
= ψ 0 (θ)

∂

=⇒ Eθ (T (X) − χ(θ)) log fθ (X) = ψ 0 (θ) (+)
∂θ

67
2
∂

0 2
=⇒ (ψ (θ)) = Eθ (T (X) − χ(θ)) log fθ (X)
∂θ
(∗)
2 !
∂

2
≤ Eθ (T (X) − χ(θ)) Eθ log fθ (X) ,
∂θ

i.e., (A) holds. (∗) follows from the Cauchy–Schwarz–Inequality (Theorem 4.5.7 (ii)).

If ψ 0 (θ0 ) 6= 0, then the left–hand side of (A) is > 0. Therefore, the right–hand side is > 0.
Thus,
2 !
∂

Eθ0 log fθ (X) > 0,
∂θ
and (B) follows directly from (A).

If ψ 0 (θ0 ) = 0, but equality does not hold in (A), then

2 !
∂

Eθ0 log fθ (X) > 0,
∂θ

and (B) follows directly from (A) again.

Finally, if equality holds in (B), then ψ 0 (θ0 ) 6= 0 (because T is not constant). Thus,
M SE(χ(θ0 ), T (X)) > 0. The Cauchy–Schwarz–Inequality (Theorem 4.5.7 (iii)) gives equality
iff there exist constants (α, β) ∈ IR2 − {(0, 0)} such that
! !
∂
P α(T (X) − χ(θ0 )) + β log fθ (X) =0 = 1.
∂θ θ=θ0

This implies K(θ0 ) = − αβ and (C) holds. Since T is not a constant, it also holds that
K(θ0 ) 6= 0.

Example 8.5.2:
If we take χ(θ) = ψ(θ), we get from (B)

(ψ 0 (θ))2
V arθ (T (X)) ≥ (∗).
Eθ ( ∂ log∂θ
fθ (X) 2
)

If we have ψ(θ) = θ, the inequality (∗) above reduces to

−1
∂ log fθ (X) 2

V arθ (T (X)) ≥ Eθ ( ) .
∂θ

Finally, if X = (X1 , . . . , Xn ) iid with identical fθ (x), the inequality (∗) reduces to

(ψ 0 (θ))2
V arθ (T (X)) ≥ .
nEθ ( ∂ log ∂θ
fθ (X1 ) 2
)

68
Example 8.5.3:
Let X1 , . . . , Xn be iid Bin(1, p). Let X ∼ Bin(n, p), p ∈ Θ = (0, 1) ⊂ IR. Let
n
!
X n x
ψ(p) = E(T (X)) = T (x) p (1 − p)n−x .
x=0
x

ψ(p) is differentiable with respect to p under the summation sign since it is a finite polynomial
in p.

Example 8.5.4:
Let X ∼ U (0, θ), θ ∈ Θ = (0, ∞) ⊂ IR.

69
Theorem 8.5.5: Chapman, Robbins, Kiefer Inequality (CRK Inequality)
Let Θ ⊆ IR. Let {fθ : θ ∈ Θ} be a family of pdf’s or pmf’s. Let ψ(θ) be defined on Θ. Let
T be an unbiased estimate of ψ(θ) such that Eθ (T 2 ) < ∞ ∀θ ∈ Θ.

If θ 6= ϑ, θ and ϑ ∈ Θ, assume that fθ (x) and fϑ (x) are different. Also assume that there
exists such a ϑ ∈ Θ such that θ 6= ϑ and

S(θ) = {x : fθ (x) > 0} ⊃ S(ϑ) = {x : fϑ (x) > 0}.

Then it holds that

(ψ(ϑ) − ψ(θ))2
V arθ (T (X)) ≥ sup
fϑ (X)
∀θ ∈ Θ.
{ϑ : S(ϑ)⊂S(θ), ϑ6=θ} V arθ fθ (X)

Proof:
Since T is unbiased, it follows

Eϑ (T (X)) = ψ(ϑ) ∀ϑ ∈ Θ.

For ϑ 6= θ and S(ϑ) ⊂ S(θ), it follows

fϑ (x) − fθ (x)
Z
T (x) fθ (x)dx = Eϑ (T (X)) − Eθ (T (X)) = ψ(ϑ) − ψ(θ)
S(θ) fθ (x)
and
fϑ (x) − fθ (x) fϑ (X)
Z
0= fθ (x)dx = Eθ −1 .
S(θ) fθ (x) fθ (X)
Therefore
fϑ (X)

Covθ T (X), − 1 = ψ(ϑ) − ψ(θ).
fθ (X)
It follows by the Cauchy–Schwarz–Inequality (Theorem 4.5.7 (ii)) that
2
fϑ (X)

2
(ψ(ϑ) − ψ(θ)) = Covθ T (X), −1
fθ (X)
fϑ (X)

≤ V arθ (T (X))V arθ −1
fθ (X)
fϑ (X)

= V arθ (T (X))V arθ .
fθ (X)
Thus,
(ψ(ϑ) − ψ(θ))2
V arθ (T (X)) ≥
fϑ (X)
.
V arθ fθ (X)

Finally, we take the supremum of the right–hand side with respect to {ϑ : S(ϑ) ⊂ S(θ),
ϑ 6= θ}, which completes the proof.

70
Note:

(i) The CRK inequality holds without the previous regularity conditions.

(ii) An alternative form of the CRK inequality is:

Let θ, θ + δ, δ 6= 0, be distinct with S(θ + δ) ⊂ S(θ). Let ψ(θ) = θ. Define

2 !
1 fθ+δ (X)

J = J(θ, δ) = 2 −1 .
δ fθ (X)

Then the CRK inequality reads as

1
V arθ (T (X)) ≥
inf Eθ (J)
δ

with the infimum taken over δ 6= 0 : S(θ + δ) ⊂ S(θ).

(iii) The CRK inequality works for discrete Θ, the CRLB does not work in such cases.

Example 8.5.6:
Let X ∼ U (0, θ), θ > 0. The required conditions for the CRLB are not met. Recall from
θ2 θ2
Example 8.5.4 that n+1 n+1
n X(n) is UMVUE with V ar( n X(n) ) = n(n+2) < n = CRLB.

Definition 8.5.7:
Let T1 , T2 be unbiased estimates of θ with Eθ (T12 ) < ∞ and Eθ (T22 ) < ∞ ∀θ ∈ Θ. We define
the efficiency of T1 relative to T2 by
V arθ (T1 )
ef fθ (T1 , T2 ) =
V arθ (T2 )

and say that T1 is more efficient than T2 if ef fθ (T1 , T2 ) < 1.

71
Definition 8.5.8:
Assume the regularity conditions of Theorem 8.5.1 are satisfied by a family of cdf’s {Fθ : θ ∈
Θ}. An unbiased estimate T for θ is most efficient for {Fθ } if
2 !!−1
∂ log fθ (X)

V arθ (T ) = Eθ
∂θ

Definition 8.5.9:
Let T be the most efficient estimate for the family of cdf’s {Fθ : θ ∈ Θ}, Θ ⊆ IR. Then the
efficiency of any unbiased T1 of θ is defined as
V arθ (T1 )
ef fθ (T1 ) = ef fθ (T1 , T ) = .
V arθ (T )

Definition 8.5.10:
T1 is asymptotically (most) efficient if T1 is asymptotically unbiased, i.e., lim Eθ (T1 ) = θ,
n→∞
and lim ef fθ (T1 ) = 1, where n is the sample size.
n→∞

Theorem 8.5.11:
A necessary and sufficient condition for an estimate T of θ to be most efficient is that T is
sufficient and
1 ∂ log fθ (x)
(T (x) − θ) = ∀θ ∈ Θ (∗),
K(θ) ∂θ
where K(θ) is defined as in Theorem 8.5.1 and the regularity conditions for Theorem 8.5.1
hold.
Proof:
“=⇒:”
Theorem 8.5.1 says that if T is most efficient, then (∗) holds.

Assume that Θ = IR. We define

Z θ0 Z θ0
1 θ
C(θ0 ) = dθ, ψ(θ0 ) = dθ, and λ(x) = lim log fθ (x) − c(x).
−∞ K(θ) −∞ K(θ) θ→−∞

Integrating (∗) with respect to θ gives

Z θ0 Z θ0 Z θ0
1 θ ∂ log fθ (x)
T (x)dθ − dθ = dθ
−∞ K(θ) −∞ K(θ) −∞ ∂θ

=⇒ T (x)C(θ0 ) − ψ(θ0 ) = log fθ (x)|θ−∞

0
+ c(x)

=⇒ T (x)C(θ0 ) − ψ(θ0 ) = log fθ0 (x) − lim log fθ (x) + c(x)

θ→−∞

=⇒ T (x)C(θ0 ) − ψ(θ0 ) = log fθ0 (x) − λ(x)

72
Therefore,
fθ0 (x) = exp(T (x)C(θ0 ) − ψ(θ0 ) + λ(x))

which belongs to an exponential family. Thus, T is sufficient.

“⇐=:”
From (∗), we get
2 !
∂ log fθ (X) 1

Eθ = V arθ (T (X)).
∂θ (K(θ))2
Additionally, it holds
∂ log fθ (X)

Eθ (T (X) − θ) =1
∂θ
as shown in the Proof of Theorem 8.5.1 (let χ(θ) = θ in (+)).

Using (∗) in the line above, we get

2 !
∂ log fθ (X)

K(θ)Eθ = 1,
∂θ

i.e.,
2 !!−1
∂ log fθ (X)

K(θ) = Eθ .
∂θ
Therefore,
2 !!−1
∂ log fθ (X)

V arθ (T (X)) = Eθ ,
∂θ
i.e., T is most efficient for θ.

Note:
Instead of saying “a necessary and sufficient condition for an estimate T of θ to be most
efficient ...” in the previous Theorem, we could say that “an estimate T of θ is most efficient
iff ...”, i.e., “necessary and sufficient” means the same as “iff”.
A is necessary for B means: B ⇒ A (because ¬A ⇒ ¬B)
A is sufficient for B means: A ⇒ B

73
8.6 The Method of Moments

(Based on Casella/Berger, Section 7.2.1)

Definition 8.6.1:
Let X1 , . . . , Xn be iid with pdf (or pmf) fθ , θ ∈ Θ. We assume that first k moments m1 , . . . , mk
of fθ exist. If θ can be written as

θ = h(m1 , . . . , mk ),

where h : IRk → IR is a Borel–measurable function, the method of moments estimate

(mom) of θ is
n n n
1X 1X 1X
θ̂mom = T (X1 , . . . , Xn ) = h( Xi , Xi2 , . . . , X k ).
n i=1 n i=1 n i=1 i

Note:

(i) The Definition above can also be used to estimate joint moments. For example, we use
n
X
1
n Xi Yi to estimate E(XY ).
i=1
n
Xij ) = mj , method of moment estimates are unbiased for the popula-
X
(ii) Since E( n1
i=1
tion moments. The WLLN and the CLT say that these estimates are consistent and
asymptotically Normal as well.

(iii) If θ is not a linear function of the population moments, θ̂mom will, in general, not be
unbiased. However, it will be consistent and (usually) asymptotically Normal.

(iv) Method of moments estimates do not exist if the related moments do not exist.

(v) Method of moments estimates may not be unique. If there exist multiple choices for the
mom, one usually takes the estimate involving the lowest–order sample moment.

(vi) Alternative method of moment estimates can be obtained from central moments (rather
than from raw moments) or by using moments other than the first k moments.

74
Example 8.6.2:
Let X1 , . . . , Xn be iid N (µ, σ 2 ).

Since µ = m1 , it is µ̂mom = X.

This is an unbiased, consistent and asymptotically Normal estimate.

v
u n
q u X
2
Since σ = m2 − m1 , it is σ̂mom = t n1
2 Xi2 − X .
i=1

This is a consistent, asymptotically Normal estimate. However, it is not unbiased.

Example 8.6.3:
Let X1 , . . . , Xn be iid Poisson(λ).

We know that E(X1 ) = V ar(X1 ) = λ.

n
X
Thus, X and 1
n (Xi − X)2 are possible choices for the mom of λ. Due to part (v) of the
i=1
Note above, one uses λ̂mom = X.

75
8.7 Maximum Likelihood Estimation

(Based on Casella/Berger, Section 7.2.2)

Definition 8.7.1:
Let (X1 , . . . , Xn ) be an n–rv with pdf (or pmf) fθ (x1 , . . . , xn ), θ ∈ Θ. We call the function
of θ
L(θ; x1 , . . . , xn ) = fθ (x1 , . . . , xn )
the likelihood function.

Note:

(i) Often θ is a vector of parameters.

n
Y
(ii) If (X1 , . . . , Xn ) are iid with pdf (or pmf) fθ (x), then L(θ; x1 , . . . , xn ) = fθ (xi ).
i=1

Definition 8.7.2:
A maximum likelihood estimate (MLE) is a non–constant estimate θ̂M L such that

L(θ̂M L ; x1 , . . . , xn ) = sup L(θ; x1 , . . . , xn ).

θ∈Θ

Note:
It is often convenient to work with log L when determining the maximum likelihood estimate.
Since the log is monotone, the maximum is the same.

Example 8.7.3:
Let X1 , . . . , Xn be iid N (µ, σ 2 ), where µ and σ 2 are unknown.
n
!
2 1 X (xi − µ)2
L(µ, σ ; x1 , . . . , xn ) = n exp −
σ n (2π) 2 i=1
2σ 2

Formally, we still have to verify that we found the maximum (and not a minimum) and that
there is no parameter θ at the edge of the parameter space Θ such that the likelihood function

76
does not take its absolute maximum which is not detectable by using our approach for local
extrema.

Example 8.7.4:

Let X1 , . . . , Xn be iid U (θ − 12 , θ + 21 ).

Example 8.7.5:
Let X ∼ Bin(1, p), p ∈ [ 14 , 43 ].

 p, if x = 1
L(p; x) = px (1 − p)1−x =

1 − p, if x = 0

Theorem 8.7.6:
Let T be a sufficient statistic for fθ (x), θ ∈ Θ. If a unique MLE of θ exists, it is a function
of T .
Proof:
Since T is sufficient, we can write

fθ (x) = h(x)gθ (T (x))

77
due to the Factorization Criterion (Theorem 8.3.5). Maximizing the likelihood function with
respect to θ takes h(x) as a constant and therefore is equivalent to maximizing gθ (x) with
respect to θ. But gθ (x) involves x only through T .

Note:

(i) MLE’s may not be unique (however they frequently are).

(ii) MLE’s are not necessarily unbiased.

(iii) MLE’s may not exist.

(iv) If a unique MLE exists, it is a function of a sufficient statistic.

(v) Often (but not always), the MLE will be a sufficient statistic itself.

Theorem 8.7.7:
Suppose the regularity conditions of Theorem 8.5.1 hold and θ belongs to an open interval in
IR. If an estimate θ̂ of θ attains the CRLB, it is the unique MLE.
Proof:
If θ̂ attains the CRLB, it follows by Theorem 8.5.1 that

∂ log fθ (X) 1
= (θ̂(X) − θ) w.p. 1.
∂θ K(θ)

Thus, θ̂ satisfies the likelihood equations.

1
We define A(θ) = K(θ) . Then it follows

∂ 2 log fθ (X)
= A0 (θ)(θ̂(X) − θ) − A(θ).
∂θ2
The Proof of Theorem 8.5.11 gives us
2 !
∂ log fθ (X)

A(θ) = Eθ > 0.
∂θ

So
∂ 2 log fθ (X)
= −A(θ) < 0,
∂θ2 θ=θ̂

i.e., log fθ (X) has a maximum in θ̂. Thus, θ̂ is the MLE.

78
Note:
The previous Theorem does not imply that every MLE is most efficient.

Theorem 8.7.8:
Let {fθ : θ ∈ Θ} be a family of pdf’s (or pmf’s) with Θ ⊆ IRk , k ≥ 1. Let h : Θ → ∆ be a
mapping of Θ onto ∆ ⊆ IRp , 1 ≤ p ≤ k. If θ̂ is an MLE of θ, then h(θ̂) is an MLE of h(θ).
Proof:
For each δ ∈ ∆, we define
Θδ = {θ : θ ∈ Θ, h(θ) = δ}
and
M (δ; x) = sup L(θ; x),
θ∈Θδ

the likelihood function induced by h.

Let θ̂ be an MLE and a member of Θδ̂ , where δ̂ = h(θ̂). It holds

M (δ̂; x) = sup L(θ; x) ≥ L(θ̂; x),

θ∈Θδ̂

but also
!
M (δ̂; x) ≤ sup M (δ; x) = sup sup L(θ; x) = sup L(θ; x) = L(θ̂; x).
δ∈∆ δ∈∆ θ∈Θδ θ∈Θ

Therefore,
M (δ̂; x) = L(θ̂; x) = sup M (δ; x).
δ∈∆

Thus, δ̂ = h(θ̂) is an MLE.

Example 8.7.9:
Let X1 , . . . , Xn be iid Bin(1, p). Let h(p) = p(1 − p).

Since the MLE of p is p̂ = X, the MLE of h(p) is h(p̂) = X(1 − X).

Theorem 8.7.10:
Consider the following conditions a pdf fθ can fulfill:

∂ log fθ ∂ 2 log fθ ∂ 3 log fθ

(i) ∂θ , ∂θ2 , ∂θ3 exist for all θ ∈ Θ for all x. Also,
Z ∞
∂fθ (x) ∂ log fθ (X)

dx = Eθ = 0 ∀θ ∈ Θ.
−∞ ∂θ ∂θ
Z ∞ 2
∂ fθ (x)
(ii) dx = 0 ∀θ ∈ Θ.
−∞ ∂θ2

79
Z ∞ 2
∂ log fθ (x)
(iii) −∞ < fθ (x)dx < 0 ∀θ ∈ Θ.
−∞ ∂θ2
(iv) There exists a function H(x) such that for all θ ∈ Θ:

∂ 3 log fθ (x)
Z ∞
< H(x) and H(x)fθ (x)dx = M (θ) < ∞.
∂θ3 −∞

(v) There exists a function g(θ) that is positive and twice differentiable for every θ ∈ Θ and
there exists a function H(x) such that for all θ ∈ Θ:

∂2
Z ∞
∂ log fθ (x)

2
g(θ) < H(x) and H(x)fθ (x)dx = M (θ) < ∞.
∂θ ∂θ −∞

In case that multiple of these conditions are fulfilled, we can make the following statements:

(i) (Cramér) Conditions (i), (iii), and (iv) imply that, with probability approaching 1, as
n → ∞, the likelihood equation has a consistent solution.

(ii) (Cramér) Conditions (i), (ii), (iii), and (iv) imply that a consistent solution θ̂n of the
likelihood equation is asymptotically Normal, i.e.,
√
n d
(θ̂n − θ) −→ Z
σ
−1
∂ log fθ (X) 2

where Z ∼ N (0, 1) and σ2 = Eθ ∂θ .

(iii) (Kulldorf) Conditions (i), (iii), and (v) imply that, with probability approaching 1, as
n → ∞, the likelihood equation has a consistent solution.

(iv) (Kulldorf) Conditions (i), (ii), (iii), and (v) imply that a consistent solution θ̂n of the
likelihood equation is asymptotically Normal.

Note:
In case of a pmf fθ , we can define similar conditions as in Theorem 8.7.10.

80
8.8 Decision Theory — Bayes and Minimax Estimation

(Based on Casella/Berger, Section 7.2.3 & 7.3.4)

Let {fθ : θ ∈ Θ} be a family of pdf’s (or pmf’s). Let X1 , . . . , Xn be a sample from fθ . Let
A be the set of possible actions (or decisions) that are open to the statistician in a given
situation , e.g.,

A = {reject H0 , do not reject H0 } (Hypothesis testing, see Chapter 9)

A = artefact found is of { Greek, Roman} origin (Classification)

A = Θ (Estimation)

Definition 8.8.1:
A decision function d is a statistic, i.e., a Borel–measurable function, that maps IRn into
A. If X = x is observed, the statistician takes action d(x) ∈ A.

Note:
For the remainder of this Section, we are restricting ourselves to A = Θ, i.e., we are facing
the problem of estimation.

Definition 8.8.2:
A non–negative function L that maps Θ × A into IR is called a loss function. The value
L(θ, a) is the loss incurred to the statistician if he/she takes action a when θ is the true pa-
rameter value.

Definition 8.8.3:
Let D be a class of decision functions that map IRn into A. Let L be a loss function on Θ × A.
The function R that maps Θ × D into IR is defined as

R(θ, d) = Eθ (L(θ, d(X)))

and is called the risk function of d at θ.

Example 8.8.4:
Let A = Θ ⊆ IR. Let L(θ, a) = (θ − a)2 . Then it holds that

R(θ, d) = Eθ (L(θ, d(X))) = Eθ ((θ − d(X))2 ) = Eθ ((θ − θ̂)2 ).

Note that this is just the MSE. If θ̂ is unbiased, this would just be V ar(θ̂).

81
Note:
The basic problem of decision theory is that we would like to find a decision function d ∈ D
such that R(θ, d) is minimized for all θ ∈ Θ. Unfortunately, this is usually not possible.

Definition 8.8.5:
The minimax principle is to choose the decision function d∗ ∈ D such that

max R(θ, d∗ ) ≤ max R(θ, d) ∀d ∈ D.

θ∈Θ θ∈Θ

Note:
If the problem of interest is an estimation problem, we call a d∗ that satisifies the condition
in Definition 8.8.5 a minimax estimate of θ.

Example 8.8.6:
Let X ∼ Bin(1, p), p ∈ Θ = { 14 , 34 } = A.

We consider the following loss function:

p a L(p, a)
1 1
4 4 0
1 3
4 4 2
3 1
4 4 5
3 3
4 4 0

The set of decision functions consists of the following four functions:

1 1
d1 (0) = , d1 (1) =
4 4
1 3
d2 (0) = , d2 (1) =
4 4
3 1
d3 (0) = , d3 (1) =
4 4
3 3
d4 (0) = , d4 (1) =
4 4
First, we evaluate the loss function for these four decision functions:
1 1 1
L( , d1 (0)) = L( , ) =
4 4 4
1 1 1
L( , d1 (1)) = L( , ) =
4 4 4
3 3 1
L( , d1 (0)) = L( , ) =
4 4 4

82
3 3 1
L( , d1 (1)) = L( , ) =
4 4 4
1 1 1
L( , d2 (0)) = L( , ) =
4 4 4
1 1 3
L( , d2 (1)) = L( , ) =
4 4 4
3 3 1
L( , d2 (0)) = L( , ) =
4 4 4
3 3 3
L( , d2 (1)) = L( , ) =
4 4 4
1 1 3
L( , d3 (0)) = L( , ) =
4 4 4
1 1 1
L( , d3 (1)) = L( , ) =
4 4 4
3 3 3
L( , d3 (0)) = L( , ) =
4 4 4
3 3 1
L( , d3 (1)) = L( , ) =
4 4 4
1 1 3
L( , d4 (0)) = L( , ) =
4 4 4
1 1 3
L( , d4 (1)) = L( , ) =
4 4 4
3 3 3
L( , d4 (0)) = L( , ) =
4 4 4
3 3 3
L( , d4 (1)) = L( , ) =
4 4 4
Then, the risk function

R(p, di (X)) = Ep (L(p, di (X))) = L(p, di (0)) · Pp (X = 0) + L(p, di (1)) · Pp (X = 1)

takes the following values:

i p = 14 : R( 14 , di ) p = 43 : R( 34 , di ) max R(p, di )
p∈{1/4, 3/4}
1
2
3
4

Hence,
min max R(p, di ) = .
i∈{1, 2, 3, 4} p∈{1/4, 3/4}

Thus, is the minimax estimate.

Note:
Minimax estimation does not require any unusual assumptions. However, it tends to be very

83
conservative.

Definition 8.8.7:
Suppose we consider θ to be a rv with pdf π(θ) on Θ. We call π the a priori distribution
(or prior distribution).

Note:
f (x | θ) is the conditional density of x given a fixed θ. The joint density of x and θ is

f (x, θ) = π(θ)f (x | θ),

the marginal density of x is Z

g(x) = f (x, θ)dθ,

and the a posteriori distribution (or posterior distribution), which gives the distribution
of θ after sampling, has pdf (or pmf)

f (x, θ)
h(θ | x) = .
g(x)

Definition 8.8.8:
The Bayes risk of a decision function d is defined as

R(π, d) = Eπ (R(θ, d)),

where π is the a priori distribution.

Note:
If θ is a continuous rv and X is of continuous type, then

R(π, d) = Eπ (R(θ, d))

Similar expressions can be written if θ and/or X are discrete.

84
Definition 8.8.9:
A decision function d∗ is called a Bayes rule if d∗ minimizes the Bayes risk, i.e., if

R(π, d∗ ) = inf R(π, d).

d∈D

Theorem 8.8.10:
Let A = Θ ⊆ IR. Let L(θ, d(x)) = (θ − d(x))2 . In this case, a Bayes rule is

d(x) = E(θ | X = x).

Proof:
Minimizing Z Z
R(π, d) = g(x) (θ − d(x))2 h(θ | x) dθ dx,

where g is the marginal pdf of X and h is the conditional pdf of θ given x, is the same as
minimizing Z
(θ − d(x))2 h(θ | x) dθ.

However, this is minimized when d(x) = E(θ | X = x) as shown in Stat 6710, Homework 3,
Question (ii), for the unconditional case.

Note:
Under the conditions of Theorem 8.8.10, d(x) = E(θ | X = x) is called the Bayes estimate.

Example 8.8.11:
Let X ∼ Bin(n, p). Let L(p, d(x)) = (p − d(x))2 .

Let π(p) = 1 ∀p ∈ (0, 1), i.e., π ∼ U (0, 1), be the a priori distribution of p.

Then it holds:
!
n x
f (x, p) = p (1 − p)n−x
x
Z
g(x) = f (x, p)dp

Z 1 !
n x
= p (1 − p)n−x dp
0 x

f (x, p)
h(p | x) =
g(x)

85
n x
p (1
x ! − p)n−x
= Z 1
n x
p (1 − p)n−x dp
0 x

px (1 − p)n−x
= Z 1
px (1 − p)n−x dp
0

E(p | x) =

Thus, by Theorem 8.8.10, the Bayes rule is

p̂Bayes =

The Bayes risk of d∗ (X) is

R(π, d∗ (X)) = Eπ (R(p, d∗ (X)))

= ...

86
Z 1
1
= (1 − 4p + np − np2 + 4p2 ) dp
(n + 2)2 0
Z 1
1
= (1 + (n − 4)p + (4 − n)p2 ) dp
(n + 2)2 0

1
1 n−4 2 4−n 3
= (p + p + p )
(n + 2)2 2 3 0

1 n−4 4−n
= 2
(1 + + )
(n + 2) 2 3
1 6 + 3n − 12 + 8 − 2n
=
(n + 2)2 6
1 n+2
= 2
(n + 2) 6
1
=
6(n + 2)

Now we compare the Bayes rule d∗ (X) with the MLE p̂M L = X
n. This estimate has Bayes risk
X
R(π, ) =
n

Theorem 8.8.12:
Let {fθ : θ ∈ Θ} be a family of pdf’s (or pmf’s). Suppose that an estimate d∗ of θ is a
Bayes estimate corresponding to some prior distribution π on Θ. If the risk function R(θ, d∗ )
is constant on Θ, then d∗ is a minimax estimate of θ.

Proof:
Homework.

Definition 8.8.13:
Let F denote the class of pdf’s (or pmf’s) fθ (x). A class Π of prior distributions is a conju-
gate family for F if the posterior distribution is in the class Π for all f ∈ F , all priors in Π,
and all x ∈ X.

87
Note:
The beta family is conjugate for the binomial family. Thus, if we start with a beta prior, we
will end up with a beta posterior. (See Homework.)

88
9 Hypothesis Testing

9.1 Fundamental Notions

(Based on Casella/Berger, Section 8.1 & 8.3)

We assume that X = (X1 , . . . , Xn ) is a random sample from a population distribution
Fθ , θ ∈ Θ ⊆ IRk , where the functional form of Fθ is known, except for the parameter θ.
We also assume that Θ contains at least two points.

Definition 9.1.1:
A parametric hypothesis is an assumption about the unknown parameter θ.

The null hypothesis is of the form

H0 : θ ∈ Θ0 ⊂ Θ.

The alternative hypothesis is of the form

H1 : θ ∈ Θ1 = Θ − Θ0 .

Definition 9.1.2:
If Θ0 (or Θ1 ) contains only one point, we say that H0 and Θ0 (or H1 and Θ1 ) are simple. In
this case, the distribution of X is completely specified under the null (or alternative) hypoth-
esis.

If Θ0 (or Θ1 ) contains more than one point, we say that H0 and Θ0 (or H1 and Θ1 ) are
composite.

Example 9.1.3:
1 1
Let X1 , . . . , Xn be iid Bin(1, p). Examples for hypotheses are p = 2 (simple), p ≥ 2 (com-
posite), p 6= 14 (composite), etc.

Note:
The problem of testing a hypothesis can be described as follows: Given a sample point x, find
a decision rule that will lead to a decision to accept or reject the null hypothesis. This means,
we partition the space IRn into two disjoint sets C and C c such that, if x ∈ C, we reject
H0 : θ ∈ Θ0 (and we accept H1 ). Otherwise, if x ∈ C c , we accept H0 that X ∼ Fθ , θ ∈ Θ0 .

89
Definition 9.1.4:
Let X ∼ Fθ , θ ∈ Θ. Let C be a subset of IRn such that, if x ∈ C, then H0 is rejected (with
probability 1), i.e.,
C = {x ∈ IRn : H0 is rejected for this x}.

The set C is called the critical region.

Definition 9.1.5:
If we reject H0 when it is true, we call this a Type I error. If we fail to reject H0 when it
is false, we call this a Type II error. Usually, H0 and H1 are chosen such that the Type I
error is considered more serious.

Example 9.1.6:
We first consider a non–statistical example, in this case a jury trial. Our hypotheses are that
the defendant is innocent or guilty. Our possible decisions are guilty or not guilty. Since it is
considered worse to punish the innocent than to let the guilty go free, we make innocence the
null hypothesis. Thus, we have

Truth (unknown)
Innocent (H0 ) Guilty (H1 )
Decision (known)
Not Guilty (H0 ) Correct Type II Error
Guilty (H1 ) Type I Error Correct

The jury tries to make a decision “beyond a reasonable doubt”, i.e., it tries to make the
probability of a Type I error small.

Definition 9.1.7:
If C is the critical region, then Pθ (C), θ ∈ Θ0 , is a probability of Type I error, and
Pθ (C c ), θ ∈ Θ1 , is a probability of Type II error.

Note:
We would like both error probabilities to be 0, but this is usually not possible. We usually
settle for fixing the probability of Type I error to be small, e.g., 0.05 or 0.01, and minimizing
the Type II error.

Definition 9.1.8:
Every Borel–measurable mapping φ of IRn → [0, 1] is called a test function. φ(x) is the
probability of rejecting H0 when x is observed.

90
If φ is the indicator function of a subset C ⊆ IRn , φ is called a nonrandomized test and C
is the critical region of this test function.

Otherwise, if φ is not an indicator function of a subset C ⊆ IRn , φ is called a randomized

test.

Definition 9.1.9:
Let φ be a test function of the hypothesis H0 : θ ∈ Θ0 against the alternative H1 : θ ∈ Θ1 .
We say that φ has a level of significance of α (or φ is a level–α–test or φ is of size α) if

Eθ (φ(X)) = Pθ (reject H0 ) ≤ α ∀θ ∈ Θ0 .

In short, we say that φ is a test for the problem (α, Θ0 , Θ1 ).

Definition 9.1.10:
Let φ be a test for the problem (α, Θ0 , Θ1 ). For every θ ∈ Θ, we define

βφ (θ) = Eθ (φ(X)) = Pθ (reject H0 ).

We call βφ (θ) the power function of φ. For any θ ∈ Θ1 , βφ (θ) is called the power of φ
against the alternative θ.

Definition 9.1.11:
Let Φα be the class of all tests for (α, Θ0 , Θ1 ). A test φ0 ∈ Φα is called a most powerful
(MP) test against an alternative θ ∈ Θ1 if

βφ0 (θ) ≥ βφ (θ) ∀φ ∈ Φα .

Definition 9.1.12:
Let Φα be the class of all tests for (α, Θ0 , Θ1 ). A test φ0 ∈ Φα is called a uniformly most
powerful (UMP) test if

βφ0 (θ) ≥ βφ (θ) ∀φ ∈ Φα ∀θ ∈ Θ1 .

Example 9.1.13:
Let X1 , . . . , Xn be iid N (µ, 1), µ ∈ Θ = {µ0 , µ1 }, µ0 < µ1 .

Let H0 : Xi ∼ N (µ0 , 1) vs. H1 : Xi ∼ N (µ1 , 1).

Intuitively, reject H0 when X is too large, i.e., if X ≥ k for some k.

91
Under H0 it holds that X ∼ N (µ0 , n1 ). For a given α, we can solve the following equation for
k:
X − µ0 k − µ0
Pµ0 (X > k) = P ( √ > √ ) = P (Z > zα ) = α
1/ n 1/ n
X−µ
Here, √0
1/ n
= Z ∼ N (0, 1) and zα is defined in such a way that P (Z > zα ) = α, i.e., zα is
k−µ
√0
the upper α–quantile of the N (0, 1) distribution. It follows that 1/ n
= zα and therefore,
zα
k = µ0 + √ n
.

Thus, we obtain the nonrandomized test

92
Example 9.1.14:
Let X ∼ Bin(6, p), p ∈ Θ = (0, 1).

H0 : p = 12 , H1 : p 6= 21 .

Desired level of significance: α = 0.05.

Reasonable plan: Since Ep= 1 (X) = 3, reject H0 when | X − 3 |≥ c for some constant c. But
2
how should we select c?

x c =| x − 3 | Pp= 1 (X = x) Pp= 1 (| X − 3 |≥ c)
2 2
0, 6
1, 5
2, 4
3

93
9.2 The Neyman–Pearson Lemma

(Based on Casella/Berger, Section 8.3.2)

Let {fθ : θ ∈ Θ = {θ0 , θ1 }} be a family of possible distributions of X. fθ represents the pdf
(or pmf) of X. For convenience, we write f0 (x) = fθ0 (x) and f1 (x) = fθ1 (x).

Theorem 9.2.1: Neyman–Pearson Lemma (NP Lemma)

Suppose we wish to test H0 : X ∼ f0 (x) vs. H1 : X ∼ f1 (x), where fi is the pdf (or pmf) of
X under Hi , i = 0, 1, where both, H0 and H1 , are simple.

(i) Any test of the form


 1, if f1 (x) > kf0 (x)


φ(x) = γ(x), if f1 (x) = kf0 (x) (∗)


 0, if f1 (x) < kf0 (x)

for some k ≥ 0 and 0 ≤ γ(x) ≤ 1, is most powerful of its significance level for testing
H0 vs. H1 .

If k = ∞, the test (
1, if f0 (x) = 0
φ(x) = (∗∗)
0, if f0 (x) > 0
is most powerful of size (or significance level) 0 for testing H0 vs. H1 .

(ii) Given 0 ≤ α ≤ 1, there exists a test of the form (∗) or (∗∗) with γ(x) = γ (i.e., a
constant) such that
Eθ0 (φ(X)) = α.

Proof:
We prove the continuous case only.

(i):

94
95
Theorem 9.2.2:
If a sufficient statistic T exists for the family {fθ : θ ∈ Θ = {θ0 , θ1 }}, then the Neyman–
Pearson most powerful test is a function of T .
Proof:
Homework

Example 9.2.3:
We want to test H0 : X ∼ N (0, 1) vs. H1 : X ∼ Cauchy(1, 0), based on a single observation.
It is 2
1 1
2 exp( x2 )
r
f1 (x) π 1+x2
= 1 2 = .
f0 (x) √ exp(− x2 ) π 1 + x2
2π

The MP test is
x2
 q
 1, if 2 exp( 2 )
π 1+x2 > k

φ(x) =

 0, otherwise
where k is determined such that EH0 (φ(X)) = α.
α
If α < 0.113, we reject H0 if | x |> z α2 , where z α2 is the upper 2 quantile of a N (0, 1)
distribution.

If α > 0.113, we reject H0 if | x |> k1 or if | x |< k2 , where k1 > 0, k2 > 0, such that
k2 k2
x2
Z k1
exp( 21 ) exp( 22 ) 1 1−α
2 = and √ exp(− )dx = .
1 + k1 1 + k22 k2 2π 2 2

96
Why is α = 0.113 so interesting?

For x = 0, it is r
f1 (x) 2
= ≈ 0.7979.
f0 (x) π
Similarly, for x ≈ −1.585 and x ≈ 1.585, it is
r (±1.585) 2
f1 (x) 2 exp( 2 ) f1 (0)
= ≈ 0.7979 ≈ .
f0 (x) π 1 + (±1.585)2 f0 (0)

More importantly, PH0 (| X |> 1.585) = 0.113.

97
9.3 Monotone Likelihood Ratios

(Based on Casella/Berger, Section 8.3.2)

Suppose we want to test H0 : θ ≤ θ0 vs. H1 : θ > θ0 for a family of pdf’s {fθ : θ ∈ Θ ⊆ IR}.
In general, it is not possible to find a UMP test. However, there exist conditions under which
UMP tests exist.

Definition 9.3.1:
Let {fθ : θ ∈ Θ ⊆ IR} be a family of pdf’s (pmf’s) on a one–dimensional parameter space.
We say the family {fθ } has a monotone likelihood ratio (MLR) in statistic T (X) if for
f (x)
θ1 < θ2 , whenever fθ1 and fθ2 are distinct, the ratio fθθ2 (x) is a nondecreasing function of T (x)
1
for the set of values x for which at least one of fθ1 and fθ2 is > 0.

Note:
We can also define families of densities with nonincreasing MLR in T (X), but such families
can be treated by symmetry.

Example 9.3.2:
Let X1 , . . . , Xn ∼ U [0, θ], θ > 0. Then the joint pdf is
(
1
θn , 0 ≤ x(1) ≤ x(n) ≤ θ 1
fθ (x) = = n I[0,∞) (x(1) )I[0,θ) (x(n) ),
0, otherwise θ

where x(n) = xmax = max xi .

i=1,...,n

Let θ2 > θ1 , then

98
Theorem 9.3.3:
The one–parameter exponential family fθ (x) = exp(Q(θ)T (x) + D(θ) + S(x)), where Q(θ) is
nondecreasing, has a MLR in T (X).
Proof:
Homework.

Example 9.3.4:
Let X = (X1 , · · · , Xn ) be a random sample from the Poisson family with parameter λ > 0.
Then the joint pdf is
n n n n
!
1 1
Pn
−λ xi −nλ x
Y Y X X
fλ (x) = e λ =e λ i=1 i = exp −nλ + xi · log(λ) − log(xi !) ,
i=1
xi ! i=1
xi ! i=1 i=1

which belongs to the one–parameter exponential family.

Since Q(λ) = log(λ) is a nondecreasing function of λ, it follows by Theorem 9.3.3 that the
n
X
Poisson family with parameter λ > 0 has a MLR in T (X) = Xi .
i=1

We can verify this result by Definition 9.3.1:

P
xi P xi
fλ2 (x) λ e−nλ2 λ2

= 2P = e−n(λ2 −λ1 ) .
fλ1 (x) xi e−nλ1 λ1
λ1
P xi
λ2 λ2 P
If λ2 > λ1 , then λ1 > 1 and λ1 is a nondecreasing function of xi .
n
X
Therefore, fθ has a MLR in T (X) = Xi .
i=1

Theorem 9.3.5:
Let X ∼ fθ , θ ∈ Θ ⊆ IR, where the family {fθ } has a MLR in T (X).

For testing H0 : θ ≤ θ0 vs. H1 : θ > θ0 , θ0 ∈ Θ, any test of the form


 1,
if T (x) > t0


φ(x) = γ, if T (x) = t0 (∗)


 0, if T (x) < t
0

has a nondecreasing power function and is UMP of its size Eθ0 (φ(X)) = α, if the size is not 0.

Also, for every 0 ≤ α ≤ 1 and every θ0 ∈ Θ, there exists a t0 and a γ (−∞ ≤ t0 ≤ ∞, 0 ≤

γ ≤ 1), such that the test of form (∗) is the UMP size α test of H0 vs. H1 .

99
Proof:
“=⇒”:

100
“⇐=”:
Use the Neyman–Pearson Lemma (Theorem 9.2.1).

Note:
By interchanging inequalities throughout Theorem 9.3.5 and its proof, we see that this The-
orem also provides a solution of the dual problem H00 : θ ≥ θ0 vs. H10 : θ < θ0 .

Theorem: 9.3.6
For the one–parameter exponential family, there exists a UMP two–sided test of H0 : θ ≤ θ1
or θ ≥ θ2 , (where θ1 < θ2 ) vs. H1 : θ1 < θ < θ2 of the form

 1, if c1 < T (x) < c2


φ(x) = γi , if T (x) = ci , i = 1, 2


 0, if T (x) < c , or if T (x) > c
1 2

Note:
UMP tests for H0 : θ1 ≤ θ ≤ θ2 and H00 : θ = θ0 do not exist for one–parameter exponential
families.

101
9.4 Unbiased and Invariant Tests

(Based on Rohatgi, Section 9.5, Rohatgi/Saleh, Section 9.5 & Casella/Berger,

Section 8.3.2)
If we look at all size α tests in the class Φα , there exists no UMP test for many hypotheses.
Can we find UMP tests if we reduce Φα by reasonable restrictions?
Definition 9.4.1:
A size α test φ of H0 : θ ∈ Θ0 vs H1 : θ ∈ Θ1 is unbiased if

Eθ (φ(X)) ≥ α ∀θ ∈ Θ1 .

Note:
This condition means that βφ (θ) ≤ α ∀θ ∈ Θ0 and βφ (θ) ≥ α ∀θ ∈ Θ1 . In other words, the
power of this test is never less than α.

Definition 9.4.2:
Let Uα be the class of all unbiased size α tests of H0 vs H1 . If there exists a test φ ∈ Uα
that has maximal power for all θ ∈ Θ1 , we call φ a UMP unbiased (UMPU) size α test.

Note:
It holds that Uα ⊆ Φα . A UMP test φα ∈ Φα will have βφα ≥ α ∀θ ∈ Θ1 since we must
compare all tests φα with the trivial test φ(x) = α. Thus, if a UMP test exists in Φα , it is
also a UMPU test in Uα .

Example 9.4.3:
Let X1 , . . . , Xn be iid N (µ, σ 2 ), where σ 2 > 0 is known. Consider H0 : µ = µ0 vs H1 : µ 6= µ0 .
From the Neyman–Pearson Lemma, we know that for µ1 > µ0 , the MP test is of the form

 1, if X > µ0 + √σ zα
φ1 (X) = n
 0, otherwise

and for µ2 < µ0 , the MP test is of the form


 1, if X < µ0 − √σ zα
φ2 (X) = n
 0, otherwise

If a test is UMP, it must have the same rejection region as φ1 and φ2 . However, these 2
rejection regions are different (actually, their intersection is empty). Thus, there exists no

102
UMP test.

We next state a helpful Theorem and then continue with this example and see how we can
find a UMPU test.

Theorem 9.4.4:
Let c1 , . . . , cn ∈ IR be constants and f1 (x), . . . , fn+1 (x) be real–valued functions. Let C be the
class of functions φ(x) satisfying 0 ≤ φ(x) ≤ 1 and
Z ∞
φ(x)fi (x)dx = ci ∀i = 1, . . . , n.
−∞
If φ∗ ∈ C satisfies
n

 X
 1, if fn+1 (x) > ki fi (x)



∗ i=1
φ (x) = Xn

 0, if fn+1 (x) < ki fi (x)



i=1
Z ∞
for some constants k1 , . . . , kn ∈ IR, then φ∗ maximizes φ(x)fn+1 (x)dx among all φ ∈ C.
−∞
Proof:
Let φ∗ (x) be as above. Let φ(x) be any other function in C. Since 0 ≤ φ(x) ≤ 1 ∀x, it is
n
!
∗
X
(φ (x) − φ(x)) fn+1 (x) − ki fi (x) ≥ 0 ∀x.
i=1
This holds since if φ∗ (x) = 1, the left factor is ≥ 0 and the right factor is ≥ 0. If φ∗ (x) = 0,
the left factor is ≤ 0 and the right factor is ≤ 0.

Therefore,
n
Z !
∗
X
0 ≤ (φ (x) − φ(x)) fn+1 (x) − ki fi (x) dx
i=1
Z Z n Z Z
φ∗ (x)fn+1 (x)dx − φ∗ (x)fi (x)dx −
X
= φ(x)fn+1 (x)dx − ki φ(x)fi (x)dx
i=1 | {z }
=ci −ci =0

Thus, Z Z
φ∗ (x)fn+1 (x)dx ≥ φ(x)fn+1 (x)dx.

Note:

(i) If fn+1 is a pdf, then φ∗ maximizes the power.

(ii) The Theorem above is the Neyman–Pearson Lemma if n = 1, f1 = fθ0 , f2 = fθ1 , and
c1 = α.

103
Example 9.4.3: (continued)
So far, we have seen that there exists no UMP test for H0 : µ = µ0 vs H1 : µ 6= µ0 .

We will show that


 1, if X < µ0 − √σ zα/2 or if X > µ0 + √σ zα/2
n n
φ3 (x) =

0, otherwise

is a UMPU size α test.

Due to Theorem 9.2.2, we only have to consider functions of sufficient statistics T (X) = X.
σ2
Let τ 2 = n .

To be unbiased and of size α, a test φ must have

Z
(i) φ(t)fµ0 (t)dt = α, and

∂
Z Z
∂
(ii) ∂µ φ(t)fµ (t)dt|µ=µ0 = φ(t) fµ (t) dt = 0, i.e., we have a minimum at µ0 .
∂µ µ=µ0
Z
We want to maximize φ(t)fµ (t)dt, µ 6= µ0 such that conditions (i) and (ii) hold.

We choose an arbitrary µ1 6= µ0 and let

f1 (t) = fµ0 (t)

∂
f2 (t) = fµ (t)
∂µ µ=µ0

f3 (t) = fµ1 (t)

We now consider how the conditions on φ∗ in Theorem 9.4.4 can be met:

f3 (t) > k1 f1 (t) + k2 f2 (t)

1 1 k 1
⇐⇒ √ exp(− 2 (x − µ1 )2 ) > √ 1 exp(− 2 (x − µ0 )2 ) +
2πτ 2τ 2πτ 2τ
k 1 x − µ0
√ 2 exp(− 2 (x − µ0 )2 )( )
2πτ 2τ τ2
1 1 1 x − µ0
⇐⇒ exp(− 2
(x − µ1 )2 ) > k1 exp(− 2 (x − µ0 )2 ) + k2 exp(− 2 (x − µ0 )2 )( )
2τ 2τ 2τ τ2
1 x − µ0
⇐⇒ exp( 2
((x − µ0 )2 − (x − µ1 )2 )) > k1 + k2 ( )
2τ τ2
x(µ1 − µ0 ) µ21 − µ20 x − µ0
⇐⇒ exp( − ) > k 1 + k2 ( )
τ2 2τ 2 τ2

104
Note that the left hand side of this inequality is increasing in x if µ1 > µ0 and decreasing in
x if µ1 < µ0 . Either way, we can choose k1 and k2 such that the linear function in x crosses
the exponential function in x at the two points
σ σ
µL = µ0 − √ zα/2 , µU = µ0 + √ zα/2 .
n n

Obviously, φ3 satisfies (i). We still need to check that φ3 satisfies (ii) and that βφ3 (µ) has a
minimum at µ0 but omit this part from our proof here.

φ3 is of the form φ∗ in Theorem 9.4.4 and therefore φ3 is UMP in C. But the trivial test
φt (x) = α also satisfies (i) and (ii) above. Therefore, βφ3 (µ) ≥ α ∀µ 6= µ0 . This means that
φ3 is unbiased.

Overall, φ3 is a UMPU test of size α.

Definition 9.4.5:
A test φ is said to be α–similar on a subset Θ∗ of Θ if

βφ (θ) = Eθ (φ(X)) = α ∀θ ∈ Θ∗ .

A test φ is said to be similar on Θ∗ ⊆ Θ if it is α–similar on Θ∗ for some α, 0 ≤ α ≤ 1.

Note:
The trivial test φ(x) = α is α–similar on every Θ∗ ⊆ Θ.

Theorem 9.4.6:
Let φ be an unbiased test of size α for H0 : θ ∈ Θ0 vs H1 : θ ∈ Θ1 such that βφ (θ) is a
continuous function in θ. Then φ is α–similar on the boundary Λ = Θ0 ∩ Θ1 , where Θ0 and
Θ1 are the closures of Θ0 and Θ1 , respectively.
Proof:
Let θ ∈ Λ. There exist sequences {θn } and {θn0 } whith θn ∈ Θ0 and θn0 ∈ Θ1 such that
lim θn = θ and lim θn0 = θ.
n→∞ n→∞

By continuity, βφ (θn ) → βφ (θ) and βφ (θn0 ) → βφ (θ).

Since βφ (θn ) ≤ α implies βφ (θ) ≤ α and since βφ (θn0 ) ≥ α implies βφ (θ) ≥ α it must hold
that βφ (θ) = α ∀θ ∈ Λ.

105
Definition 9.4.7:
A test φ that is UMP among all α–similar tests on the boundary Λ = Θ0 ∩ Θ1 is called a
UMP α–similar test.

Theorem 9.4.8:
Suppose βφ (θ) is continuous in θ for all tests φ of H0 : θ ∈ Θ0 vs H1 : θ ∈ Θ1 . If a size α
test of H0 vs H1 is UMP α–similar, then it is UMP unbiased.
Proof:
Let φ0 be UMP α–similar and of size α. This means that Eθ (φ(X)) ≤ α ∀θ ∈ Θ0 .

Since the trivial test φ(x) = α is α–similar, it must hold for φ0 that βφ0 (θ) ≥ α ∀θ ∈ Θ1 since
φ0 is UMP α–similiar. This implies that φ0 is unbiased.

Since βφ (θ) is continuous in θ, we see from Theorem 9.4.6 that the class of unbiased tests is
a subclass of the class of α–similar tests. Since φ0 is UMP in the larger class, it is also UMP
in the subclass. Thus, φ0 is UMPU.

Note:
The continuity of the power function βφ (θ) cannot always be checked easily.

Example 9.4.9:
Let X1 , . . . , Xn ∼ N (µ, 1).

Let H0 : µ ≤ 0 vs H1 : µ > 0.
n
X
Since the family of densities has a MLR in Xi , we could use Theorem 9.3.5 to find a UMP
i=1
test. However, we want to illustate the use of Theorem 9.4.8 here.

It is Λ = {0} and the power function

n
n !
1 1X
Z
βφ (µ) = φ(x) √ exp − (xi − µ)2 dx
IRn 2π 2 i=1

of any test φ is continuous in µ. Thus, due to Theorem 9.4.6, any unbiased size α test of H0
is α–similar on Λ.

We need a UMP test of H00 : µ = 0 vs H1 : µ > 0.

By the NP Lemma, a MP test of H000 : µ = 0 vs H100 : µ = µ1 , where µ1 > 0 is given by

 P P
x2i (xi −µ)2
 1, if exp − > k0

2 2
φ(x) =
 0,

otherwise

106
or equivalently, by Theorem 9.2.2,
n

 X
1, if T = Xi > k


φ(x) = i=1

 0,

otherwise

Since under H0 , T ∼ N (0, n), k is determined by α = Pµ=0 (T > k) = P ( √Tn > √k ),

n
i.e.,
√
k = nzα .

φ is independent of µ1 for every µ1 > 0. So φ is UMP α–similar for H00 vs. H1 .

Finally, φ is of size α, since for µ ≤ 0, it holds that

√
Eµ (φ(X)) = Pµ (T > nzα )
T − nµ √

= P √ > zα − nµ
n
(∗)
≤ P (Z > zα )

= α

−nµ
T√ √
(∗) holds since n
∼ N (0, 1) for µ ≤ 0 and zα − nµ ≥ zα for µ ≤ 0.

Thus all the requirements are met for Theorem 9.4.8, i.e., βφ is continuous and φ is UMP
α–similar and of size α, and thus φ is UMPU.

Note:
Rohatgi, page 428–430, lists Theorems (without proofs), stating that for Normal data, one–
and two–tailed t–tests, one– and two–tailed χ2 –tests, two–sample t–tests, and F –tests are all
UMPU.

Note:
Recall from Definition 8.2.4 that a class of distributions is invariant under a group G of trans-
formations, if for each g ∈ G and for each θ ∈ Θ there exists a unique θ0 ∈ Θ such that if
X ∼ Pθ , then g(X) ∼ Pθ0 .

Definition 9.4.10:
A group G of transformations on X leaves a hypothesis testing problem invariant if G
leaves both {Pθ : θ ∈ Θ0 } and {Pθ : θ ∈ Θ1 } invariant, i.e., if y = g(x) ∼ hθ (y), then
{fθ (x) : θ ∈ Θ0 } ≡ {hθ (y) : θ ∈ Θ0 } and {fθ (x) : θ ∈ Θ1 } ≡ {hθ (y) : θ ∈ Θ1 }.

107
Note:
We want two types of invariance for our tests:

Measurement Invariance: If y = g(x) is a 1–to–1 mapping, the decision based on y should

be the same as the decision based on x. If φ(x) is the test based on x and φ∗ (y) is the
test based on y, then it must hold that φ(x) = φ∗ (g(x)) = φ∗ (y).

Formal Invariance: If two tests have the same structure, i.e, the same Θ, the same pdf’s (or
pmf’s), and the same hypotheses, then we should use the same test in both problems.
So, if the transformed problem in terms of y has the same formal structure as that of
the problem in terms of x, we must have that φ∗ (y) = φ(x) = φ∗ (g(x)).

We can combine these two requirements in the following definition:

Definition 9.4.11:
An invariant test with respect to a group G of tansformations is any test φ such that

φ(x) = φ(g(x)) ∀x ∀g ∈ G.

Example 9.4.12:
1
Let X ∼ Bin(n, p). Let H0 : p = 2 vs. H1 : p 6= 12 .

Let G = {g1 , g2 }, where g1 (x) = n − x and g2 (x) = x.

If φ is invariant, then φ(x) = φ(n − x). Is the test problem invariant? For g2 , the answer is
obvious.

For g1 , we get:
g1 (X) = n − X ∼ Bin(n, 1 − p)
1
H0 : p = 2 : {fp (x) : p = 12 } = {hp (g1 (x)) : p = 21 } = Bin(n, 12 )
1 1 1
H1 : p 6= 2 : {fp (x) : p 6= } = {hp (g1 (x)) : p 6= }
| {z 2} | {z 2}
=Bin(n,p6= 21 ) =Bin(n,p6= 12 )

So all the requirements in Definition 9.4.10 are met. If, for example, n = 10, the test

 1, if x = 0, 1, 2, 8, 9, 10
φ(x) =

0, otherwise

is invariant under G. For example, φ(4) = 0 = φ(10 − 4) = φ(6), and, in general,

φ(x) = φ(10 − x) ∀x ∈ {0, 1, . . . , 9, 10}.

108
Example 9.4.13:
2
Let X1 , . . . , Xn ∼ N (µ, σ 2 ) where both µ and σ 2 > 0 are unknown. It is X ∼ N (µ, σn ) and
n−1 2
σ2
S ∼ χ2n−1 and X and S 2 independent.

Let H0 : µ ≤ 0 vs. H1 : µ > 0.

Let G be the group of scale changes:

G = {gc (x, s2 ), c > 0 : gc (x, s2 ) = (cx, c2 s2 )}

The problem is invariant because, when gc (x, s2 ) = (cx, c2 s2 ), then

(i) cX and c2 S 2 are independent.

2 2 2
(ii) cX ∼ N (cµ, c nσ ) ∈ {N (η, τn )}.
n−1 2 2
(iii) c2 σ 2
c S ∼ χ2n−1 .

So, this is the same family of distributions and Definition 9.4.10 holds because µ ≤ 0 implies
that cµ ≤ 0 (for c > 0).

An invariant test satisfies φ(x, s2 ) ≡ φ(cx, c2 s2 ), c > 0, s2 > 0, x ∈ IR.

Let c = 1s . Then φ(x, s2 ) ≡ φ( xs , 1) so invariant tests depend on (x, s2 ) only through xs .

If xs11 6= xs22 , then there exists no c > 0 such that (x2 , s22 ) ≡ (cx1 , c2 s21 ). So invariance places
no restrictions on φ for different xs11 = xs22 . Thus, invariant tests are exactly those that depend
only on xs , which are equivalent to tests that are based only on t = s/x√n . Since this mapping
is 1–to–1, the invariant test will use T = S/X√n ∼ tn−1 if µ = 0. Note that this test does not
depend on the nuisance parameter σ 2 . Invariance often produces such results.

Definition 9.4.14:
Let G be a group of transformations on the space of X. We say a statistic T (x) is maximal
invariant under G if

(i) T is invariant, i.e., T (x) = T (g(x)) ∀g ∈ G, and

(ii) T is maximal, i.e., T (x1 ) = T (x2 ) implies that x1 = g(x2 ) for some g ∈ G.

109
Example 9.4.15:
Let x = (x1 , . . . , xn ) and gc (x) = (x1 + c, . . . , xn + c).

Consider T (x) = (xn − x1 , xn − x2 , . . . , xn − xn−1 ).

It is T (gc (x)) = (xn − x1 , xn − x2 , . . . , xn − xn−1 ) = T (x), so T is invariant.

If T (x) = T (x0 ), then xn − xi = x0n − x0i ∀i = 1, 2, . . . , n − 1.

This implies that xi − x0i = xn − x0n = c ∀i = 1, 2, . . . , n − 1.

Thus, gc (x0 ) = (x01 + c, . . . , x0n + c) = x.

Therefore, T is maximal invariant.

Definition 9.4.16:
Let Iα be the class of all invariant tests of size α of H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ1 . If there
exists a UMP member in Iα , it is called the UMP invariant test of H0 vs H1 .

Theorem 9.4.17:
Let T (x) be maximal invariant with respect to G. A test φ is invariant under G iff φ is a
function of T .
Proof:
“=⇒”:
Let φ be invariant under G. If T (x1 ) = T (x2 ), then there exists a g ∈ G such that x1 = g(x2 ).
Thus, it follows from invariance that φ(x1 ) = φ(g(x2 )) = φ(x2 ). Since φ is the same whenever
T (x1 ) = T (x2 ), φ must be a function of T .

“⇐=”:
Let φ be a function of T , i.e., φ(x) = h(T (x)). It follows that
(∗)
φ(g(x)) = h(T (g(x))) = h(T (x)) = φ(x).

(∗) holds since T is invariant.

This means that φ is invariant.

110
Example 9.4.18:
Consider the test problem

H0 : X ∼ f0 (x1 − θ, . . . , xn − θ) vs. H1 : X ∼ f1 (x1 − θ, . . . , xn − θ),

where θ ∈ IR.

Let G be the group of transformations with

gc (x) = (x1 + c, . . . , xn + c),

where c ∈ IR and n ≥ 2.

As shown in Example 9.4.15, a maximal invariant statistic is T (X) = (X1 − Xn , . . . , Xn−1 −

Xn ) = (T1 , . . . , Tn−1 ). Due to Theorem 9.4.17, an invariant test φ depends on X only through
T.

Since the transformation

   
T1 X1 − Xn
!  .   .. 
T0  .
 .
 
.

= =
  

Z  T
 n−1   n−1 − Xn
  X 

Z Xn

is 1–to–1, there exists inverses Xn = Z and Xi = Ti + Xn = Ti + Z ∀i = 1, . . . , n − 1.

Applying Theorem 4.3.5 and integrating out the last component Z (= Xn ) gives us the joint
pdf of T = (T1 , . . . , Tn−1 ).
Z ∞
Thus, under Hi , i = 0, 1, the joint pdf of T is given by fi (t1 + z, t2 + z, . . . , tn−1 + z, z)dz
−∞
which is independent of θ. The problem is thus reduced to testing a simple hypothesis against
a simple alternative. By the NP Lemma (Theorem 9.2.1), the MP test is
(
1, if λ(t) > c
φ(t1 , . . . , tn−1 ) =
0, if λ(t) < c
Z ∞
f1 (t1 + z, t2 + z, . . . , tn−1 + z, z)dz
where t = (t1 , . . . , tn−1 ) and λ(t) = Z−∞
∞ .
f0 (t1 + z, t2 + z, . . . , tn−1 + z, z)dz
−∞
In the homework assignment, we use this result to construct a UMP invariant test of

H0 : X ∼ N (θ, 1) vs. H1 : X ∼ Cauchy(1, θ),

1 1
where a Cauchy(1, θ) distribution has pdf f (x; θ) = , where θ ∈ IR.
π 1 + (x − θ)2

111
10 More on Hypothesis Testing

10.1 Likelihood Ratio Tests

(Based on Casella/Berger, Section 8.2.1)

Definition 10.1.1:
The likelihood ratio test statistic for

H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ1 = Θ − Θ0

is
sup fθ (x)
θ∈Θ0
λ(x) = .
sup fθ (x)
θ∈Θ

The likelihood ratio test (LRT) is the test function

φ(x) = I[0,c) (λ(x)),

for some constant c ∈ [0, 1], where c is usually chosen in such a way to make φ a test of size
α.

Note:

(i) We have to select c such that 0 ≤ c ≤ 1 since 0 ≤ λ(x) ≤ 1.

(ii) LRT’s are strongly related to MLE’s. If θ̂ is the unrestricted MLE of θ over Θ and θ̂0 is
fθ̂ (x)
the MLE of θ over Θ0 , then λ(x) = fθ̂ (x) .
0

Example 10.1.2:
Let X1 , . . . , Xn be a sample from N (µ, 1). We want to construct a LRT for

H0 : µ = µ0 vs. H1 : µ 6= µ0 .

It is µ̂0 = µ0 and µ̂ = X. Thus,

(2π)−n/2 exp(− 12 (xi − µ0 )2 )

P
n
λ(x) = −n/2 1P 2
= exp(− (x − µ0 )2 ).
(2π) exp(− 2 (xi − x) ) 2
q
The LRT rejects H0 if λ(x) ≤ c, or equivalently, | x − µ |≥ −2 logn c . This means, the LRT
rejects H0 : µ = µ0 if x is too far from µ0 .

112
Theorem 10.1.3:
If T (X) is sufficient for θ and λ∗ (t) and λ(x) are LRT statistics based on T and X respectively,
then
λ∗ (T (x)) = λ(x) ∀x,

i.e., the LRT can be expressed as a function of every sufficient statistic for θ.
Proof:
Since T is sufficient, it follows from Theorem 8.3.5 that its pdf (or pdf) factorizes as fθ (x) =
gθ (T )h(x). Therefore we get:

sup fθ (x)
θ∈Θ0
λ(x) =
sup fθ (x)
θ∈Θ

sup gθ (T )h(x)
θ∈Θ0
=
sup gθ (T )h(x)
θ∈Θ

sup gθ (T )
θ∈Θ0
=
sup gθ (T )
θ∈Θ

= λ∗ (T (x))

Thus, our simplified expression for λ(x) indeed only depends on a sufficient statistic T .

Theorem 10.1.4:
If for a given α, 0 ≤ α ≤ 1, and for a simple hypothesis H0 and a simple alternative H1 a
non–randomized test based on the NP Lemma and LRT’s exist, then these tests are equivalent.
Proof:
See Homework.

Note:
Usually, LRT’s perform well since they are often UMP or UMPU size α tests. However, this
does not always hold. Rohatgi, Example 4, page 440–441, cites an example where the LRT is
not unbiased and it is even worse than the trivial test φ(x) = α.

113
Theorem 10.1.5:
Under some regularity conditions on fθ (x), the rv −2 log λ(X) under H0 has asymptotically
a chi–squared distribution with ν degrees of freedom, where ν equals the difference between
the number of independent parameters in Θ and Θ0 , i.e.,
d
−2 log λ(X) −→ χ2ν under H0 .

Note:
The regularity conditions required for Theorem 10.1.5 are basically the same as for Theorem
8.7.10. Under “independent” parameters we understand parameters that are unspecified, i.e.,
free to vary.

Example 10.1.6:
Let X1 , . . . , Xn ∼ N (µ, σ 2 ) where µ ∈ IR and σ 2 > 0 are both unknown.

Let H0 : µ = µ0 vs. H1 : µ 6= µ0 .

We have θ = (µ, σ 2 ), Θ = {(µ, σ 2 ) : µ ∈ IR, σ 2 > 0} and Θ0 = {(µ0 , σ 2 ) : σ 2 > 0}.

n
X n
X
2
It is θ̂0 = (µ0 , n1 (xi − µ0 ) ) and θ̂ = (x, 1
n (xi − x)2 ).
i=1 i=1
Now, the LR test statistic λ(x) can be determined:

114
Note that − n
Γ( n2 ) 1 f

2
f1,n−1 (f ) = 1 √ 1+ · I[0,∞) (f )
Γ( 2 )Γ( 2 )(n − 1) 2 f
n−1 1 n−1
is the pdf of a F1,n−1 distribution.
−1
f f 1−y
Let y = 1 + n−1 , then n−1 = y and df = − n−1
y2
dy.

Thus,

115
As n → ∞, we can apply Stirling’s formula which states that
√ 1
Γ(α(n) + 1) ≈ (α(n))! ≈ 2π(α(n))α(n)+ 2 exp(−α(n)).

So,
√ n−1 √ n(1−2t)−3 n(1−2t)−2
2π( n−2
2 ) 2 exp(−
n−2
2 ) 2π( 2 ) 2 exp(− n(1−2t)−3
2 )
Mn (t) ≈ √ √
n−3 n−2 n−3 n(1−2t)−2 n(1−2t)−1 n(1−2t)−2
2π( 2 ) 2 exp(− 2 ) 2π( 2 ) 2 exp(− 2 )

116
10.2 Parametric Chi–Squared Tests

(Based on Rohatgi, Section 10.3 & Rohatgi/Saleh, Section 10.3)

Definition 10.2.1: Normal Variance Tests
Let X1 , . . . , Xn be a sample from a N (µ, σ 2 ) distribution where µ may be known or unknown
and σ 2 > 0 is unknown. The following table summarizes the χ2 tests that are typically being
used:

Reject H0 at level α if
H0 H1 µ known µ unknown
σ02 2
σ ≥ σ0 σ < σ 0 (xi − µ)2 ≤ σ02 χ2n;1−α s2 ≤
P
I n−1 χn−1;1−α
σ02 2
σ ≤ σ0 σ > σ 0 (xi − µ)2 ≥ σ02 χ2n;α s2 ≥
P
II n−1 χn−1;α
σ02 2
III σ = σ0 σ 6= σ0 (xi − µ)2 ≤ σ02 χ2n;1−α/2 s2 ≤
P
n−1 χn−1;1−α/2
σ02 2
(xi − µ)2 ≥ σ02 χ2n;α/2 or s2 ≥
P
or n−1 χn−1;α/2

Note:

(i) In Definition 10.2.1, σ0 is any fixed positive constant.

(ii) Tests I and II are UMPU if µ is unknown and UMP if µ is known.

(iii) In test III, the constants have been chosen in such a way to give equal probability to
each tail. This is the usual approach. However, this may result in a biased test.

(iv) χ2n;1−α is the (lower) α quantile and χ2n;α is the (upper) 1 − α quantile, i.e., for X ∼ χ2n ,
it holds that P (X ≤ χ2n;1−α ) = α and P (X ≤ χ2n;α ) = 1 − α.

(v) We can also use χ2 tests to test for equality of binomial probabilities as shown in the
next few Theorems.

Theorem 10.2.2:
Let X1 , . . . , Xk be independent rv’s with Xi ∼ Bin(ni , pi ), i = 1, . . . , k. Then it holds that
k
!2
X Xi − ni pi d
T = p −→ χ2k
i=1 ni pi (1 − pi )
as n1 , . . . , nk → ∞.

117
Proof:
Homework

Corollary 10.2.3:
Let X1 , . . . , Xk be as in Theorem 10.2.2 above. We want to test the hypothesis that H0 : p1 =
p2 = . . . = pk = p, where p is a known constant (vs. the alternative H1 that at least one of
the pi ’s is different from the other ones). An appoximate level–α test rejects H0 if
k
!2
X xi − ni p
y= p ≥ χ2k;α .
i=1 ni p(1 − p)

Theorem 10.2.4:
Let X1 , . . . , Xk be independent rv’s with Xi ∼ Bin(ni , p), i = 1, . . . , k. Then the MLE of p is
k
X
xi
i=1
p̂ = k
.
X
ni
i=1

Proof:
This can be shown by using the joint likelihood function or by the fact that Xi ∼ Bin( ni , p)
P P

and for X ∼ Bin(n, p), the MLE is p̂ = nx .

Theorem 10.2.5:
Let X1 , . . . , Xk be independent rv’s with Xi ∼ Bin(ni , pi ), i = 1, . . . , k. An approximate
level–α test of H0 : p1 = p2 = . . . = pk = p, where p is unknown (vs. the alternative H1 that
at least one of the pi ’s is different from the other ones), rejects H0 if
k
!2
X xi − ni p̂
y= p ≥ χ2k−1;α ,
i=1 ni p̂(1 − p̂)
P
x
where p̂ = P ni .
i

Theorem 10.2.6:
k
X
Let (X1 , . . . , Xk ) be a multinomial rv with parameters n, p1 , p2 , . . . , pk where pi = 1 and
i=1
k
X
Xi = n. Then it holds that
i=1
k
X (Xi − npi )2 d
Uk = −→ χ2k−1
i=1
npi

118
as n → ∞.

An approximate level–α test of H0 : p1 = p01 , p2 = p02 , . . . , pk = p0k rejects H0 if

k
X (xi − np0i )2
> χ2k−1;α .
i=1
np0i

Proof:
Case k = 2 only:

Theorem 10.2.7:
Let X1 , . . . , Xn be a sample from X. Let H0 : X ∼ F , where the functional form of F is
known completely. We partition the real line into k disjoint Borel sets A1 , . . . , Ak and let
P (X ∈ Ai ) = pi , where pi > 0 ∀i = 1, . . . , k.
n
X
Let Yj = #Xi0 s in Aj = IAj (Xi ), ∀j = 1, . . . , k.
i=1
Then, (Y1 , . . . , Yk ) has multinomial distribution with parameters n, p1 , p2 , . . . , pk .

Theorem 10.2.8:
Let X1 , . . . , Xn be a sample from X. Let H0 : X ∼ Fθ , where θ = (θ1 , . . . , θr ) is unknown.
Let the MLE θ̂ exist. We partition the real line into k disjoint Borel sets A1 , . . . , Ak and let
Pθ̂ (X ∈ Ai ) = p̂i , where p̂i > 0 ∀i = 1, . . . , k.
n
X
Let Yj = #Xi0 s in Aj = IAj (Xi ), ∀j = 1, . . . , k.
i=1
Then it holds that
k
X (Yi − np̂i )2 d
Vk = −→ χ2k−r−1 .
i=1
np̂i

119
An approximate level–α test of H0 : X ∼ Fθ rejects H0 if

k
X (yi − np̂i )2
> χ2k−r−1;α ,
i=1
np̂i

where r is the number of parameters in θ that have to be estimated.

120
10.3 t–Tests and F –Tests

(Based on Rohatgi, Section 10.4 & 10.5 & Rohatgi/Saleh, Section 10.4 & 10.5)
Definition 10.3.1: One– and Two–Tailed t-Tests
Let X1 , . . . , Xn be a sample from a N (µ, σ 2 ) distribution where σ 2 > 0 may be known or
n
X n
X
unknown and µ is unknown. Let X = 1
n Xi and S 2 = 1
n−1 (Xi − X)2 .
i=1 i=1
The following table summarizes the z– and t–tests that are typically being used:

Reject H0 at level α if
H0 H1 σ2 known σ 2 unknown

I µ ≤ µ0 µ > µ0 x ≥ µ0 + √σ zα x ≥ µ0 + √s tn−1;α
n n

II µ ≥ µ0 µ < µ0 x ≤ µ0 + √σ z1−α x ≤ µ0 + √s tn−1;1−α

n n

III µ = µ0 µ 6= µ0 | x − µ0 |≥ √σ zα/2 | x − µ0 |≥ √s tn−1;α/2

n n

Note:

(i) In Definition 10.3.1, µ0 is any fixed constant.

(ii) These tests are based on just one sample and are often called one sample t–tests.

(iii) Tests I and II are UMP and test III is UMPU if σ 2 is known. Tests I, II, and III are
UMPU and UMP invariant if σ 2 is unknown.

(iv) For large n (≥ 30), we can use z–tables instead of t-tables. Also, for large n we can
drop the Normality assumption due to the CLT. However, for small n, none of these
simplifications is justified.

Definition 10.3.2: Two–Sample t-Tests

Let X1 , . . . , Xm be a sample from a N (µ1 , σ12 ) distribution where σ12 > 0 may be known or
unknown and µ1 is unknown. Let Y1 , . . . , Yn be a sample from a N (µ2 , σ22 ) distribution where
σ22 > 0 may be known or unknown and µ2 is unknown.
m
X m
X
Let X = 1
m Xi and S12 = 1
m−1 (Xi − X)2 .
i=1 i=1
n
X n
X
Let Y = 1
n Yi and S22 = 1
n−1 (Yi − Y )2 .
i=1 i=1
(m−1)S12 +(n−1)S22
Let Sp2 = m+n−2 .

The following table summarizes the z– and t–tests that are typically being used:

121
Reject H0 at level α if
H0 H1 σ12 , σ22 known σ12 , σ22 unknown, σ1 = σ2
q
σ12 σ22
q
1 1
I µ 1 − µ 2 ≤ δ µ 1 − µ2 > δ x − y ≥ δ + zα m + n x − y ≥ δ + tm+n−2;α sp m + n
q
σ12 σ22
q
1 1
II µ 1 − µ 2 ≥ δ µ 1 − µ2 < δ x − y ≤ δ + z1−α m + n x − y ≤ δ + tm+n−2;1−α sp m + n
q
σ12 σ22
q
1 1
III µ1 − µ2 = δ µ1 − µ2 6= δ | x − y − δ |≥ zα/2 m + n | x − y − δ |≥ tm+n−2;α/2 sp m + n

Note:

(i) In Definition 10.3.2, δ is any fixed constant.

(ii) All tests are UMPU and UMP invariant.

(iii) If σ12 = σ22 = σ 2 (which is unknown), then Sp2 is an unbiased estimate of σ 2 . We should
check that σ12 = σ22 with an F –test.

(iv) For large m + n, we can use z–tables instead of t-tables. Also, for large m and large n
we can drop the Normality assumption due to the CLT. However, for small m or small
n, none of these simplifications is justified.

Definition 10.3.3: Paired t-Tests

Let (X1 , Y1 ) . . . , (Xn , Yn ) be a sample from a bivariate N (µ1 , µ2 , σ12 , σ22 , ρ) distribution where
all 5 parameters are unknown.

Let Di = Xi − Yi ∼ N (µ1 − µ2 , σ12 + σ22 − 2ρσ1 σ2 ).

n
X n
X
Let D = 1
n Di and Sd2 = 1
n−1 (Di − D)2 .
i=1 i=1
The following table summarizes the t–tests that are typically being used:

H0 H1 Reject H0 at level α if
sd
I µ1 − µ 2 ≤ δ µ 1 − µ2 > δ d≥δ+ √ t
n n−1;α
sd
II µ1 − µ 2 ≥ δ µ 1 − µ2 < δ d≤δ+ √ t
n n−1;1−α
sd
III µ1 − µ2 = δ µ1 − µ2 6= δ | d − δ |≥ √ t
n n−1;α/2

122
Note:

(i) In Definition 10.3.3, δ is any fixed constant.

(ii) These tests are special cases of one–sample tests. All the properties stated in the Note
following Definition 10.3.1 hold.

(iii) We could do a test based on Normality assumptions if σ 2 = σ12 + σ22 − 2ρσ1 σ2 were
known, but that is a very unrealistic assumption.

Definition 10.3.4: F –Tests

Let X1 , . . . , Xm be a sample from a N (µ1 , σ12 ) distribution where µ1 may be known or unknown
and σ12 is unknown. Let Y1 , . . . , Yn be a sample from a N (µ2 , σ22 ) distribution where µ2 may
be known or unknown and σ22 is unknown.

Recall that m n
X X
2
(Xi − X) (Yi − Y )2
i=1 i=1
∼ χ2m−1 , ∼ χ2n−1 ,
σ12 σ22
and m
X
(Xi − X)2
i=1
(m − 1)σ12 σ22 S12
n = ∼ Fm−1,n−1 .
X σ12 S22
(Yi − Y )2
i=1
(n − 1)σ22

The following table summarizes the F –tests that are typically being used:

Reject H0 at level α if
H0 H1 µ1 , µ2 known µ1 , µ2 unknown
1
P 2
P (xi −µ1 )2 ≥ Fm,n;α s21
I σ12 ≤ σ22 σ12 > σ22 m
1
(y −µ ) s22
≥ Fm−1,n−1;α
n i 2

1
P 2
P(yi −µ2 ) 2 ≥ Fn,m;α s22
II σ12 ≥ σ22 σ12 < σ22 n
1
(xi −µ1 ) s21
≥ Fn−1,m−1;α
m

1
P 2
P (xi −µ1 )2 ≥ Fm,n;α/2 s21
III σ12 = σ22 σ12 6= σ22 m
1
(y −µ ) s22
≥ Fm−1,n−1;α/2 if s21 ≥ s22
1
nPi 2 2
(y −µ ) s22
or n1 P(xi −µ2 )2 ≥ Fn,m;α/2 or s21
≥ Fn−1,m−1;α/2 if s21 < s22
m i 1

123
Note:

(i) Tests I and II are UMPU and UMP invariant if µ1 and µ2 are unknown.

(ii) Test III uses equal tails and therefore may not be unbiased.

(iii) If an F –test (at level α1 ) and a t–test (at level α2 ) are both performed, the combined
test has level α = 1 − (1 − α1 )(1 − α2 ) ≥ max(α1 , α2 ) (≡ α1 + α2 if both are small).
1
(iv) Fm,n;1−α = .
Fn,m;α

124
10.4 Bayes and Minimax Tests

(Based on Rohatgi, Section 10.6 & Rohatgi/Saleh, Section 10.6)

Hypothesis testing may be conducted in a decision–theoretic framework. Here our action
space A consists of two options: a0 = fail to reject H0 and a1 = reject H0 .
Usually, we assume no loss for a correct decision. Thus, our loss function looks like:

 0, if θ ∈ Θ0
L(θ, a0 ) =

a(θ), if θ ∈ Θ1

 b(θ), if θ ∈ Θ0
L(θ, a1 ) =

0, if θ ∈ Θ1

We consider the following special cases:

0–1 loss: a(θ) = b(θ) = 1, i.e., all errors are equally bad.

Generalized 0–1 loss: a(θ) = cII , b(θ) = cI , i.e., all Type I errors are equally bad and all
Type II errors are equally bad and Type I errors are worse than Type II errors or vice
versa.

Then, the risk function can be written as

R(θ, d(X)) = L(θ, a0 )Pθ (d(X) = a0 ) + L(θ, a1 )Pθ (d(X) = a1 )


 a(θ)Pθ (d(X) = a0 ), if θ ∈ Θ1
=

b(θ)Pθ (d(X) = a1 ), if θ ∈ Θ0

The minimax rule minimizes

max{a(θ)Pθ (d(X) = a0 ), b(θ)Pθ (d(X) = a1 )}.

125
Theorem 10.4.1:
The minimax rule d for testing

H0 : θ = θ0 vs. H1 : θ = θ1

under the generalized 0–1 loss function rejects H0 if

fθ1 (x)
≥ k,
fθ0 (x)

where k is chosen such that

R(θ1 , d(X)) = R(θ0 , d(X))

⇐⇒ cII Pθ1 (d(X) = a0 ) = cI Pθ0 (d(X) = a1 )

fθ1 (X) fθ1 (X)

⇐⇒ cII Pθ1 <k = cI Pθ0 ≥k .
fθ0 (X) fθ0 (X)

Proof:
Let d∗ be any other rule.

• If R(θ0 , d) < R(θ0 , d∗ ), then

• If R(θ0 , d) ≥ R(θ0 , d∗ ), then

126
Example 10.4.2:
Let X1 , . . . , Xn be iid N (µ, 1). Let H0 : µ = µ0 vs. H1 : µ = µ1 > µ0 .

Note:
Now suppose we have a prior distribution π(θ) on Θ. Then the Bayes risk of a decision rule
d (under the loss function introduced before) is

R(π, d) = Eπ R(θ, d(X))

Z
= R(θ, d)π(θ)dθ
Θ
Z Z
= b(θ)π(θ)Pθ (d(X) = a1 )dθ + a(θ)π(θ)Pθ (d(X) = a0 )dθ
Θ0 Θ1

if π is a pdf.

The Bayes risk for a pmf π looks similar (see Rohatgi, page 461).

Theorem 10.4.3:
The Bayes rule for testing H0 : θ = θ0 vs. H1 : θ = θ1 under the prior π(θ0 ) = π0 and
π(θ1 ) = π1 = 1 − π0 and the generalized 0–1 loss function is to reject H0 if

fθ1 (x) cI π0
≥ .
fθ0 (x) cII π1

Proof:

127
Note:
For minimax rules and Bayes rules, the significance level α is no longer predetermined.

Example 10.4.4:
Let X1 , . . . , Xn be iid N (µ, 1). Let H0 : µ = µ0 vs. H1 : µ = µ1 > µ0 . Let cI = cII .

By Theorem 10.4.3, the Bayes rule d rejects H0 if

Note:
We can generalize Theorem 10.4.3 to the case of classifying among k options θ1 , . . . , θk . If we
use the 0–1 loss function

 1, if d(X) = θj ∀j 6= i
L(θi , d) = ,

0, if d(X) = θi

then the Bayes rule is to pick θi if

πi fθi (x) ≥ πj fθj (x) ∀j 6= i.

128
Example 10.4.5:
Let X1 , . . . , Xn be iid N (µ, 1). Let µ1 < µ2 < µ3 and let π1 = π2 = π3 .

Choose µ = µi if
! !
(xk − µi )2 (xk − µj )2
P P
πi exp − ≥ πj exp − , j 6= i, j = 1, 2, 3.
2 2

Similar to Example 10.4.4, these conditions can be transformed as follows:

(µi − µj )(µi + µj )
x(µi − µj ) ≥ , j 6= i, j = 1, 2, 3.
2
In our particular example, we get the following decision rules:

µ1 +µ2 µ1 +µ3
(i) Choose µ1 if x ≤ 2 (and x ≤ 2 ).

µ1 +µ2 µ2 +µ3
(ii) Choose µ2 if x ≥ 2 and x ≤ 2 .

µ2 +µ3 µ1 +µ3
(iii) Choose µ3 if x ≥ 2 (and x ≥ 2 ).

Note that in (i) and (iii) the condition in parentheses automatically holds when the other
condition holds.

If µ1 = 0, µ2 = 2, and µ3 = 4, we have the decision rules:

(i) Choose µ1 if x ≤ 1.

(ii) Choose µ2 if 1 ≤ x ≤ 3.

(iii) Choose µ3 if x ≥ 3.

We do not have to worry how to handle the boundary since the probability that the rv will
realize on any of the two boundary points is 0.

129
11 Confidence Estimation

11.1 Fundamental Notions

(Based on Casella/Berger, Section 9.1 & 9.3.2)

Let X be a rv and a, b be fixed positive numbers, a < b. Then

P (a < X < b) =

The interval I(X) = ( aXb , X) is an example of a random interval. I(X) contains the value
a with a certain fixed probability.
For example, if X ∼ U (0, 1), a = 14 , and b = 43 , then the interval I(X) = ( X3 , X) contains 1
4
with probability 12 .

Definition 11.1.1:
Let Pθ , θ ∈ Θ ⊆ IRk , be a set of probability distributions of a rv X. A family of subsets
S(x) of Θ, where S(x) depends on x but not on θ, is called a family of random sets. In
particular, if θ ∈ Θ ⊆ IR and S(x) is an interval (θ(x), θ(x)) where θ(x) and θ(x) depend on
x but not on θ, we call S(X) a random interval, with θ(X) and θ(X) as lower and upper
bounds, respectively. θ(X) may be −∞ and θ(X) may be +∞.

Note:
Frequently in inference, we are not interested in estimating a parameter or testing a hypoth-
esis about it. Instead, we are interested in establishing a lower or upper bound (or both) for
one or multiple parameters.

Definition 11.1.2:
A family of subsets S(x) of Θ ⊆ IRk is called a family of confidence sets at confidence
level 1 − α if
Pθ (S(X) 3 θ) ≥ 1 − α ∀θ ∈ Θ,

where 0 < α < 1 is usually small.

The quantity
inf Pθ (S(X) 3 θ) = 1 − α
θ

130
is called the confidence coefficient (i.e., the smallest probability of true coverage is 1 − α).

Definition 11.1.3:
For k = 1, we use the following names for some of the confidence sets defined in Definition
11.1.2:

(i) If S(x) = (θ(x), ∞), then θ(x) is called a level 1 − α lower confidence bound.

(ii) If S(x) = (−∞, θ(x)), then θ(x) is called a level 1 − α upper confidence bound.

(iii) S(x) = (θ(x), θ(x)) is called a level 1 − α confidence interval (CI).

Definition 11.1.4:
A family of 1 − α level confidence sets {S(x)} is called uniformly most accurate (UMA)
if
Pθ (S(X) 3 θ0 ) ≤ Pθ (S 0 (X) 3 θ0 ) ∀θ, θ0 ∈ Θ, θ 6= θ0 ,

and for any 1 − α level family of confidence sets S 0 (X) (i.e., S(x) minimizes the probability
of false (or incorrect) coverage).

Theorem 11.1.5:
Let X1 , . . . , Xn ∼ Fθ , θ ∈ Θ, where Θ is an interval on IR. Let T (X, θ) be a function on
IRn × Θ such that for each θ, T (X, θ) is a statistic, and as a function of θ, T is strictly
monotone (either increasing or decreasing) in θ at every value of x ∈ IRn .

Let Λ ⊆ IR be the range of T and let the equation λ = T (x, θ) be solvable for θ for every
λ ∈ Λ and every x ∈ IRn .

If the distribution of T (X, θ) is independent of θ, then we can construct a confidence interval

for θ at any level.
Proof:
Choose α such that 0 < α < 1. Then we can choose λ1 (α) < λ2 (α) (which may not necessarily
be unique) such that

Pθ (λ1 (α) < T (X, θ) < λ2 (α)) ≥ 1 − α ∀θ.

Since the distribution of T (X, θ) is independent of θ, λ1 (α) and λ2 (α) also do not depend on
θ.

If T (X, θ) is increasing in θ, solve the equations λ1 (α) = T (X, θ) for θ(X) and λ2 (α) = T (X, θ)

131
for θ(X).

If T (X, θ) is decreasing in θ, solve the equations λ1 (α) = T (X, θ) for θ(X) and λ2 (α) =
T (X, θ) for θ(X).

In either case, it holds that

Pθ (θ(X) < θ < θ(X)) ≥ 1 − α ∀θ.

Note:

(i) Solvability is guaranteed if T is continuous and strictly monotone as a function of θ.

(ii) If T is not monotone, we can still use this Theorem to get confidence sets that may not
be confidence intervals.

Example 11.1.6:
Let X1 , . . . , Xn ∼ N (µ, σ 2 ), where µ and σ 2 > 0 are both unknown. We seek a 1 − α level
confidence interval for µ.

Example 11.1.7:
Let X1 , . . . , Xn ∼ U (0, θ).

We know that θ̂ = max(Xi ) = M axn is the MLE for θ and sufficient for θ.

The pdf of M axn is given by

ny n−1
fn (y) = I (y).
θn (0,θ)

132
M axn
Then the rv Tn = θ has the pdf

hn (t) = ntn−1 I(0,1) (t),

which is independent of θ. Tn is monotone and decreasing in θ.

We now have to find numbers λ1 (α) and λ2 (α) such that

133
11.2 Shortest–Length Confidence Intervals

(Based on Casella/Berger, Section 9.2.2 & 9.3.1)

In practice, we usually want not only an interval with coverage probability 1 − α for θ, but if
possible the shortest (most precise) such interval.

Definition 11.2.1:
A rv T (X, θ) whose distribution is independent of θ is called a pivot.

Note:
The methods we will discuss here can provide the shortest interval based on a given pivot.
They will not guarantee that there is no other pivot with a shorter minimal interval.

Example 11.2.2:
Let X1 , . . . , Xn ∼ N (µ, σ 2 ), where σ 2 is known. The obvious pivot for µ is

X −µ
Tµ (X) = √ ∼ N (0, 1).
σ/ n

Suppose that (a, b) is an interval such that P (a < Z < b) = 1 − α, where Z ∼ N (0, 1).

A 1 − α level CI based on this pivot is found by

X −µ σ σ
1 − α = P (a < √ < b) = P (X − b √ < µ < X − a √ ).
σ/ n n n

The length of the interval is L = (b − a) √σn .

To minimize L, we must choose a and b such that b − a is minimal while

Z b
1 x2
Φ(b) − Φ(a) = √ e− 2 dx = 1 − α,
2π a

where Φ(z) = P (Z ≤ z).

To find a minimum, we can differentiate these expressions with respect to a. However, b is

not a constant but is an implicit function of a. Formally, we could write d da
b(a)
. However, this
db
is usually shortened to da .

Here we get
d db
(Φ(b) − Φ(a)) = φ(b) − φ(a) = 0
da da
and
dL σ db σ φ(a)
= √ ( − 1) = √ ( − 1).
da n da n φ(b)

134
The minimum occurs when φ(a) = φ(b) which happens when a = b or a = −b. If we select
a = b, then Φ(b) − Φ(a) = Φ(a) − Φ(a) = 0 6= 1 − α. Thus, we must have that b = −a = zα/2 .
Thus, the shortest CI based on Tµ is
σ σ
(X − zα/2 √ , X + zα/2 √ ).
n n

Definition 11.2.3:
A pdf f (x) is unimodal iff there exists a x∗ such that f (x) is nondecreasing for x ≤ x∗ and
f (x) is nonincreasing for x ≥ x∗ .

Theorem 11.2.4:
Let f (x) be a unimodal pdf. If the interval [a, b] satisfies
Z b
(i) f (x)dx = 1 − α
a

(ii) f (a) = f (b) > 0, and

(iii) a ≤ x∗ ≤ b, where x∗ is a mode of f (x),

then the interval [a, b] is the shortest of all intervals which satisfy condition (i).
Proof:
Z b0
Let [a0 , b0 ] be any interval with b0 −a0 < b−a. We will show that this implies f (x)dx < 1−α,
a0
i.e., a contradiction.

We assume that a0 ≤ a. The case a < a0 is similar.

• Suppose that b0 ≤ a. Then a0 ≤ b0 ≤ a ≤ x∗ . It follows

Z b0
f (x)dx ≤ f (b0 )(b0 − a0 ) |x ≤ b0 ≤ x∗ ⇒ f (x) ≤ f (b0 )
a0
≤ f (a)(b0 − a0 ) |b0 ≤ a ≤ x∗ ⇒ f (b0 ) < f (a)

< f (a)(b − a) |b0 − a0 < b − a and f (a) > 0

Z b
≤ f (x)dx |f (x) ≥ f (a) for a ≤ x ≤ b
a
= 1−α |by (i)

• Suppose b0 > a. We can immediately exclude that b0 > b since then b0 − a0 > b − a, i.e.,
b0 − a0 wouldn’t be of shorter length than b − a. Thus, we have to consider the case that
a0 ≤ a < b0 < b. It holds that
Z b0 Z b Z a Z b
f (x)dx = f (x)dx + f (x)dx − f (x)dx
a0 a a0 b0

135
Z a Z b
0
Note that f (x)dx ≤ f (a)(a − a ) and f (x)dx ≥ f (b)(b − b0 ). Therefore, we get
a0 b0
Z a Z b
− ≤ f (a)(a − a0 ) − f (b)(b − b0 )
a0 b0

= f (a)((a − a0 ) − (b − b0 )) since f (a) = f (b)

= f (a)((b0 − a0 ) − (b − a))

< 0

Thus,
Z b0
f (x)dx < a − α.
a0

Note:
Example 11.2.2 is a special case of Theorem 11.2.4. However, Theorem 11.2.4 is not immedi-
ately applicable in the following example since the length of that interval is proportional to
1 1
a − b (and not to b − a).

Example 11.2.5:
Let X1 , . . . , Xn ∼ N (µ, σ 2 ), where µ is known. The obvious pivot for σ 2 is

(Xi − µ)2
P
Tσ2 (X) = ∼ χ2n .
σ2
So
!
(Xi − µ)2
P
P a< <b = 1−α
σ2
!
(Xi − µ)2 (Xi − µ)2
P P
⇐⇒ P < σ2 < = 1−α
b a

We wish to minimize
1 1 X
L = ( − ) (Xi − µ)2
a b
Z b
such that fn (t)dt = 1 − α, where fn (t) is the pdf of a χ2n distribution.
a
We get
db
fn (b) − fn (a) = 0
da
and
dL 1 1 db X 1 1 fn (a) X

= − 2+ 2 (Xi − µ)2 = − 2 + 2 (Xi − µ)2 .
da a b da a b fn (b)

136
We obtain a minimum if a2 fn (a) = b2 fn (b).

Note that in practice equal tails χ2n;α/2 and χ2n;1−α/2 are used, which do not result in shortest–
length CI’s. The reason for this selection is simple: When these tests were developed, com-
puters did not exist that could solve these equations numerically. People in general had to
rely on tabulated values. Manually solving the equation above for each case obviously wasn’t
a feasible solution.

Example 11.2.6:
Let X1 , . . . , Xn ∼ U (0, θ). Let M axn = max Xi = X(n) . Since Tn = M ax θ
n
has pdf
nt n−1 I(0,1) (t) which does not depend on θ, Tn can be selected as a our pivot. The den-
sity of Tn is strictly increasing for n ≥ 2, so we cannot find constants a and b as in Example
11.2.5.

If P (a < Tn < b) = 1 − α, then P ( M ax

b
n
<θ< M axn
a ) = 1 − α.

We wish to minimize
1 1
L = M axn ( − )
a b
Z b
such that ntn−1 dt = bn − an = 1 − α.
a
We get
da da bn−1
nbn−1 − nan−1 = 0 =⇒ = n−1
db db a
and
dL 1 da 1 bn−1 1 an+1 − bn+1
= M axn (− 2 + 2 ) = M axn (− n+1 + 2 ) = M axn ( ) < 0 for 0 ≤ a < b ≤ 1.
db a db b a b b2 an+1
Thus, L does not have a local minimum. However, since dL db < 0, L is strictly decreasing
as a function of b. It is minimized when b = 1, i.e., when b is as large as possible. The
corresponding a is selected as a = α1/n .

The shortest 1 − α level CI based on Tn is (M axn , α−1/n M axn ). This is the same CI that
was already obtained in Example 11.1.7.

137
11.3 Confidence Intervals and Hypothesis Tests

(Based on Casella/Berger, Section 9.2)

Example 11.3.1:
Let X1 , . . . , Xn ∼ N (µ, σ 2 ), where σ 2 > 0 is known. In Example 11.2.2 we have shown that
the interval
σ σ
(X − zα/2 √ , X + zα/2 √ )
n n
is a 1 − α level CI for µ.

Suppose we define a test φ of H0 : µ = µ0 vs. H1 : µ 6= µ0 that rejects H0 iff µ0 does not

fall in this interval.

Conversely, if φ(x, µ0 ) is a family of size α tests of H0 : µ = µ0 , the set {µ0 | φ(x, µ0 ) fails to reject H0 }
is a level 1 − α confidence set for µ0 .

Theorem 11.3.2:
Denote H0 (θ0 ) for H0 : θ = θ0 , and H1 (θ0 ) for the alternative. Let A(θ0 ), θ0 ∈ Θ, denote the
acceptance region of a level–α test of H0 (θ0 ). For each possible observation x, define

S(x) = {θ : x ∈ A(θ), θ ∈ Θ}.

Then S(x) is a family of 1 − α level confidence sets for θ.

If, moreover, A(θ0 ) is UMP for (α, H0 (θ0 ), H1 (θ0 )), then S(x) minimizes Pθ (S(X) 3 θ0 ) ∀θ ∈
H1 (θ0 ) among all 1 − α level families of confidence sets, i.e., S(x) is UMA.
Proof:

138
Example 11.3.3:
Let X be a rv that belongs to a one–parameter exponential family with pdf fθ (x) = exp(Q(θ)T (x)+
S 0 (x)+D(θ)), where Q(θ) is non–decreasing. We consider a test H0 : θ = θ0 vs. H1 : θ < θ0 .
The acceptance region of a UMP size α test of H0 has the form A(θ0 ) = {x : T (x) > c(θ0 )}.

Example 11.3.4:
x
Let X ∼ Exp(θ) with fθ (x) = 1θ e− θ I(0,∞) (x), which belongs to a one–parameter exponential
family. Then Q(θ) = − 1θ is non–decreasing and T (x) = x.

We want to test H0 : θ = θ0 vs. H1 : θ < θ0 .

139
Note:
Just as we frequently restrict the class of tests (when UMP tests don’t exist), we can make
the same sorts of restrictions on CI’s.

Definition 11.3.5:
A family S(x) of confidence sets for parameter θ is said to be unbiased at level 1 − α if

Pθ (S(X) 3 θ) ≥ 1 − α and Pθ (S(X) 3 θ0 ) ≤ 1 − α ∀θ, θ0 ∈ Θ, θ 6= θ0 .

If S(x) is unbiased and minimizes Pθ (S(X) 3 θ0 ) among all unbiased CI’s at level 1 − α, it is
called uniformly most accurate unbiased (UMAU).

Theorem 11.3.6:
Let A(θ0 ) be the acceptance region of a UMPU size α test of H0 : θ = θ0 vs. H1 : θ 6= θ0
(for all θ0 ). Then S(x) = {θ : x ∈ A(θ)} is a UMAU family of confidence sets at level 1 − α.
Proof:

140
Theorem 11.3.7:
Let Θ be an interval on IR and fθ be the pdf of X. Let S(X) be a family of 1 − α level CI’s,
where S(X) = (θ(X), θ(X)), θ and θ increasing functions of X, and θ(X) − θ(X) is a finite
rv.

Then it holds that

Z Z
Eθ (θ(X) − θ(X))) = (θ(x) − θ(x))fθ (x)dx = Pθ (S(X) 3 θ0 ) dθ0 ∀θ ∈ Θ.
θ0 6=θ

Proof:
Z θ
It holds that θ − θ = dθ0 . Thus, for all θ ∈ Θ,
θ
Z
Eθ (θ(X) − θ(X))) = (θ(x) − θ(x))fθ (x)dx
IRn
Z Z θ(x) !
0
= dθ fθ (x)dx
IRn θ(x)

∈IRn
 
z }| {
Z −1 0
 θ (θ )
Z 

=  −1 0 fθ (x)dx dθ 0
 
IR  θ (θ
 | {z }) 

∈IRn

Z
−1

= Pθ X ∈ [θ−1 (θ0 ), θ (θ0 )] dθ0
IR
Z
= Pθ (S(X) 3 θ0 ) dθ0
IR
Z
= Pθ (S(X) 3 θ0 ) dθ0
θ0 6=θ

Note:
Theorem 11.3.7 says that the expected length of the CI is the probability that S(X) includes
the false θ0 , averaged over all false values of θ0 .

Corollary 11.3.8:
If S(X) is UMAU, then Eθ (θ(X) − θ(X)) is minimized among all unbiased families of CI’s.
Proof:
In Theorem 11.3.7 we have shown that
Z
Eθ (θ(X) − θ(X)) = Pθ (S(X) 3 θ0 ) dθ0 .
θ0 6=θ

Since a UMAU CI minimizes this probability for all θ0 , the entire integral is minimized.

141
Example 11.3.9:
Let X1 , . . . , Xn ∼ N (µ, σ 2 ), where σ 2 > 0 is known.

By Example 11.2.2, (X − zα/2 √σn , X + zα/2 √σn ) is the shortest 1 − α level CI for µ.

By Example 9.4.3, the equivalent test is UMPU. So by Theorem 11.3.6 this interval is UMAU
and by Corollary 11.3.8 it has shortest expected length as well.

Example 11.3.10:
Let X1 , . . . , Xn ∼ N (µ, σ 2 ), where µ and σ 2 > 0 are both unknown.

Note that
(n − 1)S 2
T (X, σ 2 ) = = Tσ ∼ χ2n−1 .
σ2
Thus,

Rohatgi, Theorem 4(b), page 428–429, states that the related test is UMPU. Therefore, by
Theorem 11.3.6 and Corollary 11.3.8, our CI is UMAU with shortest expected length among
all unbiased intervals.

Note that this CI is different from the equal–tail CI based on Definition 10.2.1, III, and from
the shortest–length CI obtained in Example 11.2.5.

142
11.4 Bayes Confidence Intervals

(Based on Casella/Berger, Section 9.2.4)

Definition 11.4.1:
Given a posterior distribution h(θ | x), a level 1 − α credible set (Bayesian confidence set)
is any set A such that Z
P (θ ∈ A | x) = h(θ | x)dθ = 1 − α.
A

Example 11.4.2:
Let X ∼ Bin(n, p) and π(p) ∼ U (0, 1).

In Example 8.8.11, we have shown that

px (1 − p)n−x
h(p | x) = Z 1 I(0,1) (p)
x n−x
p (1 − p) dp
0

= B(x + 1, n − x + 1)−1 px (1 − p)n−x I(0,1) (p)

Γ(n + 2)
= px (1 − p)n−x I(0,1) (p)
Γ(x + 1)Γ(n − x + 1)
⇒ p | x ∼ Beta(x + 1, n − x + 1),

where B(a, b) = Γ(a)Γ(b)

Γ(a+b) is the beta function evaluated for a and b and Beta(x + 1, n − x + 1)
represents a Beta distribution with parameters x + 1 and n − x + 1.

Using the observed value for x and tables for incomplete beta integrals or a numerical ap-
proach, we can find λ1 and λ2 such that Pp|x (λ1 < p < λ2 ) = 1 − α. So (λ1 , λ2 ) is a credible
interval for p.

Note:

(i) The definitions and interpretations of credible intervals and confidence intervals are quite
different. Therefore, very different intervals may result.

(ii) We can often use Theorem 11.2.4 to find the shortest credible interval (if the precondi-
tions hold).

Example 11.4.3:
Let X1 , . . . , Xn be iid N (µ, 1) and π(µ) ∼ N (0, 1). We want to construct a Bayesian level
1 − α CI for µ.

143
By Definition 8.8.7, the posterior distribution of µ given x is

π(µ)f (x | µ)
h(µ | x) =
g(x)

where

g(x) =

144
12 Nonparametric Inference

12.1 Nonparametric Estimation

Definition 12.1.1:
A statistical method which does not rely on assumptions about the distributional form of a
rv (except, perhaps, that it is absolutely continuous, or purely discrete) is called a nonpara-
metric or distribution–free method.

Note:
Unless otherwise specified, we make the following assumptions for the remainder of this chap-
ter: Let X1 , . . . , Xn be iid ∼ F , where F is unknown. Let P be the class of all possible
distributions of X.

Definition 12.1.2:
A statistic T (X) is sufficient for a family of distributions P if the conditional distibution of
X given T = t is the same for all F ∈ P.

Example 12.1.3:
Let X1 , . . . , Xn be absolutely continuous. Let T = (X(1) , . . . , X(n) ) be the order statistics.

It holds that
1
f (x | T = t) =,
n!
so T is sufficient for the family of absolutely continuous distributions on IR.

Definition 12.1.4:
A family of distributions P is complete if the only unbiased estimate of 0 is the 0 itself, i.e.,

EF (h(X)) = 0 ∀F ∈ P =⇒ h(x) = 0 ∀x.

Definition 12.1.5:
A statistic T (X) is complete in relation to P if the class of induced distributions of T is
complete.

Theorem 12.1.6:
The order statistic (X(1) , . . . , X(n) ) is a complete sufficient statistic, provided that X1 , . . . , Xn
are of either (pure) discrete of (pure) continuous type.

145
Definition 12.1.7:
A parameter g(F ) is called estimable if it has an unbiased estimate, i.e., if there exists a
T (X) such that
EF (T (X)) = g(F ) ∀F ∈ P.

Example 12.1.8:
Let P be the class of distributions for which second moments exist. Then X is unbiased for
R
µ(F ) = xdF (x). Thus, µ(F ) is estimable.

Definition 12.1.9:
The degree m of an estimable parameter g(F ) is the smallest sample size for which an unbi-
ased estimate exists for all F ∈ P.

An unbiased estimate based on a sample of size m is called a kernel.

Lemma 12.1.10:
There exists a symmetric kernel for every estimable parameter.
Proof:
Let T (X1 , . . . , Xm ) be a kernel of g(F ). Define
1 X
Ts (X1 , . . . , Xm ) = T (Xi1 , . . . , Xim ).
m!
all permutations of{1,...,m}
where the summation is over all m! permutations of {1, . . . , m}.

Clearly Ts is symmetric and E(Ts ) = g(F ).

Example 12.1.11:

(i) E(X1 ) = µ(F ), so µ(F ) has degree 1 with kernel X1 .

(ii) E(I(c,∞) (X1 )) = PF (X > c), where c is a known constant. So g(F ) = PF (X > c) has
degree 1 with kernel I(c,∞) (X1 ).

(iii) There exists no T (X1 ) such that E(T (X1 )) = σ 2 (F ) = (x − µ(F ))2 dF (x).
R

But E(T (X1 , X2 )) = E(X12 − X1 X2 ) = σ 2 (F ). So σ 2 (F ) has degree 2 with kernel

X12 − X1 X2 . Note that X22 − X2 X1 is another kernel.

(iv) A symmetric kernel for σ 2 (F ) is

1 1
Ts (X1 , X2 ) = ((X12 − X1 X2 ) + (X22 − X1 X2 )) = (X1 − X2 )2 .
2 2

146
Definition 12.1.12:
Let g(F ) be an estimable parameter of degree m. Let X1 , . . . , Xn be a sample of size n, n ≥ m.
Given a kernel T (Xi1 , . . . , Xim ) of g(F ), we define a U –statistic by

1 X
U (X1 , . . . , Xn ) = n Ts (Xi1 , . . . , Xim ),
m c
n
where Ts is defined as in Lemma 12.1.10 and the summation c is over all m combina-
tions of m integers (i1 , . . . , im ) from {1, · · · , n}. U (X1 , . . . , Xn ) is symmetric in the Xi ’s and
EF (U (X)) = g(F ) for all F.

Example 12.1.13:
For estimating µ(F ) with degree m of µ(F ) = 1:

Symmetric kernel:
Ts (Xi ) = Xi , i = 1, . . . , n

U–statistic:
1 X
Uµ (X) = n Xi
1 c

1 · (n − 1)! X
= Xi
n! c
n
1X
= Xi
n i=1

= X

For estimating σ 2 (F ) with degree m of σ 2 (F ) = 2:

Symmetric kernel:
1
Ts (Xi1 , Xi2 ) = (Xi1 − Xi2 )2 , i1 , i2 = 1, . . . , n, i1 6= i2
2
U–statistic:
1 X 1
Uσ2 (X) = n (Xi1 − Xi2 )2
2 i <i 2
1 2

1 1 X
= n (Xi1 − Xi2 )2
2 4 i 6=i
1 2

(n − 2)! · 2! 1 X
= (Xi1 − Xi2 )2
n! 4 i 6=i
1 2

147
1 X X
= (Xi21 − 2Xi1 Xi2 + Xi22 )
2n(n − 1) i i 6=i
1 2 1

n n n n
1 X X X X
= (n − 1) Xi21 − 2( Xi1 )( Xi2 ) + 2 Xi2 +
2n(n − 1) i =1 i =1 i =1 i=1
1 1 2


n
X
(n − 1) Xi22 
i2 =1

n n n n
1 X X X X
= n Xi21 − Xi21 − 2( Xi1 )2 + 2 Xi2 +
2n(n − 1) i =1 i =1 i =1 i=1
1 1 1


n
X n
X
n Xi22 − Xi22 
i2 =1 i2 =1

n n
" #
1 X X
= n Xi2 − ( Xi )2
n(n − 1) i=1 i=1
  2 
n n
1 X
Xi −
1 X
= n Xj  
 
n(n − 1) i=1
n j=1

n
1 X
= (Xi − X)2
(n − 1) i=1

= S2

Theorem 12.1.14:
Let P be the class of all absolutely continuous or all purely discrete distribution functions on
IR. Any estimable function g(F ), F ∈ P, has a unique estimate that is unbiased and sym-
metric in the observations and has uniformly minimum variance among all unbiased estimates.

Proof:
iid
Let X1 , . . . , Xn ∼ F ∈ P, with T (X1 , . . . , Xn ) an unbiased estimate of g(F ).

We define
Ti = Ti (X1 , . . . , Xn ) = T (Xi1 , Xi2 , . . . , Xin ), i = 1, 2, . . . , n!,

over all possible permutations of {1, . . . , n}.

n!
X n!
X
1
Let T = n! Ti and T = Ti .
i=1 i=1

148
Then
EF (T ) = g(F )

and
2
V ar(T ) = E(T ) − (E(T ))2
n!
" #
1 X
= E ( Ti )2 − [g(F )]2
n! i=1
 
n! X
n!
1 2 X
= E ( ) Ti Tj  − [g(F )]2
n! i=1 j=1
 
Xn! X
n!
≤ E Ti Tj  − [g(F )]2
i=1 j=1
 !  n! 
n!
X X
= E Ti  Tj  − [g(F )]2
i=1 j=1

n!
!2 
X
= E Ti  − [g(F )]2
i=1

= E(T 2 ) − [g(F )]2

= V ar(T )

Equality holds iff Ti = Tj ∀i, j = 1, . . . , n!

=⇒ T is symmetric in (X1 , . . . , Xn ) and T = T

=⇒ by Rohatgi, Problem 4, page 538, T is a function of order statistics

=⇒ by Rohatgi, Theorem 1, page 535, T is a complete sufficient statistic

=⇒ by Note (i) following Theorem 8.4.12, T is UMVUE

Corollary 12.1.15:
If T (X1 , . . . , Xn ) is unbiased for g(F ), F ∈ P, the corresponding U –statistic is an essentially
unique UMVUE.

149
Definition 12.1.16:
iid iid
Suppose we have independent samples X1 , . . . , Xm ∼ F ∈ P, Y1 , . . . , Yn ∼ G ∈ P (G may or
may not equal F.) Let g(F, G) be an estimable function with unbiased estimator T (X1 , . . . , Xk , Y1 , . . . , Yl ).
Define
1 XX
Ts (X1 , . . . , Xk , Y1 , . . . , Yl ) = T (Xi1 , . . . , Xik , Yj1 , . . . , Yjl )
k!l! P P
X Y

(where PX and PY are permutations of X and Y ) and

1 XX
U (X, Y ) = m n Ts (Xi1 , . . . , Xik , Yj1 , . . . , Yjl )
k l CX CY

(where CX and CY are combinations of X and Y ).

U is a called a generalized U –statistic.

Example 12.1.17:
Let X1 , . . . , Xm and Y1 , . . . , Yn be independent random samples from F and G, respectively,
with F, G ∈ P. We wish to estimate

g(F, G) = PF,G (X ≤ Y ).

Let us define (
1, Xi ≤ Yj
Zij =
0, Xi > Yj
for each pair Xi , Yj , i = 1, 2, . . . , m, j = 1, 2, . . . , n.
m
X n
X
Then Zij is the number of X’s ≤ Yj , and Zij is the number of Y ’s > Xi .
i=1 j=1

E(I(Xi ≤ Yj )) = g(F, G) = PF,G (X ≤ Y ),

and degrees k and l are = 1, so we use

1 XX
U (X, Y ) = m n Ts (Xi1 , . . . , Xik , Yj1 , . . . , Yjl )
1 1 CX CY

(m − 1)!(n − 1)! X X 1 X X
= T (Xi1 , . . . , Xik , Yj1 , . . . , Yjl )
m!n! C C
1!1! P P
X Y X Y

m X n
1 X
= I(Xi ≤ Yj ).
mn i=1 j=1

This Mann–Whitney estimator (or Wilcoxin 2–Sample estimator) is unbiased and

symmetric in the X’s and Y ’s. It follows by Corollary 12.1.15 that it has minimum variance.

150
12.2 Single-Sample Hypothesis Tests

Let X1 , . . . , Xn be a sample from a distribution F . The problem of fit is to test the hypoth-
esis that the sample X1 , . . . , Xn is from some specified distribution against the alternative
that it is from some other distribution, i.e., H0 : F = F0 vs. H1 : F (x) 6= F0 (x) for some x.

Definition 12.2.1:
iid
Let X1 , . . . , Xn ∼ F , and let the corresponding empirical cdf be
n
1X
Fn∗ (x) = I (Xi ).
n i=1 (−∞,x]

The statistic
Dn = sup | Fn∗ (x) − F (x) |
x

is called the two–sided Kolmogorov–Smirnov statistic (K–S statistic).

The one–sided K–S statistics are

Dn+ = sup[Fn∗ (x) − F (x)] and Dn− = sup[F (x) − Fn∗ (x)].
x x

Theorem 12.2.2:
For any continuous distribution F , the K–S statistics Dn , Dn− , Dn+ are distribution free.
Proof:
Let X(1) , . . . , X(n) be the order statistics of X1 , . . . , Xn , i.e., X(1) ≤ X(2) ≤ . . . ≤ X(n) , and
define X(0) = −∞ and X(n+1) = +∞.

Then,
i
Fn∗ (x) = for X(i) ≤ x < X(i+1) , i = 0, . . . , n.
n
Therefore,
i
Dn+ = max { sup [ − F (x)]}
0≤i≤n X(i) ≤x<X(i+1) n

i
= max { −[ inf F (x)]}
0≤i≤n n X(i) ≤x<X(i+1)

(∗) i
= max { − F (X(i) )}
0≤i≤n n
i

= max { max − F (X(i) ) , 0}
1≤i≤n n
(∗) holds since F is nondecreasing in [X(i) , X(i+1) ).

151
Note that Dn+ is a function of F (X(i) ). In order to make some inference about Dn+ , the dis-
tribution of F (X(i) ) must be known. We know from the Probability Integral Transformation
(see Rohatgi, page 203, Theorem 1) that for a rv X with continuous cdf FX , it holds that
FX (X) ∼ U (0, 1).

Thus, F (X(i) ) is the ith order statistic of a sample from U (0, 1), independent from F . There-
fore, the distribution of Dn+ is independent of F .

Similarly, the distribution of

i−1

Dn− = max { max F (X(i) ) − , 0}
1≤i≤n n
is independent of F .

Since
Dn = sup | Fn∗ (x) − F (x) |= max {Dn+ , Dn− },
x
the distribution of Dn is also independent of F .

Theorem 12.2.3:
If F is continuous, then



 0, if ν ≤ 0

2n−1
 1 Z 3
1  Z ν+ 2n ν+ 2n Z ν+ 2n
P (Dn ≤ ν + )= . . . f (u)du, if 0 < ν < 2n−1
2n 
 1
−ν 3
−ν 2n−1
−ν
2n

 2n 2n 2n
 2n−1

1, if ν ≥ 2n
where (
n!, if 0 < u1 < u2 < . . . < un < 1
f (u) = f (u1 , . . . , un ) =
0, otherwise
is the joint pdf of an order statistic of a sample of size n from U (0, 1).

Note:
As Gibbons & Chakraborti (1992), page 108–109, point out, this result must be interpreted
carefully. Consider the case n = 2.

For 0 < ν < 43 , it holds that

Z ν+ 1 Z ν+ 3
1 4 4
P (D2 ≤ ν + ) = 2! du2 du1 .
4 1
4
−ν 0<u1 <u2 <1
3
4
−ν

Note that the integration limits overlap if

1 3
ν+ ≥ −ν +
4 4
1
⇐⇒ ν ≥
4

152
When 0 < ν < 14 , it automatically holds that 0 < u1 < u2 < 1. Thus, for 0 < ν < 14 , it holds
that
Z ν+ 1 Z ν+ 3
1 4 4
P (D2 ≤ ν + ) = 2! du2 du1
4 1
4
−ν 3
4
−ν

Z ν+ 1
4 ν+ 3
= 2! u2 | 3 −ν4 du1
1
4
−ν 4

Z ν+ 1
4
= 2! 2ν du1
1
4
−ν

ν+ 1
= 2! (2ν) u1 | 1 −ν4
4

= 2! (2ν)2

1
For 4 ≤ ν < 34 , the region of integration is as follows:

1
Thus, for 4 ≤ ν < 34 , it holds that
Z ν+ 1 Z ν+ 3
1 4 4
P (D2 ≤ ν + ) = 2! du2 du1
4 1
4
−ν 0<u1 <u2 <1
3
4
−ν

Z ν+ 1 Z 1 Z 3
−ν Z 1
4 4
= 2! du2 du1 + 2! du2 du1
3 3
4
−ν u1 0 4
−ν

ν+ 14 3
"Z #
Z
4
−ν
= 2 u2 |1u1 du1 + 1
u2 | 3 −ν du1
3 4
4
−ν 0

ν+ 41 3
"Z #
Z
4
−ν 3
= 2 (1 − u1 ) du1 + (1 − + ν) du1
3
4
−ν 0 4

153
 ! ν+ 1 3 −ν

u2 4
u1

4
= 2  u1 − 1 + + νu1 
2 3
−ν
4 0
4

(ν + 14 )2 (−ν + 34 )2 (−ν + 34 )
" #
1 3 3
= 2 (ν + ) − − (−ν + ) + + + ν(−ν + )
4 2 4 2 4 4
" #
1 ν2 ν 1 3 ν2 3 9 ν 3 3
= 2 ν+ − − − +ν− + − ν+ − + − ν2 + ν
4 2 4 32 4 2 4 32 4 16 4
3 1

2
= 2 −ν + ν −
2 16
1
= −2ν 2 + 3ν −
8
Combining these results gives


 0, if ν ≤ 0



 2! (2ν)2 ,
 1
1
 if 0 < ν < 4
P (D2 ≤ ν + ) =
4 1
 −2ν 2 + 3ν − 8 ,

 if 1
≤ν< 3
  4 4

3

1, if ν ≥

4

Theorem 12.2.4:
Let F be a continuous cdf. Then it holds ∀z ≥ 0:
∞
z X
lim P (Dn ≤ √ ) = L1 (z) = 1 − 2 (−1)i−1 exp(−2i2 z 2 ).
n→∞ n i=1

Theorem 12.2.5:
Let F be a continuous cdf. Then it holds:



 0, if z ≤ 0


 Z 1 Z un Z u3 Z u2
P (Dn+ ≤ z) = P (Dn− ≤ z) = ... f (u)du, if 0 < z < 1
1−z n−1 −z 2 1
−z n −z



 n n
if z ≥ 1

 1,

where f (u) is defined in Theorem 12.2.3.

Note:
It should be obvious that the statistics Dn+ and Dn− have the same distribution because of
symmetry.

154
Theorem 12.2.6:
Let F be a continuous cdf. Then it holds ∀z ≥ 0:
z z
lim P (Dn+ ≤ √ ) = lim P (Dn− ≤ √ ) = L2 (z) = 1 − exp(−2z 2 )
n→∞ n n→∞ n

Corollary 12.2.7:
d
Let Vn = 4n(Dn+ )2 . Then it holds Vn −→ χ22 , i.e., this transformation of Dn+ has an asymptotic
χ22 distribution.

Proof:
Let x ≥ 0. Then it follows:
x=4z 2
lim P (Vn ≤ x) = lim P (Vn ≤ 4z 2 )
n→∞ n→∞

= lim P (4n(Dn+ )2 ≤ 4z 2 )
n→∞
√
= lim P ( nDn+ ≤ z)
n→∞

T h.12.2.6
= 1 − exp(−2z 2 )
4z 2 =x
= 1 − exp(−x/2)

Thus, lim P (Vn ≤ x) = 1 − exp(−x/2) for x ≥ 0. Note that this is the cdf of a χ22 distribu-
n→∞
tion.

Definition 12.2.8:
+ be the
Let Dn;α be the smallest value such that P (Dn > Dn;α ) ≤ α. Likewise, let Dn;α
smallest value such that P (Dn+ > Dn;α
+ ) ≤ α.

The Kolmogorov–Smirnov test (K–S test) rejects H0 : F (x) = F0 (x) ∀x at level α if

Dn > Dn;α .

It rejects H00 : F (x) ≥ F0 (x) ∀x at level α if Dn− > Dn;α

+ and it rejects H 00 : F (x) ≤ F (x) ∀x
0 0
at level α if Dn+ > Dn;α+ .

Note:
+ for selected values of α and small
Rohatgi, Table 7, page 661, gives values of Dn;α and Dn;α
+ for large n.
n. Theorems 12.2.4 and 12.2.6 allow the approximation of Dn;α and Dn;α

155
Example 12.2.9:
Let X1 , . . . , Xn ∼ C(1, 0). We want to test whether H0 : X ∼ N (0, 1).

The following data has been observed for x(1) , . . . , x(10) :

−1.42, −0.43, −0.19, 0.26, 0.30, 0.45, 0.64, 0.96, 1.97, and 4.68

The results for the K–S test have been obtained through the following S–Plus session, i.e.,
+ −
D10 = 0.02219616, D10 = 0.3025681, and D10 = 0.3025681:

> x _ c(-1.42, -0.43, -0.19, 0.26, 0.30, 0.45, 0.64, 0.96, 1.97, 4.68)
> FX _ pnorm(x)
> FX
[1] 0.07780384 0.33359782 0.42465457 0.60256811 0.61791142 0.67364478
[7] 0.73891370 0.83147239 0.97558081 0.99999857
> Dp _ (1:10)/10 - FX
> Dp
[1] 2.219616e-02 -1.335978e-01 -1.246546e-01 -2.025681e-01 -1.179114e-01
[6] -7.364478e-02 -3.891370e-02 -3.147239e-02 -7.558081e-02 1.434375e-06
> Dm _ FX - (0:9)/10
> Dm
[1] 0.07780384 0.23359782 0.22465457 0.30256811 0.21791142 0.17364478
[7] 0.13891370 0.13147239 0.17558081 0.09999857
> max(Dp)
[1] 0.02219616
> max(Dm)
[1] 0.3025681
> max(max(Dp), max(Dm))
[1] 0.3025681
>
> ks.gof(x, alternative = "two.sided", mean = 0, sd = 1)

One-sample Kolmogorov-Smirnov Test

Hypothesized distribution = normal

data: x
ks = 0.3026, p-value = 0.2617
alternative hypothesis:
True cdf is not the normal distn. with the specified parameters

Using Rohatgi, Table 7, page 661, we have to use D10;0.20 = 0.323 for α = 0.20. Since
D10 = 0.3026 < 0.323 = D10;0.20 , it is p > 0.20. The K–S test does not reject H0 at level
α = 0.20. As S–Plus shows, the precise p–value is even p = 0.2617.

156
Note:
Comparison between χ2 and K–S goodness of fit tests:

• K–S uses all available data; χ2 bins the data and loses information

• K–S works for all sample sizes; χ2 requires large sample sizes

• it is more difficult to modify K–S for estimated parameters; χ2 can be easily adapted
for estimated parameters

• K–S is “conservative” for discrete data, i.e., it tends to accept H0 for such data

• the order matters for K–S; χ2 is better for unordered categorical data

157
12.3 More on Order Statistics

Definition 12.3.1:
Let F be a continuous cdf. A tolerance interval for F with tolerance coefficient γ is
a random interval such that the probability is γ that this random interval covers at least a
specified percentage 100p% of the distribution.

Theorem 12.3.2:
If order statistics X(r) < X(s) are used as the endpoints for a tolerance interval for a continuous
cdf F , it holds that
s−r−1
!
X n i
γ= p (1 − p)n−i .
i=0
i

Proof:
According to Definition 12.3.1, it holds that

γ = PX(r) ,X(s) PX (X(r) < X < X(s) ) ≥ p .

Since F is continuous, it holds that FX (X) ∼ U (0, 1). Therefore,

PX (X(r) < X < X(s) ) = P (X < X(s) ) − P (X ≤ X(r) )

= F (X(s) ) − F (X(r) )

= U(s) − U(r) ,

where U(s) and U(r) are the order statistics of a U (0, 1) distribution.

Thus,
γ = PX(r) ,X(s) PX (X(r) < X < X(s) ) ≥ p = P (U(s) − U(r) ≥ p).

By Therorem 4.4.4, we can determine the joint distribution of order statistics and calculate γ
as Z 1 Z y−p
n!
γ= xr−1 (y − x)s−r−1 (1 − y)n−s dx dy.
p 0 (r − 1)!(s − r − 1)!(n − s)!
Rather than solving this integral directly, we make the transformation

U = U(s) − U(r)

V = U(s) .

Then the joint pdf of U and V is


n!

(r−1)!(s−r−1)!(n−s)! (v − u)r−1 us−r−1 (1 − v)n−s , if 0 < u < v < 1
fU,V (u, v) =

0, otherwise

158
and the marginal pdf of U is
Z 1
fU (u) = fU,V (u, v) dv
0
Z 1
n!
= us−r−1 I(0,1) (u) (v − u)r−1 (1 − v)n−s dv
(r − 1)!(s − r − 1)!(n − s)! u
Z 1
(A) n!
= us−r−1 (1 − u)n−s+r I(0,1) (u) tr−1 (1 − t)n−s dt
(r − 1)!(s − r − 1)!(n − s)! 0
| {z }
B(r,n−s+1)

n! (r − 1)!(n − s)!
= us−r−1 (1 − u)n−s+r I (u)
(r − 1)!(s − r − 1)!(n − s)! (n − s + r)! (0,1)
n!
= us−r−1 (1 − u)n−s+r I(0,1) (u)
(n − s + r)!(s − r − 1)!
!
n−1
= n us−r−1 (1 − u)n−s+r I(0,1) (u).
s−r−1

v−u
(A) is based on the transformation t = , v − u = (1 − u)t, 1 − v = 1 − u − (1 − u)t =
1−u
(1 − u)(1 − t) and dv = (1 − u)dt.

It follows that

γ = P (U(s) − U(r) ≥ p)

= P (U ≥ p)
Z 1 !
n−1
= n us−r−1 (1 − u)n−s+r du
p s−r−1
(B)
= P (Y < s − r) | where Y ∼ Bin(n, p)
s−r−1
!
X n i
= p (1 − p)n−i .
i=0
i

(B) holds due to Rohatgi, Remark 3 after Theorem 5.3.18, page 216, since for X ∼ Bin(n, p),
it holds that Z 1 !
n − 1 k−1
P (X < k) = n x (1 − x)n−k dx.
p k−1

159
Example 12.3.3:
Let s = n and r = 1. Then,
n−2
!
X n i
γ= p (1 − p)n−i = 1 − pn − npn−1 (1 − p).
i=0
i

If p = 0.8 and n = 10, then

γ10 = 1 − (0.8)10 − 10 · (0.8)9 · (0.2) = 0.624,

i.e., (X(1) , X(10) ) defines a 62.4% tolerance interval for 80% probability.

If p = 0.8 and n = 20, then

γ20 = 1 − (0.8)20 − 20 · (0.8)19 · (0.2) = 0.931,

and if p = 0.8 and n = 30, then

γ30 = 1 − (0.8)30 − 30 · (0.8)29 · (0.2) = 0.989.

Theorem 12.3.4:
Let kp be the pth quantile of a continuous cdf F . Let X(1) , . . . , X(n) be the order statistics of
a sample of size n from F . Then it holds that
s−1
!
X n i
P (X(r) ≤ kp ≤ X(s) ) = p (1 − p)n−i .
i=r
i

Proof:
It holds that

P (X(r) ≤ kp ) = P (at least r of the Xi ’s are ≤ kp )

n
!
X n
= pi (1 − p)n−i .
i=r
i

Therefore,

P (X(r) ≤ kp ≤ X(s) ) = P (X(r) ≤ kp ) − P (X(s) < kp )

n n
! !
X n i n−i
X n
= p (1 − p) − pi (1 − p)n−i
i=r
i i=s
i
s−1
!
X n i
= p (1 − p)n−i .
i=r
i

160
Corollary 12.3.5:
s−1
!
X n i
(X(r) , X(s) ) is a level p (1 − p)n−i confidence interval for kp .
i=r
i

Example 12.3.6:
Let n = 10. We want a 95% confidence interval for the median, i.e., kp where p = 21 .
s−1
!
X n i
We get the following probabilities pr,s = p (1 − p)n−i that (X(r) , X(s) ) covers k0.5 :
i=r
i

pr,s s
2 3 4 5 6 7 8 9 10
1 0.01 0.05 0.17 0.38 0.62 0.83 0.94 0.99 0.998
2 0.04 0.16 0.37 0.61 0.82 0.93 0.98 0.99
3 0.12 0.32 0.57 0.77 0.89 0.93 0.94
4 0.21 0.45 0.66 0.77 0.82 0.83
r 5 0.25 0.45 0.57 0.61 0.62
6 0.21 0.32 0.37 0.38
7 0.12 0.16 0.17
8 0.04 0.05
9 0.01

Only the random intervals (X(1) , X(9) ), (X(1) , X(10) ), (X(2) , X(9) ), and (X(2) , X(10) ) give the
desired coverage probability. Therefore, we use the one that comes closest to 95%, i.e.,
(X(2) , X(9) ), as the 95% confidence interval for the median.

161
13 Some Results from Sampling

13.1 Simple Random Samples

Definition 13.1.1:
Let Ω be a population of size N with mean µ and variance σ 2 . A sampling method (of size
n) is called simple if the set S of possible samples contains all combinations of n elements of
Ω (without repetition) and the probability for each sample s ∈ S to become selected depends
only on n, i.e., p(s) = N1 ∀s ∈ S. Then we call s ∈ S a simple random sample (SRS) of
(n)
size n.

Theorem 13.1.2:
Let Ω be a population of size N with mean µ and variance σ 2 . Let Y : Ω → IR be a measurable
function. Let ni be the total number of times the parameter ỹi occurs in the population and
pi = nNi be the relative frequency the parameter ỹi occurs in the population. Let (y1 , . . . , yn )
be a SRS of size n with respect to Y , where P (Y = ỹi ) = pi = nNi .

Then the components yi , i = 1, . . . , n, are identically distributed as Y and it holds for i 6= j:


1  nk nl , k 6= l
P (yi = ỹk , yj = ỹl ) = nkl , where nkl =
N (N − 1) 
nk (nk − 1), k = l

Note:

(i) In Sampling, many authors use capital letters to denote properties of the population
and small letters to denote properties of the random sample. In particular, xi ’s and yi ’s
are considered as random variables related to the sample. They are not seen as specific
realizations.

(ii) The following equalities hold in the scenario of Theorem 13.1.2:

X
N = ni
i

1 X
µ = ni ỹi
N i

1 X
σ2 = ni (ỹi − µ)2
N i

1 X
= ni ỹi2 − µ2
N i

162
Theorem 13.1.3:
n
X
1
Let the same conditions hold as in Theorem 13.1.2. Let y = n yi be the sample mean of a
i=1
SRS of size n. Then it holds:

(i) E(y) = µ, i.e., the sample mean is unbiased for the population mean µ.
1N −n 2 1 N
(ii) V ar(y) = σ = (1 − f ) σ 2 , where f = n
N.
nN −1 n N −1

Proof:
(i)
n
1X
E(y) = E(yi ) = µ, since E(yi ) = µ ∀i.
n i=1

(ii)

163
Theorem 13.1.4:
Let y n be the sample mean of a SRS of size n. Then it holds that
n yn − µ d
r
q −→ N (0, 1),
1−f N
σ
N −1

n
where N → ∞ and f = N is a constant.

In particular, when the yi ’s are 0–1–distributed with E(yi ) = P (yi = 1) = p ∀i, then it holds
that
n yn − p
r
d
q −→ N (0, 1),
1−f N
p(1 − p)
N −1
n
where N → ∞ and f = N is a constant.

164
13.2 Stratified Random Samples

Definition 13.2.1:
Let Ω be a population of size N , that is split into m disjoint sets Ωj , called strata, of size
m
X
Nj , j = 1, . . . , m, where N = Nj . If we independently draw a random sample of size nj in
j=1
each strata, we speak of a stratified random sample.

Note:

(i) The random samples in each strata are not always SRS’s.

(ii) Stratified random samples are used in practice as a means to reduce the sample variance
in the case that data in each strata is homogeneous and data among different strata is
heterogeneous.

(iii) Frequently used strata in practice are gender, state (or county), income range, ethnic
background, etc.

Definition 13.2.2:
Let Y : Ω → IR be a measurable function. In case of a stratified random sample, we use the
following notation:

Let Yjk , j = 1, . . . , m, k = 1, . . . , Nj be the elements in Ωj . Then, we define

Nj
X
(i) Yj = Yjk the total in the j th strata,
k=1

1
(ii) µj = Nj Yj the mean in the j th strata,
m
X
1
(iii) µ = N Nj µj the expectation (or grand mean),
j=1

m Nj
m X
X X
(iv) N µ = Yj = Yjk the total,
j=1 j=1 k=1

Nj
X
(v) σj2 = 1
Nj (Yjk − µj )2 the variance in the j th strata, and
k=1

Nj
m X
X
(vi) σ2 = 1
N (Yjk − µ)2 the variance.
j=1 k=1

165
nj
X
1
(vii) We denote an (ordered) sample in Ωj of size nj as (yj1 , . . . , yjnj ) and y j = nj yjk the
k=1
sample mean in the j th strata.

Theorem 13.2.3:
Let the same conditions hold as in Definitions 13.2.1 and 13.2.2. Let µ̂j be an unbiased
estimate of µj and V ar(µ̂
d ) be an unbiased estimate of V ar(µ̂ ). Then it holds:
j j
m
X
1
(i) µ̂ = N Nj µ̂j is unbiased for µ.
j=1
m
X
V ar(µ̂) = 1
N2
Nj2 V ar(µ̂j ).
j=1
m
X
d = 1
(ii) V ar(µ̂) N2
Nj2 V ar(µ̂
d ) is unbiased for V ar(µ̂).
j
j=1

Proof:
(i)

Theorem 13.2.4:
Let the same conditions hold as in Theorem 13.2.3. If we draw a SRS in each strata, then it
holds:
m nj
X X
1 1
(i) µ̂ = N Nj y j is unbiased for µ, where y j = nj yjk , j = 1, . . . , m.
j=1 k=1
m
X 1 Nj nj
V ar(µ̂) = 1
Nj2 (1 − fj ) σ 2 , where fj = Nj .
N2
j=1
nj Nj − 1 j
m
X 1
(ii) V ar(µ̂)
d = 1
N2
Nj2 (1 − fj )s2j is unbiased for V ar(µ̂), where
j=1
nj
nj
1 X
s2j = (yjk − y j )2 .
nj − 1 k=1

166
Proof:

Definition 13.2.5:
Let the same conditions hold as in Definitions 13.2.1 and 13.2.2. If the sample in each strata
N
is of size nj = n Nj , j = 1, . . . , m, where n is the total sample size, then we speak of pro-
portional selection.

Note:
nj n
(i) In the case of proportional selection, it holds that fj = Nj = N = f, j = 1, . . . , m.

(ii) Proportional strata cannot always be obtained for each combination of m, n, and N .

Theorem 13.2.6:
Let the same conditions hold as in Definition 13.2.5. If we draw a SRS in each strata, then it
holds in case of proportional selection that
m
1 1−f X
V ar(µ̂) = 2 Nj σ̃j2 ,
N f j=1
Nj
where σ̃j2 = 2
Nj −1 σj .
Proof:
The proof follows directly from Theorem 13.2.4 (i).

Theorem 13.2.7:
If we draw (1) a stratified random sample that consists of SRS’s of sizes nj under proportional
m
X
selection and (2) a SRS of size n = nj from the same population, then it holds that
j=1
 
m m
1 N − n X 2 1 X
V ar(y) − V ar(µ̂) = Nj (µj − µ) − (N − Nj )σ̃j2  .
n N (N − 1) j=1 N j=1

167
Proof:
See Homework.

168
14 Some Results from Sequential Statistical Inference

14.1 Fundamentals of Sequential Sampling

Example 14.1.1:
A particular machine produces a large number of items every day. Each item can be either
“defective” or “non–defective”. The unknown proportion of defective items in the production
of a particular day is p.

Let (X1 , . . . , Xm ) be a sample from the daily production where xi = 1 when the item is
m
X
defective and xi = 0 when the item is non–defective. Obviously, Sm = Xi ∼ Bin(m, p)
i=1
denotes the total number of defective items in the sample (assuming that m is small compared
to the daily production).

We might be interested to test H0 : p ≤ p0 vs. H1 : p > p0 at a given significance level α

and use this decision to trash the entire daily production and have the machine fixed if indeed
p > p0 . A suitable test could be

 1, if sm > c
Φ1 (x1 , . . . , xm ) =

0, if sm ≤ c

where c is chosen such that Φ1 is a level–α test.

However, wouldn’t it be more beneficial if we sequentially sample the items (e.g., take item
# 57, 623, 1005, 1286, 2663, etc.) and stop the machine as soon as it becomes obvious that
it produces too many bad items. (Alternatively, we could also finish the time consuming and
expensive process to determine whether an item is defective or non–defective if it is impossible
to surpass a certain proportion of defectives.) For example, if for some j < m it already holds
that sj > c, then we could stop (and immediately call maintenance) and reject H0 after only
j observations.

More formally, let us define T = min{j | Sj > c} and T 0 = min{T, m}. We can now con-
sider a decision rule that stops with the sampling process at random time T 0 and rejects H0 if
T ≤ m. Thus, if we consider R0 = {(x1 , . . . , xm ) | t ≤ m} and R1 = {(x1 , . . . , xm ) | sm > c}
as critical regions of two tests Φ0 and Φ1 , then these two tests are equivalent.

169
Definition 14.1.2:
Let Θ be the parameter space and A the set of actions the statistician can take. We assume
that the rv’s X1 , X2 , . . . are observed sequentially and iid with common pdf (or pmf) fθ (x).
A sequential decision procedure is defined as follows:

(i) A stopping rule specifies whether an element of A should be chosen without taking
any further observation. If at least one observation is taken, this rule specifies for every
set of observed values (x1 , x2 , . . . , xn ), n ≥ 1, whether to stop sampling and choose an
action in A or to take another observation xn+1 .

(ii) A decision rule specifies the decision to be taken. If no observation has been taken,
then we take action d0 ∈ A. If n ≥ 1 observation have been taken, then we take action
dn (x1 , . . . , xn ) ∈ A, where dn (x1 , . . . , xn ) specifies the action that has to be taken for
the set (x1 , . . . , xn ) of observed values. Once an action has been taken, the sampling
process is stopped.

Note:
In the remainder of this chapter, we assume that the statistician takes at least one observation.

Definition 14.1.3:
Let Rn ⊆ IRn , n = 1, 2, . . ., be a sequence of Borel–measurable sets such that the sampling
process is stopped after observing X1 = x1 , X2 = x2 , . . . , Xn = xn if (x1 , . . . , xn ) ∈ Rn . If
(x1 , . . . , xn ) ∈
/ Rn , then another observation xn+1 is taken. The sets Rn , n = 1, 2, . . . are called
stopping regions.

Definition 14.1.4:
With every sequential stopping rule we associate a stopping random variable N which
takes on the values 1, 2, 3, . . .. Thus, N is a rv that indicates the total number of observations
taken before the sampling is stopped.

Note:
We use the (sloppy) notation {N = n} to denote the event that sampling is stopped after
observing exactly n values x1 , . . . , xn (i.e., sampling is not stopped before taking n samples).
Then the following equalities hold:

{N = 1} = R1

170
{N = n} = {(x1 , . . . , xn ) ∈ IRn | sampling is stopped after n observations but not before}
= (R1 ∪ R2 ∪ . . . ∪ Rn−1 )c ∩ Rn
= R1c ∩ R2c ∩ . . . ∩ Rn−1
c
∩ Rn
n
[
{N ≤ n} = {N = k}
k=1

Here we will only consider closed sequential sampling procedures, i.e., procedures where
sampling eventually stops with probability 1, i.e.,

P (N < ∞) = 1,
P (N = ∞) = 1 − P (N < ∞) = 0.

Theorem 14.1.5: Wald’s Equation

N
X
Let X1 , X2 , . . . be iid rv’s with E(| X1 |) < ∞. Let N be a stopping variable. Let SN = Xk .
k=1
If E(N ) < ∞, then it holds
E(SN ) = E(X1 )E(N ).

Proof:
Define a sequence of rv’s Yi , i = 1, 2, . . ., where

 1, if no decision is reached up to the (i − 1)th stage, i.e., N > (i − 1)
Yi =

0, otherwise

Then each Yi is a function of X1 , X2 , . . . , Xi−1 only and Yi is independent of Xi .

Consider the rv ∞
X
Xn Yn .
n=1
Obviously, it holds that
∞
X
SN = Xn Yn .
n=1
Thus, it follows that
∞
!
X
E(SN ) = E Xn Yn . (∗)
n=1
It holds that
∞
X ∞
X
E(| Xn Yn |) = E(| Xn |)E(| Yn |)
n=1 n=1
∞
X
= E(| X1 |) P (N ≥ n)
n=1

171
∞ X
X ∞
= E(| X1 |) P (N = k)
n=1 k=n
∞
(A) X
= E(| X1 |) nP (N = n)
n=1

= E(| X1 |)E(N )

< ∞

(A) holds due to the following rearrangement of indizes:

n k
1 1, 2, 3, . . .
2 2, 3, . . .
3 3, . . .
.. ..
. .

We may therefore interchange the expectation and summation signs in (∗) and get
∞
!
X
E(SN ) = E Xn Yn
n=1
∞
X
= E(Xn Yn )
n=1
∞
X
= E(Xn )E(Yn )
n=1
∞
X
= E(X1 ) P (N ≥ n)
n=1

= E(X1 )E(N )

which completes the proof.

172
14.2 Sequential Probability Ratio Tests

Definition 14.2.1:
Let X1 , X2 , . . . be a sequence of iid rv’s with common pdf (or pmf) fθ (x). We want to test a
simple hypothesis H0 : X ∼ fθ0 vs. a simple alternative H1 : X ∼ fθ1 when the observations
are taken sequentially.

Let f0n and f1n denote the joint pdf’s (or pmf’s) of X1 , . . . , Xn under H0 and H1 respectively,
i.e.,
n
Y n
Y
f0n (x1 , . . . , xn ) = fθ0 (xi ) and f1n (x1 , . . . , xn ) = fθ1 (xi ).
i=1 i=1
Finally, let
f1n (x)
,
λn (x1 , . . . , xn ) =
f0n (x)
where x = (x1 , . . . , xn ). Then a sequential probability ratio test (SPRT) for testing H0
vs. H1 is the following decision rule:

(i) If at any stage of the sampling process it holds that

λn (x) ≥ A,

then stop and reject H0 .

(ii) If at any stage of the sampling process it holds that

λn (x) ≤ B,

then stop and accept H0 , i.e., reject H1 .

(iii) If
B < λn (x) < A,
then continue sampling by taking another observation xn+1 .

Note:

(i) It is usually convenient to define

fθ1 (Xi )
Zi = log ,
fθ0 (Xi )
where Z1 , Z2 , . . . are iid rv’s. Then, we work with
n
X n
X
log λn (x) = zi = (log fθ1 (xi ) − log fθ0 (xi ))
i=1 i=1

instead of using λn (x). Obviously, we now have to use constants b = log B and a = log A
instead of the original constants B and A.

173
(ii) A and B (where A > B) are constants such that the SPRT will have strength (α, β),
where
α = P (Type I error) = P (Reject H0 | H0 )
and
β = P (Type II error) = P (Accept H0 | H1 ).
If N is the stopping rv, then

α = Pθ0 (λN (X) ≥ A) and β = Pθ1 (λN (X) ≤ B).

Example 14.2.2:
Let X1 , X2 , . . . be iid N (µ, σ 2 ), where µ is unknown and σ 2 > 0 is known. We want to test
H0 : µ = µ0 vs. H1 : µ = µ1 , where µ0 < µ1 .

If our data is sampled sequentially, we can constract a SPRT as follows:

n
1 1
X
log λn (x) = − 2 (xi − µ1 )2 − (− 2 (xi − µ0 )2 )
i=1
2σ 2σ
n
1 X 2 2

= (x i − µ 0 ) − (x i − µ 1 ) )
2σ 2 i=1
n
1 X 2 2 2 2

= x − 2x i µ 0 + µ − x + 2xi µ 1 − µ
2σ 2 i=1 i 0 i 1

n
1 X 2 2

= −2x i µ 0 + µ 0 + 2xi µ 1 − µ 1
2σ 2 i=1
n
!
1 X
= 2xi (µ1 − µ0 ) + n(µ20 − µ21 )
2σ 2 i=1
n
!
µ1 − µ0 X µ0 + µ1
= xi − n
σ2 i=1
2

We decide for H0 if

log λn (x) ≤ b
n
!
µ1 − µ0 X µ0 + µ1
⇐⇒ xi − n ≤b
σ2 i=1
2
n
µ0 + µ1
+ b∗ ,
X
⇐⇒ xi ≤ n
i=1
2

σ2
where b∗ = µ1 −µ0 b.

174
We decide for H1 if

log λn (x) ≥ a
n
!
µ 1 − µ0 X µ0 + µ1
⇐⇒ xi − n ≥a
σ2 i=1
2
n
µ0 + µ 1
+ a∗ ,
X
⇐⇒ xi ≥ n
i=1
2

σ2
where a∗ = µ1 −µ0 a.

Theorem 14.2.3:
For a SPRT with stopping bounds A and B, A > B, and strength (α, β), we have
1−β β
A≤ and B ≥ ,
α 1−α
where 0 < α < 1 and 0 < β < 1.

Theorem 14.2.4:
Assume we select for given α, β ∈ (0, 1), where α + β ≤ 1, the stopping bounds
1−β β
A0 = and B 0 = .
α 1−α
Then it holds that the SPRT with stopping bounds A0 and B 0 has strength (α0 , β 0 ), where
α β
α0 ≤ , β0 ≤ , and α0 + β 0 ≤ α + β.
1−β 1−α

175
Note:

(i) The approximation A0 = 1−β

α and B 0 = 1−αβ
in Theorem 14.2.4 is called Wald–
Approximation for the optimal stopping bounds of a SPRT.

(ii) A0 and B 0 are functions of α and β only and do not depend on the pdf’s (or pmf’s) fθ0
and fθ1 . Therefore, they can be computed once and for all fθi ’s, i = 0, 1.

THE END !!!

176
Index
α–similar, 105 Efficient, More, 71
0–1 Loss, 125 Efficient, Most, 72
Empirical Cumulative Distribution Function, 36
A Posteriori Distribution, 84 Error, Type I, 90
A Priori Distribution, 84 Error, Type II, 90
Action, 81 Estimable, 146
Alternative Hypothesis, 89 Estimable Function, 57
Ancillary, 55 Estimate, Bayes, 85
Asymptotically (Most) Efficient, 72 Estimate, Maximum Likelihood, 76
Estimate, Method of Moments, 74
Basu’s Theorem, 55 Estimate, Minimax, 82
Bayes Estimate, 85 Estimate, Point, 44
Bayes Risk, 84 Estimator, 44
Bayes Rule, 85 Estimator, Mann–Whitney, 150
Bayesian Confidence Set, 143 Estimator, Wilcoxin 2–Sample, 150
Bias, 57 Exponential Family, One–Parameter, 53

Central Limit Theorem, Lindeberg, 33 F–Test, 123

Central Limit Theorem, Lindeberg–Lèvy, 30 Factorization Criterion, 49
Chapman, Robbins, Kiefer Inequality, 70 Family of CDF’s, 44
CI, 131 Family of Confidence Sets, 130
Closed, 171 Family of PDF’s, 44
Complete, 52, 145 Family of PMF’s, 44
Complete in Relation to P, 145 Family of Random Sets, 130
Composite, 89 Fisher Information, 67
Confidence Bound, Lower, 131 Formal Invariance, 108
Confidence Bound, Upper, 131
Confidence Coefficient, 131 Generalized U –Statistic, 150
Confidence Interval, 131 Glivenko–Cantelli Theorem, 37
Confidence Level, 130
Confidence Sets, 130 Hypothesis, Alternative, 89
Conjugate Family, 87 Hypothesis, Null, 89
Consistent in the rth M ean, 45
Consistent, Mean–Squared–Error, 58 Independence of X and S 2 , 41
Consistent, Strongly, 45 Induced Function, 46
Consistent, Weakly, 45 Interval, Random, 130
Continuity Theorem, 29 Invariance, Measurement, 108
Contradiction, Proof by, 60 Invariant, 46, 107
Cramér–Rao Lower Bound, 66 Invariant Test, 108
Credible Set, 143 Invariant, Location, 46
Critical Region, 90 Invariant, Maximal, 109
CRK Inequality, 70 Invariant, Permutation, 46
CRLB, 66 Invariant, Scale, 46

Decision Function, 81 K–S Statistic, 151

Decision Rule, 170 K–S Test, 155
Degree, 146 Kernel, 146
Distribution, Population, 36 Kernel, Symmetric, 146
Distribution–Free, 145 Kolmogorov–Smirnov Statistic, 151
Domain of Attraction, 32 Kolmogorov–Smirnov Test, 155

Efficiency, 71, 72 Landau Symbols O and o, 31

Efficient, Asymptotically (Most), 72 Lehmann–Scheffée, 64

177
Level of Significance, 91 Power Function, 91
Level–α–Test, 91 Probability Integral Transformation, 152
Likelihood Function, 76 Probability Ratio Test, Sequential, 173
Likelihood Ratio Test, 112 Problem of Fit, 151
Likelihood Ratio Test Statistic, 112 Proof by Contradiction, 60
Lindeberg Central Limit Theorem, 33 Proportional Selection, 167
Lindeberg Condition, 33
Lindeberg–Lèvy Central Limit Theorem, 30 Random Interval, 130
LMVUE, 59 Random Sample, 36
Locally Minumum Variance Unbiased Estimate, 59 Random Sets, 130
Location Invariant, 46 Random Variable, Stopping, 170
Logic, 60 Randomized Test, 91
Loss Function, 81 Rao–Blackwell, 63
Lower Confidence Bound, 131 Rao–Blackwellization, 64
LRT, 112 Realization, 36
Regularity Conditions, 67
Mann–Whitney Estimator, 150 Risk Function, 81
Maximal Invariant, 109 Risk, Bayes, 84
Maximum Likelihood Estimate, 76
Mean Square Error, 58 Sample, 36
Mean–Squared–Error Consistent, 58 Sample Central Moment of Order k, 37
Measurement Invariance, 108 Sample Mean, 36
Method of Moments Estimate, 74 Sample Moment of Order k, 37
Minimax Estimate, 82 Sample Statistic, 36
Minimax Principle, 82 Sample Variance, 36
Minmal Sufficient, 55 Scale Invariant, 46
MLE, 76 Selection, Proportional, 167
MLR, 98 Sequential Decision Procedure, 170
MOM, 74 Sequential Probability Ratio Test, 173
Monotone Likelihood Ratio, 98 Significance Level, 91
More Efficient, 71 Similar, 105
Most Efficient, 72 Similar, α, 105
Most Powerful Test, 91 Simple, 89, 162
MP, 91 Simple Random Sample, 162
MSE–Consistent, 58 Size, 91
SPRT, 173
Neyman–Pearson Lemma, 94 SRS, 162
Nonparametric, 145 Stable, 32
Nonrandomized Test, 91 Statistic, 36
Normal Variance Tests, 117 Statistic, Kolmogorov–Smirnov, 151
NP Lemma, 94 Statistic, Likelihood Ratio Test, 112
Null Hypothesis, 89 Stopping Random Variable, 170
Stopping Regions, 170
One Sample t–Test, 121 Stopping Rule, 170
One–Tailed t-Test, 121 Strata, 165
Stratified Random Sample, 165
Paired t-Test, 122 Strongly Consistent, 45
Parameter Space, 44 Sufficient, 48, 145
Parametric Hypothesis, 89 Sufficient, Minimal, 55
Permutation Invariant, 46 Symmetric Kernel, 146
Pivot, 134
Point Estimate, 44 t–Test, 121
Point Estimation, 44 Test Function, 90
Population Distribution, 36 Test, Invariant, 108
Power, 91 Test, Kolmogorov–Smirnov, 155

178
Test, Likelihood Ratio, 112
Test, Most Powerful, 91
Test, Nonrandomized, 91
Test, Randomized, 91
Test, Uniformly Most Powerful, 91
Tolerance Coefficient, 158
Tolerance Interval, 158
Two–Sample t-Test, 121
Two–Tailed t-Test, 121
Type I Error, 90
Type II Error, 90

U–Statistic, 147
U–Statistic, Generalized, 150
UMA, 131
UMAU, 140
UMP, 91
UMP α–similar, 106
UMP Invariant, 110
UMP Unbiased, 102
UMPU, 102
UMVUE, 59
Unbiased, 57, 102, 140
Uniformly Minumum Variance Unbiased Estimate, 59
Uniformly Most Accurate, 131
Uniformly Most Accurate Unbiased, 140
Uniformly Most Powerful Test, 91
Unimodal, 135
Upper Confidence Bound, 131

Wald’s Equation, 171

Wald–Approximation, 176
Weakly Consistent, 45
Wilcoxin 2–Sample Estimator, 150

179

Probability and Statistical Inference 9t PDF
100% (1)
Probability and Statistical Inference 9t PDF
30 pages
STAT 330 Course Notes Fall 2024 Edition
No ratings yet
STAT 330 Course Notes Fall 2024 Edition
482 pages
Statistics Lecture Note Asymptotic Tools
No ratings yet
Statistics Lecture Note Asymptotic Tools
216 pages
I ' S M P S I: Nstructor S Olutions Anual Robability AND Tatistical Nference
No ratings yet
I ' S M P S I: Nstructor S Olutions Anual Robability AND Tatistical Nference
16 pages
Rohatgi Expl
No ratings yet
Rohatgi Expl
192 pages
Xxxx- Mathematical Statistics II
No ratings yet
Xxxx- Mathematical Statistics II
192 pages
Math and Statistics PDF
No ratings yet
Math and Statistics PDF
192 pages
Notests PDF
No ratings yet
Notests PDF
153 pages
Notes On Estimation
No ratings yet
Notes On Estimation
76 pages
Stat PDF
No ratings yet
Stat PDF
132 pages
Mathematical Statistics Intro Course 1713243381
No ratings yet
Mathematical Statistics Intro Course 1713243381
142 pages
剑桥大学统计学讲义
No ratings yet
剑桥大学统计学讲义
78 pages
Fundamentals of Mathematical Statistics 2020
No ratings yet
Fundamentals of Mathematical Statistics 2020
196 pages
Project Report
No ratings yet
Project Report
56 pages
Asymp
No ratings yet
Asymp
216 pages
Statistics
No ratings yet
Statistics
60 pages
Lecture Notes Statistics II PDF
No ratings yet
Lecture Notes Statistics II PDF
139 pages
STAT 713 Mathematical Statistics Ii: Lecture Notes
No ratings yet
STAT 713 Mathematical Statistics Ii: Lecture Notes
152 pages
18.443 MIT Stats Course
No ratings yet
18.443 MIT Stats Course
139 pages
Cimentaciones Maquinas
100% (1)
Cimentaciones Maquinas
235 pages
(FreeCourseWeb - Com) 1493997599
100% (1)
(FreeCourseWeb - Com) 1493997599
386 pages
STA2004F
No ratings yet
STA2004F
212 pages
MI 2026 Probs and Statistics Theory and Answer
No ratings yet
MI 2026 Probs and Statistics Theory and Answer
119 pages
Ecstats
No ratings yet
Ecstats
299 pages
Foundations of Statistical Inference
No ratings yet
Foundations of Statistical Inference
89 pages
Notes
No ratings yet
Notes
199 pages
Statistical Inference in Science
No ratings yet
Statistical Inference in Science
262 pages
Stat 2013
No ratings yet
Stat 2013
132 pages
Statistics for Econometrics
No ratings yet
Statistics for Econometrics
100 pages
A Concise Introduction To Statistical Inference
100% (5)
A Concise Introduction To Statistical Inference
231 pages
Handbook Statistical Foundations of Machine Learning
No ratings yet
Handbook Statistical Foundations of Machine Learning
267 pages
EC400Stats Lecturenotes2021
No ratings yet
EC400Stats Lecturenotes2021
101 pages
Book Solutions
No ratings yet
Book Solutions
17 pages
Statistics and Probability Theory Summary and Answer of Exercises
No ratings yet
Statistics and Probability Theory Summary and Answer of Exercises
120 pages
A First Course in Mathematical Statistics - Nusbaum
No ratings yet
A First Course in Mathematical Statistics - Nusbaum
195 pages
Stat 111: Introduction To Statistical Inference: ©2023 by Joseph K. Blitzstein and Neil Shephard
No ratings yet
Stat 111: Introduction To Statistical Inference: ©2023 by Joseph K. Blitzstein and Neil Shephard
387 pages
MS Theory Exam Study Guide
No ratings yet
MS Theory Exam Study Guide
50 pages
Statistical+Inference+1 Shaw2007
No ratings yet
Statistical+Inference+1 Shaw2007
66 pages
Evans Rosenthal Solutions
No ratings yet
Evans Rosenthal Solutions
367 pages
(University of Wisconsin-Madison, Shalizi) CSSS 2000-2001 Math Review Lectures - Probability, Statistics, and Stochastic Processes
No ratings yet
(University of Wisconsin-Madison, Shalizi) CSSS 2000-2001 Math Review Lectures - Probability, Statistics, and Stochastic Processes
71 pages
Lecture Notes Statistics
100% (2)
Lecture Notes Statistics
117 pages
001-2023-0929 DLMDSAS01 Course Book
No ratings yet
001-2023-0929 DLMDSAS01 Course Book
224 pages
Statistics
No ratings yet
Statistics
53 pages
Probability and Statistics Ii: George Deligiannidis Module Lecturer 2020/21: Kalliopi Mylona
No ratings yet
Probability and Statistics Ii: George Deligiannidis Module Lecturer 2020/21: Kalliopi Mylona
107 pages
Akritas Probability & Statistics With R For Engineers and Scientists
No ratings yet
Akritas Probability & Statistics With R For Engineers and Scientists
256 pages
Lecture Notes
No ratings yet
Lecture Notes
80 pages
Probability
No ratings yet
Probability
180 pages
Econometría
No ratings yet
Econometría
43 pages
Theory of Preliminary Test and Stein-Type Estimation with Applications
From Everand
Theory of Preliminary Test and Stein-Type Estimation with Applications
A. K. Md. Ehsanes Saleh
No ratings yet
Computer-Aided Modeling of Reactive Systems
From Everand
Computer-Aided Modeling of Reactive Systems
Warren E. Stewart
No ratings yet
Time Counts: Quantitative Analysis for Historical Social Science
From Everand
Time Counts: Quantitative Analysis for Historical Social Science
Gregory Wawro
No ratings yet
Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives
From Everand
Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives
Andrew Gelman
No ratings yet
Robust Methods in Biostatistics
From Everand
Robust Methods in Biostatistics
Stephane Heritier
No ratings yet
Spatial Ecology via Reaction-Diffusion Equations
From Everand
Spatial Ecology via Reaction-Diffusion Equations
Robert Stephen Cantrell
No ratings yet
Statistical Tolerance Regions: Theory, Applications, and Computation
From Everand
Statistical Tolerance Regions: Theory, Applications, and Computation
Kalimuthu Krishnamoorthy
No ratings yet
Symbolic Data Analysis: Conceptual Statistics and Data Mining
From Everand
Symbolic Data Analysis: Conceptual Statistics and Data Mining
Lynne Billard
No ratings yet
Nonparametric Analysis of Univariate Heavy-Tailed Data: Research and Practice
From Everand
Nonparametric Analysis of Univariate Heavy-Tailed Data: Research and Practice
Natalia Markovich
No ratings yet
Glow Discharge Plasmas in Analytical Spectroscopy
From Everand
Glow Discharge Plasmas in Analytical Spectroscopy
R. Kenneth Marcus
No ratings yet
The Satisfiability Problem: Algorithms and Analyses
From Everand
The Satisfiability Problem: Algorithms and Analyses
Uwe Schöning
No ratings yet
Mathematical Analysis: A Concise Introduction
From Everand
Mathematical Analysis: A Concise Introduction
Bernd S. W. Schröder
5/5 (1)
Solid waste sampling is a critical process to analyze and manage waste effectively
No ratings yet
Solid waste sampling is a critical process to analyze and manage waste effectively
2 pages
Peace Corps OST Focus In/Train Up (FITU) Focus in - Train Up Action Plan
No ratings yet
Peace Corps OST Focus In/Train Up (FITU) Focus in - Train Up Action Plan
2 pages
SMEs India - 15509
0% (1)
SMEs India - 15509
1,992 pages
FPM 046
100% (2)
FPM 046
134 pages
Team Teaching: Exploring The Curriculum: Compostela Valley State College
No ratings yet
Team Teaching: Exploring The Curriculum: Compostela Valley State College
22 pages
ANN and Power System
No ratings yet
ANN and Power System
37 pages
The Study of Masculinities
No ratings yet
The Study of Masculinities
11 pages
Owners Manual Gx440iu English 32z3s605
No ratings yet
Owners Manual Gx440iu English 32z3s605
24 pages
02 Brakes
No ratings yet
02 Brakes
33 pages
ECG353-chapter 2 - Vertical Stress Distribution - Part 2
No ratings yet
ECG353-chapter 2 - Vertical Stress Distribution - Part 2
21 pages
Lifecycle
No ratings yet
Lifecycle
17 pages
BMW Pricelist PM
No ratings yet
BMW Pricelist PM
3 pages
Context Clues
No ratings yet
Context Clues
4 pages
Household Dialogue Toolkit - EN
No ratings yet
Household Dialogue Toolkit - EN
118 pages
TABLA A279952 PDF
No ratings yet
TABLA A279952 PDF
185 pages
Santhosh Kumar C0S27547
No ratings yet
Santhosh Kumar C0S27547
21 pages
Grade 3 COT Math
No ratings yet
Grade 3 COT Math
3 pages
Dubai Building Code 2021 - Structure
No ratings yet
Dubai Building Code 2021 - Structure
76 pages
MSSU OJT MTech
No ratings yet
MSSU OJT MTech
17 pages
ASEU TEACHERFILE WEB 5831243475081930867.docx 1618394686
No ratings yet
ASEU TEACHERFILE WEB 5831243475081930867.docx 1618394686
7 pages
Beginner S1
No ratings yet
Beginner S1
233 pages
Medical Center Map - Directions - Hotel Information PDF
No ratings yet
Medical Center Map - Directions - Hotel Information PDF
3 pages
Coastal Processes Knowledge Organiser
No ratings yet
Coastal Processes Knowledge Organiser
2 pages
Module 2 - Handout 2.0 - Demand Theory, Analysis and Estimation
No ratings yet
Module 2 - Handout 2.0 - Demand Theory, Analysis and Estimation
4 pages
Norma C177-04 PDF
No ratings yet
Norma C177-04 PDF
22 pages
Dams in The Phil.
No ratings yet
Dams in The Phil.
33 pages
(Ebook) The Gospel of Matthew: A Hypertextual Commentary (European Studies in Theology, Philosophy and History of Religions, Book 16) by Adamczewski, Bartosz ISBN 9783631679418, 3631679416 - Quickly download the ebook to read anytime, anywhere
100% (2)
(Ebook) The Gospel of Matthew: A Hypertextual Commentary (European Studies in Theology, Philosophy and History of Religions, Book 16) by Adamczewski, Bartosz ISBN 9783631679418, 3631679416 - Quickly download the ebook to read anytime, anywhere
73 pages
Apollo and Lumina - Crystal Wind™ - Elohim
No ratings yet
Apollo and Lumina - Crystal Wind™ - Elohim
7 pages
18 Lecun1989 PDF
No ratings yet
18 Lecun1989 PDF
11 pages
Advances in Adaptive Control Theory Grad
No ratings yet
Advances in Adaptive Control Theory Grad
165 pages