Math 152 Notes Fall 09
Math 152 Notes Fall 09
Lecture Notes
Adolfo J. Rumbos
⃝
c Draft date December 18, 2009
2
Contents
1 Introduction 5
1.1 Introduction to statistical inference . . . . . . . . . . . . . . . . . 5
1.1.1 An Introductory Example . . . . . . . . . . . . . . . . . . 5
1.1.2 Sampling: Concepts and Terminology . . . . . . . . . . . 7
2 Estimation 13
2.1 Estimating the Mean of a Distribution . . . . . . . . . . . . . . . 13
2.2 Interval Estimate for Proportions . . . . . . . . . . . . . . . . . . 15
2.3 Interval Estimates for the Mean . . . . . . . . . . . . . . . . . . . 17
2.3.1 The 𝜒2 Distribution . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 The 𝑡 Distribution . . . . . . . . . . . . . . . . . . . . . . 26
2.3.3 Sampling from a normal distribution . . . . . . . . . . . . 29
2.3.4 Distribution of the Sample Variance from a Normal Dis-
tribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.5 The Distribution of 𝑇𝑛 . . . . . . . . . . . . . . . . . . . . 39
3 Hypothesis Testing 43
3.1 Chi–Square Goodness of Fit Test . . . . . . . . . . . . . . . . . . 44
3.1.1 The Multinomial Distribution . . . . . . . . . . . . . . . . 46
3.1.2 The Pearson Chi-Square Statistic . . . . . . . . . . . . . . 48
3.1.3 Goodness of Fit Test . . . . . . . . . . . . . . . . . . . . . 49
3.2 The Language and Logic of Hypothesis Tests . . . . . . . . . . . 50
3.3 Hypothesis Tests in General . . . . . . . . . . . . . . . . . . . . . 54
3.4 Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5 The Neyman–Pearson Lemma . . . . . . . . . . . . . . . . . . . . 73
4 Evaluating Estimators 77
4.1 Mean Squared Error . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 Crámer–Rao Theorem . . . . . . . . . . . . . . . . . . . . . . . . 80
3
4 CONTENTS
Chapter 1
Introduction
5
6 CHAPTER 1. INTRODUCTION
air, a few of the kernels begin to pop. Pressure from the circulating air and
other kernels popping an bouncing off around inside the cylinder forces kernels
to the top of the container, then to the spout, and finally into the container.
Once you start eating the popcorn, you realize that not all the kernels popped.
You also notice that there are two kinds of unpopped kernels: those that just
didn’t pop and those that were kicked out of the container before they could
get warm enough to pop. In any case, after you are done eating the popped
kernels, you cannot resit the temptation to count how many kernels did not pop.
Table 1.1 shows the results of 27 popping sessions performed under nearly the
same conditions. Each popping session represents a random experiment.1 The
under the same conditions, and whose outcomes cannot be predicted with certainty before
the experiment is performed.
1.1. INTRODUCTION TO STATISTICAL INFERENCE 7
where ( )
𝑁 𝑁!
= , 𝑘 = 0, 1, 2 . . . , 𝑁.
𝑘 𝑘!(𝑁 − 𝑘)!
This is the underlying probability model that we may postulate for this situa-
tion. The probability of a failure to pop for a given kernel, 𝑝, and the number of
kernels, 𝑁 , in one–quarter cup are unknown parameters. The challenge before
us is to use the data in Table 1.1 on page 6 to estimate the parameter 𝑝. No-
tice that 𝑁 is also unknown, so we’ll also have to estimate 𝑁 as well; however,
the data in Table 1.1 do not give enough information do so. We will therefore
have to design a new experiment to obtain data that will allow us to estimate
𝑁 . This will be done in the next chapter. Before we proceed further, we will
will lay out the sampling notions and terminology that are at the foundation of
statistical inference.
1 (𝑥−𝜇)2
𝑓𝑁 (𝑥) = √ 𝑒− 2𝜎2 , for − ∞ < 𝑥 < ∞.
2𝜋𝜎
2 A random variable is a numerical outcome of a random experiment whose value cannot
Hence, the probability that the number of kernels in one quarter cup of popcorn
lies within certain range of values, 𝑎 ⩽ 𝑁 < 𝑏 is
∫ 𝑏
𝑃 (𝑎 ⩽ 𝑁 < 𝑏) = 𝑓𝑁 (𝑥) d𝑥.
𝑎
𝑋1 + 𝑋2 + ⋅ ⋅ ⋅ + 𝑋𝑛
𝑋𝑛 = .
𝑛
𝑋 𝑛 is an example of a statistic.
𝐸(𝑋 𝑛 ) = 𝜇.
𝑛
∑ 𝑛
∑
= 𝑋𝑘2 − 2𝜇 𝑋𝑘 + 𝑛𝜇2
𝑘=1 𝑘=1
𝑛
∑
= 𝑋𝑘2 − 2𝜇𝑛𝑋 𝑛 + 𝑛𝜇2 .
𝑘=1
𝑛 𝑛
∑ ∑ 2
= 𝑋𝑘2 − 2𝑋 𝑛 𝑋𝑘 + 𝑛𝑋 𝑛
𝑘=1 𝑘=1
𝑛
∑ 2
= 𝑋𝑘2 − 2𝑛𝑋 𝑛 𝑋 𝑛 + 𝑛𝑋 𝑛
𝑘=1
𝑛
∑ 2
= 𝑋𝑘2 − 𝑛𝑋 𝑛 .
𝑘=1
10 CHAPTER 1. INTRODUCTION
Consequently,
𝑛 𝑛
∑ ∑ 2
(𝑋𝑘 − 𝜇)2 − (𝑋𝑘 − 𝑋 𝑛 )2 = 𝑛𝑋 𝑛 − 2𝜇𝑛𝑋 𝑛 + 𝑛𝜇2 = 𝑛(𝑋 𝑛 − 𝜇)2 .
𝑘=1 𝑘=1
𝑛
∑
= 𝜎 2 − 𝑛var(𝑋 𝑛 )
𝑘=1
𝜎2
= 𝑛𝜎 2 − 𝑛
𝑛
= (𝑛 − 1)𝜎 2 .
Thus, dividing by 𝑛 − 1,
( 𝑛
)
1 ∑
𝐸 (𝑋𝑘 − 𝑋 𝑛 )2 = 𝜎2 .
𝑛−1
𝑘=1
𝐹𝑋 (𝑥) = P(𝑋 ⩽ 𝑥)
and
If 𝑋 is a continuous random variable with density 𝑓𝑋 (𝑥), then the joint density
of the sample is
which is the mgf of a normal(𝜇, 𝜎 2 /𝑛) distribution. It then follows that 𝑁 𝑛 has
a normal distribution with mean
𝐸(𝑁 𝑛 ) = 𝜇
and variance
𝜎2
var(𝑁 𝑛 ) = .
𝑛
12 CHAPTER 1. INTRODUCTION
Example 1.1.5 shows that the sample mean, 𝑋 𝑛 , for a random sample from
a normal(𝜇, 𝜎 2 ) distribution follows a normal(𝜇, 𝜎 2 /𝑛). A surprising, and ex-
tremely useful, result from the theory of probability, states that for large values
of 𝑛 the sample mean for samples from any distribution are approximately
normal(𝜇, 𝜎 2 /𝑛). This is the essence of the Central Limit Theorem:
Theorem 1.1.6 (Central Limit Theorem). [HCM04, Theorem 4.4.1, p 220]
Suppose 𝑋1 , 𝑋2 , 𝑋3 . . . are independent, identically distributed random vari-
ables with 𝐸(𝑋𝑖 ) = 𝜇 and finite variance var(𝑋𝑖 ) = 𝜎 2 , for all 𝑖. Then
( )
𝑋𝑛 − 𝜇
lim P √ ⩽ 𝑧 = P(𝑍 ⩽ 𝑧),
𝑛→∞ 𝜎/ 𝑛
𝑋𝑛 − 𝜇
Thus, for large values of 𝑛, the distribution function for √ can be
𝜎/ 𝑛
approximated by the standard normal distribution. We write
𝑋𝑛 − 𝜇 𝐷
√ −→ 𝑍 ∼ Normal(0, 1)
𝜎/ 𝑛
𝑋𝑛 − 𝜇
and say that √ converges in distribution to 𝑍. In general, we have
𝜎/ 𝑛
Definition 1.1.7 (Convergence in Distribution). A sequence, (𝑌𝑛 ), of random
variables is said to converge in distribution to a random variable 𝑌 if
in other words, for large sample sizes, 𝑛, the distribution of the sample mean is
approximately normal(𝜇, 𝜎 2 /𝑛).
Chapter 2
Estimation
We write
P
𝑌𝑛 −→ 𝑏 as 𝑛 → ∞.
The fact that 𝑋 𝑛 converges to 𝜇 in probability is known as the weak Law
of Large Numbers. We will prove this fact under the assumption that the
distribution being sampled has finite variance, 𝜎 2 . Then, the weak Law of Large
Numbers will follow from the inequality:
Theorem 2.1.2 (Chebyshev Inequality). Let 𝑋 be a random variable with
mean 𝜇 and variance var(𝑋). Then, for every 𝜀 > 0,
var(𝑋)
P(∣𝑋 − 𝜇∣ ⩾ 𝜀) ⩽ .
𝜀2
13
14 CHAPTER 2. ESTIMATION
Proof: We shall prove this inequality for the case in which 𝑋 is continuous with
pdf 𝑓𝑋 . ∫ ∞
Observe that var(𝑋) = 𝐸[(𝑋 − 𝜇)2 ] = ∣𝑥 − 𝜇∣2 𝑓𝑋 (𝑥) d𝑥. Thus,
−∞
∫
var(𝑋) ⩾ ∣𝑥 − 𝜇∣2 𝑓𝑋 (𝑥) d𝑥,
𝐴𝜀
Later in these notes will we need the fact that a continuous function of a
sequence which converges in probability will also converge in probability:
Theorem 2.1.3 (Slutsky’s Theorem). Suppose that (𝑌𝑛 ) converges in proba-
bility to 𝑏 as 𝑛 → ∞ and that 𝑔 is a function which is continuous at 𝑏. Then,
(𝑔(𝑌𝑛 )) converges in probability to 𝑔(𝑏) as 𝑛 → ∞.
Proof: Let 𝜀 > 0 be given. Since 𝑔 is continuous at 𝑏, there exists 𝛿 > 0 such
that
∣𝑦 − 𝑏∣ < 𝛿 ⇒ ∣𝑔(𝑦) − 𝑔(𝑏)∣ < 𝜀.
It then follows that the event 𝐴𝛿 = {𝑦 ∣ ∣𝑦 − 𝑏∣ < 𝛿} is a subset the event
𝐵𝜀 = {𝑦 ∣ ∣𝑔(𝑦) − 𝑔(𝑏)∣ < 𝜀}. Consequently,
P(𝐴𝛿 ) ⩽ P(𝐵𝜀 ).
2.2. INTERVAL ESTIMATE FOR PROPORTIONS 15
It then follows from Equation (2.1) and the Squeeze or Sandwich Theorem that
lim P(∣𝑔(𝑌𝑛 ) − 𝑔(𝑏)∣ < 𝜀) = 1.
𝑛→∞
have the property that the probability that the true proportion 𝑝 lies in them
is at least 95%. For this reason, the interval
[ √ √ )
𝑝ˆ𝑛 (1 − 𝑝ˆ𝑛 ) 𝑝ˆ𝑛 (1 − 𝑝ˆ𝑛 )
𝑝ˆ𝑛 − 𝑧 √ , 𝑝ˆ𝑛 + 𝑧 √
𝑛 𝑛
or about [0.146 − 0.037, 0.146 + 0.037), or [0.109, 0.183). Thus, the failure to pop
rate is between 10.9% and 18.3% with a 95% confidence level. The confidence
level here indicates the probability that the method used to produce the inter-
val estimate from the data will contain the true value of the parameter being
estimated.
𝑋𝑛 − 𝜇
√ ,
𝜎/ 𝑛
𝑋𝑛 − 𝜇
𝑇𝑛 = √ . (2.4)
𝑆𝑛 / 𝑛
1 2
𝑓𝑋 (𝑧) = √ 𝑒−𝑧 /2 , for − ∞ < 𝑧 < ∞.
2𝜋
𝑃 (𝑋 ≤ 𝑥) = 𝑃 (𝑍 2 ≤ 𝑥) for 𝑦 ⩾ 0
√ √
= 𝑃 (− 𝑥 ≤ 𝑍 ≤ 𝑥)
√ √
= 𝑃 (− 𝑥 < 𝑍 ≤ 𝑥), since Z is continuous.
Thus,
√ √
𝑃 (𝑋 ≤ 𝑥) = 𝑃 (𝑍 ≤ 𝑥) − 𝑃 (𝑍 ≤ − 𝑥)
√ √
= 𝐹𝑍 ( 𝑥) − 𝐹𝑍 (− 𝑥) for 𝑥 > 0,
since 𝑋 is continuous.
We then have that the cdf of 𝑋 is
√ √
𝐹𝑋 (𝑥) = 𝐹𝑍 ( 𝑥) − 𝐹𝑍 (− 𝑥) for 𝑥 > 0,
2.3. INTERVAL ESTIMATES FOR THE MEAN 19
for 𝑥 > 0. □
𝑌 ∼ 𝜒2 (1).
𝐸(𝑋) = 𝐸(𝑍 2 ) = 1,
since var(𝑍) = 𝐸(𝑍 2 )−(𝐸(𝑍))2 and 𝐸(𝑍) = 0 and var(𝑍) = 1. To compute the
second moment of 𝑋, 𝐸(𝑋 2 ) = 𝐸(𝑍 4 ), we need to compute the fourth moment
of 𝑍. In order to do this, we first compute the mgf of 𝑍 is
2
𝑀𝑍 (𝑡) = 𝑒𝑡 /2
for all 𝑡 ∈ ℝ.
Thus,
𝐸(𝑍 4 ) = 𝑀𝑍(4) (0) = 3.
We then have that the variance of 𝑋 is
where
⎧
⎧ 1 1
−𝑥/2 1 1 −𝑦/2
⎨ √ √ 𝑒 𝑥 > 0, ⎨ √ √ 𝑒 𝑦>0
2𝜋 𝑥 2𝜋 𝑦
𝑓𝑋 (𝑥) = 𝑓𝑌 (𝑦) =
⎩
0 elsewhere, ⎩
0 otherwise.
𝑒−𝑤/2 1
∫
1
= √√ d𝑡.
2𝜋 0 𝑡 1−𝑡
√
Making a second change of variables 𝑠 = 𝑡, we get that 𝑡 = 𝑠2 and d𝑡 = 2𝑠d𝑠,
so that
𝑒−𝑤/2 1
∫
1
𝑓𝑊 (𝑤) = √ d𝑠
𝜋 0 1 − 𝑠2
𝑒−𝑤/2 1
= [arcsin(𝑠)]0
𝜋
1 −𝑤/2
= 𝑒 for 𝑤 > 0,
2
and zero otherwise. It then follows that 𝑊 = 𝑋 + 𝑌 has the pdf of an
exponential(2) random variable.
2.3. INTERVAL ESTIMATES FOR THE MEAN 21
Our goal in the following set of examples is to come up with the formula for the
pdf of a 𝜒2 (𝑛) random variable.
Example 2.3.5 (Three degrees of freedom). Let 𝑋 ∼ exponential(2) and 𝑌 ∼
𝜒2 (1) be independent random variables and define 𝑊 = 𝑋 + 𝑌 . Give the
distribution of 𝑊 .
Solution: Since 𝑋 and 𝑌 are independent, by Problem 1 in Assign-
ment #3, 𝑓𝑊 is the convolution of 𝑓𝑋 and 𝑓𝑌 :
𝑓𝑊 (𝑤) = 𝑓𝑋 ∗ 𝑓𝑌 (𝑤)
∫ ∞
= 𝑓𝑋 (𝑢)𝑓𝑌 (𝑤 − 𝑢)d𝑢,
−∞
where
1 −𝑥/2
⎧
⎨2𝑒 if 𝑥 > 0;
𝑓𝑋 (𝑥) =
0 otherwise;
⎩
and ⎧ 1 1
−𝑦/2
⎨ √2𝜋 √𝑦 𝑒 if 𝑦 > 0;
𝑓𝑌 (𝑦) =
⎩
0 otherwise.
It then follows that, for 𝑤 > 0,
∫ ∞
1 −𝑢/2
𝑓𝑊 (𝑤) = 𝑒 𝑓𝑌 (𝑤 − 𝑢)d𝑢
0 2
∫ 𝑤
1 −𝑢/2 1 1
= 𝑒 √ √ 𝑒−(𝑤−𝑢)/2 d𝑢
0 2 2𝜋 𝑤 − 𝑢
𝑤
𝑒−𝑤/2
∫
1
= √ √ d𝑢.
2 2𝜋 0 𝑤−𝑢
22 CHAPTER 2. ESTIMATION
1 √
= √ 𝑤 𝑒−𝑤/2 ,
2𝜋
for 𝑤 > 0. It then follows that
⎧ 1 √
−𝑤/2
⎨ √2𝜋 𝑤 𝑒 if 𝑤 > 0;
𝑓𝑊 (𝑤) =
0 otherwise.
⎩
𝑓𝑊 (𝑤) = 𝑓𝑋 ∗ 𝑓𝑌 (𝑤)
∫ ∞
= 𝑓𝑋 (𝑢)𝑓𝑌 (𝑤 − 𝑢)d𝑢,
−∞
where
1
⎧
𝑒−𝑥/2
if 𝑥 > 0;
⎨ 2
𝑓𝑋 (𝑥) =
0 otherwise;
⎩
and
1
⎧
𝑒−𝑦/2
if 𝑦 > 0;
⎨ 2
𝑓𝑌 (𝑦) =
0 otherwise.
⎩
2.3. INTERVAL ESTIMATES FOR THE MEAN 23
𝑤 𝑒−𝑤/2
= ,
4
for 𝑤 > 0. It then follows that
1
⎧
−𝑤/2
⎨4 𝑤 𝑒 if 𝑤 > 0;
𝑓𝑊 (𝑤) =
0 otherwise.
⎩
We are now ready to derive the general formula for the pdf of a 𝜒2 (𝑛) random
variable.
1 1 1 1
𝑓𝑊 (𝑤) = 1/2
𝑤 2 −1 𝑒−𝑤/2 = √ √ 𝑒−𝑤/2 ,
Γ(1/2) 2 2𝜋 𝑥
which is the pdf for a 𝜒( 1) random variable. Thus, the formula in (2.5) holds
true for 𝑛 = 1.
24 CHAPTER 2. ESTIMATION
Next, assume that a 𝜒2 (𝑛) random variable has pdf given (2.5). We will
show that if 𝑊 ∼ 𝜒2 (𝑛 + 1), then its pdf is given by
⎧ 1 𝑛−1
(𝑛+1)/2
𝑤 2 𝑒−𝑤/2 if 𝑤 > 0;
Γ((𝑛 + 1)/2) 2
⎨
𝑓𝑊 (𝑤) = (2.6)
0 otherwise.
⎩
and ⎧ 1 1
−𝑦/2
⎨ √2𝜋 √𝑦 𝑒 if 𝑦 > 0;
𝑓𝑌 (𝑦) =
⎩
0 otherwise.
Consequently, for 𝑤 > 0,
∫ 𝑤
1 𝑛 1 1
𝑓𝑊 (𝑤) = 𝑛/2
𝑢 2 −1 𝑒−𝑢/2 √ √ 𝑒−(𝑤−𝑢)/2 d𝑢
0 Γ(𝑛/2) 2 2𝜋 𝑤 − 𝑢
𝑤 𝑛
𝑒−𝑤/2 𝑢 2 −1
∫
= √ √ d𝑢.
Γ(𝑛/2) 𝜋 2(𝑛+1)/2 0 𝑤−𝑢
Next, make the change of variables 𝑡 = 𝑢/𝑤; we then have that 𝑢 = 𝑤𝑡, d𝑢 = 𝑤d𝑡
and
𝑛−1 ∫ 1 𝑛 −1
𝑤 2 𝑒−𝑤/2 𝑡2
𝑓𝑊 (𝑤) = √ √ d𝑡.
Γ(𝑛/2) 𝜋 2(𝑛+1)/2 0 1−𝑡
Making a further change of variables 𝑡 = 𝑧 2 , so that d𝑡 = 2𝑧d𝑧, we obtain that
𝑛−1 1
2𝑤 2 𝑒−𝑤/2 𝑧 𝑛−1
∫
𝑓𝑊 (𝑤) = √ √ d𝑧. (2.7)
Γ(𝑛/2) 𝜋 2(𝑛+1)/2 0 1 − 𝑧2
Looking up the last integral in a table of integrals we find that, if 𝑛 is even and
𝑛 ⩾ 4, then
∫ 𝜋/2
1 ⋅ 3 ⋅ 5 ⋅ ⋅ ⋅ (𝑛 − 2)
sin𝑛−1 𝜃d𝜃 = ,
0 2 ⋅ 4 ⋅ 6 ⋅ ⋅ ⋅ (𝑛 − 1)
which can be written in terms of the Gamma function as
[ ( )]2
2𝑛−2 Γ 𝑛2
∫ 𝜋/2
𝑛−1
sin 𝜃d𝜃 = . (2.8)
0 Γ(𝑛)
Note that this formula also works for 𝑛 = 2.
Similarly, we obtain that for odd 𝑛 with 𝑛 ⩾ 1 that
∫ 𝜋/2
Γ(𝑛) 𝜋
sin𝑛−1 𝜃d𝜃 = [ ( 𝑛+1 )]2 . (2.9)
0 2𝑛−1 Γ 2 2
𝑛−1 √
𝑤 2 𝑒−𝑤/2 Γ(𝑛) 𝜋
= )] .
Γ(𝑛/2) 2(𝑛+1)/2 2𝑛−1 Γ 𝑛+1 2
[ (
2
for 𝑤 > 0, which is (2.6) for even 𝑛. This completes inductive step and the
proof is now complete. That is, if 𝑊 ∼ 𝜒2 (𝑛) then the pdf of 𝑊 is given by
⎧ 1 𝑛
𝑤 2 −1 𝑒−𝑤/2 if 𝑤 > 0;
Γ(𝑛/2) 2𝑛/2
⎨
𝑓𝑊 (𝑤) =
0 otherwise,
⎩
for 𝑛 = 1, 2, 3, . . .
𝐹𝑇 (𝑡) = 𝑃 (𝑇 ⩽ 𝑡)
( )
𝑍
= 𝑃 √ ⩽𝑡
𝑋/(𝑛 − 1)
∫∫
= 𝑓(𝑋,𝑍) (𝑥, 𝑧)d𝑥d𝑧,
𝑅
2.3. INTERVAL ESTIMATES FOR THE MEAN 27
and
1 2
𝑓𝑍 (𝑧) = √ 𝑒−𝑧 /2 , for − ∞ < 𝑧 < ∞.
2𝜋
We then have that
∞ ∫ 𝑡√𝑥/(𝑛−1) 𝑛−3 2
𝑥 2 𝑒−(𝑥+𝑧 )/2
∫
𝐹𝑇 (𝑡) = )√ 𝑛 d𝑧d𝑥.
Γ 𝑛−1
(
0 −∞ 2 𝜋 22
𝑢 = 𝑥
𝑧
𝑣 = √ ,
𝑥/(𝑛 − 1)
so that
𝑥 = 𝑢
√
𝑧 = 𝑣 𝑢/(𝑛 − 1).
Consequently,
∫ 𝑡 ∫ ∞
1 𝑛−3
−(𝑢+𝑢𝑣 2 /(𝑛−1))/2
∂(𝑥, 𝑧)
𝐹𝑇 (𝑡) = ( 𝑛−1 ) √ 𝑛 𝑢 2 𝑒 ∂(𝑢, 𝑣) d𝑢d𝑣,
Γ 2 𝜋 22 −∞ 0
𝑢1/2
= √ .
𝑛−1
28 CHAPTER 2. ESTIMATION
∫ ∞ 𝑡2
1
( )
𝑛 − 1+ 𝑛−1 𝑢/2
= ( 𝑛−1 ) √ 𝑛 𝑢 2 −1 𝑒 d𝑢.
Γ 2 (𝑛 − 1)𝜋 2 2 0
𝑛 2
Put 𝛼 = and 𝛽 = 𝑡2
. Then,
2 1 + 𝑛−1
∫ ∞
1
𝑓𝑇 (𝑡) = ( 𝑛−1 ) √ 𝑢𝛼−1 𝑒−𝑢/𝛽 d𝑢
Γ 2 (𝑛 − 1)𝜋 2𝛼 0
∞
Γ(𝛼)𝛽 𝛼 𝑢𝛼−1 𝑒−𝑢/𝛽
∫
= d𝑢,
Γ(𝛼)𝛽 𝛼
( 𝑛−1 ) √
Γ 2 (𝑛 − 1)𝜋 2𝛼 0
where ⎧
𝛼−1 −𝑢/𝛽
𝑢
𝑒
if 𝑢 > 0
⎨ Γ(𝛼)𝛽 𝛼
𝑓𝑈 (𝑢) =
⎩0 if 𝑢 ⩽ 0
is the pdf of a Γ(𝛼, 𝛽) random variable (see Problem 5 in Assignment
#3). We then have that
Γ(𝛼)𝛽 𝛼
𝑓𝑇 (𝑡) = ( 𝑛−1 ) √ for 𝑡 ∈ ℝ.
Γ 2 (𝑛 − 1)𝜋 2𝛼
Using the definitions of 𝛼 and 𝛽 we obtain that
Γ 𝑛2
( )
1
𝑓𝑇 (𝑡) = ( 𝑛−1 ) √ ⋅( )𝑛/2 for 𝑡 ∈ ℝ.
Γ 2 (𝑛 − 1)𝜋 𝑡2
1+
𝑛−1
This is the pdf of a random variable with a 𝑡 distribution with 𝑛 − 1
degrees of freedom. In general, a random variable, 𝑇 , is said to have
a 𝑡 distribution with 𝑟 degrees of freedom, for 𝑟 ⩾ 1, if its pdf is
given by
Γ 𝑟+1
( )
2 1
𝑓𝑇 (𝑡) = ( )√ ⋅ ( for 𝑡 ∈ ℝ.
Γ 2𝑟 𝑟𝜋 𝑡 2 (𝑟+1)/2
)
1+
𝑟
2.3. INTERVAL ESTIMATES FOR THE MEAN 29
𝑍
√ ∼ 𝑡(𝑛 − 1).
𝑋/(𝑛 − 1)
We will see the relevance of this example in the next section when we continue
our discussion estimating the mean of a norma distribution.
= P(−𝑧 < 𝑍 ⩽ 𝑧)
= 𝐹𝑍 (𝑧) − 𝐹𝑍 (−𝑧),
Suppose that 0 < 𝛼 < 1 and let 𝑧𝛼/2 be the value of 𝑧 for which P(∣𝑍∣ < 𝑧) =
1 − 𝛼. We then have that 𝑧𝛼/2 satisfies the equation
𝛼
𝐹𝑍 (𝑧) = 1 − .
2
30 CHAPTER 2. ESTIMATION
Thus, ( 𝛼)
𝑧𝛼/2 = 𝐹𝑍−1 1 − , (2.12)
2
where 𝐹𝑍−1 denotes the inverse of the cdf of 𝑍. Then, setting
√
𝑛𝑏
= 𝑧𝛼/2 ,
𝜎
we see from (2.11) that
( )
𝜎
P ∣𝑋 𝑛 − 𝜇∣ < 𝑧𝛼/2 √ = 1 − 𝛼,
𝑛
captures the parameter 𝜇 is 1−𝛼. The interval in (2.14) is called the 100(1−𝛼)%
confidence interval for the mean, 𝜇, based on the sample mean. Notice that
this interval assumes that the variance, 𝜎 2 , is known, which is not the case in
general. So, in practice it is not very useful (we will see later how to remedy this
situation); however, it is a good example to illustrate the concept of a confidence
interval.
For a more concrete example, let 𝛼 = 0.05. Then, to find 𝑧𝛼/2 we may use
the NORMINV function in MS Excel, which gives the inverse of the cumulative
distribution function of normal random variable. The format for this function
is
NORMINV(probability,mean,standard_dev)
𝛼
In this case the probability is 1 − = 0.975, the mean is 0, and the standard
2
deviation is 1. Thus, according to (2.12), 𝑧𝛼/2 is given by
NORMINV(0.975, 0, 1) ≈ 1.959963985
or about 1.96.
In R, the inverse cdf for a normal random variable is the qnorm function
whose format is
2.3. INTERVAL ESTIMATES FOR THE MEAN 31
Hence the 95% confidence interval for the mean, 𝜇, of a normal(𝜇, 𝜎 2 ) distribu-
tion based on the sample mean, 𝑋 𝑛 is
( )
𝜎 𝜎
𝑋 𝑛 − 1.96 √ , 𝑋 𝑛 + 1.96 √ , (2.15)
𝑛 𝑛
𝑋𝑛 − 𝜇
𝑇𝑛 = √ ,
𝑆𝑛 / 𝑛
Starting with
𝑛 ( )2 ( )2
∑ 𝑋𝑖 − 𝜇 𝑋𝑛 − 𝜇
= 𝑊𝑛 + √ , (2.19)
𝑖=1
𝜎 𝜎/ 𝑛
where we have used the definition of the random variable 𝑊𝑛 in (2.16). Observe
that the random variable
𝑛 ( )2
∑ 𝑋𝑖 − 𝜇
𝑖=1
𝜎
𝑋𝑖 − 𝜇
∼ normal(0, 1),
𝜎
and, consequently,
( )2
𝑋𝑖 − 𝜇
∼ 𝜒2 (1).
𝜎
Similarly,
( )2
𝑋𝑛 − 𝜇
√ ∼ 𝜒2 (1),
𝜎/ 𝑛
𝑌 = 𝑊𝑛 + 𝑋, (2.20)
where 𝑌 ∼ 𝜒2 (𝑛) and 𝑋 ∼ 𝜒2 (1). If we can prove that 𝑊𝑛 and 𝑋 are indepen-
dent random variables, we will then be able to conclude that
𝑊𝑛 ∼ 𝜒2 (𝑛 − 1). (2.21)
To see why the assertion in (2.21) is true, if 𝑊𝑛 and 𝑋 are independent, note
that from (2.20) we get that the mgf of 𝑌 is
The justification for the last assertion is given in the following two examples.
Example 2.3.9. Suppose that 𝑋 and 𝑌 are independent independent random
variables. Show that 𝑋 and 𝑌 2 are also independent.
Solution: Compute, for 𝑥 ∈ ℝ and 𝑢 ⩾ 0,
√
P(𝑋 ⩽ 𝑥, 𝑌 2 ⩽ 𝑢) = P(𝑋 ⩽ 𝑥, ∣𝑌 ∣ ⩽ 𝑢)
√ √
= P(𝑋 ⩽ 𝑥, − 𝑢 ⩽ 𝑌 ⩽ 𝑢)
√ √
= P(𝑋 ⩽ 𝑥) ⋅ P(− 𝑢 ⩽ 𝑌 ⩽ 𝑢),
34 CHAPTER 2. ESTIMATION
(𝑋2 − 𝑋 𝑛 , 𝑋3 − 𝑋 𝑛 , . . . , 𝑋𝑛 − 𝑋 𝑛 ).
The proof of the claim in (2.25) relies on the assumption that the random
variables 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 are iid normal random variables. We illustrate this
2.3. INTERVAL ESTIMATES FOR THE MEAN 35
𝑋𝑖 − 𝜇
for 𝑖 = 1, 2, . . . , 𝑛,
𝜎
we may assume from the outset that 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 are iid normal(0, 1) random
variables.
𝑟 = 𝑥1 + 𝑥2 and 𝑤 = 𝑥2 − 𝑥1 ,
so that
𝑟−𝑤 𝑟+𝑤
𝑥1 = and 𝑥2 = ,
2 2
and therefore
1 2
𝑥21 + 𝑥22 = (𝑟 + 𝑤2 ).
2
36 CHAPTER 2. ESTIMATION
where ⎛ ⎞
1/2 −1/2
∂(𝑥1 , 𝑥2 ) ⎠ = 1.
= det ⎝
∂(𝑟, 𝑤) 2
1/2 1/2
Thus,
∫ 2𝑢 ∫ 2𝑣
1 2 2
𝐹(𝑈,𝑉 ) (𝑢, 𝑣) = 𝑒−𝑟 /4
⋅ 𝑒−𝑤 /4
d𝑤d𝑟,
4𝜋 −∞ −∞
= P(𝑋 𝑛 ⩽ 𝑢, 𝑋2 − 𝑋 𝑛 ⩽ 𝑣2 , 𝑋3 − 𝑋 𝑛 ⩽ 𝑣3 , . . . , 𝑋𝑛 − 𝑋 𝑛 ⩽ 𝑣𝑛 )
∫∫ ∫
= ⋅⋅⋅ 𝑓(𝑋1 ,𝑋2 ,...,𝑋𝑛 ) (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) d𝑥1 d𝑥2 ⋅ ⋅ ⋅ d𝑥𝑛 ,
𝑅𝑢,𝑣2 ,𝑣3 ,...,𝑣𝑛
2.3. INTERVAL ESTIMATES FOR THE MEAN 37
where
𝑥1 + 𝑥2 + ⋅ ⋅ ⋅ + 𝑥𝑛
for 𝑥 = , and the joint pdf of 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 is
𝑛
1 ∑𝑛 2
𝑓(𝑋1 ,𝑋2 ,...,𝑋𝑛 ) (𝑥1 , 𝑥2 , . . . 𝑥𝑛 ) = 𝑛/2
𝑒−( 𝑖=1 𝑥𝑖 )/2 for all (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ∈ ℝ𝑛 ,
(2𝜋)
𝑦1 = 𝑥
𝑦2 = 𝑥2 − 𝑥
𝑦3 = 𝑥3 − 𝑥
..
.
𝑦𝑛 = 𝑥𝑛 − 𝑥.
so that
𝑛
∑
𝑥1 = 𝑦1 − 𝑦𝑖
𝑖=2
𝑥2 = 𝑦1 + 𝑦2
𝑥3 = 𝑦1 + 𝑦3
..
.
𝑥𝑛 = 𝑦1 + 𝑦𝑛 ,
and therefore
𝑛
( 𝑛
)2 𝑛
∑ ∑ ∑
𝑥2𝑖 = 𝑦1 − 𝑦𝑖 + (𝑦1 + 𝑦𝑖 )2
𝑖=1 𝑖=2 𝑖=2
( 𝑛 )2 𝑛
∑ ∑
= 𝑛𝑦12 + 𝑦𝑖 + 𝑦𝑖2
𝑖=2 𝑖=2
= 𝑛𝑦12 + 𝐶(𝑦2 , 𝑦3 , . . . , 𝑦𝑛 ),
𝐹(𝑋 𝑛 ,𝑌 ) ( 𝑢, 𝑣2 , . . . , 𝑣𝑛 )
𝑢 𝑣1 𝑣𝑛 2
𝑒−(𝑛𝑦1 +𝐶(𝑦2 ,...,𝑦𝑛 )/2 ∂(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )
∫ ∫ ∫
= ⋅⋅⋅ ∂(𝑦1 , 𝑦2 , . . . , 𝑦𝑛 ) d𝑦𝑛 ⋅ ⋅ ⋅ d𝑦1 ,
−∞ −∞ −∞ (2𝜋)𝑛/2
where
⎛ ⎞
1 −1 −1 ⋅⋅⋅ −1
⎜ 1 1 0 ⋅⋅⋅ 0 ⎟
∂(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ⎜
1 0 1 ⋅⋅⋅ 0
⎟
= det ⎜ ⎟.
⎜ ⎟
∂(𝑦1 , 𝑦2 , . . . 𝑦𝑛 ) ⎜ .. .. .. .. ⎟
⎝ . . . . ⎠
1 0 0 ⋅⋅⋅ 1
𝑛𝑦1 = 𝑥1 + 𝑥2 + 𝑥3 + . . . 𝑥𝑛
𝑛𝑦2 = −𝑥1 + (𝑛 − 1)𝑥2 − 𝑥3 − . . . − 𝑥𝑛
𝑛𝑦3 = −𝑥1 − 𝑥2 + (𝑛 − 1)𝑥3 − . . . − 𝑥𝑛 ,
..
.
𝑛𝑦𝑛 = −𝑥1 − 𝑥2 − 𝑥3 − . . . + (𝑛 − 1)𝑥𝑛
whose determinant is
⎛ ⎞
1 1 1 ⋅⋅⋅ 1
⎜
⎜ 0 𝑛 0 ⋅⋅⋅ 0 ⎟
⎟
det 𝐴 = det ⎜
⎜ 0 0 𝑛 ⋅⋅⋅ 0 ⎟ = 𝑛𝑛−1 .
⎟
⎜ .. .. .. .. ⎟
⎝ . . . . ⎠
0 0 0 ⋅⋅⋅ 𝑛
2.3. INTERVAL ESTIMATES FOR THE MEAN 39
Thus, since ⎞ ⎛ ⎛ ⎞
𝑥1 𝑦1
⎜ 𝑥2 ⎟ ⎜ 𝑦2 ⎟
⎜ ⎟ ⎜ ⎟
𝐴 ⎜ 𝑥3 ⎟ = 𝑛𝐴−1 ⎜ 𝑦3 ⎟ ,
⎜ ⎟ ⎜ ⎟
⎜ .. ⎟ ⎜ .. ⎟
⎝ . ⎠ ⎝.⎠
𝑥𝑛 𝑦𝑛
it follows that
∂(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) 1
= det(𝑛𝐴−1 ) = 𝑛𝑛 ⋅ = 𝑛.
∂(𝑦1 , 𝑦2 , . . . 𝑦𝑛 ) 𝑛𝑛−1
Consequently,
𝐹(𝑋 𝑛 ,𝑌 ) ( 𝑢, 𝑣2 , . . . , 𝑣𝑛 )
𝑢 𝑣1 𝑣𝑛 2
𝑛 𝑒−𝑛𝑦1 /2 𝑒−𝐶(𝑦2 ,...,𝑦𝑛 )/2
∫ ∫ ∫
= ⋅⋅⋅ d𝑦𝑛 ⋅ ⋅ ⋅ d𝑦1 ,
−∞ −∞ −∞ (2𝜋)𝑛/2
which can be written as
𝐹(𝑋 𝑛 ,𝑌 ) ( 𝑢, 𝑣2 , . . . , 𝑣𝑛 )
𝑢 2 𝑣1 𝑣𝑛
𝑛 𝑒−𝑛𝑦1 /2 𝑒−𝐶(𝑦2 ,...,𝑦𝑛 )/2
∫ ∫ ∫
= √ d𝑦1 ⋅ ⋅⋅⋅ d𝑦𝑛 ⋅ ⋅ ⋅ d𝑦2 .
−∞ 2𝜋 −∞ −∞ (2𝜋)(𝑛−1)/2
Observe that 2
𝑢
𝑛 𝑒−𝑛𝑦1 /2
∫
√ d𝑦1
−∞ 2𝜋
is the cdf of a normal(0, 1/𝑛) random variable, which is the distribution of 𝑋 𝑛 .
Therefore
∫ 𝑣1 ∫ 𝑣𝑛 −𝐶(𝑦2 ,...,𝑦𝑛 )/2
𝑒
𝐹(𝑋 𝑛 ,𝑌 ) (𝑢, 𝑣2 , . . . , 𝑣𝑛 ) = 𝐹𝑋 𝑛 (𝑢) ⋅ ⋅⋅⋅ (𝑛−1)/2
d𝑦𝑛 ⋅ ⋅ ⋅ d𝑦2 ,
−∞ −∞ (2𝜋)
𝑌 = (𝑋2 − 𝑋 𝑛 , 𝑋3 − 𝑋 𝑛 , . . . , 𝑋𝑛 − 𝑋 𝑛 )
where 𝑋 𝑛 and 𝑆𝑛2 are the sample mean and variance, respectively, based on a
random sample of size 𝑛 taken from a normal(𝜇, 𝜎 2 ) distribution.
We begin by re–writing the expression for 𝑇𝑛 in (2.26) as
𝑋𝑛 − 𝜇
√
𝜎/ 𝑛
𝑇𝑛 = , (2.27)
𝑆𝑛
𝜎
and observing that
𝑋𝑛 − 𝜇
𝑍𝑛 = √ ∼ normal(0, 1).
𝜎/ 𝑛
Furthermore, √ √
𝑆𝑛 𝑆𝑛2 𝑉𝑛
= = ,
𝜎 𝜎2 𝑛−1
where
𝑛−1 2
𝑉𝑛 = 𝑆 ,
𝜎2 𝑛
which has a 𝜒2 (𝑛 − 1) distribution, according to (2.22). It then follows from
(2.27) that
𝑍𝑛
𝑇𝑛 = √ ,
𝑉𝑛
𝑛−1
where 𝑍𝑛 is a standard normal random variable, and 𝑉𝑛 has a 𝜒2 distribution
with 𝑛 − 1 degrees of freedom. Furthermore, by (2.23), 𝑍𝑛 and 𝑉𝑛 are indepen-
dent. Consequently, using the result in Example 2.3.8, the statistic 𝑇𝑛 defined
in (2.26) has a 𝑡 distribution with 𝑛 − 1 degrees of freedom; that is,
𝑋𝑛 − 𝜇
√ ∼ 𝑡(𝑛 − 1). (2.28)
𝑆𝑛 / 𝑛
Notice that the distribution on the right–hand side of (2.28) is independent
of the parameters 𝜇 and 𝜎 2 ; we can can therefore obtain a confidence interval
for the mean of of a normal(𝜇, 𝜎 2 ) distribution based on the sample mean and
variance calculated from a random sample of size 𝑛 by determining a value 𝑡𝛼/2
such that
P(∣𝑇𝑛 ∣ < 𝑡𝛼/2 ) = 1 − 𝛼.
We then have that ( )
∣𝑋 𝑛 − 𝜇∣
P √ < 𝑡𝛼/2 = 1 − 𝛼,
𝑆𝑛 / 𝑛
or ( )
𝑆𝑛
P ∣𝜇 − 𝑋 𝑛 ∣ < 𝑡𝛼/2 √ = 1 − 𝛼,
𝑛
or ( )
𝑆𝑛 𝑆𝑛
P 𝑋 𝑛 − 𝑡𝛼/2 √ < 𝜇 < 𝑋 𝑛 + 𝑡𝛼/2 √ = 1 − 𝛼.
𝑛 𝑛
2.3. INTERVAL ESTIMATES FOR THE MEAN 41
We have therefore obtained a 100(1 − 𝛼)% confidence interval for the mean of a
normal(𝜇, 𝜎 2 ) distribution based on the sample mean and variance of a random
sample of size 𝑛 from that distribution; namely,
( )
𝑆𝑛 𝑆𝑛
𝑋 𝑛 − 𝑡𝛼/2 √ , 𝑋 𝑛 + 𝑡𝛼/2 √ . (2.29)
𝑛 𝑛
To find the value for 𝑧𝛼/2 in (2.29) we use the fact that the pdf for the 𝑡
distribution is symmetric about the vertical line at 0 (or even) to obtain that
= P(−𝑡 < 𝑇𝑛 ⩽ 𝑡)
where we have used the fact that 𝑇𝑛 is a continuous random variable. Now, by
the symmetry if the pdf of 𝑇𝑛 𝐹𝑇𝑛 (−𝑡) = 1 − 𝐹𝑇𝑛 (𝑡). Thus,
Example 2.3.12. Give a 95% confidence interval for the mean of a normal
distribution based on the sample mean and variance computed from a sample
of size 𝑛 = 20.
TINV(probability,degrees_freedom)
42 CHAPTER 2. ESTIMATION
In this case the probability of the two tails is 𝛼 = 0.05 and the
number of degrees of freedom is 19. Thus, according to (2.30), 𝑡𝛼/2
is given by
TINV(0.05, 19) ≈ 2.09,
where we have used 0.05 because TINV in MS Excel gives two-tailed
probability distribution values.
In R, the inverse cdf for a random variable with a 𝑡 distribution is
qt function whose format is
qt(probability, df).
Example 2.3.13. Obtain a 95% confidence interval for the average number of
popcorn kernels in a 1/4–cup based on the data in Table 1.2 on page 8.
(326, 358)
Hypothesis Testing
43
44 CHAPTER 3. HYPOTHESIS TESTING
Histogram of N
6
5
4
Frequency
3
2
1
0
Histogram of UnPoppedN
4
3
Frequency
2
1
0
UnPoppedN
We have therefore divided the range of possible observations into categories, and
the probabilities computed above give the likelihood a given observation (the
count of unpopped kernels, in this case) will fall in a given category, assuming
that a Poisson model is driving the process. Using these probabilities, we can
predict how many observations out of the 27 will fall, on average, in each cat-
egory. If the probability that a count will fall in category 𝑖 is 𝑝𝑖 , and 𝑛 is the
total number of observations (in this case, 𝑛 = 27), then the predicted number
of counts in category 𝑖 is
𝐸𝑖 = 𝑛𝑝𝑖 .
Table 3.1 shows the predicted values in each category as well as the actual
(observed) counts.
The last column in Table 3.1 shows that actual observed counts based on the
data in Table 1.1 on page 6. Are the large discrepancies between the observed
and predicted counts in the first three categories in Table 3.1 enough evidence
for us to dismiss the Poisson hypothesis? One of the goals of this chapter is to
answer this question with confidence. We will need to find a way to measure
the discrepancy that will allow us to make statements based on probability
calculations. A measure of the discrepancy between the values predicted by
an assumed probability model and the values that are actually observed in the
data was introduced by Karl Pearson in 1900, [Pla83]. In order to motivate
the Pearson’s statistic, we first present an example involving the multinomial
distribution.
𝑋1 + 𝑋2 + ⋅ ⋅ ⋅ + 𝑋𝑘 = 𝑛. (3.1)
3.1. CHI–SQUARE GOODNESS OF FIT TEST 47
X = (𝑋1 , 𝑋2 , . . . , 𝑋𝑘 ) (3.3)
𝑛!
⎧ ∑𝑘
𝑛1 𝑛2 𝑛𝑘
⎨ 𝑛 !𝑛 ! ⋅ ⋅ ⋅ 𝑛 ! 𝑝1 𝑝2 ⋅ ⋅ ⋅ 𝑝𝑘 if 𝑖=1 𝑛𝑘 = 𝑛;
1 2 𝑘
𝑝(𝑋1 ,𝑋2 ,...,𝑋𝑘 ) (𝑛1 , 𝑛2 , . . . , 𝑛𝑘 ) =
0 otherwise.
⎩
(3.4)
We first show that each 𝑋𝑖 has marginal distribution which is binomial(𝑛, 𝑝𝑖 ),
so that
𝐸(𝑋𝑖 ) = 𝑛𝑝𝑖 for all 𝑖 = 1, 2, . . . , 𝑘,
and
var((𝑋𝑖 )) = 𝑛𝑝𝑖 (1 − 𝑝𝑖 ) for all 𝑖 = 1, 2, . . . , 𝑘.
Note that the 𝑋1 , 𝑋2 , . . . , 𝑋𝑘 are not independent because of the relation in
(3.1). In fact, it can be shown that
Remark 3.1.2. Note that when 𝑘 = 2 in Theorem 3.1.1 we recover the binomial
theorem,
∑ 𝑛!
𝑝𝑋1 (𝑛1 ) = 𝑝𝑛1 𝑝𝑛2 ⋅ ⋅ ⋅ 𝑝𝑛𝑘 𝑘 ,
𝑛2 ,𝑛3 ,...,𝑛𝑘
𝑛1 !𝑛2 ! ⋅ ⋅ ⋅ 𝑛𝑘 ! 1 2
𝑛2 +𝑛3 +...+𝑛𝑘 =𝑛−𝑛1
𝑝𝑛1 1 ∑ 𝑛!
𝑝𝑋1 (𝑛1 ) = 𝑝𝑛2 ⋅ ⋅ ⋅ 𝑝𝑛𝑘 𝑘
𝑛1 ! 𝑛2 ,𝑛3 ,...,𝑛𝑘
𝑛2 ! ⋅ ⋅ ⋅ 𝑛𝑘 ! 2
𝑛2 +𝑛3 +...+𝑛𝑘 =𝑛−𝑛1
𝑝𝑛1 1 𝑛! ∑ (𝑛 − 𝑛1 )! 𝑛2
= 𝑝 ⋅ ⋅ ⋅ 𝑝𝑛𝑘 𝑘
𝑛1 ! (𝑛 − 𝑛1 )! 𝑛2 ,𝑛3 ,...,𝑛𝑘
𝑛2 ! ⋅ ⋅ ⋅ 𝑛𝑘 ! 2
𝑛2 +𝑛3 +...+𝑛𝑘 =𝑛−𝑛1
( )
𝑛 𝑛1
= 𝑝 (𝑝2 + 𝑝3 + ⋅ ⋅ ⋅ + 𝑝𝑘 )𝑛−𝑛1 ,
𝑛1 1
𝑋1 − 𝑛𝑝1
𝑍=√
𝑛𝑝1 (1 − 𝑝1 )
We have therefore proved that, for large values of 𝑛, the random variable
𝑘
∑ (𝑋𝑖 − 𝑛𝑝𝑖 )2
𝑄= (3.5)
𝑖=1
𝑛𝑝𝑖
or 0.02%, less than 1%, which is a very small probability. Thus, the chances of
observing the counts in the fourth column of Table 3.1 on page 46, under the
assumption that the Poisson hypothesis is true, are very small. The fact that
we did observe those counts, and the counts came from observations recorded in
Table 1.1 on page 6 suggest that it is highly unlikely that the counts of unpopped
kernels in that table follow a Poisson distribution. We are therefore justified in
rejecting the Poisson hypothesis on the basis on not enough statistical support
provided by the data.
Example 3.2.2 (Testing a binomial model). We have seen how to use a chi–
square goodness of fit test to determine that the Poisson model for the distribu-
tion of counts of unpopped kernels in Table 1.1 on page 6 is not supported by
the data in the table. A more appropriate model would be a binomial model.
In this case we have two unknown parameters: the mean number of kernels,
52 CHAPTER 3. HYPOTHESIS TESTING
𝑛, in one–quarter cup, and the probability, 𝑝, that a given kernel will not pop.
We have estimated 𝑛 independently using the data in Table 1.2 on page 8 to
be 𝑛ˆ = 342 according to the result in Example 2.3.13 on page 42. In order to
estimate 𝑝, we may use the average number of unppoped kernels in one–quarter
cup from the data in Table 1.1 on page 6 and then divide that number by the
estimated value of 𝑛 to obtain the estimate
56
𝑝ˆ = ≈ 0.1637.
342
Thus, in this example, we assume that the counts, 𝑋, of unpopped kernels in
one–quarter cup in Table 1.1 on page 6 follows the distribution
𝑋 ∼ binomial(ˆ
𝑛, 𝑝ˆ).
ˆ = 𝑋 𝑛 ≈ 342,
𝜇
and
ˆ = 𝑆 𝑛 ≈ 35.
𝜎
We therefore assume that
𝑁 ∼ normal(ˆ ˆ2 )
𝜇, 𝜎
and use the corresponding pdf to compute the probabilities that the counts will
lie in certain ranges.
Table 3.3 on page 54 shows those ranges and their corresponding probabili-
ties. Note that the ranges for the counts were chosen so that the expected count
for each category is 5. Table 3.3 shows also the predicted and observed counts
from which we get the value for the chi–square statistic, 𝑄, to be 𝑄 ˆ = 2/5.
In this case 𝑄 has an approximate 𝜒2 (1) asymptotic distribution, according to
54 CHAPTER 3. HYPOTHESIS TESTING
Thus, based on the data, we cannot reject the null hypothesis that the counts
can be described as following a normal distribution. Hence, we were justified
in assuming a normal model when estimating the mean number of kernels in
one–quarter cup in Example 2.3.13 on page 42.
H𝑜 : 𝑁 is normaly distributed
and
H1 : 𝑁 is not normaly distributed.
Here is another example.
Example 3.3.1. We wish to determine whether a given coin is fair or not.
Thus, we test the null hypothesis
1
H𝑜 : 𝑝=
2
versus the alternative hypothesis
1
H1 : 𝑝 ∕= ,
2
where 𝑝 denotes the probability that a given toss of the coin will yield a head.
3.3. HYPOTHESIS TESTS IN GENERAL 55
H𝑜 : 𝑌 ∼ binomial(400, 0.5).
Notice that this hypothesis completely specifies the distribution of the random
variable 𝑌 , which is known as a test statistic. On the other hand, the hy-
pothesis in the goodness of fit test in Example 3.2.3 on page 53 does not specify
a distribution. H𝑜 in Example 3.2.3 simply states that the the count, 𝑁 , of
kernels in one–quarter cup follows a normal distribution, but it does not specify
the parameters 𝜇 and 𝜎 2 .
Definition 3.3.2 (Simple versus Composite Hypotheses). A hypothesis which
completely specifies a distribution is said to be a simple hypothesis. A hy-
pothesis which is not simple is said to be composite.
For example, the alternative hypothesis, H1 : 𝑝 ∕= 0.5, in Example 3.3.1 is
composite since the test statistic, 𝑌 , for that test is binomial(400, 𝑝) where 𝑝 is
any value between 0 and 1 which is not 0.5. Thus, H1 is a really a combination
of many hypotheses.
The decision to reject or not reject H𝑜 in a hypothesis test is based on a
set of observations, 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ; these could be the outcomes of certain ex-
periment performed to test the hypothesis and are, therefore, random variables
with certain distribution. Given a set of of observations, 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 , a test
statistic, 𝑇 = 𝑇 (𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ), may be formed. For instance, in Example
3.3.1 on page 54, the experiment might consist of flipping the coin 400 times
and determining the number of heads. If the null hypothesis in that test is true,
then the 400 observations are independent Bernoulli(0.5) trials. We can define
the test statistic for this test to be
400
∑
𝑇 = 𝑋𝑖
𝑖=1
so that, in H𝑜 is true,
𝑇 ∼ binomial(400, 0.5).
A test statistic for a hypothesis test may be used to establish a criterion for
rejecting H𝑜 . For instance in the coin tossing Example 3.3.1, we can say that
we reject the hypothesis that the coin is fair if
that is, the distance from the statistic 𝑇 to the mean of the assumed distribution
is at least certain critical value, 𝑐. The condition in (3.6) constitutes a decision
criterion for rejection of H𝑜 . It the null hypothesis is true and the observed
value, 𝑇ˆ, of the test statistic, 𝑇 , falls within the range specified by the rejection
criterion in (3.6), we mistakenly reject H𝑜 when it is in fact true. This is known
56 CHAPTER 3. HYPOTHESIS TESTING
If the null hypothesis, H𝑜 , is in fact false, but the hypothesis test does not
yield the rejection of H𝑜 , then a type II error is made. The probability of a type
II error is denoted by 𝛽.
In general, a hypothesis test is concerned with the question of whether a
parameter, 𝜃, from certain underlying distribution is in a certain range or not.
Suppose the underlying distribution has pdf or pmf denoted by 𝑓 (𝑥 ∣ 𝜃), where
we have explicitly expressed the dependence of the distribution function on the
parameter 𝜃; for instance, in Example 3.3.1, the underlying distribution is
𝑓 (𝑥 ∣ 𝑝) = 𝑝𝑥 (1 − 𝑝)1−𝑥 , for 𝑥 = 0 or 𝑥 = 1,
and
H1 : 𝜃 ∈ Ω1
where Ω𝑜 and Ω1 are complementary subsets of a parameter space Ω = Ω𝑜 ∪ Ω1 ,
where Ω𝑜 ∩ Ω1 = ∅. In Example 3.3.1, we have that Ω𝑜 = {0.5} and
Ω1 = {𝑝 ∈ [0, 1] ∣ 𝑝 ∕= 0.5}.
𝑇 = 𝑇 (𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ).
where 𝐴 is a subset of the real line. For example, in the coin tossing example,
we had the rejection region
or
𝑅 = {(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ∈ 𝒟 ∣ ∣𝑇 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) − 𝑛𝑝∣ > 𝑐},
since, this this case, 𝑇 ∼ binomial(𝑛, 𝑝), where 𝑛 = 400, and 𝑝 depends on
which hypothesis we are assuming to be true. Thus, in this case, the set 𝐴 in
the definition of the rejection region in (3.7) is
H𝑜 : 𝜃 ∈ Ω𝑜
and
H1 : 𝜃 ∈ Ω1 ,
let
P𝜃 ((𝑥1 , 𝑥1 , . . . , 𝑥𝑛 ) ∈ 𝑅)
denote the probability that the observation values fall in the rejection region
under the assumption that the random variables 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 are iid with
distribution 𝑓 (𝑥 ∣ 𝜃). Thus,
max P𝜃 ((𝑥1 , 𝑥1 , . . . , 𝑥𝑛 ) ∈ 𝑅)
𝜃∈Ω𝑜
is the largest probability that H𝑜 will be rejected given that H𝑜 ; this is the
significance level for the test; that is,
𝛼 = P0.5 ((𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ∈ 𝑅) ,
where
P𝜃 ((𝑥1 , 𝑥1 , . . . , 𝑥𝑛 ) ∈ 𝑅)
gives the probability of rejecting the null hypothesis when H𝑜 is false. It then
follows that the probability of a Type II error, for the case in which 𝜃 ∈ Ω1 , is
this is the probability of not rejecting the null hypothesis when H𝑜 is in fact
false.
Definition 3.3.5 (Power of a Test). For 𝜃 ∈ Ω1 , the function
P𝜃 ((𝑥1 , 𝑥1 , . . . , 𝑥𝑛 ) ∈ 𝑅))
is called the power function for the test at 𝜃; that is, P𝜃 ((𝑥1 , 𝑥1 , . . . , 𝑥𝑛 ) ∈ 𝑅))
is the probability of rejecting the null hypothesis when it is in fact false. We
will use the notation
Example 3.3.6. In Example 3.3.1 on page 54, consider the rejection region
where
𝑛
∑
𝑇 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = 𝑥𝑗 ,
𝑗=1
= 1 − P (180 ⩽ 𝑇 ⩽ 220)
where we have used the continuity correction, since we are going to be applying
the Central Limit Theorem to approximate a discrete distribution; namely, 𝑇
has an approximate normal(200, 100) distribution in this case, since 𝑛 = 400 is
large. We then have that
𝛼 ≈ 0.0404.
where
1
𝑇 ∼ binomial(400, 𝑝) for 𝑝 ∕= .
2
We write
𝛾(𝑝) = P(∣𝑇 − 200∣ > 20)
= 1 − P(180 ⩽ 𝑇 ⩽ 220)
plot(p,gammap,type=’l’,ylab="Power at p")
60 CHAPTER 3. HYPOTHESIS TESTING
𝑝 𝛾(𝑝)
0.10 1.0000
0.20 1.0000
0.30 1.0000
0.40 0.9767
0.43 0.7756
0.44 0.6378
0.45 0.4800
0.46 0.3260
0.47 0.1978
0.48 0.1076
0.49 0.0566
0.50 0.0404
0.51 0.0566
0.52 0.1076
0.53 0.1978
0.54 0.3260
0.55 0.4800
0.56 0.6378
0.57 0.7756
0.60 0.9767
0.70 1.0000
0.80 1.0000
0.90 1.0000
1.0
0.8
0.6
Power at p
0.4
0.2
Figure 3.3.3: Sketch of graph of power function for test in Example 3.3.6
where p and gammap are arrays where values of 𝑝 and 𝛾(𝑝) were stored. These
were obtained using the commands:
p <- seq(0.01,0.99,by=0.01)
and
Observe that the sketch of the power function in Figure 3.3.3 on page 61
suggests that 𝛾(𝑝) tends to 1 as either 𝑝 → 0 or 𝑝 → 1, and that 𝛾(𝑝) → 𝛼 as
𝑝 → 0.5.
Λ ⩽ 𝑐,
H𝑜 : 𝜃 ∈ Ω𝑜
62 CHAPTER 3. HYPOTHESIS TESTING
= 𝑝𝑦 (1 − 𝑝)𝑛−𝑦
𝑛
∑
where 𝑦 = 𝑥𝑖 .
𝑖=1
3.4. LIKELIHOOD RATIO TEST 63
H 𝑜 : 𝜃 ∈ Ω𝑜
where Ω = Ω𝑜 ∪ Ω1 with Ω𝑜 ∩ Ω1 = ∅.
Example 3.4.4 (Simple hypotheses for Bernoulli(p) trials). Consider the test
of
H𝑜 : 𝑝 = 𝑝 𝑜
versus
H1 : 𝑝 = 𝑝1 ,
where 𝑝1 ∕= 𝑝𝑜 , based on random sample of size 𝑛 from a Bernoulli(𝑝) distribu-
tion, for 0 < 𝑝 < 1. The likelihood ratio statistic for this test is
𝐿(𝑝𝑜 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )
Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = ,
max{𝐿(𝑝𝑜 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ), 𝐿(𝑝1 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )}
𝑝𝑦𝑜 (1 − 𝑝𝑜 )𝑛−𝑦
Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = ,
max{𝑝𝑦𝑜 (1 − 𝑝𝑜 )𝑛−𝑦 , 𝑝𝑦1 (1 − 𝑝1 )𝑛−𝑦 }
𝑛
∑
where 𝑦 = 𝑥𝑖 .
𝑖=1
Definition 3.4.5 (Likelihood Ratio Test). We can use the likelihood ratio
statistic, Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ), to define the rejection region
for some critical value 𝑐 with 0 < 𝑐 < 1. This defines a likelihood ratio test
(LRT) for H𝑜 against H1 .
The rationale for this definition is that, if the likelihood ratio of the sample
is very small, the evidence provided by the sample in favor of the null hypothesis
is not strong in comparison with the evidence for the alternative. Thus, in this
case it makes sense to reject H𝑜 .
64 CHAPTER 3. HYPOTHESIS TESTING
H𝑜 : 𝑝 = 𝑝𝑜
versus
H1 : 𝑝 = 𝑝1 ,
for 𝑝𝑜 ∕= 𝑝1 , based on a random sample 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 from a Bernoulli(𝑝)
distribution for 0 < 𝑝 < 1.
Solution: The rejection region for the likelihood ratio test is given
by
𝑅 : Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ⩽ 𝑐,
for 0 < 𝑐 < 1, where
𝑝𝑦𝑜 (1 − 𝑝𝑜 )𝑛−𝑦
Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = ,
max{𝑝𝑦𝑜 (1 − 𝑝𝑜 )𝑛−𝑦 , 𝑝𝑦1 (1 − 𝑝1 )𝑛−𝑦 }
with
𝑛
∑
𝑦= 𝑥𝑖 .
𝑖=1
𝑝𝑦𝑜 (1 − 𝑝𝑜 )𝑛−𝑦
Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = ;
𝑝𝑦1 (1 − 𝑝1 )𝑛−𝑦
𝑎𝑛 𝑟𝑦 ⩽ 𝑐, (3.9)
where
𝑛
∑
𝑦= 𝑥𝑖 .
𝑖=1
3.4. LIKELIHOOD RATIO TEST 65
ln 𝑐 − 𝑛 ln 𝑎
𝑦⩾ .
ln 𝑟
In other words, the LRT will reject H𝑜 if
𝑌 ⩾ 𝑏,
ln (𝑐/𝑎𝑛 )
where 𝑏 = > 0, and 𝑌 is the statistic
ln 𝑟
𝑛
∑
𝑌 = 𝑋𝑖 ,
𝑖=1
ln 𝑐 − 𝑛 ln 𝑎
𝑦⩽ .
ln 𝑟
In other words, the LRT will reject H𝑜 if
𝑌 ⩽ 𝑑,
ln 𝑐 − 𝑛 ln 𝑎
where 𝑑 = can be made to be positive by choosing
ln 𝑟
ln 𝑐
𝑛> , and 𝑌 is again the number of successes in the sample. □
ln 𝑎
We next consider the example in which we test
H𝑜 : 𝑝 = 𝑝𝑜
versus
H1 : 𝑝 ∕= 𝑝0
based on a random sample 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 from a Bernoulli(𝑝) distribution for
0 < 𝑝 < 1. We would like to find the LRT rejection region for this test of
hypotheses.
In this case the likelihood ratio statistic is
𝐿(𝑝𝑜 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )
Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = , (3.10)
sup 𝐿(𝑝 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )
1<𝑝<1
66 CHAPTER 3. HYPOTHESIS TESTING
𝑛
∑
𝑦 𝑛−𝑦
where 𝐿(𝑝 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = 𝑝 (1 − 𝑝) for 𝑦 = 𝑥𝑖 .
𝑖=1
In order to determine the denominator in the likelihood ratio in (3.10), we
need to maximize the function 𝐿(𝑝 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) over 0 < 𝑝 < 1. We can do
this by maximizing the natural logarithm of the likelihood function,
and
1
𝑝ˆ = 𝑦
𝑛
3.4. LIKELIHOOD RATIO TEST 67
Observe also that lim+ ℎ(𝑡) = ln[(1 − 𝑝𝑜 )𝑛 ] and lim ℎ(𝑡) = ln[𝑝𝑛𝑜 ], so
𝑡→0 𝑡→(1/𝑝𝑜 )−
that
Λ(0) = (1 − 𝑝𝑜 )𝑛 and Λ(1/𝑝𝑜 ) = 𝑝𝑛𝑜 .
Putting all the information about the graph for Λ(𝑡) together we obtain a sketch
as the one shown in Figure 3.4.4,where we have sketched the case 𝑝𝑜 = 1/4 and
𝑛 = 20 for 0 ⩽ 𝑡 ⩽ 4. The sketch in Figure 3.4.4 suggests that, given any
1.0
0.8
0.6
Lmabda(t)
0.4
0.2
0.0
0 1 2 3 4
positive value of 𝑐 such that 𝑐 < 1 and 𝑐 > max{𝑝𝑛𝑜 , (1 − 𝑝𝑜 )𝑛 }, there exist
positive values 𝑡1 and 𝑡2 such that 0 < 𝑡1 < 1 < 𝑡2 < 1/𝑝𝑜 and
Λ(𝑡) = 𝑐 for 𝑡 = 𝑡1 , 𝑡2 .
Furthermore,
Λ(𝑡) ⩽ 𝑐 for 𝑡 ⩽ 𝑡1 or 𝑡 ⩾ 𝑡2 .
Thus, the LRT rejection region for the test of H𝑜 : 𝑝 = 𝑝𝑜 versus H1 : 𝑝 ∕= 𝑝𝑜 is
equivalent to
𝑝ˆ 𝑝ˆ
⩽ 𝑡1 or ⩾ 𝑡2 ,
𝑝𝑜 𝑝𝑜
𝑛
∑
which we could rephrase in terms of 𝑌 = 𝑋𝑖 as
𝑖=1
𝑅: 𝑌 ⩽ 𝑡1 𝑛𝑝𝑜 or 𝑌 ⩾ 𝑡2 𝑛𝑝𝑜 ,
3.4. LIKELIHOOD RATIO TEST 69
for some 𝑡1 and 𝑡2 with 0 < 𝑡1 < 1 < 𝑡2 . This rejection region can also be
phrased as
𝑅 : 𝑌 < 𝑛𝑝𝑜 − 𝑏 or 𝑌 > 𝑛𝑝𝑜 + 𝑏,
for some 𝑏 > 0. The value of 𝑏 will then be determined by the significance level
that we want to impose on the test.
Example 3.4.8 (Likelihood ratio test based on a sample from a normal distri-
bution). We wish to test the hypothesis
H𝑜 : 𝜇 = 𝜇𝑜 , 𝜎 2 > 0
𝑛
∂ℓ 1 ∑ 𝑛
(𝜇, 𝜎) = (𝑥𝑖 − 𝜇)2 − ,
∂𝜎 𝜎 3 𝑖=1 𝜎
𝑛
1∑
where 𝑥 = 𝑥𝑖 , and the second partial derivatives
𝑛 𝑖=1
∂2ℓ 𝑛
(𝜇, 𝜎) = − 2 ,
∂𝜇2 𝜎
70 CHAPTER 3. HYPOTHESIS TESTING
∂2ℓ ∂2ℓ 2𝑛
(𝜇, 𝜎) = (𝜇, 𝜎) = − 3 (𝑥 − 𝜇),
∂𝜎∂𝜇 ∂𝜇∂𝜎 𝜎
and
𝑛
∂2ℓ 3 ∑ 𝑛
(𝜇, 𝜎) = − 4 (𝑥𝑖 − 𝜇)2 + 2 ,
∂𝜎 2 𝜎 𝑖=1 𝜎
∂ℓ
⎧
⎨ ∂𝜇 (𝜇, 𝜎) = 0
⎩ ∂ℓ (𝜇, 𝜎)
= 0,
∂𝜎
which yields
𝜇
ˆ = 𝑥,
𝑛
1∑
ˆ2
𝜎 = (𝑥𝑖 − 𝑥)2 .
𝑛 𝑖=1
To see that ℓ(𝜇, 𝜎) is maximized at these values, look at the Hessian matrix,
⎛ 2
∂2ℓ
⎞
∂ ℓ
⎜ ∂𝜇2 (𝜇, 𝜎) (𝜇, 𝜎)
⎜ ∂𝜎∂𝜇 ⎟
⎟
⎜ ⎟,
⎜ 2 2
⎟
⎝ ∂ ℓ ∂ ℓ ⎠
∂
(𝜇, 𝜎) 2
(𝜇, 𝜎)
∂𝜇 𝜎 ∂𝜎
at (ˆ
𝜇, 𝜎
ˆ) to get
⎛ 𝑛 ⎞
− 0
ˆ2
⎜ 𝜎 ⎟
⎜ ⎟,
⎝ 2𝑛 ⎠
0 −
ˆ2
𝜎
which has negative eigenvalues. It then follows that ℓ(𝜇, 𝜎) is maximized at
(ˆ
𝜇, 𝜎
ˆ). Hence, 𝑥 is the MLE for 𝜇 and
𝑛
1∑
ˆ2 =
𝜎 (𝑥𝑖 − 𝑥)2
𝑛 𝑖=1
𝑛−1 2
ˆ2 =
𝜎 𝑆𝑛 ,
𝑛
so that
𝑛−1 2
𝜎2 ) =
𝐸(ˆ 𝜎 ,
𝑛
3.4. LIKELIHOOD RATIO TEST 71
ℓ(𝜎) = ln(𝐿(𝜇𝑜 , 𝜎 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ))
𝑛
1 ∑ 𝑛
= − 2
(𝑥𝑖 − 𝜇𝑜 )2 − 𝑛 ln 𝜎 − ln(2𝜋).
2𝜎 𝑖=1 2
and
𝑛
3 ∑ 𝑛
ℓ′′ (𝜎) = − 4
(𝑥𝑖 − 𝜇𝑜 )2 + 2 .
𝜎 𝑖=1 𝜎
Note that
2𝑛
ℓ′′ (𝜎) = −
< 0,
𝜎2
so that ℓ(𝜎) is maximized when 𝜎 = 𝜎. We then have that
where
𝑛
1∑
𝜎2 = (𝑥𝑖 − 𝜇𝑜 )2 .
𝑛 𝑖=1
Observe that
𝑛
∑ 𝑛
∑
(𝑥𝑖 − 𝜇𝑜 )2 = (𝑥𝑖 − 𝑥 + 𝑥 − 𝜇𝑜 )2
𝑖=1 𝑖=1
𝑛
∑ 𝑛
∑
2
= (𝑥𝑖 − 𝑥) + (𝑥 − 𝜇𝑜 )2 ,
𝑖=1 𝑖=1
72 CHAPTER 3. HYPOTHESIS TESTING
since
𝑛
∑ 𝑛
∑
2(𝑥𝑖 − 𝑥)(𝑥 − 𝜇𝑜 ) = 2(𝑥 − 𝜇𝑜 ) (𝑥𝑖 − 𝑥) = 0.
𝑖=1 𝑖=1
sup 𝐿(𝜇𝑜 , 𝜎 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )
𝜎>0 ˆ𝑛
𝜎
Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = = .
𝐿(ˆ ˆ ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )
𝜇, 𝜎 𝜎𝑛
ˆ𝑛
𝜎
⩽ 𝑐,
𝜎𝑛
for some 𝑐 with 0 < 𝑐 < 1, or
ˆ2
𝜎
⩽ 𝑐2/𝑛 ,
𝜎2
or
𝜎2 1
⩾ 2/𝑛 ,
ˆ2
𝜎 𝑐
1
where > 1. In view of (3.12), we see that and LRT will reject H𝑜 if
𝑐2/𝑛
(𝑥 − 𝜇𝑜 )2 1
⩾ 2/𝑛 − 1 ≡ 𝑘,
ˆ2
𝜎 𝑐
𝑛−1 2
where 𝑘 > 0, and 𝜎ˆ2 is the MLE for 𝜎 2 . Writing ˆ2 we see that
𝑆𝑛 for 𝜎
𝑛
an LRT will reject H𝑜 if
∣𝑥 − 𝜇𝑜 ∣ √
√ ⩾ (𝑛 − 1)𝑘 ≡ 𝑏,
𝑆𝑛 / 𝑛
where 𝑏 > 0. Hence, the LRT can be based in the test statistic
𝑋 𝑛 − 𝜇𝑜
𝑇𝑛 = √ .
𝑆𝑛 / 𝑛
Note that 𝑇𝑛 has a 𝑡(𝑛−1) distribution if H𝑜 is true. We then see that if 𝑡𝛼/2,𝑛−1
is such that
P(∣𝑇 ∣ ⩾ 𝑡𝛼/2,𝑛−1 ) = 𝛼, for 𝑇 ∼ 𝑡(𝑛 − 1),
3.5. THE NEYMAN–PEARSON LEMMA 73
∣𝑋 𝑛 − 𝜇𝑜 ∣
√ ⩾ 𝑡𝛼/2,𝑛−1 ,
𝑆𝑛 / 𝑛
has significance level 𝛼.
Observe also that the set of values of 𝜇𝑜 which do not get rejected by this
test is the open interval
( )
𝑆𝑛 𝑆𝑛
𝑋 𝑛 − 𝑡𝛼/2,𝑛−1 √ , 𝑋 𝑛 + 𝑡𝛼/2,𝑛−1 √ ,
𝑛 𝑛
H𝑜 : 𝜃 = 𝜃 𝑜
𝛾
˜(𝜃1 ) ⩽ 𝛾(𝜃1 ); (3.19)
in other words, out of all the tests of the simple hypothesis H𝑜 : 𝜃 = 𝜃𝑜 versus
H1 : 𝜃 = 𝜃1 , the LRT yields the largest possible power. Consequently, the
LRT gives the smallest probability of making a Type II error our of the tests of
significance level 𝛼.
The proof of the Neyman–Pearson Lemma is straight forward. First observe
that
𝑅 = (𝑅 ∩ 𝑅) ˜ ∪ (𝑅 ∩ 𝑅˜𝑐 ), (3.20)
˜𝑐 denotes the complement of 𝑅.
where 𝑅 ˜ It then follows from (3.15) that
∫ ∫
𝛼= 𝐿(𝜃𝑜 ∣ x) dx + 𝐿(𝜃𝑜 ∣ x) dx, (3.21)
𝑅∩𝑅
˜ ˜𝑐
𝑅∩𝑅
𝑅
˜ = (𝑅 ˜ ∩ 𝑅𝑐 ),
˜ ∩ 𝑅) ∪ (𝑅 (3.22)
where we have used (3.24). The inequality in (3.19) now follows from (3.31).
Thus, we have proved the Neymann–Pearson Lemma.
The Neyman–Pearson Lemma applies only to tests of simple hypotheses.
For instance, in Example 3.4.6 of page 64 dealing with the test of H𝑜 : 𝑝 = 𝑝𝑜
versus H1 : 𝑝 = 𝑝1 , for 𝑝1 > 𝑝𝑜 , based on a random sample 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 from
a Bernoulli(𝑝) distribution for 0 < 𝑝 < 1, we saw that the LRT rejects the null
hypothesis at some significance level, 𝛼, is
𝑛
∑
𝑌 = 𝑋𝑖 ⩾ 𝑏, (3.32)
𝑖=1
𝛼 = P(𝑌 ⩾ 𝑏)
( )
𝑌 − 𝑛𝑝𝑜 𝑏 − 𝑛𝑝𝑜
= P √ ⩾√
𝑛𝑝𝑜 (1 − 𝑝𝑜 ) 𝑛𝑝𝑜 (1 − 𝑝𝑜 )
( )
𝑏 − 𝑛𝑝𝑜
≈ P 𝑍⩾√ ,
𝑛𝑝𝑜 (1 − 𝑝𝑜 )
in (3.32) gives the most powerful test at the significance level of 𝛼. Observe
that this value of 𝑏 depends only on 𝑝𝑜 and 𝑛; it does not depend on 𝑝1 .
Now consider the test of H𝑜 : 𝑝 = 𝑝𝑜 versus H1 : 𝑝 > 𝑝𝑜 . Since, the alterna-
tive hypothesis is not simple, we cannot apply the Neyman–Pearson Lemma di-
rectly. However, by the previous considerations, the test that rejects H𝑜 : 𝑝 = 𝑝𝑜
if
𝑌 ⩾ 𝑏,
where 𝑏 is given by (3.33) for large 𝑛 is the most powerful test at level 𝛼 for every
𝑝1 > 𝑝𝑜 ; i.e., for every possible value in the alternative hypothesis H1 : 𝑝 > 𝑝𝑜 .
We then say that the LRT is the uniformly most powerful test (UMP) at
level 𝛼 in this case.
Definition 3.5.1 (Uniformly most powerful test). A test of a simple hypothesis
H𝑜 : 𝜃 = 𝜃𝑜 against a composite alternative hypothesis H1 : 𝜃 ∈ Ω1 is said to
be uniformly most powerful test (UMP) at a level 𝛼, if it is most powerful
at that level for every simple alternative 𝜃 = 𝜃1 in Ω1 .
Chapter 4
Evaluating Estimators
𝑇 = 𝑇 (𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ),
are both estimators for the variance 𝜎 2 . The sample variance, 𝑆𝑛2 , is unbiased,
while the MLE is not.
As another example, consider a random sample, 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 , from a
Poisson distribution with parameter 𝜆. Then, the sample mean, 𝑋 𝑛 and the
then the sample variance, 𝑆𝑛2 . are both unbiased estimators for 𝜆.
Given two estimators for a given parameter, 𝜃, is there a way to evaluate
the two estimators in such a way that we can tell which of the two is the better
one? In this chapter we explore one way to measure how good an estimator is,
the mean squared error or MSE. We will then see how to use that measure to
compare one estimator to others.
77
78 CHAPTER 4. EVALUATING ESTIMATORS
MSE(𝑊 ) = 𝐸𝜃 (𝑊 − 𝜃)2 .
[ ]
since
= 2(𝐸𝜃 (𝑊 ) − 𝜃) [𝐸𝜃 (𝑊 ) − 𝐸𝜃 (𝑊 )]
= 0.
that is, the mean square error of 𝑊 is the sum of the variance of 𝑊 and the
quantity [𝐸𝜃 (𝑊 ) − 𝜃]2 . The expression 𝐸𝜃 (𝑊 ) − 𝜃 is called the bias of the
estimator 𝑊 and is denoted by bias𝜃 (𝑊 ); that is,
bias𝜃 (𝑊 ) = 𝐸𝜃 (𝑊 ) − 𝜃.
𝜎2
MSE(𝑋 𝑛 ) = var(𝑋 𝑛 ) =
𝑛
and
2𝜎 4
MSE(𝑆𝑛2 ) = var(𝑆𝑛2 ) = ,
𝑛−1
where we have used the fact that
𝑛−1 2
𝑆 ∼ 𝜒2 (𝑛 − 1),
𝜎2 𝑛
and therefore
(𝑛 − 1)2
var(𝑆𝑛2 ) = 2(𝑛 − 1).
𝜎4
Example 4.1.2 (Comparing the sample variance and the MLE in a sample
from a norma distribution). Let 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 be a random sample from a
normal(𝜇, 𝜎 2 ) distribution. The MLE for 𝜎 2 is the estimator
𝑛
1∑
ˆ2 =
𝜎 (𝑋𝑖 − 𝑋 𝑛 )2 .
𝑛 𝑖=1
𝑛−1 2
ˆ2 =
Since 𝜎 𝑆𝑛 , and 𝑆𝑛2 is an unbiased estimator for 𝜃 it follows that
𝑛
𝑛−1 2 𝜎2
𝜎2 ) =
𝐸(ˆ 𝜎 = 𝜎2 − .
𝑛 𝑛
ˆ2 is
It then follows that the bias of 𝜎
𝜎2
𝜎 2 ) = 𝐸(ˆ
bias(ˆ 𝜎2 ) − 𝜎2 = − ,
𝑛
ˆ2 underestimates 𝜎 2 .
which shows that, on average, 𝜎
ˆ2 . In order to do this, we used the fact
Next, we compute the variance of 𝜎
that
𝑛−1 2
𝑆 ∼ 𝜒2 (𝑛 − 1),
𝜎2 𝑛
so that ( )
𝑛−1 2
var 𝑆 = 2(𝑛 − 1).
𝜎2 𝑛
𝑛−1 2
ˆ2 =
It then follows from 𝜎 𝑆𝑛 that
𝑛
𝑛2
𝜎 2 ) = 2(𝑛 − 1),
var(ˆ
𝜎4
80 CHAPTER 4. EVALUATING ESTIMATORS
so that
2(𝑛 − 1)𝜎 4
𝜎2 ) =
var(ˆ .
𝑛2
ˆ2 is
It the follows that the mean squared error of 𝜎
𝜎2 )
MSE(ˆ = 𝜎 2 ) + bias(ˆ
var(ˆ 𝜎2 )
2(𝑛 − 1)𝜎 4 𝜎4
= 2
+ 2
𝑛 𝑛
2𝑛 − 1 4
= 𝜎 .
𝑛2
𝜎 2 ) to
Comparing the value of MSE(ˆ
2𝜎 4
MSE(𝑆𝑛2 ) = ,
𝑛−1
we see that
𝜎 2 ) < MSE(𝑆𝑛2 ).
MSE(ˆ
Hence, the MLE for 𝜎 2 has a smaller mean squared error than the unbiased
estimator 𝑆𝑛2 . Thus, 𝜎ˆ2 is a more precise estimator than 𝑆𝑛2 ; however, 𝑆𝑛2 is
2
more accurate than 𝜎ˆ .
cov(𝑊1 , 𝑊2 ) = 𝐸𝜃 (𝑊1 𝑊2 )
∫
1 ∂
= 𝑊 (x) [𝐿(𝜃 ∣ x)] 𝐿(𝜃 ∣ x) dx
ℝ𝑛 𝐿(𝜃 ∣ x) ∂𝜃
∫
∂
= 𝑊 (x) [𝐿(𝜃 ∣ x)] dx
ℝ𝑛 ∂𝜃
∫
∂
= [𝑊 (x) 𝐿(𝜃 ∣ x)] dx.
ℝ𝑛 ∂𝜃
Thus, if the order of differentiation and integration can be interchanged, we
have that [∫ ]
∂
cov(𝑊1 , 𝑊2 ) = 𝑊 (x) 𝐿(𝜃 ∣ x) dx
∂𝜃 ℝ𝑛
∂
= [𝐸 (𝑊 )] .
∂𝜃 𝜃
Thus, if we set
𝑔(𝜃) = 𝐸𝜃 (𝑊 )
for all 𝜃 in the parameter range, we see that
𝑛
∑ ∂
= (ln(𝑓 (𝑋𝑖 ∣ 𝜃)) .
𝑖=1
∂𝜃
4.2. CRÁMER–RAO THEOREM 83
var(𝑊2 ) = 𝑛𝐼(𝜃).
[𝑔 ′ (𝜃)]2
var(𝑊 ) ⩾ , (4.4)
𝑛𝐼(𝜃)
where ( )
∂
𝐼(𝜃) = var [ln (𝑓 (𝑋 ∣ 𝜃))]
∂𝜃
is the Fisher information. For the case in which 𝑊 is unbiased we obtain from
(4.4) that
1
var(𝑊 ) ⩾ . (4.5)
𝑛𝐼(𝜃)
Example 4.2.1. Let 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 be a random sample from a Poisson(𝜆)
distribution. Then,
𝜆𝑋 −𝜆
𝑓 (𝑋, 𝜆) = 𝑒 ,
𝑋!
so that
ln(𝑓 (𝑋, 𝜆)) = 𝑋 ln 𝜆 − 𝜆 − ln(𝑋!)
and
∂ 𝑋
[ln(𝑓 (𝑋, 𝜆))] = − 1.
∂𝜆 𝜆
Then the Fisher information is
1 1 1
𝐼(𝜆) = 2
var(𝑋) = 2 ⋅ 𝜆 = .
𝜆 𝜆 𝜆
Thus, the Crámer–Rao lower bound for unbiased estimators is obtained from
(4.5) to be
1 𝜆
= .
𝑛𝐼(𝜆) 𝑛
84 CHAPTER 4. EVALUATING ESTIMATORS
Next, differentiate (4.7) with respect to 𝜃 one more time to obtain that
[ ]
∂ ∂
𝐸𝜃 (ln(𝑓 (𝑋 ∣ 𝜃)) = 0, (4.8)
∂𝜃 ∂𝜃
where, assuming that the order of differentiation and integration can be inter-
changed,
[ ] [∫ ∞ ]
∂ ∂ ∂ ∂
𝐸 (ln(𝑓 (𝑥 ∣ 𝜃)) = (ln(𝑓 (𝑥 ∣ 𝜃)) 𝑓 (𝑥 ∣ 𝜃) d𝑥
∂𝜃 𝜃 ∂𝜃 ∂𝜃 −∞ ∂𝜃
∞
∂2
∫
= (ln(𝑓 (𝑥 ∣ 𝜃)) 𝑓 (𝑥 ∣ 𝜃) d𝑥
−∞ ∂𝜃2
∫ ∞
∂ ∂
+ (ln(𝑓 (𝑥 ∣ 𝜃)) 𝑓 (𝑥 ∣ 𝜃) d𝑥
−∞ ∂𝜃 ∂𝜃
∂2
[ ]
= 𝐸𝜃 (ln(𝑓 (𝑥 ∣ 𝜃)))
∂𝜃2
∫ ∞ [ ]2
1 ∂
+ 𝑓 (𝑥 ∣ 𝜃) d𝑥,
−∞ 𝑓 (𝑥 ∣ 𝜃) ∂𝜃
where
∫ ∞ [ ]2 ∫ ∞ [ ]2
1 ∂ 1 ∂
𝑓 (𝑥 ∣ 𝜃) d𝑥 = 𝑓 (𝑥 ∣ 𝜃) 𝑓 (𝑥 ∣ 𝜃) d𝑥
−∞ 𝑓 (𝑥 ∣ 𝜃) ∂𝜃 −∞ 𝑓 (𝑥 ∣ 𝜃) ∂𝜃
∫ ∞ [ ]2
∂
= ln(𝑓 (𝑥 ∣ 𝜃)) 𝑓 (𝑥 ∣ 𝜃) d𝑥
−∞ ∂𝜃
[( )2 ]
∂
= 𝐸𝜃 ln(𝑓 (𝑥 ∣ 𝜃)) .
∂𝜃
Consequently,
∂2
[ ] [ ]
∂ ∂
𝐸𝜃 (ln(𝑓 (𝑥 ∣ 𝜃)) = 𝐸𝜃 (ln(𝑓 (𝑥 ∣ 𝜃))) + 𝐼(𝜃)
∂𝜃 ∂𝜃 ∂𝜃2
86 CHAPTER 4. EVALUATING ESTIMATORS
Pearson Chi–Square
Statistic
The goal of this appendix is to prove the first part of Theorem 3.1.4 on page
49; namely: assume that
(𝑋1 , 𝑋2 , . . . , 𝑋𝑘 )
is a random vector with a multinomial(𝑛, 𝑝1 , 𝑝2 , . . . , 𝑝𝑘 ) distribution, and define
𝑘
∑ (𝑋𝑖 − 𝑛𝑝𝑖 )2
𝑄= . (A.1)
𝑖=1
𝑛𝑝𝑖
87
88 APPENDIX A. PEARSON CHI–SQUARE STATISTIC
1 1 1
⎛ ⎞
+
⎜ 𝑝1 𝑝 3 𝑝3 ⎟
C−1 =⎜ ⎟.
⎜ ⎟
(𝑈1 ,𝑈2 ) ⎝ 1 1 1⎠
+
𝑝3 𝑝2 𝑝3
Consequently,
1 1 1
⎛ ⎞
+
⎜ 𝑝1 𝑝3 𝑝3 ⎟
𝑛 C−1 =⎜ ⎟.
⎜ ⎟
𝑊𝑛
⎝ 1 1 1⎠
+
𝑝3 𝑝2 𝑝3
Note also that
(W𝑛 − 𝐸(W𝑛 ))𝑇 C−1
𝑊𝑛 (W𝑛 − 𝐸(W𝑛 )),
89
where (W𝑛 − 𝐸(W𝑛 ))𝑇 is the transpose of the column vector (W𝑛 − 𝐸(W𝑛 )),
is equal to
1 1 1
⎛ ⎞
+ ⎛ ⎞
⎜ 𝑝1 𝑝3 𝑝 3 ⎟ 𝑋1 − 𝑛𝑝1
𝑛−1 (𝑋1 − 𝑛𝑝1 , 𝑋2 − 𝑛𝑝2 ) ⎜ ⎟. ⎠,
⎜ ⎟ ⎝
⎝ 1 1 1 ⎠ 𝑋2 − 𝑛𝑝2
+
𝑝3 𝑝2 𝑝3
which is equal to
( )
1 1
𝑛−1 + (𝑋1 − 𝑛𝑝1 )2
𝑝1 𝑝3
𝑛−1 𝑛−1
+ (𝑋1 − 𝑛𝑝1 )(𝑋2 − 𝑛𝑝2 ) + (𝑋2 − 𝑛𝑝2 )(𝑋1 − 𝑛𝑝1 )
𝑝3 𝑝3
( )
1 1
𝑛−1 + (𝑋2 − 𝑛𝑝2 )2 .
𝑝2 𝑝3
Note that
(𝑋1 − 𝑛𝑝1 )(𝑋2 − 𝑛𝑝2 ) = (𝑋1 − 𝑛𝑝1 )(𝑛 − 𝑋1 − 𝑋3 − 𝑛𝑝2 )
(𝑋2 − 𝑛𝑝2 )(𝑋1 − 𝑛𝑝1 ) = −(𝑋2 − 𝑛𝑝2 )2 − (𝑋2 − 𝑛𝑝2 )(𝑋3 − 𝑛𝑝3 ).
is equal to
( )
1 1 1
𝑛−1 + − (𝑋1 − 𝑛𝑝1 )2
𝑝1 𝑝3 𝑝3
𝑛−1 𝑛−1
− (𝑋1 − 𝑛𝑝1 )(𝑋3 − 𝑛𝑝3 ) − (𝑋2 − 𝑛𝑝2 )(𝑋3 − 𝑛𝑝3 )
𝑝3 𝑝3
( )
1 1 1
𝑛−1 + − (𝑋2 − 𝑛𝑝2 )2 ,
𝑝2 𝑝3 𝑝3
90 APPENDIX A. PEARSON CHI–SQUARE STATISTIC
or
1
(𝑋1 − 𝑛𝑝1 )2
𝑛𝑝1
1
− (𝑋3 − 𝑛𝑝3 )[(𝑋1 − 𝑛𝑝1 ) + (𝑋2 − 𝑛𝑝2 )]
𝑛𝑝3
1
(𝑋2 − 𝑛𝑝2 )2 ,
𝑛𝑝2
where
(𝑋3 − 𝑛𝑝3 )[(𝑋1 − 𝑛𝑝1 ) + (𝑋2 − 𝑛𝑝2 )] = (𝑋3 − 𝑛𝑝3 )[𝑋1 + 𝑋2 − 𝑛(𝑝1 + 𝑝2 )]
= −(𝑋3 − 𝑛𝑝3 )2 .
is equal to
1 1 1
(𝑋1 − 𝑛𝑝1 )2 + (𝑋3 − 𝑛𝑝3 )2 + (𝑋2 − 𝑛𝑝2 )2 ;
𝑛𝑝1 𝑛𝑝3 𝑛𝑝2
that is,
3
∑ (𝑋𝑗 − 𝑛𝑝𝑗 )2
(W𝑛 − 𝐸(W𝑛 ))𝑇 C−1
𝑊𝑛 (W𝑛 − 𝐸(W𝑛 )) = ,
𝑗=1
𝑛𝑝𝑗
Next, put
−1/2
Z𝑛 = C𝑊𝑛 (W𝑛 − 𝐸(W𝑛 ))
and apply the multivariate central limit theorem (see, for instance, [Fer02, The-
orem 5, p. 26]) to obtain that
𝐷
Z𝑛 −→ Z ∼ normal(0, 𝐼) as 𝑛 → ∞;
−1/2
that is, the bivariate random vectors, Z𝑛 = C𝑊𝑛 (W𝑛 − 𝐸(W𝑛 )), converge in
( )
0
distribution to the bivariate random vector Z with mean and covariance
0
matrix ( )
1 0
𝐼= .
0 1
In other words ( )
𝑍1
Z= ,
𝑍2
where 𝑍1 and 𝑍2 are independent normal(0, 1) random variables. Consequently,
−1/2 −1/2 𝐷
(C𝑊𝑛 (W𝑛 − 𝐸(W𝑛 ))𝑇 C𝑊𝑛 (W𝑛 − 𝐸(W𝑛 )) −→ Z𝑇 Z = 𝑍12 + 𝑍22
as 𝑛 → ∞; that is,
3
∑ (𝑋𝑖 − 𝑛𝑝𝑖 )2
𝑄=
𝑖=1
𝑛𝑝𝑖
(𝑋1 , 𝑋2 , . . . , 𝑋𝑘 ) ∼ multinomial(𝑛, 𝑝1 , 𝑝2 , . . . , 𝑝𝑘 ).
Next, put
−1/2
Z𝑛 = C𝑊𝑛 (W𝑛 − 𝐸(W𝑛 ))
92 APPENDIX A. PEARSON CHI–SQUARE STATISTIC
The main goal of this appendix is to compute the variance of the sample variance
based on a sample from and arbitrary distribution; i.e.,
var(𝑆𝑛2 ),
where
𝑛
1 ∑
𝑆𝑛2 = (𝑋𝑖 − 𝑋 𝑛 )2 .
𝑛 − 1 𝑖=1
We will come up with a formula based on the second and fourth central
moments of the underlying distribution. More precisely, we will prove that
( )
1 𝑛−3 2
var(𝑆𝑛2 ) = 𝜇4 − 𝜇2 , (B.1)
𝑛 𝑛−1
𝜇2 = 𝐸 (𝑋 − 𝐸(𝑋))2 ;
[ ]
𝜇𝑘 = 𝐸 (𝑋 − 𝐸(𝑋))𝑘 .
[ ]
93
94 APPENDIX B. THE VARIANCE OF THE SAMPLE VARIANCE
since
∑∑ ∑ ∑
(𝑋𝑖 − 𝑋 𝑛 )(𝑋 𝑛 − 𝑋𝑗 ) = (𝑋𝑖 − 𝑋 𝑛 ) (𝑋 𝑛 − 𝑋𝑗 ) = 0.
𝑖 𝑗 𝑖 𝑗
= 𝜇4 + 6𝜇2 ⋅ 𝜇2 + 𝜇4 ,
where we have used the independence of the 𝑋𝑖 s and the definition of the central
moments. We then have that
𝐸 (𝑋𝑖 − 𝑋𝑗 )4 = 2𝜇4 + 6𝜇22 , for 𝑖 ∕= 𝑗.
[ ]
(B.6)
[ ]
For the rest for the expectations, 𝐸 (𝑋𝑖 − 𝑋𝑗 )2 (𝑋𝑘 − 𝑋ℓ )2 , in (B.4) there are
two possibilities
(i) 𝑖 ∕= 𝑘 and 𝑗 ∕= ℓ, or
(ii) either 𝑖 = 𝑘, or 𝑗 = ℓ, but not both simultaneously.
In case (i) we obtain, by the independence of the 𝑋𝑖 s and the definition of
the central moments, that
𝐸 (𝑋𝑖 − 𝑋𝑗 )2 (𝑋𝑘 − 𝑋ℓ )2 = 𝐸 (𝑋𝑖 − 𝑋𝑗 )2 ⋅ 𝐸 (𝑋𝑘 − 𝑋ℓ )2 ,
[ ] [ ] [ ]
(B.7)
where [ ] [ ]
𝐸 (𝑋𝑖 − 𝑋𝑗 )2 = 𝐸 (𝑋𝑖 − 𝜇1 + 𝜇1 − 𝑋𝑗 )2
[ ] [ ]
= 𝐸 (𝑋𝑖 − 𝜇1 )2 + 𝐸 (𝑋𝑗 − 𝜇1 )2
since
𝐸 [(𝑋𝑖 − 𝜇1 )(𝜇1 − 𝑋𝑗 )] = 𝐸(𝑋𝑖 − 𝜇1 ) ⋅ 𝐸(𝜇1 − 𝑋𝑗 ) = 0.
Consequently,
𝐸 (𝑋𝑖 − 𝑋𝑗 )2 = 2𝜇2 .
[ ]
Similarly,
𝐸 (𝑋𝑘 − 𝑋ℓ )2 = 2𝜇2 .
[ ]
= 𝜇4 + 3𝜇22 .
We obtain the same value for all the other expectations in case (ii); i.e.,
𝐸 (𝑋𝑖 − 𝑋𝑗 )2 (𝑋𝑖 − 𝑋ℓ )2 = 𝜇4 + 3𝜇22 , for 𝑖 ∕= 𝑗 ∕= ℓ.
[ ]
(B.9)
[ ]
It follows from (B.4) and the values of the possible expectations, 𝐸 (𝑋𝑖 − 𝑋𝑗 )2 (𝑋𝑘 − 𝑋ℓ )2 ,
we have computed in equations (B.6), (B.8) and (B.9), that
∑∑∑∑ [
𝐸 (𝑋𝑖 − 𝑋𝑗 )2 (𝑋𝑘 − 𝑋ℓ )2
[ ] ]
𝐸 𝑛2 (𝑛 − 1)2 (𝑆𝑛2 )2 =
𝑖<𝑗 𝑘<ℓ
( ) ( )
𝑛 2 𝑛
= (2𝜇4 + 6𝜇2 ) + 6 (4𝜇22 )
2 (( ) 4
2 ( ) ( ))
𝑛 𝑛 𝑛
+ − −6 (𝜇4 + 3𝜇22 ).
2 2 4
( )
𝑛 𝑛(𝑛 − 1)
Noting that = , the above expression simplifies to
2 2
𝐸 𝑛2 (𝑛 − 1)2 (𝑆𝑛2 )2 = 𝑛(𝑛 − 1)(𝜇4 + 3𝜇22 ) + 𝑛(𝑛 − 1)(𝑛 − 2)(𝑛 − 3)𝜇22
[ ]
Thus, dividing by 𝑛(𝑛−1) on both sides of the previous equation we then obtain
that
𝐸 𝑛(𝑛 − 1)(𝑆𝑛2 )2 = 𝜇4 + 3𝜇22 + (𝑛 − 2)(𝑛 − 3)𝜇22 + (𝑛 − 2)(𝜇4 + 3𝜇22 )
[ ]
Thus,
]2
= 𝐸 (𝑆𝑛2 )2 − 𝐸(𝑆𝑛2 )
[ ] [
var(𝑆𝑛2 )
𝑛2 − 2𝑛 + 3 2
( )
1
= 𝜇4 + 𝜇2 − (𝜇2 )2 ,
𝑛 𝑛−1
since 𝑆𝑛2 is an unbiased estimator of 𝜇2 . Simplifying we then obtain that
1 3−𝑛 2
var(𝑆𝑛2 ) = 𝜇4 + 𝜇 ,
𝑛 𝑛(𝑛 − 1) 2
99