0% found this document useful (0 votes)
22 views99 pages

Math 152 Notes Fall 09

This document provides lecture notes on statistical theory. It introduces statistical inference, including estimation and hypothesis testing. It uses an example of counting unpopped popcorn kernels to illustrate key concepts. The example involves modeling the number of unpopped kernels using a binomial distribution and estimating its parameters based on sample data. The document outlines sampling concepts and terminology that form the foundation of statistical inference.

Uploaded by

ghafran zahid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views99 pages

Math 152 Notes Fall 09

This document provides lecture notes on statistical theory. It introduces statistical inference, including estimation and hypothesis testing. It uses an example of counting unpopped popcorn kernels to illustrate key concepts. The example involves modeling the number of unpopped kernels using a binomial distribution and estimating its parameters based on sample data. The document outlines sampling concepts and terminology that form the foundation of statistical inference.

Uploaded by

ghafran zahid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

Statistical Theory

Lecture Notes

Adolfo J. Rumbos

c Draft date December 18, 2009
2
Contents

1 Introduction 5
1.1 Introduction to statistical inference . . . . . . . . . . . . . . . . . 5
1.1.1 An Introductory Example . . . . . . . . . . . . . . . . . . 5
1.1.2 Sampling: Concepts and Terminology . . . . . . . . . . . 7

2 Estimation 13
2.1 Estimating the Mean of a Distribution . . . . . . . . . . . . . . . 13
2.2 Interval Estimate for Proportions . . . . . . . . . . . . . . . . . . 15
2.3 Interval Estimates for the Mean . . . . . . . . . . . . . . . . . . . 17
2.3.1 The 𝜒2 Distribution . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 The 𝑡 Distribution . . . . . . . . . . . . . . . . . . . . . . 26
2.3.3 Sampling from a normal distribution . . . . . . . . . . . . 29
2.3.4 Distribution of the Sample Variance from a Normal Dis-
tribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.5 The Distribution of 𝑇𝑛 . . . . . . . . . . . . . . . . . . . . 39

3 Hypothesis Testing 43
3.1 Chi–Square Goodness of Fit Test . . . . . . . . . . . . . . . . . . 44
3.1.1 The Multinomial Distribution . . . . . . . . . . . . . . . . 46
3.1.2 The Pearson Chi-Square Statistic . . . . . . . . . . . . . . 48
3.1.3 Goodness of Fit Test . . . . . . . . . . . . . . . . . . . . . 49
3.2 The Language and Logic of Hypothesis Tests . . . . . . . . . . . 50
3.3 Hypothesis Tests in General . . . . . . . . . . . . . . . . . . . . . 54
3.4 Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5 The Neyman–Pearson Lemma . . . . . . . . . . . . . . . . . . . . 73

4 Evaluating Estimators 77
4.1 Mean Squared Error . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 Crámer–Rao Theorem . . . . . . . . . . . . . . . . . . . . . . . . 80

A Pearson Chi–Square Statistic 87

B The Variance of the Sample Variance 93

3
4 CONTENTS
Chapter 1

Introduction

1.1 Introduction to statistical inference


The main topic of this course is statistical inference. Loosely speaking, statisti-
cal inference is the process of going from information gained from a sample to
inferences about a population from which the sample is taken. There are two
aspects of statistical inference that we’ll be studying in this course: estimation
and hypothesis testing. In estimation, we try to determine parameters from a
population based on quantities, referred to as statistics, calculated from data
in a sample. The degree to which the estimates resemble the parameters be-
ing estimated can be measured by ascertaining the probability that a certain
range of values around the estimate will contain the actual parameter. The use
of probability is at the core of statistical inference; it involves the postulation
of a certain probability model underlying the situation being studied and cal-
culations based on that model. The same procedure can in turn be used to
determine the degree to which the data in the sample support the underlying
model; this is the essence of hypothesis testing.
Before we delve into the details of the statistical theory of estimation and
hypothesis testing, we will present a simple example which will serve to illustrate
several aspects of the theory.

1.1.1 An Introductory Example


I have a hot–air popcorn popper which I have been using a lot lately. It is a
small appliance consisting of a metal, cylindrical container with narrow vents
at the bottom, on the sides of the cylinder, through which hot air is pumped.
The vents are slanted in a given direction so that the kernels are made to
circulate at the bottom of the container. The top of the container is covered
with a hard-plastic lid with a wide spout that directs popped and unpopped
kernels to a container placed next to the popper. The instructions call for
one–quarter cup of kernels to be placed at the bottom of the container and
the device to be plugged in. After a short while of the kernels swirling in hot

5
6 CHAPTER 1. INTRODUCTION

air, a few of the kernels begin to pop. Pressure from the circulating air and
other kernels popping an bouncing off around inside the cylinder forces kernels
to the top of the container, then to the spout, and finally into the container.
Once you start eating the popcorn, you realize that not all the kernels popped.
You also notice that there are two kinds of unpopped kernels: those that just
didn’t pop and those that were kicked out of the container before they could
get warm enough to pop. In any case, after you are done eating the popped
kernels, you cannot resit the temptation to count how many kernels did not pop.
Table 1.1 shows the results of 27 popping sessions performed under nearly the
same conditions. Each popping session represents a random experiment.1 The

Trial Number of Uppopped Kernels


1 32
2 11
3 32
4 9
5 17
6 8
7 7
8 15
9 139
10 110
11 124
12 111
13 67
14 143
15 35
16 52
17 35
18 65
19 44
20 52
21 49
22 18
23 56
24 131
25 55
26 59
27 37

Table 1.1: Number of Unpopped Kernels out of 1/4–cup of popcorn

1 A random experiment is a process or observation, which can be repeated indefinitely

under the same conditions, and whose outcomes cannot be predicted with certainty before
the experiment is performed.
1.1. INTRODUCTION TO STATISTICAL INFERENCE 7

number of unpopped kernels is a random variable2 which we obtain from the


outcome of each experiment. Denoting the number of unpopped kernels in a
given run by 𝑋, we may postulate that 𝑋 follows a Binomial distribution with
parameters 𝑁 and 𝑝, where 𝑝 is the probability that a given kernel will not pop
(either because it was kicked out of the container too early, or because it would
just not pop) and 𝑁 is the number of kernels contained in one-quarter cup. We
write
𝑋 ∼ binom(𝑁, 𝑝)
and have that
( )
𝑁 𝑘
𝑃 (𝑋 = 𝑘) = 𝑝 (1 − 𝑝)𝑁 −𝑘 for 𝑘 = 0, 1, 2, . . . , 𝑁,
𝑘

where ( )
𝑁 𝑁!
= , 𝑘 = 0, 1, 2 . . . , 𝑁.
𝑘 𝑘!(𝑁 − 𝑘)!
This is the underlying probability model that we may postulate for this situa-
tion. The probability of a failure to pop for a given kernel, 𝑝, and the number of
kernels, 𝑁 , in one–quarter cup are unknown parameters. The challenge before
us is to use the data in Table 1.1 on page 6 to estimate the parameter 𝑝. No-
tice that 𝑁 is also unknown, so we’ll also have to estimate 𝑁 as well; however,
the data in Table 1.1 do not give enough information do so. We will therefore
have to design a new experiment to obtain data that will allow us to estimate
𝑁 . This will be done in the next chapter. Before we proceed further, we will
will lay out the sampling notions and terminology that are at the foundation of
statistical inference.

1.1.2 Sampling: Concepts and Terminology


Suppose we wanted to estimate the number of popcorn kernels in one quarter
cup of popcorn. In order to do this we can sample one quarter cup from a
bag of popcorn and count the kernels in the quarter cup. Each time we do
the sampling we get a value, 𝑁𝑖 , for the number of kernels. We postulate that
there is a value, 𝜇, which gives the mean value of kernels in one quarter cup of
popcorn. It is reasonable to assume that the distribution of each of the 𝑁𝑖 , for
𝑖 = 1, 2, 3, . . ., is normal around 𝜇 with certain variance 𝜎 2 . That is,

𝑁𝑖 ∼ normal(𝜇, 𝜎 2 ) for all 𝑖 = 1, 2, 3, . . . ,

so that each of the 𝑁𝑖 s has a density function, 𝑓𝑁 , given by

1 (𝑥−𝜇)2
𝑓𝑁 (𝑥) = √ 𝑒− 2𝜎2 , for − ∞ < 𝑥 < ∞.
2𝜋𝜎
2 A random variable is a numerical outcome of a random experiment whose value cannot

be determined with certainty.


8 CHAPTER 1. INTRODUCTION

Hence, the probability that the number of kernels in one quarter cup of popcorn
lies within certain range of values, 𝑎 ⩽ 𝑁 < 𝑏 is
∫ 𝑏
𝑃 (𝑎 ⩽ 𝑁 < 𝑏) = 𝑓𝑁 (𝑥) d𝑥.
𝑎

Notice here that we are approximating a discrete random variable, 𝑁 , by a


continuous one. This approximation is justified if we are dealing with large
numbers of kernels, so that a few kernels might not make a large relative differ-
ence. Table 1.2 shows a few of those numbers. If we also assume that the 𝑁𝑖 s

Sample Number of Kernels


1 356
2 368
3 356
4 351
5 339
6 298
7 289
8 352
9 447
10 314
11 332
12 369
13 298
14 327
15 319
16 316
17 341
18 367
19 357
20 334

Table 1.2: Number of Kernels in 1/4–cup of popcorn

are independent random variables, then 𝑁1 , 𝑁2 , . . . , 𝑁𝑛 constitutes a random


sample of size 𝑛.
Definition 1.1.1 (Random Sample). (See also [HCM04, Definition 5.1.1, p
234]) The random variables, 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 , form a random sample of size 𝑛 on
a random variable 𝑋 if they are independent and each has the same distribution
as that of 𝑋. We say that 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 constitute a random sample from the
distribution of 𝑋.
Example 1.1.2. The second column of Table 1.2 shows values from a random
sample from from the distribution of the number of kernels, 𝑁 , in one-quarter
cup of popcorn kernels.
1.1. INTRODUCTION TO STATISTICAL INFERENCE 9

Given a random sample, 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 , from the distribution of a random


variable, 𝑋, the sample mean, 𝑋 𝑛 , is defined by

𝑋1 + 𝑋2 + ⋅ ⋅ ⋅ + 𝑋𝑛
𝑋𝑛 = .
𝑛

𝑋 𝑛 is an example of a statistic.

Definition 1.1.3 (Statistic). (See also [HCM04, Definition 5.1.2, p 235]) A


statistic is a function of a random sample. In other words, a statistic is a
quantity that is calculated from data contained in a random sample.

Let 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 denote a random sample from a distribution of mean 𝜇


and variance 𝜎 2 . Then the expected value of the sample mean 𝑋 𝑛 is

𝐸(𝑋 𝑛 ) = 𝜇.

We say that 𝑋 𝑛 is an unbiased estimator for the mean 𝜇.

Example 1.1.4 (Unbiased Estimation of the Variance). Let 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 be


a random sample from a distribution of mean 𝜇 and variance 𝜎 2 . Consider
𝑛
∑ 𝑛

(𝑋𝑘 − 𝜇)2
[ 2
𝑋𝑘 − 2𝜇𝑋𝑘 + 𝜇2
]
=
𝑘=1 𝑘=1

𝑛
∑ 𝑛

= 𝑋𝑘2 − 2𝜇 𝑋𝑘 + 𝑛𝜇2
𝑘=1 𝑘=1

𝑛

= 𝑋𝑘2 − 2𝜇𝑛𝑋 𝑛 + 𝑛𝜇2 .
𝑘=1

On the other hand,


𝑛 𝑛 [ ]
∑ ∑ 2
(𝑋𝑘 − 𝑋 𝑛 )2 = 𝑋𝑘2 − 2𝑋 𝑛 𝑋𝑘 + 𝑋 𝑛
𝑘=1 𝑘=1

𝑛 𝑛
∑ ∑ 2
= 𝑋𝑘2 − 2𝑋 𝑛 𝑋𝑘 + 𝑛𝑋 𝑛
𝑘=1 𝑘=1

𝑛
∑ 2
= 𝑋𝑘2 − 2𝑛𝑋 𝑛 𝑋 𝑛 + 𝑛𝑋 𝑛
𝑘=1

𝑛
∑ 2
= 𝑋𝑘2 − 𝑛𝑋 𝑛 .
𝑘=1
10 CHAPTER 1. INTRODUCTION

Consequently,
𝑛 𝑛
∑ ∑ 2
(𝑋𝑘 − 𝜇)2 − (𝑋𝑘 − 𝑋 𝑛 )2 = 𝑛𝑋 𝑛 − 2𝜇𝑛𝑋 𝑛 + 𝑛𝜇2 = 𝑛(𝑋 𝑛 − 𝜇)2 .
𝑘=1 𝑘=1

It then follows that


𝑛
∑ 𝑛

(𝑋𝑘 − 𝑋 𝑛 )2 = (𝑋𝑘 − 𝜇)2 − 𝑛(𝑋 𝑛 − 𝜇)2 .
𝑘=1 𝑘=1

Taking expectations on both sides, we get


( 𝑛 ) 𝑛
∑ ∑
2
𝐸 (𝑋𝑘 − 𝜇)2 − 𝑛𝐸 (𝑋 𝑛 − 𝜇)2
[ ] [ ]
𝐸 (𝑋𝑘 − 𝑋 𝑛 ) =
𝑘=1 𝑘=1

𝑛

= 𝜎 2 − 𝑛var(𝑋 𝑛 )
𝑘=1

𝜎2
= 𝑛𝜎 2 − 𝑛
𝑛

= (𝑛 − 1)𝜎 2 .

Thus, dividing by 𝑛 − 1,
( 𝑛
)
1 ∑
𝐸 (𝑋𝑘 − 𝑋 𝑛 )2 = 𝜎2 .
𝑛−1
𝑘=1

Hence, the random variable


𝑛
1 ∑
𝑆𝑛2 = (𝑋𝑘 − 𝑋 𝑛 )2 ,
𝑛−1
𝑘=1

called the sample variance, is an unbiased estimator of the variance.

Given a random sample, 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 , from a distribution with mean 𝜇


and variance 𝜎 2 , and a statistic, 𝑇 = 𝑇 (𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ), based on the ran-
dom sample, it is of interest to find out what the distribution of the statistic,
𝑇 , is. This is called the sampling distribution of 𝑇 . For example, we would
like to know what the sampling distribution of the sample mean, 𝑋 𝑛 , is. In
order to find out what the sampling distribution of a statistic is, we need to
know the joint distribution, 𝐹(𝑋1 ,𝑋2 ,...,𝑋𝑛 ) (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ), of the sample vari-
able 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 is. Since, the variables 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 are independently
and identically distributed (iid), then we can compute

𝐹(𝑋1 ,𝑋2 ,...,𝑋𝑛 ) (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = 𝐹𝑋 (𝑥1 ) ⋅ 𝐹𝑋 (𝑥2 ) ⋅ ⋅ ⋅ 𝐹𝑋 (𝑥𝑛 ),


1.1. INTRODUCTION TO STATISTICAL INFERENCE 11

where 𝐹𝑋 is the common distribution. Recall that

𝐹𝑋 (𝑥) = P(𝑋 ⩽ 𝑥)

and

𝐹(𝑋1 ,𝑋2 ,...,𝑋𝑛 ) (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = P(𝑋1 ⩽ 𝑥1 , 𝑋2 ⩽ 𝑥2 , . . . , 𝑋𝑛 ⩽ 𝑥𝑛 ).

If 𝑋 is a continuous random variable with density 𝑓𝑋 (𝑥), then the joint density
of the sample is

𝑓(1 ,𝑋2 ,...,𝑋𝑛 ) (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = 𝑓𝑋 (𝑥1 ) ⋅ 𝑓𝑋 (𝑥2 ) ⋅ ⋅ ⋅ 𝑓𝑋 (𝑥𝑛 ).

Example 1.1.5. Let 𝑁1 , 𝑁2 , . . . , 𝑁𝑛 denote a random sample from the exper-


iment consisting of scooping up a quarter-cup of kernels popcorn from bag and
counting the number of kernels. Assume that each 𝑁𝑖 has a normal(𝜇, 𝜎 2 ) dis-
tribution. We would like to find the distribution of the sample mean 𝑁 𝑛 . We
can do this by first computing the moment generating function (mgf), 𝑀𝑁 𝑛 (𝑡),
of 𝑁 𝑛 :
𝑡𝑁 𝑛
𝑀𝑁 𝑛 (𝑡) = 𝐸(𝑒 ( ) )
(𝑁1 +𝑁2 +⋅⋅⋅+𝑁𝑛 )( 𝑛
𝑡
)
= 𝐸 𝑒
( )
𝑡
= 𝑀𝑁1 +𝑁2 +⋅⋅⋅+𝑁𝑛
𝑛
( ) ( ) ( )
𝑡 𝑡 𝑡
= 𝑀𝑋1 𝑀𝑁2 ⋅ ⋅ ⋅ 𝑀𝑁𝑛 ,
𝑛 𝑛 𝑛
since the 𝑁𝑖 s are independent. Thus, since the 𝑁𝑖 s are also identically dis-
tributed,
( ( ))𝑛
𝑡
𝑀𝑁 𝑛 (𝑡) = 𝑀𝑁1 ,
𝑛
( )
𝑡 2 2 2
where 𝑀𝑁1 = 𝑒𝜇𝑡/𝑛+𝜎 𝑡 /2𝑛 , since 𝑁1 has a normal(𝜇, 𝜎 2 ) distribution.
𝑛
It then follows that
2 2
𝑀𝑁 𝑛 (𝑡) = 𝑒𝜇𝑡+𝜎 𝑡 /2𝑛 ,

which is the mgf of a normal(𝜇, 𝜎 2 /𝑛) distribution. It then follows that 𝑁 𝑛 has
a normal distribution with mean

𝐸(𝑁 𝑛 ) = 𝜇

and variance
𝜎2
var(𝑁 𝑛 ) = .
𝑛
12 CHAPTER 1. INTRODUCTION

Example 1.1.5 shows that the sample mean, 𝑋 𝑛 , for a random sample from
a normal(𝜇, 𝜎 2 ) distribution follows a normal(𝜇, 𝜎 2 /𝑛). A surprising, and ex-
tremely useful, result from the theory of probability, states that for large values
of 𝑛 the sample mean for samples from any distribution are approximately
normal(𝜇, 𝜎 2 /𝑛). This is the essence of the Central Limit Theorem:
Theorem 1.1.6 (Central Limit Theorem). [HCM04, Theorem 4.4.1, p 220]
Suppose 𝑋1 , 𝑋2 , 𝑋3 . . . are independent, identically distributed random vari-
ables with 𝐸(𝑋𝑖 ) = 𝜇 and finite variance var(𝑋𝑖 ) = 𝜎 2 , for all 𝑖. Then
( )
𝑋𝑛 − 𝜇
lim P √ ⩽ 𝑧 = P(𝑍 ⩽ 𝑧),
𝑛→∞ 𝜎/ 𝑛

for all 𝑧 ∈ ℝ, where 𝑍 ∼ Normal(0, 1).

𝑋𝑛 − 𝜇
Thus, for large values of 𝑛, the distribution function for √ can be
𝜎/ 𝑛
approximated by the standard normal distribution. We write

𝑋𝑛 − 𝜇 𝐷
√ −→ 𝑍 ∼ Normal(0, 1)
𝜎/ 𝑛

𝑋𝑛 − 𝜇
and say that √ converges in distribution to 𝑍. In general, we have
𝜎/ 𝑛
Definition 1.1.7 (Convergence in Distribution). A sequence, (𝑌𝑛 ), of random
variables is said to converge in distribution to a random variable 𝑌 if

lim 𝐹𝑌𝑛 (𝑦) = 𝐹𝑌 (𝑦)


𝑛→∞

for all 𝑦 where 𝐹𝑌 is continuous. We write


𝐷
𝑌𝑛 −→ 𝑌 as 𝑛 → ∞.

In practice, the Central Limit Theorem is applied to approximate the prob-


abilities ( )
𝑋𝑛 − 𝜇
P √ ⩽ 𝑧 ≈ P(𝑍 ⩽ 𝑧) for larege 𝑛,
𝜎/ 𝑛
which we could write as

𝐹𝑋 𝑛 ≈ 𝐹𝜇+ √𝜎𝑛 𝑍 for larege 𝑛;

in other words, for large sample sizes, 𝑛, the distribution of the sample mean is
approximately normal(𝜇, 𝜎 2 /𝑛).
Chapter 2

Estimation

2.1 Estimating the Mean of a Distribution


We saw in the previous section that the sample mean, 𝑋 𝑛 , of a random sample,
𝑋1 , 𝑋2 , . . . , 𝑋𝑛 , from a distribution with mean 𝜇 is an unbiased estimator for 𝜇;
that is, 𝐸(𝑋 𝑛 ) = 𝜇. In this section we will see that, as we increase the sample
size, 𝑛, then the sample means, 𝑋 𝑛 , approach 𝜇 in probability; that is, for every
𝜀 > 0,
lim P(∣𝑋 𝑛 − 𝜇∣ ⩾ 𝜀) = 0,
𝑛→∞
or
lim P(∣𝑋 𝑛 − 𝜇∣ < 𝜀) = 1.
𝑛→∞

We then say that 𝑋 𝑛 converges to 𝜇 in probability and write


P
𝑋 𝑛 −→ 𝜇 as 𝑛 → ∞.
Definition 2.1.1 (Convergence in Probability). A sequence, (𝑌𝑛 ), of random
variables is said to converge in probability to 𝑏 ∈ ℝ, if for every 𝜀 > 0
lim P(∣𝑌𝑛 − 𝑏∣ < 𝜀) = 1.
𝑛→∞

We write
P
𝑌𝑛 −→ 𝑏 as 𝑛 → ∞.
The fact that 𝑋 𝑛 converges to 𝜇 in probability is known as the weak Law
of Large Numbers. We will prove this fact under the assumption that the
distribution being sampled has finite variance, 𝜎 2 . Then, the weak Law of Large
Numbers will follow from the inequality:
Theorem 2.1.2 (Chebyshev Inequality). Let 𝑋 be a random variable with
mean 𝜇 and variance var(𝑋). Then, for every 𝜀 > 0,
var(𝑋)
P(∣𝑋 − 𝜇∣ ⩾ 𝜀) ⩽ .
𝜀2

13
14 CHAPTER 2. ESTIMATION

Proof: We shall prove this inequality for the case in which 𝑋 is continuous with
pdf 𝑓𝑋 . ∫ ∞
Observe that var(𝑋) = 𝐸[(𝑋 − 𝜇)2 ] = ∣𝑥 − 𝜇∣2 𝑓𝑋 (𝑥) d𝑥. Thus,
−∞

var(𝑋) ⩾ ∣𝑥 − 𝜇∣2 𝑓𝑋 (𝑥) d𝑥,
𝐴𝜀

where 𝐴𝜀 = {𝑥 ∈ ℝ ∣ ∣𝑥 − 𝜇∣ ⩾ 𝜀}. Consequently,



2
var(𝑋) ⩾ 𝜀 𝑓𝑋 (𝑥) d𝑥 = 𝜀2 P(𝐴𝜀 ).
𝐴𝜀

we therefore get that


var(𝑋)
P(𝐴𝜀 ) ⩽ ,
𝜀2
or
var(𝑋)
P(∣𝑋 − 𝜇∣ ⩾ 𝜀) ⩽ .
𝜀2

Applying Chebyshev Inequality to the case in which 𝑋 is the sample mean,


𝑋 𝑛 , we get
var(𝑋 𝑛 ) 𝜎2
P(∣𝑋 𝑛 − 𝜇∣ ⩾ 𝜀) ⩽ 2
= 2.
𝜀 𝑛𝜀
We therefore obtain that
𝜎2
P(∣𝑋 𝑛 − 𝜇∣ < 𝜀) ⩾ 1 − .
𝑛𝜀2
Thus, letting 𝑛 → ∞, we get that, for every 𝜀 > 0,

lim P(∣𝑋 𝑛 − 𝜇∣ < 𝜀) = 1.


𝑛→∞

Later in these notes will we need the fact that a continuous function of a
sequence which converges in probability will also converge in probability:
Theorem 2.1.3 (Slutsky’s Theorem). Suppose that (𝑌𝑛 ) converges in proba-
bility to 𝑏 as 𝑛 → ∞ and that 𝑔 is a function which is continuous at 𝑏. Then,
(𝑔(𝑌𝑛 )) converges in probability to 𝑔(𝑏) as 𝑛 → ∞.
Proof: Let 𝜀 > 0 be given. Since 𝑔 is continuous at 𝑏, there exists 𝛿 > 0 such
that
∣𝑦 − 𝑏∣ < 𝛿 ⇒ ∣𝑔(𝑦) − 𝑔(𝑏)∣ < 𝜀.
It then follows that the event 𝐴𝛿 = {𝑦 ∣ ∣𝑦 − 𝑏∣ < 𝛿} is a subset the event
𝐵𝜀 = {𝑦 ∣ ∣𝑔(𝑦) − 𝑔(𝑏)∣ < 𝜀}. Consequently,

P(𝐴𝛿 ) ⩽ P(𝐵𝜀 ).
2.2. INTERVAL ESTIMATE FOR PROPORTIONS 15

It then follows that


P(∣𝑌𝑛 − 𝑏∣ < 𝛿) ⩽ P(∣𝑔(𝑌𝑛 ) − 𝑔(𝑏)∣ < 𝜀) ⩽ 1. (2.1)
P
Now, since 𝑌𝑛 −→ 𝑏 as 𝑛 → ∞,
lim P(∣𝑌𝑛 − 𝑏∣ < 𝛿) = 1.
𝑛→∞

It then follows from Equation (2.1) and the Squeeze or Sandwich Theorem that
lim P(∣𝑔(𝑌𝑛 ) − 𝑔(𝑏)∣ < 𝜀) = 1.
𝑛→∞

Since the sample mean, 𝑋 𝑛 , converges in probability to the mean, 𝜇, of


sampled distribution, by the weak Law of Large Numbers, we say that 𝑋 𝑛 is a
consistent estimator for 𝜇.

2.2 Interval Estimate for Proportions


Example 2.2.1 (Estimating Proportions). Let 𝑋1 , 𝑋2 , 𝑋3 , . . . denote indepen-
dent identically distributed (iid) Bernoulli(𝑝) random variables. Then the sam-
ple mean, 𝑋 𝑛 , is an unbiased and consistent estimator for 𝑝. Denoting 𝑋 𝑛 by
𝑝ˆ𝑛 , we then have that
𝐸(ˆ
𝑝𝑛 ) = 𝑝 for all 𝑛 = 1, 2, 3, . . . ,
and
P
𝑝ˆ𝑛 −→ 𝑝 as 𝑛 → ∞;
that is, for every 𝜀 > 0,
𝑝𝑛 − 𝑝∣ < 𝜀) = 1.
lim P(∣ˆ
𝑛→∞

By Slutsky’s Theorem (Theorem 2.1.3), we also have that


√ P √
𝑝ˆ𝑛 (1 − 𝑝ˆ𝑛 ) −→ 𝑝(1 − 𝑝) as 𝑛 → ∞.

statistic 𝑝ˆ𝑛 (1 − 𝑝ˆ𝑛 ) is a consistent estimator of the standard devi-
Thus, the √
ation 𝜎 = 𝑝(1 − 𝑝) of the Bernoulli(𝑝) trials 𝑋1 , 𝑋2 , 𝑋3 , . . .
Now, by the Central Limit Theorem, we have that
( )
𝑝ˆ𝑛 − 𝑝
lim P √ ⩽ 𝑧 = P(𝑍 ⩽ 𝑧),
𝑛→∞ 𝜎/ 𝑛

where 𝑍 ∼ Normal(0, 1), for all 𝑧 ∈ ℝ. Hence, since 𝑝ˆ𝑛 (1 − 𝑝ˆ𝑛 ) is a consistent
estimator for 𝜎, we have that, for large values of 𝑛,
( )
𝑝ˆ𝑛 − 𝑝
P √ √ ⩽ 𝑧 ≈ P(𝑍 ⩽ 𝑧),
𝑝ˆ𝑛 (1 − 𝑝ˆ𝑛 )/ 𝑛
16 CHAPTER 2. ESTIMATION

for all 𝑧 ∈ ℝ. Similarly, for large values of 𝑛,


( )
𝑝ˆ𝑛 − 𝑝
P √ √ ⩽ −𝑧 ≈ P(𝑍 ⩽ −𝑧).
𝑝ˆ𝑛 (1 − 𝑝ˆ𝑛 )/ 𝑛
subtracting this from the previous expression we get
( )
𝑝ˆ𝑛 − 𝑝
P −𝑧 < √ √ ⩽ 𝑧 ≈ P(−𝑧 < 𝑍 ⩽ 𝑧)
𝑝ˆ𝑛 (1 − 𝑝ˆ𝑛 )/ 𝑛
for large values of 𝑛, or
( )
𝑝 − 𝑝ˆ𝑛
P −𝑧 ⩽ √ √ <𝑧 ≈ P(−𝑧 < 𝑍 ⩽ 𝑧)
𝑝ˆ𝑛 (1 − 𝑝ˆ𝑛 )/ 𝑛
for large values of 𝑛.
Now, suppose that 𝑧 > 0 is such that P(−𝑧 < 𝑍 ⩽ 𝑧) ⩾ 0.95. Then, for that
value of 𝑧, we get that, approximately, for large values of 𝑛,
( √ √ )
𝑝ˆ𝑛 (1 − 𝑝ˆ𝑛 ) 𝑝ˆ𝑛 (1 − 𝑝ˆ𝑛 )
P 𝑝ˆ𝑛 − 𝑧 √ ⩽ 𝑝 < 𝑝ˆ𝑛 + 𝑧 √ ⩾ 0.95
𝑛 𝑛

Thus, for large values of 𝑛, the intervals


[ √ √ )
𝑝ˆ𝑛 (1 − 𝑝ˆ𝑛 ) 𝑝ˆ𝑛 (1 − 𝑝ˆ𝑛 )
𝑝ˆ𝑛 − 𝑧 √ , 𝑝ˆ𝑛 + 𝑧 √
𝑛 𝑛

have the property that the probability that the true proportion 𝑝 lies in them
is at least 95%. For this reason, the interval
[ √ √ )
𝑝ˆ𝑛 (1 − 𝑝ˆ𝑛 ) 𝑝ˆ𝑛 (1 − 𝑝ˆ𝑛 )
𝑝ˆ𝑛 − 𝑧 √ , 𝑝ˆ𝑛 + 𝑧 √
𝑛 𝑛

is called the 95% confidence interval estimate for the proportion 𝑝. To


find the value of 𝑧 that yields the 95% confidence interval for 𝑝, observe that
P(−𝑧 < 𝑍 ⩽ 𝑧) = 𝐹𝑍 (𝑧) − 𝐹𝑍 (−𝑧) = 𝐹𝑍 (𝑧) − (1 − 𝐹𝑍 (𝑧)) = 2𝐹𝑍 (𝑧) − 1.
Thus, we need to solve for 𝑧 in the inequality
2𝐹𝑍 (𝑧) − 1 ⩾ 0.95
or
𝐹𝑍 (𝑧) ⩾ 0.975.
This yields 𝑧 = 1.96. We then get that the approximate 95% confidence
interval estimate for the proportion 𝑝 is
[ √ √ )
𝑝ˆ𝑛 (1 − 𝑝ˆ𝑛 ) 𝑝ˆ𝑛 (1 − 𝑝ˆ𝑛 )
𝑝ˆ𝑛 − 1.96 √ , 𝑝ˆ𝑛 + 1.96 √
𝑛 𝑛
2.3. INTERVAL ESTIMATES FOR THE MEAN 17

Example 2.2.2. In the corn–popping experiment described in Section 1.1.1, out


of 356 kernels, 52 fail to pop. In this example, we compute a 95% confidence
interval for 𝑝, the probability of failure to pop for a given kernel, based on
this information. An estimate for 𝑝 in this case is 𝑝ˆ𝑛 = 52/356 ≈ 0.146. An
approximate 95% confidence interval estimate for the true proportion of kernels,
𝑝, which will not pop is then
[ √ √ )
0.146(0.854) 0.146(0.854)
0.146 − 1.96 √ , 0.146 + 1.96 √ ,
356 356

or about [0.146 − 0.037, 0.146 + 0.037), or [0.109, 0.183). Thus, the failure to pop
rate is between 10.9% and 18.3% with a 95% confidence level. The confidence
level here indicates the probability that the method used to produce the inter-
val estimate from the data will contain the true value of the parameter being
estimated.

2.3 Interval Estimates for the Mean


In the previous section we obtained an approximate confidence interval (CI)
estimate for the probability that a given kernel will fail to pop. We did this by
using the fact that, for large numbers of trials, a binomial distribution can be
approximated by a normal distribution (by the Central √ Limit Theorem). We also
used the fact that the sample standard deviation
√ 𝑝ˆ𝑛 (1 − 𝑝ˆ𝑛 ) is a consistent
estimator of the standard deviation 𝜎 = 𝑝(1 − 𝑝) of the Bernoulli(𝑝) trials
𝑋1 , 𝑋2 , 𝑋3 , . . . The consistency condition might not hold in general. However,
in the case in which sampling is done from a normal distribution an exact
confidence interval estimate may be obtained based on on the sample mean and
variance by means of the 𝑡–distribution. We present this development here and
apply it to the problem of estimating the mean number of popcorn kernels in
one quarter cup.
We have already seen that the sample mean, 𝑋 𝑛 , of a random sample of size
𝑛 from a normal(𝜇, 𝜎 2 ) follows a normal(𝜇, 𝜎 2 /𝑛) distribution. It then follows
that ( )
∣𝑋 𝑛 − 𝜇∣
P √ = P(∣𝑍∣ ⩽ 𝑧) for all 𝑧 ∈ ℝ, (2.2)
𝜎/ 𝑛
where 𝑍 ∼ normal(0, 1). Thus, if we knew 𝜎, then we could obtain the 95% CI
for 𝜇 by choosing 𝑧 = 1.96 in (2.2). We would then obtain the CI:
[ ]
𝜎 𝜎
𝑋 𝑛 − 1.96 √ , 𝑋 𝑛 + 1.96 √ .
𝑛 𝑛
However, 𝜎 is generally and unknown parameter. So, we need to resort to
a different kind of estimate. The idea is to use the sample variance, 𝑆𝑛2 , to
estimate 𝜎 2 , where
𝑛
1 ∑
𝑆𝑛2 = (𝑋𝑘 − 𝑋 𝑛 )2 . (2.3)
𝑛−1
𝑘=1
18 CHAPTER 2. ESTIMATION

Thus, instead of considering the normalized sample means

𝑋𝑛 − 𝜇
√ ,
𝜎/ 𝑛

we consider the random variables

𝑋𝑛 − 𝜇
𝑇𝑛 = √ . (2.4)
𝑆𝑛 / 𝑛

The task that remains then is to determine the sampling distribution of 𝑇𝑛 .


This was done by William Sealy Gosset in 1908 in an article published in the
journal Biometrika under the pseudonym Student [Stu08]. The fact the we
can actually determine the distribution of 𝑇𝑛 in (2.4) depends on the fact that
𝑋1 , 𝑋2 , . . . , 𝑋𝑛 is a random sample from a normal distribution and knowledge
of the 𝜒2 distribution.

2.3.1 The 𝜒2 Distribution


Example 2.3.1 (The Chi–Square Distribution with one degree of freedom).
Let 𝑍 ∼ Normal(0, 1) and define 𝑋 = 𝑍 2 . Give the probability density function
(pdf) of 𝑋.

Solution: The pdf of 𝑍 is given by

1 2
𝑓𝑋 (𝑧) = √ 𝑒−𝑧 /2 , for − ∞ < 𝑧 < ∞.
2𝜋

We compute the pdf for 𝑋 by first determining its cumulative density


function (cdf):

𝑃 (𝑋 ≤ 𝑥) = 𝑃 (𝑍 2 ≤ 𝑥) for 𝑦 ⩾ 0
√ √
= 𝑃 (− 𝑥 ≤ 𝑍 ≤ 𝑥)
√ √
= 𝑃 (− 𝑥 < 𝑍 ≤ 𝑥), since Z is continuous.

Thus,
√ √
𝑃 (𝑋 ≤ 𝑥) = 𝑃 (𝑍 ≤ 𝑥) − 𝑃 (𝑍 ≤ − 𝑥)
√ √
= 𝐹𝑍 ( 𝑥) − 𝐹𝑍 (− 𝑥) for 𝑥 > 0,

since 𝑋 is continuous.
We then have that the cdf of 𝑋 is
√ √
𝐹𝑋 (𝑥) = 𝐹𝑍 ( 𝑥) − 𝐹𝑍 (− 𝑥) for 𝑥 > 0,
2.3. INTERVAL ESTIMATES FOR THE MEAN 19

from which we get, after differentiation with respect to 𝑥,


√ 1 √ 1
𝑓𝑋 (𝑥) = 𝐹𝑍′ ( 𝑥) ⋅ √ + 𝐹𝑍′ (− 𝑥) ⋅ √
2 𝑥 2 𝑥
√ 1 √ 1
= 𝑓𝑍 ( 𝑥) √ + 𝑓𝑍 (− 𝑥) √
2 𝑥 2 𝑥
{ }
1 1 −𝑥/2 1 −𝑥/2
= √ √ 𝑒 +√ 𝑒
2 𝑥 2𝜋 2𝜋
1 1
= √ ⋅ √ 𝑒−𝑥/2
2𝜋 𝑥

for 𝑥 > 0. □

Definition 2.3.2. A continuous random variable, 𝑋 having the pdf


⎧ 1 1
 √ ⋅ √ 𝑒−𝑥/2 if 𝑥 > 0

⎨ 2𝜋 𝑥
𝑓𝑋 (𝑥) =


0 otherwise,

is said to have a Chi–Square distribution with one degree of freedom. We write

𝑌 ∼ 𝜒2 (1).

Remark 2.3.3. Observe that if 𝑋 ∼ 𝜒21 , then its expected value is

𝐸(𝑋) = 𝐸(𝑍 2 ) = 1,

since var(𝑍) = 𝐸(𝑍 2 )−(𝐸(𝑍))2 and 𝐸(𝑍) = 0 and var(𝑍) = 1. To compute the
second moment of 𝑋, 𝐸(𝑋 2 ) = 𝐸(𝑍 4 ), we need to compute the fourth moment
of 𝑍. In order to do this, we first compute the mgf of 𝑍 is
2
𝑀𝑍 (𝑡) = 𝑒𝑡 /2
for all 𝑡 ∈ ℝ.

Its fourth derivative can be computed to be


2
𝑀𝑍(4) (𝑡) = (3 + 6𝑡2 + 𝑡4 ) 𝑒𝑡 /2
for all 𝑡 ∈ ℝ.

Thus,
𝐸(𝑍 4 ) = 𝑀𝑍(4) (0) = 3.
We then have that the variance of 𝑋 is

var(𝑋) = 𝐸(𝑋 2 ) − (𝐸(𝑋))2 = 𝐸(𝑍 4 ) − 1 = 3 − 1 = 2.

Suppose next that we have two independent random variable, 𝑋 and 𝑌 ,


both of which have a 𝜒2 (1) distribution. We would like to know the distribution
of the sum 𝑋 + 𝑌 .
20 CHAPTER 2. ESTIMATION

Denote the sum 𝑋 + 𝑌 by 𝑊 . We would like to compute the pdf 𝑓𝑊 . Since


𝑋 and 𝑌 are independent, 𝑓𝑊 is given by the convolution of 𝑓𝑋 and 𝑓𝑌 ; namely,
∫ +∞
𝑓𝑊 (𝑤) = 𝑓𝑋 (𝑢)𝑓𝑌 (𝑤 − 𝑢)d𝑢,
−∞

where

⎧ 1 1
−𝑥/2 1 1 −𝑦/2
⎨ √ √ 𝑒 𝑥 > 0, ⎨ √ √ 𝑒 𝑦>0
 

2𝜋 𝑥 2𝜋 𝑦
𝑓𝑋 (𝑥) = 𝑓𝑌 (𝑦) =

⎩ 

0 elsewhere, ⎩
0 otherwise.

We then have that


∫ ∞
1
𝑓𝑊 (𝑤) = √ √ 𝑒−𝑢/2 𝑓𝑌 (𝑤 − 𝑢) d𝑢,
0 2𝜋 𝑢

since 𝑓𝑋 (𝑢) is zero for negative values of 𝑢. Similarly, since 𝑓𝑌 (𝑤 − 𝑢) = 0 for


𝑤 − 𝑢 < 0, we get that
∫ 𝑤
1 1
𝑓𝑊 (𝑤) = √ √ 𝑒−𝑢/2 √ √ 𝑒−(𝑤−𝑢)/2 d𝑢
0 2𝜋 𝑢 2𝜋 𝑤 − 𝑢
𝑒−𝑤/2 𝑤

1
= √ √ d𝑢.
2𝜋 0 𝑢 𝑤−𝑢
𝑢
Next, make the change of variables 𝑡 = . Then, d𝑢 = 𝑤d𝑡 and
𝑤
𝑒−𝑤/2 1

𝑤
𝑓𝑊 (𝑤) = √ √ d𝑡
2𝜋 0 𝑤𝑡 𝑤 − 𝑤𝑡

𝑒−𝑤/2 1

1
= √√ d𝑡.
2𝜋 0 𝑡 1−𝑡

Making a second change of variables 𝑠 = 𝑡, we get that 𝑡 = 𝑠2 and d𝑡 = 2𝑠d𝑠,
so that
𝑒−𝑤/2 1

1
𝑓𝑊 (𝑤) = √ d𝑠
𝜋 0 1 − 𝑠2
𝑒−𝑤/2 1
= [arcsin(𝑠)]0
𝜋
1 −𝑤/2
= 𝑒 for 𝑤 > 0,
2
and zero otherwise. It then follows that 𝑊 = 𝑋 + 𝑌 has the pdf of an
exponential(2) random variable.
2.3. INTERVAL ESTIMATES FOR THE MEAN 21

Definition 2.3.4 (𝜒2 distribution with 𝑛 degrees of freedom). Let 𝑋1 , 𝑋2 , . . . , 𝑋𝑛


be independent, identically distributed random variables with a 𝜒2 (1) distribu-
tion. Then then random variable 𝑋1 + 𝑋2 + ⋅ ⋅ ⋅ + 𝑋𝑛 is said to have a 𝜒2
distribution with 𝑛 degrees of freedom. We write
𝑋1 + 𝑋2 + ⋅ ⋅ ⋅ + 𝑋𝑛 ∼ 𝜒2 (𝑛).
The calculations preceding Definition 2.3.4 if a random variable, 𝑊 , has a
𝜒2 (2) distribution, then its pdf is given by
1

 𝑒−𝑤/2 for 𝑤 > 0;

⎨ 2
𝑓𝑊 (𝑤) =


0 for 𝑤 ⩽ 0;

Our goal in the following set of examples is to come up with the formula for the
pdf of a 𝜒2 (𝑛) random variable.
Example 2.3.5 (Three degrees of freedom). Let 𝑋 ∼ exponential(2) and 𝑌 ∼
𝜒2 (1) be independent random variables and define 𝑊 = 𝑋 + 𝑌 . Give the
distribution of 𝑊 .
Solution: Since 𝑋 and 𝑌 are independent, by Problem 1 in Assign-
ment #3, 𝑓𝑊 is the convolution of 𝑓𝑋 and 𝑓𝑌 :
𝑓𝑊 (𝑤) = 𝑓𝑋 ∗ 𝑓𝑌 (𝑤)
∫ ∞
= 𝑓𝑋 (𝑢)𝑓𝑌 (𝑤 − 𝑢)d𝑢,
−∞

where
1 −𝑥/2

⎨2𝑒 if 𝑥 > 0;


𝑓𝑋 (𝑥) =


0 otherwise;

and ⎧ 1 1
−𝑦/2
⎨ √2𝜋 √𝑦 𝑒 if 𝑦 > 0;


𝑓𝑌 (𝑦) =



0 otherwise.
It then follows that, for 𝑤 > 0,
∫ ∞
1 −𝑢/2
𝑓𝑊 (𝑤) = 𝑒 𝑓𝑌 (𝑤 − 𝑢)d𝑢
0 2
∫ 𝑤
1 −𝑢/2 1 1
= 𝑒 √ √ 𝑒−(𝑤−𝑢)/2 d𝑢
0 2 2𝜋 𝑤 − 𝑢
𝑤
𝑒−𝑤/2

1
= √ √ d𝑢.
2 2𝜋 0 𝑤−𝑢
22 CHAPTER 2. ESTIMATION

Making the change of variables 𝑡 = 𝑢/𝑤, we get that 𝑢 = 𝑤𝑡 and


d𝑢 = 𝑤d𝑡, so that
1
𝑒−𝑤/2

1
𝑓𝑊 (𝑤) = √ √ 𝑤d𝑡
2 2𝜋 0 𝑤 − 𝑤𝑡
√ 1
𝑤 𝑒−𝑤/2

1
= √ √ d𝑡
2 2𝜋 0 1−𝑡

𝑤 𝑒−𝑤/2 [ √ ]1
= √ − 1−𝑡 0
2𝜋

1 √
= √ 𝑤 𝑒−𝑤/2 ,
2𝜋
for 𝑤 > 0. It then follows that
⎧ 1 √
−𝑤/2
⎨ √2𝜋 𝑤 𝑒 if 𝑤 > 0;


𝑓𝑊 (𝑤) =


0 otherwise.

This is the pdf for a 𝜒2 (3) random variable. □

Example 2.3.6 (Four degrees of freedom). Let 𝑋, 𝑌 ∼ exponential(2) be in-


dependent random variables and define 𝑊 = 𝑋 + 𝑌 . Give the distribution of
𝑊.

Solution: Since 𝑋 and 𝑌 are independent, 𝑓𝑊 is the convolution


of 𝑓𝑋 and 𝑓𝑌 :

𝑓𝑊 (𝑤) = 𝑓𝑋 ∗ 𝑓𝑌 (𝑤)
∫ ∞
= 𝑓𝑋 (𝑢)𝑓𝑌 (𝑤 − 𝑢)d𝑢,
−∞

where
1

 𝑒−𝑥/2
 if 𝑥 > 0;
⎨ 2
𝑓𝑋 (𝑥) =


0 otherwise;

and
1

 𝑒−𝑦/2
 if 𝑦 > 0;
⎨ 2
𝑓𝑌 (𝑦) =


0 otherwise.

2.3. INTERVAL ESTIMATES FOR THE MEAN 23

It then follows that, for 𝑤 > 0,


∫ ∞
1 −𝑢/2
𝑓𝑊 (𝑤) = 𝑒 𝑓𝑌 (𝑤 − 𝑢)d𝑢
0 2
∫ 𝑤
1 −𝑢/2 1 −(𝑤−𝑢)/2
= 𝑒 𝑒 d𝑢
0 2 2
𝑤
𝑒−𝑤/2

= d𝑢
4 0

𝑤 𝑒−𝑤/2
= ,
4
for 𝑤 > 0. It then follows that
1

−𝑤/2
⎨4 𝑤 𝑒 if 𝑤 > 0;


𝑓𝑊 (𝑤) =


0 otherwise.

This is the pdf for a 𝜒2 (4) random variable. □

We are now ready to derive the general formula for the pdf of a 𝜒2 (𝑛) random
variable.

Example 2.3.7 (𝑛 degrees of freedom). In this example we prove that if 𝑊 ∼


𝜒2 (𝑛), then the pdf of 𝑊 is given by
⎧ 1 𝑛
 𝑤 2 −1 𝑒−𝑤/2 if 𝑤 > 0;
Γ(𝑛/2) 2𝑛/2


𝑓𝑊 (𝑤) = (2.5)


0 otherwise,

where Γ denotes the Gamma function defined by


∫ ∞
Γ(𝑧) = 𝑡𝑧−1 𝑒−𝑡 d𝑡 for all real values of 𝑧 except 0, −1, −2, −3, . . .
0

Proof: We proceed by induction of 𝑛. Observe that when 𝑛 = 1 the formula in


(2.5) yields, for 𝑤 > 0,

1 1 1 1
𝑓𝑊 (𝑤) = 1/2
𝑤 2 −1 𝑒−𝑤/2 = √ √ 𝑒−𝑤/2 ,
Γ(1/2) 2 2𝜋 𝑥

which is the pdf for a 𝜒( 1) random variable. Thus, the formula in (2.5) holds
true for 𝑛 = 1.
24 CHAPTER 2. ESTIMATION

Next, assume that a 𝜒2 (𝑛) random variable has pdf given (2.5). We will
show that if 𝑊 ∼ 𝜒2 (𝑛 + 1), then its pdf is given by
⎧ 1 𝑛−1

(𝑛+1)/2
𝑤 2 𝑒−𝑤/2 if 𝑤 > 0;
Γ((𝑛 + 1)/2) 2


𝑓𝑊 (𝑤) = (2.6)


0 otherwise.

By the definition of a 𝜒2 (𝑛 + 1) random variable, we have that 𝑊 = 𝑋 + 𝑌


where 𝑋 ∼ 𝜒2 (𝑛) and 𝑌 ∼ 𝜒2 (1) are independent random variables. It then
follows that
𝑓𝑊 = 𝑓𝑋 ∗ 𝑓𝑌
where ⎧ 1 𝑛
 𝑥 2 −1 𝑒−𝑥/2 if 𝑥 > 0;
Γ(𝑛/2) 2𝑛/2


𝑓𝑋 (𝑥) =


0 otherwise.

and ⎧ 1 1
−𝑦/2
⎨ √2𝜋 √𝑦 𝑒 if 𝑦 > 0;


𝑓𝑌 (𝑦) =



0 otherwise.
Consequently, for 𝑤 > 0,
∫ 𝑤
1 𝑛 1 1
𝑓𝑊 (𝑤) = 𝑛/2
𝑢 2 −1 𝑒−𝑢/2 √ √ 𝑒−(𝑤−𝑢)/2 d𝑢
0 Γ(𝑛/2) 2 2𝜋 𝑤 − 𝑢
𝑤 𝑛
𝑒−𝑤/2 𝑢 2 −1

= √ √ d𝑢.
Γ(𝑛/2) 𝜋 2(𝑛+1)/2 0 𝑤−𝑢

Next, make the change of variables 𝑡 = 𝑢/𝑤; we then have that 𝑢 = 𝑤𝑡, d𝑢 = 𝑤d𝑡
and
𝑛−1 ∫ 1 𝑛 −1
𝑤 2 𝑒−𝑤/2 𝑡2
𝑓𝑊 (𝑤) = √ √ d𝑡.
Γ(𝑛/2) 𝜋 2(𝑛+1)/2 0 1−𝑡
Making a further change of variables 𝑡 = 𝑧 2 , so that d𝑡 = 2𝑧d𝑧, we obtain that
𝑛−1 1
2𝑤 2 𝑒−𝑤/2 𝑧 𝑛−1

𝑓𝑊 (𝑤) = √ √ d𝑧. (2.7)
Γ(𝑛/2) 𝜋 2(𝑛+1)/2 0 1 − 𝑧2

It remains to evaluate the integrals


1
𝑧 𝑛−1

√ d𝑧 for 𝑛 = 1, 2, 3, . . .
0 1 − 𝑧2
2.3. INTERVAL ESTIMATES FOR THE MEAN 25

We can evaluate these by making the trigonometric substitution 𝑧 = sin 𝜃 so


that d𝑧 = cos 𝜃d𝜃 and
∫ 1 ∫ 𝜋/2
𝑧 𝑛−1
√ d𝑧 = sin𝑛−1 𝜃d𝜃.
1 − 𝑧 2
0 0

Looking up the last integral in a table of integrals we find that, if 𝑛 is even and
𝑛 ⩾ 4, then
∫ 𝜋/2
1 ⋅ 3 ⋅ 5 ⋅ ⋅ ⋅ (𝑛 − 2)
sin𝑛−1 𝜃d𝜃 = ,
0 2 ⋅ 4 ⋅ 6 ⋅ ⋅ ⋅ (𝑛 − 1)
which can be written in terms of the Gamma function as
[ ( )]2
2𝑛−2 Γ 𝑛2
∫ 𝜋/2
𝑛−1
sin 𝜃d𝜃 = . (2.8)
0 Γ(𝑛)
Note that this formula also works for 𝑛 = 2.
Similarly, we obtain that for odd 𝑛 with 𝑛 ⩾ 1 that
∫ 𝜋/2
Γ(𝑛) 𝜋
sin𝑛−1 𝜃d𝜃 = [ ( 𝑛+1 )]2 . (2.9)
0 2𝑛−1 Γ 2 2

Now, if 𝑛 is odd and 𝑛 ⩾ 1 we may substitute (2.9) into (2.7) to get


𝑛−1
2𝑤 2 𝑒−𝑤/2 Γ(𝑛) 𝜋
𝑓𝑊 (𝑤) = √
Γ(𝑛/2) 𝜋 2(𝑛+1)/2 2𝑛−1 Γ 𝑛+1 2 2
[ ( )]
2

𝑛−1 √
𝑤 2 𝑒−𝑤/2 Γ(𝑛) 𝜋
= )] .
Γ(𝑛/2) 2(𝑛+1)/2 2𝑛−1 Γ 𝑛+1 2
[ (
2

Now, by Problem 5 in Assignment 1, for odd 𝑛,


(𝑛) √
Γ(𝑛) 𝜋
Γ = 𝑛−1 ( 𝑛+1 ) .
2 2 Γ 2

It the follows that 𝑛−1


𝑤 2 𝑒−𝑤/2
𝑓𝑊 (𝑤) = ( 𝑛+1 )
Γ 2 2(𝑛+1)/2
for 𝑤 > 0, which is (2.6) for odd 𝑛.
Next, suppose that 𝑛 is a positive, even integer. In this case we substitute
(2.8) into (2.7) and get
𝑛−1 [ ( )]2
2𝑤 2 𝑒−𝑤/2 2𝑛−2 Γ 𝑛2
𝑓𝑊 (𝑤) = √
Γ(𝑛/2) 𝜋 2(𝑛+1)/2 Γ(𝑛)
or 𝑛−1
𝑤 2 𝑒−𝑤/2 2𝑛−1 Γ 𝑛2
( )
𝑓𝑊 (𝑤) = √ (2.10)
2(𝑛+1)/2 𝜋 Γ(𝑛)
26 CHAPTER 2. ESTIMATION

Now, since 𝑛 is even, 𝑛 + 1 is odd, so that by by Problem 5 in Assignment 1


again, we get that
( ) √ √
𝑛+1 Γ(𝑛 + 1) 𝜋 𝑛Γ(𝑛) 𝜋
Γ = 𝑛 ( 𝑛+2 ) = 𝑛 𝑛 ( 𝑛 ) ,
2 2 Γ 2 2 2Γ 2
from which we get that
2𝑛−1 Γ 𝑛2
( )
1
√ = ( 𝑛+1 ) .
𝜋 Γ(𝑛) Γ 2
Substituting this into (2.10) yields
𝑛−1
𝑤 2 𝑒−𝑤/2
𝑓𝑊 (𝑤) = ( 𝑛+1 )
Γ 2 2(𝑛+1)/2

for 𝑤 > 0, which is (2.6) for even 𝑛. This completes inductive step and the
proof is now complete. That is, if 𝑊 ∼ 𝜒2 (𝑛) then the pdf of 𝑊 is given by
⎧ 1 𝑛
 𝑤 2 −1 𝑒−𝑤/2 if 𝑤 > 0;
Γ(𝑛/2) 2𝑛/2


𝑓𝑊 (𝑤) =


0 otherwise,

for 𝑛 = 1, 2, 3, . . .

2.3.2 The 𝑡 Distribution


In this section we derive a very important distribution in statistics, the Student
𝑡 distribution, or 𝑡 distribution for short. We will see that this distribution will
come in handy when we complete our discussion of estimating the mean based
on a random sample from a normal(𝜇, 𝜎 2 ) distribution.
Example 2.3.8 (The 𝑡 distribution). Let 𝑍 ∼ normal(0, 1) and 𝑋 ∼ 𝜒2 (𝑛 − 1)
be independent random variables. Define
𝑍
𝑇 =√ .
𝑋/(𝑛 − 1)
Give the pdf of the random variable 𝑇 .

Solution: We first compute the cdf, 𝐹𝑇 , of 𝑇 ; namely,

𝐹𝑇 (𝑡) = 𝑃 (𝑇 ⩽ 𝑡)
( )
𝑍
= 𝑃 √ ⩽𝑡
𝑋/(𝑛 − 1)
∫∫
= 𝑓(𝑋,𝑍) (𝑥, 𝑧)d𝑥d𝑧,
𝑅
2.3. INTERVAL ESTIMATES FOR THE MEAN 27

where 𝑅𝑡 is the region in the 𝑥𝑧–plane given by



𝑅𝑡 = {(𝑥, 𝑧) ∈ ℝ2 ∣ 𝑧 < 𝑡 𝑥/(𝑛 − 1), 𝑥 > 0},

and the joint distribution, 𝑓(𝑋,𝑍) , of 𝑋 and 𝑍 is given by

𝑓(𝑋,𝑍) (𝑥, 𝑧) = 𝑓𝑋 (𝑥) ⋅ 𝑓𝑍 (𝑧) for 𝑥 > 0 and 𝑧 ∈ ℝ,

because 𝑋 and 𝑍 are assumed to be independent. Furthermore,


⎧ 1 𝑛−1
2 −1 𝑒−𝑥/2
⎨ Γ ( 𝑛−1 ) 2(𝑛−1)/2 𝑥 if 𝑥 > 0;


2
𝑓𝑋 (𝑥) =



0 otherwise,

and
1 2
𝑓𝑍 (𝑧) = √ 𝑒−𝑧 /2 , for − ∞ < 𝑧 < ∞.
2𝜋
We then have that
∞ ∫ 𝑡√𝑥/(𝑛−1) 𝑛−3 2
𝑥 2 𝑒−(𝑥+𝑧 )/2

𝐹𝑇 (𝑡) = )√ 𝑛 d𝑧d𝑥.
Γ 𝑛−1
(
0 −∞ 2 𝜋 22

Next, make the change of variables

𝑢 = 𝑥
𝑧
𝑣 = √ ,
𝑥/(𝑛 − 1)

so that
𝑥 = 𝑢

𝑧 = 𝑣 𝑢/(𝑛 − 1).
Consequently,
∫ 𝑡 ∫ ∞
1 𝑛−3
−(𝑢+𝑢𝑣 2 /(𝑛−1))/2
∂(𝑥, 𝑧)
𝐹𝑇 (𝑡) = ( 𝑛−1 ) √ 𝑛 𝑢 2 𝑒 ∂(𝑢, 𝑣) d𝑢d𝑣,

Γ 2 𝜋 22 −∞ 0

where the Jacobian of the change of variables is


⎛ ⎞
1 0
∂(𝑥, 𝑧)
= det ⎝ √ √ √

∂(𝑢, 𝑣) 1/2
𝑣/2 𝑢 𝑛 − 1 𝑢 / 𝑛 − 1

𝑢1/2
= √ .
𝑛−1
28 CHAPTER 2. ESTIMATION

It then follows that


∫ 𝑡 ∫ ∞
1 𝑛 2
𝐹𝑇 (𝑡) = )√ 𝑛 𝑢 2 −1 𝑒−(𝑢+𝑢𝑣 /(𝑛−1))/2
d𝑢d𝑣.
Γ 𝑛−1
(
2 (𝑛 − 1)𝜋 2 2 −∞ 0

Next, differentiate with respect to 𝑡 and apply the Fundamental


Theorem of Calculus to get
∫ ∞
1 𝑛 2
𝑓𝑇 (𝑡) = ( 𝑛−1 ) √ 𝑛 𝑢 2 −1 𝑒−(𝑢+𝑢𝑡 /(𝑛−1))/2 d𝑢
Γ 2 (𝑛 − 1)𝜋 2 2 0

∫ ∞ 𝑡2
1
( )
𝑛 − 1+ 𝑛−1 𝑢/2
= ( 𝑛−1 ) √ 𝑛 𝑢 2 −1 𝑒 d𝑢.
Γ 2 (𝑛 − 1)𝜋 2 2 0

𝑛 2
Put 𝛼 = and 𝛽 = 𝑡2
. Then,
2 1 + 𝑛−1
∫ ∞
1
𝑓𝑇 (𝑡) = ( 𝑛−1 ) √ 𝑢𝛼−1 𝑒−𝑢/𝛽 d𝑢
Γ 2 (𝑛 − 1)𝜋 2𝛼 0


Γ(𝛼)𝛽 𝛼 𝑢𝛼−1 𝑒−𝑢/𝛽

= d𝑢,
Γ(𝛼)𝛽 𝛼
( 𝑛−1 ) √
Γ 2 (𝑛 − 1)𝜋 2𝛼 0

where ⎧
𝛼−1 −𝑢/𝛽
𝑢


𝑒
if 𝑢 > 0
⎨ Γ(𝛼)𝛽 𝛼
𝑓𝑈 (𝑢) =



⎩0 if 𝑢 ⩽ 0
is the pdf of a Γ(𝛼, 𝛽) random variable (see Problem 5 in Assignment
#3). We then have that
Γ(𝛼)𝛽 𝛼
𝑓𝑇 (𝑡) = ( 𝑛−1 ) √ for 𝑡 ∈ ℝ.
Γ 2 (𝑛 − 1)𝜋 2𝛼
Using the definitions of 𝛼 and 𝛽 we obtain that
Γ 𝑛2
( )
1
𝑓𝑇 (𝑡) = ( 𝑛−1 ) √ ⋅( )𝑛/2 for 𝑡 ∈ ℝ.
Γ 2 (𝑛 − 1)𝜋 𝑡2
1+
𝑛−1
This is the pdf of a random variable with a 𝑡 distribution with 𝑛 − 1
degrees of freedom. In general, a random variable, 𝑇 , is said to have
a 𝑡 distribution with 𝑟 degrees of freedom, for 𝑟 ⩾ 1, if its pdf is
given by
Γ 𝑟+1
( )
2 1
𝑓𝑇 (𝑡) = ( )√ ⋅ ( for 𝑡 ∈ ℝ.
Γ 2𝑟 𝑟𝜋 𝑡 2 (𝑟+1)/2
)
1+
𝑟
2.3. INTERVAL ESTIMATES FOR THE MEAN 29

We write 𝑇 ∼ 𝑡(𝑟). Thus, in this example we have seen that, if


𝑍 ∼ norma(0, 1) and 𝑋 ∼ 𝜒2 (𝑛 − 1), then

𝑍
√ ∼ 𝑡(𝑛 − 1).
𝑋/(𝑛 − 1)

We will see the relevance of this example in the next section when we continue
our discussion estimating the mean of a norma distribution.

2.3.3 Sampling from a normal distribution


Let 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 be a random sample from a normal(𝜇, 𝜎 2 ) distribution.
Then, the sample mean, 𝑋 𝑛 has a normal(𝜇, 𝜎 2 /𝑛) distribution.
Observe that √
𝑋𝑛 − 𝜇 𝑛𝑏
∣𝑋 𝑛 − 𝜇∣ < 𝑏 ⇔
√ < ,
𝜎/ 𝑛 𝜎
where
𝑋𝑛 − 𝜇
√ ∼ normal(0, 1).
𝜎/ 𝑛
Thus, ( √ )
𝑛𝑏
P(∣𝑋 𝑛 − 𝜇∣ < 𝑏) = P ∣𝑍∣ < , (2.11)
𝜎
where 𝑍 ∼ normal(0, 1). Observer that the distribution of the standard normal
random variable 𝑍 is independent of the parameters 𝜇 and 𝜎. Thus, for given
values of 𝑧 > 0 we can compute 𝑃 (∣𝑍∣ < 𝑧). For example, if there is a way of
knowing the cdf for 𝑍, either by looking up values in probability tables or suing
statistical software packages to compute then, we have that

P(∣𝑍∣ < 𝑧) = P(−𝑧 < 𝑍 < 𝑧)

= P(−𝑧 < 𝑍 ⩽ 𝑧)

= P(𝑍 ⩽ 𝑧) − P(𝑍 ⩽ −𝑧)

= 𝐹𝑍 (𝑧) − 𝐹𝑍 (−𝑧),

where 𝐹𝑍 (−𝑧) = 1 − 𝐹𝑍 (𝑧), by the symmetry if the pdf of 𝑍. Consequently,

P(∣𝑍∣ < 𝑧) = 2𝐹𝑍 (𝑧) − 1 for 𝑧 > 0.

Suppose that 0 < 𝛼 < 1 and let 𝑧𝛼/2 be the value of 𝑧 for which P(∣𝑍∣ < 𝑧) =
1 − 𝛼. We then have that 𝑧𝛼/2 satisfies the equation
𝛼
𝐹𝑍 (𝑧) = 1 − .
2
30 CHAPTER 2. ESTIMATION

Thus, ( 𝛼)
𝑧𝛼/2 = 𝐹𝑍−1 1 − , (2.12)
2
where 𝐹𝑍−1 denotes the inverse of the cdf of 𝑍. Then, setting

𝑛𝑏
= 𝑧𝛼/2 ,
𝜎
we see from (2.11) that
( )
𝜎
P ∣𝑋 𝑛 − 𝜇∣ < 𝑧𝛼/2 √ = 1 − 𝛼,
𝑛

which we can write as


( )
𝜎
P ∣𝜇 − 𝑋 𝑛 ∣ < 𝑧𝛼/2 √ = 1 − 𝛼,
𝑛
or ( )
𝜎 𝜎
P 𝑋 𝑛 − 𝑧𝛼/2 √ < 𝜇 < 𝑋 𝑛 + 𝑧𝛼/2 √ = 1 − 𝛼, (2.13)
𝑛 𝑛
which says that the probability that the interval
( )
𝜎 𝜎
𝑋 𝑛 − 𝑧𝛼/2 √ , 𝑋 𝑛 + 𝑧𝛼/2 √ (2.14)
𝑛 𝑛

captures the parameter 𝜇 is 1−𝛼. The interval in (2.14) is called the 100(1−𝛼)%
confidence interval for the mean, 𝜇, based on the sample mean. Notice that
this interval assumes that the variance, 𝜎 2 , is known, which is not the case in
general. So, in practice it is not very useful (we will see later how to remedy this
situation); however, it is a good example to illustrate the concept of a confidence
interval.
For a more concrete example, let 𝛼 = 0.05. Then, to find 𝑧𝛼/2 we may use
the NORMINV function in MS Excel, which gives the inverse of the cumulative
distribution function of normal random variable. The format for this function
is

NORMINV(probability,mean,standard_dev)

𝛼
In this case the probability is 1 − = 0.975, the mean is 0, and the standard
2
deviation is 1. Thus, according to (2.12), 𝑧𝛼/2 is given by

NORMINV(0.975, 0, 1) ≈ 1.959963985

or about 1.96.
In R, the inverse cdf for a normal random variable is the qnorm function
whose format is
2.3. INTERVAL ESTIMATES FOR THE MEAN 31

qnorm(probability, mean, standard_deviation).

Thus, in R, for 𝛼 = 0.05,

𝑧𝛼/2 ≈ qnorm(0.975, 0, 1) ≈ 1.959964 ≈ 1.96.

Hence the 95% confidence interval for the mean, 𝜇, of a normal(𝜇, 𝜎 2 ) distribu-
tion based on the sample mean, 𝑋 𝑛 is
( )
𝜎 𝜎
𝑋 𝑛 − 1.96 √ , 𝑋 𝑛 + 1.96 √ , (2.15)
𝑛 𝑛

provided that the variance, 𝜎 2 , of the distribution is known. Unfortunately, in


most situations, 𝜎 2 is an unknown parameter, so the formula for the confidence
interval in (2.14) is not useful at all. In order to remedy this situation, in 1908,
William Sealy Gosset, writing under the pseudonym of A. Student (see [Stu08]),
proposed looking tat the statistic

𝑋𝑛 − 𝜇
𝑇𝑛 = √ ,
𝑆𝑛 / 𝑛

where 𝑆𝑛2 is the sample variance defined by


𝑛
1 ∑
𝑆𝑛2 = (𝑋𝑖 − 𝑋 𝑛 )2 .
𝑛 − 1 𝑖=1

Thus, we are replacing 𝜎 in


𝑋𝑛 − 𝜇
𝑇𝑛 = √
𝜎/ 𝑛
by the sample standard deviation, 𝑆𝑛 , so that we only have one unknown pa-
rameter, 𝜇, in the definition of 𝑇𝑛 .
In order to find the sampling distribution of 𝑇𝑛 , we will first need to deter-
mine the distribution of 𝑆𝑛2 , given that sampling is done form a normal(𝜇, 𝜎 2 )
distribution. We will find the distribution of 𝑆𝑛2 by first finding the distribution
of the statistic
𝑛
1 ∑
𝑊𝑛 = 2 (𝑋𝑖 − 𝑋 𝑛 )2 . (2.16)
𝜎 𝑖=1

Starting with

(𝑋𝑖 − 𝜇)2 = [(𝑋𝑖 − 𝑋 𝑛 ) + (𝑋 𝑛 − 𝜇)]2 ,

we can derive the identity


𝑛
∑ 𝑛

2
(𝑋𝑖 − 𝜇) = (𝑋𝑖 − 𝑋 𝑛 )2 + 𝑛(𝑋 𝑛 − 𝜇)2 , (2.17)
𝑖=1 𝑖=1
32 CHAPTER 2. ESTIMATION

where we have used the fact that


𝑛

(𝑋𝑖 − 𝑋 𝑛 ) = 0. (2.18)
𝑖=1

Next, dividing the equation in (2.17) by 𝜎 2 and rearranging we obtain that

𝑛 ( )2 ( )2
∑ 𝑋𝑖 − 𝜇 𝑋𝑛 − 𝜇
= 𝑊𝑛 + √ , (2.19)
𝑖=1
𝜎 𝜎/ 𝑛

where we have used the definition of the random variable 𝑊𝑛 in (2.16). Observe
that the random variable
𝑛 ( )2
∑ 𝑋𝑖 − 𝜇
𝑖=1
𝜎

has a 𝜒2 (𝑛) distribution since the 𝑋𝑖 s are iid normal(𝜇, 𝜎 2 ) so that

𝑋𝑖 − 𝜇
∼ normal(0, 1),
𝜎

and, consequently,
( )2
𝑋𝑖 − 𝜇
∼ 𝜒2 (1).
𝜎

Similarly,
( )2
𝑋𝑛 − 𝜇
√ ∼ 𝜒2 (1),
𝜎/ 𝑛

since 𝑋 𝑛 ∼ normal(𝜇, 𝜎 2 /𝑛). We can then re–write (2.19) as

𝑌 = 𝑊𝑛 + 𝑋, (2.20)

where 𝑌 ∼ 𝜒2 (𝑛) and 𝑋 ∼ 𝜒2 (1). If we can prove that 𝑊𝑛 and 𝑋 are indepen-
dent random variables, we will then be able to conclude that

𝑊𝑛 ∼ 𝜒2 (𝑛 − 1). (2.21)

To see why the assertion in (2.21) is true, if 𝑊𝑛 and 𝑋 are independent, note
that from (2.20) we get that the mgf of 𝑌 is

𝑀𝑌 (𝑡) = 𝑀𝑊𝑛 (𝑡) ⋅ 𝑀𝑋 (𝑡),


2.3. INTERVAL ESTIMATES FOR THE MEAN 33

by independence of 𝑊𝑛 and 𝑋. Consequently,


𝑀𝑌 (𝑡)
𝑀𝑊𝑛 (𝑡) =
𝑀𝑋 (𝑡)
( )𝑛/2
1
1 − 2𝑡
= ( )1/2
1
1 − 2𝑡
( )(𝑛−1)/2
1
= ,
1 − 2𝑡
which is the mgf for a 𝜒2 (𝑛 − 1) random variable. Thus, in order to prove
( )2
𝑋𝑛 − 𝜇
(2.21), it remains to prove that 𝑊𝑛 and √ are independent random
𝜎/ 𝑛
variables.

2.3.4 Distribution of the Sample Variance from a Normal


Distribution
In this section we will establish (2.21), which we now write as
(𝑛 − 1) 2
𝑆𝑛 ∼ 𝜒2 (𝑛 − 1). (2.22)
𝜎2
As pointed out in the previous section, (2.22)will follow from (2.20) if we can
prove that
𝑛 ( )2
1 ∑ 2 𝑋𝑛 − 𝜇
(𝑋𝑖 − 𝑋 𝑛 ) and √ are independent. (2.23)
𝜎 2 𝑖=1 𝜎/ 𝑛

In turn, the claim in (2.23) will follow from the claim


𝑛

(𝑋𝑖 − 𝑋 𝑛 )2 and 𝑋 𝑛 are independent. (2.24)
𝑖=1

The justification for the last assertion is given in the following two examples.
Example 2.3.9. Suppose that 𝑋 and 𝑌 are independent independent random
variables. Show that 𝑋 and 𝑌 2 are also independent.
Solution: Compute, for 𝑥 ∈ ℝ and 𝑢 ⩾ 0,

P(𝑋 ⩽ 𝑥, 𝑌 2 ⩽ 𝑢) = P(𝑋 ⩽ 𝑥, ∣𝑌 ∣ ⩽ 𝑢)
√ √
= P(𝑋 ⩽ 𝑥, − 𝑢 ⩽ 𝑌 ⩽ 𝑢)
√ √
= P(𝑋 ⩽ 𝑥) ⋅ P(− 𝑢 ⩽ 𝑌 ⩽ 𝑢),
34 CHAPTER 2. ESTIMATION

since 𝑋 and 𝑌 are assumed to be independent. Consequently,

P(𝑋 ⩽ 𝑥, 𝑌 2 ⩽ 𝑢) = P(𝑋 ⩽ 𝑥) ⋅ P(𝑌 2 ⩽ 𝑢),

which shows that 𝑋 and 𝑌 2 are independent. □


Example 2.3.10. Let 𝑎 and 𝑏 be real numbers with 𝑎 ∕= 0. Suppose that 𝑋
and 𝑌 are independent independent random variables. Show that 𝑋 and 𝑎𝑌 + 𝑏
are also independent.
Solution: Compute, for 𝑥 ∈ ℝ and 𝑤 ∈ ℝ,
𝑤−𝑏
P(𝑋 ⩽ 𝑥, 𝑎𝑌 + 𝑏 ⩽ 𝑤) = P(𝑋 ⩽ 𝑥, 𝑌 ⩽ 𝑎 )
( )
𝑤−𝑏
= P(𝑋 ⩽ 𝑥) ⋅ P 𝑌 ⩽ ,
𝑎

since 𝑋 and 𝑌 are assumed to be independent. Consequently,

P(𝑋 ⩽ 𝑥, 𝑎𝑌 + 𝑏 ⩽ 𝑤) = P(𝑋 ⩽ 𝑥) ⋅ P(𝑎𝑌 + 𝑏 ⩽ 𝑤),

which shows that 𝑋 and 𝑎𝑌 + 𝑏 are independent. □


Hence, in order to prove (2.22) it suffices to show that the claim in (2.24) is
true. To prove this last claim, observe that from (2.18) we get
𝑛

(𝑋1 − 𝑋 𝑛 ) = − (𝑋𝑖 − 𝑋 𝑛 ,
𝑖=2

so that, squaring on both sides,


( 𝑛 )2

2
(𝑋1 − 𝑋 𝑛 ) = (𝑋𝑖 − 𝑋 𝑛 .
𝑖=2

Hence, the random variable


𝑛
( 𝑛
)2 𝑛
∑ ∑ ∑
(𝑋𝑖 − 𝑋 𝑛 )2 = (𝑋𝑖 − 𝑋 𝑛 + (𝑋𝑖 − 𝑋 𝑛 )2
𝑖=1 𝑖=2 𝑖=2

is a function of the random vector

(𝑋2 − 𝑋 𝑛 , 𝑋3 − 𝑋 𝑛 , . . . , 𝑋𝑛 − 𝑋 𝑛 ).

Consequently, the claim in (2.24) will be proved if we can prove that

𝑋 𝑛 and (𝑋2 − 𝑋 𝑛 , 𝑋3 − 𝑋 𝑛 , . . . , 𝑋𝑛 − 𝑋 𝑛 ) are independent. (2.25)

The proof of the claim in (2.25) relies on the assumption that the random
variables 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 are iid normal random variables. We illustrate this
2.3. INTERVAL ESTIMATES FOR THE MEAN 35

in the following example for the spacial case in which 𝑛 = 2 and 𝑋1 , 𝑋2 ∼


normal(0, 1). Observe that, in view of Example 2.3.10, by considering

𝑋𝑖 − 𝜇
for 𝑖 = 1, 2, . . . , 𝑛,
𝜎
we may assume from the outset that 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 are iid normal(0, 1) random
variables.

Example 2.3.11. Let 𝑋1 and 𝑋2 denote independent normal(0, 1) random


variables. Define
𝑋1 + 𝑋2 𝑋2 − 𝑋1
𝑈= and 𝑉 = .
2 2
Show that 𝑈 and 𝑉 independent random variables.

Solution: Compute the cdf of 𝑈 and 𝑉 :

𝐹(𝑈,𝑉 ) (𝑢, 𝑣) = P(𝑈 ⩽ 𝑢, 𝑉 ⩽ 𝑣)


( )
𝑋1 + 𝑋2 𝑋2 − 𝑋1
= P ⩽ 𝑢, ⩽𝑣
2 2

= P (𝑋1 + 𝑋2 ⩽ 2𝑢, 𝑋2 − 𝑋1 ⩽ 2𝑣)


∫∫
= 𝑓(𝑋1 ,𝑋2 ) (𝑥1 , 𝑥2 )d𝑥1 d𝑥2 ,
𝑅𝑢,𝑣

where 𝑅𝑢,𝑣 is the region in ℝ2 defined by

𝑅𝑢,𝑣 = {(𝑥1 , 𝑥2 ) ∈ ℝ2 ∣ 𝑥1 + 𝑥2 ⩽ 2𝑢, 𝑥2 − 𝑥1 ⩽ 2𝑣},

and 𝑓(𝑋1 ,𝑋2 ) is the joint pdf of 𝑋1 and 𝑋2 :

1 −(𝑥21 +𝑥22 )/2


𝑓(𝑋1 ,𝑋2 ) (𝑥1 , 𝑥2 ) = 𝑒 for all (𝑥1 , 𝑥2 ) ∈ ℝ2 ,
2𝜋
where we have used the assumption that 𝑋1 and 𝑋2 are independent
normal(0, 1) random variables.
Next, make the change of variables

𝑟 = 𝑥1 + 𝑥2 and 𝑤 = 𝑥2 − 𝑥1 ,

so that
𝑟−𝑤 𝑟+𝑤
𝑥1 = and 𝑥2 = ,
2 2
and therefore
1 2
𝑥21 + 𝑥22 = (𝑟 + 𝑤2 ).
2
36 CHAPTER 2. ESTIMATION

Thus, by the change of variables formula,


∫ 2𝑢 ∫ 2𝑣
1 −(𝑟2 +𝑤2 )/4 ∂(𝑥1 , 𝑥2 )
𝐹(𝑈,𝑉 ) (𝑢, 𝑣) = 𝑒 ∂(𝑟, 𝑤) d𝑤d𝑟,
−∞ −∞ 2𝜋

where ⎛ ⎞
1/2 −1/2
∂(𝑥1 , 𝑥2 ) ⎠ = 1.
= det ⎝
∂(𝑟, 𝑤) 2
1/2 1/2
Thus,
∫ 2𝑢 ∫ 2𝑣
1 2 2
𝐹(𝑈,𝑉 ) (𝑢, 𝑣) = 𝑒−𝑟 /4
⋅ 𝑒−𝑤 /4
d𝑤d𝑟,
4𝜋 −∞ −∞

which we can write as


∫ 2𝑢 ∫ 2𝑣
1 2 1 2
𝐹(𝑈,𝑉 ) (𝑢, 𝑣) = √ 𝑒−𝑟 /4
d𝑟 ⋅ √ 𝑒−𝑤 /4
d𝑤.
2 𝜋 −∞ 2 𝜋 −∞

Taking partial derivatives with respect to 𝑢 and 𝑣 yields


1 2 1 2
𝑓(𝑈,𝑉 ) (𝑢, 𝑣) = √ 𝑒−𝑢 ⋅ √ 𝑒−𝑣 ,
𝜋 𝜋
where we have used the fundamental theorem of calculus and the
chain rule. Thus, the joint pdf of 𝑈 and 𝑉 is the product of the two
marginal pdfs
1 2
𝑓𝑈 (𝑢) = √ 𝑒−𝑢 for − ∞ < 𝑢 < ∞,
𝜋
and
1 2
𝑓𝑉 (𝑣)√ 𝑒−𝑣 for − ∞ < 𝑣 < ∞.
=
𝜋
Hence, 𝑈 and 𝑉 are independent random variables. □
To prove in general that if 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 is a random sample from a
normal(0, 1) distribution, then
𝑋𝑛 and (𝑋2 − 𝑋 𝑛 , 𝑋3 − 𝑋 𝑛 , . . . , 𝑋𝑛 − 𝑋 𝑛 ) are independent,
we may proceed as follows. Denote the random vector
(𝑋2 − 𝑋 𝑛 , 𝑋3 − 𝑋 𝑛 , . . . , 𝑋𝑛 − 𝑋 𝑛 )
by 𝑌 , and compute the cdf of 𝑋 𝑛 and 𝑌 :
𝐹(𝑋 𝑛 ,𝑌 ) (𝑢, 𝑣2 , 𝑣3 , . . . , 𝑣𝑛 )

= P(𝑋 𝑛 ⩽ 𝑢, 𝑋2 − 𝑋 𝑛 ⩽ 𝑣2 , 𝑋3 − 𝑋 𝑛 ⩽ 𝑣3 , . . . , 𝑋𝑛 − 𝑋 𝑛 ⩽ 𝑣𝑛 )
∫∫ ∫
= ⋅⋅⋅ 𝑓(𝑋1 ,𝑋2 ,...,𝑋𝑛 ) (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) d𝑥1 d𝑥2 ⋅ ⋅ ⋅ d𝑥𝑛 ,
𝑅𝑢,𝑣2 ,𝑣3 ,...,𝑣𝑛
2.3. INTERVAL ESTIMATES FOR THE MEAN 37

where

𝑅𝑢,𝑣2 ,...,𝑣𝑛 = {(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ∈ ℝ𝑛 ∣ 𝑥 ⩽ 𝑢, 𝑥2 − 𝑥 ⩽ 𝑣2 , . . . , 𝑥𝑛 − 𝑥 ⩽ 𝑣𝑛 },

𝑥1 + 𝑥2 + ⋅ ⋅ ⋅ + 𝑥𝑛
for 𝑥 = , and the joint pdf of 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 is
𝑛
1 ∑𝑛 2
𝑓(𝑋1 ,𝑋2 ,...,𝑋𝑛 ) (𝑥1 , 𝑥2 , . . . 𝑥𝑛 ) = 𝑛/2
𝑒−( 𝑖=1 𝑥𝑖 )/2 for all (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ∈ ℝ𝑛 ,
(2𝜋)

since 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 are iid normal(0, 1).


Next, make the change of variables

𝑦1 = 𝑥
𝑦2 = 𝑥2 − 𝑥
𝑦3 = 𝑥3 − 𝑥
..
.
𝑦𝑛 = 𝑥𝑛 − 𝑥.

so that
𝑛

𝑥1 = 𝑦1 − 𝑦𝑖
𝑖=2

𝑥2 = 𝑦1 + 𝑦2

𝑥3 = 𝑦1 + 𝑦3

..
.

𝑥𝑛 = 𝑦1 + 𝑦𝑛 ,
and therefore
𝑛
( 𝑛
)2 𝑛
∑ ∑ ∑
𝑥2𝑖 = 𝑦1 − 𝑦𝑖 + (𝑦1 + 𝑦𝑖 )2
𝑖=1 𝑖=2 𝑖=2

( 𝑛 )2 𝑛
∑ ∑
= 𝑛𝑦12 + 𝑦𝑖 + 𝑦𝑖2
𝑖=2 𝑖=2

= 𝑛𝑦12 + 𝐶(𝑦2 , 𝑦3 , . . . , 𝑦𝑛 ),

where we have set


( 𝑛 )2 𝑛
∑ ∑
𝐶(𝑦2 , 𝑦3 , . . . , 𝑦𝑛 ) = 𝑦𝑖 + 𝑦𝑖2 .
𝑖=2 𝑖=2
38 CHAPTER 2. ESTIMATION

Thus, by the change of variables formula,

𝐹(𝑋 𝑛 ,𝑌 ) ( 𝑢, 𝑣2 , . . . , 𝑣𝑛 )

𝑢 𝑣1 𝑣𝑛 2
𝑒−(𝑛𝑦1 +𝐶(𝑦2 ,...,𝑦𝑛 )/2 ∂(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )
∫ ∫ ∫
= ⋅⋅⋅ ∂(𝑦1 , 𝑦2 , . . . , 𝑦𝑛 ) d𝑦𝑛 ⋅ ⋅ ⋅ d𝑦1 ,
−∞ −∞ −∞ (2𝜋)𝑛/2

where
⎛ ⎞
1 −1 −1 ⋅⋅⋅ −1
⎜ 1 1 0 ⋅⋅⋅ 0 ⎟
∂(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ⎜
1 0 1 ⋅⋅⋅ 0

= det ⎜ ⎟.
⎜ ⎟
∂(𝑦1 , 𝑦2 , . . . 𝑦𝑛 ) ⎜ .. .. .. .. ⎟
⎝ . . . . ⎠
1 0 0 ⋅⋅⋅ 1

In order to compute this determinant observe that

𝑛𝑦1 = 𝑥1 + 𝑥2 + 𝑥3 + . . . 𝑥𝑛
𝑛𝑦2 = −𝑥1 + (𝑛 − 1)𝑥2 − 𝑥3 − . . . − 𝑥𝑛
𝑛𝑦3 = −𝑥1 − 𝑥2 + (𝑛 − 1)𝑥3 − . . . − 𝑥𝑛 ,
..
.
𝑛𝑦𝑛 = −𝑥1 − 𝑥2 − 𝑥3 − . . . + (𝑛 − 1)𝑥𝑛

which can be written in matrix form as


⎛ ⎞ ⎛ ⎞
𝑦1 𝑥1
⎜ 𝑦2 ⎟ ⎜ 𝑥2 ⎟
⎜ ⎟ ⎜ ⎟
𝑛 ⎜ 𝑦3 ⎟ = 𝐴 ⎜ 𝑥3 ⎟ ,
⎜ ⎟ ⎜ ⎟
⎜ .. ⎟ ⎜ .. ⎟
⎝.⎠ ⎝ . ⎠
𝑦𝑛 𝑥𝑛

where 𝐴 is the 𝑛 × 𝑛 matrix


⎛ ⎞
1 1 1 ⋅⋅⋅ 1
⎜−1 (𝑛 − 1) −1 ⋅⋅⋅ −1 ⎟
⎜ ⎟
𝐴 = ⎜−1 −1 (𝑛 − 1) ⋅⋅⋅ −1 ⎟,
⎜ ⎟
⎜ .. .. .. .. ⎟
⎝ . . . . ⎠
−1 −1 −1 ⋅⋅⋅ (𝑛 − 1)

whose determinant is
⎛ ⎞
1 1 1 ⋅⋅⋅ 1

⎜ 0 𝑛 0 ⋅⋅⋅ 0 ⎟

det 𝐴 = det ⎜
⎜ 0 0 𝑛 ⋅⋅⋅ 0 ⎟ = 𝑛𝑛−1 .

⎜ .. .. .. .. ⎟
⎝ . . . . ⎠
0 0 0 ⋅⋅⋅ 𝑛
2.3. INTERVAL ESTIMATES FOR THE MEAN 39

Thus, since ⎞ ⎛ ⎛ ⎞
𝑥1 𝑦1
⎜ 𝑥2 ⎟ ⎜ 𝑦2 ⎟
⎜ ⎟ ⎜ ⎟
𝐴 ⎜ 𝑥3 ⎟ = 𝑛𝐴−1 ⎜ 𝑦3 ⎟ ,
⎜ ⎟ ⎜ ⎟
⎜ .. ⎟ ⎜ .. ⎟
⎝ . ⎠ ⎝.⎠
𝑥𝑛 𝑦𝑛
it follows that
∂(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) 1
= det(𝑛𝐴−1 ) = 𝑛𝑛 ⋅ = 𝑛.
∂(𝑦1 , 𝑦2 , . . . 𝑦𝑛 ) 𝑛𝑛−1
Consequently,
𝐹(𝑋 𝑛 ,𝑌 ) ( 𝑢, 𝑣2 , . . . , 𝑣𝑛 )

𝑢 𝑣1 𝑣𝑛 2
𝑛 𝑒−𝑛𝑦1 /2 𝑒−𝐶(𝑦2 ,...,𝑦𝑛 )/2
∫ ∫ ∫
= ⋅⋅⋅ d𝑦𝑛 ⋅ ⋅ ⋅ d𝑦1 ,
−∞ −∞ −∞ (2𝜋)𝑛/2
which can be written as
𝐹(𝑋 𝑛 ,𝑌 ) ( 𝑢, 𝑣2 , . . . , 𝑣𝑛 )

𝑢 2 𝑣1 𝑣𝑛
𝑛 𝑒−𝑛𝑦1 /2 𝑒−𝐶(𝑦2 ,...,𝑦𝑛 )/2
∫ ∫ ∫
= √ d𝑦1 ⋅ ⋅⋅⋅ d𝑦𝑛 ⋅ ⋅ ⋅ d𝑦2 .
−∞ 2𝜋 −∞ −∞ (2𝜋)(𝑛−1)/2
Observe that 2
𝑢
𝑛 𝑒−𝑛𝑦1 /2

√ d𝑦1
−∞ 2𝜋
is the cdf of a normal(0, 1/𝑛) random variable, which is the distribution of 𝑋 𝑛 .
Therefore
∫ 𝑣1 ∫ 𝑣𝑛 −𝐶(𝑦2 ,...,𝑦𝑛 )/2
𝑒
𝐹(𝑋 𝑛 ,𝑌 ) (𝑢, 𝑣2 , . . . , 𝑣𝑛 ) = 𝐹𝑋 𝑛 (𝑢) ⋅ ⋅⋅⋅ (𝑛−1)/2
d𝑦𝑛 ⋅ ⋅ ⋅ d𝑦2 ,
−∞ −∞ (2𝜋)

which shows that 𝑋 𝑛 and the random vector

𝑌 = (𝑋2 − 𝑋 𝑛 , 𝑋3 − 𝑋 𝑛 , . . . , 𝑋𝑛 − 𝑋 𝑛 )

are independent. Hence we have established (2.22); that is,


(𝑛 − 1) 2
𝑆𝑛 ∼ 𝜒2 (𝑛 − 1).
𝜎2

2.3.5 The Distribution of 𝑇𝑛


We are now in a position to determine the sampling distribution of the statistic
𝑋𝑛 − 𝜇
𝑇𝑛 = √ , (2.26)
𝑆𝑛 / 𝑛
40 CHAPTER 2. ESTIMATION

where 𝑋 𝑛 and 𝑆𝑛2 are the sample mean and variance, respectively, based on a
random sample of size 𝑛 taken from a normal(𝜇, 𝜎 2 ) distribution.
We begin by re–writing the expression for 𝑇𝑛 in (2.26) as

𝑋𝑛 − 𝜇

𝜎/ 𝑛
𝑇𝑛 = , (2.27)
𝑆𝑛
𝜎
and observing that
𝑋𝑛 − 𝜇
𝑍𝑛 = √ ∼ normal(0, 1).
𝜎/ 𝑛
Furthermore, √ √
𝑆𝑛 𝑆𝑛2 𝑉𝑛
= = ,
𝜎 𝜎2 𝑛−1
where
𝑛−1 2
𝑉𝑛 = 𝑆 ,
𝜎2 𝑛
which has a 𝜒2 (𝑛 − 1) distribution, according to (2.22). It then follows from
(2.27) that
𝑍𝑛
𝑇𝑛 = √ ,
𝑉𝑛
𝑛−1
where 𝑍𝑛 is a standard normal random variable, and 𝑉𝑛 has a 𝜒2 distribution
with 𝑛 − 1 degrees of freedom. Furthermore, by (2.23), 𝑍𝑛 and 𝑉𝑛 are indepen-
dent. Consequently, using the result in Example 2.3.8, the statistic 𝑇𝑛 defined
in (2.26) has a 𝑡 distribution with 𝑛 − 1 degrees of freedom; that is,

𝑋𝑛 − 𝜇
√ ∼ 𝑡(𝑛 − 1). (2.28)
𝑆𝑛 / 𝑛
Notice that the distribution on the right–hand side of (2.28) is independent
of the parameters 𝜇 and 𝜎 2 ; we can can therefore obtain a confidence interval
for the mean of of a normal(𝜇, 𝜎 2 ) distribution based on the sample mean and
variance calculated from a random sample of size 𝑛 by determining a value 𝑡𝛼/2
such that
P(∣𝑇𝑛 ∣ < 𝑡𝛼/2 ) = 1 − 𝛼.
We then have that ( )
∣𝑋 𝑛 − 𝜇∣
P √ < 𝑡𝛼/2 = 1 − 𝛼,
𝑆𝑛 / 𝑛
or ( )
𝑆𝑛
P ∣𝜇 − 𝑋 𝑛 ∣ < 𝑡𝛼/2 √ = 1 − 𝛼,
𝑛
or ( )
𝑆𝑛 𝑆𝑛
P 𝑋 𝑛 − 𝑡𝛼/2 √ < 𝜇 < 𝑋 𝑛 + 𝑡𝛼/2 √ = 1 − 𝛼.
𝑛 𝑛
2.3. INTERVAL ESTIMATES FOR THE MEAN 41

We have therefore obtained a 100(1 − 𝛼)% confidence interval for the mean of a
normal(𝜇, 𝜎 2 ) distribution based on the sample mean and variance of a random
sample of size 𝑛 from that distribution; namely,
( )
𝑆𝑛 𝑆𝑛
𝑋 𝑛 − 𝑡𝛼/2 √ , 𝑋 𝑛 + 𝑡𝛼/2 √ . (2.29)
𝑛 𝑛

To find the value for 𝑧𝛼/2 in (2.29) we use the fact that the pdf for the 𝑡
distribution is symmetric about the vertical line at 0 (or even) to obtain that

P(∣𝑇𝑛 ∣ < 𝑡) = P(−𝑡 < 𝑇𝑛 < 𝑡)

= P(−𝑡 < 𝑇𝑛 ⩽ 𝑡)

= P(𝑇𝑛 ⩽ 𝑡) − P(𝑇𝑛 ⩽ −𝑡)

= 𝐹𝑇𝑛 (𝑡) − 𝐹𝑇𝑛 (−𝑡),

where we have used the fact that 𝑇𝑛 is a continuous random variable. Now, by
the symmetry if the pdf of 𝑇𝑛 𝐹𝑇𝑛 (−𝑡) = 1 − 𝐹𝑇𝑛 (𝑡). Thus,

P(∣𝑇𝑛 ∣ < 𝑡) = 2𝐹𝑇𝑛 (𝑡) − 1 for 𝑡 > 0.

So, to find 𝑡𝛼/2 we need to solve


𝛼
𝐹𝑇𝑛 (𝑡) = 1 − .
2
We therefore get that ( 𝛼)
𝑡𝛼/2 = 𝐹𝑇−1 1− , (2.30)
𝑛 2
where 𝐹𝑇−1
𝑛
denotes the inverse of the cdf of 𝑇𝑛 .

Example 2.3.12. Give a 95% confidence interval for the mean of a normal
distribution based on the sample mean and variance computed from a sample
of size 𝑛 = 20.

Solution: In this case, 𝛼 = 0.5 and 𝑇𝑛 ∼ 𝑡(19).


To find 𝑡𝛼/2 we may use the TINV function in MS Excel, which
gives the inverse of the two–tailed cumulative distribution function
of random variable with a 𝑡 distribution. That is, the inverse of the
function
P(∣𝑇𝑛 ∣ > 𝑡) for 𝑡 > 0.
The format for this function is

TINV(probability,degrees_freedom)
42 CHAPTER 2. ESTIMATION

In this case the probability of the two tails is 𝛼 = 0.05 and the
number of degrees of freedom is 19. Thus, according to (2.30), 𝑡𝛼/2
is given by
TINV(0.05, 19) ≈ 2.09,
where we have used 0.05 because TINV in MS Excel gives two-tailed
probability distribution values.
In R, the inverse cdf for a random variable with a 𝑡 distribution is
qt function whose format is

qt(probability, df).

Thus, in R, for 𝛼 = 0.05,

𝑡𝛼/2 ≈ qt(0.975, 19) ≈ 2.09.

Hence the 95% confidence interval for the mean, 𝜇, of a normal(𝜇, 𝜎 2 )


distribution based on the sample mean, 𝑋 𝑛 , and the sample variance,
𝑆𝑛2 , is ( )
𝑆𝑛 𝑆𝑛
𝑋 𝑛 − 2.09 √ , 𝑋 𝑛 + 2.09 √ , (2.31)
𝑛 𝑛
where 𝑛 = 20. □

Example 2.3.13. Obtain a 95% confidence interval for the average number of
popcorn kernels in a 1/4–cup based on the data in Table 1.2 on page 8.

Solution: Assume that the the underlying distribution of the count


of kernels in 1/4–cup is normal(𝜇, 𝜎 2 ), where 𝜇 is the parameter we
are trying to estimate and 𝜎 2 is unknown.
The sample mean, 𝑋 𝑛 , based on the sample in Table 1.2 on page
8 is 𝑋 𝑛 ≈ 342. The sample standard deviation is 𝑆𝑛 ≈ 35. Thus,
using the formula in (2.31) we get that

(326, 358)

is a 95% confidence interval for the average number of kernels in one


quarter cup. □
Chapter 3

Hypothesis Testing

In the examples of the previous chapters, we have assumed certain underlying


distributions which depended on a parameter or more. Based on that assump-
tion, we have obtained estimates for a parameter through calculations made
with the values of a random sample; this process yielded statistics which can
be used as estimators for the parameters. Assuming an underlying distribu-
tion allowed as to determine the sampling distribution for the statistic; in turn,
knowledge of the sampling distribution permitted the calculations of probabili-
ties that estimates are within certain range of a parameter.
For instance, the confidence interval estimate for the average number of
popcorn kernels in one–quarter cup presented Example 2.3.13 on page 42 re-
lied on the assumption that the number of kernels in one–quarter cup follows
a normal(𝜇, 𝜎 2 ) distribution. There is nothing sacred about the normality as-
sumption. The assumption was made because a lot is known about the normal
distribution and therefore the assumption was a convenient one to use in order
to illustrate the concept of a confidence interval. In fact, cursory study of the
data in Table 1.2 on page 8 reveals that the shape of the distribution might not
be as bell–shaped as one would hope; see the histogram of the data in Figure
3.0.1 on page 44, where 𝑁 denotes the number of kernels in one–quarter cup.
Nevertheless, the hope is that, if a larger sample of one–quarter cups of kernels
is collected, then we would expect to see the numbers bunching up around some
value close to the true mean count of kernels in one–quarter cup.
The preceding discussion underscores the fact that assumptions, which are
made to facilitate the process of parameter estimation, are also there to be
questioned. In particular, the following question needs to be addressed: Does
the assumed underlying distribution truly reflects what is going on in the real
problem being studied? The process of questioning assumptions falls under the
realm of Hypothesis Testing in statistical inference. In this Chapter we discuss
how this can done. We begin with the example of determining whether the
number of unpopped kernels in one–quarter cup follows a Poisson distribution.

43
44 CHAPTER 3. HYPOTHESIS TESTING

Histogram of N

6
5
4
Frequency

3
2
1
0

300 350 400 450

Figure 3.0.1: Histogram of Data in Table 1.2

3.1 Chi–Square Goodness of Fit Test


We begin with the example of determining whether the counts of unpopped
kernels in one–quarter cup shown in Table 1.1 on page 6 can be accounted for
by a Poisson(𝜆) distribution, where 𝜆 is the average number of unpopped kernels
in one quarter cup. A point estimate for 𝜆 is then given by the average of the
values in the table; namely, 56. Before we proceed any further, I must point out
that the reason that I am assuming a Poisson model for the data in Table 1.1 is
merely for the sake of illustration of the Chi–Square test that we’ll be discussing
in this section. However, a motivation for the use of the Poisson model is that
a Poisson random variable is a limit of binomial random variables as 𝑛 tends to
infinity under the condition that 𝑛𝑝 = 𝜆 remains constant (see your solution to
Problem 5 in Assignment 2). However, this line of reasoning would be justified
if the probability that a given kernel will not pop is small, which is not really
justified in this situation since, by the result in Example 2.2.2 on page 17, a 95%
confidence interval for 𝑝 is (0.109, 0.183). In addition, a look at the histogram
of the data in Table 1.1, shown in Figure 3.1.2, reveals that the shape of that
distribution for the number of unpopped kernels is far from being Poisson. The
reason for this is that the right tail of a Poisson distribution should be thinner
than that of the distribution shown in Figure 3.1.2.
Moreover, calculation of the sample variance for the data in Table 1.1 yields
1810, which is way too far from the sample mean of 56. Recall that the mean
and variance of a Poisson(𝜆) distribution are both equal to 𝜆.
3.1. CHI–SQUARE GOODNESS OF FIT TEST 45

Histogram of UnPoppedN

4
3
Frequency

2
1
0

0 20 40 60 80 100 120 140

UnPoppedN

Figure 3.1.2: Histogram of data for unpopped kernels in Table 1.1

Despite all of the objections to the applicability of the Poisson model to


the unpopped kernels data listed previously, I will proceed with the Poisson
assumption in order to illustrate the Chi–Square Goodness of Fit method, which
provides a quantitative way to reject the Poisson model with confidence.
Thus, assuming that the Poisson model is the mechanism behind the ob-
servations of unpopped kernels, we may compute the probabilities of observing
certain counts of unpopped kernels by using the pmf
𝜆𝑘 −𝜆
𝑝𝑋 (𝑘) =
𝑒 for 𝑘 = 0, 1, 2, . . . ,
𝑘!
and 0 elsewhere, where 𝜆 is taken to be the estimated value of 56. Hence, the
probability that we observe counts between 0 and 50 is
50

P(0 ⩽ 𝑋 ⩽ 50) = 𝑝𝑋 (𝑘) ≈ 0.2343;
𝑘=0

between 51 and 55:


55

P(51 ⩽ 𝑋 ⩽ 55) = 𝑝𝑋 (𝑘) ≈ 0.2479;
𝑘=51

between 56 and 60:


60

P(56 ⩽ 𝑋 ⩽ 60) = 𝑝𝑋 (𝑘) ≈ 0.2487;
𝑘=56
46 CHAPTER 3. HYPOTHESIS TESTING

and 61 and above:




P(𝑋 ⩾ 61) = 𝑝𝑋 (𝑘) ≈ 0.2691.
𝑘=61

We have therefore divided the range of possible observations into categories, and
the probabilities computed above give the likelihood a given observation (the
count of unpopped kernels, in this case) will fall in a given category, assuming
that a Poisson model is driving the process. Using these probabilities, we can
predict how many observations out of the 27 will fall, on average, in each cat-
egory. If the probability that a count will fall in category 𝑖 is 𝑝𝑖 , and 𝑛 is the
total number of observations (in this case, 𝑛 = 27), then the predicted number
of counts in category 𝑖 is
𝐸𝑖 = 𝑛𝑝𝑖 .
Table 3.1 shows the predicted values in each category as well as the actual
(observed) counts.

Category Counts 𝑝𝑖 Predicted Observed


(𝑖) Range Counts Counts
1 0 ⩽ 𝑋 ⩽ 50 0.2343 6 14
2 51 ⩽ 𝑋 ⩽ 55 0.2479 7 3
3 56 ⩽ 𝑋 ⩽ 60 0.2487 7 2
4 𝑋 ⩾ 61 0.2691 7 8

Table 3.1: Counts Predicted by the Poisson Model

The last column in Table 3.1 shows that actual observed counts based on the
data in Table 1.1 on page 6. Are the large discrepancies between the observed
and predicted counts in the first three categories in Table 3.1 enough evidence
for us to dismiss the Poisson hypothesis? One of the goals of this chapter is to
answer this question with confidence. We will need to find a way to measure
the discrepancy that will allow us to make statements based on probability
calculations. A measure of the discrepancy between the values predicted by
an assumed probability model and the values that are actually observed in the
data was introduced by Karl Pearson in 1900, [Pla83]. In order to motivate
the Pearson’s statistic, we first present an example involving the multinomial
distribution.

3.1.1 The Multinomial Distribution


Consider the general situation of 𝑘 categories whose counts are given by random
variables 𝑋1 , 𝑋1 , . . . , 𝑋𝑘 . Assume that there is a total 𝑛 of observations so that

𝑋1 + 𝑋2 + ⋅ ⋅ ⋅ + 𝑋𝑘 = 𝑛. (3.1)
3.1. CHI–SQUARE GOODNESS OF FIT TEST 47

We assume that the probability that a count is going to fall in category 𝑖 is 𝑝𝑖


for 𝑖 = 1, 2, . . . , 𝑘. Assume also that the categories are mutually exclusive and
exhaustive so that
𝑝1 + 𝑝2 + ⋅ ⋅ ⋅ + 𝑝𝑘 = 1. (3.2)
Then, the distribution of the random vector

X = (𝑋1 , 𝑋2 , . . . , 𝑋𝑘 ) (3.3)

is multinomial so that the joint pmf of the random variables 𝑋1 , 𝑋1 , . . . , 𝑋𝑘 ,


given that 𝑋1 + 𝑋2 + ⋅ ⋅ ⋅ + 𝑋𝑘 = 𝑛, is

𝑛!
⎧ ∑𝑘
𝑛1 𝑛2 𝑛𝑘
⎨ 𝑛 !𝑛 ! ⋅ ⋅ ⋅ 𝑛 ! 𝑝1 𝑝2 ⋅ ⋅ ⋅ 𝑝𝑘 if 𝑖=1 𝑛𝑘 = 𝑛;


1 2 𝑘
𝑝(𝑋1 ,𝑋2 ,...,𝑋𝑘 ) (𝑛1 , 𝑛2 , . . . , 𝑛𝑘 ) =


0 otherwise.

(3.4)
We first show that each 𝑋𝑖 has marginal distribution which is binomial(𝑛, 𝑝𝑖 ),
so that
𝐸(𝑋𝑖 ) = 𝑛𝑝𝑖 for all 𝑖 = 1, 2, . . . , 𝑘,
and
var((𝑋𝑖 )) = 𝑛𝑝𝑖 (1 − 𝑝𝑖 ) for all 𝑖 = 1, 2, . . . , 𝑘.
Note that the 𝑋1 , 𝑋2 , . . . , 𝑋𝑘 are not independent because of the relation in
(3.1). In fact, it can be shown that

cov(𝑋𝑖 , 𝑋𝑗 ) = −𝑛𝑝𝑗 𝑝𝑗 for 𝑖 ∕= 𝑗.

We will first establish that the marginal distribution of 𝑋𝑖 is binomial. We


will show it for 𝑋1 in the following example. The proof for the other variables
is similar. In the proof, though, we will need the following extension of the
binomial theorem known as the multinomial theorem [CB01, Theorem 4.6.4, p.
181].

Theorem 3.1.1 (Multinomial Theorem). Let 𝑛, 𝑛1 , 𝑛2 , . . . , 𝑛𝑘 denote non–


negative integers, and 𝑎1 , 𝑎2 , . . . , 𝑎𝑘 be real numbers. Then,
∑ 𝑛!
(𝑎1 + 𝑎2 + ⋅ ⋅ ⋅ + 𝑎𝑘 )𝑛 = 𝑎𝑛1 1 𝑎𝑛2 2 ⋅ ⋅ ⋅ 𝑎𝑛𝑘 𝑘 ,
𝑛1 +𝑛2 +⋅⋅⋅+𝑛𝑘
𝑛 !𝑛
=𝑛 1 2
! ⋅ ⋅ ⋅ 𝑛 𝑘 !

where the sum is take over all 𝑘–tuples of nonnegative integers, 𝑛1 , 𝑛2 , . . . , 𝑛𝑘


which add up to 𝑛.

Remark 3.1.2. Note that when 𝑘 = 2 in Theorem 3.1.1 we recover the binomial
theorem,

Example 3.1.3. Let (𝑋1 , 𝑋2 , . . . , 𝑋𝑘 ) have a multinomil distribution with pa-


rameters 𝑛, 𝑝1 , 𝑝2 , . . . , 𝑝𝑘 . Then, the marginal distribution of 𝑋1 is binomial(𝑛, 𝑝1 ).
48 CHAPTER 3. HYPOTHESIS TESTING

Solution: The marginal distribution of 𝑋1 has pmf

∑ 𝑛!
𝑝𝑋1 (𝑛1 ) = 𝑝𝑛1 𝑝𝑛2 ⋅ ⋅ ⋅ 𝑝𝑛𝑘 𝑘 ,
𝑛2 ,𝑛3 ,...,𝑛𝑘
𝑛1 !𝑛2 ! ⋅ ⋅ ⋅ 𝑛𝑘 ! 1 2
𝑛2 +𝑛3 +...+𝑛𝑘 =𝑛−𝑛1

where the summation is taken over all nonnegative, integer values of


𝑛2 , 𝑛3 , . . . , 𝑛𝑘 which add up to 𝑛 − 𝑛1 . We then have that

𝑝𝑛1 1 ∑ 𝑛!
𝑝𝑋1 (𝑛1 ) = 𝑝𝑛2 ⋅ ⋅ ⋅ 𝑝𝑛𝑘 𝑘
𝑛1 ! 𝑛2 ,𝑛3 ,...,𝑛𝑘
𝑛2 ! ⋅ ⋅ ⋅ 𝑛𝑘 ! 2
𝑛2 +𝑛3 +...+𝑛𝑘 =𝑛−𝑛1

𝑝𝑛1 1 𝑛! ∑ (𝑛 − 𝑛1 )! 𝑛2
= 𝑝 ⋅ ⋅ ⋅ 𝑝𝑛𝑘 𝑘
𝑛1 ! (𝑛 − 𝑛1 )! 𝑛2 ,𝑛3 ,...,𝑛𝑘
𝑛2 ! ⋅ ⋅ ⋅ 𝑛𝑘 ! 2
𝑛2 +𝑛3 +...+𝑛𝑘 =𝑛−𝑛1

( )
𝑛 𝑛1
= 𝑝 (𝑝2 + 𝑝3 + ⋅ ⋅ ⋅ + 𝑝𝑘 )𝑛−𝑛1 ,
𝑛1 1

where we have applied the multinomial theorem (Theorem 3.1.1).


Using (3.2) we then obtain that
( )
𝑛 𝑛1
𝑝𝑋1 (𝑛1 ) = 𝑝 (1 − 𝑝1 )𝑛−𝑛1 ,
𝑛1 1

which is the pmf of a binomial(𝑛, 𝑝1 ) distribution. □

3.1.2 The Pearson Chi-Square Statistic


We first consider the example of a multinomial random vector (𝑋1 , 𝑋2 ) with
parameters 𝑛, 𝑝1 , 𝑝2 ; in other words, there are only two categories and the counts
in each category are binomial(𝑛, 𝑝𝑖 ) for 𝑖 = 1, 2, with 𝑋1 + 𝑋2 = 𝑛. We consider
the situation when 𝑛 is very large. In this case, the random variable

𝑋1 − 𝑛𝑝1
𝑍=√
𝑛𝑝1 (1 − 𝑝1 )

has an approximate normal(0, 1) distribution for large values of 𝑛. Consequently,


for large values of 𝑛,
(𝑋1 − 𝑛𝑝1 )2
𝑍2 =
𝑛𝑝1 (1 − 𝑝1 )

has an approximate 𝜒2 (1) distribution.


3.1. CHI–SQUARE GOODNESS OF FIT TEST 49

Note that we can write


(𝑋1 − 𝑛𝑝1 )2 (1 − 𝑝1 ) + (𝑋1 − 𝑛𝑝1 )2 𝑝1
𝑍2 =
𝑛𝑝1 (1 − 𝑝1 )

(𝑋1 − 𝑛𝑝1 )2 (𝑋1 − 𝑛𝑝1 )2


= +
𝑛𝑝1 𝑛(1 − 𝑝1 )

(𝑋1 − 𝑛𝑝1 )2 (𝑛 − 𝑋2 − 𝑛𝑝1 )2


= +
𝑛𝑝1 𝑛(1 − 𝑝1 )

(𝑋1 − 𝑛𝑝1 )2 (𝑋2 − 𝑛(1 − 𝑝1 ))2


= +
𝑛𝑝1 𝑛(1 − 𝑝1 )

(𝑋1 − 𝑛𝑝1 )2 (𝑋2 − 𝑛𝑝2 )2


= + .
𝑛𝑝1 𝑛𝑝2

We have therefore proved that, for large values of 𝑛, the random variable

(𝑋1 − 𝑛𝑝1 )2 (𝑋2 − 𝑛𝑝2 )2


𝑄= +
𝑛𝑝1 𝑛𝑝2

has an approximate 𝜒2 (1) distribution.


The random variable 𝑄 is the Pearson Chi–Square statistic for 𝑘 = 2.

Theorem 3.1.4 (Pearson Chi–Square Statistic). Let (𝑋1 , 𝑋2 , . . . , 𝑋𝑘 ) be a ran-


dom vector with a multinomial(𝑛, 𝑝1 , . . . , 𝑝𝑘 ) distribution. The random variable

𝑘
∑ (𝑋𝑖 − 𝑛𝑝𝑖 )2
𝑄= (3.5)
𝑖=1
𝑛𝑝𝑖

has an approximate 𝜒2 (𝑘 − 1) distribution for large values of 𝑛. If the 𝑝𝑖 s


are computed assuming an underlying distribution with 𝑐 unknown parameters,
then the number of degrees of freedom in the chi–square distribution for 𝑄 get
reduced by 𝑐. In other words

𝑄 ∼ 𝜒2 (𝑘 − 𝑐 − 1) for large values of 𝑛.

Theorem 3.1.4, the proof of which is relegated to Appendix A on page 87 in


these notes, forms the basis for the Chi–Square Goodness of Fit Test. Examples
of the application of this result will be given in subsequent sections.

3.1.3 Goodness of Fit Test


We now go back to the analysis of the data portrayed in Table 3.1 on page 46.
Letting 𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 denote the observed counts in the fourth column of the
50 CHAPTER 3. HYPOTHESIS TESTING

table, we compute the value of the Pearson Chi–Square statistic according to


(3.5) to be
4
∑ (𝑋𝑖 − 𝑛𝑝𝑖 )2
𝑄
ˆ= ≈ 16.67,
𝑖=1
𝑛𝑝𝑖
where, in this case, 𝑛 = 27 and the 𝑝𝑖 s are given in the third column of Table
3.1. This is the measure of how far the observed counts are from the counts
predicted by the Poisson assumption. How significant is the number 16.67? Is
it a big number or not? More importantly, how probable would a value like
16.67, or higher, be if the Poisson assumption is true? The last question is
one we could answer approximately by using Pearson’s Theorem 3.1.4. Since,
𝑄 ∼ 𝜒2 (2) in this case, the answer to the last question is

𝑝 = P(𝑄 > 16.67) ≈ 0.0002,

or 0.02%, less than 1%, which is a very small probability. Thus, the chances of
observing the counts in the fourth column of Table 3.1 on page 46, under the
assumption that the Poisson hypothesis is true, are very small. The fact that
we did observe those counts, and the counts came from observations recorded in
Table 1.1 on page 6 suggest that it is highly unlikely that the counts of unpopped
kernels in that table follow a Poisson distribution. We are therefore justified in
rejecting the Poisson hypothesis on the basis on not enough statistical support
provided by the data.

3.2 The Language and Logic of Hypothesis Tests


The argument that we followed in the example presented in the previous section
is typical of hypothesis tests.

∙ Postulate a Null Hypothesis. First, we postulated a hypothesis that


purports to explain patters observed in data. This hypothesis is the one
to be tested against the data. In the example at hand, we want to test
whether the counts of unpopped kernels in a one–quarter cup follow a
Poisson distribution. The Poisson assumption was used to determine
probabilities that observations will fall into one of four categories. We
can use these values to formulate a null hypothesis, H𝑜 , in terms of the
the predicted probabilities; we write

H𝑜 : 𝑝1 = 0.2343, 𝑝2 = 0.2479, 𝑝3 = 0.2487, 𝑝4 = 0.2691.

Based on probabilities in H𝑜 , we compute the expected counts in each


categories
𝐸𝑖 = 𝑛𝑝𝑖 for 𝑖 = 1, 2, 3, 4.
Remark 3.2.1 (Why were the categories chosen the way we chose them?).
Pearson’s Theorem 3.1.4 gives an approximate distribution for the Chi–
Square statistic in (3.5) for large values of 𝑛. A rule of thumb to justify
3.2. THE LANGUAGE AND LOGIC OF HYPOTHESIS TESTS 51

the use the Chi–Square approximation to distribution of the Chi–Square


statistic, 𝑄, is to make sure that the expected count in each category is 5
or more. That is why we divided the range of counts in Table 1.1 on page
6 into the four categories shown in Table 3.1 on page 46.

∙ Compute a Test Statistic. In the example of the previous section,


we computed the Pearson Chi–Square statistic, 𝑄, ˆ which measures how
far the observed counts, 𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 , are from the expected counts,
𝐸1 , 𝐸2 , 𝐸3 , 𝐸4 :
4
∑ (𝑋𝑖 − 𝐸𝑖 )2
𝑄ˆ= .
𝑖=1
𝐸𝑖

According to Pearson’s Theorem, the random variable 𝑄 given by (3.5);


namely,
4
∑ (𝑋𝑖 − 𝑛𝑝𝑖 )2
𝑄=
𝑖=1
𝑛𝑝𝑖

has an approximate 𝜒2 (4 − 1 − 1) distribution in this case.


∙ Compute or approximate a 𝑝–value. A 𝑝–value for a test is the
probability that the test statistic will attain the computed value, or more
extreme ones, under the assumption that the null hypothesis is true. In
the example of the previous section, we used the fact that 𝑄 has an ap-
proximate 𝜒2 (2) distribution to compute

𝑝–value = P(𝑄 ⩾ 𝑄).


ˆ

∙ Make a decision. Either we reject or we don’t reject the null hypothesis.


The criterion for rejection is some threshold, 𝛼 with 0 < 𝛼 < 1, usually
some small probability, say 𝛼 = 0.01 or 𝛼 = 0.05.
We reject H𝑜 if 𝑝–value < 𝛼; otherwise we don’t reject H𝑜 .
We usually refer to 𝛼 as a level of significance for the test. If 𝑝–value < 𝛼
we say that we reject H𝑜 at the level of significance 𝛼.
In the example of the previous section

𝑝–value ≈ 0.0002 < 0.01;

Thus, we reject the Poisson model as an explanation of the distribution for


the counts of unpopped kernels in Table 1.1 on page 6 at the significance
level of 𝛼 = 1%.

Example 3.2.2 (Testing a binomial model). We have seen how to use a chi–
square goodness of fit test to determine that the Poisson model for the distribu-
tion of counts of unpopped kernels in Table 1.1 on page 6 is not supported by
the data in the table. A more appropriate model would be a binomial model.
In this case we have two unknown parameters: the mean number of kernels,
52 CHAPTER 3. HYPOTHESIS TESTING

𝑛, in one–quarter cup, and the probability, 𝑝, that a given kernel will not pop.
We have estimated 𝑛 independently using the data in Table 1.2 on page 8 to
be 𝑛ˆ = 342 according to the result in Example 2.3.13 on page 42. In order to
estimate 𝑝, we may use the average number of unppoped kernels in one–quarter
cup from the data in Table 1.1 on page 6 and then divide that number by the
estimated value of 𝑛 to obtain the estimate
56
𝑝ˆ = ≈ 0.1637.
342
Thus, in this example, we assume that the counts, 𝑋, of unpopped kernels in
one–quarter cup in Table 1.1 on page 6 follows the distribution

𝑋 ∼ binomial(ˆ
𝑛, 𝑝ˆ).

Category Counts 𝑝𝑖 Predicted Observed


(𝑖) Range Counts Counts
1 0 ⩽ 𝑋 ⩽ 50 0.2131 6 14
2 51 ⩽ 𝑋 ⩽ 55 0.2652 7 3
3 56 ⩽ 𝑋 ⩽ 60 0.2700 7 2
4 𝑋 ⩾ 61 0.2517 7 8

Table 3.2: Counts Predicted by the Binomial Model

Table 3.2 shows the probabilities predicted by the binomial hypothesis in


each of the categories that we used in the previous example in which we tested
the Poisson hypothesis. Observe that the binomial model predicts the same
expected counts as the Poisson model. We therefore get the same value for
the Pearson Chi–Square statistic, 𝑄 ˆ = 16.67. In this case the approximate,
2
asymptotic distribution of 𝑄 is 𝜒 (1) because we estimated two parameters, 𝑛
and 𝑝, to compute the 𝑝𝑖 s. Thus, the 𝑝–value in this case is approximated by

𝑝–value ≈ 4.45 × 10−5 ,

which is a very small probability. Thus, we reject the binomial hypothesis.


Hence the hypothesis that distribution of the counts of unpopped kernels follows
a binomial model is not supported by the data. Consequently, the interval
estimate for 𝑝 which we obtained in Example 2.2.2 on page 17 is not justified
since that interval was obtained under the assumption of a binomial model. We
therefore need to come up with another way to obtain an interval estimate for
𝑝.
At this point we need to re–evaluate the model and re–examine the assump-
tions that went into the choice of the Poisson and binomial distributions as
possible explanations for the distribution of counts in Table 1.1 on page 6. An
important assumption that goes into the derivations of both models is that of
3.2. THE LANGUAGE AND LOGIC OF HYPOTHESIS TESTS 53

independent trials. In this case, a trial consists of determining whether a given


kernel will pop or not. It was mentioned in Section 1.1.1 that a kernel in a hot–
air popper might not pop because it gets pushed out of the container because of
the popping of kernels in the neighborhood of the given kernel. Thus, the event
that a kernel will not pop will depend on whether a nearby kernel popped or
not, and no necessarily on some intrinsic property of the kernel. These consid-
erations are not consistent with the independence assumption required by bout
the Poisson and the binomial models. Thus, these models are not appropriate
for this situation.
How we proceed from this point on will depend on which question we want
to answer. If we want to know what the intrinsic probability of not popping
for a given kernel is, independent of the popping mechanism that is used, we
need to redesign the experiment so that the popping procedure that is used
will guarantee the independence of trials required by the binomial or Poisson
models. For example, a given number of kernels, 𝑛, might be laid out on flat
surface in a microwave oven.
If we want to know what the probability of not popping is for the hot–
air popper, we need to come up with another way to model the distribution.
This process is complicated by the fact that there are two mechanisms at work
that prevent a given from popping: an intrinsic mechanism depending on the
properties of a given kernel, and the swirling about of the kernels in the container
that makes it easy for the popping of a given kernel to cause other kernels to
be pushed out before they pop. Both of these mechanisms need to be modeled.

Example 3.2.3 (A test of normality). In this example we test whether the


counts of kernels in one–quarter cup shown in Table 1.2 on page 8 can be
assumed to come from a normal distribution. We first use the data in the table
to estimate 𝜇 and 𝜎 2 . Based on the calculations in Example 2.3.13 on page 42,
we get the following estimates

ˆ = 𝑋 𝑛 ≈ 342,
𝜇

and
ˆ = 𝑆 𝑛 ≈ 35.
𝜎
We therefore assume that

𝑁 ∼ normal(ˆ ˆ2 )
𝜇, 𝜎

and use the corresponding pdf to compute the probabilities that the counts will
lie in certain ranges.
Table 3.3 on page 54 shows those ranges and their corresponding probabili-
ties. Note that the ranges for the counts were chosen so that the expected count
for each category is 5. Table 3.3 shows also the predicted and observed counts
from which we get the value for the chi–square statistic, 𝑄, to be 𝑄 ˆ = 2/5.
In this case 𝑄 has an approximate 𝜒2 (1) asymptotic distribution, according to
54 CHAPTER 3. HYPOTHESIS TESTING

Category Counts 𝑝𝑖 Predicted Observed


(𝑖) Range Counts Counts
1 𝑁 ⩽ 319 0.2555 5 6
2 319 < 𝑁 ⩽ 342 0.2445 5 5
3 342 < 𝑁 ⩽ 365 0.2445 5 5
4 𝑁 > 365 0.2555 5 4

Table 3.3: Kernels in 1/4–cup Predicted by the Normal Model

Pearson’s Theorem, since we estimated two parameters, 𝜇 and 𝜎, based on the


data. We therefore obtain the approximate 𝑝–value
ˆ ≈ 0.5271.
𝑝–value = P(𝑄 ⩾ 𝑄)

Thus, based on the data, we cannot reject the null hypothesis that the counts
can be described as following a normal distribution. Hence, we were justified
in assuming a normal model when estimating the mean number of kernels in
one–quarter cup in Example 2.3.13 on page 42.

3.3 Hypothesis Tests in General


Hypothesis testing is a tool in statistical inference which provides a general
framework for rejecting certain hypothesis, known as the null hypothesis and
denoted by H𝑜 , against an alternative hypothesis, denoted by H1 . For in-
stance, in Example 3.2.3 on page 53 we tested the hypothesis that the counts
of kernel in a one–quarter cup, shown in Table 1.2 on page 8, follows a normal
distribution. In this case, denoting the counts of kernels by 𝑁 , we may state
the null and alternative hypotheses as

H𝑜 : 𝑁 is normaly distributed

and
H1 : 𝑁 is not normaly distributed.
Here is another example.
Example 3.3.1. We wish to determine whether a given coin is fair or not.
Thus, we test the null hypothesis
1
H𝑜 : 𝑝=
2
versus the alternative hypothesis
1
H1 : 𝑝 ∕= ,
2
where 𝑝 denotes the probability that a given toss of the coin will yield a head.
3.3. HYPOTHESIS TESTS IN GENERAL 55

In order to tests the hypotheses in Example 3.3.1, we may perform an ex-


periment which consists of flipping the coin 400 times. If we let 𝑌 denote the
number of heads that we observe, then H𝑜 may be stated as

H𝑜 : 𝑌 ∼ binomial(400, 0.5).

Notice that this hypothesis completely specifies the distribution of the random
variable 𝑌 , which is known as a test statistic. On the other hand, the hy-
pothesis in the goodness of fit test in Example 3.2.3 on page 53 does not specify
a distribution. H𝑜 in Example 3.2.3 simply states that the the count, 𝑁 , of
kernels in one–quarter cup follows a normal distribution, but it does not specify
the parameters 𝜇 and 𝜎 2 .
Definition 3.3.2 (Simple versus Composite Hypotheses). A hypothesis which
completely specifies a distribution is said to be a simple hypothesis. A hy-
pothesis which is not simple is said to be composite.
For example, the alternative hypothesis, H1 : 𝑝 ∕= 0.5, in Example 3.3.1 is
composite since the test statistic, 𝑌 , for that test is binomial(400, 𝑝) where 𝑝 is
any value between 0 and 1 which is not 0.5. Thus, H1 is a really a combination
of many hypotheses.
The decision to reject or not reject H𝑜 in a hypothesis test is based on a
set of observations, 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ; these could be the outcomes of certain ex-
periment performed to test the hypothesis and are, therefore, random variables
with certain distribution. Given a set of of observations, 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 , a test
statistic, 𝑇 = 𝑇 (𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ), may be formed. For instance, in Example
3.3.1 on page 54, the experiment might consist of flipping the coin 400 times
and determining the number of heads. If the null hypothesis in that test is true,
then the 400 observations are independent Bernoulli(0.5) trials. We can define
the test statistic for this test to be
400

𝑇 = 𝑋𝑖
𝑖=1

so that, in H𝑜 is true,
𝑇 ∼ binomial(400, 0.5).
A test statistic for a hypothesis test may be used to establish a criterion for
rejecting H𝑜 . For instance in the coin tossing Example 3.3.1, we can say that
we reject the hypothesis that the coin is fair if

∣𝑇 − 200∣ > 𝑐; (3.6)

that is, the distance from the statistic 𝑇 to the mean of the assumed distribution
is at least certain critical value, 𝑐. The condition in (3.6) constitutes a decision
criterion for rejection of H𝑜 . It the null hypothesis is true and the observed
value, 𝑇ˆ, of the test statistic, 𝑇 , falls within the range specified by the rejection
criterion in (3.6), we mistakenly reject H𝑜 when it is in fact true. This is known
56 CHAPTER 3. HYPOTHESIS TESTING

as a Type I error. The probability of committing a Type I error in the coin


tossing example is
P (∣𝑇 − 200∣ > 𝑐 ∣ H𝑜 true) .
Definition 3.3.3 (Significance Level). The largest probability of making a type
I error is denoted by 𝛼 and is called the significance level of the test.
Example 3.3.4. In the coin tossing example (Example 3.3.1), we can set a given
significance level, 𝛼, as follows. Since the number of tosses is large, 𝑛 = 400, we
can use the central limit theorem to get that
( )
∣𝑇 − 200∣ 𝑐
P (∣𝑇 − 200∣ > 𝑐 ∣ H𝑜 true) = P √ >
400 ⋅ (0.5)(1 − 0.5) 10
( 𝑐 )
≈ P ∣𝑍∣ > ,
10
where 𝑍 ∼ normal(0, 1). It then follows that, if we set
𝑐
= 𝑧𝛼/2 ,
10
where 𝑧𝛼/2 is such that P(∣𝑍∣ > 𝑧𝛼/2 ) = 𝛼, we obtain that 𝑐 = 10𝑧𝛼/2 . Hence,
the rejection criterion
∣𝑇 − 200∣ > 10𝑧𝛼/2
yields a test with a significance level 𝛼. For example, if 𝛼 = 0.05, then we get
that 𝑧𝛼/2 = 1.96 and therefore 𝑐 = 19.6 ≈ 20. Hence, the test that rejects H𝑜 if

𝑇 < 180 or 𝑇 > 220

has a significance level of 𝛼 = 0.05.

If the null hypothesis, H𝑜 , is in fact false, but the hypothesis test does not
yield the rejection of H𝑜 , then a type II error is made. The probability of a type
II error is denoted by 𝛽.
In general, a hypothesis test is concerned with the question of whether a
parameter, 𝜃, from certain underlying distribution is in a certain range or not.
Suppose the underlying distribution has pdf or pmf denoted by 𝑓 (𝑥 ∣ 𝜃), where
we have explicitly expressed the dependence of the distribution function on the
parameter 𝜃; for instance, in Example 3.3.1, the underlying distribution is

𝑓 (𝑥 ∣ 𝑝) = 𝑝𝑥 (1 − 𝑝)1−𝑥 , for 𝑥 = 0 or 𝑥 = 1,

and 0 otherwise. The parameter 𝜃 in this case is 𝑝, the probability of a success


in a Bernoulli(𝑝) trial.
In the general setting, the null and alternative hypothesis are statements of
the form
H 𝑜 : 𝜃 ∈ Ω𝑜
3.3. HYPOTHESIS TESTS IN GENERAL 57

and
H1 : 𝜃 ∈ Ω1
where Ω𝑜 and Ω1 are complementary subsets of a parameter space Ω = Ω𝑜 ∪ Ω1 ,
where Ω𝑜 ∩ Ω1 = ∅. In Example 3.3.1, we have that Ω𝑜 = {0.5} and

Ω1 = {𝑝 ∈ [0, 1] ∣ 𝑝 ∕= 0.5}.

Given a set of observations, 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 , which may be assumed to be


iid random variables with distribution 𝑓 (𝑥 ∣ 𝜃), we denote the set all possible
values of the 𝑛–tuple (𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ) by 𝒟. Consider a statistic

𝑇 = 𝑇 (𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ).

A rejection region, 𝑅, for a test is defined by

𝑅 = {(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ∈ 𝒟 ∣ 𝑇 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ∈ 𝐴}, (3.7)

where 𝐴 is a subset of the real line. For example, in the coin tossing example,
we had the rejection region

𝑅 = {(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ∈ 𝒟 ∣ ∣𝑇 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) − 𝐸(𝑇 )∣ > 𝑐},

or
𝑅 = {(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ∈ 𝒟 ∣ ∣𝑇 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) − 𝑛𝑝∣ > 𝑐},
since, this this case, 𝑇 ∼ binomial(𝑛, 𝑝), where 𝑛 = 400, and 𝑝 depends on
which hypothesis we are assuming to be true. Thus, in this case, the set 𝐴 in
the definition of the rejection region in (3.7) is

𝐴 = (−∞, 𝑛𝑝 − 𝑐) ∪ (𝑛𝑝 + 𝑐, ∞).

Given a rejection region, 𝑅, for a test of hypotheses

H𝑜 : 𝜃 ∈ Ω𝑜

and
H1 : 𝜃 ∈ Ω1 ,
let
P𝜃 ((𝑥1 , 𝑥1 , . . . , 𝑥𝑛 ) ∈ 𝑅)
denote the probability that the observation values fall in the rejection region
under the assumption that the random variables 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 are iid with
distribution 𝑓 (𝑥 ∣ 𝜃). Thus,

max P𝜃 ((𝑥1 , 𝑥1 , . . . , 𝑥𝑛 ) ∈ 𝑅)
𝜃∈Ω𝑜

is the largest probability that H𝑜 will be rejected given that H𝑜 ; this is the
significance level for the test; that is,

𝛼 = max P𝜃 ((𝑥1 , 𝑥1 , . . . , 𝑥𝑛 ) ∈ 𝑅).


𝜃∈Ω𝑜
58 CHAPTER 3. HYPOTHESIS TESTING

In Example 3.3.1 on page 54, Ω𝑜 = {0.5}; thus,

𝛼 = P0.5 ((𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ∈ 𝑅) ,

where

𝑅 = {(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ∣ 𝑇 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) < 𝑛𝑝−𝑐, or 𝑇 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) > 𝑛𝑝+𝑐}.

By the same token, for 𝜃 ∈ Ω1 ,

P𝜃 ((𝑥1 , 𝑥1 , . . . , 𝑥𝑛 ) ∈ 𝑅)

gives the probability of rejecting the null hypothesis when H𝑜 is false. It then
follows that the probability of a Type II error, for the case in which 𝜃 ∈ Ω1 , is

𝛽(𝜃) = 1 − P𝜃 ((𝑥1 , 𝑥1 , . . . , 𝑥𝑛 ) ∈ 𝑅);

this is the probability of not rejecting the null hypothesis when H𝑜 is in fact
false.
Definition 3.3.5 (Power of a Test). For 𝜃 ∈ Ω1 , the function

P𝜃 ((𝑥1 , 𝑥1 , . . . , 𝑥𝑛 ) ∈ 𝑅))

is called the power function for the test at 𝜃; that is, P𝜃 ((𝑥1 , 𝑥1 , . . . , 𝑥𝑛 ) ∈ 𝑅))
is the probability of rejecting the null hypothesis when it is in fact false. We
will use the notation

𝛾(𝜃) = P𝜃 ((𝑥1 , 𝑥1 , . . . , 𝑥𝑛 ) ∈ 𝑅)) for 𝜃 ∈ Ω1 .

Example 3.3.6. In Example 3.3.1 on page 54, consider the rejection region

𝑅 = {(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ∈ 𝒟 ∣ ∣𝑇 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) − 200∣ > 20}

where
𝑛

𝑇 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = 𝑥𝑗 ,
𝑗=1

for 𝑛 = 400, in the test of H𝑜 : 𝑝 = 0.5 against H1 : 𝑝 ∕= 0.5. The significance


level for this test is
𝛼 = P0.5 (∣𝑇 − 200∣ > 20) ,
where
𝑇 ∼ binomial(400, 0.5).
Thus,
𝛼 = 1 − P (∣𝑇 − 200∣ ⩽ 20)

= 1 − P (180 ⩽ 𝑇 ⩽ 220)

= 1 − P (179.5 < 𝑇 ⩽ 220.5) ,


3.3. HYPOTHESIS TESTS IN GENERAL 59

where we have used the continuity correction, since we are going to be applying
the Central Limit Theorem to approximate a discrete distribution; namely, 𝑇
has an approximate normal(200, 100) distribution in this case, since 𝑛 = 400 is
large. We then have that

𝛼 ≈ 1 − P (179.5 < 𝑌 ⩽ 220.5) ,

where 𝑌 ∼ normal(200, 100). Consequently,

𝛼 ≈ 1 − (𝐹𝑌 (220.5) − 𝐹𝑌 (179.5)),

where 𝐹𝑌 is cdf of 𝑌 ∼ normal(200, 100). We then have that

𝛼 ≈ 0.0404.

We next compute the power function for this test:

𝛾(𝑝) = P(∣𝑇 − 200∣ > 20),

where
1
𝑇 ∼ binomial(400, 𝑝) for 𝑝 ∕= .
2
We write
𝛾(𝑝) = P(∣𝑇 − 200∣ > 20)

= 1 − P(∣𝑇 − 200∣ ⩽ 20)

= 1 − P(180 ⩽ 𝑇 ⩽ 220)

= 1 − P(179.5 < 𝑇 ⩽ 220.5),


where we have used again the continuity correction, since we are going to be
applying the Central Limit Theorem to approximate the distribution of 𝑇 by
that of a normal(400𝑝, 400𝑝(1 − 𝑝)) random variable. We then have that

𝛾(𝑝) ≈ 1 − P(179.5 < 𝑌𝑝 ⩽ 220.5),

where 𝑌𝑝 denotes a normal(400𝑝, 400𝑝(1 − 𝑝)) random variable. Thus,

𝛾(𝑝) ≈ 1 − (𝐹𝑌𝑝 (220.5) − 𝐹𝑌𝑝 (179.5)), (3.8)

where 𝐹𝑌𝑝 denotes the cdf of 𝑌𝑝 ∼ normal(400𝑝, 400𝑝(1 − 𝑝)).


Table 3.4 on page 60 shows a few values of 𝑝 and their corresponding ap-
proximate values of 𝛾(𝑝) according to (3.8). A sketch of the graph of 𝛾 as a
function of 𝑝 is shown in Figure 3.3.3 on page 61.
The sketch in Figure 3.3.3 was obtained using the plot function in R by
typing

plot(p,gammap,type=’l’,ylab="Power at p")
60 CHAPTER 3. HYPOTHESIS TESTING

𝑝 𝛾(𝑝)
0.10 1.0000
0.20 1.0000
0.30 1.0000
0.40 0.9767
0.43 0.7756
0.44 0.6378
0.45 0.4800
0.46 0.3260
0.47 0.1978
0.48 0.1076
0.49 0.0566
0.50 0.0404
0.51 0.0566
0.52 0.1076
0.53 0.1978
0.54 0.3260
0.55 0.4800
0.56 0.6378
0.57 0.7756
0.60 0.9767
0.70 1.0000
0.80 1.0000
0.90 1.0000

Table 3.4: Table of values of 𝑝 and 𝛾(𝑝)


3.4. LIKELIHOOD RATIO TEST 61

1.0
0.8
0.6
Power at p

0.4
0.2

0.0 0.2 0.4 0.6 0.8 1.0

Figure 3.3.3: Sketch of graph of power function for test in Example 3.3.6

where p and gammap are arrays where values of 𝑝 and 𝛾(𝑝) were stored. These
were obtained using the commands:

p <- seq(0.01,0.99,by=0.01)

and

gammap <- 1-(pnorm(220.5,400*p,sqrt(400*p*(1-p)))


-pnorm(179.5,400*p,sqrt(400*p*(1-p))))

Observe that the sketch of the power function in Figure 3.3.3 on page 61
suggests that 𝛾(𝑝) tends to 1 as either 𝑝 → 0 or 𝑝 → 1, and that 𝛾(𝑝) → 𝛼 as
𝑝 → 0.5.

3.4 Likelihood Ratio Test


Likelihood ratio tests provide a general way of obtaining a test statistic, Λ,
called a likelihood ratio statistic, and a rejection criterion of the form

Λ ⩽ 𝑐,

for some critical value 𝑐, for the test of the hypothesis

H𝑜 : 𝜃 ∈ Ω𝑜
62 CHAPTER 3. HYPOTHESIS TESTING

versus the alternative


H1 : 𝜃 ∈ Ω1 ,
based on a random sample, 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 , coming a distribution with distri-
bution function 𝑓 (𝑥 ∣ 𝜃). Here 𝑓 (𝑥 ∣ 𝜃) represents a pdf or a pmf, Ω = Ω𝑜 ∪ Ω1
is the parameter space, and Ω𝑜 ∩ Ω1 = ∅.
Before we define the likelihood ratio statistic, Λ, we need to define the con-
cept of a likelihood function.
Definition 3.4.1 (Likelihood Function). Given a random sample, 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ,
from a distribution with distribution function 𝑓 (𝑥 ∣ 𝜃), either a pdf or a pmf,
where 𝜃 is some unknown parameter (either a scalar or a vector parameter), the
joint distribution of the sample is given by
𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ∣ 𝜃) = 𝑓 (𝑥1 ∣ 𝜃) ⋅ 𝑓 (𝑥2 ∣ 𝜃) ⋅ ⋅ ⋅ 𝑓 (𝑥𝑛 ∣ 𝜃),
by the independence condition in the definition of a random sample. If the
random variables, 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 , are discrete, 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ∣ 𝜃) gives the
probability of observing the values
𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 , . . . , 𝑋𝑛 = 𝑥𝑛 ,
under the assumption that the sample is taken from certain distribution with
parameter 𝜃. We can also interpret 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ∣ 𝜃𝑜 ) as measuring the
likelihood that the parameter 𝜃 will take on the value 𝜃𝑜 given that we have
observed the values 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 in the sample. Thus, we call
𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ∣ 𝜃)
the likelihood function for the parameter 𝜃 given the observations
𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 , . . . , 𝑋𝑛 = 𝑥𝑛 ,
and denote it by 𝐿(𝜃 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ); that is,
𝐿(𝜃 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ∣ 𝜃).
Example 3.4.2 (Likelihood function for independent Bernoulli(𝑝) trials). Let
𝑋1 , 𝑋2 , . . . , 𝑋𝑛 be a random sample from a Bernoulli(𝑝). Thus, the underlying
distribution in this case is
𝑓 (𝑥 ∣ 𝑝) = 𝑝𝑥 (1 − 𝑝)1−𝑥 for 𝑥 = 0, 1 and 0 otherwise,
where 0 < 𝑝 < 1.
We then get that the likelihood function for 𝑝, based on the sample obser-
vations, is
𝐿(𝑝 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = 𝑝𝑥1 (1 − 𝑝)1−𝑥1 ⋅ 𝑝𝑥2 (1 − 𝑝)1−𝑥2 ⋅ ⋅ ⋅ 𝑝𝑥𝑛 (1 − 𝑝)1−𝑥𝑛

= 𝑝𝑦 (1 − 𝑝)𝑛−𝑦
𝑛

where 𝑦 = 𝑥𝑖 .
𝑖=1
3.4. LIKELIHOOD RATIO TEST 63

Definition 3.4.3 (Likelihood Ratio Statistic). For a general hypothesis test of

H 𝑜 : 𝜃 ∈ Ω𝑜

against the alternative


H 1 : 𝜃 ∈ Ω1 ,
based on a random sample, 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 , from a distribution with distribution
function 𝑓 (𝑥 ∣ 𝜃), the likelihood ratio statistic, Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ), is defined
by
sup 𝐿(𝜃 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )
𝜃∈Ω𝑜
Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = ,
sup 𝐿(𝜃 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )
𝜃∈Ω

where Ω = Ω𝑜 ∪ Ω1 with Ω𝑜 ∩ Ω1 = ∅.

Example 3.4.4 (Simple hypotheses for Bernoulli(p) trials). Consider the test
of
H𝑜 : 𝑝 = 𝑝 𝑜
versus
H1 : 𝑝 = 𝑝1 ,
where 𝑝1 ∕= 𝑝𝑜 , based on random sample of size 𝑛 from a Bernoulli(𝑝) distribu-
tion, for 0 < 𝑝 < 1. The likelihood ratio statistic for this test is

𝐿(𝑝𝑜 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )
Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = ,
max{𝐿(𝑝𝑜 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ), 𝐿(𝑝1 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )}

since, for this case, Ω𝑜 = {𝑝𝑜 } and Ω = {𝑝𝑜 , 𝑝1 }; thus,

𝑝𝑦𝑜 (1 − 𝑝𝑜 )𝑛−𝑦
Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = ,
max{𝑝𝑦𝑜 (1 − 𝑝𝑜 )𝑛−𝑦 , 𝑝𝑦1 (1 − 𝑝1 )𝑛−𝑦 }
𝑛

where 𝑦 = 𝑥𝑖 .
𝑖=1

Definition 3.4.5 (Likelihood Ratio Test). We can use the likelihood ratio
statistic, Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ), to define the rejection region

𝑅 = {(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ∣ Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ⩽ 𝑐},

for some critical value 𝑐 with 0 < 𝑐 < 1. This defines a likelihood ratio test
(LRT) for H𝑜 against H1 .
The rationale for this definition is that, if the likelihood ratio of the sample
is very small, the evidence provided by the sample in favor of the null hypothesis
is not strong in comparison with the evidence for the alternative. Thus, in this
case it makes sense to reject H𝑜 .
64 CHAPTER 3. HYPOTHESIS TESTING

Example 3.4.6. Find the likelihood ratio test for

H𝑜 : 𝑝 = 𝑝𝑜

versus
H1 : 𝑝 = 𝑝1 ,
for 𝑝𝑜 ∕= 𝑝1 , based on a random sample 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 from a Bernoulli(𝑝)
distribution for 0 < 𝑝 < 1.

Solution: The rejection region for the likelihood ratio test is given
by
𝑅 : Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ⩽ 𝑐,
for 0 < 𝑐 < 1, where

𝑝𝑦𝑜 (1 − 𝑝𝑜 )𝑛−𝑦
Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = ,
max{𝑝𝑦𝑜 (1 − 𝑝𝑜 )𝑛−𝑦 , 𝑝𝑦1 (1 − 𝑝1 )𝑛−𝑦 }

with
𝑛

𝑦= 𝑥𝑖 .
𝑖=1

Thus, for 𝑅 to be defined, we must have that

𝑝𝑦𝑜 (1 − 𝑝𝑜 )𝑛−𝑦
Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = ;
𝑝𝑦1 (1 − 𝑝1 )𝑛−𝑦

otherwise Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) would be 1, and so we wouldn’t be able


to get the condition Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ⩽ 𝑐 to hold since 𝑐 < 1. Thus,
the LRT for this example will reject H𝑜 if
( )𝑦 ( )𝑛−𝑦
𝑝𝑜 1 − 𝑝𝑜
⩽ 𝑐,
𝑝1 1 − 𝑝1
or ( )𝑛 ( )𝑦
1 − 𝑝𝑜 𝑝𝑜 (1 − 𝑝1 )
⩽ 𝑐.
1 − 𝑝1 𝑝1 (1 − 𝑝𝑜 )
Write
1 − 𝑝𝑜 𝑝𝑜 (1 − 𝑝1 )
𝑎= and 𝑟 = .
1 − 𝑝1 𝑝1 (1 − 𝑝𝑜 )
Then, the LRT rejection region is defined by

𝑎𝑛 𝑟𝑦 ⩽ 𝑐, (3.9)

where
𝑛

𝑦= 𝑥𝑖 .
𝑖=1
3.4. LIKELIHOOD RATIO TEST 65

We consider two cases:


Case 1: 𝑝1 > 𝑝𝑜 . In this case, 𝑎 > 1 and 𝑟 < 1. Thus, taking the
natural logarithm on both sides of (3.9) and solving for 𝑦 we get that
the rejection region for the LRT in this example is equivalent to

ln 𝑐 − 𝑛 ln 𝑎
𝑦⩾ .
ln 𝑟
In other words, the LRT will reject H𝑜 if

𝑌 ⩾ 𝑏,

ln (𝑐/𝑎𝑛 )
where 𝑏 = > 0, and 𝑌 is the statistic
ln 𝑟
𝑛

𝑌 = 𝑋𝑖 ,
𝑖=1

which counts the number of successes in the sample.


Case 2: 𝑝1 < 𝑝𝑜 . In this case, 𝑎 < 1 and 𝑟 > 1. We then get from
(3.9) the LRT in this example rejects H𝑜 if

ln 𝑐 − 𝑛 ln 𝑎
𝑦⩽ .
ln 𝑟
In other words, the LRT will reject H𝑜 if

𝑌 ⩽ 𝑑,

ln 𝑐 − 𝑛 ln 𝑎
where 𝑑 = can be made to be positive by choosing
ln 𝑟
ln 𝑐
𝑛> , and 𝑌 is again the number of successes in the sample. □
ln 𝑎
We next consider the example in which we test

H𝑜 : 𝑝 = 𝑝𝑜

versus
H1 : 𝑝 ∕= 𝑝0
based on a random sample 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 from a Bernoulli(𝑝) distribution for
0 < 𝑝 < 1. We would like to find the LRT rejection region for this test of
hypotheses.
In this case the likelihood ratio statistic is
𝐿(𝑝𝑜 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )
Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = , (3.10)
sup 𝐿(𝑝 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )
1<𝑝<1
66 CHAPTER 3. HYPOTHESIS TESTING

𝑛

𝑦 𝑛−𝑦
where 𝐿(𝑝 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = 𝑝 (1 − 𝑝) for 𝑦 = 𝑥𝑖 .
𝑖=1
In order to determine the denominator in the likelihood ratio in (3.10), we
need to maximize the function 𝐿(𝑝 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) over 0 < 𝑝 < 1. We can do
this by maximizing the natural logarithm of the likelihood function,

ℓ(𝑝) = ln(𝐿(𝑝 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )), 0 < 𝑝 < 1,

since ln : (0, ∞) → ℝ is a strictly increasing function. Thus, we need to maximize

ℓ(𝑝) = 𝑦 ln 𝑝 + (𝑛 − 𝑦) ln(1 − 𝑝) over 0 < 𝑝 < 1.

In order to do this, we compute the derivatives


𝑦 𝑛−𝑦
ℓ′ (𝑝) = − ,
𝑝 1−𝑝
and
𝑦 𝑛−𝑦
ℓ′′ (𝑝) = − − ,
𝑝2 (1 − 𝑝)2
and observe that ℓ′′ (𝑝) < 0 for all 1 < 𝑝 < 1. Thus, a critical point of ℓ; that is,
a solution of ℓ′ (𝑝) = 0, will yield a maximum for the function ℓ(𝑝).
Solving for 𝑝 in ℓ′ (𝑝) = 0 yields the critical point
1
𝑝ˆ = 𝑦,
𝑛
which is the sample proportion of successes. This is an example of a maximum
likelihood estimator (MLE) for 𝑝. In general, we have the following definition.
Definition 3.4.7 (Maximum Likelihood Estimator). Let 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 be a
random sample from a distribution with distribution function 𝑓 (𝑥 ∣ 𝜃), for 𝜃 in
some parameter space Ω. A value, 𝜃,
ˆ of the parameter 𝜃 such that

𝐿(𝜃ˆ ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = sup 𝐿(𝜃 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )


𝜃∈Ω

is called a maximum likelihood estimator for 𝜃, or an MLE for 𝜃.


We therefore have that the likelihood ratio statistic for the test of H𝑜 : 𝑝 = 𝑝𝑜
versus H1 : 𝑝 ∕= 𝑝𝑜 , based on a random sample of size 𝑛 from a Bernoulli(𝑝)
distribution, is
𝑝𝑦 (1 − 𝑝𝑜 )𝑛−𝑦
Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = 𝑜𝑦 ,
𝑝ˆ (1 − 𝑝ˆ)𝑛−𝑦
where
𝑛

𝑦= 𝑥𝑖
𝑖=1

and
1
𝑝ˆ = 𝑦
𝑛
3.4. LIKELIHOOD RATIO TEST 67

is the MLE for 𝑝 based on the random sample.


Write the likelihood ratio statistic as
( )𝑦 ( )𝑛−𝑦
𝑝𝑜 1 − 𝑝𝑜
Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) =
𝑝ˆ 1 − 𝑝ˆ
⎛ ⎞𝑛−𝑦
1
⎜ 𝑝𝑜 − 1 ⎟
( )𝑦
𝑝𝑜
= ⎜
⎝ 1
⎟ ,
𝑝ˆ 𝑝ˆ ⎠

𝑝𝑜 𝑝𝑜
𝑝ˆ
and set 𝑡 = . Then, Λ can be written as a function of 𝑡 as follows
𝑝𝑜
( )𝑛−𝑛𝑝𝑜 𝑡
1 1 − 𝑝𝑜 1
Λ(𝑡) = 𝑛𝑝𝑜 𝑡 ⋅ , for 0 ⩽ 𝑡 ⩽ ,
𝑡 1 − 𝑝𝑜 𝑡 𝑝𝑜
1
where we have used the fact that 𝑝ˆ = 𝑦 so that 𝑦 = 𝑛𝑝𝑜 𝑡.
𝑛
1
We now proceed to sketch the graph of Λ as a function of 𝑡 for 0 ⩽ 𝑡 ⩽ .
𝑝𝑜
First note that Λ(𝑡) attains its maximum value of 1 when 𝑡 = 1; namely,
when 𝑝ˆ = 𝑝𝑜 . That 𝑡 = 1 is the only value of 𝑡 at which the maximum for Λ(𝑡)
is attained can be verified by showing that
( )
1 − 𝑝𝑜 𝑡
ℎ(𝑡) = ln(Λ(𝑡)) = −𝑛𝑝𝑜 𝑡 ln 𝑡 − (𝑛 − 𝑛𝑝𝑜 𝑡) ln
1 − 𝑝𝑜
attains its maximum solely at 𝑡 = 1. Computing the derivative of ℎ with respect
to 𝑡 we find that
( )
′ 1 − 𝑝𝑜 𝑡
ℎ (𝑡) = −𝑛𝑝𝑜 ln 𝑡 + 𝑛𝑝𝑜 ln
1 − 𝑝𝑜
( )
1 − 𝑝𝑜 𝑡
= 𝑛𝑝𝑜 ln .
𝑡(1 − 𝑝𝑜 )
Thus, ℎ′ (𝑡) = 0 if and only if
1 − 𝑝𝑜 𝑡
= 1,
𝑡(1 − 𝑝𝑜 )
which implies that 𝑡 = 1 is the only critical point of ℎ. The fact that 𝑡 = 1
yields a maximum for ℎ can be seen by observing that the second derivative of
ℎ with respect to 𝑡,
𝑛𝑝𝑜 𝑛𝑝2𝑜
ℎ′′ (𝑡) = − − ,
𝑡 1 − 𝑝𝑜 𝑡
1
is negative for 0 < 𝑡 < .
𝑝𝑜
68 CHAPTER 3. HYPOTHESIS TESTING

Observe also that lim+ ℎ(𝑡) = ln[(1 − 𝑝𝑜 )𝑛 ] and lim ℎ(𝑡) = ln[𝑝𝑛𝑜 ], so
𝑡→0 𝑡→(1/𝑝𝑜 )−
that
Λ(0) = (1 − 𝑝𝑜 )𝑛 and Λ(1/𝑝𝑜 ) = 𝑝𝑛𝑜 .
Putting all the information about the graph for Λ(𝑡) together we obtain a sketch
as the one shown in Figure 3.4.4,where we have sketched the case 𝑝𝑜 = 1/4 and
𝑛 = 20 for 0 ⩽ 𝑡 ⩽ 4. The sketch in Figure 3.4.4 suggests that, given any

1.0
0.8
0.6
Lmabda(t)

0.4
0.2
0.0

0 1 2 3 4

Figure 3.4.4: Sketch of graph of Λ(𝑡) for 𝑝𝑜 = 1/4, 𝑛 = 20, and 0 ⩽ 𝑡 ⩽ 4

positive value of 𝑐 such that 𝑐 < 1 and 𝑐 > max{𝑝𝑛𝑜 , (1 − 𝑝𝑜 )𝑛 }, there exist
positive values 𝑡1 and 𝑡2 such that 0 < 𝑡1 < 1 < 𝑡2 < 1/𝑝𝑜 and

Λ(𝑡) = 𝑐 for 𝑡 = 𝑡1 , 𝑡2 .

Furthermore,
Λ(𝑡) ⩽ 𝑐 for 𝑡 ⩽ 𝑡1 or 𝑡 ⩾ 𝑡2 .
Thus, the LRT rejection region for the test of H𝑜 : 𝑝 = 𝑝𝑜 versus H1 : 𝑝 ∕= 𝑝𝑜 is
equivalent to
𝑝ˆ 𝑝ˆ
⩽ 𝑡1 or ⩾ 𝑡2 ,
𝑝𝑜 𝑝𝑜
𝑛

which we could rephrase in terms of 𝑌 = 𝑋𝑖 as
𝑖=1

𝑅: 𝑌 ⩽ 𝑡1 𝑛𝑝𝑜 or 𝑌 ⩾ 𝑡2 𝑛𝑝𝑜 ,
3.4. LIKELIHOOD RATIO TEST 69

for some 𝑡1 and 𝑡2 with 0 < 𝑡1 < 1 < 𝑡2 . This rejection region can also be
phrased as
𝑅 : 𝑌 < 𝑛𝑝𝑜 − 𝑏 or 𝑌 > 𝑛𝑝𝑜 + 𝑏,
for some 𝑏 > 0. The value of 𝑏 will then be determined by the significance level
that we want to impose on the test.
Example 3.4.8 (Likelihood ratio test based on a sample from a normal distri-
bution). We wish to test the hypothesis

H𝑜 : 𝜇 = 𝜇𝑜 , 𝜎 2 > 0

versus the alternative


H1 : 𝜇 ∕= 𝜇𝑜 , 𝜎 2 > 0,
based on a random sample, 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 , from a normal(𝜇, 𝜎 2 ) distribution.
Observe that both H𝑜 and H1 are composite hypotheses.
The likelihood function in this case is
2 2 2 2 2 2
𝑒−(𝑥1 −𝜇) /2𝜎 𝑒−(𝑥2 −𝜇) /2𝜎 𝑒−(𝑥𝑛 −𝜇) /2𝜎
𝐿(𝜇, 𝜎 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = √ ⋅ √ ⋅⋅⋅ √
2𝜋 𝜎 2𝜋 𝜎 2𝜋 𝜎
∑𝑛 2 2
𝑒− 𝑖=1 (𝑥𝑖 −𝜇) /2𝜎
= .
(2𝜋)𝑛/2 𝜎 𝑛
The likelihood ratio statistic is
sup 𝐿(𝜇𝑜 , 𝜎 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )
𝜎>0
Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = , (3.11)
𝐿(ˆ ˆ ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )
𝜇, 𝜎
where 𝜇ˆ is the MLE for 𝜇 and 𝜎ˆ2 is the MLE for 𝜎 2 . To find these MLEs, we
need to maximize the natural logarithm of the likelihood function:
𝑛
1 ∑ 𝑛
ℓ(𝜇, 𝜎 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = − (𝑥𝑖 − 𝜇)2 − 𝑛 ln 𝜎 − ln(2𝜋).
2𝜎 2 𝑖=1 2

We therefore need to look at the first partial derivatives


𝑛
∂ℓ 1 ∑ 𝑛
(𝜇, 𝜎) = (𝑥𝑖 − 𝜇) = 2 (𝑥 − 𝜇)
∂𝜇 𝜎 2 𝑖=1 𝜎

𝑛
∂ℓ 1 ∑ 𝑛
(𝜇, 𝜎) = (𝑥𝑖 − 𝜇)2 − ,
∂𝜎 𝜎 3 𝑖=1 𝜎
𝑛
1∑
where 𝑥 = 𝑥𝑖 , and the second partial derivatives
𝑛 𝑖=1

∂2ℓ 𝑛
(𝜇, 𝜎) = − 2 ,
∂𝜇2 𝜎
70 CHAPTER 3. HYPOTHESIS TESTING

∂2ℓ ∂2ℓ 2𝑛
(𝜇, 𝜎) = (𝜇, 𝜎) = − 3 (𝑥 − 𝜇),
∂𝜎∂𝜇 ∂𝜇∂𝜎 𝜎
and
𝑛
∂2ℓ 3 ∑ 𝑛
(𝜇, 𝜎) = − 4 (𝑥𝑖 − 𝜇)2 + 2 ,
∂𝜎 2 𝜎 𝑖=1 𝜎

The critical points of ℓ(𝜇, 𝜎) are solutions to the system

∂ℓ

⎨ ∂𝜇 (𝜇, 𝜎) = 0


⎩ ∂ℓ (𝜇, 𝜎)


= 0,

∂𝜎
which yields
𝜇
ˆ = 𝑥,
𝑛
1∑
ˆ2
𝜎 = (𝑥𝑖 − 𝑥)2 .
𝑛 𝑖=1

To see that ℓ(𝜇, 𝜎) is maximized at these values, look at the Hessian matrix,
⎛ 2
∂2ℓ

∂ ℓ
⎜ ∂𝜇2 (𝜇, 𝜎) (𝜇, 𝜎)
⎜ ∂𝜎∂𝜇 ⎟

⎜ ⎟,
⎜ 2 2

⎝ ∂ ℓ ∂ ℓ ⎠

(𝜇, 𝜎) 2
(𝜇, 𝜎)
∂𝜇 𝜎 ∂𝜎

at (ˆ
𝜇, 𝜎
ˆ) to get
⎛ 𝑛 ⎞
− 0
ˆ2
⎜ 𝜎 ⎟
⎜ ⎟,
⎝ 2𝑛 ⎠
0 −
ˆ2
𝜎
which has negative eigenvalues. It then follows that ℓ(𝜇, 𝜎) is maximized at

𝜇, 𝜎
ˆ). Hence, 𝑥 is the MLE for 𝜇 and
𝑛
1∑
ˆ2 =
𝜎 (𝑥𝑖 − 𝑥)2
𝑛 𝑖=1

is the MLE for 𝜎 2 . Observe that 𝜎


ˆ2 is not the sample variance, 𝑆𝑛2 . In fact,

𝑛−1 2
ˆ2 =
𝜎 𝑆𝑛 ,
𝑛
so that
𝑛−1 2
𝜎2 ) =
𝐸(ˆ 𝜎 ,
𝑛
3.4. LIKELIHOOD RATIO TEST 71

and so 𝜎 ˆ2 is not an unbiased estimator of 𝜎 2 . It is, however, the maximum


likelihood estimator of 𝜎 2 .
We then have that the denominator in the likelihood ratio in (3.11) is
∑𝑛 2
𝜎2
𝑒− 𝑖=1 (𝑥𝑖 −𝑥) /2ˆ 𝑒−𝑛/2
𝐿(ˆ ˆ ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) =
𝜇, 𝜎 = .
(2𝜋)𝑛/2 ˆ𝑛
𝜎 (2𝜋)𝑛/2 𝜎
ˆ𝑛

To compute the numerator in (3.11), we need to maximize

ℓ(𝜎) = ln(𝐿(𝜇𝑜 , 𝜎 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ))
𝑛
1 ∑ 𝑛
= − 2
(𝑥𝑖 − 𝜇𝑜 )2 − 𝑛 ln 𝜎 − ln(2𝜋).
2𝜎 𝑖=1 2

Taking derivatives of ℓ we obtain


𝑛
1 ∑ 𝑛
ℓ′ (𝜎) = 3
(𝑥𝑖 − 𝜇𝑜 )2 −
𝜎 𝑖=1 𝜎

and
𝑛
3 ∑ 𝑛
ℓ′′ (𝜎) = − 4
(𝑥𝑖 − 𝜇𝑜 )2 + 2 .
𝜎 𝑖=1 𝜎

Thus, a critical point of ℓ(𝜎) is the value, 𝜎, of 𝜎 given by


𝑛
1∑
𝜎2 = (𝑥𝑖 − 𝜇𝑜 )2 .
𝑛 𝑖=1

Note that
2𝑛
ℓ′′ (𝜎) = −
< 0,
𝜎2
so that ℓ(𝜎) is maximized when 𝜎 = 𝜎. We then have that

sup 𝐿(𝜇𝑜 , 𝜎 ∣ 𝑥1 , 𝑥2 , . . . 𝑥𝑛 ) = 𝐿(𝜇𝑜 , 𝜎 ∣ 𝑥1 , 𝑥2 , . . . 𝑥𝑛 ),


𝜎>0

where
𝑛
1∑
𝜎2 = (𝑥𝑖 − 𝜇𝑜 )2 .
𝑛 𝑖=1
Observe that
𝑛
∑ 𝑛

(𝑥𝑖 − 𝜇𝑜 )2 = (𝑥𝑖 − 𝑥 + 𝑥 − 𝜇𝑜 )2
𝑖=1 𝑖=1

𝑛
∑ 𝑛

2
= (𝑥𝑖 − 𝑥) + (𝑥 − 𝜇𝑜 )2 ,
𝑖=1 𝑖=1
72 CHAPTER 3. HYPOTHESIS TESTING

since
𝑛
∑ 𝑛

2(𝑥𝑖 − 𝑥)(𝑥 − 𝜇𝑜 ) = 2(𝑥 − 𝜇𝑜 ) (𝑥𝑖 − 𝑥) = 0.
𝑖=1 𝑖=1

We then have that


𝜎2 = 𝜎
ˆ2 + (𝑥 − 𝜇𝑜 )2 . (3.12)
Consequently,
∑𝑛 2 2
𝑒− 𝑖=1 (𝑥𝑖 −𝜇𝑜 ) /2𝜎 𝑒−𝑛/2
sup 𝐿(𝜇𝑜 , 𝜎 ∣ 𝑥1 , 𝑥2 , . . . 𝑥𝑛 ) = = .
𝜎>0 (2𝜋)𝑛/2 𝜎 𝑛 (2𝜋)𝑛/2 𝜎 𝑛

Thus, the likelihood ratio statistic in (3.11) is

sup 𝐿(𝜇𝑜 , 𝜎 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )
𝜎>0 ˆ𝑛
𝜎
Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = = .
𝐿(ˆ ˆ ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )
𝜇, 𝜎 𝜎𝑛

Hence, an LRT will reject H𝑜 is

ˆ𝑛
𝜎
⩽ 𝑐,
𝜎𝑛
for some 𝑐 with 0 < 𝑐 < 1, or
ˆ2
𝜎
⩽ 𝑐2/𝑛 ,
𝜎2
or
𝜎2 1
⩾ 2/𝑛 ,
ˆ2
𝜎 𝑐
1
where > 1. In view of (3.12), we see that and LRT will reject H𝑜 if
𝑐2/𝑛
(𝑥 − 𝜇𝑜 )2 1
⩾ 2/𝑛 − 1 ≡ 𝑘,
ˆ2
𝜎 𝑐
𝑛−1 2
where 𝑘 > 0, and 𝜎ˆ2 is the MLE for 𝜎 2 . Writing ˆ2 we see that
𝑆𝑛 for 𝜎
𝑛
an LRT will reject H𝑜 if

∣𝑥 − 𝜇𝑜 ∣ √
√ ⩾ (𝑛 − 1)𝑘 ≡ 𝑏,
𝑆𝑛 / 𝑛

where 𝑏 > 0. Hence, the LRT can be based in the test statistic

𝑋 𝑛 − 𝜇𝑜
𝑇𝑛 = √ .
𝑆𝑛 / 𝑛

Note that 𝑇𝑛 has a 𝑡(𝑛−1) distribution if H𝑜 is true. We then see that if 𝑡𝛼/2,𝑛−1
is such that
P(∣𝑇 ∣ ⩾ 𝑡𝛼/2,𝑛−1 ) = 𝛼, for 𝑇 ∼ 𝑡(𝑛 − 1),
3.5. THE NEYMAN–PEARSON LEMMA 73

then the LRT of H𝑜 : 𝜇 = 𝜇𝑜 versus H1 : 𝜇 ∕= 𝜇𝑜 which rejects H𝑜 if

∣𝑋 𝑛 − 𝜇𝑜 ∣
√ ⩾ 𝑡𝛼/2,𝑛−1 ,
𝑆𝑛 / 𝑛
has significance level 𝛼.
Observe also that the set of values of 𝜇𝑜 which do not get rejected by this
test is the open interval
( )
𝑆𝑛 𝑆𝑛
𝑋 𝑛 − 𝑡𝛼/2,𝑛−1 √ , 𝑋 𝑛 + 𝑡𝛼/2,𝑛−1 √ ,
𝑛 𝑛

which is a 100(1 − 𝛼)% confidence interval for the mean, 𝜇, of a normal(𝜇, 𝜎 2 )


distribution based on a random sample of size 𝑛 from that distribution. This
provides another interpretation of a confidence interval based on a hypothesis
test.

3.5 The Neyman–Pearson Lemma


Consider a test of a simple hypothesis

H𝑜 : 𝜃 = 𝜃 𝑜

versus the alternative


H1 : 𝜃 = 𝜃 1
based on a random sample of size 𝑛 from a distribution with distribution func-
tion 𝑓 (𝑥 ∣ 𝜃). The likelihood ratio statistic in this case is
𝐿(𝜃𝑜 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )
Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = . (3.13)
𝐿(𝜃1 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )
The rejection region for the LRT is

𝑅 = {(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ∣ Λ(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ⩽ 𝑐} (3.14)

for some 0 < 𝑐 < 1.


If the significance level of the test is 𝛼, then

𝛼= 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ∣ 𝜃𝑜 ) d𝑥1 d𝑥2 ⋅ ⋅ ⋅ d𝑥𝑛 ,
𝑅

for the case in which 𝑓 (𝑥 ∣ 𝜃) is a pdf. Thus,



𝛼= 𝐿(𝜃𝑜 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) d𝑥1 d𝑥2 ⋅ ⋅ ⋅ d𝑥𝑛 , (3.15)
𝑅

It then follows that the power of the LRT is



𝛾(𝜃1 ) = 𝐿(𝜃1 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) d𝑥1 d𝑥2 ⋅ ⋅ ⋅ d𝑥𝑛 ; (3.16)
𝑅
74 CHAPTER 3. HYPOTHESIS TESTING

that is, the probability of reject H𝑜 when H1 is true.


Consider next another test with rejection region 𝑅 ˜ and significance level 𝛼.
We then have that

𝛼= 𝐿(𝜃𝑜 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) d𝑥1 d𝑥2 ⋅ ⋅ ⋅ d𝑥𝑛 , (3.17)
𝑅
˜

and the power of the new test is



𝛾
˜(𝜃1 ) = 𝐿(𝜃1 ∣ 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) d𝑥1 d𝑥2 ⋅ ⋅ ⋅ d𝑥𝑛 . (3.18)
𝑅
˜

The Neyman–Pearson Lemma states that

𝛾
˜(𝜃1 ) ⩽ 𝛾(𝜃1 ); (3.19)

in other words, out of all the tests of the simple hypothesis H𝑜 : 𝜃 = 𝜃𝑜 versus
H1 : 𝜃 = 𝜃1 , the LRT yields the largest possible power. Consequently, the
LRT gives the smallest probability of making a Type II error our of the tests of
significance level 𝛼.
The proof of the Neyman–Pearson Lemma is straight forward. First observe
that
𝑅 = (𝑅 ∩ 𝑅) ˜ ∪ (𝑅 ∩ 𝑅˜𝑐 ), (3.20)
˜𝑐 denotes the complement of 𝑅.
where 𝑅 ˜ It then follows from (3.15) that
∫ ∫
𝛼= 𝐿(𝜃𝑜 ∣ x) dx + 𝐿(𝜃𝑜 ∣ x) dx, (3.21)
𝑅∩𝑅
˜ ˜𝑐
𝑅∩𝑅

where we have abbreviated the vector (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) by x, and the volume


element d𝑥1 d𝑥2 ⋅ ⋅ ⋅ d𝑥𝑛 by dx. Similarly, using

𝑅
˜ = (𝑅 ˜ ∩ 𝑅𝑐 ),
˜ ∩ 𝑅) ∪ (𝑅 (3.22)

and (3.17) we get that


∫ ∫
𝛼= 𝐿(𝜃𝑜 ∣ x) dx + 𝐿(𝜃𝑜 ∣ x) dx. (3.23)
𝑅∩𝑅
˜ 𝑅∩𝑅
˜ 𝑐

Combining (3.21) and (3.23) we then get that


∫ ∫
𝐿(𝜃𝑜 ∣ x) dx = 𝐿(𝜃𝑜 ∣ x) dx. (3.24)
˜𝑐
𝑅∩𝑅 𝑅∩𝑅
˜ 𝑐

To prove (3.19), use (3.18) and (3.22) to get


∫ ∫
𝛾
˜(𝜃1 ) = 𝐿(𝜃1 ∣ x) dx + 𝐿(𝜃1 ∣ x) dx (3.25)
𝑅∩𝑅
˜ 𝑅∩𝑅
˜ 𝑐

Similarly, using (3.16) and (3.20), we get that


∫ ∫
𝛾(𝜃1 ) = 𝐿(𝜃1 ∣ x) dx + 𝐿(𝜃1 ∣ x) dx. (3.26)
𝑅∩𝑅
˜ ˜𝑐
𝑅∩𝑅
3.5. THE NEYMAN–PEARSON LEMMA 75

Next, subtract (3.25) from (3.26) to get


∫ ∫
𝛾(𝜃1 ) − 𝛾
˜(𝜃1 ) = 𝐿(𝜃1 ∣ x) dx − 𝐿(𝜃1 ∣ x) dx, (3.27)
˜𝑐
𝑅∩𝑅 𝑅∩𝑅
˜ 𝑐

and observe that


𝑐𝐿(𝜃1 ∣ x) ⩾ 𝐿(𝜃𝑜 ∣ x) ˜𝑐
on 𝑅 ∩ 𝑅 (3.28)
and
𝑐𝐿(𝜃1 ∣ x) ⩽ 𝐿(𝜃𝑜 ∣ x) ˜ ∩ 𝑅𝑐 ,
on 𝑅 (3.29)
where we have used (3.13) and (3.14). Multiplying the inequality in (3.29) by
−1 we get that
−𝑐𝐿(𝜃1 ∣ x) ⩾ −𝐿(𝜃𝑜 ∣ x) on 𝑅 ˜ ∩ 𝑅𝑐 . (3.30)
It then follows from (3.27), (3.28) and (3.30)
(∫ ∫ )
1
𝛾(𝜃1 ) − 𝛾
˜(𝜃1 ) ⩾ 𝐿(𝜃𝑜 ∣ x) dx − 𝐿(𝜃𝑜 ∣ x) dx = 0 (3.31)
𝑐 𝑅∩𝑅˜𝑐 𝑅∩𝑅
˜ 𝑐

where we have used (3.24). The inequality in (3.19) now follows from (3.31).
Thus, we have proved the Neymann–Pearson Lemma.
The Neyman–Pearson Lemma applies only to tests of simple hypotheses.
For instance, in Example 3.4.6 of page 64 dealing with the test of H𝑜 : 𝑝 = 𝑝𝑜
versus H1 : 𝑝 = 𝑝1 , for 𝑝1 > 𝑝𝑜 , based on a random sample 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 from
a Bernoulli(𝑝) distribution for 0 < 𝑝 < 1, we saw that the LRT rejects the null
hypothesis at some significance level, 𝛼, is
𝑛

𝑌 = 𝑋𝑖 ⩾ 𝑏, (3.32)
𝑖=1

for some 𝑏 > 0 determined by 𝛼. By the Neyman–Pearson Lemma, this is the


most powerful test at that significance level; that is, the test with the smallest
probability of a Type II error. Recall that the value of 𝑏 yielding a significance
level 𝛼 may be obtained, for large sample sizes, 𝑛, by applying the Central Limit
Theorem. In fact, assuming that the null hypothesis is true, the test statistic 𝑌
in (3.32) is binomial(𝑛, 𝑝𝑜 ). We then have that

𝛼 = P(𝑌 ⩾ 𝑏)
( )
𝑌 − 𝑛𝑝𝑜 𝑏 − 𝑛𝑝𝑜
= P √ ⩾√
𝑛𝑝𝑜 (1 − 𝑝𝑜 ) 𝑛𝑝𝑜 (1 − 𝑝𝑜 )
( )
𝑏 − 𝑛𝑝𝑜
≈ P 𝑍⩾√ ,
𝑛𝑝𝑜 (1 − 𝑝𝑜 )

where 𝑍 ∼ normal(0, 1). Thus, if 𝑧𝛼 is such that P(𝑍 ⩾ 𝑧𝛼 ) = 𝛼, then



𝑏 = 𝑛𝑝𝑜 + 𝑧𝛼 𝑛𝑝𝑜 (1 − 𝑝𝑜 ) (3.33)
76 CHAPTER 3. HYPOTHESIS TESTING

in (3.32) gives the most powerful test at the significance level of 𝛼. Observe
that this value of 𝑏 depends only on 𝑝𝑜 and 𝑛; it does not depend on 𝑝1 .
Now consider the test of H𝑜 : 𝑝 = 𝑝𝑜 versus H1 : 𝑝 > 𝑝𝑜 . Since, the alterna-
tive hypothesis is not simple, we cannot apply the Neyman–Pearson Lemma di-
rectly. However, by the previous considerations, the test that rejects H𝑜 : 𝑝 = 𝑝𝑜
if
𝑌 ⩾ 𝑏,
where 𝑏 is given by (3.33) for large 𝑛 is the most powerful test at level 𝛼 for every
𝑝1 > 𝑝𝑜 ; i.e., for every possible value in the alternative hypothesis H1 : 𝑝 > 𝑝𝑜 .
We then say that the LRT is the uniformly most powerful test (UMP) at
level 𝛼 in this case.
Definition 3.5.1 (Uniformly most powerful test). A test of a simple hypothesis
H𝑜 : 𝜃 = 𝜃𝑜 against a composite alternative hypothesis H1 : 𝜃 ∈ Ω1 is said to
be uniformly most powerful test (UMP) at a level 𝛼, if it is most powerful
at that level for every simple alternative 𝜃 = 𝜃1 in Ω1 .
Chapter 4

Evaluating Estimators

Given a random sample, 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 , from a distribution with distribution


function 𝑓 (𝑥 ∣ 𝜃), we have seen that there might be more than one statistic,

𝑇 = 𝑇 (𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ),

that can be used to estimate the parameter 𝜃. For example, if 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 is


a random sample from a normal(𝜇, 𝜎 2 ) distribution, then the sample variance,
𝑛
1 ∑
𝑆𝑛2 = (𝑋𝑖 − 𝑋 𝑛 )2 ,
𝑛 − 1 𝑖=1

and the maximum likelihood estimator,


𝑛
1∑
ˆ2 =
𝜎 (𝑋𝑖 − 𝑋 𝑛 )2 ,
𝑛 𝑖=1

are both estimators for the variance 𝜎 2 . The sample variance, 𝑆𝑛2 , is unbiased,
while the MLE is not.
As another example, consider a random sample, 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 , from a
Poisson distribution with parameter 𝜆. Then, the sample mean, 𝑋 𝑛 and the
then the sample variance, 𝑆𝑛2 . are both unbiased estimators for 𝜆.
Given two estimators for a given parameter, 𝜃, is there a way to evaluate
the two estimators in such a way that we can tell which of the two is the better
one? In this chapter we explore one way to measure how good an estimator is,
the mean squared error or MSE. We will then see how to use that measure to
compare one estimator to others.

4.1 Mean Squared Error


Given a random sample, 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 , from a distribution with distribution
function 𝑓 (𝑥 ∣ 𝜃), and an estimator, 𝑊 = 𝑊 (𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ), for the parameter

77
78 CHAPTER 4. EVALUATING ESTIMATORS

𝜃, we define the mean squared error (MSE) of 𝑊 [ to be the] expected value


of (𝑊 − 𝜃)2 . We denote this expected value by 𝐸𝜃 (𝑊 − 𝜃)2 and compute it,
for the case in which 𝑓 (𝑥 ∣ 𝜃) is a pdf, using the formula

𝐸𝜃 (𝑊 − 𝜃)2 = (𝑊 − 𝜃)2 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ∣ 𝜃) d𝑥1 d𝑥2 ⋅ ⋅ ⋅ d𝑥𝑛 ,
[ ]
ℝ𝑛

where 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ∣ 𝜃) is the joint distribution of the sample. The subscript,


𝜃, in the expectation symbol for expectation, 𝐸, reminds us that we are using
𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ∣ 𝜃). By the same token, the expectation of the 𝑊 is written
𝐸𝜃 (𝑊 ). We also write

MSE(𝑊 ) = 𝐸𝜃 (𝑊 − 𝜃)2 .
[ ]

Observe that, since expectation is a linear operation,


[ ]
MSE(𝑊 ) = 𝐸𝜃 ((𝑊 − 𝐸𝜃 (𝑊 )) + (𝐸𝜃 (𝑊 ) − 𝜃))2
[ ]
= 𝐸𝜃 ((𝑊 − 𝐸𝜃 (𝑊 )))2 + 2(𝑊 − 𝐸𝜃 (𝑊 ))(𝐸𝜃 (𝑊 ) − 𝜃) + (𝐸𝜃 (𝑊 ) − 𝜃)2
[ ] [ ]
= 𝐸𝜃 (𝑊 − 𝐸𝜃 (𝑊 ))2 + 𝐸𝜃 (𝐸𝜃 (𝑊 ) − 𝜃)2

since

𝐸𝜃 [2(𝑊 − 𝐸𝜃 (𝑊 ))(𝐸𝜃 (𝑊 ) − 𝜃)] = 2(𝐸𝜃 (𝑊 ) − 𝜃) 𝐸𝜃 [(𝑊 − 𝐸𝜃 (𝑊 )]

= 2(𝐸𝜃 (𝑊 ) − 𝜃) [𝐸𝜃 (𝑊 ) − 𝐸𝜃 (𝑊 )]

= 0.

We then have that

MSE(𝑊 ) = var𝜃 (𝑊 ) + [𝐸𝜃 (𝑊 ) − 𝜃]2 ;

that is, the mean square error of 𝑊 is the sum of the variance of 𝑊 and the
quantity [𝐸𝜃 (𝑊 ) − 𝜃]2 . The expression 𝐸𝜃 (𝑊 ) − 𝜃 is called the bias of the
estimator 𝑊 and is denoted by bias𝜃 (𝑊 ); that is,

bias𝜃 (𝑊 ) = 𝐸𝜃 (𝑊 ) − 𝜃.

We then have that the mean square error of an estimator is

MSE(𝑊 ) = var𝜃 (𝑊 ) + [bias𝜃 (𝑊 )]2 .

Thus, if the estimator, 𝑊 , is unbiased, then 𝐸𝜃 (𝑊 ) = 𝜃, so that

MSE(𝑊 ) = var𝜃 (𝑊 ), for an unbiased estimator.


4.1. MEAN SQUARED ERROR 79

Example 4.1.1. Let 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 be a random sample from a normal(𝜇, 𝜎 2 )


distribution. Then, the sample mean, 𝑋 𝑛 , and the sample variance, 𝑆𝑛2 , are
unbiased estimators of the 𝜇 and 𝜎 2 , respectively. It then follows that

𝜎2
MSE(𝑋 𝑛 ) = var(𝑋 𝑛 ) =
𝑛
and
2𝜎 4
MSE(𝑆𝑛2 ) = var(𝑆𝑛2 ) = ,
𝑛−1
where we have used the fact that
𝑛−1 2
𝑆 ∼ 𝜒2 (𝑛 − 1),
𝜎2 𝑛
and therefore
(𝑛 − 1)2
var(𝑆𝑛2 ) = 2(𝑛 − 1).
𝜎4
Example 4.1.2 (Comparing the sample variance and the MLE in a sample
from a norma distribution). Let 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 be a random sample from a
normal(𝜇, 𝜎 2 ) distribution. The MLE for 𝜎 2 is the estimator
𝑛
1∑
ˆ2 =
𝜎 (𝑋𝑖 − 𝑋 𝑛 )2 .
𝑛 𝑖=1

𝑛−1 2
ˆ2 =
Since 𝜎 𝑆𝑛 , and 𝑆𝑛2 is an unbiased estimator for 𝜃 it follows that
𝑛
𝑛−1 2 𝜎2
𝜎2 ) =
𝐸(ˆ 𝜎 = 𝜎2 − .
𝑛 𝑛
ˆ2 is
It then follows that the bias of 𝜎
𝜎2
𝜎 2 ) = 𝐸(ˆ
bias(ˆ 𝜎2 ) − 𝜎2 = − ,
𝑛
ˆ2 underestimates 𝜎 2 .
which shows that, on average, 𝜎
ˆ2 . In order to do this, we used the fact
Next, we compute the variance of 𝜎
that
𝑛−1 2
𝑆 ∼ 𝜒2 (𝑛 − 1),
𝜎2 𝑛
so that ( )
𝑛−1 2
var 𝑆 = 2(𝑛 − 1).
𝜎2 𝑛
𝑛−1 2
ˆ2 =
It then follows from 𝜎 𝑆𝑛 that
𝑛
𝑛2
𝜎 2 ) = 2(𝑛 − 1),
var(ˆ
𝜎4
80 CHAPTER 4. EVALUATING ESTIMATORS

so that
2(𝑛 − 1)𝜎 4
𝜎2 ) =
var(ˆ .
𝑛2
ˆ2 is
It the follows that the mean squared error of 𝜎
𝜎2 )
MSE(ˆ = 𝜎 2 ) + bias(ˆ
var(ˆ 𝜎2 )

2(𝑛 − 1)𝜎 4 𝜎4
= 2
+ 2
𝑛 𝑛
2𝑛 − 1 4
= 𝜎 .
𝑛2
𝜎 2 ) to
Comparing the value of MSE(ˆ
2𝜎 4
MSE(𝑆𝑛2 ) = ,
𝑛−1
we see that
𝜎 2 ) < MSE(𝑆𝑛2 ).
MSE(ˆ
Hence, the MLE for 𝜎 2 has a smaller mean squared error than the unbiased
estimator 𝑆𝑛2 . Thus, 𝜎ˆ2 is a more precise estimator than 𝑆𝑛2 ; however, 𝑆𝑛2 is
2
more accurate than 𝜎ˆ .

4.2 Crámer–Rao Theorem


If 𝑊 = 𝑊 (𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ) is an unbiased estimator for 𝜃, where 𝑋1 , 𝑋2 , . . . , 𝑋𝑛
is a random sample from a distribution with distribution function 𝑓 (𝑥 ∣ 𝜃), we
saw in the previous section that the mean squared error of 𝑊 is given by
MSE(𝑊 ) = var𝜃 (𝑊 ).
The question we would like to answer in this section is the following: Out of all
unbiased estimators of 𝜃 based on the random sample 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 , is there
one with the smallest possible variance, and consequently the smallest possible
MSE?
We will provide a partial answer to the question posed above. The answer
is based on a lower bound for the variance of a statistic, 𝑊 , based on a random
sample from a distribution with distribution function 𝑓 (𝑥 ∣ 𝜃). The lower bound
was discovered independently by Rao and Crámer around the middle of the
twentieth century. The idea is to show that
var𝜃 (𝑊 ) ⩾ 𝑏(𝜃, 𝑛)
for all estimators, 𝑊 , based on the sample, for a function 𝑏 of the parameter
𝜃. The Crámer–Rao inequality can be derived as a consequence of the Cauchy–
Schwarz inequality: For any statistics, 𝑊1 and 𝑊2 , based on the sample,
[cov(𝑊1 , 𝑊2 )]2 ⩽ var(𝑊1 ) ⋅ var(𝑊2 ). (4.1)
4.2. CRÁMER–RAO THEOREM 81

The proof of (4.1) is very straightforward. Define a function ℎ : ℝ → ℝ by


ℎ(𝑡) = var(𝑊1 + 𝑡𝑊2 ) for all 𝑡 ∈ ℝ,
and observe that ℎ(𝑡) ⩾ 0 for all 𝑡 ∈ ℝ. By the properties of the expectation
operator and the definition of variance,
[ ] 2
ℎ(𝑡) = 𝐸 (𝑊1 + 𝑡𝑊2 )2 − [𝐸(𝑊1 + 𝑡𝑊2 )]
[ ] 2
= 𝐸 𝑊12 + 2𝑡𝑊1 𝑊2 + 𝑡2 𝑊22 − [𝐸(𝑊1 ) + 𝑡𝐸(𝑊2 )]
[ ] [ ] 2 2
= 𝐸 𝑊12 + 2𝑡𝐸 [𝑊1 𝑊2 ] + 𝑡2 𝐸 𝑊22 − [𝐸(𝑊1 )] − 2𝑡𝐸(𝑊1 )𝐸(𝑊2 ) − 𝑡2 [𝐸(𝑊2 )]

= var(𝑊1 ) + 2 cov(𝑊1 , 𝑊2 ) 𝑡 + var(𝑊2 ) 𝑡2 ,


where we have used the definition of covariance
cov(𝑊1 , 𝑊2 ) = 𝐸(𝑊1 𝑊2 ) − 𝐸(𝑊1 )𝐸(𝑊2 ). (4.2)
It then follows that ℎ(𝑡) is quadratic polynomial which is never negative. Con-
sequently, the discriminant,
[2 cov(𝑊1 , 𝑊2 )]2 − 4 var(𝑊2 )var(𝑊1 ),
is at most 0; that is,
2
4 [cov(𝑊1 , 𝑊2 )] − 4 var(𝑊2 )var(𝑊1 ) ⩽ 0,
from which the Cauchy–Schwarz inequality in (4.1) follows.
To obtain the Crámmer–Rao lower bound, we will apply the Cauchy–Schwarz
inequality (4.1) to the case 𝑊1 = 𝑊 and

𝑊2 = [ln (𝐿(𝜃 ∣ 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ))] .
∂𝜃
In other words, 𝑊1 is the estimator in question and 𝑊2 is the partial deriva-
tive with respect to the parameter 𝜃 of the natural logarithm of the likelihood
function.
In order to compute cov(𝑊1 , 𝑊2 ), we will need the expected value of 𝑊2 .
Note that
1 ∂
𝑊2 = [𝐿(𝜃 ∣ 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 )] ,
𝐿(𝜃 ∣ 𝑋1 , 𝑋2 , . . . 𝑋𝑛 ) ∂𝜃
so that ∫
1 ∂
𝐸𝜃 (𝑊2 ) = [𝐿(𝜃 ∣ x)] 𝐿(𝜃 ∣ x) dx,
ℝ𝑛 𝐿(𝜃 ∣ x) ∂𝜃
where we have denoted the vector (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) by x and the volume element,
d𝑥1 d𝑥2 ⋅ ⋅ ⋅ d𝑥𝑛 , by dx. We then have that


𝐸𝜃 (𝑊2 ) = [𝐿(𝜃 ∣ x)] dx.
ℝ 𝑛 ∂𝜃
82 CHAPTER 4. EVALUATING ESTIMATORS

Assuming that the order of differentiation and integration can be changed, we


then have that
[∫ ]
∂ ∂
𝐸𝜃 (𝑊2 ) = 𝐿(𝜃 ∣ (𝑥)) dx = (1) = 0,
∂𝜃 ℝ𝑛 ∂𝜃

so that 𝑊2 has expected value 0. It then follows from (4.2) that

cov(𝑊1 , 𝑊2 ) = 𝐸𝜃 (𝑊1 𝑊2 )

1 ∂
= 𝑊 (x) [𝐿(𝜃 ∣ x)] 𝐿(𝜃 ∣ x) dx
ℝ𝑛 𝐿(𝜃 ∣ x) ∂𝜃


= 𝑊 (x) [𝐿(𝜃 ∣ x)] dx
ℝ𝑛 ∂𝜃


= [𝑊 (x) 𝐿(𝜃 ∣ x)] dx.
ℝ𝑛 ∂𝜃
Thus, if the order of differentiation and integration can be interchanged, we
have that [∫ ]

cov(𝑊1 , 𝑊2 ) = 𝑊 (x) 𝐿(𝜃 ∣ x) dx
∂𝜃 ℝ𝑛


= [𝐸 (𝑊 )] .
∂𝜃 𝜃
Thus, if we set
𝑔(𝜃) = 𝐸𝜃 (𝑊 )
for all 𝜃 in the parameter range, we see that

cov𝜃 (𝑊1 , 𝑊2 ) = 𝑔 ′ (𝜃).

In particular, if 𝑊 is an unbiased estimator, cov𝜃 (𝑊, 𝑊2 ) = 1, where



𝑊2 = [ln (𝐿(𝜃 ∣ 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ))] .
∂𝜃
Applying the Cauchy–Schwarz inequality in (4.1) we then have that

[𝑔 ′ (𝜃)]2 ⩽ var(𝑊 ) ⋅ var(𝑊2 ). (4.3)

In order to compute var(𝑊2 ), observe that


( 𝑛 )
∂ ∑
𝑊2 = ln(𝑓 (𝑋𝑖 ∣ 𝜃)
∂𝜃 𝑖=1

𝑛
∑ ∂
= (ln(𝑓 (𝑋𝑖 ∣ 𝜃)) .
𝑖=1
∂𝜃
4.2. CRÁMER–RAO THEOREM 83

Thus, since 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 are iid random variable with distribution function


𝑓 (𝑥 ∣ 𝜃),
𝑛 ( )
∑ ∂
var(𝑊2 ) = var (ln(𝑓 (𝑋𝑖 ∣ 𝜃))
𝑖=1
∂𝜃
( )

= 𝑛 ⋅ var (ln(𝑓 (𝑋𝑖 ∣ 𝜃))
∂𝜃

The variance of the random variable [ln (𝑓 (𝑋 ∣ 𝜃))] is called the Fisher
∂𝜃
Information and is denoted by 𝐼(𝜃). We then have that

var(𝑊2 ) = 𝑛𝐼(𝜃).

We then obtain from (4.3) that

[𝑔 ′ (𝜃)]2 ⩽ 𝑛𝐼(𝜃) var(𝑊 ),

which yields the Crámer-Rao lower bound

[𝑔 ′ (𝜃)]2
var(𝑊 ) ⩾ , (4.4)
𝑛𝐼(𝜃)
where ( )

𝐼(𝜃) = var [ln (𝑓 (𝑋 ∣ 𝜃))]
∂𝜃
is the Fisher information. For the case in which 𝑊 is unbiased we obtain from
(4.4) that
1
var(𝑊 ) ⩾ . (4.5)
𝑛𝐼(𝜃)
Example 4.2.1. Let 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 be a random sample from a Poisson(𝜆)
distribution. Then,
𝜆𝑋 −𝜆
𝑓 (𝑋, 𝜆) = 𝑒 ,
𝑋!
so that
ln(𝑓 (𝑋, 𝜆)) = 𝑋 ln 𝜆 − 𝜆 − ln(𝑋!)
and
∂ 𝑋
[ln(𝑓 (𝑋, 𝜆))] = − 1.
∂𝜆 𝜆
Then the Fisher information is
1 1 1
𝐼(𝜆) = 2
var(𝑋) = 2 ⋅ 𝜆 = .
𝜆 𝜆 𝜆
Thus, the Crámer–Rao lower bound for unbiased estimators is obtained from
(4.5) to be
1 𝜆
= .
𝑛𝐼(𝜆) 𝑛
84 CHAPTER 4. EVALUATING ESTIMATORS

Observe that if 𝑊 = 𝑋 𝑛 , the sample mean, then 𝑊 is unbiased and


𝜆
var(𝑊 ) =.
𝑛
Thus, in this case, the lower bound for the variance of unbiased estimators is
attained at the sample mean. We say that 𝑋 𝑛 is an efficient estimator.
Definition 4.2.2 (Efficient Estimator). An unbiased estimator, 𝑊 , of a pa-
rameter, 𝜃, is said to be efficient if its variance is the lower bound in the
Crámer–Rao inequality; that is, if
1
var(𝑊 ) = ,
𝑛𝐼(𝜃)
where 𝐼(𝜃) is the Fisher information,
( )

𝐼(𝜃) = var (ln(𝑓 (𝑋 ∣ 𝜃)) .
∂𝜃
For any unbiased estimator, 𝑊 , of 𝜃, we define the efficiency of 𝑊 , denoted
eff𝜃 (𝑊 ), to be
1/(𝑛𝐼(𝜃))
eff𝜃 (𝑊 ) = .
var𝜃 (𝑊 )
Thus, by the Crámer–Rao inequality (4.5),
eff𝜃 (𝑊 ) ⩽ 1
for all unbiased estimators, 𝑊 , of 𝜃. Furthermore, eff𝜃 (𝑊 ) = 1 if and only if 𝑊
is efficient.

Next, we turn to the question of computing the Fisher information, 𝐼(𝜃).


First, observe that
( ) [( )2 ]
∂ ∂
𝐼(𝜃) = var (ln(𝑓 (𝑋 ∣ 𝜃)) = 𝐸𝜃 (ln(𝑓 (𝑋 ∣ 𝜃)) , (4.6)
∂𝜃 ∂𝜃
since [ ]

𝐸𝜃 (ln(𝑓 (𝑋 ∣ 𝜃)) = 0. (4.7)
∂𝜃
To see why (4.7) is true, observe that
[ ] ∫ ∞
∂ ∂
𝐸𝜃 (ln(𝑓 (𝑋 ∣ 𝜃)) = (ln(𝑓 (𝑥 ∣ 𝜃)) 𝑓 (𝑥 ∣ 𝜃) d𝑥
∂𝜃 −∞ ∂𝜃
∫ ∞
1 ∂
= (𝑓 (𝑥 ∣ 𝜃)) 𝑓 (𝑥 ∣ 𝜃) d𝑥
−∞ 𝑓 (𝑥 ∣ 𝜃) ∂𝜃
∫ ∞

= (𝑓 (𝑥 ∣ 𝜃)) d𝑥
−∞ ∂𝜃
(∫ ∞ )

= 𝑓 (𝑥 ∣ 𝜃) d𝑥 ,
∂𝜃 −∞
4.2. CRÁMER–RAO THEOREM 85

assuming that the order of differentiation and integration can be interchanged.


The identity in (4.7) now follows from the fact that
∫ ∞
𝑓 (𝑥 ∣ 𝜃) d𝑥 = 1.
−∞

Next, differentiate (4.7) with respect to 𝜃 one more time to obtain that
[ ]
∂ ∂
𝐸𝜃 (ln(𝑓 (𝑋 ∣ 𝜃)) = 0, (4.8)
∂𝜃 ∂𝜃
where, assuming that the order of differentiation and integration can be inter-
changed,
[ ] [∫ ∞ ]
∂ ∂ ∂ ∂
𝐸 (ln(𝑓 (𝑥 ∣ 𝜃)) = (ln(𝑓 (𝑥 ∣ 𝜃)) 𝑓 (𝑥 ∣ 𝜃) d𝑥
∂𝜃 𝜃 ∂𝜃 ∂𝜃 −∞ ∂𝜃

∂2

= (ln(𝑓 (𝑥 ∣ 𝜃)) 𝑓 (𝑥 ∣ 𝜃) d𝑥
−∞ ∂𝜃2
∫ ∞
∂ ∂
+ (ln(𝑓 (𝑥 ∣ 𝜃)) 𝑓 (𝑥 ∣ 𝜃) d𝑥
−∞ ∂𝜃 ∂𝜃

∂2
[ ]
= 𝐸𝜃 (ln(𝑓 (𝑥 ∣ 𝜃)))
∂𝜃2
∫ ∞ [ ]2
1 ∂
+ 𝑓 (𝑥 ∣ 𝜃) d𝑥,
−∞ 𝑓 (𝑥 ∣ 𝜃) ∂𝜃
where
∫ ∞ [ ]2 ∫ ∞ [ ]2
1 ∂ 1 ∂
𝑓 (𝑥 ∣ 𝜃) d𝑥 = 𝑓 (𝑥 ∣ 𝜃) 𝑓 (𝑥 ∣ 𝜃) d𝑥
−∞ 𝑓 (𝑥 ∣ 𝜃) ∂𝜃 −∞ 𝑓 (𝑥 ∣ 𝜃) ∂𝜃
∫ ∞ [ ]2

= ln(𝑓 (𝑥 ∣ 𝜃)) 𝑓 (𝑥 ∣ 𝜃) d𝑥
−∞ ∂𝜃
[( )2 ]

= 𝐸𝜃 ln(𝑓 (𝑥 ∣ 𝜃)) .
∂𝜃

It then follows from (4.6) that


∫ ∞ [ ]2
1 ∂
𝑓 (𝑥 ∣ 𝜃) d𝑥 = 𝐼(𝜃).
−∞ 𝑓 (𝑥 ∣ 𝜃) ∂𝜃

Consequently,
∂2
[ ] [ ]
∂ ∂
𝐸𝜃 (ln(𝑓 (𝑥 ∣ 𝜃)) = 𝐸𝜃 (ln(𝑓 (𝑥 ∣ 𝜃))) + 𝐼(𝜃)
∂𝜃 ∂𝜃 ∂𝜃2
86 CHAPTER 4. EVALUATING ESTIMATORS

In view of (4.7) we therefore have that


[ 2 ]

𝐼(𝜃) = −𝐸𝜃 (ln(𝑓 (𝑥 ∣ 𝜃))) , (4.9)
∂𝜃2

which gives another formula for computing the Fisher information.


Appendix A

Pearson Chi–Square
Statistic

The goal of this appendix is to prove the first part of Theorem 3.1.4 on page
49; namely: assume that
(𝑋1 , 𝑋2 , . . . , 𝑋𝑘 )
is a random vector with a multinomial(𝑛, 𝑝1 , 𝑝2 , . . . , 𝑝𝑘 ) distribution, and define
𝑘
∑ (𝑋𝑖 − 𝑛𝑝𝑖 )2
𝑄= . (A.1)
𝑖=1
𝑛𝑝𝑖

Then, for large values of 𝑛, 𝑄 has an approximate 𝜒2 (𝑘 − 1) distribution. The


idea for the proof presented here comes from Exercise 3 on page 60 in [Fer02].
We saw is Section 3.1.2 that the result in Theorem 3.1.4 is true for 𝑘 = 2.
We begin the discussion in this appendix with the case 𝑘 = 3. It is hoped that
the main features of the proof of the general case will be seen in this simple
case.
Consider the random vector U = (𝑈1 , 𝑈2 , 𝑈3 ) with a multinomial(1, 𝑝1 , 𝑝2 , 𝑝3 )
distribution. In other words, each component function, 𝑈𝑖 , is a Bernoulli(𝑝𝑖 )
random variable for 𝑖 = 1, 2, 3, and the distribution of U is conditioned on
𝑈1 + 𝑈2 + 𝑈3 = 1.
We then have that
𝐸(𝑈𝑗 ) = 𝑝𝑗 for 𝑗 = 1, 2, 3;
var(𝑈𝑗 ) = 𝑝𝑗 (1 − 𝑝𝑗 ) for 𝑗 = 1, 2, 3;
and
cov(𝑈𝑖 , 𝑈𝑗 ) = −𝑝𝑖 𝑝𝑗 for 𝑖 ∕= 𝑗.
Suppose now that we have a sequence of independent multinomial(1, 𝑝1 , 𝑝2 , 𝑝3 )
random vectors
U1 , U2 , . . . , U𝑛 , . . .

87
88 APPENDIX A. PEARSON CHI–SQUARE STATISTIC

We then have that the random vector


𝑛

X𝑛 = (𝑋1 , 𝑋2 , 𝑋3 ) = U𝑖
𝑖=1

has a multinomial(𝑛, 𝑝1 , 𝑝2 , 𝑝3 ) distribution.


We now try to get an expression for the Pearson chi–square statistic in (A.1),
for 𝑘 = 3, in terms of the bivariate random vector
( )
𝑋1
W𝑛 = .
𝑋2

The expected value of the random vector W𝑛 is


( )
𝑛𝑝1
𝐸(W𝑛 ) = ,
𝑛𝑝2

and its covariance matrix is


( )
𝑝1 (1 − 𝑝1 ) −𝑝1 𝑝2
C𝑊𝑛 = 𝑛 ,
−𝑝1 𝑝2 𝑝2 (1 − 𝑝2 )
or
C𝑊 = 𝑛C(𝑈1 ,𝑈2 ) ,
( )
𝑈1
where C(𝑈1 ,𝑈2 ) is the covariance matrix for the bivariate random vector ,
𝑈2
for (𝑈1 , 𝑈2 , 𝑈3 ) ∼ multinomial(1, 𝑝1 , 𝑝2 , 𝑝3 ). Note that the determinant of the
matrix C(𝑈1 ,𝑈2 ) is 𝑝1 𝑝2 𝑝3 , which is different from 0 since we are assuming that

0 < 𝑝𝑖 < 1 for 𝑖 = 1, 2, 3.

It then follows that C(𝑈1 ,𝑈2 ) is invertible with inverse

1 1 1
⎛ ⎞
+
⎜ 𝑝1 𝑝 3 𝑝3 ⎟
C−1 =⎜ ⎟.
⎜ ⎟
(𝑈1 ,𝑈2 ) ⎝ 1 1 1⎠
+
𝑝3 𝑝2 𝑝3

Consequently,
1 1 1
⎛ ⎞
+
⎜ 𝑝1 𝑝3 𝑝3 ⎟
𝑛 C−1 =⎜ ⎟.
⎜ ⎟
𝑊𝑛
⎝ 1 1 1⎠
+
𝑝3 𝑝2 𝑝3
Note also that
(W𝑛 − 𝐸(W𝑛 ))𝑇 C−1
𝑊𝑛 (W𝑛 − 𝐸(W𝑛 )),
89

where (W𝑛 − 𝐸(W𝑛 ))𝑇 is the transpose of the column vector (W𝑛 − 𝐸(W𝑛 )),
is equal to
1 1 1
⎛ ⎞
+ ⎛ ⎞
⎜ 𝑝1 𝑝3 𝑝 3 ⎟ 𝑋1 − 𝑛𝑝1
𝑛−1 (𝑋1 − 𝑛𝑝1 , 𝑋2 − 𝑛𝑝2 ) ⎜ ⎟. ⎠,
⎜ ⎟ ⎝
⎝ 1 1 1 ⎠ 𝑋2 − 𝑛𝑝2
+
𝑝3 𝑝2 𝑝3
which is equal to
( )
1 1
𝑛−1 + (𝑋1 − 𝑛𝑝1 )2
𝑝1 𝑝3

𝑛−1 𝑛−1
+ (𝑋1 − 𝑛𝑝1 )(𝑋2 − 𝑛𝑝2 ) + (𝑋2 − 𝑛𝑝2 )(𝑋1 − 𝑛𝑝1 )
𝑝3 𝑝3
( )
1 1
𝑛−1 + (𝑋2 − 𝑛𝑝2 )2 .
𝑝2 𝑝3
Note that
(𝑋1 − 𝑛𝑝1 )(𝑋2 − 𝑛𝑝2 ) = (𝑋1 − 𝑛𝑝1 )(𝑛 − 𝑋1 − 𝑋3 − 𝑛𝑝2 )

= (𝑋1 − 𝑛𝑝1 )(𝑛(1 − 𝑝2 ) − 𝑋1 − 𝑋3 )

= (𝑋1 − 𝑛𝑝1 )(𝑛(𝑝1 + 𝑝3 ) − 𝑋1 − 𝑋3 )

= −(𝑋1 − 𝑛𝑝1 )(𝑋1 − 𝑛𝑝1 + 𝑋3 − 𝑛𝑝3 )

= −(𝑋1 − 𝑛𝑝1 )2 − (𝑋1 − 𝑛𝑝1 )(𝑋3 − 𝑛𝑝3 ).

Similarly, we obtain that

(𝑋2 − 𝑛𝑝2 )(𝑋1 − 𝑛𝑝1 ) = −(𝑋2 − 𝑛𝑝2 )2 − (𝑋2 − 𝑛𝑝2 )(𝑋3 − 𝑛𝑝3 ).

We then have that

(W𝑛 − 𝐸(W𝑛 ))𝑇 C−1


𝑊𝑛 (W𝑛 − 𝐸(W𝑛 ))

is equal to
( )
1 1 1
𝑛−1 + − (𝑋1 − 𝑛𝑝1 )2
𝑝1 𝑝3 𝑝3

𝑛−1 𝑛−1
− (𝑋1 − 𝑛𝑝1 )(𝑋3 − 𝑛𝑝3 ) − (𝑋2 − 𝑛𝑝2 )(𝑋3 − 𝑛𝑝3 )
𝑝3 𝑝3
( )
1 1 1
𝑛−1 + − (𝑋2 − 𝑛𝑝2 )2 ,
𝑝2 𝑝3 𝑝3
90 APPENDIX A. PEARSON CHI–SQUARE STATISTIC

or
1
(𝑋1 − 𝑛𝑝1 )2
𝑛𝑝1

1
− (𝑋3 − 𝑛𝑝3 )[(𝑋1 − 𝑛𝑝1 ) + (𝑋2 − 𝑛𝑝2 )]
𝑛𝑝3

1
(𝑋2 − 𝑛𝑝2 )2 ,
𝑛𝑝2
where
(𝑋3 − 𝑛𝑝3 )[(𝑋1 − 𝑛𝑝1 ) + (𝑋2 − 𝑛𝑝2 )] = (𝑋3 − 𝑛𝑝3 )[𝑋1 + 𝑋2 − 𝑛(𝑝1 + 𝑝2 )]

= (𝑋3 − 𝑛𝑝3 )[𝑛 − 𝑋3 − 𝑛(1 − 𝑝3 )]

= −(𝑋3 − 𝑛𝑝3 )2 .

We have therefore shown that

(W𝑛 − 𝐸(W𝑛 ))𝑇 C−1


𝑊𝑛 (W𝑛 − 𝐸(W𝑛 ))

is equal to
1 1 1
(𝑋1 − 𝑛𝑝1 )2 + (𝑋3 − 𝑛𝑝3 )2 + (𝑋2 − 𝑛𝑝2 )2 ;
𝑛𝑝1 𝑛𝑝3 𝑛𝑝2
that is,
3
∑ (𝑋𝑗 − 𝑛𝑝𝑗 )2
(W𝑛 − 𝐸(W𝑛 ))𝑇 C−1
𝑊𝑛 (W𝑛 − 𝐸(W𝑛 )) = ,
𝑗=1
𝑛𝑝𝑗

which is the Pearson Chi–Square statistic for the case 𝑘 = 3. Consequently,

𝑄 = (W𝑛 − 𝐸(W𝑛 ))𝑇 C−1


𝑊𝑛 (W𝑛 − 𝐸(W𝑛 )).

Our goal now is to show that, as 𝑛 → ∞,

(W𝑛 − 𝐸(W𝑛 ))𝑇 C−1


𝑊𝑛 (W𝑛 − 𝐸(W𝑛 ))

tends in distribution to a 𝜒2 (2) random variable.


Observe that the matrix C−1 𝑊𝑛 is symmetric and positive definite. There-
−1/2
fore, it has a square root, C𝑊𝑛 , which is also symmetric and positive definite.
Consequently,
( )𝑇
−1/2 −1/2
𝑄 = (W𝑛 − 𝐸(W𝑛 ))𝑇 C𝑊𝑛 C𝑊𝑛 (W𝑛 − 𝐸(W𝑛 )),

which we can write as


−1/2 −1/2
𝑄 = (C𝑊𝑛 (W𝑛 − 𝐸(W𝑛 )))𝑇 C𝑊𝑛 (W𝑛 − 𝐸(W𝑛 )).
91

Next, put
−1/2
Z𝑛 = C𝑊𝑛 (W𝑛 − 𝐸(W𝑛 ))
and apply the multivariate central limit theorem (see, for instance, [Fer02, The-
orem 5, p. 26]) to obtain that
𝐷
Z𝑛 −→ Z ∼ normal(0, 𝐼) as 𝑛 → ∞;
−1/2
that is, the bivariate random vectors, Z𝑛 = C𝑊𝑛 (W𝑛 − 𝐸(W𝑛 )), converge in
( )
0
distribution to the bivariate random vector Z with mean and covariance
0
matrix ( )
1 0
𝐼= .
0 1
In other words ( )
𝑍1
Z= ,
𝑍2
where 𝑍1 and 𝑍2 are independent normal(0, 1) random variables. Consequently,

−1/2 −1/2 𝐷
(C𝑊𝑛 (W𝑛 − 𝐸(W𝑛 ))𝑇 C𝑊𝑛 (W𝑛 − 𝐸(W𝑛 )) −→ Z𝑇 Z = 𝑍12 + 𝑍22

as 𝑛 → ∞; that is,
3
∑ (𝑋𝑖 − 𝑛𝑝𝑖 )2
𝑄=
𝑖=1
𝑛𝑝𝑖

converges in distribution to a 𝜒2 (2) random variable as 𝑛 → ∞, which we wanted


to prove.
The proof of the general case is analogous. Begin with the multivariate
random vector

(𝑋1 , 𝑋2 , . . . , 𝑋𝑘 ) ∼ multinomial(𝑛, 𝑝1 , 𝑝2 , . . . , 𝑝𝑘 ).

Define the random vector ⎛ ⎞


𝑋1
⎜ ⎟ 𝑋2
W𝑛 = ⎜
⎜ ⎟
⎟ ..
⎝ ⎠ .
𝑋𝑘−1
with covariance matrix C𝑊𝑛 . Verify that
𝑘
∑ (𝑋𝑗 − 𝑛𝑝𝑗 )2 −1/2 −1/2
𝑄= = (C𝑊𝑛 (W𝑛 − 𝐸(W𝑛 )))𝑇 C𝑊𝑛 (W𝑛 − 𝐸(W𝑛 )).
𝑗=1
𝑛𝑝 𝑗

Next, put
−1/2
Z𝑛 = C𝑊𝑛 (W𝑛 − 𝐸(W𝑛 ))
92 APPENDIX A. PEARSON CHI–SQUARE STATISTIC

and apply the multivariate central limit theorem to obtain that


𝐷
Z𝑛 −→ Z ∼ normal(0, 𝐼) as 𝑛 → ∞,

where 𝐼 is the (𝑘 − 1) × (𝑘 − 1) identity matrix, so that


⎛ ⎞
𝑍1
⎜ 𝑍2 ⎟
Z = ⎜ . ⎟,
⎜ ⎟
⎝ .. ⎠
𝑍𝑘−1

where 𝑍1 , 𝑍2 , . . . , 𝑍𝑘−1 are independent normal(0, 1) random variables. Conse-


quently,
𝑘−1
−1/2 −1/2 𝐷

(C𝑊𝑛 (W𝑛 − 𝐸(W𝑛 ))𝑇 C𝑊𝑛 (W𝑛 − 𝐸(W𝑛 )) −→ Z𝑇 Z = 𝑍𝑗2 ∼ 𝜒2 (𝑘 − 1)
𝑗=1

as 𝑛 → ∞. This proves that


𝑘
∑ (𝑋𝑖 − 𝑛𝑝𝑖 )2
𝑄=
𝑖=1
𝑛𝑝𝑖

converges in distribution to a 𝜒2 (𝑘 − 1) random variable as 𝑛 → ∞.


Appendix B

The Variance of the Sample


Variance

The main goal of this appendix is to compute the variance of the sample variance
based on a sample from and arbitrary distribution; i.e.,

var(𝑆𝑛2 ),

where
𝑛
1 ∑
𝑆𝑛2 = (𝑋𝑖 − 𝑋 𝑛 )2 .
𝑛 − 1 𝑖=1

We will come up with a formula based on the second and fourth central
moments of the underlying distribution. More precisely, we will prove that
( )
1 𝑛−3 2
var(𝑆𝑛2 ) = 𝜇4 − 𝜇2 , (B.1)
𝑛 𝑛−1

where 𝜇2 denotes the second central moment, or variance, of the distribution


and 𝜇4 is the fourth central moment.
In general, we define the first central moment, 𝜇1 , of the distribution of 𝑋
to be
𝜇1 = 𝐸(𝑋),

the mean on the distribution. The second central moment of 𝑋, 𝜇2 , is

𝜇2 = 𝐸 (𝑋 − 𝐸(𝑋))2 ;
[ ]

in other words, 𝜇2 is the variance of the distribution. Similarly, for any 𝑘 ⩾ 2,


the 𝑘th central moment, 𝜇𝑘 , of 𝑋 is

𝜇𝑘 = 𝐸 (𝑋 − 𝐸(𝑋))𝑘 .
[ ]

93
94 APPENDIX B. THE VARIANCE OF THE SAMPLE VARIANCE

First observe that, for each 𝑖 and 𝑗


(𝑋𝑖 − 𝑋𝑗 )2 = (𝑋𝑖 − 𝑋 𝑛 + 𝑋 𝑛 − 𝑋𝑗 )2

= (𝑋𝑖 − 𝑋 𝑛 )2 − 2(𝑋𝑖 − 𝑋 𝑛 )(𝑋 𝑛 − 𝑋𝑗 ) + (𝑋 𝑛 − 𝑋𝑗 )2 ,


so that
∑∑ ∑∑ ∑∑
(𝑋𝑖 − 𝑋𝑗 )2 = (𝑋𝑖 − 𝑋 𝑛 )2 + (𝑋𝑗 − 𝑋 𝑛 )2 , (B.2)
𝑖 𝑗 𝑖 𝑗 𝑖 𝑗

since
∑∑ ∑ ∑
(𝑋𝑖 − 𝑋 𝑛 )(𝑋 𝑛 − 𝑋𝑗 ) = (𝑋𝑖 − 𝑋 𝑛 ) (𝑋 𝑛 − 𝑋𝑗 ) = 0.
𝑖 𝑗 𝑖 𝑗

It then follows from (B.2) that


∑∑ ∑ ∑
(𝑋𝑖 − 𝑋𝑗 )2 = 𝑛 (𝑋𝑖 − 𝑋 𝑛 )2 + 𝑛 (𝑋𝑗 − 𝑋 𝑛 )2
𝑖 𝑗 𝑖 𝑗

= 𝑛(𝑛 − 1)𝑆𝑛2 + 𝑛(𝑛 − 1)𝑆𝑛2


from which we obtain another formula for the sample variance:
1 ∑∑
𝑆𝑛2 = (𝑋𝑖 − 𝑋𝑗 )2 ,
2𝑛(𝑛 − 1) 𝑖 𝑗

which we can also write as


1 ∑∑
𝑆𝑛2 = (𝑋𝑖 − 𝑋𝑗 )2 . (B.3)
𝑛(𝑛 − 1) 𝑖<𝑗

In order to compute var(𝑆𝑛2 ) we will need to compute the expectation of


(𝑆𝑛2 )2 ,
where, according to the formula in (B.3),
1 ∑∑∑∑
(𝑆𝑛2 )2 = 2 2
(𝑋𝑖 − 𝑋𝑗 )2 (𝑋𝑘 − 𝑋ℓ )2 .
𝑛 (𝑛 − 1) 𝑖<𝑗 𝑘<ℓ

It then follows by the linearity of the expectation operator that


1 ∑∑∑∑ [
𝐸 (𝑆𝑛2 )2 = 2 𝐸 (𝑋𝑖 − 𝑋𝑗 )2 (𝑋𝑘 − 𝑋ℓ )2 .
[ ] ]
2
(B.4)
𝑛 (𝑛 − 1) 𝑖<𝑗 𝑘<ℓ
[ ]
We will then to compute the expectations 𝐸 (𝑋𝑖 − 𝑋𝑗 )2 (𝑋𝑘 − 𝑋ℓ )2 for all
possible values of 𝑖, 𝑗, 𝑘, ℓ ranging from 1 to 𝑛 such that 𝑖 < 𝑗 and 𝑘 < ℓ. There
[ ]2
𝑛(𝑛 − 1)
are of those terms contributing to the expectation of (𝑆𝑛2 )2 in
2 ( )
𝑛(𝑛 − 1) 𝑛
(B.4). Out of those terms, , or , are of the form
2 2
𝐸 (𝑋𝑖 − 𝑋𝑗 )4 ,
[ ]
(B.5)
95

where 𝑖 < 𝑗. We compute the expectations in (B.5) as follows


[ ] [ ]
𝐸 (𝑋𝑖 − 𝑋𝑗 )4 = 𝐸 (𝑋𝑖 − 𝜇1 + 𝜇1 − 𝑋𝑗 )4
[
= 𝐸 (𝑋𝑖 − 𝜇1 )4 + 4(𝑋𝑖 − 𝜇1 )3 (𝜇1 − 𝑋𝑗 )
+6(𝑋𝑖 − 𝜇1 )2 (𝜇1 − 𝑋𝑗 )2 ]
+4(𝑋𝑖 − 𝜇1 )(𝜇1 − 𝑋𝑗 )3 + (𝜇1 − 𝑋𝑗 )4

= 𝜇4 + 6𝜇2 ⋅ 𝜇2 + 𝜇4 ,
where we have used the independence of the 𝑋𝑖 s and the definition of the central
moments. We then have that
𝐸 (𝑋𝑖 − 𝑋𝑗 )4 = 2𝜇4 + 6𝜇22 , for 𝑖 ∕= 𝑗.
[ ]
(B.6)
[ ]
For the rest for the expectations, 𝐸 (𝑋𝑖 − 𝑋𝑗 )2 (𝑋𝑘 − 𝑋ℓ )2 , in (B.4) there are
two possibilities
(i) 𝑖 ∕= 𝑘 and 𝑗 ∕= ℓ, or
(ii) either 𝑖 = 𝑘, or 𝑗 = ℓ, but not both simultaneously.
In case (i) we obtain, by the independence of the 𝑋𝑖 s and the definition of
the central moments, that
𝐸 (𝑋𝑖 − 𝑋𝑗 )2 (𝑋𝑘 − 𝑋ℓ )2 = 𝐸 (𝑋𝑖 − 𝑋𝑗 )2 ⋅ 𝐸 (𝑋𝑘 − 𝑋ℓ )2 ,
[ ] [ ] [ ]
(B.7)
where [ ] [ ]
𝐸 (𝑋𝑖 − 𝑋𝑗 )2 = 𝐸 (𝑋𝑖 − 𝜇1 + 𝜇1 − 𝑋𝑗 )2
[ ] [ ]
= 𝐸 (𝑋𝑖 − 𝜇1 )2 + 𝐸 (𝑋𝑗 − 𝜇1 )2
since
𝐸 [(𝑋𝑖 − 𝜇1 )(𝜇1 − 𝑋𝑗 )] = 𝐸(𝑋𝑖 − 𝜇1 ) ⋅ 𝐸(𝜇1 − 𝑋𝑗 ) = 0.
Consequently,
𝐸 (𝑋𝑖 − 𝑋𝑗 )2 = 2𝜇2 .
[ ]

Similarly,
𝐸 (𝑋𝑘 − 𝑋ℓ )2 = 2𝜇2 .
[ ]

We then have from (B.7) that


𝐸 (𝑋𝑖 − 𝑋𝑗 )2 (𝑋𝑘 − 𝑋ℓ )2 = 4𝜇22 ,
[ ]
for 𝑖 ∕= 𝑗 ∕= 𝑘 ∕= ℓ. (B.8)
There are ( )
𝑛
4!
4
of making the choices for 𝑖 ∕= 𝑗 ∕= 𝑘 ∕= ℓ. Since we are only interested in those
choices with 𝑖 < 𝑗 and 𝑘 < ℓ we get a total of
( )
𝑛
6
4
96 APPENDIX B. THE VARIANCE OF THE SAMPLE VARIANCE

choices in case (i). Consequently the number of choices in case (ii) is


( )2 ( ) ( )
𝑛 𝑛 𝑛
− −6 .
2 2 4
One of the expectations in case (ii) is of the form
𝐸 (𝑋𝑖 − 𝑋𝑗 )2 (𝑋𝑖 − 𝑋ℓ )2 ,
[ ]

where 𝑗 ∕= ℓ. In this case we have


[ ] [ ]
𝐸 (𝑋𝑖 − 𝑋𝑗 )2 (𝑋𝑖 − 𝑋ℓ )2 = 𝐸 (𝑋𝑖 − 𝜇1 + 𝜇1 − 𝑋𝑗 )2 (𝑋𝑖 − 𝜇1 + 𝜇1 − 𝑋ℓ )2
[( )
= 𝐸 ( (𝑋𝑖 − 𝜇1 )2 + 2(𝑋𝑖 − 𝜇1 )(𝜇1 − 𝑋𝑗 ) + (𝜇1 − 𝑋𝑗 )2)]
(𝑋𝑖 − 𝜇1 )2 + 2(𝑋𝑖 − 𝜇1 )(𝜇1 − 𝑋ℓ ) + (𝜇1 − 𝑋ℓ )2
[
= 𝐸 (𝑋𝑖 − 𝜇1 )4 + 2(𝑋𝑖 − 𝜇1 )3 (𝜇1 − 𝑋ℓ )
+(𝑋𝑖 − 𝜇1 )2 (𝑋ℓ − 𝜇1 )2 + 2(𝑋𝑖 − 𝜇1 )3 (𝜇1 − 𝑋𝑗 )
+4(𝑋𝑖 − 𝜇1 )2 (𝜇1 − 𝑋𝑗 )(𝜇1 − 𝑋ℓ )
+2(𝑋𝑖 − 𝜇1 )(𝜇1 − 𝑋𝑗 )(𝑋ℓ − 𝜇1 )2
+(𝑋𝑖 − 𝜇1 )2 (𝑋𝑗 − 𝜇1 )2
+2(𝑋𝑖 − 𝜇1 )(𝑋𝑗 − 𝜇1 )2 (𝜇1] − 𝑋ℓ )
+(𝑋𝑗 − 𝜇1 )2 (𝑋ℓ − 𝜇1 )2 .
Next, use the linearity of the expectation operator, the independence of 𝑋𝑖 , 𝑋𝑗
and 𝑋ℓ , and the definition of the central moments to get
[ ]
𝐸 (𝑋𝑖 − 𝑋𝑗 )2 (𝑋𝑖 − 𝑋ℓ )2 = 𝜇4 + 𝜇2 ⋅ 𝜇2 + 𝜇2 ⋅ 𝜇2 + 𝜇2 ⋅ 𝜇2

= 𝜇4 + 3𝜇22 .
We obtain the same value for all the other expectations in case (ii); i.e.,
𝐸 (𝑋𝑖 − 𝑋𝑗 )2 (𝑋𝑖 − 𝑋ℓ )2 = 𝜇4 + 3𝜇22 , for 𝑖 ∕= 𝑗 ∕= ℓ.
[ ]
(B.9)
[ ]
It follows from (B.4) and the values of the possible expectations, 𝐸 (𝑋𝑖 − 𝑋𝑗 )2 (𝑋𝑘 − 𝑋ℓ )2 ,
we have computed in equations (B.6), (B.8) and (B.9), that
∑∑∑∑ [
𝐸 (𝑋𝑖 − 𝑋𝑗 )2 (𝑋𝑘 − 𝑋ℓ )2
[ ] ]
𝐸 𝑛2 (𝑛 − 1)2 (𝑆𝑛2 )2 =
𝑖<𝑗 𝑘<ℓ

( ) ( )
𝑛 2 𝑛
= (2𝜇4 + 6𝜇2 ) + 6 (4𝜇22 )
2 (( ) 4
2 ( ) ( ))
𝑛 𝑛 𝑛
+ − −6 (𝜇4 + 3𝜇22 ).
2 2 4
( )
𝑛 𝑛(𝑛 − 1)
Noting that = , the above expression simplifies to
2 2
𝐸 𝑛2 (𝑛 − 1)2 (𝑆𝑛2 )2 = 𝑛(𝑛 − 1)(𝜇4 + 3𝜇22 ) + 𝑛(𝑛 − 1)(𝑛 − 2)(𝑛 − 3)𝜇22
[ ]

+𝑛(𝑛 − 1)(𝑛 − 2)(𝜇4 + 3𝜇22 ).


97

Thus, dividing by 𝑛(𝑛−1) on both sides of the previous equation we then obtain
that
𝐸 𝑛(𝑛 − 1)(𝑆𝑛2 )2 = 𝜇4 + 3𝜇22 + (𝑛 − 2)(𝑛 − 3)𝜇22 + (𝑛 − 2)(𝜇4 + 3𝜇22 )
[ ]

= (𝑛 − 1)𝜇4 + (𝑛2 − 2𝑛 + 3)𝜇22 .

Dividing by 𝑛 − 1 we then have that


[ ] 𝑛2 − 2𝑛 + 3 2
𝑛𝐸 (𝑆𝑛2 )2 = 𝜇4 + 𝜇2 ,
𝑛−1
from which we obtain that
𝑛2 − 2𝑛 + 3 2
( )
[ ] 1
𝐸 (𝑆𝑛2 )2 = 𝜇4 + 𝜇2 .
𝑛 𝑛−1

Thus,
]2
= 𝐸 (𝑆𝑛2 )2 − 𝐸(𝑆𝑛2 )
[ ] [
var(𝑆𝑛2 )

𝑛2 − 2𝑛 + 3 2
( )
1
= 𝜇4 + 𝜇2 − (𝜇2 )2 ,
𝑛 𝑛−1
since 𝑆𝑛2 is an unbiased estimator of 𝜇2 . Simplifying we then obtain that
1 3−𝑛 2
var(𝑆𝑛2 ) = 𝜇4 + 𝜇 ,
𝑛 𝑛(𝑛 − 1) 2

which yields the equation in (B.1).


98 APPENDIX B. THE VARIANCE OF THE SAMPLE VARIANCE
Bibliography

[CB01] G. Casella and R. L. Berger. Statistical Inference. Duxbury Press,


second edition, 2001.

[Fer02] T. S. Ferguson. A Course in Large Sample Theory. Chapman &


Hall/CRC, 2002.
[HCM04] R. V. Hogg, A. Craig, and J. W. McKean. Introduction to Mathemat-
ical Statistics. Prentice Hall, 2004.

[Pla83] R. L. Plackett. Karl pearson and the chi–square test. International


Statistical Review, 51(1):59–72, April 1983.
[Stu08] Student. The probable error of a mean. Biometrika, 6(1):1–25, March
1908.

99

You might also like