Lecture Notes - 3
Lecture Notes - 3
In all of the previous sections of these notes, we have focussed on the area of statistical estimation.
In other words, we have tried to use our data (and sometimes a prior belief in the case of Bayesian
approaches) to arrive at either “best guesses” (in the case of point estimation) or “plausible ranges”
(in the case of interval estimation) for some quantitative aspect (often encoded as a parameter
of a distributional family) of a population of interest. In many situations, however, the simple
estimation of a population characteristic is not the final desired outcome of a statistical analysis.
Specifically, we may want to use our estimates to decide whether some previously proposed theory
or statement regarding the population of interest is actually true (or at least is plausible given the
information provided by the observations at hand). This is, of course, the standard framework of
statistical hypothesis testing which is familiar from any introductory unit in basic statistics. In
this final section of these notes, we will briefly discuss the more formal structure of the theory
of hypothesis testing which underlies the standard testing procedures (for population means and
proportions) which are the main staple of any introductory presentation. We start by giving a
formal set of definitions for parametric hypothesis testing and then introduce perhaps the most
important and flexible of all testing procedures, based on the likelihood function. Finally, as we
have done with both point and interval estimation previously, we will briefly investigate procedures
for some standard situations which are not (as heavily) dependent on the parametric assumptions
which will underly our initial discussions of statistical testing theory.
4.1. Definitions
We shall introduce and define the key aspects of a statistical hypothesis test through a rather simple
example. Suppose that we have purchased a light-bulb based on its advertised claim that the mean
lifetime of such bulbs is at least 1000 hours. If we then observe the lifetime of the actual bulb we
purchased, we have some data with which to assess the advertising claim. This simple scenario is
precisely the framework of statistical hypothesis testing.
4.1.1. Statistical Hypotheses and Decision Rules: More formally, suppose we believe that the
lifetime of the population of bulbs in question is exponentially distributed with mean parameter
θ, so that the probability density associated with X, the random lifetime of a bulb, is given by
fX (x; θ) = θ−1 e−x/θ for some θ ∈ Θ. A statistical hypothesis is then simply a statement regarding
the population of interest or, equivalently in the parametric case described here, the value of the
true population parameter. As such, we can formulate the hypothesis we wish to examine regarding
the population of light-bulbs as H0 : θ ≥ 1000. More generally, we have:
Definition 4.1: Suppose that X1 , . . . , Xn represent a simple random sample from a parametric
family with density function fX (x; θ) for some parameter θ ∈ Θ. A statistical hypothesis is
simply a subset of the parameter space, Θ. Any statistical hypothesis of interest, often termed
the null hypothesis, is associated with a competing alternative hypothesis. As such, a null
hypothesis and its alternative form a partition of the parameter space Θ consisting of the sets:
Θ0 , the set of parameter values which constitute the null hypothesis and Θ1 = Θc0 ∩ Θ, the set
of parameter values which are in the parameter space but not in the null hypothesis collection.
Note that in our light-bulb example, Θ0 = {θ ∈ Θ : θ ≥ 1000}. Moreover, we stress that the
alternative hypothesis is defined as the complement of the null hypothesis within the parameter
space. In other words, if we are considering testing the mean of a normal distribution and our null
hypothesis is H0 : µ = 0, then the general alternative (in the case that the parameter space of µ is
the entire real line) would be the two-sided one, H1 : µ = 0. However, if we restrict the parameter
space to only non-negative values (perhaps because of some external information regarding the
Statistical Inference (STAT3013/8027) - Lecture Notes - Page 83
specific problem at hand), then the relevant alternative hypothesis would be the one-sided one,
H1 : µ > 0, since {µ = 0}c ∩ {µ ≥ 0} = {µ > 0}.
A statistical test of the null hypothesis H0 : θ ∈ Θ0 is then just a decision rule based on
the observed data for deciding whether to accept H0 or reject it, and thus accept the alternative
hypothesis, H1 : θ ∈ Θ1 .
Definition 4.2: Suppose that X1 , . . . , Xn represent a simple random sample from a parametric
family with density functions fX (x; θ) for some parameter θ ∈ Θ. Further let X represent the
sample space of the (random) vector X = (X1 , . . . , Xn ). A statistical test of the null hypothesis
H0 : θ ∈ Θ0 is just a decision rule based on a partitioning of the sample space. In particular,
if we partition the sample space X into those outcomes of the observations which would lead
us to reject H0 , often denoted as C and referred to as the rejection region or critical region of
the test, and those observations which would lead us to accept H0 , which is just the collection
C c ∩ X , then a statistical test is simply defined by the decision rule which rejects H0 in favor
of H1 in the case that X = (X1 , . . . , Xn ) ∈ C and accepts H0 otherwise.
So, characterising a statistical test is as simple as defining its associated rejection region. For
instance, in our light-bulb example, we can define the test which rejects H0 if X, the observed
lifetime of our sampled bulb, is less than 1000 hours. In other words, we define a test with critical
region C = {X < 1000}. Indeed, since we have already seen (during our initial discussions of
the concept of sufficiency) that statistics can be viewed as partitioning the sample space of the
observations, it is quite common to define a statistical test in terms of a rejection region which
is just a level set for some statistic T (X1 , . . . , Xn ); in other words, C has the form C = {X ∈
X : T (X) < k} for some prespecified value k. Of course, whether this is a “good” test must be
determined by examining the properties of the testing procedure so determined. This exercise is
the subject of the next section.
4.1.2. Size and the Power Function: Common sense would indicate that the test described in
the example of the previous section; namely, rejecting the null hypothesis that the mean lifetime
of the bulbs is at least 1000 hours based on a single observation being less than 1000 hours, is
not a very good test since it is quite prone to making an error. Indeed, we can assess the quality
of a statistical test by examining the two distinct types of errors that can arise from it. If the
observations fall in the rejection region C when in fact that null hypothesis, H0 , is true then our
testing procedure will reject H0 when it should not. Such a mistake is termed a Type I error and
has a chance of occuring P rθ (C) for θ ∈ Θ0 . Alternatively, if the observed data values fall outside
the rejection region when in fact the null hypothesis is false, then our testing procedure will accept
H0 when it should not. Such a mistake is termed a Type II error and has a chance of occuring
P rθ (C c ) for θ ∈ Θ1 . Clearly, we would like to use a testing procedure which has a small chance of
making errors of either type.
Of course, to actually assess the probability of making an error, we must make a probability
statement about the observed data values, and these values depend on the true parameter θ. For
instance, suppose that in our light-bulb example, the true mean lifetime of bulbs is exactly 1000
hours. In this case, H0 is indeed true and the chance of a Type I error is
1000
1 −x/1000
P r(C) = P r1000 (X < 1000) = e dx = 1 − e−1000/1000 = 0.632.
0 1000
On the other hand, if θ = 500 then H0 is false and the chance of making a Type II error is:
∞
1 −x/500
P r(C c ) = P r500 (X ≥ 1000) = e dx = e−1000/500 = 0.135.
1000 500
Clearly, there is a strong relationship between Type I and Type II errors. In particular, note that
for a given value of θ, only one type of error can occur (since for any given θ, H0 either is or is not
true). For convenience we generally focus our attention on the so-called power function:
Definition 4.3: The power function of a statistical test of H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1
determined by the rejection region C is given by KC (θ) = P rθ (C). Note that this function
yields the chance of a Type I error when θ ∈ Θ0 and yields the probability of correctly rejecting
H0 when θ ∈ Θ1 ; that is KC (θ) = 1 − P rθ (C c ), which is one minus the probability of a Type
II error when θ ∈ Θ1 . This last probability is often termed the power of the test, since it
represents the likelihood of the test detecting that the null hypothesis is indeed false (i.e., its
power of detection).
Since KC (θ) is just the chance of rejecting H0 when the true parameter value is θ, we would like
to have tests which have values of KC (θ) which are large when θ ∈ Θ1 and which are small when
θ ∈ Θ0 . Of course, since KC (θ) is a function, it can sometimes be difficult to work with directly.
Therefore, we often define:
Definition 4.4: The size (or significance level) of a statistical test is given by
αC = sup KC (θ).
θ∈Θ0
In other words, the size of a test is the largest possible chance of a Type I error.
In the case of our light-bulb example, it is easy to calculate the power function as KC (θ) = P rθ (X <
1000) = 1 − e−1000/θ . Therefore, the size of the test determined by C = {X < 1000} is easily seen
to be
sup KC (θ) = sup 1 − e−1000/θ = 1 − e−1000/1000 = 0.632,
θ∈Θ0 θ≥1000
Some simple algebra shows that, for this example, we have kα = −1000 ln(1 − α). In particular, if
we want a test with size α = 0.05, we should use a test with rejection region C = {X < 51.29},
since −1000 ln(1 − 0.05) = 51.29.
Statistical Inference (STAT3013/8027) - Lecture Notes - Page 85
Of course, size (i.e., Type I error) is only one side of the coin. We must also examine our
chance of a Type II error. In particular, we would like to ensure that the power function of our test
is large when θ ∈ Θ1 . Note that if we employ the test based on the critical region C = {X < 51.29}
for our lightbulb example (so that we have a test with size α = 0.05), then the power of this
test when θ = 500 (i.e., when the true mean lifetime is half of the advertised duration) is given
51.29 1 −x/500
by KC (500) = P r500 (X < 51.29) = 0 500 e dx = 1 − e−51.29/500 = 0.0975. In other
words, this test has less than a 10% chance of detecting even this drastic departure from the null
hypothesis. Unfortunately, if our power is not as large as we like, then we cannot simply change a
rejection region of the form C = {X < k} to increase the power without simultaneously affecting
the size of our test. Indeed, it is usually the case that simple modifications to a testing procedure
to decrease the chance of a Type II error (or equivalently to increase the power of the test when H0
is false) will increase the size of the test. Our task, then, is to find tests (or equivalently rejection
regions) of a given size which have the best possible power when θ ∈ Θ1 . We note that there are
two potential ways of modifying our test so as to increase its power at the same time as maintaining
its size. The first is to change the sample space X (recall that a statistical test is equivalent to a
partitioning of the sample space). The only way to effectively achieve a change in X is to change
the sample size (and indeed, it should seem reasonable that the easiest way to increase the power of
detection of departures from the null hypothesis is to increase the information available on which
to base a decision). While this is sometimes a possibility in practice, usually we are in the position
of already having gathered our observations and so the size of the sample is a fixed quantity. The
other method of changing our test is, of course, to change our critical region C. We have noted that
critical regions are generally based on level sets of a statistic (though they certainly do not have to
be), and simply changing the level of the set [i.e., changing the value k in the region of the form
{X ∈ X : T (X) < k}] will generally only increase the power at the expense of increasing the size
of the test as well. As such, we must change our critical region (and thus the corresponding test)
more substantially and dramatically, generally by basing it on a different statistic, T (X). Finding
“good” tests based on level sets of statistics T (X) for a fixed sample size is the subject of the
following sections.
However, there is one case in which it is always possible to find a UMP test, and this is the subject
of the next section.
4.2.1. Simple Hypotheses and the Neyman-Pearson Lemma: A statistical hypothesis which
consists of only a single parameter value is generally termed simple. For instance, if Θ0 for the null
hypothesis H0 : θ ∈ Θ0 consists of the single value θ0 (i.e., Θ0 = {θ0 }), then it is a simple hypothesis.
Alternatively, if a hypothesis is not simple (i.e., it contains more than a single possible value) it is
termed composite. In this section, we will examine the case of a statistical test for which both the
null and alternative hypotheses are simple. In other words, we shall suppose that X1 , . . . , Xn are
a sample from a population characterised by a probability model with density function fX (x; θ)
for θ ∈ Θ where Θ = {θ0 , θ1 } and we shall focus on testing the null hypothesis H0 : θ ∈ Θ0 with
Θ0 = {θ0 }. Note that the structure of Θ means that this is a test of the null hypothesis H0 : θ = θ0
versus the alternative hypothesis H1 : θ = θ1 .
We now demonstrate that the UMP test in the case of two simple hypotheses is based on the
so-called likelihood ratio:
L(θ0 ; X1 , . . . , Xn ) fX (X1 , θ0 ) · · · fX (Xn ; θ0 )
Λ(X1 , . . . , Xn ) = = .
L(θ1 ; X1 , . . . , Xn ) fX (X1 , θ1 ) · · · fX (Xn ; θ1 )
In particular, the test we shall define has a critical region of the form C = {Λ(X1 , . . . , Xn ) ≤ k}.
[NOTE: Since θ0 and θ1 are specified constants in the current testing framework, Λ(X1 , . . . , Xn )
is a statistic.] The idea here is to construct the critical region, C, by collecting together those
elements of the sample space, X , which give the strongest evidence against the null hypothesis. In
this respect, the ratio of the likelihood for any given sample at each of the two possible parameter
values is precisely a relative measure of how plausible the two hypotheses are. In other words, when
Λ(X1 , . . . , Xn ) is very small, this is strong evidence that the observations arose from the alternative
hypothesis rather than the null hypothesis. All that remains, then, is to determine the value of
k so as to ensure that the test is of the desired size α. This can always be accomplished with
an application (of perhaps rather tedious) calculus in the current setting since we have assumed
that our hypotheses are simple (and thus completely determine the distribution of the data). Of
course, while it should seem intuitively reasonable that the likelihood ratio is a good method of
distinguishing between samples which support the null hypothesis versus samples which support
the alternative hypothesis, in order to be assured that the test based on this statistic is UMP
we need to demonstrate that the likelihood ratio provides the “best” information for making this
distinction. This fact is the subject of the so-called Neyman-Pearson Lemma:
Theorem 4.1: Suppose that X1 , . . . , Xn are a sample from a population characterised by
a probability model with density function fX (x; θ) for θ ∈ Θ where Θ = {θ0 , θ1 } and we
want to test the null hypothesis H0 : θ ∈ Θ0 with Θ0 = {θ0 }. Then the test with critical
region C = {Λ(X1 , . . . , Xn ) ≤ kα }, where Λ(X1 , . . . , Xn ) is the likelihood ratio statistic defined
previously and kα is defined such that P rθ0 (C) = α, is uniformly most powerful among all tests
of size no larger than α.
Proof: We start by defining any other test of size α ≤ α, determined by the critical region C .
We need to show that P rθ1 (C) ≥ P rθ1 (C ), since this demonstrates that the test based on the
critical region C has larger power than any other test of size no larger than α for all θ ∈ Θ1
(and here we see why the fact that the alternative hypothesis is simple makes this situation
much easier to deal with than the general case of a composite alternative). Now, we note the
following simple probability identities:
P rθ1 (C) = P rθ1 (C ∩ C ) + P rθ1 (C ∩ C c )
P rθ1 (C ) = P rθ1 (C ∩ C) + P rθ1 (C ∩ C c ),
Statistical Inference (STAT3013/8027) - Lecture Notes - Page 87
So, we can demonstrate the desired result by simply showing that P rθ1 (C∩C c )−P rθ1 (C ∩C c ) ≥
0. To do so, we first note that for any event E ⊆ C, we have:
1
P rθ1 (E) = L(θ1 ; x1 , . . . , xn )dx1 · · · dxn = L(θ0 ; x1 , . . . , xn )dx1 · · · dxn
E E Λ(x1 , . . . , xn )
1 1
≥ L(θ0 ; x1 , . . . , xn )dx1 · · · dxn = P rθ0 (E),
kα E kα
1 1
P rθ1 (C ∩ C c ) − P rθ1 (C ∩ C c ) ≥ P rθ0 (C ∩ C c ) − P rθ0 (C ∩ C c )
kα kα
1
= P rθ0 (C ∩ C c ) − P rθ0 (C ∩ C c )
kα
1
= P rθ0 (C ∩ C c ) + P rθ0 (C ∩ C )
kα
− P rθ0 (C ∩ C) − P rθ0 (C ∩ C c )
1
= P rθ0 (C) − P rθ0 (C )
kα
1
= (α − α )
kα
≥ 0.
So, we now have a UMP test for the case of simple null and alternative hypotheses. Of course, for
any specific instance, we will need to calculate the appropriate value of kα .
Example 4.1: Suppose that X1 , . . . , Xn are a random sample from a normal distribution with
mean µ and unit variance. Further, suppose that we know µ ∈ {0, 1}. We wish to test H0 : µ = 0
versus H1 : µ = 1. Now, the likelihood function in this case is:
n
1 1
L(µ; X1 , . . . , Xn ) = exp − (Xi − µ)2 .
(2π)n/2 2 i=1
Therefore, the uniformly most powerful test of size α in this case is determined by the rejection
region C = {Λ(X1 , . . . , Xn ) ≤ kα }, where
n
(2π)−n/2 exp − 12 i=1 Xi2
Λ(X1 , . . . , Xn ) = n
(2π)−n/2 exp − 12 i=1 (Xi − 1)2
n
1 2
= exp − {X − (Xi − 1)2 }
2 i=1 i
n
n
= exp − Xi ,
2 i=1
Statistical Inference (STAT3013/8027) - Lecture Notes - Page 88
In other words, we can now see that the UMP test is equivalently determined by a rejection region
of the form C = {X ≥ cα }, where cα = 12 − n1 ln(kα ) is now determined so that P r0 (C) = α.
This form of the critical region makes determination of the required constant much easier, since
the distribution of the statistic X is well-known in this case. In particular, when µ = 0, X
is normally distributed with mean 0 and variance n1 , so that the required value of cα can be
determined as:
√ √
P r0 (X ≥ cα ) = α =⇒ P r0 ( n X ≥ cα n) = α
√
=⇒ 1 − Φ(cα n) = α
1
=⇒ cα = Φ−1 (1 − α) √ .
n
Of course, we can now determine the value of kα if we so desire, but it is no longer necessary, as
we see that the UMP test is now simply determined by the decision rule which rejects H0 : µ = 0
in favor of H1 : µ = 1 whenever X ≥ Φ−1 (1 − α) √1n . As a final aside, we note that this rejection
rule can also be written in the form:
X −0
≥ Φ−1 (1 − α),
1/n
which looks strikingly like the usual one-sided test for a single population mean when the popu-
lation variance is assumed known. Indeed, this is precisely the starting point for demonstrating
the previously stated facts regarding the UMP nature of the usual one-sided t-tests under the
assumption of normally distributed observations.
We close this section with a few remarks. First, we note that the simple nature of the null and alter-
native hypotheses assumed here by no means requires θ0 and θ1 to be scalar values, just that they be
a single (possibly vector-valued) point in the parameter space Θ. Second, we note that there was no
real requirement that our observed sample contain independent observations or even that the struc-
ture of our testing framework be parametric in the true sense of the word. All that we truly required
was that the two competing hypotheses each completely determined a distinct joint likelihood for
the observed data. In other words, if our hypotheses took the form H0 : fX1 ,...,Xn (x1 , . . . , xn ) =
g0 (x1 , . . . , xn ) and H1 : fX1 ,...,Xn (x1 , . . . , xn ) = g1 (x1 , . . . , xn ) where fX1 ,...,Xn (x1 , . . . , xn ) repre-
sents the joint density function of the observations X1 , . . . , Xn and g0 (x1 , . . . , xn ) and g1 (x1 , . . . , xn )
are two given functions, then the UMP test of size α associated with these competing hypotheses is
determined by a rejection region of the form C = gg01 (x 1 ,...,xn )
≤ kα where kα is determined so that
(x 1 ,...,x n )
g (x , . . . , xn )dx1 · · · dxn = α (of course, this last requirement may mean a rather tedious and
C 0 1
complicated calculus problem is required before we can actually implement this test). Finally, we
note that it may not always be possible to find a value kα which satisfies the strictures of Theorem
4.1. In other words, there may be no value kα such that P rθ0 (C) = α exactly. In particular, this
can occur when the observed data have a discrete distribution.
Statistical Inference (STAT3013/8027) - Lecture Notes - Page 89
Example 4.2: Suppose that X1 , . . . , X10 are iid random variables having a Bernoulli distribu-
tion with parameter θ. Further, suppose that we wish to test H0 : θ = 0.5 versus H1 : θ = 0.2.
The likelihood function in this case is just:
C = {Λ(X1 , . . . , Xn ) ≤ kα }
10X 10(1−X)
0.5 0.5
= ≤ kα
0.2 10X 0.810(1−X)
10
5
= 410X ≤ kα
8
= {10X ≤ cα },
10
where cα = log4 85 kα . Suppose that we wish to find a UMP test of size α = 0.01. Since
10X has a binomial distribution with parameters n = 10 and p = 0.5 under the null hypothesis,
we see that we must find cα such that:
cα
10
0.510 = 0.01.
i
i=0
Therefore, P r0.5 (C) < 0.01 for any choice cα < 1 and P r0.5 (C) > 0.01 for any choice cα ≥ 1. In
other words, there is no possible value of cα which makes the probability of the rejection region
exactly equal to 0.01.
In such cases, however, while there may be no UMP test of a specific size α (if kα does not exist
for this size), there will always be a UMP test for some collection of sizes α1 , α2 , . . . , and we can
then pick the UMP test with the size closest to our desired size α. Indeed, in Example 4.2, we can
find a UMP test of size α = 0.0107421875 which is rather close to 0.01.
4.2.2. Generalised Likelihood Ratio Tests: In the previous section, we were able to find a UMP
test in the case of simple null and alternative hypotheses. Of course, we have already noted that
such an endeavour is generally not possible in the case of composite hypotheses. Nonetheless, the
result of the Neyman-Pearson lemma does lead quite naturally to the construction of a test in the
case of composite hypotheses. In particular, suppose that X1 , . . . , Xn are a random sample from a
population characterised by a probability model with density function fX (x; θ) for θ ∈ Θ and we
are interested in testing H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1 where Θ0 ∪ Θ1 = Θ is any partition of the
parameter space. Following the general notion of the Neyman-Pearson lemma, we can define the
generalised likelihood ratio:
supθ∈Θ0 L(θ; X1 , . . . , Xn ) supθ∈Θ0 fX (X1 ; θ) · · · fX (Xn , θ)
Λg (X1 , . . . , Xn ) = =
supθ∈Θ L(θ; X1 , . . . , Xn ) supθ∈Θ fX (X1 ; θ) · · · fX (Xn , θ)
and the generalised likelihood ratio test which has critical region C = {Λg (X1 , . . . , Xn ) ≤ kα }
where, as usual, kα is defined so that the size of the test is α; that is, supθ∈Θ0 P rθ (C) = α. Note
Statistical Inference (STAT3013/8027) - Lecture Notes - Page 90
that the generalised likelihood ratio differs from the likelihood ratio statistic defined in Theorem
4.1 not only in the use of supremums (which are now necessary due to the potentially composite
nature of the hypotheses) but also in that the denominator is maximised over the entirety of the
parameter space Θ (rather than over the alternative hypothesis). This difference is employed for
purely mathematical reasons, and a little thought shows that the set C defined by a level set of the
generalised likelihood ratio is typically equivalent to a level set of the statistic:
supθ∈Θ0 L(θ; X1 , . . . , Xn )
Λg (X1 , . . . , Xn ) = .
supθ∈Θ1 L(θ; X1 , . . . , Xn )
In other words,
{Λg (X1 , . . . , Xn ) ≤ kα } = {Λg (X1 , . . . , Xn ) ≤ kα }
for some value kα , provided the level set of Λg (X1 , . . . , Xn ) in question has kα < 1. However,
from the perspective of constructing a critical region, these are the only level sets of interest, since
samples for which Λg (X1 , . . . , Xn ) = 1 indicate that the null hypothesis is at least as likely as
the alternative (since the supremum of the likelihood over the entire space is no larger than the
supremum over the null hypothesis subset) and as such would never reasonably be included in a
rejection region.
It would be nice if this test based on the generalised likelihood ratio was always the UMP test,
however, this is not the case. There are indeed cases where this test can be shown to be the UMP
test (indeed, the usual t-tests for population means and linear regression coefficients in the case of
normally distributed observations turn out to have the form of generalised likelihood ratio tests).
However, a full demonstration of when these tests are UMP is beyond the scope of these notes.
Moreover, even were we able to conclude that the likelihood ratio test was UMP we would still be
in the unenviable position of having to determine the appropriate value kα in the definition of the
critical region C. Fortunately, it turns out that even when the generalised likelihood ratio test is
not UMP, it typically has excellent properties (in particular, it can be shown to have nearly the
largest possible power as the sample size increases towards infinity). As such, we tend to use the
generalised likelihood ratio test in most complex testing situations where no other specific UMP
test is available.
We close this section by noting one other strength of the generalised likelihood ratio test.
Recall that we must determine the value of kα in the definition of the rejection region C =
{Λg (X1 , . . . , Xn ) ≤ kα }. To do so requires the distribution of the statistic Λg (X1 , . . . , Xn ) which
can be quite complicated in general. However, in some specific situations, the distribution of
Λg (X1 , . . . , Xn ) can be accurately approximated. In particular, suppose that θ = (θ1 , . . . , θp ), so
that the probability model parameter is a p-vector. Further suppose that Θ is an open subset
of p-dimensional Euclidean space (for example, the entire p-dimensional Euclidean space itself or
perhaps the positive quadrant, so that Θ = {θ ∈ IRp : θ1 > 0, . . . , θp > 0}) and the null hy-
pothesis we are interested in testing has the form H0 : θ1 = θ1,0 , . . . , θq = θq,0 for some q ≤ p.
In this case, it can be shown that the distribution of −2 ln{Λg (X1 , . . . , Xn )} is approximately
chi-squared with q degrees of freedom. As such, we can construct a test based on the gener-
alised likelihood ratio with an approximate size α which is determined by the rejection region
C = {−2 ln[Λg (X1 , . . . , Xn )] ≥ χ2q (1 − α)} = {Λg (X1 , . . . , Xn ) ≤ −0.5 exp[χ2q (1 − α)]}.
Example 4.3: Suppose that we observe the array of independent random variables Xij , i =
1, . . . , I, j = 1, . . . , J where Xij is normally distributed with mean µi and variance σi2 (i.e.,
a standard balanced one-way analysis of variance dataset). An important assumption for the
validity of standard ANOVA procedures is that of homoscedasticity. Suppose we wish to test this
Statistical Inference (STAT3013/8027) - Lecture Notes - Page 91
assumption; that is, we wish to test the hypothesis H0 : σ12 = · · · = σI2 . Note that this hypothesis
is not quite in the form required for our chi-squared approximation to the generalised likelihood
ratio test. However, a simple reparameterisation from σ12 , . . . , σI2 to σ12 , τ22 = σ22 − σ12 , . . . , τI2 =
σI2 − σ12 shows that the null hypothesis can be written in the form H0 : τ22 = 0, . . . , τI2 = 0. Now,
the likelihood for this situation can readily be calculated as:
where we have defined τ12 = 0. Some straightforward (though tedious) calculus shows that this
likelihood is maximised at:
J J J
1 1 1
µ̂i = Xij = X i ; σ12 = (X1j − X 1 )2 ; τi2 = (Xij − X i )2 − σ̂12 ,
J j=1 J j=1 J j=1
Finally, then, we see that the generalised likelihood ratio test with approximate size α is deter-
mined by the rejection region:
We close this section by noting that a proof of the chi-squared approximation to the distribution
of −2 ln{Λ(X1 , . . . , Xn )} is beyond the scope of these notes, but it does follow along the lines of
the argument used in the construction of the asymptotic chi-squared likelihood-based confidence
intervals developed in Section 3.2.2.
Statistical Inference (STAT3013/8027) - Lecture Notes - Page 92
where p is the (unknown) true value of F (z). Of course, we must choose c1 and c2 in order to
achieve a desired size for this test. As such, we need to choose the values of c1 and c2 so that
KC (p0 ) = α. [NOTE: Since Y is clearly a discrete random variable, we will not be able to achieve
all possible sizes; see Example 4.2 and the remarks at the end of Section 4.2.1.] We note that this
test is valid regardless of the underlying distribution F (x). Typically, the value of interest for p0
will be one-half, so that we are testing whether z is the median of the distribution F (x). In such
cases, the test described here is referred to as the sign test, since it can be seen to be based on
the number of positive values among the collection X1 − z, . . . , Xn − z. [NOTE: The version of the
sign test presented here is two-sided, however, it can be easily modified to achieve a test against
either of the one-sided alternatives H1 : F (z) > p0 or H1 : F (z) < p0 . All that is required is a
modification of the critical region to the form C = {Y ≥ c} or C = {Y ≤ c}, respectively.]
Example 4.4: Let X1 , . . . , X10 be a random sample from a population characterised by a
distribution with CDF F (x). Suppose we wish to test whether the median of this distribution is
equal to 72; that is, we wish to test H0 : F (72) = 0.5 against the two-sided alternative. Further,
suppose that we would like a test of size α = 0.07. Some simple calculation shows:
0
1
10 10
(0.5)i (1 − 0.5)10−i = 0.00097656; (0.5)i (1 − 0.5)10−i = 0.01074219;
i i
i=0 i=0
2
10
10 10
(0.5)i (1 − 0.5)10−i = 0.05468750; (0.5)i (1 − 0.5)10−i = 0.05468750;
i i
i=0 i=8
10
10
10 10
(0.5)i (1 − 0.5)10−i = 0.01074219; (0.5)i (1 − 0.5)10−i = 0.00097656.
i i
i=9 i=10
So, we see that it is not possible to choose a rejection region such that the size of the test is
precisely 0.07. However, we can choose either C = {Y ≤ 2 or Y ≥ 9} or C = {Y ≤ 1 or Y ≥ 8}
and arrive at a test which has size 0.0547 + 0.0107 = 0.0654 which is reasonably close to the
desired level.
Statistical Inference (STAT3013/8027) - Lecture Notes - Page 93
The sign test is remarkably flexible, making essentially no assumptions regarding the underlying
distribution F (x). However, it can be shown that its power (i.e., the probability of it detecting
that the null hypothesis is actually false) is quite low (and indeed, calculating the power when
F (z) = p1 for some value p1 = p0 is a straightforward calculation again involving the binomial
distribution). This lack of power should not be very surprising, as the sign test is only based
on whether the given observations are larger than the proposed median z, and ignores how far
above or below z the observations were. As such, the sign test tends to ignore useful information
contained in the sample. It does this in order to avoid making various assumptions regarding the
underlying distribution F (x), and this is a common theme for non-parametric tests; namely, they
give up power in order to avoid making parametric assumptions. This is not an advisable thing to
do if we truly believe in a given set of parametric assumptions. However, if we do not believe in
any parametric framework, then using the non-parametric approach seems a more prudent way to
proceed. Nonetheless, the loss of information inherent in the sign test seems rather dramatic, and
it can often be improved upon without requiring parametric assumptions to be made.
We saw that the sign test for the median was based on the statistic Y , the number of values in
the collection X1 − z, . . . , Xn − z which were positive. This approach essentially ignores the size of
the deviation between the observation and the proposed null hypothesis median value z. It turns out
that it is possible to retain some of the information contained in the size of these differences without
reverting to a parametric approach. In particular, suppose we define the quantities Zi = |Xi − z|
and let Ri be the rank of Zi in an ordered list of the values Z1 , . . . , Zn . For example, if n = 3 and
Z2 < Z3 < Z1 , then we would have R1 = 3 (since Z1 is the largest of the Zi ’s) while R2 = 1 and
R3 = 2. Finally, we define
−1 if Xi < z
si = 0 if Xi = z .
1 if Xi > z
A test of the null hypothesis H0 : F (z) = 0.5 can then be constructed based on the level sets
n
of the so-called Wilcoxon signed-rank statistic, W = i=1 si Ri . The idea is that if z truly is
the median of the population, then the ranks Ri will be evenly dispersed among the positive and
negative Xi − z values, and thus the statistic W will tend to be near zero. On the other hand,
if z is not the true median, then the large deviations from z will tend to congregate on one side
of z or the other, meaning that more of the large ranks will go with either the positive Xi − z
values (if the true median is larger than z) or the negative Xi − z values (if the true median is
smaller than z). In either case, the value of the statistic W will tend to be far from zero (in either
direction). Therefore, we can construct a test against the two-sided alternative with rejection region
C = {W ≤ c1 or W ≥ c2 }. Again, of course, we must determine the values c1 and c2 so as to ensure
that the size of our test is the desired value, α (and again, there are the obvious one-sided versions
of this test). Unfortunately, unlike the sign test, the distribution of the statistic W is no longer as
simple as the binomial distribution of Y . Nonetheless, the distribution of W under H0 can indeed
be computed directly (and tables of its distribution for small sample sizes exist). Moreover, it can
further be shown that the distribution of W under H0 is approximately normal with mean zero and
n
variance i=1 i2 = 16 n(n + 1)(2n + 1) when the sample size n is large. The demonstration of this
n
fact is beyond the scope of these notes, however, we do note that W = i=1 si Ri has the form of
a sum, and thus it is not overly surprising that its distribution can be approximated by a normal
distribution.
Example 4.5: Suppose that we observe the following 20 data values:
94.1, 93.3, 91.2, 93.0, 104.8, 100.6, 110.4, 94.1, 95.2, 102.1,
92.9, 102.7, 111.5, 88.4, 88.7, 105.0, 94.0, 99.1, 109.5, 97.3,
Statistical Inference (STAT3013/8027) - Lecture Notes - Page 94
and we wish to construct an α = 0.01 level test of H0 : F (95) = 0.5 versus the two-sided
alternative. So, the desired critical region has the form C = {W ≤ c1 or W ≥ c2 }, and we need
to choose c1 and c2 so that P rH0 (C) = 0.01 (at least approximately). Using the fact that, under
H0 , W is approximately normally distributed with mean zero and variance 16 20(21)(41) = 2870,
we see that
c1 c2
P rH0 (C) ≈ Φ √ + 1−Φ √ ,
2870 2870
√
and thus choosing c2 = −c1 = 2.575 2870 = 137.95 yields a test with size:
[NOTE: These are certainly not the only possible choices for c1 and c2 , but the symmetry of
the resulting rejection region seems a sensible feature.] Now, to actually implement the test on
the given data, we note that the Xi − 95 values are:
−0.9, −1.7, −3.8, −2.0, 9.8, 5.6, 15.4, −0.9, 0.2. 7.1,
−2.1, 7.7, 16.5, −6.6, −6.3, 10.0, −1.0, 4.1, 14.5, 2.3.
The ranks of the absolute values of this collection are:
2.5, 5, 9, 6, 16, 11, 19, 2.5, 1, 14,
7, 15, 20, 13, 12, 17, 4, 10, 18, 8.
[NOTE: In the case of tied values, we simply assign the average rank; for example, the two
absolute values of 0.9 are the second and third smallest, so each is assigned a rank of 2.5. It
should be noted, however, that if there are a large number of tied observations, the normal
approximation to the distribution of W can become poor, and the procedure described here
would need to be modified.] So, we can now calculate the Wilcoxon signed-rank statistic as:
W = −2.5 − 5 − 9 − 6 + 16 + 11 + 19 − 2.5 + 1 + 14
− 7 + 15 + 20 − 13 − 12 + 17 − 4 + 10 + 18 + 8
= 88.
Since 88 ∈/ C, we do not reject the null hypothesis. Of course, if we were to change the size
of our test to α = 0.1, then the rejection region would need to change accordingly. A simple
calculation (left as an exercise) shows that (assuming we wish to maintain the symmetric aspect
of our rejection region), the new critical region is given by C = {W ≤ −88.13 or W ≥ 88.13}.
Again, we see that 88 ∈ / C, but this time it is a very near thing. Indeed, we recall from our
introductory units in statistics, the p-value of a testing procedure is the smallest size α for which
the observed data falls in the rejection region. As such, we see that the p-value associated with
this Wilcoxon signed-rank test for the observed data is very near to 0.1.
We close this section by noting that the Wilcoxon signed-rank test generally has much better power
than the sign test and still does not require parametric assumptions. Of course, the power of the
Wilcoxon signed-rank test is still generally less than that of parametric procedures, provided we
believe that the required parametric assumptions are indeed true.
4.3.2. Tests for Bivariate Samples: In this section, we shall assume that we have observed
two random samples X1 , . . . , Xm and Y1 , . . . , Yn from two independent, univariate populations
characterised by distributions having CDFs F (x) and G(y), respectively. As in the previous section,
we shall not make any further assumptions regarding the forms of F (x) or G(y). We shall then be
interested in testing whether the two populations are characterised by the same distribution. In
other words, we wish to test the null hypothesis H0 : F (z) = G(z) for all z against the two-sided
alternative H1 : F (z) = G(z) for some z. We shall discuss several different tests for this situation.
Statistical Inference (STAT3013/8027) - Lecture Notes - Page 95
The first test we shall discuss is essentially just a test for the equality of medians (and as such
is usually referred to as the median test). The idea of the test is that if the two populations have
the same distribution (or indeed, just the same median) than when the two samples are combined
and the median of this combined collection calculated, we should expect half of each sample to fall
below the combined median. Specifically, then, we define Z = median{X1 , . . . , Xn , Y1 , . . . , Ym } and
n
V = I{Xi <Z} ,
i=1
so that V is just the number of Xi ’s which fall below the combined median Z. Clearly, if H0 is true
then we would expect V to be equal to n/2, and thus we shall construct our test with a rejection
region of the form C = V − n2 ≥ k . All that remains, is to determine the value of k to achieve
a desired size α for our test. If we assume that the CDFs F (x) and G(y) are continuous, so that
the chance of any of the Xi ’s or Yj ’s being equal is zero (i.e., there is no chance of any ties in the
combined collection of observations), then it can easily be seen that in order for V = v, we must
choose v out of the n Xi ’s and m+n 2 − v of the m Yj ’s to be less than Z. This is precisely the
structure of the so-called hypergeometric distribution. In other words, we have:
n m
v 0.5(m + n) − v
P rH0 (V = v) = ,
m+n
0.5(m + n)
m
where we must be careful to interpret to be zero when 0.5(m + n) − v < 0.
0.5(m + n) − v
[NOTE: In the case that 0.5(m+n) is not an integer then, by convention, we simply use 0.5(m+n−1)
instead, the idea being that we have thus ignored the observed value equal to Z, the combined
median, which will always exist in a combined sample of odd size.] Now, if m and n are small enough,
an exact calculation of the hypergeometric probabilities can be performed and an appropriate value
for k can then be chosen to yield a test of the desired size. When m and n are large, however, such
calculations are extremely time consuming. As such, we can approximate the distribution of V with
a normal distribution when m and n are large (in this particular case, the normal approximation
is quite accurate as soon as m, n > 10). It is a reasonably straightforward exercise to show that:
n mn
EH0 (V ) = ; V arH0 (V ) = = σV2 .
2 4(m + n − 1)
A simple calculation then shows that we should choose k = Φ−1 (1 − α/2)σV . As a specific example,
400
suppose that we have m = n = 20. In this case, σV2 = 4(39) = 2.5641. Thus, a test with size α = 0.05
n
√
would reject H0 whenever V differed from 2 = 10 by more than k = 1.96 2.5641 = 3.14; that is,
we will reject H0 if more than 13 or fewer than 7 Xi ’s fall below the combined median, Z.
Of course, just as the sign test ignored the actual values of the observed data, the median test
described above does not take into account the size of the Xi ’s and Yj ’s but only the number of these
values which fall below the observed median of the combined sample. In the case of the univariate
framework, we saw that the sign test could be improved upon by incorporating the ranks of the
Statistical Inference (STAT3013/8027) - Lecture Notes - Page 96
data, and this led to the Wilcoxon signed-rank test. In the current setting, a very similar approach
can be taken to develop an improved test for the null hypothesis H0 : F (z) = G(z) for all z. We
again start by considering the combined sample, and define Ri (i = 1, . . . , n) to be the rank of Xi
in the combined sample. For example, if we have observed the samples X1 = 1, X2 = 6, X3 = 2
and Y1 = 0, Y2 = 4, then the ordered combined collection is Y1 , X1 , X3 , Y2 , X2 and thus R1 = 2,
R2 = 5 and R3 = 3. The Mann-Whitney test can then be determined by defining a rejection region
n
based on the statistic T = i=1 Ri [NOTE: this test is also sometimes referred to as the Wilcoxon
rank-sum test]. It is a reasonably straightforward (though tedious) exercise to show that, under
the null hypothesis, the mean and variance of T are:
n(n + m + 1) nm(n + m + 1)
EH0 (T ) = ; V arH0 (T ) = .
2 12
Now, if the observed value of T is far from its expectation under the null hypothesis, then this
is evidence that we should reject the null hypothesis. Indeed, the rejection region for the Mann-
Whitney test is of the form C = {|T −EH0 (T )| ≥ k}. All that remains is to appropriately determine
the value k to ensure the desired size of the test. For the simple data set with n = 3 and m = 2
given earlier, we note that the observed value of T is 2 + 5 + 3 = 10. To determine the distribution
of T in this case, we note that for n = 3 and m = 2, there are 10 possible general arrangements for
the combined values in terms of the sample to which the values belong; that is, the ordered sample
could have been associated with the arrangements:
xxxyy, xxyxy, xxyyx, xyxxy, xyxyx, xyyxx, yxxxy, yxxyx, yxyxx, yyxxx
(e.g., the given data are in the arrangement yxxyx). For each of these 10 arrangements, the
associated values of T are 6, 7, 8, 8, 9, 10, 9, 10, 11, and 12. Under the null hypothesis, each of
these 10 arrangements is equally likely, and thus we can calculate:
1 1 1 1
P rH0 (T ≤ 6) = , P rH0 (T ≤ 7) = , P rH0 (T ≥ 11) = , P rH0 (T ≥ 12) = .
10 5 5 10
As such, if we want a test with size α = 0.2, we could use the rejection region C = {T = 6 or T =
12}. [NOTE: Again, we see that it is not always possible to construct tests for all possible sizes.]
Unfortunately, the exact distribution of the statistic T is quite complicated when m and n are
reasonably large. However, as for the Wilcoxon signed-rank statistic, it turns out that, under the
null hypothesis, the distribution of T is well-approximated by a normal distribution with mean
EH0 (T ) and variance V arH0 (T ). As such, we can define a rejection region for the Mann-Whitney
test with approximate size α by setting k = Φ−1 (1 − α/2) V arH0 (T ).
Example 4.6: Suppose that we observe two samples of size n = 10 and m = 9 as follows:
X : 4.3, 5.9, 4.9, 3.1, 5.3, 6.4, 6.2, 3.8, 7.1, 5.8,
Y : 5.5, 7.9, 6.8, 9.0, 5.6, 6.3, 8.5, 4.6, 7.5.
The sorted combined sample (along with whether the observations was an Xi or a Yj ) is:
3.1(x), 3.8(x), 4.3(x), 4.6(y), 4.9(x), 5.3(x), 5.5(y), 5.6(y), 5.8(x), 5.9(x),
6.2(x), 6.3(y), 6.4(x), 6.8(y), 7.1(x), 7.5(y), 7.9(y), 8.5(y), 9.0(y).
T = 75. Furthermore, we see that under the null hypothesis, the mean and variance of T are
given by
10(10 + 9 + 1) 10(9)(10 + 9 + 1)
EH0 (T ) = = 100; V arH0 (T ) = = 150.
2 12
Therefore, a size α = 0.05 test is determined by the critical region C = {|T − 100| ≥ k}, where
√
k = 1.96 150 = 24.005. So, since |T − 100| = |75 − 100| = 25, we reject the null hypothesis (of
course, if we had desired a test of size α = 0.01 we would not have rejected H0 , since in this case
√
the appropriate value of k would have been 2.575 150 = 31.537). Finally, by way of comparison,
we note that the observed number of Xi ’s less than the combined sample median of 5.9 is V = 6.
Using the normal approximation to the distribution of V , we see that a median test with size
√
α = 0.05 is determined by the critical region C = {|V − 5| ≥ 1.96 1.25} = {|V − 5| ≥ 2.19},
since
10 9(10)
EH0 (V ) = = 5; V arH0 (V ) = = 1.25.
2 4(9 + 10 − 1)
Thus, since |V − 5| = 1 in this case, we do not reject the null hypothesis. This is a nice example
of how the median test is less powerful than the Mann-Whitney test (not an overly surprising
result given that the median test ignores more information contained in the observed data than
does the Mann-Whitney test).
As noted at the end of Example 4.6, the Mann-Whitney test is generally more powerful than the
median test since it takes into account, to some degree, the relative sizes of the observed data values.
Of course, it only takes account of these sizes through the use of ranks, and thus still ignores some
potentially relevant information. As such, we close this section by introducing another testing
procedure for the null hypothesis H0 : F (z) = G(z) for all z which does take into account the
actual observed values of the data directly.
Supposing that we have observed two independent samples, X = (X1 , . . . , Xn ) and Y =
(Y1 , . . . , Ym ) and we have settled on some statistic T = T (X, Y ) which can be used to investigate
the potential differences between the two samples (e.g., the most common choice would be X − Y ,
though many other choices are possible). As in the parametric setting, we then construct a test
based on a rejection region of the form C = {T ≤ k1 or T ≥ k2 } for values of k1 and k2 chosen to
ensure that the size of the resulting test was some desired value α. In the parametric setting, we
would use our chosen underlying probability model to determine the value of k1 and k2 . However,
in the current setting, we have avoided making parametric assumptions. Nonetheless, it is possible
to determine values of k1 and k2 under the assumption of the null hypothesis of equal distributions
within the two populations under study. We note that if H0 is true, then the observation labels
(i.e., whether the observation is associated
with the X-sample or the Y -sample) are equally likely to
n+m
have arisen in any of the possible allocations of the observed values to X and Y samples.
n
As such, we can define a new data set X = (X1 , . . . , Xn ), Y = (Y1 , . . . , Ym ), which is just a
permutation of the original samples (so that values in the original X-sample may now appear in
the new Y -sample instead) and a new test statistic value T = T (X , Y ). If we calculate values of
T for all of the possible re-allocations of the data labels, then we can approximate the probability
P rH0 (C) by simply calculating the proportion of these T values which fall in the set C. Or,
conversely, we can construct a rejection region with (approximately) the desired size α by selecting
k1 and k2 to be the lower and upper α/2-quantiles of the observed distribution of the T values;
n+m
that is, if we represent the ordered collection of the N = T values as T[1]
, . . . , T[N ],
n
then k1 = T[N α/2] and k2 = T[N (1−α/2)] [where, of course, we must round off the values N α/2 and
Statistical Inference (STAT3013/8027) - Lecture Notes - Page 98
N (1−α/2) to the nearest integer value]. The test so constructed is often referred to as a permutation
test, due to the process of permutating of sample labels on which it is based. We stress that, despite
the fact that we are using the actual observed values of our data in the construction of the test,
their are no parametric assumptions being employed. The actual implementation of the testing
process is easiest to understand by examination of a simple example:
Example 4.6: Suppose that we observed the two datasets
X1 = 4, X2 = 3, X3 = 7; Y1 = 1, Y2 = 9.
Since each of these T values is equally likely under the null hypothesis, we see that the region
C = {T ≤ −5.33 or T ≥ 4.67} has an approximate size of 0.2 (since 2 of the ten re-allocations
yield T values which lie in C). As such, we have constructed a test with size α = 0.2. Since
our observed value is T = −0.33 ∈ / C, we see that we cannot reject the null hypothesis H0 :
F (z) = G(z) for all z. Of course, we could just as easily used some other statistic T , say the
difference in medians. In general, the choice of statistic will depend upon how we believe the
two populations are likely to differ from one another, and thus is a quite problem specific issue.
In general, when n + m is large, the number of re-allocations of the labels is extremely large (e.g.,
for the dataset of Example 4.5, where n = 10 and m = 9, there are 92,378 different re-allocations
of the data into two samples of appropriate size). In such cases, it is common practice to use only
a random subset of some number B of the possible re-allocations. We note the similarity in this
regard to the idea underlying the bootstrap introduced in Section 2.6.3. Indeed, the bootstrap can
also be used to construct non-parametric hypothesis tests, but we do not discuss this idea here.