Introduction To Hypothesis Testing: 8.1 Basic Structure of Hypothesis Tests
Introduction To Hypothesis Testing: 8.1 Basic Structure of Hypothesis Tests
Introduction To Hypothesis Testing: 8.1 Basic Structure of Hypothesis Tests
Lecture 8:
It is often the case that we wish to use data to make a binary decision about some unknown
aspect of nature. For example, we may wish to decide whether or not it is plausible that a
parameter takes some particular value. A frequentist approach to using data to make such
decisions is hypothesis testing, also called significance testing.
Note: There exist Bayesian counterparts of frequentist hypothesis tests, but the two
philosophies differ more substantially for these types of binary decisions than for estimation problems.
8.1
A hypothesis test consists of two hypotheses and a rejection region. The rejection region
may be specified via a test statistic and a critical value. We define each of these terms below.
Hypotheses
A hypothesis is any statement about an unknown aspect of a distribution. In a hypothesis
test, we have two hypotheses:
H0 , the null hypothesis, and
H1 , the alternative hypothesis.
Often a hypothesis is stated in terms of the value of one or more unknown parameters, in
which case it is called a parametric hypothesis. Specifically, suppose we have an unknown
parameter . Then parametric hypotheses about can be written in general as H0 0
and H1 1 , where 0 and 1 are disjoint, i.e., 0 1 = . We will typically assume
hypotheses to be parametric unless clearly stated otherwise.
Example 8.1.1: Let X1 , . . . , Xn be iid observations from a distribution with an unknown
mean R. Parametric hypotheses about could be H0 2 and H1 > 2. A different
set of parametric hypotheses could be H0 = 2 and H1 2.
Example 8.1.3: Let X Bin(n, ), where 0 < < 1, and consider testing H0 = 1/2 versus
H1 1/2. Perhaps the simplest nontrivial test of these hypotheses is to reject H0 if and
only if the trials are all successes or all failures, i.e., if and only if X = 0 or X = n. Then the
rejection region is R = {0, n}.
Essentially, a hypothesis test is its rejection region, in the sense that two tests of the same
hypotheses based on the same data are identical tests if and only if they have the same
rejection region.
Test Statistic
It is common to write the rejection region R in the form
R = {x S T (x) c}
or
Notice that if c1 > c2 , then Rc1 Rc2 . Thus, writing rejection regions in this form allows us
to construct a series of nested rejection regions corresponding to the same test statistic.
Example 8.1.4: Let X Bin(n, ), where 0 < < 1, and consider testing H0 = 1/2 versus
H1 1/2. A simple test of these hypotheses is to reject H0 if and only if X/n is far enough
from 1/2. Then we could state the test statistic and rejection region as
T (X) =
X 1
,
n 2
Rc = {x S T (x) c},
Rejection Region
c0
0 < c 1/6
1/6 < c 1/3
1/3 < c 1/2
1/2 < c
{0, 1, 2, 3, 4, 5, 6}
{0, 1, 2, 4, 5, 6}
{0, 1,
5, 6}
{0,
6}
{
}
Thus, we always reject H0 (Rc = S) if c 0 , while we never reject H0 (Rc = ) if c > 1/2.
Although any rejection region R can be written in the form R = {x S T (x) c} or
R = {x S T (x) > c} for some test statistic T (X) and some critical value c, it may be
occasionally be more convenient to express a rejection region in some other form.
Example 8.1.5: Example 9.1.7 of DeGroot & Schervish describes a test statistic Yn and
a hypothesis test in which we reject H0 if and only if Yn 2.9 or Yn 4.0. On page 533,
DeGroot & Schervish claim that the rejection region of this test cannot be written in the
form {x S T (x) c}. However, this claim is clearly incorrect, since we can simply
define another test statistic Zn = max{2.9 Yn , Yn 4.0} and write the rejection region as
{x S Zn (x) 0}. However, it is probably more convenient to work with the rejection
region in terms of the original Yn , even though it does not fit the standard form.
Example 8.1.6: Let X Bin(n, ), where 0 < < 1, and consider testing H0 = 1/2 versus
H1 1/2. Clearly the hypothesis tests proposed in Example 8.1.3 and Example 8.1.4 are
good since X is more likely to fall in the rejection region if 1/2 than if = 1/2.
Example 8.1.7: Let X Bin(n, ), where 0 < < 1, and consider testing H0 = 1/2 versus
H1 1/2. A legal hypothesis test is simply to always reject H0 . The rejection region of
this test is {0, 1, . . . , n}, the entire sample space. Another legal hypothesis test is simply to
never reject H0 . The rejection region of this test is . However, these two hypothesis tests
are obviously a waste of time.
Example 8.1.8: Let X Bin(n, ), where 0 < < 1, and consider testing H0 = 1/2 versus
H1 1/2. Suppose we take the test statistic to be T (X) = X and reject H0 if and only
if X c. This is a legal hypothesis test. However, it is not a good test of these hypotheses
since P (X c) is smaller for < 1/2 than for = 1/2. (Note, however, that it would be a
good test of H0 = 1/2 versus H1 > 1/2.)
Example 8.1.9: Let X Bin(n, ), where 0 < < 1, and consider testing H0 = 1/2 versus
H1 1/2. The seemingly perfect test that rejects H0 if and only if 1/2 is not a
hypothesis test at all, since it does not specify a rejection region as a subset of the sample
space. (It specifies a rule in terms of the parameter value itself, which of course is impossible
to apply since the parameter value is unknown.)
8.2
We now discuss basic properties of hypothesis tests in a probabilistic context. Remember that
hypothesis tests as discussed here are a fundamentally frequentist concept, so probabilities
discussed here are calculated as if the true parameter value is fixed but unknown.
Type I and Type II Errors
Recall that an ideal hypothesis test would always fail to reject H0 when it is true and would
always reject H0 when it is false. Then an ideal hypothesis test would be constructed so that
X R if and only if 1 . However, since X is random, this goal is typically impossible.
Hence, there is usually some chance that our test will make the wrong decision.
A type I error occurs if we reject H0 when it is true, i.e., if 0 and X R.
A type II error occurs if we fail to reject H0 when it is false, i.e., if 1 and X R.
The following table of possibilities may be helpful:
Truth
H0 0
H0 0
H1 1
H1 1
Data
X
X
X
X
R
R
R
R
Decision
Outcome
Fail to Reject H0
Reject H0
Fail to Reject H0
Reject H0
Correct Decision
Type I Error
Type II Error
Correct Decision
Of course, in reality we would not know whether a decision is correct or is an error, because
we would not know the true parameter value . However, we can still consider the probability
of each type of error.
If 0 , then the probability of a type I error is P (X R).
If 1 , then the probability of a type II error is P (X R) = 1 P (X R).
The true value of is unknown, but these probabilities can be calculated for each possible .
Power Function
The power function of a hypothesis test with rejection region R is Power() = P (X R).
Note: We will write Power() to avoid any notational confusion, but be aware that
this notation is nonstandard. Our textbook uses () for the power function, while
another textbook uses (). The latter choice is particularly confusing since many
people instead use to denote the probability of a type II error.
Notice that the power function provides the probabilities of both error types:
P (type I error)
Power() = P (X R) =
1 P (type II error)
if 0 ,
if 1 .
Note: When people use the word power in the context of hypothesis tests, they
usually mean 1 P (type II error), i.e., they mean the values of Power() for 1 .
The definition of the power function above is simply the logical extension to 0 as
well. Note, however, that it is actually bad if Power() is large for 0 .
0
Power() = 11 () =
if 0 ,
if 1 ,
but we know this is typically impossible since it corresponds to a perfect hypothesis test.
More practically, we want Power() to be small for 0 and large for 1 .
Example 8.2.1: Let X Bin(6, ), where 0 < < 1 and consider testing H0 1/2 versus
H1 > 1/2 using one of the following three hypothesis tests:
Test 1: Reject H0 if and only if X = 6. The power function of this hypothesis test is
Power(1) () = P (X = 6) = 6 .
Test 2: Reject H0 if and only if X 5. The power function of this hypothesis test is
Power(2) () = P (X 5) = 6 + 65 (1 ) = 5 (6 5).
Test 3: Reject H0 if and only if X 4. The power function of this hypothesis test is
Power(3) () = P (X 4) = 6 + 65 (1 ) + 15 4 (1 )2 = 4 (15 24 + 10 2 ).
These functions are plotted below.
1.0
0.0
0.2
Power
0.4
0.6
0.8
Test 1
Test 2
Test 3
0.0
0.2
0.4
0.6
0.8
1.0
8.3
Suppose for now that our null hypothesis is H0 = 0 (i.e., suppose that 0 = {0 }). Then
it suffices to consider P0 [T (X) c].
Achieving a Specified Level
Our test has level if and only if P0 [T (X) c] . For any > 0, we can find a value of c
large enough to satisfy this inequality.
Achieving a Specified Size
Now suppose that we wish to construct a test with size . Our test has size if and only if
P0 [T (X) c] = . It may or may not be possible to find such a test.
If the distribution of T (X) is continuous, then we want to find a value c R such that
[T (X)]
(c),
[T (X)]
where F0
denotes the cdf of T (X) for parameter value 0 . If 0 < < 1, then there
exists a point c R that satisfies this equation since the cdf of T (X) is continuous.
Thus, if the test statistic T (X) is a continuous random variable and 0 < < 1, then
there exists a choice of the critical value c that achieves size .
If instead the distribution of T (X) is discrete, then there may or may not exist a value
of c for which P0 [T (X) c] = . If no such c exists, then there does not exist a test
with size based on the test statistic T (X). In this case, we would typically try to
find a test with size less than (so that it still has level ) but as close to as possible.
Example 8.3.2: In Example 8.3.1, we can obtain a test with size by taking the critical
value c to be the number such that P (Z c) = for a standard normal random variable Z.
(For = 0.05, this is c 1.96. For = 0.10, this is c 1.64.) Any larger value of c would also
yield a test with level , but the size of such a test would be smaller than .
for some
0 1 . (Often is on the boundary of 0 .) Then we can proceed as if the
set 0 were instead simply { }, i.e., as if the null hypothesis were simply H0 = .
Example 8.3.3: In Example 8.2.1 and Example 8.2.2,
sup P (X c) = P=1/2 (X c)
0<1/2
for all c R (which was why the sizes of the tests in Example 8.2.2 could be computed by
evaluating the power function at = 1/2). Then since the distribution of X is discrete, a
test with size exactly only exists for certain values of . For example, there does not exist
a test of this form with size 0.05. If we were asked to find a test with level 0.05, we could
choose Test 1, which rejects H0 if and only if X = 6. This test has size 1/64 0.016, so 0.05
is indeed a level of this test.
8.4
P-Values
The choice of the size or level of a test is typically subjective. This subjectivity can be
somewhat unsatisfying, since two different people can reach opposite conclusions from the
same data and the same test statistic simply because they chose to use different sizes or
levels (and hence different critical values).
Example 8.4.1: In Example 8.3.1 and Example 8.3.2, we considered a test that rejects H0
if and only if the test statistic exceeds the number c such that P (Z c) = for a standard
normal random variable Z. Suppose one person uses = 0.05 and c 1.96, while another
person uses = 0.10 and c 1.64. Now suppose the observed test statistic value is 1.76.
Then the first person will fail to reject H0 , while the second person will reject H0 .
10
since the test has level . Now suppose instead that xobs Rc . Then T (xobs ) < c, so
p(xobs ) = sup P [T (X) T (xobs )] >
0
since otherwise c would not be the smallest number such that the test associated with Rc
has level .
Thus, Theorem 8.4.2 tells us that an equivalent way to make the final decision in a hypothesis
test is to calculate the p-value p(xobs ) for the observed data xobs and reject H0 at level
if and only if p(xobs ) . For this reason, the p-value is sometimes called the observed
significance level.
Example 8.4.3: In Example 8.4.1, the observed test statistic value 1.76 has p-value
p(1.76) = P (Z 1.76) 0.078,
where Z is a standard normal random variable.
8.5
Frequentist hypothesis testing has been an immensely popular tool of statistical inference
for decades. However, there do exist scenarios in which hypothesis tests show properties
that some people consider illogical and unacceptable. On the other hand, some people see
absolutely no problem with this type of behavior. We now provide a few examples merely
to illustrate some issues that can arise.
Example 8.5.1: Suppose we wish to test whether a particular coin is fair or weighted in
favor of heads. Then our hypotheses are H0 = 1/2 and H1 > 1/2, where denotes the
probability that the coin yields heads on any given flip. Now suppose we are told that the
following sequence of flips was observed (in order):
heads, heads, heads, heads, heads, tails.
There is some ambiguity here about how we should represent the data as a random variable.
11
Perhaps the person flipping the coin decided to flip the coin repeatedly until obtaining
tails. Let X be the number of times heads is observed for such an experiment before
the first tails. Then X Geometric(), and a sensible hypothesis test is to reject H0
if and only if X c for some c. The observed value of X was X = 5, so the p-value is
p(5) = P=1/2 (X 5) =
1
0.031.
32
Perhaps the person flipping the coin instead decided to flip the coin six times and record
the results. Let X be the number of times heads is observed for such an experiment.
Then X Bin(6, ), and a sensible hypothesis test is to reject H0 if and only if X c
for some c. The observed value of X was X = 5, so the p-value is
p(5) = P=1/2 (X 5) =
7
0.109.
64
Thus, the two different representations yield very different p-values and would therefore lead
to opposite conclusions at both = 0.05 and = 0.10. This is troubling since there is no clear
reason to prefer either representation over the other. Essentially, the result of our hypothesis
test depends on knowing what the experimenter would have done under circumstances that
are already known not to have occurred (e.g., whether the experimenter would have stopped
flipping had tails occurred earlier than the sixth flip).
Example 8.5.2: A researcher visits a lab and is allowed to use Machine A to conduct some
measurements. These measurements are then used to perform a hypothesis test and reach
a conclusion. However, the researcher later learns that the lab actually had two similar
machines of this type (Machine A and Machine B), that another researcher also visited the
lab the same day, and that the two machines were assigned to the two researchers randomly.
Also, the machines are not identical: Machine A is a better piece of equipment and hence
provides more precise measurements than Machine B. Although these new facts do not change
the researchers data or test statistic, they do change the distribution of that test statistic,
which must instead be calculated as if there were probability 1/2 of using Machine A and
probability 1/2 of using Machine B. Thus, the outcome of the hypothesis test can be altered
even after the data has been collected by the mere existence of Machine B and the fact that
it could have been used instead, even though it is already known that it was not used.
Example 8.5.3: Suppose a certain voltage is to be measured using a voltmeter for which
the readings are iid N (, 2 ) random variables, where 2 > 0 is known. The sample mean is
computed, and a hypothesis test is performed. However, it is later learned that the voltmeter
had a maximum reading of 10 V, and any reading that otherwise would have been greater
than 10 V would have instead been given as 10 V. This fact changes the distribution of the
test statistic and could thus alter the outcome of the hypothesis test. Note that this change
occurs even if all of the readings are less than 10 V, i.e., even if it is already known that the
maximum did not actually matter.
12
These examples also highlight the differences between frequentist and Bayesian inference.
Frequentist inference conditions on parameter values and integrates/sums over all possible data values that could be observed.
Bayesian inference conditions on the observed data values and integrates/sums over all
possible values of the parameter.
Thus, the issues that arise in the examples in this section do not arise in Bayesian inference.
Since Bayesian methods are conditional on the data that is actually observed, they are
unaffected by what could have happened for data values that did not actually occur.
Note: Of course, there also exist scenarios where Bayesian methods exhibit behavior
that can be criticized on philosophical grounds. We will return to such scenarios later
in the course if time permits.