BT Wk5 LectureNotes B
BT Wk5 LectureNotes B
Overview
Hypothesis testing is a formal procedure for comparing competing theories about natural phenomena. It
can be viewed as a key component of the scientific method and, in general, a means of advancing knowledge
and understanding.
The simplest scenario has two competing hypotheses, one labelled the Null Hypothesis and denoted H0 and
the other labelled the Alternative Hypothesis and denoted H1 . In our statistical framework these hypotheses
are typically statements about the possible values of a parameter (or parameters):
H0 : θ ∈ Θ0
H1 : θ ∈ Θ1
The sets defined by the hypotheses are mutually exclusive, Θ0 ∩ Θ1 = ∅, and (usually) exhaustive, i.e., their
union includes the entire parameter space, Θ0 ∪ Θ1 = Θ.
Example. The fraction of a specific variety of potatoes infected by a virus is approximately 0.15. A virus
resistant variety of potatoes will be planted this year and the hope is that the fraction infected will be less
than 0.15. Letting θ denote the probability that plant is infected, the two competing hypotheses are:
H0 : θ ≥ 0.15
H1 : θ < 0.15
Example. In multiple regression one is interested in knowing if one or more covariates have a linear
relationship with a response variable. For example, is this model, E[Y ]=β0 + β1 x1 + β2 x2 , correct? Two
hypotheses might be:
H0 : β1 = 0
H1 : β1 6= 0
Comments
• A Null Hypothesis that includes only a single value for θ it is called a Point Null Hypothesis (or Simple
Hypothesis). There are often practical problems with such hypotheses; e.g., β1 = 0.001 would be
contrary to H0 .
• Competing hypotheses can (sometimes) be viewed as competing models about phenomena. The above
hypotheses could be written as:
H0 ≡ M0 is the correct model : E[Y ] = β0 + β2 x2
H1 ≡ M1 is the correct model : E[Y ] = β0 + β1 x1 + β2 x2
And one can easily imagine a larger set of hypotheses or alternative models:
M1 : E[Y ] = β0
M2 : E[Y ] = β0 + β1 x1
M3 : E[Y ] = β0 + β2 x2
M4 : E[Y ] = β0 + β1 x1 + β2 x2
94
5.4.1 Classical Hypothesis Testing
Example A. The sampling model is Normal(θ, σ 2 ), where θ is unknown but σ 2 is known and equals 2
(admittedly seldom realistic). The null hypothesis is that θ is less than or equal to 3 while the alternative
hypothesis is that θ is greater than 3:
H0 : θ ≤ 3
H1 : θ > 3
A random sample of n=10 is taken and the sample average is ȳ = 4. The test statistic:
ȳ − θ0
T (y) = p
σ 2 /n
where θ0 is a value in the set θ ≤ 3. Note that conditional on H0 , T (y) is Normal(0,1)9 . Given that there
are infinite number of values in the set Θ0 , the convention is to select the value of θ ∈ Θ0 that would yield
the largest p-value, in this case θ0 =3, and then the p-value is
!
4−3
Pr (T (y) ≥ T (observed)) = Pr T (y) ≥ p = 2.236 = 1 − Φ(2.236) = 0.013
2/10
where Φ(z) is the cumulative distribution function for a standard normal random variable. Note that
extremeness here is in the direction of H1 , namely, towards values of θ > 3. Such a p-value of 0.013 would
be considered by many to be “sufficiently small”, or statistically significant, and H0 would be rejected.
Example B. Two linear models for an expected outcome are proposed, where one model is nested inside
the other model:
M 1 : E[Y ] = β0 + β1 x
M 2 : E[Y ] = β0 + β1 x + β2 x2
9 The test statistic T (y) for this setting is sometimes written z and is called the z-statistic.
95
Equivalently,
H0 : β2 = 0
H1 : β2 6= 0
β̂2 −0
Assuming normality of Y , the common test statistic is the t-statistic, t = std.error( β̂2 )
. And extremeness in
this case would be values of t that are relatively far from 0, t << 0 or t >> 0.
1. H0 and Ha must be structured such that “extremeness” in the direction of Ha is definable in order
to calculate the p-value. If one is comparing models that are not nested, “extremeness” is not readily
definable. For example, exponential “growth” versus linear ”growth” models:
M 1 : E[Y ] = β0 exp(β1 t)
M 2 : E[Y ] = β0 + β1 t
If H0 is that M 1 is true, and H1 is that M 2 is true, then assuming that H0 is true, what is a measure
of extremeness in the direction of H1 ?
2. The evidence is only against H0 as the p-value is calculated assuming that H0 is true.
• A small p-value indicates that the data are not what would be expected if H0 is true.
• A large p-value, however, does not mean that H0 is true, that the model implied by H0 is true,
as the calculation is made assuming that H0 is true—-so there is no weight of evidence for H0 .
• This is the reason that the frequentist conclusion given a large p-value is to say “fail to reject”
H0 , and Not to say “accept” H0 . You can’t accept something that you assumed was true in the
first place.
3. The p-value itself, e.g., 0.01, does not provide “weight of evidence” for the H0 . The p-value is a long-
run relative frequency measure: if H0 was true, only 1% of the time would the observed results or more
extreme results. The p-value is not the probability that H0 is true.
4. Calculation of P-values involves including values that were not even observed. This violates the Like-
lihood Principle10 .
Example C. (This example was discussed previously in Lecture Notes 1.) The sampling model for
the data is Poisson(θ) and there are two hypotheses about θ:
H0 : θ = 1 H1 : θ = 2
A sample size n=1 is drawn and yields the value y=2. The standard frequentist approach is to calculate
the p-value: the probability of the observed value and any values in a direction away from H0 in the
direction of H1 . In this case the p-value is Pr(Y ≥ 2|H0 ) = 1- Pr(Y = 0 ∪ Y = 1|θ = 1) = 0.26411 .
Thus, one would not reject H0 .
This procedure is violating the Likelihood Principle, however, in that inference is being based on more
than the likelihood of the data: the probability of events that did not occur, such as Y =3 or Y =4, is
being used as the basis for inference.
10 Reminder from Lecture 1 Notes: The Likelihood Principle says that given a sample of data, y, any two sampling models for
y, say p1 (y|θ) and p2 (y|θ), that have proportional likelihoods yield the same inference for θ. The main point is that inference
for θ depends on the observed y alone, not on unobserved values of y.
11 In R: 1-ppois(q=1,lambda=1)=0.2642411.
96
5.4.2 Bayesian Hypothesis Testing
H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1
, and where p0 + p1 = 1.
Then data, y, are collected and the posterior probabilities for each hypothesis are calculated:
Simple
H0 : θ = θ0 versus H1 : θ = θ 1
Then
f (y|θ0 )p0 f (y|θ0 )p0
Pr(H0 |y) = Pr(θ = θ0 |y) = =
m(y) f (y|θ0 )p0 + f (y|θ1 )p1
f (y|θ1 )p1 f (y|θ1 )p1
Pr(H1 |y) = Pr(θ = θ1 |y) = =
m(y) f (y|θ0 )p0 + f (y|θ1 )p1
Given that Pr(H0 |y) + Pr(H1 |y) = 1, Pr(H1 |y) is simply 1 − Pr(H0 |y).
Note that to calculate posterior odds the normalizing constant m(y) need not be calculated:
Pr(H0 |y) f (y|θ0 )p0
=
Pr(H1 |y) f (y|θ1 )p1
Composite
H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1
Letting π(θ) be the prior probability over the entire parameter space, the prior probability for Hypothesis i
is
Z
pi = π(θ)dθ
θ∈Θi
97
Thus the prior distribution for the parameter θ, π(θ), is inducing the prior for the hypothesis, pi 12 .
Now
R R
p(y, Hi ) pi p(y|Hi ) pi p(y, θ|Hi ) dθ pi f (y|θ)π(θ|Hi ) dθ
Pr(Hi |y) = = = =
m(y) m(y) m(y) m(y)
R π(θ) R
pi θ∈Θi f (y|θ) pi dθ f (y|θ)π(θ)dθ
Z
= = θ∈Θi = p(θ|y) dθ = Pr(θ ∈ Θi |y)
m(y) m(y) θ∈Θi
The key step in the above is the equality of integrating π(θ|Hi ) over the entire parameter space Θ and
integrating π(θ)/pi over the reduced parameter space Θi .
Stated most plainly, however, the posterior probability of Hi is simply the integral of the posterior for θ over
Θi .
The posterior odds of H0 against H1 can be written:
R
Pr(H0 |y) f (y|θ)π(θ)dθ Pr(θ ∈ Θ0 |y) Pr(θ ∈ Θ0 |y)
= Rθ∈Θ0 = =
Pr(H1 |y) θ∈Θ1
f (y|θ)π(θ)dθ Pr(θ ∈ Θ 1 |y) 1 − Pr(θ ∈ Θ0 |y)
Remarks
• Multiple Hypotheses. Multiple hypotheses can be handled similarly. The different hypotheses could
correspond to different sets of models: M1 , . . ., MK :
Pr(Hi , y) Pr(Hi , y)
Pr(Hi |y) = = PK
Pr(y) j=1 Pr(Hj , y)
where the form of Pr(Hi , y) would depend upon whether Hi was simple or composite.
• Computational difficulties. For composite hypotheses, the integration needed to calculate Pr(θ ∈
Θ|y) may not be analytically tractable.
An alternative to calculating posterior probabilities for the hypotheses is Bayes factors. A Bayes factor is
the ratio of posterior odds to prior odds. The prior odds for H0 against H1 is the ratio p0 /p1 . E.g., if
p0 =0.6 and p1 =0.4, then 0.6/0.4 = 1.5 are the prior odds. The posterior odds for H0 against H1 is the ratio
Pr(H0 |y)/ Pr(H1 |y). The Bayes Factor for H0 against H1 , which is written BF01 , is
Rules of thumb for interpreting Bayes Factors are given by Kass and Raftery (Journal of the American
Statistical Association Volume 90, 1995 - Issue 430 ):
12 Note: one can specify a prior for the hypothesis independent of the prior for θ; e.g., simply state that p0 =0.3 regardless of
the π(θ).
98
BF01 Interpretation
<3 No evidence for H0 over H1
>3 Positive evidence for H0
> 20 Strong evidence for H0
> 150 Very strong evidence for H0
• BF01 < 1
3 ⇒ BF10 > 3 ⇒ positive evidence for H1
• BF01 < 1
20 ⇒ BF10 > 20 ⇒ strong evidence for H1
Simple vs Simple
H0 : θ = θ0 vs H1 : θ = θ1 .
Pr(H0 |y)/ Pr(H1 |y) f (y|θ0 )p0 /f (y|θ1 )p1 f (y|θ0 )
BF01 = = = (5.11)
p0 /p1 p0 /p1 f (y|θ1 )
Thus the Bayes Factor is simply the ratio of the likelihoods, and the priors for the hypotheses are irrelevant.
Example C (continued). The sampling distribution for the data is Poisson(θ) and H0 : θ = 1 and
H1 : θ = 2. The prior for H0 is p0 =0.8, thus p1 =0.2. A single observation, n = 1, is observed with y = 2.
Then BF01 = e−1 12 /e−2 22 = 0.6796, and BF10 = 1.4715. Thus there is no evidence for H0 over H1 , or for
H1 over H0 .
H0 : θ ∈ Θ0 vs H1 : θ ∈ Θ1 ; Θ0 ∪ Θ1 = Θ.
hR i hR i
Pr(H0 |y)/ Pr(H1 |y) θ∈Θ0
f (y|θ)π(θ)dθ / θ∈Θ1
f (y|θ)π(θ)dθ Pr(θ ∈ Θ0 |y)/ Pr(θ ∈ Θ1 |y)
BF01 = = =
p0 /p1 p0 /p1 p0 /p1
(5.12)
Simple vs Composite
H0 : θ = θ0 vs H1 : θ 6= θ0 .
h R i
∞
Pr(H0 |y)/ Pr(H1 |y) [f (y|θ0 )p0 ]/ p1 −∞ f (y|θ)π(θ)dθ f (y|θ0 ) f (y|θ0 )
BF01 = = = R∞ = (5.13)
p0 /p1 p0 /p1 −∞
f (y|θ)π(θ)dθ m(y)
Notes:
• As for the simple versus simple case, the prior probabilities for the hypotheses are cancelling out.
99
5.4.5 Example D: Simple Null and Simple Alternative
The number of hairs per square inch of mohair fabric used by a teddy bear manufacturer is assumed to have
a Poisson(θ) distribution (King and Ross, 2017). The manufacturer wants to test the hypotheses:
H0 : θ = 100; H1 : θ = 110
To test these hypotheses an independent random sample of n pieces of fabric is drawn and the number of
hairs per square inch, y = y1 , . . . , yn , is recorded.
Exercise: Given yi , i=1,. . .,10 are independent Normal(µ,1) random variables, where observe data:
3.4, 2.9, 3.0, 3.5, 3.3, 3.7, 2.7, 3.9, 2.7, 2.9
Test the simple hypothesis: H0 : µ = 3 vs H1 : µ = 3.5. Show that the Bayes Factor BF01 = 1.28.
A food manufacturer is considering releasing a new flavour of hummus, but before doing so wants to carry
out an experiment with volunteers to see whether this new flavour is liked better than a competitor’s version
(based on example from Carlin and Louis, 2009). They would like to be “pretty sure” that the new flavour
is preferred by at least 60% of hummus consumers. Letting θ be the probability that the new flavour is
preferred, there are two hypotheses:
If there is strong evidence for H0 , they will release the new flavour. The manufacturer would prefer to be
cautious and selects a Beta prior for θ that has an expected value of 0.5 and a coefficient of variation of 0.3
(thus a standard deviation of 0.5*0.3=0.15). That translates into Beta(5.056, 5.056).
100
The induced prior probability for H0 is then13 :
Z 1 Z 1
1 Γ(10.13)
p0 = θ4.06 (1 − θ)4.06 dθ = θ4.06 (1 − θ)4.06 dθ = 0.265
0.6 Be(5.056, 5.056) 0.6 Γ(5.065)Γ(5.065)
Thus p1 = 0.735.
To test these hypotheses, a taste preference study is carried with n=16 volunteers. How would you recom-
mend that such a study be carried out?
Assume that the probability of preferring the new flavour is the same for all volunteers and the responses
are independent. Then, letting y be the number preferring the new flavour, y ∼ Binomial(16, θ). After the
study was completed, 13 of the 16 volunteers preferred the new flavour. What are the posterior probabilities
for H0 and H1 ? And what is BF01 ?
To begin, note that Pr(H0 |y) is the same as Pr(θ ≥ 0.6|y). We know that the Beta distribution is conjugate
for the Binomial distribution and the posterior is Beta(α + y, β + n − y), or in this case, Beta(5.056+13,
5.056+16-13) = Beta(18.056, 8.056). Therefore:
Z 1
1
Pr(H0 |y = 13) = θ18.056−1 (1 − θ)8.056−1 dθ = 0.8448
0.6 Be(18.056, 8.056)
Pr(H1 |y = 13) = 1 − Pr(H0 |13) = 0.1552
Note: the R code for Pr(θ ≥ 0.6) = 1-pbeta(0.6,18.056,8.056) = 0.8447625. And the Bayes Factor:
Table 5.1: Numerical summaries of posterior quantities for taste preference study.
13 In R: 1-pbeta(0.6,5.056,5.056)=0.265.
101
Figure 5.8: Four prior distributions for θ in the hummus taste preference study. The vertical line at 0.6
marks the division between H0 and H1 .
2.5
Beta(5.056, 5.056)
Beta(0.5, 0.5)
Beta(1, 1)
Beta(2, 2)
2.0
1.5
1.0
0.5
0.0
Figure 5.9: Four posterior distributions for θ in the hummus taste preference study given y=13 in n=16
trials. The vertical line at 0.6 marks the division between H0 and H1 .
Beta(18.056, 8.056)
Beta(13.5, 3.5)
Beta(14, 4)
4
Beta(15, 5)
3
2
1
0
102
5.4.7 Example F: Simple Null and Composite Alternative
The sampling distribution is Poisson(θ). The null hypothesis is H0 : θ = 5 and the alternative is H1 : θ 6= 5,
where p0 =0.7. A Gamma prior distribution is chosen for θ such that E[θ]=5 with a CV of 0.1, thus a
Gamma(100,20).
A random sample of n=8 is drawn yielding the following values
3, 3, 3, 3, 5, 7, 7, 4
P8
Note: ȳ=4.375, and θ|y ∼ Gamma(100+ i=1 yi , 20+n) = Gamma(135,28).
To find the posterior probabilities:
8
Pr(H0 , y) Y e−5 5yi e−40 535
Pr(H0 |y) = ∝ p0 = 0.7 ∗ Q8 = 9.12873e − 08
m(y) i=1
yi ! i=1 yi !
Z ∞ −θ∗8 35
Pr(H1 , y) e θ 20100 100−1 −20∗θ
Pr(H1 |y) = ∝ p1 Q8 θ e dθ
m(y) 0 i=1 yi !
Γ(100)
1 20100 Γ(135)
= 0.3 ∗ Q8 = 3.684845e − 08
i=1 yi !
Γ(100) (28)135
Then
9.12873e − 08
Pr(H0 |y) = = 0.7124
9.12873e − 08 + 3.684845e − 08
3.684845e − 08
Pr(H1 |y) = = 0.2876
9.12873e − 08 + 3.684845e − 08
And the Bayes Factor14 for H0 against H1 :
0.7124/0.2376
BF01 = = 1.0617
0.7/0.3
which implies no evidence of H0 over H1 or vice versa.
As said previously, multiple models can be viewed as multiple hypotheses. From an example by Lavine15 ,
a primary (elementary) school in Fresno, California had two high-voltage transmission lines nearby and
the cancer rate amongst staff was a concern as 8 of the 145 staff had developed invasive cancers. Assume
independence between staff and identical probabilities for cancer. Let y denote the number developing cancer
and θ the probability of cancer. Then y ∼ Binomial(n=145, θ) is the sampling model.
Based on data collected at a national level (for approximately the same age of the staff, mostly women, and
number of years of working), the expected number of cancers for 145 staff was estimated to be 4.2. Translating
that into a probability, one hypothesis was that θ=4.2/145 ≈ 0.03. However, different individuals thought
the rate was higher and three alternative hypotheses were postulated:
103
These four hypotheses can be viewed as 4 models. Lavine proposed that a priori, H1 was as likely to be
right as it was to be wrong, thus the prior for H1 was Pr(H1 ) = 1/2. Then he assumed that any of the
remaining hypotheses was equally likely, thus Pr(H2 ) = Pr(H3 ) = Pr(H4 )=1/6. The posterior probabilities
for the four hypotheses can be viewed as the relative weight of evidence for the competing theories:
Thus, one could conclude that given the data and the priors, each of the four hypotheses are about equally
likely. Or that the weight of evidence for each model is about the same. The posterior odds that the
cancer rate is higher than the national average, or the posterior odds of H2 or H3 or H4 against H1 is
(0.21+0.28+0.28)/0.23 =3.3. Given that the prior odds of H2 or H3 or H4 and H1 are 1, this is also the
Bayes Factor and by the Kass and Raftery criteria this is just above the “positive evidence” lower bound of
3.
Contrast with Frequentist Approach. Lavine also carried out the frequentist analysis H0 : θ = 0.3
against the alternative H1 : θ > 0.3. The P-value is the probability of observing an outcome equal to what
was observed, 8 occurrences of cancer in 145 staff, and anything more extreme in the direction of H1 16 :
Pr(Y ≥ 8|θ = 0.3) = Pr(Y = 8|θ = 0.3) + Pr(Y = 9|θ = 0.3) + . . . + Pr(Y = 145|θ = 0.3)
= 1 − Pr(Y < 8|θ = 0.3, n = 145) = 0.0717
This would be considered “significant” evidence against H0 if the cut-off was 0.10. However, as Lavine points
out this P-value does not account for how well the other hypotheses explain the data, information about
things that did not happen (e.g., there were Not 9, nor 10, nor 11, and so on incidences of cancer), and the
Likelihood Principle is not obeyed.
104
R code
plot(theta.seq,dbeta(theta.seq,post.a.set[1],post.b.set[1]),type="l",
xlab=expression(theta),ylab="",col=1,lty=1,xlim=c(0,1),lwd=my.lwd)
for(j in 2:4) {
lines(theta.seq,dbeta(theta.seq,post.a.set[j],post.b.set[j]),
col=j,lty=j,lwd=my.lwd)
}
abline(v=0.6,col="purple")
legend("topleft",legend=paste0("Beta(",post.a.set,", ",post.b.set,")"),
lty=1:4,col=1:4,lwd=my.lwd)
105