CompleteLectureNotes STAT 261
CompleteLectureNotes STAT 261
by
Mary Lesperance
© Mary Lesperance
Department of Mathematics and Statistics
University of Victoria, Victoria, B.C.
Contents
1 Background material 1
1.1 Distribution Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 R code for Distribution Figures . . . . . . . . . . . . . . . . . 10
1.2 Review Stat260 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Likelihood methods 14
2.1 Introduction to Maximum Likelihood Estimation . . . . . . . . . . . 14
2.2 Likelihoods Based on Frequency Tables . . . . . . . . . . . . . . . . . 18
2.3 Unusual example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Combining Independent Events . . . . . . . . . . . . . . . . . . . . . 22
2.5 Relative Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.1 R code for Example 2.1.1 . . . . . . . . . . . . . . . . . . . . . 30
2.5.2 R code for Example 2.2.1 . . . . . . . . . . . . . . . . . . . . . 31
2.6 Likelihood for Continuous Models . . . . . . . . . . . . . . . . . . . . 33
2.6.1 R code for Example 2.6.1 . . . . . . . . . . . . . . . . . . . . . 36
2.7 Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.7.1 R code for Example 2.7.1 . . . . . . . . . . . . . . . . . . . . . 42
4 Tests of Significance 54
4.1 Introduction to Tests of Significance . . . . . . . . . . . . . . . . . . . 54
4.2 Likelihood Ratio Tests for Simple Null Hypotheses . . . . . . . . . . . 60
4.2.1 One Parameter Case . . . . . . . . . . . . . . . . . . . . . . . 61
i
CONTENTS ii
5 Confidence Intervals 93
5.1 Invert a Test to Derive a Confidence Interval . . . . . . . . . . . . . . 93
5.2 Approximate Confidence Intervals . . . . . . . . . . . . . . . . . . . . 97
5.2.1 R Code for Example 5.2.1 . . . . . . . . . . . . . . . . . . . . 100
5.3 Another Approximate Confidence Interval . . . . . . . . . . . . . . . 101
Index 154
Chapter 1
Background material
where nx = (x!)(n−x)!
n!
.
1
CHAPTER 1. BACKGROUND MATERIAL 2
0.12
0.08
pmf
0.04
0.00
0 20 40 60 80 100
n
px1 1 px2 2 . . . pxkk
2. Multinomial(n, p1 , . . . , pk ); f (x1 , . . . , xk ) = x1 ...xk
n
= (x1 !)(x2n!)!···(xk )!
where x1 ...x k
and
xi = 0, 1, . . . , n, such that x1 + · · · + xk = n and p1 + · · · + pk = 1.
k
X k
X
Note: pi = 1 Xi = n
i=1 i=1
Example: Toss a fair die n = 100 times and let (X1 , X2 , . . . , X6 ) be the observed
frequencies of the numbers 1, 2, 3, 4, 5, 6 from the tosses of the die. Since the
die is fair, then pi = 1/6 for i = 1, . . . , 6.
x+r−1
r
3. Negative Binomial(r, p); f (x) = r−1
p (1 − p)x , x = 0, 1, ...
Consider independent repetitions of an experiment each of which has exactly
two possible outcomes, say (S, F ) .
Let P (S) = p constant, i.e. the same for each experiment
Let X = # F ’s before the rth S
Then X ∼ NegBin(r, p)
Example: Continue flipping a fair coin and stop when you observe the first
head. X = the number of tails before the first head has a Negative Binomial
distribution with r = 1.
CHAPTER 1. BACKGROUND MATERIAL 4
0 20 40 60 80 100
M N −M N
5. Hypergeometric(N, M, n); f (x) = x n−x
/ n
where max(0, n −
N + M ) ≤ x ≤ min(n, M ).
Consider a finite population of size N . Let each object in the population be
characterized as either a S or F, where there are M ≤ N S’s in the population.
Draw a random sample of size n from the population without replacement.
Let X = # S’s in the sample of size n.
Then X ∼ Hypergeometric(N, M, n).
Example: Suppose that a bin contains N = 100 balls, of which M = 30 are
white and N −M = 70 are black. Choose a random sample of n = 10 balls from
CHAPTER 1. BACKGROUND MATERIAL 5
the bin without replacement. X = the number of white balls in the sample has
a Hypergeometric(N = 100, M = 30, n = 10) distribution.
Example: A shipping container contains N = 10, 000 iPhone 7’s of which
M = 30 are defective and the remainder are not defective. Choose a random
sample of n = 100 iPhone 7’s from the shipping container without replacement.
Then X = the number of defectives in the sample has a Hypergeometric(N =
10, 000, M = 30, n = 100) distribution.
In this example, n/N = 100/10, 000 = 0.01 ≤ 0.05. Then X = the number of
defectives in the sample is approximately distributed as Binomial(n = 100, p =
30/10, 000 = 0.003).
0.10
0.00
0 2 4 6 8 10
λx e−λ
6. Poisson (λ); f (x) = x!
x = 0, 1, . . .
CHAPTER 1. BACKGROUND MATERIAL 6
Models the number of occurrences of random events in space and time, where
the average rate, λ per unit time (or area, or volume) is constant.
Poisson(lambda=5) pmf
0.15
0.10
pmf
0.05
0.00
0 5 10 15 20
0.5
0.4
0.3
pdf
0.2
0.1
0.0
0 2 4 6 8 10
1
8. Gamma(α, β); f (x) = Γ(α)β α
xα−1 exp(−x/β), x > 0, α, β > 0.
CHAPTER 1. BACKGROUND MATERIAL 8
0.15
0.10
pdf
0.05
0.00
0 5 10 15
n o
√1 x−µ 2
exp − 12
9. Normal (µ, σ 2 ); f (x) = 2πσ σ
, x, µ ∈ ℜ, σ 2 > 0.
0.4
0.3
pdf
0.2
0.1
0.0
−3 −2 −1 0 1 2 3
#Binomial
x <- 0:100
plot(x, dbinom(x,size=100,prob=.1), ylab=’pmf’, xlab=’x’)
title("Binomial(n=100, p=0.1) pmf")
#Negative Binomial
x <- 0:100
plot(x, dnbinom(x,size=10,prob=.1), ylab=’pmf’, xlab=’x’)
title("Negative Binomial(r=10, p=0.1) pmf")
#Hypergeometric
x <- 0:10
plot(x, dhyper(x,m=30,n=70,k=10), ylab=’pmf’, xlab=’x’)
title("Hypergeometric(N=100, M=30, n=10) pmf")
#Poisson
x <- 0:20
plot(x, dpois(x,lambda=5), ylab=’pmf’, xlab=’x’)
title("Poisson(lambda=5) prob mass function")
#Exponential
x <- seq(0,10,by=.01)
plot(x, dexp(x,rate=.5), ylab=’pdf’, xlab=’x’, type=’l’)
title("Exponential(rate=.5) prob density function")
#Gamma
x <- seq(0,15,by=.01)
plot(x, dgamma(x,shape=2,scale=2), ylab=’pdf’, xlab=’x’, type=’l’)
title("Gamma(alpha=2, beta=2) prob density function")
#Normal
x <- seq(-3,3,by=.01)
plot(x, dnorm(x,mean=0, sd=1), ylab=’pdf’, xlab=’x’, type=’l’)
title("Normal(mean=0, sd=1) prob density function")
CHAPTER 1. BACKGROUND MATERIAL 11
X ∼ Normal (µ, σ 2 ) X =R
X ∼ Exponential (mean θ) X = (0, ∞)
d
The Probability density function of X, pdf, is f (x) = dx
F (x).
• Variance of X: = E (X − E (X))2
2
V ar(X) = σX
= E (X 2 ) − [E (X)]2
X
Discrete case: V ar(X) = (x − E (X))2 f (x)
x∈X
Z∞
Continuous: V ar(X) = (x − E (X))2 f (x) dx
−∞
p
Recall: (i) V ar(X) = σ is called the standard deviation (sd) of X
(ii) V ar (aX + b) = a2 V ar(X) a, b constants
(iii) V ar(X + Y ) = V ar(X) + V ar(Y ) + 2 Cov (X, Y )
(iv) Cov (X, Y ) = E (XY ) − E(X)E(Y )
1.3 Notation
The following is a list of notation for these notes:
1. ∼ : is distributed as
2. ≈ : approximately distributed as
3. L(θ): the likelihood function as a function of θ
4. ℓ(θ) : log-likelihood as function of θ
Chapter 2
Likelihood methods
Let θ = probability that a randomly chosen parcel is illegal. The auditors are inter-
ested in estimating θ.
14
CHAPTER 2. LIKELIHOOD METHODS 15
Question: What assumptions are required for the use of the Binomial distribution
here?
MAXIMIZATION =⇒ CALCULUS
100
L(θ) = θx (1 − θ)100−x
In Example 2.1.1, c = 1/ x
and
CHAPTER 2. LIKELIHOOD METHODS 16
x
To ensure that θ̂ = 100
is a maximum, we check that the second derivative
ℓ′′ (θ̂) < 0.
x 100 − x
ℓ′′ (θ) = − − < 0 for all 0 < θ < 1.
θ 2
(1 − θ)2
At the boundary values of 0 and 1, L(0) = L(1) = 0 < L(θ), 0 < θ < 1). Therefore θ̂
is a maximum, the MLE.
Question: What is the MLE of θ when x = 0 or x = n?
dℓ(θ)
S(θ) = ℓ′ (θ) =
dθ
d2 ℓ (θ)
I (θ) = −ℓ′′ (θ) = −
dθ2
At the MLE, θ̂, the proportion of illegal parcels, S(θ̂) = 0 and I(θ̂) > 0 .
X2 ∼ Negative Binomial(r = 7, p = θ)
x+r−1 r
p(x; θ) = θ (1 − θ)x
r−1
1
L (θ) = θ7 (1 − θ)x if c = x+r−1
r−1
ℓ(θ) = 7 ln θ + x ln(1 − θ)
7 x 7
ℓ′ (θ) = − ⇒ θ̂ =
θ 1−θ x+7
7 x
ℓ′′ (θ) = − 2 − <0 for 0 < θ < 1.
θ (1 − θ)2
where each outcome of one of the n experiments must fall in exactly one category,
A1 , . . . , Ak , a partition of the sample space.
• Let Xi = # of times Ai occurs in n repetitions [ ki=1 Xi = n]
P
• E(Xi ) = npi
We can add a row in the table corresponding to expected cell frequencies.
CHAPTER 2. LIKELIHOOD METHODS 19
200
P (x1 , x2 , x3 , x4 ; θ) = p1 (θ)112 p2 (θ)36 p3 (θ)22 p4 (θ)30
112, 36, 22, 30
= [1 − θ]112+36+22 [θ]36+2·22+3·30
= [1 − θ]170 θ170
170 170
ℓ′ (θ) = S(θ) = − +
1−θ θ
170 1
ℓ′ (θ̂) = 0 =⇒ θ̂ = =
340 2
170 170
ℓ′′ (θ) = − 2
− 2 < 0 for 0 < θ < 1
(1 − θ) θ
Checking the boundary points, 0, 1, L(0) = L(1) = 0 but L(θ) > 0 for θ ̸= 0, 1,
therefore θ̂ = 12 is a maximum, that is, it is the MLE of θ.
Substituting in the MLE for θ into the expressions for the p′ s, we obtain,
CHAPTER 2. LIKELIHOOD METHODS 21
1
p̂1 = p1 (θ̂) = 1 − θ̂ =
2
1
p̂2 = p2 (θ̂) = θ̂(1 − θ̂) =
4
2 1
p̂3 = p3 (θ̂) = θ̂ (1 − θ̂) =
8
3 1
p̂4 = θ̂ =
8
Using these estimates, we obtain estimated expected frequencies np̂i under the
model:
# hits required to fracture 1 2 3 ≥4 Total
Observed frequency 112 36 22 30 200
The estimated expected frequencies display poor agreement with the observed fre-
quencies. We expect some variation between the observed and estimated expected
frequencies. Does the poor agreement here suggest that something is wrong with the
assumed probability model? We need to be able to quantify the differences between
observed and estimated expected frequencies and decide if these are due to chance
variation only or to an inappropriate model. It may be that the assumed model is
incorrect, for example, the assumption of a constant probability of surviving a blow
independently of previous blows may not be realistic.
There are some examples for which we cannot use Calculus to compute the maximum
likelihood estimate. Here is one such example.
Example 2.3.1. The ‘enemy’ has an unknown number, N , drones, which have been
numbered 1, 2, . . . N . Spies have reported sighting 8 drones with numbers 137, 24,
86, 33, 92, 129, 17, 111. Assume that sightings are independent and that each of the
drones has probability N1 of being observed at each sighting. Find N̂ .
CHAPTER 2. LIKELIHOOD METHODS 22
1
N8
if N ≥ max{137, 24, 86, . . . , 111}
P (137, 24, 86, 33, 92, 129, 17, 111; N ) =
0 otherwise
As N decreases, P (137, 24, 86, 33, 92, 129, 17, 111; N ) increases provided that N ≥
137. Therefore, to maximize the probability of the observed data assuming this
model, we need to make N as small as possible subject to N ≥ max{137, 24, 86, . . . , 111}.
Therefore, N̂ = 137 is the MLE of N . This is an example where we do NOT use
Calculus to solve for the MLE.
Assuming that the numbers of illegal parcels for day 1 is independent of the number
of illegal parcels for day 2, we can write the JOINT probability mass function (pmf)
for X1 and X2 as:
100 x1 100−x1 100
p(x1 , x2 ; θ) = θ (1 − θ) θx2 (1 − θ)100−x2 .
x1 x2
The Likelihood function for θ now uses both data values and becomes,
Questions:
(1) What about θ = .06? Is this a reasonable or plausible value for θ given the
data we have?
(2) How can we produce a set of θ-values that are plausible given the data?
L(θ)
R(θ) = .
L(θ̂)
cL(θ) p(x; θ)
R(θ) = = .
cL(θ̂) p(x; θ̂)
Note:
cL(θ1 ) Probability of data when θ = θ1
R(θ1 ) = = .
cL(θ̂) max Probability of data for any value θ
• If R(θ1 ) = 0.1 then the data are 10 times more probable when θ = θ̂ than
when θ = θ1 , under the hypothesized model.
• If R(θ2 ) = 0.5, then the data are 2 times more probable when θ = θ̂ than when
θ = θ2 , under the hypothesized model.
• θ2 is a more plausible parameter value than θ1 .
• R(θ) gives us a way of assessing and generating plausible values of θ given the
data and the hypothesized model.
• For example, {θ|R(θ) ≥ 0.5} is a set of θ values that give the data at least 50%
of the maximum possible probability under the hypothesized model.
Definition: A 100 p% Likelihood interval (LI) for θ is the set of θ values such
that,
R(θ) ≥ p or equivalently ln R(θ) = r(θ) ≥ ln p.
Likelihood intervals are similar, in practice, to Confidence Intervals, and we will see
that they are mathematically related when the data are normally distributed. As
an example, in the one-sample normal case when σ is known, the 14.7% Likelihood
interval for the unknown mean, µ, corresponds to a 95% confidence interval.
Relative Likelihood is also used for Hypothesis Testing/Tests of Significance which
we will see in Chapter 4.
CHAPTER 2. LIKELIHOOD METHODS 25
L(θ) = θx (1 − θ)n−x .
7 7 7 93 93
Here x = 7, θ̂ = 100
and L(θ̂) = ( 100 ) ( 100 ) .
To compute a 100 p% Likelihood interval, we want all θ such that
θ7 (1 − θ)93
R(θ) = 7 7 93 93 ≥ p.
( 100 ) ( 100 )
Equivalently, we can find the values θ such that r(θ) = ln R(θ) ≥ ln(p).
Here we find the roots θ of r(θ) − ln(p) = 0, where θ is in the interval (0, 1) using
the R function uniroot(). To use uniroot(), we need to supply the function that
we wish to solve and starting values that bracket the roots. To determine starting
values, we graph r(θ) − ln(p) versus θ and overlay a horizontal line at zero. We give
an example using p = 0.1 for a 10% Likelihood Interval.
CHAPTER 2. LIKELIHOOD METHODS 26
2
r(theta)−ln(p)
1
0
−1
theta
Figure 2.1: 10% Likelihood interval construction. The log relative likelihood minus
ln(0.1) is plotted versus θ. A horizontal line at zero is overdrawn to assist with
starting value determination.
From the graph, we see that there are two roots. The lower root lies within the
interval [0.02, 0.04] and the upper root lies within the interval [0.1, 0.15]. These are
the starting values that we supply to uniroot() in the code in the next section. The
roots that R returned are in the $root slot below. The 10% Likelihood interval for
θ is thus: (0.028, 0.138) and the MLE is θ̂ = 0.07. Note that this interval is NOT
symmetric about the value 0.07 with the right endpoint further from 0.07 than the
left endpoint. This is displayed in the asymmetry of the plot above. [Aside: the 50%
Likelihood interval for θ is (0.044, 0.10).]
CHAPTER 2. LIKELIHOOD METHODS 27
L(θ) = p112
1 p36 22 30
2 p3 p4
= [1 − θ]170 θ170 ,
1
θ̂ = 2
and so L(θ̂) = ( 12 )340 .
To compute a 100 p% Likelihood interval, we want all θ such that
(1 − θ)170 θ170
R(θ) = ≥ p, or
1 340
2
R(θ) = [4 (1 − θ) θ]170 ≥ p.
To solve this problem, we find the roots θ of R(θ) − p = 0, for admissible values of θ
in the interval (0, 1). Below, we tabulate values for R(θ) and r(θ) for various values
of θ to help us discern starting values for numerical root finding software.
CHAPTER 2. LIKELIHOOD METHODS 29
θ R(θ) r(θ)
.3 1.34 × 10−13 −29.64
.4 .00968 −6.94
.45 .1811 ←− 10% −1.71
.46 .3357 −1.09
.47 .5417 ←− 50% −.61
interval surrounding θ̂ −→ .50 1 0
.53 .5417 ←− 50% −.61
.54 .3357 −1.09
.55 .1811 ←− 10% −1.71
.60 .00968 −6.94
Equivalently, we could have used r (θ) to compute the Likelihood Intervals. For a
100 p% Likelihood Interval we want all θ such that, r(θ) ≥ ln p. To compute the
endpoints of the interval, we solve the lower and upper roots of the equation in θ,
r(θ) − ln p = 0 using the R code below:
Alternatively (or additionally) a graph of the log relative likelihood can aid in choos-
ing starting values for a root finding technique. Below is the graph of the log relative
likelihood minus ln(.1) for Example 2.2.1.
CHAPTER 2. LIKELIHOOD METHODS 30
0
r(theta)−ln(p)
−5
−10
theta
Figure 2.3: 10% Likelihood interval construction. The log relative likelihood - ln(.1)
is plotted versus theta. A horizontal line at zero is overdrawn to assist with starting
value determination.
#MLE of theta
thetahat <- optimize(ell, c(.05,.09), maximum=TRUE)
thetahat
#Likelihod intervals
lower <- uniroot(logR.m.lnp, c(.02, .04), thetahat$maximum, p)
lower
plot(theta,logR.m.lnp(theta,thetahat$maximum,p), ylab=’r(theta)-ln(p)’,
xlab=’theta’, type=’b’)
title(’Example 2.2.1, Log Relative Likelihood - ln(p)’)
abline(h=0)
#The plot helps us to determine starting values for a root finding
# technique used to solve for the Likelihood interval
#Likelihod intervals
#log relative likelihod minus ln(p)
#Find a root in the interval (.4, .5)
In the case of continuous measurements, the pdf evaluated at x does not represent
the probability of observing x, however, we construct the likelihood using the pdf.
In this case, the joint pdf of the independent sample is written as the product of the
marginal pdf’s and,
n
Y
L(θ) = f (xi ; θ).
i=1
CHAPTER 2. LIKELIHOOD METHODS 34
Suppose that we observed the following times between failures (to the nearest day):
70 11 66 5 20 4 35 40 29 8
P
1. What is an estimate of the expected time between failures? ( xi = 288)
2. What values are plausible given the data? (10% and 50% LI’s)
3. The computer manufacturer claims that the mean time between failures is 100
days. Comment.
Solution:
1. X ∼ exp (θ)
θ = Expected time between failures The estimated expected time between
failures is, θb = Σx
n
i
= 288
10
= 28.8.
CHAPTER 2. LIKELIHOOD METHODS 35
2
0
−2
−4
−6
10 20 30 40 50 60
theta
Figure 2.4: 10% Likelihood interval. Log relative likelihood minus ln(0.1) plotted
versus θ. Horizontal line is at zero.
The data do not support the claim that the mean time between failures is 100
days. 100 days is not a plausible value for θ.
CHAPTER 2. LIKELIHOOD METHODS 36
x <- c(70 , 11 , 66 , 5 , 20 , 4 , 35 , 40 , 29 , 8)
ell <- function(theta,x){
n <- length(x)
return(-n*log(theta) - sum(x)/theta)
}
theta <- seq(10,60,by=1)
plot(theta,ell(theta,x),ylab=’log likelihood’,xlab=’theta’)
title(’Log Likelihood, Example 2.6.1’)
#MLE
thetahat <- optimize(ell, c(20,40), maximum=TRUE, x=x)
thetahat
ell(thetahat$maximum,x)
plot(theta,logR.m.lnp(theta,thetahat$maximum,x,p),
ylab=’log relative likelihood - ln(p)’,xlab=’theta’)
abline(h=0) #add a horizontal line at zero
title(’Log-relative Likelihood minus ln(p), Example 2.6.1’)
#Likelihod intervals
lower <- uniroot(logR.m.lnp, c(10,20), thetahat$maximum, x, p)
lower
upper <- uniroot(logR.m.lnp, c(50,70), thetahat$maximum, x, p)
upper
CHAPTER 2. LIKELIHOOD METHODS 37
2.7 Invariance
Optional Reading: Section 9.6
In the above Example 2.6.1, we might also be interested in estimating the probability
that the time between failures is greater than 100 days, i.e.
Z∞
1 −x/θ
β = P (X > 100) = e dx
θ
100
−100/θ
=e
n
!
−100 ln β X
ℓ(β) = −n ln + xi ,
ln β 100 i=1
n
!
−n 100 1 1 X
ℓ′ (β) = + xi .
−100 (ln β)2 β 100β
ln β i=1
β̂ = e−100/θ̂ = 0.031.
CHAPTER 2. LIKELIHOOD METHODS 38
The estimated probability that the time between failures is greater than 100 days is
0.031, very small.
We see that maximum Likelihood Estimates have some very nice properties - MLE’s
are invariant under one-to-one parametric transformations. In addition, a 10% Like-
lihood interval for β is (e−100/θ1 , e−100/θ2 ) = (e−100/15.65 , e−100/61.88 ) = (0.0017, 0.199),
where θ1 and θ2 are the endpoints of the 10% Likelihood interval for θ.
where θ > 0. Data for a random sample of n = 10 families living in Toronto is:
1.02, 1.41, 1.75, 2.31, 3.42, 4.31, 9.21, 17.4, 38.6, 392.8.
(a) Find the MLE of θ.
(b) Obtain an estimate of the median family income, β.
Figure 2.5 graphs the density of the Pareto distribution for various values of θ.
Typically, there are many individuals with smaller incomes and only a few individuals
with large incomes, and this is the behaviour displayed by the Pareto densities.
CHAPTER 2. LIKELIHOOD METHODS 39
Pareto distribution
2.0
theta = 0.5
theta = 1
1.5
theta = 2
f(x)
1.0
0.5
0.0
Solution:
(a) Find the MLE of θ.
n
−(θ+1)
Y
f (x1 , . . . , xn ; θ) = θxi
i=1
n
−(θ+1)
Y
L (θ) = θn xi
i=1
n
X
ℓ (θ) = n ln θ − (θ + 1) ln xi
i=1
n
′ n X n 1
ℓ (θ) = − ln xi =⇒ θ̂ = n = = 0.52208
θ i=1
Σi=1 ln xi 1.92
n
ℓ′′ (θ) = 2 < 0 for all θ.
θ
where 0 < x < 1 and F (x) is the cdf of X. The Median is the 50th percentile.
Zβ
0.5 = P (X ≤ β) = θx−(θ+1)dx
1
β
−θ
= −x = 1 − β −θ
1
−θ
0.5 = β =⇒ β = 0.5−1/θ = 21/θ
P
Returning to Example 2.7.1, the MLE of β is β̂ = 21/θ̂ = 2 ln xi /n
= 3.78
Given that a 10% LI for θ is
0.24 ≤ θ ≤ 0.96,
a 10% LI for β is (β monotone decreasing)
21/0.96 ≤ β ≤ 21/0.24
2.06 ≤ β ≤ 17.96,
a set of plausible values for median income (relative to subsistence level) based on
the Toronto sample data.
Some Comments:
• What is the effect of increasing sample size, n on Likelihood intervals? Gener-
ally this produces a more sharply peaked likelihood which results in narrower
Likelihood intervals. Likelihood intervals for θ will be more precisely estimated.
• Can we combine data from independent experiments or studies? Suppose we
are given a random sample of family incomes (relative to subsistence level) for
families living in London, England where it is assumed that the pdf of the
income distribution is
−(θ+1)
θx x≥1
f (x; θ) =
0 x<1
When would it be appropriate to pool this data with the Toronto data and
produce a common estimate of θ?
plot(theta,logR.m.lnp(theta,thetahat,x,p),
ylab=’log relative likelihood - ln(p)’,xlab=’theta’)
abline(h=0) #add a horizontal line at zero
title(’Log-relative Likelihood minus ln(p), Example 2.7.1’)
#Likelihod intervals
lower <- uniroot(logR.m.lnp, c(.2,.5), thetahat, x, p)
lower
upper <- uniroot(logR.m.lnp, c(.6,1.2), thetahat, x, p)
upper
#MLE and Likelihood intervals for beta, the median
2^(1/thetahat)
2^(1/c(lower$root,upper$root))
Chapter 3
Solution: n
Y
2
f xi ; µ, σ 2
L µ, σ =c
i=1
44
CHAPTER 3. TWO PARAMETER LIKELIHOODS 45
−1
logLike
−2
−3
−4
−1.0
−0.5 2.0
0.0 1.5
mu 0.5 1.0 igma
1.0 s
1.50.5
Figure 3.1: Log Likelihood for a random sample of size 100 from Normal(0,1)
−0.8
−0.7
1.5
sigma
1.0
−1.1 −0.9
−1.4 −1 −1.2 −1.5
−1.3
0.5
mu
Figure 3.2: Log Likelihood contour plot for a random sample of size 100 from Nor-
mal(0,1)
CHAPTER 3. TWO PARAMETER LIKELIHOODS 46
Figures 3.1 and 3.2 display the Log-likelihood as a function of µ and σ for a sample of
size 100 data values generated from the Normal(0,1) distribution. The figures show
that the Log-likelihood is maximized somewhere near µ = 0.25 and σ = 1.1.
Find the values (µ̂, σ̂ 2 ) that maximize ℓ (µ, σ 2 ) . To do so, we take derivatives of
ℓ(µ, σ 2 ) with respect to µ and σ.
n
∂ℓ 1 X
= (xi − µ) (3.1)
∂µ σ 2 i=1
n
∂ℓ n 1 X
= − + 3 (xi − µ)2 (3.2)
∂σ σ σ i=1
Pn Pn
(3.1) = 0 =⇒ i=1 (xi − µ̂) = 0 =⇒ µ̂ = i=1 xi /n = x̄.
(3.2) = 0 =⇒ n/σ̂ = σ̂13 ni=1 (xi − µ̂)2 =⇒ nσ̂ 2 = ni=1 (xi − x̄)2
P P
2
−∂ 2 ℓ/∂µ2 −∂ 2 ℓ/∂µ∂σ
I µ, σ =
−∂ 2 ℓ/∂σ∂µ −∂ 2 ℓ/∂σ 2
I11 I12
= .
I21 I22
In the example,
∂ 2ℓ n
I11 = − 2
= 2
∂µ σ
2
P
∂ ℓ 2 (xi − µ)
I12 =− =
∂µ∂σ σ3
n 3 X
I22 =− 2 + 4 (xi − µ)2
σ σ
Substituting in µ̂ and σ̂ 2 ,
n2
Ib11 = P >0 Ib12 = 0
(xi − x̄)2
n 3nσ̂ 2 2n
Ib22 = − 2 + 4 = 2 > 0
σ̂ σ̂ σ̂
I11 I22 − I21 > 0 =⇒ µ̂, σ̂ 2 is the joint MLE
b b b2
Example 2.2.1 revisited: Specimens of a new high impact plastic are tested by
repeatedly striking them with a hammer until they fracture.
We assumed that:
P (Y = y) = θy−1 (1 − θ) x = 1, 2, 3, . . .
P (item is flawed) = 1 − λ,
P (Y = 1 | flawed) = 1,
P (item is not flawed) = λ, and
P (Y = y | not flawed) = θy−1 (1 − θ).
p1 = P (Y = 1)
= P (Y = 1 and flawed) + P (Y = 1 and not flawed)
= 1 − λ + λ (1 − θ) = 1 − λθ
p2 = P (Y = 2)
= P (Y = 2 and flawed) + P (Y = 2 and not flawed)
= 0 + λθ (1 − θ)
p3 = P (Y = 3) = λθ2 (1 − θ)
p4 = 1 − p1 − p2 − p 3
= 1 − (1 − λθ) − λθ (1 − θ) − λθ2 (1 − θ) = λθ3
To compute the MLE’s of θ and λ, we need to take derivatives, set them to zero and
solve for θ̂ and λ̂.
CHAPTER 3. TWO PARAMETER LIKELIHOODS 50
∂ℓ 112λ 170 58
= − + − (3.3)
∂θ 1 − λθ θ 1−θ
∂ℓ 112θ 88
= − + (3.4)
∂λ 1 − λθ λ
112
200λ̂θ̂
(3.4) = 0 =⇒ 112λ̂θ̂ = 88 1 − λ̂θ̂ =⇒ λ̂θ̂ = 1 − λ̂θ̂ =⇒ = 1,
88 88
and
88 1
λ̂ = .
200 θ̂
112
Substituting 1 − λ̂θ̂ = 88
λ̂θ̂ into (3.3),
112λ̂ 170 58
0 = − 112 + − ,
88
λ̂θ̂ θ̂ 1 − θ̂
82 58
=⇒ 0 = − .
θ̂ 1 − θ̂
82 41
θ̂ = = = 0.5857
140 70
88 × 70 154
and λ̂ = = = 0.7512.
200 × 41 205
Substituting θ̂ and λ̂ into the expressions for the p′i s, yields estimated probabili-
ties:
The estimated and observed frequencies are very close! The new model fits the data
very well.
We should check that we have attained a maximum using the second derivatives
evaluated at the maximum. The information matrix entries are given below. As an
exercise, check that they satisfy the criteria for a maximum.
∂ 2ℓ 112λ2 170 58
I11 = − = 2 + + >0
∂θ 2
(1 − λθ) θ 2
(1 − θ)2
∂ 2ℓ 112θ2 88
I22 =− 2 = 2 + 2 > 0
∂λ (1 − λθ) λ
2
∂ ℓ 112 112θλ
I21 = I12 = − = +
∂θ∂λ 1 − λθ (1 − λθ)2
Left tail areas are tabulated in the optional textbook on page 351, and a chi-square
table is available on Brightspace. We will also use R to compute these.
Example: P χ2(4) ≤ 7.779 = 0.9.
Check that you can get this answer from the chi-square table provided on Brightspace
or on page 351 of the optional text or using R (code below).
The chi-square density is graphed for various values of ν, the degrees of freedom.
Chi−square densities
0.6
df= 1
df= 4
df= 10
0.4
chi.dat
0.2
0.0
0 5 10 15 20 25 30
x <- seq(0.3,30,by=.05)
chi.dat <- cbind(dchisq(x,1), dchisq(x,4), dchisq(x,10)) #chi-square densities
matplot(x,chi.dat,type=’l’,col=1:3, lty=1:3, main=’Chi-square densities’)
legend("topright",c(paste(’df=’,c(1,4,10))),lty=1:3,col=1:3)
Chapter 4
Tests of Significance
To test this claim, we will perform an experiment using a deck of cards. After
shuffling the cards, a volunteer chooses a card and I must divine the colour of the
suit. This is repeated 25 times and the number of correct responses is recorded.
Let X be the number of correct responses out of the 25 independent trials.
Notes:
1. Even if I do NOT have ESP, some correct responses will occur by chance.
2. If I do have ESP, I should be able to achieve more correct responses than would
be expected by chance alone.
We define two hypotheses:
54
CHAPTER 4. TESTS OF SIGNIFICANCE 55
1. H1 : I have ESP. This is the claim or research hypothesis and is called the
alternative hypothesis. (It is also denoted as HA .)
2. H0 : I do not have ESP. This is usually the complement of H1 and is called the
null hypothesis.
We proceed as if we were performing a proof by contradiction, assuming
that the null hypothesis is true until we obtain enough evidence against
it.
To determine whether there is evidence against the null and in favour of the hypoth-
esis of ESP, we compare the results obtained from the experiment with that which
would be expected under the null hypothesis that I do NOT have ESP.
Large values of X observed, xobs , will be interpreted as evidence against the null
hypothesis and in favour of the alternative hypothesis.
Under H0 , the hypothesis that I do NOT have ESP, my responses are guesses, and
each response has a .50 chance of being correct. Therefore Under H0 ,
X ∼ Binomial(n = 25, p = 0.5), so that I am expected to get E(X) = np = 12.5
correct, on average if the null hypothesis is true.
Note that the null and alternative hypotheses can be written in terms of p, the
Binomial probability of success as:
Suppose that I get xobs = 18 correct. Does this provide evidence against H0 ?
Returning to the example, suppose that I get xobs =18 correct responses. The p-
value= P (X ≥ 18) is computed using the distribution of X under the null hypothesis,
that is, Binomial(n = 25, p = 0.5). Using R, we compute:
where the normal probability is obtained from the N (0, 1) cdf table on Brightspace
or on page 349 of the optional text. The approximation can be improved using a
continuity correction to the normal approximation,
17.5 − 12.5
P (X ≥ 18) = 1 − P (X ≤ 17) ≃ 1 − P Z ≤ q
25
4
= 1 − P (Z ≤ 2) = 1 − 0.97725 = 0.02275.
Note: You should review how to use the N (0, 1) cdf table since we will resort to
tables for the Quizzes and Exam.
• Small p − values suggest that if H0 were true, results as extreme as xobs would
occur very rarely. The data are then deemed to be inconsistent with H0 . Thus
we say that we have evidence against H0 .
We say that:
In our ESP example, we conclude that we have evidence against H0 , (p-value = 0.02).
The data are consistent with the hypothesis that I have ESP. Of course, results like
these could have occurred by chance, and this does NOT prove that I have ESP.
The p-value is a probability computed using our data and assuming that the null
hypothesis is true.
Example 4.1.1. Blind Taste Test: Twenty-five individuals were given two similar
glasses, one of Pepsi, one of Coke and each was asked to identify the one that was
Coke. 60% (15) correctly identified Coke. Is this consistent with
Solution: We shall initally proceed under assumption that H0 is true, that there
is no detectable differences between Pepsi and Coke and see if the data provides
evidence against the null hypothesis. We need a probability model for the data
under the assumption that H0 is true. Let
D ≡ |X − 12.5|
be our test statistic which ranks possible values of X according to how close they
are to H0 .
D = .5 =⇒ X = 13 or 12
D = 1.5 =⇒ X = 14 or 11
If H0 were true, results as extreme X = 15 would occur fairly often and we have no
evidence against the null hypothesis. The data are consistent with the null hypothesis
that there is no detectable difference between Pepsi and Coke (p-value = .42).
Example 4.1.1 continued: Suppose that 60% (150) of 250 individuals correctly
identified Coke. Is there evidence against the null hypothesis?
Under H0 : X ∼ Binomial n = 250, p = 12 and E(X) = 125.
There is very strong evidence against the null hypothesis of no detectable difference
between Coke and Pepsi (p-value = 0.002)!!
CHAPTER 4. TESTS OF SIGNIFICANCE 60
R Code: 2*pnorm(-25/sqrt(250/4))
The large SAMPLE SIZE yields a more precise estimate of the probability of correctly
identifying Coke versus Pepsi!
We have looked at two test statistics to test hypotheses about the Binomial parameter
p, D = X and D = |X − np|. In general, the test statistic will depend upon the
hypothesis being tested and it may be difficult to “come up” with a test statistic. The
Likelihood Ratio Statistic (LRS) is a good statistic and it has intuitive appeal. We
consider the LRS for simple null hypotheses. In many applications, the hypothesis
to be tested can be formulated as a hypothesis concerning the values of unknown
parameters in a probability model.
Definition: A simple hypothesis specifies numerical values for all of the unknown
parameters in the model.
CHAPTER 4. TESTS OF SIGNIFICANCE 61
where θ̂ is the MLE of θ. Since ℓ(θ̂) ≥ ℓ (θ0 ) for all values of θ0 , then D ≥ 0.
Thus, D ranks possible outcomes of the experiment according to how well they agree
with H0 : θ = θ0 . Let dobs be the observed numeric value of the Likelihood Ratio
Statistic, then the p − value is calculated as:
D ≈ χ2(1)
in most cases of one-parameter simple hypotheses, therefore, you can use the
chi-squared table to obtain p-values as,
Notation Notes:
• ≈ means approximately distributed as
• ≃ means approximately equal to
Example 4.2.1. The measurement errors associated with a set of scales are inde-
pendent normal with known σ = 1.3 grams. Ten (n = 10) weightings of an unknown
mass µ give the following results in grams:
Is the data consistent with H0 : µ = µ0 = 226. Derive the LRS for testing µ = 226.
D = 2 [ℓ (µ̂) − ℓ (µ0 )]
n
Y 1 1 2
L (µ) = √ exp − 2 (xi − µ)
i=1 σ
2 2σ
n " n
#
1 1 X 2
= √ exp − 2 (xi − µ)
σ2 2σ i=1
n
Since σ is assumed known, and equal to 1.3 grams, the term, √1σ2 is considered
a constant and can be disregarded in the construction of the likelihood for µ.
n P
1 X xi
ℓ (µ) = − 2 (xi − µ)2 and µ̂ = = x̄ = 227.49.
2σ i=1 n
D = −2r (µ0 ) = 2 [ℓ (µ̂) − ℓ (µ0 )] , where µ0 = 226
" n n
#
1 X 1 X
=2 − 2 (xi − x̄)2 + 2 (xi − µ0 )2
2σ i=1 2σ i=1
CHAPTER 4. TESTS OF SIGNIFICANCE 63
n (x̄ − µ0 )2 (x̄ − µ0 )2
= = .
σ2 σ 2 /n
σ2
2
If X ∼ N µ0 , σ , then X̄ ∼ N µ0 ,
n
X̄ − µ0
=⇒ Z = σ ∼ N (0, 1) =⇒ D = Z 2 ∼ χ2(1)
√
n
For this example, the likelihood ratio test statistic is equivalent to the Z test statistic
that you learned in your first course in statistics!
In the normal case, the LRS has an EXACT χ2(1) distribution!
10
dobs = (227.49 − 226)2 = 13.14
(1.3)2
p − value = P (D ≥ dobs | µ0 = 226)
= P χ2(1) ≥ 13.14 = 0.00029 < .005
There is very strong evidence against H0 : µ = 226 (p − value < 0.005) . In our
report, we should include our estimate of the mean, µ, together with an interval
estimate which provides information about the margin of error in estimating the
mean. We write in our report: “The estimated mean weight is 227.49 grams
and the data are not consistent with the hypothesis that the mean weight
CHAPTER 4. TESTS OF SIGNIFICANCE 64
is 226 grams (p − value < 0.005, 10% Likelihood interval estimate 226.61-
228.37 grams).”
I computed the 10% likelihood interval using the R code provided below. We will
see later that likelihood intervals are related to confidence intervals, which are more
commonly quoted in practice. In this example, the interval is very, very narrow, and
the lower endpoint is close to 226 grams. Although the data suggest that the mean
weight is statistically significantly different from 226, the investigator may find no
practical difference between the observed data and the hypothesis that the mean
is 226 grams.
#Log-likelihood function
ell <- function(mu,y,sigma){
res<-vector("numeric",length(mu))
for (i in 1:length(mu)){
res[i]<--sum((y-mu[i])^2)/2/sigma^2
}
return(res)
}
#MLE
muhat <- optimize(ell, c(225,230), maximum=TRUE, y=y, sigma=sigma)
muhat
mu <- seq(225,230,by=.1)
plot(mu,logR(mu,muhat$maximum,y,sigma), ylab=’log relative likelihood’,xlab=’mu’)
abline(h=log(.1)) #add a horizontal line at ln(p)
title(’Log-relative Likelihood, Example 4.2.1’)
#Likelihod intervals
p <- .1 #10% likelihood interval
#log relative likelihood minus ln(p)
logR.m.lnp <- function(mu, muhat, y,sigma, p) {logR(mu,muhat,y,sigma)-log(p)}
Under the assumption that θ = θ0 , D ≈ χ2(k) (in most cases), where k is the
number of functionally independent unknown θ parameters in the model.
5
P
pi = 1. If four parameters are known, then the fifth is fully determined.
i=1
Test the null hypothesis that deaths are equally likely to occur on any day of the
week.
Solution:
(X1 , . . . , X7 ) ∼ Multinomial (n = 63, p1 , . . . , p7 )
The null hypothesis is that deaths are equally likely to occur on any day of the week,
and so, H0 : p1 = p2 = · · · = p7 = 71 , a simple hypothesis.
The likelihood ratio statistic for testing H0 is,
7
X
L (p1 , . . . , p7 ) = px1 1 px2 2 · · · px7 7 = p22
1 p72 ··· p67 where pi = 1
i=1
ℓ (p1 , . . . , p7 ) = 22 ln p1 + 7 ln p2 + · · · + 6 ln p7
7
X
= xi ln pi .
i=1
CHAPTER 4. TESTS OF SIGNIFICANCE 67
7
P
We need to maximize ℓ (p1 , . . . , p7 ) subject to the constraint pi = 1. To do that we
i=1
can use Lagrange multipliers (if you have taken Math 200) or we can simply substitute
the constraint into the log-likelihood and maximize over the parameters.
∂g xi
= + γ for i = 1, . . . , 7 (4.1)
∂pi pi
∂g X
= pi − 1 (4.2)
∂γ
We take derivatives with respect to the p′ s, set the expressions equal to zero and
solve for the MLE’s of the p′ s.
∂ℓ xi x7
= − =0 for i = 1, . . . , 6.
∂pi p̂i pˆ7
pˆ7
Therefore, p̂i = xi . (4.3)
x7
Now that we have the MLE’s of the p′ s, we can compute our Likelihood Ratio
Statistic:
7
P
Here k = 6 functionally independent parameters since pi = 1. If six parameters
i=1
are known, then the seventh is fully determined.
The p-value is very small (< 0.01) and we say that we have very strong evidence
against the hypothesis that deaths are equally likely to occur on any day of the
week. The estimated expected frequencies under the null hypothesis that pi = 1/7 =
0.1428571 are all npi = 9. From the table of observed frequencies, we note that
there are many more heart attacks on Monday than would be expected under the
null hypothesis.
We may have hypothesized, a priori, that heart attacks have different daily proba-
bilities of occurrence, for example, they may be more likely to occur on Monday’s,
and formed our null hypothesis as:
H0 : p2 = p3 = · · · = p7 = p and p1 unspecified.
The null hypothesis does not specify numerical values for every parameter in the
model, since p is unknown, therefore it is NOT a simple hypothesis! H0 is an ex-
ample of a composite hypothesis. Note that if we knew p, we could obtain p1 by
subtraction since the p′ s must sum to one so that p1 = 1 − 6p.
To test the new, composite hypothesis, we need to find the MLE of the p′ s assuming
that H0 is true. We substitute the hypothesized values into the log-likelihood and
maximize over p.
7
X
ℓ (p1 , p2 = p, p3 = p, p4 = p, p5 = p, p6 = p, p7 = p) = x1 ln p1 + (xi ln p)
i=2
7
X
= x1 ln p1 + (ln p) xi
i=2
∂ℓH 22(−6) 41
= + .
∂p 1 − 6p p
41
The MLE of p under H0 is p̃ = 378
= 0.1085 and p̃1 = 0.349.
ℓH (p̃1 , p̃) is the largest value that ℓ (p1 , . . . , p7 ) can attain under H0 .
D ≈ χ2(k−q)
where
k = # functionally independent unknown parameters in the basic model
q = # functionally independent unknown parameters in hypothesized model.
The p-value is large (> 0.1), and we have no evidence against the null hypothesis. We
conclude that heart attacks were more likely to occur on Monday’s for this sample,
and equally likely to occur on the other days of the week. The estimated expected
frequencies under H0 , are shown in the bottom row of the following table:
p_1
dobs <- -2*logR(c(p_1,rep(p_tilde,6)),freq/sum(freq),freq)
dobs
1-pchisq(dobs,5) #pvalue
1. Test the hypothesis that the probability of a Yes response is the same
in all four provinces.
4
Y
L (p1 , p2 , p3 , p4 ) = pyi i (1 − pi )ni −yi
i=1
X4
ℓ (p1 , p2 , p3 , p4 ) = [yi ln pi + (ni − yi ) ln (1 − pi )]
i=1
CHAPTER 4. TESTS OF SIGNIFICANCE 75
Taking the derivative with respect to pi , setting equal to zero and solving, we ob-
tain,
∂ℓ yi 100 − yi yi
= − , and p̂i = , i = 1, 2, 3, 4.
∂pi pi 1 − pi ni
H0 : p1 = p2 = p3 = p4 = p unspecified
4
X
ℓH (p) = ℓ (p1 = p, p2 = p, p3 = p, p4 = p) = [yi ln p + (ni − yi ) ln (1 − p)]
i=1
X4 4
X
=( yi ) ln p + (400 − yi ) ln(1 − p)
i=1 i=1
Substituting in values for p̂1 , p̂2 , p̂3 , p̂4 , and p̃, we obtain dobs = 10.76. The p-value
for the test is computed as,
There is evidence against the hypothesis that the four provinces have the same prob-
ability of responding Yes. We would write, “the data are not consistent with the
hypothesis that respondents from the four provinces are equally likely to support the
legalization of pot (p-value = 0.013).” The estimated expected frequencies under the
null hypothesis are given in the table below. Note that Ontario has fewer respondents
in favour of legalizing pot than would be expected under the hypothesis that those
sampled from the four provinces were equally likely to support legalization.
where ObsF req is the observed frequency and ExpectedF req is the estimated ex-
pected frequency under the null hypothesis. This form for the likelihood ratio statis-
tic will come up again.
2. Test the hypothesis that the three western provinces have respon-
dents who are equally likely to say Yes, whereas Ontario responds differ-
ently.
H0′ : p1 = p2 = p3 = pW , p4 unspecified
This is a composite hypothesis because pW and p4 are unknown and must be esti-
mated. There are q = 2 functionally independent unknown parameters. Substituting
the null hypothesis into the log-likelihood, we obtain,
ℓH ′ (pW , p4 ) = ℓ (p1 = pW , p2 = pW , p3 = pW , p4 )
3
X
= [yi ln pW + (ni − yi ) ln (1 − pW )] + y4 ln p4 + (100 − y4 ) ln(1 − p4 )
i=1
X3 3
X
=( yi ) ln pW + (300 − yi ) ln(1 − pW ) + y4 ln p4 + (100 − y4 ) ln(1 − p4 )
i=1 i=1
Substituting in values for p̂1 , p̂2 , p̂3 , p̂4 , and p̃W and p̃4 , we obtain dobs = 1.814. The
p-value for the test is computed as,
p − value ≃ P (χ2(k−q) ≥ dobs )
= P (χ2(2) ≥ 1.814) ≃ 0.404.
We have no evidence against the hypothesis that the Western provinces are equally
likely to support legalization of pot.
CHAPTER 4. TESTS OF SIGNIFICANCE 78
LRS<-function(p,p0,y,n){
2*(ell(y,n,p)-ell(y,n,p0))
}
pW<-sum(y[1:3])/sum(n[1:3])
p1<-c(pW,pW,pW,y[4]/n[4]) #MLE under (b) H0
p1
D1<-LRS(phat,p1,y,n) #observed LRS
D1
1-pchisq(D1,4-2) #p-value
200 specimens of a new high impact plastic are tested by repeatedly striking them
with a hammer until they fracture. The data are as follows:
Let
Y = # hits required to fracture a specimen
We assumed that:
f (y) = P (Y = y) = θy−1 (1 − θ) x = 1, 2, 3, . . .
H0 : Geometric p1 = 1 − θ, p2 = θ (1 − θ) , p3 = θ2 (1 − θ) , p4 = θ3 .
1
We computed the MLE for θ under H0 , θ̃ = 2
so that
1 1 1 1
p̃1 = , p̃2 = , p̃3 = , p̃4 =
2 4 8 8
" #
X ObsF req
D=2 ObsF req ln .
all cells
ExpectedF req
Exercise: Test the fit of the “extended” geometric model - where we assumed that
a proportion λ were defective.
Are these observations consistent with the hypothesis of a common failure rate for
single and married students? Use a Likelihood Ratio test to answer the question.
Because only the total number of students, 1500, is fixed in advance of the examina-
2
P
tion, then the data arise from a single Multinomial distribution and pij = 1.
i,j=1
CHAPTER 4. TESTS OF SIGNIFICANCE 83
Let’s introduce some general notation and label the frequencies as follows:
We have that (X11 , X12 , X21 , X22 ) ∼ Multinomial (n = 1500, p11 , p12 , p21 , p22 ) and the
number of functionally independent parameters is k = 3.
xij
Then the MLE for pij under BASIC model is, p̂ij = n
.
If H0 is true, then
In other words, the null hypothesis states that pass/fail on the examination is inde-
pendent of marital status!
Let
We will use the Likelihood ratio statistic for testing H0 , therefore we need to compute
the MLE’s of the pij ’s under the hypothesized model.
CHAPTER 4. TESTS OF SIGNIFICANCE 84
The Likelihood looks the same as that for two independent binomials!
We use these to compute the peij ’s and the estimated expected frequencies under H0
in each of the cells in the table as follows:
eij = estimated expected freq. in (i, j)th cell under hypothesized model
r1 c1 (297)(157)
e11 = ne
p11 = ne
αβe = =
n 1500
r 1 c2
p12 = n(1 − α
e12 = ne e)β =
e
n
r 2 c1
e21 = ne α(1 − β) =
p21 = ne e
n
p22 = n(1 − α
e22 = ne e = r2 c2
e)(1 − β)
n
The estimated expected frequencies under the independence hypothesis are included
in the original data table in parentheses.
Observed frequencies (estimated expected frequencies under H0 )
Fail (Fail) Pass (Pass) Total
Married 14 (31.086) 143 (125.914) 157
Single 283 (265.914) 1060 (1077.086) 1343
Total 297 1203 1500
From the last section, we have that the form of the Likelihood ratio statistic for the
multinomial model is:
" #
X ObsF req
D=2 ObsF req ln .
all cells
ExpectedF req
" #
X xij
D=2 xij ln
all cells
eij
dobs = 2 (7.70) = 15.40
p-value = P (D ≥ dobs | H true)
≃ P χ2(k−q) ≥ 15.40
k = 3, q = 2
2
= P χ(1) ≥ 15.40 = .00009 < .001
We have very strong evidence against the hypothesis that there is a common failure
rate for single and married students.
The data suggest that there is an association between marital status and whether
students pass or fail the examination.
What is the nature of the association?
14
Proportion of married who failed = = .089
147
283
Proportion of single who failed = = .211
1343
The data suggests that married students are less likely to fail the exam. Does that
mean that marriage causes better outcomes on examinations?
CHAPTER 4. TESTS OF SIGNIFICANCE 86
This does not mean that A causes B! There are 3 possible cause-effect relationships
that could produce the association:
(i) A causes B
(ii) B causes A
(iii) some other factor C causes both A and B
We cannot claim that A causes B until we have ruled out (ii) and (iii).
xij
D = 2 [ℓ(p̂) − ℓ(p̃)] where p̂ij = n
" #
XX x
=2 xij ln np̃ijij
i j
" #
XX x
=2 xij ln eijij (eij = np̃ij ) ,
i j
where eij are the estimated expected frequencies under the hypothesized model,
H0 .
=⇒ q = (a − 1) + (b − 1)
ri cj
It can be shown [Exercise!] that α̃i = n
β̃j = ,
so that,
n
r i cj
eij = np̃ij = nα̃i β̃j = .
n
The Likelihood ratio test statistic for testing H0 has an approximate χ2 distribution
with degrees of freedom computed as:
k − q = ab − 1 − [(a − 1) + (b − 1)] = (a − 1)(b − 1).
CHAPTER 4. TESTS OF SIGNIFICANCE 90
Example 4.8.1. The following data on heights of 210 married couples were presented
by Yule in 1900.
Wife
Husband Tall Medium Short Total
Tall 18 (15.48) 28 (32.19) 19 (17.33) 65
Med 20 (23.57) 51 (49.03) 28 (26.4) 99
Short 12 (10.95) 25 (22.78) 9 (12.27) 46
Total 50 104 56 210
Test the hypothesis that heights of husbands and wives are independent.
ri cj
Solution: The estimated expected frequencies, eij = n
under the hypothesis of
independence are given in parentheses in the table.
The Likelihood ratio test for testing H0 is:
" 3 3 #
XX xij
D=2 xij ln
i=1 j=1
eij
There is no evidence against the hypothesis of independence. The data suggest that
the heights of husbands and wives are not associated.
freq <- matrix(c(18, 28, 19, 20, 51, 28, 12, 25, 9), nrow=3, byrow=TRUE)
freq
sum(freq)
CHAPTER 4. TESTS OF SIGNIFICANCE 91
X (xj − ej )2
G.O.F. = where ej = estimated expected under H0
all cells
ej
G.O.F. ≈ χ2(k−q)
CHAPTER 4. TESTS OF SIGNIFICANCE 92
When the ej ’s are large, G.O.F. will be very nearly equal to the Likelihood ratio
statistic.
Confidence Intervals
93
CHAPTER 5. CONFIDENCE INTERVALS 94
the government’s claim that the recession has had no effect on profits, based on an
observed sample average decrease in profits of $1,000. Assume that measurements
are independent and normally distributed with σ = $600.
Solution:
Let Xi = change in profit in dollars for business i.
50 (−1000 − 0)2
dobs = = 85
6002
p-value = P (D ≥ dobs | H0 true)
= P χ2(1) ≥ 85 < .001
CHAPTER 5. CONFIDENCE INTERVALS 95
Now, we consider, for which parameter values, µ0 , does a likelihood ratio test of
H0 : µ = µ0 yield a
p − value(µ0 ) ≥ 0.05?
What values of µ0 are reasonably consistent with the data?
where
n (x̄ − µ0 )2
dobs (µ0 ) = .
σ2
From tables or R, P χ2(1) ≥ 3.841 = .05
therefore p − value(µ0 ) ≥ .05 if and only if dobs (µ0 ) ≤ 3.841. We must solve for the
values µ0 such that
n (x̄ − µ0 )2
≤ 3.841.
σ2
σ2
(x̄ − µ0 )2 ≤ 3.841
n
σ σ
−1.96 √ ≤ x̄ − µ0 ≤ 1.96 √
n n
σ σ
x̄ − 1.96 √ ≤ µ0 ≤ x̄ + 1.96 √
n n
estimate of µ. Let µT be the true unknown value of µ. We can compute the fraction
of times that the random interval would include the true value µT in a large number
of repetitions of the experiment.
σ σ
P X̄ − 1.96 √ ≤ µT ≤ X̄ + 1.96 √ | µ = µT
n n
( ! )
X̄ − µT
= P −1.96 ≤ σ ≤ 1.96 | µ = µT = .95
√
n
X̄ − µT
∼ N (0, 1) if µ = µT
√σ
n
We are confident that in 95 times out of 100 trials of the experiment the above
interval will contain µT , the true value.
Definition: A 95% Confidence Interval, CI, is an interval for the unknown param-
eter µT , such that in a large number of repetitions of the experiment, CI covers µT
95 times out of 100.
One way to construct a 95% CI is to solve for the set of parameter values µ0 such
that for a test of H : µ = µ0
Thus, an approximate 95% Confidence interval for θ is the set of θ0 values such
that,
d (θ0 ) ≤ 3.841
⇐⇒ −2r (θ0 ) ≤ 3.841
⇐⇒ r (θ0 ) ≥ −1.92
⇐⇒ er(θ0 ) = R (θ0 ) ≥ e−1.92 = .147
Thus, an approximate 95% Confidence Interval for θ is just a 14.7% Likelihood In-
terval! Below is a table of common confidence interval levels and their corresponding
likelihood intervals.
CI % Corresponding LI %
90 25.8
95 14.7
96.8 10
99 3.6
includes the true value; he/(she) must rely on indirect statistical theory for assurance
in the long run, 95% of the Confidence Intervals similarly constructed would include
the true value.
Example 5.2.1. (Example 4.4.1 revisited) In a random sample of 100 people from
B.C., it was found that 23 out of 100 favour legalization of pot. Construct an
approximate 95% Confidence Interval for the proportion of B.C. voters who support
legalization of pot.
X ∼ Binomial(n = 100, θ)
Method 1: Use Likelihood Ratio Statistic Using the methods of this section,
we find a 14.7% Likelihood interval for θ. This works out to be [.155, .319] using the
R-code which is included at the end of this chapter.
( )
|X − nθ0 | |xobs − nθ0 |
p − value(θ0 ) = P p ≥p | H0 : θ = θ0
nθ0 (1 − θ0 ) nθ0 (1 − θ0 )
( )
|xobs − nθ0 |
≃ P |Z| ≥ p
nθ0 (1 − θ0 )
CHAPTER 5. CONFIDENCE INTERVALS 99
where Z ∼ N (0, 1)
|xobs − nθ0 |
p-value (θ0 ) ≥ .05 ⇐⇒ p ≤ 1.96
nθ0 (1 − θ0 )
q
|xobs − nθ0 | ≤ 1.96 nθ̂(1 − θ̂)
s s
xobs θ̂(1 − θ̂) xobs θ̂(1 − θ̂)
− 1.96 ≤ θ0 ≤ + 1.96
n n n n
s s
θ̂ − 1.96 θ̂(1 − θ̂) θ̂(1 − θ̂)
, θ̂ + 1.96
n n
= [.148, .312]
This gives us an indication of the precision of our estimate and the margin of error
for our estimate is 0.082. Intervals constructed this way have the property that 19
times out of 20, such intervals will cover the true value.
This is the basis for quotes in the media such as,
“is considered accurate to within 8.2 percentage points, 19 times out of 20”.
CHAPTER 5. CONFIDENCE INTERVALS 100
#MLE of theta
thetahat <- optimize(ell, c(.1,.6), maximum=TRUE)
thetahat
#Likelihod intervals
lower <- uniroot(logR.m.lnp, c(.1, .23), thetahat$maximum, p)
lower
A related, and extremely useful result, is that in many cases, the MLE has an
approximate normal distribution for large sample size, n. Again, consider a Basic
model that has one, unknown parameter, θ, and let θ̂ be the MLE of θ where the
‘true’ value of θ = θ0 . Then for many probability models,
q
θ̂ − θ0 I(θ̂) ≈ N (0, 1),
Where I(θ̂) is the Information function defined in Section 2.1 evaluated at θ̂, I(θ) =
2
−ℓ′′ (θ) = − d dθℓ(θ)
2 . This result can be rewritten as,
−1
θ̂ ≈ N θ0 , I(θ̂) ,
so that θ̂ is approximately normally distributed with the true value as its mean,
and with asymptotic variance 1/I(θ̂). The result generalizes to the multi-parameter
case.
Using this result, we obtain an approximate 100(1−α)% Confidence Interval as,
z z
θ̂ − q1−α/2 , θ̂ + q1−α/2
I(θ̂) I(θ̂)
where z1−α/2 is the 1 − α/2 quantile of the N (0, 1) distribution. For example, for
α = .05, z1−α/2 = 1.96.
This result is used very frequently in applied statistics!!
CHAPTER 5. CONFIDENCE INTERVALS 102
Example 5.3.1. (Example 5.2.1 revisited) In a random sample of 100 people from
B.C., it was found that 23 out of 100 favour legalization of pot. Construct an
approximate 95% Confidence Interval for the proportion of B.C. voters who support
legalization of pot using the normal approximation for the MLE.
X ∼ Binomial(n = 100, θ)
ℓ(θ) = x ln θ + (n − x) ln(1 − θ)
x n−x
ℓ′ (θ) = −
θ 1−θ
′′ x n−x
ℓ (θ) = − 2 −
θ (1 − θ)2
x n−x
I(θ) = 2 + ,
θ (1 − θ)2
and after some simplification,
θ̂(1 − θ̂)
I(θ̂)−1 = .
n
This yields an approximate Confidence Interval for θ, the proportion of B.C. voters
who support legalization of pot as,
s s
θ̂ − 1.96 θ̂(1 − θ̂) θ̂(1 − θ̂)
, θ̂ + 1.96
n n
= [.148, .312] ,
which is the same as what we obtained in the previous section using Method 2!
Chapter 6
Normal Theory
n n n
!
X X X
ai X i ∼ N ai µ i , a2i σi2 .
i=1 i=1 i=1
n
X
Z12 ∼ χ2(1) and Zi2 ∼ χ2(n) .
i=1
103
CHAPTER 6. NORMAL THEORY 104
Then
Yi = µi + εi where εi ∼ N (0, σ 2 ) independent.
A consequence is that,
The smaller σ, is the smaller we expect εi to be. σ measures the amount of random
variability (noise) that one would expect in repeated measurements taken under the
same conditions.
Example 6.2.1. Monthly salaries for Engineering and Math co-op students were
collected. A sample of 10 salaries is given below (in $):
Compute a 95% Confidence Interval for the mean monthly salary, assuming that
Yi = monthly salaries for person i = 1, ..., n are N (µ, σ 2 ) and independent.
To assess the assumption of normality, we could consider a histogram as in Figure
6.1. Unfortunately, histograms are not very informative for small samples. A normal
QQ (quantile-quantile) plot is given in Figure 6.2. If the data are approximately
normally distributed, then this graph should resemble a straight line. In this case,
it does(!) and we have no evidence against the normal assumption. We will learn
about normal quantile-quantile (QQ) plots in Section 6.4.5.
CHAPTER 6. NORMAL THEORY 106
Histogram of 10 salaries
3.0
2.0
Frequency
1.0
0.0
dollars
3000
2000
Theoretical Quantiles
where
b − µ0
µ ȳ − µ0
T = =
√s √s
n n
and
n
X
(yi − ȳ)2
i=1
s2 = .
n−1
We can compute a 95% Confidence interval for µ, by finding all µ0 such that p −
value(µ0 ) ≥ 0.05. Instead note that D is a one-to-one increasing transformation of
T 2 which I will call g(D) = T 2 .
We have that,
p − value(µ0 ) ≥ 0.05 ⇐⇒ t2obs (µ0 ) ≤ a2
P T 2 ≤ a2 | H0 : µ = µ0 = 0.95
= P {−a ≤ T ≤ a | H0 : µ = µ0 }
( )
Y − µ0
= P −a ≤ s ≤ a | H0 : µ = µ0 .
√
n
Yq−µ0
(i) Z = σ2
∼ N (0, 1) when µ = µ0
n
(n−1)s2
(ii) V = σ2
∼ χ2(n−1) independent of Z. (Proved in Stat450)
E [V ] = n − 1. (See Chapter 3)
h 2 i
2 σ σ2
E [s ] = E n−1 V = n−1 E [V ] = σ 2
Z
T = q
V
(n−1)
N (0, 1)
= q 2 ∼ t(n−1) ,
χ(n−1)
n−1
CHAPTER 6. NORMAL THEORY 109
the Student’s t distribution with n−1 degrees of freedom. Percentiles of the Student’s
t distribution are tabulated in Table B3. If the degrees of freedom are very large,
> 60, then the t distribution approaches N(0,1). In general, for any Z ∼ N (0, 1)
independent of V ∼ χ2(ν) ,
Z
q ∼ t(ν) .
V
ν
( )
Ȳ − µ0
P −a ≤ ≤ a | H0 : µ = µ0 = .95
√s
n
n = 10, so P −a ≤ t(9) ≤ a = 0.95.
Y − µ0
−2.262 ≤ ≤ 2.262.
√s
n
Isolating µ0 , we obtain
s s
Ȳ − 2.262 √ , Ȳ + 2.262 √ .
n n
A 95% confidence interval for the entire dataset using n = 1151, ȳ = $3445.382,
s = $920.52, 97.5 percentile of t(1150) equal to 1.96 is:
t.975
(n−1) = 97.5
th
percentile of t(n−1)
Example 6.2.2. Returning to Example 6.2.1, let Yi = monthly salary for person
i = 1, ..., n, and Yi ∼ N (µ, σ 2 ) independent. Test the hypothesis that the mean
salary is $3, 000 per month.
data: y
t = 1.2836, df = 9, p-value = 0.2313
alternative hypothesis: true mean is not equal to 3000
95 percent confidence interval:
2621.22 4372.58
sample estimates:
mean of x
3496.9
For the general normal linear model, we will use a t statistic for inference about
mean parameters when σ 2 is unknown.
> summary(ywt1)
Term WTNum GR.UG SalMonth
2015 - Fall :111 W-1:352 GR: 50 Min. :1406
2015 - Spring:151 UG:302 1st Qu.:2717
2015 - Summer: 90 Median :3003
Mean :3149
3rd Qu.:3526
Max. :7259
CHAPTER 6. NORMAL THEORY 112
80
Frequency
60
40
20
0
Monthly Salaries $
Let Yi , i = 1, ..., n be the monthly salary for the i’th co-op student. We assume
that Yi ∼ N (µ, σ 2 ), independent. For the Work Term 1 monthly salaries, we test the
hypothesis that H0 : σ 2 = σ02 = 7502 using a Likelihood Ratio test.
n
n 1 X
ℓ µ, σ 2 = − ln σ 2 − 2 (yi − µ)2
2 2σ i=1
In our example,
(yi − ȳ)2
P
2 n−1 2
k = 2, µ̂ = ȳ = $3149 and σ̂ = = s = 778.97812 .
n n
CHAPTER 6. NORMAL THEORY 113
P {A ≤ θT ≤ B} = p
CHAPTER 6. NORMAL THEORY 114
Ȳ −µ
(b) T = √s
∼ t(n−1) is a pivotal quantity for µ when σ 2 unknown
n
(n−1)s2
(c) σ2
∼ χ2(n−1) is a pivotal quantity for σ 2
(n − 1)s2
P a≤ ≤ b = .95
σ2
The convention is to choose a and b so that two tails have equal area, see Figure 6.4
below.
CHAPTER 6. NORMAL THEORY 115
Chi−square(4) density
0.15
a=qchisq(.025,4)
b=qchisq(.975,4)
density
0.10
0.05
0.95
0.00
a b
0 2 4 6 8 10 12
(n − 1) s2
P a≤ ≤ b = .95
σ2
(n − 1) s2 (n − 1) s2
2
P ≥σ ≥ = .95
a b
h i
(n−1)s2 (n−1)s2
The interval b
, a satisfies the definition of a 95% Confidence Interval for
σ2.
For the Work Term 1 data,
s = 780.087, n = 352
a = 300.9897, b = 404.7974,
CHAPTER 6. NORMAL THEORY 116
and a 95% CI for σ 2 is [7262 , 8422 ] . There is a great deal of variability in the
data.
Example 6.3.1. Monthly salaries for Co-op students in Work Term 1 and 2 were
collected over one year. The file ‘SalaryWT12.csv’ contains the salaries. A histogram
and summary statistics by work term number are given below. Recall that the box
on a boxplot encases the middle 50 percent of the data, i.e. from the 25’th to 75’th
percentile. The solid line in the middle of the box indicates the median and outliers
CHAPTER 6. NORMAL THEORY 117
are plotted as small circles in both tails. The boxplots of the salaries for the two
work terms suggest that these salaries have distributions which are similar.
7000
5000
3000
1000 Monthly salaries for Workterms 1, 2
W−1 W−2
ywt12$WTNum: W-2
[1] 847.1914
We assume that,
Y1i ∼ N (µ1 , σ 2 ), independent,
Y2j ∼ N (µ2 , σ 2 ), independent,
and that σ 2 = 8182 .
n1
P n2
P
As an exercise, show that µ̂1 = ȳ1 = y1i /n1 = $3, 149 and µ̂2 = ȳ2 = y2j /n2 =
i=1 j=1
$3, 354.
n1
X n1
X
2
(y1i − ȳ) = [(y1i − ȳ1 ) + (ȳ1 − ȳ)]2
i=1 i=1
Xn1
= (yi1 − ȳ1 )2 + n1 (ȳ1 − ȳ)2 ,
i=1
since n1
X
(yi1 − ȳ1 ) (ȳ1 − ȳ) = 0.
i=1
CHAPTER 6. NORMAL THEORY 120
Simplifying for the work term 2 data in the same way yields,
1 2 2
D= n1 (ȳ 1 − ȳ) + n 2 (ȳ 2 − ȳ) .
σ2
n1 ȳ2 − n1 ȳ1
Similarly, ȳ2 − ȳ = .
n1 + n2
n1 n22 n2 n21
1 2 2
D= 2 (ȳ1 − ȳ2 ) + (ȳ1 − ȳ2 )
σ (n1 + n2 )2 (n1 + n2 )2
1 n1 n2
= 2 (ȳ1 − ȳ2 )2
σ n1 + n2
2
1 (ȳ1 − ȳ2 )
= 2 .
σ 1
+ 1
n1 n2
p − value = P (D ≥ dobs | H0 : µ1 = µ2 ) .
We can obtain the exact distribution of D under H0 here. Since V ar(Ȳ1 ) = σ 2 /n1 ,
V ar(Ȳ2 ) = σ 2 /n2 and Ȳ1 and Ȳ2 are independent, V ar(Ȳ1 − Ȳ2 ) = σ 2 (1/n1 + 1/n2 ).
Therefore, 2
Ȳ1 − Ȳ2
∼ χ2(1) exactly, and
1 1
σ 2 n1 + n2
CHAPTER 6. NORMAL THEORY 121
Ȳ1 − Ȳ2
q = Z ∼ N (0, 1)exactly.
σ n11 + n12
The p − value is
2
(ȳ1 − ȳ2 )
p − value = P χ2(1) ≥
σ 2 n11 + n12
|ȳ1 − ȳ2 |
= P |Z| ≥ q
σ n11 + n12
where Z ∼ N (0, 1) .
There is strong evidence against the hypothesis that salaries for work terms 1 and
2 are the same. The increase in mean monthly salary is significantly different from
zero.
A 95% confidence interval for the difference in the means is:
p
ȳ1 − ȳ2 ± z.975 σ 1/n1 + 1/n2
= −205.95 ± 125.09
= (−331.04, −80.86)
We report, “The estimated mean monthly increase in salary in work term 2 over
work term 1 is $205.95 (95% CI $80.86 - $331.04).”
CHAPTER 6. NORMAL THEORY 122
We assume that,
Y1i ∼ N (µ1 , σ 2 ), independent,
Y2j ∼ N (µ2 , σ 2 ), independent,
and that σ 2 is unknown.
n21 " n1
# n22 " n2
#
1 1 X 1 1 X
2
(y1i − µ1 )2 (y2j − µ2 )2
L µ1 , µ2 , σ = exp − 2 exp − 2
σ2 2σ i=1 σ2 2σ j=1
n1
P n2
P
As an exercise, show that µ̂1 = ȳ1 = y1i /n1 = $3, 149, µ̂2 = ȳ2 = y2j /n2 =
i=1 j=1
$3, 354, and Pn1 Pn2
2 2
i=1 (y 1i − ȳ 1 ) + j=1 (y2j − ȳ2 )
σ̂ 2 = .
n1 + n2
n1 +n2
" n1 n2
#
1 2 1 X 2 1 X
2
(y2j − µ)2
L µ1 = µ, µ2 = µ, σ = exp − 2 (y1i − µ) − 2
σ2 2σ i=1 2σ j=1
CHAPTER 6. NORMAL THEORY 123
Ȳ − Y¯2
T = p1 ∼ t(n1 +n2 −2) .
spooled 1/n1 + 1/n2
and
(n1 − 1)s21 + (n2 − 1)s22
s2pooled = .
n1 + n2 − 2
The R output for this problem is below and code is in the following section:
CHAPTER 6. NORMAL THEORY 124
Ȳ1 − Ȳ2
T =q 2 ∼ t(ν) ,
s1 s22
n1
+ n2
where ν is the Satterthwaite approximation to the degrees of freedom.
The R output using this approximation appears below and code is in the following
subsection.
> t.test(SalMonth~WTNum, data=ywt12,var.equal=FALSE)
The p-value for the test of the equality of the two means is 0.001291, which is very
similar to the results of the test which assumes that the variances are equal.
###Method 1: Assume variances equal and known, Test and Confidence Interval
for Difference###
ywt12.means<-by(ywt12$SalMonth,ywt12$WTNum,mean)
ywt12.num<-by(ywt12$SalMonth,ywt12$WTNum,length)
ywt12.means
ywt12.num
zobs<-(ywt12.means[1]-ywt12.means[2])/818/sqrt(sum(1/ywt12.num))
zobs
pvalue<-2*pnorm(-abs(zobs))
pvalue
zobs^2
1-pchisq(zobs^2,1)
t.test(SalMonth~WTNum, data=ywt12,var.equal=TRUE)
t.test(SalMonth~WTNum, data=ywt12,var.equal=FALSE)
CHAPTER 6. NORMAL THEORY 126
Taking derivatives and solving yields the MLE’s under the Reduced model are:
P P
i y1i + j y2j
µ̃ = and
n + n2
P 1 2 P 2
2 i (y1i − µ̃) + j (y2j − µ̃)
σ̃ = .
n1 + n2
D = − (n1 + n2 ) ln σ̂ 2 + (n1 + n2 ) ln σ̃ 2
σ̃ 2
= (n1 + n2 ) ln 2
σ̂
X 2 X 2
2 n1 ȳ1 + n2 ȳ2 n1 ȳ1 + n2 ȳ2
(n1 + n2 )σ̃ = y1i − + y2j −
i
n1 + n2 j
n1 + n2
(ȳ1 − ȳ2 )2
n1 n2
n1 +n2
=1+ P 2 P 2
i (y1i − ȳ1 ) + j (y2j − ȳ2 )
(ȳ1 − ȳ2 )2
=1+ h i h P (y 2 P 2 ih i
1 i 1i −ȳ1 ) + j (y2j −ȳ2 ) 1 1
n1 +n2 −2 n1 +n2 −2 n1
+ n2
1 ȳ − ȳ2
=1+ T2 where T = q1 .
n1 + n2 − 2 sp n11 + n12
We wish to determine the form and strength of the relationship between the response
variable (y) and the explanatory variable (x). Typically there are two possible goals
for the analysis:
(1) Explanation: What is the relationship between y and x.
(2) Prediction: Given x, can we predict y accurately.
Example 6.4.1. We are interested in the relationship between monthly salaries for
co-op students as a function of the work term number.
xi = work term number for student i,
yi = monthly salary for student i.
The first step is to graph the data and determine an appropriate model to fit to the
data. The data are graphed below:
CHAPTER 6. NORMAL THEORY 131
7000
5000
monthly salary
3000
1000
(1) for a given number of work terms, the monthly salaries are subject to a large
amount of variability:
(2) A linear relationship between monthly salary (Y ) and work term number (X)
seems appropriate.
In this course, we will primarily consider models where the y ′ s are linearly related
to the x′ s.
n
2
Y 1 1 2
L α, β, σ = exp − 2 (yi − µi )
i=1
σ 2σ
" n
#
1 X
= σ −n exp − 2 (yi − µi )2
2σ i=1
n
1 X
2
(yi − α − βxi )2
ℓ α, β, σ = −n ln σ − 2
2σ i=1
| {z }
α̂, β̂ will maximize this
n
X
=⇒ α̂, β̂ will minimize (yi − α − βxi )2
i=1
Letting ϵ̂i = yi − α̂ − β̂xi , the equations (6.5) and (6.6) can be written as:
n
1 X
(6.5) = ϵ̂i = 0
σ 2 i=1
n
1 X
(6.6) = ϵ̂i xi = 0
σ 2 i=1
Using algebraic manipulations, we derive some alternate formulae for SXY and SXX
which will be useful later.
(i)
n
X n
X n
X
2
SXX = (xi − x̄) xi = xi − x̄ xi
i=1 i=1 i=1
Xn
SXX = x2i − nx̄2
i=1
(ii)
n
X n
X
SXX = (xi − x̄) xi − (xi − x̄) x̄ since 2nd term=0
i=1 i=1
n
X
SXX = (xi − x̄)2
i=1
(iii)
n
X n
X n
X
SXY = (yi − ȳ) xi = xi yi − ȳ xi
i=1 i=1 i=1
Xn
SXY = xi yi − nx̄ȳ
i=1
CHAPTER 6. NORMAL THEORY 134
(iv)
n
X n
X
SXY = (yi − ȳ) xi − (yi − ȳ) x̄ since 2nd term=0
i=1 i=1
n
X
SXY = (yi − ȳ) (xi − x̄)
i=1
(v)
n
X n
X
SXY = (xi − x̄) yi since (xi − x̄) ȳ = 0
i=1 i=1
To estimate σ 2 we compute the derivative of the log likelihood function with respect
to σ.
n
∂ℓ n 1 X
=− + 3 (yi − α − βxi )2
∂σ σ σ i=1
n
∂ℓ 2
X 2
= 0 =⇒ nσ̂ = yi − α̂ − β̂xi
∂σ α̂,β̂,σ̂ i=1
n n
2 1 X 2 1 X2
σ̂ = yi − α̂ − β̂xi = ϵ̂
n i=1
n i=1
Note that ϵ̂i is an estimate of ϵi = Yi −(α + βxi ) where we assumed that ϵi ∼ N (0, σ 2 )
and independent. We call ϵ̂i a Residual.
CHAPTER 6. NORMAL THEORY 135
Call:
lm(formula = SalMonth ~ WTNumN, data = salarynz)
Residuals:
Min 1Q Median 3Q Max
-2960.8 -522.6 -157.4 406.4 4136.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2887.40 56.26 51.33 <2e-16 ***
WTNumN 234.99 21.06 11.16 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
• The estimated relationship between monthly salary and work term number is:
Salary = 2887.40 + 234.99 × Work Term number.
n
X
We want to express β̂ = ai Yi , as a linear combination of the Yi ’s.
i=1
SXY
β̂ =
S
PXX
(xi − x̄) yi X (xi − x̄) yi
= =
SXX SXX
(xi − x̄)
Let ai =
SXX
X X
2
then β̂ ∼ N ai µ i , σ a2i
X X (xi − x̄)
E(β̂) = ai µ i = (α + βxi ) = β
SXX
X X (xi − x̄)2 σ2
V AR(β̂) = σ 2 a2i = σ 2 2
=
SXX SXX
2
σ
β̂ ∼ N β,
SXX
for tests and confidence intervals for β. (It is a pivotal quantity for β as it is a
function of the data and a function of the unknown parameter and its distribution
is completely known.)
√
The quantity in the denominator of (6.7), s/ SXX is called the Standard Error of
β̂, s.e.(β̂), and it is listed on the output of Figure 6.7 under the column ‘Std. Error’.
CHAPTER 6. NORMAL THEORY 137
It is the square root of the estimated variance of β̂. The standard error for β̂ is
21.06.
Note that in (6.7), the degrees of freedom for the t distribution are the same as the
denominator in the formula for s2 . This result holds generally. The estimate s is
given in the output Figure 6.7 labelled as ‘Residual standard error:’ and the value
here is 874.7 on 1149 degrees of freedom.
We next test the hypothesis that β = 0, and construct a 99% confidence interval for
β.
β̂ − 0
p − value = P t(n−2) ≥ q H0 : β = 0 true
1
s SXX
The R output in Figure 6.7, provides the observed value of the test statistic, (6.7),
under the column ‘t value’. The observed value of our t-statistic for β is 11.16. The
p − value is given in the column ‘Pr(> |t|)’ and its value is listed as < 2e − 16.
For very small or very large numbers, R uses exponential notation. 2e − 16 means
2 × 10−16 .
p − value = P t(1149) ≥ 11.16 < 2 × 10−16 .
estimate ± t.995
(ν) s.e.(estimate).
Here we obtain the t-quantile from R as: qt(.995, 1149), so our 99% confidence
interval is:
234.99 ± 2.58 × 21.06 = [180.66, 289.32].
Distribution of α̂
n
X
We want to express α̂ = ai Yi , as a linear combination of the Yi ’s.
i=1
CHAPTER 6. NORMAL THEORY 138
1X X
α̂ = ȳ − β̂ x̄ = yi − x̄ ai y i
" n #
X 1 xi − x̄
= yi − x̄ai ai =
|n {z } SXX
bi
X
= bi y i
Therefore, α̂ ∼ N ( bi µi , σ 2 b2i ) .
P P
X
V AR(α̂) = b2i σ 2
X X 1 x̄ (xi − x̄) 2
2
bi = −
n SXX
" #
X 1 1 x̄ (xi − x̄) x̄2 (xi − x̄)2
= −2 + 2
n2 n SXX SXX
1 x̄2 X
= + 2 (xi − x̄)2
n SXX
1 x̄2
= +
n SXX
x̄2
2 1
α̂ ∼ N α, σ +
n SXX
for tests and confidence intervals for α. (It is a pivotal quantity for α as it is a
function of the data and a function of the unknown parameter and its distribution
is completely known.)
q
2
The quantity in the denominator of (6.8), s n1 + Sx̄XX is called the Standard Error
of α̂, s.e.(α̂), and it is listed on the output of Figure 6.7 under the column ‘Std.
Error’. It is the square root of the estimated variance of α̂. The standard error for
α̂ is 56.26.
Note that in (6.8), the degrees of freedom for the t distribution are the same as the
denominator in the formula for s2 .
We next test the hypothesis that α = 0, and construct a 99% confidence interval for
α.
|α̂ − 0|
p − value = P t(n−2) ≥ q H0 : α = 0
2
s n1 + Sx̄XX
The R output in Figure 6.7, provides the observed value of the test statistic, (6.8),
under the column ‘t value’. The observed value of our t-statistic for α is 51.33. The
p − value is given in the column ‘Pr(> |t|)’ and its value is listed as < 2e − 16.
For very small or very large numbers, R uses exponential notation. 2e − 16 means
2 × 10−16 .
p − value = P t(1149) ≥ 11.16 < 2 × 10−16 .
estimate ± t.995
(ν) s.e.(estimate).
Here we obtain the t-quantile from R as: qt(.995, 1149), so our 99% confidence
interval is:
2887.40 ± 2.58 × 56.26 = [$2742.25, $3032.55].
CHAPTER 6. NORMAL THEORY 140
h i
µ̂0 = ȳ − β̂ x̄ + β̂x0
= ȳ + β̂ (x0 − x̄)
1X (xi − x̄)
= yi + (x0 − x̄) ai yi ai =
n SXX
" #
X 1
= + (x0 − x̄) ai yi
|n {z }
ci
X X
=⇒ µ̂0 ∼ N ci µ i , c2i σ 2
We know that,
E (α̂) = α E β̂ = β
=⇒ E α̂ + β̂x0 = α + βx0
X1 2
X
2 xi − x̄
ci = + (x0 − x̄) ai , ai =
n SXX
X 1 2 (x0 − x̄) X 2
X
= 2
+ a i + (x 0 − x̄) a2i
n n
1 (x0 − x̄)2 X
= + since ai = 0
n SXX
1 (x0 − x̄2 ) 2
α̂ + β̂ ∼ N α + βx0 , + σ
n SXX
α̂ + β̂x0 − (α + βx0 )
q ∼ t(n−2)
1 (x0 −x̄)2
s n + SXX
X
SST = (yi − ȳ)2 adding and subtracting ŷi within the brackets
X
= [(yi − ŷi ) + (ŷi − ȳ)]2 , where ŷi = α̂ + β̂xi
X X X
= (yi − ŷi )2 + (ŷi − ȳ)2 + 2 (yi − ŷi )(ŷi − ȳ)
X X
= (yi − ŷi )2 + (ŷi − ȳ)2
X X
= ϵ̂2i + (ŷi − ȳ)2 = SSE + SSR
CHAPTER 6. NORMAL THEORY 142
The cross-product term is zero because of equations 6.5 and 6.6. SSE is the sum
of squares error and SSR is the sum of squares regression.
Now we define R2 ,
SSR SSE
R2 = =1− .
SST SST
Note that 0 ≤ R2 ≤ 1. If the regression line fits the data perfectly, then yi = ŷi and
ϵ̂i = 0 for all i = 1, ..., n. In that case, SSE = 0 and R2 = 1.
If ŷi = ȳ for all i = 1, ..., n, then SSR = 0 and R2 = 0.
Returning to the R output for the co-op salary data, R2 is called ‘Multiple R-squared:’
at the bottom of Figure 6.7, and is equal to 0.09779. Only about 10% of the variation
in salaries is explained by the work term number.
R2 has a deficiency in that it can be artificially inflated by adding more explanatory
variables into the model. Adjusted R2 incorporates a penalty for the number of
explanatory variables in the model, and is the preferred measure of fit for linear
regressions. Its formula is:
s2
Adjusted R2 = 1 − .
s2Y
It is listed in Figure 6.7 as ‘Adjusted R-squared’.
Anscombe’s data
Anscombe’s data provides a good illustration of issues with linear regression and R2 .
See the file AnscombeR.pdf on Brightspace. The Anscombe dataset is built into R;
simply type anscombe to see the dataset. The dataset consists of four pairs of x’s
and y’s. We graph and fit linear models to each of the pairs.
> anscombe
x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8.04 9.14 7.46 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
3 13 13 13 8 7.58 8.74 12.74 7.71
4 9 9 9 8 8.81 8.77 7.11 8.84
5 11 11 11 8 8.33 9.26 7.81 8.47
6 14 14 14 8 9.96 8.10 8.84 7.04
7 6 6 6 8 7.24 6.13 6.08 5.25
8 4 4 4 19 4.26 3.10 5.39 12.50
CHAPTER 6. NORMAL THEORY 143
> summary(ans.lm1)
Call:
lm(formula = y1 ~ x1, data = anscombe)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0001 1.1247 2.667 0.02573 *
x1 0.5001 0.1179 4.241 0.00217 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Figures 6.9 and 6.10 below show the scatterplots of the pairs of Anscombe’s data
together with the fitted linear model in the first columns. Plots of residuals versus
fitted values from linear model fits are shown in the second columns. A linear model
seems appropriate for the first pair, (x1, y1). The second pair, (x2, y2) require a
quadratic model. The third pair has an outlier which raises the regression line. The
fourth pair has an influential point which totally determines the line. Thus, although
their R2 values are all the same, we see that the linear model fits are all very different
for the four pairs.
CHAPTER 6. NORMAL THEORY 144
Y1 vs X1
Residuals vs Fitted
0 1 2
9
anscombe$y1
10
Residuals
8
6
−2
10
3
4
4 6 8 10 12 14 5 6 7 8 9 10
Y2 vs X2
Residuals vs Fitted
9
4
anscombe$y2
1
Residuals
7
0
5
−2
8 6
3
4 6 8 10 12 14 5 6 7 8 9 10
Y3 vs X3
Residuals vs Fitted
3
anscombe$y3
3
Residuals
8 10
1
−1
9
6
4 6 8 10 12 14 5 6 7 8 9 10
Y4 vs X4
Residuals vs Fitted
2
4
anscombe$y4
5
Residuals
1
8 10
0
6
−2
8 10 14 18 7 8 9 10 12
plot(ans.lm1,which=1,add.smooth=FALSE)
summary(ans.lm1)
ANOVA
ANOVA stands for Analysis of Variance, and it is a tabulation of the sources of
variation that we derived for R2 . It usually also includes test statistics, and usually,
an F −test statistic with its p − value. For our simple linear regression, the ANOVA
table has the following form.
The ANOVA table for the Co-op salary data appears below. Note that the p-value
for the F-test is exactly the same as the p-value for the t test of the H0 : β = 0
in Figure 6.7. That is because we have only one X variable in our model, namely
WTNumN.
> anova(Sal.lm)
Analysis of Variance Table
Response: SalMonth
Df Sum Sq Mean Sq F value Pr(>F)
WTNumN 1 95290565 95290565 124.54 < 2.2e-16 ***
Residuals 1149 879171908 765163
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
Yi = α + βxi + ϵi
ϵi ∼ N 0, σ 2 independent.
CHAPTER 6. NORMAL THEORY 148
To estimate ϵi , we use
ϵ̂i = yi − α̂ + β̂xi = yi − ŷi = residuali .
If the model is correct, we would expect (ϵ̂1 , . . . , ϵ̂n ) to behave like a random sample
from N (0, σ 2 ).
A plot of residuals should be scattered about the centre line at zero, all within
approximately ±3s. We plot the residuals versus the fitted values, ŷi , and look
for:
1. constant variance,
2. patterns that suggest nonlinearity,
3. outliers,
4. influential points.
In the plots of residuals versus fitted values of Figures 6.9 and 6.10, the pairs 2, 3
and 4 indicate problems with the linear models.
We can also plot a histogram of the residuals to check for normality, or a Normal
Q-Q plot which is explained in the next section.
line. Figure 6.12 is a Normal Q-Q plot of a sample of size 200 generated in R from
the χ2(2) distribution. The points do NOT fall on a straight line.
2
Sample Quantiles
1
0
−1
−2
−3 −2 −1 0 1 2 3
Theoretical Quantiles
12
10
Sample Quantiles
8
6
4
2
0
−3 −2 −1 0 1 2 3
Theoretical Quantiles
x3<-rchisq(200,df=2)
qqnorm(x3)
qqline(x3)
CHAPTER 6. NORMAL THEORY 151
Example 6.5.1. Twelve students in a statistics course recorded the scores listed
below on their first and second tests in the course.
Student
1 2 3 4 5 6 7 8 9 10 11 12
Test 1 64 28 90 30 97 20 100 67 54 44 100 71
Test 2 80 87 90 57 89 51 81 82 89 78 100 81
Test the hypothesis that there is no difference in the scores for the 2 tests.
Solution:
Note: The Test 1 and Test 2 pairs are not independent of each other. We would
expect results from different individuals to be independent of one another how-
ever.
|x̄ − 0|
p − value = P t(n−1) ≥ √ | H0 : µ = 0
s/ n
x̄ = −16.67
(xi − x̄)2
P
2
s = = 474.97
n−1
|−16.67|
tobs = p = 2.65
(474.97) /12
p − value = P t(n−1) ≥ 2.65
CHAPTER 6. NORMAL THEORY 152
P t(11) ≥ 2.201 = .025
P t(11) ≥ 2.718 = .01
There is evidence against the hypothesis that there is no difference in the scores for
the 2 exams, with a mean difference of -16.67.
A 95% Confidence Interval for the mean difference has the form,
r
s 474.97
X̄ ± t(n−1) √ = −16.67 ± (2.201)
n 12
= [−30.5, −2.82] .
Note that the interval does not cover zero and the scores were significantly lower for
Test 1 than Test 2.
Our concluding statement is that: Scores on Test 1 were significantly lower than
those on Test 2 with a mean difference of 16.67 (s.e. 13.85).
R Code for Paired Measurements Example:
>T1<-c( 64, 28, 90, 30, 97, 20, 100, 67, 54, 44, 100, 71)
>T2<-c( 80, 87, 90, 57, 89, 51, 81, 82, 89, 78, 100, 81)
> var(T1-T2)
[1] 474.9697
> t.test(T1-T2)
data: T1 - T2
t = -2.6491, df = 11, p-value = 0.02262
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-30.513786 -2.819547
sample estimates:
mean of x
-16.66667
CHAPTER 6. NORMAL THEORY 153
Why pair?
Suppose instead we had randomly chosen a sample of Test 1 and Test 2 results,
assuming,
and investigated H0 : µ1 − µ2 = 0.
Suppose that by chance, the first group consisted of students that were brighter
than the second group. Then any difference in the two test results may be due to
the intelligence difference of the groups rather than to differences in the difficulty of
the two tests. With pairing, differences in the two tests will not be obscured by the
second sample of students being entirely different (independent) from the first.
154