0% found this document useful (0 votes)
8 views158 pages

CompleteLectureNotes STAT 261

The document contains lecture notes for Stat261 by Mary Lesperance, covering various statistical concepts including distributions, likelihood methods, tests of significance, and confidence intervals. It includes detailed sections on maximum likelihood estimation, tests for binomial and multinomial probabilities, and confidence intervals derived from tests. Additionally, R code examples are provided throughout to illustrate statistical methods and calculations.

Uploaded by

Jaspervanmaren
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views158 pages

CompleteLectureNotes STAT 261

The document contains lecture notes for Stat261 by Mary Lesperance, covering various statistical concepts including distributions, likelihood methods, tests of significance, and confidence intervals. It includes detailed sections on maximum likelihood estimation, tests for binomial and multinomial probabilities, and confidence intervals derived from tests. Additionally, R code examples are provided throughout to illustrate statistical methods and calculations.

Uploaded by

Jaspervanmaren
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 158

Lecture Notes for Stat261

by
Mary Lesperance

© Mary Lesperance
Department of Mathematics and Statistics
University of Victoria, Victoria, B.C.
Contents

1 Background material 1
1.1 Distribution Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 R code for Distribution Figures . . . . . . . . . . . . . . . . . 10
1.2 Review Stat260 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Likelihood methods 14
2.1 Introduction to Maximum Likelihood Estimation . . . . . . . . . . . 14
2.2 Likelihoods Based on Frequency Tables . . . . . . . . . . . . . . . . . 18
2.3 Unusual example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Combining Independent Events . . . . . . . . . . . . . . . . . . . . . 22
2.5 Relative Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.1 R code for Example 2.1.1 . . . . . . . . . . . . . . . . . . . . . 30
2.5.2 R code for Example 2.2.1 . . . . . . . . . . . . . . . . . . . . . 31
2.6 Likelihood for Continuous Models . . . . . . . . . . . . . . . . . . . . 33
2.6.1 R code for Example 2.6.1 . . . . . . . . . . . . . . . . . . . . . 36
2.7 Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.7.1 R code for Example 2.7.1 . . . . . . . . . . . . . . . . . . . . . 42

3 Two Parameter Likelihoods 44


3.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . 44
3.2 The Chi-Square Distribution . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.1 R code for Chi-square distribution . . . . . . . . . . . . . . . 53

4 Tests of Significance 54
4.1 Introduction to Tests of Significance . . . . . . . . . . . . . . . . . . . 54
4.2 Likelihood Ratio Tests for Simple Null Hypotheses . . . . . . . . . . . 60
4.2.1 One Parameter Case . . . . . . . . . . . . . . . . . . . . . . . 61

i
CONTENTS ii

4.2.2 R code for Example 4.2.1 . . . . . . . . . . . . . . . . . . . . 64


4.2.3 LR Statistic for 2 or More Parameters . . . . . . . . . . . . . 65
4.2.4 R code for Example 4.2.3 . . . . . . . . . . . . . . . . . . . . 69
4.3 Likelihood Ratio Tests for Composite Hypotheses . . . . . . . . . . . 69
4.3.1 R Code Example 4.2.3 continued . . . . . . . . . . . . . . . . 72
4.3.2 Summary of Likelihood Ratio testing . . . . . . . . . . . . . . 73
4.4 Tests for Binomial Probabilities . . . . . . . . . . . . . . . . . . . . . 74
4.4.1 R Code for Example 4.4.1: . . . . . . . . . . . . . . . . . . . . 78
4.5 Tests for Multinomial Probabilities, Goodness of fit test . . . . . . . . 78
4.5.1 R Code for Example 4.5.1: . . . . . . . . . . . . . . . . . . . . 81
4.6 Multinomial Probabilities - Tests for Independence in Contingency
Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.6.1 R Code for Example 4.6.1: . . . . . . . . . . . . . . . . . . . . 86
4.7 Cause and Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.7.1 Accuracy of the χ2 approximation . . . . . . . . . . . . . . . . 87
4.8 The General Contingency Table . . . . . . . . . . . . . . . . . . . . . 88
4.8.1 R Code for Example 4.8.1: . . . . . . . . . . . . . . . . . . . . 90
4.8.2 Pearson’s Goodness of Fit Statistic . . . . . . . . . . . . . . . 91
4.8.3 R Code for Pearson’s GOF test, Example 4.8.1: . . . . . . . . 92

5 Confidence Intervals 93
5.1 Invert a Test to Derive a Confidence Interval . . . . . . . . . . . . . . 93
5.2 Approximate Confidence Intervals . . . . . . . . . . . . . . . . . . . . 97
5.2.1 R Code for Example 5.2.1 . . . . . . . . . . . . . . . . . . . . 100
5.3 Another Approximate Confidence Interval . . . . . . . . . . . . . . . 101

6 Normal Theory 103


6.1 Basic Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.2 One Sample Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2.1 Confidence Intervals for µ . . . . . . . . . . . . . . . . . . . . 107
6.2.2 Hypothesis tests for µ . . . . . . . . . . . . . . . . . . . . . . 110
6.2.3 Inferences for σ 2 . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3 The Two Sample Model . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.3.1 Inferences for the differences between two means . . . . . . . . 118
6.3.2 Testing Equality of Variances . . . . . . . . . . . . . . . . . . 126
6.3.3 Appendix: Derive the 2-sample σ unknown but equal t-test . . 127
6.4 The Straight Line Model . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.4.1 Linear model parameter estimation . . . . . . . . . . . . . . . 132
CONTENTS iii

6.4.2 Linear model Distribution theory . . . . . . . . . . . . . . . . 136


6.4.3 R2 and ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.4.4 Checking Goodness of Fit . . . . . . . . . . . . . . . . . . . . 147
6.4.5 Normal Q-Q plots . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.5 Analysis of Paired Measurements . . . . . . . . . . . . . . . . . . . . 151

Index 154
Chapter 1

Background material

1.1 Distribution Summary


n
px (1 − p)n−x x = 0, 1, . . . , n

1. Binomial(n, p); f (x) = x

where nx = (x!)(n−x)!
n!

.

Consider n independent repetitions of an experiment each of which has only


two possible outcomes, say (S, F ) where

P {S} = p is constant, i.e. the same for each experiment

Let X = #S’s in n trials


Then X ∼ Binomial (n, p) .
Example: Let X be the number of heads in n tosses of a fair coin.
Note: We often use the Binomial Distribution when we Sample with Re-
placement.

1
CHAPTER 1. BACKGROUND MATERIAL 2

Binomial(n=100, p=0.1) pmf

0.12
0.08
pmf

0.04
0.00

0 20 40 60 80 100

Figure 1.1: Binomial probability mass function, n=100, p=0.1

n
px1 1 px2 2 . . . pxkk

2. Multinomial(n, p1 , . . . , pk ); f (x1 , . . . , xk ) = x1 ...xk
n
= (x1 !)(x2n!)!···(xk )!

where x1 ...x k
and
xi = 0, 1, . . . , n, such that x1 + · · · + xk = n and p1 + · · · + pk = 1.

Consider n independent repetitions of an experiment for which each outcome


can be classified in exactly one of k mutually exclusive ways, A1 , A2 , . . . , Ak .
Let

pi = P {an outcome of one trial is of class Ai }


Xi = # outcomes that are of class i out of n repetitions

Then (X1 , X2 , . . . , Xk ) ∼ Multinomial (n, p1 , . . . , pk )


CHAPTER 1. BACKGROUND MATERIAL 3

k
X k
X
Note: pi = 1 Xi = n
i=1 i=1

Example: Toss a fair die n = 100 times and let (X1 , X2 , . . . , X6 ) be the observed
frequencies of the numbers 1, 2, 3, 4, 5, 6 from the tosses of the die. Since the
die is fair, then pi = 1/6 for i = 1, . . . , 6.

x+r−1
 r
3. Negative Binomial(r, p); f (x) = r−1
p (1 − p)x , x = 0, 1, ...
Consider independent repetitions of an experiment each of which has exactly
two possible outcomes, say (S, F ) .
Let P (S) = p constant, i.e. the same for each experiment
Let X = # F ’s before the rth S
Then X ∼ NegBin(r, p)
Example: Continue flipping a fair coin and stop when you observe the first
head. X = the number of tails before the first head has a Negative Binomial
distribution with r = 1.
CHAPTER 1. BACKGROUND MATERIAL 4

Negative Binomial(r=10, p=0.1) pmf

0.000 0.004 0.008 0.012


pmf

0 20 40 60 80 100

Figure 1.2: Negative Binomial probability mass function, r=10, p=0.1

4. Geometric(p); is the same as Negative Binomial (r = 1, p)

M N −M N
  
5. Hypergeometric(N, M, n); f (x) = x n−x
/ n
where max(0, n −
N + M ) ≤ x ≤ min(n, M ).
Consider a finite population of size N . Let each object in the population be
characterized as either a S or F, where there are M ≤ N S’s in the population.
Draw a random sample of size n from the population without replacement.
Let X = # S’s in the sample of size n.
Then X ∼ Hypergeometric(N, M, n).
Example: Suppose that a bin contains N = 100 balls, of which M = 30 are
white and N −M = 70 are black. Choose a random sample of n = 10 balls from
CHAPTER 1. BACKGROUND MATERIAL 5

the bin without replacement. X = the number of white balls in the sample has
a Hypergeometric(N = 100, M = 30, n = 10) distribution.
Example: A shipping container contains N = 10, 000 iPhone 7’s of which
M = 30 are defective and the remainder are not defective. Choose a random
sample of n = 100 iPhone 7’s from the shipping container without replacement.
Then X = the number of defectives in the sample has a Hypergeometric(N =
10, 000, M = 30, n = 100) distribution.
In this example, n/N = 100/10, 000 = 0.01 ≤ 0.05. Then X = the number of
defectives in the sample is approximately distributed as Binomial(n = 100, p =
30/10, 000 = 0.003).

Hypergeometric(N=100, M=30, n=10) pmf


0.20
pmf

0.10
0.00

0 2 4 6 8 10

Figure 1.3: Hypergeometric probability mass function, N=100, M=30, n=10

λx e−λ
6. Poisson (λ); f (x) = x!
x = 0, 1, . . .
CHAPTER 1. BACKGROUND MATERIAL 6

Models the number of occurrences of random events in space and time, where
the average rate, λ per unit time (or area, or volume) is constant.

Let X = # events in t units of time


Then X ∼ Poisson (λt)

Example: Let X = the number of customers arriving at a bank in a given one


hour time interval.

Poisson(lambda=5) pmf
0.15
0.10
pmf

0.05
0.00

0 5 10 15 20

Figure 1.4: Poisson probability mass function, λ = 5

7. Exponential (mean θ); f (x) = 1


θ
e−x/θ x>0 θ>0
Models lifetimes where there is no deterioration with age - or - waiting times
between successive random events in a Poisson process. We also parameterize
the exponential distribution using the rate parameter, λ = 1/θ.
CHAPTER 1. BACKGROUND MATERIAL 7

Exponential(rate=.5) prob density function

0.5
0.4
0.3
pdf

0.2
0.1
0.0

0 2 4 6 8 10

Figure 1.5: Exponential density function, θ = 1/.5 = 2

1
8. Gamma(α, β); f (x) = Γ(α)β α
xα−1 exp(−x/β), x > 0, α, β > 0.
CHAPTER 1. BACKGROUND MATERIAL 8

Gamma(alpha=2, beta=2) prob density function

0.15
0.10
pdf

0.05
0.00

0 5 10 15

Figure 1.6: Gamma density function, α = 2, β = 2

n o
√1 x−µ 2
exp − 12

9. Normal (µ, σ 2 ); f (x) = 2πσ σ
, x, µ ∈ ℜ, σ 2 > 0.

Many measurements are approximately normal.


If X ∼ N (µ, σ 2 ),
X −µ
∼ N (0, 1).
σ
CHAPTER 1. BACKGROUND MATERIAL 9

Normal(mean=0, sd=1) prob density function

0.4
0.3
pdf

0.2
0.1
0.0

−3 −2 −1 0 1 2 3

Figure 1.7: Normal density function, µ = 0, σ = 1

Note: If X1 , . . . , Xn are independent with Xi ∼ N (µi , σi2 ) and a1 , . . . , an


are constants,
n
X
ai Xi ∼ N ( ai µi , a2i σi2 )
P P
then
i=1

Central Limit Theorem


Xn
Let Sn = Xi be the sum of n independent random variables each with
i=1
mean µ, variance σ 2 . Then
Sn − nµ
√ ≈ N (0, 1) for large n,
σ n
where ≈ means approximately distributed as.
CHAPTER 1. BACKGROUND MATERIAL 10

1.1.1 R code for Distribution Figures

#Binomial
x <- 0:100
plot(x, dbinom(x,size=100,prob=.1), ylab=’pmf’, xlab=’x’)
title("Binomial(n=100, p=0.1) pmf")

#Negative Binomial
x <- 0:100
plot(x, dnbinom(x,size=10,prob=.1), ylab=’pmf’, xlab=’x’)
title("Negative Binomial(r=10, p=0.1) pmf")

#Hypergeometric
x <- 0:10
plot(x, dhyper(x,m=30,n=70,k=10), ylab=’pmf’, xlab=’x’)
title("Hypergeometric(N=100, M=30, n=10) pmf")

#Poisson
x <- 0:20
plot(x, dpois(x,lambda=5), ylab=’pmf’, xlab=’x’)
title("Poisson(lambda=5) prob mass function")

#Exponential
x <- seq(0,10,by=.01)
plot(x, dexp(x,rate=.5), ylab=’pdf’, xlab=’x’, type=’l’)
title("Exponential(rate=.5) prob density function")

#Gamma
x <- seq(0,15,by=.01)
plot(x, dgamma(x,shape=2,scale=2), ylab=’pdf’, xlab=’x’, type=’l’)
title("Gamma(alpha=2, beta=2) prob density function")

#Normal
x <- seq(-3,3,by=.01)
plot(x, dnorm(x,mean=0, sd=1), ylab=’pdf’, xlab=’x’, type=’l’)
title("Normal(mean=0, sd=1) prob density function")
CHAPTER 1. BACKGROUND MATERIAL 11

1.2 Review Stat260


• A random variable, X is a quantity which is capable of taking various real
values according to chance.

Notation: X, Y, Z random variables


x, y, z realized values of random variables
X , Y, Z range of values

• Discrete random variables: have only finitely many or at most countably


many possible values. For example,

X ∼ Binomial (n, p) X = {0, 1, 2, . . . , n}


X ∼ Poisson (λ) X = {0, 1, 2, . . .}

The Probability mass function, pmf, of X is f (x) = P (X = x).

• Continuous random variables: can take on any real value in an interval.


For example,

X ∼ Normal (µ, σ 2 ) X =R
X ∼ Exponential (mean θ) X = (0, ∞)

The Cumulative distribution function of X, cdf, is F (x) = P (X ≤ x).

d
The Probability density function of X, pdf, is f (x) = dx
F (x).

Given f (x) we can obtain F (x),


Zx
F (x) = P (X ≤ x) = f (x) dx
−∞
CHAPTER 1. BACKGROUND MATERIAL 12

• Expectation of X: E(X), also called the (population) mean of X


X
Discrete case: E (X) = x f (x)
|{z}
x∈X
↑ pmf of X
Z∞
Continuous case: E(X) = x f (x) dx
|{z}
−∞ ↑ pdf of X
Recall: E (aX + b) = aE(X) + b where a, b constants

• Variance of X: = E (X − E (X))2
2

V ar(X) = σX
= E (X 2 ) − [E (X)]2
X
Discrete case: V ar(X) = (x − E (X))2 f (x)
x∈X

Z∞
Continuous: V ar(X) = (x − E (X))2 f (x) dx
−∞

p
Recall: (i) V ar(X) = σ is called the standard deviation (sd) of X
(ii) V ar (aX + b) = a2 V ar(X) a, b constants
(iii) V ar(X + Y ) = V ar(X) + V ar(Y ) + 2 Cov (X, Y )
(iv) Cov (X, Y ) = E (XY ) − E(X)E(Y )

• Independent random variables:


Let X and Y be random variables with marginal pmf’s (or pdf’s) f1 (x) and
f2 (y) respectively. Let f (x, y) be the joint pmf (pdf) of X and Y . Then X and
Y are statistically independent if and only if

f (x, y) = f1 (x)f2 (y)

Recall: If X, Y are independent, then

Cov(X, Y ) = 0 and V ar(X + Y ) = V ar(X) + V ar(Y )


CHAPTER 1. BACKGROUND MATERIAL 13

1.3 Notation
The following is a list of notation for these notes:
1. ∼ : is distributed as
2. ≈ : approximately distributed as
3. L(θ): the likelihood function as a function of θ
4. ℓ(θ) : log-likelihood as function of θ
Chapter 2

Likelihood methods

2.1 Introduction to Maximum Likelihood Estima-


tion
Optional Text Reading: Section 9.1, pp. 3-8

Example 2.1.1. Canada Border Services processes hundreds of thousands of small


parcels entering the country by mail. In their goal to combat the opioid crisis, they
are interested in discerning the proportion of small parcels that contain ingredients
that could be used to manufacturer illicit drugs, which we will call illegal parcels here.
They do not have the resources to check ALL small parcels entering the country, and
instead they perform a sample audit.

They randomly choose n = 100 small parcels and keep track of

X = # illegal parcels out of n = 100

Let θ = probability that a randomly chosen parcel is illegal. The auditors are inter-
ested in estimating θ.

We begin by postulating a probability model to describe the sampling procedure.

14
CHAPTER 2. LIKELIHOOD METHODS 15

We will assume that:

X ∼ Binomial(n = 100, p = θ) x = 0, 1, ..., 100; 0 ≤ θ ≤ 1


 
100 x
p(x; θ) = θ (1 − θ)100−x
x

Question: What assumptions are required for the use of the Binomial distribution
here?

The Maximum Likelihood Estimate, MLE of θ is the value of θ that maximizes


p(x; θ) given the data x.
Is that a reasonable estimate of θ?
It is the parameter value that best explains the data in the sense that it maximizes
the probability of the data assuming that the hypothesized probability model is
true.

MAXIMIZATION =⇒ CALCULUS

Let’s introduce some simplifications into the optimization problem:


• The factor 100

x
will have no effect on the maximization of p(x; θ) over θ.

To simplify the expression, we will omit multiplicative constants that do not


involve θ.

Definition: L(θ) = cp(x; θ) is called the Likelihood function, where c is


a positive constant that does not depend on θ. Therefore, we have:
max p (x; θ) ⇐⇒ max L(θ)
θ θ

100
L(θ) = θx (1 − θ)100−x

In Example 2.1.1, c = 1/ x
and
CHAPTER 2. LIKELIHOOD METHODS 16

• Usually L(θ) is the product of terms in θ, however, it is generally easier to take


derivatives of sums.

Definition: ℓ(θ) = ln L(θ) is called the Log-likelihood function.

Note: ln(y) is a monotone increasing function of y, so we have the following:

max ℓ (θ) ⇔ max L(θ) ⇔ max p (x; θ)


θ θ θ

Returning to the auditing Example 2.1.1:

ℓ (θ) = ln L(θ) = x ln θ + (100 − x) ln (1 − θ)


x 100 − x
ℓ′ (θ) = − 0<θ<1
θ 1−θ

At the maximum point, θ̂


x 100 − x x
ℓ′ (θ̂) = 0 = − =⇒ θ̂ =
θ̂ 1 − θ̂ 100

x
To ensure that θ̂ = 100
is a maximum, we check that the second derivative
ℓ′′ (θ̂) < 0.
x 100 − x
ℓ′′ (θ) = − − < 0 for all 0 < θ < 1.
θ 2
(1 − θ)2

At the boundary values of 0 and 1, L(0) = L(1) = 0 < L(θ), 0 < θ < 1). Therefore θ̂
is a maximum, the MLE.
Question: What is the MLE of θ when x = 0 or x = n?

Example: If x = 7 then θ̂ = .07 An estimated 7% of small parcels were illegal, i.e.


contained illicit ingredients.

We define two more quantities:


CHAPTER 2. LIKELIHOOD METHODS 17

• The Score Function, S(θ) is defined as

dℓ(θ)
S(θ) = ℓ′ (θ) =

• The Information Function, I(θ) is defined as

d2 ℓ (θ)
I (θ) = −ℓ′′ (θ) = −
dθ2

At the MLE, θ̂, the proportion of illegal parcels, S(θ̂) = 0 and I(θ̂) > 0 .

Returning to the auditing Example 2.1.1 Sampling Method II:


The auditors check through the parcels in a random fashion until they find r = 7
parcels with questionable ingredients. They note

X2 = #legal parcels until the 7’th illegal parcel is observed

X2 ∼ Negative Binomial(r = 7, p = θ)
 
x+r−1 r
p(x; θ) = θ (1 − θ)x
r−1
1
L (θ) = θ7 (1 − θ)x if c = x+r−1
r−1
ℓ(θ) = 7 ln θ + x ln(1 − θ)
7 x 7
ℓ′ (θ) = − ⇒ θ̂ =
θ 1−θ x+7
7 x
ℓ′′ (θ) = − 2 − <0 for 0 < θ < 1.
θ (1 − θ)2

If x2 = 93, then θ̂ = .07.


CHAPTER 2. LIKELIHOOD METHODS 18

2.2 Likelihoods Based on Frequency Tables


Optional Text Reading: Section 9.1, pp. 8-10

Example 2.2.1. Two hundred specimens of a new high-impact plastic produced


using a new 3D printing machine are tested by repeatedly striking them with a
hammer until they fracture. Let Yi be the random number of hits that are required to
fracture the i′ th specimen so that the observed number of hits satisfies yi = 1, 2, 3, ....
The following table summarizes the results in a Frequency Table. For example, 112
specimens fractured on the first hit and 30 specimens required 4 or more hits to
fracture.

# hits required to fracture 1 2 3 ≥ 4 Total


# specimens 112 36 22 30 200

Suppose that a specimen has a constant probability, θ, of surviving a hit, indepen-


dently of previous hits received. Find the MLE of θ based on the Frequency Table
results for 200 independent specimens. Compare estimated expected frequencies with
the observed frequencies.

- n independent repetitions of an experiment
Note that we have
- each outcome must fall in exactly one category
Often data from n independent repetitions of an experiment are summarized in a
frequency table,

Type of event or category A1 A2 . . . Ak Total


Observed frequency X 1 X2 . . . X k n

where each outcome of one of the n experiments must fall in exactly one category,
A1 , . . . , Ak , a partition of the sample space.
• Let Xi = # of times Ai occurs in n repetitions [ ki=1 Xi = n]
P

• pi = P {an outcome of one trial is of type Ai } [ ki=1 pi = 1]


P

• E(Xi ) = npi
We can add a row in the table corresponding to expected cell frequencies.
CHAPTER 2. LIKELIHOOD METHODS 19

Type of event or class A1 A2 . . . Ak Total


Observed frequency X 1 X2 . . . X k n
Expected frequency np1 np2 . . . npk n

The pi ’s may be determined from a probability model that depends on an unknown


parameter, θ, so that pi = pi (θ).
The distribution of the frequencies in the table is Multinomial(n, p1 (θ), . . . , pk (θ))
and the probability of observing a particular set of frequencies (x1 , ..., xk ) is:
 
n
P (x1 , . . . , xk ; θ) = p1 (θ)x1 p2 (θ)x2 . . . pk (θ)xk .
x 1 x 2 . . . xk

The likelihood function is therefore,

L(θ) = p1 (θ)x1 p2 (θ)x2 . . . pk (θ)xk ,

and the MLE, θ̂, is the value of θ that maximizes L(θ).


Using θ̂ we can compute pi (θ̂) = p̂i and np̂i = npi (θ̂), the estimated expected
frequencies. The estimated expected frequencies are compared with the observed
frequencies to give us an indication of how well the probability model fits.

Returning to Example 2.2.1:

Let Y = # hits required to fracture a random specimen


p1 (θ) = P (Y = 1) = 1 − θ
p2 (θ) = P (Y = 2) = θ(1 − θ)
p3 (θ) = P (Y = 3) = θ2 (1 − θ)
p4 (θ) = 1 − p1 − p2 − p3 = 1 − (1 − θ) − θ(1 − θ) − θ2 (1 − θ)
= θ3

You may recognize that Y − 1 ∼ Negative Binomial(r = 1, p = θ) which is also


known as the Geometric(θ) distribution. We can now write down the distribution of
our data and obtain the Likelihood function.
CHAPTER 2. LIKELIHOOD METHODS 20

 
200
P (x1 , x2 , x3 , x4 ; θ) = p1 (θ)112 p2 (θ)36 p3 (θ)22 p4 (θ)30
112, 36, 22, 30

L(θ) = p1 (θ)112 p2 (θ)36 p3 (θ)22 p4 (θ)30

= [1 − θ]112 [θ(1 − θ)]36 [θ2 (1 − θ)]22 [θ3 ]30

= [1 − θ]112+36+22 [θ]36+2·22+3·30

= [1 − θ]170 θ170

ℓ(θ) = 170 ln(1 − θ) + 170 ln θ

170 170
ℓ′ (θ) = S(θ) = − +
1−θ θ

170 1
ℓ′ (θ̂) = 0 =⇒ θ̂ = =
340 2

170 170
ℓ′′ (θ) = − 2
− 2 < 0 for 0 < θ < 1
(1 − θ) θ

Checking the boundary points, 0, 1, L(0) = L(1) = 0 but L(θ) > 0 for θ ̸= 0, 1,
therefore θ̂ = 12 is a maximum, that is, it is the MLE of θ.

Substituting in the MLE for θ into the expressions for the p′ s, we obtain,
CHAPTER 2. LIKELIHOOD METHODS 21

1
p̂1 = p1 (θ̂) = 1 − θ̂ =
2
1
p̂2 = p2 (θ̂) = θ̂(1 − θ̂) =
4
2 1
p̂3 = p3 (θ̂) = θ̂ (1 − θ̂) =
8
3 1
p̂4 = θ̂ =
8

Using these estimates, we obtain estimated expected frequencies np̂i under the
model:
# hits required to fracture 1 2 3 ≥4 Total
Observed frequency 112 36 22 30 200

Estimated expected 200p̂1 200p̂2 200p̂3 200p̂4


frequency = 100 = 50 = 25 = 25 200

The estimated expected frequencies display poor agreement with the observed fre-
quencies. We expect some variation between the observed and estimated expected
frequencies. Does the poor agreement here suggest that something is wrong with the
assumed probability model? We need to be able to quantify the differences between
observed and estimated expected frequencies and decide if these are due to chance
variation only or to an inappropriate model. It may be that the assumed model is
incorrect, for example, the assumption of a constant probability of surviving a blow
independently of previous blows may not be realistic.

2.3 Unusual example

There are some examples for which we cannot use Calculus to compute the maximum
likelihood estimate. Here is one such example.

Example 2.3.1. The ‘enemy’ has an unknown number, N , drones, which have been
numbered 1, 2, . . . N . Spies have reported sighting 8 drones with numbers 137, 24,
86, 33, 92, 129, 17, 111. Assume that sightings are independent and that each of the
drones has probability N1 of being observed at each sighting. Find N̂ .
CHAPTER 2. LIKELIHOOD METHODS 22

1

N8
if N ≥ max{137, 24, 86, . . . , 111}
P (137, 24, 86, 33, 92, 129, 17, 111; N ) =
0 otherwise

As N decreases, P (137, 24, 86, 33, 92, 129, 17, 111; N ) increases provided that N ≥
137. Therefore, to maximize the probability of the observed data assuming this
model, we need to make N as small as possible subject to N ≥ max{137, 24, 86, . . . , 111}.
Therefore, N̂ = 137 is the MLE of N . This is an example where we do NOT use
Calculus to solve for the MLE.

2.4 Combining Independent Events


Optional Text Reading: Section 9.2
Example 2.4.1. Suppose that for Example 2.1.1 we observed the number of illegal
parcels on each of two days, so that we observe
X1 = # illegal parcels out of n = 100 on day 1, and
X2 = # illegal parcels out of n = 100 on day 2.

Assuming that the numbers of illegal parcels for day 1 is independent of the number
of illegal parcels for day 2, we can write the JOINT probability mass function (pmf)
for X1 and X2 as:
   
100 x1 100−x1 100
p(x1 , x2 ; θ) = θ (1 − θ) θx2 (1 − θ)100−x2 .
x1 x2

The Likelihood function for θ now uses both data values and becomes,

L(θ) = θx1 (1 − θ)100−x1 × θx2 (1 − θ)100−x2


= θ(x1 +x2 ) (1 − θ)(200−x1 −x1 )
and the Log-likelihood function is,
ℓ(θ) = ln L(θ) = (x1 + x2 ) ln θ + (200 − x1 − x2 ) ln(1 − θ).

We can proceed as above and obtain the Maximum Likelihood Estimate of θ. As an


exercise, show that the MLE is θ̂ = (x1 + x2 )/200.
CHAPTER 2. LIKELIHOOD METHODS 23

2.5 Relative Likelihood


Optional Text Reading: Section 9.3

In Example 2.1.1, we estimated θ, the probability that a randomly chosen parcel


is illegal in a Binomial experiment, X ∼ Binomial(n = 100, θ). We computed the
MLE as θ̂ = x/100 = .07 when x = 7. We know this to be the most plausible value
of θ in the sense that it maximizes the probability of the observed data, assuming
the Binomial model. Ultimately, we want to obtain a set of plausible values for θ
which incorporate the variability in the data as described by the model.

Questions:
(1) What about θ = .06? Is this a reasonable or plausible value for θ given the
data we have?
(2) How can we produce a set of θ-values that are plausible given the data?

The relative plausibilities of other θ-values may be examined by comparing them


with θ̂, the MLE.

Definition: The Relative Likelihood function (RLF) of θ is

L(θ)
R(θ) = .
L(θ̂)

Since L(θ) = cp(x; θ) where c does not depend on θ, then

cL(θ) p(x; θ)
R(θ) = = .
cL(θ̂) p(x; θ̂)

Since 0 ≤ L(θ) ≤ L(θ̂) for all θ, then 0 ≤ R(θ) ≤ 1.

Definition: The Log Relative Likelihood function of θ is

r(θ) = ln R(θ) = ln L(θ) − ln L(θ̂) = ℓ(θ) − ℓ(θ̂).


CHAPTER 2. LIKELIHOOD METHODS 24

Since 0 ≤ R(θ) ≤ 1, then −∞ ≤ r(θ) ≤ 0.

Note:
cL(θ1 ) Probability of data when θ = θ1
R(θ1 ) = = .
cL(θ̂) max Probability of data for any value θ

• If R(θ1 ) = 0.1 then the data are 10 times more probable when θ = θ̂ than
when θ = θ1 , under the hypothesized model.
• If R(θ2 ) = 0.5, then the data are 2 times more probable when θ = θ̂ than when
θ = θ2 , under the hypothesized model.
• θ2 is a more plausible parameter value than θ1 .
• R(θ) gives us a way of assessing and generating plausible values of θ given the
data and the hypothesized model.
• For example, {θ|R(θ) ≥ 0.5} is a set of θ values that give the data at least 50%
of the maximum possible probability under the hypothesized model.

Definition: A 100 p% Likelihood interval (LI) for θ is the set of θ values such
that,
R(θ) ≥ p or equivalently ln R(θ) = r(θ) ≥ ln p.

Likelihood Interval Guidelines:


θ-values inside 10% LIs are referred to as plausible
50% LIs are referred to as very plausible

Question: Is a 10% LI contained in a 50% LI, or is a 50% LI contained in a 10%


LI?

Likelihood intervals are similar, in practice, to Confidence Intervals, and we will see
that they are mathematically related when the data are normally distributed. As
an example, in the one-sample normal case when σ is known, the 14.7% Likelihood
interval for the unknown mean, µ, corresponds to a 95% confidence interval.
Relative Likelihood is also used for Hypothesis Testing/Tests of Significance which
we will see in Chapter 4.
CHAPTER 2. LIKELIHOOD METHODS 25

Any report of results of an experiment should include θ̂ as well as an interval estimate


such as a likelihood interval or a confidence interval.

Returning to Example 2.1.1, we construct a 100 p% LI for θ. We want to find


all θ values such that R(θ) ≥ p, where R(θ) = L(θ)/L(θ̂) and

L(θ) = θx (1 − θ)n−x .

7 7 7 93 93
Here x = 7, θ̂ = 100
and L(θ̂) = ( 100 ) ( 100 ) .
To compute a 100 p% Likelihood interval, we want all θ such that

θ7 (1 − θ)93
R(θ) = 7 7 93 93 ≥ p.
( 100 ) ( 100 )

Equivalently, we can find the values θ such that r(θ) = ln R(θ) ≥ ln(p).
Here we find the roots θ of r(θ) − ln(p) = 0, where θ is in the interval (0, 1) using
the R function uniroot(). To use uniroot(), we need to supply the function that
we wish to solve and starting values that bracket the roots. To determine starting
values, we graph r(θ) − ln(p) versus θ and overlay a horizontal line at zero. We give
an example using p = 0.1 for a 10% Likelihood Interval.
CHAPTER 2. LIKELIHOOD METHODS 26

Example 2.1.1, Log Relative Likelihood − ln(p)

2
r(theta)−ln(p)

1
0
−1

0.02 0.04 0.06 0.08 0.10 0.12 0.14

theta

Figure 2.1: 10% Likelihood interval construction. The log relative likelihood minus
ln(0.1) is plotted versus θ. A horizontal line at zero is overdrawn to assist with
starting value determination.

From the graph, we see that there are two roots. The lower root lies within the
interval [0.02, 0.04] and the upper root lies within the interval [0.1, 0.15]. These are
the starting values that we supply to uniroot() in the code in the next section. The
roots that R returned are in the $root slot below. The 10% Likelihood interval for
θ is thus: (0.028, 0.138) and the MLE is θ̂ = 0.07. Note that this interval is NOT
symmetric about the value 0.07 with the right endpoint further from 0.07 than the
left endpoint. This is displayed in the asymmetry of the plot above. [Aside: the 50%
Likelihood interval for θ is (0.044, 0.10).]
CHAPTER 2. LIKELIHOOD METHODS 27

> lower <- uniroot(logR.m.lnp, c(.02, .04), thetahat$maximum, p)


> lower
$root
[1] 0.02804908
$f.root
[1] 0.004036655
$iter
[1] 4
$init.it
[1] NA
$estim.prec
[1] 6.103516e-05

> upper <- uniroot(logR.m.lnp, c(.1, .15), thetahat$maximum, p)


> upper
$root
[1] 0.1378683
$f.root
[1] -1.041713e-05
$iter
[1] 4
$init.it
[1] NA
$estim.prec
[1] 6.103516e-05

Figure 2.2: R Output: Likelihood interval computations for Example 2.1.1

In Example 2.2.1, we worked through an example involving specimens of a new


high impact plastic which were tested repeatedly by striking them with a hammer
until they fractured. In that example,
θ = P {specimen survives a hit independently of blows received}
and the data are given again below.
# hits required to fracture 1 2 3 ≥ 4 Total
# specimens 112 36 22 30 200
CHAPTER 2. LIKELIHOOD METHODS 28

We computed the MLE, θ̂ = 0.5


We know that θ̂ = 0.5 is the most plausible value of θ in the sense that it maximizes
P (x1 , x2 , x3 , x4 ; θ), the probability of the observed data given θ.
We construct a 100 p% LI for θ. We want to find all θ values such that R(θ) ≥ p,
where R(θ) = L(θ)/L(θ̂) and

L(θ) = p112
1 p36 22 30
2 p3 p4
= [1 − θ]170 θ170 ,

1
θ̂ = 2
and so L(θ̂) = ( 12 )340 .
To compute a 100 p% Likelihood interval, we want all θ such that

(1 − θ)170 θ170
R(θ) = ≥ p, or
1 340

2
R(θ) = [4 (1 − θ) θ]170 ≥ p.

To solve this problem, we find the roots θ of R(θ) − p = 0, for admissible values of θ
in the interval (0, 1). Below, we tabulate values for R(θ) and r(θ) for various values
of θ to help us discern starting values for numerical root finding software.
CHAPTER 2. LIKELIHOOD METHODS 29

θ R(θ) r(θ)
.3 1.34 × 10−13 −29.64
.4 .00968 −6.94
.45 .1811 ←− 10% −1.71
.46 .3357 −1.09
.47 .5417 ←− 50% −.61
interval surrounding θ̂ −→ .50 1 0
.53 .5417 ←− 50% −.61
.54 .3357 −1.09
.55 .1811 ←− 10% −1.71
.60 .00968 −6.94

10% .44 .0849 =⇒ [.442, .558] 10% LI


.442 .10

50% .469 .5196 =⇒ [.468, .532] 50% LI


.468 .4977

Equivalently, we could have used r (θ) to compute the Likelihood Intervals. For a
100 p% Likelihood Interval we want all θ such that, r(θ) ≥ ln p. To compute the
endpoints of the interval, we solve the lower and upper roots of the equation in θ,
r(θ) − ln p = 0 using the R code below:

50% LI r (θ) = ln 0.5 = −0.69


10% LI r (θ) = ln 0.1 = −2.30,

where r(θ) = ℓ(θ) − ℓ(θ̂)


ℓ(θ) = 170 ln(1 − θ) + 170 ln θ
ℓ(θ̂) = 340 ln 0.5 = −235.67.

Alternatively (or additionally) a graph of the log relative likelihood can aid in choos-
ing starting values for a root finding technique. Below is the graph of the log relative
likelihood minus ln(.1) for Example 2.2.1.
CHAPTER 2. LIKELIHOOD METHODS 30

Example 2.2.1, Log Relative Likelihood − ln(p)

0
r(theta)−ln(p)

−5
−10

0.35 0.40 0.45 0.50 0.55 0.60 0.65

theta

Figure 2.3: 10% Likelihood interval construction. The log relative likelihood - ln(.1)
is plotted versus theta. A horizontal line at zero is overdrawn to assist with starting
value determination.

2.5.1 R code for Example 2.1.1


# Example 2.1.1
# Log-likelihood function
ell <- function(theta){
7*log(theta) + 93*log(1-theta)
}
theta <- seq(.02, .15,by=.005)
plot(theta,ell(theta),ylab=’log likelihood’,xlab=’theta’)
title(’Example 2.1.1, Log-Likelihood’)
CHAPTER 2. LIKELIHOOD METHODS 31

#MLE of theta
thetahat <- optimize(ell, c(.05,.09), maximum=TRUE)
thetahat

#Log relative likelihood function


logR <- function(theta, thetahat){
ell(theta) - ell(thetahat)
}
logR(theta,thetahat$maximum)
p <- .1 #10% likelihood interval
logR.m.lnp <- function(theta, thetahat, p) {logR(theta,thetahat)-log(p)}

plot(theta,logR.m.lnp(theta,thetahat$maximum, p), ylab=’r(theta)-ln(p)’,


xlab=’theta’, type=’b’)
abline(h=0)
title(’Example 2.1.1, Log Relative Likelihood - ln(p)’)

#Likelihod intervals
lower <- uniroot(logR.m.lnp, c(.02, .04), thetahat$maximum, p)
lower

upper <- uniroot(logR.m.lnp, c(.1, .15), thetahat$maximum, p)


upper

2.5.2 R code for Example 2.2.1


# Example 2.2.1
# Log-likelihood function and plot
ell <- function(theta){
170*log(theta) + 170*log(1-theta)
}
CHAPTER 2. LIKELIHOOD METHODS 32

theta <- seq(.35,.65,by=.01)


plot(theta,ell(theta),ylab=’log-likelihood’,xlab=’theta’)
title(’Example 2.2.1, Log-likelihood’)

#MLE of theta - looks for maximum in interval (.4, .6)


thetahat <- optimize(ell, c(.4,.6), maximum=TRUE)
thetahat

#Log relative likelihood function and plot


logR <- function(theta, thetahat){
ell(theta) - ell(thetahat)
}
logR(theta,thetahat$maximum)
p <- .1 #10% likelihood interval
logR.m.lnp <- function(theta, thetahat, p) {logR(theta,thetahat)-log(p)}

plot(theta,logR.m.lnp(theta,thetahat$maximum,p), ylab=’r(theta)-ln(p)’,
xlab=’theta’, type=’b’)
title(’Example 2.2.1, Log Relative Likelihood - ln(p)’)
abline(h=0)
#The plot helps us to determine starting values for a root finding
# technique used to solve for the Likelihood interval

#Likelihod intervals
#log relative likelihod minus ln(p)
#Find a root in the interval (.4, .5)

lower <- uniroot(logR.m.lnp, c(.4, .5), thetahat$maximum, p)


lower

#Find a root in the interval (.5, .6)


upper <- uniroot(logR.m.lnp, c(.5, .6), thetahat$maximum, p)
upper
CHAPTER 2. LIKELIHOOD METHODS 33

2.6 Likelihood for Continuous Models


Optional Text Reading: Section 9.4

Suppose that have a sample x1 , x2 , . . . , xn of independent observations on a contin-


uous random variable X which has pdf f (x; θ) where θ is an unknown parame-
ter.

Example 2.6.1. Times between successive failures X, of a computer system are


thought to be independent and identically distributed (iid) random variables with
an exponential distribution having mean θ.

Let X1 , X2 . . . , Xn represent a random sample of observed times between successive


failures so that
1
f (x; θ) = e−x/θ .
θ

In the case of continuous measurements, the pdf evaluated at x does not represent
the probability of observing x, however, we construct the likelihood using the pdf.
In this case, the joint pdf of the independent sample is written as the product of the
marginal pdf’s and,

n
Y
L(θ) = f (xi ; θ).
i=1
CHAPTER 2. LIKELIHOOD METHODS 34

Substituting in the example,


n
Y
L (θ) = f (xi ; θ)
i=1
n
Y 1 −xi /θ
= e = θ−n e−Σxi /θ
i=1
θ
n
1X
ℓ (θ) = −n ln θ − xi
θ i=1
n Σxi Σxi
S(θ) = ℓ′ (θ) = − + 2 =⇒ θ̂ = = x̄, sample mean!
θ θ n
n 2Σxi n 2nx̄
ℓ′′ (θ) = 2 − 3 = 2 − 3
θ θ θ θ
n 2nx̄ n 2n n
ℓ′′ (θ̂) = 2 − 3 = 2 − 2 = − 2
x̄ x̄ x̄ x̄ x̄
n
I(θ̂) = 2 > 0 =⇒ θ̂ = x̄ is the MLE

Suppose that we observed the following times between failures (to the nearest day):

70 11 66 5 20 4 35 40 29 8
P
1. What is an estimate of the expected time between failures? ( xi = 288)
2. What values are plausible given the data? (10% and 50% LI’s)
3. The computer manufacturer claims that the mean time between failures is 100
days. Comment.

Solution:
1. X ∼ exp (θ)
θ = Expected time between failures The estimated expected time between
failures is, θb = Σx
n
i
= 288
10
= 28.8.
CHAPTER 2. LIKELIHOOD METHODS 35

2. Using r(θ), we obtain likelihood intervals for θ as follows:


L (θ)
r (θ) = ln R (θ) = ln
L(θ̂)
= ℓ (θ) − ℓ(θ̂)
1
ℓ(θ̂) = −n ln θ̂ − Σxi = −10 ln 28.8 − 10
θ̂
= −43.60
288
Plot r (θ) − ln(p) = −10 ln θ − + 43.60 − ln(p) versus θ.
θ

Log−relative Likelihood minus ln(p), Example 2.6.1


log relative likelihood − ln(p)

2
0
−2
−4
−6

10 20 30 40 50 60

theta

Figure 2.4: 10% Likelihood interval. Log relative likelihood minus ln(0.1) plotted
versus θ. Horizontal line is at zero.

The likelihood intervals are:


50% LI : 20.28 ≤ θ ≤ 42.83
10% LI : 15.65 ≤ θ ≤ 61.88

The data do not support the claim that the mean time between failures is 100
days. 100 days is not a plausible value for θ.
CHAPTER 2. LIKELIHOOD METHODS 36

2.6.1 R code for Example 2.6.1


# Example 2.6.1

x <- c(70 , 11 , 66 , 5 , 20 , 4 , 35 , 40 , 29 , 8)
ell <- function(theta,x){
n <- length(x)
return(-n*log(theta) - sum(x)/theta)
}
theta <- seq(10,60,by=1)
plot(theta,ell(theta,x),ylab=’log likelihood’,xlab=’theta’)
title(’Log Likelihood, Example 2.6.1’)

#MLE
thetahat <- optimize(ell, c(20,40), maximum=TRUE, x=x)
thetahat
ell(thetahat$maximum,x)

#Log relative likelihood function


logR <- function(theta, thetahat,x){
ell(theta,x) - ell(thetahat,x)
}
logR(theta,thetahat$maximum,x)
p <- .1 #10% likelihood interval
#log relative likelihood minus ln(p)
logR.m.lnp <- function(theta, thetahat, x, p) {logR(theta,thetahat,x)-log(p)}

plot(theta,logR.m.lnp(theta,thetahat$maximum,x,p),
ylab=’log relative likelihood - ln(p)’,xlab=’theta’)
abline(h=0) #add a horizontal line at zero
title(’Log-relative Likelihood minus ln(p), Example 2.6.1’)

#Likelihod intervals
lower <- uniroot(logR.m.lnp, c(10,20), thetahat$maximum, x, p)
lower
upper <- uniroot(logR.m.lnp, c(50,70), thetahat$maximum, x, p)
upper
CHAPTER 2. LIKELIHOOD METHODS 37

p <- .5 #50% likelihood interval


lower <- uniroot(logR.m.lnp, c(10,25), thetahat$maximum, x, p)
lower
upper <- uniroot(logR.m.lnp, c(40,60), thetahat$maximum, x, p)
upper

2.7 Invariance
Optional Reading: Section 9.6
In the above Example 2.6.1, we might also be interested in estimating the probability
that the time between failures is greater than 100 days, i.e.

Z∞
1 −x/θ
β = P (X > 100) = e dx
θ
100
−100/θ
=e

How can we find the MLE of β?


Solution: We can reparameterize the Log-likelihood in terms of θ(β) = −100
ln β
. Note
that β is an INCREASING function of θ. [Exercise: Show that the derivative of β
with respect to θ is positive for all β ∈ (0, 1).]
The Log-likelihood and Score function for β are:

  n
!
−100 ln β X
ℓ(β) = −n ln + xi ,
ln β 100 i=1
n
!
−n 100 1 1 X
ℓ′ (β) =   + xi .
−100 (ln β)2 β 100β
ln β i=1

Setting the derivative to zero and solving [as an exercise] yields,

β̂ = e−100/θ̂ = 0.031.
CHAPTER 2. LIKELIHOOD METHODS 38

The estimated probability that the time between failures is greater than 100 days is
0.031, very small.
We see that maximum Likelihood Estimates have some very nice properties - MLE’s
are invariant under one-to-one parametric transformations. In addition, a 10% Like-
lihood interval for β is (e−100/θ1 , e−100/θ2 ) = (e−100/15.65 , e−100/61.88 ) = (0.0017, 0.199),
where θ1 and θ2 are the endpoints of the 10% Likelihood interval for θ.

Example 2.7.1. Family income, X is measured on a scale such that X = 1 corre-


sponds to a subsistence level income. The pdf of the income distribution is assumed
to be the Pareto distribution
 −(θ+1)
 θx x≥1
f (x; θ) =
0 x<1

where θ > 0. Data for a random sample of n = 10 families living in Toronto is:
1.02, 1.41, 1.75, 2.31, 3.42, 4.31, 9.21, 17.4, 38.6, 392.8.
(a) Find the MLE of θ.
(b) Obtain an estimate of the median family income, β.

Figure 2.5 graphs the density of the Pareto distribution for various values of θ.
Typically, there are many individuals with smaller incomes and only a few individuals
with large incomes, and this is the behaviour displayed by the Pareto densities.
CHAPTER 2. LIKELIHOOD METHODS 39

Pareto distribution

2.0
theta = 0.5
theta = 1

1.5
theta = 2
f(x)

1.0
0.5
0.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Figure 2.5: The Pareto density, for values of θ

Solution:
(a) Find the MLE of θ.
n
−(θ+1)
Y
f (x1 , . . . , xn ; θ) = θxi
i=1
n
−(θ+1)
Y
L (θ) = θn xi
i=1
n
X
ℓ (θ) = n ln θ − (θ + 1) ln xi
i=1
n
′ n X n 1
ℓ (θ) = − ln xi =⇒ θ̂ = n = = 0.52208
θ i=1
Σi=1 ln xi 1.92
n
ℓ′′ (θ) = 2 < 0 for all θ.
θ

A 10% Likelihood interval for θ is (0.24, 0.96)

(b) Find an estimate of the median family income, β.


CHAPTER 2. LIKELIHOOD METHODS 40

Definition: The 100αth percentile of continuous X is the variate value Qα such


that
P (X ≤ Qα ) = F (Qα ) = α

where 0 < x < 1 and F (x) is the cdf of X. The Median is the 50th percentile.

Returning to Example 2.7.1,


0.5 = P (X ≤ β) = θx−(θ+1)dx
1
β
−θ 
= −x  = 1 − β −θ
1
−θ
0.5 = β =⇒ β = 0.5−1/θ = 21/θ

Note that β is a 1 − 1 function of θ!


Substituting θ = ln 2/ ln β into L(θ),
 
ln 2
L (θ) = L
ln β
 n Y
ln 2 −(ln 2/ ln β+1)
= xi
ln β
= L∗ (β)

We can find β̂ by maximizing L∗ (β) . As an exercise, show that the maximizer of


L∗ (β) is β̂ = 21/θ̂

MLE’s are invariant under one-to-one parametric transformations. Also, R (θ) =


R∗ (β) where θ = ln 2/ ln β. Relative plausibilities do not depend upon the
parametrization.

Definition: Invariance Property. Let θ = g (β) be a one-to-one transformation


of β. Let θ̂ be the MLE of θ and θ1 ≤ θ ≤ θ2 be a 100p% likelihood interval for
θ.
CHAPTER 2. LIKELIHOOD METHODS 41

The MLE of β is β̂ = g −1 (θ̂) and a 100p% Likelihood Interval for β is:

g −1 (θ1 ) ≤ β ≤ g −1 (θ2 ) if g is monotone increasing


.
g −1 (θ2 ) ≤ β ≤ g −1 (θ1 ) if g is monotone decreasing

P
Returning to Example 2.7.1, the MLE of β is β̂ = 21/θ̂ = 2 ln xi /n
= 3.78
Given that a 10% LI for θ is
0.24 ≤ θ ≤ 0.96,
a 10% LI for β is (β monotone decreasing)

21/0.96 ≤ β ≤ 21/0.24
2.06 ≤ β ≤ 17.96,

a set of plausible values for median income (relative to subsistence level) based on
the Toronto sample data.

Some Comments:
• What is the effect of increasing sample size, n on Likelihood intervals? Gener-
ally this produces a more sharply peaked likelihood which results in narrower
Likelihood intervals. Likelihood intervals for θ will be more precisely estimated.
• Can we combine data from independent experiments or studies? Suppose we
are given a random sample of family incomes (relative to subsistence level) for
families living in London, England where it is assumed that the pdf of the
income distribution is
 −(θ+1)
θx x≥1
f (x; θ) =
0 x<1

When would it be appropriate to pool this data with the Toronto data and
produce a common estimate of θ?

We can estimate θ for each sample, θ̂Toronto , θ̂London .


Plot the two log-relative likelihoods rToronto (θ), rLondon (θ), on the same graph
and look for values of θ that are plausible for both locations. If there are some,
combine the two sets of data to produce a common estimate of θ.
CHAPTER 2. LIKELIHOOD METHODS 42

2.7.1 R code for Example 2.7.1


#Example 2.7.1 Family Income Pareto distribution with data
# Pareto density plot
dpareto <- function(x,theta){
theta*x^(-theta-1)
}
xx<-seq(1,4,by=.01)
pareto.dat<-cbind(dpareto(xx,.5), dpareto(xx,1), dpareto(xx,2))
matplot(xx, pareto.dat, type=’l’,col=1, lty=c(1,2,5),
main=’Pareto distribution’,
ylab=’f(x)’,xlab=’x’)
legend("topright",c(paste(’theta’,c(.5, 1, 2))),lty=c(1,2,5),col=1)
#We have an algebraic expression for the MLE, thetahat
x<- c(1.02, 1.41, 1.75, 2.31, 3.42, 4.31, 9.21, 17.4, 38.6, 392.8)
n <- length(x)
n
thetahat <- n/sum(log(x))
thetahat
ell <- function(theta,x){
n <- length(x)
ellres <- vector(’numeric’,length(theta))
for(i in (1:length(theta)))
{ ellres[i] <- n*log(theta[i]) - sum((theta[i]+1)*log(x))
}
return(ellres)
}
### use this one!!! No loops
ell2 <- function(theta,x){
n <- length(x)
ellres <- n*log(theta) - (theta+1)*sum(log(x))
return(ellres)
}

#Graph the Log Relative likelihood function


theta <- seq(.1,2,by=.01)
logR <- function(theta, thetahat,x){
ell(theta,x) - ell(thetahat,x)
}
CHAPTER 2. LIKELIHOOD METHODS 43

p <- .1 #10% likelihood interval


#log relative likelihood minus ln(p)
logR.m.lnp <- function(theta, thetahat, x, p) {logR(theta,thetahat,x)-log(p)}

plot(theta,logR.m.lnp(theta,thetahat,x,p),
ylab=’log relative likelihood - ln(p)’,xlab=’theta’)
abline(h=0) #add a horizontal line at zero
title(’Log-relative Likelihood minus ln(p), Example 2.7.1’)

#Likelihod intervals
lower <- uniroot(logR.m.lnp, c(.2,.5), thetahat, x, p)
lower
upper <- uniroot(logR.m.lnp, c(.6,1.2), thetahat, x, p)
upper
#MLE and Likelihood intervals for beta, the median
2^(1/thetahat)
2^(1/c(lower$root,upper$root))
Chapter 3

Two Parameter Likelihoods

3.1 Maximum Likelihood Estimation


Optional Reading: Section 10.1

Example 3.1.1. Suppose x1 , x2 , . . . , xn are independent observations on a random


variable, X ∼ N (µ, σ 2 ), where both µ, σ 2 are unknown. Find the joint MLE (µ̂, σ̂ 2 ).

Solution: n
Y
2
f xi ; µ, σ 2
 
L µ, σ =c
i=1

where f (x; µ, σ 2 ) is the pdf of X.


 
2
 1 1 2
f x; µ, σ = √ exp − 2 (x − µ)
2πσ 2 2σ
n
Y 1  
2
 1 2
L µ, σ = c √ exp − 2 (xi − µ)
i=1 2πσ 2 2σ
 n/2 ( n
)
1 1 X
= 2
exp − 2 (xi − µ)2
σ 2σ i=1
n
n 1 X
2 2
(xi − µ)2

ℓ µ, σ = − ln σ − 2
2 2σ i=1

44
CHAPTER 3. TWO PARAMETER LIKELIHOODS 45

Normal Log Like(scaled)

−1

logLike
−2

−3

−4

−1.0
−0.5 2.0
0.0 1.5
mu 0.5 1.0 igma
1.0 s
1.50.5

Figure 3.1: Log Likelihood for a random sample of size 100 from Normal(0,1)

Normal Log Like contour


2.0

−0.8

−0.7
1.5
sigma

1.0

−1.1 −0.9
−1.4 −1 −1.2 −1.5
−1.3
0.5

−1.0 −0.5 0.0 0.5 1.0 1.5

mu

Figure 3.2: Log Likelihood contour plot for a random sample of size 100 from Nor-
mal(0,1)
CHAPTER 3. TWO PARAMETER LIKELIHOODS 46

Figures 3.1 and 3.2 display the Log-likelihood as a function of µ and σ for a sample of
size 100 data values generated from the Normal(0,1) distribution. The figures show
that the Log-likelihood is maximized somewhere near µ = 0.25 and σ = 1.1.
Find the values (µ̂, σ̂ 2 ) that maximize ℓ (µ, σ 2 ) . To do so, we take derivatives of
ℓ(µ, σ 2 ) with respect to µ and σ.

n
∂ℓ 1 X
= (xi − µ) (3.1)
∂µ σ 2 i=1
n
∂ℓ n 1 X
= − + 3 (xi − µ)2 (3.2)
∂σ σ σ i=1

At the joint maximizer, (µ̂, σ̂ 2 ), both (3.1) and (3.2) are 0.

Pn Pn
(3.1) = 0 =⇒ i=1 (xi − µ̂) = 0 =⇒ µ̂ = i=1 xi /n = x̄.

Substituting (3.1) into (3.2):

(3.2) = 0 =⇒ n/σ̂ = σ̂13 ni=1 (xi − µ̂)2 =⇒ nσ̂ 2 = ni=1 (xi − x̄)2
P P

=⇒ σ̂ 2 = ni=1 (xi − x̄)2 /n


P

Checking for a Maximum


The second derivatives, ∂ 2 ℓ/∂µ2 and ∂ 2 ℓ/∂σ 2 only give information in the direction
of the µ, σ axes. We need criteria that tests for a maximum in all directions radiating
from µ̂, σ̂. We provide the criteria here, and a proof is given in the optional text
pages 90 and 91.

As in the one parameter case, we define the Observed Information matrix,


I(µ, σ 2 ),
CHAPTER 3. TWO PARAMETER LIKELIHOODS 47

 
2
 −∂ 2 ℓ/∂µ2 −∂ 2 ℓ/∂µ∂σ
I µ, σ =
−∂ 2 ℓ/∂σ∂µ −∂ 2 ℓ/∂σ 2
 
I11 I12
= .
I21 I22

Note that ∂ 2 ℓ/∂σ∂µ = ∂ 2 ℓ/∂µ∂σ, and so I21 = I12 . At a relative maximum,


(µ̂, σ̂ 2 ) , I (µ̂, σ̂ 2 ) must satisfy:

 Ib11 = I11 (µ̂, σ̂ 2 ) > 0

Ib22 > 0
 Ib Ib − Ib2 > 0

11 22 21

In the example,

∂ 2ℓ n
I11 = − 2
= 2
∂µ σ
2
P
∂ ℓ 2 (xi − µ)
I12 =− =
∂µ∂σ σ3
n 3 X
I22 =− 2 + 4 (xi − µ)2
σ σ
Substituting in µ̂ and σ̂ 2 ,
n2
Ib11 = P >0 Ib12 = 0
(xi − x̄)2
n 3nσ̂ 2 2n
Ib22 = − 2 + 4 = 2 > 0
σ̂ σ̂  σ̂
I11 I22 − I21 > 0 =⇒ µ̂, σ̂ 2 is the joint MLE
b b b2

Example 2.2.1 revisited: Specimens of a new high impact plastic are tested by
repeatedly striking them with a hammer until they fracture.

Y = # blows required to fracture a specimen


CHAPTER 3. TWO PARAMETER LIKELIHOODS 48

We assumed that:

P (Y = y) = θy−1 (1 − θ) x = 1, 2, 3, . . .

where θ = P (surviving a hit independently of previous hits). Y has a geometric


distribution. The assumption does not seem tenable and we found that the model
did not yield estimated expected frequencies that were close to the observed frequen-
cies.

It is suggested that, while the geometric distribution applies to most specimens, a


fraction 1 − λ, 0 < λ < 1, of them will have flaws and always fracture on the first
hit.

Compute estimates of λ, θ and compare observed and estimated expected frequencies


under the model.

# hits required 1 2 3 ≥ 4 Total


# specimens 112 36 22 30 200
x1 x 2 x3 x4 n

Solution: We need to construct the probability of the observed frequencies as a


function of λ and θ.
Recall that we have 200 repetitions of an experiment where each outcome falls in
one of the above categories. We modelled the probability of the observed frequencies
using a Multinomial distribution.
4
X
Let xi = the number of specimens in category i; xi = n = 200
i=1
 
200
P (x1 , x2 , x3 , x4 ) = px1 1 px2 2 px3 3 px4 4
x1 x2 x3 x4

where pi = P (a specimen falls in category i) .


pi = P (i hits required to fracture) = P (Y = i) i = 1, 2, 3
p4 = P (≥ 4 hits required to fracture) = P (Y ≥ 4)
CHAPTER 3. TWO PARAMETER LIKELIHOODS 49

We need to obtain expressions for p1 , . . . , p4 in terms of λ and θ. It may be helpful


to draw a tree diagram here. Try it as an exercise.

P (item is flawed) = 1 − λ,
P (Y = 1 | flawed) = 1,
P (item is not flawed) = λ, and
P (Y = y | not flawed) = θy−1 (1 − θ).

Recall: Let A and B be 2 events, then P (A and B) = P (A | B) P (B)

p1 = P (Y = 1)
= P (Y = 1 and flawed) + P (Y = 1 and not flawed)
= 1 − λ + λ (1 − θ) = 1 − λθ

p2 = P (Y = 2)
= P (Y = 2 and flawed) + P (Y = 2 and not flawed)
= 0 + λθ (1 − θ)

p3 = P (Y = 3) = λθ2 (1 − θ)

p4 = 1 − p1 − p2 − p 3
= 1 − (1 − λθ) − λθ (1 − θ) − λθ2 (1 − θ) = λθ3

We can express the likelihood and log-likelihood in terms of θ and λ:

L (θ, λ) = px1 1 px2 2 px3 3 px4 4


22  3 30
= [1 − λθ]112 [λθ (1 − θ)]36 λθ2 (1 − θ)

λθ
= (1 − λθ)112 λ88 θ170 (1 − θ)58
ℓ (θ, λ) = 112 ln (1 − λθ) + 88 ln λ + 170 ln θ + 58 ln (1 − θ)

To compute the MLE’s of θ and λ, we need to take derivatives, set them to zero and
solve for θ̂ and λ̂.
CHAPTER 3. TWO PARAMETER LIKELIHOODS 50

∂ℓ 112λ 170 58
= − + − (3.3)
∂θ 1 − λθ θ 1−θ
∂ℓ 112θ 88
= − + (3.4)
∂λ 1 − λθ λ

 112
   200λ̂θ̂
(3.4) = 0 =⇒ 112λ̂θ̂ = 88 1 − λ̂θ̂ =⇒ λ̂θ̂ = 1 − λ̂θ̂ =⇒ = 1,
88 88
and
88 1
λ̂ = .
200 θ̂
 
112
Substituting 1 − λ̂θ̂ = 88
λ̂θ̂ into (3.3),

112λ̂ 170 58
0 = − 112 + − ,
88
λ̂θ̂ θ̂ 1 − θ̂

82 58
=⇒ 0 = − .
θ̂ 1 − θ̂

82 41
θ̂ = = = 0.5857
140 70

88 × 70 154
and λ̂ = = = 0.7512.
200 × 41 205

Substituting θ̂ and λ̂ into the expressions for the p′i s, yields estimated probabili-
ties:

p̂1 = 1 − λ̂θ̂ = 0.56


 
p̂2 = λ̂θ̂ 1 − θ̂ = 0.18
 
p̂3 = λ̂θ̂2 1 − θ̂ = 0.11
p̂4 = λ̂θ̂3 = 0.15.
CHAPTER 3. TWO PARAMETER LIKELIHOODS 51

# hits required 1 2 3 ≥ 4 Total


observed frequency 112 36 22 30 200
np̂i = estimated frequency 112.00 36.46 21.35 30.19 200

The estimated and observed frequencies are very close! The new model fits the data
very well.
We should check that we have attained a maximum using the second derivatives
evaluated at the maximum. The information matrix entries are given below. As an
exercise, check that they satisfy the criteria for a maximum.

∂ 2ℓ 112λ2 170 58
I11 = − = 2 + + >0
∂θ 2
(1 − λθ) θ 2
(1 − θ)2
∂ 2ℓ 112θ2 88
I22 =− 2 = 2 + 2 > 0
∂λ (1 − λθ) λ
2
∂ ℓ 112 112θλ
I21 = I12 = − = +
∂θ∂λ 1 − λθ (1 − λθ)2

3.2 The Chi-Square Distribution


A continuous variate, X with pdf

f (x; ν) = cν xν/2−1 e−x/2 for x > 0 and cν a positive constant

is said to have a χ2(ν) distribution where ν is called the degrees of freedom.

It can be shown that E (X) = ν,


V AR (X) = 2ν.

We will be interested in the cdf of X,


Zx
χ2(ν)

F (x; ν) = P ≤x = f (x; ν) dx.
0
CHAPTER 3. TWO PARAMETER LIKELIHOODS 52

Left tail areas are tabulated in the optional textbook on page 351, and a chi-square
table is available on Brightspace. We will also use R to compute these.
 
Example: P χ2(4) ≤ 7.779 = 0.9.
Check that you can get this answer from the chi-square table provided on Brightspace
or on page 351 of the optional text or using R (code below).

The chi-square density is graphed for various values of ν, the degrees of freedom.

Chi−square densities
0.6

df= 1
df= 4
df= 10
0.4
chi.dat

0.2
0.0

0 5 10 15 20 25 30

Figure 3.3: Chi-square densities

Properties of the Chi-square distribution:


1. Let X1 , X2 , . . . , Xn be independent random variates with Xi ∼ χ2(νi ) . Then
X1 + X2 + · · · + Xn ∼ χ2(ν1 +ν2 +···+νn )
2. Let Z ∼ N (0, 1). Then Z 2 ∼ χ2(1) .
3. Let Z1 , Z2 , . . . Zn be independent N (0, 1) random variables. Then Z12 , Z22 , . . . , Zn2
are independent χ2(1) random variables and Z12 + Z22 + . . . + Zn2 ∼ χ2(n)
CHAPTER 3. TWO PARAMETER LIKELIHOODS 53

3.2.1 R code for Chi-square distribution


pchisq(7.779,4) # find probability chi-square(4) < 7.779
qchisq(.9, 4) # find .90 quantile of chi-square(4)

x <- seq(0.3,30,by=.05)
chi.dat <- cbind(dchisq(x,1), dchisq(x,4), dchisq(x,10)) #chi-square densities
matplot(x,chi.dat,type=’l’,col=1:3, lty=1:3, main=’Chi-square densities’)
legend("topright",c(paste(’df=’,c(1,4,10))),lty=1:3,col=1:3)
Chapter 4

Tests of Significance

4.1 Introduction to Tests of Significance


Optional reading: Chapter 12.1
We will consider a formal method for testing hypotheses about model parameters.
To illustrate the ideas, I will consider a simple example which tests the following
claim.

I claim that I have ESP

To test this claim, we will perform an experiment using a deck of cards. After
shuffling the cards, a volunteer chooses a card and I must divine the colour of the
suit. This is repeated 25 times and the number of correct responses is recorded.
Let X be the number of correct responses out of the 25 independent trials.

Notes:
1. Even if I do NOT have ESP, some correct responses will occur by chance.
2. If I do have ESP, I should be able to achieve more correct responses than would
be expected by chance alone.
We define two hypotheses:

54
CHAPTER 4. TESTS OF SIGNIFICANCE 55

1. H1 : I have ESP. This is the claim or research hypothesis and is called the
alternative hypothesis. (It is also denoted as HA .)
2. H0 : I do not have ESP. This is usually the complement of H1 and is called the
null hypothesis.
We proceed as if we were performing a proof by contradiction, assuming
that the null hypothesis is true until we obtain enough evidence against
it.
To determine whether there is evidence against the null and in favour of the hypoth-
esis of ESP, we compare the results obtained from the experiment with that which
would be expected under the null hypothesis that I do NOT have ESP.
Large values of X observed, xobs , will be interpreted as evidence against the null
hypothesis and in favour of the alternative hypothesis.
Under H0 , the hypothesis that I do NOT have ESP, my responses are guesses, and
each response has a .50 chance of being correct. Therefore Under H0 ,
X ∼ Binomial(n = 25, p = 0.5), so that I am expected to get E(X) = np = 12.5
correct, on average if the null hypothesis is true.
Note that the null and alternative hypotheses can be written in terms of p, the
Binomial probability of success as:

H1 : p > 0.5 H0 : p = 0.5.

Suppose that I get xobs = 18 correct. Does this provide evidence against H0 ?

To answer that question, Statisticians compute a p-value also called a Signifi-


cance level (SL). It is defined as the probability of observing a result as extreme
or more in the direction of the alternative hypothesis, computed assuming that
the null hypothesis is true.

Returning to the example, suppose that I get xobs =18 correct responses. The p-
value= P (X ≥ 18) is computed using the distribution of X under the null hypothesis,
that is, Binomial(n = 25, p = 0.5). Using R, we compute:

p-value = P (X ≥ 18) = 1 − P (X ≤ 17) = 0.02164263

using the code,


CHAPTER 4. TESTS OF SIGNIFICANCE 56

1-pbinom(17, size=25, p=.5).

You may recall the Normal approximation to the Binomial,


 
25 2 25
X ≈ N µ = np = , σ = np (1 − p) = ,
2 4

is appropriate when np ≥ 5 and n(1 − p) ≥ 5. The p-value can be computed using


this approximation as:
   
X −µ xobs − 12.5  xobs − 12.5 
p-value = P (X ≥ xobs ) = P  ≥ q ≃ P Z ≥ q ,
σ 25 25
4 4

where Z ∼ N (0, 1).


The p-value is approximately,

P (X ≥ 18) ≃ P (Z ≥ 2.2) = 1 − 0.98610 = 0.0139,

where the normal probability is obtained from the N (0, 1) cdf table on Brightspace
or on page 349 of the optional text. The approximation can be improved using a
continuity correction to the normal approximation,
 
17.5 − 12.5 
P (X ≥ 18) = 1 − P (X ≤ 17) ≃ 1 − P Z ≤ q
25
4

= 1 − P (Z ≤ 2) = 1 − 0.97725 = 0.02275.

Note: You should review how to use the N (0, 1) cdf table since we will resort to
tables for the Quizzes and Exam.

How do we interpret the p-value in practice?


• Large p − values suggest that results as extreme as xobs would occur fairly
often if H0 were true and we have no evidence that H0 is false. There is no
inconsistency with H0 , BUT this does not prove that H0 is true! It only
indicates a lack of evidence against H0 .
CHAPTER 4. TESTS OF SIGNIFICANCE 57

• Small p − values suggest that if H0 were true, results as extreme as xobs would
occur very rarely. The data are then deemed to be inconsistent with H0 . Thus
we say that we have evidence against H0 .

We say that:

p − value > .10 - no evidence against H0


.05 < p − value ≤ .10 - marginal evidence against H0
.01 < p − value ≤ .05 - evidence against H0
p − value ≤ .01 - strong evidence against H0

In our ESP example, we conclude that we have evidence against H0 , (p-value = 0.02).
The data are consistent with the hypothesis that I have ESP. Of course, results like
these could have occurred by chance, and this does NOT prove that I have ESP.

Ingredients for Tests of Significance

1. Test statistic, D - provides a ranking of all possible outcomes of an experiment


according to how closely they agree with H0 , the null hypothesis.
Small values of D =⇒ close agreement with H0
Large values of D =⇒ poor agreement with H0
(In the ESP example, D = X = the number of correct responses.)
2. We need a measuring device to determine how far away from H0 is the observed
test statistic. We use the p − value = P (D ≥ dobs | H0 true), the probability
of a random D greater than or equal to the value observed, dobs , computed
assuming that H0 is true.
This is the probability of observing such poor agreement between the null
hypothesis and the data if H0 were true.
If the p − value is small, then such poor agreement would almost never occur
when H0 is true.
With data, we cannot prove or disprove a null hypothesis. All that we can do is
to say whether our data is consistent or not consistent with the null hypothesis.
CHAPTER 4. TESTS OF SIGNIFICANCE 58

The p-value is a probability computed using our data and assuming that the null
hypothesis is true.

Example 4.1.1. Blind Taste Test: Twenty-five individuals were given two similar
glasses, one of Pepsi, one of Coke and each was asked to identify the one that was
Coke. 60% (15) correctly identified Coke. Is this consistent with

H0 : there is no detectable difference between Pepsi and Coke.

Solution: We shall initally proceed under assumption that H0 is true, that there
is no detectable differences between Pepsi and Coke and see if the data provides
evidence against the null hypothesis. We need a probability model for the data
under the assumption that H0 is true. Let

X = # individuals out of 25 who correctly identified Coke.

If H0 is true, then the responses would be guesses, with


 a .50 chance of being correct.
1
Therefore under H0 , X ∼ Binomial n = 25, p = 2 and we would expect to observe
a value of X near E(X) = np = 12.5 if H0 is true. In this example, very small
numbers of correct responses as well as large numbers of correct responses would
suggest that there is a detectable difference between Pepsi and Coke. For example,
if zero out of 20 responses were correct, that would suggest that the two drinks were
detectably different, but not correctly identified.
We therefore define our statistic to be large when the number of observed correct,
xobs , is much smaller or much larger than 12.5. Let

D ≡ |X − 12.5|

be our test statistic which ranks possible values of X according to how close they
are to H0 .

• If D is close to zero, then the data are in agreement with H0 .


• if D is large (close to 12), then the data are NOT in agreement with H0 .
In our example, 15 correctly identified Coke, so that,

dobs = |xobs − 12.5| = |15 − 12.5| = 2.5


CHAPTER 4. TESTS OF SIGNIFICANCE 59

p − value = P (D ≥ dobs | H0 true)


= P (D = 2.5 or 3.5 or . . . 12.5 | H0 true)
= 1 − P (D = .5 or 1.5 | H0 true)

D = .5 =⇒ X = 13 or 12
D = 1.5 =⇒ X = 14 or 11

p − value = 1 − P (X = 11, 12, 13 or 14 | H0 true)


14    25
X 25 1
=1−
x=11
x 2
= 0.4243562 (using R)
= EXACT p − value

R Code: 1- sum(dbinom(11:14, size=25, p=.5))

If H0 were true, results as extreme X = 15 would occur fairly often and we have no
evidence against the null hypothesis. The data are consistent with the null hypothesis
that there is no detectable difference between Pepsi and Coke (p-value = .42).

Example 4.1.1 continued: Suppose that 60% (150) of 250 individuals correctly
identified Coke. Is there evidence against the null hypothesis?
Under H0 : X ∼ Binomial n = 250, p = 12 and E(X) = 125.


D = |X − 125| and dobs = |150 − 125| = 25


p − value = P (D ≥ dobs | H0 true) = 0.001883301

There is very strong evidence against the null hypothesis of no detectable difference
between Coke and Pepsi (p-value = 0.002)!!
CHAPTER 4. TESTS OF SIGNIFICANCE 60

R Code: 1- sum(dbinom(101:149, size=250, p=.5))

Since n = 250 is large, we can use a normal approximation to the Binomial to


obtain,

p − value = P (|X − 125| ≥ 25)


!
|X − 125| 25
=P p ≥p
250p(1 − p) 250p(1 − p))
 
25
≃ P |Z| ≥ q 
250
4

= P (|Z| ≥ 3.16) = 0.001565402

R Code: 2*pnorm(-25/sqrt(250/4))
The large SAMPLE SIZE yields a more precise estimate of the probability of correctly
identifying Coke versus Pepsi!

4.2 Likelihood Ratio Tests for Simple Null Hy-


potheses
Optional reading: Chapter 12.2

We have looked at two test statistics to test hypotheses about the Binomial parameter
p, D = X and D = |X − np|. In general, the test statistic will depend upon the
hypothesis being tested and it may be difficult to “come up” with a test statistic. The
Likelihood Ratio Statistic (LRS) is a good statistic and it has intuitive appeal. We
consider the LRS for simple null hypotheses. In many applications, the hypothesis
to be tested can be formulated as a hypothesis concerning the values of unknown
parameters in a probability model.

Definition: A simple hypothesis specifies numerical values for all of the unknown
parameters in the model.
CHAPTER 4. TESTS OF SIGNIFICANCE 61

4.2.1 One Parameter Case


Consider a probability model with one unknown parameter θ. We wish to test
H0 : θ = θ0 where θ0 is a particular numerical value.
1
For example, H0 : p = 2
in the Binomial model.
With simple hypothesis tests, we are asking if a particular parameter value θ0 is
plausible given the data we have. We have already seen a function that tells us
about the relative plausibilities of parameter values, R(θ) or r(θ).

Definition: The Likelihood Ratio Statistic (LRS) for testing H0 : θ = θ0 is

D ≡ −2r (θ0 ) = 2[ℓ(θ̂) − ℓ (θ0 )],

where θ̂ is the MLE of θ. Since ℓ(θ̂) ≥ ℓ (θ0 ) for all values of θ0 , then D ≥ 0.

D small =⇒ outcome is such that θ0 is a plausible parameter value


D large =⇒ outcome is such that θ0 is not a plausible parameter value

Thus, D ranks possible outcomes of the experiment according to how well they agree
with H0 : θ = θ0 . Let dobs be the observed numeric value of the Likelihood Ratio
Statistic, then the p − value is calculated as:

p − value = SL = P (D ≥ dobs | H0 true)


= P (D ≥ dobs | θ = θ0 )

Under the assumption that H : θ = θ0 is true,

D ≈ χ2(1)

in most cases of one-parameter simple hypotheses, therefore, you can use the
chi-squared table to obtain p-values as,

p − value ≃ P (χ2(1) ≥ dobs ).


CHAPTER 4. TESTS OF SIGNIFICANCE 62

Notation Notes:
• ≈ means approximately distributed as
• ≃ means approximately equal to

Example 4.2.1. The measurement errors associated with a set of scales are inde-
pendent normal with known σ = 1.3 grams. Ten (n = 10) weightings of an unknown
mass µ give the following results in grams:

227.1 226.8 224.8 228.2 225.6


229.7 228.4 228.8 225.9 229.6

Is the data consistent with H0 : µ = µ0 = 226. Derive the LRS for testing µ = 226.

Let xi represent the i′ th observed weighting, then

D = 2 [ℓ (µ̂) − ℓ (µ0 )]
n  
Y 1 1 2
L (µ) = √ exp − 2 (xi − µ)
i=1 σ
2 2σ
 n " n
#
1 1 X 2
= √ exp − 2 (xi − µ)
σ2 2σ i=1

 n
Since σ is assumed known, and equal to 1.3 grams, the term, √1σ2 is considered
a constant and can be disregarded in the construction of the likelihood for µ.

n P
1 X xi
ℓ (µ) = − 2 (xi − µ)2 and µ̂ = = x̄ = 227.49.
2σ i=1 n
D = −2r (µ0 ) = 2 [ℓ (µ̂) − ℓ (µ0 )] , where µ0 = 226
" n n
#
1 X 1 X
=2 − 2 (xi − x̄)2 + 2 (xi − µ0 )2
2σ i=1 2σ i=1
CHAPTER 4. TESTS OF SIGNIFICANCE 63

Consider the second term in the above expression:


n
X n
X n
X n
X
2 2 2
(xi − µ0 ) = [(xi − x̄) + (x̄ − µ0 )] = (xi − x̄) + (x̄ − µ0 )2 .
i=1 i=1 i=1 i=1

Why is the cross-product term zero?


Therefore,
" n n n
#
1 X 2
X 2
X 2
D= 2 − (xi − x̄) + (xi − x̄) + (x̄ − µ0 )
σ i=1 i=1 i=1

n (x̄ − µ0 )2 (x̄ − µ0 )2
= = .
σ2 σ 2 /n

We want to find the distribution of D assuming that H0 : µ = µ0 = 226 is true.

σ2
 
2

If X ∼ N µ0 , σ , then X̄ ∼ N µ0 ,
n
X̄ − µ0
=⇒ Z = σ ∼ N (0, 1) =⇒ D = Z 2 ∼ χ2(1)

n

For this example, the likelihood ratio test statistic is equivalent to the Z test statistic
that you learned in your first course in statistics!
In the normal case, the LRS has an EXACT χ2(1) distribution!

10
dobs = (227.49 − 226)2 = 13.14
(1.3)2
p − value = P (D ≥ dobs | µ0 = 226)
= P χ2(1) ≥ 13.14 = 0.00029 < .005


There is very strong evidence against H0 : µ = 226 (p − value < 0.005) . In our
report, we should include our estimate of the mean, µ, together with an interval
estimate which provides information about the margin of error in estimating the
mean. We write in our report: “The estimated mean weight is 227.49 grams
and the data are not consistent with the hypothesis that the mean weight
CHAPTER 4. TESTS OF SIGNIFICANCE 64

is 226 grams (p − value < 0.005, 10% Likelihood interval estimate 226.61-
228.37 grams).”
I computed the 10% likelihood interval using the R code provided below. We will
see later that likelihood intervals are related to confidence intervals, which are more
commonly quoted in practice. In this example, the interval is very, very narrow, and
the lower endpoint is close to 226 grams. Although the data suggest that the mean
weight is statistically significantly different from 226, the investigator may find no
practical difference between the observed data and the hypothesis that the mean
is 226 grams.

4.2.2 R code for Example 4.2.1


#Compute the LRS and p-value
d<-10*(227.49-226)^2/(1.3^2)
1-pchisq(d,1)

#Define functions and data required to compute a likelihood interval.


sigma<-1.3
y<-c(227.1, 226.8, 224.8, 228.2, 225.6, 229.7, 228.4, 228.8, 225.9, 229.6)
mean(y)

#Log-likelihood function
ell <- function(mu,y,sigma){
res<-vector("numeric",length(mu))
for (i in 1:length(mu)){
res[i]<--sum((y-mu[i])^2)/2/sigma^2
}
return(res)
}

#MLE
muhat <- optimize(ell, c(225,230), maximum=TRUE, y=y, sigma=sigma)
muhat

#Log relative likelihood function; plot for values of mu to help


# determine starting values for computing a 10% likelihood interval
logR <- function(mu, muhat,y,sigma){
ell(mu,y,sigma) - ell(muhat,y,sigma)
CHAPTER 4. TESTS OF SIGNIFICANCE 65

mu <- seq(225,230,by=.1)
plot(mu,logR(mu,muhat$maximum,y,sigma), ylab=’log relative likelihood’,xlab=’mu’)
abline(h=log(.1)) #add a horizontal line at ln(p)
title(’Log-relative Likelihood, Example 4.2.1’)

#Likelihod intervals
p <- .1 #10% likelihood interval
#log relative likelihood minus ln(p)
logR.m.lnp <- function(mu, muhat, y,sigma, p) {logR(mu,muhat,y,sigma)-log(p)}

lower <- uniroot(logR.m.lnp, c(226,227), muhat$maximum, y,sigma, p)


lower
upper <- uniroot(logR.m.lnp, c(228,229), muhat$maximum, y,sigma, p)
upper

4.2.3 LR Statistic for 2 or More Parameters


Suppose that we have a probability model that depends upon a vector of unknown
parameters θ = (θ1 , θ2 , . . . , θp )′ . We wish to test H0 : θ = θ0 , for θ0 a vector of
numbers. Then H0 is a simple hypothesis because it specifies a numerical value for
each parameter in the model.

The Likelihood ratio statistic for testing the SIMPLE hypothesis H0 : θ = θ0


is h i
D ≡ −2r (θ0 ) = 2 ℓ(θ̂) − ℓ (θ0 ) ,
 ′
where θ̂ = θ̂1 , . . . , θ̂p is the joint MLE.

Under the assumption that θ = θ0 , D ≈ χ2(k) (in most cases), where k is the
number of functionally independent unknown θ parameters in the model.

Example 4.2.2. Multinomial distribution: Consider frequencies, (X1 , X2 , . . . , X5 ) ∼


Multinomial (n, p1 , p2 , . . . , p5 ) . Here k = 4 functionally independent parameters since
CHAPTER 4. TESTS OF SIGNIFICANCE 66

5
P
pi = 1. If four parameters are known, then the fifth is fully determined.
i=1

Example 4.2.3. Heart disease: In a long-term study of heart disease in a large


group of men, it was noted that 63 men who had no previous record of heart problems
died suddenly of heart attacks. The following table shows the number of such deaths
recorded on each day of the week.

Day Mon Tues Wed Thurs Fri Sat Sun Total


# deaths 22 7 6 13 5 4 6 63
x1 x2 x3 x4 x5 x6 x7 n

Test the null hypothesis that deaths are equally likely to occur on any day of the
week.

Solution:
(X1 , . . . , X7 ) ∼ Multinomial (n = 63, p1 , . . . , p7 )

The null hypothesis is that deaths are equally likely to occur on any day of the week,
and so, H0 : p1 = p2 = · · · = p7 = 71 , a simple hypothesis.
The likelihood ratio statistic for testing H0 is,

D = −2r (p1 = 1/7, p2 = 1/7, . . . , p7 = 1/7)


= 2 [ℓ (p̂1 , . . . , p̂7 ) − ℓ (p1 = 1/7, . . . , p7 = 1/7)]

We need the joint MLE’s for the p′ s!

7
X
L (p1 , . . . , p7 ) = px1 1 px2 2 · · · px7 7 = p22
1 p72 ··· p67 where pi = 1
i=1
ℓ (p1 , . . . , p7 ) = 22 ln p1 + 7 ln p2 + · · · + 6 ln p7
7
X
= xi ln pi .
i=1
CHAPTER 4. TESTS OF SIGNIFICANCE 67

7
P
We need to maximize ℓ (p1 , . . . , p7 ) subject to the constraint pi = 1. To do that we
i=1
can use Lagrange multipliers (if you have taken Math 200) or we can simply substitute
the constraint into the log-likelihood and maximize over the parameters.

Method 1: Lagrange multipliers


7
P P7 
The objective function, g = xi ln pi + γ i=1 pi − 1 and the partial derivatives
i=1
with respect to the p′ s and γ are,

∂g xi
= + γ for i = 1, . . . , 7 (4.1)
∂pi pi
∂g X
= pi − 1 (4.2)
∂γ

To maximize, set the partial derivatives to 0 and solve.

(4.1) = 0 =⇒ p̂i = −xi /γ̂ i = 1, . . . , 7


(4.2) = 0 =⇒ i=1 p̂i = 1 =⇒ − 7i=1 xi /γ̂ = 1
P7 P
P7
=⇒ − xi = γ̂ = −63 = −n
i=1

Substituting the γ̂ into the solution for (4.1), p̂i = xi /n = xi /63, i = 1, . . . , 7,


which is the sample proportion that fall in category i.

Method 2: Substitute the constraint into the log-likelihood


We have that,
7
X
ℓ (p1 , . . . , p7 ) = xi ln pi .
i=1
P7 P7
where i=1 pi = 1 and i=1 xi = n. Let p7 = 1 − p1 − p2 − p3 − p4 − p5 − p6 , and
substitute this into the log-likelihood. The log-likelihood becomes,
6
X
ℓ (p1 , . . . , p7 ) = xi ln pi + x7 ln(1 − p1 − p2 − p3 − p4 − p5 − p6 ).
i=1
CHAPTER 4. TESTS OF SIGNIFICANCE 68

We take derivatives with respect to the p′ s, set the expressions equal to zero and
solve for the MLE’s of the p′ s.

∂ℓ xi x7
= − =0 for i = 1, . . . , 6.
∂pi p̂i pˆ7
pˆ7
Therefore, p̂i = xi . (4.3)
x7

Summing the six equations (4.3), results in,


6 6
X p̂7 X
p̂i = xi
i=1
x7 i=1
p̂7
1 − pˆ7 = (n − x7 )
x7
(1 − p̂7 )x7 = p̂7 (n − x7 )
p̂7 = x7 /n, and substituting into (4.3),
p̂i = xi /n.

Now that we have the MLE’s of the p′ s, we can compute our Likelihood Ratio
Statistic:

dobs = 2 [ℓ (p̂1 , . . . , p̂7 ) − ℓ (1/7, . . . , 1/7)]


" 7 7
#
X xi X 1
=2 xi ln − xi ln
i=1
n i=1
7
= 23.27
p − value ≃ P χ2(k) ≥ dobs | H0 true , here k = 6

 
2 1
= P χ(6) ≥ 23.27 | p1 = p2 = · · · = p7 =
7
≃ 0.0007
CHAPTER 4. TESTS OF SIGNIFICANCE 69

7
P
Here k = 6 functionally independent parameters since pi = 1. If six parameters
i=1
are known, then the seventh is fully determined.
The p-value is very small (< 0.01) and we say that we have very strong evidence
against the hypothesis that deaths are equally likely to occur on any day of the
week. The estimated expected frequencies under the null hypothesis that pi = 1/7 =
0.1428571 are all npi = 9. From the table of observed frequencies, we note that
there are many more heart attacks on Monday than would be expected under the
null hypothesis.

4.2.4 R code for Example 4.2.3


#Heart disease example
freq <- c(22 , 7 , 6 , 13 , 5 , 4 , 6)
sum(freq)
ell <- function(p, freq){
# Multinomial log-likelihood
# freq = frequencies; p = probabilities
sum(freq * log(p))
}

logR <- function(p, phat, freq){


#Log relative likelihood function
ell(p, freq) - ell(phat, freq)
}

dobs <- -2 * logR(rep(1/7, 7), freq/sum(freq),freq) #LRS observed


dobs
1-pchisq(dobs, 6) #pvalue

4.3 Likelihood Ratio Tests for Composite Hypothe-


ses
Optional reading: Chapter 12.3
CHAPTER 4. TESTS OF SIGNIFICANCE 70

We may have hypothesized, a priori, that heart attacks have different daily proba-
bilities of occurrence, for example, they may be more likely to occur on Monday’s,
and formed our null hypothesis as:
H0 : p2 = p3 = · · · = p7 = p and p1 unspecified.

Our BASIC MODEL for the multinomial frequencies is still (x1 , x2 , . . . , x7 ) ∼


Multinomial(n = 63, p1 , . . . , p7 ).

The null hypothesis does not specify numerical values for every parameter in the
model, since p is unknown, therefore it is NOT a simple hypothesis! H0 is an ex-
ample of a composite hypothesis. Note that if we knew p, we could obtain p1 by
subtraction since the p′ s must sum to one so that p1 = 1 − 6p.

Definition: A Composite hypothesis reduces the number of unknown parame-


ters in the model, but not to zero.

To test the new, composite hypothesis, we need to find the MLE of the p′ s assuming
that H0 is true. We substitute the hypothesized values into the log-likelihood and
maximize over p.

7
X
ℓ (p1 , p2 = p, p3 = p, p4 = p, p5 = p, p6 = p, p7 = p) = x1 ln p1 + (xi ln p)
i=2
7
X
= x1 ln p1 + (ln p) xi
i=2

Instead of using a Lagrange multiplier, we substitute the constraint, p1 = 1 − 6p,


into the log likelihood as follows,
7
X
ℓH (p1 , p) = ℓ (p1 , p2 = p, . . . , p7 = p) = x1 ln [1 − 6p] + (ln p) xi
i=2
= 22 ln (1 − 6p) + 41 ln p.
CHAPTER 4. TESTS OF SIGNIFICANCE 71

Taking the derivative with respect to p yields,

∂ℓH 22(−6) 41
= + .
∂p 1 − 6p p

41
The MLE of p under H0 is p̃ = 378
= 0.1085 and p̃1 = 0.349.
ℓH (p̃1 , p̃) is the largest value that ℓ (p1 , . . . , p7 ) can attain under H0 .

The Likelihood Ratio Statistic for testing the composite hypothesis H0 is

D = 2 [ℓ (p̂1 , p̂2 , . . . , p̂7 ) − ℓ (p̃1 , p̃2 , . . . , p̃7 )]


↑ ↑
max possible for this max possible under H0
prob model

where p̃1 = 0.349 and p̃i = p̃ = .1085 for i = 2, ..., 7.

D ranks possible outcomes according to how well they agree with H0 .


If D is small, then the maximum of the log likelihood is nearly as large assuming H0
to be true as under the basic model.
CHAPTER 4. TESTS OF SIGNIFICANCE 72

Under H0 : the asymptotic distribution of the LRS is approximately

D ≈ χ2(k−q)

where
k = # functionally independent unknown parameters in the basic model
q = # functionally independent unknown parameters in hypothesized model.

In our example, k = 6 and q = 1.

dobs = 2 [ℓ (p̂1 , . . . , p̂7 ) − ℓ (p̃1 , p̃, . . . , p̃)]


" 7 7
#
X xi X
=2 xi ln − x1 ln p̃1 − xi ln p̃
i=1
n i=2
= 2 [(−110.96) − (−114.22)] = 6.529754
p − value = P (D ≥ dobs | H0 true)
≃ P χ2(5) ≥ 6.529754 = 0.2580262


The p-value is large (> 0.1), and we have no evidence against the null hypothesis. We
conclude that heart attacks were more likely to occur on Monday’s for this sample,
and equally likely to occur on the other days of the week. The estimated expected
frequencies under H0 , are shown in the bottom row of the following table:

Day Mon Tues Wed Thurs Fri Sat Sun Total


# deaths 22 7 6 13 5 4 6 63
Estimated expected 22=63p˜1 6.8=63p̃ 6.8 6.8 6.8 6.8 6.8 n = 63
freq under H0

4.3.1 R Code Example 4.2.3 continued


#hypothesize that p_2 = p_3 ... p_7 =p
p_tilde <- 41/378
p_tilde
p_1 <- 1-6*p_tilde
CHAPTER 4. TESTS OF SIGNIFICANCE 73

p_1
dobs <- -2*logR(c(p_1,rep(p_tilde,6)),freq/sum(freq),freq)
dobs

1-pchisq(dobs,5) #pvalue

4.3.2 Summary of Likelihood Ratio testing


The following summarizes the material in Sections 4.2 and 4.3.
1. First we assume that the data arise from a BASIC probability model with k
functionally independent unknown parameters. Compute the MLE’s of these
k unknown parameters, θ̂.
2. Then we write the null hypothesis in terms of model parameters with q func-
tionally independent unknown parameters. Note that for a simple hypothesis,
q = 0. Compute the MLE’s of the q unknown parameters under the null
hypothesis, θ̃, when q ̸= 0.
3. We test H0 - is the data consistent with the null hypothesis model?
(a) The likelihood ratio test statistic is computed. Small values of the statistic
indicate good agreement between the data and the null hypothesis.
(b) The p-value is the probability of obtaining results as extreme as those
observed assuming that the null hypothesis is true. Small p-values indicate
that the results are unlikely to occur if the null hypothesis is true. Large
p-values suggest that the data are consistent with the null hypothesis.
 
L(θ̂ )
 
Likelihood Ratio Statistic: D = −2r θ̃ = 2 ln L θ̃ = 2[ℓ(θ̂) − ℓ(θ̃)].
( )

θ̃ is the MLE of θ under the hypothesized model.

If H0 is true, then D ≈ χ2(k−q) , for the cases that we will consider.


CHAPTER 4. TESTS OF SIGNIFICANCE 74

4.4 Tests for Binomial Probabilities


Optional reading: Chapter 12.4

Example 4.4.1. Should Pot Be Legalized? One hundred people, randomly


selected from each of four provinces, were asked whether or not they think that pot
should be legalized. The frequencies responding yes and no are tabulated below.
Province B.C. Alberta Sask. Ontario Total
Yes 23 19 27 10 79
No 77 81 73 90 321
Totals 100 100 100 100 400
1. Test the hypothesis that the probability of a Yes response is the same in all
four provinces.
2. Test the hypothesis that the three western provinces have respondents who are
equally likely to say Yes, whereas Ontario responds differently.

1. Test the hypothesis that the probability of a Yes response is the same
in all four provinces.

Step 1: BASIC model


Let Yi = # Yes for province i out of ni = 100.
Then Yi ∼ Binomial (ni , pi ) independent i = 1, 2, 3, 4, where pi = P (Yes for province i).

The probability mass function for the data is,


4  
Y n i yi
f (y1 , y2 , y3 , y4 ; p1 , p2 , p3 , p4 ) = pi (1 − pi )ni −yi
i=1
y i

There are k = 4 functionally independent parameters.

4
Y
L (p1 , p2 , p3 , p4 ) = pyi i (1 − pi )ni −yi
i=1
X4
ℓ (p1 , p2 , p3 , p4 ) = [yi ln pi + (ni − yi ) ln (1 − pi )]
i=1
CHAPTER 4. TESTS OF SIGNIFICANCE 75

Taking the derivative with respect to pi , setting equal to zero and solving, we ob-
tain,
∂ℓ yi 100 − yi yi
= − , and p̂i = , i = 1, 2, 3, 4.
∂pi pi 1 − pi ni

Step 2: Hypothesized model

H0 : p1 = p2 = p3 = p4 = p unspecified

This is a composite hypothesis because p is unknown and must be estimated. There is


q = 1 functionally independent unknown parameter. Substituting the null hypothesis
into the log-likelihood, we obtain,

4
X
ℓH (p) = ℓ (p1 = p, p2 = p, p3 = p, p4 = p) = [yi ln p + (ni − yi ) ln (1 − p)]
i=1
X4 4
X
=( yi ) ln p + (400 − yi ) ln(1 − p)
i=1 i=1

Taking derivatives, setting to zero and solving, we obtain,


P4
(400 − 4i=1 yi )
P P4
∂ℓH i=1 yi yi
= − , and p̃ = i=1 = 0.1975.
∂p p 1−p 400

Step 3: Test the Hypothesis

D = 2 [ℓ (p̂1 , p̂2 , p̂3 , p̂4 ) − ℓ (p̃, p̃, p̃, p̃)]


( 4  )
X yi (ni − yi )
=2 yi ln + (ni − yi ) ln ,
i=1
n i p̃ i n i (1 − p̃ i )

where p̃i = p̃ = 0.1975, i = 1, 2, 3, 4.


CHAPTER 4. TESTS OF SIGNIFICANCE 76

Substituting in values for p̂1 , p̂2 , p̂3 , p̂4 , and p̃, we obtain dobs = 10.76. The p-value
for the test is computed as,

p − value ≃ P (χ2(k−q) ≥ dobs )


= P (χ2(3) ≥ 10.76) ≃ 0.013.

There is evidence against the hypothesis that the four provinces have the same prob-
ability of responding Yes. We would write, “the data are not consistent with the
hypothesis that respondents from the four provinces are equally likely to support the
legalization of pot (p-value = 0.013).” The estimated expected frequencies under the
null hypothesis are given in the table below. Note that Ontario has fewer respondents
in favour of legalizing pot than would be expected under the hypothesis that those
sampled from the four provinces were equally likely to support legalization.

Province B.C. Alberta Sask. Ontario Total


Yes 23 19 27 10 79
(Est Expected Yes) (19.75) (19.75) (19.75) (19.75) 79
No 77 81 73 90 321
Totals 100 100 100 100 400

Note: The form of the likelihood ratio statistic is:


"  #
X ObsF req
D=2 ObsF req ln ,
all cells
ExpectedF req

where ObsF req is the observed frequency and ExpectedF req is the estimated ex-
pected frequency under the null hypothesis. This form for the likelihood ratio statis-
tic will come up again.

2. Test the hypothesis that the three western provinces have respon-
dents who are equally likely to say Yes, whereas Ontario responds differ-
ently.

Step 1: BASIC model


The BASIC model stays the same as for question 1.

Step 2: Hypothesized model


CHAPTER 4. TESTS OF SIGNIFICANCE 77

H0′ : p1 = p2 = p3 = pW , p4 unspecified

This is a composite hypothesis because pW and p4 are unknown and must be esti-
mated. There are q = 2 functionally independent unknown parameters. Substituting
the null hypothesis into the log-likelihood, we obtain,
ℓH ′ (pW , p4 ) = ℓ (p1 = pW , p2 = pW , p3 = pW , p4 )
3
X
= [yi ln pW + (ni − yi ) ln (1 − pW )] + y4 ln p4 + (100 − y4 ) ln(1 − p4 )
i=1
X3 3
X
=( yi ) ln pW + (300 − yi ) ln(1 − pW ) + y4 ln p4 + (100 − y4 ) ln(1 − p4 )
i=1 i=1

Taking derivatives, setting to zero and solving, we obtain,


P3
(300 − 3i=1 yi )
P P3
∂ℓH ′ i=1 yi yi
= − , and p̃W = i=1 = 0.23.
∂pW pW 1 − pW 300
∂ℓH ′ y4 (100 − y4 ) y4
= − , and p̃4 = = 0.10.
∂p4 p4 1 − p4 100
The estimated expected frequencies for the four provinces under the null hypothesis,
H0′ , are 23, 23, 23 and 10 respectively.

Step 3: Test the Hypothesis

D = 2 [ℓ (p̂1 , p̂2 , p̂3 , p̂4 ) − ℓ (p̃W , p̃W , p̃W , p̃4 )]


( 4  )
X yi (ni − yi )
=2 yi ln + (ni − yi ) ln .
i=1
ni p̃i ni (1 − p̃i )

Substituting in values for p̂1 , p̂2 , p̂3 , p̂4 , and p̃W and p̃4 , we obtain dobs = 1.814. The
p-value for the test is computed as,
p − value ≃ P (χ2(k−q) ≥ dobs )
= P (χ2(2) ≥ 1.814) ≃ 0.404.

We have no evidence against the hypothesis that the Western provinces are equally
likely to support legalization of pot.
CHAPTER 4. TESTS OF SIGNIFICANCE 78

4.4.1 R Code for Example 4.4.1:

y<-c(23, 19, 27, 10)


n<-rep(100,4)
ell<-function(y,n,p){
sum(y*log(p) + (n-y)*log(1-p))
}

LRS<-function(p,p0,y,n){
2*(ell(y,n,p)-ell(y,n,p0))
}

phat<-y/n #MLE under BASIC model


p0<-sum(y)/sum(n) #MLE under (a) H0
p0
D<-LRS(phat,p0) #observed LRS
D
1-pchisq(D,4-1) #p-value

pW<-sum(y[1:3])/sum(n[1:3])
p1<-c(pW,pW,pW,y[4]/n[4]) #MLE under (b) H0
p1
D1<-LRS(phat,p1,y,n) #observed LRS
D1
1-pchisq(D1,4-2) #p-value

4.5 Tests for Multinomial Probabilities, Goodness


of fit test
Optional reading: Chapter 12.5

ASIDE: Note the difference between independent Binomials and Multinomial by


noting which marginal totals are fixed.

Example 4.5.1. (Example 2.2.1 revisited). (Goodness of Fit test).


CHAPTER 4. TESTS OF SIGNIFICANCE 79

200 specimens of a new high impact plastic are tested by repeatedly striking them
with a hammer until they fracture. The data are as follows:

# hits required 1 2 3 ≥ 4 Total


# specimens 112 36 22 30 200
frequencies x1 x2 x3 x4 n = 200

Let
Y = # hits required to fracture a specimen

We assumed that:

f (y) = P (Y = y) = θy−1 (1 − θ) x = 1, 2, 3, . . .

where θ = P (surviving a hit independently of previous hits). Y has a geometric


distribution. The assumption did not seem tenable and we found that the model did
not yield estimated expected frequencies that were close to the observed frequencies.
Now, we can formally test the Goodness of fit of this model using a Likelihood ratio
test.

Step 1: BASIC model


(X1 , X2 , X3 , X4 ) ∼ M ultinomial (n = 200, p1 , p2 , p3 , p4 ).
There are k = 3 functionally independent unknown parameters.

L (p1 , p2 , p3 , p4 ) = px1 1 px2 2 px3 3 px4 4


4
X
ℓ (p1 , p2 , p3 , p4 ) = xi ln pi
i=1
xi xi
p̂i = = (already shown)
n 200

Step 2: Hypothesized model

H0 : Geometric p1 = 1 − θ, p2 = θ (1 − θ) , p3 = θ2 (1 − θ) , p4 = θ3 .


There is only q = 1 unknown parameter under the hypothesized model.


CHAPTER 4. TESTS OF SIGNIFICANCE 80

1
We computed the MLE for θ under H0 , θ̃ = 2
so that

1 1 1 1
p̃1 = , p̃2 = , p̃3 = , p̃4 =
2 4 8 8

3. Test the Hypothesis

D = −2r (p̃1 , p̃2 , p̃3 , p̃4 )


= 2 [ℓ (p̂1 , p̂2 , p̂3 , p̂4 ) − ℓ (p̃1 , p̃2 , p̃3 , p̃4 )]
" 4 4
#
X xi X
=2 xi ln − xi ln p̃i
i=1
n i=1
" 4 #
X xi
=2 xi ln
i=1
np̃i

D has a form that we saw earlier,

"  #
X ObsF req
D=2 ObsF req ln .
all cells
ExpectedF req

The estimated expected frequencies are given in the table below:

# blows required 1 2 3 ≥ 4 Total


# specimens 112 36 22 30 200
frequencies x1 x2 x3 x4 n = 200
(Est exp freq) (100) (50) (25) (25) (200)

Substituting into the formula, we obtain dobs = 7.048.


Under H0 : D ≈ χ2(k−q) , k − q = 3 − 1 = 2.
CHAPTER 4. TESTS OF SIGNIFICANCE 81

p − value = P (D ≥ dobs | H0 true)


≃ P χ2(2) ≥ 7.048 = 0.02948


.025 < p − value < .05

There is evidence against the geometric model.

Exercise: Test the fit of the “extended” geometric model - where we assumed that
a proportion λ were defective.

4.5.1 R Code for Example 4.5.1:


# Goodness of fit Plastic specimens

freq <- c(112, 36, 22, 30)


sum(freq)
ell <- function(p,freq){
# Multinomial log-likelihood
# freq = frequencies; p = probabilities
sum(freq*log(p))
}

LRS <- function(p0, phat,freq){


#Log relative likelihood function
2*(ell(phat,freq) - ell(p0,freq))
}

ptilde <- c(.5, .25, .125, .125)


dobs <- LRS(ptilde,freq/sum(freq),freq) #LRS observed
dobs
1-pchisq(dobs,2) #pvalue
CHAPTER 4. TESTS OF SIGNIFICANCE 82

4.6 Multinomial Probabilities - Tests for Indepen-


dence in Contingency Tables
Optional reading: 12.6
Contingency tables are also called two-way tables or cross-tabulations.

Example 4.6.1. It was noted that married undergraduates seemed to do better


academically than single students. The following observations were made on the
examination results of 1500 engineering students. Students were asked to check a
box on the examination booklet indicating if they were married or single.

Fail Pass Total


Married 14 143 157
Single 283 1060 1343
Total 297 1203 1500

Are these observations consistent with the hypothesis of a common failure rate for
single and married students? Use a Likelihood Ratio test to answer the question.

Step 1: BASIC Model


First, which totals are known before examination is written? Only 1500!
Assume that an engineering student falls in exactly one of the 4 categories inde-
pendently of the other students and with a constant probability given in the table
below:
Fail Pass Total
Married p11 p12
Single p21 p22
Total 1

Because only the total number of students, 1500, is fixed in advance of the examina-
2
P
tion, then the data arise from a single Multinomial distribution and pij = 1.
i,j=1
CHAPTER 4. TESTS OF SIGNIFICANCE 83

Let’s introduce some general notation and label the frequencies as follows:

Fail Pass Total


Married x11 x12 r1
Single x21 x22 r2
Total c1 c2 n = 1500

We have that (X11 , X12 , X21 , X22 ) ∼ Multinomial (n = 1500, p11 , p12 , p21 , p22 ) and the
number of functionally independent parameters is k = 3.
xij
Then the MLE for pij under BASIC model is, p̂ij = n
.

Step 2: Hypothesized model


The hypothesis is that the failure probability is the same for married and single
students. We write that in terms of the model as follows:

H0 : P (fail | married) = P (fail | single) = P (fail) = α unspecified.

If H0 is true, then

P (pass | married) = 1 − α = P (pass | single) = P (pass)

In other words, the null hypothesis states that pass/fail on the examination is inde-
pendent of marital status!
Let

β=P (married) then,


p11 = P (fail & married) = αβ
p12 = P (pass & married) = (1 − α) β
p21 = P (fail & single) = α (1 − β)
p22 = P (pass & single) = (1 − α) (1 − β)

Here q = 2, as we have two parameters that require estimation.

We will use the Likelihood ratio statistic for testing H0 , therefore we need to compute
the MLE’s of the pij ’s under the hypothesized model.
CHAPTER 4. TESTS OF SIGNIFICANCE 84

L (α, β) = [αβ]14 [(1 − α) β]143 [α (1 − β)]283 [(1 − α) (1 − β)]1060


= α297 (1 − α)1203 β 157 (1 − β)1343

The Likelihood looks the same as that for two independent binomials!

ℓ(α, β) = 297 ln α + 1203 ln(1 − α) + 157 ln β + 1343 ln(1 − β)


297 c1
α̃ = = = proportion who fail
1500 n
157 r1
β̃ = = = proportion married
1500 n

We use these to compute the peij ’s and the estimated expected frequencies under H0
in each of the cells in the table as follows:

eij = estimated expected freq. in (i, j)th cell under hypothesized model
r1 c1 (297)(157)
e11 = ne
p11 = ne
αβe = =
n 1500
r 1 c2
p12 = n(1 − α
e12 = ne e)β =
e
n
r 2 c1
e21 = ne α(1 − β) =
p21 = ne e
n
p22 = n(1 − α
e22 = ne e = r2 c2
e)(1 − β)
n

The estimated expected frequencies under the independence hypothesis are included
in the original data table in parentheses.
Observed frequencies (estimated expected frequencies under H0 )
Fail (Fail) Pass (Pass) Total
Married 14 (31.086) 143 (125.914) 157
Single 283 (265.914) 1060 (1077.086) 1343
Total 297 1203 1500

Step 3: Test the hypothesis


CHAPTER 4. TESTS OF SIGNIFICANCE 85

From the last section, we have that the form of the Likelihood ratio statistic for the
multinomial model is:
"  #
X ObsF req
D=2 ObsF req ln .
all cells
ExpectedF req

Rewriting that using the notation of this section,

" #
X xij
D=2 xij ln
all cells
eij
dobs = 2 (7.70) = 15.40
p-value = P (D ≥ dobs | H true)
≃ P χ2(k−q) ≥ 15.40

k = 3, q = 2
2

= P χ(1) ≥ 15.40 = .00009 < .001

We have very strong evidence against the hypothesis that there is a common failure
rate for single and married students.
The data suggest that there is an association between marital status and whether
students pass or fail the examination.
What is the nature of the association?

14
Proportion of married who failed = = .089
147

283
Proportion of single who failed = = .211
1343

The data suggests that married students are less likely to fail the exam. Does that
mean that marriage causes better outcomes on examinations?
CHAPTER 4. TESTS OF SIGNIFICANCE 86

4.6.1 R Code for Example 4.6.1:


# Test of Independence

freq <- matrix(c(14, 143, 283, 1060), nrow=2, byrow=TRUE)


freq
sum(freq)
ell <- function(p,freq){
# Multinomial log-likelihood
# freq = frequencies; p = probabilities
sum(freq*log(p))
}

LRS <- function(p0, phat,freq){


#Log relative likelihood function
2*(ell(phat,freq) - ell(p0,freq)
}

rsum <- rowSums(freq)


csum <- colSums(freq)
rsum
csum

eij <- outer(rsum, csum)/sum(freq) # exp freq


eij # eij= r_i * c_j / n

# estimated probs under H0 are eij/sum(freq)

dobs <- LRS(c(eij)/sum(freq),c(freq)/sum(freq),c(freq)) #LRS observed


dobs
1-pchisq(dobs,1) #pvalue, df=(#rows - 1)(#cols - 1)

4.7 Cause and Effect


Optional reading: Sections 12.7
The statement “A and B are associated” means that A and B tend to occur together.
CHAPTER 4. TESTS OF SIGNIFICANCE 87

This does not mean that A causes B! There are 3 possible cause-effect relationships
that could produce the association:
(i) A causes B
(ii) B causes A
(iii) some other factor C causes both A and B
We cannot claim that A causes B until we have ruled out (ii) and (iii).

4.7.1 Accuracy of the χ2 approximation


There are situations where the χ2 approximation to the distribution of
the Likelihood ratio statistic for multinomial/binomial data is inaccu-
rate.
The χ2 approximation should not be trusted if there are categories for which ei ≃ 0
but xi ≥ 1.
Rule of thumb: ei should be ≥ 5.
Remedy: If there are several categories for which ei < 5, pool adjacent categories
to increase the corresponding e′i s.
CHAPTER 4. TESTS OF SIGNIFICANCE 88

4.8 The General Contingency Table


Consider n independent repetitions of an experiment and classify each outcome in
two ways according to which of events, Ai or Bj , i = 1, ..., a, j = 1, ..., b occur,
where
(i) A1 , . . . , Aa is a partition of the A sample space
(ii) B1 , . . . , Bb is a partition of the B sample space,
so that each outcome belongs to exactly one of the A′i s, and each outcome belongs
to exactly one of the Bj′ s.
The data is tabulated using the following notation:
Observed frequencies
B1 B2 . . . Bb Total
A1 x11 x12 x1b r1
A2 x21 x22 r2
..
.
Aa xa1 xa2 xab ra
Total c1 c2 cb n
X
Let pij = P {an outcome falls in class Ai Bj } , then pij = 1
i,j

Under the assumption of independent repetitions, the BASIC model is

(X11 , X12 , . . . , Xab ) ∼ Multinomial (n, p11 , . . . , pab ) .

There are k = ab − 1 functionally independent unknown parameters in the BASIC


model.
 
n
P (x11 , x12 , . . . , xab ) = px11 , . . . , pxabab
x11 , . . . , xab 11
Xa X b
ℓ (p) = xij ln pij , where p = (p11 , p12 , . . . , pab )′ .
i=1 j=1

The Likelihood ratio statistic for testing hypotheses about p is:


CHAPTER 4. TESTS OF SIGNIFICANCE 89

xij
D = 2 [ℓ(p̂) − ℓ(p̃)] where p̂ij = n
" #
XX x
=2 xij ln np̃ijij
i j

" #
XX x
=2 xij ln eijij (eij = np̃ij ) ,
i j

where eij are the estimated expected frequencies under the hypothesized model,
H0 .

D ≈ χ2(ab−1−q) where q is the number of functionally independent unknown parame-


ters in the hypothesized model.

The null Hypothesis of Independence can be written as:

H0 : P (Ai Bj ) = P (Ai ) P (Bj ) for all i, j

or H0 : P (Bj | Ai ) = P (Bj ) for all i, j

Under H0 , the unknown parameters are:


Pa
αi = P (Ai ) i = 1, . . . , a i=1 αi = 1
Pb
βj = P (Bj ) j = 1, . . . , b j=1 βj = 1

=⇒ q = (a − 1) + (b − 1)

ri cj
It can be shown [Exercise!] that α̃i = n
β̃j = ,
so that,
n
r i cj
eij = np̃ij = nα̃i β̃j = .
n

The Likelihood ratio test statistic for testing H0 has an approximate χ2 distribution
with degrees of freedom computed as:
k − q = ab − 1 − [(a − 1) + (b − 1)] = (a − 1)(b − 1).
CHAPTER 4. TESTS OF SIGNIFICANCE 90

Example 4.8.1. The following data on heights of 210 married couples were presented
by Yule in 1900.

Wife
Husband Tall Medium Short Total
Tall 18 (15.48) 28 (32.19) 19 (17.33) 65
Med 20 (23.57) 51 (49.03) 28 (26.4) 99
Short 12 (10.95) 25 (22.78) 9 (12.27) 46
Total 50 104 56 210

Test the hypothesis that heights of husbands and wives are independent.

ri cj
Solution: The estimated expected frequencies, eij = n
under the hypothesis of
independence are given in parentheses in the table.
The Likelihood ratio test for testing H0 is:
" 3 3 #
XX xij
D=2 xij ln
i=1 j=1
eij

p − value = P (D ≥ dobs | H0 true)


≃ P χ2(k−q) ≥ 3.13

k − q = (a − 1) (b − 1) = 4
2

= P χ(4) ≥ 3.13 ≃ 0.54

There is no evidence against the hypothesis of independence. The data suggest that
the heights of husbands and wives are not associated.

4.8.1 R Code for Example 4.8.1:


# Test of Independence - Yule example

freq <- matrix(c(18, 28, 19, 20, 51, 28, 12, 25, 9), nrow=3, byrow=TRUE)
freq
sum(freq)
CHAPTER 4. TESTS OF SIGNIFICANCE 91

ell <- function(p,freq){


# Multinomial log-likelihood
# freq = frequencies; p = probabilities
sum(freq*log(p))
}

LRS <- function(p0, phat,freq){


#Log relative likelihood function
2*(ell(phat,freq) - ell(p0,freq))
}

rsum <- rowSums(freq)


csum <- colSums(freq)
rsum
csum

eij <- outer(rsum, csum)/sum(freq) # exp freq


eij # eij= r_i * c_j / n

# estimated probs under H0 are eij/sum(freq)

dobs <- LRS(c(eij)/sum(freq),c(freq)/sum(freq),c(freq)) #LRS observed


dobs
1-pchisq(dobs,4) #pvalue, df=(#rows - 1)(#cols - 1)

4.8.2 Pearson’s Goodness of Fit Statistic


Pearson’s Goodness of Fit Statistic may be used with multinomial or binomial
data.

X (xj − ej )2
G.O.F. = where ej = estimated expected under H0
all cells
ej

G.O.F. ≈ χ2(k−q)
CHAPTER 4. TESTS OF SIGNIFICANCE 92

When the ej ’s are large, G.O.F. will be very nearly equal to the Likelihood ratio
statistic.

For the Yule heights data in Example 4.8.1,

G.O.F. = 3.02, which yields p − value ≃ 0.55,

and there is no evidence against H0 .

4.8.3 R Code for Pearson’s GOF test, Example 4.8.1:


# Pearson Goodness-of-fit Statistic
# use the freq and eij from the code in the previous section

GOF <- sum((freq-eij)^2/eij)


1-pchisq(GOF,4)
Chapter 5

Confidence Intervals

Optional reading: Section 11.4


A confidence interval is interpreted as a range of “reasonable” values for a parameter
given the data. Earlier we considered the use of the relative likelihood function in
determining which values of an unknown parameter θ are plausible in the light of the
data. Values within the 10% Likelihood Interval are considered plausible because
they give the data at least 10% of the maximum probability which is possible under
the model.
Here we consider another way, based on tests of significance, of constructing a “rea-
sonable” interval of values for an unknown parameters. We show how these intervals
are related to Likelihood Intervals.
Definition: [A, B] is a 100(1 − α)% confidence interval for θ if CP (θ0 ) =
P (A ≤ θ0 ≤ B | θ = θ0 ) = 1 − α, (the coverage probability) for all parameter values
θ0 . A 95% confidence interval would include the true parameter value θ0 in 95% of
repetitions of the experiment with θ fixed.

5.1 Invert a Test to Derive a Confidence Inter-


val
Example 5.1.1. The Small Business Association has taken a sample of 50 small
businesses to investigate the effect that the recession has had on profit levels. Test

93
CHAPTER 5. CONFIDENCE INTERVALS 94

the government’s claim that the recession has had no effect on profits, based on an
observed sample average decrease in profits of $1,000. Assume that measurements
are independent and normally distributed with σ = $600.

Solution:
Let Xi = change in profit in dollars for business i.

Step 1: BASIC model


Assume Xi ∼ N (µ, σ 2 = 6002 ) , i = 1, .., 50, and therefore k = 1.
n  
Y 1
L (µ) = exp − 2 (xi − µ)2
i=1

n
1 X
ℓ (µ) = − 2 (xi − µ)2 and we know µ̂ = x̄.
2σ i=1

Step 2: Hypothesized Model H0 : µ0 = 0 , i.e. that there is no change in mean


profits. Here q = 0, since there are no unknown parameters to estimate.

Step 3: Test the hypothesis

LRS = D = −2r (µ0 ) = 2 [ℓ (µ̂) − ℓ (µ0 )]


2
n X̄ − µ0
= .
σ2
We obtained this result in the last chapter.

Here n = 50, σ 2 = 6002 , and x̄ = −1000

50 (−1000 − 0)2
dobs = = 85
6002
p-value = P (D ≥ dobs | H0 true)
= P χ2(1) ≥ 85 < .001

CHAPTER 5. CONFIDENCE INTERVALS 95

We have very strong evidence against H0 of no effect due to the recession.

Now, we consider, for which parameter values, µ0 , does a likelihood ratio test of
H0 : µ = µ0 yield a
p − value(µ0 ) ≥ 0.05?
What values of µ0 are reasonably consistent with the data?

p − value(µ0 ) = P χ2(1) ≥ dobs | H : µ = µ0 ≥ .05




where
n (x̄ − µ0 )2
dobs (µ0 ) = .
σ2
 
From tables or R, P χ2(1) ≥ 3.841 = .05

therefore p − value(µ0 ) ≥ .05 if and only if dobs (µ0 ) ≤ 3.841. We must solve for the
values µ0 such that

n (x̄ − µ0 )2
≤ 3.841.
σ2
σ2
(x̄ − µ0 )2 ≤ 3.841
n
σ σ
−1.96 √ ≤ x̄ − µ0 ≤ 1.96 √
n n
σ σ
x̄ − 1.96 √ ≤ µ0 ≤ x̄ + 1.96 √
n n

Here x̄ = −1000, σ = 600, n = 50, so the observed confidence interval is


[−1166, −834]. Values of µ0 within this interval are consistent with the data, and
are reasonable estimates of µ. Since the confidence interval lies below zero, the gov-
ernment’s claim that the recession has had no effect on profits is not plausible given
the data.

One property of the Interval:


h i
σ σ
For a random X̄, the interval X̄ − 1.96 n , X̄ + 1.96 n is a random interval. If
√ √

we repeated the experiment, we would obtain a different x̄ and a different interval


CHAPTER 5. CONFIDENCE INTERVALS 96

estimate of µ. Let µT be the true unknown value of µ. We can compute the fraction
of times that the random interval would include the true value µT in a large number
of repetitions of the experiment.

 
σ σ
P X̄ − 1.96 √ ≤ µT ≤ X̄ + 1.96 √ | µ = µT
n n
( ! )
X̄ − µT
= P −1.96 ≤ σ ≤ 1.96 | µ = µT = .95

n

X̄ − µT
∼ N (0, 1) if µ = µT
√σ
n

With probability .95 the interval


 
σ σ
X̄ − 1.96 √ , X̄ + 1.96 √
n n

contains µT , the true value.


This interval is called a 95% Confidence Interval for µ.

We are confident that in 95 times out of 100 trials of the experiment the above
interval will contain µT , the true value.

Definition: A 95% Confidence Interval, CI, is an interval for the unknown param-
eter µT , such that in a large number of repetitions of the experiment, CI covers µT
95 times out of 100.

One way to construct a 95% CI is to solve for the set of parameter values µ0 such
that for a test of H : µ = µ0

p-value (µ0 ) ≥ 0.05.

which we do in the next section.


CHAPTER 5. CONFIDENCE INTERVALS 97

5.2 Approximate Confidence Intervals


We consider a Basic model that has one, unknown parameter, θ, and let
D = −2r (θ0 ) , be the Likelihood Ratio Statistic for testing H0 : θ = θ0 .
h i
We know that D ≈ χ2(1) , so that p − value ≃ P χ2(1) ≥ dobs (θ0 ) ,

where dobs (θ0 ) is the observed value of D when θ = θ0 .


 
Since P χ2(1) ≥ 3.841 = .05, then p−value (θ0 ) ≥ .05 ⇐⇒ dobs (θ0 ) ≤ 3.841.

Thus, an approximate 95% Confidence interval for θ is the set of θ0 values such
that,
d (θ0 ) ≤ 3.841
⇐⇒ −2r (θ0 ) ≤ 3.841
⇐⇒ r (θ0 ) ≥ −1.92
⇐⇒ er(θ0 ) = R (θ0 ) ≥ e−1.92 = .147

Thus, an approximate 95% Confidence Interval for θ is just a 14.7% Likelihood In-
terval! Below is a table of common confidence interval levels and their corresponding
likelihood intervals.

CI % Corresponding LI %
90 25.8
95 14.7
96.8 10
99 3.6

Analogy: Pitching Horseshoes


Constructing a 95% confidence interval is like pitching horseshoes. In each case, there
is a fixed target either the population µ or the stake. We are trying to bracket the
target with some chancy device, either the random interval or the horseshoe.
There are several important ways in which confidence intervals differ. Customarily,
only one Confidence Interval is constructed, for a target µ that is not visible. Conse-
quently, the statistician does not know directly whether his/(her) Confidence Interval
CHAPTER 5. CONFIDENCE INTERVALS 98

includes the true value; he/(she) must rely on indirect statistical theory for assurance
in the long run, 95% of the Confidence Intervals similarly constructed would include
the true value.

Example 5.2.1. (Example 4.4.1 revisited) In a random sample of 100 people from
B.C., it was found that 23 out of 100 favour legalization of pot. Construct an
approximate 95% Confidence Interval for the proportion of B.C. voters who support
legalization of pot.

θ = proportion who support legalization of pot


X = Number out of 100 sampled who support legalization of pot

X ∼ Binomial(n = 100, θ)

We want the set of all values θ0 , such that in a test H0 : θ = θ0

p-value (θ0 ) ≥ .05

Method 1: Use Likelihood Ratio Statistic Using the methods of this section,
we find a 14.7% Likelihood interval for θ. This works out to be [.155, .319] using the
R-code which is included at the end of this chapter.

Method 2: Use D = |X − nθ|


Here we use D = |X − nθ|, so that
p − value(θ0 ) = P {|X − nθ0 | ≥ |xobs − nθ0 | | H : θ = θ0 } .

For nθ, n(1 − θ) ≥ 5, X ≈ N (nθ, nθ (1 − θ)).

( )
|X − nθ0 | |xobs − nθ0 |
p − value(θ0 ) = P p ≥p | H0 : θ = θ0
nθ0 (1 − θ0 ) nθ0 (1 − θ0 )
( )
|xobs − nθ0 |
≃ P |Z| ≥ p
nθ0 (1 − θ0 )
CHAPTER 5. CONFIDENCE INTERVALS 99

where Z ∼ N (0, 1)

|xobs − nθ0 |
p-value (θ0 ) ≥ .05 ⇐⇒ p ≤ 1.96
nθ0 (1 − θ0 )

(i) We can solve the quadratic for θ0 . [Exercise!]


(ii) Or we can approximate V AR(X) = nθ0 (1 − θ0 ) with nθ̂(1 − θ̂) where θ̂ =
xobs
n
= .23

Using method (ii) - Solve for θ0 .

q
|xobs − nθ0 | ≤ 1.96 nθ̂(1 − θ̂)
s s
xobs θ̂(1 − θ̂) xobs θ̂(1 − θ̂)
− 1.96 ≤ θ0 ≤ + 1.96
n n n n

An approximate 95% Confidence Interval for θ is:

 s s 
θ̂ − 1.96 θ̂(1 − θ̂) θ̂(1 − θ̂) 
, θ̂ + 1.96
n n

= [.148, .312]

This gives us an indication of the precision of our estimate and the margin of error
for our estimate is 0.082. Intervals constructed this way have the property that 19
times out of 20, such intervals will cover the true value.
This is the basis for quotes in the media such as,

“is considered accurate to within 8.2 percentage points, 19 times out of 20”.
CHAPTER 5. CONFIDENCE INTERVALS 100

5.2.1 R Code for Example 5.2.1


# Binomial Example 95% CI using LI
# Log-likelihood function
ell <- function(theta){
23*log(theta) + 77*log(1-theta)
}
theta <- seq(.1,.45,by=.01)
plot(theta,ell(theta),ylab=’log likelihood’,xlab=’theta’,
type=’l’)
title(’Example Binomial CI, Log-Likelihood’)

#MLE of theta
thetahat <- optimize(ell, c(.1,.6), maximum=TRUE)
thetahat

#Log relative likelihood function


logR <- function(theta, thetahat){
ell(theta) - ell(thetahat)
}
p <- .147 #14.7% likelihood interval
logR.m.lnp <- function(theta, thetahat, p) {logR(theta,thetahat)-log(p)}

plot(theta,logR.m.lnp(theta,thetahat$maximum,p), ylab=’log relative likelihood’,


xlab=’theta’,type=’l’)
abline(h=0) #add a horizontal line at zero
title(’Example Binomial CI, Log Relative Likelihood’)

#Likelihod intervals
lower <- uniroot(logR.m.lnp, c(.1, .23), thetahat$maximum, p)
lower

upper <- uniroot(logR.m.lnp, c(.23, .6), thetahat$maximum, p)


upper
CHAPTER 5. CONFIDENCE INTERVALS 101

5.3 Another Approximate Confidence Interval


Optional reading: 11.3
We learned in Section 4.2.1 that the Likelihood Ratio Statistic has an approxi-
mate χ2(1) distribution in the one parameter case assuming the null hypothesis to
be true.

A related, and extremely useful result, is that in many cases, the MLE has an
approximate normal distribution for large sample size, n. Again, consider a Basic
model that has one, unknown parameter, θ, and let θ̂ be the MLE of θ where the
‘true’ value of θ = θ0 . Then for many probability models,

  q
θ̂ − θ0 I(θ̂) ≈ N (0, 1),

Where I(θ̂) is the Information function defined in Section 2.1 evaluated at θ̂, I(θ) =
2
−ℓ′′ (θ) = − d dθℓ(θ)
2 . This result can be rewritten as,

 
−1
θ̂ ≈ N θ0 , I(θ̂) ,

so that θ̂ is approximately normally distributed with the true value as its mean,
and with asymptotic variance 1/I(θ̂). The result generalizes to the multi-parameter
case.
Using this result, we obtain an approximate 100(1−α)% Confidence Interval as,

 
z z
θ̂ − q1−α/2 , θ̂ + q1−α/2 
I(θ̂) I(θ̂)

where z1−α/2 is the 1 − α/2 quantile of the N (0, 1) distribution. For example, for
α = .05, z1−α/2 = 1.96.
This result is used very frequently in applied statistics!!
CHAPTER 5. CONFIDENCE INTERVALS 102

Example 5.3.1. (Example 5.2.1 revisited) In a random sample of 100 people from
B.C., it was found that 23 out of 100 favour legalization of pot. Construct an
approximate 95% Confidence Interval for the proportion of B.C. voters who support
legalization of pot using the normal approximation for the MLE.

θ = proportion who support legalization of pot


X = Number out of 100 sampled who support legalization of pot

X ∼ Binomial(n = 100, θ)

The Log-likelihood, Score and Information function are respectively:

ℓ(θ) = x ln θ + (n − x) ln(1 − θ)
x n−x
ℓ′ (θ) = −
θ 1−θ
′′ x n−x
ℓ (θ) = − 2 −
θ (1 − θ)2
x n−x
I(θ) = 2 + ,
θ (1 − θ)2
and after some simplification,

θ̂(1 − θ̂)
I(θ̂)−1 = .
n

This yields an approximate Confidence Interval for θ, the proportion of B.C. voters
who support legalization of pot as,

 s s 
θ̂ − 1.96 θ̂(1 − θ̂) θ̂(1 − θ̂) 
, θ̂ + 1.96
n n

= [.148, .312] ,

which is the same as what we obtained in the previous section using Method 2!
Chapter 6

Normal Theory

Optional reading: Sections 13.1, 13.2


The normal distribution plays a large role in modelling and the statistical analysis
of continuous measurements. Many types of measurements have distributions which
are approximately normal - the Central Limit Theorem helps to explain this. In
the next sections, we will concentrate solely on models for normal measurements:
Maximum Likelihood Estimation, tests of hypotheses and confidence intervals for
normal measurements taken under varying conditions.
Before we do, recall:
(1) Let X1 , . . . , Xn be independent random variables with Xi ∼ N (µi , σi2 ).
Let a1 , . . . , an be constants. Then

n n n
!
X X X
ai X i ∼ N ai µ i , a2i σi2 .
i=1 i=1 i=1

(2) Let Z1 , . . . , Zn be independent N (0, 1) random variables, then

n
X
Z12 ∼ χ2(1) and Zi2 ∼ χ2(n) .
i=1

103
CHAPTER 6. NORMAL THEORY 104

6.1 Basic Assumptions


We assume that measurements are independent and normally distributed with con-
stant variance, Yi ∼ N (µi , σ 2 ).
Under this assumption, we assume that the effect of changing conditions is to al-
ter µ. We can write the model in terms of independent error variables, ε1 , . . . , εn
where
εi = Yi − µi ∼ N 0, σ 2


Then
Yi = µi + εi where εi ∼ N (0, σ 2 ) independent.

A consequence is that,

P {−3σ ≤ εi ≤ 3σ} = .9973 ≈ 1.

The smaller σ, is the smaller we expect εi to be. σ measures the amount of random
variability (noise) that one would expect in repeated measurements taken under the
same conditions.

Assumptions Concerning µ1 , µ2 , . . . , µn We will express the n mean parameters


as functions of q parameters, where q < n.

(1) One-Sample Model: n measurements taken under the same conditions


e.g. blood pressure measurements, Yi for a group of patients all receiving
the same drug
Assume: µ1 = · · · = µn = α unknown
There is q = 1 unknown mean parameter, assuming σ 2 is known.

(2) Two-Sample Model: 2 groups of sample measurements


e.g. salaries for co-op students: 2nd year, 3rd year
Assume: µ2nd = α
µ3rd = α + β
There are q = 2 unknown mean parameters, assuming σ 2 is known.
CHAPTER 6. NORMAL THEORY 105

(3) Straight Line Model: n measurements taken under varying conditions


e.g. salaries for recently graduated students, Yi , depend upon the number
of co-op work terms, xi they performed.
Here, the xi are known constants,
Yi vary,
Assume Yi ∼ N (µi , σ 2 ) where µi = αi + βxi and α, β are unknown pa-
rameters.
There are q = 2 unknown mean parameters, assuming σ 2 is known.

6.2 One Sample Model


Optional reading: Section 13.3

Example 6.2.1. Monthly salaries for Engineering and Math co-op students were
collected. A sample of 10 salaries is given below (in $):

5050 4184 1787 2167 2650


5499 3163 3016 3120 4333

Compute a 95% Confidence Interval for the mean monthly salary, assuming that
Yi = monthly salaries for person i = 1, ..., n are N (µ, σ 2 ) and independent.
To assess the assumption of normality, we could consider a histogram as in Figure
6.1. Unfortunately, histograms are not very informative for small samples. A normal
QQ (quantile-quantile) plot is given in Figure 6.2. If the data are approximately
normally distributed, then this graph should resemble a straight line. In this case,
it does(!) and we have no evidence against the normal assumption. We will learn
about normal quantile-quantile (QQ) plots in Section 6.4.5.
CHAPTER 6. NORMAL THEORY 106

Histogram of 10 salaries

3.0
2.0
Frequency

1.0
0.0

1000 2000 3000 4000 5000 6000

dollars

Figure 6.1: Histogram of 10 Salaries

Normal QQ plot of 10 salaries


5000
4000
dollars

3000
2000

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Theoretical Quantiles

Figure 6.2: QQ plot of 10 Salaries


CHAPTER 6. NORMAL THEORY 107

6.2.1 Confidence Intervals for µ


Confidence Interval for µ when σ 2 known
In the case that σ 2 his known, we saw from thei chapter on confidence intervals that
σ σ
a 95% CI for µ is: Ȳ − 1.96 √n , Ȳ + 1.96 √n . This is the set of µ0 such that p −
value(µ0 ) ≥ 0.05 in a Likelihood ratio test of H0 : µ = µ0 . The value 1.96 = z1−α/2 ,
where α = 0.05.

Confidence Interval for µ when σ 2 unknown


One can show [EXERCISE!] that the Likelihood ratio test for testing H0 : µ = µ0
when σ 2 is unknown is
 
1
D = n ln 1 + T2
n−1

where

b − µ0
µ ȳ − µ0
T = =
√s √s
n n

and

n
X
(yi − ȳ)2
i=1
s2 = .
n−1

We can compute a 95% Confidence interval for µ, by finding all µ0 such that p −
value(µ0 ) ≥ 0.05. Instead note that D is a one-to-one increasing transformation of
T 2 which I will call g(D) = T 2 .

p − value(µ0 ) = P {D ≥ dobs (µ0 ) | H0 : µ = µ0 }


= P {g (D) ≥ g (dobs (µ0 )) | H0 : µ = µ0 }
= P T 2 ≥ t2obs (µ0 ) | H0 : µ = µ0 .

CHAPTER 6. NORMAL THEORY 108

We have that,
p − value(µ0 ) ≥ 0.05 ⇐⇒ t2obs (µ0 ) ≤ a2

where a2 is chosen so that

P T 2 ≤ a2 | H0 : µ = µ0 = 0.95


= P {−a ≤ T ≤ a | H0 : µ = µ0 }
( )
Y − µ0
= P −a ≤ s ≤ a | H0 : µ = µ0 .

n

The exact distribution of T is known. Rewriting T as,


r
Y − µ0 s2
T = q ÷ ,
σ2 σ2
n

we examine the two pieces in the expression.

Yq−µ0
(i) Z = σ2
∼ N (0, 1) when µ = µ0
n

(n−1)s2
(ii) V = σ2
∼ χ2(n−1) independent of Z. (Proved in Stat450)
E [V ] = n − 1. (See Chapter 3)
h 2 i
2 σ σ2
E [s ] = E n−1 V = n−1 E [V ] = σ 2

Thus, s2 is an unbiased estimate of σ 2 .

Putting the pieces together,

Z
T = q
V
(n−1)

N (0, 1)
= q 2 ∼ t(n−1) ,
χ(n−1)
n−1
CHAPTER 6. NORMAL THEORY 109

the Student’s t distribution with n−1 degrees of freedom. Percentiles of the Student’s
t distribution are tabulated in Table B3. If the degrees of freedom are very large,
> 60, then the t distribution approaches N(0,1). In general, for any Z ∼ N (0, 1)
independent of V ∼ χ2(ν) ,
Z
q ∼ t(ν) .
V
ν

Returning to our example, we want to find a such that,

( )
Ȳ − µ0
P −a ≤ ≤ a | H0 : µ = µ0 = .95
√s
n

n = 10, so P −a ≤ t(9) ≤ a = 0.95.

Using the following R code, we obtain a = 2.262157.


R Code: qt(.975,9)

A 95% CI for µ is the set of all µ0 such that

Y − µ0
−2.262 ≤ ≤ 2.262.
√s
n

Isolating µ0 , we obtain
 
s s
Ȳ − 2.262 √ , Ȳ + 2.262 √ .
n n

Here, ȳ = $3496.9, n = 10 and s = $1224.116. A 95% Confidence interval for µ is


[$2, 621.22, $4, 372.58]

A 95% confidence interval for the entire dataset using n = 1151, ȳ = $3445.382,
s = $920.52, 97.5 percentile of t(1150) equal to 1.96 is:

[$3, 392.15, $3, 498.62],


CHAPTER 6. NORMAL THEORY 110

a much narrower interval because n is larger.

Summary: 95% Confidence Interval for µ.


h i
(a) when σ 2 known Ȳ − 1.96 √σn , Ȳ + 1.96 √σn

where 1.96 = 97.5th percentile of N (0, 1)

(b) when σ 2 unknown


 
s s
Ȳ − t.975
(n−1) √ , Ȳ + t.975
(n−1) √
n n

t.975
(n−1) = 97.5
th
percentile of t(n−1)

6.2.2 Hypothesis tests for µ


Optional reading: Section 13.3

Example 6.2.2. Returning to Example 6.2.1, let Yi = monthly salary for person
i = 1, ..., n, and Yi ∼ N (µ, σ 2 ) independent. Test the hypothesis that the mean
salary is $3, 000 per month.

Solution: The Likelihood ratio test for testing H0 : µ = $3, 000 is


 
1 2 Ȳ − 3000
D = n ln 1 − T , where T =
n−1 √s
n

p − value = P {D ≥ dobs | H0 : µ = 3000} since D monotone function of T 2


= P T 2 ≥ t2obs | H0 : µ = 3000


= P {|T | ≥ |tobs | | H0 : µ = 3000}

But T ∼ t(n−1=9) when µ = 3000, therefore



p − value = P t(9) ≥ |tobs |
|y − 3000|
|tobs | = s = 1.2836

n

p − value = P t(9) ≥ 1.2836 = 0.2313,
CHAPTER 6. NORMAL THEORY 111

using R code: 2*(1-pt(1.2836,9)). There is no evidence against the H0 : µ =


$3, 000.
R code and output for the test:
> t.test(y,mu=3000)

One Sample t-test

data: y
t = 1.2836, df = 9, p-value = 0.2313
alternative hypothesis: true mean is not equal to 3000
95 percent confidence interval:
2621.22 4372.58
sample estimates:
mean of x
3496.9

For the general normal linear model, we will use a t statistic for inference about
mean parameters when σ 2 is unknown.

6.2.3 Inferences for σ 2


Example 6.2.3. Monthly salaries for Co-op students in Work Term 1 were collected
over one year. The file ‘SalaryWT1.csv’ contains the salaries. A histogram and
summary statistics are given below. GR.UG stands for Graduate and Undergraduate.

> summary(ywt1)
Term WTNum GR.UG SalMonth
2015 - Fall :111 W-1:352 GR: 50 Min. :1406
2015 - Spring:151 UG:302 1st Qu.:2717
2015 - Summer: 90 Median :3003
Mean :3149
3rd Qu.:3526
Max. :7259
CHAPTER 6. NORMAL THEORY 112

Work Term 1 Salaries

80
Frequency

60
40
20
0

1000 2000 3000 4000 5000 6000 7000

Monthly Salaries $

Figure 6.3: Histogram of Work Term 1 Salaries

Hypothesis Tests for σ 2


Optional reading: Section 13.3

Let Yi , i = 1, ..., n be the monthly salary for the i’th co-op student. We assume
that Yi ∼ N (µ, σ 2 ), independent. For the Work Term 1 monthly salaries, we test the
hypothesis that H0 : σ 2 = σ02 = 7502 using a Likelihood Ratio test.

D = 2 [ℓ (µ̂, σ̂) − ℓ (µ̃, σ02 )]


↑ ↑
joint MLE max under H0

Step 1: BASIC model

n
n 1 X
ℓ µ, σ 2 = − ln σ 2 − 2 (yi − µ)2

2 2σ i=1
In our example,
(yi − ȳ)2
P
2 n−1 2
k = 2, µ̂ = ȳ = $3149 and σ̂ = = s = 778.97812 .
n n
CHAPTER 6. NORMAL THEORY 113

Step 2: Hypothesized Model, H0 : σ 2 = 7502


We need to compute µ̃, assuming that σ 2 = 7502 . Here q = 1.
As an exercise, show that maximizing ℓ (µ, σ 2 = 7502 ) over µ leads to µ̃ = ȳ.
Step 3: Test the hypothesis:
Substituting into the expression for the Likelihood ratio statistic,

D = 2 ℓ (µ̂, σ̂) − ℓ µ̃, σ 2 = σ02 = 7502 .


 
" n n
#
n 1 X 2 n 1 X 2
D = 2 − ln σ̂ 2 − 2 (yi − ȳ) + ln σ02 + 2

(yi − ȳ)
2 2σ̂ i=1 2 2σ0 i=1
 2
σ0 σ̂ 2
= n ln − n + n (6.1)
σ̂ 2 σ02

σ̂ 2 = 778.97812 , σ02 = 7502 and dobs = 1.038.

Under H0 : σ 2 = 7502 , D ≈ χ2(k−q)


k = 2, q = 1 and D ≈ χ2(1)

p-value = P D ≥ dobs | H0 : σ 2 = 7502




≃ P χ2(1) ≥ 1.038 = 0.3083.




We have no evidence against H0 : σ 2 = 7502 .

Confidence intervals for σ 2


Here we consider another way to construct Confidence Intervals, using Pivotal
Quantities.

Recall the definition of a confidence interval.


Definition: A 100p% confidence interval [A, B] for the unknown parameter θT , is
an interval such that in a large number of repetitions of the experiment, [A, B] covers
θT 100p times out of 100. [0 ≤ p ≤ 1] i.e.

P {A ≤ θT ≤ B} = p
CHAPTER 6. NORMAL THEORY 114

We can use a pivotal quantity to construct a CI.

Definition: A Pivotal Quantity, Q, is a function of the data, and a monotone


function of the unknown parameter, such that the distribution of Q does not depend
upon θT .
Below are some examples of pivotal quantities.
Ȳ −µ
(a) Z = √σ
∼ N (0, 1) is a pivotal quantity for µ when σ 2 known
n

Ȳ −µ
(b) T = √s
∼ t(n−1) is a pivotal quantity for µ when σ 2 unknown
n

(n−1)s2
(c) σ2
∼ χ2(n−1) is a pivotal quantity for σ 2

Here we construct a 95% confidence interval for σ 2 using a pivotal quantity.

Using χ2 tables or R we find a, b such that

(n − 1)s2
 
P a≤ ≤ b = .95
σ2

The convention is to choose a and b so that two tails have equal area, see Figure 6.4
below.
CHAPTER 6. NORMAL THEORY 115

Chi−square(4) density

0.15
a=qchisq(.025,4)
b=qchisq(.975,4)
density

0.10
0.05

0.95
0.00

a b

0 2 4 6 8 10 12

Figure 6.4: Chi-square on 4 degrees of freedom density

Then by a series of monotone transformations, we isolate σ 2 .

(n − 1) s2
 
P a≤ ≤ b = .95
σ2
(n − 1) s2 (n − 1) s2
 
2
P ≥σ ≥ = .95
a b
h i
(n−1)s2 (n−1)s2
The interval b
, a satisfies the definition of a 95% Confidence Interval for
σ2.
For the Work Term 1 data,

s = 780.087, n = 352
a = 300.9897, b = 404.7974,
CHAPTER 6. NORMAL THEORY 116

and a 95% CI for σ 2 is [7262 , 8422 ] . There is a great deal of variability in the
data.

R code, Inferences for σ 2


LRS.sig<-function(y,sigma02){#LRS test for H_0: sigma^2 = sigma0^2
n<-length(y)
sigma2hat<-var(y)*(n-1)/n
LRS<-n*(log(sigma02/sigma2hat) + (sigma2hat/sigma02) - 1)
return(LRS)
}
n<-dim(ywt1)[1]
sd(ywt1$SalMonth) #sd for Work Term 1 Co-op salary data
sqrt(var(ywt1$SalMonth)*(n-1)/n) #MLE of sigma

D<- LRS.sig(ywt1$SalMonth, 750^2) #Likelihood Ratio test


D
1-pchisq(D,1)

#99% Confidence Interval for the Variance


sqrt((n-1)*var(ywt1$SalMonth)/qchisq(c(.995,.005),n-1))
qchisq(c(.005,.995),n-1)

#95% Confidence Interval for the Variance


sqrt((n-1)*var(ywt1$SalMonth)/qchisq(c(.975,.025),n-1))
qchisq(c(.025,.975),n-1)

6.3 The Two Sample Model


Optional reading: Section 13.4

Example 6.3.1. Monthly salaries for Co-op students in Work Term 1 and 2 were
collected over one year. The file ‘SalaryWT12.csv’ contains the salaries. A histogram
and summary statistics by work term number are given below. Recall that the box
on a boxplot encases the middle 50 percent of the data, i.e. from the 25’th to 75’th
percentile. The solid line in the middle of the box indicates the median and outliers
CHAPTER 6. NORMAL THEORY 117

are plotted as small circles in both tails. The boxplots of the salaries for the two
work terms suggest that these salaries have distributions which are similar.

7000
5000
3000
1000 Monthly salaries for Workterms 1, 2

W−1 W−2

Figure 6.5: Boxplots of Work Term 1 and 2 Salaries

> by(ywt12$SalMonth,ywt12$WTNum,summary) #summary statistics


ywt12$WTNum: W-1
Min. 1st Qu. Median Mean 3rd Qu. Max.
1406 2717 3003 3149 3526 7259
---------------------------------------------------------------------------
ywt12$WTNum: W-2
Min. 1st Qu. Median Mean 3rd Qu. Max.
866.6 2899.0 3200.0 3354.0 3691.0 7279.0

> by(ywt12$SalMonth,ywt12$WTNum,sd) #standard deviations


ywt12$WTNum: W-1
[1] 780.087
---------------------------------------------------------------------------
CHAPTER 6. NORMAL THEORY 118

ywt12$WTNum: W-2
[1] 847.1914

6.3.1 Inferences for the differences between two means


Is this data consistent with the hypothesis that salaries for work term one and two
are the same?

Two sample model, Variances assumed EQUAL and KNOWN


It is very unusual in practice to assume that the variances of the two groups are
known. This case provides a ‘baby’ step en route to the case where variances are not
assumed to be known.
Let
Y1i = monthly salary for work term 1, i = 1, ..., n1
Y2j = monthly salary for work term 2, j = 1, ..., n2

We assume that,
Y1i ∼ N (µ1 , σ 2 ), independent,
Y2j ∼ N (µ2 , σ 2 ), independent,
and that σ 2 = 8182 .

We test the hypothesis, H0 : µ1 = µ2 , or equivalently, H0 : µ1 − µ2 = 0.

Step 1: BASIC model


" n1
# " n2
#
1 X 1 X
L (µ1 , µ2 ) = exp − 2 (y1i − µ1 )2 × exp − 2 (y2j − µ2 )2
2σ i=1 2σ j=1

n1
P n2
P
As an exercise, show that µ̂1 = ȳ1 = y1i /n1 = $3, 149 and µ̂2 = ȳ2 = y2j /n2 =
i=1 j=1
$3, 354.

Step 2: Hypothesized Model


CHAPTER 6. NORMAL THEORY 119

Assuming H0 : µ1 = µ2 = µ, unknown, we need to estimate µ.


"
n1 n2
#
1 X 2 1 X
L (µ1 = µ, µ2 = µ) = exp − 2 (y1i − µ) − 2 (y2j − µ)2
2σ i=1 2σ j=1

As an exercise, show that the MLE of µ is


n1
X n2
X
y1i + y2j
i=1 j=1
µ̃ = = ȳ.
n1 + n2

Step 3: Test the hypothesis

D = 2 [ℓ (µ̂1 , µ̂2 ) − ℓ (µ1 = µ̃, µ2 = µ̃)]


" n1 n2
1 X 2 1 X
=2 − 2 (y1i − ȳ1 ) − 2 (y2j − ȳ2 )2
2σ i=1 2σ j=1
n1 n2
#
1 X 1 X
+ 2 (y1i − ȳ)2 + 2 (y2j − ȳ)2
2σ i=1 2σ j=1

This can be simplified using,

n1
X n1
X
2
(y1i − ȳ) = [(y1i − ȳ1 ) + (ȳ1 − ȳ)]2
i=1 i=1
Xn1
= (yi1 − ȳ1 )2 + n1 (ȳ1 − ȳ)2 ,
i=1

since n1
X
(yi1 − ȳ1 ) (ȳ1 − ȳ) = 0.
i=1
CHAPTER 6. NORMAL THEORY 120

Simplifying for the work term 2 data in the same way yields,
1  2 2
D= n1 (ȳ 1 − ȳ) + n 2 (ȳ 2 − ȳ) .
σ2

Noting that ȳ is a function of both ȳ1 and ȳ2 ,


n1 ȳ1 + n2 ȳ2
ȳ =
n1 + n2
n2 ȳ1 − n2 ȳ2
and so ȳ1 − ȳ = .
n1 + n2

n1 ȳ2 − n1 ȳ1
Similarly, ȳ2 − ȳ = .
n1 + n2

Substituting into the expression for D yields,

n1 n22 n2 n21
 
1 2 2
D= 2 (ȳ1 − ȳ2 ) + (ȳ1 − ȳ2 )
σ (n1 + n2 )2 (n1 + n2 )2
1 n1 n2
= 2 (ȳ1 − ȳ2 )2
σ n1 + n2
2
1 (ȳ1 − ȳ2 )
= 2 .
σ 1
+ 1
n1 n2

The p − value for the test is,

p − value = P (D ≥ dobs | H0 : µ1 = µ2 ) .

We know that under H0 : µ1 = µ2 , D ≈ χ2(k−q=1) .

We can obtain the exact distribution of D under H0 here. Since V ar(Ȳ1 ) = σ 2 /n1 ,
V ar(Ȳ2 ) = σ 2 /n2 and Ȳ1 and Ȳ2 are independent, V ar(Ȳ1 − Ȳ2 ) = σ 2 (1/n1 + 1/n2 ).
Therefore, 2
Ȳ1 − Ȳ2
  ∼ χ2(1) exactly, and
1 1
σ 2 n1 + n2
CHAPTER 6. NORMAL THEORY 121

Ȳ1 − Ȳ2
q = Z ∼ N (0, 1)exactly.
σ n11 + n12

The p − value is
 
2
(ȳ1 − ȳ2 ) 
p − value = P χ2(1) ≥  
σ 2 n11 + n12
 
|ȳ1 − ȳ2 | 
= P |Z| ≥ q
σ n11 + n12

where Z ∼ N (0, 1) .

Returning to our example:

ȳ1 = 3148.532, ȳ2 = 3354.483, σ 2 = 8182


n1 = 352, n2 = 308
and the h √ i
p − value = P |Z| ≥ 10.4129 = .00125.

There is strong evidence against the hypothesis that salaries for work terms 1 and
2 are the same. The increase in mean monthly salary is significantly different from
zero.
A 95% confidence interval for the difference in the means is:
p
ȳ1 − ȳ2 ± z.975 σ 1/n1 + 1/n2
= −205.95 ± 125.09
= (−331.04, −80.86)

We report, “The estimated mean monthly increase in salary in work term 2 over
work term 1 is $205.95 (95% CI $80.86 - $331.04).”
CHAPTER 6. NORMAL THEORY 122

Two sample model, Variances assumed EQUAL and UNKNOWN


In this section, we assume that the variances for the two groups are equal but un-
known.
Let
Y1i = monthly salary for work term 1, i = 1, ..., n1
Y2j = monthly salary for work term 2, j = 1, ..., n2

We assume that,
Y1i ∼ N (µ1 , σ 2 ), independent,
Y2j ∼ N (µ2 , σ 2 ), independent,
and that σ 2 is unknown.

We test the hypothesis, H0 : µ1 = µ2 , or equivalently, H0 : µ1 − µ2 = 0.

Step 1: BASIC model

  n21 " n1
#  n22 " n2
#
1 1 X 1 1 X
2
(y1i − µ1 )2 (y2j − µ2 )2

L µ1 , µ2 , σ = exp − 2 exp − 2
σ2 2σ i=1 σ2 2σ j=1

n1
P n2
P
As an exercise, show that µ̂1 = ȳ1 = y1i /n1 = $3, 149, µ̂2 = ȳ2 = y2j /n2 =
i=1 j=1
$3, 354, and Pn1 Pn2
2 2
i=1 (y 1i − ȳ 1 ) + j=1 (y2j − ȳ2 )
σ̂ 2 = .
n1 + n2

Step 2: Hypothesized Model


Assuming H0 : µ1 = µ2 = µ, unknown, we need to estimate µ and the common
σ2.

  n1 +n2
" n1 n2
#
1 2 1 X 2 1 X
2
(y2j − µ)2

L µ1 = µ, µ2 = µ, σ = exp − 2 (y1i − µ) − 2
σ2 2σ i=1 2σ j=1
CHAPTER 6. NORMAL THEORY 123

As an exercise, show that the MLE of µ is


n1
X n2
X
y1i + y2j
i=1 j=1
µ̃ = = ȳ,
n1 + n2
and that Pn1 Pn2
2 2
i=1 (y1i − ȳ) + j=1 (y2j − ȳ)
σ̃ 2 = .
n1 + n2

Step 3: Test the hypothesis

D = 2 ℓ µ̂1 , µ̂2 , σ̂ 2 − ℓ µ1 = µ̃, µ2 = µ̃, σ̃ 2


  

is a monotone function of T 2 , where

Ȳ − Y¯2
T = p1 ∼ t(n1 +n2 −2) .
spooled 1/n1 + 1/n2

and
(n1 − 1)s21 + (n2 − 1)s22
s2pooled = .
n1 + n2 − 2

[See the Appendix for a proof of this result.]


Returning to our example, tobs = −3.2504, and the p-value is,

P (|t(n1 +n2 −2) | ≥ |tobs |) = 0.001211.

A 95% confidence interval for the differences of the means is:


p
ȳ1 − ȳ2 ± t.975
(n1 +n2 −2) s pooled 1/n1 + 1/n2
= (−330.37, −80.54).

The R output for this problem is below and code is in the following section:
CHAPTER 6. NORMAL THEORY 124

> t.test(SalMonth~WTNum, data=ywt12,var.equal=TRUE)


Two Sample t-test

data: SalMonth by WTNum


t = -3.2504, df = 658, p-value = 0.001211
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-330.36720 -81.53596
sample estimates:
mean in group W-1 mean in group W-2
3148.532 3354.483

Two sample model, Variances assumed UNEQUAL and UNKNOWN


For the two sample model with unequal and unknown variances, we will use the
Satterthwaite approximation to the degrees of freedom of the t-test that you learned
in your first course in Statistics.

Ȳ1 − Ȳ2
T =q 2 ∼ t(ν) ,
s1 s22
n1
+ n2
where ν is the Satterthwaite approximation to the degrees of freedom.
The R output using this approximation appears below and code is in the following
subsection.
> t.test(SalMonth~WTNum, data=ywt12,var.equal=FALSE)

Welch Two Sample t-test

data: SalMonth by WTNum


t = -3.2326, df = 628.79, p-value = 0.001291
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-331.06376 -80.83941
sample estimates:
mean in group W-1 mean in group W-2
3148.532 3354.483
CHAPTER 6. NORMAL THEORY 125

The p-value for the test of the equality of the two means is 0.001291, which is very
similar to the results of the test which assumes that the variances are equal.

R code for Inferences for Two sample model


###Inferences about the differences in the means, $\mu_1 - \mu_2$###

###Method 1: Assume variances equal and known, Test and Confidence Interval
for Difference###

ywt12.means<-by(ywt12$SalMonth,ywt12$WTNum,mean)
ywt12.num<-by(ywt12$SalMonth,ywt12$WTNum,length)
ywt12.means
ywt12.num
zobs<-(ywt12.means[1]-ywt12.means[2])/818/sqrt(sum(1/ywt12.num))
zobs
pvalue<-2*pnorm(-abs(zobs))
pvalue

zobs^2
1-pchisq(zobs^2,1)

#95% confidence interval for the difference


(ywt12.means[1]-ywt12.means[2]) + qnorm(.975)*c(-1,1)*818*sqrt(sum(1/ywt12.num))
#difference in means
(ywt12.means[1]-ywt12.means[2])
#margin of error
qnorm(.975)*c(-1,1)*818*sqrt(sum(1/ywt12.num))

###Method 2: Assume variances equal and unknown###


#this uses pooled estimate of variance for test

t.test(SalMonth~WTNum, data=ywt12,var.equal=TRUE)

###Method 3: Assume variance are not equal###

t.test(SalMonth~WTNum, data=ywt12,var.equal=FALSE)
CHAPTER 6. NORMAL THEORY 126

6.3.2 Testing Equality of Variances


On the last assignment, you will test the hypothesis of equal variances in the two
sample model using a Likelihood Ratio test.
CHAPTER 6. NORMAL THEORY 127

6.3.3 Appendix: Derive the 2-sample σ unknown but equal


t-test
Derive the two-sample t-test.
Y1i ∼ N (µ1 , σ 2 ) i = 1, . . . , n1
Y2j ∼ N (µ2 , σ 2 ) j = 1, . . . , n2
The likelihood ratio statistic for testing the equality of the two means is:
D = 2 ℓ µ̂1 , µ̂2 , σ̂ 2 − ℓ µ1 = µ̃, µ2 = µ̃, σ̃ 2 .
  

The Log-likelihood under the Basic model is:


n1 1 X n2 1 X
ℓ(µ1 , µ2 ) = − ln σ 2 − 2 (y1i − µ1 )2 − ln σ 2 − 2 (y2j − µ2 )2 .
2 2σ 2 2σ
The MLE’s for µ1 , µ2 and σ 2 under the Basic model are solutions of the following
equations:
n1
∂ℓ(µ1 , µ2 ) 1 X
= (y1i − µ1 ) = 0 (6.2)
∂µ1 σ 2 i=1
n2
∂ℓ(µ1 , µ2 ) 1 X
= (y2j − µ2 ) = 0 (6.3)
∂µ2 σ 2 j=1
"n n2
#
1
∂ℓ(µ1 , µ2 ) (n1 + n2 ) 1 X X
= − + 3 (y1i − µ1 )2 + (y2j − µ2 )2 = 0 (6.4)
∂σ σ σ i=1 j=1

The MLE’s under the Basic model are:


n1
X
µ̂1 = y1i /n1
i=1
Xn2
µ̂2 = y2j /n2
j=1

(y1i − µ̂1 )2 + (y2j − µ̂2 )2


P P
2
σ̂ =
n1 + n2

The Log-likelihood under the Reduced (hypothesized) model is:


"n n2
#
1
(n 1 + n 2 ) 1 X X
ℓH µ1 = µ, µ2 = µ, σ 2 = − ln σ 2 − 2 (y1i − µ)2 + (y2j − µ)2

2 2σ i=1 j=1
CHAPTER 6. NORMAL THEORY 128

Taking derivatives and solving yields the MLE’s under the Reduced model are:
P P
i y1i + j y2j
µ̃ = and
n + n2
P 1 2 P 2
2 i (y1i − µ̃) + j (y2j − µ̃)
σ̃ = .
n1 + n2

Substituting the above into the Likelihood Ratio Statistic,

D = − (n1 + n2 ) ln σ̂ 2 + (n1 + n2 ) ln σ̃ 2
σ̃ 2
= (n1 + n2 ) ln 2
σ̂

X 2 X 2
2 n1 ȳ1 + n2 ȳ2 n1 ȳ1 + n2 ȳ2
(n1 + n2 )σ̃ = y1i − + y2j −
i
n1 + n2 j
n1 + n2

The first term can be written as follows:


X 2 X  2
n1 ȳ1 + n2 ȳ2 n1 ȳ1 + n2 ȳ2
y1i − = y1i − ȳ1 + ȳ1 −
i
n1 + n2 i
n1 + n2
 2
X 2 n1 ȳ1 + n2 ȳ2
= (y1i − ȳ1 ) + n1 ȳ1 − ,
i
n 1 + n 2

and the similarly the second term,


X 2 X  2
n1 ȳ1 + n2 ȳ2 2 n1 ȳ1 + n2 ȳ2
y2i − = (y2i − ȳ2 ) + n2 ȳ2 − .
j
n1 + n2 j
n1 + n2

The contents of the second terms can be simplified as,


n1 ȳ1 + n2 ȳ2 n2 (ȳ1 − ȳ2 )
ȳ1 − =
n1 + n2 n1 + n2
n1 ȳ1 + n2 ȳ2 n1 (ȳ2 − ȳ1 )
ȳ2 − = .
n1 + n2 n1 + n2
CHAPTER 6. NORMAL THEORY 129

Substituting into part of the expression for D,


h i2 h i2
n2 (y¯1 −ȳ2 ) n1 (ȳ2 −ȳ1 )
σ̃ 2 n 1 n1 +n2
+ n2 n1 +n2
2
=1+ P 2 2
σ̂
P
i (y1i − ȳ1 ) + j (y2j − ȳ2 )

(ȳ1 − ȳ2 )2
n1 n2
n1 +n2
=1+ P 2 P 2
i (y1i − ȳ1 ) + j (y2j − ȳ2 )

(ȳ1 − ȳ2 )2
=1+ h i h P (y 2 P 2 ih i
1 i 1i −ȳ1 ) + j (y2j −ȳ2 ) 1 1
n1 +n2 −2 n1 +n2 −2 n1
+ n2
1 ȳ − ȳ2
=1+ T2 where T = q1 .
n1 + n2 − 2 sp n11 + n12

The Likelihood ratio statistic is a monotone function of T 2 ,


 
1 2
D = (n1 + n2 ) ln 1 + T .
n1 + n2 − 2
CHAPTER 6. NORMAL THEORY 130

6.4 The Straight Line Model


Optional reading: Section 13.5, 13.6

In this section we analyze data that is in the form of ordered pairs

(x1 , y1 ) , (x2 , y2 ) , . . . , (xn , yn )

yi - called the response or dependent variable


xi - called the explanatory or predictor variable

We wish to determine the form and strength of the relationship between the response
variable (y) and the explanatory variable (x). Typically there are two possible goals
for the analysis:
(1) Explanation: What is the relationship between y and x.
(2) Prediction: Given x, can we predict y accurately.

Example 6.4.1. We are interested in the relationship between monthly salaries for
co-op students as a function of the work term number.
xi = work term number for student i,
yi = monthly salary for student i.
The first step is to graph the data and determine an appropriate model to fit to the
data. The data are graphed below:
CHAPTER 6. NORMAL THEORY 131

Monthly Salary versus Work Term Number

7000
5000
monthly salary
3000
1000

W−1 W−2 W−3 W−4 W−5 W−6 W−7


Work Term Number

Figure 6.6: Boxplots of monthly salaries by work term number

From the graph, we note that,

(1) for a given number of work terms, the monthly salaries are subject to a large
amount of variability:
(2) A linear relationship between monthly salary (Y ) and work term number (X)
seems appropriate.

In this course, we will primarily consider models where the y ′ s are linearly related
to the x′ s.

We assume that Yi ∼ N (µi , σ 2 ) independent, where µi = E (Yi ) = α + βxi . The


mean monthly salary is linearly related to the work term number.
Another way to write this model is as:
Yi = (α + βxi ) + ϵi ,
where ϵi ∼ N (0, σ 2 ), independent.
CHAPTER 6. NORMAL THEORY 132

6.4.1 Linear model parameter estimation


To estimate α, β and σ 2 , we use Maximum likelihood estimation.

n  
2
 Y 1 1 2
L α, β, σ = exp − 2 (yi − µi )
i=1
σ 2σ
" n
#
1 X
= σ −n exp − 2 (yi − µi )2
2σ i=1
n
1 X
2
(yi − α − βxi )2

ℓ α, β, σ = −n ln σ − 2
2σ i=1
| {z }
α̂, β̂ will maximize this
n
X
=⇒ α̂, β̂ will minimize (yi − α − βxi )2
i=1

α̂, β̂ are often called LEAST SQUARES ESTIMATES

We need to solve the system of equations,


n
∂ℓ 1 X
= (yi − α̂ − β̂xi ) = 0 (6.5)
∂α α̂,β̂ σ 2 i=1
n
∂ℓ 1 X
= (yi − α̂ − β̂xi )xi = 0 (6.6)
∂β α̂,β̂ σ 2 i=1

Letting ϵ̂i = yi − α̂ − β̂xi , the equations (6.5) and (6.6) can be written as:
n
1 X
(6.5) = ϵ̂i = 0
σ 2 i=1
n
1 X
(6.6) = ϵ̂i xi = 0
σ 2 i=1

Solving (6.5) leads to

nȳ − nα̂ − β̂nx̄ = 0 =⇒ α̂ = ȳ − β̂ x̄.


CHAPTER 6. NORMAL THEORY 133

Substituting for α̂ and solving (6.6) yields,


n
X
(yi − ȳ + β̂ x̄ − β̂xi )xi = 0
i=1
X X
=⇒ (yi − ȳ)xi − β̂ (xi − x̄)xi = 0
P
(yi − ȳ)xi SXY
β̂ = P = .
(xi − x̄)xi SXX

Using algebraic manipulations, we derive some alternate formulae for SXY and SXX
which will be useful later.
(i)
n
X n
X n
X
2
SXX = (xi − x̄) xi = xi − x̄ xi
i=1 i=1 i=1
Xn
SXX = x2i − nx̄2
i=1

(ii)
n
X n
X
SXX = (xi − x̄) xi − (xi − x̄) x̄ since 2nd term=0
i=1 i=1
n
X
SXX = (xi − x̄)2
i=1

(iii)
n
X n
X n
X
SXY = (yi − ȳ) xi = xi yi − ȳ xi
i=1 i=1 i=1
Xn
SXY = xi yi − nx̄ȳ
i=1
CHAPTER 6. NORMAL THEORY 134

(iv)
n
X n
X
SXY = (yi − ȳ) xi − (yi − ȳ) x̄ since 2nd term=0
i=1 i=1
n
X
SXY = (yi − ȳ) (xi − x̄)
i=1

(v)
n
X n
X
SXY = (xi − x̄) yi since (xi − x̄) ȳ = 0
i=1 i=1

To estimate σ 2 we compute the derivative of the log likelihood function with respect
to σ.

n
∂ℓ n 1 X
=− + 3 (yi − α − βxi )2
∂σ σ σ i=1
n 
∂ℓ 2
X 2
= 0 =⇒ nσ̂ = yi − α̂ − β̂xi
∂σ α̂,β̂,σ̂ i=1
n  n
2 1 X 2 1 X2
σ̂ = yi − α̂ − β̂xi = ϵ̂
n i=1
n i=1

We can show that E (σ̂ 2 ) ̸= σ 2 and it is therefore a biased estimate.

To estimate σ 2 , we will use:


n
2 1 X 2
s = yi − α̂ − β̂xi
n − 2 i=1
n
1 X2
= ϵ̂ where ϵ̂i = yi − α̂ − β̂xi
n − 2 i=1 i

Note that ϵ̂i is an estimate of ϵi = Yi −(α + βxi ) where we assumed that ϵi ∼ N (0, σ 2 )
and independent. We call ϵ̂i a Residual.
CHAPTER 6. NORMAL THEORY 135

Returning to Example 6.4.1, the fitted model from R is given below.


> Sal.lm<-lm(SalMonth~WTNumN, data=salarynz)
> summary(Sal.lm)

Call:
lm(formula = SalMonth ~ WTNumN, data = salarynz)

Residuals:
Min 1Q Median 3Q Max
-2960.8 -522.6 -157.4 406.4 4136.2

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2887.40 56.26 51.33 <2e-16 ***
WTNumN 234.99 21.06 11.16 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 874.7 on 1149 degrees of freedom


Multiple R-squared: 0.09779, Adjusted R-squared: 0.097
F-statistic: 124.5 on 1 and 1149 DF, p-value: < 2.2e-16

Figure 6.7: R Output: Linear regression for salary data

• The estimated relationship between monthly salary and work term number is:
Salary = 2887.40 + 234.99 × Work Term number.

• The estimate of σ is s =“Residual standard error” = 874.7 on 1149 degrees of


freedom.
• We estimate that monthly salary increases by $234.99 for each additional work
term.
• The intercept estimate is the estimated monthly salary for zero work terms,
but this is not meaningful here. Instead, we could quote the estimated monthly
salary for work term 1, $2887.40 + $234.99.
CHAPTER 6. NORMAL THEORY 136

6.4.2 Linear model Distribution theory


Distribution of β̂
Recall: If Y1 , . . . , Yn are independent with Yi ∼ N (µi , σ 2 ) then ai Yi ∼ N ( ai µi , σ 2 a2i ) ,
P P P
for constants a1 , ..., an .

n
X
We want to express β̂ = ai Yi , as a linear combination of the Yi ’s.
i=1

SXY
β̂ =
S
PXX
(xi − x̄) yi X (xi − x̄) yi
= =
SXX SXX

(xi − x̄)
Let ai =
SXX
X X 
2
then β̂ ∼ N ai µ i , σ a2i
X X (xi − x̄)
E(β̂) = ai µ i = (α + βxi ) = β
SXX
X X (xi − x̄)2 σ2
V AR(β̂) = σ 2 a2i = σ 2 2
=
SXX SXX
2
 
σ
β̂ ∼ N β,
SXX

We can use the quantity


β̂ − β
q ∼ t(n−2) (6.7)
s2
SXX

for tests and confidence intervals for β. (It is a pivotal quantity for β as it is a
function of the data and a function of the unknown parameter and its distribution
is completely known.)

The quantity in the denominator of (6.7), s/ SXX is called the Standard Error of
β̂, s.e.(β̂), and it is listed on the output of Figure 6.7 under the column ‘Std. Error’.
CHAPTER 6. NORMAL THEORY 137

It is the square root of the estimated variance of β̂. The standard error for β̂ is
21.06.
Note that in (6.7), the degrees of freedom for the t distribution are the same as the
denominator in the formula for s2 . This result holds generally. The estimate s is
given in the output Figure 6.7 labelled as ‘Residual standard error:’ and the value
here is 874.7 on 1149 degrees of freedom.
We next test the hypothesis that β = 0, and construct a 99% confidence interval for
β.

 
 β̂ − 0 
p − value = P t(n−2) ≥ q H0 : β = 0 true
1
 s SXX 

The R output in Figure 6.7, provides the observed value of the test statistic, (6.7),
under the column ‘t value’. The observed value of our t-statistic for β is 11.16. The
p − value is given in the column ‘Pr(> |t|)’ and its value is listed as < 2e − 16.
For very small or very large numbers, R uses exponential notation. 2e − 16 means
2 × 10−16 .
p − value = P t(1149) ≥ 11.16 < 2 × 10−16 .


We have very strong evidence against H0 : β = 0.

To compute a 99% CI for β, we use the general formulation,

estimate ± t.995
(ν) s.e.(estimate).

Here we obtain the t-quantile from R as: qt(.995, 1149), so our 99% confidence
interval is:
234.99 ± 2.58 × 21.06 = [180.66, 289.32].

Distribution of α̂
n
X
We want to express α̂ = ai Yi , as a linear combination of the Yi ’s.
i=1
CHAPTER 6. NORMAL THEORY 138

1X X
α̂ = ȳ − β̂ x̄ = yi − x̄ ai y i
" n #
X 1 xi − x̄
= yi − x̄ai ai =
|n {z } SXX
bi
X
= bi y i

Therefore, α̂ ∼ N ( bi µi , σ 2 b2i ) .
P P

X X1 x̄ (xi − x̄)



E(α̂) = bi µ i = − (α + βxi )
n SXX
αx̄ X β x̄ X
= α + β x̄ − (xi − x̄) − (xi − x̄) xi
SXX SXX
X X
= α since SXX = (xi − x̄) xi and (xi − x̄) = 0

X
V AR(α̂) = b2i σ 2
X X  1 x̄ (xi − x̄) 2
2
bi = −
n SXX
" #
X 1 1 x̄ (xi − x̄) x̄2 (xi − x̄)2
= −2 + 2
n2 n SXX SXX
1 x̄2 X
= + 2 (xi − x̄)2
n SXX
1 x̄2
= +
n SXX

x̄2
  
2 1
α̂ ∼ N α, σ +
n SXX

We can use the quantity


α̂ − α
q ∼ t(n−2) (6.8)
2
s n1 + Sx̄XX
CHAPTER 6. NORMAL THEORY 139

for tests and confidence intervals for α. (It is a pivotal quantity for α as it is a
function of the data and a function of the unknown parameter and its distribution
is completely known.)
q
2
The quantity in the denominator of (6.8), s n1 + Sx̄XX is called the Standard Error
of α̂, s.e.(α̂), and it is listed on the output of Figure 6.7 under the column ‘Std.
Error’. It is the square root of the estimated variance of α̂. The standard error for
α̂ is 56.26.
Note that in (6.8), the degrees of freedom for the t distribution are the same as the
denominator in the formula for s2 .
We next test the hypothesis that α = 0, and construct a 99% confidence interval for
α.
 
 |α̂ − 0| 
p − value = P t(n−2) ≥ q H0 : α = 0
2
 s n1 + Sx̄XX 

The R output in Figure 6.7, provides the observed value of the test statistic, (6.8),
under the column ‘t value’. The observed value of our t-statistic for α is 51.33. The
p − value is given in the column ‘Pr(> |t|)’ and its value is listed as < 2e − 16.
For very small or very large numbers, R uses exponential notation. 2e − 16 means
2 × 10−16 .
p − value = P t(1149) ≥ 11.16 < 2 × 10−16 .


We have very strong evidence against H0 : α = 0.

To compute a 99% CI for α, we use the general formulation,

estimate ± t.995
(ν) s.e.(estimate).

Here we obtain the t-quantile from R as: qt(.995, 1149), so our 99% confidence
interval is:
2887.40 ± 2.58 × 56.26 = [$2742.25, $3032.55].
CHAPTER 6. NORMAL THEORY 140

Distribution of µ̂0 = α̂ + β̂x0


Given a particular value for x say x0 , the E(Y ) = α+βx0 is estimated with α̂+ β̂x0 =
µ̂0 . We can obtain its distribution as follows.

h i
µ̂0 = ȳ − β̂ x̄ + β̂x0
= ȳ + β̂ (x0 − x̄)
1X (xi − x̄)
= yi + (x0 − x̄) ai yi ai =
n SXX
" #
X 1
= + (x0 − x̄) ai yi
|n {z }
ci
X X 
=⇒ µ̂0 ∼ N ci µ i , c2i σ 2

We know that,

 
E (α̂) = α E β̂ = β
 
=⇒ E α̂ + β̂x0 = α + βx0
X1 2
X
2 xi − x̄
ci = + (x0 − x̄) ai , ai =
n SXX
X 1 2 (x0 − x̄) X 2
X
= 2
+ a i + (x 0 − x̄) a2i
n n
1 (x0 − x̄)2 X
= + since ai = 0
n SXX
1 (x0 − x̄2 ) 2
   
α̂ + β̂ ∼ N α + βx0 , + σ
n SXX

We can construct confidence intervals and tests for µ0 = α + βx0 using


CHAPTER 6. NORMAL THEORY 141

α̂ + β̂x0 − (α + βx0 )
q ∼ t(n−2)
1 (x0 −x̄)2
s n + SXX

A 99% CI for µ = α + βx0


 s 
2
α̂ + β̂x0 ± t.995 1 (x0 − x̄) 
(n−2) s +
n SXX

Note that the Confidence interval is narrowest when x0 = x̄ and it increases as


|x0 − x̄| increases. Therefore, we can estimate α + βx0 most precisely when x0 is
close to x̄, the mean of the x values used to fit the line.

6.4.3 R2 and ANOVA


R2
R2 measures the proportion of the variation in Y explained by model. It is also
called the Coefficient of Determination.
We obtain R2 by decomposing the variation in Y into two parts, one part explained
by the linear regression (SSR), and one part unexplained (or error) by the regression,
(SSE).
If we did not have any X ′ s, P then we would estimate the mean of Y using ȳ and
the variance of Y using s2Y = ni=1 (yi − ȳ)2 /(n − 1). The total variation in Y is the
numerator of s2Y and is called the Total Sum of Squares, SST , and is decomposed
as follows.

X
SST = (yi − ȳ)2 adding and subtracting ŷi within the brackets
X
= [(yi − ŷi ) + (ŷi − ȳ)]2 , where ŷi = α̂ + β̂xi
X X X
= (yi − ŷi )2 + (ŷi − ȳ)2 + 2 (yi − ŷi )(ŷi − ȳ)
X X
= (yi − ŷi )2 + (ŷi − ȳ)2
X X
= ϵ̂2i + (ŷi − ȳ)2 = SSE + SSR
CHAPTER 6. NORMAL THEORY 142

The cross-product term is zero because of equations 6.5 and 6.6. SSE is the sum
of squares error and SSR is the sum of squares regression.
Now we define R2 ,
SSR SSE
R2 = =1− .
SST SST
Note that 0 ≤ R2 ≤ 1. If the regression line fits the data perfectly, then yi = ŷi and
ϵ̂i = 0 for all i = 1, ..., n. In that case, SSE = 0 and R2 = 1.
If ŷi = ȳ for all i = 1, ..., n, then SSR = 0 and R2 = 0.
Returning to the R output for the co-op salary data, R2 is called ‘Multiple R-squared:’
at the bottom of Figure 6.7, and is equal to 0.09779. Only about 10% of the variation
in salaries is explained by the work term number.
R2 has a deficiency in that it can be artificially inflated by adding more explanatory
variables into the model. Adjusted R2 incorporates a penalty for the number of
explanatory variables in the model, and is the preferred measure of fit for linear
regressions. Its formula is:
s2
Adjusted R2 = 1 − .
s2Y
It is listed in Figure 6.7 as ‘Adjusted R-squared’.

Anscombe’s data
Anscombe’s data provides a good illustration of issues with linear regression and R2 .
See the file AnscombeR.pdf on Brightspace. The Anscombe dataset is built into R;
simply type anscombe to see the dataset. The dataset consists of four pairs of x’s
and y’s. We graph and fit linear models to each of the pairs.
> anscombe
x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8.04 9.14 7.46 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
3 13 13 13 8 7.58 8.74 12.74 7.71
4 9 9 9 8 8.81 8.77 7.11 8.84
5 11 11 11 8 8.33 9.26 7.81 8.47
6 14 14 14 8 9.96 8.10 8.84 7.04
7 6 6 6 8 7.24 6.13 6.08 5.25
8 4 4 4 19 4.26 3.10 5.39 12.50
CHAPTER 6. NORMAL THEORY 143

9 12 12 12 8 10.84 9.13 8.15 5.56


10 7 7 7 8 4.82 7.26 6.42 7.91
11 5 5 5 8 5.68 4.74 5.73 6.89
Below is the output from R for the fits of the linear models, Y = α + βX + ϵ, for
each of the four pairs of x and y. Yes, they all have the SAME fit; SAME R2 , SAME
coefficients estimates, SAME everything, so I only included one version. The graphs
of the data pairs are quite different however.

> summary(ans.lm1)

Call:
lm(formula = y1 ~ x1, data = anscombe)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0001 1.1247 2.667 0.02573 *
x1 0.5001 0.1179 4.241 0.00217 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.237 on 9 degrees of freedom


Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217

Figure 6.8: R output, fit of linear model of Y on X

Figures 6.9 and 6.10 below show the scatterplots of the pairs of Anscombe’s data
together with the fitted linear model in the first columns. Plots of residuals versus
fitted values from linear model fits are shown in the second columns. A linear model
seems appropriate for the first pair, (x1, y1). The second pair, (x2, y2) require a
quadratic model. The third pair has an outlier which raises the regression line. The
fourth pair has an influential point which totally determines the line. Thus, although
their R2 values are all the same, we see that the linear model fits are all very different
for the four pairs.
CHAPTER 6. NORMAL THEORY 144

Y1 vs X1
Residuals vs Fitted

0 1 2
9
anscombe$y1

10

Residuals
8
6

−2
10
3
4

4 6 8 10 12 14 5 6 7 8 9 10

anscombe$x1 Fitted values

Y2 vs X2
Residuals vs Fitted
9

4
anscombe$y2

1
Residuals
7

0
5

−2

8 6
3

4 6 8 10 12 14 5 6 7 8 9 10

anscombe$x2 Fitted values

Figure 6.9: Anscombe pairs 1 and 2, Scatterplots; Residual plots


CHAPTER 6. NORMAL THEORY 145

Y3 vs X3
Residuals vs Fitted
3
anscombe$y3

3
Residuals
8 10

1
−1
9
6

4 6 8 10 12 14 5 6 7 8 9 10

anscombe$x3 Fitted values

Y4 vs X4
Residuals vs Fitted
2

4
anscombe$y4

5
Residuals

1
8 10

0
6

−2

8 10 14 18 7 8 9 10 12

anscombe$x4 Fitted values

Figure 6.10: Anscombe pairs 3 and 4, Scatterplots; Residual plots

R Code for Anscombe analyses:


anscombe
plot(anscombe$x1, anscombe$y1, main=’Y1 vs X1’)
ans.lm1<-lm(y1~x1, data=anscombe)
abline(ans.lm1)
CHAPTER 6. NORMAL THEORY 146

plot(ans.lm1,which=1,add.smooth=FALSE)
summary(ans.lm1)

plot(anscombe$x2, anscombe$y2, main=’Y2 vs X2’)


ans.lm2<-lm( y2~ x2, data=anscombe)
abline(ans.lm1)
plot(ans.lm2,which=1,add.smooth=FALSE)
summary(ans.lm2)

plot( anscombe$x3, anscombe$y3, main=’Y3 vs X3’)


ans.lm3<-lm( y3~ x3,data=anscombe)
abline(ans.lm1)
plot(ans.lm3,which=1,add.smooth=FALSE)
summary(ans.lm3)

plot( anscombe$x4, anscombe$y4, main=’Y4 vs X4’)


ans.lm4<-lm( y4~ x4,data=anscombe)
abline(ans.lm1)
plot(ans.lm4,which=1,add.smooth=FALSE)
summary(ans.lm4)

ANOVA
ANOVA stands for Analysis of Variance, and it is a tabulation of the sources of
variation that we derived for R2 . It usually also includes test statistics, and usually,
an F −test statistic with its p − value. For our simple linear regression, the ANOVA
table has the following form.

Source Df Sum Sq Mean Sq F value Pr(>F)


X variable dfR SSR MSR = SSR/dfR MSR/MSE p-value
Residuals(Error) dfE SSE MSE = SSE/dfE

Table 6.1: Analysis of Variance Table

The quantities in the table are defined below:


• Df = degrees of freedom
CHAPTER 6. NORMAL THEORY 147

• Sum Sq = sum of squares


• Mean Sq = mean square = Sum Sq/Df
• F value = value of F statistic for testing H0 that all coefficients of X variable(s)
are zero = MSR/MSE
• Pr(>F) = p-value for the test H0 that all coefficients of the X variable(s) are
zero; small p-values indicate that there is evidence against H0
• MSE = s2 = ϵ̂2i /dfE is the estimate of σ 2 .
P

The ANOVA table for the Co-op salary data appears below. Note that the p-value
for the F-test is exactly the same as the p-value for the t test of the H0 : β = 0
in Figure 6.7. That is because we have only one X variable in our model, namely
WTNumN.

> anova(Sal.lm)
Analysis of Variance Table

Response: SalMonth
Df Sum Sq Mean Sq F value Pr(>F)
WTNumN 1 95290565 95290565 124.54 < 2.2e-16 ***
Residuals 1149 879171908 765163
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

6.4.4 Checking Goodness of Fit


One method to check the fit of the model is to plot the data together with the fitted
line. Are the data points scattered randomly about the fitted line? In Figure 6.9,
the scatterplot of (x1, y1) with the fitted line indicates a good model fit. The other
Anscombe pairs do not fit the linear model well.
Another method is to plot the residuals. The model has the form,

Yi = α + βxi + ϵi
ϵi ∼ N 0, σ 2 independent.

CHAPTER 6. NORMAL THEORY 148

To estimate ϵi , we use
 
ϵ̂i = yi − α̂ + β̂xi = yi − ŷi = residuali .

If the model is correct, we would expect (ϵ̂1 , . . . , ϵ̂n ) to behave like a random sample
from N (0, σ 2 ).
A plot of residuals should be scattered about the centre line at zero, all within
approximately ±3s. We plot the residuals versus the fitted values, ŷi , and look
for:
1. constant variance,
2. patterns that suggest nonlinearity,
3. outliers,
4. influential points.
In the plots of residuals versus fitted values of Figures 6.9 and 6.10, the pairs 2, 3
and 4 indicate problems with the linear models.
We can also plot a histogram of the residuals to check for normality, or a Normal
Q-Q plot which is explained in the next section.

6.4.5 Normal Q-Q plots


A Normal Q-Q plot of residuals is a graph of the ordered residuals from smallest
to largest, versus the corresponding percentiles of the N (0, 1) distribution. Suppose
that n = 10 and we have 10 distinct ordered residuals, ϵ̂(1) < ... < ϵ̂(n) . The brackets
are used to denote ordered values from (1) to (n).
• ϵ̂(1) is plotted versus the 100 1−.5

10
= 5th percentile of N (0, 1) = -1.644854.
• ϵ̂(2) is plotted versus the 100 2−.5

10
= 15th percentile of N (0, 1) = -1.036433
• ...
• ϵ̂(10) is plotted versus the 100 10−.5

10
= 95th percentile of N (0, 1) = 1.644854
If the residuals are approximately normally distributed, then the graph should roughly
look like a straight line. Figure 6.11 is a Normal Q-Q plot of a sample of size 200
generated in R from the N (0, 1) distribution. The points fall roughly on a straight
CHAPTER 6. NORMAL THEORY 149

line. Figure 6.12 is a Normal Q-Q plot of a sample of size 200 generated in R from
the χ2(2) distribution. The points do NOT fall on a straight line.

Normal Q−Q Plot

2
Sample Quantiles

1
0
−1
−2

−3 −2 −1 0 1 2 3

Theoretical Quantiles

Figure 6.11: Normal QQ plot of Normal data


CHAPTER 6. NORMAL THEORY 150

Normal Q−Q Plot

12
10
Sample Quantiles

8
6
4
2
0

−3 −2 −1 0 1 2 3

Theoretical Quantiles

Figure 6.12: Normal QQ plot of Chi-square(2) data

R Code for Normal Q-Q plots:


set.seed(12345)
x1<-rnorm(200)
qqnorm(x1)
qqline(x1) #overlays a line through the first and third quartiles

x3<-rchisq(200,df=2)
qqnorm(x3)
qqline(x3)
CHAPTER 6. NORMAL THEORY 151

6.5 Analysis of Paired Measurements


Optional reading: Section 13.7
The analysis of paired measurements is an application of the one sample model.

Example 6.5.1. Twelve students in a statistics course recorded the scores listed
below on their first and second tests in the course.

Student
1 2 3 4 5 6 7 8 9 10 11 12
Test 1 64 28 90 30 97 20 100 67 54 44 100 71
Test 2 80 87 90 57 89 51 81 82 89 78 100 81

Test the hypothesis that there is no difference in the scores for the 2 tests.
Solution:
Note: The Test 1 and Test 2 pairs are not independent of each other. We would
expect results from different individuals to be independent of one another how-
ever.

Let Xi = i′ th difference (Test1 −Test2 ), and assume Xi ∼ N (µ, σ 2 ) independent, σ 2


is unknown. This is just a one sample model for the Xi ’s.

We will use the t statistic,


x̄ − µ
T = √ ∼ t(n−1)
s/ n
to test H0 : µ = 0.

 
|x̄ − 0|
p − value = P t(n−1) ≥ √ | H0 : µ = 0
s/ n
x̄ = −16.67
(xi − x̄)2
P
2
s = = 474.97
n−1
|−16.67|
tobs = p = 2.65
(474.97) /12

p − value = P t(n−1) ≥ 2.65
CHAPTER 6. NORMAL THEORY 152

P t(11) ≥ 2.201 = .025

P t(11) ≥ 2.718 = .01

=⇒ .02 < p − value ≤ .05

There is evidence against the hypothesis that there is no difference in the scores for
the 2 exams, with a mean difference of -16.67.

A 95% Confidence Interval for the mean difference has the form,
  r
s 474.97
X̄ ± t(n−1) √ = −16.67 ± (2.201)
n 12
= [−30.5, −2.82] .

Note that the interval does not cover zero and the scores were significantly lower for
Test 1 than Test 2.
Our concluding statement is that: Scores on Test 1 were significantly lower than
those on Test 2 with a mean difference of 16.67 (s.e. 13.85).
R Code for Paired Measurements Example:
>T1<-c( 64, 28, 90, 30, 97, 20, 100, 67, 54, 44, 100, 71)
>T2<-c( 80, 87, 90, 57, 89, 51, 81, 82, 89, 78, 100, 81)
> var(T1-T2)
[1] 474.9697
> t.test(T1-T2)

One Sample t-test

data: T1 - T2
t = -2.6491, df = 11, p-value = 0.02262
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-30.513786 -2.819547
sample estimates:
mean of x
-16.66667
CHAPTER 6. NORMAL THEORY 153

Why pair?
Suppose instead we had randomly chosen a sample of Test 1 and Test 2 results,
assuming,

Test 1 results ∼ N (µ1 , σ 2 )

Test 2 results ∼ N (µ2 , σ 2 )

and investigated H0 : µ1 − µ2 = 0.

Suppose that by chance, the first group consisted of students that were brighter
than the second group. Then any difference in the two test results may be due to
the intelligence difference of the groups rather than to differences in the difficulty of
the two tests. With pairing, differences in the two tests will not be obscured by the
second sample of students being entirely different (independent) from the first.

Pairing is effective when there is considerable variation between subjects because it


controls or reduces unwanted or extraneous variation.
Index

Alternative hypothesis, 55 Maximum likelihood estimate, MLE, 15


ANOVA, 146 Median, 40
Multinomial distribution, 2
Binomial distribution, 1
Negative Binomial distribution, 3
Central limit theorem, 9 Normal distribution, 8
Chi-square distribution, 51 Normal Q-Q plot, 148
Composite hypothesis, 70 Null hypothesis, 55
Confidence Interval, 93
Contingency table, 82 p-value, 55
Coverage Probability, 93 Pareto distribution, 38
Cumulative distribution function, 11 Percentile, 40
Pivotal Quantity, 113
Degrees of freedom, 51 Poisson distribution, 5
Expectation, 12 Probability density function, 11
Exponential distribution, 6 Probability mass function, 11

Fitted values, 148 R-squared, 141


Relative Likelihood Function, 23
Geometric distribution, 4 Residual, 134
Residual Standard Error, 135
Hypergeometric distribution, 4
Sampling with replacement, 1
Independent random variables, 12 Score Function, 17
Information function, 17 Significance level, 55
Information matrix, 46 Simple hypothesis, 60
Least Squares estimates, 132 SSE, Sum of Squares Error, 141
Likelihood function, 15 SSR, Sum of Squares Regression, 141
Likelihood Interval, 24 Standard Error, 136, 139
Likelihood Ratio Statistics, 61 Variance of X, 12
Log Relative Likelihood function, 23
Log-likelihood function, 16

154

You might also like