0% found this document useful (0 votes)
18 views201 pages

WT ST102

The document is a course pack for ST102/ST110, Elementary Statistical Theory, authored by Dr. James Abdey for the 2023/24 winter term. It includes detailed contents covering topics such as sampling distributions, point estimation, interval estimation, hypothesis testing, analysis of variance, and linear regression. The course pack is copyrighted material and outlines learning outcomes, key terms, and concepts for each chapter.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views201 pages

WT ST102

The document is a course pack for ST102/ST110, Elementary Statistical Theory, authored by Dr. James Abdey for the 2023/24 winter term. It includes detailed contents covering topics such as sampling distributions, point estimation, interval estimation, hypothesis testing, analysis of variance, and linear regression. The course pack is copyrighted material and outlines learning outcomes, key terms, and concepts for each chapter.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 201

ST102/ST110

Elementary Statistical Theory

Course pack

2023/24 (Winter term)

Dr James Abdey

lse.ac.uk/statistics
2
ST102/ST110

Elementary Statistical Theory

Course pack

© James Abdey 2023–24

The author asserts copyright over all material in this course guide except where
otherwise indicated. All rights reserved. No part of this work may be reproduced in any
form, or by any means, without permission in writing from the author.
ii
Contents

Contents

6 Sampling distributions of statistics 1


6.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
6.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
6.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
6.4 Random samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
6.4.1 Joint distribution of a random sample . . . . . . . . . . . . . . . . 2
6.5 Statistics and their sampling distributions . . . . . . . . . . . . . . . . . 3
6.5.1 Sampling distribution of a statistic . . . . . . . . . . . . . . . . . 4
6.6 Sample mean from a normal population . . . . . . . . . . . . . . . . . . . 6
6.7 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 10
6.8 Some common sampling distributions . . . . . . . . . . . . . . . . . . . . 12
6.8.1 The χ2 distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.8.2 (Student’s) t distribution . . . . . . . . . . . . . . . . . . . . . . . 15
6.8.3 The F distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.9 Prelude to statistical inference . . . . . . . . . . . . . . . . . . . . . . . . 17
6.9.1 Population versus random sample . . . . . . . . . . . . . . . . . . 19
6.9.2 Parameter versus statistic . . . . . . . . . . . . . . . . . . . . . . 19
6.9.3 Difference between ‘Probability’ and ‘Statistics’ . . . . . . . . . . 21
6.10 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.11 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

7 Point estimation 23
7.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.4 Estimation criteria: bias, variance and mean squared error . . . . . . . . 24
7.5 Method of moments (MM) estimation . . . . . . . . . . . . . . . . . . . . 30
7.6 Least squares (LS) estimation . . . . . . . . . . . . . . . . . . . . . . . . 32
7.7 Maximum likelihood (ML) estimation . . . . . . . . . . . . . . . . . . . . 34
7.8 Asymptotic distribution of MLEs . . . . . . . . . . . . . . . . . . . . . . 39

iii
Contents

7.9 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40


7.10 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

8 Interval estimation 43
8.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.4 Interval estimation for means of normal distributions . . . . . . . . . . . 44
8.4.1 An important property of normal samples . . . . . . . . . . . . . 46
8.5 Approximate confidence intervals . . . . . . . . . . . . . . . . . . . . . . 47
8.5.1 Means of non-normal distributions . . . . . . . . . . . . . . . . . 47
8.5.2 MLE-based confidence intervals . . . . . . . . . . . . . . . . . . . 47
8.6 Use of the chi-squared distribution . . . . . . . . . . . . . . . . . . . . . 47
8.7 Interval estimation for variances of normal distributions . . . . . . . . . . 48
8.8 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.9 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

9 Hypothesis testing 51
9.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.4 Introductory examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.5 Setting p-value, significance level, test statistic . . . . . . . . . . . . . . . 54
9.5.1 General setting of hypothesis tests . . . . . . . . . . . . . . . . . 54
9.5.2 Statistical testing procedure . . . . . . . . . . . . . . . . . . . . . 55
9.5.3 Two-sided tests for normal means . . . . . . . . . . . . . . . . . . 56
9.5.4 One-sided tests for normal means . . . . . . . . . . . . . . . . . . 57
9.6 t tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
9.7 General approach to statistical tests . . . . . . . . . . . . . . . . . . . . . 59
9.8 Two types of error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
9.9 Tests for variances of normal distributions . . . . . . . . . . . . . . . . . 60
9.10 Summary: tests for µ and σ 2 in N (µ, σ 2 ) . . . . . . . . . . . . . . . . . 62
9.11 Comparing two normal means with paired observations . . . . . . . . . . 62
9.11.1 Power functions of the test . . . . . . . . . . . . . . . . . . . . . . 63
9.12 Comparing two normal means . . . . . . . . . . . . . . . . . . . . . . . . 63
2
9.12.1 Tests on µX − µY with known σX and σY2 . . . . . . . . . . . . 64

iv
Contents

2
9.12.2 Tests on µX − µY with σX = σY2 but unknown . . . . . . . . . . 64
9.13 Tests for correlation coefficients . . . . . . . . . . . . . . . . . . . . . . . 67
9.13.1 Tests for correlation coefficients . . . . . . . . . . . . . . . . . . . 69
9.14 Tests for the ratio of two normal variances . . . . . . . . . . . . . . . . . 70
9.15 Summary: tests for two normal distributions . . . . . . . . . . . . . . . . 73
9.16 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
9.17 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

10 Analysis of variance (ANOVA) 75


10.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
10.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
10.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
10.4 Testing for equality of three population means . . . . . . . . . . . . . . . 75
10.5 One-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . 77
10.6 From one-way to two-way ANOVA . . . . . . . . . . . . . . . . . . . . . 86
10.7 Two-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . 86
10.8 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
10.9 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
10.10Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

11 Linear regression 93
11.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
11.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
11.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
11.4 Introductory examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
11.5 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
11.6 Inference for parameters in normal regression models . . . . . . . . . . . 100
11.7 Regression ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
11.8 Confidence intervals for E(y) . . . . . . . . . . . . . . . . . . . . . . . . 105
11.9 Prediction intervals for y . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
11.10Multiple linear regression models . . . . . . . . . . . . . . . . . . . . . . 108
11.11Regression using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
11.12Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
11.13Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

A Sampling distributions of statistics 121


A.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

v
Contents

A.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

B Point estimation 129


B.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
B.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

C Interval estimation 143


C.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
C.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

D Hypothesis testing 149


D.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
D.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

E Analysis of variance (ANOVA) 159


E.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
E.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

F Linear regression 167


F.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
F.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

G Solutions to Practice questions 175


G.1 Chapter 6 – Sampling distributions of statistics . . . . . . . . . . . . . . 175
G.2 Chapter 7 – Point estimation . . . . . . . . . . . . . . . . . . . . . . . . 177
G.3 Chapter 8 – Interval estimation . . . . . . . . . . . . . . . . . . . . . . . 180
G.4 Chapter 9 – Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . 182
G.5 Chapter 10 – Analysis of variance . . . . . . . . . . . . . . . . . . . . . . 186
G.6 Chapter 11 – Linear regression . . . . . . . . . . . . . . . . . . . . . . . . 187

H Formula sheet in the summer examination 191

vi
Chapter 6
Sampling distributions of statistics

6.1 Synopsis of chapter


This chapter considers the idea of sampling and the concept of a sampling distribution
for a statistic (such as a sample mean) which must be understood by all users of
statistics.

6.2 Learning outcomes


After completing this chapter, you should be able to:

demonstrate how sampling from a population results in a sampling distribution for


a statistic

prove and apply the results for the mean and variance of the sampling distribution
of the sample mean when a random sample is drawn with replacement

state the central limit theorem and recall when the limit is likely to provide a good
approximation to the distribution of the sample mean.

6.3 Introduction
Suppose we have a sample of n observations of a random variable X:

{X1 , X2 , . . . , Xn }.

We have already stated that in statistical inference each individual observation Xi is


regarded as a value of a random variable X, with some probability distribution (that is,
the population distribution).
In this chapter we discuss how we define and work with:

the joint distribution of the whole sample {X1 , X2 , . . . , Xn }, treated as a


multivariate random variable

distributions of univariate functions of {X1 , X2 , . . . , Xn } (statistics).

1
6. Sampling distributions of statistics

6.4 Random samples


Many of the results discussed here hold for many (or even all) probability distributions,
not just for some specific distributions.
It is then convenient to use generic notation.

We use f (x) to denote both the pdf of a continuous random variable, and the pf of
a discrete random variable.

The parameter(s) of a distribution are generally denoted as θ. For example, for the
Poisson distribution θ stands for λ, and for the normal distribution θ stands for
(µ, σ 2 ).

Parameters are often included in the notation: f (x; θ) denotes the pf/pdf of a
distribution with parameter(s) θ, and F (x; θ) is its cdf.

For simplicity, we may often use phrases like ‘distribution f (x; θ)’ or ‘distribution
F (x; θ)’ when we mean ‘distribution with the pf/pdf f (x; θ)’ and ‘distribution with the
cdf F (x; θ)’, respectively.
The simplest assumptions about the joint distribution of the sample are as follows.

1. {X1 , X2 , . . . , Xn } are independent random variables.

2. {X1 , X2 , . . . , Xn } are identically distributed random variables. Each Xi has the


same distribution f (x; θ), with the same value of the parameter(s) θ.

The random variables {X1 , X2 , . . . , Xn } are then called:

independent and identically distributed (IID) random variables from the


distribution (population) f (x; θ)

a random sample of size n from the distribution (population) f (x; θ).

We will assume this most of the time from now. So you will see many examples and
questions which begin something like:

‘Let {X1 , X2 , . . . , Xn } be a random sample from a normal distribution with


mean µ and variance σ 2 . . .’.

6.4.1 Joint distribution of a random sample


The joint probability distribution of the random variables in a random sample is an
important quantity in statistical inference. It is known as the likelihood function.
You will hear more about it in the chapter on point estimation.
For a random sample the joint distribution is easy to derive, because the Xi s are
independent.

2
6.5. Statistics and their sampling distributions

The joint pf/pdf of a random sample is:


n
Y
f (x1 , x2 , . . . , xn ) = f (x1 ; θ) f (x2 ; θ) · · · f (xn ; θ) = f (xi ; θ).
i=1

Other assumptions about random samples

Not all problems can be seen as IID random samples of a single random variable. There
are other possibilities, which you will see more of in the future.

IID samples from multivariate population distributions. For example, a sample of


n
Q
(Xi , Yi ), with the joint distribution f (xi , yi ).
i=1

Independent but not identically distributed observations. For example, observations


(Xi , Yi ) where Yi (the ‘response variable’) is treated as random, but Xi (the
‘explanatory variable’) is not. Hence the joint distribution of the Yi s is
Qn
fY |X (yi | xi ; θ) where fY |X (y | x; θ) is the conditional distribution of Y given X.
i=1
This is the starting point of regression modelling (introduced later in the course).

Non-independent observations. For example, a time series {Y1 , Y2 , . . . , YT } where


i = 1, 2, . . . , T are successive time points. The joint distribution of the series is, in
general:

f (y1 ; θ) f (y2 | y1 ; θ) f (y3 | y1 , y2 ; θ) · · · f (yT | y1 , y2 , . . . , yT −1 ; θ).

Random samples and their observed values

Here we treat {X1 , X2 , . . . , Xn } as random variables. Therefore, we consider what values


{X1 , X2 , . . . , Xn } might have in different samples.
Once a real sample is actually observed, the values of {X1 , X2 , . . . , Xn } in that specific
sample are no longer random variables, but realised values of random variables, i.e.
known numbers.
Sometimes this distinction is emphasised in the notation by using:

X1 , X2 , . . . , Xn for the random variables

x1 , x2 , . . . , xn for the observed values.

6.5 Statistics and their sampling distributions


A statistic is a known function of the random variables {X1 , X2 , . . . , Xn } in a random
sample.

3
6. Sampling distributions of statistics

Example 6.1 All of the following are statistics:


n
P
the sample mean X̄ = Xi /n
i=1

n √
the sample variance S 2 = (Xi − X̄)2 /(n − 1) and standard deviation S =
P
S2
i=1

the sample median, quartiles, minimum, maximum etc.

quantities such as:


n
X X̄
Xi2 and √ .
i=1
S/ n

Here we focus on single (univariate) statistics. More generally, we could also consider
vectors of statistics, i.e. multivariate statistics.

6.5.1 Sampling distribution of a statistic

A (simple) random sample is modelled as a sequence of IID random variables. A


statistic is a function of these random variables, so it is also a random variable, with a
distribution of its own.
In other words, if we collected several random samples from the same population, the
values of a statistic would not be the same from one sample to the next, but would vary
according to some probability distribution.
The sampling distribution is the probability distribution of the values which the
statistic would have in a large number of samples collected (independently) from the
same population.

Example 6.2 Suppose we collect a random sample of size n = 20 from a normal


population (distribution) X ∼ N (5, 1).
Consider the following statistics:

sample mean X̄, sample variance S 2 , and maxX = max(X1 , X2 , . . . , Xn ).

Here is one such random sample (with values rounded to 2 decimal places):
6.28 5.22 4.19 3.56 4.15 4.11 4.03 5.81 5.43 6.09
4.98 4.11 5.55 3.95 4.97 5.68 5.66 3.37 4.98 6.58
For this random sample, the values of our statistics are:

x̄ = 4.94

s2 = 0.90

maxx = 6.58.

4
6.5. Statistics and their sampling distributions

Here is another such random sample (with values rounded to 2 decimal places):
5.44 6.14 4.91 5.63 3.89 4.17 5.79 5.33 5.09 3.90
5.47 6.62 6.43 5.84 6.19 5.63 3.61 5.49 4.55 4.27
For this sample, the values of our statistics are:

x̄ = 5.22 (the first sample had x̄ = 4.94)

s2 = 0.80 (the first sample had s2 = 0.90)

maxx = 6.62 (the first sample had maxx = 6.58).

How to derive a sampling distribution?

The sampling distribution of a statistic is the distribution of the values of the statistic
in (infinitely) many repeated samples. However, typically we only have one sample
which was actually observed. Therefore, the sampling distribution seems like an
essentially hypothetical concept.
Nevertheless, it is possible to derive the forms of sampling distributions of statistics
under different assumptions about the sampling schemes and population distribution
f (x; θ).
There are two main ways of doing this.

Exactly or approximately through mathematical derivation. This is the most


convenient way for subsequent use, but is not always easy.
With simulation, i.e. by using a computer to generate (artificial) random samples
from a population distribution of a known form.

Example 6.3 Consider again a random sample of size n = 20 from the population
X ∼ N (5, 1), and the statistics X̄, S 2 and maxX .

We first consider deriving the sampling distributions of these by approximation


through simulation.

Here a computer was used to draw 10,000 independent random samples of


n = 20 from N (5, 1), and the values of X̄, S 2 and maxX for each of these
random samples were recorded.

Figures 6.1, 6.2 and 6.3 show histograms of the statistics for these 10,000
random samples.

We now consider deriving the exact sampling distribution. Here this is possible. For
a random sample of size n from N (µ, σ 2 ) we have:

(a) X̄ ∼ N (µ, σ 2 /n)

(b) (n − 1)S 2 /σ 2 ∼ χ2n−1

5
6. Sampling distributions of statistics

(c) the sampling distribution of Y = maxX has the following pdf:

fY (y) = n(FX (y))n−1 fX (y)

where FX (x) and fX (x) are the cdf and pdf of X ∼ N (µ, σ 2 ), respectively.

Curves of the densities of these distributions are also shown in Figures 6.1, 6.2 and
6.3.

4.5 5.0 5.5 6.0

Sample mean

Figure 6.1: Simulation-generated sampling distribution of X̄ to accompany Example 6.3.

6.6 Sample mean from a normal population


Consider one very common statistic, the sample mean:
n
1X 1 1 1
X̄ = Xi = X1 + X2 + · · · + Xn .
n i=1
n n n

What is the sampling distribution of X̄?


We know from Section 5.10.2 that for independent {X1 , X2 , . . . , Xn } from any
distribution: !
X n Xn
E ai Xi = ai E(Xi )
i=1 i=1

and: !
n
X n
X
Var ai Xi = a2i Var(Xi ).
i=1 i=1

6
6.6. Sample mean from a normal population

0.5 1.0 1.5 2.0 2.5

Sample variance

Figure 6.2: Simulation-generated sampling distribution of S 2 to accompany Example 6.3.

5 6 7 8 9

Maximum value

Figure 6.3: Simulation-generated sampling distribution of maxX to accompany Example


6.3.

7
6. Sampling distributions of statistics

For a random sample, all Xi s are independent and E(X P i ) = E(X) is the same
Pfor all of
them, since the Xi s are identically distributed. X̄ = i Xi /n is of the form i ai Xi ,
with ai = 1/n for all i = 1, 2, . . . , n.
Therefore:
n
X 1 1
E(X̄) = E(X) = n × E(X) = E(X)
i=1
n n
and:
n
X 1 1 Var(X)
Var(X̄) = Var(X) = n × Var(X) = .
i=1
n2 n2 n

So the mean and variance of X̄ are E(X) and Var(X)/n, respectively, for a random
sample from any population distribution of X. What about the form of the sampling
distribution of X̄?
This depends on the distribution of X, and is not generally known. However, when the
distribution of X is normal, we do know that the sampling distribution of X̄ is also
normal.
Suppose that {X1 , X2 , . . . , Xn } is a random sample from a normal distribution with
mean µ and variance σ 2 , then:

σ2
 
X̄ ∼ N µ, .
n

For example, the pdf drawn on the histogram in Figure 6.1 is that of N (5, 1/20).
We have E(X̄) = E(X) = µ.

In an individual sample, x̄ is not usually equal to µ, the expected value of the


population.

However, over repeated samples the values of X̄ are centred at µ.


We also have Var(X̄) = Var(X)/n = σ 2 /n, and hence also sd(X̄) = σ/ n.

The variation of the values of X̄ in different samples (the sampling variance) is


large when the population variance of X is large.

More interestingly, the sampling variance gets smaller when the sample size n
increases.

In other words, when n is large the distribution of X̄ is more tightly concentrated


around µ than when n is small.

Figure 6.4 shows sampling distributions of X̄ from N (5, 1) for different n.

Example 6.4 Suppose that the heights (in cm) of men (aged over 16) in a
population follow a normal distribution with some unknown mean µ and a known
standard deviation of 7.39.

8
6.6. Sample mean from a normal population

n=100

n=20

n=5

4.0 4.5 5.0 5.5 6.0

Figure 6.4: Sampling distributions of X̄ from N (5, 1) for different n.

We plan to select a random sample of n men from the population, and measure their
heights. How large should n be so that there is a probability of at least 0.95 that the
sample mean X̄ will be within 1 cm of the population mean µ?

Here X ∼ N (µ, (7.39)2 ), so X̄ ∼ N (µ, (7.39/ n)2 ). What we need is the smallest n
such that:
P (|X̄ − µ| ≤ 1) ≥ 0.95.
So:

P (|X̄ − µ| ≤ 1) ≥ 0.95
P (−1 ≤ X̄ − µ ≤ 1) ≥ 0.95
 
−1 X̄ − µ 1
P √ ≤ √ ≤ √ ≥ 0.95
7.39/ n 7.39/ n 7.39/ n
 √ √ 
n n
P − ≤Z≤ ≥ 0.95
7.39 7.39
 √ 
n 0.05
P Z> < = 0.025
7.39 2

where Z ∼ N (0, 1). From Table 3 of Murdoch and Barnes’ Statistical Tables, we see
that the smallest z which satisfies P (Z > z) < 0.025 is z = 1.97. Therefore:

n
≥ 1.97 ⇔ n ≥ (7.39 × 1.97)2 = 211.9.
7.39
Therefore, n should be at least 212.

9
6. Sampling distributions of statistics

6.7 The central limit theorem


We have discussed the very convenient result that if a random sample comes from a
normally-distributed population, the sampling distribution of X̄ is also normal. How
about sampling distributions of X̄ from other populations?
For this, we can use a remarkable mathematical result, the central limit theorem
(CLT). In essence, the CLT states that the normal sampling distribution of X̄ which
holds exactly for random samples from a normal distribution, also holds approximately
for random samples from nearly any distribution.
The CLT applies to ‘nearly any’ distribution because it requires that the variance of the
population distribution is finite. If it is not (such as for some Pareto distributions,
introduced in Chapter 3), the CLT does not hold. However, such distributions are not
common.
Suppose that {X1 , X2 , . . . , Xn } is a random sample from a population distribution
which has mean E(Xi ) = µ < ∞ and variance Var(Xi ) = σ 2 < ∞, that is with a finite
mean and finite variance. Let X̄n denote the sample mean calculated from a random
sample of size n, then:
X̄n − µ
 
lim P √ ≤ z = Φ(z)
n→∞ σ/ n
for any z, where Φ(z) denotes the cdf of the standard normal distribution.
The ‘ lim ’ indicates that this is an asymptotic result, i.e. one which holds increasingly
n→∞
well as n increases, and exactly when the sample size is infinite.
The full proof of the CLT is not straightforward. A partial (and non-examinable!)
version is given in a note on the ST102 Moodle site.
In less formal language, the CLT says that for a random sample from nearly any
distribution with mean µ and variance σ 2 then:

σ2
 
X̄ ∼ N µ,
n

approximately, when n is sufficiently large. We can then say that X̄ is asymptotically


normally distributed with mean µ and variance σ 2 /n.

The wide reach of the CLT

It may appear that the CLT is still somewhat limited, in that it applies only to sample
means calculated from random (IID) samples. However, this is not really true, for two
main reasons.

There are more general versions of the CLT which do not require the observations
Xi to be IID.

Even the basic version applies very widely, when we realise that the ‘X’ can also be
a function of the original variables in the data. For example, if X and Y are

10
6.7. The central limit theorem

random variables in the sample, we can also apply the CLT to:
n n
X ln(Xi ) X X i Yi
or .
i=1
n i=1
n

Therefore, the CLT can also be used to derive sampling distributions for many statistics
which do not initially look at all like X̄ for a single random variable in an IID sample.
You may get to do this in future courses.

How large is ‘large n’?

The larger the sample size n, the better the normal approximation provided by the CLT
is. In practice, we have various rules-of-thumb for what is ‘large enough’ for the
approximation to be ‘accurate enough’. This also depends on the population
distribution of Xi . For example:

for symmetric distributions, even small n is enough

for very skewed distributions, larger n is required.

For many distributions, n > 30 is sufficient for the approximation to be reasonably


accurate.

Example 6.5 In the first case, we simulate random samples of sizes:

n = 1, 5, 10, 30, 100 and 1,000

from the Exp(0.25) distribution (for which µ = 4 and σ 2 = 16). This is clearly a
skewed distribution, as shown by the histogram for n = 1 in Figure 6.5.
10,000 independent random samples of each size were generated. Histograms of the
values of X̄ in these random samples are shown in Figure 6.5. Each plot also shows
the pdf of the approximating normal distribution, N (4, 16/n). The normal
approximation is reasonably good already for n = 30, very good for n = 100, and
practically perfect for n = 1,000.

Example 6.6 In the second case, we simulate 10,000 independent random samples
of sizes:
n = 1, 10, 30, 50, 100 and 1,000
from the Bernoulli(0.2) distribution (for which µ = 0.2 and σ 2 = 0.16).
Here the distribution of Xi itself is not even continuous, and has only two possible
values, 0 and 1. Nevertheless, the sampling distribution of X̄ can be very
well-approximated by the normal distribution, when n is large enough.
n
P
Note that since here Xi = 1 or Xi = 0 for all i, X̄ = Xi /n = m/n, where m is the
i=1
number of observations for which Xi = 1. In other words, X̄ is the sample
proportion of the value X = 1.

11
6. Sampling distributions of statistics

n=5 n = 10

n=1

0 10 20 30 40 0 2 4 6 8 10 12 14 2 4 6 8 10

n = 30 n = 100 n = 1000

2 3 4 5 6 7 2.5 3.0 3.5 4.0 4.5 5.0 5.5 3.6 3.8 4.0 4.2 4.4

Figure 6.5: Sampling distributions of X̄ for various n when sampling from the Exp(0.25)
distribution.

The normal approximation is clearly very bad for small n, but reasonably good
already for n = 50, as shown by the histograms in Figure 6.6.

6.8 Some common sampling distributions


In the remaining chapters, we will make use of results like the following.
Suppose that {X1 , X2 , . . . , Xn } and {Y1 , Y2 , . . . , Ym } are two independent random
samples from N (µ, σ 2 ), then:
2
(n − 1)SX (m − 1)SY2
∼ χ2n−1 and ∼ χ2m−1
σ2 σ2
s
n+m−2 X̄ − Ȳ
×p 2
∼ tn+m−2
1/n + 1/m (n − 1)SX + (m − 1)SY2

and:
2
SX
∼ Fn−1, m−1 .
SY2
Here ‘χ2 ’, ‘t’ and ‘F ’ refer to three new families of probability distributions:

the χ2 (‘chi-squared’) distribution


the t distribution
the F distribution.

12
6.8. Some common sampling distributions

n = 30

n = 10

n=1

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5

n = 100 n = 1000
n = 50

0.0 0.1 0.2 0.3 0.4 0.50.05 0.10 0.15 0.20 0.25 0.30 0.35 0.16 0.18 0.20 0.22 0.24

Figure 6.6: Sampling distributions of X̄ for various n when sampling from the
Bernoulli(0.2) distribution.

These are not often used as distributions of individual variables. Instead, they are used
as sampling distributions for various statistics. Each of them arises from the normal
distribution in a particular way. We will now briefly introduce their main properties.
This is in preparation for statistical inference, where the uses of these distributions will
be discussed at length.

6.8.1 The χ2 distribution

Definition of the χ2 distribution

Let Z1 , Z2 , . . . , Zk be independent N (0, 1) random variables. If:


k
X
X= Z12 + Z22 + ··· + Zk2 = Zi2
i=1

the distribution of X is the χ2 distribution with k degrees of freedom. This is


denoted by X ∼ χ2 (k) or X ∼ χ2k .

The χ2k distribution is a continuous distribution, which can take values of x ≥ 0. Its
mean and variance are:

E(X) = k
Var(X) = 2k.

13
6. Sampling distributions of statistics

For reference, the probability density function of X ∼ χ2k is:


(
(2k/2 Γ(k/2))−1 xk/2−1 e−x/2 for x ≥ 0
f (x) =
0 otherwise
where: Z ∞
Γ(α) = xα−1 e−x dx
0
is the gamma function, which is defined for all α > 0. (Note the formula of the pdf of
X ∼ χ2k is not examinable.)
The shape of the pdf depends on the degrees of freedom k, as illustrated in Figure 6.7.
In most applications of the χ2 distribution the appropriate value of k is known, in which
case it does not need to be estimated from data.

0.10
0.6

k=1 k=10
k=2 k=20
0.5

k=4 0.08 k=30


k=6 k=40
0.4

0.06
0.3

0.04
0.2

0.02
0.1
0.0

0.0

0 2 4 6 8 0 10 20 30 40 50

Figure 6.7: χ2 pdfs for various degrees of freedom.

If X1 , X2 , . . . , Xm are independent random variables and Xi ∼ χ2ki , then their sum is


also χ2 -distributed where the individual degrees of freedom are added, such that:
X1 + X2 + · · · + Xm ∼ χ2k1 +k2 +···+km .
The uses of the χ2 distribution will be discussed later. One example though is if
{X1 , X2 , . . . , Xn } is a random sample from the population N (µ, σ 2 ), and S 2 is the
sample variance, then:
(n − 1)S 2
2
∼ χ2n−1 .
σ
This result is used to derive basic tools of statistical inference for both µ and σ 2 for the
normal distribution.

Tables of the χ2 distribution

In exercises and the examination, you will need a table of some probabilities for the χ2
distribution. Table 8 of Murdoch and Barnes’ Statistical Tables shows the following
information.

14
6.8. Some common sampling distributions

The rows correspond to different degrees of freedom k (denoted in the table by ν).
The table shows values of k up to 100.

The columns correspond to the right-tail probability P (X > x) = α, where


X ∼ χ2k , for different values of α. The first page contains α = 0.995, 0.99, . . . , 0.50,
and the second page contains α = 0.30, 0.25, . . . , 0.001.

The numbers in the table are values of x such that P (X > x) = α for the k and α
in that row and column.

Example 6.7 Consider two numbers in the ‘ν = 5’ row, the 2.675 in the ‘α = 0.75’
column and the 3.000 in the ‘α = 0.70’ column. These mean that for X ∼ χ25 we
have:

P (X > 2.675) = 0.75 (and hence P (X ≤ 2.675) = 0.25)

P (X > 3.000) = 0.70 (and hence P (X ≤ 3.000) = 0.30).

These also provide bounds for probabilities of other values. For example, since 2.8 is
between 2.675 and 3.000, we can conclude that:

0.70 < P (X > 2.8) < 0.75.

The ways in which this table may be used in statistical inference will be explained in
later chapters.

6.8.2 (Student’s) t distribution

Definition of Student’s t distribution

Suppose Z ∼ N (0, 1), X ∼ χ2k , and Z and X are independent. The distribution of
the random variable:
Z
T = p
X/k
is the t distribution with k degrees of freedom. This is denoted T ∼ tk or
T ∼ t(k). The distribution is also known as ‘Student’s t distribution’.

The tk distribution is continuous with the pdf:


−(k+1)/2
x2

Γ((k + 1)/2)
f (x) = √ 1+
kπΓ(k/2) k

for all −∞ < x < ∞. Examples of f (x) for different k are shown in Figure 6.8. (Note
the formula of the pdf of tk is not examinable.)
From Figure 6.8, we see the following.

The distribution is symmetric around 0.

15
6. Sampling distributions of statistics

0.4
N(0,1)
k=1
k=3
k=8

0.3
k=20

0.2
0.1
0.0

−2 0 2

Figure 6.8: Student’s t pdfs for various degrees of freedom.

As k → ∞, the tk distribution tends to the standard normal distribution, so tk with


large k is very similar to N (0, 1).
For any finite value of k, the tk distribution has heavier tails than the standard
normal distribution, i.e. tk places more probability on values far from 0 than
N (0, 1) does.

For T ∼ tk , the mean and variance of the distribution are:

E(T ) = 0 for k > 1

and:
k
Var(T ) = for k > 2.
k−2
This means that for t1 neither E(T ) nor Var(T ) exist, and for t2 , Var(T ) does not exist.

Tables of the t distribution

In exercises and the examination, you will need a table of some probabilities for the t
distribution. Table 7 of Murdoch and Barnes’ Statistical Tables shows the following
information.

The rows correspond to different degrees of freedom k (denoted in the table by ν).
The table shows values of k up to 120, and then ‘∞’, which is N (0, 1).
If you need a tk distribution for which k is not in the table, use the nearest value or
use interpolation.
The columns correspond to the right-tail probability P (T > t) = α, where T ∼ tk ,
for α = 0.10, 0.05, . . . , 0.0005.
The numbers in the table are values of t such that P (T > t) = α for the k and α in
that row and column.

16
6.9. Prelude to statistical inference

Example 6.8 Consider the number 2.132 in the ‘ν = 4’ row, and the ‘α = 0.05’
column. This means that for T ∼ t4 we have:

P (T > 2.132) = 0.05 (and hence P (T ≤ 2.132) = 0.95).

The table also provides bounds for other probabilities. For example, the number in
the ‘α = 0.025’ column is 2.776, so P (T > 2.776) = 0.025. Since 2.132 < 2.5 < 2.776,
we know that 0.025 < P (T > 2.5) < 0.05.
Results for left-tail probabilities P (T < t) = α can also be obtained, because the t
distribution is symmetric around 0. This means that P (T < t) = P (T > −t). For
example:
P (T < −2.132) = P (T > 2.132) = 0.05
and P (T < −2.5) < 0.05 since P (T > 2.5) < 0.05.
This is the same trick we used for the standard normal distribution.

6.8.3 The F distribution

Definition of the F distribution

Let U and V be two independent random variables, where U ∼ χ2p and V ∼ χ2k .
The distribution of:
U/p
F =
V /k
is the F distribution with degrees of freedom (p, k), denoted F ∼ Fp, k or
F ∼ F (p, k).

The F distribution is a continuous distribution, with non-zero probabilities for x > 0.


The general shape of its pdf is shown in Figure 6.9.
For F ∼ Fp, k , E(F ) = k/(k − 2), for k > 2. If F ∼ Fp, k , then 1/F ∼ Fk, p . If T ∼ tk ,
then T 2 ∼ F1, k .
Tables of F distributions will be needed for some purposes. They will be available in the
examination. We will postpone practice with them until later in the course.

6.9 Prelude to statistical inference


We conclude Chapter 6 with a discussion of the preliminaries of statistical inference
before moving on to point estimation. The discussion below will review some key
concepts introduced previously.
So, just what is ‘Statistics’ ? It is a scientific subject of collecting and ‘making sense’ of
data.

Collection: designing experiments/questionnaires, designing sampling schemes, and


administration of data collection.

17
6. Sampling distributions of statistics

(10,50)
(10,10)
(10,3)

f(x)

0 1 2 3 4

Figure 6.9: F pdfs for various degrees of freedom.

Making sense: estimation, testing and forecasting.

So, ‘Statistics’ is an application-oriented subject, particularly useful or helpful in


answering questions such as the following.

Does a certain new drug prolong life for AIDS sufferers?

Is global warming really happening?

Are GCSE and A-level examination standards declining?

Is the gap between rich and poor widening in Britain?

Is there still a housing bubble in London?

Is the Chinese yuan undervalued? If so, by how much?

These questions are difficult to study in a laboratory, and admit no self-evident axioms.
Statistics provides a way of answering these types of questions using data.
What should we learn in ‘Statistics’ ? The basic ideas, methods and theory. Some
guidelines for learning/applying statistics are the following.

Understand what data say in each specific context. All the methods are just tools
to help us to understand data.

Concentrate on what to do and why, rather than on concrete calculations and


graphing.

It may take a while to catch the basic idea of statistics – keep thinking!

18
6.9. Prelude to statistical inference

6.9.1 Population versus random sample


Consider the following two practical examples.

Example 6.9 A new type of tyre was designed to increase its lifetime. The
manufacturer tested 120 new tyres and obtained the average lifetime (over these 120
tyres) of 35,391 miles. So the manufacturer claims that the mean lifetime of new
tyres is 35,391 miles.

Example 6.10 A newspaper sampled 1,000 potential voters, and 350 of them were
Labour Party supporters. It claims that the proportion of Labour voters in the
whole country is 350/1,000 = 0.35, i.e. 35%.

In both cases, the conclusion is drawn on a population (i.e. all the objects concerned)
based on the information from a sample (i.e. a subset of the population).
In Example 6.9, it is impossible to measure the whole population. In Example 6.10, it is
not economical to measure the whole population. Therefore, errors are inevitable!
The population is the entire set of objects concerned, and these objects are typically
represented by some numbers. We do not know the entire population in practice.
In Example 6.9, the population consists of the lifetimes of all tyres, including those to
be produced in the future. For the opinion poll in Example 6.10, the population consists
of many ‘1’s and ‘0’s, where each ‘1’ represents a voter for the Labour party, and each
‘0’ represents a voter for other parties.
A sample is a (randomly) selected subset of a population, and is known in practice. The
population is unknown. We represent a population by a probability distribution.
Why do we need a model for the entire population?

Because the questions we ask concern the entire population, not just the data we
have. Having a model for the population tells us that the remaining population is
not much different from our data or, in other words, that the data are
representative of the population.

Why do we need a random model?

Because the process of drawing a sample from a population is a bit like the process
of generating random variables. A different sample would produce different values.
Therefore, the population from which we draw a random sample is represented as a
probability distribution.

6.9.2 Parameter versus statistic


For a given problem, we typically assume a population to be a probability distribution
F (x; θ), where the form of distribution F is known (such as normal or Poisson), and θ
denotes some unknown characteristic (such as the mean or variance) and is called a
parameter.

19
6. Sampling distributions of statistics

Example 6.11 Continuing with Example 6.9, the population may be assumed to
be N (µ, σ 2 ) with θ = (µ, σ 2 ), where µ is the ‘true’ lifetime.
Let:
X = the lifetime of a tyre
then we can write X ∼ N (µ, σ 2 ).

Example 6.12 Continuing with Example 6.10, the population is a Bernoulli


distribution such that:

P (X = 1) = P (a Labour voter) = π

and:
P (X = 0) = P (a non-Labour voter) = 1 − π
where:

π = the proportion of Labour supporters in the UK


= the probability of a voter being a Labour supporter.

A sample: a set of data or random variables?

A sample of size n, {X1 , X2 , . . . , Xn }, is also called a random sample. It consists of n


real numbers in a practical problem. The word ‘random’ captures the fact that samples
(of the same size) taken by different people or at different times may be different, as
they are different subsets of a population.
Furthermore, a sample is also viewed as n independent and identically distributed
(IID) random variables, when we assess the performance of a statistical method.

Example 6.13 For the tyre lifetime in Example 6.9, suppose the realised sample
(of size n = 120) gives the sample mean:
n
1X
x̄ = xi = 35,391.
n i=1

A different sample may give a different sample mean, such as 36,721.

Is the sample mean X̄ a good estimator of the unknown ‘true’ lifetime µ? Obviously,
we cannot use the real number 35,391 to assess how good this estimator is, as a different
sample may give a different average value, such as 36,721.
By treating {X1 , X2 , . . . , Xn } as random variables, X̄ is also a random variable. If the
distribution of X̄ concentrates closely around (unknown) µ, X̄ is a good estimator of µ.

20
6.9. Prelude to statistical inference

Definition of a statistic

Any known function of a random sample is called a statistic. Statistics are used for
statistical inference such as estimation and testing.

Example 6.14 Let {X1 , X2 , . . . , Xn } be a random sample from the population


N (µ, σ 2 ), then:
n
1X
X̄ = Xi , X1 + Xn2 and sin(X3 ) + 6
n i=1

are all statistics, but:


X1 − µ
σ
is not a statistic, as it depends on the unknown quantities µ and σ 2 .

An observed random sample is often denoted as {x1 , x2 , . . . , xn }, indicating that they


are n real numbers. They are seen as a realisation of n IID random variables
{X1 , X2 , . . . , Xn }.
The connection between a population and a sample is shown in Figure 6.10, where θ is
a parameter. A known function of {X1 , X2 , . . . , Xn } is called a statistic.

Figure 6.10: Representation of the connection between a population and a sample.

6.9.3 Difference between ‘Probability’ and ‘Statistics’


‘Probability’ is a mathematical subject, while ‘Statistics’ is an application-oriented
subject (which uses probability heavily).

Example 6.15 Let:

X = the number of lectures attended by a student in a term with 20 lectures

21
6. Sampling distributions of statistics

then X ∼ Bin(20, π), i.e. the pf is:

20!
P (X = x) = π x (1 − π)20−x for x = 0, 1, 2, . . . , 20
x! (20 − x)!

and 0 otherwise.
Some probability questions are as follows. Treating π as known:

what is E(X) (the average number of lectures attended)?

what is P (X ≥ 18) (the proportion of students attending at least 18 lectures)?

what is P (X < 10) (the proportion of students attending fewer than half of the
lectures)?

Some statistics questions are as follows.

What is π (the average attendance rate)?

Is π larger than 0.9?

Is π smaller than 0.5?

6.10 Overview of chapter


This chapter introduced sampling distributions of statistics which are the foundations
to statistical inference. The sampling distribution of the sample mean was derived
exactly when sampling from normal populations and also approximately for more
general distributions using the central limit theorem. Three new families of distributions
(χ2 , t and F ) were defined.

6.11 Key terms and concepts


Central limit theorem Chi-squared (χ2 ) distribution
F distribution IID random variables
Random sample Sampling distribution
Sampling variance Statistic
(Student’s) t distribution

Did you hear the one about the statistician? Probably.


(Anon)

22
Chapter 7
Point estimation

7.1 Synopsis of chapter


This chapter covers point estimation. Specifically, the properties of estimators are
considered and the attributes of a desirable estimator are discussed. Techniques for
deriving estimators are introduced.

7.2 Learning outcomes


After completing this chapter, you should be able to:

summarise the performance of an estimator with reference to its sampling


distribution

use the concepts of bias and variance of an estimator

define mean squared error and calculate it for simple estimators

find estimators using the method of moments, least squares and maximum
likelihood.

7.3 Introduction
The basic setting is that we assume a random sample {X1 , X2 , . . . , Xn } is observed from
a population F (x; θ). The goal is to make inference (i.e. estimation or testing) for the
unknown parameter(s) θ.

Statistical inference is based on two things.


1. A set of data/observations {X1 , X2 , . . . , Xn }.
2. An assumption of F (x; θ) for the joint distribution of {X1 , X2 , . . . , Xn }.

Inference is carried out using a statistic, i.e. a known function of {X1 , X2 , . . . , Xn }.

For estimation, we look for a statistic θb = θ(X


b 1 , X2 , . . . , Xn ) such that the value of
θb is taken as an estimate (i.e. an estimated value) of θ. Such a θb is called a point
estimator of θ.

For testing, we typically use a statistic to test if a hypothesis on θ (such as θ = 3) is


true or not.

23
7. Point estimation

Example 7.1 Let {X1 , X2 , . . . , Xn } be a random sample from a population with


mean µ = E(Xi ). Find an estimator of µ.
Since µ is the mean of the population, a natural estimator would be the sample
mean µb = X̄, where:
n
1X X1 + X 2 + · · · + Xn
X̄ = Xi = .
n i=1 n

We call µ
b = X̄ a point estimator (or simply an estimator) of µ.
For example, if we have an observed sample of 9, 16, 15, 4 and 12, hence of size
n = 5, the sample mean is:
9 + 16 + 15 + 4 + 12
µ
b= = 11.2.
5
The value 11.2 is a point estimate of µ. For an observed sample of 15, 16, 10, 8
and 9, we obtain µb = 11.6.

7.4 Estimation criteria: bias, variance and mean


squared error
Estimators are random variables and, therefore, have probability distributions, known
as sampling distributions. As we know, two important properties of probability
distributions are the mean and variance. Our objective is to create a formal criterion
which combines both of these properties to assess the relative performance of different
estimators.

Bias of an estimator

Let θb be an estimator of the population parameter θ.1 We define the bias of an


estimator as:
Bias(θ)
b = E(θ)b − θ. (7.1)
An estimator is:

positively biased if b −θ >0


E(θ)

unbiased if b −θ =0
E(θ)

negatively biased if b − θ < 0.


E(θ)

A positively-biased estimator means the estimator would systematically overestimate


the parameter by the size of the bias, on average. An unbiased estimator means the
estimator would estimate the parameter correctly, on average. A negatively-biased
1
The ‘b’ (hat) notation is often used by statisticians to denote an estimator of the parameter beneath
the ‘b’. So, for example, λ
b denotes an estimator of the Poisson rate parameter λ.

24
7.4. Estimation criteria: bias, variance and mean squared error

estimator means the estimator would systematically underestimate the parameter by


the size of the bias, on average.
In words, the bias of an estimator is the difference between the expected (average) value
of the estimator and the true parameter being estimated. Intuitively, it would be
desirable, other things being equal, to have an estimator with zero bias, called an
unbiased estimator. Given the definition of bias in (7.1), an unbiased estimator would
satisfy:
E(θ)
b = θ.
In words, the expected value of the estimator is the true parameter being estimated, i.e.
on average, under repeated sampling, an unbiased estimator correctly estimates θ.
We view bias as a ‘bad’ thing, so, other things being equal, the smaller an estimator’s
bias the better.

Example 7.2 Since E(X̄) = µ, the sample mean X̄ is an unbiased estimator of µ


because:
E(X̄) − µ = 0.

Variance of an estimator

The variance of an estimator, denoted Var(θ),


b is obtained directly from the
estimator’s sampling distribution.

Example 7.3 For the sample mean, X̄, we have:

σ2
Var(X̄) = . (7.2)
n

It is clear that in (7.2) increasing the sample size n decreases the estimator’s variance
(and hence the standard error, i.e. the square root of the estimator’s variance), therefore
increasing the precision of the estimator.2 We conclude that variance is also a ‘bad’
thing so, other things being equal, the smaller an estimator’s variance the better.

Estimator properties

Is µ
b = X̄ a ‘good’ estimator of µ?
Intuitively, X1 or (X1 + X2 + X3 )/3 would not be good enough as estimators of µ.
However, can we use other estimators such as the sample median:

X((n+1)/2) for odd n
µ
b1 =
(X(n/2) + X(n/2+1) )/2 for even n

2
Remember, however, that this increased precision comes at a cost – namely the increased expenditure
on data collection.

25
7. Point estimation

or perhaps a trimmed sample mean:


1
µ
b2 = (X(k1 +1) + X(k1 +2) + · · · + X(n−k2 ) )
n − k1 − k2
or simply µ
b3 = (X(1) + X(n) )/2, where X(1) , X(2) , . . . , X(n) are the order statistics
obtained by rearranging X1 , X2 , . . . , Xn into ascending order:
X(1) ≤ X(2) ≤ · · · ≤ X(n)
and k1 and k2 are two small, positive integers?
To highlight the key idea, let θ be a scalar, and θb be a (point) estimator of θ. A good
estimator would make |θb − θ| as small as possible. However:

θ is unknown
the value of θb changes with the observed sample.

Mean squared error and mean absolute deviation

The mean squared error (MSE) of θb is defined as:


 
b = E (θb − θ)2
MSE(θ)

and the mean absolute deviation (MAD) of θb is defined as:


 
MAD(θ) = E |θ − θ| .
b b

Intuitively, MAD is a more appropriate measure for the error in estimation. However, it
is technically less convenient since the function h(x) = |x| is not differentiable at x = 0.
Therefore, the MSE is used more often.
If E(θb2 ) < ∞, it holds that:
 2
MSE(θ) b = Var(θ) b + Bias(θ) b

where Bias(θ) b − θ.
b = E(θ)
Proof:
 
b = E (θb − θ)2
MSE(θ)
 2 
= E (θ − E(θ)) + (E(θ) − θ)
b b b

     
b 2 + E (E(θ)
= E (θb − E(θ)) b − θ)2 + 2E (θb − E(θ))(E(
b b − θ)
θ)
   
= Var(θ) b 2 + 2 (E(θ)
b + E (Bias(θ)) b − E(θ))(E(
b b − θ)
θ)
 2
= Var(θ)
b + Bias(θ)
b + 0.

26
7.4. Estimation criteria: bias, variance and mean squared error

We have already established that both bias and variance of an estimator are ‘bad’
things, so the MSE (being the sum of a bad thing and a bad thing squared) can also be
viewed as a ‘bad’ thing.3 Hence when faced with several competing estimators, we
prefer the estimator with the smallest MSE.
So, although an unbiased estimator is intuitively appealing, it is perfectly possible that
a biased estimator might be preferred if the ‘cost’ of the bias is offset by a substantial
reduction in variance. Hence the MSE provides us with a formal criterion to assess the
trade-off between the bias and variance of different estimators of the same parameter.

Example 7.4 A population is known to be normally distributed, i.e. X ∼ N (µ, σ 2 ).


Suppose we wish to estimate the population mean, µ. We draw a random sample
{X1 , X2 , . . . , Xn } such that these random variables are IID. We have three candidate
estimators of µ, T1 , T2 and T3 , defined as:
n
1X X 1 + Xn
T1 = X̄ = Xi , T2 = and T3 = X̄ + 3.
n i=1 2

Which estimator should we choose?


We begin by computing the MSE for T1 , noting:

E(T1 ) = E(X̄) = µ

and:
σ2
Var(T1 ) = Var(X̄) =
.
n
Hence T1 is an unbiased estimator of µ. So the MSE of T1 is just the variance of T1 ,
since the bias is 0. Therefore, MSE(T1 ) = σ 2 /n.
Moving to T2 , note:
 
X1 + Xn E(X1 ) + E(Xn ) µ+µ
E(T2 ) = E = = =µ
2 2 2
and:
Var(X1 ) + Var(Xn ) 2σ 2 σ2
Var(T2 ) = = = .
22 4 2
So T2 is also an unbiased estimator of µ, hence MSE(T2 ) = σ 2 /2.
Finally, consider T3 , noting:

E(T3 ) = E(X̄ + 3) = E(X̄) + 3 = µ + 3

and:
σ2
Var(T3 ) = Var(X̄ + 3) = Var(X̄) = .
n
So T3 is a positively-biased estimator of µ, with a bias of 3. Hence we have
MSE(T3 ) = σ 2 /n + 32 = σ 2 /n + 9.
We seek the estimator with the smallest MSE. Clearly, MSE(T1 ) < MSE(T3 ) so we
can eliminate T3 . Now comparing T1 with T2 , we note that:
3
Or, for that matter, a ‘very bad’ thing!

27
7. Point estimation

for n = 2, MSE(T1 ) = MSE(T2 ), since the estimators are identical

for n > 2, MSE(T1 ) < MSE(T2 ), so T1 is preferred.

So T1 = X̄ is our preferred estimator of µ. Intuitively this should make sense. Note


for n > 2, T1 uses all the information in the sample (i.e. all observations are used),
unlike T2 which uses the first and last observations only. Of course, for n = 2, these
estimators are identical.

Some remarks are the following.

i. µ
b = X̄ is a better estimator of µ than X1 as:
σ2
MSE(µ)
b = < MSE(X1 ) = σ 2 .
n

ii. As n → ∞, MSE(X̄) → 0, i.e. when the sample size tends to infinity, the error in
estimation goes to 0. Such an estimator is called a (mean-square) consistent
estimator.
Consistency is a reasonable requirement. It may be used to rule out some silly
estimators.
For µ̃ = (X1 + X4 )/2, MSE(µ̃) = σ 2 /2 which does not converge to 0 as n → ∞.
This is due to the fact that only a small portion of information (i.e. X1 and X4 )
is used in the estimation.
iii. For any random sample {X1 , X2 , . . . , Xn } from a population with mean µ and
variance σ 2 , it holds that E(X̄) = µ and Var(X̄) = σ 2 /n. The derivation of the
expected value and variance of the sample mean was covered in Chapter 6.
iv. For any independent random variables Y1 , Y2 , . . . , Yk and constants a1 , a2 , . . . , ak ,
then:
k
! k k
! k
X X X X
E ai Yi = ai E(Yi ) and Var ai Yi = a2i Var(Yi ).
i=1 i=1 i=1 i=1

The proof uses the fact that:


!  !2 
k
X k
X
Var ai Yi = E ai (Yi − E(Yi )) .
i=1 i=1

Example 7.5 Bias by itself cannot be used to measure the quality of an estimator.
Consider two artificial estimators of θ, θb1 and θb2 , such that θb1 takes only the two
values, θ − 100 and θ + 100, and θb2 takes only the two values θ and θ + 0.2, with the
following probabilities:

P (θb1 = θ − 100) = P (θb1 = θ + 100) = 0.5

and:
P (θb2 = θ) = P (θb2 = θ + 0.2) = 0.5.

28
7.4. Estimation criteria: bias, variance and mean squared error

Note that θb1 is an unbiased estimator of θ and θb2 is a positively-biased estimator of θ


as:
Bias(θb2 ) = E(θb2 ) − θ = ((θ × 0.5) + ((θ + 0.2) × 0.5)) − θ = 0.1.
However:

MSE(θb1 ) = E((θb1 − θ)2 ) = (−100)2 × 0.5 + (100)2 × 0.5 = 10,000

and:
MSE(θb2 ) = E((θb2 − θ)2 ) = 02 × 0.5 + (0.2)2 × 0.5 = 0.02.
Hence θb2 is a much better (i.e. more accurate) estimator of θ than θb1 .

Example 7.6 Let {X1 , X2 , . . . , Xn } be a random sample from a population with


mean µ = E(Xi ) and variance σ 2 = Var(Xi ) < ∞, for i = 1, 2, . . . , n. Let µ
b = X̄.
Find MSE(b
µ).
We compute the bias and variance separately.
n
! n n
1X 1X 1X
E(b
µ) = E Xi = E(Xi ) = µ = µ.
n i=1 n i=1 n i=1

Hence Bias(b µ) − µ = 0. For the variance, we note the useful formula:


µ) = E(b
k
! k ! k X k k
X X X X X
ai bj = ai b j = ai b i + ai bj .
i=1 j=1 i=1 j=1 i=1 1≤i6=j≤k

Especially:
k
!2 k
X X X
ai = a2i + ai aj .
i=1 i=1 1≤i6=j≤k

Hence Var(b
µ) =
!  !2 
n n
1 X 1 X
Var Xi = E Xi − µ 
n i=1
n i=1
 !2 
n
1 X
= E (Xi − µ) 
n i=1

n
!
1 X X
= E((Xi − µ)2 ) + E ((Xi − µ)(Xj − µ))
n2 i=1 1≤i6=j≤n
!
1 X σ2
= 2 nσ 2 + E(Xi − µ) E(Xj − µ) = .
n 1≤i6=j≤n
n

µ) = MSE(X̄) = σ 2 /n.
Hence MSE(b

29
7. Point estimation

Finding estimators

In general, how should we find an estimator of θ in a practical situation?


There are three conventional methods:

method of moments estimation

least squares estimation

maximum likelihood estimation.

7.5 Method of moments (MM) estimation

Method of moments estimation

Let {X1 , X2 , . . . , Xn } be a random sample from a population F (x; θ). Suppose θ has
p components (for example, for a normal population N (µ, σ 2 ), p = 2; for a Poisson
population with parameter λ, p = 1).
Let:
µk = µk (θ) = E(X k )
denote the kth population moment, for k = 1, 2, . . .. Therefore, µk depends on the
unknown parameter θ, as everything else about the distribution F (x; θ) is known.
Denote the kth sample moment by:
n
1X X1k + X2k + · · · + Xnk
Mk = Xik = .
n i=1
n

The MM estimator (MME) θb of θ is the solution of the p equations:

µk (θ)
b = Mk for k = 1, 2, . . . , p.

Example 7.7 Let {X1 , X2 , . . . , Xn } be a random sample from a population with


mean µ and variance σ 2 < ∞. Find the MM estimator of (µ, σ 2 ).
There are two unknown parameters. Let:
n
1X 2
µ
b=µ
b1 = M1 and µ
b2 = M2 = X .
n i=1 i

This gives us µ
b = M1 = X̄.
Since σ 2 = µ2 − µ21 = E(X 2 ) − (E(X))2 , we have:
n n
2 1X 2 1X
b = M2 −
σ M12 = Xi − X̄ 2 = (Xi − X̄)2 .
n i=1 n i=1

30
7.5. Method of moments (MM) estimation

Note we have:
n
!
1X 2
σ2) = E
E(b Xi − X̄ 2
n i=1
n
1X
= E(Xi2 ) − E(X̄ 2 )
n i=1

= E(X 2 ) − E(X̄ 2 )
 2 
2 2 σ 2
=σ +µ − +µ
n
(n − 1)σ 2
= .
n
Since:
σ2
σ2) − σ2 = −
E(b <0
n
b2 is a negatively-biased estimator of σ 2 .
σ
The sample variance, defined as:
n
2 1 X
S = (Xi − X̄)2
n − 1 i=1

is a more frequently-used estimator of σ 2 as it has zero bias, i.e. it is an unbiased


estimator since E(S 2 ) = σ 2 . This is why we use the n − 1 divisor when calculating
the sample variance.

A useful formula for computation of the sample variance is:


n
!
2
1 X
2 2
S = X − nX̄ .
n − 1 i=1 i

Note the MME does not use any information on F (x; θ) beyond the moments.
The idea is that Mk should be pretty close to µk when n is sufficiently large. In fact:
n
1X
Mk = Xik
n i=1

converges to:
µk = E(X k )
as n → ∞. This is due to the law of large numbers (LLN). We illustrate this
phenomenon by simulation using R.

Example 7.8 For N (2, 4), we have µ1 = 2 and µ2 = 8. We use the sample moments
M1 and M2 as estimators of µ1 and µ2 , respectively. Note how the sample moments
converge to the population moments as the sample size increases.

31
7. Point estimation

For a sample of size n = 10, we obtained m1 = 0.5145838 and m2 = 2.171881.

> x <- rnorm(10,2,2)


> x
[1] 0.70709403 -1.38416864 -0.01692815 2.51837989 -0.28518898 1.96998829
[7] -1.53308559 -0.42573724 1.76006933 1.83541490
> mean(x)
[1] 0.5145838
> x2 <- x^2
> mean(x2)
[1] 2.171881

For a sample of size n = 100, we obtained m1 = 2.261542 and m2 = 8.973033.

> x <- rnorm(100,2,2)


> mean(x)
[1] 2.261542
> x2 <- x^2
> mean(x2)
[1] 8.973033

For a sample of size n = 500, we obtained m1 = 1.912112 and m2 = 7.456353.

> x <- rnorm(500,2,2)


> mean(x)
[1] 1.912112
> x2 <- x^2
> mean(x2)
[1] 7.456353

Example 7.9 For a Poisson distribution with λ = 1, we have µ1 = 1 and µ2 = 2.


With a sample of size n = 500, we obtained m1 = 1.09 and m2 = 2.198.

> x <- rpois(500,1)


> mean(x)
[1] 1.09
> x2 <- x^2
> mean(x2)
[1] 2.198
> x
[1] 1 2 2 1 0 0 0 0 0 0 2 2 1 2 1 1 1 2 ...

7.6 Least squares (LS) estimation


Given a random sample {X1 , X2 , . . . , Xn } from a population with mean µ and variance
σ 2 , how can we estimate µ?

32
7.6. Least squares (LS) estimation

n
P
The MME of µ is the sample mean X̄ = Xi /n.
i=1

Least squares estimator of µ

The estimator X̄ is also the least squares estimator (LSE) of µ, defined as:
n
X
µ
b = X̄ = min (Xi − a)2 .
a
i=1

n n
(Xi − a)2 = (Xi − X̄)2 + n(X̄ − a)2 , where all terms are
P P
Proof: Given that S =
i=1 i=1
non-negative, then the value of a for which S is minimised is when n(X̄ − a)2 = 0, i.e.
a = X̄.


Estimator accuracy

In order to assess the accuracy of µ


b = X̄ as an estimator of µ we calculate its MSE:

σ2
MSE(µ) b − µ)2 ) =
b = E((µ .
n
In order to determine the distribution of µ b we require knowledge of the underlying
distribution. Even if the relevant knowledge is available, one may only compute the
exact distribution of µ
b explicitly for a limited number of cases.
By the central limit theorem, as n → ∞, we have:

X̄ − µ
 
P √ ≤ z → Φ(z)
σ/ n

for any z, where Φ(z) is the cdf of N (0, 1), i.e. when n is large, X̄ ∼ N (µ, σ 2 /n)
approximately.
Hence when n is large:
 
σ
P |X̄ − µ| ≤ 1.96 × √ ≈ 0.95.
n

In practice, the standard deviation σ is unknown and so we replace it by the sample


standard deviation S, where S 2 is the sample variance, given by:
n
2 1 X
S = (Xi − X̄)2 .
n − 1 i=1

This gives an approximation of:


 
S
P |X̄ − µ| ≤ 1.96 × √ ≈ 0.95.
n

33
7. Point estimation

To be on the safe side, the coefficient 1.96 is often replaced by 2. The estimated
standard error of X̄ is:
n
!1/2
S 1 X
E.S.E.(X̄) = √ = (Xi − X̄)2 .
n n(n − 1) i=1

Some remarks are the following.

i. The LSE is a geometrical solution – it minimises the sum of squared distances


between the estimated value and each observation. It makes no use of any
information about the underlying distribution.
n
(Xi − a)2 with respect to a, and equating it to 0, we
P
ii. Taking the derivative of
i=1
obtain (after dividing through by −2):
n
X n
X
(Xi − a) = Xi − na = 0.
i=1 i=1

Hence the solution is µ


b=b
a = X̄. This is another way to derive the LSE of µ.

7.7 Maximum likelihood (ML) estimation


We begin with an illustrative example. Maximum likelihood (ML) estimation
generalises the reasoning in the following example to arbitrary settings.

Example 7.10 Suppose we toss a coin 10 times, and record the number of ‘heads’
as a random variable X. Therefore:

X ∼ Bin(10, π)

where π = P (heads) ∈ (0, 1) is the unknown parameter.

If x = 8, what is your best guess (i.e. estimate) of π? Obviously 0.8!

Is π = 0.1 possible? Yes, but very unlikely.

Is π = 0.5 possible? Yes, but not very likely.

Is π = 0.7 or 0.9 possible? Yes, very likely.

Nevertheless, π = 0.8 is the most likely, or ‘maximally’ likely value of the parameter.
Why do we think ‘π = 0.8’ is most likely?
Let:
10! 8
L(π) = P (X = 8) = π (1 − π)2 .
8! 2!
Since x = 8 is the event which occurred in the experiment, this probability would be
very large. Figure 7.1 shows a plot of L(π) as a function of π.

34
7.7. Maximum likelihood (ML) estimation

The most likely value of π should make this probability as large as possible. This
value is taken as the maximum likelihood estimate of π.
Maximising L(π) is equivalent to maximising:

l(π) = ln(L(π)) = 8 ln π + 2 ln(1 − π) + c

where c is the constant ln(10!/(8! 2!)). Setting dl(π)/dπ = 0, we obtain the


maximum likelihood estimate π b = 0.8.

Figure 7.1: Plot of the likelihood function in Example 7.10.

Maximum likelihood definition

Let f (x1 , x2 , . . . , xn ; θ) be the joint probability density function (or probability


function) for random variables (X1 , X2 , . . . , Xn ). The maximum likelihood estimator
(MLE) of θ based on the observations {X1 , X2 , . . . , Xn } is defined as:

θb = max f (X1 , X2 , . . . , Xn ; θ).


θ

Some remarks are the following.

i. The MLE depends only on the observations {X1 , X2 , . . . , Xn }, such that:

θb = θ(X
b 1 , X2 , . . . , Xn ).

Therefore, θb is a statistic (as it must be for an estimator of θ).

ii. If {X1 , X2 , . . . , Xn } is a random sample from a population with probability density


function f (x; θ), the joint probability density function for (X1 , X2 , . . . , Xn ) is:
n
Y
f (xi ; θ).
i=1

35
7. Point estimation

The joint pdf is a function of (X1 , X2 , . . . , Xn ), while θ is a parameter.

The joint pdf describes the probability distribution of {X1 , X2 , . . . , Xn }.

The likelihood function is defined as:


n
Y
L(θ) = f (Xi ; θ). (7.3)
i=1

The likelihood function is a function of θ, while {X1 , X2 , . . . , Xn } are treated as


constants (as given observations).

The likelihood function reflects the information about the unknown parameter θ in
the data {X1 , X2 , . . . , Xn }.

Some remarks are the following.

i. The likelihood function is a function of the parameter. It is defined up to positive


constant factors. A likelihood function is not a probability density function. It
contains all the information about the unknown parameter from the observations.

ii. The MLE is θb = max L(θ).


θ

iii. It is often more convenient to use the log-likelihood function4 denoted as:
n
X
l(θ) = ln L(θ) = ln f (Xi ; θ)
i=1

as it transforms the product in (7.3) into a sum. Note that:

θb = max l(θ).
θ

iv. For a smooth likelihood function, the MLE is often the solution of the equation:

d
l(θ) = 0.

v. If θb is the MLE and φ = g(θ) is a function of θ, φb = g(θ)


b is the MLE of φ (which is
known as the invariance principle of the MLE).

vi. Unlike the MME or LSE, the MLE uses all the information about the population
distribution. It is often more efficient (i.e. more accurate) than the MME or LSE.

vii. In practice, ML estimation should be used whenever possible.

4
Throughout where ‘log’ is used in log-likelihood functions, it will be assumed to be the logarithm to
the base e, i.e. the natural logarithm.

36
7.7. Maximum likelihood (ML) estimation

Example 7.11 Let {X1 , X2 , . . . , Xn } be a random sample from a distribution with


pdf: (
λ2 xe−λx for x > 0
f (x; λ) =
0 otherwise
where λ > 0 is unknown. Find the MLE of λ.
n
The joint pdf is f (x1 , x2 , . . . , xn ; λ) = (λ2 xi e−λxi ) if all xi > 0, and 0 otherwise.
Q
i=1

The likelihood function is:


n
! n
X Y
2n
L(λ) = λ exp −λ Xi Xi
i=1 i=1
n
Y
= λ2n exp(−nλX̄) Xi .
i=1

n
Q
The log-likelihood function is l(λ) = 2n ln λ − nλX̄ + c, where c = ln Xi is a
i=1
constant.
Setting:
d 2n
l(λ) = − nX̄ = 0
dλ λ
b
we obtain λ
b = 2/X̄.

Note the MLE λ b may be obtained from maximising L(λ) directly. However, it is
much easier to work with l(λ) instead.

Example 7.12 Let {X1 , X2 , . . . , Xn } be a random sample from N (µ, σ 2 ).


 n 
2 −n/2 2 2
P
The joint pdf is (2πσ ) exp − (xi − µ) /(2σ ) .
i=1

Case I: σ 2 is known.
The likelihood function is:
n
!
1 1 X
L(µ) = exp − (Xi − µ)2
(2πσ 2 )n/2 2σ 2 i=1
n
!
1 1 X n
 
= 2 n/2
exp − 2 (Xi − X̄)2 2
exp − 2 (X̄ − µ) .
(2πσ ) 2σ i=1 2σ

Hence the log-likelihood function is:


  n
1 1 X n
l(µ) = ln 2 n/2
− 2 (Xi − X̄)2 − 2 (X̄ − µ)2 .
(2πσ ) 2σ i=1 2σ

Maximising l(µ) with respect to µ gives µ


b = X̄.

37
7. Point estimation

Case II: σ 2 is unknown.


The likelihood function is:
n
!
2 −n/2 2 −n/2 1 X 2
L(µ, σ ) = (2π) (σ ) exp − 2 (Xi − µ) .
2σ i=1

Hence the log-likelihood function is:


n
2 n 2 1 X
l(µ, σ ) = − ln(σ ) − 2 (Xi − µ)2 + c
2 2σ i=1

where c = −(n/2) ln(2π). Regardless of the value of σ 2 , l(X̄, σ 2 ) ≥ l(µ, σ 2 ). Hence


µ
b = X̄.
The MLE of σ 2 should maximise:
n
2 n 2 1 X
l(X̄, σ ) = − ln(σ ) − 2 (Xi − X̄)2 + c.
2 2σ i=1

n
b2 = (Xi − X̄)2 /n.
P
It follows from the lemma below that σ
i=1

Lemma: Let g(x) = −a ln(x) − b/x, where a, b > 0, then:


 
b
g = max g(x).
a x>0

Proof: Letting g 0 (x) = −a/x + b/x2 = 0 leads to the solution x = b/a.



Now suppose we wanted to find the MLE of γ = σ/µ.
Since γ = γ(µ, σ), by the invariance principle the MLE of γ is:
rn
P
(Xi − X̄)2 /n
σ
b i=1
γ
b = γ(b b) = =
µ, σ n .
µ P
Xi /n
b
i=1

Example 7.13 Consider a population with three types of individuals labelled 1, 2


and 3, and occurring according to the Hardy–Weinberg proportions:
p(1; θ) = θ2 , p(2; θ) = 2θ(1 − θ) and p(3; θ) = (1 − θ)2
where 0 < θ < 1. Note that p(1; θ) + p(2; θ) + p(3; θ) = 1.
A random sample of size n is drawn from this population with n1 observed values
equal to 1 and n2 observed values equal to 2 (therefore, there are n − n1 − n2 values
equal to 3). Find the MLE of θ.
Let us assume {X1 , X2 , . . . , Xn } is the sample (i.e. n observed values). Among them,
there are n1 ‘1’s, n2 ‘2’s, and n − n1 − n2 ‘3’s. The likelihood function is (where ∝

38
7.8. Asymptotic distribution of MLEs

means ‘proportional to’):


n
Y
L(θ) = p(Xi ; θ) = p(1; θ)n1 p(2; θ)n2 p(3; θ)n−n1 −n2
i=1

= θ2n1 (2θ(1 − θ))n2 (1 − θ)2(n−n1 −n2 )


∝ θ2n1 +n2 (1 − θ)2n−2n1 −n2 .

The log-likelihood is l(θ) ∝ (2n1 + n2 ) ln θ + (2n − 2n1 − n2 ) ln(1 − θ).


Setting:
d 2n1 + n2 2n − 2n1 − n2
l(θ) = − =0
dθ θb 1 − θb
that is:
(1 − θ)(2n
b 1 + n2 ) = θ(2n − 2n1 − n2 )
b

leads to the MLE:


2n1 + n2
θb = .
2n
For example, for a sample with n = 4, n1 = 1 and n2 = 2, we obtain a point estimate
of θb = 0.5.

7.8 Asymptotic distribution of MLEs


Let {X1 , X2 , . . . , Xn } be a random sample from a population with a smooth pdf f (x; θ),
and θ is a scalar. Denote as:

θb = θ(X
b 1 , X2 , . . . , Xn )

the MLE of θ. Under some regularity conditions, the distribution of n(θb − θ)
converges to N (0, 1/I(θ)) as n → ∞, where I(θ) is the Fisher information defined as:
Z ∞
∂ 2 ln f (x; θ)
I(θ) = − f (x; θ) dx.
−∞ ∂θ 2
Some remarks are the following.

i. When n is large, θb ∼ N (θ, (nI(θ))−1 ) approximately.


ii. For a discrete distribution with probability function p(x; θ), then:
X ∂ 2 ln p(x; θ)
I(θ) = − p(x; θ) .
x
∂θ 2

Example 7.14 For N (µ, σ 2 ) with σ 2 known, we have:


 
2 −1/2 1 2
f (x; µ) = (2πσ ) exp − 2 (x − µ) .

39
7. Point estimation

Therefore:
1 1
ln f (x; µ) = − ln(2πσ 2 ) − 2 (x − µ)2 .
2 2σ
Hence:
d ln f (x; µ) x−µ d2 ln f (x; µ) 1
= and = − 2.
dµ σ2 dµ 2 σ
Therefore: Z ∞
1 1
I(µ) = − − 2
f (x; µ) dx = 2 .
−∞ σ σ
The MLE of µ is X̄, and hence X̄ ∼ N (µ, σ 2 /n).

Example 7.15 For the Poisson distribution, p(x; λ) = λx e−λ /x!. Therefore:

ln p(x; λ) = x ln λ − λ − ln(x!).

Hence:
d ln p(x; λ) x d2 ln p(x; λ) x
= − 1 and 2
= − 2.
dλ λ dλ λ
Therefore: ∞
1 X 1 1
I(λ) = 2 x p(x; λ) = 2 E(X) = .
λ x=0 λ λ

The MLE of λ is X̄. Hence X̄ ∼ N (λ, λ/n) approximately, when n is large.

7.9 Overview of chapter


This chapter introduced point estimation. Key properties of estimators were explored
and the characteristics of a desirable estimator were studied through the calculation of
the mean squared error. Methods for finding estimators of parameters were also
described, including method of moments, least squares and maximum likelihood
estimation.

7.10 Key terms and concepts


Bias Consistent estimator
Fisher information Information
Invariance principle Law of large numbers (LLN)
Least squares estimation Likelihood function
Log-likelihood function Maximum likelihood estimation
Mean absolute deviation (MAD) Mean squared error (MSE)
Method of moments estimation Parameter
Point estimate Point estimator
Population moment Random sample
Sample moment Standard error
Statistic Unbiased

40
7.10. Key terms and concepts

The group was alarmed to find that if you are a labourer, cleaner or dock
worker, you are twice as likely to die than a member of the professional classes.
(The Sunday Times, 31 August 1980)

41
7. Point estimation

42
Chapter 8
Interval estimation

8.1 Synopsis of chapter


This chapter covers interval estimation – a natural extension of point estimation. Due
to the almost inevitable sampling error, we wish to communicate the level of
uncertainty in our point estimate by constructing confidence intervals.

8.2 Learning outcomes


After completing this chapter, you should be able to:

explain the coverage probability of a confidence interval

construct confidence intervals for means of normal and non-normal populations


when the variance is known and unknown

construct confidence intervals for the variance of a normal population

explain the link between confidence intervals and distribution theory, and critique
the assumptions made to justify the use of various confidence intervals.

8.3 Introduction
Point estimation is simple but not informative enough, since a point estimator is
always subject to errors. A more scientific approach is to find an upper bound
U = U (X1 , X2 , . . . , Xn ) and a lower bound L = L(X1 , X2 , . . . , Xn ), and hope that the
unknown parameter θ lies between the two bounds L and U (life is not always as simple
as that, but it is a good start).
An intuitive guess for estimating the population mean would be:

L = X̄ − k × S.E.(X̄) and U = X̄ + k × S.E.(X̄)

where k > 0 is a constant and S.E.(X̄) is the standard error of the sample mean.
The (random) interval (L, U ) forms an interval estimator of θ. For estimation to be
as precise as possible, intuitively the width of the interval, U − L, should be small.

43
8. Interval estimation

Typically, the coverage probability:

P (L(X1 , X2 , . . . , Xn ) < θ < U (X1 , X2 , . . . , Xn )) < 1.

Ideally, we should choose L and U such that:

the width of the interval is as small as possible

the coverage probability is as large as possible.

8.4 Interval estimation for means of normal


distributions
Let us consider a simple example. We have a random sample {X1 , X2 , . . . , Xn } from the
distribution N (µ, σ 2 ), with σ 2 known.
From Chapter 7, we have reason to believe that X̄ is a good estimator of µ. We also
know X̄ ∼ N (µ, σ 2 /n), and hence:

X̄ − µ n(X̄ − µ)
√ = ∼ N (0, 1).
σ/ n σ

Therefore, supposing a 95% coverage probability:


√ 
n|X̄ − µ|
0.95 = P ≤ 1.96
σ
 
σ
= P |µ − X̄| ≤ 1.96 × √
n
 
σ σ
= P −1.96 × √ < µ − X̄ < 1.96 × √
n n
 
σ σ
= P X̄ − 1.96 × √ < µ < X̄ + 1.96 × √ .
n n

Therefore, the interval covering µ with probability 0.95 is:


 
σ σ
X̄ − 1.96 × √ , X̄ + 1.96 × √
n n

which is called a 95% confidence interval for µ.

Example 8.1 Suppose σ = 1, n = 4, and x̄ = 2.25, then a 95% confidence interval


for µ is:  
1 1
2.25 − 1.96 × √ , 2.25 + 1.96 × √ = (1.27, 3.23).
4 4
Instead of a simple point estimate of µ
b = 2.25, we say µ is between 1.27 and 3.23 at
the 95% confidence level.

44
8.4. Interval estimation for means of normal distributions

What is P (1.27 < µ < 3.23) = 0.95 in Example 8.1? Well, this probability does not
mean anything, since µ is an unknown constant!
We treat (1.27, 3.23) as one realisation of the random interval (X̄ − 0.98, X̄ + 0.98)
which covers µ with probability 0.95.
What is the meaning of ‘with probability 0.95’ ? If one repeats the interval estimation a
large number of times, about 95% of the time the interval estimator covers the true µ.
Some remarks are the following.

i. The confidence level is often specified as 90%, 95% or 99%. Obviously the higher
the confidence level, the wider the interval.
For the normal distribution example:
√ 
n|X̄ − µ|
0.90 = P ≤ 1.645
σ
 
σ σ
= P X̄ − 1.645 × √ < µ < X̄ + 1.645 × √
n n
√ 
n|X̄ − µ|
0.95 = P ≤ 1.96
σ
 
σ σ
=P X̄ − 1.96 × √ < µ < X̄ + 1.96 × √
n n
√ 
n|X̄ − µ|
0.99 = P ≤ 2.576
σ
 
σ σ
= P X̄ − 2.576 × √ < µ < X̄ + 2.576 × √ .
n n
√ √
√ three intervals are 2 × 1.645 × σ/ n, 2 × 1.96 × σ/ n and
The widths of the
2 × 2.576 × σ/ n, corresponding to the confidence levels of 90%, 95% and 99%,
respectively.
To achieve a 100% confidence level in the normal example, the width of the interval
would have to be infinite!
ii. Among all the confidence intervals at the same confidence level, the one with the
smallest width gives the most accurate estimation and is, therefore, optimal.
iii. For a distribution with a symmetric unimodal density function, optimal confidence
intervals are symmetric, as depicted in Figure 8.1.

Dealing with unknown σ

In practice the standard deviation σ is typically unknown, and we replace it with the
sample standard deviation:
n
!1/2
1 X 2
S= (Xi − X̄)
n − 1 i=1

45
8. Interval estimation

Figure 8.1: Symmetric unimodal density function showing that a given probability is
represented by the narrowest interval when symmetric about the mean.

leading to a confidence interval for µ of the form:


 
S S
X̄ − k × √ , X̄ + k × √
n n

where k is a constant determined by the confidence level and also by the distribution of
the statistic:
X̄ − µ
√ . (8.1)
S/ n
However, the distribution of (8.1) is no longer normal – it is the Student’s t distribution.

8.4.1 An important property of normal samples


Let {X1 , X2 , . . . , Xn } be a random sample from N (µ, σ 2 ). Suppose:

n n
1X 1 X S
X̄ = Xi , S2 = (Xi − X̄)2 and E.S.E.(X̄) = √
n i=1
n−1 i=1
n

where E.S.E.(X̄) denotes the estimated standard error of the sample mean.

i. X̄ ∼ N (µ, σ 2 /n) and (n − 1)S 2 /σ 2 ∼ χ2n−1 .

ii. X̄ and S 2 are independent, therefore:



n(X̄ − µ)/σ X̄ − µ X̄ − µ
p = √ = ∼ tn−1 .
(n − 1)S 2 /((n − 1)σ 2 ) S/ n E.S.E.(X̄)

An accurate 100(1 − α)% confidence interval for µ, where α ∈ (0, 1), is:
 
S S
X̄ − c × √ , X̄ + c × √ = (X̄ − c × E.S.E.(X̄), X̄ + c × E.S.E.(X̄))
n n

where c > 0 is a constant such that P (T > c) = α/2, where T ∼ tn−1 .

46
8.5. Approximate confidence intervals

8.5 Approximate confidence intervals

8.5.1 Means of non-normal distributions


Let {X1 , X2 , . . . , Xn } be a random sample from a non-normal distribution with mean µ
and variance σ 2 < ∞.

When n is large, n(X̄ − µ)/σ is N (0, 1) approximately.
Therefore, we have an approximate 95% confidence interval for µ given by:
 
S S
X̄ − 1.96 × √ , X̄ + 1.96 × √
n n
where S is the sample standard deviation. Note that it is a two-stage approximation.

1. Approximate the distribution of n(X̄ − µ)/σ by N (0, 1).
2. Approximate σ by S.

Example 8.2 The salary data of 253 graduates from a UK business school (in
thousands
√ of pounds) yield the following: n = 253, x̄ = 47.126, s = 6.843 and so
s/ n = 0.43.
A point estimate of the average salary µ is x̄ = 47.126.
An approximate 95% confidence interval for µ is:

47.126 ± 1.96 × 0.43 ⇒ (46.283, 47.969).

8.5.2 MLE-based confidence intervals


Let {X1 , X2 , . . . , Xn } be a random sample from a smooth distribution with unknown
parameter θ. Let θb = θ(X b 1 , X2 , . . . , Xn ) be the MLE of θ.

Under some regularity conditions, it holds that θb ∼ N (θ, (nI(θ))−1 ) approximately,


when n is large, where I(θ) is the Fisher information.
This leads to the following approximate 95% confidence interval for θ:
 
−1/2 b −1/2
θ − 1.96 × (nI(θ))
b b , θ + 1.96 × (nI(θ))
b .

8.6 Use of the chi-squared distribution


Let X1 , X2 , . . . , Xn be independent N (µ, σ 2 ) random variables. Therefore:
Xi − µ
∼ N (0, 1).
σ
Hence: n
1 X
2
(Xi − µ)2 ∼ χ2n .
σ i=1

47
8. Interval estimation

Note that:
n n
1 X 1 X
2 2
n(X̄ − µ)2
(X i − µ) = (X i − X̄) + . (8.2)
σ 2 i=1 σ 2 i=1 σ2

Proof: We have:
n
X n
X
2
(Xi − µ) = ((Xi − X̄) + (X̄ − µ))2
i=1 i=1
n
X n
X n
X
2 2
= (Xi − X̄) + (X̄ − µ) + 2 (Xi − X̄)(X̄ − µ)
i=1 i=1 i=1
n
X n
X
2 2
= (Xi − X̄) + n(X̄ − µ) + 2(X̄ − µ) (Xi − X̄)
i=1 i=1
n
X
= (Xi − X̄)2 + n(X̄ − µ)2 .
i=1

Hence: n n
1 X 2 1 X 2 n(X̄ − µ)2
(Xi − µ) = 2 (Xi − X̄) + .
σ 2 i=1 σ i=1 σ2

Since X̄ ∼ N (µ, σ 2 /n), then n(X̄ − µ)2 /σ 2 ∼ χ21 . It can be proved that:
n
1 X
(Xi − X̄)2 ∼ χ2n−1 .
σ 2 i=1

Therefore, decomposition (8.2) is an instance of the relationship:

χ2n = χ2n−1 + χ21 .

8.7 Interval estimation for variances of normal


distributions
Let {X1 , X2 , . . . , Xn } be a random sample from a population with mean µ and variance
σ 2 < ∞.
n
Let M = (Xi − X̄)2 = (n − 1)S 2 , then M/σ 2 ∼ χ2n−1 .
P
i=1

For any given small α ∈ (0, 1), we can find 0 < k1 < k2 such that:
α
P (X < k1 ) = P (X > k2 ) =
2
where X ∼ χ2n−1 . Therefore:
   
M M 2
M
1 − α = P k1 < 2 < k2 = P <σ < .
σ k2 k1

48
8.8. Overview of chapter

Hence a 100(1 − α)% confidence interval for σ 2 is:


 
M M
, .
k2 k1

Example 8.3 Suppose n = 15 and the sample variance is s2 = 24.5. Let α = 0.05.
From Table 8 of Murdoch and Barnes’ Statistical Tables, we find:

P (X < 5.629) = P (X > 26.119) = 0.025

where X ∼ χ214 .
Hence a 95% confidence interval for σ 2 is:
14 × S 2 14 × S 2
   
M M
, = ,
26.119 5.629 26.119 5.629
= (0.536 × S 2 , 2.487 × S 2 )
= (13.132, 60.934).

In the above calculation, we have used the formula:


n
2 1 X 1
S = (Xi − X̄)2 = × M.
n − 1 i=1 n−1

8.8 Overview of chapter


This chapter covered interval estimation. A confidence interval converts a point
estimate of an unknown parameter into an interval estimate, reflecting the likely
sampling error. The chapter demonstrated how to construct confidence intervals for
means and variances of normal populations.

8.9 Key terms and concepts


Confidence interval Coverage probability
Interval estimator Interval width

A statistician took the Dale Carnegie Course, improving his confidence from
95% to 99%.
(Anon)

49
8. Interval estimation

50
Chapter 9
Hypothesis testing

9.1 Synopsis of chapter


This chapter discusses hypothesis testing which is used to answer questions about an
unknown parameter. We consider how to perform an appropriate hypothesis test for a
given problem, determine error probabilities and test power, and draw appropriate
conclusions from a hypothesis test.

9.2 Learning outcomes


After completing this chapter, you should be able to:

define and apply the terminology of hypothesis testing

conduct statistical tests of all the types covered in the chapter

calculate the power of some of the simpler tests

explain the construction of rejection regions as a consequence of prior distributional


results, with reference to the significance level and power.

9.3 Introduction
Hypothesis testing, together with statistical estimation, are the two most
frequently-used statistical inference methods. Hypothesis testing addresses a different
type of practical question from statistical estimation.
Based on the data, a (statistical) test is to make a binary decision on a hypothesis,
denoted by H0 :
reject H0 or not reject H0 .

9.4 Introductory examples

Example 9.1 Consider a simple experiment – toss a coin 20 times.


Let {X1 , X2 , . . . , X20 } be the outcomes where ‘heads’ → Xi = 1, and ‘tails’
→ Xi = 0.
Hence the probability distribution is P (Xi = 1) = π = 1 − P (Xi = 0), for π ∈ (0, 1).

51
9. Hypothesis testing

b = X̄ = (X1 + X2 + · · · + X20 )/20.


Estimation would involve estimating π, using π
Testing involves assessing if a hypothesis such as ‘the coin is fair’ is true or not. For
example, this particular hypothesis can be formally represented as:

H0 : π = 0.50.

We cannot be sure what the answer is just from the data.

If π
b = 0.90, H0 is unlikely to be true.

If π
b = 0.45, H0 may be true (and also may be untrue).

If π
b = 0.70, what to do then?

Example 9.2 A customer complains that the amount of coffee powder in a coffee
tin is less than the advertised weight of 3 pounds.
A random sample of 20 tins is selected, resulting in an average weight of x̄ = 2.897
pounds. Is this sufficient to substantiate the complaint?
Again statistical estimation cannot provide a firm answer, due to random
fluctuations between different random samples. So we cast the problem into a
hypothesis testing problem as follows.
Let the weight of coffee in a tin be a normal random variable X ∼ N (µ, σ 2 ). We
need to test the hypothesis µ < 3. In fact, we use the data to test the hypothesis:

H0 : µ = 3.

If we could reject H0 , the customer complaint would be vindicated.

Example 9.3 Suppose one is interested in evaluating the mean income (in £000s)
of a community. Suppose income in the population is modelled as N (µ, 25) and a
random sample of n = 25 observations is taken, yielding the sample mean x̄ = 17.
Independently of the data, three expert economists give their own opinions as
follows.

Dr A claims the mean income is µ = 16.

Ms B claims the mean income is µ = 15.

Mr C claims the mean income is µ = 14.

How would you assess these experts’ statements?


X̄ ∼ N (µ, σ 2 /n) = N (µ, 1). We assess the statements based on this distribution.
If Dr A’s claim is correct, X̄ ∼ N (16, 1). The observed value x̄ = 17 is one standard
deviation away from µ, and may be regarded as a typical observation from the
distribution. Hence there is little inconsistency between the claim and the data
evidence. This is shown in Figure 9.1.

52
9.4. Introductory examples

If Ms B’s claim is correct, X̄ ∼ N (15, 1). The observed value x̄ = 17 begins to look a
bit ‘extreme’, as it is two standard deviations away from µ. Hence there is some
inconsistency between the claim and the data evidence. This is shown in Figure 9.2.
If Mr C’s claim is correct, X̄ ∼ N (14, 1). The observed value x̄ = 17 is very extreme,
as it is three standard deviations away from µ. Hence there is strong inconsistency
between the claim and the data evidence. This is shown in Figure 9.3.

Figure 9.1: Comparison of claim and data evidence for Dr A in Example 9.3.

Figure 9.2: Comparison of claim and data evidence for Ms B in Example 9.3.

Figure 9.3: Comparison of claim and data evidence for Mr C in Example 9.3.

53
9. Hypothesis testing

9.5 Setting p-value, significance level, test statistic


A measure of the discrepancy between the hypothesised (claimed) value of µ and the
observed value X̄ = x̄ is the probability of observing X̄ = x̄ or more extreme values
under the null hypothesis. This probability is called the p-value.

Example 9.4 Continuing Example 9.3:

under H0 : µ = 16, P (X̄ ≥ 17) + P (X̄ ≤ 15) = P (|X̄ − 16| ≥ 1) = 0.317

under H0 : µ = 15, P (X̄ ≥ 17) + P (X̄ ≤ 13) = P (|X̄ − 15| ≥ 2) = 0.046

under H0 : µ = 14, P (X̄ ≥ 17) + P (X̄ ≤ 11) = P (|X̄ − 14| ≥ 3) = 0.003.

In summary, we reject the hypothesis µ = 15 or µ = 14, as, for example, if the


hypothesis µ = 14 is true, the probability of observing x̄ = 17, or more extreme
values, would be as small as 0.003. We are comfortable with this decision, as a small
probability event would be very unlikely to occur in a single experiment.
On the other hand, we cannot reject the hypothesis µ = 16. However, this does not
imply that this hypothesis is necessarily true as, for example, µ = 17 or 18 are at
least as likely as µ = 16. Remember:

not reject 6= accept.

A statistical test is incapable of ‘accepting’ a hypothesis.

Definition of p-values

A p-value is the probability of the event that the test statistic takes the observed
value or more extreme (i.e. more unlikely) values under H0 . It is a measure of the
discrepancy between the hypothesis H0 and the data.

• A ‘small’ p-value indicates that H0 is not supported by the data.

• A ‘large’ p-value indicates that H0 is not inconsistent with the data.

So p-values may be seen as a risk measure of rejecting H0 , as shown in Figure 9.4.

9.5.1 General setting of hypothesis tests


Let {X1 , X2 , . . . , Xn } be a random sample from a distribution with cdf F (x; θ). We are
interested in testing the hypotheses:
H0 : θ = θ 0 vs. H1 : θ ∈ Θ1
where θ0 is a fixed value, Θ1 is a set, and θ0 6∈ Θ1 .

H0 is called the null hypothesis.


H1 is called the alternative hypothesis.

54
9.5. Setting p-value, significance level, test statistic

Figure 9.4: Interpretation of p-values as a risk measure.

The significance level is based on α, which is a small number between 0 and 1


selected subjectively. Often we choose α = 0.10, 0.05 or 0.01, i.e. tests are often
conducted at the significance levels of 10%, 5% or 1%, respectively. So we test at the
100α% significance level.
Our decision is to reject H0 if the p-value is ≤ α.

9.5.2 Statistical testing procedure


1. Find a test statistic T = T (X1 , X2 , . . . , Xn ). Denote by t the value of T for the
given sample of observations under H0 .

2. Compute the p-value:

p = Pθ0 (T = t or more ‘extreme’ values)

where Pθ0 denotes the probability distribution such that θ = θ0 .

3. If p ≤ α we reject H0 . Otherwise, H0 is not rejected.

Our understanding of ‘extremity’ is defined by the alternative hypothesis H1 . This will


become clear in subsequent examples. The significance level determines which p-values
are considered ‘small’.

Example 9.5 Let {X1 , X2 , . . . , X20 }, taking values either 1 or 0, be the outcomes
of an experiment of tossing a coin 20 times, where:

P (Xi = 1) = π = 1 − P (Xi = 0) for π ∈ (0, 1).

We are interested in testing:

H0 : π = 0.50 vs. H1 : π 6= 0.50.

Suppose there are 17 Xi s taking the value 1, and 3 Xi s taking the value 0. Will you
reject the null hypothesis at the 5% significance level?

55
9. Hypothesis testing

Let T = X1 + X2 + · · · + X20 . Therefore, T ∼ Bin(20, π). We use T as the test


statistic. With the given sample, we observe t = 17. What are the more extreme
values of T if H0 is true?
Under H0 , E(T ) = nπ0 = 10. Hence 3 is as extreme as 17, and the more extreme
values are:
0, 1, 2, 18, 19 and 20.
Therefore, the p-value is:
3 20
! 3 20
!
X X X X 20!
+ PH0 (T = i) = + (0.50)i (1 − 0.50)20−i
i=0 i=17 i=0 i=17
i! (20 − i)!
3
X20!
= 2 × (0.50)20
i=0
i! (20 − i)!
 
20 20 × 19 20 × 19 × 18
= 2 × (0.50) × 1 + 20 + +
2! 3!
= 0.0026.

So we reject the null hypothesis of a fair coin at the 1% significance level.

9.5.3 Two-sided tests for normal means


Let {X1 , X2 , . . . , Xn } be a random sample from N (µ, σ 2 ). Assume σ 2 > 0 is known. We
are interested in testing the hypotheses:
H0 : µ = µ0 vs. H1 : µ 6= µ0
where µ0 is a given constant.
P
Intuitively if H0 is true, X̄ = i Xi /n should be close to µ0 . Therefore, large values of
|X̄ − µ0 | suggest a departure from H0 .

Under H0 , X̄ ∼ N (µ0 , σ 2 /n), i.e. n(X̄ − µ0 )/σ ∼ N (0, 1). Hence the test statistic
may be defined as:

n(X̄ − µ0 ) X̄ − µ0
T = = √ ∼ N (0, 1)
σ σ/ n
and we reject H0 for sufficiently ‘large’ values of |T |.
How large is ‘large’ ? This is determined by the significance level.
Suppose
√ µ0 = 3, σ = 0.148, n = 20 and x̄ = 2.897. Therefore, the observed value of T is
t = 20 × (2.897 − 3)/0.148 = −3.112. Hence the p-value is:
Pµ0 (|T | ≥ 3.112) = P (|Z| > 3.112) = 0.0019
where Z ∼ N (0, 1). Therefore, the null hypothesis of µ = 3 will be rejected even at the
1% significance level.
Alternatively, for a given 100α% significance level we may find the critical value cα
such that Pµ0 (|T | > cα ) = α. Therefore, the p-value is ≤ α if and only if the observed
value of |T | ≥ cα .

56
9.5. Setting p-value, significance level, test statistic

Using this alternative approach, we do not need to compute the p-value.


For this example, cα = zα/2 , that is the top 100α/2th percentile of N (0, 1), i.e. the
z-value which cuts off α/2 probability in the upper tail of the standard normal
distribution.
For α = 0.10, 0.05 and 0.01, zα/2 = 1.645, 1.96 and 2.576, respectively. Since we observe
|t| = 3.112, the null hypothesis is rejected at all three significance levels.

9.5.4 One-sided tests for normal means


Let {X1 , X2 , . . . , Xn } be a random sample from N (µ, σ 2 ) with σ 2 > 0 known. We are
interested in testing the hypotheses:

H0 : µ = µ0 vs. H1 : µ < µ0

where µ0 is a known constant.



Under H0 , T = n(X̄ − µ0 )/σ ∼ N (0, 1). We continue to use T as the test statistic. For
H1 : µ < µ0 we should reject H0 when t ≤ c, where c < 0 is a constant.
For a given 100α% significance level, the critical value c should be chosen such that:

α = Pµ0 (T ≤ c) = P (Z ≤ c).

Therefore, c is the 100αth percentile of N (0, 1). Due to the symmetry of N (0, 1),
c = −zα , where zα is the top 100αth percentile of N (0, 1), i.e. P (Z > zα ) = α, where
Z ∼ N (0, 1). For α = 0.05, zα = 1.645. We reject H0 if t ≤ −1.645.

Example 9.6 Suppose µ0 = 3, σ = 0.148, n = 20 and x̄ = 2.897, then:



20 × (2.897 − 3)
t= = −3.112 < −1.645.
0.148
So the null hypothesis of µ = 3 is rejected at the 5% significance level as there is
significant evidence from the data that the true mean is likely to be smaller than 3.

Some remarks are the following.

i. We use a one-tailed test when we are only interested in the departure from H0 in
one direction.

ii. The distribution of a test statistic under H0 must be known in order to calculate
p-values or critical values.

iii. A test may be carried out by either computing the p-value or determining the
critical value.

iv. The probability of incorrect decisions in hypothesis testing is typically positive. For
example, the significance level is the probability of rejecting a true H0 .

57
9. Hypothesis testing

9.6 t tests
t tests are one of the most frequently-used statistical tests.
Let {X1 , X2 , . . . , Xn } be a random sample from N (µ, σ 2 ), where both µ and σ 2 > 0 are
unknown. We are interested in testing the hypotheses:

H0 : µ = µ0 vs. H1 : µ < µ0

where µ0 is known.

Now we cannot use n(X̄ − µ0 )/σ as a statistic, since σ is unknown. Naturally we
replace it by S, where:
n
1 X
S2 = (Xi − X̄)2 .
n − 1 i=1
The test statistic is then the famous t statistic:
√ n
!1/2
n(X̄ − µ0 ) X̄ − µ0 √ . 1 X
T = = √ = n(X̄ − µ0 ) (Xi − X̄)2 .
S S/ n n−1 i=1

We reject H0 if t < c, where c is the critical value determined by the significance level:

PH0 (T < c) = α

where PH0 denotes the distribution under H0 (with mean µ0 and unknown σ 2 ).
Under H0 , T ∼ tn−1 . Hence:
α = PH0 (T < c)
i.e. c is the 100αth percentile of the t distribution with n − 1 degrees of freedom. By
symmetry, c = −tα, n−1 , where tα, k denotes the top 100αth percentile of the tk
distribution.

Example 9.7 To deal with the customer complaint that the average amount of
coffee powder in a coffee tin is less than the advertised 3 pounds, 20 tins were
weighed, yielding the following observations:

2.82, 3.01, 3.11, 2.71, 2.93, 2.68, 3.02, 3.01, 2.93, 2.56,
2.78, 3.01, 3.09, 2.94, 2.82, 2.81, 3.05, 3.01, 2.85, 2.79.

The sample mean and standard deviation are, respectively:

x̄ = 2.897 and s = 0.148.

To test H0 : µ = 3 vs. H1 : µ < 3 at the 1% significance level, the critical value is


c = −t0.01, 19 = −2.539.

Since t = 20 × (2.897 − 3)/0.148 = −3.112 < −2.539, we reject the null hypothesis
that µ = 3 at the 1% significance level.
We conclude that there is highly significant evidence which supports the claim that
the mean amount of coffee is less than 3 pounds.

58
9.7. General approach to statistical tests

Note the hypotheses tested are in fact:

H0 : µ = µ0 , σ 2 > 0 vs. H1 : µ 6= µ0 , σ 2 > 0.

Although H0 does not specify the population distribution completely (σ 2 > 0), the
distribution of the test statistic, T , under H0 is completely known. This enables us
to find the critical value or p-value.

9.7 General approach to statistical tests


Let {X1 , X2 , . . . , Xn } be a random sample from the distribution F (x; θ). We are
interested in testing:
H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ1
where Θ0 and Θ1 are two non-overlapping sets. A general approach to test the above
hypotheses at the 100α% significance level may be described as follows.

1. Find a test statistic T = T (X1 , X2 , . . . , Xn ) such that the distribution of T under


H0 is known.

2. Identify a critical region C such that:

PH0 (T ∈ C) = α.

3. If the observed value of T with the given sample is in the critical region C, H0 is
rejected. Otherwise, H0 is not rejected.

In order to make a test powerful in the sense that the chance of making an incorrect
decision is small, the critical region should consist of those values of T which are least
supportive of H0 (i.e. which lie in the direction of H1 ).

9.8 Two types of error


Statistical tests are often associated with two kinds of decision errors, which are
displayed in the following table:

Decision made
H0 not rejected H0 rejected
True state H0 true Correct decision Type I error
of nature H1 true Type II error Correct decision

Some remarks are the following.

i. Ideally we would like to have a test which minimises the probabilities of making
both types of error, which unfortunately is not feasible.

59
9. Hypothesis testing

ii. The probability of making a Type I error is the significance level, which is under
our control.
iii. We do not have explicit control over the probability of a Type II error. For a given
significance level, we try to choose a test statistic such that the probability of a
Type II error is small.
iv. The power function of the test is defined as:
β(θ) = Pθ (H0 is rejected) for θ ∈ Θ1
i.e. β(θ) = 1 − P (Type II error).
v. The null hypothesis H0 and the alternative hypothesis H1 are not treated equally in
a statistical test, i.e. there is an asymmetric treatment. The choice of H0 is based
on the subject matter concerned and/or technical convenience.
vi. It is more conclusive to end a test with H0 rejected, as the decision of ‘not reject
H0 ’ does not imply that H0 is accepted.

9.9 Tests for variances of normal distributions

Example 9.8 A container-filling machine is used to package milk cartons of 1 litre


(= 1,000 cm3 ). Ideally, the amount of milk should only vary slightly. The company
which produced the filling machine claims that the variance of the milk content is
not greater than 1 cm3 . To examine the veracity of the claim, a random sample of 25
cartons is taken, resulting in 25 measurements (in cm3 ) as follows:

1,000.3, 1,001.3, 999.5, 999.7, 999.3,


999.8, 998.3, 1,000.6, 999.7, 999.8,
1,001.0, 999.4, 999.5, 998.5, 1,000.7,
999.6, 999.8, 1,000.0, 998.2, 1,000.1,
998.1, 1,000.7, 999.8, 1,001.3, 1,000.7.

Do these data support the claim of the company?

Turning Example 9.8 into a statistical problem, we assume that the data form a random
sample from N (µ, σ 2 ). We are interested in testing the hypotheses:
H0 : σ 2 = σ02 vs. H1 : σ 2 > σ02 .
n
Let S 2 = (Xi − X̄)2 /(n − 1), then (n − 1)S 2 /σ 2 ∼ χ2n−1 . Under H0 we have:
P
i=1

n
(Xi − X̄)2
P
(n − 1)S 2 i=1
T = = ∼ χ2n−1 .
σ02 σ02
Since we will reject H0 against an alternative hypothesis σ 2 > σ02 , we should reject H0
for large values of T .

60
9.9. Tests for variances of normal distributions

H0 is rejected if t > χ2α, n−1 , where χ2α, n−1 denotes the top 100αth percentile of the χ2n−1
distribution, i.e. we have:
P (T ≥ χ2α, n−1 ) = α.
For any σ 2 > σ02 , the power of the test at σ is:

β(σ)= Pσ (H0 is rejected)


= Pσ (T > χ2α, n−1 )
(n − 1)S 2
 
2
= Pσ > χα, n−1
σ02
(n − 1)S 2 σ02
 
2
= Pσ > 2 × χα, n−1
σ2 σ
which is greater than α, as σ02 /σ 2 < 1, where (n − 1)S 2 /σ 2 ∼ χ2n−1 when σ 2 is the true
variance, instead of σ02 . Note that here 1 − β(σ) is the probability of a Type II error.
Suppose we choose α = 0.05. For n = 25, χ2α, n−1 = χ20.05, 24 = 36.415.
With the given sample, s2 = 0.8088 and σ02 = 1, t = 24 × 0.8088 = 19.41 < χ20.05, 24 .
Hence we do not reject H0 at the 5% significance level. There is no significant evidence
from the data against the company’s claim that the variance is not beyond 1.
With σ02 = 1, the power function is:
!
(n − 1)S 2 χ20.05, 24 (n − 1)S 2
 
36.415
β(σ) = P > =P >
σ2 σ2 σ2 σ2

where (n − 1)S 2 /σ 2 ∼ χ224 .


For any given values of σ 2 , we may compute β(σ). We list some specific values next.

σ2 1 1.5 2 3 4
χ20.05, 24 /σ 2 36.415 24.277 18.208 12.138 9.104
β(σ) 0.05 0.446 0.793 0.978 0.997
Approximate β(σ) 0.05 0.40 0.80 0.975 0.995

Clearly, β(σ) % as σ 2 %. Intuitively, it is easier to reject H0 : σ 2 = 1 if the true


population, which generates the data, has a larger variance σ 2 .
Due to the sparsity of the available χ2 tables, we may only obtain some approximate
values for β(σ) – see the entries in the last row in the above table. The more accurate
values of β(σ) were calculated using a computer.
Some remarks are the following.

i. The significance level is selected subjectively by the statistician. To make the


conclusion more convincing in the above example, we may use α = 0.10 instead. As
χ20.10, 24 = 33.196, H0 is not rejected at the 10% significance level. In fact the p-value
is:
PH0 (T ≥ 19.41) = 0.73
where T ∼ χ224 .

61
9. Hypothesis testing

ii. As σ 2 increases, the power function β(σ) also increases.


iii. For H1 : σ 2 6= σ02 , we should reject H0 if:
t ≤ χ21−α/2, n−1 or t ≥ χ2α/2, n−1
where χ2α, k denotes the top 100αth percentile of the χ2k distribution.

9.10 Summary: tests for µ and σ 2 in N (µ, σ 2)

Null hypothesis, H0 µ = µ0 µ = µ0 σ 2 = σ02


(σ 2 known)

X̄−µ X̄−µ (n−1)S 2


Test statistic, T √0
σ/ n
√0
S/ n σ02

Distribution of T N (0, 1) tn−1 χ2n−1


under H0

n n
Xi /n, S 2 = (Xi − X̄)2 /(n − 1), and {X1 , X2 , . . . , Xn } is a
P P
In the above table, X̄ =
i=1 i=1
random sample from N (µ, σ 2 ).

9.11 Comparing two normal means with paired


observations
Suppose that the observations are paired:
(X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn )
2
where all Xi s and Yi s are independent, Xi ∼ N (µX , σX ) and Yi ∼ N (µY , σY2 ).
We are interested in testing the hypothesis:
H0 : µX = µY . (9.1)

Example 9.9 The following are some practical examples.

Do husbands make more money than wives?


Is the increased marketing budget improving sales?
Are customers willing to pay more for the new product than the old one?
Does TV advertisement A have higher average effectiveness than advertisement
B?
Will promotion method A generate higher sales than method B?

62
9.12. Comparing two normal means

Observations are paired together for good reasons: husband-wife, before-after,


A-vs.-B (from the same subject).

Let Zi = Xi − Yi , for i = 1, 2, . . . , n, then {Z1 , Z2 , . . . , Zn } is a random sample from the


population N (µ, σ 2 ), where:

µ = µX − µY and σ 2 = σX
2
+ σY2 .

The hypothesis (9.1) can also be expressed as:

H0 : µ = 0.

Therefore, we should use the test statistic T = nZ̄/S, where Z̄ and S 2 denote,
respectively, the sample mean and the sample variance of {Z1 , Z2 , . . . , Zn }.
At the 100α% significance level, for α ∈ (0, 1), we reject the hypothesis µX = µY when:

|t| > tα/2, n−1 , if the alternative is H1 : µX 6= µY

t > tα, n−1 , if the alternative is H1 : µX > µY

t < −tα, n−1 , if the alternative is H1 : µX < µY

where P (T > tα, n−1 ) = α, for T ∼ tn−1 .

9.11.1 Power functions of the test


Consider the case of testing H0 : µX = µY vs. H1 : µX > µY only. For µ = µX − µY > 0,
we have:

β(µ)= Pµ (H0 is rejected)


= Pµ (T > tα, n−1 )
√ !
nZ̄
= Pµ > tα, n−1
S
√ √ !
n(Z̄ − µ) nµ
= Pµ > tα, n−1 −
S S

where n(Z̄ − µ)/S ∼ tn−1 under the distribution represented by Pµ .
Note that for µ > 0, β(µ) > α. Furthermore, β(µ) increases as µ increases.

9.12 Comparing two normal means


Let {X1 , X2 , . . . , Xn } and {Y1 , Y2 , . . . , Ym } be two independent random samples drawn
2
from, respectively, N (µX , σX ) and N (µY , σY2 ). We seek to test hypotheses on µX − µY .
We cannot pair the two samples together, because of the different sample sizes n and m.

63
9. Hypothesis testing

n
P m
P
Let the sample means be X̄ = Xi /n and Ȳ = Yi /m, and the sample variances be:
i=1 i=1

n m
2 1 X 1 X
SX = (Xi − X̄)2 and SY2 = (Yi − Ȳ )2 .
n − 1 i=1 m − 1 i=1

Some remarks are the following.

2
X̄, Ȳ , SX and SY2 are independent.
2 2 2
X̄ ∼ N (µX , σX /n) and (n − 1)SX /σX ∼ χ2n−1 .

Ȳ ∼ N (µY , σY2 /m) and (m − 1)SY2 /σY2 ∼ χ2m−1 .

2
Hence X̄ − Ȳ ∼ N (µX − µY , σX /n + σY2 /m). If σX
2
= σY2 , then:
p 2
(X̄ − Ȳ − (µX − µY )) σX /n + σY2 /m
p
2 2
((n − 1)SX /σX + (m − 1)SY2 /σY2 )/(n + m − 2)
s
n+m−2 X̄ − Ȳ − (µX − µY )
= ×p 2
∼ tn+m−2 .
1/n + 1/m (n − 1)SX + (m − 1)SY2

2
9.12.1 Tests on µX − µY with known σX and σY2
Suppose we are interested in testing:

H0 : µX = µY vs. H1 : µX 6= µY .

Note that:
X̄ − Ȳ − (µX − µY )
p
2
∼ N (0, 1).
σX /n + σY2 /m
Under H0 , µX − µY = 0, so we have:

X̄ − Ȳ
T = p 2 ∼ N (0, 1).
σX /n + σY2 /m

At the 100α% significance level, for α ∈ (0, 1), we reject H0 if |t| > zα/2 , where
P (Z > zα/2 ) = α/2, for Z ∼ N (0, 1).
A 100(1 − α)% confidence interval for µX − µY is:
s
2
σX σ2
X̄ − Ȳ ± zα/2 × + Y.
n m

2
9.12.2 Tests on µX − µY with σX = σY2 but unknown
This time we consider the following hypotheses:

H0 : µX − µY = δ0 vs. H1 : µX − µY > δ0

64
9.12. Comparing two normal means

where δ0 is a given constant. Under H0 , we have:


s
n+m−2 X̄ − Ȳ − δ0
T = ×p 2
∼ tn+m−2 .
1/n + 1/m (n − 1)SX + (m − 1)SY2
At the 100α% significance level, for α ∈ (0, 1), we reject H0 if t > tα, n+m−2 , where
P (T > tα, n+m−2 ) = α, for T ∼ tn+m−2 .
A 100(1 − α)% confidence interval for µX − µY is:
s
1/n + 1/m 2
X̄ − Ȳ ± tα/2, n+m−2 × ((n − 1)SX + (m − 1)SY2 ).
n+m−2

Example 9.10 Two types of razor, A and B, were compared using 100 men in an
experiment. Each man shaved one side, chosen at random, of his face using one razor
and the other side using the other razor. The times taken to shave, Xi and Yi
minutes, for i = 1, 2, . . . , 100, corresponding to the razors A and B, respectively,
were recorded, yielding:

x̄ = 2.84, s2X = 0.48, ȳ = 3.02 and s2Y = 0.42.

Also available is the sample variance of the differences, Zi = Xi − Yi , which is


s2Z = 0.6.
Test, at the 5% significance level, if the two razors lead to different mean shaving
times. State clearly any assumptions used in the test.

Assumption: Suppose {X1 , X2 , . . . , Xn } and {Y1 , Y2 , . . . , Yn } are two independent


2
random samples from, respectively, N (µX , σX ) and N (µY , σY2 ).
The problem requires us to test the following hypotheses:

H0 : µX = µY vs. H1 : µX 6= µY .

There are three approaches – a paired comparison method and two two-sample
comparisons based on different assumptions. Since the data are recorded in pairs,
the paired comparison is most relevant and effective to analyse these data.

Method I: paired comparison


We have Zi = Xi − Yi ∼ N (µZ , σZ2 ) with µZ = µX − µY and σZ2 = σX
2
+ σY2 . We want
to test:
H0 : µZ = 0 vs. H1 : µZ 6= 0.
This is the standard one-sample t test, where:

n(Z̄ − µZ ) X̄ − Ȳ − (µX − µY )
= √ ∼ tn−1 .
SZ SZ / n

H0 is rejected if |t| > t0.025, 99 = 1.98, where under H0 we have:


√ √
nZ̄ 100 × (X̄ − Ȳ )
T = = .
SZ SZ

65
9. Hypothesis testing


With the given data, we observe t = 10 × (2.84 − 3.02)/ 0.6 = −2.327. Hence we
reject the hypothesis that the two razors lead to the same mean shaving time at the
5% significance level.
A 95% confidence interval for µX − µY is:
sZ
x̄ − ȳ ± t0.025, n−1 × √ = −0.18 ± 0.154 ⇒ (−0.334, −0.026).
n
Some remarks are the following.

i. Zero is not in the confidence interval for µX − µY .


ii. t0.025, 99 = 1.98 is pretty close to z0.025 = 1.96.

Method II: two-sample comparison with known variances


2
A further assumption is that σX = 0.48 and σY2 = 0.42.
2
Note X̄ − Ȳ ∼ N (µX − µY , σX /100 + σY2 /100), i.e. we have:
X̄ − Ȳ − (µX − µY )
p
2
∼ N (0, 1).
σX /100 + σY2 /100
Hence we reject H0 when |t| > 1.96 at the 5% significance level, where:
X̄ − Ȳ
T =p 2 .
σX /100 + σY2 /100

For the given data, t = −0.18/ 0.009 = −1.90. Hence we cannot reject H0 .
A 95% confidence interval for µX − µY is:
r
2
σX σ2
x̄ − ȳ ± 1.96 × + Y = −0.18 ± 0.186 ⇒ (−0.366, 0.006).
100 100
The value 0 is now contained in the confidence interval.

Method III: two-sample comparison with equal but unknown variance


2
A different additional assumption is that σX = σY2 = σ 2 .
Now X̄ − Ȳ ∼ N (µX − µY , σ 2 /50) and 99(SX
2
+ SY2 )/σ 2 ∼ χ2198 . Hence:

50 × (X̄ − Ȳ − (µX − µY )) X̄ − Ȳ − (µX − µY )
p
2 2
= 10 × p
2
∼ t198 .
99 × (SX + SY )/198 SX + SY2
Hence we reject H0 if |t| > t0.025, 198 = 1.97 where:
10 × (X̄ − Ȳ )
T = p 2 .
SX + SY2
For the given data, t = −1.897. Hence we cannot reject H0 at the 5% significance
level.
A 95% confidence interval for µX − µY is:
r
s2X + s2Y
x̄ − ȳ ± t0.025, 198 × = −0.18 ± 0.1870 ⇒ (−0.367, 0.007)
100
which contains 0.

66
9.13. Tests for correlation coefficients

Some remarks are the following.

i. Different methods lead to different but not contradictory conclusions, as remember:

not reject 6= accept.

ii. The paired comparison is intuitively the most relevant, requires the least
assumptions, and leads to the most conclusive inference (i.e. rejection of H0 ). It
also produces the narrowest confidence interval.

iii. Methods II and III ignore the pairing of the data. Consequently, the inference is
less conclusive and less accurate.

iv. A general observation is that H0 is rejected at the 100α% significance level if and
only if the value hypothesised by H0 is not within the corresponding 100(1 − α)%
confidence interval.

v. It is much more challenging to compare two normal means with unknown and
unequal variances. This will not be discussed in this course.

9.13 Tests for correlation coefficients


We now consider a test for the correlation coefficient of two random variables X and Y
where:

Cov(X, Y )
ρ = Corr(X, Y )=
(Var(X) Var(Y ))1/2
E((X − E(X))(Y − E(Y )))
= .
(E((X − E(X))2 ) E((Y − E(Y ))2 ))1/2

Some remarks are the following.

i. ρ ∈ [−1, 1], and |ρ| = 1 if and only if Y = aX + b for some constants a and b.
Furthermore, a > 0 if ρ = 1, and a < 0 if ρ = −1.

ii. ρ measures only the linear relationship between X and Y . When ρ = 0, X and Y
are linearly independent, that is uncorrelated.

iii. If X and Y are independent (in the sense that the joint pdf is the product of the
two marginal pdfs), ρ = 0. However, if ρ = 0, X and Y are not necessarily
independent, as there may exist some non-linear relationship between X and Y .

iv. If ρ > 0, X and Y tend to increase (or decrease) together. If ρ < 0, X and Y tend
to move in opposite directions.

67
9. Hypothesis testing

Sample correlation coefficient

Given paired observations (Xi , Yi ), for i = 1, 2, . . . , n, a natural estimator of ρ is


defined as: n
P
(Xi − X̄)(Yi − Ȳ )
i=1
ρb = !1/2
n
P Pn
(Xi − X̄)2 (Yj − Ȳ )2
i=1 j=1

n
P n
P
where X̄ = Xi /n and Ȳ = Yi /n.
i=1 i=1

Example 9.11 The measurements of height, X, and weight, Y , are taken from 69
students in a class. ρ should be positive, intuitively!
In Figure 9.5, the vertical line at x̄ and the horizontal line at ȳ divide the 69 points
into 4 quadrants: northeast (NE), southwest (SW), northwest (NW) and southeast
(SE). Most points are in either NE or SW.

In the NE quadrant, xi > x̄ and yi > ȳ, hence:


X
(xi − x̄)(yi − ȳ) > 0.
i∈NE

In the SW quadrant, xi < x̄ and yi < ȳ, hence:


X
(xi − x̄)(yi − ȳ) > 0.
i∈SW

In the NW quadrant, xi < x̄ and yi > ȳ, hence:


X
(xi − x̄)(yi − ȳ) < 0.
i∈NW

In the SE quadrant, xi > x̄ and yi < ȳ, hence:


X
(xi − x̄)(yi − ȳ) < 0.
i∈SE

Overall:
69
X
(xi − x̄)(yi − ȳ) > 0
i=1

and hence ρb > 0.

Figure 9.6 shows examples of different sample correlation coefficients using scatterplots
of bivariate observations.

68
9.13. Tests for correlation coefficients

Figure 9.5: Scatterplot of height and weight in Example 9.11.

9.13.1 Tests for correlation coefficients

Let {(X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn )} be a random sample from a two-dimensional


normal distribution. Let ρ = Corr(Xi , Yi ). We are interested in testing:

H0 : ρ = 0 vs. H1 : ρ 6= 0.

It can be shown that under H0 the test statistic is:


s
n−2
T = ρb ∼ tn−2 .
1 − ρb2

Hence we reject H0 at the 100α% significance level, for α ∈ (0, 1), if |t| > tα/2, n−2 , where:

α
P (T > tα/2, n−2 ) = .
2

Some remarks are the following.

p
ρ| (n − 2)/(1 − ρb2 ) increases as |b
i. |T | = |b ρ| increases.

ii. For H1 : ρ > 0, we reject H0 if t > tα, n−2 .

iii. Two random variables X and Y are jointly normal if aX + bY is normal for any
constants a and b.

iv. For jointly normal random variables X and Y , if Corr(X, Y ) = 0, X and Y are also
independent.

69
9. Hypothesis testing

Figure 9.6: Scatterplots of bivariate observations with different sample correlation


coefficients.

9.14 Tests for the ratio of two normal variances


Let {X1 , X2 , . . . , Xn } and {Y1 , Y2 , . . . , Ym } be two independent random samples from,
2
respectively, N (µX , σX ) and N (µY , σY2 ). We are interested in testing:
σY2 σY2
H0 : 2
= k vs. H1 : 2
6= k
σX σX
where k > 0 is a given constant. The case with k = 1 is of particular interest since this
tests for equal variances.
Pn m
P
Let the sample means be X̄ = Xi /n and Ȳ = Yi /m, and the sample variances be:
i=1 i=1

n m
2 1 X 1 X
SX = (Xi − X̄)2 and SY2 = (Yi − Ȳ )2 .
n − 1 i=1 m − 1 i=1
2 2
We have (n − 1)SX /σX ∼ χ2n−1 and (m − 1)SY2 /σY2 ∼ χ2m−1 . Therefore:

σY2 2
SX 2
SX 2
/σX
2
× = ∼ Fn−1, m−1 .
σX SY2 SY2 /σY2
2
 2
Under H0 , T = kSX SY ∼ Fn−1, m−1 . Hence H0 is rejected if:

t < F1−α/2, n−1, m−1 or t > Fα/2, n−1, m−1

where Fα, p, k denotes the top 100αth percentile of the Fp, k distribution, that is:

P (T > Fα, p, k ) = α

70
9.14. Tests for the ratio of two normal variances

available from Table 9 of Murdoch and Barnes’ Statistical Tables.


Since:
σY2 2
 
SX
P F1−α/2, n−1, m−1 ≤ 2 × 2 ≤ Fα/2, n−1, m−1 = 1 − α
σX SY
a 100(1 − α)% confidence interval for σY2 /σX
2
is:

SY2 SY2
 
F1−α/2, n−1, m−1 × 2 , Fα/2, n−1, m−1 × 2 .
SX SX

Example 9.12 Here we practise use of Table 9 of Murdoch and Barnes’ Statistical
Tables to obtain critical values for the F distribution.
Table 9 can be used to find the top 100αth percentile of the Fν1 , ν2 distribution for
α = 0.05, 0.025, 0.01 and 0.001.
For example, for ν1 = 3 and ν2 = 5, then:

P (F3, 5 > 5.41) = 0.05

P (F3, 5 > 7.76) = 0.025

P (F3, 5 > 12.06) = 0.01

and:
P (F3, 5 > 33.20) = 0.001.
To find the bottom 100αth percentile, we note that F1−α, ν1 , ν2 = 1/Fα, ν2 , ν1 . So, for
ν1 = 3 and ν2 = 5, we have:
 
1 1
P F3, 5 < = = 0.111 = 0.05
F0.05, 5, 3 9.01
 
1 1
P F3, 5 < = = 0.067 = 0.025
F0.025, 5, 3 14.90
 
1 1
P F3, 5 < = = 0.035 = 0.01
F0.01, 5, 3 28.20

and:  
1 1
P F3, 5 < = = 0.007 = 0.001.
F0.001, 5, 3 134.60

Example 9.13 The daily returns (in percentages) of two assets, X and Y , are
recorded over a period of 100 trading days, yielding average daily returns of x̄ = 3.21
and ȳ = 1.41. Also available from the data are the following quantities:
100
X 100
X 100
X
x2i = 1,989.24, yi2 = 932.78 and xi yi = 661.11.
i=1 i=1 i=1

71
9. Hypothesis testing

Assume the data are normally distributed. Are the two assets positively correlated
with each other, and is asset X riskier than asset Y ?
With n = 100 we have:
n n
!
1 X 1 X
s2X = (xi − x̄)2 = x2i − nx̄2 = 9.69
n − 1 i=1 n−1 i=1

and: !
n n
1 X 1 X
s2Y = (yi − ȳ)2 = yi2 − nȳ 2 = 7.41.
n−1 i=1
n−1 i=1

Therefore: n n
P P
(xi − x̄)(yi − ȳ) xi yi − nx̄ȳ
i=1 i=1
ρb = = = 0.249.
(n − 1)sX sY (n − 1)sX sY
First we test:
H0 : ρ = 0 vs. H1 : ρ > 0.
Under H0 , the test statistic is:
r
n−2
T = ρb ∼ t98 .
1 − ρb2
Setting α = 0.01, we reject H0 if t > t0.01, 98 = 2.37. With the given data, t = 2.545
hence we reject the null hypothesis of ρ = 0 at the 1% significance level. We
conclude that there is highly significant evidence indicating that the two assets are
positively correlated.
We measure the risks in terms of variances, and test:
2
H0 : σX = σY2 2
vs. H1 : σX > σY2 .

Under H0 , we have that:


2
SX
T = ∼ F99, 99 .
SY2
Hence we reject H0 if t > F0.05, 99, 99 = 1.39 at the 5% significance level, using Table 9
of Murdoch and Barnes’ Statistical Tables.
With the given data, t = 9.69/7.41 = 1.308. Therefore, we cannot reject H0 . As the
test is not significant at the 5% significance level, we may not conclude that the
variances of the two assets are significantly different. Therefore, there is no
significant evidence indicating that asset X is riskier than asset Y .
Strictly speaking, the test is valid only if the two samples are independent of each
other, which is not the case here.

72
9.15. Summary: tests for two normal distributions

9.15 Summary: tests for two normal distributions


2
Let (X1 , X2 , . . . , Xn ) ∼IID N (µX , σX ), (Y1 , Y2 , . . . , Ym ) ∼IID N (µY , σY2 ), and
ρ = Corr(X, Y ).
A summary table of tests for two normal distributions is:

2
σY
Null hypothesis, µX − µY = δ µX − µY = δ ρ=0 2
σX
=k
2
H0 (σX , σY2 known) 2
(σX = σY2 unknown) (n = m)

q
S2
q
√ X̄−Ȳ −δ X̄−Ȳ −δ
Test statistic, T 2 /n+σ 2 /m
n+m−2
1/n+1/m
×√ 2 +(m−1)S 2
n−2
ρb 1−bρ2
k SX2
σX Y (n−1)SX Y Y

Distribution of T N (0, 1) tn+m−2 tn−2 Fn−1, m−1


under H0

9.16 Overview of chapter


This chapter has discussed hypothesis tests for parameters of normal distributions –
specifically means and variances. In each case an appropriate test statistic was
constructed whose distribution under the null hypothesis was known. Concepts of
hypothesis testing errors and power were also discussed, as well as how to test
correlation coefficients.

9.17 Key terms and concepts


Alternative hypothesis Critical value
Decision Null hypothesis
p-value Paired comparison
Power function Significance level
t test Test statistic
Type I error Type II error

To p, or not to p?
(James Abdey, Ph.D. Thesis 2009.1 )

1
Available at http://etheses.lse.ac.uk/31

73
9. Hypothesis testing

74
Chapter 10
Analysis of variance (ANOVA)

10.1 Synopsis of chapter


This chapter introduces analysis of variance (ANOVA) which is a widely-used technique
for detecting differences between groups based on continuous dependent variables.

10.2 Learning outcomes


After completing this chapter, you should be able to:

explain the purpose of analysis of variance

restate and interpret the models for one-way and two-way analysis of variance

conduct small examples of one-way and two-way analysis of variance with a


calculator, reporting the results in an ANOVA table

perform hypothesis tests and construct confidence intervals for one-way and
two-way analysis of variance

explain how to interpret residuals from an analysis of variance.

10.3 Introduction
Analysis of variance (ANOVA) is a popular tool which has an applicability and power
which we can only start to appreciate in this course. The idea of analysis of variance is
to investigate how variation in structured data can be split into pieces associated with
components of that structure. We look only at one-way and two-way classifications,
providing tests and confidence intervals which are widely used in practice.

10.4 Testing for equality of three population means


We begin with an illustrative example to test the hypothesis that three populations
means are equal.

75
10. Analysis of variance (ANOVA)

Example 10.1 To assess the teaching quality of class teachers, a random sample of
6 examination marks was selected from each of three classes. The examination marks
for each class are listed in the table below.
Can we infer from these data that there is no significant difference in the
examination marks among all three classes?

Class 1 Class 2 Class 3


85 71 59
75 75 64
82 73 62
76 74 69
71 69 75
85 82 67

Suppose examination marks from Class j follow the distribution N (µj , σ 2 ), for
j = 1, 2, 3. So we assume examination marks are normally distributed with the same
variance in each class, but possibly different means.
We need to test the hypothesis:

H0 : µ1 = µ2 = µ3 .

The data form a 6 × 3 array. Denote the data point at the (i, j)th position as Xij .
We compute the column means first where the jth column mean is:
X1j + X2j + · · · + Xnj j
X̄·j =
nj

where nj is the sample size of group j (here nj = 6 for all j).


This leads to x̄·1 = 79, x̄·2 = 74 and x̄·3 = 66. Transposing the table, we get:

Observation
1 2 3 4 5 6 Mean
Class 1 85 75 82 76 71 85 79
Class 2 71 75 73 74 69 82 74
Class 3 59 64 62 69 75 67 66

Note that similar problems arise from other practical situations. For example:

comparing the returns of three stocks

comparing sales using three advertising strategies

comparing the effectiveness of three medicines.

If H0 is true, the three observed sample means x̄·1 , x̄·2 and x̄·3 should be very close to
each other, i.e. all of them should be close to the overall sample mean, x̄, which is:
x̄·1 + x̄·2 + x̄·3 79 + 74 + 66
x̄ = = = 73
3 3

76
10.5. One-way analysis of variance

i.e. the mean value of all 18 observations.


So we wish to perform a hypothesis test based on the variation in the sample means
such that the greater the variation, the more likely we are to reject H0 . One possible
measure for the variation in the sample means X̄·j about the overall sample mean X̄,
for j = 1, 2, 3, is:
X3
(X̄·j − X̄)2 . (10.1)
j=1

However, (10.1) is not scale-invariant, so it would be difficult to judge whether the


realised value is large enough to warrant rejection of H0 due to the magnitude being
dependent on the units of measurement of the data. So we seek a scale-invariant test
statistic.
Just as we scaled the covariance between two random variables to give the
scale-invariant correlation coefficient, we can similarly scale (10.1) to give the
following possible test statistic:
3
(X̄·j − X̄)2
P
j=1
T = .
sum of the three sample variances

Hence we would reject H0 for large values of T . (Note t = 0 if x̄·1 = x̄·2 = x̄·3 which
would mean that there is no variation at all between the sample means. In this case
all the sample means would equal x̄.)
It remains to determine the distribution of T under H0 .

10.5 One-way analysis of variance


We now extend Example 10.1 to consider a general setting where there are k
independent random samples available from k normal distributions N (µj , σ 2 ), for
j = 1, 2, . . . , k. (Example 10.1 corresponds to k = 3.)
Denote by X1j , X2j , . . . , Xnj j the random sample with sample size nj from N (µj , σ 2 ), for
j = 1, 2, . . . , k.
Our goal is to test:
H0 : µ1 = µ2 = · · · = µk
vs.
H1 : not all µj s are the same.
One-way analysis of variance (one-way ANOVA) involves a continuous dependent
variable and one categorical independent variable (sometimes called a factor, or
treatment), where the k different levels of the categorical variable are the k different
groups.
We now introduce statistics associated with one-way ANOVA.

77
10. Analysis of variance (ANOVA)

Statistics associated with one-way ANOVA

The jth sample mean is:


nj
1 X
X̄·j = Xij .
nj i=1

The overall sample mean is:


k nj k
1 XX 1X
X̄ = Xij = nj X̄·j
n j=1 i=1
n j=1

k
P
where n = nj is the total number of observations across all k groups.
j=1
The total variation is:
nj
k X
X
(Xij − X̄)2
j=1 i=1

with n − 1 degrees of freedom.


The between-groups variation is:
k
X
B= nj (X̄·j − X̄)2
j=1

with k − 1 degrees of freedom.


The within-groups variation is:
nj
k X
X
W = (Xij − X̄·j )2
j=1 i=1

k
P
with n − k = (nj − 1) degrees of freedom.
j=1
The ANOVA decomposition is:
nj
k X k nj
k X
X X X
2 2
(Xij − X̄) = nj (X̄·j − X̄) + (Xij − X̄·j )2 .
j=1 i=1 j=1 j=1 i=1

We have already discussed the jth sample mean and overall sample mean. The total
variation is a measure of the overall (total) variability in the data from all k groups
about the overall sample mean. The ANOVA decomposition decomposes this into two
components: between-groups variation (which is attributable to the factor level) and
within-groups variation (which is attributable to the variation within each group and is
assumed to be the same σ 2 for each group).
Some remarks are the following.

i. B and W are also called, respectively, between-treatments variation and

78
10.5. One-way analysis of variance

within-treatments variation. In fact W is effectively a residual (error) sum of


squares, representing the variation which cannot be explained by the treatment or
group factor.
ii. The ANOVA decomposition follows from the identity:
m
X m
X
2
(ai − b) = (ai − ā)2 + m(ā − b)2 .
i=1 i=1

However, the actual derivation is not required for this course.


iii. The following are some useful formulae for manual computations.
k
• n=
P
nj .
j=1
nj k
• X̄·j =
P P
Xij /nj and X̄ = nj X̄·j /n.
i=1 j=1
nj
k P
• Total variation = Total SS = B + W = Xij2 − nX̄ 2 .
P
j=1 i=1
k
• B= nj X̄·j2 − nX̄ 2 .
P
j=1
nj
k P k k
• Residual (Error) SS = W = Xij2 − nj X̄·j2 = (nj − 1)Sj2 where Sj2 is
P P P
j=1 i=1 j=1 j=1
the jth sample variance.

We now note, without proof, the following results.


k nj
k P
nj (X̄·j − X̄)2 and W = (Xij − X̄·j )2 are independent of each other.
P P
i. B =
j=1 j=1 i=1

nj
k P
ii. W/σ 2 = (Xij − X̄·j )2 /σ 2 ∼ χ2n−k .
P
j=1 i=1

k
iii. Under H0 : µ1 = · · · = µk , then B/σ 2 = nj (X̄·j − X̄)2 /σ 2 ∼ χ2k−1 .
P
j=1

In order to test H0 : µ1 = µ2 = · · · = µk , we define the following test statistic:


k
nj (X̄·j − X̄)2 /(k − 1)
P
j=1 B/(k − 1)
F = k Pnj
= .
P W/(n − k)
(Xij − X̄·j )2 /(n − k)
j=1 i=1

Under H0 , F ∼ Fk−1, n−k . We reject H0 at the 100α% significance level if:

f > Fα, k−1, n−k

where Fα, k−1, n−k is the top 100αth percentile of the Fk−1, n−k distribution, i.e.
P (F > Fα, k−1, n−k ) = α, and f is the observed test statistic value.

79
10. Analysis of variance (ANOVA)

The p-value of the test is:

p-value = P (F > f ).

It is clear that f > Fα, k−1, n−k if and only if the p-value < α, as we must reach the same
conclusion regardless of whether we use the critical value approach or the p-value
approach to hypothesis testing.

One-way ANOVA table

Typically, one-way ANOVA results are presented in a table as follows:


Source DF SS MS F p-value
B/(k−1)
Factor k−1 B B/(k − 1) W/(n−k)
p
Error n−k W W/(n − k)
Total n−1 B+W

Example 10.2 Continuing with Example 10.1, for the given data, k = 3,
n1 = n2 = n3 = 6, n = n1 + n2 + n3 = 18, x̄·1 = 79, x̄·2 = 74, x̄·3 = 66 and x̄ = 73.
The sample variances are calculated to be s21 = 34, s22 = 20 and s23 = 32. Therefore:
3
X
b= 6(x̄·j − x̄)2 = 6 × ((79 − 73)2 + (74 − 73)2 + (66 − 73)2 ) = 516
j=1

and:
3 X
X 6 3 X
X 6 3
X
w= (xij − x̄·j )2 = x2ij − 6 x̄2·j
j=1 i=1 j=1 i=1 j=1

3
X
= 5s2j
j=1

= 5 × (34 + 20 + 32)
= 430.

Hence:
b/(k − 1) 516/2
f= = = 9.
w/(n − k) 430/15
Under H0 : µ1 = µ2 = µ3 , F ∼ Fk−1, n−k = F2, 15 . Since F0.01, 2, 15 = 6.359 < 9, using
Table 9 of Murdoch and Barnes’ Statistical Tables, we reject H0 at the 1%
significance level. In fact the p-value (using a computer) is P (F > 9) = 0.003.
Therefore, we conclude that there is a significant difference among the mean
examination marks across the three classes.

80
10.5. One-way analysis of variance

The one-way ANOVA table is as follows:

Source DF SS MS F p-value
Class 2 516 258 9 0.003
Error 15 430 28.67
Total 17 946

Example 10.3 A study performed by a Columbia University professor counted the


number of times per minute professors from three different departments said ‘uh’ or
‘ah’ during lectures to fill gaps between words. The data listed in ‘UhAh.csv’ were
derived from observing 100 minutes from each of the three departments. If we
assume that the more frequent use of ‘uh’ or ‘ah’ results in more boring lectures, can
we conclude that some departments’ professors are more boring than others?
The counts for English, Mathematics and Political Science departments are stored.
As always in statistical analysis, we first look at the summary (descriptive) statistics
of these data.

> attach(UhAh)
> summary(UhAh)
Frequency Department
Min. : 0.00 English :100
1st Qu.: 4.00 Mathematics :100
Median : 5.00 Political Science:100
Mean : 5.48
3rd Qu.: 7.00
Max. :11.00
> xbar <- tapply(Frequency, Department, mean)
> s <- tapply(Frequency, Department, sd)
> n <- tapply(Frequency, Department, length)
> sem <- s/sqrt(n)
> list(xbar,s,n,sem)
[[1]]
English Mathematics Political Science
5.81 5.30 5.33

[[2]]
English Mathematics Political Science
2.493203 2.012587 1.974867

[[3]]
English Mathematics Political Science
100 100 100

[[4]]
English Mathematics Political Science
0.2493203 0.2012587 0.1974867

81
10. Analysis of variance (ANOVA)

Surprisingly, professors in English say ‘uh’ or ‘ah’ more on average than those in
Mathematics and Political Science (compare the sample means of 5.81, 5.30 and
5.33), but the difference seems small. However, we need to formally test whether the
(seemingly small) differences are statistically significant.
Using the data, R produces the following one-way ANOVA table:

> anova(lm(Frequency ~ Department))


Analysis of Variance Table

Response: Frequency
Df Sum Sq Mean Sq F value Pr(>F)
Department 2 16.38 8.1900 1.7344 0.1783
Residuals 297 1402.50 4.7222
Since the p-value for the F test is 0.1783, we cannot reject the following hypothesis:

H0 : µ1 = µ2 = µ3 .

Therefore, there is no evidence of a difference in the mean number of ‘uh’s or ‘ah’s


said by professors across the three departments.

In addition to a one-way ANOVA table, we can also obtain the following.

An estimator of σ is: s
W
σ
b =S= .
n−k
95% confidence intervals for µj are given by:
S
X̄·j ± t0.025, n−k × √ for j = 1, 2, . . . , k
nj
where t0.025, n−k is the top 2.5th percentile of the Student’s tn−k distribution, which
can be obtained from Table 7 of Murdoch and Barnes’ Statistical Tables.

Example 10.4 Assuming a common variance for each group, from the preceding
output in Example 10.3 we see that:
1,402.50 √
r
σ
b=s= = 4.72 = 2.173.
297
Since t0.025, 297 ≈ t0.025, ∞ = 1.96, using Table 7 of Murdoch and Barnes’ Statistical
Tables, we obtain the following 95% confidence intervals for µ1 , µ2 and µ3 ,
respectively:
2.173
j=1: 5.81 ± 1.96 × √ ⇒ (5.38, 6.24)
100
2.173
j=2: 5.30 ± 1.96 × √ ⇒ (4.87, 5.73)
100
2.173
j=3: 5.33 ± 1.96 × √ ⇒ (4.90, 5.76).
100
82
10.5. One-way analysis of variance

R can produce the following:

> stripchart(Frequency ~ Department,pch=16,vert=T)


> arrows(1:3,xbar+1.96*2.173/sqrt(n),1:3,xbar-1.96*2.173/sqrt(n),
angle=90,code=3,length=0.1)
> lines(1:3,xbar,pch=4,type="b",cex=2)
These 95% confidence intervals can be seen plotted in the R output below. Note that
these confidence intervals all overlap, which is consistent with our failure to reject
the null hypothesis that all population means are equal.
10
8
Frequency

6
4
2
0

English Mathematics Political Science

Figure 10.1: Overlapping confidence intervals.

Example 10.5 In early 2001, the American economy was slowing down and
companies were laying off workers. A poll conducted during February 2001 asked a
random sample of workers how long (in months) it would be before they faced
significant financial hardship if they lost their jobs, with the data available in the file
‘GallupPoll.csv’. They are classified into four groups according to their incomes.
Below is part of the R output of the descriptive statistics of the classified data. Can
we infer that income group has a significant impact on the mean length of time
before facing financial hardship?

Hardship Income.group
Min. : 0.00 $20 to 30K: 81
1st Qu.: 8.00 $30 to 50K:114
Median :15.00 Over $50K : 39
Mean :16.11 Under $20K: 67
3rd Qu.:22.00
Max. :50.00

83
10. Analysis of variance (ANOVA)

> xbar <- tapply(Hardship, Income.group, mean)


> s <- tapply(Hardship, Income.group, sd)
> n <- tapply(Hardship, Income.group, length)
> sem <- s/sqrt(n)
> list(xbar,s,n,sem)
[[1]]
$20 to 30K $30 to 50K Over $50K Under $20K
15.493827 18.456140 22.205128 9.313433

[[2]]
$20 to 30K $30 to 50K Over $50K Under $20K
9.233260 9.507464 11.029099 8.087043

[[3]]
$20 to 30K $30 to 50K Over $50K Under $20K
81 114 39 67

[[4]]
$20 to 30K $30 to 50K Over $50K Under $20K
1.0259178 0.8904556 1.7660693 0.9879896
Inspection of the sample means suggests that there is a difference between income
groups, but we need to conduct a one-way ANOVA test to see whether the
differences are statistically significant.
We apply one-way ANOVA to test whether the means in the k = 4 groups are equal,
i.e. H0 : µ1 = µ2 = µ3 = µ4 , from highest to lowest income groups.
We have n1 = 39, n2 = 114, n3 = 81 and n4 = 67, hence:
k
X
n= nj = 39 + 114 + 81 + 67 = 301.
j=1

Also x̄·1 = 22.21, x̄·2 = 18.456, x̄·3 = 15.49, x̄·4 = 9.313 and:
k
1X 39 × 22.21 + 114 × 18.456 + 81 × 15.49 + 67 × 9.313
x̄ = nj X̄·j = = 16.109.
n j=1 301

Now:
k
X
b= nj (x̄·j − x̄)2
j=1

= 39 × (22.21 − 16.109)2 + 114 × (18.456 − 16.109)2


+ 81 × (15.49 − 16.109)2 + 67 × (9.313 − 16.109)2
= 5,205.097.

We have s21 = (11.03)2 = 121.661, s22 = (9.507)2 = 90.383, s23 = (9.23)2 = 85.193 and

84
10.5. One-way analysis of variance

s24 = (8.087)2 = 65.400, hence:


nj
k X k
X X
2
w= (xij − x̄·j ) = (nj − 1)s2j
j=1 i=1 j=1

= 38 × 121.661 + 113 × 90.383 + 80 × 85.193 + 66 × 65.400


= 25,968.24.

Consequently:
b/(k − 1) 5,205.097/3
f= = = 19.84.
w/(n − k) 25,968.24/(301 − 4)
Under H0 , F ∼ Fk−1, n−k = F3, 297 . Since F0.01, 3, 297 ≈ 3.848 < 19.84, we reject H0 at
the 1% significance level, i.e. there is strong evidence that income group has a
significant impact on the mean length of time before facing financial hardship.
The pooled estimate of σ is:
p p
s = w/(n − k) = 25,968.24/(301 − 4) = 9.351.

A 95% confidence interval for µj is:

s 9.351 18.328
x̄·j ± t0.025, 297 × √ = x̄·j ± 1.96 × √ = x̄·j ± √ .
nj nj nj

Hence, for example, a 95% confidence interval for µ1 is:


18.328
22.21 ± √ ⇒ (19.28, 25.14)
39
and a 95% confidence interval for µ4 is:
18.328
9.313 ± √ ⇒ (7.07, 11.55).
67
Notice that these two confidence intervals do not overlap, which is consistent with
our conclusion that there is a difference between the group means.
R output for the data is:

> anova(lm(Hardship ~ Income.group))


Analysis of Variance Table

Response: Hardship
Df Sum Sq Mean Sq F value Pr(>F)
Income.group 3 5202.1 1734.03 19.828 9.636e-12 ***
Residuals 297 25973.3 87.45
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that minor differences are due to rounding errors in calculations.

85
10. Analysis of variance (ANOVA)

10.6 From one-way to two-way ANOVA


One-way ANOVA: a review
We have independent observations Xij ∼ N (µj , σ 2 ) for i = 1, 2, . . . , nj and
j = 1, 2, . . . , k. We are interested in testing:

H0 : µ1 = µ2 = · · · = µk .

The variation of the Xij s is driven by a factor at different levels µ1 , µ2 , . . . , µk , in


addition to random fluctuations (i.e. random errors). We test whether such a factor
effect exists or not. We can model a one-way ANOVA problem as follows:

Xij = µ + βj + εij for i = 1, 2, . . . , nj and j = 1, 2, . . . , k

where εij ∼ N (0, σ 2 ) and the εij s are independent. µ is the average effect and βj is the
Pk
factor (or treatment) effect at the jth level. Note that βj = 0. The null hypothesis
j=1
(i.e. that the group means are all equal) can also be expressed as:

H0 : β1 = β2 = · · · = βk = 0.

10.7 Two-way analysis of variance


Two-way analysis of variance (two-way ANOVA) involves a continuous dependent
variable and two categorical independent variables (factors). Two-way ANOVA models
the observations as:

Xij = µ + γi + βj + εij for i = 1, 2, . . . , r and j = 1, 2, . . . , c

where:

µ represents the average effect


β1 , β2 , . . . , βc represent c different treatment (column) levels
γ1 , γ2 , . . . , γr represent r different block (row) levels
εij ∼ N (0, σ 2 ) and the εij s are independent.

In total, there are n = r × c observations. We now consider the conditions to make the
parameters µ, γi and βj identifiable for i = 1, 2, . . . , r and j = 1, 2, . . . , c. The conditions
are:
γ1 + γ2 + · · · + γr = 0 and β1 + β2 + · · · + βc = 0.
We will be interested in testing the following hypotheses.

The ‘no treatment (column) effect’ hypothesis of H0 : β1 = β2 = · · · = βc = 0.


The ‘no block (row) effect’ hypothesis of H0 : γ1 = γ2 = · · · = γr = 0.

We now introduce statistics associated with two-way ANOVA.

86
10.7. Two-way analysis of variance

Statistics associated with two-way ANOVA

The sample mean at the ith block level is:


c
1X
X̄i· = Xij for i = 1, 2, . . . , r.
c j=1

The sample mean at the jth treatment level is:


r
X 1
X̄·j = Xij for j = 1, 2, . . . , c.
i=1
r

The overall sample mean is:


r c
1 XX
X̄ = X̄·· = Xij .
n i=1 j=1

The total variation (with rc − 1 degrees of freedom) is:


r X
X c
Total SS = (Xij − X̄)2 .
i=1 j=1

The between-blocks (rows) variation (with r − 1 degrees of freedom) is:


r
X
Brow = c (X̄i· − X̄)2 .
i=1

The between-treatments (columns) variation (with c − 1 degrees of freedom)


is: c
X
Bcol = r (X̄·j − X̄)2 .
j=1

The residual (error) variation (with (r − 1)(c − 1) degrees of freedom) is:


r X
X c
Residual SS = (Xij − X̄i· − X̄·j + X̄)2 .
i=1 j=1

The (two-way) ANOVA decomposition is:


r X
X c r
X c
X
2 2
(Xij − X̄) = c (X̄i· − X̄) + r (X̄·j − X̄)2
i=1 j=1 i=1 j=1

r X
X c
+ (Xij − X̄i· − X̄·j + X̄)2 .
i=1 j=1

87
10. Analysis of variance (ANOVA)

The total variation is a measure of the overall (total) variability in the data and the
(two-way) ANOVA decomposition decomposes this into three components:
between-blocks variation (which is attributable to the row factor level),
between-treatments variation (which is attributable to the column factor level) and
residual variation (which is attributable to the variation not explained by the row and
column factors).
The following are some useful formulae for manual computations.
c
P
Row sample means: X̄i· = Xij /c, for i = 1, 2, . . . , r.
j=1

r
P
Column sample means: X̄·j = Xij /r, for j = 1, 2, . . . , c.
i=1

r P
P c r
P c
P
Overall sample mean: X̄ = Xij /n = X̄i· /r = X̄·j /c.
i=1 j=1 i=1 j=1

r P
c
Xij2 − rcX̄ 2 .
P
Total SS =
i=1 j=1

r
X̄i·2 − rcX̄ 2 .
P
Between-blocks (rows) variation: Brow = c
i=1

c
X̄·j2 − rcX̄ 2 .
P
Between-treatments (columns) variation: Bcol = r
j=1

r P
c r c
Xij2 − c X̄i·2 − r X̄·j2 + rcX̄ 2 .
P P P
Residual SS = (Total SS) − Brow − Bcol =
i=1 j=1 i=1 j=1

In order to test the ‘no block (row) effect’ hypothesis of H0 : γ1 = γ2 = · · · = γr = 0, the


test statistic is defined as:
Brow /(r − 1) (c − 1)Brow
F = = .
(Residual SS)/((r − 1)(c − 1)) Residual SS
Under H0 , F ∼ Fr−1, (r−1)(c−1) . We reject H0 at the 100α% significance level if:
f > Fα, r−1, (r−1)(c−1)
where Fα, r−1, (r−1)(c−1) is the top 100αth percentile of the Fr−1, (r−1)(c−1) distribution, i.e.
P (F > Fα, r−1, (r−1)(c−1) ) = α, and f is the observed test statistic value.
The p-value of the test is:
p-value = P (F > f ).
In order to test the ‘no treatment (column) effect’ hypothesis of
H0 : β1 = β2 = · · · = βc = 0, the test statistic is defined as:
Bcol /(c − 1) (r − 1)Bcol
F = = .
(Residual SS)/((r − 1)(c − 1)) Residual SS
Under H0 , F ∼ Fc−1, (r−1)(c−1) . We reject H0 at the 100α% significance level if:
f > Fα, c−1, (r−1)(c−1) .

88
10.8. Residuals

The p-value of the test is defined in the usual way.

Two-way ANOVA table

As with one-way ANOVA, two-way ANOVA results are presented in a table as follows:
Source DF SS MS F p-value
(c−1)Brow
Row factor r−1 Brow Brow /(r − 1) Residual SS
p
(r−1)Bcol
Column factor c−1 Bcol Bcol /(c − 1) Residual SS
p
Residual SS
Residual (r − 1)(c − 1) Residual SS (r−1)(c−1)
Total rc − 1 Total SS

10.8 Residuals
Before considering an example of two-way ANOVA, we briefly consider residuals.
Recall the original two-way ANOVA model:

Xij = µ + γi + βj + εij .

We now decompose the observations as follows:

Xij = X̄ + (X̄i· − X̄) + (X̄·j − X̄) + (Xij − X̄i· − X̄·j + X̄)

for i = 1, 2, . . . , r and j = 1, 2, . . . , c, where we have the following point estimators.

µ
b = X̄ is the point estimator of µ.

bi = X̄i· − X̄ is the point estimator of γi , for i = 1, 2, . . . , r.


γ

βbj = X̄·j − X̄ is the point estimator of βj , for j = 1, 2, . . . , c.

It follows that the residual, i.e. the estimator of εij , is:

εbij = Xij − X̄i· − X̄·j + X̄

for i = 1, 2, . . . r and j = 1, 2, . . . , c.
The two-way ANOVA model assumes εij ∼ N (0, σ 2 ) and so, if the model structure is
correct, then the εbij s should behave like independent N (0, σ 2 ) random variables.

89
10. Analysis of variance (ANOVA)

Example 10.6 The following table lists the percentage annual returns (calculated
four times per annum) of the Common Stock Index at the New York Stock
Exchange during 1981–85, available in the data file ‘NYSE.csv’.

1st quarter 2nd quarter 3rd quarter 4th quarter


1981 5.7 6.0 7.1 6.7
1982 7.2 7.0 6.1 5.2
1983 4.9 4.1 4.2 4.4
1984 4.5 4.9 4.5 4.5
1985 4.4 4.2 4.2 3.6

(a) Is the variability in returns from year to year statistically significant?


(b) Are returns affected by the quarter of the year?
Using two-way ANOVA, we test the no row effect hypothesis to answer (a), and test
the no column effect hypothesis to answer (b). We have r = 5 and c = 4.
c
P
The row sample means are calculated using X̄i· = Xij /c, which gives 6.375, 6.375,
j=1
4.4, 4.6 and 4.1, for i = 1, 2, . . . , 5, respectively.
r
P
The column sample means are calculated using X̄·j = Xij /r, which gives 5.34,
i=1
5.24, 5.22 and 4.88, for j = 1, 2, 3, 4, respectively.
r
P
The overall sample mean is x̄ = x̄i· /r = 5.17.
i=1
r P
c
x2ij = 559.06.
P
The sum of the squared observations is
i=1 j=1

Hence we have the following.


r X
X c
Total SS = x2ij − rcx̄2 = 559.06 − 20 × (5.17)2 = 559.06 − 534.578 = 24.482.
i=1 j=1

r
X
brow = c x̄2i· − rcx̄2 = 4 × 138.6112 − 534.578 = 19.867.
i=1

c
X
bcol = r x̄2·j − rcx̄2 = 5 × 107.036 − 534.578 = 0.602.
j=1

Residual SS = (Total SS) − brow − bcol = 24.482 − 19.867 − 0.602 = 4.013.


To test the no row effect hypothesis H0 : γ1 = γ2 = · · · = γ5 = 0, the test statistic
value is:
(c − 1)brow 3 × 19.867
f= = = 14.852.
Residual SS 4.013
Under H0 , F ∼ Fr−1, (r−1)(c−1) = F4, 12 . Using Table 9 of Murdoch and Barnes’
Statistical Tables, since F0.01, 4, 12 = 5.412 < 14.852, we reject H0 at the 1%

90
10.8. Residuals

significance level. We conclude that there is strong evidence that the return does
depend on the year.
To test the no column effect hypothesis H0 : β1 = β2 = β3 = β4 = 0, the test statistic
value is:
(r − 1)bcol 4 × 0.602
f= = = 0.600.
Residual SS 4.013
Under H0 , F ∼ Fc−1, (r−1)(c−1) = F3, 12 . Since F0.10, 3, 12 = 2.606 > 0.600, we cannot
reject H0 even at the 10% significance level. Therefore, there is no significant
evidence indicating that the return depends on the quarter.
The results may be summarised in a two-way ANOVA table as follows:

Source DF SS MS F p-value
Year 4 19.867 4.967 14.852 < 0.01
Quarter 3 0.602 0.201 0.600 > 0.10
Residual 12 4.013 0.334
Total 19 24.482

We could also provide 95% confidence interval estimates for each block and
treatment level by using the pooled estimator of σ 2 , which is:
Residual SS
S2 = = Residual MS.
(r − 1)(c − 1)
For the given data, s2 = 0.334.
R produces the following output:

> anova(lm(Return ~ Year + Quarter))


Analysis of Variance Table

Response: Return
Df Sum Sq Mean Sq F value Pr(>F)
Year 4 19.867 4.9667 14.852 0.0001349 ***
Quarter 3 0.602 0.2007 0.600 0.6271918
Residuals 12 4.013 0.3344
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that the confidence intervals for years 1 and 2 (corresponding to 1981 and
1982) are separated from those for years 3 to 5 (that is, 1983 to 1985), which is
consistent with rejection of H0 in the no row effect test. In contrast, the confidence
intervals for each quarter all overlap, which is consistent with our failure to reject H0
in the no column effect test.
Finally, we may also look at the residuals:

εbij = Xij − µ
b−γ
bi − βbj for i = 1, 2, . . . r and j = 1, 2, . . . , c.

If the assumed normal model (structure) is correct, the εbij s should behave like
independent N (0, σ 2 ) random variables.

91
10. Analysis of variance (ANOVA)

10.9 Overview of chapter


This chapter introduced analysis of variance as a statistical tool to detect differences
between group means. One-way and two-way analysis of variance frameworks were
presented depending on whether one or two independent variables were modelled,
respectively. Statistical inference in the form of hypothesis tests and confidence intervals
was conducted.

10.10 Key terms and concepts


ANOVA decomposition Between-blocks variation
Between-groups variation Between-treatments variation
One-way ANOVA Random errors
Residual Sample mean
Total variation Two-way ANOVA
Within-groups variation

A total of 4,000 cans are opened around the world every second. Ten babies are
conceived around the world every second. Each time you open a can, you stand
a 1-in-400 chance of falling pregnant.
(True or false?)

92
Chapter 11
Linear regression

11.1 Synopsis of chapter


This chapter covers linear regression whereby the variation in a continuous dependent
variable is modelled as being explained by one or more continuous independent
variables.

11.2 Learning outcomes


After completing this chapter, you should be able to:

derive from first principles the least squares estimators of the intercept and slope in
the simple linear regression model

explain how to construct confidence intervals and perform hypothesis tests for the
intercept and slope in the simple linear regression model

demonstrate how to construct confidence intervals and prediction intervals and


explain the difference between the two

summarise the multiple linear regression model with several explanatory variables,
and explain its interpretation

provide the assumptions on which regression models are based

interpret typical output from a computer package fitting of a regression model.

11.3 Introduction
Regression analysis is one of the most frequently-used statistical techniques. It aims
to model an explicit relationship between one dependent variable, often denoted as y,
and one or more regressors (also called covariates, or independent variables), often
denoted as x1 , x2 , . . . , xp .
The goal of regression analysis is to understand how y depends on x1 , x2 , . . . , xp and to
predict or control the unobserved y based on the observed x1 , x2 , . . . , xp . We start with
some simple examples with p = 1.

93
11. Linear regression

11.4 Introductory examples

Example 11.1 In a university town, the sales, y, of 10 Armand’s Pizza Parlour


restaurants are closely related to the student population, x, in their neighbourhoods.
The data file ‘Armand.csv’ contains the sales (in thousands of euros) in a period of
three months together with the numbers of students (in thousands) in their
neighbourhoods.
We plot y against x, and draw a straight line through the middle of the data points:

y = β0 + β1 x + ε

where ε stands for a random error term, β0 is the intercept and β1 is the slope of the
straight line.

For a given student population, x, the predicted sales are yb = β0 + β1 x.

Example 11.2 The data file ‘WeightHeight.csv’ contains the heights, x, and
weights, y, of 69 students in a class.
We plot y against x, and draw a straight line through the middle of the data cloud:
y = β0 + β1 x + ε
where ε stands for a random error term, β0 is the intercept and β1 is the slope of the
straight line.
For a given height, x, the predicted value yb = β0 + β1 x may be viewed as a kind of
‘standard weight’.

94
11.5. Simple linear regression

Example 11.3 Some other possible examples of y and x are shown in the following
table.

y x
Sales Price
Weight gain Protein in diet
Present FTSE 100 index Past FTSE 100 index
Consumption Income
Salary Tenure
Daughter’s height Mother’s height

In most cases, there are several x variables involved. We will consider such situations
later in this chapter.

Some questions to consider are the following.

How to draw a line through data clouds, i.e. how to estimate β0 and β1 ?
How accurate is the fitted line?
What is the error in predicting a future y?

11.5 Simple linear regression


We now present the simple linear regression model. Let the paired observations
(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) be drawn from the model:
yi = β0 + β1 xi + εi
where:
E(εi ) = 0 and Var(εi ) = E(ε2i ) = σ 2 > 0.
Furthermore, suppose Cov(εi , εj ) = E(εi εj ) = 0 for all i 6= j. That is, the εi s are
assumed to be uncorrelated (remembering that a zero covariance between two random
variables implies that they are uncorrelated).
So the model has three parameters: β0 , β1 and σ 2 .
For convenience, we will treat x1 , x2 , . . . , xn as constants.1 We have:
E(yi ) = β0 + β1 xi and Var(yi ) = σ 2 .
Since the εi s are uncorrelated (by assumption), it follows that y1 , y2 , . . . , yn are also
uncorrelated with each other.
Sometimes we assume εi ∼ N (0, σ 2 ), in which case yi ∼ N (β0 + β1 xi , σ 2 ), and
y1 , y2 , . . . , yn are independent. (Remember that a linear transformation of a normal
random variable is also normal, and that for jointly normal random variables if they are
uncorrelated then they are also independent.)
1
If you study an econometrics course, you will explore regression models in much more detail than is
covered here. For example, x1 , x2 , . . . , xn will be treated as random variables in an econometrics course.

95
11. Linear regression

Our tasks are two-fold.

Statistical inference for β0 , β1 and σ 2 , i.e. (point) estimation, confidence intervals


and hypothesis testing.

Prediction intervals for future values of y.

We derive estimators of β0 and β1 using least squares estimation (introduced in Chapter


7). The least squares estimators (LSEs) of β0 and β1 are the values of (β0 , β1 ) at which
the function: n n
X X
2
L(β0 , β1 ) = εi = (yi − β0 − β1 xi )2
i=1 i=1

obtains its minimum.


We proceed to partially differentiate L(β0 , β1 ) with respect to β0 and β1 , respectively.
Firstly:
n
∂ X
L(β0 , β1 ) = −2 (yi − β0 − β1 xi ).
∂β0 i=1

Upon setting this partial derivative to zero, this leads to:


n
X n
X
yi − nβb0 − βb1 xi = 0 or βb0 = ȳ − βb1 x̄.
i=1 i=1

Secondly:
n
∂ X
L(β0 , β1 ) = −2 xi (yi − β0 − β1 xi ).
∂β1 i=1

Upon setting this partial derivative to zero, this leads to:


n
X
0= xi (yi − βb0 − βb1 xi )
i=1
n
X
= xi (yi − ȳ − (βb1 xi − βb1 x̄))
i=1
n
X n
X
= xi (yi − ȳ) − βb1 xi (xi − x̄).
i=1 i=1

Hence:
n
P n
P
xi (yi − ȳ) (xi − x̄)(yi − ȳ)
i=1 i=1
βb1 = Pn = n
P and βb0 = ȳ − βb1 x̄.
xi (xi − x̄) (xi − x̄)2
i=1 i=1

The estimator βb1 above is based on the fact that for any constant c, we have:
n
X n
X
xi (yi − ȳ) = (xi − c)(yi − ȳ)
i=1 i=1

96
11.5. Simple linear regression

since: n n
X X
c(yi − ȳ) = c (yi − ȳ) = 0.
i=1 i=1
n
P n
P
Given that (xi − x̄) = 0, it follows that c(xi − x̄) = 0 for any constant c.
i=1 i=1

In order to calculate βb1 numerically, often the following formula is convenient:


n
P
xi yi − nx̄ȳ
i=1
βb1 = n .
x2i
P
− nx̄2
i=1

n
(yi − β0 − β1 xi )2 . For any β0
P
An alternative derivation is as follows. Note L(β0 , β1 ) =
i=1
and β1 , we have:
n
X
L(β0 , β1 ) = (yi − βb0 − βb1 xi + βb0 − β0 + (βb1 − β1 )xi )2
i=1
n
X
= L(βb0 , βb1 ) + (βb0 − β0 + (βb1 − β1 )xi )2 + 2B (11.1)
i=1

where:
n
X
B= (βb0 − β0 + (βb1 − β1 )xi )(yi − βb0 − βb1 xi )
i=1
n
X n
X
= (βb0 − β0 ) (yi − βb0 − βb1 xi ) + (βb1 − β1 ) xi (yi − βb0 − βb1 xi ).
i=1 i=1

Now let (βb0 , βb1 ) be the solution to the equations:


n
X n
X
(yi − βb0 − βb1 xi ) = 0 and xi (yi − βb0 − βb1 xi ) = 0 (11.2)
i=1 i=1

such that B = 0. By (11.1), we have:


n
X
L(β0 , β1 ) = L(βb0 , βb1 ) + (βb0 − β0 + (βb1 − β1 )xi )2 ≥ L(βb0 , βb1 ).
i=1

Hence (βb0 , βb1 ) are the least squares estimators (LSEs) of β0 and β1 , respectively.
To find the explicit expression from (11.2), note the first equation can be written as:

nȳ − nβb0 − nβb1 x̄ = 0.

Hence βb0 = ȳ − βb1 x̄. Substituting this into the second equation, we have:
n
X n
X n
X
0= xi (yi − ȳ − βb1 (xi − x̄)) = xi (yi − ȳ) − βb1 xi (xi − x̄).
i=1 i=1 i=1

97
11. Linear regression

Therefore:
n
P n
P
xi (yi − ȳ) (xi − x̄)(yi − ȳ)
βb1 = i=1
Pn = i=1
n
P .
xi (xi − x̄) (xi − x̄)2
i=1 i=1

This completes the derivation.


Pn n
P
Remember (xi − x̄) = 0. Hence c(xi − x̄) = 0 for any constant c.
i=1 i=1
2
We also note the estimator of σ , which is:
n
(yi − βb0 − βb1 xi )2
P
i=1
b2 =
σ .
n−2

We now explore the properties of the LSEs βb0 and βb1 . We now proceed to show that the
means and variances of these LSEs are:
n
x2i
P
σ2 i=1
E(βb0 ) = β0 and Var(βb0 ) = n
n P (xi − x̄)2
i=1

for βb0 , and:


σ2
E(βb1 ) = β1 and Var(βb1 ) = P
n
(xi − x̄)2
i=1

for βb1 .
Proof: Recall we treat the xi s as constants, and we have E(yi ) = β0 + β1 xi and also
Var(yi ) = σ 2 . Hence:
n
! n n
1X 1X 1X
E(ȳ) = E yi = E(yi ) = (β0 + β1 xi ) = β0 + β1 x̄.
n i=1 n i=1 n i=1

Therefore:
E(yi − ȳ) = β0 + β1 xi − (β0 + β1 x̄) = β1 (xi − x̄).
Consequently, we have:
 n  n n
(xi − x̄)2 β1
P P P
(x i − x̄)(yi − ȳ) (x i − x̄)E(yi − ȳ)
 i=1  i=1 i=1
E(βb1 ) = E 
 n
P
=
 Pn = P n = β1 .
(xi − x̄) 2 (xi − x̄)2 (xi − x̄) 2
i=1 i=1 i=1

Now:
E(βb0 ) = E(ȳ − βb1 x̄) = β0 + β1 x̄ − β1 x̄ = β0 .
Therefore, the LSEs βb0 and βb1 are unbiased estimators of β0 and β1 , respectively.

98
11.5. Simple linear regression

To work out the variances, the key is to write βb1 and βb0 as linear estimators (i.e.
linear combinations of the yi s):
n
P n
P
(xi − x̄)(yi − ȳ) (xi − x̄)yi n
X
i=1 i=1
βb1 = n
P = n
P = ai y i
(xi − x̄)2 (xk − x̄)2 i=1
i=1 k=1

n
P
where ai = (xi − x̄) (xk − x̄)2 and:
k=1

n n  
X X 1
βb0 = ȳ − βb1 x̄ = ȳ − ai x̄yi = − ai x̄ yi .
i=1 i=1
n

Note that:
n n
X X 1
ai = 0 and a2i = P
n .
i=1 i=1 (xk − x̄)2
k=1

Now we note the following lemma, without proof. Let y1 , y2 , . . . , yn be uncorrelated


random variables, and b1 , b2 , . . . , bn be constants, then:
n
! n
X X
Var bi y i = b2i Var(yi ).
i=1 i=1

By this lemma:
n
! n
X
2
X σ2
Var(βb1 ) = Var ai y i =σ a2i = P
n
i=1 i=1 (xk − x̄)2
k=1

and:
 
n  2 n
!
X 1 1 X 2 2 σ2  nx̄2 
Var(βb0 ) = σ 2 − ai x̄ =σ 2
+ a x̄ = 1 + 
n n i=1 i n  n
P 
i=1 (xk − x̄)2
k=1
n
x2k
P
σ2 k=1
= n .
n P 2
(xk − x̄)
k=1

The last equality uses the fact that:


n
X n
X
x2k = (xk − x̄)2 + nx̄2 .
k=1 k=1

99
11. Linear regression

11.6 Inference for parameters in normal regression


models
The normal simple linear regression model is yi = β0 + β1 xi + εi , where:

ε1 , ε2 , . . . , εn ∼IID N (0, σ 2 ).

y1 , y2 , . . . , yn are independent (but not identically distributed) and:

yi ∼ N (β0 + β1 xi , σ 2 ).

Since any linear combination of normal random variables is also normal, the LSEs of β0
and β1 (as linear estimators) are also normal random variables. In fact:
 n   
2
P
xi
 σ2 i=1
  σ2 
β0 ∼ N β0 ,
b 
n
 and β
b 1 ∼ N  β 1 , n
.
n P (xi − x̄)2
  P
(xi − x̄)2

i=1 i=1

Since σ 2 is unknown in practice, we replace σ 2 by its estimator:


n
(yi − βb0 − βb1 xi )2
P
i=1
b2 =
σ
n−2
and use the estimated standard errors:
 n 1/2
x2i
P
σb  i=1

E.S.E.(βb0 ) = √  n

nP (x − x̄)2

i
i=1

and:
σ
b
E.S.E.(βb1 ) =  1/2 .
n
P
(xi − x̄)2
i=1

The following results all make use of distributional results introduced earlier in the
course. Statistical inference (confidence intervals and hypothesis testing) for the normal
simple linear regression model can then be performed.

i. We have:
n
(yi − βb0 − βb1 xi )2
P
σ2
(n − 2)b i=1
= ∼ χ2n−2 .
σ2 σ2

b2 are independent, hence:


ii. βb0 and σ

βb0 − β0
∼ tn−2 .
E.S.E.(βb0 )

100
11.6. Inference for parameters in normal regression models

b2 are independent, hence:


iii. βb1 and σ

βb1 − β1
∼ tn−2 .
E.S.E.(βb1 )

Confidence intervals for the simple linear regression model parameters

A 100(1 − α)% confidence interval for β0 is:


 
βb0 − tα/2, n−2 × E.S.E.(βb0 ), βb0 + tα/2, n−2 × E.S.E.(βb0 )

and a 100(1 − α)% confidence interval for β1 is:


 
β1 − tα/2, n−2 × E.S.E.(β1 ), β1 + tα/2, n−2 × E.S.E.(β1 )
b b b b

where tα, k denotes the top 100αth percentile of the Student’s tk distribution, obtained
from Table 7 of Murdoch and Barnes’ Statistical Tables.

Tests for the regression slope

The relationship between y and x in the regression model hinges on β1 . If β1 = 0,


then y ∼ N (β0 , σ 2 ).
To validate the use of the regression model, we need to make sure that β1 6= 0, or
more practically that βb1 is significantly non-zero. This amounts to testing:

H0 : β1 = 0 vs. H1 : β1 6= 0.

Under H0 , the test statistic is:

βb1
T = ∼ tn−2 .
E.S.E.(βb1 )

At the 100α% significance level, we reject H0 if |t| > tα/2, n−2 , where t is the observed
test statistic value.
Alternatively, we could use H1 : β1 < 0 or H1 : β1 > 0 if there was a rationale for
doing so. In such cases, we would reject H0 if t < −tα, n−2 and t > tα, n−2 for the
lower-tailed and upper-tailed t tests, respectively.

Some remarks are the following.

i. For testing H0 : β1 = b for a given constant b, the above test still applies, but now
with the following test statistic:

βb1 − b
T = .
E.S.E.(βb1 )

101
11. Linear regression

ii. Tests for the regression intercept β0 may be constructed in a similar manner,
replacing β1 and βb1 with β0 and βb0 , respectively.

In the normal regression model, the LSEs βb0 and βb1 are also the MLEs of β0 and β1 ,
respectively.
Since εi = yi − β0 − β1 xi ∼IID N (0, σ 2 ), the likelihood function is:
n  
2
Y 1
1 2
L(β0 , β1 , σ )= √ exp − 2 (yi − β0 − β1 xi )
i=1 2πσ 2 2σ

 n/2 n
!
1 1 X
∝ exp − 2 (yi − β0 − β1 xi )2 .
σ2 2σ i=1

Hence the log-likelihood function is:


  n
2
n 1 1 X
l(β0 , β1 , σ ) = ln − (yi − β0 − β1 xi )2 + c.
2 σ2 2
2σ i=1

Therefore, for any β0 , β1 and σ 2 > 0, we have:

l(β0 , β1 , σ 2 ) ≤ l(βb0 , βb1 , σ 2 ).

Hence (βb0 , βb1 ) are the MLEs of (β0 , β1 ).


To find the MLE of σ 2 , we need to maximise:
  n
n 1 1 X
l(σ ) = l(βb0 , βb1 , σ 2 ) = ln
2
− 2 (yi − βb0 − βb1 xi )2 .
2 σ2 2σ i=1

Setting u = 1/σ 2 , it is equivalent to maximising:

g(u) = n ln u − ub
n
(yi − βb0 − βb1 xi )2 .
P
where b =
i=1

Setting dg(u)/du = n/bu − b = 0, u


b = n/b, i.e. g(u) attains its maximum at u = u
b.
Hence the MLE of σ 2 is:
n
2 1 b 1X
σ
e = = = (yi − βb0 − βb1 xi )2 .
u
b n n i=1

e2 is a biased estimator of σ 2 . In practice, we often use the unbiased


Note the MLE σ
estimator:
n
2
1 X
σ
b = (yi − βb0 − βb1 xi )2 .
n − 2 i=1
We now consider an empirical example of the normal simple linear regression model.

102
11.6. Inference for parameters in normal regression models

Example 11.4 The dataset ‘Cigarette.csv’ contains the annual cigarette


consumption, x, and the corresponding mortality rate, y, due to coronary heart
disease (CHD) of 21 countries. Some useful summary statistics calculated from the
data are:
X21 X21 21
X
xi = 45,110, yi = 3,042.2, x2i = 109,957,100,
i=1 i=1 i=1
21
X 21
X
yi2 = 529,321.58 and xi yi = 7,319,602.
i=1 i=1
Do these data support the suspicion that smoking contributes to CHD mortality?
(Note the assertion ‘smoking is harmful for health’ is largely based on statistical,
rather than laboratory, evidence.)
We fit the regression model y = β0 + β1 x + ε. Our least squares estimates of β1 and
β0 are, respectively:
P P P
x y − i xi j yj /n
P P
(x i − x̄)(yi − ȳ) x i y i − nx̄ȳ i i i
βb1 = i P 2
= Pi
2 2
= P 2
P 2
i (xi − x̄) i xi − nx̄ i xi − ( i xi ) /n

7,319,602 − 45,110 × 3,042.2/21


=
109,957,100 − (45,110)2 /21
= 0.06
and:
3,042.2 − 0.06 × 45,110
βb0 = ȳ − βb1 x̄ = = 15.77.
21
Also:
X
b2 =
σ (yi − βb0 − βb1 xi )2 /(n − 2)
i
X X X X X 
= yi2 + nβb02 + βb12 x2i − 2βb0 yi − 2βb1 xi yi + 2βb0 βb1 xi /(n − 2)

= 2,181.66.
We now proceed to test H0 : β1 = 0 vs. H1 : β1 > 0. (If indeed smoking contributes
to CHD mortality, then β1 > 0.)
We have calculated βb1 = 0.06. However, is this deviation from zero due to sampling
error, or is it significantly different from zero? (The magnitude of βb1 itself is not
important in determining if β1 = 0 or not – changing the scale of x may make βb1
arbitrarily small.)
Under H0 , the test statistic is:
βb1
T = ∼ tn−2 = t19
E.S.E.(βb1 )
b/( i (xi − x̄)2 )1/2 = 0.01293.
P
where E.S.E.(βb1 ) = σ
Since t = 0.06/0.01293 = 4.64 > 2.54 = t0.01, 19 , we reject the hypothesis β1 = 0 at
the 1% significance level and we conclude that there is strong evidence that smoking
contributes to CHD mortality.

103
11. Linear regression

11.7 Regression ANOVA


In Chapter 10 we discussed ANOVA, whereby we decomposed the total variation of a
continuous dependent variable. In a similar way we can decompose the total variation of
y in the simple linear regression model. It can be shown that the regression ANOVA
decomposition is:
n
X n
X n
X
2 2 2
(yi − ȳ) = β1 (xi − x̄) +
b (yi − βb0 − βb1 xi )2
i=1 i=1 i=1

where, denoting sum of squares by ‘SS’, we have the following.


n n
(yi − ȳ)2 = yi2 − nȳ 2 .
P P
Total SS is
i=1 i=1

n
 n

βb12 (xi − x̄)2 = βb12 x2i 2
P P
Regression (explained) SS is − nx̄ .
i=1 i=1

n
(yi − βb0 − βb1 xi )2 = Total SS − Regression SS.
P
Residual (error) SS is
i=1

If εi ∼ N (0, σ 2 ) and β1 = 0, then it can be shown that:


n
(yi − ȳ)2 /σ 2 ∼ χ2n−1
P
i=1

n
βb12 (xi − x̄)2 /σ 2 ∼ χ21
P
i=1

n
(yi − βb0 − βb1 xi )2 /σ 2 ∼ χ2n−2 .
P
i=1

Therefore, under H0 : β1 = 0, we have:


n
(n − 2)βb12 (xi − x̄)2
P !2
(Regression SS)/1 i=1 βb1
F = = n = ∼ F1, n−2 .
(Residual SS)/(n − 2) P
(yi − βb0 − βb1 xi )2 E.S.E.(βb1 )
i=1

We reject H0 at the 100α% significance level if f > Fα, 1, n−2 , where f is the observed
test statistic value and Fα, 1, n−2 is the top 100αth percentile of the F1, n−2 distribution,
obtained from Table 9 of Murdoch and Barnes’ Statistical Tables.
A useful statistic is the coefficient of determination, denoted as R2 , defined as:
Regression SS Residual SS
R2 = =1− .
Total SS Total SS
If we view Total SS as the total variation (or energy) of y, then R2 is the proportion of
the total variation of y explained by x. Note that R2 ∈ [0, 1]. The closer R2 is to 1, the
better the explanatory power of the regression model.

104
11.8. Confidence intervals for E(y)

11.8 Confidence intervals for E(y)


Based on the observations (xi , yi ), for i = 1, 2, . . . , n, we fit a regression model:

yb = βb0 + βb1 x.

Our goal is to predict the unobserved y corresponding to a known x. The point


prediction is:
yb = βb0 + βb1 x.

For the analysis to be more informative, we would like to have some ‘error bars’ for our
prediction. We introduce two methods as follows.

A confidence interval for µ(x) = E(y) = β0 + β1 x.

A prediction interval for y.

A confidence interval is an interval estimator of an unknown parameter (i.e. for a


constant) while a prediction interval is for a random variable. They are different and
serve different purposes.
We assume the model is normal, i.e. ε = y − β0 − β1 x ∼ N (0, σ 2 ) and let
µ
b(x) = βb0 + βb1 x, such that µ
b(x) is an unbiased estimator of µ(x). We note without
proof that:
 n 
2
P
(x − x)
 σ 2 i=1 i 
µ(x)
b ∼ N µ(x),

n
.
n P (xj − x̄)2

j=1

Standardising gives:

b(x) − µ(x)
µ
v
u ! ∼ N (0, 1).
u n
P n
P
t(σ 2 /n) (xi − x)2 / (xj − x̄)2
i=1 j=1

In practice σ 2 is unknown, but it can be shown that (n − 2)bσ 2 /σ 2 ∼ χ2n−2 , where


n
b2 = (yi − βb0 − βb1 xi )2 /(n − 2). Furthermore, µ b2 are independent. Hence:
P
σ b(x) and σ
i=1

b(x) − µ(x)
µ
v
u ! ∼ tn−2 .
u n
P n
P
σ 2 /n)
t(b (xi − x)2 / (xj − x̄)2
i=1 j=1

105
11. Linear regression

Confidence interval for µ(x)

A 100(1 − α)% confidence interval for µ(x) is:


 P n 1/2
(xi − x)2
 i=1 
µ(x)
b ± tα/2, n−2 × σ
b ×
 P n
 .

n (xj − x̄)2
j=1

Such a confidence interval contains the true expectation E(y) = µ(x) with probability
1 − α over repeated samples. It does not cover y with probability 1 − α.

11.9 Prediction intervals for y

A 100(1 − α)% prediction interval is an interval which contains y with probability


1 − α.
We may assume that the y to be predicted is independent of y1 , y2 , . . . , yn used in the
estimation of the regression model.
Hence y − µ
b(x) is normal with mean 0 and variance:

n
(xi − x)2
P
σ2 i=1
Var(y) + Var(µ(x))
b = σ2 + n .
n P
(xj − x̄)2
j=1

Therefore:
  n 1/2
2
P
.  (xi − x) 
i=1
(y − µ
b(x)) 
σb2 
1 + Pn

 ∼ tn−2 .
n (xj − x̄)2
j=1

Prediction interval for y

A 100(1 − α)% prediction interval covering y with probability 1 − α is:


 n 1/2
2
P
(xi − x)
i=1
 
µ(x)
b ± tα/2, n−2 × σ
b ×
1 + Pn

 .
n (xj − x̄)2
j=1

106
11.9. Prediction intervals for y

Some remarks are the following.

i. It holds that:
 1/2 
 n
(xi − x)2
P
  
i=1

P y ∈ µ(x) ± tα/2, n−2 × σ
b × 1 + = 1 − α.
  
b  Pn  

n (xj − x̄) 2 
j=1

ii. The prediction interval for y is wider than the confidence interval for E(y). The
former contains the unobserved random variable y with probability 1 − α, the
latter contains the unknown constant E(y) with probability 1 − α over repeated
samples.

Example 11.5 The dataset ‘UsedFord.csv’ contains the prices (y, in $000s) of 100
three-year-old Ford Tauruses together with their mileages (x, in thousands of miles)
when they were sold at auction. Based on these data, a car dealer needs to make two
decisions.

1. To prepare cash for bidding on one three-year-old Ford Taurus with a mileage of
x = 40.

2. To prepare buying several three-year-old Ford Tauruses with mileages close to


x = 40 from a rental company.
For the first task, a prediction interval would be more appropriate. For the second
task, the car dealer needs to know the average price and, therefore, a confidence
interval is appropriate. This can be easily done using R.

> reg <- lm(Price~ Mileage)


> summary(reg)

Call:
lm(formula = Price ~ Mileage)

Residuals:
Min 1Q Median 3Q Max
-0.68679 -0.27263 0.00521 0.23210 0.70071

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.248727 0.182093 94.72 <2e-16 ***
Mileage -0.066861 0.004975 -13.44 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3265 on 98 degrees of freedom


Multiple R-squared: 0.6483, Adjusted R-squared: 0.6447
F-statistic: 180.6 on 1 and 98 DF, p-value: < 2.2e-16

107
11. Linear regression

> new.Mileage <- data.frame(Mileage = c(40))


> predict(reg, newdata = new.Mileage, int = "c")
fit lwr upr
1 14.57429 14.49847 14.65011
> predict(reg, newdata = new.Mileage, int = "p")
fit lwr upr
1 14.57429 13.92196 15.22662

We predict that a Ford Taurus will sell for between $13,922 and $15,227. The
average selling price of several three-year-old Ford Tauruses is estimated to be
between $14,498 and $14,650. Because predicting the selling price for one car is more
difficult, the corresponding prediction interval is wider than the confidence interval.
To produce the plots with confidence intervals for E(y) and prediction intervals for
y, we proceed as follows:

> pc <- predict(reg,int="c")


> pp <- predict(reg,int="p")
> plot(Mileage,Price,pch=16)
> matlines(Mileage,pc)
> matlines(Mileage,pp)
16.5
16.0
15.5
Price

15.0
14.5
14.0
13.5

20 25 30 35 40 45 50

Mileage

11.10 Multiple linear regression models


For most practical problems, the variable of interest, y, typically depends on several
explanatory variables, say x1 , x2 , . . . , xp , leading to the multiple linear regression
model. In this course we only provide a brief overview of the multiple linear regression
model. Subsequent econometrics courses would explore this model in much greater
depth.

108
11.10. Multiple linear regression models

Let (yi , xi1 , xi2 , . . . , xip ), for i = 1, 2, . . . , n, be observations from the model:

yi = β0 + β1 xi1 + β2 xi2 + · · · + βp xip + εi

where:

E(εi ) = 0, Var(εi ) = σ 2 > 0 and Cov(εi , εj ) = 0 for all i 6= j.

The multiple linear regression model is a natural extension of the simple linear
regression model, just with more parameters: β0 , β1 , β2 , . . . , βp and σ 2 .
Treating all of the xij s as constants as before, we have:

E(yi ) = β0 + β1 xi1 + β2 xi2 + · · · + βp xip and Var(yi ) = σ 2 .

y1 , y2 , . . . , yn are uncorrelated with each other, again as before.


If in addition εi ∼ N (0, σ 2 ), then:
p
!
X
yi ∼ N β0 + βj xij , σ 2 .
j=1

Estimation of the intercept and slope parameters is still performed using least squares
estimation. The LSEs βb0 , βb1 , βb2 , . . . , βbp are obtained by minimising:
n p
!2
X X
yi − β0 − βj xij
i=1 j=1

leading to the fitted regression model:

yb = βb0 + βb1 x1 + βb2 x2 + · · · + βbp xp .

The residuals are expressed as:


p
X
εbi = yi − βb0 − βbj xij .
j=1

Just as with the simple linear regression model, we can decompose the total variation of
y such that:
X n Xn Xn
2 2
(yi − ȳ) = yi − ȳ) +
(b εb2i
i=1 i=1 i=1

or, in words:
Total SS = Regression SS + Residual SS.
An unbiased estimator of σ 2 is:
n p
!2
1 X X Residual SS
b2 =
σ yi − βb0 − βbj xij = .
n−p−1 i=1 j=1
n−p−1

We can test a single slope coefficient by testing:

H0 : βi = 0 vs. H1 : βi 6= 0.

109
11. Linear regression

Under H0 , the test statistic is:

βbi
T = ∼ tn−p−1
E.S.E.(βbi )

and we reject H0 if |t| > tα/2, n−p−1 . However, note the slight difference in the
interpretation of the slope coefficient βj . In the multiple regression setting, βj is the
effect of xj on y, holding all other independent variables fixed – this is unfortunately
not always practical.
It is also possible to test whether all the regression coefficients are equal to zero. This is
known as a joint test of significance and can be used to test the overall significance
of the regression model, i.e. whether there is at least one significant explanatory
(independent) variable, by testing:

H0 : β1 = β2 = · · · = βp = 0 vs. H1 : At least one βi 6= 0.

Indeed, it is preferable to perform this joint test of significance before conducting t tests
of individual slope coefficients. Failure to reject H0 would render the model useless and
hence the model would not warrant any further statistical investigation.
Provided εi ∼ N (0, σ 2 ), under H0 : β1 = β2 = · · · = βp = 0, the test statistic is:

(Regression SS)/p
F = ∼ Fp, n−p−1 .
(Residual SS)/(n − p − 1)

We reject H0 at the 100α% significance level if f > Fα, p, n−p−1 .


It may be shown that:
n
X n
X
2
Regression SS = yi − ȳ) =
(b (βb1 (xi1 − x̄1 ) + βb2 (xi2 − x̄2 ) + · · · + βbp (xip − x̄p ))2 .
i=1 i=1

Hence, under H0 , f should be very small.


We now conclude the chapter with worked examples of linear regression using R.

11.11 Regression using R


To solve practical regression problems, we need to use statistical computing packages.
All of them include linear regression analysis. In fact all statistical packages, such as R,
make regression analysis much easier to use.

Example 11.6 We illustrate the use of linear regression in R using the dataset
‘Armand.csv’, introduced in Example 11.1.

> reg <- lm(Sales ~ Student.population)


> summary(reg)

110
11.11. Regression using R

Call:
lm(formula = Sales ~ Student.population)

Residuals:
Min 1Q Median 3Q Max
-21.00 -9.75 -3.00 11.25 18.00

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.0000 9.2260 6.503 0.000187 ***
Student.population 5.0000 0.5803 8.617 2.55e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.83 on 8 degrees of freedom


Multiple R-squared: 0.9027, Adjusted R-squared: 0.8906
F-statistic: 74.25 on 1 and 8 DF, p-value: 2.549e-05
The fitted line is yb = 60 + 5x. We have σb2 = (13.83)2 . Also, βb0 = 60 and
E.S.E.(βb0 ) = 9.2260. βb1 = 5 and E.S.E.(βb1 ) = 0.5803.
For testing H0 : β0 = 0 we have t = βb0 /E.S.E.(βb0 ) = 6.503. The p-value is
P (|T | > 6.503) = 0.000187, where T ∼ tn−2 .
For testing H0 : β1 = 0 we have t = βb1 /E.S.E.(βb1 ) = 8.617. The p-value is
P (|T | > 8.617) = 0.0000255, where T ∼ tn−2 .
The F test statistic value is 74.25 with a corresponding p-value of:

P (F > 74.25) = 0.00002549

where F1, 8 .

Example 11.7 We apply the simple linear regression model to study the
relationship between two series of financial returns – a regression of Cisco Systems
stock returns, y, on S&P500 Index returns, x. This regression model is an example of
the capital asset pricing model (CAPM).
Stock returns are defined as:
current price − previous price
 
current price
return = ≈ ln
previous price previous price
when the difference between the two prices is small.
The data file ‘Returns.csv’ contains daily returns over the period 3 January – 29
December 2000 (i.e. n = 252 observations). The dataset has 5 columns: Day, S&P500
return, Cisco return, Intel return and Sprint return.
Daily prices are definitely not independent. However, daily returns may be seen as a
sequence of uncorrelated random variables.

111
11. Linear regression

> summary(S.P500)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-6.00451 -0.85028 -0.03791 -0.04242 0.79869 4.65458

> summary(Cisco)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-13.4387 -3.0819 -0.1150 -0.1336 2.6363 15.4151
For the S&P500, the average daily return is −0.04%, the maximum daily return is
4.46%, the minimum daily return is −6.01% and the standard deviation is 1.40%.
For Cisco, the average daily return is −0.13%, the maximum daily return is 15.42%,
the minimum daily return is −13.44% and the standard deviation is 4.23%.
We see that Cisco is much more volatile than the S&P500.

> sandpts <- ts(S.P500)


> ciscots <- ts(Cisco)
> ts.plot(sandpts,ciscots,col=c(1:2))
15
10
5
0
−5
−10

0 50 100 150 200 250

Time

There is clear synchronisation between the movements of the two series of returns,
as evident from examining the sample correlation coefficient.

> cor.test(S.P500,Cisco)

Pearson’s product-moment correlation

data: S.P500 and Cisco


t = 14.943, df = 250, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0

112
11.11. Regression using R

95 percent confidence interval:


0.6155530 0.7470423
sample estimates:
cor
0.686878
We fit the regression model: Cisco = β0 + β1 S&P500 + ε.
Our rationale is that part of the fluctuation in Cisco returns was driven by the
fluctuation in the S&P500 returns.
R produces the following regression output.

> reg <- lm(Cisco ~ S.P500)


> summary(reg)

Call:
lm(formula = Cisco ~ S.P500)

Residuals:
Min 1Q Median 3Q Max
-13.1175 -2.0238 0.0091 2.0614 9.9491

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.04547 0.19433 -0.234 0.815
S.P500 2.07715 0.13900 14.943 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.083 on 250 degrees of freedom


Multiple R-squared: 0.4718, Adjusted R-squared: 0.4697
F-statistic: 223.3 on 1 and 250 DF, p-value: < 2.2e-16
The estimated slope is βb1 = 2.07715. The null hypothesis H0 : β1 = 0 is rejected with
a p-value of 0.000 (to three decimal places). Therefore, the test is extremely
significant.
Our interpretation is that when the market index goes up by 1%, Cisco stock goes
up by 2.07715%, on average. However, the error term ε in the model is large with an
estimated σ
b = 3.083%.
The p-value for testing H0 : β0 = 0 is 0.815, so we cannot reject the hypothesis that
β0 = 0. Recall βb0 = ȳ − βb1 x̄ and both ȳ and x̄ are very close to 0.
R2 = 47.18%, hence 47.18% of the variation of Cisco stock may be explained by the
variation of the S&P500 index, or, in other words, 47.18% of the risk in Cisco stock
is the market-related risk.
The capital asset pricing model (CAPM) is a simple asset pricing model in finance
given by:
yi = β0 + β1 xi + εi
where yi is a stock return and xi is a market return at time i.

113
11. Linear regression

The total risk of the stock is:


n n n
1X 2
1X 2
1X
(yi − ȳ) = yi − ȳ) +
(b (yi − ybi )2 .
n i=1 n i=1 n i=1

The market-related (or systematic) risk is:


n n
1X 2
1 b2 X
yi − ȳ) = β1
(b (xi − x̄)2 .
n i=1 n i=1

The firm-specific risk is:


n
1X
(yi − ybi )2 .
n i=1

Some remarks are the following.

i. β1 measures the market-related (or systematic) risk of the stock.

ii. Market-related risk is unavoidable, while firm-specific risk may be ‘diversified


away’ through hedging.

iii. Variance is a simple measure (and one of the most frequently-used) of risk in
finance.

Example 11.8 The data in the file ‘Foods.csv’ illustrate the effects of marketing
instruments on the weekly sales volume of a certain food product over a three-year
period. Data are real but transformed to protect the innocent!
There are observations on the following four variables:

y = LVOL: logarithms of weekly sales volume


x1 = PROMP : promotion price
x2 = FEAT : feature advertising
x3 = DISP : display measure.

R produces the following descriptive statistics.

> summary(Foods)
LVOL PROMP FEAT DISP
Min. :13.83 Min. :3.075 Min. : 2.84 Min. :12.42
1st Qu.:14.08 1st Qu.:3.330 1st Qu.:15.95 1st Qu.:20.59
Median :14.24 Median :3.460 Median :22.99 Median :25.11
Mean :14.28 Mean :3.451 Mean :24.84 Mean :25.31
3rd Qu.:14.43 3rd Qu.:3.560 3rd Qu.:33.49 3rd Qu.:29.34
Max. :15.07 Max. :3.865 Max. :57.10 Max. :45.94
n = 156. The values of FEAT and DISP are much larger than LVOL.
As always, first we plot the data to ascertain basic characteristics.

114
11.11. Regression using R

> LVOLts <- ts(LVOL)


> ts.plot(LVOLts)

15.0
14.8
14.6
LVOLts

14.4
14.2
14.0
13.8

0 50 100 150

Time

The time series plot indicates momentum in the data.


Next we show scatterplots between y and each xi .

> plot(PROMP,LVOL,pch=16)
15.0
14.8
14.6
LVOL

14.4
14.2
14.0
13.8

3.2 3.4 3.6 3.8

PROMP

115
11. Linear regression

> plot(FEAT,LVOL,pch=16)

15.0
14.8
14.6
LVOL

14.4
14.2
14.0
13.8

10 20 30 40 50

FEAT

> plot(DISP,LVOL,pch=16)
15.0
14.8
14.6
LVOL

14.4
14.2
14.0
13.8

15 20 25 30 35 40 45

DISP

What can we observe from these pairwise plots?


There is a negative correlation between LVOL and PROMP.
There is a positive correlation between LVOL and FEAT.
There is little or no correlation between LVOL and DISP, but this might have
been blurred by the other input variables.

116
11.11. Regression using R

Therefore, we should regress LVOL on PROMP and FEAT first.


We run a multiple linear regression model using x1 and x2 as explanatory variables:
y = β0 + β1 x1 + β2 x2 + ε.

> reg <- lm(LVOL~PROMP + FEAT)


> summary(reg)

Call:
lm(formula = LVOL ~ PROMP + FEAT)

Residuals:
Min 1Q Median 3Q Max
-0.32734 -0.08519 -0.01011 0.08471 0.30804

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.1500102 0.2487489 68.94 <2e-16 ***
PROMP -0.9042636 0.0694338 -13.02 <2e-16 ***
FEAT 0.0100666 0.0008827 11.40 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1268 on 153 degrees of freedom


Multiple R-squared: 0.756, Adjusted R-squared: 0.7528
F-statistic: 237 on 2 and 153 DF, p-value: < 2.2e-16
We begin by performing a joint test of significance by testing H0 : β1 = β2 = 0. The
test statistic value is given in the regression ANOVA table as f = 237, with a
corresponding p-value of 0.000 (to three decimal places). Hence H0 is rejected and we
have strong evidence that at least one slope coefficient is not equal to zero.
Next we consider individual t tests of H0 : β1 = 0 and H0 : β2 = 0. The respective
test statistic values are −13.02 and 11.40, both with p-values of 0.000 (to three
decimal places) indicating that both slope coefficients are non-zero.
Turning to the estimated coefficients, βb1 = −0.904 (to three decimal places) which
indicates that LVOL decreases as PROMP increases controlling for FEAT. Also,
βb2 = 0.010 (to three decimal places) which indicates that LVOL increases as FEAT
increases, controlling for PROMP.
We could also compute 95% confidence intervals, given by:
βbi ± t0.025, n−3 × E.S.E.(βbi ).
Since n − 3 = 153 is large, t0.025, n−3 ≈ z0.025 = 1.96.
R2 = 0.756. Therefore, 75.6% of the variation of LVOL can be explained (jointly)
with PROMP and FEAT. However, a large R2 does not necessarily mean that the
fitted model is useful. For the estimation of coefficients and predicting y, the
absolute measure ‘Residual SS’ (or σ b2 ) plays a critical role in determining the
accuracy of the model.

117
11. Linear regression

Consider now introducing DISP into the regression model to give three explanatory
variables:
y = β0 + β1 x1 + β2 x2 + β3 x3 + ε.
The reason for adding the third variable is that one would expect DISP to have an
impact on sales and we may wish to estimate its magnitude.

> reg <- lm(LVOL~PROMP + FEAT + DISP)


> summary(reg)

Call:
lm(formula = LVOL ~ PROMP + FEAT + DISP)

Residuals:
Min 1Q Median 3Q Max
-0.33363 -0.08203 -0.00272 0.07927 0.33812

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.2372251 0.2490226 69.220 <2e-16 ***
PROMP -0.9564415 0.0726777 -13.160 <2e-16 ***
FEAT 0.0101421 0.0008728 11.620 <2e-16 ***
DISP 0.0035945 0.0016529 2.175 0.0312 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1253 on 152 degrees of freedom


Multiple R-squared: 0.7633, Adjusted R-squared: 0.7587
F-statistic: 163.4 on 3 and 152 DF, p-value: < 2.2e-16

All the estimated coefficients have the right sign (according to commercial common
sense!) and are statistically significant. In particular, the relationship with DISP
seems real when the other inputs are taken into account. On the other hand, the
addition
√ of DISP to the
√ model has resulted in a very small reduction in σ b, from
0.0161 = 0.1268 to 0.0157 = 0.1253, and correspondingly a slightly higher R2
(0.7633, i.e. 76.33% of the variation of LVOL is explained by the model). Therefore,
DISP contributes very little to ‘explaining’ the variation of LVOL after the other
two explanatory variables, PROMP and FEAT, are taken into account.
Intuitively, we would expect a higher R2 if we add a further explanatory variable to
the model. However, the model has become more complex as a result – there is an
additional parameter to estimate. Therefore, strictly speaking, we should consider
the ‘adjusted R2 ’ statistic, although this will not be considered in this course.
Special care should be exercised when predicting with x out of the range of the
observations used to fit the model, which is called extrapolation.

118
11.12. Overview of chapter

11.12 Overview of chapter


This chapter has covered the linear regression model with one or more explanatory
variables. Least squares estimators were derived for the simple linear regression model,
and statistical inference procedures were also covered. The multiple linear regression
model and applications using R concluded the chapter.

11.13 Key terms and concepts


ANOVA decomposition Coefficient of determination
Confidence interval Dependent variable
Independent variable Intercept
Least squares estimation Linear estimators
Multiple linear regression Prediction interval
Regression analysis Regressor
Residual Simple linear regression
Slope coefficient

Facts are stubborn, but statistics are more pliable.


(Mark Twain)

119
11. Linear regression

120
Appendix A
Sampling distributions of statistics

A.1 Worked examples


1. Suppose A, B and C are independent chi-squared random variables with 5, 7 and
10 degrees of freedom, respectively. Calculate:
(a) P (B < 12)
(b) P (A + B + C < 14)
(c) P (A − B − C < 0)
(d) P (A3 + B 3 + C 3 < 0).
In this question, you should use the closest value given in Murdoch and Barnes’
Statistical Tables. Further approximation is not required.

Solution:

(a) P (B < 12) ≈ 0.9, directly from Table 8, where B ∼ χ27 .


(b) A + B + C ∼ χ25+7+10 = χ222 , so P (A + B + C < 14) is the probability that such
a random variable is less than 14, which is approximately 0.1 from Table 8.
(c) Transforming and rearranging the probability, we need:
 
A B + C 17
P (A < B + C) = P < ×
5 17 5
 
A/5
=P < 3.4 = P (F < 3.4) ≈ 0.975
(B + C)/17

where F ∼ F5, 17 , using Table 9 (practice of which will be covered later in the
course1 ).
(d) A chi-squared random variable only assumes non-negative values. Hence each
of A, B and C is non-negative, so A3 + B 3 + C 3 ≥ 0, and:

P (A3 + B 3 + C 3 < 0) = 0.

2. Suppose {Zi }, for i = 1, 2, . . . , k, are independent and identically distributed


standard normal random variables, i.e. Zi ∼ N (0, 1), for i = 1, 2, . . . , k.
1
Although we have yet to ‘formally’ introduce Table 9 of Murdoch and Barnes’ Statistical Tables, you
should be able to see how this works.

121
A. Sampling distributions of statistics

State the distribution of:


(a) Z12
(b) Z12 /Z22
p
(c) Z1 / Z22
k
P
(d) Zi /k
i=1
k
Zi2
P
(e)
i=1

(f) (3/2) × (Z12 + Z22 )/(Z32 + Z42 + Z52 ).

Solution:

(a) Z12 ∼ χ21


(b) Z12 /Z22 ∼ F1, 1
p
(c) Z1 / Z22 ∼ t1
k
P
(d) Zi /k ∼ N (0, 1/k)
i=1
k
Zi2 ∼ χ2k
P
(e)
i=1

(f) (3/2) × (Z12 + Z22 )/(Z32 + Z42 + Z52 ) ∼ F2, 3 .

3. X1 , X2 , X3 and X4 are independent normally distributed random variables each


with a mean of 0 and a standard deviation of 3. Find:
(a) P (X1 + 2X2 > 9)
(b) P (X12 + X22 > 54)
(c) P ((X12 + X22 ) > 99(X32 + X42 )).

Solution:

(a) We have X1 ∼ N (0, 9) and X2 ∼ N (0, 9). Hence 2X2 ∼ N (0, 36) and
X1 + 2X2 ∼ N (0, 45). So:
 
9
P (X1 + 2X2 > 9) = P Z > √ = P (Z > 1.34) = 0.0901.
45

(b) We have X1 /3 ∼ N (0, 1) and X2 /3 ∼ N (0, 1). Hence X12 /9 ∼ χ21 and
X22 /9 ∼ χ21 . Therefore, X12 /9 + X22 /9 ∼ χ22 . So:

P (X12 + X22 > 54) = P (Y > 6) = 0.05

where Y ∼ χ22 .

122
A.1. Worked examples

(c) We have X12 /9 + X22 /9 ∼ χ22 and also X32 /9 + X42 /9 ∼ χ22 . So:
X12 + X22 (X12 + X22 )/18
= ∼ F2, 2 .
X32 + X42 (X32 + X42 )/18
Hence:
P ((X12 + X22 ) > 99(X32 + X42 )) = P (Y > 99) = 0.01
where Y ∼ F2, 2 .

4. The independent random variables X1 , X2 and X3 are each normally distributed


with a mean of 0 and a variance of 4. Find:
(a) P (X1 > X2 + X3 )
(b) P (X12 > 9.25(X22 + X32 ))
(c) P (X1 > 5(X22 + X32 )1/2 ).

Solution:
(a) We have Xi ∼ N (0, 4), for i = 1, 2, 3, hence:
X1 − X2 − X3 ∼ N (0, 12).
So:
P (X1 > X2 + X3 ) = P (X1 − X2 − X3 > 0) = P (Z > 0) = 0.5.
(b) We have Xi /2 ∼ N (0, 1), so Xi2 /4 ∼ χ21 for i = 1, 2, 3. Hence:
2X12 (X12 /4)/1
= ∼ F1, 2 .
X22 + X32 ((X22 + X32 )/4)/2
So:
2X12
 
P (X12 > 9.25(X22 + X32 )) =P > 9.25 × 2 = P (Y > 18.5) = 0.05
X22 + X32
where Y ∼ F1, 2 .
(c) We have:
1/2 !
X22 X32

X1
P (X1 > 5(X22 + X32 )1/2 ) = P >5 +
2 4 4
1/2 !  !

X22 X32 √

X1
=P >5 2
+ 2
2 4 4
√ p
i.e. P (Y1 > 5 2Y2 ), where Y1 ∼ N (0, 1) and Y2 ∼ χ22 /2, or P (Y3 > 7.07),
where Y3 ∼ t2 . From Table 7, this is approximately 0.01.

5. The independent random variables X1 , X2 , X3 and X4 are each normally


distributed with a mean of 0 and a variance of 4. Using Murdoch and Barnes’
Statistical Tables, derive values for k in each of the following cases:
(a) P (3X1 + 4X2 > 5) = k
p
(b) P (X1 > k X32 + X42 ) = 0.025
(c) P (X12 + X22 + X32 < k) = 0.9
(d) P (X22 + X32 + X42 > 19X12 + 20X32 ) = k.

123
A. Sampling distributions of statistics

Solution:

(a) We have Xi ∼ N (0, 4), for i = 1, 2, 3, 4, hence 3X1 ∼ N (0, 36) and
4X2 ∼ N (0, 64). Therefore:

3X1 + 4X2
= Z ∼ N (0, 1).
10

So, P (3X1 + 4X2 > 5) = k = P (Z > 0.5) = 0.3085.

(b) We have Xi /2 ∼ N (0, 1), for i = 1, 2, 3, 4, hence (X32 + X42 )/4 ∼ χ22 . So:


 q 
P X1 > k X32 + X42 = 0.025 = P (T > k 2)


where T ∼ t2 and hence k 2 = 4.303, so k = 3.04268.
(c) We have (X12 + X22 + X32 )/4 ∼ χ23 , so:
 
k
P (X12 + X22 + X32 < k) = 0.9 = P X<
4

where X ∼ χ23 . Therefore, k/4 = 6.251. Hence k = 25.004.


(d) P (X22 + X32 + X42 > 19X12 + 20X32 ) = k simplifies to:

P (X22 + X42 > 19(X12 + X32 )) = k

and:
X22 + X42
∼ F2, 2 .
X12 + X32
So, from Table 9, k = 0.05.

6. Suppose that the heights of students are normally distributed with a mean of 68.5
inches and a standard deviation of 2.7 inches. If 200 random samples of size 25 are
drawn from this population with means recorded to the nearest 0.1 inch, find:
(a) the expected mean and standard deviation of the sampling distribution of the
mean
(b) the expected number of recorded sample means which fall between 67.9 and
69.2 inclusive
(c) the expected number of recorded sample means falling below 67.0.

Solution:

(a) The sampling distribution of the mean of 25 observations has the same mean
as the population, which is 68.5 inches.
√ The standard deviation (standard
error) of the sample mean is 2.7/ 25 = 0.54.

124
A.1. Worked examples

(b) Notice that the samples are random, so we cannot be sure exactly how many
will have means between 67.9 and 69.2 inches. We can work out the probability
that the sample mean will lie in this interval using the sampling distribution:
X̄ ∼ N (68.5, (0.54)2 ).
We need to make a continuity correction, to account for the fact that the
recorded means are rounded to the nearest 0.1 inch. For example, the
probability that the recorded mean is ≥ 67.9 inches is the same as the
probability that the sample mean is > 67.85. Therefore, the probability we
want is:
 
67.85 − 68.5 69.25 − 68.5
P (67.85 < X < 69.25) = P <Z<
0.54 0.54
= P (−1.20 < Z < 1.39)
= Φ(1.39) − Φ(−1.20)
= 0.9177 − (1 − 0.1151)
= 0.8026.
As usual, the values of Φ(1.39) and Φ(−1.20) can be found from Table 3 of
Murdoch and Barnes’ Statistical Tables. Since there are 200 independent
random samples drawn, we can now think of each as a single trial. The
recorded mean lies between 67.9 and 69.2 with probability 0.8026 at each trial.
We are dealing with a binomial distribution with n = 200 trials and
probability of success π = 0.8026. The expected number of successes is:
nπ = 200 × 0.8026 = 160.52.

(c) The probability that the recorded mean is < 67.0 inches is:
 
66.95 − 68.5
P (X < 66.95) = P Z < = P (Z < −2.87) = Φ(−2.87) = 0.00205
0.54
so the expected number of recorded means below 67.0 out of a sample of 200 is:
200 × 0.00205 = 0.41.

7. If Z is a random variable with a standard normal distribution, what is


P (Z 2 < 3.841)?
Solution:
We can compute the probability in two different ways. Working with the standard
normal distribution, we have:
 √ √ 
P (Z 2 < 3.841) = P − 3.841 < Z < 3.841

= P (−1.96 < Z < 1.96)


= Φ(1.96) − Φ(−1.96)
= 0.9750 − (1 − 0.9750) = 0.95.

125
A. Sampling distributions of statistics

Alternatively, we can use the fact that Z 2 follows a χ21 distribution. From Table 8
of Murdoch and Barnes’ Statistical Tables we can see that 3.841 is the 5% right-tail
value for this distribution, and so P (Z 2 < 3.84) = 0.95, as before.

8. Suppose that X1 and X2 are independent N (0, 4) random variables. Compute


P (X12 < 36.84 − X22 ).
Solution:
Rearrange the inequality to obtain:

P (X12 < 36.84 − X22 ) = P (X12 + X22 < 36.84)


 2
X1 + X22

36.84
=P <
4 4
 2  2 !
X1 X2
=P + < 9.21 .
2 2

Since X1 /2 and X2 /2 are independent N (0, 1) random variables, the sum of their
squares will follow a χ22 distribution. Using Table 8 of Murdoch and Barnes’
Statistical Tables, we see that 9.210 is the 1% right-tail value, so the probability we
are looking for is 0.99.

9. Suppose that X1 X2 and X3 are independent N (0, 1) random variables, while Y


(independently) follows a χ25 distribution. Compute P (X12 + X22 < 7.236Y − X32 ).
Solution:
Rearranging the inequality gives:

P (X12 + X22 < 7.236Y − X32 ) = P (X12 + X22 + X32 < 7.236Y )
 2
X1 + X22 + X32

=P < 7.236
Y
 2
(X1 + X22 + X32 )/3

5
=P < × 7.236
Y /5 3
 2
(X1 + X22 + X32 )/3

=P < 12.060 .
Y /5

Since X12 + X22 + X32 ∼ χ23 , we have a ratio of independent χ23 and χ25 random
variables, each divided by its degrees of freedom. By definition, this follows an F3, 5
distribution. From Table 9 of Murdoch and Barnes’ Statistical Tables, we see that
12.060 is the 1% upper-tail value for this distribution, so the probability we want is
equal to 0.99.

10. Compare the normal distribution approximation to the exact values for the
upper-tail probabilities for the binomial distribution with 100 trials and probability
of success 0.1.

126
A.1. Worked examples

Solution:
Let R ∼ Bin(100, 0.1) denote the exact number of successes. It has mean and
variance:
E(R) = nπ = 100 × 0.1 = 10
and:
Var(R) = nπ(1 − π) = 100 × 0.1 × 0.9 = 9
so we use the approximation R ∼
˙ N (10, 9) or, equivalently:
R − 10 R − 10
√ = ∼˙ N (0, 1).
9 3
Applying a continuity correction of 0.5 (for example, 7.8 successes are rounded up
to 8) gives:  
r − 0.5 − 10
P (R ≥ r) ≈ P Z > .
3
The results are summarised in the following table. The first column is the number
of successes; the second gives the exact binomial probabilities; the third column
lists the corresponding z-values (with the continuity correction); and the fourth
gives the probabilities for the normal approximation.
Although the agreement between columns two and four is not too bad, you may
think it is not as close as you would like for some applications.
r P (R ≥ r) z = (r − 0.5 − 10)/3 P (Z > z)
1 0.999973 −3.1667 0.999229
2 0.999678 −2.8333 0.997697
3 0.998055 −2.5000 0.993790
4 0.992164 −2.1667 0.984870
5 0.976289 −1.8333 0.966624
6 0.942423 −1.5000 0.933193
7 0.882844 −1.1667 0.878327
8 0.793949 −0.8333 0.797672
9 0.679126 −0.5000 0.691462
10 0.548710 −0.1667 0.566184
11 0.416844 0.1667 0.433816
12 0.296967 0.5000 0.308538
13 0.198179 0.8333 0.202328
14 0.123877 1.1667 0.121673
15 0.072573 1.5000 0.066807
16 0.039891 1.8333 0.033376
17 0.020599 2.1667 0.015130
18 0.010007 2.5000 0.006210
19 0.004581 2.8333 0.002303
20 0.001979 3.1667 0.000771
21 0.000808 3.5000 0.000233
22 0.000312 3.8333 0.000063
23 0.000114 4.1667 0.000015
24 0.000040 4.5000 0.000003
25 0.000013 4.8333 0.000001
26 0.000004 5.1667 0.000000

127
A. Sampling distributions of statistics

A.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix G.

1. (a) Suppose {X1 , X2 , X3 , X4 } is a random sample of size n = 4 from the


n
P
Bernoulli(0.2) distribution. What is the distribution of Xi in this case?
i=1
n
P
(b) Write down the sampling distribution of X̄ = Xi /n for the sample
i=1
considered in (a). In other words, write down the possible values of X̄ and
their probabilities.
P
Hint: what are the possible values of i Xi , and their probabilities?
(c) Suppose we have a random sample of size n = 100 from the Bernoulli(0.2)
distribution. What is the approximate sampling distribution of X̄ suggested by
the central limit theorem in this case? Use this distribution to calculate an
approximate value for the probability that X̄ > 0.3. (The true value of this
probability is 0.0061.)

2. Suppose that we plan to take a random sample of size n from a normal distribution
with mean µ and standard deviation σ = 2.
(a) Suppose µ = 4 and n = 20.
i. What is the probability that the mean X̄ of the sample is greater than 5?
ii. What is the probability that X̄ is smaller than 3?
iii. What is P (|X̄ − µ| ≤ 1) in this case?
(b) How large should n be in order that P (|X̄ − µ| ≤ 0.5) ≥ 0.95 for every possible
value of µ?
(c) It is claimed that the true value of µ is 5 in a population. A random sample of
size n = 100 is collected from this population, and the mean for this sample is
x̄ = 5.8. Based on the result in (b), what would you conclude from this value
of X̄?

3. A random sample of 25 audits is to be taken from a company’s total audits, and


the average value of these audits is to be calculated.
(a) Explain what you understand by the sampling distribution of this average and
discuss its relationship to the population mean.
(b) Is it reasonable to assume that this sampling distribution is normal?
(c) If the population of all audits has a mean of £54 and a standard deviation of
£10, find the probability that:
i. the sample mean will be greater than £60
ii. the sample mean will be within 5% of the population mean.

Did you hear the one about the statistician? Probably.


(Anon)

128
Appendix B
Point estimation

B.1 Worked examples


1. Let X1 and X2 be two independent random variables with the same mean, µ, and
the same variance, σ 2 < ∞. Let µ
b = aX1 + bX2 be an estimator of µ, where a and b
are two non-zero constants.
(a) Identify the condition on a and b to ensure that µ
b is an unbiased estimator of
µ.
(b) Find the minimum mean squared error (MSE) among all unbiased estimators
of µ.
Solution:
(a) Let E(b
µ) = E(aX1 + bX2 ) = a E(X1 ) + b E(X2 ) = (a + b)µ. Hence a + b = 1 is
the condition for µ
b to be an unbiased estimator of µ.
(b) Under this condition, noting that b = 1 − a, we have:

MSE(b µ) = a2 Var(X1 ) + b2 Var(X2 ) = (a2 + b2 )σ 2 = (2a2 − 2a + 1)σ 2 .


µ) = Var(b

µ)/da = (4a − 2)σ 2 = 0, we have a = 0.5, and hence b = 0.5.


Setting d MSE(b
Therefore, among all unbiased linear estimators, the sample mean (X1 + X2 )/2
has the minimum variance.

Remark: Let {X1 , X, . . . , Xn } be a random sample from a population with finite


variance. The sample mean X̄ has the minimum variance among all unbiased linear
Pn
estimators of the form ai Xi , hence it is the best linear unbiased estimator
i=1
(BLUE(!)).

2. Let {X1 , X2 , . . . , Xn } be a random sample from the (continuous) uniform


distribution such that X ∼ Uniform[0, θ], where θ > 0. Find the method of
moments estimator (MME) of θ.
Solution:
The pdf of Xi is: (
θ−1 for 0 ≤ xi ≤ θ
f (xi ; θ) =
0 otherwise.
Therefore:  θ
θ
1 x2i
Z
1 θ
E(Xi ) = xi dxi = = .
θ 0 θ 2 0 2

129
B. Point estimation

Therefore, setting µ
b1 = M1 , we have:
n
θb X Xi
= X̄ ⇒ θb = 2X̄ = 2 .
2 i=1
n

3. Let X ∼ Bin(n, π), where n is known. Find the methods of moments estimator
(MME) of π.
Solution:
The pf of the binomial distribution is:

n!
P (X = x) = π x (1 − π)n−x for x = 0, 1, 2, . . . , n
x! (n − x)!

and 0 otherwise. Therefore:


n n n
X X n! X n!
E(X) = x P (X = x) = x π x (1−π)n−x = π x (1−π)n−x .
x=0 x=1
x! (n − x)! x=1
(x − 1)! (n − x)!

Let m = n − 1 and write j = x − 1, then (n − x) = (m − j), and:


m m
X nm! X m!
E(X) = ππ j (1 − π)m−j = nπ π j (1 − π)m−j .
j=0
j! (m − j)! j=0
j! (m − j)!

Therefore, E(X) = nπ, and hence π


b = X/n.

4. Let {X1 , X2 , . . . , Xn } be a random sample from the distribution with pdf:


(
λ exp(−λ(x − a)) for x ≥ a
f (x) =
0 otherwise

where λ > 0. Find the method of moments estimators (MMEs) of λ and a.


Solution:
We have:
Z ∞ Z ∞
1 1
E(X) = x λ exp(−λ(x − a)) dx = (y + λa)e−y dy = +a
a λ 0 λ

and:
Z ∞ Z ∞  2
y 2 2a
2
E(X ) = 2
x λ exp(−λ(x − a)) dx = e−y dy = 2
+ + a2 .
a 0 λ+a λ λ

Therefore, the MMEs are the solutions to the equations:


n
1 1X 2 2 2b
a
X̄ = + b
a and Xi = + a2 .
+b
λ
b n i=1 λ
b 2 λ
b

130
B.1. Worked examples

Actually, the explicit solutions may be obtained as follows:


n  2
1X 2 2 2 2b
a 2 1 1
Xi − X̄ = + +ba − +ba = .
n i=1 b2
λ λ
b λ
b b2
λ

Hence: !−1/2 !−1/2


n n
1X 2 1X
λ
b= Xi − X̄ 2 = (Xi − X̄)2 .
n i=1 n i=1
Consequently:
1
a = X̄ − .
b
λ
b

5. Let {X1 , X2 , . . . , Xn } be a random sample from the distribution N (µ, 1). Find the
maximum likelihood estimator (MLE) of µ.
Solution:
The joint pdf of the observations is:
n   n
!
Y 1 1 1 1 X
f (x1 , x2 , . . . , xn ; µ) = √ exp − (xi − µ)2 = n/2
exp − (xi − µ)2 .
i=1
2π 2 (2π) 2 i=1

We write the above as a function of µ only:


n
!
1X
L(µ) = C exp − (Xi − µ)2
2 i=1

where C > 0 is a constant. The MLE µ


b maximises this function, and also
maximises the function:
n
1X
l(µ) = ln L(µ) = − (Xi − µ)2 + log(C).
2 i=1
n
(Xi − µ)2 , i.e. the MLE is also the
P
Therefore, the MLE effectively minimises
i=1
least squares estimator (LSE), i.e. µ
b = X̄.

6. Let {X1 , X2 , . . . , Xn } be a random sample from a Poisson distribution with mean


λ > 0. Find the maximum likelihood estimator (MLE) of λ.
Solution:
The probability function is:
e−λ λx
P (X = x) = .
x!
The likelihood and log-likelihood functions are, respectively:
n  −λ Xi 
Y e λ e−nλ λnX̄
L(λ) = = Q n
Xi !
i=1 Xi !
i=1

131
B. Point estimation

and:
l(λ) = ln L(λ) = nX̄ ln(λ) − nλ + C = n(X̄ ln(λ) − λ) + C
where C is a constant (i.e. it may depend on Xi but cannot depend on the
parameter). Setting:  
d X̄
l(λ) = n −1 =0
dλ λ
b
we obtain the MLE λ b = X̄, which is also the MME.

7. Let {X1 , X2 , . . . , Xn } be a random sample from the (continuous) uniform


distribution Uniform[0, θ], where θ > 0 is unknown.
(a) Find the maximum likelihood estimator (MLE) of θ.
(b) If n = 3, x1 = 0.2, x2 = 3.6 and x3 = 1.1, what is the maximum likelihood
estimate of θ?

Solution:
(a) The pdf of Uniform[0, θ] is:
(
θ−1 for 0 ≤ x ≤ θ
f (x; θ) =
0 otherwise.
The joint pdf is:
(
θ−n for 0 ≤ x1 , x2 , . . . , xn ≤ θ
f (x1 , x2 , . . . , xn ; θ) =
0 otherwise.
In fact f (x1 , x2 , . . . , xn ; θ), as a function of θ, is the likelihood function, L(θ).
The maximum likelihood estimator of θ is the value at which the likelihood
function L(θ) achieves its maximum. Note:
(
θ−n for X(n) ≤ θ
L(θ) =
0 otherwise
where:
X(n) = max Xi .
i

Hence the MLE is θb = X(n) , which is different from the MME. For example, if
x(n) = 1.16, we have:

132
B.1. Worked examples

(b) For the given data, the maximum observation is x(3) = 3.6. Therefore, the
maximum likelihood estimate is θb = 3.6.

8. Use the observed random sample x1 = 8.2, x2 = 10.6, x3 = 9.1 and x4 = 4.9 to
calculate the maximum likelihood estimate of λ in the exponential pdf:
(
λe−λx for x ≥ 0
f (x; λ) =
0 otherwise.

Solution:
We derive a general formula with a random sample {X1 , X2 , . . . , Xn } first. The
joint pdf is:
(
λn e−λnx̄ for x1 , x2 , . . . , xn ≥ 0
f (x1 , x2 , . . . , xn ; λ) =
0 otherwise.

With all xi ≥ 0, L(λ) = λn e−λnX̄ , hence the log-likelihood function is:

l(λ) = ln L(λ) = n ln(λ) − λnX̄.

Setting:
d n b= 1.
l(λ) = − nX̄ = 0 ⇒ λ
dλ λ
b X̄
For the given sample, x̄ = (8.2 + 10.6 + 9.1 + 4.9)/4 = 8.2. Therefore, λ
b = 0.1220.

9. The following data show the number of occupants in passenger cars observed
during one hour at a busy junction. It is assumed that these data follow a
geometric distribution with pf:
(
(1 − π)x−1 π for x = 1, 2, . . .
p(x; π) =
0 otherwise.

Number of occupants 1 2 3 4 5 ≥6 Total


Frequency 678 227 56 28 8 14 1,011

Find the maximum likelihood estimate of π.


Solution:
The sample size is n = 1,011. If we know all the 1,011 observations, the joint
probability function for x1 , x2 , . . . , x1,011 is:
1,011
Y
L(π) = p(xi ; π).
i=1

However, we only know that there are 678 xi s equal to 1, 227 xi s equal to 2, . . .,
and 14 xi s equal to some integers not smaller than 6.

133
B. Point estimation

Note that:

X
P (Xi ≥ 6) = p(x; π) = π(1 − π)5 (1 + (1 − π) + (1 − π)2 + · · · )
x=6

1
= π(1 − π)5 ×
π

= (1 − π)5 .

Hence we may only use:

L(π) = p(1, π)678 p(2, π)227 p(3, π)56 p(4, π)28 p(5, π)8 ((1 − π)5 )14
= π 1,011−14 (1 − π)227+56×2+28×3+8×4+14×5
= π 997 (1 − π)525

hence:
l(π) = ln L(π) = 997 ln(π) + 525 ln((1 − π)).
Setting:
d 997 525 997
l(π) = − =0 ⇒ π
b= = 0.655.
dπ π
b 1−π b 997 + 525
Remark: Since P (Xi = 1) = π, πb = 0.655 indicates that about 2/3 of cars have only
one occupant. Note E(Xi ) = 1/π. In order to ensure that the average number of
occupants is not smaller than k, we require π < 1/k.

10. Let {X1 , X2 , . . . , Xn }, where n > 2, be a random sample from an unknown


population with mean θ and variance σ 2 . We want to choose between two
estimators of θ, θb1 = X̄ and θb2 = (X1 + X2 )/2. Which is the better estimator of θ?
Solution:
Let us consider the bias first. The estimator θb1 is just the sample mean, so we know
that it is unbiased. The estimator θb2 has expectation:
 
X 1 + X 2 E(X1 ) + E(X2 ) θ+θ
E(θb2 ) = E = = =θ
2 2 2
so it is also an unbiased estimator of θ.
Next, we consider the variances of the two estimators. We have:
σ2
Var(θ1 ) = Var(X̄) =
b
n
and:
σ2 + σ2 σ2
 
X1 + X2 Var(X1 ) + Var(X2 )
Var(θb2 ) = Var = = = .
2 4 4 2

Since n > 2, we can see that θb1 has a lower variance than θb2 , so it is a better
estimator. Unsurprisingly, we obtain a better estimator of θ by considering the
whole sample, rather than just the first two values.

134
B.1. Worked examples

11. Show that the MSE of an estimator θb can be written as:


 2
MSE(θ)b = Var(θ)b + Bias(θ) b .

Solution:
We need to introduce the term E(θ)
b inside the expectation, so we add and subtract
it to obtain:

b = E((θb − θ)2 )
MSE(θ)
 
= E ((θb − E(θ)) b 2
b − (θ − E(θ)))
 
2 2
= E (θ − E(θ)) − 2(θ − E(θ))(θ − E(θ)) + (θ − E(θ))
b b b b b b

b 2 ) − 2E((θb − E(θ))(θ
= E((θb − E(θ)) b − E(θ))) b 2 ).
b + E((θ − E(θ))

The first term in this expression is, by definition, the variance of θ.


b The final term
is:
b 2 ) = (θ − E(θ))
E((θ − E(θ)) b 2 = (E(θ)b − θ)2 = (Bias(θ))
b 2

because θ and E(θ)


b are both constants, and are not affected by the expectation
operator. It remains to be shown that the middle term is equal to zero. We have:
 
E (θb − E(θ))(θ
b − E(θ))
b = (θ − E(θ))
b E(θb − E(θ))
b = (θ − E(θ))(E(
b b − E(θ))
θ) b =0

which concludes our proof.

12. Find the MSEs of the estimators in Question 10.


Solution:
The MSEs are:
σ2 σ2
MSE(θb1 ) = Var(θb1 ) + (Bias(θb1 ))2 = +0=
n n
and:
σ2 σ2
MSE(θb2 ) = Var(θb2 ) + (Bias(θb2 ))2 =
+0= .
2 2
Note that the MSE of an unbiased estimator is equal to its variance.

13. Are the estimators in Question 10 (mean-square) consistent?


Solution:
The estimator θb1 has MSE equal to σ 2 /n, which converges to 0 as n → ∞. The
estimator θb2 has MSE equal to σ 2 /2, which stays constant as n → ∞. Therefore, θb1
is a (mean-square) consistent estimator of θ, whereas θb2 is not.

135
B. Point estimation

14. Suppose that we have a random sample {X1 , X2 , . . . , Xn } from a Uniform[−θ, θ]


distribution. Find the method of moments estimator of θ.
Solution:
The mean of the Uniform[a, b] distribution is (a + b)/2. In our case, this gives
E(X) = (−θ + θ)/2 = 0. The first population moment does not depend on θ, so we
need to move to the next (i.e. second) population moment.
Recall that the variance of the Uniform[a, b] distribution is (b − a)2 /12. Hence the
second population moment is:
(θ − (−θ))2 θ2
E(X 2 ) = Var(X) + E(X)2 = + 02 = .
12 3
We set this equal to the second sample moment to obtain:
n
1 X 2 θb2
X = .
n i=1 i 3

Therefore, the method of moments estimator of θ is:


v
u n
u3 X
θbM M = t X 2.
n i=1 i

15. Let {X1 , X2 , . . . , Xn } be a random sample from a Bin(m, π) distribution, with both
m and π unknown. Find the method of moments estimators of m, the number of
trials, and π, the probability of success.
Solution:
There are two unknown parameters, so we need two equations. The expectation
and variance of a Bin(m, π) distribution are mπ and mπ(1 − π), respectively, so we
have:
µ1 = E(X) = mπ
and:
µ2 = Var(X) + E(X)2 = mπ(1 − π) + (mπ)2 .
Setting the first two sample and population moments equal gives:
n n
1X 1X 2
Xi = mb
b π and b π (1 − π
X = mb b π )2 .
b) + (mb
n i=1 n i=1 i

The two equations need to be solved simultaneously. Solving the first equation for
π
b gives:
Pn
Xi /n
i=1 X̄
π
b= = .
m
b mb
Now we can substitute π
b into the second moment equation to obtain:
n    2
1X 2 X̄ X̄ X̄
X =m 1− + m
n i=1 i
b b
mb m
b m
b

136
B.1. Worked examples

which we now solve for m


b to find the method of moments estimator:
X̄ 2
m
b MM =  n
.
2
P
X̄ 2 − Xi /n − X̄
i=1

16. Consider again the Uniform[−θ, θ] distribution from Question 14. Suppose that we
observe the following data:

1.8, 0.7, −0.2, −1.8, 2.8, 0.6, −1.3 and − 0.1.

Estimate θ using the method of moments.


Solution:
The point estimate is: v
u 8
u3 X
θbM M =t x2 ≈ 2.518
8 i=1 i

which implies that the data came from a Uniform[−2.518, 2.518] distribution.
However, this clearly cannot be true since the observation x5 = 2.8 falls outside this
range! The method of moments does not take into account that all of the
observations need to lie in the interval [−θ, θ], and so it fails to produce a useful
estimate.

17. Let {X1 , X2 , . . . , Xn } be a random sample from an Exp(λ) distribution. Find the
MLE of λ.
Solution:
The likelihood function is:
n
Y n
Y P
L(λ) = f (xi ; θ) = λe−λXi = λn e−λ i Xi
= λn e−λnX̄
i=1 i=1

so the log-likelihood function is:

l(λ) = ln(λn e−λnX̄ ) = n ln(λ) − λnX̄.

Differentiating and setting equal to zero gives:


d n b= 1.
l(λ) = − nX̄ = 0 ⇒ λ
dλ λ
b X̄
The second derivative of the log-likelihood function is:

d2 n
2
l(λ) = − 2
dλ λ

which is always negative, hence the MLE λ


b = 1/X̄ is indeed a maximum. This
happens to be the same as the method of moments estimator of λ.

137
B. Point estimation

18. Let {X1 , X2 , . . . , Xn } be a random sample from a N (µ, σ 2 ) distribution. Find the
MLE of σ 2 if:
(a) µ is known
(b) µ is unknown.
In each case, work out if the MLE is an unbiased estimator of σ 2 .
Solution:
The likelihood function is:
n n
(Xi − µ)2
 
2
Y
2
Y 1
L(µ, σ ) = f (xi ; µ, σ ) = √ exp −
i=1 i=1 2πσ 2 2σ 2
n
!
1 X
= (2πσ 2 )−n/2 exp − 2 (Xi − µ)2
2σ i=1

so the log-likelihood function is:


n
2 n n 2 1 X
l(µ, σ ) = − ln(2π) − ln(σ ) − 2 (Xi − µ)2 .
2 2 2σ i=1

Differentiating with respect to σ 2 and setting the derivative equal to zero gives:
n
d 2 n 1 1 X
l(µ, σ ) = − + 4 (Xi − µ)2 = 0.
dσ 2 b2 2b
2 σ σ i=1

b2 :
If µ is known, we can solve this equation for σ
n n n
n 1 1 X n 2 1X 1X
= 4 (Xi − µ)2 ⇒ σ
b = (Xi − µ)2 ⇒ σ2
b = (Xi − µ)2 .
b2
2 σ 2b
σ i=1 2 2 i=1 n i=1

The second derivative is always negative, so we conclude that the MLE:


n
2 1X
σ
b = (Xi − µ)2
n i=1

is indeed a maximum. We can work out the bias of this estimator directly:
n
! n
!
2 1X 2 2 1 X (Xi − µ)2
E(bσ )=E (Xi − µ) = σ E
n i=1 n i=1 σ2
n 2
σ2 X

Xi − µ
= E
n i=1 σ
n
σ2 X
= E(Zi2 )
n i=1

σ2
= n = σ2
n

138
B.1. Worked examples

where Zi = (Xi − µ)/σ, for i = 1, 2, . . . , n. Therefore, the MLE of σ 2 is an unbiased


estimator in this case.
If µ is unknown, we also need to maximise the likelihood function with respect to
µ. Here, we consider an alternative method. The likelihood function is:
n
!
1 X
L(µ, σ 2 ) = (2πσ 2 )−n/2 exp − 2 (Xi − µ)2
2σ i=1

n
so, whatever the value of σ 2 , we need to ensure that (Xi − µ)2 is minimised.
P
i=1
However, we have:
n
X n
X
(Xi − µ)2 = (Xi − X̄)2 + n(X̄ − µ)2 .
i=1 i=1

Only the second term on the right-hand side depends on µ and, because of the
square, its minimum value is zero. It is minimised when µ is equal to the sample
mean, so this is the MLE of µ:
µ
b = X̄.
The resulting MLE of σ 2 is:
n
2 1X
σ
b = (Xi − X̄)2 .
n i=1

This is not the same as the sample variance S 2 , where we divide by n − 1 instead of
n. The expectation of the MLE of σ 2 is:
n
! n
!
1 X 1 1 X
σ2) = E
E(b (Xi − X̄)2 = E (n − 1) (Xi − X̄)2
n i=1 n n − 1 i=1
1
= E((n − 1)S 2 )
n
σ2 (n − 1)S 2
 
= E .
n σ2

The term inside the expectation, (n − 1)S 2 /σ 2 , follows a χ2n−1 distribution, and so:

σ2
σ2) =
E(b (n − 1).
n
This is not equal to σ 2 , so the MLE of σ 2 is a biased estimator in this case. (Note
b2 = S 2 is an unbiased estimator of σ 2 .) The bias of the MLE is:
that the estimator σ

σ2 σ2
σ 2 ) = E(b
Bias(b σ2) − σ2 = (n − 1) − σ 2 = −
n n
which tends to zero as n → ∞. In such cases, we say that the estimator is
asymptotically unbiased.

139
B. Point estimation

B.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix G.

1. Based on a random sample of two independent observations from a population with


mean µ and standard deviation σ, consider two estimators of µ, X and Y , defined
as:
X1 X2 X1 2X2
X= + and Y = + .
2 2 3 3
Are X and Y unbiased estimators of µ?

2. Prove that, for normally distributed data, S 2 is an unbiased estimator of σ 2 , but


that S is a biased estimator of σ.
Hint: if X̄ is the sample mean for a random sample of size n, the fact that the
observations {X1 , X2 , . . . , Xn } are independent can be used to prove that (in the
standard notation):
σ2
E(X̄ 2 ) = µ2 + .
n

3. A random sample of n independent Bernoulli trials with success probability π


results in R successes. Derive an unbiased estimator of π(1 − π).

4. Given a random sample of n values from a normal distribution with unknown mean
2
and variance, consider the
P following2 two estimators of σ (the unknown population
variance), where Sxx = (Xi − X̄) :

Sxx Sxx
T1 = and T2 = .
n−1 n
For each of these determine its bias, its variance and its mean squared error. Which
has the smaller mean squared error?
Hint: use the fact that Var(S 2 ) = 2σ 4 /(n − 1) for a random sample of size n, or
some equivalent formula.

5. Suppose that you are given observations y1 , y2 , y3 and y4 such that:

y 1 = α + β + ε1

y2 = −α + β + ε2

y 3 = α − β + ε3

y4 = −α − β + ε4 .

The random variables εi , for i = 1, 2, 3, 4, are independent and normally distributed


with mean 0 and variance σ 2 .

140
B.2. Practice questions

(a) Find the least squares estimators of the parameters α and β.


(b) Verify that the least squares estimators in (a) are unbiased estimators of their
respective parameters.
(c) Find the variance of the least squares estimator of α.

The group was alarmed to find that if you are a labourer, cleaner or dock
worker, you are twice as likely to die than a member of the professional classes.
(The Sunday Times, 31 August 1980)

141
B. Point estimation

142
Appendix C
Interval estimation

C.1 Worked examples


1. (a) Find the length of a 95% confidence interval for the mean of a normal
distribution with known variance σ 2 .
(b) Find the minimum sample size such that the width of a 95% confidence
interval is not wider than d, where d > 0 is a prescribed constant.

Solution:
(a) With an available random sample {X1 , X2 , . . . , Xn } from the normal
distribution N (µ, σ 2 ) with σ 2 known, a 95% confidence interval for µ is of the
form:  
σ σ
X̄ − 1.96 × √ , X̄ + 1.96 × √ .
n n
Hence the width of the confidence interval is:
   
σ σ σ σ
X̄ + 1.96 × √ − X̄ − 1.96 × √ = 2 × 1.96 × √ = 3.92 × √ .
n n n n

(b) Let 3.92 × σ/ n ≤ d, and so we obtain the condition for the required sample
size: 2
15.37 × σ 2

3.92 × σ
n≥ = .
d d2
Therefore, in order to achieve the required accuracy, the sample size n should
be at least as large as 15.37 × σ 2 /d2 .
Note that as the variance σ 2 %, the confidence interval width d %, and as the
sample size n %, the confidence interval width d &. Also, note that when σ 2 is
unknown, the width of a confidence interval for µ depends on S. Therefore, the
width is a random variable.

2. The data below are from a random sample of size n = 9 taken from the distribution
N (µ, σ 2 ):
3.75, 5.67, 3.14, 7.89, 3.40, 9.32, 2.80, 10.34 and 14.31.

(a) Assume σ 2 = 16. Find a 95% confidence interval for µ. If the width of such a
confidence interval must not exceed 2.5, at least how many observations do we
need?
(b) Suppose σ 2 is now unknown. Find a 95% confidence interval for µ. Compare
the result with that obtained in (a) and comment.
(c) Obtain a 95% confidence interval for σ 2 .

143
C. Interval estimation

Solution:
(a) We have x̄ = 6.74. For a 95% confidence interval, α = 0.05 so we need to find
the top 100α/2 = 2.5th percentile of N (0, 1), which is 1.96. Since σ = 4 and
n = 9, a 95% confidence interval for µ is:
 
σ 4 4
x̄ ± 1.96 × √ ⇒ 6.74 − 1.96 × , 6.74 + 1.96 × = (4.13, 9.35).
n 3 3
In general, a 100(1 − α)% confidence interval for µ is:
 
σ σ
X̄ − zα/2 × √ , X̄ + zα/2 × √
n n
where zα denotes the top 100αth percentile of the standard normal
distribution, i.e. such that:
P (Z > zα ) = α
where Z ∼ N (0, 1). Hence the width of the confidence interval is:
σ
2 × zα/2 × √ .
n
For this example, α = 0.05, z0.025 = 1.96 and σ = 4. Setting the width of the
confidence interval to be at most 2.5, we have:
σ 15.68
2 × 1.96 × √ = √ ≤ 2.5.
n n
Hence:  2
15.68
n≥ = 39.34.
2.5
So we need a sample of at least 40 observations in order to obtain a 95%
confidence interval with a width not greater than 2.5.
(b) When σ 2 is unknown, a 95% confidence interval for µ is:
 
S S
X̄ − tα/2, n−1 × √ , X̄ + tα/2, n−1 × √
n n
n
where S 2 = (Xi − X̄)2 /(n − 1), and tα, k denotes the top 100αth percentile
P
i=1
of the Student’s tk distribution, i.e. such that:
P (T > tα, k ) = α
for T ∼ tk . For this example, s2 = 16, s = 4, n = 9 and t0.025, 8 = 2.306. Hence
a 95% confidence interval for µ is:
4
6.74 ± 2.306 × ⇒ (3.67, 9.81).
3
This confidence interval is much wider than the one obtained in (a). Since we
do not know σ 2 , we have less information available for our estimation. It is
only natural that our estimation becomes less accurate.
Note that although the sample size is n, the Student’s t distribution used has
only n − 1 degrees of freedom. The loss of 1 degree of freedom in the sample
variance is due to not knowing µ. Hence we estimate µ using the data, for
which we effectively pay a ‘price’ of one degree of freedom.

144
C.1. Worked examples

(c) Note (n − 1)S 2 /σ 2 ∼ χ2n−1 = χ28 . From Table 8 of Murdoch and Barnes’
Statistical Tables, for X ∼ χ28 , we find that:
P (X < 2.180) = P (X > 17.535) = 0.025.
Hence:
8 × S2
 
P 2.180 < < 17.535 = 0.95.
σ2
Therefore, the lower bound for σ 2 is 8 × s2 /17.535 = 7.298, and the upper
bound is 8 × s2 /2.180 = 58.701. Therefore, a 95% confidence interval for σ 2 ,
noting s2 = 16, is:
(7.30, 58.72).
Note that the estimation in this example is rather inaccurate. This is due to
two reasons.
i. The sample size is small.
ii. The population variance, σ 2 , is large.

3. Assume that the random variable X is normally distributed and that σ 2 is known.
What confidence level would be associated with each of the following intervals?
√ √
(a) (x̄ − 1.645 × σ/ n, x̄ + 2.326 × σ/ n).

(b) (−∞, x̄ + 2.576 × σ/ n).

(c) (x̄ − 1.645 × σ/ n, x̄).

Solution:
√ √
We have X̄ ∼ N (µ, σ 2 / n), hence n(X̄ − µ)/σ ∼ N (0, 1).
(a) P (−1.645 < Z < 2.326) = 0.94, hence a 94% confidence level.
(b) P (−∞ < Z < 2.576) = 0.995, hence a 99.5% confidence level.
(c) P (−1.645 < Z < 0) = 0.45, hence a 45% confidence level.

4. Five independent samples, each of size n, are to be drawn from a normal


distribution where σ 2 is known. For each sample, the interval:
 
σ σ
x̄ − 0.96 × √ , x̄ + 1.06 × √
n n
will be constructed. What is the probability that at least four of the intervals will
contain the unknown µ?
Solution:
The probability that the given interval will contain µ is:
P (−0.96 < Z < 1.06) = 0.6869.
The probability of four or five such intervals is binomial with n = 5 and
π = 0.6869, so let the number of such intervals be Y ∼ Bin(5, 0.6869). The required
probability is:
   
5 4 5
P (Y ≥ 4) = (0.6869) (0.3131) + (0.6869)5 = 0.5014.
4 5

145
C. Interval estimation

5. A personnel manager has found that historically the scores on aptitude tests given
to applicants for entry-level positions are normally distributed with σ = 32.4
points. A random sample of nine test scores from the current group of applicants
had a mean score of 187.9 points.
(a) Find an 80% confidence interval for the population mean score of the current
group of applicants.
(b) Based on these sample results, a statistician found for the population mean a
confidence interval extending from 165.8 to 210.0 points. Find the confidence
level of this interval.

Solution:
(a) We have n = 9, x̄ = 187.9, σ = 32.4 and 1 − α = 0.80, hence α/2 = 0.10 and,
from Table 3 of Murdoch and Barnes’ Statistical Tables, P (Z > 1.282) =
1 − Φ(1.282) = 0.10. So an 80% confidence interval is:
32.4
187.9 ± 1.282 × √ ⇒ (174.05, 201.75).
9

(b) The half-width of the confidence interval is 210.0 − 187.9 = 22.1, which is
equal to the margin of error, i.e. we have:
σ 32.4
22.1 = k × √ = k × √ ⇒ k = 2.05.
n 9
P (Z > 2.05) = 1 − Φ(2.05) = 0.02018 = α/2 ⇒ α = 0.04036. Hence we have
a 100(1 − α)% = 100(1 − 0.04036)% ≈ 96% confidence interval.

6. A manufacturer is concerned about the variability of the levels of impurity


contained in consignments of raw materials from a supplier. A random sample of 10
consignments showed a standard deviation of 2.36 in the concentration of impurity
levels. Assume normality.
(a) Find a 95% confidence interval for the population variance.
(b) Would a 99% confidence interval for this variance be wider or narrower than
that found in (a)?

Solution:
(a) We have n = 10, s2 = (2.36)2 = 5.5696, χ20.975, 9 = 2.700 and χ20.025, 9 = 19.023.
Hence a 95% confidence interval for σ 2 is:
(n − 1)s2 (n − 1)s2
   
9 × 5.5696 9 × 5.5696
, = , = (2.64, 18.57).
χ20.025, n−1 χ20.975, n−1 19.023 2.700

(b) A 99% confidence interval would be wider since:

χ20.995, n−1 < χ20.975, n−1 and χ20.005, n−1 > χ20.025, n−1 .

146
C.1. Worked examples

7. Why do we not always choose a very high confidence level for a confidence interval?

Solution:
We do not always want to use a very high confidence level because the confidence
interval would be very wide. We have a trade-off between the width of the
confidence interval and the coverage probability.

8. Suppose that 9 bags of sugar are selected from the supermarket shelf at random
and weighed. The weights in grammes are 812.0, 786.7, 794.1, 791.6, 811.1, 797.4,
797.8, 800.8 and 793.2. Construct a 95% confidence interval for the mean weight of
all the bags on the shelf. Assume the population is normal.

Solution:
Here we have a random sample of size n = 9. The mean is 798.30. The sample
variance is s2 = 72.76, which gives a sample standard deviation s = 8.53. From
Table 7 of Murdoch and Barnes’ Statistical Tables, the top 2.5th percentile of the t
distribution with n − 1 = 8 degrees of freedom is 2.306. Therefore, a 95%
confidence interval is:
 
8.53 8.53
798.30 − 2.306 × √ , 798.30 + 2.306 × √ = (798.30 − 6.56, 798.30 + 6.56)
9 9
= (791.74, 804.86).

It is sometimes more useful to write this as 798.30 ± 6.56.

9. Continuing Question 2, suppose we are now told that σ, the population standard
deviation, is known to be 8.5 g. Construct a 95% confidence interval using this
information.

Solution:
From Table 7 of Murdoch and Barnes’ Statistical Tables, the top 2.5th percentile of
the standard normal distribution z0.025 = 1.96 (recall t∞ = N (0, 1)) so a 95%
confidence interval for the population mean is:
 
8.5 8.5
798.30 − 1.96 × √ , 798.30 + 1.96 × √ = (798.30 − 6.53, 798.30 + 6.53)
9 9
= (792.75, 803.85).

Again, it may be more useful to write this as 798.30 ± 5.55. Note that this
confidence interval is less wide than the one in Question 2, even though our initial
estimate s turned out to be very close to the true value of σ.

10. Construct a 90% confidence interval for the variance of the bags of sugar in
Question 2. Does the given value of 8.5 g for the population standard deviation
seem plausible?

147
C. Interval estimation

Solution:
We have n = 9 and s2 = 72.76. For a 90% confidence interval, we need the bottom
and top 5th percentiles of the chi-squared distribution on n − 1 = 8 degrees of
freedom. These are:
χ20.95, 8 = 2.733 and χ20.05, 8 = 15.507.
A 90% confidence interval is:
!
(n − 1)S 2 (n − 1)S 2
 
(9 − 1) × 72.76 (9 − 1) × 72.76
, = ,
χ2α/2, n−1 χ21−α/2, n−1 15.507 2.733

= (37.536, 213.010).
The corresponding values for the standard deviation are:
√ √
( 37.536, 213.010) = (6.127, 14.595).
The given value falls well within this confidence interval, so we have no reason to
doubt it.

C.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix G.

1. A business requires an inexpensive check on the value of stock in its warehouse. In


order to do this, a random sample of 50 items is taken and valued. The average
value of these is computed to be £320.41 with a (sample) standard deviation of
£40.60. It is known that there are 9,875 items in the total stock.
(a) Estimate the total value of the stock to the nearest £10,000.
(b) Construct a 95% confidence interval for the mean value of all items and hence
construct a 95% confidence interval for the total value of the stock.
(c) You are told that the confidence interval in (b) is too wide for decision-making
purposes and you are asked to assess how many more items would need to be
sampled to obtain a confidence interval with the same level of confidence, but
with half the width.

2. (a) A sample of 954 adults in early 1987 found that 23% of them held shares.
Given a UK adult population of 41 million and assuming a proper random
sample was taken, construct a 95% confidence interval estimate for the number
of shareholders in the UK.
(b) A ‘similar’ survey the previous year had found a total of 7 million shareholders.
Assuming ‘similar’ means the same sample size, construct a 95% confidence
interval estimate of the increase in shareholders between the two years.

A statistician took the Dale Carnegie Course, improving his confidence from
95% to 99%.
(Anon)

148
Appendix D
Hypothesis testing

D.1 Worked examples

1. A manufacturer has developed a new fishing line which is claimed to have an


average breaking strength of 7 kg, with a standard deviation of 0.25 kg. Assume
that the standard deviation figure is correct and that the breaking strength is
normally distributed. Suppose that we carry out a test, at the 5% significance level,
of H0 : µ = 7 vs. H1 : µ < 7. Find the sample size which is necessary for the test to
have 90% power if the true breaking strength is 6.95 kg.

Solution:
The critical value for the test is z0.95 = −1.645 and the probability of rejecting H0
with this test is:  
X̄ − 7
P √ < −1.645
0.25/ n
which we rewrite as:
 
X̄ − 6.95 7 − 6.95
P √ < √ − 1.645
0.25/ n 0.25/ n

because X̄ ∼ N (6.95, (0.25)2 /n).


To ensure power of 90% we need z0.10 = 1.282 since:

P (Z < 1.282) = 0.90.

Therefore:

7 − 6.95
√ − 1.645 = 1.282
0.25/ n

0.2 × n = 2.927

n = 14.635
n = 214.1832.

So to ensure that the test power is at least 90%, we should use a sample size of 215.
Remark: We see a rather large sample size is required. Hence investigators are
encouraged to use sample sizes large enough to come to rational decisions.

149
D. Hypothesis testing

2. A doctor claims that the average European is more than 8.5 kg overweight. To test
this claim, a random sample of 12 Europeans were weighed, and the difference
between their actual weight and their ideal weight was calculated. The data are:

14, 12, 8, 13, −1, 10, 11, 15, 13, 20, 7 and 14.

Assuming the data follow a normal distribution, conduct a t test to infer at the 5%
significance level whether or not the doctor’s claim is true.
Solution:
We have a random sample of size n = 12 from N (µ, σ 2 ), and we test H0 : µ = 8.5
vs. H1 : µ > 8.5. The test statistic, under H0 , is:

X̄ − 8.5 X̄ − 8.5
T = √ = √ ∼ t11 .
S/ n S/ 12

We reject H0 if t > t0.05, 11 = 1.796. For the given data:


12 12
!
1 X 1 X
x̄ = xi = 11.333 and s2 = x2i − 12x̄2 = 26.606.
12 i=1 11 i=1

Hence:
11.333 − 8.5
t= p = 1.903 > 1.796 = t0.05, 11
26.606/12
so we reject H0 at the 5% significance level. There is significant evidence to support
the doctor’s claim.

3. {X1 , X2 , . . . , X21 } represents a random sample of size 21 from a normal population


with mean µ and variance σ 2 .
(a) Construct a test procedure with a 5% significance level to test the null
hypothesis that σ 2 = 8 against the alternative that σ 2 > 8.
(b) Evaluate the power of the test for the values of σ 2 given below.
σ2 = 8.84 10.04 10.55 11.03 12.99 15.45 17.24

Solution:
(a) We test:
H0 : σ 2 = 8 vs. H1 : σ 2 > 8.
The test statistic, under H0 , is:

(n − 1)S 2 20 × S 2
T = 2
= ∼ χ220 .
σ0 8

With a 5% significance level, we reject the null hypothesis if:

t ≥ 31.410

since χ20.05, 20 = 31.410.

150
D.1. Worked examples

(b) To evaluate the power, we need the probability of rejecting H0 (which happens
if t ≥ 31.410) conditional on the actual value of σ 2 , that is:
 
2 8 8
P (T ≥ 31.410 | σ = k) = P T × ≥ 31.410 ×
k k

where k is the true value of σ 2 , noting that:


8
T× ∼ χ220 .
k

σ2 = k 8.84 10.04 10.55 11.03 12.99 15.45 17.24


31.410 × 8/k 28.4 25.0 23.8 22.8 19.3 16.3 14.6
β(σ 2 ) 0.10 0.20 0.25 0.30 0.50 0.70 0.80

4. The weights (in grammes) of a group of five-week-old chickens reared on a


high-protein diet are 336, 421, 310, 446, 390 and 434. The weights of a second
group of chickens similarly reared, except for their low-protein diet, are 224, 275,
393, 282 and 365. Is there evidence that the additional protein has increased the
average weight of the chickens? Assume normality.
Solution:
Assuming normally-distributed populations with possibly different means, but the
same variance, we test:

H0 : µX = µY vs. H1 : µX > µY .

The sample means and standard deviations are x̄ = 389.5, ȳ = 307.8, sX = 55.40
and sY = 69.45. The test statistic and its distribution under H0 are:
s
n+m−2 X̄ − Ȳ
T = ×p ∼ tn+m−2
1/n + 1/m (n − 1)SX 2
+ (m − 1)SY2

and we obtain, for the given data, t = 2.175 > 1.833 = t0.05, 9 hence we reject H0
that the mean weights are equal and conclude that the mean weight for the
high-protein diet is greater at the 5% significance level.

5. Suppose that we have two independent samples from normal populations with
known variances. We want to test the H0 that the two population means are equal
against the alternative that they are different. We could use each sample by itself
to write down 95% confidence intervals and reject H0 if these intervals did not
overlap. What would be the significance level of this test?
Solution:
Let us assume H0 : µX = µY is true, then the two 95% confidence intervals do not
overlap if and only if:
σX σY σY σX
X̄ − 1.96 × √ ≥ Ȳ + 1.96 × √ or Ȳ − 1.96 × √ ≥ X̄ + 1.96 × √ .
n m m n

151
D. Hypothesis testing

So we want the probability:


  
σX σY
P |X̄ − Ȳ | ≥ 1.96 × √ + √
n m
which is: √ √ !
X̄ − Ȳ σX / n + σY / m
P p
2
≥ 1.96 × p 2 .
σX /n + σY2 /m σX /n + σY2 /m
So we have: √ √ !
σX / n + σY / m
P |Z| ≥ 1.96 × p 2
σX /n + σY2 /m
where Z ∼ N (0, 1). This does not reduce in general, but if we assume n = m and
2
σX = σY2 , then it reduces to:

P (|Z| ≥ 1.96 × 2) = 0.0056.

The significance level is about 0.6%, which is much smaller than the usual
conventions of 5% and 1%. Putting variability into two confidence intervals makes
them more likely to overlap than you might think, and so your chance of
incorrectly rejecting the null hypothesis is smaller than you might expect!

6. The following table shows the number of salespeople employed by a company and
the corresponding value of sales (in £000s):

Number of salespeople (x) 210 209 219 225 232 221


Sales (y) 206 200 204 215 222 216
Number of salespeople (x) 220 233 200 215 205 227
Sales (y) 210 218 201 212 204 212

Compute the sample correlation coefficient for these data and carry out a formal
test for a (linear) relationship between the number of salespeople and sales.
Note that: X X X
xi = 2,616, yi = 2,520, x2i = 571,500,
X X
yi2 = 529,746 and xi yi = 550,069.

Solution:
We test:
H0 : ρ = 0 vs. H1 : ρ > 0.
The corresponding test statistic and its distribution under H0 are:

ρb n − 2
T =p ∼ tn−2 .
1 − ρb2
We find ρb = 0.8716 and obtain t = 5.62 > 2.764 = t0.01, 10 and so we reject H0 at the
1% significance level. Since the test is highly significant, there is overwhelming
evidence of a (linear) relationship between the number of salespeople and the value
of sales.

152
D.1. Worked examples

7. Two independent samples from normal populations yield the following results:

2
P
Sample 1 n=5 P (xi − x̄)2 = 4.8
Sample 2 m=7 (yi − ȳ) = 37.2

Test at the 5% signficance level whether the population variances are the same
based on the above data.
Solution:
We test:
H0 : σ12 = σ22 vs. H1 : σ12 6= σ22 .
Under H0 , the test statistic is:

S12
T = ∼ Fn−1, m−1 = F4, 6 .
S22

Critical values are F0.975, 4, 6 = 1/F0.025, 6, 4 = 1/9.20 = 0.11 and F0.025, 4, 6 = 6.23,
using Table 9 of Murdoch and Barnes’ Statistical Tables. The test statistic value is:
4.8/4
t= = 0.1935
37.2/6
and since 0.11 < 0.1935 < 6.23 we do not reject H0 , which means there is no
evidence of a difference in the variances.

8. Why does it make no sense to use a hypothesis like x̄ = 2?


Solution:
We can see immediately if x̄ = 2 by calculating the sample mean. Inference is
concerned with the population from which the sample was taken. We are not very
interested in the sample mean in its own right.

9. (a) Of 100 clinical trials, 5 have shown that wonder-drug ‘Zap2’ is better than the
standard treatment (aspirin). Should we be excited by these results?
(b) Of the 1,000 clinical trials of 1,000 different drugs this year, 30 trials found
drugs which seem better than the standard treatments with which they were
compared. The television news reports only the results of those 30 ‘successful’
trials. Should we believe these reports?
(c) A child welfare officer says that she has a test which always reveals when a
child has been abused, and she suggests it be put into general use. What is she
saying about Type I and Type II errors for her test?
Solution:
(a) If 5 clinical trials out of 100 report that Zap2 is better, this is consistent with
there being no difference whatsoever between Zap2 and aspirin if a 5% Type I
error probability is being used for tests in these clinical trials. With a 5%
significance level we expect 5 trials in 100 to show spurious significant results.

153
D. Hypothesis testing

(b) If the television news reports the 30 successful trials out of 1,000, and those
trials use tests with a significance level of 5%, we may well choose to be very
cautious about believing the results. We would expect 50 spuriously significant
results in the 1,000 trial results.
(c) The welfare officer is saying that the Type II error has probability zero. The
test is always positive if the null hypothesis of no abuse is false. On the other
hand, the welfare officer is saying nothing about the probability of a Type I
error. It may well be that the probability of a Type I error is high, which
would lead to many false accusations of abuse when no abuse had taken place.
One should always think about both types of error when proposing a test.

10. A machine is designed to fill bags of sugar. The weight of the bags is normally
distributed with standard deviation σ. If the machine is correctly calibrated, σ
should be no greater than 20 g. We collect a random sample of 18 bags and weigh
them. The sample standard deviation is found to be equal to 32.48 g. Is there any
evidence that the machine is incorrectly calibrated?
Solution:
This is a hypothesis test for the variance of a normal population, so we will use the
chi-squared distribution. Let:

X1 , X2 , . . . , X18 ∼ N (µ, σ 2 )

be the weights of the bags in the sample. An appropriate test has hypotheses:

H0 : σ 2 = 400 vs. H1 : σ 2 > 400.

This is a one-sided test, because we are interested in detecting an increase in


variance. We compute the value of the test statistic:

(n − 1)s2 (18 − 1) × (32.48)2


t= = = 44.385.
σ02 (20)2

At the 5% significance level, the upper-tail value of the chi-squared distribution on


ν = 18 − 1 degrees of freedom is χ20.05, 17 = 27.587. Our test statistic exceeds this
value, so we reject the null hypothesis.
We now move to the 1% significance level. The upper-tail value is χ20.01, 17 = 33.409,
so we reject H0 again. We conclude that there is very strong evidence that the
machine is incorrectly calibrated.

11. After the machine in Question 3 is calibrated, we collect a new sample of 21 bags.
The sample standard deviation of their weights is 23.72 g. Based on this sample,
can you conclude that the calibration has reduced the variance of the weights of the
bags?
Solution:
Let:
Y1 , Y2 , . . . , Y21 ∼ N (µY , σY2 )

154
D.2. Practice questions

2
be the weights of the bags in the new sample, and use σX to denote the variance of
the distribution of the previous sample, to avoid confusion. We want to test for a
reduction in variance, so we set:
2 2
σX σX
H0 : = 1 vs. H 1 : > 1.
σY2 σY2
The value of the test statistic in this case is:
s2X (32.48)2
= = 1.875.
s2Y (23.72)2
If the null hypothesis is true, the test statistic will follow an F18−1, 21−1 = F17, 20
distribution.
At the 5% significance level, the upper-tail critical value of the F17, 20 distribution is
F0.05, 17, 20 = 2.17. Our test statistic does not exceed this value, so we cannot reject
the null hypothesis.
We move to the 10% significance level. The upper-tail critical value is
F0.10, 17, 20 = 1.821, so we can now reject the null hypothesis (if only barely). We
conclude that there is some evidence that the variance is reduced, but it is not very
strong evidence.
Notice the difference between the conclusions of these two tests. We have a much
more powerful test when we compare our standard deviation of 32.48 g to a fixed
standard deviation of 25 g, than when we compare it to an estimated standard
deviation of 23.78 g, even though the values are similar.

D.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix G.

1. A random sample of fibres is known to come from one of two environments, A or B.


It is known from past experience that the lengths of fibres from A have a log-normal
distribution so that the log-length of an A-type fibre is normally distributed about
a mean of 0.80 with a standard deviation of 1.00. (Original units are in microns.)
The log-lengths of B-type fibres are normally distributed about a mean of 0.65
with a standard deviation of 1.00. In order to identify the environment from which
the given sample was taken a subsample of n fibres are to be measured and the
classification is to be made on the evidence of these measurements.
Do not be put off by the log-normal distribution. This simply means that it is the
logs of the data, rather than the original data, which have a normal distribution. If
X represents the log of a fibre length for fibres from A, then X ∼ N (0.8, 1).
(a) If n = 50 and the sample is attributed to type A if the sample mean of
log-lengths exceeds 0.75, determine the error probabilities.
(b) What sample size and decision procedures should be used if it is desired to
have error probabilities such that the chance of misclassifying as A is to be 5%
and the chance of misclassifying as B is to be 10%?

155
D. Hypothesis testing

(c) If the sample is classified as A if the sample mean of log-lengths exceeds 0.75,
and the misclassification as A is to have a probability of 2%, what sample size
should be used and what is the probability of a B-type misclassification?
(d) If the sample comes from neither A nor B but from an environment with a
mean log-length of 0.70, what is the probability of classifying it as type A if
the decision procedure determined in (b) is applied?

2. In a wire-based nail manufacturing process the target length for cut wire is 22 cm.
It is known that widths vary with a standard deviation equal to 0.08 cm. In order
to monitor this process, a random sample of 50 separate wires is accurately
measured and the process is regarded as operating satisfactorily (the null
hypothesis) if the sample mean width lies between 21.97 cm and 22.03 cm so that
this is the decision procedure used (i.e. if the sample mean falls within this range
then the null hypothesis is not rejected, otherwise the null hypothesis is rejected).
(a) Determine the probability of a Type I error for this test.
(b) Determine the probability of making a Type II error when the process is
actually cutting to a length of 22.05 cm.
(c) Find the probability of rejecting the null hypothesis when the true cutting
length is 22.01 cm. (This is the power of the test when the true mean is 22.01
cm.)

3. A sample of seven is taken at random from a large batch of (nominally 12-volt)


batteries. These are tested and their true voltages are shown below:
12.9, 11.6, 13.5, 13.9, 12.1, 11.9 and 13.0.

(a) Test if the mean voltage of the whole batch is 12 volts.


(b) Test if the mean batch voltage is less than 12 volts.
Which test do you think is the more appropriate?

4. To instil customer loyalty, airlines, hotels, rental car companies, and credit card
companies (among others) have initiated frequency marketing programmes which
reward their regular customers. In the United States alone, millions of people are
members of the frequent-flier programmes of the airline industry. A large fast food
restaurant chain wished to explore the profitability of such a programme. They
randomly selected 12 of their 1,200 restaurants nationwide and instituted a
frequency programme which rewarded customers with a $5.00 gift certificate after
every 10 meals purchased at full price.
They ran the trial programme for three months. The restaurants not in the sample
had an average increase in profits of $1,047.34 over the previous three months,
whereas the restaurants in the sample had the following changes in profit:

$2,232.90 $545.47 $3,440.70 $1,809.10


$6,552.70 $4,798.70 $2,965.00 $2,610.70
$3,381.30 $1,591.40 $2,376.20 −$2,191.00

156
D.2. Practice questions

Note that the last number is negative, representing a decrease in profits. Specify
the appropriate null and alternative hypotheses for determining whether the mean
profit change for restaurants with frequency programmes is significantly greater (in
a statistical sense which you should make clear) than $1,047.34.

5. Two companies supplying a television repair service are compared by their repair
times (in days). Random samples of recent repair times for these companies gave
the following statistics:

Sample size Sample mean Sample variance


Company A 44 11.9 7.3
Company B 52 10.8 6.2

(a) Is there evidence that the companies differ in their true mean repair times?
Give an appropriate hypothesis test to support your conclusions.
(b) What is the p-value of your test?
(c) What difference would it have made if the sample sizes had each been smaller
by 35 (i.e. sizes 9 and 17, respectively)?

6. A museum conducts a survey of its visitors in order to assess the popularity of a


device which is used to provide information on the museum exhibits. The device
will be withdrawn if less than 30% of all of the museum’s visitors make use of it. Of
a random sample of 80 visitors, 20 chose to use the device.
(a) Carry out a hypothesis test at the 5% significance level to see if the device
should be withdrawn or not and state your conclusions.
(b) Determine the p-value of the test.
(c) What is the power of this test if the actual percentage of all visitors who would
use this device is only 20%?

To p, or not to p?
(James Abdey, Ph.D. Thesis 2009.1 )

1
Available at http://etheses.lse.ac.uk/31

157
D. Hypothesis testing

158
Appendix E
Analysis of variance (ANOVA)

E.1 Worked examples


1. Three trainee salespeople were working on a trial basis. Salesperson A went in the
field for 5 days and made a total of 440 sales. Salesperson B was tried for 7 days
and made a total of 630 sales. Salesperson C was tried for 10 days and made a total
of 690 sales. Note that these figures
P are total sales, not daily averages. The sum of
the squares of all 22 daily sales ( x2i ) is 146,840.
(a) Construct a one-way analysis of variance table.
(b) Would you say there is a difference between the mean daily sales of the three
salespeople? Justify your answer.
(c) Construct a 95% confidence interval for the mean difference between
salesperson B and salesperson C. Would you say there is a difference?

Solution:

(a) The means are 440/5 = 88, 630/7 = 90 and 690/10 = 69. We will perform a
one-way ANOVA. First, we calculate the overall mean. This is:

440 + 630 + 690


= 80.
22
We can now calculate the sum of squares between salespeople. This is:

5 × (88 − 80)2 + 7 × (90 − 80)2 + 10 × (69 − 80)2 = 2,230.

The total sum of squares is:

146,840 − 22 × (80)2 = 6,040.

Here is the one-way ANOVA table:


Source DF SS MS F p-value
Salesperson 2 2,230 1,115 5.56 ≈ 0.01
Error 19 3,810 200.53
Total 21 6,040

(b) As 5.56 > 3.52 = F0.05, 2, 19 , which is the top 5th percentile of the F2, 19
distribution (interpolated from Table 9 of Murdoch and Barnes’ Statistical
Tables), we reject H0 : µ1 = µ2 = µ3 and conclude that there is evidence that
the means are not equal.

159
E. Analysis of variance (ANOVA)

(c) We have:
s  
1 1
90 − 69 ± 2.093 × 200.53 × + = 21 ± 14.61.
7 10

Here 2.093 is the top 2.5th percentile point of the t distribution with 19
degrees of freedom. A 95% confidence interval is (6.39, 35.61). As zero is not
included, there is evidence of a difference.

2. The total times spent by three basketball players on court were recorded. Player A
was recorded on three occasions and the times were 29, 25 and 33 minutes. Player
B was recorded twice and the times were 16 and 30 minutes. Player C was recorded
on three occasions and the times were 12, 14 and 16 minutes. Use analysis of
variance to test whether there is any difference in the average times the three
players spend on court.
Solution:
We have x̄·A = 29, x̄·B = 23, x̄·C = 14 and x̄ = 21.875. Hence:

3 × (29 − 21.875)2 + 2 × (23 − 21.875)2 + 3 × (14 − 21.875)2 = 340.875.

The total sum of squares is:

4,307 − 8 × (21.875)2 = 478.875.

Here is the one-way ANOVA table:

Source DF SS MS F p-value
Players 2 340.875 170.4375 6.175 ≈ 0.045
Error 5 138 27.6
Total 7 478.875

We test H0 : µ1 = µ2 = µ3 (i.e. the average times they play are the same) vs. H1 :
The average times they play are not the same.

As 6.175 > 5.79 = F0.05, 2, 5 , which is the top 5th percentile of the F2, 5 distribution,
we reject H0 and conclude that there is evidence of a difference between the means.

3. Three independent random samples were taken. Sample A consists of 4


observations taken from a normal distribution with mean µA and variance σ 2 ,
sample B consists of 6 observations taken from a normal distribution with mean µB
and variance σ 2 , and sample C consists of 5 observations taken from a normal
distribution with mean µC and variance σ 2 .
The average value of the first sample was 24, the average value of the second
sample was 20, and the average value of the third sample was 18. The sum of the
squared observations (all of them) was 6,722.4. Test the hypothesis:

H 0 : µA = µB = µC

against the alternative that this is not so.

160
E.1. Worked examples

Solution:
We will perform a one-way ANOVA. First we calculate the overall mean:
4 × 24 + 6 × 20 + 5 × 18
= 20.4.
15
We can now calculate the sum of squares between groups:

4 × (24 − 20.4)2 + 6 × (20 − 20.4)2 + 5 × (18 − 20.4)2 = 81.6.

The total sum of squares is:

6,722.4 − 15 × (20.4)2 = 480.

Here is the one-way ANOVA table:

Source DF SS MS F p-value
Sample 2 81.6 40.8 1.229 ≈ 0.327
Error 12 398.4 33.2
Total 14 480

As 1.229 < 3.89 = F0.05, 2, 12 , which is the top 5th percentile of the F2, 12
distribution, we see that there is no evidence that the means are not equal.

4. Four suppliers were asked to quote prices for seven different building materials. The
average quote of supplier A was 1,315.8. The average quote of suppliers B, C and D
were 1,238.4, 1,225.8 and 1,200.0, respectively. The following is the calculated
two-way ANOVA table with some entries missing.

Source DF SS MS F p-value
Materials 17,800
Suppliers
Error
Total 358,700
(a) Complete the table using the information provided above.
(b) Is there a significant difference between the quotes of different suppliers?
Explain your answer.
(c) Construct a 90% confidence interval for the difference between suppliers A and
D. Would you say there is a difference?
Solution:
(a) The average quote of all suppliers is:
1,315.8 + 1,238.4 + 1,225.8 + 1,200.0
= 1,245.
4
Hence the sum of squares (SS) due to suppliers is:

7×((1,315.8−1,245)2 +(1,238.4−1,245)2 +(1,225.8−1,245)2 +(1,200.0−1,245)2 ] = 52,148.88

161
E. Analysis of variance (ANOVA)

and the MS due to suppliers is 52,148.88/(4 − 1) = 17,382.96.


The degrees of freedom are 7 − 1 = 6, 4 − 1 = 3, (7 − 1)(4 − 1) = 18 and
7 × 4 − 1 = 27 for materials, suppliers, error and total sum of squares,
respectively.
The SS for materials is 6 × 17,800 = 106,800. We have that the SS due to the
error is given by 358,700 − 52,148.88 − 106,800 = 199,751.12 and the MS is
199,751.12/18 = 11,097.28. The F values are:

17,800 17,382.96
= 1.604 and = 1.567
11,097.28 11,097.28

for materials and suppliers, respectively. The two-way ANOVA table is:
Source DF SS MS F p-value
Materials 6 106,800 17,800 1.604 ≈ 0.203
Suppliers 3 52,148.88 17,382.96 1.567 ≈ 0.232
Error 18 199,751.12 11,097.28
Total 27 358,700
(b) We test H0 : µ1 = µ2 = µ3 = µ4 (i.e. there is no difference between suppliers)
vs. H1 : There is a difference between suppliers. The F value is 1.567 and at a
5% significance level the critical value from Table 9 (degrees of freedom 3 and
18) is 3.16, hence we do not reject H0 and conclude that there is not enough
evidence that there is a difference.
(c) The top 5th percentile of the t distribution with 18 degrees of freedom is 1.734
and the MS value is 11,097.28. So a 90% confidence interval is:
s  
1 1
1,315.8 − 1,200 ± 1.734 × 11,097.28 + = 115.8 ± 97.64
7 7

giving (18.16, 213.44). Since zero is not in the interval, there appears to be a
difference between suppliers A and D.

5. Blood alcohol content (BAC) is measured in milligrams per decilitre of blood


(mg/dL). A researcher is looking into the effects of alcoholic drinks. Four different
individuals tried five different brands of strong beer (A, B, C, D and E) on different
days, of course! Each individual consumed 1L of beer over a 30-minute period and
their BAC was measured one hour later. The average BAC for beers A, C, D and E
were 83.25, 95.75, 79.25 and 99.25, respectively. The value for beer B is not given.
The following information is provided as well.

Source DF SS MS F p-value
Drinker 1.56
Beer 303.5
Error 695.6
Total

162
E.1. Worked examples

(a) Complete the table using the information provided above.


(b) Is there a significant difference between the effects of different beers? What
about different drinkers?
(c) Construct a 90% confidence interval for the difference between the effects of
beers C and D. Would you say there is a difference?
Solution:
(a) We have:
Source DF SS MS F p-value
Drinker 3 271.284 90.428 1.56 ≈ 0.250
Beer 4 1214 303.5 5.236 ≈ 0.011
Error 12 695.6 57.967
Total 19 2,180.884
(b) We test the hypothesis H0 : µ1 = µ2 = · · · = µ5 (i.e. there is no difference
between the effects of different beers) vs. the alternative H1 : There is a
difference between the effects of different beers. The F value is 5.236 and at a
5% significance level the critical value from Table 9 is F0.05, 4, 12 = 3.26, so since
5.236 > 3.26 we reject H0 and conclude that there is evidence of a difference.
For drinkers, we test the hypothesis H0 : µ1 = µ2 = µ3 = µ4 (i.e. there is no
difference between the effects on different drinkers) vs. the alternative H1 :
There is a difference between the effects on different drinkers. The F value is
1.56 and at a 5% significance level the critical value from Table 9 is
F0.05, 3, 12 = 3.49, so since 1.56 < 3.49 we fail to reject H0 and conclude that
there is no evidence of a difference.
(c) The top 5th percentile of the t distribution with 12 degrees of freedom is 1.782.
So a 90% confidence interval is:
s  
1 1
95.75 − 79.25 ± 1.782 × 57.967 + = 16.5 ± 9.59
4 4

giving (6.91, 26.09). As the interval does not contain zero, there is evidence of
a difference between the effects of beers C and D.

6. A motor manufacturer operates five continuous-production plants: A, B, C, D and


E. The average rate of production has been calculated for the three shifts of each
plant and recorded in the table below. Does there appear to be a difference in
production rates in different plants or by different shifts?

A B C D E
Early shift 102 93 85 110 72
Late shift 85 87 71 92 73
Night shift 75 80 75 77 76

Solution:
Here r = 3 and c = 5. We may obtain the two-way ANOVA table as follows:

163
E. Analysis of variance (ANOVA)

Source DF SS MS F
Shift 2 652.13 326.07 5.62
Plant 4 761.73 190.43 3.28
Error 8 463.87 57.98
Total 14 1,877.73

Under the null hypothesis of no shift effect, F ∼ F2, 8 . Since F0.05, 2, 8 = 4.46 < 5.62,
we can reject the null hypothesis at the 5% significance level. (Note the p-value
= 0.030.)
Under the null hypothesis of no plant effect, F ∼ F4, 8 . Since F0.05, 4, 8 = 3.84 > 3.28,
we cannot reject the null hypothesis at the 5% significance level. (Note the p-value
= 0.072.)
Overall, the data collected show some evidence of a shift effect but little evidence
of a plant effect.

7. Complete the two-way ANOVA table below. In the places of p-values, indicate in
the form such as ‘< 0.01’ appropriately and use the closest value which you may
find from Murdoch and Barnes’ Statistical Tables.

Source DF SS MS F p-value
Row factor 4 ? 234.23 ? ?
Column factor 6 270.84 45.14 1.53 ?
Residual ? 708.00 ?
Total 34 1,915.76

Solution:
First, C2 SS = (C2 MS)×4 = 936.92.
The degrees of freedom for Error is 34 − 4 − 6 = 24. Therefore, Error MS
= 708.00/24 = 29.5.
Hence the F statistic for testing no C2 effect is 234.23/29.5 = 7.94. From Table 9 of
Murdoch and Barnes’ Statistical Tables, F0.001, 4, 24 = 6.59 < 7.94. Therefore, the
corresponding p-value is smaller than 0.001.
Since F0.05, 6, 24 = 2.51 > 1.53, the p-value for testing the C3 effect is greater than
0.05.
The complete ANOVA table is as follows:

Two-way ANOVA: C1 versus C2, C3

Source DF SS MS F P
C2 4 936.92 234.23 7.94 <0.001
C3 6 270.84 45.14 1.53 >0.05
Error 24 708.00 29.5
Total 34 1,915.76

164
E.2. Practice questions

E.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix G.

1. An executive of a prepared frozen meals company is interested in the amounts of


money spent on such products by families in different income ranges. The table
below lists the monthly expenditures (in dollars) on prepared frozen meals from 15
randomly selected families divided into three groups according to their incomes.
Under $15,000 $15,000 – $30,000 Over $30,000
45.2 53.2 52.7
60.1 56.6 73.6
52.8 68.7 63.3
31.7 51.8 51.8
33.6 54.2
39.4
(a) Based on these data, can we infer at the 5% significance level that the
population mean expenditures on prepared frozen meals are the same for the
three different income groups?
(b) Produce a one-way ANOVA table.
(c) Construct 95% confidence intervals for the mean expenditures of the first
(under $15,000) and the third (over $30,000) income groups.

2. Does the level of success of publicly-traded companies affect the way their board
members are paid? The annual payments (in $000s) of randomly selected
publicly-traded companies to their board members were recorded. The companies
were divided into four quarters according to the returns in their stocks, and the
payments from each quarter were grouped together. Some summary statistics are
provided below.

Descriptive Statistics: 1st quarter, 2nd quarter, 3rd quarter, 4th quarter

Variable N Mean SE Mean StDev


1st quarter 30 74.10 2.89 15.81
2nd quarter 30 75.67 2.48 13.57
3rd quarter 30 78.50 2.79 15.28
4th quarter 30 81.30 2.85 15.59
(a) Can we infer that the amount of payment differs significantly across the four
groups of companies?
(b) Construct 95% confidence intervals for the mean payment of the 1st quarter
companies and the 4th quarter companies.

A total of 4,000 cans are opened around the world every second. Ten babies are
conceived around the world every second. Each time you open a can, you stand
a 1-in-400 chance of falling pregnant.
(True or false?)

165
E. Analysis of variance (ANOVA)

166
Appendix F
Linear regression

F.1 Worked examples


1. Consider the simple linear regression model representing the linear relationship
between two variables, y and x:
y i = β 0 + β 1 xi + εi
for i = 1, 2, . . . , n, where εi are independent and identically distributed random
variables with mean 0 and variance σ 2 . Prove that the least squares straight line
must necessarily pass through the point (x̄, ȳ).
Solution:
The estimated regression line is:
ybi = βb0 + βb1 xi
where βb0 = ȳ − βb1 x̄. When x̄ is substituted for xi , we obtain:
yb = βb0 + βb1 x̄ = ȳ − βb1 x̄ + βb1 x̄ = ȳ.
Therefore, the least squares straight line must necessarily pass through the point
(x̄, ȳ).

2. The following linear regression model is proposed to represent the linear


relationship between two variables, y and x:
yi = βxi + εi
for i = 1, 2, . . . , n, where εi are independent and identically distributed random
variables with mean 0 and variance σ 2 , and β is an unknown parameter to be
estimated.
(a) A proposed estimator of β is:
n
X
βb = min (yi − βxi )2 .
β
i=1

Explain why this estimator is sensible.


(b) Another proposed estimator of β is:
n
X
β̃ = min |yi − βxi |.
β
i=1

Explain why βb would be preferred to β̃.

167
F. Linear regression

(c) Express βb explicitly as a function of yi and xi only.


(d) Using the estimator β:
b

i. what is the value of βb if yi = xi for all i? What if they are the exact
opposites of each other, i.e. yi = −xi for all i?
ii. is it always the case that −1 ≤ βb ≤ 1?
Solution:
(a) The estimator βb is sensible because it is the least squares estimator of β, which
provides the ‘best’ fit to the data in terms of minimising the sum of squared
residuals.
(b) The estimator βb is preferred to β̃ because the estimator β̃ is the least absolute
deviations estimator of β, which is also an option, but unlike βb it cannot be
computed explicitly via differentiation as the function f (x) = |x| is not
differentiable at zero. Therefore, β̃ is harder to compute than β.b
(c) We need to minimise a convex quadratic, so we can do that by differentiating
it and equating the derivative to zero. We obtain:
n
X
−2 (yi − βx
b i )xi = 0
i=1

which yields:
n
P
xi y i
i=1
βb = n .
x2i
P
i=1

(d) i. If xi = yi , then βb = 1. If xi = −yi , then βb = −1.


ii. Not true. A counterexample is to take n = 1, x1 = 1 and y1 = 2.

3. Let {(xi , yi )}, for i = 1, 2, . . . , n, be observations from the linear regression model:

y i = β 0 + β 1 xi + εi .

(a) Suppose that the slope, β1 , is known. Find the least squares estimator (LSE) of
the intercept, β0 .
(b) Suppose that the intercept, β0 , is known. Find the LSE of the slope, β1 .
Solution:
(a) When β1 is known, let zi = yi − β1 xi . The model then reduces to zi = β0 + εi .
n
The LSE βb0 minimises (zi − β0 )2 , hence:
P
i=1

n
1X
βb0 = z̄ = (yi − β1 xi ).
n i=1

168
F.1. Worked examples

(b) When β0 is known, we may write zi = yi − β0 . The model is reduced to


zi = β1 xi + εi . Note that:
n
X n
X
2
(zi − β1 xi ) = (zi − βb1 xi + (βb1 − β1 )xi )2
i=1 i=1
n
X n
X
= (zi − βb1 xi )2 + (βb1 − β1 )2 x2i + 2D
i=1 i=1

n
P
where D = (βb1 − β1 ) xi (zi − βb1 xi ). Suppose we choose βb1 such that:
i=1

n
X n
X n
X
xi (zi − βb1 xi ) = 0 i.e. xi zi − βb1 x2i = 0.
i=1 i=1 i=1

Hence:
n
X n
X n
X n
X
2 2 2 2
(zi − β1 xi ) = (zi − β1 xi ) + (β1 − β1 )
b b xi ≥ (zi − βb1 xi )2 .
i=1 i=1 i=1 i=1

Therefore, βb1 is the LSE of β1 . Note now:


n
P n
P
xi zi xi (yi − β0 )
βb1 = i=1
n = i=1
n .
x2i x2i
P P
i=1 i=1

4. Suppose an experimenter intends to perform a regression analysis by taking a total


of 2n data points, where the xi s are restricted to the interval [0, 5]. If the
xy-relationship is assumed to be linear and if the objective is to estimate the slope
with the greatest possible precision, what values should be assigned to the xi s?
Solution:
Since:
σ2
Var(βb1 ) = P
n
(xi − x̄)2
i=1

in order to minimise the variance of the sampling distribution of βb1 , we must


maximise: n
X
(xi − x̄)2 .
i=1

To accomplish this, take half of the xi s to be 0, and the other half to be 5.

5. Suppose a total of n = 9 observations are to be taken on a simple linear regression


model, where the xi s will be set equal to 1, 2, . . . , 9. If the variance associated with
the xy-relationship is known to be 45, what is the probability that the estimated
slope will be within 1.5 units of the true slope?

169
F. Linear regression

Solution:
n
(xi − x̄)2 = 60 and so:
P
Since x̄ = (1 + 2 + · · · + 9)/9 = 5, then
i=1

σ2 45
Var(βb1 ) = P
n = = 0.75.
60
(xi − x̄)2
i=1

Therefore:
βb1 ∼ N (β1 , 0.75).
We require:
 
1.5
P (|βb1 − β1 | < 1.5) = P |Z| < √ = P (|Z| < 1.73) = 1 − 2 × 0.0418 = 0.9164.
0.75

6. A researcher wants to investigate whether there is a significant link between GDP


per capita and average life expectancy in major cities. Data have been collected in
30 major cities, yielding average GDPs per capita x1 , x2 , . . . , x30 (in $000s) and
average life expectancies y1 , y2 , . . . , y30 (in years). The following linear regression
model has been proposed:
y i = β 0 + β 1 xi + εi
where the εi s are independent and N (0, σ 2 ). Some summary statistics are:
30
X 30
X 30
X
xi = 620.35, yi = 2,123.00, xi yi = 44,585.1
i=1 i=1 i=1
30
X 30
X
x2i = 13,495.62 and yi2 = 151,577.3.
i=1 i=1

(a) Find the least-squares estimates of β0 and β1 and write down the fitted
regression model.
(b) Compute a 95% confidence interval for the slope coefficient β1 . What can be
concluded?
(c) Compute R2 . What can be said about how ‘good’ the model is?
(d) With x = 30, find a prediction interval which covers y with probability 0.95.
With 97.5% confidence, what minimum average life expectancy can a city
expect once its GDP per capita reaches $30,000?
Solution:
(a) We have:
n
P n
P
(xi − x̄)(yi − ȳ) xi yi − nx̄ȳ
i=1 i=1
βb1 = n = n = 1.026
x2i − nx̄2
P P
(xi − x̄)2
i=1 i=1
and:
βb0 = ȳ − βb1 x̄ = 49.55.
Hence the fitted model is:
yb = 49.55 + 1.026x.

170
F.1. Worked examples

b2 . For σ
(b) We first need E.S.E.(βb1 ), for which we need σ b2 , we need the Residual
SS (from the Total SS and the Regression SS). We compute:
X
Total SS = yi2 − nȳ 2 = 1,339.67
i
!
X
Regression SS = βb12 x2i − nx̄2 = 702.99
i

Residual SS = Total SS − Regression SS = 636.68

636.68
b2 =
σ = 22.74
28
1/2
b2

σ
E.S.E.(β1 ) = P 2
b
2
= 0.184.
i xi − nx̄

Hence a 95% confidence interval for β1 is:

(βb1 − t0.025, 28 × E.S.E.(βb1 ), βb1 + t0.025, 28 × E.S.E.(βb1 ))

which gives:
1.026 ± 2.05 × 0.184 ⇒ (0.65, 1.40).

The confidence interval does not contain zero. Therefore, we would reject the
hypothesis of β1 being zero at the 5% significance level. Hence there does
appear to be a significant link.

(c) The model can explain 52% of the variation of y, since:

Regression SS 702.99
R2 = = = 0.52.
Total SS 1,339.67

Whether or not the model is ‘good’ is subjective. It is not necessarily ‘bad’,


although we may be able to determine a ‘better’ model with better
explanatory power, possibly using multiple linear regression.

(d) The prediction interval has the form:

2 1/2
 P 2 P 
i x i − 2x i x i + nx
βb0 + βb1 x ± t0.025, n−2 × σ
b× 1+
n( i x2i − nx̄2 )
P

which gives:
(69.79, 90.87).

Therefore, we can be 97.5% confident that the average life expectancy lies
above 69.79 years once GDP per capita reaches $30,000.

171
F. Linear regression

7. The following is partial regression output:

The regression equation is


y = 2.1071 + 1.1263x

Predictor Coef SE Coef


Constant 2.1071 0.2321
x 1.1263 0.0911

Analysis of Variance

SOURCE DF SS
Regression 1 2011.12
Residual Error 40 539.17

In addition, x̄ = 1.56.
(a) Find an estimate of the error term variance, σ 2 .
(b) Calculate and interpret R2 .
(c) Test at the 5% significance level whether or not the slope in the regression
model is equal to 1.
(d) For x = 0.8, find a 95% confidence interval for the expectation of y.
Solution:
(a) Noting n = 40 + 1 + 1 = 42, we have:

Residual SS 539.17
b2 =
σ = = 13.479.
n−2 40

(b) We have Total SS = Regression SS + Residual SS = 2,550.29. Hence:

Regression SS 2,011.12
R2 = = = 0.7886.
Total SS 2,550.29

Therefore, 78.86% of the variation of y is explained by x.


(c) Under H0 : β1 = 1, the test statistic is:

βb1 − 1
T = ∼ tn−2 = t40 .
E.S.E.(βb1 )

We reject H0 if |t| > 2.021 = t0.025, 40 . As t = 0.1263/0.0911 = 1.386, we cannot


reject the null hypothesis that β1 = 1 at the 5% significance level.
n
(d) Note (xi − x̄)2 = (Regression SS)/(βb1 )2 = 2,011.12/(1.1263)2 = 1,585.367.
P
i=1
Also:
n
X n
X
2
(xi − x) = (xi − x̄)2 + n(x̄ − x)2 = 1,585.367 + 42 × (1.56 − 0.8)2
i=1 i=1
= 1,609.626.

172
F.2. Practice questions

Hence a 95% confidence interval for E(y) given x = 0.8 is:


 n 1/2
2
P
(x
 i=1 i − x) 
βb0 + βb1 x ± t0.025, n−2 × σ
b×
 P n


n (xj − x̄) 2
j=1
r
13.479 × 1,609.626
= 2.1071 + 1.1263 × 0.8 ± 2.021 ×
42 × 1,585.367
= 3.0081 ± 1.1536 ⇒ (1.854, 4.162).

8. Why is the squared sample correlation coefficient between the yi s and xi s the same
as the squared sample correlation coefficient between the yi s and ybi s? No algebra is
needed for this.
Solution:
The only difference between the xi s and ybi s is a rescaling by multiplying by βb1 ,
followed by a relocation by adding βb0 . Correlation coefficients are not affected by a
change of scale or location, so it will be the same whether we use the xi s or the ybi s.

9. If the model fits, then the fitted values and the residuals from the model are
independent of each other. What do you expect to see if the model fits when you
plot residuals against fitted values?
Solution:
If the model fits, one would expect to see a random scatter with no particular
pattern.

F.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix G.

1. The table below shows the cost of fire damage for ten fires together with the
corresponding distances of the fires to the nearest fire station:

Distance in miles (x) 4.9 4.5 6.3 3.2 5.0


Cost in £000s (y) 31.1 31.1 43.1 22.1 36.2
Distance in miles (x) 5.7 4.0 4.3 2.5 5.2
Cost in £000s (y) 35.8 25.9 28.0 22.9 33.5

(a) Fit a straight line to these data and construct a 95% confidence interval for
the increase in cost of a fire for each mile from the nearest fire station.
(b) Test the hypothesis that the ‘true line’ passes through the origin.

173
F. Linear regression

2. The yearly profits made by a company, over a period of eight consecutive years are
shown below:

Year 1 2 3 4 5 6 7 8
Profit (in £000s) 18 21 34 31 44 46 60 75

(a) Fit a straight line to these data and compute a 95% confidence interval for the
‘true’ yearly increase in profits.
(b) The company accountant forecasts the profits for year 9 to be £90,000. Is this
forecast reasonable if it is based on the above data?

3. The data table below shows the yearly expenditure (in £000s) by a cosmetics
company in advertising a particular brand of perfume:

Year (x) 1 2 3 4 5 6 7 8
Expenditure (y) 170 170 275 340 435 510 740 832

(a) Fit a regression line to these data and construct a 95% confidence interval for
its slope.
(b) Construct an analysis of variance table and compute the R2 statistic for the fit.
(c) Comment on the goodness of fit of the linear regression model.
(d) Predict the expenditure for Year 9 and construct a 95% prediction interval for
the actual expenditure.

4. Let X and ε be two independent random variables, and E(ε) = 0. Let


Y = β0 + β1 X + ε. Show that:
s
Cov(X, Y ) Var(Y )
β1 = = Corr(X, Y ) × .
Var(X) Var(X)

Facts are stubborn, but statistics are more pliable.


(Mark Twain)

174
Appendix G
Solutions to Practice questions

G.1 Chapter 6 – Sampling distributions of statistics


1. (a) The sum of n independent Bernoulli random variables, each with success
4
P
probability π, is Bin(n, π). Here n = 4 and π = 0.2, so Xi ∼ Bin(4, 0.2).
i=1
P
(b) The possible values of Xi are 0, 1, 2, 3 and 4, and their probabilities can be
calculated from the binomial distribution. For example:
4
!  
X 4
P Xi = 1 = (0.2)1 (0.8)3 = 4 × 0.2 × 0.512 = 0.4096.
i=1
1

The other probabilities are shown in the table below.


P
Since X̄ = Xi /4, the possible values of X̄ are 0, 0.25, 0.5, 0.75 and
P 1. Their
probabilities are the same asP those of the corresponding values of Xi . For
example, P (X̄ = 0.25) = P ( Xi = 1) = 0.4096. The values and their
probabilities are:
X̄ = x̄ 0.0 0.25 0.50 0.75 1.0
P (X̄ = x̄) 0.4096 0.4096 0.1536 0.0256 0.0016

(c) For Xi ∼ Bernoulli(π), E(Xi ) = π and Var(Xi ) = π(1 − π). Therefore, the
approximate normal sampling distribution of X̄, derived from the central limit
theorem, is N (π, π(1 − π)/n). Here this is:
 
0.2 × 0.8
N 0.2, = N (0.2, 0.0016) = N (0.2, (0.04)2 ).
100

Therefore, the probability requested by the question is approximately:


 
X̄ − 0.2 0.3 − 0.2
P (X̄ > 0.3) = P > = P (Z > 2.5) = 0.0062
0.04 0.04

using Table 3 of Murdoch and Barnes’ Statistical Tables. This is very close to
the probability obtained from the exact sampling distribution, which is about
0.0061.

2. (a) Let {X1 , X2 , . . . , Xn } denote the random sample. We know that the sampling
distribution of X̄ is N (µ, σ 2 /n), here N (4, 22 /20) = N (4, 0.2).

175
G. Solutions to Practice questions

i. The probability we need is:


 
X̄ − 4 5−4
P (X̄ > 5) = P √ > √ = P (Z > 2.24) = 0.0126
0.2 0.2

where, as usual, Z ∼ N (0, 1).


ii. P (X̄ < 3) is obtained similarly. Note that this leads to
P (Z < −2.24) = 0.0126, which is equal to the P (X̄ > 5) = P (Z > 2.24)
result obtained above. This is because 5 is one unit above the mean µ = 4,
and 3 is one unit below the mean, and because the normal distribution is
symmetric around its mean.
iii. One way of expressing this is:

P (X̄ − µ > 1) = P (X̄ − µ < −1) = 0.0126

for µ = 4. This also shows that:

P (X̄ − µ > 1) + P (X̄ − µ < −1) = P (|X̄ − µ| > 1) = 2 × 0.0126 = 0.0252

and hence:
P (|X̄ − µ| ≤ 1) = 1 − 2 × 0.0126 = 0.9748.
In other words, the probability is 0.9748 that the sample mean is within
one unit of the true population mean, µ = 4.
(b) We can use the same ideas as in (a). Since X̄ ∼ N (µ, 4/n) we have:

P (|X̄ − µ| ≤ 0.5) = 1 − 2 × P (X̄ − µ > 0.5)


!
X̄ − µ 0.5
=1−2×P p >p
4/n 4/n

= 1 − 2 × P (Z > 0.25 n)
≥ 0.95

which holds if:


√ 0.05
P (Z > 0.25 n) ≤ = 0.025.
2
From Table 3 of√Murdoch and Barnes’ Statistical Tables, we see that this is
2
true when 0.25 n ≥ 1.96, i.e. when n ≥ (1.96/0.25) = 61.5. Rounding up to
the nearest integer, we get n ≥ 62. The sample size should be at least 62 for us
to be 95% confident that the sample mean will be within 0.5 units of the true
mean, µ.
(c) Here n > 62, yet x̄ is further than 0.5 units from the claimed mean of µ = 5.
Based on the result in (b), this would be quite unlikely if µ is really 5. One
explanation of this apparent contradiction is that µ is not really equal to 5.
This kind of reasoning will be the basis of statistical hypothesis testing, which
will be discussed later in the course.

176
G.2. Chapter 7 – Point estimation

3. (a) The sample average is composed of 25 randomly sampled data which are
subject to sampling variability, hence the average is also subject to this
variability. Its sampling distribution describes its probability properties. If a
large number of such averages were independently sampled, then their
histogram would be the sampling distribution.
(b) It is reasonable to assume that this sampling distribution is normal due to the
CLT, although the sample size is rather small. If n = 25 and µ = 54 and
σ = 10, then the CLT says that:

σ2
   
100
X̄ ∼ N µ, = N 54, .
n 25

(c) i. We have:
!
60 − 54
P (X̄ > 60) = P Z>p = P (Z > 3) = 0.0013
100/25

using Table 3 of Murdoch and Barnes’ Statistical Tables.


ii. We are asked for:
 
−0.05 × 54 0.05 × 54
P (0.95 × 54 < X̄ < 1.05 × 54) = P <Z<
2 2
= P (−1.35 < Z < 1.35)
= 0.8230

using Table 3 of Murdoch and Barnes’ Statistical Tables.

G.2 Chapter 7 – Point estimation


1. We have:
 
X1 X2 1 1 1 1
E(X) = E + = × E(X1 ) + × E(X2 ) = × µ + × µ = µ
2 2 2 2 2 2

and:
 
X1 2X2 1 2 1 2
E(Y ) = E + = × E(X1 ) + × E(X2 ) = × µ + × µ = µ.
3 3 3 3 3 3

It follows that both estimators are unbiased estimators of µ.

2. The formula for S 2 is:


n
!
2 1 X
S = Xi2 − nX̄ 2
.
n−1 i=1

177
G. Solutions to Practice questions

n
This means that (n − 1)S 2 = Xi2 − nX̄ 2 , hence:
P
i=1

n
!
X
2 2
E((n − 1)S ) = (n − 1) E(S ) = E Xi2 − nX̄ 2
= n E(Xi2 ) − n E(X̄ 2 ).
i=1

Because the sample is random, E(Xi2 ) = E(X 2 ) for all i = 1, 2, . . . , n as all the
variables are identically distributed. From the standard formula
Var(X) = σ 2 = E(X 2 ) − µ2 , so (using the hint):

σ2
E(X 2 ) = σ 2 + µ2 and E(X̄ 2 ) = µ2 + .
n
Hence:
σ2
 
2 2 2 2
(n − 1) E(S ) = n(σ + µ ) − n µ + = (n − 1)σ 2
n
so E(S 2 ) = σ 2 , which means that S 2 is an unbiased estimator of σ 2 , as stated.
The standard formula for Var(X), applied to S, states that:

E(S 2 ) = Var(S) + (E(S))2

which means that:


p p √
E(S) = E(S 2 ) − Var(S) = σ 2 − Var(S) < σ = σ 2

since all variances are strictly positive. It follows that S is a biased estimator of σ
(with its average value lower than the true value σ).

3. As defined, R is a random variable, and R ∼ Bin(n, π), so that E(R) = nπ and


hence E(R/n) = π. It also follows that:
     
R R n−R
1−E =E 1− =E = 1 − π.
n n n

So the first obvious guess is that we should try R/n × (1 − R/n) = R/n − (R/n)2 .
Now:
nπ(1 − π) = Var(R) = E(R2 ) − (E(R))2 = E(R2 ) − (nπ)2 .
So:
 2 !
R 1 1
E = 2
E(R2 ) = 2 (nπ(1 − π) + n2 π 2 )
n n n
 2 !
R R 1 1
⇒ E − = E(R) − 2 (nπ(1 − π) + n2 π 2 )
n n n n

nπ n2 π 2 π(1 − π)
= − 2 −
n n n
π(1 − π)
= π − π2 − .
n

178
G.2. Chapter 7 – Point estimation

However, π(1 − π) = π − π 2 , so:


 2 !
R R π(1 − π) n−1
E − = π(1 − π) − = π(1 − π) × .
n n n n

It follows that:
 2 !
R2
 
n R R R
π(1 − π) = ×E − =E − .
n−1 n n n − 1 n(n − 1)

So we have found the unbiased estimator of π(1 − π), but it could do with tidying
up! When this is done, we see that:

R(n − R)
n(n − 1)

is the required unbiased estimator of π(1 − π).

4. For T1 :
 
Sxx 1 1
E(T1 ) = E = E(Sxx ) = × (n − 1)σ 2 = σ 2 .
n−1 n−1 n−1

Hence T1 is an unbiased estimator of σ 2 . Turning to the variance:


2 2
2σ 4
   
Sxx 1 1
Var(T1 ) = Var = ×Var(Sxx ) = ×(2σ 4 (n−1)) = .
n−1 n−1 n−1 n−1

By definition, MSE(T1 ) = Var(T1 ) + (Bias(T1 ))2 = 2σ 4 /(n − 1) + 02 = 2σ 4 /(n − 1).


For T2 :
   
Sxx 1 1 1
E(T2 ) = E = E(Sxx ) = × (n − 1)σ 2 = 1− σ2.
n n n n

It follows that Bias(T2 ) = −σ 2 /n, hence T2 is a biased estimator of σ 2 .


 2  2
2(n − 1)σ 4
 
Sxx 1 1
Var(T2 ) = Var = × Var(Sxx ) = × (2σ 4 (n − 1)) = .
n n n n2

By definition, MSE(T2 ) = 2(n − 1)σ 4 /n2 + (−σ 2 /n)2 = (2n − 1)σ 4 /n2 .
It can be seen that MSE(T1 ) > MSE(T2 ) since:

2 2n − 1 2n2 − (n − 1)(2n − 1) 2n2 − (2n2 − 3n + 1) 3n − 1


− 2
= 2
= 2
= 2 > 0.
n−1 n n (n − 1) n (n − 1) n (n − 1)

So, although T2 is a biased estimator of σ 2 , it is preferable to T1 due to the


dominating effect of its smaller variance.

179
G. Solutions to Practice questions

5. (a) We start off with the sum of squares function:


4
X
S= ε2i = (y1 − α − β)2 + (y2 + α − β)2 + (y3 − α + β)2 + (y4 + α + β)2 .
i=1

Now take the partial derivatives:


∂S
= −2(y1 − α − β) + 2(y2 + α − β) − 2(y3 − α + β) + 2(y4 + α + β)
∂α
= −2(y1 − y2 + y3 − y4 ) + 8α

and:
∂S
= −2(y1 − α − β) − 2(y2 + α − β) + 2(y3 − α + β) + 2(y4 + α + β)
∂β
= −2(y1 + y2 − y3 − y4 ) + 8β.

b and βb are the solutions to ∂S/∂α = 0 and


The least squares estimators α
∂S/∂β = 0. Hence:
y1 − y2 + y3 − y4 y1 + y2 − y3 − y4
α
b= and βb = .
4 4

(b) α
b is an unbiased estimator of α since:
 
y1 − y2 + y3 − y4 α+β+α−β+α−β+α+β
E(b
α) = E = = α.
4 4

βb is an unbiased estimator of β since:


 
y1 + y2 − y3 − y 4 α+β−α+β−α+β+α+β
E(β)
b =E = = β.
4 4

(c) We have:
4σ 2 σ2
 
y1 − y2 + y3 − y4
Var(b
α) = Var = = .
4 16 4

G.3 Chapter 8 – Interval estimation


1. (a) The total value of the stock is 9,875µ, where µ is the mean value of an item of
stock. From Chapter 6, X̄ is the obvious estimator of µ, so 9,875X̄ is the
obvious estimator of 9,875µ. Therefore, an estimate for the total value of the
stock is 9,875 × 320.41 = £3,160,000 (to the nearest £10,000).
(b) In this question n = 50 is large, and σ 2 is unknown so a 95% confidence
interval for µ is:
s 40.6
x̄±1.96× √ = 320.41±1.96× √ = 320.41±11.25 ⇒ (£309.16, £331.66).
n 50

180
G.3. Chapter 8 – Interval estimation

Note that because n is large we have used the standard normal distribution. It
is more accurate to use a t distribution with 49 degrees of freedom. This gives
an interval of (£308.87, £331.95) – not much of a difference.
To obtain a 95% confidence interval for the total value of the stock, 9,875µ,
multiply the interval by 9,875. This gives (to the nearest £10,000):
(£3,050,000, £3,280,000).

(c) Increasing the sample size


√ by a factor of k reduces the width of the confidence
interval by a factor of k. Therefore, increasing the sample size by a √
factor of
4 will reduce the width of the confidence interval by a factor of 2 (= 4).
Hence we need to increase the sample size from 50 to 4 × 50 = 200. So we
should collect another 150 observations.

2. (a) Let π be the proportion of shareholders in the population. Start by estimating


π. We are estimating a proportion and n is large, so an approximate 95%
confidence interval for π is, using the central limit theorem:
r r
b(1 − π
π b) 0.23 × 0.77
b±1.96×
π ⇒ 0.23±1.96× = 0.23±0.027 ⇒ (0.203, 0.257).
n 954
Therefore, a 95% confidence interval for the number (rather than the
proportion) of shareholders in the UK is obtained by multiplying the above
interval endpoints by 41 million and getting the answer 8.3 million to 10.5
million. An alternative way of expressing this is:
9,400,000 ± 1,100,000 ⇒ (8,300,000, 10,500,000).
Therefore, we estimate there are about 9.4 million shareholders in the UK,
with a margin of error of 1.1 million.
(b) Let us start by finding a 95% confidence interval for the difference in the two
proportions. We use the formula:
s
b1 (1 − π
π b1 ) πb2 (1 − π
b2 )
b1 − π
π b2 ± 1.96 × + .
n1 n2
The estimates of the proportions π1 and π2 are 0.23 and 0.171, respectively.
We know n1 = 954 and although n2 is unknown we can assume it is
approximately equal to 954 (noting the ‘similar’ in the question), so an
approximate 95% confidence interval is:
r
0.23 × 0.77 0.171 × 0.829
0.23−0.171±1.96× + = 0.059±0.036 ⇒ (0.023, 0.094).
954 954
By multiplying by 41 million, we get a confidence interval of:
2,400,000 ± 1,500,000 ⇒ (900,000, 3,900,000).
We estimate that the number of shareholders has increased by about 2.4
million in the two years. There is quite a large margin of error, i.e. 1.5 million,
especially when compared with a point estimate (i.e. interval midpoint) of 2.4
million.

181
G. Solutions to Practice questions

G.4 Chapter 9 – Hypothesis testing


1. (a) We have n = 50 and σ = 1. We wish to test:

H0 : µ = 0.65 (sample is from ‘B’) vs. H1 : µ = 0.80 (sample is from ‘A’).

The decision rule is that we reject H0 if x̄ > 0.75.


The probability of a Type I error is:
 
0.75 − 0.65
P (X̄ > 0.75 | H0 ) = P Z > √ = P (Z > 0.71) = 0.2389.
1/ 50
The probability of a Type II error is:
 
0.75 − 0.80
P (X̄ < 0.75 | H1 ) = P Z < √ = P (Z < −0.35) = 0.3632.
1/ 50

(b) To find the sample size n and the value a, we need to solve two conditions:

• α = P (X̄ > a |√H0 ) = P (Z > (a − 0.65)/(1/ n)) = 0.05 ⇒
(a − 0.65)/(1/ n) = 1.645.

• β = P (X̄ < a |√H1 ) = P (Z < (a − 0.80)/(1/ n)) = 0.10 ⇒
(a − 0.80)/(1/ n) = −1.28.
Solving these equations gives a = 0.734 and n = 381, remembering to round
up!
(c) A sample is classified as being from A if H1 if x̄ > 0.75. We have:
 
0.75 − 0.65 0.75 − 0.65
α = P (X̄ > 0.75 | H0 ) = P Z > √ = 0.02 ⇒ √ = 2.05.
1/ n 1/ n
Solving this equation gives n = 421, remembering to round up! Therefore:
 
0.75 − 0.80
β = P (X̄ < 0.75 | H1 ) = P Z < √ = P (Z < −1.026) = 0.1515.
1/ 421

(d) The rule in (b) is ‘take n = 381 and reject H0 if x̄ > 0.734’. So:
 
0.734 − 0.7
P (X̄ > 0.734 | µ = 0.7) = P Z > √ = P (Z > 0.66) = 0.2546.
1/ 381

2. (a) We have:

α = 1 − P (21.97 < X̄ < 22.03 | µ = 22)


 
21.97 − 22 22.03 − 22
=1−P √ <Z< √
0.08/ 50 0.08/ 50
= 1 − P (−2.65 < Z < 2.65)
= 1 − 0.992
= 0.008.

182
G.4. Chapter 9 – Hypothesis testing

(b) We have:

β = P (21.97 < X̄ < 22.03 | µ = 22.05)


 
21.97 − 22.05 22.03 − 22.05
=P √ <Z< √
0.08/ 50 0.08/ 50
= P (−7.07 < Z < −1.77)
= P (Z < −1.77) − P (Z < −7.07)
= 0.0384.

(c) We have:

P (rejecting H0 | µ = 22.01) = 1 − P (21.97 < X̄ < 22.03 | µ = 22.01)


 
21.97 − 22.01 22.03 − 22.01
=1−P √ <Z< √
0.08/ 50 0.08/ 50
= 1 − P (−3.53 < X < 1.77)
= 1 − (P (Z < 1.77) − P (Z < −3.53))
= 1 − (0.9616 − 0.00023)
= 0.0386.

3. (a) We are to test H0 : µ = 12 vs. H1 : µ 6= 12. The key points here are that n is
small and that σ 2 is unknown. We can use the t test and this is valid provided
the data are normally distributed. The test statistic value is:
x̄ − 12 12.7 − 12
t= √ = √ = 2.16.
s/ 7 0.858/ 7
This is compared to a Student’s t distribution on 6 degrees of freedom. The
critical value corresponding to a 5% significance level is 2.447. Hence we
cannot reject the null hypothesis at the 5% significance level. (We can reject at
the 10% significance level, but the convention on this course is to regard such
evidence merely as casting doubt on H0 , rather than justifying rejection as
such, i.e. such a result would be ‘weakly significant’.)
(b) We are to test H0 : µ = 12 vs. H1 : µ < 12. There is no need to do a formal
statistical test. As the sample mean is 12.7, which is greater than 12, there is
no evidence whatsoever for the alternative hypothesis.
In (a) you are asked to do a two-sided test and in (b) it is a one-sided test. Which
is more appropriate will depend on the purpose of the experiment, and your
suspicions before you conduct it.
• If you suspected before collecting the data that the mean voltage was less than
12 volts, the one-sided test would be appropriate.
• If you had no prior reason to believe that the mean was less than 12 volts you
would perform a two-sided test.

183
G. Solutions to Practice questions

• General rule: decide on whether it is a one- or two-sided test before performing


the statistical test!

4. It is useful to discuss the issues about this question before giving the solution.
• We want to know whether a loyalty programme such as that at the 12 selected
restaurants would result in an increase in mean profits greater than that
observed (during the three-month test) at the other sites within the chain.
• So we can model the profits across the chain as $1,047.34 + x, where $x is the
supposed effect of the promotion, and if the true mean value of x is µ, then we
wish to test:
H0 : µ = 0 vs. H1 : µ > 0
which is a one-tailed test since, clearly, there are (preliminary) grounds for
thinking that there is an increase due to the loyalty programme.
• We know nothing about the variability of profits across the rest of the chain,
so we will have to use the sample data, i.e. to calculate the sample variance
and to employ the t distribution with ν = 12 − 1 = 11 degrees of freedom.
• Although we shall want the variance of the data ‘sample value − 1,047.34’,
this will be the same as the variance of the sample data, since for any random
variable X and constant k we have:

Var(X + k) = Var(X)

because in calculating the variance every value (xi − x̄) is ‘replaced’ by


((xi + k) − (x̄ + k)), which is in fact the same value.
12 12
x2i , Sxx = x2i − nx̄2 and s2 .
P P
• So we need to calculate x̄,
i=1 i=1
12
P
The total change in profit for restaurants in the programme is xi = 30,113.17.
i=1
Since n = 12, the mean change in profit for restaurants in the programme is:
30,113.17
= 2,509.431 = 1,047.34 + 1,462.091
12
hence use x̄ = 1,462.091.
12
x2i = 126,379,568.8. So, the ‘corrected’ sum of squares
P
The raw sum of squares is
i=1
is:
12
X
Sxx = x2i − nx̄2 = 126,379,568.8 − 12 × (2,509.431)2 = 50,812,651.51.
i=1

Therefore:
Sxx 50,812,651.51
s2 = = = 4,619,331.956.
n−1 11
Hence the estimated standard error is:
r
s 4,619,331,956 p
√ = = 384,944.3296 = 620.439.
n 12

184
G.4. Chapter 9 – Hypothesis testing

So, the test statistic value is:

x̄ − µ0 1,462.091 − 0
√ = = 2.3565.
s/ n 620.439

The relevant critical values for t11 in this one-tailed test are:

5%: 1.796 and 1%: 2.718.

So we see that the test is significant at the 5% significance level, but not at the 1%
significance level, so reject H0 and conclude that the loyalty programme does have
an effect. (In fact, this means the result is moderately significant that the
programme has had a beneficial effect for the company.)

5. (a) We test H0 : µA = µB vs. H1 : µA 6= µB , where we use a two-tailed test since


there is no prior reason to suggest the direction of the difference, if any. The
test statistic value is:
11.9 − 10.8
p = 2.06
7.3/44 + 6.3/52
where we assume the sample variances are equal to the population variances
due to the large sample sizes (and hence we would expect accurate variance
estimates). For a two-tailed test, this is significant at the 5% significance level
(1.96 < 2.06), but not at the 1% significance level (2.06 < 2.576). We reject H0
and conclude that company A is slower in repair times on average than
company B, with a moderately significant result.

(b) The p-value for this two-tailed test is 2 × P (Z > 2.06) = 0.0394.

(c) For small samples, we should use a pooled estimate of the population standard
deviation:
s
(9 − 1) × 7.3 + (17 − 1) × 6.2
s= = 2.5626 on 24 degrees of freedom.
(9 − 1) + (17 − 1)

Hence the test statistic value in this case is:


11.9 − 10.8
p = 1.04.
2.5626 × 1/9 + 1/17

This should be compared with the t24 distribution and is clearly not
significant, even at the 10% significance level. With the smaller samples we fail
to detect the difference.
Comparing the two test statistic calculations shows that the different results
flow from differences in the estimated standard errors, hence ultimately (and
unsurprisingly) from the differences in the sample sizes used in the two
situations.

185
G. Solutions to Practice questions

6. (a) Let π be the population proportion of visitors who would use the device. We
test H0 : π = 0.3 vs. H1 : π < 0.3. The sample p proportion is p = 20/80 = 0.25.
Standard error of the sample proportion is 0.3 × 0.7/80 = 0.0512. The test
statistic value is:
0.25 − 0.30
z= = −0.976.
0.0512
For a one-sided (lower-tailed) test at the 5% significance level, the critical
value is −1.645, so the test is not significant – and not even at the 10%
significance level (the critical value is −1.282). On the basis of the data, there
is no reason to withdraw the device.
The critical region for the above test is to reject H0 if the sample proportion is
less than 0.3 − 1.645 × 0.0512, i.e. if the sample proportion, p, is less than
0.2157.
(b) The p-value of the test is the probability of the test statistic value or a more
extreme value conditional on H0 being true. Hence the p-value is:

P (Z ≤ −0.976) = 0.1645.

So for any α < 0.1645 we would fail to reject H0 .


(c) The power of the test when π = 0.2 is the conditional probability:

P (P < 0.2157 | π = 0.2).

When
p π = 0.2, the standard error of the sample proportion is
0.2 × 0.8/80 = 0.0447. Therefore, the power when π = 0.2 is:
 
0.2157 − 0.2
P Z< = P (Z < 0.35) = 0.6368.
0.0447

G.5 Chapter 10 – Analysis of variance


1. (a) For this example, k = 3, n1 = 6, n2 = 5, n3 = 4 and n = n1 + n2 + n3 = 15.
We have x̄·1 = 43.8, x̄·2 = 56.9, x̄·3 = 60.35 and x̄ = 52.58.
nj
3 P
x2ij = 43,387.85.
P
Also,
j=1 i=1
nj
3 P
x2ij − nx̄2 = 43,387.85 − 41,469.85 = 1,918.
P
Total SS =
j=1 i=1
nj
3 P P3
x2ij − nj x̄2·j = 43,387.85 − 42,267.18 = 1,120.67.
P
w= j=1
j=1 i=1

Therefore, b = Total SS − w = 1,918 − 1,120.67 = 797.33.


To test H0 : µ1 = µ2 = µ3 , the test statistic value is:

b/(k − 1) 797.33/2
f= = = 4.269.
w/(n − k) 1,120.67/12

186
G.6. Chapter 11 – Linear regression

Under H0 , F ∼ F2, 12 . Since F0.05, 2, 12 = 3.89 < 4.269, we reject H0 at the 5%


significance level, i.e. there exists evidence indicating that the population mean
expenditures on frozen meals are not the same for the three different income
groups.
(b) The ANOVA table is as follows:
Source DF SS MS F P
Income 2 797.33 398.67 4.269 <0.05
Error 12 1,120.67 93.39
Total 14 1,918.00
(c) A 95% confidence interval for µj is of the form:

S 93.39 21.056
X̄·j ± t0.025, n−k × √ = X̄·j ± t0.025, 12 × √ = X̄·j ± √ .
nj nj nj

For j = 1, a 95% confidence interval is 43.8 ± 21.056/ 6 ⇒ (35.20, 52.40).

For j = 3, a 95% confidence interval is 60.35 ± 21.056/ 4 ⇒ (49.82, 70.88).

2. (a) Here k = 4 and n1 = n2 = n3 = n4 = 30. We have x̄·1 = 74.10, x̄·2 = 75.67,


x̄·3 = 78.50, x̄·4 = 81.30, b = 909, w = 26,408 and the pooled estimate of σ is
s = 15.09.
Hence the test statistic value is:
b/(k − 1)
f= = 1.33.
w/(n − k)

Under H0 : µ1 = µ2 = µ3 = µ4 , F ∼ Fk−1, n−k = F3, 116 . Since


F0.05, 3, 116 = 2.68 > 1.33, we cannot reject H0 at the 5% significance level.
Hence there is no evidence to support the claim that payments among the four
groups are significantly different.
(b) A 95% confidence interval for µj is of the form:

S 15.09
X̄·j ± t0.025, n−k × √ = X̄·j ± t0.025, 116 × √ = X̄·j ± 5.46.
nj 30

For j = 1, a 95% confidence interval is 74.10 ± 5.46 ⇒ (68.64, 79.56).


For j = 4, a 95% confidence interval is 81.30 ± 5.46 ⇒ (75.84, 86.76).

G.6 Chapter 11 – Linear regression


P 2 P 2
1. (a) We
P first calculate x̄ = 4.56, xi = 219.46, ȳ = 30.97, yi = 9,973.99 and
xi yi = 1,475.1. The estimated regression coefficients are:

1,475.1 − 10 × 4.56 × 30.97


βb1 = = 5.46 and βb0 = 30.97−5.46×4.56 = 6.07.
219.46 − 10 × (4.56)2

187
G. Solutions to Practice questions

The fitted line is:


[ = 6.09 + 5.46 × Distance.
Cost
In order to perform statistical inference, we need to find:
2
P
i (yi − β0 − β1 xi )
b b
2
σ
b =
n−2
X X X X X 
2 2 2 2
= yi + nβ0 + β1
b b xi − 2β0
b yi − 2β1
b xi y i + 2 β 0 β 1
b b xi /(n − 2)

= (9,973.99 + 10 × (6.07)2 + (5.46)2 × 219.46 − 2 × 6.07 × 309.7


− 2 × 5.46 × 1475.1 + 2 × 6.07 × 5.46 × 45.6)/(10 − 2)
= 4.95.

The estimated standard error of βb1 is:



4.95
p = 0.66.
219.46 − 10 × (4.56)2

Hence a 95% confidence interval for β1 is 5.46 ± 2.306 × 0.66 ⇒ (3.94, 6.98).
(b) To test H0 : β0 = 0 vs. H1 : β0 6= 0, we first determine the estimated standard
error of βb0 , which is:
√  1/2
4.95 219.46
√ = 3.07.
10 219.46 − 10 × (4.56)2
Therefore, test statistic value is:
6.07
= 1.98.
3.07
Comparing with the t8 distribution, this is not significant at the 5%
significance level (1.98 < 2.306), but it is significant at the 10% significance
level (1.860 < 1.98).
There is only weak evidence against the null hypothesis. Note though that in
practice this hypothesis is not really of interest. A line through the origin
implies that there is zero cost of a fire which takes place right next to a fire
station. This hypothesis does not seem sensible!

2. The question implies that we want to explain changes in profitability by the


passage of time. If we let x represent years and y represent profits (in £000s) then
we need to perform a regression of y on x.
P 2 P 2
(a) We
P first calculate x̄ = 4.5, x i = 204, ȳ = 41.125, yi = 16,159 and
xi yi = 1,802. The estimated regression coefficients are:
1,802 − 8 × 4.5 × 41.125
βb1 = = 7.65 and βb0 = 41.125 − 7.65 × 4.5 = 6.70.
204 − 8 × (4.5)2
The fitted line is:
\ = 6.70 + 7.65 × Year.
Profit

188
G.6. Chapter 11 – Linear regression

In order to perform statistical inference, we need to find:


X
b2 =
σ (yi − βb0 − βb1 xi )2 /(n − 2)
i
X X X X X 
= yi2 + nβb02 + βb12 x2i − 2βb0 yi − 2βb1 xi yi + 2βb0 βb1 xi /(n − 2)

= (16,159 + 8 × (6.70)2 + (7.65)2 × 204 − 2 × 6.70 × 329


− 2 × 7.65 × 1,802 + 2 × 6.70 × 7.65 × 36)/(8 − 2)
= 27.98.

The estimated standard error of βb1 is:



27.98
p = 0.82.
204 − 8 × (4.5)2
Hence a 95% confidence interval for β1 is 7.65 ± 2.447 × 0.82 ⇒ (5.64, 9.66).
(b) Substituting x = 9 we find the predicted year 9 profit (in £000s) is 75.55. The
estimated standard error of this prediction is:
1/2
√ 204 − 2 × 9 × 36 + 8 × 92

27.98 × 1 + = 6.71.
8 × (204 − 8 × (4.5)2 )
It follows that (using tn−2 = t6 ) a 95% prediction interval for the predicted
profit (in £000s) is:
75.55 ± 2.447 × 6.71 ⇒ (59.13, 91.97).
As 90 is in this prediction interval, we cannot reject the accountant’s forecast
out of hand. However, it is right at the top end of the prediction interval, and
hence seems rather optimistic.

P 2 P 2
3. (a) We
P first calculate x̄ = 4.5, xi = 204, ȳ = 434, yi = 1,938,174 and
xi yi = 19,766. The estimated regression coefficients are:
19,766 − 8 × 4.5 × 434
βb1 = = 98.62 and βb0 = 434 − 98.62 × 4.5 = −9.79.
204 − 8 × (4.5)2
The fitted line is:
\
Expenditure = −9.79 + 98.62 × Year.
In order to perform statistical inference, we need to find:
X
b2 =
σ (yi − βb0 − βb1 xi )2 /(n − 2)
i
X X X X X 
= yi2 + nβb02 + βb12 x2i − 2βb0 yi − 2βb1 xi yi + 2βb0 βb1 xi /(n − 2)

= (1,938,174 + 8 × (−9.79)2 + (98.62)2 × 204 − 2 × (−9.79) × 3,472


− 2 × 98.62 × 19,766 + 2 × (−9.79) × 98.62 × 36)/(8 − 2)
= 3,807.65.

189
G. Solutions to Practice questions

The estimated standard error of βb1 is:



3,807.65
p = 9.52.
204 − 8 × (4.5)2

Hence a 95% confidence interval for β1 is:

98.62 ± 2.447 × 9.52 ⇒ (75.32, 121.92).

(b) The ANOVA table is:


Source DF SS MS F
Regression 1 408,480 408,480 107.269
Residual Error 6 22,846 3,808
Total 7 431,326
Hence R2 = 408,480/431,326 = 0.947.
(c) As R2 is very close to 1, the linear regression model provides a very good fit.
(d) Substituting x = 9 we find the predicted year 9 profit (in £000s) is 877.79. The
estimated standard error of this prediction is:
1/2
204 − 2 × 9 × 36 + 8 × 92

p
3,807.65 × 1 + = 78.23.
8 × (204 − 8 × (4.5)2 )

It follows that (using tn−2 = t6 ) a 95% prediction interval for the predicted
profit (in £000s) is:

877.79 ± 2.447 × 78.23 ⇒ (686.36, 1,069.22).

4. We first note E(Y ) = β0 + β1 E(X) and Y − E(Y ) = (X − E(X))β1 + ε. Hence:

Cov(X, Y ) = E((X − E(X))(Y − E(Y )))


= E((X − E(X))(X − E(X))β1 )
= β1 Var(X).

Therefore, β1 = Cov(X, Y )/Var(X). The second equality follows from the fact that
Corr(X, Y ) = Cov(X, Y )/(Var(X) Var(Y ))1/2 .
Also, note that the first equality resembles the estimator:
P
(xi − x̄)(yi − ȳ)
βb1 = i P 2
i (xi − x̄)

although in the simple linear regression model y = β0 + β1 x + ε, x is assumed to be


fixed (to make the inference easier). Otherwise βb0 and βb1 are no longer linear
estimators, for example. The second equality reinforces the fact that β1 > 0 if and
only if x and y are positively correlated.

190
Appendix H
Formula sheet in the summer
examination

Simple linear regression


Model: yi = β0 + β1 xi + εi .
n n
LSEs: βb0 = ȳ − βb1 x̄, βb1 = (xi − x̄)(yi − ȳ)/ (xj − x̄)2
P P
and:
i=1 j=1

n
x2j
P
2
σ j=1 σ2 2
−σ x̄
Var(βb0 ) = n , Var(βb1 ) = P
n , Cov(βb0 , βb1 ) = P
n .
n P 2
(xi − x̄) (xi − x̄)2 (xi − x̄)2
i=1 i=1 i=1

n
b2 = (yi − βb0 − βb1 xi )2 /(n − 2).
P
Estimator for the variance of εi : σ
i=1
Regression ANOVA:
n n
Total SS = (yi − ȳ)2 , Regression SS = βb12 (xi − x̄)2
P P
and
i=1 i=1
n
(yi − βb0 − βb1 xi )2 .
P
Residual SS =
i=1
Squared regression correlation coefficients:

Regression SS (Residual SS)/(n − 2)


R2 = 2
and Radj =1− .
Total SS (Total SS)/(n − 1)

For a given x, the expectation of y is µ(x) = β0 + β1 x. A 100(1 − α)% confidence


interval for µ(x) is:
 n 1/2
2 
P


 (xi − x) 
i=1
β0 + β1 x ± tα/2, n−2 × σ
b b b× Pn
 n (xj − x̄)2 

 

j=1

and a 100(1 − α)% prediction interval covering y with probability (1 − α) is:


 n
1/2
2 
P


 (xi − x) 
i=1
β0 + β1 x ± tα/2, n−2 × σ
b b b× 1+ P n .
n (xj − x̄)2 

 
 
j=1

191
H. Formula sheet in the summer examination

One-way ANOVA:
nj
k P nj
k P
(Xij − X̄)2 = Xij2 − nX̄ 2 .
P P
Total variation:
j=1 i=1 j=1 i=1
k k
nj (X̄·j − X̄)2 = nj X̄·j2 − nX̄ 2 .
P P
Between-treatments variation: B =
j=1 j=1
nj
k P nj
k P k
(Xij − X̄·j )2 = Xij2 − nj X̄·j2 .
P P P
Within-treatments variation: W =
j=1 i=1 j=1 i=1 j=1

Two-way ANOVA:
r P
c r P
c
(Xij − X̄)2 = Xij2 − rcX̄ 2 .
P P
Total variation:
i=1 j=1 i=1 j=1
r r
(X̄i· − X̄)2 = c X̄i·2 − rcX̄ 2 .
P P
Between-blocks (rows) variation: Brow = c
i=1 i=1
c c
(X̄·j − X̄)2 = r X̄·j2 − rcX̄ 2 .
P P
Between-treatments (columns) variation: Bcol = r
j=1 j=1

Residual (error) variation:


r X
X c r X
X c r
X c
X
2 2 2
(Xij − X̄i· − X̄·j + X̄) = Xij − c X̄i· − r X̄·j2 + rcX̄ 2 .
i=1 j=1 i=1 j=1 i=1 j=1

192
lse.ac.uk/statistics Department of Statistics
The London School of Economics
and Political Science
Houghton Street
London WC2A 2AE
Email: [email protected]
Telephone: +44 (0)20 7852 3709

The London School of Economics and Political Science is a School of the University of London. It is a
charity and is incorporated in England as a company limited by guarantee under the Companies Acts
(Reg no 70527).

The School seeks to ensure that people are treated equitably, regardless of age, disability, race,
nationality, ethnic or national origin, gender, religion, sexual orientation or personal circumstances.

You might also like