WT ST102
WT ST102
Course pack
Dr James Abdey
lse.ac.uk/statistics
2
ST102/ST110
Course pack
The author asserts copyright over all material in this course guide except where
otherwise indicated. All rights reserved. No part of this work may be reproduced in any
form, or by any means, without permission in writing from the author.
ii
Contents
Contents
7 Point estimation 23
7.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.4 Estimation criteria: bias, variance and mean squared error . . . . . . . . 24
7.5 Method of moments (MM) estimation . . . . . . . . . . . . . . . . . . . . 30
7.6 Least squares (LS) estimation . . . . . . . . . . . . . . . . . . . . . . . . 32
7.7 Maximum likelihood (ML) estimation . . . . . . . . . . . . . . . . . . . . 34
7.8 Asymptotic distribution of MLEs . . . . . . . . . . . . . . . . . . . . . . 39
iii
Contents
8 Interval estimation 43
8.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.4 Interval estimation for means of normal distributions . . . . . . . . . . . 44
8.4.1 An important property of normal samples . . . . . . . . . . . . . 46
8.5 Approximate confidence intervals . . . . . . . . . . . . . . . . . . . . . . 47
8.5.1 Means of non-normal distributions . . . . . . . . . . . . . . . . . 47
8.5.2 MLE-based confidence intervals . . . . . . . . . . . . . . . . . . . 47
8.6 Use of the chi-squared distribution . . . . . . . . . . . . . . . . . . . . . 47
8.7 Interval estimation for variances of normal distributions . . . . . . . . . . 48
8.8 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.9 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
9 Hypothesis testing 51
9.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.4 Introductory examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.5 Setting p-value, significance level, test statistic . . . . . . . . . . . . . . . 54
9.5.1 General setting of hypothesis tests . . . . . . . . . . . . . . . . . 54
9.5.2 Statistical testing procedure . . . . . . . . . . . . . . . . . . . . . 55
9.5.3 Two-sided tests for normal means . . . . . . . . . . . . . . . . . . 56
9.5.4 One-sided tests for normal means . . . . . . . . . . . . . . . . . . 57
9.6 t tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
9.7 General approach to statistical tests . . . . . . . . . . . . . . . . . . . . . 59
9.8 Two types of error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
9.9 Tests for variances of normal distributions . . . . . . . . . . . . . . . . . 60
9.10 Summary: tests for µ and σ 2 in N (µ, σ 2 ) . . . . . . . . . . . . . . . . . 62
9.11 Comparing two normal means with paired observations . . . . . . . . . . 62
9.11.1 Power functions of the test . . . . . . . . . . . . . . . . . . . . . . 63
9.12 Comparing two normal means . . . . . . . . . . . . . . . . . . . . . . . . 63
2
9.12.1 Tests on µX − µY with known σX and σY2 . . . . . . . . . . . . 64
iv
Contents
2
9.12.2 Tests on µX − µY with σX = σY2 but unknown . . . . . . . . . . 64
9.13 Tests for correlation coefficients . . . . . . . . . . . . . . . . . . . . . . . 67
9.13.1 Tests for correlation coefficients . . . . . . . . . . . . . . . . . . . 69
9.14 Tests for the ratio of two normal variances . . . . . . . . . . . . . . . . . 70
9.15 Summary: tests for two normal distributions . . . . . . . . . . . . . . . . 73
9.16 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
9.17 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
11 Linear regression 93
11.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
11.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
11.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
11.4 Introductory examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
11.5 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
11.6 Inference for parameters in normal regression models . . . . . . . . . . . 100
11.7 Regression ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
11.8 Confidence intervals for E(y) . . . . . . . . . . . . . . . . . . . . . . . . 105
11.9 Prediction intervals for y . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
11.10Multiple linear regression models . . . . . . . . . . . . . . . . . . . . . . 108
11.11Regression using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
11.12Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
11.13Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
v
Contents
vi
Chapter 6
Sampling distributions of statistics
prove and apply the results for the mean and variance of the sampling distribution
of the sample mean when a random sample is drawn with replacement
state the central limit theorem and recall when the limit is likely to provide a good
approximation to the distribution of the sample mean.
6.3 Introduction
Suppose we have a sample of n observations of a random variable X:
{X1 , X2 , . . . , Xn }.
1
6. Sampling distributions of statistics
We use f (x) to denote both the pdf of a continuous random variable, and the pf of
a discrete random variable.
The parameter(s) of a distribution are generally denoted as θ. For example, for the
Poisson distribution θ stands for λ, and for the normal distribution θ stands for
(µ, σ 2 ).
Parameters are often included in the notation: f (x; θ) denotes the pf/pdf of a
distribution with parameter(s) θ, and F (x; θ) is its cdf.
For simplicity, we may often use phrases like ‘distribution f (x; θ)’ or ‘distribution
F (x; θ)’ when we mean ‘distribution with the pf/pdf f (x; θ)’ and ‘distribution with the
cdf F (x; θ)’, respectively.
The simplest assumptions about the joint distribution of the sample are as follows.
We will assume this most of the time from now. So you will see many examples and
questions which begin something like:
2
6.5. Statistics and their sampling distributions
Not all problems can be seen as IID random samples of a single random variable. There
are other possibilities, which you will see more of in the future.
3
6. Sampling distributions of statistics
n √
the sample variance S 2 = (Xi − X̄)2 /(n − 1) and standard deviation S =
P
S2
i=1
Here we focus on single (univariate) statistics. More generally, we could also consider
vectors of statistics, i.e. multivariate statistics.
Here is one such random sample (with values rounded to 2 decimal places):
6.28 5.22 4.19 3.56 4.15 4.11 4.03 5.81 5.43 6.09
4.98 4.11 5.55 3.95 4.97 5.68 5.66 3.37 4.98 6.58
For this random sample, the values of our statistics are:
x̄ = 4.94
s2 = 0.90
maxx = 6.58.
4
6.5. Statistics and their sampling distributions
Here is another such random sample (with values rounded to 2 decimal places):
5.44 6.14 4.91 5.63 3.89 4.17 5.79 5.33 5.09 3.90
5.47 6.62 6.43 5.84 6.19 5.63 3.61 5.49 4.55 4.27
For this sample, the values of our statistics are:
The sampling distribution of a statistic is the distribution of the values of the statistic
in (infinitely) many repeated samples. However, typically we only have one sample
which was actually observed. Therefore, the sampling distribution seems like an
essentially hypothetical concept.
Nevertheless, it is possible to derive the forms of sampling distributions of statistics
under different assumptions about the sampling schemes and population distribution
f (x; θ).
There are two main ways of doing this.
Example 6.3 Consider again a random sample of size n = 20 from the population
X ∼ N (5, 1), and the statistics X̄, S 2 and maxX .
Figures 6.1, 6.2 and 6.3 show histograms of the statistics for these 10,000
random samples.
We now consider deriving the exact sampling distribution. Here this is possible. For
a random sample of size n from N (µ, σ 2 ) we have:
5
6. Sampling distributions of statistics
where FX (x) and fX (x) are the cdf and pdf of X ∼ N (µ, σ 2 ), respectively.
Curves of the densities of these distributions are also shown in Figures 6.1, 6.2 and
6.3.
Sample mean
and: !
n
X n
X
Var ai Xi = a2i Var(Xi ).
i=1 i=1
6
6.6. Sample mean from a normal population
Sample variance
5 6 7 8 9
Maximum value
7
6. Sampling distributions of statistics
For a random sample, all Xi s are independent and E(X P i ) = E(X) is the same
Pfor all of
them, since the Xi s are identically distributed. X̄ = i Xi /n is of the form i ai Xi ,
with ai = 1/n for all i = 1, 2, . . . , n.
Therefore:
n
X 1 1
E(X̄) = E(X) = n × E(X) = E(X)
i=1
n n
and:
n
X 1 1 Var(X)
Var(X̄) = Var(X) = n × Var(X) = .
i=1
n2 n2 n
So the mean and variance of X̄ are E(X) and Var(X)/n, respectively, for a random
sample from any population distribution of X. What about the form of the sampling
distribution of X̄?
This depends on the distribution of X, and is not generally known. However, when the
distribution of X is normal, we do know that the sampling distribution of X̄ is also
normal.
Suppose that {X1 , X2 , . . . , Xn } is a random sample from a normal distribution with
mean µ and variance σ 2 , then:
σ2
X̄ ∼ N µ, .
n
For example, the pdf drawn on the histogram in Figure 6.1 is that of N (5, 1/20).
We have E(X̄) = E(X) = µ.
√
We also have Var(X̄) = Var(X)/n = σ 2 /n, and hence also sd(X̄) = σ/ n.
More interestingly, the sampling variance gets smaller when the sample size n
increases.
Example 6.4 Suppose that the heights (in cm) of men (aged over 16) in a
population follow a normal distribution with some unknown mean µ and a known
standard deviation of 7.39.
8
6.6. Sample mean from a normal population
n=100
n=20
n=5
We plan to select a random sample of n men from the population, and measure their
heights. How large should n be so that there is a probability of at least 0.95 that the
sample mean X̄ will be within 1 cm of the population mean µ?
√
Here X ∼ N (µ, (7.39)2 ), so X̄ ∼ N (µ, (7.39/ n)2 ). What we need is the smallest n
such that:
P (|X̄ − µ| ≤ 1) ≥ 0.95.
So:
P (|X̄ − µ| ≤ 1) ≥ 0.95
P (−1 ≤ X̄ − µ ≤ 1) ≥ 0.95
−1 X̄ − µ 1
P √ ≤ √ ≤ √ ≥ 0.95
7.39/ n 7.39/ n 7.39/ n
√ √
n n
P − ≤Z≤ ≥ 0.95
7.39 7.39
√
n 0.05
P Z> < = 0.025
7.39 2
where Z ∼ N (0, 1). From Table 3 of Murdoch and Barnes’ Statistical Tables, we see
that the smallest z which satisfies P (Z > z) < 0.025 is z = 1.97. Therefore:
√
n
≥ 1.97 ⇔ n ≥ (7.39 × 1.97)2 = 211.9.
7.39
Therefore, n should be at least 212.
9
6. Sampling distributions of statistics
σ2
X̄ ∼ N µ,
n
It may appear that the CLT is still somewhat limited, in that it applies only to sample
means calculated from random (IID) samples. However, this is not really true, for two
main reasons.
There are more general versions of the CLT which do not require the observations
Xi to be IID.
Even the basic version applies very widely, when we realise that the ‘X’ can also be
a function of the original variables in the data. For example, if X and Y are
10
6.7. The central limit theorem
random variables in the sample, we can also apply the CLT to:
n n
X ln(Xi ) X X i Yi
or .
i=1
n i=1
n
Therefore, the CLT can also be used to derive sampling distributions for many statistics
which do not initially look at all like X̄ for a single random variable in an IID sample.
You may get to do this in future courses.
The larger the sample size n, the better the normal approximation provided by the CLT
is. In practice, we have various rules-of-thumb for what is ‘large enough’ for the
approximation to be ‘accurate enough’. This also depends on the population
distribution of Xi . For example:
from the Exp(0.25) distribution (for which µ = 4 and σ 2 = 16). This is clearly a
skewed distribution, as shown by the histogram for n = 1 in Figure 6.5.
10,000 independent random samples of each size were generated. Histograms of the
values of X̄ in these random samples are shown in Figure 6.5. Each plot also shows
the pdf of the approximating normal distribution, N (4, 16/n). The normal
approximation is reasonably good already for n = 30, very good for n = 100, and
practically perfect for n = 1,000.
Example 6.6 In the second case, we simulate 10,000 independent random samples
of sizes:
n = 1, 10, 30, 50, 100 and 1,000
from the Bernoulli(0.2) distribution (for which µ = 0.2 and σ 2 = 0.16).
Here the distribution of Xi itself is not even continuous, and has only two possible
values, 0 and 1. Nevertheless, the sampling distribution of X̄ can be very
well-approximated by the normal distribution, when n is large enough.
n
P
Note that since here Xi = 1 or Xi = 0 for all i, X̄ = Xi /n = m/n, where m is the
i=1
number of observations for which Xi = 1. In other words, X̄ is the sample
proportion of the value X = 1.
11
6. Sampling distributions of statistics
n=5 n = 10
n=1
0 10 20 30 40 0 2 4 6 8 10 12 14 2 4 6 8 10
n = 30 n = 100 n = 1000
2 3 4 5 6 7 2.5 3.0 3.5 4.0 4.5 5.0 5.5 3.6 3.8 4.0 4.2 4.4
Figure 6.5: Sampling distributions of X̄ for various n when sampling from the Exp(0.25)
distribution.
The normal approximation is clearly very bad for small n, but reasonably good
already for n = 50, as shown by the histograms in Figure 6.6.
and:
2
SX
∼ Fn−1, m−1 .
SY2
Here ‘χ2 ’, ‘t’ and ‘F ’ refer to three new families of probability distributions:
12
6.8. Some common sampling distributions
n = 30
n = 10
n=1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5
n = 100 n = 1000
n = 50
0.0 0.1 0.2 0.3 0.4 0.50.05 0.10 0.15 0.20 0.25 0.30 0.35 0.16 0.18 0.20 0.22 0.24
Figure 6.6: Sampling distributions of X̄ for various n when sampling from the
Bernoulli(0.2) distribution.
These are not often used as distributions of individual variables. Instead, they are used
as sampling distributions for various statistics. Each of them arises from the normal
distribution in a particular way. We will now briefly introduce their main properties.
This is in preparation for statistical inference, where the uses of these distributions will
be discussed at length.
The χ2k distribution is a continuous distribution, which can take values of x ≥ 0. Its
mean and variance are:
E(X) = k
Var(X) = 2k.
13
6. Sampling distributions of statistics
0.10
0.6
k=1 k=10
k=2 k=20
0.5
0.06
0.3
0.04
0.2
0.02
0.1
0.0
0.0
0 2 4 6 8 0 10 20 30 40 50
In exercises and the examination, you will need a table of some probabilities for the χ2
distribution. Table 8 of Murdoch and Barnes’ Statistical Tables shows the following
information.
14
6.8. Some common sampling distributions
The rows correspond to different degrees of freedom k (denoted in the table by ν).
The table shows values of k up to 100.
The numbers in the table are values of x such that P (X > x) = α for the k and α
in that row and column.
Example 6.7 Consider two numbers in the ‘ν = 5’ row, the 2.675 in the ‘α = 0.75’
column and the 3.000 in the ‘α = 0.70’ column. These mean that for X ∼ χ25 we
have:
These also provide bounds for probabilities of other values. For example, since 2.8 is
between 2.675 and 3.000, we can conclude that:
The ways in which this table may be used in statistical inference will be explained in
later chapters.
Suppose Z ∼ N (0, 1), X ∼ χ2k , and Z and X are independent. The distribution of
the random variable:
Z
T = p
X/k
is the t distribution with k degrees of freedom. This is denoted T ∼ tk or
T ∼ t(k). The distribution is also known as ‘Student’s t distribution’.
for all −∞ < x < ∞. Examples of f (x) for different k are shown in Figure 6.8. (Note
the formula of the pdf of tk is not examinable.)
From Figure 6.8, we see the following.
15
6. Sampling distributions of statistics
0.4
N(0,1)
k=1
k=3
k=8
0.3
k=20
0.2
0.1
0.0
−2 0 2
and:
k
Var(T ) = for k > 2.
k−2
This means that for t1 neither E(T ) nor Var(T ) exist, and for t2 , Var(T ) does not exist.
In exercises and the examination, you will need a table of some probabilities for the t
distribution. Table 7 of Murdoch and Barnes’ Statistical Tables shows the following
information.
The rows correspond to different degrees of freedom k (denoted in the table by ν).
The table shows values of k up to 120, and then ‘∞’, which is N (0, 1).
If you need a tk distribution for which k is not in the table, use the nearest value or
use interpolation.
The columns correspond to the right-tail probability P (T > t) = α, where T ∼ tk ,
for α = 0.10, 0.05, . . . , 0.0005.
The numbers in the table are values of t such that P (T > t) = α for the k and α in
that row and column.
16
6.9. Prelude to statistical inference
Example 6.8 Consider the number 2.132 in the ‘ν = 4’ row, and the ‘α = 0.05’
column. This means that for T ∼ t4 we have:
The table also provides bounds for other probabilities. For example, the number in
the ‘α = 0.025’ column is 2.776, so P (T > 2.776) = 0.025. Since 2.132 < 2.5 < 2.776,
we know that 0.025 < P (T > 2.5) < 0.05.
Results for left-tail probabilities P (T < t) = α can also be obtained, because the t
distribution is symmetric around 0. This means that P (T < t) = P (T > −t). For
example:
P (T < −2.132) = P (T > 2.132) = 0.05
and P (T < −2.5) < 0.05 since P (T > 2.5) < 0.05.
This is the same trick we used for the standard normal distribution.
Let U and V be two independent random variables, where U ∼ χ2p and V ∼ χ2k .
The distribution of:
U/p
F =
V /k
is the F distribution with degrees of freedom (p, k), denoted F ∼ Fp, k or
F ∼ F (p, k).
17
6. Sampling distributions of statistics
(10,50)
(10,10)
(10,3)
f(x)
0 1 2 3 4
These questions are difficult to study in a laboratory, and admit no self-evident axioms.
Statistics provides a way of answering these types of questions using data.
What should we learn in ‘Statistics’ ? The basic ideas, methods and theory. Some
guidelines for learning/applying statistics are the following.
Understand what data say in each specific context. All the methods are just tools
to help us to understand data.
It may take a while to catch the basic idea of statistics – keep thinking!
18
6.9. Prelude to statistical inference
Example 6.9 A new type of tyre was designed to increase its lifetime. The
manufacturer tested 120 new tyres and obtained the average lifetime (over these 120
tyres) of 35,391 miles. So the manufacturer claims that the mean lifetime of new
tyres is 35,391 miles.
Example 6.10 A newspaper sampled 1,000 potential voters, and 350 of them were
Labour Party supporters. It claims that the proportion of Labour voters in the
whole country is 350/1,000 = 0.35, i.e. 35%.
In both cases, the conclusion is drawn on a population (i.e. all the objects concerned)
based on the information from a sample (i.e. a subset of the population).
In Example 6.9, it is impossible to measure the whole population. In Example 6.10, it is
not economical to measure the whole population. Therefore, errors are inevitable!
The population is the entire set of objects concerned, and these objects are typically
represented by some numbers. We do not know the entire population in practice.
In Example 6.9, the population consists of the lifetimes of all tyres, including those to
be produced in the future. For the opinion poll in Example 6.10, the population consists
of many ‘1’s and ‘0’s, where each ‘1’ represents a voter for the Labour party, and each
‘0’ represents a voter for other parties.
A sample is a (randomly) selected subset of a population, and is known in practice. The
population is unknown. We represent a population by a probability distribution.
Why do we need a model for the entire population?
Because the questions we ask concern the entire population, not just the data we
have. Having a model for the population tells us that the remaining population is
not much different from our data or, in other words, that the data are
representative of the population.
Because the process of drawing a sample from a population is a bit like the process
of generating random variables. A different sample would produce different values.
Therefore, the population from which we draw a random sample is represented as a
probability distribution.
19
6. Sampling distributions of statistics
Example 6.11 Continuing with Example 6.9, the population may be assumed to
be N (µ, σ 2 ) with θ = (µ, σ 2 ), where µ is the ‘true’ lifetime.
Let:
X = the lifetime of a tyre
then we can write X ∼ N (µ, σ 2 ).
P (X = 1) = P (a Labour voter) = π
and:
P (X = 0) = P (a non-Labour voter) = 1 − π
where:
Example 6.13 For the tyre lifetime in Example 6.9, suppose the realised sample
(of size n = 120) gives the sample mean:
n
1X
x̄ = xi = 35,391.
n i=1
Is the sample mean X̄ a good estimator of the unknown ‘true’ lifetime µ? Obviously,
we cannot use the real number 35,391 to assess how good this estimator is, as a different
sample may give a different average value, such as 36,721.
By treating {X1 , X2 , . . . , Xn } as random variables, X̄ is also a random variable. If the
distribution of X̄ concentrates closely around (unknown) µ, X̄ is a good estimator of µ.
20
6.9. Prelude to statistical inference
Definition of a statistic
Any known function of a random sample is called a statistic. Statistics are used for
statistical inference such as estimation and testing.
21
6. Sampling distributions of statistics
20!
P (X = x) = π x (1 − π)20−x for x = 0, 1, 2, . . . , 20
x! (20 − x)!
and 0 otherwise.
Some probability questions are as follows. Treating π as known:
what is P (X < 10) (the proportion of students attending fewer than half of the
lectures)?
22
Chapter 7
Point estimation
find estimators using the method of moments, least squares and maximum
likelihood.
7.3 Introduction
The basic setting is that we assume a random sample {X1 , X2 , . . . , Xn } is observed from
a population F (x; θ). The goal is to make inference (i.e. estimation or testing) for the
unknown parameter(s) θ.
23
7. Point estimation
We call µ
b = X̄ a point estimator (or simply an estimator) of µ.
For example, if we have an observed sample of 9, 16, 15, 4 and 12, hence of size
n = 5, the sample mean is:
9 + 16 + 15 + 4 + 12
µ
b= = 11.2.
5
The value 11.2 is a point estimate of µ. For an observed sample of 15, 16, 10, 8
and 9, we obtain µb = 11.6.
Bias of an estimator
unbiased if b −θ =0
E(θ)
24
7.4. Estimation criteria: bias, variance and mean squared error
Variance of an estimator
σ2
Var(X̄) = . (7.2)
n
It is clear that in (7.2) increasing the sample size n decreases the estimator’s variance
(and hence the standard error, i.e. the square root of the estimator’s variance), therefore
increasing the precision of the estimator.2 We conclude that variance is also a ‘bad’
thing so, other things being equal, the smaller an estimator’s variance the better.
Estimator properties
Is µ
b = X̄ a ‘good’ estimator of µ?
Intuitively, X1 or (X1 + X2 + X3 )/3 would not be good enough as estimators of µ.
However, can we use other estimators such as the sample median:
X((n+1)/2) for odd n
µ
b1 =
(X(n/2) + X(n/2+1) )/2 for even n
2
Remember, however, that this increased precision comes at a cost – namely the increased expenditure
on data collection.
25
7. Point estimation
θ is unknown
the value of θb changes with the observed sample.
Intuitively, MAD is a more appropriate measure for the error in estimation. However, it
is technically less convenient since the function h(x) = |x| is not differentiable at x = 0.
Therefore, the MSE is used more often.
If E(θb2 ) < ∞, it holds that:
2
MSE(θ) b = Var(θ) b + Bias(θ) b
where Bias(θ) b − θ.
b = E(θ)
Proof:
b = E (θb − θ)2
MSE(θ)
2
= E (θ − E(θ)) + (E(θ) − θ)
b b b
b 2 + E (E(θ)
= E (θb − E(θ)) b − θ)2 + 2E (θb − E(θ))(E(
b b − θ)
θ)
= Var(θ) b 2 + 2 (E(θ)
b + E (Bias(θ)) b − E(θ))(E(
b b − θ)
θ)
2
= Var(θ)
b + Bias(θ)
b + 0.
26
7.4. Estimation criteria: bias, variance and mean squared error
We have already established that both bias and variance of an estimator are ‘bad’
things, so the MSE (being the sum of a bad thing and a bad thing squared) can also be
viewed as a ‘bad’ thing.3 Hence when faced with several competing estimators, we
prefer the estimator with the smallest MSE.
So, although an unbiased estimator is intuitively appealing, it is perfectly possible that
a biased estimator might be preferred if the ‘cost’ of the bias is offset by a substantial
reduction in variance. Hence the MSE provides us with a formal criterion to assess the
trade-off between the bias and variance of different estimators of the same parameter.
E(T1 ) = E(X̄) = µ
and:
σ2
Var(T1 ) = Var(X̄) =
.
n
Hence T1 is an unbiased estimator of µ. So the MSE of T1 is just the variance of T1 ,
since the bias is 0. Therefore, MSE(T1 ) = σ 2 /n.
Moving to T2 , note:
X1 + Xn E(X1 ) + E(Xn ) µ+µ
E(T2 ) = E = = =µ
2 2 2
and:
Var(X1 ) + Var(Xn ) 2σ 2 σ2
Var(T2 ) = = = .
22 4 2
So T2 is also an unbiased estimator of µ, hence MSE(T2 ) = σ 2 /2.
Finally, consider T3 , noting:
and:
σ2
Var(T3 ) = Var(X̄ + 3) = Var(X̄) = .
n
So T3 is a positively-biased estimator of µ, with a bias of 3. Hence we have
MSE(T3 ) = σ 2 /n + 32 = σ 2 /n + 9.
We seek the estimator with the smallest MSE. Clearly, MSE(T1 ) < MSE(T3 ) so we
can eliminate T3 . Now comparing T1 with T2 , we note that:
3
Or, for that matter, a ‘very bad’ thing!
27
7. Point estimation
i. µ
b = X̄ is a better estimator of µ than X1 as:
σ2
MSE(µ)
b = < MSE(X1 ) = σ 2 .
n
ii. As n → ∞, MSE(X̄) → 0, i.e. when the sample size tends to infinity, the error in
estimation goes to 0. Such an estimator is called a (mean-square) consistent
estimator.
Consistency is a reasonable requirement. It may be used to rule out some silly
estimators.
For µ̃ = (X1 + X4 )/2, MSE(µ̃) = σ 2 /2 which does not converge to 0 as n → ∞.
This is due to the fact that only a small portion of information (i.e. X1 and X4 )
is used in the estimation.
iii. For any random sample {X1 , X2 , . . . , Xn } from a population with mean µ and
variance σ 2 , it holds that E(X̄) = µ and Var(X̄) = σ 2 /n. The derivation of the
expected value and variance of the sample mean was covered in Chapter 6.
iv. For any independent random variables Y1 , Y2 , . . . , Yk and constants a1 , a2 , . . . , ak ,
then:
k
! k k
! k
X X X X
E ai Yi = ai E(Yi ) and Var ai Yi = a2i Var(Yi ).
i=1 i=1 i=1 i=1
Example 7.5 Bias by itself cannot be used to measure the quality of an estimator.
Consider two artificial estimators of θ, θb1 and θb2 , such that θb1 takes only the two
values, θ − 100 and θ + 100, and θb2 takes only the two values θ and θ + 0.2, with the
following probabilities:
and:
P (θb2 = θ) = P (θb2 = θ + 0.2) = 0.5.
28
7.4. Estimation criteria: bias, variance and mean squared error
and:
MSE(θb2 ) = E((θb2 − θ)2 ) = 02 × 0.5 + (0.2)2 × 0.5 = 0.02.
Hence θb2 is a much better (i.e. more accurate) estimator of θ than θb1 .
Especially:
k
!2 k
X X X
ai = a2i + ai aj .
i=1 i=1 1≤i6=j≤k
Hence Var(b
µ) =
! !2
n n
1 X 1 X
Var Xi = E Xi − µ
n i=1
n i=1
!2
n
1 X
= E (Xi − µ)
n i=1
n
!
1 X X
= E((Xi − µ)2 ) + E ((Xi − µ)(Xj − µ))
n2 i=1 1≤i6=j≤n
!
1 X σ2
= 2 nσ 2 + E(Xi − µ) E(Xj − µ) = .
n 1≤i6=j≤n
n
µ) = MSE(X̄) = σ 2 /n.
Hence MSE(b
29
7. Point estimation
Finding estimators
Let {X1 , X2 , . . . , Xn } be a random sample from a population F (x; θ). Suppose θ has
p components (for example, for a normal population N (µ, σ 2 ), p = 2; for a Poisson
population with parameter λ, p = 1).
Let:
µk = µk (θ) = E(X k )
denote the kth population moment, for k = 1, 2, . . .. Therefore, µk depends on the
unknown parameter θ, as everything else about the distribution F (x; θ) is known.
Denote the kth sample moment by:
n
1X X1k + X2k + · · · + Xnk
Mk = Xik = .
n i=1
n
µk (θ)
b = Mk for k = 1, 2, . . . , p.
This gives us µ
b = M1 = X̄.
Since σ 2 = µ2 − µ21 = E(X 2 ) − (E(X))2 , we have:
n n
2 1X 2 1X
b = M2 −
σ M12 = Xi − X̄ 2 = (Xi − X̄)2 .
n i=1 n i=1
30
7.5. Method of moments (MM) estimation
Note we have:
n
!
1X 2
σ2) = E
E(b Xi − X̄ 2
n i=1
n
1X
= E(Xi2 ) − E(X̄ 2 )
n i=1
= E(X 2 ) − E(X̄ 2 )
2
2 2 σ 2
=σ +µ − +µ
n
(n − 1)σ 2
= .
n
Since:
σ2
σ2) − σ2 = −
E(b <0
n
b2 is a negatively-biased estimator of σ 2 .
σ
The sample variance, defined as:
n
2 1 X
S = (Xi − X̄)2
n − 1 i=1
Note the MME does not use any information on F (x; θ) beyond the moments.
The idea is that Mk should be pretty close to µk when n is sufficiently large. In fact:
n
1X
Mk = Xik
n i=1
converges to:
µk = E(X k )
as n → ∞. This is due to the law of large numbers (LLN). We illustrate this
phenomenon by simulation using R.
Example 7.8 For N (2, 4), we have µ1 = 2 and µ2 = 8. We use the sample moments
M1 and M2 as estimators of µ1 and µ2 , respectively. Note how the sample moments
converge to the population moments as the sample size increases.
31
7. Point estimation
32
7.6. Least squares (LS) estimation
n
P
The MME of µ is the sample mean X̄ = Xi /n.
i=1
The estimator X̄ is also the least squares estimator (LSE) of µ, defined as:
n
X
µ
b = X̄ = min (Xi − a)2 .
a
i=1
n n
(Xi − a)2 = (Xi − X̄)2 + n(X̄ − a)2 , where all terms are
P P
Proof: Given that S =
i=1 i=1
non-negative, then the value of a for which S is minimised is when n(X̄ − a)2 = 0, i.e.
a = X̄.
Estimator accuracy
σ2
MSE(µ) b − µ)2 ) =
b = E((µ .
n
In order to determine the distribution of µ b we require knowledge of the underlying
distribution. Even if the relevant knowledge is available, one may only compute the
exact distribution of µ
b explicitly for a limited number of cases.
By the central limit theorem, as n → ∞, we have:
X̄ − µ
P √ ≤ z → Φ(z)
σ/ n
for any z, where Φ(z) is the cdf of N (0, 1), i.e. when n is large, X̄ ∼ N (µ, σ 2 /n)
approximately.
Hence when n is large:
σ
P |X̄ − µ| ≤ 1.96 × √ ≈ 0.95.
n
33
7. Point estimation
To be on the safe side, the coefficient 1.96 is often replaced by 2. The estimated
standard error of X̄ is:
n
!1/2
S 1 X
E.S.E.(X̄) = √ = (Xi − X̄)2 .
n n(n − 1) i=1
Example 7.10 Suppose we toss a coin 10 times, and record the number of ‘heads’
as a random variable X. Therefore:
X ∼ Bin(10, π)
Nevertheless, π = 0.8 is the most likely, or ‘maximally’ likely value of the parameter.
Why do we think ‘π = 0.8’ is most likely?
Let:
10! 8
L(π) = P (X = 8) = π (1 − π)2 .
8! 2!
Since x = 8 is the event which occurred in the experiment, this probability would be
very large. Figure 7.1 shows a plot of L(π) as a function of π.
34
7.7. Maximum likelihood (ML) estimation
The most likely value of π should make this probability as large as possible. This
value is taken as the maximum likelihood estimate of π.
Maximising L(π) is equivalent to maximising:
θb = θ(X
b 1 , X2 , . . . , Xn ).
35
7. Point estimation
The likelihood function reflects the information about the unknown parameter θ in
the data {X1 , X2 , . . . , Xn }.
iii. It is often more convenient to use the log-likelihood function4 denoted as:
n
X
l(θ) = ln L(θ) = ln f (Xi ; θ)
i=1
θb = max l(θ).
θ
iv. For a smooth likelihood function, the MLE is often the solution of the equation:
d
l(θ) = 0.
dθ
vi. Unlike the MME or LSE, the MLE uses all the information about the population
distribution. It is often more efficient (i.e. more accurate) than the MME or LSE.
4
Throughout where ‘log’ is used in log-likelihood functions, it will be assumed to be the logarithm to
the base e, i.e. the natural logarithm.
36
7.7. Maximum likelihood (ML) estimation
n
Q
The log-likelihood function is l(λ) = 2n ln λ − nλX̄ + c, where c = ln Xi is a
i=1
constant.
Setting:
d 2n
l(λ) = − nX̄ = 0
dλ λ
b
we obtain λ
b = 2/X̄.
Note the MLE λ b may be obtained from maximising L(λ) directly. However, it is
much easier to work with l(λ) instead.
Case I: σ 2 is known.
The likelihood function is:
n
!
1 1 X
L(µ) = exp − (Xi − µ)2
(2πσ 2 )n/2 2σ 2 i=1
n
!
1 1 X n
= 2 n/2
exp − 2 (Xi − X̄)2 2
exp − 2 (X̄ − µ) .
(2πσ ) 2σ i=1 2σ
37
7. Point estimation
n
b2 = (Xi − X̄)2 /n.
P
It follows from the lemma below that σ
i=1
38
7.8. Asymptotic distribution of MLEs
θb = θ(X
b 1 , X2 , . . . , Xn )
√
the MLE of θ. Under some regularity conditions, the distribution of n(θb − θ)
converges to N (0, 1/I(θ)) as n → ∞, where I(θ) is the Fisher information defined as:
Z ∞
∂ 2 ln f (x; θ)
I(θ) = − f (x; θ) dx.
−∞ ∂θ 2
Some remarks are the following.
39
7. Point estimation
Therefore:
1 1
ln f (x; µ) = − ln(2πσ 2 ) − 2 (x − µ)2 .
2 2σ
Hence:
d ln f (x; µ) x−µ d2 ln f (x; µ) 1
= and = − 2.
dµ σ2 dµ 2 σ
Therefore: Z ∞
1 1
I(µ) = − − 2
f (x; µ) dx = 2 .
−∞ σ σ
The MLE of µ is X̄, and hence X̄ ∼ N (µ, σ 2 /n).
Example 7.15 For the Poisson distribution, p(x; λ) = λx e−λ /x!. Therefore:
ln p(x; λ) = x ln λ − λ − ln(x!).
Hence:
d ln p(x; λ) x d2 ln p(x; λ) x
= − 1 and 2
= − 2.
dλ λ dλ λ
Therefore: ∞
1 X 1 1
I(λ) = 2 x p(x; λ) = 2 E(X) = .
λ x=0 λ λ
40
7.10. Key terms and concepts
The group was alarmed to find that if you are a labourer, cleaner or dock
worker, you are twice as likely to die than a member of the professional classes.
(The Sunday Times, 31 August 1980)
41
7. Point estimation
42
Chapter 8
Interval estimation
explain the link between confidence intervals and distribution theory, and critique
the assumptions made to justify the use of various confidence intervals.
8.3 Introduction
Point estimation is simple but not informative enough, since a point estimator is
always subject to errors. A more scientific approach is to find an upper bound
U = U (X1 , X2 , . . . , Xn ) and a lower bound L = L(X1 , X2 , . . . , Xn ), and hope that the
unknown parameter θ lies between the two bounds L and U (life is not always as simple
as that, but it is a good start).
An intuitive guess for estimating the population mean would be:
where k > 0 is a constant and S.E.(X̄) is the standard error of the sample mean.
The (random) interval (L, U ) forms an interval estimator of θ. For estimation to be
as precise as possible, intuitively the width of the interval, U − L, should be small.
43
8. Interval estimation
44
8.4. Interval estimation for means of normal distributions
What is P (1.27 < µ < 3.23) = 0.95 in Example 8.1? Well, this probability does not
mean anything, since µ is an unknown constant!
We treat (1.27, 3.23) as one realisation of the random interval (X̄ − 0.98, X̄ + 0.98)
which covers µ with probability 0.95.
What is the meaning of ‘with probability 0.95’ ? If one repeats the interval estimation a
large number of times, about 95% of the time the interval estimator covers the true µ.
Some remarks are the following.
i. The confidence level is often specified as 90%, 95% or 99%. Obviously the higher
the confidence level, the wider the interval.
For the normal distribution example:
√
n|X̄ − µ|
0.90 = P ≤ 1.645
σ
σ σ
= P X̄ − 1.645 × √ < µ < X̄ + 1.645 × √
n n
√
n|X̄ − µ|
0.95 = P ≤ 1.96
σ
σ σ
=P X̄ − 1.96 × √ < µ < X̄ + 1.96 × √
n n
√
n|X̄ − µ|
0.99 = P ≤ 2.576
σ
σ σ
= P X̄ − 2.576 × √ < µ < X̄ + 2.576 × √ .
n n
√ √
√ three intervals are 2 × 1.645 × σ/ n, 2 × 1.96 × σ/ n and
The widths of the
2 × 2.576 × σ/ n, corresponding to the confidence levels of 90%, 95% and 99%,
respectively.
To achieve a 100% confidence level in the normal example, the width of the interval
would have to be infinite!
ii. Among all the confidence intervals at the same confidence level, the one with the
smallest width gives the most accurate estimation and is, therefore, optimal.
iii. For a distribution with a symmetric unimodal density function, optimal confidence
intervals are symmetric, as depicted in Figure 8.1.
In practice the standard deviation σ is typically unknown, and we replace it with the
sample standard deviation:
n
!1/2
1 X 2
S= (Xi − X̄)
n − 1 i=1
45
8. Interval estimation
Figure 8.1: Symmetric unimodal density function showing that a given probability is
represented by the narrowest interval when symmetric about the mean.
where k is a constant determined by the confidence level and also by the distribution of
the statistic:
X̄ − µ
√ . (8.1)
S/ n
However, the distribution of (8.1) is no longer normal – it is the Student’s t distribution.
n n
1X 1 X S
X̄ = Xi , S2 = (Xi − X̄)2 and E.S.E.(X̄) = √
n i=1
n−1 i=1
n
where E.S.E.(X̄) denotes the estimated standard error of the sample mean.
An accurate 100(1 − α)% confidence interval for µ, where α ∈ (0, 1), is:
S S
X̄ − c × √ , X̄ + c × √ = (X̄ − c × E.S.E.(X̄), X̄ + c × E.S.E.(X̄))
n n
46
8.5. Approximate confidence intervals
Example 8.2 The salary data of 253 graduates from a UK business school (in
thousands
√ of pounds) yield the following: n = 253, x̄ = 47.126, s = 6.843 and so
s/ n = 0.43.
A point estimate of the average salary µ is x̄ = 47.126.
An approximate 95% confidence interval for µ is:
47
8. Interval estimation
Note that:
n n
1 X 1 X
2 2
n(X̄ − µ)2
(X i − µ) = (X i − X̄) + . (8.2)
σ 2 i=1 σ 2 i=1 σ2
Proof: We have:
n
X n
X
2
(Xi − µ) = ((Xi − X̄) + (X̄ − µ))2
i=1 i=1
n
X n
X n
X
2 2
= (Xi − X̄) + (X̄ − µ) + 2 (Xi − X̄)(X̄ − µ)
i=1 i=1 i=1
n
X n
X
2 2
= (Xi − X̄) + n(X̄ − µ) + 2(X̄ − µ) (Xi − X̄)
i=1 i=1
n
X
= (Xi − X̄)2 + n(X̄ − µ)2 .
i=1
Hence: n n
1 X 2 1 X 2 n(X̄ − µ)2
(Xi − µ) = 2 (Xi − X̄) + .
σ 2 i=1 σ i=1 σ2
Since X̄ ∼ N (µ, σ 2 /n), then n(X̄ − µ)2 /σ 2 ∼ χ21 . It can be proved that:
n
1 X
(Xi − X̄)2 ∼ χ2n−1 .
σ 2 i=1
For any given small α ∈ (0, 1), we can find 0 < k1 < k2 such that:
α
P (X < k1 ) = P (X > k2 ) =
2
where X ∼ χ2n−1 . Therefore:
M M 2
M
1 − α = P k1 < 2 < k2 = P <σ < .
σ k2 k1
48
8.8. Overview of chapter
Example 8.3 Suppose n = 15 and the sample variance is s2 = 24.5. Let α = 0.05.
From Table 8 of Murdoch and Barnes’ Statistical Tables, we find:
where X ∼ χ214 .
Hence a 95% confidence interval for σ 2 is:
14 × S 2 14 × S 2
M M
, = ,
26.119 5.629 26.119 5.629
= (0.536 × S 2 , 2.487 × S 2 )
= (13.132, 60.934).
A statistician took the Dale Carnegie Course, improving his confidence from
95% to 99%.
(Anon)
49
8. Interval estimation
50
Chapter 9
Hypothesis testing
9.3 Introduction
Hypothesis testing, together with statistical estimation, are the two most
frequently-used statistical inference methods. Hypothesis testing addresses a different
type of practical question from statistical estimation.
Based on the data, a (statistical) test is to make a binary decision on a hypothesis,
denoted by H0 :
reject H0 or not reject H0 .
51
9. Hypothesis testing
H0 : π = 0.50.
If π
b = 0.90, H0 is unlikely to be true.
If π
b = 0.45, H0 may be true (and also may be untrue).
If π
b = 0.70, what to do then?
Example 9.2 A customer complains that the amount of coffee powder in a coffee
tin is less than the advertised weight of 3 pounds.
A random sample of 20 tins is selected, resulting in an average weight of x̄ = 2.897
pounds. Is this sufficient to substantiate the complaint?
Again statistical estimation cannot provide a firm answer, due to random
fluctuations between different random samples. So we cast the problem into a
hypothesis testing problem as follows.
Let the weight of coffee in a tin be a normal random variable X ∼ N (µ, σ 2 ). We
need to test the hypothesis µ < 3. In fact, we use the data to test the hypothesis:
H0 : µ = 3.
Example 9.3 Suppose one is interested in evaluating the mean income (in £000s)
of a community. Suppose income in the population is modelled as N (µ, 25) and a
random sample of n = 25 observations is taken, yielding the sample mean x̄ = 17.
Independently of the data, three expert economists give their own opinions as
follows.
52
9.4. Introductory examples
If Ms B’s claim is correct, X̄ ∼ N (15, 1). The observed value x̄ = 17 begins to look a
bit ‘extreme’, as it is two standard deviations away from µ. Hence there is some
inconsistency between the claim and the data evidence. This is shown in Figure 9.2.
If Mr C’s claim is correct, X̄ ∼ N (14, 1). The observed value x̄ = 17 is very extreme,
as it is three standard deviations away from µ. Hence there is strong inconsistency
between the claim and the data evidence. This is shown in Figure 9.3.
Figure 9.1: Comparison of claim and data evidence for Dr A in Example 9.3.
Figure 9.2: Comparison of claim and data evidence for Ms B in Example 9.3.
Figure 9.3: Comparison of claim and data evidence for Mr C in Example 9.3.
53
9. Hypothesis testing
Definition of p-values
A p-value is the probability of the event that the test statistic takes the observed
value or more extreme (i.e. more unlikely) values under H0 . It is a measure of the
discrepancy between the hypothesis H0 and the data.
54
9.5. Setting p-value, significance level, test statistic
Example 9.5 Let {X1 , X2 , . . . , X20 }, taking values either 1 or 0, be the outcomes
of an experiment of tossing a coin 20 times, where:
Suppose there are 17 Xi s taking the value 1, and 3 Xi s taking the value 0. Will you
reject the null hypothesis at the 5% significance level?
55
9. Hypothesis testing
56
9.5. Setting p-value, significance level, test statistic
H0 : µ = µ0 vs. H1 : µ < µ0
α = Pµ0 (T ≤ c) = P (Z ≤ c).
Therefore, c is the 100αth percentile of N (0, 1). Due to the symmetry of N (0, 1),
c = −zα , where zα is the top 100αth percentile of N (0, 1), i.e. P (Z > zα ) = α, where
Z ∼ N (0, 1). For α = 0.05, zα = 1.645. We reject H0 if t ≤ −1.645.
i. We use a one-tailed test when we are only interested in the departure from H0 in
one direction.
ii. The distribution of a test statistic under H0 must be known in order to calculate
p-values or critical values.
iii. A test may be carried out by either computing the p-value or determining the
critical value.
iv. The probability of incorrect decisions in hypothesis testing is typically positive. For
example, the significance level is the probability of rejecting a true H0 .
57
9. Hypothesis testing
9.6 t tests
t tests are one of the most frequently-used statistical tests.
Let {X1 , X2 , . . . , Xn } be a random sample from N (µ, σ 2 ), where both µ and σ 2 > 0 are
unknown. We are interested in testing the hypotheses:
H0 : µ = µ0 vs. H1 : µ < µ0
where µ0 is known.
√
Now we cannot use n(X̄ − µ0 )/σ as a statistic, since σ is unknown. Naturally we
replace it by S, where:
n
1 X
S2 = (Xi − X̄)2 .
n − 1 i=1
The test statistic is then the famous t statistic:
√ n
!1/2
n(X̄ − µ0 ) X̄ − µ0 √ . 1 X
T = = √ = n(X̄ − µ0 ) (Xi − X̄)2 .
S S/ n n−1 i=1
We reject H0 if t < c, where c is the critical value determined by the significance level:
PH0 (T < c) = α
where PH0 denotes the distribution under H0 (with mean µ0 and unknown σ 2 ).
Under H0 , T ∼ tn−1 . Hence:
α = PH0 (T < c)
i.e. c is the 100αth percentile of the t distribution with n − 1 degrees of freedom. By
symmetry, c = −tα, n−1 , where tα, k denotes the top 100αth percentile of the tk
distribution.
Example 9.7 To deal with the customer complaint that the average amount of
coffee powder in a coffee tin is less than the advertised 3 pounds, 20 tins were
weighed, yielding the following observations:
2.82, 3.01, 3.11, 2.71, 2.93, 2.68, 3.02, 3.01, 2.93, 2.56,
2.78, 3.01, 3.09, 2.94, 2.82, 2.81, 3.05, 3.01, 2.85, 2.79.
58
9.7. General approach to statistical tests
Although H0 does not specify the population distribution completely (σ 2 > 0), the
distribution of the test statistic, T , under H0 is completely known. This enables us
to find the critical value or p-value.
PH0 (T ∈ C) = α.
3. If the observed value of T with the given sample is in the critical region C, H0 is
rejected. Otherwise, H0 is not rejected.
In order to make a test powerful in the sense that the chance of making an incorrect
decision is small, the critical region should consist of those values of T which are least
supportive of H0 (i.e. which lie in the direction of H1 ).
Decision made
H0 not rejected H0 rejected
True state H0 true Correct decision Type I error
of nature H1 true Type II error Correct decision
i. Ideally we would like to have a test which minimises the probabilities of making
both types of error, which unfortunately is not feasible.
59
9. Hypothesis testing
ii. The probability of making a Type I error is the significance level, which is under
our control.
iii. We do not have explicit control over the probability of a Type II error. For a given
significance level, we try to choose a test statistic such that the probability of a
Type II error is small.
iv. The power function of the test is defined as:
β(θ) = Pθ (H0 is rejected) for θ ∈ Θ1
i.e. β(θ) = 1 − P (Type II error).
v. The null hypothesis H0 and the alternative hypothesis H1 are not treated equally in
a statistical test, i.e. there is an asymmetric treatment. The choice of H0 is based
on the subject matter concerned and/or technical convenience.
vi. It is more conclusive to end a test with H0 rejected, as the decision of ‘not reject
H0 ’ does not imply that H0 is accepted.
Turning Example 9.8 into a statistical problem, we assume that the data form a random
sample from N (µ, σ 2 ). We are interested in testing the hypotheses:
H0 : σ 2 = σ02 vs. H1 : σ 2 > σ02 .
n
Let S 2 = (Xi − X̄)2 /(n − 1), then (n − 1)S 2 /σ 2 ∼ χ2n−1 . Under H0 we have:
P
i=1
n
(Xi − X̄)2
P
(n − 1)S 2 i=1
T = = ∼ χ2n−1 .
σ02 σ02
Since we will reject H0 against an alternative hypothesis σ 2 > σ02 , we should reject H0
for large values of T .
60
9.9. Tests for variances of normal distributions
H0 is rejected if t > χ2α, n−1 , where χ2α, n−1 denotes the top 100αth percentile of the χ2n−1
distribution, i.e. we have:
P (T ≥ χ2α, n−1 ) = α.
For any σ 2 > σ02 , the power of the test at σ is:
σ2 1 1.5 2 3 4
χ20.05, 24 /σ 2 36.415 24.277 18.208 12.138 9.104
β(σ) 0.05 0.446 0.793 0.978 0.997
Approximate β(σ) 0.05 0.40 0.80 0.975 0.995
61
9. Hypothesis testing
n n
Xi /n, S 2 = (Xi − X̄)2 /(n − 1), and {X1 , X2 , . . . , Xn } is a
P P
In the above table, X̄ =
i=1 i=1
random sample from N (µ, σ 2 ).
62
9.12. Comparing two normal means
µ = µX − µY and σ 2 = σX
2
+ σY2 .
H0 : µ = 0.
√
Therefore, we should use the test statistic T = nZ̄/S, where Z̄ and S 2 denote,
respectively, the sample mean and the sample variance of {Z1 , Z2 , . . . , Zn }.
At the 100α% significance level, for α ∈ (0, 1), we reject the hypothesis µX = µY when:
63
9. Hypothesis testing
n
P m
P
Let the sample means be X̄ = Xi /n and Ȳ = Yi /m, and the sample variances be:
i=1 i=1
n m
2 1 X 1 X
SX = (Xi − X̄)2 and SY2 = (Yi − Ȳ )2 .
n − 1 i=1 m − 1 i=1
2
X̄, Ȳ , SX and SY2 are independent.
2 2 2
X̄ ∼ N (µX , σX /n) and (n − 1)SX /σX ∼ χ2n−1 .
2
Hence X̄ − Ȳ ∼ N (µX − µY , σX /n + σY2 /m). If σX
2
= σY2 , then:
p 2
(X̄ − Ȳ − (µX − µY )) σX /n + σY2 /m
p
2 2
((n − 1)SX /σX + (m − 1)SY2 /σY2 )/(n + m − 2)
s
n+m−2 X̄ − Ȳ − (µX − µY )
= ×p 2
∼ tn+m−2 .
1/n + 1/m (n − 1)SX + (m − 1)SY2
2
9.12.1 Tests on µX − µY with known σX and σY2
Suppose we are interested in testing:
H0 : µX = µY vs. H1 : µX 6= µY .
Note that:
X̄ − Ȳ − (µX − µY )
p
2
∼ N (0, 1).
σX /n + σY2 /m
Under H0 , µX − µY = 0, so we have:
X̄ − Ȳ
T = p 2 ∼ N (0, 1).
σX /n + σY2 /m
At the 100α% significance level, for α ∈ (0, 1), we reject H0 if |t| > zα/2 , where
P (Z > zα/2 ) = α/2, for Z ∼ N (0, 1).
A 100(1 − α)% confidence interval for µX − µY is:
s
2
σX σ2
X̄ − Ȳ ± zα/2 × + Y.
n m
2
9.12.2 Tests on µX − µY with σX = σY2 but unknown
This time we consider the following hypotheses:
H0 : µX − µY = δ0 vs. H1 : µX − µY > δ0
64
9.12. Comparing two normal means
Example 9.10 Two types of razor, A and B, were compared using 100 men in an
experiment. Each man shaved one side, chosen at random, of his face using one razor
and the other side using the other razor. The times taken to shave, Xi and Yi
minutes, for i = 1, 2, . . . , 100, corresponding to the razors A and B, respectively,
were recorded, yielding:
H0 : µX = µY vs. H1 : µX 6= µY .
There are three approaches – a paired comparison method and two two-sample
comparisons based on different assumptions. Since the data are recorded in pairs,
the paired comparison is most relevant and effective to analyse these data.
65
9. Hypothesis testing
√
With the given data, we observe t = 10 × (2.84 − 3.02)/ 0.6 = −2.327. Hence we
reject the hypothesis that the two razors lead to the same mean shaving time at the
5% significance level.
A 95% confidence interval for µX − µY is:
sZ
x̄ − ȳ ± t0.025, n−1 × √ = −0.18 ± 0.154 ⇒ (−0.334, −0.026).
n
Some remarks are the following.
66
9.13. Tests for correlation coefficients
ii. The paired comparison is intuitively the most relevant, requires the least
assumptions, and leads to the most conclusive inference (i.e. rejection of H0 ). It
also produces the narrowest confidence interval.
iii. Methods II and III ignore the pairing of the data. Consequently, the inference is
less conclusive and less accurate.
iv. A general observation is that H0 is rejected at the 100α% significance level if and
only if the value hypothesised by H0 is not within the corresponding 100(1 − α)%
confidence interval.
v. It is much more challenging to compare two normal means with unknown and
unequal variances. This will not be discussed in this course.
Cov(X, Y )
ρ = Corr(X, Y )=
(Var(X) Var(Y ))1/2
E((X − E(X))(Y − E(Y )))
= .
(E((X − E(X))2 ) E((Y − E(Y ))2 ))1/2
i. ρ ∈ [−1, 1], and |ρ| = 1 if and only if Y = aX + b for some constants a and b.
Furthermore, a > 0 if ρ = 1, and a < 0 if ρ = −1.
ii. ρ measures only the linear relationship between X and Y . When ρ = 0, X and Y
are linearly independent, that is uncorrelated.
iii. If X and Y are independent (in the sense that the joint pdf is the product of the
two marginal pdfs), ρ = 0. However, if ρ = 0, X and Y are not necessarily
independent, as there may exist some non-linear relationship between X and Y .
iv. If ρ > 0, X and Y tend to increase (or decrease) together. If ρ < 0, X and Y tend
to move in opposite directions.
67
9. Hypothesis testing
n
P n
P
where X̄ = Xi /n and Ȳ = Yi /n.
i=1 i=1
Example 9.11 The measurements of height, X, and weight, Y , are taken from 69
students in a class. ρ should be positive, intuitively!
In Figure 9.5, the vertical line at x̄ and the horizontal line at ȳ divide the 69 points
into 4 quadrants: northeast (NE), southwest (SW), northwest (NW) and southeast
(SE). Most points are in either NE or SW.
Overall:
69
X
(xi − x̄)(yi − ȳ) > 0
i=1
Figure 9.6 shows examples of different sample correlation coefficients using scatterplots
of bivariate observations.
68
9.13. Tests for correlation coefficients
H0 : ρ = 0 vs. H1 : ρ 6= 0.
Hence we reject H0 at the 100α% significance level, for α ∈ (0, 1), if |t| > tα/2, n−2 , where:
α
P (T > tα/2, n−2 ) = .
2
p
ρ| (n − 2)/(1 − ρb2 ) increases as |b
i. |T | = |b ρ| increases.
iii. Two random variables X and Y are jointly normal if aX + bY is normal for any
constants a and b.
iv. For jointly normal random variables X and Y , if Corr(X, Y ) = 0, X and Y are also
independent.
69
9. Hypothesis testing
n m
2 1 X 1 X
SX = (Xi − X̄)2 and SY2 = (Yi − Ȳ )2 .
n − 1 i=1 m − 1 i=1
2 2
We have (n − 1)SX /σX ∼ χ2n−1 and (m − 1)SY2 /σY2 ∼ χ2m−1 . Therefore:
σY2 2
SX 2
SX 2
/σX
2
× = ∼ Fn−1, m−1 .
σX SY2 SY2 /σY2
2
2
Under H0 , T = kSX SY ∼ Fn−1, m−1 . Hence H0 is rejected if:
where Fα, p, k denotes the top 100αth percentile of the Fp, k distribution, that is:
P (T > Fα, p, k ) = α
70
9.14. Tests for the ratio of two normal variances
SY2 SY2
F1−α/2, n−1, m−1 × 2 , Fα/2, n−1, m−1 × 2 .
SX SX
Example 9.12 Here we practise use of Table 9 of Murdoch and Barnes’ Statistical
Tables to obtain critical values for the F distribution.
Table 9 can be used to find the top 100αth percentile of the Fν1 , ν2 distribution for
α = 0.05, 0.025, 0.01 and 0.001.
For example, for ν1 = 3 and ν2 = 5, then:
and:
P (F3, 5 > 33.20) = 0.001.
To find the bottom 100αth percentile, we note that F1−α, ν1 , ν2 = 1/Fα, ν2 , ν1 . So, for
ν1 = 3 and ν2 = 5, we have:
1 1
P F3, 5 < = = 0.111 = 0.05
F0.05, 5, 3 9.01
1 1
P F3, 5 < = = 0.067 = 0.025
F0.025, 5, 3 14.90
1 1
P F3, 5 < = = 0.035 = 0.01
F0.01, 5, 3 28.20
and:
1 1
P F3, 5 < = = 0.007 = 0.001.
F0.001, 5, 3 134.60
Example 9.13 The daily returns (in percentages) of two assets, X and Y , are
recorded over a period of 100 trading days, yielding average daily returns of x̄ = 3.21
and ȳ = 1.41. Also available from the data are the following quantities:
100
X 100
X 100
X
x2i = 1,989.24, yi2 = 932.78 and xi yi = 661.11.
i=1 i=1 i=1
71
9. Hypothesis testing
Assume the data are normally distributed. Are the two assets positively correlated
with each other, and is asset X riskier than asset Y ?
With n = 100 we have:
n n
!
1 X 1 X
s2X = (xi − x̄)2 = x2i − nx̄2 = 9.69
n − 1 i=1 n−1 i=1
and: !
n n
1 X 1 X
s2Y = (yi − ȳ)2 = yi2 − nȳ 2 = 7.41.
n−1 i=1
n−1 i=1
Therefore: n n
P P
(xi − x̄)(yi − ȳ) xi yi − nx̄ȳ
i=1 i=1
ρb = = = 0.249.
(n − 1)sX sY (n − 1)sX sY
First we test:
H0 : ρ = 0 vs. H1 : ρ > 0.
Under H0 , the test statistic is:
r
n−2
T = ρb ∼ t98 .
1 − ρb2
Setting α = 0.01, we reject H0 if t > t0.01, 98 = 2.37. With the given data, t = 2.545
hence we reject the null hypothesis of ρ = 0 at the 1% significance level. We
conclude that there is highly significant evidence indicating that the two assets are
positively correlated.
We measure the risks in terms of variances, and test:
2
H0 : σX = σY2 2
vs. H1 : σX > σY2 .
72
9.15. Summary: tests for two normal distributions
2
σY
Null hypothesis, µX − µY = δ µX − µY = δ ρ=0 2
σX
=k
2
H0 (σX , σY2 known) 2
(σX = σY2 unknown) (n = m)
q
S2
q
√ X̄−Ȳ −δ X̄−Ȳ −δ
Test statistic, T 2 /n+σ 2 /m
n+m−2
1/n+1/m
×√ 2 +(m−1)S 2
n−2
ρb 1−bρ2
k SX2
σX Y (n−1)SX Y Y
To p, or not to p?
(James Abdey, Ph.D. Thesis 2009.1 )
1
Available at http://etheses.lse.ac.uk/31
73
9. Hypothesis testing
74
Chapter 10
Analysis of variance (ANOVA)
restate and interpret the models for one-way and two-way analysis of variance
perform hypothesis tests and construct confidence intervals for one-way and
two-way analysis of variance
10.3 Introduction
Analysis of variance (ANOVA) is a popular tool which has an applicability and power
which we can only start to appreciate in this course. The idea of analysis of variance is
to investigate how variation in structured data can be split into pieces associated with
components of that structure. We look only at one-way and two-way classifications,
providing tests and confidence intervals which are widely used in practice.
75
10. Analysis of variance (ANOVA)
Example 10.1 To assess the teaching quality of class teachers, a random sample of
6 examination marks was selected from each of three classes. The examination marks
for each class are listed in the table below.
Can we infer from these data that there is no significant difference in the
examination marks among all three classes?
Suppose examination marks from Class j follow the distribution N (µj , σ 2 ), for
j = 1, 2, 3. So we assume examination marks are normally distributed with the same
variance in each class, but possibly different means.
We need to test the hypothesis:
H0 : µ1 = µ2 = µ3 .
The data form a 6 × 3 array. Denote the data point at the (i, j)th position as Xij .
We compute the column means first where the jth column mean is:
X1j + X2j + · · · + Xnj j
X̄·j =
nj
Observation
1 2 3 4 5 6 Mean
Class 1 85 75 82 76 71 85 79
Class 2 71 75 73 74 69 82 74
Class 3 59 64 62 69 75 67 66
Note that similar problems arise from other practical situations. For example:
If H0 is true, the three observed sample means x̄·1 , x̄·2 and x̄·3 should be very close to
each other, i.e. all of them should be close to the overall sample mean, x̄, which is:
x̄·1 + x̄·2 + x̄·3 79 + 74 + 66
x̄ = = = 73
3 3
76
10.5. One-way analysis of variance
Hence we would reject H0 for large values of T . (Note t = 0 if x̄·1 = x̄·2 = x̄·3 which
would mean that there is no variation at all between the sample means. In this case
all the sample means would equal x̄.)
It remains to determine the distribution of T under H0 .
77
10. Analysis of variance (ANOVA)
k
P
where n = nj is the total number of observations across all k groups.
j=1
The total variation is:
nj
k X
X
(Xij − X̄)2
j=1 i=1
k
P
with n − k = (nj − 1) degrees of freedom.
j=1
The ANOVA decomposition is:
nj
k X k nj
k X
X X X
2 2
(Xij − X̄) = nj (X̄·j − X̄) + (Xij − X̄·j )2 .
j=1 i=1 j=1 j=1 i=1
We have already discussed the jth sample mean and overall sample mean. The total
variation is a measure of the overall (total) variability in the data from all k groups
about the overall sample mean. The ANOVA decomposition decomposes this into two
components: between-groups variation (which is attributable to the factor level) and
within-groups variation (which is attributable to the variation within each group and is
assumed to be the same σ 2 for each group).
Some remarks are the following.
78
10.5. One-way analysis of variance
nj
k P
ii. W/σ 2 = (Xij − X̄·j )2 /σ 2 ∼ χ2n−k .
P
j=1 i=1
k
iii. Under H0 : µ1 = · · · = µk , then B/σ 2 = nj (X̄·j − X̄)2 /σ 2 ∼ χ2k−1 .
P
j=1
where Fα, k−1, n−k is the top 100αth percentile of the Fk−1, n−k distribution, i.e.
P (F > Fα, k−1, n−k ) = α, and f is the observed test statistic value.
79
10. Analysis of variance (ANOVA)
p-value = P (F > f ).
It is clear that f > Fα, k−1, n−k if and only if the p-value < α, as we must reach the same
conclusion regardless of whether we use the critical value approach or the p-value
approach to hypothesis testing.
Example 10.2 Continuing with Example 10.1, for the given data, k = 3,
n1 = n2 = n3 = 6, n = n1 + n2 + n3 = 18, x̄·1 = 79, x̄·2 = 74, x̄·3 = 66 and x̄ = 73.
The sample variances are calculated to be s21 = 34, s22 = 20 and s23 = 32. Therefore:
3
X
b= 6(x̄·j − x̄)2 = 6 × ((79 − 73)2 + (74 − 73)2 + (66 − 73)2 ) = 516
j=1
and:
3 X
X 6 3 X
X 6 3
X
w= (xij − x̄·j )2 = x2ij − 6 x̄2·j
j=1 i=1 j=1 i=1 j=1
3
X
= 5s2j
j=1
= 5 × (34 + 20 + 32)
= 430.
Hence:
b/(k − 1) 516/2
f= = = 9.
w/(n − k) 430/15
Under H0 : µ1 = µ2 = µ3 , F ∼ Fk−1, n−k = F2, 15 . Since F0.01, 2, 15 = 6.359 < 9, using
Table 9 of Murdoch and Barnes’ Statistical Tables, we reject H0 at the 1%
significance level. In fact the p-value (using a computer) is P (F > 9) = 0.003.
Therefore, we conclude that there is a significant difference among the mean
examination marks across the three classes.
80
10.5. One-way analysis of variance
Source DF SS MS F p-value
Class 2 516 258 9 0.003
Error 15 430 28.67
Total 17 946
> attach(UhAh)
> summary(UhAh)
Frequency Department
Min. : 0.00 English :100
1st Qu.: 4.00 Mathematics :100
Median : 5.00 Political Science:100
Mean : 5.48
3rd Qu.: 7.00
Max. :11.00
> xbar <- tapply(Frequency, Department, mean)
> s <- tapply(Frequency, Department, sd)
> n <- tapply(Frequency, Department, length)
> sem <- s/sqrt(n)
> list(xbar,s,n,sem)
[[1]]
English Mathematics Political Science
5.81 5.30 5.33
[[2]]
English Mathematics Political Science
2.493203 2.012587 1.974867
[[3]]
English Mathematics Political Science
100 100 100
[[4]]
English Mathematics Political Science
0.2493203 0.2012587 0.1974867
81
10. Analysis of variance (ANOVA)
Surprisingly, professors in English say ‘uh’ or ‘ah’ more on average than those in
Mathematics and Political Science (compare the sample means of 5.81, 5.30 and
5.33), but the difference seems small. However, we need to formally test whether the
(seemingly small) differences are statistically significant.
Using the data, R produces the following one-way ANOVA table:
Response: Frequency
Df Sum Sq Mean Sq F value Pr(>F)
Department 2 16.38 8.1900 1.7344 0.1783
Residuals 297 1402.50 4.7222
Since the p-value for the F test is 0.1783, we cannot reject the following hypothesis:
H0 : µ1 = µ2 = µ3 .
An estimator of σ is: s
W
σ
b =S= .
n−k
95% confidence intervals for µj are given by:
S
X̄·j ± t0.025, n−k × √ for j = 1, 2, . . . , k
nj
where t0.025, n−k is the top 2.5th percentile of the Student’s tn−k distribution, which
can be obtained from Table 7 of Murdoch and Barnes’ Statistical Tables.
Example 10.4 Assuming a common variance for each group, from the preceding
output in Example 10.3 we see that:
1,402.50 √
r
σ
b=s= = 4.72 = 2.173.
297
Since t0.025, 297 ≈ t0.025, ∞ = 1.96, using Table 7 of Murdoch and Barnes’ Statistical
Tables, we obtain the following 95% confidence intervals for µ1 , µ2 and µ3 ,
respectively:
2.173
j=1: 5.81 ± 1.96 × √ ⇒ (5.38, 6.24)
100
2.173
j=2: 5.30 ± 1.96 × √ ⇒ (4.87, 5.73)
100
2.173
j=3: 5.33 ± 1.96 × √ ⇒ (4.90, 5.76).
100
82
10.5. One-way analysis of variance
6
4
2
0
Example 10.5 In early 2001, the American economy was slowing down and
companies were laying off workers. A poll conducted during February 2001 asked a
random sample of workers how long (in months) it would be before they faced
significant financial hardship if they lost their jobs, with the data available in the file
‘GallupPoll.csv’. They are classified into four groups according to their incomes.
Below is part of the R output of the descriptive statistics of the classified data. Can
we infer that income group has a significant impact on the mean length of time
before facing financial hardship?
Hardship Income.group
Min. : 0.00 $20 to 30K: 81
1st Qu.: 8.00 $30 to 50K:114
Median :15.00 Over $50K : 39
Mean :16.11 Under $20K: 67
3rd Qu.:22.00
Max. :50.00
83
10. Analysis of variance (ANOVA)
[[2]]
$20 to 30K $30 to 50K Over $50K Under $20K
9.233260 9.507464 11.029099 8.087043
[[3]]
$20 to 30K $30 to 50K Over $50K Under $20K
81 114 39 67
[[4]]
$20 to 30K $30 to 50K Over $50K Under $20K
1.0259178 0.8904556 1.7660693 0.9879896
Inspection of the sample means suggests that there is a difference between income
groups, but we need to conduct a one-way ANOVA test to see whether the
differences are statistically significant.
We apply one-way ANOVA to test whether the means in the k = 4 groups are equal,
i.e. H0 : µ1 = µ2 = µ3 = µ4 , from highest to lowest income groups.
We have n1 = 39, n2 = 114, n3 = 81 and n4 = 67, hence:
k
X
n= nj = 39 + 114 + 81 + 67 = 301.
j=1
Also x̄·1 = 22.21, x̄·2 = 18.456, x̄·3 = 15.49, x̄·4 = 9.313 and:
k
1X 39 × 22.21 + 114 × 18.456 + 81 × 15.49 + 67 × 9.313
x̄ = nj X̄·j = = 16.109.
n j=1 301
Now:
k
X
b= nj (x̄·j − x̄)2
j=1
We have s21 = (11.03)2 = 121.661, s22 = (9.507)2 = 90.383, s23 = (9.23)2 = 85.193 and
84
10.5. One-way analysis of variance
Consequently:
b/(k − 1) 5,205.097/3
f= = = 19.84.
w/(n − k) 25,968.24/(301 − 4)
Under H0 , F ∼ Fk−1, n−k = F3, 297 . Since F0.01, 3, 297 ≈ 3.848 < 19.84, we reject H0 at
the 1% significance level, i.e. there is strong evidence that income group has a
significant impact on the mean length of time before facing financial hardship.
The pooled estimate of σ is:
p p
s = w/(n − k) = 25,968.24/(301 − 4) = 9.351.
s 9.351 18.328
x̄·j ± t0.025, 297 × √ = x̄·j ± 1.96 × √ = x̄·j ± √ .
nj nj nj
Response: Hardship
Df Sum Sq Mean Sq F value Pr(>F)
Income.group 3 5202.1 1734.03 19.828 9.636e-12 ***
Residuals 297 25973.3 87.45
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that minor differences are due to rounding errors in calculations.
85
10. Analysis of variance (ANOVA)
H0 : µ1 = µ2 = · · · = µk .
where εij ∼ N (0, σ 2 ) and the εij s are independent. µ is the average effect and βj is the
Pk
factor (or treatment) effect at the jth level. Note that βj = 0. The null hypothesis
j=1
(i.e. that the group means are all equal) can also be expressed as:
H0 : β1 = β2 = · · · = βk = 0.
where:
In total, there are n = r × c observations. We now consider the conditions to make the
parameters µ, γi and βj identifiable for i = 1, 2, . . . , r and j = 1, 2, . . . , c. The conditions
are:
γ1 + γ2 + · · · + γr = 0 and β1 + β2 + · · · + βc = 0.
We will be interested in testing the following hypotheses.
86
10.7. Two-way analysis of variance
r X
X c
+ (Xij − X̄i· − X̄·j + X̄)2 .
i=1 j=1
87
10. Analysis of variance (ANOVA)
The total variation is a measure of the overall (total) variability in the data and the
(two-way) ANOVA decomposition decomposes this into three components:
between-blocks variation (which is attributable to the row factor level),
between-treatments variation (which is attributable to the column factor level) and
residual variation (which is attributable to the variation not explained by the row and
column factors).
The following are some useful formulae for manual computations.
c
P
Row sample means: X̄i· = Xij /c, for i = 1, 2, . . . , r.
j=1
r
P
Column sample means: X̄·j = Xij /r, for j = 1, 2, . . . , c.
i=1
r P
P c r
P c
P
Overall sample mean: X̄ = Xij /n = X̄i· /r = X̄·j /c.
i=1 j=1 i=1 j=1
r P
c
Xij2 − rcX̄ 2 .
P
Total SS =
i=1 j=1
r
X̄i·2 − rcX̄ 2 .
P
Between-blocks (rows) variation: Brow = c
i=1
c
X̄·j2 − rcX̄ 2 .
P
Between-treatments (columns) variation: Bcol = r
j=1
r P
c r c
Xij2 − c X̄i·2 − r X̄·j2 + rcX̄ 2 .
P P P
Residual SS = (Total SS) − Brow − Bcol =
i=1 j=1 i=1 j=1
88
10.8. Residuals
As with one-way ANOVA, two-way ANOVA results are presented in a table as follows:
Source DF SS MS F p-value
(c−1)Brow
Row factor r−1 Brow Brow /(r − 1) Residual SS
p
(r−1)Bcol
Column factor c−1 Bcol Bcol /(c − 1) Residual SS
p
Residual SS
Residual (r − 1)(c − 1) Residual SS (r−1)(c−1)
Total rc − 1 Total SS
10.8 Residuals
Before considering an example of two-way ANOVA, we briefly consider residuals.
Recall the original two-way ANOVA model:
Xij = µ + γi + βj + εij .
µ
b = X̄ is the point estimator of µ.
for i = 1, 2, . . . r and j = 1, 2, . . . , c.
The two-way ANOVA model assumes εij ∼ N (0, σ 2 ) and so, if the model structure is
correct, then the εbij s should behave like independent N (0, σ 2 ) random variables.
89
10. Analysis of variance (ANOVA)
Example 10.6 The following table lists the percentage annual returns (calculated
four times per annum) of the Common Stock Index at the New York Stock
Exchange during 1981–85, available in the data file ‘NYSE.csv’.
r
X
brow = c x̄2i· − rcx̄2 = 4 × 138.6112 − 534.578 = 19.867.
i=1
c
X
bcol = r x̄2·j − rcx̄2 = 5 × 107.036 − 534.578 = 0.602.
j=1
90
10.8. Residuals
significance level. We conclude that there is strong evidence that the return does
depend on the year.
To test the no column effect hypothesis H0 : β1 = β2 = β3 = β4 = 0, the test statistic
value is:
(r − 1)bcol 4 × 0.602
f= = = 0.600.
Residual SS 4.013
Under H0 , F ∼ Fc−1, (r−1)(c−1) = F3, 12 . Since F0.10, 3, 12 = 2.606 > 0.600, we cannot
reject H0 even at the 10% significance level. Therefore, there is no significant
evidence indicating that the return depends on the quarter.
The results may be summarised in a two-way ANOVA table as follows:
Source DF SS MS F p-value
Year 4 19.867 4.967 14.852 < 0.01
Quarter 3 0.602 0.201 0.600 > 0.10
Residual 12 4.013 0.334
Total 19 24.482
We could also provide 95% confidence interval estimates for each block and
treatment level by using the pooled estimator of σ 2 , which is:
Residual SS
S2 = = Residual MS.
(r − 1)(c − 1)
For the given data, s2 = 0.334.
R produces the following output:
Response: Return
Df Sum Sq Mean Sq F value Pr(>F)
Year 4 19.867 4.9667 14.852 0.0001349 ***
Quarter 3 0.602 0.2007 0.600 0.6271918
Residuals 12 4.013 0.3344
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that the confidence intervals for years 1 and 2 (corresponding to 1981 and
1982) are separated from those for years 3 to 5 (that is, 1983 to 1985), which is
consistent with rejection of H0 in the no row effect test. In contrast, the confidence
intervals for each quarter all overlap, which is consistent with our failure to reject H0
in the no column effect test.
Finally, we may also look at the residuals:
εbij = Xij − µ
b−γ
bi − βbj for i = 1, 2, . . . r and j = 1, 2, . . . , c.
If the assumed normal model (structure) is correct, the εbij s should behave like
independent N (0, σ 2 ) random variables.
91
10. Analysis of variance (ANOVA)
A total of 4,000 cans are opened around the world every second. Ten babies are
conceived around the world every second. Each time you open a can, you stand
a 1-in-400 chance of falling pregnant.
(True or false?)
92
Chapter 11
Linear regression
derive from first principles the least squares estimators of the intercept and slope in
the simple linear regression model
explain how to construct confidence intervals and perform hypothesis tests for the
intercept and slope in the simple linear regression model
summarise the multiple linear regression model with several explanatory variables,
and explain its interpretation
11.3 Introduction
Regression analysis is one of the most frequently-used statistical techniques. It aims
to model an explicit relationship between one dependent variable, often denoted as y,
and one or more regressors (also called covariates, or independent variables), often
denoted as x1 , x2 , . . . , xp .
The goal of regression analysis is to understand how y depends on x1 , x2 , . . . , xp and to
predict or control the unobserved y based on the observed x1 , x2 , . . . , xp . We start with
some simple examples with p = 1.
93
11. Linear regression
y = β0 + β1 x + ε
where ε stands for a random error term, β0 is the intercept and β1 is the slope of the
straight line.
Example 11.2 The data file ‘WeightHeight.csv’ contains the heights, x, and
weights, y, of 69 students in a class.
We plot y against x, and draw a straight line through the middle of the data cloud:
y = β0 + β1 x + ε
where ε stands for a random error term, β0 is the intercept and β1 is the slope of the
straight line.
For a given height, x, the predicted value yb = β0 + β1 x may be viewed as a kind of
‘standard weight’.
94
11.5. Simple linear regression
Example 11.3 Some other possible examples of y and x are shown in the following
table.
y x
Sales Price
Weight gain Protein in diet
Present FTSE 100 index Past FTSE 100 index
Consumption Income
Salary Tenure
Daughter’s height Mother’s height
In most cases, there are several x variables involved. We will consider such situations
later in this chapter.
How to draw a line through data clouds, i.e. how to estimate β0 and β1 ?
How accurate is the fitted line?
What is the error in predicting a future y?
95
11. Linear regression
Secondly:
n
∂ X
L(β0 , β1 ) = −2 xi (yi − β0 − β1 xi ).
∂β1 i=1
Hence:
n
P n
P
xi (yi − ȳ) (xi − x̄)(yi − ȳ)
i=1 i=1
βb1 = Pn = n
P and βb0 = ȳ − βb1 x̄.
xi (xi − x̄) (xi − x̄)2
i=1 i=1
The estimator βb1 above is based on the fact that for any constant c, we have:
n
X n
X
xi (yi − ȳ) = (xi − c)(yi − ȳ)
i=1 i=1
96
11.5. Simple linear regression
since: n n
X X
c(yi − ȳ) = c (yi − ȳ) = 0.
i=1 i=1
n
P n
P
Given that (xi − x̄) = 0, it follows that c(xi − x̄) = 0 for any constant c.
i=1 i=1
n
(yi − β0 − β1 xi )2 . For any β0
P
An alternative derivation is as follows. Note L(β0 , β1 ) =
i=1
and β1 , we have:
n
X
L(β0 , β1 ) = (yi − βb0 − βb1 xi + βb0 − β0 + (βb1 − β1 )xi )2
i=1
n
X
= L(βb0 , βb1 ) + (βb0 − β0 + (βb1 − β1 )xi )2 + 2B (11.1)
i=1
where:
n
X
B= (βb0 − β0 + (βb1 − β1 )xi )(yi − βb0 − βb1 xi )
i=1
n
X n
X
= (βb0 − β0 ) (yi − βb0 − βb1 xi ) + (βb1 − β1 ) xi (yi − βb0 − βb1 xi ).
i=1 i=1
Hence (βb0 , βb1 ) are the least squares estimators (LSEs) of β0 and β1 , respectively.
To find the explicit expression from (11.2), note the first equation can be written as:
Hence βb0 = ȳ − βb1 x̄. Substituting this into the second equation, we have:
n
X n
X n
X
0= xi (yi − ȳ − βb1 (xi − x̄)) = xi (yi − ȳ) − βb1 xi (xi − x̄).
i=1 i=1 i=1
97
11. Linear regression
Therefore:
n
P n
P
xi (yi − ȳ) (xi − x̄)(yi − ȳ)
βb1 = i=1
Pn = i=1
n
P .
xi (xi − x̄) (xi − x̄)2
i=1 i=1
We now explore the properties of the LSEs βb0 and βb1 . We now proceed to show that the
means and variances of these LSEs are:
n
x2i
P
σ2 i=1
E(βb0 ) = β0 and Var(βb0 ) = n
n P (xi − x̄)2
i=1
for βb1 .
Proof: Recall we treat the xi s as constants, and we have E(yi ) = β0 + β1 xi and also
Var(yi ) = σ 2 . Hence:
n
! n n
1X 1X 1X
E(ȳ) = E yi = E(yi ) = (β0 + β1 xi ) = β0 + β1 x̄.
n i=1 n i=1 n i=1
Therefore:
E(yi − ȳ) = β0 + β1 xi − (β0 + β1 x̄) = β1 (xi − x̄).
Consequently, we have:
n n n
(xi − x̄)2 β1
P P P
(x i − x̄)(yi − ȳ) (x i − x̄)E(yi − ȳ)
i=1 i=1 i=1
E(βb1 ) = E
n
P
=
Pn = P n = β1 .
(xi − x̄) 2 (xi − x̄)2 (xi − x̄) 2
i=1 i=1 i=1
Now:
E(βb0 ) = E(ȳ − βb1 x̄) = β0 + β1 x̄ − β1 x̄ = β0 .
Therefore, the LSEs βb0 and βb1 are unbiased estimators of β0 and β1 , respectively.
98
11.5. Simple linear regression
To work out the variances, the key is to write βb1 and βb0 as linear estimators (i.e.
linear combinations of the yi s):
n
P n
P
(xi − x̄)(yi − ȳ) (xi − x̄)yi n
X
i=1 i=1
βb1 = n
P = n
P = ai y i
(xi − x̄)2 (xk − x̄)2 i=1
i=1 k=1
n
P
where ai = (xi − x̄) (xk − x̄)2 and:
k=1
n n
X X 1
βb0 = ȳ − βb1 x̄ = ȳ − ai x̄yi = − ai x̄ yi .
i=1 i=1
n
Note that:
n n
X X 1
ai = 0 and a2i = P
n .
i=1 i=1 (xk − x̄)2
k=1
By this lemma:
n
! n
X
2
X σ2
Var(βb1 ) = Var ai y i =σ a2i = P
n
i=1 i=1 (xk − x̄)2
k=1
and:
n 2 n
!
X 1 1 X 2 2 σ2 nx̄2
Var(βb0 ) = σ 2 − ai x̄ =σ 2
+ a x̄ = 1 +
n n i=1 i n n
P
i=1 (xk − x̄)2
k=1
n
x2k
P
σ2 k=1
= n .
n P 2
(xk − x̄)
k=1
99
11. Linear regression
ε1 , ε2 , . . . , εn ∼IID N (0, σ 2 ).
yi ∼ N (β0 + β1 xi , σ 2 ).
Since any linear combination of normal random variables is also normal, the LSEs of β0
and β1 (as linear estimators) are also normal random variables. In fact:
n
2
P
xi
σ2 i=1
σ2
β0 ∼ N β0 ,
b
n
and β
b 1 ∼ N β 1 , n
.
n P (xi − x̄)2
P
(xi − x̄)2
i=1 i=1
and:
σ
b
E.S.E.(βb1 ) = 1/2 .
n
P
(xi − x̄)2
i=1
The following results all make use of distributional results introduced earlier in the
course. Statistical inference (confidence intervals and hypothesis testing) for the normal
simple linear regression model can then be performed.
i. We have:
n
(yi − βb0 − βb1 xi )2
P
σ2
(n − 2)b i=1
= ∼ χ2n−2 .
σ2 σ2
βb0 − β0
∼ tn−2 .
E.S.E.(βb0 )
100
11.6. Inference for parameters in normal regression models
βb1 − β1
∼ tn−2 .
E.S.E.(βb1 )
where tα, k denotes the top 100αth percentile of the Student’s tk distribution, obtained
from Table 7 of Murdoch and Barnes’ Statistical Tables.
H0 : β1 = 0 vs. H1 : β1 6= 0.
βb1
T = ∼ tn−2 .
E.S.E.(βb1 )
At the 100α% significance level, we reject H0 if |t| > tα/2, n−2 , where t is the observed
test statistic value.
Alternatively, we could use H1 : β1 < 0 or H1 : β1 > 0 if there was a rationale for
doing so. In such cases, we would reject H0 if t < −tα, n−2 and t > tα, n−2 for the
lower-tailed and upper-tailed t tests, respectively.
i. For testing H0 : β1 = b for a given constant b, the above test still applies, but now
with the following test statistic:
βb1 − b
T = .
E.S.E.(βb1 )
101
11. Linear regression
ii. Tests for the regression intercept β0 may be constructed in a similar manner,
replacing β1 and βb1 with β0 and βb0 , respectively.
In the normal regression model, the LSEs βb0 and βb1 are also the MLEs of β0 and β1 ,
respectively.
Since εi = yi − β0 − β1 xi ∼IID N (0, σ 2 ), the likelihood function is:
n
2
Y 1
1 2
L(β0 , β1 , σ )= √ exp − 2 (yi − β0 − β1 xi )
i=1 2πσ 2 2σ
n/2 n
!
1 1 X
∝ exp − 2 (yi − β0 − β1 xi )2 .
σ2 2σ i=1
g(u) = n ln u − ub
n
(yi − βb0 − βb1 xi )2 .
P
where b =
i=1
102
11.6. Inference for parameters in normal regression models
= 2,181.66.
We now proceed to test H0 : β1 = 0 vs. H1 : β1 > 0. (If indeed smoking contributes
to CHD mortality, then β1 > 0.)
We have calculated βb1 = 0.06. However, is this deviation from zero due to sampling
error, or is it significantly different from zero? (The magnitude of βb1 itself is not
important in determining if β1 = 0 or not – changing the scale of x may make βb1
arbitrarily small.)
Under H0 , the test statistic is:
βb1
T = ∼ tn−2 = t19
E.S.E.(βb1 )
b/( i (xi − x̄)2 )1/2 = 0.01293.
P
where E.S.E.(βb1 ) = σ
Since t = 0.06/0.01293 = 4.64 > 2.54 = t0.01, 19 , we reject the hypothesis β1 = 0 at
the 1% significance level and we conclude that there is strong evidence that smoking
contributes to CHD mortality.
103
11. Linear regression
n
n
βb12 (xi − x̄)2 = βb12 x2i 2
P P
Regression (explained) SS is − nx̄ .
i=1 i=1
n
(yi − βb0 − βb1 xi )2 = Total SS − Regression SS.
P
Residual (error) SS is
i=1
n
βb12 (xi − x̄)2 /σ 2 ∼ χ21
P
i=1
n
(yi − βb0 − βb1 xi )2 /σ 2 ∼ χ2n−2 .
P
i=1
We reject H0 at the 100α% significance level if f > Fα, 1, n−2 , where f is the observed
test statistic value and Fα, 1, n−2 is the top 100αth percentile of the F1, n−2 distribution,
obtained from Table 9 of Murdoch and Barnes’ Statistical Tables.
A useful statistic is the coefficient of determination, denoted as R2 , defined as:
Regression SS Residual SS
R2 = =1− .
Total SS Total SS
If we view Total SS as the total variation (or energy) of y, then R2 is the proportion of
the total variation of y explained by x. Note that R2 ∈ [0, 1]. The closer R2 is to 1, the
better the explanatory power of the regression model.
104
11.8. Confidence intervals for E(y)
yb = βb0 + βb1 x.
For the analysis to be more informative, we would like to have some ‘error bars’ for our
prediction. We introduce two methods as follows.
Standardising gives:
b(x) − µ(x)
µ
v
u ! ∼ N (0, 1).
u n
P n
P
t(σ 2 /n) (xi − x)2 / (xj − x̄)2
i=1 j=1
b(x) − µ(x)
µ
v
u ! ∼ tn−2 .
u n
P n
P
σ 2 /n)
t(b (xi − x)2 / (xj − x̄)2
i=1 j=1
105
11. Linear regression
Such a confidence interval contains the true expectation E(y) = µ(x) with probability
1 − α over repeated samples. It does not cover y with probability 1 − α.
n
(xi − x)2
P
σ2 i=1
Var(y) + Var(µ(x))
b = σ2 + n .
n P
(xj − x̄)2
j=1
Therefore:
n 1/2
2
P
. (xi − x)
i=1
(y − µ
b(x))
σb2
1 + Pn
∼ tn−2 .
n (xj − x̄)2
j=1
106
11.9. Prediction intervals for y
i. It holds that:
1/2
n
(xi − x)2
P
i=1
P y ∈ µ(x) ± tα/2, n−2 × σ
b × 1 + = 1 − α.
b Pn
n (xj − x̄) 2
j=1
ii. The prediction interval for y is wider than the confidence interval for E(y). The
former contains the unobserved random variable y with probability 1 − α, the
latter contains the unknown constant E(y) with probability 1 − α over repeated
samples.
Example 11.5 The dataset ‘UsedFord.csv’ contains the prices (y, in $000s) of 100
three-year-old Ford Tauruses together with their mileages (x, in thousands of miles)
when they were sold at auction. Based on these data, a car dealer needs to make two
decisions.
1. To prepare cash for bidding on one three-year-old Ford Taurus with a mileage of
x = 40.
Call:
lm(formula = Price ~ Mileage)
Residuals:
Min 1Q Median 3Q Max
-0.68679 -0.27263 0.00521 0.23210 0.70071
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.248727 0.182093 94.72 <2e-16 ***
Mileage -0.066861 0.004975 -13.44 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
107
11. Linear regression
We predict that a Ford Taurus will sell for between $13,922 and $15,227. The
average selling price of several three-year-old Ford Tauruses is estimated to be
between $14,498 and $14,650. Because predicting the selling price for one car is more
difficult, the corresponding prediction interval is wider than the confidence interval.
To produce the plots with confidence intervals for E(y) and prediction intervals for
y, we proceed as follows:
15.0
14.5
14.0
13.5
20 25 30 35 40 45 50
Mileage
108
11.10. Multiple linear regression models
Let (yi , xi1 , xi2 , . . . , xip ), for i = 1, 2, . . . , n, be observations from the model:
where:
The multiple linear regression model is a natural extension of the simple linear
regression model, just with more parameters: β0 , β1 , β2 , . . . , βp and σ 2 .
Treating all of the xij s as constants as before, we have:
Estimation of the intercept and slope parameters is still performed using least squares
estimation. The LSEs βb0 , βb1 , βb2 , . . . , βbp are obtained by minimising:
n p
!2
X X
yi − β0 − βj xij
i=1 j=1
Just as with the simple linear regression model, we can decompose the total variation of
y such that:
X n Xn Xn
2 2
(yi − ȳ) = yi − ȳ) +
(b εb2i
i=1 i=1 i=1
or, in words:
Total SS = Regression SS + Residual SS.
An unbiased estimator of σ 2 is:
n p
!2
1 X X Residual SS
b2 =
σ yi − βb0 − βbj xij = .
n−p−1 i=1 j=1
n−p−1
H0 : βi = 0 vs. H1 : βi 6= 0.
109
11. Linear regression
βbi
T = ∼ tn−p−1
E.S.E.(βbi )
and we reject H0 if |t| > tα/2, n−p−1 . However, note the slight difference in the
interpretation of the slope coefficient βj . In the multiple regression setting, βj is the
effect of xj on y, holding all other independent variables fixed – this is unfortunately
not always practical.
It is also possible to test whether all the regression coefficients are equal to zero. This is
known as a joint test of significance and can be used to test the overall significance
of the regression model, i.e. whether there is at least one significant explanatory
(independent) variable, by testing:
Indeed, it is preferable to perform this joint test of significance before conducting t tests
of individual slope coefficients. Failure to reject H0 would render the model useless and
hence the model would not warrant any further statistical investigation.
Provided εi ∼ N (0, σ 2 ), under H0 : β1 = β2 = · · · = βp = 0, the test statistic is:
(Regression SS)/p
F = ∼ Fp, n−p−1 .
(Residual SS)/(n − p − 1)
Example 11.6 We illustrate the use of linear regression in R using the dataset
‘Armand.csv’, introduced in Example 11.1.
110
11.11. Regression using R
Call:
lm(formula = Sales ~ Student.population)
Residuals:
Min 1Q Median 3Q Max
-21.00 -9.75 -3.00 11.25 18.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.0000 9.2260 6.503 0.000187 ***
Student.population 5.0000 0.5803 8.617 2.55e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
where F1, 8 .
Example 11.7 We apply the simple linear regression model to study the
relationship between two series of financial returns – a regression of Cisco Systems
stock returns, y, on S&P500 Index returns, x. This regression model is an example of
the capital asset pricing model (CAPM).
Stock returns are defined as:
current price − previous price
current price
return = ≈ ln
previous price previous price
when the difference between the two prices is small.
The data file ‘Returns.csv’ contains daily returns over the period 3 January – 29
December 2000 (i.e. n = 252 observations). The dataset has 5 columns: Day, S&P500
return, Cisco return, Intel return and Sprint return.
Daily prices are definitely not independent. However, daily returns may be seen as a
sequence of uncorrelated random variables.
111
11. Linear regression
> summary(S.P500)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-6.00451 -0.85028 -0.03791 -0.04242 0.79869 4.65458
> summary(Cisco)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-13.4387 -3.0819 -0.1150 -0.1336 2.6363 15.4151
For the S&P500, the average daily return is −0.04%, the maximum daily return is
4.46%, the minimum daily return is −6.01% and the standard deviation is 1.40%.
For Cisco, the average daily return is −0.13%, the maximum daily return is 15.42%,
the minimum daily return is −13.44% and the standard deviation is 4.23%.
We see that Cisco is much more volatile than the S&P500.
Time
There is clear synchronisation between the movements of the two series of returns,
as evident from examining the sample correlation coefficient.
> cor.test(S.P500,Cisco)
112
11.11. Regression using R
Call:
lm(formula = Cisco ~ S.P500)
Residuals:
Min 1Q Median 3Q Max
-13.1175 -2.0238 0.0091 2.0614 9.9491
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.04547 0.19433 -0.234 0.815
S.P500 2.07715 0.13900 14.943 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
113
11. Linear regression
iii. Variance is a simple measure (and one of the most frequently-used) of risk in
finance.
Example 11.8 The data in the file ‘Foods.csv’ illustrate the effects of marketing
instruments on the weekly sales volume of a certain food product over a three-year
period. Data are real but transformed to protect the innocent!
There are observations on the following four variables:
> summary(Foods)
LVOL PROMP FEAT DISP
Min. :13.83 Min. :3.075 Min. : 2.84 Min. :12.42
1st Qu.:14.08 1st Qu.:3.330 1st Qu.:15.95 1st Qu.:20.59
Median :14.24 Median :3.460 Median :22.99 Median :25.11
Mean :14.28 Mean :3.451 Mean :24.84 Mean :25.31
3rd Qu.:14.43 3rd Qu.:3.560 3rd Qu.:33.49 3rd Qu.:29.34
Max. :15.07 Max. :3.865 Max. :57.10 Max. :45.94
n = 156. The values of FEAT and DISP are much larger than LVOL.
As always, first we plot the data to ascertain basic characteristics.
114
11.11. Regression using R
15.0
14.8
14.6
LVOLts
14.4
14.2
14.0
13.8
0 50 100 150
Time
> plot(PROMP,LVOL,pch=16)
15.0
14.8
14.6
LVOL
14.4
14.2
14.0
13.8
PROMP
115
11. Linear regression
> plot(FEAT,LVOL,pch=16)
15.0
14.8
14.6
LVOL
14.4
14.2
14.0
13.8
10 20 30 40 50
FEAT
> plot(DISP,LVOL,pch=16)
15.0
14.8
14.6
LVOL
14.4
14.2
14.0
13.8
15 20 25 30 35 40 45
DISP
116
11.11. Regression using R
Call:
lm(formula = LVOL ~ PROMP + FEAT)
Residuals:
Min 1Q Median 3Q Max
-0.32734 -0.08519 -0.01011 0.08471 0.30804
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.1500102 0.2487489 68.94 <2e-16 ***
PROMP -0.9042636 0.0694338 -13.02 <2e-16 ***
FEAT 0.0100666 0.0008827 11.40 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
117
11. Linear regression
Consider now introducing DISP into the regression model to give three explanatory
variables:
y = β0 + β1 x1 + β2 x2 + β3 x3 + ε.
The reason for adding the third variable is that one would expect DISP to have an
impact on sales and we may wish to estimate its magnitude.
Call:
lm(formula = LVOL ~ PROMP + FEAT + DISP)
Residuals:
Min 1Q Median 3Q Max
-0.33363 -0.08203 -0.00272 0.07927 0.33812
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.2372251 0.2490226 69.220 <2e-16 ***
PROMP -0.9564415 0.0726777 -13.160 <2e-16 ***
FEAT 0.0101421 0.0008728 11.620 <2e-16 ***
DISP 0.0035945 0.0016529 2.175 0.0312 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
All the estimated coefficients have the right sign (according to commercial common
sense!) and are statistically significant. In particular, the relationship with DISP
seems real when the other inputs are taken into account. On the other hand, the
addition
√ of DISP to the
√ model has resulted in a very small reduction in σ b, from
0.0161 = 0.1268 to 0.0157 = 0.1253, and correspondingly a slightly higher R2
(0.7633, i.e. 76.33% of the variation of LVOL is explained by the model). Therefore,
DISP contributes very little to ‘explaining’ the variation of LVOL after the other
two explanatory variables, PROMP and FEAT, are taken into account.
Intuitively, we would expect a higher R2 if we add a further explanatory variable to
the model. However, the model has become more complex as a result – there is an
additional parameter to estimate. Therefore, strictly speaking, we should consider
the ‘adjusted R2 ’ statistic, although this will not be considered in this course.
Special care should be exercised when predicting with x out of the range of the
observations used to fit the model, which is called extrapolation.
118
11.12. Overview of chapter
119
11. Linear regression
120
Appendix A
Sampling distributions of statistics
Solution:
where F ∼ F5, 17 , using Table 9 (practice of which will be covered later in the
course1 ).
(d) A chi-squared random variable only assumes non-negative values. Hence each
of A, B and C is non-negative, so A3 + B 3 + C 3 ≥ 0, and:
P (A3 + B 3 + C 3 < 0) = 0.
121
A. Sampling distributions of statistics
Solution:
Solution:
(a) We have X1 ∼ N (0, 9) and X2 ∼ N (0, 9). Hence 2X2 ∼ N (0, 36) and
X1 + 2X2 ∼ N (0, 45). So:
9
P (X1 + 2X2 > 9) = P Z > √ = P (Z > 1.34) = 0.0901.
45
(b) We have X1 /3 ∼ N (0, 1) and X2 /3 ∼ N (0, 1). Hence X12 /9 ∼ χ21 and
X22 /9 ∼ χ21 . Therefore, X12 /9 + X22 /9 ∼ χ22 . So:
where Y ∼ χ22 .
122
A.1. Worked examples
(c) We have X12 /9 + X22 /9 ∼ χ22 and also X32 /9 + X42 /9 ∼ χ22 . So:
X12 + X22 (X12 + X22 )/18
= ∼ F2, 2 .
X32 + X42 (X32 + X42 )/18
Hence:
P ((X12 + X22 ) > 99(X32 + X42 )) = P (Y > 99) = 0.01
where Y ∼ F2, 2 .
Solution:
(a) We have Xi ∼ N (0, 4), for i = 1, 2, 3, hence:
X1 − X2 − X3 ∼ N (0, 12).
So:
P (X1 > X2 + X3 ) = P (X1 − X2 − X3 > 0) = P (Z > 0) = 0.5.
(b) We have Xi /2 ∼ N (0, 1), so Xi2 /4 ∼ χ21 for i = 1, 2, 3. Hence:
2X12 (X12 /4)/1
= ∼ F1, 2 .
X22 + X32 ((X22 + X32 )/4)/2
So:
2X12
P (X12 > 9.25(X22 + X32 )) =P > 9.25 × 2 = P (Y > 18.5) = 0.05
X22 + X32
where Y ∼ F1, 2 .
(c) We have:
1/2 !
X22 X32
X1
P (X1 > 5(X22 + X32 )1/2 ) = P >5 +
2 4 4
1/2 ! !
√
X22 X32 √
X1
=P >5 2
+ 2
2 4 4
√ p
i.e. P (Y1 > 5 2Y2 ), where Y1 ∼ N (0, 1) and Y2 ∼ χ22 /2, or P (Y3 > 7.07),
where Y3 ∼ t2 . From Table 7, this is approximately 0.01.
123
A. Sampling distributions of statistics
Solution:
(a) We have Xi ∼ N (0, 4), for i = 1, 2, 3, 4, hence 3X1 ∼ N (0, 36) and
4X2 ∼ N (0, 64). Therefore:
3X1 + 4X2
= Z ∼ N (0, 1).
10
(b) We have Xi /2 ∼ N (0, 1), for i = 1, 2, 3, 4, hence (X32 + X42 )/4 ∼ χ22 . So:
√
q
P X1 > k X32 + X42 = 0.025 = P (T > k 2)
√
where T ∼ t2 and hence k 2 = 4.303, so k = 3.04268.
(c) We have (X12 + X22 + X32 )/4 ∼ χ23 , so:
k
P (X12 + X22 + X32 < k) = 0.9 = P X<
4
and:
X22 + X42
∼ F2, 2 .
X12 + X32
So, from Table 9, k = 0.05.
6. Suppose that the heights of students are normally distributed with a mean of 68.5
inches and a standard deviation of 2.7 inches. If 200 random samples of size 25 are
drawn from this population with means recorded to the nearest 0.1 inch, find:
(a) the expected mean and standard deviation of the sampling distribution of the
mean
(b) the expected number of recorded sample means which fall between 67.9 and
69.2 inclusive
(c) the expected number of recorded sample means falling below 67.0.
Solution:
(a) The sampling distribution of the mean of 25 observations has the same mean
as the population, which is 68.5 inches.
√ The standard deviation (standard
error) of the sample mean is 2.7/ 25 = 0.54.
124
A.1. Worked examples
(b) Notice that the samples are random, so we cannot be sure exactly how many
will have means between 67.9 and 69.2 inches. We can work out the probability
that the sample mean will lie in this interval using the sampling distribution:
X̄ ∼ N (68.5, (0.54)2 ).
We need to make a continuity correction, to account for the fact that the
recorded means are rounded to the nearest 0.1 inch. For example, the
probability that the recorded mean is ≥ 67.9 inches is the same as the
probability that the sample mean is > 67.85. Therefore, the probability we
want is:
67.85 − 68.5 69.25 − 68.5
P (67.85 < X < 69.25) = P <Z<
0.54 0.54
= P (−1.20 < Z < 1.39)
= Φ(1.39) − Φ(−1.20)
= 0.9177 − (1 − 0.1151)
= 0.8026.
As usual, the values of Φ(1.39) and Φ(−1.20) can be found from Table 3 of
Murdoch and Barnes’ Statistical Tables. Since there are 200 independent
random samples drawn, we can now think of each as a single trial. The
recorded mean lies between 67.9 and 69.2 with probability 0.8026 at each trial.
We are dealing with a binomial distribution with n = 200 trials and
probability of success π = 0.8026. The expected number of successes is:
nπ = 200 × 0.8026 = 160.52.
(c) The probability that the recorded mean is < 67.0 inches is:
66.95 − 68.5
P (X < 66.95) = P Z < = P (Z < −2.87) = Φ(−2.87) = 0.00205
0.54
so the expected number of recorded means below 67.0 out of a sample of 200 is:
200 × 0.00205 = 0.41.
125
A. Sampling distributions of statistics
Alternatively, we can use the fact that Z 2 follows a χ21 distribution. From Table 8
of Murdoch and Barnes’ Statistical Tables we can see that 3.841 is the 5% right-tail
value for this distribution, and so P (Z 2 < 3.84) = 0.95, as before.
Since X1 /2 and X2 /2 are independent N (0, 1) random variables, the sum of their
squares will follow a χ22 distribution. Using Table 8 of Murdoch and Barnes’
Statistical Tables, we see that 9.210 is the 1% right-tail value, so the probability we
are looking for is 0.99.
P (X12 + X22 < 7.236Y − X32 ) = P (X12 + X22 + X32 < 7.236Y )
2
X1 + X22 + X32
=P < 7.236
Y
2
(X1 + X22 + X32 )/3
5
=P < × 7.236
Y /5 3
2
(X1 + X22 + X32 )/3
=P < 12.060 .
Y /5
Since X12 + X22 + X32 ∼ χ23 , we have a ratio of independent χ23 and χ25 random
variables, each divided by its degrees of freedom. By definition, this follows an F3, 5
distribution. From Table 9 of Murdoch and Barnes’ Statistical Tables, we see that
12.060 is the 1% upper-tail value for this distribution, so the probability we want is
equal to 0.99.
10. Compare the normal distribution approximation to the exact values for the
upper-tail probabilities for the binomial distribution with 100 trials and probability
of success 0.1.
126
A.1. Worked examples
Solution:
Let R ∼ Bin(100, 0.1) denote the exact number of successes. It has mean and
variance:
E(R) = nπ = 100 × 0.1 = 10
and:
Var(R) = nπ(1 − π) = 100 × 0.1 × 0.9 = 9
so we use the approximation R ∼
˙ N (10, 9) or, equivalently:
R − 10 R − 10
√ = ∼˙ N (0, 1).
9 3
Applying a continuity correction of 0.5 (for example, 7.8 successes are rounded up
to 8) gives:
r − 0.5 − 10
P (R ≥ r) ≈ P Z > .
3
The results are summarised in the following table. The first column is the number
of successes; the second gives the exact binomial probabilities; the third column
lists the corresponding z-values (with the continuity correction); and the fourth
gives the probabilities for the normal approximation.
Although the agreement between columns two and four is not too bad, you may
think it is not as close as you would like for some applications.
r P (R ≥ r) z = (r − 0.5 − 10)/3 P (Z > z)
1 0.999973 −3.1667 0.999229
2 0.999678 −2.8333 0.997697
3 0.998055 −2.5000 0.993790
4 0.992164 −2.1667 0.984870
5 0.976289 −1.8333 0.966624
6 0.942423 −1.5000 0.933193
7 0.882844 −1.1667 0.878327
8 0.793949 −0.8333 0.797672
9 0.679126 −0.5000 0.691462
10 0.548710 −0.1667 0.566184
11 0.416844 0.1667 0.433816
12 0.296967 0.5000 0.308538
13 0.198179 0.8333 0.202328
14 0.123877 1.1667 0.121673
15 0.072573 1.5000 0.066807
16 0.039891 1.8333 0.033376
17 0.020599 2.1667 0.015130
18 0.010007 2.5000 0.006210
19 0.004581 2.8333 0.002303
20 0.001979 3.1667 0.000771
21 0.000808 3.5000 0.000233
22 0.000312 3.8333 0.000063
23 0.000114 4.1667 0.000015
24 0.000040 4.5000 0.000003
25 0.000013 4.8333 0.000001
26 0.000004 5.1667 0.000000
127
A. Sampling distributions of statistics
2. Suppose that we plan to take a random sample of size n from a normal distribution
with mean µ and standard deviation σ = 2.
(a) Suppose µ = 4 and n = 20.
i. What is the probability that the mean X̄ of the sample is greater than 5?
ii. What is the probability that X̄ is smaller than 3?
iii. What is P (|X̄ − µ| ≤ 1) in this case?
(b) How large should n be in order that P (|X̄ − µ| ≤ 0.5) ≥ 0.95 for every possible
value of µ?
(c) It is claimed that the true value of µ is 5 in a population. A random sample of
size n = 100 is collected from this population, and the mean for this sample is
x̄ = 5.8. Based on the result in (b), what would you conclude from this value
of X̄?
128
Appendix B
Point estimation
129
B. Point estimation
Therefore, setting µ
b1 = M1 , we have:
n
θb X Xi
= X̄ ⇒ θb = 2X̄ = 2 .
2 i=1
n
3. Let X ∼ Bin(n, π), where n is known. Find the methods of moments estimator
(MME) of π.
Solution:
The pf of the binomial distribution is:
n!
P (X = x) = π x (1 − π)n−x for x = 0, 1, 2, . . . , n
x! (n − x)!
and:
Z ∞ Z ∞ 2
y 2 2a
2
E(X ) = 2
x λ exp(−λ(x − a)) dx = e−y dy = 2
+ + a2 .
a 0 λ+a λ λ
130
B.1. Worked examples
5. Let {X1 , X2 , . . . , Xn } be a random sample from the distribution N (µ, 1). Find the
maximum likelihood estimator (MLE) of µ.
Solution:
The joint pdf of the observations is:
n n
!
Y 1 1 1 1 X
f (x1 , x2 , . . . , xn ; µ) = √ exp − (xi − µ)2 = n/2
exp − (xi − µ)2 .
i=1
2π 2 (2π) 2 i=1
131
B. Point estimation
and:
l(λ) = ln L(λ) = nX̄ ln(λ) − nλ + C = n(X̄ ln(λ) − λ) + C
where C is a constant (i.e. it may depend on Xi but cannot depend on the
parameter). Setting:
d X̄
l(λ) = n −1 =0
dλ λ
b
we obtain the MLE λ b = X̄, which is also the MME.
Solution:
(a) The pdf of Uniform[0, θ] is:
(
θ−1 for 0 ≤ x ≤ θ
f (x; θ) =
0 otherwise.
The joint pdf is:
(
θ−n for 0 ≤ x1 , x2 , . . . , xn ≤ θ
f (x1 , x2 , . . . , xn ; θ) =
0 otherwise.
In fact f (x1 , x2 , . . . , xn ; θ), as a function of θ, is the likelihood function, L(θ).
The maximum likelihood estimator of θ is the value at which the likelihood
function L(θ) achieves its maximum. Note:
(
θ−n for X(n) ≤ θ
L(θ) =
0 otherwise
where:
X(n) = max Xi .
i
Hence the MLE is θb = X(n) , which is different from the MME. For example, if
x(n) = 1.16, we have:
132
B.1. Worked examples
(b) For the given data, the maximum observation is x(3) = 3.6. Therefore, the
maximum likelihood estimate is θb = 3.6.
8. Use the observed random sample x1 = 8.2, x2 = 10.6, x3 = 9.1 and x4 = 4.9 to
calculate the maximum likelihood estimate of λ in the exponential pdf:
(
λe−λx for x ≥ 0
f (x; λ) =
0 otherwise.
Solution:
We derive a general formula with a random sample {X1 , X2 , . . . , Xn } first. The
joint pdf is:
(
λn e−λnx̄ for x1 , x2 , . . . , xn ≥ 0
f (x1 , x2 , . . . , xn ; λ) =
0 otherwise.
Setting:
d n b= 1.
l(λ) = − nX̄ = 0 ⇒ λ
dλ λ
b X̄
For the given sample, x̄ = (8.2 + 10.6 + 9.1 + 4.9)/4 = 8.2. Therefore, λ
b = 0.1220.
9. The following data show the number of occupants in passenger cars observed
during one hour at a busy junction. It is assumed that these data follow a
geometric distribution with pf:
(
(1 − π)x−1 π for x = 1, 2, . . .
p(x; π) =
0 otherwise.
However, we only know that there are 678 xi s equal to 1, 227 xi s equal to 2, . . .,
and 14 xi s equal to some integers not smaller than 6.
133
B. Point estimation
Note that:
∞
X
P (Xi ≥ 6) = p(x; π) = π(1 − π)5 (1 + (1 − π) + (1 − π)2 + · · · )
x=6
1
= π(1 − π)5 ×
π
= (1 − π)5 .
L(π) = p(1, π)678 p(2, π)227 p(3, π)56 p(4, π)28 p(5, π)8 ((1 − π)5 )14
= π 1,011−14 (1 − π)227+56×2+28×3+8×4+14×5
= π 997 (1 − π)525
hence:
l(π) = ln L(π) = 997 ln(π) + 525 ln((1 − π)).
Setting:
d 997 525 997
l(π) = − =0 ⇒ π
b= = 0.655.
dπ π
b 1−π b 997 + 525
Remark: Since P (Xi = 1) = π, πb = 0.655 indicates that about 2/3 of cars have only
one occupant. Note E(Xi ) = 1/π. In order to ensure that the average number of
occupants is not smaller than k, we require π < 1/k.
Since n > 2, we can see that θb1 has a lower variance than θb2 , so it is a better
estimator. Unsurprisingly, we obtain a better estimator of θ by considering the
whole sample, rather than just the first two values.
134
B.1. Worked examples
Solution:
We need to introduce the term E(θ)
b inside the expectation, so we add and subtract
it to obtain:
b = E((θb − θ)2 )
MSE(θ)
= E ((θb − E(θ)) b 2
b − (θ − E(θ)))
2 2
= E (θ − E(θ)) − 2(θ − E(θ))(θ − E(θ)) + (θ − E(θ))
b b b b b b
b 2 ) − 2E((θb − E(θ))(θ
= E((θb − E(θ)) b − E(θ))) b 2 ).
b + E((θ − E(θ))
135
B. Point estimation
15. Let {X1 , X2 , . . . , Xn } be a random sample from a Bin(m, π) distribution, with both
m and π unknown. Find the method of moments estimators of m, the number of
trials, and π, the probability of success.
Solution:
There are two unknown parameters, so we need two equations. The expectation
and variance of a Bin(m, π) distribution are mπ and mπ(1 − π), respectively, so we
have:
µ1 = E(X) = mπ
and:
µ2 = Var(X) + E(X)2 = mπ(1 − π) + (mπ)2 .
Setting the first two sample and population moments equal gives:
n n
1X 1X 2
Xi = mb
b π and b π (1 − π
X = mb b π )2 .
b) + (mb
n i=1 n i=1 i
The two equations need to be solved simultaneously. Solving the first equation for
π
b gives:
Pn
Xi /n
i=1 X̄
π
b= = .
m
b mb
Now we can substitute π
b into the second moment equation to obtain:
n 2
1X 2 X̄ X̄ X̄
X =m 1− + m
n i=1 i
b b
mb m
b m
b
136
B.1. Worked examples
16. Consider again the Uniform[−θ, θ] distribution from Question 14. Suppose that we
observe the following data:
which implies that the data came from a Uniform[−2.518, 2.518] distribution.
However, this clearly cannot be true since the observation x5 = 2.8 falls outside this
range! The method of moments does not take into account that all of the
observations need to lie in the interval [−θ, θ], and so it fails to produce a useful
estimate.
17. Let {X1 , X2 , . . . , Xn } be a random sample from an Exp(λ) distribution. Find the
MLE of λ.
Solution:
The likelihood function is:
n
Y n
Y P
L(λ) = f (xi ; θ) = λe−λXi = λn e−λ i Xi
= λn e−λnX̄
i=1 i=1
d2 n
2
l(λ) = − 2
dλ λ
137
B. Point estimation
18. Let {X1 , X2 , . . . , Xn } be a random sample from a N (µ, σ 2 ) distribution. Find the
MLE of σ 2 if:
(a) µ is known
(b) µ is unknown.
In each case, work out if the MLE is an unbiased estimator of σ 2 .
Solution:
The likelihood function is:
n n
(Xi − µ)2
2
Y
2
Y 1
L(µ, σ ) = f (xi ; µ, σ ) = √ exp −
i=1 i=1 2πσ 2 2σ 2
n
!
1 X
= (2πσ 2 )−n/2 exp − 2 (Xi − µ)2
2σ i=1
Differentiating with respect to σ 2 and setting the derivative equal to zero gives:
n
d 2 n 1 1 X
l(µ, σ ) = − + 4 (Xi − µ)2 = 0.
dσ 2 b2 2b
2 σ σ i=1
b2 :
If µ is known, we can solve this equation for σ
n n n
n 1 1 X n 2 1X 1X
= 4 (Xi − µ)2 ⇒ σ
b = (Xi − µ)2 ⇒ σ2
b = (Xi − µ)2 .
b2
2 σ 2b
σ i=1 2 2 i=1 n i=1
is indeed a maximum. We can work out the bias of this estimator directly:
n
! n
!
2 1X 2 2 1 X (Xi − µ)2
E(bσ )=E (Xi − µ) = σ E
n i=1 n i=1 σ2
n 2
σ2 X
Xi − µ
= E
n i=1 σ
n
σ2 X
= E(Zi2 )
n i=1
σ2
= n = σ2
n
138
B.1. Worked examples
n
so, whatever the value of σ 2 , we need to ensure that (Xi − µ)2 is minimised.
P
i=1
However, we have:
n
X n
X
(Xi − µ)2 = (Xi − X̄)2 + n(X̄ − µ)2 .
i=1 i=1
Only the second term on the right-hand side depends on µ and, because of the
square, its minimum value is zero. It is minimised when µ is equal to the sample
mean, so this is the MLE of µ:
µ
b = X̄.
The resulting MLE of σ 2 is:
n
2 1X
σ
b = (Xi − X̄)2 .
n i=1
This is not the same as the sample variance S 2 , where we divide by n − 1 instead of
n. The expectation of the MLE of σ 2 is:
n
! n
!
1 X 1 1 X
σ2) = E
E(b (Xi − X̄)2 = E (n − 1) (Xi − X̄)2
n i=1 n n − 1 i=1
1
= E((n − 1)S 2 )
n
σ2 (n − 1)S 2
= E .
n σ2
The term inside the expectation, (n − 1)S 2 /σ 2 , follows a χ2n−1 distribution, and so:
σ2
σ2) =
E(b (n − 1).
n
This is not equal to σ 2 , so the MLE of σ 2 is a biased estimator in this case. (Note
b2 = S 2 is an unbiased estimator of σ 2 .) The bias of the MLE is:
that the estimator σ
σ2 σ2
σ 2 ) = E(b
Bias(b σ2) − σ2 = (n − 1) − σ 2 = −
n n
which tends to zero as n → ∞. In such cases, we say that the estimator is
asymptotically unbiased.
139
B. Point estimation
4. Given a random sample of n values from a normal distribution with unknown mean
2
and variance, consider the
P following2 two estimators of σ (the unknown population
variance), where Sxx = (Xi − X̄) :
Sxx Sxx
T1 = and T2 = .
n−1 n
For each of these determine its bias, its variance and its mean squared error. Which
has the smaller mean squared error?
Hint: use the fact that Var(S 2 ) = 2σ 4 /(n − 1) for a random sample of size n, or
some equivalent formula.
y 1 = α + β + ε1
y2 = −α + β + ε2
y 3 = α − β + ε3
y4 = −α − β + ε4 .
140
B.2. Practice questions
The group was alarmed to find that if you are a labourer, cleaner or dock
worker, you are twice as likely to die than a member of the professional classes.
(The Sunday Times, 31 August 1980)
141
B. Point estimation
142
Appendix C
Interval estimation
Solution:
(a) With an available random sample {X1 , X2 , . . . , Xn } from the normal
distribution N (µ, σ 2 ) with σ 2 known, a 95% confidence interval for µ is of the
form:
σ σ
X̄ − 1.96 × √ , X̄ + 1.96 × √ .
n n
Hence the width of the confidence interval is:
σ σ σ σ
X̄ + 1.96 × √ − X̄ − 1.96 × √ = 2 × 1.96 × √ = 3.92 × √ .
n n n n
√
(b) Let 3.92 × σ/ n ≤ d, and so we obtain the condition for the required sample
size: 2
15.37 × σ 2
3.92 × σ
n≥ = .
d d2
Therefore, in order to achieve the required accuracy, the sample size n should
be at least as large as 15.37 × σ 2 /d2 .
Note that as the variance σ 2 %, the confidence interval width d %, and as the
sample size n %, the confidence interval width d &. Also, note that when σ 2 is
unknown, the width of a confidence interval for µ depends on S. Therefore, the
width is a random variable.
2. The data below are from a random sample of size n = 9 taken from the distribution
N (µ, σ 2 ):
3.75, 5.67, 3.14, 7.89, 3.40, 9.32, 2.80, 10.34 and 14.31.
(a) Assume σ 2 = 16. Find a 95% confidence interval for µ. If the width of such a
confidence interval must not exceed 2.5, at least how many observations do we
need?
(b) Suppose σ 2 is now unknown. Find a 95% confidence interval for µ. Compare
the result with that obtained in (a) and comment.
(c) Obtain a 95% confidence interval for σ 2 .
143
C. Interval estimation
Solution:
(a) We have x̄ = 6.74. For a 95% confidence interval, α = 0.05 so we need to find
the top 100α/2 = 2.5th percentile of N (0, 1), which is 1.96. Since σ = 4 and
n = 9, a 95% confidence interval for µ is:
σ 4 4
x̄ ± 1.96 × √ ⇒ 6.74 − 1.96 × , 6.74 + 1.96 × = (4.13, 9.35).
n 3 3
In general, a 100(1 − α)% confidence interval for µ is:
σ σ
X̄ − zα/2 × √ , X̄ + zα/2 × √
n n
where zα denotes the top 100αth percentile of the standard normal
distribution, i.e. such that:
P (Z > zα ) = α
where Z ∼ N (0, 1). Hence the width of the confidence interval is:
σ
2 × zα/2 × √ .
n
For this example, α = 0.05, z0.025 = 1.96 and σ = 4. Setting the width of the
confidence interval to be at most 2.5, we have:
σ 15.68
2 × 1.96 × √ = √ ≤ 2.5.
n n
Hence: 2
15.68
n≥ = 39.34.
2.5
So we need a sample of at least 40 observations in order to obtain a 95%
confidence interval with a width not greater than 2.5.
(b) When σ 2 is unknown, a 95% confidence interval for µ is:
S S
X̄ − tα/2, n−1 × √ , X̄ + tα/2, n−1 × √
n n
n
where S 2 = (Xi − X̄)2 /(n − 1), and tα, k denotes the top 100αth percentile
P
i=1
of the Student’s tk distribution, i.e. such that:
P (T > tα, k ) = α
for T ∼ tk . For this example, s2 = 16, s = 4, n = 9 and t0.025, 8 = 2.306. Hence
a 95% confidence interval for µ is:
4
6.74 ± 2.306 × ⇒ (3.67, 9.81).
3
This confidence interval is much wider than the one obtained in (a). Since we
do not know σ 2 , we have less information available for our estimation. It is
only natural that our estimation becomes less accurate.
Note that although the sample size is n, the Student’s t distribution used has
only n − 1 degrees of freedom. The loss of 1 degree of freedom in the sample
variance is due to not knowing µ. Hence we estimate µ using the data, for
which we effectively pay a ‘price’ of one degree of freedom.
144
C.1. Worked examples
(c) Note (n − 1)S 2 /σ 2 ∼ χ2n−1 = χ28 . From Table 8 of Murdoch and Barnes’
Statistical Tables, for X ∼ χ28 , we find that:
P (X < 2.180) = P (X > 17.535) = 0.025.
Hence:
8 × S2
P 2.180 < < 17.535 = 0.95.
σ2
Therefore, the lower bound for σ 2 is 8 × s2 /17.535 = 7.298, and the upper
bound is 8 × s2 /2.180 = 58.701. Therefore, a 95% confidence interval for σ 2 ,
noting s2 = 16, is:
(7.30, 58.72).
Note that the estimation in this example is rather inaccurate. This is due to
two reasons.
i. The sample size is small.
ii. The population variance, σ 2 , is large.
3. Assume that the random variable X is normally distributed and that σ 2 is known.
What confidence level would be associated with each of the following intervals?
√ √
(a) (x̄ − 1.645 × σ/ n, x̄ + 2.326 × σ/ n).
√
(b) (−∞, x̄ + 2.576 × σ/ n).
√
(c) (x̄ − 1.645 × σ/ n, x̄).
Solution:
√ √
We have X̄ ∼ N (µ, σ 2 / n), hence n(X̄ − µ)/σ ∼ N (0, 1).
(a) P (−1.645 < Z < 2.326) = 0.94, hence a 94% confidence level.
(b) P (−∞ < Z < 2.576) = 0.995, hence a 99.5% confidence level.
(c) P (−1.645 < Z < 0) = 0.45, hence a 45% confidence level.
145
C. Interval estimation
5. A personnel manager has found that historically the scores on aptitude tests given
to applicants for entry-level positions are normally distributed with σ = 32.4
points. A random sample of nine test scores from the current group of applicants
had a mean score of 187.9 points.
(a) Find an 80% confidence interval for the population mean score of the current
group of applicants.
(b) Based on these sample results, a statistician found for the population mean a
confidence interval extending from 165.8 to 210.0 points. Find the confidence
level of this interval.
Solution:
(a) We have n = 9, x̄ = 187.9, σ = 32.4 and 1 − α = 0.80, hence α/2 = 0.10 and,
from Table 3 of Murdoch and Barnes’ Statistical Tables, P (Z > 1.282) =
1 − Φ(1.282) = 0.10. So an 80% confidence interval is:
32.4
187.9 ± 1.282 × √ ⇒ (174.05, 201.75).
9
(b) The half-width of the confidence interval is 210.0 − 187.9 = 22.1, which is
equal to the margin of error, i.e. we have:
σ 32.4
22.1 = k × √ = k × √ ⇒ k = 2.05.
n 9
P (Z > 2.05) = 1 − Φ(2.05) = 0.02018 = α/2 ⇒ α = 0.04036. Hence we have
a 100(1 − α)% = 100(1 − 0.04036)% ≈ 96% confidence interval.
Solution:
(a) We have n = 10, s2 = (2.36)2 = 5.5696, χ20.975, 9 = 2.700 and χ20.025, 9 = 19.023.
Hence a 95% confidence interval for σ 2 is:
(n − 1)s2 (n − 1)s2
9 × 5.5696 9 × 5.5696
, = , = (2.64, 18.57).
χ20.025, n−1 χ20.975, n−1 19.023 2.700
χ20.995, n−1 < χ20.975, n−1 and χ20.005, n−1 > χ20.025, n−1 .
146
C.1. Worked examples
7. Why do we not always choose a very high confidence level for a confidence interval?
Solution:
We do not always want to use a very high confidence level because the confidence
interval would be very wide. We have a trade-off between the width of the
confidence interval and the coverage probability.
8. Suppose that 9 bags of sugar are selected from the supermarket shelf at random
and weighed. The weights in grammes are 812.0, 786.7, 794.1, 791.6, 811.1, 797.4,
797.8, 800.8 and 793.2. Construct a 95% confidence interval for the mean weight of
all the bags on the shelf. Assume the population is normal.
Solution:
Here we have a random sample of size n = 9. The mean is 798.30. The sample
variance is s2 = 72.76, which gives a sample standard deviation s = 8.53. From
Table 7 of Murdoch and Barnes’ Statistical Tables, the top 2.5th percentile of the t
distribution with n − 1 = 8 degrees of freedom is 2.306. Therefore, a 95%
confidence interval is:
8.53 8.53
798.30 − 2.306 × √ , 798.30 + 2.306 × √ = (798.30 − 6.56, 798.30 + 6.56)
9 9
= (791.74, 804.86).
9. Continuing Question 2, suppose we are now told that σ, the population standard
deviation, is known to be 8.5 g. Construct a 95% confidence interval using this
information.
Solution:
From Table 7 of Murdoch and Barnes’ Statistical Tables, the top 2.5th percentile of
the standard normal distribution z0.025 = 1.96 (recall t∞ = N (0, 1)) so a 95%
confidence interval for the population mean is:
8.5 8.5
798.30 − 1.96 × √ , 798.30 + 1.96 × √ = (798.30 − 6.53, 798.30 + 6.53)
9 9
= (792.75, 803.85).
Again, it may be more useful to write this as 798.30 ± 5.55. Note that this
confidence interval is less wide than the one in Question 2, even though our initial
estimate s turned out to be very close to the true value of σ.
10. Construct a 90% confidence interval for the variance of the bags of sugar in
Question 2. Does the given value of 8.5 g for the population standard deviation
seem plausible?
147
C. Interval estimation
Solution:
We have n = 9 and s2 = 72.76. For a 90% confidence interval, we need the bottom
and top 5th percentiles of the chi-squared distribution on n − 1 = 8 degrees of
freedom. These are:
χ20.95, 8 = 2.733 and χ20.05, 8 = 15.507.
A 90% confidence interval is:
!
(n − 1)S 2 (n − 1)S 2
(9 − 1) × 72.76 (9 − 1) × 72.76
, = ,
χ2α/2, n−1 χ21−α/2, n−1 15.507 2.733
= (37.536, 213.010).
The corresponding values for the standard deviation are:
√ √
( 37.536, 213.010) = (6.127, 14.595).
The given value falls well within this confidence interval, so we have no reason to
doubt it.
2. (a) A sample of 954 adults in early 1987 found that 23% of them held shares.
Given a UK adult population of 41 million and assuming a proper random
sample was taken, construct a 95% confidence interval estimate for the number
of shareholders in the UK.
(b) A ‘similar’ survey the previous year had found a total of 7 million shareholders.
Assuming ‘similar’ means the same sample size, construct a 95% confidence
interval estimate of the increase in shareholders between the two years.
A statistician took the Dale Carnegie Course, improving his confidence from
95% to 99%.
(Anon)
148
Appendix D
Hypothesis testing
Solution:
The critical value for the test is z0.95 = −1.645 and the probability of rejecting H0
with this test is:
X̄ − 7
P √ < −1.645
0.25/ n
which we rewrite as:
X̄ − 6.95 7 − 6.95
P √ < √ − 1.645
0.25/ n 0.25/ n
Therefore:
7 − 6.95
√ − 1.645 = 1.282
0.25/ n
√
0.2 × n = 2.927
√
n = 14.635
n = 214.1832.
So to ensure that the test power is at least 90%, we should use a sample size of 215.
Remark: We see a rather large sample size is required. Hence investigators are
encouraged to use sample sizes large enough to come to rational decisions.
149
D. Hypothesis testing
2. A doctor claims that the average European is more than 8.5 kg overweight. To test
this claim, a random sample of 12 Europeans were weighed, and the difference
between their actual weight and their ideal weight was calculated. The data are:
14, 12, 8, 13, −1, 10, 11, 15, 13, 20, 7 and 14.
Assuming the data follow a normal distribution, conduct a t test to infer at the 5%
significance level whether or not the doctor’s claim is true.
Solution:
We have a random sample of size n = 12 from N (µ, σ 2 ), and we test H0 : µ = 8.5
vs. H1 : µ > 8.5. The test statistic, under H0 , is:
X̄ − 8.5 X̄ − 8.5
T = √ = √ ∼ t11 .
S/ n S/ 12
Hence:
11.333 − 8.5
t= p = 1.903 > 1.796 = t0.05, 11
26.606/12
so we reject H0 at the 5% significance level. There is significant evidence to support
the doctor’s claim.
Solution:
(a) We test:
H0 : σ 2 = 8 vs. H1 : σ 2 > 8.
The test statistic, under H0 , is:
(n − 1)S 2 20 × S 2
T = 2
= ∼ χ220 .
σ0 8
t ≥ 31.410
150
D.1. Worked examples
(b) To evaluate the power, we need the probability of rejecting H0 (which happens
if t ≥ 31.410) conditional on the actual value of σ 2 , that is:
2 8 8
P (T ≥ 31.410 | σ = k) = P T × ≥ 31.410 ×
k k
H0 : µX = µY vs. H1 : µX > µY .
The sample means and standard deviations are x̄ = 389.5, ȳ = 307.8, sX = 55.40
and sY = 69.45. The test statistic and its distribution under H0 are:
s
n+m−2 X̄ − Ȳ
T = ×p ∼ tn+m−2
1/n + 1/m (n − 1)SX 2
+ (m − 1)SY2
and we obtain, for the given data, t = 2.175 > 1.833 = t0.05, 9 hence we reject H0
that the mean weights are equal and conclude that the mean weight for the
high-protein diet is greater at the 5% significance level.
5. Suppose that we have two independent samples from normal populations with
known variances. We want to test the H0 that the two population means are equal
against the alternative that they are different. We could use each sample by itself
to write down 95% confidence intervals and reject H0 if these intervals did not
overlap. What would be the significance level of this test?
Solution:
Let us assume H0 : µX = µY is true, then the two 95% confidence intervals do not
overlap if and only if:
σX σY σY σX
X̄ − 1.96 × √ ≥ Ȳ + 1.96 × √ or Ȳ − 1.96 × √ ≥ X̄ + 1.96 × √ .
n m m n
151
D. Hypothesis testing
The significance level is about 0.6%, which is much smaller than the usual
conventions of 5% and 1%. Putting variability into two confidence intervals makes
them more likely to overlap than you might think, and so your chance of
incorrectly rejecting the null hypothesis is smaller than you might expect!
6. The following table shows the number of salespeople employed by a company and
the corresponding value of sales (in £000s):
Compute the sample correlation coefficient for these data and carry out a formal
test for a (linear) relationship between the number of salespeople and sales.
Note that: X X X
xi = 2,616, yi = 2,520, x2i = 571,500,
X X
yi2 = 529,746 and xi yi = 550,069.
Solution:
We test:
H0 : ρ = 0 vs. H1 : ρ > 0.
The corresponding test statistic and its distribution under H0 are:
√
ρb n − 2
T =p ∼ tn−2 .
1 − ρb2
We find ρb = 0.8716 and obtain t = 5.62 > 2.764 = t0.01, 10 and so we reject H0 at the
1% significance level. Since the test is highly significant, there is overwhelming
evidence of a (linear) relationship between the number of salespeople and the value
of sales.
152
D.1. Worked examples
7. Two independent samples from normal populations yield the following results:
2
P
Sample 1 n=5 P (xi − x̄)2 = 4.8
Sample 2 m=7 (yi − ȳ) = 37.2
Test at the 5% signficance level whether the population variances are the same
based on the above data.
Solution:
We test:
H0 : σ12 = σ22 vs. H1 : σ12 6= σ22 .
Under H0 , the test statistic is:
S12
T = ∼ Fn−1, m−1 = F4, 6 .
S22
Critical values are F0.975, 4, 6 = 1/F0.025, 6, 4 = 1/9.20 = 0.11 and F0.025, 4, 6 = 6.23,
using Table 9 of Murdoch and Barnes’ Statistical Tables. The test statistic value is:
4.8/4
t= = 0.1935
37.2/6
and since 0.11 < 0.1935 < 6.23 we do not reject H0 , which means there is no
evidence of a difference in the variances.
9. (a) Of 100 clinical trials, 5 have shown that wonder-drug ‘Zap2’ is better than the
standard treatment (aspirin). Should we be excited by these results?
(b) Of the 1,000 clinical trials of 1,000 different drugs this year, 30 trials found
drugs which seem better than the standard treatments with which they were
compared. The television news reports only the results of those 30 ‘successful’
trials. Should we believe these reports?
(c) A child welfare officer says that she has a test which always reveals when a
child has been abused, and she suggests it be put into general use. What is she
saying about Type I and Type II errors for her test?
Solution:
(a) If 5 clinical trials out of 100 report that Zap2 is better, this is consistent with
there being no difference whatsoever between Zap2 and aspirin if a 5% Type I
error probability is being used for tests in these clinical trials. With a 5%
significance level we expect 5 trials in 100 to show spurious significant results.
153
D. Hypothesis testing
(b) If the television news reports the 30 successful trials out of 1,000, and those
trials use tests with a significance level of 5%, we may well choose to be very
cautious about believing the results. We would expect 50 spuriously significant
results in the 1,000 trial results.
(c) The welfare officer is saying that the Type II error has probability zero. The
test is always positive if the null hypothesis of no abuse is false. On the other
hand, the welfare officer is saying nothing about the probability of a Type I
error. It may well be that the probability of a Type I error is high, which
would lead to many false accusations of abuse when no abuse had taken place.
One should always think about both types of error when proposing a test.
10. A machine is designed to fill bags of sugar. The weight of the bags is normally
distributed with standard deviation σ. If the machine is correctly calibrated, σ
should be no greater than 20 g. We collect a random sample of 18 bags and weigh
them. The sample standard deviation is found to be equal to 32.48 g. Is there any
evidence that the machine is incorrectly calibrated?
Solution:
This is a hypothesis test for the variance of a normal population, so we will use the
chi-squared distribution. Let:
X1 , X2 , . . . , X18 ∼ N (µ, σ 2 )
be the weights of the bags in the sample. An appropriate test has hypotheses:
11. After the machine in Question 3 is calibrated, we collect a new sample of 21 bags.
The sample standard deviation of their weights is 23.72 g. Based on this sample,
can you conclude that the calibration has reduced the variance of the weights of the
bags?
Solution:
Let:
Y1 , Y2 , . . . , Y21 ∼ N (µY , σY2 )
154
D.2. Practice questions
2
be the weights of the bags in the new sample, and use σX to denote the variance of
the distribution of the previous sample, to avoid confusion. We want to test for a
reduction in variance, so we set:
2 2
σX σX
H0 : = 1 vs. H 1 : > 1.
σY2 σY2
The value of the test statistic in this case is:
s2X (32.48)2
= = 1.875.
s2Y (23.72)2
If the null hypothesis is true, the test statistic will follow an F18−1, 21−1 = F17, 20
distribution.
At the 5% significance level, the upper-tail critical value of the F17, 20 distribution is
F0.05, 17, 20 = 2.17. Our test statistic does not exceed this value, so we cannot reject
the null hypothesis.
We move to the 10% significance level. The upper-tail critical value is
F0.10, 17, 20 = 1.821, so we can now reject the null hypothesis (if only barely). We
conclude that there is some evidence that the variance is reduced, but it is not very
strong evidence.
Notice the difference between the conclusions of these two tests. We have a much
more powerful test when we compare our standard deviation of 32.48 g to a fixed
standard deviation of 25 g, than when we compare it to an estimated standard
deviation of 23.78 g, even though the values are similar.
155
D. Hypothesis testing
(c) If the sample is classified as A if the sample mean of log-lengths exceeds 0.75,
and the misclassification as A is to have a probability of 2%, what sample size
should be used and what is the probability of a B-type misclassification?
(d) If the sample comes from neither A nor B but from an environment with a
mean log-length of 0.70, what is the probability of classifying it as type A if
the decision procedure determined in (b) is applied?
2. In a wire-based nail manufacturing process the target length for cut wire is 22 cm.
It is known that widths vary with a standard deviation equal to 0.08 cm. In order
to monitor this process, a random sample of 50 separate wires is accurately
measured and the process is regarded as operating satisfactorily (the null
hypothesis) if the sample mean width lies between 21.97 cm and 22.03 cm so that
this is the decision procedure used (i.e. if the sample mean falls within this range
then the null hypothesis is not rejected, otherwise the null hypothesis is rejected).
(a) Determine the probability of a Type I error for this test.
(b) Determine the probability of making a Type II error when the process is
actually cutting to a length of 22.05 cm.
(c) Find the probability of rejecting the null hypothesis when the true cutting
length is 22.01 cm. (This is the power of the test when the true mean is 22.01
cm.)
4. To instil customer loyalty, airlines, hotels, rental car companies, and credit card
companies (among others) have initiated frequency marketing programmes which
reward their regular customers. In the United States alone, millions of people are
members of the frequent-flier programmes of the airline industry. A large fast food
restaurant chain wished to explore the profitability of such a programme. They
randomly selected 12 of their 1,200 restaurants nationwide and instituted a
frequency programme which rewarded customers with a $5.00 gift certificate after
every 10 meals purchased at full price.
They ran the trial programme for three months. The restaurants not in the sample
had an average increase in profits of $1,047.34 over the previous three months,
whereas the restaurants in the sample had the following changes in profit:
156
D.2. Practice questions
Note that the last number is negative, representing a decrease in profits. Specify
the appropriate null and alternative hypotheses for determining whether the mean
profit change for restaurants with frequency programmes is significantly greater (in
a statistical sense which you should make clear) than $1,047.34.
5. Two companies supplying a television repair service are compared by their repair
times (in days). Random samples of recent repair times for these companies gave
the following statistics:
(a) Is there evidence that the companies differ in their true mean repair times?
Give an appropriate hypothesis test to support your conclusions.
(b) What is the p-value of your test?
(c) What difference would it have made if the sample sizes had each been smaller
by 35 (i.e. sizes 9 and 17, respectively)?
To p, or not to p?
(James Abdey, Ph.D. Thesis 2009.1 )
1
Available at http://etheses.lse.ac.uk/31
157
D. Hypothesis testing
158
Appendix E
Analysis of variance (ANOVA)
Solution:
(a) The means are 440/5 = 88, 630/7 = 90 and 690/10 = 69. We will perform a
one-way ANOVA. First, we calculate the overall mean. This is:
(b) As 5.56 > 3.52 = F0.05, 2, 19 , which is the top 5th percentile of the F2, 19
distribution (interpolated from Table 9 of Murdoch and Barnes’ Statistical
Tables), we reject H0 : µ1 = µ2 = µ3 and conclude that there is evidence that
the means are not equal.
159
E. Analysis of variance (ANOVA)
(c) We have:
s
1 1
90 − 69 ± 2.093 × 200.53 × + = 21 ± 14.61.
7 10
Here 2.093 is the top 2.5th percentile point of the t distribution with 19
degrees of freedom. A 95% confidence interval is (6.39, 35.61). As zero is not
included, there is evidence of a difference.
2. The total times spent by three basketball players on court were recorded. Player A
was recorded on three occasions and the times were 29, 25 and 33 minutes. Player
B was recorded twice and the times were 16 and 30 minutes. Player C was recorded
on three occasions and the times were 12, 14 and 16 minutes. Use analysis of
variance to test whether there is any difference in the average times the three
players spend on court.
Solution:
We have x̄·A = 29, x̄·B = 23, x̄·C = 14 and x̄ = 21.875. Hence:
Source DF SS MS F p-value
Players 2 340.875 170.4375 6.175 ≈ 0.045
Error 5 138 27.6
Total 7 478.875
We test H0 : µ1 = µ2 = µ3 (i.e. the average times they play are the same) vs. H1 :
The average times they play are not the same.
As 6.175 > 5.79 = F0.05, 2, 5 , which is the top 5th percentile of the F2, 5 distribution,
we reject H0 and conclude that there is evidence of a difference between the means.
H 0 : µA = µB = µC
160
E.1. Worked examples
Solution:
We will perform a one-way ANOVA. First we calculate the overall mean:
4 × 24 + 6 × 20 + 5 × 18
= 20.4.
15
We can now calculate the sum of squares between groups:
Source DF SS MS F p-value
Sample 2 81.6 40.8 1.229 ≈ 0.327
Error 12 398.4 33.2
Total 14 480
As 1.229 < 3.89 = F0.05, 2, 12 , which is the top 5th percentile of the F2, 12
distribution, we see that there is no evidence that the means are not equal.
4. Four suppliers were asked to quote prices for seven different building materials. The
average quote of supplier A was 1,315.8. The average quote of suppliers B, C and D
were 1,238.4, 1,225.8 and 1,200.0, respectively. The following is the calculated
two-way ANOVA table with some entries missing.
Source DF SS MS F p-value
Materials 17,800
Suppliers
Error
Total 358,700
(a) Complete the table using the information provided above.
(b) Is there a significant difference between the quotes of different suppliers?
Explain your answer.
(c) Construct a 90% confidence interval for the difference between suppliers A and
D. Would you say there is a difference?
Solution:
(a) The average quote of all suppliers is:
1,315.8 + 1,238.4 + 1,225.8 + 1,200.0
= 1,245.
4
Hence the sum of squares (SS) due to suppliers is:
161
E. Analysis of variance (ANOVA)
17,800 17,382.96
= 1.604 and = 1.567
11,097.28 11,097.28
for materials and suppliers, respectively. The two-way ANOVA table is:
Source DF SS MS F p-value
Materials 6 106,800 17,800 1.604 ≈ 0.203
Suppliers 3 52,148.88 17,382.96 1.567 ≈ 0.232
Error 18 199,751.12 11,097.28
Total 27 358,700
(b) We test H0 : µ1 = µ2 = µ3 = µ4 (i.e. there is no difference between suppliers)
vs. H1 : There is a difference between suppliers. The F value is 1.567 and at a
5% significance level the critical value from Table 9 (degrees of freedom 3 and
18) is 3.16, hence we do not reject H0 and conclude that there is not enough
evidence that there is a difference.
(c) The top 5th percentile of the t distribution with 18 degrees of freedom is 1.734
and the MS value is 11,097.28. So a 90% confidence interval is:
s
1 1
1,315.8 − 1,200 ± 1.734 × 11,097.28 + = 115.8 ± 97.64
7 7
giving (18.16, 213.44). Since zero is not in the interval, there appears to be a
difference between suppliers A and D.
Source DF SS MS F p-value
Drinker 1.56
Beer 303.5
Error 695.6
Total
162
E.1. Worked examples
giving (6.91, 26.09). As the interval does not contain zero, there is evidence of
a difference between the effects of beers C and D.
A B C D E
Early shift 102 93 85 110 72
Late shift 85 87 71 92 73
Night shift 75 80 75 77 76
Solution:
Here r = 3 and c = 5. We may obtain the two-way ANOVA table as follows:
163
E. Analysis of variance (ANOVA)
Source DF SS MS F
Shift 2 652.13 326.07 5.62
Plant 4 761.73 190.43 3.28
Error 8 463.87 57.98
Total 14 1,877.73
Under the null hypothesis of no shift effect, F ∼ F2, 8 . Since F0.05, 2, 8 = 4.46 < 5.62,
we can reject the null hypothesis at the 5% significance level. (Note the p-value
= 0.030.)
Under the null hypothesis of no plant effect, F ∼ F4, 8 . Since F0.05, 4, 8 = 3.84 > 3.28,
we cannot reject the null hypothesis at the 5% significance level. (Note the p-value
= 0.072.)
Overall, the data collected show some evidence of a shift effect but little evidence
of a plant effect.
7. Complete the two-way ANOVA table below. In the places of p-values, indicate in
the form such as ‘< 0.01’ appropriately and use the closest value which you may
find from Murdoch and Barnes’ Statistical Tables.
Source DF SS MS F p-value
Row factor 4 ? 234.23 ? ?
Column factor 6 270.84 45.14 1.53 ?
Residual ? 708.00 ?
Total 34 1,915.76
Solution:
First, C2 SS = (C2 MS)×4 = 936.92.
The degrees of freedom for Error is 34 − 4 − 6 = 24. Therefore, Error MS
= 708.00/24 = 29.5.
Hence the F statistic for testing no C2 effect is 234.23/29.5 = 7.94. From Table 9 of
Murdoch and Barnes’ Statistical Tables, F0.001, 4, 24 = 6.59 < 7.94. Therefore, the
corresponding p-value is smaller than 0.001.
Since F0.05, 6, 24 = 2.51 > 1.53, the p-value for testing the C3 effect is greater than
0.05.
The complete ANOVA table is as follows:
Source DF SS MS F P
C2 4 936.92 234.23 7.94 <0.001
C3 6 270.84 45.14 1.53 >0.05
Error 24 708.00 29.5
Total 34 1,915.76
164
E.2. Practice questions
2. Does the level of success of publicly-traded companies affect the way their board
members are paid? The annual payments (in $000s) of randomly selected
publicly-traded companies to their board members were recorded. The companies
were divided into four quarters according to the returns in their stocks, and the
payments from each quarter were grouped together. Some summary statistics are
provided below.
Descriptive Statistics: 1st quarter, 2nd quarter, 3rd quarter, 4th quarter
A total of 4,000 cans are opened around the world every second. Ten babies are
conceived around the world every second. Each time you open a can, you stand
a 1-in-400 chance of falling pregnant.
(True or false?)
165
E. Analysis of variance (ANOVA)
166
Appendix F
Linear regression
167
F. Linear regression
i. what is the value of βb if yi = xi for all i? What if they are the exact
opposites of each other, i.e. yi = −xi for all i?
ii. is it always the case that −1 ≤ βb ≤ 1?
Solution:
(a) The estimator βb is sensible because it is the least squares estimator of β, which
provides the ‘best’ fit to the data in terms of minimising the sum of squared
residuals.
(b) The estimator βb is preferred to β̃ because the estimator β̃ is the least absolute
deviations estimator of β, which is also an option, but unlike βb it cannot be
computed explicitly via differentiation as the function f (x) = |x| is not
differentiable at zero. Therefore, β̃ is harder to compute than β.b
(c) We need to minimise a convex quadratic, so we can do that by differentiating
it and equating the derivative to zero. We obtain:
n
X
−2 (yi − βx
b i )xi = 0
i=1
which yields:
n
P
xi y i
i=1
βb = n .
x2i
P
i=1
3. Let {(xi , yi )}, for i = 1, 2, . . . , n, be observations from the linear regression model:
y i = β 0 + β 1 xi + εi .
(a) Suppose that the slope, β1 , is known. Find the least squares estimator (LSE) of
the intercept, β0 .
(b) Suppose that the intercept, β0 , is known. Find the LSE of the slope, β1 .
Solution:
(a) When β1 is known, let zi = yi − β1 xi . The model then reduces to zi = β0 + εi .
n
The LSE βb0 minimises (zi − β0 )2 , hence:
P
i=1
n
1X
βb0 = z̄ = (yi − β1 xi ).
n i=1
168
F.1. Worked examples
n
P
where D = (βb1 − β1 ) xi (zi − βb1 xi ). Suppose we choose βb1 such that:
i=1
n
X n
X n
X
xi (zi − βb1 xi ) = 0 i.e. xi zi − βb1 x2i = 0.
i=1 i=1 i=1
Hence:
n
X n
X n
X n
X
2 2 2 2
(zi − β1 xi ) = (zi − β1 xi ) + (β1 − β1 )
b b xi ≥ (zi − βb1 xi )2 .
i=1 i=1 i=1 i=1
169
F. Linear regression
Solution:
n
(xi − x̄)2 = 60 and so:
P
Since x̄ = (1 + 2 + · · · + 9)/9 = 5, then
i=1
σ2 45
Var(βb1 ) = P
n = = 0.75.
60
(xi − x̄)2
i=1
Therefore:
βb1 ∼ N (β1 , 0.75).
We require:
1.5
P (|βb1 − β1 | < 1.5) = P |Z| < √ = P (|Z| < 1.73) = 1 − 2 × 0.0418 = 0.9164.
0.75
(a) Find the least-squares estimates of β0 and β1 and write down the fitted
regression model.
(b) Compute a 95% confidence interval for the slope coefficient β1 . What can be
concluded?
(c) Compute R2 . What can be said about how ‘good’ the model is?
(d) With x = 30, find a prediction interval which covers y with probability 0.95.
With 97.5% confidence, what minimum average life expectancy can a city
expect once its GDP per capita reaches $30,000?
Solution:
(a) We have:
n
P n
P
(xi − x̄)(yi − ȳ) xi yi − nx̄ȳ
i=1 i=1
βb1 = n = n = 1.026
x2i − nx̄2
P P
(xi − x̄)2
i=1 i=1
and:
βb0 = ȳ − βb1 x̄ = 49.55.
Hence the fitted model is:
yb = 49.55 + 1.026x.
170
F.1. Worked examples
b2 . For σ
(b) We first need E.S.E.(βb1 ), for which we need σ b2 , we need the Residual
SS (from the Total SS and the Regression SS). We compute:
X
Total SS = yi2 − nȳ 2 = 1,339.67
i
!
X
Regression SS = βb12 x2i − nx̄2 = 702.99
i
636.68
b2 =
σ = 22.74
28
1/2
b2
σ
E.S.E.(β1 ) = P 2
b
2
= 0.184.
i xi − nx̄
which gives:
1.026 ± 2.05 × 0.184 ⇒ (0.65, 1.40).
The confidence interval does not contain zero. Therefore, we would reject the
hypothesis of β1 being zero at the 5% significance level. Hence there does
appear to be a significant link.
Regression SS 702.99
R2 = = = 0.52.
Total SS 1,339.67
2 1/2
P 2 P
i x i − 2x i x i + nx
βb0 + βb1 x ± t0.025, n−2 × σ
b× 1+
n( i x2i − nx̄2 )
P
which gives:
(69.79, 90.87).
Therefore, we can be 97.5% confident that the average life expectancy lies
above 69.79 years once GDP per capita reaches $30,000.
171
F. Linear regression
Analysis of Variance
SOURCE DF SS
Regression 1 2011.12
Residual Error 40 539.17
In addition, x̄ = 1.56.
(a) Find an estimate of the error term variance, σ 2 .
(b) Calculate and interpret R2 .
(c) Test at the 5% significance level whether or not the slope in the regression
model is equal to 1.
(d) For x = 0.8, find a 95% confidence interval for the expectation of y.
Solution:
(a) Noting n = 40 + 1 + 1 = 42, we have:
Residual SS 539.17
b2 =
σ = = 13.479.
n−2 40
Regression SS 2,011.12
R2 = = = 0.7886.
Total SS 2,550.29
βb1 − 1
T = ∼ tn−2 = t40 .
E.S.E.(βb1 )
172
F.2. Practice questions
8. Why is the squared sample correlation coefficient between the yi s and xi s the same
as the squared sample correlation coefficient between the yi s and ybi s? No algebra is
needed for this.
Solution:
The only difference between the xi s and ybi s is a rescaling by multiplying by βb1 ,
followed by a relocation by adding βb0 . Correlation coefficients are not affected by a
change of scale or location, so it will be the same whether we use the xi s or the ybi s.
9. If the model fits, then the fitted values and the residuals from the model are
independent of each other. What do you expect to see if the model fits when you
plot residuals against fitted values?
Solution:
If the model fits, one would expect to see a random scatter with no particular
pattern.
1. The table below shows the cost of fire damage for ten fires together with the
corresponding distances of the fires to the nearest fire station:
(a) Fit a straight line to these data and construct a 95% confidence interval for
the increase in cost of a fire for each mile from the nearest fire station.
(b) Test the hypothesis that the ‘true line’ passes through the origin.
173
F. Linear regression
2. The yearly profits made by a company, over a period of eight consecutive years are
shown below:
Year 1 2 3 4 5 6 7 8
Profit (in £000s) 18 21 34 31 44 46 60 75
(a) Fit a straight line to these data and compute a 95% confidence interval for the
‘true’ yearly increase in profits.
(b) The company accountant forecasts the profits for year 9 to be £90,000. Is this
forecast reasonable if it is based on the above data?
3. The data table below shows the yearly expenditure (in £000s) by a cosmetics
company in advertising a particular brand of perfume:
Year (x) 1 2 3 4 5 6 7 8
Expenditure (y) 170 170 275 340 435 510 740 832
(a) Fit a regression line to these data and construct a 95% confidence interval for
its slope.
(b) Construct an analysis of variance table and compute the R2 statistic for the fit.
(c) Comment on the goodness of fit of the linear regression model.
(d) Predict the expenditure for Year 9 and construct a 95% prediction interval for
the actual expenditure.
174
Appendix G
Solutions to Practice questions
(c) For Xi ∼ Bernoulli(π), E(Xi ) = π and Var(Xi ) = π(1 − π). Therefore, the
approximate normal sampling distribution of X̄, derived from the central limit
theorem, is N (π, π(1 − π)/n). Here this is:
0.2 × 0.8
N 0.2, = N (0.2, 0.0016) = N (0.2, (0.04)2 ).
100
using Table 3 of Murdoch and Barnes’ Statistical Tables. This is very close to
the probability obtained from the exact sampling distribution, which is about
0.0061.
2. (a) Let {X1 , X2 , . . . , Xn } denote the random sample. We know that the sampling
distribution of X̄ is N (µ, σ 2 /n), here N (4, 22 /20) = N (4, 0.2).
175
G. Solutions to Practice questions
and hence:
P (|X̄ − µ| ≤ 1) = 1 − 2 × 0.0126 = 0.9748.
In other words, the probability is 0.9748 that the sample mean is within
one unit of the true population mean, µ = 4.
(b) We can use the same ideas as in (a). Since X̄ ∼ N (µ, 4/n) we have:
176
G.2. Chapter 7 – Point estimation
3. (a) The sample average is composed of 25 randomly sampled data which are
subject to sampling variability, hence the average is also subject to this
variability. Its sampling distribution describes its probability properties. If a
large number of such averages were independently sampled, then their
histogram would be the sampling distribution.
(b) It is reasonable to assume that this sampling distribution is normal due to the
CLT, although the sample size is rather small. If n = 25 and µ = 54 and
σ = 10, then the CLT says that:
σ2
100
X̄ ∼ N µ, = N 54, .
n 25
(c) i. We have:
!
60 − 54
P (X̄ > 60) = P Z>p = P (Z > 3) = 0.0013
100/25
and:
X1 2X2 1 2 1 2
E(Y ) = E + = × E(X1 ) + × E(X2 ) = × µ + × µ = µ.
3 3 3 3 3 3
177
G. Solutions to Practice questions
n
This means that (n − 1)S 2 = Xi2 − nX̄ 2 , hence:
P
i=1
n
!
X
2 2
E((n − 1)S ) = (n − 1) E(S ) = E Xi2 − nX̄ 2
= n E(Xi2 ) − n E(X̄ 2 ).
i=1
Because the sample is random, E(Xi2 ) = E(X 2 ) for all i = 1, 2, . . . , n as all the
variables are identically distributed. From the standard formula
Var(X) = σ 2 = E(X 2 ) − µ2 , so (using the hint):
σ2
E(X 2 ) = σ 2 + µ2 and E(X̄ 2 ) = µ2 + .
n
Hence:
σ2
2 2 2 2
(n − 1) E(S ) = n(σ + µ ) − n µ + = (n − 1)σ 2
n
so E(S 2 ) = σ 2 , which means that S 2 is an unbiased estimator of σ 2 , as stated.
The standard formula for Var(X), applied to S, states that:
since all variances are strictly positive. It follows that S is a biased estimator of σ
(with its average value lower than the true value σ).
So the first obvious guess is that we should try R/n × (1 − R/n) = R/n − (R/n)2 .
Now:
nπ(1 − π) = Var(R) = E(R2 ) − (E(R))2 = E(R2 ) − (nπ)2 .
So:
2 !
R 1 1
E = 2
E(R2 ) = 2 (nπ(1 − π) + n2 π 2 )
n n n
2 !
R R 1 1
⇒ E − = E(R) − 2 (nπ(1 − π) + n2 π 2 )
n n n n
nπ n2 π 2 π(1 − π)
= − 2 −
n n n
π(1 − π)
= π − π2 − .
n
178
G.2. Chapter 7 – Point estimation
It follows that:
2 !
R2
n R R R
π(1 − π) = ×E − =E − .
n−1 n n n − 1 n(n − 1)
So we have found the unbiased estimator of π(1 − π), but it could do with tidying
up! When this is done, we see that:
R(n − R)
n(n − 1)
4. For T1 :
Sxx 1 1
E(T1 ) = E = E(Sxx ) = × (n − 1)σ 2 = σ 2 .
n−1 n−1 n−1
By definition, MSE(T2 ) = 2(n − 1)σ 4 /n2 + (−σ 2 /n)2 = (2n − 1)σ 4 /n2 .
It can be seen that MSE(T1 ) > MSE(T2 ) since:
179
G. Solutions to Practice questions
and:
∂S
= −2(y1 − α − β) − 2(y2 + α − β) + 2(y3 − α + β) + 2(y4 + α + β)
∂β
= −2(y1 + y2 − y3 − y4 ) + 8β.
(b) α
b is an unbiased estimator of α since:
y1 − y2 + y3 − y4 α+β+α−β+α−β+α+β
E(b
α) = E = = α.
4 4
(c) We have:
4σ 2 σ2
y1 − y2 + y3 − y4
Var(b
α) = Var = = .
4 16 4
180
G.3. Chapter 8 – Interval estimation
Note that because n is large we have used the standard normal distribution. It
is more accurate to use a t distribution with 49 degrees of freedom. This gives
an interval of (£308.87, £331.95) – not much of a difference.
To obtain a 95% confidence interval for the total value of the stock, 9,875µ,
multiply the interval by 9,875. This gives (to the nearest £10,000):
(£3,050,000, £3,280,000).
181
G. Solutions to Practice questions
(b) To find the sample size n and the value a, we need to solve two conditions:
√
• α = P (X̄ > a |√H0 ) = P (Z > (a − 0.65)/(1/ n)) = 0.05 ⇒
(a − 0.65)/(1/ n) = 1.645.
√
• β = P (X̄ < a |√H1 ) = P (Z < (a − 0.80)/(1/ n)) = 0.10 ⇒
(a − 0.80)/(1/ n) = −1.28.
Solving these equations gives a = 0.734 and n = 381, remembering to round
up!
(c) A sample is classified as being from A if H1 if x̄ > 0.75. We have:
0.75 − 0.65 0.75 − 0.65
α = P (X̄ > 0.75 | H0 ) = P Z > √ = 0.02 ⇒ √ = 2.05.
1/ n 1/ n
Solving this equation gives n = 421, remembering to round up! Therefore:
0.75 − 0.80
β = P (X̄ < 0.75 | H1 ) = P Z < √ = P (Z < −1.026) = 0.1515.
1/ 421
(d) The rule in (b) is ‘take n = 381 and reject H0 if x̄ > 0.734’. So:
0.734 − 0.7
P (X̄ > 0.734 | µ = 0.7) = P Z > √ = P (Z > 0.66) = 0.2546.
1/ 381
2. (a) We have:
182
G.4. Chapter 9 – Hypothesis testing
(b) We have:
(c) We have:
3. (a) We are to test H0 : µ = 12 vs. H1 : µ 6= 12. The key points here are that n is
small and that σ 2 is unknown. We can use the t test and this is valid provided
the data are normally distributed. The test statistic value is:
x̄ − 12 12.7 − 12
t= √ = √ = 2.16.
s/ 7 0.858/ 7
This is compared to a Student’s t distribution on 6 degrees of freedom. The
critical value corresponding to a 5% significance level is 2.447. Hence we
cannot reject the null hypothesis at the 5% significance level. (We can reject at
the 10% significance level, but the convention on this course is to regard such
evidence merely as casting doubt on H0 , rather than justifying rejection as
such, i.e. such a result would be ‘weakly significant’.)
(b) We are to test H0 : µ = 12 vs. H1 : µ < 12. There is no need to do a formal
statistical test. As the sample mean is 12.7, which is greater than 12, there is
no evidence whatsoever for the alternative hypothesis.
In (a) you are asked to do a two-sided test and in (b) it is a one-sided test. Which
is more appropriate will depend on the purpose of the experiment, and your
suspicions before you conduct it.
• If you suspected before collecting the data that the mean voltage was less than
12 volts, the one-sided test would be appropriate.
• If you had no prior reason to believe that the mean was less than 12 volts you
would perform a two-sided test.
183
G. Solutions to Practice questions
4. It is useful to discuss the issues about this question before giving the solution.
• We want to know whether a loyalty programme such as that at the 12 selected
restaurants would result in an increase in mean profits greater than that
observed (during the three-month test) at the other sites within the chain.
• So we can model the profits across the chain as $1,047.34 + x, where $x is the
supposed effect of the promotion, and if the true mean value of x is µ, then we
wish to test:
H0 : µ = 0 vs. H1 : µ > 0
which is a one-tailed test since, clearly, there are (preliminary) grounds for
thinking that there is an increase due to the loyalty programme.
• We know nothing about the variability of profits across the rest of the chain,
so we will have to use the sample data, i.e. to calculate the sample variance
and to employ the t distribution with ν = 12 − 1 = 11 degrees of freedom.
• Although we shall want the variance of the data ‘sample value − 1,047.34’,
this will be the same as the variance of the sample data, since for any random
variable X and constant k we have:
Var(X + k) = Var(X)
Therefore:
Sxx 50,812,651.51
s2 = = = 4,619,331.956.
n−1 11
Hence the estimated standard error is:
r
s 4,619,331,956 p
√ = = 384,944.3296 = 620.439.
n 12
184
G.4. Chapter 9 – Hypothesis testing
x̄ − µ0 1,462.091 − 0
√ = = 2.3565.
s/ n 620.439
The relevant critical values for t11 in this one-tailed test are:
So we see that the test is significant at the 5% significance level, but not at the 1%
significance level, so reject H0 and conclude that the loyalty programme does have
an effect. (In fact, this means the result is moderately significant that the
programme has had a beneficial effect for the company.)
(b) The p-value for this two-tailed test is 2 × P (Z > 2.06) = 0.0394.
(c) For small samples, we should use a pooled estimate of the population standard
deviation:
s
(9 − 1) × 7.3 + (17 − 1) × 6.2
s= = 2.5626 on 24 degrees of freedom.
(9 − 1) + (17 − 1)
This should be compared with the t24 distribution and is clearly not
significant, even at the 10% significance level. With the smaller samples we fail
to detect the difference.
Comparing the two test statistic calculations shows that the different results
flow from differences in the estimated standard errors, hence ultimately (and
unsurprisingly) from the differences in the sample sizes used in the two
situations.
185
G. Solutions to Practice questions
6. (a) Let π be the population proportion of visitors who would use the device. We
test H0 : π = 0.3 vs. H1 : π < 0.3. The sample p proportion is p = 20/80 = 0.25.
Standard error of the sample proportion is 0.3 × 0.7/80 = 0.0512. The test
statistic value is:
0.25 − 0.30
z= = −0.976.
0.0512
For a one-sided (lower-tailed) test at the 5% significance level, the critical
value is −1.645, so the test is not significant – and not even at the 10%
significance level (the critical value is −1.282). On the basis of the data, there
is no reason to withdraw the device.
The critical region for the above test is to reject H0 if the sample proportion is
less than 0.3 − 1.645 × 0.0512, i.e. if the sample proportion, p, is less than
0.2157.
(b) The p-value of the test is the probability of the test statistic value or a more
extreme value conditional on H0 being true. Hence the p-value is:
P (Z ≤ −0.976) = 0.1645.
When
p π = 0.2, the standard error of the sample proportion is
0.2 × 0.8/80 = 0.0447. Therefore, the power when π = 0.2 is:
0.2157 − 0.2
P Z< = P (Z < 0.35) = 0.6368.
0.0447
b/(k − 1) 797.33/2
f= = = 4.269.
w/(n − k) 1,120.67/12
186
G.6. Chapter 11 – Linear regression
S 15.09
X̄·j ± t0.025, n−k × √ = X̄·j ± t0.025, 116 × √ = X̄·j ± 5.46.
nj 30
187
G. Solutions to Practice questions
Hence a 95% confidence interval for β1 is 5.46 ± 2.306 × 0.66 ⇒ (3.94, 6.98).
(b) To test H0 : β0 = 0 vs. H1 : β0 6= 0, we first determine the estimated standard
error of βb0 , which is:
√ 1/2
4.95 219.46
√ = 3.07.
10 219.46 − 10 × (4.56)2
Therefore, test statistic value is:
6.07
= 1.98.
3.07
Comparing with the t8 distribution, this is not significant at the 5%
significance level (1.98 < 2.306), but it is significant at the 10% significance
level (1.860 < 1.98).
There is only weak evidence against the null hypothesis. Note though that in
practice this hypothesis is not really of interest. A line through the origin
implies that there is zero cost of a fire which takes place right next to a fire
station. This hypothesis does not seem sensible!
188
G.6. Chapter 11 – Linear regression
P 2 P 2
3. (a) We
P first calculate x̄ = 4.5, xi = 204, ȳ = 434, yi = 1,938,174 and
xi yi = 19,766. The estimated regression coefficients are:
19,766 − 8 × 4.5 × 434
βb1 = = 98.62 and βb0 = 434 − 98.62 × 4.5 = −9.79.
204 − 8 × (4.5)2
The fitted line is:
\
Expenditure = −9.79 + 98.62 × Year.
In order to perform statistical inference, we need to find:
X
b2 =
σ (yi − βb0 − βb1 xi )2 /(n − 2)
i
X X X X X
= yi2 + nβb02 + βb12 x2i − 2βb0 yi − 2βb1 xi yi + 2βb0 βb1 xi /(n − 2)
189
G. Solutions to Practice questions
It follows that (using tn−2 = t6 ) a 95% prediction interval for the predicted
profit (in £000s) is:
Therefore, β1 = Cov(X, Y )/Var(X). The second equality follows from the fact that
Corr(X, Y ) = Cov(X, Y )/(Var(X) Var(Y ))1/2 .
Also, note that the first equality resembles the estimator:
P
(xi − x̄)(yi − ȳ)
βb1 = i P 2
i (xi − x̄)
190
Appendix H
Formula sheet in the summer
examination
n
x2j
P
2
σ j=1 σ2 2
−σ x̄
Var(βb0 ) = n , Var(βb1 ) = P
n , Cov(βb0 , βb1 ) = P
n .
n P 2
(xi − x̄) (xi − x̄)2 (xi − x̄)2
i=1 i=1 i=1
n
b2 = (yi − βb0 − βb1 xi )2 /(n − 2).
P
Estimator for the variance of εi : σ
i=1
Regression ANOVA:
n n
Total SS = (yi − ȳ)2 , Regression SS = βb12 (xi − x̄)2
P P
and
i=1 i=1
n
(yi − βb0 − βb1 xi )2 .
P
Residual SS =
i=1
Squared regression correlation coefficients:
191
H. Formula sheet in the summer examination
One-way ANOVA:
nj
k P nj
k P
(Xij − X̄)2 = Xij2 − nX̄ 2 .
P P
Total variation:
j=1 i=1 j=1 i=1
k k
nj (X̄·j − X̄)2 = nj X̄·j2 − nX̄ 2 .
P P
Between-treatments variation: B =
j=1 j=1
nj
k P nj
k P k
(Xij − X̄·j )2 = Xij2 − nj X̄·j2 .
P P P
Within-treatments variation: W =
j=1 i=1 j=1 i=1 j=1
Two-way ANOVA:
r P
c r P
c
(Xij − X̄)2 = Xij2 − rcX̄ 2 .
P P
Total variation:
i=1 j=1 i=1 j=1
r r
(X̄i· − X̄)2 = c X̄i·2 − rcX̄ 2 .
P P
Between-blocks (rows) variation: Brow = c
i=1 i=1
c c
(X̄·j − X̄)2 = r X̄·j2 − rcX̄ 2 .
P P
Between-treatments (columns) variation: Bcol = r
j=1 j=1
192
lse.ac.uk/statistics Department of Statistics
The London School of Economics
and Political Science
Houghton Street
London WC2A 2AE
Email: [email protected]
Telephone: +44 (0)20 7852 3709
The London School of Economics and Political Science is a School of the University of London. It is a
charity and is incorporated in England as a company limited by guarantee under the Companies Acts
(Reg no 70527).
The School seeks to ensure that people are treated equitably, regardless of age, disability, race,
nationality, ethnic or national origin, gender, religion, sexual orientation or personal circumstances.