0% found this document useful (0 votes)
20 views196 pages

Lecture Slides 1 To 10 - 2024

Uploaded by

Rajesh Mal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views196 pages

Lecture Slides 1 To 10 - 2024

Uploaded by

Rajesh Mal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 196

Prelims Statistics and Data Analysis

Lectures 1–10

Christl Donnelly
[email protected]

Trinity Term 2024

1
Course info

Christl Donnelly, Department of Statistics,


[email protected]
16 lectures over 4 weeks
As usual in Prelims there will be one sheet per two lectures. So for this
course this means two sheets per week – this should be about the right
amount for one tutorial.

2
Introduction

3
Probability

We start with a probability model 𝑃, and we deduce properties of 𝑃.


Imagine flipping a coin 5 times. Assuming that flips are fair and
independent, what is the probability of getting 2 heads?

4
Statistics

In statistics we have data, which we regard as having been generated


from some unknown probability model 𝑃. We want to be able to say
some useful things about 𝑃.
We flip a coin 1000 times and observe 530 heads. Is the coin fair? Is the
probability of heads greater than 12 ? or could it be equal to 12 ?

5
Precision of estimation

Question: What is the average body temperature in humans?


Collect data: let 𝑥1 , . . . , 𝑥 𝑛 be the body temperatures of 𝑛 randomly
chosen individuals, i.e. our model is that individuals are chosen
independently from the general population.
1 Í𝑛
Then estimate: 𝑥 = 𝑛 𝑖=1 𝑥 𝑖 could be our estimate, i.e. the sample
average.
How precise is 𝑥 as an estimate of the population mean?

6
Statistics Publications - Axillary temperature in young,
healthy adults

Histograms of axillary temperature B) measured for 10 seconds; C)


measured for 10 minutes in ◦ C (axillary = under the armpit)
Source: Marui, S., Misawa, A., Tanaka, Y. et al. J Physiol Anthropol (2017) 36: 18.
https://doi.org/10.1186/s40101-017-0133-y

Copyright: © The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0

International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any
7
medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and
Statistics Publications - IQ of University Students

Histogram showing the distribution of IQ among university students


involved in the present study (40 men and 40 women).
Source: Kleisner K, Chvátalová V, Flegr J (2014) Perceived Intelligence Is Associated with Measured
Intelligence in Men but Not Women. PLoS ONE 9(3): e81237.
https://doi.org/10.1371/journal.pone.0081237

Copyright: © 2014 Kleisner et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License,
8
which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Relationships between observations

Let 𝑦 𝑖 = price of house in month 𝑥 𝑖 , for 𝑖 = 1, . . . , 𝑛.


• Is it reasonable that 𝑦 𝑖 = 𝛼 + 𝛽𝑥 𝑖 + "error"?
• Is 𝛽 > 0?
• How accurately can we estimate 𝛼 and 𝛽, and how do we estimate
them?

9
Relationship between time and house prices


350 ●●


● ●
●●●
● ●●
300 ● ●
● ● ● ●
● ● ●

● ● ●● ● ●●
● ●● ●
● ●
●● ●
Price ● ●
●●
● ●
● ●
● ● ●

250 ●

● ● ●

●● ●
● ● ●●● ●
●●
● ● ●
● ●
● ●●
200 ●●


● ●
● ●
● ●●
●●●●
●● ●●● ●●
●●
150 ●

0 20 40 60 80 100

Month

Figure: Scatterplot of some Oxford house price data.

10
Statistics Publications - Underrepresentation of women
on boards

The association of women on editorial/advisory board and in the


corresponding academic specialty.
Source: Ioannidou E, Rosania A (2015) Under-Representation of Women on Dental
Journal Editorial Boards. PLoS ONE 10(1): e0116630.
https://doi.org/10.1371/journal.pone.0116630

Copyright: © 2015 Ioannidou, Rosania. This is an open access article distributed under the terms of the Creative Commons Attribution 11
Notation and conventions

Lower case letters 𝑥 1 , . . . , 𝑥 𝑛 denote observations: we regard these as the


observed values of random variables 𝑋1 , . . . , 𝑋𝑛 .
Let x = (𝑥1 , . . . , 𝑥 𝑛 ) and X = (𝑋1 , . . . , 𝑋𝑛 ).
It is convenient to think of 𝑥 𝑖 as
• the observed value of 𝑋𝑖
• or sometimes as a possible value that 𝑋𝑖 can take.
Since 𝑥 𝑖 is a possible value for 𝑋𝑖 we can calculate probabilities, e.g. if
𝑋𝑖 ∼ Poisson(𝜆) then

𝑒 −𝜆 𝜆 𝑥 𝑖
𝑃(𝑋𝑖 = 𝑥 𝑖 ) = , 𝑥 𝑖 = 0, 1, . . . .
𝑥𝑖 !

12
1. Random Samples

13
Random Samples
A random sample of size 𝑛 is a set of random variables 𝑋1 , . . . , 𝑋𝑛 which
are independent and identically distributed (i.i.d.).
Let 𝑋1 , . . . , 𝑋𝑛 be a random sample from a Poisson(𝜆) distribution.
e.g. 𝑋𝑖 = # traffic accidents on St Giles’ in year 𝑖.
We’ll write 𝑓 (x) to denote the joint probability mass function (p.m.f.) of
𝑋1 , . . . , 𝑋𝑛 . Then

𝑓 (x) = 𝑃(𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 , . . . , 𝑋𝑛 = 𝑥 𝑛 ) definition of joint p.m.f.


= 𝑃(𝑋1 = 𝑥1 )𝑃(𝑋2 = 𝑥2 ) . . . 𝑃(𝑋𝑛 = 𝑥 𝑛 ) by independence

𝑒 −𝜆 𝜆 𝑥1 𝑒 −𝜆 𝜆 𝑥2 𝑒 −𝜆 𝜆 𝑥 𝑛
= · ··· since 𝑋𝑖 Poisson
𝑥1 ! 𝑥2 ! 𝑥𝑛 !
Í𝑛
𝑒 −𝑛𝜆 𝜆( 𝑖=1 𝑥𝑖 )
= Î𝑛 .
𝑖=1 𝑥𝑖 !

14
Random Samples
Let 𝑋1 , . . . , 𝑋𝑛 be a random sample from an exponential distribution
with probability density function (p.d.f.) given by

1 −𝑥/𝜇
𝑓 (𝑥) = 𝑒 , 𝑥 ≥ 0.
𝜇

e.g. 𝑋𝑖 = time until the first breakdown of machine 𝑖 in a factory.


We’ll write 𝑓 (x) to denote the joint p.d.f. of 𝑋1 , . . . , 𝑋𝑛 . Then

𝑓 (x) = 𝑓 (𝑥1 ) . . . 𝑓 (𝑥 𝑛 ) since the 𝑋𝑖 are independent


𝑛
Ö 1
= 𝑒 −𝑥 𝑖 /𝜇
𝜇
𝑖=1

𝑛
1 1Õ
 
= 𝑛 exp − 𝑥𝑖 .
𝜇 𝜇
𝑖=1

15
1 We use 𝑓 to denote a p.m.f. in the first example and to denote a
p.d.f. in the second example. It is convenient to use the same letter
(i.e. 𝑓 ) in both the discrete and continuous cases. (In introductory
probability you may often see 𝑝 for p.m.f. and 𝑓 for p.d.f.)
We could write 𝑓𝑋𝑖 (𝑥 𝑖 ) for the p.m.f./p.d.f. of 𝑋𝑖 , and 𝑓𝑋1 ,...,𝑋𝑛 (x)
for the joint p.m.f./p.d.f. of 𝑋, . . . , 𝑋𝑛 . However it is convenient to
keep things simpler by omitting subscripts on 𝑓 .
2 In the second example 𝐸(𝑋𝑖 ) = 𝜇 and we say “𝑋𝑖 has an
exponential distribution with mean 𝜇” (i.e. expectation 𝜇).
Sometimes, and often in probability, we work with “an
exponential distribution with parameter 𝜆” where 𝜆 = 1/𝜇. To
change the parameter from 𝜇 to 𝜆 all we do is replace 𝜇 by 1/𝜆 to
get
𝑓 (𝑥) = 𝜆𝑒 −𝜆𝑥 , 𝑥 ≥ 0.
Sometimes (often in statistics) we parametrise the distribution
using 𝜇, sometimes (often in probability) we parametrise it using
𝜆.
16
In probability we assume that the parameters 𝜆 and 𝜇 in our two
examples are known. In statistics we wish to estimate 𝜆 and 𝜇 from
data.
• What is the best way to estimate them? And what does “best”
mean?
• For a given method, how precise is the estimation?

17
2. Summary Statistics

18
Summary Statistics
The expected value of 𝑋, E(𝑋), is also called its mean. This is often
denoted 𝜇.
The variance of 𝑋, var(𝑋), is var(𝑋) = 𝐸[(𝑋 − 𝜇)2 ]. This is often
denoted 𝜎2 . The standard deviation of 𝑋 is 𝜎.
Let 𝑋1 , . . . , 𝑋𝑛 be a random sample. The sample mean 𝑋 is defined by
𝑛

𝑋= 𝑋𝑖 .
𝑛
𝑖=1

The sample variance 𝑆2 is defined by


𝑛
1 Õ
𝑆2 = (𝑋𝑖 − 𝑋)2 .
𝑛−1
𝑖=1

The sample standard deviation is 𝑆 = sample variance.


p

19
Statistics Publications - Descriptive statistics -
biochemistry

Means ± standard deviations are reported for the continuous variables.


Source: Sait Demirkol et al. (2012) Evaluation of the mean platelet volume in patients with cardiac
syndrome X. Clinics vol.67 no.9 http://dx.doi.org/10.6061/clinics/2012(09)06

Copyright: All the contents of this journal, except where otherwise noted, is licensed under a Creative Commons Attribution License

20
Statistics Publications - IQ of University Students - Mean
(125.04) and Standard deviation (17.298)

Histogram showing the distribution of IQ among university students


Source: Kleisner K, Chvátalová V, Flegr J (2014) Perceived Intelligence Is Associated with Measured
Intelligence in Men but Not Women. PLoS ONE 9(3): e81237.
https://doi.org/10.1371/journal.pone.0081237

Copyright: © 2014 Kleisner et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, 21
Note

1 The denominator in the definition of 𝑆2 is 𝑛 − 1, not 𝑛.


2 𝑋 and 𝑆2 are random variables. Their distributions are called the
sampling distributions of 𝑋 and 𝑆2 .
3 Given observations 𝑥1 , , . . . , 𝑥 𝑛 we can compute the observed values
𝑥 and 𝑠 2 .
4 The sample mean 𝑥 is a summary of the location of the sample.
5 The sample variance 𝑠 2 (or the sample standard deviation 𝑠) is a
summary of the spread of the sample about 𝑥.

22
The Normal Distribution

The random variable 𝑋 has a normal distribution with mean 𝜇 and


variance 𝜎2 , written 𝑋 ∼ 𝑁(𝜇, 𝜎2 ), if the p.d.f. of 𝑋 is

1 (𝑥 − 𝜇)2
 
𝑓 (𝑥) = √ exp − , −∞ < 𝑥 < ∞.
2𝜋𝜎 2 2𝜎2

Properties: 𝐸(𝑋) = 𝜇, var(𝑋) = 𝜎 2 .

23
Standard Normal Distribution

If 𝜇 = 0 and 𝜎2 = 1, then 𝑋 ∼ 𝑁(0, 1) is said to have a standard normal


distribution. Properties:
• If 𝑍 ∼ 𝑁(0, 1) and 𝑋 = 𝜇 + 𝜎𝑍, then 𝑋 ∼ 𝑁(𝜇, 𝜎 2 ).
• If 𝑋 ∼ 𝑁(𝜇, 𝜎 2 ) and 𝑍 = (𝑋 − 𝜇)/𝜎, then 𝑍 ∼ 𝑁(0, 1).
• The cumulative distribution function (c.d.f.) of the standard
normal distribution is
𝑥
1

2
Φ(𝑥) = √ 𝑒 −𝑢 /2 𝑑𝑢.
−∞ 2𝜋
This cannot be written in a closed form, but can be found by
numerical integration to an arbitrary degree of accuracy.

24
Standard Normal

N(0, 1) p.d.f. N(0, 1) c.d.f.

0.4 1.0

0.8
0.3

0.6

Φ(x)
f(x)

0.2
0.4

0.1
0.2

0.0 0.0

−4 −2 0 2 4 −4 −2 0 2 4

x x

Figure: Standard normal p.d.f. and c.d.f.

25
Standard and Poors 500 stock index (2012 data)

1450
S&P 500 Index

1400

1350

1300

1250

0 50 100 150 200 250

Day
26
S&P 500 daily returns (%)

1
Return (%)

−1

−2

0 50 100 150 200 250

Day
27
S&P 500 daily returns (%)

0.6
Density

0.4

0.2

0.0

−2 −1 0 1 2

Return (%)
28
Insurance claim amounts
lognormal gamma

0.4

0.3
Density

0.2

0.1

0.0

0 10 20 30

Claim amount
29
Time intervals between major earthquakes
Histogram of time intervals between 62 major earthquakes 1902–77: an
exponential density looks plausible.

0.0010
Density

0.0005

0.0000

0 500 1000 1500 2000

Time interval (days)


30
3. Maximum Likelihood

31
Maximum Likelihood Estimation
Maximum likelihood estimation is a general method for estimating
unknown parameters from data. This turns out to be the method of
choice in many contexts, though this isn’t obvious at this stage.
Suppose e.g. that 𝑥1 , . . . , 𝑥 𝑛 are time intervals between major
earthquakes. Assume these are observations of 𝑋1 , . . . , 𝑋𝑛
independently drawn from an exponential distribution with mean 𝜇,
so that each 𝑋𝑖 has p.d.f.

1 −𝑥/𝜇
𝑓 (𝑥; 𝜇) = 𝑒 , 𝑥 ≥ 0.
𝜇

We have written 𝑓 (𝑥; 𝜇) to indicate that the p.d.f. 𝑓 depends on 𝜇.


• How do we estimate 𝜇?
• Is 𝑋 a good estimator for 𝜇?
• Is there a general principle we can use?

32
Maximum Likelihood Estimation

In general we write 𝑓 (𝑥; 𝜃) to indicate that the p.d.f./p.m.f. 𝑓 , which is


a function of 𝑥, depends on a parameter 𝜃. Similarly, 𝑓 (x; 𝜃) denotes
the joint p.d.f./p.m.f. of X. (Sometimes 𝑓 (𝑥; 𝜃) is written 𝑓 (𝑥|𝜃).)
The parameter 𝜃 may be a vector, e.g. 𝜃 = (𝜇, 𝜎 2 ) in the earlier 𝑁(𝜇, 𝜎 2 )
example.

33
Maximum Likelihood Estimation

If we regard 𝜃 as unknown, then we need to estimate it


using 𝑥1 , . . . , 𝑥 𝑛 .
Let 𝑋1 , . . . , 𝑋𝑛 have joint p.d.f./p.m.f. 𝑓 (x; 𝜃). Given observed values
𝑥1 , . . . , 𝑥 𝑛 of 𝑋1 , . . . , 𝑋𝑛 , the likelihood of 𝜃 is the function

𝐿(𝜃) = 𝑓 (x; 𝜃). (1)

The log-likelihood is ℓ (𝜃) = log 𝐿(𝜃).


So 𝐿(𝜃) is the joint p.d.f./p.m.f. of the observed data regarded as a
function of 𝜃, for fixed x.
In the definition of ℓ (𝜃), log means log to base 𝑒, i.e. log = log𝑒 = ln.

34
Maximum Likelihood Estimation
Often we assume that 𝑋1 , . . . , 𝑋𝑛 are a random sample from 𝑓 (𝑥; 𝜃), so
that

𝐿(𝜃) = 𝑓 (x; 𝜃)
𝑛
Ö
= 𝑓 (𝑥 𝑖 ; 𝜃) since the 𝑋𝑖 are independent.
𝑖=1

Sometimes we have independent 𝑋𝑖 whose distributions differ, say 𝑋𝑖


is from 𝑓𝑖 (𝑥; 𝜃). Then the likelihood is
𝑛
Ö
𝐿(𝜃) = 𝑓𝑖 (𝑥 𝑖 ; 𝜃).
𝑖=1

The maximum likelihood estimate (MLE) 𝜃(x)


b is the value of 𝜃 that
maximises 𝐿(𝜃) for the given x.

35
The idea of maximum likelihood is to estimate the parameter by the
value of 𝜃 that gives the greatest likelihood to observations 𝑥1 , . . . , 𝑥 𝑛 .
That is, the 𝜃 for which the probability or probability density is
maximised.
Since taking logs is monotone, 𝜃(x)
b also maximises ℓ (𝜃). Finding the
MLE by maximising ℓ (𝜃) is often more convenient.

36
[Example continued] In our exponential mean 𝜇 example, the
parameter is 𝜇 and
𝑛 𝑛
1 1Õ
Ö  
−𝑥 𝑖 /𝜇 −𝑛
𝐿(𝜇) = 𝑒 =𝜇 exp − 𝑥𝑖
𝜇 𝜇
𝑖=1 𝑖=1

𝑛

ℓ (𝜇) = log 𝐿(𝜇) = −𝑛 log 𝜇 − 𝑥𝑖 .
𝜇
𝑖=1

To find the maximum


Í𝑛
𝑑ℓ 𝑛 𝑖=1 𝑥𝑖
=− + .
𝑑𝜇 𝜇 𝜇2

So Í𝑛
𝑑ℓ 𝑛 𝑖=1 𝑥𝑖
= 0 ⇐⇒ = ⇐⇒ 𝜇 = 𝑥.
𝑑𝜇 𝜇 𝜇2

37
This is a maximum since
Í𝑛
𝑑 2ℓ 𝑛 2 𝑖=1 𝑥𝑖 𝑛
= − =− < 0.
𝑑𝜇2 𝜇=𝑥 𝑥 2
𝑥3
𝑥2

So the MLE is b
𝜇(x) = 𝑥. Often we’ll just write b
𝜇 = 𝑥.
𝜇(X) = 𝑋, which
In this case the maximum likelihood estimator of 𝜇 is b
is a random variable. (More on the difference between estimates and
estimators later.)

38
−4.5
2.0

logLikelihood × 10−2
−5.0
Likelihood × 10191

1.5
−5.5

1.0
−6.0

0.5
−6.5

0.0 −7.0

0 200 400 600 800 1000 0 200 400 600 800 1000

mu mu

Figure: Likelihood 𝐿(𝜇) and log-likelihood ℓ (𝜇) for exponential (earthquake)


example.

Note that both the likelihood and the log-likelihood are not plotted on
the natural scale.

39
If 𝜃(x)
b is the maximum likelihood estimate of 𝜃, then the maximum
likelihood estimator (MLE) is defined by 𝜃(X).
b
Note: both maximum likelihood estimate and maximum likelihood
estimator are often abbreviated to MLE.

40
Opinion poll example
Suppose 𝑛 individuals are drawn independently from a large
population. Let
1 if individual 𝑖 is a Labour voter

𝑋𝑖 =
0 otherwise.
Let 𝑝 be the proportion of Labour voters, so that
𝑃(𝑋𝑖 = 1) = 𝑝, 𝑃(𝑋𝑖 = 0) = 1 − 𝑝.
This is a Bernoulli distribution, for which the p.m.f. can be written
𝑓 (𝑥; 𝑝) = 𝑃(𝑋 = 𝑥) = 𝑝 𝑥 (1 − 𝑝)1−𝑥 , 𝑥 = 0, 1.
The likelihood is
𝑛
Ö
𝐿(𝑝) = 𝑝 𝑥 𝑖 (1 − 𝑝)1−𝑥 𝑖
𝑖=1
= 𝑝 𝑟 (1 − 𝑝)𝑛−𝑟
Í𝑛
where 𝑟 = 𝑖=1 𝑥𝑖 .
41
Opinion poll example (continued)

So the log-likelihood is

ℓ (𝑝) = log 𝐿(𝑝)


= 𝑟 log 𝑝 + (𝑛 − 𝑟) log(1 − 𝑝)

For a maximum, differentiate and set to zero:


𝑟 𝑛−𝑟 𝑟 𝑛−𝑟
− = 0 ⇐⇒ = ⇐⇒ 𝑟 − 𝑟𝑝 = 𝑛𝑝 − 𝑟𝑝
𝑝 1−𝑝 𝑝 1−𝑝

and so 𝑝 = 𝑟/𝑛. This is a maximum since ℓ ′′(𝑝) < 0.


Í𝑛
So b
𝑝 = 𝑟/𝑛, i.e. the MLE is b
𝑝= 𝑖=1 𝑋𝑖 /𝑛 which is the proportion of
Labour voters in the sample.

42
Genetics example

Suppose we test randomly chosen individuals at a particular locus on


the genome. Each chromosome can be type 𝐴 or 𝑎. Every individual
has two chromosomes (one from each parent), so the genotype can be
𝐴𝐴, 𝐴𝑎 or 𝑎𝑎. (Note that order is not relevant, there is no distinction
between 𝐴𝑎 and 𝑎𝐴.)
Hardy-Weinberg law: under plausible assumptions,

𝑃(𝐴𝐴) = 𝑝1 = 𝜃 2 , 𝑃(𝐴𝑎) = 𝑝 2 = 2𝜃(1 − 𝜃), 𝑃(𝑎𝑎) = 𝑝3 = (1 − 𝜃)2

for some 𝜃 with 0 ≤ 𝜃 ≤ 1.


Now suppose the random sample of 𝑛 individuals contains:

𝑥1 of type 𝐴𝐴, 𝑥2 of type 𝐴𝑎, 𝑥3 of type 𝑎𝑎

where 𝑥1 + 𝑥 2 + 𝑥3 = 𝑛 and these are observations of 𝑋1 , 𝑋2 , 𝑋3 .

43
Genetics example (continued)
Then

𝐿(𝜃) = 𝑃(𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 , 𝑋3 = 𝑥3 )

𝑛!
= 𝑝 𝑥1 𝑝 𝑥2 𝑝 𝑥3
𝑥1 ! 𝑥2 ! 𝑥3 ! 1 2 3
𝑛!
= (𝜃 2 )𝑥1 [2𝜃(1 − 𝜃)]𝑥2 [(1 − 𝜃)2 ]𝑥3 .
𝑥1 ! 𝑥2 ! 𝑥3 !
This is a multinomial distribution.
So

ℓ (𝜃) = constant + 2𝑥1 log 𝜃 + 𝑥2 [log 2 + log 𝜃 + log(1 − 𝜃)] + 2𝑥3 log(1 − 𝜃)
= constant + (2𝑥1 + 𝑥2 ) log 𝜃 + (𝑥2 + 2𝑥3 ) log(1 − 𝜃)

(where the constants depend on 𝑥1 , 𝑥2 , 𝑥3 but not 𝜃).

44
Genetics example (continued)

Then ℓ ′(𝜃)
b = 0 gives

2𝑥1 + 𝑥 2 𝑥2 + 2𝑥3
− =0
𝜃
b 1−𝜃
b
or
(2𝑥1 + 𝑥 2 )(1 − 𝜃)
b = (𝑥2 + 2𝑥3 )𝜃.
b
So
2𝑥1 + 𝑥2 2𝑥1 + 𝑥 2
𝜃
b= = .
2(𝑥1 + 𝑥2 + 𝑥 3 ) 2𝑛

45
Maximum likelihood approach

Steps:
• Write down the (log) likelihood
• Find the maximum (usually by differentiation, but not quite
always)
• Rearrange to give the parameter estimate in terms of the data.

46
Statistics Publications - Likelihood of a hepatitis A
transmission tree

v = possible infector; w = contacts; 𝜃 = parameters of serial interval dist.


Source: Zhang X-S, Iacono GL (2018) Estimating human-to-human transmissibility of hepatitis A virus in an outbreak at an elementary school

in China, 2011. PLoS ONE 13(9): e0204201. https://doi.org/10.1371/journal.pone.0204201

Copyright: © 2018 Zhang, Iacono. This is an open access article distributed under the terms of the Creative Commons Attribution License,

which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. 47
Estimating multiple parameters
iid
Let 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎2 ) where both 𝜇 and 𝜎2 are unknown.
iid
[Here ∼ means “are independent and identically distributed as.”]
The likelihood is
𝑛
1 (𝑥 𝑖 − 𝜇)2
Ö  
2
𝐿(𝜇, 𝜎 ) = √ exp −
𝑖=1 2𝜋𝜎2 2𝜎2

𝑛
1 Õ
 
= (2𝜋𝜎2 )−𝑛/2 exp − (𝑥 𝑖 − 𝜇)2
2𝜎2 𝑖=1
with log-likelihood
𝑛
𝑛 𝑛 1 Õ
ℓ (𝜇, 𝜎2 ) = − log 2𝜋 − log(𝜎2 ) − 2 (𝑥 𝑖 − 𝜇)2 .
2 2 2𝜎 𝑖=1

48
Estimating multiple parameters (continued)
We maximise ℓ jointly over 𝜇 and 𝜎2 :
𝑛
𝜕ℓ 1 Õ
= 2 (𝑥 𝑖 − 𝜇)
𝜕𝜇 𝜎
𝑖=1

𝑛
𝜕ℓ 𝑛 1 Õ
= − + (𝑥 𝑖 − 𝜇)2
𝜕(𝜎2 ) 2𝜎2 2(𝜎 2 )2 𝑖=1

𝜕ℓ 𝜕ℓ
and solving 𝜕𝜇
= 0 and 𝜕(𝜎 2 )
= 0 simultaneously we obtain
𝑛

𝜇= 𝑥𝑖 = 𝑥
𝑛
b
𝑖=1

𝑛 𝑛
1Õ 1Õ
𝑠2 = 𝜇)2 =
(𝑥 𝑖 − b (𝑥 𝑖 − 𝑥)2 .
𝑛 𝑛
b
𝑖=1 𝑖=1

Hence: the MLE of 𝜇 is the sample mean, but the MLE of 𝜎2 is


(𝑛 − 1)𝑠 2 /𝑛. (More later.)
49
“Estimate” and “estimator”

Estimator:
• A rule for constructing an estimate.
• A function of the random variables X involved in the random
sample.
• Itself a random variable.
Estimate:
• The numerical value of the estimator for the particular data set.
• The value of the function evaluated at the data 𝑥1 , . . . , 𝑥 𝑛 .

50
Statistics Publications - Influenza transmission in Mexico
in 2009

Log likelihood profile of the influenza transmission rate


Source: Michael Springborn, Gerardo Chowell, Matthew MacLachlan, Eli P Fenichel (2015)
Accounting for behavioral responses during a flu epidemic using home television viewing. BMC
Infectious Diseases 15:21 https://doi.org/10.1186/s12879-014-0691-0

Copyright: © Springborn et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative

Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and
51
reproduction in any medium, provided the original work is properly credited.
Statistics Publications - Influenza transmission in Mexico
in 2009

Sometimes the multidimensional log likelihood surface is complex


Source: Michael Springborn, Gerardo Chowell, Matthew MacLachlan, Eli P Fenichel (2015)
Accounting for behavioral responses during a flu epidemic using home television viewing. BMC
Infectious Diseases 15:21 https://doi.org/10.1186/s12879-014-0691-0

Copyright: © Springborn et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative

Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and
52
reproduction in any medium, provided the original work is properly credited.
Statistics Publications - Influenza transmission in Mexico
in 2009

Often these surfaces have to be searched numerically to identify the


global maximum
Source: Michael Springborn, Gerardo Chowell, Matthew MacLachlan, Eli P Fenichel (2015)
Accounting for behavioral responses during a flu epidemic using home television viewing. BMC
Infectious Diseases 15:21 https://doi.org/10.1186/s12879-014-0691-0

Copyright: © Springborn et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative
53
Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and
Statistics Publications - The maximum log likelihood is
easy to spot - Influenza transmission in Mexico in 2009

Log likelihood profile of the influenza transmission rate


Source: Michael Springborn, Gerardo Chowell, Matthew MacLachlan, Eli P Fenichel (2015)
Accounting for behavioral responses during a flu epidemic using home television viewing. BMC
Infectious Diseases 15:21 https://doi.org/10.1186/s12879-014-0691-0

Copyright: © Springborn et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative

Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and
54
reproduction in any medium, provided the original work is properly credited.
Statistics Publications - Sometimes a maximum looks
like a minimum - Influenza transmission in ferrets

Sometimes the log likelihood profile is flipped and even given relative
to its maximum
Source: Rebecca Frise et al. (2016) Contact transmission of influenza virus between ferrets imposes a
looser bottleneck than respiratory droplet transmission allowing propagation of antiviral resistance.
Scientific Reports volume 6, Article number: 29793

Copyright: This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in
55
this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line
4. Parameter Estimation

56
Parameter Estimation - Earthquake example

Recall the earthquake example:

iid
𝑋1 , . . . , 𝑋𝑛 ∼ exponential distribution, mean 𝜇.

Possible estimators of 𝜇:
• 𝑋
1
• 3 𝑋1 + 23 𝑋2
• 𝑋1 + 𝑋2 − 𝑋3
2
• 𝑛(𝑛+1)
(𝑋1 + 2𝑋2 + · · · + 𝑛𝑋𝑛 ).
How should we choose?

57
In general, suppose 𝑋1 , . . . , 𝑋𝑛 is a random sample from a distribution
with p.d.f./p.m.f. 𝑓 (𝑥; 𝜃). We want to estimate 𝜃 from observations
𝑥1 , . . . , 𝑥 𝑛 .
A statistic is any function 𝑇(X) of 𝑋1 , . . . , 𝑋𝑛 that does not depend on 𝜃.
An estimator of 𝜃 is any statistic 𝑇(X) that we might use to estimate 𝜃.
𝑇(x) is the estimate of 𝜃 obtained via 𝑇 from observed values x.

𝑇(X) is a random variable, e.g. 𝑋.


𝑇(x) is a fixed number, based on data, e.g. 𝑥.

58
We can choose between estimators by studying their properties. A
good estimator should take values close to 𝜃.
The estimator 𝑇 = 𝑇(X) is said to be unbiased for 𝜃 if, whatever the true
value of 𝜃, we have 𝐸(𝑇) = 𝜃.
This means that “on average” 𝑇 is correct.

59
Example: Earthquakes
Possible estimators 𝑋, 13 𝑋1 + 32 𝑋2 , etc.
Since 𝐸(𝑋𝑖 ) = 𝜇, we have
𝑛

𝐸(𝑋) = 𝜇=𝜇
𝑛
𝑖=1

𝐸( 13 𝑋1 + 32 𝑋2 ) = 13 𝜇 + 23 𝜇 = 𝜇.
2 Í𝑛
Similar calculations show that 𝑋1 + 𝑋2 − 𝑋3 and 𝑛(𝑛+1) 𝑗=1 𝑗𝑋 𝑗 are
also unbiased.

60
Example: Normal variance
iid
Suppose 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎 2 ), with 𝜇 and 𝜎2 unknown, and let
𝑇 = 𝑛1 (𝑋𝑖 − 𝑋)2 . Then 𝑇 is the MLE of 𝜎2 . Is 𝑇 unbiased?
Í

Let 𝑍 𝑖 = (𝑋𝑖 − 𝜇)/𝜎. So the 𝑍 𝑖 are independent and 𝑁(0, 1), 𝐸(𝑍 𝑖 ) = 0,
var(𝑍 𝑖 ) = 𝐸(𝑍 2𝑖 ) = 1.
𝐸[(𝑋𝑖 − 𝑋)2 ]
= 𝐸[𝜎2 (𝑍 𝑖 − 𝑍)2 ]
= 𝜎2 var(𝑍 𝑖 − 𝑍) since 𝐸(𝑍 𝑖 − 𝑍) = 0

1 1 𝑛−1 1
 
2
= 𝜎 var − 𝑍1 − 𝑍2 − · · · + 𝑍𝑖 + · · · − 𝑍𝑛
𝑛 𝑛 𝑛 𝑛
1 1 (𝑛 − 1)2 1
 
= 𝜎2 2
var(𝑍1 ) + 2 var(𝑍2 ) + · · · + var(𝑍 𝑖 ) + · · · + 2 var(𝑍 𝑛 )
𝑛 𝑛 𝑛2 𝑛
Õ Õ
since var( 𝑎 𝑖 𝑈𝑖 ) = 𝑎 2𝑖 var(𝑈 𝑖 ) for indep 𝑈 𝑖

1 (𝑛 − 1)2 (𝑛 − 1) 2
 
2
= 𝜎 (𝑛 − 1) × 2 + = 𝜎 .
𝑛 𝑛 2 𝑛
61
So
𝑛
1Õ (𝑛 − 1) 2
𝐸(𝑇) = 𝐸[(𝑋𝑖 − 𝑋)2 ] = 𝜎 < 𝜎2 .
𝑛 𝑛
𝑖=1

Hence 𝑇 is not unbiased, 𝑇 will underestimate 𝜎2 on average.


But
𝑛
1 Õ 𝑛
𝑆2 = (𝑋𝑖 − 𝑋)2 = 𝑇
𝑛−1 𝑛−1
𝑖=1

so the sample variance satisfies


𝑛
𝐸(𝑆2 ) = 𝐸(𝑇) = 𝜎2 .
𝑛−1
So 𝑆2 is unbiased for 𝜎2 .

62
Uniform distribution – some unusual features!
iid
Suppose 𝑋1 , . . . , 𝑋𝑛 ∼ Uniform[0, 𝜃], where 𝜃 > 0, i.e.
1
if 0 ≤ 𝑥 ≤ 𝜃


𝑓 (𝑥; 𝜃) = 𝜃


0

otherwise.

What is the MLE for 𝜃? Is the MLE unbiased?
Calculate the likelihood:
𝑛
Ö
𝐿(𝜃) = 𝑓 (𝑥; 𝜃)
𝑖=1

1
if 0 ≤ 𝑥 𝑖 ≤ 𝜃 for all 𝑖


= 𝜃𝑛


0

otherwise

0 if 0 < 𝜃 < max 𝑥 𝑖



= 1

 𝑛 if 𝜃 ≥ max 𝑥 𝑖 .
𝜃
Note: 𝜃 ≥ 𝑥 𝑖 for all 𝑖 ⇐⇒ 𝜃 ≥ max 𝑥 𝑖 . (And max 𝑥 𝑖 means
63
max1≤𝑖≤𝑛 𝑥 𝑖 ).
Likelihood
0

0 max xi
theta

Figure: Likelihood 𝐿(𝜃) for Uniform[0, 𝜃] example.

From the diagram:


• the max occurs at 𝜃
b = max 𝑥 𝑖
• this is not a point where ℓ ′(𝜃) = 0
• taking logs doesn’t help.
64
Consider the range of values of 𝑥 for which 𝑓 (𝑥; 𝜃) > 0, i.e. 0 ≤ 𝑥 ≤ 𝜃.
The thing that makes this example different to our previous ones is
that this range depends on 𝜃 (and we must take this into account
because the likelihood is a function of 𝜃).

The MLE of 𝜃 is 𝜃
b = max 𝑋𝑖 . What is 𝐸(𝜃)?
b
Find the c.d.f. of 𝜃:
b

𝐹(𝑦) = 𝑃(𝜃
b ≤ 𝑦)
= 𝑃(max 𝑋𝑖 ≤ 𝑦)
= 𝑃(𝑋1 ≤ 𝑦, 𝑋2 ≤ 𝑦, . . . , 𝑋𝑛 ≤ 𝑦)
= 𝑃(𝑋1 ≤ 𝑦)𝑃(𝑋2 ≤ 𝑦) . . . 𝑃(𝑋𝑛 ≤ 𝑦) since 𝑋𝑖 independent

(𝑦/𝜃)𝑛 if 0 ≤ 𝑦 ≤ 𝜃

=
1 if 𝑦 > 𝜃.

65
So, differentiating the c.d.f., the p.d.f. is

𝑛 𝑦 𝑛−1
𝑓 (𝑦) = , 0 ≤ 𝑦 ≤ 𝜃.
𝜃𝑛
So
𝜃
𝑛 𝑦 𝑛−1

𝐸(𝜃)
b = 𝑦· 𝑑𝑦
0 𝜃𝑛
𝜃
𝑛

= 𝑛 𝑦 𝑛 𝑑𝑦
𝜃 0

𝑛𝜃
= .
𝑛+1

So 𝜃
b is not unbiased. But note that it is asymptotically unbiased:
𝐸(𝜃)
b → 𝜃 as 𝑛 → ∞.
In fact under mild assumptions MLEs are always asymptotically
unbiased.
66
Further Properties of Estimators

The mean squared error (MSE) of an estimator 𝑇 is defined by

MSE(𝑇) = 𝐸[(𝑇 − 𝜃)2 ].

The bias 𝑏(𝑇) of 𝑇 is defined by

𝑏(𝑇) = 𝐸(𝑇) − 𝜃.

1 Both MSE(𝑇) and 𝑏(𝑇) may depend on 𝜃.


2 MSE is a measure of the “distance” between 𝑇 and 𝜃, so is a good
overall measure of performance.
3 𝑇 is unbiased if 𝑏(𝑇) = 0 for all 𝜃.

67
Theorem 4.1 MSE(𝑇) = var(𝑇) + [𝑏(𝑇)]2 .
Proof
Let 𝜇 = 𝐸(𝑇). Then

MSE(𝑇) = 𝐸[{(𝑇 − 𝜇) + (𝜇 − 𝜃)}2 ]


= 𝐸[(𝑇 − 𝜇)2 + 2(𝜇 − 𝜃)(𝑇 − 𝜇) + (𝜇 − 𝜃)2 ]
= 𝐸[(𝑇 − 𝜇)2 ] + 2(𝜇 − 𝜃)𝐸[𝑇 − 𝜇] + (𝜇 − 𝜃)2
= var(𝑇) + 2(𝜇 − 𝜃) × 0 + (𝜇 − 𝜃)2
= var(𝑇) + [𝑏(𝑇)]2 .

So an estimator with small MSE needs to have small variance and small
bias. Unbiasedness alone is not particularly desirable – it is the
combination of small variance and small bias which is important.

68
Reminder

Suppose 𝑎 1 , . . . , 𝑎 𝑛 are constants. It is always the case that

𝐸(𝑎1 𝑋1 + · · · + 𝑎 𝑛 𝑋𝑛 ) = 𝑎1 𝐸(𝑋1 ) + · · · + 𝑎 𝑛 𝐸(𝑋𝑛 ).

If 𝑋1 , . . . , 𝑋𝑛 are independent then

var(𝑎1 𝑋1 + · · · + 𝑎 𝑛 𝑋𝑛 ) = 𝑎12 var(𝑋1 ) + · · · + 𝑎 𝑛2 var(𝑋𝑛 ).

In particular, if 𝑋1 , . . . , 𝑋𝑛 is a random sample with 𝐸(𝑋𝑖 ) = 𝜇 and


var(𝑋𝑖 ) = 𝜎2 , then

𝜎2
𝐸(𝑋) = 𝜇 and var(𝑋) = .
𝑛

69
Uniform distribution
iid
Suppose 𝑋1 , . . . , 𝑋𝑛 ∼ Uniform[0, 𝜃], i.e.

1
if 0 ≤ 𝑥 ≤ 𝜃


𝑓 (𝑥; 𝜃) = 𝜃


0

otherwise.

We will consider two estimators of 𝜃:
• 𝑇 = 2𝑋, the natural estimator based on the sample mean (because
the mean of the distribution is 𝜃/2)
• 𝜃
b = max 𝑋𝑖 , the MLE.
Now 𝐸(𝑇) = 2𝐸(𝑋) = 𝜃, so 𝑇 is unbiased. Hence

MSE(𝑇) = var(𝑇)
= 4 var(𝑋)

4 var(𝑋1 )
= .
𝑛
70
We have 𝐸(𝑋1 ) = 𝜃/2 and
𝜃
1 𝜃2

𝐸(𝑋12 ) = 𝑥2 · 𝑑𝑥 =
0 𝜃 3
so
 2
𝜃2 𝜃 𝜃2
var(𝑋1 ) = − =
3 2 12
hence
4 var(𝑋1 ) 𝜃2
MSE(𝑇) = = .
𝑛 3𝑛

71
Previously we showed that 𝜃
b has p.d.f.

𝑛 𝑦 𝑛−1
𝑓 (𝑦) = , 0≤𝑦≤𝜃
𝜃𝑛

and 𝐸(𝜃)
b = 𝑛𝜃/(𝑛 + 1). So 𝑏(𝜃)
b = 𝑛𝜃/(𝑛 + 1) − 𝜃 = −𝜃/(𝑛 + 1). Also,
𝜃
𝑛 𝑦 𝑛−1 𝑛𝜃 2

b2 ) =
𝐸(𝜃 𝑦2 · 𝑑𝑦 =
0 𝜃𝑛 𝑛+2
so
𝑛 𝑛2 𝑛𝜃 2
 
b = 𝜃2
var(𝜃) − =
𝑛 + 2 (𝑛 + 1)2 (𝑛 + 1)2 (𝑛 + 2)

72
hence

MSE(𝜃)
b = var(𝜃) b 2
b + [𝑏(𝜃)]

2𝜃 2
=
(𝑛 + 1)(𝑛 + 2)

𝜃2
< for 𝑛 ≥ 3
3𝑛
= MSE(𝑇).

• MSE(𝜃)
b ≪ MSE(𝑇) for large 𝑛, so 𝜃 b is much better – its MSE
decreases like 1/𝑛 2 rather than 1/𝑛.
Remember: 𝑇 = 2𝑋 and 𝜃
b = max 𝑋𝑖

73
• Note that ( 𝑛+1
𝑛 )𝜃 is unbiased and
b

𝑛 +1b 𝑛 +1b
   
MSE 𝜃 = var 𝜃
𝑛 𝑛

(𝑛 + 1)2
= var(𝜃)
b
𝑛2
𝜃2
=
𝑛(𝑛 + 2)

< MSE(𝜃)
b for 𝑛 ≥ 2.

However, among all estimators of the form 𝜆𝜃,


b the MSE is
𝑛+2 b
minimized by ( 𝑛+1 )𝜃.
b = 𝜆2 var(𝜃) 𝜆𝑛𝜃
[To show this: note var(𝜆𝜃) b and 𝑏(𝜆𝜃)
b =
𝑛+1 − 𝜃.
Now plug in formulae and minimise over 𝜆.]

74
Estimating the parameter of Uniform[0, 𝜃]
max(x_i), n=1000 2*mean(x), n=1000

10 15 20 25
600
Density

Density
200

5
0

0
0.993 0.995 0.997 0.999 0.94 0.96 0.98 1.00 1.02 1.04 1.06

theta theta

max(x_i), n=100 2*mean(x), n=100

8
80
60

6
Density

Density
40

4
20

2
0

0
0.94 0.96 0.98 1.00 0.8 0.9 1.0 1.1 1.2

theta theta

max(x_i), n=10 2*mean(x), n=10


10

2.0
8
Density

Density
6

1.0
4
2

0.0
0

0.5 0.6 0.7 0.8 0.9 1.0 0.6 0.8 1.0 1.2 1.4

theta theta
75
Estimating the parameter of Uniform[0, 𝜃]
0.4 0.6 0.8 1.0 1.2 1.4 1.6

max(x_i), n=1000 2*mean(x), n=1000


12
10
8
6
4
2
0
max(x_i), n=100 2*mean(x), n=100
12
10
8
Density

6
4
2
0
max(x_i), n=10 2*mean(x), n=10
12
10
8
6
4
2
0

0.4 0.6 0.8 1.0 1.2 1.4 1.6

theta 76
Estimation so far

So far we’ve considered getting an estimate which is a single number –


a point estimate – for the parameter of interest: e.g. 𝑥, max 𝑥 𝑖 , 𝑠 2 , ....
Maximum likelihood is a good way (usually) of producing an estimate
(but we did better when the range of the distribution depends on 𝜃 –
fairly unusual).
MLEs are usually asymptotically unbiased, and have MSE decreasing
like 1/𝑛 for large 𝑛.
MLEs can be found in quite general situations.

77
Statistics Publications - Maximum likelihood in
phylogenetics

Source: Placide Mbala-Kingebeni et al. (2019) Rapid Confirmation of the Zaire Ebola Virus in the Outbreak of the Equateur Province in the

Democratic Republic of Congo: ... , Clinical Infectious Diseases 68(2): 330–333, https://doi.org/10.1093/cid/ciy527

Copyright: © The Author(s) 2018. Published by Oxford University Press for the Infectious Diseases Society of America. This is an Open Access
78
article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs licence
Statistics Publications - Maximum likelihood in object
recognition

Copyright: © 2017 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License

http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.

79
Statistics Publications - Radiation research

Source: Leslie Stayner et al. (2007) A Monte Carlo Maximum Likelihood Method for Estimating Uncertainty Arising from Shared Errors in

Exposures in Epidemiological Studies of Nuclear Workers. Radiation Research, 168(6):757-763. https://doi.org/10.1667/RR0677.1

"Usage of BioOne Complete content is strictly limited to personal, educational, and non-commercial used."

https://bioone.org/journals/Radiation-Research/volume-168/issue-6/RR0677.1/A-Monte-Carlo-Maximum-Likelihood-Method-for-

Estimating-Uncertainty-Arising/10.1667/RR0677.1.full 80
5. Accuracy of Estimation

81
Accuracy of estimation: Confidence Intervals

A crucial aspect of statistics is not just to estimate a quantity of interest,


but to assess how accurate or precise that estimate is. One approach is to
find an interval, called a confidence interval (CI) within which we think
the true parameter falls.

82
Theorem 5.1

Suppose 𝑎 1 , . . . , 𝑎 𝑛 are constants and that 𝑋1 , . . . , 𝑋𝑛 are independent


with 𝑋𝑖 ∼ 𝑁(𝜇𝑖 , 𝜎𝑖2 ). Let 𝑌 = 𝑛𝑖=1 𝑎 𝑖 𝑋𝑖 . Then
Í

𝑛
Õ 𝑛
Õ 
𝑌∼𝑁 𝑎 𝑖 𝜇𝑖 , 𝑎 2𝑖 𝜎𝑖2 .
𝑖=1 𝑖=1

Proof: See Sheet 1.


We know from Prelims Probability how to calculate the mean and
variance of 𝑌. The additional information here is that 𝑌 has a normal
distribution, i.e. “a linear combination of normals is itself normal.”

83
Example
iid
Let 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎02 ) where 𝜇 is unknown and 𝜎02 is known.
What can we say about 𝜇?
By Theorem 5.1,
𝑛
Õ
𝑋𝑖 ∼ 𝑁(𝑛𝜇, 𝑛𝜎02 )
𝑖=1
 𝜎02 
𝑋 ∼ 𝑁 𝜇, .
𝑛
So, standardising 𝑋,
 𝜎02 
𝑋 − 𝜇 ∼ 𝑁 0,
𝑛
𝑋 −𝜇
√ ∼ 𝑁(0, 1).
𝜎0 / 𝑛

Now if 𝑍 ∼ 𝑁(0, 1), then 𝑃(−1.96 < 𝑍 < 1.96) = 0.95.


84
0.4

0.3

N(0, 1) p.d.f.
0.2

0.1

0.0

−4 −2 0 2 4

Figure: Standard normal p.d.f.: the shaded area, i.e. the area under the curve
from 𝑥 = −1.96 to 𝑥 = 1.96, is 0.95.

85
So
𝑋 −𝜇
 
𝑃 − 1.96 < √ < 1.96 = 0.95
𝜎0 / 𝑛

𝜎0 𝜎0
 
𝑃 − 1.96 √ < 𝑋 − 𝜇 < 1.96 √ = 0.95
𝑛 𝑛

𝜎0 𝜎0
 
𝑃 𝑋 − 1.96 √ < 𝜇 < 𝑋 + 1.96 √ = 0.95
𝑛 𝑛

𝜎0 
  
𝑃 the interval 𝑋 ± 1.96 √ contains 𝜇 = 0.95.
𝑛

Note that we write 𝑋 ± 1.96 √𝜎0𝑛 to mean the interval




𝜎0 𝜎0
 
𝑋 − 1.96 √ , 𝑋 + 1.96 √ .
𝑛 𝑛

This interval is a random interval since its endpoints involve 𝑋 (and 𝑋


is a random variable). It is an example of a confidence interval.
86
Confidence Intervals: general definition

Definition
If 𝑎(X) and 𝑏(X) are two statistics, and 0 < 𝛼 < 1, the interval
𝑎(X), 𝑏(X) is called a confidence interval for 𝜃 with confidence level
1 − 𝛼 if, for all 𝜃,

𝑃 𝑎(X) < 𝜃 < 𝑏(X) = 1 − 𝛼.




The interval 𝑎(X), 𝑏(X) is also called a 100(1 − 𝛼)% CI, e.g. a “95%

confidence interval” if 𝛼 = 0.05.
Usually we are interested in small values of 𝛼: the most commonly
used values are 0.05 and 0.01 (i.e. confidence levels of 95% and 99%)
but there is nothing special about any confidence level.

87
The interval 𝑎(x), 𝑏(x) is called an interval estimate and the random

interval 𝑎(X), 𝑏(X) is called an interval estimator.


Note: 𝑎(X) and 𝑏(X) do not depend on 𝜃.


We would like to construct 𝑎(X) and 𝑏(X) so that:
• the width of the interval 𝑎(X), 𝑏(X) is small


• the probability 𝑃 𝑎(X) < 𝜃 < 𝑏(X) is large.




88
Percentage points of normal distribution
For any 𝛼 with 0 < 𝛼 < 1, let 𝑧 𝛼 be the constant such that Φ(𝑧 𝛼 ) = 1 − 𝛼,
where Φ is the 𝑁(0, 1) c.d.f. (i.e. if 𝑍 ∼ 𝑁(0, 1) then 𝑃(𝑍 > 𝑧 𝛼 ) = 𝛼).

N(0, 1) p.d.f.

0 zα

Figure: Standard normal p.d.f.: the shaded area, i.e. the area under the curve to
the right of 𝑧 𝛼 , is 𝛼.

89
We call 𝑧 𝛼 the “1 − 𝛼 quantile of 𝑁(0, 1).”

𝛼 0.1 0.05 0.025 0.005


𝑧𝛼 1.28 1.64 1.96 2.58
iid
By the same argument as before, if 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎02 ) with 𝜎02
known, then a level 1 − 𝛼 confidence interval for 𝜇 is

𝑧 𝛼/2 𝜎0
 
𝑋± √ .
𝑛

90
Oxford rainfall, annual, 20 years

● ● ●
800
● ● ●
● ● ●

● ●
Rainfall (mm)


600 ● ●





400

200

1995 2000 2005 2010

Year
91
Oxford rainfall, annual, 20 years

We have rainfall amounts (in mm) 𝑥1 , . . . , 𝑥 𝑛 where 𝑛 = 20 and


𝑥 = 688.4. We assume 𝜎0 = 130.

• The endpoints of a 95% CI are


𝜎0
𝑥 ± 1.96 √ = 688.4 ± 57.0.
𝑛

So a 95% CI for 𝜇 is (631.4, 745.4).

• A 99% CI is (𝑥 ± 2.58 √𝜎0 ) = (613.4, 763.4).


𝑛

92
Confidence interval example

iid
When 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎02 ), a 90% CI for 𝜇 is

𝜎0 𝜎0
 
𝑎(X), 𝑏(X) = 𝑋 − 1.64 √ , 𝑋 + 1.64 √ .

𝑛 𝑛

So a 90% interval estimate based on data x is 𝑎(x), 𝑏(x) .




93
90% CIs from samples of size 15 from N(10, 4)

100
80
sample number

60
40
20
0

6 8 10 12 14

94
The symmetric confidence interval for 𝜇

𝜎0
𝑥 ± 1.96 √
𝑛
is called a central confidence interval for 𝜇.

Suppose now that 𝑐 and 𝑑 are any constants such that


𝑃(−𝑐 < 𝑍 < 𝑑) = 1 − 𝛼
for 𝑍 ∼ 𝑁(0, 1).
N(0, 1) p.d.f.

−c d

Figure: Standard normal p.d.f.: the shaded area under the curve from −𝑐 to 𝑑,
is 1 − 𝛼. 95
N(0, 1) p.d.f.
−c d

Figure: Standard normal p.d.f.: the shaded area under the curve from −𝑐 to 𝑑,
is 1 − 𝛼.

Then
𝑑𝜎0 𝑐𝜎0
 
𝑃 𝑋− √ <𝜇<𝑋+ √ = 1 − 𝛼.
𝑛 𝑛
The choice 𝑐 = 𝑑 = 𝑧 𝛼/2 gives the shortest such interval.

96
One-sided confidence limits
Continuing our normal example we have

𝑋 −𝜇
 
𝑃 √ > −𝑧 𝛼 = 1 − 𝛼
𝜎0 / 𝑛
so
𝑧 𝛼 𝜎0
 
𝑃 𝜇<𝑋+ √ =1−𝛼
𝑛
𝑧 𝛼 𝜎0
and so (−∞, 𝑋 + √ )
𝑛
is a “one-sided” confidence interval. We call
𝑧 𝛼 𝜎0
𝑋+ √
𝑛
an upper 1 − 𝛼 confidence limit for 𝜇.

Similarly
𝑧 𝛼 𝜎0
 
𝑃 𝜇>𝑋− √ =1−𝛼
𝑛
𝑧 𝛼 𝜎0
and 𝑋 − √
𝑛
is a lower 1 − 𝛼 confidence limit for 𝜇.

97
Interpretation of a Confidence Interval
• The parameter 𝜃 is fixed but unknown.
• If we imagine repeating our experiment then we’d get new data,
x′ = (𝑥1′ , . . ., 𝑥 𝑛′ ) say, and hence we’d get a new confidence interval
𝑎(x′), 𝑏(x′) . If we did this repeatedly we would “catch” the true
parameter value about 95% of the time, for a 95% confidence
interval: i.e. about 95% of our intervals would contain 𝜃.
• The confidence level is a coverage probability, the probability that
the random confidence interval 𝑎(X), 𝑏(X) covers the true 𝜃. (It’s

a random interval because the endpoints 𝑎(X), 𝑏(X) are random
variables.)
But note that
 the interval 𝑎(x), 𝑏(x) is not a random interval, e.g.

𝑎(x), 𝑏(x) = (631.4, 745.4) in the rainfall example. So it is wrong to say
that 𝑎(x), 𝑏(x) contains 𝜃 with probability 1 − 𝛼: this interval, e.g.

(631.4, 745.4), either definitely does or definitely does not contain 𝜃, but
we can’t say which of these two possibilities is true as 𝜃 is unknown.

98
The Central Limit Theorem (CLT)
iid
We know that if 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎2 ) then

𝑋 −𝜇
√ ∼ 𝑁(0, 1).
𝜎/ 𝑛

Theorem 5.2 Let 𝑋1 , . . . , 𝑋𝑛 be i.i.d. from any distribution with mean 𝜇


and variance 𝜎2 ∈ (0, ∞). Then, for all 𝑥,

𝑋 −𝜇
 
𝑃 √ ≤ 𝑥 → Φ(𝑥) as 𝑛 → ∞.
𝜎/ 𝑛

Here, as usual, Φ is the 𝑁(0, 1) c.d.f. So for large 𝑛, whatever the


distribution of the 𝑋𝑖 ,
𝑋 −𝜇
√ ≈ 𝑁(0, 1)
𝜎/ 𝑛
where ≈ means “has approximately the same distribution as.” Usually
𝑛 > 30 is ok for a reasonable approximation.
99
CLT with 𝑋𝑖 ∼ 𝑁(7, 9)

n = 200

n = 30
Density

n = 10

n=1

−4 −2 0 2 4

Standardised value of x

100
CLT with 𝑋𝑖 ∼ 𝑈(0, 1)

n = 200

n = 30
Density

n = 10

n=1

−4 −2 0 2 4

Standardised value of x

101
CLT with 𝑋𝑖 ∼ Exponential(1)

n = 200

n = 30
Density

n = 10

n=1

−4 −2 0 2 4

Standardised value of x

102
CLT with 𝑋𝑖 ∼ Pareto(1, 3)

n = 200

n = 30
Density

n = 10

n=1

−4 −2 0 2 4

Standardised value of x

103
With 𝑋1 , 𝑋2 , . . . i.i.d. from any distribution with 𝐸(𝑋𝑖 ) = 𝜇,
var(𝑋𝑖 ) = 𝜎2 :
• the weak law of large numbers (Prelims Probability) tells us that
the distribution of 𝑋 concentrates around 𝜇 as 𝑛 becomes large,
i.e. for 𝜖 > 0, we have 𝑃(|𝑋 − 𝜇| > 𝜖) → 0 as 𝑛 → ∞
• the CLT adds to this

– the fluctuations of 𝑋 around 𝜇 are of order 1/ 𝑛
• the asymptotic distribution of these fluctuations is normal.

104
iid
Example Suppose 𝑋1 , . . . , 𝑋𝑛 ∼ exponential mean 𝜇, e.g.
𝑋𝑖 = survival time of patient 𝑖. So
1 −𝑥/𝜇
𝑓 (𝑥; 𝜇) = 𝑒 , 𝑥≥0
𝜇
and 𝐸(𝑋𝑖 ) = 𝜇, var(𝑋𝑖 ) = 𝜇2 .
For large 𝑛, by CLT,
𝑋 −𝜇
√ ≈ 𝑁(0, 1).
𝜇/ 𝑛
So
𝑋 −𝜇
 
𝑃 − 𝑧 𝛼/2 < √ < 𝑧 𝛼/2 ≈ 1 − 𝛼
𝜇/ 𝑛
𝑧 𝛼/2  𝑧 𝛼/2 
   
𝑃 𝜇 1− √ < 𝑋 < 𝜇 1+ √ ≈1−𝛼
𝑛 𝑛

𝑋 𝑋
 
𝑃 𝑧 𝛼/2 <𝜇< 𝑧 𝛼/2 ≈ 1 − 𝛼.
1+ √
𝑛
1− √
𝑛

105
Hence an approximate 1 − 𝛼 CI for 𝜇 is

𝑋 𝑋
 
𝑧 𝛼/2 , 𝑧 𝛼/2 .
1+ √
𝑛
1− √
𝑛

Numerically, if we have 𝑛 = 100 patients and 𝛼 = 0.05 (so 𝑧 𝛼/2 = 1.96),


then
(0.84𝑥, 1.24𝑥)
is an approximate 95% CI for 𝜇.

106
Example: Opinion poll
In a opinion poll, suppose 321 of 1003 voters said they would vote for
the Party X. What’s the underlying level of support for Party X?
Let 𝑋1 , . . . , 𝑋𝑛 be a random sample from the Bernoulli(𝑝) distribution,
i.e.
𝑃(𝑋𝑖 = 1) = 𝑝, 𝑃(𝑋𝑖 = 0) = 1 − 𝑝.
The MLE of 𝑝 is 𝑋. Also 𝐸(𝑋𝑖 ) = 𝑝 and var(𝑋𝑖 ) = 𝑝(1 − 𝑝) = 𝜎2 (𝑝) say.
For large 𝑛, by CLT,
𝑋−𝑝
√ ≈ 𝑁(0, 1).
𝜎(𝑝)/ 𝑛
So
𝑋−𝑝
 
1 − 𝛼 ≈ 𝑃 − 𝑧 𝛼/2 < √ < 𝑧 𝛼/2
𝜎(𝑝)/ 𝑛
𝜎(𝑝) 𝜎(𝑝)
 
= 𝑃 𝑋 − 𝑧 𝛼/2 √ < 𝑝 < 𝑋 + 𝑧 𝛼/2 √ .
𝑛 𝑛
107
𝑧 𝛼/2
 
The interval 𝑋 ± √ 𝜎(𝑝)
𝑛
has approximate probability 1 − 𝛼 of
containing the true 𝑝, but it is not a confidence interval since its
endpoints depend on 𝑝 via 𝜎(𝑝).
To get an approximate confidence interval:
• either, solve the inequality to get 𝑃 𝑎(X) < 𝑝 < 𝑏(X) ≈ 1 − 𝛼

where 𝑎(X), 𝑏(X) don’t depend on 𝑝
p
• or, estimate 𝜎(𝑝) by 𝜎(b
𝑝) = b𝑝 (1 − b
𝑝 ) where b
𝑝 = 𝑥 the MLE. This
gives endpoints r
𝑝 (1 − b
𝑝)
𝑝 ± 𝑧 𝛼/2 .
b
𝑛
b

For 𝑛 = 1003 and b 𝑝 = 𝑥 = 321/1003, this gives a 95% CI for 𝑝 of


(0.29, 0.35) using the two approximations: (i) CLT and (ii) 𝜎(𝑝)
approximated by 𝜎(b 𝑝 ).

108
Opinion polls often mention “±3% error.”
Note that for any 𝑝,
1
𝜎2 (𝑝) = 𝑝(1 − 𝑝) ≤
4
since 𝑝(1 − 𝑝) is maximised at 𝑝 = 12 . Then we have

𝑝−𝑝 0.03
 
−0.03
𝑃(b
𝑝 − 0.03 < 𝑝 < b
𝑝 + 0.03) = 𝑃 √ < √ <
b

𝜎(𝑝)/ 𝑛 𝜎(𝑝)/ 𝑛 𝜎(𝑝)/ 𝑛

0.03
   
−0.03
≈Φ √ −Φ √
𝜎(𝑝)/ 𝑛 𝜎(𝑝)/ 𝑛
√ √
≥ Φ(0.03 4𝑛) − Φ(−0.03 4𝑛).

For this probability to be at least 0.95 we need 0.03 4𝑛 ≥ 1.96, or
𝑛 ≥ 1068. Opinion polls typically use 𝑛 ≈ 1100.

109
Standard errors

Definition Let 𝜃b be an estimator of 𝜃 based on X. The standard error


SE(𝜃) of 𝜃 is defined by
b b
q
SE(𝜃)
b = var(𝜃).
b

Example

• Let 𝑋1 , . . . , 𝑋𝑛 iid 2 2
√ ∼ 𝑁(𝜇, 𝜎 ). Then 𝜇 = 𝑋 and var(𝜇) = 𝜎 /𝑛. So
b b
SE(b
𝜇) = 𝜎/ 𝑛.

• Let 𝑋1 , . . . , 𝑋𝑛 iid
∼ Bernoulli(𝑝). Then 𝑝 = 𝑋 and
p b
var(b
𝑝 ) = 𝑝(1 − 𝑝)/𝑛. So SE(b 𝑝 ) = 𝑝(1 − 𝑝)/𝑛.

110
Sometimes SE(𝜃)b depends on 𝜃 itself, meaning that SE(𝜃) b is unknown.
In such cases we have to plug-in parameter estimates to get the
estimated standard error. e.g.pplug-in to get estimated standard errors

SE(𝑋) = b𝜎/ 𝑛 and SE(b 𝑝) = b 𝑝 (1 − b
𝑝 )/𝑛.
The values plugged-in (b
𝜎 and b
𝑝 above) could be maximum likelihood,
or other, estimates.
(We could write SE(
cb 𝑝 ), i.e. with a hat on the SE, to denote that b
𝑝 has
been plugged-in, but this is ugly so we won’t, we’ll just write SE(b 𝑝 ).)

111
If 𝜃
b is unbiased, then MSE(𝜃)
b = var(𝜃) b 2 . So the standard
b = [SE(𝜃)]
error (or estimated standard error) gives some quantification of the
accuracy of estimation.

If in addition 𝜃
b is approximately 𝑁(𝜃, SE(𝜃) b 2 ) then, by the arguments
used above, an approximate 1 − 𝛼 CI for 𝜃 is given by 𝜃 b ± 𝑧 𝛼/2 SE(𝜃)

b
where again we might need to plug-in to obtain the estimated standard
error. Since, roughly, 𝑧0.025 = 2 and 𝑧0.001 = 3,

(estimate ± 2 estimated std errors) is an approximate 95% CI


(estimate ± 3 estimated std errors) is an approximate 99.8% CI.

The CLT justifies the normal approximation for 𝜃b = 𝑋, but


𝜃
b ≈ 𝑁(𝜃, SE(𝜃) 2
b ) is also appropriate for more general MLEs by other
theory (see Part A).

112
Statistics Publications - ADHD in psychiatric patients

Attention-deficit/hyperactivity disorder diagnoses


Walter Deberdt et al. (2015) Prevalence of ADHD in nonpsychotic adult psychiatric care (ADPSYC): A
multinational cross-sectional study in Europe. BMC Psychiatry15:242
https://doi.org/10.1186/s12888-015-0624-5

This article is distributed under the terms of the Creative Commons Attribution 4.0 International License

(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided

you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes

were made.
113
Statistics Publications - ADHD in psychiatric patients
We can reconstruct the confidence intervals reported for the percentage
diagnosed with ADHD. There was an additional note at the bottom of
the table:

Let us consider those diagnosed using the DSM-IV-TR criteria. There


were 2009 patients analysed and it was reported that 15.8% were
diagnosed with ADHD.
iid
We know that for 𝑋1 , . . . , 𝑋𝑛 ∼ Bernoulli(𝑝).
𝑝 = 𝑋 = 318/2009 = 0.1582
b
p
SE(b
𝑝 ) = 0.1582(1 − 0.1582)/2009 = 0.0081.
Using b
𝑝 ± 1.96 SE(b
𝑝 ) gives the interval (0.1423,0.1742) matching what
was reported in the paper.

114
iid
Example Suppose 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎2 ) with 𝜇 and 𝜎 2 unknown.

The MLEs are b 𝜇 = 𝑋, b𝜎 2 = 𝑛1 (𝑋𝑖 − 𝑋)2 , and SE(b𝜇) = 𝜎/ 𝑛 is
Í
unknown because 𝜎 is unknown. So to use b 𝜇 ± 𝑧 𝛼/2 SE(b
𝜇) as the basis
for a confidence interval we need to estimate 𝜎. One possibility is to
use b𝜎 and so get the interval

𝜎
 
𝜇 ± 𝑧 𝛼/2 √ .
b
𝑛
b

However we can improve on this:


(i) use 𝑠 (the sample standard deviation) instead of b
𝜎 (better as
𝐸(𝑆2 ) = 𝜎 2 whereas 𝐸(b
𝜎2 ) < 𝜎2 )
(ii) use a critical value from a 𝑡-distribution instead of 𝑧 𝛼/2 – see Part

A Statistics (better as (𝑋 − 𝜇)/(𝑆/ 𝑛) has a 𝑡-distribution, exactly,
whereas using the CLT is an approximation).

115
Statistics Publications - Barn Owl Productivity

Pavluvčík P, Poprach K, Machar I, Losík J, Gouveia A, Tkadlec E (2015) Barn Owl Productivity
Response to Variability of Vole Populations. PLoS ONE 10(12): e0145851.
https://doi.org/10.1371/journal.pone.0145851

Copyright: © 2015 Pavluvčík et al. This is an open access article distributed under the terms of the Creative Commons Attribution License,

which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

116
Statistics Publications - Barn Owl Productivity

So for clutches we know that b


𝜇 = 1.31, 𝑛 = 2071 and 𝑆𝐸(b
𝜇) = 0.010.
Thus, the 95% confidence interval is (1.31 ± 1.96 × 0.010) = (1.29, 1.33).

And b
𝜎 = 0.010 × 2071 = 0.455.

Pavluvčík et al. 2015 https://doi.org/10.1371/journal.pone.0145851

Copyright: © 2015 Pavluvčík et al. This is an open access article distributed under the terms of the Creative Commons Attribution License,

which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

117
6. Linear Regression

118
Linear Regression

Suppose we measure two variables in the same population:


𝑥 the explanatory variable
𝑦 the response variable.
Other possible names for 𝑥 are the predictor or feature or input variable or
independent variable. Other possible names for 𝑦 are the output variable
or dependent variable.

119
CO2 emissions versus GDP
Let 𝑥 measure the GDP per head, and 𝑦 the CO2 emissions per head,
for 178 countries.



●●

●●
● ●
● ●●● ● ●
● ● ● ●● ●●
● ● ● ● ●● ●
● ●
● ● ● ●●● ●
● ● ●● ● ●● ● ●● ●●● ● ●
● ● ●● ● ● ●●

● ●● ● ● ● ●
CO2 emissions

● ● ● ●

●● ●
● ●
● ● ●
●● ●
● ● ● ●●
● ●●

● ● ● ●●
●● ● ●● ●● ● ●
● ● ●●
●●
● ●● ● ●●● ●
●●●●●● ●

●●
●● ●●
●●● ●
● ●● ●●
● ●● ●●
● ●
●●● ●
● ● ●
● ●●
●●
●● ●
●●



● ●

GDP
120
Questions of interest:
For fixed 𝑥, what is the average value of 𝑦?
How does that average value change with 𝑥?
A simple model for the dependence of 𝑦 on 𝑥 is

𝑦 = 𝛼 + 𝛽𝑥 + "error".

Note: a linear relationship like this does not necessarily imply that 𝑥
causes 𝑦.

121
More precise model
We regard the values of 𝑥 as being fixed and known, and we regard the
values of 𝑦 as being the observed values of random variables.
We suppose that

𝑌𝑖 = 𝛼 + 𝛽𝑥 𝑖 + 𝜖 𝑖 , 𝑖 = 1, . . . , 𝑛 (2)

where

𝑥1 , . . . , 𝑥 𝑛 are known constants


𝜖1 , . . . , 𝜖 𝑛 are i.i.d. 𝑁(0, 𝜎 2 ) "random errors"
𝛼, 𝛽 are unknown parameters.

The “random errors” 𝜖1 , . . . , 𝜖 𝑛 represent random scatter of the points


(𝑥 𝑖 , 𝑦 𝑖 ) about the line 𝑦 = 𝛼 + 𝛽𝑥, we do not expect these points to lie
on a perfect straight line.
Sometimes we will refer to the above model as being 𝑌 = 𝛼 + 𝛽𝑥 + 𝜖.

122
US city temperature data

30−year average min January temp (degrees Celsius)


● Miami, FL

10


●●
Houston, TX ● ●
● San Francisco, CA
●●



● ●

0

● ●●


● ● ●
● ●

Albuquerque, NM ●● ● ●




● ●
● Boise, ID
● ●

−10


● ●
● ●

● ●


Minneapolis, MN ●

25 30 35 40 45

latitude of city (degrees)


123
Auto data


●● ●




40

● ● ●
●●
●● ●
●●
●● ●
●● ● ●● ● ●●●
●● ● ●
Miles per gallon


●●●
● ●● ● ●●
●●

● ●● ● ●
●●
● ●● ● ● ● ●
● ●●●

●●●

●● ●●

30

●●●
●●
● ● ●●● ●
●● ● ●● ● ● ●
●●● ●●●●● ●
● ● ●● ●●●● ●● ● ●
● ●●●●
●● ●●●●
● ●●
● ●● ●●
● ●● ● ● ●● ●● ● ●
● ● ●●●●●● ●
●● ●


● ● ●●●● ●● ● ● ●
●● ●● ● ●●●● ● ●●●
● ●
●●● ● ●●●

20

●●● ● ●●
● ● ●● ● ● ●
● ● ● ● ● ●
●●● ●● ●● ●●
● ●●● ● ●● ● ● ●
●● ● ●
● ● ●● ●
● ●
● ● ● ● ● ●● ● ● ● ● ●
● ● ● ● ●●● ●
● ● ● ●
● ● ● ● ● ● ●● ● ●
● ●
● ●
●● ●●● ● ● ● ● ● ●

● ● ● ● ●● ● ● ● ● ●
● ● ● ● ●
● ● ●●
10

● ●

50 100 150 200

Horsepower

124
𝑌 = 𝛼 + 𝛽𝑥 + 𝜖 for the CO2 example

1 The values 𝑦1 , . . . , 𝑦𝑛 (e.g. the CO2 emissions per head in the


various countries) are the observed values of random variables
𝑌1 , . . . , 𝑌𝑛 .
2 The values 𝑥1 , . . . , 𝑥 𝑛 (e.g. the GDP per head in the various
countries) do not correspond to random variables. They are fixed
and known constants.

125
Questions:
• How do we estimate 𝛼 and 𝛽?
• Does the mean of 𝑌 actually depend on the value of 𝑥? i.e. is
𝛽 ≠ 0?
We now find the MLEs of 𝛼 and 𝛽, and we regard 𝜎 2 as being known.
The MLEs of 𝛼 and 𝛽 are the same if 𝜎2 is unknown. If 𝜎2 is unknown,
then we simply maximise over 𝜎2 as well to obtain its MLE – this is no
harder than what we do here (try it!). However, working out all of the
properties of this MLE is harder and beyond what we can do in this
course.

126
We have 𝑌𝑖 ∼ 𝑁(𝛼 + 𝛽𝑥 𝑖 , 𝜎2 ). So 𝑌𝑖 has p.d.f.

1 1
 
𝑓𝑌𝑖 (𝑦 𝑖 ) = √ exp − 2 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )2 , −∞ < 𝑦 𝑖 < ∞.
2𝜋𝜎 2 2𝜎

So the likelihood 𝐿(𝛼, 𝛽) is


𝑛
1 1
Ö  
𝐿(𝛼, 𝛽) = √ exp − 2 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )2
𝑖=1 2𝜋𝜎 2 2𝜎

𝑛
1 Õ
 
2 −𝑛/2 2
= (2𝜋𝜎 ) exp − 2 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )
2𝜎 𝑖=1
with log-likelihood
𝑛
𝑛 1 Õ
ℓ (𝛼, 𝛽) = − log(2𝜋𝜎 2 ) − 2 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )2 .
2 2𝜎 𝑖=1

127
𝑛
𝑛 1 Õ
ℓ (𝛼, 𝛽) = − log(2𝜋𝜎2 ) − 2 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )2 .
2 2𝜎 𝑖=1

𝑛 𝑛
Õ Õ 2
(𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )2 = 𝑦 𝑖 − (𝛼 + 𝛽𝑥 𝑖 ) .
𝑖=1 𝑖=1

So maximising ℓ (𝛼, 𝛽) over 𝛼 and 𝛽 is equivalent to minimising the sum


of squares
𝑛
Õ
𝑆(𝛼, 𝛽) = (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )2 .
𝑖=1

For this reason the MLEs of 𝛼 and 𝛽 are also called least squares
estimators.

128
What does simple linear regression do?
Minimise the sum of squared vertical distances from the points to the
line 𝑦 = 𝛼 + 𝛽𝑥.



● ●

● ●
● ●
● ●●● ● ●
● ● ●
● ● ● ●● ● ●●
●● ●●
● ● ●● ● ●
● ● ●●
● ●● ● ● ● ● ● ●
● ● ●● ● ●

● ●

● ●● ●

● ●● ●
● ●●
●● ●
● ● ●●●
● ● ●
● ● ●●
● ● ●
● ● ●●●
●●
● ● ●

● ● ●● ● ●● ●
● ● ●
y

● ● ●●
● ●●
●●
● ● ●
●● ● ●
● ●


● ●
●● ●
● ●
●● ●
● ●
● ● ●
● ● ●● ●●
● ●
● ● ●

● ●

● ●

●●
● ● ●
● ●



● ●

129
Theorem 6.1

The MLEs (or, equivalently, the least squares estimates) of 𝛼 and 𝛽 are
given by

𝑥 2𝑖 )( 𝑦𝑖 ) − ( 𝑥 𝑖 )( 𝑥 𝑖 𝑦𝑖 )
Í Í Í Í
(
𝛼=
𝑛 𝑥 2𝑖 𝑥 𝑖 )2
b Í Í
−(

𝑛 𝑥 𝑖 𝑦𝑖 − ( 𝑥 𝑖 )( 𝑦 𝑖 )
Í Í Í
𝛽=
b .
𝑥 2𝑖 − (
Í 2
𝑛 𝑥𝑖 )
Í

These sums are from 𝑖 = 1 to 𝑛.

130
Proof of Theorem 6.1
To find b
𝛼 and b
𝛽 we calculate

𝜕𝑆 Õ
= −2 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )
𝜕𝛼
𝜕𝑆 Õ
= −2 𝑥 𝑖 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 ).
𝜕𝛽

Putting these partial derivatives equal to zero, the minimisers b


𝛼 and b
𝛽
satisfy
Õ Õ
𝑛b
𝛼+b
𝛽 𝑥𝑖 = 𝑦𝑖
Õ Õ Õ
𝛼
b 𝑥𝑖 + b
𝛽 𝑥 2𝑖 = 𝑥 𝑖 𝑦𝑖 .

Solving this pair of simultaneous equations for b


𝛼 and b
𝛽 gives the
required MLEs.

131
Sometimes we consider the model

𝑌𝑖 = 𝑎 + 𝑏(𝑥 𝑖 − 𝑥) + 𝜖 𝑖 , 𝑖 = 1, . . . , 𝑛

and find the MLEs of 𝑎 and 𝑏 by minimising (𝑦 𝑖 − 𝑎 − 𝑏(𝑥 𝑖 − 𝑥))2 .


Í
This model is just an alternative parametrisation of our original model:
the first model is

𝑌 = 𝛼 + 𝛽𝑥 + 𝜖
and the second is
𝑌 = 𝑎 + 𝑏(𝑥 − 𝑥) + 𝜖
= (𝑎 − 𝑏𝑥) + 𝑏𝑥 + 𝜖.

Here 𝑌, 𝑥 denote general values of 𝑌, 𝑥 (and 𝑥 = 𝑛1 𝑥 𝑖 is the mean of


Í
the 𝑛 data values of 𝑥). Comparing the two model equations, 𝑏 = 𝛽
and 𝑎 − 𝑏𝑥 = 𝛼.
The interpretation of the parameters is that 𝛽 = 𝑏 is the increase in 𝐸(𝑌)
when 𝑥 increases by 1. The parameter 𝛼 is the value of 𝐸(𝑌) when 𝑥 is
0; whereas 𝑎 is the value of 𝐸(𝑌) when 𝑥 is 𝑥.
132
Statistics Publications - Estimating dry mass in plants

Estimation of the relationship between individual plant height and


aboveground dry mass
Laís Samira Correia Nunes, Antonio Fernando Monteiro Camargo (2017) A simple non-destructive
method for estimating aboveground biomass of emergent aquatic macrophytes. Acta Limnologica
Brasiliensia vol.29 http://dx.doi.org/10.1590/s2179-975x6416

This is an Open Access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use,

distribution and reproduction in any medium, provided the original work is properly cited.

133
Statistics Publications - Eye measurements

Relationships between the peripapillary retinal nerve fiber layer (RNFL, a,b,c)
and the thickness of the ganglion cell-inner plexiform layer (GCIPL, d,e,f) and
other eye characteristics (important for evaluation of glaucoma)
Sam Seo et al. (2017) Ganglion cell-inner plexiform layer and retinal nerve fiber layer thickness according to myopia and optic disc area: a

quantitative and three-dimensional analysis. BMC Ophthalmology 17:22 https://doi.org/10.1186/s12886-017-0419-1

This article is distributed under the terms of the Creative Commons Attribution 4.0 International License

(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided

you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes
134
were made.
Alternative expressions for b
𝛼 and b
𝛽 are
(𝑥 𝑖 − 𝑥)(𝑦 𝑖 − 𝑦)
Í
𝛽=
b (3)
(𝑥 𝑖 − 𝑥)2
Í

(𝑥 𝑖 − 𝑥)𝑦 𝑖
Í
= Í 2
(4)
(𝑥 𝑖 − 𝑥)

𝛼 = 𝑦 −b
b 𝛽 𝑥.
𝜕𝑆
The above alternative for b
𝛼 follows directly from 𝜕𝛼
= 0.
To obtain the alternatives for b
𝛽: Theorem 6.1 gives
𝑥 𝑖 𝑦 𝑖 − 𝑛1 ( 𝑥 𝑖 )( 𝑦𝑖 )
Í Í Í
𝛽=
b
1 Í
𝑥 2𝑖 𝑛 ( 𝑥𝑖 )
2
Í

𝑥 𝑖 𝑦 𝑖 − 𝑛𝑥 𝑦
Í
= Í . (5)
𝑥 2𝑖 − 𝑛𝑥 2
Now check that the numerators and denominators in (3) and (5) are the
same. Then observe that the numerators of (3) and (4) differ by
(𝑥 𝑖 − 𝑥)𝑦, which is 0.
Í
135
The fitted regression line is the line 𝑦 = b
𝛼+b
𝛽𝑥.
The point (𝑥, 𝑦) always lies on this line.



● ●

● ●
● ●
● ●●● ● ●
● ● ●
● ● ● ●● ●● ●●● ●●
● ● ● ● ●
● ● ●● ●
● ●● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ●● ●

● ●● ●
● ●● ●
●● ● ● ● ●
● ● ● ●
● ● ●

● ● ●
● ● ●●●
●●
● ● ●
● ● ● ● ● ●● ●
● ● ● ●
y

● ● ●● ●●
● ● ● ●● ●
●● ● ●
● ●


● ●
●● ●
● ●
●● ●
● ● ● ● ●
● ● ●● ●●
● ●
● ● ●

● ●

● ●

●●
● ● ●
● ●



● ●

136
Bias of regression parameter estimates
Let 𝑤 𝑖 = 𝑥 𝑖 − 𝑥 and note 𝑤 𝑖 = 0.
Í

From (4) the maximum likelihood estimator of 𝛽 is


1 Õ
𝛽=Í 𝑤 𝑖 𝑌𝑖

b
𝑤 2𝑖
so
1 Õ
𝐸(b
𝛽) = Í 𝐸 𝑤 𝑖 𝑌𝑖

2
𝑤𝑖
1 Õ
=Í 𝑤 𝑖 𝐸(𝑌𝑖 ).
𝑤 2𝑖
Note 𝐸(𝑌𝑖 ) = 𝛼 + 𝛽𝑥 𝑖 = 𝛼 + 𝛽𝑥 + 𝛽𝑤 𝑖 (using 𝑥 𝑖 = 𝑤 𝑖 + 𝑥), and so
1 Õ
𝐸(b
𝛽) = Í 𝑤 𝑖 (𝛼 + 𝛽𝑥 + 𝛽𝑤 𝑖 )
𝑤 2𝑖
1  Õ Õ 
=Í 2
(𝛼 + 𝛽𝑥) 𝑤𝑖 + 𝛽 𝑤 2𝑖 = 𝛽
𝑤𝑖

since 𝑤 𝑖 = 0 so b
𝛽 is unbiased.
Í
137
Bias of regression parameter estimates
𝛼 =𝑌−b
Also b 𝛽 𝑥 and so

𝐸(b
𝛼) = 𝐸(𝑌) − 𝑥𝐸(b
𝛽)


= 𝐸(𝑌𝑖 ) − 𝛽𝑥 since 𝐸(b
𝛽) = 𝛽
𝑛

= (𝛼 + 𝛽𝑥 𝑖 ) − 𝛽𝑥
𝑛
1
= · 𝑛(𝛼 + 𝛽𝑥) − 𝛽𝑥
𝑛
= 𝛼.

So b
𝛼 and b
𝛽 are unbiased.
Note the unbiasedness of b 𝛼, b
𝛽 does not depend on the assumptions
that the 𝜖 𝑖 are independent, normal and have the same variance, only
on the assumptions that the errors are additive and 𝐸(𝜖 𝑖 ) = 0.
138
Variance of regression parameter estimates
We are usually only interested in the variance of b
𝛽:
𝑤 𝑖 𝑌𝑖
Í 
var(b
𝛽) = var Í 2
𝑤𝑖
1 Õ
= Í 2 var 𝑤 𝑖 𝑌𝑖

( 𝑤𝑖 )2

1 Õ
= Í 2 𝑤 2𝑖 var(𝑌𝑖 )
( 𝑤 𝑖 )2
1 Õ
= Í 2 𝑤 2𝑖 𝜎2
( 𝑤 𝑖 )2

𝜎2
= Í 2.
𝑤𝑖
Since b
𝛽 is a linear combination of the independent normal random
variables 𝑌𝑖 , the estimator b 𝛽 ∼ 𝑁(𝛽, 𝜎𝛽2 ) where
𝛽 is itself normal: b
𝜎𝛽2 = 𝜎2 / 𝑤 2𝑖 .
Í
139
𝛽 is 𝜎𝛽 and if 𝜎2 is known, then a 95% CI for 𝛽 is
So the standard error of b

𝛽 ± 1.96𝜎𝛽 ).
(b

However, this is only a valid CI when 𝜎 2 is known and, in practice, 𝜎2


is rarely known.
For 𝜎2 unknown we need to plug-in an estimate of 𝜎2 , i.e. use
𝜎𝛽2 = b
𝜎2 / 𝑤 2𝑖 where b
𝜎2 is some estimate of 𝜎 2 . For example we could
Í
b
use the MLE which is b𝜎2 = 𝑛1 (𝑦 𝑖 − b 𝛽𝑥 𝑖 )2 . Using the 𝜃
𝛼−b b ± 2 SE(𝜃)
Í b
approximation for a 95% confidence interval, we have that (b 𝛽 ± 2b
𝜎𝛽 ) is
an approximate 95% confidence interval for 𝛽.

140
A better approach here, but beyond the scope of this course, is to
estimate 𝜎2 using

1 Õ
𝑠2 = 𝛽𝑥 𝑖 )2
𝛼−b
(𝑦 𝑖 − b
𝑛−2
and to base the confidence interval on a 𝑡-distribution rather than a
normal distribution. This estimator 𝑆2 is unbiased for 𝜎2 (see Sheet 5),
but details about its distribution and the 𝑡-distribution are beyond this
course – see Parts A/B.

141
7. Multiple Linear Regression

142
Multiple linear regression – hill races data
Below are record times (in hours) for 23 hill races together with race
distance (in miles) and amount of climb (in feet).
dist climb time
Binevenagh 7.5 1740 0.8583
Slieve Gullion 4.2 1110 0.4667
Glenariff Mountain 5.9 1210 0.7031
Donard & Commedagh 6.8 3300 1.0386
McVeigh Classic 5.0 1200 0.5411
Tollymore Mountain 4.8 950 0.4833
Slieve Martin 4.3 1600 0.5506
Moughanmore 3.0 1500 0.4636
Hen & Cock 2.5 1500 0.4497
Annalong Horseshoe 12.0 5080 1.9492
Monument Race 4.0 1000 0.4717
Loughshannagh Horseshoe 4.3 1700 0.6469
Rocky 4.0 1300 0.5231
Meelbeg Meelmore 3.5 1800 0.4544
Donard Forest 4.5 1400 0.5186
Slieve Donard 5.5 2790 0.9483
Flagstaff to Carling 11.0 3000 1.4569
Slieve Bearnagh 4.0 2690 0.6878
Seven Sevens 18.9 8775 3.9028
Lurig Challenge 4.0 1000 0.4347
Scrabo Hill Race 2.9 750 0.3247
Slieve Gallion 4.6 1440 0.6361
BARF Turkey Trot 5.7 1430 0.7131
143
Scatterplots

4.0

3.5
3.0
2.5
time

2.0

1.5
1.0 ●




●●● ●●
0.5

●● ●
●●
● ●
●●●●

5 10 15

dist

• Time increases with distance and also with amount of climb.


• Most data points are crowded into the bottom LH corner of each plot – we
could transform time, dist, climb before fitting a model (and we will).

144
Scatterplots

4.0

3.5
3.0
2.5
time

2.0

1.5
1.0 ●




●● ● ●

0.5

● ●●●●


● ● ●

2000 4000 6000 8000

climb

• Time increases with distance and also with amount of climb.


• Most data points are crowded into the bottom LH corner of each plot – we
could transform time, dist, climb before fitting a model (and we will).

145
Multiple linear regression

For the Hill races example, we could consider: Let 𝑌 = time,


𝑥1 = distance, 𝑥2 = climb.
We can consider a model in which 𝑌 depends on both 𝑥 1 and 𝑥2 , of the
form

𝑌 = 𝛽0 + 𝛽 1 𝑥1 + 𝛽 2 𝑥 2 + 𝜖.

This model has one response variable 𝑌 (as usual), but now we have
two explanatory variables 𝑥 1 and 𝑥2 , and three regression parameters
𝛽0 , 𝛽1 , 𝛽2 .

146
Let the 𝑖th race have time 𝑦 𝑖 , distance 𝑥 𝑖1 and climb 𝑥 𝑖2 . Then in more
detail our model is
𝑌𝑖 = 𝛽 0 + 𝛽 1 𝑥 𝑖1 + 𝛽 2 𝑥 𝑖2 + 𝜖 𝑖 , 𝑖 = 1, . . . , 𝑛
where
𝑥 𝑖1 , 𝑥 𝑖2 , for 𝑖 = 1, . . . , 𝑛, are known constants
iid
𝜖1 , . . . , 𝜖 𝑛 ∼ 𝑁(0, 𝜎2 )
𝛽0 , 𝛽1 , 𝛽2 are unknown parameters
and (as usual) 𝑦 𝑖 denotes the observed value of the random variable 𝑌𝑖 .
As for simple linear regression we obtain the MLEs/least squares
estimates of 𝛽0 , 𝛽1 , 𝛽2 by minimising
𝑛
Õ
𝑆(𝛽) = (𝑦 𝑖 − 𝛽 0 − 𝛽 1 𝑥 𝑖1 − 𝛽2 𝑥 𝑖2 )2
𝑖=1
𝜕𝑆
with respect to 𝛽 0 , 𝛽 1 , 𝛽 2 , i.e. by solving 𝜕𝛽 𝑘
= 0 for 𝑘 = 0, 1, 2.
As before, the only property of the 𝜖 𝑖 needed to define the least squares
estimates is 𝐸(𝜖 𝑖 ) = 0.
147
What does multiple linear regression do?
For two explanatory variables: minimise the sum of squared vertical
distances from the points to the plane 𝑦 = 𝛽 0 + 𝛽1 𝑥1 + 𝛽 2 𝑥2 .

X2

X1

[Figure from James, Witten, Hastie and Tibshirani (2013).]


148
Hill races data

Maindonald and Braun (2010) suggest taking logarithms and


considering 𝑌 = 𝛽 0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝜖 where

𝑌 = log(time), 𝑥1 = log(dist), 𝑥2 = log(climb).

A couple of reasons why we might transform like this:

• The max value of time is more than 10 times the min value, with times
bunched up towards zero, and with a long tail, similarly for dist, climb.
A log-transform will lead to a more symmetric spread.
• The longest race has a time more than double the next largest, and the
dist and climb of this race stand out similarly. Using untransformed
variables this race will have much more say in determining the fitted
model than any other, taking logs will reduce this. (Can we be more
precise than “will have much more say”? Yes, see later.)

149
Untransformed scales
● ● 4
3 4
3
time
● ● 2
(hours)
● ●

● ●● ● ●
● 1



●●●●

● ●●

●●
● ●
●● 1 2

●●●●● ●●●●
● ●●

● ●
8000 6000 8000

6000
● climb ●

(feet) 4000
● ●
● ●
● ● ●●
●●
●● ● ● 2000 ●●● ●

●● 2000 4000 ●●●
●● ● ●●●

● ●● ●●

● ●
10 15
15
dist ●


10 10
(miles)
● ●
● ●
●● ● ●
5 ●●●● ●
●●● ●
●●● ●
● ●

●●●

5 10 ● ●●● ●

●●

Scatter Plot Matrix

150
Logarithmic scales
● ●
0.0 0.5 1.0
1.0
● ●
● ●
0.5
time
● ● 0.0 (log

● ●

hours) 0.0
●●● ●● ●●
●● ●
●●●● ● ● −0.5
●● ●
● ● ●● ●●● ●●● ●

−1.0−0.5 0.0 −1.0
● ●

● ●
9.0 8.0 8.5 9.0
● 8.5 ●

● ●
● ● ● 8.0 climb 8.0 ● ● ●

● ● ●
(log feet) 7.5 ● ● ●
● ● ● ●●●● ●●
● ●
● ●● ● ● ●
● 7.0 ●●
● ● ●●

7.0 7.5 8.0
● ●
3.0 ● ●
2.0 2.5 3.0
2.5 ●


2.0 dist 2.0 ● ●


● ●
(log miles) ●●


● ●
● ●●● ●●●

1.5 ●● ●
●●
● ●●
●●
●● ●
● ●
1.0 1.5 2.0 ● ● ● ●
1.0 ● ●

Scatter Plot Matrix

151
The fitted model is

𝑦 = −4.96 + 0.68𝑥 1 + 0.47𝑥2


log(time) = −4.96 + 0.68 log(dist) + 0.47 log(climb)
time = 0.0070 × dist0.68 × climb0.47

Some interpretation: for a given value of climb, if the distance doubles then
time increases by a factor of 20.68 = 1.60. Is this reasonable? If the distance
doubles, don’t we expect the time taken to be multiplied by more than 2?

The key thing is that climb is being held constant – so although doubling the
length seems to make a race more than twice as hard, the gradient is halved (if
climb is held constant), making the climbing aspect of the race easier. The
estimated effect overall is the factor of 1.60.

152
• We can fit a model depending on distance only

log(time) = −2.21 + 1.12 log(dist)

• or depending on climb only

log(time) = −7.10 + 0.90 log(climb)

• or depending on both distance and climb

log(time) = −4.96 + 0.68 log(dist) + 0.47 log(climb).

Observe that the estimated regression coefficients are different in each model,
e.g. the coefficient of log(dist) is different depending on whether log(climb) is
part of the model – because if log(climb) is included then it can help explain
some of the variation in log(time), whereas if log(climb) is absent then
log(dist) has to account for the variation in log(time) on its own.

153
A general multiple regression model has 𝑝 explanatory variables
(𝑥1 , . . . , 𝑥 𝑝 ),

𝑌 = 𝛽0 + 𝛽1 𝑥1 + · · · + 𝛽 𝑝 𝑥 𝑝 + 𝜖

and the MLEs/least squares estimates are obtained by minimising


𝑛
Õ
(𝑦 𝑖 − 𝛽 0 − 𝛽 1 𝑥 𝑖1 − · · · − 𝛽 𝑝 𝑥 𝑖𝑝 )2
𝑖=1

with respect to 𝛽 0 , 𝛽 1 , . . . , 𝛽 𝑝 . In this course we will focus on 𝑝 = 1 or 2.

154
Statistics Publications - Multiple regression for monthly
hatching success of leatherback turtles

Santidrián Tomillo P, Saba VS, Blanco GS, Stock CA, Paladino FV, Spotila JR (2012) Climate Driven
Egg and Hatchling Mortality Threatens Survival of Eastern Pacific Leatherback Turtles. PLoS ONE
7(5): e37602. https://doi.org/10.1371/journal.pone.0037602

Copyright: © 2012 Santidrián Tomillo et al. This is an open-access article distributed under the terms of the Creative Commons Attribution

License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

155
Statistics Publications - Multiple regression for monthly
emergence rate of leatherback turtles

Note the absence of an intercept term. What does this imply?


Santidrián Tomillo P, Saba VS, Blanco GS, Stock CA, Paladino FV, Spotila JR (2012) Climate Driven
Egg and Hatchling Mortality Threatens Survival of Eastern Pacific Leatherback Turtles. PLoS ONE
7(5): e37602. https://doi.org/10.1371/journal.pone.0037602

Copyright: © 2012 Santidrián Tomillo et al. This is an open-access article distributed under the terms of the Creative Commons Attribution

License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

156
Quadratic regression

The relationship between 𝑌 and 𝑥 may be approximately quadratic in


which case we can consider the model 𝑌 = 𝛽 0 + 𝛽 1 𝑥 + 𝛽2 𝑥 2 + 𝜖. This is
the case 𝑝 = 2 with 𝑥1 = 𝑥 and 𝑥2 = 𝑥 2 .

157
Statistics Publications - Weight gain and diet in fish

Quadratic regression analysis of Percentage Weight Gain (PWG) for


grass carp fed graded levels of isoleucine.
Source: Gan L, Jiang W-D, Wu P, Liu Y, Jiang J, Li S-H, et al. (2014) Flesh Quality Loss in Response to
Dietary Isoleucine Deficiency and Excess in Fish: ... .PLoS ONE 9(12): e115129.
https://doi.org/10.1371/journal.pone.0115129

Copyright: © 2014 Gan et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which
158
permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Statistics Publications - Weight gain and diet in fish

Source: Gan et al. 2014 https://doi.org/10.1371/journal.pone.0115129

What level of isoleucine will maximise the Percentage Weight Gain?


If 𝑦 = −1.414𝑥 2 + 34.45𝑥 − 67.39, then the maximum is found where
34.45
2 × −1.414𝑥 + 34.45 = 0 so 𝑥 = 2×1.414 = 12.18.
159
Statistics Publications - Cubic regression

Cubic regression analysis of the run time for a fast approximate


quadratic assignment algorithm as a function of the number of vertices.
Source: Vogelstein JT, Conroy JM, Lyzinski V, Podrazik LJ, Kratzer SG, Harley ET, et al. (2015) Fast
Approximate Quadratic Programming for Graph Matching. PLoS ONE 10(4): e0121002.
https://doi.org/10.1371/journal.pone.0121002

Copyright: This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or

otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication
160
Statistics Publications - Cubic regression

Cubic regression analysis of the run time for a fast approximate


quadratic assignment algorithm as a function of the number of vertices.
Source: Vogelstein et al. 2015 https://doi.org/10.1371/journal.pone.0121002

What is the prediction of the run time if there are 50 vertices?


Does the variation in the observation appear to be constant?
161
For convenience write the two explanatory variables as 𝑥 and 𝑧. So
suppose
𝑌𝑖 = 𝛽 0 + 𝛽1 𝑥 𝑖 + 𝛽 2 𝑧 𝑖 + 𝜖 𝑖 , 𝑖 = 1, . . . , 𝑛
iid
where 𝜖 𝑖 ∼ 𝑁(0, 𝜎 2 ) and assume 𝑥𝑖 = 𝑧 𝑖 = 0.
Í Í

Then minimising 𝑆(𝛽) gives


𝑦𝑖
Í
𝛽0 =
b
𝑛
1 Õ 2 Õ Õ Õ 
𝛽1 =
b 𝑧𝑖 𝑥 𝑖 𝑦𝑖 − 𝑥𝑖 𝑧𝑖 𝑧 𝑖 𝑦𝑖
Δ
1 Õ 2 Õ Õ Õ 
𝛽2 =
b 𝑥𝑖 𝑧 𝑖 𝑦𝑖 − 𝑥𝑖 𝑧𝑖 𝑥 𝑖 𝑦𝑖
Δ
where
Õ Õ Õ 2
Δ= 𝑥 2𝑖 𝑧 2𝑖 − 𝑥𝑖 𝑧𝑖 .
𝜕𝑆
The method (solving 𝜕𝛽 𝑘
= 0 for 𝑘 = 0, 1, 2, i.e. 3 equations in 3
unknowns) is straightforward, the algebra less so – the are more
elegant ways to do some of this (using matrices, in 3rd year).
162
Interpretation of regression coefficients

Consider the model with 𝑝 = 2, so two regressors 𝑥1 and 𝑥 2 :

𝑌 = 𝛽0 + 𝛽 1 𝑥1 + 𝛽 2 𝑥2 + 𝜖.

We interpret 𝛽 1 as the average effect on 𝑌 of a one unit increase in 𝑥1 ,


holding the other regressor 𝑥2 fixed.
Similarly we interpret 𝛽 2 as the average effect on 𝑌 of a one unit
increase in 𝑥2 , holding the other regressor 𝑥1 fixed.
In these interpretations “average” means we are talking about the
change in 𝐸(𝑌) when changing 𝑥1 (or 𝑥2 ).
One important thing to note here is that 𝑥1 and 𝑥 2 often change
together. e.g. in the hill races example, a race whose distance is one
mile greater will usually to have an increased value of climb as well.
This makes interpretation more difficult.

163
8. Assessing the fit of a model

164
Assessing the fit of a model
Having fitted a model, we should consider how well it fits the data. A
model is normally an approximation to reality: is the approximation
sufficiently good that the model is useful? This question applies to
mathematical models in general. In this course we will approach the
question by considering the fit of a simple linear regression
(generalisations are possible).
For the model 𝑌 = 𝛼 + 𝛽𝑥 + 𝜖 let b
𝛼, b 𝛽 be the usual estimates of 𝛼, 𝛽
based on the observation pairs (𝑥 1 , 𝑦1 ), . . . , (𝑥 𝑛 , 𝑦𝑛 ).
From now on we consider this model, with the usual assumptions
about 𝜖, unless otherwise stated.
Definition The 𝑖th fitted value b
𝑦 𝑖 of 𝑌 is defined by b
𝑦𝑖 = b
𝛼+b
𝛽𝑥 𝑖 , for
𝑖 = 1, . . . , 𝑛.
The 𝑖th residual 𝑒 𝑖 is defined by 𝑒 𝑖 = 𝑦 𝑖 − b
𝑦 𝑖 , for 𝑖 = 1, . . . , 𝑛.
The residual sum of squares RSS is defined by RSS = 𝑒 𝑖2 .
Í
q
1
The residual standard error RSE is defined by RSE = 𝑛−2 RSS.
165
The RSE is an estimate of the standard deviation 𝜎. If the fitted values
are close to the observed values, i.e. b
𝑦 𝑖 ≈ 𝑦 𝑖 for all 𝑖 (so that the 𝑒 𝑖 are
small), then the RSE will be small. Alternatively if one or more of the 𝑒 𝑖
is large then the RSE will be higher.
We have 𝐸(𝑒 𝑖 ) = 0. In taking this expectation, we treat 𝑦 𝑖 as the random
variable 𝑌𝑖 , and we treat b𝑦 𝑖 as the random variable b
𝛼+b 𝛽𝑥 𝑖 (in
particular, b𝛼 and 𝛽 are estimators, not estimates). Hence
b

𝐸(𝑒 𝑖 ) = 𝐸(𝑌𝑖 − b
𝛼−b
𝛽𝑥 𝑖 )
= 𝐸(𝑌𝑖 ) − 𝐸(b
𝛼) − 𝐸(b
𝛽)𝑥 𝑖
= 𝐸(𝛼 + 𝛽𝑥 𝑖 + 𝜖 𝑖 ) − 𝛼 − 𝛽𝑥 𝑖 since b
𝛼, b
𝛽 are unbiased
= 𝛼 + 𝛽𝑥 𝑖 + 𝐸(𝜖 𝑖 ) − 𝛼 − 𝛽𝑥 𝑖
=0 since 𝐸(𝜖 𝑖 ) = 0.

166
Potential problem: non-linearity

The model 𝑌 = 𝛼 + 𝛽𝑥 + 𝜖 assumes a straight-line relationship between


𝑌 (the response) and 𝑥 (the predictor). If the true relationship is far
from linear then any conclusions (e.g. predictions) we draw from the
fit will be suspect.
A residual plot is a useful graphical tool for identifying non-linearity:
for simple linear regression we can plot the residuals 𝑒 𝑖 against the
fitted values b
𝑦 𝑖 . Ideally the plot will show no pattern. The existence of
a pattern may indicate a problem with some aspect of the linear model.
Note that for simple linear regression (i.e. the case 𝑝 = 1) plotting 𝑒 𝑖
against 𝑥 𝑖 gives an equivalent plot, just with a different horizontal
scale, since there is an exact linear relation between the 𝑥 𝑖 and b
𝑦 𝑖 (i.e.
𝑦𝑖 = b
b 𝛼 + 𝛽𝑥 𝑖 ).
b
[Plotting 𝑒 𝑖 against b
𝑦 𝑖 generalises better to multiple regression.]

167
Residual plots

True model is 𝑌 = 2 + 2𝑥 + 𝜖. Fitted model is 𝑌 = 𝛼 + 𝛽𝑥 + 𝜖.


The right form of model has been fitted, the residual plot should show
no pattern, just random scatter – it does.
Data from Y = 2 + 2x + ε Residuals of Y = α + βx + ε Residuals of Y = α + βx + ε

8 ● ● ●
2 2

● ●
6 ● ● ●


● ● ●
●● ● ● ●
● 1
●● ●
1 ● ●
● ● ● ● ● ● ● ● ●
4 ●

Residuals
●●
Residuals
● ● ●● ● ●
● ●
● ● ●
● ● ● ● ●
y

● ● ●
2 ● ● ● ● ●

0 ● ● ● 0 ●


● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ● ●
●● ● ● ●
● ●
● ●
0 ●
● ● ●

● ● ● ●

−1 ● ●

−1 ● ● ● ●
−2 ● ● ● ●

● ●
● ● ●

−2 0 2 4 6
−2 −1 0 1 2 −2 −1 0 1 2

x x Fitted values y^i

Plotting residuals against 𝑥 𝑖 , or against b


𝑦 𝑖 , is basically the same (since
𝑦𝑖 = b
b 𝛼+b 𝛽𝑥 𝑖 ).

168
Data from Y = x2 + ε Residuals of Y = β0 + β1x + ε Residuals of Y = β0 + β1x + ε

● 3 ● 3 ●



● ●

15 ● 2 2
●●● ● ●

Residuals

Residuals
●● ●●
● 1 1
● ●
● ●
10
y

● ●● ●●
0 0
● ● ● ●

● ● ● ●
5 ●
● ●



● ●
● −1 −1

● ● ● ●
● ● ●

● ● ●
0 ●
● −2 ● −2 ●

0 5 10 15 0 5 10 15
1 2 3 4

x y^ y^

A straight-line model 𝑌 = 𝛽 0 + 𝛽1 𝑥 + 𝜖 has been fitted when actually the


relationship is quadratic 𝑌 = 𝑥 2 + 𝜖.

The residuals should indicate a problem – they do – there is a pattern, they are
not randomly scattered. The curvature indicated in the right-hand plot is what
we should notice in the middle plot. [How to fit the curve is beyond the scope
of this course.]

169
Data from Y = x2 + ε Residuals of Y = β0 + β1x + β2x2 + ε Residuals of Y = β0 + β1x + β2x2 + ε


1.0 1.0
● ●
● ● ● ● ●

● ●
● ●
● ●
15 ● 0.5 0.5
● ● ● ●
●●● ● ●

Residuals

Residuals

● ●
10 0.0 ● 0.0 ●
y


● ●
● ●
● ●
● ●
−0.5 −0.5
5 ●

● ●

● ● ●
● ● ● ● ● ●
● ●
● −1.0 −1.0
● ● ●
0 ●

0 5 10 15 0 5 10 15
1 2 3 4

x y^ y^

True model is 𝑌 = 𝑥 2 + 𝜖. Fitted model is 𝑌 = 𝛽 0 + 𝛽 1 𝑥 + 𝛽 2 𝑥 2 + 𝜖, i.e. the


correct form of model has been fitted. The residuals don’t show any pattern.

170
Auto data


●● ●




40

● ● ●
●●
●● ●
●●
●● ●
●● ● ●● ● ●●●
●● ● ●
Miles per gallon


●●●
● ●● ● ●●
●●

● ●● ● ●
●●
● ●● ● ● ● ●
● ●●●

●●●

●● ●●

30

●●●
●●
● ● ●●● ●
●● ● ●● ● ● ●
●●● ●●●●● ●
● ● ●● ●●●● ●● ● ●
● ●●●●
●● ●●●●
● ●●
● ●● ●●
● ●● ● ● ●● ●● ● ●
● ● ●●●●●● ●
●● ●


● ● ●●●● ●● ● ● ●
●● ●● ● ●●●● ● ●●●
● ●
●●● ● ●●●

20

●●● ● ●●
● ● ●● ● ● ●
● ● ● ● ● ●
●●● ●● ●● ●●
● ●●● ● ●● ● ● ●
●● ● ●
● ● ●● ●
● ●
● ● ● ● ● ●● ● ● ● ● ●
● ● ● ● ●●● ●
● ● ● ●
● ● ● ● ● ● ●● ● ●
● ●
● ●
●● ●●● ● ● ● ● ● ●

● ● ● ● ●● ● ● ● ● ●
● ● ● ● ●
● ● ●●
10

● ●

50 100 150 200

Horsepower

171
Auto data, linear fit and quadratic fit


●● ●




40

● ● ●
●●
●● ●
●●
●● ●
●● ● ●● ● ●●●
●● ● ●
Miles per gallon


●●●
● ●● ● ●●
●●

● ●● ● ●
●●
● ●● ● ● ● ●
● ●●●

●●●

●● ●●

30

●●●
●●
● ● ●●● ●
●● ● ●● ● ● ●
●●● ●●●●● ●
● ● ●● ●●●● ●● ● ●
● ●●●●
●● ●●●●
● ●●
● ●● ●●
● ●● ● ● ●● ●● ● ●
● ● ●●●●●● ●
●● ●


● ● ●●●● ●● ● ● ●
●● ●● ● ●●●● ● ●●●
● ●
●●● ● ●●●

20

●●● ● ●●
● ● ●● ● ● ●
● ● ● ● ● ●
●●● ●● ●● ●●
● ●●● ● ●● ● ● ●
●● ● ●
● ● ●● ●
● ●
● ● ● ● ● ●● ● ● ● ● ●
● ● ● ● ●●● ●
● ● ● ●
● ● ● ● ● ● ●● ● ●
● ●
● ●
●● ●●● ● ● ● ● ● ●

● ● ● ● ●● ● ● ● ● ●
● ● ● ● ●
● ● ●●
10

● ●

50 100 150 200

Horsepower

172
Residual plot for linear fit Residual plot for quadratic fit

● ●
● 15 ●
15 ● ●
● ● ●
● ●
●●
●● ● ● ●
● ●
● 10 ● ●● ● ●
10 ● ● ● ●● ●
● ● ●●● ●●● ● ●

● ● ● ● ● ● ● ●● ●● ● ●
● ● ● ● ●●●● ●●●●● ●● ● ● ●● ● ●

● ●
● ● 5 ●● ● ● ● ● ● ●●●● ●●●
Residuals

Residuals
● ● ●● ● ● ● ●●● ● ●
5 ● ● ● ● ●●
● ● ●●●● ●● ●● ●
●●● ●
●●
● ●●● ● ●● ● ● ●● ●● ● ●
● ●● ● ● ●●●●●
●● ●
●●●●● ●●●● ●●
● ●● ●●

●● ●●● ● ●●●

● ●● ●

● ●● ●●●●
● ●
●● ●● ●●●●●●●●
●●● ● ● ●
● ●
●●●●
● ●●
●●●● ●●● ● ●●
●●
● ●● ●● ●●●●
● ●● ● ●
●●● ●●●● ●●● ● ●
●●
●●●● ●
● ●●


●●● ● ●
● ● ● ● ● ● ●●


●● ● ●● ● ● ●●
●●●●●●
●●
● ●●●● ● 0 ●
●●●● ●●
● ● ● ● ● ● ●●
●● ●●●● ● ● ●●●●
0 ● ● ●●● ● ●●● ● ●●
● ● ●● ●● ●●
● ●● ● ● ●● ●● ● ●●●● ●
●● ●● ●●●●●
● ● ●●
●● ●
● ● ●●●●● ●● ●●●●●
●● ●● ●●● ● ●
●●
●●
●●●●● ●
● ●● ●
●● ●●● ●●● ●●
● ● ● ●
●●● ●●
●● ●
●●●● ●●●● ●●●●● ●● ●●●● ●
●● ●
● ● ● ●● ●● ●● ●●
● ●

● ●● ● ● ●
● ● ●
● ●● ●●● ● ●

●●● ● ● ●●●●●●●●●●● ●●● ● ●
●● −5 ● ●● ●●● ● ● ●●● ●
● ● ●●● ● ● ● ●●● ●●●●
−5 ● ● ● ●●●●●●● ●●●
● ● ● ● ●●●

● ● ●
● ●
● ● ●●● ● ● ● ●
● ● ●●● ● ● ● ●
●●●●●● ● ● ● ● ●
● ● ●
● ●● −10 ●
−10 ●

● −15 ●
−15
5 10 15 20 25 30 15 20 25 30 35

y^ y^

Left: the pattern (curvature) in the residuals from the linear fit
𝑌 = 𝛽0 + 𝛽 1 𝑥 + 𝜖 indicates non-linearity.

Right: little pattern remains in the residuals from the quadratic fit
𝑌 = 𝛽 0 + 𝛽 1 𝑥 + 𝛽 2 𝑥 2 + 𝜖.

173
Potential problem: non-constant variance of errors

We have assumed that the errors have a constant variance, i.e.


var(𝑌𝑖 ) = var(𝜖 𝑖 ) = 𝜎2 . That is, the same variance for all 𝑖. Unfortunately,
this is often not true. e.g. the variance of the error may increase as 𝑌
increases.
Non-constant variance is also called heteroscedasticity. We can identify
this from the presence of a funnel-type shape in the residual plot.

174
Non-constant variance of errors

Plots with response variable 𝑌, so the model is 𝑌 = 𝛼 + 𝛽𝑥 + 𝜖.


Both plots suggest non-constant variance of the error, the variability
appears larger when 𝑌 (or 𝑥) is larger.

15 ●
●●

● ●●
● ● ●

30 ● ● 10 ●

● ●

● ●● ● ●
● ● ●●
● ● ●
●● ● ● ●● ● ●●● ● ● ●● ●●

● ● ● ●●● ●● ● ● ● ● ● ● ●● ● ● ●●●
●●●●● ● ●●
●●● ●● ● ●●
● ● ●● ●● ●●● ● ●●●● ●
●●
● ●●● ● ●● ●
5 ● ●●● ●
● ●● ● ●
●● ●
● ●● ●●●●●●

Residuals
●●● ●● ●●●● ●
●● ●● ●● ●● ● ●● ● ●●● ●● ● ●● ●● ●● ●●

● ● ●
●●●
●●● ● ● ●
●● ●● ● ●
●●●
●●●● ● ●● ● ●
●● ●● ●●●● ●● ●● ●●●
●●
●● ● ●●●●●
● ●●●●● ●● ● ●
●● ●
● ●● ● ● ●●●●●●
● ● ● ●● ● ● ● ●
●● ● ●
● ● ●● ● ● ●●●●
● ● ●●●● ●●●●

●●●●● ●●●● ●●●●●● ● ●● ●
●● ● ●
●●●
●● ●●●
●●
●●
● ●● ●●
● ●● ●● ●
● ●●●● ●● ● ● ●
● ●● ●●
●● ●
●●

●● ●
●●●●●
20 ●
●●
●● ● ●● ● ● ● ●
●● ●●●
● ●●● ● ●● ●
●●●●● ●● ●●● ●
●●● ●
●●● ●●
●●



●●
●●
● ●
● ●

●●
●● ● ●
● ●● ●
●●●●
● ●●●●●●●
●●
●●●
●●●
●●
●●●●●●●
●● ● ● ●● ●


●● ●●
●●●●● ● ●● ●●
●●
● ● ●● ●● ● ● ●●●
● ●● ●● ●● ● ● ● ●●●
y

●●●● ●●●● ●● ●●● ●


●●●
● ● ● ●●
● ●● ● ● ● ● ● ●● ● ●●●
● ●●●
●●●● ●●
● ●
●●




●●●● ● ● ● ● 0 ●




●●


●●

●●●
●●

●●

●●

●●●●
●● ●●
●●●●●●●●● ● ●

●●●●

●●● ●●●● ●●
●● ●●● ●●
●●●● ●● ● ●
●● ● ● ● ● ●●

● ●●
●●
● ● ● ●●● ●

●●●●
●●● ●●●●●●●●
●●●
●●●●● ●

●●●

●●
●●
●●●●
●●
●●●


● ●●
● ●
●●●●
●●
●●●●● ●●● ●




● ●
●●●● ●●
● ●●
● ●●● ●●●● ●●● ●● ●●● ●●●●●



● ●●●●● ●● ●●●● ●●
●●●
●●●● ● ●


●● ● ●
●● ●●● ● ●
●●●●●●●●●


●●●
● ●


●●●

●●●●●●●●● ●●●






●●●●●
●●●●

●●●
● ●
●●●●●●●●● ●●●
● ●●●●● ●
●● ●


●●●●● ●
●●●● ● ●● ● ●●●●● ●● ●
●●●●● ●●●●●●● ●
●● ●● ● ●●
●●●●● ●●●●● ●●
●● ●
●●●●● ● ● ●● ● ●●● ● ● ●●●●
●●●● ●●
●● ●● ●●●● ● ●● ●● ●●●● ● ● ●● ● ●● ●● ●●●●●● ● ● ●●● ●●● ●


●●● ● ●● ●●●
●● ●


●●●● ● ● ●● ●●●
● ●●● ●
●● ●
●● ●●
● ●●●
●● ● ●● ● ● ● ●
●●●● ● ●
●●●● ●●● ● ●●


●●●

● ●
● ●● ●●● ●●● ● ●●
● ●● ● ●●●● ● ●● ● ●●● ●●
●● ●●● ●●
●●
●●●● ● ● ●●● ●
●●● ●●●●●●● ● ●●
●●
● ●●●●

●●


●●
● ●

●●●●●●


●●
●●
●●●●●●●●●●●●● ●●
●● ● ●●● ● ● ●● ●● ● ●● ● ● ●●●
● ●
● ●
●●● ●●
● ● ●●
● ●● ●● ●
●●● ●● ●
●● ● ●●●● ● ●
●● ●

●●●● ●●●●●● ●●●
● ●●
●●
●●● ●●● ● ●●●●● ● ● −5 ● ●●●●●● ●●● ● ● ●●● ●●
●● ● ●●●
●● ●● ●
●●●●● ●●

● ●●●●●● ●
●● ● ●● ●● ● ● ●● ● ● ●● ● ●● ● ●●●●●●
●● ●●
●●● ●
● ●● ●●

● ●●●●●●●●●●
● ●● ● ●
● ●● ●●● ●●● ● ●●● ●
10 ●
●●

●●
● ●●



●●
●●●
●●●●
●●
●●
●●
● ● ●
●● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ●●● ●●



●●●●
●●●



●●●●●
●●●●●●


●●
●●●● ●

● ●●●● ●● ●● ● ● ● ●● ●

● ●●
●●

●● ●●
● ● ●
●●●●

●● ●●●●●
●●●
● ●●●●●●●●● ● ●● ●● ● ●●

● ● ● ●
● ●●●●●● ● ●
●● ● ●●
● ● ● ● ●

●●

●●●
●●●●

●●

●●


● ●
●●
● ●

●●
●●● ●● ● ●● ● ●●● ● ●

●●
















●●
●●





●●

●●●●
● ●●
● ● ●●● ●● ●●
●● ● −10 ●● ●





●●

●●





●●●
● ●●

●●●● ●● ● ●
● ●
● ● ●
●● ●

5 10 15 20
2 4 6 8

x Fitted values y^

175
How might we deal with non-constant variance of the errors?
One possibility is√to transform the response 𝑌 using a transformation
such as log 𝑌 or 𝑌 (which shrinks larger responses more), leading to
a reduction in heteroscedasticity.

176
Plots with log 𝑌 as the response variable, so the model is
log 𝑌 = 𝛼 + 𝛽𝑥 + 𝜖.

●● ● ●
3.5 ● ● ● ●
● ●● ●● ●● ● ● ●
● ●● ●● ●●
●● 0.5
● ●● ●●●● ● ● ●●●
●●
●●●● ●● ● ● ● ●●
●● ● ●●●●●● ●●●
● ● ●● ● ●●● ● ●● ● ●●

●●
● ●● ●●●●●●●●●
●●●●
●●●●
●●●
● ● ● ●● ● ●●● ●●●
●● ●●

●●●● ●● ●
● ●
● ●● ●
● ● ●●
●● ● ●● ●
●● ● ●●
● ● ●
●●
●●
● ● ● ●● ●● ● ● ●● ●● ●● ●●●
3.0 ● ● ●● ● ●
●●●●
●●●●●● ●

●●

● ●
● ●●●●●●● ●●


● ●● ●●●●
● ●● ●●●● ● ●


●●
●● ●
●●●●
● ●
●● ● ● ●●
●●●●● ● ●●● ●●
● ●● ●
●●● ● ●● ●●● ●●● ●
● ● ● ●● ●
●●●●● ●●● ● ●
● ●● ●●
●● ●● ●●
● ● ● ●●● ●


●●● ●●
● ●●●

●●●
●●●●● ●● ●● ● ●● ●
●●●● ● ●●●●●
●● ●●●●● ●●
● ●● ● ●●● ● ●
●●● ● ● ● ●● ●● ●●● ● ●● ●●●● ● ●●● ●●
● ●
● ●
●● ●
●●● ●●
●●
●●●
● ●
● ●
●●●●●
●●●●
● ● ● ●
● ●●●●
● ● ●● ● ●● ●
●●●●● ●
●●●●●
● ●●
●●●● ●● ●●
●●●
●●●
●●●●

●●● ●●● ●

●●●●●
●●
●●●●
●●●
● ● ●
●●
● ● ●●●●


● ●●●●● ●●
● ● ● ●●● ● ● ●●●●
● ●● ● ●●
● ●● ● ●●●● ●
● ●● ●●●●●
●●●


● ● ●●●●
● ●
●●●● ● ●● ●
●● ●● ●
● ● ●●
● ●
● ● ●
●●●●● ● ● ● ● ● ● ● ●● ●●
●●●●
● ●●
● ● ● ●
●● ●●
● ● ● ●
● ● ●

● ●● ● ●●●●●●

●●●● ●● ●●● ●●●●●●● ●●● ●●●● ●● ●
●●●
●●●●●●●●●
● ●● ●● ● ● ●
● ●● ●●●
● ● ●● ●●●
●● ● ● ● ● ●● ●● ●● ● ●
● ●● ●

Residuals
●● ●

●● ● ● ● ● ●●●● ●● ●●● ●● ● ● ● ●

●● ●● ● ●● ●●●● ●●
● ●●●● ●●

●●● ●
●●● ● ●● ●
●● ●●

●●●
●●●
● ● ●
●● ●●
●●●●●●●
●● ●
●● ●●


●● ● ●
● ● ●● ●●
● ●●●●● ●●●
●●●●● ● ●
● ●●●●●●●●● ●
●●
●●●
● ●● ●● ●●
●● ●●●
●● ●● ● ●● ● ●●●
● ● ●●●
● ●


●● ●

●●● ● ●
●●●●●
● ●● ● ● ●● ● ● ● ● 0.0 ●●


● ●● ●●
●●
●●
● ● ● ● ● ● ●●●
● ●●●● ● ●● ●●● ●
●● ● ● ●●●

● ●
2.5 ●●● ●● ●● ●●●● ●●● ● ●
log(y)

●● ● ●●
●● ●● ●●
●●●●
● ● ●● ● ●●●
●●●● ●● ● ●● ●●●●●● ● ● ●
●●●
●●●●● ●
●●●● ●● ● ●

●●● ● ● ● ●
●●● ●●●

●● ●●●

●●

●●●●●● ● ●●● ● ●●●●
●●●●
● ● ●● ● ● ●●●●
●●
● ●● ●● ●● ●● ● ●● ●● ●

●●●●● ●●●●
●●●●●● ● ●

●● ●● ●● ●● ●
●●● ●

●● ● ●● ● ● ● ●● ●●

●●●
●●
● ●●●

● ●●●●
●● ●●
● ●●●
●● ●
● ●
● ●●●●●

● ●
●●● ●●●● ●
●●●●●
●●
●●●●● ●
●●● ●
●●●
●●


●●
●● ●● ●●● ●
●● ●●● ● ●● ● ●●● ●
● ● ●
●●
● ●
●●

●●● ●● ●● ●●●● ●
●● ● ●
●● ●● ●●
● ●●
●●●●●
●●●●●● ●●
●●●●●
● ●
●●●●
● ●●●●● ●● ● ●
●●●● ●
●● ● ●●●● ●
●● ●● ●
● ●
● ●●

●●● ● ●
●●
●●
● ●
● ●●
● ●

●●●● ● ●

●●●● ●●●
●●
●●● ● ●●
●●●●

● ●● ● ●● ● ● ●
● ● ●

●●●●●● ● ●● ● ●
● ●●● ● ● ● ●●●● ●●● ●● ● ● ●
● ●
●●●● ●● ●●● ●

●● ● ● ●● ● ● ● ●● ●● ●●● ● ●

●●●●
● ●● ● ●●
● ● ● ●
● ●●●● ● ● ●● ●
● ●●● ● ● ● ●● ● ●
● ● ● ● ●●●
2.0 ● ●●●●●

●● ● ●●● ●
●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●

●●●●●

●● ●●●●●●● ●● ● ●
● ●●● ●● ● ● ●● ● ●● ●●● ● ● ●●

●●●● ●●



●●
●●●
●●●


●●
●●●●● ●


● ● ● ●
●● ● ● ● ●● ● ●●● ● ●●
●●●●●●
●●

● ●●

●●

●●●● ● ● ● ● ●● ● ● ● ● ●●
●●

●●●







●●

●●
●●●●
●●● ●●● ● ●● −0.5 ● ● ● ● ●●● ● ● ● ● ●



●●●
● ● ●●●
●●
● ● ●

● ●
● ● ●
● ● ●

●●
●● ● ● ● ●
●● ● ● ●


●●●●● ●● ● ●
● ●● ● ● ●●
1.5 ●●●● ●

●●
● ●
● ● ●
● ●
● ● ●
−1.0

2 4 6 8 2.0 2.5 3.0

x Fitted values

177
Sometimes we might have a good idea about of the variance of 𝑌𝑖 : we
2
might think var(𝑌𝑖 ) = var(𝜖 𝑖 ) = 𝑤𝜎 𝑖 where 𝜎2 is unknown but where the
𝑤 𝑖 are known. e.g. if 𝑌𝑖 is actually the mean of 𝑛 𝑖 observations, where
2
each of these 𝑛 𝑖 observations are made at 𝑥 = 𝑥 𝑖 , then var(𝑌𝑖 ) = 𝜎𝑛 𝑖 . So
𝑤 𝑖 = 𝑛 𝑖 in this case.
It is straightforward to show (exercise) that the MLEs of 𝛼, 𝛽 are
obtained by minimising
𝑛
Õ
𝑤 𝑖 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )2 . (6)
𝑖=1

The form of (6) is intuitively correct: if 𝑤 𝑖 is small then var(𝑌𝑖 ) is large,


so there is a lot of uncertainty about observation 𝑖, so this observation
shouldn’t affect the fit too much – this is achieved in (6) by observation
𝑖 being weighted by the small value of 𝑤 𝑖 . Hence this approach is
called weighted least squares.

178
Potential problem: outliers

An outlier is a point for which 𝑦 𝑖 is far from the value b


𝑦 𝑖 predicted by
the model. Outliers can arise for a variety of reasons, e.g. incorrect
recording of an observation during data collection.
Residual plots can be used to identify outliers. Recall that 𝐸(𝑒 𝑖 ) = 0.
But in practice it can be difficult to decide how large (i.e. how far from
the expected value of zero) a residual needs to be before we consider it
a possible outlier. To address this we can plot studentized residuals
instead of residuals, where
𝑒𝑖
𝑖th studentized residual = .
SE(𝑒 𝑖 )

179
Theorem 8.1 var(𝑒 𝑖 ) = 𝜎2 (1 − ℎ 𝑖 ) where

1 (𝑥 𝑖 − 𝑥)2
ℎ𝑖 = + Í𝑛 .
𝑛 𝑗=1 (𝑥 𝑗 − 𝑥)
2

Definition The 𝑖th studentized residual 𝑟 𝑖 is defined by


𝑒𝑖
𝑟𝑖 = √
𝑠 1 − ℎ𝑖

where 𝑠 = RSE is the residual standard error.


[Here we call the 𝑟 𝑖 “studentized” residuals. Some authors call the 𝑟 𝑖
“standardized” residuals and save the word “studentized” to mean
something similar but different.]

180
So the 𝑟 𝑖 are all on a comparable scale, each having a standard
deviation of about 1. We will say that observations with |𝑟 𝑖 | > 3 are
possible outliers.
If we believe an outlier is due to an error in data collection, then one
solution is to simply remove the observation from the data and re-fit
the model. However, an outlier may instead indicate a problem with
the model, e.g. a nonlinear relationship between 𝑌 and 𝑥, so care must
be taken.
Similarly, this kind of problem could arise if we have a missing
regressor, i.e. we could be using 𝑌 = 𝛽0 + 𝛽 1 𝑥 1 + 𝜖 when we should
really be using 𝑌 = 𝛽0 + 𝛽 1 𝑥1 + 𝛽 2 𝑥2 + 𝜖.

181
Proof of Theorem 8.1
Idea of the proof: write 𝑒 𝑗 = 𝑎 𝑗 𝑌𝑗 and use
Í
𝑗
Õ  Õ Õ
var 𝑎 𝑗 𝑌𝑗 = 𝑎 2𝑗 var(𝑌𝑗 ) = 𝜎2 𝑎 2𝑗 (7)
𝑗 𝑗 𝑗

since the 𝑌𝑗 are independent with var(𝑌𝑗 ) = 𝜎2 for all 𝑗. Here and
below, all sums are from 1 to n.
First recall
− 𝑥)𝑌𝑗
Í
𝑗 (𝑥 𝑗
Õ
𝛽=
b where 𝑆 𝑥𝑥 = (𝑥 𝑘 − 𝑥)2
𝑆 𝑥𝑥
𝑘
and
𝛼 =𝑌−b
b 𝛽𝑥
− 𝑥)𝑌𝑗

1 𝑗 (𝑥 𝑗
Õ  
= 𝑌𝑗 − 𝑥
𝑛 𝑆 𝑥𝑥
𝑗
Õ 1 𝑥(𝑥 𝑗 − 𝑥)

= − 𝑌𝑗 .
𝑛 𝑆 𝑥𝑥
𝑗
182
Proof of Theorem 8.1 ... continued
So

𝑦𝑖 = b
b 𝛼+b
𝛽𝑥 𝑖
− 𝑥)𝑌𝑗
!
𝑥(𝑥 𝑗 − 𝑥)
Í
𝑗 (𝑥 𝑗
Õ 1
 
= − 𝑌𝑗 + 𝑥 𝑖
𝑛 𝑆 𝑥𝑥 𝑆 𝑥𝑥
𝑗
Õ 1 (𝑥 𝑖 − 𝑥)(𝑥 𝑗 − 𝑥)

= + 𝑌𝑗 . (8)
𝑛 𝑆 𝑥𝑥
𝑗

We can write
Õ
𝑌𝑖 = 𝛿 𝑖𝑗 𝑌𝑗 (9)
𝑗
where (
1 if 𝑖 = 𝑗
𝛿 𝑖𝑗 =
0 otherwise.

183
Proof of Theorem 8.1 ... continued

Note 𝛿2𝑖𝑗 = 𝛿 𝑖𝑗 and 𝛿2𝑖𝑗 = 𝛿 𝑖𝑗 = 1. So


Í Í
𝑗 𝑗

𝑒 𝑖 = 𝑌𝑖 − b
𝛼−b
𝛽𝑥 𝑖
and using (8) and (9)
Õ 1 (𝑥 𝑖 − 𝑥)(𝑥 𝑗 − 𝑥)

= 𝛿 𝑖𝑗 − − 𝑌𝑗 .
𝑛 𝑆 𝑥𝑥
𝑗

184
Proof of Theorem 8.1 ... continued
As the 𝑌𝑗 are independent, as at (7),
2
Õ 1 (𝑥 𝑖 − 𝑥)(𝑥 𝑗 − 𝑥)
var(𝑒 𝑖 ) = 𝛿 𝑖𝑗 − − var(𝑌𝑗 )
𝑛 𝑆 𝑥𝑥
𝑗

2
Õ 1 (𝑥 𝑖 − 𝑥)2 (𝑥 𝑗 − 𝑥)2 1
=𝜎 𝛿2𝑖𝑗 + + − 2 𝛿 𝑖𝑗
𝑛2 2
𝑆 𝑥𝑥 𝑛
𝑗

(𝑥 𝑖 − 𝑥)(𝑥 𝑗 − 𝑥) 1 (𝑥 𝑖 − 𝑥)(𝑥 𝑗 − 𝑥)

−2 𝛿 𝑖𝑗 + 2
𝑆 𝑥𝑥 𝑛 𝑆 𝑥𝑥
1 (𝑥 𝑖 − 𝑥)2 2 (𝑥 𝑖 − 𝑥)2 2 (𝑥 𝑖 − 𝑥) Õ
 
2
= 𝜎 1+ + − −2 + (𝑥 𝑗 − 𝑥)
𝑛 𝑆 𝑥𝑥 𝑛 𝑆 𝑥𝑥 𝑛 𝑆 𝑥𝑥
𝑗

1 (𝑥 𝑖 − 𝑥)2
 
= 𝜎2 1 − −
𝑛 𝑆 𝑥𝑥
= 𝜎2 (1 − ℎ 𝑖 ). □

185
Plots with with studentized residuals

Left: dotted line = regression line based on the black points, red line =
regression line with red point included.

Middle = residuals 𝑒 𝑖 . Right = studentized residuals 𝑟 𝑖 .

5
● ● ● ●

5
6

●●

4
● ●

Studentized Residuals
● ●●●

4



4

3
●●

Residuals

3
● ●●
● ● ●●

2
● ●
y

2
2




1
●● ● ●

1

● ● ●
●● ●
0

● ● ● ● ●
● ● ● ●● ●
● ●● ● ●● ● ●● ● ●
●● ● ●

0
● ● ●
0

● ● ● ● ●
●● ● ● ● ●● ● ●
● ●● ● ● ● ●●
●● ● ●

●●
● ● ●
● ● ●

−2

● ● ● ● ● ● ● ● ● ● ●

−1

−1

● ● ● ●

−2 −1 0 1 2 0 2 4 6 0 2 4 6

x Fitted Values Fitted Values

Residuals 𝑒 𝑖 and studentized residuals 𝑟 𝑖 are almost the same. The regression
does not fit the red point well, but whether the red point is included or not has
little effect on the fitted line.

186
Scale of 𝑦 has changed, |𝑒 𝑖 | and |𝑟 𝑖 | are much smaller. Also, the red point isn’t
so extreme.
● ● ●

4
0.6

0.20
●●
● ●

Studentized Residuals
●●●

3




0.4


●●

Residuals

2
0.10
● ●● ●

● ● ●●
● ●
0.2
y

● ●
● ●

1
● ● ● ● ●

● ● ●● ● ● ● ●● ●
●● ● ●● ● ●● ●

0.00
● ● ● ● ●
0.0

● ● ● ● ●
● ●

0
●● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
●● ●● ● ●● ●
● ●
● ● ●

−1

−0.2

● ● ● ● ●
● −0.10 ● ● ● ● ●
● ● ● ● ●

−2 −1 0 1 2 −0.2 0.0 0.2 0.4 0.6 −0.2 0.0 0.2 0.4 0.6

x Fitted Values Fitted Values

Studentized residual of red point is greater than 4, it is a possible outlier, again


it has a tiny effect on the fitted line.

187
Potential problem: high leverage points

Outliers are observations for which the response 𝑦 𝑖 is unusual given


the value of 𝑥 𝑖 . On the other hand, observations with high leverage have
an unusual value of 𝑥 𝑖 .
Definition The leverage of the 𝑖th observation is ℎ 𝑖 , where

1 (𝑥 𝑖 − 𝑥)2
ℎ𝑖 = + Í𝑛 .
𝑛 𝑗=1 (𝑥 𝑗 − 𝑥)
2

Clearly ℎ 𝑖 depends only on the values of the 𝑥 𝑖 (it doesn’t depend on


the 𝑦 𝑖 values at all). We see that ℎ 𝑖 increases with the distance of 𝑥 𝑖
from 𝑥.

188
High leverage points tend to have a sizeable impact on the regression
line. Since var(𝑒 𝑖 ) = 𝜎2 (1 − ℎ 𝑖 ), a large leverage ℎ 𝑖 will make var(𝑒 𝑖 )
small. And then since 𝐸(𝑒 𝑖 ) = 0, this means that the regression line will
be “pulled” close to the point (𝑥 𝑖 , 𝑦 𝑖 ).

Note that ℎ 𝑖 = 2, so the average leverage is ℎ = 𝑛2 . One


Í
rule-of-thumb is to regard points with a leverage more than double
this to be high leverage points, i.e. points with ℎ 𝑖 > 4/𝑛.
Why does this matter? We should be concerned if the regression line is
heavily affected by just a couple of points, because any problems with
these points might invalidate the entire fit. Hence it is important to
identify high leverage observations.

189
Leverage

Left: plot of (𝑥 𝑖 , 𝑦 𝑖 ), 𝑖 = 1, . . . , 𝑛.
Right: plot of leverage values ℎ 𝑖 against 𝑥 𝑖 .

0.00 0.02 0.04 0.06 0.08 0.10 0.12



6


● ●

● ●
● ●●●


Leverage values hi

4

●● ●
● ●
● ●
● ● ●

● ● ● ●
● ●
y

● ● ●
2

● ● ●



● ●
● ●
●● ●
● ●

●● ●
0

●●
● ●● ●●
●●● ●
●●
● ●


−2


−2 −1 0 1 2 −2 −1 0 1 2

x x

190
Blue line = excluding red point, dotted red line = including red point.
Low leverage, low residual High leverage, low residual

20

20
● ●
● ●

● ●
y1

y2
●● ●●
● ●
● ● ● ● ●

15

15
●● ●

● ●
●● ●●
● ● ● ●
● ● ● ●
● ● ● ●
10

10
● ●

10 15 20 25 30 10 15 20 25 30

x1 x2

Low leverage, high residual High leverage, high residual


20

20
● ●
● ●

● ●
y3

y4
●● ●●
● ●
● ● ● ●
15



15 ●

● ●
●● ●●
● ● ● ●
● ● ● ●
● ● ● ●
10

10

● ● ●

10 15 20 25 30 10 15 20 25 30

x3 x4

191
Left: red line = fitted line from black points + point 𝐴, blue line = fitted
line from black points + point 𝐴 + point 𝐵.
Right: studentized residuals 𝑟 𝑖 against leverage values ℎ 𝑖 .
15

● A

4
B ●
B ●

Studentized residuals
10

3
A

2
● ●
y

●●
● ●
● ●●●
5



1
●●
● ●
● ● ● ●
● ● ●
●● ● ● ● ● ● ● ● ●
● ●
●● ●
● ● ● ●●

0

●●● ● ● ● ●
●●● ● ●●
● ● ● ●
0


● ● ●
●● ●● ● ●
● ● ●

−1
●● ●

−2 −1 0 1 2 3 0.02 0.04 0.06 0.08 0.10 0.12 0.14

x Leverage values

192
Removing point 𝐵 – which has high leverage and a high residual – has
a much more substantial impact on the regression line than removing
an outlier with low leverage and/or low residual.
In the plot of studentized residuals against leverage:
• point B has high leverage and a high residual, it is having a
substantial effect on the regression line yet it still has a high
residual
• in contrast point A has a high residual, the model isn’t fitting well
at A, but t A isn’t affecting the fit much at all – the reason that A
has little effect on the fit is that it has low leverage.

193
Statistics Publications - Underrepresentation of women
on boards

The association of women on editorial/advisory board and in the


corresponding academic specialty.
Do you have any concerns here about fitting a simple linear
regression? Spot the high leverage point!
Source: Ioannidou E, Rosania A (2015) Under-Representation of Women on Dental
Journal Editorial Boards. PLoS ONE 10(1): e0116630. 194
Statistics Publications - Estimating dry mass in plants

Estimation of the relationship between individual plant height and


aboveground dry mass
Do you have any concerns here about fitting a simple linear
regression? Spot the high leverage point!
Laís Samira Correia Nunes, Antonio Fernando Monteiro Camargo (2017) A simple non-destructive
method for estimating aboveground biomass of emergent aquatic macrophytes. Acta Limnologica
Brasiliensia vol.29 http://dx.doi.org/10.1590/s2179-975x6416

This is an Open Access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use,

distribution and reproduction in any medium, provided the original work is properly cited.

195
Statistics Publications - Cubic regression

Cubic regression analysis of the run time for a fast approximate


quadratic assignment algorithm as a function of the number of vertices.
Source: Vogelstein et al. 2015 https://doi.org/10.1371/journal.pone.0121002

Does the variation in the observation appear to be constant?

196

You might also like