Lecture Slides 1 To 10 - 2024
Lecture Slides 1 To 10 - 2024
Lectures 1–10
Christl Donnelly
[email protected]
1
Course info
2
Introduction
3
Probability
4
Statistics
5
Precision of estimation
6
Statistics Publications - Axillary temperature in young,
healthy adults
Copyright: © The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any
7
medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and
Statistics Publications - IQ of University Students
Copyright: © 2014 Kleisner et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License,
8
which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Relationships between observations
9
Relationship between time and house prices
●
350 ●●
●
● ●
●●●
● ●●
300 ● ●
● ● ● ●
● ● ●
●
● ● ●● ● ●●
● ●● ●
● ●
●● ●
Price ● ●
●●
● ●
● ●
● ● ●
●
250 ●
●
● ● ●
●
●● ●
● ● ●●● ●
●●
● ● ●
● ●
● ●●
200 ●●
●
●
● ●
● ●
● ●●
●●●●
●● ●●● ●●
●●
150 ●
0 20 40 60 80 100
Month
10
Statistics Publications - Underrepresentation of women
on boards
Copyright: © 2015 Ioannidou, Rosania. This is an open access article distributed under the terms of the Creative Commons Attribution 11
Notation and conventions
𝑒 −𝜆 𝜆 𝑥 𝑖
𝑃(𝑋𝑖 = 𝑥 𝑖 ) = , 𝑥 𝑖 = 0, 1, . . . .
𝑥𝑖 !
12
1. Random Samples
13
Random Samples
A random sample of size 𝑛 is a set of random variables 𝑋1 , . . . , 𝑋𝑛 which
are independent and identically distributed (i.i.d.).
Let 𝑋1 , . . . , 𝑋𝑛 be a random sample from a Poisson(𝜆) distribution.
e.g. 𝑋𝑖 = # traffic accidents on St Giles’ in year 𝑖.
We’ll write 𝑓 (x) to denote the joint probability mass function (p.m.f.) of
𝑋1 , . . . , 𝑋𝑛 . Then
𝑒 −𝜆 𝜆 𝑥1 𝑒 −𝜆 𝜆 𝑥2 𝑒 −𝜆 𝜆 𝑥 𝑛
= · ··· since 𝑋𝑖 Poisson
𝑥1 ! 𝑥2 ! 𝑥𝑛 !
Í𝑛
𝑒 −𝑛𝜆 𝜆( 𝑖=1 𝑥𝑖 )
= Î𝑛 .
𝑖=1 𝑥𝑖 !
14
Random Samples
Let 𝑋1 , . . . , 𝑋𝑛 be a random sample from an exponential distribution
with probability density function (p.d.f.) given by
1 −𝑥/𝜇
𝑓 (𝑥) = 𝑒 , 𝑥 ≥ 0.
𝜇
𝑛
1 1Õ
= 𝑛 exp − 𝑥𝑖 .
𝜇 𝜇
𝑖=1
15
1 We use 𝑓 to denote a p.m.f. in the first example and to denote a
p.d.f. in the second example. It is convenient to use the same letter
(i.e. 𝑓 ) in both the discrete and continuous cases. (In introductory
probability you may often see 𝑝 for p.m.f. and 𝑓 for p.d.f.)
We could write 𝑓𝑋𝑖 (𝑥 𝑖 ) for the p.m.f./p.d.f. of 𝑋𝑖 , and 𝑓𝑋1 ,...,𝑋𝑛 (x)
for the joint p.m.f./p.d.f. of 𝑋, . . . , 𝑋𝑛 . However it is convenient to
keep things simpler by omitting subscripts on 𝑓 .
2 In the second example 𝐸(𝑋𝑖 ) = 𝜇 and we say “𝑋𝑖 has an
exponential distribution with mean 𝜇” (i.e. expectation 𝜇).
Sometimes, and often in probability, we work with “an
exponential distribution with parameter 𝜆” where 𝜆 = 1/𝜇. To
change the parameter from 𝜇 to 𝜆 all we do is replace 𝜇 by 1/𝜆 to
get
𝑓 (𝑥) = 𝜆𝑒 −𝜆𝑥 , 𝑥 ≥ 0.
Sometimes (often in statistics) we parametrise the distribution
using 𝜇, sometimes (often in probability) we parametrise it using
𝜆.
16
In probability we assume that the parameters 𝜆 and 𝜇 in our two
examples are known. In statistics we wish to estimate 𝜆 and 𝜇 from
data.
• What is the best way to estimate them? And what does “best”
mean?
• For a given method, how precise is the estimation?
17
2. Summary Statistics
18
Summary Statistics
The expected value of 𝑋, E(𝑋), is also called its mean. This is often
denoted 𝜇.
The variance of 𝑋, var(𝑋), is var(𝑋) = 𝐸[(𝑋 − 𝜇)2 ]. This is often
denoted 𝜎2 . The standard deviation of 𝑋 is 𝜎.
Let 𝑋1 , . . . , 𝑋𝑛 be a random sample. The sample mean 𝑋 is defined by
𝑛
1Õ
𝑋= 𝑋𝑖 .
𝑛
𝑖=1
19
Statistics Publications - Descriptive statistics -
biochemistry
Copyright: All the contents of this journal, except where otherwise noted, is licensed under a Creative Commons Attribution License
20
Statistics Publications - IQ of University Students - Mean
(125.04) and Standard deviation (17.298)
Copyright: © 2014 Kleisner et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, 21
Note
22
The Normal Distribution
1 (𝑥 − 𝜇)2
𝑓 (𝑥) = √ exp − , −∞ < 𝑥 < ∞.
2𝜋𝜎 2 2𝜎2
23
Standard Normal Distribution
24
Standard Normal
0.4 1.0
0.8
0.3
0.6
Φ(x)
f(x)
0.2
0.4
0.1
0.2
0.0 0.0
−4 −2 0 2 4 −4 −2 0 2 4
x x
25
Standard and Poors 500 stock index (2012 data)
1450
S&P 500 Index
1400
1350
1300
1250
Day
26
S&P 500 daily returns (%)
1
Return (%)
−1
−2
Day
27
S&P 500 daily returns (%)
0.6
Density
0.4
0.2
0.0
−2 −1 0 1 2
Return (%)
28
Insurance claim amounts
lognormal gamma
0.4
0.3
Density
0.2
0.1
0.0
0 10 20 30
Claim amount
29
Time intervals between major earthquakes
Histogram of time intervals between 62 major earthquakes 1902–77: an
exponential density looks plausible.
0.0010
Density
0.0005
0.0000
31
Maximum Likelihood Estimation
Maximum likelihood estimation is a general method for estimating
unknown parameters from data. This turns out to be the method of
choice in many contexts, though this isn’t obvious at this stage.
Suppose e.g. that 𝑥1 , . . . , 𝑥 𝑛 are time intervals between major
earthquakes. Assume these are observations of 𝑋1 , . . . , 𝑋𝑛
independently drawn from an exponential distribution with mean 𝜇,
so that each 𝑋𝑖 has p.d.f.
1 −𝑥/𝜇
𝑓 (𝑥; 𝜇) = 𝑒 , 𝑥 ≥ 0.
𝜇
32
Maximum Likelihood Estimation
33
Maximum Likelihood Estimation
34
Maximum Likelihood Estimation
Often we assume that 𝑋1 , . . . , 𝑋𝑛 are a random sample from 𝑓 (𝑥; 𝜃), so
that
𝐿(𝜃) = 𝑓 (x; 𝜃)
𝑛
Ö
= 𝑓 (𝑥 𝑖 ; 𝜃) since the 𝑋𝑖 are independent.
𝑖=1
35
The idea of maximum likelihood is to estimate the parameter by the
value of 𝜃 that gives the greatest likelihood to observations 𝑥1 , . . . , 𝑥 𝑛 .
That is, the 𝜃 for which the probability or probability density is
maximised.
Since taking logs is monotone, 𝜃(x)
b also maximises ℓ (𝜃). Finding the
MLE by maximising ℓ (𝜃) is often more convenient.
36
[Example continued] In our exponential mean 𝜇 example, the
parameter is 𝜇 and
𝑛 𝑛
1 1Õ
Ö
−𝑥 𝑖 /𝜇 −𝑛
𝐿(𝜇) = 𝑒 =𝜇 exp − 𝑥𝑖
𝜇 𝜇
𝑖=1 𝑖=1
𝑛
1Õ
ℓ (𝜇) = log 𝐿(𝜇) = −𝑛 log 𝜇 − 𝑥𝑖 .
𝜇
𝑖=1
So Í𝑛
𝑑ℓ 𝑛 𝑖=1 𝑥𝑖
= 0 ⇐⇒ = ⇐⇒ 𝜇 = 𝑥.
𝑑𝜇 𝜇 𝜇2
37
This is a maximum since
Í𝑛
𝑑 2ℓ 𝑛 2 𝑖=1 𝑥𝑖 𝑛
= − =− < 0.
𝑑𝜇2 𝜇=𝑥 𝑥 2
𝑥3
𝑥2
So the MLE is b
𝜇(x) = 𝑥. Often we’ll just write b
𝜇 = 𝑥.
𝜇(X) = 𝑋, which
In this case the maximum likelihood estimator of 𝜇 is b
is a random variable. (More on the difference between estimates and
estimators later.)
38
−4.5
2.0
logLikelihood × 10−2
−5.0
Likelihood × 10191
1.5
−5.5
1.0
−6.0
0.5
−6.5
0.0 −7.0
0 200 400 600 800 1000 0 200 400 600 800 1000
mu mu
Note that both the likelihood and the log-likelihood are not plotted on
the natural scale.
39
If 𝜃(x)
b is the maximum likelihood estimate of 𝜃, then the maximum
likelihood estimator (MLE) is defined by 𝜃(X).
b
Note: both maximum likelihood estimate and maximum likelihood
estimator are often abbreviated to MLE.
40
Opinion poll example
Suppose 𝑛 individuals are drawn independently from a large
population. Let
1 if individual 𝑖 is a Labour voter
𝑋𝑖 =
0 otherwise.
Let 𝑝 be the proportion of Labour voters, so that
𝑃(𝑋𝑖 = 1) = 𝑝, 𝑃(𝑋𝑖 = 0) = 1 − 𝑝.
This is a Bernoulli distribution, for which the p.m.f. can be written
𝑓 (𝑥; 𝑝) = 𝑃(𝑋 = 𝑥) = 𝑝 𝑥 (1 − 𝑝)1−𝑥 , 𝑥 = 0, 1.
The likelihood is
𝑛
Ö
𝐿(𝑝) = 𝑝 𝑥 𝑖 (1 − 𝑝)1−𝑥 𝑖
𝑖=1
= 𝑝 𝑟 (1 − 𝑝)𝑛−𝑟
Í𝑛
where 𝑟 = 𝑖=1 𝑥𝑖 .
41
Opinion poll example (continued)
So the log-likelihood is
42
Genetics example
43
Genetics example (continued)
Then
𝐿(𝜃) = 𝑃(𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 , 𝑋3 = 𝑥3 )
𝑛!
= 𝑝 𝑥1 𝑝 𝑥2 𝑝 𝑥3
𝑥1 ! 𝑥2 ! 𝑥3 ! 1 2 3
𝑛!
= (𝜃 2 )𝑥1 [2𝜃(1 − 𝜃)]𝑥2 [(1 − 𝜃)2 ]𝑥3 .
𝑥1 ! 𝑥2 ! 𝑥3 !
This is a multinomial distribution.
So
ℓ (𝜃) = constant + 2𝑥1 log 𝜃 + 𝑥2 [log 2 + log 𝜃 + log(1 − 𝜃)] + 2𝑥3 log(1 − 𝜃)
= constant + (2𝑥1 + 𝑥2 ) log 𝜃 + (𝑥2 + 2𝑥3 ) log(1 − 𝜃)
44
Genetics example (continued)
Then ℓ ′(𝜃)
b = 0 gives
2𝑥1 + 𝑥 2 𝑥2 + 2𝑥3
− =0
𝜃
b 1−𝜃
b
or
(2𝑥1 + 𝑥 2 )(1 − 𝜃)
b = (𝑥2 + 2𝑥3 )𝜃.
b
So
2𝑥1 + 𝑥2 2𝑥1 + 𝑥 2
𝜃
b= = .
2(𝑥1 + 𝑥2 + 𝑥 3 ) 2𝑛
45
Maximum likelihood approach
Steps:
• Write down the (log) likelihood
• Find the maximum (usually by differentiation, but not quite
always)
• Rearrange to give the parameter estimate in terms of the data.
46
Statistics Publications - Likelihood of a hepatitis A
transmission tree
Copyright: © 2018 Zhang, Iacono. This is an open access article distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. 47
Estimating multiple parameters
iid
Let 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎2 ) where both 𝜇 and 𝜎2 are unknown.
iid
[Here ∼ means “are independent and identically distributed as.”]
The likelihood is
𝑛
1 (𝑥 𝑖 − 𝜇)2
Ö
2
𝐿(𝜇, 𝜎 ) = √ exp −
𝑖=1 2𝜋𝜎2 2𝜎2
𝑛
1 Õ
= (2𝜋𝜎2 )−𝑛/2 exp − (𝑥 𝑖 − 𝜇)2
2𝜎2 𝑖=1
with log-likelihood
𝑛
𝑛 𝑛 1 Õ
ℓ (𝜇, 𝜎2 ) = − log 2𝜋 − log(𝜎2 ) − 2 (𝑥 𝑖 − 𝜇)2 .
2 2 2𝜎 𝑖=1
48
Estimating multiple parameters (continued)
We maximise ℓ jointly over 𝜇 and 𝜎2 :
𝑛
𝜕ℓ 1 Õ
= 2 (𝑥 𝑖 − 𝜇)
𝜕𝜇 𝜎
𝑖=1
𝑛
𝜕ℓ 𝑛 1 Õ
= − + (𝑥 𝑖 − 𝜇)2
𝜕(𝜎2 ) 2𝜎2 2(𝜎 2 )2 𝑖=1
𝜕ℓ 𝜕ℓ
and solving 𝜕𝜇
= 0 and 𝜕(𝜎 2 )
= 0 simultaneously we obtain
𝑛
1Õ
𝜇= 𝑥𝑖 = 𝑥
𝑛
b
𝑖=1
𝑛 𝑛
1Õ 1Õ
𝑠2 = 𝜇)2 =
(𝑥 𝑖 − b (𝑥 𝑖 − 𝑥)2 .
𝑛 𝑛
b
𝑖=1 𝑖=1
Estimator:
• A rule for constructing an estimate.
• A function of the random variables X involved in the random
sample.
• Itself a random variable.
Estimate:
• The numerical value of the estimator for the particular data set.
• The value of the function evaluated at the data 𝑥1 , . . . , 𝑥 𝑛 .
50
Statistics Publications - Influenza transmission in Mexico
in 2009
Copyright: © Springborn et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and
51
reproduction in any medium, provided the original work is properly credited.
Statistics Publications - Influenza transmission in Mexico
in 2009
Copyright: © Springborn et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and
52
reproduction in any medium, provided the original work is properly credited.
Statistics Publications - Influenza transmission in Mexico
in 2009
Copyright: © Springborn et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative
53
Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and
Statistics Publications - The maximum log likelihood is
easy to spot - Influenza transmission in Mexico in 2009
Copyright: © Springborn et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and
54
reproduction in any medium, provided the original work is properly credited.
Statistics Publications - Sometimes a maximum looks
like a minimum - Influenza transmission in ferrets
Sometimes the log likelihood profile is flipped and even given relative
to its maximum
Source: Rebecca Frise et al. (2016) Contact transmission of influenza virus between ferrets imposes a
looser bottleneck than respiratory droplet transmission allowing propagation of antiviral resistance.
Scientific Reports volume 6, Article number: 29793
Copyright: This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in
55
this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line
4. Parameter Estimation
56
Parameter Estimation - Earthquake example
iid
𝑋1 , . . . , 𝑋𝑛 ∼ exponential distribution, mean 𝜇.
Possible estimators of 𝜇:
• 𝑋
1
• 3 𝑋1 + 23 𝑋2
• 𝑋1 + 𝑋2 − 𝑋3
2
• 𝑛(𝑛+1)
(𝑋1 + 2𝑋2 + · · · + 𝑛𝑋𝑛 ).
How should we choose?
57
In general, suppose 𝑋1 , . . . , 𝑋𝑛 is a random sample from a distribution
with p.d.f./p.m.f. 𝑓 (𝑥; 𝜃). We want to estimate 𝜃 from observations
𝑥1 , . . . , 𝑥 𝑛 .
A statistic is any function 𝑇(X) of 𝑋1 , . . . , 𝑋𝑛 that does not depend on 𝜃.
An estimator of 𝜃 is any statistic 𝑇(X) that we might use to estimate 𝜃.
𝑇(x) is the estimate of 𝜃 obtained via 𝑇 from observed values x.
58
We can choose between estimators by studying their properties. A
good estimator should take values close to 𝜃.
The estimator 𝑇 = 𝑇(X) is said to be unbiased for 𝜃 if, whatever the true
value of 𝜃, we have 𝐸(𝑇) = 𝜃.
This means that “on average” 𝑇 is correct.
59
Example: Earthquakes
Possible estimators 𝑋, 13 𝑋1 + 32 𝑋2 , etc.
Since 𝐸(𝑋𝑖 ) = 𝜇, we have
𝑛
1Õ
𝐸(𝑋) = 𝜇=𝜇
𝑛
𝑖=1
𝐸( 13 𝑋1 + 32 𝑋2 ) = 13 𝜇 + 23 𝜇 = 𝜇.
2 Í𝑛
Similar calculations show that 𝑋1 + 𝑋2 − 𝑋3 and 𝑛(𝑛+1) 𝑗=1 𝑗𝑋 𝑗 are
also unbiased.
60
Example: Normal variance
iid
Suppose 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎 2 ), with 𝜇 and 𝜎2 unknown, and let
𝑇 = 𝑛1 (𝑋𝑖 − 𝑋)2 . Then 𝑇 is the MLE of 𝜎2 . Is 𝑇 unbiased?
Í
Let 𝑍 𝑖 = (𝑋𝑖 − 𝜇)/𝜎. So the 𝑍 𝑖 are independent and 𝑁(0, 1), 𝐸(𝑍 𝑖 ) = 0,
var(𝑍 𝑖 ) = 𝐸(𝑍 2𝑖 ) = 1.
𝐸[(𝑋𝑖 − 𝑋)2 ]
= 𝐸[𝜎2 (𝑍 𝑖 − 𝑍)2 ]
= 𝜎2 var(𝑍 𝑖 − 𝑍) since 𝐸(𝑍 𝑖 − 𝑍) = 0
1 1 𝑛−1 1
2
= 𝜎 var − 𝑍1 − 𝑍2 − · · · + 𝑍𝑖 + · · · − 𝑍𝑛
𝑛 𝑛 𝑛 𝑛
1 1 (𝑛 − 1)2 1
= 𝜎2 2
var(𝑍1 ) + 2 var(𝑍2 ) + · · · + var(𝑍 𝑖 ) + · · · + 2 var(𝑍 𝑛 )
𝑛 𝑛 𝑛2 𝑛
Õ Õ
since var( 𝑎 𝑖 𝑈𝑖 ) = 𝑎 2𝑖 var(𝑈 𝑖 ) for indep 𝑈 𝑖
1 (𝑛 − 1)2 (𝑛 − 1) 2
2
= 𝜎 (𝑛 − 1) × 2 + = 𝜎 .
𝑛 𝑛 2 𝑛
61
So
𝑛
1Õ (𝑛 − 1) 2
𝐸(𝑇) = 𝐸[(𝑋𝑖 − 𝑋)2 ] = 𝜎 < 𝜎2 .
𝑛 𝑛
𝑖=1
62
Uniform distribution – some unusual features!
iid
Suppose 𝑋1 , . . . , 𝑋𝑛 ∼ Uniform[0, 𝜃], where 𝜃 > 0, i.e.
1
if 0 ≤ 𝑥 ≤ 𝜃
𝑓 (𝑥; 𝜃) = 𝜃
0
otherwise.
What is the MLE for 𝜃? Is the MLE unbiased?
Calculate the likelihood:
𝑛
Ö
𝐿(𝜃) = 𝑓 (𝑥; 𝜃)
𝑖=1
1
if 0 ≤ 𝑥 𝑖 ≤ 𝜃 for all 𝑖
= 𝜃𝑛
0
otherwise
0 if 0 < 𝜃 < max 𝑥 𝑖
= 1
𝑛 if 𝜃 ≥ max 𝑥 𝑖 .
𝜃
Note: 𝜃 ≥ 𝑥 𝑖 for all 𝑖 ⇐⇒ 𝜃 ≥ max 𝑥 𝑖 . (And max 𝑥 𝑖 means
63
max1≤𝑖≤𝑛 𝑥 𝑖 ).
Likelihood
0
0 max xi
theta
The MLE of 𝜃 is 𝜃
b = max 𝑋𝑖 . What is 𝐸(𝜃)?
b
Find the c.d.f. of 𝜃:
b
𝐹(𝑦) = 𝑃(𝜃
b ≤ 𝑦)
= 𝑃(max 𝑋𝑖 ≤ 𝑦)
= 𝑃(𝑋1 ≤ 𝑦, 𝑋2 ≤ 𝑦, . . . , 𝑋𝑛 ≤ 𝑦)
= 𝑃(𝑋1 ≤ 𝑦)𝑃(𝑋2 ≤ 𝑦) . . . 𝑃(𝑋𝑛 ≤ 𝑦) since 𝑋𝑖 independent
(𝑦/𝜃)𝑛 if 0 ≤ 𝑦 ≤ 𝜃
=
1 if 𝑦 > 𝜃.
65
So, differentiating the c.d.f., the p.d.f. is
𝑛 𝑦 𝑛−1
𝑓 (𝑦) = , 0 ≤ 𝑦 ≤ 𝜃.
𝜃𝑛
So
𝜃
𝑛 𝑦 𝑛−1
∫
𝐸(𝜃)
b = 𝑦· 𝑑𝑦
0 𝜃𝑛
𝜃
𝑛
∫
= 𝑛 𝑦 𝑛 𝑑𝑦
𝜃 0
𝑛𝜃
= .
𝑛+1
So 𝜃
b is not unbiased. But note that it is asymptotically unbiased:
𝐸(𝜃)
b → 𝜃 as 𝑛 → ∞.
In fact under mild assumptions MLEs are always asymptotically
unbiased.
66
Further Properties of Estimators
𝑏(𝑇) = 𝐸(𝑇) − 𝜃.
67
Theorem 4.1 MSE(𝑇) = var(𝑇) + [𝑏(𝑇)]2 .
Proof
Let 𝜇 = 𝐸(𝑇). Then
So an estimator with small MSE needs to have small variance and small
bias. Unbiasedness alone is not particularly desirable – it is the
combination of small variance and small bias which is important.
68
Reminder
𝜎2
𝐸(𝑋) = 𝜇 and var(𝑋) = .
𝑛
69
Uniform distribution
iid
Suppose 𝑋1 , . . . , 𝑋𝑛 ∼ Uniform[0, 𝜃], i.e.
1
if 0 ≤ 𝑥 ≤ 𝜃
𝑓 (𝑥; 𝜃) = 𝜃
0
otherwise.
We will consider two estimators of 𝜃:
• 𝑇 = 2𝑋, the natural estimator based on the sample mean (because
the mean of the distribution is 𝜃/2)
• 𝜃
b = max 𝑋𝑖 , the MLE.
Now 𝐸(𝑇) = 2𝐸(𝑋) = 𝜃, so 𝑇 is unbiased. Hence
MSE(𝑇) = var(𝑇)
= 4 var(𝑋)
4 var(𝑋1 )
= .
𝑛
70
We have 𝐸(𝑋1 ) = 𝜃/2 and
𝜃
1 𝜃2
∫
𝐸(𝑋12 ) = 𝑥2 · 𝑑𝑥 =
0 𝜃 3
so
2
𝜃2 𝜃 𝜃2
var(𝑋1 ) = − =
3 2 12
hence
4 var(𝑋1 ) 𝜃2
MSE(𝑇) = = .
𝑛 3𝑛
71
Previously we showed that 𝜃
b has p.d.f.
𝑛 𝑦 𝑛−1
𝑓 (𝑦) = , 0≤𝑦≤𝜃
𝜃𝑛
and 𝐸(𝜃)
b = 𝑛𝜃/(𝑛 + 1). So 𝑏(𝜃)
b = 𝑛𝜃/(𝑛 + 1) − 𝜃 = −𝜃/(𝑛 + 1). Also,
𝜃
𝑛 𝑦 𝑛−1 𝑛𝜃 2
∫
b2 ) =
𝐸(𝜃 𝑦2 · 𝑑𝑦 =
0 𝜃𝑛 𝑛+2
so
𝑛 𝑛2 𝑛𝜃 2
b = 𝜃2
var(𝜃) − =
𝑛 + 2 (𝑛 + 1)2 (𝑛 + 1)2 (𝑛 + 2)
72
hence
MSE(𝜃)
b = var(𝜃) b 2
b + [𝑏(𝜃)]
2𝜃 2
=
(𝑛 + 1)(𝑛 + 2)
𝜃2
< for 𝑛 ≥ 3
3𝑛
= MSE(𝑇).
• MSE(𝜃)
b ≪ MSE(𝑇) for large 𝑛, so 𝜃 b is much better – its MSE
decreases like 1/𝑛 2 rather than 1/𝑛.
Remember: 𝑇 = 2𝑋 and 𝜃
b = max 𝑋𝑖
73
• Note that ( 𝑛+1
𝑛 )𝜃 is unbiased and
b
𝑛 +1b 𝑛 +1b
MSE 𝜃 = var 𝜃
𝑛 𝑛
(𝑛 + 1)2
= var(𝜃)
b
𝑛2
𝜃2
=
𝑛(𝑛 + 2)
< MSE(𝜃)
b for 𝑛 ≥ 2.
74
Estimating the parameter of Uniform[0, 𝜃]
max(x_i), n=1000 2*mean(x), n=1000
10 15 20 25
600
Density
Density
200
5
0
0
0.993 0.995 0.997 0.999 0.94 0.96 0.98 1.00 1.02 1.04 1.06
theta theta
8
80
60
6
Density
Density
40
4
20
2
0
0
0.94 0.96 0.98 1.00 0.8 0.9 1.0 1.1 1.2
theta theta
2.0
8
Density
Density
6
1.0
4
2
0.0
0
0.5 0.6 0.7 0.8 0.9 1.0 0.6 0.8 1.0 1.2 1.4
theta theta
75
Estimating the parameter of Uniform[0, 𝜃]
0.4 0.6 0.8 1.0 1.2 1.4 1.6
6
4
2
0
max(x_i), n=10 2*mean(x), n=10
12
10
8
6
4
2
0
theta 76
Estimation so far
77
Statistics Publications - Maximum likelihood in
phylogenetics
Source: Placide Mbala-Kingebeni et al. (2019) Rapid Confirmation of the Zaire Ebola Virus in the Outbreak of the Equateur Province in the
Democratic Republic of Congo: ... , Clinical Infectious Diseases 68(2): 330–333, https://doi.org/10.1093/cid/ciy527
Copyright: © The Author(s) 2018. Published by Oxford University Press for the Infectious Diseases Society of America. This is an Open Access
78
article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs licence
Statistics Publications - Maximum likelihood in object
recognition
Copyright: © 2017 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License
http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.
79
Statistics Publications - Radiation research
Source: Leslie Stayner et al. (2007) A Monte Carlo Maximum Likelihood Method for Estimating Uncertainty Arising from Shared Errors in
"Usage of BioOne Complete content is strictly limited to personal, educational, and non-commercial used."
https://bioone.org/journals/Radiation-Research/volume-168/issue-6/RR0677.1/A-Monte-Carlo-Maximum-Likelihood-Method-for-
Estimating-Uncertainty-Arising/10.1667/RR0677.1.full 80
5. Accuracy of Estimation
81
Accuracy of estimation: Confidence Intervals
82
Theorem 5.1
𝑛
Õ 𝑛
Õ
𝑌∼𝑁 𝑎 𝑖 𝜇𝑖 , 𝑎 2𝑖 𝜎𝑖2 .
𝑖=1 𝑖=1
83
Example
iid
Let 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎02 ) where 𝜇 is unknown and 𝜎02 is known.
What can we say about 𝜇?
By Theorem 5.1,
𝑛
Õ
𝑋𝑖 ∼ 𝑁(𝑛𝜇, 𝑛𝜎02 )
𝑖=1
𝜎02
𝑋 ∼ 𝑁 𝜇, .
𝑛
So, standardising 𝑋,
𝜎02
𝑋 − 𝜇 ∼ 𝑁 0,
𝑛
𝑋 −𝜇
√ ∼ 𝑁(0, 1).
𝜎0 / 𝑛
0.3
N(0, 1) p.d.f.
0.2
0.1
0.0
−4 −2 0 2 4
Figure: Standard normal p.d.f.: the shaded area, i.e. the area under the curve
from 𝑥 = −1.96 to 𝑥 = 1.96, is 0.95.
85
So
𝑋 −𝜇
𝑃 − 1.96 < √ < 1.96 = 0.95
𝜎0 / 𝑛
𝜎0 𝜎0
𝑃 − 1.96 √ < 𝑋 − 𝜇 < 1.96 √ = 0.95
𝑛 𝑛
𝜎0 𝜎0
𝑃 𝑋 − 1.96 √ < 𝜇 < 𝑋 + 1.96 √ = 0.95
𝑛 𝑛
𝜎0
𝑃 the interval 𝑋 ± 1.96 √ contains 𝜇 = 0.95.
𝑛
𝜎0 𝜎0
𝑋 − 1.96 √ , 𝑋 + 1.96 √ .
𝑛 𝑛
Definition
If 𝑎(X) and 𝑏(X) are two statistics, and 0 < 𝛼 < 1, the interval
𝑎(X), 𝑏(X) is called a confidence interval for 𝜃 with confidence level
1 − 𝛼 if, for all 𝜃,
The interval 𝑎(X), 𝑏(X) is also called a 100(1 − 𝛼)% CI, e.g. a “95%
confidence interval” if 𝛼 = 0.05.
Usually we are interested in small values of 𝛼: the most commonly
used values are 0.05 and 0.01 (i.e. confidence levels of 95% and 99%)
but there is nothing special about any confidence level.
87
The interval 𝑎(x), 𝑏(x) is called an interval estimate and the random
interval 𝑎(X), 𝑏(X) is called an interval estimator.
88
Percentage points of normal distribution
For any 𝛼 with 0 < 𝛼 < 1, let 𝑧 𝛼 be the constant such that Φ(𝑧 𝛼 ) = 1 − 𝛼,
where Φ is the 𝑁(0, 1) c.d.f. (i.e. if 𝑍 ∼ 𝑁(0, 1) then 𝑃(𝑍 > 𝑧 𝛼 ) = 𝛼).
N(0, 1) p.d.f.
0 zα
Figure: Standard normal p.d.f.: the shaded area, i.e. the area under the curve to
the right of 𝑧 𝛼 , is 𝛼.
89
We call 𝑧 𝛼 the “1 − 𝛼 quantile of 𝑁(0, 1).”
𝑧 𝛼/2 𝜎0
𝑋± √ .
𝑛
90
Oxford rainfall, annual, 20 years
● ● ●
800
● ● ●
● ● ●
●
● ●
Rainfall (mm)
●
600 ● ●
●
●
●
●
400
200
Year
91
Oxford rainfall, annual, 20 years
92
Confidence interval example
iid
When 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎02 ), a 90% CI for 𝜇 is
𝜎0 𝜎0
𝑎(X), 𝑏(X) = 𝑋 − 1.64 √ , 𝑋 + 1.64 √ .
𝑛 𝑛
93
90% CIs from samples of size 15 from N(10, 4)
100
80
sample number
60
40
20
0
6 8 10 12 14
94
The symmetric confidence interval for 𝜇
𝜎0
𝑥 ± 1.96 √
𝑛
is called a central confidence interval for 𝜇.
−c d
Figure: Standard normal p.d.f.: the shaded area under the curve from −𝑐 to 𝑑,
is 1 − 𝛼. 95
N(0, 1) p.d.f.
−c d
Figure: Standard normal p.d.f.: the shaded area under the curve from −𝑐 to 𝑑,
is 1 − 𝛼.
Then
𝑑𝜎0 𝑐𝜎0
𝑃 𝑋− √ <𝜇<𝑋+ √ = 1 − 𝛼.
𝑛 𝑛
The choice 𝑐 = 𝑑 = 𝑧 𝛼/2 gives the shortest such interval.
96
One-sided confidence limits
Continuing our normal example we have
𝑋 −𝜇
𝑃 √ > −𝑧 𝛼 = 1 − 𝛼
𝜎0 / 𝑛
so
𝑧 𝛼 𝜎0
𝑃 𝜇<𝑋+ √ =1−𝛼
𝑛
𝑧 𝛼 𝜎0
and so (−∞, 𝑋 + √ )
𝑛
is a “one-sided” confidence interval. We call
𝑧 𝛼 𝜎0
𝑋+ √
𝑛
an upper 1 − 𝛼 confidence limit for 𝜇.
Similarly
𝑧 𝛼 𝜎0
𝑃 𝜇>𝑋− √ =1−𝛼
𝑛
𝑧 𝛼 𝜎0
and 𝑋 − √
𝑛
is a lower 1 − 𝛼 confidence limit for 𝜇.
97
Interpretation of a Confidence Interval
• The parameter 𝜃 is fixed but unknown.
• If we imagine repeating our experiment then we’d get new data,
x′ = (𝑥1′ , . . ., 𝑥 𝑛′ ) say, and hence we’d get a new confidence interval
𝑎(x′), 𝑏(x′) . If we did this repeatedly we would “catch” the true
parameter value about 95% of the time, for a 95% confidence
interval: i.e. about 95% of our intervals would contain 𝜃.
• The confidence level is a coverage probability, the probability that
the random confidence interval 𝑎(X), 𝑏(X) covers the true 𝜃. (It’s
a random interval because the endpoints 𝑎(X), 𝑏(X) are random
variables.)
But note that
the interval 𝑎(x), 𝑏(x) is not a random interval, e.g.
𝑎(x), 𝑏(x) = (631.4, 745.4) in the rainfall example. So it is wrong to say
that 𝑎(x), 𝑏(x) contains 𝜃 with probability 1 − 𝛼: this interval, e.g.
(631.4, 745.4), either definitely does or definitely does not contain 𝜃, but
we can’t say which of these two possibilities is true as 𝜃 is unknown.
98
The Central Limit Theorem (CLT)
iid
We know that if 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎2 ) then
𝑋 −𝜇
√ ∼ 𝑁(0, 1).
𝜎/ 𝑛
𝑋 −𝜇
𝑃 √ ≤ 𝑥 → Φ(𝑥) as 𝑛 → ∞.
𝜎/ 𝑛
n = 200
n = 30
Density
n = 10
n=1
−4 −2 0 2 4
Standardised value of x
100
CLT with 𝑋𝑖 ∼ 𝑈(0, 1)
n = 200
n = 30
Density
n = 10
n=1
−4 −2 0 2 4
Standardised value of x
101
CLT with 𝑋𝑖 ∼ Exponential(1)
n = 200
n = 30
Density
n = 10
n=1
−4 −2 0 2 4
Standardised value of x
102
CLT with 𝑋𝑖 ∼ Pareto(1, 3)
n = 200
n = 30
Density
n = 10
n=1
−4 −2 0 2 4
Standardised value of x
103
With 𝑋1 , 𝑋2 , . . . i.i.d. from any distribution with 𝐸(𝑋𝑖 ) = 𝜇,
var(𝑋𝑖 ) = 𝜎2 :
• the weak law of large numbers (Prelims Probability) tells us that
the distribution of 𝑋 concentrates around 𝜇 as 𝑛 becomes large,
i.e. for 𝜖 > 0, we have 𝑃(|𝑋 − 𝜇| > 𝜖) → 0 as 𝑛 → ∞
• the CLT adds to this
√
– the fluctuations of 𝑋 around 𝜇 are of order 1/ 𝑛
• the asymptotic distribution of these fluctuations is normal.
104
iid
Example Suppose 𝑋1 , . . . , 𝑋𝑛 ∼ exponential mean 𝜇, e.g.
𝑋𝑖 = survival time of patient 𝑖. So
1 −𝑥/𝜇
𝑓 (𝑥; 𝜇) = 𝑒 , 𝑥≥0
𝜇
and 𝐸(𝑋𝑖 ) = 𝜇, var(𝑋𝑖 ) = 𝜇2 .
For large 𝑛, by CLT,
𝑋 −𝜇
√ ≈ 𝑁(0, 1).
𝜇/ 𝑛
So
𝑋 −𝜇
𝑃 − 𝑧 𝛼/2 < √ < 𝑧 𝛼/2 ≈ 1 − 𝛼
𝜇/ 𝑛
𝑧 𝛼/2 𝑧 𝛼/2
𝑃 𝜇 1− √ < 𝑋 < 𝜇 1+ √ ≈1−𝛼
𝑛 𝑛
𝑋 𝑋
𝑃 𝑧 𝛼/2 <𝜇< 𝑧 𝛼/2 ≈ 1 − 𝛼.
1+ √
𝑛
1− √
𝑛
105
Hence an approximate 1 − 𝛼 CI for 𝜇 is
𝑋 𝑋
𝑧 𝛼/2 , 𝑧 𝛼/2 .
1+ √
𝑛
1− √
𝑛
106
Example: Opinion poll
In a opinion poll, suppose 321 of 1003 voters said they would vote for
the Party X. What’s the underlying level of support for Party X?
Let 𝑋1 , . . . , 𝑋𝑛 be a random sample from the Bernoulli(𝑝) distribution,
i.e.
𝑃(𝑋𝑖 = 1) = 𝑝, 𝑃(𝑋𝑖 = 0) = 1 − 𝑝.
The MLE of 𝑝 is 𝑋. Also 𝐸(𝑋𝑖 ) = 𝑝 and var(𝑋𝑖 ) = 𝑝(1 − 𝑝) = 𝜎2 (𝑝) say.
For large 𝑛, by CLT,
𝑋−𝑝
√ ≈ 𝑁(0, 1).
𝜎(𝑝)/ 𝑛
So
𝑋−𝑝
1 − 𝛼 ≈ 𝑃 − 𝑧 𝛼/2 < √ < 𝑧 𝛼/2
𝜎(𝑝)/ 𝑛
𝜎(𝑝) 𝜎(𝑝)
= 𝑃 𝑋 − 𝑧 𝛼/2 √ < 𝑝 < 𝑋 + 𝑧 𝛼/2 √ .
𝑛 𝑛
107
𝑧 𝛼/2
The interval 𝑋 ± √ 𝜎(𝑝)
𝑛
has approximate probability 1 − 𝛼 of
containing the true 𝑝, but it is not a confidence interval since its
endpoints depend on 𝑝 via 𝜎(𝑝).
To get an approximate confidence interval:
• either, solve the inequality to get 𝑃 𝑎(X) < 𝑝 < 𝑏(X) ≈ 1 − 𝛼
where 𝑎(X), 𝑏(X) don’t depend on 𝑝
p
• or, estimate 𝜎(𝑝) by 𝜎(b
𝑝) = b𝑝 (1 − b
𝑝 ) where b
𝑝 = 𝑥 the MLE. This
gives endpoints r
𝑝 (1 − b
𝑝)
𝑝 ± 𝑧 𝛼/2 .
b
𝑛
b
108
Opinion polls often mention “±3% error.”
Note that for any 𝑝,
1
𝜎2 (𝑝) = 𝑝(1 − 𝑝) ≤
4
since 𝑝(1 − 𝑝) is maximised at 𝑝 = 12 . Then we have
𝑝−𝑝 0.03
−0.03
𝑃(b
𝑝 − 0.03 < 𝑝 < b
𝑝 + 0.03) = 𝑃 √ < √ <
b
√
𝜎(𝑝)/ 𝑛 𝜎(𝑝)/ 𝑛 𝜎(𝑝)/ 𝑛
0.03
−0.03
≈Φ √ −Φ √
𝜎(𝑝)/ 𝑛 𝜎(𝑝)/ 𝑛
√ √
≥ Φ(0.03 4𝑛) − Φ(−0.03 4𝑛).
√
For this probability to be at least 0.95 we need 0.03 4𝑛 ≥ 1.96, or
𝑛 ≥ 1068. Opinion polls typically use 𝑛 ≈ 1100.
109
Standard errors
Example
• Let 𝑋1 , . . . , 𝑋𝑛 iid 2 2
√ ∼ 𝑁(𝜇, 𝜎 ). Then 𝜇 = 𝑋 and var(𝜇) = 𝜎 /𝑛. So
b b
SE(b
𝜇) = 𝜎/ 𝑛.
• Let 𝑋1 , . . . , 𝑋𝑛 iid
∼ Bernoulli(𝑝). Then 𝑝 = 𝑋 and
p b
var(b
𝑝 ) = 𝑝(1 − 𝑝)/𝑛. So SE(b 𝑝 ) = 𝑝(1 − 𝑝)/𝑛.
110
Sometimes SE(𝜃)b depends on 𝜃 itself, meaning that SE(𝜃) b is unknown.
In such cases we have to plug-in parameter estimates to get the
estimated standard error. e.g.pplug-in to get estimated standard errors
√
SE(𝑋) = b𝜎/ 𝑛 and SE(b 𝑝) = b 𝑝 (1 − b
𝑝 )/𝑛.
The values plugged-in (b
𝜎 and b
𝑝 above) could be maximum likelihood,
or other, estimates.
(We could write SE(
cb 𝑝 ), i.e. with a hat on the SE, to denote that b
𝑝 has
been plugged-in, but this is ugly so we won’t, we’ll just write SE(b 𝑝 ).)
111
If 𝜃
b is unbiased, then MSE(𝜃)
b = var(𝜃) b 2 . So the standard
b = [SE(𝜃)]
error (or estimated standard error) gives some quantification of the
accuracy of estimation.
If in addition 𝜃
b is approximately 𝑁(𝜃, SE(𝜃) b 2 ) then, by the arguments
used above, an approximate 1 − 𝛼 CI for 𝜃 is given by 𝜃 b ± 𝑧 𝛼/2 SE(𝜃)
b
where again we might need to plug-in to obtain the estimated standard
error. Since, roughly, 𝑧0.025 = 2 and 𝑧0.001 = 3,
112
Statistics Publications - ADHD in psychiatric patients
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License
(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided
you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes
were made.
113
Statistics Publications - ADHD in psychiatric patients
We can reconstruct the confidence intervals reported for the percentage
diagnosed with ADHD. There was an additional note at the bottom of
the table:
114
iid
Example Suppose 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎2 ) with 𝜇 and 𝜎 2 unknown.
√
The MLEs are b 𝜇 = 𝑋, b𝜎 2 = 𝑛1 (𝑋𝑖 − 𝑋)2 , and SE(b𝜇) = 𝜎/ 𝑛 is
Í
unknown because 𝜎 is unknown. So to use b 𝜇 ± 𝑧 𝛼/2 SE(b
𝜇) as the basis
for a confidence interval we need to estimate 𝜎. One possibility is to
use b𝜎 and so get the interval
𝜎
𝜇 ± 𝑧 𝛼/2 √ .
b
𝑛
b
115
Statistics Publications - Barn Owl Productivity
Pavluvčík P, Poprach K, Machar I, Losík J, Gouveia A, Tkadlec E (2015) Barn Owl Productivity
Response to Variability of Vole Populations. PLoS ONE 10(12): e0145851.
https://doi.org/10.1371/journal.pone.0145851
Copyright: © 2015 Pavluvčík et al. This is an open access article distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
116
Statistics Publications - Barn Owl Productivity
Copyright: © 2015 Pavluvčík et al. This is an open access article distributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
117
6. Linear Regression
118
Linear Regression
119
CO2 emissions versus GDP
Let 𝑥 measure the GDP per head, and 𝑦 the CO2 emissions per head,
for 178 countries.
●
●
●●
●
●●
● ●
● ●●● ● ●
● ● ● ●● ●●
● ● ● ● ●● ●
● ●
● ● ● ●●● ●
● ● ●● ● ●● ● ●● ●●● ● ●
● ● ●● ● ● ●●
●
● ●● ● ● ● ●
CO2 emissions
● ● ● ●
●
●● ●
● ●
● ● ●
●● ●
● ● ● ●●
● ●●
●
● ● ● ●●
●● ● ●● ●● ● ●
● ● ●●
●●
● ●● ● ●●● ●
●●●●●● ●
●
●●
●● ●●
●●● ●
● ●● ●●
● ●● ●●
● ●
●●● ●
● ● ●
● ●●
●●
●● ●
●●
●
●
●
● ●
GDP
120
Questions of interest:
For fixed 𝑥, what is the average value of 𝑦?
How does that average value change with 𝑥?
A simple model for the dependence of 𝑦 on 𝑥 is
𝑦 = 𝛼 + 𝛽𝑥 + "error".
Note: a linear relationship like this does not necessarily imply that 𝑥
causes 𝑦.
121
More precise model
We regard the values of 𝑥 as being fixed and known, and we regard the
values of 𝑦 as being the observed values of random variables.
We suppose that
𝑌𝑖 = 𝛼 + 𝛽𝑥 𝑖 + 𝜖 𝑖 , 𝑖 = 1, . . . , 𝑛 (2)
where
122
US city temperature data
● Miami, FL
10
●
●
●●
Houston, TX ● ●
● San Francisco, CA
●●
●
●
●
● ●
●
0
● ●●
●
●
● ● ●
● ●
●
Albuquerque, NM ●● ● ●
●
●
●
●
● ●
● Boise, ID
● ●
●
−10
●
● ●
● ●
●
● ●
●
●
●
Minneapolis, MN ●
●
25 30 35 40 45
●
●● ●
●
●
●
●
40
● ● ●
●●
●● ●
●●
●● ●
●● ● ●● ● ●●●
●● ● ●
Miles per gallon
●
●●●
● ●● ● ●●
●●
●
● ●● ● ●
●●
● ●● ● ● ● ●
● ●●●
●
●●●
●
●● ●●
●
30
●●●
●●
● ● ●●● ●
●● ● ●● ● ● ●
●●● ●●●●● ●
● ● ●● ●●●● ●● ● ●
● ●●●●
●● ●●●●
● ●●
● ●● ●●
● ●● ● ● ●● ●● ● ●
● ● ●●●●●● ●
●● ●
●
●
● ● ●●●● ●● ● ● ●
●● ●● ● ●●●● ● ●●●
● ●
●●● ● ●●●
●
20
●●● ● ●●
● ● ●● ● ● ●
● ● ● ● ● ●
●●● ●● ●● ●●
● ●●● ● ●● ● ● ●
●● ● ●
● ● ●● ●
● ●
● ● ● ● ● ●● ● ● ● ● ●
● ● ● ● ●●● ●
● ● ● ●
● ● ● ● ● ● ●● ● ●
● ●
● ●
●● ●●● ● ● ● ● ● ●
●
● ● ● ● ●● ● ● ● ● ●
● ● ● ● ●
● ● ●●
10
● ●
●
Horsepower
124
𝑌 = 𝛼 + 𝛽𝑥 + 𝜖 for the CO2 example
125
Questions:
• How do we estimate 𝛼 and 𝛽?
• Does the mean of 𝑌 actually depend on the value of 𝑥? i.e. is
𝛽 ≠ 0?
We now find the MLEs of 𝛼 and 𝛽, and we regard 𝜎 2 as being known.
The MLEs of 𝛼 and 𝛽 are the same if 𝜎2 is unknown. If 𝜎2 is unknown,
then we simply maximise over 𝜎2 as well to obtain its MLE – this is no
harder than what we do here (try it!). However, working out all of the
properties of this MLE is harder and beyond what we can do in this
course.
126
We have 𝑌𝑖 ∼ 𝑁(𝛼 + 𝛽𝑥 𝑖 , 𝜎2 ). So 𝑌𝑖 has p.d.f.
1 1
𝑓𝑌𝑖 (𝑦 𝑖 ) = √ exp − 2 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )2 , −∞ < 𝑦 𝑖 < ∞.
2𝜋𝜎 2 2𝜎
𝑛
1 Õ
2 −𝑛/2 2
= (2𝜋𝜎 ) exp − 2 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )
2𝜎 𝑖=1
with log-likelihood
𝑛
𝑛 1 Õ
ℓ (𝛼, 𝛽) = − log(2𝜋𝜎 2 ) − 2 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )2 .
2 2𝜎 𝑖=1
127
𝑛
𝑛 1 Õ
ℓ (𝛼, 𝛽) = − log(2𝜋𝜎2 ) − 2 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )2 .
2 2𝜎 𝑖=1
𝑛 𝑛
Õ Õ 2
(𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )2 = 𝑦 𝑖 − (𝛼 + 𝛽𝑥 𝑖 ) .
𝑖=1 𝑖=1
For this reason the MLEs of 𝛼 and 𝛽 are also called least squares
estimators.
128
What does simple linear regression do?
Minimise the sum of squared vertical distances from the points to the
line 𝑦 = 𝛼 + 𝛽𝑥.
●
●
●
● ●
●
● ●
● ●
● ●●● ● ●
● ● ●
● ● ● ●● ● ●●
●● ●●
● ● ●● ● ●
● ● ●●
● ●● ● ● ● ● ● ●
● ● ●● ● ●
●
● ●
●
● ●● ●
●
● ●● ●
● ●●
●● ●
● ● ●●●
● ● ●
● ● ●●
● ● ●
● ● ●●●
●●
● ● ●
●
● ● ●● ● ●● ●
● ● ●
y
● ● ●●
● ●●
●●
● ● ●
●● ● ●
● ●
●
●
● ●
●● ●
● ●
●● ●
● ●
● ● ●
● ● ●● ●●
● ●
● ● ●
●
● ●
●
● ●
●
●●
● ● ●
● ●
●
●
●
● ●
129
Theorem 6.1
The MLEs (or, equivalently, the least squares estimates) of 𝛼 and 𝛽 are
given by
𝑥 2𝑖 )( 𝑦𝑖 ) − ( 𝑥 𝑖 )( 𝑥 𝑖 𝑦𝑖 )
Í Í Í Í
(
𝛼=
𝑛 𝑥 2𝑖 𝑥 𝑖 )2
b Í Í
−(
𝑛 𝑥 𝑖 𝑦𝑖 − ( 𝑥 𝑖 )( 𝑦 𝑖 )
Í Í Í
𝛽=
b .
𝑥 2𝑖 − (
Í 2
𝑛 𝑥𝑖 )
Í
130
Proof of Theorem 6.1
To find b
𝛼 and b
𝛽 we calculate
𝜕𝑆 Õ
= −2 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )
𝜕𝛼
𝜕𝑆 Õ
= −2 𝑥 𝑖 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 ).
𝜕𝛽
131
Sometimes we consider the model
𝑌𝑖 = 𝑎 + 𝑏(𝑥 𝑖 − 𝑥) + 𝜖 𝑖 , 𝑖 = 1, . . . , 𝑛
𝑌 = 𝛼 + 𝛽𝑥 + 𝜖
and the second is
𝑌 = 𝑎 + 𝑏(𝑥 − 𝑥) + 𝜖
= (𝑎 − 𝑏𝑥) + 𝑏𝑥 + 𝜖.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use,
distribution and reproduction in any medium, provided the original work is properly cited.
133
Statistics Publications - Eye measurements
Relationships between the peripapillary retinal nerve fiber layer (RNFL, a,b,c)
and the thickness of the ganglion cell-inner plexiform layer (GCIPL, d,e,f) and
other eye characteristics (important for evaluation of glaucoma)
Sam Seo et al. (2017) Ganglion cell-inner plexiform layer and retinal nerve fiber layer thickness according to myopia and optic disc area: a
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License
(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided
you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes
134
were made.
Alternative expressions for b
𝛼 and b
𝛽 are
(𝑥 𝑖 − 𝑥)(𝑦 𝑖 − 𝑦)
Í
𝛽=
b (3)
(𝑥 𝑖 − 𝑥)2
Í
(𝑥 𝑖 − 𝑥)𝑦 𝑖
Í
= Í 2
(4)
(𝑥 𝑖 − 𝑥)
𝛼 = 𝑦 −b
b 𝛽 𝑥.
𝜕𝑆
The above alternative for b
𝛼 follows directly from 𝜕𝛼
= 0.
To obtain the alternatives for b
𝛽: Theorem 6.1 gives
𝑥 𝑖 𝑦 𝑖 − 𝑛1 ( 𝑥 𝑖 )( 𝑦𝑖 )
Í Í Í
𝛽=
b
1 Í
𝑥 2𝑖 𝑛 ( 𝑥𝑖 )
2
Í
−
𝑥 𝑖 𝑦 𝑖 − 𝑛𝑥 𝑦
Í
= Í . (5)
𝑥 2𝑖 − 𝑛𝑥 2
Now check that the numerators and denominators in (3) and (5) are the
same. Then observe that the numerators of (3) and (4) differ by
(𝑥 𝑖 − 𝑥)𝑦, which is 0.
Í
135
The fitted regression line is the line 𝑦 = b
𝛼+b
𝛽𝑥.
The point (𝑥, 𝑦) always lies on this line.
●
●
●
● ●
●
● ●
● ●
● ●●● ● ●
● ● ●
● ● ● ●● ●● ●●● ●●
● ● ● ● ●
● ● ●● ●
● ●● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ●● ●
●
● ●● ●
● ●● ●
●● ● ● ● ●
● ● ● ●
● ● ●
●
● ● ●
● ● ●●●
●●
● ● ●
● ● ● ● ● ●● ●
● ● ● ●
y
● ● ●● ●●
● ● ● ●● ●
●● ● ●
● ●
●
●
● ●
●● ●
● ●
●● ●
● ● ● ● ●
● ● ●● ●●
● ●
● ● ●
●
● ●
●
● ●
●
●●
● ● ●
● ●
●
●
●
● ●
136
Bias of regression parameter estimates
Let 𝑤 𝑖 = 𝑥 𝑖 − 𝑥 and note 𝑤 𝑖 = 0.
Í
since 𝑤 𝑖 = 0 so b
𝛽 is unbiased.
Í
137
Bias of regression parameter estimates
𝛼 =𝑌−b
Also b 𝛽 𝑥 and so
𝐸(b
𝛼) = 𝐸(𝑌) − 𝑥𝐸(b
𝛽)
1Õ
= 𝐸(𝑌𝑖 ) − 𝛽𝑥 since 𝐸(b
𝛽) = 𝛽
𝑛
1Õ
= (𝛼 + 𝛽𝑥 𝑖 ) − 𝛽𝑥
𝑛
1
= · 𝑛(𝛼 + 𝛽𝑥) − 𝛽𝑥
𝑛
= 𝛼.
So b
𝛼 and b
𝛽 are unbiased.
Note the unbiasedness of b 𝛼, b
𝛽 does not depend on the assumptions
that the 𝜖 𝑖 are independent, normal and have the same variance, only
on the assumptions that the errors are additive and 𝐸(𝜖 𝑖 ) = 0.
138
Variance of regression parameter estimates
We are usually only interested in the variance of b
𝛽:
𝑤 𝑖 𝑌𝑖
Í
var(b
𝛽) = var Í 2
𝑤𝑖
1 Õ
= Í 2 var 𝑤 𝑖 𝑌𝑖
( 𝑤𝑖 )2
1 Õ
= Í 2 𝑤 2𝑖 var(𝑌𝑖 )
( 𝑤 𝑖 )2
1 Õ
= Í 2 𝑤 2𝑖 𝜎2
( 𝑤 𝑖 )2
𝜎2
= Í 2.
𝑤𝑖
Since b
𝛽 is a linear combination of the independent normal random
variables 𝑌𝑖 , the estimator b 𝛽 ∼ 𝑁(𝛽, 𝜎𝛽2 ) where
𝛽 is itself normal: b
𝜎𝛽2 = 𝜎2 / 𝑤 2𝑖 .
Í
139
𝛽 is 𝜎𝛽 and if 𝜎2 is known, then a 95% CI for 𝛽 is
So the standard error of b
𝛽 ± 1.96𝜎𝛽 ).
(b
140
A better approach here, but beyond the scope of this course, is to
estimate 𝜎2 using
1 Õ
𝑠2 = 𝛽𝑥 𝑖 )2
𝛼−b
(𝑦 𝑖 − b
𝑛−2
and to base the confidence interval on a 𝑡-distribution rather than a
normal distribution. This estimator 𝑆2 is unbiased for 𝜎2 (see Sheet 5),
but details about its distribution and the 𝑡-distribution are beyond this
course – see Parts A/B.
141
7. Multiple Linear Regression
142
Multiple linear regression – hill races data
Below are record times (in hours) for 23 hill races together with race
distance (in miles) and amount of climb (in feet).
dist climb time
Binevenagh 7.5 1740 0.8583
Slieve Gullion 4.2 1110 0.4667
Glenariff Mountain 5.9 1210 0.7031
Donard & Commedagh 6.8 3300 1.0386
McVeigh Classic 5.0 1200 0.5411
Tollymore Mountain 4.8 950 0.4833
Slieve Martin 4.3 1600 0.5506
Moughanmore 3.0 1500 0.4636
Hen & Cock 2.5 1500 0.4497
Annalong Horseshoe 12.0 5080 1.9492
Monument Race 4.0 1000 0.4717
Loughshannagh Horseshoe 4.3 1700 0.6469
Rocky 4.0 1300 0.5231
Meelbeg Meelmore 3.5 1800 0.4544
Donard Forest 4.5 1400 0.5186
Slieve Donard 5.5 2790 0.9483
Flagstaff to Carling 11.0 3000 1.4569
Slieve Bearnagh 4.0 2690 0.6878
Seven Sevens 18.9 8775 3.9028
Lurig Challenge 4.0 1000 0.4347
Scrabo Hill Race 2.9 750 0.3247
Slieve Gallion 4.6 1440 0.6361
BARF Turkey Trot 5.7 1430 0.7131
143
Scatterplots
4.0
●
3.5
3.0
2.5
time
2.0
●
1.5
1.0 ●
●
●
●
●●● ●●
0.5
●● ●
●●
● ●
●●●●
●
5 10 15
dist
144
Scatterplots
4.0
●
3.5
3.0
2.5
time
2.0
●
1.5
1.0 ●
●
●
●
●● ● ●
●
0.5
● ●●●●
●
●
● ● ●
●
climb
145
Multiple linear regression
𝑌 = 𝛽0 + 𝛽 1 𝑥1 + 𝛽 2 𝑥 2 + 𝜖.
This model has one response variable 𝑌 (as usual), but now we have
two explanatory variables 𝑥 1 and 𝑥2 , and three regression parameters
𝛽0 , 𝛽1 , 𝛽2 .
146
Let the 𝑖th race have time 𝑦 𝑖 , distance 𝑥 𝑖1 and climb 𝑥 𝑖2 . Then in more
detail our model is
𝑌𝑖 = 𝛽 0 + 𝛽 1 𝑥 𝑖1 + 𝛽 2 𝑥 𝑖2 + 𝜖 𝑖 , 𝑖 = 1, . . . , 𝑛
where
𝑥 𝑖1 , 𝑥 𝑖2 , for 𝑖 = 1, . . . , 𝑛, are known constants
iid
𝜖1 , . . . , 𝜖 𝑛 ∼ 𝑁(0, 𝜎2 )
𝛽0 , 𝛽1 , 𝛽2 are unknown parameters
and (as usual) 𝑦 𝑖 denotes the observed value of the random variable 𝑌𝑖 .
As for simple linear regression we obtain the MLEs/least squares
estimates of 𝛽0 , 𝛽1 , 𝛽2 by minimising
𝑛
Õ
𝑆(𝛽) = (𝑦 𝑖 − 𝛽 0 − 𝛽 1 𝑥 𝑖1 − 𝛽2 𝑥 𝑖2 )2
𝑖=1
𝜕𝑆
with respect to 𝛽 0 , 𝛽 1 , 𝛽 2 , i.e. by solving 𝜕𝛽 𝑘
= 0 for 𝑘 = 0, 1, 2.
As before, the only property of the 𝜖 𝑖 needed to define the least squares
estimates is 𝐸(𝜖 𝑖 ) = 0.
147
What does multiple linear regression do?
For two explanatory variables: minimise the sum of squared vertical
distances from the points to the plane 𝑦 = 𝛽 0 + 𝛽1 𝑥1 + 𝛽 2 𝑥2 .
X2
X1
• The max value of time is more than 10 times the min value, with times
bunched up towards zero, and with a long tail, similarly for dist, climb.
A log-transform will lead to a more symmetric spread.
• The longest race has a time more than double the next largest, and the
dist and climb of this race stand out similarly. Using untransformed
variables this race will have much more say in determining the fitted
model than any other, taking logs will reduce this. (Can we be more
precise than “will have much more say”? Yes, see later.)
149
Untransformed scales
● ● 4
3 4
3
time
● ● 2
(hours)
● ●
● ●● ● ●
● 1
●
●
●
●●●●
●
● ●●
●
●●
● ●
●● 1 2
●
●●●●● ●●●●
● ●●
● ●
8000 6000 8000
6000
● climb ●
(feet) 4000
● ●
● ●
● ● ●●
●●
●● ● ● 2000 ●●● ●
●
●● 2000 4000 ●●●
●● ● ●●●
●
● ●● ●●
● ●
10 15
15
dist ●
●
●
●
10 10
(miles)
● ●
● ●
●● ● ●
5 ●●●● ●
●●● ●
●●● ●
● ●
●
●●●
●
5 10 ● ●●● ●
●
●●
150
Logarithmic scales
● ●
0.0 0.5 1.0
1.0
● ●
● ●
0.5
time
● ● 0.0 (log
●
● ●
●
hours) 0.0
●●● ●● ●●
●● ●
●●●● ● ● −0.5
●● ●
● ● ●● ●●● ●●● ●
●
−1.0−0.5 0.0 −1.0
● ●
● ●
9.0 8.0 8.5 9.0
● 8.5 ●
● ●
● ● ● 8.0 climb 8.0 ● ● ●
● ● ●
(log feet) 7.5 ● ● ●
● ● ● ●●●● ●●
● ●
● ●● ● ● ●
● 7.0 ●●
● ● ●●
●
7.0 7.5 8.0
● ●
3.0 ● ●
2.0 2.5 3.0
2.5 ●
●
●
●
151
The fitted model is
Some interpretation: for a given value of climb, if the distance doubles then
time increases by a factor of 20.68 = 1.60. Is this reasonable? If the distance
doubles, don’t we expect the time taken to be multiplied by more than 2?
The key thing is that climb is being held constant – so although doubling the
length seems to make a race more than twice as hard, the gradient is halved (if
climb is held constant), making the climbing aspect of the race easier. The
estimated effect overall is the factor of 1.60.
152
• We can fit a model depending on distance only
Observe that the estimated regression coefficients are different in each model,
e.g. the coefficient of log(dist) is different depending on whether log(climb) is
part of the model – because if log(climb) is included then it can help explain
some of the variation in log(time), whereas if log(climb) is absent then
log(dist) has to account for the variation in log(time) on its own.
153
A general multiple regression model has 𝑝 explanatory variables
(𝑥1 , . . . , 𝑥 𝑝 ),
𝑌 = 𝛽0 + 𝛽1 𝑥1 + · · · + 𝛽 𝑝 𝑥 𝑝 + 𝜖
154
Statistics Publications - Multiple regression for monthly
hatching success of leatherback turtles
Santidrián Tomillo P, Saba VS, Blanco GS, Stock CA, Paladino FV, Spotila JR (2012) Climate Driven
Egg and Hatchling Mortality Threatens Survival of Eastern Pacific Leatherback Turtles. PLoS ONE
7(5): e37602. https://doi.org/10.1371/journal.pone.0037602
Copyright: © 2012 Santidrián Tomillo et al. This is an open-access article distributed under the terms of the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
155
Statistics Publications - Multiple regression for monthly
emergence rate of leatherback turtles
Copyright: © 2012 Santidrián Tomillo et al. This is an open-access article distributed under the terms of the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
156
Quadratic regression
157
Statistics Publications - Weight gain and diet in fish
Copyright: © 2014 Gan et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which
158
permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Statistics Publications - Weight gain and diet in fish
Copyright: This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or
otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication
160
Statistics Publications - Cubic regression
𝑌 = 𝛽0 + 𝛽 1 𝑥1 + 𝛽 2 𝑥2 + 𝜖.
163
8. Assessing the fit of a model
164
Assessing the fit of a model
Having fitted a model, we should consider how well it fits the data. A
model is normally an approximation to reality: is the approximation
sufficiently good that the model is useful? This question applies to
mathematical models in general. In this course we will approach the
question by considering the fit of a simple linear regression
(generalisations are possible).
For the model 𝑌 = 𝛼 + 𝛽𝑥 + 𝜖 let b
𝛼, b 𝛽 be the usual estimates of 𝛼, 𝛽
based on the observation pairs (𝑥 1 , 𝑦1 ), . . . , (𝑥 𝑛 , 𝑦𝑛 ).
From now on we consider this model, with the usual assumptions
about 𝜖, unless otherwise stated.
Definition The 𝑖th fitted value b
𝑦 𝑖 of 𝑌 is defined by b
𝑦𝑖 = b
𝛼+b
𝛽𝑥 𝑖 , for
𝑖 = 1, . . . , 𝑛.
The 𝑖th residual 𝑒 𝑖 is defined by 𝑒 𝑖 = 𝑦 𝑖 − b
𝑦 𝑖 , for 𝑖 = 1, . . . , 𝑛.
The residual sum of squares RSS is defined by RSS = 𝑒 𝑖2 .
Í
q
1
The residual standard error RSE is defined by RSE = 𝑛−2 RSS.
165
The RSE is an estimate of the standard deviation 𝜎. If the fitted values
are close to the observed values, i.e. b
𝑦 𝑖 ≈ 𝑦 𝑖 for all 𝑖 (so that the 𝑒 𝑖 are
small), then the RSE will be small. Alternatively if one or more of the 𝑒 𝑖
is large then the RSE will be higher.
We have 𝐸(𝑒 𝑖 ) = 0. In taking this expectation, we treat 𝑦 𝑖 as the random
variable 𝑌𝑖 , and we treat b𝑦 𝑖 as the random variable b
𝛼+b 𝛽𝑥 𝑖 (in
particular, b𝛼 and 𝛽 are estimators, not estimates). Hence
b
𝐸(𝑒 𝑖 ) = 𝐸(𝑌𝑖 − b
𝛼−b
𝛽𝑥 𝑖 )
= 𝐸(𝑌𝑖 ) − 𝐸(b
𝛼) − 𝐸(b
𝛽)𝑥 𝑖
= 𝐸(𝛼 + 𝛽𝑥 𝑖 + 𝜖 𝑖 ) − 𝛼 − 𝛽𝑥 𝑖 since b
𝛼, b
𝛽 are unbiased
= 𝛼 + 𝛽𝑥 𝑖 + 𝐸(𝜖 𝑖 ) − 𝛼 − 𝛽𝑥 𝑖
=0 since 𝐸(𝜖 𝑖 ) = 0.
166
Potential problem: non-linearity
167
Residual plots
8 ● ● ●
2 2
● ●
6 ● ● ●
●
●
● ● ●
●● ● ● ●
● 1
●● ●
1 ● ●
● ● ● ● ● ● ● ● ●
4 ●
Residuals
●●
Residuals
● ● ●● ● ●
● ●
● ● ●
● ● ● ● ●
y
● ● ●
2 ● ● ● ● ●
●
0 ● ● ● 0 ●
●
●
● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ● ●
●● ● ● ●
● ●
● ●
0 ●
● ● ●
●
● ● ● ●
●
−1 ● ●
●
−1 ● ● ● ●
−2 ● ● ● ●
●
● ●
● ● ●
−2 0 2 4 6
−2 −1 0 1 2 −2 −1 0 1 2
168
Data from Y = x2 + ε Residuals of Y = β0 + β1x + ε Residuals of Y = β0 + β1x + ε
● 3 ● 3 ●
●
●
● ●
15 ● 2 2
●●● ● ●
Residuals
Residuals
●● ●●
● 1 1
● ●
● ●
10
y
● ●● ●●
0 0
● ● ● ●
● ● ● ●
5 ●
● ●
●
●
●
● ●
● −1 −1
●
● ● ● ●
● ● ●
●
● ● ●
0 ●
● −2 ● −2 ●
0 5 10 15 0 5 10 15
1 2 3 4
x y^ y^
The residuals should indicate a problem – they do – there is a pattern, they are
not randomly scattered. The curvature indicated in the right-hand plot is what
we should notice in the middle plot. [How to fit the curve is beyond the scope
of this course.]
169
Data from Y = x2 + ε Residuals of Y = β0 + β1x + β2x2 + ε Residuals of Y = β0 + β1x + β2x2 + ε
●
1.0 1.0
● ●
● ● ● ● ●
●
● ●
● ●
● ●
15 ● 0.5 0.5
● ● ● ●
●●● ● ●
Residuals
Residuals
●
● ●
10 0.0 ● 0.0 ●
y
●
● ●
● ●
● ●
● ●
−0.5 −0.5
5 ●
●
● ●
●
● ● ●
● ● ● ● ● ●
● ●
● −1.0 −1.0
● ● ●
0 ●
0 5 10 15 0 5 10 15
1 2 3 4
x y^ y^
170
Auto data
●
●● ●
●
●
●
●
40
● ● ●
●●
●● ●
●●
●● ●
●● ● ●● ● ●●●
●● ● ●
Miles per gallon
●
●●●
● ●● ● ●●
●●
●
● ●● ● ●
●●
● ●● ● ● ● ●
● ●●●
●
●●●
●
●● ●●
●
30
●●●
●●
● ● ●●● ●
●● ● ●● ● ● ●
●●● ●●●●● ●
● ● ●● ●●●● ●● ● ●
● ●●●●
●● ●●●●
● ●●
● ●● ●●
● ●● ● ● ●● ●● ● ●
● ● ●●●●●● ●
●● ●
●
●
● ● ●●●● ●● ● ● ●
●● ●● ● ●●●● ● ●●●
● ●
●●● ● ●●●
●
20
●●● ● ●●
● ● ●● ● ● ●
● ● ● ● ● ●
●●● ●● ●● ●●
● ●●● ● ●● ● ● ●
●● ● ●
● ● ●● ●
● ●
● ● ● ● ● ●● ● ● ● ● ●
● ● ● ● ●●● ●
● ● ● ●
● ● ● ● ● ● ●● ● ●
● ●
● ●
●● ●●● ● ● ● ● ● ●
●
● ● ● ● ●● ● ● ● ● ●
● ● ● ● ●
● ● ●●
10
● ●
●
Horsepower
171
Auto data, linear fit and quadratic fit
●
●● ●
●
●
●
●
40
● ● ●
●●
●● ●
●●
●● ●
●● ● ●● ● ●●●
●● ● ●
Miles per gallon
●
●●●
● ●● ● ●●
●●
●
● ●● ● ●
●●
● ●● ● ● ● ●
● ●●●
●
●●●
●
●● ●●
●
30
●●●
●●
● ● ●●● ●
●● ● ●● ● ● ●
●●● ●●●●● ●
● ● ●● ●●●● ●● ● ●
● ●●●●
●● ●●●●
● ●●
● ●● ●●
● ●● ● ● ●● ●● ● ●
● ● ●●●●●● ●
●● ●
●
●
● ● ●●●● ●● ● ● ●
●● ●● ● ●●●● ● ●●●
● ●
●●● ● ●●●
●
20
●●● ● ●●
● ● ●● ● ● ●
● ● ● ● ● ●
●●● ●● ●● ●●
● ●●● ● ●● ● ● ●
●● ● ●
● ● ●● ●
● ●
● ● ● ● ● ●● ● ● ● ● ●
● ● ● ● ●●● ●
● ● ● ●
● ● ● ● ● ● ●● ● ●
● ●
● ●
●● ●●● ● ● ● ● ● ●
●
● ● ● ● ●● ● ● ● ● ●
● ● ● ● ●
● ● ●●
10
● ●
●
Horsepower
172
Residual plot for linear fit Residual plot for quadratic fit
● ●
● 15 ●
15 ● ●
● ● ●
● ●
●●
●● ● ● ●
● ●
● 10 ● ●● ● ●
10 ● ● ● ●● ●
● ● ●●● ●●● ● ●
●
● ● ● ● ● ● ● ●● ●● ● ●
● ● ● ● ●●●● ●●●●● ●● ● ● ●● ● ●
●
● ●
● ● 5 ●● ● ● ● ● ● ●●●● ●●●
Residuals
Residuals
● ● ●● ● ● ● ●●● ● ●
5 ● ● ● ● ●●
● ● ●●●● ●● ●● ●
●●● ●
●●
● ●●● ● ●● ● ● ●● ●● ● ●
● ●● ● ● ●●●●●
●● ●
●●●●● ●●●● ●●
● ●● ●●
●
●● ●●● ● ●●●
●
● ●● ●
●
● ●● ●●●●
● ●
●● ●● ●●●●●●●●
●●● ● ● ●
● ●
●●●●
● ●●
●●●● ●●● ● ●●
●●
● ●● ●● ●●●●
● ●● ● ●
●●● ●●●● ●●● ● ●
●●
●●●● ●
● ●●
●
●
●●● ● ●
● ● ● ● ● ● ●●
●
●
●● ● ●● ● ● ●●
●●●●●●
●●
● ●●●● ● 0 ●
●●●● ●●
● ● ● ● ● ● ●●
●● ●●●● ● ● ●●●●
0 ● ● ●●● ● ●●● ● ●●
● ● ●● ●● ●●
● ●● ● ● ●● ●● ● ●●●● ●
●● ●● ●●●●●
● ● ●●
●● ●
● ● ●●●●● ●● ●●●●●
●● ●● ●●● ● ●
●●
●●
●●●●● ●
● ●● ●
●● ●●● ●●● ●●
● ● ● ●
●●● ●●
●● ●
●●●● ●●●● ●●●●● ●● ●●●● ●
●● ●
● ● ● ●● ●● ●● ●●
● ●
●
● ●● ● ● ●
● ● ●
● ●● ●●● ● ●
●
●●● ● ● ●●●●●●●●●●● ●●● ● ●
●● −5 ● ●● ●●● ● ● ●●● ●
● ● ●●● ● ● ● ●●● ●●●●
−5 ● ● ● ●●●●●●● ●●●
● ● ● ● ●●●
●
● ● ●
● ●
● ● ●●● ● ● ● ●
● ● ●●● ● ● ● ●
●●●●●● ● ● ● ● ●
● ● ●
● ●● −10 ●
−10 ●
●
● −15 ●
−15
5 10 15 20 25 30 15 20 25 30 35
y^ y^
Left: the pattern (curvature) in the residuals from the linear fit
𝑌 = 𝛽0 + 𝛽 1 𝑥 + 𝜖 indicates non-linearity.
Right: little pattern remains in the residuals from the quadratic fit
𝑌 = 𝛽 0 + 𝛽 1 𝑥 + 𝛽 2 𝑥 2 + 𝜖.
173
Potential problem: non-constant variance of errors
174
Non-constant variance of errors
15 ●
●●
●
● ●●
● ● ●
●
30 ● ● 10 ●
●
● ●
●
● ●● ● ●
● ● ●●
● ● ●
●● ● ● ●● ● ●●● ● ● ●● ●●
●
● ● ● ●●● ●● ● ● ● ● ● ● ●● ● ● ●●●
●●●●● ● ●●
●●● ●● ● ●●
● ● ●● ●● ●●● ● ●●●● ●
●●
● ●●● ● ●● ●
5 ● ●●● ●
● ●● ● ●
●● ●
● ●● ●●●●●●
Residuals
●●● ●● ●●●● ●
●● ●● ●● ●● ● ●● ● ●●● ●● ● ●● ●● ●● ●●
●
● ● ●
●●●
●●● ● ● ●
●● ●● ● ●
●●●
●●●● ● ●● ● ●
●● ●● ●●●● ●● ●● ●●●
●●
●● ● ●●●●●
● ●●●●● ●● ● ●
●● ●
● ●● ● ● ●●●●●●
● ● ● ●● ● ● ● ●
●● ● ●
● ● ●● ● ● ●●●●
● ● ●●●● ●●●●
●
●●●●● ●●●● ●●●●●● ● ●● ●
●● ● ●
●●●
●● ●●●
●●
●●
● ●● ●●
● ●● ●● ●
● ●●●● ●● ● ● ●
● ●● ●●
●● ●
●●
●
●● ●
●●●●●
20 ●
●●
●● ● ●● ● ● ● ●
●● ●●●
● ●●● ● ●● ●
●●●●● ●● ●●● ●
●●● ●
●●● ●●
●●
●
●
●
●●
●●
● ●
● ●
●
●●
●● ● ●
● ●● ●
●●●●
● ●●●●●●●
●●
●●●
●●●
●●
●●●●●●●
●● ● ● ●● ●
●
●
●● ●●
●●●●● ● ●● ●●
●●
● ● ●● ●● ● ● ●●●
● ●● ●● ●● ● ● ● ●●●
y
5 10 15 20
2 4 6 8
x Fitted values y^
175
How might we deal with non-constant variance of the errors?
One possibility is√to transform the response 𝑌 using a transformation
such as log 𝑌 or 𝑌 (which shrinks larger responses more), leading to
a reduction in heteroscedasticity.
176
Plots with log 𝑌 as the response variable, so the model is
log 𝑌 = 𝛼 + 𝛽𝑥 + 𝜖.
●● ● ●
3.5 ● ● ● ●
● ●● ●● ●● ● ● ●
● ●● ●● ●●
●● 0.5
● ●● ●●●● ● ● ●●●
●●
●●●● ●● ● ● ● ●●
●● ● ●●●●●● ●●●
● ● ●● ● ●●● ● ●● ● ●●
●
●●
● ●● ●●●●●●●●●
●●●●
●●●●
●●●
● ● ● ●● ● ●●● ●●●
●● ●●
●
●●●● ●● ●
● ●
● ●● ●
● ● ●●
●● ● ●● ●
●● ● ●●
● ● ●
●●
●●
● ● ● ●● ●● ● ● ●● ●● ●● ●●●
3.0 ● ● ●● ● ●
●●●●
●●●●●● ●
●
●●
●
● ●
● ●●●●●●● ●●
●
●
● ●● ●●●●
● ●● ●●●● ● ●
●
●
●●
●● ●
●●●●
● ●
●● ● ● ●●
●●●●● ● ●●● ●●
● ●● ●
●●● ● ●● ●●● ●●● ●
● ● ● ●● ●
●●●●● ●●● ● ●
● ●● ●●
●● ●● ●●
● ● ● ●●● ●
●
●
●●● ●●
● ●●●
●
●●●
●●●●● ●● ●● ● ●● ●
●●●● ● ●●●●●
●● ●●●●● ●●
● ●● ● ●●● ● ●
●●● ● ● ● ●● ●● ●●● ● ●● ●●●● ● ●●● ●●
● ●
● ●
●● ●
●●● ●●
●●
●●●
● ●
● ●
●●●●●
●●●●
● ● ● ●
● ●●●●
● ● ●● ● ●● ●
●●●●● ●
●●●●●
● ●●
●●●● ●● ●●
●●●
●●●
●●●●
●
●●● ●●● ●
●
●●●●●
●●
●●●●
●●●
● ● ●
●●
● ● ●●●●
●
●
● ●●●●● ●●
● ● ● ●●● ● ● ●●●●
● ●● ● ●●
● ●● ● ●●●● ●
● ●● ●●●●●
●●●
●
●
● ● ●●●●
● ●
●●●● ● ●● ●
●● ●● ●
● ● ●●
● ●
● ● ●
●●●●● ● ● ● ● ● ● ● ●● ●●
●●●●
● ●●
● ● ● ●
●● ●●
● ● ● ●
● ● ●
●
● ●● ● ●●●●●●
●
●●●● ●● ●●● ●●●●●●● ●●● ●●●● ●● ●
●●●
●●●●●●●●●
● ●● ●● ● ● ●
● ●● ●●●
● ● ●● ●●●
●● ● ● ● ● ●● ●● ●● ● ●
● ●● ●
Residuals
●● ●
●
●● ● ● ● ● ●●●● ●● ●●● ●● ● ● ● ●
●
●● ●● ● ●● ●●●● ●●
● ●●●● ●●
●
●●● ●
●●● ● ●● ●
●● ●●
●
●●●
●●●
● ● ●
●● ●●
●●●●●●●
●● ●
●● ●●
●
●
●● ● ●
● ● ●● ●●
● ●●●●● ●●●
●●●●● ● ●
● ●●●●●●●●● ●
●●
●●●
● ●● ●● ●●
●● ●●●
●● ●● ● ●● ● ●●●
● ● ●●●
● ●
●
●
●● ●
●
●●● ● ●
●●●●●
● ●● ● ● ●● ● ● ● ● 0.0 ●●
●
●
● ●● ●●
●●
●●
● ● ● ● ● ● ●●●
● ●●●● ● ●● ●●● ●
●● ● ● ●●●
●
● ●
2.5 ●●● ●● ●● ●●●● ●●● ● ●
log(y)
●● ● ●●
●● ●● ●●
●●●●
● ● ●● ● ●●●
●●●● ●● ● ●● ●●●●●● ● ● ●
●●●
●●●●● ●
●●●● ●● ● ●
●
●●● ● ● ● ●
●●● ●●●
●
●● ●●●
●
●●
●
●●●●●● ● ●●● ● ●●●●
●●●●
● ● ●● ● ● ●●●●
●●
● ●● ●● ●● ●● ● ●● ●● ●
●
●●●●● ●●●●
●●●●●● ● ●
●
●● ●● ●● ●● ●
●●● ●
●
●● ● ●● ● ● ● ●● ●●
●
●●●
●●
● ●●●
●
● ●●●●
●● ●●
● ●●●
●● ●
● ●
● ●●●●●
●
● ●
●●● ●●●● ●
●●●●●
●●
●●●●● ●
●●● ●
●●●
●●
●
●
●●
●● ●● ●●● ●
●● ●●● ● ●● ● ●●● ●
● ● ●
●●
● ●
●●
●
●●● ●● ●● ●●●● ●
●● ● ●
●● ●● ●●
● ●●
●●●●●
●●●●●● ●●
●●●●●
● ●
●●●●
● ●●●●● ●● ● ●
●●●● ●
●● ● ●●●● ●
●● ●● ●
● ●
● ●●
●
●●● ● ●
●●
●●
● ●
● ●●
● ●
●
●●●● ● ●
●
●●●● ●●●
●●
●●● ● ●●
●●●●
●
● ●● ● ●● ● ● ●
● ● ●
●
●●●●●● ● ●● ● ●
● ●●● ● ● ● ●●●● ●●● ●● ● ● ●
● ●
●●●● ●● ●●● ●
●
●● ● ● ●● ● ● ● ●● ●● ●●● ● ●
●
●●●●
● ●● ● ●●
● ● ● ●
● ●●●● ● ● ●● ●
● ●●● ● ● ● ●● ● ●
● ● ● ● ●●●
2.0 ● ●●●●●
●
●● ● ●●● ●
●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●
●
●●●●●
●
●● ●●●●●●● ●● ● ●
● ●●● ●● ● ● ●● ● ●● ●●● ● ● ●●
●
●●●● ●●
●
●
●
●●
●●●
●●●
●
●
●●
●●●●● ●
●
●
● ● ● ●
●● ● ● ● ●● ● ●●● ● ●●
●●●●●●
●●
●
● ●●
●
●●
●
●●●● ● ● ● ● ●● ● ● ● ● ●●
●●
●
●●●
●
●
●
●
●
●
●
●●
●
●●
●●●●
●●● ●●● ● ●● −0.5 ● ● ● ● ●●● ● ● ● ● ●
●
●
●
●●●
● ● ●●●
●●
● ● ●
●
● ●
● ● ●
● ● ●
●
●●
●● ● ● ● ●
●● ● ● ●
●
●
●●●●● ●● ● ●
● ●● ● ● ●●
1.5 ●●●● ●
●
●●
● ●
● ● ●
● ●
● ● ●
−1.0
x Fitted values
177
Sometimes we might have a good idea about of the variance of 𝑌𝑖 : we
2
might think var(𝑌𝑖 ) = var(𝜖 𝑖 ) = 𝑤𝜎 𝑖 where 𝜎2 is unknown but where the
𝑤 𝑖 are known. e.g. if 𝑌𝑖 is actually the mean of 𝑛 𝑖 observations, where
2
each of these 𝑛 𝑖 observations are made at 𝑥 = 𝑥 𝑖 , then var(𝑌𝑖 ) = 𝜎𝑛 𝑖 . So
𝑤 𝑖 = 𝑛 𝑖 in this case.
It is straightforward to show (exercise) that the MLEs of 𝛼, 𝛽 are
obtained by minimising
𝑛
Õ
𝑤 𝑖 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )2 . (6)
𝑖=1
178
Potential problem: outliers
179
Theorem 8.1 var(𝑒 𝑖 ) = 𝜎2 (1 − ℎ 𝑖 ) where
1 (𝑥 𝑖 − 𝑥)2
ℎ𝑖 = + Í𝑛 .
𝑛 𝑗=1 (𝑥 𝑗 − 𝑥)
2
180
So the 𝑟 𝑖 are all on a comparable scale, each having a standard
deviation of about 1. We will say that observations with |𝑟 𝑖 | > 3 are
possible outliers.
If we believe an outlier is due to an error in data collection, then one
solution is to simply remove the observation from the data and re-fit
the model. However, an outlier may instead indicate a problem with
the model, e.g. a nonlinear relationship between 𝑌 and 𝑥, so care must
be taken.
Similarly, this kind of problem could arise if we have a missing
regressor, i.e. we could be using 𝑌 = 𝛽0 + 𝛽 1 𝑥 1 + 𝜖 when we should
really be using 𝑌 = 𝛽0 + 𝛽 1 𝑥1 + 𝛽 2 𝑥2 + 𝜖.
181
Proof of Theorem 8.1
Idea of the proof: write 𝑒 𝑗 = 𝑎 𝑗 𝑌𝑗 and use
Í
𝑗
Õ Õ Õ
var 𝑎 𝑗 𝑌𝑗 = 𝑎 2𝑗 var(𝑌𝑗 ) = 𝜎2 𝑎 2𝑗 (7)
𝑗 𝑗 𝑗
since the 𝑌𝑗 are independent with var(𝑌𝑗 ) = 𝜎2 for all 𝑗. Here and
below, all sums are from 1 to n.
First recall
− 𝑥)𝑌𝑗
Í
𝑗 (𝑥 𝑗
Õ
𝛽=
b where 𝑆 𝑥𝑥 = (𝑥 𝑘 − 𝑥)2
𝑆 𝑥𝑥
𝑘
and
𝛼 =𝑌−b
b 𝛽𝑥
− 𝑥)𝑌𝑗
Í
1 𝑗 (𝑥 𝑗
Õ
= 𝑌𝑗 − 𝑥
𝑛 𝑆 𝑥𝑥
𝑗
Õ 1 𝑥(𝑥 𝑗 − 𝑥)
= − 𝑌𝑗 .
𝑛 𝑆 𝑥𝑥
𝑗
182
Proof of Theorem 8.1 ... continued
So
𝑦𝑖 = b
b 𝛼+b
𝛽𝑥 𝑖
− 𝑥)𝑌𝑗
!
𝑥(𝑥 𝑗 − 𝑥)
Í
𝑗 (𝑥 𝑗
Õ 1
= − 𝑌𝑗 + 𝑥 𝑖
𝑛 𝑆 𝑥𝑥 𝑆 𝑥𝑥
𝑗
Õ 1 (𝑥 𝑖 − 𝑥)(𝑥 𝑗 − 𝑥)
= + 𝑌𝑗 . (8)
𝑛 𝑆 𝑥𝑥
𝑗
We can write
Õ
𝑌𝑖 = 𝛿 𝑖𝑗 𝑌𝑗 (9)
𝑗
where (
1 if 𝑖 = 𝑗
𝛿 𝑖𝑗 =
0 otherwise.
183
Proof of Theorem 8.1 ... continued
𝑒 𝑖 = 𝑌𝑖 − b
𝛼−b
𝛽𝑥 𝑖
and using (8) and (9)
Õ 1 (𝑥 𝑖 − 𝑥)(𝑥 𝑗 − 𝑥)
= 𝛿 𝑖𝑗 − − 𝑌𝑗 .
𝑛 𝑆 𝑥𝑥
𝑗
184
Proof of Theorem 8.1 ... continued
As the 𝑌𝑗 are independent, as at (7),
2
Õ 1 (𝑥 𝑖 − 𝑥)(𝑥 𝑗 − 𝑥)
var(𝑒 𝑖 ) = 𝛿 𝑖𝑗 − − var(𝑌𝑗 )
𝑛 𝑆 𝑥𝑥
𝑗
2
Õ 1 (𝑥 𝑖 − 𝑥)2 (𝑥 𝑗 − 𝑥)2 1
=𝜎 𝛿2𝑖𝑗 + + − 2 𝛿 𝑖𝑗
𝑛2 2
𝑆 𝑥𝑥 𝑛
𝑗
(𝑥 𝑖 − 𝑥)(𝑥 𝑗 − 𝑥) 1 (𝑥 𝑖 − 𝑥)(𝑥 𝑗 − 𝑥)
−2 𝛿 𝑖𝑗 + 2
𝑆 𝑥𝑥 𝑛 𝑆 𝑥𝑥
1 (𝑥 𝑖 − 𝑥)2 2 (𝑥 𝑖 − 𝑥)2 2 (𝑥 𝑖 − 𝑥) Õ
2
= 𝜎 1+ + − −2 + (𝑥 𝑗 − 𝑥)
𝑛 𝑆 𝑥𝑥 𝑛 𝑆 𝑥𝑥 𝑛 𝑆 𝑥𝑥
𝑗
1 (𝑥 𝑖 − 𝑥)2
= 𝜎2 1 − −
𝑛 𝑆 𝑥𝑥
= 𝜎2 (1 − ℎ 𝑖 ). □
185
Plots with with studentized residuals
Left: dotted line = regression line based on the black points, red line =
regression line with red point included.
5
● ● ● ●
5
6
●●
4
● ●
Studentized Residuals
● ●●●
4
●
●
●
4
3
●●
Residuals
●
3
● ●●
● ● ●●
2
● ●
y
2
2
●
●
●
1
●● ● ●
1
●
● ● ●
●● ●
0
● ● ● ● ●
● ● ● ●● ●
● ●● ● ●● ● ●● ● ●
●● ● ●
0
● ● ●
0
● ● ● ● ●
●● ● ● ● ●● ● ●
● ●● ● ● ● ●●
●● ● ●
●
●●
● ● ●
● ● ●
●
−2
● ● ● ● ● ● ● ● ● ● ●
−1
●
−1
● ● ● ●
−2 −1 0 1 2 0 2 4 6 0 2 4 6
Residuals 𝑒 𝑖 and studentized residuals 𝑟 𝑖 are almost the same. The regression
does not fit the red point well, but whether the red point is included or not has
little effect on the fitted line.
186
Scale of 𝑦 has changed, |𝑒 𝑖 | and |𝑟 𝑖 | are much smaller. Also, the red point isn’t
so extreme.
● ● ●
4
0.6
0.20
●●
● ●
Studentized Residuals
●●●
3
●
●
●
●
0.4
●
●●
Residuals
●
2
0.10
● ●● ●
●
● ● ●●
● ●
0.2
y
● ●
● ●
1
● ● ● ● ●
●
● ● ●● ● ● ● ●● ●
●● ● ●● ● ●● ●
0.00
● ● ● ● ●
0.0
● ● ● ● ●
● ●
0
●● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
●● ●● ● ●● ●
● ●
● ● ●
●
−1
●
−0.2
● ● ● ● ●
● −0.10 ● ● ● ● ●
● ● ● ● ●
−2 −1 0 1 2 −0.2 0.0 0.2 0.4 0.6 −0.2 0.0 0.2 0.4 0.6
187
Potential problem: high leverage points
1 (𝑥 𝑖 − 𝑥)2
ℎ𝑖 = + Í𝑛 .
𝑛 𝑗=1 (𝑥 𝑗 − 𝑥)
2
188
High leverage points tend to have a sizeable impact on the regression
line. Since var(𝑒 𝑖 ) = 𝜎2 (1 − ℎ 𝑖 ), a large leverage ℎ 𝑖 will make var(𝑒 𝑖 )
small. And then since 𝐸(𝑒 𝑖 ) = 0, this means that the regression line will
be “pulled” close to the point (𝑥 𝑖 , 𝑦 𝑖 ).
189
Leverage
Left: plot of (𝑥 𝑖 , 𝑦 𝑖 ), 𝑖 = 1, . . . , 𝑛.
Right: plot of leverage values ℎ 𝑖 against 𝑥 𝑖 .
●
6
●
● ●
●
● ●
● ●●●
●
●
●
Leverage values hi
●
4
●● ●
● ●
● ●
● ● ●
●
● ● ● ●
● ●
y
● ● ●
2
● ● ●
●
●
●
● ●
● ●
●● ●
● ●
●
●● ●
0
●●
● ●● ●●
●●● ●
●●
● ●
●
−2
●
●
−2 −1 0 1 2 −2 −1 0 1 2
x x
190
Blue line = excluding red point, dotted red line = including red point.
Low leverage, low residual High leverage, low residual
20
20
● ●
● ●
● ●
y1
y2
●● ●●
● ●
● ● ● ● ●
15
15
●● ●
●
● ●
●● ●●
● ● ● ●
● ● ● ●
● ● ● ●
10
10
● ●
10 15 20 25 30 10 15 20 25 30
x1 x2
●
20
20
● ●
● ●
● ●
y3
y4
●● ●●
● ●
● ● ● ●
15
●
●
15 ●
●
● ●
●● ●●
● ● ● ●
● ● ● ●
● ● ● ●
10
10
● ● ●
10 15 20 25 30 10 15 20 25 30
x3 x4
191
Left: red line = fitted line from black points + point 𝐴, blue line = fitted
line from black points + point 𝐴 + point 𝐵.
Right: studentized residuals 𝑟 𝑖 against leverage values ℎ 𝑖 .
15
● A
4
B ●
B ●
Studentized residuals
10
3
A
2
● ●
y
●●
● ●
● ●●●
5
●
●
●
1
●●
● ●
● ● ● ●
● ● ●
●● ● ● ● ● ● ● ● ●
● ●
●● ●
● ● ● ●●
0
●
●●● ● ● ● ●
●●● ● ●●
● ● ● ●
0
●
● ● ●
●● ●● ● ●
● ● ●
●
−1
●● ●
●
x Leverage values
192
Removing point 𝐵 – which has high leverage and a high residual – has
a much more substantial impact on the regression line than removing
an outlier with low leverage and/or low residual.
In the plot of studentized residuals against leverage:
• point B has high leverage and a high residual, it is having a
substantial effect on the regression line yet it still has a high
residual
• in contrast point A has a high residual, the model isn’t fitting well
at A, but t A isn’t affecting the fit much at all – the reason that A
has little effect on the fit is that it has low leverage.
193
Statistics Publications - Underrepresentation of women
on boards
This is an Open Access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use,
distribution and reproduction in any medium, provided the original work is properly cited.
195
Statistics Publications - Cubic regression
196