0% found this document useful (0 votes)
19 views110 pages

Module 3: Linear Regression: TMA4268 Statistical Learning V2025

Module 3 of TMA4268 focuses on linear regression, covering both simple and multiple linear regression techniques, including model assumptions, least squares estimation, and the use of qualitative predictors. The module emphasizes the importance of understanding linear regression as it serves as a foundation for more complex statistical learning methods. Practical examples, such as predicting body fat based on BMI, are used to illustrate the concepts and techniques discussed.

Uploaded by

emilien88.henry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views110 pages

Module 3: Linear Regression: TMA4268 Statistical Learning V2025

Module 3 of TMA4268 focuses on linear regression, covering both simple and multiple linear regression techniques, including model assumptions, least squares estimation, and the use of qualitative predictors. The module emphasizes the importance of understanding linear regression as it serves as a foundation for more complex statistical learning methods. Practical examples, such as predicting body fat based on BMI, are used to illustrate the concepts and techniques discussed.

Uploaded by

emilien88.henry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

Module 3: Linear Regression

TMA4268 Statistical Learning V2025

Stefanie Muff, Department of Mathematical Sciences, NTNU

January 23 and 24, 2025


Acknowledgements

• Thanks to Mette Langaas and her TAs who permit me to use and
modify their original material.
• Some of the figures and slides in this presentation are taken (or
are inspired) from James et al. (2013).
Introduction

Learning material for this module

• James et al. (2013): An Introduction to Statistical Learning,


Chapter 3 (skip 3.5).

We need more statistical theory than is presented in the textbook,


which you find in this module page.
What will you learn?
• Simple linear regression:
• Model and assumptions
• Least squares
• Testing and confidence intervals
• Multiple linear regression:
• The use of matrix algebra
• Distribution of estimators
• Assessing model fit, model selection
• Confidence and prediction ranges
• Qualitative predictors
• Interactions
• Assessing model fit / residual analysis
Linear regression

• Very simple approach for supervised learning.


• Parametric.
• A quantitative response vs. one or several explanatory variables
(quantitative or qualitative).
• Is linear regression too simple? Maybe, but very useful.
Important to understand because many learning methods can be
seen as generalization of linear regression.
Motivating example: Prognostic factors for body fat
(From Theo Gasser & Burkhardt Seifert Grundbegriffe der Biostatistik)

Body fat is an important indicator for overweight, but difficult to


measure.
Question: Which factors allow for precise estimation (prediction) of
body fat?
Study with 243 male participants, where body fat (%) and BMI and
other predictors were measured. Some scatterplots1 :
40

40

40

40
30

30

30

30
bodyfat (y)

bodyfat (y)

bodyfat (y)

bodyfat (y)
20

20

20

20
10

10

10

10
0

0
20 25 30 35 20 40 60 80 32 36 40 44 90 100 120

bmi age neck hip

1 The
data to reproduce these plots and analyses can be found here:
https://github.com/stefaniemuff/statlearning2/tree/master/3LinReg/data
For a good predictive model we need to dive into multiple linear
regression. However, wer start with the simple case of only one
predictor variable:

40

30
body fat (%)

20

10

20 25 30 35 40
bmi
Interesting questions
1. How good is BMI as a predictor for body fat?
2. How strong is this relationship?
3. Is the relationship linear?
4. Are also other variables associated with bodyfat?
5. How well can we predict the bodyfat of a person?
Simple Linear Regression

• One quantitative response 𝑌 is modelled


• from one covariate 𝑥 (=simple),
• and the relationship between 𝑌 and 𝑥 is assumed to be linear.

If the relation between 𝑌 and 𝑥 is perfectly linear, all instances of


(𝑥, 𝑌 ), given by (𝑥𝑖 , 𝑦𝑖 ), 𝑖 = 1, … , 𝑛, lie on a straight line and fulfill

𝑦𝑖 = 𝛽 0 + 𝛽 1 𝑥𝑖 .
But which is the “true” or “best” line, if the relationship is not exact?

40

30
body fat (%)

20

10

0
20 25 30 35 40
bmi

Task: Estimate the intercept and slope parameters (by “eye”) and
write it down (we will look at the “best” answer later).
→ Mentimeter
It is obvious that
• the linear relationship does not describe the data perfectly.
• another realization of the data (other 243 males) would lead to a
slightly different picture.

⇒ We need a model that describes the relationship between BMI


and bodyfat.
The simple linear regression model
In the linear regression model the dependent variable 𝑌 is related to
the independent variable 𝑥 as

𝑌 = 𝛽 0 + 𝛽1 𝑥 + 𝜀 , 𝜀 ∼ 𝑁 (0, 𝜎2 ) .

In this formulation 𝑌 is a random variable 𝑌 |𝑥 ∼ 𝑁 (𝛽0 + 𝛽1 𝑥, 𝜎2 )


where
𝑌 |𝑥 = ⏟⏟ expected
⏟⏟⏟ value
⏟⏟ + ⏟ error .
E(𝑌 )=𝛽0 +𝛽1 𝑥 𝜀

Note:
• The model for 𝑌 given 𝑥 has three parameters: 𝛽0 (intercept), 𝛽1
(slope coefficient) and 𝜎2 .
• 𝑥 is the independent/ explanatory / regressor variable.
• 𝑌 is the dependent / outcome / response variable.
Modeling assumptions

The central assumption in linear regression is that for any pairs


(𝑥𝑖 , 𝑌𝑖 ), the error 𝜀𝑖 ∼ 𝑁 (0, 𝜎2 ). This implies
a) The expected value of 𝜀𝑖 is 0: E(𝜀𝑖 ) = 0.
b) All 𝜀𝑖 have the same variance: Var(𝜀𝑖 ) = 𝜎2 .
c) All 𝜀𝑖 are normally distributed.
d) 𝜀 is independent of any variable, observation number etc.
e) 𝜀1 , 𝜀2 , … , 𝜀𝑛 are independent of each other.
Visualization of the regression assumptions
The assumptions about the linear regression model lie in the error
term
𝜀 ∼ 𝑁 (0, 𝜎2 ) .

Note: The true regression line goes through E(𝑌 ).


Parameter estimation (“model fitting”)
In a regression analysis, the task is to estimate the regression
coefficients 𝛽0 , 𝛽1 and the residual variance 𝜎2 for a given set of
(𝑥, 𝑦) data.

• Problem: For more than two points (𝑥𝑖 , 𝑦𝑖 ), 𝑖 = 1, … , 𝑛, there is


generally no perfectly fitting line.
• Aim: We want to find the parameters (𝑎, 𝑏) of the best fitting
line 𝑌 = 𝑎 + 𝑏𝑥.
• Idea: Minimize the deviations between the data points (𝑥𝑖 , 𝑦𝑖 )
and the regression line.

But what are we actually going to minimize?


Least squares
Remember the Least Squared Method. Graphically, we are
minimizing the sum of the squared distances over all points:

7
6
5
y

4
3
2
1

−1.5 −1.0 −0.5 0.0 0.5 1.0

x
• Mathematically, 𝑎 and 𝑏 are estimated such that the sum of
squared vertical distances (residual sum of squares)

𝑛
RSS = ∑ 𝑒2𝑖 , where 𝑒𝑖 = 𝑦𝑖 − (𝑎 + 𝑏𝑥𝑖 )
𝑖=1

is being minimized.
• The respective “best” estimates are called 𝛽0̂ and 𝛽1̂ .
• We can predict the value of the response for a (new) observation
of the covariate at 𝑥.
𝑦 ̂ = 𝛽0̂ + 𝛽1̂ 𝑥.
• The 𝑖-th residual of the model is the difference between the 𝑖-th
observed response value and the 𝑖-th predicted value, and is
written as:
𝑒𝑖 = 𝑌𝑖 − 𝑦𝑖̂ .
• We may regard the residuals as predictions (not estimates) of the
error terms 𝜀𝑖 .
(The error terms are random variables and can not be estimated - they can be predicted. It is only for
parameters that we speak about estimates.)
Least squares estimators:
Using 𝑛 observed independent data points

(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ) ,

the least squares estimates for simple linear regression are given as

𝛽0̂ = 𝑦 ̄ − 𝛽1̂ 𝑥̄
and 𝑛
∑𝑖=1 (𝑥𝑖 − 𝑥)(𝑦
̄ 𝑖 − 𝑦)̄ Cov(x, y)
𝛽1̂ = = ,
𝑛
∑𝑖=1 (𝑥𝑖 − 𝑥)̄ 2 Var(x)

𝑛 𝑛
where 𝑦 ̄ = 1
𝑛 ∑𝑖=1 𝑦𝑖 and 𝑥̄ = 1
𝑛 ∑𝑖=1 𝑥𝑖 are the sample means.

This is something you should have proven in your previous statistics classes; if you
forgot how to get there, please check again, e.g. in chapter 11 of the book by Walepole
et al. (2012), see here.
Do-it-yourself “by hand”

Go to the Shiny gallery and try to “estimate’ ’ the correct parameters.

You can do this here:


https://gallery.shinyapps.io/simple_regression/
Example continued: Body fat
Assume a linear relationship between the % bodyfat (bodyfat) and
the BMI (bmi), we can get the LS estimates using R as follows:

r.bodyfat = lm(bodyfat~bmi, data=d.bodyfat)

The estimates (and more information) can be obtained as follows:

summary(r.bodyfat)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26.984368 2.7689004 -9.745518 3.921511e-19
## bmi 1.818778 0.1083411 16.787522 2.063854e-42
We see that the model fits the data quite well. It captures the essence.
It looks that a linear relationship between bodyfat and bmi is a good
approximation.

40

30
bodyfat

20

10

0
20 25 30 35 40
bmi
Questions:
• The blue line gives the estimated model. Explain what the line
means in practice. Is this result plausible?
• Compare the estimates for 𝛽0 and 𝛽1 to the estimates you gave at
the beginning - were you close?
• How does this relate to the true (population) model?
• What could the regression line look like if another set of
243 males were used for estimation?
Uncertainty in the estimates 𝛽0̂ and 𝛽1̂
Note: 𝛽0̂ and 𝛽1̂ are themselves random variables and as such contain
uncertainty!

Let us look again at the regression output, this time only for the
coefficients. The second column shows the standard error of the
estimate:
summary(r.bodyfat)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26.984368 2.7689004 -9.745518 3.921511e-19
## bmi 1.818778 0.1083411 16.787522 2.063854e-42

→ The logical next question is: what is the distribution of the


estimates?
Distribution of the estimators for 𝛽0̂ and 𝛽1̂
To obtain an intuition, we generate data points according to model

𝑦𝑖 = 4 − 2𝑥𝑖 + 𝜀𝑖 , 𝜀𝑖 ∼ 𝑁 (0, 0.52 ).


In each round, we estimate the parameters and store them:
set.seed(1)
niter <- 1000
pars <- matrix(NA,nrow=niter,ncol=2)
for (ii in 1:niter){
x <- rnorm(100)
y <- 4 - 2*x + rnorm(100,0,sd=0.5)
pars[ii,] <- lm(y~x)$coef
}

Doing it 1000 times, we obtain the following distributions for 𝛽0̂ and
𝛽1̂ :
100

75
75
count

count

50 50

25 25

0 0
3.9 4.0 4.1 −2.1 −2.0 −1.9
beta0 beta1
Accuracy of the parameter estimates
• The standard errors of the estimates are given by the following
formulas:
2
̂ ̂ 1 𝑥 ̄
Var(𝛽0 ) = SE(𝛽0 ) = 𝜎 [ + 𝑛
2 2
]
𝑛 ∑𝑖=1 (𝑥𝑖 − 𝑥)̄2

and
2
𝜎
Var(𝛽1̂ ) = SE(𝛽1̂ )2 = 𝑛 2
.
∑𝑖=1 (𝑥𝑖 − 𝑥)̄
• Cov(𝛽0̂ , 𝛽1̂ ) is in general different from zero.

Note: We will derive a general version of these formulas for multiple


linear regression, because without matrix notation this is very
cumbersome.
Under the assumption that 𝜀 ∼ 𝑁 (0, 𝜎2 ), and assuming 𝛽0̂ and 𝛽1̂ are
estimated as in formulas (1) and (2), we have in addition that

𝛽0̂ ∼ 𝑁 (𝛽0 , 𝜎𝛽2 0 ) and 𝛽1̂ ∼ 𝑁 (𝛽1 , 𝜎𝛽2 1 ) .

Again: We will derive this in the multiple linear regression version in


more generality (see recommended exercise 3).
Design issue with data collection
Recall that
2
𝜎
SE(𝛽1̂ )2 = 𝑛 2
,
∑𝑖=1 (𝑥𝑖 − 𝑥)̄
thus for a given 𝜎2 , the standard error is only dependent on the
design of the 𝑥𝑖 ’s!
• Would we like the SE(𝛽1̂ )2 large or small? Why?
• If it is possible for us to choose the 𝑥𝑖 ’s, which strategy should we
use to choose them?
Residual standard error (RSE)
• Problem: 𝜎 is usually not known, but needs to be estimated2 .
• Remember: The residual sum of squares is
𝑛
RSS = ∑𝑖=1 (𝑦𝑖 − 𝛽0̂ − 𝛽1̂ 𝑥𝑖 )2 .
• An estimate of 𝜎, the residual standard error, RSE, is given by

𝑛
1 1
𝜎̂ = RSE = √ RSS = √ ∑(𝑦𝑖 − 𝑦𝑖̂ )2 .
𝑛−2 𝑛 − 2 𝑖=1

• So actually we have
2
̂ 𝛽 ̂ )2 = 𝑛 𝜎̂
SE( 1 2
,
∑𝑖=1 (𝑥𝑖 − 𝑥)̄

but we usually just write SE(𝛽1̂ )2 (without the extra hat).

2 𝜎2 is the irreducible error variance.


The estimated standard errors can be seen using the summary()
function:

summary(r.bodyfat)$coef

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) -26.984368 2.7689004 -9.745518 3.921511e-19
## bmi 1.818778 0.1083411 16.787522 2.063854e-42
Testing and Confidence Intervals

After the regression parameters and their uncertainties have been


estimated, there are typically two fundamental questions:

1. “Are the parameters compatible with some specific


value?” Typically, the question is whether the slope 𝛽1 might be
0 or not, that is: “Is 𝑥 an informative predictor or not?”
→ This leads to a statistical test.

2. “Which values of the parameters are compatible with the data?”


→ This leads us to determine confidence intervals.
Let’s first go back to the output from the bodyfat example:
summary(r.bodyfat)$coef

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) -26.984368 2.7689004 -9.745518 3.921511e-19
## bmi 1.818778 0.1083411 16.787522 2.063854e-42

Besides the estimate and the standard error (which we discussed


before), there is a t value and a probability Pr(>|t| that we need to
understand.
How do these things help us to answer the two questions above?
Testing the effect of a covariate

Remember: in a statistical test you first need to specify the null


hypothesis. Here, typically, the null hypothesis is

𝐻0 ∶ 𝛽1 = 0 .
3
In words: 𝐻0 = “There is no relationship between 𝑋 and 𝑌 .”

Here, the alternative hypothesis is given by

𝐻𝐴 ∶ 𝛽1 ≠ 0

3 You could also test against another null hypothesis, like 𝛽1 = 𝑐.


Remember: To carry out a statistical test, we need a test statistic.
This is some type of summary statistic that follows a known
distribution under 𝐻0 . For our purpose, we use the so-called
𝑇 -statistic

𝛽1̂
𝑇 = .
𝑆𝐸(𝛽1̂ )

Note: If you want to test against another value than 𝛽1 = 0, the


formula is

𝛽1̂ − 𝑐
𝑇 = .
𝑆𝐸(𝛽 )̂
1
Distribution of parameter estimators

Brief recap: Under 𝐻0 , 𝑇 has a 𝑡-distribution with 𝑛 − 2 degrees of


freedom (𝑛 = number of data points; compare to Chapter 8.6 in
Walepole et al. (2012)).

We will derive a general version for multiple linear regression!


Hypothesis tests for bodyfat example

So let’s again go back to the bodyfat regression output:


summary(r.bodyfat)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26.984368 2.7689004 -9.745518 3.921511e-19
## bmi 1.818778 0.1083411 16.787522 2.063854e-42

Q: Check how the Estimate, Std. Error and t value relate.

• The last column contains the 𝑝-values of the tests with 𝐻0 :


𝛽0 = 0 and 𝛽1 = 0, respectively.
• The 𝑝-value for bmi is very small (𝑝 < 0.0001). What does this
mean?
Cautionary notes regarding 𝑝-values:

• The (mis)use of 𝑝-values is heavily under critique in the scientific


world!
• Simple yes/no decisions do often stand on very wiggly scientific
ground.

We will discuss this a bit in the final module 12. The topic is
connected to good/bad research practice, problems with
“reproducibility’ ’ and scientific progress in general. See e.g. here:

• The 𝑝-value statement by ASA: https://amstat.tandfonline.com


/doi/full/10.1080/00031305.2016.1154108#.Xh16iuExnhM
• Ideas to redefine what “statistical significane’ ’ means:
https://www.nature.com/articles/s41562-017-0189-z
• Recently written with some co-authors:
https://www.cell.com/trends/ecology-evolution/fulltext/S0169-
5347(21)00284-6
Confidence intervals
• Confidence intervals (CIs) are a much more informative way to
report results than 𝑝-values!
• The 𝑡-distribution4 can be used to create confidence intervals for
the regression parameters. The lower and upper limits of a 95%
confidence interval for 𝛽𝑗 are

𝛽𝑗̂ ± 𝑡(1−𝛼/2),𝑛−2 ⋅ SE(𝛽𝑗̂ ) 𝑗 = 0, 1.

• Interpretation of this confidence interval:


• There is a 95% probability that the interval will contain the true
value of 𝛽𝑗 .
• It is the range of parameter estimates that are compatible
with the data .

4 If 𝑛 is large, the normal approximation to the 𝑡-distribution can be used (and


is used in the textbook).
We can directly ask R to give us the CIs:

confint(r.bodyfat,level=c(0.95))

## 2.5 % 97.5 %
## (Intercept) -32.438703 -21.530032
## bmi 1.605362 2.032195

Interpretation:
For an increase in the bmi by one index point, roughly … percentage
points more bodyfat are expected, and all true values for 𝛽1 between
… and … are ….
Model accuracy
Measured by
1. The residual standard error (RSE), which provides an
absolute measure of lack of fit (see above).

2. The coefficient of determination 𝑅2 , which measures the


proportion of 𝑦’s variance explained by the model (between 0 and
1), is a relative measure of fit or lack of fit (1 − 𝑅2 ):

𝑛
2 TSS − RSS RSS ∑𝑖=1 (𝑦𝑖 − 𝑦𝑖̂ )2
𝑅 = =1− =1− 𝑛 ,
TSS TSS ∑𝑖=1 (𝑦𝑖 − 𝑦𝑖̄ )2

where
𝑛
TSS = ∑(𝑦𝑖 − 𝑦)̄ 2
𝑖=1

is the total sum of squares, a measure for the total variability in 𝑌 .


𝑅2 in simple linear regression
Note: In simple linear regression, 𝑅2 is the squared correlation
between the independent and the dependent variable.

Verify this by comparing 𝑅2 from the bodyfat output to the squared


correlation between the two variables:

summary(r.bodyfat)$r.squared
## [1] 0.5390391
cor(d.bodyfat$bodyfat,d.bodyfat$bmi)^2
## [1] 0.5390391
Multiple Linear Regression

Remember that the bodyfat dataset contained much more


information than only bmi and bodyfat:
• bodyfat: % of body fat.
• age: age of the person.
• weight: body weighth.
• height: body height.
• bmi: bmi.
• abdomen: circumference of abdomen.
• hip: circumference of hip.
Model

We assume

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ... + 𝛽𝑝 𝑋𝑝 + 𝜀 , (1)

where 𝑋𝑗 is the 𝑗th predictor and 𝛽𝑗 the respective regression


coeffficient.

Assume we have 𝑛 sampling units (𝑥1𝑖 , … , 𝑥𝑝𝑖 , 𝑦𝑖 ), 1 ≤ 𝑖 ≤ 𝑛, such


that each represents an instance of equation (3), we can use the data
matrix

1 𝑥11 ... 𝑥1𝑝


⎡1 𝑥21 ... 𝑥2𝑝 ⎤
X=⎢ ⎥
⎢⋮ ... ... ⋮ ⎥
⎣1 𝑥𝑛1 ... 𝑥𝑛𝑝 ⎦
to write the model in matrix form:

Y = X𝛽 + 𝜀
Notation
• Y ∶ (𝑛 × 1) vector of responses (e.g., bodyfat).
• X ∶ (𝑛 × (𝑝 + 1)) design matrix, and 𝑥𝑇𝑖 is a (𝑝 + 1)-dimensional
row vector for observation 𝑖.
• 𝛽 ∶ ((𝑝 + 1) × 1) vector of regression parameters (𝛽0 , 𝛽1 , … , 𝛽𝑝 )⊤ .
• 𝜀 ∶ (𝑛 × 1) vector of random errors.
• We assume that pairs (𝑥𝑇𝑖 , 𝑦𝑖 ) (𝑖 = 1, ..., 𝑛) are measured from
independent sampling units.

Remark: other books, including the book in TMA4267 and TMA4315


define 𝑝 to include the intercept. This may lead to some confusion
about 𝑝 or 𝑝 + 1 in formulas…
Classical linear model

Y = X𝛽 + 𝜀
Assumptions:
1. E(𝜀) = 0.
2. Cov(𝜀) = E(𝜀𝜀𝑇 ) = 𝜎2 𝐼.
3. The design matrix has full rank, rank(X) = 𝑝 + 1. (We assume
𝑛 >> (𝑝 + 1).)
The classical normal linear regression model is obtained if additionally
4. 𝜀 ∼ 𝑁𝑛 (0, 𝜎2 I) holds. Here 𝑁𝑛 denotes the 𝑛-dimensional
multivarate normal distribution.
The bodyfat example for two predictors

We are looking at the regression model

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝜀 ,
with bodyfat as the response (𝑌 ) and bmi and age as 𝑋1 and 𝑋2 .

The regression
r.bodyfat=lm(bodyfat ~ bmi + age ,data=d.bodyfat)

The design matrix:


head(model.matrix(r.bodyfat))
## (Intercept) bmi age
## 1 1 23.65 23
## 2 1 23.36 22
## 3 1 24.69 22
## 4 1 24.91 26
## 5 1 25.54 24
## 6 1 26.48 24
Distribution of the response vector

Assume that

Y = X𝛽 + 𝜀 , 𝜀 ∼ 𝑁𝑛 (0, 𝜎2 I) .

Q:
• What is the expected value E(Y) given X?
• The covariance matrix Cov(Y) given X?
• Thus what is the distribution of Y given X?
A:
Y ∼ 𝑁𝑛 (X𝛽, 𝜎2 I)
Parameter estimation for 𝛽

In multiple linear regression, the parameter vector 𝛽 is estimated with


maximum likelihood and least squares. These two methods give the
same estimator when we assume the normal linear regression model.

With least suqres, we minimize the RSS for a multiple linear


regression model:
𝑛 𝑛
RSS = ∑(𝑦𝑖 − 𝑦𝑖̂ )2 = ∑(𝑦𝑖 − 𝛽0̂ − 𝛽1̂ 𝑥𝑖1 − 𝛽2̂ 𝑥𝑖2 − ... − 𝛽𝑝̂ 𝑥𝑖𝑝 )2
𝑖=1 𝑖=1
𝑛
= ∑(𝑦𝑖 − 𝑥𝑇𝑖 𝛽)̂ 2 = (Y − X𝛽)̂ 𝑇 (Y − X𝛽)̂
𝑖=1

The estimator is found by solving the system of (𝑝 + 1) equations

𝜕RSS
=0.
𝜕𝛽
→ Derivation on next 2 pages.
Summing up:
The least squares and maximum likelihood estimator for 𝛽:
is given like
𝛽̂ = (X𝑇 X)−1 X𝑇 Y .
Example continued
r.bodyfat3 <- lm(bodyfat ~ bmi + age + neck + hip +abdomen,data=d.bodyfat)
summary(r.bodyfat3)

##
## Call:
## lm(formula = bodyfat ~ bmi + age + neck + hip + abdomen, data = d.bodyfat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3727 -3.1884 -0.1559 3.1003 12.7613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.74965 7.29830 -1.062 0.28939
## bmi 0.42647 0.23133 1.844 0.06649 .
## age 0.01457 0.02783 0.524 0.60100
## neck -0.80206 0.19097 -4.200 3.78e-05 ***
## hip -0.31764 0.10751 -2.954 0.00345 **
## abdomen 0.83909 0.08418 9.968 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.392 on 237 degrees of freedom
## Multiple R-squared: 0.7185, Adjusted R-squared: 0.7126
## F-statistic: 121 on 5 and 237 DF, p-value: < 2.2e-16
Reproduce the values under Estimate by calculating without the use
of lm.

X=model.matrix(r.bodyfat3 )
Y=d.bodyfat$bodyfat
betahat=solve(t(X)%*%X)%*%t(X)%*%Y
print(betahat)

## [,1]
## (Intercept) -7.74964673
## bmi 0.42647368
## age 0.01457356
## neck -0.80206081
## hip -0.31764315
## abdomen 0.83909391
Distribution of the regression parameter estimator
Given
𝛽̂ = (X𝑇 X)−1 X𝑇 Y ,
what are
• The mean E(𝛽)?̂
̂
• The covariance matrix Cov(𝛽)?
̂
• The distribution of 𝛽?

Hint: Use that


• 𝛽̂ = CY with C = (X𝑇 X)−1 X𝑇 .
• Y ∼ 𝑁𝑛 (X𝛽, 𝜎2 I).
Answers:
See Problem 2 of recommended exercise 3 (and the respective
solutions) to derive that

𝛽̂ ∼ 𝑁𝑝+1 (𝛽, 𝜎
⏟⏟
2 (X𝑇 X)−1 ) .
⏟⏟⏟
covariance matrix
The covariance matrix of 𝛽 ̂ in R

The covariance matrix for the 𝛽 ̂ can be obtained as follows:

vcov(r.bodyfat3)
## (Intercept) bmi age neck hip
## (Intercept) 53.26521684 0.6774596810 -0.0780438125 -0.7219656479 -0.548205733
## bmi 0.67745968 0.0535131152 0.0005729015 -0.0120408637 -0.005804073
## age -0.07804381 0.0005729015 0.0007745054 -0.0003432518 0.001523951
## neck -0.72196565 -0.0120408637 -0.0003432518 0.0364680351 -0.002715930
## hip -0.54820573 -0.0058040729 0.0015239515 -0.0027159299 0.011558850
## abdomen 0.16457979 -0.0110809165 -0.0011917596 -0.0007706161 -0.004570722
## abdomen
## (Intercept) 0.1645797895
## bmi -0.0110809165
## age -0.0011917596
## neck -0.0007706161
## hip -0.0045707222
## abdomen 0.0070861066
How does this compare to simple linear regression? Not so easy to see
a connection!

𝑛
̄ 𝑖 − 𝑌̄ )
∑𝑖=1 (𝑥𝑖 − 𝑥)(𝑌
𝛽0̂ = 𝑌 ̄ − 𝛽1̂ 𝑥̄ and 𝛽1̂ = 𝑛 ,
∑𝑖=1 (𝑥𝑖 − 𝑥)̄ 2

𝛽̂ = (X𝑇 X)−1 X𝑇 Y .

Exercise: Verify the connection using 𝛽 = (𝛽0 , 𝛽1 )⊤ and


1 𝑥11
⎡1 𝑥 ⎤
X=⎢ 21 ⎥
.
⎢1 ⋮ ⎥
⎣1 𝑥𝑛1 ⎦
Four important questions

1. Is at least one of the predictors 𝑋1 , … , 𝑋𝑝 useful in predicting


the response?
2. Do all the predictors help to explain 𝑌 , or is only a subset of
predictors useful?
3. How well does the model fit the data?
4. Given a set of predictor variables, how accurate is our prediction?
1. Relationship between predictors and response?

Question is whether we could as well omit all predictor variables at


the same time, that is

𝐻0 ∶ 𝛽 1 = 𝛽 2 = … = 𝛽 𝑝 = 0

vs.

𝐻1 ∶ at least one 𝛽𝑗 is non-zero.


To answer this, we need the 𝐹 -statistic

(TSS − RSS)/𝑝
𝐹 = ∼ 𝐹𝑝,(𝑛−𝑝−1) ,
RSS/(𝑛 − 𝑝 − 1)

where total sum of squares TSS = ∑𝑖 (𝑦𝑖 − 𝑦)̄ 2 , and residual sum of
squares RSS = ∑𝑖 (𝑦𝑖 − 𝑦𝑖̂ )2 . Under the Normal regression
assumptions, 𝐹 follows an 𝐹𝑝,(𝑛−𝑝−1) distribution (see Walepole et al.
(2012), Chapter 8.7).
• If 𝐻0 is true, 𝐹 is expected to be ≈ 1.
• Otherwise, we expect that the numerator is larger than the
denominator (because the regression then explains a lot of
variation) and thus 𝐹 is greater than 1. For an observed value 𝑓0 ,
the 𝑝-value is given as

𝑝 = 𝑃 (𝐹𝑝,𝑛−𝑝−1 > 𝑓0 ) .
Checking the 𝐹 -value in the R output:

summary(r.bodyfat)

##
## Call:
## lm(formula = bodyfat ~ bmi + age, data = d.bodyfat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.0415 -3.8725 -0.1237 3.9193 12.6599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -31.25451 2.78973 -11.203 < 2e-16 ***
## bmi 1.75257 0.10449 16.773 < 2e-16 ***
## age 0.13268 0.02732 4.857 2.15e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.329 on 240 degrees of freedom
## Multiple R-squared: 0.5803, Adjusted R-squared: 0.5768
## F-statistic: 165.9 on 2 and 240 DF, p-value: < 2.2e-16

Conclusion?
More complex hypotheses

Sometimes we don’t want to test if all 𝛽’s are zero at the same time,
but only a subset 1, … , 𝑞:

𝐻0 ∶ 𝛽 1 = 𝛽 2 = ⋯ = 𝛽 𝑞 = 0
vs.
𝐻1 ∶ at least one different from zero.

Again, the 𝐹 -test can be used, but now 𝐹 is calculated like

(RSS0 -RSS)/(𝑞)
𝐹 = ∼ 𝐹𝑞,𝑛−𝑝−1 ,
RSS/(𝑛 − 𝑝 − 1)
where
• Large model: RSS with 𝑝 + 1 regression parameters
• Small model: RSS0 with 𝑝 + 1 − 𝑞 regression parameters
Example in R
• Question: Do weight and height explain something of
bodyfat, on top of the variables bmi and age?
• Fit both models and use the anova() function to carry out the
𝐹 -test:

r.bodyfat.small=lm(bodyfat ~ bmi + age ,data=d.bodyfat)


r.bodyfat.large=lm(bodyfat ~ bmi + age + weight + height, data=d.bodyfat)
anova(r.bodyfat.small,r.bodyfat.large)
## Analysis of Variance Table
##
## Model 1: bodyfat ~ bmi + age
## Model 2: bodyfat ~ bmi + age + weight + height
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 240 6816.2
## 2 238 6702.9 2 113.28 2.0112 0.1361
Inference about a single predictor 𝛽𝑗

A special case is

𝐻0 ∶ 𝛽𝑗 = 0 vs. 𝐻1 ∶ 𝛽𝑗 ≠ 0

• Nothing new: We did it for simple linear regression!


• This makes sense: it is known that (or you can try to show it
yourself)
𝐹1,𝑛−𝑝−1 = 𝑡2𝑛−𝑝−1 ,
thus we can use a 𝑇 -statistic with (𝑛 − 𝑝 − 1) degrees of freedom
to get the 𝑝-value.
Going back again:

summary(r.bodyfat)$coef

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) -31.2545057 2.78973238 -11.203406 1.039096e-23
## bmi 1.7525705 0.10448723 16.773060 2.600646e-42
## age 0.1326767 0.02731582 4.857137 2.149482e-06

However:
Only checking the individual 𝑝-values is dangerous. Why?
Inference about 𝛽𝑗 : confidence interval

• Using that
𝛽𝑗̂
𝑇𝑗 = ∼ 𝑡𝑛−𝑝−1 ,
SE(𝛽𝑗̂ )
we can create confidence intervals for 𝛽𝑗 in the same manner as
we did for simple linear regression (see slide 41). For example,
when using the typical confidence level 𝛼 = 0.05 we have

𝛽𝑗̂ ± 𝑡0.975,𝑛−𝑝−2 ⋅ SE(𝛽𝑗̂ ) .

confint(r.bodyfat)
## 2.5 % 97.5 %
## (Intercept) -36.7499929 -25.7590185
## bmi 1.5467413 1.9583996
## age 0.0788673 0.1864861
2. Deciding on important variables

Overarching question:

Which model is the best?


But:
• Not clear what best means → we need an objective criterion, like
AIC, BIC, Mallows 𝐶𝑝 , adjusted 𝑅2 .
• There are usually many possible models. For 𝑝 predictors, we
can build 2𝑝 different models.
• Cautionary note: Model selection can also lead to biased
parameters estimates.

→ This topic is the focus of Module 6.


3. Model Fit

We can again look at the two measures from simple linear regression:

• An absolute measure of lack of fit is again given by the estimate


of 𝜎, the residual standard error (RSE)

RSS
𝜎̂ = RSE = √ .
𝑛−𝑝−1
• 𝑅2 is again the fraction of variance explained (no change from
simple linear regression)
𝑛
2 TSS − RSS RSS ∑𝑖=1 (𝑦𝑖 − 𝑦𝑖̂ )2
𝑅 = =1− =1− 𝑛 .
TSS TSS ∑𝑖=1 (𝑦𝑖 − 𝑦𝑖̄ )2

Simply speaking: “The higher 𝑅2 , the better.”


However: Caveat with 𝑅2

Let us look at the 𝑅2 s from the three bodyfat models


model 1: 𝑦 ∼ 𝑏𝑚𝑖
model 2: 𝑦 ∼ 𝑏𝑚𝑖 + 𝑎𝑔𝑒
model 3: 𝑦 ∼ 𝑏𝑚𝑖 + 𝑎𝑔𝑒 + 𝑛𝑒𝑐𝑘 + ℎ𝑖𝑝 + 𝑎𝑏𝑑𝑜𝑚𝑒𝑛
summary(r.bodyfatM1)$r.squared

## [1] 0.5390391
summary(r.bodyfatM2)$r.squared

## [1] 0.5802956
summary(r.bodyfatM3)$r.squared

## [1] 0.718497

The models explain 54%, 58% and 72% of the total variability of 𝑦.
It thus seems that larger models are “better’ ’. However, 𝑅2 does
always increase when new variables are included, but this does not
mean that the model is more reasonable.
Adjusted 𝑅2

When the sample size 𝑛 is small with respect to the number of


variables 𝑚 included in the model, an adjusted 𝑅2 gives a better
(“fairer”) estimation of the actual variability that is explained by the
covariates:
𝑛−1
𝑅𝑎2 = 1 − (1 − 𝑅2 )
𝑛−𝑝−1

𝑅𝑎2 penalizes for adding more variables if they do not really


improve the model!

→ 𝑅𝑎 may decrease when a new variable is added.


Model fit – in a broader sense

We will look at model validation / model checking later.


4. Predictions: Two questions

1. Which other regression lines are compatible with the


observed data?
→ We can use 𝛽0̂ , … , 𝛽𝑝̂ to estimate the least squares plane

𝑌 ̂ = 𝛽0̂ + 𝛽1̂ 𝑋1 + … + 𝛽𝑝̂ 𝑋𝑝

as an approximation of 𝑓(𝑋) = 𝛽0 + 𝛽1 𝑋1 + … + 𝛽𝑝 𝑋𝑝 . This leads to


the confidence interval.

2. Where do future observations with a given 𝑥 coordinate


lie?
→ Even if we could predict 𝑌 ̂ = 𝑓(𝑋), the true value 𝑌 = 𝑓(𝑋) + 𝜀
varies around 𝑌 ̂ . We can compute a prediction interval for new
observations 𝑌 .
50
40
30
bodyfat

20
10

confidence range (95%)


0

prediction range (95%)

20 25 30 35 40

BMI

• Plotting the confidence and prediction intervals around all


predicted values 𝑌0̂ one obtains the confidence range or
confidence band for the expected values of 𝑌 .

Note: The prediction range is much broader than the confidence


range. Why?
Calculation of the confidence and prediction intervals

See problem 2 in the recommended exercise 3 (for this module).


• Confidence and prediction intervals for given data points can be
found in R using predict on an lm object (make sure that
newdata is a data.frame with the same names as the original
data).

fit=lm(bodyfat ~ bmi + age + abdomen,data=d.bodyfat)


newobs=d.bodyfat[1,]

predict(fit,newdata=newobs,interval="confidence",type="response")

## fit lwr upr


## 1 13.17595 11.99122 14.36069
predict(fit,newdata=newobs,interval="prediction",type="response")

## fit lwr upr


## 1 13.17595 3.951613 22.4003
Finally, we need to keep in mind that the model we work with is only
an approximation of the reality. In fact,

In 2014, David Hand wrote:

In general, when building statistical models, we must not forget


that the aim is to understand something about the real world.
Or predict, choose an action, make a decision, summarize
evidence, and so on, but always about the real world, not an
abstract mathematical world: our models are not the reality
– a point well made by George Box in his often-cited remark
that “all models are wrong, but some are useful”.

(Box 1979)
Extensions of the linear model

Extensions that make the linear model very powerful:

• Binary covariates (e.g., male/female, smoker/non-smoker)

• Categorical covariates (e.g., black/white/green)?

• Interaction terms

• Non-linear terms
Binary predictors

So far, the covariates 𝑋 were always continuous.


In reality, there are no restrictions assumed with respect to the 𝑋
variables.
One very frequent data type are binary variables, that is, variables
that can only attain values 0 or 1.

If the binary variable 𝑥 is the only variable in the model


𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖 , the model has only two predicted outcomes (plus
error):

𝛽0 + 𝜀 𝑖 if 𝑥𝑖 = 0 ,
𝑌𝑖 = {
𝛽0 + 𝛽 1 + 𝜀 𝑖 if 𝑥𝑖 = 1 .

Example: Credit card data analysis in Section 3.3.1 in the ISLR


book.
Qualitative predictors with more than 2 levels

More generally, a covariate may indicate a category, for instance the


species of an animal or a plant. This type of covariate is called a
factor. The trick: convert a factor variable 𝑋 with 𝑘 levels (for
instance 3 species) into 𝑘 dummy variables 𝑋𝑗 with

1, if the 𝑖th observation belongs to group 𝑗.


𝑥𝑖𝑗 = {
0, otherwise.

Each of the covariates 𝑥1 , … , 𝑥𝑘 can then be included as a binary


variable in the model

𝑦𝑖 = 𝛽0 + 𝛽1 𝑥1 + … + 𝛽𝑘 𝑥𝑘 + 𝜀𝑖 .

However: this model is not identifiable.5


5 What does that mean? I could add a constant to 𝛽1 , 𝛽2 , ...𝛽𝑘 and subtract it
from 𝛽0 , and the model would fit equally well to the data, so it cannot be decided
which set of the parameters is best.
Solution: One of the 𝑘 categories must be selected as a reference
category and is not included in the model. Typically: the first category
is the reference, thus 𝛽1 = 0.

The model thus discriminates between the factor levels, such that
(assuming 𝛽1 = 0)

⎧ 𝛽0 + 𝜀, if 𝑥𝑖1 = 1
{ 𝛽0 + 𝛽2 + 𝜀, if 𝑥𝑖2 = 1
𝑦𝑖 = ⎨
...
{
⎩ 𝛽0 + 𝛽𝑘 + 𝜀, if 𝑥𝑖𝑘 = 1 .
!Important to remember!
(Common aspect that leads to confusion!)

A factor covariate with 𝑘 factor levels requires 𝑘 − 1 parameters!


→ The degrees of freedom of the fitted model are therefore reduced by
𝑘 − 1.
Example

We are now using the Credit dataset from the ISLR library.

library(ISLR)
data(Credit)
head(Credit)
## ID Income Limit Rating Cards Age Education Gender Student Married Ethnicity
## 1 1 14.891 3606 283 2 34 11 Male No Yes Caucasian
## 2 2 106.025 6645 483 3 82 15 Female Yes Yes Asian
## 3 3 104.593 7075 514 4 71 11 Male No No Asian
## 4 4 148.924 9504 681 3 36 11 Female No No Asian
## 5 5 55.882 4897 357 2 68 16 Male No Yes Caucasian
## 6 6 80.180 8047 569 4 77 10 Male No No Caucasian
## Balance
## 1 333
## 2 903
## 3 580
## 4 964
## 5 331
## 6 1151

Question: Do the Balances differ for different Ethnicities


(Caucasian/Asian/Afro-American)?
library(GGally)
ggpairs(Credit[,c("Ethnicity","Balance")],aes(colour=Ethnicity))

Ethnicity Balance
200

150

Ethnicity
100

50

2000

1500

Balance
1000

500

0
0 102030400 102030400 10203040 0 500 1000 1500 2000
In R, a factor covariate can be used in the same way as a continuous
predictor:

r.lm <- lm(Balance ~ Ethnicity, data=Credit)


summary(r.lm)$coef

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 531.00000 46.31868 11.4640565 1.774117e-26
## EthnicityAsian -18.68627 65.02107 -0.2873880 7.739652e-01
## EthnicityCaucasian -12.50251 56.68104 -0.2205766 8.255355e-01

Interpretation? Do the ethnicities really differ? Check also the 𝐹 -test


in the last line of the summary output.

⎧ if 𝑖 is Asian
{
𝑦𝑖̂ = if 𝑖 is Caucasian

{
⎩ if 𝑖 is Afro-American
Sidenote: The “reference category”
In the above example we do not see a result for the
EthnicityAfrican American. Why?
• African American is chosen to be the reference category.
• The results for EthnicityAsian and EthnicityCaucasian are
differences with respect to the reference cateogry.
• R chooses the reference category in alphabetic order! This is
sometimes not a relevant category.
• You can change the reference category:

library(dplyr)
Credit <- mutate(Credit,Ethnicity = relevel(Ethnicity,ref="Caucasian"))
r.lm <- lm(Balance ~ Ethnicity, data=Credit)
summary(r.lm)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 518.497487 32.66986 15.8708211 2.824537e-44
## EthnicityAfrican American 12.502513 56.68104 0.2205766 8.255355e-01
## EthnicityAsian -6.183762 56.12165 -0.1101850 9.123184e-01

Note: The differences are now with respect to the Caucasian category – the
model is however exactly the same!
Testing for a categorical predictor
Question: Is a qualitative predictor needed in the model?
For a predictor with more than two levels (like Ethnicity above), the
Null Hypothesis is whether

𝛽1 = … = 𝛽𝑘−1 = 0
at the same time.
→ We again need the 𝐹 -test6 , as always when we test for more
than one 𝛽𝑗 = 0 simultaneously!
In R, this is done by the anova() function:
anova(r.lm)
## Analysis of Variance Table
##
## Response: Balance
## Df Sum Sq Mean Sq F value Pr(>F)
## Ethnicity 2 18454 9227 0.0434 0.9575
## Residuals 397 84321458 212397

6 remember that the 𝐹 -test is a generalization of the 𝑡-test!


Interactions: Removing the additivity assumption
We again look at the Credit dataset. We want to model the Balance
as a function of Income and wheter the person is a student or not.
The model is given as

Balance𝑖 = 𝛽0 + 𝛽1 ⋅ Income𝑖 + 𝛽2 ⋅ Student𝑖 + 𝜀𝑖 ,

where Student is a binary variable. Thus we have a model that looks


like
𝛽0 + 𝛽2 + 𝛽1 ⋅ Income𝑖 + 𝜀𝑖 , if 𝑖 is a student,
Balance𝑖 = {
𝛽0 + 𝛽1 ⋅ Income𝑖 + 𝜀𝑖 otherwise.

In R, we simply add Student to the model:


r.lm <- lm(Balance ~ Income + Student, Credit)

Caveat: This model assumes that students and non-students have


the same slope for Income. Realistic?
Let’s look at the graphs:

→ We want a model that allows for different slopes!


Interaction terms
We formulate a new model that includes the interaction term
(Income ⋅ Student):

Balance𝑖 = 𝛽0 +𝛽1 ⋅Income𝑖 +𝛽2 ⋅Student𝑖 +𝛽3 ⋅Income𝑖 ⋅Student𝑖 +𝜀𝑖 ,

Thus we have a model that allows for different intercept and slope for
the two groups:

𝛽0 + 𝛽2 + (𝛽1 + 𝛽3 ) ⋅ Income𝑖 + 𝜀𝑖 , if 𝑖 is a student,


Balance𝑖 = {
𝛽0 + 𝛽1 ⋅ Income𝑖 + 𝜀𝑖 otherwise.
In R, this is again quite simple:
r.lm <- lm(Balance ~ Income * Student, Credit)
summary(r.lm)$coef

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 200.623153 33.6983706 5.953497 5.789658e-09
## Income 6.218169 0.5920936 10.502003 6.340684e-23
## StudentYes 476.675843 104.3512235 4.567995 6.586095e-06
## Income:StudentYes -1.999151 1.7312511 -1.154743 2.488919e-01

Interpretation:
We allow the model to depend on the binary variable Student, such
that
For a student: 𝑦 ̂ = 200.6 + 476.7 + (6.2 + -2.0) ⋅ Income
For a non-Student: 𝑦 ̂ = 200.6 + (6.2) ⋅ Income

Question: Is the interaction relevant here?


The hierarchical principle

If we include an interaction in a model, we should also include the


main effects, even if the 𝑝-values associated with the coefficients of the
main effects are large (see p.89 in ISLR book).
More interactions
We can include interactions also between
• two continuous variables.
• a categorical variable with more than 2 levels and a continuous
variable.

→ See problem 1 in recommended exercises 3 (important)!


Non-linear terms
Linear regression is even more powerful!
• We have seen that it is possible to include continuous, binary or
factorial covariates in a regression model.
• Even transformations of covariates can be included in (almost)
any form. For instance the square of a variable 𝑋 2

𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝛽2 𝑥2𝑖 + 𝜀𝑖 ,

which leads to a quadratic or polynomial regression (if higher


order terms are used).

• Other common transformations are:


• log
• √..
• sin, cos,
How can a quadratic regression be a linear regression??

Note:
The word linear refers to the linearity in the coefficients, and not on a
linear relationship between 𝑌 and 𝑋1 , … , 𝑋𝑝 !

Question: When would we need such a regression? Well, sometimes


the world is not linear. In particular, if
• there is a theoretical/biological/medical reason to believe in a
non-linear relationship, or
• the residual analysis indicates that there are non-linear
associations in the data.

→ In the later modules, we will discuss other more advanced non-linear approaches for
addressing this issue.
Challenges - for model fit

1. Non-linearity of data
2. Correlation of error terms
3. Non-constant variance of error terms
4. Non-Normality of error terms
5. Outliers
6. High leverage points
7. Collinearity
Recap of modelling assumptions in linear regression

To make valid inference from our model, we must check if our model
assumptions are fulfilled!7

The assumption in linear regression is that the residuals follow a


𝑁 (0, 𝜎2 ) distribution, implying that :

1. The expected value of 𝜀𝑖 is 0: E(𝜀𝑖 ) = 0.


2. All 𝜀𝑖 have the same variance: Var(𝜀𝑖 ) = 𝜎2 .
3. The 𝜀𝑖 are normally distributed.
4. The 𝜀𝑖 are independent of each other.

7 What is the problem if the assumptions are violated?


Model checking tool I: Tukey-Anscombe diagram

The Tukey-Anscombe diagram plots the residuals against the fitted


values. For the bodyfat data it looks like this:

Residuals vs Fitted
76
10

5
Residuals

−5

−10

9 12
10 20 30 40
Fitted values

This plot is ideal to check if assumptions 1. and 2. (and partially 4.)


are met. Here, this seems fine.
Model checking tool II: The QQ-diagram

To check assumption 3., the quantiles of the observed distribution are


plotted against the quantiles of the respective theoretical (normal)
distribution:

Normal Q−Q
76
2
Standardized residuals

−1

−2
9 12
−3 −2 −1 0 1 2 3
Theoretical Quantiles

If the points lie approximately on a straight line, the data is fairly


normally distributed. This is often “tested’ ’ by eye, and needs some
experience.
Model checking tool III: The scale-location plot

The scale-location plot is particularly suited to check the assumption


of equal variances (homoscedasticity; assumption 2.).

The idea is to plot the square root of the (standardized) residuals


√|𝑟𝑖̃ | against the fitted values 𝑦𝑖̂ . There should be no trend:
Scale−Location
1.6
9 12
76
Standardized residuals

1.2

0.8

0.4

0.0
10 20 30 40
Fitted values
Model checking tool IV: The leverage plot

• Mainly useful to determine outliers.

• To understand the leverage plot, we need to introduce the idea of


the leverage.

• In simple regression, the leverage of individual 𝑖 is defined as

1 (𝑥𝑖 − 𝑥)2
𝐻𝑖𝑖 = + 2
. (2)
𝑛 ∑𝑖′ (𝑥𝑖′ − 𝑥)

Q: When are leverages expected to be large/small?


Illustration: Data points with 𝑥𝑖 values far from the mean have a
stronger leverage effect than when 𝑥𝑖 ≈ 𝑥:

4
4

4
2
2

2
y1

y2
y

0
0

0
−2
−2

−2
−4

−4
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

x x x

The outlier in the middle plot “pulls’ ’ the regression line in its
direction and biases the slope.

http://students.brown.edu/seeing-theory/regression-
analysis/index.html to do it manually! 8

8 Youmay choose ”Ordinary Least Squares” and then click on ”I” for the
Ascombe quartett example. Drag points to see what happens with the regression
line.
In the leverage plot, (standardized) residuals 𝑟𝑖̃ are plotted against
the leverage 𝐻𝑖𝑖 (still for the bodyfat):
Residuals vs Leverage

2
35
Standardized Residuals

1 207

−1

39
−2

0.00 0.02 0.04 0.06


Leverage

Critical ranges are the top and bottom right corners! Why?
Leverages in multiple regression
• Leverage is defined as the diagonal elements of the so-called hat
matrix H9 , i.e., the leverage of the 𝑖-th data point is 𝐻𝑖𝑖 on the
diagonal of H = X(XT X)−1 XT .

• Exercise: Verify that formula (2) comes out in the special case of
simple linear regression.
• A large leverage indicates that the observation (𝑖) has a large
influence on the estimation results, and that the covariate values
(𝑥𝑖 ) are unusual.
9 Do you remember why H is called hat matrix?
Different types of residuals?

It can be shown that the vector of residuals, e = (𝑒1 , 𝑒2 , … , 𝑒𝑛 ) have a


normal (singular) distribution with
• E(e) = 0
• Cov(e) = 𝜎2 (I − H),
where H = X(X𝑇 X)−1 X𝑇 .

This means that the residuals (possibly) have different variance, and
may also be correlated.

Q: Why is that a problem?


A:
• To check the model assumptions we want to look at the
distribution of the error terms 𝜀𝑖 , to check that our errors are
independent, homoscedastic (same variance for each observation),
and not dependent on our covariates.
• However, we only have the residuals 𝑒𝑖 , the “predictions” for 𝜖𝑖 .
• It would have been great if the 𝑒𝑖 have the same properties as 𝜖𝑖 .

→ To make the 𝑒𝑖 more “like 𝜖𝑖 ”, we use standardized or studentized


residuals.
Standardized residuals:

𝑒𝑖
𝑟𝑖̃ =
𝜎√1
̂ − 𝐻𝑖𝑖
where 𝐻𝑖𝑖 is the 𝑖th diagonal element of the hat matrix H.
In R you can get the standardized residuals from an lm-object (named
fit) by rstandard(fit).

Studentized residuals:

𝑒𝑖
𝑟𝑖∗ =
𝜎̂(𝑖) √1 − 𝐻𝑖𝑖
where 𝜎̂(𝑖) is the estimated error variance in a model with observation
number 𝑖 omitted. It can be shown that it is possible to calculated
the studentized residuals directly from the standardized residuals.
In R you can get the studentized residuals from an lm-object (named
fit) by rstudent(fit).
Diagnostic plots in R
See exercises: We use autoplot() from the ggfortify package in R
to plot the diagnostic plots.
Collinearity
In brief, collinearity refers to the situation when two or more
predictors are correlated, thus encode (partially) for the same
information.

Problems:
• Reduces the accuracy of the estimated coefficients 𝛽𝑗̂ (large SE!).
• Consequently, reduces power in finding effects (𝑝-values become
larger).

Solutions:
• Detect it by calculating the variance inflation factor (VIF).
• Remove the problematic variable.
• Or combine the collinear variables into a single new one.

Todo: Read in the course book p.99-102 (self-study).


Further reading

• Videoes on YouTube by the authors of ISL, Chapter 3


References

Box, G. E. P. 1979. “Robustness in the Strategy of Scientific Model


Building.” In In Robustness in Statistics, edited by R. L. Launer
and G. N. Wilkinson, 201–36. New York: Academic Press.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani.
2013. An Introduction to Statistical Learning. Vol. 112. Springer.
Walepole, R. E., R. H. Myers, S. L. Myers, and K. Ye. 2012.
Probability & Statistics for Engineers and Scientists. 9th ed.
Boston: Pearson Education Inc.

You might also like