Module 3: Linear Regression: TMA4268 Statistical Learning V2025
Module 3: Linear Regression: TMA4268 Statistical Learning V2025
• Thanks to Mette Langaas and her TAs who permit me to use and
modify their original material.
• Some of the figures and slides in this presentation are taken (or
are inspired) from James et al. (2013).
Introduction
40
40
40
30
30
30
30
bodyfat (y)
bodyfat (y)
bodyfat (y)
bodyfat (y)
20
20
20
20
10
10
10
10
0
0
20 25 30 35 20 40 60 80 32 36 40 44 90 100 120
1 The
data to reproduce these plots and analyses can be found here:
https://github.com/stefaniemuff/statlearning2/tree/master/3LinReg/data
For a good predictive model we need to dive into multiple linear
regression. However, wer start with the simple case of only one
predictor variable:
40
30
body fat (%)
20
10
20 25 30 35 40
bmi
Interesting questions
1. How good is BMI as a predictor for body fat?
2. How strong is this relationship?
3. Is the relationship linear?
4. Are also other variables associated with bodyfat?
5. How well can we predict the bodyfat of a person?
Simple Linear Regression
𝑦𝑖 = 𝛽 0 + 𝛽 1 𝑥𝑖 .
But which is the “true” or “best” line, if the relationship is not exact?
40
30
body fat (%)
20
10
0
20 25 30 35 40
bmi
Task: Estimate the intercept and slope parameters (by “eye”) and
write it down (we will look at the “best” answer later).
→ Mentimeter
It is obvious that
• the linear relationship does not describe the data perfectly.
• another realization of the data (other 243 males) would lead to a
slightly different picture.
𝑌 = 𝛽 0 + 𝛽1 𝑥 + 𝜀 , 𝜀 ∼ 𝑁 (0, 𝜎2 ) .
Note:
• The model for 𝑌 given 𝑥 has three parameters: 𝛽0 (intercept), 𝛽1
(slope coefficient) and 𝜎2 .
• 𝑥 is the independent/ explanatory / regressor variable.
• 𝑌 is the dependent / outcome / response variable.
Modeling assumptions
7
6
5
y
4
3
2
1
x
• Mathematically, 𝑎 and 𝑏 are estimated such that the sum of
squared vertical distances (residual sum of squares)
𝑛
RSS = ∑ 𝑒2𝑖 , where 𝑒𝑖 = 𝑦𝑖 − (𝑎 + 𝑏𝑥𝑖 )
𝑖=1
is being minimized.
• The respective “best” estimates are called 𝛽0̂ and 𝛽1̂ .
• We can predict the value of the response for a (new) observation
of the covariate at 𝑥.
𝑦 ̂ = 𝛽0̂ + 𝛽1̂ 𝑥.
• The 𝑖-th residual of the model is the difference between the 𝑖-th
observed response value and the 𝑖-th predicted value, and is
written as:
𝑒𝑖 = 𝑌𝑖 − 𝑦𝑖̂ .
• We may regard the residuals as predictions (not estimates) of the
error terms 𝜀𝑖 .
(The error terms are random variables and can not be estimated - they can be predicted. It is only for
parameters that we speak about estimates.)
Least squares estimators:
Using 𝑛 observed independent data points
the least squares estimates for simple linear regression are given as
𝛽0̂ = 𝑦 ̄ − 𝛽1̂ 𝑥̄
and 𝑛
∑𝑖=1 (𝑥𝑖 − 𝑥)(𝑦
̄ 𝑖 − 𝑦)̄ Cov(x, y)
𝛽1̂ = = ,
𝑛
∑𝑖=1 (𝑥𝑖 − 𝑥)̄ 2 Var(x)
𝑛 𝑛
where 𝑦 ̄ = 1
𝑛 ∑𝑖=1 𝑦𝑖 and 𝑥̄ = 1
𝑛 ∑𝑖=1 𝑥𝑖 are the sample means.
This is something you should have proven in your previous statistics classes; if you
forgot how to get there, please check again, e.g. in chapter 11 of the book by Walepole
et al. (2012), see here.
Do-it-yourself “by hand”
summary(r.bodyfat)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26.984368 2.7689004 -9.745518 3.921511e-19
## bmi 1.818778 0.1083411 16.787522 2.063854e-42
We see that the model fits the data quite well. It captures the essence.
It looks that a linear relationship between bodyfat and bmi is a good
approximation.
40
30
bodyfat
20
10
0
20 25 30 35 40
bmi
Questions:
• The blue line gives the estimated model. Explain what the line
means in practice. Is this result plausible?
• Compare the estimates for 𝛽0 and 𝛽1 to the estimates you gave at
the beginning - were you close?
• How does this relate to the true (population) model?
• What could the regression line look like if another set of
243 males were used for estimation?
Uncertainty in the estimates 𝛽0̂ and 𝛽1̂
Note: 𝛽0̂ and 𝛽1̂ are themselves random variables and as such contain
uncertainty!
Let us look again at the regression output, this time only for the
coefficients. The second column shows the standard error of the
estimate:
summary(r.bodyfat)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26.984368 2.7689004 -9.745518 3.921511e-19
## bmi 1.818778 0.1083411 16.787522 2.063854e-42
Doing it 1000 times, we obtain the following distributions for 𝛽0̂ and
𝛽1̂ :
100
75
75
count
count
50 50
25 25
0 0
3.9 4.0 4.1 −2.1 −2.0 −1.9
beta0 beta1
Accuracy of the parameter estimates
• The standard errors of the estimates are given by the following
formulas:
2
̂ ̂ 1 𝑥 ̄
Var(𝛽0 ) = SE(𝛽0 ) = 𝜎 [ + 𝑛
2 2
]
𝑛 ∑𝑖=1 (𝑥𝑖 − 𝑥)̄2
and
2
𝜎
Var(𝛽1̂ ) = SE(𝛽1̂ )2 = 𝑛 2
.
∑𝑖=1 (𝑥𝑖 − 𝑥)̄
• Cov(𝛽0̂ , 𝛽1̂ ) is in general different from zero.
𝑛
1 1
𝜎̂ = RSE = √ RSS = √ ∑(𝑦𝑖 − 𝑦𝑖̂ )2 .
𝑛−2 𝑛 − 2 𝑖=1
• So actually we have
2
̂ 𝛽 ̂ )2 = 𝑛 𝜎̂
SE( 1 2
,
∑𝑖=1 (𝑥𝑖 − 𝑥)̄
summary(r.bodyfat)$coef
𝐻0 ∶ 𝛽1 = 0 .
3
In words: 𝐻0 = “There is no relationship between 𝑋 and 𝑌 .”
𝐻𝐴 ∶ 𝛽1 ≠ 0
𝛽1̂
𝑇 = .
𝑆𝐸(𝛽1̂ )
𝛽1̂ − 𝑐
𝑇 = .
𝑆𝐸(𝛽 )̂
1
Distribution of parameter estimators
We will discuss this a bit in the final module 12. The topic is
connected to good/bad research practice, problems with
“reproducibility’ ’ and scientific progress in general. See e.g. here:
confint(r.bodyfat,level=c(0.95))
## 2.5 % 97.5 %
## (Intercept) -32.438703 -21.530032
## bmi 1.605362 2.032195
Interpretation:
For an increase in the bmi by one index point, roughly … percentage
points more bodyfat are expected, and all true values for 𝛽1 between
… and … are ….
Model accuracy
Measured by
1. The residual standard error (RSE), which provides an
absolute measure of lack of fit (see above).
𝑛
2 TSS − RSS RSS ∑𝑖=1 (𝑦𝑖 − 𝑦𝑖̂ )2
𝑅 = =1− =1− 𝑛 ,
TSS TSS ∑𝑖=1 (𝑦𝑖 − 𝑦𝑖̄ )2
where
𝑛
TSS = ∑(𝑦𝑖 − 𝑦)̄ 2
𝑖=1
summary(r.bodyfat)$r.squared
## [1] 0.5390391
cor(d.bodyfat$bodyfat,d.bodyfat$bmi)^2
## [1] 0.5390391
Multiple Linear Regression
We assume
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ... + 𝛽𝑝 𝑋𝑝 + 𝜀 , (1)
Y = X𝛽 + 𝜀
Notation
• Y ∶ (𝑛 × 1) vector of responses (e.g., bodyfat).
• X ∶ (𝑛 × (𝑝 + 1)) design matrix, and 𝑥𝑇𝑖 is a (𝑝 + 1)-dimensional
row vector for observation 𝑖.
• 𝛽 ∶ ((𝑝 + 1) × 1) vector of regression parameters (𝛽0 , 𝛽1 , … , 𝛽𝑝 )⊤ .
• 𝜀 ∶ (𝑛 × 1) vector of random errors.
• We assume that pairs (𝑥𝑇𝑖 , 𝑦𝑖 ) (𝑖 = 1, ..., 𝑛) are measured from
independent sampling units.
Y = X𝛽 + 𝜀
Assumptions:
1. E(𝜀) = 0.
2. Cov(𝜀) = E(𝜀𝜀𝑇 ) = 𝜎2 𝐼.
3. The design matrix has full rank, rank(X) = 𝑝 + 1. (We assume
𝑛 >> (𝑝 + 1).)
The classical normal linear regression model is obtained if additionally
4. 𝜀 ∼ 𝑁𝑛 (0, 𝜎2 I) holds. Here 𝑁𝑛 denotes the 𝑛-dimensional
multivarate normal distribution.
The bodyfat example for two predictors
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝜀 ,
with bodyfat as the response (𝑌 ) and bmi and age as 𝑋1 and 𝑋2 .
The regression
r.bodyfat=lm(bodyfat ~ bmi + age ,data=d.bodyfat)
Assume that
Y = X𝛽 + 𝜀 , 𝜀 ∼ 𝑁𝑛 (0, 𝜎2 I) .
Q:
• What is the expected value E(Y) given X?
• The covariance matrix Cov(Y) given X?
• Thus what is the distribution of Y given X?
A:
Y ∼ 𝑁𝑛 (X𝛽, 𝜎2 I)
Parameter estimation for 𝛽
𝜕RSS
=0.
𝜕𝛽
→ Derivation on next 2 pages.
Summing up:
The least squares and maximum likelihood estimator for 𝛽:
is given like
𝛽̂ = (X𝑇 X)−1 X𝑇 Y .
Example continued
r.bodyfat3 <- lm(bodyfat ~ bmi + age + neck + hip +abdomen,data=d.bodyfat)
summary(r.bodyfat3)
##
## Call:
## lm(formula = bodyfat ~ bmi + age + neck + hip + abdomen, data = d.bodyfat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3727 -3.1884 -0.1559 3.1003 12.7613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.74965 7.29830 -1.062 0.28939
## bmi 0.42647 0.23133 1.844 0.06649 .
## age 0.01457 0.02783 0.524 0.60100
## neck -0.80206 0.19097 -4.200 3.78e-05 ***
## hip -0.31764 0.10751 -2.954 0.00345 **
## abdomen 0.83909 0.08418 9.968 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.392 on 237 degrees of freedom
## Multiple R-squared: 0.7185, Adjusted R-squared: 0.7126
## F-statistic: 121 on 5 and 237 DF, p-value: < 2.2e-16
Reproduce the values under Estimate by calculating without the use
of lm.
X=model.matrix(r.bodyfat3 )
Y=d.bodyfat$bodyfat
betahat=solve(t(X)%*%X)%*%t(X)%*%Y
print(betahat)
## [,1]
## (Intercept) -7.74964673
## bmi 0.42647368
## age 0.01457356
## neck -0.80206081
## hip -0.31764315
## abdomen 0.83909391
Distribution of the regression parameter estimator
Given
𝛽̂ = (X𝑇 X)−1 X𝑇 Y ,
what are
• The mean E(𝛽)?̂
̂
• The covariance matrix Cov(𝛽)?
̂
• The distribution of 𝛽?
𝛽̂ ∼ 𝑁𝑝+1 (𝛽, 𝜎
⏟⏟
2 (X𝑇 X)−1 ) .
⏟⏟⏟
covariance matrix
The covariance matrix of 𝛽 ̂ in R
vcov(r.bodyfat3)
## (Intercept) bmi age neck hip
## (Intercept) 53.26521684 0.6774596810 -0.0780438125 -0.7219656479 -0.548205733
## bmi 0.67745968 0.0535131152 0.0005729015 -0.0120408637 -0.005804073
## age -0.07804381 0.0005729015 0.0007745054 -0.0003432518 0.001523951
## neck -0.72196565 -0.0120408637 -0.0003432518 0.0364680351 -0.002715930
## hip -0.54820573 -0.0058040729 0.0015239515 -0.0027159299 0.011558850
## abdomen 0.16457979 -0.0110809165 -0.0011917596 -0.0007706161 -0.004570722
## abdomen
## (Intercept) 0.1645797895
## bmi -0.0110809165
## age -0.0011917596
## neck -0.0007706161
## hip -0.0045707222
## abdomen 0.0070861066
How does this compare to simple linear regression? Not so easy to see
a connection!
𝑛
̄ 𝑖 − 𝑌̄ )
∑𝑖=1 (𝑥𝑖 − 𝑥)(𝑌
𝛽0̂ = 𝑌 ̄ − 𝛽1̂ 𝑥̄ and 𝛽1̂ = 𝑛 ,
∑𝑖=1 (𝑥𝑖 − 𝑥)̄ 2
𝛽̂ = (X𝑇 X)−1 X𝑇 Y .
𝐻0 ∶ 𝛽 1 = 𝛽 2 = … = 𝛽 𝑝 = 0
vs.
(TSS − RSS)/𝑝
𝐹 = ∼ 𝐹𝑝,(𝑛−𝑝−1) ,
RSS/(𝑛 − 𝑝 − 1)
where total sum of squares TSS = ∑𝑖 (𝑦𝑖 − 𝑦)̄ 2 , and residual sum of
squares RSS = ∑𝑖 (𝑦𝑖 − 𝑦𝑖̂ )2 . Under the Normal regression
assumptions, 𝐹 follows an 𝐹𝑝,(𝑛−𝑝−1) distribution (see Walepole et al.
(2012), Chapter 8.7).
• If 𝐻0 is true, 𝐹 is expected to be ≈ 1.
• Otherwise, we expect that the numerator is larger than the
denominator (because the regression then explains a lot of
variation) and thus 𝐹 is greater than 1. For an observed value 𝑓0 ,
the 𝑝-value is given as
𝑝 = 𝑃 (𝐹𝑝,𝑛−𝑝−1 > 𝑓0 ) .
Checking the 𝐹 -value in the R output:
summary(r.bodyfat)
##
## Call:
## lm(formula = bodyfat ~ bmi + age, data = d.bodyfat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.0415 -3.8725 -0.1237 3.9193 12.6599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -31.25451 2.78973 -11.203 < 2e-16 ***
## bmi 1.75257 0.10449 16.773 < 2e-16 ***
## age 0.13268 0.02732 4.857 2.15e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.329 on 240 degrees of freedom
## Multiple R-squared: 0.5803, Adjusted R-squared: 0.5768
## F-statistic: 165.9 on 2 and 240 DF, p-value: < 2.2e-16
Conclusion?
More complex hypotheses
Sometimes we don’t want to test if all 𝛽’s are zero at the same time,
but only a subset 1, … , 𝑞:
𝐻0 ∶ 𝛽 1 = 𝛽 2 = ⋯ = 𝛽 𝑞 = 0
vs.
𝐻1 ∶ at least one different from zero.
(RSS0 -RSS)/(𝑞)
𝐹 = ∼ 𝐹𝑞,𝑛−𝑝−1 ,
RSS/(𝑛 − 𝑝 − 1)
where
• Large model: RSS with 𝑝 + 1 regression parameters
• Small model: RSS0 with 𝑝 + 1 − 𝑞 regression parameters
Example in R
• Question: Do weight and height explain something of
bodyfat, on top of the variables bmi and age?
• Fit both models and use the anova() function to carry out the
𝐹 -test:
A special case is
𝐻0 ∶ 𝛽𝑗 = 0 vs. 𝐻1 ∶ 𝛽𝑗 ≠ 0
summary(r.bodyfat)$coef
However:
Only checking the individual 𝑝-values is dangerous. Why?
Inference about 𝛽𝑗 : confidence interval
• Using that
𝛽𝑗̂
𝑇𝑗 = ∼ 𝑡𝑛−𝑝−1 ,
SE(𝛽𝑗̂ )
we can create confidence intervals for 𝛽𝑗 in the same manner as
we did for simple linear regression (see slide 41). For example,
when using the typical confidence level 𝛼 = 0.05 we have
confint(r.bodyfat)
## 2.5 % 97.5 %
## (Intercept) -36.7499929 -25.7590185
## bmi 1.5467413 1.9583996
## age 0.0788673 0.1864861
2. Deciding on important variables
Overarching question:
We can again look at the two measures from simple linear regression:
RSS
𝜎̂ = RSE = √ .
𝑛−𝑝−1
• 𝑅2 is again the fraction of variance explained (no change from
simple linear regression)
𝑛
2 TSS − RSS RSS ∑𝑖=1 (𝑦𝑖 − 𝑦𝑖̂ )2
𝑅 = =1− =1− 𝑛 .
TSS TSS ∑𝑖=1 (𝑦𝑖 − 𝑦𝑖̄ )2
## [1] 0.5390391
summary(r.bodyfatM2)$r.squared
## [1] 0.5802956
summary(r.bodyfatM3)$r.squared
## [1] 0.718497
The models explain 54%, 58% and 72% of the total variability of 𝑦.
It thus seems that larger models are “better’ ’. However, 𝑅2 does
always increase when new variables are included, but this does not
mean that the model is more reasonable.
Adjusted 𝑅2
20
10
20 25 30 35 40
BMI
predict(fit,newdata=newobs,interval="confidence",type="response")
(Box 1979)
Extensions of the linear model
• Interaction terms
• Non-linear terms
Binary predictors
𝛽0 + 𝜀 𝑖 if 𝑥𝑖 = 0 ,
𝑌𝑖 = {
𝛽0 + 𝛽 1 + 𝜀 𝑖 if 𝑥𝑖 = 1 .
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥1 + … + 𝛽𝑘 𝑥𝑘 + 𝜀𝑖 .
The model thus discriminates between the factor levels, such that
(assuming 𝛽1 = 0)
⎧ 𝛽0 + 𝜀, if 𝑥𝑖1 = 1
{ 𝛽0 + 𝛽2 + 𝜀, if 𝑥𝑖2 = 1
𝑦𝑖 = ⎨
...
{
⎩ 𝛽0 + 𝛽𝑘 + 𝜀, if 𝑥𝑖𝑘 = 1 .
!Important to remember!
(Common aspect that leads to confusion!)
We are now using the Credit dataset from the ISLR library.
library(ISLR)
data(Credit)
head(Credit)
## ID Income Limit Rating Cards Age Education Gender Student Married Ethnicity
## 1 1 14.891 3606 283 2 34 11 Male No Yes Caucasian
## 2 2 106.025 6645 483 3 82 15 Female Yes Yes Asian
## 3 3 104.593 7075 514 4 71 11 Male No No Asian
## 4 4 148.924 9504 681 3 36 11 Female No No Asian
## 5 5 55.882 4897 357 2 68 16 Male No Yes Caucasian
## 6 6 80.180 8047 569 4 77 10 Male No No Caucasian
## Balance
## 1 333
## 2 903
## 3 580
## 4 964
## 5 331
## 6 1151
Ethnicity Balance
200
150
Ethnicity
100
50
2000
1500
Balance
1000
500
0
0 102030400 102030400 10203040 0 500 1000 1500 2000
In R, a factor covariate can be used in the same way as a continuous
predictor:
⎧ if 𝑖 is Asian
{
𝑦𝑖̂ = if 𝑖 is Caucasian
⎨
{
⎩ if 𝑖 is Afro-American
Sidenote: The “reference category”
In the above example we do not see a result for the
EthnicityAfrican American. Why?
• African American is chosen to be the reference category.
• The results for EthnicityAsian and EthnicityCaucasian are
differences with respect to the reference cateogry.
• R chooses the reference category in alphabetic order! This is
sometimes not a relevant category.
• You can change the reference category:
library(dplyr)
Credit <- mutate(Credit,Ethnicity = relevel(Ethnicity,ref="Caucasian"))
r.lm <- lm(Balance ~ Ethnicity, data=Credit)
summary(r.lm)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 518.497487 32.66986 15.8708211 2.824537e-44
## EthnicityAfrican American 12.502513 56.68104 0.2205766 8.255355e-01
## EthnicityAsian -6.183762 56.12165 -0.1101850 9.123184e-01
Note: The differences are now with respect to the Caucasian category – the
model is however exactly the same!
Testing for a categorical predictor
Question: Is a qualitative predictor needed in the model?
For a predictor with more than two levels (like Ethnicity above), the
Null Hypothesis is whether
𝛽1 = … = 𝛽𝑘−1 = 0
at the same time.
→ We again need the 𝐹 -test6 , as always when we test for more
than one 𝛽𝑗 = 0 simultaneously!
In R, this is done by the anova() function:
anova(r.lm)
## Analysis of Variance Table
##
## Response: Balance
## Df Sum Sq Mean Sq F value Pr(>F)
## Ethnicity 2 18454 9227 0.0434 0.9575
## Residuals 397 84321458 212397
Thus we have a model that allows for different intercept and slope for
the two groups:
Interpretation:
We allow the model to depend on the binary variable Student, such
that
For a student: 𝑦 ̂ = 200.6 + 476.7 + (6.2 + -2.0) ⋅ Income
For a non-Student: 𝑦 ̂ = 200.6 + (6.2) ⋅ Income
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝛽2 𝑥2𝑖 + 𝜀𝑖 ,
Note:
The word linear refers to the linearity in the coefficients, and not on a
linear relationship between 𝑌 and 𝑋1 , … , 𝑋𝑝 !
→ In the later modules, we will discuss other more advanced non-linear approaches for
addressing this issue.
Challenges - for model fit
1. Non-linearity of data
2. Correlation of error terms
3. Non-constant variance of error terms
4. Non-Normality of error terms
5. Outliers
6. High leverage points
7. Collinearity
Recap of modelling assumptions in linear regression
To make valid inference from our model, we must check if our model
assumptions are fulfilled!7
Residuals vs Fitted
76
10
5
Residuals
−5
−10
9 12
10 20 30 40
Fitted values
Normal Q−Q
76
2
Standardized residuals
−1
−2
9 12
−3 −2 −1 0 1 2 3
Theoretical Quantiles
1.2
0.8
0.4
0.0
10 20 30 40
Fitted values
Model checking tool IV: The leverage plot
1 (𝑥𝑖 − 𝑥)2
𝐻𝑖𝑖 = + 2
. (2)
𝑛 ∑𝑖′ (𝑥𝑖′ − 𝑥)
4
4
4
2
2
2
y1
y2
y
0
0
0
−2
−2
−2
−4
−4
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x x x
The outlier in the middle plot “pulls’ ’ the regression line in its
direction and biases the slope.
http://students.brown.edu/seeing-theory/regression-
analysis/index.html to do it manually! 8
8 Youmay choose ”Ordinary Least Squares” and then click on ”I” for the
Ascombe quartett example. Drag points to see what happens with the regression
line.
In the leverage plot, (standardized) residuals 𝑟𝑖̃ are plotted against
the leverage 𝐻𝑖𝑖 (still for the bodyfat):
Residuals vs Leverage
2
35
Standardized Residuals
1 207
−1
39
−2
Critical ranges are the top and bottom right corners! Why?
Leverages in multiple regression
• Leverage is defined as the diagonal elements of the so-called hat
matrix H9 , i.e., the leverage of the 𝑖-th data point is 𝐻𝑖𝑖 on the
diagonal of H = X(XT X)−1 XT .
• Exercise: Verify that formula (2) comes out in the special case of
simple linear regression.
• A large leverage indicates that the observation (𝑖) has a large
influence on the estimation results, and that the covariate values
(𝑥𝑖 ) are unusual.
9 Do you remember why H is called hat matrix?
Different types of residuals?
This means that the residuals (possibly) have different variance, and
may also be correlated.
𝑒𝑖
𝑟𝑖̃ =
𝜎√1
̂ − 𝐻𝑖𝑖
where 𝐻𝑖𝑖 is the 𝑖th diagonal element of the hat matrix H.
In R you can get the standardized residuals from an lm-object (named
fit) by rstandard(fit).
Studentized residuals:
𝑒𝑖
𝑟𝑖∗ =
𝜎̂(𝑖) √1 − 𝐻𝑖𝑖
where 𝜎̂(𝑖) is the estimated error variance in a model with observation
number 𝑖 omitted. It can be shown that it is possible to calculated
the studentized residuals directly from the standardized residuals.
In R you can get the studentized residuals from an lm-object (named
fit) by rstudent(fit).
Diagnostic plots in R
See exercises: We use autoplot() from the ggfortify package in R
to plot the diagnostic plots.
Collinearity
In brief, collinearity refers to the situation when two or more
predictors are correlated, thus encode (partially) for the same
information.
Problems:
• Reduces the accuracy of the estimated coefficients 𝛽𝑗̂ (large SE!).
• Consequently, reduces power in finding effects (𝑝-values become
larger).
Solutions:
• Detect it by calculating the variance inflation factor (VIF).
• Remove the problematic variable.
• Or combine the collinear variables into a single new one.