Lab 8
Lab 8
2
Simple regression equation Observed value (y)
𝑌 = 𝛼 + 𝛽𝑥 + 𝑒
Intercept Slope
NORMAL EQUATIONS
The Least Squares Estimators are used to estimate 𝛼 and 𝛽. It finds the line that minimizes the sum of the
∑! 𝑥! 𝑌! − 𝑛𝑥̅ 𝑌% squared differences between the observed
%
𝐴 = 𝑌 − 𝐵𝑥;̅ 𝐵 =
∑! 𝑥!" − 𝑛𝑥^2
̅ data points and the predicted values on the
line.
3
Simple regression equation – in R
set.seed(123) # Create some random data
n <- 10; x = 1:n; y = 1:n + runif(n, -2, 2)
df <- data.frame(x = x, y = y)
# Least Squares Estimates
B <- (sum(x * y) - n * mean(x) * mean(y)) /
(sum(x^2) - n * mean(x)^2)
A <- mean(y) - B * mean(x)
# Plot the regression line
library(ggplot2)
ggplot(df, aes(x = x, y = y)) +
geom_segment(aes(x = x, xend = x, y = A + B*x,
yend = y), color = "cyan4") +
geom_point(color = "brown3") +
geom_abline(aes(slope = B, intercept = A)) +
labs(title = "Estimated regression line",
subtitle = paste0("A = ", round(A, 2), ", B = ", round(B, 2)))
4
The lm() function
The lm() function uses the least squares regression method to fit a linear model:
u y is a vector of dependent variable values
u x is a vector of independent variable values
u data is the data.frame where x and y variables are contained
model <- lm(y ~ x, data = df)
model
# Call:
# lm(formula = y ~ x, data = df)
# Coefficients:
# (Intercept) x
# 0.2702 1.0078
5
lm()’s output
Using the names() on the linear model estimated by lm() we can see the many outputs
available:
names(model)
Where:
u coefficients is a named vector of coefficients
u the residuals are the response minus the fitted.values
u fitted.values are the fitted mean values
6
lm()’s output
To obtain the estimated coefficients we can use the following codes:
coefficients(model); coef(model)
model$coefficients
# (Intercept) x
# 0.2702156 1.0077772
The dataset below contain the weights and flipper lenghts of 10 penguins from the
palmerpenguins dataset:
weights <- c(4100, 3800, 4300, 4550, 3775, 2900, 4600, 5400, 4100, 4500)
flipper_lengths <- c(216, 191, 210, 205, 199, 187, 209, 228, 210, 211)
A. Build a simple regression line using the flipper length as the independent variable and the
weight as the response variable. Manually compute the least squares estimates for the
intercept and slope .
B. Use the lm() function and compare the coefficients to those obtained in A.
C. Use ggplot2() to plot the weights and flipper lengths as points. Then add the regression
line using the geom_abline() function.
D. Remove the regression line from the plot above and replace it using the geom_smooth()
function with method = "lm" and formula = "weights ~ flipper_length".
N
8
UTIO
SOL
Exercise 1 – Least squares estimates
A. Build a simple regression line using the flipper length as the independent variable and
the weight as the response variable. Manually compute the least squares estimates
for the intercept and slope.
weights <- c(4100, 3800, 4300, 4550, 3775, 2900, 4600, 5400, 4100, 4500)
flipper_lengths <- c(216, 191, 210, 205, 199, 187, 209, 228, 210, 211)
N
9
UTIO
SOL
Exercise 1 – Least squares estimates
B. Use the lm() function and compare the coefficients to those obtained in A.
C. Use ggplot2() to plot the weights and flipper lengths as points. Then add the
regression line using the geom_abline() function.
library(ggplot2)
ggplot(data, aes(x = flipper_lengths,
y = weights)) +
geom_point() +
geom_abline(aes(intercept = intercept,
slope = slope))
N
10
UTIO
SOL
Exercise 1 – Least squares estimates
D. Remove the regression line from the plot above and replace it using the
geom_smooth() function with method = "lm" and formula = "y ~ x".
11
Assessing the model: plot the regression line
Graphical evaluation
of a simple regression
line involves visually
inspecting the scatter
plot of the data Strong Weak No No linear
points and the fitted relationship relationship relationship relationship
line to assess the line's
appropriateness in
capturing the overall
trend and
relationship between
the variables.
12
Assessing the model: amount of variation
.𝟎 + 𝜷
-𝒊 = 𝜷
𝒚 . 𝟏 𝒙𝒊
The dependent variable 𝑦 has a certain
amount of variability. When we build a
regression model we can decompose the
𝒚𝒊
total variability into two components: a
portion explained by the independent
variable 𝑥 and the residual, which remains Unexplained sum of
unexplained. squares 𝑦$ − 𝑦2$ (
𝑆!! = 𝑆𝑆" + 𝑆𝑆#
' ' '
( ( = $(𝑌*$ − 𝑌)
$(𝑌$ −𝑌) ( ( + $(𝑌$ −𝑌*$ )( Total sum of
$%& $%& $%& squares 𝑦$ − 𝑦( (
Explained sum of
u 𝑆## = Total sum of squares squares 𝑦2$ − 𝑦( (
u 𝑆𝑆$ = Explained sum of squares (sum of
squares explained thanks to regression)
u 𝑆𝑆% = Residual sum of squares (unexplained +
𝒚
sum of squares)
13
Coefficient of determination 𝑅 !
► By decomposing the errors, we can gain insights into the variability explained by the
model and the remaining unexplained variation.
► The coefficient of determination 𝑹𝟐 measures the proportion of the variance in the
dependent variable that is explained by the independent variables in a regression
model.
► It ranges from 0 to 1, with 0 indicating no explanation and 1 indicating full explanation.
14
A. Estimate a model without the intercept. Minimise the sum of squares for a model with the
sole slope and compute its value in the dataset. Then use the lm() function with a formula
of the form "y ~ -1 + x" to estimate the same model, named model1, and compare the
two values.
B. As above, estimate an intercept-only model, named model2. Then use the lm() function
with a formula of the form "y ~ 1" to estimate the same model and compare the two
values. What do you notice?
C. Estimate a third model, named model3, with intercept and slope. Plot the regression lines
for model1, model2, and model3 over the scatterplot of the data.
D. For model3, compute 𝑆## , 𝑆𝑆$ , and 𝑆𝑆% . Then compute the determination index.
N
15
UTIO
SOL
Exercise 2 – Determination index
A. Estimate a model without the intercept. Minimise the sum of squares for a model with the
sole slope and compute its value in the dataset. Then use the lm() function with a formula
of the form "y ~ -1 + x" to estimate the same model, named model1, and compare
the two values.
' ' ' '
N
16
UTIO
SOL
Exercise 2 – Determination index
B. As above, estimate an intercept-only model, named model2. Then use the lm()
function with a formula of the form "y ~ 1" to estimate the same model and
compare the two values. What do you notice?
' ' '
mean(scores)
∑'$%& 𝑦$
model2 <- lm(scores ~ 1, data = data) 𝑎= = 𝑦(
model2 𝑛
N
17
UTIO
SOL
Exercise 2 – Determination index
C. Estimate a third model, named model3, with intercept and slope. Plot the regression
lines for model1, model2, and model3 over the scatterplot of the data.
model3 <- lm(scores ~ hours, data = data)
ggplot(data, aes(x = hours, y = scores)) +
geom_point() +
geom_abline(aes(intercept = 0,
slope = coef(model1), color = "Slope")) +
geom_abline(aes(intercept = coef(model2),
slope = 0, color = "Intercept")) +
geom_abline(aes(intercept = coef(model3)[1],
slope = coef(model3)[2],
color = "Intercept and Slope")) +
guides(color = guide_legend("Model"))
N
18
UTIO
SOL
Exercise 2 – Determination index
D. For model3, compute 𝑆!! , 𝑆𝑆" , and 𝑆𝑆# . Then compute the determination index.
R2 <- 1 - SSr/Syy
# The 82.85% of the scores variability is explained by the different
# amount of study hours!
19
Assessing the model: residuals
satisfactory funnel
Residuals are the differences between the
observed values 𝑦 of the dependent variable and
the values predicted by the regression model 𝑦.
2
20
Assessing the model: residuals
Residual analysis helps assess how well the regression model captures the patterns and
trends in the data. They are usually standardised by dividing them by the unbiased
estimate of the standard deviation.
𝑦$ − 𝑦2$ Residuals
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑠𝑒𝑑 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙$ =
(
∑'$%& 𝑦$ − 𝑦2$ Unbiased
𝑛−2 SD estimate
They are useful for:
► Checking normality: Standardized residuals should be distributed as a standardized
normal random variable.
► Detecting Heteroscedasticity: Plotting standardized residuals against the predicted
values or the independent variable can reveal patterns, such as a funnel shape or cone
shape, which may suggest heteroscedasticity.
► Identifying Outliers: points with standardized residuals that fall outside a certain range
(e.g., ±2 or ±3) may indicate observations that have a significant impact on the
regression model. These outliers might be influential in influencing the estimated
coefficients or affecting the overall fit of the model.
Dott. Matteo Calgaro
Non è possibile visualizzare l'immagine.
21
Assessing the model: residuals’ plots
# Define a model
m1 <- lm(y ~ x, data = data)
# Extract the residuals
res <- residuals(m1)
# Extract the fitted values
fitted <- fitted(m1)
# Compute the standardized residuals
stdres <- residuals(m1) /
sqrt(sum(residuals(m1)^2)/m1$df.residual)
# Plot the standardized residuals and the
# fitted values
ggplot(df, aes(x = fitted, y = stdres)) +
geom_point(color = "brown3") +
geom_hline(aes(yintercept = 0),
linetype = "dashed") +
labs(x = "Fitted values",
y = "Standardized Residuals")
Dott. Matteo Calgaro
Non è possibile visualizzare l'immagine.
22
Assessing the model: residuals’ normality
Residuals' normality in simple linear regression can be
assessed using several methods:
► Normality test: Shapiro-Wilk or Anderson-Darling tests. P-
values > 0.05.
► Skewness and Kurtosis: values close to zero.
► Visual inspection: residuals vs fitted values, residuals
histogram and density plot, qq-plots.
23
Linear model summary()
In R, the command summary() used for a linear model object (e.g., produced by lm()
function) returns all the summary information about the regression model.
coefficient estimates
24
library(palmerpenguins)
df <- na.omit(penguins[, c("body_mass_g", "flipper_length_mm")])
A. Plot the body weights, the flipper lengths and draw the regression line in red.
B. How much variability of the body weights is explained by the model? (hint: compute the determination
index)
C. Compute the standardised residuals and plot them together with the fitted values. Are they
satisfactory? (hint: do you see any pattern?)
D. Compute the theoretical and observed quantiles for the standardised residuals. Draw the Q-Q plot.
Then use the function plot() on the regression model output. What are you obtaining?
E. Do A., B., C., and D., but instead of using the flipper length as independent variable, see if the
bill_length_mm variable could be a good regressor for body weight (hint: how is the determination
index? How are the standardised residuals distributed?)
N
25
UTIO
SOL
Exercise 3 – Regression on penguins
A. Plot the body weights, the flipper lengths and draw the regression line in red.
N
26
UTIO
SOL
Exercise 3 – Regression on penguins
# Automatically
summary(m1)
# Or manually
SSy <- sum((df$body_mass_g –
mean(df$body_mass_g))^2)
SSexp <- sum((fitted(m1) –
mean(df$body_mass_g))^2)
SSunexp <- sum(residuals(m1)^2)
R2 <- SSexp/SSy
N
27
UTIO
SOL
Exercise 3 – Regression on penguins
C. Compute the standardised residuals and plot them together with the fitted values.
Are they satisfactory?
N
28
UTIO
SOL
Exercise 3 – Regression on penguins
D. Compute the theoretical and observed quantiles for the
standardised residuals. Draw the Q-Q plot. Then use the function
plot() on the regression model output. What are you obtaining?
Assess normality
probs <- seq(0, 1, length.out = nrow(df))
df$observed_q <- quantile(df$stdres, probs = probs)
df$theoretical_q <- qnorm(probs)
ggplot(df, aes(x = theoretical_q, y = observed_q)) +
geom_point() +
geom_abline(aes(intercept = 0, slope = 1)) +
labs(x = "Theoretical quantiles", "Observed quantiles")
N
29
UTIO
SOL
Exercise 3 – Regression on penguins
E. Do A., B., C., and D. but instead of using the flipper length as
independent variable, see if the bill_length_mm variable could
be a good regressor for body weight.
df <- na.omit(penguins[, c("body_mass_g", "bill_length_mm")])
m2 <- lm(body_mass_g ~ bill_length_mm, data = df)
ggplot(data = df, aes(x = bill_length_mm, y = body_mass_g)) +
geom_point() + geom_abline(aes(slope = coef(m2)[2],
intercept = coef(m2)[1]), color = "red") +
labs(x = "Bill length (mm)", y = "Body mass (g)",
title = "Palmer penguins",
subtitle = "Bill length and body mass")
summary(m2)
N
30
UTIO
SOL
Exercise 3 – Regression on penguins
par(mfrow = c(2,2)
plot(m2)