0% found this document useful (0 votes)
10 views

Session7 LinearRegression

The document discusses linear regression models for predicting used car prices. It presents a dataset of 1,417 used Toyota Corolla sales with variables like price, age, mileage, and features. Simple linear regression is performed with price as the target and age as a predictor. The model shows a significant negative relationship between age and price, with age explaining about 77% of price variation. Diagnostic plots are examined to validate that the residuals meet assumptions of linear regression.

Uploaded by

Asier Gomez
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Session7 LinearRegression

The document discusses linear regression models for predicting used car prices. It presents a dataset of 1,417 used Toyota Corolla sales with variables like price, age, mileage, and features. Simple linear regression is performed with price as the target and age as a predictor. The model shows a significant negative relationship between age and price, with age explaining about 77% of price variation. Diagnostic plots are examined to validate that the residuals meet assumptions of linear regression.

Uploaded by

Asier Gomez
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Regression: Linear Models

Ignasi Puig
[email protected]

Departament d’Estadı́stica i IO. UPC

October 11th. , 2023

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 1 / 52
Outline

1 Problem Description

2 Simple Linear Regression

3 Multiple Linear Regression

4 Others

5 Issues

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 2 / 52
Outline

1 Problem Description

2 Simple Linear Regression

3 Multiple Linear Regression

4 Others

5 Issues

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 3 / 52
Problem Description

A set of recent graduates of the Engineering School wants to launch a web for used car
sales. They want to develop an application that given the characteristics of the car, it
provides a price range estimate of its potential price.
To test the feasibility of their idea they have a set of 1,417 sales of used Toyota Corolas.

Load data:
> data <- read.table(file = ’data/ToyotaCorolla.csv’, header = TRUE,
sep = ’,’,dec= ’.’)

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 4 / 52
Problem Description

Check data structure


> str(data)
’data.frame’: 1436 obs. of 10 variables:
$ Price : int 13500 13750 13950 14950 13750 12950 16900 18600 21500 ...
$ Age : int 23 23 24 26 30 32 27 30 27 23 ...
$ KM : int 46986 72937 41711 48000 38500 61000 94612 75889 19700 ...
$ FuelType : Factor w/ 3 levels "CNG","Diesel",..: 2 2 2 2 2 2 2 2 3 2 ...
$ HP : int 90 90 90 90 90 90 90 90 192 69 ...
$ MetColor : int 1 1 1 0 0 0 1 1 0 0 ...
$ Automatic: int 0 0 0 0 0 0 0 0 0 0 ...
$ CC : int 2000 2000 2000 2000 2000 2000 2000 2000 1800 1900 ...
$ Doors : int 3 3 3 3 3 3 3 3 3 3 ...
$ Weight : int 1165 1165 1165 1165 1170 1170 1245 1245 1185 1105 ...

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 5 / 52
Problem Description
Clean up data

Remove cars with 2 doors (2 cars) and using gas fuel (’CNG’).
Change KM variable name to Mileage
Convert columns 5, 6, 8 and 9 fron numeric to factor.

> data <- filter(data, Doors > 2, FuelType != ’CNG’)


> colnames(data)[3] <- ’Mileage’
> cols <- c(’HP’, ’MetColor’, ’Automatic’, ’CC’, ’Doors’)
> data <- mutate_at(data, cols, factor)

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 6 / 52
Problem Description
Data

n = 1,417 second hand sales of Toyota Corolas.


> head(data)
Id Price Age Mileage FuelType HP MetColor Automatic Doors
1 13500 23 46986 Diesel 90 1 0 3
2 13750 23 72937 Diesel 90 1 0 3
3 13950 24 41711 Diesel 90 1 0 3
4 14950 26 48000 Diesel 90 0 0 3
5 13750 30 38500 Diesel 90 0 0 3

Id, sales id.


Price, price, in USD, paid for the car.
Age, age of the car in months.
Mileage, mile count.
Fuel Type, Diesel, Petrol.
HP, car power in Horse Power.
MetColor, 0 regular paint, 1 metal paint.
Automatic, 0 manual drive, 1 automatic drive.
Doors, number of doors: 3 (hatchback), 4 (sedan), 5 (family car).
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 7 / 52
Problem Description
Exploratory Analysis

> summary(data)
Price Age Mileage
Min. : 4350 Min. : 1.00 Min. : 1
1st Qu.: 8450 1st Qu.:44.00 1st Qu.: 42800
Median : 9900 Median :61.00 Median : 63000
Mean :10750 Mean :55.93 Mean : 67883
3rd Qu.:11950 3rd Qu.:70.00 3rd Qu.: 86221
Max. :32500 Max. :80.00 Max. :243000
.
FuelType HP MetColor Automatic Doors
Diesel: 154 110 :818 0:462 0:1338 3:616
Petrol:1263 86 :248 1:955 1: 79 4:135
97 :164 5:666
72 : 73
90 : 36
69 : 34
(Other): 44

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 8 / 52
Problem Description
Questions to Answer

1 Is there a relationship between age and price?


2 Is it stronger than between mileage and price?
3 How accurate can we estimate the effect of age on a car’s price?
4 How well can we predict the price given a car’s age?
5 Does the price of a car based on its age change betwen gas and diesel?
6 Is the relationship linear?

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 9 / 52
Problem Description
Regression Problems

Regression problem: given a set of observations {yi , xi1 , xi1 ...xip }i=1...n where Y is a
quantitative random variable and (X1 , X2 ...Xp ) are a set of quantitative and qualitative
variables observed for each individual, we want to build a function that given a set of
features from and new individual i, it predicts its value yi .

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 10 / 52
Problem Description
Analyzing the relationship

> pairs(data[,1:5],col=’red’)

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 11 / 52
Outline

1 Problem Description

2 Simple Linear Regression

3 Multiple Linear Regression

4 Others

5 Issues

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 12 / 52
Simple Linear Regression
Parameter Estimation

Let’s assume that the linear relationship between price and car age is:

Price = β0 + β1 Age + ϵ
ϵ ∼ N(0, σ)

Figure: Price vs. Age. Toyota Corola second hand sales sample.

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 13 / 52
Simple Linear Regression
Parameter Properties

> mod1 <- (lm(Price ~ Age, data))


> summary(mod1)
Call:
lm(formula = Price ~ Age, data = data)
Residuals:
Min 1Q Median 3Q Max
-8440.0 -997.9 -24.2 867.7 12868.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20316.037 146.362 138.8 <2e-16 ***
Age -171.046 2.483 -68.9 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1744 on 1415 degrees of freedom
Multiple R-squared: 0.7704,Adjusted R-squared: 0.7702
F-statistic: 4747 on 1 and 1415 DF, p-value: < 2.2e-16

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 14 / 52
Simple Linear Regression
Model Fit

The overall fit of a linear regression (clearer in the Multiple Linear Regression case to be
seen later!) can be assessed with:
1 Residual Standard Error (RSE),
v
u n
u 1 X
σ̂ = t (yi − ŷi )2
n − 2 i=1

It measures the average amount a response will deviate from the true regression line
2 Determination Coefficient (R 2 )
TSS − RSS RSS
R2 = =1−
TSS TSS
where TSS = ni=1 (yi − ȳ )2 . It can be understood as the proportion of total
P
variability explained by the regression.

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 15 / 52
Simple Linear Regression
Model assumptions validation

Model assumptions are validated via the model residuals.


> par(mfrow=c(2,2))
> plot(mod1,ask=F)

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 16 / 52
Outline

1 Problem Description

2 Simple Linear Regression

3 Multiple Linear Regression

4 Others

5 Issues

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 17 / 52
Multiple Linear Regression
Model

In Multiple Linear Regression one assumes that the response Y relates to more than one
regressor. The model with p regressors (p+1 parameters) becomes

Yi = β0 + β1 X1 + β2 X2 + ... + βp Xp + ϵi

and one can predict Ŷ with the estimates of βi , bi ,:

Ŷ = b0 + b1 X1 + b2 X2 + ... + bp Xp

where the bi ’s are chosen as those that minimize the residual sum of squares (RSS):
n
X
RSS = (yi − b0 + b1 xi1 + b2 xi2 + ... + bp xip )2
i=1

that is, the Least Square Estimates again.

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 18 / 52
Multiple Linear Regression
Model

In Multiple Linear Regression one fits the best hyperplane, in the Least Squares sense,
over the observations.

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 19 / 52
Multiple Linear Regression
Parameter Estimates

> mod2 <- (lm(Price~Age+Mileage,data))


> summary(mod2)
Call:
lm(formula = Price ~ Age + Mileage, data = data)
Residuals:
Min 1Q Median 3Q Max
-6823.7 -976.8 -75.9 831.6 12622.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.050e+04 1.408e+02 145.56 <2e-16 ***
Age -1.547e+02 2.762e+00 -56.00 <2e-16 ***
Mileage -1.611e-02 1.393e-03 -11.56 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1667 on 1414 degrees of freedom
Multiple R-squared: 0.7902,Adjusted R-squared: 0.7899
F-statistic: 2663 on 2 and 1414 DF, p-value: < 2.2e-16

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 20 / 52
Multiple Linear Regression
Parameters Relevance

Is at least one of the predictors useful at predicting the response?

H0 : β1 = β2 = ... = βp = 0
H1 : at least one βj ̸= 0

test statistic
(TSS − RSS)/p
F =
RSS/(n − (p + 1))
under H0 and with model assumptions

F ∼ F − Snedecor (p, n − (p + 1))

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 21 / 52
Multiple Linear Regression
Model Fit

How well does the model fit the data?


One could use the coefficient of determination (R 2 ) as in the Simple Linear Regression...
but it always increases with the number of parameters.
> data.frame(mod = as.character(c(mod2$call,mod3$call)),
R2 = round(c(summary(mod2)$r.squared,
summary(mod3)$r.squared),4),
adjR2 = round(c(summary(mod2)$adj.r.squared,
summary(mod3)$adj.r.squared),4))
mod R2 adjR2
1 lm(formula = Price ~ Age + Mileage, data = data) 0.7902 0.7899
2 lm(formula = Price ~ Age + Mileage + Rand, data = data) 0.7903 0.7898

RSS
R2 = 1 −
TSS
2 RSS/(n − (p + 1)) σ̂ 2
Radj =1− =1− 2
TSS/(n − 1) s (y )
2
Radj penalizes the introduction of new variables in the model.
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 22 / 52
Multiple Linear Regression
Parameter Meaning

1 βi is the impact on Y of one unit increase in Xj keeping all other regressors


constant
2 significative βi (t-statistic large and p-value small) means regressor Xj brings
relevant information to explain/predict Y once all other regressors included in
the model are taken into account. The t-test measures the predictive value of Xj
whe all other Xk,k̸=j are known.

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 23 / 52
Multiple Linear Regression
Parameter Meaning

βi is the impact on Y of one unit increase in Xj keeping all other regressor constant

> coins <- read.table(file=’coins.csv’, sep = ’;’, header = T, dec = ’,’,)


> head(coins)
id X1c X2c X5c X10c X20c X50c X1e X2e noCoins noSmallCoins totValueEuros
1 4 3 3 2 5 5 2 4 28 22 13.95
2 2 4 1 1 1 1 2 4 16 10 10.95
3 4 4 3 3 3 3 2 4 26 20 12.67
4 1 4 3 5 1 3 1 4 22 17 11.44
5 2 4 1 5 5 4 4 5 30 21 17.65
6 5 4 3 3 3 4 2 1 25 22 7.18

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 24 / 52
Multiple Linear Regression
Parameter Meaning

Build a model using the number of small coins and another one using both, the number
of small coins and the total number of coins.
Compare them: write the model formula and explain the changes on the coefficients sign.

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 25 / 52
Multiple Linear Regression
Parameter Meaning

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 26 / 52
Multiple Linear Regression
Parameter Meaning

significative βi (t-statistic large and p-value small) means regressor Xj brings relevant
information to explain/predict Y once all other regressors included in the model are
taken into account. The t-test measures the predictive value of Xj whe all other Xk,k̸=j
are known.

> racc <- read.table(file=’webRACC.csv’, sep = ’;’, header = T,


dec = ’,’)
> head(racc)
Id Marca Mod PrecioEur PotHP VelMaxKmh AccSeg ParMaxNm PesKg
1 Audi A3 23430 122 203 9.3 200 1250
2 Audi A4 40430 177 228 7.8 380 1640
3 Audi A6 50840 204 240 7.2 400 1720
4 Audi Q3 32370 150 200 8.9 250 1405
5 Audi Q5 50620 177 200 9.0 380 1820
6 Audi A4 32200 120 205 10.5 290 1560

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 27 / 52
Multiple Linear Regression
Parameter Meaning

Build a model using PotHP and AccSeg as predictors and another one including also
VelMaxKmh.
Compare them: can you explain the changes in the AccSeg p-values between the two
models?.

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 28 / 52
Multiple Linear Regression
Parameter Meaning

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 29 / 52
Multiple Linear Regression
Model Prediction

Given a set of regressors for an new observation: what should our prediction be? How
much uncertainty does it have?
Assuming our linear model is correct, there are two sources of uncertainty:
1 the uncertainty behind our estimation of βj ’s (reducible error).
2 The uncertainty behind the model itself. Its error ϵ (irreducible error).
Two types of predictions are possible:
The expected value (confidence interval): the confidence interval for the average
value Ŷ given the regressors {Xj = xj }j=1...p .
The observed value(prediction interval): the confidence interval for the prediction
of an indivual, Ŷi , given its regressors {Xj = xij }j=1...p .

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 30 / 52
Multiple Linear Regression
Model Prediction

What is the average price for a 5 years old car (60 months) and 50,000 miles?
> round(predict(mod2, newdata = data.frame(Age=60,Mileage=50000),
interval = c(’confidence’)),1)
fit lwr upr
10408 10300.7 10515.4
Which is the price range for a 5 year old car with 50,000 miles?
> round(predict(mod2, newdata = data.frame(Age=60,Mileage=50000),
interval = c(’prediction’)),1)
fit lwr upr
10408 7135.8 13680.2

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 31 / 52
Multiple Linear Regression
Model Assumptions

Caution! The model can only be used if model assumptions are met:

Assumptions:
p
X
Yi = β0 + βj Xij + ϵi
j=1

ϵi ∼ N(0, σ 2 )

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 32 / 52
Outline

1 Problem Description

2 Simple Linear Regression

3 Multiple Linear Regression

4 Others

5 Issues

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 33 / 52
Others
Qualitative Predictors

Is there a relationship between type of transmission and price?

Figure: Price vs. Age by transmission type

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 34 / 52
Others
Qualitative Predictors

Is there a relationship between type of transmission and price?


> round(mod4$coefficients,2)
(Intercept) Age Mileage Automatic1
20463.85 -155.61 -0.02 756.51
What does the Automatic1 parameter estimate mean?
The dicotomic categorical variables are split in 0/1. Then the model is:

Pricei = b0 + bAge Agei + bmiles Milesi + bAutomatic Automatici

when the car is manual the variable Automatic1 becomes 0 and the model is:

Pricei = b0 + bAge Agei + bmiles Milesi

When the car is automatic the variable Automatic1 becomes 1 and the model is:

Pricei = b0 + bAge Agei + bmiles Milesi + bAutomatic


Pricei = (b0 + bAutomatic ) + bAge Agei + bmiles Milesi

Pricei = b0 + bAge Agei + bmiles Milesi

Having a categorical variable equals to a level shift (intercept) for the different values of
the categorical variable.
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 35 / 52
Others
Qualitative Predictors

When there are more than two levels of the categorical variable R automatically codes
levels according to the factor ordering.
Call:
lm(formula = Price ~ Age + Mileage + Automatic + Doors, data = data)
Residuals:
Min 1Q Median 3Q Max
-7042.8 -965.4 -76.0 821.2 12503.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.011e+04 1.542e+02 130.423 < 2e-16 ***
Age -1.532e+02 2.765e+00 -55.400 < 2e-16 ***
Mileage -1.582e-02 1.382e-03 -11.446 < 2e-16 ***
Automatic1 7.933e+02 1.917e+02 4.138 3.70e-05 ***
Doors4 9.042e+00 1.563e+02 0.058 0.954
Doors5 4.980e+02 9.291e+01 5.361 9.68e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1642 on 1411 degrees of freedom
Multiple R-squared: 0.797,Adjusted R-squared: 0.7962
F-statistic: 1108 on 5 and 1411 DF, p-value: < 2.2e-16
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 36 / 52
Others
Qualitative Predictors

Now the model is:

Pricei = b0 +bAge Agei +bmiles Milesi +bAutomatic Automatici +bDoors4 Doors4i +bDoors5 Doors5i

When the car i has three doors then Doors4 = 0 and Doors5 = 0 and the model
becomes:
Pricei = b0 + bAge Agei + bmiles Milesi + bAutomatic Automatici
When the car i has four doors then Doors4 = 1 and Doors5 = 0 and the model becomes:

Pricei = (b0 + bDoor 4 ) + bAge Agei + bmiles Milesi + bAutomatic Automatici

Similarly when the car has five doors.

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 37 / 52
Others
Interactions

Let’s suppose the relationship between Price and Age is the following:

Figure: Price vs. Age by transmission type

We would say that Transmission interacts with Age.


For ”young” cars being automatic is a plus. However, this plus diminishes as the car
ages. It even becomes a drawback for cars with more than 5 years!

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 38 / 52
Others
Interactions

> summary(lm(Price~Age*Automatic,data=inter))
Call:
lm(formula = Price ~ Age * Automatic, data = inter)
Residuals:
Min 1Q Median 3Q Max
-2986.0 -1063.3 -119.5 1048.7 4603.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22592.387 283.090 79.807 < 2e-16 ***
Age -207.198 6.534 -31.710 < 2e-16 ***
Automatic1 2234.083 414.819 5.386 2.05e-07 ***
Age:Automatic1 -37.067 9.371 -3.955 0.000107 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1507 on 196 degrees of freedom
Multiple R-squared: 0.9224,Adjusted R-squared: 0.9212
F-statistic: 776.4 on 3 and 196 DF, p-value: < 2.2e-16
Figure: Simulated interaction between Age and transmission type: Automatic

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 39 / 52
Outline

1 Problem Description

2 Simple Linear Regression

3 Multiple Linear Regression

4 Others

5 Issues

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 40 / 52
Issues
Non-linearity

Going back to the data that gathers several cars features:

Id Marca Mod PrecioEur PotHP VelMaxKmh AccSeg ParMaxNm PesKg


1 Audi A3 23430 122 203 9.3 200 1250
2 Audi A4 40430 177 228 7.8 380 1640
3 Audi A6 50840 204 240 7.2 400 1720
4 Audi Q3 32370 150 200 8.9 250 1405
5 Audi Q5 50620 177 200 9.0 380 1820
6 Audi A4 32200 120 205 10.5 290 1560

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 41 / 52
Issues
Non-linearity

Let’s analyze the relationship between car maximum speed (km/h) and power (HP):

Figure: Car maximum speed (km/h) vs. power (HP)

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 42 / 52
Issues
Non-linearity

Is this the best model?

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 43 / 52
Issues
Non-linearity

Let’s have a second look to the relationship between car maximum speed (km/h) and
power (HP):

Figure: Car maximum speed (km/h) vs. power (HP)

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 44 / 52
Issues
Non-linearity

What about now?

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 45 / 52
Issues
Non-constant Variance: Heterokedasticity

Area vs. Perimeter of 100 randomly generated triangles:

Figure: Area vs. Perimeter observations Figure: Area vs. Perimeter linear model residuals

. .

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 46 / 52
Issues
Non-constant Variance: Heterokedasticity


As Area ∝ Length2 transforming Area to Area removes somehow the tendency to have
larger variance at larger values of the prediction:

√ √
Figure: Area vs. Perimeter observations Figure: Area vs. Perimeter linear model residuals

. .

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 47 / 52
Issues
Non-constant Variance: Heterokedasticity

Transforming variables help:


address non-linearity.
Correct heterokedasticity.
Normalize observations (e.g. logarithm transformations)

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 48 / 52
Issues
Multicolinearity

Remember, in Multiple Linear Regression, the significance of a parameter is related to


the information it brings to the response given the other model regressors.
When all regressors explain the same underlying behavior, multicolinearity arises.
Predicting an adult male height by the length of his femur:
id height leftLeg rightLeg
1 178.54 80.96 80.90
2 167.76 74.17 73.63
3 180.74 89.45 88.97
4 171.22 84.03 84.58
5 166.41 79.68 79.79
6 161.92 65.42 65.70

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 49 / 52
Issues
Multicolinearity

Figure: Matrix scatterplot diagram

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 50 / 52
Issues
Multicolinearity

Call:
lm(formula = height ~ rightLeg, data = leg)
Residuals:
Min 1Q Median 3Q Max
-15.7892 -5.1326 -0.1116 3.8258 23.2344
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 96.1225 9.4719 10.148 < 2e-16 ***
rightLeg 1.0120 0.1182 8.558 1.61e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.677 on 98 degrees of freedom
Multiple R-squared: 0.4277,Adjusted R-squared: 0.4219
F-statistic: 73.24 on 1 and 98 DF, p-value: 1.613e-13

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 51 / 52
The End

Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 52 / 52

You might also like