Session7 LinearRegression
Session7 LinearRegression
Ignasi Puig
[email protected]
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 1 / 52
Outline
1 Problem Description
4 Others
5 Issues
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 2 / 52
Outline
1 Problem Description
4 Others
5 Issues
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 3 / 52
Problem Description
A set of recent graduates of the Engineering School wants to launch a web for used car
sales. They want to develop an application that given the characteristics of the car, it
provides a price range estimate of its potential price.
To test the feasibility of their idea they have a set of 1,417 sales of used Toyota Corolas.
Load data:
> data <- read.table(file = ’data/ToyotaCorolla.csv’, header = TRUE,
sep = ’,’,dec= ’.’)
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 4 / 52
Problem Description
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 5 / 52
Problem Description
Clean up data
Remove cars with 2 doors (2 cars) and using gas fuel (’CNG’).
Change KM variable name to Mileage
Convert columns 5, 6, 8 and 9 fron numeric to factor.
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 6 / 52
Problem Description
Data
> summary(data)
Price Age Mileage
Min. : 4350 Min. : 1.00 Min. : 1
1st Qu.: 8450 1st Qu.:44.00 1st Qu.: 42800
Median : 9900 Median :61.00 Median : 63000
Mean :10750 Mean :55.93 Mean : 67883
3rd Qu.:11950 3rd Qu.:70.00 3rd Qu.: 86221
Max. :32500 Max. :80.00 Max. :243000
.
FuelType HP MetColor Automatic Doors
Diesel: 154 110 :818 0:462 0:1338 3:616
Petrol:1263 86 :248 1:955 1: 79 4:135
97 :164 5:666
72 : 73
90 : 36
69 : 34
(Other): 44
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 8 / 52
Problem Description
Questions to Answer
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 9 / 52
Problem Description
Regression Problems
Regression problem: given a set of observations {yi , xi1 , xi1 ...xip }i=1...n where Y is a
quantitative random variable and (X1 , X2 ...Xp ) are a set of quantitative and qualitative
variables observed for each individual, we want to build a function that given a set of
features from and new individual i, it predicts its value yi .
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 10 / 52
Problem Description
Analyzing the relationship
> pairs(data[,1:5],col=’red’)
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 11 / 52
Outline
1 Problem Description
4 Others
5 Issues
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 12 / 52
Simple Linear Regression
Parameter Estimation
Let’s assume that the linear relationship between price and car age is:
Price = β0 + β1 Age + ϵ
ϵ ∼ N(0, σ)
Figure: Price vs. Age. Toyota Corola second hand sales sample.
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 13 / 52
Simple Linear Regression
Parameter Properties
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 14 / 52
Simple Linear Regression
Model Fit
The overall fit of a linear regression (clearer in the Multiple Linear Regression case to be
seen later!) can be assessed with:
1 Residual Standard Error (RSE),
v
u n
u 1 X
σ̂ = t (yi − ŷi )2
n − 2 i=1
It measures the average amount a response will deviate from the true regression line
2 Determination Coefficient (R 2 )
TSS − RSS RSS
R2 = =1−
TSS TSS
where TSS = ni=1 (yi − ȳ )2 . It can be understood as the proportion of total
P
variability explained by the regression.
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 15 / 52
Simple Linear Regression
Model assumptions validation
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 16 / 52
Outline
1 Problem Description
4 Others
5 Issues
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 17 / 52
Multiple Linear Regression
Model
In Multiple Linear Regression one assumes that the response Y relates to more than one
regressor. The model with p regressors (p+1 parameters) becomes
Yi = β0 + β1 X1 + β2 X2 + ... + βp Xp + ϵi
Ŷ = b0 + b1 X1 + b2 X2 + ... + bp Xp
where the bi ’s are chosen as those that minimize the residual sum of squares (RSS):
n
X
RSS = (yi − b0 + b1 xi1 + b2 xi2 + ... + bp xip )2
i=1
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 18 / 52
Multiple Linear Regression
Model
In Multiple Linear Regression one fits the best hyperplane, in the Least Squares sense,
over the observations.
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 19 / 52
Multiple Linear Regression
Parameter Estimates
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 20 / 52
Multiple Linear Regression
Parameters Relevance
H0 : β1 = β2 = ... = βp = 0
H1 : at least one βj ̸= 0
test statistic
(TSS − RSS)/p
F =
RSS/(n − (p + 1))
under H0 and with model assumptions
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 21 / 52
Multiple Linear Regression
Model Fit
RSS
R2 = 1 −
TSS
2 RSS/(n − (p + 1)) σ̂ 2
Radj =1− =1− 2
TSS/(n − 1) s (y )
2
Radj penalizes the introduction of new variables in the model.
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 22 / 52
Multiple Linear Regression
Parameter Meaning
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 23 / 52
Multiple Linear Regression
Parameter Meaning
βi is the impact on Y of one unit increase in Xj keeping all other regressor constant
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 24 / 52
Multiple Linear Regression
Parameter Meaning
Build a model using the number of small coins and another one using both, the number
of small coins and the total number of coins.
Compare them: write the model formula and explain the changes on the coefficients sign.
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 25 / 52
Multiple Linear Regression
Parameter Meaning
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 26 / 52
Multiple Linear Regression
Parameter Meaning
significative βi (t-statistic large and p-value small) means regressor Xj brings relevant
information to explain/predict Y once all other regressors included in the model are
taken into account. The t-test measures the predictive value of Xj whe all other Xk,k̸=j
are known.
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 27 / 52
Multiple Linear Regression
Parameter Meaning
Build a model using PotHP and AccSeg as predictors and another one including also
VelMaxKmh.
Compare them: can you explain the changes in the AccSeg p-values between the two
models?.
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 28 / 52
Multiple Linear Regression
Parameter Meaning
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 29 / 52
Multiple Linear Regression
Model Prediction
Given a set of regressors for an new observation: what should our prediction be? How
much uncertainty does it have?
Assuming our linear model is correct, there are two sources of uncertainty:
1 the uncertainty behind our estimation of βj ’s (reducible error).
2 The uncertainty behind the model itself. Its error ϵ (irreducible error).
Two types of predictions are possible:
The expected value (confidence interval): the confidence interval for the average
value Ŷ given the regressors {Xj = xj }j=1...p .
The observed value(prediction interval): the confidence interval for the prediction
of an indivual, Ŷi , given its regressors {Xj = xij }j=1...p .
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 30 / 52
Multiple Linear Regression
Model Prediction
What is the average price for a 5 years old car (60 months) and 50,000 miles?
> round(predict(mod2, newdata = data.frame(Age=60,Mileage=50000),
interval = c(’confidence’)),1)
fit lwr upr
10408 10300.7 10515.4
Which is the price range for a 5 year old car with 50,000 miles?
> round(predict(mod2, newdata = data.frame(Age=60,Mileage=50000),
interval = c(’prediction’)),1)
fit lwr upr
10408 7135.8 13680.2
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 31 / 52
Multiple Linear Regression
Model Assumptions
Caution! The model can only be used if model assumptions are met:
Assumptions:
p
X
Yi = β0 + βj Xij + ϵi
j=1
ϵi ∼ N(0, σ 2 )
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 32 / 52
Outline
1 Problem Description
4 Others
5 Issues
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 33 / 52
Others
Qualitative Predictors
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 34 / 52
Others
Qualitative Predictors
when the car is manual the variable Automatic1 becomes 0 and the model is:
When the car is automatic the variable Automatic1 becomes 1 and the model is:
Having a categorical variable equals to a level shift (intercept) for the different values of
the categorical variable.
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 35 / 52
Others
Qualitative Predictors
When there are more than two levels of the categorical variable R automatically codes
levels according to the factor ordering.
Call:
lm(formula = Price ~ Age + Mileage + Automatic + Doors, data = data)
Residuals:
Min 1Q Median 3Q Max
-7042.8 -965.4 -76.0 821.2 12503.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.011e+04 1.542e+02 130.423 < 2e-16 ***
Age -1.532e+02 2.765e+00 -55.400 < 2e-16 ***
Mileage -1.582e-02 1.382e-03 -11.446 < 2e-16 ***
Automatic1 7.933e+02 1.917e+02 4.138 3.70e-05 ***
Doors4 9.042e+00 1.563e+02 0.058 0.954
Doors5 4.980e+02 9.291e+01 5.361 9.68e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1642 on 1411 degrees of freedom
Multiple R-squared: 0.797,Adjusted R-squared: 0.7962
F-statistic: 1108 on 5 and 1411 DF, p-value: < 2.2e-16
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 36 / 52
Others
Qualitative Predictors
Pricei = b0 +bAge Agei +bmiles Milesi +bAutomatic Automatici +bDoors4 Doors4i +bDoors5 Doors5i
When the car i has three doors then Doors4 = 0 and Doors5 = 0 and the model
becomes:
Pricei = b0 + bAge Agei + bmiles Milesi + bAutomatic Automatici
When the car i has four doors then Doors4 = 1 and Doors5 = 0 and the model becomes:
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 37 / 52
Others
Interactions
Let’s suppose the relationship between Price and Age is the following:
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 38 / 52
Others
Interactions
> summary(lm(Price~Age*Automatic,data=inter))
Call:
lm(formula = Price ~ Age * Automatic, data = inter)
Residuals:
Min 1Q Median 3Q Max
-2986.0 -1063.3 -119.5 1048.7 4603.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22592.387 283.090 79.807 < 2e-16 ***
Age -207.198 6.534 -31.710 < 2e-16 ***
Automatic1 2234.083 414.819 5.386 2.05e-07 ***
Age:Automatic1 -37.067 9.371 -3.955 0.000107 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1507 on 196 degrees of freedom
Multiple R-squared: 0.9224,Adjusted R-squared: 0.9212
F-statistic: 776.4 on 3 and 196 DF, p-value: < 2.2e-16
Figure: Simulated interaction between Age and transmission type: Automatic
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 39 / 52
Outline
1 Problem Description
4 Others
5 Issues
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 40 / 52
Issues
Non-linearity
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 41 / 52
Issues
Non-linearity
Let’s analyze the relationship between car maximum speed (km/h) and power (HP):
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 42 / 52
Issues
Non-linearity
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 43 / 52
Issues
Non-linearity
Let’s have a second look to the relationship between car maximum speed (km/h) and
power (HP):
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 44 / 52
Issues
Non-linearity
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 45 / 52
Issues
Non-constant Variance: Heterokedasticity
Figure: Area vs. Perimeter observations Figure: Area vs. Perimeter linear model residuals
. .
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 46 / 52
Issues
Non-constant Variance: Heterokedasticity
√
As Area ∝ Length2 transforming Area to Area removes somehow the tendency to have
larger variance at larger values of the prediction:
√ √
Figure: Area vs. Perimeter observations Figure: Area vs. Perimeter linear model residuals
. .
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 47 / 52
Issues
Non-constant Variance: Heterokedasticity
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 48 / 52
Issues
Multicolinearity
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 49 / 52
Issues
Multicolinearity
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 50 / 52
Issues
Multicolinearity
Call:
lm(formula = height ~ rightLeg, data = leg)
Residuals:
Min 1Q Median 3Q Max
-15.7892 -5.1326 -0.1116 3.8258 23.2344
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 96.1225 9.4719 10.148 < 2e-16 ***
rightLeg 1.0120 0.1182 8.558 1.61e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.677 on 98 degrees of freedom
Multiple R-squared: 0.4277,Adjusted R-squared: 0.4219
F-statistic: 73.24 on 1 and 98 DF, p-value: 1.613e-13
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 51 / 52
The End
Ignasi Puig (Departament d’Estadı́stica i IO. UPC) Regression: Linear Models October 11th. , 2023 52 / 52