0% found this document useful (0 votes)
117 views

FIT2086 Lecture 6 Linear Regression: Daniel F. Schmidt

This document discusses linear regression models and model selection for linear regression. It begins by introducing linear regression as a type of supervised learning method. Linear regression finds a linear relationship between target variables (y) and predictor variables (X) in order to predict the target variables. The document then discusses simple linear regression using an example dataset to predict blood pressure from patient ID. It uses the mean to make predictions but notes that a simple mean may not be the best predictive model.

Uploaded by

Doaibu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views

FIT2086 Lecture 6 Linear Regression: Daniel F. Schmidt

This document discusses linear regression models and model selection for linear regression. It begins by introducing linear regression as a type of supervised learning method. Linear regression finds a linear relationship between target variables (y) and predictor variables (X) in order to predict the target variables. The document then discusses simple linear regression using an example dataset to predict blood pressure from patient ID. It uses the mean to make predictions but notes that a simple mean may not be the best predictive model.

Uploaded by

Doaibu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Linear Regression Models

Model Selection for Linear Regression

FIT2086 Lecture 6
Linear Regression

Daniel F. Schmidt

Faculty of Information Technology, Monash University

September 4, 2017
Linear Regression Models
Model Selection for Linear Regression

Outline

1 Linear Regression Models


Supervised Learning
Linear Regression Models

2 Model Selection for Linear Regression


Under and Overfitting
Model Selection Methods
Linear Regression Models
Model Selection for Linear Regression

Revision from last week

Hypothesis testing; test null hypothesis vs alternative

H0 : null hypothesis
vs
HA : alternative hypothesis

A test-statistic measures how different our observed sample is


from the null hypothesis
A p-value quantifies the evidence against the null hypothesis
A p-value is the probability of seeing a sample that results in a
test statistic as extreme, or more extreme, than the one we
observed, just by chance if the null was true.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Outline

1 Linear Regression Models


Supervised Learning
Linear Regression Models

2 Model Selection for Linear Regression


Under and Overfitting
Model Selection Methods
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Supervised Learning (1)

Over the last three weeks we have looked at parameter


inference

In week 3 we examined point estimation using maximum


likelihood
Selecting our “best guess” at a single value of the parameter

In week 4 we examined interval estimation using confidence


intervals
Give a range of plausible values for the unknown population
parameter

In week 5 we examined hypothesis testing


Quantify statistical evidence against a given hypothesis
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Supervised Learning (2)

Now we will start to see how these tools can be used to build
more complex models
Over the next three weeks we will look at supervised learning
In particular, we we will look at linear regression
But first, what is supervised learning?
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Supervised Learning (3)

Imagine we have measured p + 1 variables on n individuals


(people, objects, things)
We would like to predict one of the variables using the
remaining p variables

If the variable we are predicting is categorical, we are


performing classification
Example: predicting if someone has diabetes from medical
measurements.

If the variable we are predicting is numerical, we are


performing regression
Example: Predicting the quality of a wine from chemical and
seasonal information.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Supervised Learning (3)

Imagine we have measured p + 1 variables on n individuals


(people, objects, things)
We would like to predict one of the variables using the
remaining p variables

If the variable we are predicting is categorical, we are


performing classification
Example: predicting if someone has diabetes from medical
measurements.

If the variable we are predicting is numerical, we are


performing regression
Example: Predicting the quality of a wine from chemical and
seasonal information.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Supervised Learning (4)


The variable we are predicting is designated the “y” variable
We have (y1 , . . . , yn )
This variable is often called the:
target;
response;
outcome.

The other variables are usually designated “X” variables


We have (xi,1 , . . . , xi,n ) for i = 1, . . . , p
These variables are often called the
explanatory variables;
predictors;
covariates;
exposures.

Usually we assume the targets are random variables and the


predictors are known without error
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Supervised Learning (4)


The variable we are predicting is designated the “y” variable
We have (y1 , . . . , yn )
This variable is often called the:
target;
response;
outcome.

The other variables are usually designated “X” variables


We have (xi,1 , . . . , xi,n ) for i = 1, . . . , p
These variables are often called the
explanatory variables;
predictors;
covariates;
exposures.

Usually we assume the targets are random variables and the


predictors are known without error
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Supervised Learning (4)

Supervised learning: find a relationship between the targets yi


and associated predictors xi,1 , . . . , xi,p .
That is, learn a function f (·) such that

yi = f (xi,1 , . . . , xi,p )

Usually error in measuring yi so that no f (·) fits perfectly


⇒ we model yi as realisation of RV Yi
So instead, find an f (·) that is “close” to y1 , . . . , yn

It is “supervised” because we have examples to learn from


Supervised learning model depends on form of f (·)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Linear Regression

Linear regression is a special type of supervised learning


In this case, we take the function f (·) that relates the
predictors to the target as being linear

One of the most important models in statistics


The resulting model is highly interpretable
It is very flexible and can even handle nonlinear relationships
It is computationally efficient to fit, even for very large p
Enormous area of research and work
⇒ we will get acquainted with the basics
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (1)

Consider the following dataset (we examined in Studio 5):

Imagine we want to model blood pressure


Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (2)


Blood pressure plotted against patient ID
125
Blood Pressure (mmHg)

120

115

110

105
5 10 15 20
Patient ID
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (3)

Our blood pressure variable BP1 , . . . , BP20 is continuous


⇒ we choose to model it using a normal distribution
The maximum likelihood estimate of the mean µ is
n
1 X
µ̂ = yi = 114
20 i=1

which is equivalent to the sample mean


We have a new person from the population this sample was
drawn from and we want to predict their blood pressure
Using our simple model our best guess of this persons blood
pressure is 114, i.e., the estimated mean µ̂
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (4)


Prediction of BP using the mean
125
Blood Pressure (mmHg)

120

115

110

105
5 10 15 20
Patient ID
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (5)

How good is our model at predicting?


One way we could measure this is through prediction error
We don’t know future data, but we can look to see how well it
predicts the data we have
Let ŷi denote the prediction of sample y using a model; then

ei = ŷi − yi

are the errors between our model predictions ŷi and the
observed data yi
⇒ often called residual error, or just residuals
A good fit would lead to overall small errors
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (6)


Prediction of BP using the mean, showing errors/residuals
125
Blood Pressure (mmHg)

120

115

110

105
5 10 15 20
Patient ID
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (7)

We can summarise the total error of fit of our model by


n
X
RSS = e2i
i=1

which is called the residual sum-of-squared errors.


For our simple mean model RSS = 560
Can we do better (smaller error) if we use one of the other
measured variables to help predict blood pressure?
For example, if we took a persons weight into account, could
we build a better predictor of their blood pressure?
To get an idea if there is scope for improvement we can plot
blood pressure vs weight
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (7)

We can summarise the total error of fit of our model by


n
X
RSS = e2i
i=1

which is called the residual sum-of-squared errors.


For our simple mean model RSS = 560
Can we do better (smaller error) if we use one of the other
measured variables to help predict blood pressure?
For example, if we took a persons weight into account, could
we build a better predictor of their blood pressure?
To get an idea if there is scope for improvement we can plot
blood pressure vs weight
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (8)


Blood pressure vs weight – BP appears to increase with weight

125
Blood Pressure (mmHg)

120

115

110

105

86 88 90 92 94 96 98 100 102
Weight (kg)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (9)


Our simple mean model is clearly not a good fit

125
Blood Pressure (mmHg)

120

115

110

105

86 88 90 92 94 96 98 100 102
Weight (kg)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (10)

Our simple mean model predicts blood pressure by

E [BPi ] = µ

irrespective of any other data on individual i


Let (Weight1 , . . . , Weight20 ) be the weights of our 20
individuals
We can let the mean vary as a linear function of weight, i.e.,

E [BPi | Weighti ] = β0 + β1 Weighti

This says that the conditional mean of blood pressure BPi for
individual i, given the individual’s weight Weighti , is equal to
β0 plus β1 times the weight Weighti
Note our simple mean model is a linear model with β1 = 0
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (11)


The linear model E [BPi | Weighti ] = 1.2009 + 2.2053 Weighti

125
Blood Pressure (mmHg)

120

115

110

105

86 88 90 92 94 96 98 100 102
Weight (kg)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (12)


Residuals; ei = BPi − 1.2009 − 2.2053 Weighti (RSS= 120)

125
Blood Pressure (mmHg)

120

115

110

105

86 88 90 92 94 96 98 100 102
Weight (kg)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (13) – Key Slide

A linear model of the form

E [Yi | xi ] = ŷi = β0 + β1 xi

is called a simple linear regression.

It has two free regression parameters


β0 is the intercept; it is the value of the predicted value ŷi
when the predictor xi = 0
β1 is a regression coefficient; it is the amount the predicted
value ŷi changes by in one unit change of the predictor xi
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (14)

In our example yi is blood pressure and xi weight;

ŷi = 1.2009 + 2.2053xi

so
For every additional kilogram a person weighs, their blood
pressure increases by 2.2053mmHg
For a person who weighs zero kilograms, the predicted blood
pressure is 1.2009mmHg
The predictions might not make sense outside of sensible
ranges of the predictors!
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Fitting Simple Linear Regressions (1)

How did we arrive at β̂0 = 1.2009 and β̂1 = 2.2053 in our


blood pressure vs weight example?
Measure fit of a model by its RSS
n
X
RSS = (yi − β0 − β1 xi )2
i=1
n
X
= (yi − ŷi )2
i=1
n
X
= e2i
i=1

Smaller error = better fit


Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Fitting Simple Linear Regressions (2)

So least-squares principle says we choose (estimate) β0 , β1 to


minimise the RSS
Formally
( n )
X
(β̂0 , β̂1 ) = arg min (yi − β0 − β1 xi )2
β0 ,β1 i=1

These are often called least-squares (LS) estimates.


There are alternative measures of error; for example least sum
of absolute errors.
Least squares is popular due to simplicity, computational
efficiency and connections to normal models
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Fitting Simple Linear Regressions (3)

The RSS is a function of β0 , β1 , i.e.,


n
X
RSS(β0 , β1 ) = (yi − β0 − β1 xi )2
i=1

The least-squares estimates are the solutions to the equations


n
∂RSS(β0 , β1 ) X
= −2 (yi − β0 − β1 xi ) = 0
∂β0 i=1
n
∂RSS(β0 , β1 ) X
= −2 xi (yi − β0 − β1 xi ) = 0
∂β1 i=1

where we use the chain rule.


Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Fitting Simple Linear Regressions (4)

The solution for β0 is


n n n n
! ! ! !
X X X X
yi x2i − yi x i xi
i=1 i=1 i=1 i=1
β̂0 = !2
n
X Xn
n x2i − xi
i=1 i=1

and the solution for β1 is


n n
!
X X
yi xi − β̂0 xi
i=1 i=1
β̂1 = n
X
x2i
i=1
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Fitting Simple Linear Regressions (5)

Given LS estimates β̂0 , β̂1 we can find the predictions for our
data
ŷi = β̂0 + β̂1 xi
and residuals
ei = yi − ŷi
The vector of residuals e = (e1 , . . . , en ) has the properties
n
X
ei = 0 and corr (x, e) = 0
i=1

where x = (x1 , . . . , xn ) is our predictor variable.


This means least-squares fits a line such that the mean of the
resulting residuals is zero, and the residuals are uncorrelated
with the predictor.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Fitting Simple Linear Regressions (5)

Given LS estimates β̂0 , β̂1 we can find the predictions for our
data
ŷi = β̂0 + β̂1 xi
and residuals
ei = yi − ŷi
The vector of residuals e = (e1 , . . . , en ) has the properties
n
X
ei = 0 and corr (x, e) = 0
i=1

where x = (x1 , . . . , xn ) is our predictor variable.


This means least-squares fits a line such that the mean of the
resulting residuals is zero, and the residuals are uncorrelated
with the predictor.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Multiple Linear Regression (1) – Key Slide

We have used one explanatory variable in our linear model


A great strength of linear models is that they easily handle
multiple variables
Let xi,j denote the variable j for individual i, where
j = 1, . . . , p; i.e., we have p explanatory variables. Then
p
X
E [yi | xi,1 , . . . , xi,p ] = β0 + βj xi,j
j=1

The intercept is now the expected value of the target when


xi,1 = xi,2 = · · · = xi,p = 0
The coefficient βj is the increase in the expected value of the
target per unit change in explanatory variable j
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Multiple Linear Regression (2) – Key Slide

Fit a multiple linear regression using least-squares


⇒ assume p < n, otherwise solution is non-unique
Given coefficients β0 , β1 ,. . . ,βp the RSS is
 2
n
X p
X
RSS(β0 , β1 , . . . , βp ) = yi − βj xi,j 
i=1 j=1

Now we have to solve

β̂0 , β̂1 , . . . , β̂p = arg min {RSS(β0 , β1 , . . . , βp )}


β0 ,β1 ,...,βp

Efficient algorithms exist to find these estimates


Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Multiple Linear Regression (3)

Matrix algebra can simplify linear regression equations


We have a vector of targets y = (y1 , . . . , yn )
We have a vector of coefficients β = (β1 , . . . , βp )
We can treat each variable as a vector xj = (x1,j , . . . , xn,j )
Arrange these vectors into a matrix X of predictors:
 
x1,1 x1,2 ··· x1,p

x2,1 x2,2 ··· x2,p 
(x10 , x20 , . . . , x20 )
 
X= = .. .. .. ,
. . .
 
 
xn,1 xn,2 · · · xn,p

We call this the design matrix


⇒ has p columns (predictors) and n rows (individuals)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Multiple Linear Regression (4) – Key Slide

We can form our predictions and residuals using

ŷ = Xβ + β0 1n and e = y − ŷ.

where 1n is a vector of n ones.


We can then write our RSS very compactly as

RSS(β0 , β) = e0 e

If β̂0 , β̂ are least-squares estimates, then

corr(xj , e) = 0 for all j

That is, least-squares finds the plane such that the residuals
(errors) are uncorrelated with all predictors in the model
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

R-squared (R2 ) (1)

Residual sum-of-squares tells us how well we fit the data


But the scale is arbitrary – what does an RSS of 2, 352 mean?
Instead, we define the RSS relative to some reference point
We use the total sum-of-squares as the reference:
n
X
TSS = (yi − ȳ)2
i=1

which is the residual sum-of-squares obtained by fitting the


intercept only (the “mean model”)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

R-squared (R2 ) (2) – Key Slide

The R2 value is then defined as


RSS
R2 = 1 −
TSS
which is also called the coefficient-of-determination
R2 is strictly between 0 (model has no explanatory power)
and 1 (model completely explains the data)
The higher the R2 the better the fit to the data
Adding an extra predictor always increases R2
⇒ predictors that greatly increase R2 are potentially
important
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Example: Multiple regression and R2 (1)

Let us revisit our blood pressure data


The residual sum-of-squares of our mean model was 560
⇒ this is our reference model (total sum-of-squares)
Regression of blood pressure (BP) onto weight gave us

E [BP | Weight] = 2.20 + 1.2 Weight

which had an RSS of 54.52 ⇒ R2 ≈ 0.9


Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Example: Multiple regression and R2 (2)

In our data we also have an individual’s age


We fit a multiple linear regression of BP onto weight and age

E [BP | Weight, Age] = −16.57 + 1.03 Weight + 0.71 Age

This says that:


for every kilogram, a person’s bloodpressure rises by
1.03mmHg;
for every year, a person’s bloodpressure rises by 0.71mmHg;
This model has an RSS of 4.82 ⇒ R2 = 0.99
So include age seems to increase our fit substanstially
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Handling Categorical Predictors (1)

Sometimes our predictors are categorical variables


This means the numerical values they take are on just codes
for different categories
Makes no sense to “added” or “multiply” them
Instead we turn them into K − 1 new predictors (if K is the
number of categories)
These predictors take on a one when an individual is in a
particular category, and zero otherwise
They are called indicator variables.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Handling Categorical Predictors (2) – Key Slide

Example variable with four categories coded as 1, 2, 3 and 4


   
1 0 0 0

 2 


 1 0 0 

1 0 0 0
   
   
3 0 1 0
   
   
   

 4 
 =⇒ 
 0 0 1 


 2 


 1 0 0 

3 0 1 0
   
   
   
 2   1 0 0 
4 0 0 1

We do not build indicators for first category


Regression coefficients for other categories are increases in
target relative to being in the first category
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Nonlinear effects (1)

Sometimes predictors are related to the target in a nonlinear


fashion
We can still use linear models by transforming the predictors
If the transformed predictors are linearly related to the target,
regression will work well
We can often detect this by plotting the residuals against a
variable – if they exhibit a nonlinear trend or curve it is sign
that a transformation might be needed
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Nonlinear effects (2)


Example dataset
10

6
y

0 0.2 0.4 0.6 0.8 1


x
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Nonlinear effects (3)


Fitted model: ŷ = −1.07 + 9.55x; RSS = 0.95
10

6
y

0 0.2 0.4 0.6 0.8 1


x
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Nonlinear effects (4)


Example data: residuals exhibit clear nonlinear trend
2

1.5

1
Residuals

0.5

-0.5

-1
0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Nonlinear effects (5)

There are several common transformations


A logarithmic transformation can be used if the predictor
seems to be more variable for larger values of the predictor

xi,j ⇒ log xi,j

Can only be used if all xi > 0


Polynomial transformations offer general purpose nonlinear fits
We turn our variable into q new variables of the form:

xi,j ⇒ xi,j , x2i,j , x3i,j , . . . , xqi,j

The higher the q the more nonlinear the fit can become, but
at risk of overfitting
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Nonlinear effects (5)

There are several common transformations


A logarithmic transformation can be used if the predictor
seems to be more variable for larger values of the predictor

xi,j ⇒ log xi,j

Can only be used if all xi > 0


Polynomial transformations offer general purpose nonlinear fits
We turn our variable into q new variables of the form:

xi,j ⇒ xi,j , x2i,j , x3i,j , . . . , xqi,j

The higher the q the more nonlinear the fit can become, but
at risk of overfitting
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Nonlinear effects (6)


New model: ŷ = −0.02 + 2.16x + 7.77x2 , R2 = 0.999
10

6
y

0 0.2 0.4 0.6 0.8 1


x
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Connecting LS to ML (1)
To show this, let our targets Y1 , . . . , Yn be RVs
Write the linear regression model as
p
X
Ŷi = β0 + βj xi,j + εi
j=1

where εi is a random, unobserved “error”


Now assume that εi ∼ N (0, σ 2 )
This is equivalent to saying that
 
p
X
Yi | xi,1 , . . . , xi,p ∼ N β0 + βj xi,j , σ 2 
j=1

so each Yi is normally distributed with variance σ 2 and a


mean that depends on the values of the associated predictors
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Connecting LS to ML (2)

Each Yi is independent
Given target data y the likelihood function can be written
 
Pp 2 
n  1 yi − β0 −
Y 1 2 j=1 βj xi,j
p(y | β0 , β, σ 2 ) = exp −
 
2πσ 2 2σ 2 
i=1

Noting e−a e−b = e−a−b this simplifes to



Pn  Pp 2 
n yi − β0 −
1 j=1 βj xi,j

2 i=1
exp −
 
2πσ 2 2σ 2

where we can see term in the numerator in the exp(·) is the


residual sum-of-squares.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Connecting LS to ML (2) – Key Slide

Taking the negative-logarithm of this yields

n RSS(β0 , β)
L(y | β0 , β, σ 2 ) = log(2πσ 2 ) +
2 2σ 2
As the value of σ 2 scales the RSS term, it is easy to see that
the values of β0 and β that minimise the negative
log-likelihood are the least-squares estimates β̂0 and β̂
LS estimates are same as the maximum likelihood estimates
assuming the random “errors” εi are normally distributed
Our residuals
ei = yi − ŷi
can be viewed as our estimates of the errors εi .
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Connecting LS to ML (3)

How to estimate the error variance σ 2 ?


The maximum likelihood estimate is:

2 RSS(β̂0 , β̂)
σ̂ML =
n
but this tends to underestimate the actual variance.
A better estimate is the unbiased estimate

RSS(β̂0 , β̂)
σ̂u2 =
n−p−1
where p is the number of predictors used to fit the model.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Making predictions with a linear model

Given estimates β̂0 , β̂ can make predictions about new data


To estimate value of target for some new predictor values
x01 , x02 , . . . , x0p
p
β̂j x0j
X
ŷ = β̂0 +
j=1

Using normal model of residuals, we can also get probability


distribution over future data:
 
p
β̂j x0j , σ 2 
X
Ŷ ∼ N β̂0 +
j=1

By changing predictors we can see how target changes


Example: seeing how weight and age effect blood pressure
Careful using predictions outside of sensible predictors values!
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Outline

1 Linear Regression Models


Supervised Learning
Linear Regression Models

2 Model Selection for Linear Regression


Under and Overfitting
Model Selection Methods
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Underfitting/Overfitting (1)

We often have many measured predictors


In our blood pressure example, we have weight, body surface
area, age, pulse rate and a measure of stress
Should we use them all, and if not, why not?
The R2 always improves as we include more predictors
⇒ so model always fits the data we have better
But prediction on new, unseen data might be worse
We call this generalisation
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Underfitting/Overfitting (2) – Key Slide

Risks of including/excluding predictors


Omitting important predictors
Called underfitting
Leads to systematic error, bias in predicting the target

Including spurious predictors


Called overfitting
Leads our model to “learn” noise and random variation
Poorer ability to predict to new, unseen data from our
population
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Underfitting/Overfitting Example (1)

Example: we observe x and y data and want to build a


prediction model for y using x
Data looks nonlinear so we use polynomial regression
We take x, x2 , x3 , . . . , x20 ⇒ very flexible model
How many terms to include?
For example, do we use

y = β0 + β1 x + β2 x2 + ε

or
y = β0 + β1 x + β2 x2 + β3 x3 + β4 x4 + β5 x5 + ε
or another model with some other number of polynomial
terms.
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Underfitting/Overfitting Example (2)


Example dataset of 50 samples
14

12

10

6
y

-2

-4
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Underfitting/Overfitting Example (3)


Use (x, x2 ), too simple – underfitting
14

12

10

6
y

-2

-4
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Underfitting/Overfitting Example (4)


Use (x, x2 , . . . , x20 ), too complex – overfitting
14

12

10

6
y

-2

-4
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Underfitting/Overfitting Example (5)


(x, x2 , . . . , x6 ) seems “just right”. But how to find this model?
14

12

10

6
y

-2

-4
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Using Hypothesis Testing – Key Slide

One approach is to use hypothesis testing


We know that a predictor j is unimportant if βj = 0
So we can test the hypothesis:

H0 : β=0
vs
HA : β 6= 0

which, in this setting is a variant of the t-test (see Ross,


Chapter 9 and Studio 6)
Strengths: easy to apply, easy to understand
Weaknesses: difficult to directly compare two different models
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Model Selection (1)

A different approach is through model selection


In the context of linear regression, we define a model by
specifying which predictors are included in the linear regression
For example, in our blood pressure example:
{Weight}
{Weight, Age}
{Age, Stress}
{Age, Stress, Pulse}
are some of the possible models we could build
Given a model, we can estimate the associated linear
regression coefficients using least-squares/maximum likelihood
The question then becomes how to choose a good model
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Model Selection (2)

We use maximum likelihood to choose the parameters


Remember, this means we adjust the parameters of our
distribution until we find the ones that maximise the
probability of seeing the data y we have observed
Can we use this to select a model as well as parameters?

Assume normal distribution for our regression errors


The minimised negative log-likelihood (i.e., L(y | β0 , β, σ 2 )
2 ) is
evaluated at the ML estimates β̂0 , β̂, σ̂ML

2 n   n
L(y | β̂0 , β̂, σ̂ML )= log 2πRSS(β̂0 , β̂)/n +
2 2
This always decreases as we add more predictors to our model
⇒ cannot be used to select models, only parameters
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Model Selection (2)

We use maximum likelihood to choose the parameters


Remember, this means we adjust the parameters of our
distribution until we find the ones that maximise the
probability of seeing the data y we have observed
Can we use this to select a model as well as parameters?

Assume normal distribution for our regression errors


The minimised negative log-likelihood (i.e., L(y | β0 , β, σ 2 )
2 ) is
evaluated at the ML estimates β̂0 , β̂, σ̂ML

2 n   n
L(y | β̂0 , β̂, σ̂ML )= log 2πRSS(β̂0 , β̂)/n +
2 2
This always decreases as we add more predictors to our model
⇒ cannot be used to select models, only parameters
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Model Selection (3) – Key Slide

Let M denote a model (set of predictors to use)


2 , M) denote minimised negative
Let L(y | β̂0 , β̂, σ̂ML
log-likelihood for the model M
We can select a model by minimising an information criterion
2
L(y | β̂0 , β̂, σ̂ML , M) + α(n, kM )

where
α(·) is a model complexity penalty;
kM is the number of predictors in model M;
n is the size of our data sample.
This is a form of penalized likelihood estimation
⇒ a model is penalized by its complexity (ability to fit data)
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Model Selection (4)

How to measure complexity, i.e., choose α(·)?


Akaike Information Criterion (AIC)

α(n, kM ) = kM

Bayesian Information Criterion (BIC)

kM
α(n, kM ) = log n
2
AIC penalty smaller than BIC; increased chance of overfitting
BIC penalty bigger than AIC; increased chance of underfitting
Differences in scores of ≥ 3 or more are considered significant
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Finding a Good Model (1)

Most obvious approach is to try all possible combinations of


predictors, and choose one that has smallest information
criterion score
Called the all subsets approach
If we have p predictors then we have 2p models to try
For p = 50, 2p ≈ 1.2 × 1015 !
So this method is computationally intractable for moderate p
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Finding a Good Model (2)

An alternative is to search through the model space


Forward selection algorithm:
1 Start with the empty model;
2 Find the predictor that reduces info criterion by most
3 If no predictor improves model, end.
4 Add this predictor to the model
5 Return to Step 2

Backwards selection is related algorithm


Start with the full model and remove predictors
Is computationally tractable for large p, but may miss
important predictors
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Reading/Terms to Revise

Reading for this week: Chapter 9 of Ross.

Terms you should know:


Target, predictor, explanatory variable;
Intercept, coefficient;
R2 value;
Categorical predictors;
Polynomial regression;
Model, model selection;
Overfitting, underfitting
Information Criteria;

This week we looked at supervised learning for continuous


targets; next week we will examine supervised learning for
categorical targets (classification).

You might also like