0% found this document useful (0 votes)

117 views

FIT2086 Lecture 6 Linear Regression: Daniel F. Schmidt

This document discusses linear regression models and model selection for linear regression. It begins by introducing linear regression as a type of supervised learning method. Linear regression finds a linear relationship between target variables (y) and predictor variables (X) in order to predict the target variables. The document then discusses simple linear regression using an example dataset to predict blood pressure from patient ID. It uses the mean to make predictions but notes that a simple mean may not be the best predictive model.

Uploaded by

Doaibu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

117 views

FIT2086 Lecture 6 Linear Regression: Daniel F. Schmidt

Uploaded by

Doaibu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 72

Linear Regression Models

Model Selection for Linear Regression

FIT2086 Lecture 6
Linear Regression

Daniel F. Schmidt

Faculty of Information Technology, Monash University

September 4, 2017
Linear Regression Models
Model Selection for Linear Regression

Outline

1 Linear Regression Models

Supervised Learning
Linear Regression Models

2 Model Selection for Linear Regression

Under and Overfitting
Model Selection Methods
Linear Regression Models
Model Selection for Linear Regression

Revision from last week

Hypothesis testing; test null hypothesis vs alternative

H0 : null hypothesis
vs
HA : alternative hypothesis

A test-statistic measures how different our observed sample is

from the null hypothesis
A p-value quantifies the evidence against the null hypothesis
A p-value is the probability of seeing a sample that results in a
test statistic as extreme, or more extreme, than the one we
observed, just by chance if the null was true.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Outline

1 Linear Regression Models

Supervised Learning
Linear Regression Models

2 Model Selection for Linear Regression

Under and Overfitting
Model Selection Methods
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Supervised Learning (1)

Over the last three weeks we have looked at parameter

inference

In week 3 we examined point estimation using maximum

likelihood
Selecting our “best guess” at a single value of the parameter

In week 4 we examined interval estimation using confidence

intervals
Give a range of plausible values for the unknown population
parameter

In week 5 we examined hypothesis testing

Quantify statistical evidence against a given hypothesis
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Supervised Learning (2)

Now we will start to see how these tools can be used to build
more complex models
Over the next three weeks we will look at supervised learning
In particular, we we will look at linear regression
But first, what is supervised learning?
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Supervised Learning (3)

Imagine we have measured p + 1 variables on n individuals

(people, objects, things)
We would like to predict one of the variables using the
remaining p variables

If the variable we are predicting is categorical, we are

performing classification
Example: predicting if someone has diabetes from medical
measurements.

If the variable we are predicting is numerical, we are

performing regression
Example: Predicting the quality of a wine from chemical and
seasonal information.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Supervised Learning (3)

Imagine we have measured p + 1 variables on n individuals

(people, objects, things)
We would like to predict one of the variables using the
remaining p variables

If the variable we are predicting is categorical, we are

performing classification
Example: predicting if someone has diabetes from medical
measurements.

If the variable we are predicting is numerical, we are

Supervised Learning (4)

The variable we are predicting is designated the “y” variable
We have (y1 , . . . , yn )
This variable is often called the:
target;
response;
outcome.

The other variables are usually designated “X” variables

We have (xi,1 , . . . , xi,n ) for i = 1, . . . , p
These variables are often called the
explanatory variables;
predictors;
covariates;
exposures.

Usually we assume the targets are random variables and the

predictors are known without error
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Supervised Learning (4)

The variable we are predicting is designated the “y” variable
We have (y1 , . . . , yn )
This variable is often called the:
target;
response;
outcome.

The other variables are usually designated “X” variables

We have (xi,1 , . . . , xi,n ) for i = 1, . . . , p
These variables are often called the
explanatory variables;
predictors;
covariates;
exposures.

Usually we assume the targets are random variables and the

predictors are known without error
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Supervised Learning (4)

Supervised learning: find a relationship between the targets yi

and associated predictors xi,1 , . . . , xi,p .
That is, learn a function f (·) such that

yi = f (xi,1 , . . . , xi,p )

Usually error in measuring yi so that no f (·) fits perfectly

⇒ we model yi as realisation of RV Yi
So instead, find an f (·) that is “close” to y1 , . . . , yn

It is “supervised” because we have examples to learn from

Supervised learning model depends on form of f (·)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Linear Regression

Linear regression is a special type of supervised learning

In this case, we take the function f (·) that relates the
predictors to the target as being linear

One of the most important models in statistics

The resulting model is highly interpretable
It is very flexible and can even handle nonlinear relationships
It is computationally efficient to fit, even for very large p
Enormous area of research and work
⇒ we will get acquainted with the basics
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (1)

Consider the following dataset (we examined in Studio 5):

Imagine we want to model blood pressure

Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (2)

Blood pressure plotted against patient ID
125
Blood Pressure (mmHg)

120

115

110

105
5 10 15 20
Patient ID
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (3)

Our blood pressure variable BP1 , . . . , BP20 is continuous

⇒ we choose to model it using a normal distribution
The maximum likelihood estimate of the mean µ is
n
1 X
µ̂ = yi = 114
20 i=1

which is equivalent to the sample mean

We have a new person from the population this sample was
drawn from and we want to predict their blood pressure
Using our simple model our best guess of this persons blood
pressure is 114, i.e., the estimated mean µ̂
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (4)

Prediction of BP using the mean
125
Blood Pressure (mmHg)

120

115

110

105
5 10 15 20
Patient ID
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (5)

How good is our model at predicting?

One way we could measure this is through prediction error
We don’t know future data, but we can look to see how well it
predicts the data we have
Let ŷi denote the prediction of sample y using a model; then

ei = ŷi − yi

are the errors between our model predictions ŷi and the
observed data yi
⇒ often called residual error, or just residuals
A good fit would lead to overall small errors
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (6)

Prediction of BP using the mean, showing errors/residuals
125
Blood Pressure (mmHg)

120

115

110

105
5 10 15 20
Patient ID
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (7)

We can summarise the total error of fit of our model by

n
X
RSS = e2i
i=1

which is called the residual sum-of-squared errors.

For our simple mean model RSS = 560
Can we do better (smaller error) if we use one of the other
measured variables to help predict blood pressure?
For example, if we took a persons weight into account, could
we build a better predictor of their blood pressure?
To get an idea if there is scope for improvement we can plot
blood pressure vs weight
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (7)

We can summarise the total error of fit of our model by

n
X
RSS = e2i
i=1

which is called the residual sum-of-squared errors.

Simple Linear Regression (8)

Blood pressure vs weight – BP appears to increase with weight

125
Blood Pressure (mmHg)

120

115

110

105

86 88 90 92 94 96 98 100 102
Weight (kg)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (9)

Our simple mean model is clearly not a good fit

125
Blood Pressure (mmHg)

120

115

110

105

86 88 90 92 94 96 98 100 102
Weight (kg)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (10)

Our simple mean model predicts blood pressure by

E [BPi ] = µ

irrespective of any other data on individual i

Let (Weight1 , . . . , Weight20 ) be the weights of our 20
individuals
We can let the mean vary as a linear function of weight, i.e.,

E [BPi | Weighti ] = β0 + β1 Weighti

This says that the conditional mean of blood pressure BPi for
individual i, given the individual’s weight Weighti , is equal to
β0 plus β1 times the weight Weighti
Note our simple mean model is a linear model with β1 = 0
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (11)

The linear model E [BPi | Weighti ] = 1.2009 + 2.2053 Weighti

125
Blood Pressure (mmHg)

120

115

110

105

86 88 90 92 94 96 98 100 102
Weight (kg)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (12)

Residuals; ei = BPi − 1.2009 − 2.2053 Weighti (RSS= 120)

125
Blood Pressure (mmHg)

120

115

110

105

86 88 90 92 94 96 98 100 102
Weight (kg)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (13) – Key Slide

A linear model of the form

E [Yi | xi ] = ŷi = β0 + β1 xi

is called a simple linear regression.

It has two free regression parameters

β0 is the intercept; it is the value of the predicted value ŷi
when the predictor xi = 0
β1 is a regression coefficient; it is the amount the predicted
value ŷi changes by in one unit change of the predictor xi
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (14)

In our example yi is blood pressure and xi weight;

ŷi = 1.2009 + 2.2053xi

so
For every additional kilogram a person weighs, their blood
pressure increases by 2.2053mmHg
For a person who weighs zero kilograms, the predicted blood
pressure is 1.2009mmHg
The predictions might not make sense outside of sensible
ranges of the predictors!
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Fitting Simple Linear Regressions (1)

How did we arrive at β̂0 = 1.2009 and β̂1 = 2.2053 in our

blood pressure vs weight example?
Measure fit of a model by its RSS
n
X
RSS = (yi − β0 − β1 xi )2
i=1
n
X
= (yi − ŷi )2
i=1
n
X
= e2i
i=1

Smaller error = better fit

Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Fitting Simple Linear Regressions (2)

So least-squares principle says we choose (estimate) β0 , β1 to

minimise the RSS
Formally
( n )
X
(β̂0 , β̂1 ) = arg min (yi − β0 − β1 xi )2
β0 ,β1 i=1

These are often called least-squares (LS) estimates.

There are alternative measures of error; for example least sum
of absolute errors.
Least squares is popular due to simplicity, computational
efficiency and connections to normal models
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Fitting Simple Linear Regressions (3)

The RSS is a function of β0 , β1 , i.e.,

n
X
RSS(β0 , β1 ) = (yi − β0 − β1 xi )2
i=1

The least-squares estimates are the solutions to the equations

n
∂RSS(β0 , β1 ) X
= −2 (yi − β0 − β1 xi ) = 0
∂β0 i=1
n
∂RSS(β0 , β1 ) X
= −2 xi (yi − β0 − β1 xi ) = 0
∂β1 i=1

where we use the chain rule.

Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Fitting Simple Linear Regressions (4)

The solution for β0 is

n n n n
! ! ! !
X X X X
yi x2i − yi x i xi
i=1 i=1 i=1 i=1
β̂0 = !2
n
X Xn
n x2i − xi
i=1 i=1

and the solution for β1 is

n n
!
X X
yi xi − β̂0 xi
i=1 i=1
β̂1 = n
X
x2i
i=1
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Fitting Simple Linear Regressions (5)

Given LS estimates β̂0 , β̂1 we can find the predictions for our
data
ŷi = β̂0 + β̂1 xi
and residuals
ei = yi − ŷi
The vector of residuals e = (e1 , . . . , en ) has the properties
n
X
ei = 0 and corr (x, e) = 0
i=1

where x = (x1 , . . . , xn ) is our predictor variable.

This means least-squares fits a line such that the mean of the
resulting residuals is zero, and the residuals are uncorrelated
with the predictor.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Fitting Simple Linear Regressions (5)

where x = (x1 , . . . , xn ) is our predictor variable.

Multiple Linear Regression (1) – Key Slide

We have used one explanatory variable in our linear model

A great strength of linear models is that they easily handle
multiple variables
Let xi,j denote the variable j for individual i, where
j = 1, . . . , p; i.e., we have p explanatory variables. Then
p
X
E [yi | xi,1 , . . . , xi,p ] = β0 + βj xi,j
j=1

The intercept is now the expected value of the target when

xi,1 = xi,2 = · · · = xi,p = 0
The coefficient βj is the increase in the expected value of the
target per unit change in explanatory variable j
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Multiple Linear Regression (2) – Key Slide

Fit a multiple linear regression using least-squares

⇒ assume p < n, otherwise solution is non-unique
Given coefficients β0 , β1 ,. . . ,βp the RSS is
 2
n
X p
X
RSS(β0 , β1 , . . . , βp ) = yi − βj xi,j 
i=1 j=1

Now we have to solve

β̂0 , β̂1 , . . . , β̂p = arg min {RSS(β0 , β1 , . . . , βp )}

β0 ,β1 ,...,βp

Efficient algorithms exist to find these estimates

Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Multiple Linear Regression (3)

Matrix algebra can simplify linear regression equations

We have a vector of targets y = (y1 , . . . , yn )
We have a vector of coefficients β = (β1 , . . . , βp )
We can treat each variable as a vector xj = (x1,j , . . . , xn,j )
Arrange these vectors into a matrix X of predictors:
 
x1,1 x1,2 ··· x1,p

x2,1 x2,2 ··· x2,p 
(x10 , x20 , . . . , x20 )
 
X= = .. .. .. ,
. . .
 
 
xn,1 xn,2 · · · xn,p

We call this the design matrix

⇒ has p columns (predictors) and n rows (individuals)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Multiple Linear Regression (4) – Key Slide

We can form our predictions and residuals using

ŷ = Xβ + β0 1n and e = y − ŷ.

where 1n is a vector of n ones.

We can then write our RSS very compactly as

RSS(β0 , β) = e0 e

If β̂0 , β̂ are least-squares estimates, then

corr(xj , e) = 0 for all j

That is, least-squares finds the plane such that the residuals
(errors) are uncorrelated with all predictors in the model
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

R-squared (R2 ) (1)

Residual sum-of-squares tells us how well we fit the data

But the scale is arbitrary – what does an RSS of 2, 352 mean?
Instead, we define the RSS relative to some reference point
We use the total sum-of-squares as the reference:
n
X
TSS = (yi − ȳ)2
i=1

which is the residual sum-of-squares obtained by fitting the

intercept only (the “mean model”)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

R-squared (R2 ) (2) – Key Slide

The R2 value is then defined as

RSS
R2 = 1 −
TSS
which is also called the coefficient-of-determination
R2 is strictly between 0 (model has no explanatory power)
and 1 (model completely explains the data)
The higher the R2 the better the fit to the data
Adding an extra predictor always increases R2
⇒ predictors that greatly increase R2 are potentially
important
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Example: Multiple regression and R2 (1)

Let us revisit our blood pressure data

The residual sum-of-squares of our mean model was 560
⇒ this is our reference model (total sum-of-squares)
Regression of blood pressure (BP) onto weight gave us

E [BP | Weight] = 2.20 + 1.2 Weight

which had an RSS of 54.52 ⇒ R2 ≈ 0.9

Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Example: Multiple regression and R2 (2)

In our data we also have an individual’s age

We fit a multiple linear regression of BP onto weight and age

E [BP | Weight, Age] = −16.57 + 1.03 Weight + 0.71 Age

This says that:

for every kilogram, a person’s bloodpressure rises by
1.03mmHg;
for every year, a person’s bloodpressure rises by 0.71mmHg;
This model has an RSS of 4.82 ⇒ R2 = 0.99
So include age seems to increase our fit substanstially
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Handling Categorical Predictors (1)

Sometimes our predictors are categorical variables

This means the numerical values they take are on just codes
for different categories
Makes no sense to “added” or “multiply” them
Instead we turn them into K − 1 new predictors (if K is the
number of categories)
These predictors take on a one when an individual is in a
particular category, and zero otherwise
They are called indicator variables.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Handling Categorical Predictors (2) – Key Slide

Example variable with four categories coded as 1, 2, 3 and 4

   
1 0 0 0

 2 


 1 0 0 

1 0 0 0
   
   
3 0 1 0
   
   
   

 4 
 =⇒ 
 0 0 1 


 2 


 1 0 0 

3 0 1 0
   
   
   
 2   1 0 0 
4 0 0 1

We do not build indicators for first category

Regression coefficients for other categories are increases in
target relative to being in the first category
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Nonlinear effects (1)

Sometimes predictors are related to the target in a nonlinear

fashion
We can still use linear models by transforming the predictors
If the transformed predictors are linearly related to the target,
regression will work well
We can often detect this by plotting the residuals against a
variable – if they exhibit a nonlinear trend or curve it is sign
that a transformation might be needed
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Nonlinear effects (2)

Example dataset
10

6
y

0 0.2 0.4 0.6 0.8 1

x
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Nonlinear effects (3)

Fitted model: ŷ = −1.07 + 9.55x; RSS = 0.95
10

6
y

0 0.2 0.4 0.6 0.8 1

x
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Nonlinear effects (4)

Example data: residuals exhibit clear nonlinear trend
2

1.5

1
Residuals

0.5

-0.5

-1
0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Nonlinear effects (5)

There are several common transformations

A logarithmic transformation can be used if the predictor
seems to be more variable for larger values of the predictor

xi,j ⇒ log xi,j

Can only be used if all xi > 0

Polynomial transformations offer general purpose nonlinear fits
We turn our variable into q new variables of the form:

xi,j ⇒ xi,j , x2i,j , x3i,j , . . . , xqi,j

The higher the q the more nonlinear the fit can become, but
at risk of overfitting
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Nonlinear effects (5)

There are several common transformations

A logarithmic transformation can be used if the predictor
seems to be more variable for larger values of the predictor

xi,j ⇒ log xi,j

Can only be used if all xi > 0

Polynomial transformations offer general purpose nonlinear fits
We turn our variable into q new variables of the form:

xi,j ⇒ xi,j , x2i,j , x3i,j , . . . , xqi,j

The higher the q the more nonlinear the fit can become, but
at risk of overfitting
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Nonlinear effects (6)

New model: ŷ = −0.02 + 2.16x + 7.77x2 , R2 = 0.999
10

6
y

0 0.2 0.4 0.6 0.8 1

x
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Connecting LS to ML (1)
To show this, let our targets Y1 , . . . , Yn be RVs
Write the linear regression model as
p
X
Ŷi = β0 + βj xi,j + εi
j=1

where εi is a random, unobserved “error”

Now assume that εi ∼ N (0, σ 2 )
This is equivalent to saying that
 
p
X
Yi | xi,1 , . . . , xi,p ∼ N β0 + βj xi,j , σ 2 
j=1

so each Yi is normally distributed with variance σ 2 and a

mean that depends on the values of the associated predictors
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Connecting LS to ML (2)

Each Yi is independent
Given target data y the likelihood function can be written

Pp 2 
n 1 yi − β0 −
Y 1 2 j=1 βj xi,j
p(y | β0 , β, σ 2 ) = exp −
 
2πσ 2 2σ 2 
i=1

Noting e−a e−b = e−a−b this simplifes to


Pn Pp 2 
n yi − β0 −
1 j=1 βj xi,j

2 i=1
exp −
 
2πσ 2 2σ 2


where we can see term in the numerator in the exp(·) is the

residual sum-of-squares.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Connecting LS to ML (2) – Key Slide

Taking the negative-logarithm of this yields

n RSS(β0 , β)
L(y | β0 , β, σ 2 ) = log(2πσ 2 ) +
2 2σ 2
As the value of σ 2 scales the RSS term, it is easy to see that
the values of β0 and β that minimise the negative
log-likelihood are the least-squares estimates β̂0 and β̂
LS estimates are same as the maximum likelihood estimates
assuming the random “errors” εi are normally distributed
Our residuals
ei = yi − ŷi
can be viewed as our estimates of the errors εi .
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Connecting LS to ML (3)

How to estimate the error variance σ 2 ?

The maximum likelihood estimate is:

2 RSS(β̂0 , β̂)
σ̂ML =
n
but this tends to underestimate the actual variance.
A better estimate is the unbiased estimate

RSS(β̂0 , β̂)
σ̂u2 =
n−p−1
where p is the number of predictors used to fit the model.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Making predictions with a linear model

Given estimates β̂0 , β̂ can make predictions about new data

To estimate value of target for some new predictor values
x01 , x02 , . . . , x0p
p
β̂j x0j
X
ŷ = β̂0 +
j=1

Using normal model of residuals, we can also get probability

distribution over future data:
 
p
β̂j x0j , σ 2 
X
Ŷ ∼ N β̂0 +
j=1

By changing predictors we can see how target changes

Example: seeing how weight and age effect blood pressure
Careful using predictions outside of sensible predictors values!
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Outline

1 Linear Regression Models

Supervised Learning
Linear Regression Models

2 Model Selection for Linear Regression

Under and Overfitting
Model Selection Methods
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Underfitting/Overfitting (1)

We often have many measured predictors

In our blood pressure example, we have weight, body surface
area, age, pulse rate and a measure of stress
Should we use them all, and if not, why not?
The R2 always improves as we include more predictors
⇒ so model always fits the data we have better
But prediction on new, unseen data might be worse
We call this generalisation
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Underfitting/Overfitting (2) – Key Slide

Risks of including/excluding predictors

Omitting important predictors
Called underfitting
Leads to systematic error, bias in predicting the target

Including spurious predictors

Called overfitting
Leads our model to “learn” noise and random variation
Poorer ability to predict to new, unseen data from our
population
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Underfitting/Overfitting Example (1)

Example: we observe x and y data and want to build a

prediction model for y using x
Data looks nonlinear so we use polynomial regression
We take x, x2 , x3 , . . . , x20 ⇒ very flexible model
How many terms to include?
For example, do we use

y = β0 + β1 x + β2 x2 + ε

or
y = β0 + β1 x + β2 x2 + β3 x3 + β4 x4 + β5 x5 + ε
or another model with some other number of polynomial
terms.
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Underfitting/Overfitting Example (2)

Example dataset of 50 samples
14

6
y

-2

-4
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Underfitting/Overfitting Example (3)

Use (x, x2 ), too simple – underfitting
14

6
y

-2

-4
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Underfitting/Overfitting Example (4)

Use (x, x2 , . . . , x20 ), too complex – overfitting
14

6
y

-2

-4
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Underfitting/Overfitting Example (5)

(x, x2 , . . . , x6 ) seems “just right”. But how to find this model?
14

6
y

-2

-4
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Using Hypothesis Testing – Key Slide

One approach is to use hypothesis testing

We know that a predictor j is unimportant if βj = 0
So we can test the hypothesis:

H0 : β=0
vs
HA : β 6= 0

which, in this setting is a variant of the t-test (see Ross,

Chapter 9 and Studio 6)
Strengths: easy to apply, easy to understand
Weaknesses: difficult to directly compare two different models
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Model Selection (1)

A different approach is through model selection

In the context of linear regression, we define a model by
specifying which predictors are included in the linear regression
For example, in our blood pressure example:
{Weight}
{Weight, Age}
{Age, Stress}
{Age, Stress, Pulse}
are some of the possible models we could build
Given a model, we can estimate the associated linear
regression coefficients using least-squares/maximum likelihood
The question then becomes how to choose a good model
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Model Selection (2)

We use maximum likelihood to choose the parameters

Remember, this means we adjust the parameters of our
distribution until we find the ones that maximise the
probability of seeing the data y we have observed
Can we use this to select a model as well as parameters?

Assume normal distribution for our regression errors

The minimised negative log-likelihood (i.e., L(y | β0 , β, σ 2 )
2 ) is
evaluated at the ML estimates β̂0 , β̂, σ̂ML

2 n n
L(y | β̂0 , β̂, σ̂ML )= log 2πRSS(β̂0 , β̂)/n +
2 2
This always decreases as we add more predictors to our model
⇒ cannot be used to select models, only parameters
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Model Selection (2)

We use maximum likelihood to choose the parameters

Assume normal distribution for our regression errors

The minimised negative log-likelihood (i.e., L(y | β0 , β, σ 2 )
2 ) is
evaluated at the ML estimates β̂0 , β̂, σ̂ML

Model Selection (3) – Key Slide

Let M denote a model (set of predictors to use)

2 , M) denote minimised negative
Let L(y | β̂0 , β̂, σ̂ML
log-likelihood for the model M
We can select a model by minimising an information criterion
2
L(y | β̂0 , β̂, σ̂ML , M) + α(n, kM )

where
α(·) is a model complexity penalty;
kM is the number of predictors in model M;
n is the size of our data sample.
This is a form of penalized likelihood estimation
⇒ a model is penalized by its complexity (ability to fit data)
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Model Selection (4)

How to measure complexity, i.e., choose α(·)?

Akaike Information Criterion (AIC)

α(n, kM ) = kM

Bayesian Information Criterion (BIC)

kM
α(n, kM ) = log n
2
AIC penalty smaller than BIC; increased chance of overfitting
BIC penalty bigger than AIC; increased chance of underfitting
Differences in scores of ≥ 3 or more are considered significant
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Finding a Good Model (1)

Most obvious approach is to try all possible combinations of

predictors, and choose one that has smallest information
criterion score
Called the all subsets approach
If we have p predictors then we have 2p models to try
For p = 50, 2p ≈ 1.2 × 1015 !
So this method is computationally intractable for moderate p
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Finding a Good Model (2)

An alternative is to search through the model space

Forward selection algorithm:
1 Start with the empty model;
2 Find the predictor that reduces info criterion by most
3 If no predictor improves model, end.
4 Add this predictor to the model
5 Return to Step 2

Backwards selection is related algorithm

Start with the full model and remove predictors
Is computationally tractable for large p, but may miss
important predictors
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Reading/Terms to Revise

Reading for this week: Chapter 9 of Ross.

Terms you should know:

Target, predictor, explanatory variable;
Intercept, coefficient;
R2 value;
Categorical predictors;
Polynomial regression;
Model, model selection;
Overfitting, underfitting
Information Criteria;

This week we looked at supervised learning for continuous

targets; next week we will examine supervised learning for
categorical targets (classification).

The Theory of Interest (Stephen G. Kellison)
89% (63)
The Theory of Interest (Stephen G. Kellison)
167 pages
Psychology Statistics For Dummies
From Everand
Psychology Statistics For Dummies
Martin Dempster
5/5 (1)
Analysis of Transport Choice of Employees - A Project On Machine Learning
100% (10)
Analysis of Transport Choice of Employees - A Project On Machine Learning
24 pages
Linear Regression PDF
100% (1)
Linear Regression PDF
32 pages
Linear Regression - Module 3
No ratings yet
Linear Regression - Module 3
16 pages
Statics Thinking-Regression
No ratings yet
Statics Thinking-Regression
51 pages
ML Module 2
No ratings yet
ML Module 2
185 pages
Lecture 4 Linear Regression
100% (1)
Lecture 4 Linear Regression
44 pages
3CP10 Final MJJ Linear Regression
No ratings yet
3CP10 Final MJJ Linear Regression
68 pages
Week 9 - PROG 8510 Week 9
No ratings yet
Week 9 - PROG 8510 Week 9
27 pages
Linear Regression
No ratings yet
Linear Regression
11 pages
linearregression
No ratings yet
linearregression
18 pages
Linear Regression
No ratings yet
Linear Regression
13 pages
Progression Linaire
No ratings yet
Progression Linaire
187 pages
Supervised Machine Learning - Regression
No ratings yet
Supervised Machine Learning - Regression
34 pages
ML - Module 2
No ratings yet
ML - Module 2
16 pages
Lecture 09_02.09.2024_Regression-01
No ratings yet
Lecture 09_02.09.2024_Regression-01
62 pages
Linear regression for machine learning
No ratings yet
Linear regression for machine learning
9 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
Forecasting and Learning Theory
No ratings yet
Forecasting and Learning Theory
46 pages
Linear RegressionSV
No ratings yet
Linear RegressionSV
66 pages
Regression
No ratings yet
Regression
44 pages
Module III (Part II)(Regression and Time Series)
No ratings yet
Module III (Part II)(Regression and Time Series)
118 pages
MOD3_EDA
No ratings yet
MOD3_EDA
16 pages
2.linear Regression
No ratings yet
2.linear Regression
49 pages
Sparse Regression
No ratings yet
Sparse Regression
37 pages
Machine Learning Unit2
No ratings yet
Machine Learning Unit2
31 pages
FML Unit2
No ratings yet
FML Unit2
13 pages
Fsgs
No ratings yet
Fsgs
28 pages
Lec 3 Regression.
No ratings yet
Lec 3 Regression.
20 pages
SLChapter3
No ratings yet
SLChapter3
29 pages
LinearRegression PDF
No ratings yet
LinearRegression PDF
4 pages
Unit-4 DS Student
No ratings yet
Unit-4 DS Student
43 pages
Linear Regression & Logistic Regression
No ratings yet
Linear Regression & Logistic Regression
30 pages
Predictive Analytics - Regression
No ratings yet
Predictive Analytics - Regression
27 pages
Simple Linear Regression Analysis - Final
No ratings yet
Simple Linear Regression Analysis - Final
46 pages
Regression
No ratings yet
Regression
45 pages
ML-2
No ratings yet
ML-2
155 pages
5_AML Lecture 5_Linear regression
No ratings yet
5_AML Lecture 5_Linear regression
56 pages
Regression
No ratings yet
Regression
19 pages
Regression
No ratings yet
Regression
31 pages
MachineLearning_Unit-II
No ratings yet
MachineLearning_Unit-II
45 pages
MIT18 650F16 Regression
No ratings yet
MIT18 650F16 Regression
44 pages
Unit 2 ML
No ratings yet
Unit 2 ML
201 pages
Regression
No ratings yet
Regression
24 pages
Linear Regression
No ratings yet
Linear Regression
22 pages
Unit2 ML Notes
No ratings yet
Unit2 ML Notes
19 pages
IT 122 Lecture 4
No ratings yet
IT 122 Lecture 4
26 pages
Lecture 3_Regression (1)
No ratings yet
Lecture 3_Regression (1)
47 pages
Ch06 MultipleLinearRegression
0% (2)
Ch06 MultipleLinearRegression
19 pages
Introml 02 Regression Annotated PDF
No ratings yet
Introml 02 Regression Annotated PDF
26 pages
DS Unit-Iv
No ratings yet
DS Unit-Iv
34 pages
Unit -3_ML_24
No ratings yet
Unit -3_ML_24
41 pages
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-06 Reference-Material-I
No ratings yet
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-06 Reference-Material-I
21 pages
ML UNIT II
No ratings yet
ML UNIT II
30 pages
Unit Iii
No ratings yet
Unit Iii
27 pages
CH 03 Regression Techniques
No ratings yet
CH 03 Regression Techniques
74 pages
ch12 0
No ratings yet
ch12 0
82 pages
Exercises of Statistical Inference
From Everand
Exercises of Statistical Inference
Simone Malacrida
No ratings yet
Statistics II for Dummies
From Everand
Statistics II for Dummies
Deborah J. Rumsey
3.5/5 (30)
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Math662TB 09S
100% (2)
Math662TB 09S
712 pages
Math662TB 09S
100% (2)
Math662TB 09S
712 pages
Frequency Programme Squat Bench Everyday
No ratings yet
Frequency Programme Squat Bench Everyday
4 pages
Magnet Scalping - Felix
0% (1)
Magnet Scalping - Felix
15 pages
Specifications of 3 D Views PDF
No ratings yet
Specifications of 3 D Views PDF
3 pages
Software Testing & Quality Assurance MCA
No ratings yet
Software Testing & Quality Assurance MCA
8 pages
Indian Institute of Technology Kanpur CS-698 Visual Recognition
No ratings yet
Indian Institute of Technology Kanpur CS-698 Visual Recognition
3 pages
Bernd Bruegge, Allen H. Dutoit Object-Oriented Software Engineering Using UML, Patterns and Java 2nd Edition PDF
No ratings yet
Bernd Bruegge, Allen H. Dutoit Object-Oriented Software Engineering Using UML, Patterns and Java 2nd Edition PDF
406 pages
CS 403 - Latest Quiz
No ratings yet
CS 403 - Latest Quiz
5 pages
Database Management Systems - Prelims 3rd Attempt - 26 PDF
0% (1)
Database Management Systems - Prelims 3rd Attempt - 26 PDF
7 pages
DBMS Lab 06 28102020 013107pm
No ratings yet
DBMS Lab 06 28102020 013107pm
7 pages
Academic Year 2018 - 2019 Mathematics Worksheet No: 4: Abu Dhabi
No ratings yet
Academic Year 2018 - 2019 Mathematics Worksheet No: 4: Abu Dhabi
3 pages
CSE - 220 Database Management Systems: Subrat K Dash Lnmiit
No ratings yet
CSE - 220 Database Management Systems: Subrat K Dash Lnmiit
45 pages
Text Classification PDF
No ratings yet
Text Classification PDF
56 pages
20.information Modeling of Online Air Tickets Reservation System PDF
No ratings yet
20.information Modeling of Online Air Tickets Reservation System PDF
4 pages
DBMS Manual1
No ratings yet
DBMS Manual1
72 pages
CIE Lab Color Space
0% (1)
CIE Lab Color Space
6 pages
Topographic Map of Ore City
No ratings yet
Topographic Map of Ore City
1 page
Composition Aggregation UML Class Diagram For Composition and Aggregation
No ratings yet
Composition Aggregation UML Class Diagram For Composition and Aggregation
25 pages
Mech HeatTransfer 15.0 L01 Intoduction
No ratings yet
Mech HeatTransfer 15.0 L01 Intoduction
15 pages
Integrity Constraints and Functional Dependencies
No ratings yet
Integrity Constraints and Functional Dependencies
18 pages
The Relational Data Model I: Topic 3
No ratings yet
The Relational Data Model I: Topic 3
20 pages
UNIT I To III Database and Data Processing
No ratings yet
UNIT I To III Database and Data Processing
43 pages
Design Final 1
33% (3)
Design Final 1
16 pages
Types of Forecasting Methods
No ratings yet
Types of Forecasting Methods
3 pages
Computer Graphics Question Bank UNIT I
No ratings yet
Computer Graphics Question Bank UNIT I
4 pages
A254053496 - 23413 - 25 - 2018 - Lecture 3 DBMS Database Models
100% (1)
A254053496 - 23413 - 25 - 2018 - Lecture 3 DBMS Database Models
22 pages
Checklist For DWH Testing
No ratings yet
Checklist For DWH Testing
3 pages
Anand M Nair
100% (1)
Anand M Nair
18 pages
Unit 6 - Normalization
No ratings yet
Unit 6 - Normalization
10 pages
DLTM Coordinate System PDF
No ratings yet
DLTM Coordinate System PDF
12 pages
Pengembangan Model Manajemen Pelatihan Dan Pengembangan Pendidikan Karakter Berlokus Padepokan Karakter
No ratings yet
Pengembangan Model Manajemen Pelatihan Dan Pengembangan Pendidikan Karakter Berlokus Padepokan Karakter
11 pages
4-Fundamentals of Database Management
No ratings yet
4-Fundamentals of Database Management
99 pages