0% found this document useful (0 votes)
42 views48 pages

Linear Regression: Courtesy:Richard Zemel, Raquel Urtasun and Sanja Fidler

The document discusses linear regression for predicting continuous outputs. It describes how linear regression requires continuous target variables, input features, training examples where the features and targets are known, a model representing the relationship between features and targets, an objective function to evaluate model fit, and optimization to find the best model parameters. It provides examples of using linear regression to predict movie ratings and house prices based on various input features. The key questions of how to parametrize the model, define the objective function, and optimize for generalization on new data are also discussed.

Uploaded by

Iqra Imtiaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views48 pages

Linear Regression: Courtesy:Richard Zemel, Raquel Urtasun and Sanja Fidler

The document discusses linear regression for predicting continuous outputs. It describes how linear regression requires continuous target variables, input features, training examples where the features and targets are known, a model representing the relationship between features and targets, an objective function to evaluate model fit, and optimization to find the best model parameters. It provides examples of using linear regression to predict movie ratings and house prices based on various input features. The key questions of how to parametrize the model, define the objective function, and optimize for generalization on new data are also discussed.

Uploaded by

Iqra Imtiaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Linear Regression

Courtesy:Richard Zemel, Raquel Urtasun and


Sanja Fidler

University of Toronto

(Most plots in this lecture are from Bishop’s book)

Zemel, Urtasun, Fidler (UofT) 02-Regression 1 / 22


Problems for Today
What should I watch this Friday?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 2 / 22


Problems for Today
What should I watch this Friday?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 2 / 22


Problems for Today
Goal: Predict movie rating automatically!

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 2 / 22


Problems for Today
Goal: Predict the price of the house

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 2 / 22


Regression
What do all these problems have in common?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22


Regression
What do all these problems have in common?
) Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22


Regression
What do all these problems have in common?
) Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Predicting continuous outputs is called regression

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22


Regression
What do all these problems have in common?
) Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Predicting continuous outputs is called regression
What do I need in order to predict these outputs?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22


Regression
What do all these problems have in common?
) Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Predicting continuous outputs is called regression
What do I need in order to predict these outputs?
) Features (inputs), we’ll call these x (or x if
vectors)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22


Regression
What do all these problems have in common?
) Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Predicting continuous outputs is called regression
What do I need in order to predict these outputs?
) Features (inputs), we’ll call these x (or x if
vectors)
) Training examples, many x (i ) for which t ( i ) is known (e.g., many
movies for which we know the rating)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22


Regression
What do all these problems have in common?
) Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Predicting continuous outputs is called regression
What do I need in order to predict these outputs?
) Features (inputs), we’ll call these x (or x if
vectors)
) Training examples, many x (i ) for which t (i ) is known (e.g., many
movies for which we know the rating)
) A model, a function that represents the relationship between x and
t

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22


Regression
What do all these problems have in common?
) Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Predicting continuous outputs is called regression
What do I need in order to predict these outputs?
) Features (inputs), we’ll call these x (or x if
vectors)
) Training examples, many x (i ) for which t (i ) is known (e.g., many
movies for which we know the rating)
) A model, a function that represents the relationship between x and
t
) A loss or a cost or an objective function, which tells us how well our
model approximates the training examples
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22
Regression
What do all these problems have in common?
) Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Predicting continuous outputs is called regression
What do I need in order to predict these outputs?
) Features (inputs), we’ll call these x (or x if
vectors)
) Training examples, many x (i ) for which t (i ) is known (e.g., many
movies for which we know the rating)
) A model, a function that represents the relationship between x and
t
) A loss or a cost or an objective function, which tells us how well our
model approximates the training examples
) Optimization, a way ofCSC
Zemel, Urtasun, Fidler (UofT) finding the parameters of our model that
411: 02-Regression 3 / 22
Simple 1-D regression

Circles are data points (i.e., training examples) that are given to us

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 5 / 22


Simple 1-D regression

Circles are data points (i.e., training examples) that are given to us
The data points are uniform in x , but may be displaced in y
t(x ) = f (x ) + s
with s some noise

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 5 / 22


Simple 1-D regression

Circles are data points (i.e., training examples) that are given to us
The data points are uniform in x , but may be displaced in y
t(x ) = f (x ) + s
with s some noise
In green is the ”true” curve that we don’t know

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 5 / 22


Simple 1-D regression

Circles are data points (i.e., training examples) that are given to us
The data points are uniform in x , but may be displaced in y
t(x ) = f (x ) + s
with s some noise
In green is the ”true” curve that we don’t know
Goal: We want to fit a curve to these points
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 5 / 22
Simple 1-D regression

Key Questions:

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 6 / 22


Simple 1-D regression

Key Questions:
) How do we parametrize the model?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 6 / 22


Simple 1-D regression

Key Questions:
) How do we parametrize the model?
) What loss (objective) function should we use to judge the fit?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 6 / 22


Simple 1-D regression

Key Questions:
) How do we parametrize the model?
) What loss (objective) function should we use to judge the fit?
) How do we optimize fit to unseen test data (generalization)?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 6 / 22


Example: Boston Housing data
Estimate median house price in a neighborhood based on neighborhood
statistics

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 7 / 22


Example: Boston Housing data
Estimate median house price in a neighborhood based on neighborhood
statistics
Look at first possible attribute (feature): per capita crime rate

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 7 / 22


Example: Boston Housing data
Estimate median house price in a neighborhood based on neighborhood
statistics
Look at first possible attribute (feature): per capita crime rate

Use this to predict house prices in other neighborhoods

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 7 / 22


Example: Boston Housing data
Estimate median house price in a neighborhood based on neighborhood
statistics
Look at first possible attribute (feature): per capita crime rate

Use this to predict house prices in other neighborhoods


Is this a good input (attribute) to predict house prices?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 7 / 22
Represent the Data
Data is described as pairs D = {(x ( 1 ) , t ( 1 ) ) , · · · , (x ( N ) , t ( N ) )}
) x ∈ R is the input feature (per capita crime rate)
) t ∈ R is the target output (median house price)
) (i ) simply indicates the training examples (we have N in this case)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22


Represent the Data
Data is described as pairs D = {(x ( 1 ) , t ( 1 ) ) , · · · , (x ( N ) , t ( N ) )}
) x ∈ R is the input feature (per capita crime rate)
) t ∈ R is the target output (median house price)
) (i ) simply indicates the training examples (we have N in this case)

Here t is continuous, so this is a regression problem

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22


Represent the Data
Data is described as pairs D = {(x ( 1 ) , t ( 1 ) ) , · · · , (x ( N ) , t ( N ) )}
) x ∈ R is the input feature (per capita crime rate)
) t ∈ R is the target output (median house price)
) (i ) simply indicates the training examples (we have N in this case)

Here t is continuous, so this is a regression problem


Model outputs y , an estimate of t

y (x ) = w0 + w1x

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22


Represent the Data
Data is described as pairs D = {(x ( 1 ) , t ( 1 ) ) , · · · , (x ( N ) , t ( N ) )}
) x ∈ R is the input feature (per capita crime rate)
) t ∈ R is the target output (median house price)
) (i ) simply indicates the training examples (we have N in this case)

Here t is continuous, so this is a regression problem


Model outputs y , an estimate of t

y (x ) = w0 + w1x

What type of model did we choose?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22


Represent the Data
Data is described as pairs D = {(x ( 1 ) , t ( 1 ) ) , · · · , (x ( N ) , t ( N ) )}
) x ∈ R is the input feature (per capita crime rate)
) t ∈ R is the target output (median house price)
) (i ) simply indicates the training examples (we have N in this case)

Here t is continuous, so this is a regression problem


Model outputs y , an estimate of t

y (x ) = w0 + w1x

What type of model did we choose?


Divide the dataset into training and testing
examples

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22


Represent the Data
Data is described as pairs D = {(x ( 1 ) , t ( 1 ) ) , · · · , (x ( N ) , t ( N ) )}
) x ∈ R is the input feature (per capita crime rate)
) t ∈ R is the target output (median house price)
) (i ) simply indicates the training examples (we have N in this case)

Here t is continuous, so this is a regression problem


Model outputs y , an estimate of t

y (x ) = w0 + w1x

What type of model did we choose?


Divide the dataset into training and testing
examples
) Use the training examples to construct hypothesis, or function
approximator, that maps x to predicted y
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22
Represent the Data
Data is described as pairs D = {(x ( 1 ) , t ( 1 ) ) , · · · , (x ( N ) , t ( N ) )}
) x ∈ R is the input feature (per capita crime rate)
) t ∈ R is the target output (median house price)
) (i ) simply indicates the training examples (we have N in this case)

Here t is continuous, so this is a regression problem


Model outputs y , an estimate of t

y (x ) = w0 + w1x

What type of model did we choose?


Divide the dataset into training and testing
examples
) Use the training examples to construct hypothesis, or function
approximator, that maps x to predicted y
) Evaluate hypothesis on test set
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22
Noise

A simple model typically does not exactly fit the data


) lack of fit can be considered noise

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22


Noise

A simple model typically does not exactly fit the data

) lack of fit can be considered noise

Sources of noise:

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22


Noise

A simple model typically does not exactly fit the data

) lack of fit can be considered noise

Sources of noise:
) Imprecision in data attributes (input noise, e.g., noise in per-capita
crime)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22


Noise

A simple model typically does not exactly fit the data

) lack of fit can be considered noise

Sources of noise:
) Imprecision in data attributes (input noise, e.g., noise in per-capita
crime)
) Errors in data targets (mis-labeling, e.g., noise in house prices)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22


Noise

A simple model typically does not exactly fit the data

) lack of fit can be considered noise

Sources of noise:
) Imprecision in data attributes (input noise, e.g., noise in per-capita
crime)
) Errors in data targets (mis-labeling, e.g., noise in house prices)
) Additional attributes not taken into account by data attributes, affect
target values (latent variables). In the example, what else could affect
house prices?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22


Noise

A simple model typically does not exactly fit the data

) lack of fit can be considered noise

Sources of noise:
) Imprecision in data attributes (input noise, e.g., noise in per-capita
crime)
) Errors in data targets (mis-labeling, e.g., noise in house prices)
) Additional attributes not taken into account by data attributes, affect
target values (latent variables). In the example, what else could affect
house prices?
) Model may be too simple to account for data targets

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22


Least-Squares Regression

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22


Least-Squares Regression

Define a model
y (x ) = function(x,
w)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22


Least-Squares Regression

Define a model

Linear y (x ) = w0 + w1x
:

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22


Least-Squares Regression

Define a model
Linear: y
(x ) = w0 + w1x
Standard loss/cost/objective function measures the squared error between y
and the true value t
ΣN
A(w) = [t(n) − y (x(n) 2
)] n= 1

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22


Least-Squares Regression

Define a model
Linear: y
(x ) = w0 + w1x
Standard loss/cost/objective function measures the squared error between y
and the true value t
ΣN
Linear model: A(w) = [t(n) − 0 + w x (n) 2

(w n= 1 1 )]

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22


Least-Squares Regression

Define a model
Linear: y
(x ) = w0 + w1x
Standard loss/cost/objective function measures the squared error between y
and the true value t
ΣN
Linear model: A(w) = [t(n) − 0 + w x (n) 2
(w n=1 1 )]

For a particular hypothesis (y (x ) defined by a choice of w, drawn in red),


what does the loss represent geometrically?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22


Least-Squares Regression

Define a model
Linear: y
(x ) = w0 + w1x
Standard loss/cost/objective function measures the squared error between y
and the true value t
ΣN
Linear model: A(w) = [t(n) − 0 + w x (n) 2

(w n= 1 1 )]
How do we obtain weights w = (w0,
w1)?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22


Least-Squares Regression

Define a model
Linear: y
(x ) = w0 + w1x
Standard loss/cost/objective function measures the squared error between y
and the true value t
ΣN
Linear model: A(w) = [t(n) − 0 + w x (n) 2

(w n=1 1 )]
How do we obtain weights w = (w0, w1)? Find w that minimizes loss
A(w)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22


Least-Squares Regression

Define a model
Linear: y
(x ) = w0 + w1x
Standard loss/cost/objective function measures the squared error between y
and the true value t
ΣN
Linear model: A(w) = [t(n) − 0 + w x (n) 2

(w n=1 1 )]
How do we obtain weights w = (w0, w1)?
For the linear model, what kind of a function is A(w)?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22

You might also like