0% found this document useful (0 votes)
20 views

Week 4 Linear Regression

Introduction to Applied Machine Learning
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Week 4 Linear Regression

Introduction to Applied Machine Learning
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

IAML: Linear Regression

Nigel Goddard
School of Informatics

Semester 1

1 / 38
Overview

I The linear model


I Fitting the linear model to data
I Probabilistic interpretation of the error function
I Examples of regression problems
I Dealing with multiple outputs
I Generalized linear regression
I Radial basis function (RBF) models

2 / 38
The Regression Problem

I Classification and regression problems:


I Classification: target of prediction is discrete
I Regression: target of prediction is continuous

I Training data: Set D of pairs (xi , yi ) for i = 1, . . . , n, where


xi ∈ RD and yi ∈ R
I Today: Linear regression, i.e., relationship between x and
y is linear.
I Although this is simple (and limited) it is:
I More powerful than you would expect
I The basis for more complex nonlinear methods
I Teaches a lot about regression and classification

3 / 38
Examples of regression problems

I Robot inverse dynamics: predicting what torques are


needed to drive a robot arm along a given trajectory
I Electricity load forecasting, generate hourly forecasts two
days in advance.
I Predicting staffing requirements at help desks based on
historical data and product and sales information,
I Predicting the time to failure of equipment based on
utilization and environmental conditions

4 / 38
The Linear Model

I Linear model

f (x; w) = w0 + w1 x1 + . . . + wD xD
= φ(x)w

where φ(x) = (1, x1 , . . . , xD ) = (1, xT )


and  
w0
 w1 
w= 
 ...  (1)
wD
I The maths of fitting linear models to data is easy. We use
the notation φ(x) to make generalisation easy later.

5 / 38
Toy example: Data
4

● ●
3




●●
2


● ●
● ●
● ●

y


● ●●
0




−1

● ●
−2

−3 −2 −1 0 1 2 3

6 / 38
Toy example: Data
4

4
● ● ● ●
3

3
● ●
● ●
● ●
●● ●●
2

2
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●
y

1
● ●

● ●
● ●● ● ●●
0

0
● ●
● ●
● ●
−1

−1
● ● ● ●
−2

−2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

x x

7 / 38
With two features


• • ••

•• •
• • •• •• • • •
• • • • •• •

• •
•• • • • • •• •• • •

•• • •
• • X2
• •

X1

FIGURE 3.1. Linear least squares fitting with


Instead X of ∈a IRline,
2
. We a plane. With
seek the more
linear features,
function a hyperplane.
of X that mini-
mizes the sum of
Figure: Hastie, Tibshirani, and Friedman
squared residuals from Y .
8 / 38
With more features

CPU Performance Data Set


I Predict: PRP: published relative performance
I MYCT: machine cycle time in nanoseconds (integer)
I MMIN: minimum main memory in kilobytes (integer)
I MMAX: maximum main memory in kilobytes (integer)
I CACH: cache memory in kilobytes (integer)
I CHMIN: minimum channels in units (integer)
I CHMAX: maximum channels in units (integer)

9 / 38
With more features

PRP = - 56.1
+ 0.049 MYCT
+ 0.015 MMIN
+ 0.006 MMAX
+ 0.630 CACH
- 0.270 CHMIN
+ 1.46 CHMAX

10 / 38
In matrix notation

I Design matrix is n × (D + 1)
 
1 x11 x12 . . . x1D
 1 x21 x22 . . . x2D 
 
Φ= . .. .. .. .. 
 .. . . . . 
1 xn1 xn2 . . . xnD

I xij is the jth component of the training input xi


I Let y = (y1 , . . . , yn )T
I Then ŷ = Φw is ...?

11 / 38
Linear Algebra: The 1-Slide Version
What is matrix multiplication?
   
a11 a12 a13 b1
A = a21 a22 a23  , b = b2 
a31 a32 a33 b3

First consider matrix times vector, i.e., Ab. Two answers:


1. Ab is a linear combination of the columns of A
     
a11 a12 a13
Ab = b1 a21  + b2 a22  + b3 a23 
a31 a32 a33
2. Ab is a vector. Each element of the vector is the dot
products between b and one row of A.
 
(a11 , a12 , a13 )b
Ab = (a21 , a22 , a23 )b
(a31 , a32 , a33 )b
12 / 38
Linear model (part 2)

In matrix notation:

I Design matrix is n × (D + 1)
 
1 x11 x12 . . . x1D
 1 x21 x22 . . . x2D 
 
Φ= . .. .. .. .. 
 .. . . . . 
1 xn1 xn2 . . . xnD

I xij is the jth component of the training input xi


I Let y = (y1 , . . . , yn )T
I Then ŷ = Φw is the model’s predicted values on training
inputs.

13 / 38
Solving for Model Parameters

This looks like what we’ve seen in linear algebra

y = Φw

We know y and Φ but not w.

So why not take w = Φ−1 y? (You can’t, but why?)

14 / 38
Solving for Model Parameters

This looks like what we’ve seen in linear algebra

y = Φw

We know y and Φ but not w.

So why not take w = Φ−1 y? (You can’t, but why?)

Three reasons:
I Φ is not square. It is n × (D + 1).
I The system is overconstrained (n equations for D + 1
parameters), in other words
I The data has noise

15 / 38
Loss function

Want a loss function O(w) that


I We minimize wrt w.
I At minimum, ŷ looks like y.
I (Recall: ŷ depends on w)

ŷ = Φw

16 / 38
Fitting a linear model to data

Y
I A common choice: squared error
(makes the maths easy)

• • •• n
X

•• • O(w) = (yi − wT xi )2
• • • •• • • •

• • • • •• • i=1

• • • • ••
•• • • • • • • I In the picture: this is sum of

•• • •• • X2 squared length of black sticks.
• • I (Each one is called a residual,

X1 i.e., each yi − wT xi )

URE 3.1. Linear least squares fitting with


IR2 . We seek the linear function of X that mini-
s the sum of squared residuals from Y . 17 / 38
Fitting a linear model to data
ting a linear model to data
I
n
� Given a dataset D of pairs (xi , yiX
) for i = 1, .T. . , n2
O(w) = (y − w xi )
� Squared error makes the maths easy i
i=1
n

E(w) = (y= (y − Φw)2T (y − Φw)
i − f (xi ; w))
i=1
I We want to minimize this with respect to w.
� TheIerror
Thesurface is a parabolic
error surface bowl
is a parabolic bowl
25

20

15
E[w]

10

0
2

1
-2
-1
0 0
1
2
-1 3
w0 w1

How do we do this?
Figure: Tom Mitchell
I 5 / 17
18 / 38
The Solution

Pn
I Answer: to minimize O(w) = i=1 (yi − wT xi )2 , set partial
derivatives to 0.
I This has an analytical solution

ŵ = (ΦT Φ)−1 ΦT y

I (ΦT Φ)−1 ΦT is the pseudo-inverse of Φ


I First check: Does this make sense? Do the matrix
dimensions line up?
I Then: Why is this called a pseudo-inverse? ()
I Finally: What happens if there are no features?

19 / 38
Probabilistic interpretation of O(w)

I Assume that y = wT x + , where  ∼ N(0, σ 2 )


I (This is an exact linear relationship plus Gaussian noise.)
I This implies that y |xi ∼ N(wT xi , σ 2 ), i.e.

√ (y − wT xi )2
− log p(yi |xi ) = log 2π + log σ + i
2σ 2
I So minimising O(w) equivalent to maximising likelihood!
I Can view wT x as E[y |x].
I Squared residuals allow estimation of σ 2
n
2 1X
σ̂ = (yi − wT xi )2
n
i=1

20 / 38
Fitting this into the general structure for learning algorithms:

I Define the task: regression


I Decide on the model structure: linear regression model
I Decide on the score function: squared error (likelihood)
I Decide on optimization/search method to optimize the
score function: calculus (analytic solution)

21 / 38
Sensitivity to Outliers
I Linear regression is sensitive to outliers √
I Example: Suppose y = 0.5x + , where  ∼ N(0, 0.25),
and then add a point (2.5,3):

3.0
2.5 ●



2.0
1.5


1.0


0.5
0.0

0 1 2 3 4 5
22 / 38
Diagnositics

Graphical diagnostics can be useful for checking:


I Is the relationship obviously nonlinear? Look for structure
in residuals?
I Are there obvious outliers?
The goal isn’t to find all problems. You can’t. The goal is to find
obvious, embarrassing problems.

Examples: Plot residuals by fitted values. Stats packages will


do this for you.

23 / 38
Dealing with multiple outputs

I Suppose there are q different targets for each input x


I We introduce a different wi for each target dimension, and
do regression separately for each one
I This is called multiple regression

24 / 38
Basis expansion

I We can easily transform the original attributes x


non-linearly into φ(x) and do linear regression on them

1 1 1

0.5 0.75 0.75

0 0.5 0.5

−0.5 0.25 0.25

−1 0 0
−1 0 1 −1 0 1 −1 0 1

polynomial Gaussians sigmoids

Figure credit: Chris Bishop, PRML

25 / 38
I Design matrix is n × m
 
φ1 (x1 ) φ2 (x1 ) . . . φm (x1 )
 φ1 (x2 ) φ2 (x2 ) . . . φm (x2 ) 
 
Φ= .. .. .. .. 
 . . . . 
φ1 (xn ) φ2 (xn ) . . . φm (xn )

I Let y = (y1 , . . . , yn )T
I Minimize E(w) = |y − Φw|2 . As before we have an
analytical solution

ŵ = (ΦT Φ)−1 ΦT y

I (ΦT Φ)−1 ΦT is the pseudo-inverse of Φ

26 / 38
Example: polynomial regression
φ(x) = (1, x, x 2 , . . . , x M )T

1 M =0 1 M =1
t t

0 0

−1 −1

0 x 1 0 x 1

1 M =3 1 M =9
t t

0 0

−1 −1

0 x 1 0 x 1

Figure credit: Chris Bishop, PRML


27 / 38
More about the features

I Transforming the features can be important.


I Example: Suppose I want to predict the CPU performance.
I Maybe one of the features is manufacturer.



 1 if Intel

2 if AMD
x1 =


 3 if Apple

4 if Motorola

I Let’s use this as a feature. Will this work?

28 / 38
More about the features

I Transforming the features can be important.


I Example: Suppose I want to predict the CPU performance.
I Maybe one of the features is manufacturer.



 1 if Intel

2 if AMD
x1 =


 3 if Apple

4 if Motorola

I Let’s use this as a feature. Will this work?


I Not the way you want. Do you really believe AMD is double
Intel?

29 / 38
More about the features

I Instead convert this into 0/1

x1 = 1 if Intel, 0 otherwise
x2 = 1 if AMD, 0 otherwise
..
.

I Note this is a consequence of linearity. We saw something


similar with text in the first week.
I Other good transformations: log, square, square root

30 / 38
Radial basis function (RBF) models

I Set φi (x) = exp(− 12 |x − ci |2 /α2 ).


I Need to position these “basis functions” at some prior
chosen centres ci and with a given width α. There are
many ways to do this but choosing a subset of the
datapoints as centres is one method that is quite effective
I Finding the weights is the same as ever: the
pseudo-inverse solution.

31 / 38
RBF example

2.0
1.5
1.0
y

0.5
0.0


−0.5

0 1 2 3 4 5 6 7

32 / 38
y

−0.5 0.0 0.5 1.0 1.5 2.0

0
1
2
RBF example

x
4
5
6
7

−0.5 0.0 0.5 1.0 1.5 2.0


0
1
2
3

x
4
5
6
7

33 / 38
An RBF feature

Original data RBF feature, c1 = 3, α1 = 1


2.0

2.0
1.5

1.5
1.0

1.0
y

y
0.5

0.5
0.0

0.0
● ●
−0.5

0 1 2 3 4 5 6 7 −0.5 0.0 0.1 0.2 0.3 0.4

x phi3.x

34 / 38
Another RBF feature

Notice how the feature functions “specialize” in input space.

Original data RBF feature, c2 = 6, α2 = 1


2.0

2.0
1.5

1.5
1.0

1.0
y

y
0.5

0.5
0.0

0.0
● ●
−0.5

−0.5
0 1 2 3 4 5 6 7 0.0 0.1 0.2 0.3 0.4

x RBF kernel (mean 6)

35 / 38
RBF example
Run the RBF with both basis functions above, plot the residuals

yi − φ(xi )T w

Original data Residuals


2.0

2.0
1.5

1.5
Residual (RBF model)
1.0

1.0
y

0.5

0.5
0.0

0.0

−0.5

−0.5

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

x x
36 / 38
RBF: Ay, there’s the rub

I So why not use RBFs for everything?


I Short answer: You might need too many basis functions.
I This is especially true in high dimensions (we’ll say more
later)
I Too many means you probably overfit.
I Extreme example: Centre one on each training point.
I Also: notice that we haven’t seen yet in the course how to
learn the RBF parameters, i.e., the mean and standard
deviation of each kernel
I Main point of presenting RBFs now: Set up later methods
like support vector machines

37 / 38
Summary

I Linear regression often useful out of the box.


I More useful than it would be seem because linear means
linear in the parameters. You can do a nonlinear transform
of the data first, e.g., polynomial, RBF. This point will come
up again.
I Maximum likelihood solution is computationally efficient
(pseudo-inverse)
I Danger of overfitting, especially with many features or
basis functions

38 / 38

You might also like