Week 4 Linear Regression
Week 4 Linear Regression
Nigel Goddard
School of Informatics
Semester 1
1 / 38
Overview
2 / 38
The Regression Problem
3 / 38
Examples of regression problems
4 / 38
The Linear Model
I Linear model
f (x; w) = w0 + w1 x1 + . . . + wD xD
= φ(x)w
5 / 38
Toy example: Data
4
● ●
3
●
●
●
●●
2
●
● ●
● ●
● ●
●
y
●
● ●●
0
●
●
●
−1
● ●
−2
−3 −2 −1 0 1 2 3
6 / 38
Toy example: Data
4
4
● ● ● ●
3
3
● ●
● ●
● ●
●● ●●
2
2
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●
y
1
● ●
● ●
● ●● ● ●●
0
0
● ●
● ●
● ●
−1
−1
● ● ● ●
−2
−2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x x
7 / 38
With two features
•
• • ••
•
•• •
• • •• •• • • •
• • • • •• •
•
• •
•• • • • • •• •• • •
•
•• • •
• • X2
• •
•
X1
9 / 38
With more features
PRP = - 56.1
+ 0.049 MYCT
+ 0.015 MMIN
+ 0.006 MMAX
+ 0.630 CACH
- 0.270 CHMIN
+ 1.46 CHMAX
10 / 38
In matrix notation
I Design matrix is n × (D + 1)
1 x11 x12 . . . x1D
1 x21 x22 . . . x2D
Φ= . .. .. .. ..
.. . . . .
1 xn1 xn2 . . . xnD
11 / 38
Linear Algebra: The 1-Slide Version
What is matrix multiplication?
a11 a12 a13 b1
A = a21 a22 a23 , b = b2
a31 a32 a33 b3
In matrix notation:
I Design matrix is n × (D + 1)
1 x11 x12 . . . x1D
1 x21 x22 . . . x2D
Φ= . .. .. .. ..
.. . . . .
1 xn1 xn2 . . . xnD
13 / 38
Solving for Model Parameters
y = Φw
14 / 38
Solving for Model Parameters
y = Φw
Three reasons:
I Φ is not square. It is n × (D + 1).
I The system is overconstrained (n equations for D + 1
parameters), in other words
I The data has noise
15 / 38
Loss function
ŷ = Φw
16 / 38
Fitting a linear model to data
Y
I A common choice: squared error
(makes the maths easy)
•
• • •• n
X
•
•• • O(w) = (yi − wT xi )2
• • • •• • • •
•
• • • • •• • i=1
•
• • • • ••
•• • • • • • • I In the picture: this is sum of
•
•• • •• • X2 squared length of black sticks.
• • I (Each one is called a residual,
•
X1 i.e., each yi − wT xi )
20
15
E[w]
10
0
2
1
-2
-1
0 0
1
2
-1 3
w0 w1
How do we do this?
Figure: Tom Mitchell
I 5 / 17
18 / 38
The Solution
Pn
I Answer: to minimize O(w) = i=1 (yi − wT xi )2 , set partial
derivatives to 0.
I This has an analytical solution
ŵ = (ΦT Φ)−1 ΦT y
19 / 38
Probabilistic interpretation of O(w)
√ (y − wT xi )2
− log p(yi |xi ) = log 2π + log σ + i
2σ 2
I So minimising O(w) equivalent to maximising likelihood!
I Can view wT x as E[y |x].
I Squared residuals allow estimation of σ 2
n
2 1X
σ̂ = (yi − wT xi )2
n
i=1
20 / 38
Fitting this into the general structure for learning algorithms:
21 / 38
Sensitivity to Outliers
I Linear regression is sensitive to outliers √
I Example: Suppose y = 0.5x + , where ∼ N(0, 0.25),
and then add a point (2.5,3):
3.0
2.5 ●
●
●
2.0
1.5
●
1.0
●
0.5
0.0
0 1 2 3 4 5
22 / 38
Diagnositics
23 / 38
Dealing with multiple outputs
24 / 38
Basis expansion
1 1 1
0 0.5 0.5
−1 0 0
−1 0 1 −1 0 1 −1 0 1
25 / 38
I Design matrix is n × m
φ1 (x1 ) φ2 (x1 ) . . . φm (x1 )
φ1 (x2 ) φ2 (x2 ) . . . φm (x2 )
Φ= .. .. .. ..
. . . .
φ1 (xn ) φ2 (xn ) . . . φm (xn )
I Let y = (y1 , . . . , yn )T
I Minimize E(w) = |y − Φw|2 . As before we have an
analytical solution
ŵ = (ΦT Φ)−1 ΦT y
26 / 38
Example: polynomial regression
φ(x) = (1, x, x 2 , . . . , x M )T
1 M =0 1 M =1
t t
0 0
−1 −1
0 x 1 0 x 1
1 M =3 1 M =9
t t
0 0
−1 −1
0 x 1 0 x 1
28 / 38
More about the features
29 / 38
More about the features
x1 = 1 if Intel, 0 otherwise
x2 = 1 if AMD, 0 otherwise
..
.
30 / 38
Radial basis function (RBF) models
31 / 38
RBF example
2.0
1.5
1.0
y
0.5
0.0
●
−0.5
0 1 2 3 4 5 6 7
32 / 38
y
0
1
2
RBF example
x
4
5
6
7
0
1
2
3
x
4
5
6
7
33 / 38
An RBF feature
2.0
1.5
1.5
1.0
1.0
y
y
0.5
0.5
0.0
0.0
● ●
−0.5
x phi3.x
34 / 38
Another RBF feature
2.0
1.5
1.5
1.0
1.0
y
y
0.5
0.5
0.0
0.0
● ●
−0.5
−0.5
0 1 2 3 4 5 6 7 0.0 0.1 0.2 0.3 0.4
35 / 38
RBF example
Run the RBF with both basis functions above, plot the residuals
yi − φ(xi )T w
2.0
1.5
1.5
Residual (RBF model)
1.0
1.0
y
0.5
0.5
0.0
0.0
●
−0.5
−0.5
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
x x
36 / 38
RBF: Ay, there’s the rub
37 / 38
Summary
38 / 38