0% found this document useful (0 votes)
8 views

Lect5 Reg

Uploaded by

Ark Mtech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lect5 Reg

Uploaded by

Ark Mtech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Lecture 5: Regression

C4B Machine Learning Hilary 2011 A. Zisserman

• Linear regression
• Loss function
• Ridge regression

• Basis functions

• Dual representation and kernels

Regression
y

• Suppose we are given a training set of N observations

((x1, y1), . . . , (xN , yN )) with xi ∈ Rd, yi ∈ R

• The regression problem is to estimate f (x) from this data


such that
yi = f (xi)
Learning by optimization

• Asin the case of classification, learning a regressor can be


formulated as an optimization:

Minimize with respect to f ∈ F


N
X
l (f (xi), yi) + λR (f )
i=1
loss function regularization

• There is a choice of both loss functions and regularization


• e.g. squared loss, SVM “hinge-like” loss
• squared regularizer, lasso regularizer

• Algorithms can be kernelized

Choice of regression function – non-linear basis functions

• Function for regression y(x, w) is a non-linear function of x, but


linear in w:

f (x, w) = w0 + w1 φ1 (x) + w2 φ2 (x) + . . . + wM φM (x) = w> Φ(x)

• For example, for x ∈ R, polynomial regression with φj (x) = xj :


M
X
f (x, w) = w0 + w1 φ1 (x) + w2 φ2 (x) + . . . + wM φM (x) = wj xj
j=0
⎛ ⎞
e.g. for M = 3, 1
⎜ x ⎟
f (x, w) = (w0 , w1 , w2 , w3 ) ⎜ ⎟ >
⎝ x2 ⎠ = w Φ(x)
1 4 x3
Φ : x → Φ(x) R →R
• or the basis functions can be Gaussians centred on the training
data:
φj (x) = exp −(x − xj )2 /2σ2

e.g. for 3 points,


⎛ ⎞
−(x−x1 )2 /2σ2
e
⎜ −(x−x2 )2 /2σ2 ⎟
f (x, w) = (w1 , w2 , w3 ) ⎝ e ⎠ = w> Φ(x)
2 2
e−(x−x3 ) /2σ

Φ : x → Φ(x) R1 → R 3

Least squares “ridge regression”


• Cost function – squared loss:
target value
yi

loss function regularization xi

• Regression function for x (1D):

f (x, w) = w0 + w1 φ1 (x) + w2 φ2 (x) + . . . + wM φM (x) = w > Φ(x)

• NB squared loss arises in Maximum Likelihood estimation for an error model


yi = ỹi + ni ni ∼ N (0, σ 2)
measured value
true value
Solving for the weights w

Notation: write the target and regressed values as N -vectors


⎛ ⎞ ⎛ ⎞ ⎡ ⎤⎛ ⎞
y1 Φ(x1 )>w 1 φ1(x1) . . . φM (x1) w0
⎜ ⎟ ⎜ ⎟ ⎢ ⎥⎜ ⎟


y2 ⎟


⎜ Φ(x2 )>w ⎟



1 φ1(x2) . . . φM (x2) ⎥⎜
⎥⎜
w1 ⎟

y =⎜
⎜ . ⎟
⎟ f =⎜
⎜ . ⎟=
⎟ Φw =⎢
⎢ . . ⎥⎜
⎥⎜ . ⎟

⎜ . ⎟ ⎜ ⎟ ⎢ . . ⎥⎜ . ⎟
⎝ ⎠ ⎝ . ⎠ ⎣ ⎦⎝ ⎠
yN Φ(xN )>w 1 φ1(xN ) . . . φM (xN ) wM

Φ is an N × M design matrix
e.g. for polynomial regression with basis functions up to x2
⎡ ⎤
1 x1 2
x1 ⎛
⎢ ⎥ ⎞
⎢ 1
⎢ x2 x2 ⎥ w
2 ⎥⎜ 0 ⎟
Φw =⎢
⎢ . . ⎥⎥ ⎝ w1 ⎠

⎣ . . ⎥⎦ w2
1 xN 2
xN

N
1 X λ
e
E(w) = {f (xi, w) − yi}2 + kwk2
2 i=1 2
N ³ ´2
1 X λ
= yi − w>Φ(xi) + kwk2
2 i=1 2
1 λ
= (y − Φw)2 + kwk2
2 2

Now, compute where derivative w.r.t. w is zero for minimum


e w)
E(
= −Φ> (y − Φw) + λw = 0
dw
Hence
³ ´
Φ>Φ + λI w = Φ> y
³ ´−1
w = Φ> Φ + λ I Φ> y
M basis functions, N data points
³ ´−1
w = Φ > Φ + λI Φ> y
= assume N > M

Mx1 MxM MxN Nx1


• This shows that there is a unique solution.

• If λ = 0 (no regularization), then

w = (Φ>Φ )−1Φ>y = Φ+y


where Φ+ is the pseudo-inverse of Φ (pinv in Matlab)

• Adding the term λI improves the conditioning of the inverse, since if Φ


is not full rank, then (Φ>Φ + λI) will be (for sufficiently large λ)

• As λ → ∞, w → 1 >
λΦ y → 0

• Often the regularization is applied only to the inhomogeneous part of w,


i.e. to w̃, where w = (w0, w̃)

³ ´−1
w >
= Φ Φ + λI Φ> y

f (x, w) = w>Φ(x) = Φ(x)>w


³ ´−1
= Φ(x)> >
Φ Φ + λI Φ>y
= b(x)>y
Output is a linear blend, b(x), of the training values {yi}
Example 1: polynomial basis functions
ideal fit
• The red curve is the true function 1.5
Sample points

(which is not a polynomial) 1


Ideal fit

0.5

• The data points are samples from the


curve with added noise in y. 0

y
-0.5

• There is a choice in both the degree, -1


M, of the basis functions used, and in
the strength of the regularization -1.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

M
X
f (x, w) = wj xj = w> Φ(x) Φ : x → Φ(x) R → RM +1
j=0

w is a M+1
dimensional vector

N = 9 samples, M = 7
1.5 1.5
Sample points Sample points
Ideal fit Ideal fit
1 lambda = 100 1 lambda = 0.001

0.5 0.5

0 0
y
y

-0.5 -0.5

-1 -1

-1.5 -1.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x x

1.5 1.5
Sample points Sample points
Ideal fit Ideal fit
1 lambda = 1e-010 1 lambda = 1e-015

0.5 0.5

0 0
y
y

-0.5 -0.5

-1 -1

-1.5 -1.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x x
M=3 M=5
least-squares fit least-squares fit
1.5 1.5
Sample points Sample points
Ideal fit Ideal fit
1 Least-squares solution 1 Least-squares solution

0.5 0.5

y
0
y

-0.5
-0.5

-1
-1

-1.5
-1.5 Polynomial basis functions 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Polynomial basis
x functions
15 x
400

10
300

5
200

0
100

-5
y

y
-10
-100
-15
-200
-20
-300
-25
-400
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

Example 2: Gaussian basis functions


ideal fit
• The red curve is the true function 1.5
Sample points
(which is not a polynomial) 1
Ideal fit

• The data points are samples from the


curve with added noise in y. 0.5

0
y

• Basis functions are centred on the -0.5


training data (N points)
• There is a choice in both the scale, -1

sigma, of the basis functions used, and -1.5


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
in the strength of the regularization x

N
X 2
/σ2 R → RN
f (x, w) = wi e−(x−xi ) = w> Φ(x) Φ : x → Φ(x)
i=1

w is a N-vector
N = 9 samples, sigma = 0.334

1.5 1.5
Sample points Sample points
Ideal fit Ideal fit
1 lambda = 100 1 lambda = 0.001

0.5 0.5

0 0
y

y
-0.5 -0.5

-1 -1

-1.5 -1.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x x

1.5 1.5
Sample points Sample points
Ideal fit Ideal fit
1 lambda = 1e-010 1 lambda = 1e-015

0.5 0.5

0 0
y

-0.5 -0.5

-1 -1

-1.5 -1.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x x

Choosing lambda using a validation set

6 1.5
Ideal fit Sample points
Validation Ideal fit
5 Training 1 Validation set fit
Min error

4 0.5
error norm

3 0
y

2 -0.5

1 -1

0 -1.5
-10 -5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
10 10 10
log λ x
Sigma = 0.334 Sigma = 0.1
1.5
1.5
Sample points
Sample points
Ideal fit
Ideal fit
1 Validation set fit
1 Validation set fit

0.5
0.5

y
0
y

-0.5
-0.5

-1
-1

-1.5
-1.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
x
Gaussian basis functions
Gaussian basis functions

0.8
2000

0.6
1500

1000
0.4

500 0.2

0 0

y
y

-500 -0.2

-1000 -0.4

-1500 -0.6

-2000 -0.8

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
x

Summary and preview


So far we have considered the primal problem where
M
X
f (x, w ) = wiφi(x) = w>Φ(x)
i=1
and we wanted a solution for w ∈ RM

Now we will consider the dual problem where


N
X
w= aiΦ(xi)
i=1
and we want a solution for a ∈ RN .

We will see that

• there is a closed form solution for a,

• the solution involves the N ×N Gram matrix k(xi, xj ) = Φ(xi)>Φ(xj ),

• so we can use the kernel trick again to replace scalar products


Dual Representation
N ³ ´2
1 X λ
e w) =
E( yi − w>Φ(xi) + kwk2 Φ : x → Φ(x) R → RM
2 i=1 2
and from the derivative w.r.t. w
e w) N
X ³ ´
E(
= − yi − w>Φ(xi) Φ(xi) + λw = 0
dw i=1
Hence
N N
X yi − w>Φ(xi) X
w= Φ(xi) = aiΦ(xi)
i=1 λ i=1
Again the vector w can be written as a linear combination of
the training data — Representer Theorem
⎛ ⎞
a1
⎜ ⎟
N
X h i⎜

a2 ⎟

w= aiΦ(xi) = Φ(x1) Φ(x2) . . . Φ(xN ) ⎜
⎜ . ⎟=
⎟ Φ> a assume N > M
i=1 ⎜ . ⎟
⎝ ⎠
aN =
where Φ is the N × M design matrix
Mx1 MxN Nx1

Substitute w = Φ>a into


e w) = 1 (y − Φw)2 + λ kwk2
E(
2 2
1³ ´2 λ
= y − ΦΦ>a + a>ΦΦ>a
2 2
1 λ
= (y − Ka)2 + a>Ka
2 2
where K = ΦΦ> is the N × N Kernel Gram matrix with entries
k(xi, xj ) = Φ(xi)>Φ(xj ), and minimize E(
e a) w.r.t. a, to show
that

a = (K + λI)−1 y Exercise

Nx1 NxN Nx1

• dual version involves inverting a N x N matrix, cf inverting a M x M matrix in the primal


f (x, w) = w>Φ(x) = Φ(x)>w
= Φ(x)>Φ>a
h i
= Φ(x)> Φ(x1) Φ(x2) . . . Φ(xN ) a
³ ´
= Φ(x)>Φ(x1) Φ(x)>Φ(x2) . . . Φ(x)>Φ(xN ) a
³ ´
= k(x, x1) k(x, x2) . . . k(x, xN ) a
³ ´
Write k(x) = k(x, x1) k(x, x2) . . . k(x, xN ) >, then

f (x) = k(x)>a = k(x)> (K + λI)−1 y

• Again, see that output is a linear blend, b(x)>y, of the training


values {yi} where

b(x)> = k(x)> (K + λI)−1

• All the advantages of kernels: it is only necessary to provide


k(xi, xj ) rather than compute Φ(x) explicitly.

Example: 3D Human Pose from Silhouettes

Objective: Recover 3D human body pose from image silhouettes


• 3D pose = joint angles

Applications:
• motion capture, resynthesis
• human-computer interaction
• action recognition
• visual surveillance

Ankur Agarwal and Bill Triggs, ICML 2004


Silhouette descriptors to represent input

Use Shape Context Histograms – distributions of local


shape context responses

Regression for pose vector y


Predict 3D pose y given the shape context histogram x as

y = Ak(x)
where

• k(x) = (k(x, x1), k(x, x2), . . . , k(x, xN )) > is a vector of scalar ba-
sis functions.

• A = (a1, a2, . . . , aN ) is a matrix of dual vectors aj


PN
(Compare with scalar version y = i ai k(xi, x)).

Learn A from training data {xi, yi} by optimizing the cost function
N
X
min ||yi − Ak(xi)||2 + λtrace(A>A)
A
i
Training and test data

Video Motion capture Re-rendered Re-rendered


recordings data poses silhouettes

(y)

Results: Synthetic Spiral Walk Test Sequence

•Mean angular error per d.o.f. = 6.0o


•Instances of error due to ambiguities
Complete sequence from individual images

15% instances of error due to ambiguities

Ambiguities in pose reconstruction

Silhouette to pose problem is inherently multi-valued

Add tracking to disambiguate


Tracking results

Tracking a real motion sequence

1. Preprocessing

a) Background subtraction
b) Shadow removal for
silhouette extraction
Tracking a real motion sequence

2. Regression

obtain 3D body joint


angles as output

-can be used to render


synthetic models …

Background reading

• Bishop, chapters 3 & 6.1

• More on web page:


http://www.robots.ox.ac.uk/~az/lectures/ml

• e.g. Gaussian process regression

You might also like