0% found this document useful (0 votes)
8 views10 pages

Lect 6

Uploaded by

Ark Mtech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views10 pages

Lect 6

Uploaded by

Ark Mtech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Lecture 6: Regression continued

C4B Machine Learning Hilary 2011 A. Zisserman

• Lasso
• L1 regularization
• other regularizers

• SVM regression
• epsilon-insensitive loss

• More loss functions

Regression
y

• Suppose we are given a training set of N observations

((x1, y1), . . . , (xN , yN )) with xi ∈ Rd, yi ∈ R

• The regression problem is to estimate f (x) from this data


such that
yi = f (xi)
Regression cost functions
Minimize with respect to w
N
X
l (f (xi, w), yi) + λR (w)
i=1
loss function regularization

• There is a choice of both loss functions and regularization


• So far we have seen – “ridge” regression
N
X
• squared loss: (yi − f (xi, w))2
i=1

• squared regularizer: λkwk2


• Now, consider other losses and regularizers

The “Lasso” or L1 norm regularization

• LASSO = Least Absolute Shrinkage and Selection

Minimize with respect to w ∈ Rd


N
X d
X
2
(yi − f (xi, w)) + λ |wj |
i=1 j

loss function regularization

• This is a quadratic optimization problem


• There is a unique solution ⎛ ⎞1
d
X p
• p-Norm definition: k w kp = ⎝ |wi|p⎠
j=1
Sparsity property of the Lasso
• contour plots for d = 2
N
X
(yi − f (xi, w))2
i=1

d
λkwk2 λ
X
|wj |
ridge regression lasso j

• Minimum where loss contours tangent to regularizer’s


• For the lasso case, minima occur at “corners”
• Consequently one of the weights is zero
• In high dimensions many weights can be zero
Example: Lasso for polynomial basis functions regression
ideal fit
• The red curve is the true function 1.5
Sample points

(which is not a polynomial) 1


Ideal fit

0.5

• The data points are samples from the


curve with added noise in y. 0

y
-0.5

• N = 9, M = 7 -1

M
X -1.5
j > 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
f (x, w) = wj x = w Φ(x) x

j=0
w is a M+1
dimensional vector

ridge regression

lasso

Variation of weights with lambda


Variation of weights with lambda

100
ridge regression 500
lasso
400
50
300

0 200
wj

100
wj

-50
0

-100
-100

-200
-150

-300

-8 -7 -6 -5 -4 -3 -2
10 10 10 10 10 10 10 -8 -7 -6 -5 -4 -3 -2
10 10 10 10 10 10 10
Variation of weightslog λ lambda
with
with λlambda
Variation of weights log
30
200

detail detail
20 150

100
10

50
wj

0
wj

-10
-50

-20 -100

-150
-5 -4 -7 -6 -5
10 10 10 10 10
log λ log λ
Second example – lasso in action
1.5

0.5
weights

−0.5

−1
0 0.5 1 1.5
regularization parameter λ

Sparse weight vectors

• Weights being zero is a method of “feature selection” –


zeroing out the unimportant features

• The SVM classifier also has this property (sparse alpha in


the dual representation)

• Ridge regression does not

• AdaBoost achieves feature selection by a different,


greedy approach
Other regularizers
N
X d
X
2
(yi − f (xi, w)) + λ |wj |q
i=1 j

• For q ≥ 1, the cost function is convex and has a unique minimum.


The solution can be obtained by quadratic optimization.

• For q < 1, the problem is not convex, and obtaining the global
minimum is more difficult
SVMs for Regression
Use ε-insensitive error measure square
( loss
0 if |r| ≤ ε
Vε(r) =
|r| − ε otherwise. Vε(r)
This can also be written as

Vε(r) = (|r| − ε)+


r
where ()+ indicates the positive part of (.).
Or equivalently as
cost is zero inside epsilon “tube”
Vε(r) = max ((|r| − ε), 0)

loss function regularization

• As before, introduce slack variables for


cost is zero inside epsilon “tube”
points that violate ε-insensitive error.

• For each data point, xi, two slack vari-


ables, ξi, ξbi, are required (depending on
whether f (xi) is above or below the tube)

• Learning is by the optimization


N ³
X ´ 1
min C ξi + ξbi + ||w||2
w∈Rd, ξi, ξbi i 2
subject to

yi ≤ f (xi, w)+ε+ξi, yi ≥ f (xi, w)−ε−ξbi, ξi ≥ 0, ξbi ≥ 0 for i = 1 . . . N

• Again, this is a quadratic programming problem


• It can be dualized
• Some of the data points will become support vectors
• It can be kernelized
Example: SV regression with Gaussian basis functions
ideal fit
1.5

• The red curve is the true function Sample points


Ideal fit

(which is not a polynomial) 1

0.5

• Regression function – Gaussians 0

y
centred on data points
-0.5

• Parameters are: C, epsilon, sigma -1

-1.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

N
X 2
/σ 2
f (x, w) = wi e−(x−xi ) = w> Φ(x)
i=1

Φ : x → Φ(x) R → RN w is a N-vector

1.5 1.5
Sample points Sample points
Ideal fit Validation set fit
1 1 Support vectors

0.5 0.5

0 0
y

-0.5 -0.5

-1 -1

-1.5 -1.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

epsilon = 0.01

• Validation set fit is a search


over both C and sigma
epsilon = 0.5 epsilon = 0.8
1.5 1.5
Sample points Sample points
Validation set fit Validation set fit
1 Support vectors 1 Support vectors

0.5 0.5

0 0

-0.5 -0.5

-1 -1

-1.5 -1.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

As epsilon increases:
• fit becomes looser
• less data points are support vectors

Loss functions for regression


• quadratic (square) loss `(y, f (x)) = 1
2 (y − f (x))
2

• ε-insensitive loss `(y, f (x)) = max ((|r| − ε), 0)

• Hüber loss (mixed quadratic/linear): robustness to outliers:


`(y, f (x)) = h(y − f (x))
(
r2 if |r| ≤ c 4 square
h(r) = 2
2c|r| − c otherwise. ε−insensitive
Huber
3
• all of these are convex
2

0
−3 −2 −1 0 1 2 3
y−f(x)
Final notes on cost functions

Regressors and classifiers can be constructed by a “mix ‘n’ match” of loss


functions and regularizers to obtain a learning machine suited to a
particular application. e.g. for a classifier f (x) = w>x + b
• L1 Logistic regression
N
X ³ ´
min log 1 + e−yif (xi) + λ||w||1
w∈Rd i

• L1—SVM
N
X
min max (0, 1 − yif (xi)) + λ||w||1
w∈Rd i

• Least squares SVM


N
X
min [max (0, 1 − yif (xi))]2 + λ||w||2
w∈Rd i

Background reading

• Bishop, chapters 3.1 & 7.1.4

• Hastie et al, chapters 3.4 & 12.3.5

• More on web page:


http://www.robots.ox.ac.uk/~az/lectures/ml

You might also like