0% found this document useful (0 votes)
14 views61 pages

15 SVM

Apuntes SVM

Uploaded by

jndiosipn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views61 pages

15 SVM

Apuntes SVM

Uploaded by

jndiosipn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

CSC 411: Lecture 15: Support Vector Machine

Richard Zemel, Raquel Urtasun and Sanja Fidler

University of Toronto

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 1 / 15


Today

Margin
Max-margin classification

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 2 / 15


Today

We are back to supervised learning

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 3 / 15


Today

We are back to supervised learning


We are given training data {(x(i) , t (i) )}N
i=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 3 / 15


Today

We are back to supervised learning


We are given training data {(x(i) , t (i) )}N
i=1
We will look at classification, so t (i) will represent the class label

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 3 / 15


Today

We are back to supervised learning


We are given training data {(x(i) , t (i) )}N
i=1
We will look at classification, so t (i) will represent the class label
We will focus on binary classification (two classes)

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 3 / 15


Today

We are back to supervised learning


We are given training data {(x(i) , t (i) )}N
i=1
We will look at classification, so t (i) will represent the class label
We will focus on binary classification (two classes)
We will consider a linear classifier first (next class non-linear decision
boundaries)

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 3 / 15


Today

We are back to supervised learning


We are given training data {(x(i) , t (i) )}N
i=1
We will look at classification, so t (i) will represent the class label
We will focus on binary classification (two classes)
We will consider a linear classifier first (next class non-linear decision
boundaries)
Tiny change from before: instead of using t = 1 and t = 0 for
positive and negative class, we will use t = 1 for the positive and
t = −1 for the negative class

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 3 / 15


Logistic Regression

(
1 if (wT x + b) ≥ 0
y=
−1 if (wT x + b) < 0
Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 4 / 15
Max Margin Classification

Instead of fitting all the points, focus on the boundary points

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 5 / 15


Max Margin Classification

Instead of fitting all the points, focus on the boundary points


Aim: learn a boundary that leads to the largest margin (buffer) from points
on both sides

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 5 / 15


Max Margin Classification

Instead of fitting all the points, focus on the boundary points


Aim: learn a boundary that leads to the largest margin (buffer) from points
on both sides

Why: intuition; theoretical support; and works well in practice

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 5 / 15


Max Margin Classification

Instead of fitting all the points, focus on the boundary points


Aim: learn a boundary that leads to the largest margin (buffer) from points
on both sides

Why: intuition; theoretical support; and works well in practice


Subset of vectors that support (determine boundary) are called the support
vectors

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 5 / 15


Linear SVM

Max margin classifier: inputs in margin are of unknown class

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 6 / 15


Linear SVM

Max margin classifier: inputs in margin are of unknown class


1
 if wT x + b ≥ 1
y = −1 if wT x + b ≤ −1

Undefined if − 1 ≤ wT x + b ≤ 1

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 6 / 15


Linear SVM

Max margin classifier: inputs in margin are of unknown class


1
 if wT x + b ≥ 1
y = −1 if wT x + b ≤ −1

Undefined if − 1 ≤ wT x + b ≤ 1

Can write above condition as:


(wT x + b)y ≥ 1

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 6 / 15


Geometry of the Problem

The vector w is orthogonal to the +1 plane.

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 7 / 15


Geometry of the Problem

The vector w is orthogonal to the +1 plane.


If u and v are two points on that plane, then

wT (u − v) = 0

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 7 / 15


Geometry of the Problem

The vector w is orthogonal to the +1 plane.


If u and v are two points on that plane, then

wT (u − v) = 0

Same is true for −1 plane

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 7 / 15


Geometry of the Problem

The vector w is orthogonal to the +1 plane.


If u and v are two points on that plane, then

wT (u − v) = 0

Same is true for −1 plane


Also: for point x+ on +1 plane and x− nearest point on −1 plane:

x+ = λw + x−

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 7 / 15


Computing the Margin

Also: for point x+ on +1 plane and x− nearest point on −1 plane:

x+ = λw + x−

wT x+ + b = 1

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 8 / 15


Computing the Margin

Also: for point x+ on +1 plane and x− nearest point on −1 plane:

x+ = λw + x−

wT x+ + b = 1
wT (λw + x− ) + b = 1

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 8 / 15


Computing the Margin

Also: for point x+ on +1 plane and x− nearest point on −1 plane:

x+ = λw + x−

wT x+ + b = 1
wT (λw + x− ) + b = 1
wT x− + b + λwT w = 1

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 8 / 15


Computing the Margin

Also: for point x+ on +1 plane and x− nearest point on −1 plane:

x+ = λw + x−

wT x+ + b = 1
wT (λw + x− ) + b = 1
wT x− + b + λwT w = 1
− 1 + λwT w = 1

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 8 / 15


Computing the Margin

Also: for point x+ on +1 plane and x− nearest point on −1 plane:

x+ = λw + x−

wT x+ + b = 1
wT (λw + x− ) + b = 1
wT x− + b + λwT w = 1
− 1 + λwT w = 1

Therefore
2
λ=
wT w

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 8 / 15


Computing the Margin

Define the margin M to be the distance between the +1 and −1 planes

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 9 / 15


Computing the Margin

Define the margin M to be the distance between the +1 and −1 planes


We can now express this in terms of w to maximize the margin we minimize
the length of w

M = ||x+ − x− ||

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 9 / 15


Computing the Margin

Define the margin M to be the distance between the +1 and −1 planes


We can now express this in terms of w to maximize the margin we minimize
the length of w

M = ||x+ − x− ||
= ||λw|| =

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 9 / 15


Computing the Margin

Define the margin M to be the distance between the +1 and −1 planes


We can now express this in terms of w to maximize the margin we minimize
the length of w

M = ||x+ − x− ||

= ||λw|| = λ wT w

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 9 / 15


Computing the Margin

Define the margin M to be the distance between the +1 and −1 planes


We can now express this in terms of w to maximize the margin we minimize
the length of w

M = ||x+ − x− ||

= ||λw|| = λ wT w

wT w
=2 =
wT w

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 9 / 15


Computing the Margin

Define the margin M to be the distance between the +1 and −1 planes


We can now express this in terms of w to maximize the margin we minimize
the length of w

M = ||x+ − x− ||

= ||λw|| = λ wT w

wT w 2
=2 T
=√ =
w w wT w

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 9 / 15


Computing the Margin

Define the margin M to be the distance between the +1 and −1 planes


We can now express this in terms of w to maximize the margin we minimize
the length of w

M = ||x+ − x− ||

= ||λw|| = λ wT w

wT w 2 2
=2 T
=√ =
w w T
w w ||w||

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 9 / 15


Learning a Margin-Based Classifier

We can search for the optimal parameters (w and b) by finding a solution


that:

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 10 / 15


Learning a Margin-Based Classifier

We can search for the optimal parameters (w and b) by finding a solution


that:
1. Correctly classifies the training examples: {(x(i) , t (i) )}N
i=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 10 / 15


Learning a Margin-Based Classifier

We can search for the optimal parameters (w and b) by finding a solution


that:
1. Correctly classifies the training examples: {(x(i) , t (i) )}N
i=1
2. Maximizes the margin (same as minimizing wT w)

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 10 / 15


Learning a Margin-Based Classifier

We can search for the optimal parameters (w and b) by finding a solution


that:
1. Correctly classifies the training examples: {(x(i) , t (i) )}N
i=1
2. Maximizes the margin (same as minimizing wT w)

1
min ||w||2
w,b 2

s.t.∀i (wT x(i) + b)t (i) ≥ 1,

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 10 / 15


Learning a Margin-Based Classifier

We can search for the optimal parameters (w and b) by finding a solution


that:
1. Correctly classifies the training examples: {(x(i) , t (i) )}N
i=1
2. Maximizes the margin (same as minimizing wT w)

1
min ||w||2
w,b 2

s.t.∀i (wT x(i) + b)t (i) ≥ 1,

This is called the primal formulation of Support Vector Machine (SVM)

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 10 / 15


Learning a Margin-Based Classifier

We can search for the optimal parameters (w and b) by finding a solution


that:
1. Correctly classifies the training examples: {(x(i) , t (i) )}N
i=1
2. Maximizes the margin (same as minimizing wT w)

1
min ||w||2
w,b 2

s.t.∀i (wT x(i) + b)t (i) ≥ 1,

This is called the primal formulation of Support Vector Machine (SVM)


Can optimize via projective gradient descent, etc.

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 10 / 15


Learning a Margin-Based Classifier

We can search for the optimal parameters (w and b) by finding a solution


that:
1. Correctly classifies the training examples: {(x(i) , t (i) )}N
i=1
2. Maximizes the margin (same as minimizing wT w)

1
min ||w||2
w,b 2

s.t.∀i (wT x(i) + b)t (i) ≥ 1,

This is called the primal formulation of Support Vector Machine (SVM)


Can optimize via projective gradient descent, etc.
Apply Lagrange multipliers: formulate equivalent problem

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 10 / 15


Learning a Linear SVM
Convert the constrained minimization to an unconstrained optimization
problem: represent constraints as penalty terms:
1
min ||w||2 + penalty term
w,b 2

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 11 / 15


Learning a Linear SVM
Convert the constrained minimization to an unconstrained optimization
problem: represent constraints as penalty terms:
1
min ||w||2 + penalty term
w,b 2

For data {(x(i) , t (i) )}N


i=1 , use the following penalty
(
T (i) (i) 0 if (wT x(i) + b)t (i) ≥ 1
max αi [1 − (w x + b)t ] =
αi ≥0 ∞ otherwise

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 11 / 15


Learning a Linear SVM
Convert the constrained minimization to an unconstrained optimization
problem: represent constraints as penalty terms:
1
min ||w||2 + penalty term
w,b 2

For data {(x(i) , t (i) )}N


i=1 , use the following penalty
(
T (i) (i) 0 if (wT x(i) + b)t (i) ≥ 1
max αi [1 − (w x + b)t ] =
αi ≥0 ∞ otherwise

Rewrite the minimization problem


N
1 X
min{ ||w||2 + max αi [1 − (wT x(i) + b)t (i) ]}
w,b 2 αi ≥0
i=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 11 / 15


Learning a Linear SVM
Convert the constrained minimization to an unconstrained optimization
problem: represent constraints as penalty terms:
1
min ||w||2 + penalty term
w,b 2

For data {(x(i) , t (i) )}N


i=1 , use the following penalty
(
T (i) (i) 0 if (wT x(i) + b)t (i) ≥ 1
max αi [1 − (w x + b)t ] =
αi ≥0 ∞ otherwise

Rewrite the minimization problem


N
1 X
min{ ||w||2 + max αi [1 − (wT x(i) + b)t (i) ]}
w,b 2 αi ≥0
i=1
where αi are the Lagrange multipliers
N
1 X
= min max{ ||w||2 + αi [1 − (wT x(i) + b)t (i) ]}
w,b αi ≥0 2
i=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 11 / 15


Solution to Linear SVM

Let: N
1 X
J(w, b; α) = ||w||2 + αi [1 − (wT x(i) + b)t (i) ]
2
i=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 12 / 15


Solution to Linear SVM

Let: N
1 X
J(w, b; α) = ||w||2 + αi [1 − (wT x(i) + b)t (i) ]
2
i=1

Swap the ”max” and ”min”: This is a lower bound

max min J(w, b; α) ≤ min max J(w, b; α)


αi ≥0 w,b w,b αi ≥0

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 12 / 15


Solution to Linear SVM

Let: N
1 X
J(w, b; α) = ||w||2 + αi [1 − (wT x(i) + b)t (i) ]
2
i=1

Swap the ”max” and ”min”: This is a lower bound

max min J(w, b; α) ≤ min max J(w, b; α)


αi ≥0 w,b w,b αi ≥0

Equality holds in certain conditions

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 12 / 15


Solution to Linear SVM
Solving:
N
1 X
max min J(w, b; α) = max min ||w||2 + αi [1 − (wT x(i) + b)t (i) ]
αi ≥0 w,b αi ≥0 w,b 2
i=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 13 / 15


Solution to Linear SVM
Solving:
N
1 X
max min J(w, b; α) = max min ||w||2 + αi [1 − (wT x(i) + b)t (i) ]
αi ≥0 w,b αi ≥0 w,b 2
i=1

First minimize J() w.r.t. w, b for fixed Lagrange multipliers:


N
∂J(w, b; α) X
= w− αi x(i) t (i) = 0
∂w i=1
N
∂J(w, b; α) X
= − αi t (i) = 0
∂b i=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 13 / 15


Solution to Linear SVM
Solving:
N
1 X
max min J(w, b; α) = max min ||w||2 + αi [1 − (wT x(i) + b)t (i) ]
αi ≥0 w,b αi ≥0 w,b 2
i=1

First minimize J() w.r.t. w, b for fixed Lagrange multipliers:


N
∂J(w, b; α) X
= w− αi x(i) t (i) = 0
∂w i=1
N
∂J(w, b; α) X
= − αi t (i) = 0
∂b i=1

We obtain N
X
w= αi t (i) x(i)
i=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 13 / 15


Solution to Linear SVM
Solving:
N
1 X
max min J(w, b; α) = max min ||w||2 + αi [1 − (wT x(i) + b)t (i) ]
αi ≥0 w,b αi ≥0 w,b 2
i=1

First minimize J() w.r.t. w, b for fixed Lagrange multipliers:


N
∂J(w, b; α) X
= w− αi x(i) t (i) = 0
∂w i=1
N
∂J(w, b; α) X
= − αi t (i) = 0
∂b i=1

We obtain N
X
w= αi t (i) x(i)
i=1
Then substitute back to get final optimization:
N N
X 1 X (i) (j) T
L = max{ αi − t t αi αj (x(i) · x(j) )}
αi ≥0
i=1
2 i,j=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 13 / 15


Summary of Linear SVM
Binary and linear separable classification

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 14 / 15


Summary of Linear SVM
Binary and linear separable classification
Linear classifier with maximal margin

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 14 / 15


Summary of Linear SVM
Binary and linear separable classification
Linear classifier with maximal margin
Training SVM by maximizing
N N
X 1 X (i) (j) T
max{ αi − t t αi αj (x(i) · x(j) )}
αi ≥0
i=1
2 i,j=1
N
X
subject to αi ≥ 0; αi t (i) = 0
i=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 14 / 15


Summary of Linear SVM
Binary and linear separable classification
Linear classifier with maximal margin
Training SVM by maximizing
N N
X 1 X (i) (j) T
max{ αi − t t αi αj (x(i) · x(j) )}
αi ≥0
i=1
2 i,j=1
N
X
subject to αi ≥ 0; αi t (i) = 0
i=1
The weights are
N
X
w= αi t (i) x(i)
i=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 14 / 15


Summary of Linear SVM
Binary and linear separable classification
Linear classifier with maximal margin
Training SVM by maximizing
N N
X 1 X (i) (j) T
max{ αi − t t αi αj (x(i) · x(j) )}
αi ≥0
i=1
2 i,j=1
N
X
subject to αi ≥ 0; αi t (i) = 0
i=1
The weights are
N
X
w= αi t (i) x(i)
i=1

Only a small subset of αi ’s will be nonzero, and the corresponding x(i) ’s are
the support vectors S

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 14 / 15


Summary of Linear SVM
Binary and linear separable classification
Linear classifier with maximal margin
Training SVM by maximizing
N N
X 1 X (i) (j) T
max{ αi − t t αi αj (x(i) · x(j) )}
αi ≥0
i=1
2 i,j=1
N
X
subject to αi ≥ 0; αi t (i) = 0
i=1
The weights are
N
X
w= αi t (i) x(i)
i=1

Only a small subset of αi ’s will be nonzero, and the corresponding x(i) ’s are
the support vectors S
Prediction on a new example:
N
X X
y = sign[b + x · ( αi t (i) x(i) )] = sign[b + x · ( αi t (i) x(i) )]
i=1 i∈S

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 14 / 15


What if data is not linearly separable?

Introduce slack variables ξi


N
1 X
min ||w||2 + λ ξi
2 i=1

s.t ξi ≥ 0; ∀i t (i) (wT x(i) + b) ≥ 1 − ξi

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 15 / 15


What if data is not linearly separable?

Introduce slack variables ξi


N
1 X
min ||w||2 + λ ξi
2 i=1

s.t ξi ≥ 0; ∀i t (i) (wT x(i) + b) ≥ 1 − ξi

Example lies on wrong side of hyperplane ξi > 1

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 15 / 15


What if data is not linearly separable?

Introduce slack variables ξi


N
1 X
min ||w||2 + λ ξi
2 i=1

s.t ξi ≥ 0; ∀i t (i) (wT x(i) + b) ≥ 1 − ξi

Example lies on wrong side of hyperplane ξi > 1


P
Therefore i ξi upper bounds the number of training errors

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 15 / 15


What if data is not linearly separable?

Introduce slack variables ξi


N
1 X
min ||w||2 + λ ξi
2 i=1

s.t ξi ≥ 0; ∀i t (i) (wT x(i) + b) ≥ 1 − ξi

Example lies on wrong side of hyperplane ξi > 1


P
Therefore i ξi upper bounds the number of training errors
λ trades off training error vs model complexity

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 15 / 15


What if data is not linearly separable?

Introduce slack variables ξi


N
1 X
min ||w||2 + λ ξi
2 i=1

s.t ξi ≥ 0; ∀i t (i) (wT x(i) + b) ≥ 1 − ξi

Example lies on wrong side of hyperplane ξi > 1


P
Therefore i ξi upper bounds the number of training errors
λ trades off training error vs model complexity
This is known as the soft-margin extension

Zemel, Urtasun, Fidler (UofT) CSC 411: 15-SVM I 15 / 15

You might also like