0% found this document useful (0 votes)
117 views23 pages

10 SVM

This document provides an introduction to support vector machines (SVMs). It discusses how SVMs find the optimal hyperplane for binary classification that maximizes the margin between the two classes. The hyperplane is determined by support vectors, which are the data points closest to the decision boundary. The document describes how SVMs solve a quadratic optimization problem to learn the hyperplane parameters that maximize the margin. It also covers extensions of SVMs to non-linearly separable data using soft margins that allow some misclassification with a penalty.

Uploaded by

Aurobindo Sarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views23 pages

10 SVM

This document provides an introduction to support vector machines (SVMs). It discusses how SVMs find the optimal hyperplane for binary classification that maximizes the margin between the two classes. The hyperplane is determined by support vectors, which are the data points closest to the decision boundary. The document describes how SVMs solve a quadratic optimization problem to learn the hyperplane parameters that maximize the margin. It also covers extensions of SVMs to non-linearly separable data using soft margins that allow some misclassification with a penalty.

Uploaded by

Aurobindo Sarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

INTRODUCTION TO MACHINE LEARNING

SUPPORT VECTOR MACHINE

The slides are from Raymond J. Mooney (ML Research Group @ Univ. of Texas)

Mingon Kang, Ph.D.


Department of Computer Science @ UNLV
Linear Separators
 Binary classification can be viewed as the task of
separating classes in feature space:
w Tx + b = 0
w Tx + b > 0
wT x + b < 0

f(x) = sign(wTx + b)
Ch. 15

Linear classifiers: Which Hyperplane?


3

 Lots of possible choices for a, b, c.


 A Support Vector Machine (SVM) finds an This line
optimal* solution. represents the
 Maximizes the distance between the hyperplane decision
and the “difficult points” close to decision boundary:
boundary ax + by − c = 0
 One intuition: if there are no points near the
decision surface, then there are no very
uncertain classification decisions
Sec. 15.1

Support Vector Machine (SVM)


4

Support vectors
 SVMs maximize the margin around
the separating hyperplane.
◼ A.k.a. large margin classifiers

 The decision function is fully


specified by a subset of training
samples, the support vectors.
 Solving SVMs is a quadratic
Maximizes
programming problem Narrower
margin
margin
Sec. 15.1

Maximum Margin: Formalization


5

 w: decision hyperplane normal vector


 xi: data point i
 yi: class of data point i (+1 or -1)
 Classifier is: f(xi) = sign(wTxi + b)
 Functional margin of xi is: yi (wTxi + b)
 The functional margin of a dataset is twice the minimum
functional margin for any point
 The factor of 2 comes from measuring the whole width of the
margin
 Problem: we can increase this margin simply by scaling w, b….
Sec. 15.1

Geometric Margin
wT x + b
 Distance from example to the separator is r=y
w
 Examples closest to the hyperplane are support vectors.
 Margin ρ of the separator is the width of separation between support vectors of
classes.
x ρ Derivation of finding r:
Dotted line x’ − x is perpendicular to
r decision boundary so parallel to w.
x′ Unit vector is w/|w|, so line is rw/|w|.
x’ = x – yrw/|w|.
x’ satisfies wTx’ + b = 0.
So wT(x –yrw/|w|) + b = 0
Recall that |w| = sqrt(wTw).
So wTx –yr|w| + b = 0
So, solving for r gives:
r = y(wTx + b)/|w| 6
Sec. 15.1

Linear SVM Mathematically


The linearly separable case

 Assume that the functional margin of each data item is at least 1, then the
following two constraints follow for a training set {(xi ,yi)}

wTxi + b ≥ 1 if yi = 1
wTxi + b ≤ −1 if yi = −1
 For support vectors, the inequality becomes an equality
 Then, since each example’s distance from the hyperplane is
w x+b
T
r=y
w

 The functional margin is: 2


r=
w
7
Sec. 15.1

Linear Support Vector Machine (SVM)

ρ wTxa + b = 1

wTxb + b = -1
 Hyperplane
wT x + b = 0

 Extra scale constraint:


mini=1,…,n |wTxi + b| = 1

 This implies:
wT(xa–xb) = 2
wT x + b = 0
ρ = ‖xa–xb‖2 = 2/‖w‖2

8
Worked example: Geometric margin

 Maximum margin weight


vector is parallel to line
from
Extra margin(1, 1) to (2, 3). So
weight vector is (1, 2).
 Decision boundary is

normal (“perpendicular”)
to it halfway between.
 It passes through (1.5, 2)

 So y = x1 +2x2 − 5.5

 Geometric margin is √5
9
Worked example: Functional margin
 Let’s minimize w given that
yi(wTxi + b) ≥ 1
 Constraint has = at SVs;
w = (a, 2a) for some a
 a+2a+b = −1 2a+6a+b =
1
 So, a = 2/5 and b = −11/5
Optimal hyperplane is:
w = (2/5, 4/5) and b =
−11/5
 Margin ρ is 2/|w|
= 2/√(4/25+16/25)
= 2/(2√5/5) = √5
10
Sec. 15.1

Linear SVMs Mathematically (cont.)


 Then we can formulate the quadratic optimization problem:

Find w and b such that


2
r= is maximized; and for all {(xi , yi)}
w
wTxi + b ≥ 1 if yi=1; wTxi + b ≤ -1 if yi = -1

 A better formulation (min ‖w‖= max 1/‖w‖ ):

Find w and b such that


Φ(w) =½ wTw is minimized;

and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1 11


Sec. 15.1

Solving the Optimization Problem


Find w and b such that
Φ(w) =½ wTw is minimized;
and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1

 This is now optimizing a quadratic function subject to linear constraints


 Quadratic optimization problems are a well-known class of mathematical
programming problem, and many (intricate) algorithms exist for solving them
(with many special ones built for SVMs)
 The solution involves constructing a dual problem where a Lagrange multiplier αi
is associated with every constraint in the primary problem:
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi 12
Sec. 15.1

The Optimization Problem Solution


 The solution has the form:

w =Σαiyixi b= yk- wTxk for any xk such that αk 0


 Each non-zero αi indicates that corresponding xi is a support vector.
 Then the classifying function will have the form:

f(x) = ΣαiyixiTx + b

 Notice that it relies on an inner product between the test point x and the
support vectors xi
 We will return to this later.
 Also keep in mind that solving the optimization problem involved computing the
inner products xiTxj between all pairs of training points.

13
Sec. 15.2.1

Soft Margin Classification

 If the training data is not


linearly separable, slack
variables ξi can be added to
allow misclassification of
difficult or noisy examples.
 Allow some errors
 Letsome points be ξi
moved to where they ξj
belong, at a cost
 Still, try to minimize training
set errors, and to place
hyperplane “far” from each
class (large margin) 14
Sec. 15.2.1
Soft Margin Classification
Mathematically
 The old formulation:

Find w and b such that


Φ(w) =½ wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1
 The new formulation incorporating slack variables:

Find w and b such that


Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i
 Parameter C can be viewed as a way to control overfitting
 A regularization term

15
Sec. 15.2.1

Soft Margin Classification – Solution


 The dual problem for soft margin classification:

Find α1…αN such that


Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi

 Neither slack variables ξi nor their Lagrange multipliers appear in the dual
problem!
 Again, xi with non-zero αi will be support vectors.
 Solution to the dual problem is:
w is not needed explicitly
w = Σαiyixi for classification!
b = yk(1- ξk) - wTxk where k = argmax αk’
k’ f(x) = ΣαiyixiTx + b
16
Sec. 15.1

Classification with SVMs


 Given a new point x, we can score its projection
onto the hyperplane normal:
 I.e., compute score: wTx + b = ΣαiyixiTx + b
◼ Decide class based on whether < or > 0

 Can set confidence threshold t.

Score > t: yes


Score < -t: no
1
0
Else: don’t know -1 17
Sec. 15.2.1

Linear SVMs: Summary


 The classifier is a separating hyperplane.

 The most “important” training points are the support vectors; they define the
hyperplane.

 Quadratic optimization algorithms can identify which training points xi are


support vectors with non-zero Lagrangian multipliers αi.

 Both in the dual formulation of the problem and in the solution, training
points appear only inside inner products:

Find α1…αN such that f(x) = ΣαiyixiTx + b


Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
18
Sec. 15.2.3

Non-linear SVMs
 Datasets that are linearly separable (with some noise) work out great:

0 x

 But what are we going to do if the dataset is just too hard?

0 x

 How about … mapping data to a higher-dimensional space:


x2

0 x 19
Sec. 15.2.3

Non-linear SVMs: Feature spaces


 General idea: the original feature space can
always be mapped to some higher-dimensional
feature space where the training set is separable:

Φ: x → φ(x)

20
Non-linear SVMs: Feature spaces

21
Sec. 15.2.3

The “Kernel Trick”

 The linear classifier relies on an inner product between vectors K(xi,xj)=xiTxj


 If every datapoint is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
 A kernel function is some function that corresponds to an inner product in some
expanded feature space.
 Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2=
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi) Tφ(xj) where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2] 22
Sec. 15.2.3

Kernels
 Why use kernels?
 Make non-separable problem separable.
 Map data into better representational space

 Common kernels
 Linear

 Polynomial K(x,z) = (1+xTz)d


◼ Gives feature conjunctions
 Radial basis function (infinite dimensional space)

23

You might also like