0% found this document useful (0 votes)
11 views54 pages

Lecture 18 - SVM

SVM

Uploaded by

raoseshu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views54 pages

Lecture 18 - SVM

SVM

Uploaded by

raoseshu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Transfer Functions

Supervised Learning – Classification

Support Vector Machines


Background
 There are three methods to establish a classifier
a) Model a classification rule directly
Examples: k-NN, decision trees, perceptron, SVM
b) Model the probability of class memberships given input data
Example: feedforward ANN (multi-layered perceptron)
c) Make a probabilistic model of data within each class
Examples: naive Bayes, model based classifiers

 a) and b) are examples of discriminative classification


 c) is an example of generative classification
 b) and c) are both examples of probabilistic classification

2
Support Vector Machines - Overview
• Proposed by Vapnik and his colleagues
- Started in 1963, taking shape in late 70’s as part of his statistical
learning theory (with Chervonenkis)
- Current form established in early 90’s (with Cortes)
• Became popular in last decade
- Classification, regression (function approx.), optimization
• Basic ideas
- Maximize margin of decision boundary
- Overcoming linear seperability problem by transforming the
problem into higher dimensional space using kernel functions
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1

How would you


classify this data?
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1

How would you


classify this data?
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1

How would you


classify this data?
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1

How would you


classify this data?
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1

Any of these
would be fine..

..but which is
best?
a
Classifier Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with the maximum
margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the, um,
are those
datapoints that maximum margin.
the margin This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
Linear SVM
Why Maximum Margin?

denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the maximum
are those
datapoints that margin.
the margin This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
Specifying a line and margin
Plus-Plane
Classifier Boundary
Minus-Plane

• How do we represent this mathematically?


• …in m input dimensions?
Specifying a line and margin
Plus-Plane
Classifier Boundary
Minus-Plane

Conditions for optimal separating hyperplane for data points


(x1, y1),…,(xl, yl) where yi =1
1. w . xi + b  1 if yi = 1 (points in plus class)
2. w . xi + b  -1 if yi = -1 (points in minus class)
Specifying a line and margin
Estimate the Margin
denotes +1
denotes -1
x
wx +b = 0

X – Vector
W
W – Normal Vector
b – Scale Value
Maximize Margin
denotes +1
denotes -1 wx +b = 0

Margin
b  xi  w
argmax arg min

w ,b xi D d 2
w
i 1 i

subject to xi  D : yi  xi  w  b   0
WXi+b≥1 iff yi=1
WXi+b≤-1 iff yi=-1
argmin  i 1 wi2
d

w ,b
yi(WXi+b) ≥ 1 subject to xi  D : yi  xi  w  b   1
Linear SVM
• Linear model:

 
 1 if w  x  b  1
f ( x)    
  1 if w  x  b  1

• Learning the model is equivalent to determining


the values of

• How to find w and b from training data?

• Constrained Optimization
• Langrangian Method
SVM – Langrangian Formulation
SVM – Langrangian Formulation
Example of Linear SVM

x1 x2 y l
0.3858 0.4687 1 65.5261
0.4871 0.611 -1 65.5261
0.9218 0.4103 -1 0
0.7382 0.8936 -1 0
0.1763 0.0579 1 0
0.4057 0.3529 1 0
0.9355 0.8132 -1 0
0.2146 0.0099 1 0
Example of Linear SVM

• only the first two tuples are support


vectors in this case
Learning Linear SVM
• Let W = (w1;w2) and b denote the parameter to be determined. We
can solve for w1 and w2 as follows
Learning Linear SVM
Learning Linear SVM
Learning Linear SVM
Learning Linear SVM
Example of Linear SVM

Support vectors

x1 x2 y a
l
0.3858 0.4687 1 65.5261
0.4871 0.611 -1 65.5261
0.9218 0.4103 -1 0
0.7382 0.8936 -1 0
0.1763 0.0579 1 0
0.4057 0.3529 1 0
0.9355 0.8132 -1 0
0.2146 0.0099 1 0
Example of Linear SVM
Learning Linear SVM
• Decision boundary depends only on support
vectors
• If you have data set with same support vectors,
decision boundary will not change

• How to classify using SVM once w and b are


found? Given a test record, xi

 
 1 if w  x i  b  1
f ( xi )    
 1 if w  x i  b  1
Support Vector Machines
• What if the problem is not linearly separable?
Support Vector Machines
• What if the problem is not linearly separable?
• Introduce slack variables
• Need to minimize:  2
|| w ||  N k
L( w)   C   i 
2  i 1 
• Subject to:

 
1 if w  x i  b  1 - i
yi    
 1 if w  x i  b  1  i

• If k is 1 or 2, this leads to similar objective function


as linear SVM but with different constraints
Support Vector Machines
B1

B2

b21
b22

margin
b11

b12

• Find the hyperplane that optimizes both factors


Nonlinear Support Vector Machines
• What if decision boundary is not linear?
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Concept of Nonlinear Mapping
Nonlinear Support Vector Machines
• What if decision boundary is not linear?
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
• Transform data into higher dimensional space

Decision boundary:
 
w  ( x )  b  0
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Example of Nonlinear SVM

SVM with polynomial


degree 2 kernel
Learning Nonlinear SVM
• Advantages of using kernel:
• Computing dot product (xi) (xj) in the
original space avoids curse of dimensionality

• Not all functions can be kernels


• Must make sure there is a corresponding  in
some high-dimensional space
• Mercer’s theorem
Characteristics of SVM
• The learning problem is formulated as a convex optimization problem
• Efficient algorithms are available to find the global minima
• Many of the other methods use greedy approaches and find locally
optimal solutions
• High computational complexity for building the model

• Robust to noise
• Overfitting is handled by maximizing the margin of the decision
boundary,
• SVM can handle irrelevant and redundant attributes better than many
other techniques
• The user needs to provide the type of kernel function and cost function
• Difficult to handle missing values
References
• An excellent tutorial on VC-dimension and Support
Vector Machines:
C.J.C. Burges. A tutorial on support vector machines
for pattern recognition. Data Mining and Knowledge
Discovery, 2(2):955-974, 1998.
http://citeseer.nj.nec.com/burges98tutorial.html
• The VC/SRM/SVM Bible:
Statistical Learning Theory by Vladimir Vapnik, Wiley-
Interscience; 1998
• Download SVM-light:
http://svmlight.joachims.org/
Some other issues in SVM
• SVM works only in a real-valued space. For a
categorical attribute, we need to convert its
categorical values to numeric values.
• SVM does only two-class classification. For multi-
class problems, some strategies can be applied, e.g.,
one-against-rest, and error-correcting output coding.
• The hyperplane produced by SVM is hard to
understand by human users. The matter is made
worse by kernels. Thus, SVM is commonly used in
applications that do not required human
understanding.

You might also like