0% found this document useful (0 votes)

67 views22 pages

An Introduction To Support Vector Machines

Support vector machines (SVMs) are a type of supervised learning model used for classification and regression analysis. The goal of SVMs is to find the optimal separating hyperplane that maximizes the margin between two classes of objects. This is achieved by formulating the problem as a constrained quadratic optimization problem and using Lagrangian multipliers. The kernel trick allows projecting data into a higher dimensional space to make nonlinear data linearly separable. SVMs have good performance even in high dimensional spaces and avoid overfitting by choosing a hyperplane with maximum margin.

Uploaded by

farheen shaikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views22 pages

An Introduction To Support Vector Machines

Uploaded by

farheen shaikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 22

An Introduction to Support Vector

Machines

CSE 573 Autumn 2005

Henry Kautz

based on slides stolen from Pierre Dönnes’ web site

Main Ideas
• Max-Margin Classifier
– Formalize notion of the best linear separator
• Lagrangian Multipliers
– Way to convert a constrained optimization problem
to one that is easier to solve
• Kernels
– Projecting data into higher-dimensional space
makes it linearly separable
• Complexity
– Depends only on the number of training examples,
not on dimensionality of the kernel space!
Tennis example

Temperature

Humidity
= play tennis
= do not play tennis
Linear Support Vector
Machines
Data: <xi,yi>, i=1,..,l
xi  Rd
yi  {-1,+1}

=+1
=-1

x1
Linear SVM 2

Data: <xi,yi>, i=1,..,l

xi  Rd
yi  {-1,+1}

f(x) =-1
=+1

All hyperplanes in Rd are parameterize by a vector (w) and a constant b.

Can be expressed as w•x+b=0 (remember the equation for a hyperplane
from algebra!)
Our aim is to find such a hyperplane f(x)=sign(w•x+b), that
correctly classify our data.
Definitions
Define the hyperplane H such that:
xi•w+b  +1 when yi =+1 H1

xi•w+b  -1 when yi =-1

H2
H1 and H2 are the planes: d+
H1: xi•w+b = +1
d-
H2: xi•w+b = -1 H
The points on the planes
H1 and H2 are the
Support Vectors

d+ = the shortest distance to the closest positive point

d- = the shortest distance to the closest negative point
The margin of a separating hyperplane is d+ + d-.
Maximizing the margin
We want a classifier with as big margin as possible.

H1
H
H2
Recall the distance from a point(x0,y0) to a line: d+
Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2+B2) d-
The distance between H and H1 is:
|w•x+b|/||w||=1/||w||

The distance between H1 and H2 is: 2/||w||

In order to maximize the margin, we need to minimize ||w||. With the

condition that there are no datapoints between H1 and H2:
xi•w+b  +1 when yi =+1
xi•w+b  -1 when yi =-1 Can be combined into yi(xi•w)  1
Constrained Optimization
Problem
Minimize || w || w  w subject to yi ( x i  w  b)  1 for all i
Lagrangian method : maximize inf w L(w, b,  ), where

L(w, b,  )  || w ||   i  yi (x i  w )  b   1
1
2 i

At the extremum, the partial derivative of L with respect

both w and b must be 0. Taking the derivative s, setting them
to 0, substituti ng back into L, and simplifyin g yields :
1
Maximize i 
i

2 i, j
yi y j i j x i  x j

subject to  y
i
i i  0 and  i  0
Quadratic Programming
• Why is this reformulation a good thing?
• The problem
1
Maximize   i   yi y j i j xi  x j
i 2 i, j
subject to  y
i
i i  0 and  i  0

is an instance of what is called a positive, semi-definite

programming problem
• For a fixed real-number accuracy, can be solved in
O(n log n) time = O(|D|2 log |D|2)
Problems with linear SVM

=-1
=+1

What if the decision function is not a linear?

Kernel Trick

Data points are linearly separable

in the space ( x12 , x22 , 2 x1 x2 )

1
We want to maximize   i   yi y j i j F (x i )  F (x j )
i 2 i, j
Define K (x i , x j )  F (x i )  F (x j )
Cool thing : K is often easy to compute directly! Here,
2
K (x i , x j )  x i  x j
Other Kernels

The polynomial kernel

K(xi,xj) = (xi•xj + 1)p, where p is a tunable parameter.
Evaluating K only require one addition and one exponentiation
more than the original dot product.

Gaussian kernels (also called radius basis functions)

K(xi,xj) = exp(||xi-xj ||2/22)
Overtraining/overfitting
A well known problem with machine learning methods is overtraining.
This means that we have learned the training data very well, but
we can not classify unseen examples correctly.
An example: A botanist really knowing trees.Everytime he sees a new tree,
he claims it is not a tree.

=-1
=+1
Overtraining/overfitting 2
A measure of the risk of overtraining with SVM (there are also other
measures).
It can be shown that: The portion, n, of unseen data that will be
missclassified is bounded by:
n  Number of support vectors / number of training examples

Ockham´s razor principle: Simpler system are better than more complex ones.
In SVM case: fewer support vectors mean a simpler representation of the
hyperplane.

Example: Understanding a certain cancer if it can be described by one gene

is easier than if we have to describe it with 5000.
A practical example, protein
localization
• Proteins are synthesized in the cytosol.
• Transported into different subcellular
locations where they carry out their
functions.
• Aim: To predict in what location a
certain protein will end up!!!
Subcellular Locations
Method
• Hypothesis: The amino acid composition of proteins
from different compartments should differ.
• Extract proteins with know subcellular location from
SWISSPROT.
• Calculate the amino acid composition of the proteins.
• Try to differentiate between: cytosol, extracellular,
mitochondria and nuclear by using SVM
Input encoding
Prediction of nuclear proteins:
Label the known nuclear proteins as +1 and all others
as –1.
The input vector xi represents the amino acid
composition.
Eg xi =(4.2,6.7,12,….,0.5)
A , C , D,….., Y)

Nuclear
SVM Model
All others
Cross-validation
Cross validation: Split the data into n sets, train on n-1 set, test on the set left
out of training.
1
1 Test set
Nuclear 1
2

2
1
Training set
All others 3
2
2
3
3
Performance measurments
Test data Predictions TP

FP
+1
Model
TN
-1

=+1
=-1
FN

SP = TP /(TP+FP), the fraction of predicted +1 that actually are +1.

SE = TP /(TP+FN), the fraction of the +1 that actually are predicted as +1.
In this case: SP=5/(5+1) =0.83
SE = 5/(5+2) = 0.71
A Cautionary Example

Image classification of tanks. Autofire when an enemy tank is spotted.

Input data: Photos of own and enemy tanks.
Worked really good with the training set used.
In reality it failed completely.

Reason: All enemy tank photos taken in the morning. All own tanks in dawn.
The classifier could recognize dusk from dawn!!!!
References
http://www.kernel-machines.org/

http://www.support-vector.net/