0% found this document useful (0 votes)
13 views29 pages

Support Vector Machine

Support Vector Machines (SVM) are supervised learning algorithms primarily used for classification tasks, aiming to find the optimal hyperplane that separates different classes in n-dimensional space. SVM can handle both linearly and non-linearly separable data, utilizing support vectors to define the decision boundary and maximize the margin between classes. The document also discusses mathematical formulations, optimization problems, and the kernel trick for mapping data into higher-dimensional spaces to improve classification accuracy.

Uploaded by

kalpana khandale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views29 pages

Support Vector Machine

Support Vector Machines (SVM) are supervised learning algorithms primarily used for classification tasks, aiming to find the optimal hyperplane that separates different classes in n-dimensional space. SVM can handle both linearly and non-linearly separable data, utilizing support vectors to define the decision boundary and maximize the margin between classes. The document also discusses mathematical formulations, optimization problems, and the kernel trick for mapping data into higher-dimensional spaces to improve classification accuracy.

Uploaded by

kalpana khandale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 29

Machine Learning

Group

Support Vector Machine

University of Texas at 1
Austin
Machine Learning
Group

Support Vector Machines


• Support Vector Machine or SVM is one of the most popular Supervised
Learning algorithms, which is used for Classification as well as Regression
problems.
• However, primarily, it is used for Classification problems in Machine
Learning.
• The goal of the SVM algorithm is to create the best line or decision
boundary that can segregate n-dimensional space into classes so that we
can easily put the new data point in the correct category in the future.
• This best decision boundary is called a hyperplane. SVM chooses the
extreme points/vectors that help in creating the hyperplane.
• These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine.

University of Texas at 2
Austin
Machine Learning
Group

Continue...
• Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:

University of Texas at 3
Austin
Machine Learning
Group

Continue...
• Example: SVM can be understood with the example that we have used in
the KNN classifier. Suppose we see a strange cat that also has some
features of dogs, so if we want a model that can accurately identify
whether it is a cat or dog, so such a model can be created by using the
SVM algorithm.
• We will first train our model with lots of images of cats and dogs so that it
can learn about different features of cats and dogs, and then we test it with
this strange creature.
• So as support vector creates a decision boundary between these two data
(cat and dog) and choose extreme cases (support vectors), it will see the
extreme case of cat and dog.
• On the basis of the support vectors, it will classify it as a cat.

University of Texas at 4
Austin
Machine Learning
Group

Continue...
• Consider the below diagram:

• SVM algorithm can be used for Face detection, image classification, text
categorization, etc. Types of SVM

University of Texas at 5
Austin
Machine Learning
Group

Continue...
• SVM can be of two types:

• Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a single
straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.

• Non-linear SVM: Non-Linear SVM is used for non-linearly separated


data, which means if a dataset cannot be classified by using a straight line,
then such data is termed as non-linear data and classifier used is called as
Non-linear SVM classifier.

University of Texas at 6
Austin
Machine Learning
Group

Continue...
• Hyperplane and Support Vectors in the SVM algorithm:
• Hyperplane: There can be multiple lines/decision boundaries to segregate
the classes in n- dimensional space, but we need to find out the best
decision boundary that helps to classify the data points. This best boundary
is known as the hyperplane of SVM.
• The dimensions of the hyperplane depend on the features present in the
dataset, which means if there are 2 features (as shown in image), then
hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.
• We always create a hyperplane that has a maximum margin, which means
the maximum distance between the data points.
• Support Vectors:
• The data points or vectors that are the closest to the hyperplane and which
affect the position of the hyperplane are termed as Support Vector. Since
these vectors support the hyperplane, hence called a Support vector. How
does SVM works?

University of Texas at 7
Austin
Machine Learning
Group

Continue...
• Linear SVM:
• The working of the SVM algorithm can be understood by using an
example. Suppose we have a dataset that has two tags (green and blue),
and the dataset has two features x1 and x2. We want a classifier that can
classify the pair(x1, x2) of coordinates in either green or blue. Consider the
below image:

University of Texas at 8
Austin
Machine Learning
Group
Continue...
• So as it is 2-d space so by just using a straight line, we can easily separate these
two classes. But there can be multiple lines that can separate these classes.
Consider the below image:

• Hence, the SVM algorithm helps to find the best line or decision boundary; this
best boundary or region is called as a hyperplane.
• SVM algorithm finds the closest point of the lines from both the classes. These
points are called support vectors.
• The distance between the vectors and the hyperplane is called as margin. And
the goal of SVM is to maximize this margin. The hyperplane with maximum
margin is called the optimal hyperplane.
University of Texas at 9
Austin
Machine Learning
Group

Perceptron Revisited: Linear Separators

• Binary classification can be viewed as the task of


separating classes in feature space:

wTx + b = 0
wTx + b > 0
wTx + b < 0

f(x) = sign(wTx + b)

University of Texas at 10
Austin
Machine Learning
Group

Linear Separators

• Which of the linear separators is optimal?

University of Texas at 11
Austin
Machine Learning
Group

Classification Margin
wT xi  b
• Distance from example xi to the separator is r
w
• Examples closest to the hyperplane are support vectors.
• Margin ρ of the separator is the distance between support vectors.
ρ

University of Texas at 12
Austin
Machine Learning
Group

Maximum Margin Classification


• Maximizing the margin is good according to intuition and
PAC theory.
• Implies that only support vectors matter; other training
examples are ignorable.

University of Texas at 13
Austin
Machine Learning
Group

Linear SVM Mathematically


• Let training set {(xi, yi)}i=1..n, xiRd, yi  {-1, 1} be separated by a
hyperplane with margin ρ. Then for each training example (xi, yi):
wTxi + b ≤ - ρ/2 if yi = -1
 yi(wTxi + b) ≥ ρ/2
wTxi + b ≥ ρ/2 if yi = 1

• For every support vector xs the above inequality is an equality.


After rescaling w and b by ρ/2 in the equality, we obtain
y s (w T x sthat
 b) 1
r 
distance between each xs and the hyperplane is w w

• Then the margin can be expressed


2 through (rescaled) w and b as:
 2r 
w

University of Texas at 14
Austin
Machine Learning
Group

Linear SVMs Mathematically (cont.)

• Then we can formulate the quadratic optimization problem:

Find w and b such that


2
 is maximized
w
and for all (xi, yi), i=1..n : yi(wTxi + b) ≥ 1
Which can be reformulated as:

Find w and b such that


Φ(w) = ||w||2=wTw is minimized
and for all (xi, yi), i=1..n : yi (wTxi + b) ≥ 1

University of Texas at 15
Austin
Machine Learning
Group

Solving the Optimization Problem

Find w and b such that


Φ(w) =wTw is minimized
and for all (xi, yi), i=1..n : yi (wTxi + b) ≥ 1

• Need to optimize a quadratic function subject to linear constraints.


• Quadratic optimization problems are a well-known class of mathematical
programming problems for which several (non-trivial) algorithms exist.
• The solution involves constructing a dual problem where a Lagrange
multiplier αi is associated with every inequality constraint in the primal
(original) problem:
Find α1…αn such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi
University of Texas at 16
Austin
Machine Learning
Group

The Optimization Problem Solution

• Given a solution α1…αn to the dual problem, solution to the primal is:

w =Σαiyixi b = yk - Σαiyixi Txk for any αk > 0

• Each non-zero αi indicates that corresponding xi is a support vector.


• Then the classifying function is (note that we don’t need w explicitly):

f(x) = ΣαiyixiTx + b

• Notice that it relies on an inner product between the test point x and the
support vectors xi – we will return to this later.
• Also keep in mind that solving the optimization problem involved
computing the inner products xiTxj between all training points.
University of Texas at 17
Austin
Machine Learning
Group

Soft Margin Classification

• What if the training set is not linearly separable?


• Slack variables ξi can be added to allow misclassification of difficult or
noisy examples, resulting margin called soft.

ξi
ξi

University of Texas at 18
Austin
Machine Learning
Group

Soft Margin Classification Mathematically

• The old formulation:


Find w and b such that
Φ(w) =wTw is minimized
and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1

• Modified formulation incorporates slack variables:

Find w and b such that


Φ(w) =wTw + CΣξi is minimized
and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1 – ξi, , ξi ≥ 0

• Parameter C can be viewed as a way to control overfitting: it “trades off”


the relative importance of maximizing the margin and fitting the training
data.

University of Texas at 19
Austin
Machine Learning
Group

Soft Margin Classification – Solution

• Dual problem is identical to separable case (would not be identical if the 2-


norm penalty for slack variables CΣξi2 was used in primal objective, we
would need additional Lagrange multipliers for slack variables):
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi

• Again, xi with non-zero αi will be support vectors.


• Solution to the dual problem is: Again, we don’t need to
compute w explicitly for
w =Σαiyixi classification:
b= yk(1- ξk) - ΣαiyixiTxk for any k s.t. αk>0
f(x) = ΣαiyixiTx + b

University of Texas at 20
Austin
Machine Learning
Group

Theoretical Justification for Maximum Margins

• Vapnik has proved the following:


The class of optimal linear separators has VC dimension h bounded from
above as  D 2  
h min  2  , m0   1
   
where ρ is the margin, D is the diameter of the smallest sphere that can
enclose all of the training examples, and m0 is the dimensionality.

• Intuitively, this implies that regardless of dimensionality m0 we can


minimize the VC dimension by maximizing the margin ρ.

• Thus, complexity of the classifier is kept small regardless of


dimensionality.

University of Texas at 21
Austin
Machine Learning
Group

Linear SVMs: Overview

• The classifier is a separating hyperplane.

• Most “important” training points are support vectors; they define the
hyperplane.

• Quadratic optimization algorithms can identify which training points xi are


support vectors with non-zero Lagrangian multipliers αi.

• Both in the dual formulation of the problem and in the solution training
points appear only inside inner products:
Find α1…αN such that f(x) = ΣαiyixiTx + b
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi

University of Texas at 22
Austin
Machine Learning
Group

Non-linear SVMs

• Datasets that are linearly separable with some noise work out great:

0 x

• But what are we going to do if the dataset is just too hard?

0 x
• How about… mapping data to a higher-dimensional space:
x2

0 x
University of Texas at 23
Austin
Machine Learning
Group

Non-linear SVMs: Feature spaces

• General idea: the original feature space can always be mapped to some
higher-dimensional feature space where the training set is separable:

Φ: x → φ(x)

University of Texas at 24
Austin
Machine Learning
Group

The “Kernel Trick”

• The linear classifier relies on inner product between vectors K(xi,xj)=xiTxj


• If every datapoint is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
• A kernel function is a function that is eqiuvalent to an inner product in some
feature space.
• Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2=
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] =
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
• Thus, a kernel function implicitly maps data to a high-dimensional space
(without the need to compute each φ(x) explicitly).
University of Texas at 25
Austin
Machine Learning
Group

What Functions are Kernels?

• For some functions K(xi,xj) checking that K(xi,xj)= φ(xi) Tφ(xj) can be
cumbersome.
• Mercer’s theorem:
Every semi-positive definite symmetric function is a kernel
• Semi-positive definite symmetric functions correspond to a semi-positive
definite symmetric Gram matrix:

K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xn)


K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xn)

K=
… … … … …
K(xn,x1) K(xn,x2) K(xn,x3) … K(xn,xn)

University of Texas at 26
Austin
Machine Learning
Group

Examples of Kernel Functions


• Linear: K(xi,xj)= xiTxj
– Mapping Φ: x → φ(x), where φ(x) is x itself

• Polynomial of power p: K(xi,xj)= (1+ xiTxj)dp  p


 
 
– Mapping Φ: x → φ(x), where φ(x) has  p  dimensions
2
xi  x j

2 2
e
• Gaussian (radial-basis function): K(xi,xj) =
– Mapping Φ: x → φ(x), where φ(x) is infinite-dimensional: every point is
mapped to a function (a Gaussian); combination of functions for support
vectors is the separator.

• Higher-dimensional space still has intrinsic dimensionality d (the mapping


is not onto), but linear separators in it correspond to non-linear separators
inUniversity
originalof space.
Texas at 27
Austin
Machine Learning
Group

Non-linear SVMs Mathematically

• Dual problem formulation:


Find α1…αn such that
Q(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi

• The solution is:

f(x) = ΣαiyiK(xi, xj)+ b

• Optimization techniques for finding αi’s remain the same!

University of Texas at 28
Austin
Machine Learning
Group

SVM applications

• SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and
gained increasing popularity in late 1990s.
• SVMs are currently among the best performers for a number of classification
tasks ranging from text to genomic data.
• SVMs can be applied to complex data types beyond feature vectors (e.g.
graphs, sequences, relational data) by designing kernel functions for such data.
• SVM techniques have been extended to a number of tasks such as regression
[Vapnik et al. ’97], principal component analysis [Schölkopf et al. ’99], etc.
• Most popular optimization algorithms for SVMs use decomposition to hill-
climb over a subset of αi’s at a time, e.g. SMO [Platt ’99] and [Joachims ’99]
• Tuning SVMs remains a black art: selecting a specific kernel and parameters
is usually done in a try-and-see manner.

University of Texas at 29
Austin

You might also like