0% found this document useful (0 votes)
53 views28 pages

13.1 Support Vector Machine

This document discusses support vector machines (SVMs) for classification. It explains that SVMs find the optimal separating hyperplane that maximizes the margin between the two classes of data. This hyperplane is known as the maximum marginal hyperplane (MMH). The data points that lie closest to the MMH are called the support vectors, and they are the most informative for determining the classification. SVMs can handle both linearly separable and non-linearly separable data using techniques like soft margins and kernels to project the data into higher dimensions where linear separation is possible.

Uploaded by

gameplayevgerk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views28 pages

13.1 Support Vector Machine

This document discusses support vector machines (SVMs) for classification. It explains that SVMs find the optimal separating hyperplane that maximizes the margin between the two classes of data. This hyperplane is known as the maximum marginal hyperplane (MMH). The data points that lie closest to the MMH are called the support vectors, and they are the most informative for determining the classification. SVMs can handle both linearly separable and non-linearly separable data using techniques like soft margins and kernels to project the data into higher dimensions where linear separation is possible.

Uploaded by

gameplayevgerk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

SUPPORT VECTOR MACHINE

Pranya PO, MSc


Classification of customers
Income

– buy computers : YES


– – – – + buy computers : NO

– – ––
– ––
– + ?
– ––
+ + +
+ + + ++
+ + +

Age
2
Classification of customers
Income

– buy computers : YES


– – – – + buy computers : NO

– – ––
– ––

– –– +
? + ++
+ + + ++
+ + +

Age
3
Classification of customers
Income

– buy computers : YES


– – – – + buy computers : NO

– – ––
– ––

– –– +
+ + +
+ + + ++
+ + +

Age
4
Classification of customers
Income

– buy computers : YES


– – – – + buy computers : NO

– – ––
– ––

– –– +
? + ++
+ + + ++
+ + +

Age
5
Classification of customers
Income

– buy computers : YES


– – – – + buy computers : NO

– – ––
– ––

– –– +
? + ++
+ + + ++
+ + +

Age
6
Support vector machine
• Support vector machine (SVM) is a method for the classification of both linear and
nonlinear data.
• It uses a nonlinear mapping to transform the original training data into a higher
dimension. Within this new dimension, it searches for the linear optimal separating
hyperplane: a “decision boundary” separating the tuples of one class from another.

– – – – The SVM finds this hyperplane



– – – – using support vectors (“essential”
– –– – training tuples) and margins

– –– + + N
Y + + + (defined by the support vectors).
+ + + ++
++ + +
Main ideas of SVM
• Maximum marginal hyperplane
• Formalize notion of the best linear separator
• Lagrangian multipliers
• Way to convert a constrained optimization problem to one that is easier to
solve
• Slack variables
• Allow misclassification of difficult or noisy examples
• Kernel tricks
• Projecting data into higher-dimensional space makes it linearly separable
Linear SVM
Linear SVM
• Linear SVM  the case when the data are linearly separable.
• The simplest case: a two-class problem where the classes are linearly
separable.
• Let the data set D be given as 𝑿𝑿1, 𝑦𝑦1 , 𝑿𝑿2 , 𝑦𝑦2 , … , 𝑿𝑿𝑘𝑘 , 𝑦𝑦𝑘𝑘 where 𝑿𝑿𝑖𝑖
is the set of training tuples with associated class labels, 𝑦𝑦𝑖𝑖 .
• Each 𝑦𝑦𝑖𝑖 take one of two values, either +1 or -1 (yi ∈ {-1,+1}),
corresponding to the classes buys_computer = yes and
buys_computer = no, respectively.
Linear SVM
• The 2-D training data are
linearly separable. There are
an infinite number of possible
separating hyperplanes or
“decision boundaries,” some
of which are shown here as
dashed lines. Which one is
best?
• An SVM approaches this
problem by searching for the
maximum marginal hyperplane
(MMH).
Maximum marginal hyperplane
We expect the
hyperplane with the
larger margin to be
more accurate at
classifying future data
tuples than the
hyperplane with the
smaller margin. This is
why the SVM searches
for the hyperplane with
the largest margin.
Maximum marginal hyperplane
• The shortest distance from a
hyperplane to one side of its
margin is equal to the
shortest distance from the
hyperplane to the other side
of its margin, where the
“sides” of the margin are
parallel to the hyperplane.
• This distance is the shortest
distance from the MMH to
the closest training tuple of
either class.
Maximum marginal hyperplane
• A separating hyperplane can be written as
𝑾𝑾 � 𝑿𝑿 + 𝑏𝑏 = 0,
where 𝑾𝑾 is a weight vector, namely, 𝑾𝑾 = 𝑤𝑤1 , 𝑤𝑤2 , … , 𝑤𝑤𝑛𝑛 ; n is the
number of attributes; and b is a scalar, often referred to as a bias.
• Consider two input attributes, 𝐴𝐴1 and 𝐴𝐴2, where 𝑥𝑥1 and 𝑥𝑥2 are the
values of attributes for 𝑿𝑿. Rewrite the equation above as
𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 = 0
where 𝑤𝑤0 is scalar b.
Maximum marginal hyperplane
• Thus, any point lies above the separating hyperplane satisfies:
𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 > 0
Similarly, any point lies below the separating hyperplane satisfies:
𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 < 0
• The weights can be adjusted so that the hyperplanes defining the “sides” of
the margin can be written as
𝐻𝐻1 : 𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 ≥ 1 for 𝑦𝑦𝑖𝑖 = +1
𝐻𝐻2 : 𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 ≤ −1 for 𝑦𝑦𝑖𝑖 = −1
• Any tuple that falls on or above 𝐻𝐻1 belongs to class +1, and any tuple that
falls on or below 𝐻𝐻2 belongs to class -1. Combining the two inequalities, we
get
𝑦𝑦𝑖𝑖 (𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 ) ≥ 1, ∀𝑖𝑖
Maximum marginal hyperplane
• Any training tuples that fall on hyperplanes 𝐻𝐻1 or 𝐻𝐻2 are called
support vectors.

• Essentially, the support vectors


are the most difficult tuples to
classify and give the most
information regarding
classification.
Maximum marginal hyperplane
• The distance from the separating hyperplane to any point on 𝐻𝐻1 is
1
, where 𝑾𝑾 = 𝑾𝑾 � 𝑾𝑾 = 𝑤𝑤12 + 𝑤𝑤22 + ⋯ + 𝑤𝑤𝑛𝑛2 is the Euclidean
| 𝑾𝑾 |
norm of 𝑾𝑾.
• By definition, this is equal to the
distance from any point on 𝐻𝐻2 to the
separating hyperplane.
2
• Therefore, the maximal margin is .
| 𝑾𝑾 |
Maximum marginal hyperplane
• So, how does an SVM find the MMH and the support vectors?
• We can rewrite 𝑦𝑦𝑖𝑖 (𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 ) ≥ 1, ∀𝑖𝑖 so that it becomes
what is known as a constrained quadratic optimization problem
using a Lagrangian formulation.
• Then solving for the solution using Karush-Kuhn-Tucker (KKT)
conditions.
Maximum marginal hyperplane
• Based on the Lagrangian formulation mentioned before, the MMH
can be rewritten as the decision boundary
𝑙𝑙

𝑑𝑑 𝑿𝑿𝑇𝑇 = � 𝑦𝑦𝑖𝑖 𝛼𝛼𝑖𝑖 𝑿𝑿𝑖𝑖 𝑿𝑿𝑇𝑇 + 𝑏𝑏0 ,


𝑖𝑖=1
where 𝑦𝑦𝑖𝑖 is the class label of support vector 𝑿𝑿𝑖𝑖 ; 𝑿𝑿𝑇𝑇 is a test tuple; 𝛼𝛼𝑖𝑖
(Lagrange multipliers) and 𝑏𝑏0 are numeric parameters that were
determined automatically by the optimization or SVM algorithm
noted before; and 𝑙𝑙 is the number of support vectors.
Sec. 15.2.1

Soft margin classification


• What if the training set is not linearly
separable?
• Slack variables ξi can be added to allow
misclassification of difficult or noisy
examples, resulting margin called soft.
• We then pay a cost for each misclassified
ξi
example, which depends on how far it is
ξj from meeting the margin requirement.
• Add slack variables ξi into 𝑦𝑦𝑖𝑖 (𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥1 +
𝑤𝑤2 𝑥𝑥2 ) ≥ 1 and rewrite as
𝒚𝒚𝒊𝒊 (𝒘𝒘𝟎𝟎 + 𝒘𝒘𝟏𝟏 𝒙𝒙𝟏𝟏 + 𝒘𝒘𝟐𝟐 𝒙𝒙𝟐𝟐 ) ≥ 𝟏𝟏 −ξi
Non-linear SVM
Sec. 15.2.3

Non-linear SVM
• Datasets that are linearly separable work out great:

0 x

• But what are we going to do if the dataset is just too hard?

0 x

• How about … mapping data to a higher-dimensional space:


2 x

0 x
Sec. 15.2.3

Feature spaces
General idea: the original feature space can always be mapped to
some higher-dimensional feature space where the training set is
separable:

Φ: x → φ(x)
Sec. 15.2.3

The “kernel trick”


• The linear classifier relies on an inner product between vectors
K(xi,xj)=xiTxj
• If every datapoint is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
• A kernel function is some function that corresponds to an inner
product in some expanded feature space.
Sec. 15.2.3

Kernels
• Why use kernels?
• Make non-separable problem separable.
• Map data into better representational space
• Common kernels
Multiclass classification
One-versus-all/one-versus-rest
• Given m classes, we train m binary classifiers, one for each class.
• Classifier j is trained using data of class j as the positive class, and the
remaining data as the negative class. It learns to return a positive
value for class j and a negative value for the rest.
All-versus-all/one-versus-one
𝒎𝒎(𝒎𝒎−𝟏𝟏)
• Given m classes, we construct binary classifiers 
𝟐𝟐
combination nCr, where n=m classes and r=2.
• A classifier is trained using data of the two classes.

You might also like