0% found this document useful (0 votes)
102 views69 pages

Final - Support Vector Machine - Class - Modifie

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views69 pages

Final - Support Vector Machine - Class - Modifie

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Support Vector Machines

Support Vector Machines (SVM)


■ SVM were introduced by Vladimir Vapnik (Vapnik, 1995).
Technique based on statistical learning theory.

■ The main objective in SVM is to find the hyperplane which


separates the d-dimensional data points perfectly into two classes.

■ However, since example data is often not linearly separable, SVM's


introduce the notion of a “kernel induced feature space" which casts
the data points (input space) into a higher dimensional feature space
where the data is separable.

■ SVM works well with higher dimensional data and thus avoids
dimensionality problem.

■ SVM's higher-dimensional space doesn't need to be dealt with


directly which eliminates overfitting.
Support Vector Machines (SVM)

We discuss
 A classification technique when training data are linearly
separable known as a Linear SVM
• A classification technique when training data are linearly
non-separable known as a Non-linear SVM.
 Maximum margin of hyperplane is also key point
Support Vector Machines - Linear classifier

■ Classification tasksare based on drawing separating lines to


distinguish between objects of different class labels are known
as hyperplane classifiers.

■ A decision plane is one that separates between a set of objects


having different class labels.

■ Any new object falling to the right is labeled, i.e., classified, as


GREEN (or classified as RED should it fall to the left of the
separating line).

■ The objects closest to the hyperplane is called support vectors


Linear
separability

linearly
separabl
e

not
linearly
separabl
e
Input Space to Feature Space

•The original objects are transformed, using a set of mathematical


functions, known as kernels.
•Instead of constructing the complex curve, we find an optimal line
that can separate the objects.
Linear Classifiers We are given l training examples {xi, y i};
i = 1.. l , where each example has d
inputs (xi ∈ Rd), and a class label with
one of two values (yi ∈{-1, 1}.
denotes +1
denotes -1
•All hyperplanes in Rd
are parameterized by
a vector (w) and a
constant (b),
expressed using the
equation w . x + b = 0
•w is the vector
• Given such a hyperplane (w,b) that separates the orthogonal to the
data, using function f(x) = sign(w. x + b) hyperplane
Linear Classifiers

f(x,w,b) = sign(w x + b)
denotes +1 wx+b>0
denotes -1

How would you


classify this data?

wx+b<0
Finding a hyperplane

In matrix form, a hyperplane thus can be represented as

W.X + b = 0 (4)
where W =[w1,w2.......wm] and X = [x1,x2.......xm] and b is a real
constant.

Here, W and b are parameters of the classifier model to be


evaluated given a training set D.
Hyperplane and Classification

W.X + b = 0, the equation can be explain as follows.

W represents the orientation and b is the intercept of the


hyperplane from the origin.

If both W and b are scaled (up or down) by dividing a non zero


constant, we get the same hyperplane.

This means there can be infinite number of solutions using various


scaling factors, all of them geometrical representing the same
hyperplane.

Debasis Samanta (IIT Kharagpur)


Max Margin of Linear Classifiers
Two problem may arise:
 Whether all hyperplanes are equivalent so far the classification of
data is concerned?
 If not, which hyperplane is the best?

 Regarding Classification error (with training data), all of them are


with zero error.
 But there is no guarantee that all hyperplanes perform equally well
on unseen (i.e., test) data.

 For a good classifier, choose one of the infinite number of


hyperplanes, so that it performs better not only on training data but
as well as test data.
Hyperplanes with decision boundaries and their margins.
Maximum Margin Hyperplane

Two hyperplanes H1 and H2 have decision boundaries(denoted


as b11 and b12 for H1 and b21 and b22 for H2).

A decision boundary is a boundary which is parallel to hyperplane


and touches the closest class in one side of the hyperplane.

The distance between the two decision boundaries of a


hyperplane is called the margin. If data is classified using
Hyperplane H1, then it is with larger margin then using
Hyperplane H2.

The larger the margin, lower is the classification error.


The Perceptron Classifier
Given linearly separable data xi labelled into two categories yi = {-1,1} ,
find a weight vector w such that the discriminant function

f ( xi ) = w > xi + b
separates the categories for i = 1, .., N
• how can we find this separating hyperplane ?

The Perceptron Algorithm


Write classifier f ( xi ) = w̃> x̃i + w0 = w> xi
as

where w = (w̃, w0), xi = (x̃i , 1)


• Initialize w = 0
• Cycle though the data points { xi, yi }
• if xi is misclassified w ← w + α sign(f (xi ))
then xi
• Until all the data is correctly classified
For example in 2D
• Initialize w = 0
• Cycle though the data points { xi, yi }
• if xi is misclassified then w ← w + α sign(f (xi )) xi
• Until all the data is correctly classified

before update after Update

X2 X2

w
w

X1 X1
xi
8

6
Perceptron
example 4

-2

-4

-6

-8

-10

-15 -10 -5 0 5 10

• if the data is linearly separable, then the algorithm will


converge
• convergence can be slow …
• separating line close to training data
• we would prefer a larger margin for generalization
Hyperplane Classifier

■ To avoid such confusions


■ A given hyperplane represented by (w,b) is equally expressed
by all pairs {λw, λb} for λ ∈ R+.

■ We define the hyperplane which separates the data from the


hyperplane by a “distance” so that at least one example on
both sides has a distance of exactly 1.
■ That is, we consider those that satisfy:

■ w . xi + b ≥ 1 when yi = +1
■ w . xi + b ≤ 1 when yi = -1

yi (w . xi + b) ≥ 1 ∀i
• To obtain the geometric distance from the hyperplane to a data
point, we normalize by the magnitude of w.
• We want the hyperplane that maximizes the geometric distance
to the closest data points. d( (w, b) , x ) = [y (w. x + b)] / ||w|| ≥ 1 / ||w||
i i i

Choosing the hyperplane that maximizes the margin


Maximum Margin
1. Maximizing the margin
2. f(x,w,b)are
support vectors = sign(w x + b)
important
denotes +1
denotes -1
The maximum
margin linear
classifier
Support Vectors
This is the
simplest kind of
SVM (Called an
LSVM)

Linear SVM
Linear SVM for Linearly Not Separable Data

a linear SVM can be refitted to learn a hyperplane that is


tolerable to a small number of non-separable training data.

The approach of refitting is called soft margin approach (hence,


the SVM is called Soft Margin SVM), where it introduces slack
variables to the inseparable cases.

More specifically, the soft margin SVM considers a linear SVM


hyperplane (i.e., linear decision boundaries) even in situations
where the classes are not linearly separable.

.
Linear SVM Mathematically
x+ M=Margin Width

X-
Two hyperplanes are parallel (they
have the same normal) and that no
training points fall between them.

What we know:
■ w . x+ + b = +1

■ w . x- + b = -1

■ w . (x+-x-) = 2
Linear SVM Mathematically
■ Goal: 1) Correctly classify all training data
if yi = +1
if yi = -1
for all i
2) Maximize the Margin or same as minimize

■ We can formulate a constrained optimization Problem and solve for


w and b

■ Minimize

subject to
Searching for MMH

constrained optimization problem is popularly known as convex


optimization problem, where objective function is quadratic and
constraints are linear in the parameters W and b.

The well known technique to solve a convex optimization problem


is the standard Lagrange Multiplier method.

First, we shall learn the Lagrange Multiplier method, then come


back to the solving of our own SVM problem.

Debasis Samanta (IIT Kharagpur)


Lagrange Multipliers
• Consider a problem: minx f(x) subject to h(x) = 0

• We define the Lagrangian L (x, α ) = f(x) - α h (x)

• α is called “Lagrange multiplier”

• Solve: minx maxα L (x, α ) subject to α ≥ 0

Original Problem:
Find w and b such that
Φ(w) =½ wTw is minimized;
and for all i {(xi ,yi)}: yi (wTxi + b) ≥ 1
Construct the Lagrangian Function for
optimization

S. T. αi ≥ 0; ∀
i

Our goal is to: OR


The derivative with respect to b

Substituting we get:

maxα :

Subject to

α is the vector of m non-negative Lagrange multipliers to be determined,


and C is a constant
Optimal hyperplane :

• The vector w is just a linear combination of the training examples.


•If we’ve found the αi’s, in order to make a prediction, we have
to calculate a quantity that depends only on the inner product
between x and the points in the training set.


Illustration : Linear SVM
Consider the case of a binary classification starting with a
training data of 8 tuples as shown in Table 1.
Using quadratic programming, we can solve the KKT constraints
to obtain the Lagrange multipliers λ i for each training tuple,
which is shown in Table 1.
Note that only the first two tuples are support vectors in this case.
Let W = (w1, w2) and b denote the parameter to be
determined now. We can solve for w1 and w2 as follows:

w1 = Σ λ i .yi .xi1 = 65.52 × 1 × 0.38 + 65.52 × −1 × 0.49 = −6.64


i

w2 = Σ λ i .yi .xi2 = 65.52 × 1 × 0.47 + 65.52 × −1 × 0.61 = −9.32


i
Illustration : Linear SVM

Table 1: Training Data

x1 x2 y λ
0.38 0.47 + 65.52
0.49 0.61 - 65.52
0.92 0.41 - 0
0.74 0.89 - 0
0.18 0.58 + 0
0.41 0.35 + 0
0.93 0.81 - 0
0.21 0.10 + 0

Debasis Samanta (IIT Kharagpur)


Illustration : Linear SVM

Linear SVM example.

Debasis Samanta (IIT Kharagpur)


Illustration : Linear SVM
The parameter b can be calculated for each support vector
as follows
b1 = 1 −W.x1 // for support vector x1
= 1 −(−6.64) × 0.38 −(−9.32) × 0.47 //using dot product
= 7.93
b2 = 1 −W .x2 // for support vector x2
= 1 −(−6.64) × 0.48 −(−9.32) × 0.611 //using dot product
= 7.93

Averaging these values of b1 and b2, we get b = 7.93.

Debasis Samanta (IIT Kharagpur)


Illustration : Linear SVM

Thus, the MMH is −6.64x1 −9.32x2 + 7.93 = 0 (also see Fig. 6).
Suppose, test data is X = (0.5, 0.5). Therefore,
δ(X ) = W .X + b
= −6.64 × 0.5 −9.32 × 0.5 + 7.93
= −0.05
= −ve

This implies that the test data falls on or below the MMH and
SVM classifies that X belongs to class label -.
Dataset with noise

denotes +1 ■ Hard Margin: So far we require


all data points be classified correctly.
denotes -1
- No training error
■ What if the training set is noisy?
- Solution 1: use very powerful
kernels

OVERFITTING!
Soft Margin Classification
Slack variables ξi can be added to allow
misclassification of difficult or noisy examples.

What should our quadratic


ε11 optimization criterion be?
ε2 Minimize

ε7
Hard Margin v.s. Soft Margin
■ The old formulation:
Find w and b such that
Φ(w) =½ wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1

■ The new formulation incorporating slack variables:

Find w and b such that


Φ(w) =½ wTw + λ Σξ i is minimized and for all {(x ,yi)}
i
yi (w xi + b) ≥ 1- ξ i and ξi ≥ 0 for all i
T

■ Parameter λ can be viewed as a way to control


overfitting.
Linear SVMs: Overview
■ The classifier is a separating hyperplane.
■ Most “important” training points are support vectors; they
define the hyperplane.
■ Quadratic optimization algorithms can identify which training
points xi are support vectors with non-zero Lagrangian
multipliers αi.
■ Both in the dual formulation of the problem and in the solution
training points appear only inside dot products:

Find α1…αN such that


Q(α) =Σα i - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi

f(x) = Σαiyixi Tx + b
Learned model

f ( x) = w > x + b

Slide from Deva


Ramanan
Illustration :
Linear SVM
Thus, the MMH is −6.64x1 −9.32x2 + 7.93 = 0 (also see Fig. 6).
Suppose, test data is X = (0.5, 0.5). Therefore,
δ(X ) = W .X + b
= −6.64 × 0.5 −9.32 × 0.5 + 7.93
= −0.05
= −ve

This implies that the test data falls on or below the MMH and
SVM classifies that X belongs to class label -.
Non-linear SVMs
■ Datasets that are linearly separable with some noise work out
great:
0 x

■ But what are we going to do if the dataset is just too hard?


0 x
■ How about… mapping data to a higher-dimensional space:

x2

0 x
Non-Linear SVM
For understanding this, .
Note that a linear hyperplane is expressed as a linear equation
in terms of n-dimensional component, whereas a non-linear
hypersurface is a non-linear expression.

Figure 13: 2D view of few class separabilities.

Data Analytics Autumn 2018 41 / 131


Debasis Samanta (IIT Kharagpur)
Non-Linear SVM
A hyperplane is expressed as
linear : w1x1 + w2x2 + w3x3 + c = 0 (30)

Whereas a non-linear hypersurface is expressed as.


Nonlinear : w1x 2 + w2x 2 + w3x1x2 + w4x 2 + w5x1x3 + c = 0 (31)
1 2 3

The task therefore takes a turn to find a nonlinear


decision boundaries, that is, nonlinear hypersurface in
input space comprising with linearly not separable data.
This task indeed neither hard nor so complex, and fortunately
can be accomplished extending the formulation of linear SVM,
we have already learned.

Data Analytics Autumn 2018 42 / 131


Debasis Samanta (IIT Kharagpur)
Non-linear SVMs: Feature spaces
■ General idea: the original input space (nonlinear separable
data) can always be mapped to some higher-dimensional
feature space where the training set is linearly separable:

Φ: x → φ(x)
Concept of Non-Linear Mapping
Example: Non-linear mapping to linear SVM
The below figure shows an example of 2-D data set consisting
of class label +1 (as +) and class label -1 (as -).

Figure: Non-linear mapping to Linear SVM.


Concept of Non-Linear Mapping

Example: Non-linear mapping to linear SVM


We see that all instances of class -1 can be separated from
instances of class +1 by a circle, The following equation of
the decision boundary can be thought of:
X (x1, x2) = +1

if q
(x1 −0.5) 2 + (x2 −0.5) 2 > 2

X (x1, x2) = −1 otherwise


Concept of Non-Linear Mapping
Example: Non-linear mapping to linear SVM
The Z space when plotted will take view where data are
separable with linear boundary,
Z : z1 + z2 = −0.46

Non-linear mapping to Linear SVM.

Debasis Samanta (IIT Kharagpur)


Non-Linear to Linear Transformation: Issues

The non linear mapping and hence a linear decision boundary


concept looks pretty simple. But there are many potential
problems to do so.

1 Mapping: How to choose the non linear mapping to a higher


dimensional space?

In fact, the φ-transformation works fine for small example.

But, it fails for realistically sized problems.

2 Cost of mapping:For n-dimensional input instances there exist


(N+d−1)!
NH = different monomials comprising a feature space of
d!(N− 1)!

dimensionality NH . Here, d is the maximum degree of monomial.

Debasis Samanta (IIT Kharagpur)


Non-Linear to Linear Transformation: Issues

Dimensionality problem: It may suffer from the curse of


dimensionality problem often associated with a high dimensional
data.

More specifically, in the calculation of W.X or Xi .X (in δ(X ) see Eqn.


18), we need n multiplications and n additions (in their dot products)
for each of the n-dimensional input instances and support vectors,
respectively.
As the number of input instances as well as support vectors are
enormously large, it is therefore, computationally expensive.

4 Computational cost: Solving the quadratic constrained optimization


problem in the high dimensional feature space is too a
computationally expensive task.

Debasis Samanta (IIT Kharagpur) Autumn 2018 48 / 131


Non-Linear to Linear Transformation: Issues

Fortunately, mathematicians have cleverly proposes an elegant


solution to the above problems.

Their solution consist of the following:


1 Dual formulation of optimization problem
2 Kernel trick

Autumn 2018 49 / 131


Mapping the Inputs to other dimensions - the
use of Kernels
• Finding the optimal curve to fit the data is difficult.
• To “pre-process" the data in such a way that the problem is
transformed into one of finding a simple hyperplane.
•We define a mapping z = φ(x) that transforms the d-dimensional
input vector x into a (usually higher) d*-dimensional vector z.
• To choose a φ() so that the new training data {φ(xi),yi} is
separable by a hyperplane.
• How do we go about choosing φ()?
Kernel Trick

Example: SVM classifier for non-linear data


Suppose, there are a set of data in R 2 (i.e., in 2-D space), φ is the
mapping for X ϵ R 2 to Z (z1 , z2 , z3 )ϵR 3 , in the 3-D space.

R 2 ⇒ X (x1 , x2 )

R 3 ⇒ Z (z1 , z2 , z3 )

Debasis Samanta (IIT Kharagpur) Autumn 2018 51 / 131


Kernel Trick

Example: SVM classifier for non-linear data

φ(X ) ⇒ Z

z1 = x12, z 2 = √ 2x1 x2, z3 = x22


The hyperplane in R2 is of the form

w1 x12 + w2√ 2x1x 2 + w3x22 = 0

Which is the equation of an ellipse in 2D.

Debasis Samanta (IIT Kharagpur) Autumn 2018 52 / 131


Kernel Trick
Example: SVM classifier for non-linear data
After the transformation, φ as mentioned above, we have the
decision boundary of the form

w1 z1 + w2 z2 + w3 z2 = 0
This is clearly a linear form in 3-D space. In other words,
W .x + b = 0 in R 2 has a mapped equivalent W .z + bJ = 0 in R 3

This means that data which are not linearly separable in 2-D are
separable in 3-D, that is, non linear data can be classified by a
linear SVM classifier.

Debasis Samanta (IIT Kharagpur) Autumn 2018 53 / 131


Kernel Trick
The above can be generalized in the following.

Classifier:
n
δ(x) = Σ λ i yi yi .x + b
i=1

n
δ(z) = Σ λ i yi φ(xi ).φ(x) + b
i=1

Learning:

n 1
Maximize Σ λ i − 2 i,jΣ i j i j i j
λ λ .y .y .x .x
i=1

1
n
Maximize Σ λ i − 2 Σ λ i λ j .yi .yj φ(xi ).φ(xj )
i=1 i,j
Σ
Subject to: λ i ≥ 0, i λ i .yi = 0

Debasis Samanta (IIT Kharagpur) Autumn 2018 54 / 131


Kernel Trick
Now, question here is how to choose φ, the mapping function
X ⇒ Z , so that linear SVM can be directly applied to.

A breakthrough solution to this problem comes in the form of a


method as the kernel trick.

We discuss the kernel trick in the following.


We know that (.) dot product is often regarded as a measure of
similarity between two input vectors.
For example, if X and Y are two vectors, then
X .Y = |X ||Y|cosθ

Here, similarity between X and Y is measured as cosine similarity.


If θ=0 (i.e., cosθ=1), then they are most similar, otherwise
orthogonal means dissimilar.
Debasis Samanta (IIT Kharagpur) Autumn 2018 55 / 131
Kernel Trick
Analogously, if Xi and Xj are two tuples, then Xi .Xj is regarded as
a measure of similarity between Xi and Xj .

Again, φ(Xi ) and φ(Xj ) are the transformed features of Xi and Xj ,


respectively in the transformed space; thus, φ(Xi ).φ(Xj ) is also
should be regarded as the similarity measure between φ(Xi ) and
φ(Xj ) in the transformed space.

This is the basic idea behind the kernel trick.

Now, naturally question arises, if both measures the similarity, then


what is the correlation between them (i.e., Xi .Xj and φ(Xi ).φ(Xj )).

Let us try to find the answer to this question through an example.

Debasis Samanta (IIT Kharagpur) Autumn 2018 56 / 131


Kernel Trick
Example: Correlation between Xi .Xj and φ(Xi ).φ(Xj )
Without any loss of generality, let us consider a situation stated
below.

φ : R 2 ⇒ R 3 = x12 ⇒ z, x 22 ⇒ z 2, √ 2x1x2 ⇒ z 3 (40)


Suppose, Xi = [xi1, xi2] and Xj = [xj1, xj2] are any two vectors in
R2.

Similarly, φ(Xi ) = [x i1
2 , 2.x .x , x 2 ] and
√ i1 i2 i2
φ(Xj ) = [xj1, √2.xj1.xj2, xj2
2 ] are two transformed version of X and
i
2
Xj but in 3
R .

Debasis Samanta (IIT Kharagpur) Autumn 2018 57 / 131


Kernel Trick
Example: Correlation between Xi .Xj and φ(Xi ).φ(Xj )
Now,
x 2j1
φ(Xi ).φ(Xj ) = [x i1, √2.xi1 .xi2, xi2
2]
√2xj1x j2
2
x2
j2
= x 2 .x 2 + 2xi1 xi2 xj1 xj2 + x 2 .x 2
i1 j1 i2 j2

= (xi1 .xj1 + xi2 .xj2 ) 2

x j1 2
= { [xi1 , xi2] xj2}

= (Xi .Xj ) 2

Debasis Samanta (IIT Kharagpur) Autumn 2018 58 / 131


Kernel Trick : Correlation between Xi.Xj and φ(Xi ).φ(Xj )

With reference to the above example, we can conclude that


φ(Xi ).φ(Xj ) are correlated to Xi .Xj .
In fact, the same can be proved in general, for any feature vectors
and their transformed feature vectors.
More specifically, there is a correlation between dot products of
original data and transformed data.
Based on the above discussion, we can write the following
implications.
Xi .Xj ⇒ φ(Xi ).φ(Xj ) ⇒ K (Xi , Xj ) (41)
Here, K (Xi , Xj denotes a function more popularly called as kernel
function

Debasis Samanta (IIT Kharagpur) Autumn 2018 59 / 131


Kernel Trick : Significance

Computational efficiency:
Another important significance is easy and efficient computability.

We know that in the discussed SVM classifier, we need several and


repeated round of computation of dot products both in learning
phase as well as in classification phase.

On other hand, using kernel trick, we can do it once and with fewer
dot products.

Debasis Samanta (IIT Kharagpur) Autumn 2018 60 / 131


The “Kernel Trick”
■ The linear classifier relies on dot product between vectors K(x ,x )=x Tx
i j i j
■ If every data point is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the dot product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
■ A kernel function is some function that corresponds to an inner product in
some expanded feature space.
■ Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2 ,
Need to show that K(xi,xj)= φ(x i) Tφ(x j):
K(xi,xj)=(1 + xiTxj)2,
= 1+ x i12xj12 + 2 xi1xj1 xi2xj2 + x i22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi) Tφ(x j), where φ(x) = [1 x 12 √2 x 1x 2 x2 2 √2x1 √2x2]
Non-linear SVMs Mathematically
■ Dual problem formulation:
Find α1…αN such that
Q(α) =Σαi - ½ ΣΣαiαjyiyjK(xi, xj) is maximized and
(1) Σαi yi = 0
(2) αi ≥ 0 for all αi

■ The solution is:

f(x) = Σαi yi K(xi, xj) + b

■ Optimization techniques for finding αi’s remain the same!


Examples of Kernel Functions
T
■ Linear: K(xi,xj)= xi xj

■ Polynomial of power p: K(x ,x )= (1+ x Tx )p


i j i j
Produces large dot products. Power ρ is specified
apriori by the user.

■ Gaussian (radial-basis function network):


Examples of Kernel Functions
■ Laplacian:
K (X, Y ) = e−λ||x −y ||

■ Mahalanobis:
K (X, Y ) = e−(X −y )T A(x −y )
Followed when statistical test data is known

 Sigmoid: K (X, Y ) = tanh(β0XT y + β1)


Followed when statistical test data is known
Nonlinear SVM - Overview
■ SVM locates a separating hyperplane in the
feature space and classify points in that
space
■ It does not need to represent the space
explicitly, simply by defining a kernel
function
■ The kernel function plays the role of the dot
product in the feature space.
Properties of SVM
■ Flexibility in choosing a similarity function.

■ Sparseness of solution when dealing with large data


sets
-only support vectors are used to specify the separating
hyperplane

■ Ability to handle large feature spaces


-complexity does not depend on the dimensionality of the
feature space

■ Overfitting can be controlled by soft margin


approach
■ Nice math property: a simple convex optimization problem
which is guaranteed to converge to a single global solution

■ Feature Selection
Weakness of SVM
■ It is sensitive to noise
-A relatively small number of mislabeled examples can
dramatically decrease the performance

■ It only considers two classes


- how to do multi-class classification with SVM?
- Answer:
1) with output arity m, learn m SVM’s
❑ SVM 1 learns “Output==1” vs “Output != 1”
❑ SVM 2 learns “Output==2” vs “Output != 2”
❑ :
❑ SVM m learns “Output==m” vs “Output != m”
2)To predict the output for a new input, just predict with each
SVM and find out which one puts the prediction the furthest
into the positive region.
Some Issues
■ Choice of kernel
- Gaussian or polynomial kernel is default
- if ineffective, more elaborate kernels are needed
-domain experts can give assistance in formulating appropriate
similarity measures

■ Choice of kernel parameters


- e.g. σ in Gaussian kernel
- σ is the distance between closest points with different
classifications
- In the absence of reliable criteria, applications rely on the use
of a validation set or cross-validation to set such parameters.

■ Optimization criterion – Hard margin v.s. Soft margin


- a lengthy series of experiments in which various parameters
are tested

You might also like