0% found this document useful (0 votes)
13 views19 pages

Support Vector Machine

Support Vector Machines (SVM) are primarily used for classification tasks, aiming to find a hyperplane that maximizes the margin between different classes of data points. The algorithm utilizes support vectors, which are critical data points that influence the hyperplane's position and orientation. SVM can also handle non-linear data through kernel methods, mapping data to higher-dimensional spaces for better separation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views19 pages

Support Vector Machine

Support Vector Machines (SVM) are primarily used for classification tasks, aiming to find a hyperplane that maximizes the margin between different classes of data points. The algorithm utilizes support vectors, which are critical data points that influence the hyperplane's position and orientation. SVM can also handle non-linear data through kernel methods, mapping data to higher-dimensional spaces for better separation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

• Support Vector Machine, abbreviated as

SVM can be used for both regression and


classification tasks.

• But, it is widely used in classification


objectives.

Support Vector and • The objective of the support vector


machine algorithm is to find a hyperplane
Kernel Methods in an N-dimensional space(N — the
number of features) that distinctly
classifies the data points.

• To separate the two classes of data points,


there are many possible hyperplanes that
could be chosen.
Support Vector Machines

• The objective is to find a plane that has


the maximum margin, i.e the
maximum distance between data
points of both classes.
• Maximizing the margin distance
provides some reinforcement so that
future data points can be classified
with more confidence.
Hyperplane as Decision Surface
• Hyperplanes are decision boundaries that help
classify the data points.
• Data points falling on either side of the
hyperplane can be attributed to different classes.
It is a sort of binary classification
• The dimension of the hyperplane depends upon
the number of features. If the number of input
features is 2, then the hyperplane is just a line. If
the number of input features is 3, then the
hyperplane becomes a two-dimensional plane.
Support
Vectors
Support vectors are data points that are
closer to the hyperplane and influence
the position and orientation of the
hyperplane.

Using these support vectors, we


maximize the margin of the classifier.

Deleting the support vectors will change


the position of the hyperplane. These
are the points that help us build our
SVM.
Maximizing the Margin

• In logistic regression, we take the output of the linear function and squash the value
within the range of [0,1] using the sigmoid function.
• If the squashed value is greater than a threshold value(0.5) we assign it a label 1,
else we
assign it a label 0.
• In SVM, we take the output of the linear function and if that output is greater than 1,
we identify it with one class and if the output is -1, we identify is with another class.
• Since the threshold values are changed to 1 and -1 in SVM, we obtain this reinforcement
range of values([-1,1]) which acts as margin.
Sec.
15.1

Support • SVMs maximize the margin around the


separating hyperplane.
Vector • A.k.a. large margin classifiers
Machine
Support vectors • The decision function is fully specified
(SVM) by a
subset of training samples, the support
vectors.
• Solving SVMs is a quadratic
programming
problem
• Seen by many as the most successful current
Maximizes
Narrowmearrgin
text classification method*
margin 71
SV
M
Sec.
15.1

Maximum Margin:
Formalization
w: decision hyperplane normal vector

xi: data point i

yi: class of data point i (+1 or -1) NB: Not 1/0

Classifier is: f(xi) = sign(wTxi + b)

Functional margin of xi is: But note that we can increase this


yi (wTxi + b) margin simply by scaling w, b….
73
Sec.
Geometric 15.1

Margin
• Distance from example to the separator is r 
wT x  b
y w
• Examples closest to the hyperplane are support vectors.
• Margin ρ of the separator is the width of separation between support vectors of classes.

Derivation of finding r:
ρ Dotted line x’−x is perpendicular to
x
decision boundary so parallel to w.
 Unit vector is w/|w|, so line is
r
x rw/|w|.
x’ = x – yrw/|w|.
′ x’ satisfies wTx’+b = 0. So
wT(x –yrw/|w|) + b = 0
Recall that |w| =
sqrt(wTw).
So wTx –yr|w| + b = 0
w So, solving for r gives:
r = y(wTx + b)/|w|
Sec.
15.1
Linear SVM
Mathematically The
• Assume that all data is at least linearly separable
distance 1 from the hyperplane, then the following two constraints
follow for a training set {(x ,y )}
i i
case
wTxi + b ≥ 1 if yi = 1
wTxi + b ≤ −1 if yi = −1

• For support vectors, the inequality becomes an equality


• Then, since each example’s distance from the hyperplane is
wT x  b
r
y w
• The margin is:

2

 w
75
Sec.
15.1

Linear Support Vector


Machine (SVM)
ρ wTxa + b = 1


wTxb + b = -1
Hyperplane
wT x + b = 0
• Extra scale constraint:
mini=1,…,n |wTxi + b| = 1

• This implies:
wT(xa–xb) = 2
ρ = ||xa–x b|| 2 = 2/||w||2 wT x + b = 0

76
Solving the Optimization Problem
Find w and b such that
Φ(w) =½ wTw is minimized;
and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1

• This is now optimizing a quadratic function subject to linear constraints


• Quadratic optimization problems are a well-known class of mathematical
programming problem, and many (intricate) algorithms exist for solving them
(with many special ones built for SVMs)
• The solution involves constructing a dual problem where a Lagrange
multiplier
αi is associated with every constraint in the primary problem:
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi
77
The Optimization Problem Solution

• The solution has the form:

w =Σαiyixi b= yk- wTxk for any xk such that αk 0

• Each non-zero αi indicates that corresponding xi is a support vector.


• Then the classifying function will have the form:

f(x) = ΣαiyixiTx + b

• Notice that it relies on an inner product between the test point x and the support vectors xi
• We will return to this later.
• Also keep in mind that solving the optimization problem involved computing the inner products x iTx j
between all pairs of training points.

78
Classification with SVMs

•Given a new point x, we can score its projection onto the


hyperplane normal:
• I.e., compute score: wTx + b = ΣαiyixiTx + b
• Decide class based on whether < or > 0

• Can set confidence threshold t.

Score > t: yes


Score < -t: no
1
Else: don’t -1
0
79
Linear SVMs:
Summary
• The classifier is a separating hyperplane.

• The most “important” training points are the support vectors; they define the hyperplane.

• Quadratic optimization algorithms can identify which training points xi are support vectors with
non-zero Lagrangian multipliers αi.

• Both in the dual formulation of the problem and in the solution, training points appear only
inside
inner products:
Find α1…αN such that f(x) = ΣαiyixiTx + b
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi

80
Non-linear SVMs
• Datasets that are linearly separable (with some noise) work out great:

0 x

• But what are we going to do if the dataset is just too hard?


0 x

•How about … mapping data to a higher-xd imensional space:


2

0 x
81
Non-linear SVMs:
Feature spaces
• General idea: the original feature space can always be mapped
to some higher-dimensional feature space where
the training set is separable:

Φ: x → φ(x)

82
The “Kernel
Trick”
• The linear classifier relies on an inner product between vectors K(x ,x )=x x i j i
T
j

• If every datapoint is mapped into high-dimensional space via some transformation Φ: x → φ(x), the
inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
• A kernel function is some function that corresponds to an inner product in some expanded feature
space.
• Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi xj) ,
T 2

Need to show that K(xi,xj)= φ(xi) Tφ(xj):


K(xi,xj)=(1 + xi xj) ,= 1+ xi1 xj1 + 2 xi1xj1 xi2xj2+ xi2 xj2 + 2xi1xj1 + 2xi2xj2=
T 2 2 2 2 2

= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj1 √2 xj1xj2 xj2 √2xj1 √2xj2]
2
2
2 2
= φ(xi) Tφ(xj) where φ(x) = [1 x1 x2 √2x1 √2x2]
83
√2 x1x2
Sec.
15.2.3

Kernels

Why use kernels?

• Make non-separable problem separable.


• Map data into better representational space

Common kernels

• Linear
• Polynomial K(x,z) = (1+xTz)d
• Gives feature conjunctions
• Radial basis function (infinite dimensional space)

Haven’t been very useful in text classification


84

You might also like