0% found this document useful (0 votes)
19 views2 pages

5d. Support Vector Machine

SVM pdf

Uploaded by

harsh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views2 pages

5d. Support Vector Machine

SVM pdf

Uploaded by

harsh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Machine Learning (CSPC41)

Support Vector Machine Support Vectors

Three main ideas:


Support Vector Machine 1.Define what an optimal hyperplane is (taking into
account that it needs to be computed efficiently):
maximize margin
2.Generalize to non-linearly separable problems: have a
penalty term for misclassifications
3.Map data to high dimensional space where it is easier
Ref, Source & Ack: to classify with linear decision surfaces: reformulate
1. Dr Anoop Kumar, Computer Engg Dept, NIT Kurukshetra problem so that data are mapped implicitly to this
2. Jiawei Han, Micheline Kamber, and Jian Pei, University of Illinois space
at Urbana-Champaign & Simon Fraser University, USA.
1

Setting Up the Optimization Problem


Linear classifiers: Which Hyperplane? Support Vector Machine

• Lots of possible solutions for a, b, c. Three main ideas:


• Some methods find a separating hyperplane, 1.Define what an optimal hyperplane is (taking into
but not the optimal one. This line represents the account that it needs to be computed efficiently):
• Support Vector Machine (SVM) finds an decision boundary:
ax + by − c = 0 maximize margin
optimal* solution.
– Maximizes the distance between the 2.Generalize to non-linearly separable problems: have a
hyperplane and the “difficult points” close to penalty term for misclassifications
decision boundary
– One intuition: if there are no points near the 3.Map data to high dimensional space where it is easier
decision surface, then there are no very to classify with linear decision surfaces: reformulate
uncertain classification decisions
problem so that data are mapped implicitly to this
* But note that Naïve Bayes also finds an
optimal solution, just under a different space
definition of optimality.

Non-Linear classifiers Which Hyperplane to use for Setting Up the Optimization Problem
Separation?
• For non-linear data, SVM is more complex
• use Kernelized SVM for non-linearly
separable data.
• Say, we have some non-linearly separable
data in one dimension. We can transform
this data into two dimensions and the
data will become linearly separable in
two dimensions. This is done by mapping
each 1-D data point to a corresponding 2-
D ordered pair.
• So for any non-linearly separable data in
any dimension, we can just map the data
to a higher dimension and then make it
linearly separable.
• Kernelized SVM the similarity is between
the points in the newly transformed
feature space.

Sec. 15.1 Sec. 15.1

Maximizing the Margin Maximum Margin: Formalization


Support Vector Machine (SVM)
• Point lying above the hyperplane will have w*xi+b > 0 yi as +1
• SVMs maximize the margin
Support vectors • Point lying below the hyperplane will have w*xi+b < 0 yi is -1
around the separating hyperplane. • The weights can be adjusted so that the hyperplane defining the
"sides" of the margin can be written as :
• large margin classifiers
H1: w*xi+b >= +1 for yi to be +1
• The decision function is fully
H2: w*xi+b <= -1 for yi to be -1
specified by a subset of training
• Any tuple falling on or above H1 will be classified as +1 and any
samples, the support vectors
tuple falling on or below H2 will be classified as -1
(tuples falling on the hyperplane).
• Solving SVMs is a quadratic Maximizes • Combining above two inequalities yi(w*xi+b) >= +1
Narrower
margin
programming problem margin • Above defines the Functional Margin of point xi as yi (wTxi + b)
– But note that we can increase this margin simply by scaling w, b….
• Seen by many as the one of the
most successful text classification • Functional margin of dataset is twice the minimum functional
margin for any point(factor of 2 comes from measuring the whole width
method* *but other discriminative methods often of the margin)
4 14
perform very similarly

Sec. 15.1 Sec. 15.1

Support Vector Machine Maximum Margin: Formalization How to Maximize Formalization


• w: decision hyperplane normal vector • Recall: Distance between two planes: P1: w1x + w2y + w3z + d1 =
• Decision surface: a hyperplane in feature space 0 and P2: w1x + w2y + w3z + d2 = 0. Then, the formula for the
• xi: data point I, distance between two planes that are parallel is
• One of the most important tools in the machine learning
toolbox • b is called bias (or zero-intercept like in line-eqn: d = |d2 - d1|/√(w12 + w22 + w32)
mx+c) • √(w12 + w22 + w32) is Euclidean distance, notated as || w ||
• In a nutshell: • So distance of point on H1 from the separating hyperplane:
–map the data to a predetermined very high-dimensional • yi:class of data point i(+1 or -1)(Note:It classifies as +1 or • 1/ ||w|| (point on H1 has w1x + w2y + w3z + d1= +1
space via a kernel function -1,not as 1/0)
w1x + w2y + w3z + d1-1 = 0
• Gen hyperplane eqn:(w*x +b=0) So Classifier: f(xi)=sign(wTxi + b)
–Find the hyperplane that maximizes the margin between point on Sep. Hyp. has w1x + w2y + w3z + d1- 0 = 0
the two classes • w and x are n-dim vectors, T for Transpose, Why wT is used?
• Similarly distance of point on H2 will also be 1/ ||w||
• If w and x are initially vertical vectors, wT means horizontal vector,
–If data are not separable -find the hyperplane that so that matrix multiplication of w and x becomes compatible,
• Maximal Margin: 2/ ||w||
maximizes the margin and minimizes the (weighted • In some books/notes, T might not be mentioned, so w as well as wT
average of the) misclassifications are used interchangeably. Remember this while reading. 10 15
Sec. 15.1 Sec. 15.2.1 Sec. 15.2.3

Soft Margin Classification Non-linear SVMs: Feature spaces


Linear SVMs Mathematically (cont.)
Mathematically
• Then we can formulate the quadratic optimization problem: • The old formulation:
• General idea: the original feature space can
Find w and b such that
Find w and b such that always be mapped to some higher-
2
k is maximized; and for all {(xi , yi)} Φ(w) =½ wTw is minimized and for all {(xi ,yi)}
dimensional feature space where the training
w yi (wTxi + b) ≥ 1
wTxi + b ≥ 1 if yi=1; wTxi + b ≤ -1 if yi = -1 set is separable:
• The new formulation incorporating slack variables:
Φ: x → φ(x)

• A better formulation (min ||w|| = max 1/ ||w|| ): Find w and b such that
Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i
Find w and b such that
• Parameter C can be viewed as a way to control overfitting
Φ(w) =½ wTw is minimized;
– A regularization term
and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1
16 21 26

Sec. 15.1 Sec. 15.2.1

Solving the Optimization Problem Soft Margin Classification – Solution SVM: Adv Features: Kernels
Map data to high dimensional space where it is easier to
• The dual problem for soft margin classification:
Find w and b such that classify with linear decision surfaces: reformulate problem so
Φ(w) =½ wTw is minimized;
and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1
Find α1…αN such that that data are mapped implicitly to this space
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
• The linear classifier relies on an inner product between
• This is now optimizing a quadratic function subject to linear constraints
• Quadratic optimization problems are a well-known class of mathematical (2) 0 ≤ αi ≤ C for all αi vectors K(xi,xj)=xiTxj
programming problem, and many (intricate) algorithms exist for solving them • Neither slack variables ξi nor their Lagrange multipliers appear in the dual • If every data point is mapped into high-dimensional space
(with many special ones built for SVMs) problem!
• The solution involves constructing a dual problem where a Lagrange • Again, xi with non-zero αi will be support vectors. via some transformation Φ: x → φ(x), the inner product
multiplier αi is associated with every constraint in the primary problem: • Solution to the dual problem is: becomes: K(xi,xj)= φ(xi) Tφ(xj)
w is not needed explicitly for • A kernel function is some function that corresponds to an
Find α1…αN such that w = Σαiyixi
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
classification! inner product in some expanded feature space.
b = yk(1- ξk) - wTxk where k = argmax αk’
(1) Σαiyi = 0 k’ f(x) = ΣαiyixiTx + b
(2) αi ≥ 0 for all αi
17 22

Sec. 15.1 Sec. 15.1

The Optimization Problem Solution Classification with SVMs SVM: Adv Features: Kernels
• Common kernels
• The solution has the form: • Given a new point x, we can score its projection – Linear
w =Σαiyixi b= yk- wTxk for any xk such that αk 0 onto the hyperplane normal:
– Polynomial K(x,z) = (1+xTz)d
– I.e., compute score: wTx + b = ΣαiyixiTx + b • Gives feature conjunctions
• Each non-zero αi indicates that corresponding xi is a support vector.
• Decide class based on whether < or > 0
• Then the classifying function will have the form: – Radial basis function (infinite dimensional space)
f(x) = ΣαiyixiTx + b • Haven’t been very useful in text classification
– Can set confidence threshold t.
• Notice that it relies on an inner product between the test point x and the
support vectors xi Score > t: yes
• Also keep in mind that solving the optimization problem involved Score < -t: no
computing the inner products xiTxj between all pairs of training points. Else: don’t know 1
0
-1
18 23

Sec. 15.2.1

Support Vector Machine: Idea 2 Support Vector Machine: Idea 3 SVMs: Summary
• The classifier is a separating hyperplane.
• The most “important” training points are the support vectors;
Three main ideas: Three main ideas: they define the hyperplane.
1.Define what an optimal hyperplane is (taking into 1.Define what an optimal hyperplane is (taking into • Quadratic optimization algorithms can identify which training
account that it needs to be computed efficiently): account that it needs to be computed efficiently): points xi are support vectors with non-zero Lagrangian
maximize margin maximize margin multipliers αi.
• Margin Maximization is the aim
2.Generalize to non-linearly separable problems: have a 2.Generalize to non-linearly separable problems: have a
• For non-linearly separable data, if few points get misclassified
penalty term for misclassifications penalty term for misclassifications
through separating hyperplane: penalize for the misclassified
3.Map data to high dimensional space where it is easier 3.Map data to high dimensional space where it is easier points and minimize the penalty alongwith maximization of
to classify with linear decision surfaces: reformulate to classify with linear decision surfaces: reformulate the margin
problem so that data are mapped implicitly to this problem so that data are mapped implicitly to this • If data is mostly non-linearly separable, use kernel function,
space space map data to high dimensions and then optimize for finding
separating hyperplane with max margin in high-dim space
29

Sec. 15.2.1 Sec. 15.2.3

Soft Margin Classification Non-linear SVMs


• Datasets that are linearly separable (with some noise) work
• If the training data is not out great:
linearly separable, slack x
0
variables ξi can be added to
allow misclassification of
difficult or noisy examples. • But what are we going to do if the dataset is just too hard?
• Allow some errors
0 x
ξi
– Let some points be
ξj • How about … mapping data to a higher-dimensional space:
moved to where they
x2
belong, at a cost
• Still, try to minimize training
set errors, and to place
hyperplane “far” from each
class (large margin) 0 x
20 25

You might also like