5d. Support Vector Machine
5d. Support Vector Machine
Non-Linear classifiers Which Hyperplane to use for Setting Up the Optimization Problem
Separation?
• For non-linear data, SVM is more complex
• use Kernelized SVM for non-linearly
separable data.
• Say, we have some non-linearly separable
data in one dimension. We can transform
this data into two dimensions and the
data will become linearly separable in
two dimensions. This is done by mapping
each 1-D data point to a corresponding 2-
D ordered pair.
• So for any non-linearly separable data in
any dimension, we can just map the data
to a higher dimension and then make it
linearly separable.
• Kernelized SVM the similarity is between
the points in the newly transformed
feature space.
• A better formulation (min ||w|| = max 1/ ||w|| ): Find w and b such that
Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i
Find w and b such that
• Parameter C can be viewed as a way to control overfitting
Φ(w) =½ wTw is minimized;
– A regularization term
and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1
16 21 26
Solving the Optimization Problem Soft Margin Classification – Solution SVM: Adv Features: Kernels
Map data to high dimensional space where it is easier to
• The dual problem for soft margin classification:
Find w and b such that classify with linear decision surfaces: reformulate problem so
Φ(w) =½ wTw is minimized;
and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1
Find α1…αN such that that data are mapped implicitly to this space
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
• The linear classifier relies on an inner product between
• This is now optimizing a quadratic function subject to linear constraints
• Quadratic optimization problems are a well-known class of mathematical (2) 0 ≤ αi ≤ C for all αi vectors K(xi,xj)=xiTxj
programming problem, and many (intricate) algorithms exist for solving them • Neither slack variables ξi nor their Lagrange multipliers appear in the dual • If every data point is mapped into high-dimensional space
(with many special ones built for SVMs) problem!
• The solution involves constructing a dual problem where a Lagrange • Again, xi with non-zero αi will be support vectors. via some transformation Φ: x → φ(x), the inner product
multiplier αi is associated with every constraint in the primary problem: • Solution to the dual problem is: becomes: K(xi,xj)= φ(xi) Tφ(xj)
w is not needed explicitly for • A kernel function is some function that corresponds to an
Find α1…αN such that w = Σαiyixi
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
classification! inner product in some expanded feature space.
b = yk(1- ξk) - wTxk where k = argmax αk’
(1) Σαiyi = 0 k’ f(x) = ΣαiyixiTx + b
(2) αi ≥ 0 for all αi
17 22
The Optimization Problem Solution Classification with SVMs SVM: Adv Features: Kernels
• Common kernels
• The solution has the form: • Given a new point x, we can score its projection – Linear
w =Σαiyixi b= yk- wTxk for any xk such that αk 0 onto the hyperplane normal:
– Polynomial K(x,z) = (1+xTz)d
– I.e., compute score: wTx + b = ΣαiyixiTx + b • Gives feature conjunctions
• Each non-zero αi indicates that corresponding xi is a support vector.
• Decide class based on whether < or > 0
• Then the classifying function will have the form: – Radial basis function (infinite dimensional space)
f(x) = ΣαiyixiTx + b • Haven’t been very useful in text classification
– Can set confidence threshold t.
• Notice that it relies on an inner product between the test point x and the
support vectors xi Score > t: yes
• Also keep in mind that solving the optimization problem involved Score < -t: no
computing the inner products xiTxj between all pairs of training points. Else: don’t know 1
0
-1
18 23
Sec. 15.2.1
Support Vector Machine: Idea 2 Support Vector Machine: Idea 3 SVMs: Summary
• The classifier is a separating hyperplane.
• The most “important” training points are the support vectors;
Three main ideas: Three main ideas: they define the hyperplane.
1.Define what an optimal hyperplane is (taking into 1.Define what an optimal hyperplane is (taking into • Quadratic optimization algorithms can identify which training
account that it needs to be computed efficiently): account that it needs to be computed efficiently): points xi are support vectors with non-zero Lagrangian
maximize margin maximize margin multipliers αi.
• Margin Maximization is the aim
2.Generalize to non-linearly separable problems: have a 2.Generalize to non-linearly separable problems: have a
• For non-linearly separable data, if few points get misclassified
penalty term for misclassifications penalty term for misclassifications
through separating hyperplane: penalize for the misclassified
3.Map data to high dimensional space where it is easier 3.Map data to high dimensional space where it is easier points and minimize the penalty alongwith maximization of
to classify with linear decision surfaces: reformulate to classify with linear decision surfaces: reformulate the margin
problem so that data are mapped implicitly to this problem so that data are mapped implicitly to this • If data is mostly non-linearly separable, use kernel function,
space space map data to high dimensions and then optimize for finding
separating hyperplane with max margin in high-dim space
29