0% found this document useful (0 votes)
12 views

SVM Slides

The document discusses machine learning algorithms such as support vector machines. It explains how early methods had linear decision boundaries while modern algorithms use nonlinear functions and optimization methods. It provides details on support vector machines, including how they find the optimal hyperplane for classification using maximum margin and support vectors.

Uploaded by

Richa Halder
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

SVM Slides

The document discusses machine learning algorithms such as support vector machines. It explains how early methods had linear decision boundaries while modern algorithms use nonlinear functions and optimization methods. It provides details on support vector machines, including how they find the optimal hyperplane for classification using maximum margin and support vectors.

Uploaded by

Richa Halder
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Machine Learning

and Network Analysis


MA4207
Looking back at ML Algorithms
Earlier Methods (Pre 1980s)
◼ Linear Decision Boundary
◼ Based on theoretical properties
1980-90
◼ Neural Network, Decision Trees, Ensemble Learning
◼ Almost no theoretical basis
Post 1990
◼ Nonlinear functions based on computational theory
◼ Theoretical Properties
Main Ideas of current ML Algorithms

◼ Efficient Separability using non-linear regions


◼ Use of Kernel functions
◼ Dot product based similarity measures
◼ Quadratic optimization for global minimum
◼ New methods uses more optimization and
less greedy search
Support Vector Machine

◼ Optimal hyperplane for linearly separable


pattern
◼ Use of Kernel functions for non-linear decision
plane
◼ Maximum Margin Classifiers
◼ Extendable to multi-class
Support Vectors
◼ Support vectors are the data points that lie closest to the
decision surface (or hyperplane)
◼ Data points most difficult to classify
◼ Instrumental on the optimum location of the decision surface
◼ It can be shown that the optimal hyperplane stems from the
function class with the lowest “capacity”= # of independent
features/parameters
Maximum Margin Decision Plane
◼ A “skinny” one (small margin),
which will be able to adopt
many orientations
◼ A “fat” one (large margin),
which will have limited
flexibility
◼ Larger Margin , Lower Capacity
◼ If the margin is sufficiently
large, the complexity of the
function will be low even if the
dimensionality is very high!
Linearly-separable data, binary
classification
◼ In general, lots of
possible solutions
(an infinite
number!)
◼ Support Vector
Machine finds an
optimal solution
Linear SVM
◼ SVMs maximize the margin
around the separating
hyperplane.
◼ The decision function is fully
specified by a (usually very
small) subset of training
samples, the support vectors.
◼ Quadratic programming problem
that is easy to solve by standard
methods
Linear SVM
◼ Find a,b,c, such that
ax + by ≥ c for red points
ax + by ≤ (or < ) c for green points.
◼ Lots of possible solutions for a,b,c.
◼ Some methods find a separating
hyperplane, but not the optimal one
(e.g.,neural net)
Which points should influence
optimality?
◼ All points?
◼ “difficult points” close to decision
boundary (Support Vectors)
◼ Support vectors decide the location
of hyperplane
◼ Optimization techniques used to
find the optimum hyperplane
Linear SVM
Support Vectors: Input vectors that just touch the boundary of the margin (street) –
circled below, there are 3 of them (or, rather, the‘tips’ of the vectors)
wTx + b = 1 or wTx + b = -1

Support vectors, v1, v2, v3,


d denotes 1/2 of the street
‘width’
Linear SVM
Define the hyperplanes H such that:
w•xi+b >= +1 when yi =+1
w•xi+b <= -1 when yi = –1
H1 and H2 are the planes:
H1: w•xi+b = +1
H2: w•xi+b = –1
The points on the planes H1 and H2 are the
tips of the Support Vectors
The plane H0 is the median in between, where
w•xi+b =0
d+ = the shortest distance to the closest
positive point
d- = the shortest distance to the closest
negative point
The margin (gutter) of a separating
hyperplane is d+ + d–.
Linear SVM
Classifier should separate with the biggest
margin

Distance between point x and a plane (w,b)


is

Optimal hyperplane has infinite solutions,


choose solution such that discriminant
function becomes 1 for training example
closest to boundary

Canonical hyperplane
Distance between H1 and H0

and margin becomes


Linear SVM
In order to maximize the margin, we thus need to minimize ||w||. With the
condition that there are no datapoints between H1 and H2:
wT xi + b > 0 will have corresponding yi = 1.
wT xi + b < 0 will have corresponding yi = -1.
=> y(wT x+b) >= 1

Constrained Optimization Problem


J(w): Quadratic Function, Surface paraboloid, global minimum (improvement over NN)
Solved using classical Lagrangian Optimization
Kuhn-Tucker Theorem

For active constraints 𝛼𝑖≥0; and for inactive constraints 𝛼𝑖=0


The KKT condition allows us to identify the training examples that define the
largest margin hyperplane
Linear SVM
Constrained minimization of 𝐽(𝑤) is solved by introducing the Lagrangian

Yields an unconstrained optimization problem that is solved by:


• minimizing 𝐿𝑃 w.r.t. the primal variables w and b, and
• maximizing 𝐿𝑃 w.r.t. the dual variables 𝛼𝑖≥0 (the Lagrange multipliers)
Thus, the optimum is defined by a saddle point
This is known as the Lagrangian primal problem
Linear SVM
◼ To simplify the primal problem, we eliminate the primal variables (𝑤,𝑏),
differentiate 𝐿p(𝑤,𝑏,𝛼) with respect to 𝑤 and 𝑏, and setting to zero yields

◼ Expansion of 𝐿𝑃 yields

◼ Using the optimality condition 𝜕𝐽/𝜕𝑤=0, the first term in 𝐿𝑃 can be expressed as

◼ The second term in 𝐿𝑃 can be expressed in the same way


◼ The third term in 𝐿𝑃 is zero by virtue of the optimality condition 𝜕𝐽/𝜕𝑏=0
Linear SVM
Merging all the expressions together we get

Subject to constraints
Lagrangian Dual Problem (LP=>LD)
◼ problem of finding a saddle point for 𝐿𝑃(𝑤,𝑏) into the easier one of maximizing
𝐿𝐷(𝛼).
◼ 𝐿𝐷(𝛼) depends on the Lagrange multipliers 𝛼, not on (𝑤,𝑏)
◼ The primal problem scales with dimensionality (𝑤 has one coefficient for each
dimension), whereas the dual problem scales with the amount of training data
(there is one Lagrange multiplier per example)
◼ In 𝐿𝐷(𝛼), training data appears only as dot products 𝑥𝑖𝑇𝑥𝑗
◼ This property can be cleverly exploited to perform the classification in a higher
(e.g., infinite) dimensional space
Linear SVM
The KKT complementary condition states that, for every point in the training set,
the following equality must hold

Therefore, ∀𝑥, either 𝛼𝑖=0 or 𝑦𝑖(𝑤𝑇𝑥𝑖+𝑏−1)=0 must hold


Those points for which 𝛼𝑖>0 must then lie on one of the two hyperplanes that define
the largest margin (the term 𝑦𝑖(𝑤𝑇𝑥𝑖+𝑏−1) becomes zero only at these hyperplanes)
These points are known as the Support Vectors
All the other points must have 𝛼𝑖=0
The SVs contribute to defining the optimal hyperplane–

Bias term 𝑏 is found from the KKT complementary condition on the support
vectors, hence the complete dataset could be replaced by only the support
vectors, and the separating hyperplane would be the same
Non-Linear SVM
Non-Linear SVM
Non-Linear SVM
Non-Linear SVM

You might also like