Support Vector Machine
Support Vector Machine
Are explicitly based on a theoretical model of learning
Come with theoretical guarantees about their performance
Are not affected by local minima
Do not suffer from the curse of dimensionality
Support vectors are the data points that lie closest to the decision
surface
They are the most difficult to classify
They have direct bearing on the optimum location of the
decision surface
Which Hyperplane?
In general, lots of possible
solutions
Support Vector Machine
finds an optimal solution.
SUPPORT VECTOR MACHINE
Also known as maximum-
margin classifier
Support vectors are the elements of the training set that would
change the position of the dividing hyper plane if removed.
Support vectors are the critical elements of the training set
The problem of finding the optimal hyper plane is an optimization
problem and can be solved by optimization techniques (use
Lagrange multipliers to get into a form that can be solved
analytically).
Maximizing the margin
We want a classifier with a margin as big as possible
In order to maximize the margin, we need to minimize ||w||. With
the condition that there are no data points between H1 and H2:
xi •w+b ≥ +1 when y i =+1
x i •w+b ≤ -1 when y i =-1
Can be combined into y i (x i •w) ≥ 1
Constrained optimization
problem
Our goal is to develop a computationally efficient procedure for using
the training sample T={xi,di} i=1...N to find the optimal
hyperplane subject to constraint
When you want to maximize (or minimize) a multi variable function f(x,y,
…)
subject to the constraint that another multi variable function equals a
constant, g(x,y,…)=c, follow
Step1: Introduce a new variable λ define a new function
L(x,y......λ)=f(x,y,…) - λ(g(x,y,…)−c)
The function L is known as Lagrange function and λ is known as
Lagrange multiplier.
Step 2-set ∇L(x,y,…,λ)=0
Step 3-Consider each solution, which will look something like (x0,y0,
…,λ0). Plug each one into function f
Whichever one gives the greatest (or smallest) value is the maximum
(or minimum) point your are seeking.
Example :
Problem :Suppose you are running a factory, producing some sort of widget that requires steel as a raw
material. Your costs are predominantly human labor, which is $20 per hour for your workers, and the steel
itself, which runs for $170 per ton. Suppose your revenue R is loosely modeled by the following equation:
R (h,s)=200.h(2/3).s(1/3)
Where h=hours of labour s=Tons of steel
If your budget is $20,000 what is the maximum possible revenue?
Now this budgetary constraint can be modeled as :
20h+170s=20000
or g(h,s)=20h+170s-20000=0
we begin by writing the Lagrangian function for this
setup:
Step 1- L=(h,s,λ)=200.h(2/3).s(1/3)-λ(20h+170s-20000)
Step 2- set the gradient ∇L= 0 for every variable of the
∂ L /∂h=0
function:
( 2 /3 ) (1 /3 )
∂ L / ∂ h =200 h + 170 s − 20000= 0
(−1 /3 ) (1/ 3)
200.(2 /3)h s −20 lambda=0 ...........(1)
Similarly solving
∂ L /∂ s=0
(2/ 3) (−1/ 3)
200.(1/ 3) h s −170 lambda=0 ...........(2)
∂ L / ∂ lam bda = 0
20 h+170 s−20000=0 .........(3)
Now solving the 3 equations we can find
h=666.67
s=39.21
λ=2.59
This means you should employ about 667 hours of labor,
and purchase 39 tons of steel, which will give a maximum
revenue of
51,777 $ also satisfying your budgetary constraint
The interpretation of λ is : For every additional $1 of
investment you will add a profit of 2.59$
Solving the Lagrangian function w.r.t condition -1 and condition-2
we get respectively
The solution vector w is defined in terms of expansion that
involves the N training examples.
The primal problem deals with a convex problem with linear
constraint
Given such constrained optimization problem we can build
another dual problem.
As per the Duality theorem-
If the primal problem has an optimal solution then the dual problem
also has an optimal solution and the corresponding optimal values are
equal.
To postulate our dual problem let us expand our primal problem as
follows :
We can reformulate the previous equation as :Dual Problem)
Note that this optimization depends on samples xi only through
the dot product (xi)T(xj).
If we lift xi to high dimension using phi(x) we need to compute
high dimensional product phi(xi)Tphi(xj)
OR Can we have a function that can directly compute
the value of the dot product in high dimension and such a function is
known as KERNEL FUNCTION
Kernel function do not need to perform operations in high
dimension explicitely.
THANK YOU