0% found this document useful (0 votes)
6 views51 pages

Kernal and Multiclass

The document discusses Support Vector Machines (SVM) and their application in creating linear decision surfaces for classification tasks. It covers concepts such as linear separators, maximizing margins, the role of support vectors, and the use of kernel methods to handle non-linear data. Additionally, it explains the importance of regularization parameters like C and gamma in optimizing the SVM model for better generalization and performance.

Uploaded by

laxman.22bce8268
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views51 pages

Kernal and Multiclass

The document discusses Support Vector Machines (SVM) and their application in creating linear decision surfaces for classification tasks. It covers concepts such as linear separators, maximizing margins, the role of support vectors, and the use of kernel methods to handle non-linear data. Additionally, it explains the importance of regularization parameters like C and gamma in optimizing the SVM model for better generalization and performance.

Uploaded by

laxman.22bce8268
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 51

Support Vector

Machines and
Kernels
Doing Really Well with
Linear Decision Surfaces
Linear Separators
 Training instances
 x  n Math Review
 y  {-1, 1} Inner (dot) product:
<a, b> = a · b = ∑ ai*bi
 w  n
= a1b1 + a2b2 + …+anb
 b
 Hyperplane
 <w, x> + b = 0
 w1x1 + w2x2 … + wnxn + b = 0
 Decision function
 f(x) = sign(<w, x> + b)
Intuitions

O O
X X
X O
X O O
X X O
X O
X O
Intuitions

O O
X X
X O
X O O
X X O
X O
X O
Intuitions

O O
X X
X O
X O O
X X O
X O
X O
Intuitions

O O
X X
X O
X O O
X X O
X O
X O
A “Good” Separator

O O
X X
X O
X O O
X X O
X O
X O
Noise in the
Observations

O O
X X
X O
X O O
X X O
X O
X O
Ruling Out Some
Separators

O O
X X
X O
X O O
X X O
X O
X O
Lots of Noise

O O
X X
X O
X O O
X X O
X O
X O
Maximizing the Margin

O O
X X
X O
X O O
X X O
X O
X O
“Fat” Separators

O O
X X
X O
X O O
X X O
X O
X O
Why Maximize Margin?
 Increasing margin reduces capacity
 Must restrict capacity to generalize
 m training instances
 2m ways to label them
 What if function class that can separate
them all?
 Shatters the training instances

 VC Dimension is largest m such that


function class can shatter some set
of m points
VC Dimension Example

X O X X

X X X X O X X O

O O X O

O X X O O O O O
Support Vectors

O O
X X
X O
X O O
X X O
X O
X O
The Math
 For perfect classification, we want
 yi (<w,xi> + b) ≥ 0 for all i
 Why?
 To maximize the margin, we want
 w that minimizes |w|2
Dual Optimization
Problem
 Maximize over 
 W() = i i - 1/2 i,j i j yi yj <xi, xj>
 Subject to
 i  0
 i i yi = 0
 Decision function
 f(x) = sign(i i yi <x, xi> + b)
 primal/dual problems

Reference material (click on below link):


https://www.youtube.com/watch?
v=OR-xXUmBtYU
 KKT conditions:

https://www.youtube.com/watch?
v=vx7d32Jz97w
What if Data Are Not
Perfectly Linearly
Separable?
 Cannot find w and b that satisfy
 yi (<w,xi> + b) ≥ 1 for all i
 Introduce slack variables i
 yi (<w,xi> + b) ≥ 1 - i for all i
 Minimize
 |w|2 + C  i
Strengths of SVMs
 Good generalization in theory
 Good generalization in practice
 Work well with few training
instances
 Find globally best model
 Efficient algorithms
 Amenable to the kernel trick …
Simple example
 We have 2 colors of balls on the table that
we want to separate.
 We get a stick and put it on the table, this works
pretty well right?
 Some villain comes and places more balls on the table, it kind of works
but one of the balls is on the wrong side and there is probably a better
place to put the stick now.
 SVMs try to put the stick in the best possible place by
having as big a gap on either side of the stick as possible.
 There is another trick in the SVM toolbox that is
even more important. Say the villain has seen how
good you are with a stick so he gives you a new
challenge.
 There’s no stick in the world that will let you split
those balls well, so what do you do? You flip the
table of course! Throwing the balls into the air.
Then, with your pro ninja skills, you grab a sheet
of paper and slip it between the balls.
 Here,

 Boring adults call the balls data,


 the stick is a classifier,

 the biggest gap trick is


called optimization,
 flipping the table is called kernelling and

 the piece of paper a hyperplane.


What if Surface is Non-
Linear?

O O O
OO O O O O
O X O
XX X
O O
O X X O
O O O
O O
Image from http://www.atrandomresearch.com/iclass/
Soft Margins
 This idea is based on a simple premise: allow
SVM to make a certain number of mistakes
and keep margin as wide as possible so that
other points can still be classified correctly.
 This can be done simply by modifying the
objective of SVM.
The green decision boundary has a wider margin that
would allow it to generalize well on unseen data. In this
event, soft margin formulation helps to avoid the
overfitting issues.
How it Works (mathematically)?
Aim to minimize the following objective:

This differs from the original objective in the


second term. Here, C is is a hyperparameter
that decides the trade-off between maximizing
the margin and minimizing the mistakes. When
C is small classification mistakes are given less
importance and focus is more on maximizing
the margin. Similarly, When C is large, the
focus is more on avoiding misclassification at
the expense of keeping the margin small.
Kernel Methods
Making the Non-Linear
Linear
When Linear Separators
Fail
x2 x12
X X
X X
O O
X X O O O O X X x1 O O x1

Note: Kernels are a way to solve non-linear problems with the


help of linear classifiers. This is known as the kernel
trick method.
Simple example

In this case we cannot find a straight


line to separate apples from lemons. So how
can we solve this problem. We will use
the Kernel Trick! (2D to 3D)
Function of Kernal

 A kernel is a function used in SVM for helping


to solve problems.
 They provide shortcuts to avoid complex
calculations.
 The amazing thing about kernel is that we can
go to higher dimensions and perform smooth
calculations with the help of it.
Mapping into a New Feature
Space

 : x  X = (x)
(x1,x2) = (x1,x2,x12,x22,x1x2)
 Rather than run SVM on xi, (original) run it on (xi)
(Transform dimensional)
 Find non-linear separator in input space
 What if (xi) is really big?
 Use kernels to compute it implicitly!Image from http://web.engr.oregonstate.edu
~afern/classes/cs534/
Kernels
 Find kernel K such that
 K(x1,x2) = < (x1), (x2)>
 Computing K(x1,x2) should be
efficient, much more so than
computing (x1) and (x2)
 Use K(x1,x2) in SVM algorithm rather
than <x1,x2>
 Remarkably, this is possible
Examples of Kernel
Functions
 Linear: K(xi,xj)= xiTxj
 Mapping Φ: x → φ(x), where φ(x) is x itself
 Polynomial of power p: K(xi,xj)= (1+ xiTxj)p
 Mapping Φ: x → φ(x), where φ(x) has  d  p
dimensions  
 p 
 Gaussian (radial-basis function):
K(xi,xj) = x i  x j 2

2 2
e
 Mapping Φ: x → φ(x), where φ(x) is infinite-
dimensional: every point is mapped to a
function (a Gaussian); combination of functions
for support vectors is the separator.
 .
The Polynomial Kernel
 K(x1,x2) = < x1, x2 > 2
 x1 = (x11, x12)
 x2 = (x21, x22)
 < x1, x2 > = (x11x21 + x12x22)
 < x1, x2 > 2 = (x112 x212 + x122x222 + 2x11 x12 x21
x22)
 (x1) = (x112, x122, √2x11 x12)
 (x2) = (x212, x222, √2x21 x22)
 K(x1,x2) = < (x1), (x2) >
The Polynomial Kernel
 (x) contains all monomials of
degree d
 Useful in visual pattern recognition
 Number of monomials
 16x16 pixel image
 1010 monomials of degree 5

 Never explicitly compute (x)!


 Variation - K(x1,x2) = (< x1, x2 > + 1) 2
A Few Good Kernels
 Dot product kernel
 K(x1,x2) = < x1,x2 >
 Polynomial kernel
 K(x1,x2) = < x1,x2 >d (Monomials of degree d)
 K(x1,x2) = (< x1,x2 > + 1)d (All monomials of degree 1,2,
…,d)
 Gaussian kernel
 K(x1,x2) = exp(-| x1-x2 |2/22)
 Radial basis functions:
 Sigmoid kernel
 K(x1,x2) = tanh(< x1,x2 > + )
 Neural networks
 Establishing “kernel-hood” from first principles is
non-trivial
The Kernel Trick
“Given an algorithm which is
formulated in terms of a positive
definite kernel K1, one can
construct an alternative
algorithm by replacing K1 with
another positive definite kernel
 K2” can use the kernel trick
SVMs
 To get more info on SVM, visit:
https://www.youtube.com/watch?
v=GcCG0PPV6cg
Using a Different Kernel in
the Dual Optimization
Problem
 For example, using the polynomial
kernel with d = 4 (including lower-order
terms).
(<xi, xj> +
Maximize over 
X

1)4
 W() = i i - 1/2 i,j i j yi yj <xi, xj>
 Subject to
So by the kernel trick,
 i  0 These
we are kernels!
just replace them!
 i i yi = 0 (<xi, xj> +
 Decision function1)4

X
f(x) = sign(i i yi <x, xi> + b)
Regularization
 Mapping 1D to 2D
 So after the transformation, we can easily delimit the
two classes using just a single line. (i.e, lemon and
apple)
 In real life applications we won’t have a simple straight
line, but we will have lots of curves and high
dimensions.
 In some cases we won’t have two hyperplanes which
separates the data with no points between them, so we
need some trade-offs, tolerance for outliers.
 Fortunately the SVM algorithm has a so-
called regularization parameter to configure the
trade-off and to tolerate outliers.
Regularization
 The Regularization Parameter or slack
penalty factor (in python it’s called C) tells
the SVM optimization how much you want to
avoid miss classifying each training example.
 If the C is higher, the optimization will
choose smaller margin hyperplane, so training
data miss classification rate will be lower.
 On the other hand, if the C is low, then
the margin will be big, even if there will be
miss classified training data examples.
 When the C is low, the margin is higher (so
implicitly we don’t have so many curves, the
line doesn’t strictly follows the data points)
even if two apples were classified as lemons.
 When the C is high, the boundary is full of
curves and all the training data was classified
correctly.
 Don’t forget, even if all the training data was
correctly classified, this doesn’t mean that
increasing the C will always increase the
precision (it leads overfitting)
Gamma (γ=
( 1/2 ) 2

 The next important parameter is Gamma. The


gamma parameter (see RBF Kernal equation)
defines how far the influence of a single
training example reaches.
 This means that high Gamma will consider
only points close to the plausible hyperplane
and low Gamma will consider points at
greater distance
SVM Example using
Python
 # Fitting SVM to the Training set
from sklearn.svm import SVC classifier =
SVC(kernel = 'rbf', C = 0.1, gamma = 0.1)
classifier.fit(X_train, y_train)

You might also like