0% found this document useful (0 votes)

9 views

svm and kernel

Uploaded by

hometowncrpalli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

svm and kernel

Uploaded by

hometowncrpalli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 57

What is a good Decision Boundary?

• Many decision
boundaries! Class 2
– The Perceptron
algorithm can be used
to find such a boundary
• Are all decision
boundaries equally good? Class 1

1
Examples of Bad Decision Boundaries

Class 2 Class 2

Class 1 Class 1

2
Finding the Decision Boundary
• Let {x1, ..., xn} be our data set and let yi  {1,-1} be the
class label of xi
T
For yi=1 w xi  b 1
For yi=-1 wT xi  b  1
y=1
y=1
So:
yi w xi  b 1, xi , yi 
y=1 T
y=-1 y=1
y=-1 y=1
Class 2
y=-1
y=-1
y=-1

Class 1 y=-1
m

3
Large-margin Decision Boundary
• The decision boundary should be as far away
from the data of both classes as possible
– We should maximize the margin, m

Class 2

Class 1
m

4
Finding the Decision Boundary
• The decision boundary should classify all points correctly


• The decision boundary can be found by solving the

following constrained optimization problem

• This is a constrained optimization problem. Solving it

requires to use Lagrange multipliers
5
Finding the Decision Boundary

• The Lagrangian is

 i≥0
– Note that ||w||2 = wTw
6
Gradient with respect to w and b
• Setting the gradient of w.r.t. w and b to zero,
we have
n
1 T
 
L  w w    i 1  yi wT xi  b 
2

i 1

1 m k k n   m k k 
  w w    i  1  yi   w xi  b  
2 k 1 i 1   k 1 
n: no of examples, m: dimension of the space

 L
 wk 0, k

 L
 b 0
 7

A “Good” Separator

O O
X X
X O
X O O
X X O
X O
X O
Noise in the Observations

O O
X X
X O
X O O
X X O
X O
X O
Ruling Out Some Separators

O O
X X
X O
X O O
X X O
X O
X O
Lots of Noise

O O
X X
X O
X O O
X X O
X O
X O
Maximizing the Margin

O O
X X
X O
X O O
X X O
X O
X O
“Fat” Separators

O O
X X
X O
X O O
X X O
X O
X O
An example of VC dimension
• Suppose our model class is a hyperplane.
• In 2-D, we can find a plane (i.e. a line) to deal with any
labeling of three points. A 2-D hyperplane shatters 3 points

• But we cannot deal with some of the possible labelings of four

points. A 2-D hyperplane (i.e. a line) does not shatter 4 points .
Support Vectors

O O
X X
X O
X O O
X X O
X O
X O
The Math
• Training instances
– x  n
– y  {-1, 1}
• Decision function
– f(x) = sign(<w,x> + b)
– w  n
– b
• Find w and b that
– Perfectly classify training instances
• Assuming linear separability
– Maximize margin
The Math

• For perfect classification, we want

– yi (<w,xi> + b) ≥ 0 for all i
– Why?
• To maximize the margin, we want
– w that minimizes |w|2
Some examples of VC dimension
• The VC dimension of a hyperplane in 2-D is 3.
– In k dimensions it is k+1.
• Its just a coincidence that the VC dimension of a
hyperplane is almost identical to the number of
parameters it takes to define a hyperplane.
• A sine wave has infinite VC dimension and only 2
parameters! By choosing the phase and period carefully
we can shatter any random collection of one-dimensional
datapoints (except for nasty special cases).

f ( x) a sin(b x)
The probabilistic guarantee
1
 h  h log(2 N / h)  log( p / 4)  2
Etest Etrain   
 N 
where N = size of training set
h = VC dimension of the model class
p = upper bound on probability that this bound fails

So if we train models with different complexity, we

should pick the one that minimizes this bound
Actually, this is only sensible if we think the bound is
fairly tight, which it usually isn’t. The theory provides
insight, but in practice we still need some witchcraft.
Preventing overfitting when using big sets of features
• Suppose we use a big set of features
to ensure that the two classes are
linearly separable. What is the best
separating line to use?
• The Bayesian answer is to use them
all (including ones that do not quite
separate the data.)
• Weight each line by its posterior
probability (i.e. by a combination of
how well it fits the data and how well it
fits the prior).
• Is there an efficient way to
approximate the correct Bayesian
answer?
Support Vector Machines
• The line that maximizes the minimum
margin is a good bet.
– The model class of “hyper-planes
with a margin of m” has a low VC
dimension if m is big.
• This maximum-margin separator is
determined by a subset of the
datapoints.
– Datapoints in this subset are
called “support vectors”.
– It will be useful computationally if The support vectors
only a small fraction of the are indicated by the
datapoints are support vectors, circles around them.
because we use the support
vectors to decide which side of the
separator a test case is on.
Training a linear SVM
• To find the maximum margin separator, we have to solve
the following optimization problem:
w.x c  b  1 for positive cases
w.x c  b   1 for negative cases
and || w ||2 is as small as possible
• This is tricky but it’s a convex problem. There is only one
optimum and we can find it without fiddling with learning
rates or weight decay or early stopping.
– Don’t worry about the optimization problem. It has been
solved. Its called quadratic programming.
– It takes time proportional to N^2 which is really bad for
very big datasets
• so for big datasets we end up doing approximate optimization!
Testing a linear SVM

• The separator is defined as the set of points for

which: w.x  b 0
so if w.x c  b  0 say its a positive case
and if w.x c  b  0 say its a negative case
A Bayesian Interpretation

• Using the maximum margin separator often

gives a pretty good approximation to using all
separators weighted by their posterior
probabilities.
What to do if there is no separating plane

• Use a much bigger set of features.

– This looks as if it would make the computation
hopelessly slow, but in the next part of the
lecture we will see how to use the “kernel”
trick to make the computation fast even with
huge numbers of features.
• Extend the definition of maximum margin to
allow non-separating planes.
– This can be done by using “slack” variables
Introducing slack variables
• Slack variables are constrained to be non-negative.
When they are greater than zero they allow us to cheat
by putting the plane closer to the datapoint than the
margin. So we need to minimize the amount of cheating.
This means we have to pick a value for lamba (this
sounds familiar!)
w.x c  b 1   c for positive cases
w.x c  b  1   c for negative cases
with  c  0 for all c
|| w ||2
and    c as small as possible
2 c
A picture of the best plane with a slack variable
The story so far
• If we use a large set of non-adaptive features, we can
often make the two classes linearly separable.
– But if we just fit any old separating plane, it will not
generalize well to new cases.
• If we fit the separating plane that maximizes the margin
(the minimum distance to any of the data points), we will
get much better generalization.
– Intuitively, by maximizing the margin we are
squeezing out all the surplus capacity that came from
using a high-dimensional feature space.
• This can be justified by a whole lot of clever mathematics
which shows that
– large margin separators have lower VC dimension.
– models with lower VC dimension have a smaller gap
between the training and test error rates.
Why do large margin separators have lower VC
dimension?
• Consider a random set of N points that
all fit inside a unit hypercube.
• If the number of dimensions is bigger
than N-2, it is easy to find a separating
plane for any labeling of the points.
– So the fact that there is a separating
plane doesn’t tell us much. It like
putting a straight line through 2 data
points.
• But there is unlikely to be a separating
plane with a margin that is big
– If we find such a plane its unlikely to
be a coincidence. So it will probably
apply to the test data too.
How to make a plane curved
• Fitting hyperplanes as
separators is mathematically
easy.
– The mathematics is linear.
• By replacing the raw input
variables with a much larger
set of features we get a nice
property:
– A planar separator in the
high-dimensional space of
feature vectors is a curved A planar separator in
separator in the low a 20-D feature space
dimensional space of the projected back to the
raw input variables. original 2-D space
A potential problem and a magic solution
• If we map the input vectors into a very high-dimensional
feature space, surely the task of finding the maximum-
margin separator becomes computationally intractable?
– The mathematics is all linear, which is good, but the
vectors have a huge number of components.
– So taking the scalar product of two vectors is very
expensive.
• The way to keep things tractable is to use
“the kernel trick”
• The kernel trick makes your brain hurt when you first
learn about it, but its actually very simple.
What the kernel trick achieves
• All of the computations that we need to do to find the
maximum-margin separator can be expressed in terms of
scalar products between pairs of datapoints (in the high-
dimensional feature space).
• These scalar products are the only part of the computation
that depends on the dimensionality of the high-dimensional
space.
– So if we had a fast way to do the scalar products we
would not have to pay a price for solving the learning
problem in the high-D space.
• The kernel trick is just a magic way of doing scalar products
a whole lot faster than is usually possible.
– It relies on choosing a way of mapping to the high-
dimensional feature space that allows fast scalar
products.
The kernel trick
• For many mappings from
a low-D space to a high-D Low-D
space, there is a simple xb
operation on two vectors xa
in the low-D space that
can be used to compute
the scalar product of their

two images in the high-D
High-D
space.
K ( x a , x b )  ( x a ) . ( x b )
 (xa )
Letting the doing the scalar
 ( xb )
kernel do product in the
the work obvious way
Dealing with the test data
• If we choose a mapping to a high-D space for
which the kernel trick works, we do not have to
pay a computational price for the high-
dimensionality when we find the best hyper-plane.
– We cannot express the hyperplane by using its normal
vector in the high-dimensional space because this
vector would have a huge number of components.
– Luckily, we can express it in terms of the support
vectors.
• But what about the test data. We cannot compute
the scalar product w .  (x) because its in the
high-D space.
Dealing with the test data

• We need to decide which side of the separating

hyperplane a test point lies on and this requires
us to compute a scalar product.
• We can express this scalar product as a
weighted average of scalar products with the
stored support vectors
– This could still be slow if there are a lot of
support vectors .
The classification rule
• The final classification rule is quite simple:

bias   s
w K ( x test
, x s
)  0
s  SV

The set of
support vectors

• All the cleverness goes into selecting the support vectors

that maximize the margin and computing the weight to
use on each support vector.
• We also need to choose a good kernel function and we
may need to choose a lambda for dealing with non-
separable cases.
Some commonly used kernels

Polynomial: K (x, y ) (x.y  1) p

Gaussian Parameters
 ||x  y||2 / 2 2 that the user
radial basis K (x, y ) e must choose
function

Neural net: K (x, y ) tanh ( k x.y   )

For the neural network kernel, there is one “hidden unit”

per support vector, so the process of fitting the maximum
margin hyperplane decides how many hidden units to use.
Also, it may violate Mercer’s condition.
Performance
• Support Vector Machines work very well in practice.
– The user must choose the kernel function and its
parameters, but the rest is automatic.
– The test performance is very good.
• They can be expensive in time and space for big datasets
– The computation of the maximum-margin hyper-plane
depends on the square of the number of training cases.
– We need to store all the support vectors.
• SVM’s are very good if you have no idea about what
structure to impose on the task.
• The kernel trick can also be used to do PCA in a much
higher-dimensional space, thus giving a non-linear version
of PCA in the original space.
Support Vector Machines are Perceptrons!

• SVM’s use each training case, x, to define a feature

K(x, .) where K is chosen by the user.
– So the user designs the features.
• Then they do “feature selection” by picking the support
vectors, and they learn how to weight the features by
solving a big optimization problem.
• So an SVM is just a very clever way to train a standard
perceptron.
– All of the things that a perceptron cannot do cannot
be done by SVM’s (but it’s a long time since 1969 so
people have forgotten this).
A problem that cannot be solved using a
kernel that computes the similarity of a test
image to a training case
• Suppose we have images that may contain a tank, but
with a cluttered background.
• To recognize which ones contain a tank, it is no good
computing a global similarity
– A non-tank test image may have a very similar
background to a tank training image, so it will have
very high similarity if the tanks are only a small
fraction of the image.
• We need local features that are appropriate for the task.
So they must be learned, not pre-specified.
• Its very appealing to convert a learning problem to a
convex optimization problem
– but we may end up by ignoring aspects of the real
learning problem in order to make it convex.
A hybrid approach
• If we use a neural net to define the features, maybe we
can use convex optimization for the final layer of weights
and then backpropagate derivatives to “learn the kernel”.
• The convex optimization is quadratic in the number of
training cases. So this approach works best when most
of the data is unlabelled.
– Unsupervised pre-training can then use the unlabelled
data to learn features that are appropriate for the
domain.
– The final convex optimization can use these features
as well as possible and also provide derivatives that
allow them to be fine-tuned.
– This seems better than just trying lots of kernels and
selecting the best ones (which is the current method).
Learning to extract the orientation of a face
patch (Ruslan Salakhutdinov)
The training and test sets

100, 500, or 1000 labeled cases 11,000 unlabeled cases

face patches from new people

The root mean squared error in the orientation
when combining GP’s with deep belief nets

GP on GP on GP on top-level
the top-level features with
pixels features fine-tuning
100 labels 22.2 17.9 15.2
500 labels 17.2 12.7 7.2
1000 labels 16.3 11.2 6.4

Conclusion: The deep features are much better

than the pixels. Fine-tuning helps a lot.
Dual Optimization Problem

• Maximize over 
– W() = i i - 1/2 i,j i j yi yj <xi, xj>
• Subject to
– i  0
– i i yi = 0
• Decision function
– f(x) = sign(i i yi <x, xi> + b)
Strengths of SVMs

• Good generalization in theory

• Good generalization in practice
• Work well with few training instances
• Find globally best model
• Efficient algorithms
• Amenable to the kernel trick …
What if Surface is Non-Linear?

O O O
OO O O O O
O X O
XX X
O O
O X X O
O O O
O O
Image from http://www.atrandomresearch.com/iclass/
Kernel Methods
Making the Non-Linear Linear
When Linear Separators Fail

x2 x12
X X
X X
X X O O OO X X x1 OO O O x1
Mapping into a New Feature Space

 : x  X = (x)
(x1,x2) = (x1,x2,x12,x22,x1x2)
• Rather than run SVM on xi, run it on (xi)
• Find non-linear separator in input space
• What if (xi) is really big?
• Use kernels to compute it implicitly!
Image from http://web.engr.oregonstate.edu
~afern/classes/cs534/
Kernels

• Find kernel K such that

– K(x1,x2) = < (x1), (x2)>
• Computing K(x1,x2) should be efficient, much
more so than computing (x1) and (x2)
• Use K(x1,x2) in SVM algorithm rather than
<x1,x2>
• Remarkably, this is possible
The Polynomial Kernel

• K(x1,x2) = < x1, x2 > 2

– x1 = (x11, x12)
– x2 = (x21, x22)
• < x1, x2 > = (x11x21 + x12x22)
• < x1, x2 > 2 = (x112 x212 + x122x222 + 2x11 x12 x21 x22)
• (x1) = (x112, x122, √2x11 x12)
• (x2) = (x212, x222, √2x21 x22)
• K(x1,x2) = < (x1), (x2) >
The Polynomial Kernel

• (x) contains all monomials of degree d

• Useful in visual pattern recognition
• Number of monomials
– 16x16 pixel image
– 1010 monomials of degree 5
• Never explicitly compute (x)!
• Variation - K(x1,x2) = (< x1, x2 > + 1) 2
A Few Good Kernels
• Dot product kernel
– K(x1,x2) = < x1,x2 >
• Polynomial kernel
– K(x1,x2) = < x1,x2 >d (Monomials of degree d)
– K(x1,x2) = (< x1,x2 > + 1)d (All monomials of degree 1,2,
…,d)
• Gaussian kernel
– K(x1,x2) = exp(-| x1-x2 |2/22)
– Radial basis functions
• Sigmoid kernel
– K(x1,x2) = tanh(< x1,x2 > + )
– Neural networks
• Establishing “kernel-hood” from first principles is non-
trivial
The Kernel Trick

“Given an algorithm which is

formulated in terms of a positive
definite kernel K1, one can
construct an alternative
algorithm by replacing K1 with
another positive definite kernel
K2”
 SVMs can use the kernel trick
Using a Different Kernel in the Dual
Optimization Problem
• For example, using the polynomial kernel with d
= 4 (including lower-order terms).

• Maximize over  (<xi, xj> +

X
– W() = i i - 1/2 i,j i j yi yj <x
1)i, 4xj>
• Subject to
– i  0
So by the kernel trick
– i i yi = 0 These are kernels!
we just replace them!
• Decision function (<xi, xj> +

X 1)4xi> + b)
– f(x) = sign(i i yi <x,
Conclusion

• SVMs find optimal linear separator

• The kernel trick makes SVMs non-linear learning
algorithms

Dnv-Os-C501 - 2013 PDF
No ratings yet
Dnv-Os-C501 - 2013 PDF
250 pages
Machines and Mechanisms Coursework
No ratings yet
Machines and Mechanisms Coursework
30 pages
ML.1.Lecture.9 (Where It Actually Comes From)
No ratings yet
ML.1.Lecture.9 (Where It Actually Comes From)
31 pages
lec10svm
No ratings yet
lec10svm
35 pages
Lec 10 SVM
No ratings yet
Lec 10 SVM
35 pages
Introduction To Support Vector Machines: 1 Description
No ratings yet
Introduction To Support Vector Machines: 1 Description
15 pages
Support Vector Machine
No ratings yet
Support Vector Machine
8 pages
SVM Scribe Notes
No ratings yet
SVM Scribe Notes
16 pages
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
No ratings yet
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
16 pages
cs221-lecture11
No ratings yet
cs221-lecture11
71 pages
EXP-14
No ratings yet
EXP-14
27 pages
Support Vector Machine
No ratings yet
Support Vector Machine
52 pages
SML Unit 4
No ratings yet
SML Unit 4
61 pages
Svm
No ratings yet
Svm
40 pages
CS-13410 Introduction To Machine Learning
No ratings yet
CS-13410 Introduction To Machine Learning
33 pages
Chapter 8
No ratings yet
Chapter 8
103 pages
SVM & Image Classification.
No ratings yet
SVM & Image Classification.
22 pages
SVM
No ratings yet
SVM
57 pages
Support Vector Machines: Logisic Regression
No ratings yet
Support Vector Machines: Logisic Regression
10 pages
SVM Class
No ratings yet
SVM Class
33 pages
SVM.pptx
No ratings yet
SVM.pptx
67 pages
W12 SVM
No ratings yet
W12 SVM
52 pages
Support Vector Machines: 1 What's SVM
No ratings yet
Support Vector Machines: 1 What's SVM
25 pages
Introduction To Support Vector Machines
No ratings yet
Introduction To Support Vector Machines
23 pages
5d. Support Vector Machine
No ratings yet
5d. Support Vector Machine
2 pages
Support Vector Machine
No ratings yet
Support Vector Machine
50 pages
Support Vector Machine
No ratings yet
Support Vector Machine
45 pages
Machine Learning: Support Vector Machines Kernel Methods
No ratings yet
Machine Learning: Support Vector Machines Kernel Methods
87 pages
315 F19 14 SVM 1
No ratings yet
315 F19 14 SVM 1
33 pages
Support Vector Machines Ymod
No ratings yet
Support Vector Machines Ymod
4 pages
Main
No ratings yet
Main
12 pages
UNIT - 2
No ratings yet
UNIT - 2
15 pages
Lec5 Support vector machine
No ratings yet
Lec5 Support vector machine
28 pages
Unit 2
No ratings yet
Unit 2
47 pages
Support Vector Machine Master Thesis
100% (3)
Support Vector Machine Master Thesis
7 pages
MergedPDF Iml
No ratings yet
MergedPDF Iml
114 pages
Final - Support Vector Machine - Class - Modifie
No ratings yet
Final - Support Vector Machine - Class - Modifie
69 pages
IVPML Unit III
No ratings yet
IVPML Unit III
139 pages
Pattern Recognition & Learning II: © UW CSE Vision Faculty
No ratings yet
Pattern Recognition & Learning II: © UW CSE Vision Faculty
47 pages
Lecture 9 - SVMs
No ratings yet
Lecture 9 - SVMs
8 pages
SVM Tutorial
No ratings yet
SVM Tutorial
31 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Support Vector Machines: CS229 Lecture Notes
100% (2)
Support Vector Machines: CS229 Lecture Notes
25 pages
Introduction To Support Vector Machines
No ratings yet
Introduction To Support Vector Machines
46 pages
Support Vector Machine
100% (1)
Support Vector Machine
25 pages
Support Vector Machines
No ratings yet
Support Vector Machines
25 pages
SVM
No ratings yet
SVM
40 pages
This Is
No ratings yet
This Is
7 pages
Lecture 14
No ratings yet
Lecture 14
20 pages
Support Vector Machines: CS229 Lecture Notes
No ratings yet
Support Vector Machines: CS229 Lecture Notes
25 pages
Lec06 SVM
No ratings yet
Lec06 SVM
25 pages
Support Vector Machines: CS229 Lecture Notes
No ratings yet
Support Vector Machines: CS229 Lecture Notes
25 pages
SVM PRESENTATION
No ratings yet
SVM PRESENTATION
34 pages
ML SVM
No ratings yet
ML SVM
34 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
ML Lectures - 20 22
No ratings yet
ML Lectures - 20 22
14 pages
Machine Learning
No ratings yet
Machine Learning
46 pages
An Introduction To Support Vector Machines
No ratings yet
An Introduction To Support Vector Machines
13 pages
Support Vector Machines: Xiaojin Zhu
No ratings yet
Support Vector Machines: Xiaojin Zhu
41 pages
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet
Useful Formulae: Mathematical & Physical
From Everand
Useful Formulae: Mathematical & Physical
Matthew Watkins
No ratings yet
Tutorial 1 Temperature and Thermal Expansion S
0% (1)
Tutorial 1 Temperature and Thermal Expansion S
2 pages
4.23 Inverse Trigonometric Functions
No ratings yet
4.23 Inverse Trigonometric Functions
7 pages
C++ Pointer Practice
No ratings yet
C++ Pointer Practice
6 pages
King of Calculations Sheet - 193
No ratings yet
King of Calculations Sheet - 193
3 pages
Computer Architecture & Operating Systems: University of Colombo School of Computing
100% (2)
Computer Architecture & Operating Systems: University of Colombo School of Computing
32 pages
Problem Collection PDF
0% (1)
Problem Collection PDF
40 pages
Calculating Ballast-Troy Averill
No ratings yet
Calculating Ballast-Troy Averill
6 pages
Albert Einstein - Wikipedia
No ratings yet
Albert Einstein - Wikipedia
2 pages
Pulse and Digital Circuits Notes PDF
No ratings yet
Pulse and Digital Circuits Notes PDF
158 pages
CH 1 Ans
No ratings yet
CH 1 Ans
13 pages
Flat Plate Boundary Layer Cornell
No ratings yet
Flat Plate Boundary Layer Cornell
39 pages
Example of C Program in PDF
No ratings yet
Example of C Program in PDF
2 pages
Midterm Exam - INST327
No ratings yet
Midterm Exam - INST327
3 pages
09 Ans Practicebook
No ratings yet
09 Ans Practicebook
2 pages
WORKSHEET ON PYTHON MODULES
No ratings yet
WORKSHEET ON PYTHON MODULES
2 pages
Ivector Tutorial Interspeech 27aug2011 PDF
No ratings yet
Ivector Tutorial Interspeech 27aug2011 PDF
146 pages
Proof: Euler S Theorem
No ratings yet
Proof: Euler S Theorem
20 pages
Journal of Food Engineering: Si Zhu, Bing Li, Guibing Chen
No ratings yet
Journal of Food Engineering: Si Zhu, Bing Li, Guibing Chen
10 pages
Hydraulic Structures I L7 Outlet Structure Design (Copy)
No ratings yet
Hydraulic Structures I L7 Outlet Structure Design (Copy)
20 pages
7 Mathematical
No ratings yet
7 Mathematical
22 pages
4-Relation Algebra Extended
No ratings yet
4-Relation Algebra Extended
21 pages
CV 222
No ratings yet
CV 222
2 pages
Video 7 Rib Feature
No ratings yet
Video 7 Rib Feature
17 pages
CPU Scheduling Algorithms: Shortest Job First (SJF)
No ratings yet
CPU Scheduling Algorithms: Shortest Job First (SJF)
16 pages
Theoretical distribution
No ratings yet
Theoretical distribution
7 pages
A Grid-Compatible Virtual Oscillator Controller Analysis and Design
No ratings yet
A Grid-Compatible Virtual Oscillator Controller Analysis and Design
7 pages
Project-Based Learning of Advanced CADCAE Tools in PDF
No ratings yet
Project-Based Learning of Advanced CADCAE Tools in PDF
14 pages
A AMU
No ratings yet
A AMU
9 pages

svm and kernel

Uploaded by

svm and kernel

Uploaded by

What is a good Decision Boundary?

• The decision boundary can be found by solving the

• This is a constrained optimization problem. Solving it

• But we cannot deal with some of the possible labelings of four

• For perfect classification, we want

So if we train models with different complexity, we

• The separator is defined as the set of points for

• Using the maximum margin separator often

• Use a much bigger set of features.

• We need to decide which side of the separating

• All the cleverness goes into selecting the support vectors

Polynomial: K (x, y ) (x.y  1) p

Neural net: K (x, y ) tanh ( k x.y   )

For the neural network kernel, there is one “hidden unit”

• SVM’s use each training case, x, to define a feature

100, 500, or 1000 labeled cases 11,000 unlabeled cases

face patches from new people

Conclusion: The deep features are much better

• Good generalization in theory

• Find kernel K such that

• K(x1,x2) = < x1, x2 > 2

• (x) contains all monomials of degree d

“Given an algorithm which is

• Maximize over  (<xi, xj> +

• SVMs find optimal linear separator

You might also like