0% found this document useful (0 votes)
8 views8 pages

SVM 4

This lecture discusses the concept of SVM kernels in machine learning, emphasizing the importance of inner products in evaluating classifiers. It explains how basis transformations can enhance linear classifiers and introduces various kernel functions, such as polynomial, Gaussian, and neural network kernels, which facilitate computations in higher-dimensional spaces. The lecture also touches on tuning parameters for SVMs and the concept of nu-SVM to manage the number of support vectors.

Uploaded by

srkkps6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views8 pages

SVM 4

This lecture discusses the concept of SVM kernels in machine learning, emphasizing the importance of inner products in evaluating classifiers. It explains how basis transformations can enhance linear classifiers and introduces various kernel functions, such as polynomial, Gaussian, and neural network kernels, which facilitate computations in higher-dimensional spaces. The lecture also touches on tuning parameters for SVMs and the concept of nu-SVM to manage the number of support vectors.

Uploaded by

srkkps6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

NPTEL

NPTEL ONLINE CERTIFICATION COURSE

Introduction to Machine Learning

Lecture 30

Prof. Balaraman Ravibdran


Computer Science and Engineering
Indian Institute of Technology Madras

SVM Kernels

So if you remember I asked you to note the fact that I am using a inner product there right xiTx as
the inner product of two vectors and the way I wrote the dual also I had only inner products in
there. So in fact if I want to evaluate the dual I need to only know the inner products of the two
vectors. Likewise if I want to finally evaluate and use the classifier that I learn I still need to only
find inner products right.

(Refer Slide Time: 01:24)

So if I can come up with a way of efficiently computing this inner products, I can do something
interesting. So what is that so what do we normally do to make linear classifiers more powerful?
Basis transformations. So I can just take my x and replace it with some function h(x) that gives
me a larger basis. It could be just replace it with the square.
I take x and replace it with x2 and then I will get a larger basis and now it turns out that I can do a
lot of math but I can get a dual that looks like this. So that is the inner product notation and so if I
can compute the inner product, I can just solve the same kind of optimization problem but I can
do this in some other transformed space.
(Refer to slide time 02.11)

So likewise our f(x) is going to be so essentially what I need to know is h(x) for whatever pair x
and x’ that I would like to consider. So in the training it is the pairs of training points right while
I am actually using it is one of the support point and the input data that I am looking at any point
I just take this pairs of data points and I need to compute the inner product.
(Refer to slide time 03.13)

So I am going to call this as some function which is a kind of a distance function or a similarity
measure between h(x) and h(x’). Such similarity measures are also called as kernels. So we have
been hearing about kernels in the context of support vector machines we have been trying to use
this libsvm or any of the other tools for some projects over the summer you have heard of
kernels. Kernels are nothing but similarity functions. So the nice thing about the kernels that we
use is that they actually operate on x and x’. They operate on x and x’ but they are computing the
inner product of h(x) and h(x’). Did you see that? They are going to work with x and x’ but they
will be computing the inner product of h(x) and h(x’).
(Refer to slide time 04.27)

So I will give you an example so the kernel function k should be symmetric and positive semi-
definite. Positive definite and semi definite is fine in some cases positive definite. People
remember what positive definite is right? xTAx>0 if it is definite and xTAx>=0 if it is semi
definite. It essentially we want the quadratic forms to be to be positive.

We do not want to take xTAx and suddenly find it is negative so it is in fact you remember I told
the xTAx is usually the quadratic form that we are trying and that will actually mess up big time
in the computation if the quadratic form becomes negative. Then we will have problems in all
the optimization thing going through okay so that is the mechanistic reason for wanting it to be
positive semi-definite. There is a much more fundamental reason for it which I have not
developed the math or the intuition for you to understand. So it has to come at a later course. So
hopefully in the kernel methods course if you are taking it you will figure out why that is needed.
So there are many choices which you can use for the kernels.
(Refer to slide time: 07.26)
So there is something called the polynomial kernel which is essentially (1  x, x ' ) d . So d is a
parameter you can have. d of two three four you can even have d of one is essentially here
whatever we have solved. The next one is called the Gaussian kernel or the RBF kernel right so
where the distance is given by exp( || x  x ' ||2 ) and is essentially the Gaussian without your
normalizing factor. So that is why it is called the RBF kernel so if you want to call it the
Gaussian kernel you actually have to make it Gaussian otherwise call it the RBF kernel.

And then this is called the neural network kernel or the sigmoidal kernel sometimes. This is just
the hyperbolic tangent tanh( K1  x, x '   K 2 ) . Some arbitrary constants k1 and k2 which are
your parameters that you choose and this is x, x’ inner product. So these are some of the popular
kernels which can be used for any generic data but then depending on the kind of data that you
are looking at where the data comes from people do develop speech the specialized kernels they
for examples for string data people have come up with a lot of kernels.

When you want to compare strings how do I look at similarity between strings so the nice thing
about whatever we have done so far is that you can apply this not just to data that comes from Rp
right you been assuming so far that your x comes from some p dimensional real space as long as
you can define a proper kernel right you can apply this max margin classification.

That we have done to any kind of data does not have to come from a real-valued space. Which is
not true of many of the other things you are looked at right all those inherently depend on the
fact that the data is real valued. Because of this nice property of what is called the kernel trick
you could do all of this nice things so as long as you can define appropriate kernel that you can
actually apply this to any kind of data. So that is one very powerful idea.
(Refer Slide Time: 09:28)

So just to convince you so let us look at the polynomial kernel of degree two operating on
vectors of two dimensions. There are two 2's here so the degree is two the d is two and the p is
also two but they need not necessarily be the same that I could have had a much larger thing but
it was easy for me to write something so this is what
(Refer to slide time: 10.33)

Now just squared it now if you think of h we get the following.


(Refer to slide time: 11.18)
So what is this function h? It is essentially the quadratic basis expansion. So I have two features
x1 x2. So i give so remember that x, x is x1 x2. So this is essentially the quadratic expansion the
first thing is one the second coordinate is x1 third coordinate is x2 so it keeps it as it is then fourth
coordinate is x12 fifth coordinate is x22 the sixth coordinate is x1x2 it is all the quadratic basis
expansion. Now if I make this operate on x and x’ and take the inner product so what will be the
terms? 1, 2 x1x1’ , 2x2x2’ , x12 x2x1’2, x2x2’2, 2x1x1’x2x2’ is exactly what we have here right so
what is the nice thing about it is I can essentially compute the inner product of x and x’ first add
1 and square it so numerically what I will end up with is the same as what I would have ended up
with if I had done the basis expansion right and then taken the inner product
(Refer Slide Time: 13:05)

If I had just taken whatever is the original vectors let us say I have some 2, 3 and 4,5 so instead
of doing this basis expansion and then computing the inner product I can just take the inner
product right away. This is essentially what the answer would be so this well for degree 2 it
might not seem great what about degree 15 polynomial? I have essentially doing similar amounts
of computation except that I have to rise something to the power of 15.
That is basis expansion if you thought something else about basis expansion please correct it this
is basis expansion. So I take the original data and then since I said you could have a new
component set or sinx cosx mean does not matter right you could think of variety of different
ways of expanding the bases in this case I am just doing the quadratic basis expansion.
So whatever we have done so far and so this whole idea for kernel and other things are arriving
rather straightforward so what I cannot write now for you is what is the basis expansion for the
RBF kernel it turns out that the computation is doing is actually in an infinite dimensional vector
space okay so here the computation is a six dimensional space and I took some data point from a
two dimensional space computation in a six dimensional space right. And I gave you back the
answer but all the time doing computation only in a two-dimensional space and I only took the
inner product of these two and then added 12 so I am essentially doing computations only R2 .
Well the actual number I am returning to you is the result of computation done in R6 that is why
it is called the kernel trick. So likewise the RBF kernel I will do something in whatever is the
original dimensional space you give me but the resulting computation has an interpretation in
some infinite dimensional vector space case it is not even easy to write it down so that is why the
RBF kernel powerful they work on a variety of data right but they are not all powerful this have
to be careful about it right so, so that is all there is to support vector machines so we have done
this support vector machines as well.

So I don't know if people who have used libsvm or one such tool for that for most RBF kernels
you would have to tune two parameters one is C which we already saw that is essentially how
much penalty you are giving to the thing other one you will tune is γ essentially this right it is
some kind of a width parameter for your Gaussian this how wide you are Gaussian is it just it is
controls that so that is γ so those are the two parameters you tune and for polynomial kernels you
have a d and you have your C right and for sigmoidal kernels you have constants k1and k2 and
you have C and this form of defining a support vector machine is called as C-SVM okay.

There are other ways other constraints that you can impose on it not just the penalty on the  ’s
you can impose penalty on the number of support vectors you consider right you want to say so
suppose I run the data and it comes back and says okay everything is a support vector right so
that is not something interesting how can everything be a support vector can all the data points
be equal distance from the separating hyper plane not if you are considering linear but when I am
considering RBF kernels right the separating hyper plane can be very very complex right.

So in which case you might end up with a lot of support vectors typically I do not know if you
have not thought too much about it and you are setting some very high values for C and trying to
run this thing you will end up with like sixty percent of your data as being support vectors so
instead of trying to do that empirically second on why only 20 support vector so let me try
different see differential γ and so on so forth you can use something called the nu-SVM not new
but nu-SVM which gives you a upper bound on the number of support vectors you are going to
get you can say do the best you can but do not give me more than 30 support vectors something
like that to that effect.

IIT Madras Production

Funded by
Department of Higher Education
Ministry of Resource Development
Government of India

www.nptel.ac.in

Copyrights Reserved

You might also like