SVM 3
SVM 3
Lecture 29
Suppose I have some data which is not linearly separable. So that is the problem that we have
seen with perceptions? So what happens if the data is not linearly separable? Perceptrons do not
converge. So can we tweak our objective function that we have here to make sure that we can
handle non linearly separable data. We are saying it is okay to say non linearly separable data
was my question. It should be linearly inseparable data right, so you have to be careful where
you put the not the negation there.
So what we do in this case? Somebody had a suggestion. So there are many ways there are many
choices you can make right let me not play around with it are many choices you could make but
there is one particular choice which is seems to yield a very nice optimization formulation. So
what is a choice I am going to say that I would really like to maximize the margin and I would
like to get as many data points correct as possible right.
(Refer Slide Time: 01:42)
So if you think about it so there are a couple of things. So this is the margin that I want so what
are the problems here? Well these data points are within the margin so I have some data points
that are within the margin so I would like to minimize such cases there are some data points that
are within the margin and erroneous. I would like to minimize such cases as well. If you think of
what if I had tried to get this correct, there is a gap here and there seems to be a gap here between
the points if I try to get this correct and move my classification surface below then the margin
would have been reduced even further. So it is okay to get this wrong but then what about this
case is it within the margin or outside the margin? Within. So the margin for that class is defined
on the other side right so the margin for that class is this side so anything to this side and x is
within the margin. This will be yi times this right, so this will actually be negative so it is within
the margin if we want things to be greater than one yi times f(x) we want it to be greater than one
right ≥ 1 this is going to be negative.
So obviously this is within the margin. So essentially what I want to do is minimize these
distances so you can see the distance that I marked here so these distances I would like to
minimize, so this is a certain small distance inside the margin, this is a large distance inside the
margin is a very large distance inside the margin likewise so I can mark each one of these and I
want to minimize these, so let us denote them says 1 to 5 and I want to minimize those right
essentially.
So if I minimize the sum of these deviations I make along with my original objective function.
Why do not they minimize the maximum here? Because that would essentially mean that I will
try to get as many things character as possible so in this case I do not mind getting something
wrong as long as the overall deviation is not does not exceed a certain limit. See that the
difference between minimizing the maximum and minimizing a sum is that I might as well give
up all of the sum to a single data point it might be something that is very hard to classify.
And I might have one single outlier somewhere here right let us let us let us draw so this data
might be perfectly separable and I might have an outlier then okay so now if I say okay minimize
the sum of the things it is fine. But if I say minimize the max okay then it is going to actually
give me a hyper plane somewhere there but like I said many different formulations are possible
this one actually yields a very nice computation that is one of the reasons people use this.
(Refer Slide Time: 07:19)
So what I am going to do is write it here. So I am going to say that this has to be that we had
already found out and I am going to introduce a slack variable so it does not have to be greater
than M. It can be some fraction lesser also M is what I would really like but I allow it to have a
slack. Ideally I would want most of these ξi’s to be 0 and I force ξi’s to be zero I am back here
but I really like some lee way right.
So I am allowing myself that leeway by introducing ξi here. This is a very standard technique for
relaxing constraints in optimization. That is one of the reasons people adopt this is a standard
constraint. So another thing which I could have chosen is that in fact this is a little bit more
common constraint but it turns out that in this particular case if I choose M - ξi instead of M(1-
ξi) I end up getting a non convex optimization problem.
So we do not want that so we end up doing this. So I drew this figure first because I wanted to
get an idea of what these slack variables actually mean. So the slack variables essentially tell you
by what fraction right you are violating the margin. So is ξ1 is essentially what fraction of
distance you are coming in here from the margin ξ2 is what fraction of the distance you are
coming in from the margin.
So the margin is M so I have moved some fraction of the distance inside. So they essentially that
is what the ξ tells me. So what are the constraints we have? So the first constraint I have is okay
all ξi have to be ≥ 0. I do not care about points going to that side of the margin. So all ξi ≥ 0 and
the second thing is whatever we have been talking about. I do not want the ξi is to be very large
taken in total so I want to upper bound them by a constant.
So because I am talking about going that side of the margin, if ξi are negative so essentially I
will be imposing a tighter constraint than what I was looking for so this will be like it will larger
than M. This is well I will be having a thing that is larger than M and it is a relative distance, so
essentially this becomes M- Mξi. So the original should be M, so it is now M ξi away from there.
So ξi is essentially a relative distance and if I make ξi is negative so this will become plus so that
will essentially mean that not only do I want the data points to be away than M actually asking it
to be further away than so it just imposes a tighter constraint. So I do not want that to happen so
and here we are essentially giving it a budget that we do not want it to be greater than the budget
right fine. So we saw such a constraint earlier where it we see such a constraint earlier we had a
budget we did not want it to be greater than a budget.
(Refer to slide time: 10.43)
So yeah ridge regression and LASSO and other things we had this thing. So wherever we are
looking at this regularized regression, so we had this greater than or lesser than a constant and
what did we do in those cases? We push it into the objective function and then add a multiplier
there and then we said okay it has to be. So then there is a relationship between this constant and
the multiplier that we put in the objective function. So likewise will do the same thing here I will
do all the other transformations that we need to do to normalize β and things like this.
So essentially I will end up with the same objective function I had there and you want ≥ to
because they have gotten rid of the M. Why how do we get rid of M, because M is 1 / β. So 1 by
||β|| so we got rid of that just anything else we need here. So now that we have this objective
function what should be the value of C? If I want a linearly separable case we want to solve the
linearly separable problem right or I want to ensure that all ξi are 0 what should the value of C be
this is the simple question infinity right C should be infinity.
(Refer to slide time: 13.53)
So the larger the value of C the more you are penalizing the violations. So the smaller the ξi
should be. So the larger the value of C the smaller the ξi should be so there is a trade-off. So the
larger you make C, the smaller will the margin be. But we will be getting more of the training
data correct right so for large values of C you are allowing a little bit more leeway. So C is very
small then you are allowing lot more errors to happen if C is very large then you are forcing the
classifier to classify as much of the training data is correct as possible okay.
The data is truly linearly separable and you make C very large what will happen? You will find
the correct linear separator. But if the data is truly linearly separable but you keep C small what
might happen? You might trade off errors in the training data for a larger margin even if the data
is linearly separable. Is that a desirable thing? When exactly? So if the data is noisy such that
there are some data points that are closer to the margin it may be one or two data points that are
closer to the margin. So if you are trying to find the perfect linear separation you will pay
attention to them as well right and therefore you will end up having a low margin, but then if you
are willing to ignore a few noisy data points, even if the training data looks perfectly separable
you might end up making a few errors on it but you will get a more robust classifier. So can
people visualize a situation I am going to try and do something here let us see that works it has
looks perfectly separable. That is noise is it still separable. There you go there it is still separable
and if you try to solve it as a perfectly separable problem that is the separator that you are going
to get but if you allow errors right so that will probably be the separating hyper plane you get and
that is probably a more appropriate hyper plane right.
(Refer to slide time: 16.11)
Apart from being robust it is it is also correct in an expected sense. We will move on to the
primal. So I just wanted to leave this on board till I wrote this note so that you can compare it.
(Refer Slide Time: 16:46)
Yeah I do not have to do this. It is not a single condition is there for each i. Do we need the
n
i 1
i const . No right, so that is why we consider constructed put that into the optimization
n
objective function itself right, so by minimizing this right we are ensuring that
i 1
i will be less
than some limit right and like I was mentioning in the ridge regression discussion, so you can
find a relationship between this constant and this C right.
It is also a function of the range of the objective function but you can always find so basically
they are equivalent ways of writing the optimization problem except that you have to this
constant and the C will not be the same there will be different values so this constraint is gone
this is no longer present here that went into the objective function.
So putting all of this back in and doing some algebra can be surprised at the algebra outcome of
this. Anyone has already solved it? Looks familiar right? It is essentially the same dual you will
get but your constraints are different. This is already there so it is just added for completeness
sake but what is important here is earlier while I had only a non negativity constraint on α now I
have a upper bound on the value of α. So why is that because α is only C- μ and since α is only C
- μ so there has to be a upper bound on α okay good so what about the other KKT conditions.
(Refer Slide Time: 22:28)
So I will make ξi a large enough. So that this term will go to 0. Because this will be negative will
be less than 1. So I will make this I will just ξi so that this term in the square bracket goes to 0
because my αi will be C what is that in case this is because I do not want to penalize this case.
This case also ξi will be 0. So this case ξi is 0 this case also ξi will be 0 because what I really
need is that is my condition ≥ 1- ξi so if is equal to 1, so I can set ξi to 0. In both these cases ξi=
0.
So what are all the support vectors? Everything on the margin and everything on the wrong side
of the margin as well right. Everything for which alpha is nonzero now becomes support vectors
so at the end of the day I am going to say that you are just going to use a package to solve all of
these things right but it is like saying yeah anyway you are going to use Microsoft Windows or I
mean Mac OS X or something why do you learn operating system right. So you need to know
what the internals are it is not the fact that you just use the tools that matters it when you can just
use the tools well yeah we can do a tool course right how to use the tools right how will you start
up limbsvm it is not trivial. So many people I know actually run experiments with SVM's by just
using the default parameter settings that the package will give. The thing is you need to
understand what is it that you are tuning right. So now I told you about the C parameter right, so
you understand have some idea of what a large C means versus what a small C means hey
instead of blindly saying that okay I am going to queue C from some number to some other
number right, so to have an appreciation of what these things are doing actually helps you even
use the tools better right so that is the whole idea behind doing all of this is not that I am going to
expect you to come and derive a large margin classifier tomorrow when ideally.
IIT Madras Production
Funded by
Department of Higher Education
Ministry of Human Resource Development
Government of India
www.nptel.ac.in
Copyrights Reserved