28.7 - Polynomial Kernel - mp4
28.7 - Polynomial Kernel - mp4
So, since we have talked so highly of the kernelization idea, let's look at one of the
most simple kernels, called a polynomial kernel, right? With this, we'll understand with
some geometric intuition on why kernels are very powerful. So let's take this example
where I have a bunch of points here, the two concentric circles case, right? I have a bunch of
positive points which are surrounded by a bunch of negative points, right? In the case of
logistic regression, if you recall, we solved this problem by transforming a space, right? By
doing feature transforms, right? So if I have my suppose, let's assume I have data like this. In
the case of logistic regression, we said if my data is actually in feature one and feature two, I
will transform this data into f one square and f two square, and then apply logistic
regression on top of it. And apply logistic regression on top of it so that I can find a
hyperplane that separates this data in this space. Because in this space no hyperplane can
separate it, right? So that is the feature transform idea. That is the feature transform idea.
And logistic regression. Now let's see how kernelization solves this problem. So, polynomial
kernel, definition of a polynomial kernel is given two vectors x one and x two, or given two
data points x one and x two, the polynomial kernel, general polynomial kernel is x one
transpose x two plus c power d, where c and d are constants. Okay, so let's take a simple
example. A quadratic kernel. So this is a polynomial kernel, because d can be anything,
right? D can be three, four, anything. A simple example of a quadratic kernel is this. So if you
have points x one and x two, a quadratic kernel can be written as x one plus, sorry, one plus
x one transpose x two power two. It is quadratic. So this is called a quadratic kernel. This is
called a quadratic kernel or polynomial kernel, with d equals to two, and here c equals to
one here, right? And d equals to two. Now we'll see why this is useful. Right? So if I apply
this, if I apply this, so what is k of X one x two. So let's expand this k of X one plus x two is
nothing but one plus x one transpose x two square. So let's assume x one is a vector of two
points x eleven and x twelve 2D. Let's take the simple example. Let's assume x two is x two
one, comma x two two. Now, if you expand this, what do you get? You get one plus x eleven,
x two one, right? Because this X one is a vector, right? So X one transpose x two is. You
multiply this with this and you multiply this with this and add them up, right? X two one,
sorry. Plus x twelve and x two two square. So when you expand this square term, what do
you get? You get one plus x eleven square x two one square plus x twelve square x two two
square plus two times x eleven, x two one plus two times x twelve, x two two plus two times
x eleven, x two one x twelve, x two two. I just expanded this a plus B plus C whole square.
Right? Now this, now this is interesting. I'll tell you why this. I can write it as product of two
vectors. Let me write the first vector. My first vector is one comma x eleven square. So I'm
taking this term and I'm taking x twelve square, right? This is a point. This corresponds to X
one, right? The first value here says what point it is. The second one says which dimension it
is, right? So then I'll say comma, square root of two, x eleven, square root of two x twelve,
square root of two x eleven, x twelve. So if I call this vector as X one dash, right? And let's
assume I call another vector which is one comma, x two one square x two two square root
two x two one root two x two two root two x two one x two two. If I call this vector x two
dash, sorry. If I call this vector as, if I call this vector as x two dash. Okay, let these two
vectors be called X one dash and x two dash. Now I can show that this product here, that this
product, that this thing here is equivalent to X one dash transpose x two dash. This is a lot of
fun, except that, see, now what is our original points? Look at it. Our original points are
actually in 2D. Our X one is in 2D. Our X two is also in 2D. Now, from X one, remember here
you have only transformations on X eleven and x twelve. In this, there is no x two one or x
two two here, right? So similarly for x two dash, you only have terms which are x two one or
x two two, right? So imagine if I am given x one and x two, I have transformed them into x
one dash and x two dash, right? Now, instead of doing x one transpose x two, I can do x one
dash transpose x two dash, right? Because I've transformed, this is my feature transform.
This is equivalent to my feature transform. So what kernelization is doing internally? So
what kernelization is doing internally? And what did I show, I showed that your
kernelization is equal to feature transformation. If you notice this, this is same as feature
transformation. So kernelization, kernelization takes data which is in a space, takes your
data which is d dimensional, and it does a feature transform internally. Internally it doesn't
do it explicitly. It does it internally and implicitly, right? See, in this case, we explicitly
transformed f one and f two to f one square and f two square. This is explicit, this is explicit
feature transform. In the case of kernelization, it takes a data which is in d dimensional
space, in this case 2D, right. In this case two d, and implicitly internally using the kernel
trick. Using the kernel trick, right? By replacing, by just replacing x I, transpose x one,
transpose x two with k of x one comma x two. By replacing this with, this is equivalent to
replacing x one with x one trans, x one dash and x two with x two dash. So it is taking me to
a different space of dimension dash, where dash is typically, typically greater than d. So
what kernelization is doing internally, intuitive if you think about it. Kernelization is taking
data which is in a d dimensional space, and implicitly internally, it is doing a feature
transform, feature transformation to a space dash such that d dash is greater than d. Now
look at it, now look at this. X a dash, x a dash has the square terms. X a dash has the square
terms, right? Now this is a six dimensional space. Remember, this is six dimension. 123456.
This is 6D data. So we went from 2D data to 6D data using the kernel trick. And in the 6D
data, since you have the quadratic form, since you have the square terms, which is very
similar to these square terms, the data will become separable, right? So if you think of this is
very, very important. And this is called, so there is a very nice theorem called Mercer's
theorem. I will not go into the details of it, but what it says intuitively, this comes from lot of
beautiful mathematics, especially related to SVMs. What Mercer's theorem says is what
kernel trick is internally doing. Kernel trick or kernelization is basically taking data from a d
dimensional space and converting into a higher dimensional space, typically dash greater
than d, right? And because if your data is not linearly separable, let's assume in this is not
linearly separable, like your concentric circles here, right? Like your concentric circles here.
But using the right kernel, using the right kernel, it will take you to a space which is dash
dimensional, where the data becomes linearly separable, right? This is what Mercer's
theorem internally says of course, there are some more mathematical conditions for
Mercer's theorem to hold, but intuitively, this is what it does. So polynomial kernel
internally. So whatever we have done, explicit feature transformation in the case of logistic
regression is now replaced by kernelization, because kernelization, or kernel trick
internally does that for you. That's the most important part of the kernel trick. That's why
kernelization is such a beautiful idea, because kernelization does implicit implicit feature
transformation for you. But again, the challenge here is what is the right kernel to apply? So
for this data set, for this data set, I know a polynomial kernel of degree two. It's also called
as polynomial kernel of degree two because your d equals to two is the right kernel. So now
challenge becomes, instead of doing explicit feature transformation, so in the case of SVMs,
in the case of SVMs for nonlinearly separable data, your explicit feature transformation.
This is what we did, right? This is what we did we said, which is the most important part in
the case of logistic regression, right? Explicit feature transformation is now replaced in the
case of SVMs with finding the right kernel. So SVMs essentially are all about finding the right
kernels. If you have the right kernel for the right problem, everything is solved, right? So it's
important to understand, given a data set, what is the right kernel to apply. So if your data
set is like this, of course a polynomial kernel of degree two will work. There are some data
sets where higher degree polynomials may work. So it's important to understand what is
the right kernel. Often that's the most important task in SVM, because everything else is
provided for you. Because bias variance trade off is very simple, right? You basically change
your C and you get bias variance trade off given a data set. Someone else has already
implemented the whole SVM stuff. Your challenge, as an applied machine learning engineer
is to find the right kernel for the right problem in SVM, which is, again, not always easy.