0% found this document useful (0 votes)
2 views

Support Vector Machine (SVM) - Kernel Functions[1]

Uploaded by

Anshika
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Support Vector Machine (SVM) - Kernel Functions[1]

Uploaded by

Anshika
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Support Vector Machine (SVM)

&
Kernel Functions
Reference [1], Chapter 7, section 7.3
Reference [3], Chapter 6, page 292
[1]: Flach, P. (2015). Machine Learning: The Art and Science of Algorithms that Make Sense of Data. Cambridge University Press.
[3]: Christopher & Bishop, M. (2016). Pattern Recognition and Machine Learning. New York: Springer-Verlag
Support Vector Machine (SVM)
The Support Vector Machine (SVM) is a supervised learning algorithm mostly used
for classification but it can be used also for regression.

Ideology behind SVM


SVM is based on the idea of finding a hyperplane that best separates the features into
different domains.
SVM: Some Important Points
The main idea is that based on the labeled data (training data), the algorithm tries to find the
optimal hyperplane which can be used to classify new data points.

In two dimensions the hyperplane is a simple line.

Usually a learning algorithm tries to learn the most common characteristics (what
differentiates one class from another) of a class and the classification is based on those
representative characteristics learnt (so classification is based on differences between
classes). The SVM works in the other way around.

It finds the most similar examples between classes. Those will be the support vectors.
Example
As an example, let’s consider two classes, apples and lemons.

Other algorithms will learn the most evident, most representative characteristics of apples
and lemons, like apples are green and rounded while lemons are yellow and have elliptic
form.

In contrast, SVM will search for apples that are very similar to lemons, for example apples
which are yellow and have elliptic form. This will be a support vector. The other support
vector will be a lemon similar to an apple (green and rounded).

So other algorithms learns the differences while SVM learns similarities.


If we visualize the example above in 2D, we will have something like this:
As we go from left to right, all the examples will be classified as apples until we reach the
yellow apple. From this point, the confidence that a new example is an apple drops while the
lemon class confidence increases. When the lemon class confidence becomes greater than the
apple class confidence, the new examples will be classified as lemons (somewhere between
the yellow apple and the green lemon).

Based on these support vectors, the algorithm tries to find the best hyperplane that
separates the classes.

In 2D the hyperplane is a line, so it would look like this:


Boundary can also be drawn like this:

As you can see, we have an


infinite number of
possibilities to draw the
decision boundary.

So how can we find the optimal


one?
Finding the Optimal Hyperplane
Intuitively the best line is the line
that is far away from both apple
and lemon examples (has the
largest margin).

To have optimal solution, we have to


maximize the margin in both
ways (if we have multiple classes,
then we have to maximize it
considering each of the classes).
Use Cases
Finding the Optimal Hyperplane
Identifying the Right Hyperplane (Scenario I)
Here, we have three hyper-planes (A, B, and C).

Now, identify the right hyper-plane to classify stars and circles.

You need to remember a thumb rule to identify the right


hyper-plane: “Select the hyperplane which segregates the
two classes better”.

In this scenario, hyper-plane “B” has excellently performed this


job.
Identifying the Right Hyperplane (Scenario II)
Here, we have three hyper-planes (A, B, and C) and all are
segregating the classes well.

Now, How can we identify the right hyper-plane?

Here, maximizing the distances between nearest data point


(either class) and hyper-plane will help us to decide the right
hyper-plane. This distance is called as Margin.

Let’s look at the below snapshot:


Consider this diagram,

Here, you can see that the margin for hyper-plane C is high as
compared to both A and B. Hence, we name the right
hyper-plane as C.

Another lightning reason for selecting the hyperplane with


higher margin is robustness.

If we select a hyper-plane having low margin then there is


high chance of miss-classification.
Identifying the Right Hyperplane (Scenario III)
Use the rules as discussed in previous slides to identify the right
hyperplane.

Some of you may have selected the hyperplane B as it has


higher margin compared to A.

But, here is the catch, SVM selects the hyperplane which


classifies the classes accurately prior to maximizing margin.

Here, hyperplane B has a classification error and A has classified


all correctly.

Therefore, the right hyperplane is A.


Identifying the Right Hyperplane (Scenario IV)
I am unable to segregate the two classes using a straight line, as
one of the stars lies in the territory of other(circle) class as an
outlier.

The SVM algorithm has a feature to ignore outliers and find the
hyper-plane that has the maximum margin. Hence, we can say,
SVM classification is robust to outliers.
Identifying the Right Hyperplane (Scenario V)
In the scenario below, we can’t have linear hyperplane between
the two classes, so how does SVM classify these two classes?
Till now, we have only looked at the linear hyperplane.

SVM can solve this problem.

Easily! It solves this problem by introducing additional feature.

Here, we will add a new feature z=x^2+y^2.

Now, let’s plot the data points on axis x and z:


In above plot, points to consider are:
● All values for z would be positive always because z is the
squared sum of both x and y
● In the original plot, red circles appear close to the origin of
x and y axes, leading to lower value of z and star relatively
away from the origin result to higher value of z.

In the SVM classifier, it is easy to have a linear hyperplane


between these two classes.
But, another burning question which arises is, should we need to
add this feature manually to have a hyperplane.
No, the SVM algorithm has a technique called the kernel trick.

The SVM kernel is a function that takes low dimensional input space and transforms it to a higher
dimensional space i.e. it converts not separable problem to separable problem. It is mostly useful in
non-linear separation problem.
Terminologies Used in SVM
The points closest to the hyperplane are called as the support vector points and the
distance of the vectors from the hyperplane are called the margins.

The basic intuition to develop over here is that more the farther SV points, from the
hyperplane, more is the probability of correctly classifying the points in their respective
region or classes.

SV points are very critical in determining the hyperplane because if the position of the
vectors changes the hyperplane’s position is altered. Technically this hyperplane can also be
called as margin maximizing hyperplane.
Hyperplane (Decision Surface)
The hyperplane is a function which is used to differentiate between features.

In 2D, the function used to classify between features is a line whereas, the function used to
classify the features.

In 3D, it is called as a plane similarly the function which classifies the point in higher
dimension is called as a hyperplane.

Now since you know about the hyperplane, let’s move back to SVM.
Let’s say there are “m” dimensions:

thus, the equation of the hyperplane in the ‘M’ dimension can be given as =

where, wi = vectors

x=variables and b = biased term


Basic Steps
The basic steps of the SVM are:

1. select two hyperplanes (in 2D) which separates the data with no points between them
(red lines)
2. maximize their distance (the margin)
3. the average line (here the line half way between the two red lines) will be the decision
boundary

This is very nice and easy, but finding the best margin, the optimization problem is not trivial (it is
easy in 2D, when we have only two attributes, but what if we have N dimensions with N a very big
number)

To solve the optimization problem, we use the Lagrange Multipliers.

You might also like