0% found this document useful (0 votes)

202 views

Lecture Notes - SVM

This document provides an overview of support vector machines (SVM). It discusses that SVMs can handle complex problems like image recognition through the use of kernels. It explains that SVMs are a class of linear models that use hyperplanes to classify data, but kernels allow them to handle nonlinear data. The document covers key SVM concepts like maximal margin classifiers, soft margin classifiers, slack variables, and kernels.

Uploaded by

Indrajit Mitali Paul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

202 views

Lecture Notes - SVM

Uploaded by

Indrajit Mitali Paul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Lecture Notes

Support Vector Machine

Support Vector Machine (SVM) is an advanced machine learning technique which has a unique way of solving
complex problems such as image recognition, face detection, voice detection etc. As you will learn in this session,
SVMs solves the Pioi;; 98oblem of nonlinearity through kernels.

For instance, if you have a data as shown in the figure below, SVMs can handle it easily and that's how SVM
distinguishes from logistic regression.

Figure 1: Nonlinear Data

It is important to remember that SVMs belong to the class of linear machine learning models (logistic regression is
also a linear model).

A linear model uses a linear function (i.e. of the form y = ax +b) to model the relationship between the input x and
output y. For example, in logistic regression, the log(odds) of an outcome (say, defaulting on a credit card) is linearly
related to the attributes x1, x2, etc.

Similarly, SVMs are also linear models and It needs attributes in the numeric form.
Concept of Hyperplane in 2D

Before you move on to support vector machines, you need to understand the concept of hyperplanes. Essentially, it
is a boundary which classifies the data set (classifies Spam email from the ham ones). It could be lines, 2D planes, or
even n-dimensional planes that are beyond our imagination.

A line that is used to classify one class from another is called a hyperplane. In fact, it is the model you're trying to
build as shown in the figure below:

Figure 2: Hyperplane

The standard equation of a line is given by ax + by + c = 0. You could generalise it as W0 + W1x1 + W2x2=0, where x1
and x2 are the features — such as 'word_freq_technology' and 'word_freq_money' — and W1 and W2 are the
coefficients.

For any line with W coefficients, substituting the value of features x1 and x2 in the equation of the line determined
by its W coefficients, will return a value.

A positive value (blue points in the plot above) would mean that the set of values of the features is in one class;
however, a negative value (red points in the plot above) would imply it belongs to the other class. A value of zero
would imply that the point lies on the line (hyperplane) because any point on the line will satisfy the equation: W0 +
W1x1 + W2 x2=0.

Concept of Hyperplane in 3D

In 3 dimensions (refer to figure below), the hyperplane (light orange) will be a plane with an expression
of ax+by+cz+d = 0. The plane divides the data set into two halves. Data points above the plane represent one class
(red), while data points below the plane represent the other class (blue).
Figure 3: 3D Hyperplane - Plane

In general, if the hyperplane from d attributes in d-dimensional, the expression can be written as follows:

The model denoted by the expression given above is called a linear discriminator. Similar to the 2D and 3D
expressions, an n-dimensional hyperplane also follows the general rule: all points above the plane will yield a
value greater than 0, and those below it will yield lesser than 0 when plugged into the expression of the hyperplane.

Figure 4: Linear Discriminator

Maximal Margin Classifier

There could be multiple lines(Hyperplanes) possible which perfectly separate the two classes as shown in the figure
below. But the best line, is the one which maintains the largest possible equal distance from the nearest points of
both the classes so for the separator to be optimal, the margin or the distance of the nearest point to the separator
should be maximum. This is called Maximal Margin classifier.

Figure 5: Maximal Margin Classifier

3rd line(Hyperplane) should be considered as a maximal margin classifier in the above figure.

Mathematical formulation for maximal margin classifier requires two major constraints that need to be taken into
account while maximising the margin. They are

 The standardisation of coefficients such that the summation of the square of the coefficients of all the
attributes is equal to 1. For example, if you have 20 attributes, then the summation of the coefficients
should be :

 Along with the first constraint, the maximal margin hyperplane should also follow the constraint given
below:
The maximal margin line (hyperplane), although it separates the two classes perfectly, is very sensitive to the
training data. This means that the Maximal Margin Classifier will perform perfectly on the training data set. But on
the unseen data, it may perform poorly. Also, there are cases where the classes cannot be perfectly separated.

Thus, the soft margin classifier helps in solving this problem.

Soft Margin Classifier

The Support Vector Classifier essentially allows certain points to be deliberately misclassified. By doing this, it is able
to classify most of the points correctly in the unseen data and is also more robust.

The Support Vector Classifier is also called the Soft Margin Classifier because instead of searching for the margin
that exactly classifies each and every data point to the correct class, the Soft Margin Classifier allows some
observations to fall on the wrong side. The points which are close to the hyperplane are only considered for
constructing the hyperplane and those points are called support vectors.

Support vector classifier works well when the data is partially intermingled (i.e. the data can be classified by minimal
misclassifications). But what if the distribution looks completely intermingled and follows some pattern, something
like the circular distribution of labels (+ and -), as shown in figure below.

Figure 6: Intermingled Data

Obviously, the Support Vector Classifier can't classify the data above correctly, because it divides the data set into
two halves, which misclassifies a lot of data points. But it doesn't mean that this problem cannot be solved. There is
a way to solve such problems, which we will learn later.

Like the Maximal Margin Classifier, the Support Vector Classifier also maximises the margin; but the margin, here,
will allow some points to be misclassified, as shown in figure below.
Figure 7: Soft Margin Classifier

So to select the best-fit Support Vector Classifier, the notion of slack variables (epsilons(ε)) can help in comparing
the classifiers.

There is also a concept of the slack variable(ϵ). A slack variable is used to control misclassifications. It tells you where
an observation is located relative to the margin and hyperplane.

There are three different conditions applied if any new data point comes into play. Suppose you draw a Support
Vector Classifier in such a way that it doesn't allow any misclassification, i.e. Epsilon(ϵ) = 0, then each observation is
on the correct side of the margin as shown in figure below.

Figure 8: Slack Variable

But if you draw a Support Vector Classifier in such a way that it only violates the margin, i.e. 0< Epsilon( ϵ) < 1, the
observations classify correctly as shown in figure below.
Figure 9: Slack Variables

But if the data points violate the hyperplane, i.e. Epsilon(ϵ) > 1, then the observation is on the wrong side of the
hyperplane, as shown in figure below.

Figure 10: Slack Variable

So you can see that:

 Each data point has a slack value associated to it, according to where the point is located.
 The value of slack lies between 0 and +infinity.

Lower values of slack are better than higher values (slack = 0 implies a correct classification, but slack > 1 implies an
incorrect classification, whereas slack within 0 and 1 classifies correctly but violates the margin)
Cost of Misclassification

Cost of misclassification is greater than or equal to the summation of all the epsilons of each data point, and is
denoted by cost or 'C'.

Once you understand the notion of the slack variable, you can easily compare the two Support Vector Classifiers.
You can measure the summation of all the epsilons(ϵ) of both the hyperplanes and choose the best one that gives
you the least sum of epsilons(ϵ). The summation of all the epsilons of each data point is denoted by cost or 'C', i.e.

When C is large, the slack variables can be large, i.e. you allow a larger number of data points to be misclassified or
to violate the margin. So you get a hyperplane where the margin is wide and misclassifications are allowed. In this
case, the model is flexible, more generalisable, and less likely to overfit. In other words, it has a high bias.

On the other hand, when C is small, you force the individual slack variables to be small, i.e. you do not allow many
data points to fall on the wrong side of the margin or the hyperplane. So, the margin is narrow and there are few
misclassifications. In this case, the model is less flexible, less generalisable, and more likely to overfit. In other
words, it has a high variance.

Kernels

Kernels are one of the most interesting inventions in machine learning, partly because they were born through the
creative imagination of mathematicians, and partly because of their utility in dealing with non-linear datasets

So far, we have learnt about hyperplanes, the Maximal Margin Classifier, and the Support Vector Classifier. All of
these are linear models (since they use linear hyperplanes to separate the classes). However, many real-world data
sets are not separable by linear boundaries. For instance, what if the distribution of data points looks like the figure
given below?

Figure-11: Intermingled Data

You’ll agree that it is not possible to imagine a linear hyperplane (a line in 2D) that separates the red and blue points
reasonably well. Thus, you need to tweak the linear SVM model and enable it to incorporate nonlinearity in some
way. Kernels serve this purpose — they enable the linear SVM model to separate nonlinearly separable data
points. l

Mapping Nonlinear Data to Linear Data

You can transform nonlinear boundaries to linear boundaries by applying certain functions to the original attributes.
The original space (X, Y) is called the attribute space, and the transformed space (X’, Y’) is called the feature space.

Let’s say you want to classify emails into 'spam' or 'ham' on the basis of two attributes — 'word_freq_office' (X) and
'word_freq_lottery' (Y). The following plot shows the data set below, which is clearly nonlinear

Figure 12: Nonlinear Distribution

To convert this data set into a linearly separable one, a simple transformation into a new feature space (X’, Y’) can be
made. For now, don’t worry about the math behind the transformation. You may almost never need to manually
transform data sets. Just assume that some appropriate transformation from (X, Y) to (X’, Y’) can make the data
linearly separable.

In the original attribute space, notice that the observations are distributed in a circular fashion. This gives you a hint
that the transformation should convert the circular distribution to a linear distribution.

 word_freq_office(X′)=(word_freq_office(X)−a)2
2
 word_freq_office′(Y′)=(word_freq_office(Y)−b)

Feature Transformation
The process of transforming the original attributes into a new feature space is called ‘feature transformation’.
However as the number of attributes increases, there is an exponential increase in the number of dimensions in the
transformed feature space. Suppose you have four variables in your data set, then considering only a polynomial
transformation with degree 2, you end up making 15 features in the new feature space, as shown in the
figure below.
Figure 13: 4D Feature Space

Kernel Trick

As feature transformation results in large number of features, it makes the modelling (i.e. the learning process)
computationally expensive. The use of kernel resolves this issue. The key fact that makes the kernel trick possible is
that to find a best fit model, the learning algorithm only needs the inner products of the observations (XTi.Xj). It never
uses the individual data points X1, X2 etc. in silo.

Think of a kernel as a black box, as shown in the figure below. The attributes are passed on the black box, and it
returns the linear boundaries for the classification of the nonlinear data implicitly.

Figure 14: Black Box

Kernel functions use this fact to bypass the explicit transformation process from the attribute space to the feature
space, and rather do it implicitly. The benefit of implicit transformation is that now you do not need to:

 Manually find the mathematical transformation needed to convert a nonlinear to a linear feature space
 Perform computationally heavy transformations

In practice, you only need to know that kernels are functions which help you transform non-linear datasets. Given a
dataset, you can try various kernels, and choose the one that produces the best model. The three most popular
types of kernel functions are:

 The linear kernel: This is the same as the support vector classifier, or the hyperplane, without any
transformation at all
 The polynomial kernel: It is capable of creating nonlinear, polynomial decision boundaries
 The radial basis function (RBF) kernel: This is the most complex one, which is capable of transforming highly
nonlinear feature spaces to linear ones. It is even capable of creating elliptical (i.e. enclosed) decision
boundaries

The three types of kernel functions shown in the figures below represent the typical decision boundaries they are
capable of creating. Notice that the linear kernel is same as the vanilla hyperplane, the polynomial kernel can
produce polynomial shaped nonlinear boundaries, and the RBF kernel can produce highly nonlinear, ellipsoid shaped
boundaries.

Figure 15:Linear Kernel (Hyperplane)

Figure 16:Polynomial Kernel

Figure 17: RBF Kernel

Kernel Parameter
In nonlinear kernels such as the RBF, you use the parameter gamma/sigma to control the amount of nonlinearity in
the model. The higher the value of sigma, the more is the nonlinearity introduced; the lower the value of
gamma/sigma, the lesser is the nonlinearity. It is also denoted as gamma in some texts and packages. Apart from
sigma, you also have the hyperparameter C, or the cost (with all types of kernels).
Figure 18: RBF Kernels

The plot above shows three RBF kernels with different values of sigma. When sigma is high, more nonlinearity is
added. When you increase the sigma from 10 to 100, the nonlinearity is further increased, and all the training points
are correctly mapped, as is visible in the highly complex decision boundaries in the figure 1(C). However, this results
in a highly biased model that may overfit the training set. Like most other hyperparameters, it is advisable to tune
the value of sigma using cross-validation.

Choosing the appropriate kernel is important for building a model of optimum complexity. If the kernel is highly
nonlinear, the model is likely to overfit. On the other hand, if the kernel is too simple, then it may not fit the training
data well.

Usually, it is difficult to choose the appropriate kernel by visualising the data or using exploratory analysis.
Thus, cross-validation (or hit-and-trial, if you are only choosing from 2-3 types of kernels) is often a good strategy.

Page Rank Questions
No ratings yet
Page Rank Questions
4 pages
INDEPENDANTS and LOCAL OPERATORS
No ratings yet
INDEPENDANTS and LOCAL OPERATORS
8 pages
SVM
No ratings yet
SVM
21 pages
CS964 Data Warehousing and Data Mining
No ratings yet
CS964 Data Warehousing and Data Mining
1 page
1
No ratings yet
1
2 pages
KNN - Algorithm - SVM - Algorithm
No ratings yet
KNN - Algorithm - SVM - Algorithm
27 pages
3.1 K Nearest Neighbour Classifier (1)
No ratings yet
3.1 K Nearest Neighbour Classifier (1)
24 pages
Forecasting
No ratings yet
Forecasting
53 pages
To Machine Learning: Isabelle Guyon
No ratings yet
To Machine Learning: Isabelle Guyon
40 pages
CO - CSE 4102_AI Lab course Outline
100% (1)
CO - CSE 4102_AI Lab course Outline
4 pages
Introduction To Embedded Systems
No ratings yet
Introduction To Embedded Systems
9 pages
An Introduction Of: Support Vector Machine
No ratings yet
An Introduction Of: Support Vector Machine
36 pages
CS3401-ALGORITHMS QB Original
No ratings yet
CS3401-ALGORITHMS QB Original
51 pages
Adversarial Search 2020
No ratings yet
Adversarial Search 2020
34 pages
ML UNIT II
No ratings yet
ML UNIT II
30 pages
Ain Shams University Faculty of Engineering
No ratings yet
Ain Shams University Faculty of Engineering
2 pages
IAT-I Question Paper With Solution of 18CS71 Artificial Intelligence and Machine Learning Oct-2022-Dr. Paras Nath Singh
No ratings yet
IAT-I Question Paper With Solution of 18CS71 Artificial Intelligence and Machine Learning Oct-2022-Dr. Paras Nath Singh
7 pages
AI Unit 4 QA
No ratings yet
AI Unit 4 QA
22 pages
Support Vector Machines PDF
100% (1)
Support Vector Machines PDF
37 pages
Embedded Systems Chapter 1
No ratings yet
Embedded Systems Chapter 1
42 pages
Thyroid Disease Classification Using Machine Learning Project
No ratings yet
Thyroid Disease Classification Using Machine Learning Project
34 pages
Embedded System Definition
No ratings yet
Embedded System Definition
6 pages
ABP DWDM UNIT 4 Classification 1
No ratings yet
ABP DWDM UNIT 4 Classification 1
51 pages
Approximate Inference
No ratings yet
Approximate Inference
37 pages
ME6016 Advanced Internal Combustion Engines: Syllabus
No ratings yet
ME6016 Advanced Internal Combustion Engines: Syllabus
41 pages
21 SVR
No ratings yet
21 SVR
22 pages
ML Unit-4
No ratings yet
ML Unit-4
40 pages
Confusion Matrix, Accuracy, Precision, Recall, F1 Score
No ratings yet
Confusion Matrix, Accuracy, Precision, Recall, F1 Score
1 page
PN Junction Diode and Zener Diode Hardware
No ratings yet
PN Junction Diode and Zener Diode Hardware
8 pages
CS6659 AI UNIT 2 Notes
100% (4)
CS6659 AI UNIT 2 Notes
51 pages
4th Sem End Semester Question Papers
No ratings yet
4th Sem End Semester Question Papers
15 pages
TNSCERT - Basic Automobile Engineering - Thoery English Medium - 20.5.18 PDF
No ratings yet
TNSCERT - Basic Automobile Engineering - Thoery English Medium - 20.5.18 PDF
248 pages
Question 01
No ratings yet
Question 01
113 pages
Machine Learning
No ratings yet
Machine Learning
57 pages
PCA With An Example
No ratings yet
PCA With An Example
7 pages
Machine Learning Questions
50% (2)
Machine Learning Questions
2 pages
Machine Learning-2
No ratings yet
Machine Learning-2
16 pages
(New) (New) ML KNN Introduction Handwritten Notes
No ratings yet
(New) (New) ML KNN Introduction Handwritten Notes
6 pages
Natural Language Processing (NLP) Introduction:: Top 10 NLP Interview Questions For Beginners
No ratings yet
Natural Language Processing (NLP) Introduction:: Top 10 NLP Interview Questions For Beginners
24 pages
Digital Image Processing: Image Enhancement (Spatial Filtering 1)
No ratings yet
Digital Image Processing: Image Enhancement (Spatial Filtering 1)
19 pages
5.4 Error Handling in File Operations
No ratings yet
5.4 Error Handling in File Operations
10 pages
Unit-4 Mwoc 5-12-22
No ratings yet
Unit-4 Mwoc 5-12-22
82 pages
Week 7 Solution
No ratings yet
Week 7 Solution
6 pages
Cost Function
100% (1)
Cost Function
21 pages
JNTUK R20 B.Tech CSE 3-2 Machine Learning Unit 3 Notes
No ratings yet
JNTUK R20 B.Tech CSE 3-2 Machine Learning Unit 3 Notes
21 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
16 pages
DIP Lecture-7 10 RKJ Image Enhancement
No ratings yet
DIP Lecture-7 10 RKJ Image Enhancement
210 pages
Back Propagation
No ratings yet
Back Propagation
56 pages
Assignment 2
No ratings yet
Assignment 2
7 pages
Ai QB
No ratings yet
Ai QB
3 pages
UNIT1
No ratings yet
UNIT1
38 pages
Adversarial Search: in Artificial Intelligence
No ratings yet
Adversarial Search: in Artificial Intelligence
21 pages
Text
No ratings yet
Text
131 pages
Must Know Questions Deep Learning
No ratings yet
Must Know Questions Deep Learning
22 pages
Classification - Issues Regarding Classification and Prediction
No ratings yet
Classification - Issues Regarding Classification and Prediction
42 pages
MCQ Dom
No ratings yet
MCQ Dom
27 pages
18AI61
No ratings yet
18AI61
3 pages
SML Unit 4
No ratings yet
SML Unit 4
61 pages
1501589527da-mod14-Q1-e-text
No ratings yet
1501589527da-mod14-Q1-e-text
12 pages
W12 SVM
No ratings yet
W12 SVM
52 pages
Support Vector Machine
No ratings yet
Support Vector Machine
52 pages
Outcome Measures in Orthopaedics and Orthopaedic Trauma, 2Ed - 2nd Edition Unlimited Download
100% (13)
Outcome Measures in Orthopaedics and Orthopaedic Trauma, 2Ed - 2nd Edition Unlimited Download
16 pages
Test Bank The Dental Hygienists Guide To Nutritional Care 4th Edition by Cynthia A. Stegeman
0% (1)
Test Bank The Dental Hygienists Guide To Nutritional Care 4th Edition by Cynthia A. Stegeman
28 pages
COURSE Course Outlie Entreprenuar and SBM
No ratings yet
COURSE Course Outlie Entreprenuar and SBM
4 pages
Bs Hanyang
No ratings yet
Bs Hanyang
49 pages
De Thi Hoc Ky 2 Tieng Anh 11
No ratings yet
De Thi Hoc Ky 2 Tieng Anh 11
6 pages
TL - 080 PDF
No ratings yet
TL - 080 PDF
12 pages
Establishing Athlete Brand in Cricket Fa
No ratings yet
Establishing Athlete Brand in Cricket Fa
17 pages
C) It Is Not, Excel Declares Them Automatically According To The Value Added
No ratings yet
C) It Is Not, Excel Declares Them Automatically According To The Value Added
9 pages
Clsu Abe Review Abe Laws
No ratings yet
Clsu Abe Review Abe Laws
208 pages
Danka t Mathematics of Machine Learning Master Linear Algebr
No ratings yet
Danka t Mathematics of Machine Learning Master Linear Algebr
729 pages
This Is The VOA Special English Agriculture Report
No ratings yet
This Is The VOA Special English Agriculture Report
2 pages
Manual Service Hitachi Ex1000 PDF
No ratings yet
Manual Service Hitachi Ex1000 PDF
370 pages
Contact Information: Birthdate
No ratings yet
Contact Information: Birthdate
1 page
Isa Connector 10Mbit/S Ethernet Card Rtl8019As
No ratings yet
Isa Connector 10Mbit/S Ethernet Card Rtl8019As
4 pages
Manual: 292 Engine Kt-150 Series
No ratings yet
Manual: 292 Engine Kt-150 Series
20 pages
Final Report
No ratings yet
Final Report
34 pages
Pap - Ti Lesson Plan
No ratings yet
Pap - Ti Lesson Plan
3 pages
ISUZU
No ratings yet
ISUZU
5 pages
Glass Lecture Notes
No ratings yet
Glass Lecture Notes
30 pages
Good Topics For A Problem Solution Essay
100% (2)
Good Topics For A Problem Solution Essay
8 pages
Oziq Ovqat Mahsulotlarini Sifatini Aniqlash Va Tasniflash Usullari
No ratings yet
Oziq Ovqat Mahsulotlarini Sifatini Aniqlash Va Tasniflash Usullari
5 pages
Focus On Life Science Interactive Student Edition All Chapters Instant Download
100% (2)
Focus On Life Science Interactive Student Edition All Chapters Instant Download
82 pages
IOM_20250227_0001_compressed
No ratings yet
IOM_20250227_0001_compressed
52 pages
Frac Fluid End Parts Master List V2 - 6905363 - 02
100% (1)
Frac Fluid End Parts Master List V2 - 6905363 - 02
2 pages
Digital To Analog Converters
No ratings yet
Digital To Analog Converters
9 pages
TYPES OF CONFLICT and Role Play Examples
No ratings yet
TYPES OF CONFLICT and Role Play Examples
3 pages
Fastener Tightening Specifications: Application Specification Metric English
No ratings yet
Fastener Tightening Specifications: Application Specification Metric English
9 pages
Black Circuit Code For The Numbers To Co
No ratings yet
Black Circuit Code For The Numbers To Co
11 pages
Energy in Buildings
No ratings yet
Energy in Buildings
232 pages

Lecture Notes - SVM

Uploaded by

Lecture Notes - SVM

Uploaded by

Lecture Notes

Support Vector Machine

Figure 1: Nonlinear Data

Figure 4: Linear Discriminator

Figure 5: Maximal Margin Classifier

Thus, the soft margin classifier helps in solving this problem.

Soft Margin Classifier

Figure 6: Intermingled Data

Figure 8: Slack Variable

Figure 10: Slack Variable

So you can see that:

Figure-11: Intermingled Data

Mapping Nonlinear Data to Linear Data

Figure 12: Nonlinear Distribution

Figure 14: Black Box

Figure 15:Linear Kernel (Hyperplane)

Figure 17: RBF Kernel

You might also like