Machine Learning with Python
Machine Learning Algorithms - Support Vector Machine
(SVM)
Prof. Shibdas Dutta,
Associate Professor,
DCG DATA CORE SYSTEMS INDIA PVT LTD
Kolkata
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
Machine Learning Algorithms – Classification Algo- SVM
Introduction - SVM
Support vector machines (SVMs) are powerful yet flexible supervised machine learning
algorithms which are used both for classification and regression.
But generally, they are used in classification problems. In 1960s, SVMs were first
introduced but later they got refined in 1990.
SVMs have their unique way of implementation as compared to other machine learning
algorithms.
Lately, they are extremely popular because of their ability to handle multiple
continuous and categorical variables.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
Working of SVM
An SVM model is basically a representation of different classes in a hyperplane in multidimensional
space.
The hyperplane will be generated in an iterative manner by SVM so that the error can be minimized.
The goal of SVM is to divide the datasets into classes to find a maximum marginal hyperplane (MMH).
Margin
Class A
Y-Axis
Class B
Support
X-Axis
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
The followings are important concepts in SVM:
· Support Vectors: Datapoints that are closest to the hyperplane is called support vectors.
Separating line will be defined with the help of these data points.
· Hyperplane: As we can see in the above diagram, it is a decision plane or space which is divided
between a set of objects having different classes.
· Margin: It may be defined as the gap between two lines on the closet data points
of different classes. It can be calculated as the perpendicular distance from the line to the support
vectors. Large margin is considered as a good margin and small margin is considered as a bad
margin.
The main goal of SVM is to divide the datasets into classes to find a maximum marginal hyperplane
(MMH) and it can be done in the following two steps:
· First, SVM will generate hyperplanes iteratively that segregates the classes in best way.
· Then, it will choose the hyperplane that separates the classes correctly.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
Implementing SVM in Python
For implementing SVM in Python we will start with the standard libraries import as follows:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns; sns.set()
Next, we are creating a sample dataset, having linearly separable data, from
sklearn.dataset.sample_generator for classification using SVM:
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=100, centers=2,random_state=0, cluster_std=0.50)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer');
The following would be the output after generating sample dataset having 100
samples and 2 clusters:
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
We know that SVM supports discriminative classification. it divides the classes from each other by simply
finding a line in case of two dimensions or manifold in case of multiple dimensions. It is implemented on
the above dataset as follows:
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
xfit = np.linspace(-1, 3.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer')
plt.plot([0.6], [2.1], 'x', color='black', markeredgewidth=4, markersize=12)
# Iterate over the tuples (m, b) to plot different lines
for m, b in [(1, 0.65), (0.5, 1.6), (-0.2, 2.9)]:
# Calculate y based on the current slope and intercept
plt.plot(xfit, m * xfit + b, '-k') # '-k' means solid black line
plt.xlim(-1, 3.5);
The output is as follows:
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
We can see from the above output that there are three different separators that perfectly discriminate the
above samples.
As discussed, the main goal of SVM is to divide the datasets into classes to find a maximum
marginal hyperplane (MMH) hence rather than drawing a zero line between classes we can draw around
each line a margin of some width up to the nearest point. It can be done as follows:
lower boundary of the shaded area
upper boundary of the shaded area
# fill is slightly transparent.
# light gray color
# Fill between yfit - d and yfit + d to show uncertainty
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
From the above image in output, we can easily observe the “margins” within the discriminative classifiers.
SVM will choose the line that maximizes the margin.
Next, we will use Scikit-Learn’s support vector classifier to train an SVM model on this data. Here, we are
using linear kernel to fit SVM as follows:
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
C controls the trade-off between maximizing the margin
and minimizing the classification error. A very large value of
regularization parameter. C (like 1E10) means that the classifier will prioritize
correctly classifying all training examples (even if it means
a smaller margin). This can lead to overfitting, especially on
small datasets.
The output is as follows:
SVC(C=10000000000.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr',
degree=3, gamma='auto_deprecated', kernel='linear', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
C=10000000000.0 - This is the regularization parameter. A large value of C (like 1E10 or 10000000000.0) means that the SVC will
prioritize minimizing classification errors on the training data, which can lead to overfitting.
cache_size=200: - This specifies the size of the kernel cache (in MB). The kernel cache stores the results of kernel computations to
speed up the training process.
class_weight=None - This parameter allows you to set the weights for different classes. When set to None, all classes are treated as
equally important. You can set it to 'balanced' to automatically adjust weights inversely proportional to class frequencies.
coef0=0.0 - This parameter is relevant when using polynomial or sigmoid kernels. It controls the influence of higher-order terms in
the polynomial kernel and the scaling factor in the sigmoid kernel. Since you're using a linear kernel, coef0 is not relevant here.
decision_function_shape='ovr': - This determines the strategy for multi-class classification. 'ovr' stands for "one-vs-rest," meaning
that the classifier fits one classifier per class, with each classifier separating one class from the rest. The alternative is 'ovo' (one-vs-
one).
degree=3: - This parameter is relevant for polynomial kernels and specifies the degree of the polynomial function. Like coef0, it's
not relevant for a linear kernel.
gamma='auto_deprecated': - The gamma parameter defines the influence of a single training example. 'auto_deprecated' was used in
older versions of scikit-learn to signify an automatically calculated gamma value based on the number of features. In newer versions,
you should use 'scale' or 'auto'.
kernel='linear - This specifies the type of kernel used by the SVC. In this case, it's linear, which means the classifier will try to find a
linear decision boundary between classes.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
max_iter=-1: - This controls the maximum number of iterations the solver can run. -1 means no limit, allowing the solver to run until
convergence.
probability=False: - When set to True, this parameter enables probability estimates by training an additional model with cross-
validation. It increases training time, so it's set to False by default.
random_state=None - This parameter sets the seed for the random number generator, which can ensure reproducibility of results. None
means the random number generator will be initialized randomly.
shrinking=True - This parameter enables the shrinking heuristic, which can speed up training by ignoring some training examples that
are unlikely to change the decision boundary.
tol=0.001: - This sets the tolerance for the stopping criterion. The algorithm will stop iterating when the error improvement is below
this threshold.
verbose=False: - This controls whether to print detailed information during training. False means no output will be printed.
Now, for a better understanding, the following will plot the decision functions for 2D SVC:
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
def decision_function(model, ax=None, plot_support=True):
if ax is None:
ax = plt.gca() # If no axis is provided, use the current axis
xlim = ax.get_xlim() # Get the current limits of the axis
ylim = ax.get_ylim()
For evaluating model, we need to create grid as follows:
x = np.linspace(xlim[0], xlim[1], 30) # Create grid to evaluate the model
y = np.linspace(ylim[0], ylim[1], 30)
Y, X = np.meshgrid(y, x)
xy = np.vstack([X.ravel(), Y.ravel()]).T
P = model.decision_function(xy).reshape(X.shape)
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
Next, we need to plot decision boundaries and margins as follows:
ax.contour(X, Y, P, colors='k',
levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--'])
Now, similarly plot the support vectors as follows:
if plot_support:
ax.scatter(model.support_vectors_[:, 0],
model.support_vectors_[:, 1],
s=300, linewidth=1, facecolors='none');
ax.set_xlim(xlim)
ax.set_ylim(ylim)
Now, use this function to fit our models as follows:
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer')
decision_function(model);
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
We can observe from the above output that an SVM classifier fit to the data with margins
i.e. dashed lines and support vectors, the pivotal elements of this fit, touching the dashed line.
These support vector points are stored in the support_vectors_ attribute of the classifier as
follows:
model.support_vectors_
The output is as follows:
array([[0.5323772 , 3.31338909],
[2.11114739, 3.57660449],
[1.46870582, 1.86947425]])
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
SVM Kernels
In practice, SVM algorithm is implemented with kernel that transforms an input data space
into the required form.
SVM uses a technique called the kernel trick in which kernel takes
a low dimensional input space and transforms it into a higher dimensional space.
In simple words, kernel converts non-separable problems into separable problems by adding more
dimensions to it.
It makes SVM more powerful, flexible and accurate. The following are some of the types of kernels used
by SVM:
Linear Kernel
It can be used as a dot product between any two observations. The formula of linear kernel is as
𝐾(𝑥, 𝑥𝑖) = 𝑠𝑢𝑚(𝑥 ∗ 𝑥𝑖)
below:
From the above formula, we can see that the product between two vectors say 𝑥 & 𝑥𝑖 is the sum of the
multiplication of each pair of input values.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
Polynomial Kernel
It is more generalized form of linear kernel and distinguish curved or nonlinear input space. Following is
K(x, 𝑥𝑖) = 1 + sum(x * 𝑥𝑖 )^d
the formula for polynomial kernel:
Here d is the degree of polynomial, which we need to specify manually in the learning algorithm.
Radial Basis Function (RBF) Kernel
RBF kernel, mostly used in SVM classification, maps input space in indefinite dimensional space. Following
formula explains it mathematically:
K(x,xi) = exp(-gamma * sum((x – xi^2))
Here, gamma ranges from 0 to 1. We need to manually specify it in the learning algorithm. A good
default value of gamma is 0.1.
As we implemented SVM for linearly separable data, we can implement it in Python for the data that is
not linearly separable. It can be done by using kernels.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
Example
The following is an example for creating an SVM classifier by using kernels. We will be using iris dataset
from scikit-learn:
We will start by importing following packages:
import pandas as pd
import numpy as np
from sklearn import svm, datasets
import matplotlib.pyplot as plt
Now, we need to load the input data:
iris = datasets.load_iris()
From this dataset, we are taking first two features as follows:
X = iris.data[:, :2]
y = iris.target
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
Next, we will plot the SVM boundaries with
original data as follows:
Now, we need to provide the value of regularization parameter as follows:
C = 1.0
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
Next, SVM classifier object can be created as follows:
Svc_classifier = svm.SVC(kernel='linear', C=C).fit(X, y)
Z = svc_classifier.predict(X_plot)
Z = Z.reshape(xx.shape)
plt.figure(figsize=(15, 5))
plt.subplot(121)
plt.contourf(xx, yy, Z, cmap=plt.cm.tab10, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('Support Vector Classifier with linear kernel')
Output
Text(0.5, 1.0, 'Support Vector Classifier with linear kernel')
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
For creating SVM classifier with rbf kernel, we can change the kernel to rbf as follows:
Svc_classifier = svm.SVC(kernel='rbf', gamma =‘auto’,C=C).fit(X, y)
Z = svc_classifier.predict(X_plot)
Z = Z.reshape(xx.shape)
plt.figure(figsize=(15, 5))
plt.subplot(121)
plt.contourf(xx, yy, Z, cmap=plt.cm.tab10, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('Support Vector Classifier with rbf kernel')
Text(0.5, 1.0, 'Support Vector Classifier with rbf kernel')
Output
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
We put the value of gamma to ‘auto’ but you can provide its value between 0 to 1 also.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
Pros and Cons of SVM Classifiers
Pros of SVM classifiers
SVM classifiers offers great accuracy and work well with high dimensional space. SVM
classifiers basically use a subset of training points hence in result uses very less memory.
Cons of SVM classifiers
They have high training time hence in practice not suitable for large datasets. Another disadvantage is
that SVM classifiers do not work well with overlapping classes.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
Thank You
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com