0% found this document useful (0 votes)
39 views48 pages

Lecture - 7 Classification (SVM)

The document discusses support vector machines (SVMs), a type of supervised machine learning model used for classification and regression analysis. It notes that SVMs are considered one of the best classifiers and make good decisions on data outside the training set. The document focuses on the sequential minimal optimization algorithm as a popular implementation of SVMs. It describes how SVMs find the optimal separating hyperplane between classes with the maximum margin and discusses the use of kernels to extend SVMs to more complex datasets.

Uploaded by

rediet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views48 pages

Lecture - 7 Classification (SVM)

The document discusses support vector machines (SVMs), a type of supervised machine learning model used for classification and regression analysis. It notes that SVMs are considered one of the best classifiers and make good decisions on data outside the training set. The document focuses on the sequential minimal optimization algorithm as a popular implementation of SVMs. It describes how SVMs find the optimal separating hyperplane between classes with the maximum margin and discusses the use of kernels to extend SVMs to more complex datasets.

Uploaded by

rediet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 48

Classification : Support Vector

Machines (SVM)
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of Computer Science & Engineering

Dr. Mesfin Abebe Haile (2022)


Support Vector Machines

 Support vector machines are considered by some people to be the best stock classifier. (the result will have low error rates)
 Support vector machines make good decisions for data points that are outside the training set. (Test set)
 There are many implementations of support vector machines, but we’ll focus on one of the most popular implementations:
 The Sequential Minimal Optimization (SMO) algorithm.
 It breaks the problem down into sub-problems that can be solved analytically (by calculating) rather than numerically (by searching or
optimizing).
 How to use something called kernels to extend SVMs to a larger number of datasets.

03/18/24 2
Support Vector Machines

 Pros and cons of SVM:


 Pros:
 Low generalization error, computationally inexpensive, easy to interpret results.
 Cons:
 Sensitive to tuning parameters and kernel choice; natively only handles binary classification.
 Works with:
 Numeric values, nominal values.

03/18/24 3
Support Vector Machines

 The line used to separate the dataset is called a separating hyperplane. (to
separate the data)
 Dataset with three dimension (plane)
 Dataset with 1024 dimension (with 1023 dimension to separate)

 The hyperplane is our decision boundary.


 Everything on one side belongs to one class, and everything on the other side
belongs to a different class.
 We’d like to make our classifier in such a way that the farther a data point is
from the decision boundary, the more confident we are about the prediction
we’ve made.

03/18/24 4
Support Vector Machines

03/18/24 5
Support Vector Machines

 We’d like to find the point closest to the separating hyperplane and make sure this is as far away from the separating line as possible.
 This is known as margin.

 We want to have the greatest possible margin, because if we made a mistake or trained our classifier on limited data, we’d want it to
be as robust as possible.
 The points closest to the separating hyperplane are known as support vectors.

03/18/24 6
Support Vector Machines
 we’re trying to maximize the distance from the separating line to the support vectors , we need to find a way to optimize the problem.
 How can we measure the line that best separates the data? (the maximum margin – the normal or perpendicular line)

03/18/24 7
Support Vector Machines

 General approach to SVMs:


 Collect: Any method.
 Prepare: Numeric values are needed.
 Analyze: It helps to visualize the separating hyperplane.
 Train: The majority of the time will be spent here. Two parameters can be adjusted during this phase. (alpha and b)

 Test: Very simple calculation.


 Use: You can use an SVM in almost any classification problem. One thing to note is that SVMs are binary classifiers. (possible to use with more than two classes)

03/18/24 8
Support Vector Machines

 SVM is one of the most popular ML algorithms:


 They were extremely popular around the time they were developed in the 1990s and continue to be
preferable for a high-performing algorithm with little tuning.

 The Maximal-Margin classifier is a hypothetical classifier that best explain how SVM works in practice.
 For example, if you had two input variables, this would form a two-dimensional space.

03/18/24 9
Support Vector Machines

 A hyperplane is a line that splits the input variable space.


 In SVM, a hyperplane is selected to best separate the points in the input variable space by their class, either
class 0 or class 1.

 In two-dimensions you can visualize this as a line and let’s assume that all of our input points can be
completely separated by this line.
 For example:
 B0 + (B1 * X1) + (B2 * X2) = 0

03/18/24 10
Support Vector Machines

 Where:
 The coefficients (B1 and B2) that determine the slop of the line,
 The intercept (B0) are found by the learning algorithm, and
 X1 and X2 are the two input variables.

 You can make classifications using this line.


 By plugging in input values into the line equation, you can
calculate whether a new point is above or below the line.

03/18/24 11
Support Vector Machines

 Above the line, the equation returns a value greater than 0 and the point
belongs to the first class (class 0).
 Below the line, the equation returns a value less than 0 and the point
belongs to the second class (class 1).

 A value close to the line returns a value close to zero and the point may be
difficult to classify.
 If the magnitude of the value is large, the model may have more
confidence in the prediction.

03/18/24 12
Support Vector Machines

 The distance between the line and the closest data points is
referred to as the margin.
 The best or optimal line that can separate the two classes is the
line that is the largest margin. This is called the Maximal-Margin
hyperplane.

 The margin is calculated as the perpendicular distance from the


line to only the closest points.
 Only these points are relevant in defining the line and in the
construction of the classifier.
03/18/24 13
SVM : Soft Margin Classifier

 In practice, real data is messy and cannot be separated perfectly


with a hyperplane.
 The constraint of maximizing the margin of the line that separates
the classes must be relaxed. This is often called the soft margin
classifier.
 An additional set of coefficients are introduced that give the
margin wiggle room in each dimension. These coefficients are
sometimes called slack variables.
 This increases the complexity of the model as there are more
parameters for the model to fit to the data to provide this
complexity.
03/18/24 14
SVM : Soft Margin Classifier

 A tuning parameter is introduced called simply C that defines the


magnitude of the wiggle allowed across all dimensions.
 The C parameters defines the amount of violation of the margin
allowed.

 A C=0 is no violation and we are back to the inflexible Maximal-


Margin Classifier shown above.
 The larger the value of C the more violations of the hyperplane
are permitted.

03/18/24 15
SVM : Soft Margin Classifier

 During the learning of the hyperplane from data, all training


instances that lie within the distance of the margin will affect the
placement of the hyperplane and are referred to as support
vectors.
 And as C affects the number of instances that are allowed to fall
within the margin, C influences the number of support vectors
used by the model.
 The smaller the value of C, the more sensitive the algorithm is to the
training data (higher variance and lower bias).
 The larger the value of C, the less sensitive the algorithm is to the
training data (lower variance and higher bias).
03/18/24 16
Support Vector Machines (Kernels)

 The SVM algorithm is implemented in practice using a kernel.


 The learning of the hyperplane in linear SVM is done by
transforming the problem using some linear algebra.

 A powerful insight is that the linear SVM can be rephrased using


the inner product of any two given observations, rather than the
observations themselves.
 The inner product between two vectors is the sum of the
multiplication of each pair of input values.

03/18/24 17
Support Vector Machines (Kernels)

 For example, the inner product of the vectors [2, 3] and [5, 6] is
2*5 + 3*6 or 28.
 The equation for making a prediction for a new input using the
dot product between the input (x) and each support vector (xi) is
calculated as follows:

 f(x) = B0 + sum(ai * (x,xi))


 This is an equation that involves calculating the inner products of a
new input vector (x) with all support vectors in the training data.
 The coefficients B0 and ai (for each input) must be estimated from
the training data by the learning algorithm.
03/18/24 18
Linear Kernel SVM

 The dot-product is called the kernel and can be re-written as:


 K(x, xi) = sum(x * xi)

 The kernel defines the similarity or a distance measure between


new data and the support vectors.

 The dot product is the similarity measure used for linear SVM or
a linear kernel because the distance is a linear combination of the
inputs.

03/18/24 19
Linear Kernel SVM

 Other kernels can be used that transform the input space into
higher dimensions such as a Polynomial Kernel and a Radial
Kernel. This is called the Kernel Trick.

 It is desirable to use more complex kernels as it allows lines to


separate the classes that are curved or even more complex.

 This in turn can lead to more accurate classifiers.

03/18/24 20
Polynomial Kernel SVM

 Instead of the dot-product, we can use a polynomial kernel. For


example:
 K(x,xi) = 1+ sum(x * xi)^d

 Where the degree of the polynomial must be specified by hand to


the learning algorithm.

 When d=1 this is the same as the linear kernel. The polynomial
kernel allows for curved lines in the input space.

03/18/24 21
Radial Kernel SVM

 Finally, we can also have a more complex radial kernel. For


example:
 K(x,xi) = e-gamma * sum((x – xi^2))
 Where gamma is a parameter that must be specified to the
learning algorithm.

 A good default value for gamma is 0.1, where gamma is often 0 <
gamma < 1.
 The radial kernel is very local and can create complex regions
within the feature space, like closed polygons in two-dimensional
space.
03/18/24 22
Data Preparation for SVM

 How to prepare your training data when learning an SVM model.


 Numerical Inputs: SVM assumes that your inputs are numeric.
 If you have categorical inputs you may need to covert them to
binary dummy variables (one variable for each category).

 Binary Classification: Basic SVM is intended for binary (two-


class) classification problems.
 Although, extensions have been developed for regression and
multi-class classification.

03/18/24 23
Classification with SVMs

 What is the best (linear) decision boundary?.


 Two features (number nodes, age)
 Two labels/values (survived, lost)

03/18/24 24
Classification with SVMs

 Find the line that best separates classes.


 More misclassification

03/18/24 25
Classification with SVMs

 Find the line that best separates classes.


 No misclassification , but the model is sensitive due to the boundary points.

03/18/24 26
Classification with SVMs

 Find the line that best separates classes.


 No misclassification , but the model is sensitive due to the boundary points.

03/18/24 27
Classification with SVMs

 Include the largest boundary possible.


 This one maximize the distance between the classes.
 This is not very sensitive to small variation.

03/18/24 28
Outlier Sensitivity in SVMs

 The goal of SVMs is to find the best boundary.


 There are some problems that this ideal may create.

03/18/24 29
Outlier Sensitivity in SVMs

 This boundary is maximized the distance from the classes.


 But, what happens when the dataset is messier?

03/18/24 30
Outlier Sensitivity in SVMs

 Now, there is a red sample close to the blue.


 What will happen to our boundary?

03/18/24 31
Outlier Sensitivity in SVMs

 Not better than the previous vertical boundary.

03/18/24 32
Outlier Sensitivity in SVMs

 Therefore, it’s probably still best to not move our natural


boundary for that one red record.

03/18/24 33
Linear SVM : the Syntax

 Import the class containing the feature selection method.


 From sklearn.svm import LinearSVC
 Create an instance of the class.
 LinSVC = LinearSVC( penality=‘l2’, c=10.0) # [Regularization
parameters]
 Fit the instance on the data and then predict the expected value.
 LinSVC = LinSVC.fit(x_train, y_train)
 y_predict = LinSVC.predict(x_test)

 Tune regularization parameters with cross-validation.


03/18/24 34
Classification with SVMs

 What we have seen so far is the linear version of Support


Vector Machines. (Simple Version)
 There is a way (Kernel trick) to achieve non-linear decision
boundaries with SVMs.
 Non-linear data can be made linearly separable with higher
dimensionality.

03/18/24 35
The Kernel Trick

 Transform data so it is linearly separable.


 The complicated looking curve actually maps to a hyperplane
in this 3 dimensional space.

03/18/24 36
SVM Gaussian Kernel

 A data set with 2 features, and classes are color-coded.


 This dataset is not linearly separable.

03/18/24 37
SVM Gaussian Kernel

 Approach 1: Create higher order features to transform the data.


 From the one we have: Budget2 + Rating2 + Budget * Rating +…

03/18/24 38
SVM Gaussian Kernel

 Approach 2: Transform the space to a different coordinate


system.

03/18/24 39
SVM Gaussian Kernel

 Algorithm tries to select the best support vectors and creates


these gaussian features.

03/18/24 40
SVM Gaussian Kernel

 Another name for this is called “radial basis function’ (RBF).


 RBF is the most commonly used kernel.

03/18/24 41
SVMs with Kernels : the Syntax

 Import the class containing the feature selection method.


 From sklearn.svm import SVC
 Create an instance of the class.
 rbfSVC = SVC( kernel=‘rbf’, gamma=1.0, c=10.0) # [Set kernel
and associated coefficient (gamma)]
 Fit the instance on the data and then predict the expected value.
 rbfSVC = rbfSVC.fit(x_train, y_train)
 y_predict = rbfSVC.predict(x_test)

 Tune kernel and associated parameters with cross-validation.


03/18/24 42
SVMs

 SVMs with RBF kernels are very slow to train with lots of
futures or data. (in real life)

 So constructing approximate kernel mapping is “good enough”.


 There are a few methods to approximate the kernel:
 Nystroem
 RBFSampler are among the popular ones.

 Kernel map will create a dataset in a higher dimensional


space. It is not by itself a classifier.
03/18/24 43
Fast Kernels Transformations : the
Syntax
 Import the class containing the feature selection method.
 From sklearn.kernel_approximation import Nystroem
 Create an instance of the class.
 nystroemSVC = Nystroem( kernel=‘rbf’, gamma=1.0,
n_components=100) # [multiple non-linear kernels can be used,
kernel and gamma are identical to SVC, n_components is number
of samples]
 Fit the instance on the data and transform.
 X_train = nystroemSVC.fit_transform(X_train)
 X_test = nystroemSVC.transform (X_test)
 Tune kernel parameters and components with cross-validation.
03/18/24 44
Fast Kernels Transformations : the
Syntax
 Import the class containing the feature selection method.
 From sklearn.kernel_approximation import RBFsampler
 Create an instance of the class.
 rbfSample = RBFsampler(gamma=1.0, n_components=100) #
[RBF is only kernel that can be used, parameter names are
identical to previous]
 Fit the instance on the data and transform.
 X_train = rbfSample.fit_transform(X_train)
 X_test = rbfSample.transform (X_test)

 Tune kernel parameters and components with cross-validation.


03/18/24 45
When to use Logistic Regression Vs
SVC

03/18/24 46
Question & Answer

03/18/24 47
Thank You !!!

03/18/24 48

You might also like