Support Vector Machines Theory Implementation and Applications
Support Vector Machines Theory Implementation and Applications
Theory, Implementation,
and Applications
Presented By:
Sachin 2022UIT3065
Presented to:
Learning Objectives: Understanding SVM principles, mathematical formulation, implementation, and practical use
cases.
Introduction to Support
Vector Machines
Developed by Vladimir Vapnik and colleagues at AT&T Bell
Laboratories (1992-1995). Evolved from Statistical Learning Theory.
SVM is a supervised machine learning algorithm that finds an optimal
hyperplane to separate data into distinct classes. Focus on
maximizing the margin between classes. Initially for binary
classification, later extended.
Fundamental Concept
At its core, the Support Vector Machine (SVM) seeks to identify the
optimal hyperplane that most effectively separates data points into
distinct classes. This hyperplane is chosen to maximize the margin—
the distance between itself and the nearest data points from each
class. In an n-dimensional space, the hyperplane is defined by the
equation w·x + b = 0, where w represents the weight vector, x is the
input vector, and b is the bias. Classification is then accomplished
using the decision function: f(x) = sign(w·x + b).
Key Terminology
Support Vectors: Critical data points nearest to the separating hyperplane, influencing its position and orientation.
Margin: The perpendicular distance between the hyperplane and the closest support vectors, indicating classification confidence.
Maximum Margin Hyperplane (MMH): The optimal hyperplane that maximizes the margin, providing the best separation between classes.
Decision Boundary: The hyperplane that distinctly separates data points of different classes, enabling classification.
Feature Space: The n-dimensional space representing all possible values of the input features, where data points are plotted.
Feature Space Mapping: The input data is transformed into a high-dimensional feature space, enabling the
representation of complex relationships within the data.
Hyperplane Generation: Multiple potential hyperplanes are created within the feature space, each acting as a
candidate decision boundary between classes.
Margin Calculation: For each hyperplane, the margin—the distance between the hyperplane and the nearest data
points (support vectors) from each class—is computed.
Maximum Margin Selection: The algorithm selects the hyperplane with the largest margin, known as the Maximum
Margin Hyperplane (MMH), ensuring optimal class separation.
Support Vector Identification: Critical data points closest to the MMH, called support vectors, are identified. These
points play a pivotal role in determining the hyperplane's position and orientation.
Decision Function Derivation: A decision function is formulated using the support vectors and MMH. This function
classifies new data points by determining which side of the hyperplane they fall on.
Mathematical Representation
The foundation of a linear Support Vector Machine lies in its precise mathematical formulation,
which can be broken down into the following components:
Primal Formulation:
The objective is to minimize ½||w||², where w is the weight vector, while ensuring that every
data point satisfies the constraint yi(w·xi + b) ≥ 1. This guarantees not only correct
classification but also a margin of at least 1, resulting in a quadratic optimization problem that
can be solved using Lagrangian multipliers.
Lagrangian Formulation:
To solve this optimization problem, the Lagrangian function is defined as:
L(w, b, α) = ½||w||² - Σi αi[yi(w·xi + b) - 1]
Here, αi represents the Lagrangian multipliers associated with each constraint.
Decision Function:
Finally, with the optimal parameters in hand, the decision function formulated for classifying
new data points is:
f(x) = sign(Σi αiyi(xi·x) + b)
This function assigns the class of a new data point based on which side of the hyperplane it falls
on.
Linear Separability
Linearly separable data can be perfectly divided into distinct classes
using a hyperplane. This division requires selecting parameters w (the
weight vector) and b (the bias) so that for every training example (xi,
yi), the condition yi(w·xi + b) > 0 is met. The classification boundaries
are established by the canonical hyperplanes defined by w·x + b = 1
and w·x + b = -1, which create a margin of width 2/||w||. A hard-
margin SVM seeks to find the unique hyperplane that maximizes this
margin, ensuring the most robust separation between the classes.
Margin Maximization
The goal of margin maximization is to achieve the widest possible
separation between classes, quantified as 2/||w||, while ensuring all
data points meet the requirement yi(w·xi + b) ≥ 1. This objective is
mathematically equivalent to minimizing (||w||²)/2 under the same
constraints, forming a convex optimization problem with a unique
solution. This problem is commonly solved using the Lagrangian
formulation:
To address non-linearly separable data, Support Vector Machines (SVMs) employ two
main approaches. The first is the Soft-Margin SVM, which introduces slack variables.
These slack variables allow for some misclassifications by accepting data points that
fall within the margin or even on the wrong side of the hyperplane. This approach
carefully balances the need to maximize the margin while also accommodating
errors in classification.
The second approach utilizes Kernel Methods. Instead of trying to find a linear
boundary in the original input space, kernel functions transform the data into a
higher-dimensional space where linear separation becomes feasible. This
transformation is performed implicitly through the computation of dot products in
the new space, making it unnecessary to explicitly compute the high-dimensional
mapping. As a result, the kernel trick enables SVMs to manage complex, non-linear
relationships effectively.