SVM
SVM
SVMs handle linear and non-linearly separable datasets. This unique capability and their robust
mathematical foundation have propelled SVMs to the forefront of various applications across
industries, from text classification and image recognition to bioinformatics and financial forecasting.
Understanding SVMs can open up possibilities in your data analysis journey.
Many real-world datasets exhibit non-linear relationships between features and classes,
making them non-linearly separable. Non-linear SVMs address this challenge by employing
kernel functions to implicitly map the original feature space into a higher-dimensional space,
where linear separation becomes feasible.
By transforming the data into a higher-dimensional space, SVMs can find hyperplanes that
effectively separate classes even when they are not linearly separable in the original feature
space.
The linear kernel is the simplest kernel function used in Support Vector Machines (SVMs). It's
essentially a dot product between two data points. Mathematically, it's represented as:
K(x1, x2) = x1 · x2
where:
How it Works:
2. Kernel Function: The linear kernel calculates the similarity between two data points by
computing the dot product of their feature vectors. A higher dot product generally indicates
greater similarity.
3. Hyperplane: The SVM algorithm uses the kernel function to find the optimal hyperplane that
separates the data points into different classes. This hyperplane is defined by the support
vectors, which are the data points closest to the decision boundary.
Linearly Separable Data: The linear kernel is most effective when the data is linearly
separable. This means that a straight line (or a hyperplane in higher dimensions) can
perfectly separate the different classes.
Simplicity: Due to its simplicity, the linear kernel is often used as a baseline for comparison
with other kernel functions.
Limitations:
Non-Linearly Separable Data: If the data is not linearly separable, the linear kernel may not
perform well. In such cases, non-linear kernels like the polynomial kernel or the RBF kernel
are often more effective.
The polynomial kernel is a powerful tool within Support Vector Machines (SVMs) that allows them
to handle non-linearly separable data. Here's a breakdown of its key aspects:
Core Idea:
Essentially, the polynomial kernel transforms the original input data into a higher-
dimensional space.
This transformation is done implicitly, meaning we don't actually calculate the coordinates of
the data in this higher space. Instead, the kernel function calculates the dot product of the
transformed vectors.
The advantage of this transformation is that data that's not separable by a straight line (or
hyperplane) in the original space might become separable in the higher-dimensional space.
Mathematical Representation:
o K(x, y) = (x ⋅ y + c)^d
o Where:
How it Works:
By raising the dot product of the input vectors to a certain power (d), the kernel effectively
creates new features that are combinations of the original features.
For example, if d = 2, the kernel creates features that are quadratic combinations of the
original features.
Key Considerations:
Degree (d):
o A higher degree allows for more complex decision boundaries but also increases the
risk of overfitting.
Constant (c):
o The constant c influences the balance between higher-degree and lower-degree
terms.
Computational Cost:
Applications:
Polynomial kernels are useful in situations where the data exhibits non-linear relationships.
o Image processing.
o Text classification.
The Radial Basis Function (RBF) kernel is a very popular kernel function used in Support Vector
Machines (SVMs), particularly for non-linear classification. Here's a breakdown of its key
characteristics:
Core Idea:
The RBF kernel transforms the input data into an infinitely dimensional space.
Mathematical Representation:
o Where:
How it Works:
Essentially, the RBF kernel calculates how similar two data points are by measuring their
distance.
Points that are close together have a high similarity, while points that are far apart have a
low similarity.
The gamma parameter determines how much influence a single training example has.
A small gamma means a wider influence, while a large gamma means a narrower influence.
Key Considerations:
Gamma (γ):
o A high gamma leads to a very "tight" fit, which can result in overfitting.
Computational Cost:
o The RBF kernel can be computationally expensive, especially for large datasets.
Flexibility:
o The RBF kernel is very flexible and can handle complex, non-linear decision
boundaries.
Applications:
o Image classification.
o Bioinformatics.
o Text classification.
HARD MARGIN
Linear SVM:
This refers to an SVM that aims to find a linear hyperplane to separate the data.
In a 2D space, this hyperplane is a straight line; in 3D, it's a plane; and in higher dimensions,
it's a hyperplane.
A linear SVM is suitable when the data is linearly separable, meaning it can be perfectly
divided by a straight line or hyperplane.
The goal is to maximize the margin, which is the distance between the hyperplane and the
closest data points (support vectors) 1 from each class.
Limitations:
o It's very sensitive to outliers. Even a single outlier can significantly affect the
hyperplane or make it impossible to find one.
o It only works when the data is perfectly linearly separable, which is rarely the case in
real-world datasets.
A soft margin SVM addresses the limitations of the hard margin SVM by allowing some
misclassifications.
It introduces "slack variables" that allow some data points to be on the wrong side of the
margin or even the wrong side of the hyperplane.
This makes the SVM more robust to outliers and allows it to work with data that is not
perfectly linearly separable.
A "C" parameter is used to control the trade-off between maximizing the margin and
minimizing the number of misclassifications.
Benefits:
Support Vector Regression (SVR) is a machine learning technique that extends Support Vector
Machines (SVMs) to handle regression problems, predicting continuous outcomes rather than
classifying data into discrete categories. It aims to find a function that best fits the data while
minimizing prediction errors.