Support Vector Machine
Support Vector Machine
algorithm that can be used for both classification and regression challenges.
Core Concepts
1. Hyperplane:
o A decision boundary that separates data points of different classes.
In 2D, it’s a line
in 3D, it’s a plane; and in higher dimensions, it’s a hyperplane.
2. Margin:
o The distance between the hyperplane and the nearest data points
(support vectors). SVM aims to maximize this margin.
3. Support Vectors:
o Data points closest to the hyperplane that influence its position and
orientation.
4. Kernel Trick:
o SVM uses kernel functions to transform data into higher dimensions,
making it easier to find a hyperplane in cases where data is not
linearly separable.
Hyperparameters in SVM
1. C (Regularization Parameter)
• Purpose: Balances the trade-off between maximizing the margin and
minimizing classification error.
• Effect:
o Large C: Focuses on correctly classifying all training points, resulting
in a smaller margin and potential overfitting.
o Small C: Allows more misclassifications but achieves a wider margin,
improving generalization.
Real-life Analogy for C:
Imagine you're designing a security system for a museum:
1. Large C:
o The security guard checks every single person and every detail of
their belongings. This ensures no one suspicious gets through but
slows down entry for everyone (overfitting).
o Applied to SVM: This results in a very tight decision boundary that
might not generalize well to new visitors (new data).
2. Small C:
o The guard only checks for large, obvious threats (e.g., large bags or
unusual behavior). Some small errors may occur, but it ensures quick
and smooth entry for most people (better generalization).
o Applied to SVM: The decision boundary is looser, focusing on the
broader picture and tolerating some mistakes.
Scenario: Classifying emails as "spam" or "not spam".
• Large C: The model tries to perfectly classify every email in the training
data. If one legitimate email contains the word "win" (often found in spam),
the model might overfit and treat all emails with "win" as spam.
• Small C: The model tolerates a few misclassified emails in the training data
but finds a broader, more generalized rule for spam classification.
2. Kernel
• Purpose: Determines the transformation applied to the data.
• Types:
o Linear Kernel: For linearly separable data.
o Polynomial Kernel: For more complex patterns.
o Radial Basis Function (RBF) Kernel: Popular for non-linear data due
to its flexibility.
o Sigmoid Kernel: Acts like a neural network activation function.
• Advice:
o Use a linear kernel for datasets where features are already linearly
separable or have high dimensionality.
o Use RBF as a default for non-linear data.
So far, we’ve used hard margin classification: all training samples are on the “correct
side of the street”:
Only work if the data is linearly separable (Left Figure)
Sensitive to outliers→not generalize (Right Figure)
The trick is to reverse the objective: instead of trying to fit the largest possible street between
two classes while limiting margin violations, SVM Regression tries to fit as many instances as
possible on the street while limiting margin violations (i.e., instances off the street)
OVR
Kernels
Polynomial
Gaussian (RBF) Kernel
Sigmoid