0% found this document useful (0 votes)
6 views

Lect_07_Distance_Based_Algorithms

This lecture covers K-Nearest Neighbors (KNN) and Support Vector Machines (SVM) as distance-based algorithms in machine learning. It discusses the mechanics, strengths, and weaknesses of KNN, alongside SVM concepts such as hard and soft margins, the kernel trick, and multiclass classification strategies. The document also emphasizes the importance of distance functions and scaling in the context of these algorithms.

Uploaded by

gacia der
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lect_07_Distance_Based_Algorithms

This lecture covers K-Nearest Neighbors (KNN) and Support Vector Machines (SVM) as distance-based algorithms in machine learning. It discusses the mechanics, strengths, and weaknesses of KNN, alongside SVM concepts such as hard and soft margins, the kernel trick, and multiclass classification strategies. The document also emphasizes the importance of distance functions and scaling in the context of these algorithms.

Uploaded by

gacia der
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

MSBA 315

ML & Predictive Analytics


Lecture 07 – Distance-Based Algorithms
(KNN and SVM)
Wael Khreich
[email protected]
Learning Outcomes
• K-Nearest Neighbors
• Classification and Regression
• KNN Pros and Cons

• Support Vector Machines


• Hard vs Soft Margin
• The Kernel trick
• Multi-class SVM
• One-Class SVM
• SVM Pros and Cons

MSBA 315 2
Machine Learning Pipeline

MSBA 315 3
Model.fit(𝑿𝒕𝒓𝒂𝒊𝒏 ,
𝒚𝒕𝒓𝒂𝒊𝒏 )
Model(s)
Training &
Evaluation

Model.predict(𝑿𝒗𝒂𝒍𝒊𝒅 ,
𝒚𝒗𝒂𝒍𝒊𝒅 )

MSBA 315 4
Nearest Neighbors
• The k-nearest neighbors (KNN) algorithm is a simple, supervised
machine learning algorithm that can be used to solve both
classification and regression problems
• KNN is non-parametric, it stores all the available data and classifies a
new data point based on its neighborhood similarity

MSBA 315 5
Nearest Neighbors - Motivations

MSBA 315 6
Distance Functions

MSBA 315 7
Distance Functions 𝐴(x1, y1) and 𝐵(x2, y2)

|x1 − x2| + |y1 − y2|

Type equation here.

(x1 − x2)² + (y1 − y2)²

max{|x1 − x2|, |y1 − y2|}

MSBA 315 8
Scaling - Review

Normalization scales each input variable separately to the range 0-1


The range for floating-point values
𝑥 − 𝑚𝑖𝑛
𝑥𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 =
𝑚𝑎𝑥 − 𝑚𝑖𝑛

Standardization involves rescaling the distribution of values so that the


mean of observed values is 0 and the standard deviation is 1
𝑥−𝜇
𝑥𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑖𝑧𝑒𝑑 =
𝜎

MSBA 315 9
Impact of K

Nearest-neighbor rule is a sub-optimal procedure


• Does not yield the Bayes error rate
• Yet it is never worse than twice the Bayes error rate
MSBA 315 10
KNN – Strengths and Weaknesses
+ Very simple but often surprisingly effective
+ Fast training phase (just need to store the training set) - lazy learner
+ Non-parametric approach that can make use of large amounts of data

– Does not produce a model (always need access to training data)


– Doesn't work well with high dimensional/large data
– Computation of distances between new point and each existing point
becomes very expensive
– Suffers from the curse of dimensionality
– Doesn't work well with categorical features
– Requires a choice of suitable k value (hyperparameter)
– Sensitive to scaling
MSBA 315 11
Support Vector Machines

MSBA 315 12
What is Support Vector Machines?
• Support vector machines (SVMs) are a
set of supervised learning methods used
for:
• Classification (binary)
• Regression
• Outlier detection

• Main Ideas:
• Use a hyperplane to separate the examples
• Chose the hyperplane with maximum
margin

MSBA 315 13
Hyperplanes
• Separates a 𝑃-dimensional space into
two half-spaces: positive (+1) and Class: +1
negative (-1) 𝒘𝑇 𝑥 + 𝑏 = 0
• Defined by normal vector 𝒘 ∈ ℝ𝑃 of 𝒘
weights = {𝑤1 , 𝑤2 , … , 𝑤𝑃 } pointing Class: -1
towards positive half-space
• Equation of the hyperplane:
𝒘𝑇 𝑥 + 𝑏 = 0
• b is a single number called bias term
• 𝑏 = 0: hyperplane passes through origin
• 𝑏 > 0: hyperplane moves parallel to
itself in the direction of 𝒘
• 𝑥 is the input vector 14
MSBA 315
Hyperplanes
• Distance to an input point 𝑥𝑛 from a
hyperplane: Class: +1
𝒘𝑇 𝑥𝑛 + 𝑏 𝒘𝑇 𝑥 + 𝑏 = 0
𝛾𝑛 = 𝒘
𝒘
• 𝛾 is a signed distance: 𝑥𝑛 Class: -1
• positive if a point is on the same side of the
hyperplane as the normal vector 𝒘
• negative if it is on the opposite side
• Zero if a point lies exactly on the hyperplane
• The Euclidean norm 𝒘 is the
magnitude (length) of weight vector 𝒘
• 𝑤 = 𝑤12 , 𝑤22 , … , 𝑤𝑃2 𝑥𝑖
𝑏
• : distance from origin to vector 𝒘
𝒘 15
MSBA 315
Maximum Margin Hyperplane
• SVM is a hyperplane based (linear)
classier that ensures a large margin
around the hyperplane
• Objective: find the hyperplane with the
largest margin
• maximum margin hyperplane
• Finds the support vectors and
their weights
• Support vectors are the most important
examples
• Prediction:
• 𝑦 = 𝑠𝑖𝑔𝑛(𝒘𝑇 𝑥𝑡𝑒𝑠𝑡 + 𝑏)

MSBA 315 16
Hard Margin SVM
• For hard margin, we want all points to
satisfy the margin constraint

• The objective for hard-margin SVM

• Minimizing 𝒘 is equal to maximizing the margin


Solution: Using quadratic optimization tools Source: Lecture by Piyush Rai

MSBA 315 17
Soft Margin SVM
• Relax the hard constraint (misclassification)
• 𝑥𝑛 can violate the margin by 𝝃𝒏 ≥ 𝟎
• 𝜉𝑛 slack variable equal to the distance by 𝜉𝑛
which the corresponding point 𝑥𝑛 falls on
the wrong side of the margin boundary
• 𝜉𝑛 = 0: correct side of the margin
• 𝜉𝑛 > 0: wrong side of the margin
• 𝜉𝑛 > 1: wrong side of the hyperplane

Source: Lecture by Piyush Rai

MSBA 315 18
Soft Margin SVM
• Maximize the margin while minimizing the
sum of slacks σ𝑁
𝑛=1 𝜉𝑛 (total training error)
𝜉𝑛

• C is a cost value associated with all points


that violate the margin (hyperparameter)
Source: Lecture by Piyush Rai
Solution: Using quadratic optimization tools

MSBA 315 19
Soft Margin SVM
• When the cost C is large:
• Model prioritizes minimizing training error
• It will try harder to perfectly separate the training
data, even if that results in a smaller margin
• The decision boundary becomes more complex to 𝜉𝑛
accommodate the training points.
• Leads to high variance (overfitting) and low bias
(underfitting).
• When the cost C is small:
• Model prioritizes maximizing the margin allowing
some training errors (misclassification)
• It generalizes better to new data by keeping a
simpler decision boundary.
• This leads to higher bias (underfitting) and lower
variance (overfitting).
• C controls the bias –variance trade-off:
Source: Lecture by Piyush Rai
• Large C: Lower bias, high variance
• Small C: higher bias, low variance
MSBA 315 20
Non-linear Spaces
• We need a mapping from the original
space into a higher dimensional space,
where the non-linear relationship
appears linear
• Increase dimensionality (add new 𝜱(𝐱)
features) and add a non-linear
mapping 𝜱

MSBA 315 Source: Lecture by Yann LeCun 21


Kernel Trick
• The kernel trick allows an efficient
mapping from the original into a
higher dimensional space
• It allows us to implicitly
map data into a higher-
dimensional space
without actually
computing the mapping
• Kernel function input
two vectors and output
a scalar = similarity
between the two
vectors
MSBA 315 Source: Lecture by Yann LeCun 22
Kernel Trick
• The Kernel function in the
original space (similarity
measure between 2
instances)
𝑲(𝒙𝒊 , 𝒙𝒋 )
• Becomes an inner product
in the feature space with
increased dimension
𝜱 𝑥𝑖 . 𝜱(𝑥𝑗 )
• Without having to explicitly
compute 𝜱 𝑥

Instead of explicitly computing 𝛷 𝑥 we directly use the polynomial kernel K


MSBA 315 Source: Lecture by Yann LeCun 23
Kernel Trick
• The feature space can be high or even infinite dimensional
• Calculating Φ 𝑥 for each training example is inefficient for high-dimensional
and impossible for infinite-dimensional
• Computing the kernel means computing the inner product between the images
of data points 𝑲 𝒙𝒊 , 𝒙𝒋 = 𝜱 𝑥𝑖 . 𝜱(𝑥𝑗 )

Φ 𝑥𝑖 T Φ(𝑥Kernel Matrix (Gram Matrix)


𝑗
• Each element 𝐾 𝑥𝑖 , 𝑥𝑗 represents
the similarity between the 𝑖 𝑡ℎ and
𝑗𝑡ℎ data points, computed using the
kernel function
• The matrix is symmetric
• Diagonal represent the self-similarity

MSBA 315 24
Kernel Trick – Common Kernels

Kernel trick also used in: Kernel PCA, Kernel Gaussian Processes,
Kernel K-Means, etc.

MSBA 315 25
SVM Decision Function

Decision function: 𝑥
• 𝑥 is the test point to classify
• The Lagrange multipliers 𝛼𝑖∗ ​ are proportional to the weights of the
support vectors 𝑥𝑖
• 𝑦𝑖 are the labels of these support vectors
• 𝐾(𝑥𝑖 , 𝑥) is the kernel function, which measures the similarity between
the support vectors (𝑥𝑖 ) and the test point 𝑥
• b is the bias term
• The decision function 𝑓(𝑥) calculate a weighted sum of similarities
between the test point (𝑥) and the support vectors (𝑥𝑖 ).
• The sign of this sum determines the predicted class label for the test
point (positive or negative)
MSBA 315 26
Support Vector Regression (SVR)
In regression problems, we typically
minimize mean-squared error
෍(𝑦𝑖 − (𝑤 𝑇 𝑥𝑖 + 𝑏))^2
𝑖

In SVR, we minimize the absolute error


(robust to outliers):
෍ |𝑦𝑖 − (𝑤 𝑇 𝑥𝑖 + 𝑏)| SVR tries to find a hyperplane that
captures the general trend of the
𝑖
we cannot require that all points be data while allowing for some
approximated correctly (overfitting!) amount of noise or outliers.
MSBA 315 27
Support Vector Regression (SVR)
• To allow misclassifications in SVR (and make it robust to noise), we
use the 𝝐-insensitive loss:

If 𝑦𝑖 − 𝑤 𝑇 𝑥𝑖 + 𝑏 ≤𝜖
𝑦𝑖 − 𝑤 𝑇 𝑥𝑖 + 𝑏 − 𝝐 Otherwise

Value of target

MSBA 315 28
Multiclass SVMs – OVO
One-vs-One (OVO)

• Build binary (2-class) classifiers


for every pair of classes
• Train 𝐾2 = 𝑲(𝑲 − 𝟏)/𝟐
different combinations of SVMs
• Predict the class with the most
“votes” from any given classifier
• Pros & Cons:
+ Less sensitive to imbalance
− Large number of SVM training
K=10 classes -> 45 SVMs
K=20 classes -> 190 SVMs
MSBA 315 29
Multiclass SVMs – OVR
One-Vs-Rest (OVR) or One-As-All
• Build 𝑲 binary classifiers
• Train the 𝑘𝑡ℎ classifier using
the data from class 𝐶𝑘 as the
positive and data from the
remaining 𝐾 − 1 classes as
negative examples
• Predict the class with largest
score among the trained SMVs
• Pros & Cons:
Source: krishijagran.com
+ More efficient
− More sensitive to imbalance
• Default in sklearn MSBA 315 30
One-Class SVM
• Two popular approaches:

1. SVDD: Support Vector Data Description 2. OC-SVM: One-Class SVM (Tax and Duin, 2004):
(Scholkopf et al., 2001]): • Find a max-margin hyperplane separating
• It attempts to fit the smallest possible positives from origin (representing negatives)
hypersphere around the data • Starts by mapping the input data into a
• Assume positives lie within a ball with smallest higher-dimensional feature space using a
possible radius 𝑅 and allow some slacks 𝜉𝑛 kernel function
• The algorithm tries to minimize the number and • Tries to find the optimal hyperplane that
magnitude of these slacks maximizes the margin while minimizing the
MSBA 315
number and magnitude of the slack variables 31
SVMs – Pros and Cons
• Strengths:
• Works well for small datasets
• Appears to avoid overfitting in large dimensional spaces
• Effective when number of dimensions is greater than number of samples
• Memory efficient (during testing): The complexity of SVM is characterized by
the number of support vectors
• Weaknesses:
• Sensitive to parameters
• Time consuming to chose the best Kernel (and its parameters)
• Training time could be expensive for large data sets
• Computing the kernel could be expensive
• Keep a linear kernel and try to manually engineer the features before training
• Training multiclass SVMs must repeat the training (OVR vs OVO)
• Difficult to interpret (especially when using kernels)
MSBA 315 32
Take Away
• SVM finds the hyperplane that maximizes the margin
• Introduce soft margin for noisy data
• Implicit mapping to higher dimensional space
• The kernel trick allows an efficient mapping for input space to a
higher dimensional feature space by applying easy computation of
the dot product in feature space
• Must scale the input to SVM to avoid features with large variance to
dominate those with smaller variance
• Scale each feature to [0, 1] or [-1, 1]

MSBA 315 33
Resources
• LIBSVM -- A Library for Support Vector Machines (see demo)
• Sklearn
• http://www.kernel-machines.org/
• https://www.svm-tutorial.com/

MSBA 315 34

You might also like