0% found this document useful (0 votes)

15 views34 pages

Lect 07 Distance Based Algorithms

This lecture covers K-Nearest Neighbors (KNN) and Support Vector Machines (SVM) as distance-based algorithms in machine learning. It discusses the mechanics, strengths, and weaknesses of KNN, alongside SVM concepts such as hard and soft margins, the kernel trick, and multiclass classification strategies. The document also emphasizes the importance of distance functions and scaling in the context of these algorithms.

Uploaded by

gacia der

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views34 pages

Lect 07 Distance Based Algorithms

Uploaded by

gacia der

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

MSBA 315

ML & Predictive Analytics

Lecture 07 – Distance-Based Algorithms
(KNN and SVM)
Wael Khreich
[email protected]
Learning Outcomes
• K-Nearest Neighbors
• Classification and Regression
• KNN Pros and Cons

• Support Vector Machines

• Hard vs Soft Margin
• The Kernel trick
• Multi-class SVM
• One-Class SVM
• SVM Pros and Cons

MSBA 315 2
Machine Learning Pipeline

MSBA 315 3
Model.fit(𝑿𝒕𝒓𝒂𝒊𝒏 ,
𝒚𝒕𝒓𝒂𝒊𝒏 )
Model(s)
Training &
Evaluation

Model.predict(𝑿𝒗𝒂𝒍𝒊𝒅 ,
𝒚𝒗𝒂𝒍𝒊𝒅 )

MSBA 315 4
Nearest Neighbors
• The k-nearest neighbors (KNN) algorithm is a simple, supervised
machine learning algorithm that can be used to solve both
classification and regression problems
• KNN is non-parametric, it stores all the available data and classifies a
new data point based on its neighborhood similarity

MSBA 315 5
Nearest Neighbors - Motivations

MSBA 315 6
Distance Functions

MSBA 315 7
Distance Functions 𝐴(x1, y1) and 𝐵(x2, y2)

|x1 − x2| + |y1 − y2|

Type equation here.

(x1 − x2)² + (y1 − y2)²

max{|x1 − x2|, |y1 − y2|}

MSBA 315 8
Scaling - Review

Normalization scales each input variable separately to the range 0-1

The range for floating-point values
𝑥 − 𝑚𝑖𝑛
𝑥𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 =
𝑚𝑎𝑥 − 𝑚𝑖𝑛

Standardization involves rescaling the distribution of values so that the

mean of observed values is 0 and the standard deviation is 1
𝑥−𝜇
𝑥𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑖𝑧𝑒𝑑 =
𝜎

MSBA 315 9
Impact of K

Nearest-neighbor rule is a sub-optimal procedure

• Does not yield the Bayes error rate
• Yet it is never worse than twice the Bayes error rate
MSBA 315 10
KNN – Strengths and Weaknesses
+ Very simple but often surprisingly effective
+ Fast training phase (just need to store the training set) - lazy learner
+ Non-parametric approach that can make use of large amounts of data

– Does not produce a model (always need access to training data)

– Doesn't work well with high dimensional/large data
– Computation of distances between new point and each existing point
becomes very expensive
– Suffers from the curse of dimensionality
– Doesn't work well with categorical features
– Requires a choice of suitable k value (hyperparameter)
– Sensitive to scaling
MSBA 315 11
Support Vector Machines

MSBA 315 12
What is Support Vector Machines?
• Support vector machines (SVMs) are a
set of supervised learning methods used
for:
• Classification (binary)
• Regression
• Outlier detection

• Main Ideas:
• Use a hyperplane to separate the examples
• Chose the hyperplane with maximum
margin

MSBA 315 13
Hyperplanes
• Separates a 𝑃-dimensional space into
two half-spaces: positive (+1) and Class: +1
negative (-1) 𝒘𝑇 𝑥 + 𝑏 = 0
• Defined by normal vector 𝒘 ∈ ℝ𝑃 of 𝒘
weights = {𝑤1 , 𝑤2 , … , 𝑤𝑃 } pointing Class: -1
towards positive half-space
• Equation of the hyperplane:
𝒘𝑇 𝑥 + 𝑏 = 0
• b is a single number called bias term
• 𝑏 = 0: hyperplane passes through origin
• 𝑏 > 0: hyperplane moves parallel to
itself in the direction of 𝒘
• 𝑥 is the input vector 14
MSBA 315
Hyperplanes
• Distance to an input point 𝑥𝑛 from a
hyperplane: Class: +1
𝒘𝑇 𝑥𝑛 + 𝑏 𝒘𝑇 𝑥 + 𝑏 = 0
𝛾𝑛 = 𝒘
𝒘
• 𝛾 is a signed distance: 𝑥𝑛 Class: -1
• positive if a point is on the same side of the
hyperplane as the normal vector 𝒘
• negative if it is on the opposite side
• Zero if a point lies exactly on the hyperplane
• The Euclidean norm 𝒘 is the
magnitude (length) of weight vector 𝒘
• 𝑤 = 𝑤12 , 𝑤22 , … , 𝑤𝑃2 𝑥𝑖
𝑏
• : distance from origin to vector 𝒘
𝒘 15
MSBA 315
Maximum Margin Hyperplane
• SVM is a hyperplane based (linear)
classier that ensures a large margin
around the hyperplane
• Objective: find the hyperplane with the
largest margin
• maximum margin hyperplane
• Finds the support vectors and
their weights
• Support vectors are the most important
examples
• Prediction:
• 𝑦 = 𝑠𝑖𝑔𝑛(𝒘𝑇 𝑥𝑡𝑒𝑠𝑡 + 𝑏)

MSBA 315 16
Hard Margin SVM
• For hard margin, we want all points to
satisfy the margin constraint

• The objective for hard-margin SVM

• Minimizing 𝒘 is equal to maximizing the margin

Solution: Using quadratic optimization tools Source: Lecture by Piyush Rai

MSBA 315 17
Soft Margin SVM
• Relax the hard constraint (misclassification)
• 𝑥𝑛 can violate the margin by 𝝃𝒏 ≥ 𝟎
• 𝜉𝑛 slack variable equal to the distance by 𝜉𝑛
which the corresponding point 𝑥𝑛 falls on
the wrong side of the margin boundary
• 𝜉𝑛 = 0: correct side of the margin
• 𝜉𝑛 > 0: wrong side of the margin
• 𝜉𝑛 > 1: wrong side of the hyperplane

Source: Lecture by Piyush Rai

MSBA 315 18
Soft Margin SVM
• Maximize the margin while minimizing the
sum of slacks σ𝑁
𝑛=1 𝜉𝑛 (total training error)
𝜉𝑛

• C is a cost value associated with all points

that violate the margin (hyperparameter)
Source: Lecture by Piyush Rai
Solution: Using quadratic optimization tools

MSBA 315 19
Soft Margin SVM
• When the cost C is large:
• Model prioritizes minimizing training error
• It will try harder to perfectly separate the training
data, even if that results in a smaller margin
• The decision boundary becomes more complex to 𝜉𝑛
accommodate the training points.
• Leads to high variance (overfitting) and low bias
(underfitting).
• When the cost C is small:
• Model prioritizes maximizing the margin allowing
some training errors (misclassification)
• It generalizes better to new data by keeping a
simpler decision boundary.
• This leads to higher bias (underfitting) and lower
variance (overfitting).
• C controls the bias –variance trade-off:
Source: Lecture by Piyush Rai
• Large C: Lower bias, high variance
• Small C: higher bias, low variance
MSBA 315 20
Non-linear Spaces
• We need a mapping from the original
space into a higher dimensional space,
where the non-linear relationship
appears linear
• Increase dimensionality (add new 𝜱(𝐱)
features) and add a non-linear
mapping 𝜱

MSBA 315 Source: Lecture by Yann LeCun 21

Kernel Trick
• The kernel trick allows an efficient
mapping from the original into a
higher dimensional space
• It allows us to implicitly
map data into a higher-
dimensional space
without actually
computing the mapping
• Kernel function input
two vectors and output
a scalar = similarity
between the two
vectors
MSBA 315 Source: Lecture by Yann LeCun 22
Kernel Trick
• The Kernel function in the
original space (similarity
measure between 2
instances)
𝑲(𝒙𝒊 , 𝒙𝒋 )
• Becomes an inner product
in the feature space with
increased dimension
𝜱 𝑥𝑖 . 𝜱(𝑥𝑗 )
• Without having to explicitly
compute 𝜱 𝑥

Instead of explicitly computing 𝛷 𝑥 we directly use the polynomial kernel K

MSBA 315 Source: Lecture by Yann LeCun 23
Kernel Trick
• The feature space can be high or even infinite dimensional
• Calculating Φ 𝑥 for each training example is inefficient for high-dimensional
and impossible for infinite-dimensional
• Computing the kernel means computing the inner product between the images
of data points 𝑲 𝒙𝒊 , 𝒙𝒋 = 𝜱 𝑥𝑖 . 𝜱(𝑥𝑗 )

Φ 𝑥𝑖 T Φ(𝑥Kernel Matrix (Gram Matrix)

𝑗
• Each element 𝐾 𝑥𝑖 , 𝑥𝑗 represents
the similarity between the 𝑖 𝑡ℎ and
𝑗𝑡ℎ data points, computed using the
kernel function
• The matrix is symmetric
• Diagonal represent the self-similarity

MSBA 315 24
Kernel Trick – Common Kernels

Kernel trick also used in: Kernel PCA, Kernel Gaussian Processes,
Kernel K-Means, etc.

MSBA 315 25
SVM Decision Function

Decision function: 𝑥
• 𝑥 is the test point to classify
• The Lagrange multipliers 𝛼𝑖∗ are proportional to the weights of the
support vectors 𝑥𝑖
• 𝑦𝑖 are the labels of these support vectors
• 𝐾(𝑥𝑖 , 𝑥) is the kernel function, which measures the similarity between
the support vectors (𝑥𝑖 ) and the test point 𝑥
• b is the bias term
• The decision function 𝑓(𝑥) calculate a weighted sum of similarities
between the test point (𝑥) and the support vectors (𝑥𝑖 ).
• The sign of this sum determines the predicted class label for the test
point (positive or negative)
MSBA 315 26
Support Vector Regression (SVR)
In regression problems, we typically
minimize mean-squared error
෍(𝑦𝑖 − (𝑤 𝑇 𝑥𝑖 + 𝑏))^2
𝑖

In SVR, we minimize the absolute error

(robust to outliers):
෍ |𝑦𝑖 − (𝑤 𝑇 𝑥𝑖 + 𝑏)| SVR tries to find a hyperplane that
captures the general trend of the
𝑖
we cannot require that all points be data while allowing for some
approximated correctly (overfitting!) amount of noise or outliers.
MSBA 315 27
Support Vector Regression (SVR)
• To allow misclassifications in SVR (and make it robust to noise), we
use the 𝝐-insensitive loss:

If 𝑦𝑖 − 𝑤 𝑇 𝑥𝑖 + 𝑏 ≤𝜖
𝑦𝑖 − 𝑤 𝑇 𝑥𝑖 + 𝑏 − 𝝐 Otherwise

Value of target

MSBA 315 28
Multiclass SVMs – OVO
One-vs-One (OVO)

• Build binary (2-class) classifiers

for every pair of classes
• Train 𝐾2 = 𝑲(𝑲 − 𝟏)/𝟐
different combinations of SVMs
• Predict the class with the most
“votes” from any given classifier
• Pros & Cons:
+ Less sensitive to imbalance
− Large number of SVM training
K=10 classes -> 45 SVMs
K=20 classes -> 190 SVMs
MSBA 315 29
Multiclass SVMs – OVR
One-Vs-Rest (OVR) or One-As-All
• Build 𝑲 binary classifiers
• Train the 𝑘𝑡ℎ classifier using
the data from class 𝐶𝑘 as the
positive and data from the
remaining 𝐾 − 1 classes as
negative examples
• Predict the class with largest
score among the trained SMVs
• Pros & Cons:
Source: krishijagran.com
+ More efficient
− More sensitive to imbalance
• Default in sklearn MSBA 315 30
One-Class SVM
• Two popular approaches:

1. SVDD: Support Vector Data Description 2. OC-SVM: One-Class SVM (Tax and Duin, 2004):
(Scholkopf et al., 2001]): • Find a max-margin hyperplane separating
• It attempts to fit the smallest possible positives from origin (representing negatives)
hypersphere around the data • Starts by mapping the input data into a
• Assume positives lie within a ball with smallest higher-dimensional feature space using a
possible radius 𝑅 and allow some slacks 𝜉𝑛 kernel function
• The algorithm tries to minimize the number and • Tries to find the optimal hyperplane that
magnitude of these slacks maximizes the margin while minimizing the
MSBA 315
number and magnitude of the slack variables 31
SVMs – Pros and Cons
• Strengths:
• Works well for small datasets
• Appears to avoid overfitting in large dimensional spaces
• Effective when number of dimensions is greater than number of samples
• Memory efficient (during testing): The complexity of SVM is characterized by
the number of support vectors
• Weaknesses:
• Sensitive to parameters
• Time consuming to chose the best Kernel (and its parameters)
• Training time could be expensive for large data sets
• Computing the kernel could be expensive
• Keep a linear kernel and try to manually engineer the features before training
• Training multiclass SVMs must repeat the training (OVR vs OVO)
• Difficult to interpret (especially when using kernels)
MSBA 315 32
Take Away
• SVM finds the hyperplane that maximizes the margin
• Introduce soft margin for noisy data
• Implicit mapping to higher dimensional space
• The kernel trick allows an efficient mapping for input space to a
higher dimensional feature space by applying easy computation of
the dot product in feature space
• Must scale the input to SVM to avoid features with large variance to
dominate those with smaller variance
• Scale each feature to [0, 1] or [-1, 1]

MSBA 315 33
Resources
• LIBSVM -- A Library for Support Vector Machines (see demo)
• Sklearn
• http://www.kernel-machines.org/
• https://www.svm-tutorial.com/

MSBA 315 34

SVM Presentation
No ratings yet
SVM Presentation
27 pages
Overview of SVM: A Support Vector Machine (SVM) Performs by Finding The That The Margin Between The
No ratings yet
Overview of SVM: A Support Vector Machine (SVM) Performs by Finding The That The Margin Between The
20 pages
Unit II 2.2 ML Kernel Machines SVM
No ratings yet
Unit II 2.2 ML Kernel Machines SVM
50 pages
CH 7
No ratings yet
CH 7
33 pages
Lab 6 Dsa
No ratings yet
Lab 6 Dsa
15 pages
ML Lec9 SVM
No ratings yet
ML Lec9 SVM
32 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Sesion 4
No ratings yet
Sesion 4
37 pages
2.1 SVM
No ratings yet
2.1 SVM
16 pages
Unit 2 - SVM - 241016 - 104220
No ratings yet
Unit 2 - SVM - 241016 - 104220
47 pages
SVM Class
No ratings yet
SVM Class
33 pages
SVM
No ratings yet
SVM
11 pages
Introduction To Support Vector Machines: Andrew Moore CMU
No ratings yet
Introduction To Support Vector Machines: Andrew Moore CMU
40 pages
A Introduction To SVM PDF
No ratings yet
A Introduction To SVM PDF
48 pages
This Is
No ratings yet
This Is
7 pages
Support Vector Machine: Abinas Panda
No ratings yet
Support Vector Machine: Abinas Panda
52 pages
SVM
No ratings yet
SVM
12 pages
SVM 1
No ratings yet
SVM 1
36 pages
Chapter 6 ML Classifications
100% (1)
Chapter 6 ML Classifications
51 pages
Module10 - Support Vector Machine
No ratings yet
Module10 - Support Vector Machine
23 pages
2024 Scu ML 2 1 SVM
No ratings yet
2024 Scu ML 2 1 SVM
36 pages
Unit-4 AI - SVM
No ratings yet
Unit-4 AI - SVM
21 pages
Support Vector Machines: (Vapnik, 1979)
No ratings yet
Support Vector Machines: (Vapnik, 1979)
34 pages
Mod09-ppt2-ML in Image Classification
No ratings yet
Mod09-ppt2-ML in Image Classification
30 pages
Support Vector Machin, An Excellent Tool
No ratings yet
Support Vector Machin, An Excellent Tool
36 pages
IVPML Unit III
No ratings yet
IVPML Unit III
139 pages
Lecture 18 - SVM
No ratings yet
Lecture 18 - SVM
54 pages
Introduction To Support Vector Machines
No ratings yet
Introduction To Support Vector Machines
23 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
103 pages
SVM Tutorial
No ratings yet
SVM Tutorial
31 pages
08 Classification
No ratings yet
08 Classification
46 pages
Support Vector Machine: Prof. Subodh Kumar Mohanty
No ratings yet
Support Vector Machine: Prof. Subodh Kumar Mohanty
52 pages
Chapter 7
No ratings yet
Chapter 7
64 pages
Support Vector Machines
No ratings yet
Support Vector Machines
16 pages
Unit - 2
No ratings yet
Unit - 2
15 pages
SVM-CDing2024 11 15
No ratings yet
SVM-CDing2024 11 15
54 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
26 pages
Atc Lecture Tyliu
No ratings yet
Atc Lecture Tyliu
48 pages
CS-13410 Introduction To Machine Learning
No ratings yet
CS-13410 Introduction To Machine Learning
33 pages
Support Vector Machine
No ratings yet
Support Vector Machine
52 pages
ML - Lec 8-SVM As A Linear Classifier
No ratings yet
ML - Lec 8-SVM As A Linear Classifier
78 pages
Support Vector Machine
No ratings yet
Support Vector Machine
45 pages
Lecture Notes SVM
No ratings yet
Lecture Notes SVM
4 pages
Lecture Notes SVM
No ratings yet
Lecture Notes SVM
4 pages
Support Vector Machine
No ratings yet
Support Vector Machine
19 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
L5-Support Vector Machine
No ratings yet
L5-Support Vector Machine
61 pages
AP For NLP-LO2
No ratings yet
AP For NLP-LO2
38 pages
Support Vector Machines: Xiaojin Zhu
No ratings yet
Support Vector Machines: Xiaojin Zhu
41 pages
Unit - 2-1
No ratings yet
Unit - 2-1
7 pages
Tutorial 7 Machine Learning Algorithms
No ratings yet
Tutorial 7 Machine Learning Algorithms
30 pages
Final - Support Vector Machine - Class - Modifie
No ratings yet
Final - Support Vector Machine - Class - Modifie
69 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
28 pages
SVM MJJ
No ratings yet
SVM MJJ
19 pages
SML Unit 4
No ratings yet
SML Unit 4
61 pages
Lec5 Support Vector Machine
No ratings yet
Lec5 Support Vector Machine
28 pages
Lect 06 Feature Engineering and Selection
No ratings yet
Lect 06 Feature Engineering and Selection
41 pages
ML Science
No ratings yet
ML Science
6 pages
Lect 05 Preprocessing Text
No ratings yet
Lect 05 Preprocessing Text
25 pages
MSBA315 Syllabus 2025
No ratings yet
MSBA315 Syllabus 2025
6 pages
MSBA315 Project Description
No ratings yet
MSBA315 Project Description
1 page
PA-1 Class-9th Set A
No ratings yet
PA-1 Class-9th Set A
3 pages
Adaline/Madaline
100% (8)
Adaline/Madaline
38 pages
Mindsdb
No ratings yet
Mindsdb
3 pages
Deep Reinforcement Learning Algorithm Based On Multi-Agent Parallelism and Its Application in Game Environment
No ratings yet
Deep Reinforcement Learning Algorithm Based On Multi-Agent Parallelism and Its Application in Game Environment
10 pages
Vikrant Criminal
No ratings yet
Vikrant Criminal
8 pages
Uts Engineering Ug Course Guide
No ratings yet
Uts Engineering Ug Course Guide
52 pages
Innovate Data
No ratings yet
Innovate Data
11 pages
Assignment 7 2024
No ratings yet
Assignment 7 2024
6 pages
AI in Sports
No ratings yet
AI in Sports
10 pages
Advanced Certification in Data Science and AI iHUB IITR
No ratings yet
Advanced Certification in Data Science and AI iHUB IITR
15 pages
Artificial Intelligence May Doom The Human Race
No ratings yet
Artificial Intelligence May Doom The Human Race
2 pages
A Matrix Headlamp Design Based On Artificial Intel
No ratings yet
A Matrix Headlamp Design Based On Artificial Intel
8 pages
Improved Classification For Pneumonia Detection Using Transfer Learning With GAN Based Synthetic Image Augmentation
No ratings yet
Improved Classification For Pneumonia Detection Using Transfer Learning With GAN Based Synthetic Image Augmentation
5 pages
Lecture Ethics Security
No ratings yet
Lecture Ethics Security
37 pages
Report
No ratings yet
Report
5 pages
Communication Topics IT
No ratings yet
Communication Topics IT
2 pages
A Hybrid Text Classification and Language Generation Model For Automated Summarization of Dutch Breast Cancer Radiology Reports
No ratings yet
A Hybrid Text Classification and Language Generation Model For Automated Summarization of Dutch Breast Cancer Radiology Reports
10 pages
CS601PC - MACHINE LEARNING Unit - 1-2
No ratings yet
CS601PC - MACHINE LEARNING Unit - 1-2
145 pages
Do Llms Understand User Preferences? Evaluating Llms On User Rating Prediction
No ratings yet
Do Llms Understand User Preferences? Evaluating Llms On User Rating Prediction
11 pages
Rmpi Final
No ratings yet
Rmpi Final
12 pages
The Most Terrifying Thought Experiment of All Time
No ratings yet
The Most Terrifying Thought Experiment of All Time
7 pages
Edge Computing For Midstream - Leveraging Edge Computing Platforms in Midstream Oil & Gas
No ratings yet
Edge Computing For Midstream - Leveraging Edge Computing Platforms in Midstream Oil & Gas
2 pages
Chat Bot
100% (3)
Chat Bot
15 pages
ISCL Post Show Report - 2024
No ratings yet
ISCL Post Show Report - 2024
36 pages
Liquid Legal Humanization and The Law Kai Jacob Dierk Schindler Download
No ratings yet
Liquid Legal Humanization and The Law Kai Jacob Dierk Schindler Download
87 pages
2way & With Output Automata
No ratings yet
2way & With Output Automata
25 pages
Research Paper On Brain Tumor Detection
100% (1)
Research Paper On Brain Tumor Detection
8 pages
Chapter 8 - 1 Machine Learning
No ratings yet
Chapter 8 - 1 Machine Learning
167 pages
Seshat Scientific Investigative Framework (SSIF) - Integrating Critical Thought For Human and AI Collaboration in Police Investigations
No ratings yet
Seshat Scientific Investigative Framework (SSIF) - Integrating Critical Thought For Human and AI Collaboration in Police Investigations
62 pages
AI Syllabus - IBM
No ratings yet
AI Syllabus - IBM
18 pages

Lect 07 Distance Based Algorithms

Uploaded by

Lect 07 Distance Based Algorithms

Uploaded by

MSBA 315

ML & Predictive Analytics

• Support Vector Machines

|x1 − x2| + |y1 − y2|

Type equation here.

(x1 − x2)² + (y1 − y2)²

max{|x1 − x2|, |y1 − y2|}

Normalization scales each input variable separately to the range 0-1

Standardization involves rescaling the distribution of values so that the

Nearest-neighbor rule is a sub-optimal procedure

– Does not produce a model (always need access to training data)

• The objective for hard-margin SVM

• Minimizing 𝒘 is equal to maximizing the margin

Source: Lecture by Piyush Rai

• C is a cost value associated with all points

MSBA 315 Source: Lecture by Yann LeCun 21

Instead of explicitly computing 𝛷 𝑥 we directly use the polynomial kernel K

Φ 𝑥𝑖 T Φ(𝑥Kernel Matrix (Gram Matrix)

In SVR, we minimize the absolute error

• Build binary (2-class) classifiers

You might also like