Lect_07_Distance_Based_Algorithms
Lect_07_Distance_Based_Algorithms
MSBA 315 2
Machine Learning Pipeline
MSBA 315 3
Model.fit(𝑿𝒕𝒓𝒂𝒊𝒏 ,
𝒚𝒕𝒓𝒂𝒊𝒏 )
Model(s)
Training &
Evaluation
Model.predict(𝑿𝒗𝒂𝒍𝒊𝒅 ,
𝒚𝒗𝒂𝒍𝒊𝒅 )
MSBA 315 4
Nearest Neighbors
• The k-nearest neighbors (KNN) algorithm is a simple, supervised
machine learning algorithm that can be used to solve both
classification and regression problems
• KNN is non-parametric, it stores all the available data and classifies a
new data point based on its neighborhood similarity
MSBA 315 5
Nearest Neighbors - Motivations
MSBA 315 6
Distance Functions
MSBA 315 7
Distance Functions 𝐴(x1, y1) and 𝐵(x2, y2)
MSBA 315 8
Scaling - Review
MSBA 315 9
Impact of K
MSBA 315 12
What is Support Vector Machines?
• Support vector machines (SVMs) are a
set of supervised learning methods used
for:
• Classification (binary)
• Regression
• Outlier detection
• Main Ideas:
• Use a hyperplane to separate the examples
• Chose the hyperplane with maximum
margin
MSBA 315 13
Hyperplanes
• Separates a 𝑃-dimensional space into
two half-spaces: positive (+1) and Class: +1
negative (-1) 𝒘𝑇 𝑥 + 𝑏 = 0
• Defined by normal vector 𝒘 ∈ ℝ𝑃 of 𝒘
weights = {𝑤1 , 𝑤2 , … , 𝑤𝑃 } pointing Class: -1
towards positive half-space
• Equation of the hyperplane:
𝒘𝑇 𝑥 + 𝑏 = 0
• b is a single number called bias term
• 𝑏 = 0: hyperplane passes through origin
• 𝑏 > 0: hyperplane moves parallel to
itself in the direction of 𝒘
• 𝑥 is the input vector 14
MSBA 315
Hyperplanes
• Distance to an input point 𝑥𝑛 from a
hyperplane: Class: +1
𝒘𝑇 𝑥𝑛 + 𝑏 𝒘𝑇 𝑥 + 𝑏 = 0
𝛾𝑛 = 𝒘
𝒘
• 𝛾 is a signed distance: 𝑥𝑛 Class: -1
• positive if a point is on the same side of the
hyperplane as the normal vector 𝒘
• negative if it is on the opposite side
• Zero if a point lies exactly on the hyperplane
• The Euclidean norm 𝒘 is the
magnitude (length) of weight vector 𝒘
• 𝑤 = 𝑤12 , 𝑤22 , … , 𝑤𝑃2 𝑥𝑖
𝑏
• : distance from origin to vector 𝒘
𝒘 15
MSBA 315
Maximum Margin Hyperplane
• SVM is a hyperplane based (linear)
classier that ensures a large margin
around the hyperplane
• Objective: find the hyperplane with the
largest margin
• maximum margin hyperplane
• Finds the support vectors and
their weights
• Support vectors are the most important
examples
• Prediction:
• 𝑦 = 𝑠𝑖𝑔𝑛(𝒘𝑇 𝑥𝑡𝑒𝑠𝑡 + 𝑏)
MSBA 315 16
Hard Margin SVM
• For hard margin, we want all points to
satisfy the margin constraint
MSBA 315 17
Soft Margin SVM
• Relax the hard constraint (misclassification)
• 𝑥𝑛 can violate the margin by 𝝃𝒏 ≥ 𝟎
• 𝜉𝑛 slack variable equal to the distance by 𝜉𝑛
which the corresponding point 𝑥𝑛 falls on
the wrong side of the margin boundary
• 𝜉𝑛 = 0: correct side of the margin
• 𝜉𝑛 > 0: wrong side of the margin
• 𝜉𝑛 > 1: wrong side of the hyperplane
MSBA 315 18
Soft Margin SVM
• Maximize the margin while minimizing the
sum of slacks σ𝑁
𝑛=1 𝜉𝑛 (total training error)
𝜉𝑛
MSBA 315 19
Soft Margin SVM
• When the cost C is large:
• Model prioritizes minimizing training error
• It will try harder to perfectly separate the training
data, even if that results in a smaller margin
• The decision boundary becomes more complex to 𝜉𝑛
accommodate the training points.
• Leads to high variance (overfitting) and low bias
(underfitting).
• When the cost C is small:
• Model prioritizes maximizing the margin allowing
some training errors (misclassification)
• It generalizes better to new data by keeping a
simpler decision boundary.
• This leads to higher bias (underfitting) and lower
variance (overfitting).
• C controls the bias –variance trade-off:
Source: Lecture by Piyush Rai
• Large C: Lower bias, high variance
• Small C: higher bias, low variance
MSBA 315 20
Non-linear Spaces
• We need a mapping from the original
space into a higher dimensional space,
where the non-linear relationship
appears linear
• Increase dimensionality (add new 𝜱(𝐱)
features) and add a non-linear
mapping 𝜱
MSBA 315 24
Kernel Trick – Common Kernels
Kernel trick also used in: Kernel PCA, Kernel Gaussian Processes,
Kernel K-Means, etc.
MSBA 315 25
SVM Decision Function
Decision function: 𝑥
• 𝑥 is the test point to classify
• The Lagrange multipliers 𝛼𝑖∗ are proportional to the weights of the
support vectors 𝑥𝑖
• 𝑦𝑖 are the labels of these support vectors
• 𝐾(𝑥𝑖 , 𝑥) is the kernel function, which measures the similarity between
the support vectors (𝑥𝑖 ) and the test point 𝑥
• b is the bias term
• The decision function 𝑓(𝑥) calculate a weighted sum of similarities
between the test point (𝑥) and the support vectors (𝑥𝑖 ).
• The sign of this sum determines the predicted class label for the test
point (positive or negative)
MSBA 315 26
Support Vector Regression (SVR)
In regression problems, we typically
minimize mean-squared error
(𝑦𝑖 − (𝑤 𝑇 𝑥𝑖 + 𝑏))^2
𝑖
If 𝑦𝑖 − 𝑤 𝑇 𝑥𝑖 + 𝑏 ≤𝜖
𝑦𝑖 − 𝑤 𝑇 𝑥𝑖 + 𝑏 − 𝝐 Otherwise
Value of target
MSBA 315 28
Multiclass SVMs – OVO
One-vs-One (OVO)
1. SVDD: Support Vector Data Description 2. OC-SVM: One-Class SVM (Tax and Duin, 2004):
(Scholkopf et al., 2001]): • Find a max-margin hyperplane separating
• It attempts to fit the smallest possible positives from origin (representing negatives)
hypersphere around the data • Starts by mapping the input data into a
• Assume positives lie within a ball with smallest higher-dimensional feature space using a
possible radius 𝑅 and allow some slacks 𝜉𝑛 kernel function
• The algorithm tries to minimize the number and • Tries to find the optimal hyperplane that
magnitude of these slacks maximizes the margin while minimizing the
MSBA 315
number and magnitude of the slack variables 31
SVMs – Pros and Cons
• Strengths:
• Works well for small datasets
• Appears to avoid overfitting in large dimensional spaces
• Effective when number of dimensions is greater than number of samples
• Memory efficient (during testing): The complexity of SVM is characterized by
the number of support vectors
• Weaknesses:
• Sensitive to parameters
• Time consuming to chose the best Kernel (and its parameters)
• Training time could be expensive for large data sets
• Computing the kernel could be expensive
• Keep a linear kernel and try to manually engineer the features before training
• Training multiclass SVMs must repeat the training (OVR vs OVO)
• Difficult to interpret (especially when using kernels)
MSBA 315 32
Take Away
• SVM finds the hyperplane that maximizes the margin
• Introduce soft margin for noisy data
• Implicit mapping to higher dimensional space
• The kernel trick allows an efficient mapping for input space to a
higher dimensional feature space by applying easy computation of
the dot product in feature space
• Must scale the input to SVM to avoid features with large variance to
dominate those with smaller variance
• Scale each feature to [0, 1] or [-1, 1]
MSBA 315 33
Resources
• LIBSVM -- A Library for Support Vector Machines (see demo)
• Sklearn
• http://www.kernel-machines.org/
• https://www.svm-tutorial.com/
MSBA 315 34