0% found this document useful (0 votes)
3 views

Lecture 5

Lecture 5 of SYSC4415 focuses on Support Vector Machines (SVM) and K-Nearest Neighbors (KNN) for data classification. Key concepts include the SVM kernel for nonlinear decision boundaries, the hinge loss function, and the mechanics of KNN classification. The lecture also emphasizes hands-on activities to implement these classifiers and understand their underlying principles.

Uploaded by

Esraa Al dn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture 5

Lecture 5 of SYSC4415 focuses on Support Vector Machines (SVM) and K-Nearest Neighbors (KNN) for data classification. Key concepts include the SVM kernel for nonlinear decision boundaries, the hinge loss function, and the mechanics of KNN classification. The lecture also emphasizes hands-on activities to implement these classifiers and understand their underlying principles.

Uploaded by

Esraa Al dn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

SYSC4415

Introduction to Machine Learning

Lecture 5

Prof James Green


[email protected]
Systems and Computer Engineering, Carleton University
Learning Objectives for Lecture 5
• Understand how SVM and KNN models classify data.
• Understand how to train a SVM model from labelled data.
• Introduce concept of SVM kernel to achieve nonlinear decision boundary.
• Define a "support vector"
• Be able to classify data by hand using a KNN classifier.

Pre-lecture Assignment:
• Read Section 3.4-3.5 of the 100pMLB

In-class activities:
• Poll Everywhere review questions
• Review fundamental equations governing SVM
• Discuss impact of kernel function
• Implement a KNN classifier by hand using sample data
Key terms
• Hinge loss function, kernel trick, kernel functions (kernels), RBF
kernel, Euclidean distance, k-Nearest Neighbors, cosine similarity,
support vector.
>
-
Support Vector Machines - Review
• Review of support vector machines (SVM)
Note that 100pMLbook doesn’t
show dot product
f(x)=sign(wTx-b)=sign(w·x-b)
Learn w, b such that:

How is this equivalent to that?

Why minimize ||w||?


Support Vector Machines – new concepts
• Data can be linearly inseparable due to 1) noise or 2) inherent structure
Cost Function Hinge LOSS Function : predict
if
22-Regularizat hinge-loss ~
Wrong
- ~
- max (0 ,
1 -
y , w 2 b)
+

cllwll + * yi( (i Lif predict


,
max(0 ,
1 -
.
+ b) right
determine tradeoff between increasing size of decision boundary
and classification
Regulates empirical
3
large 2 width MOST risk
>
-
of margin matters

Small - > classifying matters more and generalization => TRADE OFF

·
soft-margin SVM : Optimize hinge-loss function
.

· hard-margin SVM : no hinge-loss


.

Inherent Non-linearity :

·
implicitly transform the original space into a higher dimension
during the cost function optimization
. This is called the Kernel trich .

&
1) Noisy/mislabelled data
• Use hinge loss function to allow for incorrectly classified training data

large 2 >
- width
of margin matters MOST

incorrectly
classified
Small - > classifying matters
<0 for correctly penalty
-
ma makes
classified samples dominate
↓ -
h
value becomes
-
ve

FOR CORRECTLY
Minimize classified points
. contributes nothing
i

. e

Minimize Hinge loss function


||w|| to (For each misclassified
Maximize sample, minimize distance
margin from decision margin)

• Hard vs. soft margin SVM


• C hyperparameter
• Either way, solve ‘optimization under constraints’ using Lagrangian multipliers then quadratic programming
>
-
2) Inherent structure of data
• Can introduce nonlinear mapping to map data to higher dimensional
space where data become linearly separable
• “Kernel trick” (computational trick)
• It turns out that we only need to compute dot products between pairs of
vectors (training/testing instances) in the new space, and not the actual
transformed vectors explicitly
• Therefore, define a “kernel function” to bypass mapping:
From: http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html

For , , , = ,

Where °, ° is an inner product of , M > N,


and transforms to :
2) Inherent structure of data (example)

From http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html 12
2) Inherent structure of data (example con’t)

From http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html 13
2) Inherent structure of data
Common kernels:
1) Linear kernel: , =
No mapping to higher dimensional space…

2) Polynomial kernel: , = + (dth-order polynomial)

Euclidian distance (squared)

3) Radial basis function kernel: , =


• Infinite dimensions, but chance of overfitting only depends on # of support vectors
• Varying controls smooth/curvy nature of decision boundary

• For the math lovers, https://www.youtube.com/watch?v=_PwhiWxHK8o shows a near-intuitive


derivation of SVMs (50 min MIT grad lecture)
>
-
K-Nearest Neighbour classifier
• Non-parametric
• Vote among classes of K nearest training points to test point
• Hyperparameters:
• K, # neighbours to examine
• Choice of distance metric
• City block distance, Euclidian distance, Mahalanobis distance, etc.
• Cosine similarity often used for comparing two documents (binary feature vectors):
K-Nearest Neighbour Example
• Consider the following table of training data:
Weight Length Fish Type
4 19 Salmon
6 17 Salmon
8 15 Salmon
8 12 Sea Bass
7 13 Sea Bass
9 15 Sea Bass

• For K=3, use city-block-distance to classify the following test point:


• Weight = 8, length = 12
Salman

mseabass

unknown
weight
A

9 -
&

=
O
123
S -
& - - -
&
a
neighbours ass

7 -
I & & 7 neighbour = Salmon
2
6 -

&

5 -

4- ~ A

length
is in is i in is is
I >

12
>
-
>
-
don't have to compute any distances because all will the close . ex : if more

apples then
oranges ,
will
always the apples

You might also like