0% found this document useful (0 votes)

20 views39 pages

Handout 03 Classic Classifiers

The document provides an overview of classic supervised learning algorithms including k-nearest neighbors (k-NN), decision trees, and support vector machines (SVM), highlighting their methods, advantages, and disadvantages. It discusses the kernel trick, which allows SVM to handle non-linear classification by transforming input data into higher-dimensional spaces without explicitly computing them. The document emphasizes the importance of selecting the appropriate classifier based on the specific problem at hand, as stated in the 'no free lunch' theorem.

Uploaded by

zhangx30

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views39 pages

Handout 03 Classic Classifiers

Uploaded by

zhangx30

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

03 - Know your Classics

Supervised Learning: k-NN, decision trees, SVM and kernel trick

François Pitié

Assistant Professor in Media Signal Processing

Department of Electronic & Electrical Engineering, Trinity College Dublin
[4C16/5C16] Deep Learning and its Applications — 2024/2025

1
Before we dive into Neural Networks, keep in mind that Neural Nets
have been around for a while and, until recently, they were not the
method of choice for Machine Learning.
A zoo of algorithms exits out there, and we’ll briefly introduce here
some of the classic methods for supervised learning.

2
k-nearest neighbours

k-nearest neighbours is a very simple yet powerful technique. For an

input x, you retrieve the k-nearest neighbours in the training data,
then return the majority class amoung the k values. You can also
return the confidence as a proportion of the majority class.

3
k-nearest neighbours

input data 1-NN 3-NN 10-NN 1.00

0.75
0.50
0.25
Acc:97.5% Acc:97.5% Acc:95.0% 0.00
input data 1-NN 3-NN 10-NN

Acc:87.5% Acc:92.5% Acc:87.5%

input data 1-NN 3-NN 10-NN

Acc:92.5% Acc:92.5% Acc:92.5%

Decision boundaries on 3 problems. The intensity of the shades indicates

the certainty we have about the prediction.
4
k-nearest neighbours

pros:

• It is a non-parametric technique.
• It works surprisingly well and you can obtain high accuracy if
the training set is large enough.

cons:

• Finding the nearest neighbours is computationally expensive

and doesn’t scale with the training set.
• It may generalise very badly if your training set is small.
• You don’t learn much about the features themselves.

5
Decision Trees

In decision trees (Breiman et al., 1984) and its many variants, each
node of the decision tree is associated with a region of the input
space, and internal nodes partition that region into sub-regions (in
a divide and conquer fashion).

The regions are split along the axes of the input space (eg. at each
node you take a decision according to a binary test such as x2 < 3).

6
Decision Trees

input data Decision Tree Random Forest AdaBoost 1.00

0.75
0.50
0.25
Acc:95.0% Acc:92.5% Acc:92.5% 0.00
input data Decision Tree Random Forest AdaBoost

Acc:77.5% Acc:82.5% Acc:82.5%

input data Decision Tree Random Forest AdaBoost

Acc:95.0% Acc:92.5% Acc:95.0%

In Ada Boost and Random Forests multiple decision trees are used to
aggregate a probability on the prediction.
7
Decision Trees

Random Forests gained a lot of popularity before the rise of Neural

Nets as they can be very efficiently computed.
For instance they where used for the body part identification in the
Microsoft Kinect.

[1] Real-Time Human Pose Recognition in Parts from a Single Depth Image
J. Shotton, A. Fitzgibbon, A. Blake, A. Klpman, M. Finocchio, B. Moore, T. Sharp, 2011
[https://goo.gl/UTM6s1]

8
Decision Trees

pros:
• It is fast.
cons:
• Decisions are taken along axes (eg. x1 < 3) but it could be more
efficient to split the classes along a diagonal (eg. x1 < x2 ):

9
Decision Trees

Ada Boost, Random Forests, XGBoost.

LINKS:

https://www.youtube.com/watch?v=p17C9q2M00Q

10
SVM

Until recently Support Vector Machines were the most popular tech-
nique around.
Like in Logistic Regression, SVM starts as a linear classifier:

y = [x⊺ w > 0]

The difference with logistic regression lies in the choice of the loss
function.

11
SVM

Whereas in logistic regression the loss function was based on the

cross-entropy, the loss function in SVM is based on the Hinge loss
function:

N
LSV M (w) = ∑[yi = 0] max(0, 1 + x⊺i w) + [yi = 1] max(0, 1 − x⊺i w)
i=1

12
SVM

From a geometrical point of view, SVM seeks to find the hyperplane

that maximises the separation between the two classes.

2 d
4
4 2 0 2 4

13
SVM

There is a lot more to SVM, but this will be not coverd in this course.

14
No Free Lunch Theorem

Note that there is a priori no advantage of using linear SVM over lo-
gistic regression in terms of performance alone. It all depends on the
type of data you have.
Recall that the choice of loss function directly relates to assumptions
you make about the distribution of the prediction errors, and thus
about the dataset of your problem.

15
No Free Lunch Theorem

This is formalised in the “no free lunch” theorem (Wolpert, 1996), which
tells us that classifiers perform equally well when averaged over all
possible problems. In other words: your choice of classifier should
depend on the problem at hand.

Classifier A

Classifier B
performance

Classifier C

problems/dataset

16
SVM

SVM gained popularity when it became associated

with the kernel trick.

17
Kernel Trick

Recall that in linear regression, we managed to fit non-linear func-

tions by augmenting the feature space with higher order polynomials
of each the observations, e.g, x, x2 , x3 , etc.
What we’ve done is to map the original features into a higher dimen-
sional feature space: ϕ ∶ x ↦ ϕ(x). In our case we had:

⎛1⎞
⎜x⎟
⎜ ⎟
⎜ ⎟
ϕ(x) = ⎜x2 ⎟
⎜ ⎟
⎜x3 ⎟
⎜ ⎟
⎝⋮⎠

18
kernel Trick

The idea here is the same: we want to find a feature map x ↦ ϕ(x) that
transforms the input data into a new dataset that can be solved using a
linear classifier.

19
Transforming the original features into more complex ones is a key
ingredient of machine learning, and something that we’ll see again
with Deep Learning.
The collected features are usually not optimal for linearly separating
the classes and it is often unclear how these should be transformed.
We would like the machine learning technique to learn how to best
recombine the features so as to yield optimal class separation.

20
So our first problem is to find a useful feature transformation ϕ. An-
other problem is that the size of the new feature vectors ϕ(x) could
potentially grow very large.
Consider the following polynomial augmentations:

ϕ ([x1 , x2 ]⊺ ) = [1 , x1 , x2 , x1 x2 , x21 , x22 ]⊺

ϕ ([x1 , x2 , x3 ]⊺ ) = [1 , x1 , x2 , x3 , x1 x3 , x1 x2 , x2 x3 , x21 , x22 , x23 ]⊺

The new feature vectors have significantly increased in size.

It can be shown that for input features of dimension p and a polyno-
mial of degree d, the expanded features are of dimension (p+d)!
p! d!
.

21
For example, if you have p = 100 features per observation and that you
are looking at a polynomial of order 5, the resulting feature vector is
of dimension about 100 millions!!
Now, recall that Least-Squares solutions are given by

ŵ = (X ⊺ X)−1 X ⊺ y

if ϕ(x) is of dimension 100 millions, then X ⊺ X is of size 108 ×108 . This

is totally impractical.

22
So, we want to transform the original features into higher level fea-
tures but we want to this comes at the cost of greatly increasing the
dimension of the original problem.
The Kernel trick offers an elegant solution to this problem and allows
us to use very complex mapping functions ϕ without having to ever
explicitly compute them.

23
Kernel Trick

We start from the observation that most loss functions only operates
on the scores x⊺ w, eg:
n
ŵ = arg min E(w) = ∑ e(x⊺i w)
w
i=1

We can show that (see lecture notes), that for any x, the score at the
optimum x⊺ ŵ can then be re-expressed as:
n
x⊺ ŵ = ∑ αi x⊺ xi ,
i=1

where the scalars x⊺ xi are dot-products between feature vectors.

The new weights α = [α1 , ⋯, αn ] can be seen as a re-parametrisation
of the p×1 vector ŵ into a n×1 vector α, with E(w) being re-expressed
as E(α). These are often called the dual coefficients in SVM.

24
Kernel Trick

Things get interesting when using our expanded features:

n
ϕ(x)⊺ ŵ = ∑ αi ϕ(x)⊺ ϕ(xi )
i=1

To compute the score, we only ever need to know how to compute

the dot products ϕ(x)⊺ ϕ(xi ), not the actual high dimensional feature
vector ϕ(xi ).
Introducing the kernel function:

(u, v) ↦ κ(u, v) = ϕ(u)⊺ ϕ(v) ,

which allows us to rewrite the score as:

n
ϕ(x)⊺ ŵ = ∑ αi κ(x, xi ).
i=1

25
Kernel Trick

The kernel trick builds on the Theory of Reproducing Kernels, which

we says that for a whole class of kernel functions κ we can find a
mapping ϕ that is such that κ(u, v) = ϕ(u)⊺ ϕ(v).
The key is that we can define κ without having to explicitly define ϕ.

26
Kernel Trick

Many kernel functions are possible. For instance, the polynomial ker-
nel is defined as:

κ(u, v) = (r − γu⊺ v)d

and one can show that this is equivalent to using a polynomial map-
ping as proposed earlier. Except that instead of requiring 100’s of
millions of dimensions, we only need to take scalar products between
vectors of dimension p.

27
Kernel Trick

The most commonly used kernel is probably the Radial Basis Function
(RBF) kernel:

κ(u, v) = e−γ∥u−v∥
2

The induced mapping ϕ is infinitely dimensional, but that’s OK be-

cause we never need to evaluate ϕ(x).

28
Kernel Trick: Intuition (pt1)

To have some intuition about these kernels, consider the kernel trick
for a RBF kernel. The score for a particular observation x is:
n
score(x) = ∑ αi κ(x, xi )
i=1

The kernel function κ(u, v) = e−γ∥u−v∥ is a measure of similarity be-

tween observations. If both observations are similar, κ(u, v) ≈ 1. If

they are very different, κ(u, v) ≈ 0. We can see it as a neighbour-
hood indicator function. If the observations are close, κ(u, v) ≈ 1,
else κ(u, v) ≈ 0. The scale of this neighbourhood is controlled by γ.
(as you can imaging, this is less intuitive for other kernels)

29
Kernel Trick: Intuition (pt2)

Let’s choose αi = 1 for positive observations and αi = −1 for negative

observations. This is obviously not the optimal, but this is in fact close
to what happens in SVM. We have now something resembling k-NN.
Indeed, look at the score:
n
score(x) = ∑ αi κ(x, xi )
i=1
⎧
⎪
⎪1 if yi positive
≈ ∑ ⎨
⎪−1
i∈neighbours of x ⎪ if yi negative
⎩
≈ # positive neighbours of x − # negative neighbours of x

This makes sense: if x has more positive than negative neighbours in

the dataset, then its score should be high, and its prediction positive.
Thus we have here something similar to k-NN. The main difference is
that instead of finding a fixed number of the k closest neighbours, we
consider all the neighbours within some radius (controlled by γ).
30
Kernel Trick: Intuition (pt3)

In SVM, the actual values of α̂i are estimated by ways of the minimi-
sation of the Hinge loss.
The optimisation falls outside of the scope of this course material. We
could use Gradient Descent, but, as it turns out, the Hinge loss makes
this problem a constrained optimisation problem and we can use a
solver for that. The good news is that we can find the global minimum
without having to worry about convergence issues.
We find after optimisation that, indeed, −1 ≤ α̂i ≤ 1, with the sign of
α̂i indicating the class membership, thus following a similar idea to
what was proposed in the previous slide.

31
Kernel Trick: Intuition (pt4)
4

α12 =+0.9
3

α23 =0
2
α17 =+0.6

1
α37 =0

α2 =-0.6 α14 =+0.9 α18 =+0.7

α5 =-0.8

−1
α13 =0 α3 =-0.2

−2 α11 =0

α6 =-0.7
−3
α8 =-0.7

−4
−4 −3 −2 −1 0 1 2 3 4

SVM-RBF example with score contour lines. The thickness of each observa-
tion’s outer circle is proportional to ∣αi ∣ (no outer circle means αi = 0). Only a
subset of datapoints, called support vectors, have non-null αi . They lie near
the class boundary and are the only datapoints used in making predictions.
32
SVM results with polynomial kernel

input data linear SVM poly SVM d = 2 poly SVM d = 3 1.00

0.75
0.50
0.25
Acc:87.5% Acc:45.0% Acc:85.0% 0.00
input data linear SVM poly SVM d = 2 poly SVM d = 3

Acc:40.0% Acc:95.0% Acc:40.0%

input data linear SVM poly SVM d = 2 poly SVM d = 3

Acc:95.0% Acc:45.0% Acc:85.0%

Decision Boundaries for SVM using a linear and polynomial kernels.

33
SVM results with RBF kernel

input data RBF SVM = 1 RBF SVM = 5 RBF SVM = 10 1.00

0.75
0.50
0.25
Acc:95.0% Acc:97.5% Acc:95.0% 0.00
input data RBF SVM = 1 RBF SVM = 5 RBF SVM = 10

Acc:85.0% Acc:87.5% Acc:82.5%

input data RBF SVM = 1 RBF SVM = 5 RBF SVM = 10

Acc:95.0% Acc:92.5% Acc:90.0%

Decision Boundaries for SVM using Gaussian kernels. The value of γ controls
the smoothness of the boundary by setting the size of the neighbourhood.
34
Other Kernel Methods Exist

Support vector machines are not the only algorithm that can avail of
the kernel trick. Many other linear models (including logistic regres-
sion) can be enhanced in this way. They are known as kernel methods.

35
Kernel Methods Drawbacks

A major drawback to kernel methods is that the cost of evaluating the

decision function is proportional to the number of training examples,
because the ith observation contributes a term αi κ(x, xi ) to the de-
cision function.
As we have seen, SVM somehow mitigates this by learning which ex-
amples contribute the most (the support vectors).
The cost of training is however still high for large datasets (eg. with
tens of thousands of datapoints).

36
Kernel Methods and Neural Networks

Evidence that deep learning could outperform kernel SVM on large

datasets emerged in 2006 when team lead by G. Hinton demonstrated
that a neural network on the MNIST benchmark. The real tipping point
occured with the 2012 paper by A. Krizhevsky, I. Sutskever and G. Hinton
(see handout-00).

37
References

Laurent El Ghaoui’s lecture at Berkeley: https://goo.gl/hY1Bpn

Eric Kim’s python tutorial on SVM: https://goo.gl/73iBdx

38
Take Away

Neural Nets have existed for a while, but it is only recently (2012) that
they have started to surpass all other techniques.
Kernel based techniques have been very popular up to recently as
they offer an elegant way of transforming input features into more
complex features that can then be linearly separated.
The problem with kernel techniques is that they cannot deal efficiently
with large datasets (eg. more than 10’s of thousands of observations)

Confirmation - Flight Booking - Etihad
No ratings yet
Confirmation - Flight Booking - Etihad
2 pages
MACHINE LEARNING Notes
No ratings yet
MACHINE LEARNING Notes
8 pages
This Is
No ratings yet
This Is
7 pages
Lecture 5
No ratings yet
Lecture 5
19 pages
Lecture09 SVM Intro, Kernel Trick (Updated)
No ratings yet
Lecture09 SVM Intro, Kernel Trick (Updated)
36 pages
Chapter 4 - Kernel Theory
No ratings yet
Chapter 4 - Kernel Theory
9 pages
Kernel Models 1233
No ratings yet
Kernel Models 1233
56 pages
Chapter 7
No ratings yet
Chapter 7
64 pages
Atc Lecture Tyliu
No ratings yet
Atc Lecture Tyliu
48 pages
Support Vector Machines
No ratings yet
Support Vector Machines
43 pages
5th Unit ML
No ratings yet
5th Unit ML
40 pages
Pattern Recognition
No ratings yet
Pattern Recognition
33 pages
SVM Kernel Functions
No ratings yet
SVM Kernel Functions
12 pages
SVM Class
No ratings yet
SVM Class
33 pages
Pattern Recognition
No ratings yet
Pattern Recognition
33 pages
SVM
No ratings yet
SVM
8 pages
Data Science Unit 3
No ratings yet
Data Science Unit 3
33 pages
Supervised Learning - Support Vector Machines and Feature Reduction
No ratings yet
Supervised Learning - Support Vector Machines and Feature Reduction
11 pages
Support Vector Machines: Theory, Implementation, and Applications
No ratings yet
Support Vector Machines: Theory, Implementation, and Applications
40 pages
Machine Learning
No ratings yet
Machine Learning
45 pages
03 - Kernelization
No ratings yet
03 - Kernelization
32 pages
Machine Learning SVM - Supervised
No ratings yet
Machine Learning SVM - Supervised
32 pages
22-Kernel Tricks Shit
No ratings yet
22-Kernel Tricks Shit
43 pages
Unit - 2-1
No ratings yet
Unit - 2-1
7 pages
Machine Learning Crash Course: Computer Vision James Hays
No ratings yet
Machine Learning Crash Course: Computer Vision James Hays
38 pages
Lecture6 Notes
No ratings yet
Lecture6 Notes
5 pages
Spam Not Spam
No ratings yet
Spam Not Spam
7 pages
Support Vector Machines: Kernels: CS4780/5780 - Machine Learning Fall 2011 Thorsten Joachims Cornell University
No ratings yet
Support Vector Machines: Kernels: CS4780/5780 - Machine Learning Fall 2011 Thorsten Joachims Cornell University
15 pages
SVM Presentation
No ratings yet
SVM Presentation
27 pages
SVM-Worked Out Example
No ratings yet
SVM-Worked Out Example
4 pages
Machine Unit4
No ratings yet
Machine Unit4
55 pages
Ain3001 - 04 - Support - Vector.machines
No ratings yet
Ain3001 - 04 - Support - Vector.machines
50 pages
AP For NLP-LO2
No ratings yet
AP For NLP-LO2
38 pages
A Practical Guide To Support Vector Classification: I I I N L
No ratings yet
A Practical Guide To Support Vector Classification: I I I N L
15 pages
CH 7
No ratings yet
CH 7
33 pages
Fast Kernel Classifiers
No ratings yet
Fast Kernel Classifiers
41 pages
Pattern Recognition 14
No ratings yet
Pattern Recognition 14
46 pages
Lecture 8
No ratings yet
Lecture 8
19 pages
Lecture Slides-Week12
100% (1)
Lecture Slides-Week12
41 pages
Unit - 2
No ratings yet
Unit - 2
15 pages
An Introduction To Kernel Methods: C. Campbell
No ratings yet
An Introduction To Kernel Methods: C. Campbell
38 pages
Unit 5 Learning With Algorithm
No ratings yet
Unit 5 Learning With Algorithm
7 pages
SML Unit 4
No ratings yet
SML Unit 4
61 pages
Presentation - SVM & KM - May 2009
No ratings yet
Presentation - SVM & KM - May 2009
24 pages
Some Methods of Constructing Kernel
No ratings yet
Some Methods of Constructing Kernel
23 pages
Unit 1,2,3
No ratings yet
Unit 1,2,3
17 pages
Unit 6 Ai
No ratings yet
Unit 6 Ai
28 pages
SVM Unit 2
No ratings yet
SVM Unit 2
12 pages
Machine Learning
No ratings yet
Machine Learning
15 pages
Understanding Machine Learning Algorithms - in Depth
No ratings yet
Understanding Machine Learning Algorithms - in Depth
167 pages
Types of Kernels in Support Vector Machines
No ratings yet
Types of Kernels in Support Vector Machines
14 pages
L5-Support Vector Machine
No ratings yet
L5-Support Vector Machine
61 pages
ML Imppp
No ratings yet
ML Imppp
12 pages
Support Vector Machine
0% (1)
Support Vector Machine
7 pages
Introduction To Support Vector Machines: Andrew Moore CMU
No ratings yet
Introduction To Support Vector Machines: Andrew Moore CMU
40 pages
Quiz 1 On Wednesday
No ratings yet
Quiz 1 On Wednesday
46 pages
Support Vector Machines
No ratings yet
Support Vector Machines
57 pages
Introduction To Support Vector Machines
No ratings yet
Introduction To Support Vector Machines
23 pages
Newborn Care 2
No ratings yet
Newborn Care 2
2 pages
Tribal Pesonaliteis PDF
No ratings yet
Tribal Pesonaliteis PDF
5 pages
21 - Olorunfemi - Assessment of The Effect
No ratings yet
21 - Olorunfemi - Assessment of The Effect
7 pages
Alienation From David-McClellan-The-Thought-of-Karl-Marx
No ratings yet
Alienation From David-McClellan-The-Thought-of-Karl-Marx
17 pages
Abbotsford VFR Terminal Procedures Chart Rwy 01 & 19
No ratings yet
Abbotsford VFR Terminal Procedures Chart Rwy 01 & 19
3 pages
Mitochondrial Disorders Biochemical and Molecular Analysis Methods in Molecular Biology Vol 837 2012th Edition Lee-Jun C. Wong (Editor) Download PDF
100% (2)
Mitochondrial Disorders Biochemical and Molecular Analysis Methods in Molecular Biology Vol 837 2012th Edition Lee-Jun C. Wong (Editor) Download PDF
84 pages
How To Use DNA Baser - 2 Minutes Video Tutorial - Url
No ratings yet
How To Use DNA Baser - 2 Minutes Video Tutorial - Url
13 pages
Lymph 4649 Document PDF
No ratings yet
Lymph 4649 Document PDF
17 pages
Number Series
No ratings yet
Number Series
16 pages
Vet Pharm Therapeutics - 2020 - Broughton Neiswanger - Pharmacometabolomics With A Combination of PLS DA and Random
No ratings yet
Vet Pharm Therapeutics - 2020 - Broughton Neiswanger - Pharmacometabolomics With A Combination of PLS DA and Random
11 pages
Text
No ratings yet
Text
3 pages
Abdul - Azeez Bin Abdullaah Bin Baaz
No ratings yet
Abdul - Azeez Bin Abdullaah Bin Baaz
4 pages
Chemistry Sheet Haxked - 5
No ratings yet
Chemistry Sheet Haxked - 5
7 pages
Contest1 Tasks
No ratings yet
Contest1 Tasks
9 pages
Loft D55 Spec Sheet
No ratings yet
Loft D55 Spec Sheet
5 pages
Animal Toxins: - Composition & Chemical Properties
No ratings yet
Animal Toxins: - Composition & Chemical Properties
6 pages
Scherfi Gsvej 8, DK-2100 Copenhagen Ø, Denmark Tel.: +45 39 17 17 17. Fax: +45 39 17 18 18. E-Mail: Postmaster@euro - Who.int Web Site: WWW - Euro.who - Int
No ratings yet
Scherfi Gsvej 8, DK-2100 Copenhagen Ø, Denmark Tel.: +45 39 17 17 17. Fax: +45 39 17 18 18. E-Mail: Postmaster@euro - Who.int Web Site: WWW - Euro.who - Int
205 pages
Learning Expectations: QUARTER 1 Week 1
No ratings yet
Learning Expectations: QUARTER 1 Week 1
10 pages
Chapter 01 5e
100% (1)
Chapter 01 5e
15 pages
Special Comment(s) Overall:: 9299724. Clark Builders. Raymond Block - Level3 - Zone1 - Phase1. March 01, 2018
No ratings yet
Special Comment(s) Overall:: 9299724. Clark Builders. Raymond Block - Level3 - Zone1 - Phase1. March 01, 2018
3 pages
Chawimawi Ru
No ratings yet
Chawimawi Ru
1 page
National-Oilwell: Top Drive
No ratings yet
National-Oilwell: Top Drive
6 pages
Mini Research On Homeless
No ratings yet
Mini Research On Homeless
6 pages
1 Maxwell's Equations in Matter (Integrate With Next Section)
100% (1)
1 Maxwell's Equations in Matter (Integrate With Next Section)
2 pages
Monetary Statistics M
No ratings yet
Monetary Statistics M
42 pages
Final 5
100% (3)
Final 5
9 pages
The Nexus Between Visioning and Planning
No ratings yet
The Nexus Between Visioning and Planning
2 pages
DIY Simple Machine Model Rubric
No ratings yet
DIY Simple Machine Model Rubric
1 page
ACM-F015 Intern's Competency Checklist
No ratings yet
ACM-F015 Intern's Competency Checklist
6 pages

Handout 03 Classic Classifiers

Uploaded by

Handout 03 Classic Classifiers

Uploaded by

03 - Know your Classics

Supervised Learning: k-NN, decision trees, SVM and kernel trick

Assistant Professor in Media Signal Processing

k-nearest neighbours is a very simple yet powerful technique. For an

input data 1-NN 3-NN 10-NN 1.00

Acc:87.5% Acc:92.5% Acc:87.5%

Acc:92.5% Acc:92.5% Acc:92.5%

Decision boundaries on 3 problems. The intensity of the shades indicates

• Finding the nearest neighbours is computationally expensive

input data Decision Tree Random Forest AdaBoost 1.00

Acc:77.5% Acc:82.5% Acc:82.5%

Acc:95.0% Acc:92.5% Acc:95.0%

Random Forests gained a lot of popularity before the rise of Neural

Ada Boost, Random Forests, XGBoost.

Whereas in logistic regression the loss function was based on the

From a geometrical point of view, SVM seeks to find the hyperplane

SVM gained popularity when it became associated

Recall that in linear regression, we managed to fit non-linear func-

ϕ ([x1 , x2 ]⊺ ) = [1 , x1 , x2 , x1 x2 , x21 , x22 ]⊺

ϕ ([x1 , x2 , x3 ]⊺ ) = [1 , x1 , x2 , x3 , x1 x3 , x1 x2 , x2 x3 , x21 , x22 , x23 ]⊺

The new feature vectors have significantly increased in size.

if ϕ(x) is of dimension 100 millions, then X ⊺ X is of size 108 ×108 . This

where the scalars x⊺ xi are dot-products between feature vectors.

Things get interesting when using our expanded features:

To compute the score, we only ever need to know how to compute

(u, v) ↦ κ(u, v) = ϕ(u)⊺ ϕ(v) ,

which allows us to rewrite the score as:

The kernel trick builds on the Theory of Reproducing Kernels, which

κ(u, v) = (r − γu⊺ v)d

The induced mapping ϕ is infinitely dimensional, but that’s OK be-

The kernel function κ(u, v) = e−γ∥u−v∥ is a measure of similarity be-

tween observations. If both observations are similar, κ(u, v) ≈ 1. If

Let’s choose αi = 1 for positive observations and αi = −1 for negative

This makes sense: if x has more positive than negative neighbours in

α2 =-0.6 α14 =+0.9 α18 =+0.7

input data linear SVM poly SVM d = 2 poly SVM d = 3 1.00

Acc:40.0% Acc:95.0% Acc:40.0%

Acc:95.0% Acc:45.0% Acc:85.0%

Decision Boundaries for SVM using a linear and polynomial kernels.

input data RBF SVM = 1 RBF SVM = 5 RBF SVM = 10 1.00

Acc:85.0% Acc:87.5% Acc:82.5%

Acc:95.0% Acc:92.5% Acc:90.0%

A major drawback to kernel methods is that the cost of evaluating the

Evidence that deep learning could outperform kernel SVM on large

Laurent El Ghaoui’s lecture at Berkeley: https://goo.gl/hY1Bpn

You might also like