Handout 03 Classic Classifiers
Handout 03 Classic Classifiers
François Pitié
1
Before we dive into Neural Networks, keep in mind that Neural Nets
have been around for a while and, until recently, they were not the
method of choice for Machine Learning.
A zoo of algorithms exits out there, and we’ll briefly introduce here
some of the classic methods for supervised learning.
2
k-nearest neighbours
3
k-nearest neighbours
pros:
• It is a non-parametric technique.
• It works surprisingly well and you can obtain high accuracy if
the training set is large enough.
cons:
5
Decision Trees
In decision trees (Breiman et al., 1984) and its many variants, each
node of the decision tree is associated with a region of the input
space, and internal nodes partition that region into sub-regions (in
a divide and conquer fashion).
The regions are split along the axes of the input space (eg. at each
node you take a decision according to a binary test such as x2 < 3).
6
Decision Trees
In Ada Boost and Random Forests multiple decision trees are used to
aggregate a probability on the prediction.
7
Decision Trees
[1] Real-Time Human Pose Recognition in Parts from a Single Depth Image
J. Shotton, A. Fitzgibbon, A. Blake, A. Klpman, M. Finocchio, B. Moore, T. Sharp, 2011
[https://goo.gl/UTM6s1]
8
Decision Trees
pros:
• It is fast.
cons:
• Decisions are taken along axes (eg. x1 < 3) but it could be more
efficient to split the classes along a diagonal (eg. x1 < x2 ):
9
Decision Trees
SEE ALSO:
LINKS:
https://www.youtube.com/watch?v=p17C9q2M00Q
10
SVM
Until recently Support Vector Machines were the most popular tech-
nique around.
Like in Logistic Regression, SVM starts as a linear classifier:
y = [x⊺ w > 0]
The difference with logistic regression lies in the choice of the loss
function.
11
SVM
N
LSV M (w) = ∑[yi = 0] max(0, 1 + x⊺i w) + [yi = 1] max(0, 1 − x⊺i w)
i=1
12
SVM
2 d
4
4 2 0 2 4
13
SVM
There is a lot more to SVM, but this will be not coverd in this course.
14
No Free Lunch Theorem
Note that there is a priori no advantage of using linear SVM over lo-
gistic regression in terms of performance alone. It all depends on the
type of data you have.
Recall that the choice of loss function directly relates to assumptions
you make about the distribution of the prediction errors, and thus
about the dataset of your problem.
15
No Free Lunch Theorem
This is formalised in the “no free lunch” theorem (Wolpert, 1996), which
tells us that classifiers perform equally well when averaged over all
possible problems. In other words: your choice of classifier should
depend on the problem at hand.
Classifier A
Classifier B
performance
Classifier C
problems/dataset
16
SVM
17
Kernel Trick
⎛1⎞
⎜x⎟
⎜ ⎟
⎜ ⎟
ϕ(x) = ⎜x2 ⎟
⎜ ⎟
⎜x3 ⎟
⎜ ⎟
⎝⋮⎠
18
kernel Trick
The idea here is the same: we want to find a feature map x ↦ ϕ(x) that
transforms the input data into a new dataset that can be solved using a
linear classifier.
19
Transforming the original features into more complex ones is a key
ingredient of machine learning, and something that we’ll see again
with Deep Learning.
The collected features are usually not optimal for linearly separating
the classes and it is often unclear how these should be transformed.
We would like the machine learning technique to learn how to best
recombine the features so as to yield optimal class separation.
20
So our first problem is to find a useful feature transformation ϕ. An-
other problem is that the size of the new feature vectors ϕ(x) could
potentially grow very large.
Consider the following polynomial augmentations:
21
For example, if you have p = 100 features per observation and that you
are looking at a polynomial of order 5, the resulting feature vector is
of dimension about 100 millions!!
Now, recall that Least-Squares solutions are given by
ŵ = (X ⊺ X)−1 X ⊺ y
22
So, we want to transform the original features into higher level fea-
tures but we want to this comes at the cost of greatly increasing the
dimension of the original problem.
The Kernel trick offers an elegant solution to this problem and allows
us to use very complex mapping functions ϕ without having to ever
explicitly compute them.
23
Kernel Trick
We start from the observation that most loss functions only operates
on the scores x⊺ w, eg:
n
ŵ = arg min E(w) = ∑ e(x⊺i w)
w
i=1
We can show that (see lecture notes), that for any x, the score at the
optimum x⊺ ŵ can then be re-expressed as:
n
x⊺ ŵ = ∑ αi x⊺ xi ,
i=1
24
Kernel Trick
25
Kernel Trick
26
Kernel Trick
Many kernel functions are possible. For instance, the polynomial ker-
nel is defined as:
and one can show that this is equivalent to using a polynomial map-
ping as proposed earlier. Except that instead of requiring 100’s of
millions of dimensions, we only need to take scalar products between
vectors of dimension p.
27
Kernel Trick
The most commonly used kernel is probably the Radial Basis Function
(RBF) kernel:
κ(u, v) = e−γ∥u−v∥
2
28
Kernel Trick: Intuition (pt1)
To have some intuition about these kernels, consider the kernel trick
for a RBF kernel. The score for a particular observation x is:
n
score(x) = ∑ αi κ(x, xi )
i=1
29
Kernel Trick: Intuition (pt2)
In SVM, the actual values of α̂i are estimated by ways of the minimi-
sation of the Hinge loss.
The optimisation falls outside of the scope of this course material. We
could use Gradient Descent, but, as it turns out, the Hinge loss makes
this problem a constrained optimisation problem and we can use a
solver for that. The good news is that we can find the global minimum
without having to worry about convergence issues.
We find after optimisation that, indeed, −1 ≤ α̂i ≤ 1, with the sign of
α̂i indicating the class membership, thus following a similar idea to
what was proposed in the previous slide.
31
Kernel Trick: Intuition (pt4)
4
α12 =+0.9
3
α23 =0
2
α17 =+0.6
1
α37 =0
α5 =-0.8
−1
α13 =0 α3 =-0.2
−2 α11 =0
α6 =-0.7
−3
α8 =-0.7
−4
−4 −3 −2 −1 0 1 2 3 4
SVM-RBF example with score contour lines. The thickness of each observa-
tion’s outer circle is proportional to ∣αi ∣ (no outer circle means αi = 0). Only a
subset of datapoints, called support vectors, have non-null αi . They lie near
the class boundary and are the only datapoints used in making predictions.
32
SVM results with polynomial kernel
33
SVM results with RBF kernel
Decision Boundaries for SVM using Gaussian kernels. The value of γ controls
the smoothness of the boundary by setting the size of the neighbourhood.
34
Other Kernel Methods Exist
Support vector machines are not the only algorithm that can avail of
the kernel trick. Many other linear models (including logistic regres-
sion) can be enhanced in this way. They are known as kernel methods.
35
Kernel Methods Drawbacks
36
Kernel Methods and Neural Networks
37
References
SEE ALSO:
Gaussian Processes,
Reproducing kernel Hilbert spaces,
kernel Logistic Regression
LINKS:
38
Take Away
Neural Nets have existed for a while, but it is only recently (2012) that
they have started to surpass all other techniques.
Kernel based techniques have been very popular up to recently as
they offer an elegant way of transforming input features into more
complex features that can then be linearly separated.
The problem with kernel techniques is that they cannot deal efficiently
with large datasets (eg. more than 10’s of thousands of observations)
39