Classic Machine Learning Algorithms
Classic Machine Learning Algorithms
Abstract
In this chapter, we present the main classic machine learning algorithms.
A large part of the chapter is devoted to supervised learning algorithms
for classification and regression, including nearest-neighbor methods, lin-
ear and logistic regressions, support vector machines and tree-based algo-
rithms. We also describe the problem of overfitting as well as strategies
to overcome it. We finally provide a brief overview of unsupervised learn-
ing methods, namely for clustering and dimensionality reduction. The
chapter does not cover neural networks and deep learning as these will
be presented in Chapters 3, 4, 5 and 6.
1. Introduction
This chapter presents the main classic machine learning (ML) algorithms.
There is a focus on supervised learning methods for classification and re-
gression, but we also describe some unsupervised approaches. The chap-
ter is meant to be readable by someone with no background in machine
learning. It is nevertheless necessary to have some basic notions of linear
To appear in
O. Colliot (Ed.), Machine Learning for Brain Disorders, Springer
2 Faouzi and Colliot
algebra, probabilities and statistics. If this is not the case, we refer the
reader to chapters 2 and 3 of [1].
The rest of this chapter is organized as follows. Rather than grouping
methods by categories (for instance classification or regression methods),
we chose to present methods by increasing order of complexity. We first
provide the notations in Section 2. We then describe a very intuitive fam-
ily of methods, that of nearest neighbors (Section 3). We continue with
linear regression (Section 4) and logistic regression (Section 5), the later
being a classification technique. We subsequently introduce the problem
of overfitting (Section 6) as well as strategies to mitigate it (Section 7).
Section 8 describes support vector machines (SVM). Section 9 explains
how binary classification methods can be extended to a multi-class set-
ting. We then describe methods which are specifically adapted to the case
of normal distributions (Section 10). Decision trees and random forests
are described in Section 11. We then briefly describe some unsupervised
learning algorithms, namely for clustering (Section 12) and dimensional-
ity reduction (Section 13). The chapter ends with a description of kernel
methods which can be used to extend linear algorithms to non-linear
cases (Section 14). Box 1 summarizes the algorithms presented in this
chapter, grouped by categories and then sorted in order of appearance.
2. Notations
Let n be the number of samples and p be the number of features. An
input sample is thus a p-dimensional vector:
x1
x = ...
xp
The input space is denoted by I and the set of training samples is denoted
by X .
In the case of regression, y is a real number. In the case of classifica-
tion, y is a single label. More precisely, y can only take one of a finite set
• Supervised learning
• Unsupervised learning
of values called labels. The set of possible classes (i.e., labels) is denoted
by C = {C1 , . . . , Cq }, with q being the number of classes. As the values
of the classes are not meaningful, when there are only two classes, the
classes are often called the positive and negative classes. In this case and
also for mathematical reasons, without loss of generality, we assume the
values of the classes to be +1 and −1.
3. Nearest-neighbor methods
One of the most intuitive approaches to machine learning is nearest neigh-
bors. It is based on the following intuition: for a given input, its corre-
sponding output is likely to be similar to the outputs of similar inputs. A
real-life metaphor would be that, if a subject has similar characteristics
than other subjects who were diagnosed with a given disease, then this
subject is likely to also be suffering from this disease.
More formally, nearest-neighbor methods use the training samples
from the neighborhood of a given point x, denoted by N (x), to perform
prediction [2].
For regression tasks, the prediction is computed as a weighted mean
of the target values in N (x):
X (x)
ŷ = wi y (i)
x(i) ∈N (x)
(x)
where wi is the weight associated to x(i) to predict the output of x,
(x) P (x)
with wi ≥ 0 ∀i and i wi = 1.
For classification tasks, the predicted label corresponds to the label
with the largest weighted sum of occurrences of each label:
X (x)
ŷ = arg max wi 1y(i) =Ck
C
x(i) ∈N (x)
3.1 Metrics
Many metrics have been defined for various types of input data such
as vectors of real numbers, integers or booleans. Among these different
types, vectors of real numbers are one of the most common types of
input data, for which the most commonly used metric is the Euclidean
distance, defined as:
v
u p
uX
′ ′
∀x, x ∈ I, ∥x − x ∥2 = t (xj − x′j )2
j=1
3.2 Neighborhood
The two most common definitions of the neighborhood rely on either
the number of neighbors or the radius around the given point. Figure 1
illustrates the differences between both definitions.
The k-nearest neighbor method defines the neighborhood of a given
point x as the set of the k closest points to x:
N (x) = {x(i) }ki=1 with d(x, x(1) ) ≤ . . . ≤ d(x, x(n) )
The radius neighbor method defines the neighborhood of a given point
x as the set of points whose dissimilarity to x is smaller than the given
radius, denoted by r:
N (x) = {x(i) ∈ X | d(x, x(i) ) < r}
3.3 Weights
The two most common approaches to compute the weights are to use:
• uniform weights (all the weights are equal):
(x) 1
∀i, wi =
|N (x)|
• weights inversely proportional to the dissimilarity:
1
(x) d(x(i) ,x) 1
∀i, wi =P 1 = P 1
j d(x(j) ,x) d(x(i) , x) j d(x(j) ,x)
4. Linear regression
Linear regression is a regression model that linearly combines the fea-
tures. Each feature is associated with a coefficient that represents the
relative weight of this feature compared to the other features. A real-life
metaphor would be to see the coefficients as the ingredients of a recipe:
the key is to find the best balance (i.e., proportions) between all the
ingredients in order to make the best cake.
Mathematically, a linear model is a model that linearly combines the
features [5]:
p
X
f (x) = w0 + w j xj
j=1
ŷ = f (x) = x⊤ w
One wants to find the optimal parameters w⋆ that minimize the cost
function:
w⋆ = arg min J(w)
w
∇w ⋆ J = 0
=⇒ 2X ⊤ Xw⋆ − 2X ⊤ y = 0
=⇒ X ⊤ Xw⋆ = X ⊤ y
−1
=⇒ w⋆ = X ⊤ X X ⊤y
5. Logistic regression
Intuitively, linear regression consists in finding the line that best fits the
data: the true output should be as close to the line as possible. For
binary classification, one wants the line to separate both classes as well
as possible: the samples from one class should all be in one subspace and
the samples from the other class should all be in the other subspace, with
the inputs being as far as possible from the line.
Mathematically, for binary classification tasks, a linear model is de-
fined by an hyperplane splitting the input space into two subspaces such
that each subspace is characteristic of one class. For instance, a line
splits a plane into two subspaces in the two-dimensional case, while a
plane splits a three-dimensional space into two subspaces. A hyperplane
is defined by a vector w = (w0 , w1 , . . . , wp ) and f (x) = x⊤ w corresponds
to the signed distance between the input x and the hyperplane w: in one
subspace, the distance with any input is always positive, whereas in the
other subspace, the distance with any input is always negative. Figure 4
illustrates the decision function in the two-dimensional case where both
classes are linearly separable.
The sign of the signed distance corresponds to the decision function
of a linear binary classification model:
(
+1 if f (x) > 0
ŷ = sign(f (x)) =
−1 if f (x) < 0
P (y = −1|x = x) = 1 − P (y = +1|x = x)
exp (−f (x))
=
1 + exp (−f (x))
1
=
1 + exp (f (x))
P (y = −1|x = x) = σ(−f (x))
• Main idea: best hyperplane (i.e., line) that separates two classes.
n
X
log σ y (i) f (x(i) ; w)
=
i=1
n
X
− log 1 + exp y (i) x(i)⊤ w
=
i=1
n
X
log 1 + exp y (i) x(i)⊤ w
log(L(w)) = −
i=1
We can see that the w coefficients that maximize the likelihood are also
the coefficients that minimize the sum of the logistic loss values, with
the logistic loss being defined as:
Unlike for linear regression, there is no closed formula for this minimiza-
tion. One thus needs to use an optimization method such as for instance
7. Penalized models
Depending on the class of algorithms, there exist different strategies to
tackle overfitting.
For neighbor methods, the number of neighbors used to define the
neighborhood of any input and the strategy to compute the weights are
the key hyperparameters to control the bias-variance trade-off. For mod-
els that are presented in the remaining sections of this chapter, we men-
tion strategies to address the bias-variance trade-off in their respective
sections. In this section, we present the most commonly used strategies
for models whose parameters are optimized by minimizing a cost function
defined as the mean loss values over all the training samples:
n
1X
ℓ y (i) , f (x(i) ; w)
min J(w) with J(w) =
w n i=1
This is for instance the case of the linear and logistic regression methods
presented in the previous sections.
7.1 Penalties
The main idea is to introduce a penalty term Pen(w) that will constraint
the parameters w to have some desired properties. The most common
penalties are the ℓ2 penalty, the ℓ1 penalty and the elastic-net penalty.
7.1.1. ℓ2 penalty
The ℓ2 penalty is defined as the squared ℓ2 norm of the w coefficients:
p
X
ℓ2 (w) = ∥w∥22 = wj2
j=1
The ℓ2 penalty forces each coefficient wi not to be too large and makes
the coefficients more robust to collinearity (i.e., when some features are
approximately linear combinations of the other features).
Figure 6: Unit balls of the ℓ0 , ℓ1 and ℓ2 norms. For each norm, the set
of points in R2 whose norm is equal to 1 is plotted. The ℓ1 norm is the
best convex approximation to the ℓ0 norm. Note that the lines for the ℓ0
norm extend to −∞ and +∞, but are cut for plotting reasons.
7.1.2. ℓ1 penalty
The ℓ2 penalty forces the values of the parameters not to be too large,
but does not incentive to make small values tend to zero. Indeed, the
square of a small value is even smaller. When the number of features is
large, or when interpretability is important, it can be useful to make the
algorithm select the most important features. The corresponding metric
is the ℓ0 “norm” (which is not a proper norm in the mathematical sense),
defined as the number of nonzero elements:
p
X
ℓ0 (w) = ∥w∥0 = 1wj ̸=0
j=1
which reads as “Find the optimal parameters that minimize the cost
function J among all the parameters w that satisfy Pen(w) < c” for
a positive real number c. Figure 7 illustrates the optimal solution of a
simple linear regression task with different constraints. This figure also
highlights the sparsity property of the ℓ1 penalty (the optimal parameter
for the horizontal axis is set to zero) that the ℓ2 penalty does not have
(the optimal parameter for the horizontal axis is small but different from
zero).
Although this approach is appealing due to its intuitiveness and the
possibility to set the maximum possible penalty on the parameters w, it
leads to a minimization problem that is not trivial to solve. A similar
approach consists in adding the regularization term in the cost function:
The penalties can also be added in other models such as logistic regres-
sion, support-vector machines, artificial neural networks, etc.
8. Support-vector machine
Linear and logistic regression take into account every training sample
in order to find the best line, which is due to their corresponding loss
functions: the squared error is zero only if the true and predicted outputs
are equal, and the logistic loss is always positive. One could argue that
the training samples whose outputs are “easily” well predicted are not
relevant: only the training samples whose outputs are not “easily” well
predicted or are wrongly predicted should be taken into account. The
support vector machine (SVM) algorithm is based on this principle.
The original support vector machine algorithm was invented in 1963 and
was a linear binary classification algorithm [10]. Figure 9 illustrates the
main concept of its original version. When both classes are linearly sep-
arable, there exists an infinite number of hyperplanes that separate both
classes. The SVM algorithm finds the hyperplane that maximizes the
margin, that is the distance between the hyperplane and the closest points
of both classes to the hyperplane, while linearly separating both classes.
The SVM algorithm was later updated to non-separable classes [11].
Figure 10 illustrates the role of the margin in this case. The dashed lines
correspond to the hyperplanes defined by the equations x⊤ w = +1 and
x⊤ w = −1. The margin is the distance between both hyperplanes and
is equal to 2/∥w∥22 . It defines which samples are included in the decision
function of the model: a sample is included if and only if it is inside
the margin, or outside the margin and misclassified. Such samples are
called support vectors and are illustrated in Figure 10 with a black circle
surrounding them. In this case, the margin can be seen a regularization
term: the larger the margin is, the more support vectors are included in
the decision function, the more regularized the model is.
Figure 11: Binary classification losses. The logistic loss is always pos-
itive, even when the point is accurately classified with high confidence
(i.e., when yf (x) ≫ 0), whereas the hinge loss is equal to zero when the
point is accurately classified with good confidence (i.e., when yf (x) ≥ 1).
The loss function for the SVM algorithm is called the hinge loss and
is defined as:
Figure 11 illustrates the curves of the logistic and hinge losses. The
logistic loss is always positive, even when the point is accurately classified
with high confidence (i.e., when yf (x) ≫ 0), whereas the hinge loss is
equal to zero when the point is accurately classified with good confidence
(i.e., when yf (x) ≥ 1). One can see that a sample (x, y) is a support
vector if and only if yf (x) ≥= 1, that is if and only if ℓhinge (y, f (x)) = 0.
The optimal w coefficients for the original version are estimated by
minimizing an objective function consisting of the sum of the hinge loss
values and a ℓ2 penalty term (which is inversely proportional to the mar-
gin):
n
X 1
min max(0, 1 − y (i) x(i)⊤ w) + ∥w∥22
w
i=1
2C
ϕ: I → G
x 7→ ϕ(x)
The decision function would still be linear (with a dot product), but in
the feature space:
f (x) = ϕ(x)⊤ w
Unfortunately, solving the corresponding minimization problem is not
trivial: n
X 1
max 0, 1 − y (i) ϕ(x(i) )⊤ w + ∥w∥22
min (3)
w
i=1
2C
Nonetheless, two mathematical properties make the use of non-linear
transformations in the feature space possible: the kernel trick and the
representer theorem.
The kernel trick asserts that the dot product in the feature space can
be computed using only the points from the input space and a kernel
function, denoted by K:
The representer theorem [13, 14] asserts that, under certain conditions
on the kernel K and the feature space G associated with the function ϕ,
any minimizer of Equation 3 admits the following form:
n
X
f= αi K(·, x(i) )
i=1
where α solves:
n
X 1 ⊤
min max(0, 1 − y (i) [Kα]i ) + α Kα
α
i=1
2C
• Main idea: hyperplane (i.e., line) that maximizes the margin (i.e.,
the distance between the hyperplane and the closest inputs to the
hyperplane).
• Regularization: ℓ2 penalty.
SV = {i ∈ {1, . . . , n} | αi ̸= 0}
Xn X
f (x) = αi K(x, x(i) ) = αi K(x, x(i) )
i=1 i∈SV
The kernel trick and the representer theorem show that it is more
practical to work with the kernel K instead of the mapping function ϕ.
Popular kernel functions include:
K(x, x′ ) = tanh γ x⊤ x′ + c0
with γ > 0, c0 ≥ 0
The linear kernel yields a linear decision function and is actually iden-
tical to the original formulation of the SVM algorithm (one can show
that there is a mapping between the α and w coefficients). Non-linear
kernels allow for non-linear, more complex, decision functions. This is
particularly useful when the data is not linearly separable, which is the
most common use case. Figure 12 illustrates the decision function and
the margin of a SVM classification model for four different kernels.
The SVM algorithm was also extended to regression tasks with the
use of the ε-insensitive loss. Similarly to the hinge loss, which is equal to
zero for points that are correctly classified and outside the margin, the
ε-insensitive loss is equal to zero when the error between the true target
value and the predicted value is not greater than ε:
The objective function for the SVM regression algorithm combines the
values of ε-insensitive loss of the training points and the ℓ2 penalty:
n
X 1
max 0, y (i) − ϕ(x(i) )⊤ w − ε + ∥w∥22
min
w
i=1
2C
9. Multiclass classification
The classification algorithms that we presented so far, logistic regression
and support-vector machines, are binary classification algorithms: they
can only be used when there are only two possible outcomes. However,
in practice, it is common to have more than two possible outcomes. For
instance, differential diagnosis of brain disorders is often between several,
and not only two, diseases.
Several strategies have been proposed to extend any binary classi-
fication algorithm to multiclass classification tasks. They all rely on
transforming the multiclass classification task into several binary clas-
sification tasks. In this section, we present the most commonly used
strategies: one-vs-rest, one-vs-one and error correcting output code [15].
Figure 14 illustrates the main ideas of these approaches. But first, we
Figure 13: Regression losses. The squared error loss takes very small
values for small errors and very large values for large errors, whereas the
absolute error loss takes small values for small errors and large values
for large errors. Both losses take small but non-zero values when the
error is small. On the contrary, the ε-insensitive loss is null when the
error is small and otherwise equal the absolute error loss minus ε. When
computed over several samples, the squared and absolute error losses are
often referred to as mean squared error (MSE) and mean absolute error
(MAE) respectively.
exp x⊤ wk
∀k ∈ {1, . . . , q}, P (y = Ck |x = x) = Pq ⊤
j=1 exp (x wj )
The coefficients (wk )1≤k≤q are still estimated by maximizing the like-
lihood function:
q
n Y
Y 1y(i) =C
L(w1 , . . . , wq ) = P y = Ck |x = x(i) k
i=1 k=1
− log(L(w1 , . . . , wq ))
n Xq
X
1y(i) =Ck log P y = Ck |x = x(i)
=−
i=1 k=1
n q !
X X exp x(i)⊤ wk
= − 1y(i) =Ck log Pq (i)⊤ w )
i=1 k=1 j=1 exp (x j
Xn
ℓcross-entropy y (i) , softmax x(i)⊤ w1 , . . . , x(i)⊤ wq
=
i=1
where ℓcross entropy is known as the cross-entropy loss and is defined, for
any label y and any vector of probabilities (π1 , . . . , πq ), as:
q
X
ℓcross-entropy (y, (π1 , . . . , πq )) = − 1y=Ck πk
k=1
9.2 One-vs-rest
A strategy to transform a multiclass classification task into several binary
classification tasks is to fit a binary classification algorithm for each class:
the positive class is the given class, and the negative class consists of all
the other classes merged into a single class. This strategy is known as one-
vs-rest. The advantage of this strategy is that each class is characterized
by a single model, so that it is possible to gain deeper knowledge about
9.3 One-vs-one
Another strategy is to fit a binary classification algorithm for each pair
of classes: this strategy is known as one-vs-one. The advantage of this
strategy is that the classes in each binary classification task are “pure”,
in the sense that different classes are never merged into a single class.
However, the number of binary classification algorithms that needs to be
trained is larger for the one-vs-one strategy ( 12 q(q − 1)) than for the one-
vs-rest strategy (q). Nonetheless, for the one-vs-one strategy, the number
of training samples in each binary classification tasks is smaller than the
total number of samples, which makes training each binary classification
algorithm usually faster. Another drawback is that this strategy is less
interpretable compared to the one-vs-rest strategy, as the predicted class
corresponds to the class obtaining the most votes (i.e., winning the most
one-vs-one matchups), which does not take into account the confidence
in winning each matchup.1 For instance, winning a one-vs-one matchup
with 0.99 probability gives the same result as winning the same matchup
with 0.51 probability, i.e. one vote.
distance (the lower, the more similar). There exist advanced strategies
to define the code book, but it has been argued than a random code book
usually gives as good results as a sophisticated one [16].
1 1
px|y=Ck (x) = p exp − [x − µk ]⊤ Σ−1
k [x − µk ]
p
(2π) |Σk | 2
px|y=Ck (x)P (y = Ck )
P (y = Ck |x = x) =
px (x)
probabilities:
log P (y = Ck |x = x)
= log px|y=Ck (x) + log P (y = Ck ) − log px (x)
1 1
= − [x − µk ]⊤ Σ−1 k [x − µk ] − log |Σk | + log P (y = Ck )
2 2
p
− log(2π) − log px (x)
2
1 ⊤ −1 1 ⊤ −1 1
= − x Σk x + x⊤ Σ−1 k µk − µk Σk µk − log |Σk | + log P (y = Ck )
2 2 2
p
− log(2π) − log px (x)
2
(4)
∀k, Σk = σk2 Ip
log P (y = Ck |x = x)
1 1 1
= − 2 x⊤ x + 2 x⊤ µk − 2 µ⊤ µk − log σk + log P (y = Ck )
2σk σk 2σk k
p
− log(2π) − log px (x)
2
= x⊤ Wk x + x⊤ wk + w0k + s
where:
• Wk = − 2σ1 2 Ip is the matrix of the quadratic term for class Ck ,
k
• wk = 1
µ
σk2 k
is the vector of the linear term for class Ck ,
• w0k = − 2σ1 2 µ⊤
k µk − log σk + log P (y = Ck ) is the intercept for class
k
Ck , and
exp x⊤ Wk x + x⊤ wk + w0k
P (y = Ck |x = x) = Pk
⊤ ⊤
j=1 exp (x Wj x + x wj + w0j )
∀k, Σk = σk2 Ip = σ 2 Ip
log P (y = Ck |x = x)
1 1 1
= − 2 x⊤ x + 2 x⊤ µk − 2 µ⊤ µk − log σk + log P (y = Ck )
2σ σ 2σ k
p
− log(2π) − log px (x)
2
⊤
= x wk + w0k + s
where:
• wk = 1
µ
σ2 k
is the vector of the linear term for class Ck ,
• w0k = − 2σ1 2 µ⊤
k µk + log P (y = Ck ) is the intercept for class Ck , and
∀k, Σk = Σ
log P (y = Ck |x = x)
1 1
= − [x − µk ]⊤ Σ−1 [x − µk ] − log |Σ| + log P (y = Ck )
2 2
p
− log(2π) − log px (x)
2
1 ⊤ −1
x Σ x − x⊤ Σ−1 µk − µ⊤ −1 ⊤ −1
=− k Σ x + µk Σ µk
2
1 p
− log |Σ| + log P (y = Ck ) − log(2π) − log px (x)
2 2
⊤ −1 1 ⊤ −1 1 ⊤ −1 1
= −x Σ µk − x Σ x − µk Σ µk + log P (y = Ck ) − log |Σ|
2 2 2
p
− log(2π) − log px (x)
2
⊤
= x wk + w0k + s
where:
• wk = Σ−1 µk is the vector of coefficients for class Ck ,
• w0k = − 21 µ⊤ −1
k Σ µk + log P (y = Ck ) is the intercept for class Ck ,
and
log P (y = Ck |x = x)
1 1 ⊤ −1 1
= − x⊤ Σ−1 ⊤ −1
k x + x Σk µk − µk Σk µk − log |Σk | + log P (y = Ck )
2 2 2
p
− log(2π) − log px (x)
2
= x⊤ Wk x + x⊤ wk + w0k + s
where:
• Wk = − 12 Σ−1
k is the matrix of the quadratic term for class Ck ,
• wk = Σ−1
k µk is the vector of the linear term for class Ck ,
• w0k = − 21 µ⊤ −1 1
k Σk µk − 2 log |Σk | + log P (y = Ck ) is the intercept for
class Ck , and
• s = − p2 log(2π) − log px (x) is a term that does not depend on class
Ck .
Therefore, quadratic discriminant analysis is a quadratic model.
The probabilities for input x to belong to each class Ck can then easily
be computed:
exp x⊤ Wk x + x⊤ wk + w0k
P (y = Ck |x = x) = Pk
⊤ ⊤
j=1 exp (x Wj x + x wj + w0j )
Severity of symptoms
Se
ild
ve
M
re
Consult your Can you quickly
general practitioner consult a specialist?
N
s
o
Ye
Call for
Consult a specialist
emergency services
x1 > −6.26
s
No
Ye
ŷ = +1 x1 > −4.23
s
No
Ye
x2 > 3.34 ŷ = −1
s
No
Ye
ŷ = +1 ŷ = −1
Figure 18: A decision tree: (left) the rules learned by the decision tree,
and (right) the corresponding decision function.
• Misclassification: 1 − max pk
k
Figure 19 illustrates the values of the Gini index and the entropy for
a single class Ck and for different proportions of samples pk . One can
see that the entropy function takes larger values than the Gini index,
especially for pk < 0.8. Since the sum of the proportions is equal to 1,
most classes only represent a small proportion of the samples. Therefore,
a simple interpretation is that entropy is more discriminative against
heterogeneous subsets than the Gini index. Misclassification only takes
into account the proportion of the most common class and tends to be
even less discriminative against heterogeneous subsets than both entropy
and Gini index.
For regression tasks, the mean error from a reference value (such as
the mean or the median) is often used as the impurity criterion:
1 X 1 X
• Mean squared error: (y − ȳ)2 with ȳ = y
|S| y∈S |S| y∈S
1 X
• Mean absolute error: |y − medianS (y)|
|S| y∈S
Theoretically, a tree can grow until every leaf node is perfectly pure.
However, such a tree would have a lot of branches and would be very
complex, making it prone to overfitting. Several strategies are commonly
used to limit the size of the tree. One approach consists in growing the
tree with no restriction, then pruning the tree, that is replacing subtrees
with nodes [17]. Other popular strategies to limit the complexity of the
tree are usually applied while the tree is grown, and include setting:
Figure 19: Illustration of Gini index and entropy. The entropy function
takes larger values than the Gini index, especially for pk < 0.8, thus is
more discriminative against heterogeneous subsets (when most classes
only represent only a small proportion of the samples) than Gini index.
In a bid to have trees that are not perfectly correlated (thus building
actually different trees), each tree is built using only a subset of the
training samples obtained with random sampling. Moreover, for each
decision node of each tree, only a subset of the features are considered
to find the best split.
The final prediction is obtained by averaging the predictions of each
tree. For classification tasks, the predicted class is either the most com-
monly predicted class (hard-voting) or the one with the highest mean
probability estimate (soft-voting) across the trees. For regression tasks,
the predicted value is usually the mean of the predicted values across the
trees.
12. Clustering
So far, we have presented classic machine learning algorithms for clas-
sification and regression, which are the main components of supervised
learning. Each input x(i) had an associated output y (i) . In this section
12.1 k-means
The k-means algorithm divides a set of n samples, denoted by X , into a
set of k disjoint clusters, each denoted by Xj , such that X = {X1 , . . . , Xk }.
Figure 20 illustrates the concept of this algorithm. Each cluster Xj
is characterized by its centroid, denoted by µj , that is the mean of the
samples in this cluster:
1 X (i)
µj = x
|Xj | (i)
x ∈Xj
The centroids fully define the set of clusters because each sample is as-
signed to the cluster whose centroid is the closest.
The k-means algorithm aims at finding centroids that minimize the
inertia, also known as within-cluster sum-of-squares criterion:
k
X X
min ∥x(i) − µj ∥22
{µ1 ,...,µk }
j=1 x(i) ∈Xj
The original algorithm used to find the centroids is often referred to as the
Lloyd’s algorithm [20] and is presented in Algorithm 1. After initializing
the centroids, a two-step loop is repeated until convergence (when the
centroids are identical for two consecutive iterations) consisting of:
1. the assignment step, where the clusters are updated based on the
current centroids, and
2. the update step, where the centroids are updated based on the cur-
rent clusters.
∀j ∈ {1, . . . , k}, Xj = {x(i) ∈ X | ∥x(i) −µj ∥22 = min ∥x(i) −µl ∥22 }
l
k
X
x= ∆j xj
j=1
k
X
∀x ∈ X , px (x, θ) = πj pj (x; θj )
j=1
πj pj (x(i) ; θj , Σj )
∀i ∈ {1, . . . , n}, ∀j ∈ {1, . . . , k}, γi (j) = Pk
(i)
l=1 πl pj (x ; θl , Σl )
• the expectation step, in which the probability for each sample x(i)
to have been generated from distribution Fj is computed, and
step computes the centroid of each cluster as the mean of the samples in
a given cluster, while the maximization step updates the probability and
the parameters of each distribution as a weighted average over all the
samples. For these reasons, the k-means algorithm is often referred to as
a hard-voting clustering algorithm, as opposed to the Gaussian mixture
model which is referred to as a soft-voting clustering algorithm.
The Gaussian mixture model has several advantages over the k-means
algorithm.
First, the use of normal distribution densities instead of Euclidean
distances dwindles the inflation issue in high-dimensional spaces. Second,
the Gaussian mixture model includes covariance matrices, allowing for
clusters with elliptical shapes, while the k-means algorithm only include
centroids, forcing clusters to have circular shapes.
Nonetheless, the Gaussian mixture model also has several drawbacks,
sharing a few with the k-means algorithm.
First, the number of distributions k is a hyperparameter. Setting
a value much different from the actual number of clusters may yield
poor clusters. Second, the log-likelihood is not a concave function. Like
Lloyd’s algorithm, the EM algorithm is guaranteed to converge but it
may converge to a local maximum that is not a global maximum. Sev-
eral strategies are often applied to address this issue, including sophis-
ticated centroid initialization [22] and running the algorithm numerous
times and keeping the best run (i.e., the one yielding the highest log-
likelihood). Third, the Gaussian mixture model has more parameters
than the k-means algorithm. Therefore, it usually requires more sam-
ples to accurately estimate its parameters (in particular the covariance
matrices) than the k-means algorithm.
t = x⊤ W
The first principal component w(1) is the unit vector that maximizes
the explained variance:
( n )
X
w1 = arg max x(i)⊤ w∥
∥w∥=1 i=1
w⊤ X ⊤ Xw
w1 = arg max
w∈Rp w⊤ w
As X ⊤ X is a positive semi-definite matrix, a well known result from
linear algebra is that w(1) is the eigenvector associated with the largest
eigenvalue of X ⊤ X.
The k-th component is found by subtracting the first k − 1 principal
components from X:
k−1
X
X̂k = X − Xw(s) w(s)⊤
s=1
and then finding the unit vector that explains the maximum variance
from this new data matrix:
( )
n o w⊤ X̂k⊤ X̂k w
wk = arg max ∥X̂k w∥ = arg max
∥w∥=1 w∈Rp w⊤ w
One can show that the eigenvector associated with the k-th largest eigen-
value of the X ⊤ X matrix maximizes the quantity to be maximized.
Therefore, the matrix W is the matrix whose columns are the eigen-
vectors of the X ⊤ X matrix, sorted by descending order of their associ-
ated eigenvalues.
where K is the kernel function which is equal to the dot product in the
feature space:
∀x, x′ ∈ I, K(x, x′ ) = ϕ(x)⊤ ϕ(x′ )
As this term frequently appears, we denote by K the n × n symmetric
matrix consisting of the evaluations of the kernel on all the pairs of
training samples:
The cost function can be simplified using the specific form of the possible
functions:
X n
(y (i) − f (x(i) )2 + λ∥f ∥2
i=1
n n
!2 n 2
X X X
(i) (j) (i) (i)
= y − αj k(x , x ) +λ αi K(·, x )
i=1 j=1 i=1
n
X 2
= y (i) − α⊤ K:,i + λα⊤ Kα
i=1
= ∥y − Kα∥22 + λα⊤ Kα
15. Conclusion
In this chapter, we described the main classic machine learning methods.
Due to space constraints, the description of some of them was brief. The
reader who seeks more details can refer to [5, 6]. All these approaches
are implemented in the scikit-learn Python library [27]. A common point
of the approaches presented in this chapter is that they use as input a
set of given or pre-extracted features. On the contrary, deep learning
approaches often provide an end-to-end learning setup within which the
features are learned. These techniques are covered in Chapters 3 to 6.
Acknowledgments
The authors would like to thank Hicham Janati for his fruitful remarks.
The authors would like to acknowledge the extensive documentation of
the scikit-learn Python package, in particular its user guide, for the rele-
vant information and references provided. We used the NumPy [28], mat-
plotlib [29] and scikit-learn [27] Python packages to generate all the fig-
ures. This work was supported by the French government under manage-
ment of Agence Nationale de la Recherche as part of the “Investissements
d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Insti-
tute) and reference ANR-10-IAIHU-06 (Agence Nationale de la Recherche-
10-IA Institut Hospitalo-Universitaire-6), and by the European Union
H2020 programme (grant number 826421, project TVB-Cloud).
References
[1] Goodfellow I, Bengio Y, Courville A (2016) search trees used for associative searching.
Deep Learning. MIT Press, http://www. Communications of the ACM 18(9):509–517
deeplearningbook.org
[4] Omohundro SM (1989) Five Balltree Con-
[2] Murphy KP (2012) Machine Learning: A struction Algorithms. Tech. rep., Interna-
Probabilistic Perspective. The MIT Press, tional Computer Science Institute
Cambridge, MA
[5] Bishop CM (2006) Pattern Recognition and
[3] Bentley JL (1975) Multidimensional binary Machine Learning. Springer-Verlag
[6] Hastie T, Tibshirani R, Friedman J (2009) [18] Breiman L (2001) Random Forests. Ma-
The Elements of Statistical Learning: Data chine Learning 45(1):5–32
Mining, Inference, and Prediction, Second
Edition, 2nd edn. Springer Series in Statis- [19] Geurts P, Ernst D, Wehenkel L (2006) Ex-
tics, Springer-Verlag, New York tremely randomized trees. Machine Learn-
ing 63(1):3–42
[7] Tikhonov AN, Arsenin VY, John F (1977)
Solutions of Ill Posed Problems. John Wiley [20] Lloyd S (1982) Least squares quantization
& Sons Inc, Washington : New York in PCM. IEEE Transactions on Information
Theory 28(2):129–137
[8] Tibshirani R (1996) Regression Shrinkage
and Selection via the Lasso. Journal of the [21] Elkan C (2003) Using the triangle inequal-
Royal Statistical Society Series B (Method- ity to accelerate k-means. In: Proceed-
ological) 58(1):267–288 ings of the Twentieth International Confer-
ence on International Conference on Ma-
chine Learning, pp 147–153
[9] Zou H, Hastie T (2005) Regularization and
variable selection via the elastic net. Jour-
[22] Arthur D, Vassilvitskii S (2007) k-
nal of the Royal Statistical Society: Series
means++: the advantages of careful
B (Statistical Methodology) 67(2):301–320
seeding. In: Proceedings of the eighteenth
annual ACM-SIAM symposium on Discrete
[10] Vapnik VN, Lerner A (1963) Pattern recog- algorithms, pp 1027–1035
nition using generalized portrait method.
Automation and Remote Control 24:774– [23] Aggarwal CC, Hinneburg A, Keim DA
780 (2001) On the Surprising Behavior of Dis-
tance Metrics in High Dimensional Space.
[11] Cortes C, Vapnik V (1995) Support-vector In: International Conference on Database
networks. Machine Learning 20(3):273–297 Theory, Springer, pp 420–434
[12] Boser BE, Guyon IM, Vapnik VN (1992) A [24] Dempster AP, Laird NM, Rubin DB (1977)
training algorithm for optimal margin clas- Maximum Likelihood from Incomplete Data
sifiers. In: Proceedings of the fifth annual via the EM Algorithm. Journal of the Royal
workshop on Computational learning the- Statistical Society Series B (Methodologi-
ory, Association for Computing Machinery, cal) 39(1):1–38
Pittsburgh, Pennsylvania, USA, COLT ’92,
pp 144–152 [25] Jolliffe IT (2002) Principal Component
Analysis, 2nd edn. Springer Science & Busi-
[13] Aizerman MA, Braverman EA, Rozonoer L ness Media
(1964) Theoretical foundations of the po-
tential function method in pattern recogni- [26] Schölkopf B, Smola AJ, Müller KR (1999)
tion learning. In: Automation and Remote Kernel principal component analysis. In:
Control, 25, pp 821–837 Advances in kernel methods: support vec-
tor learning, MIT Press, pp 327–352
[14] Schölkopf B, Herbrich R, Smola AJ (2001)
A Generalized Representer Theorem. In: [27] Pedregosa F, Varoquaux G, Gramfort A,
Computational Learning Theory, Springer, Michel V, Thirion B, Grisel O, Blondel M,
pp 416–426 Prettenhofer P, Weiss R, Dubourg V, et al
(2011) Scikit-learn: Machine learning in
python. the Journal of machine Learning re-
[15] Aly M (2005) Survey on multiclass classifi- search 12:2825–2830
cation methods
[28] Harris CR, Millman KJ, van der Walt SJ,
[16] James G, Hastie T (1998) The Error Cod- Gommers R, Virtanen P, Cournapeau D,
ing Method and PICTs. Journal of Compu- Wieser E, Taylor J, Berg S, Smith NJ, et al
tational and Graphical Statistics 7(3):377– (2020) Array programming with numpy.
387 Nature 585(7825):357–362
[17] Breiman L, Friedman J, Stone CJ, Olshen [29] Hunter JD (2007) Matplotlib: A 2d graph-
RA (1984) Classification and Regression ics environment. Computing in science &
Trees. Taylor & Francis engineering 9(03):90–95