0% found this document useful (0 votes)
5 views

Classic Machine Learning Algorithms

Uploaded by

geximi5072
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Classic Machine Learning Algorithms

Uploaded by

geximi5072
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Classic machine learning algorithms

Johann Faouzi, Olivier Colliot

To cite this version:


Johann Faouzi, Olivier Colliot. Classic machine learning algorithms. Olivier Colliot. Machine Learn-
ing for Brain Disorders, Springer, In press. �hal-03830094v1�

HAL Id: hal-03830094


https://hal.science/hal-03830094v1
Submitted on 26 Oct 2022 (v1), last revised 25 Jan 2024 (v4)

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.

Distributed under a Creative Commons Attribution 4.0 International License


Chapter 2

Classic machine learning


algorithms

Johann Faouzi*1 and Olivier Colliot1


1 Sorbonne Université, Institut du Cerveau - Paris Brain Institute - ICM,
CNRS, Inria, Inserm, AP-HP, Hôpital de la Pitié-Salpêtrière, F-75013, Paris,
France
* Corresponding author: e-mail address: [email protected]

Abstract
In this chapter, we present the main classic machine learning algorithms.
A large part of the chapter is devoted to supervised learning algorithms
for classification and regression, including nearest-neighbor methods, lin-
ear and logistic regressions, support vector machines and tree-based algo-
rithms. We also describe the problem of overfitting as well as strategies
to overcome it. We finally provide a brief overview of unsupervised learn-
ing methods, namely for clustering and dimensionality reduction. The
chapter does not cover neural networks and deep learning as these will
be presented in Chapters 3, 4, 5 and 6.

Keywords: machine learning, classification, regression, clustering,


dimensionality reduction

1. Introduction
This chapter presents the main classic machine learning (ML) algorithms.
There is a focus on supervised learning methods for classification and re-
gression, but we also describe some unsupervised approaches. The chap-
ter is meant to be readable by someone with no background in machine
learning. It is nevertheless necessary to have some basic notions of linear

To appear in
O. Colliot (Ed.), Machine Learning for Brain Disorders, Springer
2 Faouzi and Colliot

algebra, probabilities and statistics. If this is not the case, we refer the
reader to chapters 2 and 3 of [1].
The rest of this chapter is organized as follows. Rather than grouping
methods by categories (for instance classification or regression methods),
we chose to present methods by increasing order of complexity. We first
provide the notations in Section 2. We then describe a very intuitive fam-
ily of methods, that of nearest neighbors (Section 3). We continue with
linear regression (Section 4) and logistic regression (Section 5), the later
being a classification technique. We subsequently introduce the problem
of overfitting (Section 6) as well as strategies to mitigate it (Section 7).
Section 8 describes support vector machines (SVM). Section 9 explains
how binary classification methods can be extended to a multi-class set-
ting. We then describe methods which are specifically adapted to the case
of normal distributions (Section 10). Decision trees and random forests
are described in Section 11. We then briefly describe some unsupervised
learning algorithms, namely for clustering (Section 12) and dimensional-
ity reduction (Section 13). The chapter ends with a description of kernel
methods which can be used to extend linear algorithms to non-linear
cases (Section 14). Box 1 summarizes the algorithms presented in this
chapter, grouped by categories and then sorted in order of appearance.

2. Notations
Let n be the number of samples and p be the number of features. An
input sample is thus a p-dimensional vector:
 
x1
x =  ... 
 
xp

An output sample is denoted by y. Thus, a sample is (x, y). The dataset


of n samples can then be summarized as an n × p matrix X representing
the input data and an n-dimensional vector y representing the target
data:
   (1) (1)
  
x(1) x1 . . . x p y1
 ..   .. .
X= . = . . .. ..  , y =  ... 
  
x(n) (n)
x1
(n)
. . . xp yn

The input space is denoted by I and the set of training samples is denoted
by X .
In the case of regression, y is a real number. In the case of classifica-
tion, y is a single label. More precisely, y can only take one of a finite set

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 3

Box 1: Main classic ML algorithms

• Supervised learning

– Classification: nearest neighbors, logistic regression, sup-


port vector machine (SVM), naive Bayes, linear discriminant
analysis (LDA), quadratic discriminant analysis, tree-based
models (decision tree, random forest, extremely randomized
trees).
– Regression: nearest-neighbors, linear regression, support
vector machine regression, tree-based models (decision tree,
random forest, extremely randomized trees), kernel ridge re-
gression.

• Unsupervised learning

– Clustering: k-means, Gaussian mixture model.


– Dimensionality reduction: principal component analysis
(PCA), linear discriminant analysis (LDA), kernel principal
component analysis.

Machine Learning for Brain Disorders, Chapter 2


4 Faouzi and Colliot

of values called labels. The set of possible classes (i.e., labels) is denoted
by C = {C1 , . . . , Cq }, with q being the number of classes. As the values
of the classes are not meaningful, when there are only two classes, the
classes are often called the positive and negative classes. In this case and
also for mathematical reasons, without loss of generality, we assume the
values of the classes to be +1 and −1.

3. Nearest-neighbor methods
One of the most intuitive approaches to machine learning is nearest neigh-
bors. It is based on the following intuition: for a given input, its corre-
sponding output is likely to be similar to the outputs of similar inputs. A
real-life metaphor would be that, if a subject has similar characteristics
than other subjects who were diagnosed with a given disease, then this
subject is likely to also be suffering from this disease.
More formally, nearest-neighbor methods use the training samples
from the neighborhood of a given point x, denoted by N (x), to perform
prediction [2].
For regression tasks, the prediction is computed as a weighted mean
of the target values in N (x):
X (x)
ŷ = wi y (i)
x(i) ∈N (x)

(x)
where wi is the weight associated to x(i) to predict the output of x,
(x) P (x)
with wi ≥ 0 ∀i and i wi = 1.
For classification tasks, the predicted label corresponds to the label
with the largest weighted sum of occurrences of each label:
X (x)
ŷ = arg max wi 1y(i) =Ck
C
x(i) ∈N (x)

A key parameter of nearest-neighbor methods is the metric, denoted


by d, that is a mathematical function that defines dissimilarity. The
metric is used to define the neighborhood of any point and can also be
used to compute the weights.

3.1 Metrics
Many metrics have been defined for various types of input data such
as vectors of real numbers, integers or booleans. Among these different
types, vectors of real numbers are one of the most common types of

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 5

input data, for which the most commonly used metric is the Euclidean
distance, defined as:
v
u p
uX
′ ′
∀x, x ∈ I, ∥x − x ∥2 = t (xj − x′j )2
j=1

The Euclidean distance is sometimes referred to as the “ordinary” dis-


tance since it is the one based on the Pythagorean theorem and that
everyone uses in their everyday lives.

3.2 Neighborhood
The two most common definitions of the neighborhood rely on either
the number of neighbors or the radius around the given point. Figure 1
illustrates the differences between both definitions.
The k-nearest neighbor method defines the neighborhood of a given
point x as the set of the k closest points to x:
N (x) = {x(i) }ki=1 with d(x, x(1) ) ≤ . . . ≤ d(x, x(n) )
The radius neighbor method defines the neighborhood of a given point
x as the set of points whose dissimilarity to x is smaller than the given
radius, denoted by r:
N (x) = {x(i) ∈ X | d(x, x(i) ) < r}

3.3 Weights
The two most common approaches to compute the weights are to use:
• uniform weights (all the weights are equal):
(x) 1
∀i, wi =
|N (x)|
• weights inversely proportional to the dissimilarity:
1
(x) d(x(i) ,x) 1
∀i, wi =P 1 = P 1
j d(x(j) ,x) d(x(i) , x) j d(x(j) ,x)

With uniform weights, every point in the neighborhood equally con-


tributes to the prediction. With weights inversely proportional to the
dissimilarity, closer points contribute more to the prediction than further
points. Figure 2 illustrates the different decision functions obtained with
uniform weights and weights inversely proportional to the dissimilarity
for a 3-nearest neighbor classification model.

Machine Learning for Brain Disorders, Chapter 2


6 Faouzi and Colliot

Figure 1: Different definitions of the neighborhood. On the left, the


neighborhood of a given point is the set of its 5 nearest neighbors. On
the right, the neighborhood of a given point is the set of points whose
dissimilarity is lower than the radius. For a given input, its neighborhood
may be different depending on the definition used. The Euclidean distance
is used as the metric in both examples.

Figure 2: Impact of the definition of the weights on the prediction func-


tion of a 3-nearest neighbor classification model. When the weights are
inversely proportional to the dissimilarity, the classifier is more subject
to outliers since the predictions in the close neighborhood of any input is
mostly dedicated by the label of this input, independently of the number
of neighbors used. With uniform weights, the prediction function tends
to be smoother.

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 7

3.4 Neighbor search


The brute-force method to compute the neighborhood for n points with
p features is to compute the metric for each pair of inputs, which has
a O(n2 p) algorithmic complexity (assuming that evaluating the metric
for a pair of inputs has a complexity of O(p), which is the case for most
metrics). However, it is possible to decrease this algorithmic complexity
if the metric is a distance, that is if the metric d satisfies the following
properties:
1. Non-negativity: ∀a, b, d(a, b) ≥ 0
2. Identity: ∀a, b, d(a, b) = 0 if and only if a = b
3. Symmetry: ∀a, b, d(a, b) = d(b, a)
4. Triangle Inequality: ∀a, b, c, d(a, b) + d(b, c) ≥ d(a, c)
The key property is the triangle inequality, which has a simple interpre-
tation: the shortest path between two points is a straight line. Math-
ematically, if a is far from c and c is close to b (i.e., d(a, c) is large
and d(b, c) is small), then a is far from b (i.e., d(a, b) is large). This is
obtained by rewriting the triangle inequality as follows:
∀a, b, c, d(a, b) ≥ d(a, c) − d(b, c)
This means that it is not necessary to compute d(a, b) in this case. There-
fore, the computational cost of a nearest neighbors search can be reduced
to O(n log(n)p) or better, which is a substantial improvement over the
brute-force method for large n. Two popular methods that take advan-
tage of this property are the K-dimensional tree structure [3] and the
ball tree structure [4].

4. Linear regression
Linear regression is a regression model that linearly combines the fea-
tures. Each feature is associated with a coefficient that represents the
relative weight of this feature compared to the other features. A real-life
metaphor would be to see the coefficients as the ingredients of a recipe:
the key is to find the best balance (i.e., proportions) between all the
ingredients in order to make the best cake.
Mathematically, a linear model is a model that linearly combines the
features [5]:
p
X
f (x) = w0 + w j xj
j=1

Machine Learning for Brain Disorders, Chapter 2


8 Faouzi and Colliot

A common notation consists in including a 1 in x so that f (x) can be


written as the dot product between the vector x and the vector w:
p
X
f (x) = w0 × 1 + w j xj = x ⊤ w
j=1

where the vector w consists of:


• the intercept (also known as bias) w0 , and
• the coefficients (w1 , . . . , wp ), where each coefficient wj is associated
with the corresponding feature xj .
In the case of linear regression, f (x) is the predicted output:

ŷ = f (x) = x⊤ w

There are several methods to estimate the w coefficients. In this sec-


tion, we present the oldest one which is known as ordinary least squares
regression.
In the case of ordinary least squares regression, the cost function J is
the sum of the squared errors on the training data (see Figure 3):
n n
(i) 2
X
(i)
X 2
y (i) − x(i)⊤ w = ∥y − Xw∥22

J(w) = y − ŷ =
i=1 i=1

One wants to find the optimal parameters w⋆ that minimize the cost
function:
w⋆ = arg min J(w)
w

This optimization problem is convex, implying that any local minimum is


a global minimum, and differentiable, implying that every local minimum
has a null gradient. One therefore aims to find null gradients of the cost
function:

∇w ⋆ J = 0
=⇒ 2X ⊤ Xw⋆ − 2X ⊤ y = 0
=⇒ X ⊤ Xw⋆ = X ⊤ y
−1
=⇒ w⋆ = X ⊤ X X ⊤y

Ordinary least square regression is one of the few machine learning


optimization problem for which there exists a closed formula, i.e. the
optimal solution can be computed using a finite number of standard
operations such as addition, multiplication and evaluations of well-known
functions.

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 9

Box 2: Linear regression

• Main idea: best hyperplane (i.e., line when p = 1, plane when


p = 2) mapping the inputs and to the outputs.

• Mathematical formulation: linear relationship between the pre-


dicted output ŷ and the input x that minimizes the sum of squared
errors:
n n
X X 2
ŷ = w0⋆ + wj⋆ xj ⋆
with w = arg min y (i) − x(i)⊤ w
w
j=1 i=1

• Regularization: can be penalized to avoid overfitting (ridge), to


perform feature selection (lasso), or both (elastic-net). See Sec-
tion 7.

Figure 3: Ordinary least squares regression. The coefficients (that is the


intercept and the slope with a single predictor) are estimated by minimiz-
ing the sum of the squared errors.

Machine Learning for Brain Disorders, Chapter 2


10 Faouzi and Colliot

5. Logistic regression
Intuitively, linear regression consists in finding the line that best fits the
data: the true output should be as close to the line as possible. For
binary classification, one wants the line to separate both classes as well
as possible: the samples from one class should all be in one subspace and
the samples from the other class should all be in the other subspace, with
the inputs being as far as possible from the line.
Mathematically, for binary classification tasks, a linear model is de-
fined by an hyperplane splitting the input space into two subspaces such
that each subspace is characteristic of one class. For instance, a line
splits a plane into two subspaces in the two-dimensional case, while a
plane splits a three-dimensional space into two subspaces. A hyperplane
is defined by a vector w = (w0 , w1 , . . . , wp ) and f (x) = x⊤ w corresponds
to the signed distance between the input x and the hyperplane w: in one
subspace, the distance with any input is always positive, whereas in the
other subspace, the distance with any input is always negative. Figure 4
illustrates the decision function in the two-dimensional case where both
classes are linearly separable.
The sign of the signed distance corresponds to the decision function
of a linear binary classification model:
(
+1 if f (x) > 0
ŷ = sign(f (x)) =
−1 if f (x) < 0

The logistic regression model is a probabilistic linear model that trans-


forms the signed distance to the hyperplane into a probability using the
1
sigmoid function [6], denoted by σ(u) = 1+exp(−u) .
Consider the linear model:
p
X

f (x) = x w = w0 + wj x j
i=j

Then the probability of belonging to the positive class is:


1
P (y = +1|x = x) = σ(f (x)) =
1 + exp (−f (x))

and that of belonging to the negative class is:

P (y = −1|x = x) = 1 − P (y = +1|x = x)
exp (−f (x))
=
1 + exp (−f (x))

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 11

Figure 4: Decision function of a logistic regression model. A logistic


regression is a linear model, that is its decision function is linear. In the
two-dimensional case, it separates a plane with a line.

1
=
1 + exp (f (x))
P (y = −1|x = x) = σ(−f (x))

By applying the inverse of the sigmoid function, which is known as


the logit function, one can see that the logarithm of the odds ratio is
modeled as a linear combination of the features:
   
P (y = +1|x = x) P (y = +1|x = x)
log = log = f (x)
P (y = −1|x = x) 1 − P (y = +1|x = x)

The w coefficients are estimated by maximizing the likelihood func-


tion, that is the function measuring the goodness of fit of the model to
the training data:
n
Y
P y = y (i) |x = x(i) ; w

L(w) =
i=1

For computational reasons, it is easier to maximize the log-likelihood,


which is simply the logarithm of the likelihood:
n
X
log P y = y (i) |x = x(i) ; w

log(L(w)) =
i=1

Machine Learning for Brain Disorders, Chapter 2


12 Faouzi and Colliot

Box 3: Logistic regression

• Main idea: best hyperplane (i.e., line) that separates two classes.

• Mathematical formulation: the signed distance to the hyper-


plane is mapped into the probability to belong to the positive class
using the sigmoid function:
n
X
f (x) = w0 + w j xj
j=1
1
P (y = +1|x = x) = σ(f (x)) =
1 + exp (−f (x))

• Estimation: likelihood maximization.

• Regularization: can be penalized to avoid overfitting (ℓ2 penalty),


to perform feature selection (ℓ1 penalty), or both (elastic-net
penalty).

n
X
log σ y (i) f (x(i) ; w)

=
i=1
n
X
− log 1 + exp y (i) x(i)⊤ w

=
i=1
n
X
log 1 + exp y (i) x(i)⊤ w

log(L(w)) = −
i=1

Finally, we can rewrite this maximization problem as a minimization


problem by noticing that maxw log(L(w)) = − minw − log(L(w)):
n
X
log 1 + exp y (i) x(i)⊤ w

max log(L(w)) = − min
w w
i=1

We can see that the w coefficients that maximize the likelihood are also
the coefficients that minimize the sum of the logistic loss values, with
the logistic loss being defined as:

ℓlogistic (y, f (x)) = log (1 + exp (yf (x))) / log(2)

Unlike for linear regression, there is no closed formula for this minimiza-
tion. One thus needs to use an optimization method such as for instance

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 13

gradient descent which was presented in Section 3 of Chapter 1. In prac-


tice, more sophisticated approaches such as quasi-Newton methods and
variants of stochastic gradient descent are often used.

6. Overfitting and regularization


The original formulations of ordinary least square regression and logistic
regression are unregularized models, that is the model is trained to fit the
training data as much as possible. Let us consider a real-life example as it
is very similar to human learning. If a person learns by heart the content
of a book, they are able to solve the exercises in the book, but unable to
apply the theoretical concepts to new exercises or real-life situations. If
a person only quickly reads through the book, they are probably unable
to solve neither the exercises in the book nor new exercises.
The corresponding concepts are known as overfitting and underfitting
in machine learning. Overfitting occurs when a model fits too well the
training data and generalizes poorly to new data. Oppositely, underfit-
ting occurs when a model does not capture well enough the characteristics
of the training data, thus also generalizes poorly to new data.
Overfitting and underfitting are related to frequently used terms in
machine learning: bias and variance. Bias is defined as the expected
(i.e., mean) difference between the true output and the predicted out-
put. Variance is defined as the variability of the predicted output. For
instance, let us consider an algorithm predicting the age of a person from
a picture. If the algorithm always underestimates or overestimates the
age, then the algorithm is biased. If the algorithm makes both large and
small errors, then the algorithm has a high variance.
Ideally, one would like to have a model with a small bias and a small
variance. However, the bias of a model tends to increase when decreasing
its variance, and the variance of the model tends to increase when de-
creasing its bias. This phenomenon is known as the bias-variance trade-
off. Figure 5 illustrates this phenomenon. One can also notice it by
computing the squared error between the true output y (fixed) and the
predicted output ŷ (random variable): its expected value is the sum of
the squared bias of ŷ and the variance of ŷ:

E (y − ŷ)2 = E y 2 − 2yŷ + ŷ2


   

= y 2 − 2yE [ŷ] + E ŷ2


 

= y 2 − 2yE [ŷ] + E ŷ2 + E [ŷ]2 − E [ŷ]2


 

= (E [ŷ] − y)2 + E ŷ2 − E [ŷ]2


 

Machine Learning for Brain Disorders, Chapter 2


14 Faouzi and Colliot

= (E [ŷ] − y)2 + E ŷ2 − E [ŷ]2


 

= (E [ŷ] − y)2 + E ŷ2 − 2E [ŷ]2 + E [ŷ]2


 

= (E [ŷ] − y)2 + E ŷ2 − 2ŷE [ŷ] + E [ŷ]2


 

= (E [ŷ] − y)2 + E (ŷ − E [ŷ])2


 

E (y − ŷ)2 = (E [ŷ] − y)2 + Var [ŷ]


 
| {z } | {z }
bias2 variance

7. Penalized models
Depending on the class of algorithms, there exist different strategies to
tackle overfitting.
For neighbor methods, the number of neighbors used to define the
neighborhood of any input and the strategy to compute the weights are
the key hyperparameters to control the bias-variance trade-off. For mod-
els that are presented in the remaining sections of this chapter, we men-
tion strategies to address the bias-variance trade-off in their respective
sections. In this section, we present the most commonly used strategies
for models whose parameters are optimized by minimizing a cost function
defined as the mean loss values over all the training samples:
n
1X
ℓ y (i) , f (x(i) ; w)

min J(w) with J(w) =
w n i=1
This is for instance the case of the linear and logistic regression methods
presented in the previous sections.

7.1 Penalties
The main idea is to introduce a penalty term Pen(w) that will constraint
the parameters w to have some desired properties. The most common
penalties are the ℓ2 penalty, the ℓ1 penalty and the elastic-net penalty.

7.1.1. ℓ2 penalty
The ℓ2 penalty is defined as the squared ℓ2 norm of the w coefficients:
p
X
ℓ2 (w) = ∥w∥22 = wj2
j=1

The ℓ2 penalty forces each coefficient wi not to be too large and makes
the coefficients more robust to collinearity (i.e., when some features are
approximately linear combinations of the other features).

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 15

Figure 5: Illustration of underfitting and overfitting. Underfitting occurs


when a model is too simple and does not capture well enough the char-
acteristics of the training data, leading to high bias and low variance.
Oppositely, overfitting occurs when a model is too complex and learns the
noise in the training data, leading to low bias and high variance.

Machine Learning for Brain Disorders, Chapter 2


16 Faouzi and Colliot

Figure 6: Unit balls of the ℓ0 , ℓ1 and ℓ2 norms. For each norm, the set
of points in R2 whose norm is equal to 1 is plotted. The ℓ1 norm is the
best convex approximation to the ℓ0 norm. Note that the lines for the ℓ0
norm extend to −∞ and +∞, but are cut for plotting reasons.

7.1.2. ℓ1 penalty

The ℓ2 penalty forces the values of the parameters not to be too large,
but does not incentive to make small values tend to zero. Indeed, the
square of a small value is even smaller. When the number of features is
large, or when interpretability is important, it can be useful to make the
algorithm select the most important features. The corresponding metric
is the ℓ0 “norm” (which is not a proper norm in the mathematical sense),
defined as the number of nonzero elements:
p
X
ℓ0 (w) = ∥w∥0 = 1wj ̸=0
j=1

However, the ℓ0 “norm” is neither differentiable nor convex (which are


useful properties to solve an optimization problem, but this is not fur-
ther detailed for the sake of conciseness). The best convex differentiable
approximation of the ℓ0 “norm” is the the ℓ1 norm (see Figure 6), defined
as the sum of the absolute values of each element:
p
X
ℓ1 (w) = ∥w∥1 = |wj |
j=1

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 17

7.1.3. Elastic-net penalty


Both the ℓ2 and ℓ1 penalties have their upsides and downsides. In order
to try to obtain the best of penalties, one can add both penalties in the
objective function. The combination of both penalties is known as the
elastic-net penalty:

EN(w, α) = α∥w∥1 + (1 − α)∥w∥22

where α ∈ [0, 1] is a hyperparameter representing the proportion of the


ℓ1 penalty compared to the ℓ2 penalty.

7.2 New optimization problem


A natural approach would be to add a constraint to the minimization
problem:
min J(w) subject to Pen(w) < c (1)
w

which reads as “Find the optimal parameters that minimize the cost
function J among all the parameters w that satisfy Pen(w) < c” for
a positive real number c. Figure 7 illustrates the optimal solution of a
simple linear regression task with different constraints. This figure also
highlights the sparsity property of the ℓ1 penalty (the optimal parameter
for the horizontal axis is set to zero) that the ℓ2 penalty does not have
(the optimal parameter for the horizontal axis is small but different from
zero).
Although this approach is appealing due to its intuitiveness and the
possibility to set the maximum possible penalty on the parameters w, it
leads to a minimization problem that is not trivial to solve. A similar
approach consists in adding the regularization term in the cost function:

min J(w) + λ × Pen(w) (2)


w

where λ > 0 is a hyperparameter that controls the weights of the penalty


term compared to the mean loss values over all the training samples. This
formulation is related to the Lagrangian function of the minimization
problem with the penalty constraint.
This formulation leads to a minimization problem with no constraint
which is much easier to solve. One can actually show that Equation 1
and Equation 2 are related: solving Equation 2 for a given λ, whose
optimal solution is denoted by wλ⋆ , is equivalent to solving Equation 1
for c = Pen(wλ⋆ ). In other words, solving Equation 2 for a given λ is
equivalent to solving Equation 1 for c whose value is only known after
finding the optimal solution of Equation 2.

Machine Learning for Brain Disorders, Chapter 2


18 Faouzi and Colliot

Figure 7: Illustration of the minimization problem with a constraint


on the penalty term. The plot represents the value of the loss function
for different values of the two coefficients for a linear regression task.
The black star indicates the optimal solution with no constraint. The
green and orange stars indicate the optimal solutions when imposing a
constraint on the ℓ2 and ℓ1 norms of the parameters w respectively.

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 19

Figure 8: Illustration of regularization. A kernel ridge regression al-


gorithm is fitted on the training data (blue points) with different values
of λ, which is the weight of the regularization in the cost function. The
smaller values of λ, the smaller the weight of the ℓ2 regularization. The
algorithm underfits (respectively overfits) the data when the value of λ is
too large (respectively low).

Machine Learning for Brain Disorders, Chapter 2


20 Faouzi and Colliot

Figure 8 illustrates the impact of the regularization term λ × Pen(w)


on the prediction function of a kernel ridge regression algorithm (see Sec-
tion 14 for more details) for different values of λ. For high values of λ, the
regularization term is dominating the mean loss value, making the pre-
diction function not fitting well enough the training data (underfitting).
For small values of λ, the mean loss value is dominating the regulariza-
tion term, making the prediction function fitting too well the training
data (overfitting). A good balance between the mean loss value and the
regularization term is required to learn the best function.
Since linear regression is one of the oldest and best-known models, the
aforementioned penalties were originally introduced for linear regression:
• Linear regression with the ℓ2 penalty is also known as ridge [7]:

min ∥y − Xw∥22 + λ∥w∥22


w

As in ordinary least squares regression, there exists a closed formula


for the optimal solution:
−1 ⊤
w⋆ = X ⊤ X + λI X y

• Linear regression with the ℓ1 penalty is also known as lasso [8]:

min ∥y − Xw∥22 + λ∥w∥1


w

• Linear regression with the elastic-net penalty is also known as


elastic-net [9]:

min ∥y − Xw∥22 + λα∥w∥1 + λ(1 − α)∥w∥22


w

The penalties can also be added in other models such as logistic regres-
sion, support-vector machines, artificial neural networks, etc.

8. Support-vector machine
Linear and logistic regression take into account every training sample
in order to find the best line, which is due to their corresponding loss
functions: the squared error is zero only if the true and predicted outputs
are equal, and the logistic loss is always positive. One could argue that
the training samples whose outputs are “easily” well predicted are not
relevant: only the training samples whose outputs are not “easily” well
predicted or are wrongly predicted should be taken into account. The
support vector machine (SVM) algorithm is based on this principle.

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 21

Figure 9: Support vector machine classifier with linearly separable


classes. When two classes are linearly separable, there exists an infi-
nite number of hyperplanes separating them (left). The decision function
of the support vector machine classifier is the hyperplane that maximizes
the margin, that is the distance between the hyperplane and the closest
points to the hyperplane (right). Support vectors are highlighted with a
black circle surrounding them.

8.1 Original formulation

The original support vector machine algorithm was invented in 1963 and
was a linear binary classification algorithm [10]. Figure 9 illustrates the
main concept of its original version. When both classes are linearly sep-
arable, there exists an infinite number of hyperplanes that separate both
classes. The SVM algorithm finds the hyperplane that maximizes the
margin, that is the distance between the hyperplane and the closest points
of both classes to the hyperplane, while linearly separating both classes.
The SVM algorithm was later updated to non-separable classes [11].
Figure 10 illustrates the role of the margin in this case. The dashed lines
correspond to the hyperplanes defined by the equations x⊤ w = +1 and
x⊤ w = −1. The margin is the distance between both hyperplanes and
is equal to 2/∥w∥22 . It defines which samples are included in the decision
function of the model: a sample is included if and only if it is inside
the margin, or outside the margin and misclassified. Such samples are
called support vectors and are illustrated in Figure 10 with a black circle
surrounding them. In this case, the margin can be seen a regularization
term: the larger the margin is, the more support vectors are included in
the decision function, the more regularized the model is.

Machine Learning for Brain Disorders, Chapter 2


22 Faouzi and Colliot

Figure 10: Decision function of a support-vector machine classifier with


a linear kernel when both classes are not strictly linearly separable. The
support vectors are the training points within the margin of the decision
function and the misclassified training points. The support vectors are
highlighted with a black circle surrounding them.

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 23

Figure 11: Binary classification losses. The logistic loss is always pos-
itive, even when the point is accurately classified with high confidence
(i.e., when yf (x) ≫ 0), whereas the hinge loss is equal to zero when the
point is accurately classified with good confidence (i.e., when yf (x) ≥ 1).

The loss function for the SVM algorithm is called the hinge loss and
is defined as:

ℓhinge (y, f (x)) = max(0, 1 − yf (x))

Figure 11 illustrates the curves of the logistic and hinge losses. The
logistic loss is always positive, even when the point is accurately classified
with high confidence (i.e., when yf (x) ≫ 0), whereas the hinge loss is
equal to zero when the point is accurately classified with good confidence
(i.e., when yf (x) ≥ 1). One can see that a sample (x, y) is a support
vector if and only if yf (x) ≥= 1, that is if and only if ℓhinge (y, f (x)) = 0.
The optimal w coefficients for the original version are estimated by
minimizing an objective function consisting of the sum of the hinge loss
values and a ℓ2 penalty term (which is inversely proportional to the mar-
gin):
n
X 1
min max(0, 1 − y (i) x(i)⊤ w) + ∥w∥22
w
i=1
2C

Machine Learning for Brain Disorders, Chapter 2


24 Faouzi and Colliot

8.2 General formulation with kernels


The SVM algorithm was later updated to non-linear decision functions
with the use of kernels [12].
In order to have a non-linear decision function, one could map the
input space I into another space (often called the feature space), denoted
by G, using a function denoted by ϕ:

ϕ: I → G
x 7→ ϕ(x)

The decision function would still be linear (with a dot product), but in
the feature space:
f (x) = ϕ(x)⊤ w
Unfortunately, solving the corresponding minimization problem is not
trivial: n
X 1
max 0, 1 − y (i) ϕ(x(i) )⊤ w + ∥w∥22

min (3)
w
i=1
2C
Nonetheless, two mathematical properties make the use of non-linear
transformations in the feature space possible: the kernel trick and the
representer theorem.
The kernel trick asserts that the dot product in the feature space can
be computed using only the points from the input space and a kernel
function, denoted by K:

∀x, x′ ∈ I, ϕ(x)⊤ ϕ(x′ ) = K(x, x′ )

The representer theorem [13, 14] asserts that, under certain conditions
on the kernel K and the feature space G associated with the function ϕ,
any minimizer of Equation 3 admits the following form:
n
X
f= αi K(·, x(i) )
i=1

where α solves:
n
X 1 ⊤
min max(0, 1 − y (i) [Kα]i ) + α Kα
α
i=1
2C

where K is the n×n matrix consisting of the evaluations of the kernel on


all the pairs of training samples: ∀i, j ∈ {1, . . . , n}, Kij = K(x(i) , x(j) ).
Because the hinge loss is equal to zero if and only if yf (x) is greater
than or equal to 1, only the training samples (x(i) , y (i) ) such that y (i) f (x(i) ) <

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 25

Box 4: Support-vector machine

• Main idea: hyperplane (i.e., line) that maximizes the margin (i.e.,
the distance between the hyperplane and the closest inputs to the
hyperplane).

• Support vectors: only the misclassified inputs and the inputs


well classified but with low confidence are taken into account.

• Non-linearity: decision function can be non-linear with the use


of non-linear kernels.

• Regularization: ℓ2 penalty.

1 have a nonzero αi coefficient. These points are the so-called support


vectors and this is why they are the only training samples contributing
to the decision function of the model:

SV = {i ∈ {1, . . . , n} | αi ̸= 0}
Xn X
f (x) = αi K(x, x(i) ) = αi K(x, x(i) )
i=1 i∈SV

The kernel trick and the representer theorem show that it is more
practical to work with the kernel K instead of the mapping function ϕ.
Popular kernel functions include:

• the linear kernel:


K(x, x′ ) = x⊤ x′

• the polynomial kernel:


d
K(x, x′ ) = γ x⊤ x′ + c0 with γ > 0, c0 ≥ 0, d ∈ N∗

• the sigmoid kernel:

K(x, x′ ) = tanh γ x⊤ x′ + c0

with γ > 0, c0 ≥ 0

• the radial basis function (RBF) kernel:

K(x, x′ ) = exp −γ ∥x − x′ ∥22



with γ > 0

Machine Learning for Brain Disorders, Chapter 2


26 Faouzi and Colliot

The linear kernel yields a linear decision function and is actually iden-
tical to the original formulation of the SVM algorithm (one can show
that there is a mapping between the α and w coefficients). Non-linear
kernels allow for non-linear, more complex, decision functions. This is
particularly useful when the data is not linearly separable, which is the
most common use case. Figure 12 illustrates the decision function and
the margin of a SVM classification model for four different kernels.
The SVM algorithm was also extended to regression tasks with the
use of the ε-insensitive loss. Similarly to the hinge loss, which is equal to
zero for points that are correctly classified and outside the margin, the
ε-insensitive loss is equal to zero when the error between the true target
value and the predicted value is not greater than ε:

ℓε−insensitive (y, f (x)) = max(0, |y − f (x)| − ε)

The objective function for the SVM regression algorithm combines the
values of ε-insensitive loss of the training points and the ℓ2 penalty:
n
X 1
max 0, y (i) − ϕ(x(i) )⊤ w − ε + ∥w∥22

min
w
i=1
2C

Figure 13 illustrates the curves of three regression losses. The squared


error loss takes very small values for small errors and very high values for
high errors, whereas the absolute error loss takes small values for small
errors and high values for high errors. Both losses take small but non-
zero values when the error is small. On the contrary, the ε-insensitive
loss is null when the error is small and otherwise equal to the absolute
error loss minus ε.

9. Multiclass classification
The classification algorithms that we presented so far, logistic regression
and support-vector machines, are binary classification algorithms: they
can only be used when there are only two possible outcomes. However,
in practice, it is common to have more than two possible outcomes. For
instance, differential diagnosis of brain disorders is often between several,
and not only two, diseases.
Several strategies have been proposed to extend any binary classi-
fication algorithm to multiclass classification tasks. They all rely on
transforming the multiclass classification task into several binary clas-
sification tasks. In this section, we present the most commonly used
strategies: one-vs-rest, one-vs-one and error correcting output code [15].
Figure 14 illustrates the main ideas of these approaches. But first, we

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 27

Figure 12: Impact of the kernel on the decision function of a support


vector machine classifier. A non-linear kernel allows for a non-linear
decision function.

Machine Learning for Brain Disorders, Chapter 2


28 Faouzi and Colliot

Figure 13: Regression losses. The squared error loss takes very small
values for small errors and very large values for large errors, whereas the
absolute error loss takes small values for small errors and large values
for large errors. Both losses take small but non-zero values when the
error is small. On the contrary, the ε-insensitive loss is null when the
error is small and otherwise equal the absolute error loss minus ε. When
computed over several samples, the squared and absolute error losses are
often referred to as mean squared error (MSE) and mean absolute error
(MAE) respectively.

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 29

One-vs-rest One-vs-one Output code

{1} vs. {2, 3, 4, 5} {1} vs. {2} {1, 3} vs. {2, 4, 5}

{2} vs. {1, 3, 4, 5} {1} vs. {3} {1, 4, 5} vs. {2, 3}

{3} vs. {1, 2, 4, 5} {1} vs. {4} {2} vs. {1, 3, 4, 5}

{4} vs. {1, 2, 3, 5} {1} vs. {5} {1, 2, 3} vs. {4, 5}

{5} vs. {1, 2, 3, 4} {2} vs. {3} {2, 5} vs. {1, 3, 4}

{2} vs. {4} {2, 3, 4} vs. {1, 5}

{2} vs. {5} {4} vs. {1, 2, 3, 5}


.. .. ..
{3} vs. {4} . . .

{3} vs. {5}

{4} vs. {5}

Figure 14: Main approaches to convert a multiclass classification task


into several binary classification tasks. In the one-vs-rest approach, each
class is associated to a binary classification model that is trained to sep-
arate this class from all the other classes. In the one-vs-one approach, a
binary classification algorithm is trained on each pair of classes. In the
error correcting output code approach, the classes are (randomly) split
into two groups and a binary classification algorithm is trained for each
split.

present a natural extension of logistic regression to multiclass classifi-


cation tasks which is often referred to as multinomial logistic regression
[5].

9.1 Multinomial logistic regression


For binary classification, logistic regression is characterized by a hyper-
plane: the signed distance to the hyperplane is mapped into the proba-
bility of belonging to the positive class using the sigmoid function. How-
ever, for multiclass classification, a single hyperplane is not enough to
characterize all the classes. Instead, each class Ck is characterized by a
hyperplane wk and, for any input x, one can compute the signed dis-
tance x⊤ wk between the input x and the hyperplane wk . The signed

Machine Learning for Brain Disorders, Chapter 2


30 Faouzi and Colliot

distances are mapped into probabilities


 using the softmax function, de-
fined as softmax (x1 , . . . , xq ) = Pq exp(xj ) , . . . , Pqexp(x
exp(x 1 ) q)
exp(xj )
, as follows:
j=1 j=1

exp x⊤ wk

∀k ∈ {1, . . . , q}, P (y = Ck |x = x) = Pq ⊤
j=1 exp (x wj )

The coefficients (wk )1≤k≤q are still estimated by maximizing the like-
lihood function:
q
n Y
Y 1y(i) =C
L(w1 , . . . , wq ) = P y = Ck |x = x(i) k

i=1 k=1

which is equivalent to minimizing the negative log-likelihood:

− log(L(w1 , . . . , wq ))
n Xq
X
1y(i) =Ck log P y = Ck |x = x(i)

=−
i=1 k=1
n q  !
X X exp x(i)⊤ wk
= − 1y(i) =Ck log Pq (i)⊤ w )
i=1 k=1 j=1 exp (x j

Xn
ℓcross-entropy y (i) , softmax x(i)⊤ w1 , . . . , x(i)⊤ wq

=
i=1

where ℓcross entropy is known as the cross-entropy loss and is defined, for
any label y and any vector of probabilities (π1 , . . . , πq ), as:
q
X
ℓcross-entropy (y, (π1 , . . . , πq )) = − 1y=Ck πk
k=1

This loss is commonly used to train artificial neural networks on classifi-


cation tasks and is equivalent to the logistic loss in the binary case.
Figure 15 illustrates the impact of the strategy used to handle a mul-
ticlass classification task on the decision function.

9.2 One-vs-rest
A strategy to transform a multiclass classification task into several binary
classification tasks is to fit a binary classification algorithm for each class:
the positive class is the given class, and the negative class consists of all
the other classes merged into a single class. This strategy is known as one-
vs-rest. The advantage of this strategy is that each class is characterized
by a single model, so that it is possible to gain deeper knowledge about

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 31

Figure 15: Illustration of the impact of the strategy used to handle a


multiclass classification task on the decision function of a logistic regres-
sion model.

Machine Learning for Brain Disorders, Chapter 2


32 Faouzi and Colliot

the class by inspecting its corresponding model. A consequence is that


the predictions for new samples take into account the confidence of the
models: the predicted class for a new input is the class for which the
corresponding model is the most confident that this input belongs to its
class. The one-vs-rest strategy is the most commonly used strategy and
usually a good default choice.

9.3 One-vs-one
Another strategy is to fit a binary classification algorithm for each pair
of classes: this strategy is known as one-vs-one. The advantage of this
strategy is that the classes in each binary classification task are “pure”,
in the sense that different classes are never merged into a single class.
However, the number of binary classification algorithms that needs to be
trained is larger for the one-vs-one strategy ( 12 q(q − 1)) than for the one-
vs-rest strategy (q). Nonetheless, for the one-vs-one strategy, the number
of training samples in each binary classification tasks is smaller than the
total number of samples, which makes training each binary classification
algorithm usually faster. Another drawback is that this strategy is less
interpretable compared to the one-vs-rest strategy, as the predicted class
corresponds to the class obtaining the most votes (i.e., winning the most
one-vs-one matchups), which does not take into account the confidence
in winning each matchup.1 For instance, winning a one-vs-one matchup
with 0.99 probability gives the same result as winning the same matchup
with 0.51 probability, i.e. one vote.

9.4 Error correcting output code


A substantially different strategy, inspired by the theory of error correc-
tion code, consists in merging a subset of classes into one class and the
other subset into the other class, for each binary classification task. This
data is often called the code book and can be represented as a matrix
whose rows correspond to the classes and whose columns correspond to
the binary classification tasks. The matrix consists only of −1 and +1
values that represent the corresponding label for each class and for each
binary task.2 For any input, each binary classifier returns the score (or
probability) associated with the positive class. The predicted class for
this input is the class whose corresponding vector is the most similar to
the vector of scores, with similarity being assessed with the Euclidean
1
The confidences are actually taken into account but only in the event of a tie.
2
The values are 0 and 1 when the classification algorithm does not return scores
but only probabilities.

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 33

distance (the lower, the more similar). There exist advanced strategies
to define the code book, but it has been argued than a random code book
usually gives as good results as a sophisticated one [16].

10. Decision functions with normal


distributions

Normal distributions are popular distributions because they are com-


monly found in nature. For instance, the distribution of heights and
birth weights of human beings can be approximated using normal dis-
tributions. Moreover, normal distributions are particularly easy to work
with from a mathematical point of view. For these reasons, a common
model consists in assuming that the training input vectors are indepen-
dently sampled from normal distributions.
A possible classification model consists in assuming that, for each
class, all the corresponding inputs are sampled from a normal distribution
with mean vector µk and covariance matrix Σk :

∀i such that y (i) = Ck , x(i) ∼ N (µk , Σk )

Using the probability density function of a normal distribution, one can


compute the probability density of any input x associated to the distri-
bution N (µk , Σk ) of class Ck :

 
1 1
px|y=Ck (x) = p exp − [x − µk ]⊤ Σ−1
k [x − µk ]
p
(2π) |Σk | 2

With such a probabilistic model, it is easy to compute the probability


that a sample belongs to class Ck using Bayes rule:

px|y=Ck (x)P (y = Ck )
P (y = Ck |x = x) =
px (x)

With normal distributions, it is mathematically easier to work with log-

Machine Learning for Brain Disorders, Chapter 2


34 Faouzi and Colliot

probabilities:

log P (y = Ck |x = x)
= log px|y=Ck (x) + log P (y = Ck ) − log px (x)
1 1
= − [x − µk ]⊤ Σ−1 k [x − µk ] − log |Σk | + log P (y = Ck )
2 2
p
− log(2π) − log px (x)
2
1 ⊤ −1 1 ⊤ −1 1
= − x Σk x + x⊤ Σ−1 k µk − µk Σk µk − log |Σk | + log P (y = Ck )
2 2 2
p
− log(2π) − log px (x)
2
(4)

It is also possible to make further assumptions on the covariance


matrices that lead to different models. In this section, we present the
most commonly used ones: Naive Bayes, linear discriminant analysis and
quadratic discriminant analysis. Figure 16 illustrates the covariance ma-
trices and the decision functions for these models in the two-dimensional
case.

10.1 Naive Bayes


The Naive Bayes model assumes that, conditionally to each class Ck , the
features are independent and have the same variance σk2 :

∀k, Σk = σk2 Ip

Equation 4 can thus be further simplified:

log P (y = Ck |x = x)
1 1 1
= − 2 x⊤ x + 2 x⊤ µk − 2 µ⊤ µk − log σk + log P (y = Ck )
2σk σk 2σk k
p
− log(2π) − log px (x)
2
= x⊤ Wk x + x⊤ wk + w0k + s

where:
• Wk = − 2σ1 2 Ip is the matrix of the quadratic term for class Ck ,
k

• wk = 1
µ
σk2 k
is the vector of the linear term for class Ck ,

• w0k = − 2σ1 2 µ⊤
k µk − log σk + log P (y = Ck ) is the intercept for class
k
Ck , and

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 35

Figure 16: Illustration of decision functions with normal distributions.


A two-dimensional covariance matrix can be represented as an ellipse. In
the Naive Bayes model, the features are assumed to be independent and
to have the same variance conditionally to the class, leading to covari-
ance matrices being represented as circles. When the covariance matrices
are assumed to be identical, the decision functions are linear instead of
quadratic.

Machine Learning for Brain Disorders, Chapter 2


36 Faouzi and Colliot

• s = − p2 log(2π) − log px (x) is a term that does not depend on class


Ck .

Therefore, Naive Bayes is a quadratic model. The probabilities for input


x to belong to each class Ck can then easily be computed:

exp x⊤ Wk x + x⊤ wk + w0k

P (y = Ck |x = x) = Pk
⊤ ⊤
j=1 exp (x Wj x + x wj + w0j )

With the Naive Bayes model, it is relatively common to have the


conditional variances σk2 to all be equal:

∀k, Σk = σk2 Ip = σ 2 Ip

In this case, Equation 4 can be even further simplified:

log P (y = Ck |x = x)
1 1 1
= − 2 x⊤ x + 2 x⊤ µk − 2 µ⊤ µk − log σk + log P (y = Ck )
2σ σ 2σ k
p
− log(2π) − log px (x)
2

= x wk + w0k + s

where:

• wk = 1
µ
σ2 k
is the vector of the linear term for class Ck ,

• w0k = − 2σ1 2 µ⊤
k µk + log P (y = Ck ) is the intercept for class Ck , and

• s = − 2σ1 2 x⊤ x − log σ − p2 log(2π) − log px (x) is a term that does


not depend on class Ck .

In this case, Naive Bayes becomes a linear model.

10.2 Linear discriminant analysis


Linear discriminant analysis (LDA) makes the assumption that all the
covariance matrices are identical but otherwise arbitrary:

∀k, Σk = Σ

Therefore, Equation 4 can be further simplified:

log P (y = Ck |x = x)
1 1
= − [x − µk ]⊤ Σ−1 [x − µk ] − log |Σ| + log P (y = Ck )
2 2

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 37

p
− log(2π) − log px (x)
2
1 ⊤ −1
x Σ x − x⊤ Σ−1 µk − µ⊤ −1 ⊤ −1

=− k Σ x + µk Σ µk
2
1 p
− log |Σ| + log P (y = Ck ) − log(2π) − log px (x)
2 2
⊤ −1 1 ⊤ −1 1 ⊤ −1 1
= −x Σ µk − x Σ x − µk Σ µk + log P (y = Ck ) − log |Σ|
2 2 2
p
− log(2π) − log px (x)
2

= x wk + w0k + s

where:
• wk = Σ−1 µk is the vector of coefficients for class Ck ,

• w0k = − 21 µ⊤ −1
k Σ µk + log P (y = Ck ) is the intercept for class Ck ,
and

• s = − 21 x⊤ Σ−1 x − − 12 log |Σ| − p2 log(2π) − log px (x) is a term that


does not depend on class Ck .
Therefore, linear discriminant analysis is a linear model. When Σ is
diagonal, linear discriminant analysis is identical to Naive Bayes with
identical conditional variances.
The probabilities for input x to belong to each class Ck can then easily
be computed:
exp x⊤ wk + w0k

P (y = Ck |x = x) = Pk

j=1 exp (x wj + w0j )

10.3 Quadratic discriminant analysis


Quadratic discriminant analysis makes no assumption on the covariance
matrices Σk that can all be arbitrary. Equation 4 can be written as:

log P (y = Ck |x = x)
1 1 ⊤ −1 1
= − x⊤ Σ−1 ⊤ −1
k x + x Σk µk − µk Σk µk − log |Σk | + log P (y = Ck )
2 2 2
p
− log(2π) − log px (x)
2
= x⊤ Wk x + x⊤ wk + w0k + s

where:
• Wk = − 12 Σ−1
k is the matrix of the quadratic term for class Ck ,

Machine Learning for Brain Disorders, Chapter 2


38 Faouzi and Colliot

• wk = Σ−1
k µk is the vector of the linear term for class Ck ,

• w0k = − 21 µ⊤ −1 1
k Σk µk − 2 log |Σk | + log P (y = Ck ) is the intercept for
class Ck , and
• s = − p2 log(2π) − log px (x) is a term that does not depend on class
Ck .
Therefore, quadratic discriminant analysis is a quadratic model.
The probabilities for input x to belong to each class Ck can then easily
be computed:
exp x⊤ Wk x + x⊤ wk + w0k

P (y = Ck |x = x) = Pk
⊤ ⊤
j=1 exp (x Wj x + x wj + w0j )

11. Tree-based algorithms

11.1 Decision tree


Binary decisions based on conditional statements are frequently used in
everyday life because they are intuitive and easy to understand. Fig-
ure 17 illustrates a general approach when someone is ill. Depending on
conditional statements (severity of symptoms, ability to quickly consult
a specialist), the decision (consult your general practitioner or a spe-
cialist, or call for emergency services) is different. Models with such an
architecture are often used in machine learning and are called decision
trees.
A decision tree is an algorithm containing only conditional statements
and can be represented with a tree [17]. This graph consists of:
• decision nodes for all the conditional statements,
• branches for the potential outcomes of each decision node, and
• leaf nodes for the final decision.
Figure 18 illustrates a decision tree and its corresponding decision func-
tion. For a given sample, the final decision is obtained by following its
corresponding path, starting at the root node.
A decision tree recursively partitions the feature space in order to
group samples with the same labels or similar target values. At each
node, the objective is to find the best (feature, threshold) pair so that
both subsets obtained with this split are the most pure, that is homoge-
neous. To do so, the best (feature, threshold) pair is defined as the pair
that minimizes an impurity criterion.

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 39

Severity of symptoms

Se
ild

ve
M

re
Consult your Can you quickly
general practitioner consult a specialist?

N
s

o
Ye

Call for
Consult a specialist
emergency services

Figure 17: A general thought process when being ill. Depending on


conditional statements (severity of symptoms, ability to quickly consult a
specialist), the decision (consult your general practitioner or a specialist,
or call for emergency services) is different.

x1 > −6.26
s

No
Ye

ŷ = +1 x1 > −4.23
s

No
Ye

x2 > 3.34 ŷ = −1
s

No
Ye

ŷ = +1 ŷ = −1

Figure 18: A decision tree: (left) the rules learned by the decision tree,
and (right) the corresponding decision function.

Machine Learning for Brain Disorders, Chapter 2


40 Faouzi and Colliot

Let S ⊆ X be a subset of training samples. For classification tasks,


the distribution of the classes, that is the proportion of each class, is used
to measure the purity of the subset. Let pk be the proportion of samples
from class Ck in a given partition:
1 X
pk = 1y=Ck
|S| y∈S

Popular impurity criterion for classification tasks include:


X
• Gini index: pk (1 − pk )
k
X
• Entropy: − pk log(pk )
k

• Misclassification: 1 − max pk
k

Figure 19 illustrates the values of the Gini index and the entropy for
a single class Ck and for different proportions of samples pk . One can
see that the entropy function takes larger values than the Gini index,
especially for pk < 0.8. Since the sum of the proportions is equal to 1,
most classes only represent a small proportion of the samples. Therefore,
a simple interpretation is that entropy is more discriminative against
heterogeneous subsets than the Gini index. Misclassification only takes
into account the proportion of the most common class and tends to be
even less discriminative against heterogeneous subsets than both entropy
and Gini index.
For regression tasks, the mean error from a reference value (such as
the mean or the median) is often used as the impurity criterion:
1 X 1 X
• Mean squared error: (y − ȳ)2 with ȳ = y
|S| y∈S |S| y∈S

1 X
• Mean absolute error: |y − medianS (y)|
|S| y∈S

Theoretically, a tree can grow until every leaf node is perfectly pure.
However, such a tree would have a lot of branches and would be very
complex, making it prone to overfitting. Several strategies are commonly
used to limit the size of the tree. One approach consists in growing the
tree with no restriction, then pruning the tree, that is replacing subtrees
with nodes [17]. Other popular strategies to limit the complexity of the
tree are usually applied while the tree is grown, and include setting:

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 41

Figure 19: Illustration of Gini index and entropy. The entropy function
takes larger values than the Gini index, especially for pk < 0.8, thus is
more discriminative against heterogeneous subsets (when most classes
only represent only a small proportion of the samples) than Gini index.

• a maximum depth for the tree,

• a minimum number of samples required to be at an internal node,

• a minimum number of samples required to split a given partition,

• a maximum number of leaf nodes,

• a maximum number of features considered (instead of all the fea-


tures) to find the best split,

• a minimum impurity decrease to split an internal node.

11.2 Random forest


One limitation of decision trees is their simplicity. Decision trees tend to
use a small fraction of the features in their decision function. In order to
use more features in the decision tree, growing a larger tree is required,
but large trees tend to suffer from overfitting, that is having a low bias
but a high variance. One solution to decrease the variance without much
increasing the bias is to build an ensemble of trees with randomness,
hence the name random forest [18].

Machine Learning for Brain Disorders, Chapter 2


42 Faouzi and Colliot

Box 5: Random forest

• Random forest: ensemble of decision trees with randomness in-


troduced to build different trees.

• Decision tree: algorithm containing only conditional statements


and represented with a tree.

• Regularization: maximum depth for each tree, minimum number


of samples required to split a given partition, etc.

In a bid to have trees that are not perfectly correlated (thus building
actually different trees), each tree is built using only a subset of the
training samples obtained with random sampling. Moreover, for each
decision node of each tree, only a subset of the features are considered
to find the best split.
The final prediction is obtained by averaging the predictions of each
tree. For classification tasks, the predicted class is either the most com-
monly predicted class (hard-voting) or the one with the highest mean
probability estimate (soft-voting) across the trees. For regression tasks,
the predicted value is usually the mean of the predicted values across the
trees.

11.3 Extremely randomized trees


Even though random forests involve randomness in sampling both the
samples and the features, trees inside a random forest tend to be corre-
lated, thus limiting the variance decrease. In order to decrease even more
the variance of the model (while possibly increasing its bias) by growing
less correlated trees, extremely randomized trees introduce more random-
ness [19]. Instead of looking for the best split among all the candidate
(feature, threshold) pairs, one threshold is drawn at random for each
candidate feature and the best of these randomly generated thresholds
is chosen as the splitting rule.

12. Clustering
So far, we have presented classic machine learning algorithms for clas-
sification and regression, which are the main components of supervised
learning. Each input x(i) had an associated output y (i) . In this section

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 43

we present clustering, which is an unsupervised machine learning task.


In unsupervised learning, only the inputs x(i) are available, with no asso-
ciated outputs. As the ground truth is not available, the objective is to
extract information from the input data without supervising the learning
process with the output data.
Clustering consists in finding groups of samples such that:
• samples from the same group are similar, and
• samples from different groups are different.
For instance, clustering can be used to identify disease subtypes for het-
erogeneous diseases such as Alzheimer’s disease and Parkinson’s disease.
In this section, we present two of the most common clustering meth-
ods: the k-means algorithm and the Gaussian mixture model.

12.1 k-means
The k-means algorithm divides a set of n samples, denoted by X , into a
set of k disjoint clusters, each denoted by Xj , such that X = {X1 , . . . , Xk }.
Figure 20 illustrates the concept of this algorithm. Each cluster Xj
is characterized by its centroid, denoted by µj , that is the mean of the
samples in this cluster:
1 X (i)
µj = x
|Xj | (i)
x ∈Xj

The centroids fully define the set of clusters because each sample is as-
signed to the cluster whose centroid is the closest.
The k-means algorithm aims at finding centroids that minimize the
inertia, also known as within-cluster sum-of-squares criterion:
k
X X
min ∥x(i) − µj ∥22
{µ1 ,...,µk }
j=1 x(i) ∈Xj

The original algorithm used to find the centroids is often referred to as the
Lloyd’s algorithm [20] and is presented in Algorithm 1. After initializing
the centroids, a two-step loop is repeated until convergence (when the
centroids are identical for two consecutive iterations) consisting of:
1. the assignment step, where the clusters are updated based on the
current centroids, and
2. the update step, where the centroids are updated based on the cur-
rent clusters.

Machine Learning for Brain Disorders, Chapter 2


44 Faouzi and Colliot

Figure 20: Illustration of the k-means algorithm. The objective of the


algorithm is to find the centroids that minimize the within-cluster sum-
of-squares criterion. In this example, the inertia is approximately equal
to 184.80 and is the lowest possible inertia, meaning that the represented
centroids are optimal.

When clusters are well-defined, a point from a given cluster is likely to


stay in this cluster. Therefore, the assignment step can be sped up thanks
to the triangle inequality by keeping track of lower and upper bounds for
distances between points and centers, at the cost of higher memory usage
[21].
Even though the k-means algorithm is one of the simplest and most
used clustering algorithms, it has several downsides that should be kept
in mind.
First, the number of clusters k is a hyperparameter. Setting a value
much different from the actual number of clusters may yield poor clusters.
Second, the inertia is not a convex function. Although Lloyd’s al-
gorithm is guaranteed to converge, it may converge to a local minimum
that is not a global minimum. Figure 21 illustrates the convergence to
such centroids. Several strategies are often applied to address this issue,
including sophisticated centroid initialization [22] and running the algo-
rithm numerous times and keeping the best run (i.e., the one yielding
the lowest inertia).
Third, inertia makes the assumption that the clusters are convex and
isotropic. The k-means algorithm may yield poor results when this as-
sumption does not hold, such as with elongated clusters or manifolds
with irregular shapes.
Fourth, the Euclidean distance tends to be inflated (i.e., the ratio of
the distances of the nearest and farthest neighbors to a given target is

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 45

Algorithm 1: Lloyd’s algorithm (a.k.a. naive k-means algorithm).

Result: Centroids {µ1 , . . . , µk }


Initialize the centroids {µ1 , . . . , µk } ;
while not converged do
Assignment step: Compute the clusters (i.e., assign each
sample to its nearest centroid):

∀j ∈ {1, . . . , k}, Xj = {x(i) ∈ X | ∥x(i) −µj ∥22 = min ∥x(i) −µl ∥22 }
l

Update step: Compute the centroids of the updated clusters:


1 X (i)
∀j ∈ {1, . . . , k}, µj = x
|Xj | (i)
x ∈Xj

close to 1) in high-dimensional spaces, making inertia a poor criterion in


such spaces [23]. One can alleviate this issue by running a dimensionality
reduction algorithm such as principal component analysis prior to the k-
means algorithm.

12.2 Gaussian mixture model


A mixture model makes the assumption that each sample is generated
from a mixture of several independent distributions.
Let k be the number of distributions. Each distribution Fj is charac-
terized by its probability of being picked, denoted by πj , and its density
pj with parameters θj , denoted by pj (·, θj ). Let ∆ = (∆1 , . . . , ∆k ) be a
vector-valued random variable such that:
k
X
∆j = 1 and ∀j ∈ {1, . . . , k}, P (∆j = 1) = 1 − P (∆j = 0) = πj
j=1

and (x1 , . . . , xk ) be independent random variables such that xj ∼ Fj .


The samples are assumed to be generated from a random variable x with
density px such that:

k
X
x= ∆j xj
j=1

Machine Learning for Brain Disorders, Chapter 2


46 Faouzi and Colliot

Figure 21: Illustration of the convergence of the k-means algorithm


to bad local minima. In the upper figure, the algorithm converged to
a global minimum because the inertia is equal to the minimum possible
value (184.80), thus the obtained clusters are optimal. In the four other
figures, the algorithm converged to a local minima that are not global
minima because the inertias are higher than the minimum possible value,
thus the obtained clusters are suboptimal.

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 47

Figure 22: Gaussian mixture model. For each estimated distribution,


the mean vector and the ellipsis consisting of all the points within one
standard deviation of the mean are plotted.

k
X
∀x ∈ X , px (x, θ) = πj pj (x; θj )
j=1

A Gaussian mixture model is a particular case of a mixture model in


which each distribution Fj is a Gaussian distribution with mean vector
µj and covariance matrix Σj :
∀j ∈ {1, . . . , k}, Fj = N (µj , Σj )
Figure 22 illustrates the learned distributions from a Gaussian mixture
model.
The objective is to find the parameters θ that maximize the likeli-
hood, with θ = {µj }kj=1 , {Σj }kj=1 , {πj }kj=1 :
n
Y
L(θ) = pX (x(i) ; θ)
i=1

For computational reasons, it is easier to maximize the log-likelihood:


n n k
!
X X X
log(L(θ)) = log(pX (x(i) ; θ)) = log πj pj (x; θj )
i=1 i=1 j=1

Because the density pX (·, θ) is a weighted sum of Gaussian densities, the


expression cannot be further simplified.
In order to solve this maximization problem, an algorithm called
Expectation–Maximization (EM) is often applied [24]. Algorithm 2 de-
scribes the main concepts of this algorithm. After initializing the param-
eters of each distribution, a two-step loop is repeated until convergence
(when the parameters are stable over consecutive loops):

Machine Learning for Brain Disorders, Chapter 2


48 Faouzi and Colliot

Algorithm 2: Expectation–Maximization algorithm for Gaussian


mixture models.
Result: Mean vectors {µj }kj=1 , covariance matrices {Σj }kj=1 and
probabilities {πj }kj=1
Initialize the mean vectors {µj }kj=1 , covariance matrices {Σj }kj=1
and probabilities {πj }kj=1 ;
while not converged do
E-step: Compute the posterior probability γi (j) for each sample
x(i) to have been generated from distribution Fj :

πj pj (x(i) ; θj , Σj )
∀i ∈ {1, . . . , n}, ∀j ∈ {1, . . . , k}, γi (j) = Pk
(i)
l=1 πl pj (x ; θl , Σl )

M-step: Update the parameters of each distribution Fj :


Pn (i)
i=1 γi (j)x
∀j ∈ {1, . . . , k}, µj = P n
γi (j)
Pn i=1
γi (j)[x(i) − µj ][x(i) − µj ]⊤
∀j ∈ {1, . . . , k}, Σj = i=1 Pn
i=1 γi (j)
n
1X
∀j ∈ {1, . . . , k}, πj = γi (j)
n i=1

• the expectation step, in which the probability for each sample x(i)
to have been generated from distribution Fj is computed, and

• the maximization step, in which the probability and the parameters


of each distribution are updated.

Because it is impossible to know which samples have been generated


by each distribution, it is also impossible to directly maximize the log-
likelihood, which is why we compute its expected value using the posterior
probabilities, hence the name expectation step. The second step simply
consists in maximizing the expected log-likelihood, hence the name max-
imization step.
Lloyd’s and EM algorithms have a lot of similarities. In the first step,
the assignment step assigns each sample to its closest cluster, whereas
the expectation step computes the probability for each sample to have
been generated from each distribution. In the second step, the update

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 49

step computes the centroid of each cluster as the mean of the samples in
a given cluster, while the maximization step updates the probability and
the parameters of each distribution as a weighted average over all the
samples. For these reasons, the k-means algorithm is often referred to as
a hard-voting clustering algorithm, as opposed to the Gaussian mixture
model which is referred to as a soft-voting clustering algorithm.
The Gaussian mixture model has several advantages over the k-means
algorithm.
First, the use of normal distribution densities instead of Euclidean
distances dwindles the inflation issue in high-dimensional spaces. Second,
the Gaussian mixture model includes covariance matrices, allowing for
clusters with elliptical shapes, while the k-means algorithm only include
centroids, forcing clusters to have circular shapes.
Nonetheless, the Gaussian mixture model also has several drawbacks,
sharing a few with the k-means algorithm.
First, the number of distributions k is a hyperparameter. Setting
a value much different from the actual number of clusters may yield
poor clusters. Second, the log-likelihood is not a concave function. Like
Lloyd’s algorithm, the EM algorithm is guaranteed to converge but it
may converge to a local maximum that is not a global maximum. Sev-
eral strategies are often applied to address this issue, including sophis-
ticated centroid initialization [22] and running the algorithm numerous
times and keeping the best run (i.e., the one yielding the highest log-
likelihood). Third, the Gaussian mixture model has more parameters
than the k-means algorithm. Therefore, it usually requires more sam-
ples to accurately estimate its parameters (in particular the covariance
matrices) than the k-means algorithm.

13. Dimensionality reduction


Dimensionality reduction consists in finding a good mapping from the
input space into a space of lower dimension. Dimensionality reduction
can either be unsupervised or supervised.

13.1 Principal component analysis


For exploratory data analysis, it may be interesting to investigate the
variances of the p features and the 12 p(p − 1) covariances or correlations.
However, as the value of p increases, this process becomes growingly
tedious. Moreover, each feature may explain a small proportion of the
total variance. It may be more desirable to have another representation
of the data where a small number of features explain most of the total

Machine Learning for Brain Disorders, Chapter 2


50 Faouzi and Colliot

variance, in other words to have a coordinate system adapted to the input


data.
Principal component analysis (PCA) consists in finding a representa-
tion of the data through principal components [25]. The principal com-
ponents are a sequence of unit vectors such that the i-th vector is the
best approximation of the data (i.e., maximizing the explained variance)
while being orthogonal to the first i − 1 vectors.
Figure 23 illustrates principal component analysis when the input
space is two-dimensional. On the upper figure, the training data in the
original space is plotted. Both features explain about the same amount
of the total variance, although one can clearly see that both features
are strongly correlated. Principal component analysis identifies a new
Cartesian coordinate system based on the input data. On the lower
figure, the training data in the new coordinate system is plotted. The
first dimension explains much more variance than the second dimension.

13.1.1. Full decomposition


Mathematically, given an input matrix X ∈ Rn×p that is centered (i.e.,
the mean value of each column X:,j is equal to zero), the objective is to
find a matrix W ∈ Rp×p such that:

• W is an orthogonal matrix, i.e. its columns are unit vectors and


orthogonal to each other,

• the new representation of the input data, denoted by T , consists of


the coordinates in the Cartesian coordinate system induced by W
(whose columns form an orthogonal basis of Rp with the Euclidean
dot product):
T = XW

• each column of W maximizes the explained variance.

Each column wi = W:,i is a principal component. Each input vector x


is transformed into another vector t using a linear combination of each
feature with the weights from the W matrix:

t = x⊤ W

The first principal component w(1) is the unit vector that maximizes
the explained variance:
( n )
X
w1 = arg max x(i)⊤ w∥
∥w∥=1 i=1

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 51

Figure 23: Illustration of principal component analysis. On the upper


figure, the training data in the original space (blue points with black axes)
is plotted. Both features explain about the same amount of the total
variance, although one can clearly see a linear pattern. The principal
component analysis algorithm learns a new Cartesian coordinate system
based on the input data (red axes). On the lower figure, the training data
in the new coordinate system is plotted (green points with red axes). The
first dimension explains much more variance than the second dimension.

Machine Learning for Brain Disorders, Chapter 2


52 Faouzi and Colliot

= arg max {∥Xw∥}


∥w∥=1

= arg max w⊤ X ⊤ Xw∥



∥w∥=1

w⊤ X ⊤ Xw
 
w1 = arg max
w∈Rp w⊤ w
As X ⊤ X is a positive semi-definite matrix, a well known result from
linear algebra is that w(1) is the eigenvector associated with the largest
eigenvalue of X ⊤ X.
The k-th component is found by subtracting the first k − 1 principal
components from X:
k−1
X
X̂k = X − Xw(s) w(s)⊤
s=1
and then finding the unit vector that explains the maximum variance
from this new data matrix:
( )
n o w⊤ X̂k⊤ X̂k w
wk = arg max ∥X̂k w∥ = arg max
∥w∥=1 w∈Rp w⊤ w
One can show that the eigenvector associated with the k-th largest eigen-
value of the X ⊤ X matrix maximizes the quantity to be maximized.
Therefore, the matrix W is the matrix whose columns are the eigen-
vectors of the X ⊤ X matrix, sorted by descending order of their associ-
ated eigenvalues.

13.1.2. Truncated decomposition


Since each principal component iteratively maximizes the remaining vari-
ance, the first principal components explain most of the total variance,
while the last ones explain a tiny proportion of the total variance. There-
fore, keeping only a subset of the ordered principal components usually
gives a good representation of the input data.
Mathematically, given a number of dimensions l, the new representa-
tion is obtained by truncating the matrix of principal components W to
only keep the first l columns, resulting in the submatrix W:,:l :
T̃ = XW:,:l
Figure 24 illustrates the use of principal component analysis as dimen-
sionality reduction. The Iris flower dataset consists of 50 samples for each
of three iris species (setosa, versicolor and virginica) for which four fea-
tures were measured: the length and the width of the sepals and petals,
in centimeters. The projection of each sample on the first two principal
components is shown in this figure.

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 53

Figure 24: Illustration of principal component analysis as a dimension-


ality reduction technique. The Iris flower dataset consists of 50 samples
for each of three iris species (setosa, versicolor and virginica) for which
four features were measured: the length and the width of the sepals and
petals, in centimeters. The projection of each sample on the first two
principal components is shown in this figure. The first dimension ex-
plains most of the variance (92 .46 %).

Machine Learning for Brain Disorders, Chapter 2


54 Faouzi and Colliot

13.2 Linear discriminant analysis


In Section 10, we introduced linear discriminant analysis (LDA) as a clas-
sification method. However, it can also be used as a supervised dimen-
sionality reduction method. LDA fits a multivariate normal distribution
for each class Ck , so that each class is characterized by its mean vector
µk ∈ Rp and has the same covariance matrix Σ ∈ Rp×p . However, a
set of k points lies in a space of dimension at most k − 1. For instance,
a set of 2 points lies on a line, while a set of 3 points lies on a plane.
Therefore, the subspace induced by the k mean vectors µk can be used
as dimensionality reduction.
There exists another formulation of linear discriminant analysis which
is equivalent and more intuitive for dimensionality reduction. Linear
discriminant analysis aims to find a linear projection so that the classes
are separated as much as possible (i.e., projections of samples from a
same class are close to each other, while projections of samples from
different classes are far from each other).
Mathematically, the objective is to find the matrix W ∈ Rp×l (with
l ≤ k −1) that maximizes the between-class scatter while also minimizing
the within-class scatter:
 −1 
max tr W ⊤ Sw W W ⊤ Sb W
W

The within-class scatter matrix Sw summarizes the diffusion between the


mean vector µk of class Ck and all the inputs x(i) belonging to class Ck ,
over all the classes:
q
X X
Sw = [x(i) − µk ][x(i) − µk ]⊤
k=1 y (i) =Ck

The between-class scatter matrix Sb summarizes the diffusion between


all the mean vectors:
q
X
Sb = nk [µk − µ][µk − µ]⊤
k=1

where nk is theP proportion of samples belonging to class Ck and µ =


P q 1 n (i)
k=1 nk µk = n i=1 x is the mean vector over all the input vectors.
One can show that the W matrix consists of the first l eigenvectors
of the matrix Sw−1 Sb with the corresponding eigenvalues being sorted in
descending order. Just as in principal component analysis, the corre-
sponding eigenvalues can be used to determine the contribution of each

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 55

Figure 25: Illustration of linear discriminant analysis as a dimension-


ality reduction technique. The Iris flower dataset consists of 50 samples
for each of three iris species (setosa, versicolor and virginica) for which
four features were measured: the length and the width of the sepals and
petals, in centimeters. The projection of each sample on the learned two-
dimensional space is shown in this figure.

dimension. However, the criterion for linear discriminant analysis is dif-


ferent from the one from principal component analysis: it is to maximiz-
ing the separability of the classes instead of maximizing the explained
variance.

Figure 25 illustrates the use of linear discriminant analysis as a di-


mensionality reduction technique. We use the same Iris flower dataset as
in Figure 24 illustrating principal component analysis. The projection
of each sample on the learned two-dimensional space is shown and one
can see that the first (horizontal) axis is more discriminative of the three
classes with linear discriminant analysis than with principal component
analysis.

Machine Learning for Brain Disorders, Chapter 2


56 Faouzi and Colliot

14. Kernel methods


Kernel methods allow for generalizing linear models to non-linear models
with the use of kernel functions.
As mentioned in Section 8, the main idea of kernel methods is to
first map the input data from the original input space to a feature space,
and then perform dot products in this feature space. Under certain
assumptions, an optimal solution of the minimization problem of the
cost function admits the following form:
n
X
f= αi K(·, x(i) )
i=1

where K is the kernel function which is equal to the dot product in the
feature space:
∀x, x′ ∈ I, K(x, x′ ) = ϕ(x)⊤ ϕ(x′ )
As this term frequently appears, we denote by K the n × n symmetric
matrix consisting of the evaluations of the kernel on all the pairs of
training samples:

∀i, j ∈ {1, . . . , n}, Kij = K(x(i) , x(j) )

In this section we present the extension of two models previously in-


troduced in this chapter, ridge regression and principal component anal-
ysis, with kernel functions.

14.1 Kernel ridge regression


Kernel ridge regression combines ridge regression with the kernel trick,
and thus learns a linear function in the space induced by the respective
kernel and the training data [2]. For non-linear kernels, this corresponds
to a non-linear function in the original input space.
Mathematically, the objective is to find the function f with the fol-
lowing form:
Xn
f= αi K(·, x(i) )
i=1

that minimizes the sum of squared errors with a ℓ2 penalization term:


n
X 2
min y (i) − f (x(i) + λ∥f ∥2
f
i=1

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 57

The cost function can be simplified using the specific form of the possible
functions:
X n
(y (i) − f (x(i) )2 + λ∥f ∥2
i=1
n n
!2 n 2
X X X
(i) (j) (i) (i)
= y − αj k(x , x ) +λ αi K(·, x )
i=1 j=1 i=1
n
X 2
= y (i) − α⊤ K:,i + λα⊤ Kα
i=1
= ∥y − Kα∥22 + λα⊤ Kα

Therefore, the minimization problem is:


min ∥y − Kα∥22 + λα⊤ Kα
α

for which a solution is given by:


α⋆ = (K + λI)−1 y
Figure 8 illustrates the prediction function of a kernel ridge regression
algorithm with a radial basis function kernel. The prediction function is
non-linear as the kernel is non-linear.

14.2 Kernel principal component analysis


As mentioned in Section 13, principal component analysis consists in
finding the linear orthogonal subspace in the original input space such
that each principal component explains the most variance. The optimal
solution is given by the first eigenvectors of X ⊤ X with the corresponding
eigenvalues being sorted in descending order.
With kernel principal component analysis, the objective is to find the
linear orthogonal subspace in the feature space such that each principal
component in the feature space explains the most variance [26]. The
solution is given by the first l eigenvectors (αk )1≤k≤l of the K matrix
with the corresponding eigenvalues being sorted in descending order. The
eigenvectors are normalized in order to be unit vectors in the feature
space.
Finally, the projection of any input x in the original space on the
k-th component can be computed as:
n
X

ϕ(x) αk = αki K(x, x(i) )
i=1

Machine Learning for Brain Disorders, Chapter 2


58 Faouzi and Colliot

Figure 26: Illustration of kernel principal component analysis. Some


non-linearly separable training data is plotted (top). The projected train-
ing data using principal component analysis remains non-linearly separa-
ble (middle). The projected training data using kernel principal compo-
nent analysis (with a non-linear kernel) becomes linearly separable (bot-
tom).

Machine Learning for Brain Disorders, Chapter 2


Classic machine learning algorithms 59

Figure 26 illustrates the projection of some non-linearly separable clas-


sification data with principal component analysis and with kernel prin-
cipal component analysis with a non-linear kernel. The projected input
data becomes linearly separable using kernel principal component analy-
sis, whereas the projected input data using (linear) principal component
analysis remains non-linearly separable.

15. Conclusion
In this chapter, we described the main classic machine learning methods.
Due to space constraints, the description of some of them was brief. The
reader who seeks more details can refer to [5, 6]. All these approaches
are implemented in the scikit-learn Python library [27]. A common point
of the approaches presented in this chapter is that they use as input a
set of given or pre-extracted features. On the contrary, deep learning
approaches often provide an end-to-end learning setup within which the
features are learned. These techniques are covered in Chapters 3 to 6.

Acknowledgments
The authors would like to thank Hicham Janati for his fruitful remarks.
The authors would like to acknowledge the extensive documentation of
the scikit-learn Python package, in particular its user guide, for the rele-
vant information and references provided. We used the NumPy [28], mat-
plotlib [29] and scikit-learn [27] Python packages to generate all the fig-
ures. This work was supported by the French government under manage-
ment of Agence Nationale de la Recherche as part of the “Investissements
d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Insti-
tute) and reference ANR-10-IAIHU-06 (Agence Nationale de la Recherche-
10-IA Institut Hospitalo-Universitaire-6), and by the European Union
H2020 programme (grant number 826421, project TVB-Cloud).

References

[1] Goodfellow I, Bengio Y, Courville A (2016) search trees used for associative searching.
Deep Learning. MIT Press, http://www. Communications of the ACM 18(9):509–517
deeplearningbook.org
[4] Omohundro SM (1989) Five Balltree Con-
[2] Murphy KP (2012) Machine Learning: A struction Algorithms. Tech. rep., Interna-
Probabilistic Perspective. The MIT Press, tional Computer Science Institute
Cambridge, MA
[5] Bishop CM (2006) Pattern Recognition and
[3] Bentley JL (1975) Multidimensional binary Machine Learning. Springer-Verlag

Machine Learning for Brain Disorders, Chapter 2


60 Faouzi and Colliot

[6] Hastie T, Tibshirani R, Friedman J (2009) [18] Breiman L (2001) Random Forests. Ma-
The Elements of Statistical Learning: Data chine Learning 45(1):5–32
Mining, Inference, and Prediction, Second
Edition, 2nd edn. Springer Series in Statis- [19] Geurts P, Ernst D, Wehenkel L (2006) Ex-
tics, Springer-Verlag, New York tremely randomized trees. Machine Learn-
ing 63(1):3–42
[7] Tikhonov AN, Arsenin VY, John F (1977)
Solutions of Ill Posed Problems. John Wiley [20] Lloyd S (1982) Least squares quantization
& Sons Inc, Washington : New York in PCM. IEEE Transactions on Information
Theory 28(2):129–137
[8] Tibshirani R (1996) Regression Shrinkage
and Selection via the Lasso. Journal of the [21] Elkan C (2003) Using the triangle inequal-
Royal Statistical Society Series B (Method- ity to accelerate k-means. In: Proceed-
ological) 58(1):267–288 ings of the Twentieth International Confer-
ence on International Conference on Ma-
chine Learning, pp 147–153
[9] Zou H, Hastie T (2005) Regularization and
variable selection via the elastic net. Jour-
[22] Arthur D, Vassilvitskii S (2007) k-
nal of the Royal Statistical Society: Series
means++: the advantages of careful
B (Statistical Methodology) 67(2):301–320
seeding. In: Proceedings of the eighteenth
annual ACM-SIAM symposium on Discrete
[10] Vapnik VN, Lerner A (1963) Pattern recog- algorithms, pp 1027–1035
nition using generalized portrait method.
Automation and Remote Control 24:774– [23] Aggarwal CC, Hinneburg A, Keim DA
780 (2001) On the Surprising Behavior of Dis-
tance Metrics in High Dimensional Space.
[11] Cortes C, Vapnik V (1995) Support-vector In: International Conference on Database
networks. Machine Learning 20(3):273–297 Theory, Springer, pp 420–434

[12] Boser BE, Guyon IM, Vapnik VN (1992) A [24] Dempster AP, Laird NM, Rubin DB (1977)
training algorithm for optimal margin clas- Maximum Likelihood from Incomplete Data
sifiers. In: Proceedings of the fifth annual via the EM Algorithm. Journal of the Royal
workshop on Computational learning the- Statistical Society Series B (Methodologi-
ory, Association for Computing Machinery, cal) 39(1):1–38
Pittsburgh, Pennsylvania, USA, COLT ’92,
pp 144–152 [25] Jolliffe IT (2002) Principal Component
Analysis, 2nd edn. Springer Science & Busi-
[13] Aizerman MA, Braverman EA, Rozonoer L ness Media
(1964) Theoretical foundations of the po-
tential function method in pattern recogni- [26] Schölkopf B, Smola AJ, Müller KR (1999)
tion learning. In: Automation and Remote Kernel principal component analysis. In:
Control, 25, pp 821–837 Advances in kernel methods: support vec-
tor learning, MIT Press, pp 327–352
[14] Schölkopf B, Herbrich R, Smola AJ (2001)
A Generalized Representer Theorem. In: [27] Pedregosa F, Varoquaux G, Gramfort A,
Computational Learning Theory, Springer, Michel V, Thirion B, Grisel O, Blondel M,
pp 416–426 Prettenhofer P, Weiss R, Dubourg V, et al
(2011) Scikit-learn: Machine learning in
python. the Journal of machine Learning re-
[15] Aly M (2005) Survey on multiclass classifi- search 12:2825–2830
cation methods
[28] Harris CR, Millman KJ, van der Walt SJ,
[16] James G, Hastie T (1998) The Error Cod- Gommers R, Virtanen P, Cournapeau D,
ing Method and PICTs. Journal of Compu- Wieser E, Taylor J, Berg S, Smith NJ, et al
tational and Graphical Statistics 7(3):377– (2020) Array programming with numpy.
387 Nature 585(7825):357–362

[17] Breiman L, Friedman J, Stone CJ, Olshen [29] Hunter JD (2007) Matplotlib: A 2d graph-
RA (1984) Classification and Regression ics environment. Computing in science &
Trees. Taylor & Francis engineering 9(03):90–95

Machine Learning for Brain Disorders, Chapter 2

You might also like