Machine Learning
Machine Learning
Département GMM
Mastère Spécialisé VALDOM
Année 2022-2023
INSA de Toulouse
Philippe BESSE - Béatrice LAURENT -Cathy MAUGIS - Olivier ROUSTANT
2
Contents
1 Introduction 5
3 Linear models 29
3
4
Introduction
5
6
2010s PO Big Data p and n very large a soil after an accidental release. The objective is to perform a sensitivity
analysis on the numerical code.
2012 Deep Learning
• Genomics: DNA microarrays allow to measure the expression of thou-
2016 Artificial Intelligence (IA): AlphaGo, Imagenet, Generative Adversarial sands of genes simultaneously on a single individual. It is, for example,
Networks .. a challenge to try to infer from those kind of data which genes are in-
volved in a certain type of cancer, by comparing expression levels be-
VVV... : Volume, Variety, Velocity... tween healthy and sick patients. This is generally a high dimensional
problem: number p of genes measured on a microarray is generally much
The development of data storage and computing resources gives rise to the larger than the number n of individuals in the study.
production and the storage of a huge amount of data from which the data sci-
entist will try to learn crucial informations to better understand the underlying • Aeronautical engineering: Aerospace industry produces a huge amont
phenomena or to provide predictions. Many fields are impacted, here are some of signal measurements obtained from thousand of on-board sensors. It
examples of learning problems: is particularly important to detect possible anomalies before launching
the satellite. Similarly, many sensors are involved in planes and it is
• Medicine: identify the risk factors for a certain type of cancer, based on important to detect a abnormal behavior on a sensor. The main objectives
clinical and demographic variables are curve clustering or classification and anomaly detections in a set of
curves for predictive maintenance purposes.
• Meteorology: predict an air pollution rate based on weather conditions
• Energy: forecast an electricity consumption curve for a customer as a • Images: Convolutional neural networks and deep learning led to im-
function of climatic variables and specific characteristics of this customer, pressive progresses for image classification. Many fields are concerned:
build a model for energy optimization of buildings, or predict the energy medical images (e.g. tumor detection), earth observation satellite images,
production of a wind farm. computer vision, autonomous vehicles, ...
of a cancer or not/ abnormal behavior or not/ interest for a certain product ..), small set of influent variables, using the vocabulary of statistical learning.
it is a challenge to derive which explanatory variables (among a possibly large The main methods are implemented in the software R.
number of available ones) are influent for the phenomenon of interest, as well In the same time, the IT community talks more about machine learning,
as to provide a prediction rule. The main objective is therefore a modeling where the approach is more centered on a pure prediction objective, most of
objective which can be specified into sub-objectives that have to be clearly the time by a "black box" without the need for an explicit interpretation. With
defined prior the study since this determines the methods that can be imple- the increase in the size of datasets (in the era of Big Data), algorithms have
mented: been developed in Python, in particular in the scikit learn library.
A common objective of learning is to built a prediction algorithm, mini-
Explore, represent, describe, the variables, their correlations .. mizing a prediction error, with or without the constraint of interpretability of
Explain or test the influence of a variable or a factor in a specific model, the algorithm. Contexts are diverse, whether the aim is to publish a research
assumed to be a priori known article in an academic journal or participating in a Kaggle-type competition
or developing an industrial solution for example for recommendation systems,
Predict & Select a (small) set of predictors, to obtain an interpretable model, fraud detection, predictive maintenance algorithms ... The publication of a new
for example searching for biomarkers learning method or new options of existing methods requires showing that it
outperforms its competitors on a battery of examples, generally from the site
Predict by a "black box" without the need for an explicit interpretation. hosted at the University of California Irvine UCI Repository [26]. The biases
inherent in this approach are discussed in numerous articles (e.g. Hand; 2006)
Important parameters of the problem are the dimensions: the number n of [20] and conferences (e.g. Donoho (2015) [13]. It is notable that the academic
observations or sample size and the number p of variables observed on this pressure of publication has caused an explosion in the number of methods and
sample. The high dimensional framework, where p is possibly greater than their variants. The analysis of Kaggle type competitions and their winning
n has received a great interest in the statistical literature these last 20 years solutions is also very instructive. The pressure leads to combinations, even
and specific methods have been developed for this non classical setting. We architecture of models, of such complexity (see e.g. Figure 1.1) that these so-
will see the importance of parcimony: "it is necessary to determine a model lutions are concretely unusable for slight performance differences (3rd or 4th
that provides an adequate representation of the data, with as few parameters as decimal).
possible". Especially if the data are voluminous, the operational and "industrialized"
Historically, statistical methods have been developed around this type of solutions, necessarily robust and fast, are often satisfied with rather rudimen-
problems and one has proposed models incorporating on the one hand explana- tary methodological tools (see Donoho (2015) [13]).
tory or predictive variables and, on the other hand, a random component or This course proposes to address the wide variety of criteria and methods,
noise. It is then a matter of estimating the parameters of the model from the their conditions of implementation, the choices to be made, in particular to op-
observations, testing the significance of a parameter, selecting a model or a
8
timize the complexity of the models. It is also the opportunity to remind that
robust and linear methods as well as old strategies (descending, ascending,
step-by-step) or more recent (lasso) for the selection of linear or polynomial
models should not be too quickly evacuated from academic or industrial prac-
tices.
Ozone dataset
Figure 1.1: Winning solution of a kaggle contest: Identify people who have This example, studied by Besse et al. (2007) [5] is a real situation whose
a high degree of Psychopathy based on Twitter usage. Weighted combination objective is to predict, for the next day, the risk of exceeding the legal ozone
of combinations (boosting, neural networks) of thirty three models (random concentration threshold in urban areas. The problem can be considered as a
forest, boosting, k nearest neighbors ...) and 8 new variables (features) regression problem: the variable to explain is an ozone concentration, but also
as a binary classification problem: exceeding or not the legal threshold. There
are only 8 explanatory variables, one of them is already a prediction of ozone
concentration but obtained by a deterministic fluid mechanics model (Navier
and Stockes equations). This is an example of statistical adaptation. The de-
terministic forecast on the basis of a global grid (30 km) is improved locally, at
9
Forests, neural networks... More appropriate algorithms, acting directly on • Predict the output y associated to a new entry x.
images such as convolutional neural networks will be studied next year.
• Select the important explanatory variables among x1 , . . . , xp .
Figure 1.3: MNIST some examples of handwritten digits A prediction rule is a measurable function fˆ : X → Y that associates
the output fˆ(x) to the input x ∈ X .
In order to quantify the quality of the prevision, we introduce a loss function.
3 Introduction to supervised learning D EFINITION 1. — A measurable function ℓ : Y × Y → R+ is a loss function
In the framework of Supervised learning, we have a Learning sample if ℓ(y, y) = 0 and ℓ(y, y ′ ) > 0 for y ̸= y ′ .
composed with observation data of the type input/output:
In real regression, it is natural to consider Lp (p ≥ 1) losses
dn1 = {(x1 , y1 ), . . . , (xn , yn )} l(y, y ′ ) = |y − y ′ |p .
where, for i = 1 . . . n, xi = (x1i , . . . , xpi ) ∈ X is a set of p explanatory If p = 2, the L2 loss is called "quadratic loss".
variables and yi ∈ Y is a response variable. In classification, one can consider the consider the 0-1 loss defined, for all
In this course, we consider supervised learning for real regression (Y ⊂ R) y, y ′ ∈ Y by
or for classification (Y finite). The explanatory variables x1 , . . . xp can be l(y, y ′ ) = 1y̸=y′ .
qualitatives or quantitatives.
Since the 0-1 loss is not smooth, it may be useful to consider other losses that
we will see in the classification courses.
Objectives: From the learning sample, we want to
The goal is to minimize the expectation of this loss function, leading to the
• Estimate the link between the input vector x (explanary variables) and notion of risk:
the output y (variable to explain):
D EFINITION 2. — Let f be a prediction rule defined from the learning sample
y = f (x1 , . . . , xp ). D n . Given a loss function ℓ, the risk - or generalization error - of the
11
where, in the above expression, (X, Y ) is independent from the learning sam-
ple D n .
It is generally recommended to take 50% of the data for the learning sample, 4. Random partition of the sample into a train set and a test set according
25% of the data for the validation sample and 25% of the data for the test to its size and choice of a loss function that will be used to estimate the
sample. prediction error.
5. The train set is separated into a learning sample and a validation sam-
Splitting the data set is not always a good solution, especially if its size is
ple. For each method considered: generalized linear model (Gaussian, bi-
quite small. We will see in Chapter 2 several ways to estimate the generaliza-
nomial or Poisson), parametric (linear or quadratic) or nonparametric (k
tion error.
nearest neighbors), discrimination, neural network (perceptron), binary
decision tree, support vectors machine, aggregation (bagging, boosting,
4 Strategy for statistical learning random forest. . . )
4.1 The steps of a statistical analysis • Estimate the model with the learning set for given values of a pa-
rameter of complexity: number of variables, neighbors, leaves, neu-
In a real situation, the initial preparation of the data (data munging: extrac- rons, penalization or regularization . . .
tion, cleaning, verification, possible allocation of missing data, transformation
...) is the most thankless phase, the one that requires the most time, human • optimization of this parameter (or these parameters) by minimizing
resources and various skills: informatics, statistics and knowledge of the field the empirical loss on the validation set, or by cross-validation on
of the data. This stage does not require major theoretical developments but the train set or the training error plus a penalty term.
rather a lot of common sense, experience and a good knowledge of the data. 6. Comparison of the previous optimal models (one per method) by estimat-
Once successfully completed, the modeling or learning phase can begin. ing the prediction error on the test set.
Systematically and also very schematically, the analysis, also called the
Data science follows the steps described below for most fields of application. 7. Possible iteration of the previous approach or Monte Carlo cross-
validation: if the test sample at step 4 is too small, the prediction error
1. Data extraction with or without sampling applied to structured databases obtained at step 6 can be very dependent on this test sample. The Monte
(SQL) or not (NoSQL) Carlo cross-validation approach consists in successive random partitions
of the sample (train and test) to study the distribution of the test error
2. Visualization, exploration of the data for the detection of atypical values, for each model or at least take the mean of the prediction errors obtained
errors or anomalies; study of distributions and correlation structures and from several Monte-Carlo iterations to ensure the robustness of the final
search for transformations of variables, construction of new variables and selected model.
/ or representation in adapted bases (Fourier, spline, wavelets ...).
8. Choice of the "best" method according to its prediction error, its robust-
3. Taking into account missing data, by simple deletion or by imputation. ness but also its interpretability if necessary.
13
9. Re-estimation of the selected model on all the data. • Neural networks will be introduced in Chapter 10. We will focus on mul-
tilayer perceptron, backpropagation algorithms, optimization algorithms,
10. Industrialization: implementation of the model on the complete data base. and provide an introduction to deep learning to will be completed next
year by the study of the Convolutional Neural Networks.
The end of this process can be modified by building a combination of the
different methods rather than selecting the best one. This is often the case • Finally, we will approach ethical aspects of statistical decisions and legal
with winning "gas factory" solutions in Kaggle competitions. This has also and societal impacts of AI.
been theorized in two approaches leading to a collaboration between models:
COBRA from Biau et al. (2016) [6] and SuperLearner from van der Laan et
al. (2007) [39].
1 Introduction of the risk on a test sample (independent of the training sample), measuring the
generalization capacity of the algorithm, we generally obtain higher values. If
1.1 Objectives these new data are representative of the whole distribution of the data, we ob-
tain an unbiased estimator of the risk. Three strategies are described to obtain
The performance of a model or algorithm is evaluated by a risk or general- unbiased estimates of risk:
ization error. The measurement of this performance is very important since,
on the one hand, it allows to operate a model selection in a family of models 1. a penalisation of the empirical risk
associated with the learning method we used and, on the other hand, it guides
the choice of the best method by comparing each of the optimized models at 2. a split of the sample: train set and test set. The train set is itself decom-
the previous step. Finally, it provides a measure of the quality or even of the posed into a leaning set to estimate the models for a given algorithm and
confidence that we can give to the prediction with the selected model. a validation set to estimate the generalization error of each model in order
Once the notion of statistical model or prediction rule is specified, the risk to choose the best one, the test set is used to estimate the risk of each
is defined from an associated loss function. In practice, this risk needs to be optimized model.
estimated and different strategies are proposed. 3. by simulation: cross validation, bootstrap.
The main issue is to construct an unbiased estimator of this risk. The empir-
ical risk (based on the training sample), also called the training error is biased The choice depends on several factors including the desired objective, the size
by optimism, it underestimates the risk. If we compute an empirical estimator of the initial sample, the complexity of the models, the computational com-
plexity of the algorithms.
15
16
2 Risk and model selection Then, the prediction rule generally assigns to the input x the class that maxi-
mizes the estimated probability that is
2.1 Loss function and risk
fˆ(x) = argmaxk∈{1,2,...,K} fˆk (x).
We consider supervised regression or classification problems. We have a
training data set with n observation points (or objects) X i and their associated In this setting, a loss function often used is the so-called cross-entropy (or
output Yi (real value in regression, class or label in classification). negative log-likelihood). Minimizing this loss function is equivalent to maxi-
n
d corresponds to the observation of the random n-sample D n
= mizing the log-likelihood. It is defined as:
{(X 1 , Y1 ), . . . , (X n , Yn )} with unknown joint distribution P on X × Y. K
X
ℓ(Y, fˆ(X)) = − 1Y =k log(fˆk (X)).
A prediction rule is a measurable function fˆ : X → Y that associates k=1
the output fˆ(x) to the input x ∈ X . It depends on D n and is thus random.
In order to quantify the quality of the prevision, we introduce a loss function. In all cases, the goal is to minimize the expectation of the loss function,
leading to the notion of risk.
D EFINITION 3. — A measurable function ℓ : Y × Y → R+ is a loss function
if ℓ(y, y) = 0 and ℓ(y, y ′ ) > 0 for y ̸= y ′ . D EFINITION 4. — Let f be a prediction rule build on the learning sample D n .
Given a loss function ℓ, the risk - or generalisation error - of f is defined by
p
In real regression it is natural to consider L (p ≥ 1) losses
RP (f ) = E(X,Y )∼P [ℓ(Y, f (X))],
ℓ(y, y ′ ) = |y − y ′ |p .
where, in the above expression (X, Y ) is independent from the learning sam-
2
If p = 2, the L loss is called "quadratic loss". ple D n .
In classification, one can consider the 0-1 loss defined, for all y, y ′ ∈ Y by
Let F be the set of possible prediction rules. f ∗ is called an optimal rule if
′
ℓ(y, y ) = 1y̸=y′ .
RP (f ∗ ) = inf RP (f ).
f ∈F
Since the 0-1 loss is not smooth, it may be useful to consider other losses.
Assuming that Y ∈ {1, 2, . . . , K}, rather than providing a class, many clas- A natural question then arises: is it possible to build optimal rules ?
sification algorithms provide estimation of the probability that the output Y
belongs to each class, given the input X = x, that is Case of real regression with L2 loss:
D EFINITION 5. — We call regression function the function η ∗ : X → Y The definition of the optimal rules described above depends on the knowl-
defined by edge of the distribution P of (X, Y ). In practice, we have a training sample
η ∗ (x) = E[Y |X = x]. D n = {(X 1 , Y1 ), . . . , (X n , Yn )} with joint unknown distribution P , from
which we construct a regression or classification rule. The aim is to find a
T HEOREM 1. — The regression function η ∗ : x 7→ E[Y |X = x] satisfies: "good" classification rule, in the sense that its risk is as small as possible.
T HEOREM 2. — The regression rule defined by µ∗ (x) = median[Y |X = x] Nevertheless, this is not a good idea: this estimator is optimistic and will un-
verifies: der estimate the risk (or generalisation error) as illustrated in the polynomial
∗ regression example presented in Figure 2.1.
RP (µ ) = inf RP (f ).
f ∈F
The empirical risk (also called training error) is not a good estimate of the
generalization error: it decreases as the complexity of the model increases.
Case of classification with 0 − 1 loss : Hence minimizing the training error leads to select the most complex model,
this leads to overfitting. Figure 2.2 illustrates the optimism of the training
ℓ(y, y ′ ) = 1y̸=y′ . error, that underestimates the generalization error, which is estimated here on
a test sample.
D EFINITION 6. — We call Bayes rule any function f ∗ of F such that for all
x ∈ X,
P(Y = f ∗ (x)|X = x) = max P(Y = y|X = x).
y∈Y A first way to have a good criterion for model selection is to minimize the
empirical risk plus a penalty term, the penalty term will penalize too complex
T HEOREM 3. — If f ∗ is a Bayes rule, then RP (f ∗ ) = inf f ∈F RP (f ). model to prevent overfitting.
18
● ●
●
●
●
2
● ●
●
●
2
●
●
1
● ●
●
●
1
y
y
●
●
0
0
−1
● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
2.5
● ●
● ●
● ●
2.0
2.0
● ●
1.5
1.5
● ● ● ●
● ●
● ●
1.0
1.0
y
● ●
● ●
0.5
0.5
0.0
0.0
−0.5
−0.5
● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Figure 2.2: Behavior of training error (in blue) and test error (in red) as the
x x
These two terms are of different natures. To evaluate them, we will use tools follows
n
respectively from statistics and approximation theory. 1X p 2
Cp = (Yi − Ŷi )2 + 2 σb
The selection of a model F̂ in a collection of models C for which the risk of n i=1 n
the estimator fˆF̂ (D n ) is close to the one of the oracle will be obtained by the where p is the number of parameter of the model, n the number of observations
minimization of a penalized criterion of the type: b2 is an estimation of the variance of the error.
and σ
F̂ = argminF ∈C {Rn (fˆF ) + pen(F )}. In framework of a the linear model Y = Xβ + ε, for which this criterion
was historically introduced, the expression becomes
In the above formula, a penalty is added to the empirical risk. The role of the
penalty is to penalize models with "large" dimension, in order to avoid over- n
1X p 2
fitting. The optimal choice of the penalty (according to the statistical models C p = (Yi − (X β̂)i )2 + 2 σb ,
n i=1 n
considered) is a very active research topic in statistics.
The more complex a model, the more flexible it is and can adjust to the where β ∈ Rp and σ 2 is an estimator of the variance of the variables εi ’s
observed data and therefore the smaller the bias. On the other hand, the vari- obtained by a model with large dimension (small bias). This last point is crucial
ance increases with the number of parameters to be estimated and therefore for the quality of the criterion: it amounts to assume that the full model (with
with this complexity. The objective is to minimize the quadratic risk, which all the variables) is the "true" model, or at least a model with a small bias to
is a sum of the variance and the squared bias term. Hence, we are looking for allow a good estimation of σ 2 .
the best compromise between the bias and the variance term: it is sometimes The Figure 2.3 shows the behavior of the Mallow’s Cp in the pedagogical
preferable to accept to bias the estimate as for example in ridge regression to example of polynomial regression. This criterions selects a polynomial with
reduce its variance. degree 3.
Penalized criterion: Mallow’s CP AIC, AICc , BIC
The Mallow’s Cp (1973)[28] was historically the first penalized criterion, While Mallow’s CP is associated to the quadratic loss, Aikaike’s Informa-
introduced for Gaussian linear model. It is based on the penalization of the tion Criterion (1974)[2] (AIC) is, more generally, related to the log-likelihood.
least square criterion by a penalty which is proportional to the dimension of It corresponds to the opposite of the empirical log-likelihood L plus a penalty
the model. It is based on the decomposition term proportional to the dimension of the model:
RP (f ) = Rn (f ) + Optim
b b p
AIC = −2L + 2 .
which corresponds to the empirical risk plus a estimation of the bias corre- n
sponding to the optimism of the empirical risk. This optimism has to be esti- The quantity −2L is also called deviance. One easily verifies that, in the Gaus-
mated to obtain a better estimation of the risk. This criterion is expressed as sian model with variance assumed to be known, the deviance and least square
20
CP de Mallows Polynôme de degré 3 n > e2 ≈ 7.4, BIC penalizes more heavily complex models. The consequence
● ●
●
is that BIC will generally select simpler model than AIC.
8
2
●
7
●
this criterion, among a collection of possible models.
6
●
●
1
●
5
CP
0
4
−1
●
tion consists in estimating the generalization error, either with data that where
2
● ●
● ●
●
●
• CV (m) estimates the generalization error of the model m and we select Algorithm 1 Monte Carlo Cross-Validation
the model which minimizes CV (m).
for k=1 à B do
(−k) Split randomly the sample into two parts: training set and test set with a
Note that, if K is small (for example K = 2), each estimator fˆm is trained prescribed proportion
with around n/2 observations. Hence, these estimators are less accurate than
an estimator built with n observations, leading to a greater variance in the for models in list of models do
estimation of the generalization error by cross-validation. When the number Estimate the parameters of the current model with the training set.
of folds K = n, the method is called leave-one-out cross-validation. This Compute the test error by the empirical risk on the test set.
method has a low bias to estimate the generalization error, but a high variance end for
(−i)
since all the estimators fˆm are highly correlated. The computation time is end for
also high for the leave-one-out method. This is why, in practice an intermediate For each model, compute the mean of the B test errors and draw the boxplots
choice such as K = 10 is often recommended. This is generally the default of the distributions of these errors.
value in softwares.
25
●
●
20
●
●
●
● ●
15
●
●
y
●
10
●
● ● ●
●
● ● ●
●
● ●
5
●
● ●
●
●
● ●
●
0
●
−10 0 10 20 30 40
Figure 2.4: Boxplot of the test errors for various methods optimized by Monte Figure 2.5: Original data
Carlo K-fold Cross-Validation on Ozone data set
consider the following estimator:
3.2 Estimation by Bootstrap B n
c boot = 1 1
XX
Let us first describe the Bootstrap, before showing how it can be used to Err ℓ(yi , fˆ∗b (xi )),
Bn
estimate the extra-sample prediction error. Suppose we have a training data set i=1 b=1
Z = {z1 , . . . zn }, with zi = (xi , yi ) and a model to be fitted on these data. We
measuring the mean, over the B bootstrap predictors, of the error on the train-
denote by fˆ the model fitted with the sample Z. The principle of the bootstrap
ing sample Z. However, we easily see that this is not a good estimate of the
is to randomly draw datasets of size n with replacement from the original sam-
generalization error since the bootstrap samples and the original sample have
ple Z. Conditionally on Z, all these draws are independent. Figures 2.6 and
many observations in common. Hence, this estimator will be too optimistic: it
2.7 show two bootstrap samples from the original dataset presented in Figure
will underestimate the generalization error. A better idea is to exploit the fact
2.5.
that each bootstrap sample does not contain all the observations of the original
We draw B bootstrap samples (for example B = 500) that we denote sample. Namely, we have
(Z∗b , b = 1, . . . , B). We fit the model with each of these bootstrap sam-
ples. We denote fˆ∗b the model fitted with the sample Z∗b . How can we use all
n
1 1
these predictors to estimate the prediction error of fˆ ? A first idea would be to P (Observation zi ∈
/ bootstrap sample b) = 1− ≈ = 0.368.
n e
24
●
●
duce the estimator
20
● n
c oob = 1 1
●
X X
●
● Err ℓ(yi , fˆ∗b (xi )).
n i=1 |C −i |
●
15
●
● −i b∈C
y
●
10
●
●
● ●
● ●
This estimator is called the out-of-bag estimator. If B is large enough, then for
●
●
●
●
●
all i , |C −i | =
̸ 0. Otherwise, the observation i for which |C −i | = 0 can be
●
●
● ●
5
●
● ●
●●
●
●
● ● removed from the above formula. This estimator uses extra sample observation
0
●
●
−10 0 10 20 30 40
to estimate the error of each predictor fˆ∗b , avoiding the overfitting problem
x encountered by Err c boot . Nevertheless, in expectation, each bootstrap sample
contains 0.632n observations, which is less that 2n/3 and we would like to
Figure 2.6: Bootstrap sample no 1 (in blue), and corresp. prediction with tree. estimate the generalization error of a predictor fˆ built with the n observations
The point size is proportional to the number of replicates. of the original sample Z. Each bootstrap predictors fˆ∗b will be less accurate
than fˆ since it is built with a smaller sample size. This induces a bias in the
estimation of the generalization error of fˆ by Err
c oob . To correct this bias, the
25
●
●
".632 bootstrap estimator " has been introduced by Efron and Tibshirani (1997)
[14]. It is defined by
20
●
●
●
● ●
(.632)
15
●
●
Err
c = .368err
¯ + .632Err
c oob ,
y
●
●
10
●●
●
●
●
● ● where err
● ●
5
●
●
●
●
●
situation, and a correction has been proposed in this case. It is called the
●
●
●
●
●
.632+bootstrap (see Hastie et al. [21] p. 220 for more details).
0
−10 0 10 20 30 40
x Remarks.
Figure 2.7: Bootstrap sample no 2 (in violet), and corresp. prediction with tree. 1. All the estimators proposed to estimate the generalization error are
The point size is proportional to the number of replicates. asymptotically equivalent, and it is not possible to know which method
will be more precise for a fixed sample size n.
25
2. The boostrap is time consuming and more complicated. It is less used Confusion matrix
in practice. Nevertheless, it plays a central role in recent methods of
aggregation, involving the bagging (for bootstrap aggregating) such as Given a threshold s, we use the prediction rule: if P̂(Yi = 1|X = xi ) > s,
random forests as we will see in Chapter 8 . then Ybi = 1, else Ybi = 0.
The confusion matrix crosses the modalities of the predicted variable for a
3. In conclusion, the estimation of a generalization error is delicate, and it is threshold value s with those of the observed variable in a contingency table:
recommended to consider the same estimator to compare two prediction
methods and to be very careful, without theoretical justification, to use Prediction Observation Total
one of these estimation to certify an algorithm. For this last purpose, the Yi = 1 Yi = 0
use of a test sample, with sufficiently large size, would be recommended. ybi = 1 n11 (s) n10 (s) n1+ (s)
ybi = 0 n01 (s) n00 (s) n0+ (s)
We will end this chapter by presenting the ROC curves, that are used to com- Total n+1 n+0 n
pare the relative performances of several binary classification methods.
In classic situations of medical diagnosis, marketing, pattern recognition,
4 Discrimination and ROC curves signal detection ... the following main quantities are considered:
For a two class classification problem: Y = {0, 1}, prediction methods • Number of positive conditions P = n+1
often provide an estimator of P(Y = 1|X = x). Then, a natural prediction is • Number of negative conditions N = n+0
to affect the observation x to the class 1 if
• True positives T P = n11 (s) (Y
bi = 1 et Yi = 1)
1
P̂(Y = 1|X = x) > .
2 • True negatives T N = n00 (s) (Y
bi = 0 et Yi = 0)
This gives a symmetric role to classes 0 and 1, which is sometimes not desir- • False negatives F N = n01 (s) (Y
bi = 0 et Yi = 1)
able (health context, for instance). The idea is to parameterize the decision by
a new threshold parameter s: • False positives F P = n10 (s) (Y
bi = 1 et Yi = 0)
T N +T P F N +F P
• Accuracy and error rate: ACC = N +P =1− N +P
P̂(Y = 1|X = x) > s ⇔ x belongs to class 1
TP
• True positive rate or sensitivity, recall T P R = P = 1 − FNR
s should be chosen according to policy decision, typically a tradeoff between
TN
the rate of true positive and false positive. • True negative rate or specificity, selectivity T N R = N = 1 − FPR
26
TP
• Precision or positive predictive value P P V = T P +F P = 1 − F DR • The True Positive Rate:
FP
• False positive rate F P R = N = 1 − TNR ♯{i, Ŷi = 1, Yi = 1}
T P R(s) = .
FN ♯{i, Yi = 1}
• False negative rate F N R = P = 1 − TPR
• False discovery rate F DR = FP The ROC curve plots TPR(s) versus FPR(s) for all values of s ∈ [0, 1].
F N +T N ,
We illustrate the construction of a ROC curve for a naïf example of logistic
• F1 score or harmonic mean of precision and sensitivity regression in dimension 1 in Figure 2.8.
PPV × TPR 2 × TP By making the threshold s vary in [0, 1], we obtain the complete ROC curve
F1 = 2 × = .
PPV + TPR 2 × TP + FP + FN presented in Figure 2.9
• Fβ (β ∈ R+ ) score, How to use ROC curve to select classifiers ? The "ideal" Roc curve corre-
sponds to FPR=0 and TPR =1 (no error of classification).
PPV × TPR We would like to use ROC curve to compare several classification rules, but
Fβ = (1 + β 2 ) 2 .
β PPV + TPR generally, the curves will intersect as shown in Figure 2.9 The AUC: Area
Under the Curve is a criterion which is often used to compare several classi-
The notions of specificity and sensitivity come from signal theory; their val- fication rules.
ues depend directly on the threshold s. By increasing s, the sensitivity de- In order to compare several methods with various complexity, the ROC curves
creases while the specificity increases. A good model combines high sensitiv- should be estimated on a test sample, they are indeed optimistic on the learning
ity and high specificity for signal detection. sample.
The last criterion Fβ makes it possible to weight between specificity and
sensitivity by taking into account the importance or the cost of false positives.
The smaller β, the more expensive false positives are compared to false
negatives.
By analogy with the first and second kind errors for testing procedures, we
consider the two following quantities that will be used to draw the ROC curve.
1.0
1.0
● ● ● ● ● ● ●
●
●
0.8
0.8
●
● ● ●
1.0
1.0
0.6
0.6
Probability
●
●
●
0.8
0.8
●
● (s = 0.5 )
0.4
0.4
TRUE positive rate
0.6
0.6
Probability
● ●
0.2
0.2
0.4
0.4
0.0
0.0
●
●
0.2
0.2
0.0
1.0
● ● ● ● (s = 0.2 ) ●
●
●
0.8
0.8
●
TRUE positive rate
0.6
0.6
Probability
0.4
0.4
●
0.2
0.2
0.0
0.0
Figure 2.8: Points in the ROC curve obtained for s = 0.5 and s = 0.2
Figure 2.10: ROC curves for several classification rules on bank data
28
Chapter 3
Linear models
29
30
The explanatory variables are given in the matrix X(n×(p+1)) with general
For the qualitative variables, we consider indicator functions of the different
term Xij , the first column contains the vector 1 (X0i = 1). The regressors X j
levels of the factor, and introduce some constraints for identifiability. By
can be quantitative variables, nonlinear transformation of quantitative variables
default, in R, the smallest value of the factor are set in the reference.
(such as log, exp, square ..), interaction between variables X j = X k .X l , they
This is an analysis of covariance model (mixing quantitative and qualitative
can also correspond to qualitative variables: in this case the variables X j are
variables).
indicator variables coding the different levels of a factor (we remind that we
need identifiability conditions in this case).
The response variable is given in the vector Y with general term Yi . We 2.2 Estimation of the parameters
set β = [β0 β1 · · · βp ]′ , which leads to the matricial formulation of the linear
model: Least square estimators
Y = Xβ + ε. The regressors X j are observed, the unknown parameters of the model are
the vector β and σ 2 . β is estimated by minimizing the residuals sum of square
As a practical example, we consider the Ozone data set. or equivalently, assuming that the errors are Gaussian, by maximisation of the
The data frame has 1041 observations of the following components: likelihood.
31
We minimise with respect to the parameter β ∈ Rp+1 the criterion : are still defined as the projection of Y onto the space generated by the columns
of X, even if there is not a unique β b such that Y
b = Xβ.b In practice, if X′ X
n
is not invertible (which is necessarily the case in high dimension when the
X 2
(Yi − β0 − β1 Xi1 − · · · − βp Xip )2 = ∥Y − Xβ∥
i=1
number of variables p is larger than the number of observations n - since p
= (Y − Xβ)′ (Y − Xβ) vectors of Rn are necessarily linearly dependent), we have to remove variables
from the model or to consider other approches to reduce the dimension ( Ridge,
= Y′ Y − 2β ′ X′ Y + β ′ X′ Xβ. Lasso, PLS ...) that we will developed in the next chapters.
Derivating the last equation, we obtain the “ normal equations” : We define the vector of residuals as:
columns of X are linearly independent. If it is not the case, this means that the
application β 7→ Xβ is not injective, hence the model is not identifiable and β σ2
is not uniquely defined. Nevertheless, even in this case, the predicted values Y σ̂ 2 ∼ χ2
n − (p + 1) (n−(p+1))
b
32
and is independent of β̂. In order to test that the variable associated to the parameter β j has no influence
in the model, hence H0 : β j = 0 contre H1 : β j ̸= 0, we reject the null
Exercise. — Prove Theorem 4 hypothesis at the level 5% if 0 does not belong to the previous confidence
β interval.
b is a linear estimator of β (it is a linear transformation of the observation
Y) and it is unbiased. One can wonder if it has some optimality property. This Exercise. — Recover the construction of the confidence intervals.
is indeed the case: the next theorem, called the Gauss-Markov theorem, is very
b has the smallest Test of significance of a variable
famous in statistics. It asserts that the least square estimator β
variance among all linear unbiased estimator of β. We recall the linear model
T HEOREM 5. — Let A and B two matrices. We say that A ⪯ B if B − A is Yi = β0 + β1 Xi1 + β2 Xi2 + · · · + βp Xip + εi i = 1, 2, . . . , n
positive semi-definite. Let β
e a linear unbiased estimator of β, with variance-
covariance matrix V.e Then, σ 2 (X′ X)−1 ⪯ V. e We want to test if the variable X j is significant in the model or not, which is
equivalent to test the nullity of the parameter βj .
Exercise. — Prove the Gauss-Markov theorem. We test H0 : βj = 0 against H1 : βj ̸= 0.
Theorem 5 shows that the estimator β b is the best among all linear unbiased Under the hypothesis H0 ,
estimator of β, nevertheless, in the next section, we will see that it can be
preferable to consider biased estimator, if they have a smaller variance than β,
b β
b
j
Tj = q ∼ T(n−(p+1)) .
to reduce the quadratic risk. This will be the case for the Ridge, Lasso, PCR,
σ̂ (X ′ X)−1
2
j,j
or PLS regression.
Confidence intervals The p-value of the test is defined as
One can easily deduce from Theorem 4 that PH0 (|Tj | > |Tj |obs ) = P(|T(n−(p+1)) | > |Tj |obs ),
b −β
β
q j j
∼ T(n−(p+1)) where |Tj |obs is the observed value for the variable |Tj | with our data. If the
σ̂ 2 (X ′ X)−1 p-value is very small, then it is unlikely that |Tj |obs is obtained from a Student
i,i
distribution with n − (p + 1) degrees of freedom, hence we will reject the
follows a Student distribution with n−(p+1) degrees of freedom. This allows hypothesis H0 , and conclude that the variable X j is significant. We fix some
to build confidence intervals and tests for the parameters β j . The following level α (generally 5%) for the test . If p-value < α, we reject the nullity of
interval is a 0.95 confidence interval for β j : βj and conclude that the variable X j is significant in the model. One easily
q q prove that the probability to reject H0 when it is true (i.e. to conclude that the
b − tn−(p+1),0.975 σ̂ 2 (X ′ X)−1 , β
[β b + tn−(p+1),0.975 σ̂ 2 (X ′ X)−1 ]. variable X j is significant when it is not) is less than the level α of the test.
j j,j j j,j
33
′
On the example of the Ozone data set, the software R gives the following and that Yb0 ∼ N (X0 ′ β, σ 2 X0 (X′ X)−1 X0 ). We can deduce a confidence
output, with the default constraints of R: interval for the mean response X0 ′ β at the new observation point X0 :
q
Coefficients Estimate Std. Error t value Pr(>|t|) X0 ′ β
b − tn−(p+1),0.975 σ̂ X′ (X′ X)−1 X0 ,
0
(Intercept) -33.43948 6.98313 -4.789 1.93e-06 ****
JOUR1 0.46159 1.88646 0.245 0.806747 q
MOCAGE 0.37509 0.03694 10.153 < 2e-16 *** X0 ′ β
b + tn−(p+1),0.975 σ̂ X′ (X′ X)−1 X0 .
0
TEMPE 3.96507 0.22135 17.913 < 2e-16 ***
... ... ... ... ... A prediction interval for the response Y0 at the new observation point X0 is:
Residual standard error: 27.83 on 1028 degrees of freedom
q
X0 ′ β
b − tn−(p+1),0.975 σ̂ 1 + X′ (X′ X)−1 X0 ,
0
2.3 Prediction
q
′b ′
′ −1
X0 β + tn−(p+1),0.975 σ̂ 1 + X0 (X X) X0 .
As mentioned above, the vector of predicted values is
Exercise. — Recover the construction of the prediction intervals. Hint: what
Y b = X(X′ X)−1 X′ Y = HY.
b = Xβ is the distribution of Yb0 − Y0 ?
On the example of the Ozone data, with the - simple linear regression model
This corresponds to the predicted values at the observation points. Based on with the single variable X= MOCAGE
the n previous observations, we may be interested with the prediction of the
response of the model for a new point: X0 ′ = (1, X0 1 , . . . , X0 p ): Yi = β0 + β1 Xi + εi , i = 1, . . . , n,
we obtain the following confidence and prediction intervals.
Y0 = β 0 + β 1 X01 + β 2 X02 + . . . + β p X0p + ε0 ,
2.4 Fisher test of a submodel
where ε0 ∼ N (0, σ 2 ). The predicted value is
Suppose that our data obey to a polynomial regression model of degree p
Yb0 = β
b +β b X0 p = X0 ′ β.
b X0 1 + . . . β b and we want to test the null hypothesis that our data obey to a polynomial
0 1 p
regression model of degree k < p , hence we want to test that the p − k last
coefficients of β are equal to 0. More generally, assume that our data obey to
We derive from Theorem 4 that the model, called Model (1):
E(Yb0 ) = X0 ′ β = β 0 + β 1 X01 + β 2 X02 + . . . + β p X0p Y = Xβ + ε.
34
Y = X̃θ + ε.
V = {Xβ, β ∈ Rp }
and
200 W = {X̃θ, θ ∈ Rl }.
We say that Model (0) is a submodel of Model (1) if W is a linear subspace of
O3obs
V.
100
∥Xβ b 2 /(p − l)
b − X̃θ∥
F = .
b 2 /(n − p)
∥Y − Xβ∥
35
An alternative way to write the F -statistics is: Residuals vs Fitted Normal Q-Q
489
150 489
Standardized residuals
856 4 856
100
F = ,
SSR1 /(n − p)
Residuals
50 2
where SSR0 and SSR1 respectively denote the residuals sum of square under 0 0
Exercise. — Prove that, under the null hypothesis H0 , the F -statistics is a -100
Fisher distribution with parameters (p − l, n − p). 50 100 150 200 -2 0 2
Fitted values Theoretical Quantiles
Standardized residuals
Standardized Residuals
856
4
when the sub-model is valid, and becomes larger under the alternative. Hence, 1.0
0
the null hypothesis is rejected for large values of F , namely, for a level-α test, 0.5 635
-2
when 179
0.0
F > fp−l,n−p,1−α , 50 100 150 200 0.00 0.05 0.10 0.15
Fitted values Leverage
where fp,q,1−α is the (1−α) quantile of the Fisher distribution with parameters
(p, q). The statistical softwares provide the p− value of the test: Figure 3.2: Diagnosis on the residuals for Ozone data
PH0 (F > Fobs )
where Fobs is the observed value for the F -statistics. The null hypothesis is • The linear model is valid: there is no tendancy in the residuals,
rejected at level α if the p− value is smaller than α.
• Detection of possible outliers with the Cook’s distance
2.5 Diagnosis on the residuals
• Normality of the residuals (if this assumption was used to provide confi-
As illustrated for Ozone data on Figure 3.2, the analysis and visualisation of dence/prediction intervals or tests).
the residuals allow to verify some hypotheses:
This is rather classical for linear regression, and we focus here on the detec-
• Homoscedasticity: the variance σ 2 is assumed to be constant, tion of possible high collinearities between the regressors, since it has an im-
36
pact on the variance of our estimators. Indeed, we have seen that the variance- between the largest and the smallest eigenvalues of R. If this ratio is large,
covariance matrix of βb is σ 2 (X′ X)−1 . then the problem is ill-conditioned.
When the matrix X is ill-conditioned, which means that the determinant of This condition number is a global indicator of collinearities, while the VIF
′
X X is close to 0, we will have high variances for some components of β. b It allows to identify the variables that are problematic.
is therefore important to detect and remedy these situations by removing some
variables of the model or introducing some constraints on the parameters to 3 Determination coefficient and Model se-
reduce the variance of the estimators. lection
VIF
3.1 R2 and adjusted R2
Most statistical softwares propose collinearity diagnosis. The most classical
il the Variance Influence Factor (VIF) We define respectively the total, explicated and residual sums of squares by
n
1 X
2
Vj = (Yi − Ȳ )2 =
Y − Y1
,
1 − Rj2 SST =
i=1
where Rj2 corresponds to the determination coefficient of the regression of Xn
2
the variable X j on the other explanatory variables ; Rj represents also the SSE = (Ŷi − Ȳ )2 =
Y − Y1
,
b
n j
cosine of the angle in R between X and the linear subspace generated by i=1
the variables {X 1 , . . . , X j−1 , X j+1 , . . . , X p }. The more X j is “linearly” Xn
2
2
SSR = (Ŷi − Yi )2 =
Y − Y
= ∥e∥ .
b
linked with the other variables, the more Rj is close to 1 ; we show that the
variance of the estimator of βj is large in this case. This variance is minimal i=1
j
when X is orthogonal to the subspace generated by the other variables. Since we consider a model with intercept, by Pythagora’s theorem,
2
2
Condition number
Y − Y1
2 =
Y − Y b
+
Y
b
− Y1
,
2.5
2.5
● ● ● ●
●
●
● ● ●
● ●
2
2.0
2.0
● ●
●
●
2
● ● ●
1.5
1.5
● ● ● ●
●
1
● ●
●
●
● ●
● ●
1.0
1.0
●
1
y
y
● ● ●
● ●
0.5
0.5
0
0.0
0.0
−1
−0.5
−0.5
● ● ● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x x x
Figure 3.3: Polynomial regression: adjusted model, on the left: y = β0 + Figure 3.4: Polynomial regression: adjusted model, on the left: y = β0 +β1 x+
β1 x + ϵ, R2 = 0.03, on the right: y = β0 + β1 x + β2 x2 + ϵ, R2 = 0.73. . . . + β5 x5 + ϵ, R2 = 0.874, on the right: y = β0 + β1 x + . . . + β10 x10 + ϵ,
R2 = 1.
Note that 0 ≤ R2 ≤ 1. The model is well adjusted to the n training data if the
residuals sum of square SSR is close to 0, or equivalently, if the determination Of course this model is not the best one: it has a very high variance since
coefficient R2 is close to 1. Hence, the first hint is that a "good" model is a we estimate as much coefficients as the number of observations. This is a
model for which R2 is close to 1. This is in fact not true, as shown by the typical case of overfitting. When the degree of the polynomial increases, the
following pedagogical example of polynomial regression. Suppose that we bias of our estimators decreases, but the variance increases. The best model is
have a training sample (Xi , Yi )1≤i≤n where Xi ∈ [0, 1] and Yi ∈ R and we the one that realizes the best trade-off between the bias term and the variance
adjust polynomials on these data: term. Hence, we have seen that maximizing the determination coefficient is
not a good criterion to compare models with various complexity. It is more
Yi = β 0 + β 1 Xi + β 2 Xi2 + . . . + β k Xik + εi . interesting to consider the adjusted determination coefficient defined by:
SSR/(n − k − 1)
2 R′2 = 1 − .
When k increases, the model is more and more complex, hence
Y − Y SST/(n − 1)
b
decreases, and R2 increases as shown in Figures 3.3 and 3.4. The definition of R′2 takes into account the complexity of the model, repre-
The determination coefficient is equal to 1 for the polynomial of degree sented here by its number of coefficients: k + 1 for a polynomial of degree k,
n − 1 (which has n coefficients) and passes through all the training points. and penalizes more complex models. One can choose, between several mod-
38
els, the one which maximizes the adjusted R2 . In the previous example, we
would choose a polynomial of degree 3 with this criterion. [0, 1] R
π
More generally, we have to define model selection procedures that realize a logit: π → ln 1−π
good compromise between a good adjustment to the data (small bias) and a exp(x)
1+exp(x)
← x : sigmoid
small variance; and an unbiased estimator is not necessarily the best one in
logit sigmoid
this sense. We will prefer a biased model if this allows to reduce drastically
the variance. There are several ways to do that:
1.0
6
0.8
4
2
0.6
0
p
x
• Reducing the number of explanatory variables and by the same way sim-
0.4
−2
plifying the model (variable selection or Lasso penalization)
−4
0.2
−6
0.0
0.0 0.2 0.4 0.6 0.8 1.0 −6 −4 −2 0 2 4 6
p x
Exercise. — Compute the Bayes classifier f ∗ for this model and determine the
border between f ∗ = 1 and f ∗ = −1.
1.0
● ●
●● ● ● ●● ● ● ● ● ● ● ●
●
●
●●
4.2 Estimation of the parameters
0.8
●
●
n
Given a n-sample D = {(X 1 , Y1 ), . . . , (X n , Yn )}, we can estimate the
0.6
Probability
parameter β by maximizing the conditional likelihood of Y = (Y1 , . . . , Yn ) ●
0.4
●
●
given (X 1 , . . . , X n ). Since the distribution of Y given X = x is a Bernoulli ●
0.2
distribution with parameter πβ (x), the conditional likelihood is
0.0
n
Y 4 5 6 7 8 9 10
L(Y1 , . . . , Yn , β) = πβ (Xi )Yi (1 − πβ (Xi ))1−Yi .
x
i=1
Y exp(⟨β, Xi ⟩) Y 1 Figure 3.5: Logistic regression for a dataset composed of 2 groups of size 15,
L(Y , β) = . sampled from Normal distributions, centered at 5 and 7, with variance 1.
1 + exp(⟨β, Xi ⟩) 1 + exp(⟨β, Xi ⟩)
i,Yi =1 i,Yi =0
• Unlike the linear model, there is no explicit expression for the maximum Like for linear models, in a high dimensional setting (p is large), it will be
likelihood estimator β̂. necessary to use variable selection and model selection procedures by intro-
ducing penalized likelihood criterions (AIC, BIC, LASSO ..). This is the topic
• It can be shown that computing β̂ is a convex optimization problem. of Chapter 4.
• We compute the gradient of the log-likelihood, also called the score func-
tion S(Y, β) and use a Newton-Raphson algorithm to approximate β̂
satisfying S(Y, β̂) = 0.
1 Introduction the model. Moreover, a model with a small number of variables is more inter-
esting for the interpretation, keeping only the variables that have the strongest
We have made some reminders on linear models in Chapter 3. We have seen effects on the variable to explain. There are several ways to do that.
that, in a high dimensional framework, when p is possibly large, even larger
Assume we want to select a subset of variables among all possible subsets
than n, a complete model obtained by least square estimation is overfitted and
taken from the input variables. Each subset defines a model, and we want to
it is necessary to regularize the least square estimation by introducing some
select the "best model". We have seen that maximizing the R2 is not a good
penalty on the complexity of the models in order to reduce the variance of
criterion since this will always lead to select the full model. It is more inter-
the estimators. The adjusted R2 , presented in Chapter 3 is a first step in this
esting to select the model maximizing the adjusted determination coefficient
direction. Model selection and variable selection for linear models has been
R′2 . Many other penalized criterion have been introduce for variable selection
intensively studied this past twenty years and is still a very active field of re-
such as the Mallow’s CP criterion or the BIC criterion. In both cases, it corre-
search in statistics. Some of these methods such as Ridge or Lasso methods,
sponds to the minimization of the least square criterion plus some penalty term,
will be at the core of this course.
depending on the number k of parameters in the model m that is considered.
2 Variable selection
As we have seen, the least square estimator is not satisfactory since it has low
n
bias but generally high variance. In most examples, several variables are not X
Crit(m) = (Yi − Ŷi )2 + pen(k).
significant, and we may have better results by removing those variables from
i=1
41
42
● ●
●
8
n
X
2 2
2
●
7
●
i=1 ●
6
●
●
1
●
5
and the BIC criterion penalizes more the dimension of the model with an ad-
CP
y
●
0
ditional logarithmic term.
4
3
n
−1
X
(Yi − Ŷi )2 + log(n)kσ 2 .
●
2
● ●
CritBIC (m) = ●
●
●
●
● ●
k x
The aim is to select the model (among all possible subsets) that minimizes one
of those criterion. On the example of the polynomial models, we obtain the
results summarized in Figure 4.1. Figure 4.1: Mallows’CP in function of the degree of the polynomial. Selected
p
Nevertheless, the number of subsets of a set of p variables is 2 , and it is model: polynomial with degree 3.
impossible (as soon as p > 30) to explore all the models to minimize the cri-
terion. Fast algorithms have been developed to find a clever way to explore a
subsample of the models. This are the backward, forward and stepwise algo-
rithms.
Backward/Forward Algorithms: All those algorithms stop when the criterion can no more be reduced. Let us
see some applications of those algorithms on the Ozone data.
• Forward selection: We start from the constant model (only the intercept,
Stepwise Algorithm
no explanatory variable), and we add sequentially the variable that allows We apply the StepAIC algorithm, with the option both of the software R in
to reduce the more the criterion. order to select a subset of variables, and we present here an intermediate result:
• Backward selection: This is the same principle, but starting from the
full model and removing one variable at each step in order to reduce the
criterion.
Start: AIC=6953.05
• Stepwise selection: This is a mixed algorithm, adding or removing one O3obs ∼ MOCAGE + TEMPE + RMH2O + NO2 + NO + VentMOD +
variable at each step in order to reduce the criterion in the best way. VentANG
43
The principle of the Ridge regression is to consider all the explanatory vari-
D EFINITION 8. — The ridge estimator of β
e in the model
ables, but to introduce constraints on the parameters in order to avoid overfit-
ting, and by the same way in order to reduce the variance of the estimators. In
Y=X
eβe + ϵ,
the case of the Ridge regression, we introduce an l2 constraint on the parameter
β.
is defined by
3.1 Model and estimation
n p p
(j)
X X X
If we have an ill-conditionned problem, but we want to keep all the variables, β
b = argmin
β∈Rp+1
(Yi − Xi βj )2 + λ βj2 ,
it is possible to improve the numerical properties and to reduce the variance of i=1 j=0 j=1
the estimator by considering a slightly biased estimator of the parameter β.
We consider the linear model where λ is a non negative parameter, that we have to calibrate.
Y=X
eβe + ϵ, P ROPOSITION 1. — Assume that X is centered. We obtain the following ex-
44
plicit solution for the Ridge estimator: Choice of the penalty term
β̂1
In the Figure 4.2, we see results obtained by the ridge method for several
. values of the tuning parameter λ = l on the polynomial regression example.
′ −1 ′
. = (X X + λIp ) X (Y − Ȳ 1).
β̂0 = Ȳ , β̂R = Increasing the penalty leads to more regular solutions, the bias increases, and
the variance decreases. We have overfitting when the penalty is equal to 0 and
β̂p
under-fitting when the penalty is too large.
Exercise. — Prove the Proposition 1. For each regularization method, the choice of the parameter λ is crucial and
determinant for the model selection. We see in Figure 4.3 the Regularisation
Remarks: path, showing the profiles of the estimated parameters when the tuning param-
eter λ increases.
1. X′ X is a nonnegative symmetric matrix (for all vector u in Rp ,
u′ (X′ X)u = ∥Xu∥2 ≥ 0. Hence, for any λ > 0, X′ X + λIp is in- Choice of the regularization parameter
vertible.
Most softwares use the cross-validation to select the tuning parameter
2. The constant β is not penalized, otherwise, the estimator would depend penalty. The principe is the following:
0
on the choice of the origin for Y. We obtain βb0 = Y, adding a constant
• We split the data into K sub-samples. For all I from 1 to K:
to Y does not modify the values of βbj for j ≥ 1.
– We compute the Ridge estimator associated to a regularization pa-
3. The ridge estimator is not invariant by normalization of the vectors X (j) , rameter λ from the data of all the subsamples, except the I-th (that
it is therefore important to normalize the vectors before minimizing the will be a "‘test"’ sample).
criterion. (−I)
– We denote by β̂ λ the obtained estimator.
4. The ridge regression is equivalent to the least square estimation under the – We test the performances of this estimator on the data that have not
constraint that the l2 -norm of the vector β is not too large: been used to build it, that is the one of the I-th sub-sample.
n o • We compute the criterion:
b = arg min ∥Y − Xβ∥2 ; ∥β∥2 < c .
β R n
β 1X (−τ (i)) 2
CV (λ) = (Y i − X i β̂ λ ) .
The ridge regression keeps all the parameters, but, introducing constraints n i=1
on the values of the βj ’s avoids too large values for the estimated param-
eters, which reduces the variance. • We choose the value of λ which minimizes CV (λ).
45
● ●
● ●
2.5
2.5
● ●
2.0
2.0
● ●
● ●
1.5
1.5
● ●
● ●
● ●
y
y
1.0
1.0
● ●
● ●
0.5
0.5
0.0
0.0
−0.5
−0.5 JOUR1
20
● ●
MOCAGE
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 TEMPE
RMH2O
x x
NO2
15
NO
STATION1
Régression Ridge, l=10^−4 Régression Ridge, l=0.1
STATION2
STATION3
10
STATION4
t(mod.ridge$coef)
● ●
● ●
VentMOD
2.5
2.5
VentANG
● ●
5
2.0
2.0
● ●
● ●
1.5
1.5
● ●
● ●
0
● ●
y
y
1.0
1.0
● ●
● ●
-5
0.5
0.5
0.0
0.0
-10
−0.5
−0.5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
● ●
Figure 4.3: Regularization paths for the Ridge regression
● ●
2.5
2.5
● ●
2.0
2.0
● ●
● ●
1.5
1.5
● ●
● ●
● ●
y
y
1.0
1.0
● ●
● ●
0.5
0.5
0.0
0.0
−0.5
−0.5
● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
MOCAGE
TEMPE
RMH2O
orthogonal: UU′ = U′ U = In , VV′ = V′ V = Ip .
NO2
We have
15
NO
STATION1
STATION2 Xβb = UD(D′ D + λIp )−1 D′ U′ Y.
R
STATION3
10
STATION4
t(mod.ridge$coef)
VentMOD Suppose that n ≤ p. We denote by u(1) , . . . , u(n) the columns of the matrix
VentANG
U. Setting d1 ≥ . . . ≥ dp ≥ 0 the diagonal elements of D, UD is a n × p
5
p
!
X d2j
Xβ R =
b u j
2+λ (uj )′ Y.
-5
j=1
dj
-10
Let us compare this estimator with the least square estimator (which corre-
0 500 1000 1500 2000
sponds to λ = 0):
Xp
Xβb= uj (uj )′ Y.
j=1
Figure 4.4: Selection of the regularization parameter by CV (uj )′ Y corresponds to the j-th component of Y in the basis (u1 , . . . , un ).
In the case of
the ridge regression, this component is multiplied by the factor
d2j / d2j + λ ∈]0, 1[, we can say that this component has been thresholded.
Application to the Ozoner data: The value of λ selected by cross-validation Remarks:
is 5.4. We show the obtained value in Figure 4.4. 1) When the tuning parameter λ increases, the coefficients are more and more
thresholded.
Singular Value Decomposition and Ridge regression
2) x 7→ x/(x + λ) is a non decreasing function of x for x > 0. The largest
2 2 2
The Singular Value Decomposition (SVD) of the centered matrix X allows coefficients are slightly thresholded: if dj >> λ, dj / dj + λ is close to 1.
to interpret the ridge regression as a shrinkage method. The SVD of the matrix The threshold decreases when j increases since dj decreases.
X has the following form:
We can give an interpretation in relation with the Principal Components
X = UDV′ , Analysis . X being centered, X′ X/n is the empirical variance-covariance
47
We see that the ridge regression shrinks slightly the first principal components Y = Xβ + ϵ,
(for which dj is large), and more the last principal components.
is defined by:
We can associate to the ridge procedure the quantity df (λ) which is called the
effective number of degrees of freedom in the ridge regression and is defined
n p p
by (j)
X X X
βb
Lasso = argmin β∈Rp+1
(Yi − Xi βj )2 + λ |βj | ,
p
X d2j i=1 j=0 j=1
df (λ) = .
d2j + λ
j=1 where λ is a nonnegative tuning parameter.
If λ = 0, df (λ) = p (no shrinkage), if λ → ∞, df (λ) → 0, at the limit, all the We can show that this is equivalent to the minimization problem:
coefficients are equal to 0.
2
β β∈Rp ,∥β∥1 ≤t (∥Y − Xβ∥ ),
b = argmin
L
48
Pp
where t is suitably chosen, and βc0
Lasso = Ȳ . Like for the Ridge regression, under the constraint j=1 |βj | ≤ t, for some t > 0 (depending on λ).
the parameter λ is a regularization parameter: The statistical
Pp software R introduces a constraint expressed by a relative
bound for j=1 |βj |: the constraint is expressed by
• If λ = 0, we recover the least square estimator.
p p
• If λ tends to infinity, all the coefficients β̂j are equal to 0 for j = 1, . . . , p. (0)
X X
|βj | ≤ κ |β̂j |,
j=1 j=1
The solution to the Lasso is parsimonious (or sparse), since it has many null
coefficients. where β̂ (0) is the least square estimator and κ ∈ [0, 1].
′
If the matrix X is orthogonal: (X X = Id), the solution is explicit. For κ = 1 we recover the least square estimator (there is no constraint) and
for κ = 0, all the β̂j , j ≥ 1, vanish (maximal constraint).
P ROPOSITION 2. — If X′ X = Ip , the solution β of the minimization of the
Lasso criterion 4.2 Applications
∥Y − Xβ∥2 + 2λ∥β∥1
We represent in Figure 4.5 the values of the coefficients in function of κ
is defined as follows: for all j = 1, . . . , p, for the Ozone data: this are the regularization paths of the LASSO. As for
the Ridge regression, the tuning parameter is generally calibrated by cross-
βj = sign(βbj )(|βbj | − λ)1|βbj |≥λ ,
validation.
where β b = X′ Y.
b is the least square estimator: β Comparison LASSO/ RIDGE
The obtained estimator corresponds to a soft thresholding of the least square The Figure 4.6 gives a geometric interpretation of the minimization prob-
estimator. The coefficients βbj are replaced by ϕλ (βbj ) where lems for both the Ridge and Lasso estimators. This explains why the Lasso
solution is sparse.
ϕλ : x 7→ sign(x)(|x| − λ)+ .
4.3 Optimization algorithms for the LASSO
Exercise. — Prove the proposition 2.
Convex functions and subgradients
Another formulation
D EFINITION 10. — A function F : Rn → R is convex if ∀x, y ∈ Rn , ∀λ ∈
The LASSO is equivalent to the minimization of the criterion [0, 1],
n F (λx + (1 − λ)y) ≤ λF (x) + (1 − λ)F (y).
(1) (2) (p)
X
Crit(β) = (Yi − β0 − β1 Xi − β2 Xi − . . . − βp Xi )2
i=1
L EMMA 3. — When F is differentiable, we have F (y) ≥ F (x)+⟨∇F (x), y −
49
MOCAGE
15
TEMPE
NO2
STATIONAls
STATIONCad
STATIONPla
STATIONRam
VentMOD
10
VentANG
coefficients
5
0
relative_bound
Figure 4.5: Regularization paths of the LASSO when the penalty decreases
50
Indeed
Due to its parsimonious solution, this method is widely used to select vari- Z (m) is the linear P
• P combination of X (1) , . . . , X (p) of the form
ables in high dimension settings (when p > n). p (j) 2
i=1 αj,m X with αj,m = 1 with maximal variance and orthog-
(1) (m−1)
onal to Z , . . . , Z .
5 Elastic Net
The Principal Component Regression (PCR) consists in considering a predictor
Elastic Net is a method that combines Ridge and Lasso regression, by in- of the form:
troducing simultaneously the l1 and l2 penalties. The criterion to minimize XM
is Ŷ P CR = θ̂m Z (m)
n m=1
(1) (2) (p)
X
(Yi − β0 − β1 Xi − β2 Xi − . . . − βp Xi )2 with
i=1 ⟨Z (m) , Y ⟩
p p
θ̂m = .
X X ∥Z (m) ∥2
+λ α |βj | + (1 − α) βj2
j=1 j=1 Comments:
• For α = 1, we recover the LASSO. • If M = p, we keep all the variables and we recover the ordinary least
square (OLS) estimator.
• For α = 0, we recover the Ridge regression.
• If one can obtain a good prediction with M < p, then we have reduced
In this case, we have two tuning parameters to calibrate by cross-validation. the number of variables, hence the dimension.
53
• Nevertheless, interpretation is not always easy: if the variables are inter- • In order to obtain the following directions, we orthogonalize the variables
pretable, the principal components (that correspond to linear combination X (j) with respect to the first PLS component W (1) :
of the variables) are generally difficult to interpret.
• We substract to each variables X (j) (1 ≤ j ≤ p) its orthogonal projec-
• This method is quite similar to the Ridge regression, which shrinks the tion in the direction given by W (1) and we normalize the variables thus
coefficients of the principal components. Here, we set to 0 the coefficients obtained.
of the principal components of order greater than M .
• We compute the second PLS component W (2) in the same way as the first
• The first principal components are not necessarily well correlated with component by replacing the variables X (j) ’s by the new variables.
the variable to explain Y , this is the reason why the PLS regression has
been introduced. • We iterate this process by orthogonalizing at each step the variables with
respect to the PLS components.
6.2 Partial Least Square (PLS) regression
The algorithm is the following:
The principle of this method is to make a regression on linear combinations
of the variables Xi ’s, that are highly correlated with Y . • Ŷ 0 = Ȳ and X (j),0 = X (j) . For m = 1, . . . , p
Pp
• We assume that Y has been centered, and that the variables X (j) are also • W (m) = j=1 ⟨Y, X
(j,m−1)
⟩X (j,m−1) .
centered and normalized (with norm 1). ⟨Y,W (m) ⟩
• Ŷ m = Ŷ m−1 + ∥W (m) ∥2
W (m) .
• The first PLS component is defined by:
X (j),m−1 −ΠW (m) (X (j),m−1 )
p
X • ∀j = 1, . . . , p, X (j),m = ∥X (j),m−1 −ΠW (m) (X (j),m−1 )∥
.
W (1) = ⟨Y, X (j) ⟩X (j) .
j=1
• The predictor Ŷ p obtained at step p corresponds to ordinary least square
• The prediction associated to this first component is: estimator.
⟨Y, W (1) ⟩ (1) • This method is useless if the variables X (j) are orthogonal.
Ŷ 1 = W .
∥W (1) ∥2 • When the variables X (j) are correlated, PCR and PLS methods present
the advantage to deal with new variables, that are orthogonal.
Note that if the matrix X is orthogonal, this estimator corresponds to the ordi-
nary least square (OLS) estimator, and in this case, the following steps of the • The choice of the number of PCR or PLS components can be done by
PLS regression are useless. cross-validation.
54
55
56
Exercise. — Prove that Hence the decision boundary between the class k and the class l,
{x, P(Y = k/X = x) = P(Y = l/X = x)} is linear.
K
X
fX (x) = gl (x)πl
l=1 We want now to built a decision rule from a training sample D n =
{(X 1 , Y1 ), . . . , (X n , Yn )} which is close to the Bayes rule. For this purpose,
and that we have to estimate for all k πk , µk and the matrix Σ. We consider the follow-
gk (x)πk ing estimators.
fk (x) = PK . Pn
l=1 gl (x)πl Nk X i 1Yi =k
π̂k = , µ̂k = i=1
n Nk
If we assume that the distribution of X given Y = k is a multivariate normal Pn
distribution, with mean µk and covariance matrix Σk , we have where Nk = i=1 1Yi =k . We estimate Σ by
K X
n
(X i − µ̂k )(X i − µ̂k )′ 1Y
1 1 ′ −1
X
i =k
gk (x) = exp − (x − µk ) Σk (x − µk ) . Σ̂ = .
(2π)p/2 |Σk |1/2 2 n−K
k=1 i=1
For the linear discriminant analysis, we furthermore assume that Σk = Σ To conclude, the Linear Discriminant Analysis assigns the input x to the class
for all k. In this case we have fˆ(x) which maximises δ̂k (x), where we have replaced in the expression of
δk (x) the unknown quantities by their estimators.
log P(Y = k/X = x) = C(x) + δk (x) Remark: If we no more assume that the matrix Σ does not depend on the class
k, we obtain quadratic discriminant functions
where C(x) does not depend on the class k, and
1 1
1 ′ −1 δk (x) = − log |Σk | − (x − µk )′ Σ−1 k (x − µk ) + log(πk ).
′ −1
δk (x) = x Σ µk − µk Σ µk + log(πk ). 2 2
2
This leads to the quadratic discriminant analysis.
The Bayes rule will assign x to the class f ∗ (x) which maximises δk (x).
3 Linear Support Vector Machine
P(Y = k/X = x) πk 3.1 Linearly separable training set
log = log + x′ Σ−1 (µk − µl )
P(Y = l/X = x) πl
1 We assume that X = Rd , endowed with the usual scalar product ⟨., .⟩, and
− (µk + µl )′ Σ−1 (µk − µl ). that Y = {−1, 1}.
2
57
1 |⟨w, x1 − x−1 ⟩|
γ= .
2 ∥w∥
Let us notice that for all κ ̸= 0, the couples (κw, κb) and (w, b) define the
same hyperplane.
3.2 A convex optimisation problem The corresponding dual problem corresponds to the maximization of
Finding the separating hyperplane with maximal margin consists in finding n n
(w, b) such that
X 1 X
θ(α) = αi − αi αj yi yj ⟨xi , xj ⟩
i=1
2 i,j=1
∥w∥2 or 21 ∥w∥2 is minimal Pn
under the constraint under the constraint i=1 αi yi = 0 and αi ≥ 0 ∀i.
yi (⟨w, xi ⟩ + b) ≥ 1 for all i. The Karush-Kuhn-Tucker conditions are
This leads to a convex optimization problem with linear constraints, hence
• αi∗ ≥ 0 ∀i = 1 . . . n.
there exists a unique global minimizer.
The primal problem to solve is: • yi (⟨w∗ , xi ⟩ + b∗ ) ≥ 1 ∀i = 1 . . . n.
∗ 2 −1/2
Pn
The maximal margin equals γ ∗ = 1
∥w∗ ∥ = i=1 (αi ) (provided the
xi ’s are normalized). • If ξi ∈ [0, 1] the point is well classified but in the region defined by the
The αi∗ that do not correspond to support vectors (sv) are equal to 0, and margin.
therefore
fˆ(x) = 1Px yi α∗ ∗
i ⟨xi ,x⟩+b ≥0
− 1Px yi α∗ ∗
i ⟨xi ,x⟩+b <0
. • If ξi > 1 the point is misclassified.
i sv i sv
The previous formulation has two main drawbacks : it assumes that the classes
are linearly separable and it is also very sensitive to outliers as illustrated in
The margin is called flexible margin.
Figure 5.1.
60
Pn
3.5 Optimization problem with relaxed constraints • w∗ = i=1 αi∗ xi yi ,
In order to avoid too large margins, we penalize large values for the slack • b∗ = − 12 {minyi =1 ⟨w∗ , xi ⟩ + minyi =−1 ⟨w∗ , xi ⟩}.
variable ξi .
The primal optimization problem is formalized as follows : We have here two types of support vectors (xi such that αi∗ > 0) :
Pn • The support vectors for which the slack variables are equal to 0. They are
1 2
Minimize with respect to (w, b, ξ) 2 ∥w∥ +C i=1 ξi such that located on the border of the region defining the margin.
yi (⟨w, xi ⟩ + b) ≥ 1 − ξi ∀ i
• The support vectors for which the slack variables are not equal to 0: ξi∗ >
ξi ≥ 0 0 and in this case αi∗ = C.
Remarks : For the vectors that are not support vectors, we have αi∗ = 0 and ξi∗ = 0.
• C > 0 is a tuning parameter of the SVM algorithm. It will determine
the tolerance to misclassifications. If C increases, the number of misclas-
sified points decreases, and if C decreases, the number of misclassified
points increases. C is generally calibrated by cross-validation.
Exercise. — Write the Lagrangian, the dual problem, and the KKT conditions.
Karush-Kuhn-Tucker conditions :
• 0 ≤ αi∗ ≤ C ∀i = 1 . . . n.
• yi (⟨w∗ , xi ⟩ + b∗ ) ≥ 1 − ξi∗ ∀i = 1 . . . n.
• αi∗ (yi (⟨w∗ , xi ⟩ + b∗ ) + ξi∗ − 1) = 0 ∀ i = 1 . . . n. Figure 5.2: Support Vectors in the non separable case
• ξi∗ (αi∗ − C) = 0.
As previously, we obtain the following classification rule: We have assumed in this chapter that the classes are (nearly) linearly sep-
arable. This assumption is often unrealistic, and we will see in the Chapter 6
how to extend the SVM classifiers to a more general setting. Moreover, we fo-
fˆ(x) = 1⟨w∗ ,x⟩+b∗ ≥0 − 1⟨w∗ ,x⟩+b∗ <0 , cused here on classification problems but procedures based on support vector
with for regression have also been proposed and will be presented in Chapter 6.
Chapter 6
1 Introduction
In Chapter 5, we have studied linear SVM. The assumption was that the
training set is nearly linearly separable. In most cases, this assumption is not
realistic.
In this case, a linear SVM leads to bad performances and a high number of
support vectors. We can make the classification procedure more flexible by
enlarging the feature space and sending the entries {xi , i = 1 . . . n} in an
Hilbert space H, with high or possibly infinite dimension, via a function ϕ,
Figure 6.1: Non linearly separable training set
and we apply a linear SVM procedure on the new training set {(ϕ(xi ), yi ), i =
1 . . . n}. The space H is called the feature space. This idea is due to Boser,
Guyon, Vapnik (1992).
In Figure 6.1, setting ϕ(x) = (x21 , x22 , x1 , x2 ), the training set becomes linearly
separable in R4 , and a linear SVM is appropriate.
61
62
′
By way of example, let us precise the RKHS associated with the Gaussian and f a function Rd → R, ϕ : Rd → Rd , B a positive definite matrix, P a
kernel. polynomial with positive coefficients and λ > 0.
P ROPOSITION 8. — For any function f ∈ L1 (Rd ) ∩ L2 (Rd ) and ω ∈ Rd , we The functions defined by k(x, x′ ) = k1 (x, x′ ) + k2 (x, x′ ), λk1 (x, x′ ),
define the Fourier transform k1 (x, x′ )k2 (x, x′ ), f (x)f (x′ ), k1 (ϕ(x), ϕ(x′ )), xT Bx′ , P (k1 (x, x′ )), or
′
Z ek1 (x,x ) are still kernels.
1
F [f ](ω) = f (t)e−i⟨ω,t⟩ dt. We have presented examples of kernels for the case where X = Rd but a
(2π)d/2 Rd
very interesting property is that kernels can be defined for very general input
For any σ > 0, the functional space spaces, such as sets, trees, graphs, texts, DNA sequences ...
Z
2 2
Hσ = {f ∈ C0 (Rd ) ∩ L1 (Rd ) such that |F [f ](ω)|2 eσ |ω| /2 dω < +∞} 3 Minimization of the convexified empirical
Rd
risk
endowed with the scalar product
Z The ideal classification rule is the one which minimizes the risk L(f ) =
⟨f, g⟩Hσ = (2πσ )2 −d/2
F [f ](ω)F [g](ω)e σ 2 |ω|2 /2
dω, P(Y ̸= f (X)), we have seen that the solution is the Bayes rule f ∗ . A classical
Rd way in nonparametric estimation or classification problems is to replace the
∥x−x′ ∥2
risk by the empirical risk and to minimize the empirical risk:
is the RKHS associated with the Gaussian kernel k(x, x′ ) = e− 2σ 2 . n
1X
Ln (f ) = 1Y ̸=f (Xi ) .
Indeed, for all x ∈ Rd , the function k(x, .) belongs to Hσ and we have n i=1 i
term. Enlarging the class F reduces the approximation error but increases empirical risk 1yi (⟨w,xi ⟩+b)<0 with the hinge loss.
the stochastic error.
The empirical risk minimization classifier cannot be used in practice because
of its computational cost, indeed Ln is not convex. This is the reason why we
generally replace the empirical misclassification probability Ln by some con-
vex surrogate, and we consider convex classes F. We consider a loss function
ℓ, and we require the condition ℓ(z) ≥ 1z<0 , which will allow to give an upper
bound for the misclassification probability; indeed
Classical convex losses ℓ are the hinge loss ℓ(z) = (1 − z)+ , the exponential
loss ℓ(z) = exp(−z), the logit loss ℓ(z) = log2 (1 + exp(−z)).
Let us show that SVM are solutions of the minimization of the convexified
(with the hinge loss) and penalized empirical risk. For the sake of simplicity,
we consider the linear case.
We first notice that
the following optimization problem:
Pn yi (⟨w, xi ⟩ + b) ≥ 1 − ξi ∀ i
Minimizing 21 ∥w∥2 + C i=1 ξi s. t.
ξi ≥ 0
is equivalent to minimize Hence, SVM are solutions of the minimization of the convexified empirical
n
risk with the hinge loss ℓ plus a penalty term. Indeed, SVM are solutions of
1 2
X
∥w∥ + C (1 − yi (⟨w, xi ⟩ + b))+ , n
2 i=1
1X
argminf ∈F ℓ(yi f (xi )) + pen(f ),
n i=1
or equivalently
where
n
1X 1 F = {⟨w, x⟩ + b, w ∈ Rd , b ∈ R}
(1 − yi (⟨w, xi ⟩ + b))+ + ∥w∥2 .
n i 2Cn
and
1
γ(w, b, xi , yi ) = (1 − yi (⟨w, xi ⟩ + b))+ is a convex upper bound of the ∀f ∈ F, pen(f ) = ∥w∥2 .
2Cn
65
Dual problem. Show that the dual problem can be formulated as follows : We are looking for a predictor of the form
n n
1 X
(αi − αi′ )(αj − αj′ )⟨xi , xj ⟩
X
Maximize − f (x) = cj k(Xj , x), c ∈ Rn .
2 i,j=1
i=1
n
X n
X
−ε (αi + αi′ ) + yi (αi − αi′ ) Let us denote by K the matrix defined by K i,j = k(Xi , Xj ). The KRLS
i=1 i=1 method consists in minimizing for f on the form defined above the penalized
n
X least square criterion
subject to (αi − αi′ ) = 0 and 0 ≤ αi , αi′ ≤ C ∀i. n
X
i=1
(Yi − f (Xi ))2 + λ∥f ∥2K ,
Karush-Kuhn-Tucker conditions : i=1
where
• αi∗ (ε + ξi∗ − yi + ⟨w∗ , xi ⟩ + b∗ ) = 0 n
X
∥f ∥2K = ci cj k(Xi , Xj ).
• (αi′ )∗ (ε + (ξi′ )∗ + yi − ⟨w∗ , xi ⟩ − b∗ ) = 0
i,j=1
• ξi∗ (C − αi∗ ) = 0, (ξi′ )∗ (C − (αi′ )∗ ) = 0 Equivalently, we minimize for c ∈ Rn the criterion
Exercise. — Draw a picture similar to Figure 5.2 to show the support vectors ∥Y − Kc∥2 + λc′ Kc.
for the regression problem.
There exists an explicit solution
As previously, only the scalar product ⟨xi , xj ⟩ are involved in the solution,
allowing easily to extend to nonlinear regression functions. ĉ = (K + λIn )−1 Y,
n
X
fˆ(x) = ĉj ⟨Xj , x⟩.
j=1
For polynomial or Gaussian kernels for example, we obtain non linear predic-
tors. As for SVM, an important interest of this method is the possibility to be
generalized to complex predictors such as text, graphs, DNA sequences .. as
soon as one can define a kernel function on such objects.
6 Conclusion
• Using kernels allows to delinearize classification algorithms by mapping
X in the RKHS H with the map x 7→ k(x, .). It provides nonlinear
algorithms with almost the same computational properties as linear ones.
• SVM have nice theoretical properties, cf. Vapnik’s theory for empirical
risk minimization [40].
• The use of RKHS allows to apply to any set X (such as set of graphs, texts,
DNA sequences ..) algorithms that are defined for vectors as soon as we
can define a kernel k(x, y) corresponding to some measure of similarity
between two objects of X .
• Important issues concern the choice of the kernel, and of the tuning pa-
rameters to define the SVM procedure.
• Note that SVM can also be used for multi-class classification problems
for example, one can built a SVM classifier for each class against the
others and predict the class for a new point by a majority vote.
• Kernels methods are also used for non supervised classification (kernel
PCA), and for anomaly detection (One-class SVM).
68
Chapter 7
69
70
separates each node into two child nodes more homogeneous than the parent
node in the sense of a criterion to be specified and depending on the type of
the variable Y that we have: quantitative or qualitative.
A first very simple and natural non parametric procedure in supervised re-
gression or classification is the k-Nearest Neighbors (k-NN) method. Given a
leaning sample {(X 1 , Y1 ), . . . , (X n , Yn )} in X × Y, we want to predict the
output Y associated to a new entry x. For this, it seems natural to built the pre-
dictor from the observations in the training sample that are "close" to x. We
consider a distance d on X . We fix an integer k and we retain the k nearest to
x observations {X (1) , . . . , X (k) } and the associated outputs (Y(1) , . . . , Y(k) ). Figure 7.2: Source: Hastie, Tibshirani, Friedman (2019), “The elements of
In a regression context, the prediction at point x is obtained from the mean of statistical learning”
the observations (Y(1) , . . . , Y(k) ) while in classification we consider a majority
vote. The choice of k is of course crucial. A too small value leads to overfitting
modeled is regular.
(small bias but high variance) while a large value of k may lead to underfitting
(small variance but probably high bias).
CART will use the same idea of local mean or majority vote, but the cell in
X that is used to predict at point x is obtained from a more sophisticated way These two aspects or weaknesses of CART: instability and irregularities are
than simply considering the k-Nearest Neighbors of x in the learning sample. at the origin of the success of the methods of aggregation leading to Random
It will also take into account the values of the Yi ’s. When partitioning ends, Forests proposed by Breiman (2001) [8], that will be the topic of next chapter.
each terminal node of the complete tree becomes a leaf to which is assigned a
value if Y is quantitative and a class if Y is qualitative. 2 Construction of a maximal binary tree
The last step consists in pruning the complete tree, which corresponds to a
model selection procedure in order to reduce the complexity and avoid over- We observe p quantitative or qualitative explanatory variables X j and a
fitting. Since Breiman et al. (1984) [9] have introduced this algorithm, CART variable to predict Y which is either qualitative with m modalities {Tℓ ; ℓ =
have been very successful with the major advantage of an easy interpretation 1 . . . , m} or real quantitative, on a sample of n individuals.
of the trees. The drawback is that these models are particularly unstable (not The construction of a binary discrimination tree (cf. figure 1) consists in
robust), very sensitive to fluctuations in the training sample. Furthermore, determining a sequence of nodes.
for quantitative explanatory variables, the construction of a tree constitutes
a dyadic partitioning of space (see Figure 7.2). The model thus defined is, by • A node is defined by the choice of a variable among the p explanatory
construction, discontinuous which may be a problem if the phenomenon to be variables and of a division which induces a partition into two classes.
71
Implicitly, to each node corresponds a subset of the initial sample to which The division criterion is based on the definition of an heterogeneity func-
a dichotomy is applied. tion presented in the next section. The objective is to divide the observations
which compose a node into two more homogeneous groups with respect to the
• A division is defined by a threshold value if the selected variable is quan- variable to explain Y .
titative or a split into two groups of modalities if the variable is qualitative.
Dividing the node κ creates two son nodes. For simplicity, they are denoted
• At the root, the initial node corresponds to the whole sample; the proce- κL (left node) and κR (right node).
dure is then iterated over each of the subsets. Among all the admissible divisions of the node κ, the algorithm retains the
one which minimizes the sum of the heterogeneities of the son nodes DκL +
The algorithm requires: DκR . This amounts to solving at each node κ:
1. the definition of a criterion allowing to select the best division among all max Dκ − (DκL + DκR )
admissible ones for the different variables; {divisions ofX j ;j=1,p}
2. a rule allowing to decide that a node is terminal: it thus becomes a leaf; Graphically, the length of each branch can be represented proportionally to the
3. the predicted value (class or real value) associated to a leaf. reduction in heterogeneity induced by the division.
3 Homogeneity criterion
3.1 Constructing regression trees
For a given region (node) κ with cardinality |κ|, we define the empirical
variance at node |κ| by
1 X
Vκ = (Yi − Y κ )2 , x < 5.98361
|κ| i∈κ |
1
P
where Y κ = |κ| i∈κ Yi . x < 34.1597
The heterogeneity at the node κ is then defined by 17.840
6.796
x < 12.0953
2.501
7.962
X
Dκ = (Yi − Y κ )2 = |κ|Vκ
i∈κ
25
● ●
20
●
15
●
●
y
and right subregions
10
●
●● ● ●
●
● ● ●
●● ●
5
●● ● ●
●
● ●
●
0
●
Dκ − J(j, t)
0.6
Gini
sures in classification. Two main measures are considered to define the het-
erogeneity of node κ.
0.4
For ℓ = 1, . . . , m, let pℓκ denote proportion of the class Tℓ of Y in the node κ.
0.2
• The Cross-Entropy or deviance is defined by
0.0
0.0 0.2 0.4 0.6 0.8 1.0
m
X p
Eκ = − pℓκ log(pℓκ ).
ℓ=1
The heterogeneity at the node κ is then defined by Figure 7.4: Heterogeneity criterions for classification. Both are minimal for
p = 0 or p = 1, and maximal for p = 1/2.
m
X
Dκ = −|κ| pℓκ log(pℓκ ).
ℓ=1
1 1
The cross-entropy is maximal in ( m ,..., m ), minimal in
(1, 0, . . . , 0), . . . , (0, . . . , 0, 1) 4 Pruning the maximal tree
(by continuity, we assume that 0 log(0) = 0).
The previous construction leads to a maximal tree Amax (depending on the
• The Gini concentration s defined by stopping rule) with K leaves, that is generally very unstable and heavily de-
X m pends on the training sample: it is overfitted. We have to build a more parsi-
ℓ ℓ monious, and hence more robust, prediction model. This will be achieved by
Gκ = pκ (1 − pκ ),
ℓ=1 pruning the maximal tree. We have to find a compromise between the trivial
tree reduced to the root (which is underfitted) and the maximal tree Amax . The
which leads to the heterogeneity at the node κ
prediction performances of various trees could be compared on a validation
X m set. All sub trees of the maximal tree are admissible, but they are generally
Dκ = |κ| pℓκ (1 − pℓκ ). too many to be all considered. To get around this problem, Breiman et al.
ℓ=1 (1984)[9] have proposed an algorithm, based on a penalized criterion, to build
an nested sequence of sub-trees of the maximal tree. One then chooses, among
An illustration of these two heterogeneity measures in presented in Figure
this sequence, the optimal tree minimizing a generalization error.
7.4 in the simple case where we have two classes (m = 2).
74
∗
4.1 Construction of Breiman’s subsequence • Critγ1 (κ∗ ) = Critγ1 (Aκ ∗
1 ) and, for γ = γ1 , the node κ becomes
∗
preferable to the subtree Aκ1 .
For a given tree A, we denote by |A| its number of leaves or terminal nodes
κ. The value of |A| is a measure of the complexity of the tree A. We define the
• A2 is the subtree obtained by pruning the branches from the nodes κ∗
fit quality of a tree A by
|A|
minimizing s(κ, Aκ1 ): this gives the second tree in the sub-sequence.
X
D(A) = Dκ
κ=1
• This process is iterated.
where Dκ is the heterogeneity of the terminal node κ of the tree A. The con-
struction of Breiman’s subsequence relies on the penalized criterion We obtain a nested sequence of sub trees
For γ = 0, Amax minimizes C(A). When γ increases, a pruned tree will where AK is the trivial tree, reduced to the root, gathering all the training
be preferable to the maximal tree. More precisely, Breiman’s subsequence is sample.
obtained as follows:
4.2 Determination of the optimal tree
• Let A1 be the sub tree of Amax (maximal tree) obtained by pruning the Once the nested sequence of trees is obtained, we have to determine an op-
nodes κ such that D(κ) = D(κL ) + D(κR ). timal one, minimizing the generalization error. As explained in Chapter 2, this
error can be estimated on a validation set. More often, V -fold cross-validation
• For each node in A1 , D(κ) > D(κL ) + D(κR ) and D(κ) > D(Aκ 1)
κ is used. In this case, the implementation of the V -fold cross-validation is par-
where A1 is the subtree of A1 from node κ.
ticular since for each of the V subsamples composed of V −1 folds, we obtain a
• For γ small enough, for all node κ of A1 , D(κ) + γ > D(Aκ 1 ) + γ|A κ
1 |. different sequence of trees. In fact, the aim of cross-validation is to determine
This holds while, for all node κ of A1 , the optimal value of the penalization parameter γ resulting from Breiman’s
subsequence produced with the whole training set. We then choose the tree
γ < (D(κ) − D(Aκ1 ))/(|Aκ1 | − 1) = s(κ, Aκ1 ). associated with this optimal value of γ. In the cross-validation procedure, for
each value of γ produced by Breiman’s subsequence, the mean error is com-
We then define puted for the V subtrees . This leads to an optimal value of γ, minimizing the
κ ∗ κ∗ prediction error estimated by cross-validation. We then retain the tree corre-
γ1 = inf s(κ, A1 ) = s(κ , A1 ).
κ node of A1 sponding to this value of γ in Breiman’s subsequence.
Algorithm 3 describes the selection of an optimal tree :
75
the decrease in heterogeneity for each variable and split is computed by using
the associated available observations and the best split is chosen as usual.
Instability of trees
A major drawback of trees is their high variance. They are not robust, in the
sense that a small change in the data can lead to very different sequences of
splits. This is why we have to be careful with the interpretation. This is due
to the hierarchical procedure : an error in the choice of a split in the top of
the tree cannot be corrected below. This instability is the price to pay to have
a simple and interpretable model. We will see in Chapter 8 how to aggregate
trees to reduce the variance of the prediction rule.
Lack of smoothness
In a regression framework, the trees are constant piecewise functions, they
are hence not smooth (not even continuous). This may be a problem if the
phenomenon to model is regular. More regular algorithms, such as the MARS
procedure have been developed (see Hastie and al [21]).
same prediction value is obtained for observation falling in the same terminal
node. This is why we observe a column per leaf.
6 Conclusion
Trees have nice properties : they are easy to interpret, efficient algorithms
exist to prune them, they are tolerant to missing data. All these properties
made the success of CART for practical applications. Nevertheless, CART
algorithm has also important drawbacks : it is highly instable, being not robust
to the learning sample and it also suffers from the curse of dimensionality. The
selected tree only depends on few explanatory variables, which is nice for the
interpretation but trees are often (wrongly) interpreted as a variable selection
procedure, due to their high instability. Moreover, prediction accuracy of a
tree is often poor compared to other procedures. This is why more robust
procedures, based on the aggregation of trees leading to Random Forests have
been proposed . They also have better prediction accuracy. This is the topic of
Chapter 8.
79
80
case, if the model returns probabilities associated with each modality as in the
logistic regression model, it is simple to calculate these probabilities.
The principle is elementary, averaging the predictions of several indepen-
dent models allows to reduce the variance and therefore to reduce the predic-
25
●
●
tion error.
20
●
●
15
require too much data. These samples are therefore replaced by B bootstrap ●
●
●
y
samples each obtained by n draws with replacement according to the empirical ●
10
●
●
●
● ●
●
●
●
●
●
● ●
5
●
● ●
●●
●
● ●
●
0
●
●
−10 0 10 20 30 40
Algorithm 4 Bagging x
Let x0 and
Z = {(X1 , Y1 ), . . . , (Xn , Yn )} a learning sample.
25
for b = 1 to B do ●
●
20
●
●
●
15
end for PB
●
●
Compute the mean fbB (x0 ) = B1 b=1 fbzb (x0 ) or the result of a majority
y
●
●
10
● ●●
●
vote. ●
●
●
●
●
● ●
5
●
●
●
●
●
●
●
●
●
0
−10 0 10 20 30 40
Figure 8.1 presents two bootstrap samples and the corresponding models x
3 Random Forests
Parameters of the algorithm
3.1 Motivation
The pruning strategy can, in the case of random forests, be quite elementary.
In the specific case of binary decision tree models (CART), Breiman (2001) Indeed, pruned trees may be strongly correlated because they may involve the
[8] proposes an improvement of the bagging by adding a random component. same variables appearing to be the most explanatory. In the default strategy of
The objective is to make the aggregated trees more independent by adding the algorithm, it is simply the minimum number of observations per leaf which
randomness in the choice of the variables which are involved in the prediction. limits the size of the tree, it is set to 5 by default. We therefore aggregate rather
Since the initial publication of the algorithm, this method has been widely complete trees, which are considered of low bias but of high variance.
tested and compared with other procedures see Fernandez-Delgado et al. 2014
The random selection of a reduced number of m potential predictors at each
[16], Caruana et al. 2008 [10]. It becomes in many machine learning articles
stage of the construction of the trees significantly increases the variability by
the method to beat in terms of prediction accuracy. Theoretical convergence
highlighting other variables. Each tree is obviously less efficient, sub-optimal,
properties, difficult to study, have been published quite recently (Scornet et
but, united being strength, aggregation ultimately leads to good results. The
al. 2015) [36]. However, it can also lead to bad results, especially when the
number m of variables drawn randomly can, according to the examples, be a
underlying problem is linear.
sensitive parameter with default choices are not always optimal :
3.2 Algorithm √
• m= p in a classification problem,
The bagging is applied to binary decision trees by adding a random selection
of m explanatory variables among the p variables. • m = p/3 in a regression problem.
82
ously all the more useful as the variables are very numerous. Two criteria have
been proposed to evaluate the importance of the variable X j .
25
●
●
• The first one Mean Decrease Accuracy (MDA) is based on a random per-
20
●
● mutation of the values of this variable. The more the quality of the pre-
●
● ●
diction, estimated by an out-of-bag error, is degraded by the permutation
15
●
●
of this variable, the more the variable is important. Once the bth tree has
y
●
been constructed, the out-of-bag sample is predicted for this tree and the
10
●
● ● ●
●
●
● ●
estimated error is recorded. The values of the jth variable are then ran-
●
●
●
domly permuted in the out-of-bag data sample and the error is computed
5
●
● ●
●
● ●
●
again. The decrease in prediction accuracy is averaged over all the trees
and used to assess the importance of the variable X j in the forest. It is
●
0
The iterative evaluation of the out-of-bag error makes it possible to control the 1 X
number B of trees in the forest as well as to optimize the choice of m. It is Rn [fbzb , D] = (Yi − fbzb (Xi ))2 .
|D|
nevertheless a cross-validation procedure which is preferably used to optimize i,(Xi ,Yi )∈D
m. The Figure 8.2 presents an example of Random Forest regression predictor,
built with B = 500 bootstrap samples. – The MDA is defined by
⋆
* jκ corresponds to the optimal variable selected for the split
⋆ ⋆
* tκ corresponds to the optimal threshold along the jκ variable. TEMPE ● TEMPE ●
MOCAGE MOCAGE
Example on ozone data : ● ●
STATION ● VentANG ●
VentANG ● SRMH2O ●
“The first measure [%IncMSE] is computed from permuting OOB data : For VentMOD ● LNO2 ●
LNO ● STATION ●
each tree, the prediction error on the out-of-bag portion of the data is recorded LNO2 ● LNO ●
JOUR JOUR
(error rate for classification, MSE for regression). Then the same is done ● ●
after permuting each predictor variable. The difference between the two are 0 10 20 30 40 50 0e+00 1e+05 2e+05 3e+05 4e+05
then averaged over all trees, and normalized by the standard deviation of the %IncMSE IncNodePurity
differences.
The second measure [IncNodePurity] is the total decrease in node impurities Figure 8.3: Variable importance plot, returned by the R function
from splitting on the variable, averaged over all trees. For classification, the importance. MDA on the left and MDI on the right.
node impurity is measured by the Gini index. For regression, it is measured
by residual sum of squares.”
3.4 Implementation
• The randomForest library of R interfaces the original program devel-
oped in Fortran77 by Leo Breiman and Adele Cutler which maintains the
site dedicated to this algorithm.
84
• An alternative in R, more efficient in computing time especially with a grouped into a other modality. As previously, the time to deter-
large volume of data, consists in using the ranger library. mine a better division is obviously largely influenced by the number
of modalities or even the number of possible values of a quantitative
• The software site Weka developed at Waikato University in New Zealand variable. Reducing the number of possible values is finally another
offers a version in Java. way of reducing the computation time but it would be appropriate
• A very efficient and close version of the original algorithm is available in to guide the groupings of the modalities to avoid misinterpretations.
the Scikit-learn library of Python.
• Another version suitable for big data is available in the MLlib li-
4 Conclusion
brary of Spark, a technology developed to interface different hardware/-
Having become the Swiss Army Knife of learning, Random Forests are used
software architectures with distributed data file management systems
for different purposes (see the dedicated site) :
(Hadoop). In addition to the usual parameters : number of trees, max-
imum depth of trees, and number of variables drawn at random to build
• Similarity or proximity between observations : after building each tree,
a subdivision at each node, this implementation adds two parameters :
increment by 1 the similarity or proximity of two observations that are in
subsamplingRate and maxBins, which have a default value. These
the same leaf. Sum on the trees of the forest, normalize by the number of
parameters play an important role, certainly in drastically reducing the
trees. A multidimensional positioning can represent these similarities or
computation time, but, on the other hand, in restricting the precision of
the matrix of dissimilarities that results from them.
the estimate. They regulate the balance between computation time and
precision of the estimate as a subsampling in the data would do. • Detection of multidimensional atypical observations : outliers or novel-
ties that correspond to observations which do not belong to known classes.
– subsamplingRate =1.0 subsamples as its name suggests before
A criterion of "abnormality" with respect to a class is based on the pre-
building each tree. With the default value, it is the classic version of
vious notion of proximities of an observation to the other observations of
random forests with B Bootstrap samples of size n but if this rate is
its class.
less than 1, smaller samples are drawn. The sample are then more
distinct (or independent) for each tree. The variance is therefore re- • Another algorithm, inspired by Random Forests has been developed for
duced (more independent trees) but the bias increases because each anomaly detection, it is called isolation forest.
tree is built with a smaller data set.
• Random forests are used for the Imputation of missing data.
– maxBins = 32 is the maximum number of categories that are con-
sidered for a qualitative variable or the number of possible values for • Adaptations to take into account censored data to model survival times
a quantitative variable. Only the most frequent modalities of a qual- correspond to the survival forest algorithm.
itative variable are taken into account, the others are automatically
Chapter 9
85
86
Classical convex losses ℓ are the hinge loss ℓ(z) = (1 − z)+ , (used for SVM’s Prove that
for example), the exponential loss ℓ(z) = exp(−z), the logit loss ℓ(z) = jm = argmin errm (j),
log2 (1 + exp(−z)). j=1,...,p
Since this optimization problem (9.1) is complex, in order to approximate the and
solution, Adaboost computes a recursive sequence of predictors fˆm for m =
1 1 − errm (j)
βm = log .
0, . . . M with 2 errm (j)
fˆ0 = 0 (1)
Note that the initial weights are equal for all the observations : wi = 1/n
fˆm = fˆm−1 + βm fjm for i = 1, . . . , n and that the AdaBoost algorithm then attributes more weights
where (βm , jm ) minimizes the empirical risk associated to the exponential loss in the computation of the predictor at step m to the observations for which the
function: exponential loss exp(−Yi fˆm−1 (Xi )) of the previous estimator is high.
n This leads to the AdaBoost algorithm :
1X
(βm , jm ) = argmin { exp(−Yi (fˆm−1 (Xi ) + βfj (Xi )))}. (9.2)
β∈R,j=1,...,p n i=1 Algorithm 6 AdaBoost
Hence, at each step, the algorithm looks for the best classifier fjm in the avail- Choose a parameter M .
(1)
able collection f1 , . . . , fp and the best coefficient βm such that adding βm fjm wi = 1/n for i = 1, . . . , n.
to the previous predictor fˆm−1 minimizes the empirical exponential loss. for m = 1, . . . , M do
The final classification rule is given by jm = argmin errm (j)
j=1,...,p
fˆ = sign(fˆM ). βm = 21 log 1−err m (j)
errm (j)
(m+1) (m)
M is a parameter of the procedure. wi = wi exp(−Yi βm fjm (Xi )) for i = 1, . . . , n.
The aim of the following exercise is to compute the solution of (9.2). end for P
M
Exercise. — We denote fˆM (x) = m=1 βm fjm (x).
1 fˆ = sign(fˆM ).
(m)
wi exp(−Yi fˆm−1 (Xi ))
=
n
and we assume that for all j = 1, . . . , p, The introduction of the exponential loss is motivated by computational rea-
Pn (m)
sons in the context of a sequentially additive modeling approach: it leads to
i=1 wi 1sign(fj (Xi ))̸=Yi the simple reweighting AdaBoost algorithm. One can wonder about the rele-
errm (j) = Pn (m)
∈]0, 1[.
i=1 wi
vance of this exponential loss function. The aim of the following exercise is to
87
show that minimizing the exponential loss (at the population level) leads to the consider the L2 loss, leading to the L2 boosting. Sometimes the L1 loss is
Bayes classifier. Although this is not true for the empirical loss, this justifies prefered since it is more robust (less sensitive to outliers). The Huber loss
the use of the exponential loss. combines the advantages of the L2 loss (it is differentiable) and the L1 loss
Exercise. — Let (robustness). It is defined by
f ∗ (x) = argminE e−Y f (X) /X = x . ℓ(y, y ′ ) = (y − y ′ )2 1|y−y′ |≤δ + (2δ|y − y ′ | − δ 2 )1|y−y′ |>δ ,
f
Another way to avoid overfitting is to use a shrinkage method. In the case of indices of importance computed for each tree. This is crucial to have an idea
boosting procedures, this corresponds to scaling the contribution of each tree of the relative importance of each predictor in the model.
by a factor 0 < ν < 1, which leads to the formula The Gradient Boosting is implemented in the R package gbm. It is also imple-
J m
mented in the Scikit Learn library of Python (GradientBoostingClassifier and
GradientBoostingRegressor).
X
fm (x) = fm−1 (x) + ν
b b γjm 1x∈R jm
j=1
6 Conclusion
To summarize, the boosting allows to reduce the variance compared with
single procedures but also the bias by aggregation, which generally leads to
very performant procedures.
Trees are easily interpretable. Of course, the interpretation is lost by aggre-
gating. Nevertheless, like for Random Forest, indices of importance can be
computed : one can average over the M trees of the boosting algorithm the
90
Chapter 10
1 Introduction • The recurrent neural networks, used for sequential data such as text or
times series.
Deep learning is a set of learning methods attempting to model data with
complex architectures combining different non-linear transformations. The el-
ementary bricks of deep learning are the neural networks, that are combined to They are based on deep cascade of layers. They need clever stochastic op-
form the deep neural networks. timization algorithms, and initialization, and also a clever choice of the struc-
ture. They lead to very impressive results, although very few theoretical fon-
These techniques have enabled significant progress in the fields of sound
dations are available till now.
and image processing, including facial recognition, speech recognition, com-
puter vision, automated language processing, text classification (for example
spam recognition). Potential applications are very numerous. A spectacularly 2 Neural networks
example is the AlphaGo program, which learned to play the go game by the
An artificial neural network is an application, non linear with respect to its
deep learning method, and beated the world champion in 2016.
parameters θ that associates to an entry x an output y = f (x, θ). For the
There exist several types of architectures for neural networks: sake of simplicity, we assume that y is unidimensional, but it could also be
multidimensional. This application f has a particular form that we will precise.
• The multilayer perceptrons, that are the oldest and simplest ones
The neural networks can be use for regression or classification. As usual in
• The Convolutional Neural Networks (CNN), particularly adapted for im- statistical learning, the parameters θ are estimated from a learning sample. The
age processing function to minimize is not convex, leading to local minimizers. The success
91
92
of the method came from a universal approximation theorem due to Cybenko • The Rectified Linear Unit (ReLU) activation function
(1989) and Hornik (1991). Moreover, Le Cun (1986) proposed an efficient
way to compute the gradient of a neural network, called backpropagation of ϕ(x) = max(0, x).
the gradient, that allows to obtain a local minimizer of the quadratic criterion
easily. Here is a schematic representation of an artificial neuron where Σ = ⟨wj , x⟩ +
bj . The Figure 10.2 represents the activation function described above. His-
2.1 Artificial Neuron
An artificial neuron is a function fj of the input x = (x1 , . . . , xd ) weighted
by a vector of connection weights wj = (wj,1 , . . . , wj,d ), completed by a
neuron bias bj , and associated to an activation function ϕ, namely
yj = fj (x) = ϕ(⟨wj , x⟩ + bj ).
1 torically, the sigmoid was the mostly used activation function since it is dif-
ϕ(x) = .
1 + exp(−x) ferentiable and allows to keep values in the interval [0, 1]. Nevertheless, it is
problematic since its gradient is very close to 0 when |x| is not close to 0. The
• The hyperbolic tangent function ("tanh") Figure 10.3 represents the Sigmoid function and its derivative. With neural
networks with a high number of layers (which is the case for deep learning),
exp(x) − exp(−x) exp(2x) − 1 this causes troubles for the backpropagation algorithm to estimate the param-
ϕ(x) = = .
exp(x) + exp(−x) exp(2x) + 1 eter (backpropagation is explained in the following). This is why the sigmoid
function was supplanted by the rectified linear function. This function is not
• The hard threshold function differentiable in 0 but in practice this is not really a problem since the proba-
bility to have an entry equal to 0 is generally null. The ReLU function also has
ϕβ (x) = 1x≥β . a sparsification effect. The ReLU function and its derivative are equal to 0 for
93
Figure 10.2: Activation functions Figure 10.3: Sigmoid function (in black) and its derivatives (in red)
negative values, and no information can be obtain in this case for such a unit, 2.2 Multilayer perceptron
this is why it is advised to add a small positive bias to ensure that each unit is
active. Several variations of the ReLU function are considered to make sure A multilayer perceptron (or neural network) is a structure composed by sev-
that all units have a non vanishing gradient and that for x < 0 the derivative is eral hidden layers of neurons where the output of a neuron of a layer becomes
not equal to 0. Namely the input of a neuron of the next layer. Moreover, the output of a neuron can
also be the input of a neuron of the same layer or of neuron of previous layers
ϕ(x) = max(x, 0) + αmin(x, 0) (this is the case for recurrent neural networks). On last layer, called output
layer, we may apply a different activation function as for the hidden layers de-
pending on the type of problems we have at hand: regression or classification.
where α is either a fixed parameter set to a small positive value, or a parameter The Figure 10.4 represents a neural network with three input variables, one
to estimate. output variable, and two hidden layers. Multilayers perceptrons have a basic
94
This theorem is interesting from a theoretical point of view. From a practical This loss function is well adapted with the sigmoid activation function since
point of view, this is not really useful since the number of neurons in the hidden the use of the logarithm avoids to have too small values for the gradient.
layer may be very large. The strength of deep learning lies in the deep (number Finally, for a multi-class classification problem, we consider a generalization
of hidden layers) of the networks. of the previous loss function to k classes
k
3 Estimation of the parameters
X
L(θ) = −E(X,Y )∼P [ 1Y =j log pθ (Y = j/X)].
j=1
Once the architecture of the network has been chosen, the parameters (the
weights wj and biases bj ) have to be estimated from a learning sample. As Ideally we would like to minimize the classification error, but it is not smooth,
usual, the estimation is obtained by minimizing a loss function with a gradient this is why we consider the cross-entropy (or eventually a convex surrogate).
descent algorithm. We first have to choose the loss function.
3.1 Penalized empirical risk
Loss functions
The expected loss can be written as L(θ) = E(X,Y )∼P [ℓ(Y, f (X, θ))] and it
It is classical to estimate the parameters by maximizing the likelihood (or is associated to a loss function ℓ.
equivalently the logarithm of the likelihood). This corresponds to the mini- In order to estimate the parameters θ, we use a training sample (Xi , Yi )1≤i≤n
mization of the loss function which is the opposite of the log likelihood. De- and we minimize the empirical loss
noting θ the vector of parameters to estimate, we consider the expected loss n
1X
function L̃n (θ) = ℓ(Yi , f (Xi , θ))
L(θ) = −E(X,Y )∼P (log(pθ (Y /X)). n i=1
If the model is Gaussian, namely if pθ (Y /X = x) ∼ N (f (x, θ), I), maximiz- eventually we add a regularization term. This leads to minimize the penalized
ing the likelihood is equivalent to minimize the quadratic loss, hence empirical risk
L(θ) = E(X,Y )∼P (∥Y − f (X, θ)∥2 ). 1X
n
Ln (θ) = ℓ(Yi , f (Xi , θ)) + λΩ(θ).
For binary classification, with Y ∈ {0, 1}, maximizing the log-likelihood cor- n i=1
responds to the minimization of the cross-entropy.
Setting f (X, θ) = Pθ (Y = 1/X), the cross-entropy loss is We can consider L2 regularization. Using the same notations as in Section 2.2,
X X X (k)
ℓ(f (x, θ), y) = −[y log(f (x, θ)) + (1 − y) log(1 − f (x, θ))] Ω(θ) = (Wi,j )2
k i j
and the corresponding expected loss is X
= ∥W (k) ∥2F
L(θ) = −E(X,Y )∼P [Y log(f (X, θ)) + (1 − Y ) log(1 − f (X, θ))]. k
96
where ∥W ∥F denotes the Frobenius norm of the matrix W . Note that only the 4 Backpropagation algorithm for classifica-
weights are penalized, the biases are not penalized. It is easy to compute the
gradient of Ω(θ):
tion
▽W (k) Ω(θ) = 2W (k) .
here a K class
We consider classification problem. The output of the MLP
One can also consider L1 regularization, leading to parcimonious solutions: P(Y = 1/x)
.
Ω(θ) =
XXX (k)
|Wi,j |. is f (x) = . We assume that the output activation function is
.
k i j P(Y = K/x)
the softmax function.
In order to minimize the criterion Ln (θ), a stochastic gradient descent al-
gorithm is used. In order to compute the gradient, a clever method, called 1
Backpropagation algorithm is considered. It has been introduced by Rumel- softmax(x1 , . . . , xK ) = PK (ex1 , . . . , exK ).
xk
hart et al. (1988), it is still crucial for deep learning. k=1 e
The stochastic gradient descent algorithm performs at follows:
Let us make some useful computations to compute the gradient.
• Initialization of θ0 = (W (1) , b(1) , . . . , W (L+1) , b(L+1) ).
∂softmax(x)i
• At each iteration, we compute : = softmax(x)i (1 − softmax(x)i ) if i = j
∂xj
1 X = −softmax(x)i softmax(x)j if i ̸= j
θj = θj−1 − ε [▽θ ℓ(f (Xi , θj−1 ), Yi ) + λ ▽θ Ω(θj−1 )].
m
i∈B
We introduce the notation
Note that, in the previous algorithm, we do not compute the gradient for the
K
loss function at each step of the algorithm but only on a subset B of cardinal- X
ity m (called a batch). This is what is classically done for big data sets (and (f (x))y = 1y=k (f (x))k ,
k=1
for deep learning) or for sequential data. B is taken at random without re-
placement. An iteration over all the training examples is called an epoch. The
numbers of epochs to consider is a parameter of the deep learning algorithms. where (f (x))k is the kth component of f (x): (f (x))k = P(Y = k/x). Then
The total number of iterations equals the number of epochs times the sample we have
size n divided by m, the size of a batch. This procedure is called batch learn- K
ing, sometimes, one also takes batches of size 1, reduced to a single training
X
− log(f (x))y = − 1y=k log(f (x))k = ℓ(f (x), y),
example B = {(Xi , Yi )}. k=1
97
∂ℓ(f (x), y) ∂ℓ(f (x), y) ′ (k) – Compute the output gradient ▽a(L+1) (x) ℓ(f (x), y) = f (x) − e(y).
(k)
= ϕ (a (x)j ).
∂a (x)j ∂h(k) (x)j – For k = L + 1 to 1
Hence, * Compute the gradient at the hidden layer k
▽a(k) (x) ℓ(f (x), y) = ▽h(k) (x) ℓ(f (x), y)⊙(ϕ′ (a(k) (x)1 ), . . . , ϕ′ (a(k) (x)j ), . . .)′ ▽W (k) ℓ(f (x), y) = ▽a(k) (x) ℓ(f (x), y)h(k−1) (x)′
where ⊙ denotes the element-wise product. This leads to ▽b(k) ℓ(f (x), y) = ▽a(k) (x) ℓ(f (x), y)
∂ℓ(f (x), y) ∂ℓ(f (x), y) ∂a(k) (x)i * Compute the gradient at the previous layer
(k)
=
∂Wi,j ∂a(k) (x)i ∂W (k)
i,j ▽h(k−1) (x) ℓ(f (x), y) = (W (k) )′ ▽a(k) (x) ℓ(f (x), y)
∂ℓ(f (x), y) (k−1)
= h (x) and
∂a(k) (x)i j
Finally, the gradient of the loss function with respect to hidden weights is ▽a(k−1) (x) ℓ(f (x), y) = ▽h(k−1) (x) ℓ(f (x), y)
▽W (k) ℓ(f (x), y) = ▽a(k) (x) ℓ(f (x), y)h (k−1) ′
(x) . (10.3) ⊙(. . . , ϕ′ (a(k−1) (x)j ), . . . )′
The last step is to compute the gradient with respect to the hidden biases. We
4.1 Optimization algorithms
simply have
∂ℓ(f (x), y) ∂ℓ(f (x), y) Many algorithms can be used to minimize the loss function, all of them have
(k)
= (k) (x)
∂bi ∂a i hyperparameters, that have to be calibrated, and have an important impact on
99
the convergence of the algorithms. The elementary tool of all these algorithms erally lead to better generalization properties. The particular case of batches
is the Stochastic Gradient Descent (SGD) algorithm. It is the most simple one: of size 1 is called On-line Gradient Descent. The disadvantage of this proce-
dure is the very long computation time. Let us summarize the classical SGD
∂L old
θinew = θiold − ε (θi ), algorithm.
∂θi
where ε is the learning rate , and its calibration is very important for the con- Algorithm 8 Stochastic Gradient Descent algorithm
vergence of the algorithm. If it is too small, the convergence is very slow and Fix the parameters ε : learning rate, m : batch size, nb : number of epochs.
the optimization can be blocked on a local minimum. If the learning rate is too Choose the initial parameter θ
large, the network will oscillate around an optimum without stabilizing and for k = 1 to nb epochs do
converging. A classical way to proceed is to adapt the learning rate during the for l = 1 to n/m do
training: it is recommended to begin with a "large " value of ϵ, (for example Take a random batch of size m without replacement in the learning
0.1) and to reduce its value during the successive iterations. However, there is sample: (Xi , Yi )i∈Bl
no general rule on how to adjust the learning rate, and this is more the expe- Compute the gradients with the backpropagation algorithm
rience of the engineer concerning the observation of the evolution of the loss
function that will give indications on the way to proceed. 1 X
g= ▽θ ℓ(f (Xi , θ), Yi ).
The stochasticity of the SGD algorithm lies in the computation of the gradi- m
i∈Bl
ent. Indeed, we consider batch learning: at each step, m training examples
are randomly chosen without replacement and the mean of the m correspond- Update the parameters : θ ← θ − εg.
ing gradients is used to update the parameters. An epoch corresponds to a pass end for
through all the learning data, for example if the batch size m is 1/100 times the end for
sample size n, an epoch corresponds to 100 batches. We iterate the process on
a certain number nb of epochs that is fixed in advance. If the algorithm did not Since the choice of the learning rate is delicate and very influent on the
converge after nb epochs, we have to continue for nb′ more epochs. Another convergence of the SGD algorithm, variations of the algorithm have been pro-
stopping rule, called early stopping is also used: it consists in considering a posed. They are less sensitive to the learning rate. The principle is to add a
validation sample, and stop learning when the loss function for this validation correction when we update the gradient, called momentum. The method is
sample stops to decrease. Batch learning is used for computational reasons, due to Polyak (1964) [31]. The idea is to accumulate an exponentially decay-
indeed, as we have seen, the backpropagation algorithm needs to store all the ing moving average of past negative gradients and to continue to move in their
intermediate values computed at the forward step, to compute the gradient dur- direction.The momentum algorithm introduces a variable ν, that plays the role
ing the backward pass, and for big data sets, such as millions of images, this is of a velocity. An hyperparameter α ∈ [0, 1[ determines how fast the contribu-
not feasible, all the more that the deep networks have millions of parameters to tion of previous gradients exponentially decay. The method is summarized in
calibrate. The batch size m is also a parameter to calibrate. Small batches gen- Algorithm 9.
100
Algorithm 9 Stochastic Gradient Descent algorithm with momentum each parameter (components of θ) and to automatically adapt this learning rate
Fix the parameters ε : learning rate, m: batch size, nb : number of epochs, during the training. It is described in Algorithm 10.
momentum parameter α ∈ [0, 1[.
Choose the initial parameter θ and the initial velocity ν. Algorithm 10 RMSProp algorithm
for k = 1 to nb epochs do Fix the parameters ε : learning rate, m: batch size, nb : number of epochs,
for l = 1 to n/m do decay rate ρ in [0, 1[
Sample a minibach B of size m from theP learning sample. Choose the initial parameter θ
1
Compute the gradient estimate : g ← m i∈B ▽ θ ℓ(f (Xi , θ), Yi ). Choose a small constant δ, usually 10−6 (to avoid division by 0)
Update the velocity : ν ← αν − εg. Initialize accumulation variable r = 0.
Update the parameter : θ ← θ + ν. for k = 1 to nb epochs do
end for for l = 1 to n/m do
end for Sample a minibach B of size m from theP learning sample.
1
Compute the gradient estimate : g ← m i∈B ▽θ ℓ(f (Xi , θ), Yi ).
Accumulate squared gradient r ← ρr + (1 − ρ)g ⊙ g
This method allows to attenuate the oscillations of the gradient. Update the parameter : θ ← θ − √δ+r ε 1
⊙ g ( √δ+r is computed
In practice, a more recent version of the momentum due to Nesterov (1983)
element-wise).
[30] and Sutskever et al. (2013) [37] is considered, it is called Nesterov ac-
end for
celerated gradient. The variants lie in the updates of the parameter and the
end for
velocity :
1 X Adam algorithm (Kingma and Ba, 2014) is also an adaptive learning rate op-
ν ← αν − ε ▽θ ℓ(f (Xi , θ + αν), Yi )
m timization algorithm. "Adam" means "Adaptive moments". It can be viewed as
i∈B
θ ← θ + ν. a variant of RMSProp algorithm with momentum. It also includes a bias cor-
rection of the first order moment (momentum term) and second order moment.
The learning rate ε is a difficult parameter to calibrate because it significantly It is described in Algorithm 11.
affects the performances of the neural network. This is why new algorithms We have presented the most popular optimization algorithms for deep
have been introduced, to be less sensitive to this learning rate : the RMSProp learning. There is actually no theoretical foundation on the performances of
algorithm, due to Hinton (2012) [22] and Adam (for Adaptive Moments) these algorithms, even for convex functions (which is not the case in deep
algorithm, see Kingma and Ba (2014) [24]. learning problems !). Numerical studies have been performed to compare a
large number of optimization algorithms for various learning problems (Schaul
The idea of the RMSProp algorithm is to use a different learning rate for et al. (2014)). There is no algorithms that outperforms the other ones. The
101
6 Conclusion
We have presented in this course the feedforward neural networks, and
explained how the parameters of these models can be estimated. The choice
of the architecture of the network is also a crucial point. Several models
can be compared by using a cross validation method to select the "best" model.
The perceptrons are defined for vectors. They are not well adapted for some
types of data such as images. By transforming an image into a vector, we
loose spatial information, such as forms.
MAR (Missing at random). The case of MAR data occurs when the data are
103
104
3.2 Methods tolerant to missing data ods that select the most influential values by local aggregation or regression or
even by combining different aspects.
While most methods automatically remove missing data, some tolerate
them. This is the case for example of trees (CART) which consider surro- 4.3 Nearest Neighbor Method (KNN)
gate splits : At each node splitting, several optimal pairs variable/threshold are
considered and memorized. To compute a prediction, if the data is missing for The completion by k-nearest neighbors or KNN consists in running the fol-
an observation, it is not the best division that is used but the one just after. lowing algorithm that models and predicts the missing data. Assume that the
values Yi⋆ ,J are missing, where J is the subset of variables not observed for
the individual i⋆ .
4 Imputation Methods
This section provides a non-exhaustive overview of the most common com- Algorithm 12 Algorithm of k-nearest neighbors (k-nn)
pletion methods. A dataset consists of p quantitative or qualitative variables Choice of an integer 1 ≤ k ≤ n.
(Y1 , . . . , Yp ) observed for a sample of n individuals; M refers to the matrix Computation of the distances d(Yi⋆ , Yi ) , i = 1, . . . , n (using only the
indicating missing values by mij = 1{yij .missing} observed variables for Yi⋆ to compute the distances).
Retrieve the k observations Y(i1 ) , . . . , Y(ik ) for which these distances are the
4.1 Stationary completion smallest.
There are several possible stationary completions : the most frequently rep- Assign to the missing values the average of the values of the k-nearest neigh-
resented value (Concept Most Common Attribute Value Fitting, CMCF [41]) bors :
1
∀j ⋆ ∈ J, Yi⋆ j ⋆ =
or simply the last known value (Last observation carried forward, LOCF). Y(i1 ),j ⋆ + . . . + Y(ik ),j ⋆
k
This method may seem too naive but is often used to lay the foundation for a
comparison between completion methods.
4.2 Completion by a linear combination of observa- The nearest neighbors method requires the choice of the parameter k by
tions optimization of a criterion. Moreover, the notion of distance between indi-
viduals must be chosen carefully. One generally considers the Euclidean or
Another common technique is to replace all missing values with a linear Mahalanobis distance.
combination of observations. Let us mention the imputation by the mean : the
missing value Yij is replaced by the mean Ȳj over all observed values of the 4.4 Local regression
variable Yj . This case is generalized to any weighted linear combination of the The LOcal regrESSion : LOESS [42] also allows to impute missing data.
observations. The median of Yj can also be considered. For this, a polynomial with small degree is fitted around the missing data by
Instead of using all available values, it is possible to restrict oneself to meth- weighted least squares, giving more weight to values close to the missing data.
107
Let Yi⋆ be an observation with q (among p) missing values. These missing missing. We then consider the truncated singular value decomposition (SVD)
values are imputed by local regression following the algorithm below : of the complete set Y c (see [15]) :
c
Algorithm 13 Algorithm LOESS YˆJ = UJ DJ VJ⊤
Getting the k nearest neighbors Y(i1 ) , . . . , Y(ik ) .
c
Creation of matrices A ∈ Rk×(p−q) , B ∈ Rk×q and w ∈ R(p−q)×1 such where DJ is the diagonal matrix including the J first singular values of Y .
that : Note that VJ ∈ Mp,J (R). The missing values are then imputed by regression.
More precisely, let (Yi⋆ )obs ∈ Rp−q the set of observed values for Yi⋆ , let VJ⋆
- Lines of A correspond to the k nearest neighbors deprived of the values be the truncated version of V , i.e. for which the lines corresponding to the q
J
at the indices of the missing variables for Yi⋆ . missing variables for Yi⋆ have been deleted. Hence VJ⋆ ∈ Mp−q,J (R). Then
- The columns of B correspond to the values of the neighbors for the the prediction of the q missing data for the individual i⋆ , (Yi⋆ )mis is given by
indices of the missing variables for Yi⋆ .
(⋆)
- The vector w = (Yi⋆ )obs corresponds to the (p − q) observed values (Yi⋆ )mis = VJ β̂,
of Yi⋆ .
(⋆)
where VJ ∈ Mq,J (R) is the complement of VJ⋆ in VJ and
Solving the Least Square Problem
where ∥ · ∥ is the quadratic standard of Rk . As for KNN, this method requires the choice of the parameter J.
The vector of missing data is then predicted by
Cases with too many missing data
(Yi⋆ )mis = B ⊤ x⋆ .
If there are too many missing data, this will induce a significant bias in the
calculation of the SVD decomposition. In addition, there may be at least one
missing data for all observations. In this case, the following problem must be
4.5 By singular value decomposition (SVD) solved :
min ∥ Y − m − UJ DJ VJ⊤ ∥⋆ (11.1)
Cases where there is sufficiently observed data UJ ,VJ ,DJ
If there are much more observed data than missing ones, the dataset Y is where ∥ · ∥⋆ sums the squares of the elements of the matrix, ignoring the
separated into two groups : on one side Y c with the complete observations missing values and m is the vector of the means of the observations. The
and on the other side Y m including the individuals for which some data are resolution of this problem follows the following algorithm :
108
For the set of qualitative variables F , the difference is defined by imputation [11]. It also allows to define a measure of the uncertainty induced
P Pn by the completion.
j∈F i=1 1Yimp
new ̸=Y old
∆F = imp
Maintaining the original variability of the data is done by creating imputed
#N A values that are based on variables correlated with missing data and causes of
where #N A is the number of missing values in the categorical variables. absence. Uncertainty is taken into account by creating different versions of
missing data and observing the variability between imputed data sets.
4.7 Bayesian Inference
Amelia II
Let θ be the realization of a random variable and let p(θ) be its distribution
Amelia II is a multiple imputation program for continuous variables devel-
a priori. The distribution a posteriori is thus given by:
oped by James Honaker et al (2011) [23]. The model is based on an assumption
p(θ|Yobs ) ∝ p(θ)f (Yobs /θ) of normality : Y ∼ Nk (µ, Σ), and thus sometimes requires prior transforma-
tions of the data.
Tanner and Wong’s (1987) data augmentation method iteratively simulates Let M be the matrix indicating the missing data and θ = (µ, Σ) the param-
random samples of missing values and model parameters, taking into account eters of the model. Another hypothesis is that the data are MAR so
the observed data at each iteration, consisting of an imputation step (I) and a
"posterior" step (P). p(M |Y ) = p(M |Yobs )
Let θ(0) be an initial draw obtained from an approximation of the posterior The likelihood p(Yobs |θ) is then written as follows
distribution of θ. For a value of θ(t) of θ at an instant t
p(Yobs , M |θ) = p(M |Yobs )p(Yobs |θ)
(t)
• Imputation (I) : simulate Ymis with a density p(Ymis |Yobs , θ (t) ).
So
(t)
• Posterior (P) : simulate θ (t+1) with a density p(θ|Yobs , Ymis ) p(θ|Yobs ) ∝ p(Yobs |θ)
Using the iterative property of the expectation,
This iterative procedure converges to a draw of the joint distribution of Z
(Ymis , θ|Yobs ) when t → +∞. p(Yobs |θ) = p(Y |θ)dYmis
Amelia II’s EMB algorithm combines the classical EM algorithm (for max-
imum likelihood) with a bootstrap approach. For each run, the data are esti-
mated by bootstrap to simulate the uncertainty then the EM algorithm is run to
find the a posteriori estimator θ̂M AP for the bootstrap data. Imputations are
then created by drawing Ymis according to its conditional distribution on Yobs
and simulations of θ.
5 Examples
5.1 Gas consumption frauds
The different completion methods were tested and compared on an example
of gas consumption fraud detection. Let Y ∈ Rn×12 such that yij is the indi- Figure 11.4: Fraud - Completion errors on a test sample
vidual’s gas consumption for individual i in month j. The distribution of the
missing data is non-monotonic and we assume MAR data. After a log transfor-
mation in order to approach normality, completion was performed. The results
were compared with a test sample of 10% of the data, previously removed from
the set.
This actual data set had at least one missing value per individual, and a total
of 50.4% of the data was missing. If we consider only the individual monthly
consumption, we obtain the error distribution of each method shown in Figure
11.4.
Figure 11.7: EBP - Completion errors on a test sample by missForest when the
amount of missing values increases
Figure 11.6: EBP - Completion errors on a test sample by Amelia II when the peak slope, number of heart vessels, thalassemia,
amount of missing values increases absence/presence of heart disease).
By always limiting oneself to the MCAR case, one artificially creates more
5.3 Coronary Heart Disease (CHD) and more missing data to be imputed. The adequacy of the imputation is given
by the mean of the error in absolute value in the case of quantitative data and
Most imputation methods are defined only for quantitative variables. by the Hamming distance in the case of qualitative data. The results are shown
However, some of the methods presented above can be used to impute in Figure 11.8.
qualitative or even heterogeneous data. This is the case of LOCF, KNN
and missForest, which were therefore tested on a reference data set on
completion problems [29]. The data were acquired by Detranao et al (1989)
[32] and made available by Bache and Lichman (2013)[4]. They are in the
form of a matrix of medical observations Y ∈ Rn×14 of 14 heterogeneous
variables for n patients. The dataset thus contains quantitative (age,
pressure, cholesterol, maximum heart rate, oldpeak)
and qualitative variables (sex, pain, sugar, cardio, angina,
112
Figure 11.8: CHD - Completion errors on a test sample by LOCF (black), KNN
(red) and missForest (green) when the amount of missing values increases, for
a qualitative (above) and quantitative (below) variable
.
Chapter 12
113
114
rédigé par un groupe d’experts (CE 2019). L’étape suivante est la publication risque selon les critères européens. Il cible plus particulièrement certaines des
de propositions de règlements dont certains en cours d’adoption: sept exigences citées dans le guide des experts (CE 2019), reprises dans le livre
blanc (CE 2020) et identifiées comme risques potentiels (Besse et al. 2019): 1.
• Digital Market Act (2020): recherche d’équité dans les relations commer- confidentialité et analyse des données ; 2. précision, robustesse, résilience ; 3.
ciales et risques d’entraves à la concurrence à l’encontre des entreprises explicabilité ; 4. non-discrimination.
européennes; La section 2 suivante extrait de l’AI Act les éléments clefs impactant les
• Digital Services Act (2020): sites de service intermédiaire, choix et développements méthodologiques puis la section 3 en commente les
d’hébergement, de plateforme en ligne et autres réseaux sociaux; conséquences tout en proposant les outils statistiques bien connus de niveau
comment contrôler les contenus illicites et risques des outils automa- Master et bagage d’un futur scientifique des données. Ceux ci semblent adap-
tiques de modération; tés voire suffisants pour satisfaire aux futures obligations réglementaires de
contrôle des risques afférents aux systèmes d’IA. Enfin la section 4 déroule un
• Data Governance Act (2020) contractualisation des utilisations, réutilisa- cas d’usage numérique analogue à la prévision d’un score de crédit sur un jeu
tions, des bases de données tant publiques que privées (fiducie des don- de données concret. Cet exemple, extrait d’un tutoriel dont le code est libre-
nées); ment accessible, permet d’illustrer la démarche de recherche d’un moins mau-
vais compromis à élaborer entre confidentialité, performance, explicabilité et
• Artificial Intelligence Act (CE 2021): proposition de règlement établissant
sources de discrimination. Il souligne les difficultés soulevées par la rédaction
des règles harmonisées sur l’intelligence artificielle.
de la documentation qui devra accompagner tout système d’IA à haut risque.
S’ajoutant au RGPD pour la protection des données à caractère personnel, En conclusion, nous proposons une synthèse des principales avancées de ce
l’adoption européenne à venir de ce dernier texte (AI Act) va profondément projet d’AI Act et en relevons, dans la version d’avril 2021, les principales
impacter les conditions de développements et d’exploitations des systèmes limites.
d’Intelligence Artificielle (systèmes d’IA). Cette démarche fait passer d’une
IA souhaitée éthique (ethical AI), à une obligation de conformité (lawfull AI) 2 Impacts techniques de l’AI Act
qui confère le marquage "CE" ouvrant l’accès au marché européen. La CE veut
ainsi manifester son leadership normatif à l’international afin que ce pouvoir
de l’UE sur la réglementation et le marché lui confère un avantage concurren-
tiel dans le domaine de l’IA.
En conséquence, le présent document propose une réflexion sur la prise
en compte méthodologique de ce projet de réglementation concernant plus
spécifiquement les compétences usuelles en Statistique, Mathématiques, des
équipes de développement d’un système d’IA, notamment ceux jugés à haut
115
2.1 Structure du projet de règlement gorithmes d’apprentissage automatique supervisés ou non, par renforcement,
constituent actuellement l’essentiel des applications quotidiennes de l’IA. La
Castets-Renard et Besse (2022) détaillent une analyse du régime de respon- représentation de connaissances, la programmation inductive et plus générale-
sabilité ex ante1 proposé dans les 89 considérants2 et 85 articles structurés en ment les systèmes experts très développés dans les années 70s, restent présents
12 titres de l’AI Act: entre auto-régulation, certification, normalisation, pour dans certains domaines. Le troisième type d’algorithme cité cible les ap-
définir des règles de conformité notamment pour la défense des droits fonda- proches statistiques, inférences bayésiennes et méthodes d’optimisation. Les
mentaux. L’objectif du présent article est plus spécifique, il est focalisé sur les approches statistiques bayésiennes ou non conduisant très généralement à
éléments du projet de réglementation concernant directement le statisticien ou des prévisions pour l’aide à la décision peuvent être incluses dans la grande
scientifique des données impliqué dans la conception d’un système d’IA jugé famille de l’apprentissage fondée sur des données. En revanche, les méthodes
à haut risque car impactant des personnes physiques. d’optimisation comme par exemple celles d’allocation optimale de ressources
De façon générale, les considérants, introductifs au projet, listent donc les des sites d’intermédiation (e.g. Uber, ParcourSup,...) nécessitent une approche
principes retenus par la CE et qui ont prévalu à la rédaction des articles. particulière. Cette liste peut être facilement adaptée en fonction des évolu-
La CE insiste sur la nécessité de la construction de normes internationales tions technologiques. Ces définitions reconnaissent la place prépondérante
en priorisant le respect des droits fondamentaux dont la non-discrimination. de l’apprentissage statistique et donc des données exploitées pour leur con-
Consciente de la place occupée par les algorithmes d’apprentissage statis- struction. Ils laissent de côté les algorithmes procéduraux basés sur les règles
tiques, elle souligne la nécessité de la représentativité statistique des données logiques d’une législation comme par exemple ceux présidant aux calculs des
d’entraînement et l’importance d’une documentation exhaustive à propos de montants d’allocations.
ces données et des performances d’un système d’IA. Consciente également de Les articles 5 et 6 adoptent également le principe de définitions pragma-
l’opacité de ces algorithmes, elle demande que les capacités d’interprétation tiques en listant explicitement les applications prohibées (art. 5) et celles
de leurs sorties ou décisions en découlant soient à jour des recherches scien- à haut risque de l’IA facilement adaptables en fonction des évolutions tech-
tifiques en cours et qu’un suivi puisse être assuré grâce à une journalisation ou nologiques. L’article 6 fait la différence entre les systèmes faisant déjà l’objet
archivage des décisions et données afférentes. d’une réglementation européenne (annexe II: systèmes de transports et de
2.2 Articles les plus concernés soins) qui nécessitent une certification ex-ante par un tiers, organisme de no-
tification, contrairement aux autres (annexe III) impactant également des per-
La définition adoptée de l’IA (art. 3) est pragmatique et très flexible en sonnes physiques mais dont le processus de mise en conformité est seulement
se basant sur la liste exhaustive des algorithmes concernés (annexe I). Les al- déclaratif. Attention: la consultation attentive de ces annexes, de leur évolu-
1 Par opposition à ex-post, ex-ante signifie ici que l’analyse ou audit de conformité d’un al-
tion, est importante pour bien distinguer les systèmes à haut risque des autres.
gorithme d’IA afin de valider sa certification (marquage "CE") est considérée ou effectivement Les scores de crédit bancaire sont concernés (cf. exemple numérique section
réalisée avant sa diffusion ou commercialisation et donc avant sa mise en exploitation. 4) ainsi que les évaluations individuelles de "police prédictive" ou les scores
2 Les considérants sont une liste de principes qui motivent un décret, une loi ou un règlement
de récidive (justice) mais pas explicitement celles concernant des évaluations
et qui en précèdent le texte contenu dans la liste des articles.
116
de risques de délits par bloc géographique telles Predpol ou Paved en France. discrimination. Il s’agit ici d’un point sensible directement dépendant de la
Pour les applications dans le domaine de la justice, seuls sont concernés les complexité des systèmes d’IA à base d’algorithmes sophistiqués donc opaques
systèmes d’IA à l’usage des autorités judiciaires (magistrats) tels le projet d’apprentissage statistique. Le choix des métriques de biais sont laissées à
abandonné DataJust mais pas ceux à l’usage des cabinets d’avocats (e.g. case l’initiative du concepteur. De plus, le manque de recul sur les recherches en
law analytics). cours en matière d’explicabilité d’une décision algorithmique laissent beau-
L’article 10 est fondamental, il insiste sur l’importance d’une exploration coup de latitude à l’interprétation de cet article qui devra être adaptée à
statistique préalable exhaustive des données avant de lancer les procédures l’évolution des recherches très actives sur ce thème. L’article 14 complète
largement automatiques d’apprentissage et optimisation. Il évite une forme ces dispositions en imposant une surveillance humaine visant à prévenir ou
d’hypocrisie en autorisant, sous réserve de précautions avancées pour la con- minimiser les risques pour la santé, la sécurité ou les droits fondamentaux.
fidentialité, la constitution de bases de données personnelles sensibles permet- L’article 15 comble une lacune importante par l’obligation de déclaration
tant par exemple des statistiques ethniques. Cela autorise la mesure directe des des performances (précisions, robustesse, résilience) d’un système d’IA à haut
biais statistiques, sources potentielles de discrimination. risque. Il concerne également les algorithmes d’apprentissage par renforce-
L’article 11 impose la rédaction d’une documentation qui est essentielle pour ment soumis à des risques spécifiques: dérives potentielles (biais) et attaques
ouvrir la possibilité d’audit ex-ante d’un système d’IA à haut risque relevant malveillantes (cybersécurité) comme ce fut le cas pour le chatbot Tay de Mi-
de l’annexe II ou celui d’un contrôle ex-post pour ceux relevant de l’annexe crosoft.
III. Avec un reversement de la charge de preuve, c’est au concepteur de mon- Les articles des chapitres suivants du Titre III notifient des obligations sans
trer qu’il a mis en œuve ce qu’il était techniquement possible en matière de apporter de précisions techniques ou méthodologiques: obligations faites au
sécurité, qualité, explicabilité, non discrimination, pour atteindre les objectifs fournisseur (art. 16), obligation de mise en place d’un système de gestion de
attendus de conformité. la qualité (art. 17), notamment de toute la procédure de gestion des données
L’article 12 impose un archivage ou journalisation du fonctionnement d’un de la collecte initiale à leurs mises à jour en exploitation, ainsi que de la main-
système d’IA à haut risque. Cette obligation est nouvelle par rapport aux textes tenance post-commercialisation; obligation de documentation technique (art.
européens précédents. Elle est indispensable pour assurer le suivi des mesures 18), d’évaluation de la conformité (art. 19), obligation des utilisateurs (art.
de performances, de risques et donc pour être capable de détecter des failles 29)...
nécessitant des mises à jour voire un ré-entraînement du système ou même son Les États membres sont, par ailleurs, invités à désigner une autorité notifi-
arrêt. Les conditions d’archivage sont précisées dans l’article 61 (post-market ante comme responsable du suivi des procédures relatives aux systèmes à haut
monitoring). risque et un organisme notifié (art. 30 à 39) indépendant, tout à fait classique
Selon l’article 13 un utilisateur devrait pouvoir interpréter les sorties, et des mécanismes de certification déjà en œuvre. Un marquage "CE" sera délivré
doit être clairement informé des performances, éventuellement en fonction aux systèmes conformes (art. 49).
des groupes concernés, ainsi que des risques notamment de biais et donc de Ce processus de marquage "CE" est essentiel pour les systèmes d’IA à haut
117
risque de l’annexe II, il repose sur un audit ex-ante requérant, dans le cas d’une vironnementaux ou autres) d’un système d’IA. Ainsi, l’obligation
évaluation externe, des compétences très élaborées de la part de l’organisme de l’archivage des données de fonctionnement d’un système d’IA
qui en porte la responsabilité afin d’être à même de pouvoir déceler des man- génère un coût environnemental qui mériterait d’être pris en compte
quements intentionnels ou non. Sans évaluation externe, pour les systèmes dans les risques afférents à son déploiement au regard de son utilité.
d’IA de l’annexe II, c’est à l’utilisateur de prendre ses responsabilités vis-à-vis Équité La demande exprimée qu’un système d’IA satisfasse au respect
du respect, entre autres, des droits fondamentaux afin de pouvoir faire face à des droits fondamentaux en référence à la charte de l’UE, notam-
un contrôle si l’État membre désigne une autorité compétente à ce sujet et lui ment celui de non-discrimination, est très présente dans le livre
en fournit les moyens. blanc (cité 16 fois), comme dans les considérants (15, 17, 28, 39)
de la proposition de règlement. En revanche, ce principe n’apparaît
2.3 Conséquences plus explicitement dans les articles. Est-ce sa présence dans des
L’analyse de ces quelques articles amène des commentaires ou questions, textes de plus haut niveau comme la Charte des Droits Fondamen-
notamment sous le prisme d’une approche mathématique ou statistique de con- taux de l’UE qui n’a pas justifié ici une répétition ou encore un
ception d’un système d’IA. manque d’harmonisation entre les États membres à ce propos? Il
n’y a donc pas de précision sur les façon de "mesurer" une discrim-
Projet Le projet de règlement (AI Act) entre dans un long processus (3 ou ination ou la nécessité de l’atténuer. En revanche, les recherches et
4 ans comme le RGPD?) de maturation avant une adoption européenne documentations des biais potentiels sont clairement explicitées.
et une application par les États membres. Les amendements à venir de- Normes Le considérant (13) appelle à la définition de normes internationales
vront être successivement pris en considération pour en analyser les con- notamment à propos des droits fondamentaux. En l’absence d’une défini-
séquences en espérant que des réponses, précisions, corrections, seront tion juridique de l’équité d’un algorithme, celle-ci est définie en creux par
apportées aux points ci-dessous. Néanmoins et compte tenu des temps et l’absence de discrimination interdite explicitement. Le souci est que la
coûts de conception d’un système d’IA, il est important d’anticiper dès littérature regorge de dizaines de définitions de biais statistiques pouvant
maintenant l’adoption de ce cadre réglementaire. être à l’origine de sources de discrimination; lesquels considérer en prior-
ité? Il est peu probable que les autorités compétentes se prononcent à ce
Exigences essentielles À la suite du guide des experts, le livre blanc appelle à
sujet, elles se focalisent (LNE 2021) sur les mesures de performances des
satisfaire sept exigences essentielles dont celles de non discrimination et
systèmes d’IA de l’annexe II, notamment les systèmes de transport et les
équité, bien être sociétal et environnemental.
dispositifs de santé en vue de leur certification (marquage "CE").
Environnement la prise en compte de l’impact environnemental reste Néanmoins, la recherche d’un biais systémique ou de société est req-
anecdotique, simplement évoquées dans les considérants (28) et uise dans l’analyse préalable des données (art. 10, 2. (f)), ainsi que
(81), puis l’article 69 (codes de conduite) 2. sans aucune obli- l’obligation de détailler les performances (précision) par groupe ou sous
gation formelle de calculer une balance bénéfices / risques (en- groupe d’un système d’IA (art. 13, 3., (b) iv). Ceci permet de prendre en
118
compte certains type de biais, donc de discriminations spécifiques même 44) de qualité et pertinence des données conduisant à leur entraînement.
en l’absence de définitions normatives. Des indicateurs statistiques de L’article 10 impose en conséquence des compétences en Statistique pour
biais devenus relativement consensuels dans la communauté académique conduire les études préalables à l’entraînement d’un algorithme. Nous
sont proposés dans la section suivante. assistons à un renversement de tendance, un retour de balancier, du tout
En revanche, il est regrettable qu’aucune indication, recommandation, automatique à une approche raisonnée sous responsabilité humaine de
contrainte, ne vienne ensuite préciser ce qui pourrait ou devrait être fait cette phase d’analyse des données longue et coûteuse mais classique du
pour atténuer ou éliminer un biais discriminatoire. Ceci est laissé au libre métier de statisticien.
arbitre du concepteur d’un système d’IA en espérant que les choix opérés Responsabilité De façon générale, l’objectif essentiel n’est plus la perfor-
soient explicitement détaillés en toute transparence pour le fournisseur mance absolue comme dans les concours de type Kaggle et conduisant
qui en assume la responsabilité et pour l’utilisateur en relation avec les à des empilements inextricables d’algorithmes opaques mais de satis-
usagers. L’exemple numérique illustre une telle démarche. faire à un ensemble de contraintes pour la mise en conformité, dont
celle de transparence, sous la responsabilité du fournisseur du système
Utilisateur & Usager Le règlement traite en priorité les considérations com-
d’IA. L’analyse des responsabilités en cas de défaillance ou de produit
merciales, donc des risques de défaillance inhérents de l’acquisition des
défectueux sera l’objet d’un autre texte.
données à la mise en exploitation d’un système d’IA. Tout système doit
satisfaire aux exigences de performance annoncées selon un principe de Documentation Tous les choix opérés lors de la conception d’un système
sécurité des produits ou responsabilité du fait des produits défectueux. d’IA: ensembles de données, algorithmes, procédures d’apprentissage et
En revanche, l’usager final, les dommages auxquels il peut être con- de tests, optimisations des paramètres, compromis entre confidentialité,
fronté, ne sont pas du tout pris en compte. L’obligation d’information performances, interprétabilité, biais... doivent (art. 11 et annexe IV)
(art. 13) est ainsi au profit de l’utilisateur et pas à celui de l’usager, per- être explicitement documentés en vue d’un audit ex-ante des systèmes de
sonne physique impactée, qui ne semble donc protégé à ce jour que par l’annexe II ou d’un contrôle ex-post d’un système de l’annexe III. C’est
les seules obligations de l’article 22 du RGPD. Il est informé de l’usage un renversement de la charge de preuve sous la responsabilité du four-
d’un système d’IA le concernant, il peut en contester la décision auprès de nisseur qui doit pouvoir montrer que le concepteur a mis en œuvre ce qui
l’utilisateur humain mais l’explication de la décision, des risques encou- était techniquement possible pour satisfaire aux obligations (conformité)
rus, sont soumises aux compétences et à la déontologie professionnelle de légales de sécurité, transparence, performances et non discrimination.
cet utilisateur: conseiller financier pour un client, magistrat pour un jus-
ticiable, responsable des ressources humaines pour un candidat, à moins Autorité notifiante (Chapitre 4 Titre III) Chaque pays va se doter ou désigner
d’un cadre juridique spécifique (e.g. code de santé public). (art. 30) un service chargé entre autres de superviser l’audit ex-ante d’un
système d’IA à haut risque de l’annexe II avant son déploiement qu’il
Données le règlement reconnaît le rôle prépondérant des algorithmes soit commercialisé ou non. L’autorité notifiante désigne l’organisme de
d’apprentissage automatique et donc de la nécessité absolue (considérant notification qui exécutera l’audit. De façon assez étonnante, un système
119
d’ascenseur élémentaire, n’embarquant qu’une "IA" logique rudimentaire 3.1 Quels algorithmes
mais dépendant de l’annexe II, est plus contraint par l’obligation de cer-
tification par un organisme tiers, au contraire d’applications des systèmes Dans l’attente d’une adoption effective du texte final qui risque d’être
d’IA de l’annexe III (justice, emploi, crédit...) impactant directement des amendé, il est néanmoins prudent, compte tenu des investissements en jeu,
personnes physiques avec des risques réels envers les droits fondamen- d’anticiper des réponses techniques à certaines contraintes ou obligations faites
taux. Il faudra donc être attentif à l’interprétation que fera un État membre aux système d’IA désignés à haut risque. Cet article laisse volontairement de
de cette situation afin d’évaluer les possibilités de saisine et compétences côté certaines classes d’algorithmes mentionnées ou non dans l’annexe I dont
de contrôle d’un système à haut risque de l’annexe III. la liste finale reste l’objet de débats entre les instances européennes.
Un système expert est l’association d’une base de règles logiques ou base de
Archivage & confidentialité Le règlement cible donc, en première lecture, connaissances construites par des experts du domaine concerné, d’un moteur
les obligations commerciales du fournisseur plutôt que celles étiques ou d’inférence et d’une base de faits observés pour une exécution en cours. Le
déontologiques envers l’usager. Néanmoins le règlement apporte la possi- moteur d’inférence recherche la séquence de règles logiquement applicables
bilité de prendre en compte des données sensibles (art. 10, 5.), les obliga- à partir des faits observés de la base qui s’incrémente comme conséquences
tions d’archivage des décisions (art. 12), de suivi des performances selon du déclenchements des règles. Le processus itère jusqu’à l’obtention ou non
les groupes (art. 13), une surveillance humaine (art. 14) pendant toute la d’une décision recherchée et expliquée par la séquence de règles y conduisant.
période d’utilisation et de correction rétro-active des biais (art. 15). Cette Très développée dans les années 70, la recherche a marqué le pas face à un
obligation d’archivage et surveillance du fonctionnement notamment à problème dit NP-complet c’est-à-dire de complexité algorithmique exponen-
destination des groupes sensibles oblige implicitement à l’acquisition, en tielle en la taille de la base de connaissance (nombre de règles). Supplantée
toute sécurité (cryptage, anonymisation, pseudonymisation...), de don- par la ré-émergence des réseaux de neurones (années 80) puis plus largement
nées confidentielles (e.g. origine ethnique). Cela ne rend-il pas indis- par l’apprentissage automatique, la recherche dans ce domaine dit d’IA sym-
pensable, selon le domaine d’application, la mise en place d’un proto- bolique est restée active. Elle connaît un renouveau motivé par les capacités
cole explicite de consentement libre et éclairé, d’un engagement éthique, d’explicabilité des systèmes experts.
entre l’utilisateur et l’usager, protégé par le RGPD? Comment sont éval-
ués les risques encourus d’un usager ou groupe d’usager par le recueil et Les approches statistiques bayésiennes ou non basées sur des données sont
l’exploitation de leurs données sensibles lors de l’exploitation d’un sys- associées implicitement aux méthodes par apprentissage. En revanche, les al-
tème d’IA face aux bénéfices attendus pour eux mêmes ou l’intérêt pub- gorithmes d’allocation optimale de ressources prennent une place à part. Si
lic? les principes d’allocation en tant que tels ne soulèvent pas de problème, ceux
d’ordonnancement ou de tri des ressources peuvent amener des risques réels de
discrimination indirecte. C’est notamment le cas de l’algorithme ParcourSup
3 Prise en compte méthodologique de l’AI lorsque les établissements d’enseignement supérieur introduisent des pondéra-
Act tions selon le lycée d’origine des candidats: lycée de centre ville vs. lycée de
120
banlieue. Cette situation rejoint alors le cas des algorithmes déterministes ou l’élaboration d’un système d’IA performant, robuste, résilient et dont les biais
procéduraux. Il s’agit d’algorithmes décisionnels (e.g. calcul de taxes, impôts, potentiels sont sous contrôle. Construire de nouvelles caractéristiques (fea-
allocations ou prestations sociales,...) basés sur un ensemble de règles de déci- tures) adaptées à l’objectif, traquer et gérer éventuellement par imputation des
sion déterministes qui peuvent tout autant présenter des impacts, désavantages données manquantes, identifier les anomalies ou valeurs atypiques (outliers)
ou risques de discrimination indirecte, malgré une apparente neutralité. La sources de défaillances, les sources de biais: classes ou groupes sous représen-
Défenseure des Droits (2020) est très attentive en France à l’analyse et détec- tés, biais systémiques, nécessitent compétences et expériences avancées en
tion de ces risques. Celle-ci relèvent de l’analyse experte des règles de déci- Statistique.
sions codées dans l’algorithme qui en l’état ne sont pas concernés par le projet Ces compétences sont indispensables pour répondre aux attentes de l’article
de règlement. Néanmoins, la complexité de l’algorithme peut être telle qu’une 10 ainsi qu’aux besoins de la documentation (annexe IV) imposée par l’article
analyse experte ex-post ne sera pas en mesure d’évaluer l’étendue des risques 11.
indirects. Aussi, un algorithme déterministe complexe peut être analysé avec
les mêmes outils statistiques que ceux adaptés à un algorithme d’apprentissage 3.3 Qualité, précision et robustesse
automatique.
Les articles 13 et 15 imposent clairement de devoir documenter les per-
Nous insistons donc tout particulièrement sur les systèmes d’IA basés sur formances et risques d’erreur, éventuellement en fonction de groupes sensi-
des algorithmes d’apprentissage supervisé ou statistique ou IA empirique par bles et protégés, ou de défaillance d’un système d’IA. Cela rend indispensable
opposition à l’IA dite symbolique des systèmes experts. Ce sont très majori- l’explicitation de choix notamment des métriques utilisées.
tairement les plus répandus au sein de ceux désignés à haut risque (art. 6) car
susceptibles d’impacter directement des personnes physiques. Choix de métrique
Même sans obligation de certification ex-ante par un organisme notifié, une L’évaluation de la qualité d’une aide algorithmique à la décision est es-
documentation exhaustive (art. 11) d’un système d’IA à haut risque doit être sentielle à la justification du déploiement d’un système d’IA au regard de
produite et fournie à l’utilisateur. Cette section propose quelques indications sa balance bénéfice / risques. Dans le cas d’un système IA empirique ou
méthodologiques pour répondre à cette attente. par apprentissage automatique, il s’agit d’estimer la précision des prévisions
dont les mesures sont bien connues et maîtrisées, parties intégrante du pro-
3.2 Les données
cessus d’apprentissage. Néanmoins parmi un large éventail des possibles,
Tout système d’IA basé sur un algorithme d’apprentissage statistique néces- le choix, précisément justifié, doit être adapté au domaine, au type de prob-
site la mise en place d’une base de données d’entraînement fiable et représen- lème traité, aux risques spécifiques encourus quelque soit le modèle ou le type
tative du domaine d’application visé qui doit en tout premier lieu satisfaire aux d’algorithme d’apprentissage utilisé. Citons par exemple les situations de:
exigences de confidentialité du RGPD. Puis, le travail d’exploration statistique,
généralement long et fastidieux d’acquisition, vérification, analyse, prépara- Régression ou modélisation et prévision d’une variable cible Y quantitative.
tion, nettoyage, enrichissement, archivage sécurisé des données, est essentiel à Elle est généralement basée sur l’optimisation d’une mesure quadratique
121
(norme L2 ) pouvant intégrer, à l’étape d’entraînement, différents types de L’histoire de la littérature statistique puis d’apprentissage automatique peut
pénalisation dont celles de parcimonie (ridge, Lasso) afin de contrôler la être lue comme une succession de stratégies pour le contrôle du nombre de
complexité de l’algorithme et éviter les phénomènes de sur-apprentissage. variables et ainsi de paramètres estimés dans un modèle statistique ou entraînés
D’autres types de fonction objectif basée sur une perte en norme L1 ou dans un algorithme. Il s’agit par exemple de contrôler le conditionnement
valeur absolue, moins sensible à la présence de valeurs atypiques (out- d’une matrice en régression: PLS (partial least square), sélection de variables
liers) que la norme quadratique, permet des solutions plus robustes car (critères AIC, BIC), pénalisations ridge ou Lasso, et ainsi l’explosion de la
tolérantes à des observations atypiques. variance des prévisions. Plus généralement c’est aussi le contrôle du risque
de sur-ajustement qui doit être documenté comme résultat de l’optimisation
Classification ou modélisation, prévision d’une variable Y qualitative. Le des hyperparamètres: nombre de plus proches voisins, pénalité en machines à
choix d’une mesure d’erreur doit être opéré parmi de très nombreuses vecteurs supports, nombre de feuilles d’un arbre, de variables tirées aléatoire-
possibilités: taux d’erreur, AUC (area under the ROC Curve pour une ment dans une forêt d’arbres, profondeur des arbres et nombre d’itérations en
variable Y binaire), score Fβ , risque bayésien, entropie... avec la diffi- boosting ... structures des couches convolutionnelles et drop out des réseaux de
cile prise en compte des situations de classes déséquilibrées qui oriente neurones en reconnaissance d’images. Même si les stratégies d’optimisation
le choix du type de mesure et nécessite des précautions spécifiques dans de ces hyperparamètres par validation croisée ou échantillon de validation sont
l’équilibrage de la base d’apprentissage ou les pondérations de la fonction bien rodées, le fléau de la dimension peut s’avérer rédhibitoire (e.g. Verzelen
objectif en prenant en compte une matrice de coûts de mauvais classement 2012).
éventuellement asymétrique.
Échantillon test
Limites de la précision En tout état de cause, il est indispensable de mettre en place une démarche
très rigoureuse pour conduire à l’évaluation de la précision et donc des per-
Besse (2021) rappelle que les performances de l’IA sont largement suréval- formances d’un système d’IA basé sur un algorithme d’apprentissage. Comme
uées par le battage médiatique dont bénéficient ces technologies. Ces perfor- énoncé dans l’article 3, 31. ce sont des données de test indépendantes de celles
mances sont d’autant plus dégradées lorsque la décision concerne la prévision d’apprentissage qui sont utilisées à cet effet. Attention néanmoins d’évaluer
d’un comportement (achat, départ, embauche, acte violent, pathologie...) in- les performances sur des données telles qu’elles se présenteront réellement
dividuel humain dépendant potentiellement d’un très grand nombre de vari- en exploitation, avec leurs défauts, et pas un simple sous-ensemble aléatoire
ables explicatives ou facteurs dont certains peuvent ne pas être observables. de la base d’apprentissage comme c’est trop souvent le cas en recherche
Il importe de distinguer les systèmes d’IA développés dans un domaine bien académique. En effet cet ensemble de données peut bénéficier d’une ho-
déterminé (e.g. process industriel sous-contrôle), où le nombre de facteurs ou mogénéité d’acquisition (e.g. même technologie, même opérateur) et de pré-
dimensions est raisonnable et identifié, des systèmes d’IA où opère le fléau ou traitements qui peuvent faire défaut à de réelles données d’entrée à venir en
malédiction de la dimension (curse of dimensionality), lorsque celle-ci est très exploitation. Cela demande donc une extrême rigueur dans la constitution
grande, voire indéterminée.
122
d’un échantillon test pour éviter ces pièges bien trop présents en recherche Une recherche active
académique (e.g. Liu et al. 2019, Roberts et al. 2021) et conduisant, sous la
pression de publication, à beaucoup trop de résultats non reproductibles et des Il est bien trop tôt pour tenter un résumé opérationnel de ce thème et fournir
algorithmes non certifiables. Enfin, une surveillance (art. 14) toute la durée de des indications claires sur la démarche à adopter pour satisfaire aux exigences
vie du système d’IA est indispensable afin d’en détecter de possibles dérives réglementaires (art. 13, 15). Il faut pour cela attendre que la recherche ait
ou dysfonctionnements (art. 12 et 15) affectant la robustesse ou la résilience progressé et qu’une sélection "naturelle" en extrait les procédures les plus per-
des décisions. tinentes parmi une grande quantité de solutions proposées; un article de revue
sur ce sujet (Barredo Arrieta et al. 2020) listait plus de 400 références.
Robustesse
Arbre de choix
L’évaluation de la robustesse est liée aux procédures de contrôle mises en
place pour détecter des valeurs atypiques (outliers) ou anomalies dans la base Tentons de décrire les premiers embranchements d’un arbre de décision en
d’apprentissage et au choix de la fonction perte de la procédure d’entraînement répondant à quelques questions rudimentaires qu’il faudrait en plus adapter au
de l’algorithme. Impérativement, surtout dans les d’applications sensibles pou- domaine d’application car le type de réponse à apporter n’est évidemment pas
vant entraîner des risques élevés en cas d’erreur, la détection d’anomalie doit le même s’il s’agit d’expliquer le refus d’un prêt ou les conséquences d’une
également être intégrée en exploitation afin de ne pas chercher à proposer aide automatisée au diagnostic d’un cancer.
des décisions correspondant à des situations atypiques, étrangères à la base Il importe de bien distinguer les niveaux d’explication: concepteur, utilisa-
d’apprentissage. teur ou usager, même si ce dernier n’est pas directement concerné par le projet
de règlement. De plus, l’explication peut s’appliquer soit au fonctionnement
Résilience général de l’algorithme soit à une décision spécifique.
La résilience d’un système d’IA est essentielle pour les dispositifs critiques Il y a schématiquement deux types d’algorithmes dontceux relativement
(dispositifs de santé connecté, aide au pilotage). Cela concerne par exemple transparents: modèles linéaires et arbres de décision. L’explication est dans
la prise en compte de données manquantes lors de l’apprentissage comme en ce cas possible à condition que le nombre de variables et d’interactions prises
exploitation. Il s’agit d’évaluer la capacité d’un système d’IA à assurer des en compte ou le nombre de feuilles d’un arbre reste raisonnable. Toutes les
fonctions pouvant s’avérer vitales en cas, par exemple, de panne ou de fonc- autres classes d’algorithme d’apprentissage, systématiquement non linéaires
tionnement erratique d’un capteur: choix d’un algorithme tolérant aux données et complexes, sont par construction opaques. Il s’agit alors de construire une
manquantes, imputation de celles-ci, fonctionnement en mode dégradé, alerte explication par différentes stratégies comme une approximation explicable par
et arrêt du système. un modèle linéaire, un arbre ou un ensemble de règles de décision détermin-
istes. Une autre stratégie consiste à fournir des indications sur l’ importances
3.4 Explicabilité des variables en mesurant l’effet d’une permutation aléatoire de leurs valeurs
(mean decrease accuracy Breiman, 2001), en stressant l’algorithme (Bachoc
123
et al. 2020) ou en réalisant une analyse de sensibilité par indices de Sobol d’abord le réel qui s’avère complexe à expliquer.
(Bénesse et al. 2021).
3.5 Biais & discrimination
Le concepteur d’un algorithme s’intéresse également à l’explication d’une
décision spécifique afin d’identifier la cause d’une erreur, y remédier par exem- Bien que très présente dans les textes préliminaires (livre blanc (CE 2021,
ple en complétant la base d’apprentissage d’un groupe sous-représenté avant de considérants de l’AI Act) la référence au risque de discrimination ne l’est
ré-entraîner l’algorithme. L’utilisateur d’un système d’IA doit être au mieux pas de façon explicite dans les projets d’articles. Apparaissent néanmoins
informé (art. 13, 15) des possibilités d’expliquer une décision qu’il pourra l’obligation de détecter des biais dans les données (art. 10) ainsi que celle
retranscrire à l’usager (client, patient, justiciable, citoyen...) selon sa propre d’afficher des performances ou risques d’erreur par groupe (art. 13). Quelles
déontologie, son intérêt commercial ou une contrainte légale par exemple pour en sont les conséquences au regard des difficultés de définir, détecter une dis-
des décisions administratives. Pour ce faire quelques stratégies sont proposées crimination qu’elle soit humaine ou algorithmique?
comme une approximation locale par un modèle explicable (linéaire, arbre
de décision) ou par une liste d’exemples contrefactuels c’est-à-dire des situa- Détecter une discrimination
tions les plus proches, en un certain sens, qui conduiraient à décision contraire, Formellement, la stricte équité peut s’exprimer par des propriétés
généralement plus favorable (attribution d’un prêt). Lorsque cela s’avère im- d’indépendance en probabilité entre la variable cible Y qui exprime une dé-
possible, comme par exemple dans le cas d’un diagnostic médical impliquant cision et la variable dite sensible S par rapport à laquelle une discrimination
un nombre important de facteurs opaques, il importe d’informer précisément est en principe interdite. Cette variable peut être quantitative (e.g. âge) ou
l’utilisateur et donc le patient sur les risques d’erreur afin que consentement de qualitative à deux ou plusieurs classes (e.g. genre ou origine ethnique) ou, de
ce dernier soit effectivement libre et éclairé. façon plus complexe, la prise en compte d’interactions entre plusieurs variables
Quelques démonstrations de procédures explicatives sont proposées sur des sensibles. Néanmoins cette définition théorique de l’équité n’est pas concrète-
sites en accès libre. Citons: gems-ai.com, aix360.mybluemix.net, ment praticable pour détecter, mesurer, atténuer des risques de biais. De plus,
github.com/MAIF/shapash les textes juridiques font essentiellement référence à un groupe de personnes
sensibles par rapport aux autres. En conséquences et pour simplifier cette pre-
Réalité complexe mière lecture pédagogique de la détection des risques de discrimination, nous
Ne pas perdre de vue que l’impossibilité ou simplement la difficulté à for- ne considérons qu’une variable sensible à 2 modalités: jeune vs. vieux, femme
muler une explication provient certes de l’utilisation d’algorithmes opaques vs. homme...
mais dont la nécessité est inhérente à la complexité même du réel. Un réel Une façon bien établie de détecter une décision humaine discriminatoire
complexe (e.g. les fonctions du vivant) impliquant de nombreuses variables, consiste à opérer par testing. Dans le cas d’une présomption de discrimina-
leurs interactions, des effets non linéaires voire des boucles de contre-réaction, tion à l’embauche, la procédure consiste à adresser deux CV comparables,
est nécessairement modélisé par un algorithme complexe afin d’éviter des sim- à l’exception (counterfactual example) de la modalité de la variable sensible
plifications abusives pouvant gravement nuire aux performances. C’est tout (e.g. genre, origine ethnique associée au nom) afin de comparer les réponses:
124
proposition ou non d’entretien. Cette démarche individuelle est rendue sys- de la décision. Historiquement, l’écart à l’indépendance pour mesurer ce type
tématique (Rich, 2014) dans une enquête par l’envoi de milliers de paires de de biais est évalué aux USA dans les procédures d’embauche depuis 1971 par
CV. C’est en France la doctrine officielle promue par le Comité National de la notion d’effet disproportionné ou disparate impact et maintenant reprises
l’Information Statistique et commanditée périodiquement par la DARES (Di- systématiquement (Barocas et Selbst, 2016) pour l’évaluation de ce type de
rection de l’Animation, des Études, de la Recherche et des Statistiques) du discrimination dans un algorithme. L’effet disproportionné consiste à estimer
Ministère du travail. le rapport de deux probabilités: probabilité d’une décision favorable (Yb = 1)
Des indicateurs statistiques peuvent être estimés à l’issue de cette enquête pour une personne du groupe sensible (S = 0) au sens de la loi sur la même
mais, comme il n’existe pas de définition juridique de l’équité qui devient par probabilité pour une personne de l’autre groupe (S = 1):
défaut l’absence de discrimination, le monde académique a proposé quelques P(Yb = 1|S = 0)
dizaines d’indicateurs (e.g. Zliobaitė 2017) afin d’évaluer des biais potentiels DI = .
sources de discrimination. Il est nécessaire d’opérer des choix parmi tous les P( Y
b = 1|S = 1)
critères de biais en remarquant que beaucoup de ces indicateurs s’avèrent être Cet indicateur est intégré au Civil Rights act & Code of Federal Regula-
très corrélés ou redondants (Friedler et al. 2019). Empiriquement et après tions (Title 29, Labor: Part 1607 Uniform guidelines on employee selection
avoir consulter une vaste littérature sur l’IA éthique ou plutôt sur les risques procedures) depuis 1978 avec la règle dite des 4/5 ème; si DI est inférieur à
identifiés de discrimination algorithmique, un consensus émerge sur le choix 0, 8, l’entreprise doit en apporter les justifications économiques. Les logiciels
en priorité de trois niveaux de biais statistique. Sont finalement considérés dans commercialisés aux USA et proposant des algorithmes de pré-recrutement au-
cet article élémentaire trois types de rapports de probabilités (égaux à 1 en cas tomatique anticipent ce risque juridique (Raghavan et al. 2019) en intégrant
d’indépendance stricte) dont Besse et al. (2021) proposent des estimations par une procédure automatique d’atténuation du biais (fair learning). Il n’y a au-
intervalle de confiance afin d’en contrôler la précision. cune obligation ni mention en France de cet indicateur statistique, seulement
Parité statistique et effet disproportionné une incitation de la part de la Défenseure des Droits et de la CNIL (2012) en-
vers les services de ressources humaines des entreprises. Il leur est suggéré
Le premier niveau de risque de discrimination algorithmique s’illustre sim- de tenir des statistiques ethniques, autorisées dans ce cas sous réserve de con-
plement: si un algorithme est entraîné sur des données biaisées, il apprend et fidentialité, sous la forme de tables de contingence dont il serait facile d’en
reproduit très fidèlement ces biais systémiques, de société ou de population, déduire des estimations d’effet disproportionné.
par lesquels un groupe est historiquement (e.g. revenu des femmes) désavan-
La mise en évidence d’un biais systémique est implicitement citée lors de
tagé; plus grave, l’algorithme risque même de renforcer le biais en conduisant
l’étape d’analyse préliminaire des données (art. 10, 2., (f)) mais sans plus de
à des décisions explicitement discriminatoires. Il importe donc de pouvoir
précision sur la façon dont il doit être pris en compte alors que renforcer al-
détecter, mesurer, atténuer voire éliminer ce type de biais. L’équité ou parité
gorithmiquement ce biais serait ouvertement discriminatoire. De plus serait-il
statistique (ou demographic equality) serait l’indépendance entre la ou les vari-
politiquement opportun d’introduire une part de discrimination positive afin
ables sensibles S (e.g. genre, origine ethnique) et la variable de prévision Yb
d’atténuer la discrimination sociale? C’est évoqué dans le guide des experts
125
(CE, 2019, ligne directrice 52) pour améliorer le caractère équitable de la cote ou odds ratio d’indépendance conditionnelle nommé aussi equalli odds)
société et techniquement l’objet d’une vaste littérature académique nommée est au cœur de la controverse concernant l’évaluation COMPAS du risque de
apprentissage équitable (fair learning). Cette opportunité n’est pas reprise récidive aux USA (Larson et al. 2016). Il est également présent dans l’exemple
explicitement dans l’AI Act mais nous verrons dans l’exemple numérique ci- numérique ci-après. Cet indicateur double est mesuré par l’estimation de deux
dessous qu’elle ne peut être exclue et peut même être pleinement justifiée en rapports de probabilités: rapports des taux de faux positifs du groupe sensible
prenant en considération les autres types de biais ci-après. sur le taux de faux positifs de l’autre groupe et rapport des taux de faux négatifs
pour ces mêmes groupes.
Erreurs conditionnelles
Les taux d’erreur de prévision et donc les risques d’erreur de décisions sont- P(Ŷ = 1|Y = 0, S = 0) P(Yb = 0|Y = 1, S = 0)
RF P = et RF N = .
ils les mêmes pour chaque groupe (overall error equality)? Autrement dit, P(Ŷ = 1|Y = 0, S = 1) P(Yb = 0|Y = 1, S = 1)
l’erreur est-elle indépendante de la variable sensible? Ceci peut se mesurer par
l’estimation (intervalle de confiance) du rapport de probabilités (probabilité de L’évaluation de ce type de biais n’est pas explicitement mentionné dans le
se tromper pour le groupe sensible sur la probabilité de se tromper pour l’autre projet de règlement. Néanmoins il fait partie de la procédure classique
groupe): d’évaluation des erreurs en classification à l’aide d’une matrice de confusion
P(Yb ̸= Y |S = 0) ou de courbes ROC par groupes et ne peut être négligé.
REC = .
P(Yb ̸= Y |S = 1) Notons qu’il est d’autant plus difficile de faire abstraction du dernier type
Ainsi, si un groupe est sous-représenté dans la base d’apprentissage, il est de biais que les trois sont interdépendants et même en interaction avec les
très probable que les décisions le concernant soient moins fiables. C’est une autres risques: précision et explicabilité. Ceci est clairement mis en évi-
des première critiques formulées à l’encontre des algorithmes de reconnais- dence dans l’exemple numérique suivant. Il y a donc une forme d’obligation
sance faciale et ce risque est également présent dans les applications en santé déontologique ou de cohérence statistique à devoir appréhender ces différents
(Besse et al. 2020) ou en ressources humaines (De-Arteaga et al. 2019). niveaux d’analyse.
L’identification, la prise en compte et la surveillance de ce risque sont présents
(art. 13, 3., (b), ii et art. 15, 1. & 2.) dans le projet de règlement et doivent 4 Exemple numérique
donc être explicitement détaillés dans la documentation (art. 11).
L’exemple jouet ou bac à sable de cette section permet d’illustrer con-
Rapports de cote conditionnels crètement toute la complexité des principes précédemment évoqués en soulig-
nant leur interdépendance. Ce jeu de données est ancien, largement utilisé
Même si les deux critères précédents sont trouvés équitables, les erreurs
pour illustrer tous les travaux visant une atténuation optimale du biais. Le
peuvent être dissymétriques (plus de faux positifs, moins de faux négatifs)
monde académique espère avoir rapidement accès à bien d’autres "bac à sable"
au détriment d’un groupe avec un impact d’autant plus discriminatoire que
représentatifs dont la construction est l’objet de l’article 53 de l’AI Act.
le taux d’erreur est important. Cet indicateur (comparaison des rapports de
126
4.1 Données nagères et enfants?); même si le niveau de diplôme ne semble pas lié au
genre, elles occupent un poste avec moins de responsabilité (Admin) (effet
Les données publiques utilisées imitent le contexte du calcul d’un score de plafond de verre?). Un autre type de biais semble présent dans ces données, les
crédit. Elles sont extraites (échantillon de 45 000 personnes) d’un recensement femmes sont associées (co-occurrences plus fréquentes que l’indépendance) à
de 1994 aux USA et décrivent l’âge, le type d’emploi, le niveau d’éducation, le la présence d’enfants sans pour autant être en situation de couple contrairement
statut marital, l’origine ethnique, le nombre d’heures travaillées par semaine, aux hommes. Cette enquête s’adresse-t-elle de façon privilégiée au chef ou à
la présence ou non d’un enfant, les revenus ou pertes financières, le genre et le la cheffe de famille éventuellement monoparentale?
niveau de revenu bas ou élevé. Elles servent de référence ou bac à sable pour
tous les développements d’algorithmes d’apprentissage automatique équitable. Les données ont été aléatoirement réparties en trois échantillons
Il s’agit de prévoir si le revenu annuel d’une personne est supérieur ou inférieur d’apprentissage (29 000), destinés à l’estimation des modèles ou entraîne-
à 50k$ et donc de prévoir, d’une certaine façon, sa solvabilité connaissant ses ment des algorithmes, de validation (8000) afin d’optimiser certains hyper
autres caractéristiques socio-économiques. Ces questions de discrimination paramètres et de test (8000) pour évaluer les différents indicateurs de perfor-
dans l’accès au crédit sont toujours d’actualité (Campisi 2021, Hurlin et al. mance et biais. La taille relativement importante de l’échantillon initial permet
2021, Kozodoi et al. 2021) même si le principe du score de crédit s’est général- de considérer un échantillon de validation représentatif, comme demandé dans
isé dès les années 90 avec l’envol du data mining devenu depuis de l’IA. le règlement, afin d’éviter des procédures plus lourdes de validation croisée.
Les résultats de prévision sont regroupés dans la figure 12.2.
L’étude complète et les codes de calcul sont disponibles dans un tuto-
riel (calepin Jupyter) mais l’illustration est limitée à un résumé succinct de Le biais systémique (dataBaseBias) des données est comparé avec celui
l’analyse de la discrimination selon le genre. de la prévision de niveau de revenu par un modèle classique linéaire de ré-
gression logistique linLogit: DI = 0, 25. Significativement moins élevé
4.2 Résultats (intervalles de confiance disjoints), il montre que ce modèle renforce le biais et
donc discrimine nettement les femmes dans sa prévision. La procédure naïve
Une analyse exploratoire: nettoyage des données, description statistique, (linLogit-w-s) qui consiste à éliminer la variable dite sensible (genre) du
préalable doit être incluse dans la documentation. Elle est l’objet d’un autre modèle ne supprime en rien (DI = 0, 27) le biais discriminatoire car le genre
tutoriel dont seuls quelques résultats sont retenus par souci de concision. est de toute façon présent à travers les valeurs prises par les autres variables
Ils mettent en évidence un biais systémique ou de société important: seule- (effet proxy). Une autre conséquence de cette dépendance aux proxys est que
ment 11, 6% des femmes ont un revenu élevé contre 31, 5% des hommes. le testing ou counterfactual test (changement de genre toutes choses égales par
Le rapport DI = 0, 38 est donc très disproportionné et peut s’expliquer par ailleurs) ne détecte plus (DI = 0, 90) aucune discrimination!
quelques considérations sociologiques bien identifiées sur le premier plan fac-
toriel (fig. 12.1) d’une analyse factorielle multiple des correspondances cal- Un algorithme non-linéaire élémentaire (tree, arbre binaire de décision)
culée après avoir recodé qualitatives toutes les variables. Les femmes tra- augmente le biais mais pas de façon statistiquement significative car les inter-
vaillent en moyenne moins d’heures (HW1) par semaine (occupations mé- valles de confiance ne sont pas disjoints. Sa précision est meilleure que celle du
modèle de régression logistique mais, si l’objectif est une interprétation utile,
127
Figure 12.1: Premier plan factoriel d’une analyse factorielle multiple des cor-
respondances (librairie FactoMineR, Lê et al. 2008)
128
il est nécessaire de réduire la complexité de l’arbre en pénalisant le nombre de d’une décision favorable même à tort. En revanche, le taux de faux négat-
feuilles, d’une centaine à une dizaine. Dans ce cas la précision se dégrade pour ifs est plus important pour les femmes (0, 41), à leur désavantage, que pour
rejoindre celle de la régression logistique; l’explicabilité a un coût. les hommes (0, 38) (Rf n = 1, 08) mais ces dernières différences ne sont pas
Un algorithme non linéaire plus sophistiqué (random forest) est très significatives.
fidèle au biais des données avec un indicateur (DI = 0, 36) proche de celui du Dans une telle situation en choisissant le seuil de décision par défaut à 0, 5,
biais de société et fournit une meilleure précision: 0, 86 au lieu de 0, 84 pour une banque prendrait peu de risque: faible taux de faux positifs et taux élevés
la régression logistique. Cet algorithme ne discrimine pas plus, apporte une de faux négatifs mais, conclusion importante, il apparaît une rupture d’équité
meilleure précision, mais c’est au prix de l’explicabilité du modèle. Opaque au sens où la banque prend plus de risques au bénéfice des hommes alors que
comme un réseau de neurones, il ne permet pas d’expliquer une décision à les taux d’erreur les concernant sont plus élevés.
partir de ses paramètres comme cela est facile avec le modèle de régression ou Une atténuation du biais des rapports de cotes se justifie donc afin de rendre
un arbre binaire de décision de taille raisonnable. comparables les chances d’obtention d’un crédit selon le genre et ce même à
Une question délicate concerne le choix politique de procéder ou non à une tort. Plutôt que d’équilibrer ces chances en pénalisant celles des hommes, une
atténuation du biais systémique dans le cas d’un score de crédit. Contrairement part de discrimination positive est introduite au bénéfice des femmes pour plus
à Hurlin et al. (2021), Goglin (2021) l’aborde de façon très incomplète en d’équité en cherchant à rendre égaux les taux de faux positifs selon le genre et
ne considérant, de manière exclusive, que le biais des erreurs selon le genre. évalués sur l’échantillon de validation.
Cet auteur "justifie" de ne pas considérer le biais systémique car le corriger Les deux dernières lignes de la figure 12.2 proposent une façon simple (post-
conduirait des femmes à des situations de surendettement tandis que le 3ème processing), parmi une littérature très volumineuse, de corriger le biais pour
type de biais est purement oublié. Une analyse plus fine montre, à travers plus de justice sociale. Deux algorithmes sont entraînés, un par genre et le
cet exemple, toute l’importance de prendre en compte simultanément les trois seuil de décision (revenu élevé ou pas, accord ou non de crédit...) est abaissé
types de biais afin d’éviter un positionnement quelque peu "paternaliste". pour les femmes : 0, 3 pour les forêts aléatoires, 0, 2 pour un arbre binaire, au
En principe, la précision de la prévision pour un groupe dépend de sa lieu de celui par défaut de 0, 5 pour les hommes. Cette correction des faux
représentativité. Si ce dernier est sous-représenté, l’erreur est plus importante; positifs impacte également les taux d’erreur qui deviennent plus équilibrés
c’est typiquement le cas en reconnaissance faciale mais pas dans l’exemple selon le genre et provoque également une atténuation de l’effet disproportionné
traité. Alors qu’elles sont deux fois moins nombreuses dans l’échantillon, le pour une société plus équitable. L’arbre binaire utilisé (TreeDiscrPos)
taux d’erreur de prévision est de l’ordre de 7, 9% pour les femmes et de 17% est celui pénalisé (peu de feuilles) afin d’obtenir une interprétation facile au
(REC = 0, 36) pour les hommes (algorithme d’arbre binaire simplifié). Il prix de la précision. Les seuils et le paramètre de pénalisation ont été déter-
est alors indispensable de considérer le troisième type de biais pour se rendre minés sur l’échantillon de validation avant d’être appliquées indépendamment
compte que c’est finalement au désavantage des femmes. Le taux de faux posi- à l’échantillon test.
tifs est plus important pour les hommes (0, 081) que pour les femmes (0, 016)
(RF P = 0, 20). Ceci avantage les hommes qui bénéficient plus largement
129
4.3 Discussion de discrimination positive qui réduit le désavantage fait aux femmes sans
pour autant nuire aux hommes.
Nous pouvons tirer quelques enseignements de cet exemple jouet imitant le
calcul d’un score d’attribution de crédit bancaire. • Finalement dans cet exemple illustratif, un arbre pénalisé pour être suff-
isamment simple (nombre réduit de feuilles) et assorti d’une touche de
• Sans précaution, si un biais est présent dans les données, il est appris et discrimination positive fournit une aide à la décision explicable à un client
même renforcé par un modèle linéaire élémentaire. et équitable en terme de risques de la banque vis-à-vis de son genre.
• La suppression naïve de la variable sensible (genre) pour réduire le biais • Certes, dans le cas d’un score de crédit, cela aurait pour conséquence
n’y change rien d’où l’importance (art. 10, 5.) d’autoriser la prise d’un d’accroître le risque de la banque en réduisant la qualité de prévision et
risque contrôlé de confidentialité pour intégrer des données personnelles augmentant le taux de faux positifs pour les femmes mais lui fournirait
sensibles afin de pouvoir détecter des biais. des arguments tangibles de communication pour une image "éthique":
des décisions inclusives donc plus équitables et plus explicables sans trop
• Un algorithme sophistiqué, non linéaire et impliquant les interactions en- nuire à la précision.
tre les variables, ne fait que reproduire le biais mais, opaque, ne permet
plus de justification des décisions si l’effet disproportionné est juridique-
ment attaquable comme aux USA (DI < 0, 8). Dans le cas présent, un
5 Conclusion
simple arbre binaire pénalisé pour contrôler le nombre de feuilles permet Comme le rappelle Meneceur (2021-b) dans une comparaison exhaustive
de concilier accroissement peu important du biais et explicabilité sans des démarches institutionnelles, les très nombreuses approches éthiques visant
trop pénaliser la précision. à encadrer le développement et l’application des systèmes d’IA ne sont pas des
réponses suffisantes et convaincantes pour développer la confiance des usagers.
• En présence de proxys du genre comme c’est le cas dans cet exemple, une
Ceci motive la démarche de la CE aboutissant à la publication de ce projet de
procédure de testing (counterfactual test) est complètement inadaptée à la
règlement alors que le Conseil de l’Europe envisage également un mélange
détection ex-post d’une discrimination algorithmique. Seule une analyse
d’instruments juridiques contraignants et non contraignants pour prévenir les
rigoureuse d’une documentation loyale (art. 11) décrivant les données, la
violations des droits de l’homme et des atteintes à la démocratie et à l’État de
procédure d’apprentissage, les performances, peut donc s’avérer convain-
droit; la nécessité de conformité se substitue à l’éthique.
cante sur les capacités non discriminatoires d’un algorithme.
L’analyse du projet de règlement européen montre des avancées significa-
• Sur cet exemple, le choix d’un post-processing permettant d’atténuer le tives pour plus de transparence des systèmes d’IA:
biais des rapports de cotes conditionnels (taux de faux positifs similaires)
selon le genre impacte les trois types de biais pour en réduire simultané- • importance fondamentale des données et donc de leur analyse préalable
ment l’importance. C’est une façon de légitimer l’introduction d’une dose fouillée et documentée,
130
• évaluation et documentation explicite des performances et donc des majeure de la CE. Cela concerne la consommation énergétique pour le
risques d’erreur ou de manquement: robustesse, résilience, stockage massif et l’entraînement des algorithmes et la sur-exploitation
des ressources minières nécessaires à la fabrication des équipements
• documentation explicite sur les capacités d’interprétation d’un système, numériques.
d’une décision, à la mesure des technologies et méthodes disponibles,
• prise en compte de certains types de biais: équité sociale dans les don- • Il est certes conseillé de rechercher des biais potentiels dans les données
nées, performances selon des groupes et suivi des risques possibles de (art. 10, 2., (f)) avec même la possibilité de prendre en compte des don-
discrimination associés, nées personnelles sensibles (art.10, 5.) pour traquer des biais systémiques
sources potentielles de discrimination. Néanmoins, l’absence de préci-
• enregistrement de l’activité pour une traçabilité du fonctionnement, sions sur la façon de mesurer ces biais, de les atténuer ou les supprimer
dans les procédures d’entraînement laisse un vide potentiellement préju-
• contrôle humain approprié pour réduire et anticiper les risques,
diciable à l’usager. Alors qu’il est déjà fort complexe pour un usager
• obligation de fournir la documentation exhaustive à l’utilisateur (système d’apporter la preuve d’une présomption de discrimination, par exemple
d’IA de l’annexe III), qui est auditée ex-ante par un organisme notifié pour par testing, lors d’une décision humaine, l’exemple numérique ci-dessus
les systèmes d’IA de l’annexe II, pour l’obtention du marquage "CE". montre que c’est mission impossible face à une décision algorithmique.
Seule une procédure rigoureuse d’audit de la documentation décrivant les
Néanmoins ce projet de règlement principalement motivé par une harmonisa- données, la procédure d’apprentissage et les dispositions mises en place
tion des relations commerciales au sein de l’Union selon le principe de sécu- pour gérer, atténuer les biais, peut garantir une protection a minima des
rité des produits ou de la responsabilité du fait des produits défectueux ne usagers finaux contre ce type de discrimination. Cette mise en confor-
prend pas en compte des dommages pouvant impacter les usagers. Les con- mité agit comme un renversement de la charge de la preuve mais qui ne
séquences ou objectifs de la démarche adoptée par la CE rejoignent d’ailleurs bénéficie, pour les systèmes d’IA de l’annexe III, qu’à l’information de
les exigences de la FTC (Federal Trade Commission) (Jillson, 2021) de loyauté l’utilisateur pas, dans l’état actuel, à la protection de l’usager.
et transparence vis-à-vis des performances d’un système d’IA commercialisé.
Aussi certains droits fondamentaux, bien que retenus comme exigence essen- • Consciente de ces problèmes la Défenseure des Droits a récemment pub-
tielle dans le livre blanc se trouvent pour le moins négligés et ce d’autant plus lié un avis en collaboration avec le réseau européen EQUINET dont les
que les systèmes d’IA à haut risque de l’annexe III ne sont pas concernés par principales conclusions sont résumées dans un communiqué de presse.
la certification d’un organisme notifié indépendant. Elle y appelle à replacer le principe de non-discrimination (de l’usager)
au cœur du projet d’AI Act. Une des questions essentielles reste à savoir
• Plus largement que les seules applications de l’IA, une prise en compte qui pourra, en dehors de l’utilisateur, avoir accès à la documentation d’un
d’une forme de frugalité numérique afin de réduire les impacts environ- système d’IA à haut risque, et donc de pouvoir l’auditer dans de bonnes
nementaux ne semblent pas, dans ce projet d’AI Act, une préoccupation conditions. Ce sera sans doute à chaque État membre de légiférer sur ces
131
questions. transparence.
L’exemple numérique jouet a également pour mérite de montrer clairement • Besse P. (2021). Médecine, police, justice : l’intelligence artificielle a de
l’interdépendance de toutes les contraintes: confidentialité, qualité, explicabil- réelles limites, The Conversation, 01/12/2021.
ité, équité (types de biais), que devrait satisfaire un système d’IA pour gagner
la confiance des usagers. Il montre aussi que le problème ne se réduit pas à • Besse P., Besse Patin A., Castets Renard C. (2020). Implications ju-
un simple objectif de minimisation d’un risque quantifiable pour l’obtention ridiques et éthiques des algorithmes d’intelligence artificielle dans le do-
d’un meilleur compromis. C’est plutôt la recherche d’une moins mauvaise maine de la santé, Statistique & Société, 3, pp 21-53.
solution imbriquant des choix techniques, économiques, juridiques, politiques
qu’il sera nécessaire de clairement expliciter dans la documentation rendue • Besse P., Castets-Renard C., Garivier A., Loubes J.-M. (2019). L’IA
obligatoire par l’adoption à venir d’un AI Act qui serait, de toute façon et mal- du Quotidien peut elle être Éthique? Loyauté des Algorithmes
gré les limites actuelles du projet de texte, une avancée notable pour plus de d’Apprentissage Automatique, Statistique et Société, VOl6 (3), pp 9-31.
132
• Besse P., del Barrio E., Gordaliza P., Loubes J-M., Risser L. (2021) • Défenseure des Droits (2020). Algorithmes: prévenir l’automatisation
A survey of bias in Machine Learning through the prism of Statis- des discriminations, rapport.
tical Parity for the Adult Data Set, The American Statistician, DOI:
• Défenseur des Droits, CNIL (2012). Mesurer pour progresser vers
10.1080/00031305.2021.1952897, version en accès libre.
l’égalité des chances. Guide méthodologique à l’usage des acteurs de
• Breiman L. (2001). Random forests, Machine Learning 45, 5–32. l’emploi.
• Friedler S., Scheidegger C., Venkatasubramanian S., Choudhary S.,
• Campisi N. (2021). From Inherent Racial Bias to Incorrect Data—The
Problems With Current Credit Scoring Models, Forbes Advisor. Hamilton E., Roth D. (2019). Comparative study of fairness-enhancing
interventions in machine learning. Proceedings of the Conference on
• Castets Renard C., Besse P. (2022). Responsabilité ex ante de l’AI Act: Fairness, Accountability, and Transparency, p. 329–38.
entre certification et normalisation, à la recherche des droits fondamen-
• Goglin C. (2021). Discrimination et IA : comment limiter les risques en
taux au pays de la conformité, dans "Un droit de l’intelligence artificielle
matière de crédit bancaire, The Conversation, 23/09/2021.
: entre règles sectorielles et régime général. Perspectives de droit com-
paré", dir. C. Castets-Renard et J. Eynard, Bruylant (à paraître). • Hurlin C., Pérignon C., Saurin S. (2021) The fairness of credit score mod-
els, preprint SSRN.
• CE (2019) Lignes Directrices pour une IA digne de Confiance, rédigé par
un groupe d’experts européens. • Jillson E. (2021). Aiming for truth, fairness, and equity in your company’s
use of AI, blog, consulté le 29/05/2021.
• CE (2020) Livre blanc sur l’intelligence artificielle: une approche eu-
• Kozodoi N., Jacob, J. Lessman, S. (2021). Fairness in credit scoring:
ropéenne d’excellence et de confiance.
assessment, implementation and profit implications, preprint arXiv.
• CE (2021). Règlement du parlement et du conseil établissant des rè-
• Larson J., Mattu S., Kirchner L., Angwin J. (2016). How we ana-
gles harmonisées concernant l’intelligence artificielle (législation sur lyzed the compas recidivism algorithm. ProPublica, en ligne consulté
l’intelligence artificielle) et modifiant certains actes législatifs de l’union. le 28/04/2020.
• Conseil d’État (2022). S’engager dans l’intelligence artificielle pour un • Lê, S., Josse, J., Husson, F. (2008). FactoMineR: An R Package for Mul-
meilleur service public, rapport d’étude mis en ligne le 30/08/2022. tivariate Analysis. Journal of Statistical Software. 25(1). pp. 1-18.
• De-Arteaga M., Romanov A. et al. (2019). Bias in Bios: A Case Study • Liu X., L. Faes, A. U. Kale et al. (2019), A comparison of deep learning
of Semantic Representation Bias in a High-Stakes Setting, Proceedings performance against health-care professionals in detecting diseases from
of the Conference on Fairness, Accountability, and Transparency, pp medical imaging: a systematic review and meta-analysis, The Lancet Dig-
120–128. ital Health, vol. 1, pp. e271–e297.
133
• Raghavan M., Barocas S., Kleinberg J., Levy K. (2019) Mitigating bias
in Algorithmic Hiring : Evaluating Claims and Practices, Proceedings of
the Conference on Fairness, Accountability, and Transparency.
• Rich J. (2014). What Do Field Experiments of Discrimination in Mar-
kets Tell Us? A Meta Analysis of Studies Conducted since 2000, IZA
Discussion Paper, No. 8584.
• M. Roberts, D. Driggs, M. Thorpe, J. Gilbey, M. Yeung, S. Ursprung, A.
I. Aviles-Rivero, C. Etmann, C. McCague, L. Beer, J. R. Weir-McCall,
Z. Teng, E. Gkrania-Klotsas, AIX-COVNET, J. H. F. Rudd, Evis Sala,
C.-B. Schönlieb (2021), Common pitfalls and recommendations for us-
ing machine learning to detect and prognosticate for COVID-19 using
chest radiographs and CT scans, Nature Machine Intelligence, 3, pages
199–217.
• Verzelen N. (2012). Minimax risks for sparse regressions: Ultra-high
dimensional phenomenons, Electron. J. Statist., 6, 38 - 90.
[1] Gelman A. and Hill J., Data analysis using regression and multi- [7] L. Breiman, Bagging predictors, Machine Learning 26 (1996), no. 2,
level/hierarchical models, ch. 25, pp. 529–563, Cambridge University 123–140.
Press, 2007.
[8] , Random forests, Machine Learning 45 (2001), 5–32.
[2] H. Akaïke, A new look at the statistical model identification, IEEE Trans-
actions on Automatic Control 19 (1974). [9] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and
regression trees, Wadsworth & Brooks, 1984.
[3] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J.L. Reyes-Ortiz, Energy
efficient smartphone-based activity recognition using fixed-point arith- [10] Rich. Caruana, N. Karampatziakis, and A. Yessenalina, An empirical
metic, Journal of Universal Computer Science. Special Issue in Ambient evaluation of supervised learning in high dimensions, Proceedings of
Assisted Living: Home Care 19 (2013). the 25th International Conference on Machine Learning (New York, NY,
[4] K. Bache and M. Lichman, UCI machine learning repository, 2013, USA), ICML ’08, ACM, 2008, pp. 96–103, ISBN 978-1-60558-205-4.
http://archive.ics.uci.edu/ml.
[11] Rubin D.B., Multiple imputation for nonresponse in surveys, Wiley,
[5] P. Besse, H. Milhem, O. Mestre, A. Dufour, and V. H. Peuch, Comparai- 1987.
son de techniques de data mining pour l’adaptation statistique des prévi-
sions d’ozone du modèle de chimie-transport mocage, Pollution Atmo- [12] Stekhoven D.J. and Bühlmann P., Missforest - nonparametric missing
sphérique 195 (2007), 285–292. value imputation for mixed-type data, Bioinformatics Advance Access
(2011).
[6] G. Biau, A. Ficher, B. Guedj, and J. D. Malley, Cobra: A nonlinear
aggregation strategy, Journal of Multivariate Analysis 146 (2016), 18– [13] David Donoho, 50 years of data science, Princeton NJ, Tukey Centennial
28. Workshop, 2015.
135
136
[14] B. Efron and R. Tibshirani, Improvements on cross-validation: The .632+ [24] D. Kingma and J. Ba, Adam : a method for stochastic optimization, Arxiv
bootstrap method, Journal of the American Statistical Association 92 1412.6980 (2014).
(1997), no. 438, 548–560.
[25] Y. LeCun, L. Jackel, B. Boser, J. Denker, H. Graf, I. Guyon, D. Hender-
[15] Hastie et al, Imputing missing data for gene expression arrays, Techn. son, R. Howard, and W. Hubbard, Handwritten digit recognition : Appli-
rep., Division of Biostatistics, Stanford University, 1999. cations of neural networks chipsand automatic learning, Proceedings of
the IEEE 86 (1998), no. 11, 2278–2324.
[16] M Fernández-Delgado, E Cernadas, S Barro, and D Amorim, Do we need
hundreds of classifiers to solve real world classification problems?, The [26] M. Lichman, UCI machine learning repository, 2013, http://
journal of machine learning research 15 (2014), no. 1, 3133–3181. archive.ics.uci.edu/ml.
[17] Y. Freund and R.E. Schapire, Experiments with a new boosting algorithm,
[27] Glasson Cicignani M. and Berchtold A., Imputation de donnees man-
Machine Learning: proceedings of the Thirteenth International Confer-
quantes : Comparaison de differentes approches, 42e Journees de Statis-
ence, Morgan Kaufman, 1996, San Francisco, pp. 148–156.
tique, 2010.
[18] J. H. Friedman, Stochastic gradient boosting, Computational Statisrics
and Data Analysis 38 (2002), . [28] C.L. Mallows, Some comments on cp, Technometrics 15 (1973), 661–
675.
[19] Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep learning,
MIT Press, 2016, http://www.deeplearningbook.org. [29] Setiawan N.A., Venkatachalam P.A., and Hani A.F.M., A comparative
study of imputation methods to predict missing attribute values in coro-
[20] David J. Hand, Classifier technology and the illusion of progress, Statist. nary heart disease data set, 4th Kuala Lumpur International Conference
Sci. 21 (2006), no. 1, 1–14. on Biomedical Engineering 2008 (University of Malaya Department of
[21] T. Hastie, R. Tibshirani, and J Friedman, The elements of statistical learn- Biomedical Engineering Faculty of Engineering, ed.), vol. 21, Springer
ing : data mining, inference, and prediction, Springer, 2009, Second edi- Berlin Heidelberg, 2008, pp. 266–269.
tion.
[30] Y. Nesterov, A method of solving a complex programming problem with
[22] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhut- convergence rate o(1/k2), Soviet Mathematics Doklady 27 (1983), 372–
dinov, Improving neural networks by preventing co-adaptation of feature 376.
detectors, CoRR, abs/1207.0580 (2012).
[31] B.T. Polyak, Some methods of speeding up the convergence of iteration
[23] Honaker J., King G., and Blackwell M., Amelia ii: A program for missing methods, USSR Computational Mathematics and Mathematical Physics
data, Journal of statistical software 45 (2011), no. 7. 4(5) (1964), 1–17.
137
[32] Detrano R., Janosi A., Steinbrunn W., Pfisterer M., Schmid J., Sandhu [42] Cleveland W.S. and Devlin S.J., Locally-weighted regression: An ap-
S., Guppy K., Lee S., and Froelicher V., International application of a proach to regression analysis by local fitting, Journal of the American
new probability algorithm for the diagnosis of coronary artery disease, Statistical Association 83 (1988), no. 403, 596–610.
American Journal of Cardiology 64 (1989), 304–310.
[33] Little R.J.A. and Rubin D.B., Statistical analysis with missing data, Wi-
ley series in probability and statistics, 1987.
[34] B. Scholkopf and A. Smola, Learning with kernels support vector ma-
chines, regularization, optimization and beyond, MIT Press, 2002.
[37] I. Sutskever, J. Martens, G.E. Dahl, and G.E. Hinton, On the importance
of initialization and momentum in deep learning, ICML 28(3) (2013),
1139–1147.
[38] R. Tibshirani, Regression shrinkage and selection via the lasso, J. Royal.
Statist. Soc B 58 (1996), 267–288.
[40] V.N. Vapnik, Statistical learning theory, Wiley Inter science, 1999.
[41] Grzymala Busse J. W., Grzymala Busse W. J., and Goodwin L. K., Cop-
ing with missing attribute values based on closest fit in preterm birth
data: A rough set approach, Computational Intelligence 17 (2001), 425–
434.