Poly ML SIR
Poly ML SIR
Jérémy Fix
Hervé Frezza-Buet
Matthieu Geist
Frédéric Pennerath
2
Contents
I Overview 11
1 Introduction 13
1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.1 Data sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.2 Data conditionning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2 Different learning problems... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.1 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.2 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.2.3 Semi-supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.2.4 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.3 Different learning strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.3.1 Inductive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.3.2 Transductive learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Evaluation 45
4.1 Real risk estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.1 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.2 Real risk optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 The specific case of classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3
4 CONTENTS
5 Risks 53
5.1 Controlling the risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1.1 The considered learning paradigm . . . . . . . . . . . . . . . . . . . . . . . 53
5.1.2 Bias-variance decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.1.3 Consistency of empirical risk minimization . . . . . . . . . . . . . . . . . . . 56
5.1.4 Towards bounds on the risk . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1.5 To go further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Classification, convex surrogates and calibration . . . . . . . . . . . . . . . . . . . . 62
5.2.1 Binary classification with binary loss . . . . . . . . . . . . . . . . . . . . . 62
5.2.2 Cost-sensitive multiclass classification . . . . . . . . . . . . . . . . . . . . . 64
5.2.3 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2.4 To go further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.1 Penalizing complex solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.3 To go further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6 Preprocessing 69
6.1 Selecting and conditioning data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1.1 Collecting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1.2 Conditioning the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2.1 Variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2.2 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . 75
6.2.3 Relationship between covariance, Gram and euclidean distance matrices . . 83
6.2.4 Principal component analysis in large dimensional space (N d) . . . . . . 85
6.2.5 Kernel PCA (KPCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2.7 Manifold learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7 Introduction 95
7.1 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.3 By the way, what is an SVM? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.4 How does it work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
8 Linear separator 97
8.1 Problem Features and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.1.1 The Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.1.2 The Linear Separator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.2 Separability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.3 Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
CONTENTS 5
10 Kernels 113
10.1 The feature space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
10.2 Which Functions Are Kernels? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
10.2.1 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
10.2.2 Conditions for a kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
10.2.3 Reference kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
10.2.4 Assembling kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
10.3 The core idea for SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
10.4 Some Kernel Tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
10.4.1 Data Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
10.4.2 Centering and Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
10.5 Kernels for Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
10.5.1 Document Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
10.5.2 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
10.5.3 Other Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
12 Regression 125
12.1 Definition of the Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . 125
12.2 Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
12.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
21 Bagging 217
21.1 Bootstrap aggregating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
21.2 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
21.3 Extremely randomized trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
22 Boosting 221
22.1 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
22.1.1 Weighted binary classification . . . . . . . . . . . . . . . . . . . . . . . . . . 221
22.1.2 The AdaBoost algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
22.2 Derivation and partial analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
22.2.1 Forward stagewise additive modeling . . . . . . . . . . . . . . . . . . . . . . 223
22.2.2 Bounding the empirical risk . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
22.3 Restricted functional gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . 226
Overview
11
Chapter 1
Introduction
Endowing machines with the ability of learning things may sound excessive or even inappropriate,
since the common meaning of learning is a cognitive process exhibited by humans or animals.
Learning implies being able to reuse in the future what has been learned in the past, requiring
some memory capacities. Before the age of computers, memory was also a concept rather used
for humans or animals. From the point of view of physicists, any system that keeps trace of its
history is obviously endowed with memory and the process leading to the engram of that trace
could naturally be called learning. Would we say that the solid matter inside of a boiled egg is the
memory of the cooking stage ? Was the boiling itself a learning process ?
In this document, we will not enter into a semiotic analysis of learning. Instead, let us sketch
out what learning is in the so-called “machine learning” field. This first part aims at outlining
that field, highlighting the overall structure of a domain which has increased significantly during
last decades.
1.1 Datasets
One intends to use machine learning techniques when s/he has to cope with data. Extracting
information from a dataset in order to process further some new data that was not in the initial
dataset can be thought as the core of machine learning. Every processing coming out of a machine
learning approach is fundamentally data driven. Indeed, if the processing could have been designed
from a specific knowledge about what it should do, then this knowledge could have been used for
designing the process without any help of machine learning. For example, nobody uses collections
of moon, earth and sun positions to predict eclipses, since the prediction process can be set up
from Newton’s laws. As opposed to this, there is not known algorithm (i.e. law) telling how to
recognize which hand written digits are written in the zip-code area of a letter. For such problems,
learning is to be considered, trying to set up a process that is grounded on a huge amount of data
(zip-code scan images and their transcription into digits) to perform its recognition task.
13
14 CHAPTER 1. INTRODUCTION
Figure 1.1: The dataset S of the weights and heights of the students attending the Machine
Learning class (fake data). |S| = N = 300.
Sometimes, the generation of data (i.e. PZ ) comes with a supplementary process, that associates
a label to the data. This supplementary labeling process is called the oracle, and it is defined as
a conditional distribution. In this case, let us rename the data into inputs, renaming Z, z, Z, PZ
into X , x, X, PX . The label (or output) y ∈ Y given by the oracle to some input x results from
a stochastic process as well. In other words, the oracle is defined as a conditional distribution
P (Y | X), which is unknown as well. In this case, datasets are made of (x, y) pairs, sampled
according to PX and the oracle, i.e.
def
z = (x, y)
def
Z = X ×Y (1.1)
def
pZ (x, y) = p Y |X (y|x)pX (x)
From a computational point of view, the dataset S is as if it had been generated by algorithm 1.
In the example of the Machine Learning class students, let us now add a supplementary Boolean
1 In the theory of probabilities, instead of dealing with sample tossed from a probability distribution, a single
random variable made of N identical and independent variables is rather considered, which is more rigorous than
considering samples as done here.
1.1. DATASETS 15
attribute to each student, telling wether s/he belongs to the University Wrestling Team. Our
dataset is now made of ((w, h), b), with (w, h) the weight and height and b set to true if the
student belongs to the Wrestling Team. Such dataset is represented in figure 1.2. One can see that
the sample distribution is the same as in figure 1.1, but some labels are added. Dealing with such
data in the context of machine learning makes the assumption that the dataset can be obtained
from algorithm 1, i.e. that whether a student belongs or not to the wrestling team can be somehow
deduced from its weight and height.
Algorithm 1 MakeDataSet
1: S=∅
2: for i = 1 to N do
3: x L99 PX // Toss the input.
4: y L99 P Y | X=x // Thanks to the oracle, toss the output from x.
5: S ← S ∪ {(x, y)}
6: end for
7: return S
Figure 1.2: Weights and heights of the students attending the Machine Learning class. The ones
belonging to the University Wrestling Team are plotted in green.
Ways of sampling
The way the data feed the learning process is important in practical applications of machine
learning techniques. Indeed, some of them are restricted to a specific sampling strategy. Let us
16 CHAPTER 1. INTRODUCTION
Data preprocessing in most cases is actually the place for introducing problem specific
knowledge, as dedicated signal processing, suitable metrics, etc.
Each datum z is indeed a set of attributes, i.e a set of key-value pairs. In the case of our students in
figure 1.2, there are three attributes : weight, height and is wrestler. Each datum is a specific
values for all the attributes.
In practical cases, some values may be missing for some attributes, which may raise problems.
Values may be also over-represented. In the example of the wrestler students in figure 1.2, as few
of them actually belong to the wrestling team, considering blindly that a student do not belong
to the team may not be a so bad prediction. To avoid this, one may try to balance the dataset,
picking samples such that there is an equal amount of wrestler and not wrestler students. This
may be done carefully, since the dataset is no more sampled according to algorithm 1, as discussed
further in section 2.2.3.
The type of the attributes have to be considered carefully. Some attributes are numerical. This
is the case for the weight and height in our example. Some other are categorial. This is the case for
the is wrestler. Some nationality attribute is also categorial, since there is no order between
nationality values, they cannot be added, etc.
Many machine learning algorithms handle vectors, i.e. a set of scalar attributes.
Representing categorial attribute values with numbers, for example using USA =
1, Belgium = 2, France = 3, · · · for nationality attribute, induces scalar operation
on the values, whose semantics may be silly. In the given example, France > USA
and Belgium is an intermediate value between France and USA.
Another problem that araises with vectorial data is the scales of the different attributes. In
our student example, if weights were given in milligrams and heights in meter, the first dimension
of the vectorial data (the weight) would live in a [20000, 160000] range, while the second (the
height) would live in a [0, 3] range. With such scales, the point cloud of figure 1.1 would be a flat
2 Indeed, they are still independent but the distribution changes.
1.2. DIFFERENT LEARNING PROBLEMS... 17
horizontal line. In other words, the used learning algorithm would consider height as a constant,
fluctuating slightly from one data sample to the other3 . To avoid this, attribute values may be
rescaled (usually standardization4 is performed).
The high number of attributes in some data may also lead to an increase of the computation
time, as well as a lack of generalization performance. To avoid this, some variable selection methods
exist, that select, from all the attributes, a subset that appears to be useful for the learning. This
is detailed in section 6.2.1.
Last, let us raise a warning. Be aware of the attribute meaning. For example, if in the student
dataset, all wrestler are registered first, and if the line number in the datafile is an attribute,
comparing this line number to the appropriate threshold (that is the number of wrestlers) leads
to a perfect, but silly, learning. Such silly case may occur since toolboxes for machine learning
usually provide scripts that take the data file as inputs and may use all attributes for doing their
job. Be sure that datafiles do not need cleaning before feeding the toolbox.
Membership test
Relying on the samples in the dataset S, one can determine whether a point belongs or not to the
distribution of the samples5 . A membership test for a point could for example consist in finding
the two closest samples to the tested point, measure the two distances from the point and the
selected sample, average these distances, and compare the result to some threshold. Figure 1.3
shows the result. Once computed, the membership function can be used to say that a student with
a 120cm height and a 100kg weight is very unlikely to be one of my students6 .
Identification of components
Another way to extract information from a distribution of samples is to recognize some components.
For example, let us consider in our case lines as elementary components. One can describe the
data as the main fitting lines, as in figure 1.4. Projecting high dimensional data to few lines, that
are adjusted according to the distribution, may allow to express the high dimensional original data
in a reduced space while preserving its variability.
Clustering
Last, some algorithms enable to represent the samples as a set of clusters, as in figure 1.5. This may
help to set up an a posteriori identification of groups inside the data. Analyzing the groups/clusters
in figure 1.5 may lead me to guess that there are three kind of students in my class. The examination
of the small cluster on the right suggests that some students form a specific group. I may investigate
on this... and discover that there are wrestlers in the classroom.
3 e.g, the euclidian distance order of magnitude between two samples is mainly due to their weight difference.
4 Rescaled such as mean is 0 and variance is 1.
5 Is it likely that the point could have belonged to some dataset generated with the unknown P ?
Z
6 I cannot know why. Do such students actually exist ? If yes, why do they not attend my lecture ? Keep in
mind that the data is fake, and so are the conclusions we get from it.
18 CHAPTER 1. INTRODUCTION
Figure 1.3: Weights and heights of the students attending the Machine Learning class. All 2D
points are submitted to the membership test function described in the text (with threshold 7).
The gray areas contain the points which passed the test.
1.2. DIFFERENT LEARNING PROBLEMS... 19
Figure 1.4: Weights and heights of the students attending the Machine Learning class, linear
component analysis.
20 CHAPTER 1. INTRODUCTION
Figure 1.5: Weights and heights of the students attending the Machine Learning class, clustering.
1.2. DIFFERENT LEARNING PROBLEMS... 21
Bi-class classification
The most basic, but commonly addressed, supervised learning problem is two-class classification.
A supervised learning problem is a two-class classification problem when |Y| = 2, as in figure 1.2
where Y = {wrestler, non wrestler}. Learning consists in setting up a labelling process that
tells wether a student is a wrestler or not, according to its weight and height. In other word, the
learned labelling process should predict the label y of a incoming new data (x, y) from x only. A
very classical example is the linear classifier. It consists of an hyperplane. A hyperplane splits the
space into two regions. The linear classifier assigns one label to one region, and the other label to
the other region. The learning consists in finding a suitable hyperplane, as in figure 1.6. When
a new students enter the machine learning class, if the hyperplane is well placed8 , one can guess
whether that student is a wrestler or not, just by identifying in which region the student is.
Figure 1.6: Weights and heights of the students attending the Machine Learning class, linear
separation.
7 See section 1.1.1.
8 This is not always possible...
22 CHAPTER 1. INTRODUCTION
The case of the linear separation in figure 1.6 allows to introduce the concept of classification
score. Even if the output has only two values, the hyperplane is defined by ax + by + c = 0.
So for any (x, y) data, once a, b, c have been determined by learning, one can compute the scalar
s = ax + by + c and decide, according to the sign of s, in which region, green or yellow, the input
(x, y) lives. In this case, the binary decision for labelling relies on the previous computation of a
scalar. The higher s is, the more distant to the line (x, y) is. So a highly positive s means that the
classifier says “strongly” that the student is a wrestler. In general, there are some classifiers that
add a score to the label they produce. This can be a statistical score, or a geometrical score as for
the hyperplane used in our example.
Multi-class classification
A multi-class classification problem is a problem where Y is finite (i.e. |Y| < ∞). In the case
of the students, the label could have been the name of the sport they practice, instead of only
telling whether they practice wrestling or not. Even if some multi-class learning algorithms exist,
there are also multi-class algorithms that combine two-class algorithms. This can be achieved by
one-versus-all (or one-versus-rest) strategy (see algorithm 2 for learning, and then algorithm 3 for
labelling a new input) or one-versus-one strategy (see algorithms 4 and 5). The former requires
scores, as opposed to the latter. The latter requires more computation.
Regression
A regression problem is a supervised learning problem where labels are continuous, as opposed to
classification. The standard case is Y = R. For example, if the dataset of the students attending the
machine learning class contains their weights and heights, as well as a Boolean label telling whether
a student is over-weighted or not, and if this Boolean is the label to be predicted from weight and
1.2. DIFFERENT LEARNING PROBLEMS... 23
height, the supervised learning problem is a classification problem, as presented previously. If now
the BMI9 of the student is given instead of the overweight flag, and if this index value has to be
predicted from weight and height, the supervised learning problem becomes a regression.
The distinction between classification and regression may sound artificial, since it only denotes
that Y is finite or continuous. From a mathematical point of view, such a distinction is irrelevant.
Nevertheless, supervised learning methods are often dedicated to either regression or classification,
which justifies the distinction between the two.
Last, let us mention the case where Y = Rn . This can be solved with n scalar regression
problems, one for predicting each components of the label. When this is applied, each component
prediction is learned separately from the others, whereas they may not be independent. Some
methods like neural networks10 handle multidimensional prediction as a whole, thus exploiting the
eventual relations between the dimensions.
A control problem
Let us play the famous weakest link game, used in the weakest link TV show. Here, a single player
is considered. The player tries to get the highest amount of money in a delimited time. The game
starts with a 20$ question. The player needs to answer it right to reach the next question. The
next question is a 50$ one. If the player answers it right, s/he is asked the next question, which is
100$. The values for the questions are 20, 50, 100, 200, 300, 450, 600, 800 and 1000$. Each time
the user reaches a new question, s/he has two options. First option is to try to answer it and go
to next stage. Second option is to say ”bank”. In that case, the amount of money associated to
the last question is actually won, and the game restarts from first stage.
9 Body Mass Index
10 Multi-layered perceptrons indeed.
24 CHAPTER 1. INTRODUCTION
Figure 1.7: Weights and heights of the students attending the Machine Learning class. Blue dots
are unlabelled data. Yellow dots are students that are known to be non-wrestlers, whereas green
dots are those who are known to be wrestlers.
1.2. DIFFERENT LEARNING PROBLEMS... 25
Let us model the game with a Markovian Decision Process (MDP). First element of the MDP
is a state space S. Here S = {q0 , · · · , qi , · · · , q9 }, i.e. the nine question levels of the game and the
initial state. Second, let us denote by A the set of actions. Here A = {answer, bank}. Third is
0
the transition matrix11 T , defined such as Ts,s a
0 is the probability of reaching state s from state s
a
when performing action a. Last element is a reward matrix R, defined such as Rs,s0 is the reward
expectation when the transition s, a, s0 occurs.
For the weakest link game, the modeling is quite easy. Let us suppose that the probability of
answering a question right is p. Reward is deterministic here. The following stands.
• ∀i ∈ [0..8], Tqanswer
i ,qi+1
= p and Tqanswer
i ,q0
= 1 − p. Moreover, Tqanswer
9 ,q0
= 1, and other transition
probabilities are null.
• Rqbank
1 ,q0
= 20, Rqbank
2 ,q0
= 50, Rqbank
3 ,q0
= 100, Rqbank
4 ,q0
= 200, Rqbank
5 ,q0
= 300, Rqbank
6 ,q0
= 450, Rqbank
7 ,q0
=
bank bank a
600, Rq8 ,q0 = 800, Rq9 ,q0 = 1000 and Rs,s0 = 0 otherwise. This is the reward profile.
Figure 1.8 illustrates the modeling of the game with a MDP. The purpose of reinforcement learning
is to solve the MDP. It means finding a policy that allows the player to accumulate the maximal
amount of reward. Such policy is called the optimal policy. A policy, in this context, is simply the
function that tells for each state which action to do. It can be stochastic, but it can be shown that
the optimal policy is indeed deterministic.
1 1 1 1 1 1 1 1 1 1
q q q q q q q q q q
0 1 2 3 4 5 6 7 8 9
p p p p p p p p p
1−p 1−p 1−p 1−p 1−p 1−p 1−p 1−p 1−p 1
bank
answer
reward
Figure 1.8: Modelling the weakest link game with a Markovian Decision Process. See text for
details.
In the TV show, the users play during a fix duration, and they have to maximize their return
(i.e. the sum of the money got when banking). In reinforcement learning, usually, no such duration
is given. Time stress is modeled as a probability 1 − γ, γ ∈ [0, 1[ to end the game at each action.
This is not modelled directly in T and R. The optimal policy, that is what the reinforcement
learning computes, take γ into account. If γ is high, one can hope reaching last states, and thus
one may not bank for first questions. If γ is low, it is better to be Epicurian12 , i.e. to bank as soon
as a question is answered correctly.
Resolution
Even for such a reduced problem, finding the right strategy is not obvious, even when the MDP
(S, A, T, R, γ) is known. When everything is known, the problem of finding what to do in each state
in order to accumulate the highest amount of reward (i.e. find the optimal policy) is addressed
by optimal control, a field of automation. Reinforcement learning addresses this problem when T
and R are unknown. To do so, the reinforcement learning methods often rely on inner supervized
learning.
Figure 1.9 shows the optimal policy in different cases.
11 It is a 3D tensor indeed....
12 carpe diem
26 CHAPTER 1. INTRODUCTION
9
8
question to bank at
7
6
5
4
3
2
1
0
1.0
0.8
gam0.6 0.4 0.8 1.0
ma 0.2 0.6
0.4 ers probability
0.0 0.0 0.2 ct answ
Corre
Figure 1.9: According to γ and p, the best policy consists in banking at a specific question level.
This is what this plot illustrates. See text for details.
1.3. DIFFERENT LEARNING STRATEGIES 27
Figure 1.10: Weights and heights of the students attending the Machine Learning class. The
colored areas correspond to the labels given by a KNN (with k = 10) relying on the dataset.
Chapter 2
29
30 CHAPTER 2. THE FREQUENTIST APPROACH
def
to a nonlinear function ϕ ∈ ΦX . In this case, Θ = Rdim(Φ) , fθ (x) = θT .ϕ (x) and the hypothesis
def
set H = fθ θ ∈ Rdim(Φ) ⊂ Y X .
When such a projection is used, X is called the ambient space and Φ is called the feature
space. Usually, dim (Φ) dim (X ), meaning that using the non linear projection ϕ consists in
applying linear methods in high dimension in order to get an overall non-linear processing. This
has dramatic consequences, as illustrated next.
Let us illustrate this in R2 with 3 points in figure 2.1. Of course, if all points were aligned,
separation would have been impossible.
Figure 2.1: For theses points in R2 , any labelling makes the obtained dataset linearly separable.
This is true for most point configurations. This remarks can be extended to n + 1 points in Rn .
Let us now consider the labeled points S in figure 2.2-left. They are obviously not linearly
separable. In this case, the trick is to project X into an ambient space Φ, as mentioned in previous
section, thanks
to some function ϕ. In order to name components, x ∈ X = R2 is denoted by
x = x1 , x2 . Let us use
x1
ϕ (x) = x2 (2.1)
1 2
2
x + x2
we denote θ0 = (θ, b) ∈ Rn+1 and x0 = (x, 1) ∈ Rn+1 , the expression θT .x + b rewrites as θ0 T .x0 . This is usually
1 If
done in order to avoid a specific formulation for the offset b in mathematical expressions.
2.2. RISKS 31
Let us define ϕ (S) = {(ϕ (x), y) | (x, y) ∈ S}. Figure 2.2-middle shows that ϕ (S) can be separated.
The separation frontier in the ambient space X is obtained from a reverse projection of ϕ (X ) ∩
{θT .x + b = 0} in the feature space2 . This is how ϕ enables to perform non-linear separation in
the input space, while a linear separator is involved in the feature space.
Figure 2.2: On the left, the labeling of 10 points in R2 is not linearly separable. The same points
projected in R3 by ϕ, defined by equation (2.1), become separable by a hyperplane, as middle of
the figure shows. The half-space over the hyperplane correspond to black labels. It is shaded. The
paraboloid is the projection of the whole R2 in R3 by ϕ. The part of the paraboloid above the
hyperplane is darken as well. It correspond to regions labeled as black. On the right, points are
represented back into R2 , as on the left, but the plane area which projects into the dark half-space
in R3 is darken. It can be seen that the linear separation in R3 (the feature space) corresponds to
a non-linear separation in R2 (the ambient space), since ϕ is a non-linear projection.
The trick of projecting a dataset into a high-dimension space, so that the projected points
become linearly separable, seems powerful. Indeed, it has its drawback. In figure 2.2, the projection
has been chosen carefully. This cannot been done in non artificial situation. In real cases, we would
rather have built a dataset-agnostic high dimensional projection into Rn , with n 9 in order to
make our 10 points very easily separable, and then set up a separator to make the decision. This
cannot be represented, as opposed to what we did for the middle part of figure 2.2 where ϕ projects
in R3 . Nevertheless, the right part of figure 2.2 can still be sketched out, even if the feature space
cannot be plotted. The figure that we will obtain may be like figure 2.3... The separation has poor
generalization capabilities, and can hardly be used to predict the label of new incoming samples.
Once again, this drawback occurs since it is easy to separate points in high dimensions, and so
separation becomes useless. This is referred to as the curse of dimensionality.
2.2 Risks
Let us consider here supervised learning and the definitions given in section 1.1.1. We would like
to find some predictor (or hypothesis) h ∈ H such as the label y = h (x) given by h to some input
x is likely to be the label associated to that x if it is sampled by algorithm 1, i.e. if that x were
labeled by the oracle. This is what a ”good” predictor is supposed to do. The concept of risk aims
at defining and measuring the quality of a predictor.
The binary loss is suitable for finite label set Y, i.e. classification, but it may be rough when Y is
continuous, i.e. regression. When Y = R, one can use the quadratic loss defined as
2
L (y, y 0 ) = (y − y 0 )
Unfortunately, the real risk cannot be computed, since the oracle pZ is unknown. Nevertheless,
it can be estimated. The classical method for estimating an expectation is the computation of the
average over a collection of samples. The dataset S actually contains samples tossed from PZ , as
algorithm 1 shows. The following average, called the empirical risk of h computed from S, denoted
by RSemp (h) or RN (h) (N = |S|), thus estimates the real risk.
def 1 X
RSemp (h) = L (h (x) , y)
|S|
(x,y)∈S
The criterion that measures how well some hypothesis h imitates the labeling per-
formed by the oracle is the real risk. It cannot be computed. The empirical risk may
be an estimation of it. Is this estimation reliable ? Can we consider that a predictor
which has a low empirical risk on some dataset has a low real risk in general ? This
is the core question of supervised machine learning.
2.2. RISKS 33
A good predictor is a predictor that minimizes the real risk. Good predictors may be
uninteresting, especially when the distribution of labels is strongly unbalanced.
If I really want to build up an alert system for earthquakes, I would need to balance the
dataset with 50% non-earthquake, taken randomly in the huge mass of non-earthquake days, and
50% earthquake data (made of all, but few, earthquake data I actually have). If there is only, on
the balance dataset, 1% of cases for which I predict earthquake while it actually not occurs (false
positive detection), this 1% may trigger frequent false alerts in the daily use of my detector. In
other words, after the balancing process, the database has lost is statistical significance for a real
life usage.
b def
hS = argmin RSemp (h)
h∈H
where RSemp (h) is the empirical risk of h computed on S. The function b hS is also denoted by hN
(N = |S|). When H is a parametric function set (see section 2.1.1), one can use a gradient descent
in the parameter space to find bhS .
Some problem arise if H gathers complex functions, i.e. H is rich. Indeed, one can find in H
many hypotheses h for which RSemp (h) = 0. Such hypotheses fit the dataset perfectly good.
In general cases, when a perfect fit to the data occurs, one should suspect an overfitting situ-
ation, where the function b
hS that has been found performs a learning by heart. For example, this
is what happened in figure 2.3, since the non-linear decision region gives the right label for all the
ten points, whereas it will certainly not generalize this good labelling for new data. Overfitting
occurs when H is rich enough to contain functions that can learn big datasets by heart.
34 CHAPTER 2. THE FREQUENTIST APPROACH
2.3.1 Bagging
First place where randomness can be introduced is the sample set. For training each predictor,
one can build up a dataset from S by sampling P data from S uniformly, with replacement. If the
learning process is very sensitive to the data (some samples can be duplicated since replacement
occur, some others may be missing, ...), the predictors got from this process may show variability.
Nevertheless that is counterbalanced with the merging procedure described previously to get a
good overall prediction. Making datasets this way is called bootstrapping, and relying on this to
set up many predictors, whose predictions are merged, is called bagging.
time (compared to an accurate optimization process). As for bagging, the merging of the predic-
tions compensates for the weaknesses of each single predictor output. Dataset bootstrapping can
also be used for training random models, adding more variability.
2.3.3 Boosting
Boosting is a specific case in ensemble methods since it does not rely on randomness. Indeed,
what is actually boosted in the boosting method is a weak predictor learning algorithm. To do the
boosting trick, the weak learning algorithm has to be able to handle weighted datasets. A weighted
dataset is a dataset where each sample is associated with a weight reflecting the importance given
to that sample in the resulting predictor construction.
Boosting is an iterative process. First, equal weights are given to all samples in S. A first
predictor is learned from this. Then, the weights are reconsidered in order to be increased for the
samples badly labeled by the previously learned predictor. A second predictor is learned from the
dataset with this new weight distribution.... and so on. The boosting theory gives formulas for
weighting the predictions of each constructed predictors in order to set up a final predictor as a
weighted sum of the individual predictions.
36 CHAPTER 2. THE FREQUENTIST APPROACH
Chapter 3
This chapter introduces the basics of Bayesian inference on a fake example, in order to build up
the general scheme of Bayesian approaches of machine learning. The mathematics here aim at
being intuitive rather than providing a rigorous definition of probabilities. The latter is grounded
on the measure theory that is not addressed here.
when the density p X.Y (x, y) for that pair is high. Saying that (x, y) is “probable” is thus abusive, this is why it is
quoted.
2 Here, its density of probability is obtained. It describes the random variable when it exists.
37
38 CHAPTER 3. THE BAYESIAN APPROACH
In equation 3.1, the argument y is highlighted, in order to stress that p Y |X (y|x0 ) is a function
of y. This will be not recalled in the following.
B
C
20 y 30
normalization
A
D
1
30 20 y 30
y
x0
10 20
Figure 3.1: Joint and conditional densities of probability. X = [0, 10] and Y = [20, 30]. In the
figure, A = p X.Y (x, y), B = p X.Y (x0 , y), C = pX (x0 ) and D = p Y |X (y|x0 ).
which leads to the following expression of the Bayes’ rule, expressed for densities of probability:
It is very similar to the more usual Bayes’ rule with probabilities, i.e. P (A | B) = P(B |P(B)
A)×P(A)
, but
components of the formulas here are functions (the densities) rather than scalars (the probabilities).
The Bayes’ rule for densities of probability is the core mechanism for Bayesian learning, as
illustrated next.
For the sake of further densities of probability plotting, let us consider that models have a scalar
parameter θ ∈ Θ = [0, 1]. For a specific value θ, we consider the data to be samples z ∈ Z = [0, 1]
of the random variable mθ whose density of probability pmθ (z) is defined in figure 3.2.
This density represents the distribution of data for a specific θ. Let us now introduce a random
variable T with values in Θ... this is the trick of Bayesian inference. Its density pT (θ) represents
the values that the parameter θ is likely to have. This is set a priori to any kind of distribution.
Introducing the random variable T gives the density of probability pmθ a conditional flavor. Indeed,
let us consider that
def
∀z ∈ Z, pmθ (z) = p Z|T (z|θ).
We have thus modeled a situation where getting a data sample consists in first tossing θ L99 PT
and second tossing z L99 Pmθ . Once again, the first toss sounds artificial... but this is the trick,
as next section shows.
From this modeling of data generation, the joint density of probability can be computed easily
from the Bayes’ rule (see equation (3.2)).
∀(z, θ) ∈ Z × Θ, p Z.T (z, θ) = p Z|T (z|θ) × pT (θ) = pmθ (z) × pT (θ) (3.3)
Figures 3.3 and 3.4 show the joint density of probability for different pT .
Note that here, the data z is a parameter. The variable is θ, so p T |Z=z is a density of probability
over Θ. Indeed, it tells how θ is distributed under our a priori hypotheses, knowing that the data
z has been observed.
In equation (3.4), pT (θ) is called the prior. It can be computed since it is given a priori. The
value pZ (z) is a normalization constant. It is the “probability” of the occurrence of that data when
the situation is modelled as we do... the way the data is actually sampled is not considered. pZ (z)
can be computed here numerically3 from the joint density of probability given by equation (3.3).
The density of probability p Z|T (z|θ) is known since it is our model, i.e. pmθ (z), as already stated.
3 This
R
is a marginalization computed from an integral pZ (z) = Θ p Z.T (z, θ)dθ.
40 CHAPTER 3. THE BAYESIAN APPROACH
1.06 pT (θ)
1.04
1.02
1.00
0.98
0.96
0.94
0.0 0.2 0.4 0.6 0.8 1.0
θ
pZ.T(z,θ)
2.0
1.5
1.0
0.5
0.0
1.0
0.8
0.0 0.6
0.2 0.4
θ
0.4
z 0.6 0.2
0.8
1.0 0.0
Figure 3.3: Joint distribution of parameter θ and data z. The distribution pT (uniform here) is
given a priori, as well as p Z|T which is the one defined in figure 3.2.
3.2. BAYESIAN INFERENCE 41
3.0 pT (θ)
2.5
2.0
1.5
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
θ
pZ.T(z,θ)
6
5
4
3
2
1
0
1.0
0.8
0.0 0.6
0.2 0.4
θ
0.4
z 0.6 0.2
0.8
1.0 0.0
Figure 3.4: Joint distribution of parameter θ and data z. The distribution pT (parabolic here) is
given a priori, as well as p Z|T which is the one defined in figure 3.2.
42 CHAPTER 3. THE BAYESIAN APPROACH
The Bayesian inference consists in updating the prior pT (θ) so that it is now the function
p T |Z (θ|z) that has been computed. Using that new prior changes the situation that we model.
Indeed, this prior has somehow considered the data z that has been provided. Bayesian learning
then consists in repeating this update for each new data sample that is tossed. This is illustrated
in figure 3.5, where it can be seen that pT (θ) gets more and more focused, as the data samples are
provided, to the value θ = 0.7 which is actually the parameter we used for tossing data samples.
The shape of the distribution reflects the uncertainty concerning the estimated value for that
parameter.
3.0 0 samples submitted 1.8 1 samples submitted 1.8 2 samples submitted 1.8 3 samples submitted
2.5 1.6 1.6 1.6
1.4 1.4 1.4
2.0 1.2 1.2 1.2
1.5 1.0 1.0 1.0
0.8 0.8 0.8
1.0 0.6 0.6 0.6
0.5 0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
θ θ θ θ
Figure 3.5: Bayesian inference. Each plot is the prior pT (θ) after the mentioned number of samples
provided.
p K|T (z2 , z1 |θ) = pmθ (z2 ) × pmθ (z1 ) = p Z|T (z2 |θ) × p Z|T (z1 |θ)
So the Bayesian update (equation (3.4)), when a pair of independent samples is given, is the
following:
p K|T (z2 , z1 |θ) × pT (θ)
∀θ ∈ Θ, p T |K (θ|z2 , z1 ) =
pK (z2 , z1 )
Evaluation
4.1.1 Cross-validation
Overfitting occurs b
when the hypothesis hS computed from the dataset S sticks to that set. In this
case, RSemp b hS ≈ 0, this is the fitting to the training data which is expected by any inductive
0
principle, but RSemp b hS is high for another dataset S 0 , meaning that the fitting to S is actually
an overfitting.
Let us call S the training set, since learning consists in building RSemp from it. One can detect
overfitting by computing the empirical risk of b hS on another dataset S 0 , called the test set. Of
0
course, both S and S are sampled i.i.d. according to PZ (see algorithm 1).
Usually, only a single dataset is available and one can do better than only evaluate the empirical
risk on a train set. This is the idea of the k-fold cross-validation procedure described by algorithm 7
that provides an estimation of the real risk of b hS by a generalization of the use of only two training
and test sets described so far. This is illustrated on figure 4.1.
45
46 CHAPTER 4. EVALUATION
training set
dataset
R1
test set
R2
R3
R4
Figure 4.1: 4-fold cross-validation. See text and algorithm 7 for detail.
The least thing that can be done when applying supervized machine learning is to
evaluate the method with cross-validation.
θ? = argmin Rθ
θ∈Θ
and then, applying αθ? on S gives a predictor hθ? , which is the one returned by the whole process.
The error that could be made with this approach is considering that Rθ? , computed during
the optimization process, is an estimation of the real risk of the optimization process. Indeed, it
estimates the real risk of using our learning algorithm with parameter θ? , but not the real risk of
the whole optimization process since the value of θ? is not independent from S.
In this case, apart from dividing S into training and test sets for cross-validation, an extra
0
validation set S 0 should be used to measure RSemp (hθ? ) and estimate the real risk of our optimization
process. As we have extended training an testing sets to a whole cross-validation process, the
validation set can be extended to a cross-validation as well... To do so, let us denote by metalearn
the algorithm described above, reminded in algorithm 8.
This algorithm’s real risk can be estimated by cross-validation, as any other, by simply calling
cross validation (k, S, metalearn). This process involves two nest levels of cross-validation (see
figure 4.2), which generalize the use of train, test and validation sets.
Algorithm 8 metalearn(S)
1: // α is a learning algorithm with parameters θ, k is a constant.
2: R? = +∞
3: for θ ∈ Θ do
4: // Consider operation research techniques if Θ cannot be iterated
5: Compute Rθ = cross validation (k, S, αθ ).
6: if Rθ < R? then
7: R ? ← Rθ
8: θ? ← θ
9: end if
10: end for
11: Use αθ? to train on S and get a predictor hθ? = αθ? (S).
12: return hθ?
dataset
training set
R1
test set
R2
R1
R3
validation set
R4
R1
R2
R2
R3
R4
R1
R2
R3
R3
R4
R1
R2
R4
R3
R4
Figure 4.2: Cross-validation of algorithm 8 involves a nested cross-validation. See text for detail.
48 CHAPTER 4. EVALUATION
TP FN
FP TN
Figure 4.3: On the left, a label dataset is depicted. Positively labeled samples are painted in
pale yellow, negatively labeled ones are painted in green. The middle of the figure shows a linear
separator. It separates the plane into sub-regions. The sub-region corresponding to positive labels
is painted in yellow as well, the negative sub-region is depicted in green. On the right, a recall of
the confusion matrix coefficients with the colored graphical notation.
responses of the classifier, as in table 4.1. In this example, when a sample belongs to class C it
is systematically given a C label by the predictor. Nevertheless, the C label is also given by the
predictor when real class is not a C.
predicted
A B C D
A 13 2 2 3
B 1 25 4 0
real
C 0 0 40 0
D 2 0 1 7
Table 4.1: Confusion matrix for some classifier computed from a dataset S with |S| = 100 data
samples. The classes can be A, B, C or D. Numbers in the matrix are the number of samples
satisfying the condition (their sum is 100).
The confusion matrix can also be meaningful when the problem is cost sensitive. It means that
errors may have different weights/costs according to their nature. For example, predicting that a
patient has no cancer while s/he actually has one is not the same as predicting a cancer while the
patient is sane... Some cost matrix C can be associated to the confusion matrix, where Cij is the
cost of predicting class i while the real class is j. Usually, the Cii are null.
predicted
P N
P TP FN
real
N FP TN
Table 4.2: Confusion matrix for some bi-class classifier. Class P is the positive class and class N the
negative one. The coefficient names are true positives (TP), true negatives (TN), false positives
(FP), false negatives (FN).
4.2. THE SPECIFIC CASE OF CLASSIFICATION 49
Figure 4.4: The different concepts are illustrated in the case of a linear separator. Each concept
is a ratio (see text), represented by two read area. The central one is the numerator, and the
surrounding one the denominator.
Lots of measures are based on the confusion matrix coefficients. Main ones are sensitivity
or recall ( T PT+F
P TN TP
N ), specificity ( F P +T N ), precision ( F P +T P ). Those definitions can be depicted
graphically as well, as in figure 4.4.
It is also common to plot a predictor performance in a ROC space. That space is a chart
depicted in figure 4.5.
It allows to situate the performances independently from the unbalance between the
classes.
Last, one can summarize the trade-off between sensitivity and precision with the f-score (or
f1-score) that is computed as
sensitivity × precision
f =2×
sensitivity + precision
The f-score is a way to merge sensitivity and precision into a single number, as shown by the chart
in figure 4.6.
50 CHAPTER 4. EVALUATION
om
nd
ra
sensitivity
n
ha
rt
tte
om
be
nd
ra
an
th
se
or
w
0 1−specificity
0 1
Figure 4.5: ROC space. See text for detail.
0.8
0.6
f-score
0.4
0.2 1
0.8
00 0.6
0.2 0.4 precision
0.4 0.2
0.6
sensitivity 0.8
1 0
51
Chapter 5
Risks
In machine learning, one wants to learn something from data. The “something” to be learnt
is usually quantified thanks to the central notion of risk. However, the risk is an asymptotical
concept, and learning is practically done by the optimization of an empirical counterpart of it. In
section 5.1, we provide a brief introduction to statistical learning theory. Notably, we will provide
(partial) answers to questions such as:
• to what extent the minimized empirical risk is a good approximation of the risk?
These questions are respectively related to the consistency of machine learning and to overfitting.
In the case of classification, the natural (empirical) risk (based on the binary loss) is a difficult
thing to optimize. Therefore, it is customary to minimize a convex surrogate (or proxy) instead. In
section 5.2, we motivate the introduction of these surrogates, exemplify some of them and discuss
their calibration (in other words, is it consistent to minimize them instead of the risk?).
As will be shown in section 5.1, overfitting is heavily related to the capacity (or richness) of
the considered hypothesis spaces. In section 5.3, we briefly introduce the concept of regulariza-
tion, which is a modification of the risk that penalizes complex solutions (the aim being to avoid
overfitting by restricting the search space).
• an oracle that for each input x provides an output y ∈ Y, sampled according to the conditional
distribution P (y | x), also fixed but unknown. If Y = R, we’re facing a regression problem,
whereas if Y is a finite set, we’re facing a classification problem;
• a machine that can implement a set of functions, this set being called the hypothesis space
H = {f : X → Y} ⊂ Y X .
In a first try, supervised learning can be framed as follows: pick f ∈ H that predict “the best” the
responses of the oracle. The choice of f must be done based on a dataset D = {(xi , yi )1≤i≤n } of
n examples sampled i.i.d. from the joint distribution P (x, y) = P (x)P (y | x). If these examples
are fixed in practice (the dataset is given beforehand), they should really be understood here as
53
54 CHAPTER 5. RISKS
i.i.d. random variables. All quantities computed using these samples are therefore also random
variables.
Before stating more precisely what “the best” means formally, we give some examples of hy-
pothesis spaces:
In this case, searching for a function f ∈ H amounts to search for the related parameters
α and β. If Y = {−1, 1} (binary classification, see section 5.2 for multiclass classification),
we can define similarly the following space (writing sgn the operator that gives the sign of a
scalar):
H = fα,β : X → {−1, 1}, fα,β (x) = sgn(α> x + β), α ∈ Rp , β ∈ R ;
Radial Basis Function Networks (RBFN): the underlying idea here is that many functions
can be represented as a mixture of Gaussians. Given d vectors µi ∈ Rp (the centers of the
Gaussians) and d symmetric and positive definite matrices Σi (variance matrices) chosen a
priori, the hypothesis space is
( d )
X 1 > −1
H = fα,β : x → αi exp (x − µi ) Σi (x − µi ) + β .
i=1
2
Each Gaussian function is generally called a basis function. Using the same sign trick as
before, this can be used to build an hypothesis space for classification;
Again, this can be modified to form an hypothesis space for binary classification;
nonlinear parametrization: In all above examples, the dependency on the parameters is linear.
This is not necessarily the case, consider an RBFN where the weight of each basis function,
but also the mean and variance of each basis function, has to be learnt. Another classical
example is when predictions are made using an artificial neural network, see part V;
Reproducing Kernel Hilbert Space (RKHS): Let {(xi , yi )1≤i≤n } be the dataset and let K
be a Mercer kernel1 , the hypothesis space can be written as
( n
)
X
n
H = fα : x → αi K(x, xi ), α ∈ R .
i=1
Notice that, contrary to the previous examples, the hypothesis depends here on the dataset
of learning samples (through the use of the inputs xi ). The related approach is usually called
non-parametric. Notice also that H is not the RKHS, but a subset of it, see part III for
details.
Now, we still need to precise formally what we mean by predicting “the best”. To do so, we
introduce the notion of loss function. A loss function L : Y × Y → R+ measures how two outputs
are similar. It allows quantifying locally the quality of the prediction related to a predictor f ∈ H:
L(y, f (x)) measures the error between the response y of the oracle for a given input x and the
prediction f (x) of the machine for the same input. Here are some examples of loss functions:
1 Roughly speaking, this is the functional generalization of a symmetric and positive definite matrix, see part III
for details.
5.1. CONTROLLING THE RISK 55
`2 -loss: it is defined as
L(y, f (x)) = (y − f (x))2 .
`1 -loss: it is defined as
L(y, f (x)) = |y − f (x)|.
It is of interest in a regression setting when there are outliers (because outliers are less
penalized with the `1 -loss), for example;
It is the ideal (but unpractical, see section 5.2) loss for classification.
If the loss function quantifies locally the quality of the prediction, we need a more global
measure of this quality. This is quantified by the risk, formally defined as the expected loss:
Z
R(f ) = L(y, f (x))dP (x, y)
X ×Y
= E [L(Y, f (X))].
Ideally, supervised learning would thus consists in minimizing the risk R(f ) under the constraints
f ∈ H, giving the solution
f0 = argmin R(f ).
f ∈H
Unfortunately, recall that the joint distribution P (x, y) is unknown, so the risk cannot even be
computed for a given function f . However, we have a partial information about this distribution,
through the dataset D = {(xi , yi )1≤i≤n }. It is natural to define the empirical risk as
n
1X
Rn (f ) = L(yi , f (xi )).
n i=1
A supervised learning algorithm will therefore minimize this empirical risk, providing the solution
fn = argmin Rn (f ).
f ∈H
This is called the empirical risk minimization (or ERM for short). Notice that fn is a random
quantity (as it depends on the dataset, which is a collection of i.i.d. random variables).
To sum up, given a dataset (assumed to be sampled i.i.d. from an unknown joint distribution),
given an hypothesis space (chosen by the practitioner) and given a loss function (also chosen by
the practitioner, depending on the problem at hand), a supervised learning algorithm minimizes
the empirical risk, constructed from the dataset, instead of ideally the risk. The natural questions
that arise are:
• does fn converges to f0 (and with what type of convergence)? If it is not the case, machine
learning cannot be consistent;
• given n samples and the chosen hypothesis space (and also depending on the considered loss
function), how close is fn to f0 ?
These are the questions we study next. Before, we discuss briefly the bias-variance decomposition
(a central notion in machine learning).
56 CHAPTER 5. RISKS
The term R∗ is the best one can hope (notice that it is not necessary equal to zero, it depends
on the noise of the oracle). In some cases, we can express analytically the minimizer f∗ , but
the corresponding solution cannot be computed, the underlying distribution
R being unknown. For
example, with an `2 -loss, one can show that f∗ (x) = E [y | x] = Y ydP (y | x), that cannot be
computed (the conditional distribution being only observed through data).
Consider the following decomposition:
R(fn ) − R∗ = R(f0 ) − R∗ + R(fn ) − R(f0 ) .
| {z } | {z } | {z }
error bias variance
Each of these terms is obviously positive. The term R(fn ) − R∗ is the error of using fn (computed
from the data) instead of the best (but unreachable) choice f∗ . The term R(f0 ) − R∗ is called the
bias. It is a deterministic term that compares the best solution one can find in the hypothesis space
H to the best possible solution (without constraint). This term therefore depends on the choice of
H, but not on the data used. The richer the hypothesis space (the more functions belonging to it),
the smaller can be this term. The term R(fn ) − R(f0 ) is called the variance. It is a stochastic
term (through the dependency to data) that should go to zero as as the number of samples goes
to the infinity. In the next sections, we will see that this is a necessary condition for the ERM to
be consistent, and that we can bound it (in probability) with a function depending on the richness
of H and on the number of samples.
This allows lightening the notations and gives more generality to the following results2 .
Let f be a fixed function of H. The risk (5.3) is the expectation of the random variable
L(Y, f (X)) and the empirical risk (5.4) is the empirical expectation of i.i.d. samples of the same
random variable3 . In the probability theory, the convergence of an empirical expectation is given
by the laws of large numbers (there are many of them, depending on required assumptions and on
the considered convergence). Here, we will focus on convergence in probabilities. Therefore, the
weak law of large numbers states that, for the fixed f ∈ H chosen beforehand (before seing the
data), we have:
P
Rn (f ) −−−−→ R(f ) ⇔ ∀ > 0, P (|R(f ) − Rn (f )| > ) −−−−→ 0.
n→∞ n→∞
Notice that we cannot replace f by fn (the minimizer of the empirical risk) in the above expression,
because fn depends on the data and is therefore itself a random quantity.
2 For
example, it applies also to some unsupervised learning algorithms. Let x1 , . . . , xn be i.i.d. samples from an
unknown distribution P (x) to be estimated. Let {pα (x), α ∈ Rd } be a set of parameterized densities. Consider the
(one-parameter) loss function L(p(x, α)) = − log pα (x) and the related risk R(pα ) = E [− log pα (x)] (this corresponds
to maximizing the likelihood of the data). This case is also handled by the notations z and Q.
3 We recall that it is important to understand that, given this statistical model of supervised learning, the dataset
is a random quantity (so is the empirical risk), even if in practice the dataset is given beforehand and imposed.
Imagine that we could repeat the experience (of generating the dataset). The dataset would be different, but still
sampled i.i.d. from the same joint distribution.
5.1. CONTROLLING THE RISK 57
Figure 5.1: An illustration of the convergence behavior of the risk and its empirical counterpart.
Consider for example a regression problem where one tries to fit a polynomial to data. With too
few points, the empirical risk will be zero while the risk will be high. With an increasing number
of data points, both risks should converge to the same quantity.
Figure 5.2: The considered hypothesis space for showing that limits in Def. 5.1 are not equivalent.
For the ERM to be consistent, we require that fn converges to f0 in some sense. What we are
interested in is the quality of the solution, quantified by the risk. So, the convergence to be studied
is the one of the (empirical) risk of fn to the risk of f0 . This gives a first definition of consistency
of ERM.
Definition 5.1 (Classic consistency of the ERM principle). We say the the ERM principle is
consistent for the set of functions Q(z, f ), f ∈ H, and for the distribution P (z) if the following
convergences occur:
P P
R(fn ) −−−−→ R(f0 ) and Rn (fn ) −−−−→ R(f0 ).
n→∞ n→∞
Notice that the two limits are not equivalent, as illustrated in Fig. 5.1. To show this more
formally, assume that z ∈ (0, 1) and consider H the space of functions such that Q(z, f ) = 1
everywhere except on a finite number of intervals of cumulated length , as illustrated in Fig. 5.2.
Assume that P (z) is uniform on (0, 1). Then, for any n ∈ N, Rn (f ) = 0 (pick the function with n
intervals centered in the z1 , . . . , zn datapoints). On the other hand, for any f ∈ H, R(f ) = 1 − .
Therefore, R(f0 ) − Rn (fn ) = 1 − does not converge to zero while R(f0 ) − R(fn ) = 0 does.
The problem with this definition of consistency is that it encompasses straightforward cases
of consistency. Assume that for a set of functions Q(z, f ), f ∈ H, the ERM is not consistent.
Now, let φ be a function such that φ(z) < inf f ∈H Q(z, f ) and add the corresponding function to
H (see Fig. 5.3). With this extended set, the ERM becomes consistant: For any distribution and
any sampled dataset, the empirical risk is minimized with φ(z), which is also the argument that
minimize the risk. This is a case we would like to avoid, motivating a more strict definition of
consistency.
Definition 5.2 (Strict consistency of the ERM principle). Let Q(z, f ), f ∈ H be a set of function
and P (z) a distribution. For c ∈ R, define H(c) the set
Z
H(c) = {f ∈ H : R(f ) = Q(z, f )dP (z) ≥ c}.
58 CHAPTER 5. RISKS
The principle of ERM is said to be strictly consistent (for the above set of functions and distribu-
tion) if for any c ∈ R, we have
P
inf Rn (f ) −−−−→ inf R(f ).
f ∈H(c) n→∞ f ∈H(c)
With this definition, the function φ(z) used in the previous explanation will be removed from
the set H(c) for a large enough value of c. The fact that this definition of strict consistency implies
the definition of classic consistency (the converse being obviously false) is not straightforward, but
it can be demonstrated.
This notion of consistency is fundamental in machine learning. If it is not satisfied, minimizing
an empirical risk has no sense (so, roughly speaking, machine learning would be useless). What we
would like is a (necessary and) sufficient condition to have a strict consistency, that is a convergence
in probabilities of the quantities of interest. The weak law of large numbers states that for any
function f in H, we have this convergence. However, this is not sufficient. As fn is a random
quantity, there is no way to know what function will minimize the empirical risk, so the standard
law of large numbers do not apply. However, assume that we have a uniform convergence in
probabilities, in the sense that
!
P
sup |R(f ) − Rn (f )| −−−−→ 0 ⇔ ∀ > 0, P sup |R(f ) − Rn (f )| > −−−−→ 0, (5.5)
f ∈H n→∞ f ∈H n→∞
which is much stronger than the weak law of large numbers (see the next section for a more
quantitative discussion). With such a uniform convergence, the ERM principle is strictly consistent
(convergence occurs in the worst case, so it occurs for fn : |R(fn ) − Rn (fn )| ≤ supf ∈H |R(f ) −
Rn (f )|). This is the base of a fundamental result of statistical learning theory.
Theorem 5.1 (Vapnik’s key theorem). Assume that there exists two constants a and A such that
for any function Q(z, f ), f ∈ H, and for a given distribution P (z), we have
Z
a ≤ R(f ) = Q(z, f )dP (z) ≤ A.
Obviously, the weak law of large numbers is a corollary of this result. This is called a con-
centration inequality: it states how the empirical mean (which is a random variable) concentrates
around its expectation. Such concentration inequalities are the base of what is called PAC (Prob-
ably Approximately Correct) analysis. The preceding result can be equivalently written as
!
1 X
n
2
P Xi − µ ≤ > 1 − 2e−2n .
n
i=1
q 2
2 ln
Write δ = 2e−2n ⇔ = 2n . The Hoeffding’s inequality can equivalently be stated as: with
δ
So
q the result is probably (of probability at least 1 − δ) approximately (the error being at most
ln δ2
2n ) correct.
This can be directly applied to our problem (recall Eqs (5.3) and (5.4), that correspond re-
spectively to an expectation and to and empirical expectation). Let f ∈ H be a function chosen
beforehand, then with probability at least 1 − δ we have
s
ln 2δ
|R(f ) − Rn (f )| ≤ .
2n
Unfortunately, this cannot be extended directly to uniform convergence. Indeed, probabilities are
about measuring sets. What Hoeffding says, when applied to the risk, is that the measure of the
set of datasets satisfying |R(f ) − Rn (f )| > is at most δ. Now, this set of dataset depends on the
function f of interest. Take another function f 0 , the corresponding set of datasets will be different,
so the measure of both sets of datasets (corresponding respectively to f and f 0 , such that the
inequality of interest is satisfied for both function) is no longer bounded by δ, but by 2δ. This is
illustrated in Fig. 5.4. We write this idea more formally now.
Assume that H is a finite set, such that Card H = h. We can write
!
[
P sup |R(f ) − Rn (f )| > = P {|R(f ) − Rn (f )| > } (by def. of the sup)
f ∈H
f ∈H
X
≤ P (|R(f ) − Rn (f )| > ) (union bound)
f ∈H
2
≤ 2he−2n (by Hoeffding on each term of the sum).
60 CHAPTER 5. RISKS
Figure 5.4: Here, R is the risk and Rn is the empirical risk for two different datasets (sampled
from the same law). For a given function f ∈ H, the fluctuation (or variation) of Rn (f ) around
Rn (f ) is controlled by the Hoeffding inequality. However, fn depends on the data set and the
fluctuation of the associated empirical risk cannot be controlled by Hoeffding.
Moreover, we have just shown that, if the hypothesis space is a finite set, then the ERM principle is
strictly consistent, for any distribution (no specific assumption has been made about P (z), apart
from the fact that the related random variables are bounded; this means that it will work for
any—classification here—problem).
Unfortunately, even for the simplest cases exemplified before, the hypothesis space is uncount-
able and this method (Hoeffding with a union bound) does not apply. Yet, the underlying fun-
damental idea is good, it is just the way of counting functions which is not smart enough. Next
we introduce the Vapnik-Chervonenkis dimension, which is a measure of the capacity (the way
of counting functions in a smart way) of an hypothesis space, among others (e.g., Rademacher
averagers, shattering dimension, and so on).
Recall that we are focusing here on the case of classification. The basic idea for counting
functions in a smart way is as follows: as we work with empirical risks, there is no difference
between two functions that provide the same empirical risk, so they should not be counted twice.
Consider the examples provided in Fig. 5.5, with 3 to 4 labels. Suppose that the classifiers (the
functions of the hypothesis space) make their predictions based on a separating hyperplan (see the
left figure). If two classifiers predict the same labels for the provided examples, they will have the
same risk and will not be distinguishable. In the left figure, there are 8 different classifiers from
this viewpoint. In the middle figure, there are only 6 different classifiers (points are aligned). In
the right figure, there are 8 possible classifiers (all configurations of the labels are not possible
given a linear separator). Generally speaking, given any hypothesis space and n data points, there
are a maximum of 2n possible classifiers (still relatively to the value of the associated empirical
risk). We have just seen in the simple example of Fig. 5.5 that if there are more than 3 points, a
5.1. CONTROLLING THE RISK 61
linear separator will not be able to provide all possibilities. This rough idea is formalized by the
following result.
Theorem 5.3 (Vapnik & Chervonenkis, Sauer & Shelah). Define GH (n) the growth function of a
set of functions Q(z, f ), f ∈ H, as4
H H
G (n) = ln sup N (z1 , . . . , zn )
z1 ,...,zn ∈Z
In the first case, the Vapnik-Chervonenkis dimension is infinite, dVC (H) = ∞. In the second case,
we have dVC (H) = h.
Alternatively, we can say that the Vapnik-Chervonenkis dimension of a set of indicator functions
Q(z, f ), f ∈ H is the maximum number h of vectors z1 , . . . , zh which can be separated in all 2h
possible ways using functions of this set (shattered by this set of functions). It plays the role of
the cardinal of this set (while being much smarter). This dimension depends on the structure of
the hypothesis space, but not on the distribution of interest (not on P (z), so not on the specific
problem at hand). It can be estimated or computed in many cases. For example, if classification
is done thanks to a linear separator in Rd , then we have dVC (H) = d + 1. This say that the
Vapnik-Chervonenkis dimension is a measure of the capacity of the space (and again, it is not
the sole).
The following result can be proven.
Theorem 5.4 (Bound on the risk). Let δ be in (0, 1). With probability at least 1 − δ, we have
s
2 2en 2
∀f ∈ H, R(f ) ≤ Rn (f ) + 2 dVC (H) ln + ln , (5.6)
n dVC (H) δ
This being true for any f , it is also true for fn , the minimizer of the empirical risk. This gives
a bound on the risk, based on the number of samples and on the capacity of H. Notably, this tells
that if n dVC (H), then the error term is small and we do not have problems of overfitting (and
conversely, if we do not have enough samples, the empirical risk will be far away from the risk,
leading to an overfitting problem). A direct (and important) corollary of this result is that the
empirical risk minimization is strictly consistent if dVC (H) < ∞.
4Q
z1 ,...,zn is the set of all possible combinations of labels that can be computed given the functions in H for
the data points z1 , . . . , zn , N H (z1 , . . . , zn ) is the cardinal of this set of points, bounded by 2n , and GH (n) is the
logarithm of the supremum of these cardinals for arbitrary datasets (bounded by n ln 2).
62 CHAPTER 5. RISKS
5.1.5 To go further
We have provided a short (and dense) introduction to statistical learning theory. For a deeper
introduction, the interested student can follow the optional course “apprentissage statistique” or
read the associated course material by Geist (2015a) (in french). Most of the material presented
in the above sections comes from the book by Vapnik (1998), who did a seminal work in statistical
learning theory. Vapnik (1999) provides a much shorter (than the book) introduction. A seminal
work regarding PAC analysis is the one of Valiant (1984). Other references that might be of
interest (the list being far from exhaustive and disordered, the last reference focusing more on
concentration inequalities) are (Bousquet et al., 2004; Cucker and Smale, 2001; Evgeniou et al.,
2000; Hsu et al., 2014; Györfi et al., 2006; Boucheron et al., 2013). For a general analysis of support
vector machines (to be presented in Part III), see Guermeur (2007).
In other words, minimizing this risk corresponds to minimizing the probability of predicting a
wrong label. That is why the binary loss is the natural loss for classification. The associated
empirical risk is
n
1X
Rn (f ) = 1{yi 6=f (xi )} .
n i=1
There are two problems with this risk:
1. with this formulation, the hypothesis space should satisfies H ⊂ {1, . . . , K}X , and it is quite
hard to design (for example through a parameterization) a space of functions that output
labels (much harder than designing space of functions that output reals);
2. the resulting optimization problem (minimizing the empirical risk) is really hard to solve
(not smooth, not convex, and so on).
Therefore, it is customary to minimize a surrogate (or proxy) to this risk of interest.
We have just designed a (simple) hypothesis space for classification (through the space H). Con-
sidering the binary loss, this leads to the following optimization problem
n n
1X 1X
min Rn (gα ) = min 1{yi 6=sgn(fα (x))} = min 1{yi 6=sgn(α> φ(x))} .
α∈Rd α∈Rd n i=1 α∈R n
d
i=1
5 We
recall that in probabilities, for an event A depending on a random variable Z, we have that E 1{A} =
R R
1{A} dP (z) = A dP (z) = P (A).
5.2. CLASSIFICATION, CONVEX SURROGATES AND CALIBRATION 63
ϕ(t) for t ∈ R
hinge loss max(0, 1 − t)
truncated least-squares (max(0, 1 − t))2
least-squares (1 − t)2
exponential loss e−t
sigmoid loss 1 − tanh(t)
logistic loss ln(1 + e−t )
Table 5.1: Some classic loss functions for binary classification, see also Eq. (5.7).
2.00
ϕ(y.f (x)) = 1R− (y.f (x))
ϕ(y.f (x)) = max(1 − y.f (x), 0)
1.75
ϕ(y.f (x)) = e−y.f (x)
ϕ(y.f (x)) = ln 1 + e−y.f (x)
1.50
L(y, f (x)) = ϕ(yf (x))
1.25
1.00
0.75
0.50
0.25
0.00
−3 −2 −1 0 1 2 3
y.f (x)
Figure 5.6: Some plots of classic loss functions for binary classification, see also Eq. (5.7).
The question is: how to solve this optimization problem? It is not convex (a very nice property in
optimization), not even smooth, so there is no easy answer.
A solution is to introduce a surrogate to this risk, that works on fα instead of gα , and that
solves (approximately) the same problem. We give a simple example now. The rational is to
introduce a loss function such that the loss is low when y and f (x) have the same sign, and high
in the other case. Consider the following risk and its empirical counterpart:
h i n
1 X −yi fα (xi )
R(fα ) = E e−Y fα (X) and Rn (fα ) = e .
n i=1
We have that R(fα ) ≥ 0. To minimize this risk, we should set fα such that sgn(fα (xi )) = sgn(yi )
and with a high (ideally infinite) absolute value. Therefore, minimizing this risk makes sense, from
a classification perspective. Moreover, as here fα is linear (in the parameters), one can easily show
that Rn (fα ) is a smooth (Lipschitz) and convex function. This allows using a bunch of optimization
algorithms to solve the related problem, with strong guarantees to compute the global minimum.
This is called the exponential loss. More generally, write L(y, f (x)) = ϕ(yf (x)) the loss for a
convenient and well-chosen function ϕ : R → R, we have the following generic surrogate to binary
classification:
n
1X
R(fα ) = E [ϕ(Y fα (X))] and Rn (fα ) = ϕ(yi fα (xi )). (5.7)
n i=1
In Tab. 5.1 and Fig. 5.6, we provide some classic loss functions. Using a convex surrogate allows for
solving a convex optimisation problem when minimizing the empirical risk. In other word, given
some dataset S = {(x1 , y1 ), · · · , (xi , yi ), · · · , (xN , yN )}, the functional Rn (f ) is convex according
to f . Let us recall that
1 X 1 X
Rn (f ) = L (xi , yi ) = ϕ (yi .f (xi ))
N N
(xi ,yi )∈S (xi ,yi )∈S
is a sum of positive terms. Therefore, the convexity of Rn (f ) can be deduced from the convexity of
def
each of the ϕ (yi .f (xi )) = Ki (f ) terms according to f . This comes naturally from the convexity
of ϕ, i.e.
∀λ ∈ [0, 1], ϕ (λx + (1 − λ)x0 ) ≤ λϕ (x) + (1 − λ)ϕ (x0 )
64 CHAPTER 5. RISKS
ψ(s) for s ∈ R
hinge loss max(0, 1 + s)
truncated least-squares (max(0, 1 + s))2
least-squares (1 + s)2
exponential loss es
logistic loss ln(1 + es )
Table 5.2: Some loss functions for cost-sensitive multiclass classification, see also Eq. (5.12).
We have the same problem as before: designing the space G is difficult and the resulting optimiza-
tion problem is hard, if even possible. Consider again the hypothesis space of linear parameteriza-
tions (5.1),
H = fα : x → α> φ(x), α ∈ Rd ,
From this, we would like to design a space G ⊂ {1, .., K}X . Write fα1 ,...,αK : X → RK the function
defined as
>
fα1 ,...,αK (x) = fα1 (x) . . . fαK (x) such that ∀1 ≤ k ≤ K, fαk ∈ H.
6 There should be no cost for assigning the correct label, that is c(x, y, y) = 0.
5.2. CLASSIFICATION, CONVEX SURROGATES AND CALIBRATION 65
In other words, for each label k, we define a function fαk ∈ H which can be interpreted as the
score of this label. For a given input, the predicted label is the one with the higher score. The
corresponding space of classifiers is
G = gα1 ,...,αK : x → argmax fαk (x), ∀1 ≤ k ≤ K : fαk ∈ H .
1≤k≤K
There remains to define a surrogate to the risk of interest that operates on the functions fαk
instead if gα1 ,...,αK . Let ψ : R → R one of the function defined on Tab. 5.2, we consider the
following surrogate:
n K
1 XX
Rn (fα1 ,...,αn ) = c(xi , k, yi )ψ(fαk (xi )). (5.12)
n i=1
k=1
For the functions of Tab. 5.2, if the functions fαk are linearly parameterized, the resulting em-
pirical risk is convex. Consider for example the exponential loss, ψ(s) = es . For the loss
PK
k=1 c(xi , k, yi ) exp(fαk (xi )) to be small, we should have fαk=yi (xi ) > fαk6=yi (xi ), which shows
PK
informally that this surrogates makes sense (generally, the constraint k=1 fαk (x) = 0 is added).
Said otherwise, minimizing this surrogate loss should push for smaller scores for labels with larger
cost. Consequently, the classifier that selects the label with maximal score should incur a small
cost.
Notice that the surrogate (5.12) does not generalizes the one of Eq. (5.7): if c(x, f (x), y) =
1{y6=f (x)} and if K = 2, both expressions are not equal. Notice also that it is not sole solution.
For example, consider the following surrogate:
n K
1 XX
Rn (fα1 ,...,αn ) = c(xi , k, yi )efαk (xi )−fαyi (xi ) . (5.13)
n i=1
k=1
It is also a valid surrogate. Knowing what surrogate to use in the cost-sensitive multiclass classi-
fication case is a rather open question.
5.2.3 Calibration
Defining a convex surrogate loss is a common technique to reduce the computational cost of learning
a classifier. However, if the resulting problem is more amenable to efficient optimization, it is
unclear wether minimizing the surrogate loss results in a good accuracy (that is, if it will minimize
the original loss of interest). The study of this problem is known as calibration. For example, write
R the risk and Rϕ its convex surrogate. The question is: if we want to have a suboptimality gap
for the risk R, how small should be the suboptimality gap for Rϕ ? If there exists a positive-valued
function δ (called a calibration function) such that
Rϕ (f ) ≤ δ() ⇒ R(f ) ≤ ,
the surrogate loss is said to be calibrated with respect to the primary loss. A deeper study of
calibration is beyond the scope of this course material, but the interested student can look at the
references provided next.
5.2.4 To go further
Using a convex surrogate, such as the hinge loss in binary classification (Cortes and Vapnik,
1995), is a common technique in machine learning. Bartlett et al. (2006) studies the calibration
of surrogates provided in Eq. (5.7) (see also the work of Steinwart (2007) for a more general
treatment). The surrogate of Eq. 5.12 has been introduced by Lee et al. (2004) in the cost-
insensitive case and extended to the cost-sensitive case by Wang (2013). Ávila Pires et al. (2013)
study its calibration. The surrogate of Eq. (5.13) has been proposed by Beijbom et al. (2014), the
motivation being to introduce “guess-averse” estimators. An alternative to convex surrogate is to
use “smooth surrogates” (with no problem of calibration, at the cost of convexity), see the work
of Geist (2015b).
66 CHAPTER 5. RISKS
5.3 Regularization
We have seen in Sec. 5.1 that the number of samples should be large compared to the capacity (or
richness) of the considered hypothesis space. If one has not enough samples or a too rich hypoth-
esis space, the problem of overfitting might appear and one should choose a smaller hypothesis
space. However, it is not necessarily an easy task. For example, consider a linear parameteriza-
tion: how should one choose beforehand the basis functions to be removed? When working with
an RKHS (roughly speaking, when using the kernel trick, see Part III), the hypothesis space is im-
plicitly defined through a kernel (and with a Gaussian kernel, commonly used with support vector
machines—see Part III again—, the corresponding Vapnik-Chervonenkis dimension is infinite) and
it cannot be shrinked explicitly easily. A solution to this problem is to regularize the risk.
Jn (f ) = Rn (f ) + λΩ(f ),
where λ is called the regularization factor (it is a free parameter) and allows setting a compromise
between the minimization of the empirical risk and the quality of the solution. Informally, this can
also been seen as solving the following optimization problem:
min Rn (f ),
f ∈H:Ω(f )≤η
for some value of the (free) parameter η. Instead of searching for a minimizer of the empirical
risk in the whole hypothesis space, we’re only looking for solutions of a maximum complexity (as
measured by Ω) of η. How to solve the corresponding optimization problem obviously depends
on the risk and on the measure of complexity of solutions. The theoretical analysis of the related
solutions also depends on these instantiations. We give some examples of possible choices for Ω in
the next section.
5.3.2 Examples
For simplicity, we consider here a space of parameterized functions:
H = {fα : X → R, α ∈ Rd }.
`2 -penalization: it is defined as
d
X
Ω(fα ) = kαk22 = αj2
j=1
and is also known as Tikhonov regularization. It is often used with a linear parameteriza-
tion and an `2 -loss, providing the regularized linear least-squares (that admit an analytical
solution);
`0 -penalization: this is sometime called the norm of sparsness, or `0 -norm, even if it is not a
norm. It is defined as
The more coefficients are different from zero (the less sparse is the solution), the more pe-
nalized is fα . Notice that solving the corresponding optimization problem is intractable in
general;
5.3. REGULARIZATION 67
`1 -penalization: it is defined as
d
X
Ω(fα ) = kαk1 = |αj |.
j=1
This is often uses as a proxy for the `0 norm, as it also promotes sparse solutions.
We do not provide more examples, but there exist much more way of penalizing an empirical risk.
5.3.3 To go further
A (not always convenient) alternative to regularization is structural risk minimization7 (Vapnik,
1998). For a risk based on the `2 -loss and a linear parameterization, `2 -penalization has been
proposed by Tikhonov (1963) and `1 -penalization by Tibshirani (1996a) (it is known as LASSO,
least absolute shrinkage and selection operator, an efficient method for solving it being proposed
by Efron et al. (2004b)). Under the same assumptions (that is, a linear least-squares problem),
see Hsu et al. (2014) for an analysis of `2 -penalization and Bunea et al. (2007) for `1 -penalization
(among many others). For regularization in support vector machines (and the link between margin
maximization and risk minimization), see for example Evgeniou et al. (2000).
7 Roughly
S
speaking, assume that you have an increasing set of hypothesis spaces, H = k Hj : H1 ⊂ · · · ⊂ Hk . . . .
One can minimize the empirical risk for each subspace end evaluates the corresponding bound on the risk (see for
example Eq. (5.6)), and choose the solution with the smaller upper-bound on the risk.
68 CHAPTER 5. RISKS
Chapter 6
Preprocessing
There are actually plenty of datasets that can be used, a lot of them being listed on the Wikipedia
page https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research.
There are also some online platforms, such as Kaggle1 , on which competitions are proposed
with associated datasets.
69
70 CHAPTER 6. PREPROCESSING
In order to keep the ordering of the values when encoding the feature, one may use an increasing
number with the following encoding :
Feature scaling
Given a set {x0 , · · · , xi , · · · , xN −1 } ∈ Rd of input vectors, it is usual that the variables do not
vary in the same ranges. There are several machine learning algorithms that compute Euclidean
distances in input space, such as k-nearest neighbors for clustering or support vector machines
with radial kernels. In this situation, if one dimension is in range, say, some order of magnitudes
larger than the others, it will dominate the computation of the distance and therefore hardly take
into account the other dimensions. There are also some cases where the learning algorithm can
2 http://archive.ics.uci.edu/ml/datasets/Heart+Disease
6.2. DIMENSIONALITY REDUCTION 71
converge faster if the features are normalized. An example of this situation is when you perform
gradient descent of, for example, a linear regression model with a quadratic cost3 . If the features
are normalized, the loss will be much more circular symmetric than if they are not. In this case, a
gradient descent would point toward the minimum and the algorithm would converge faster. As we
shall see later, for example when speaking about neural networks, we will minimize cost functions
with penalties on the parameters we are looking for. This penalty will actually ensure that we do
not overfit the training set. If we do not standardize the data, the penalty will typically penalize
much more the dimensions which cover the largest range which is not a desired effect.
There are various ways to ensure that the features belong to the same range. Let us denote by
xi,j the j-th feature of the input xi , you might :
• scale each features so that they belong to the range [0, 1] :
xi,j − mink∈[0,N −1]xk,j
∀i ∈ [0, N − 1], ∀j ∈ [0, d], x0 i,j =
maxk∈[0,N −1]xk,j − mink∈[0,N −1]xk,j
• standardize each features so that they have zero mean and unit variance :
xi,j − x.,j
∀i ∈ [0, N − 1], ∀j ∈ [0, d], x0 i,j =
σj
N −1
1 X
x.,j = xk,j
N
k=0
v
u N −1
u1 X
σj = t (xk,j − x.,j )2
N
k=0
Remember that feature scaling is performed based on the training data. This has two implications.
The first is that you must remember the parameters of the scaling so that if you apply the learned
predictor on a new input, you actually make use of the same scaling than the one used for learning
the predictor. Also, as the scaling parameters are computed from the training set, you should in
principle include the scaling step in the whole pipeline that has to be cross-validated.
which is coined the term feature extraction. In the next sections we focus on three specific fea-
ture extraction methods, namely Principal Component Analysis (PCA), Kernel PCA and Locally
Linear Embedding. These are derived from different definitions of optimality and transformations
applied to the data.
The LASSO (Least Absolute Shrinkage and Selection Operator, (Tibshirani, 1996b)) algorithm
is one example of embedded method. It considers a `1 penalty to a linear least square loss4 . The
resulting loss reads :
N −1
S 1 X
Remp (θ) + λ|θ|1 = (yi − θT xi )2 + λ|θ|1
N i=0
The `1 penalty term will promote sparse predictors, i.e. predictors in which the parameter
vector θ will have null coefficients, and therefore performs variable selection.
Searching in the variable space Let us formalize the variable selection problem. The inputs
are vectors x ∈ Rd . A variable of x is one of its d components, which we denote xj : x =
(x0 x1 ...xd−1 )T . We define a set of k out of the d variables as Xσ,k :
Xσ,k = xj : j ∈ σ −1 (1), σ : [|1..d|] → {0, 1} , σ −1 (1) = k
The function σ indicates, for each variable, if we should keep it or not. We denote by Xd the set
d!
with all the variables. There are k!(d−k)! possible Xσ,k sets with k variables and the search space
d
contains actually 2 sets. We suppose, and this will be detailed latter on, that we have a score
function J which gives the quality J (Xσ,k ) of the k selected variables. Some examples of score
functions will be given in the next paragraphs. Let us introduce some quantities :
−
• given a set Xσ,k , the relevance SX σ,k
of the variables xj ∈ Xσ,k is defined as :
−
∀xj ∈ Xσ,k , SXσ,k
(xj ) = J (Xσ,k ) − J (Xσ,k \ {xj })
4 the regressor is linear in the intput f (θ, x) = θT x
6.2. DIMENSIONALITY REDUCTION 73
+
• given a set Xσ,k , the relevance SX σ,k
of the variables xj ∈ Xd \ Xσ,k is defined as :
+
∀xj ∈ Xd \ Xσ,k , SX σ,k
(xj ) = J (Xσ,k ∪ {xj }) − J (Xσ,k )
We can now define few additional notions which allow to estimate how good it is to add or
remove a variable. Given a set of variables Xσ,k
• xj ∈ Xσ,k is the most relevant variable iif
−
xj = argmax SXσ,k
(xk )
xk ∈Xσ,k
Number of sets
∅ 1
{x0, x1} {x0, x2} · · · {x0, xd−1} · · · {x1, xd−1} · · · {xd−2, xd−1} d(d-1)/2
.. .. .. d!
k!(d−k)!
Xd 1
Figure 6.1: Forward and Backward sequential search (SFS, SBS). See the text for details.
We can now introduce different strategies for building a set of variable Xσ,k . Starting from an
empty set of variable X = ∅, one can be greedy with respect to the score and add at each step the
next most relevant variable until getting the desired number of variables. This strategy is called
Sequential Forward Search (SFS). Another possibility is to start from the full set X = Xd and drop
out variables by removing the less relevant variable until getting a desired number of variables.
This strategy is called Sequential Backward Search(SBS). These are the most classical strategies,
illustrated on figure 6.1.
The SFS and SBS are heuristics and not necessarily optimal. It might be for example, following
the SFS strategy, that an added variable renders a previously added variable unnecessary. In
general, there is no guarantee that the SFS and SBS strategies find the optimal set of variables
among the exponentially growing number of configurations 2d . There are other strategies that
have been proposed in the literature as the Sequential Floating Forward Search (SFFS), Sequential
Floating Backward Search (SFBS) introduced in (Pudil et al., 1993) which combines forward and
backward steps. This algorithm has been extended in (Somol et al., 1999) and additional variants
are presented in (Somol et al., 2010).
74 CHAPTER 6. PREPROCESSING
Filter and Wrapper methods Having introduced strategies for building up a set of variables
Xσ,k , we still need to define the score function J. The filter methods use heuristics in the definition
of J while wrappers define the score function J from an estimation of the real risk of the predictor.
A first example of filter heuristic is the Correlation-based feature selection (CSF) (Hall, 1999;
Guyon and Elisseeff, 2003). It looks for a set of variables that balance the correlation between
the variables and the output to predict y and the correlation between the variables themselves.
The strategy is to keep features highly correlated with the output, yet uncorrelated to each other.
Namely, given a training set {(xi , yi ), 0 ≤ i ≤ N − 1} the score J of a set Xσ,k is defined as :
kr(Xσ,k , y)
JCSF (Xσ,k ) = p
k(k − 1)r(Xσ,k , Xσ,k )
1 X
r(Xσ,k , y) = r(xi , y)
k −1 i∈σ (1)
1 X
r(Xσ,k , Xσ,k ) = r (xi , xj )
k(k − 1)
i,j∈σ −1 (,)i6=j
where xj,i denotes the i-th component of the input xj . One can use other correlation measures,
such as mutual information(Hall, 1999). Other filter heuristics have been proposed and some of
them are reviewed in(Zhao et al., 2008).
The wrappers use an estimation of the real risk as the score function J. Denote xσ ∈ Rk
the vector for which only the k components in Xσ,k are retained. Suppose we have a training
set {(xi , yi ), 0 ≤ i ≤ N − 1} and a validation set {(x0 i , y 0 i ), 0 ≤ i ≤ M − 1}. Let us denote fˆ the
predictor obtained from a given learning algorithm on the training set {(xσi , yi ), 0 ≤ i ≤ N − 1}.
The score J (Xσ,k ) can be defined as :
1 X 0 ˆ 0 σ
M −1
J (Xσ,k ) = L y i, f x i
M i=0
with L a given loss (usually strongly dependent on the considered learning algorithm). Other esti-
mation of the real risk could have been used, such as the K-fold cross-validation or bootstrapping.
Embedded methods The embedded methods make use of the specifics of the learning algorithm
you consider and are therefore usually dependent on this chosen algorithm. For example, the
support vector machines (SVM) can be considered as an embedded approach as they choose the
support vectors among the training set (in which case, basis functions are selected rather than
variables by themselves). LASSO (Tibshirani, 1996b) is a learning algorithm which has a `1
penalty term constraining weights on some of the variables to be zero. Related to LASSO, some
other methods use a similar penalty as LARS (Efron et al., 2004a) or elastic net (Zou and Hastie,
2003). Regression trees5 (Breiman et al., 1983) also have internal mechanisms for variable selection.
Another approach, which we might rather consider as a model selection approach (in the sense
that it selects an hypothesis space, which is actually identical to variable selection when the model
is linear), is called complexity regularization. It consists in adding a penalty term to the empirical
risk to be minimized, this penalty term being input data dependent (Barron, 1988; Bartlett et al.,
2002; Wegkamp, 2003; Lugosi and Wegkamp, 2004). The rationale is to get Vapnik-Chervonenkis
5 See chapter 20
6.2. DIMENSIONALITY REDUCTION 75
type bounds (see chapter5) but actually dependent on the considered data and therefore, allowing
a better estimation of the risk for selecting a model, with theoretical guarantees. This type of
approach is complex and mathematically involved and will not be detailed further.
2.0
1.5
1.0
0.5
0.0
w1
0.5
w0
1.0
1.0 0.5 0.0 0.5 1.0 1.5 2.0
Figure 6.2: Given a set of data points, we seek a line on which to project the data so that the
sum of the norms of the vectors (red dashed) connecting the data points to their projection is
minimized. This is the problem we solve when we are looking for the first principal component of
the data points.
Let us denote by W ∈ Md,r (R) the matrix in which the columns are the vectors {w1 , · · · , wi , · · · , wr } :
W= w1 ··· wi ··· wr
6 There exists other equivalent ways to derive the principal components such as finding projection axis maximizing
the variance of the projected data. We go back on this equivalence later in this section.
76 CHAPTER 6. PREPROCESSING
The above matrix form could have be written directly if we have in mind that, given {w1 , · · · , wj , · · · , wr }
are orthogonal unit vectors, W is the matrix of an orthogonal projection on the r dimensional sub-
space generated by {w1 , · · · , wj , · · · , wr }. The norm of the residual x − x̂, where x̂ = W.WT x is
the projection of x on the subspace generated by {w1 , · · · , wj , · · · , wr }, is |(I − W.WT ).x|2 .
Before expanding a little bit the expression inside (6.2), we first note that :
Second, as WT .W = Ir , therefore :
The matrix W.WT is therefore idempotent. For any idempotent matrix M , we also know that
I − M is idempotent as (I − M)2 = I − 2M + M2 = I − M. Putting all the previous steps together,
we end with :
Deriving w0
We now solve eq. (6.2) with respect to w0 by computing the derivative with respect to w0 and
setting it to zero7 .
N −1 N −1
d X 2 X
(Id − W.WT ).(xi − w0 )2 = −2 (Id − W.WT )(xi − w0 )
dw0 i=0 i=0
N
X −1
= −2(Id − W.WT ) (xi − w0 )
i=0
N −1 N −1
d X 2 X
(Id − W.WT ).(xi − w0 )2 = 0 ⇔ (Id − W.WT ) (xi − w0 ) = 0
dw0 i=0 i=0
N −1
1 X
⇔ (Id − W.WT )( xi − w 0 ) = 0
N i=0
The vectors u satisfying (Id − W.WT ).u = 0 are the vectors belonging to the subspace generated
by the column vectors of W, i.e. by the vectors {w1 , · · · , wj , · · · , wr }. This actually makes
sense if one thinks of how an affine subspace is defined : the origin w0 of the affine subspace
can be translated by any linear combination of the vectors {w1 , · · · , wj , · · · , wr } and we still
7 Forcomputing the derivative, we note that for any vectors x, matrix A, and vector functions u (x) , v (x),
duT Av
dx
= du
dx
Av + dv
dx
AT u.
6.2. DIMENSIONALITY REDUCTION 77
get the same affine subspace. For the 1D example on Fig. 6.2, this means translating w0 along
w1 . Finally, we note that the value of the function (6.2) to be minimized is the value for any
PN −1
w0 = N1 i=0 xi + h, h ∈ span {w1 , · · · , wj , · · · , wr } :
We can also note that the optimization problem defined by eq. (6.2) is the same whatever w0 as
PN −1 PN −1
soon as w0 = N1 i=0 xi + h, h ∈ span {w1 , · · · , wj , · · · , wr }. So let us take w0 = N1 i=0 xi = x̄,
which means that the data get centered by the sample mean of {x0 , · · · , xi , · · · , xN −1 } before being
projected. If we look back to the example drawn on fig 6.2, we clearly see that we may have moved
w0 along w1 without changing the line on which the data are projected. In general, the fact that
w0 is defined up to a h ∈ span {w1 , · · · , wj , · · · , wr } simply means that the origin of the hyperplane
span {w1 , · · · , wj , · · · , wr } can be defined up to a translation within this hyperplane.
For simplicity, let us denote x̃i = xi − x̄ and let us expand a little bit the inner term of eq (6.3) :
2
(Id − W.WT ).x̃i = x̃i T (Id − W.WT )T x̃i = x̃i T x̃i − x̃i T W.WT x̃i
2
where X̃ = (x̃1 | · · · |x̃N −1 ), i.e. the matrix with columns x̃i . This matrix is the so-called sam-
ple covariance matrix. Therefore the minimization problem (6.3) is equivalent to the following
maximization problem :
r
X
max wjT X̃X̃T wj subject to WT W = Ir (6.4)
W∈Md,r (R)
j=1
Now, to begin simply and drive our intuition, we shall have a look to the solution when we are
looking for only one principal component vector. In this case, the optimization problem reads :
Looking for the critical points, i.e. where the gradient vanishes, we get :
∂L
= 0 ⇒w1T w1 = 1
dλ1
∂L
= 0 ⇒X̃X̃T w1 = λ1 w1
dw1
We therefore conclude that w1 is an eigen vector of the sample covariance matrix X̃X̃T with
corresponding eigen-value λ1 . Now, the question is which eiven vector should we consider ? Well,
to answer that question, we just need to look at what it means that w1 is an eigenvector for our
term we wish to maximize (6.4):
which is maximized for the largest eigen-value of the sample covariance matrix. So, to conclude
this first part :
The first principal component vector is a normalized eigenvector associated with the
largest eigen value of the sample covariance matrix X̃X̃T
Now let us have a look to the problem of finding two components. The problem to be solved
is :
One first thing we can note is that the sum of the variances of the projections can be written a
slightly different way taking into account the fact that w1 and w2 are orthogonal, i.e. w1T w2 = 0.
The residuals of the orthogonal projection of the data X̃ over the vector w1 is (Id − w1T w1 )X̃ and :
w2T ((Id − w1T w1 )X̃)((Id − w1T w1 )X̃)T w2 = w2T (Id − w1 w1T )X̃X̃T (Id − w1 w1T )w2
(Id − w1 w1T )w2 = w2 − (w1 w1T )w2
= w2 − w1 (w1T w2 )
= w2
⇒ w2T ((Id − w1T w1 )X̃)((Id − w1T w1 )X̃)T w2 = w2T X̃X̃T w2
Therefore the variance of the data projected on w2 equals the variance of the residuals, after pro-
jection on w1 , projected on w2 . We can then proceed iteratively and our greedy algorithm would
lead to select the normalized eigenvectors associated with the largest eigenvalues of the sample
covariance matrix (we finish justifying the correctness of .
The solution to the PCA is not unique. Indeed, if w1 is an eigenvector of M, then −w1 is
also an eigenvector of M. In terms of the principal components, this means that they are at least
defined up to a sign. For example, on the illustration 6.2, we draw w1 but we may have considered
−w1 as well. Also, if the matrix X̃X̃T has eigenvalues with a multiplicity larger than 1, since
the eigenvectors of an eigenvalue λ of any positive-definite symmetric matrix M of multiplicity kλ
engenders a subspace of dimensionality kλ , any basis of this subspace are elements of the solution
to the PCA problem.
6.2. DIMENSIONALITY REDUCTION 79
Theorem 6.1. For any symmetric positive semi-definite matrix M ∈ Md,d (R), denote {λ} 1d
its eigenvalues with λ1 ≥ λ2 · · · ≥ λd ≥ 0. For any set of r ∈ [|1, d|] orthogonal unit vectors,
{v1 , · · · , vj , · · · , vr }, we have :
r
X r
X
vjT Mvj ≤ λj
j=1 j=1
And this upper bound is reached by eigenvectors associated with the largest eigenvalues of M
P{v
Proof. Suppose we have r orthogonal unit vectors
r
1 , · · · , vj , · · · , vr } on which we project the
data. We want to compute the maximal value of j=1 vjT Mvj .
Given the matrix M is real symmetric, there exists a basis of Rd of eigenvectors. Let us
denote this basis {e1 , · · · , ei , · · · , ed } and the associated eigenvalues {λ1 , · · · , λi , · · · , λd }. Denote
βi,j = eTi vj the coordinates of vj in our basis :
d
X
∀j ∈ [|1, r|], vj = βi,j ei
i=1
Pr
This leads to the inequality : ∀i, j=1 βi,j ≤ 1, which can be injected in equation (6.5) :
r
X d
X r
X d
X
vjT Mvj = λi 2
βi,j ≤ λi
j=1 i=1 j=1 i=1
It remains to apply the previous algorithm to the symmetric positive semi-definite matrix X̃X̃T
to conclude that eigenvectors associated with the largest eigenvalues are indeed a solution to our
optimization problem.
Finally,
Pr−1
considering only r eigenvalues, the sample variance of the projected data equals the frac-
λi
tion PNi=0
−1
λ
of the original sample variance.
i=0 i
X̃ = UDVT
⇒ X̃X̃T = UDVT VDT UT = UDDT UT = UDDT U−1
We recognize the diagonalization of X̃X̃T , the eigen vectors of X̃X̃T being the column vectors
of U. In the singular value decomposition, we suppose that the diagonal elements of D are
ordered by decreasing magnitude and the implementation usually behave that way. Therefore,
the vectors {w1 , · · · , wj , · · · , wr } we are looking for are the first r column vectors of U : W =
U1 | U2 | · · · | Ur . The principal components are the projections of the data over the axes
6.2. DIMENSIONALITY REDUCTION 81
Algorithm 9 gives all the steps for performing a SVD based PCA. An example of application
of this algorithm on artificial data is shown on Fig. 6.3a as well as applying the PCA on the 5000
samples from the handwritten MNIST dataset on Fig. 6.3b. From the example over the MNIST
dataset, one can observe that the linear projection revels some isolated clusters (for the digits 0
and 1) but others remain interleaved. Do not be misleaded, PCA is an unsupervised algorithm as
it does not take into account the labels for finding the projections. The labels are added to the
figures after the PCA is performed and it turns out that it reveals that some classes are isolated.
Using only two principal components, the PCA captures 4.48% of the variance, computed as λP0 +λ 1
i λi
where, λi are the eigenvalues of the sample covariance matrix, ordered by decreasing magnitude.
If the centered data are stacked as the columns of X̃. From the SVD of X̃ = UDVT
(the eigenvalues being ordered by decreasing magnitude on the diagonal of D), one
finds the r first principal components as the r first rows of DVT and the projection
vectors as the r first columns of U.
2.0
1.5
0.8
0.6
1.0 0.4
0.2
0.0
0.5 0.2
0.4
0.6
0.8
0.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
0.5
1.0 0.5 0.0 0.5 1.0 1.5
(a) The PCA of a set of datapoints in R2 . There are N = 72 datapoints in d = 2
dimensions. On the left, the original datapoints are shown. The mean of the data
w0 is shown as the red dot. The data are centered and stacked in the rows of X̃.
From the SVD decomposition of X̃ = UDVT , the two columns of V are extracted
and shown as the green and blue arrows. The projection of the datapoints on the two
principal vectors DU is shown on the figure on the right.
0
w2
6
10 8 6 4 2 0 2 4 6
w1
0 1 2 3 4 5 6 7 8 9
(b) The PCA applied to 1000 samples from the MNIST handwritten digits dataset.
Each colored point represents one 28 × 28 image, the color indicating the associated
label. Do not be misleaded, the labels are not used for the PCA which is an unsuper-
vised technique. The labels are used after applying the PCA, just to get an idea of
how the digits are clustered. The two first components computed from 5000 samples
capture actually only 4.48% of the variance.
0.10 0.08 0.06 0.04 0.02 0.00 0.02 0.04 0.06 0.08 0.10
(c) The 10 first principal vectors when applying PCA to 5000 sampes of the MNIST
handwritten digits dataset. All the 28 × 28 vectors are normalized and have been
saturated in the range [-0.10, 0.10]. We better appreciate for exemple why the 0 and
1 have, respectively, a negative and positive first component.
6.2. DIMENSIONALITY REDUCTION 83
and also the same optimization problem than (6.4) which we found when defining the PCA as
minimizing the reconstruction error. We can then conclude that the optimal projection vectors
are the eigen vectors of the sample covariance matrix Σ = N1 X̃X̃T associated with the largest r
eigenvalues.
where X = [x0 − x|x1 − x| · · · |xN −1 − x], i.e. the vectors xi − x are the column vectors of X.
The sample covariance matrix is symmetric : ΣT = Σ. The eigenvalues of any sample covariance
matrix are non-negative. For any eigenvalue-eigenvector pair λ, v of Σ, we have :
1
λv = Σv = XXT v
N −1
2 1 1 1 T 2
⇒ λ|v| = λvT v = vT XXT v = (XT v)T (XT v) = X v
N −1 N −1 N −1
from which it follows λ ≥ 0.
where X = [x0 |x1 | · · · |xN −1 ], i.e. the vectors xi are the column vectors of X. The gram matrix is
symmetric : GT = G.
Euclidean distance matrix. The Euclidean distance matrix of a set of vectors {x0 , · · · , xi , · · · , xN −1 } ∈
Rd is the N × N matrix D whose elements are the squared euclidean distances between the vectors
2
xi , namely Dij = |xi − xj | :
2 2
0 |x0 − x1 | ··· |x0 − xN −1 |
.. .. ..
D= . . .
2 2
|xN −1 − x0 | |xN −1 − x1 | ··· 0
The Euclidean distance matrix D is symmetric and has zeros in the main diagonal : DT = D,
∀i, Dii = 0.
84 CHAPTER 6. PREPROCESSING
Relationship between euclidean distance and gram matrices Let us now detail how the
covariance, gram and euclidean distance matrices are related. The euclidean distances can be
expressed from scalar products only :
2
∀i, j, |xi − xj | = (xi − xj )T .(xi − xj ) = xTi .xi + xTj .xj − 2xTi xj
Therefore, the euclidean distance matrix and gram matrix are related by :
Actually, the euclidean distance matrix can be built from the gram matrix but the converse is not
true : the same euclidean distance matrix can be built from different configuration of the input
vectors and different gram matrices. For example :
1 √12 0 1 1 12
X= , D = , G = 1
0 23 1 0 2 1
"√ 3 #
2 2√2 3
X= q ,D = 0 1 ,G = 2 2
3
0 7 1 0 2 2
8
More generally, the distance matrix is invariant to translation while the gram matrix is not.
Indeed :
It is actually sufficient to define an origin for a set of vectors so that their Gram matrix can
be deduced
P from the distance matrix. We can indeed recover the gram matrix of the vectors
xi − N1 j xj from the distance matrix and one can show that it is given by :
1
G = − HDH
2
where H is a so-called centering matrix defined by :
1 T
H=I− ee
N
1
with e a vector full of 1, i.e. Hi,j = δi,j − N. The above transformation leading from D to G is
called double centering.
Relationship between the covariance and gram matrices In order to stress an interesting
relationship between the covariance and gram matrices that will be used in the next section for
extending PCA to kernel PCA, we need to make a bit of linear algebra.
Lemma 6.1. ∀A ∈ Rn×m , ker (A) = ker (AT A)
Proof. Let us consider A ∈ Rn×m . It is clear that
∀x ∈ Rm , Ax = 0 ⇒ AT Ax = 0
AT Ax = 0 ⇒ xT AT Ax = 0
⇔ (Ax)T Ax = 0
2
⇔ |Ax|2 = 0
⇔ Ax = 0
We now remind the rank-nullity theorem, in the context of matrices but which could have been
stated in a more general way with linear applications.
Theorem 6.2 (Rank-nullity). ∀A ∈ Rn×m , rk (A) + dim (ker (A)) = m.
We can demonstrate a theorem linking the rank of the covariance and gram matrices.
Theorem 6.3. ∀A ∈ Rn×m , rk (AT A) = rk (AAT ) ≤ min(n, m)
Proof. By applying the lemma 6.1 and the rank nullity theorem :
∀A ∈ Rn×m ,rk (A) + dim (ker (A)) = m = rk AT A + dim ker AT A
⇒rk (A) = rk AT A and rk (A) ≤ m
Applying the lemma 6.1 and the rank nullity theorem to the matrix AT :
∀A ∈ Rn×m ,rk AT + dim ker AT = n = rk AAT + dim ker AAT
⇒rk AT = rk AAT and rk (A) ≤ n
We also know that the column rank and row ranks are equal and therefore rk (A) = rk (AT )
which ends the proof.
Given the sample covariance matrix Σ = N 1−1 XXT has the same rank than (N − 1)Σ = XXT
and given the gram matrix G = XT X, applying the previous theorem leads to :
rk (Σ) = rk XXT = rk XT X = rk (G) ≤ min(n, m)
Given that the covariance and gram matrices have the same rank, they have the same number of
nonzero eigenvalues8 . There is actually an even stronger property which is :
Lemma 6.2 (Eigenvalues of the covariance and gram matrices). The nonzero eigenvalues of the
scaled covariance matrix (N − 1)Σ = XXT and gram matrix G = XT X are the same :
Proof. Consider a nonzero eigenvalue λ ∈ R∗ of XXT . There exists v 6= 0, XXT v = λv. Left-
multiplying by XT gives XT XXT v = λXT v. Denoting w = XT v, we have XT Xw = λw. If
w = XT v = 0, we have λv = XXT v = Xw = 0 and therefore v = 0 as λ 6= 0 which is in
contradiction with our hypothesis. So, necessarily, w 6= 0, and (λ, XT v) is an eigenvalue-eigenvector
pair of XT X = G. We have therefore demonstrated the inclusion
Since the two sets have the same dimension, there are actually equal.
During the demonstration, we also showed that if (λ, v) ∈ R × Rd is an eigenvalue-eigenvector
pair of XXT , then (λ, XT v) ∈ R × RN is an eigenvalue-eigenvector pair of XT X. Conversely if
(λ, w) ∈ R × RN is an eigenvalue-eigenvector pair of XT X, then (λ, Xw) ∈ R × Rd is an eigenvalue-
eigenvector pair of XXT .
engendered by the columns or the rows of A which has the same dimension than the space engendered by the
columns or the rows of D and therefore equals the number of nonzero eigenvalues of A.
86 CHAPTER 6. PREPROCESSING
3. compute the r normalized eigenvectors vj associated with the r largest eigenvalues of the
matrix X̃X̃T ∈ Rd×d
When the number of features d increases, the covariance matrix can become quite large and the
computation of the eigenvectors can be cumbersome. In light of the lemma 6.2, it turns out that
the eigenvectors of X̃X̃T can be computed from the eigenvectors of X̃T X̃ ∈ RN ×N . In case N d,
it is much less expensive to compute these eigenvectors. We can therefore state another equivalent
way of computing PCA :
PN −1
1. center the input vectors : x = N1 i=0 xi , x̃i = xi − x
We can reformulate a little bit this procedure by making use only of scalar products, and this
will turn out to be very useful for deriving a non-linear extension of PCA :
PN −1
1. center the input vectors : x = N1 i=0 xi , x̃i = xi − x
X̃w
3. project your data on the r normalized eigenvectors X̃v j = √1 X̃wj :∀v ∈ Rd , √1 X̃wj .v =
| j| λj λj
x̃0 .v
√1 wj . ..
.
λj
x̃N −1 .v
ϕ : Rd → Φ
x 7→ X
Typically, F can be a vector space of larger dimension (even infinite) than the input space, e.g.
F = Rm , m d. Let us assume for now (but we come back soon on this point) that the transformed
data are centered, i.e.
N −1
1 X
ϕ (xi ) = 0
N i=0
Let us now perform a linear PCA on the transformed data {ϕ (x0 ) , · · · , ϕ (xi ) , · · · , ϕ (xN −1 )}.
To do so, we will follow the procedure given in the previous section when working only with scalar
products :
2
9 X̃v
= X̃vjT X̃vj = vjT X̃T X̃vj = λj vjT vj = λj
j
6.2. DIMENSIONALITY REDUCTION 87
1. center the input vectors {ϕ (x0 ) , · · · , ϕ (xi ) , · · · , ϕ (xN −1 )} : there is nothing to do since we
consider, for now, that the mapped vectors are centered (at the end of the section, we come
back to the case the mapped vectors are not centered)
2. compute the r normalized eigenvectors wk ∈ RN associated with the r largest eigenvalues λk
of the Gram matrix G ∈ RN ×N :
ϕ (x0 ).ϕ (x0 ) ϕ (x0 ).ϕ (x1 ) ··· ϕ (x0 ).ϕ (xN −1 )
ϕ (x1 ).ϕ (x0 ) ϕ (x1 ).ϕ (x1 ) ···
ϕ (x1 ).ϕ (xN −1 )
G= .. .. .. ..
. . . .
ϕ (xN −1 ).ϕ (x0 ) ϕ (xN −1 ).ϕ (x1 ) · · · ϕ (xN −1 ).ϕ (xN −1 )
Let us have a look of what the last point, the projection of the vector to get the component,
looks like :
ϕ (x0 ).ϕ (v)
1 1 ϕ (x1 ).ϕ (v)
∀v ∈ Rd , ( √ ϕ (X) wk ).ϕ (v) = √ wk . ..
λk λk .
ϕ (xN −1 ).ϕ (v)
Therefore, the k-th principal component is computed solely from scalar products between the
vectors mapped into the feature space Φ. Actually, the only thing we need to compute when
performing the PCA in the feature space is scalar products between vectors in feature space since
the Gram matrix, from which we extract the eigenvectors, are also computed only from scalar
products in the feature space. We never need to compute explicitely the feature vectors ϕ (v) ∈ Φ
and only need to evaluate scalar products between two elements of Φ. This algorithm is known as
the Kernel PCA(Scholkopf et al., 1999).
The fact that all we need to compute is scalar products in the feature space implies that we
can employ the so called kernel trick which implicitly defines the mapping function ϕ from the
definition of a so-called kernel function k. Not every function k is a kernel as it does not always
correspond to the scalar product in a feature space. However, there are some conditions, known
as the Mercer’s theorem, which determine when a function k is actually a kernel. This is explained
in detail in the chapter 10. For our purpose, we just introduce briefly some known kernels :
d
• the polynomial kernel kd (x, x0 ) = (x.x0 + c)
!
2
0 |x − x0 |
• the gaussian, or RBF, kernel krbf (x, x ) = exp −
2σ 2
It can be shown that the gaussian kernel actually projects the data into an infinite dimension
space. The reader is referred to the chapter 10 for more details on kernels.
The last point that is not yet solved is : what about feature vectors that are actually not
centered in the feature space ? One can actually work with uncentered feature vectors if we change
the kernel :
N −1
! N −1
!
1 X 1 X
∀i, j, ϕ (xi ) − ϕ (xp ) . ϕ (xj ) − ϕ (xp )
N p=0 N p=0
N −1 N −1 N −1 N −1
1 X 1 X 1 X X
= ϕ (xi ).ϕ (xj ) − ϕ (xi ).ϕ (xp ) − ϕ (xj ).ϕ (xp ) + 2 ϕ (xp ).ϕ (xt )
N p=0 N p=0 N p=0 t=0
N −1 N −1 N −1
1 X 1 X X
= k (xi , xj ) − (k (xi , xk ) + k (xk , xj )) + 2 k (xp , xt )
N N p=0 t=0
k=0
88 CHAPTER 6. PREPROCESSING
Therefore we can introduce the kernel k̃ which takes as input the input vectors {x0 , · · · , xi , · · · , xN −1 } ∈
Rd and computes the scalar product between the centered feature vectors. It can be shown(Scholkopf
et al., 1999) that the associated Gram matrix K̃ is defined as :
1 1
K̃ = (IN − 1)K(IN − 1)
N N
where IN is the identity matrix and 1 is the square N × N matrix with all entries set to 1.
Let us now apply K-PCA to the MNIST dataset. The figure6.4 illustrates the two first principal
components of 5000 digits from the MNIST dataset using a RBF kernel with a variance σ = 4.8
which corresponds to the mean euclidean distance in the image space between the considered digits
and their closest neighbor. This non-linear projection captures 7% of the variability of the original
data compared to the linear PCA (fig 6.3b) which captured only around 5%. At the time of writting
this section, it is not completely clear to which extent the kernel PCA brings any improvement
over the PCA when applied to the MNIST dataset. However, there are some datasets for which
k-PCA appears superior to PCA in feature extraction and the reader is referred for example to
(Scholkopf et al., 1999) in which k-PCA and linear PCA are compared in the context of extracting
features feeding a classifier and where it is shown that extracting features with k-PCA leads to
better classification performances.
0.4
0.3
0.2
0.1
0.0
w2
0.1
0.2
0.3
0.4
0.5
0.4 0.2 0.0 0.2 0.4 0.6 0.8
w1
0 1 2 3 4 5 6 7 8 9
Figure 6.4: The Kernel PCA applied to 5000 samples from the MNIST handwritten digits dataset
using a RBF kernel with σ = 4.8, corresponding to the mean euclidean distance between the
considered images and their closest neighbor. Each colored point represents one 28 × 28 image, the
color indicating the associated label. Do not be misleaded, the labels are not used for the k-PCA
which is an unsupervised technique. The labels are used after applying the k-PCA, just to get
an idea of how the digits are clustered. The two first components computed from 5000 samples
capture approximately 7% of the variance.
6.2. DIMENSIONALITY REDUCTION 89
Note that from this asymmetric definition, within SNE, the fact xi is picked as the neighbor of
xj does not imply that xj is picked as the neighbor of xi . While in SNE, the similarity is directly
taken as pij = pj/i , in t-SNE, the similarities are symmetrized :
pj/i + pi/j
pij =
2N
The similarities in the low dimensional space could have been defined similarly. However, as argued
in (van der Maaten and Hinton, 2008), using a gaussian for defining the similarities push too much
constraints on the locations of the projections in the low dimensional space. They indeed propose
to use a t-Student distribution with 1 degree of freedom which leads to define the similarities qij
as :
2
(1 + |yi − yj |2 )−1
∀i, j, qij = P 2
k6=l (1 + |yk − yl |2 )−1
The t-Student’s distribution with one degree of freedom (or Cauchy distribution) has a heavier tail
and allows to push apart a little more the datapoints in the low-dimensional space.
90 CHAPTER 6. PREPROCESSING
Once the similarities have been defined, it remains to introduce the criteria to be optimized.
The dissimilarity between the similarities pij and qij can be estimated with the Kullback-Leibler
divergence and reads :
X
pij
C= pij log
i,j
qij
One can then minimize C with respect to the points yi in the low dimensional space by performing
a gradient descent (with momentum) of C with respect to the yi . The gradient reads (see (van der
Maaten and Hinton, 2008)) :
∂C X 2
=4 (pij − qij )(1 + |yi − yj |2 )−1 (yi − yj )
dyi j
The complexity of the algorithm is quadratic because you need to compute all the pairwise
distances. In (van der Maaten, 2014), an improvement of the complexity of the algorithm is
introduced with an approximation. It is based on the following idea which makes the evaluation of
the gradient faster: looking at the gradient for one point yi , one can see it as a sum of influences
or forces which push or pull the point yi . When some points are far away from yi , one can
approximate their individual forces by a single one originating from the center of mass of these
points. By approximating the individual contributions of several far points into a single one, one
can use the Barnes-Hutt approximation used in physics and improves the complexity from O(N 2 )
to O(N log N ). Applying t-SNE on 5000 digits of MNIST is shown on fig. 6.5.
6.2. DIMENSIONALITY REDUCTION 91
15 t-SNE
10
0
w2
10
15
15 10 5 0 5 10 15
w1
0 1 2 3 4 5 6 7 8 9
Figure 6.5: The t-SNE applied to 5000 samples from the MNIST handwritten digits dataset, using
2 components and a perplexity of 40(van der Maaten and Hinton, 2008). Each colored point
represents one 28 × 28 image, the color indicating the associated label. Do not be misleaded, the
labels are not used for the t-SNE which is an unsupervised technique. However, it turns out that
t-SNE nicely clusters the different classes.
92 CHAPTER 6. PREPROCESSING
Part III
93
Chapter 7
Introduction
7.1 Acknowledgment
The elements in the whole part III is mainly the integration of a former document. That document
was written in French and translated into English by Cédric Pradalier. The text here thus reuses
that translation.
The reader can refer to (Cristanini and Shawe-Taylor, 2000; Shawe-Taylor and Cristanini, 2004;
Vapnik, 2000) for a more exhaustive view of what is presented here.
7.2 Objectives
This document part aims at providing a practical introduction to Support Vector Machines (SVM).
Although this document presents an introduction to the topics and the foundations of the problem,
we refer the reader interested in more mathematical or practical details to the documents cited in
the bibliography. Nevertheless, there should be enough material here to get an “intuitive” under-
standing of SVMs, with an engineering approach allowing a quick and grounded implementation
of these techniques.
SVMs involve a number of mathematical notions, among which the theory of generalization –
only hinted at here –, optimization and kernel-based machine learning. We will only cover here
the aspects of these theoretical tools required to understand what are SVMs.
95
96 CHAPTER 7. INTRODUCTION
Linear separator
We will fist consider the case of a simple (although ultimately not so simple) separator: the linear
separator. In practice, this is the core of SVMs, even if they also provide much more powerful
separators than the ones we’re going to cover in this chapter.
The membership of a vector to a class is represented here with the value y ∈ Y = {−1, 1}, which
will simplify the expressions later. Let us consider X = Rn as the input set in the following.
This separator does not exclusively output values in Y = {−1, 1}, but we will consider that
when the results of hw,b (x) is positive, the vector x belongs to the same class than the samples
labelled +1, and that if the result is negative, the vector x belongs to the same class as the samples
labelled −1.
Before digging deeper in this notion of linear separator, let’s point out that the equation
hw,b (x) = 0 defines the separation border between the two classes, and that this border is an
affine hyperplane in the case of a linear separator.
8.2 Separability
Let’s consider again our sample set S, and let’s separate it into two sub-set according to the value
of the label y. We define:
S + = {(x, y) ∈ S | y = 1}
S − = {(x, y) ∈ S | y = −1}
Stating that S is linearly separable means that there exists w ∈ X and b ∈ R such that:
97
98 CHAPTER 8. LINEAR SEPARATOR
This is not always feasible. There can be label distributions over the vectors in S that make S
non linearly separable. In the case of samples taken in the X = R2 plane, stating that the sample
distribution is linearly separable means that we can draw a line (thus an hyperplane) such that
the samples of the +1 class are on one side of this border and those of the −1 class on the other
side.
8.3 Margin
For the following, let’s assume that S is linearly separable. This rather strong hypothesis will be
relieved later, but for now it will let us introduce some notions. The core idea of SVM is that,
additionally to separating the samples of each class, it is necessary that the hyperplane cuts “right
in the middle”. To formally define this notion of “right in the middle” (cf. figure 8.1), we introduce
the margin.
Figure 8.1: The same samples (class −1 or +1 is marked with different colors) are, on both figures,
separated by a line. The notion of margin allows to qualify mathematically the fact the separation
on the right figure is “better” than the one on the left.
Using figure 8.2, we can already make the following observations. First the curves defined
by equation hw,b (x) = C are parallel hyperplanes and w is normal to these hyperplanes. The
parameter b expresses a shift of the separator plane, i.e. a translation of hw,b (x). The norm |w| of
w affects the level set hw,b (x) = C. The larger |w|, the more compressed the level set will be.
fw,b
~ (~x) = cste
fw,b
~ (~x) = 0
w
~
marge
marge
|fw,b
~ (x)|
d=
|b| kwk
~
d=
kwk
~
Figure 8.2: Definition of a separator hw,b (x). The values on the graph are the values of the
separator at the sample points, not the Euclidian distances. If the Euclidian distance of a point x
to the separation border is d, |hw,b (x)| on that point is d|w|.
When looking for a given separating border, we are faced with an indetermination regarding
8.3. MARGIN 99
the choice of w and b. Any vector w not null and orthogonal to the hyperplane will do. Once this
vector chosen, we determine b such that b/|w| is the oriented measure1 of the distance from the
origin to the separating hyperplane.
The margin is defined with respect to a separator hw,b and a given set of samples S. We will
denote γhw,b (S) this margin. It is defined from the function γhw,b (x, y) computed from each sample
(x, y), also called margin, but rather sample margin. This latter margin is:
Since y ∈ {−1, 1} and a separator puts samples with label +1 on the positive side of its border
and those of class −1 on the negative side, the sample margin is, up to the norm of w, the distance
from a sample to the border. The (overall) margin, for all the samples, is simply the minimum of
all margins:
γhw,b (S) = min γhw,b (x, y) (8.2)
(x,y)∈S
Coming back to the two cases of figure 8.1, it seems clear that on the right side – the better
separator –, the margin γhw,b (S) is larger, because the the border cuts further from the samples.
Maximizing the margin is most of the work of an SVM during the learning phase.
An optimisation problem
1
find argmin w.w
w,b 2
subject to yi (w.xi + b) ≥ 1, ∀(xi , yi ) ∈ S
101
102 CHAPTER 9. AN OPTIMISATION PROBLEM
marge min
marge min
marge min
Figure 9.1: The separator with maximal margin as a border which is defined with at least one
sample of each class. The Support Vectors are marked with a cross.
fw,b
~ (~x) = 0
fw,b
~ (~x) = −1
fw,b
~ (~x) = 1
Figure 9.2: The separator in this figure has the same border as the one from figure 9.1, so it
separates the samples with the same quality. However, the difference is that the level set hw,b (x) =
1 and hw,b (x) = −1 contains the support vectors. To achieve this, it was necessary to modify the
norm of w, and adjust b accordingly.
9.1. THE PROBLEM THE SVM HAS TO SOLVE 103
1
0
−1
1
0
−1
Figure 9.3: Both separators in this figure make the margin of all samples greater than 1. For both
of them, the width of the bands is 2/|w|, where w is the term appearing in hw,b (x) = w.x + b.
Consequently, the “most vertical” separator on this figure has a vector w with a smaller norm than
the other.
3.0
binary loss
2.5 hinge loss
2.0
1.5
L(y,h(x))
1.0
0.5
0.0
0.5
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
y.h(x)
This loss approximates the binary loss, as figure 9.4 shows. Let us now reconsider equation (9.1).
The two constraints can be rewritten as as
For a given (w, b), the minimization of the objective function in equation (9.1) is improved
when the sum is minimal, i.e when each ξi is minimal. Therefore, as the slack variables should
be minimal, we can use the previous expression as an equality before injecting it in the objective
function without changing the optimal solution. So the minimization problem of equation (9.1)
can be rewritten as X
1
find argmin w.w + C max (0, 1 − yi h (xi ))
w,b 2 i
which is X
find argmin Lhinge (h (xi ) , yi ) + λw.w.
w,b i
This last formulation is an actual ERM problem, with an additional regularization term λw.w.
Solving this problem require defining the following function, called problem’s Lagrangian, which
involves the constraints multiplied by coefficients αi ≥ 0. These coefficients are named Lagrange
multipliers. The constraints gi are affine.
X
L (k, α) = f (k) + αi gi (k)
1≤i≤n
In this case, the theory says that the vector k ? minimizing f (k) while respecting the constraints
must satisfy that L has a saddle point at (k ? , α? ). At this point, L is a minimum for k and a
maximum for α:
∀k, ∀α ≥ 0, L (k ? , α) ≤ L (k ? , α? ) ≤ L (k, α? )
The following stands at the optimum
∂ ? ?
L (k , α ) = 0
∂k
∂
whereas ∂α L (k ? , α? ), that should be null as well at the saddle point, may be not defined (see the
top-right frame in figure 9.5)
These conditions are sufficient to define the optimum if the Lagrangian is a convex function,
which will be the case for SVMs. See Cristanini and Shawe-Taylor (2000) for supplementary
mathematical justifications.
The challenge is that writing these conditions does not always lead easily to a solution to the
initial problem. In the case of SVMs, solving the dual problem will result in an easier (although
not easy) solution.
L α
θ(α)
L
α
θ(α) k
Fond de vallee
Solution
k(α )
k
Figure 9.5: Transition from the primal problem to the dual problem. The Lagrangian L (k, α) is
saddle-shaped around the solution of the problem. The “bottom of the valley”, i.e. the minima
∂
along k, is represented on the saddle by a dashed line. The equation ∂k L = 0 enables us to link k
and α, so as to express k as a function of α, denoted K (α). This link is the projection of the valley
to the “horizontal” plane. Injecting this relation into L gives a function L (k, K (α)) = θ (α). This
function is the objective function of the dual problem, which we’re trying to maximize, as shown
in the figure.
9.2. LAGRANGIAN RESOLUTION 107
k
g(k)=0
K
k*
grad(f)
J(g)
K h
k* grad(f)
J(g)
g(k)=0
h
Staying in the kernel (i.e. on the bold curve in the figure) leads to the following consequence
around the solution k ? . Let h an infinitesimal displacement around the optimum, such that k ? + h
is still in the kernel of g. We have:
g(k ? ) = 0
g(k ? + h) = 0
g(k ? + h) = g(k ? ) + Jg|k (k ? ) .h
Where Jg|k (k0 ) represents the Jacobian matrix of g with respect to k, taken at k0 . We can
immediately deduce that:
Jg|k (k ? ) .h = 0
A displacement h satisfying the constraints around the solution is thus included in the vector space
defined by the kernel of the Jacobian matrix:
h ∈ ker ( Jg|k (k ? ))
Let’s consider now the consequence of this displacement regarding our objective function f . By
linearizing around k ? while using the constraint-satisfying displacement h above, we have:
f (k ? + h) = f (k ? ) + ∇f |k (k ? ) .h
As a reminder, the gradient is a degenerated Jacobian when the function is scalar instead of vector-
valued. Being around k ? means that f (k ? ) is mininal, so long as our displacements respect the
constraints (in bold on figure 9.6, i.e. the curve C). So ∇f |k (k ? ) .h >= 0. However, similarly to
h, −h satisfies the constraints as well, since the set ker ( Jg|k (k ? )) of h satisfying the constraints
is a matrix kernel and as such a vector space. Hence, we have as well − ∇f |k (k ? ) .h >= 0. From
this, we can deduce that
∇f |k (k ? ) .h = 0, ∀h ∈ ker ( Jg|k (k ? ))
In other words, ∇f |k (k ? ) is in the vector sub-space E orthogonal to ker ( Jg|k (k ? )). But this space
E happens to be the one spanned by the column vectors of the matrix Jg|k (k ? ). Hence, we can
confirm that ∇f |k (k ? ) is a linear combination of these column vectors.
But, since g is the vector of the n scalar contraints gi ,
so these column vectors are the gradients of each of the constraints with respect to the parameters.
As a result, we have:
n
X
∃(α1 , · · · , αn ) ∈ R : ∇f |k (k ? ) + αi ∇gi |k (k ? ) = 0
i=1
The idea is to associate to each constraint gi (k) <= 0 a new scalar parameter yi . We group the
yi of inequality constraints into a vector y. We set gi0 (k, y) = gi (k) + yi 2 . An optimization problem
9.2. LAGRANGIAN RESOLUTION 109
with inequality constraints thus become a problem with equality constraints, but with additional
parameters.
The gradient of the Lagrangian with respect to the parameters (now k and y) must be null. It
is thus null if we consider it with respect to k and with respect to y.
By differentiating with respect to k, we still have:
n
X
∇f |k (k ? ) + αi ∇gi |k (k ? ) = 0
i=1
The two types of constraints of our problem are inequalities (cf. section 9.1.2). The theory says
then that if a constraint is saturated, that is to say if it is actually an equality, then its Lagrange
multiplier is not null. When it is a strict inequality, the multiplier is null. So, for a constraint
gi (...) ≤ 0 for which the associated multiplier would be ki , we have either ki = 0 and gi (...) < 0,
or ki > 0 and gi (...) = 0. These two cases can be summarised into a single expression ki gi (...) = 0.
This expression is named suplementary condition de KKT3 . In our problem, we can then express
six KKT conditions: the constraints, the multipliers positive sign and the suplementary KKT
conditions:
∀i, αi ≥ 0 (KKT1)
∀i, µi ≥ 0 (KKT2)
∀i, ξi ≥ 0 (KKT3)
∀i, yi (w.xi + b) ≥ 1 − ξi (KKT4)
∀i, µi ξi = 0 (KKT5)
∀i, αi (yi (w.xi + b) + ξi − 1) = 0 (KKT6)
This being defined, let us set to zero the partial derivative of the Jacobian with respect to the
terms that are not Lagrange multipliers:
X N
∂
L=0⇒w= αi yi xi (L1)
∂w i
X N
∂
L=0⇒ αi yi = 0 (L2)
∂b i
∂
L = 0 ⇒ ∀i, C − αi − µi = 0 (L3)
∂ξi
Equation L1, injected into the expression of the Lagrangian, let us remove the term w. Equation
L3 let us remove the µi as well. L2 let us eliminate b, which is now in L multiplied by a null term.
After these substitutions, we now have a Lagrangian expression that depends only of the αi . We will
now maximize it by playing on these αi , knowing that injecting L1, L2 and L3 already guarantee
that we have a minimum with respect to w, ξ and b. This is the dual problem. The constraints on
this problem can be inferred from the constraints on the αi resulting from the equations KKTi.
Using L3, KKT2 and KKT3, we can show the following by considering the two cases resulting from
KKT5:
The constraints on the αi are thus 0 ≤ αi ≤ C and L2. Hence, we must solve the following
optimization problem, dual from our initial problem, to find the Lagrange multipliers αi .
X N N
1 XX
find argmax αi − αj αi yj yi xj .xi
α 2
i X j i
∀i, αi yi = 0
subject to i
∀i, 0 ≤ αi ≤ C
Additionally, from the two cases mentioned above, we can also deduce that ξi (αi − C) = 0.
This means that accepting a badly separated sample xi (ξi 6= 0) is equivalent to using its αi with
a maximum value of C.
One interesting aspect of this expression of the dual problem is that it only involves the samples
xi , or more precisely only their dot products. This will be useful later when moving away from
3 Karush-Kuhn-Tucker
9.2. LAGRANGIAN RESOLUTION 111
linear separators. Furthermore, the vector of the separating hyperplane being defined by L1, it is
the result of contributions from all the samples xi , with a value of αi . However, these values, after
optimization, could be found to be zero for many samples. Such samples will thus have no influence
on the definition of the separator. Those that remain, that is those for which αi is non-zero, will
be named support vectors, because they are the ones that define the separating hyperplane.
Solving the dual problem is not trival. So far, we only defined the problem. In particular, b
has now disappeared from the dual problem and we will have to work hard4 to find it back once
this problem solved. We’ll discuss this point further in chapter 11.
Let’s complete this chapter with an example of linear separation, where the separator is the
solution of the optimization problem we’ve defined earlier. The samples that effectively influence
the expression L1 with a non-zero coefficient αi , i.e. the support vectors, are marked with a cross
in figure 9.7.
The separation is thus defined with the following equation:
N
!
X
hw,b (x) = αi yi xi .x + b
i
which we will rather write as follows, to only involve the samples through their dot products:
N
X
hw,b (x) = αi yi xi .x + b (9.2)
i
Figure 9.7: Hyperplane resulting from the resolution of the optimization problem from section 9.1.2.
The separation border hw,b (x) = 0 is the bold curve. The curve hw,b (x) = 1 is shown as a thin
line, and hw,b (x) = −1 in a dashed line. The support vectors are marked with crosses.
Kernels
The main interest of kernels in the context of SVM is that everything we will written in next
chapters on linear separation also applies readily to non-linear separation once we bring kernels in,
so long as we do it right.
In order to build a better separation of the samples, a solution is to project the samples into
a different space1 , and to implement a linear separation in this space where it will hopefully work
better.
1 Often of very high dimension
113
114 CHAPTER 10. KERNELS
Obviously, the functions ϕi are not necessarily linear. Furthermore, we can have n = ∞! So, if
we use the approaches seen in the previous chapters 9 and apply them in the feature space, that
is to say, if we work with the following sample set with binary labels (Y = {−1, 1}):
instead of
S = {(x1 , y1 ), · · · , (xi , yi ), · · · , (xN , yN )}
then, we just have to perform a linear separation on the corpus ϕ (S). Using equation L1, we get a
separator w and a value for b. Now, to decide the class of a new vector x, we could compute ϕ (x)
and apply the separator on ϕ (x) to find out its class membership, −1 or +1.
In practice, we will avoid computing explicitly ϕ (x) by noting that in the optimization problem
defined in chapter 9 only involves the samples through dot products of pairs of samples.
Let’s denote k (x, x0 ) the product ϕ (x).ϕ (x0 ). Working on corpus ϕ (S) is equivalent to working
on corpus S with the algorithms of chapter 9, but replacing every occurence of •.• with k (•, •).
So far, the interest of kernels should not be obvious, because to compute k (x, x0 ), we still need
to apply its definition, that is to project x and x0 in the feature space and to compute, in the
feature space, their dot product.
However, the trick, known as the kernel trick, is that we will actually avoid performing this
projection because we will compute k (x, x0 ) in an other way. Actually k (x, x0 ) is a function that
we will choose, making sure that there exists, in theory, a projection ϕ into a space that we will
not even try to describe. By this way, we will compute directly k (x, x0 ) each time the algorithm
from chapter 9 refers to a dot product, and that’s all. The projection into the huge feature space
will be kept implicit.
Let’s take an example. Consider
!
2
0 |x − x0 |
k (x, x ) = exp −
2σ
It is well known that this function corresponds to the dot product of the projection of x and
x0 into an infinite dimension space. The optimization algorithm that will use this function, also
known as kernel, will compute a linear separation in this space while maximizing the margin,
without having to perform any infinite loop to compute the products by multiplying terms of the
projected vectors two by two!
The separation function is then directly inspired from equation (9.2) page 111, once the optimal
αi found and b computed,
XN
sep (x) = αi yi k (xi , x) + b
i
knowing that many of the terms of the sum are zero if the problem is actually separable. In our
case, the separator can then be rewritten as:
N
!
X |xi − x|
2
sep (x) = αi yi exp − +b
i
2σ
The level set sep (x) = 0 define the separation border between the classes, and the level set
sep (x) = 1 and sep (x) = −1 represents the margin. Figure 10.2 depicts the result of the algorithm
given in chapter 9 with our kernel function.
10.2. WHICH FUNCTIONS ARE KERNELS? 115
Figure 10.2: Solution of the optimization problem from section 9.1.2 on the corpus from figure 10.1,
but with Gaussian kernels. The separating borders sep (x) = 0 is depicted with a bold line and
the level set sep (x) = 1 and sep (x) = −1 respectively with the dashed and thin lines. The support
vectors are marked with crosses.
Is it a kernel? If so, what is the corresponding projection? One way to prove it, is to exhibit the
dot product of the projected vectors.
!2
2
X
0 i 0i
(x.x + c) = x x +c
i
X i j
X i
= xi x0 xj x0 + 2c xi x0 + c2
i,j i
X X√ √
0i 0j i
= i j
(x x )(x x ) + ( 2cxi )( 2cx0 ) + (c)(c)
i,j i
Hence, we can deduce that the projection into a space where our kernel is a dot product, is the
116 CHAPTER 10. KERNELS
It corresponds to a projection Φ(x) in a feature space where each component φi (x) is a product
of components of x with a degree lower than d (a monomial). The separator computed from this
kernel is a polynomial with degree d, whose terms are components of x. The larger the constant c,
the more importance is given to the high-order terms. With c = 1 and d = 3, figure 10.3 depicts
the result of the separation.
The second kernel we will introduce here is the Gaussian kernel, also known as RBF3 , men-
tionned earlier: !
2
0 |x − x0 |
krbf (x, x ) = exp −
2σ
This kernel corresponds to a projection into an infinite dimension space. However, in this space, all
2
the points are projected on the hypersphere with radius 1. This can be easily seen from |ϕ (x)| =
k (x, x) = exp(0) = 1.
Figure 10.3: Solution of the optimization problem of section 9.1.2 over the corpus from figure 10.1,
but with a polynomial kernel with degree 3. The separating borders sep (x) = 0 is depicted with
a bold line and the level set sep (x) = 1 and sep (x) = −1 respectively with the dashed and thin
lines. The support vectors are marked with crosses.
• α a positive number.
then, the following function k are also kernels:
k (x, x0 ) = k1 (x, x0 ) + k2 (x, x0 )
k (x, x0 ) = αk1 (x, x0 )
k (x, x0 ) = k1 (x, x0 ) k2 (x, x0 )
k (x, x0 ) = f (x) f (x0 )
k (x, x0 ) = k3 (Φ (x) , Φ (x0 ))
k (x, x0 ) = xT Bx0
k (x, x0 ) = p (k1 (x, x0 ))
k (x, x0 ) = exp (k1 (x, x0 ))
Hence, we can very easily work on normalized data... in the feature space! In practice, the dot
product is:
ϕ (x) ϕ (x0 ) ϕ (x).ϕ (x0 ) k (x, x0 )
. 0
= 0
=p
|ϕ (x)| |ϕ (x )| |ϕ (x)||ϕ (x )| k (x, x) k (x0 , x0 )
So, we just need to use the right side of the above expression as a new kernel, built upon a kernel k,
to work with normalized vectors in the feature space corresponding to k. Denoting k̄ the normalized
kernel, we simply have:
k (x, x0 )
k̄ (x, x0 ) = p
k (x, x) k (x0 , x0 )
We can even quite easily compute distance, in feature space, between the projections of two
vectors:
p
|ϕ (x) − ϕ (x0 )| = k (x, x) − 2k (x, x0 ) + k (x0 , x0 )
To work on centered and reduced samples, we use the kernel k̂ defined as follows:
N
X N
X
1 1
ϕ (x) − N ϕ (xj ) ϕ (x0 ) − N ϕ (xj )
j=1 j=1
k̂ (x, x0 ) = √ . √
var var
N
X XN XN
1 1 1 1
= k (x, x0 ) − k (x, xi ) − k (x0 , xi ) + 2 k (xi , xj )
var N i=1 N i=1 N i,j=1
Note that these kernels are very computationally expensive, even in comparison with SVMs
which tend to be computationally heavy algorithms with simple kernels. This is the type of
situation where one would rather pre-compute and store the kernel values for all the sample pairs
in the data-base.
4 SVMs are not the only one to take advantage of the kernel trick
10.5. KERNELS FOR STRUCTURED DATA 119
ϕ : Documents → NN
d 7 → ϕ(d) = (f (m1 , d), · · · , f (mN , d))
where f (m, d) is the number of occurences of word m in document d. For a set {dl }l of l documents,
the document term matrix D, where line i is given by vector ϕ(di ), let us define a dot product
(hence a kernel) over the documents. In practice, kdi dj is given by the coefficient (i, j) of DDT .
We can mitigate the fact that documents will have different length by using a normalized kernel.
Furthermore, we can tune this kernel by injecting some a-priori semantic knowledge. For in-
stance, we can define a diagonal matrix R where each diagonal value corresponds to the importance
of a given word. We can also define a matrix P of semantic proximity for which coefficient pi,j
represents the semantic proximity of words mi and mj . The semantic matrix S = RP let us create
a kernel taking advantage of this knowledge:
10.5.2 Strings
Character strings have received a lot of attention in computer science, and many approaches have
been designed to quantify the similarity between two strings. In particular, SVMs are one of the
machine learning techniques used to process DNA sequences where string similarity is essential.
This section provide an example of such function.
Let’s consider the case of the p-spectrum kernel. We intend to compare two strings, probably of
different length, by using their common sub-strings of length p. Let Σ be an alphabet, we denote
Σp the set of strings of length p built on Σ and s1 s2 the concatenation of s1 and s2 . We also note
|A| the number of element in a set A. We can then define the following expression, for u ∈ Σp :
For a string s, we get one ϕpu (s) per possible sub-string u. ϕpu (s) is zero for most u ∈ Σp . Thus, we
p
are projecting a string s on a vector space with |Σ| dimensions and the components of the vector
p p
ϕ (s) are the ϕu (s). We can finally define a kernel with a simple dot product:
X
k (s, t) = ϕp (s).ϕp (t) = ϕpu (s)ϕpu (t)
u∈Σp
Let’s explicit this kernel with p = 3 and the following strings: bateau, rateau, oiseau, croise
ciseaux. The elements of Σ3 leading to non-zero components are ate, aux, bat, cis, cro, eau,
ise, ois, rat, roi, sea, tea. The lines in table 10.1 are the non-zero components of the vectors
ϕp (s).
We can then represent as a matrix the values of the kernel for every pair of words, as shown in
table 10.2.;
120 CHAPTER 10. KERNELS
ate aux bat cis cro eau ise ois rat roi sea tea
bateau 1 1 1 1
rateau 1 1 1 1
oiseau 1 1 1 1
croise 1 1 1 1
ciseaux 1 1 1 1 1
Table 10.2: 3-spectrum des mots bateau, rateau, oiseau, croise ciseaux.
Solving SVMs
We can then express the KKT conditions, that is to say the zeroing of the the partial derivatives
of the Lagrangian. We can also express the additional KKT conditions which are that when a
multiplier is zero, then the constraint is not saturated and when it is non-zero, then the constraint
1 Sequential Minimal Optimization
2 Be careful, the αi are now primal parameters and the δi , µi and the parameter β are the Lagrange multipliers.
121
122 CHAPTER 11. SOLVING SVMS
is saturated. Thus, the product of the constraint and its multiplier is zero at the optimum without
the two factors being zero at the same time (see page 110 for a reminder). The multipliers are also
all positives. ∂
L = (Fi − β)yi − δi + µi = 0
∀ 1 ≤ i ≤ N, ∂αi
δi α i = 0
µi (αi − C) = 0
These conditions get simpler when they are written in the following way, distinguishing three cases
according to the value of αi .
Case αi = 0 : So δi > 0 and µi = 0, thus
(Fi − β)yi ≥ 0
(Fi − β)yi = 0
(Fi − β)yi ≤ 0
Since yi ∈ {−1, 1}, we can separate the value of i according to the sign of Fi − β. This let us define
the following sets of indices:
I0 = {i : 0 < αi < C}
I1 = {i : yi = 1, αi = 0}
I2 = {i : yi = −1, αi = C}
I3 = {i : yi = 1, αi = C}
I4 = {i : yi = −1, αi = 0}
Isup = I0 ∪ I1 ∪ I2
Iinf = I0 ∪ I3 ∪ I4
Then:
i ∈ Isup ⇒ β ≤ Fi
i ∈ Iinf ⇒ β ≥ Fi
We can then define the following bounds on these set:
bsup = min Fi
i∈Isup
binf = max Fi
i∈Iinf
In practice, when we will iterate the algorithm, we will have bsup ≤ binf as long as we haven’t
reached the optimum, but at the optimum:
binf ≤ bsup
We can reverse this condition, and say that we haven’t reached the optimum as long as we can
find two indices, one in Isup and the other in Iinf , which violate the condition binf ≤ bsup . Such an
indice pair defines a violation of the optimality conditions:
Equation (11.1) is theoretic, since on a real computer, we will never have, numerically, binf ≤
bsup at the optimum. We will satisfy ourselves with defining this conditions “up to a bit”, say
τ > 0. In other words, the approximative optimality condition is:
Equation (11.1), which defines when a pair of indices violates the optimality condition, is then
modified accordingly:
The criterion (11.3) will be tested to check whether we need to keep running the optimization
algorithm, or if can consider that the optimum has been reached.
Before terminating this paragraph, finally aimed at presenting the stopping criteria of the
algorithm we will define in the following sections, let’s point out that, at the optimum, bsup ≈
binf ≈ β... and that this value is also the b of our separator! In section 9.2.5 page 111, we lamented
that a closed-form solution for b was, until now, not made available by the Lagrangian solution.
Regression
Until now, we only considered the problem of separating a corpus of samples in two classes,
according to their labels −1 or +1. Regression consist in using labels with values in R and to
search for a function that will map a vector to its label, based on the samples in the corpus.
In a process similar to that of a linear separator, we reach the definition of the following
optimization problem for a regression:
N
X
1
find argmin w.w + C (ξi + ξi0 )
w,b,ξ,ξ 0 2
i
yi − w.xi − b ≤ + ξi , ∀(xi , yi ) ∈ S
subject to w.xi + b − yi ≤ + ξi0 , ∀(xi , yi ) ∈ S
ξi , ξi0 ≥ 0, ∀i
12.2 Resolution
Solving this optimization problem is once again easier after switching to the dual problem, as was
the case for linear separator in chapter 9. Let αi and αi0 the multipliers for the two first constraints
of our optimization problem. The vector w of the separator is given by:
N
X
wα,α0 = (αi − αi0 )xi
i
125
126 CHAPTER 12. REGRESSION
Figure 12.1: Linear regression. The white dots have an abscissa of xi , one-dimension vector,
and for ordinate yi . The dashed band represents the set of acceptable distance to the separator,
|w.x + b − y| ≤ , and not many samples are out of this set.
12.3. EXAMPLES 127
Once again, it “just” remains to apply an algorithm that will search for the maximum of this
dual problem. Approaches similar to the SMO algorithm exist but lead to algorithm relatively
hard to implement.
12.3 Examples
Figure 12.2 demonstrates the use of this type of SVM for regression in the case of 1D vectors, for
different kernels. Figure 12.3 gives an example in the case of 2D vectors.
Figure 12.2: Regression on 1D vectors, similar to figure 12.1. Left: using a standard dot product.
Middle: using a 3rd degree polynomial kernel. Right: using a Gaussian kernel. The support vectors
are marked with a cross. They are vectors xi for which a pair of (αi , αi0 ) is non zero. They are
the one constraining the position of the regression curve. The tolerance −, + is represented with
dashed lines.
p
Figure 12.3: Left: the function z = f (x, y) = exp(−2.5(x2 + y 2 ) ∗ cos(8 ∗ x2 + y 2 ) that we used
to generate the samples. Middle: 150 samples, obtained by randomly drawing x and y and defining
vector xi = (x, y) with a label yi = f (x, y) + ν, with ν drawn uniformly from [−0.1, 0.1]. Right:
result of the regression on these samples, with a Gaussian kernel with variance σ = 0.25 and a
tolerance = 0.05.
128 CHAPTER 12. REGRESSION
Chapter 13
Compedium of SVMs
Ultimately, the principle of the approaches seen so far is always the same: we define a quadratic
optimization problem that can be solved using only dot products of pairs of samples.
13.1 Classification
These approaches are called SVC for Support Vector Classification.
13.1.1 C-SVC
The C-SVC the approach we’ve seen so far. The optimization problem is only given here as a
reminder:
1 X
find argmin w.w + C ξi
w,b,ξ 2
i
yi (w.xi + b) ≥ 1 − ξi , ∀i
subject to
ξi ≥ 0, ∀i
13.1.2 ν-SVC
The problem of a C − SV M is that C, which define when to use slack variables ξi , does not depend
on the number of samples. In some cases, we might want to define the number of support vectors
based on the number of samples instead of giving an absolute value. The parameter ν ∈]0, 1] is
linked to the ratio of examples that can be used as support vectors1 . In a C-SVM, we always force
the samples to be located outside the band [−1, 1]. Here, we chose a band [−ρ, ρ], and we adjust
ρ to obtain the desired ratio of support vectors. This defines a ν-SVC problem.
1 1 X
find argmin w.w − νρ + ξi
w,b,ξ,ρ 2 N i
yi (w.xi + b) ≥ ρ − ξi , ∀i
subject to ξi ≥ 0, ∀i
ρ≥0
Expressing the objective function however is not so simple and how will ν define the ratio of
samples used as support vectors is far from obvious. This can be justified by looking at the KKT
conditions of this optimization problem Schölkopf et al. (2000).
13.2 Regression
These approaches are named SVR for Support Vector Regression.
1 This ratio tends towards ν when we have many samples.
129
130 CHAPTER 13. COMPEDIUM OF SVMS
13.2.1 -SVR
The -SVR is the approach we presented earlier. The optimization problem is given here as a
reminder. X
1
find argmin w.w + C (ξi + ξi0 )
w,b,ξ,ξ 0 2
i
w.xi + b − yi ≥ − ξi , ∀i
subject to w.xi + b − yi ≤ + ξi0 , ∀i
ξi , ξi0 ≥ 0, ∀i
13.2.2 ν-SVR
Similarly to the ν-SVC, the purpose here is to modulate the width of the -SVR according to a
parameter ν ∈]0, 1]. The objective is to define the number of samples outside of a tube with radius
around the regression function as a ratio ν of the total number of samples2 .
1 1 X
find argmin w.w + C(ν + (ξi + ξi0 ))
w,b,ξ,ξ 0 , 2 N i
w.xi + b − yi ≥ − ξi , ∀i
w.xi + b − yi ≤ + ξi0 , ∀i
subject to
ξi , ξi0 ≥ 0, ∀i
≥0
As was the case for ν-SVC, the justification of this ν-SVR formulation of the optimization problem
can be found in Schölkopf et al. (2000) and is a consequence of the KKT conditions resulting from
this problem.
Figure 13.1: Minimal enclosing sphere. Left: using the standard dot product. Right: using a
Gaussian kernel with σ = 3. In both cases, C = 0.1 and the samples are randomly drawn in a
10 × 10 square.
As before, we can find a ν-version of this problem, in order to control the number of support
vectors and thus the ratio of samples lying outside the sphere Shawe-Taylor and Cristanini (2004).
The optimization problem is the same as before, setting C = 1/νN and then multiplying by ν the
objective function3 . Once again, the reason why this value of C leads to ν being effectively linked
with the ratio of samples outside the sphere can be found by analysing the KKT conditons.
1 X
find argmin νr2 + ξi
ω,r,ξ N i
2
|xi − ω| ≤ r2 + ξi , ∀i
subject to
ξi ≥ 0, ∀i
Figure 13.2: One class SVM. The samples are the same as in figure 13.1. We use a Gaussian kernel
with σ = 3 and ν = 0.2, which means that around 20% of the samples are outside the selected
region.
Part IV
Vector Quantization
133
Chapter 14
where PN? (X ) is the set of finite subsets of X . In other words, hΩ is defined according to a set
of values Ω = {ω0 , · · · , ωi , · · · , ωK }. It computes its output as the ωi which is the closest to the
input. The ωi s are called the prototypes. All hypotheses in H are not required to use the same
number of prototypes for their computation.
Let us define the distortion induced by a set of prototypes Ω as R (hΩ ). It is the expectation,
when a sample x is taken according to X, of the error made by assimilating x to its closest
prototype in Ω. It can be approximated by an empirical risk RSemp (hΩ ) measured on a dataset
S = {x1 , · · · , xi , · · · , xN }, actually viewed here as S = {(x1 , x1 ), · · · , (xi , xi ), · · · , (xN , xN )}. This
empirical risk is considered here as the distortion induced by hΩ on the data.
Since we allow for arbitrarily large Ωs in this definition, it is obvious that, when the values of
X are bounded, having a huge set Ω of prototypes uniformly spread over the values taken by X
enables to have R (hΩ ) as small as wanted. Therefore, as opposed to real supervised learning, the
goal here is not only to reduce the distortion, but rather to have a minimal distortion when only
few prototypes (i.e. |Ω| < K) are allowed. In this case, the prototypes for which this minimal
distortion is obtained are “well spread” over the data.
Handling a discrete collection of few prototypes gives the method its name of vector quantiza-
tion.
135
136 CHAPTER 14. INTRODUCTION AND NOTATIONS FOR VECTOR QUANTIZATION
In the previous definition, where a dummy supervised learning is used to describe a unsupervised
learning problem, the loss function L is crucial.
This is the place for adding the semantics of the problem in the algorithms. Even
if the vast majority of references in the literature use X = Rn and L (x, x0 ) =
(x − x0 ).(x − x0 ), the central role of the loss has to be kept in mind when vector
quantization is applied to real problems.
Let us take the example of handwritten digits recognition where inputs are digits, provided as
64
a gray-scaled 8 × 8 images. In this case, X = [0, 1] , where 0 stands for white and 1 for black.
Let us consider the three inputs x1 , x2 and x3 depicted in the figure 14.1. Using the default loss
function mentioned above, the following stands: L (x1 , x2 ) = L (x2 , x3 ) = L (x1 , x3 ) = 20. In other
64
words, x1 x2 x3 forms an equilateral triangle in [0, 1] . An appropriate design should have lead to
L (x1 , x2 ) < L (x1 , x3 ), since samples x1 and x2 look very similar to each other. Figure 14.2 shows
the inadequacy of the `2 norm as well.
Figure 14.1: Three digit inputs. Each digit is made of 11 black pixels. Each pair of digits is such
that the digits have only one black pixel in common.
Figure 14.2: The original image is on the left. The three other images are respectively obtained
from it by a shift, the adding of black rectangles and a darkening. In each of these three cases, the
pixel-wise `2 distance to the original image is the same, while the visual alteration we experiment
as observers is not. The illustration is taken from http://cs231n.github.io/classification/.
14.2. MINIMUM OF DISTORTION 137
14.1.3 Samples
Voronoı̈ subsets
In real cases, the random variable X is unknown. It is supposed to drive the production of the
dataset S = {x1 , · · · , xi , · · · , xN }. Since hΩ in H consists in returning the prototype which is
the closest to the given argument, gathering the samples according to the labels given by hΩ is
meaningful. This leads to the definition1 of Voronoı̈ subsets as follows:
def
S
VΩ (ω) = 0
x ∈ S argmin L (ω , x) = ω
ω 0 ∈Ω
0 0
2
∀(ω, ω ) ∈ Ω , ω 6= ω ⇒ VΩS (ω) ∩ VΩS (ω 0 ) =∅
[
VΩS (ω) = S
ω∈Ω
As the Voronoı̈ subsets form a partition of S, the empirical distortion RSemp (hΩ ) can be de-
composed on each of them:
1 X
RSemp (hΩ ) = L (x, hΩ (x))
N
x∈S
1 X X
= L (x, ω)
N S ω∈Ω x∈VΩ (ω)
1 X S
= VΩ (ω)
N
ω∈Ω
def
X
where VΩS (ω) = L (x, ω)
S (ω)
x∈VΩ
Let us call VΩS (ω) the Voronoı̈ distortion caused by ω, since it is the contribution of the samples
“around” ω to the global distortion RSemp (hΩ ). The relevance of Voronoı̈ distortion in the control
of vector quantization algorithms has been introduced in (Frezza-Buet, 2014), it is detailed in
forthcoming paragraphs.
1 The argmin operator is supposed to return a single element in the definition, which may be false theoretically.
This case is not addressed for the sake of clarity, since this is not a big issue in real cases.
138 CHAPTER 14. INTRODUCTION AND NOTATIONS FOR VECTOR QUANTIZATION
Algorithm 10 Computation of S.
1: S ← ∅ // Start with an empty set.
2: for i ← 1 to M do
3: // Let us consider M attempts to add a sample in S.
4: x L99 UX , u L99 U[0,1[ // Choose a random place x in X .
5: if u < p (x) then
6: // The test will pass with a probability p (x).
7: S ← S ∪ {x} // x is kept (i.e. not rejected).
8: end if
9: end for
10: return S
0.5 0.5
−0.5 −0.5
Figure 14.3: Here, X = [−0.5, 0.5]2 . S is represented by smaller dots, while Ω is the larger ones.
Left and right figures show two distinct optimal configurations for κ = 6.
0.4 0.4
0.05
0.2 0.2
0.04
0.0 0.0
0.03
0.2 0.2
0.02
0.4 0.4
0.60.6 0.4 0.2 0.0 0.2 0.4 0.6 0.60.6 0.4 0.2 0.0 0.2 0.4 0.6 0.010 20 40 60 80 100 120
Prototypes and at most 5000 samples Voronoi error map Voronoi error histogram
of some small fluctuations, the Voronoı̈ distortions is almost equally shared between the prototypes
as well when Ω?κ is reached.
density p
1.0
0.8
0.6
0.4
0.2
0.0
0.4
0.2
0.0
−0.2
−0.4 0.2 0.4
−0.2 0.0
−0.4
The value of the equal distortion share, i.e the center of the darken range in figures 14.4-right
and 14.6-right, is obtained from a large amount of weak sample contributions around prototypes
where p is high, while it is obtained from a smaller amount of stronger sample contributions around
prototypes where p is low.
Figure 14.6: Same as figure 14.4, using an non uniform density function p.
value of this share (e.g 0.03 in figure 14.4, 0.013 in figure 14.6) appears to give a hint about the
quantization accuracy: the more numerous the prototypes are, the lower is the share for each one.
Let us use the value of this share to control the quantization accuracy. The samples are supposed
to be tossed according to algorithm 10, and thus S is rather referred to as S M .
M
Equation (14.8) states that VΩS?κ (ω) is proportional to M . Let us denote by T the proportional
coefficient.
M
∀ω ∈ Ω?κ , VΩS?κ (ω) ≈ M T, T ∈ R+ (14.10)
The coefficient T can be viewed as a target, determined in advance in order to set the accuracy
of the quantization. Once T is fixed, a targeted vector quantization process consists of setting the
M
appropriate number of prototypes κ, knowing M , such as each of the measured VΩS?κ (ω), ω ∈ Ω?κ
is close to the value M T . This is what the VQ-T algorithm does (see algorithm 11). The use of
the shortest confidence interval allows to extract the “main” values from a collection (Guenther,
1969). The implementation of algorithm 11 is naive since it can easily be improved by a dichotomic
approach.
Algorithm 11 VQ-T T, S M
1: κ ← 1 // Start with a single prototype, [· · ·] denotes lists.
2: repeat
3: Compute Ω?κ according to eq. (14.9) // Use hyour favorite VQ algorithm
here
i
SM
4: (a, b) = shortest confidence interval VΩ?κ (ω) , δ // Use δ = .5
ω∈Ω?
κ
5: if T.M < a then
6: κ←κ+1
7: else if T.M > b then
8: κ←κ−1
9: end if
10: until T.M ∈ [a, b]
11: return Ω? κ
The value of T can be chosen by trial and errors, but considering a geometrical interpretation
2 2
is worth it. Let us consider X = [−0.5, 0.5] , the quadratic loss L (x, ω) = (x − ω) and a uniform
distribution (i.e. p = 1). Let us suppose that, in such a situation, the desired quantization
accuracy consists of κ? prototypes. The prototypes are the elements of Ω?κ? obtained from a
vector quantization algorithm. The quantization accuracy in figure 14.4 where p = 1 actually
corresponds to κ? = 500. Let us consider the Voronoı̈ tessellationn induced by the prototypes.
o
? def
The Voronoı̈ tessellation is the partition of X into κ cells Cω = x ∈ X hΩ?κ? (x) = ω . Let
n o
def
us also consider Sω = x ∈ S hΩ?κ? (x) = ω similarly. It can be considered that the Voronoı̈
14.2. MINIMUM OF DISTORTION 141
tessellation divides the area of X , which is 1 here, into κ? parts with the same surface, i.e 1/κ?
each. Let us approximate the shape of each cell Cω by a circle centered at ω. The radius r is such
that the surface of the disk is the area of the cell, i.e. πr2 = 1/κ? , i.e r2 = 1/πκ? . The quadratic
momentum µ of the disk3 is πr4 /2. The variance µ is the momentum divided by the disk area,
i.e. µ = πr4 /2 /πr2 = r2 /2 = 1/2πκ? . The variance ν of any Sω approximates the variance
M
µ. By definition, ν = VΩS (ω)/ |Sω |. As p is uniform, there are exactly M samples in S M and
they are equally shared in the Sω . Therefore, |Sω | ≈ M/κ? . So the variance can be re-written as
M M M
ν ≈ κ? VΩS (ω)/M . Identifying µ with ν leads to µ ≈ κ? VΩS (ω)/M and thus VΩS (ω) ≈ µM/κ? .
Identifying the latter expression with equation (14.10) leads to T = κµ? . As µ = 2πκ 1
? , we have
1
T =
2πκ? 2
In the case of figure 14.4, since κ? = 500, we have T = 6.37 × 10−7 and thus, from equation (14.10)
M
with M = 50000, VΩS (ω) ≈ 50000 × 6.37 × 10−7 ≈ 0.0318. This value actually lies between the
darken range in figure 14.4.
Let us now apply the algorithm 11 with the density p depicted in figure 14.5, with M = 50000
and T = 0.0318. This leads to κ = 343. The configuration is displayed in figure 14.7. Comparing
the upper-right regions of figures 14.7 and 14.6 shows that in figure 14.7, the accuracy is similar
to the desired one, i.e. figure 14.4.
0.4 0.4
0.05
0.2 0.2
0.04
0.0 0.0
0.03
0.2 0.2
0.02
0.4 0.4
0.60.6 0.4 0.2 0.0 0.2 0.4 0.6 0.60.6 0.4 0.2 0.0 0.2 0.4 0.6 0.010 10 20 30 40 50 60 70
Prototypes and at most 5000 samples Voronoi error map Voronoi error histogram
N (v) = {v 0 ∈ V | ev↔v0 ∈ E}
E (v) = {ea↔b ∈ E | a = v or b = v}
Extending the graph with a new vertex simply consists in adding a new element (that was not
in V) in V.
One notation issue comes from the need to anchor values to both vertices and edges. In other
words, we need these objects to have “attributes”, in the programming sense of the term. This can
be represented by functions. For example, as prototypes are handled by vertices in the algorithms,
a function proto ∈ X V is defined, such as ω = proto (v) is the prototype hosted by v. Edges may
handled an age (integer). In that case, the function age ∈ NE enables to define the age age (ev↔v0 )
of the edge ev↔v0 . Some other attributes/functions may be used further.
Last, attribute affectations means that the function is changed. It is denoted by ←. Changing
the prototype handled by a vertex is thus denoted by proto (v) ← ω. It means that a new proto
is now considered. It is identical to the previous one, except that from now, for the value v,
proto (v) = ω.
The notation for hypothesis space can be set from vertices, rather than from prototypes, for
the sake of clarity in algorithms. Indeed, Let us denote :
ω = hΩ (x)
v = hV (x)
Figure 14.8: Top: vector quantization of the input samples (grey dots) by 50 prototypes (blue
dots). Bottom: Voronoı̈ tessellation (green). Each cell is the region where points are closer to the
central prototypes that to other ones.
144 CHAPTER 14. INTRODUCTION AND NOTATIONS FOR VECTOR QUANTIZATION
Figure 14.9: Top: Delaunay triangulation (Voronoı̈ tessellation dual). Middle: Masked Delau-
nay triangulation. Bottom: Approximated masked Delaunay triangulation obtained by CHL (see
algorithm 13).
14.3. PRESERVING TOPOLOGY 145
a sub graph of the Delaunay triangulation that only has edges covering “well” the manyfold (see.
middle of figure 14.9).
Computing Delaunay triangulations geometrically is feasible, but determining, from a set of
samples, which edges actually belong to the masked Delaunay triangulation is not obvious. More-
over, many vector quantization algorithm build the masked Delaunay triangulation incrementally.
In that context, the very simple competitive Hebbian learning algorithm (Martinez and Schulten,
1994) is the basis of these algorithms (see. algorithm 13). It leads to the graph in the bottom of
figure 14.9. As one can see, some edges are missing, since the graph obtained is not a triangulation.
Figure 14.10: Left: Voronoı̈ tesselation for nearly co-cyclic points. Right: The second order Voronoı̈
tesselation.
Missing edges often correspond to four points that almost lie on the same circle. Such points
are depicted in figure 14.10. On the left, the Voronoı̈ tessellation is showed for prototypes A, B,
C, D. Regions A and B, B and C, C and D, D and A, as well as A and C, share a common
edge. The Delaunay triangulation is made of the 5 segments [AB], [BC], [CD], [DA], [AC]. On
the right plot in figure 14.10, the second order Voronoı̈ tessellation is depicted (in red). Each point
in a cell in that plot have the same two closest prototypes. In other words, in the CHL procedure
(algorithm 13), if a sample tossed belong to one of the second order Voronoı̈ cells, the corresponding
edge of the Delaunay triangulation is created. It can be seen in figure 14.10-right that the creation
of the edge AC is very unlikely with CHL, since the corresponding region, i.e the central cell, is
tiny for almost co-circular points. This explains the missing of some of the edges when comparing
middle and bottom plots in figure 14.9.
a nine in a picture, a topology preserving vector quantization may lead to a graph from which a
cycle and a tail can be extracted (see figure 14.11).
Figure 14.11: The pixels of the digit 9 can be structured as a cycle with pending tail on the right.
Chapter 15
Main algorithms
In this chapter, main vector quantization techniques are presented. Nevertheless, the reader should
keep in mind that lots of variations of those algorithms are available in the literature. An overview
of vector quantization algorithms can also be found in Fritzke (1997).
The reader is invited to refer to paragraph 14.3.1 for notations related to graphs.
15.1 K-means
15.1.1 The Linde-Buzo-Gray algorithm
The k-means algorithms (Lloyd, 1982; Linde et al., 1980; Kanungo et al., 2002) is certainly the
most famous vector quantization algorithm, since it is implemented in any numerical processing
framework. It considers a set of samples a priori and computes the position of k prototypes so
that they minimize the distortion (see algorithm 14). The idea is to update the prototypes so that
each of them is the mean of the samples that lies in its Voronoı̈ region.
It is batch, since it works for a set given as one bulk of data, and k is a parameter that has to
be determined by the user. The line 6 of algorithm 14 consists of cloning some existing prototypes.
Here, cloning a vertex means creating a new vertex hosting a random prototype which is very close
to the prototype of the initial vertex. The new vertex is added into V. The reaching of stopping
condition has be proven, but the result may be a local distortion minimum. Figures 14.4, 14.6
and 14.7 are obtained by the use of this algorithm.
Algorithm 14 k-means
1: Sample S = {x1 , · · · , xi , · · · , xN } according to p.
2: Compute ω1 as the mean of S.
3: V = {v} such as proto (v) = ω1 // Let us start with a single vertex
4: while |V| < k do
5: Select randomly n = min (k − |V| , |V|) vertices in V.
6: Clone these n vertices (the clones are added in V).
7: repeat
8: ∀ x ∈ S, label (x) ← hV (x)
9: ∀ v ∈ V, proto (v) ← mean{x∈S | label(x)=v} x
10: until No label change has occurred.
11: end while
147
148 CHAPTER 15. MAIN ALGORITHMS
There is no real stopping criterion. After a while, the prototypes hosted by the vertices in V
are placed in order to minimize the distortion1 . Line 5 selects the vertex whose prototype is the
closest to the input sample. This stage is a competition. Line 6 says that the winning vertex v ? is
the only one whose prototype is modified, consecutively to the computation of the current input.
This is called a winner-take-all learning rule.
The update of the winning vertex (line 6) is performed by a low past first order recursive
filter. This computes each proto (v) as the mean of the inputs sample for which v hosted the
closest prototype. The same idea motivates the Linde-Buzo-Gray algorithm, as stated previously.
Increasing α make the prototypes shake, whereas smaller α leads to more stable positions. A good
compromise could be to use a large α = 0.1 for first steps, allowing the prototypes to roughly take
their positions, and then use a much more smaller α = 0.005.
Figure 15.1: Successive evolution steps of GNG from a 3D input sample distribution. Dots are the
input samples. The graph is represented as a red grid, showing edges that intersect at vertices.
Each is placed v at the position of proto (v).
Figure 15.2: Successive evolution steps of Growing Grid from a 3D input sample distribution.
Drawing convention is the one of figure 15.1.
As for algorithm 15, line 5 performs a competition, i.e. the selection of the vertex whose
prototype it the closest to the current input. However, the learning stage, i.e. line 6, slightly
changes in algorithm 17. Indeed, learning is applies to all the prototypes. The strength of learning
is determined by the term αh (.). It corresponds to a modulation of the learning rate α. The
R+
function h ∈ [0, 1] has to be a decreasing function such that h (0) = 1. As h (ν (v ? , v)) is used
at line 6, the modulation is the highest when ν (v ? , v) = 0, i.e. when v = v ? is considered. The
modulation decreases for the neighbors of v ? in the graph, since ν (v ? , v) is still high for them. For
prototypes v that are far, in the graph, from v ? , ν (v ? , v) is weaker and the learning as no significant
effect. To sum up, as opposed to algorithm 15, v ? is not the only prototype that is modified by
the current input sample, since its close neighbors also learn. This is called a winner-take-most
learning scheme.
15.3. SELF-ORGANIZING MAPS 151
Note that a function h such that h (0) = 1 and h (.) = 0 otherwise makes algorithm 17 be
identical to algorithm 15.
Figure 15.3: SOM applied to a coronal input distribution. Drawing convention is the one of
figure 15.1. See text for details .
Figure 15.3 shows the results. In figure 15.3-left, r = 15 is used (see equation (15.1)). A
wide area of prototypes around the winner actually learn. This has an averaging effect, all the
prototypes tend to be attracted to the mean of the input samples. Nevertheless, it can be observed
that the grid is “unfolded”. In figure 15.3-middle, r = 3 is used. The averaging effect is weaker,
and the prototypes cover the input sample distribution better. The drawback is that the map
is “folded”. In figure 15.3-right, we used first r = 15, and then r = 3. This leads to both an
unfolded and nicely covering prototype distribution. Nevertheless, as figure 15.3-right shows, some
prototypes (the middle ones in the figure) lie outside the distribution, because the map elasticity
pulls them in opposite directions. Such prototypes are sometime called dead units.
The h function in the literature is often presented as a Gaussian with a slowly de-
creasing variance, which complicates the formulation of the SOM algorithm. Indeed,
simpler h can be used, and the progressive decay can be reduce to few values, the first
ones with a wide h expansion and the last ones with a narrower h.
Once the map is correctly unfolded over the input samples, the following stands : two prototypes
that are close according to ν are also close according to L. The reverse is wrong (see figure 15.4).
Be careful with the input sampling. It has to be random. For example, in figure 15.3-
right, submitting examples line bye line, from the top to the bottom and from the left
to the right within each line, would have lead to a bad unfolding.
152 CHAPTER 15. MAIN ALGORITHMS
Figure 15.4: When the input manifold dimension is higher than the one of the map. On the figures,
the blue and green prototypes are close according to L but far according to ν. Left and middle:
The graph is a ring. Right: the graph is a grid. Drawing convention is the one of figure 15.1.
15.3.3 Example
Let us consider the case of written digits. Input samples are 28 × 28 gray-scaled images, where
digits are written. Input samples are thus x ∈ X = [0, 255]784 . As a loss function L, we do not use
directly the Euclidian distance in R784 (see. section 14.1.2). Rather, when we compare proto (v) to
x, we first blur the images and then compute the Euclidian distance between the blurred objects.
In this example, the input sample lie in a manifold in R784 . Visualizing this manifold is not
easy. As we force the topology to be 2D, since we use a grid for connecting the vertices, we are able
to represent the hosted prototypes as a 2D grid on a screen, displaying at each grid position the
prototypical image. Recognition can be performed as follows: Ask an expert to label each vertex
according to its prototype (their number is finite). When a handwritten digit needs to be labeled,
find the vertex hosting the closest prototype in the map and give its label to the input.
Another, and maybe more fundamental, aspect of the map in figure 15.5 is that is represents
the distribution of all the input digits over the surface of the screen, trying to place the prototypes
such as the ones that are close on the screen are actually close digits in R784 . This is an example
of using self-organizing maps as non-linear projections for visualizing high dimensional data.
15.3. SELF-ORGANIZING MAPS 153
Neural networks
155
Chapter 16
Introduction
What is a neural network ? A neural network is basically a set of interconnected units (i.e. a
graph of units), having inputs and outputs, which compute by themselves a pretty simple function
of their inputs. The idea of studying a network of interconnected units perfoming a rather basic
computation originates from (McCulloch and Pitts, 1943) which introduces a simple model of a
neuron with several inputs xi feeding the P
neuron through weighted connections of weight wi . The
weighted sum of the input contributions i wi xi provides the pre-activation of the neuron from
which its output is computed with a heaviside transfer function h(x) = 1x≥0 :
X
y = h( w i xi )
i
(
0 if x < 0
h(x) =
1 otherwise
The neuron model of (McCulloch and Pitts, 1943) was not equipped with learning rules allowing to
adapt its weights. As we shall see in the next chapters, various improvements were found ultimately
leading to trainable neural networks.
Even if the first motivation was to model how the brain works, we shall prefer speaking about
units rather than neurons as speaking about neurons tend to insist too much on a relationship with
biological neurons. Definitely, biological neurons inspired (and still inspire) the design of neural
networks but neural networks can be considered as a specific structure of predictors in machine
learning on their own without having to refer to any biological motivations to justify their study.
If we denote x the inputs of a unit and y its output, a prototypical neural network unit links
the inputs to the output through a non-linear function f applied to a linear combination of the
inputs :
a = wT x + b
y = f (a)
where f is a so-called transfer function, w a set of weights, b a bias and a the pre-activation which
is introduced for convenience. The transfer function linking the pre-activation and the output
(or activation) of the unit can take different forms and below is a list of some commonly chosen
transfer functions :
157
158 CHAPTER 16. INTRODUCTION
While the hyperbolic tangent and sigmoid were common choices for the transfer function, it turns
out that the softplus and rectified linear units bring up interesting results in terms of performances
of the learned predictor and the speed of learning(Nair and Hinton, 2010; Zeiler et al., 2013).
The ReLu is really quick to evaluate contrary to transfer functions involving exponentials! It also
behaves quite favourably when having to derive it as we shall see latter in the chapter. These
transfer functions are plotted on figure 16.1. There are also population-based transfer functions
where the output of a unit actually depends on the pre-activation of a collection of other units. A
popular example is the softmax function. If we consider a population of units for which we denote
ai the pre-activations and yi the outputs, the softmax computes the outputs as :
exp(ai )
∀i, yi = P
j exp(aj )
The softmax is especially used in the context of learning a classifier as the softmax transfer function
constrains the activations to lie in the range [0, 1] and to sum up to 1. The softmax also induces
a competition among the interconnected units : if one unit raises its activation and due to the
normalization constraint, it necessarily induces a drop of the activation of at least one of the other
units.
1.0 5
1.0
0.8 4
0.5
0.6 3
0.0
4 2 0 2 4
0.4 2
0.5 0.2 1
0.0 0
1.0 4 2 0 2 4 4 2 0 2 4
a) b) c)
Figure 16.1: Classical transfer functions; a) hyperbolic tangent f (a) = tanh(a), b) sigmoid f (a) =
1 +
1+exp(−a) , c) rectified linear unit f (a) = max (a, 0) = [a] , d) softplus : f (a) = log(1 + exp(a)).
Up to now, we especially focused on the type of computation that a single unit is performing.
The topology of the network is also a distinguishable feature of neural networks (fig 16.2). Some
neural networks are acyclic or feedforward; you can, say, identify the leaves with the inputs and
the root with the output if we take the convention that information flows from the leaves up to
the root. In general, we can group the units into layers and therefore speak about the input layer,
the output layer, and the hidden layers in between. The hidden layer are so called because these
contain the units for which you actually do not know the value while the inputs and outputs are
provided by the datasets in a supervised learning problem. Actually, nothing prevents you from
considering an architecture where the units have connections to a layer that is not the next one
with so called skip-layer connections. In particular, if one knows that the output contains some
linear dependencies on the input, it could be beneficial to add such skip layer connections. These
connections do not actually enhance the expressiveness of the architecture but slightly push the
network into the right direction when learning comes into play.
When the data has a hierarchical structures, some neural networks such as recursive and re-
current neural networks are more appropriate. With recursive neural networks, the same network
is evaluated on children to compute a representation of a parent. The children can actually be
inputs from the dataset or could also be some parent representation. Recursive neural networks
are appropriate when dealing with data that have actually a hierarchical structure such as in nat-
ural language processing. In a recurrent neural network, cycles within the network are introduced.
These cycles produce a memory effect in the network as the activations of the units depend not
only on the current input but also on the previous activations within the network. This type of
network is particularly suited for datasets such as time series.
Having briefly sketched what are neural networks, in the next chapters, we go in details through
a variety of neural network architectures especially focusing on how we train them, i.e. how we
find optimal parameters given a regression or classification problem and what they are good for.
Classical books on neural networks include (Bishop, 1995; Hertz et al., 1991). (Schmidhuber, 2015;
Bengio et al., 2013) recently reviewed the history of the ideas in the neural network community
159
x1 x2 y
y = fθ (x1 , x2 )
x1 x2 y
y = fθ (x1 , x2 )
x1 x2
Figure 16.2: a) A feedforward neural network is an acyclic graph of units. b) A recursive neural
network is applied recursively to compute the parent representation of two childrens, these two
childrens being possibly parent nodes themselves. c) A recurrent neural network is fed with a
sequence of inputs, the network itself possibly containing cycles.
and pointed out as well recent trends in the field. There is a also the book of (Bengio et al.,
2015) that is, at the time of writting these lines, a work in progress written by researchers from
the university of Montreal (Y. Bengio), one the top leading research group in neural networks
with the university of Toronto (G. Hinton) and the IDSIA research group (J. Schmidhuber). The
online book of Michael Nielsen (http://neuralnetworksanddeeplearning.com/) is also a good
reference.
160 CHAPTER 16. INTRODUCTION
Chapter 17
x0
a0 = φ0 (x)
w00
Σ g r0
x1 w10
a1 = φ1 (x) w01
r1
Σ
x2
a2 = φ2 (x)
w02
Σ r2
w22
x3
Figure 17.1: A perceptron is an acyclic graph with an input or sensory layer x, an association layer
a and an output or result layer r. The association layer activities are computed with predefined
basis functions φi . The weights between the associative and result are trainable (or plastic).
The outputs of the perceptron are computed in two steps : 1) a linear combination of the
activities in the associative layer defining the preactivation, 2) a transfer function g applied on the
pre-activation :
X
∀i, ri = g wj,i ai + bi
j
X
= g( wj,i φi (x) + bi )
j
1 Actually, F. Rosenblott introduces the layer as a Sensory layer, Association layer and Result layer and build up
161
162 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS
where g is applied element-wise. You can write the above formula even more compactly if one adds
an extra constant basis function φb (x) = 1 with extra weights to encompass the bias vector b. We
would then consider a weight matrix in Rna +1×nr and the vector of basis functions would contain
na + 1 entries with one entry set to 1.
The perceptron was introduced in the context of binary classification in which case the transfer
function g is taken to be the sign function :
(
−1 if x < 0
g(x)
+1 if x ≥ 0
As a last note to finish the introduction of the perceptron, it is actually a big reduction to
summarize the contribution of (Rosenblatt, 1962) to the study of the S-A-R architecture for binary
classification as he also studied variants of this architecture and the interested reader is referred
to the original book2 . In the next sections, we present how one can learn the weights between the
association and result layers.
Geometrical interpretation
We can have a geometrical understanding of how perceptrons and its learning rule work. Suppose
we have a set of transformed input vectors φ (xi ) ∈ Rna +1 and associated labels yi ∈ {−1, 1}.
Consider the space Rna +1 in which belong the transformed inputs φ (xi ) as well as the weights of
the perceptron w . We can associate an hyperplane to each of the transformed input vector φ (xi )
defined by the following equations :
vT φ (xi ) = 0
This hyperplane splits the space Rna +1 into two regions : one in which vT φ (xi ) < 0 and one in
which vT φ (xi ) ≥ 0. Consider the case an input is correctly classified (fig. 17.2). If the input vector
is positive (yi = 1), then it means that both the weight vector w and transformed input φ (xi )
belong to the same half space. If the input vector is negative (yi = −1), then it means that the
weight vector w and the transformed input φ (xi ) do not belong to the same half space.
Case yi = +1 Case yi = −1
φ(xi ) φ(xi )
Figure 17.2: Case when a perceptron correctly classify an input xi . If the input is positive (yi = 1),
both the weight and transformed input belong to the same half space. If the input is negative
(yi = −1), the weight vector and transformed input do not belong to the same half space. The
grey region indicates the half space in which a weight vector would misclassify the input.
Now, consider the case the perceptron is misclassifying an input (fig. 17.3). If the input is
positive (yi = 1), the weight vector and transformed input do not belong to the same half space.
In order to correctly classify the input, they should actually belong to the same half-space. In this
case, the perceptron learning rule is updating the weights as w + φ (xi ) which brings, at least in
our example, the weight vector in the correct half space. If the input is negative (yi = −1) both
the weight and transformed input belong to the same half space while they should not in order
to correctly classify the input. In this case, the perceptron learning rule is updating the weights
as w − φ (xi ) which brings, at least in our example, the weight vector in the complementary half
space where it should lie in order to correctly classify the input.
Case yi = +1 Case yi = −1
φ(xi ) w φ(xi )
w + φ(xi )
w − φ(xi )
Figure 17.3: Case when a perceptron misclassify an input xi . If the input is positive (yi = 1), the
weight vector and transformed input do not belong to the same half space. If the input is negative
(yi = −1) both the weight and transformed input belong to the same half space. The grey region
indicates the half space in which a weight vector actually misclassify the input.
Let us now consider the case we have two inputs x1 and x2 , respectively positive and negative
(y1 = 1, y2 = −1). Drawing the hyperplanes defined by vT φ (x1 ) = 0, vT φ (x2 ) = 0 delineate a
164 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS
subspace of Rna +1 in which a weight vector should be in order to correctly classify the two input
vectors, a cone of feasible solutions. It is actually not necessary that such a region exists. Indeed,
if, for example, we add an extra input x3 to the two inputs example we just considered, such that
x3 = x2 − x1 , and setting y3 = +1 vanishes the cone of feasible solutions. We will go back to this
question of feasibility in section 17.1.3.
φ(x1 )
w
φ(x2 )
v T φ(x1 ) = 0
v T φ(x2 ) = 0
Figure 17.4: Considering two inputs x1 and x2 respectively positive and negative (y1 = 1, y2 = −1).
In order to correctly classify the two inputs, the weight vector must to the white region, the cone
of feasible solutions.
Linear separability
As we shall see in the next sections, the perceptron algorithm cannot only solve a particular class
of classification problems which are called linearly separable.
Definition 17.1 (Linear separability). A binary classification problem (xi , yi ) ∈ Rd × {−1, 1}, i ∈
[1..N ] is said to be linearly separable if there exists w ∈ Rd such that :
∀i, sign wT xi = yi
The exact values of output, whether {0, 1} or {−1, 1} does not actually really care in the above
definition. To illustrate the notion of linear separability, consider a binary classification problem
with binary inputs, i.e. binary expressions. Consider two inputs x1 , x2 in {0, 1} and one output
y ∈ {−1, 1}. The boolean expressions and (x1 , x2 ) or or (x1 , x2 ) are both linearly separable as
shown on fig. 17.5.
and(x1 , x2 ) or(x1 , x2 )
x2 x2
1 0 1 1 1 1
0 0 0 0 0 1
0 1 x1 0 1 x1
Figure 17.5: The AND and OR boolean functions are linearly separable as a line in the input
space can be defined in order to place the positive inputs on one side and the negative inputs on
the other side.
17.1. SINGLE LAYER PERCEPTRON 165
Not all the binary classification problems are linearly separable. One famous example is the
XOR function depicted on fig. 17.6 for which there is no way to define a line separating the positive
and negative inputs.
xor(x1 , x2 )
x2 ??
1 1 0 ??
??
0 0 1
0 1 x1
Figure 17.6: The XOR boolean function is not linearly separable as no line can be defined to split
the positive and negative inputs.
One may wonder how many linearly separable functions with discrete inputs and outputs exist
or even generalize and wonder about the probability that a randomly picked classification problem
with real inputs is linearly separable. Actually, it turns out that all depends on the ratio between
the number of data points N and the dimensionality of the input d. If N < d, any labelling of
the inputs can be linearly separated. The probability of getting a linearly separable problem then
quickly drops as the number of samples gets larger than the number of dimensions (Cover, 1965).
In case a classification problem is linearly separable, the perceptron learning rule can be shown to
converge to a solution in a finite number of steps. Without loss of generality, we will consider a
problem linearly separable in the input space. When introducing the perceptron, we mentioned
using transformed inputs by introducing basis functions φi and we could consider a linearly sepa-
rable classification problem in the transformed input space. However, as the basis functions were
predefined, it is absolutely equivalent to consider that a problem is linearly separable in an input
space whatever is this input space (“raw” or transformed). The perceptron convergence theorem
states :
Theorem 17.1 (Perceptron convergence theorem). A classification problem (xi , yi ) ∈ Rd ×{−1, 1}, i ∈
[1..N ] is linearly separable (def 17.1) if and only if the perceptron learning rule converges to an
optimal solution in a finite number of steps.
Proof. Consider a linearly separable binary classification problem (xi , yi ) ∈ Rd ×{−1, 1}, i ∈ [1..N ].
By definition, there exists ŵ such that :
Necessarily, |ŵ|2 > 0. Let us denote wt the weight after having updated t misclassified inputs and
xt , yt the t-th misclassified input/output and we suppose that there exists an infinite sequence of
misclassified input/output pairs3 ; otherwise, the proof ends immediately. For any t > 0, since the
input/output pair xt , yt was misclassified with the weights wt−1 , it means (T xt−1 )xt yt < 0. The
sequence of weights wt after k updates using the perceptron learning rule will be :
w 1 = w 0 + y 1 x1
w 2 = w 1 + y 2 x2
....
..
wk = wk−1 + yk xk
Pk
Taking k > 0 and summing all the above equations lead to wk − w0 = i=1 yi xi . Let us compute
the scalar product by ŵ (one solution to the linear separation) :
k
X
ŵT (wk − w0 ) = yi ŵT xi
i=1
Since the problem is by hypothesis linearly separable, then ∀i, yi ŵT xi > 0. Let us denote tm =
mini∈[1,N ] yi ŵT xi > 0. Therefore, we end up with :
ŵT (wk − w0 ) ≥ ktm > 0
tm = min yi ŵT xi > 0
i
4
Reminding the Cauchy-Schwartz inequality , we get :
|ŵ|2 wk − w0 2 ≥ ŵT (wk − w0 ) ≥ ktm
ktm
⇒ wk − w0 2 ≥
|ŵ|2
k ktm
⇒ w 2 ≥ −w0 2 +
|ŵ|2
tm
Note |ŵ| is a constant dependent only on the dataset and a fixed solution ŵ. Therefore, wk 2
2
is lower bounded by a linear function in the number of misclassified input/output pair k. This is
a first point. Let us now focus on upper bounding the norm of wk :
∀k > 0, wk = wk−1 + yk xk
2 2 2
⇒ wk 2 = wk−1 2 + yk xk 2 + 2(wk−1 )T yk xk
Remind that the input/output pair xk , yk is the k-th misclassified input/output pair, meaning
(wk−1 )T xk yk < 0 and therefore :
2 2 2
∀k > 0, wk 2 < wk−1 2 + yk xk 2
2 2 2
⇒ ∀t, wk − wk−1 < yk xk
2 2 2
k−1
X k−1
X
2 2 2 2
⇒ w k 2 − w 0 2 = (wi+1 2 − wi 2 ) < yi+1 xi+1 2
2
i=0 i=0
2 2
⇒ w k 2
< w 0 2
+ ktM
2 q
2
with tM = maxi∈[1,N ] yi+1 xi+1 2 . The latter implies wk 2 < |w0 |2 + ktM . That is the second
point. We therefore demonstrated that :
q
ktm 2
∀k, −w0 2 + ≤ wk 2 < |w0 |2 + ktM
|ŵ|2
tm = min yi ŵT xi > 0
i∈[1,N ]
2
tM = max yi+1 xi+1 2
i∈[1,N ]
4 For any vector space E with a scalar product (a pre-Hilbert space), denoted (u.v), then |(u.v)|2 ≤ (u.u)(v.v)
17.1. SINGLE LAYER PERCEPTRON 167
In the lower bound, we have√ a linearly increasing function of k. In the upper bound, we have
an increasing function in k. Necessarily, there is a finite value of the number of misclassified
input/output pairs k for which the two curves cross after what the inequality cannot hold anymore
which raises a contradiction and leads to the conclusion that there cannot be an infinite sequence
of misclassified input/output pairs and therefore the perceptron algorithm is converging. There-
fore, we demonstrated that if the classification problem is linearly separable, then the perceptron
learning rule is converging in a finite number of updates.
Given the equivalence, we can then also state that, in case the classification problem is not
linearly separable, the perceptron algorithm will never converge since, otherwise, the classification
problem would have been linearly separable.
More on perceptrons
While we demonstrated the convergence of the perceptron learning rule, we did not speak that
much about the rate of convergence. The learning rule we consider and associated algorithm which
would peak each input/output pairs one after the other is not the algorithm that provides the
fastest rate of convergence. There are variants of the perceptron learning rule with improved rate
of convergence (Gallant, 1990; Muselli, 1997; Soheili and Pena, 2013; Soheili, 2014).
There are also extensions of the perceptron using kernels. As one may note, the weights of the
perceptron are always a weighted (by the labels) sum of the input samples :
X
w= yi xi
i∈I
where I is the set of misclassified inputs that we encounter during learning. At some point in
time, in order to test the prediction of the perceptron, we simply compute the dot product of the
weights by the vector x to test :
X
w.x = yi xi .x
i∈I
and test for the sign of w.x to decide whether x belongs to the positive or negative class. Given
the computation are expressed only from dot products, one can extend the algorithm using kernels
as in (Freund and Schapire, 1999). Given a mapping function ϕ of our input space into a so called
feature space Φ :
ϕ : Rd → Φ
x 7→ X
As before, testing an input x (which is also a step during learning) would imply computing the
dot product of the weights by the input, now projected in the feature space ϕ (x) :
X X
w.ϕ (x) = yi ϕ (xi ).ϕ (x) = yi k (xi , x)
i∈I i∈I
where k is a kernel (see chapter10 for more details). For example, we show on fig 17.7 an example
of binary classification, using RBF kernels with a variance σ = 0.3, where the perceptron is
168 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS
trained with the perceptron learning rule. Each class contains 100 samples and convergence was
actually obtained by iterating only two times on the training set. Please note that the point of
this illustration is to actually to illustrate the application of the perceptron and it is clear that
such a classifier does not possess a large margin around the classes which might be revealed by a
bad generalization. However, the interested reader can read (Freund and Schapire, 1999) where
the voted-perceptron algorithm is introduced, a modification of the perceptron algorithm with
guaranteed margins.
3
3 2 1 0 1 2 3
Figure 17.7: Application of the perceptron learning rule with RBF kernels (σ = 0.3) with 100
samples for both the positive and negative classes. Convergence was obtained in two iterations
over the training set.
There are two possibilities to solve this optimization problem. The first possibility is a batch
method where all the samples are considered and this least mean square problem can actually be
solved analytically by computing its derivative with respect to w and setting it to zero.
dRSemp
(w) = 0
dwj
N
2 X T 1 1
⇔− (yi − w ) =0
N i=1 xi xi
XN X N
T 1 1 1
⇔ (w ) = yi
xi xi xi
i=1 i=1
17.1. SINGLE LAYER PERCEPTRON 169
1 1 ··· 1
Let us now introduce the vector y with Yi = yi , X = . We can then rewrite
x1 x2 ·· · xN
PN 1 1
i=1 yi x = XY. For the left-hand side term, let us write xi =
xi
:
i
N
X N X
X d+1
1 1
∀j ∈ [1, d + 1], ( (wT ) )j = wk (xi )k (xi )j
xi xi
i=1 i=1 k=1
d+1
X N
X
= wk Xk,i Xj,i
k=1 i=1
d+1
X N
X
= wk Xj,i XT i,k
k=1 i=1
d+1
X d+1
X
T
= wk (XX )j,k = (XXT )j,k wk
k=1 k=1
And therefore :
N
X
1 1
(wT ) = (XXT )w
xi xi
i=1
(XXT )w = Xy (17.1)
which is known as the normal equations. If the matrix XXT is not singular, the solution to the
least square problem is :
−1
w = (XXT ) Xy
−1
and (XXT ) X ∈ R(d+1)×N is actually the Moore-Penrose pseudo-inverse of X. In case the matrix
XXT is not invertible, there is not unicity of the solution to the least square problem. One can
then find the solution w with the minimal norm. It turns out that this can be computed from
the Singular Value Decomposition (SVD) of X. The SVD of X ∈ R(d+1)×N is X = UΣVT with
U ∈ R(d+1)×(d+1) and V ∈ RN ×N two orthogonal matrices (U−1 = UT , V−1 = VT ) and Σ is a
diagonal matrix with non-negative elements (some can be equal to zeros depending on the rank of
the matrix X). The minimal norm solution to the least square problem is then defined by (17.2)
(Lawson and Hanson, 1974).
w = (VΣ+ UT )y (17.2)
with :
(
1
if Σi,i 6= 0
Σ+
i,i = Σi,i
0 otherwise
It might not be convenient to solve the optimization problem in a single shot as it requires to
compute the pseudo-inverse matrix which grows with the number of samples. Also, the previous
method is batch and requires all the samples to be available to compute the optimal solution for the
weights. An alternative is to update the parameters w online, one sample at a time. One simple
approach is then to compute the gradient of the loss and to perform a so-called steepest descent or
gradient descent. The derivative can be taken considering the whole training set (gradient descent)
or only one sample at a time (Stochastic Gradient Descent - SGD). If we consider one sample at a
time to make the updates online, it reads :
d 2 dfw
∀i, ∇w L (yi , fw (xi )) = |yi − fw (xi )|2 = −2 (xi )(yi − fw (xi )) = −xi (yi − fw (xi ))
dw dw
170 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS
We can then update the weights according to a fraction of the steepest descent. Therefore, at
time t, after observing the input/output pair xi , yi , the weights would be updated according
α
wt+1 = wt − ∇w L (yi , fw (xi )) = wt + αxi (yi − fw (xi ))
2
This is actually the so-called delta rule or Widrow-Hoff rule as considered by (Widrow and
Hoff, 1962). Even if (Widrow and Hoff, 1962) originally considered binary classification with its
architecture using a linear transfer function and a quadratic loss, as we will see in section17.1.4,
different combinations of transfer function and loss are considered depending on the type of problem
to be solved (regression or classification).
17.1.3 Limitations
As explained in the previous section, any neural network in which only the last layer contains train-
able weights with a binary transfer function can only solve linearly separable binary classification
problems. One of the famous example that had a strong negative impact on the research efforts
in neural networks is the XOR binary function. The XOR classification problem with two inputs
x1 , x2 is indeed not linearly separable in the x1 , x2 space. However, if we transform the inputs
and work in the x1 x2 , x1 x2 space, the problem becomes linearly separable (fig. 17.8). However,
the question is to determine how the inputs should be transformed so that the problem becomes
linearly separable. In other words, how one might learn appropriate features computed from the
inputs so that a classification (or regression) problem becomes solvable.
xor(x1 , x2 ) xor(x1 , x2 ) = x1 x2 + x1 x2
x2 ?? x1 x2
1 1 0 ?? 1 1
??
0 0 1 0 0 1
0 1 x1 0 1 x1 x2
Figure 17.8: The XOR boolean function x1 ⊕ x2 is not linearly separable in the x1 , x2 space but
becomes linearly separable when projected into the x1 x2 , x1 x2 space.
In section 17.1.1, we saw an example of perceptron with appropriately chosen basis functions
which performs a non-linear classification. In the section 17.2, we study a particular type of
“single-layer” neural network, the radial basis function networks, in which appropriate choose of
features computed from the inputs allow to solve non-linear regression. Actually, the limitation
of these networks is not that the perceptron can only represent linearly separable problems; the
true question is how to learn the appropriate features and this is what we will in the section on
multilayer perceptrons.
In the previous sections, we introduced both the perceptron and Adaline networks from an historical
perspective in the sense that our presentation sticks to the architecture introduced respectively by
(Rosenblatt, 1962) and (Widrow and Hoff, 1962). We now inspect the question of single layer
neural networks from a different perspective by considering which architecture one might use in
order to solve regression or classification problems.
17.1. SINGLE LAYER PERCEPTRON 171
Regression
Suppose we are given a monodimensional regression problem S = {(xi , yi ) ∈ Rd × R, i ∈ [1..N ]}.
In that case, one would use a linear transfer function g(x) = x and a quadratic loss, i.e.:
2
L (y, hw (x)) = |y − fw (x)|2
1
fw (x) = wT
x
The empirical risk to be minimized therefore reads :
N 2
S 1 X T 1
Remp = yi − w
N i=1 x i 2
As detailed in section 17.1.2, the empirical risk can be minimized in batch mode, using all the
training set and the optimal weights w? ∈ Rd+1 are given by solving a linear least square problem
T
and given
by the equations
(17.1) or (17.2) depending on whether or not XX is invertible, with
1 1 ··· 1
X= . We remind the previous results for completeness :
x1 x2 · · · xN
( −1
? (XXT ) Xy if XXT is invertible
w =
(VΣ+ UT )y otherwise; the minimal norm solution
The second possibility to optimize for the weights w is to perform learning online with the
stochastic gradient descent. You can perform gradient using one sample at a time (stochastic
gradient), all the samples (batch gradient)5 or mini-batch gradient considering only a part of the
samples at every iteration. For the stochastic gradient descent, given some initial weights w0 , the
update rule is :
α
wt+1 = wt − ∇w L (yi , fw (xi )) = wt + αxi (yi − fw (xi ))
2
where α is learning rate to be defined (pretty small if fixed, i.e. α ≈ 10−2 , 10−3 , or adaptive as we
will see later in this chapter). Mini-batches can be meaningful if you use parallel processors (e.g.
GPUs) as you actually compute the gradient for several samples with the same weights and can
then use more efficiently the parallelism of the hardware.
Binary classification
Let us now consider binary classification problems : we are given S = {(xi , yi ) ∈ Rd × {0, 1}, i ∈
[1..N ]}. For learning a classifier, one can actually devise several architectures and associated
learning algorithms but some are more appropriate than others. The first option we consider is
1
to use the logistic transfer function6 g(x) = 1+exp(−x) which allows to interpret the output as the
conditional probability of belonging to one of the class (as g(x) ∈ [0, 1]) given an input. In this
situation the quadratic loss is not appropriate (see at the end of this paragraph why) and the
cross-entropy loss should be preferred :
L (y, ŷ) = −yln (ŷ) − (1 − y)ln (1 − ŷ)
N
1 X
RSemp (w) = − (yi ln (fw (xi )) + (1 − yi )ln (1 − fw (xi )))
N i=1
T 1
fw (x) = g(w )
xi
1
g(x) =
1 + exp(−x)
5 Please note that this is actually meaningless to perform a batch gradient in this situation as the optimal weights
can be analytically solved. For multilayer perceptrons, it makes much more sense.
6 In practice, (LeCun et al., 1998) suggests to use a scaled hyperbolic tangent transfer function g(x) =
1.7159tanh (0.6666x)
172 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS
As y, ŷ ∈ {0, 1}, we can note that L (y, ŷ) ≥ 0. Also, ∀y ∈ {0, 1}, L (y, ŷ) = 0 ⇔ ŷ = y. We
will now compute the gradient of the loss with respect to the weights. A few preliminaries will be
helpful :
exp(−x) 1 1
∀x, g 0 (x) = = (1 − ) = g(x)(1 − g(x))
(1 + exp(−x))2 1 + exp(−x) 1 + exp(−x)
∂L ŷ − y
∀y, ∀ŷ, (y, ŷ) =
dŷ ŷ(1 − ŷ)
1
Let us denote xi = . We then get :
xi
∂L ∂L
∀i, yi , g wT xi = xi g 0 wT xi yi , g wT xi
dw dŷ
= xi (g wT xi − yi )
wt+1 = wt − α∇w L (yi , fw (xi )) = wt + αxi (yi − g wT xi )
which is actually very similar to the update when considering a linear transfer function with a
quadratic loss for the regression problem we considered previously. The transfer function is taken
1
to be the logistic function σ(x) = 1+exp(−x) . If we were to use, for example, the hyperbolic tangent
tanh x = exp(x)−exp(−x)
exp(x)+exp(−x) one has to adapt the loss accordingly taking into account the fact that the
hyperbolic tangent is linearly linked to the logistic as tanh(x) = 2σ(2x) − 1. The outputs must
also be defined in {−1, 1}.
What is going on if, rather than the cross entropy loss, we take the quadratic loss but still with
the logistic transfer function ? Performing the computation, we get :
d 2
∀i, yi − g wT xi 2 = −2xi (yi − g wT xi )g 0 wT xi
dw
We see that the gradient of the quadratic cost does keep a g 0 (wT xi ) term which was cancelled out
when using the cross-entropy loss. The issue we then encounter is when, for an input xi , we get
g (wT xi ) ≈ 0 or g (wT xi ) ≈ 1 where the derivative of the logistic function is close to zero. This is
the case for example when an input is misclassified and the initial weights sufficiently strong to
bring the logistic function in its saturated part. In this case, the gradient is really flat and it will
take quite a long time for the parameters to escape from this region.
Multiclass classification
In the case of a classification problem with c classes S = {(xi , yi ) ∈ Rd × {0, · · · c − 1}, i ∈ [1..N ]},
we would encode the output with the 1-of-c or one-hot encoding, i.e. the size of the output y is
number of classes and we set yi = δi,yi . We can then devise two architectures. The first one is
take sigmoidal transfer function for the output layer and use the cross-entropy loss applied
17.1. SINGLE LAYER PERCEPTRON 173
c−1
X
L (y, ŷ) = − (yk ln (ŷk ) + (1 − yk )ln (1 − ŷk ))
k=0
N
1 X
RSemp (W) = − L (yi , fW (xi ))
N i=1
T 1
g w0 x i
1
g w1T
T 1 x
fW (x) = g(W )= i
xi ..
.
T 1
g wc−1
xi
1
g(x) =
1 + exp(−x)
where g is applied element-wise and W is now a d + 1 × c matrix with the weights to each of the
c output units in its columns. One can then verify that the derivative of the loss with respect to
any weight wk,j reads :
∂L (y, fW (x)) 1
= (−yk + g wkT )xj
∂wk,j xi
In this case, we cannot interpret the outputs as any discrete probability distribution as they
are not normalized. If one wants to interpret the outputs as the conditional probability over the
labels given the inputs, we can guarantee that the outputs are in the range [0, 1] and sums up to 1
by using the soft-max transfer function. Denoting W the weight matrix where the j-th column
W.,j = wj contains the weights from the input to the j-th output, given an input xi it is handful
to introduce the notation :
aj = wjT xi
exp(aj )
∀j ∈ [0, c − 1], ŷj = P
k exp(ak )
In this case, the appropriate loss is the negative log-likelihood loss defined as :
This supposes that y is the class number. In case y is Pencoding the class with the 1-of-c or one
c−1
hot encoding, then you just get the cross-entropy loss − k=0 yk log(ŷk ). If we write the empirical
risk function of the parameter matrix W, denoting wj its j-th column :
N
1 X
RSemp (W) = − L (yi , fW (xi ))
N i=1
N c−1
!
1 XX exp(wkT xi )
=− 1yi =k log Pc−1
N i=1 l=0 exp(wlT xi )
k=0
174 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS
Here, the derivatives are a bit more tedious to compute. Let us compute some intermediate steps :
!
∂ exp(wkT x) exp(wiT x)
∀x, ∀k, j∀i 6= k, log Pc−1 = −x j P c−1 = −xj [fW (x)]i
∂wi,j T
l=0 exp(wl x)
T
l=0 exp(wl x)
!
∂ exp(wkT x) exp(wiT x)
∀x, ∀k, j, log Pc−1 = x j − x j P c−1 = xj − xj [fW (x)]k
∂wk,j T
l=0 exp(wl x)
T
l=0 exp(wl x)
!
∂ exp(wkT x)
⇒ ∀x, ∀i, k, j, log Pc−1 = xj (δi,k − [fW (x)]i )
∂wi,j T
l=0 exp(wl x)
2
L (y, fw (x)) = |y − fw (x)|2
fw (x) = wT φ (x)
As we already saw in the previous section, this least square minimization problem can be solved
analytically or iteratively with a steepest descent. Analytically, the optimal weights read :
(
T −1 T
? (φ (X)φ (X) ) Xy if φ (X)φ (X) is invertible
w =
(VΣ+ UT )y otherwise; the minimal norm solution
How do we define the centers and standard deviations of the basis functions ? There are actually
several possibilities(Schwenker et al., 2001; Peng et al., 2007; Han and Qiao, 2012). The simplest
is to pick randomly K − 1 centers from the inputs and compute a common standard deviations
as the mean of the distances between the selected inputs and their closest selected neighbors. We
then train/compute the optimal weights to minimize the risk. Another possibility is to apply a
clustering algorithm (e.g. k-means) to identify good candidates for the centers and compute the
standard deviation as before. Then, after this unsupervised learning step, one would learn in
a supervised manner the optimal weights. These are two-phase training algorithms for training
RBF (Schwenker et al., 2001). Another possibility is to train the RBF in three phases(Schwenker
et al., 2001). The two first phases consists in initializing the centers and standard deviations
of the kernels with some clustering algorithm and then compute the optimal weights directly or
with a steepest descent. The third phase consist in adapting all the parameters (weights, centers,
standard deviations) using a steepest descent. One can actually compute the gradients of the loss
with respect to the weights, centers and standard deviations(Schwenker et al., 2001; Bishop, 1995) :
2
L (y, fw (x)) = |y − fw (x)|2
fw (x) = wT φ (x)
|x − ck |2
∀k, φ (x)k = exp(− )
2σk 2
φ (x)k x − ck
∀k, j, = δk,j φ (x)k
∂cj σk 2
2
φ (x)k |x − ck |2
∀k, j, = δk,j φ (x)k
∂σj σk 3
∂L (y, fw (x))
= −2φ (x)(y − fw (x))
∂w
∂L (y, fw (x)) ∂fw (x) x − ck
∀k, = −2(y − fw (x)) = −2(y − fw (x))wk φ (x)k
∂ck ∂ck σk 2
2
∂L (y, fw (x)) ∂fw (x) |x − ck |2
∀k, = −2(y − fw (x)) = −2(y − fw (x))wk φ (x)k
∂σk ∂σk σk 3
Some other algorithms for optimizing both the weights and basis function parameters can be
found in(Peng et al., 2007; Han and Qiao, 2012).
probability distribution. For a regression problem, the transfer function of the output is taken as
a linear function f (x) = x while non linearities are introduced in the hidden layers.
Let us introduce some notations :
(l)
• wij the weight between the j-th unit of layer l − 1 and the i-th unit of layer l,
(l)
• ai the pre-activation of the unit i of layer l,
(l)
• yi the output of the unit i of layer l
Every unit computes its pre-activation as a linear combination of its inputs. For simplicity
(0)
of the notations, we denote yi = xi and Il the set of indices of the units in layer l ∈ [0, L].
Remember that, in order to take into the bias (offset) in the linear combination of the input, each
layer l ∈ [0, L − 1] has one unit with a constant output equal to 1. This means for example, that
if the inputs are taken from Rd , the input layer contains actually d + 1 units. The computation
within the network read :
(l)
X (l) (l−1)
Preactivations:∀l ∈ [1, L], ∀i ∈ Il , ai = wij yj
j∈Il−1
(l)
∀l ∈ [1, L], a = W(l) y(l−1)
x0 (1)
w00
(L−1)
(1) w00
w01 a
(L−1)
0 (L−1)
y
(L−1) 0
x1 w0i
(1)
w02 ···
(L) (L)
a0 y0
···
x2
(L−1)
a
1 (L) (L)
··· y
1
(L−1)
a1 y1
x3
1
1
1
(1) P (1) (L−1) P (L−1) (L−2) (L) P (L) (L−1)
ai = j wij xj ai = j wij
yj ai = j wij yj
(1) (1) (L−1) (L−1) (L) (L)
yi = g(ai ) yi = g(ai ) yi = f (ai )
Figure 17.9: A multilayer perceptron is built from an input and output layer with several hidden
layers in between. Each layer other than the output is extended with a unit of constant output 1
for the bias.
merical optimization. We can actually resort to any optimization algorithms such as derivative-
free optimization algorithms (e.g. line-search, Brent’s method(Brent, 1973), black-box optimiza-
tion such as CMAE-ES(Hansen, 2006), Particle Swarm Optimization(Engelbrecht, 2007; Eber-
hart and Kennedy, 1995),...), optimization algorithms that make use of the gradient (steepest
descent(Werbos, 1981; Rumelhart et al., 1986), natural gradient (Amari, 1998a), conjugate gradi-
ent (Johansson et al., 1991), ..) or algorithms that use of the second order derivatives (Hessian) but
sometimes only approximating it as in (Martens, 2010). We go back on this topic of optimization
algorithms in the section 17.5. For now, let us consider error backpropagation which is historically
a major breakthrough in the neural network community as it brought the ability to learn multilayer
neural networks(Werbos, 1981; Rumelhart et al., 1986).
We consider architectures for which an appropriate combination of loss function and output
transfer function have been chosen. As we saw in section17.1.4, it means :
• for a regression problem with a vectorial output, a linear transfer function and a quadratic
loss8 :
2
f (a) = a, L (y, ŷ) = 21 |y − ŷ|2
• for a multi-class classification problem, a softmax output transfer function and the negative
log-likelihood loss :
1
T P
f (a) = P exp(a0 ) exp(a1 ) · · · exp(ac−1 ) , L (y, ŷ) = − k yk log (ŷk )
k exp(ak )
Starting from some initial weight and bias vector w, its update following the steepest descent
reads :
w = w − α∇w L
Let us compute the derivatives of the loss with respect to a weight from the last hidden layer
to the output layer. Denoting a(L) = W(L) y(L−1) the pre-activations of the output layer where
W(L) is the weight matrix from the last hidden layer to the output layer (W(L) i,j is the weight
from the hidden unit j to the output unit i), the predicted output can be written as ŷ = f a(L) .
In order to compute the gradient to respect to any weight, we shall apply the chain rule; in case of
(L)
a weight wi,j between the j-th hidden unit and i-th output unit, the gradient of the loss reads :
Whether in the regression or classification, the preactivations are computed as the product of the
(L) P (L) (L−1)
weight matrix times the output of layer L − 1 : ak = i wki yi . Therefore :
(L)
∂ak (L−1)
(L)
= δi,k yj
∂wi,j
8 the 1 in the quadratic loss is introduced to get similar formula than in the classification case. By the way, this
2
is just a scaling factor
178 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS
(L)
We now need to explicit the term ∆i which is the derivative of the loss with respect to the
pre-activations. The computation are actually similar to the one carried out in section 17.1.4 and
repeated here for completeness :
• in case of a regression :
2
1 ∂ y − y(L) 2 1 X (yi − yi )2 X
2 (L) (L)
(L) ∂L (y, ŷ) 1 ∂|y − ŷ|2 (L) ∂yi
∆k = (L)
= = = = − (y i − y i )
∂ak 2 ∂a(L) 2 ∂a
(L) 2 i ∂a
(L)
i ∂a
(L)
k k k k
∂yi
(L)
∂
(L)
(L)
= (L)
f ai = δk,i
∂ak ∂ak
(L)
X (L)
(L) ∂yi
X (L)
⇒ ∆k =− (yi − yi ) (L) =− (yi − yi )δk,i
i ∂ak i
(L)
= −(yk − yk )
∂L (y, ŷ) (L−1) (L) (L−1) (L)
⇒ ∀i, j, (L)
= yj ∆i = −yj (yi − yi )
∂wi,j
• in case of a classification, denoting c(x) the class of the input x (i.e. ∀j, yi = δi,c(x) using
1-of-c encoding of the desired output) :
(L) (L)
exp(ai ) exp(ak )
=P (L)
(δi,k − P (L)
)
l exp(al ) l exp(al )
(L) (L)
= yi (δi,k − yk )
(L)
X yi ∂y(L) X (L) (L)
i
⇒ ∆k =− (L) (L)
=− yi (δi,k − yk ) = −(δc(x),k − yk )
i yi ∂ak i
(L)
= −(yk − yk )
∂L (y, ŷ) (L−1) (L) (L−1) (L)
⇒ ∀i, j, (L)
= yj ∆i = −yj (yi − yi )
∂wi,j
We now turn to the computation of the derivatives with respect to a weight or bias afferent to
a unit in layer L − 1 :
∂L (y, ŷ) X ∂L (y, ŷ) ∂a(L−1) X ∂L (y, ŷ) (L−2) ∂L (y, ŷ) (L−2)
k
∀i, j, (L−1)
= (L−1) (L−1)
= (L−1)
δi,k yj = (L−1)
yj
∂wi,j k ∂ak wi,j k ∂ak ∂ai
(L−2) (L−1)
= yj ∆i
(L−1) ∂L (y, ŷ) X ∂L (y, ŷ) ∂a(L) ∂y(L−1) X (L) (L) (L−1)
∆i = (L−1)
= (L)
k i
(L−1) (L−1)
= ∆k wki g 0 (ai )
∂ai k ∂ak ∂yi ∂ai k
(L−1)
X (L) (L)
= g 0 (ai ) ∆k wki
k
17.3. MULTILAYER PERCEPTRON (MLP) 179
(L−1)
If we look at the structure of ∆i , it is basically the ∆ term of the next layer weighted
by weights of the projection from the unit i to the next layer, everything premultiplied by the
derivative of the hidden layer transfer function. Here comes the name “error backpropagation”
(L)
. The error term ∆i on the output layer is propagated backward in the previous layer. This
process is recursive and backpropagation would go downward down to the input layer. We did
not detail the derivative of the hidden layer transfer function as it is specific to the network you
consider :
where we just introduce the bias. In our notations, the term inside the exponential is really
T
−αci 1
. We can then combine two of such sigmoids with different centers and different gain
α x
α. Some examples are drawn on fig 17.10. Combining arbitrarily close sigmoids, one can actually
build up bell shape functions.
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Figure 17.10: By combining two sigmoids with arbitrarily close centers, one can actually build up
arbitrarily local bell shape functions which can then be weighted in order to produce any smooth
function. The full line plots are six sigmoids which are then grouped by pairs and the difference
of these pairs are plotted with dashed lines. For generating the plot, the centers of the pairs are
{0.2, 0.25}, {0.5, 0.52}, {0.8, 0.9} with a gain α = 50.
Intuitively we reach the point that we can define a bunch of pairs of sigmoids (outputs from the
hidden layer) that will create local functions which can then be weighted with the weights from the
hidden layer to the output in order to approximate any continuous function. The formal proofs
are given in the reference(Hornik et al., 1990; Cybenko, 1989; Hornik et al., 1989).
9 in which case, one should speak about subgradient
180 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS
17.4 Generalization
So far, we only discussed about minimizing the empirical risk given some data. However, if only
minimize the empirical risk we will definitely (hopefully) perform well on the training set but the
generalization performance will be usually bad. This is actually especially true when the dataset
is limited. It turns out that some recent works working on very large datasets (or some works
augmenting the original dataset by transforming the original dataset as (Ciressan et al., 2012)) do
not encounter this issue as working with a very large dataset can be understood as working with an
infinite stream of different inputs. However, if the dataset is not ”sufficiently” large with respect
to the number of degrees of freedom of the network, the network might overfit the training set and
perform very badly on data it has not seen in the training set. In this section, we review some
popular methods allowing to get hopefully good generalization performances by counterbalancing
the minimization of the empirical risk and penalizing models that are too “complex”.
To take an example of overfitting. Imagine that you have some 1D data to regress. Suppose
that the unknown model is actually linear in the input, say y = αx but obviously, you do not
know the model that generated the data as this is what you want to learn from samples, say N
samples xi , yi . If you were to minimize the empirical risk, you can simply consider the Lagrange
polynomials :
N −1
Q
X j∈[0,N −1],j6=i (x − xj )
f (x) = yi Q
i=0 j∈[0,N −1],j6=i (xi − xj )
This regressor gets perfect fit to the samples if we were to estimate the empirical risk as it is
actually null. To avoid too much math, simply suppose that your data are slightly noisy. It is
clear that the lagrange polynomial will not result in a linear function of the input since such a
linear function would not get a null empirical risk and the lagrange polynomial is actually perfectly
fitting the data. You would get higher order monomes which might lead to bad generalization on
unseen input/output since you are actually fitting both the data and the noise that you would like
to filter out.
17.4.1 Regularization
So far, we just speak about minimizing the empirical risk. However, this is not really the quantity
of interest. More relevant is the minimization of the real risk which we usually do not have access
to10 . Usually the issue that we may encounter when only focusing on the empirical risk is overfitting
where we would perfectly perform on the training set but badly performed on data that were not
10 there is less and less true as the datasets we are working with are growing, the abundant amount of data may
actually prevent overfitting and discard the need for regularizing the neural networks.
17.4. GENERALIZATION 181
present in the dataset, i.e. we would have a bad generalization error. For example, on figure17.11,
we generated data from a sine with normally distribution noise :
Using a RBF with one kernel per sample it turns out the optimal solution to the least square
regression is clearly overfitting the data as shown by the learned predictor plotted in solid line on
fig. 17.11a.
1.0 1.0 1.0
a) b) c)
Figure 17.11: With 30 samples generated according to eq.17.4.1, and building a RBF with one
kernel per sample with a standard deviation σ = 0.05, fitting the RBF without any penalty leads
to an overfitted regressor (a) while fitting the RBF with a weight decay penalty λ = 2 or L1 norm
penalty (α = 0.005) provides a better generalization (b,c). The dashed line indicates the noise-free
data. Original example from (Bishop, 1995).
Performing a gradient descent on this extended cost function is actually simply adding a linear
term in the gradient :
∇w J = ∇w L + λw
w ← w − α(∇w L + λw) = (1 − αλ)w − α∇w L
Note that if the predictor is linear and the cost quadratic, adding a L2 penalty simply adds λI
to the XXT matrix to inverse to compute the optimal solution (actually λI with the first diagonal
element set to zero to avoid regularizing the bias). Note also that the bias is not included in the
regularization. The L2 penalty will actually enforce the weights to keep a low norm, it will bring w
closer to 0. One may see the L2 penalty as a brake P to activate the non-linearities of your network.
d+1
If one has in mind the RBF network with fw (x) = k=0 wk φk (x) we see that if the norm of w is
low, it will tend to prevent activating the non-linearities. An example of RBF with a L2 penalty is
shown on fig 17.11b. With this example in mind, we might understand why it is not a good idea
to penalize the bias term. The bias term is the mean component of your data. To see this, we can
rewrite the cost function to be minimized by expliciting the bias :
2
X N X
argmin yi − w0 − wk φk (xi )
i=1
w
k≥1
2
182 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS
If you now compute the gradient with respect to w0 and set it to zero, you will find :
N N
!
1 X X 1 X
w0 = yi − wk φk (xi )
N i=1 N i=1
k≥1
It is usually a good idea to standardize the inputs11 in which case the second term vanishes
and you recover the mean of the outputs for the optimal bias. This is one reason why you should
not regularize the bias. The penalty should only affect the activation of the non linearities which
are the main causes of overfitting.
There is also another idea which helps intuiting why weight decay helps for generalization.
Weight decay tends to bring the weights closer to zero. It turns out that if we have a logistic
(or a tanh) transfer function, when the weights get small, the preactiviations tend to be where
the logistic is almost linear. Therefore, if the weights are constrained to be small, each layer is
actually a linear layer and the whole layers of the multilayer perceptrons collapse to a single linear
hidden layer. The weights will bring the logistic functions in their saturated part only if it is
actually sufficiently decreasing the loss with respect to the weight decay amplitude. That way, we
understand weight decay as a penalty on the complexity, richness or expressiveness of a multilayer
perceptron.
Performing a gradient descent on this extended cost function is actually simply adding a term in
the gradient which depends on the sign of the components of w :
∇w J = ∇w L + λsign (w)
w ← w − α(∇w L + λw) = w − αλsign (w) − α∇w L
On figure17.12, we give an illustration(Hastie et al., 2009) that helps understanding the influence
L1 norm penalty. With the hypothesis that our loss is quadratic, the L1 norm penalty tends to
favor solutions that are more aligned with the axis than the L2 penalty which leads to weights
that are sparse (i.e. more components get equal to 0). On figure 17.11c, a RBF is regressed with
30 gaussian kernels and a L1 penalty with α = 0.005. The linear regression with L1 penalty is
solved with the LASSO-Lars algorithm12 . It turns out that of the 30 basis functions, only 10
get activated. The norm of the optimal solution of the predictor plotted on the figure is actually
around |w? |2 ≈ 0.57, actually quite close to the norm of the optimal solution with the L2 penalty
|w? |2 ≈ 0.61 but the solution is much sparser.
Dropout
Dropout is a regularizing technique introduced in (Srivastava et al., 2014) and illustrated on
fig 17.13. The motivation is actually quite interesting. It is based on the idea of avoiding co-
adaptation. It means that, in order to enforce the hidden units of a MLP to learn sound and
11 a gradient descent has better performance because the cost function is more circular. If you do not standardize
the inputs, the cost function might be elongated and the gradient descent would take longer to converge
12 implemented using the Lasso-Lars algorithm in scikit-learn http://scikit-learn.org.
17.4. GENERALIZATION 183
Figure 17.12: The L1 penalty tends to promote solutions that have few non null components.
Indeed, compared to the L2 penalty, the optimal solution with the L1 penalty will be more axis
aligned. The figure is taken from (Hastie et al., 2009).
robust features, one will actually discard some of its feedforward inputs during training. Discard-
ing is controlled by a binary gate tossed for each input connection following a Bernoulli distribution
of parameter p. By doing so, the units can hardly compensate their failures with the help of the
others (co-adaptation) as they tend to work with a random subset of other units. When testing
the full network, the probability used to drop out units is multiplying the contribution of the unit
therefore averaging the contributions. The authors report in (Srivastava et al., 2014) that using
p = 20% or p = 50% significantly improved the generalization performance of various architectures.
Figure 17.13: Dropout consists in discarding some units during training with a given probability p
taken as 20% or 50% as suggested by (Srivastava et al., 2014).Figure taken from (Srivastava et al.,
2014)
Early stopping
Early stopping consists, in its most naive implementation, in tracking in parallel with the training
error, a validation error computed on a subset of the inputs not in the training set (say, 10% of
the data). In an ideal situation, one would observe error function of training epoch that look like
on figure 17.14a. Initially, both the errors on the training and validation set decrease. At some
point in time, however, while the training error goes on decreasing, the validation starts increasing
(see (Wang et al., 1993) for a theoretical analysis in a simplified case). This point in time should
actually be the point where learning should be stopped in order to avoid overfitting. In practice,
184 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS
the errors do not strictly follow this ideal picture (see fig. 17.14b)(Prechelt, 1996). One can however
still monitor the performances of the neural network on the validation set during training and select
at the end of the training period, the weights that led to the lowest error on the validation set.
Figure 17.14: a) Ideal training and validation curves which clearly indicate when to stop learning.
b) Real cases might actually be much more fluctuating. Images from (Prechelt, 1996)
17.5 Optimization
Several optimization techniques that are not actually specific to neural networks turn out to con-
verge faster than the classical (stochastic) gradient descent. The interested reader is referred to
(Bengio et al., 2015), Chap. 8, for an in depth presentation of the aspects such as momentum, first
and second order methods (conjugate gradients, hessian free optimizers(Martens, 2010), saddle
free optimizers(Dauphin et al., 2014)). There are actually extensive research on understanding the
landscape of the cost function we encounter with neural networks and designing specific optimiza-
tion techniques that take these aspects into account.
Figure 17.15: A convolutional neural network as introduced in (Lecun et al., 1998). The first layers
compute convolutions on their input with several trainable filters. This weight sharing dramatically
decrease the number of weights to learn by actually exploiting a fundamental structure of images
: the extraction of features from images is translation invariant.
For example, the first convolution layer applied to a RGB image has filters of depth 3. If k filters
are computed from the input image, the next convolution layer will have a depth of k. After
the convolution layer, one finds a pooling layer. A pooling layer introduces another translation
invariance. In its original work, (Lecun et al., 1998) considers subsampling which reads :
(l)
X (l−1)
yi = tanh β yj + b
j∈RFi
Subsampling consists in computing the average of the convolutional layer outputs over a local patch
that we call the receptive field. It turns that another pooling operation known as max-pooling
works significantly better in practice(Scherer et al., 2010). Max-pooling consists in computing the
max rather the average within the receptive field of a unit :
(l) (l−1)
yi = max yi
j∈RFi
Such convolutional neural networks with non-overlapping max-pooling layers appear to be very
effective(Ciresan et al., 2011b). In (Simard et al., 2003), the authors present some ”good” choices
for setting up a convolutional neural network which turn out to work well in practice in terms
of initialization of the weights, of the organization of convolutive and fully-connected layers. One
additional point the authors present is data augmentation which consist in applying distorsions on
the training set images in order to feed the network with much more inputs than if we just consid-
ered the original dataset. Hopefully, the data augmentation technique should provide additional
sensible inputs which mimic the availability of an infinite dataset and therefore might discard the
need to regularize the network.
17.7 Autoencoders
Autoencoders Autoencoders (or also known as Diabolo networks) are a specific type of neural
network where the objective is to train a network that is able to reconstruct its input. A simple
single layer autoencoder is represented on fig 17.16. Usually, the hidden layers have the shape of
a bottleneck which smaller and smaller layers with the aim that the input xx will actually get
compressed in a so-called code c being sufficiently informative to allow the reconstruction x0 to be
close to the input x. In the simple autoencoder of fig 17.16, the equations read :
c = σ (Wx + b)
x0 = W0 c + b0
where σ is a transfer function (e.g. logistic). One may constrain the decoding weights W0 to be
equal to the transpose of the coding weights W to decrease the number of parameters to train.
186 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS
If one is using a linear transfer function and a quadratic cost, one then seek to minimize the
reconstruction error from some low dimensional projections which is exactly what PCA is doing.
However, when non linear transfer function are used, the auto-encoders do not at all behave as a
PCA(Japkowicz et al., 2000) and can be used to extract useful non linear features from the inputs
and autoencoders turn out to be effective architectures for performing non linear dimensionality
reduction(Hinton et al., 2006).
Reconstruction x’
W 0 , b0
Code c
W, b
Input x
Figure 17.16: A simple single hidden layer autoencoder. The input x is going through a bottleneck
to create a code c from which we seek to build up a reconstruction x0 as similar as possible to the
original input x.
Actually, one may even use a hidden layer with the same number of even more units than the
input and still get sensible hidden units that are not merely learning the identity function(Bengio
et al., 2007; Ranzato et al., 2006) if the architecture is appropriately regularized (early stopping or
L1 norm penalty). In (Hinton et al., 2006), a deep autoencoder with three hidden layers between
the input and code layer is introduced to perform dimensionality reduction. The authors also
present a way to efficiently train such deep architectures. Variants of the autoencoders where
noise, acting as a regularizer, is introduced in the hidden layers are presented in (Vincent et al.,
2008). Injecting noise enforce the network to learn robust features and prevent the autoencoder
to simply learn the identity function when large hidden layers are considered. These autoencoders
are called denoising autoencoders. In (Vincent et al., 2008), the authors also introduce the stacked
denoising autoencoder which are merely a stack of encoders trained iteratively. A first single layer
denoising autoencoder is trained. Then, the learned code is used as the input for training a second
denoising autoencoder and so on and so forth.
Around 2006, it was found that the issue of gradient vanishing can be alleviated by an appro-
priate initialization of the network parameters(Hinton et al., 2006; Bengio et al., 2007). A similar
idea was already introduced in (Schmidhuber, 1992). The idea is to train in an unsupervised way
the feedforward (or recurrent) neural networks. One such method relies on autoencoders that we
introduced in a previous section. Remember that autoencoders seek to learn a useful code, useful
in the sense that it can be sparse and allows to reconstruct the original data. Robust and sparse
features are then extracted from the input image. This unsupervised learning seems to bring the
weights of the neural network in a region much more favorable for fine tuning with stochastic
gradient descent. One may find additional elements on why training deep networks is difficult in
(Glorot and Bengio, 2010).
Figure 17.17: a) Convolutional neural network trained on the MNIST dataset by (Ciressan et al.,
2012). Applying small distorsions on the original data, the authors build up 35 datasets on
which a convolutional neural network is trained separately and then averaged as in (b). Images
from(Ciressan et al., 2012).
In different of their works (Cireşan et al., 2010; Ciresan et al., 2011a; Ciressan et al., 2012), the
authors use, as a non-linear activation function within the hidden layers of the MLPs(Cireşan et al.,
188 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS
2010) or within the fully connected and convolutional layers the convolution networks (Ciressan
et al., 2012), a scaled hyperbolic tangent suggested by (LeCun et al., 1998) :
g(x) = 1.7159tanh (0.6666x)
For the classification output, a softmax is considered. Interestingly, learning is performed using the
“good old on-line back-propagation”(Cireşan et al., 2010) without momentum or any other tricks
except an exponentially decreasing learning rate schedule and an initialization of the weights uni-
formly in [-0.05, 0.05]. Note however that is is allowed because fast GPU implementations permit
to train the networks during several epochs. As stated by the authors, one convolutional neural
network such as on figure17.17a took 14 hours to train on GPUs which would have easily taken
a month on a CPU. As reported by the authors, the final architecture ranks first on the MNIST
classification :
Figure 17.18: Examples of classes from the ImageNet dataset.(Russakovsky et al., 2014).
The SuperVision deep convolutional neural network of (Krizhevsky et al., 2012) ranks first
in the ImageNet classification problem(Russakovsky et al., 2014). The 2012 ImageNet challenge
consisted in classifying 1000 different categories of objects. The training set was made of 1.2
million image, the validation set of 50.000 images and the test set contained 100.000 images. Some
examples of the ImageNet dataset are shown on fig 17.18. The supervision network of (Krizhevsky
et al., 2012) is built from 7 hidden layers : 5 convolutional layers and finally 2 fully connected
layers. The output layer uses a soft-max transfer function. The hidden layers have a rectified
linear transfer function. The total number of parameters to learn reach 60 millions. With so many
parameters to learn, the authors proposed to extract random patches of 224 × 224 pixels from the
256 × 256 pixels images in order to augment the dataset. Learning uses stochastic gradient descent
with dropout (probability of 0.5 toset the output of a hidden unit to 0), momentum of 0.9, weight
decay of 0.0005. The weights are initialized according to a specific scheme detailed in (Krizhevsky
17.9. SUCCESS STORIES OF DEEP NEURAL NETWORKS 189
et al., 2012) but basically relying on normally distributed weights and unit or zero biases depending
on the layer. The learning rate is adapted through a heuristic which consists in dividing it by 10
when the validation error stops improving. According to the authors, it took about a week to train
the network on two GPUs involving 90 epochs of the while dataset of 1.2 million images. Recently,
(Krizhevsky, 2014) introduces a new way to make training of convolutional neural networks on
GPUs faster.
190 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS
Chapter 18
Suppose we want to learn a predictor for which the current decision depends on the current input
and on the previous inputs. One can actually solve such a task with a feedforward neural network by
extending the input layer with, say, n parts each fed with one sample from xt , xt−1 , xt−2 , · · · xt−n+1 .
The input of such a network is called a tapped delay line and the feedforward network built from
such an input, a time-delay neural network (TDNN)(Waibel et al., 1989).
xt−5
xt−4
xt−3
xt−2
xt−1
xt
Figure 18.1: A time-delay neural network (TDNN) takes the history into account for its decision
by being fed by a sliding window over the input. The main limitation of such a network is that the
maximal amount of previous inputs the network can integrate is fixed in the length of the delay
line.
The main issue with such a network is that the history that can be used to make the prediction
is dependent on the predefined length of the delay line. In addition, the TDNN is using separate
weight vectors to extract the information from the samples of the different time steps which is not
always optimal, especially if one needs to extract the same piece of information from the input but
at different time steps. This means we introduce several weights that must be trained to perform
the same work and this might impair generalization.
Rather than relying on a predefined time window over the input, a recurrent neural network
can learn to capture and integrate the sufficient amount of information from past inputs in order
to make a correct decision.
191
192 CHAPTER 18. RECURRENT NEURAL NETWORKS
Input Output
Figure 18.2: A recurrent neural network is a neural network with cycles. The outputs may or may
not be fed back to the network. The output is influenced by the hidden units and may also receive
direct excitation from the input.
Recurrent neural networks are particularly well suited when working with datasets where the
decision requires some form of memory such as for example when working with time series (e.g.
speech signals) and hopefully, the recurrent neural network will learn the dependency in time to
correctly predict the current output. Depending on the application, it might be required that the
output feeds back to the hidden layers as illustrated on fig. 18.2. To describe the computations
within a recurrent neural network, we use the same notations as in (Jaeger, 2002) :
• ui , i ∈ [1, K] denote the input unit activities ,
• xj , j ∈ [1, N ] denotes the hidden or internal state activities,
• yk , k ∈ [1, L] denotes the output unit activities.
The units are interconnected with weighted connected connections denoted :
• Win ∈ RN ×K the weight matrix from the inputs ui to the hidden units xj
• Wback ∈ RN ×L the weight matrix from the outputs yk to the hidden units xj
• W ∈ RealN ×N the weight matrix between the hidden units xj
• Wout ∈ RL×(K+N ) the weight matrix between the input and hidden units to the output units
The biases of the hidden and output units are denoted bx and by . Similarly to the case of
feedforward neural networks, we may use different transfer functions for the hidden units and
the output units. Therefore, we introduce the transfer function f and f out for respectively the
hidden and output units. We can now write down the equations that rule the activities within
the recurrent neural network. The network is supposed to be initialized to some initial state
x (0) = x0 , y (0) = y0 :
∀t > 0, x (t) = f Win u (t) + Wx (t − 1) + Wback y (t − 1) + bx (18.1)
u (t)
y (t) = f out Wout + by
x (t)
18.2. GENERAL RECURRENT NEURAL NETWORK (RNN) 193
In order to perform a (possibly stochastic) gradient descent of the loss, we need to compute
the derivative of the loss with respect to the weights and biases. All the weights, whether feeding
the hidden or output units, contribute to the loss. Without loss of generality, we formulate the
derivative with respect to a weight we denote wk,l :
T L
X X ∂Lt ∂yi (t)
∂L
∀wk,l , =
∂wk,l t=1 i=1
∂yi ∂wk,l
This requires to compute the derivative of the output activities with respect to the weights
and biases. Let us just explicit the derivatives with respect to the weights. From eq. (18.1), these
derivatives read :
" " ∂u(t) ##
out
∂y (t) 0 ∂W u (t)
= f out (ai (t)) + Wout ∂w k,l
∂x(t)
∂wk,l ∂wk,l x (t) ∂wk,l
" " ##
out 0 ∂Wout u (t) out
0
=f (ai (t)) +W ∂x(t)
∂wk,l x (t) ∂wk,l
u (t)
ai (t) = Wout + by
x (t)
We cannot much further without specifying with respect to which weight the derivative is
actually computed. Depending on the weight with respect to which the derivative is computed,
one would actually get various formula. For example :
" " ##
∂y (t) out 0 ∂Wout u (t) out
0
out = f
∂wk,l
(ai (t)) out
∂wk,l x (t)
+W ∂x(t)
∂wout
k,l
∂Wout
The matrix out
∂wk,l
is full of zero with a single 1 line k, column l. If one computes the derivative
out
with respect to a weight feeding the hidden layer, the term ∂W ∂wk,l vanishes but the other remains.
For example :
" " ##
∂y (t) 0 ∂W out
u (t) 0
in
= f out (ai (t)) in
+ Wout ∂x(t)
∂wk,l ∂wk,l x (t) in
∂wk,l
" #
0 0
= f out (ai (t))Wout ∂x(t)
in
∂wk,l
Whatever the weight we consider, the derivatives of the output activities require to compute
the derivatives of the hidden layer with respect to the weights as well. Similarly to the derivatives
of the output layer activities, one would derive eq (18.1) with respect to the weight and end up
with some formula that we can summarize as :
!
∂x (t) ∂x (t − 1) ∂y (t − 1)
in
=g in
, in
, x (t − 1) , y (t − 1) , u (t) , Win , W, Wback , bx , f
∂wk,l ∂wk,l ∂wk,l
194 CHAPTER 18. RECURRENT NEURAL NETWORKS
Note that the derivatives of the output activities y (t) with respect to some hidden layer weights
were dependent on the derivative of the hidden activities y (t) at time t which themselves are
computed from the derivatives of the hidden and output activities at time t − 1 which overall
implies that all the derivatives at time t can be computed from the derivatives at time t − 1. As
the initial state is independent from the weights, the initial conditions read :
∂x (0)
=0
∂wk,l
∂y (0)
=0
∂wk,l
To compute the gradient of the loss, we need to compute the derivative of all the hidden
and output units (N + L units), for every time step (T steps), with respect to every weight
(N 2 + 2N.L + K.N weights) which actually leads to a very expensive computational cost which
is the main drawback of this method compared to over methods such as the Backpropagation
through time presented in the next section. If the number of hidden units dominates the number
of inputs and outputs, the time complexity of one step is an order of N 4 . However, RTRL is a
forward differentiation method meaning that the derivatives of the loss are computed at each time
step and therefore the parameters can be updated online, i.e. at each time step.
Figure 18.3: Backpropagation through time is backpropagation applied to the recurrent neural
network unfolded in time where the network at each time step is considered as one layer of a
feedforward neural network of depth T . The notations of the illustration differ slightly from the
ones used in the previous section. The figure is from (Sutskever, 2013).
when these are fed back to the hidden units. One could could set the initial activities of these units
arbitrarily to 0. However, this is not guaranteed to be the optimal choice. Another possibility is
to treat the initial state as a variable to be optimized and one can therefore, in a gradient descent
method, computes the derivatives of the loss with respect to the initial state and follow the negative
gradient.
Figure 18.4: In an echo state network (ESN), the recurrent weights are predefined and fixed; only
the weights from the hidden to output layers are trained. The figure is from (Lukosevicius, 2012).
The description of echo state networks follows (Lukosevicius, 2012). The hidden units in the
ESN are leaky integrators and the output unit activities are linear with respect to the hidden (and
possibly input) units. The output units are called readout units. From eq.(18.1), with a linear
output transfer function f out (x) = x, no feedback connections from the output to the hidden layer
Wback = 0, the update equations then read :
x (n) = (1 − α)x (n − 1) + αtanh Win u (n) + Wx (n) + bx
out u (n)
y (n) = W + by
x (n)
This is the network one would consider for a regression problem : the fixed predefined recurrent
network extracts features from the input stream and a linear readout learns to map the features
to the output to regress. In the context of a classification problem, one would use a non linear
output transfer function such as the softmax which constrains the output activities to lie in [0,1]
and to sum up to 1. Learning the weights from the recurrent network (or reservoir) to the output
is not the biggest issue with ESN. In the case of a single output, if the sequence is not too long and
the hidden layer not too large, one could compute the optimal weights from the Moore Penrose
−1
pseudo inverse, e.g. wout == XXT Xy where X gathers the hidden states during all time steps
and y the sequence of outputs to predict. One could also apply online learning to the readout
weights. Regularization of the readout weights introduced with feedforward neural networks (e.g.
L2 penalty) apply in this context as well. The interested reader is referred to (Lukosevicius, 2012)
for more information on this.
The size of the reservoir (hidden layer) is usually taken to be as big as possible, hopefully
enriching the hidden representation with which the readout is computed. In (Triefenbach et al.,
2010), the authors make use of reservoir with 20.000 hidden units.
One big issue with ESN is to be able to define the input to hidden weight matrix Win , the
recurrent hidden weight matrix W and the leaking rate α. The hidden recurrent weight matrix
196 CHAPTER 18. RECURRENT NEURAL NETWORKS
is usually generated as a sparse matrix as it turns out that it generates in practice better results
than dense matrix and numerical computation libraries can compute efficiently operations with
sparse matrices which then speed up the evaluation of the network. The input matrix Win is a
dense matrix. Various distributions are used to generate the coefficients of the matrices such as
a uniform or gaussian distribution. It is usually advised to scale the hidden recurrent weights W
so that its spectral radius (largest eigenvalue in magnitude) is strictly smaller than 1. It is not
always the case that a spectral radius of 1 is optimal (Lukosevicius, 2012). The spectral radius has
an influence on the fading of the influence of the inputs on the reservoir activities. If one thinks
of the update of the reservoir as a repeated application of the weight matrix W, using a small
spectral radius tends to vanish more quickly (exponentially) the input that got integrated at some
time step by the reservoir. The leak factor α of the leaky integrator influences how quickly the
dynamic of the reservoir evolves. If the input or output time series evolve quickly in few time steps
and the leak factor is set too close to 1, the dynamic of the reservoir will not be fast enough as it
keeps a strong inertia with its previous state. We will not go further in details on how to setup a
reservoir as various elements can be found in (Lukosevicius, 2012; Jaeger, 2002). Actually, all the
previous details seem to favor a careful design of the input and hidden layers of the ESN and it
seems that much simpler (more constrained) architectures still perform favourably as presented in
(Rodan and Tiño, 2011). To finish this section on ESN, Mantas Lukoševičius is providing source
codes on his website (http://minds.jacobs-university.de/mantasCode) with implementations
of ESN in various programming languages.
Figure 18.5: The original LSTM memory cell introduced in (Hochreiter and Schmidhuber, 1997) is
able to memorize an information. This information is protected from input perturbations provided
by the other units within the network with an input. The other units are protected from the
influence of the memory cell by an output gate. The network has to learn when to memorize and
release a piece of information. LSTM memory units can be arranged to build up memory cells
which share their input and output gates. Figure adapted from (Gers et al., 2000).
Figure 18.6: Modified LSTM unit with a forget gate which modulates the gain of the recurrent
memory feedback pathway which was originally set to 1.0. A full reset would be obtained with a
gain ft = 0. The figure is from (Gers et al., 2000).
198 CHAPTER 18. RECURRENT NEURAL NETWORKS
The above equations simply state that all the hidden and input units contribute to the input,
output and forget gating of a unit as well as to define the potential new input to store in the
cell. In order to memorize an input, the input gate has to be closed it ≈ 0 and the forget gate
open ft ≈ 1. To replace the content of the memory cell, it is sufficient to close the forget gate
ft ≈ 0 and to open the input gate it ≈ 1. In the original LSTM unit, the forget factor was always
set to 1.0 and therefore, replacing the content of the memory cell would require some amount of
time as, when ft = 1, we get ct = ct−1 + c∗ . Training of such a network can be performed by
applying the algorithms such as Real Time Recurrent Learning or Backpropagation Through Time
we introduced in the previous sections.
In some situations as for example in speech processing, it might be helpful to consider both past
and future (to some extent) inputs in order to classify the current input. For example, when one is
speaking, there are co-articulating effects where the next phoneme to be pronounced influences the
end of the previous one. In this context, (Graves and Schmidhuber, 2005) introduced bidirectional
LSTM which consists in two LSTM networks processing the input one in the forward direction
and the other in the backward direction. The classification at a given can then influenced by
both the past and future contexts. There have been also successful works on unsegmented data
(Graves et al., 2006) where the network is directly fed with the continuous signals (e.g. the full
speech signal) rather than with chunks (e.g. segmented phonemes) that have been segmented in a
preprocessing step.
Figure 18.7: A handwritten sentence generated by the recurrent neural network of (Graves, 2013)
and fed with the sentence “A LSTM network generating handwritten sentences”.
Recently, it has been proposed to combine deep feedforward networks (convolutional neural
networks) with recurrent neural networks (bidirectional LSTM) in order to produce captions of
images (Karpathy and Li, 2014).
200 CHAPTER 18. RECURRENT NEURAL NETWORKS
Chapter 19
We further suppose that there can be self-connections but that the weights of self-connections are
restricted to be positive :
∀i, wii ≥ 0
One can then define the following energy (lyapunov ) function1 :
X 1 X X
E=− si bi − si sj wij − wii si (19.2)
i
2 i
i,j6=i
We can rewrite the energy to isolate the terms in which the state of a specific neuron is involved :
X 1X X X X
∀k, Ek = −sk bk − wkk sk − wkj sk sj − si sj wij − si bi − wii si
2
j6=k i6=k j6=i,j6=k i6=k i6=k
From this expression, we can compute the energy gap, i.e. the difference in energy when the neuron
k is in state 1 and when the neuron k is in state 0 :
When updating neuron k, if its state wasPsk (t) = 1, and the update
P makes it turn off sk (t+1) = −1,
according to eq. (19.1) this means that j wkj sj (t) + bk = j6=k wkj sj + wkk + bk < 0. Therefore,
∆Ek > 0. The update produces a modification of the energy of −∆Ek < 0 and therefore the
energy is strictly decreasing. P
If the neuron was inP state sk (t) = −1 and the update makes it turn
on sk (t) = 1, this mean that j wkj sj (t) + bk = j6=k wkj sj (t) + bk > 0. Given that −wkk ≤ 0,
we have ∆Ek < 0. The update leads to an increase of ∆Ek of the energy and therefore the energy
is again strictly decreasing. In case the neuron does not change its state, the energy is constant.
1 It shall be noted that this formulation of the energy function must be modified if we allow the neuron to change
its state when the net input equals 0, in this case, see the work of Floreen et Orponen Complexity issues in Discrete
Hopfield Networks.
201
202 CHAPTER 19. ENERGY BASED MODELS
Therefore, sequential updates make energy function eq.(19.2) decreasing. The energy
function is in general strictly decreasing and is constant if and only if the state of the neuron does
not change. Given that there is a finite number of states, we can conclude that the network will
converge in a finite number of iterations, the fixed point being a local minima of the energy function.
19.1.2 Example
We consider a Hopfield network with 100 binary neurons with states in {−1, 1}. The weights
are symmetrix and randomly generated in [−1, 1]. Self-connections are restricted to be positive
(random in [0, 1]). The biases are randomly generated in [−1, 1]. For each iteration, we randomly
choose one of the productive rules (if any), the updates being stopped whenever there is no more
productive rule to apply. The energy function of the number of productive rules applied is shown
on figure 19.1. On this example, it took 97 iterations before reaching a minimum of the energy
function.
Figure 19.1: Evolution of the energy of a 1D Hopfield network function of the number of productive
rules applied
19.1.3 Training
Hopfield suggested that such a network could be used as a memory where patterns to be memo-
rized would be local minima of the energy function. When the network starts from a state close
to one minimum it will eventually relax to it. This means that, say, a picture might be completely
reconstructed from only a subpart of it.
To store a pattern p in a hopfield network, we need to ensure that this pattern p is a minimum
of the energy function (19.2) which we can do by a gradient descent of the energy function:
∆wii = αpi
∆wij = αpi pj
∆bi = αpi
If all the patterns to be stored are available, we can set the weights and biases to :
1 X
W = pi pTi
N i
1 X
b = pi
N i
Example We consider a hopfield neural networks with 100 neurons, their states being either
−1 or 1. We first consider a single pattern to be memorized. This pattern is composed of four
segments alternatively set to −1 and 1. We show on figure 19.2 the evolution of the states once
the weights have been set according to the learning rule (the batch version).
Figure 19.2: Evolution of a Hopfield network trained to store the pattern (−1, −1, ...1, 1, .. −
1, −1, ..., 1, 1)
It is not shown here, but the learning rule above is somehow specific to the states −1, 1. If we
use the same learning rule with states 0, 1, and run the same example, we may keep random values
in the domain where the pattern is 0 as the weights and biaises for these neurons equal zero and
therefore their state does not leave their initial value.
e−E(v,h)
p(v, h) = P −E(u,g)
u,g e
P
The term u,g e−E(u,g) is called the partition function. The probability of a visible state or of a
2 historically introduced in (Smolensky, 1986), it was popularized by G. Hinton and collaborators who devised
efficient learning algorithms such as the contrastive divergence algorithm(Carreira-Perpiñán and Hinton, 2005)
204 CHAPTER 19. ENERGY BASED MODELS
In the remaining of this section, we give some elements with respect to restricted boltzmann
machines with binary units and the way they can be trained. We refer the reader to (Hinton,
2012) for more details on training RBM. The RBM are not restricted to deal with binary units
and variants with real valued activations have been proposed. RBM have been applied successfully
to classification where the hidden units can feed a feedforward network to discriminate the class
or the label as well as the input can be used as the visible part of the RBM. Finally, RBM
have been stacked to build up deep belief networks and deep boltzmann machines (Hinton, 2009;
Salakhutdinov and Hinton, 2009).
Figure 19.3: Graphical representation of a restricted Boltzmann machine with n = 4 visible units
and m = 3 hidden units.
Let’s denote bv , bh the biases of respectively the visible and hidden units and w the weight
matrix between the visible and hidden units. The weights are symmetric in the sense that if wij
is the weight between the visible unit i and hidden unit j, the hidden unitj is also connected to
the visible unit i with the weight wij . We define an energy function E(v, h) which depends on the
state of the visible and hidden units as :
X X XX
E(v, h) = − vi bvi − hj bhj − vi hj wij (19.12)
i j i j
From this energy function, we can define a probability over the states of the network as :
1 −E(v,h)
p(v = v, h = h) = e
Z
where Z is a normalization factor, called the partition function and defined as :
X
Z= e−E(v,h)
v,h
19.2. RESTRICTED BOLTZMANN MACHINES 205
X
p(v = v) = p(v = v|h = h)p(h = h)
h
X
= p(v = v, h = h)
h
1 X −E(v,h)
= e
Z
h
We now introduce the free energy (which will renders the derivations easier) as :
X
F(v) = − log( exp(−E(v, h)))
h
We can then rewrite the marginal of v and the partition function in terms of the free energy :
X
Z = exp(−F(v))
v
exp(−F(v))
p(v = v) = P 0
v 0 exp(−F(v ))
With the expression of the energy function in equation (19.12) the free energy can be written
as :
X
F(v) = − log( exp(−E(v, h)))
h
X X X XX
= − log( exp( vi bvi ) exp( hj bhj + wij vi hj ))
h i j i j
Y X X XX
= − log( exp(vi bvi ) exp( hj bhj + wij vi hj ))
i h j i j
X X X X
= − vi bvi − log( exp( hj bhj + vi hj wij ))
i h j i,j
X XY X
= − vi bvi − log( exp(hj (bhj + vi wij )))
i h j i
X YX X
= − vi bvi − log( exp(hj (bhj + vi wij )))
i j hj i
X X X X
= − vi bvi − log( exp(hj (bhj + vi wij )))
i j hj i
Since the hidden units have binary states, the expression can be further simplified :
X X X
F(v) = − vi bvi − log(1 + exp((bhj + vi wij )))
i j i
We are now interested in the conditional probabilities p(v|h) and p(h|v) which define the update
206 CHAPTER 19. ENERGY BASED MODELS
rules of the network. By definition of the conditional probabilities and of the free energy3 :
p(h = h, v = v)
p(h = h|v = v) =
p(v = v)
e−E(v,h)
= P −E(v,h0 )
h0 e
P P P
exp( i vi bvi + j hj bhj + i,j vi hj wij )
= P P v
P 0 h P 0
h0 exp( i v i bi + j hj bj + i,j vi hj wij )
P P
exp( j hj bhj + i,j vi hj wij )
= P P 0 h P 0
h0 exp( j hj bj + i,j vi hj wij )
Q h
P
j exp(hj (bj + i vi wij ))
= Q P 0 h
P
j h0j exp(hj (bj + i vi wij ))
P
Y exp(hj (bhj + i vi wij )) Y
= P 0 h
P = p(hj = hj |v = v)
j h0 exp(hj (bj +
j i vi wij )) j
By identifying the two last terms, the conditional probabilities of the hidden components then
read : P
exp(hj (bhj + i vi wij ))
p(hj = hj |v = v) = P 0 h
P
h0 exp(hj (bj +
j i vi wij ))
Since the hidden units are binary, the update rules finally read :
P
exp(bhj + i vi wij )
p(hj = 1|v = v) = P
1+ exp(bhj + i vi wij )
X
= σ(bhj + vi wij )
i
1
with σ the logistic function defined as σ(x) = 1+exp(−x) . By a similar derivation, since the network
is symmetric, we find :
X
p(vi = 1|h = h) = σ(bvi + hj wij )
j
19.2.2 Training
We now wish to train the network so that higher probabilities are given to the training samples
clamped on the visible units and a lower probability to all the other samples. This will in effect
shape the energy
P landscape in favor of the training samples. We therefore wish to maximise the
probability d p(v = v̂d ) where v̂d are the training data. Maximizing this sum which is called
the likelihood is equivalent to maximizing the sum of the logarithm of the probabilities. This is a
technical point simplifying the derivations. The sum of the logarithms of the probabilities is called
the log likelihood and we define a cost function as :
X
L(θ) = − log(p(v = v̂d |θ))
d
with θ the parameter vectors that we now explicitly introduce in the notations to highlight the
dependencies of the cost function on the parameters. To compute the gradient of the cost function,
3 for the last equality, we use the fact that the components of h are pairwise independent because there is no
We now have to see how the two terms can be computed to update the parameters. We begin
with the positive phase which can be analytically derived by using the expression of the free
energy :
P
∂F(vd ) X ∂bvi X 1 ∂ exp(bhj + i vd,i wi,j )
= − vd,i − P
dθ i
dθ j
1 + exp(bhj + i vd,i wi,j ) dθ
P !
X ∂bvi X exp(bhj + i vd,i wi,j ) ∂bhj X ∂wij
= − vd,i − P + vd,i
i
dθ j
1 + exp(bhj + i vd,i wi,j ) dθ i
dθ
!
X ∂bv X X ∂bhj X ∂wij
= − vd,i i − σ(bhj + vd,i wi,j ) + vd,i
i
dθ j i
dθ i
dθ
X ∂bvi X X ∂bhj X X ∂wij
= − vd,i − σ(bhj + vd,i wi,j ) − σ(bhj + vd,i wi,j )vd,i
i
dθ j i
dθ i,j i
dθ
We can now derive specific expressions for the biases and weights :
∂F(vd )
∀i, = −vd,i
dbvi
∂F(vd ) X
∀j, = −σ(bhj + vd,i wi,j )
dbhj i
∂F(vd ) X
∀i, j, = −σ(bhj + vd,i wi,j )vd,i
dwij i
We can recognize hebbian like learning rules; the weights are updated by the product of a pre- and
post-synaptic term. For the negative phase, the main difficulty is that we cannot compute the
marginal of the visible units p(v). We can however notice that the mean can be approximated with
Monte-Carlo. Suppose that even if we cannot compute the marginal, we can get samples from the
model. We would then consider a set N of samples drawn from p(v) and approximate the mean
as :
∂F(v) 1 X ∂F(v)
Ep [ ]≈
dθ |N | dθ
v∈N
208 CHAPTER 19. ENERGY BASED MODELS
Since the sum can be computed, we need a method for sampling from p(v). These methods are
called Markov Chain Monte Carlo where we sample alternatively the visible and hidden units
starting from a given sample. We would in principle need several iterations before reaching a
so-called thermal equilibrium. The thermal equilibrium is a kind of steady state where, even if the
states change (because of the stochastic update rules), the probablities from which the states are
samples are fixed. While in principle we would need to reach thermal equilibrium (and this can
actually take a certain unknown amount of iterations), an approximated method called contrastive
divergence brings pretty good results. In the contrastive divergence, we initialize the visible units
to a training sample, then update the hidden layer, the visible layer and the hidden layer again.
This is called CD-1 while in general CD-k considers k such updates. The visible state that we
sampled is called a reconstruction. The negative phase is computed only on the last visible and
hidden states.
Part VI
Ensemble methods
209
Chapter 20
Introduction
In supervised learning, predictions are made based on an estimator built with a given learning
algorithm. Ensemble methods aim at combining the predictions of several base estimators in order
to improve generalization or robustness over a single estimator.
To get an informal idea of ensemble learning1 , consider Fig. 20.1. The top square corresponds
to a classification problem with positive (denoted “+”) and negative (denoted “−”) examples. As
estimators, we consider “decision stumps”, that is linear separators that take decision based on a
single dimension of the input space (here, they must be vertical or horizontal), as illustrated by
the tree squares in the middle. Now, if one takes a weighted combination of such decisions stumps,
the classification problem can be solved, as illustrated on the bottom square.
Strictly speaking, most of the ideas presented in the following chapters applies to arbitrary
base estimators. However, they are often used with decision trees, so we start by presenting
this learning paradigm, before providing an overview of the (non exhaustively, as usual) studied
ensemble methods.
211
212 CHAPTER 20. INTRODUCTION
Figure 20.1: Combining weak learners to form a stronger learner. The figure is strongly inspired
from Schapire and Freund (2012).
Figure 20.2: Top left: this partition cannot be obtained with a recursive binary tree. Top right:
a partition corresponding to a binary tree. Bottom left: the corresponding tree. Bottom right: a
function associated to this tree (each leaf—that is, partition of the input space—is associated to a
constant value). The figure is taken from Hastie et al. (2009).
20.1. DECISION TREES 213
This model can be represented by the binary recursive tree shown on the bottom left panel. Inputs
are fed to the root (top) of the tree, and they are assigned on the left or right branch, depending
on the fact that the condition is satisfied or not, until reaching a leaf (terminal node). The
leaves correspond to the regions R1 , . . . , R5 . The corresponding regression model predicts y with
a constant cm in region Rm :
X5
f (x) = cm 1{x∈Rm } .
m=1
The bottom right panel shows such a function, for the preceding tree and some constants cm .
An advantage of trees is their interpretability (one can see easily from the drawn tree the
stratification of data to reach the predicted value). There remains to show how such a tree can be
built, depending on the problem at hand.
To minimize the risk based on the `2 -loss, one should solve the following optimization problem:
n
1X
min Rn (f ) = min (yi − f (xi ))2 .
c1 ,...,cM c1 ,...,cM n i=1
The solution is easily obtained by setting the gradient (respectively to each cm ) to zero:
Pn
yi 1{xi ∈Rm }
ĉm = ave(yi |xi ∈ Rm ) = Pi=1
n .
i=1 1{xi ∈Rm }
This is simply the empirical expectation of outputs corresponding to inputs belonging to region
Rm .
However, finding the best binary partition in terms of the risk Rn is much more difficult, and
even computationally infeasible in general. Hence, the idea is to proceed with a greedy algorithm.
We start with the whole dataset. Let j be a splitting variable (a component of the input) and s a
split point, and define the pair of half planes
For a given choice of j and s (notice that 1 ≤ j ≤ d, as each input has d components, and that is
is enough to consider n − 1 split points, obtained by ordering the j th components of inputs in the
dataset), the inner minimization problem is solved with
ĉ1 = ave(yi |xi ∈ R1 (j, s)) and ĉ2 = ave(yi |xi ∈ R2 (j, s)).
Therefore, by scanning through each dimension of all the inputs, determination of the best pair
(j, s) is feasible. Then, having found the best split, we partition the data into the two resulting
214 CHAPTER 20. INTRODUCTION
regions and repeat the splitting procedure for each of these regions. Then the process is repeated
again on each resulting region, and so on. This is repeated until a stopping criterion is met, for
example a maximum depth, a maximum number of leaves or a minimum number of samples per
leaf.
The stopping criterion is not anodyne. Clearly, a very large tree will overfit the data (consider
a tree with as many leaves as samples) while a too small tree might not capture the important
structure. For example, a decision stump (mentioned at the beginning of this chapter) is a tree with
two nodes. Consider the exemple of Fig. 20.1, a decision stump cannot capture the structure of the
data, contrary to a slightly larger tree. A solution would be to prune the tree: one construct a big
tree, then prune it by collapsing some of its internal nodes according to some criterion. See Hastie
et al. (2009, Ch. 9) for more details. We do not study this further, as ensemble methods allow
using such trees (big trees for bagging, small trees for boosting, see the next chapters).
1 X
p̂m,k = 1{yi =k} (20.1)
nm
xi ∈Rm
the proportion of class k observations in node m. We classify the observations in node m to class
k(m) = argmax1≤k≤K p̂m,k (majority class). Different measures Q(Dm ) of node impurity3 can be
considered:
• misclassification error,
1 X
Q(Dm ) = 1{yi 6=k(m)} = 1 − p̂m,k(m) ; (20.2)
nm
xi ∈Rm
• Gini index,
X K
X
Q(Dm ) = p̂m,k p̂ m,k0 = p̂m,k (1 − p̂m,k );
k6=k0 k=1
• cross-entropy,
K
X
Q(Dm ) = − p̂m,k ln p̂m,k
k=1
These measures are illustrated in Fig. 20.3 in the case of binary classification.
The tree is growth as previously. For a region Rm , consider a couple (j, s) of splitting variable
and split point and write Dm,L (j, s) the resulting dataset of the left node (of size nmL ) and
Dm,R (j, s) the dataset of the right node (of size nmR ). The tree is growth by solving
for one of the preceding measures of impurity. As for the regression case, the tree can be pruned,
but we do not study this aspect.
3 In the regression case, the measure of impurity is
1 X
Q(Dm ) = (yi − ĉm )2 ,
nm xi ∈Rm
0.8
misclassification error
0.7
Gini index
0.6 cross-entropy
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.2 0.4 0.6 0.8 1.0
p
Figure 20.3: Measure of node impurity for binary classification, as a function of the proportion p
in the second class.
20.2 Overview
In this section, we provide a brief overview of the ensemble methods studied next. As explained
before, such methods aim at combining the predictions of several base estimators (for example, the
trees studied above) in order to improve generalization or robustness over a single estimator.
We will distinguish mainly two families of ensemble methods:
• the bagging methods (or more generally averaging methods), to be presented in Ch. 21.
The underlying idea is to build several estimators, (more or less) independently and in a
randomized fashion, and to average their prediction. This leads to a reduction of variance for
the combined estimator, that makes it applicable for example to large and unpruned trees;
• the boosting methods, to be presented in Ch. 22. The idea is to build sequentially (weak)
base estimators in order to reduce the bias of the combined estimator (see for example
Fig. 20.1 for an illustration), the motivation being to combine weak learners to form a strong
learner.
There are other ensemble methods that we will not study here. We mention some of them for
completeness:
• Bayesian model averaging (BMA) combine a set of candidate models (e.g., for classification)
using a Bayesian viewpoint. Informally, let D be the dataset, ξ a quantity of interest (e.g.,
the prediction of the class for a given input) and write Mm each of the M candidate model.
BMA provides the posterior distribution on ξ conditioned on the examples in D by integrating
216 CHAPTER 20. INTRODUCTION
over models:
M
X M
X
P (ξ | D) = P (ξ | Mm , D)P (Mm | D) ∝ P (ξ | Mm , D)P (D | Mn )P (Mn ).
m=1 m=1
See Hoeting et al. (1999) for more on Bayesian model averaging and more generally Part VII
for an introduction to Bayesian Machine Learning (the above equations should become clear
in light of this part of the course material);
• mixture of experts combine local predictors considering possibly heterogeneous sets of fea-
tures. They can also be seen as a variation of decision trees4 , the difference being that the
tree splits are not hard decisions but rather soft (fuzzy or probabilistic) ones, and that the
model in each leaf might be more complex than a simple constant prediction. See Jacobs
et al. (1991); Jordan and Jacobs (1994) or Hastie et al. (2009, Ch. 9) for more on this subject;
• stacking is a way of combining heterogeneous estimators for a given problem. The basic
idea is to combine different estimators in a statistical way, that is by minimizing a given
risk. For example, consider that a classification problem is solved by using many different
classification algorithms, each one producing an estimator. Then, one can construct an
hypothesis space being composed of, for example, all linear combinations of these estimators.
Then, a classification risk can be minimized over this hypothesis space. Generally, it can be
shown that the learned combination is at least as good as the better estimator. For more
on this subject, see Wolpert (1992); Breiman (1996b); Smyth and Wolpert (1999); Ozay and
Vural (2012), for example.
4 As such, they might not be considered as an ensemble method, opinions vary on this subject.
Chapter 21
Bagging
Bagging stands for “bootstrap and aggregating”. The underlying idea is to learn several estimators
(more or less) independently (by introducing some kind of randomization) and to average their
predictions. The averaged estimator is usually better than the single estimators because it reduces
its variance. We motivate this informally. Assume that X1 , . . . , XB are B i.i.d. random variables
of mean µ = E [X1 ] and of variance σ 2 = var(X1 ) = E (X1 − µ)2 . Consider the empirical mean
PB
µB = B1 b=1 Xb . The expectation does not change, E [µB ] = µ, while the variance is reduced
(thanks to decorrelation of the random variables), var(µB ) = B1 σ 2 .
In the supervised learning paradigm, the random quantity is the dataset D = {(xi , yi )1≤i≤n },
where samples are drawn from a fixed but unknown distribution. From this, an estimate fD is
computed by minimizing the empirical risk of interest (see Ch. 5). This is a random quantity
(through the dependency on the dataset) that admits an expectation. This estimator has also a
variance, which somehow tells how different will be the predictions if the dataset is perturbated
(this can also be linked to the Vapnik-Chervonenkis bound studied in Ch. 5).
Assume that we can sample datasets on demand. Then, let D1 , . . . , DB be datasets drawn
independently, and write fb = fDb the associated minimizer of the empirical risk. The averaged
PB
estimator is fave = B1 b=1 fb . This does not change the expectation, E [fave ] = E [f1 ], but it
reduces the variance, var(fave ) = B1 var(f1 ). This is of interest for example for decision trees: we
have seen in Sec. 20 that large and unpruned trees have a small bias but a large variance.
Unfortunately, it is not possible to sample datasets on demand, as the underlying distribution
is unknown. We have to do with the sole dataset we have. That is where bootstrapping is useful.
This can be seen as an approximation of the scheme described at the beginning of the chapter, the
approximation coming from the fact that the empirical distribution is used instead of the unknown
217
218 CHAPTER 21. BAGGING
underlying distribution1 . Due to this approximation, independence (or even decorrelation) cannot
be assumed. Yet, this can improve the results empirically.
This idea can typically applied to decision trees. For regression trees, Eq. (21.1) can be applied
directly. For classification trees, an average of predicted classes would not make sense. Recall that
a classification tree will make a majority votes of the exemples belonging to the leaf where the
input of interest ends up (and associates this way a class to each sample). A first possibility is to
do a majority vote over trees:
B
!
1 X
fbag = argmax 1{fb (x)=k} .
1≤i≤K B
b=1
Another bagging strategy consists in considering the class proportion for the leave corresponding
to the input of interest for each tree (see Eq. (20.1)), to average them over all trees and to output
the class that maximizes this averaged class proportion.
There exists variations of the bagging approach, depending on how datasets are sampled from
the original training set:
• samples can be drawn with replacement, which is the principle of the bagging approach,
explained above (Breiman, 1996a);
• alternatively, random subsets of the dataset can be drawn as random subsets of the samples,
which is known as pasting (Breiman, 1999);
• one can also selects randomly a subset of the components of the inputs (which is generally
multi-dimensional) to learn models. When random subsets of the dataset are drawn as
random subsets of the features, the method is known as random subspaces (Ho, 1998);
• is is possible to combine these ideas. When base estimatores are built on subsets of bots
samples and features, it is known as random patches (Louppe and Geurts, 2012).
When averaged models are trees, Breiman (2001) has proposed random forests to further reduce
the variance. The underlying idea is to reduce the correlation between the trees by randomizing
their constructions, without increasing the variance too much. This randomization is achieved in
the tree-growing process thanks to random selection of input variables3 : at each split, m < d of the
input variables are selected at random as candidates for splitting, the choice of the best variable
and split-point among these candidates being as explained in Ch. 20. See also Alg. 18.
Intuitively, reducing m will reduce the correlation between any pair of trees in the ensemble,
and hence by Eq. (21.2) will reduce the variance of the average. However, The corresponding
√
hypothesis space will be smaller, leading to an increased bias. A heuristic is to choose m = b pc
p
and a minimum node size of 1 for classification, and m = b 3 c and a minimum node size of 5 for
regression. For further information about random forests, the reader can refer to (Hastie et al.,
2009, Ch. 15) (that provides notably a bias-variance analysis).
• split-points are also chosen randomly (in addition to splitting dimensions). More precisely,
m < p of the input variables are chosen at random, and for each of these variables a split-point
is selected at random;
• the full learning dataset is used to growth each tree (instead of a bootstrapped replica).
The rationale behind choosing also the split-point at random is to further reduce the correlation
between trees (so as to reduce the variance of the average of the ensemble more strongly). The
rationale for using the full learning set is to achieve a lower bias (at the price of an increased
variance, that should be compensated by the randomization of split-points). The corresponding
approach is described in Alg. 19.
2 Indeed,
1P 2
1 P
we have that var(µB ) = E (n b (Xb − µ)) = n2 b,b0 E [(Xb − µ)(Xb0 − µ)] =
1 2 + 1 2 + nσ 2 ) = ρσ 2 + 1−ρ σ 2 .
P P
n 2 ( b E (X b − µ) b6 =b 0 E [(Xb − µ)(Xb0 − µ)]) =
n 2 (n(n − 1)ρσ n
3 Notice that this is different from random subspaces: input variables are chosen randomly for each splitting, and
Empirically, it often provides better results than random forests. Another advantage of this
approach is its lower computational complexity compared to random forests (instead of searching
the best split-point among the m drawn dimensions, one chooses the split-point among the m
randomly drawn split-points). See the original paper (Geurts et al., 2006) for more details.
Chapter 22
Boosting
We have seen in Ch. 21 that bagging consists in learning in parallel a set of models with low bias
and high variance (learning being randomized, for example through bootstrapping), the prediction
being made by averaging all these models. Boosting takes a different route. The underlying idea
is to add sequentially models with high bias and low variance such as reducing the bias of the
ensemble. This is illustrated on Fig. 20.1 page 212: combining decision stumps (binary trees with
two leaf nodes) allows constructing a complex decision boundary.
The rest of this chapter is organized as follows. In Sec. 22.1, we present AdaBoost, a seminal
and very effective boosting algorithm. In Sec. 22.2, we show how this algorithm can be derived and
demonstrates why the ensemble achieves a lower error than each of its components (called weak
learners) taken separately. In Sec. 22.3, we extend these ideas using an optimization perspective.
22.1 AdaBoost
The seminal AdaBoost algorithm of Freund and Schapire (1997) deals with cost-insensitive binary
classification. The available dataset is of the form D = {(xi , yi )1≤i≤n } with yi ∈ {−1, +1}. Before
presenting AdaBoost, we discuss briefly weighted classification.
This way, all samples of the dataset have the same importance (each sample has a weight n1 ). Now,
assume that we want to associate
P a different weight wi to each example (xi , yi ) (this will be useful
for boosting), such that i wi = 1 and wi ≥ 0. The empirical risk to be considered is
n
X
Rn (f ) = wi 1{yi 6=f (xi )} .
i=1
This is a generalization of the approaches considered before, in the sense that they correspond to
the choice wi = n1 , for all 1 ≤ i ≤ n. Weighting the samples allow putting more emphasis on some
of them on less on the others.
Minimizing the weighted empirical risk can be done (approximately) by sampling a bootstrap
replicate according to the discrete distribution (w1 , . . . , wn ) (sampling with replacement according
to this distribution). It can also be done more directly, in a problem-dependent manner. For
example, consider the classification trees presented in Ch. 20. We have seen that a measure of
node impurity is the misclassification error of Eq. (20.2) (see page 214). It can be replaced by a
221
222 CHAPTER 22. BOOSTING
Algorithm 20 AdaBoost
Require: A dataset D = {(xi , yi )1≤i≤n }, the size T of the ensemble.
1
1: Initialize the weights wi1 = n ,1≤i≤n
2: for t = 1 to T do
3: Fit a binary classifier ft (x) to the training data using weights wit .
4: Compute the error t made by this classifier:
n
X
t = wit 1{yi 6=ft (xi )} .
i=1
wt e−αt yi ft (xi )
wit+1 = Pn i t −α y f (x ) .
j=1 wj e
t j t j
7: end for
8: return the decision rule
T
X
HT (x) = sgn (FT (x)) with FT (x) = αt ft (x).
t=1
Let us explain the rational behind this algorithm. At the beginning, all samples are equally
weighted. Then, at each iteration t, one first trains a binary classifier ft (x) with the training
t
Pn D =t {(xi , yi )1≤i≤n } and weights wi (that is, such as minimizing the weighted risk Rn (f ) =
set
i=1 wi 1{yi 6=f (xi )} ). Write t = Rn (ft ) the error made by this classifier (see line 4 in Alg. 20).
Notice that we have necessarily that t < 21 (otherwise, the classifier does worse than random
guessing, and −ft is a better classifier, with an empirical weighted risk below 21 ). The closer to 0 is
t , the better the classifier is. However, with a weak learner such as a decision stump, the error will
be more probably close to 12 . Then, one compute the learning rate αt (see line 5 in Alg. 20). This
rate is a decreasing function of the error t : with t = 21 , αt = 0 (which means that if the classifier
1 The learner is weak in the sense that it has a high bias.
22.2. DERIVATION AND PARTIAL ANALYSIS 223
does not better than random guessing, it is not added to the ensemble), and limt →0 αt = +∞
(which suggests to stop adding models to the ensemble when the empirical risk is null). Then,
the weights are updated (see line 6 in Alg. 20). This can be rewritten as (up to the normalization
factor) (
t+1 wit e−αt if ft (xi ) = yi
wi ∝ .
wit eαt if ft (xi ) 6= yi
This means that if the example (xi , yi ) is correctly classified, its weight is decreased, while if it is
incorrectly classified, its weight is increased. The final decision rule (the strong learner2 ) is the
sign of the weighted combination of the learned classifiers:
T
X
HT (x) = sgn (FT (x)) with FT (x) = αt ft (x).
t=1
To sum up, AdaBoost is a sequential algorithm. At each iteration, samples that where misclassified
by the preceding classifier have their weight increased. Therefore, examples that are difficult to
classify correctly receive ever-increasing influence as iterations proceed.
AdaBoost is a very efficient algorithm. For example, it is behind the face detection algorithm
embedded in recent cameras and smartphones, see Viola and Jones (2001). If this chapter focuses
on AdaBoost, boosting is a much larger field, the interested reader can refer to Schapire and Freund
(2012) for a deeper introduction (see also Hastie et al. (2009, Ch. 10) for a different point of view).
The loss is low if the class is correctly predicted, and high otherwise. Moreover, we are looking for
an additive model of the form
X T
FT (x) = αt ft (x),
t=1
with ft being a binary classifier (that is, ft ∈ {−1, +1}X ), called a basis function in the sequel.
The corresponding optimization problem is therefore
n
1 X −yi PTt=1 αt ft (xi )
min e .
(αt ,ft )1≤t≤T n i=1
Yet, this optimization problem is too complicated. A simple alternative is to search for an approx-
imate solution by sequentially adding basis functions and associate weights. Define F0 = 0 and
Ft = Ft−1 + αt ft . This consists in solving sequentially the following subproblems:
n
1 X −yi (Ft−1 (xi )+αf (xi ))
min e .
α,f n i=1
2 The learning is strong because it has a reduced bias, as will be shown later.
224 CHAPTER 22. BOOSTING
This is reminiscent of gradient descent (see also Sec. 22.3). This idea can also be straightforwardly
abstracted to any loss function L (not necessarily corresponding to a binary classification problem):
n
1X
min L(yi , Ft−1 (xi ) + αf (xi )).
α,f n
i=1
Now, we compute the solution of this problem in the case of the exponential loss, with binary
classifiers as basis functions.
At each iteration t ≥ 1, we have to solve
n
1 X −yi (Ft−1 (xi )+αf (xi ))
(αt , ft ) = argmin e .
α,f n i=1
In other words, ft solves the weighted empirical risk corresponding to line 3 of Alg. 20 (we will see
later that the weights are indeed the same). Here, H is the hypothesis space of considered weak
learners (or basis functions). For example, it can be the space of decision stumps. Write t the
corresponding error
n
X
t = wit 1{yi 6=ft (xi )} .
i=1
In other words, the training error drops exponentially fast as a function of the number of
combined weak learners. For example, if each weakp learner has a 40% misclassification rate, then
γt = 0.1 and the empirical risk is bounded by ( 1 − 4(0.1)2 )T ≤ 0.98T , which can be arbitrarily
close to zero, given a large enough T . Now, we prove this result.
PT
Proof of Th. 22.1. Recall that FT (x) = t=1 αt ft (x). Write Zt the normalizing factor of the
weights at round t:
n
X
Zt = wit e−αt yi ft (xi ) . (22.7)
i=1
Unraveling the recurrence of AdaBoost that defines the weights, we have
e−αT yi fT (xi )
wiT +1 = wiT
ZT
−αT −1 yi fT −1 (xi ) −αT yi fT (xi )
e e
= wiT −1
ZT −1 ZT
e−α1 yi f1 (xi ) e−αT yi fT (xi )
= wi1 ...
Z1 ZT
PT
w1 e−yi t=1 αt ht (xi )
= i QT
t=1 Zt
wi1 e−yi FT (xi )
= QT . (22.8)
t=1 Zt
226 CHAPTER 22. BOOSTING
Recall that HT (x) = sgn(FT (x)). We would like to bound the binary loss by the exponential
one, its convex surrogate. If HT (x) 6= y, then yFT (x) < 0, thus e−yFT (x) ≥ 1. Therefore, we
always have that 1{y6=HT (x)} ≤ e−yF (x) . Therefore, the empirical risk of interest can be bounded
as follows:
n n
1X 1 X −yi FT (xi )
1{yi 6=HT (xi )} ≤ e
n i=1 n i=1
n
X
= wi1 e−yi FT (xi ) (by def. of wi1 )
i=1
X T
Y
= wiT +1 Zt (by Eq. (22.8))
i=1 t=1
T
Y
= Zt (as wiT +1 is a discrete distribution) (22.9)
t=1
Recall the definition of Jt (α, f ) in Eq. (22.2) and the definition of Zt in Eq. (22.7). It is clear that
Zt = Jt (αt , ft ). Therefore, from Eq. (22.5), we have that
Plugging this result into Eq. (22.9) provides the first bound of the theorem. Using the fact that
for all x ∈ R we have 1 + x ≤ ex provides the second bound and concludes the proof.
QT
Notice that optimizing the bound in Eq. (22.9) (by minimizing t=1 Zt over αt and ft , 1 ≤
t ≤ T , using the relation (22.10)) allows deriving the AdaBoost algorithm. It is how it has been
done originally (Freund and Schapire, 1997).
If this result shows that combining enough weak learners, whatever their quality, allows having
an arbitrarysmall empirical
risk, it tells noting about the generalization error (that is how the risk,
R(FT ) = E 1{Y 6=FT (X)} , can be controlled). Direct bound on this risk can be obtained by using
rather directly the Vapnik-Chervonenkis theory presented in Ch. 5.1. Yet, this analysis would
show that AdaBoost suffers from overfitting: to obtain a low empirical risk, one has to add many
basis functions (or weak learners), leading to a large Vapnik-Chervonenkis dimension, and thus a
large variance. However, AdaBoost does not suffer from this in general. Another (and better) line
of analysis is based on the concept of margin (which is also central for support vector machines).
Here, for the classifier HT (x) = sgn(FT (x)), the margin of an example (x, y) is the quantity yF (x).
The larger it is (in absolute value), the more confident we are about the prediction (to be correct
or not, depending on the sign). One can provide bounds on the risk based on this notion of margin:
very roughly, the larger the margin, the sharper the bound. On the other hand, it is possible to
show that AdaBoost tends to enlarge the margins of the computed classifier, as the number of
iterations increases. A deeper discussion of this is beyond the scope of this manuscript, but the
interested reader can refer to Schapire and Freund (2012, Ch. 4 and 5) for more details.
Consider the binary classification problem with a convex surrogate (see Ch. 5.2). We’re looking
for a classifier H(x) = sgn(F (x)) with F ∈ RX . Let L(y, F (x)) be a convex surrogate to the binary
loss (for example, the exponential loss, L(y, F (x)) = e−yF (x) ). We would like to minimize the
empirical risk:
n
1X
min Rn (F ) with Rn (F ) = L(yi , F (xi )).
F ∈RX n i=1
Ft+1 = Ft − αt ∇F Rn (Ft ),
with αt the learning rate. The problem here is that F is not a variable, it is a function, so we need
to introduce the concept of functional gradient. To do so, we need to introduce a relevant Hilbert
space.
Assume that the input space X is measurable and let µ be a probability measure. The
function space L2 (X , R,
R µ) is the set of all equivalence classes of functions F ∈ RX such that
2
the Lebesgue
R integral X F (x) dµ(x) is finite. This Hilbert space has a natural inner product:
hF, Giµ = X F (x)G(x)dµ(x). A functional is an operator that associates a scalar to a function
of this Hilbert space. Let J : L2 (X , R, µ) → R be such a functional, its Fréchet derivative is the
linear operator ∇J(F ) satisfying
The functional we are interested in is the empirical risk Rn , and one can compute3 its Fréchet
derivative ∇F Rn (F ). So, we can write a gradient descent:
Ft+1 = Ft − αt ∇F Rn (Ft ).
However, the Fréchet derivative is a function (indeed, a set of functions), known only in the
datapoints xi . It does not allows generalizing and it is not a practical object for computing. The
idea is therefore to “restrict” this gradient to the hypothesis space H of interest. By “restricting”
the gradient, we mean here looking for the function of H being the more collinear to the gradient
(with a comparable norm). This way, we follow approximately the direction of the gradient, so
we reduce the empirical risk. Searching for this collinear function amounts to solve the following
optimization problem:
h∇F Rn (Ft ), f in
ft ∈ argmax .
f ∈H kf kn
Then, we apply the gradient update, but with the functional gradient being replaced by its ap-
proximation ft :
Ft = Ft−1 − αt ft .
Thus, we compute an additive model, as before. Yet, it is here obtained as a restricted functional
gradient descent.
3 Generally speaking, all rules of gradient computation, such as the composition rule, apply. From a practical
point of view, as only the datapoints xi do matter (given the considered discrete measure), one can see the function
F as a vector (F (x1 ), . . . , F (xn ))> and take the derivative of the loss respectively to each component seen as a
variable. We do this later for the exponential loss.
228 CHAPTER 22. BOOSTING
We apply now this idea to the exponential loss. We recall that the associated empirical risk is
n
1 X −yi F (xi )
Rn (F ) = e .
n i=1
To get this result, we can apply the informal method explained in footnote 3 page 227:
There remains to project this gradient on the hypothesis space. Here we consider H ⊂ {−1, +1}X
(for
Pnexample, H can be the set of decision stumps, as usual), thus for f ∈ H we have kf k2n =
1 2
n i=1 f (xi ) = 1. The update rule is thus
h∇Rn (Ft ), f in
Ft+1 = Ft − αt ft with ft = argmax = argmaxh∇Rn (Ft ), f in .
f ∈H kf kn f ∈H
the last line being obtained by injecting the negative sign in the optimization problem (recall that
f takes only the values ±1). As expressed in Eq. (22.12), the (negative of the) restricted gradient
is exactly the classifier computed in line 3 of Alg. 20, aiming at minimizing the error of line 4 of
the same algorithm.
There remains to choose the learning rate. The convex optimization theory offers a bunch of
choices
P for this. For example,
P a classic (but not necessarily wise) choice consists in setting αt such
that t≥1 αt = +∞ and t≥1 αt2 < ∞ (typically, αt ∝ 1t ). Here we can perform a line search,
that is we can look for the learning rate that will imply the maximum decrease of the empirical
risk. Formally, this is donne by solving the following optimization problem:
This is exactly the learning rate of AdaBoost (see line 5 of Alg. 20).
Therefore, we have derived AdaBoost in a third way, from an optimization perspective. It is of
high interest, as it allows relying on the whole optimization field. The interested reader can refer
to Mason et al. (1999); Friedman (2001); Grubb and Bagnell (2011); Geist (2015b) for more about
this kind of approach.
230 CHAPTER 22. BOOSTING
Part VII
231
233
Theoretical Foundations
235
236 CHAPTER 23. THEORETICAL FOUNDATIONS
So you are probably asking yourself why a non Bayesian method such as “Naive Bayes” is intro-
duced in a chapter devoted to Bayesian Machine Learning? And why even talking about Bayesian
Machine Learning if it is rarely used in practice?
Indeed Bayesian Machine Learning provides a general methodology along with theoretical tools
to design Machine Learning methods suiting specific problems. By stating that everything should
be described by distributions – including model parameters – Bayesian Machine Learning encom-
passes a very wide application scope. Many existing ML methods can be reinterpreted in the
Bayesian framework, often leading to interesting generalizations or sound justifications of choices
that would stay arbitrary otherwise. For instance the ordinary least square linear regression (OLS)
is a frequentist regression method but it can be reinterpreted as a specific instance of a more gen-
eral Bayesian linear regression. This latter model provides a probabilistic interpretation of the
least squares principle and it also legitimates with a sound theoretical argument the origin of ridge
regression (also called Tikhonov regularization, that consists in adding a L2 regularization term
in the OLS objective function). Similarly “Naive Bayes” can be interpreted in a more general and
powerful Bayesian context. After a thorough review, it appears that most methods making an
intensive use of probabilities are neither purely frequentist nor purely Bayesian, they are both!
So you might ask internally “why distinguishing two chapters respectively on Frequentist and
Bayesian Machine Learning if there is no fundamental difference between them?”
The key of the answer is in the word “probabilistic”: to be classified as Bayesian in a weak sense,
a method must at least rely on a probabilistic discriminative model, i.e a model that generates an
output Y given an input X and the parameters Θ of the model according to some distribution
P Y | X,Θ . Clearly the basic versions of SVM, neural networks, or decision trees do not output such
a distribution but a single target value1 : they cannot be Bayesian and are consequently considered
as frequentist. Conversely a non Bayesian method (in the strong sense) such as “Naive Bayes”
that relies on a probabilistic generative model should actually be considered as Bayesian (in a weak
sense) as it can always – at least theoretically – be generalized into a full Bayesian method. More
exactly, frequentist probabilistic methods can be derived from Bayesian ones: given some Bayesian
method (let’s say the Bayesian version of “Naive Bayes”) , parameter values θ? learned by some
associated frequentist method (let’s say the basic version of “Naive Bayes”) can be obtained by
first running the Bayesian method to learn some parameter distribution P Θ | S (θ), and then finding
the best parameter value θ? that minimizes the expected value of some cost function according
to P Θ | S (θ). The extraction of this optimal parameter value for some cost function is called a
Bayes estimator. The most well known Bayes estimators are the Maximum A Posteriori estimator
(MAP) and Maximum Likelihood estimator (MLE). For instance the simple frequentist version of
“Naive Bayes” is derived by applying the MLE estimator to the Bayesian version of “Naive Bayes”.
The same holds for linear regression: the Ordinary Least Square method (OLS) is what we get
when one applies the MLE estimator to the Bayesian linear regression.
So finally you might ask why should one use probabilistic frequentist methods if they all accept
full Bayesian counterparts that are more powerful?
There ain’t no such thing as a free lunch. Though a full Bayesian method encompasses the cor-
responding frequentist methods (in the sense frequentist methods are obtained from the Bayesian
ones by applying some Bayes estimator), the process maintaining the distribution of hypothe-
sis/parameters P Θ | S up-to-date when new samples are received can be very demanding in both
computation time and memory footprint. Frequentist versions are more lightweight as they only
compute one specific value θ? for the model parameters.
Overview
In summary, the chapter distinguishes two types of Bayesian methods: in a weak and in a strong
sense. This distinction does not come from the scientific literature but is introduced by the author in
order to understand how many ML methods can simultaneously be interpreted in both frequentist
and Bayesian Machine Learning contexts.
1 Of course a single value can always be interpreted as a degenerate distribution but this is very reductive.
23.2. A SHORT REMINDER OF ELEMENTARY NOTIONS IN PROBABILITY THEORY237
• Bayesian methods in a weak sense gather all probabilistic methods from which some fully
Bayesian methods (in the strong sense) can be straightforwardly derived, which are, all
methods that model the distribution P Y | X .
• Bayesian methods in a strong sense consists in considering a Bayesian method in a weak
sense and representing the model parameters by a full distribution PΘ instead of a single
value θ. These methods are fundamentally online and can integrate some prior knowledge.
The next sections 23.3 and 23.4 further develop the Bayesian Machine Learning in the weak and
strong sense respectively after some basic notions of probability theory are recalled in section 23.2.
• A measure P that maps every event of E to its probability and that verifies the following
axioms:
∀E ∈ E, 0 ≤ P (E) ≤ 1
∀E1 , E2 ∈ E, P (E1 ∪ E2 ) + P (E1 ∩ E2 ) = P (E1 ) + P (E2 )
P (Ω) = 1
Events are thus the sets of outcomes that can be measured by some probability.
Example:
Let’s roll two six-sided dices. One possible definition of the associated probability space (Ω, E, P ) is
to define an outcome as a couple (d1 , d2 ) where d1 and d2 respectively represent the obtained value
for the first and second dice. The set Ω of outcomes is thus the Cartesian product {1 . . . 6}×{1 . . . 6}.
The richest possible set E of events is the set P (Ω) of subsets of Ω, which is trivially a σ-algebra.
The event “the result is a duplicate” is then represented by the set Edup = {(1, 1), (2, 2), . . . , (6, 6)}.
The probability function for fair dices is then P : E 7→ |E| |E|
|Ω| = 36 . The probability P (Edup ) is thus 6 .
1
238 CHAPTER 23. THEORETICAL FOUNDATIONS
Random variable
Given a probability space (Ω, E, P ), a random variable X taking its value in a set ΩX is a function
fX : Ω → ΩX that maps any outcome of Ω to a given value of X so that it induces a probability
space (ΩX , EX , PX ) for X. The function PX that maps a set of values of X to a probability is
called the distribution of X. For this distribution to be properly defined, the σ-algebra EX must
be chosen so that the mapping function fX is measurable, in other words, so that every event of
EX corresponds to an event of E:
−1
∀EX ∈ EX , fX (EX ) ∈ E
The probability PX (EX ) for X to take its value in a given subset EX ∈ EX is then:
−1
PX (EX ) = P fX (EX )
The domain DX of a random variable X denotes the smallest event of EX such that P (DX ) = 1.
In practice one often takes ΩX = DX .
Example:
In the previous example, let’s consider the random variable S that is the sum of the faces of the two
dices. The function mapping an outcome to a value of S is fS : (d1 , d2 ) 7→ d1 + d2 . The domain
of S is DS = {2, . . . , 12}. The resulting probability space of S is (DS , ES , PS ) where
ES = P (DS )
min(s − 1, 13 − s)
PS ({s}) = P ({(d1 , d2 ) ∈ Ω | d1 + d2 = s}) =
36
For instance the probability for the sum to be equal to four is PS ({4}) = 3/36 as there are three
matching rolls that correspond to a sum of four: (1, 3), (2, 2) and (3, 1).
Notation convention
1. In the next sections, one will use interchangeably the equivalent notations PX (EX ) and
P (X ∈ EX ) to denote the probability of an event EX defined on a random variable X of
distribution PX . The second notation will be preferred as it is lighter and more readable
when the probability to express is complex. In particular this convention generalizes to joint
and conditional distributions. For instance one will prefer to write P ( (X, Y ) ∈ A | Z ∈ B )
rather than P (X,Y ) | Z∈B (A). Moreover, if EX is a singleton {x}, PX (x) will be a shortcut
for PX ({x}).
23.2. A SHORT REMINDER OF ELEMENTARY NOTIONS IN PROBABILITY THEORY239
2. Second, events EX on random variable X can be specified either extensively, using set no-
tation, or intensively, using logical predicates. For instance PX ({0} ∪ [1, 2[) is equivalent to
P (X = 0 or (X ≥ 1 and X < 2)).
3. Most importantly, because Bayesian methods make some extensive use of sometimes complex
distributions, it is a common usage to simplify expressions with a somewhat abusive but very
handy notation, consisting in denoting the distribution PX as P (X). For instance PX|Y =y
will be replaced by P ( X | Y ). Also the probability density function fX and the distribution
PX of a continuous random variable X will often be interchangeable and referred with the
generic term of distribution as their algebraic properties with respect to conditioning and
marginalization are the same.
Marginalization
Whereas the distributions of two random variables X and Y do not necessarily suffice to define
their joint distribution, the converse is always true: given the joint distribution PX,Y of X and Y ,
it is always possible to derive the distributions of X and Y using the summation property. This
operation is called marginalization and is given by the formula in the discrete case:
X
PX (x) = PX,Y (x, y)
y∈ΩY
Marginalization can be interpreted as a loss of information since the distribution of Y has been
lost in the result.
Independence
Two events A and B are said independent if and only if
P (A ∩ B) = P (A) × P (B)
Two random variables X and Y are independent if and only if the joint distrubtion of X and Y is
defined and verifies:
PX,Y = PX × PY
Whether one considers two events or two variables, independence will be denoted A ⊥
⊥ B and
X ⊥
⊥ Y.
observation of y updates in some way the initial distribution PX of X before one observed the value
of Y . The new distribution of X after the observation Y = y is called conditional distribution of X
given Y is equal to y and is denoted P X | Y =y (x) or simply P ( X = x | Y = y ). When the values x
or y do not carry specific information relatively to the current context, one will use the somewhat
abusive but lighter notation P X | Y (x) or simply P ( X | Y ). Conditional distributions can easily
be deduced from joint distributions using marginalization over X:
P (X = x ∩ Y = y) P (X = x ∩ Y = y)
P (X = x | Y = y ) = =P
P (Y = y) x P (X = x ∩ Y = y)
Of course conditional probability is not specific to random variables and is true for every type of
events. The conditional probability of event A given the fact event B occurred is denoted P ( A | B )
and is equal to:
P (A ∩ B)
P (A | B ) =
P (B)
An immediate property is that A and B are independent if and only if P ( A | B ) = P (A). In other
words, A and B are independent if and only if knowing the value of B has absolutely no influence
on the probability of A.
Bayes’ rule
A straightforward but nonetheless fundamental property of conditional probabilities is Bayes’ rule
also called Bayes’ theorem:
P ( A | B ) × P (B) = P ( B | A ) × P (A)
This symmetric expression is equivalent to the asymmetric form:
1
P (A | B ) = P ( B | A ) × P (A)
P (B)
Or using the distribution notation:
1
P X | Y =y (x) = PY | X=x (y) PX (x)
PY (y)
This latter form is the one at the heart of Bayesian Machine Learning as explained in the next
sections.
As stated by its name, the likelihood estimates how plausible are some parameter values θ given a
sample z = (x, y). The higher the likelihood, the more probable the parameters.
If one assumes our sample set is reduced to one single example z = (x, y), learning the model
consists in finding the best parameter value θ? that maximizes the likelihood Lz (θ). This is what
is called the maximum likelihood estimator (MLE):
θ̂M LE = argmax (Lz (θ))
θ
Of course one single data is not sufficient to learn in a robust manner the values of the many
parameters of a model so that the result will be prone to over-fitting. If one assumes to have
several data Z = {z1 , . . . zn } = {(x1 , y1 ), . . . , (xn , yn )}, the MLE principle can still be applied but
on a likelihood function that is far more complex as it depends on every sample:
LZ (θ) = P ( Y1 = y1 , . . . Yn = yn | X1 = x1 , . . . , Xn = xn , Θ = θ )
To simplify this problem, one usually assumes samples are independent and identically distributed
(i.i.d) which is generally an acceptable hypothesis. In this case the likelihood function can be
simplified thanks to the independence of samples given the model (i.e. given the parameters θ):
P ( Y1 = y1 , . . . Yn = yn , X1 = x1 , . . . , Xn = xn | Θ = θ )
LZ (θ) =
P ( X1 = x1 , . . . , Xn = xn | Θ = θ )
Q
P (
i Q i Y = yi , Xi = xi | Θ = θ )
=
i P ( Xi = xi | Θ = θ )
Y
= P ( Yi = yi | Xi = xi , Θ = θ )
i
Y
= Lzi (θ)
i
242 CHAPTER 23. THEORETICAL FOUNDATIONS
Because the number of data can be large and because probabilities are always less than one (and
often close to zero), the likelihood can reach extremely low values close to zero. As a consequence,
the numerical optimization to determine θ̂M LE can raise precision issues. For this reason a common
def
practice is to optimize the said log likelihood defined as log LZ (θ) = log (LZ (θ)) since the logarithm
is an increasing function with a high sensitivity for the small positive values. In addition it
transforms multiplication into addition so that the MLE is finally defined as:
!
def
X
θ̂M LE = argmax (log LZ (θ)) = argmax log Lzi (θ)
θ θ i
Even if the shape of these functions log Lzi (θ) can be very complex, one can always use some
numerical optimization method (like gradient ascent, etc.), to find a local maxima θ? that hopefully
will be equal or at least a good approximation for θ̂M LE . Now we know how to solve our wrestler
detector problem given some parametrized likelihood function. The next step is thus to define the
shape of this likelihood function, in other words to describe how we parametrize P Y | X=x,Θ=θ (y)
with θ.
c(m,h,w,g,d) = P ( M = m | H = h, W = w, G = g, D = d )
c(h,w,g,d) = P ( M = ’yes’ | H = h, W = w, G = g, D = d )
These many coefficients define the components of the parameter vector Θ. Since our model is now
defined, one must address two questions:
• What are the model parameters θ̂M LE that maximize the likelihood relatively to the sample
set?
In the general case, an approximated resolution method must be used to find a local optimum
of the rightmost term of the previous equation. However the present shape of the log likelihood
function allows us to find the exact solution analytically. Indeed the global maximum point θ̂M LE
23.3. BAYESIAN MACHINE LEARNING IN A WEAK SENSE 243
is also a local maximum point of LZ (θ). Moreover if one assumes the likelihood function is never
zero, log LZ (θ) is always differentiable and θ̂M LE must verify:
∂L
zi (θ)
What is the value of ∂c(h,w,g,d) for a given coefficient c = c(h,w,g,d) ? This coefficient only appears if
the sample z has the same input features, i.e. if z = (h0 , w0 , g 0 , d0 , m0 ) with h0 = h, w0 = w, g 0 = g
and d0 = d. There are then two different cases:
LZ (θ)
Now let us add these results to get the whole expression of ∂log∂c . For this purpose, let us
define Ncy (respectively Ncn ) the number of samples z = (h0 , w0 , g 0 , d0 , m0 ) such that (h0 , w0 , g 0 , d0 ) =
(h, w, g, d) and m0 = ’yes’ (resp. m0 = ’no’).
X ∂log Lz (θ) 1 1
i
= Ncy · − Ncn · =0
i
∂c c 1−c
Finally isolating c in the latter expression gives the natural and extremely simple result:
Ncy
c=
Ncy + Ncn
In other words, the maximum likelihood estimator computes a discrete conditional distribution
P A | B=b (a) simply by computing for each value a of A the ratio of the number NA and B of samples
satisfying simultaneously A = a and B = b over the number NB of samples satisfying B = b. This
holds for a general result: computing coefficients of some discrete distribution occurring in some
MLE estimation θ̂M LE can simply be achieved by naive counting in the dataset.
In order to fully specify the joint distribution P (X1 ,...Xm ) | Θ , one needs to learn the coefficients
θx1 ,...,xm = P(X1 ,...Xm ) (x1 , . . . xm ) from the data. However their huge number 2m − 1 exponentially
grows with m and will shortly lead to a failure, due to overfitting, even if a large number n of
samples are available. The model of Bayesian Networks helps to specify the same joint distribution
with less parameters assuming one knows some relations of independence between variables.
Bayesian Networks have two nested levels of definition: the first level is simply a graph that
states the existence of some factorization of the joint distribution resulting from some independence
relations between variables. The second level is the graph of the first level plus some probability
tables that altogether fully define the probabilities of the joint distribution.
independence relations between random variables, but it is not the only one. Markov networks is another well-known
graphical model.
23.3. BAYESIAN MACHINE LEARNING IN A WEAK SENSE 245
B C
D E
A
F
B
G C D
product of functions:
More generally to be compatible with a given Bayesian network, a decomposition of the joint
distribution must comply with the three following rules:
• There is a one-to-one mapping between variables and factoring functions or factors: e.g. A
is mapped to fA .
• Every factoring function fX mapped to variable X takes as arguments the value x of X along
with the values of the parent variables of X.
246 CHAPTER 23. THEORETICAL FOUNDATIONS
• For every variable X of factor fX (x, y1 , . . . , yk ) and for every value (y1 , . . . , yk ) of the parent
variables Y1 , . . . , Yk of X, one always has:
X
∀y1 , . . . , ∀yk , fX (x, y1 , . . . , yk ) = 1
x
From these abstract axioms, one can indeed derive a very simple interpretation of these factors:
Let us prove this result on an example, as a sketch for the general proof: let us show that factor
fD (d, b, c) is nothing else than condition distribution P D | B=b,C=c (d):
X
PB,C,D (b, c, d) = PA,B,C,D,E,F,G (a, b, c, d, ef, g)
a,e,f,g
X
= fA (a) × fB (b, a) × fC (c) × fD (d, b, c) × fE (e, c) × fF (f, b, d, e) × fG (g, f )
a,e,f,g
X
= (fA (a) × fB (b, a) × fC (c) × fD (d, b, c) × . . .
a
!!
X X X
... fE (e, c) × fF (f, b, d, e) × fG (g, f )
e f g
This expression can be simplified by applying in cascade the summation property on factors,
starting by the end:
X
PB,C,D (b, c, d) = fA (a) × fB (b, a) × fC (c) × fD (d, b, c) × . . .
a
X X
... fE (e, c) × fF (f, b, d, e) × 1
e f
X
= fA (a) × fB (b, a) × fC (c) × fD (d, b, c) × . . .
a
X
... fE (e, c) × 1
e
X
= fA (a) × fB (b, a) × fC (c) × fD (d, b, c) × 1
a
X
= fD (d, b, c) × fA (a) × fB (b, a) × fC (c)
a
Finally,
PB,C,D (b, c, d)
P D | B=b,C=c (d) =
PB,C (b, c)
P
fD (d, b, c) × a fA (a) × fB (b, a) × fC (c)
= P
a fA (a) × fB (b, a) × fC (c)
= fD (d, b, c)
For instance, if one assumes all variables of the example of figure 23.1a take their value in {0, 1},
the CPT of variable D could look like the table 23.1. THe missing probabilities for D = 1 can
be reconstructed using the 1-complement of probabilities of D = 0. Learning the CPTs from
D B C P D | B,C
0 0 0 0.2
0 0 1 0.4
0 1 0 0.1
0 1 1 0.7
some dataset using the Maximum Likelihood principle is again very easy as it amounts to estimate
probabilities by directly counting occurrences in the dataset.
One advantage of such a decomposition is to reduce drastically the number of parameters. Since
all variables of the example are assumed to be binary, the number of entries of a CPT of some
variable X is equal to 2|parents(X)| . The total number of model parameters is therefore equal to
20 + 21 + 20 + 22 + 21 + 23 + 21 = 20 to be compared with 27 − 1 = 127 parameters for specifying
the same non factorized joint distribution. The advantage of the reduction is two-fold: not only
248 CHAPTER 23. THEORETICAL FOUNDATIONS
the model requires less memory and computation time, but it also reduces the risk of overfitting.
Of course CPTs are only valid with discrete variables. Continuous variables require to parametrize
their distributions but the number of required parameters is usually very low so that the total
parameter number of the whole model remains low. This case will be explained later.
The graph of a Bayesian network helps to formalize the hypothesis of ML methods in term of
variable independence. In order to understand this point, one must first study for some Bayesian
network the necessary and sufficient condition for two subsets X and Y of variables to be indepen-
dent.
Let us consider the case where these subsets are reduced to one single variable, respectively
X and Y . Clearly if X is a parent of Y , they are dependent (unless the CPT of Y does not
depend on the value of X, but in this case the connection from X to Y can be dropped). Let us
consider the interesting case where X and Y are not directly connected but they are indirectly
connected through a third variable Z. Table 23.2 lists all of these graph configurations modulo
variable renaming symmetry.
Let us prove this last third case (c) (the other proofs are left to the reader as they just consist
23.3. BAYESIAN MACHINE LEARNING IN A WEAK SENSE 249
PX,Y
PY |X =
P
PX
PX,Y,Z
= PZ
Y,Z PX,Y,Z
P
PX PY P Z | X,Y
= PZ
Y,Z PX PY P Z | X,Y
P
PX PY Z P Z | X,Y
= P P
PX Y PY Z P Z | X,Y
PX PY
=
PX
= PY
What happens now if the value z of Z is known? Does it change the independence relation
between X and Y ? It completely reverses the results as shown on table 23.3. Note the graphical
convention: when the value of a variable is known, the variable is colored in gray. Again, let us
Table 23.3: Independence configuration of X and Y given Z (i.e node of Z is grayed out).
PX,Y,Z
PY | X,Z =
PX,Z
PZ P X | Z P Y | Z
= P
Y PZ P X | Z P Y | Z
PZ P X | Z P Y | Z
= P
PZ P X | Z Y PY | Z
PZ P X | Z P Y | Z
=
PZ P X | Z × 1
= PY | Z
One sees on this example that the information contained in observation “variable X is equal to
some value x” spreads over the graph by modifying the distribution of surrounding variables. A
variable Y will be independent of variable X if the distribution is unchanged, that is, for any
undirected path connecting X and Y , some intermediary configuration node Z will block in some
way the information X = x. The question is to identify what are these blocking conditions. The
answer to this question is provided by the following algorithm whose application scope is even
more general since it determines whether two sets of variables are independent conditionnally to a
third set of variables. This theorem introduces the notion of d-separation.
Given a Bayesian network, two subsets X and Y of variables are independent between
each other given the values of a third set Z of variables, if and only if:
1. For every couple (X, Y ) of variables of X × Y, decides whether X and Y are
d-separated given Z (definition to come shortly). If one of them is not d-
separated, conclude X and Y are not independent conditionally to Z. If all of
them are d-separated, then they are independent.
2. Two variables X and Y are d-separated given Z if and only if every undirected
path P going from X to Y is d-separated, that is, if one of the following events
occurs along the path:
Let us take some examples based on the Bayesian Network of Fig. 23.1a. By introducing the
notation X ⊥⊥ Y | Z to state X and Y are independent given Z, one has:
• B ⊥
⊥ E since every path connecting B and E is a d-separation:
• B ⊥
⊥ E | D, C since now path B → D ← C → E is blocked at C (config. a’).
• B 6⊥
⊥ E | G since path B → F ← E is not blocking as G is a descendant of F and G is
known (config. c).
• C 6⊥
⊥ G | D since path C → E → F → G is not blocked at E and F (config. b in both
cases).
• C ⊥
⊥ G | D, E since every path connecting C and G is a d-separation:
• Clearly the gender G is the “most causal” variable so that it has no parent in the Bayesian
network.
• The height is not caused by any other variables but the gender. The only parent of H is thus
G.
3 Establishing the right causality relations can be the source of intense arguments: for instance should the
level of sports practice depends on gender. Men and women might disagree on the subject. . . However the practical
importance of these arguments must be mitigated: the orientation of some edges has generally not much consequence
on the results.
252 CHAPTER 23. THEORETICAL FOUNDATIONS
• Deciding on practicing wrestling mainly depends on the gender and the height: M has two
parents G and H.
• Sport practice is a consequence of practicing wrestling (i.e. all wreslters practice sports),
possibly also the gender but not the height, so S has only M and G as parents.
• Wearing sportswear can be explained by sport practice and possibly by the gender: D has
two parents S and G.
• Finally the weight is determined by the gender, the height, sport practice and if the student
is a wrestler: W has four parents G, H, S and M .
All these causal relations define the Bayesian Network of 23.2a.
Gender
Gender
Height
Height
Member
Member
Sport Weight
The next step is to fill the CPTs with probabilities by estimating them from the dataset.
However the variable S is latent: one does not know its values. There exists some general methods
to infer the distribution of hidden variables as it will be explained in chapter25. For now, it is
simpler to delete this variable S from the network along with its incident edges. However the
deletion of a variable might cut some paths of dependence between the remaining variables. These
links must be restored. On the example, they are essentially 6 paths going through S: between G
and D, G and M , G and W , M and D, M and W and D and W . However G is already directly
connected to D so there is no edge to add. The cases G and M , G and W , and M and W are
similar. The path between M and D must be restored: since causality is a transitive relationship,
clearly the new arc must go from M to D. Finally the path between D and W must be restored
but here there is no obvious causal relation between D and W : one arbitrarily choose to add an
arc fom W to D. This gives the Bayesian Network of figure 23.2b.
Since weight and height have been divided in 20 intervals, one can determine that the number
of model parameters (i.e. the number of CPTs entries) of network figure 23.2b is about 1750, as
detailed in table 23.4. Most of the parameters come from the CPT of the weight W as it combines
the 20 possible values of W with the 20 possible values of the height H. In practice however, many
combinations are unlikely or even impossible (like a 50 Kg 2.3 m high person) so that the number
of parameters actually used for classification is much smaller.
23.3. BAYESIAN MACHINE LEARNING IN A WEAK SENSE 253
The optimal parameter value Θ̂ can be obtained analytically for most standard distributions. For
instance parameters of a normal distribution N µ, σ 2 can directly be computed by estimating
the mean and variance of samples:
P j
j∈J(y1 ,...yk ) x
µ̂ =
|J(y1 , . . . yk )|
P j
2
ˆ2 j∈J(y1 ,...yk ) x − µ̂
σ =
|J(y1 , . . . yk )| − 1
Example:
Let’s take the height variable H as an example. H has only one parent, the gender G. One can
assume the height has a normal distribution N µg , σg2 given the gender G = g. Only four real
parameters µf emale , σf2 emale , µmale , and σmale
2
must be learnt instead of a CPT of 38 entries.
If some parent variables are continuous, the dependence between these variables and X may
also be modelled by some parametrized function.
Example:
To make it clear, let’s take the example of the weight variable W . W has G, M and H as parents.
x
4 The 1
(k, θ)-gamma distribution has a density defined on [0, ∞[ equal to f (x) = Γ(k) θk
xk−1 e− θ .
5 Of course it is only an approximation as a normal distribution could generate theoretically negative weight
254 CHAPTER 23. THEORETICAL FOUNDATIONS
∀i, ∀j, i 6= j ⇒ Xi ⊥
⊥ Xj | Y
This amounts to consider the very simple Bayesian Network displayed on figure 23.3a. Figure 23.3b
Y Y
Xi
X1 ... Xi ... XM M
gives an equivalent representation using plates. Plates are a graphical convention of Bayesian
Networks to express complex networks in a compact form: given a plate its content must be
duplicated as many times as the specified number (usually given on the bottom right corner of the
plate). Such a Bayesian Network states that the joint distribution can be factorized according to
the equation: Y
PX1 ,...XM ,Y (x1 , . . . , xm , y) = PY (y) × P Xi | Y (xi )
i
Let’s look at the consequences of such hypothesis on both steps of model learning and model
prediction.
Learning step
Learning a Naive Bayes model consists in estimating the discrete distribution PY of the target
variable Y and the distribution P Xi | Y of every feature variable Xi given the class variable Y .
23.3. BAYESIAN MACHINE LEARNING IN A WEAK SENSE 255
• PY (y) is estimated as the number of samples of class Y = y divided by the size N of the
dataset, that is, if one denotes for some event A the number N (A) of data verifying A:
N (Y = y)
P̂Y (y) =
N
N (Y = y ∩ Xi = xi )
P̂ Xi | Y =y (xi ) =
N (Y = y)
maximizes the likelihood. If xji denotes the value of feature Xi of the jth data, one has:
X
Θ̂yi = argmax log P Xi | Y =yj ,Θ (xji )
Θ j
Prediction step
Given a new example to classify, predicting the distribution of its class y is straightforward given
its features (x1 , . . . , xm ):
If the risk is defined on the standard 0-1 loss function, the best classifier is the one that chooses
the class of highest probability:
ŷ = argmax P Y | X1 =x1 ,...XM =xm (y)
y
= argmax (PX1 ,...XM ,Y (x1 , . . . , xm , y))
y
!
Y
= argmax P̂Y (y) × P̂ Xi | Y (xi )
y
i
Therefore prediction y just consists in computing a product of probabilities for each y value and
selecting the class with the highest product value. Note that computation of K is not necessary.
Conclusions
Naive Bayes is a must-know Bayesian classification method. Because it is the simplest one. As a
consequence, its advantages and drawbacks are extreme. The main advantages are:
• The risk of overfitting is lower than any other state-of-the-art method.
• The method is extremily fast and scalable for both prediction and learning steps. Model
can be learnt very quickly with only one scan of the dataset and with a time complexity of
Θ(N × M ).
256 CHAPTER 23. THEORETICAL FOUNDATIONS
• Discriminative models are less powerful than generative models as they only attempt to model
the conditional distribution P Y | X,Θ of some output variable Y given some input variables
X. The model does not take into account the distribution of input features PX . These models
are only suitable for supervised learning problems. Note that given a generative model, it is
always possible to deduce a discriminative model as:
P X,Y | Θ
PY | X,Θ =P
Y P X,Y | Θ
Example:
Let us take an extremely simple example that will be further developed in the next sections: one
wants to estimate the performance of a given network server. The corresponding generative model
defines the distribution PT of the processing time T requiredto answer a query. To keep it simple,
T is assumed to be normally distributed (i.e. T ∼ N µ, σ 2 with θ = [µ, σ 2 ]), even if this is the-
oretically absurd: indeed with such distribution, the probability of having some negative processing
time is non null.
However Bayesian Machine Learning does not work on a single hypothesis or model but on a
whole distribution of models in order to represent the current uncertainty about the knowledge of
the model. In other words, the model should be viewed as a random variable denoted M, which can
be represented equivalently by a random variable Θ of model parameters6 . In order to describe the
6 Depending on the context, one interchangeably uses M or Θ to represent the model as a random variable.
258 CHAPTER 23. THEORETICAL FOUNDATIONS
distribution PΘ of the model parameters, one usually uses some parametric representation P Θ | κ
whose parameters κ are called hyperparameters. Hyperparameters are not random variables but
constants. This fundamental difference is represented schematically by the Bayesian networks of
figure 23.4.
κ
Θ
Θ
X X
Figure 23.4: Difference between Bayesian Machine Learning in a weak and strong sense
Example:
The model parameter vector Θ of the network server is [µ, σ]. Let’s assume its technical specification
states the standard deviation σ of the processing time is almost fixed, equal to 10ms. However the
average processing time µ can vary depending on some components of the server. It is equal to
50ms ± 5ms. This initial knowledge can be modelled by defining a distribution on Θ = [µ, σ]:
Like any other Machine Learning method, a Bayesian method must provide two algorithms: a
learning algorithm and a predicting algorithm.
• The learning algorithm must infer the distribution P Θ | S,κ (h) of parameters/hypothesis con-
ditioned on the set of observed samples S and values of hyperparameters κ.
• The prediction algorithm must infer an output y from some new input x and the learnt
distribution P Θ | S,κ of parameters/hypothesis.
Of course learning and prediction steps can be interleaved as Bayesian methods are naturally
incremental: the learning algorithm first consists in updating the current model given some new
samples, the second one in predicting some new samples given the current model. These two steps
are further explained in the two next paragraphs.
Bayesian prediction
Given a Bayesian model with parameters Θ and possibly some input x, what output value y should
the model predict? In fact this question is nonsense for two reasons: first the model is stochastic, the
correct answer to give is that the output value will be drawn according to distribution P Y | θ,X=x .
Secondly, even the parameters Θ are not known perfectly and are described by some distribution
P Θ | S,κ . Predicting the distribution of the output must thus be averaged with the distribution of
23.4. BAYESIAN MACHINE LEARNING IN A STRONG SENSE 259
parameters:
PY | X=x,S,κ (y) = EΘ P Y | Θ,X=x (y)
Z
= P Y | Θ=θ,X=x (y) · P Θ | S,κ (θ) · dθ
Example:
On the server example, the distribution (in density of probabilities) of the output T is
Z
P T | κ (t) = P T | θ (t) × P θ | κ (θ)dθ
ZZ (µ−µ0 )2
(t−µ)2 −
∝ e− 2σ2 × e 2σ0 2 δ(σ, σT ) dµ dσ
Z (t−µ)2 (µ−µ0 )2
− −
2σ 2 2σ0 2
∝ e T ×e dµ
Z −
(t−µ0 )2
(µ−µ1 )2
∝ e 2 (σ0 2 +σT2 ) × e− 2
2σ1
dµ
−
(t−µ0 )2 Z (µ−µ0 )2
2 (σ0 2 +σT2 ) × p 1 − 2
PT | κ (t) ∝ e e dµ
2σ1
2 π σ12
(t−µ0 )2
−
∝ e 2 (σ0 2 +σT2 )
As expected, the most probable value for t is µ0 = 50 ms. However the variance is σ0 2 + σT2 =
125 ms2 . This is because the uncertainty on t comes from two different sources: first the varying
processing time of queries (represented by σT2 ) and second the uncertain knowledge about the aver-
age processing time of the server (represented by σ0 2 ).
Bayesian inference
In Bayesian Machine Learning the distribution P Θ | S,κ of the model is updated every time some
new data is added to the sample set S. These data are also called observations to emphasize the fact
samples can come on-the-fly and that the Machine Learning process never stops. Given a supervised
or unsupervised learning problem with some set O of observations, the general methodology of
Bayesian Machine Learning is the following:
1. First choose a family of generative models as a candidate for producing an observation o. This
family of models is supposedly characterized by a distribution P ( O | θ ) for some unknown
parameter vector θ.
260 CHAPTER 23. THEORETICAL FOUNDATIONS
2. Second consider the model parameters θ as a random variable Θ and choose some distribution
PΘ (θ). This distribution PΘ (θ) called a priori distribution or prior is initialized to represent
the prior knowledge, if any, on the model. In practice the prior is parametrized by some
hyperparameters κ and should be written P Θ | κ (θ).
3. The third step is to condition the distribution PΘ (θ) on the observations O, that is, replacing
P Θ | κ (θ) by P Θ | O,κ (θ). If observations contain sufficient information, the uncertainty (i.e.
entropy) on the model will reduce.
4. The third step is repeated every time some new observation gets available.
The third step is where the Bayes’ rule comes into action. To see this, let us consider an abstract
model defined by distribution PΘ (θ). A first observation o1 (or even possibly a set of samples) gets
available. Thanks to Bayes’ rule, P Θ | O1 =o1 ,κ can be deduced from P Θ | κ :
1
P Θ | O=o1 ,κ (θ) = P O | Θ=θ,κ (o1 ) P Θ | κ (θ)
P O | κ (o1 )
1
= P O | Θ=θ P Θ | κ (θ)
P O | κ (o1 )
Z
with P O | κ (o1 ) = P O | Θ=θ (o1 ) P Θ | κ (θ) dθ
Example:
On the server example, suppose a first query is processed in t1 = 46 ms. How does it update our
knowledge on parameters Θ = [µ, σ 2 ] describing our server? Let’s apply the previous equation in
this context:
1
P Θ | T =t1 ,κ (θ) = PT | Θ=θ (t) × P Θ | κ (θ)
PT | κ (t1 )
(t1 −µ)2 (µ−µ0 )2
−
∝ e− 2σ 2 ×e 2σ0 2
δ(σ, σT )
(t −µ)2 (µ−µ )2
− 1 2 − 2σ 02
∝ e 2σ
T ×e 0 δ(σ, σT )
(t1 −µ0 )2
− (µ−µ1 )2
∝ e (
2 σ0 2 +σ 2
T ) × e− 2
2σ1
δ(σ, σT )
(µ−µ1 )2
− 2
∝ e 2σ1
δ(σ, σT )
where constants σ1 = 4.47 ms and µ1 = 51.2 ms are the same as previously defined in equa-
tion (23.83):
1 1
2 1 σ0 2 µ0 + σT 2 t1
σ1 = 1 1 and µ1 = 1 1
σ0 2 + σ 2 T
σ0 2 + σ 2 T
• µ ⊥
⊥ σ | T1 = t1 , κ (as µ and σ are separable in the joint posterior distribution).
– µ1 is a weighted average of µ0 with weight σ12 and t1 with weight σ12 : µ1 is thus a value
0 T
compromise between the prior knowledge of µ, represented by µ0 , and the observation
t1 . The smaller the variance (σ02 for µ0 , σT2 for t1 ), the higher the weight.
Repeating the operation when new observations o2 and o3 get available gives (with some sim-
plified notation):
1
P Θ | o1 ,o2 ,κ = P o2 | Θ P Θ | o1 ,κ
P o2 | o1 ,κ
1
P θ | o1 ,o2 ,o3 ,κ = P o3 | Θ P Θ | o1 ,o2 ,κ
P o3 | o1 ,o2 ,κ
The more observations are collected, the less influence the initial prior P Θ | κ has on the posterior
P Θ | O,κ .
Example:
On the server example, suppose n observations (t1 , . . . tn ) have been made. How does it update our
knowledge on parameters Θ = [µ, σ 2 ] describing our server? By applying n times the Bayesian
inference found previously for each observation ti , it is straightforward to show:
µ | t1 , . . . tn , κ ∼ N µn , σn2
1
with σn2 = n
σ2
+ σ12
T 0
n 1 P
2 ti + σ 2
σT 0 i ti
and µn = n 1 with ti =
σT2 + σ2 n
0
1
The variance of σn2 decreases in Θ n so that µ converges relatively slowly in Θ √1 towards
(n)
the average ti of the observations when the number n of observations increases. The expression of
µn also shows that the weight of the prior becomes negligible when n is large.
More generally, Bayesian inference consists in updating the model distribution P ( Θ | O ) es-
tablished so far from the past observations O when some new observation o is received:
262 CHAPTER 23. THEORETICAL FOUNDATIONS
From a practical perspective, it is useless to memorize all successive models and observations.
Algorithms only keep in memory a representation of the current distribution PΘ of parameters,
replacing the prior by the posterior once a new observation is made. Assuming all parameters are
discrete and the number V of their value combinations is not too large, the typical structure of a
Bayesian method looks like algorithm 21. However most of the time, the space of parameters is
rarely countable (e.g. continuous parameters) or very large so that the previous algorithm is not
tractable. Two solutions then exist:
• Either use some standard sampling techniques (Monte Carlo or Markov Chain Monte Carlo
algorithms) to have some approximated representation of posterior PΘ . These techniques
will be introduced later in chapter 28.
• Or choose some families of prior distribution parametrized by some hyperparameters such
that the posterior distribution has the same algebraic form as the prior. Such distributions
are called conjugate priors and are presented in the next subsection.
becomes the prior of the next observation, one had the guarantee that this property propagates
and that the distribution of parameters will always be normal. In such a case, Bayesian inference
is only about updating the hyperparameters of PΘ and this can be done efficiently with normal
distributions. Such a property is actually not restricted to normal distributions and is intensively
used in Bayesian Machine Learning:
Example:
One has already seen that given a normal likelihood X ∼ N µ, σ 2 parametrized by Θ = [µ,σ 2 ],
2
one possible conjugate prior is a constant variance σX and a normal distribution N µ0 , σ02 for
µ parametrized by κ = [µ0, σ0 ]. The resulting posterior after making observations (x1 , . . . xn ) is
µ | x1 , . . . , xn ∼ N µn , σn2 where:
1
σn2 = n 1
2
σX
+ σ02
n 1 P
2 xi + σ 2
σX xi
0 i
and µn = n 1 with xi =
σX2 + σ02 n
→
−
µ |→
−
x 1, . . . →
−
x n ∼ N (→
−
µ 1 , Σ1 ) with:
−1
Σ1 = n Σ−1 + Σ0 −1
→
− −1 1 X→
µ1 = n Σ−1 x + Σ0 −1 →
−
µ 0 × n Σ−1 + Σ0 −1 with x = −
xi
n i
This conjugate prior is a generalization of the univariate case and is extensively used in many
applications such as Bayesian linear regression, Gaussian Processes, Kalman filters, etc. Most of the
common discrete and continuous distributions have conjugate priors (see for instance wikipedia).
The parameters Θ of the model are the CPTs of the underlying Bayesian network. It has already
been shown that the MLE principle amounts to count occurrences in the dataset:
N (Y = y)
P̂Y (y) =
N
N (Y = y ∩ Xi = xi )
P̂ Xi | Y =y (xi ) =
N (Y = y)
Because every CPT has its own parameters, every CPT can be processed as an independent
likelihood function. Let’s focus for instance on the CPT of Y . If Y can take k values encoded by
numbers from
P 1 to k, the CPT is a categorical distribution described by k probabilities (θ1 , . . . , θk )
such that i θi = 1 (so in practice there are only k − 1 degrees of freedom).
Given such a likelihood function P Y | Θ with Θ = [θ1 , . . . , θk ] is there any simple conjugate
prior? Such a prior must be a distribution of categorical distributions: in other words, a sample of
k−1
this distribution must be a point of the k − 1-simplexP denoted ∆ , that is, the affine subspace
k
of R whose point coordinates (θ1 , . . . , θk ) verify i θi = 1. To answer this question one first
introduces Dirichlet distributions:
The normalisation factor B(α) is the beta function of vector α. It is equal to:
Q
Γ (αi )
B(α) = iP
Γ ( i αi )
The gamma function Γ(x) generalizes the factorial function as Γ(x) = (n − 1)! if
x∈N Z ∞
Γ(x) = xt−1 e−x dt
0
One then claims that Dirichlet distributions are conjugate priors of categorical distributions.
In order to prove it let define Ni the number of observations equal to i given some observations
O = (y1 , . . . yn ). Then considering the prior is a Dirichlet distribution of hyperparameters κ =
(α1 , . . . , αk ), one has:
The posterior distribution is thus another Dirichlet distribution of parameters α0 = (N1 +α1 , . . . , Nk +
αk ) This also gives some natural interpretation of prior parameters αi . The presence of a non null
parameter αi amounts to observe αi fictitious observations whose value yj is equal to i. This inter-
pretation gives a rule of thumb to determine a prior that represents the right amount of confidence
in the initial knowledge of the problem.
Finally the strong Bayesian version of Naive Bayes only consists in initializing every CPT
entry with some α value. From the point of view of the implementation, the difference between
the standard Naive Bayes and the fully Bayesian Naive Bayes is very weak (as it only initializes
already existing counters to some non null values instead of setting them to zero). This illustrates
the point that Bayesian Machine Learning in the weak and in the strong sense are just two available
options for the same class of methods. The next section on Bayes estimator makes the link even
stronger.
Bayes estimator
Given some parameter distribution PΘ , Bayesian estimation consists in finding the optimal pa-
rameter value θ? relatively to some user-defined loss function. This operation is called Bayes
estimator:
Given some data O, some learnt model P Θ | O and some loss function L(θ, θ̂) mea-
suring the cost of choosing parameters θ̂ when real parameters are equal to θ, the
Bayes estimator θ̂ is the parameter value that minimizes the risk relatively to the
posterior distribution P Θ | O :
θ̂ = argmin EP Θ | O [L(Θ, θ0 )]
θ0
Example:
266 CHAPTER 23. THEORETICAL FOUNDATIONS
2
In the case the loss is quadratic L(θ, θ̂) = θ̂ − θ , the Bayes estimator is the expected value of
Θ | O:
h i
2
θ̂ = argmin EP Θ | O (θ0 − Θ)
θ0
2
= argmin (θ0 − E [Θ])2 + E [Θ − E [Θ]]
θ0
= EP Θ | O [Θ]
The Maximum A Posteriori estimator (MAP) selects the parameters that maximize
the posterior:
θ̂M AP = argmax (P ( O | Θ = θ0 ) P (Θ = θ0 ))
θ0
The Maximum Likelihood estimator (MLE) selects the parameters that maximize
the likelihood:
θ̂M LE = argmax (P ( O | Θ = θ0 ))
θ0
As a conclusion, the MLE principle is a specific subcase of a Bayesian estimation when the
problem has no prior knowledge on the model parameters and is not cost sensitive.
Chapter 24
In the previous chapter, a first classification method called “Naive Bayes” was presented as being
probably the most straightforward Bayesian learning method. However the underlying assumption
of independent descriptive features conditionally to the class feature is most of the time too naive.
What are the consequences for the models if one rejects this oversimplifying hypothesis? In the case
of categorical descriptive features, one already knows it amounts to merge the dependent descriptive
features into one joint random variable1 whose distribution is still a categorical distribution. Even
if the introduction of this new categorical distribution requires a larger number of model parameters
and thus increases the risk of overfitting, the form of the resulting model remains unchanged, i.e.
identical to the initial Naive Bayes setting.
However in the case of continuous features, merging them into a single joint random variable
requires a deeper analysis. The distribution of the resulting random variable obviously depends on
the marginal distributions of descriptive fatures. However many joint distributions can share the
same set of marginal distributions. Even in the simplest case where all these features are assumed
to have a univariate Gaussian distribution, the resulting joint distribution might not be Gaussian.
In this chapter one focuses on the specific case where this joint distribution is indeed Gaussian, i.e.
is a multivariate normal distribution (MVN). As it will be seen shortly, the multivariable normal
distribution is the simplest and most natural form of joint distribution for continuous variables
in order to take into account correlation (and thus dependency) between real random variables.
MVN have elegant algebraic properties that underlies many Bayesian methods presented in this
chapter and the subsequent ones. In particular, normal distributions are closely related to linear
models.
The current chapter thus shows how normal distributions occur in Bayesian linear models
or when relaxing the independence assumption in Naive Bayes. To this end, the section 24.1
first investigates some fundamental properties of MVNs. These properties are then applied in
section 24.2 to Bayesian classification without requiring like Naive Bayes any strong hypothesis of
independence. Section 24.3 then considers linear regression problems and shows how the Bayesian
approach generalizes and legitimates the classical Ordinary Least Squares (OLS) method and
regularized versions of it, like Ridge regression. Finally section ?? considers linear classification
problems and again shows how the Bayesian approach generalizes the classical logistic regression.
1 Or several joint random variables if dependent descriptive features can be gathered in such a way that these
groups are independent between each other conditionally to the target feature.
267
268 CHAPTER 24. GAUSSIAN AND LINEAR MODELS FOR SUPERVISED LEARNING
4
0.08
3
0.07
0.06
0.05 2
0.04
0.03 1
y
0.02
0.01 0
0.00
5
4 1
3
4 2
2 1
0 0 y 2
1
x 2 4 2 3 2 1 0 1 2 3 4 5
6 3
x
T
Figure 24.1: Example of a MVN pdf with d = 2, µ = [1, 1] Σ = [[3, 1], [1, 2]]
It is interesting to observe MVN is directly parameterized by moments of the first and second
order (i.e mean and covariance) and that MVN is the most uncertain distribution with given
expected value and covariance:
Since most ML methods only estimate the two first moments, MVN are a natural choice to represent
an unknown distribution constrained to have given mean and covariance.
One fundamental property intensively used by linear models is the fact that a linear function of
a normal random vector is again a normal random vector with known parameters that can easily
be computed using linear algebra:
This latter property applied with b = 0m and A = [Im 0m,n−m ] allows to derive marginal distri-
butions of MVN:
24.1. MULTIVARIATE NORMAL DISTRIBUTIONS 269
Then
X1 ∼ N (µ1 , Σ11 ) X2 ∼ N (µ2 , Σ22 )
Conditioning distribution of X1 to some value x2 for X2 again gives a MVN according to the
following theorem:
Property 24.3. Given n i.i.d samples Xi ∼ N (µ, Σ) the MLE is given by:
1 X
µ̂M LE = xi
n i
1 X
Σ̂M LE = (xi − µ̂M LE ) × (xi − µ̂M LE )T
n i
Details of proof are skipped. It consists in deriving the log-likelihood as function of µ and Σ, then
equating it to 0.
In a Bayesian context, it might be useful to know a conjugate prior for MVNs in order to make fast
and exact inference. In reality, an MVN likelihood X ∼ N (µ, Σ) accepts several conjugate priors
of various levels of complexity, each one being adapted to specific hypothesis or restrictions. In the
most general case, when mean vector µ and covariance matrix Σ are unknown random variables,
a possible prior is a normal-inverse-Wishart distribution:
270 CHAPTER 24. GAUSSIAN AND LINEAR MODELS FOR SUPERVISED LEARNING
After n observations (Xi )1≤i≤n , the prior’s hyperparameters (µ0 , κ0 , Σ0 , ν0 ) are up-
dated as follows:
n
1X
X̄ ← Xi
n i=1
κ0 n
µ1 ← µ0 + X̄
κ0 + n κ0 + n
κ1 ← κ0 + n
n
1X T
Σ̄ ← Xi − X̄ × Xi − X̄
n i=1
ν0 n T
Σ1 ← Σ0 + n Σ̄ + µ0 − X̄ × µ0 − X̄
ν0 + n
ν1 ← ν0 + n
However in many models, one can assume Σ is known, equal to a constant matrix Σc . In this
case, a simpler conjugate prior for µ is a MVN:
µ ∼ N (µ0 , Σ0 ) and Σ = Σc
After n observations (Xi )1≤i≤n , the prior’s hyperparameters (µ0 , Σ0 ) are updated as
follows:
n
1X
X̄ ← Xi
n i=1
−1
Σ1 ← Σ−1 −1
0 + n Σc
µ1 ← Σ1 × Σ−1 −1
0 µ0 + n Σc X̄
This last result can be extended to the case where X ∼ N (µ, Σ) is not directly observed but
only a linear projection Y = A × X + b of it. Then it is still possible to update parameters µ0
and Σ0 of the normal distribution of µ given observations of Y based on the following property:
24.2. GAUSSIAN DISCRIMINANT ANALYSIS 271
This last property is central in any Bayesian problem where observations linearly depend on model
parameters, such as in Bayesian Linear Regression (see section 24.3.3) or Kalman filters (see
section 26.4.2).
Assuming a problem with k classes numbered from 1 to k, CPT of Y is defined like with Naive
Bayes, by a categorical distribution of k parameters (πy )1≤y≤k = (PY (y))1≤y≤k that can be esti-
mated using the MLE estimator (or a MAP estimator if required):
N (Y = y)
P̂Y (y) =
N
The difference between GDA methods lies in the specification of distribution of X conditionally
to Y . The main variants are presented here from the most expressive method (QDA) to the least
expressive ones (LDA and diagonal LDA):
Estimating parameters of class k, i.e mean vector µk and covariance matrix Σk is straightfor-
ward using for instance the MLE estimators presented in section 24.1.1.
The method is qualified as quadratic accordingly to the following property.
Proof. Points on a decision boundary separating two classes, lets say for Y = 1 and Y = 2, have
equal densities. By replacing the density by its expression, one has:
P Y =1 | X=x,Θ = P Y =2 | X=x,Θ ⇔
P X | Y =1,Θ1 (x) PY (1) = P X | Y =2,Θ2 (x) PY (2) ⇔
π1 1 π2 1
p exp − (x − µ1 )T Σ1 −1 (x − µ1 ) = p exp − (x − µ2 )T Σ2 −1 (x − µ2 ) ⇔
det (Σ1 ) 2 det (Σ2 ) 2
cst + (x − µ1 )T Σ1 −1 (x − µ1 ) = cst + (x − µ2 )T Σ2 −1 (x − µ2 ) ⇔
(x − a1 )T B−1 (x − a) + c = 0 for some a ∈ Rd , B ∈ Rd×d , c ∈ R
This latter equation is a level set of a quadratic form and thus it describes a quadratic hypersurface.
Figure 24.2 shows examples of 2D class boundaries using QDA. Observe 1) how the boundaries
are curved quadratic lines 2) how the MVN of every class has its own mean and covariance matrix
(whose eigenvectors are represented by main axes of ellipses). The total number of free parameters
es Samples
QDA QDA
NaiveBayes
LDA
(a) Iris dataset
NaiveBayes
(b) Gene dataset (SRBCT)
Figure 24.2: Class boundaries of QDA applied to data projected onto their two largest principal
components.
m (m+1) m+3
to describe distribution of X and Y is k − 1 + k × m + 2 = 2 m + 1 k − 1 to be
compared with the (2 m + 1) k − 1 parameters required by Naive Bayes. QDA’s model complexity
is thus Θ k m2 instead of to be compared with Naive Bayes’s complexity Θ (k m). QDA is thus
prone to overfitting when the number m of dimension gets large compared to the number n of
available samples.
This latter expression is the equation of hyperplane orthogonal to the line passing through points
µ1 and µ2 using the Mahanalobis distance relatively to Σ.
Figure 24.3 shows examples of 2D class boundaries using LDA, to be compared with those on
Fig. ?? obtained with QDA. Observe now 1) how the boundaries are straight lines 2) how the
MVN of every class has its own mean but share the same covariance matrix (whose eigenvectors
LDA
are represented by main axes of ellipses).
NaiveBayes
LDA Naiv
Figure 24.3: Class boundaries of LDA applied to data projected onto their two largest principal
components.
Θ (k m). Indeed its even sparser, i.e. less complex than Naive Bayes since the exact number of
A LDA
QDA
NaiveBayes
LDA
parameters is k − 1 + k m + m. LDA
NaiveBayes
ayes Diagonal
NaiveBayes
LDA
(a) NB on Iris dataset Diagonal LDA
(b) NB on Gene dataset (SRBCT)
Figure 24.4: Class boundaries of Naive Bayes (NB) and Diagonal LDA (D-LDA) applied to data
projected onto their two largest principal components.
Table 24.1: Accuracy and logarithmic loss of the four considered methods for an easy dataset (Iris)
and a difficult dataset (SRBCT). Best scores appear written in bold characters.
A normal linear regression model assumes a real output Y and m real input features
X = (xj )1≤j≤m such that:
PY | X,σ 2 ,W ∼ N WT X, σ 2
The output variance σ 2 ∈ R+ and the coefficient vector W ∈ Rm are the model
parameters.
Note that this model is only discriminative, not generative: nothing is said about the distribution
of the input features X.
The Ordinary Least Square (OLS) estimator WOLS for W minimizes the empirical
risk for the quadratic loss also called mean square error (MSE):
n
1 X 2
R(W) = M SE(W) = yi − W T x i
n i=1
where the rectangular n × m matrix X called design matrix is such that its ith line
−1
contains the ith input sample xi . XT X is called the Moore-Penrose inverse, or
pseudoinverse matrix of X. It is defined as soon as X has a rank equal to m, i.e as
soon as at least m samples xi are linearly independent among the n available (this
is in general the case as soon as n m).
Proof. The empirical risk is a convex function of W. It has a unique minimum that can be
Pn 2
computed by derivating the empirical risk R(W) = n1 i=1 yi − Ŵ T
x i wrt W. Setting this
gradient to zero and solving the resulting equation leads to expression ŴOLS .
Property 24.9. The MLE estimator of a normal linear regression model is identical to the OLS
estimator:
ŴM LE = ŴOLS
276 CHAPTER 24. GAUSSIAN AND LINEAR MODELS FOR SUPERVISED LEARNING
Proof.
n
Y
LZ W, σ 2
= P Yi = yi Xi = xi , W, σ 2
i=1
Yn
1 1
= √ exp − 2 (yi − WT xi )2
i=1 2πσ 2 2σ
n
!
1 1 X
= √ n exp − 2 (yi − WT xi )2
2 π σ2 2 σ i=1
1 n
= √ n exp − 2 M SE(W)
2 π σ2 2σ
Since the likelihood is maximized when the mean square error is minimized, ŴM LE = ŴOLS .
The posterior distribution can then be derived as a MVN wrt to W, simply by appling property 24.6
with A = X and b = 0.
P W, σ 2 (X, y) , µ0 , Σ0 , σy2 ∝ LZ W, σ 2 P W, σ 2 µ0 , Σ0 , σy2
∝ N y|X W, σ 2 In N (W|µ0 , Σ0 ) δ σ 2 |σy2
∝ N (W|µ1 , Σ1 ) δ σ 2 |σy2
−1
−1 1 T
with Σ1 = Σ0 + 2 X X
σy
−1 1 T
µ1 = Σ1 Σ0 µ0 + 2 X y
σy
The term σy2 Σ−10 µ0 appearing in the MAP estimator is the influence of the prior knowledge µ0
and the confidence in it (represented by Σ−1
0 ). This additional term acts as a regularizer. The
particular case µ0 = 0 and Σ−1 λ
0 = σy2 Im corresponds to the L2-regularized version of OLS, better
known as Ridge regression:
−1 T
µ̂Ridge = XT X + λ Im X y
Chapter 25
H V
N
Given a model of parameters Θ with visible variables V = (V1 , . . . , Vp ) and latent variables
H = (H1 , . . . , Hq ), the first step is to infer the marginal likelihood P V1 ,...,Vp | Θ from observations
(v1 , . . . , vp ), before applying some Bayes’s estimator like for instance MLE (to take the simplest
one):
However since the model describes the full joint distribution PV1 ,...,Vp ,H1 ,...,Hq ,Θ , one needs to
compute:
X
P (V1 = v1 , . . . , Vp = vp , Θ = θ) = P (V1 = v1 , . . . , Vp = vp , H1 = h1 , . . . , Hq = hq , Θ = θ)
h1 ,...,hq
277
278 CHAPTER 25. MODELS WITH LATENT VARIABLES
(i) (i)
Given a dataset Z of N samples (v1 , . . . , vp )1≤i≤N , the marginal likelihood is:
Y X
(i)
LZ (θ) = P V1 = v1 , . . . , Vp = vp(i) , H1 = h1 , . . . , Hq = hq , Θ = θ
i h1 ,...,hq
(i) (i)
where the joint distribution P V1 = v1 , . . . , Vp = vp , H1 = h1 , . . . , Hq = hq , Θ = θ of the model
is decomposed itself in factors according to the underlying graphical model. Maximizing a product
of sums of products, with a high number of terms and factors is just an intractable task even if in
principle, it is always possible to find a local optimum θ? using optimization methods like gradient
ascent.
Example:
In the very simple wrestling example, the joint distribution (omitting parameters for concision) of
Bayesian network of Fig 23.2a is
PG,H,M,S,D,W = PG P H | G P M | G,H P S | G,M P D | G,S P W | G,H,M,S
Given a single sample (g, h, m, d, w), applying the MLE amounts to maximize the marginalized
joint distribution without the latent variable S:
X
PG,H,M,D,W (g, h, m, d, w) = PG (g) × P H | G=g (h) × P M | G=g,H=h (m) ×
s
P S | G=g,M =m (s) × P D | G=g,S=s (d) × P W | G=g,H=h,M =m,S=s (w)
= PG (g) × P H | G=g (h) × P M | G=g,H=h (m) ×
X
P S | G=g,M =m (s) × P D | G=g,S=s (d) × P W | G=g,H=h,M =m,S=s (w)
s
For a dataset Z of N samples, the marginal loglikelihood can be decomposed in a sum of four terms:
N
X N
X N
X
log LZ (θ) = log PG (g (i) ) + log P H | G=g(i) (h(i) ) + log P M | G=g(i) ,H=h(i) (m(i) ) +
i=1 i=1 i=1
N
X X
log P S | G=g(i) ,M =m(i) (s)×
i=1 s
P D | G=g(i) ,S=s (d(i) ) × P W | G=g(i) ,H=h(i) ,M =m(i) ,S=s (w(i) )
Because the parameters of the conditional distribution tables (CDTs) of G, H and M are isolated
in separate terms, the inference of their values is an easy task: it suffices to count occurrences of
relevant events in the data set. However learning the remaining CDTs of S, D and W is much
harder since their parameters are all interdependent within the summation over s.
25.2 EM Algorithm
Given some model of likelihood P V,H | Θ with parameters Θ, visible variables V = (V1 , . . . , Vp )
and latent variables H = (H1 , . . . , Hq ), and given a dataset Z = {v (i) }1≤i≤N of i.i.d. samples, the
problem is to find the global maximum θ? of marginal likelihood P V | Θ where hidden variables H
have been marginalized out:
θ? = argmax log LZ (θ)
θ
Y
= argmax P V | Θ=θ (v (i) )
θ i
!
Y X
= argmax P V,H | Θ=θ (v (i) , h)
θ i h
25.2. EM ALGORITHM 279
This optimization problem is difficult as model parameters θ and latent variable distributions si-
multaneously occur in a product of sums (see section 25.1). Expectation maximization, abbreviated
EM, is a generic heuristic algorithm that solves this problem. Unfortunately, like any standard
numerical optimisation method, EM is heuristic: it only finds a local maximum, i.e. there is no
guarantee that the found local optimum is also a global one. But compared to standard numerical
optimisation methods, EM is simpler and faster.
Even if EM is heuristic, it results from a theoretical construction: indeed EM relies on a lower
bound of the marginal log likelihood log LZ (θ). Let’s assume one knows for each sample V (i) ,
a distribution qi that approximates somehow the true distribution P H (i) | V (i) =v(i) ,θ of hidden
variable H (i) . Using some results from information theory (i.e. using the fact the Kullback-Leibler
divergence between qi and P H (i) | V (i) =v(i) ,θ is always non-negative), every distribution qi verifies:
X XD E
log LZ (θ) ≥ − hlog qi (H)iH∼qi + log P H, V = v (i) Θ = θ
H∼qi
i i
The right-hand side of the equation can be viewed as a lower bound log L̃Z,q1 ,...,qN (θ) of the
marginal log likelihood log LZ (θ) made of two terms:
P
• The energy E(q1 , . . . qN , θ) = i log P H, V = v (i) Θ = θ H∼qi is the full log likelihood
averaged over distributions of hidden variables H (i) . This term is always non positive.
P
• The entropy H(q1 , . . . qN ) = − i hlog qi (H)iH∼qi quantifies the uncertainty on H (i) . This
corrective term is always non negative and does not depend on parameters θ.
Instead of directly finding a local maximum for the marginal likelihood LZ (θ), EM seeks a local
maximum of the lower bound L̃Z,(qi )i (θ). This choice is justified by two properties: first, as one
will see shortly, finding a local maximum of the lower bound is much easier (at least, not more
difficult than problems based on models without hidden variables). Second a local maximum of
L̃Z,(qi )i (θ) is also a local maximum of the target function LZ (θ).
EM uses an iterative approach to estimate the distribution of both latent variables and the
values of model parameters. Starting from some guess θ? of parameters, it goes alternatively into
an expectation step and a maximization step until parameters converge:
• The Expectation step (E step of EM) keeps θ? constant and optimizes log L̃Z,(qi )i (θ? ) as a
function of distributions (qi )i only. Solving this problem is straightforward: if one sets every
qi to distribution P H (i) | V (i) =v(i) ,θ=θ in equation (25.12), one finds that L̃Z,(qi )i (θ) is equal
to the marginal likelihood LZ (θ), that is also an upper bound according to equation (25.12).
Because LZ (θ) is a constant relatively to (qi )i , it is also the maximum that can be reached
by variations of (q1 , . . . , qN ). Therefore the solution of the E step always consists in setting
the distribution of every hidden variable H (i) to their expected distribution given the current
parameter value θ and the visible values v (i) :
• The Maximization step (M step of EM) is the dual of the estimation step: it keeps distri-
butions (qi )i constant and optimizes L̃Z,(qi )i (θ) as a function of parameters θ only. Because
the entropy does not depend on parameters θ, it is sufficient to maximize the energy:
X
θ? = argmax E(q1 , . . . qN , θ)
θ i
This problem is very similar to a standard MLE problem for a model without latent variables
and can be solved using standard numerical optimization methods. The only difference is
that the energy function is more complex than a simple likelihood function as the likelihood
is averaged over distributions qi . Some concrete example will be given in the next sections.
280 CHAPTER 25. MODELS WITH LATENT VARIABLES
Algorithm 22 EM algorithm.
1: // Initialization step
2: Set current parameter models θ? to some initial guess θ0
3: repeat
4: // E-step
5: for every sample v (i) do
6: qi ← P H | V =v(i) ,Θ=θ?
7: end for
8: // M-step
9: θ? old ← θ? P
10: θ? ← argmaxθ i log P H, V = v (i) Θ = θ H∼qi
11: until |θ ? − θ ? old | < ε
12: return θ ?
The E and M steps alternate until the parameters converge. EM is summarized by algorithm 22.
The strength of EM might seem weak as it does not maximize the marginal likelihood LZ (θ)
but only a lower bound of it. However it can be shown that iterations of EM never decrease
the marginal likelihood (i.e. LZ (θ? new ) ≥ LZ (θ? old )). In other words EM has the following
fundamental property:
PV | H=c = PV | θc
This joint distribution can be viewed as a weighted average of distributions P V | θc . This model is
called a mixture as it consists in mixing samples, first by drawing a value h for H between 1 and
C according to weights of θH , and then by drawing a value for V according to P V | θh . As shown
on the Bayesian network of figure 25.2, the parameters of a mixture model are thus the collection
of vectors θ V | H = (θ1 , . . . , θC ) and the vector θH = [p1 , . . . , pC ].
Such a model can be solved (approximatively) using the generic EM algorithm. Of course the
full solution depends on the exact nature of cluster distributions P V | θc . For instance the next
25.3. BAYESIAN CLUSTERING 281
θH H
θV |H V
N
section 25.3.2 will develop an example where cluster distributions are normal. However it is already
possible to infer the mixture coefficients in θH without specifying the cluster distributions. To see
this let’s consider some dataset Z = (v (1) , . . . , v (N ) ).
Estimation step
As seen in the previous section, the estimation step is generic: it consists in updating the currently
estimated distribution qi of H (i) :
∀c ∈ {1, . . . , C}, qi (c) = P H (i) | V (i) =v(i) ,θ (c)
∝ P V (i) | H (i) =c,θ v (i) × P H (i) | θ (c)
∝ P V (i) | θc v (i) × pc
Maximization step
The M-step maximizes the energy relatively to model parameters ΘH ∪ {Θ1 , . . . , ΘC }. In case of
a mixture model, the energy is:
N D
X E
E(ΘH , Θ1 , . . . , ΘC ) = log P V = v (i) , H θ
H∼qi
i=1
N D
X E N
X
= log P V = v (i) H, θ V |H + hlog P ( H | θH )iH∼qi
H∼qi
i=1 i=1
! !
N
X C
X N
X C
X
(i)
= log P V | θc v qi (c) + log(pc ) qi (c)
i=1 c=1 i=1 c=1
Maximizing the energy relatively to the mixture coefficients (i.e the components of θH ) only requires
to take into account the second term. MoreoverP this is a constrained optimization problem since the
variables p1 to pC must verify the constraint c pc = 1. One thus derives the following Lagrangian:
N C
!
X X X
L = log(pc ) qi (c) − λ( pc − 1)
i=1 c=1 c
C N
!
X X X
= log(pc ) qi (c) − λ( pc − 1)
c=1 i=1 c
Finally the result is very intuitive since the mixture coefficient of a cluster is just the average
probability for samples to be generated by it:
N
1 X
pc = qi (c)
N i=1
Little can be said about the maximization relatively to cluster parameters θ1 , . . . θC since it depends
on the nature of cluster distributions. However it is interesting to see that only the first term of
the energy depends on θ1 , . . . θC , and that all these parameters can be solved independently of each
other since:
!
∂E ∂ X
N C
X
(i) 0
=0 ⇔ log P V | θ c0 v qi (c )
∂θc ∂θc i=1
c0 =1
∂ X
N
⇔ qi (c) log P V | θc v (i) =0
∂θc i=1
N
X ∂
⇔ qi (c) log P V | θc v (i) =0
i=1
∂θc
A Gaussian
Mixture
Model or GMM is defined by a mixture of C normal distribu-
~
tions N Vc , Γc . The parameters of a GMM are:
• For each cluster c from 1 to C, cluster parameters θc are the expected vector
~c and the covariance matrix Γc of the normal distribution.
V
Pseudocode 23 already describes the general resolution of a mixture model when applying the
EM algorithm. The only remaining task to specify is how to find the best parameters θ? c during
the maximization step. Recalling equation (25.31), one has:
∂E
N
X ∂
= qi (c) log P V | θc v (i)
~c
∂V ~c
∂V
i=1
!
N
X ∂ 1 T
= qi (c) log p exp ~c
v (i) − V Γ−1 v (i) ~c
−V
c
∂ ~c
V 2π det(Γc )
i=1
XN T
∂ 1 ~c ~c
= qi (c) − log (2π det(Γc )) + v (i) − V Γ−1
c v (i)
−V
∂ ~c
V 2
i=1
N
X
= qi (c) (−2) Γ−1 ~c
v (i) − V
c
i=1
!
N
X
= −2 Γ−1 qi (c) v (i) ~c
−V
c
i=1
Markov Models
26.1 Introduction
Dynamic system state and stochastic process
Many practical problems consist in determining the dynamic state of some real system. More
precisely the goal is to track the distribution P Xt | O of the system state Xt at time t given
some observations O carrying some information about the state. Depending on the nature of the
problem, the tracking has to be made in real-time with online observations or it can be processed
offline, in batch mode. Bayesian filtering and Bayesian smoothing respectively address the first
and second class of problems:
• Bayesian filtering estimates the distribution P Xt | O online, i.e. given only past and present
observations O relatively to current time t. Estimating in real time the position and speed
of a vehicle is a Bayesian filtering problem.
• Bayesian smoothing estimates the distribution P Xt | O offline, i.e. given past, present and
posterior observations O relatively to time t. Recognizing words in a recorded sound signal
is a Bayesian smoothing problem. Obviously Bayesian smoothing brings better results than
Bayesian filtering as more observations are taken into account.
In both cases, the state of the system must be modelled by some distribution since the state is
never perfectly known:
• First, the dynamic behaviour of the system is approximatively known and can be influence
by some unobserved forces (latent variables).
• Second, the observations can be noisy and/or partially informative (i.e. the system state
cannot be fully reconstructed from the observations).
Because dynamic systems require to maintain a full distribution of the current state, methods
that deal with dynamic systems are fundamentally Bayesian. Maintaining a distribution on the
system state not only tells us the most probable value for the current state but it also provides the
amount of confidence one should trust our state estimate: a spread distribution (i.e. with a large
entropy) will mean a large uncertainty on the state value, whereas a concentrated distribution
(i.e. with low entropy) will mean a precise knowledge on the current state value. Because the
system state is dynamic and evolves with time, the uncertainty does not necessarily decrease with
time. Indeed when some new observation is received, the uncertainty on the system state usually
decreases, but when no observation has been received for a long time, the uncertainty on the system
state increases. Positioning systems are a good application example: a GPS receiver provides the
current position of a vehicle with a good level of precision. However if the GPS signal drops,
the uncertainty of the position will increase as the vehicle is moving. In some other problems,
the dynamic behaviour of a system or signal does not increase uncertainty, but on the contrary,
helps to remove noise. Indeed the high level of dependency between successive states of a system
brings some additional information that can be exploited to remove noise in order to improve some
285
286 CHAPTER 26. MARKOV MODELS
classification problems. A typical example is speech recognition where speech time slices are highly
correlated.
Dynamic systems take many forms. However all of them have one thing in common: dynamic
systems imply to model a sequence of random state variables (Xt ) indexed by some time variable t
such that a sample builds a trajectory (xt ). This is formalized by the notion of stochastic process.
A stochastic process is a sequence of random variables (Xt )t∈T for some probability
space (Ω, E, P ) where:
A stochastic process can have discrete or continuous time space, and discrete, con-
tinuous or hybrid state space.
In the followings, one will focus on discrete-time stochastic processes for two reasons: first their
discrete nature naturally matches computational models. Second a continuous-time process can
generally be approximated by a discrete-time process. In order to simplify the notation, the time
space will be assumed to be the set N as any indexing system can always be remapped to N.
Markov Models
For every stochastic process (Xt )t∈N and for every N ∈ N, the joint distribution of (X0 , . . . , XN )
can always be written as:
PX0 ,...XN = PX0 × P X1 | X0 × · · · × P XN | X0 ,...,XN −1
This is true for any joint distribution and this can be represented by Bayesian Network of Fig. 26.1.
X0 X1 X2 X3 X4 ...
As already stated, modelling the full joint distribution when the number of variables gets large
is not tractable. Fortunately in many problems, if information contained in every variable Xt
is sufficient (i.e. if the state space is rich enough), the prediction of the next state Xt+1 only
depends on the current state Xt , not on the past states Xt0 with t0 < t. Such memoryless property
characterizes Markov models:
A Markov model (of order 1) is a stochastic process (Xt )t∈N that verifies the Markov
property: given any current state Xt at any time t, the knowledge of any past state
Xt0 with t0 < t does not help to better predict any future state Xt0 with t0 > t:
The joint distribution is then considerably simplified as shown by Bayesian networks of figures 26.2a
and 26.3a.
X0 X1 X2 X3 X4 ...
X0 X1 X2 X3 X4 ...
Stationarity
In addition to satisfying the Markov property, the studied models will usually be assumed to be
stationary, that is, not to evolve with time. Their parameters Θ are thus assumed to be constant.
∀t, P Xt+1 | Xt = P X1 | X0
Given a stationary Markov model, the joint distribution of (Xt )t∈N only depends on the distri-
bution PX0 of the initial state X0 and on the distribution of state transitions P X1 | X0 .
N
Y
PX0 ,...XN (x0 , . . . , xN ) = PX0 (x0 ) × P X1 | X0 =xi−1 (xi )
i=1
Learning a stationary Markov model, consists thus in inferring from observations these two distri-
butions P X1 | X0 ,O and P X0 | O .
Observability
A last important notion is the nature of available observations. If the state of a stochastic process
(Xt )t∈N is (fully) observable the observations directly give the state values. In this case the MLE
principle allows to learn the two distributions P X1 | X0 ,O and P X0 | O by counting the events
in the observation (for discrete variables) or by inferring distribution parameters (for continuous
variables).
If the stochastic process is only partially observable – as it is often the case in practice – the
observation Yt at time t is only weakly connected to the past and current states (Xt0 )0≤t0 ≤t . The
problem is more complex to solve as the state variables Xt are latent variables and their value must
be estimated using approximated algorithms like EM. A Markov model that is partially observable
is called a Hidden Markov Model and can be represented by Bayesian network of Figure 26.4. Since
for some applications, observations Yt can randomly occur in time, they are also called emissions.
So far one assumed the time space is discrete1 but what about the state space? Some systems
have a finite number of possible states (i.e. state of an automaton), some other have a continuous
1 This choice is mostly for simplicity and is not really a restriction. Most discrete-time models that are going to
X0 X1 X2 X3 X4 ...
Y0 Y1 Y2 Y3 Y4
state (i.e. position and speed of a vehicle). Both cases are studied respectively in this order. Let’s
first consider a dynamic system that can be described by a stationary Markov model whose state
variable Xt can only take a finite number of states numbered from 1 to n. The two next sections
respectively consider the case of fully observable and partially observable models.
An homogeneous Markov chain can be represented by an oriented weighted graph whose vertices
represent the state values and whose arcs x → x0 represents a transition from state x to state x0
with non zero probability. Every arc x → x0 is weighted by the probability P ( Xt+1 = x0 | Xt = x ).
Such a graph is as informative as a transition matrix as illustrated by figure 26.5 so that both
formalisms are equivalent.
26.2. MARKOV CHAINS 289
s3
0.5 1 0.5 0.5 0 0 0 0
0.2 1 0
0 0.2 0.8 0 0
0.5 0.1 0 0 0 0 1 0
s1 s2 1 s5 s6 P=
0
0 1 0 0 0
0 0 0 0.9 0 0.1
0.8 0.9
s4 0 0 0 0 0 1
(a) (b)
Figure 26.5: Equivalent representation of a Markov chain: graph (a) and transition matrix (b)
26.2.2 Properties
Given a Markov chain of parameters PX0 and {Pt |0 ≤ t < T }, the distribution of Xt is given by
t−1
!
Y
T
PXt = P t0 × PX0
t0 =0
Sampling a Markov Chain consists in generating a trajectory of state (x0 , x1 , . . . ) such that x0
is drawn from PX0 and every transition xt → xt+1 is drawn according to Pt . Markov Chains can
thus be viewed as a model of random walk on a finite state space. One major question raised by
many applications is to know if the distribution of Xt converges to some limit P∞ when t → +∞. A
second important question is to know if this limit P∞ is unique and does not depend on the initial
state distribution PX0 . In such case P∞ is called the equilibrium distribution. If one considers the
homogeneous case, clearly the probability vector P∞ must be a fixed point of P, i.e. a eigenvector
of P for the eigenvalue 1:
P∞ = P T × P∞
Such a distribution P∞ is said stationary2 . Every Markov chain as at least one stationary distri-
bution as stated by the following theorem:
Theorem 26.1. The largest absolute eigenvalue of a stochastic matrix is 1. The eigenvectors for
the eigenvalue 1 have coefficients of the same sign. As a consequence every Markov chain admits
at least one stationary distribution.
However this last result is not a sufficient condition for a Markov chain to converge towards an
equilibrium distribution. To do so, one needs to introduce the notions of reducibility and periodic-
ity: Given two distinct state values s1 and s2 , s2 is said accessible from s1 if the probability to reach
2 Not to be confounded with stationarity of Markov processes
290 CHAPTER 26. MARKOV MODELS
state s2 in a finite number of steps by starting from state s1 is non zero, or said differently, if there
is at least one (oriented) path connecting s1 to s2 in the Markov chain’s transition graph. s1 and s2
are said communicating if s1 is accessible from s2 and conversely. Communication defines an equiv-
alence relation whose equivalence classes are called communication classes. A Markov chain is said
irreducible if every pair of vertices communicate (i.e there is only one communication class), or said
equivalently, if the transition graph is strongly connected3 . For instance, given the Markov chain
of 26.5, state s6 is accessible from any other state whereas s1 is not accessible from any other state.
States s3 , s4 and s5 are communicating. Communication classes are {{s1 }, {s2 }, {s3 , s4 , s5 }, {s6 }}.
The graph is thus reducible.
Periodicity is another important characteristic of Markov chain trajectories: the period p(s) of
a state s is the greatest common divisor of the possible times to return to state s:
p(s) = gcd {t|P Xt =s | X0 =s > 0}
= gcd {t|Pts,s > 0}
A Markov chain is aperiodic if the period of every state is equal to 1. In the Markov chain of 26.5,
the periodicity of s1 and s6 is 1 whereas the periodicity of s3 , s4 and s5 is 3. s2 is not periodic.
The following theorem concludes on the initial question about Markov chain convergence:
Theorem 26.2. An irreducible and aperiodic Markov chain has a equilibrium distribution: it
converges to a unique stationary distribution, regardless of the initial state distribution PX0 . In
particular this is the case for Markov chains whose coefficients of the transition matrix are all
positive, that is, whose the graph is complete4 .
26.2.3 Learning
Learning a Markov Chain on an interval of time [0, . . . , T − 1] consists in inferring the first T
transition matrices {Pt |0 ≤ t < T } and the initial state distribution PX0 from a dataset Z =
{zk } where every data zk is a sequence of T successive states (xi0 , . . . , xiT −1 ). Because states are
observable, learning a Markov Chain using the MLE estimator is straightforward: Determining
parameter P ( Xt+1 = j | Xt = i ) is done by counting:
N (xt+1 = j and xt = i)
P̂ ( Xt+1 = j | Xt = i ) =
N (xt = i)
Every parameter is learnt from N samples in average, if N is the number of data. However it is
likely that some transitions will occur rarely. In order to have a good confidence in the estimation
of the n2 · T + n model parameters, N must be very large.
Things get nicer if the Markov chain is homogeneous. In this case, the number of parameters to
learn is only n2 + n. Every probability of transition can be learnt from N × T − 1 samples instead
of only N so that the dataset can be much smaller.
26.2.4 Application
The notion of equilibrium distribution of Markov chains has been extensively used in many appli-
cations. It is the theoretical foundation of Markov Chain Monte Carlo algorithms (MCMC) like
the Metropolis-Hastings algorithm introduced in chapter 28. It is also used to compute various
importance index, whose the most popular one is probably PageRank’s index from Google.
PageRank’s index measures the visibility of a website on the Web but similar indexes exist
for measuring the notoriety of people in social networks, the importance of scientific journals
or researchers in their community, etc. The intuition is that the more a website is visible, the
more likely a random websurfer will spend time on this website. This problem can be formalized
as sampling a Markov chain. Let’s index webpages from 1 to the number n of web-pages. The
3 A directed graph is strongly connected if for every couple (s , s ) of vertices, there is at least one (oriented)
1 2
path connecting s1 to s2 .
4 A directed graph is complete if for every couple (s , s ) of vertices, there is at least one arc from s to s .
1 2 1 2
26.3. HIDDEN MARKOV MARKOV MODELS 291
websurfer starts from a random webpage i and then selects randomly an outgoing link on this page,
jumping this way to a new page. He does this a large number of time and records for each page
the number N [i] of times he visited it. When the websurfer reaches a page without outgoing links,
he jumps randomly on a new webpage. The resulting histogram N [i] defines a distribution whose
coefficients are the PageRank indexes. In other words, the PageRank indexes are the equilibrium
distribution of a random walk over a finite state space, i.e. over a homogeneous Markov chain.
This is summarized by algorithm 25. If J(i) denotes the set of webpages accessible from page i,
1 1j∈J(i) 1
if J(i) 6= ∅, Pi,j = α+ (1 − α) ≥ α > 0
n |J| n
1
otherwise Pi,j = >0
n
Consequently the algorithm is guaranteed to converge towards a unique set of PageRank indexes
that do not depend on the choice for the initial page.
needs to consider partially observable Markov Chains, or equivalently discrete Hidden Markov
models. In practice these discrete models are simply referred as Hidden Markov Models.
A discrete Hidden Markov Model (HMM) is a hidden Markov model whose state
variables (Xt )t∈N are discrete. By supposing states range from 1 to n then such a
HMM is specified by:
• A Markov chain specification (i.e. initial state distribution PX0 and transition
distributions/matrices (Pt )t∈N )
• Emission distributions P Yt | Xt ,Θ that depend on parameters Θ.
– If the observations are discrete, and that emission values are supposedly
ranging from 1 to m, the emission distributions can be represented by
emission matrices {Qt }t∈N . Every emission matrix is a n × m stochastic
matrix so that its ith line represents the distribution P Yt | Xt =i :
P ( Yt = 1 | Xt = 1 ) ··· P ( Yt = m | Xt = 1 )
.. .. ..
Qt = . . .
P ( Yt = 1 | Xt = n ) ··· P ( Yt = m | Xt = n )
To solve this problem let’s first introduce the so-called alpha coefficients:
αt : x 7→ P (Xt = x, Y0 = y0 , . . . Yt = yt , θ)
Once the values of αt have been computed, it is straightforward to compute the state distribution
by normalization:
αt (x)
P Xt | Y0 =y0 ,...Yt =yt ,θ (x) = Pn
x=1 αt (x)
According to the graphical model of HMMs (see figure 26.4), the alpha coefficients can easily be
computed by forward recursion on t:
αt (x) = P (Xt = x, Y0 = y0 , . . . Yt = yt )
= P ( Yt = yt | Xt = x ) × P (Xt = x, Y0 = y0 , . . . Yt−1 = yt−1 )
X
= Qx,y
t
t
× P (Xt = x, Xt−1 = x0 , Y0 = y0 , . . . Yt−1 = yt−1 )
x0
X
= Qx,y
t
t
× P ( Xt = x | Xt−1 = x0 ) × P (Xt−1 = x0 , Y0 = y0 , . . . Yt−1 = yt−1 )
x0
X 0
= Qx,y
t
t
× Pxt−1
,x
× αt−1 (x0 )
x0
In analogy with alpha coefficients for filtering, one introduces the beta coefficients as the likelihood
for state x at instant t for making the future observations:
βt : x 7→ P ( Yt+1 = yt+1 , . . . , YT = yT | Xt = x, θ )
Then the sought probability can be deduced by normalization on both alpha and beta coefficients:
The remaining task is to compute the beta coefficients. Again this can be done by recursion but
in the backward direction (t starts from T and decreases):
Computations of alpha and beta coefficients are independent and can be done in parallel. These
two computation tasks merged together define the forward-backward algorithm.
Given a HMM and some observations O = {y0 , . . . , yT }, finding the most probable
state trajectory consists in finding the most probable sequence (x? 0 , . . . , x? T ) of states
that is:
(x? 0 , . . . , x? T ) = argmax P ( X0 = x0 , . . . , XT = xT | Y0 = y0 , . . . , YT = yT )
x0 ,...,xT
This problem can be solved efficiently using dynamic programming. Let’s first define the value
function as:
In practice one computes the log value log V to avoid numerical precision issues and one also
memorizes for all pairs (t, x) the state x? t−1 (x) for which this value V (t, x) is reached, so that the
most probable trajectory can be reconstructed:
0
x? t−1 (x) = argmax log Ptx ,x + log V (t, x0 )
x0
x,y x? t−1 (x),x
log V (t + 1, x) = log Qt+1t+1 + log Pt + log V (t, x? t−1 (x))
Once the pairs (V (t, x), x? t−1 (x)) have been computed in a forward order for all times t and states
x, the most probable state trajectory (x? 0 , . . . , x? T ) are computed backward:
x? T = argmax V (T, x), x? T −1 = x? T −1 (x? T ) . . . x? t−1 = x? t−1 (x? t ) . . . x? 0 = x? 0 (x? 1 )
x
Since HMM is a model with latent variables (X0 , . . . , XT ), this ML problem can be approached by
EM (using the MLE estimator). The specific instance of EM is called the Baum-Welch algorithm.
Estimation step
(i) (i) (i) (i)
Let’s define X (i) and Y (i) respectively as the joint variables (X0 , . . . , XT ) and (Y0 , . . . , YT ).
The X (i) variables are latent. In general variables X (i) cannot be further decomposed. However
because of the Markov property, one can decompose the joint distribution of X (i) as a product:
Y
P X (i) | Y (i) = P X (i) Y (i) P X (i) X (i) ,Y (i)
0 t+1 t
t
(i)
Estimating the distribution
of X can be done
by estimating thesefactors, or alternatively, by
estimating distributions P X (i) y(i) and P X (i) ,X (i) y(i) .
t t t+1
0≤t≤T 0≤t≤T −1
(i) (i)
Let’s thus denote at and Bt the approximated distributions of P X (i) y(i) and P X (i) ,X (i)
(i)
y
.
t t t+1
(i) (i)
The “E” step updates at and Bt for every sample i and for every time t according to the current
HMM parameters θ:
(i)
at ← P X (i) y(i) ,...,y(i) ,θ
t 0 T
(i)
Bt ← P X (i) ,X (i)
(i) (i)
y0 ,...,yT ,θ
t t+1
(i)
This is a smoothing problem studied in section 26.3.3. Computing at can directly be solved with
(i)
the forward-backward algorithm. Computation of Bt can also be done by reusing alpha and beta
coefficients according to the following equation:
(i)
(i) 0 x0 ,yt+1
Bt (x, x0 ) ∝ αt (x) × Px,x
t × Qt βt+1 (x0 )
296 CHAPTER 26. MARKOV MODELS
Maximization step
−1
The “M” step finds the HMM parameters θ = {P0 } ∪ ∪Tt=0 Pt ∪ ∪Tt=0 Qt that maximizes the
(i) (i)
energy, according to distributions at and Bt . Assuming that trajectory samples are i.i.d, the
energy is (omitting parameters):
N D
X E
E(θ) = log P Y (i) = y (i) , X (i) θ
X (i) ∼a(i) ,B(i)
i=1
N
X D E T D
X E
(i) (i) (i)
= log P X0 (i) (i)
+ log P Xt+1 Xt
(i) (i)
(i)
+
X0 ∼a0 Xt ,Xt+1 ∼Bt
i=1 t=0
!
T D
X E
(i) (i) (i)
log P Yt = yt Xt (i) (i)
Xt ∼at
t=0
Lines of Pt and Qt must be normalized so that coefficients sum up to one. In case the HMM is
homogeneous, the previous equations become:
N
1 X (i)
P0 = a
N i=1 0
N T
X X −1
(i)
∀x, P(x, ∗) ∝ Bt (x, ∗)
i=1 t=0
N X
X T
(i)
∀y, Q(∗, y) ∝ 1y(i) =y at
t
i=1 t=0
U0 U1 U2 U3 U4
X0 X1 X2 X3 X4 ...
Y0 Y1 Y2 Y3 Y4
command is observable (otherwise it would not be worth integrating it into the model) and helps
to know the current system state (Bayesian filtering).
Example:
Let’s take an example of a car that is localized thanks to a GPS receiver. The state is Xt =
(xt , yt , θt , vt )T , where (x, y) is the couple longitude-latitude, θ is the heading angle (a null angle
means the car is heading east), and v is the velocity. The variations of altitude z are neglected.
The command is Ut = (Ct , αt )T where C is the curvature (i.e. one assumes the wheel can be turned
instantly) and α is the force determined by the combined action of throttle and brakes. Finally the
output/observation is Yt = (xgps gps odo
t , yt , vt ) where (x
gps gps
, y ) are the GPS coordinates and where
odo
v is the speed measured by the car odometer.
Because of the Markov property, the dynamic of the state (i.e the derivative of Xt ) is assumed
to be a function of only the current state and of the current command: in other words Xt+∆t only
depends on Xt and Ut for ∆t > 0; it is independent of the previous states Xt0 and commands Ut0
for t0 < t. The observation Yt generally only depends on the current state Xt even if the further
methods can support a further dependence of Yt on Ut . If the model is deterministic (i.e. there
is no source of uncertainty in the model) one can describe our model by a standard state-space
representation as represented on figure 26.7. Such representation is defined by two families of
Delay τ
Xt ft (Xt , Ut ) Xt+1
Ut gt (Xt , Ut ) Yt
functions ft and gt :
dXt
= ft (Xt , Ut )
dt
Yt = gt (Xt , Ut )
The first equation represents the state integration, the second one is the output equation.
Example:
26.4. CONTINUOUS-STATE MARKOV MODELS 299
dx
= vt cos(θt )
dt
dy
= vt sin(θt )
dt
dθ
= vt Ct
dt
dv 1 f
= αt − vt
dt M M
xgps
t = xt
ytgps = yt
vtodo = vt
M is the mass of the loaded vehicle and f is a friction coefficient. Because the functions ft and gt
do not depend on time t, the system is homogeneous (unless the car crashes and as a first approx-
imation since the mass can vary over time with the mass of gas, load and passengers).
However, as already stated, the time is supposedly discrete ,first for simplicity reason and
second, because practical implementations assume time is discrete. The notion of derivative is
thus replaced by a finite difference model so that the considered models are:
Xt+1 = ft (Xt , Ut )
Yt = gt (Xt , Ut )
Example:
Assuming the car computer updates the state representation with a time period of τ seconds (typi-
cally 0.1s), the state integration is now:
xt+1 = xt + vt cos(θt ) τ
yt+1 = yt + vt sin(θt ) τ
θt+1 = θ t + vt C t τ
fτ τ
vt+1 = 1− vt + αt
M M
The previous car model assumes the reality perfectly matches the model which is of course very
naive, mostly for two distinct reasons:
• First the dynamic is not perfectly known and some external factors can disturb the state
evolution. Some wind or slope can slow down the car or accelerate it in an unpredictable
way (at least for the wind).
• Second the observations can be noisy. The standard deviation of GPS coordinates for a fixed
point is typically of few meters.
300 CHAPTER 26. MARKOV MODELS
Therefore the model must be stochastic and integrate some uncertainty thanks to a Bayesian
approach. Functions ft and gt have to be replaced respectively by distributions P Xt+1 | Xt ,Ut ,Θt
and P Yt | Xt ,Ut ,Θt of parameters (Θt )t∈N .
In most cases, the system dynamic is fixed so that the model is homogeneous and the distribution
parameters do not depend on time:
The problem of learning such models consist in finding the model parameters Θ that best match
some data (i.e. sequences of command, state and output triplets (ut , xt , yt )t∈N ). This problem
is usually not easy to implement as it requires to represent – at least approximatively – complex
distributions P Xt+1 | Xt ,Ut ,Θ and P Yt | Xt ,Ut ,Θ , and then to use sampling techniques to estimate
them. Further assumptions on the model allow to drastically simplify the problem as shown in the
next section.
A discrete-time linear Markov model with continuous space and observation is char-
acterized by four matrix time series At ∈ Mn×n (R), Bt ∈ Mn×p (R), Ct ∈ Mq×n (R)
and Dt ∈ Mq×p (R), along with two white zero-centred noises (εX X n
t |εt ∈ R )t∈N and
(εt |εt ∈ R )t∈N and optionally two vector time series Xt ∈ R and Yt ∈ Rq such
Y Y q a 0 n 0
that:
Xt+1 = At Xt + Bt Ut + Xt0 + εX
t
Yt = Ct Xt + Dt Ut + Yt0 + εYt
In the homogeneous case, matrices and vectors are fixed, equal respectively to A, B,
C, D, X 0 and Y 0 .
a These terms X 0 and Y 0 are generally omitted as they can be integrated in the matrices B and
D by extending the command vector with a constant component equal to 1. While elegant, this
choice is misleading and inefficient from an implementation point of view.
Example:
Clearly our car model is not linear because of expressions like vt cos(θt ), vt sin(θt ) or vt Ct . Let’s
modify our problem to make it linear. One will see how the car problem can be solved later. Let’s
consider a logistic elevator that loads and unloads packages from very long shelves in a factory
warehouse: this elevator is a motorized trolley equipped with a lift and mounted on linear rails
that run along the shelves. This robot can thus move in the XZ plane thanks. The state is
X = (x, z, v x , v z ) where (x, z) and (v x , v z ) are the position and speed coordinates in the XZ plane.
The command U = (αx , αz ) is the X and Z forces (αx , αz ) of the robot’s electrical engines. An
26.4. CONTINUOUS-STATE MARKOV MODELS 301
odometer provides the trolley velocity v odo whereas a position encoder gives the elevation z enc of
the lift. The output vector is thus Y = (v odo , z enc ). Assuming the embedded computer updates the
state every τ = 0.1 second, and that the mass of the load can be neglected compared to the mass M
of the elevator, the corresponding discretized model is:
xt+1 = xt + vtx τ + εx
zt+1 = zt + vtz τ + εz
x 1 x x
vt+1 = vtx + (α − fx vtx ) τ + εvt
M t
z 1 z z
vt+1 = vtz + (α − fz vtz − M g) τ + εvt
M t
Coefficients fx and fy represent friction forces along X and Z axis; g is the gravity acceleration
x z
constant, noises εv and εv resp. represent the unknown forces acting on the trolley and lift and
x z
noises ε and ε represent the risk of slipping (that is assumed to be null hereafter). The output
equation of Yt = (v odo , z enc ) is given by
Noises εodo and εenc resp. represent the measurement noise of the trolley odometer and the eleva-
tion encoder. From these equations one derive the parameters of our model:
x
1 0 τ 0 0 0 0 εt
0 1 0 τ 0 0 0 z
A= B = τ X0 = εX = εvtx
0 0 1 − fx τ 0 0 0 t εt
M M z
fz τ
0 0 0 1− M τ 0 M −g τ εvt
0 0 1 0 0 0 0 εodo
C= D= Y0 = εYt = t
0 1 0 0 0 0 0 εenc
t
However the linearity hypothesis is not sufficient to keep a simple and tractable representation
of the state distributions P Xt | Y0 ,...,Yt when t is growing. Further assumptions have to be made:
Kalman filters consider the specific subcase where initial state, state and observation noises are
assumed to be gaussian:
A Kalman filter estimates the state distribution P Xt | Y0 ,...,Yt for a linear discrete-
time continuous-state Markov model where
• The state and observation noises (εX Y
t )t∈N and (εt )t∈N are white and normal
with null expected values and known covariance matrices, respectively denoted
(Qt )t∈N (with Qt ∈ Mn×n (R)) and (Rt )t∈N (with Rt ∈ Mq×q (R)).
• The initial state X0 follows a normal distribution X0 ∼ N X̂0 , P0 of known
parameters.
Example:
In the robot example, the noises on x, v x , z and v z are all independent. Similarly the measurement
noise of the odometer and the position encoder are independent. One also assumes all sources of
302 CHAPTER 26. MARKOV MODELS
noises are constant with time so that covariance matrices of noise are constant and diagonal:
2
σx 0 0 0 2
0 σz2 0 0 σodo 0
Q= 0
R =
0 σv2x 0 0 2
σenc
2
0 0 0 σ vz
Because multivariate normal distributions are closed under linear combinations, it is obvious
that P Xt | Y0 ,...,Yt will remain normal. The real question is to know how to update the parameters
of this normal distribution during state integration and observation. In this end, one introduces
thethat useful notation:
• X̂t|t−1 and Pt|t−1 are respectively the expected value and the covariance matrix of the current
state Xt | Y0 , . . . , Yt−1 given the past observations, abbreviated as Xt|t−1 .
• X̂t|t and Pt|t are respectively the expected value and the covariance matrix of the current
state Xt | Y0 , . . . , Yt given the past and present observations, abbreviated as Xt|t .
Let’s prove by induction that Xt | Y0 , . . . , Yt−1 is normal: Xt|t−1 ∼ N X̂t|t−1 , Pt|t−1 .
Proof. The induction is verified at rank 0 since X0 ∼ N X̂0 , P0 by hypothesis. Let’s as-
sume Xt|t−1 ∼ N X̂t|t−1 , Pt|t−1 and let’s prove this property at rank t + 1, i.e. Xt+1|t ∼
N X̂t+1|t , Pt+1|t .
The proof is split in two halves: the first half shows Xt|t ∼ N X̂t|t , Pt|t , the second shows
Xt+1|t ∼ N X̂t+1|t , Pt+1|t . Both halves are similar but the first half is more difficult than the
second so let’s assume in a first stage that the first half is already proven and let’s prove first the
second half.
Since Ut is a known constant that can be interpreted as a normal distribution of null covariance
and since εXt is a white noise independent of Xt|t , one then has:
Xt|t X̂t|t Pt|t 0n,n 0n,p
εX
t
∼ N 0n,1 , 0n,n Qt 0n,p
Ut Ut 0p,n 0p,n 0p,p
Xt+1|t = At Xt + Bt Ut + εXt
Xt|t
= M × εXt
with M = At In,n Bt
Ut
Because one knows that the joint variable (Xt|t−1 , Yt|t−1 ) is normal (since Y is given by a linear
combination of normal variables) and that conditioning a variable X with a variable Y when joint
variable (X, Y ) is normal gives another normal variable X | Y , one can deduce Xt|t is also normal.
Let’s compute its parameter. First let’s recall the rules for conditioning a multivariate normal
distribution. If:
X1 X̂1 Σ11 Σ12 −1 −1 2
∼N , ⇒ X 1 | X 2 = x 2 ∼ N X̂ 1 + Σ Σ
12 22 (x 2 − X̂ 2 ), Σ 11 − Σ Σ
12 22 Σ
X2 X̂2 ΣT12 Σ22 12
(26.107)
But one first needs to determine the joint distribution of Xt|t−1 and Yt|t−1 before applying these
equations:
Xt|t
Xt|t−1 I 0n,q 0n,p
= M0 × εYt with M0 = n,n
Yt|t−1 Ct Iq,q Dt
Ut
Because
Xt|t−1 X̂t|t−1 Pt|t−1 0n,n 0n,p
εYt ∼ N 0n,1 , 0n,n Rt 0n,p
Ut Ut 0p,n 0p,n 0p,p
Consequently:
X̂t|t−1 Pt|t−1 0 0
Xt|t−1 T
∼ N M0 × 0 , M0 × 0 Rt 0 × M0
Yt|t−1
Ut 0 0 0
T
X̂t|t−1 Pt|t−1 Pt|t−1 Ct
∼ N ,
Ŷt|t−1 C P
t t|t−1 St|t−1
with
Ŷt|t−1 = Ct X̂t|t−1 + Dt Ut
St|t−1 = Ct Pt|t−1 CTt + Rt
with
X̂t|t = X̂t|t−1 + Pt|t−1 Ct T S−1
t|t−1 yt − Ŷt|t−1
This proof not only demonstrates that the state Xt follows a normal distribution, but it also
gives – and this is essential from an application perspective – the equations to update the parame-
ters of the state distribution 1) when time must be increased, also called the prediction equations,
and 2) when some observations are received, also called the update equations.
X̂t+1|t = At X̂t|t + Bt Ut
Ŷt|t−1 = Ct X̂t|t−1 + Dt Ut
3. Innovation (that is the signed error between output and expected output):
et = yt − Ŷt|t−1
y
4. Kalman filter gain (that estimates how strongly the innovation should cor-
rect the state):
Kt = Pt|t−1 CTt S−1
t|t−1
et
X̂t|t = X̂t|t−1 + Kt y
Implementing a Kalman filter typically looks like implementing the three next functions 28, 29
and 30.
It is interesting to note that the update function can be called with different types of observation
vectors Y. This is of practical interest for systems equipped with different types of sensors providing
measures with different times/rates. This property of extracting the best information from multiple
sensors is called information fusion.
Example:
26.4. CONTINUOUS-STATE MARKOV MODELS 305
Algorithm 28 init X̂0 , P0
Algorithm 29 predict(t0 , U)
Require: New time t0 , and command U
1: Compute A, B and Q given current state and time
2: X̂ ← A X̂ + B U
3: P ← A P AT + Q
4: t ← t0
In addition to the existing sensors, the elevator is equipped with an optical sensor mounted on
the lift that triggers an output every time the sensor gets aligned with visual landmarks stuck on
shelves. The position (v opt , z opt ) of the robot can then be inferred by querying a database that maps
every landmark to a rack position. This second output Y opt = (v opt , z opt ) provides a measure of
position much more accurate than the first output but it is only available occasionally at much lower
rate.
Kalman filter only works for linear systems but most real systems are not linear. The navigation
systems for wheeled vehicles are an important example. One solution to apply Kalman filter to
a non-linear system is to linearise the system equations in the vicinity of the currently estimated
state X̂. This provides a first order approximation called Extended Kalman Filter.
Algorithm 30 update(t0 , Y, U)
Require: Observation timestamp t0 and value y, command U
1: Call predict(t0 , U)
2: Compute C, D and R given current state and time
3: Ŷ ← C X̂ + D U
4: S ← C P CT + R
5: K ← P CT S−1
6: X̂ ← X̂ + K y − Ŷ
7: P ← (In − K C) P
306 CHAPTER 26. MARKOV MODELS
The Extended Kalman filter (EKF) consists, given a non-linear state space repre-
sentation
Xt+1 = ft (Xt , Ut )
Yt = gt (Xt , Ut )
in:
• Prediction equations:
1. Predicted state expected value:
X̂t+1|t = ft (X̂t|t , Ut )
2. Jacobian matrix of ft :
∂ft
At = (Xt|t )
∂X
3. Predicted state covariance matrix:
• Update equations:
1. Predicted output expected value:
Ŷt|t−1 = gt (X̂t|t−1 , Ut )
2. Jacobian matrix of gt :
∂gt
Ct = (Xt|t−1 )
∂X
3. Predicted output covariance matrix:
et
y = yt − Ŷt|t−1
Kt = Pt|t−1 CTt S−1
t|t−1
X̂t|t = et
X̂t|t−1 + Kt y
Pt|t = (In − Kt Ct ) Pt|t−1
Example:
EKF allows to estimate the position and speed of our car. As a reminder, the state, command and
output of this system are:
xt gps
" # xt
y Ct
t gps
X t = Ut = Yt = yt
θt αt
vtodo
vt
26.4. CONTINUOUS-STATE MARKOV MODELS 307
To define the Q matrix, one needs to determine the main source of uncertainty in the model
dynamic. The acceleration uncertainty standard deviation is estimated roughly to 3 m/s2 , the risk
of slipping is considered to be null in normal conditions and the uncertainty on the rotational speed
is estimated to 10deg/s ≈ 0.2rad/s. For the matrix R, GPS accuracy is about 2m and the odometer
precision is 3km/h ≈ 1m/s, so finally:
0 0 0 0 2
0 0 2 0 0
0 0
Q= 0 0 (0.2 τ )2
R = 0 2 2 0
0
0 0 1
0 0 0 (3 τ )2
EKF can track efficiently a system state as long as the uncertainty on the state is kept small.
However if observations are missing for a too long time, the uncertainty (represented by P) increases
and the predicted state will likely diverge from the real state. In such cases, alternative methods
like particle filtering (see section 28.3.4) must be used instead.
308 CHAPTER 26. MARKOV MODELS
Chapter 27
27.1 Introduction
So far all the studied methods rely on the existence of some model parameters θ. Such methods
are said parametric. Given some classification/regression problem predicting an output Y from an
input X and given some i.i.d samples Z = ((x1 , y1 ), . . . , (xn , yn )), a parametric method is divided
in two steps:
• The learning step infers parameters from samples, that is to say, determines the posterior
P Θ | Z as: !
Y
P Θ | Z (θ) ∝ P Yi | Xi =xi ,Θ=θ (yi ) × PΘ (θ)
i
• The prediction step infers the output from the input and the posterior, that is to say, deter-
mines P Y | X=x,Z as:
Z
P Y | X=x,Z (y) = P Y | X=x,Θ=θ (y) × P Θ | Z (θ) dθ
θ
Merging these two steps in one leads to the notion of non-parametric methods:
Z
P Y | X=x,Z (y) = P Y | X=x,Θ=θ (y) × P Θ | Z (θ) dθ
θ
Z !
Y
∝ P Y | X=x,Θ=θ (y) × P Yi | Xi =xi ,θ (yi ) × PΘ (θ) dθ
θ i
In a non-parametric approach, the model does not make parameters explicit. One directly infers
P Y | X=x,Z from the observed samples, which are obviously not i.i.d any more as the parameters
have been marginalized out:
PY | X=x,Z ∝ P Y,Y1 =y1 ,...,Yn =yn | X=x,X1 =x1 ,...,Xn =xn
The K-nearest neighbour classification method is an example of non-parametric method. Let’s
develop the example of Gaussian processes.
309
310CHAPTER 27. NON-PARAMETRIC BAYESIAN METHODS AND GAUSSIAN PROCESSES
Where:
Kernels can also be built as a scalar product in some space, thanks to a transformation g:
Since outputs of a Gaussian process are generally spatially correlated, i.e. f (x1 ) and f (x2 ) get
more correlated when x1 and x2 get closer, the kernel functions are chosen so that k(x1 , x2 )
grows when x2 tends to x1 .
0 2 4 6 8 10
Figure 27.1: Representation of a Gaussian process GP(µ, k) for µ : x 7→ sin(x) and k(x1 , x2 ) =
kSE (x1 , x2 ) with l = 1.
However this first representation considers every input x independently of each other and do
not take into account the correlation k(x1 , x2 ) between f (x1 ) and f (x2 ). This representation
does not emphasize the correlation of outputs of similar inputs (i.e. the smoothness of samples).
A second possible representation is to draw randomly several samples from f and draw them
graphically as curves. Sampling is however not obvious as a sample of a Gaussian process is a
function y : x ∈ R 7→ y(x) ∈ R, that is, an infinite number of points. A possible approximation
is to choose an interval of representation [xmin , xmax ] and to split this interval with n regularly
spaced points:
xmax − xmin
xi = i + xmin
n
Then one approximates sample curve y : x ∈ R 7→ y(x) ∈ R by the finite set of points (x0 , y0 ), . . . (xn , yn )
such as (y0 , . . . , yn ) is drawn from the multivariate normal distribution:
k(x , x ) . . . k(x , x )
µ(x1 ) 1 1 1 n
.
. . . .
.
(y1 , . . . , yn ) L99 N ... , . . .
µ(xn ) k(xn , x1 ) . . . k(xn , xn )
The sampling algorithm is detailed in pseudocode 31. In particular, it explains how to sample
a multivariate normal distribution thanks to a covariance matrix decomposition and from the
following property of multivariate normal distributions:
X ∼ N (µ, Σ) ⇒ A X + B ∼ N A µ + B, A Σ M atAT
Figure reffig:gp-sampling provides 10 samples from the Gaussian process introduced on Fig. 27.1.
Of course the previous representations can be generalized to an input space of higher dimension
by generating a grid of input points instead of subdividing an interval of the real line.
3
0 2 4 6 8 10
Figure 27.2: 10 samples from GP(µ, k) for µ : x 7→ sin(x) and k(x1 , x2 ) = kSE (x1 , x2 ) with l = 1.
27.2. GAUSSIAN PROCESS 313
2
2
1
0
0
1
2
2
4 3
0 2 4 6 8 10 0 2 4 6 8 10
2 3
1 2
1
0
0
1
1
2
2
0 2 4 6 8 10 0 2 4 6 8 10
(c) l = 3 (d) l = 10
Figure 27.3: 10 samples from GP(µ, k) for µ : x 7→ 0 and k(x1 , x2 ) = kSE (x1 , x2 ) with various
values for l.
27.2.4 Prediction
Since Gaussian processes are non-parametric models, there is no learning step: prediction of an
output Y for the input X = x is made directly from the observations O = ((x1 , y1 ), . . . , (xk , yk ))
by conditioning the joint multivariate normal distribution of (f (x), f (x1 ), . . . , f (xk ) to the obser-
vations f (x1 ) = y1 , . . . f (xk ) = yk :
314CHAPTER 27. NON-PARAMETRIC BAYESIAN METHODS AND GAUSSIAN PROCESSES
Where
µ(x01 ) k(x01 , x01 ) ... k(x01 , x0n )
.. .. .. ..
µp = . Σpp = . . .
µ(x0n ) k(x0n , x01 ) ... k(x0n , x0n )
µ(x1 ) k(x1 , x1 ) . . . k(x1 , xk )
.. .. .. ..
µo = . Σoo = . . . (27.7)
µ(xk ) k(xk , x1 ) . . . k(xk , xk )
y1 k(x01 , x1 ) . . . k(x01 , xk )
.. .. .. ..
yo = . Σpo = . . .
yk k(x0n , x1 ) ... k(x0n , xk )
Proof. Given observations O = ((x1 , y1 ), . . . , (xk , yk )), for all finite set (x01 , . . . , x0n ) of input points,
one knows that (f (x01 ), . . . , f (x0n ), f (x1 ), . . . , f (xk )) follows a multivariate normal distribution.
Moreover, given two vectors A and B of random variables, if the joint distribution of A ∪ B follows
a multivariate normal distribution, then it is proven that the distribution of A | B = b of A given
B is equal to value b is still a multivariate normal distribution whose parameters are given by the
following formula, where matrix ΣXY denotes the covariance matrix between X and Y:
2
A E [A] σε ΣAA ΣAB
∼N , ⇒
B E [B] ΣTAB ΣBB
A | B = b ∼ N E [A] + ΣAB ΣBB −1 (b − E [B]) , ΣAA − ΣAB ΣBB −1 ΣTAB
From these expressions, one can deduce the distribution of f (x) | O (by taking n = 1 and x1 = x).
This is useful to update the drawing of the average and confidence interval curves of f (x) | O .
Moreover these expressions when combined with the sampling technique presented in algorithm 31
allows to draw graphically samples from conditioned Gaussian process. Figure 27.2 shows how a
one-dimensional Gaussian process is conditioned progressively when new observations get available.
On the last figure, the fourth observation (1.1, −2) contradicts the first one (1, 1). This introduces
a kind of singularity, with abrupt changes, and high expected value of about 15 around input x of
0.5.
27.2.5 Regularization
So far the observations ((x1 , y1 ), . . . , (xk , yk )) were assumed to be perfect, without noise. As a
consequence the distribution of f (xi ) | O is atomic (see how variance of input points is null on
Fig. 27.2). This produces a kind of overfitting, observable on the last figure of Fig. 27.4. Moreover
two contradictory observations (x1 , y1 ) and (x2 , y2 ) (i.e x1 ∼ x2 but y1 y2 ) can introduce
singularities.
27.2. GAUSSIAN PROCESS 315
2
2
1
1
0
0
1
1
2
2
3
0 2 4 6 8 10 0 2 4 6 8 10
20
3
15
2
10
1 5
0 0
5
1
10
2
15
3
20
0 2 4 6 8 10 0 2 4 6 8 10
Figure 27.4: Conditioning a Gaussian process to an increasing number of observations. The initial
Gaussian process has a null µ function and uses a squared exponential kernel kSE with l = 1.
316CHAPTER 27. NON-PARAMETRIC BAYESIAN METHODS AND GAUSSIAN PROCESSES
To avoid overfitting, one can model the imperfectness of the observation by introducing an
observation noise ε:
∀i, Yi = f (xi ) + εi
Variables εi are assumed to be normal white noises with known variance σε2 :
∀i, εi ∼ N 0, σε2
Finally
(f (x01 ), . . . , f (x0n )) | Y1 = y1 , . . . , Yk = yk ∼
−1 −1 T
N µp + Σpo Σoo + σε2 Ik (yo − µo ) , Σpp − Σpo Σoo + σε2 Ik Σpo (27.8)
The introduction of noise is a form of regularization to defeat overfitting. Figure 27.5 takes the
same example introduced in figure 27.4 with some additional observation noise. One sees that the
variance at an input point that has been observed is not zero anymore and no singularities appear
on the last figure.
3 3
2 2
1 1
0 0
1
1
2
2
3
3
0 2 4 6 8 10 0 2 4 6 8 10
3
3
2 2
1 1
0
0
1
1
2
2 3
4
3
0 2 4 6 8 10 0 2 4 6 8 10
Figure 27.5: Conditioning a Gaussian process to noisy observation. The initial Gaussian process
is the same as on Fig 27.4. The observation noise variance has been set to σε2 = 0.32 .
318CHAPTER 27. NON-PARAMETRIC BAYESIAN METHODS AND GAUSSIAN PROCESSES
Chapter 28
Approximate Inference
319
320 CHAPTER 28. APPROXIMATE INFERENCE
that:
i−1
X i
X
P (X = vj ) ≤ u < P (X = vj )
j=1 j=1
In the continuous case X has a density probability function f . Samping X consists again in drawing
u from U[0,1] and choosing the sample value as the value v satisfying:
Z v
f (x) dx = u
−∞
Of course simpler and faster procedures exist for specific distributions. For instance sampling the
canonical Gaussian distribution N (0, 1) is done by the Box-Muller method:
2. Build a vector y of size m, whose components are all drawn from N (0, 1).
3. Return sample x = C y + µ
In this case samples from PX can be drawn thanks to the rejection sampling algorithm 32. This
p? (x)
Proof. Let Y be the boolean random variable true if and only if u ≤ M q(x) . Only because one
p? (x)
assumes M q(x) ≤ 1, one has:
Z p? (x)
min 1, M q(x)
P ( Y = true | X = x ) = 1 du
0
p? (x)
= min 1,
M q(x)
?
p (x)
=
M q(x)
The distribution of the output X is then
PX (x) = P ( X = x | Y = true )
P ( Y = true | X = x ) P (X = x)
=
P (Y = true)
p? (x)
M q(x) q(x)
= R
P (Y = true, X = x)dx
p? (x)
= R p? (x)
M M q(x) q(x)dx
p? (x)
= R
p? (x) dx
R
The factor p? (x) dx is the normalization factor Z of p? so that the output distribution is the
normalized distribution of p? .
Z
At every iteration, the probability of rejection is 1 − M < 1 so that, even if the algorithm is
not guaranteed to terminate, the probability for an infinite loop is zero. The average number of
iteration is M
Z . However for the loop to exit as quickly as possible, it is important to choose q and
M in a way such that for all value x, M q(x) is as close as possible to p? (x) (ideally M q is equal
to p? ). For a given distribution q(x), M must ideally be set to:
?
p (x)
M = sup
x q(x)
If the shape of q is too different from the shape of p? , M must be chosen large so that the average
number of iterations increase.
P X | X\{X} = P X | blanket(X)
1 Y
= P X | parents(X) P X 0 | parents(X 0 )
Z
X 0 ∈children(X)
Another advantage of Gibbs sampling is its relative simplicity of implementation. However the
jumps of the random walk are very limited as they are necessarily colinear with one main axis
(in case of continuous distributions). In case areas of high probability density are remote and
324 CHAPTER 28. APPROXIMATE INFERENCE
disconnected “islands”, Gibbs sampling is likely to be stuck on one single island, without the
ability to jump remotely to another one. In such cases, more ellaborate sampling techniques are
required.
This very general condition is called the global balance equation. This equation can be rewritten
as:
Z
∀x, 1 × p(x) = q( x | x0 ) p(x0 ) dx0 ⇔
Z Z
∀x, q( x0 | x ) dx0 × p(x) = q( x | x0 ) p(x0 ) dx0 ⇔
Z
∀x, (q( x0 | x ) p(x) − q( x | x0 ) p(x0 )) dx0 = 0
On this last expression one sees the global balance equation a specific solution known as the detailed
balance equation:
∀x, ∀x0 , q( x0 | x ) p(x) = q( x | x0 ) p(x0 )
For instance, Gibbs sampling introduced in the previous section is an example of Markov chain
satisfying the detailed balance equation:
Proof. Considering a transition x → x0 such that x = (x1 , . . . , xm ) and x0 = (x01 , . . . , x0m ), either
the transition probabilities q( x0 | x ) and q( x | x0 ) are null or there exists a index i such that for
all j 6= i, x0j = xj . In the first case the detailed balance equation is obviously true. In the second
case one has:
Gibbs sampling is a very particular case of MCMC. A general question is how can one build
a transition distribution q̃ satisfying the detailed balance equation. One method is to start from
any transition distribution q̃( x | x0 ) called proposal distribution and to use the rejection principle
in order to “transform” it into a valid distribution q( x | x0 ) satisfying the detailed balance. Let’s
denote a( x | x0 ) the acceptance probability (i.e. the complement to 1 of the rejection probability)
of the transition x0 → x. The detailed balance is satisfied if:
This idea is at the origin of the most famous MCMC method called the Metropolis-Hastings
algorithm detailed by pseudocode 35.
The algorithm is characterized by an acceptance probability equal to
q̃ ( x0 | x ) × p(x)
a( x | x0 ) = min 1,
q̃ ( x | x0 ) × p(x0 )
Proof.
0 0 0 q̃ ( x | x0 ) × p(x0 )
∀x, ∀x , a( x | x ) q̃( x | x ) p(x) = min 1, q̃( x0 | x ) p(x)
q̃ ( x0 | x ) × p(x)
= min (q̃( x0 | x ) p(x), q̃ ( x | x0 ) × p(x0 ))
= min (q̃( x | x0 ) p(x0 ), q̃ ( x0 | x ) × p(x))
q̃ ( x0 | x ) × p(x)
= min 1, q̃( x | x0 ) p(x0 )
q̃ ( x | x0 ) × p(x0 )
= a( x | x0 ) q̃( x | x0 ) p(x0 )
One additional advantage of the Metropolis algorithm is that it accepts as an input an unor-
malized target distribution p? (x). The reason is that the acceptance factor depends on a ratio
p(x) p? (x)
p(x0 ) = p? (x0 ) independent of the normalizing factor.
Pn Pn p? (xi )
i=1 wi f (xi ) i=1 q(xi ) f (xi )
Ê (f (X)) = Pn = Pn p? (xi )
i=1 wi i=1 q(xi )
Proof. The proof uses a trick to redefine the expected value relatively to p as an expected value
relatively to q:
Z Z
p(x)
E (f (X))p(X) = f (x) p(x) dx = f (x) q(x) dx
q(x)
p(x)
= E f (x)
q(x) q(X)
? p? (x)
R p (x) E f (x) q(x)
p? (x0 ) dx0 q(X)
= E f (x) = R p? (x0 )
q(x) 0 0
q(x0 ) q(x ) dx
q(X)
?
E f (x) pq(x) (x)
= ? q(X)
E pq(x)(x)
q(X)
Applying the law of large numbers to the last expression by replacing expected values with arith-
metic average of n samples of q, one gets the desired result.
Resampling
Importance sampling works in theory for an infinite number of samples. In practice, things do not
work so well when q doesn’t match p. In this case most samples xi will have a very low weight (as
they are very likely to be sampled from q but not from p). Only very few samples will be likely to
be sampled from p and will get relatively high weight. This imbalance in weight distribution will
lead to imprecise results compared to the computation cost (as many samples with very low weight
are useless). The procedure of resampling helps to redistribute homogeneously weights to samples.
The algorithm 36 describes the procedure: Note that one sample can be duplicated several time
Algorithm 36 Resampling
1: Inputs : set of weighted samples (xi , wi )1≤i≤n
2: Let be the categorical distribution p(i) for i ranging from 1 to n with p(i) = Pnwi
i=1 wi
3: Output multiset S ← ∅ of weighted samples
4: for i = 1 . . . n do
5: Draw an index j from p
6: Add sample xj to S with weight w = 1
7: end for
8: return S
in the output multiset1 . Resampling more likely selects samples of heavy weight. However in the
end every “surviving” sample has the same weight.
1A multiset is a set where an element can be duplicated an arbitrary number of times.
28.3. MULTIVARIATE SAMPLING 327
Particle filtering
Particle filtering is an adaptation of importance sampling to deal with numerical trajectories, i.e
where a sample xi is a temporal sequence (x0i , . . . , xti , . . . ) of state points xi ∈ Rk . For this reason
particle filtering is also called Sequential Importance Sampling (SIR). Particle filters are a powerful
tool that is used in place of Kalman filters to solve hard Bayesian filtering problems, where con-
sidered dynamic models are non linear and/or weakly Markovian (i.e. Markov model with a high
order). Particle filtering updates dynamically trajectory samples at each time step, by analogy
with a beam of independent particles drawing a trajectory in a state space. One advantage of
particle filtering is that it can work with an unnormalized dynamic model, i.e. unnormalized tran-
sition distribution p? ( Xt+1 X0 , . . . , Xt ). Particle filtering also takes as input, some importance
distribution defined as a transition distribution q( Xt+1 | X0 , . . . , Xt ) as well. However q is meant
to be much simpler to sample than p? .
At current time t, particle filtering updates every trajectory (also called particle). When pass-
ing from time t to time t + 1, the method
completes every trajectory xi = (x0i , . . . , xti ) with a new
sample xi t+1
drawn from q( Xt+1 X0 = x0i , . . . , Xt = xti ). Trajectories thus follow the q distri-
bution. However the weight wi of every trajectory is updated consequently so that the weighted
average of a function f (x) over all particles xi provides a valid approximation of expected value
E(f (X0 , . . . , Xt+1 )) relativity to distribution p? . If wit is the value of the weight of trajectory xi
at time t, the weight update is simply:
t+1 t t+1 t+1 p? ( Xt+1 X0 , . . . , Xt )
wi = wi × αi with αi =
q( Xt+1 | X0 , . . . , Xt )
since according to importance sampling,one has:
p? (X0 , . . . , Xt+1 )
wit+1 =
q(X0 , . . . , Xt+1 )
p? (X0 , . . . , Xt ) p? ( Xt+1 X0 , . . . , Xt )
= ×
q(X0 , . . . , Xt ) q( X t+1 | X0 , . . . , Xt )
= wit × αit+1
The weights of trajectories are thus updated incrementally. When many trajectories get low
weights, a resampling is done to eliminate trajectories with low weights and to duplicate trajectories
with heavy weight. The randomness of transitions then split these duplicates to explore different
paths. The steps are listed on algorithm 37.
One remaining question is the choice of the importance distribution. One requirement is that
sampling the importance distribution must be an easy task. On the other hand, the importance
distribution must be as close as possible to the target distribution, in order to avoid resampling.
The optimal choice for q depends on the nature of the problem. However in many problems, the
state variable is hidden and partially observable through a visible variable V . Let’s assume the
model is Markovian of order 1 (but this is for simplification and it is not a strong requirement).
In this case the target transition distribution can be decomposed as the product of two factors:
p? ( Xt = xti Xt−1 = xt−1
i ) = p?1 ( Xt = xti Xt−1 = xt−1
i ) × p?2 ( Vt = vt Xt = xti )
In many practical models, the transition distribution p?1 is often a normalized distribution p1 easy to
sample (e.g. a multivariate Gaussian distribution) whereas the emission distribution p?2 is complex
and fundamentally unnormalized (since the variable to sample is not the fixed observation vt but
the state xti . For this reason, one chooses the importance distribution q as p1 . The particle weights
are then equal to the unormalized emission probabilities:
p? ( Xt | Xt−1 )
αit =
q( Xt | Xt−1 )
p?1 ( Xt | Xt−1 ) × p?2 ( Vt = vt | Xt )
=
p?1 ( Xt | Xt−1 )
= p?2 ( Vt = vt | Xt )
328 CHAPTER 28. APPROXIMATE INFERENCE
329
Chapter 29
Bandits
A (multi-armed) bandit problem is the most basic example of a sequential decision problem with
a trade-off between exploration and exploitation. A gambler (or player, or forecaster) is facing a
number of options (or actions). At each time step, the player chooses an option and receives a
reward (or a payoff). The goal is to maximize the total sum of rewards obtained in a sequence
of allocations. A tradeoff between exploration and exploitation arises: the player must balance
the exploitation of actions that did well in the past and the exploration of actions that could
give higher reward in the future. The name “bandit” comes from the American slang “one-armed
bandit” that refers to a slot machine: the gambler is facing many slot machines at once in a casino
and must repeatedly choose where to insert the next coin.
Bandits have numerous applications. They have first been introduced by Thompson (1933)
for studying clinical trials (different treatments are available for a given disease, one must choose
which treatment to use on the next patient). Nowadays, they are widely used in online services
(for adapting the service to the user’s individual sequence of requests). For example, they can
be used for ad placement (determining which advertisement to display on a web page, see for
example Chapelle and Li (2011)). They can also be used in cognitive radio for opportunistic
spectrum access (Jouini et al., 2012). They can also be used less directly. For example, they
are at the core of the MoGo program (Gelly et al., 2006) that plays Go at world-class level (see
also Munos (2014)).
The rest of this chapter is organized as follows. In Sec. 29.1, we formalize the stochastic
bandit problem. In Sec. 29.2, we explain the idea of “optimism in the face of uncertainty” and
introduce concentration inequalities. In Sec. 29.3 we present the classical UCB (Upper Confidence
Bound) strategy and prove its effectiveness. In Sec. 29.4, we discuss briefly other kinds of bandits
and problems. The material presented in this chapter is largely inspired from the monograph
of Bubeck and Cesa-Bianchi (2012).
the highest expectation and the corresponding arm (which is not necessarily unique).
The ideal (but unreachable) strategy would consist in choosing systematically It = i∗ . There-
fore, the quality of a strategy can be measured with the regret, defined as the cumulative difference
331
332 CHAPTER 29. BANDITS
(in expectation) between the optimal arm and chosen arms1 after n rounds:
" n #
X
Rn = nµ∗ − E µIt . (29.1)
t=1
The better the strategy, the lower the regret. So, we should design the sequential decisions such
as minimizing this quantity.
Next, we formulate this regret differently. Write
s
X
Ti (s) = 1{It =i}
t=1
the number of times the player selected arm i during the first s rounds and
∆i = µ∗ − µi
Therefore, a good strategy should control E [Ti (n)] for i 6= i∗ , the (expected) number of times a
suboptimal arm is played.
K (as the lower bound of the interval of arm 2 is higher than the higher bound of arm K), but it
is more difficult to tell which of arms 2 and 3 is the better. Optimism in the face of uncertainty
consists in acting greedily respectively to the most “favorable” case, here to act greedily respectively
to the higher upper bound of the arm’s confidence interval. In Fig. 29.1, optimism in the face of
uncertainty consists in choosing arm 3. Computing these confidence intervals will be done next
through the use of a concentration inequality. The related strategy and the analysis of its regret
are provided in Sec. 29.3.
Let X1 , . . . , Xn be i.i.d. (independent and identically distributed) random variables. Write
µ = E [X1 ] their common expectation and µn the related empirical mean:
n
1X
µn = Xi .
n i=1
Typically, these random variables are the rewards obtained for pulling a given arm n times. The
question we would like to answer is: how close is µn to µ (with some probability)? We first give a
general answer to this question before instantiating it to the case of bounded random variables.
Theorem 29.1 (Hoeffding’s inequality (Hoeffding, 1963; Bubeck and Cesa-Bianchi, 2012)). As-
sume that there exists a convex function ψ : R+ → R+ such that
h i h i
∀λ ≥ 0, ln E eλ(X1 −µ) ≤ ψ(λ) and ln E eλ(µ−X1 ) ≤ ψ(λ).
Then:
P (µn − µ ≥ ) ≤ e−nψ∗ () and P (µ − µn ≥ ) ≤ e−nψ∗ () .
Before proving this result (called a concentration inequality, as it states how the empirical mean
concentrates around the expectation), we give some intuitions about its meaning. The moment
condition (the assumption about the existence of the ψ function) provides information about the
tail of the distribution, notably how it concentrates around its mean. For example, if X ∼ N (µ, σ 2 )
(X is Gaussian of mean µ and variance σ 2 ), it is a standard result of the probability theory that
2
for any λ ∈ R, we have ln E [exp λ(X − µ)] = σ2 . We will see later that we have a similar result
for bounded random variables. The Legendre-Fenchel transform is a standard tool in convex
optimization (how it is introduced will be clear in the proof). For example, for ψ(λ) ∝ λ2 , we have
ψ∗ () ∝ 2 . Next, we explain why this result is indeed a confidence interval. A direct corollary of
334 CHAPTER 29. BANDITS
this theorem is
This is called a PAC (Probably Approximately Correct) result. This is is indeed a confidence
interval, as it says that with probability at least 1 − δ we have
−1 1 2 −1 1 2
µ ∈ µn − ψ∗ ln , µn + ψ∗ ln .
n δ n δ
This is exactly the kind of result we were looking for. Next, we prove the theorem.
Proof of Th. 29.1. The proof is based on what is called a Chernoff argument. Let λ > 0, we have
n
!
X
P (µn − µ ≥ ) = P (Xi − µ) ≥ n
i=1
λ n
P
=P e i=1 (Xi −µ) ≥ eλn .
Recall that the Markov’s inequality states2 that if Y is a positive random variable and c a positive
constant we have
E [Y ]
P (Y ≥ c) ≤ .
c
Therefore, we have
Pn h Pn i
P eλ i=1 (Xi −µ) ≥ eλn ≤ e−nλ E eλ i=1 (Xi −µ) (by Markov)
n
Y h i
= e−λn E eλ(Xi −µ) (by independence)
i=1
h in
= e−λn E eλ(X1 −µ) (r.v. are i.d.)
−n(λ−ln E[exp(λ(X1 −µ))])
=e
≤ e−n(λ−ψ(λ)) .
2 Refer to any basic course on probabilities. We give the proof for completeness. We have
Z Z Z Z Z
E [Y ] = Y dP = Y dP + Y dP ≥ Y dP ≥ c dP = cP (Y ≥ c),
Y <c Y ≥c Y ≥c Y ≥c
This being true for any λ > 0, it is true for the minimizer, thus
This shows the first inequality, the proof for the second one being the same.
When introducing the stochastic bandit problem in Sec. 29.1, we have assumed that the rewards
are bounded (this is not mandatory, but usual). In this case, the bound can be instantiated, thanks
to the following Lemma due to Hoeffding (1963), that specify ψ in this case.
Lemma 29.1 (Hoeffding (1963)). Let Y be a random variable such that E [Y ] = 0 and c ≤ Y ≤ d
almost surely3 . Then, for any s ≥ 0, we have
2 (d−c)
2
E esY ≤ es 8 .
Proof. Let s > 0. The function x → esx is convex, thus es(tx+(1−t)y) ≤ tesx + (1 − t)esy . Notice
also that for any x
d−x x−c
x= c+ d.
d−c d−c
Therefore, we have for the r.v. Y :
d − Y sc Y − c sd
esY ≤ e + e .
d−c d−c
Taking the expectation (recall that E [Y ] = 0):
d sc −c sd
E esY ≤ e + e
d−c d−c
d −c s(d−c)
= esc + e .
d−c d−c
−c d
Define p = d−c > 0 (which implies that d−c = 1 − p) and u = s(d − c). Therefore, sc = −pu and
we can write
E esY ≤ e−pu (1 − p + peu ) = eϕ(u)
with ϕ(u) = −pu + ln (1 − p + peu ) .
We will bound ϕ(u). We have that ϕ(0) = 0 and ϕ0 (0) = 0. The second derivative is
peu (1 − p) 1
ϕ00 (u) = u 2
≤ .
(1 − p + pe ) 4
u
pe 00
For this last statement, note that by writing t = 1−p+pe u > 0 we have ϕ (u) = t(1 − t) which is
obviously bounded by 14 . From the Taylor-Lagrange formula, there exists a ξ such that
u2 u2 s2 (d − c)2
ϕ(u) = ϕ(0) + ϕ0 (0)u + ϕ00 (ξ) ≤ = ,
2 8 8
which proves the result.
From this, we have a direct corollary of Th. 29.1.
Corollary 29.1 (Hoeffding (1963)). Assume that 0 ≤ X1 ≤ 1 almost surely. Then we have
2 2
P (µn − µ ≥ ) ≤ e−2n and P (µ − µn ≥ ) ≤ e−2n .
3 Obviously, c ≤ 0 ≤ d.
336 CHAPTER 29. BANDITS
We will next apply these results to derive a strategy for the bandit problem.
As we are looking for an upper bound on µi , we have from Th. 29.1 that with probability at least
1 − δ,
1 1
µi < µi,s + ψ∗−1 ln .
n δ
With the choice of δ = t1α where α > 0 is an input parameter (and with s = Ti (t−1), the number of
times arm i has been played before round t), we obtain the so-called (α, ψ)-UCB strategy of Bubeck
and Cesa-Bianchi (2012):
α ln t
It ∈ argmax µi,Ti (t−1) + ψ∗−1 .
1≤i≤K Ti (t − 1)
If the rewards are bounded in [0, 1], we obtain the original UCB (Upper Confidence Bound) strategy
of Auer et al. (2002) (using the results in the proof of Cor. 29.1, that is ψ∗ () = 22 ⇔ ψ∗−1 (u) =
p u
2 ):
s !
α ln t
It ∈ argmax µi,Ti (t−1) + .
1≤i≤K 2Ti (t − 1)
Therefore, we end up with a simple strategy that only requires updating empirical means as
arms as pulled, and to act greedily respectively to the above quantity (which is the empirical mean
plus a kind of bonus). An important question is to know what regret is suffered by these strategies.
The answer is given by the next theorem.
Theorem 29.2 (Auer et al. (2002); Bubeck and Cesa-Bianchi (2012)). Assume that the rewards
distributions satisfy the assumption of Th. 29.1. Then the (α, ψ)-UCB strategy with α > 2 satisfies
!
X α∆i α
Rn ≤ ∆i
ln n + .
ψ∗ 2
α −2
i:∆i >0
If rewards are bounded in [0, 1], the bound on the regret simplifies as (using the fact that
ψ∗ () = 22 ):
X 2α α
Rn ≤ ln n + .
∆i α−2
i:∆i >0
This tells that each suboptimal arm is chosen no more that a logarithmic number of times (ln n),
and that arms close to the optimal one are chosen more often ( ∆1i ). We prove the result now.
29.3. THE UCB STRATEGY 337
Proof of Th. 29.2. Assume without loss of generality that It = i 6= i∗. Then, at least on of the
tree following equations must be true:
−1 α ln t
µi∗ ,Ti∗ (t−1) + ψ∗ ≤ µ∗ (29.4)
Ti∗ (t − 1)
α ln t
µi,Ti (t−1) > µi + ψ∗−1 (29.5)
Ti (t − 1)
α ln n
Ti (t − 1) < (29.6)
ψ∗ ∆2i
Eq. (29.4) states that the upper-bound for the optimal arm is below the associated mean, Eq. (29.5)
states that the lower-bound for the considered arm is above the associated mean and Eq. (29.6)
means that arm i has not been pulled enough. If the three equations were false, we would have
−1 α ln t
µi∗ ,Ti∗ (t−1) + ψ∗ > µ∗ by (29.4) false
Ti∗ (t − 1)
= µi + ∆i by def. of ∆i
α ln t
≥ µi + 2ψ∗−1 by (29.6) false
Ti (t − 1)
−1 α ln t
≥ µi,Ti (t−1) + ψ∗ by (29.5) false,
Ti (t − 1)
which implies that It 6= i, which is a contradiction.
Recall that for controlling the regret it is enough to control the (expected) number of times
each arm has been played (Eq. (29.1) being equivalent to Eq. (29.2)). Define u as
& '
α ln n
u= .
ψ∗ ∆2i
We have that
" n #
X
E [Ti (n)] = E 1{It =i}
t=1
" n
#
X
≤u+E 1{It =i and (29.6) is false}
t=u+1
" n
#
X
≤u+E 1{(29.4) or (29.5) is true}
t=u+1
n
X
≤u+ (P ((29.4) is true) + P ((29.5) is true)) .
t=u+1
where we used a union bound and Hoeffding with δ = t−α . The same bound holds for the
probability of event (29.5):
1
P ((29.5) is true) ≤ α−1 .
t
338 CHAPTER 29. BANDITS
α ln n α
= ∆i
+ .
ψ∗ 2 α −2
Reinforcement learning
Reinforcement learning (RL) can be broadly seen as the machine learning answer to the control
problem. In this paradigm, an agent is interacting with the world by tacking actions and observing
the resulting configuration of the world (in a sequential manner). This agent receives (numerical)
rewards (given by some oracle) that are a local information about the quality of the control. The
aim of this agent is to learn the sequence of actions such as maximizing some notion of cumulative
reward. This chapter provides an introduction to the field of reinforcement learning, more can be
found on reference textbooks (Sutton and Barto, 1998; Bertsekas and Tsitsiklis, 1996; Szepesvári,
2010; Sigaud and Buffet, 2013). This field is inspired by behaviorial psychology (this explains part
of the vocabulary, such as the notion of reward) and has connections to computational neuroscience,
yet this chapter focuses on the mathematical and learning aspects.
30.1 Introduction
In reinforcement learning, an agent is interacting with a system (sometime called the environment,
or the world), as exemplified in Fig. 30.1. At each discrete time step, the system is in a given state
(or configuration) that can be observed by the agent. Based on this state, the agent apply some
action. Following this action, the system transits in a new state and the agent receives a numerical
reward from an oracle, this reward being a local clue of the quality of the control. The goal of
the agent is to take sequential decisions such as maximizing some notion of cumulative reward,
typically the sum of rewards gathered along the followed path. An important thing to understand
is that the rewards quantify the goal of the control, and not how this goal should be reached (this
is what the agent has to learn).
To clarify this paradigm, consider the simple example of a robot in a maze. The goal of the
339
340 CHAPTER 30. REINFORCEMENT LEARNING
robot is to find the shortest path to the exit. The state of the system is the current position of
the robot in the maze. Four actions are available, one for each direction. Choosing such an action
amount to moving in the required direction. In this problem, the reward can be −1 for any move,
except for the move that leads to the exit which is rewarded by 0. Notice that the only informative
reward is given for going through the exit. Here, for any path leading to the exit, the sum of
rewards is the negative of the length of the path. Therefore, maximizing the sum of rewards is
equivalent to finding the shortest path to the exit.
A first issue is to formalize mathematically such a control problem. In reinforcement learning,
this is widely done thanks to Markov Decision Processes (MDPs), to be presented in Sec. 30.2. A
second issue is to compute the best possible control when the model (the MDP) is known, which
is addressed by Dynamic Programming (DP), to be presented in Sec. 30.3. Consider again the
maze problem. A smart strategy consists in starting from the exit, and then retro-propagating
the possible paths until reaching the starting point. This is roughly what DP does. Puterman
(1994) and Bertsekas (1995) provide reference textbooks on MDPs and DP. A third problem is
to estimate this optimal strategy from data (interaction data between the agent and the system),
when the model is unknown (this is reinforcement learning). This is addressed in Sec. 30.4-30.6.
30.2 Formalism
In the sequel, we write ∆X the set of probability measures over a discrete set X and Y X the set of
applications from X to Y.
• P ∈ ∆S×AS is the Markovian transition kernel. The term P (s0 |s, a) denotes the probability
of transiting in state s0 given that action a was chosen in state s. The transition kernel is
Markovian because the probability to go to s0 depends on the fact that action a was chosen
in state s, but it does not depend on the path followed to reach this state s. This assumption
is at the core everything presented here3 ;
• r ∈ RS×A is the reward function4 , it associates the reward r(s, a) for taking action a in state
s. The reward function is assumed to be uniformly bounded;
• γ ∈ (0, 1) is a discount factor that favors shorter term rewards (see the definition of the value
function, later). The closer is γ to 1, the more importance we give to far (in time) rewards.
Usually, this parameter is set to a value close to 1.
So, the system is in state s ∈ S, the agent chooses an action a ∈ A and get the reward r(s, a),
then the system transits stochastically to a new state s0 , this new state being drawn from the
conditional probability P (.|s, a).
1 We will assume larger (countable or even infinite compact) state spaces later.
2 It is quite difficult to consider large action spaces, but see Sec. 30.6
3 Consider the maze exemple of Sec. 30.1. If the robot knows its position, the dynamics are indeed Markovian.
If the agent has only a partial observation (for example, it knows if there are walls among him, but no more), the
dynamics are no longer Markovian. This is known as partially observable MDPs, see for example Kaelbling et al.
(1998). This topic will not be covered in this chapter, but note that the general strategy consists in computing
something which is Markovian.
4 One can define more generally the reward function as r ∈ RS×A×S , that is giving a reward r(s, a, s0 ) for
P transition
each from s to s0 under action a. However, one can define an expected reward function as r̄(s, a) =
P (s 0 |s, a)r(s, a, s0 ), the mean reward for choosing action a in state s. As only this mean reward will be of
s0 ∈S
importance in the following, we keep the notations simple.
30.2. FORMALISM 341
X∞
vπ (s) = E[ γ t r(St , π(St ))|S0 = s, St+1 ∼ P (.|St , π(St ))]. (30.1)
t=0
In other words, if the agent starts in state s and keeps following the policy π (that is taking the
action given by π whenever a decision is required), it will receive a sequence of rewards. The
discounted cumulative reward of this state is the sum of gathered rewards along the (infinite)
trajectory, the reward being received at the tth time step being discounted by γ t (this favors closer
rewards and allows for the sum to be finite). This quantity being random (due to the transitions
being random), the value of the state is defined as the expectation of the discounted cumulative
reward. There exist other criteria for quantifying the quality of a policy, but they will not be
considered here6 .
A value function allows quantifying the quality of a policy, and it allows comparing policies as
follows:
π1 ≥ π2 ⇔ ∀s ∈ S, vπ1 (s) ≥ vπ2 (s).
Notice that this is a partial ordering, thus two policies might not be comparable. Solving an MDP
means computing the optimal policy π∗ satisfying vπ∗ ≥ vπ , for all π ∈ AS . In other words, the
optimal policy satisfies
π∗ ∈ argmax vπ .
π∈AS
It is possible to show that such a policy exists (the result is admitted). Before showing how such
an optimal policy can be computed (see Sec. 30.3), we develop the notion of value function.
5 More precisely, it is a deterministic policy. One can consider more generally a stochastic policy, that is π ∈ ∆S :
A
for each state s, π(.|s) is a distribution over actions. This will be useful in Sec. 30.6, but for now deterministic
policies are enough.
6 Still, we can mention the finite horizon criteria, defined for a given horizon H, the associated value function
being
H
X
vπ (s) = E[ r(St , π(St ))|S0 = s, St+1 ∼ P (.|St , π(St ))],
t=0
or the average criteria, the corresponding value being
H
1 X
vπ (s) = lim E[ r(St , π(St ))|S0 = s, St+1 ∼ P (.|St , π(St ))].
H→∞ H t=0
342 CHAPTER 30. REINFORCEMENT LEARNING
In other words, computing the value of state s can be done by knowing the value of possible next
states (and the probability to reach these states). Notice that this is a linear system. To see this
more clearly, first notice that a function from RS can be seen as a vector, and vice-versa (the state
space being finite). We introduce Pπ ∈ RS×S and rπ ∈ RS , defined as
The term Pπ is a stochastic matrix (each row sums to one) and rπ is a vector. Using this notation,
Eq. (30.2) can be written as
where I is the identity matrix. Notice that Pπ being a stochastic matrix, its spectrum (largest
eigenvalue) is bounded by 1, and as γ < 1, the matrix (I − γPπ ) is indeed invertible.
We can now introduce the Bellman evaluation operator7 Tπ : RS → RS , defined as
∀v ∈ RS , Tπ v = rπ + γPπ v,
This affine operator applies to any function v ∈ RS (not necessarily a value function corresponding
to a policy) and Eq. (30.3) shows that vπ is the unique fixed point of this operator8 :
vπ = Tπ vπ .
Therefore, we have a tool to compute the value function of any policy, which provides useful for
quantifying its quality. However, we also would like to characterize directly the optimal policy π∗
(more precisely, the related value function).
Write v∗ = vπ∗ the value function associated to an optimal policy (called the optimal value
function). Assume that the optimal value function v∗ is known, but not the optimal policy π∗ . For
any state, this policy should take the action that leads to the higher possible value, that is
!
X
0 0
π∗ (s) ∈ argmax r(s, a) + γ P (s |s, a)v∗ (s ) . (30.4)
a∈A
s0 ∈S
We say that π∗ is greedy respectively to v∗ . Therefore, knowing v∗ , one can compute π∗ . The
remaining problem is to characterize v∗ . We have seen in the evaluation problem that knowing
the value in the next state is sufficient to compute the value in the current state (see Eq. (30.2)).
The same principle can be applied to compute the optimal value. Knowing the optimal value of
the next state, the optimal value of the current state is the one that maximizes the sum of the
immediate reward and of the discounted optimal value of the next state:
!
X
0 0
∀s ∈ S, v∗ (s) = max r(s, a) + γ P (s |s, a)v∗ (s ) .
a∈A
s0 ∈S
To see if this problem admits a solution, we introduce the Bellman optimality operator T∗ :
RS → RS , defined as
∀v ∈ RS , T∗ v = max (rπ + γPπ v) , (30.5)
π∈AS
This operator is a contraction in supremum norm9 . Therefore, thanks to the Banach theorem10 ,
T∗ admits v∗ as its unique fixed point:
v∗ = T∗ v∗ .
Next, we show how the optimal policy (or equivalently the optimal value function, as shown in
Eq. (30.4)) can be computed.
We start by explaining why v∗ is indeed the solution of this linear program, then we express it in
a less compact form.
Recall that the operator T∗ applies to any v ∈ RS and that it is defined as T∗ v = maxπ∈AS (rπ +
γPπ v) (see Eq. (30.5)). Let π be any policy and v be such v ≥ T∗ v. By the definition of T∗ , we have
that v ≥ rπ + γPπ v (the inequality is true for any policy, in particular for π). We can repeatedly
apply this inequality:
v ≥ rπ + γPπ v (30.8)
≥ rπ + γPπ (rπ + γPπ vπ )
X∞
≥ γ t Pπt rπ (30.9)
t=0
= (I − γPπ )−1 rπ = vπ , (30.10)
9 Let u, v ∈ RS and write s any state. Assume without loss of generality that [T v](s) ≥ [T u](s). Write also
∗ ∗
as∗ ∈ argmaxa∈A (r(s, a) + γ s0 ∈S P (s0 |s, a)v(s0 )). We have
P
This being true for any s, we have kT∗ v − T∗ uk∞ ≤ γkv − uk∞ .
10 If an operator T is a contraction, it admits a unique fixed point, which can be constructed as the limit of the
where Eq. (30.9) is obtained P∞by repeatedly applying inequality (30.8) and Eq. (30.10) uses the
fact11 that (I − γPπ )−1 = t=0 γ t Pπt and that vπ = (I − γPπ )−1 rπ (recall Eq. (30.3)). This being
true for any policy, it is true for the optimal one, and we have just shown that
v ≥ T∗ v ⇒ v ≥ v∗ .
Moreover, v ≥ v∗ implies that 1> v ≥ 1> v∗ . Consequently, minimizing 1> v under the constraint
that v ≥ T∗ v provides the optimal value function.
and get v∗ .
2: return the policy π∗ defined as
!
X
0 0
π∗ (s) ∈ argmax r(s, a) + γ P (s |s, a)v∗ (s ) .
a∈A
s0 ∈S
The optimal policy can be computed from v∗ as the greedy policy (recall Eq. (30.4)). The linear
programming approach to solving MDP is summarized in Alg. 38 (observe that this formulation is
indeed equivalent to Eq. (30.7)). This program has |S| variables and |S × A| constraints.
Therefore, the Banach fixed-point theorem states that for any initialization v0 , the sequence defined
as
vk+1 = T∗ vk
will converge to v∗ : limk→∞ vk = v∗ . This provide a simple algorithm for computing v∗ . However,
the convergence is asymptotic, and one should stop the iteration before. A natural stopping
criterion is to check if two subsequent fonctions are closed enough (in supremum norm), that is
kvk+1 − vk k∞ ≤ , for a user-defined value of .
Doing so, we obtain a function vk ∈ RS , which is not necessary a value function (as there is no
reason that it corresponds to a policy). Yet, what we are interested in is finding a control policy.
The notion of greedy policy can be extended to any function v ∈ RS . We define a policy π to be
greedy respectively to a function v, noted π ∈ G(v), as follows:
!
X
π ∈ G(v) ⇔ Tπ v = T∗ v ⇔ ∀s ∈ S, π(s) ∈ argmax r(s, a) + γ P (s0 |s, a)v(s0 ) .
a∈A
s0 ∈S
The output of the algorithm is thus simply πk ∈ G(vk ). This method is called value iteration and
is summarized in Alg. 39.
The stopping criterion makes sense, the iterations are stopped when they do not involve much
change. Yet, we can wonder how close is the computed vk to the optimal value function v∗ . We
11 This is a Taylor expansion, valid due to P being a stochastic matrix and γ being strictly bounded by 1. It can
π
also be checked algebraically.
30.3. DYNAMIC PROGRAMMING 345
5: end for
6: k ←k+1
7: until kvk+1 − vk k∞ ≤
8: return a policy πk ∈ G(vk ):
!
X
0 0
πk (s) ∈ argmax r(s, a) + γ P (s |s, a)vk (s ) .
a∈A
s0 ∈S
or more compactly that Tπ0 vπ = T∗ vπ ≥ Tπ vπ = vπ . Yet, this is not enough to tell that π 0 is better
than π (that is, that vπ0 ≥ vπ ).
This result is indeed true, it can be shown as follows:
Tπ0 vπ = T∗ vπ ≥ Tπ vπ = vπ
⇔rπ0 + γPπ0 vπ ≥ vπ
⇔rπ0 + γPπ0 vπ0 + γPπ0 (vπ − vπ0 ) ≥ vπ
⇔(I − γPπ0 )vπ0 ≥ (I − γPπ0 )vπ
⇒vπ0 ≥ vπ .
We used notably the fact that rπ0 + γPπ0 vπ0 = Tπ0 vπ0 = vπ0 and (for the last line) the fact that
pre-multiplying a componentwise inequality by a positive matrix does not change the inequality12 .
This suggests the following algorithm. Choose an initial policy π0 and then iterates as follows:
1. solve Tπk vπk = vπk (this is called the policy evaluation step);
2. compute πk+1 ∈ G(vπk ) (this is called the policy improvement step).
We have shown that vπk+1 ≥ vπk , so either vπk+1 > vπk or vπk+1 = vπk . In this equality case, we
have
T∗ vπk = Tπk+1 vπk = Tπk+1 vπk+1 = vπk+1 = vπk .
This means that vπk is the fixed point of T∗ , and thus that vπk = v∗ , that is πk+1 = π∗ is the
optimal policy. This suggests to stop iterations when vπk+1 = vπk . Moreover, the number of
policies being finite (it is |A||S| ), the number of iterations is finite (bounded by the number of
policies). This method is called policy iteration and is summarized in Alg. 40.
Each iteration of policy iteration has a higher computational cost than one of value iteration
(as a linear system has to be solved), but it will converges in finite time. Moreover, it converges
empirically often very quickly (only few iterations are required).
would be false for an arbitrary matrix). Here, the considered matrix is (I − γPπ0 )−1 = t≥0 γ t Pπt 0 , all its elements
P
are obviously positive.
13 Consider the Tetris game, the state being the current board, actions correspond to placing falling tetraminos,
the reward is +1 for removing a line, the size of the state space is 2200 for a 10 × 20 board, which is too huge to be
handled by a machine.
14 We have considered finite state spaces so far. Yet, all involved sums are expectations. For example, the
Bellman evaluation operator is [Tπ v](s) = r(s, π(s)) +R γ s0 ∈S P (s0 |s, π(s))vπ (s0 ). With a continuous state space,
P
it can be simply written as [Tπ v](s) = r(s, π(s)) + γ S vπ (s0 )P (ds0 |s, π(s)). More abstractly, it can be written as
[Tπ v](s) = r(s, π(s)) + γES 0 ∼P (.|s,π(s)) [v(S 0 )], which encompasses both cases. Up to some technicalities, what we
have presented so far remains true for the continuous case.
30.4. APPROXIMATE DYNAMIC PROGRAMMING 347
5: k ←k+1
6: until vk+1 = vk
7: return the policy πk+1 = π∗
where θ are the parameters to be learnt and φ : S → Rd is a predefined feature vector (the
vector which components are the d user-defined basis functions φi (s));
• the model might be unknown and one has to rely on a dataset of the type
D = {(si , ai , ri , s0i )1≤i≤n }, (30.11)
where action ai is taken in state si (according to a given policy, to a random policy, or
something else), the reward satisfies ri = r(si , ai ) and the next state is sampled according
to the dynamics, s0i ∼ P (.|si , ai ). There are multiple ways this dataset can be obtained.
For example, it can be given beforehand (batch learning, the main case we consider in this
section). It can also be gathered in an online fashion by the agent that tries to learn the
optimal control (a case we address in Sec. 30.5). It can also be obtained in a somehow
controlled manner if one has access to a simulator (which does not mean that the model
is known). Here is an example of how this data can be used: first, notice that almost
everything turns around Bellman operators, and that without the model, they cannot be
computed. However, they can still be approximated from data. Assume that ai = π(si ) for
a policy of interest π. One can consider the sampled operator
[T̂π v](si ) = ri + γv(s0i ).
This is an unbiased estimate of the evaluation operator, as E[[T̂π v](si )|si ] = ES 0 ∼P (.|si ,ai ) [ri +
γv(S 0 )] = [Tπ v](si ).
In this section, we will study approximate value and policy iteration algorithms that handle
these problems. There exist also approximate linear programming approaches, extending the one
presented in Sec. 30.3.1, but we will not discuss them here (as they are restrictive, in some sense).
The interested reader can nevertheless refer (notably) to de Farias and Van Roy (2003, 2004) for
more about this.
programming (to compute the optimal policy from the optimal value function or to improve a
policy in the policy iteration scheme). However, computing a greedy policy requires knowing the
model, as
!
X
0 0
π ∈ G(v) ⇔ ∀s ∈ S, π(s) ∈ argmax r(s, a) + γ P (s |s, a)v(s ) .
a∈A
s0 ∈S
Assume that we can estimate the optimal value function from the data (thus, without knowing
the model). This is a first step, but what we are interested in is a good control policy. Deducing a
greedy policy from this estimated policy would not (or at least hardly) be possible from data only.
Another problem with value functions is that the Bellman optimality operator cannot be sam-
pled as easily as the Bellman evaluation operator. We have seen just before that [T̂π v](si ) =
ri + γv(s0i ) is an unbiased estimate of the evaluation operator. Recall the definition of the opti-
mality operator (see Eq. (30.6)):
To define a sampled operator T̂∗ , one would need to sample all actions (and related next states)
for all states si in the dataset15 (in order to compute the max). Write s0i,a a next state sampled
according to P (.|si , a). One could consider the following sampled operator
[T̂∗ v](si ) = max r(si , a) + γv(s0i,a ) .
a∈A
Anyway, this estimator would be biased (as the expectation of a max is not the max of an expec-
tation, E[[T̂∗ v](si )|si ] 6= T∗ (si )).
There is a simple solution to alleviate these problems, namely the state-action value function,
also called Q-function or quality function. For a given policy π, the state-action value function
Qπ (s, a) ∈ RS×A associate to each state-action pair the expected discounted cumulative reward
for starting in this state, taking this action (that might be different from the action advised by the
policy) and following the policy π afterward:
∞
X
∀(s, a) ∈ S ×A, Qπ (s, a) = E[ γ t r(St , At )|S0 = s, A0 = a, St+1 ∼ P (.|St , At ), At+1 = π(St+1 )].
t=0
Roughly speaking, this adds a degree of freedom to the definition of the value function by letting
free the choice of the first action. Notably, it is clear from this definition that value and Q-functions
are related as follows:
vπ (s) = Qπ (s, π(s)).
A Bellman evaluation operator can easily be defined (we used the same notation, which is
slightly abusive as it operates on a different object) as Tπ : RS×A → RS×A such that for Q ∈ RS×A
we have componentwise:
X
∀(s, a) ∈ S × A, [Tπ Q](s, a) = r(s, a) + γ P (s0 |s, a)Q(s0 , π(s0 )).
s0 ∈S
The state-action value function Qπ is the unique fixed point of the operator Tπ (this operator being
a γ-contraction):
Qπ = Tπ Qπ .
Therefore, computing the Q-function of a policy π also amounts to solving a linear system. The
optimal value function Q∗ = Qπ∗ also satisfies a fixed-point equation. Let define the Bellman
optimality operator T∗ : RS×A → RS×A such that for Q ∈ RS×A we have componentwise:
X
∀(s, a) ∈ S × A, [T∗ Q](s, a) = r(s, a) + γ P (s0 |s, a) max
0
Q(s0 , a0 ).
a ∈A
s0 ∈S
The optimal state-action value function Q∗ is the unique fixed point of the operator T∗ (this
operator being a γ-contraction):
Q∗ = T∗ Q∗ .
Notice that the optimal value and quality functions are related as follows:
A first advantage of the quality function is that it allows computing a greedy policy without
knowing the model. Indeed, for a policy improvement step (to compute a greedy policy respectively
to vπ , recalling that it satisfies vπ (s) = Qπ (s, π(s))), we have
!
X
0 0 0
π ∈ G(vπ ) ⇔ ∀s ∈ S, π(s) ∈ argmax r(s, a) + γ P (s |s, a)vπ (s )
a∈A
s0 ∈S
0
⇔ ∀s ∈ S, π (s) ∈ argmax Qπ (s, a).
a∈A
If one is able to compute the optimal quality function , it is possible to compute the optimal policy
as being greedy respectively to it:
In all cases, we define a policy π as being greedy respectively to a function Q ∈ RS×A (which is
not necessarily the state-action value function of a policy) as
When working with data (with the dataset given in Eq. (30.11), both operators can be sampled.
The Bellman evaluation operator can be sampled as
Now, both sampled operators are unbiased (contrary to sampled optimality operator applying on
value functions).
The policy and value iteration algorithms can be easily rewritten with state-action value func-
tions replacing value functions. For all the reasons given so far, it is customary to work with
quality functions when the model is unknown. We have seen that when the state space is too
large, value functions should be searched for in some hypothesis space. For example, with a linear
parameterization, the quality functions would be of the form Qθ (s, a) = θ> φ(s, a). Yet, the states
are usually continuous while the actions are discrete (a less frequent case in supervised learning).
A standard approach consists in defining a feature vector φ(s) for the state space, and to extend
it to the state-action space as follows (δ being the Dirac function):
>
φ(s, a) = δa=a1 φ(s)> ... δa=a|A| φ(s)> .
Notice that this is reminiscent of the concept of score function for cost-sensitive multiclass clasifi-
cation seen in Sec. 5.2.
and the aim is to estimate from this set of transitions the optimal Q-function Q∗ (from which we
can estimate an optimal policy by being greedy). The Bellman optimality operator applying on
RS×A is a γ-contraction (the proof is very similar to the case of value functions). Therefore, thanks
to the Banach theorem, the iteration
Qk+1 = T∗ Qk
will converge to Q∗ . Yet, there are two problems:
• the operator T∗ cannot be applied to Qk , the model being unknown;
• as we are working with a too large state space, the Q-functions should belong to some
hypothesis space H, and there is not reason that T∗ Qk ∈ H holds.
The first problem can be avoided by using the sampled operator presented before instead, the
second one indeed corresponds to a regression problem, as shown below.
Now, we construct an approximate variation of value iteration applied to state-action value
functions. Assume that we adopt a linear parametrization for the Q-function, that is we consider
the following hypothesis space
with φ(s, a) a predefined feature vector. Let θ0 be some initial parameter vector and let Q0 = Qθ0
be the associated quality function. At iteration k we have Qk = Qθk . We can sample the optimality
operator for state-action couples available in the dataset D:
So, we have n target values for the next function Qk+1 (corresponding to inputs (si , ai )), and
this function must belong to H. Finding the function Qk+1 is therefore a regression problem that
can be solved by minimizing the risk based on the `2 -loss, for example. Therefore, Qk+1 can be
computed as follows:
1 X 2
n
Qk+1 ∈ argmin Qθ (si , ai ) − [T̂∗ Qk ](si , ai ) .
Qθ ∈H n i=1
Given the chosen hypothesis space, this is simply a linear least-squares problem with inputs (si , ai )
and outputs ri + γ maxa0 ∈A Qk (s0i , a0 ), and simple calculus16 gives the solution:
n
!−1 n !
X X
> 0 0
Qk+1 = Qθk+1 with θk+1 = φ(si , ai )φ(si , ai ) φ(si , ai ) ri + γ max
0
Qθk (si , a ) .
a ∈A
i=1 i=1
Alternatively, we can see this as projecting the sampled operator applied to the previous function
onto the hypothesis space, which can written more compactly as Qk+1 = ΠT̂∗ Qk , writing Π the
projection.
For the regression step, we have considered a quadratic risk with a linear parametrization (that
is, a linear least-squares), but other regression schemes can be envisioned. Write abstractly A the
operator that gives a function from observations (such as the result of minimizing a risk, or a
random forest, or something else), approximate value iteration can be written generically as (for
some initialization Q0 )
Qk+1 = AT̂∗ Qk .
Yet, if the Bellman operator T∗ is a contraction, there is not reason for the composed operator AT̂∗
to be a contraction. Indeed, in the example developed before (with the linear least-squares), the
operator ΠT̂∗ is not a contraction, and the iterations may diverges. A sufficient condition for the
operator AT̂∗ to be a contraction is to use averagers as function approximators in the regression
step. We do not explain here what an averager is, but random forests and ensemble of extremely
randomized trees (see Ch. 21) belong to this category. Using this kind of function approximator
in the regression step therefore ensure convergence.
16 Generically, 1 Pn > 2
the problem is to solve minθ n i=1 (yi − θ xi ) . Computing the gradient (resp. to θ) and setting
>
P P
it to zero provides the solution xi xi θ = xi yi .
30.4. APPROXIMATE DYNAMIC PROGRAMMING 351
3: Solve the regression problem with inputs (si , ai ) and outputs [T̂∗ Qk ](si , ai ) to get the Q-
function Qk+1
4: end for
5: return The greedy policy πK+1 ∈ G(QK+1 ):
The generic approximate value iteration is provided in Alg. 41. An important thing to notice is
that this algorithm reduces the learning of an optimal control to a sequence of supervised learning
problems. For the definition of an averager and a discussion on necessary conditions for AT̂∗ to be
a contraction, see Gordon (1995). When the function approximator is an ensemble of extremely
randomized trees, the algorithm is known as fitted-Q (Ernst et al., 2005) and is quite efficient em-
pirically. Approximate value iteration has been experimented with other function approximators,
such as neural networks (Riedmiller, 2005) or Nadaraya-Watson estimators (Ormoneit and Sen,
2002).
So far, we have assumed that the dataset is given beforehand. The quality of this dataset is
very important for good empirical results. For example, if states si in D are sampled in a too small
part of the state space, no algorithm will be able to recover a good controller. If one has access to
a simulator (or to the real system), it is possible to choose how data are sampled (how states si are
sampled, according to what policy actions ai are sampled, the next states s0i being imposed by the
dynamics). In this case, one can wonder what is the best way to sample states. A sensible approach
would consist in following the current policy πk+1 ∈ G(Qk+1 ) slightly randomized (this is linked
to what is known as the exploration-exploitation dilemma, to be discussed in Sec. 30.5). Choosing
the right distribution is a difficult problem, and we will not discuss it much further here. However,
there is an important remark: in supervised learning, the distribution is fixed beforehand (given
by the problem at hand), while in approximate dynamic programming (that is, in reinforcement
learning) only the dynamics is fixed, which can involve many different distributions on transitions.
Therefore, things are much more difficult to analyse in this setting.
We can also have a word about model evaluation. An important question is: how good is the
policy πK+1 returned by approximate value iteration? When estimating a function in supervised
learning, its quality can be assessed by using cross-validation, for example. In reinforcement
learning, this is much more difficult, cross-validation cannot be applied. The best way to assess
the quality of the policy πK+1 is to apply it to the control problem (and if it is a real system,
and not a simulated one, it can be dangerous). There are few works on model evaluation for
reinforcement learning (Farahmand and Szepesvári, 2011; Thomas et al., 2015), but notice that
there is no answer as easy as in supervised learning.
Given any function Q ∈ RS×A , computing an associated greedy policy is easy (that is partly why
the state-action value function has been introduced). Therefore, the step to be approximated is
the policy evaluation step. In other words, the problem consists in estimating the quality function
of a given policy, from data. An iteration of approximate policy iteration can be (informally)
summarized as follows:
1. approximate policy evaluation: find a function Qk ∈ H such that Qk ≈ Tπk Qk ;
2. policy improvement: compute the greedy policy πk+1 = G(Qπk ).
The whole procedure is presented in Alg. 42.
3: policy improvement:
πk+1 ∈ G(Qk ).
4: end for
5: return the policy πK+1
So, the core question is: how to estimate the quality function of a given policy from a given
dataset? As before, the model is unknown and the Bellman operator can only be sampled. More-
over, the state space being too large, we’re looking for a Q-function belonging to some prede-
fined hypothesis space. Here, we will assume a linear parametrization, that is H = {Qθ (s, a) =
θ> φ(s, a), θ ∈ Rd }. Let π be any policy, we discuss now how to find an approximate fixed point of
Tπ , that is a function Qθ ∈ H such that Qθ ≈ Tπ Qθ .
Unfortunately, the Q-function Qπ is obviously unknown (it is what we would like to estimate).
However, if a simulator is available, it can be estimated for any given state-action couple (si , ai ).
To do so, the idea is to sample a full trajectory starting in si where action ai is chosen first, all
subsequent states being sampled according to the system dynamics and all subsequent actions
being chosen according to the policy π. Write qi the associated discounted cumulative reward (the
sum of discounted rewards gathered along the sampled trajectory), it is an unbiased estimate of
the state-action value function: E[qi |si , ai ] = Qπ (si , ai ). This is called a Monte Carlo rollout. This
fits the regression setting and one can solve
n
1X 2
min (qi − Qθ (si , ai )) .
θ∈Rd n i=1
to generalize the simulated state-action values. The solution is here the classical linear-least squares
estimator, the vector parameter minimizing this empirical risk being
Xn n
X
θn = ( φ(si , ai )φ(si , ai )> )−1 φ(si , ai )qi .
i=1 i=1
30.4. APPROXIMATE DYNAMIC PROGRAMMING 353
Notice that other losses and other function approximators could be used (as it is a standard
regression problem). The disadvantage of this approach is that it requires a simulator (which is
not always available, and which can be costly to use) and that the rollouts can be quite noisy (due
to the variance induced by the stochasticity of the system). Also, if it formally requires to sample
infinite trajectories, practically finite trajectories are sampled17 .
Residual approach
As we are looking for an approximate fixe point of Tπ , a natural idea consists in minimizing
kQθ − Tπ Qθ k for some norm. If we can set this quantity to zero, then we have found the fixed
point (and the the state-action value function Qπ ). This is called a residual approach as we try
to minimize the residual between Qθ and its image Tπ Qθ under the Bellman evaluation operator.
We work with data and the operator can only be sampled, so a natural optimization problem to
consider is
1 X 2
n n
1X 2
min [T̂π Qθ ](si , ai ) − Qθ (si , ai ) = min (ri + γQθ (s0i , π(s0i )) − Qθ (si , ai )) .
θ∈Rd n i=1 θ∈Rd n
i=1
With a linear parametrization Qθ (s, a) = θ> φ(s, a), this can be solved analytically by zeroing the
gradient, the minimizer is then given by
n
!−1 n
!
1X 1X
θn = ∆φi ∆φ>
i ∆φi ri with ∆φi = φ(si , ai ) − γφ(s0i , π(s0i )).
n i=1 n i=1
The problem with this approach is that it leads to minimizing a biased surrogates to the residual.
Indeed, if T̂π is an unbiased estimator of the Bellman operator, it is no longer true with its square
(the square of the mean is not the mean of the square):
E[([T̂π Qθ ](si , ai ) − Qθ (si , ai ))2 |si , ai ] = ([Tπ Qθ ](si , ai ) − Qθ (si , ai ))2 + var([T̂π Qθ ](si , ai )|si , ai )
6= ([Tπ Qθ ](si , ai ) − Qθ (si , ai ))2 .
If the dynamics is deterministic, this approach is fine (the variance term is null). However, with
stochastic transitions the estimate will be biased. The variance term acts as a regularizing factor,
which is good in general, but not here, as it cannot be controlled (there is no factor for trading off
the risk and the regularization). For more about this, see Antos et al. (2008).
Figure 30.2: Illustration for the Monte Carlo Rollouts (left) and LSTD (right). See the text for
details.
As usual, we work with data. LSTD can be expressed as solving the following nested optimiza-
tion problem:
( Pn
wθ = argminw∈Rd n1 i=1 (ri + γQθ (s0i , π(s0i )) − Qw (si , ai ))2
Pn .
θn = argminθ∈Rd n1 i=1 (Qθ (si , ai ) − Qwθ (si , ai ))2
The first equation correspond to the projection of T̂π Qθ onto H and the second equation to the
minimization of the distance between Qθ and the projection of T̂π Qθ . This is a nested optimization
problem as both solutions are interleaved (wθ depends en θ and vice-versa). Given the chosen linear
parametrization, this can be solved analytically. The first equation is a simple linear least-squares
problem in w, which solution is given by
n
!−1 n
X X
>
wθ = φ(si , ai )φ(si , ai ) φ(si , ai )(ri + γθ> φ(s0i , π(s0i ))).
i=1 i=1
The second equation is minimized with θ = wθ (and the distance is zero). Therefore, the solution
is
n
!−1 n
X X
>
θn = wθn ⇔ θn = φ(si , ai )φ(si , ai ) φ(si , ai )(ri + γθn> φ(s0i , π(s0i )))
i=1 i=1
n
!−1 n
X >
X
⇔ θn = φ(si , ai ) (φ(si , ai ) − γφ(s0i , π(s0i ))) φ(si , ai )ri .
i=1 i=1
When LSTD is the policy evaluation step of approximate policy iteration, the resulting algorithm
is called LSPI (Lagoudakis and Parr, 2003) for least-squares policy iteration, and is summarized
in Alg. 43.
We have see that a central question here is how to approximate the quality function from data.
We have shown the main methods, but many other exist. For an overview of the subject, the
interested reader can refer to Geist and Pietquin (2013); Geist and Scherrer (2014) (some other
will be briefly discussed in Sec. 30.5). For a discussion about the link between the residual approach
and LSTD, see Scherrer (2010).
3: policy improvement:
πk+1 ∈ G(Qθk ).
4: end for
5: return the policy πK+1
This is a cost-sensitive multiclass classification problem (see also Sec. 5.2). Notice that a policy
can be seen as a decision rule (as it associates a label—an action—to an input—a state). Take a
policy π, it will suffer a cost of maxa∈A Qπk (si , a) − Qπk (si , π(si )) for choosing the action π(si )
in state si instead of the greedy action argmaxa∈A Qπk (si , a). So, solving the above optimization
problem, with an infinite amount of data and a rich enough hypothesis space, gives the greedy
policy G(πk+1 ). As we work with a finite amount of data and as the hypothesis space may not
contain the greedy policy, in practice we obtain an approximate greedy policy.
Obviously, the state-action value function is unknown, while it is required to express prob-
lem (30.12). Yet, we only need to know (possibly approximately) the state action value function
for the state-action couples {(si , a)1≤i≤n,a∈A }. We can estimate these values by performing Monte
Carlo rollouts. Notice that we only need to estimate pointwise the function, we do not need to
generalize it to the whole state-action space (generalization is done by the policy). The interesting
aspect here is that we have reduced the optimal control problem to a sequence of classification
problems.
This approach is called DPI (Lazaric et al., 2010) for direct policy iteration. Notice that all
the approximate dynamic algorithms we have presented so far are special cases of the generic
approximate modified policy iteration approach. The interested reader can refer to Scherrer et al.
(2015) for more about this.
SARSA
Let π any policy, we present now an algorithm for estimating the Q-function of this policy in
an online fashion. Let H = {Qθ (s, a) = θ> φ(s, a), θ ∈ Rd } be an hypothesis space of linearly
parameterized functions. Assume that Qπ is known, we would like to solve
n
1X 2
min (Qπ (si , ai ) − Qθ (si , ai )) .
θ∈Rd n i=1
If we minimize this risk using a stochastic gradient descent, we get classically the following update
rule,
αi 2
θi+1 = θi − ∇ (Qπ (si , ai ) − Qθ (si , ai ))
2
= θi + αi φ(si , ai ) Qπ (si , ai ) − θi> φ(si , ai ) ,
where αi is a learning rate. This is a standard Widrow-Hoff update. At time-step i, the state-action
couple (si , ai ) is observed, and the parameter vector is updated according to the gain αi φ(si , ai )
and to the prediction error Qπ (si , ai ) − θi> φ(si , ai ).
Unfortunately, and as usual, the state-action value function is unknown (as it is what we would
like to estimate). The idea is to apply a bootstrapping principle: the unobserved value Qπ (si , ai ) is
replaced by an estimate computed by applying the sampled Bellman evaluation operator to the cur-
rent estimate Qθi (si , ai ), that is [T̂π Qθi ](si , ai ) = ri + γQθi (si+1 , π(si+1 )), where si+1 ∼ P (.|si , ai )
(si+1 is obtained by applying action ai to the system). To sum up, let (si , ai , ri , si+1 , ai+1 ) be the
current transition, with ri = r(si , ai ), si+1 ∼ P (.|si , ai ) and ai+1 = π(si+1 ), the update rule is
This is called a temporal difference algorithm, due to the prediction error ri +γQθi (si+1 , π(si+1 ))−
Qθi (si , ai ) being a temporal difference.
Algorithm 44 SARSA
Require: An initial parameter vector θ0 , the initial state s0 , an initial action a0 , the learning
rates (αi )i≥0
1: i = 0
2: while true do
3: Apply action ai in state si
4: Get the reward ri and observe the new state si+1
5: Choose the action ai+1 to be applied in state si+1
6: Update the parameter vector of the Q-function according to the transition
(si , ai , ri , si+1 , ai+1 );
θi+1 = θi + αi φ(si , ai ) ri + γθi> φ(si+1 , ai+1 ) − θi> φ(si , ai )
7: i←i+1
8: end while
The resulting algorithm is called SARSA (due to the transition (si , ai , ri , si+1 , ai+1 )) and is
summarized in Alg. 44. Notice that we remain vague about how action ai+1 is chosen (see line 5
in Alg. 44). We have said just before that action ai+1 is chosen according to π, the policy to
be evaluated. If we consider a policy evaluation problem, that’s fine. However, we would like to
learn the optimal control. A possibility would be to take ai+1 as the greedy action respectively to
the current estimate Qθi . This would correspond to an optimistic18 approximate policy iteration
18 The optimism lies in the fact that the evaluation step occurs for only one transition before taking the greedy
step.
30.5. ONLINE LEARNING 357
scheme. Yet, this would be too conservative (if the Q-function is badly estimated, the agent can
get stuck) and it should be balanced with some exploration. Before expending on this, we present
an alternative algorithm.
Q-learning
We can do the same job as for SARSA with the optimal Q-function Q∗ directly, instead of Qπ .
The update rule (assuming Q∗ known) is:
θi+1 = θi + αi φ(si , ai ) Q∗ (si , ai ) − θi> φ(si , ai ) .
The function Q∗ is unknown, but it can be bootstrapped by replacing it by the estimate obtained
by applying the sampled Bellman optimality operator to the current estimate, [T̂∗ Qθi ](si , ai ) =
ri + γ maxa∈A Q̂θi (s0i , a), giving the following update rule:
θi+1 = θi + αi φ(si , ai ) ri + γ max Qθi (si+1 , a) − Qθi (si , ai )
a∈A
= θi + αi φ(si , ai ) ri + γ max(θi> φ(si+1 , a)) − θi> φ(si , ai ) .
a∈A
Notice that whatever the behavior policy (the way actions ai are chosen), we will estimate directly
the optimal Q-function (given some assumptions, notably the behavior policy should visit often
enough all state-action pairs). This is called an off-policy algorithm (the estimated policy—here
the optimal one—is different from the behavior policy). This is different from SARSA (used for
policy evaluation), where the followed policy and the evaluated policy are the same. This is called
on-policy learning.
Algorithm 45 Q-learning
Require: An initial parameter vector θ0 , the initial state s0 , the learning rates (αi )i≥0
1: i = 0
2: while true do
3: Choose the action ai to be applied in state si
4: Apply action ai in state si
5: Get the reward ri and observe the new state si+1
6: Update the parameter vector of the Q-function according to the transition (si , ai , ri , si+1 );
θi+1 = θi + αi φ(si , ai ) ri + γ max(θi> φ(si+1 , a)) − θi> φ(si , ai )
a∈A
7: i←i+1
8: end while
The resulting algorithm is called Q-learning and is summarized in Alg. 45. Again, we remain
vague about how action ai is chosen (see line 3 in Alg. 45). If the Q-function is perfectly estimated,
the wisest choice would consist in taking the greedy action. However, if some action has never
been tried (or not enough, that is the Q-function estimation has not converged), one cannot know
if it is not indeed a better action than the greedy one. Therefore, exploitation (taking the greedy
action) should be combined with exploration (taking another action).
Notice that there exist many other online learning algorithms. For example, LSTD can be
made online by using Sherman-Morrison (much like how recursive linear least-squares are derived
from linear least-squares). For more about this, see again Geist and Pietquin (2013); Geist and
Scherrer (2014).
cumulative rewards). We present two simple solutions, that is policies balancing exploration and
exploitation, that can be used in line 5 of Alg. 44 and line 3 of Alg. 45.
Let be in (0, 1). An -greedy policy chooses the greedy action with probability 1 − , and a
random action with probability :
(
argmaxa∈A Qθ (s, a) with probability 1 −
π (s) = .
a random action with probability
Assume that is small. Most of time, the agent will act greedily respectively to the currently
estimated quality function. However, from time to time, it will try a different action, to see if it
cannot gather higher values. Practically, it is customary to set a high value of at the beginning of
learning, such as favoring exploration, and then to decrease as the estimation of the Q-function
improves, so as to act more and more greedily.
A different kind policy is the softmax policy. Let Qθ be the currently estimated quality function
and let τ > 0 be a temperature parameter, the softmax policy is a stochastic policy defined as
1
e τ Qθ (s,a)
πτ (a|s) = P 1 0
.
a0 ∈A e τ Qθ (s,a )
It is easy to check that this indeed defines a conditional probability. It is clear from this definition
that action with high Q-values will be sampled more often. The parameter τ allows going from a
purely uniform random policy (τ → ∞) to the greedy policy (τ → 0).
These policies make sense for balancing exploration and exploitation. However, one can wonder
how effective they are (empirically, which can be tried, but also theoretically). Moreover, one can
wonder what is the best exploration strategy. This is a very difficult question. Indeed, consider
the bandit problem studied in Ch. 29. A bandit is indeed an MDP with a single state (there is
no transition kernel and γ = 0). We have studied the UCB strategy, that adresse the exploration-
exploitation dilemma. Things are much more complicated in the general MDP setting. For a few
possible strategies, the interested read can refer to Bayesian reinforcement learning (Poupart et al.,
2006; Vlassis et al., 2012) or R-max (Brafman and Tennenholtz, 2003), among many others.
Here, we look for parameterized policies belonging to some hypothesis space F = {πθ , θ ∈ Rd }.
For example, for discrete actions, a common choice is to parameterize the policy with a softmax
distribution:
>
eθ φ(s,a)
πθ (a|s) = P θ > φ(s,a0 )
, (30.13)
a0 ∈A e
with φ(s, a) being a predefined feature vector. For continuous actions, a common approach consists
in embedding a parameterized deterministic policy into a Gaussian distribution. For example, if
A = R, we can consider
2
a−θ > φ(s)
− 21 σ
πθ (a|s) ∝ e ,
with φ(s) being a predefined feature vector.
Let ν ∈ ∆S be a distribution over states, defined by the user (it weights states where we would
like to have good estimates). The policy search methods aim at solving the following optimization
problem: X
max J(θ) with J(θ) = ν(s)vπθ (s) = ES∼ν [vπθ (S)].
θ∈Rd
s∈S
In dynamic programming, the aim is to find the policy that maximizes the value for every state. In
the current setting, there are too many states, so we instead try to find the policy that maximizes
the associated value function averaged over a predefined distribution over states.
where we used a classic log-trick for the last line19 We can see that componentwise, this is a Bellman
evaluation equation for the policy πθ and the reward Qπθ (s, Pa)∇θj ln πθ (a|s). Let 1 ≤ j ≤ d and θj
be the j th component of the vector θ, write also R(s) = a∈A πθ (a|s)Qπθ (s, a)∇θj ln πθ (a|s), we
have equivalently that
∀1 ≤ j ≤ d, ∇θj vπθ = (I − γPπ )−1 R.
Notice that the componentwise gradient of the objective can be written as ∇θj J(θ) = ν > ∇θj vπθ ,
we therefore have
∇θj J(θ) = ν > (I − γPπ )−1 R.
The quantity defined as
dν,π = (1 − γ)ν > (I − γPπ )−1
19 This log-trick is the fact that ∇π(a|s) = π(a|s)∇ ln π(a|s). The log is the reason why we consider stochastic
policies (no action can have probability zero, or the log is ill defined).
360 CHAPTER 30. REINFORCEMENT LEARNING
1 X X
∇θ J(θ) = dν,π(s) πθ (a|s)Qπθ (s, a)∇θ ln πθ (a|s)
1−γ
s∈S a∈A
1
= ES∼dν,π ,A∼πθ (.|S) [Qπθ (S, A)∇θ ln πθ (A|S)]. (30.14)
1−γ
This result is called the policy gradient theorem, see Sutton et al. (1999) for an alternative deriva-
tion (who first provided this result).
A local maximum can thus be searched for by doing a gradient ascent,
θ ← θ + α∇θ J(θ),
with α being a learning rate. The gradient can be estimated using Monte Carlo rollouts, see Baxter
and Bartlett (2001).
Edν,π [Qπθ (S, A)∇θ ln πθ (A|S)] = Edν,π [Qw (S, A)∇θ ln πθ (A|S)]
⇔Edν,π [(Qπθ (S, A) − Qw (S, A))∇θ ln πθ (A|S)] = 0. (30.15)
Edν,π [(Qπθ (S, A) − Qw (S, A))∇w Qw (S, A)] = 0 ⇔ ∇w Edν,π [(Qπθ (S, A) − Qw (S, A))2 ] = 0.
In other words, Qw must be a local optimum of the risk based on the `2 -loss, with the state-action
distribution given by dν,π , and with the target function being Qπ .
We have just shown that if the parametrization of the state-action value function is compatible,
in the sense that
∀(s, a) ∈ S × A, ∇θ ln πθ (a|s) = ∇w Qw (s, a),
and if Qw is a local optimum of the risk based on the `2 -loss, with state-action distribution given
by dν,π , and with the target function being Qπ , that is
So, the state-action value function appearing in the gradient can be replaced by its projection onto
the hypothesis space of compatible functions. This result has first been given by Sutton et al.
(1999). Notice that if formally the problem should be resolved using Monte Carlo rollouts, in
practice temporal difference algorithms are often used (and they do not compute the projection,
in general).
Let see what this compatibility condition gives with the softmax parametrization of Eq. (30.13).
We have
>
eθ φ(s,a)
∇θ ln πθ (a|s) = ∇θ ln P
a0 ∈A eθ> φ(s,a0 )
P > 0
a0 ∈A φ(s, a0 )eθ φ(s,a )
= φ(s, a) − P θ > φ(s,a0 )
a0 ∈A e
X
= φ(s, a) − πθ (a0 |s)φ(s, a0 ).
a0 ∈A
However, this is not really a problem. Indeed, let v ∈ RS be any function depending only on
states. One can easily show that Edν,π [v(s)∇θ ln πθ (a|s)] = 0, using the same trick as in Eq. (30.16).
Therefore, we have
with F (θ) the Fisher information matrix. In our case, the matrix is defined as (Peters and Schaal,
2008)
F (θ) = Edν,π [∇θ ln πθ (A|S)(∇θ ln πθ (A|S))> ].
362 CHAPTER 30. REINFORCEMENT LEARNING
Let Qw be a linearly parameterized function satisfying the required conditions, that is Qw (s, a) =
w> ∇θ ln πθ (a|s) and ∇w Edν,π [(Qπθ (S, A) − Qw (S, A))2 ] = 0, the we have
Therefore, the policy parameters are simply updated using the parameters computed for the quality
function. The related algorithms are called natural actor-critics and have been introduced by Peters
and Schaal (2008). For more about policy search and actor-critics, the interested reader can refer
to Grondman et al. (2012); Deisenroth et al. (2013)
Bibliography
363
364 BIBLIOGRAPHY
Evgeniou, T., Pontil, M., and Poggio, T. (2000). Regularization Networks and Support Vector
Machines. Advances in Computational Mathematics, 13(1):1–50.
Farahmand, A.-m. and Szepesvári, C. (2011). Model selection in reinforcement learning. Machine
learning, 85(3):299–332.
Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and
an application to boosting. Journal of computer and system sciences, 55(1):119–139.
Freund, Y. and Schapire, R. E. (1999). Large Margin Classification Using the Perceptron Algo-
rithm. Machine Learning, 37(3):277–296.
Frezza-Buet, H. (2014). Online Computing of Non-Stationary Distributions Velocity Fields by an
Accuracy Controlled Growing Neural Gas. Neural Networks, 60:203–221.
Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive logistic regression: a statistical view
of boosting. The annals of statistics, 28(2):337–407.
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of
statistics, pages 1189–1232.
Fritzke, B. (1995a). A growing neural gas network learns topologies. In Tesauro, G., Touretzky,
D. S., and Leen, T. K., editors, Advances in Neural Information Processing Systems 7, pages
625–632. MIT Press, Cambridge MA.
Fritzke, B. (1995b). Growing Grid - a self-organizing network with constant neighborhood range
and adaptation strength. Neural Processing Letters, 2:9–13.
Fritzke, B. (1997). Some Competitive Learning Methods. http://www.ki.inf.tu-dresden.de/
~fritzke/JavaPaper/.
Fukushima, K. (1980). Neocognitron: A Self-Organizing Neural Network Model for a Mechanism
of Pattern Recognition Unaffected by Shift in Position. Biological Cybernetics, 36:193–202.
Gabillon, V., Ghavamzadeh, M., and Lazaric, A. (2012). Best arm identification: A unified ap-
proach to fixed budget and fixed confidence. In Advances in Neural Information Processing
Systems, pages 3212–3220.
Gallant, S. I. (1990). Perceptron-based learning algorithms. IEEE Transactions on Neural Net-
works, 1(2):179–191.
Geist, M. (2013-2014). Abrégé non exhaustif sur l’évaluation et la sélection de modèles et la
sélection de variables. Technical report, CentraleSupélec.
Geist, M. (2015a). Précis introductif à l’apprentissage statistique. Support de cours, Cen-
traleSupélec. http://www.metz.supelec.fr/metz/personnel/geist_mat/pdfs/poly_as_
v2.pdf.
Geist, M. (2015b). Soft-max boosting. Machine Learning.
Geist, M. and Pietquin, O. (2013). An Algorithmic Survey of Parametric Value Function Approx-
imation. IEEE Transactions on Neural Networks and Learning Systems, 24(6):845–867.
Geist, M. and Scherrer, B. (2014). Off-policy Learning with Eligibility Traces: A Survey. Journal
of Machine Learning Research (JMLR), 15:289–333.
Gelly, S., Wang, Y., Munos, R., and Teytaud, O. (2006). Modification of UCT with patterns in
Monte-Carlo go. Technical Report RR-6062, 32:30–56.
Gers, F. A., Schmidhuber, J., and Cummins, F. A. (2000). Learning to Forget: Continual Predic-
tion with LSTM. Neural Computation, 12(10):2451–2471.
Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. Machine learning,
63(1):3–42.
BIBLIOGRAPHY 367
Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural
networks. In Teh, Y. W. and Titterington, D. M., editors, AISTATS, volume 9 of JMLR
Proceedings, pages 249–256. JMLR.org.
Graves, A. (2013). Generating Sequences With Recurrent Neural Networks. CoRR, abs/1308.0850.
Graves, A., Fernández, S., Gomez, F. J., and Schmidhuber, J. (2006). Connectionist temporal
classification: labelling unsegmented sequence data with recurrent neural networks. In Cohen,
W. W. and Moore, A., editors, ICML, volume 148 of ACM International Conference Proceeding
Series, pages 369–376. ACM.
Graves, A., rahman Mohamed, A., and Hinton, G. E. (2013). Speech recognition with deep
recurrent neural networks. In ICASSP, pages 6645–6649. IEEE.
Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM
and other neural network architectures. Neural Networks, 18(5-6):602–610.
Grondman, I., Buşoniu, L., Lopes, G. A., and Babuška, R. (2012). A survey of actor-critic rein-
forcement learning: Standard and natural policy gradients. Systems, Man, and Cybernetics,
Part C: Applications and Reviews, IEEE Transactions on, 42(6):1291–1307.
Grubb, A. and Bagnell, D. (2011). Generalized boosting algorithms for convex optimization. In
International Conference on Machine Learning, pages 1209–1216.
Guermeur, Y. (2007). Vc theory of large margin multi-category classifiers. The Journal of Machine
Learning Research, 8:2551–2594.
Guyon, I. and Elisseeff, A. (2003). An Introduction to Variable and Feature Selection. J. Mach.
Learn. Res., 3:1157–1182.
Györfi, L., Kohler, M., Krzyzak, A., and Walk, H. (2006). A distribution-free theory of nonpara-
metric regression. Springer.
Hall, M. A. (1999). Correlation-based Feature Selection for Machine Learning. PhD thesis.
Han, H.-G. and Qiao, J.-F. (2012). Adaptive Computation Algorithm for RBF Neural Network.
IEEE Trans. Neural Netw. Learning Syst., 23(2):342–347.
Hansen, N. (2006). The CMA evolution strategy: a comparing review. In Lozano, J., Larranaga,
P., Inza, I., and Bengoetxea, E., editors, Towards a new evolutionary computation. Advances
on estimation of distribution algorithms, pages 75–102. Springer.
Hartman Eric J., Keeler James D., and Kowalski Jacek M. (1990). Layered Neural Networks with
Gaussian Hidden Units as Universal Approximations. Neural Computation, 2(2):210–215. doi:
10.1162/neco.1990.2.2.210.
Håstad, J. (1986). Almost Optimal Lower Bounds for Small Depth Circuits. In Hartmanis, J.,
editor, STOC, pages 6–20. ACM.
Håstad, J. and Goldmann, M. (1991). On the Power of Small-Depth Threshold Circuits. Compu-
tational Complexity, 1:113–129.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statsitcal Learning. Springer,
2nd edition.
Hertz, J., Krogh, A., and Palmer, R. G. (1991). Introduction to the Theory of Neural Computation.
Addison-Wesley, Redwood City, CA.
368 BIBLIOGRAPHY
Hinton, G. and Roweis, S. (2002). Stochastic Neighbor Embedding. In Advances in Neural Infor-
mation Processing Systems 15, pages 833–840. MIT Press.
Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A Fast Learning Algorithm for Deep Belief
Nets. Neural Comput., 18(7):1527–1554.
Hinton, G. E. and Sejnowski, T. (1986). Learning and relearning in Boltzmann machines. In Parallel
distributed processing: Explorations in the microstructure of cognition, pages 282–317–. MIT
Press, Cambridge, MA.
Ho, T. K. (1998). The random subspace method for constructing decision forests. Pattern Analysis
and Machine Intelligence, IEEE Transactions on, 20(8):832–844.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of
the American statistical association, 58(301):13–30.
Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999). Bayesian model averaging:
a tutorial. Statistical science, pages 382–401.
Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal
approximators. Neural Networks, 2:356–366.
Hornik, K., Stinchcombe, M., and White, H. (1990). Universal Approximation of an Unknown
Mapping and Its Derivatives Using Multilayer Feedforward Networks. Neural Networks, 3:551–
560.
Hsu, D., Kakade, S. M., and Zhang, T. (2014). Random design analysis of ridge regression.
Foundations of Computational Mathematics, 14(3):569–600.
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixtures of local
experts. Neural computation, 3(1):79–87.
Jaeger, H. (2002). A tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF
and the ëcho state networkäpproach . Technical report, Fraunhofer Institute for Autonomous
Intelligent Systems (AIS). http://minds.jacobs-university.de/sites/default/files/
uploads/papers/ESNTutorialRev.pdf.
Jaeger, H. (2004). Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in
Wireless Communication. Science, 304:78–80.
Japkowicz, N., Hanson, S. J., and Gluck, M. A. (2000). Nonlinear Autoassociation Is Not Equiva-
lent to PCA. Neural Computation, 12(3):531–545.
Jaynes, E. T. and Bretthorst, G. L., editors (2003). Probability theory : the logic of science.
Cambridge University Press, Cambridge, UK, New York.
Johansson, E. M., Dowla, F. U., and Goodman, D. M. (1991). Backpropagation Learning for
Multilayer Feed-Forward Neural Networks Using the Conjugate Gradient Method. Int. J.
Neural Syst., 2(4):291–301.
BIBLIOGRAPHY 369
Jordan, M. I. and Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm.
Neural computation, 6(2):181–214.
Jouini, W., Moy, C., and Palicot, J. (2012). Decision making for cognitive radio equipment: analysis
of the first 10 years of exploration. EURASIP J. Wireless Comm. and Networking, 2012:26.
Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. (1998). Planning and acting in partially
observable stochastic domains. Artificial Intelligence, 101(1-2):99–134.
Kanungo, T., Mount, D. M., Netanyahu, N., Piatko, C., Silverman, R., and Wu, A. Y. (2002). An
efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 24:881–892.
Karpathy, A. and Li, F.-F. (2014). Deep Visual-Semantic Alignments for Generating Image De-
scriptions. CoRR, abs/1412.2306.
Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., and Murthy, K. R. K. (1999). Improvements
to Platt’s SMO Algorithm for SVM Classifier Design. Technical Report CD-99-14, National
University of Singapore.
Korda, N., Kaufmann, E., and Munos, R. (2013). Thompson sampling for 1-dimensional exponen-
tial family bandits. In Advances in Neural Information Processing Systems, pages 1448–1456.
Krizhevsky, A. (2014). One weird trick for parallelizing convolutional neural networks. CoRR,
abs/1404.5997.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet Classification with Deep Con-
volutional Neural Networks. In Bartlett, P. L., Pereira, F. C. N., Burges, C. J. C., Bottou, L.,
and Weinberger, K. Q., editors, NIPS, pages 1106–1114.
Lagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning
Research (JMLR), 4:1107–1149.
Lawson, C. and Hanson, R. (1974). Solving least squares problems. Prentice-Hall series in automatic
computation. Prentice-Hall, Englewood Cliffs, NJ.
Lazaric, A., Ghavamzadeh, M., and Munos, R. (2010). Analysis of a classification-based policy
iteration algorithm. In International Conference on Machine Learning (ICML), pages 607–614.
Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278–2324.
LeCun, Y., Bottou, L., Orr, G., and Müller, K.-R. (1998). Efficient BackProp. In Orr, G. and
Müller, K.-R., editors, Neural Networks: Tricks of the Trade, volume 1524 of Lecture Notes
in Computer Science, pages 9–50. Springer Berlin Heidelberg.
Lee, Y., Lin, Y., and Wahba, G. (2004). Multicategory support vector machines: Theory and
application to the classification of microarray data and satellite radiance data. Journal of the
American Statistical Association, 99(465):67–81.
Linde, Y., Buzo, A., and Gray, R. M. (1980). Algorithm for Vector Quantization Design. IEEE
transactions on communications systems, 28(1):84–95.
Louppe, G. and Geurts, P. (2012). Ensembles on random patches. In Machine Learning and
Knowledge Discovery in Databases, pages 346–361. Springer.
Lugosi, G. and Wegkamp, M. (2004). Complexity regularization via localized random penalties.
The Annals of Statistic, 32(4):1679–1697.
Lukosevicius, M. (2012). A Practical Guide to Applying Echo State Networks. In Montavon, G.,
Orr, G. B., and Müller, K.-R., editors, Neural Networks: Tricks of the Trade (2nd ed.), volume
7700 of Lecture Notes in Computer Science, pages 659–686. Springer.
MacQueen, J. B. (1967). Some Methods for Classification and Analysis of MultiVariate Obser-
vations. In Cam, L. M. L. and Neyman, J., editors, Proc. of the fifth Berkeley Symposium
on Mathematical Statistics and Probability, volume 1, pages 281–297. University of California
Press.
Martens, J. (2010). Deep learning via Hessian-free optimization. In Fürnkranz, J. and Joachims,
T., editors, Proc. of the International Conference on Machine Learning (ICML) 2010, pages
735–742. Omnipress.
Martinez, T. M. and Schulten, K. J. (1994). Topology Representing Networks. Neural Networks,
7(3):507–522.
Mason, L., Baxter, J., Bartlett, P., and Frean, M. (1999). Boosting algorithms as gradient descent
in function space. In Neural Information Processing Systems (NIPS).
McCulloch, W. S. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous
activity. Bulletin of Mathematical Biophysics, 5:115–133.
Mhaskar, H. and Micchelli, C. (1995). Degree of Approximation by Neural and Translation Net-
works with a Single Hidden Layer. Advances in Applied Mathematics, 16(2):151–183.
Munos, R. (2014). From bandits to Monte-Carlo Tree Search: The optimistic principle applied to
optimization and planning. Foundations and Trends in Machine Learning.
Muselli, M. (1997). On convergence properties of pocket algorithm. IEEE Transactions on Neural
Networks, 8(3):623–629.
Nair, V. and Hinton, G. E. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines.
In Fürnkranz, J. and Joachims, T., editors, ICML, pages 807–814. Omnipress.
Ormoneit, D. and Sen, S. (2002). Kernel-Based Reinforcement Learning. Machine Learning,
49:161–178.
Ozay, M. and Vural, F. T. Y. (2012). A new fuzzy stacked generalization technique and analysis
of its performance. arXiv preprint arXiv:1204.0171.
Park, J. and Sandberg, I. W. (1991). Universal Approximation Using Radial-Basis-Function Net-
works. Neural Computation, 3:246–257.
Pascanu, R., Montufar, G., and Bengio, Y. (2013). On the number of inference regions of deep
feed forward networks with piece-wise linear activations. CoRR, abs/1312.6098.
Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. Philo-
sophical Magazine Series 6, 2(11):559–572.
Peng, J. X., Li, K., and Irwin, G. W. (2007). A Novel Continuous Forward Algorithm for RBF
Neural Modelling. IEEE Trans. Automat. Contr., 52(1):117–122.
Peters, J. and Schaal, S. (2008). Natural Actor-Critic. Neurocomputing, 71:1180–1190.
Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization.
In schölkopf, B., Burges, C., and Smola, A., editors, Advances in Kernel Methods: Support
Vector Machines. MIT Press, Cambridge, MA.
BIBLIOGRAPHY 371
Poupart, P., Vlassis, N., Hoey, J., and Regan, K. (2006). An analytic solution to discrete bayesian
reinforcement learning. In International Conference on Machine Learning (ICML), pages
697–704.
Prechelt, L. (1996). Early Stopping-But When? In Orr, G. B. and Müller, K.-R., editors, Neural
Networks: Tricks of the Trade, volume 1524 of Lecture Notes in Computer Science, pages
55–69. Springer.
Pudil, P., Novovic̆ová, and Kittler, J. (1993). Floating search methods in feature selection. Pattern
Recognition Letters, 15:1119–1125.
Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming.
Wiley-Interscience.
Quinlan, J. R. (1993). C4. 5: Programs for machine learning.
Ranzato, M. A., Poultney, C. S., Chopra, S., and LeCun, Y. (2006). Efficient Learning of Sparse
Representations with an Energy-Based Model. In Schölkopf, B., Platt, J. C., and Hoffman,
T., editors, NIPS 2006, pages 1137–1144. MIT Press.
Riedmiller, M. (2005). Neural fitted q iteration–first experiences with a data efficient neural
reinforcement learning method. In European Conference on Machine Learning (ECML), pages
317–328. Springer.
Rodan, A. and Tiño, P. (2011). Minimum Complexity Echo State Network. IEEE Transactions
on Neural Networks, 22(1):131–144.
Rosenblatt, F. (1962). Principles of neurodynamics; perceptrons and the theory of brain mecha-
nisms. Washington, Spartan Books.
Roweis, S. T. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embed-
ding. Science, 290:2323–2326.
Rumelhart, D., Hinton, G., and Williams, R. (1986). Learning Internal Representations by Error
Propagation, pages 318–362. MIT Press.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,
Khosla, A., Bernstein, M. S., Berg, A. C., and Li, F.-F. (2014). ImageNet Large Scale Visual
Recognition Challenge. CoRR, abs/1409.0575.
Salakhutdinov, R. and Hinton, G. E. (2009). Deep Boltzmann Machines. Journal of Machine
Learning Research - Proceedings Track, 5:448–455.
Sammon, J. W. (1969). A Nonlinear Mapping for Data Structure Analysis. IEEE Trans. Comput.,
18(5):401–409.
Schapire, R. E. and Freund, Y. (2012). Boosting: Foundations and algorithms. MIT press.
Scherer, D., Müller, A. C., and Behnke, S. (2010). Evaluation of Pooling Operations in Convo-
lutional Architectures for Object Recognition. In Diamantaras, K. I., Duch, W., and Iliadis,
L. S., editors, ICANN (3), volume 6354 of Lecture Notes in Computer Science, pages 92–101.
Springer.
Scherrer, B. (2010). Should one compute the Temporal Difference fix point or minimize the Bellman
Residual? The unified oblique projection view. In International Conference on Machine
Learning (ICML), pages 959–966.
Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B., and Geist, M. (2015). Approximate
Modified Policy Iteration and its Application to the Game of Tetris. Journal of Machine
Learning Research.
Schmidhuber, J. (1992). Learning Complex, Extended Sequences Using the Principle of History
Compression. Neural Computation, 4(2):234–242.
372 BIBLIOGRAPHY
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61:85–
117.
Schölkopf, B., Platt, J., Shawe-Taylor, J., Smola, A. J., and Williamson, R. C. (2001). Estimating
the support of a high-dimensional distribution. Neural Computation, 13:1443–1471.
Scholkopf, B., Smola, A., and Müller, K.-R. (1999). Kernel principal component analysis. In
Advances in kernel methods - support vector learning, pages 327–352. MIT Press.
Schölkopf, B., Smola, A. J., Williamson, R. C., and Bartlett, P. L. (2000). New support vector
algorithms. Neural Computation, 12:1207–1245.
Schwenker, F., Kestler, H. A., and Palm, G. (2001). Three learning phases for radial-basis-function
networks. Neural Networks, 14(4-5):439–458.
Shawe-Taylor, J. and Cristanini, N. (2004). Kernel Methods for Pattern Analysis. Cambridge
University Press.
Sigaud, O. and Buffet, O. (2013). Markov decision processes in artificial intelligence. John Wiley
& Sons.
Simard, P. Y., Steinkraus, D., and Platt, J. C. (2003). Best Practices for Convolutional Neural
Networks Applied to Visual Document Analysis. In ICDAR, pages 958–962. IEEE Computer
Society.
Smyth, P. and Wolpert, D. (1999). Linearly combining density estimators via stacking. Machine
Learning, 36(1-2):59–83.
Soheili, N. (2014). Elementary Algorithms for Solving Convex Optimization Problems. PhD thesis,
Carnegie Mellon University.
Soheili, N. and Pena, J. (2013). A primal-dual smooth perceptron-von neumann algorithm. Discrete
Geometry and Optimization, 69:303–320.
Somol, P., Novovicova, J., and Pudil, P. (2010). Pattern recognition recent advances, chapter
Efficient Feature Subset Selection and Subset Size Optimization. InTech.
Somol, P., Pudil, P., Novovicová, J., and Paclı́k, P. (1999). Adaptive floating search methods in
feature selection. Pattern Recognition Letters, 20(11-13):1157–1163.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout:
A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning
Research, 15:1929–1958.
Steinwart, I. (2007). How to compare different loss functions and their risks. Constructive Approx-
imation, 26(2):225–287.
Sutskever, I. (2013). Training recurrent neural networks. PhD thesis, University of Toronto.
Sutskever, I., Martens, J., and Hinton, G. E. (2011). Generating Text with Recurrent Neural
Networks. In Getoor, L. and Scheffer, T., editors, ICML, pages 1017–1024. Omnipress.
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to Sequence Learning with Neural
Networks. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger,
K. Q., editors, NIPS, pages 3104–3112.
Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. The MIT Press.
BIBLIOGRAPHY 373
Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (1999). Policy Gradient Methods
for Reinforcement Learning with Function Approximation. In Neural Information Processing
Systems (NIPS), pages 1057–1063.
Szepesvári, C. (2010). Algorithms for reinforcement learning. Synthesis Lectures on Artificial
Intelligence and Machine Learning, 4(1):1–103.
Tenenbaum, J. B., de Silva, V., and Langford, J. C. (2000). A Global Geometric Framework for
Nonlinear Dimensionality Reduction. Science, 290(5500):2319.
Thomas, P., Theocharous, G., and Ghavamzadeh, M. (2015). High confidence off-policy evaluation.
In Twenty-Ninth AAAI Conference on Artificial Intelligence.
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view
of the evidence of two samples. Biometrika, pages 285–294.
Tibshirani, R. (1996a). Regression shrinkage and selection via the LASSO. Journal of the Royal
Statistical Society. Series B (Methodological), pages 267–288.
Tibshirani, R. (1996b). Regression shrinkage and selection via the Lasso. Journal of the Royal
Statistical Society. Series B (Methodological), pages 267–288.
Tikhonov, A. (1963). Solution of incorrectly formulated problems and the regularization method.
In Soviet Mathematics, volume 5, pages 1035–1038.
Tipping, M. E. (2001). Sparse Kernel Principal Component Analysis. In Leen, T., Dietterich, T.,
and Tresp, V., editors, Advances in Neural Information Processing Systems 13, pages 633–639.
MIT Press.
Triefenbach, F., Jalalvand, A., Schrauwen, B., and Martens, J.-P. (2010). Phoneme Recognition
with Large Hierarchical Reservoirs. In Lafferty, J. D., Williams, C. K. I., Shawe-Taylor, J.,
Zemel, R. S., and Culotta, A., editors, NIPS, pages 2307–2315. Curran Associates, Inc.
Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11):1134–1142.
van der Maaten, L. (2014). Accelerating t-SNE using Tree-Based algorithms. Journal of Machine
Learning Research, 15:3221–3245.
van der Maaten, L. and Hinton, G. (2008). Visualizaing Data using t-SNE. Journal of Machine
Learning Research, 9:2579–2605.
Vapnik, V. N. (1998). Statistical Learning Theory. Wiley-Interscience.
Vapnik, V. N. (1999). An overview of statistical learning theory. Neural Networks, IEEE Trans-
actions on, 10(5):988–999.
Vapnik, V. N. (2000). The Nature of Statistical Learning Theory. Statistics for Engineering and
Information Science. Springer.
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing
robust features with denoising autoencoders. In Cohen, W. W., McCallum, A., and Roweis,
S. T., editors, ICML, volume 307 of ACM International Conference Proceeding Series, pages
1096–1103. ACM.
Viola, P. and Jones, M. (2001). Rapid object detection using a boosted cascade of simple features.
In Computer Vision and Pattern Recognition (CVPR). IEEE.
Vlassis, N., Ghavamzadeh, M., Mannor, S., and Poupart, P. (2012). Bayesian reinforcement learn-
ing. In Reinforcement Learning, pages 359–386. Springer.
Waibel, A. H., Hanazawa, T., Hinton, G. E., Shikano, K., and Lang, K. J. (1989). Phoneme
recognition using time-delay neural networks. IEEE Trans. Acoustics, Speech, and Signal
Processing, 37(3):328–339.
374 BIBLIOGRAPHY
Wang, C., Venkatesh, S. S., and Judd, J. S. (1993). Optimal Stopping and Effective Machine
Complexity in Learning. In Cowan, J. D., Tesauro, G., and Alspector, J., editors, NIPS,
pages 303–310. Morgan Kaufmann.
Wang, J. (2013). Boosting the generalized margin in cost-sensitive multiclass classification. Journal
of Computational and Graphical Statistics, 22(1):178–192.
Wegkamp, M. (2003). Model selection in nonparametric regression. Ann. Statist., 31(1):252–273.
Werbos, P. (1981). Application of advances in nonlinear sensitivity analysis. In Proc. of the 10th
IFIP conference, pages 762–770.
Werbos, P. J. (1988). Generalization of backpropagation with application to a recurrent gas market
model. Neural Networks, 1(4):339–356.
Widrow, B. and Hoff, M. (1962). Associative Storage and Retrieval of Digital Information in
Networks of Adaptive Neurons. Biological Prototypes and Synthetic Systems, 1.
Williams, R. J. and Zipser, D. (1989). A Learning Algorithm for Continually Running Fully
Recurrent Neural Networks. Neural Computation, 1(2):270–280.
Wolpert, D. H. (1992). Stacked generalization. Neural networks, 5(2):241–259.
Yu, D. and Deng, L. (2015). Automatic Speech Recognition A Deep Learning Approach. Springer-
Verlag.
Zeiler, M., Ranzato, M., Monga, R., Mao, M., Yang, K., Le, Q., Nguyen, P., Senior, A., Vanhoucke,
V., Dean, J., and Hinton, G. (2013). On Rectified Linear Units For Speech Processing. In
38th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
3517–3521, Vancouver. IEEE.
Zhao, Z., Morstatter, F., Sharma, S., Alelyani, S., Anand, A., and Liu, H. (2008). Advancing
Feature Selection Research - ASU Feature Selection Repository. Technical report, Arizona
State University.
Zou, H. and Hastie, T. (2003). Regularization and Variable Selection via the Elastic Net. Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320.
Index
375
376 INDEX