0% found this document useful (0 votes)
70 views378 pages

Poly ML SIR

This document provides an overview of machine learning concepts including: 1. It describes different types of machine learning problems like supervised, unsupervised, reinforcement learning and different learning strategies like inductive and transductive learning. 2. It covers frequentist and Bayesian approaches to machine learning including hypothesis spaces, risks, empirical risk minimization, and Bayesian inference. 3. It discusses evaluating machine learning models using cross-validation, confusion matrices, and estimating real risk. It also covers regularization techniques for controlling model complexity.

Uploaded by

Roshan Velpula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views378 pages

Poly ML SIR

This document provides an overview of machine learning concepts including: 1. It describes different types of machine learning problems like supervised, unsupervised, reinforcement learning and different learning strategies like inductive and transductive learning. 2. It covers frequentist and Bayesian approaches to machine learning including hypothesis spaces, risks, empirical risk minimization, and Bayesian inference. 3. It discusses evaluating machine learning models using cross-validation, confusion matrices, and estimating real risk. It also covers regularization techniques for controlling model complexity.

Uploaded by

Roshan Velpula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 378

Machine Learning

Last revision: October 13, 2019

Jérémy Fix

Hervé Frezza-Buet

Matthieu Geist

Frédéric Pennerath
2
Contents

I Overview 11
1 Introduction 13
1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.1 Data sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.2 Data conditionning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2 Different learning problems... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.1 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.2 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.2.3 Semi-supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.2.4 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.3 Different learning strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.3.1 Inductive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.3.2 Transductive learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2 The frequentist approach 29


2.1 Hypothesis spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.1 Parametric and nonparametric hypothesis spaces . . . . . . . . . . . . . . . 29
2.1.2 The linear case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.3 Linear separability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2 Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.1 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.2 Real and empirical risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.3 Are good predictors good ? . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.4 Empirical risk minimization and overfitting . . . . . . . . . . . . . . . . . . 33
2.3 Ensemble methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.2 Random models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.3 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3 The Bayesian approach 37


3.1 Density of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.1 Reminder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.2 Joint and conditional densities of probability . . . . . . . . . . . . . . . . . 37
3.1.3 The Bayes’ rule for densities of probability . . . . . . . . . . . . . . . . . . 38
3.2 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.1 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.2 The parameter update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.3 Update from bunches of data . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Bayesian learning for real . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4 Evaluation 45
4.1 Real risk estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.1 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.2 Real risk optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 The specific case of classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3
4 CONTENTS

4.2.1 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46


4.2.2 The specific case of bi-class problems . . . . . . . . . . . . . . . . . . . . . . 48

II Concepts for Machine Learning 51

5 Risks 53
5.1 Controlling the risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1.1 The considered learning paradigm . . . . . . . . . . . . . . . . . . . . . . . 53
5.1.2 Bias-variance decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.1.3 Consistency of empirical risk minimization . . . . . . . . . . . . . . . . . . . 56
5.1.4 Towards bounds on the risk . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1.5 To go further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Classification, convex surrogates and calibration . . . . . . . . . . . . . . . . . . . . 62
5.2.1 Binary classification with binary loss . . . . . . . . . . . . . . . . . . . . . 62
5.2.2 Cost-sensitive multiclass classification . . . . . . . . . . . . . . . . . . . . . 64
5.2.3 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2.4 To go further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.1 Penalizing complex solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.3 To go further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6 Preprocessing 69
6.1 Selecting and conditioning data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1.1 Collecting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1.2 Conditioning the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2.1 Variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2.2 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . 75
6.2.3 Relationship between covariance, Gram and euclidean distance matrices . . 83
6.2.4 Principal component analysis in large dimensional space (N  d) . . . . . . 85
6.2.5 Kernel PCA (KPCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2.7 Manifold learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

III Support vector machines 93

7 Introduction 95
7.1 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.3 By the way, what is an SVM? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.4 How does it work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

8 Linear separator 97
8.1 Problem Features and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.1.1 The Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.1.2 The Linear Separator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.2 Separability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.3 Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
CONTENTS 5

9 An optimisation problem 101


9.1 The problem the SVM has to solve . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
9.1.1 The separable case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
9.1.2 General case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.1.3 Relation with the ERM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.2 Lagrangian resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
9.2.1 A convex problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
9.2.2 The direct problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
9.2.3 The Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
9.2.4 An intuitive view of optimization under constraints . . . . . . . . . . . . . . 105
9.2.5 Back to the specific case of SVMs . . . . . . . . . . . . . . . . . . . . . . . 109

10 Kernels 113
10.1 The feature space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
10.2 Which Functions Are Kernels? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
10.2.1 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
10.2.2 Conditions for a kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
10.2.3 Reference kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
10.2.4 Assembling kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
10.3 The core idea for SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
10.4 Some Kernel Tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
10.4.1 Data Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
10.4.2 Centering and Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
10.5 Kernels for Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
10.5.1 Document Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
10.5.2 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
10.5.3 Other Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

11 Solving SVMs 121


11.1 Quadratic Optimization Problems with SMO . . . . . . . . . . . . . . . . . . . . . 121
11.1.1 General Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
11.1.2 Optimality Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
11.1.3 Optimisation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
11.1.4 Numerical Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
11.2 Parameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

12 Regression 125
12.1 Definition of the Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . 125
12.2 Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
12.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

13 Compedium of SVMs 129


13.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
13.1.1 C-SVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
13.1.2 ν-SVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
13.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
13.2.1 -SVR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
13.2.2 ν-SVR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
13.3 Unsupervized Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
13.3.1 Minimal enclosing sphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
13.3.2 One-class SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6 CONTENTS

IV Vector Quantization 133


14 Introduction and notations for vector quantization 135
14.1 An unsupervised learning problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
14.1.1 Formalization as a dummy supervised learning problem . . . . . . . . . . . 135
14.1.2 Choosing the suitable loss function . . . . . . . . . . . . . . . . . . . . . . 136
14.1.3 Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
14.2 Minimum of distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
14.2.1 Non unicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
14.2.2 Sensitivity to the density of input samples . . . . . . . . . . . . . . . . . . . 138
14.2.3 Controlling the quantization accuracy . . . . . . . . . . . . . . . . . . . . . 139
14.3 Preserving topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
14.3.1 Notations for graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
14.3.2 Masked Delaunay triangulation . . . . . . . . . . . . . . . . . . . . . . . . 142
14.3.3 Structuring raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

15 Main algorithms 147


15.1 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
15.1.1 The Linde-Buzo-Gray algorithm . . . . . . . . . . . . . . . . . . . . . . . . 147
15.1.2 The online k-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
15.2 Incremental neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
15.2.1 Growing Neural Gas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
15.2.2 Growing Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
15.3 Self-Organizing Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
15.3.1 Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
15.3.2 Convergence issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
15.3.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

V Neural networks 155


16 Introduction 157

17 Feedforward neural networks 161


17.1 Single Layer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
17.1.1 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
17.1.2 ADaptive LINear Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
17.1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
17.1.4 Single layer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
17.2 Radial Basis Function networks (RBF) . . . . . . . . . . . . . . . . . . . . . . . . . 174
17.2.1 Architecture and training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
17.2.2 Universal approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
17.3 Multilayer perceptron (MLP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
17.3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
17.3.2 Learning : error backpropagation . . . . . . . . . . . . . . . . . . . . . . . . 176
17.3.3 Universal approximator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
17.3.4 The need for using deep networks . . . . . . . . . . . . . . . . . . . . . . . . 180
17.4 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
17.4.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
17.4.2 Learning procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
17.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
17.6 Convolutional neural networks : an early successful deep neural network . . . . . . 184
17.7 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
17.8 Where is the problem with a deep neural network and how to alleviate it ? . . . . 186
17.9 Success stories of Deep neural networks . . . . . . . . . . . . . . . . . . . . . . . . 187
CONTENTS 7

18 Recurrent neural networks 191


18.1 Dealing with temporal data using feedforward networks . . . . . . . . . . . . . . . 191
18.2 General recurrent neural network (RNN) . . . . . . . . . . . . . . . . . . . . . . . . 192
18.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
18.2.2 Real Time Recurrent Learning (RTRL) . . . . . . . . . . . . . . . . . . . . 193
18.2.3 Backpropagation Through Time (BPTT) . . . . . . . . . . . . . . . . . . . 194
18.2.4 What about the initial state ? . . . . . . . . . . . . . . . . . . . . . . . . . . 194
18.3 Echo state networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
18.4 Long Short Term memory (LSTM) . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
18.4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
18.4.2 Example of applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

19 Energy based models 201


19.1 Hopfield neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
19.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
19.1.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
19.1.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
19.2 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
19.2.1 RBM with binary units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
19.2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

VI Ensemble methods 209


20 Introduction 211
20.1 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
20.1.1 Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
20.1.2 Building regression trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
20.1.3 Building classification trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
20.1.4 More on trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
20.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

21 Bagging 217
21.1 Bootstrap aggregating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
21.2 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
21.3 Extremely randomized trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

22 Boosting 221
22.1 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
22.1.1 Weighted binary classification . . . . . . . . . . . . . . . . . . . . . . . . . . 221
22.1.2 The AdaBoost algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
22.2 Derivation and partial analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
22.2.1 Forward stagewise additive modeling . . . . . . . . . . . . . . . . . . . . . . 223
22.2.2 Bounding the empirical risk . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
22.3 Restricted functional gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . 226

VII Bayesian Machine Learning 231


23 Theoretical Foundations 235
23.1 Preliminary discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
23.2 A short reminder of elementary notions in probability theory . . . . . . . . . . . . 237
23.2.1 Probability and random variables . . . . . . . . . . . . . . . . . . . . . . . . 237
23.2.2 Joint distribution and independence . . . . . . . . . . . . . . . . . . . . . . 239
23.2.3 Conditional distributions and Bayes’ rule . . . . . . . . . . . . . . . . . . . 239
23.3 Bayesian Machine Learning in a weak sense . . . . . . . . . . . . . . . . . . . . . . 240
23.3.1 Maximal Likelihood Principle . . . . . . . . . . . . . . . . . . . . . . . . . . 241
8 CONTENTS

23.3.2 A brute-force approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242


23.3.3 Bayesian Networks come to the rescue . . . . . . . . . . . . . . . . . . . . . 244
23.3.4 Continuous variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
23.3.5 Naive Bayes: the standard version . . . . . . . . . . . . . . . . . . . . . . . 254
23.4 Bayesian Machine Learning in a strong sense . . . . . . . . . . . . . . . . . . . . . 257
23.4.1 Principles of Bayesian statistics and inference . . . . . . . . . . . . . . . . 257
23.4.2 Conjugate Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
23.4.3 Bayesian estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

24 Gaussian and Linear Models for Supervised Learning 267


24.1 Multivariate normal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
24.1.1 Definition and fundamental properties . . . . . . . . . . . . . . . . . . . . . 268
24.2 Gaussian Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
24.2.1 Quadratic Discriminant Analysis (QDA) . . . . . . . . . . . . . . . . . . . . 271
24.2.2 Linear Discriminant Analysis (LDA) . . . . . . . . . . . . . . . . . . . . . . 272
24.2.3 Diagonal LDA and Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . 273
24.2.4 Comparison summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
24.3 Linear regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
24.3.1 Linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
24.3.2 Ordinary Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
24.3.3 Bayesian Linear Regression and Ridge regression . . . . . . . . . . . . . . . 276

25 Models with Latent Variables 277


25.1 Latent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
25.2 EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
25.3 Bayesian Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
25.3.1 Mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
25.3.2 Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

26 Markov Models 285


26.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
26.2 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
26.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
26.2.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
26.2.3 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
26.2.4 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
26.3 Hidden Markov Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
26.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
26.3.2 Bayesian filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
26.3.3 Bayesian smoothing and the forward-backward algorithm . . . . . . . . . . 293
26.3.4 Most probable trajectory and the Viterbi algorithm . . . . . . . . . . . . . 294
26.3.5 Learning HMM and the Baum-Welch algorithm . . . . . . . . . . . . . . . . 295
26.4 Continuous-state Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
26.4.1 State-space representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
26.4.2 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
26.4.3 Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305

27 Non-parametric Bayesian methods and Gaussian Processes 309


27.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
27.2 Gaussian Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
27.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
27.2.2 Representation and sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 310
27.2.3 Influence of kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
27.2.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
27.2.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
27.3 A word on complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
CONTENTS 9

28 Approximate Inference 319


28.1 Interest of sampling techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
28.2 Univariate sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
28.2.1 Direct sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
28.2.2 Rejection sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
28.3 Multivariate sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
28.3.1 Ancestral Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
28.3.2 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
28.3.3 Markov Chain Monte Carlo and the Metropolis-Hastings algorithm . . . . . 324
28.3.4 Importance sampling and particle filter . . . . . . . . . . . . . . . . . . . . 325

VIII Sequential Decision Making 329


29 Bandits 331
29.1 The stochastic bandit problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
29.2 Optimism in the face of uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . 332
29.3 The UCB strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
29.4 More on bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338

30 Reinforcement learning 339


30.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
30.2 Formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
30.2.1 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
30.2.2 Policy and value function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
30.2.3 Bellman operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
30.3 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
30.3.1 Linear programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
30.3.2 Value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
30.3.3 Policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
30.4 Approximate Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 346
30.4.1 State-action value function . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
30.4.2 Approximate value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
30.4.3 Approximate policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 351
30.5 Online learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
30.5.1 SARSA and Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
30.5.2 The exploration-exploitation dilemma . . . . . . . . . . . . . . . . . . . . . 357
30.6 Policy search and actor-critic methods . . . . . . . . . . . . . . . . . . . . . . . . . 358
30.6.1 The policy gradient theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 359
30.6.2 Actor-critic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
10 CONTENTS
Part I

Overview

11
Chapter 1

Introduction

Endowing machines with the ability of learning things may sound excessive or even inappropriate,
since the common meaning of learning is a cognitive process exhibited by humans or animals.
Learning implies being able to reuse in the future what has been learned in the past, requiring
some memory capacities. Before the age of computers, memory was also a concept rather used
for humans or animals. From the point of view of physicists, any system that keeps trace of its
history is obviously endowed with memory and the process leading to the engram of that trace
could naturally be called learning. Would we say that the solid matter inside of a boiled egg is the
memory of the cooking stage ? Was the boiling itself a learning process ?
In this document, we will not enter into a semiotic analysis of learning. Instead, let us sketch
out what learning is in the so-called “machine learning” field. This first part aims at outlining
that field, highlighting the overall structure of a domain which has increased significantly during
last decades.

1.1 Datasets
One intends to use machine learning techniques when s/he has to cope with data. Extracting
information from a dataset in order to process further some new data that was not in the initial
dataset can be thought as the core of machine learning. Every processing coming out of a machine
learning approach is fundamentally data driven. Indeed, if the processing could have been designed
from a specific knowledge about what it should do, then this knowledge could have been used for
designing the process without any help of machine learning. For example, nobody uses collections
of moon, earth and sun positions to predict eclipses, since the prediction process can be set up
from Newton’s laws. As opposed to this, there is not known algorithm (i.e. law) telling how to
recognize which hand written digits are written in the zip-code area of a letter. For such problems,
learning is to be considered, trying to set up a process that is grounded on a huge amount of data
(zip-code scan images and their transcription into digits) to perform its recognition task.

1.1.1 Data sampling


The model
Let Z stands for the set where data are taken from. For example, let us consider the students of the
Machine Learning class, represented by their weight in kilograms and height in centimeters. The
def
data for this case are taken in Z = [20, 160]×[50, 250] where the datum z = (weight, height) ∈ Z
represents a specific student. In machine learning, data are considered to be distributed according
to some unknown distribution. In other words, let us consider a random variable Z taking values
in Z whose probability distribution is PZ . Let us denote by pZ the corresponding probability
density, that is supposed to exist. In other words, for A ⊂ Z,
Z
def def
P (Z ∈ A) = PZ (A) = pZ (z) dz
A

13
14 CHAPTER 1. INTRODUCTION

Taking a realization z of the random variable Z is denoted by z L99 PZ in the following.


The random variable Z drives the sampling of a dataset S = {z1 , · · · , zi , · · · , zN } such as
∀i, zi L99 PZ , i.e the zi are independent and identically distributed1 (i.i.d). In the fake case of the
students of the Machine Learning class, the dataset depicted in figure 1.1 is considered for further
examples.

Figure 1.1: The dataset S of the weights and heights of the students attending the Machine
Learning class (fake data). |S| = N = 300.

Sometimes, the generation of data (i.e. PZ ) comes with a supplementary process, that associates
a label to the data. This supplementary labeling process is called the oracle, and it is defined as
a conditional distribution. In this case, let us rename the data into inputs, renaming Z, z, Z, PZ
into X , x, X, PX . The label (or output) y ∈ Y given by the oracle to some input x results from
a stochastic process as well. In other words, the oracle is defined as a conditional distribution
P (Y | X), which is unknown as well. In this case, datasets are made of (x, y) pairs, sampled
according to PX and the oracle, i.e.
def
z = (x, y)
def
Z = X ×Y (1.1)
def
pZ (x, y) = p Y |X (y|x)pX (x)

From a computational point of view, the dataset S is as if it had been generated by algorithm 1.
In the example of the Machine Learning class students, let us now add a supplementary Boolean
1 In the theory of probabilities, instead of dealing with sample tossed from a probability distribution, a single

random variable made of N identical and independent variables is rather considered, which is more rigorous than
considering samples as done here.
1.1. DATASETS 15

attribute to each student, telling wether s/he belongs to the University Wrestling Team. Our
dataset is now made of ((w, h), b), with (w, h) the weight and height and b set to true if the
student belongs to the Wrestling Team. Such dataset is represented in figure 1.2. One can see that
the sample distribution is the same as in figure 1.1, but some labels are added. Dealing with such
data in the context of machine learning makes the assumption that the dataset can be obtained
from algorithm 1, i.e. that whether a student belongs or not to the wrestling team can be somehow
deduced from its weight and height.

Algorithm 1 MakeDataSet
1: S=∅
2: for i = 1 to N do
3: x L99 PX // Toss the input.
4: y L99 P Y | X=x // Thanks to the oracle, toss the output from x.
5: S ← S ∪ {(x, y)}
6: end for
7: return S

Figure 1.2: Weights and heights of the students attending the Machine Learning class. The ones
belonging to the University Wrestling Team are plotted in green.

Ways of sampling
The way the data feed the learning process is important in practical applications of machine
learning techniques. Indeed, some of them are restricted to a specific sampling strategy. Let us
16 CHAPTER 1. INTRODUCTION

overview the strategies here.


First strategy is batch sampling. It consists in providing the learning process with a batch
of data, sampled once. The learning process will compute from this given dataset. The second
strategy is online sampling. This consists is feeding the learning process with a continuous flow
of successive data samples. In this case, each piece of data contributes to update the learning
process. As opposed to the batch case, where one has to wait for the end of the dataset processing,
an online learning process is therefore anytime. Some hybrid methods also exist. They are fed
with a contunous flow of usually small datasets. The learning process then updates as each small
dataset is submitted. Such datasets are sometimes referred to as mini-batch, and the update is
called an epoch.
For the previously mentioned ways of sampling, the sampling was strictly the sampling of
the random variable Z (i.i.d). When the data contains labels, the labelling process may be very
expensive. In the case of skin cancer diagnostic for example, selecting a random picture of the back
of some patient (i.e. input x) is easy, but labelling requires to ask an medical expert to analyze
the picture in order to set the appropriate label y (cancer or not). In this case, one may have to
select the pictures before submitting them to the expert. Indeed, if some picture is very similar
to a previously labelled picture, the cost of asking the expert may not be worth it. This sample
selection is called active sampling. It brakes the i.i.d nature of the samples2 , while i.i.d sampling
is often required for theoretical convergence guaranties.

1.1.2 Data conditionning


Using machine learning for real problems requires to be able to find the suitable learning technique,
but also to preprocess the data.

Data preprocessing in most cases is actually the place for introducing problem specific
knowledge, as dedicated signal processing, suitable metrics, etc.

Each datum z is indeed a set of attributes, i.e a set of key-value pairs. In the case of our students in
figure 1.2, there are three attributes : weight, height and is wrestler. Each datum is a specific
values for all the attributes.
In practical cases, some values may be missing for some attributes, which may raise problems.
Values may be also over-represented. In the example of the wrestler students in figure 1.2, as few
of them actually belong to the wrestling team, considering blindly that a student do not belong
to the team may not be a so bad prediction. To avoid this, one may try to balance the dataset,
picking samples such that there is an equal amount of wrestler and not wrestler students. This
may be done carefully, since the dataset is no more sampled according to algorithm 1, as discussed
further in section 2.2.3.
The type of the attributes have to be considered carefully. Some attributes are numerical. This
is the case for the weight and height in our example. Some other are categorial. This is the case for
the is wrestler. Some nationality attribute is also categorial, since there is no order between
nationality values, they cannot be added, etc.

Many machine learning algorithms handle vectors, i.e. a set of scalar attributes.
Representing categorial attribute values with numbers, for example using USA =
1, Belgium = 2, France = 3, · · · for nationality attribute, induces scalar operation
on the values, whose semantics may be silly. In the given example, France > USA
and Belgium is an intermediate value between France and USA.

Another problem that araises with vectorial data is the scales of the different attributes. In
our student example, if weights were given in milligrams and heights in meter, the first dimension
of the vectorial data (the weight) would live in a [20000, 160000] range, while the second (the
height) would live in a [0, 3] range. With such scales, the point cloud of figure 1.1 would be a flat
2 Indeed, they are still independent but the distribution changes.
1.2. DIFFERENT LEARNING PROBLEMS... 17

horizontal line. In other words, the used learning algorithm would consider height as a constant,
fluctuating slightly from one data sample to the other3 . To avoid this, attribute values may be
rescaled (usually standardization4 is performed).
The high number of attributes in some data may also lead to an increase of the computation
time, as well as a lack of generalization performance. To avoid this, some variable selection methods
exist, that select, from all the attributes, a subset that appears to be useful for the learning. This
is detailed in section 6.2.1.
Last, let us raise a warning. Be aware of the attribute meaning. For example, if in the student
dataset, all wrestler are registered first, and if the line number in the datafile is an attribute,
comparing this line number to the appropriate threshold (that is the number of wrestlers) leads
to a perfect, but silly, learning. Such silly case may occur since toolboxes for machine learning
usually provide scripts that take the data file as inputs and may use all attributes for doing their
job. Be sure that datafiles do not need cleaning before feeding the toolbox.

1.2 Different learning problems...


Let us overview in this section the main classes of learning problems. Indeed, identifying the right
nature of the problem to be learned drives the whole design of a learning procedure.

1.2.1 Unsupervised learning


Unsupervised learning consists in handling data that are not labelled. In other words, samples z
are only tossed from a distribution PZ , as in figure 1.1, and no oracle is involved in the process.
With such a paradigm, the only thing that can be done is somehow describing the distribution.

Membership test

Relying on the samples in the dataset S, one can determine whether a point belongs or not to the
distribution of the samples5 . A membership test for a point could for example consist in finding
the two closest samples to the tested point, measure the two distances from the point and the
selected sample, average these distances, and compare the result to some threshold. Figure 1.3
shows the result. Once computed, the membership function can be used to say that a student with
a 120cm height and a 100kg weight is very unlikely to be one of my students6 .

Identification of components

Another way to extract information from a distribution of samples is to recognize some components.
For example, let us consider in our case lines as elementary components. One can describe the
data as the main fitting lines, as in figure 1.4. Projecting high dimensional data to few lines, that
are adjusted according to the distribution, may allow to express the high dimensional original data
in a reduced space while preserving its variability.

Clustering

Last, some algorithms enable to represent the samples as a set of clusters, as in figure 1.5. This may
help to set up an a posteriori identification of groups inside the data. Analyzing the groups/clusters
in figure 1.5 may lead me to guess that there are three kind of students in my class. The examination
of the small cluster on the right suggests that some students form a specific group. I may investigate
on this... and discover that there are wrestlers in the classroom.
3 e.g, the euclidian distance order of magnitude between two samples is mainly due to their weight difference.
4 Rescaled such as mean is 0 and variance is 1.
5 Is it likely that the point could have belonged to some dataset generated with the unknown P ?
Z
6 I cannot know why. Do such students actually exist ? If yes, why do they not attend my lecture ? Keep in

mind that the data is fake, and so are the conclusions we get from it.
18 CHAPTER 1. INTRODUCTION

Figure 1.3: Weights and heights of the students attending the Machine Learning class. All 2D
points are submitted to the membership test function described in the text (with threshold 7).
The gray areas contain the points which passed the test.
1.2. DIFFERENT LEARNING PROBLEMS... 19

Figure 1.4: Weights and heights of the students attending the Machine Learning class, linear
component analysis.
20 CHAPTER 1. INTRODUCTION

Figure 1.5: Weights and heights of the students attending the Machine Learning class, clustering.
1.2. DIFFERENT LEARNING PROBLEMS... 21

1.2.2 Supervised learning


Supervised learning denotes learning problems where the data consists of an input and an output
(also called a label). Each sample z is thus a pair (x, y), where y is supposed to be deductible
from x by a process that needs to be set up by learning. This is what figure 1.2 illustrates. Let us
recall7 the notation z ∈ Z, with Z = X × Y here.

Bi-class classification
The most basic, but commonly addressed, supervised learning problem is two-class classification.
A supervised learning problem is a two-class classification problem when |Y| = 2, as in figure 1.2
where Y = {wrestler, non wrestler}. Learning consists in setting up a labelling process that
tells wether a student is a wrestler or not, according to its weight and height. In other word, the
learned labelling process should predict the label y of a incoming new data (x, y) from x only. A
very classical example is the linear classifier. It consists of an hyperplane. A hyperplane splits the
space into two regions. The linear classifier assigns one label to one region, and the other label to
the other region. The learning consists in finding a suitable hyperplane, as in figure 1.6. When
a new students enter the machine learning class, if the hyperplane is well placed8 , one can guess
whether that student is a wrestler or not, just by identifying in which region the student is.

Figure 1.6: Weights and heights of the students attending the Machine Learning class, linear
separation.
7 See section 1.1.1.
8 This is not always possible...
22 CHAPTER 1. INTRODUCTION

The case of the linear separation in figure 1.6 allows to introduce the concept of classification
score. Even if the output has only two values, the hyperplane is defined by ax + by + c = 0.
So for any (x, y) data, once a, b, c have been determined by learning, one can compute the scalar
s = ax + by + c and decide, according to the sign of s, in which region, green or yellow, the input
(x, y) lives. In this case, the binary decision for labelling relies on the previous computation of a
scalar. The higher s is, the more distant to the line (x, y) is. So a highly positive s means that the
classifier says “strongly” that the student is a wrestler. In general, there are some classifiers that
add a score to the label they produce. This can be a statistical score, or a geometrical score as for
the hyperplane used in our example.

Multi-class classification
A multi-class classification problem is a problem where Y is finite (i.e. |Y| < ∞). In the case
of the students, the label could have been the name of the sport they practice, instead of only
telling whether they practice wrestling or not. Even if some multi-class learning algorithms exist,
there are also multi-class algorithms that combine two-class algorithms. This can be achieved by
one-versus-all (or one-versus-rest) strategy (see algorithm 2 for learning, and then algorithm 3 for
labelling a new input) or one-versus-one strategy (see algorithms 4 and 5). The former requires
scores, as opposed to the latter. The latter requires more computation.

Algorithm 2 One vs All(learner, S)


1: classifiers ← ∅
2: for y ∈ Y do
3: S 0 ← {(x, 1y0 =y ) | (x, y 0 ) ∈ S}
4: c ← learner (S 0 ) // learner is a two-class learning algorithm, thus c is a two-class classifier.
5: classifiers ← classifiers ∪ {(y, c)} // Let us store the classifier.
6: // s = c (x) is the score of x. The higher s is, the more confident is c in saying that x should be
labelled with y.
7: end for
8: return classifiers

Algorithm 3 One vs All classifier(classifiers, x)


1: (y ? , c? ) ← argmax c (x)
(y,c)∈classifiers
?
2: return y

Algorithm 4 One vs One(learner, S)


1: L ← {{y, y 0 } | (y, y 0 ) ∈ Y × Y, y 6= y 0 }
2: classifiers ← ∅
3: for {y, y 0 } ∈ L do
4: S 0 ← {(x, y 00 ) ∈ S | y 00 ∈ {y, y 0 }} // We extract from S data with y or y 0 label.
5: c ← learner (S 0 ) // learner is a two-class learning algorithm, thus c is a two-class classifier.
6: classifiers ← classifiers ∪ {c}
7: end for
8: return classifiers

Regression
A regression problem is a supervised learning problem where labels are continuous, as opposed to
classification. The standard case is Y = R. For example, if the dataset of the students attending the
machine learning class contains their weights and heights, as well as a Boolean label telling whether
a student is over-weighted or not, and if this Boolean is the label to be predicted from weight and
1.2. DIFFERENT LEARNING PROBLEMS... 23

Algorithm 5 One vs One classifier(classifiers, x)


X
1: return argmax 1c(x)=y // · · · is the number of classifiers that voted for class y.
P
y∈Y
c∈classifiers

height, the supervised learning problem is a classification problem, as presented previously. If now
the BMI9 of the student is given instead of the overweight flag, and if this index value has to be
predicted from weight and height, the supervised learning problem becomes a regression.
The distinction between classification and regression may sound artificial, since it only denotes
that Y is finite or continuous. From a mathematical point of view, such a distinction is irrelevant.
Nevertheless, supervised learning methods are often dedicated to either regression or classification,
which justifies the distinction between the two.
Last, let us mention the case where Y = Rn . This can be solved with n scalar regression
problems, one for predicting each components of the label. When this is applied, each component
prediction is learned separately from the others, whereas they may not be independent. Some
methods like neural networks10 handle multidimensional prediction as a whole, thus exploiting the
eventual relations between the dimensions.

1.2.3 Semi-supervised learning


In some supervised learning problems, getting the label is expensive. This occurs when a human
expert has to be hired for the labelling, or when the labelling requires a large amount of time. The
consequence of this is that many labels are missing in the data. This is called a semi-supervised
learning problem. Such data is illustrated in figure 1.7, taking the example of the students attending
the machine learning class.
Considering semi-supervised learning makes the assumption that the labelling process is smooth:
close inputs are likely to be labelled identically. In the figure 1.7, since green (wrestlers) dots are
localized in the same region, I may infer that all students in that region are wrestlers. If I had to
call one of the unlabelled students to ask him/her whether s/he a wrestler or not, I would rather
choose to call students who lie at the border of the regions. If I can afford only two phone calls, I
would call the two thickened students on the figure. Choosing actively which examples have to be
labeled first is referred to as active learning in the domain.

1.2.4 Reinforcement learning


Until now, we have addressed learning paradigms for which data is given for being analyzed. The
reinforcement learning paradigm rather concern control problems, and is thus very close to optimal
control. Control is usually rather related to automation, and one may wonder why such concept is
addressed in a machine learning context. Indeed, reinforcement learning addresses problems where
the dynamics of the controlled system is unknown to the controller, as opposed to automation
where the equations of that dynamics are given. Let us introduce reinforcement learning with an
example.

A control problem
Let us play the famous weakest link game, used in the weakest link TV show. Here, a single player
is considered. The player tries to get the highest amount of money in a delimited time. The game
starts with a 20$ question. The player needs to answer it right to reach the next question. The
next question is a 50$ one. If the player answers it right, s/he is asked the next question, which is
100$. The values for the questions are 20, 50, 100, 200, 300, 450, 600, 800 and 1000$. Each time
the user reaches a new question, s/he has two options. First option is to try to answer it and go
to next stage. Second option is to say ”bank”. In that case, the amount of money associated to
the last question is actually won, and the game restarts from first stage.
9 Body Mass Index
10 Multi-layered perceptrons indeed.
24 CHAPTER 1. INTRODUCTION

Figure 1.7: Weights and heights of the students attending the Machine Learning class. Blue dots
are unlabelled data. Yellow dots are students that are known to be non-wrestlers, whereas green
dots are those who are known to be wrestlers.
1.2. DIFFERENT LEARNING PROBLEMS... 25

Let us model the game with a Markovian Decision Process (MDP). First element of the MDP
is a state space S. Here S = {q0 , · · · , qi , · · · , q9 }, i.e. the nine question levels of the game and the
initial state. Second, let us denote by A the set of actions. Here A = {answer, bank}. Third is
0
the transition matrix11 T , defined such as Ts,s a
0 is the probability of reaching state s from state s
a
when performing action a. Last element is a reward matrix R, defined such as Rs,s0 is the reward
expectation when the transition s, a, s0 occurs.
For the weakest link game, the modeling is quite easy. Let us suppose that the probability of
answering a question right is p. Reward is deterministic here. The following stands.

• ∀s ∈ S, ∀s0 ∈ S \ {q0 } , Ts,s


bank
0
bank
= 0 and Ts,q 0
= 1. Banking makes a transition to q0 .

• ∀i ∈ [0..8], Tqanswer
i ,qi+1
= p and Tqanswer
i ,q0
= 1 − p. Moreover, Tqanswer
9 ,q0
= 1, and other transition
probabilities are null.

• Rqbank
1 ,q0
= 20, Rqbank
2 ,q0
= 50, Rqbank
3 ,q0
= 100, Rqbank
4 ,q0
= 200, Rqbank
5 ,q0
= 300, Rqbank
6 ,q0
= 450, Rqbank
7 ,q0
=
bank bank a
600, Rq8 ,q0 = 800, Rq9 ,q0 = 1000 and Rs,s0 = 0 otherwise. This is the reward profile.

Figure 1.8 illustrates the modeling of the game with a MDP. The purpose of reinforcement learning
is to solve the MDP. It means finding a policy that allows the player to accumulate the maximal
amount of reward. Such policy is called the optimal policy. A policy, in this context, is simply the
function that tells for each state which action to do. It can be stochastic, but it can be shown that
the optimal policy is indeed deterministic.

0 20 50 100 200 300 450 600 800 1000

1 1 1 1 1 1 1 1 1 1

q q q q q q q q q q
0 1 2 3 4 5 6 7 8 9
p p p p p p p p p
1−p 1−p 1−p 1−p 1−p 1−p 1−p 1−p 1−p 1

bank
answer
reward

Figure 1.8: Modelling the weakest link game with a Markovian Decision Process. See text for
details.

In the TV show, the users play during a fix duration, and they have to maximize their return
(i.e. the sum of the money got when banking). In reinforcement learning, usually, no such duration
is given. Time stress is modeled as a probability 1 − γ, γ ∈ [0, 1[ to end the game at each action.
This is not modelled directly in T and R. The optimal policy, that is what the reinforcement
learning computes, take γ into account. If γ is high, one can hope reaching last states, and thus
one may not bank for first questions. If γ is low, it is better to be Epicurian12 , i.e. to bank as soon
as a question is answered correctly.

Resolution
Even for such a reduced problem, finding the right strategy is not obvious, even when the MDP
(S, A, T, R, γ) is known. When everything is known, the problem of finding what to do in each state
in order to accumulate the highest amount of reward (i.e. find the optimal policy) is addressed
by optimal control, a field of automation. Reinforcement learning addresses this problem when T
and R are unknown. To do so, the reinforcement learning methods often rely on inner supervized
learning.
Figure 1.9 shows the optimal policy in different cases.
11 It is a 3D tensor indeed....
12 carpe diem
26 CHAPTER 1. INTRODUCTION

9
8
question to bank at

7
6
5
4
3
2
1
0
1.0
0.8
gam0.6 0.4 0.8 1.0
ma 0.2 0.6
0.4 ers probability
0.0 0.0 0.2 ct answ
Corre
Figure 1.9: According to γ and p, the best policy consists in banking at a specific question level.
This is what this plot illustrates. See text for details.
1.3. DIFFERENT LEARNING STRATEGIES 27

1.3 Different learning strategies


1.3.1 Inductive Learning
The main strategy for learning is to rely on the dataset for building up a predictor. Learning a
rule from examples is called an induction. Therefore, computing a predictor from the data is an
inductive learning process.
The way the predictor is induced is an induction principle. The principle that is mostly used is
empirical risk minimization (ERM), as detailed further. Roughly speaking, it consists in finding a
predictor that predicts best the labels in the dataset from their corresponding input. Will such a
predictor be able to perform well on new data ? It would be nice that the answer is yes, meaning
that the ERM principle has good generalization capabilities. Indeed, inductive principles that lead
to predictors having poor generalization capabilities are useless, since the best they can do is to
retrieve the labels of the data inputs while these labels are already known.
Controlling the generalization capabilities of inductive learning strategies is the core of the
statistical analysis of main machine learning algorithms.

1.3.2 Transductive learning


The transductive learning strategy concerns supervized learning. Instead of learning a predictor
from a dataset, transductive learning consists in embedding the dataset in the predictor. Then,
when the predictor receives some inputs, it computes its label from the labels of the embedded
dataset.
To sum up, setting up a predictor from the dataset, which is the learning stage, is nothing but
storing the dataset. All the computation is done when a new data has to be labelled.
A famous example of transductive leaning in classification is the k-nearest neighbours algorithm
(see algorithm 6), whose classification is depicted in figure 1.10.

Algorithm 6 knn predict(k, S, x)


1: C = ∅.
2: for i = 1 to k do
3: (x? , y ? ) = argmin |x0 − x| // Find the data with the input closest to x.
(x0 ,y 0 )∈S
4: C ← C ∪ {(x? , y ? )}
5: S ← S \ {(x? , y ? )}
6: end for X
7: y ? = argmax 1y0 =y // y ? is the most frequent label in C.
y∈Y
(x,y 0 )∈C
8: return y ?
28 CHAPTER 1. INTRODUCTION

Figure 1.10: Weights and heights of the students attending the Machine Learning class. The
colored areas correspond to the labels given by a KNN (with k = 10) relying on the dataset.
Chapter 2

The frequentist approach

The frequentist approach consists in relying on samples in order to estimate probabilities. It is


opposed to the Bayesian approach where distributions of probabilities are rather manipulated
(see chapter 3). Let us introduce here the main objects involved in machine learning when the
frequentist approach is used. Supervised learning is mainly addressed here.

2.1 Hypothesis spaces


In the supervised learning case, inductive learning consists in building up a predictor, i.e a function
p ∈ Y X , that associates a label y = p (x) to any input x ∈ X . In practice, a particular algorithm
is not able to search within the whole set of labelling functions Y X , but only a subset H ⊂ Y X of
it, called the hypothesis space.

2.1.1 Parametric and nonparametric hypothesis spaces


Basically, hypothesis spaces are function sets. It is usual to define parametric functions, i.e. func-
tions whose definition is related to some parameter. Let θ ∈ Θ a parameter. Let f ∈ Y Θ×X a func-
def
tion such as y = f (θ, x). Considering θ as a parameter leads to define fθ ∈ Y X as fθ (x) = f (θ, x).
X
The set H = {fθ | θ ∈ Θ} ⊂ Y is a parametric hypothesis space induced by f . Parametric hy-
pothesis spaces are interesting since H is a “very small” subset of the full functional space Y X . As
a consequence, H can be explored conveniently thanks to an exploration of the parameter space Θ.
This exploration enables to search for an optimal function within parametric hypothesis spaces,
since it consists in searching some optimal θ? ∈ Θ. To do so, classical optimization techniques can
be used, as the well-known gradient descent.
Even if parametric hypothesis spaces are commonly used in machine learning, some methods
involve a nonparametric hypothesis space. In other words, H ⊂ Y X can hardly be expressed
as related to some parameter set Θ. For example, k-nearest neighbours (knn) predictors (see
algorithm 6) have been presented in section 1.3.2. Each knn-predictor is parametrized by a set
of samples. We could consider that Θ is the set of all finite subsets of X × Y, which is very big,
but usually, Θ is limited to Rn . When Θ is so big, it is rather said that the hypothesis space
is nonparametric. Another classical example of nonparametric hypothesis space is the set of all
decision trees allowing to assign a label y ∈ Y to some input x ∈ X .

2.1.2 The linear case


In machine learning, linear functions are commonly used, since many mathematical results are
available for them. Linear functions are a prototypical case of parametric hypothesis space. Let us
def
restrict here to scalar linear functions in Rn . Here, Θ is Rn as well, since fθ (x) = θT .x. Using linear
hypothesis spaces may appear restrictive, but there is a way to perform nonlinear computation
with linear methods. This way consists in mapping the input set X to some other set Φ, thanks

29
30 CHAPTER 2. THE FREQUENTIST APPROACH

def
to a nonlinear function ϕ ∈ ΦX . In this case, Θ = Rdim(Φ) , fθ (x) = θT .ϕ (x) and the hypothesis
def 
set H = fθ θ ∈ Rdim(Φ) ⊂ Y X .
When such a projection is used, X is called the ambient space and Φ is called the feature
space. Usually, dim (Φ)  dim (X ), meaning that using the non linear projection ϕ consists in
applying linear methods in high dimension in order to get an overall non-linear processing. This
has dramatic consequences, as illustrated next.

2.1.3 Linear separability


Let us consider a bi-class supervised learning problem, with X = Rn and Y = {•, ◦}. A linear
separator in Rn is defined by (θ, b) ∈ Rn × R. It associates to some input x ∈ X the label ◦ when
θT .x + b ≥ 0, and • otherwise1 . In other words, the associated label depends only on which side of
the hyperplane θT .x + b = 0 the input x actually stands.
Let us now consider a dataset S = {(x1 , y1 ), · · · , (xi , yi ), · · · , (xN , yN )} ⊂ X × Y. It is said to
be linearly separable if some (θ, b) ∈ Rn × R exists such as, for all the samples in S, the separator
defined from (θ, b) gives the right label.

Linear separability is crucial in machine learning. Keep in mind that a collection of


n + 1 random points in Rn , given random binary labels, is likely to be separable. In
other words, a learning problem consisting in separating n + 1 points in Rn is easy,
and thus not interesting...

Let us illustrate this in R2 with 3 points in figure 2.1. Of course, if all points were aligned,
separation would have been impossible.

Figure 2.1: For theses points in R2 , any labelling makes the obtained dataset linearly separable.
This is true for most point configurations. This remarks can be extended to n + 1 points in Rn .

Let us now consider the labeled points S in figure 2.2-left. They are obviously not linearly
separable. In this case, the trick is to project X into an ambient space Φ, as mentioned in previous
section, thanks
 to some function ϕ. In order to name components, x ∈ X = R2 is denoted by
x = x1 , x2 . Let us use
 
x1
ϕ (x) =  x2  (2.1)

1 2
2
x + x2

we denote θ0 = (θ, b) ∈ Rn+1 and x0 = (x, 1) ∈ Rn+1 , the expression θT .x + b rewrites as θ0 T .x0 . This is usually
1 If

done in order to avoid a specific formulation for the offset b in mathematical expressions.
2.2. RISKS 31

Let us define ϕ (S) = {(ϕ (x), y) | (x, y) ∈ S}. Figure 2.2-middle shows that ϕ (S) can be separated.
The separation frontier in the ambient space X is obtained from a reverse projection of ϕ (X ) ∩
{θT .x + b = 0} in the feature space2 . This is how ϕ enables to perform non-linear separation in
the input space, while a linear separator is involved in the feature space.

Figure 2.2: On the left, the labeling of 10 points in R2 is not linearly separable. The same points
projected in R3 by ϕ, defined by equation (2.1), become separable by a hyperplane, as middle of
the figure shows. The half-space over the hyperplane correspond to black labels. It is shaded. The
paraboloid is the projection of the whole R2 in R3 by ϕ. The part of the paraboloid above the
hyperplane is darken as well. It correspond to regions labeled as black. On the right, points are
represented back into R2 , as on the left, but the plane area which projects into the dark half-space
in R3 is darken. It can be seen that the linear separation in R3 (the feature space) corresponds to
a non-linear separation in R2 (the ambient space), since ϕ is a non-linear projection.

The trick of projecting a dataset into a high-dimension space, so that the projected points
become linearly separable, seems powerful. Indeed, it has its drawback. In figure 2.2, the projection
has been chosen carefully. This cannot been done in non artificial situation. In real cases, we would
rather have built a dataset-agnostic high dimensional projection into Rn , with n  9 in order to
make our 10 points very easily separable, and then set up a separator to make the decision. This
cannot be represented, as opposed to what we did for the middle part of figure 2.2 where ϕ projects
in R3 . Nevertheless, the right part of figure 2.2 can still be sketched out, even if the feature space
cannot be plotted. The figure that we will obtain may be like figure 2.3... The separation has poor
generalization capabilities, and can hardly be used to predict the label of new incoming samples.
Once again, this drawback occurs since it is easy to separate points in high dimensions, and so
separation becomes useless. This is referred to as the curse of dimensionality.

2.2 Risks
Let us consider here supervised learning and the definitions given in section 1.1.1. We would like
to find some predictor (or hypothesis) h ∈ H such as the label y = h (x) given by h to some input
x is likely to be the label associated to that x if it is sampled by algorithm 1, i.e. if that x were
labeled by the oracle. This is what a ”good” predictor is supposed to do. The concept of risk aims
at defining and measuring the quality of a predictor.

2.2.1 Loss functions


Y×Y
In order to evaluate predictors, a loss function, denoted by L ∈ (R+ ) , needs to be defined for
comparing the label returned by some predictor to the one it should have returned. For example,
one can use the binary loss defined as

0 1 if y 6= y 0
L (y, y ) = (2.2)
0 otherwise
2 (θ, b) ∈ R3 × R here.
32 CHAPTER 2. THE FREQUENTIST APPROACH

Figure 2.3: The curse of dimensionality. See text for detail.

The binary loss is suitable for finite label set Y, i.e. classification, but it may be rough when Y is
continuous, i.e. regression. When Y = R, one can use the quadratic loss defined as
2
L (y, y 0 ) = (y − y 0 )

2.2.2 Real and empirical risks


Once some loss function is chosen, the idea is to measure the quality of some predictor h ∈
H. If (x, y) L99 PZ , according to equation (1.1) or algorithm 1 (page 15), we would like that
L (h (x) , y) ≈ 0. It means that the predictor should behave as the oracle does, for the inputs that
are likely to be tossed.
The real risk of h, denoted by R (h) is thus the expectation of the loss, for the data sampled
according to algorithm 1, i.e.
Z
def
R (h) = L (h (x) , y) pZ (z) dz, with z = (x, y)
X ×Y

Unfortunately, the real risk cannot be computed, since the oracle pZ is unknown. Nevertheless,
it can be estimated. The classical method for estimating an expectation is the computation of the
average over a collection of samples. The dataset S actually contains samples tossed from PZ , as
algorithm 1 shows. The following average, called the empirical risk of h computed from S, denoted
by RSemp (h) or RN (h) (N = |S|), thus estimates the real risk.

def 1 X
RSemp (h) = L (h (x) , y)
|S|
(x,y)∈S

The criterion that measures how well some hypothesis h imitates the labeling per-
formed by the oracle is the real risk. It cannot be computed. The empirical risk may
be an estimation of it. Is this estimation reliable ? Can we consider that a predictor
which has a low empirical risk on some dataset has a low real risk in general ? This
is the core question of supervised machine learning.
2.2. RISKS 33

2.2.3 Are good predictors good ?


I have at my disposal a daily record seismographs in France, and I have the list of the days where
an earthquake occurred in France. I want to predict from that data whether there will be an
earthquake or not this afternoon, knowing the seismograph record of this morning.
Without any expertise in geophysics, I can predict that there will be no earthquake this after-
noon... ignoring the seismograph data of the day. I behave as a constant function, always telling
that no earthquake will occur. Both my empirical risk and real risk are very low, so I am a good
earthquake predictor, from a statistical point of view. Making prediction errors on very infrequent
inputs do not alter the real risk.
Another example of such good predictors is recognizing the digit 0 among randomly chosen
hand-written digits. Once again, always answering that a digit is not a 0 leads to a 90% real risk
(with the binary loss) if the digits are uniformly distributed over the ten digits.

A good predictor is a predictor that minimizes the real risk. Good predictors may be
uninteresting, especially when the distribution of labels is strongly unbalanced.

If I really want to build up an alert system for earthquakes, I would need to balance the
dataset with 50% non-earthquake, taken randomly in the huge mass of non-earthquake days, and
50% earthquake data (made of all, but few, earthquake data I actually have). If there is only, on
the balance dataset, 1% of cases for which I predict earthquake while it actually not occurs (false
positive detection), this 1% may trigger frequent false alerts in the daily use of my detector. In
other words, after the balancing process, the database has lost is statistical significance for a real
life usage.

If data is very unbalanced, another available solution to the prediction problem is to


collect the regular data, i.e. the seismographs of non-earthquake days here, and to
learn a membership test by an unsupervized method, as introduced in section 1.2.1.
Exceptional situations, as earthquakes, may be labeled as non-members by this test.

2.2.4 Empirical risk minimization and overfitting


When a dataset S is given, and when some hypothesis space H is determined to extract some good
predictor from, one very common induction principle is to find the hypothesis that succeeds best
in predicting the labels of the dataset. This inductive principle is the empirical risk minimization
(ERM). In other words, the idea is to find

b def
hS = argmin RSemp (h)
h∈H

where RSemp (h) is the empirical risk of h computed on S. The function b hS is also denoted by hN
(N = |S|). When H is a parametric function set (see section 2.1.1), one can use a gradient descent
in the parameter space to find bhS .
Some problem arise if H gathers complex functions, i.e. H is rich. Indeed, one can find in H
many hypotheses h for which RSemp (h) = 0. Such hypotheses fit the dataset perfectly good.
In general cases, when a perfect fit to the data occurs, one should suspect an overfitting situ-
ation, where the function b
hS that has been found performs a learning by heart. For example, this
is what happened in figure 2.3, since the non-linear decision region gives the right label for all the
ten points, whereas it will certainly not generalize this good labelling for new data. Overfitting
occurs when H is rich enough to contain functions that can learn big datasets by heart.
34 CHAPTER 2. THE FREQUENTIST APPROACH

Let us sum up the problem. The best predictor available in H is h? =


argminh∈H R (h). It cannot be computed. If R (h? ) = 0, it can be said that H
contains a function that behaves like the unknown oracle. If it is not null, R (h? ) is
called the inductive bias, revealing that H is
 not
 rich enough to fit the oracle better
a S b
than this . If, for some dataset S, Remp hS ≈ 0, an overfitting has to be sus-
 
pected, since it may happen that R b hS is high whereas b hS behaves perfectly on the
 
dataset. The low empirical risk is then misleading. On the contrary, if RSemp b hS
is high forsome dataset S, H seems not rich enough
  to fit the problem. In this case
Remp hS may be a good approximation of R b
S b hS . Moreover, in this case as well,
 
R b hS may approximate the inductive bias R (h? ), meaning that the empirical risk
minimization, from which b hS is otained, is a good inductive principle (b hS behaves
?
as well as h ). In order to reduce the inductive bias, one may prefer having rich hy-
pothesis spaces... but then, this richness allows for the bhS found on some S to vary
from one dataset to another, since learning is a learning by heart that just fits the
given dataset without generalizing the labeling process. In other words, the variance
here means that b hS can be completely different from bhS 0 even when S and S 0 slightly
differ. Choosing the appropriate level of richness for H is difficult, it is refer to as
the bias-variance trade-off.
a It may also be due to some randomness in the oracle itself... Indeed, the inductive bias is not

exactly R (h? ), see paragraph 5.1.2 for details.

2.3 Ensemble methods


Let us say here few words about ensemble learning for supervised learning. The core idea is to
handle a set of predictors instead of a single one. Once learning has occurred, all predictors propose
a label for an incoming new input. The proposed labels are used to compute the label actually
assigned to the new input. For example, in case of classification, each predictor proposition can
be viewed has a vote for one of the available labels. The label which gets the maximum number
of votes is then the one assigned to the input. In case of regression, the average value of all the
proposed labels can be used to compute the label given finally to the input. Of course, when some
data sample S is given, learning several predictors may lead to different ones if the process is non
deterministic somewhere. The point is that even if each predictor performs just slightly better
than a random labelling, the merged prediction can actually benefit from the variability of the
predictor and be quite good.

2.3.1 Bagging
First place where randomness can be introduced is the sample set. For training each predictor,
one can build up a dataset from S by sampling P data from S uniformly, with replacement. If the
learning process is very sensitive to the data (some samples can be duplicated since replacement
occur, some others may be missing, ...), the predictors got from this process may show variability.
Nevertheless that is counterbalanced with the merging procedure described previously to get a
good overall prediction. Making datasets this way is called bootstrapping, and relying on this to
set up many predictors, whose predictions are merged, is called bagging.

2.3.2 Random models


The second place for adding randomness in the building up of a predictor collection is the predictor
set up itself. Indeed, instead of using a heavy and accurate optimization process from the data in S,
one can set up an almost-completely-random predictor... It means randomly generated predictors
that behave slightly better than random. It is interesting when this generation saves computation
2.3. ENSEMBLE METHODS 35

time (compared to an accurate optimization process). As for bagging, the merging of the predic-
tions compensates for the weaknesses of each single predictor output. Dataset bootstrapping can
also be used for training random models, adding more variability.

2.3.3 Boosting
Boosting is a specific case in ensemble methods since it does not rely on randomness. Indeed,
what is actually boosted in the boosting method is a weak predictor learning algorithm. To do the
boosting trick, the weak learning algorithm has to be able to handle weighted datasets. A weighted
dataset is a dataset where each sample is associated with a weight reflecting the importance given
to that sample in the resulting predictor construction.
Boosting is an iterative process. First, equal weights are given to all samples in S. A first
predictor is learned from this. Then, the weights are reconsidered in order to be increased for the
samples badly labeled by the previously learned predictor. A second predictor is learned from the
dataset with this new weight distribution.... and so on. The boosting theory gives formulas for
weighting the predictions of each constructed predictors in order to set up a final predictor as a
weighted sum of the individual predictions.
36 CHAPTER 2. THE FREQUENTIST APPROACH
Chapter 3

The Bayesian approach

This chapter introduces the basics of Bayesian inference on a fake example, in order to build up
the general scheme of Bayesian approaches of machine learning. The mathematics here aim at
being intuitive rather than providing a rigorous definition of probabilities. The latter is grounded
on the measure theory that is not addressed here.

3.1 Density of probability


3.1.1 Reminder
Let us remind here that a random variable X taking values in X is described by its probability
distribution PX . Let A ⊂ X a set of values, the probability that the random variable X “takes” a
value in this set is PX (A). Moreover, let us suppose that a density of probability can be associated
X R
to the probability distribution. It is a function pX ∈ (R+ ) such as PX (A) = A pX (x)dx. In the
following, random variables are described thanks to the associated density of probability, that is
supposed to exist.

3.1.2 Joint and conditional densities of probability


Let us suppose now that occurring events are pairs of values (x, y) ∈ X × Y. Let us denote by X.Y
a random variable taking values (x, y) ∈ X × Y. The density of probability associated with this
(X ×Y)
variable is p X.Y ∈ (R+ ) . This represents the occurrences of pairs (x, y) in the world modeled
by X.Y . If p X.Y (x, y) is higher, the pair of values which are close to (x, y) are more “probable1 ”.
Figure 3.1 illustrates this as well as the incoming definitions.
Let us now suppose that only the x component of incoming events is of interest. The random
variable describing its occurrence is denoted by X and it takes its values in X . The associated
density of probabilities can be computed as:
Z
def
∀x ∈ X , pX (x) = p X.Y (x, y)dy
Y
2
Here, the random variable X is obtained by a marginalization of the variable X.Y . We can obtain
Y the same way. It can also be said that X.Y is a joint variable of X and Y .
If now, we are only considering the occurring pairs such as x = x0 . The “probability” of
such pairs to occur is the “probability” of pairs (x, y) knowing that x = x0 , in other words the
“probability” of y knowing that x = x0 . This is how conditional densities of probability can be
defined.
1 As both X and Y are usually continuous, the probability some (x, y) to occur exaclty if null, of course, even

when the density p X.Y (x, y) for that pair is high. Saying that (x, y) is “probable” is thus abusive, this is why it is
quoted.
2 Here, its density of probability is obtained. It describes the random variable when it exists.

37
38 CHAPTER 3. THE BAYESIAN APPROACH

def p X.Y (x0 , y) p X.Y (x0 , y)


∀y ∈ Y, p Y |X (y|x0 ) = Z = (3.1)
pX (x0 )
p X.Y (x0 , y 0 )dy 0
Y

In equation 3.1, the argument y is highlighted, in order to stress that p Y |X (y|x0 ) is a function
of y. This will be not recalled in the following.

B
C

20 y 30

normalization
A
D
1

30 20 y 30

y
x0

10 20

Figure 3.1: Joint and conditional densities of probability. X = [0, 10] and Y = [20, 30]. In the
figure, A = p X.Y (x, y), B = p X.Y (x0 , y), C = pX (x0 ) and D = p Y |X (y|x0 ).

3.1.3 The Bayes’ rule for densities of probability


From equation (3.1), it can be derived straightforwardly that

∀(x, y) ∈ X × Y, p Y |X (y|x) × pX (x) = p X.Y (x, y) = p X|Y (x|y) × pY (y)

which leads to the following expression of the Bayes’ rule, expressed for densities of probability:

p X|Y (x|y) × pY (y)


∀(x, y) ∈ X × Y, p Y |X (y|x) = (3.2)
pX (x)

It is very similar to the more usual Bayes’ rule with probabilities, i.e. P (A | B) = P(B |P(B)
A)×P(A)
, but
components of the formulas here are functions (the densities) rather than scalars (the probabilities).
The Bayes’ rule for densities of probability is the core mechanism for Bayesian learning, as
illustrated next.

3.2 Bayesian inference


Let us illustrate what Bayesian inference is from a fake example.

3.2.1 The model


As usually in machine learning (see introduction section 1.1.1), the data is modelled by a random
variable Z providing values in Z. As for the frequentist approach (see section 2.1.1), the idea
is to infer from data samples a parametric model that fits them. This parametric model, as in
frequentist approach as well, is given a priori.
3.2. BAYESIAN INFERENCE 39

For the sake of further densities of probability plotting, let us consider that models have a scalar
parameter θ ∈ Θ = [0, 1]. For a specific value θ, we consider the data to be samples z ∈ Z = [0, 1]
of the random variable mθ whose density of probability pmθ (z) is defined in figure 3.2.

2.0 pZ|θ =0.3(z)


1.5
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
z
Figure 3.2: The density of probability pmθ (z) for θ = .3.

This density represents the distribution of data for a specific θ. Let us now introduce a random
variable T with values in Θ... this is the trick of Bayesian inference. Its density pT (θ) represents
the values that the parameter θ is likely to have. This is set a priori to any kind of distribution.
Introducing the random variable T gives the density of probability pmθ a conditional flavor. Indeed,
let us consider that
def
∀z ∈ Z, pmθ (z) = p Z|T (z|θ).
We have thus modeled a situation where getting a data sample consists in first tossing θ L99 PT
and second tossing z L99 Pmθ . Once again, the first toss sounds artificial... but this is the trick,
as next section shows.
From this modeling of data generation, the joint density of probability can be computed easily
from the Bayes’ rule (see equation (3.2)).

∀(z, θ) ∈ Z × Θ, p Z.T (z, θ) = p Z|T (z|θ) × pT (θ) = pmθ (z) × pT (θ) (3.3)

Figures 3.3 and 3.4 show the joint density of probability for different pT .

3.2.2 The parameter update


In the previous section, we have described probabilities for a given model mθ as well as an a priori
PT . Let us consider pT to be the one in figure 3.4 (i.e. parabolic), in order to stress that this choice
is arbitrary and artificial. These probabilities do not reflect any reality without being related to
real data. Let us suppose here that the hidden process generating the data is actually PZ = m0.7 .
Let us sample a new data z L99 m0.7 from it. From equation 3.2, i.e. the Bayes’ rule for densities
of probability, the following can be written.

p Z|T (z|θ) × pT (θ)


∀θ ∈ Θ, p T |Z (θ|z) = (3.4)
pZ (z)

Note that here, the data z is a parameter. The variable is θ, so p T |Z=z is a density of probability
over Θ. Indeed, it tells how θ is distributed under our a priori hypotheses, knowing that the data
z has been observed.
In equation (3.4), pT (θ) is called the prior. It can be computed since it is given a priori. The
value pZ (z) is a normalization constant. It is the “probability” of the occurrence of that data when
the situation is modelled as we do... the way the data is actually sampled is not considered. pZ (z)
can be computed here numerically3 from the joint density of probability given by equation (3.3).
The density of probability p Z|T (z|θ) is known since it is our model, i.e. pmθ (z), as already stated.
3 This
R
is a marginalization computed from an integral pZ (z) = Θ p Z.T (z, θ)dθ.
40 CHAPTER 3. THE BAYESIAN APPROACH

1.06 pT (θ)
1.04
1.02
1.00
0.98
0.96
0.94
0.0 0.2 0.4 0.6 0.8 1.0
θ

pZ.T(z,θ)

2.0
1.5
1.0
0.5
0.0
1.0
0.8

0.0 0.6
0.2 0.4
θ

0.4
z 0.6 0.2
0.8
1.0 0.0
Figure 3.3: Joint distribution of parameter θ and data z. The distribution pT (uniform here) is
given a priori, as well as p Z|T which is the one defined in figure 3.2.
3.2. BAYESIAN INFERENCE 41

3.0 pT (θ)
2.5
2.0
1.5
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
θ

pZ.T(z,θ)

6
5
4
3
2
1
0
1.0
0.8

0.0 0.6
0.2 0.4
θ

0.4
z 0.6 0.2
0.8
1.0 0.0
Figure 3.4: Joint distribution of parameter θ and data z. The distribution pT (parabolic here) is
given a priori, as well as p Z|T which is the one defined in figure 3.2.
42 CHAPTER 3. THE BAYESIAN APPROACH

The Bayesian inference consists in updating the prior pT (θ) so that it is now the function
p T |Z (θ|z) that has been computed. Using that new prior changes the situation that we model.
Indeed, this prior has somehow considered the data z that has been provided. Bayesian learning
then consists in repeating this update for each new data sample that is tossed. This is illustrated
in figure 3.5, where it can be seen that pT (θ) gets more and more focused, as the data samples are
provided, to the value θ = 0.7 which is actually the parameter we used for tossing data samples.
The shape of the distribution reflects the uncertainty concerning the estimated value for that
parameter.

3.0 0 samples submitted 1.8 1 samples submitted 1.8 2 samples submitted 1.8 3 samples submitted
2.5 1.6 1.6 1.6
1.4 1.4 1.4
2.0 1.2 1.2 1.2
1.5 1.0 1.0 1.0
0.8 0.8 0.8
1.0 0.6 0.6 0.6
0.5 0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
θ θ θ θ

4 samples submitted 5 samples submitted 5 6 samples submitted 7 samples submitted


1.6 2.5 3.5
1.4 2.0 4 3.0
1.2 2.5
1.0 1.5 3 2.0
0.8 2 1.5
0.6 1.0
0.4 1 1.0
0.2 0.5 0.5
0.0 0.0 0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
θ θ θ θ

8 8 samples submitted 10 10 samples submitted 14 15 samples submitted 100 samples submitted


40
7 12 35
6 8 30
5 10
6 8 25
4 20
3 4 6 15
2 4 10
1 2 2 5
0 0 0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
θ θ θ θ

Figure 3.5: Bayesian inference. Each plot is the prior pT (θ) after the mentioned number of samples
provided.

3.2.3 Update from bunches of data


Let us consider now two data samples z2 and z1 , tossed independently from the unknown process.
They are a samples of a random variable K = (Z, Z). Since samples are independent, and since
each one is linked to mθ , the following stands:

p K|T (z2 , z1 |θ) = pmθ (z2 ) × pmθ (z1 ) = p Z|T (z2 |θ) × p Z|T (z1 |θ)
So the Bayesian update (equation (3.4)), when a pair of independent samples is given, is the
following:
p K|T (z2 , z1 |θ) × pT (θ)
∀θ ∈ Θ, p T |K (θ|z2 , z1 ) =
pK (z2 , z1 )

p Z|T (z2 |θ) × p Z|T (z1 |θ) × pT (θ)


=
pZ (z2 ) × pZ (z1 )

p Z|T (z2 |θ) p Z|T (z1 |θ) × pT (θ)


= ×
pZ (z2 ) pZ (z1 )
 
p Z|T (z1 |θ) × pT (θ)
p Z|T =θ (z2 ) ×
pZ (z1 )
=
pZ (z2 )
The last equation is exactly the successive application of equation 3.4 to z1 and then z2 . This
can be generalized to any number of samples. As a consequence, the result of successive applications
of the Bayesian inference, when samples are mutually independent, is the density of probability
of the parameter, knowing the whole bunch of samples, whatever the order of submission, and
knowing the initial prior. The effect of the latter may vanish when the number of samples is high.
3.3. BAYESIAN LEARNING FOR REAL 43

3.3 Bayesian learning for real


In real problems, the model depends on θ ∈ Rn , so both mθ and the prior pT (θ) are multidimen-
sional functions. Moreover, one often wants to compute the update analytically, with probability
densities whose form enables an analytical computation (e.g. multidimensional Gaussian densities,
Dirichlet densities, etc...). This raises a problem. Having a prior of a certain kind do not guaranty
that it will still be of the same kind after an update. Indeed, in figure 3.5, the prior was initially
parabolic, but it takes a non-parabolic shape after the very first update, and it becomes similar to
a Gaussian-like shape in the end. It would be nice, when the computation is run fully analytically4 ,
to handle a kind of density that is stable when equation (3.4) is applied. To ensure this, the model
should be a conjugate prior of the prior. For example, when the model is Gaussian, and if the prior
is also Gaussian, the new prior will still be Gaussian5 .
Having a model that is a conjugate prior is very restrictive. To get rid of the limitation in the
shapes of the probability densities, once can use Monte Carlo sampling6 .
Last, let us mention that Bayesian updates are intrinsically online methods (see section 1.1.1),
since update is made each time a sample is presented. Equation (3.2) can be adapted to take a
bulk of samples instead of a single one. Some of the Bayesian approaches of machine learning are
addressed in part VII.

4 Numerical evaluations were used to process figure 3.5.


5 and the new mean and variance can be computed analytically !
6 See MCMC sampling (Andrieu et al., 2003).
44 CHAPTER 3. THE BAYESIAN APPROACH
Chapter 4

Evaluation

4.1 Real risk estimation


As previously mentioned in paragraph 2.2.4, measuring the performance of some predictor on
the dataset that has been used to train it may lead to a very optimistic estimation of its real
performance. The extreme case is overfitting, when the predictor behaves perfectly on the trained
data while it is indeed very bad in general.
The real risk, which is the reliable measure of a predictor performance, cannot be computed,
as opposed to the empirical risk... that cannot be trusted ! Performance evaluation has thus to be
done carefully.

4.1.1 Cross-validation
Overfitting occurs b
 when the hypothesis hS computed from the dataset S sticks to that set. In this
case, RSemp b hS ≈ 0, this is the fitting to the training data which is expected by any inductive
0
 
principle, but RSemp b hS is high for another dataset S 0 , meaning that the fitting to S is actually
an overfitting.
Let us call S the training set, since learning consists in building RSemp from it. One can detect
overfitting by computing the empirical risk of b hS on another dataset S 0 , called the test set. Of
0
course, both S and S are sampled i.i.d. according to PZ (see algorithm 1).
Usually, only a single dataset is available and one can do better than only evaluate the empirical
risk on a train set. This is the idea of the k-fold cross-validation procedure described by algorithm 7
that provides an estimation of the real risk of b hS by a generalization of the use of only two training
and test sets described so far. This is illustrated on figure 4.1.

Algorithm 7 cross validation(k, S, α)


1: // α is a learning algorithm
2: Split S into a partition {S1 , · · · , Si , · · · , Sk } such as ∀i, |Si | ≈ |S| /k
3: for i = 1 to k do
4: Train from S 0 i = S \ Si using α and get the predictor b hS 0 i = α (S 0 i ). // S \ Si is the training
set.  
5: Compute Ri = RSemp
i bhS 0 i . // Si is the test set.
6: end for
1
Pk  
7: return k i=1 Ri // This is an estimation of R hS .
b

When k = |S|, the k-fold cross-validation is referred to as a leave-one-out cross-validation. It


requires a lot of computation, but is can be usefull if few data is available.

45
46 CHAPTER 4. EVALUATION

training set
dataset

R1
test set

R2

R3

R4

Figure 4.1: 4-fold cross-validation. See text and algorithm 7 for detail.

The least thing that can be done when applying supervized machine learning is to
evaluate the method with cross-validation.

4.1.2 Real risk optimisation


In real situation, machine learning algorithms often depend on parameters that can be tuned.
One should choose the parameter configuration for which the real risk is minimal. It can be
approximated by choosing the parameter configuration for which the estimated real risk is minimal.
This very common situation can lead to estimation errors. Let us introduce notations to
illustrate the problem. Let αθ an algorithm whose parameters are θ. For a given θ, one can compute
hθ from a dataset S, and evaluate its real risk R (hθ ) ≈ Rθ = cross validation (k, S, αθ ). Since
Rθ can be computed, one can also compute

θ? = argmin Rθ
θ∈Θ

and then, applying αθ? on S gives a predictor hθ? , which is the one returned by the whole process.
The error that could be made with this approach is considering that Rθ? , computed during
the optimization process, is an estimation of the real risk of the optimization process. Indeed, it
estimates the real risk of using our learning algorithm with parameter θ? , but not the real risk of
the whole optimization process since the value of θ? is not independent from S.
In this case, apart from dividing S into training and test sets for cross-validation, an extra
0
validation set S 0 should be used to measure RSemp (hθ? ) and estimate the real risk of our optimization
process. As we have extended training an testing sets to a whole cross-validation process, the
validation set can be extended to a cross-validation as well... To do so, let us denote by metalearn
the algorithm described above, reminded in algorithm 8.
This algorithm’s real risk can be estimated by cross-validation, as any other, by simply calling
cross validation (k, S, metalearn). This process involves two nest levels of cross-validation (see
figure 4.2), which generalize the use of train, test and validation sets.

4.2 The specific case of classification


4.2.1 Confusion matrix
The performance measurement for classification problems is, as for any supervised learning prob-
lem, the real risk. It is usually computed by using the binary loss (see equation (2.2)) so that the
real risk is the percentage of errors that can be expected from the predictor/classifier.
Nevertheless, it can be relevant to get into more details for analyzing errors. To do so, the
confusion matrix is usually computed. It consists in using a test set1 and report in a matrix the
1 It is better to use a set that has not been used for training the predictor.
4.2. THE SPECIFIC CASE OF CLASSIFICATION 47

Algorithm 8 metalearn(S)
1: // α is a learning algorithm with parameters θ, k is a constant.
2: R? = +∞
3: for θ ∈ Θ do
4: // Consider operation research techniques if Θ cannot be iterated
5: Compute Rθ = cross validation (k, S, αθ ).
6: if Rθ < R? then
7: R ? ← Rθ
8: θ? ← θ
9: end if
10: end for
11: Use αθ? to train on S and get a predictor hθ? = αθ? (S).
12: return hθ?

dataset
training set

R1
test set
R2
R1
R3
validation set
R4

R1

R2
R2
R3

R4

R1

R2
R3
R3

R4

R1

R2
R4
R3

R4

Figure 4.2: Cross-validation of algorithm 8 involves a nested cross-validation. See text for detail.
48 CHAPTER 4. EVALUATION

TP FN

FP TN

Figure 4.3: On the left, a label dataset is depicted. Positively labeled samples are painted in
pale yellow, negatively labeled ones are painted in green. The middle of the figure shows a linear
separator. It separates the plane into sub-regions. The sub-region corresponding to positive labels
is painted in yellow as well, the negative sub-region is depicted in green. On the right, a recall of
the confusion matrix coefficients with the colored graphical notation.

responses of the classifier, as in table 4.1. In this example, when a sample belongs to class C it
is systematically given a C label by the predictor. Nevertheless, the C label is also given by the
predictor when real class is not a C.

predicted
A B C D
A 13 2 2 3
B 1 25 4 0
real

C 0 0 40 0
D 2 0 1 7

Table 4.1: Confusion matrix for some classifier computed from a dataset S with |S| = 100 data
samples. The classes can be A, B, C or D. Numbers in the matrix are the number of samples
satisfying the condition (their sum is 100).

The confusion matrix can also be meaningful when the problem is cost sensitive. It means that
errors may have different weights/costs according to their nature. For example, predicting that a
patient has no cancer while s/he actually has one is not the same as predicting a cancer while the
patient is sane... Some cost matrix C can be associated to the confusion matrix, where Cij is the
cost of predicting class i while the real class is j. Usually, the Cii are null.

4.2.2 The specific case of bi-class problems


When |Y| = 2, the classification problem is bi-class, which is very common. The two classes are
often referred to as positive and negative classes, i.e. Y = {P, N}, since the problem usually consists
in detecting something (positive means detected). The coefficients of the confusion matrix in this
case have specific names, as table 4.2 shows. Let us depict these values by a linear separation
example in figure 4.3.

predicted
P N
P TP FN
real

N FP TN

Table 4.2: Confusion matrix for some bi-class classifier. Class P is the positive class and class N the
negative one. The coefficient names are true positives (TP), true negatives (TN), false positives
(FP), false negatives (FN).
4.2. THE SPECIFIC CASE OF CLASSIFICATION 49

sensitivity/recall precision specificity 1−specificity

Figure 4.4: The different concepts are illustrated in the case of a linear separator. Each concept
is a ratio (see text), represented by two read area. The central one is the numerator, and the
surrounding one the denominator.

Lots of measures are based on the confusion matrix coefficients. Main ones are sensitivity
or recall ( T PT+F
P TN TP
N ), specificity ( F P +T N ), precision ( F P +T P ). Those definitions can be depicted
graphically as well, as in figure 4.4.
It is also common to plot a predictor performance in a ROC space. That space is a chart
depicted in figure 4.5.

It allows to situate the performances independently from the unbalance between the
classes.

Last, one can summarize the trade-off between sensitivity and precision with the f-score (or
f1-score) that is computed as

sensitivity × precision
f =2×
sensitivity + precision
The f-score is a way to merge sensitivity and precision into a single number, as shown by the chart
in figure 4.6.
50 CHAPTER 4. EVALUATION

om
nd
ra
sensitivity

n
ha
rt
tte

om
be

nd
ra
an
th
se
or
w

0 1−specificity

0 1
Figure 4.5: ROC space. See text for detail.

0.8

0.6
f-score
0.4

0.2 1
0.8
00 0.6
0.2 0.4 precision
0.4 0.2
0.6
sensitivity 0.8
1 0

Figure 4.6: F-score. See text for detail.


Part II

Concepts for Machine Learning

51
Chapter 5

Risks

In machine learning, one wants to learn something from data. The “something” to be learnt
is usually quantified thanks to the central notion of risk. However, the risk is an asymptotical
concept, and learning is practically done by the optimization of an empirical counterpart of it. In
section 5.1, we provide a brief introduction to statistical learning theory. Notably, we will provide
(partial) answers to questions such as:

• does minimizing an empirical risk asymptotically lead to minimizing the risk?

• to what extent the minimized empirical risk is a good approximation of the risk?

These questions are respectively related to the consistency of machine learning and to overfitting.
In the case of classification, the natural (empirical) risk (based on the binary loss) is a difficult
thing to optimize. Therefore, it is customary to minimize a convex surrogate (or proxy) instead. In
section 5.2, we motivate the introduction of these surrogates, exemplify some of them and discuss
their calibration (in other words, is it consistent to minimize them instead of the risk?).
As will be shown in section 5.1, overfitting is heavily related to the capacity (or richness) of
the considered hypothesis spaces. In section 5.3, we briefly introduce the concept of regulariza-
tion, which is a modification of the risk that penalizes complex solutions (the aim being to avoid
overfitting by restricting the search space).

5.1 Controlling the risk


This section provides a brief introduction to statistical learning theory. We start by formalizing
the learning problem, focusing mainly on supervised learning.

5.1.1 The considered learning paradigm


To formalize the learning problem, we assume that we have:

• a random generator of vectors x ∈ X , sampled i.i.d. (independently and identically dis-


tributed) from a distribution P (x), fixed but unknown;

• an oracle that for each input x provides an output y ∈ Y, sampled according to the conditional
distribution P (y | x), also fixed but unknown. If Y = R, we’re facing a regression problem,
whereas if Y is a finite set, we’re facing a classification problem;

• a machine that can implement a set of functions, this set being called the hypothesis space
H = {f : X → Y} ⊂ Y X .

In a first try, supervised learning can be framed as follows: pick f ∈ H that predict “the best” the
responses of the oracle. The choice of f must be done based on a dataset D = {(xi , yi )1≤i≤n } of
n examples sampled i.i.d. from the joint distribution P (x, y) = P (x)P (y | x). If these examples
are fixed in practice (the dataset is given beforehand), they should really be understood here as

53
54 CHAPTER 5. RISKS

i.i.d. random variables. All quantities computed using these samples are therefore also random
variables.
Before stating more precisely what “the best” means formally, we give some examples of hy-
pothesis spaces:

linear predictions: write Y = R (regression) and X = Rp , the hypothesis space is



H = fα,β : X → R, fα,β (x) = α> x + β, α ∈ Rp , β ∈ R .

In this case, searching for a function f ∈ H amounts to search for the related parameters
α and β. If Y = {−1, 1} (binary classification, see section 5.2 for multiclass classification),
we can define similarly the following space (writing sgn the operator that gives the sign of a
scalar):

H = fα,β : X → {−1, 1}, fα,β (x) = sgn(α> x + β), α ∈ Rp , β ∈ R ;

Radial Basis Function Networks (RBFN): the underlying idea here is that many functions
can be represented as a mixture of Gaussians. Given d vectors µi ∈ Rp (the centers of the
Gaussians) and d symmetric and positive definite matrices Σi (variance matrices) chosen a
priori, the hypothesis space is
( d   )
X 1 > −1
H = fα,β : x → αi exp (x − µi ) Σi (x − µi ) + β .
i=1
2

Each Gaussian function is generally called a basis function. Using the same sign trick as
before, this can be used to build an hypothesis space for classification;

linear parameterization: In an abstract way, write φ : X → Rd a vector function concatenat-


ing predefined basis functions (φ(x) is usually called the feature vector and φj (x), its j th
component, is the j th basis function), we have the following hypothesis space:

H = fα : x → α> φ(x), α ∈ Rd . (5.1)

Again, this can be modified to form an hypothesis space for binary classification;

nonlinear parametrization: In all above examples, the dependency on the parameters is linear.
This is not necessarily the case, consider an RBFN where the weight of each basis function,
but also the mean and variance of each basis function, has to be learnt. Another classical
example is when predictions are made using an artificial neural network, see part V;

Reproducing Kernel Hilbert Space (RKHS): Let {(xi , yi )1≤i≤n } be the dataset and let K
be a Mercer kernel1 , the hypothesis space can be written as
( n
)
X
n
H = fα : x → αi K(x, xi ), α ∈ R .
i=1

Notice that, contrary to the previous examples, the hypothesis depends here on the dataset
of learning samples (through the use of the inputs xi ). The related approach is usually called
non-parametric. Notice also that H is not the RKHS, but a subset of it, see part III for
details.

Now, we still need to precise formally what we mean by predicting “the best”. To do so, we
introduce the notion of loss function. A loss function L : Y × Y → R+ measures how two outputs
are similar. It allows quantifying locally the quality of the prediction related to a predictor f ∈ H:
L(y, f (x)) measures the error between the response y of the oracle for a given input x and the
prediction f (x) of the machine for the same input. Here are some examples of loss functions:
1 Roughly speaking, this is the functional generalization of a symmetric and positive definite matrix, see part III

for details.
5.1. CONTROLLING THE RISK 55

`2 -loss: it is defined as
L(y, f (x)) = (y − f (x))2 .

It is the typical loss used for regression.

`1 -loss: it is defined as
L(y, f (x)) = |y − f (x)|.

It is of interest in a regression setting when there are outliers (because outliers are less
penalized with the `1 -loss), for example;

binary loss: it is defined as


(
1 if f (x) 6= y
L(y, f (x)) = 1{y6=f (x)} = . (5.2)
0 else

It is the ideal (but unpractical, see section 5.2) loss for classification.

If the loss function quantifies locally the quality of the prediction, we need a more global
measure of this quality. This is quantified by the risk, formally defined as the expected loss:
Z
R(f ) = L(y, f (x))dP (x, y)
X ×Y
= E [L(Y, f (X))].

Ideally, supervised learning would thus consists in minimizing the risk R(f ) under the constraints
f ∈ H, giving the solution
f0 = argmin R(f ).
f ∈H

Unfortunately, recall that the joint distribution P (x, y) is unknown, so the risk cannot even be
computed for a given function f . However, we have a partial information about this distribution,
through the dataset D = {(xi , yi )1≤i≤n }. It is natural to define the empirical risk as
n
1X
Rn (f ) = L(yi , f (xi )).
n i=1

A supervised learning algorithm will therefore minimize this empirical risk, providing the solution

fn = argmin Rn (f ).
f ∈H

This is called the empirical risk minimization (or ERM for short). Notice that fn is a random
quantity (as it depends on the dataset, which is a collection of i.i.d. random variables).
To sum up, given a dataset (assumed to be sampled i.i.d. from an unknown joint distribution),
given an hypothesis space (chosen by the practitioner) and given a loss function (also chosen by
the practitioner, depending on the problem at hand), a supervised learning algorithm minimizes
the empirical risk, constructed from the dataset, instead of ideally the risk. The natural questions
that arise are:

• does fn converges to f0 (and with what type of convergence)? If it is not the case, machine
learning cannot be consistent;

• given n samples and the chosen hypothesis space (and also depending on the considered loss
function), how close is fn to f0 ?

These are the questions we study next. Before, we discuss briefly the bias-variance decomposition
(a central notion in machine learning).
56 CHAPTER 5. RISKS

5.1.2 Bias-variance decomposition


Recall that f0 and fn are respectively the minimizers of the risk R and of the empirical risk Rn ,
when the search is constrained to the hypothesis space H. Write f∗ the minimizer of the risk in
the unconstrained case:
f∗ = argmin R(f ), R∗ = R(f∗ ).
f ∈Y X

The term R∗ is the best one can hope (notice that it is not necessary equal to zero, it depends
on the noise of the oracle). In some cases, we can express analytically the minimizer f∗ , but
the corresponding solution cannot be computed, the underlying distribution
R being unknown. For
example, with an `2 -loss, one can show that f∗ (x) = E [y | x] = Y ydP (y | x), that cannot be
computed (the conditional distribution being only observed through data).
Consider the following decomposition:
R(fn ) − R∗ = R(f0 ) − R∗ + R(fn ) − R(f0 ) .
| {z } | {z } | {z }
error bias variance

Each of these terms is obviously positive. The term R(fn ) − R∗ is the error of using fn (computed
from the data) instead of the best (but unreachable) choice f∗ . The term R(f0 ) − R∗ is called the
bias. It is a deterministic term that compares the best solution one can find in the hypothesis space
H to the best possible solution (without constraint). This term therefore depends on the choice of
H, but not on the data used. The richer the hypothesis space (the more functions belonging to it),
the smaller can be this term. The term R(fn ) − R(f0 ) is called the variance. It is a stochastic
term (through the dependency to data) that should go to zero as as the number of samples goes
to the infinity. In the next sections, we will see that this is a necessary condition for the ERM to
be consistent, and that we can bound it (in probability) with a function depending on the richness
of H and on the number of samples.

5.1.3 Consistency of empirical risk minimization


In the following sections, we abstract a little bit the notations. We write z = (x, y) an example
and Q(z, f ) = L(y, f (x)) (and P (z) = P (x, y) the joint distribution). Consequently, the risk and
its empirical counterpart are now:
R(f ) = E [Q(Z, f )] (5.3)
n
1X
and Rn (f ) = Q(zi , f ). (5.4)
n i=1

This allows lightening the notations and gives more generality to the following results2 .
Let f be a fixed function of H. The risk (5.3) is the expectation of the random variable
L(Y, f (X)) and the empirical risk (5.4) is the empirical expectation of i.i.d. samples of the same
random variable3 . In the probability theory, the convergence of an empirical expectation is given
by the laws of large numbers (there are many of them, depending on required assumptions and on
the considered convergence). Here, we will focus on convergence in probabilities. Therefore, the
weak law of large numbers states that, for the fixed f ∈ H chosen beforehand (before seing the
data), we have:
P
Rn (f ) −−−−→ R(f ) ⇔ ∀ > 0, P (|R(f ) − Rn (f )| > ) −−−−→ 0.
n→∞ n→∞

Notice that we cannot replace f by fn (the minimizer of the empirical risk) in the above expression,
because fn depends on the data and is therefore itself a random quantity.
2 For
example, it applies also to some unsupervised learning algorithms. Let x1 , . . . , xn be i.i.d. samples from an
unknown distribution P (x) to be estimated. Let {pα (x), α ∈ Rd } be a set of parameterized densities. Consider the
(one-parameter) loss function L(p(x, α)) = − log pα (x) and the related risk R(pα ) = E [− log pα (x)] (this corresponds
to maximizing the likelihood of the data). This case is also handled by the notations z and Q.
3 We recall that it is important to understand that, given this statistical model of supervised learning, the dataset

is a random quantity (so is the empirical risk), even if in practice the dataset is given beforehand and imposed.
Imagine that we could repeat the experience (of generating the dataset). The dataset would be different, but still
sampled i.i.d. from the same joint distribution.
5.1. CONTROLLING THE RISK 57

Figure 5.1: An illustration of the convergence behavior of the risk and its empirical counterpart.
Consider for example a regression problem where one tries to fit a polynomial to data. With too
few points, the empirical risk will be zero while the risk will be high. With an increasing number
of data points, both risks should converge to the same quantity.

Figure 5.2: The considered hypothesis space for showing that limits in Def. 5.1 are not equivalent.

For the ERM to be consistent, we require that fn converges to f0 in some sense. What we are
interested in is the quality of the solution, quantified by the risk. So, the convergence to be studied
is the one of the (empirical) risk of fn to the risk of f0 . This gives a first definition of consistency
of ERM.

Definition 5.1 (Classic consistency of the ERM principle). We say the the ERM principle is
consistent for the set of functions Q(z, f ), f ∈ H, and for the distribution P (z) if the following
convergences occur:
P P
R(fn ) −−−−→ R(f0 ) and Rn (fn ) −−−−→ R(f0 ).
n→∞ n→∞

Notice that the two limits are not equivalent, as illustrated in Fig. 5.1. To show this more
formally, assume that z ∈ (0, 1) and consider H the space of functions such that Q(z, f ) = 1
everywhere except on a finite number of intervals of cumulated length , as illustrated in Fig. 5.2.
Assume that P (z) is uniform on (0, 1). Then, for any n ∈ N, Rn (f ) = 0 (pick the function with n
intervals centered in the z1 , . . . , zn datapoints). On the other hand, for any f ∈ H, R(f ) = 1 − .
Therefore, R(f0 ) − Rn (fn ) = 1 −  does not converge to zero while R(f0 ) − R(fn ) = 0 does.
The problem with this definition of consistency is that it encompasses straightforward cases
of consistency. Assume that for a set of functions Q(z, f ), f ∈ H, the ERM is not consistent.
Now, let φ be a function such that φ(z) < inf f ∈H Q(z, f ) and add the corresponding function to
H (see Fig. 5.3). With this extended set, the ERM becomes consistant: For any distribution and
any sampled dataset, the empirical risk is minimized with φ(z), which is also the argument that
minimize the risk. This is a case we would like to avoid, motivating a more strict definition of
consistency.

Definition 5.2 (Strict consistency of the ERM principle). Let Q(z, f ), f ∈ H be a set of function
and P (z) a distribution. For c ∈ R, define H(c) the set
Z
H(c) = {f ∈ H : R(f ) = Q(z, f )dP (z) ≥ c}.
58 CHAPTER 5. RISKS

Figure 5.3: Triviality of the classic consistence.

The principle of ERM is said to be strictly consistent (for the above set of functions and distribu-
tion) if for any c ∈ R, we have
P
inf Rn (f ) −−−−→ inf R(f ).
f ∈H(c) n→∞ f ∈H(c)

With this definition, the function φ(z) used in the previous explanation will be removed from
the set H(c) for a large enough value of c. The fact that this definition of strict consistency implies
the definition of classic consistency (the converse being obviously false) is not straightforward, but
it can be demonstrated.
This notion of consistency is fundamental in machine learning. If it is not satisfied, minimizing
an empirical risk has no sense (so, roughly speaking, machine learning would be useless). What we
would like is a (necessary and) sufficient condition to have a strict consistency, that is a convergence
in probabilities of the quantities of interest. The weak law of large numbers states that for any
function f in H, we have this convergence. However, this is not sufficient. As fn is a random
quantity, there is no way to know what function will minimize the empirical risk, so the standard
law of large numbers do not apply. However, assume that we have a uniform convergence in
probabilities, in the sense that
!
P
sup |R(f ) − Rn (f )| −−−−→ 0 ⇔ ∀ > 0, P sup |R(f ) − Rn (f )| >  −−−−→ 0, (5.5)
f ∈H n→∞ f ∈H n→∞

which is much stronger than the weak law of large numbers (see the next section for a more
quantitative discussion). With such a uniform convergence, the ERM principle is strictly consistent
(convergence occurs in the worst case, so it occurs for fn : |R(fn ) − Rn (fn )| ≤ supf ∈H |R(f ) −
Rn (f )|). This is the base of a fundamental result of statistical learning theory.
Theorem 5.1 (Vapnik’s key theorem). Assume that there exists two constants a and A such that
for any function Q(z, f ), f ∈ H, and for a given distribution P (z), we have
Z
a ≤ R(f ) = Q(z, f )dP (z) ≤ A.

Then, the following assertion are equivalents:


1. for the distribution P (z), the ERM principle is strictly consistent for the set of functions
Q(z, f ), f ∈ H;
2. for the distribution P (z) there is a one-sided uniform convergence over the set of functions
Q(z, f ), f ∈ H, !
∀ > 0, P sup (R(f ) − Rn (f ))+ >  −−−−→ 0,
f ∈H n→∞

where (x)+ = max(x, 0).


Notice that the theorem requires only a one-sided uniform convergence (contrary to the two-
sided uniform convergence of Eq. (5.5)). This is because we want to minimize the risk, and we
do not care about its maximization. Next, we discuss this result in a more quantitative way and
provide sufficient conditions on the structure of the hypothesis space for the ERM principle to be
consistant for any distribution.
5.1. CONTROLLING THE RISK 59

5.1.4 Towards bounds on the risk


In this section, we restrict the learning paradigm by assuming that for any z and any f , Q(z, f ) ∈
{0, 1}. This corresponds notably to the case of classification with the binary loss. The results
presented next can be extended to a more general case (up to additional technical difficulties), but
this restriction simplifies the discussion.
We have seen in the preceding section that consistency of the ERM principle has to do with a
notion of uniform convergence, that somehow extends the weak law of large numbers. We start by
providing a quantitative (that is, non-asymptotical) version of this law.

Theorem 5.2 (Hoeffding’s inequality). Let X1 , . . . , Xn be i.i.d. random variables, bounded in


(0,1) and of common mean µ = E [X1 ]. Then:
n !
1 X
2
∀ > 0, P Xi − µ >  ≤ 2e−2n .
n
i=1

Obviously, the weak law of large numbers is a corollary of this result. This is called a con-
centration inequality: it states how the empirical mean (which is a random variable) concentrates
around its expectation. Such concentration inequalities are the base of what is called PAC (Prob-
ably Approximately Correct) analysis. The preceding result can be equivalently written as
!
1 X
n
2
P Xi − µ ≤  > 1 − 2e−2n .
n
i=1
q 2
2 ln
Write δ = 2e−2n ⇔  = 2n . The Hoeffding’s inequality can equivalently be stated as: with
δ

probability at least 1 − δ, we have


n s
1 X ln 2δ

Xi − µ ≤ .
n 2n
i=1

So
q the result is probably (of probability at least 1 − δ) approximately (the error being at most
ln δ2
2n ) correct.
This can be directly applied to our problem (recall Eqs (5.3) and (5.4), that correspond re-
spectively to an expectation and to and empirical expectation). Let f ∈ H be a function chosen
beforehand, then with probability at least 1 − δ we have
s
ln 2δ
|R(f ) − Rn (f )| ≤ .
2n

Unfortunately, this cannot be extended directly to uniform convergence. Indeed, probabilities are
about measuring sets. What Hoeffding says, when applied to the risk, is that the measure of the
set of datasets satisfying |R(f ) − Rn (f )| >  is at most δ. Now, this set of dataset depends on the
function f of interest. Take another function f 0 , the corresponding set of datasets will be different,
so the measure of both sets of datasets (corresponding respectively to f and f 0 , such that the
inequality of interest is satisfied for both function) is no longer bounded by δ, but by 2δ. This is
illustrated in Fig. 5.4. We write this idea more formally now.
Assume that H is a finite set, such that Card H = h. We can write
!  
[
P sup |R(f ) − Rn (f )| >  = P  {|R(f ) − Rn (f )| > } (by def. of the sup)
f ∈H
f ∈H
X
≤ P (|R(f ) − Rn (f )| > ) (union bound)
f ∈H
2
≤ 2he−2n (by Hoeffding on each term of the sum).
60 CHAPTER 5. RISKS

Figure 5.4: Here, R is the risk and Rn is the empirical risk for two different datasets (sampled
from the same law). For a given function f ∈ H, the fluctuation (or variation) of Rn (f ) around
Rn (f ) is controlled by the Hoeffding inequality. However, fn depends on the data set and the
fluctuation of the associated empirical risk cannot be controlled by Hoeffding.

Figure 5.5: Idea for counting functions.

In other words, we can say that with probability at least 1 − δ, we have


s
ln 2h
δ
|R(fn ) − Rn (fn )| ≤ sup |R(f ) − Rn (f )| ≤ .
f ∈H 2n

Moreover, we have just shown that, if the hypothesis space is a finite set, then the ERM principle is
strictly consistent, for any distribution (no specific assumption has been made about P (z), apart
from the fact that the related random variables are bounded; this means that it will work for
any—classification here—problem).
Unfortunately, even for the simplest cases exemplified before, the hypothesis space is uncount-
able and this method (Hoeffding with a union bound) does not apply. Yet, the underlying fun-
damental idea is good, it is just the way of counting functions which is not smart enough. Next
we introduce the Vapnik-Chervonenkis dimension, which is a measure of the capacity (the way
of counting functions in a smart way) of an hypothesis space, among others (e.g., Rademacher
averagers, shattering dimension, and so on).
Recall that we are focusing here on the case of classification. The basic idea for counting
functions in a smart way is as follows: as we work with empirical risks, there is no difference
between two functions that provide the same empirical risk, so they should not be counted twice.
Consider the examples provided in Fig. 5.5, with 3 to 4 labels. Suppose that the classifiers (the
functions of the hypothesis space) make their predictions based on a separating hyperplan (see the
left figure). If two classifiers predict the same labels for the provided examples, they will have the
same risk and will not be distinguishable. In the left figure, there are 8 different classifiers from
this viewpoint. In the middle figure, there are only 6 different classifiers (points are aligned). In
the right figure, there are 8 possible classifiers (all configurations of the labels are not possible
given a linear separator). Generally speaking, given any hypothesis space and n data points, there
are a maximum of 2n possible classifiers (still relatively to the value of the associated empirical
risk). We have just seen in the simple example of Fig. 5.5 that if there are more than 3 points, a
5.1. CONTROLLING THE RISK 61

linear separator will not be able to provide all possibilities. This rough idea is formalized by the
following result.

Theorem 5.3 (Vapnik & Chervonenkis, Sauer & Shelah). Define GH (n) the growth function of a
set of functions Q(z, f ), f ∈ H, as4
 
H H

G (n) = ln sup N (z1 , . . . , zn )
z1 ,...,zn ∈Z

with N H (z1 , . . . , zn ) = Card(Qz1 ,...,zn )


n > o
and Qz1 ,...,zn = Q(z1 , f ) . . . Q(zn , f ) :f ∈H .

The growth function satisfies one of the two properties:

1. either GH (n) is linear,


GH (n) = n ln 2, ∀n ∈ N∗ ;

2. or GH (n) is sub logarithmic after a given rank,


(
= n ln 2 if n ≤ h
GH (n) ,
≤ h(1 + ln nh ) if n > h

with h the greater integer such that GH (n) = n ln 2.

In the first case, the Vapnik-Chervonenkis dimension is infinite, dVC (H) = ∞. In the second case,
we have dVC (H) = h.

Alternatively, we can say that the Vapnik-Chervonenkis dimension of a set of indicator functions
Q(z, f ), f ∈ H is the maximum number h of vectors z1 , . . . , zh which can be separated in all 2h
possible ways using functions of this set (shattered by this set of functions). It plays the role of
the cardinal of this set (while being much smarter). This dimension depends on the structure of
the hypothesis space, but not on the distribution of interest (not on P (z), so not on the specific
problem at hand). It can be estimated or computed in many cases. For example, if classification
is done thanks to a linear separator in Rd , then we have dVC (H) = d + 1. This say that the
Vapnik-Chervonenkis dimension is a measure of the capacity of the space (and again, it is not
the sole).
The following result can be proven.

Theorem 5.4 (Bound on the risk). Let δ be in (0, 1). With probability at least 1 − δ, we have
s  
2 2en 2
∀f ∈ H, R(f ) ≤ Rn (f ) + 2 dVC (H) ln + ln , (5.6)
n dVC (H) δ

with e being the exponential number (the mathematical constant).

This being true for any f , it is also true for fn , the minimizer of the empirical risk. This gives
a bound on the risk, based on the number of samples and on the capacity of H. Notably, this tells
that if n  dVC (H), then the error term is small and we do not have problems of overfitting (and
conversely, if we do not have enough samples, the empirical risk will be far away from the risk,
leading to an overfitting problem). A direct (and important) corollary of this result is that the
empirical risk minimization is strictly consistent if dVC (H) < ∞.
4Q
z1 ,...,zn is the set of all possible combinations of labels that can be computed given the functions in H for
the data points z1 , . . . , zn , N H (z1 , . . . , zn ) is the cardinal of this set of points, bounded by 2n , and GH (n) is the
logarithm of the supremum of these cardinals for arbitrary datasets (bounded by n ln 2).
62 CHAPTER 5. RISKS

5.1.5 To go further
We have provided a short (and dense) introduction to statistical learning theory. For a deeper
introduction, the interested student can follow the optional course “apprentissage statistique” or
read the associated course material by Geist (2015a) (in french). Most of the material presented
in the above sections comes from the book by Vapnik (1998), who did a seminal work in statistical
learning theory. Vapnik (1999) provides a much shorter (than the book) introduction. A seminal
work regarding PAC analysis is the one of Valiant (1984). Other references that might be of
interest (the list being far from exhaustive and disordered, the last reference focusing more on
concentration inequalities) are (Bousquet et al., 2004; Cucker and Smale, 2001; Evgeniou et al.,
2000; Hsu et al., 2014; Györfi et al., 2006; Boucheron et al., 2013). For a general analysis of support
vector machines (to be presented in Part III), see Guermeur (2007).

5.2 Classification, convex surrogates and calibration


In this section, we focus on the problem of classification. In this case, the input set X is usually a
subspace of Rp and the output space is Y = {1, . . . , K}, a finite set of labels. The natural loss for
classification is the binary loss (5.2), defined as L(y, f (x)) = 1{y6=f (x)} , which gives the risk5
 
R(f ) = E [L(Y, f (X))] = E 1{y6=f (x)} = P (Y 6= f (X)).

In other words, minimizing this risk corresponds to minimizing the probability of predicting a
wrong label. That is why the binary loss is the natural loss for classification. The associated
empirical risk is
n
1X
Rn (f ) = 1{yi 6=f (xi )} .
n i=1
There are two problems with this risk:
1. with this formulation, the hypothesis space should satisfies H ⊂ {1, . . . , K}X , and it is quite
hard to design (for example through a parameterization) a space of functions that output
labels (much harder than designing space of functions that output reals);
2. the resulting optimization problem (minimizing the empirical risk) is really hard to solve
(not smooth, not convex, and so on).
Therefore, it is customary to minimize a surrogate (or proxy) to this risk of interest.

5.2.1 Binary classification with binary loss


We start by discussing the case of binary classification. Without loss of generality, assume that
Y = {−1, 1}, which will be more convenient. Recall the hypothesis space of linear parameteriza-
tions (5.1), 
H = fα : x → α> φ(x), α ∈ Rd ,
with φ : X → Rd a predefined feature vector. For any x, fα (x) is a scalar, while we would like
to output an element of {−1, 1}. This can be done using the sign trick briefly mentioned before.
Write G the following hypothesis space:

G = {gα : x → sgn(fα (x)), fα ∈ H} ⊂ {−1, 1}X .

We have just designed a (simple) hypothesis space for classification (through the space H). Con-
sidering the binary loss, this leads to the following optimization problem
n n
1X 1X
min Rn (gα ) = min 1{yi 6=sgn(fα (x))} = min 1{yi 6=sgn(α> φ(x))} .
α∈Rd α∈Rd n i=1 α∈R n
d
i=1
5 We
 
recall that in probabilities, for an event A depending on a random variable Z, we have that E 1{A} =
R R
1{A} dP (z) = A dP (z) = P (A).
5.2. CLASSIFICATION, CONVEX SURROGATES AND CALIBRATION 63

ϕ(t) for t ∈ R
hinge loss max(0, 1 − t)
truncated least-squares (max(0, 1 − t))2
least-squares (1 − t)2
exponential loss e−t
sigmoid loss 1 − tanh(t)
logistic loss ln(1 + e−t )

Table 5.1: Some classic loss functions for binary classification, see also Eq. (5.7).

2.00
ϕ(y.f (x)) = 1R− (y.f (x))
ϕ(y.f (x)) = max(1 − y.f (x), 0)
1.75
ϕ(y.f (x)) = e−y.f (x)

ϕ(y.f (x)) = ln 1 + e−y.f (x)
1.50
L(y, f (x)) = ϕ(yf (x))

1.25

1.00

0.75

0.50

0.25

0.00

−3 −2 −1 0 1 2 3
y.f (x)

Figure 5.6: Some plots of classic loss functions for binary classification, see also Eq. (5.7).

The question is: how to solve this optimization problem? It is not convex (a very nice property in
optimization), not even smooth, so there is no easy answer.
A solution is to introduce a surrogate to this risk, that works on fα instead of gα , and that
solves (approximately) the same problem. We give a simple example now. The rational is to
introduce a loss function such that the loss is low when y and f (x) have the same sign, and high
in the other case. Consider the following risk and its empirical counterpart:
h i n
1 X −yi fα (xi )
R(fα ) = E e−Y fα (X) and Rn (fα ) = e .
n i=1

We have that R(fα ) ≥ 0. To minimize this risk, we should set fα such that sgn(fα (xi )) = sgn(yi )
and with a high (ideally infinite) absolute value. Therefore, minimizing this risk makes sense, from
a classification perspective. Moreover, as here fα is linear (in the parameters), one can easily show
that Rn (fα ) is a smooth (Lipschitz) and convex function. This allows using a bunch of optimization
algorithms to solve the related problem, with strong guarantees to compute the global minimum.
This is called the exponential loss. More generally, write L(y, f (x)) = ϕ(yf (x)) the loss for a
convenient and well-chosen function ϕ : R → R, we have the following generic surrogate to binary
classification:
n
1X
R(fα ) = E [ϕ(Y fα (X))] and Rn (fα ) = ϕ(yi fα (xi )). (5.7)
n i=1
In Tab. 5.1 and Fig. 5.6, we provide some classic loss functions. Using a convex surrogate allows for
solving a convex optimisation problem when minimizing the empirical risk. In other word, given
some dataset S = {(x1 , y1 ), · · · , (xi , yi ), · · · , (xN , yN )}, the functional Rn (f ) is convex according
to f . Let us recall that
1 X 1 X
Rn (f ) = L (xi , yi ) = ϕ (yi .f (xi ))
N N
(xi ,yi )∈S (xi ,yi )∈S

is a sum of positive terms. Therefore, the convexity of Rn (f ) can be deduced from the convexity of
def
each of the ϕ (yi .f (xi )) = Ki (f ) terms according to f . This comes naturally from the convexity
of ϕ, i.e.
∀λ ∈ [0, 1], ϕ (λx + (1 − λ)x0 ) ≤ λϕ (x) + (1 − λ)ϕ (x0 )
64 CHAPTER 5. RISKS

ψ(s) for s ∈ R
hinge loss max(0, 1 + s)
truncated least-squares (max(0, 1 + s))2
least-squares (1 + s)2
exponential loss es
logistic loss ln(1 + es )

Table 5.2: Some loss functions for cost-sensitive multiclass classification, see also Eq. (5.12).

since it enables to derive that


Ki (λf + (1 − λ)f 0 ) = ϕ (yi (λf + (1 − λ)f 0 ) (xi ))
= ϕ (λyi f (xi ) + (1 − λ)yi f 0 (xi ))
≤ λϕ (yi f (xi )) + (1 − λ)ϕ (yi f 0 (xi ))
= λKi (f ) + (1 − λ)Ki (f 0 )
which shows that Rn (f ) is convex according to f . As linear functions are considered here, i.e
f (x) = fα (x) = α> φ(x), the convexity of Rn (f ) according to f allows for asserting the convexity
of Rn (fα ) according to the parameters α, since the composition of a convex function and an affine
function is convex.
Knowing what surrogate to choose is not an easy question, they have different theoretical
guarantees, they lead to more or less easy to optimize problems, and so on. Notice that once one
has solved αn = argminα∈Rd Rn (f ), for the empirical risk given in Eq. (5.7), the decision rule (the
solution to the classification problem) is gαn (x) = sgn(fαn (x)).

5.2.2 Cost-sensitive multiclass classification


We consider now the general cost-sensitive multiclass classification problem. Let G ⊂ {1, .., K}X
be a hypothesis space of classifiers. The general risk considered here is
R(g) = E [c(X, g(X), Y )],
where c(x, g(x), y) is the cost6 of assigning label g(x) to input x when the oracle provide the
response y. For example, the binary loss corresponds to c(x, g(x), y) = 1{y6=g(x)} . Cost-sensitive
classification might be of interest for some applications. Assume that one wants to predict if a
patient has a given disease or not, given a set of physiological measurements. If the patient is
predicted to be ill, further (more expensive) tests are conducted to be sure, and the patient is
treated. If the patient is not predicted to be ill, nothing is done. However, is the prediction is
false, the patient dies, as no treatment is given to him. Predicting that a patient is not ill while he
is has a higher cost than the converse. Moreover, the result of the patient not being cured while
he’s ill might depend on the patient also (for example, if he is young and sportive, he won’t die),
so this cost might depend on the input too.
The empirical risk to be minimized is
n
1X
Rn (g) = c(xi , g(xi ), yi ).
n i=1

We have the same problem as before: designing the space G is difficult and the resulting optimiza-
tion problem is hard, if even possible. Consider again the hypothesis space of linear parameteriza-
tions (5.1), 
H = fα : x → α> φ(x), α ∈ Rd ,
From this, we would like to design a space G ⊂ {1, .., K}X . Write fα1 ,...,αK : X → RK the function
defined as
>
fα1 ,...,αK (x) = fα1 (x) . . . fαK (x) such that ∀1 ≤ k ≤ K, fαk ∈ H.
6 There should be no cost for assigning the correct label, that is c(x, y, y) = 0.
5.2. CLASSIFICATION, CONVEX SURROGATES AND CALIBRATION 65

In other words, for each label k, we define a function fαk ∈ H which can be interpreted as the
score of this label. For a given input, the predicted label is the one with the higher score. The
corresponding space of classifiers is
 
G = gα1 ,...,αK : x → argmax fαk (x), ∀1 ≤ k ≤ K : fαk ∈ H .
1≤k≤K

There remains to define a surrogate to the risk of interest that operates on the functions fαk
instead if gα1 ,...,αK . Let ψ : R → R one of the function defined on Tab. 5.2, we consider the
following surrogate:
n K
1 XX
Rn (fα1 ,...,αn ) = c(xi , k, yi )ψ(fαk (xi )). (5.12)
n i=1
k=1

For the functions of Tab. 5.2, if the functions fαk are linearly parameterized, the resulting em-
pirical risk is convex. Consider for example the exponential loss, ψ(s) = es . For the loss
PK
k=1 c(xi , k, yi ) exp(fαk (xi )) to be small, we should have fαk=yi (xi ) > fαk6=yi (xi ), which shows
PK
informally that this surrogates makes sense (generally, the constraint k=1 fαk (x) = 0 is added).
Said otherwise, minimizing this surrogate loss should push for smaller scores for labels with larger
cost. Consequently, the classifier that selects the label with maximal score should incur a small
cost.
Notice that the surrogate (5.12) does not generalizes the one of Eq. (5.7): if c(x, f (x), y) =
1{y6=f (x)} and if K = 2, both expressions are not equal. Notice also that it is not sole solution.
For example, consider the following surrogate:
n K
1 XX
Rn (fα1 ,...,αn ) = c(xi , k, yi )efαk (xi )−fαyi (xi ) . (5.13)
n i=1
k=1

It is also a valid surrogate. Knowing what surrogate to use in the cost-sensitive multiclass classi-
fication case is a rather open question.

5.2.3 Calibration
Defining a convex surrogate loss is a common technique to reduce the computational cost of learning
a classifier. However, if the resulting problem is more amenable to efficient optimization, it is
unclear wether minimizing the surrogate loss results in a good accuracy (that is, if it will minimize
the original loss of interest). The study of this problem is known as calibration. For example, write
R the risk and Rϕ its convex surrogate. The question is: if we want to have a suboptimality gap 
for the risk R, how small should be the suboptimality gap for Rϕ ? If there exists a positive-valued
function δ (called a calibration function) such that

Rϕ (f ) ≤ δ() ⇒ R(f ) ≤ ,

the surrogate loss is said to be calibrated with respect to the primary loss. A deeper study of
calibration is beyond the scope of this course material, but the interested student can look at the
references provided next.

5.2.4 To go further
Using a convex surrogate, such as the hinge loss in binary classification (Cortes and Vapnik,
1995), is a common technique in machine learning. Bartlett et al. (2006) studies the calibration
of surrogates provided in Eq. (5.7) (see also the work of Steinwart (2007) for a more general
treatment). The surrogate of Eq. 5.12 has been introduced by Lee et al. (2004) in the cost-
insensitive case and extended to the cost-sensitive case by Wang (2013). Ávila Pires et al. (2013)
study its calibration. The surrogate of Eq. (5.13) has been proposed by Beijbom et al. (2014), the
motivation being to introduce “guess-averse” estimators. An alternative to convex surrogate is to
use “smooth surrogates” (with no problem of calibration, at the cost of convexity), see the work
of Geist (2015b).
66 CHAPTER 5. RISKS

5.3 Regularization
We have seen in Sec. 5.1 that the number of samples should be large compared to the capacity (or
richness) of the considered hypothesis space. If one has not enough samples or a too rich hypoth-
esis space, the problem of overfitting might appear and one should choose a smaller hypothesis
space. However, it is not necessarily an easy task. For example, consider a linear parameteriza-
tion: how should one choose beforehand the basis functions to be removed? When working with
an RKHS (roughly speaking, when using the kernel trick, see Part III), the hypothesis space is im-
plicitly defined through a kernel (and with a Gaussian kernel, commonly used with support vector
machines—see Part III again—, the corresponding Vapnik-Chervonenkis dimension is infinite) and
it cannot be shrinked explicitly easily. A solution to this problem is to regularize the risk.

5.3.1 Penalizing complex solutions


Let H be an hypothesis space and Rn be the empirical risk of interest. Let Ω : H → R be a
function that penalizes the complexity of a candidate solution. Regularization amounts to solving
the following optimization problem:

Jn (f ) = Rn (f ) + λΩ(f ),

where λ is called the regularization factor (it is a free parameter) and allows setting a compromise
between the minimization of the empirical risk and the quality of the solution. Informally, this can
also been seen as solving the following optimization problem:

min Rn (f ),
f ∈H:Ω(f )≤η

for some value of the (free) parameter η. Instead of searching for a minimizer of the empirical
risk in the whole hypothesis space, we’re only looking for solutions of a maximum complexity (as
measured by Ω) of η. How to solve the corresponding optimization problem obviously depends
on the risk and on the measure of complexity of solutions. The theoretical analysis of the related
solutions also depends on these instantiations. We give some examples of possible choices for Ω in
the next section.

5.3.2 Examples
For simplicity, we consider here a space of parameterized functions:

H = {fα : X → R, α ∈ Rd }.

We give some classic regularization terms:

`2 -penalization: it is defined as
d
X
Ω(fα ) = kαk22 = αj2
j=1

and is also known as Tikhonov regularization. It is often used with a linear parameteriza-
tion and an `2 -loss, providing the regularized linear least-squares (that admit an analytical
solution);

`0 -penalization: this is sometime called the norm of sparsness, or `0 -norm, even if it is not a
norm. It is defined as

Ω(fα ) = kαk0 = Card ({j ∈ {1, . . . , d} : αj 6= 0}) .

The more coefficients are different from zero (the less sparse is the solution), the more pe-
nalized is fα . Notice that solving the corresponding optimization problem is intractable in
general;
5.3. REGULARIZATION 67

`1 -penalization: it is defined as
d
X
Ω(fα ) = kαk1 = |αj |.
j=1

This is often uses as a proxy for the `0 norm, as it also promotes sparse solutions.

We do not provide more examples, but there exist much more way of penalizing an empirical risk.

5.3.3 To go further
A (not always convenient) alternative to regularization is structural risk minimization7 (Vapnik,
1998). For a risk based on the `2 -loss and a linear parameterization, `2 -penalization has been
proposed by Tikhonov (1963) and `1 -penalization by Tibshirani (1996a) (it is known as LASSO,
least absolute shrinkage and selection operator, an efficient method for solving it being proposed
by Efron et al. (2004b)). Under the same assumptions (that is, a linear least-squares problem),
see Hsu et al. (2014) for an analysis of `2 -penalization and Bunea et al. (2007) for `1 -penalization
(among many others). For regularization in support vector machines (and the link between margin
maximization and risk minimization), see for example Evgeniou et al. (2000).

7 Roughly
S
speaking, assume that you have an increasing set of hypothesis spaces, H = k Hj : H1 ⊂ · · · ⊂ Hk . . . .
One can minimize the empirical risk for each subspace end evaluates the corresponding bound on the risk (see for
example Eq. (5.6)), and choose the solution with the smaller upper-bound on the risk.
68 CHAPTER 5. RISKS
Chapter 6

Preprocessing

6.1 Selecting and conditioning data


6.1.1 Collecting data
There might be different starting points in machine learning depending on your objectives. If
your aim is to solve a particular problem with particular data, you necessarily have to begin with
collecting samples and, if you work on a classification problem, to label all the collected samples.
However, if your aim is rather on developing new machine learning techniques, then it is relevant
to work on popular benchmarks on which other techniques have already been tried. There are
various datasets that are commonly used for this second aspect of developing a new machine
learning algorithm. While reading machine learning papers you might certainly notice the use of
the following datasets :

• the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/) which is hosting


a large collection of datasets such as the IRIS dataset for classification (150 samples with 4
features characterizing flowers of 3 classes), the diabete dataset (442 samples with 10 features
and a target to regress)

• the MNIST dataset (http://yann.lecun.com/exdb/mnist/) with 28 × 28 black and white


images of handwritten digits, with a total of 70000 samples

• face image databases, various links to databases are provided at http://www.face-rec.org/


databases/

There are actually plenty of datasets that can be used, a lot of them being listed on the Wikipedia
page https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research.

There are also some online platforms, such as Kaggle1 , on which competitions are proposed
with associated datasets.

6.1.2 Conditioning the data


Dealing with ordinal and categorical features
The machine learning algorithms we consider in this document work with vectorial and numerical
inputs. It turns out that some datasets (e.g. medical) contain features that are not numerical but
rather ordinal or categorical. In these two situations, the variable is actually a label rather than
a number. The difference between ordinal and categorical data is that ordinal data are naturally
ordered (e.g. ratings as excellent, very good, good, fair, poor), while categorical features have no
natural ordering (e.g. countries France, United States, Spain, ...).
Let us now consider different encoding schemes of ordinal and categorical features. Suppose
we have a feature being the rating of a movie with three possible values : excellent, fair and poor.
1 https://www.kaggle.com

69
70 CHAPTER 6. PREPROCESSING

In order to keep the ordering of the values when encoding the feature, one may use an increasing
number with the following encoding :

Ordinal Feature value poor fair excellent


Numerical feature value -1 0 1
When the dataset contains categorical features, we cannot use the same encoding as for or-
dinal features simply because we otherwise induce an ordering of the feature values that is not
in the data. Let us consider a dataset of people with a feature indicating their nationality :
F = {American, Spanish, German, F rench}. Sometimes, the feature value might appear as an
integer because someone decided to encode categorical features with a number, e.g. American=0,
Spanish=1, German=2, French=3 so that the fact that a feature is actually categorical might not
immediately pop out. One possible encoding scheme for handling such categorical features is to
use a so-called one-hot encoding. The idea is actually to create |F| features, each except one being
equal to zero.

Categorical feature value American Spanish German French


Numerical feature values [1, 0, 0, 0] [0, 1, 0, 0] [0, 0, 1, 0] [0, 0, 0, 1]

Working with text documents


Text documents are specific in the sense that they are only symbolic. In order to feed a machine
learning algorithm which requires feature vectors as input, one has to convert these text documents
into feature vectors. One way to do so is to use the bag-of-words representation. In the bag-of-
word model, we take all the vocabulary inside the corpus you have, keeping only the words as
case-insensitive and dropping out the various punctuation elements. You then build up a vector
with the size of the vocabulary as the number of dimensions; Then you count the occurrence of
each word and fill in the vector accordingly. Therefore, each text document gets represented as a
feature vector. The bag of words encoding can be extended to take into account patterns of words
in the documents with n-grams as you would do, for example, when computing Gabors on an image.
N-grams are patterns of n consecutive words. Clearly, as we are working with text documents that
have some grammatical structure, you would never encounter all the possible n-grams you could
imagine from a given vocabulary. If you were to known the true distribution of text documents, it
would be a manifold of lower dimension than the vector space you can compute with the n-grams
and therefore, your input vectors would be much sparser than when using bag-of-words but at the
same time focusing on patterns of words which can be more informative.

Imputation of missing values


It might be that, for some reasons, the dataset contains entries with some missing features. This
can be indicated in various ways depending on the dataset. For example, the Heart Disease dataset2
has some entries with missing features indicated as the special value -9, some other datasets might
use a blank space to indicate a missing value. One easy way to deal with missing values is actually
to drop the entries which have at least one missing feature but this might dramatically decrease
the number of data in your dataset. One may rather consider imputing the missing values, i.e.
filling in the missing feature with something. Several possibilities can be considered to define which
value to put in the missing slots as for example the mean, median or most frequent feature value
over the whole dataset,

Feature scaling
Given a set {x0 , · · · , xi , · · · , xN −1 } ∈ Rd of input vectors, it is usual that the variables do not
vary in the same ranges. There are several machine learning algorithms that compute Euclidean
distances in input space, such as k-nearest neighbors for clustering or support vector machines
with radial kernels. In this situation, if one dimension is in range, say, some order of magnitudes
larger than the others, it will dominate the computation of the distance and therefore hardly take
into account the other dimensions. There are also some cases where the learning algorithm can
2 http://archive.ics.uci.edu/ml/datasets/Heart+Disease
6.2. DIMENSIONALITY REDUCTION 71

converge faster if the features are normalized. An example of this situation is when you perform
gradient descent of, for example, a linear regression model with a quadratic cost3 . If the features
are normalized, the loss will be much more circular symmetric than if they are not. In this case, a
gradient descent would point toward the minimum and the algorithm would converge faster. As we
shall see later, for example when speaking about neural networks, we will minimize cost functions
with penalties on the parameters we are looking for. This penalty will actually ensure that we do
not overfit the training set. If we do not standardize the data, the penalty will typically penalize
much more the dimensions which cover the largest range which is not a desired effect.
There are various ways to ensure that the features belong to the same range. Let us denote by
xi,j the j-th feature of the input xi , you might :
• scale each features so that they belong to the range [0, 1] :
xi,j − mink∈[0,N −1]xk,j
∀i ∈ [0, N − 1], ∀j ∈ [0, d], x0 i,j =
maxk∈[0,N −1]xk,j − mink∈[0,N −1]xk,j

• standardize each features so that they have zero mean and unit variance :
xi,j − x.,j
∀i ∈ [0, N − 1], ∀j ∈ [0, d], x0 i,j =
σj
N −1
1 X
x.,j = xk,j
N
k=0
v
u N −1
u1 X
σj = t (xk,j − x.,j )2
N
k=0

Remember that feature scaling is performed based on the training data. This has two implications.
The first is that you must remember the parameters of the scaling so that if you apply the learned
predictor on a new input, you actually make use of the same scaling than the one used for learning
the predictor. Also, as the scaling parameters are computed from the training set, you should in
principle include the scaling step in the whole pipeline that has to be cross-validated.

6.2 Dimensionality reduction


Typical data we encounter in machine learning live in a large dimensional space. This does not
necessarily mean that the data span the whole input space. These may rather occupy a space of
lower dimension or at least the data might be projected in such a low dimensional space without
disrupting the relationships between the data. Discovering this space of lower dimension has
many interests in machine learning. If we focus on less than three dimensions, we can then afford
visualizing the dataset we are working with. We may also dramatically decrease the size of the
input vectors we are working with and therefore speed up algorithms working on these data.
Finally, projecting our fixed number of input vectors into a lower dimensional space and then learn
from these fewer features can help fighting against the curse of dimensionality especially when the
number of original features is large. In this case, your data can quickly get sparse in the input
space and it is therefore hard to properly generalize. For example, if one has N input vectors
leaving in a d dimensional space and suppose that, in order to be efficient, our predictor requires
to distinguish between 10 values in each dimension, you end up with 10N different configurations
to distinguish, a number that grows exponentially with the number of features.
Optimally transforming original data that leave in a high dimensional space into a lower di-
mensional space is the focus of dimensionality reduction methods. This broad definition gives the
essence of dimensionality reduction. There are still some elements to be defined such as what we
mean by “optimally” and which are the type of transformations that we consider. Reducing the
dimension of some input data can be done by selecting a subset of the original dimensions, which
is coined the term feature selection, or by computing new features from the original dimensions
3 which is considered just for illustrative purpose as it can be solved analytically
72 CHAPTER 6. PREPROCESSING

which is coined the term feature extraction. In the next sections we focus on three specific fea-
ture extraction methods, namely Principal Component Analysis (PCA), Kernel PCA and Locally
Linear Embedding. These are derived from different definitions of optimality and transformations
applied to the data.

6.2.1 Variable selection


Forewords The elements in this section are mainly the integration of a former document. That
document(Geist, 2014) was written in French and is here translated in English. In this section, we
provide a quick overview on variable selection. The interested reader is referred to recent reviews
on this topic(Guyon and Elisseeff, 2003; Somol et al., 2010).

Taxonomy Suppose we are given input vectors {x0 , · · · , xi , · · · , xN −1 } ∈ Rd and an output


{y0 , · · · , yi , · · · , yN −1 } to predict. It might be, for some reason, that all the input dimensions are
not necessary to predict the output. To take a dummy example to fix the ideas, if one dimension
of all the input vectors is constant while the target to predict is varying, this dimension does
not bring any information for a predictor to predict the output. The problem of finding a subset
of the original dimensions that is hopefully sufficient to produce a good or even better predictor
of the output is referred to as variable selection. There are three main approaches to variable
selection : filters, wrappers and embedded. In the two first approaches, variables are selected or
dropped out according to a heuristic for the filters (e.g. correlation or mutual information between
the dimensions and the output) and according to an estimation of the real risk for the wrappers
(which implies computing several predictors minimizing a given loss). The filters are clearly less
computationally expensive than the wrappers but also less theoretically grounded. The embedded
method introduces a penalty term in the loss to be minimized (e.g. L1 penalty) which tends to
take into account fewer than the available dimensions in the computed predictor.

The LASSO (Least Absolute Shrinkage and Selection Operator, (Tibshirani, 1996b)) algorithm
is one example of embedded method. It considers a `1 penalty to a linear least square loss4 . The
resulting loss reads :
N −1
S 1 X
Remp (θ) + λ|θ|1 = (yi − θT xi )2 + λ|θ|1
N i=0

The `1 penalty term will promote sparse predictors, i.e. predictors in which the parameter
vector θ will have null coefficients, and therefore performs variable selection.

Searching in the variable space Let us formalize the variable selection problem. The inputs
are vectors x ∈ Rd . A variable of x is one of its d components, which we denote xj : x =
(x0 x1 ...xd−1 )T . We define a set of k out of the d variables as Xσ,k :

Xσ,k = xj : j ∈ σ −1 (1), σ : [|1..d|] → {0, 1} , σ −1 (1) = k

The function σ indicates, for each variable, if we should keep it or not. We denote by Xd the set
d!
with all the variables. There are k!(d−k)! possible Xσ,k sets with k variables and the search space
d
contains actually 2 sets. We suppose, and this will be detailed latter on, that we have a score
function J which gives the quality J (Xσ,k ) of the k selected variables. Some examples of score
functions will be given in the next paragraphs. Let us introduce some quantities :

• the individual relevance S0 defined as :

S0 (xi ) = J({xi }), 1 ≤ i ≤ d


• given a set Xσ,k , the relevance SX σ,k
of the variables xj ∈ Xσ,k is defined as :


∀xj ∈ Xσ,k , SXσ,k
(xj ) = J (Xσ,k ) − J (Xσ,k \ {xj })
4 the regressor is linear in the intput f (θ, x) = θT x
6.2. DIMENSIONALITY REDUCTION 73

+
• given a set Xσ,k , the relevance SX σ,k
of the variables xj ∈ Xd \ Xσ,k is defined as :
+
∀xj ∈ Xd \ Xσ,k , SX σ,k
(xj ) = J (Xσ,k ∪ {xj }) − J (Xσ,k )

We can now define few additional notions which allow to estimate how good it is to add or
remove a variable. Given a set of variables Xσ,k
• xj ∈ Xσ,k is the most relevant variable iif

xj = argmax SXσ,k
(xk )
xk ∈Xσ,k

• xj ∈ Xσ,k is the less relevant variable iif



xj = argmin SXσ,k
(xk )
xk ∈Xσ,k

• xj ∈ Xd \ Xσ,k is the next most relevant variable iif


+
xj = argmax SX σ,k
(xk )
xk ∈Xd \Xσ,k

• xj ∈ Xd \ Xσ,k is the next less relevant variable iif


+
xj = argmin SX σ,k
(xk )
xk ∈Xd \Xσ,k

Number of sets
∅ 1

{x0} {x1} ··· ··· {xd−1} d

{x0, x1} {x0, x2} · · · {x0, xd−1} · · · {x1, xd−1} · · · {xd−2, xd−1} d(d-1)/2

.. .. .. d!
k!(d−k)!

Xd 1

Sequential Forward Search


Sequential Backward Search

Figure 6.1: Forward and Backward sequential search (SFS, SBS). See the text for details.

We can now introduce different strategies for building a set of variable Xσ,k . Starting from an
empty set of variable X = ∅, one can be greedy with respect to the score and add at each step the
next most relevant variable until getting the desired number of variables. This strategy is called
Sequential Forward Search (SFS). Another possibility is to start from the full set X = Xd and drop
out variables by removing the less relevant variable until getting a desired number of variables.
This strategy is called Sequential Backward Search(SBS). These are the most classical strategies,
illustrated on figure 6.1.
The SFS and SBS are heuristics and not necessarily optimal. It might be for example, following
the SFS strategy, that an added variable renders a previously added variable unnecessary. In
general, there is no guarantee that the SFS and SBS strategies find the optimal set of variables
among the exponentially growing number of configurations 2d . There are other strategies that
have been proposed in the literature as the Sequential Floating Forward Search (SFFS), Sequential
Floating Backward Search (SFBS) introduced in (Pudil et al., 1993) which combines forward and
backward steps. This algorithm has been extended in (Somol et al., 1999) and additional variants
are presented in (Somol et al., 2010).
74 CHAPTER 6. PREPROCESSING

Filter and Wrapper methods Having introduced strategies for building up a set of variables
Xσ,k , we still need to define the score function J. The filter methods use heuristics in the definition
of J while wrappers define the score function J from an estimation of the real risk of the predictor.
A first example of filter heuristic is the Correlation-based feature selection (CSF) (Hall, 1999;
Guyon and Elisseeff, 2003). It looks for a set of variables that balance the correlation between
the variables and the output to predict y and the correlation between the variables themselves.
The strategy is to keep features highly correlated with the output, yet uncorrelated to each other.
Namely, given a training set {(xi , yi ), 0 ≤ i ≤ N − 1} the score J of a set Xσ,k is defined as :

kr(Xσ,k , y)
JCSF (Xσ,k ) = p
k(k − 1)r(Xσ,k , Xσ,k )
1 X
r(Xσ,k , y) = r(xi , y)
k −1 i∈σ (1)
1 X
r(Xσ,k , Xσ,k ) = r (xi , xj )
k(k − 1)
i,j∈σ −1 (,)i6=j

with r a measure of correlation, for example :


1
PN −1
N k=0 xk,i yk
r (xi , y) =  PN −1  P 
1 2 1 N −1 2
N k=0 (x k,i ) N k=0 (y k )
1
PN −1
N k=0 xk,i xk,j
r (xi , xj ) =  PN −1  P 
1 2 1 N −1 2
N k=0 (xk,i ) N k=0 (xk,j )

where xj,i denotes the i-th component of the input xj . One can use other correlation measures,
such as mutual information(Hall, 1999). Other filter heuristics have been proposed and some of
them are reviewed in(Zhao et al., 2008).

The wrappers use an estimation of the real risk as the score function J. Denote xσ ∈ Rk
the vector for which only the k components in Xσ,k are retained. Suppose we have a training
set {(xi , yi ), 0 ≤ i ≤ N − 1} and a validation set {(x0 i , y 0 i ), 0 ≤ i ≤ M − 1}. Let us denote fˆ the
predictor obtained from a given learning algorithm on the training set {(xσi , yi ), 0 ≤ i ≤ N − 1}.
The score J (Xσ,k ) can be defined as :

1 X  0 ˆ 0 σ 
M −1
J (Xσ,k ) = L y i, f x i
M i=0

with L a given loss (usually strongly dependent on the considered learning algorithm). Other esti-
mation of the real risk could have been used, such as the K-fold cross-validation or bootstrapping.

Embedded methods The embedded methods make use of the specifics of the learning algorithm
you consider and are therefore usually dependent on this chosen algorithm. For example, the
support vector machines (SVM) can be considered as an embedded approach as they choose the
support vectors among the training set (in which case, basis functions are selected rather than
variables by themselves). LASSO (Tibshirani, 1996b) is a learning algorithm which has a `1
penalty term constraining weights on some of the variables to be zero. Related to LASSO, some
other methods use a similar penalty as LARS (Efron et al., 2004a) or elastic net (Zou and Hastie,
2003). Regression trees5 (Breiman et al., 1983) also have internal mechanisms for variable selection.
Another approach, which we might rather consider as a model selection approach (in the sense
that it selects an hypothesis space, which is actually identical to variable selection when the model
is linear), is called complexity regularization. It consists in adding a penalty term to the empirical
risk to be minimized, this penalty term being input data dependent (Barron, 1988; Bartlett et al.,
2002; Wegkamp, 2003; Lugosi and Wegkamp, 2004). The rationale is to get Vapnik-Chervonenkis
5 See chapter 20
6.2. DIMENSIONALITY REDUCTION 75

type bounds (see chapter5) but actually dependent on the considered data and therefore, allowing
a better estimation of the risk for selecting a model, with theoretical guarantees. This type of
approach is complex and mathematically involved and will not be detailed further.

6.2.2 Principal Component Analysis (PCA)


Problem statement: Finding an affine subspace minimizing the reconstruction error

Suppose we are given N input vectors {x0 , · · · , xi , · · · , xN −1 } ∈ Rd . We want to summarize these


d dimensional vectors by only r dimensions with r < d (but typically r  d), each of these dimen-
sions being a linear combination of the original dimensions. We consider a situation a bit more
restrictive where we want the dimensions to be orthogonal to each other. This problem is known
as the Principal Component Analysis problem and was introduced by (Pearson, 1901). It can be
defined6 as finding an affine transformation of the data so that the error we make by reconstructing
the data from their projections is minimized. As an example, consider the case depicted on Fig. 6.2
where we have data in d = 2 dimensions and we are trying to summarize these data by a single
number, i.e. we are looking for a line on which to project the data. This is like if the line we
are looking for was connected to the data with little springs. Each spring pulls the line so that it
ultimately reaches an equilibrium. For this one dimensional case, the line is defined by its origin
w0 and direction w1 .

2.0

1.5

1.0

0.5

0.0
w1
0.5
w0
1.0
1.0 0.5 0.0 0.5 1.0 1.5 2.0

Figure 6.2: Given a set of data points, we seek a line on which to project the data so that the
sum of the norms of the vectors (red dashed) connecting the data points to their projection is
minimized. This is the problem we solve when we are looking for the first principal component of
the data points.

Formally, the problem can be defined as :


2
N
X −1 Xr

min xi − w0 − (w T
.(x − w ))w subject to wiT .wj = δi,j (6.1)
{w0 ,··· ,wj ,··· ,wr }∈Rd
j i 0 j
i=0 j=1
2

Let us denote by W ∈ Md,r (R) the matrix in which the columns are the vectors {w1 , · · · , wi , · · · , wr } :

W= w1 ··· wi ··· wr

6 There exists other equivalent ways to derive the principal components such as finding projection axis maximizing

the variance of the projected data. We go back on this equivalence later in this section.
76 CHAPTER 6. PREPROCESSING

The constrained optimization problem (6.1) can be written in matrix form as :


N
X −1
2
min xi − w0 − W.WT .(xi − w0 ) subject to WT .W = Ir
w0 ∈Rd ,W∈Md,r (R) 2
i=0
N
X −1
2
min (Id − W.WT ).(xi − w0 ) subject to WT .W = Ir (6.2)
w0 ∈Rd ,W∈M 2
d,r (R)
i=0

The above matrix form could have be written directly if we have in mind that, given {w1 , · · · , wj , · · · , wr }
are orthogonal unit vectors, W is the matrix of an orthogonal projection on the r dimensional sub-
space generated by {w1 , · · · , wj , · · · , wr }. The norm of the residual x − x̂, where x̂ = W.WT x is
the projection of x on the subspace generated by {w1 , · · · , wj , · · · , wr }, is |(I − W.WT ).x|2 .
Before expanding a little bit the expression inside (6.2), we first note that :

(Id − W.WT )T = (Id − W.WT )

Second, as WT .W = Ir , therefore :

(W.WT )2 = W.WT WWT = W.WT

The matrix W.WT is therefore idempotent. For any idempotent matrix M , we also know that
I − M is idempotent as (I − M)2 = I − 2M + M2 = I − M. Putting all the previous steps together,
we end with :

(Id − W.WT )T .(Id − W.WT ) = (Id − W.WT )

We are now ready for expanding eq. (6.2) :


N
X −1 N
X −1
2
(Id − W.WT ).(xi − w0 ) = (xi − w0 )T (Id − W.WT )T (Id − W.WT )(xi − w0 )
2
i=0 i=0
N
X −1
= (xi − w0 )T (Id − W.WT )(xi − w0 )
i=0

Deriving w0
We now solve eq. (6.2) with respect to w0 by computing the derivative with respect to w0 and
setting it to zero7 .
N −1 N −1
d X 2 X
(Id − W.WT ).(xi − w0 ) 2 = −2 (Id − W.WT )(xi − w0 )
dw0 i=0 i=0
N
X −1
= −2(Id − W.WT ) (xi − w0 )
i=0
N −1 N −1
d X 2 X
(Id − W.WT ).(xi − w0 ) 2 = 0 ⇔ (Id − W.WT ) (xi − w0 ) = 0
dw0 i=0 i=0
N −1
1 X
⇔ (Id − W.WT )( xi − w 0 ) = 0
N i=0

The vectors u satisfying (Id − W.WT ).u = 0 are the vectors belonging to the subspace generated
by the column vectors of W, i.e. by the vectors {w1 , · · · , wj , · · · , wr }. This actually makes
sense if one thinks of how an affine subspace is defined : the origin w0 of the affine subspace
can be translated by any linear combination of the vectors {w1 , · · · , wj , · · · , wr } and we still
7 Forcomputing the derivative, we note that for any vectors x, matrix A, and vector functions u (x) , v (x),
duT Av
dx
= du
dx
Av + dv
dx
AT u.
6.2. DIMENSIONALITY REDUCTION 77

get the same affine subspace. For the 1D example on Fig. 6.2, this means translating w0 along
w1 . Finally, we note that the value of the function (6.2) to be minimized is the value for any
PN −1
w0 = N1 i=0 xi + h, h ∈ span {w1 , · · · , wj , · · · , wr } :

∀h ∈ span {w1 , · · · , wj , · · · , wr }, (Id − W.WT ).h = 0


⇒ ∀h ∈ span {w1 , · · · , wj , · · · , wr },
N −1 N −1
1 X 1 X
(Id − W.WT ).(xi − xi − h) = (Id − W.WT ).(xi − xi )
N i=0 N i=0

We can also note that the optimization problem defined by eq. (6.2) is the same whatever w0 as
PN −1 PN −1
soon as w0 = N1 i=0 xi + h, h ∈ span {w1 , · · · , wj , · · · , wr }. So let us take w0 = N1 i=0 xi = x̄,
which means that the data get centered by the sample mean of {x0 , · · · , xi , · · · , xN −1 } before being
projected. If we look back to the example drawn on fig 6.2, we clearly see that we may have moved
w0 along w1 without changing the line on which the data are projected. In general, the fact that
w0 is defined up to a h ∈ span {w1 , · · · , wj , · · · , wr } simply means that the origin of the hyperplane
span {w1 , · · · , wj , · · · , wr } can be defined up to a translation within this hyperplane.

Deriving the principal components vectors {w1 , · · · , wj , · · · , wr }


We rewrite the optimization problem eq. (6.2) as :
N
X −1
2
min (Id − W.WT ).(xi − x̄) subject to WT .W = Ir (6.3)
W∈Md,r (R) 2
i=0

For simplicity, let us denote x̃i = xi − x̄ and let us expand a little bit the inner term of eq (6.3) :
2
(Id − W.WT ).x̃i = x̃i T (Id − W.WT )T x̃i = x̃i T x̃i − x̃i T W.WT x̃i
2

Also, one may notice the following :


N
X −1 N
X −1 N
X −1 X
r r N
X X −1 r
X
x̃i T WWT x̃i = (WT x̃i )T (WT x̃i ) = (wjT x̃i )2 = (x̃i T wj )2 = wjT X̃X̃T wj
i=0 i=0 i=0 j=1 j=1 i=0 j=1

where X̃ = (x̃1 | · · · |x̃N −1 ), i.e. the matrix with columns x̃i . This matrix is the so-called sam-
ple covariance matrix. Therefore the minimization problem (6.3) is equivalent to the following
maximization problem :
r
X
max wjT X̃X̃T wj subject to WT W = Ir (6.4)
W∈Md,r (R)
j=1

Now, to begin simply and drive our intuition, we shall have a look to the solution when we are
looking for only one principal component vector. In this case, the optimization problem reads :

max w1T X̃X̃T w1 subject to w1T w1 = 1


w1 ∈Rd

To solve this constrained optimization problem, we introduce the associated Lagrangian :


T T T
L (w1 , λ1 ) = w1 X̃X̃ w1 + λ1 (1 − w1 w1 )

We now compute the gradients of the Lagrangian with respect to w1 and λ1 :


∂L
= 1 − w1T w1
dλ1
∂L
= 2X̃X̃T w1 − 2λ1 w1
dw1
78 CHAPTER 6. PREPROCESSING

Looking for the critical points, i.e. where the gradient vanishes, we get :

∂L
= 0 ⇒w1T w1 = 1
dλ1
∂L
= 0 ⇒X̃X̃T w1 = λ1 w1
dw1

We therefore conclude that w1 is an eigen vector of the sample covariance matrix X̃X̃T with
corresponding eigen-value λ1 . Now, the question is which eiven vector should we consider ? Well,
to answer that question, we just need to look at what it means that w1 is an eigenvector for our
term we wish to maximize (6.4):

w1T X̃X̃T w1 = w1T (λ1 w1 ) = λ1 w1T w1 = λ1

which is maximized for the largest eigen-value of the sample covariance matrix. So, to conclude
this first part :

The first principal component vector is a normalized eigenvector associated with the
largest eigen value of the sample covariance matrix X̃X̃T

Now let us have a look to the problem of finding two components. The problem to be solved
is :

max w1T X̃X̃T w1 + w2T X̃X̃T w2 subject to wiT wj = δi,j


w1 ,w2 ∈Rd

One first thing we can note is that the sum of the variances of the projections can be written a
slightly different way taking into account the fact that w1 and w2 are orthogonal, i.e. w1T w2 = 0.
The residuals of the orthogonal projection of the data X̃ over the vector w1 is (Id − w1T w1 )X̃ and :

w2T ((Id − w1T w1 )X̃)((Id − w1T w1 )X̃)T w2 = w2T (Id − w1 w1T )X̃X̃T (Id − w1 w1T )w2
(Id − w1 w1T )w2 = w2 − (w1 w1T )w2
= w2 − w1 (w1T w2 )
= w2
⇒ w2T ((Id − w1T w1 )X̃)((Id − w1T w1 )X̃)T w2 = w2T X̃X̃T w2

Therefore the variance of the data projected on w2 equals the variance of the residuals, after pro-
jection on w1 , projected on w2 . We can then proceed iteratively and our greedy algorithm would
lead to select the normalized eigenvectors associated with the largest eigenvalues of the sample
covariance matrix (we finish justifying the correctness of .

If the matrix X̃X̃T has r distinct eigenvalues, our solution {w1 , · · · , wk , · · · , wr }


are the r normalized eigenvectors associated with the largest eigenvalues of
X̃X̃T .

The solution to the PCA is not unique. Indeed, if w1 is an eigenvector of M, then −w1 is
also an eigenvector of M. In terms of the principal components, this means that they are at least
defined up to a sign. For example, on the illustration 6.2, we draw w1 but we may have considered
−w1 as well. Also, if the matrix X̃X̃T has eigenvalues with a multiplicity larger than 1, since
the eigenvectors of an eigenvalue λ of any positive-definite symmetric matrix M of multiplicity kλ
engenders a subspace of dimensionality kλ , any basis of this subspace are elements of the solution
to the PCA problem.
6.2. DIMENSIONALITY REDUCTION 79

Is the greedy algorithm optimal ?


There is a tricky step in the previous section because we still need to justify that the greedy al-
gorithm above actually leads to an optimal solution. In the context of our optimization problem,
we indeed used above a greedy algorithm where we begin by finding the first vector on which to
project the data and then proceed by looking at the second vector on which to project the residues
and so on. In general, there is no reason that a greedy algorithm finds the optimal solution to an
optimization problem but in our case, this is indeed the case. Indeed, we shall demonstrate that
whichever set of projection vectors, eigenvectors associated to the largest eigenvalues of the sample
covariance matrix are solutions to the maximization problem given by equation (6.4).

Theorem 6.1. For any symmetric positive semi-definite matrix M ∈ Md,d (R), denote {λ} 1d
its eigenvalues with λ1 ≥ λ2 · · · ≥ λd ≥ 0. For any set of r ∈ [|1, d|] orthogonal unit vectors,
{v1 , · · · , vj , · · · , vr }, we have :
r
X r
X
vjT Mvj ≤ λj
j=1 j=1

And this upper bound is reached by eigenvectors associated with the largest eigenvalues of M

P{v
Proof. Suppose we have r orthogonal unit vectors
r
1 , · · · , vj , · · · , vr } on which we project the
data. We want to compute the maximal value of j=1 vjT Mvj .

Given the matrix M is real symmetric, there exists a basis of Rd of eigenvectors. Let us
denote this basis {e1 , · · · , ei , · · · , ed } and the associated eigenvalues {λ1 , · · · , λi , · · · , λd }. Denote
βi,j = eTi vj the coordinates of vj in our basis :
d
X
∀j ∈ [|1, r|], vj = βi,j ei
i=1

From which it follows :


r
X r X
X d Xd
vjT Mvj = ( T
βi,j ei ) M( βi,j ei )
j=1 j=1 i=1 i=1
r X
X d d
X
= ( βi,j ei )T ( βi,j Mei )
j=1 i=1 i=1
r X
X d d
X
= ( βi,j ei )T ( βi,j λi ei )
j=1 i=1 i=1
r
X d
X d
X
= ( βi,j eTi )( βi,j λi ei )
j=1 i=1 i=1
r X
X d d
X r
X
2 2
= λi βi,j = λi βi,j (6.5)
j=1 i=1 i=1 j=1
P 2
The next question is about upper bounding j βi,j . To do so, we extend the set of orthogonal
unit vectors {v1 , · · · , vj , · · · , vr } with d − r vectors uj to form a basis {v1 , · · · , vr , u1 , · · · , ud−r }
of Rd . We can express the eigenvectors ei in this basis as :
r
X d−r
X
∀i ∈ [|1, d|], ei = βi,j vj + eTi uj uj
j=1 j=1

The eigenvectors are unit vectors, from which it follows :


r
X d−r
X
2 2
∀i, 1 = |ei |2 = βi,j + (eTi uj )2
j=1 j=1
80 CHAPTER 6. PREPROCESSING

Pr
This leads to the inequality : ∀i, j=1 βi,j ≤ 1, which can be injected in equation (6.5) :

r
X d
X r
X d
X
vjT Mvj = λi 2
βi,j ≤ λi
j=1 i=1 j=1 i=1

If the projection vectors are eigenvectors {w1 , · · · , wj , · · · , wr } with associated eigenvalues


{λj , · · · , λ1 , · · · , λr }, we have :
r
X r
X r
X
wjT Mwj = λj wjT w = λj
j=1 j=1 j=1

Which concludes the proof.

It remains to apply the previous algorithm to the symmetric positive semi-definite matrix X̃X̃T
to conclude that eigenvectors associated with the largest eigenvalues are indeed a solution to our
optimization problem.

What is the fraction of sample variance that we keep ?


If we keep only r eigenvalues, we may wonder how much of the sample variance we keep. The
sample variance of the data points is the sum of the diagonal elements of the sample covariance
matrix (i.e. its trace). Also, given the sample covariance matrix is real symetric, its trace equals
the sum of its eigenvalues. Indeed, for any matrix M and orthogonal matrix P, we have :
X X X X
Tr P−1 MP = P−1
i,l Ml,k Pk,i = (PP−1 )k,l Mk,l = δk,l Mk,l = Mk,k = Tr M
i,k,l k,l k,l k

and therefore, denoting σ the sample variance of the data :


N
X −1
σ = Tr X̃X̃T = λi
i=0

Finally,
Pr−1
considering only r eigenvalues, the sample variance of the projected data equals the frac-
λi
tion PNi=0
−1
λ
of the original sample variance.
i=0 i

Using the Singular Value Decomposition to compute the PCA


In order to compute the eigenvectors of X̃X̃T , i.e. the vectors {w1 , · · · , wj , · · · , wr }, there is no
need to compute the matrix product and make the eigen value decomposition of the product
but one can rather make use of some decomposition of the matrix X̃ such as the Singular Value
Decomposition. As we shall see, the Singular Value Decomposition also allows to directly compute
the principal components of our data, without having to project explicitely each sample on the
principal component vectors. Indeed, we know that there exist an orthogonal matrix U ∈ Md,d (R)
(UUT = UT U = Id ), a diagonal matrix D ∈ Md,N (R) with non-negative real numbers on the
diagonal and an orthogonal matrix V ∈ MN,N (R) (VVT = VT V = IN ) such that :

X̃ = UDVT
⇒ X̃X̃T = UDVT VDT UT = UDDT UT = UDDT U−1

We recognize the diagonalization of X̃X̃T , the eigen vectors of X̃X̃T being the column vectors
of U. In the singular value decomposition, we suppose that the diagonal elements of D are
ordered by decreasing magnitude and the implementation usually behave that way. Therefore,
the vectors {w1 , · · · , wj , · · · , wr } we are looking for are the first r column vectors of U : W =
U1 | U2 | · · · | Ur . The principal components are the projections of the data over the axes
6.2. DIMENSIONALITY REDUCTION 81

{w1 , · · · , wj , · · · , wr } = {u1 , · · · , uj , · · · , ur }, i.e. WT X. Let us compute the projection of our


samples over the principal component vectors :
UT X̃ = UT UDVT = DVT
with DVT ∈ Md,N (R). The principal components are therefore the first r rows of DVT .

Algorithm 9 gives all the steps for performing a SVD based PCA. An example of application
of this algorithm on artificial data is shown on Fig. 6.3a as well as applying the PCA on the 5000
samples from the handwritten MNIST dataset on Fig. 6.3b. From the example over the MNIST
dataset, one can observe that the linear projection revels some isolated clusters (for the digits 0
and 1) but others remain interleaved. Do not be misleaded, PCA is an unsupervised algorithm as
it does not take into account the labels for finding the projections. The labels are added to the
figures after the PCA is performed and it turns out that it reveals that some classes are isolated.
Using only two principal components, the PCA captures 4.48% of the variance, computed as λP0 +λ 1
i λi
where, λi are the eigenvalues of the sample covariance matrix, ordered by decreasing magnitude.

If the centered data are stacked as the columns of X̃. From the SVD of X̃ = UDVT
(the eigenvalues being ordered by decreasing magnitude on the diagonal of D), one
finds the r first principal components as the r first rows of DVT and the projection
vectors as the r first columns of U.

Algorithm 9 Algorithm for computing r principal components.


Input: {x0 , · · · , xi , · · · , xN −1 } // The original data in Rd
Output: {y0 , · · · , yi , · · · , yN −1 } // The projected data in Rr
1
PN −1
1: Compute the mean of the data x̃ = N i=0 xi
2: Center the data ∀i, x̃i = xi − x̃
3: Stack the data in the columns of X̃ ∈ Md,N (R)
4: Compute the SVD of X̃ = UDVT 
5: The principal components are the first r rows of DVT : y0 | · · · | yN −1 = (DVT )[: r, :]
6: The projections vectors are the first r column vectors of U.

Finding an affine subspace maximizing the variance of the projections


There is a common alternative for deriving the principal components that leads to an equivalent
algorithm than the one presented in the previous section. This alternative consists in finding
axis on which to project the data so that the variance of the projections is maximized. The
optimization problem is therefore introduced in a slightly different way than when we optimized
the reconstruction error. To simplifiy the presentation, we here suppose that the data are centered.
The optimization problem can then be written as :
r
X N −1
1 X T 2
max w .x̃i 2 subject to wiT .wj = δi,j
{w1 ,··· ,wj ,··· ,wr }∈Rd
j=1
N − 1 i=0 j

We can rewrite a little the inner term :


r
X N −1
1 X T 2 X 1 X T X 1 X
w .x̃i 2 = wj x̃i x̃Ti wj = wjT ( x̃i x̃Ti )wj
j=1
N − 1 i=0 j j
N −1 i j
N i
X 1
= wT X̃X̃T wj
j
N −1 j
N −1
1 X T
= x̃ WWT x̃i
N − 1 i=0 i
82 CHAPTER 6. PREPROCESSING

2.0

1.5
0.8
0.6
1.0 0.4
0.2
0.0
0.5 0.2
0.4
0.6
0.8
0.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

0.5
1.0 0.5 0.0 0.5 1.0 1.5
(a) The PCA of a set of datapoints in R2 . There are N = 72 datapoints in d = 2
dimensions. On the left, the original datapoints are shown. The mean of the data
w0 is shown as the red dot. The data are centered and stacked in the rows of X̃.
From the SVD decomposition of X̃ = UDVT , the two columns of V are extracted
and shown as the green and blue arrows. The projection of the datapoints on the two
principal vectors DU is shown on the figure on the right.

0
w2

6
10 8 6 4 2 0 2 4 6
w1
0 1 2 3 4 5 6 7 8 9

(b) The PCA applied to 1000 samples from the MNIST handwritten digits dataset.
Each colored point represents one 28 × 28 image, the color indicating the associated
label. Do not be misleaded, the labels are not used for the PCA which is an unsuper-
vised technique. The labels are used after applying the PCA, just to get an idea of
how the digits are clustered. The two first components computed from 5000 samples
capture actually only 4.48% of the variance.

0.10 0.08 0.06 0.04 0.02 0.00 0.02 0.04 0.06 0.08 0.10
(c) The 10 first principal vectors when applying PCA to 5000 sampes of the MNIST
handwritten digits dataset. All the 28 × 28 vectors are normalized and have been
saturated in the range [-0.10, 0.10]. We better appreciate for exemple why the 0 and
1 have, respectively, a negative and positive first component.
6.2. DIMENSIONALITY REDUCTION 83

We first recognize the sample covariance matrix of the data :


N −1 N −1
1 X 1 X
Σ= x̃i x̃Ti = (xi − x̃)(xi − x̃)T
N − 1 i=0 N − 1 i=0

and also the same optimization problem than (6.4) which we found when defining the PCA as
minimizing the reconstruction error. We can then conclude that the optimal projection vectors
are the eigen vectors of the sample covariance matrix Σ = N1 X̃X̃T associated with the largest r
eigenvalues.

6.2.3 Relationship between covariance, Gram and euclidean distance


matrices
Sample covariance matrix. The sample covariance matrix of a set of vectors {x0 , · · · , xi , · · · , xN −1 } ∈
Rd is the d × d matrix Σ defined as :
N −1
1 X 1
Σ= (xi − x)(xi − x)T = XXT
N − 1 i=0 N −1
N −1
1 X
x= xi
N i=0

where X = [x0 − x|x1 − x| · · · |xN −1 − x], i.e. the vectors xi − x are the column vectors of X.
The sample covariance matrix is symmetric : ΣT = Σ. The eigenvalues of any sample covariance
matrix are non-negative. For any eigenvalue-eigenvector pair λ, v of Σ, we have :
1
λv = Σv = XXT v
N −1
2 1 1 1 T 2
⇒ λ|v| = λvT v = vT XXT v = (XT v)T (XT v) = X v
N −1 N −1 N −1
from which it follows λ ≥ 0.

Gram matrix. The Gram matrix of a set of vectors {x0 , · · · , xi , · · · , xN −1 } ∈ Rd is the N × N


matrix G containing the scalar products of the vectors xi , namely Gij = xTi .xj :
 T 
x0 .x0 xT0 .x1 ··· xT0 .xN −1
 .. .. .. ..  T
G= . . . . =X X
xTN −1 .x0 xTN −1 .x1 ··· xTN −1 .xN −1

where X = [x0 |x1 | · · · |xN −1 ], i.e. the vectors xi are the column vectors of X. The gram matrix is
symmetric : GT = G.

Euclidean distance matrix. The Euclidean distance matrix of a set of vectors {x0 , · · · , xi , · · · , xN −1 } ∈
Rd is the N × N matrix D whose elements are the squared euclidean distances between the vectors
2
xi , namely Dij = |xi − xj | :
 2 2
0 |x0 − x1 | ··· |x0 − xN −1 |
 .. .. .. 
D= . . . 
2 2
|xN −1 − x0 | |xN −1 − x1 | ··· 0

The Euclidean distance matrix D is symmetric and has zeros in the main diagonal : DT = D,
∀i, Dii = 0.
84 CHAPTER 6. PREPROCESSING

Relationship between euclidean distance and gram matrices Let us now detail how the
covariance, gram and euclidean distance matrices are related. The euclidean distances can be
expressed from scalar products only :
2
∀i, j, |xi − xj | = (xi − xj )T .(xi − xj ) = xTi .xi + xTj .xj − 2xTi xj

Therefore, the euclidean distance matrix and gram matrix are related by :

⇒ ∀i, j, Dij = Gii + Gjj − 2Gij

Actually, the euclidean distance matrix can be built from the gram matrix but the converse is not
true : the same euclidean distance matrix can be built from different configuration of the input
vectors and different gram matrices. For example :
     
1 √12 0 1 1 12
X= , D = , G = 1
0 23 1 0 2 1
"√ 3 #    
2 2√2 3
X= q ,D = 0 1 ,G = 2 2
3
0 7 1 0 2 2
8

More generally, the distance matrix is invariant to translation while the gram matrix is not.
Indeed :

∀x, y, c, |(x + c) − (y + c)| = |x − y|


(x + c)T .(y + c) = xT .y + cT .y + xT .c + cT .c

It is actually sufficient to define an origin for a set of vectors so that their Gram matrix can
be deduced
P from the distance matrix. We can indeed recover the gram matrix of the vectors
xi − N1 j xj from the distance matrix and one can show that it is given by :

1
G = − HDH
2
where H is a so-called centering matrix defined by :
1 T
H=I− ee
N
1
with e a vector full of 1, i.e. Hi,j = δi,j − N. The above transformation leading from D to G is
called double centering.

Relationship between the covariance and gram matrices In order to stress an interesting
relationship between the covariance and gram matrices that will be used in the next section for
extending PCA to kernel PCA, we need to make a bit of linear algebra.
Lemma 6.1. ∀A ∈ Rn×m , ker (A) = ker (AT A)
Proof. Let us consider A ∈ Rn×m . It is clear that

∀x ∈ Rm , Ax = 0 ⇒ AT Ax = 0

Therefore ker (A) ⊆ ker (AT A).


Now take x ∈ Rm , such that x ∈ ker (AT A). Then :

AT Ax = 0 ⇒ xT AT Ax = 0
⇔ (Ax)T Ax = 0
2
⇔ |Ax|2 = 0
⇔ Ax = 0

Therefore ker (AT A) ⊆ ker (A) which ends the proof.


6.2. DIMENSIONALITY REDUCTION 85

We now remind the rank-nullity theorem, in the context of matrices but which could have been
stated in a more general way with linear applications.
Theorem 6.2 (Rank-nullity). ∀A ∈ Rn×m , rk (A) + dim (ker (A)) = m.
We can demonstrate a theorem linking the rank of the covariance and gram matrices.
Theorem 6.3. ∀A ∈ Rn×m , rk (AT A) = rk (AAT ) ≤ min(n, m)
Proof. By applying the lemma 6.1 and the rank nullity theorem :
 
∀A ∈ Rn×m ,rk (A) + dim (ker (A)) = m = rk AT A + dim ker AT A

⇒rk (A) = rk AT A and rk (A) ≤ m

Applying the lemma 6.1 and the rank nullity theorem to the matrix AT :
   
∀A ∈ Rn×m ,rk AT + dim ker AT = n = rk AAT + dim ker AAT
 
⇒rk AT = rk AAT and rk (A) ≤ n

We also know that the column rank and row ranks are equal and therefore rk (A) = rk (AT )
which ends the proof.
Given the sample covariance matrix Σ = N 1−1 XXT has the same rank than (N − 1)Σ = XXT
and given the gram matrix G = XT X, applying the previous theorem leads to :
 
rk (Σ) = rk XXT = rk XT X = rk (G) ≤ min(n, m)

Given that the covariance and gram matrices have the same rank, they have the same number of
nonzero eigenvalues8 . There is actually an even stronger property which is :
Lemma 6.2 (Eigenvalues of the covariance and gram matrices). The nonzero eigenvalues of the
scaled covariance matrix (N − 1)Σ = XXT and gram matrix G = XT X are the same :

{λ ∈ R∗ , ∃v 6= 0, (N − 1)Σv = λv} = {λ ∈ R∗ , ∃v 6= 0, Gv = λv}

Proof. Consider a nonzero eigenvalue λ ∈ R∗ of XXT . There exists v 6= 0, XXT v = λv. Left-
multiplying by XT gives XT XXT v = λXT v. Denoting w = XT v, we have XT Xw = λw. If
w = XT v = 0, we have λv = XXT v = Xw = 0 and therefore v = 0 as λ 6= 0 which is in
contradiction with our hypothesis. So, necessarily, w 6= 0, and (λ, XT v) is an eigenvalue-eigenvector
pair of XT X = G. We have therefore demonstrated the inclusion

{λ ∈ R∗ , ∃v 6= 0, (N − 1)Σv = λv} ⊆ {λ ∈ R∗ , ∃v 6= 0, Gv = λv}

Since the two sets have the same dimension, there are actually equal.
During the demonstration, we also showed that if (λ, v) ∈ R × Rd is an eigenvalue-eigenvector
pair of XXT , then (λ, XT v) ∈ R × RN is an eigenvalue-eigenvector pair of XT X. Conversely if
(λ, w) ∈ R × RN is an eigenvalue-eigenvector pair of XT X, then (λ, Xw) ∈ R × Rd is an eigenvalue-
eigenvector pair of XXT .

6.2.4 Principal component analysis in large dimensional space (N  d)


Suppose we are given N input vectors {x0 , · · · , xi , · · · , xN −1 } ∈ Rd . Suppose also that the number
of dimensions d is much larger than the number of input vectors N which we denote d  N . This
is the case for example when we are working with a small dataset of images. As detailed in the
previous section, PCA operates with the following steps :
PN −1
1. center the input vectors : x = N1 i=0 xi , x̃i = xi − x
8 any real symmetric matrix A can be diagonalized A = ODOT . The rank is the dimension of the space

engendered by the columns or the rows of A which has the same dimension than the space engendered by the
columns or the rows of D and therefore equals the number of nonzero eigenvalues of A.
86 CHAPTER 6. PREPROCESSING

2. stack the centered vectors in the matrix X̃ = [x̃0 |x̃1 | · · · |x̃N −1 ]

3. compute the r normalized eigenvectors vj associated with the r largest eigenvalues of the
matrix X̃X̃T ∈ Rd×d

4. project your data on the r normalized eigenvectors vj

When the number of features d increases, the covariance matrix can become quite large and the
computation of the eigenvectors can be cumbersome. In light of the lemma 6.2, it turns out that
the eigenvectors of X̃X̃T can be computed from the eigenvectors of X̃T X̃ ∈ RN ×N . In case N  d,
it is much less expensive to compute these eigenvectors. We can therefore state another equivalent
way of computing PCA :
PN −1
1. center the input vectors : x = N1 i=0 xi , x̃i = xi − x

2. stack the centered vectors in the matrix X̃ = [x̃0 |x̃1 | · · · |x̃N −1 ]

3. compute the r normalized eigenvectors wj ∈ RN associated with the r largest eigenvalues λj


of the matrix X̃T X̃ ∈ RN ×N
X̃wj
4. project your data on the r normalized9 eigenvectors = √1 X̃wj
|X̃vj | λj

We can reformulate a little bit this procedure by making use only of scalar products, and this
will turn out to be very useful for deriving a non-linear extension of PCA :
PN −1
1. center the input vectors : x = N1 i=0 xi , x̃i = xi − x

2. compute the r normalizedeigenvectors wj ∈ RN associated


 with the r largest eigenvalues λj
x̃0 .x̃0 ··· x̃0 .x̃N −1
 .. .. .. 
of the Gram matrix G =  . . . 
x̃N −1 .x̃0 ··· x̃N −1 .x̃N −1

X̃w
3. project your data on the r normalized eigenvectors X̃v j = √1 X̃wj :∀v ∈ Rd , √1 X̃wj .v =
  | j| λj λj
x̃0 .v
√1 wj . .. 
. 
λj
x̃N −1 .v

6.2.5 Kernel PCA (KPCA)


Given a set of N vectors {x0 , · · · , xi , · · · , xN −1 } ∈ Rd , suppose we have a, possibly non linear,
mapping ϕ of our input space into a so called feature space F :

ϕ : Rd → Φ
x 7→ X

Typically, F can be a vector space of larger dimension (even infinite) than the input space, e.g.
F = Rm , m  d. Let us assume for now (but we come back soon on this point) that the transformed
data are centered, i.e.
N −1
1 X
ϕ (xi ) = 0
N i=0

Let us now perform a linear PCA on the transformed data {ϕ (x0 ) , · · · , ϕ (xi ) , · · · , ϕ (xN −1 )}.
To do so, we will follow the procedure given in the previous section when working only with scalar
products :
2
9 X̃v
= X̃vjT X̃vj = vjT X̃T X̃vj = λj vjT vj = λj
j
6.2. DIMENSIONALITY REDUCTION 87

1. center the input vectors {ϕ (x0 ) , · · · , ϕ (xi ) , · · · , ϕ (xN −1 )} : there is nothing to do since we
consider, for now, that the mapped vectors are centered (at the end of the section, we come
back to the case the mapped vectors are not centered)
2. compute the r normalized eigenvectors wk ∈ RN associated with the r largest eigenvalues λk
of the Gram matrix G ∈ RN ×N :

 
ϕ (x0 ).ϕ (x0 ) ϕ (x0 ).ϕ (x1 ) ··· ϕ (x0 ).ϕ (xN −1 )
 ϕ (x1 ).ϕ (x0 ) ϕ (x1 ).ϕ (x1 ) ··· 
ϕ (x1 ).ϕ (xN −1 )
 
G= .. .. .. .. 
 . . . . 
ϕ (xN −1 ).ϕ (x0 ) ϕ (xN −1 ).ϕ (x1 ) · · · ϕ (xN −1 ).ϕ (xN −1 )

3. project your data, say y, on the r normalized eigenvectors √1 ϕ (X) wk


λk

Let us have a look of what the last point, the projection of the vector to get the component,
looks like :
 
ϕ (x0 ).ϕ (v)
1 1  ϕ (x1 ).ϕ (v) 
 
∀v ∈ Rd , ( √ ϕ (X) wk ).ϕ (v) = √ wk . .. 
λk λk  . 
ϕ (xN −1 ).ϕ (v)

Therefore, the k-th principal component is computed solely from scalar products between the
vectors mapped into the feature space Φ. Actually, the only thing we need to compute when
performing the PCA in the feature space is scalar products between vectors in feature space since
the Gram matrix, from which we extract the eigenvectors, are also computed only from scalar
products in the feature space. We never need to compute explicitely the feature vectors ϕ (v) ∈ Φ
and only need to evaluate scalar products between two elements of Φ. This algorithm is known as
the Kernel PCA(Scholkopf et al., 1999).
The fact that all we need to compute is scalar products in the feature space implies that we
can employ the so called kernel trick which implicitly defines the mapping function ϕ from the
definition of a so-called kernel function k. Not every function k is a kernel as it does not always
correspond to the scalar product in a feature space. However, there are some conditions, known
as the Mercer’s theorem, which determine when a function k is actually a kernel. This is explained
in detail in the chapter 10. For our purpose, we just introduce briefly some known kernels :
d
• the polynomial kernel kd (x, x0 ) = (x.x0 + c)
!
2
0 |x − x0 |
• the gaussian, or RBF, kernel krbf (x, x ) = exp −
2σ 2

It can be shown that the gaussian kernel actually projects the data into an infinite dimension
space. The reader is referred to the chapter 10 for more details on kernels.

The last point that is not yet solved is : what about feature vectors that are actually not
centered in the feature space ? One can actually work with uncentered feature vectors if we change
the kernel :
N −1
! N −1
!
1 X 1 X
∀i, j, ϕ (xi ) − ϕ (xp ) . ϕ (xj ) − ϕ (xp )
N p=0 N p=0
N −1 N −1 N −1 N −1
1 X 1 X 1 X X
= ϕ (xi ).ϕ (xj ) − ϕ (xi ).ϕ (xp ) − ϕ (xj ).ϕ (xp ) + 2 ϕ (xp ).ϕ (xt )
N p=0 N p=0 N p=0 t=0
N −1 N −1 N −1
1 X 1 X X
= k (xi , xj ) − (k (xi , xk ) + k (xk , xj )) + 2 k (xp , xt )
N N p=0 t=0
k=0
88 CHAPTER 6. PREPROCESSING

Therefore we can introduce the kernel k̃ which takes as input the input vectors {x0 , · · · , xi , · · · , xN −1 } ∈
Rd and computes the scalar product between the centered feature vectors. It can be shown(Scholkopf
et al., 1999) that the associated Gram matrix K̃ is defined as :

1 1
K̃ = (IN − 1)K(IN − 1)
N N

where IN is the identity matrix and 1 is the square N × N matrix with all entries set to 1.
Let us now apply K-PCA to the MNIST dataset. The figure6.4 illustrates the two first principal
components of 5000 digits from the MNIST dataset using a RBF kernel with a variance σ = 4.8
which corresponds to the mean euclidean distance in the image space between the considered digits
and their closest neighbor. This non-linear projection captures 7% of the variability of the original
data compared to the linear PCA (fig 6.3b) which captured only around 5%. At the time of writting
this section, it is not completely clear to which extent the kernel PCA brings any improvement
over the PCA when applied to the MNIST dataset. However, there are some datasets for which
k-PCA appears superior to PCA in feature extraction and the reader is referred for example to
(Scholkopf et al., 1999) in which k-PCA and linear PCA are compared in the context of extracting
features feeding a classifier and where it is shown that extracting features with k-PCA leads to
better classification performances.

0.4
0.3
0.2
0.1
0.0
w2

0.1
0.2
0.3
0.4
0.5
0.4 0.2 0.0 0.2 0.4 0.6 0.8
w1
0 1 2 3 4 5 6 7 8 9

Figure 6.4: The Kernel PCA applied to 5000 samples from the MNIST handwritten digits dataset
using a RBF kernel with σ = 4.8, corresponding to the mean euclidean distance between the
considered images and their closest neighbor. Each colored point represents one 28 × 28 image, the
color indicating the associated label. Do not be misleaded, the labels are not used for the k-PCA
which is an unsupervised technique. The labels are used after applying the k-PCA, just to get
an idea of how the digits are clustered. The two first components computed from 5000 samples
capture approximately 7% of the variance.
6.2. DIMENSIONALITY REDUCTION 89

6.2.6 Further reading


There are a lot of various methods for performing dimensionality reduction. The PCA and kernel
PCA both try to preserve the variance of the original data. The kernel PCA is rather time
consuming to compute and the Kernel PCA might actually require a lot of principal components to
ensure capturing most of the data variance. A Sparse Kernel PCA algorithm proposes to overcome
this limitation (Tipping, 2001). The PCA and its variants should really be seen as compressing
methods in the sense that these techniques try to extract fewer features than the original number
of features, in order to be able to reconstruct as best as possible the original data.

6.2.7 Manifold learning


Some other algorithms rely on a different problem statement. Rather than trying to minimize the
reconstruction error of the data, one might try to preserve the distances between the input vectors.
These methods really seek to keep the topology of the dataset while transforming it into a smaller
dimensional space; these methods are refered to as manifold learning methods. In that respect,
they can really be used to visualize the data, for exemple, on a 2D screen. With this formulation,
we really seak a lower dimensional representation of the original data ensuring that the pairwise
distances are preserved and therefore that the layout of the data in 2D is as similar as possible to the
layout of the data in the original large dimensional space. This leads to a family of algorithms like
the multidimensional scaling (MDS) algorithm(Cox and Cox, 2000), Isomap(Tenenbaum et al.,
2000), Sammon mapping(Sammon, 1969), Locally Linear Embedding(Roweis and Saul, 2000),
Stochastic Neighborhood Embedding (SNE)(Hinton and Roweis, 2002) and t-SNE (van der Maaten
and Hinton, 2008). The reader is referred to (Lee and Verleysen, 2007) where many non linear
dimensionality reduction techniques are reviewed. Below, we develop a little bit the t-SNE method,
one of the recently and sucessful manifold learning method.

t-Stochastic Neighborhood Embeeding (t-SNE)


Suppose we are given N input vectors {x0 , · · · , xi , · · · , xN −1 } ∈ Rd for which we can define a
similarity pij . We are looking for N output vectors {y0 , · · · , yi , · · · , yN −1 } ∈ Rr , with r  d so,
that denoting qij the similarity between the i-th and j-th points yi yj in the low dimensional space,
the similarities pij and qij are as close as possible. The t-SNE algorithm relies on specific choices
for computing the similarites pij , qij and measuring the discrepancy between them. In SNE, the
similarities between points in the original high dimensional space are computed as the conditional
probability pj/i that xj would be picked as the neighbor of xi if the neighbor of xi were selected
according to a normal distribution centered on xi . from Gaussians centered on the datapoints :
|xi −xj |2
exp(− 2σi 2 )
∀i, j, pj/i = P |xi −xk |2
k6=i exp(− 2σi 2 )

Note that from this asymmetric definition, within SNE, the fact xi is picked as the neighbor of
xj does not imply that xj is picked as the neighbor of xi . While in SNE, the similarity is directly
taken as pij = pj/i , in t-SNE, the similarities are symmetrized :
pj/i + pi/j
pij =
2N
The similarities in the low dimensional space could have been defined similarly. However, as argued
in (van der Maaten and Hinton, 2008), using a gaussian for defining the similarities push too much
constraints on the locations of the projections in the low dimensional space. They indeed propose
to use a t-Student distribution with 1 degree of freedom which leads to define the similarities qij
as :
2
(1 + |yi − yj |2 )−1
∀i, j, qij = P 2
k6=l (1 + |yk − yl |2 )−1

The t-Student’s distribution with one degree of freedom (or Cauchy distribution) has a heavier tail
and allows to push apart a little more the datapoints in the low-dimensional space.
90 CHAPTER 6. PREPROCESSING

Once the similarities have been defined, it remains to introduce the criteria to be optimized.
The dissimilarity between the similarities pij and qij can be estimated with the Kullback-Leibler
divergence and reads :
X  
pij
C= pij log
i,j
qij

One can then minimize C with respect to the points yi in the low dimensional space by performing
a gradient descent (with momentum) of C with respect to the yi . The gradient reads (see (van der
Maaten and Hinton, 2008)) :

∂C X 2
=4 (pij − qij )(1 + |yi − yj |2 )−1 (yi − yj )
dyi j

The complexity of the algorithm is quadratic because you need to compute all the pairwise
distances. In (van der Maaten, 2014), an improvement of the complexity of the algorithm is
introduced with an approximation. It is based on the following idea which makes the evaluation of
the gradient faster: looking at the gradient for one point yi , one can see it as a sum of influences
or forces which push or pull the point yi . When some points are far away from yi , one can
approximate their individual forces by a single one originating from the center of mass of these
points. By approximating the individual contributions of several far points into a single one, one
can use the Barnes-Hutt approximation used in physics and improves the complexity from O(N 2 )
to O(N log N ). Applying t-SNE on 5000 digits of MNIST is shown on fig. 6.5.
6.2. DIMENSIONALITY REDUCTION 91

15 t-SNE

10

0
w2

10

15
15 10 5 0 5 10 15
w1

0 1 2 3 4 5 6 7 8 9

Figure 6.5: The t-SNE applied to 5000 samples from the MNIST handwritten digits dataset, using
2 components and a perplexity of 40(van der Maaten and Hinton, 2008). Each colored point
represents one 28 × 28 image, the color indicating the associated label. Do not be misleaded, the
labels are not used for the t-SNE which is an unsupervised technique. However, it turns out that
t-SNE nicely clusters the different classes.
92 CHAPTER 6. PREPROCESSING
Part III

Support vector machines

93
Chapter 7

Introduction

7.1 Acknowledgment
The elements in the whole part III is mainly the integration of a former document. That document
was written in French and translated into English by Cédric Pradalier. The text here thus reuses
that translation.
The reader can refer to (Cristanini and Shawe-Taylor, 2000; Shawe-Taylor and Cristanini, 2004;
Vapnik, 2000) for a more exhaustive view of what is presented here.

7.2 Objectives
This document part aims at providing a practical introduction to Support Vector Machines (SVM).
Although this document presents an introduction to the topics and the foundations of the problem,
we refer the reader interested in more mathematical or practical details to the documents cited in
the bibliography. Nevertheless, there should be enough material here to get an “intuitive” under-
standing of SVMs, with an engineering approach allowing a quick and grounded implementation
of these techniques.
SVMs involve a number of mathematical notions, among which the theory of generalization –
only hinted at here –, optimization and kernel-based machine learning. We will only cover here
the aspects of these theoretical tools required to understand what are SVMs.

7.3 By the way, what is an SVM?


An SVM is a machine learning algorithm able to identify a separator. So, the essential question is:
what is a separator? Let’s consider a finite set of vectors in Rn , separated into two groups, or in
other words, into two classes. The membership of a vector to a group or an other is defined by a
label with a label “group 1” or “group 2”. Building a separator falls down to building a function
that takes as input a vector of our set and can then output its membership. SVMs provide a
solution to this problem, as would a simple recording of the vector memberships. However, with
SVM we expect good properties of generalization: should we consider another vector, not in the
initial set, the SVM will probably be able to identify its membership to one of the two groups,
based on the membership of the initial set of vectors.
The notion of separator can be extended to situations where the SVM is estimating a regression.
In this case, the label associated to a given vector is no longer a discrete membership but a real
value. This leads to a different problem since we no longer try to assess to which group a vector
belongs (e.g. group 1 or group 2), but we want to estimate how much a vector is “worth”.

95
96 CHAPTER 7. INTRODUCTION

7.4 How does it work?


The core idea is to define an optimization problem based on the vectors for which we know the class
membership. Such a problem could be seen as “optimize such and such value while making sure
that ... ”. There are two main difficulties: the first is to define the right optimization problem. This
notion of “right” problem is what refers to the mathematical theory of generalization, and make
SVMs a somewhat difficult-to-approach tool. The second difficulty is to solve this optimization
problem once defined. There we enter the realm of algorithmic subtleties from which we will only
consider the SMO algorithm.
Chapter 8

Linear separator

We will fist consider the case of a simple (although ultimately not so simple) separator: the linear
separator. In practice, this is the core of SVMs, even if they also provide much more powerful
separators than the ones we’re going to cover in this chapter.

8.1 Problem Features and Notations


8.1.1 The Samples
As mentioned in introduction, we are considering a finite set of labelled vectors. In our specific
case, we will refer to the set of given labelled vectors as the set of samples S, containing N elements.

S = {(x1 , y1 ), · · · , (xi , yi ), · · · , (xN , yN )}. ∀i, (xi , yi ) ∈ X × Y

The membership of a vector to a class is represented here with the value y ∈ Y = {−1, 1}, which
will simplify the expressions later. Let us consider X = Rn as the input set in the following.

8.1.2 The Linear Separator


The dot product of two vectors will be denoted x.y. Using this notation, we can define the linear
separator hw,b , (w, b) ∈ X × R with the following equation:

hw,b (x) = w.x + b

This separator does not exclusively output values in Y = {−1, 1}, but we will consider that
when the results of hw,b (x) is positive, the vector x belongs to the same class than the samples
labelled +1, and that if the result is negative, the vector x belongs to the same class as the samples
labelled −1.
Before digging deeper in this notion of linear separator, let’s point out that the equation
hw,b (x) = 0 defines the separation border between the two classes, and that this border is an
affine hyperplane in the case of a linear separator.

8.2 Separability
Let’s consider again our sample set S, and let’s separate it into two sub-set according to the value
of the label y. We define:
S + = {(x, y) ∈ S | y = 1}
S − = {(x, y) ∈ S | y = −1}
Stating that S is linearly separable means that there exists w ∈ X and b ∈ R such that:

hw,b (x) > 0 ∀x ∈ S +


hw,b (x) < 0 ∀x ∈ S −

97
98 CHAPTER 8. LINEAR SEPARATOR

This is not always feasible. There can be label distributions over the vectors in S that make S
non linearly separable. In the case of samples taken in the X = R2 plane, stating that the sample
distribution is linearly separable means that we can draw a line (thus an hyperplane) such that
the samples of the +1 class are on one side of this border and those of the −1 class on the other
side.

8.3 Margin
For the following, let’s assume that S is linearly separable. This rather strong hypothesis will be
relieved later, but for now it will let us introduce some notions. The core idea of SVM is that,
additionally to separating the samples of each class, it is necessary that the hyperplane cuts “right
in the middle”. To formally define this notion of “right in the middle” (cf. figure 8.1), we introduce
the margin.

Figure 8.1: The same samples (class −1 or +1 is marked with different colors) are, on both figures,
separated by a line. The notion of margin allows to qualify mathematically the fact the separation
on the right figure is “better” than the one on the left.

Using figure 8.2, we can already make the following observations. First the curves defined
by equation hw,b (x) = C are parallel hyperplanes and w is normal to these hyperplanes. The
parameter b expresses a shift of the separator plane, i.e. a translation of hw,b (x). The norm |w| of
w affects the level set hw,b (x) = C. The larger |w|, the more compressed the level set will be.

fw,b
~ (~x) = cste

fw,b
~ (~x) = 0
w
~

marge

marge

|fw,b
~ (x)|
d=
|b| kwk
~
d=
kwk
~

Figure 8.2: Definition of a separator hw,b (x). The values on the graph are the values of the
separator at the sample points, not the Euclidian distances. If the Euclidian distance of a point x
to the separation border is d, |hw,b (x)| on that point is d|w|.

When looking for a given separating border, we are faced with an indetermination regarding
8.3. MARGIN 99

the choice of w and b. Any vector w not null and orthogonal to the hyperplane will do. Once this
vector chosen, we determine b such that b/|w| is the oriented measure1 of the distance from the
origin to the separating hyperplane.
The margin is defined with respect to a separator hw,b and a given set of samples S. We will
denote γhw,b (S) this margin. It is defined from the function γhw,b (x, y) computed from each sample
(x, y), also called margin, but rather sample margin. This latter margin is:

γhw,b (x, y) = y × hw,b (x) (8.1)

Since y ∈ {−1, 1} and a separator puts samples with label +1 on the positive side of its border
and those of class −1 on the negative side, the sample margin is, up to the norm of w, the distance
from a sample to the border. The (overall) margin, for all the samples, is simply the minimum of
all margins:
γhw,b (S) = min γhw,b (x, y) (8.2)
(x,y)∈S

Coming back to the two cases of figure 8.1, it seems clear that on the right side – the better
separator –, the margin γhw,b (S) is larger, because the the border cuts further from the samples.
Maximizing the margin is most of the work of an SVM during the learning phase.

1 The direction of w defines the positive direction.


100 CHAPTER 8. LINEAR SEPARATOR
Chapter 9

An optimisation problem

9.1 The problem the SVM has to solve


9.1.1 The separable case
In continuity with chapter 8, let’s assume that we are still given a sample set S that is separable
with a linear separator. If the separator effectively separate S, with on the positive side, all the
+1-labelled samples, and on the negative side, the ones labelled −1, then all the sample margin
are positive (cf. eq. (8.1)). When one of the margin is negative, then the separator is not correctly
separating the two classes although it should be possible since we assume the sample set linearly
separable. In this incorrect case, γhw,b (S) is negative (cf. eq. (8.2)). Thus, maximising the margin
means first that we separate (positive margin) and then that we separate well (maximal margin).
A separator with maximal margin is such that it makes the margin of the sample with the
smallest margin larger than the margin of the sample with the smallest margin for all the other
possible separators.
Let’s consider this sample of smallest margin. In fact, there can be more than one (with equal
margins), belonging to either the positive or the negative classes. Actually, considering figure 9.1,
there are necessarily one sample of the positive class and one sample of the negative class that
constrain this margin, and the separating border has to cut right in the middle. We can note as
well that only these examples constrain the separating hyperplane, and that we could remove all
the others from the sample set without changing the maximal-margin separator. This is why these
samples are named support vectors.
Note that the separating hyperplane with maximal margin is defined up to a constant scale,
and that we can chose this scale such that the support vectors lie on the level set with value +1
and −1. Figure 9.2 illustrates this point. In this particular case, the distance from the support
vectors to the separating plane – i.e. the margin – is simply 1/|w|.
Starting from this observation, let’s consider the case where S is separable. Figure 9.3 depicts
two separators such that all the samples lie not only on the right side (γhw,x (x, y) > 0), but also
outside of the band defined by the level set −1 and +1 (γhw,x (x, y) > ±1).
The width of the band delimited by the level set −1 and +1 is 2/|w|. To find the separator
of maximal margin, it is then only necessary to search the separator for which |w| is minimal
among the separators analoguous to those from figure 9.3, that is among all those that verify, for
all samples, γhw,x (x, y) > 1. Minimising |w| reduces to widening the band −1/+1 until it “blocks”
against the support vectors, which brings us back to the situation depicted in figure 9.2.
The hyperplane with maximal margin, for a sample set S = {(x1 , y1 ), · · · , (xi , yi ), · · · , (xN , yN )},
is thus the solution to the following optimization problem, in which the coefficient 1/2 is just added
to simplify the calculus of the derivative in the coming sections:

1
find argmin w.w
w,b 2
subject to yi (w.xi + b) ≥ 1, ∀(xi , yi ) ∈ S

101
102 CHAPTER 9. AN OPTIMISATION PROBLEM

marge min

marge min

marge min

Figure 9.1: The separator with maximal margin as a border which is defined with at least one
sample of each class. The Support Vectors are marked with a cross.

fw,b
~ (~x) = 0

fw,b
~ (~x) = −1

fw,b
~ (~x) = 1

Figure 9.2: The separator in this figure has the same border as the one from figure 9.1, so it
separates the samples with the same quality. However, the difference is that the level set hw,b (x) =
1 and hw,b (x) = −1 contains the support vectors. To achieve this, it was necessary to modify the
norm of w, and adjust b accordingly.
9.1. THE PROBLEM THE SVM HAS TO SOLVE 103

1
0

−1

1
0
−1

Figure 9.3: Both separators in this figure make the margin of all samples greater than 1. For both
of them, the width of the bands is 2/|w|, where w is the term appearing in hw,b (x) = w.x + b.
Consequently, the “most vertical” separator on this figure has a vector w with a smaller norm than
the other.

9.1.2 General case


For the general case where S is not separable, the solution consist in authorizing some of the
samples to have a margin smaller than one, or even negative. To this end, we will transform the
constraint yi (w.xi + b) ≥ 1 into yi (w.xi + b) ≥ 1 − ξi , with ξi ≥ 0. Obviously, doing that without
constraints leads to unwanted effects, since we could minimize w.w down to zero by selecting large
enough ξi . To avoid this, we will add the variables ξi , names slack variables in the minimization
problem so as to prevent them to become exceedingly large. This will limit the influence of the
samples violating the separability on the separating solution to the optimization problem. In
summary, the optimization problem becomes the following, with C a positive parameter defining
the tolerance of the SVM to incorrectly separated samples:
1 X
find argmin w.w + C ξi
w,b,ξ 2
 i (9.1)
yi (w.xi + b) ≥ 1 − ξi , ∀(xi , yi ) ∈ S
subject to
ξi ≥ 0, ∀i

9.1.3 Relation with the ERM


We have introduced the SVMs as a learning process which maximizes the margin for separating the
two classes. This formulation is different from minimizing an empirical risk, whereas the empirical
risk minimization (ERM) is the induction principle which is legitimated theoretically, as introduced
in chapter 5.
The margin optimization process described here could be though as another induction prin-
ciple, since it does not rely on minimizing the total amount of errors performed on the learning
dataset. Nevertheless, the margin optimization can be described as an ERM process thanks to the
formulation of a specific loss function.
Let us recall notations. We have a decision function h (x) = w.x + b which associates a scalar
to the input. The sign of this scalar should be the same as the associated label y ∈ {−1, 1}, in
other words, the value yh (x) should be positive for correctly classified data. The loss associated
to a data sample (x, y) is L (h (x) , y). Let us recall (see paragraph 5.2.1) the hinge loss as
Lhinge (h (x) , y) = max (0, 1 − yh (x)).
104 CHAPTER 9. AN OPTIMISATION PROBLEM

3.0
binary loss
2.5 hinge loss

2.0

1.5
L(y,h(x))

1.0

0.5

0.0

0.5
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
y.h(x)

Figure 9.4: Binary and hinge losses.

This loss approximates the binary loss, as figure 9.4 shows. Let us now reconsider equation (9.1).
The two constraints can be rewritten as as

ξi ≥ max (0, 1 − yi (w.xi + b))

For a given (w, b), the minimization of the objective function in equation (9.1) is improved
when the sum is minimal, i.e when each ξi is minimal. Therefore, as the slack variables should
be minimal, we can use the previous expression as an equality before injecting it in the objective
function without changing the optimal solution. So the minimization problem of equation (9.1)
can be rewritten as X
1
find argmin w.w + C max (0, 1 − yi h (xi ))
w,b 2 i

which is X
find argmin Lhinge (h (xi ) , yi ) + λw.w.
w,b i

This last formulation is an actual ERM problem, with an additional regularization term λw.w.

9.2 Lagrangian resolution


9.2.1 A convex problem
In the following, we will assume that the SVM optimization problem is convex, that is to say, this
optimization has a single global optimum and no local optima. This statement, not justified here,
is essential because the problem convexity is what ensure the convergence of the SVM algorithms
towards the optimal solution.

9.2.2 The direct problem


The intent in this document is not to introduce optimization theory, but to bring in the absolute
minimum to understand the relation with the SVMs. In section 9.1.2, we defined an optimization
problem in the following form:

find argmin f (k)


k
subject to gi (k) ≤ 0, 1 ≤ i ≤ n
9.2. LAGRANGIAN RESOLUTION 105

Solving this problem require defining the following function, called problem’s Lagrangian, which
involves the constraints multiplied by coefficients αi ≥ 0. These coefficients are named Lagrange
multipliers. The constraints gi are affine.
X
L (k, α) = f (k) + αi gi (k)
1≤i≤n

In this case, the theory says that the vector k ? minimizing f (k) while respecting the constraints
must satisfy that L has a saddle point at (k ? , α? ). At this point, L is a minimum for k and a
maximum for α:
∀k, ∀α ≥ 0, L (k ? , α) ≤ L (k ? , α? ) ≤ L (k, α? )
The following stands at the optimum

∂ ? ?
L (k , α ) = 0
∂k

whereas ∂α L (k ? , α? ), that should be null as well at the saddle point, may be not defined (see the
top-right frame in figure 9.5)
These conditions are sufficient to define the optimum if the Lagrangian is a convex function,
which will be the case for SVMs. See Cristanini and Shawe-Taylor (2000) for supplementary
mathematical justifications.
The challenge is that writing these conditions does not always lead easily to a solution to the
initial problem. In the case of SVMs, solving the dual problem will result in an easier (although
not easy) solution.

9.2.3 The Dual Problem



The dual problem is the result of inserting the optimality constraints given by the terms ∂k L=0
in the expression of L. This expression will only involve the multipliers, but in a maximization
problem under an other set of constraints.
This comes from the saddle shape of the Lagrangian around the optimum: the optimum is

a minimum along k, but a maximum along α (cf. figure 9.5). Injecting ∂k L = 0 in L (k, α) is
equivalent to defining a function θ(α) which for α expresses the minimum value of the Lagrangian,
that is to say the minimal value of L resulting from setting α and minimizing by playing on k. It
remains then to minimize θ(α) by playing on α, which is the dual optimization problem.
The advantage of the dual problem in the context of SVMs will become clearer as soon as we
leave the generalities of optimization theory and come back to its application to SVMs.

9.2.4 An intuitive view of optimization under constraints


This section will give an intuitive view of the theory of optimization under constraints. However,
the impatient reader may accept these theorems and directly jump to the next section. Thanks to
Arnaud Golinvaux1 for his contribution to the following.

Optimization under equality constraints


Let’s start from the following optimization problem, while keeping in mind figure 9.6.

find argmin f (k)


k
subject to g(k) = 0 where g(k) = (gi (k))1≤i≤n
Let’s consider the solution k ? of our optimization problem. This solution, due to the constraints,
can be different from the minimum K of the objective function (see figure 9.6). The constraints
in g are a function of k. In principle, we’re only looking for k ? in the kernel of g, i.e. amongst the
k such that g(k) = 0.
1 3rd year student at Supélec in 2012
106 CHAPTER 9. AN OPTIMISATION PROBLEM

L α
θ(α)

L
α
θ(α) k

Fond de vallee

Solution

k(α )

k
Figure 9.5: Transition from the primal problem to the dual problem. The Lagrangian L (k, α) is
saddle-shaped around the solution of the problem. The “bottom of the valley”, i.e. the minima

along k, is represented on the saddle by a dashed line. The equation ∂k L = 0 enables us to link k
and α, so as to express k as a function of α, denoted K (α). This link is the projection of the valley
to the “horizontal” plane. Injecting this relation into L gives a function L (k, K (α)) = θ (α). This
function is the objective function of the dual problem, which we’re trying to maximize, as shown
in the figure.
9.2. LAGRANGIAN RESOLUTION 107

k
g(k)=0

K
k*
grad(f)
J(g)

K h
k* grad(f)

J(g)
g(k)=0
h

Figure 9.6: See text.


108 CHAPTER 9. AN OPTIMISATION PROBLEM

Staying in the kernel (i.e. on the bold curve in the figure) leads to the following consequence
around the solution k ? . Let h an infinitesimal displacement around the optimum, such that k ? + h
is still in the kernel of g. We have:

g(k ? ) = 0
g(k ? + h) = 0
g(k ? + h) = g(k ? ) + Jg|k (k ? ) .h

Where Jg|k (k0 ) represents the Jacobian matrix of g with respect to k, taken at k0 . We can
immediately deduce that:
Jg|k (k ? ) .h = 0
A displacement h satisfying the constraints around the solution is thus included in the vector space
defined by the kernel of the Jacobian matrix:

h ∈ ker ( Jg|k (k ? ))

Let’s consider now the consequence of this displacement regarding our objective function f . By
linearizing around k ? while using the constraint-satisfying displacement h above, we have:

f (k ? + h) = f (k ? ) + ∇f |k (k ? ) .h

As a reminder, the gradient is a degenerated Jacobian when the function is scalar instead of vector-
valued. Being around k ? means that f (k ? ) is mininal, so long as our displacements respect the
constraints (in bold on figure 9.6, i.e. the curve C). So ∇f |k (k ? ) .h >= 0. However, similarly to
h, −h satisfies the constraints as well, since the set ker ( Jg|k (k ? )) of h satisfying the constraints
is a matrix kernel and as such a vector space. Hence, we have as well − ∇f |k (k ? ) .h >= 0. From
this, we can deduce that
∇f |k (k ? ) .h = 0, ∀h ∈ ker ( Jg|k (k ? ))
In other words, ∇f |k (k ? ) is in the vector sub-space E orthogonal to ker ( Jg|k (k ? )). But this space
E happens to be the one spanned by the column vectors of the matrix Jg|k (k ? ). Hence, we can
confirm that ∇f |k (k ? ) is a linear combination of these column vectors.
But, since g is the vector of the n scalar contraints gi ,

Jg|k (k ? ) = [ ∇g1 |k (k ? ) , ∇g2 |k (k ? ) · · · ∇gn |k (k ? )]

so these column vectors are the gradients of each of the constraints with respect to the parameters.
As a result, we have:
n
X
∃(α1 , · · · , αn ) ∈ R : ∇f |k (k ? ) + αi ∇gi |k (k ? ) = 0
i=1

Hence, the idea to set the following Lagrangian:


n
X
L (k, α) = f (k) + αi gi (k)
i=1

which gradient with respect to k must be null at the constrained optimum.

The case of inequality constraints


Let’s consider now the following optimization problem:

find argmin f (k)


k
subject to gi (k) ≤ 0, 1 ≤ i ≤ n

The idea is to associate to each constraint gi (k) <= 0 a new scalar parameter yi . We group the
yi of inequality constraints into a vector y. We set gi0 (k, y) = gi (k) + yi 2 . An optimization problem
9.2. LAGRANGIAN RESOLUTION 109

with inequality constraints thus become a problem with equality constraints, but with additional
parameters.

find argmin f (k)


k,y
subject to g 0 (k, y) = 0
The trick is that, necessarily, the new parameters in y cannot influence the objective function
f (k). Be aware though, that every that was written earlier on the parameter k must now be
applied to the union of k and y in this new problem. Following the discussion above, we can set
the following Lagrangian:
n
X
L (k, y, α) = f (k) + αi (gi (k) + yi 2 )
i=1

The gradient of the Lagrangian with respect to the parameters (now k and y) must be null. It
is thus null if we consider it with respect to k and with respect to y.
By differentiating with respect to k, we still have:
n
X
∇f |k (k ? ) + αi ∇gi |k (k ? ) = 0
i=1

By differentiating along yi , we get 2αi yi = 0, which means either αi = 0 or yi = 0. But yi = 0 is


equivalent, from the definition of the new constraints g 0 , to gi (k) = 0.... and the k we’re referring to
is the one defined by the other derivative of the Lagrangian, i.e. k ? . This is why, at the optimum,
we have αi gi (k ? ) = 0 (these are the condition KKT that will be of use later). These conditions
are important because the value of αi at the optimum let us know whether the constraint gi is
saturated or not. A non-zero value of αi corresponds to a saturated constraint (equality).
The second derivative of the Lagrangian with respect to yi happens to be αi . Intuitively2 , if
we refer to curve C in figure 9.6, and by keeping in mind that the parameters yi are now part of
the parameters k in this figure, we can clearly see that the curvature of C is positivie (upward),
and thus that the second derivative is positive. Hence, αi ≥ 0.
In practice, we finally ended up completely suppressing any reference to yi in the expression of
our conditions of interest. Thus, we can avoid bringing in these variables in the Lagrangian, and
although it contains inequality constraints, we can keep the following Lagrangian for our problem:
n
X
L (k, α) = f (k) + αi gi (k)
i=1
∇L|k (k ? , α) = 0
∀i, αi ≥ 0
∀i, αi gi (k ? ) = 0

9.2.5 Back to the specific case of SVMs


Let’s come back the to the problem defined in section 9.1.2, and let’s denote with αi and µi the
coefficients relative to the two types of constraints. After having rewritten these constraints so
as to make inequalities appear (≤) and match the form seen in section 9.2.2, we can define the
Lagrangian of our problem as:
N
X N
X N
X
1
L (w, b, ξ, α, µ) = w.w + C ξi − αi (yi (w.xi + b) + ξi − 1) − µi ξi
2 i i i
XN XN N
X
1
= w.w + ξi (C − αi − µi ) + αi − αi yi (w.xi + b)
2 i i i
∀i, αi ≥ 0
∀i, µi ≥ 0
2 To develop this proof correctly, please refer to a math text book
110 CHAPTER 9. AN OPTIMISATION PROBLEM

The two types of constraints of our problem are inequalities (cf. section 9.1.2). The theory says
then that if a constraint is saturated, that is to say if it is actually an equality, then its Lagrange
multiplier is not null. When it is a strict inequality, the multiplier is null. So, for a constraint
gi (...) ≤ 0 for which the associated multiplier would be ki , we have either ki = 0 and gi (...) < 0,
or ki > 0 and gi (...) = 0. These two cases can be summarised into a single expression ki gi (...) = 0.
This expression is named suplementary condition de KKT3 . In our problem, we can then express
six KKT conditions: the constraints, the multipliers positive sign and the suplementary KKT
conditions:
∀i, αi ≥ 0 (KKT1)
∀i, µi ≥ 0 (KKT2)
∀i, ξi ≥ 0 (KKT3)
∀i, yi (w.xi + b) ≥ 1 − ξi (KKT4)
∀i, µi ξi = 0 (KKT5)
∀i, αi (yi (w.xi + b) + ξi − 1) = 0 (KKT6)
This being defined, let us set to zero the partial derivative of the Jacobian with respect to the
terms that are not Lagrange multipliers:

X N

L=0⇒w= αi yi xi (L1)
∂w i

X N

L=0⇒ αi yi = 0 (L2)
∂b i


L = 0 ⇒ ∀i, C − αi − µi = 0 (L3)
∂ξi
Equation L1, injected into the expression of the Lagrangian, let us remove the term w. Equation
L3 let us remove the µi as well. L2 let us eliminate b, which is now in L multiplied by a null term.
After these substitutions, we now have a Lagrangian expression that depends only of the αi . We will
now maximize it by playing on these αi , knowing that injecting L1, L2 and L3 already guarantee
that we have a minimum with respect to w, ξ and b. This is the dual problem. The constraints on
this problem can be inferred from the constraints on the αi resulting from the equations KKTi.
Using L3, KKT2 and KKT3, we can show the following by considering the two cases resulting from
KKT5:

• either ξi = 0, µi > 0, and then 0 ≤ αi < C,

• or ξi > 0, µi = 0, and then αi = C according to L3.

The constraints on the αi are thus 0 ≤ αi ≤ C and L2. Hence, we must solve the following
optimization problem, dual from our initial problem, to find the Lagrange multipliers αi .

X N N
1 XX
find argmax αi − αj αi yj yi xj .xi
α 2
i X j i
 ∀i, αi yi = 0
subject to i

∀i, 0 ≤ αi ≤ C
Additionally, from the two cases mentioned above, we can also deduce that ξi (αi − C) = 0.
This means that accepting a badly separated sample xi (ξi 6= 0) is equivalent to using its αi with
a maximum value of C.
One interesting aspect of this expression of the dual problem is that it only involves the samples
xi , or more precisely only their dot products. This will be useful later when moving away from
3 Karush-Kuhn-Tucker
9.2. LAGRANGIAN RESOLUTION 111

linear separators. Furthermore, the vector of the separating hyperplane being defined by L1, it is
the result of contributions from all the samples xi , with a value of αi . However, these values, after
optimization, could be found to be zero for many samples. Such samples will thus have no influence
on the definition of the separator. Those that remain, that is those for which αi is non-zero, will
be named support vectors, because they are the ones that define the separating hyperplane.
Solving the dual problem is not trival. So far, we only defined the problem. In particular, b
has now disappeared from the dual problem and we will have to work hard4 to find it back once
this problem solved. We’ll discuss this point further in chapter 11.
Let’s complete this chapter with an example of linear separation, where the separator is the
solution of the optimization problem we’ve defined earlier. The samples that effectively influence
the expression L1 with a non-zero coefficient αi , i.e. the support vectors, are marked with a cross
in figure 9.7.
The separation is thus defined with the following equation:
N
!
X
hw,b (x) = αi yi xi .x + b
i

which we will rather write as follows, to only involve the samples through their dot products:
N
X
hw,b (x) = αi yi xi .x + b (9.2)
i

Figure 9.7: Hyperplane resulting from the resolution of the optimization problem from section 9.1.2.
The separation border hw,b (x) = 0 is the bold curve. The curve hw,b (x) = 1 is shown as a thin
line, and hw,b (x) = −1 in a dashed line. The support vectors are marked with crosses.

4 See section 11.1.2 page 123


112 CHAPTER 9. AN OPTIMISATION PROBLEM
Chapter 10

Kernels

The main interest of kernels in the context of SVM is that everything we will written in next
chapters on linear separation also applies readily to non-linear separation once we bring kernels in,
so long as we do it right.

10.1 The feature space


Let’s imagine that a set of samples S, labelled with −1 or +1 according to their membership
as before, but not linearly separable. The methods we’ve seen up to last chapter will obviously
work, but they will provide a low-quality solution and a lot of the samples will be support vectors
(cf. figure 10.1).

Figure 10.1: Non linearly separable samples.

In order to build a better separation of the samples, a solution is to project the samples into
a different space1 , and to implement a linear separation in this space where it will hopefully work
better.
1 Often of very high dimension

113
114 CHAPTER 10. KERNELS

Let ϕ be this projection. We have:


 
ϕ1 (x)
 ϕ2 (x) 
 
 ϕ3 (x) 
ϕ (x) =  
 .. 
 . 
ϕn (x)

Obviously, the functions ϕi are not necessarily linear. Furthermore, we can have n = ∞! So, if
we use the approaches seen in the previous chapters 9 and apply them in the feature space, that
is to say, if we work with the following sample set with binary labels (Y = {−1, 1}):

ϕ (S) = {(ϕ (x1 ), y1 ) · · · (ϕ (xi ), yi ) · · · (ϕ (xN ), yN )}

instead of
S = {(x1 , y1 ), · · · , (xi , yi ), · · · , (xN , yN )}
then, we just have to perform a linear separation on the corpus ϕ (S). Using equation L1, we get a
separator w and a value for b. Now, to decide the class of a new vector x, we could compute ϕ (x)
and apply the separator on ϕ (x) to find out its class membership, −1 or +1.
In practice, we will avoid computing explicitly ϕ (x) by noting that in the optimization problem
defined in chapter 9 only involves the samples through dot products of pairs of samples.
Let’s denote k (x, x0 ) the product ϕ (x).ϕ (x0 ). Working on corpus ϕ (S) is equivalent to working
on corpus S with the algorithms of chapter 9, but replacing every occurence of •.• with k (•, •).
So far, the interest of kernels should not be obvious, because to compute k (x, x0 ), we still need
to apply its definition, that is to project x and x0 in the feature space and to compute, in the
feature space, their dot product.
However, the trick, known as the kernel trick, is that we will actually avoid performing this
projection because we will compute k (x, x0 ) in an other way. Actually k (x, x0 ) is a function that
we will choose, making sure that there exists, in theory, a projection ϕ into a space that we will
not even try to describe. By this way, we will compute directly k (x, x0 ) each time the algorithm
from chapter 9 refers to a dot product, and that’s all. The projection into the huge feature space
will be kept implicit.
Let’s take an example. Consider
!
2
0 |x − x0 |
k (x, x ) = exp −

It is well known that this function corresponds to the dot product of the projection of x and
x0 into an infinite dimension space. The optimization algorithm that will use this function, also
known as kernel, will compute a linear separation in this space while maximizing the margin,
without having to perform any infinite loop to compute the products by multiplying terms of the
projected vectors two by two!
The separation function is then directly inspired from equation (9.2) page 111, once the optimal
αi found and b computed,
XN
sep (x) = αi yi k (xi , x) + b
i

knowing that many of the terms of the sum are zero if the problem is actually separable. In our
case, the separator can then be rewritten as:
N
!
X |xi − x|
2
sep (x) = αi yi exp − +b
i

The level set sep (x) = 0 define the separation border between the classes, and the level set
sep (x) = 1 and sep (x) = −1 represents the margin. Figure 10.2 depicts the result of the algorithm
given in chapter 9 with our kernel function.
10.2. WHICH FUNCTIONS ARE KERNELS? 115

Figure 10.2: Solution of the optimization problem from section 9.1.2 on the corpus from figure 10.1,
but with Gaussian kernels. The separating borders sep (x) = 0 is depicted with a bold line and
the level set sep (x) = 1 and sep (x) = −1 respectively with the dashed and thin lines. The support
vectors are marked with crosses.

10.2 Which Functions Are Kernels?


Obviously, it would be too easy if any function of two vectors were a kernel. For a function to
be a kernel, there must exist a projection into a feature space such that the function output the
same result as the dot product of the projected vectors. Although we don’t have to instantiate
this projection (as seen before), we have to make sure it exists. The convergence of the SVM is
at stake, because this convergence is only ensured when the problem is convex as mentioned in
paragraph 9.2.1, and that requires the kernel not to be any random function.

10.2.1 A Simple Example


We consider now the following function to compare two vectors:
2
k (x, x0 ) = (x.x0 + c)

Is it a kernel? If so, what is the corresponding projection? One way to prove it, is to exhibit the
dot product of the projected vectors.
!2
2
X
0 i 0i
(x.x + c) = x x +c
i
X i j
X i
= xi x0 xj x0 + 2c xi x0 + c2
i,j i
X X√ √
0i 0j i
= i j
(x x )(x x ) + ( 2cxi )( 2cx0 ) + (c)(c)
i,j i

Hence, we can deduce that the projection into a space where our kernel is a dot product, is the
116 CHAPTER 10. KERNELS

function that combines 2 by 2 the components of the vector x:


 2 
x1
 x1 x2 
 
 x1 x3 
 
 ··· 
 
 xn xn−1 
 
 n 2 
ϕ (x) =  (x ) 
 √ 
 2c.x1 
 √ 
 2c.x2 
 
 ··· 
 √ 
 2c.xn 
c

10.2.2 Conditions for a kernel


There exists mathematical conditions, named Mercer’s theorem, that will determine whether a
function is a kernel or not, without having to build the projection into the feature space.  In
practice, we have to ensure that for all sample set with length N , the matrix k xi , xj 1≤i,j≤N
is positive definite2 . We will not delve further into these conditions because in most cases, we will
rather build kernels from existing kernels.

10.2.3 Reference kernels


In this section, we describe two commonly used kernels. The first one is the the polynomial kernel:
d
kd (x, x0 ) = (x.x0 + c)

It corresponds to a projection Φ(x) in a feature space where each component φi (x) is a product
of components of x with a degree lower than d (a monomial). The separator computed from this
kernel is a polynomial with degree d, whose terms are components of x. The larger the constant c,
the more importance is given to the high-order terms. With c = 1 and d = 3, figure 10.3 depicts
the result of the separation.
The second kernel we will introduce here is the Gaussian kernel, also known as RBF3 , men-
tionned earlier: !
2
0 |x − x0 |
krbf (x, x ) = exp −

This kernel corresponds to a projection into an infinite dimension space. However, in this space, all
2
the points are projected on the hypersphere with radius 1. This can be easily seen from |ϕ (x)| =
k (x, x) = exp(0) = 1.

10.2.4 Assembling kernels


Once we know some kernels, here are some results leading to the definition of other kernels.
Let:
• k1 , k2 , k3 three kernel functions.
• f a real valued function.
• Φ a function that project the vectors in another vector space.
• B a semi-definite positive matrix.
• p a polynomial with positive coefficients.
2 i.e. all its eigenvalues are strictly positive.
3 Radial Basis Function, a name coming from RBF neural networks which are generalized by SVMs using this
kernel.
10.3. THE CORE IDEA FOR SVM 117

Figure 10.3: Solution of the optimization problem of section 9.1.2 over the corpus from figure 10.1,
but with a polynomial kernel with degree 3. The separating borders sep (x) = 0 is depicted with
a bold line and the level set sep (x) = 1 and sep (x) = −1 respectively with the dashed and thin
lines. The support vectors are marked with crosses.

• α a positive number.
then, the following function k are also kernels:
k (x, x0 ) = k1 (x, x0 ) + k2 (x, x0 )
k (x, x0 ) = αk1 (x, x0 )
k (x, x0 ) = k1 (x, x0 ) k2 (x, x0 )
k (x, x0 ) = f (x) f (x0 )
k (x, x0 ) = k3 (Φ (x) , Φ (x0 ))
k (x, x0 ) = xT Bx0
k (x, x0 ) = p (k1 (x, x0 ))
k (x, x0 ) = exp (k1 (x, x0 ))

10.3 The core idea for SVM


Now that we have added the kernel trick to our tool-box, we can work with very high dimension
spaces, without having to really enter them. However, a linear separator – and also a linear
regression – is actually simplified by the projection of the samples in very high dimension spaces...
In return, such a easy-to-build separation is usually meaningless. In other words, it is easy to
learn by heart, that is to learn a function that will not generalize to new samples. This is what is
usually referred to as “curse of dimensionality”. By maximizing the margin, SVMs are still efficient
in these high-dimension space and are not effected by this curse. By projecting into the feature
space to use a algorithm maximizing the margin, we obtain a good separability while keeping good
generalization capabilities. This is the core idea of SVMs.
118 CHAPTER 10. KERNELS

10.4 Some Kernel Tricks


The ability to compute a dot product between vectors is sufficient to implement more operations
than could be expected, and these results can be obtained without having to explicitly project the
vectors into the feature space. Here are some examples.

10.4.1 Data Normalization


The norm of a feature vector is given by:
p p
|ϕ (x)| = ϕ (x).ϕ (x) = k (x, x)

Hence, we can very easily work on normalized data... in the feature space! In practice, the dot
product is:
ϕ (x) ϕ (x0 ) ϕ (x).ϕ (x0 ) k (x, x0 )
. 0
= 0
=p
|ϕ (x)| |ϕ (x )| |ϕ (x)||ϕ (x )| k (x, x) k (x0 , x0 )

So, we just need to use the right side of the above expression as a new kernel, built upon a kernel k,
to work with normalized vectors in the feature space corresponding to k. Denoting k̄ the normalized
kernel, we simply have:
k (x, x0 )
k̄ (x, x0 ) = p
k (x, x) k (x0 , x0 )

We can even quite easily compute distance, in feature space, between the projections of two
vectors:
p
|ϕ (x) − ϕ (x0 )| = k (x, x) − 2k (x, x0 ) + k (x0 , x0 )

10.4.2 Centering and Reduction


For some algorithms4 , it is more efficient to center (substract the center of gravity) and reduce the
samples (divide by their variance). This is something we can achieve in the feature space as well.
As a reminder, in the following N is the number of samples in our training set.
Let’s start by computing the sample variance, which will let us reduce them should we want
to.
2
N N N N
1 X 1 X
= ··· = 1
X 1 X
var = ϕ (x ) − ϕ (x ) k (x , x ) − k (xi , xj )
N i=1
i j i i
N j=1 N i=1 N 2 i,j=1

To work on centered and reduced samples, we use the kernel k̂ defined as follows:

N
X N
X
1 1
ϕ (x) − N ϕ (xj ) ϕ (x0 ) − N ϕ (xj )
j=1 j=1
k̂ (x, x0 ) = √ . √
var var
 
N
X XN XN
1  1 1 1
= k (x, x0 ) − k (x, xi ) − k (x0 , xi ) + 2 k (xi , xj )
var N i=1 N i=1 N i,j=1

Note that these kernels are very computationally expensive, even in comparison with SVMs
which tend to be computationally heavy algorithms with simple kernels. This is the type of
situation where one would rather pre-compute and store the kernel values for all the sample pairs
in the data-base.
4 SVMs are not the only one to take advantage of the kernel trick
10.5. KERNELS FOR STRUCTURED DATA 119

10.5 Kernels for Structured Data


In practice, as soon as we have data in a vector space, their dot product is a kernel function. So,
one way to use SVMs is to express the sample as “vectors”. In the situations we are considering,
these vectors live in a very high dimension space... It is not necessarily practical, even if possible,
to project them into a feature space. Nevertheless, the ability of SVMs to maintain their general-
ization power even in high-dimension spaces, where samples tend to be “far apart” is of essential
importance in this section.
All the approaches alluded to here come from Shawe-Taylor and Cristanini (2004), where many
other techniques are also presented.

10.5.1 Document Analysis


One way to process documents with statistical methods is to consider them as bags of words. To
this end, we define a dictionary of N words m1 , · · · , mN , as well as the following function ϕ,
mapping a document d to a vector ϕ(d):

ϕ : Documents → NN
d 7 → ϕ(d) = (f (m1 , d), · · · , f (mN , d))

where f (m, d) is the number of occurences of word m in document d. For a set {dl }l of l documents,
the document term matrix D, where line i is given by vector ϕ(di ), let us define a dot product
(hence a kernel) over the documents. In practice, kdi dj is given by the coefficient (i, j) of DDT .
We can mitigate the fact that documents will have different length by using a normalized kernel.
Furthermore, we can tune this kernel by injecting some a-priori semantic knowledge. For in-
stance, we can define a diagonal matrix R where each diagonal value corresponds to the importance
of a given word. We can also define a matrix P of semantic proximity for which coefficient pi,j
represents the semantic proximity of words mi and mj . The semantic matrix S = RP let us create
a kernel taking advantage of this knowledge:

k (di , dj ) = ϕ(di )SS T ϕ(dj )

10.5.2 Strings
Character strings have received a lot of attention in computer science, and many approaches have
been designed to quantify the similarity between two strings. In particular, SVMs are one of the
machine learning techniques used to process DNA sequences where string similarity is essential.
This section provide an example of such function.
Let’s consider the case of the p-spectrum kernel. We intend to compare two strings, probably of
different length, by using their common sub-strings of length p. Let Σ be an alphabet, we denote
Σp the set of strings of length p built on Σ and s1 s2 the concatenation of s1 and s2 . We also note
|A| the number of element in a set A. We can then define the following expression, for u ∈ Σp :

ϕpu (s) = |{(v1 , v2 ) : s = v1 uv2 }|

For a string s, we get one ϕpu (s) per possible sub-string u. ϕpu (s) is zero for most u ∈ Σp . Thus, we
p
are projecting a string s on a vector space with |Σ| dimensions and the components of the vector
p p
ϕ (s) are the ϕu (s). We can finally define a kernel with a simple dot product:
X
k (s, t) = ϕp (s).ϕp (t) = ϕpu (s)ϕpu (t)
u∈Σp

Let’s explicit this kernel with p = 3 and the following strings: bateau, rateau, oiseau, croise
ciseaux. The elements of Σ3 leading to non-zero components are ate, aux, bat, cis, cro, eau,
ise, ois, rat, roi, sea, tea. The lines in table 10.1 are the non-zero components of the vectors
ϕp (s).
We can then represent as a matrix the values of the kernel for every pair of words, as shown in
table 10.2.;
120 CHAPTER 10. KERNELS

ate aux bat cis cro eau ise ois rat roi sea tea
bateau 1 1 1 1
rateau 1 1 1 1
oiseau 1 1 1 1
croise 1 1 1 1
ciseaux 1 1 1 1 1

Table 10.1: 3-spectrum of words bateau, rateau, oiseau, croise ciseaux.

k bateau rateau oiseau croise ciseaux


bateau 4 3 1 0 1
rateau 3 4 1 0 1
oiseau 1 1 4 2 3
croise 0 0 2 4 1
ciseaux 1 1 3 1 5

Table 10.2: 3-spectrum des mots bateau, rateau, oiseau, croise ciseaux.

10.5.3 Other Examples


We can also design recursive kernels, which will let us define the “dot product” of two structured
objects (e.g. symbolic objects), such as graphs or trees. The subtlety and the difficulty consist in
producing a number out of two structure objects while making sure that the function evaluating
this number respects Mercer’s property.
Chapter 11

Solving SVMs

11.1 Quadratic Optimization Problems with SMO


11.1.1 General Principle
There are several methods to solve the optimization problem defined in section 9.1.2. One of them
is the SMO1 algorithm. Even for this approach, several refinement can be found in the litterature.
The core idea is to start from the KKT conditions of the dual problem defined page 110. We
need to move in the αi space in order to maximize the objective function. The SMO algorithm
consist in moving by changing pairs of αi . Equation L2, presented in 9.2.5, shows that if we keep
all αi but two constant, the two we can change are linearly dependent: we can express one of them
as a linear function of the other. This let us rewrite the objective function as a function of only a
single αi , and to compute its gradient.
The algorithm stops when some conditions, characterizing the optimality, are satisfied.
In practice, this technique is rich in algorithmic tricks to optimally chose the pairs of αi and
increase the convergence rate. As such it tends to be rather difficult to implement. In the following,
we’re presenting the technique presented in Keerthi et al. (1999), which is an improvement on the
initial SMO algorithm presented in Platt (1998).

11.1.2 Optimality Detection


To detect the optimality, the idea is to start again from the dual problem defined page 110. This
a minimization problem (we’re multiplying the objective function by −1 for that) under three
constraints (the second one having two inequalities). Even if this problem is already the result
of Lagrangian resolution, we can use the Lagrangian technique again to solve it... but we’re not
running in circle, because this Lagrangian will help us express optimality conditions that will
become the stopping criteria for an algorithm. And this algorithm solves directly the dual problem
defined page 110. So, this Lagrangian is2 :
N N N N N N
1 XX X X X X
L (α, δ, µ, β) = αj αi yj yi xj .xi − αi − δi αi + µi (αi − C) − β αi yi
2 j i i i i i

For the sake of notation simplicity, we define:


N
X
Fi = αi yi xi .xj − yi
j

We can then express the KKT conditions, that is to say the zeroing of the the partial derivatives
of the Lagrangian. We can also express the additional KKT conditions which are that when a
multiplier is zero, then the constraint is not saturated and when it is non-zero, then the constraint
1 Sequential Minimal Optimization
2 Be careful, the αi are now primal parameters and the δi , µi and the parameter β are the Lagrange multipliers.

121
122 CHAPTER 11. SOLVING SVMS

is saturated. Thus, the product of the constraint and its multiplier is zero at the optimum without
the two factors being zero at the same time (see page 110 for a reminder). The multipliers are also
all positives.  ∂

 L = (Fi − β)yi − δi + µi = 0
∀ 1 ≤ i ≤ N, ∂αi

 δi α i = 0
µi (αi − C) = 0
These conditions get simpler when they are written in the following way, distinguishing three cases
according to the value of αi .
Case αi = 0 : So δi > 0 and µi = 0, thus

(Fi − β)yi ≥ 0

Case 0 < αi < C : So δi = 0 and µi = 0, thus

(Fi − β)yi = 0

Case αi = C : So δi = 0 and µi > 0, thus

(Fi − β)yi ≤ 0

Since yi ∈ {−1, 1}, we can separate the value of i according to the sign of Fi − β. This let us define
the following sets of indices:

I0 = {i : 0 < αi < C}
I1 = {i : yi = 1, αi = 0}
I2 = {i : yi = −1, αi = C}
I3 = {i : yi = 1, αi = C}
I4 = {i : yi = −1, αi = 0}
Isup = I0 ∪ I1 ∪ I2
Iinf = I0 ∪ I3 ∪ I4

Then:
i ∈ Isup ⇒ β ≤ Fi
i ∈ Iinf ⇒ β ≥ Fi
We can then define the following bounds on these set:

bsup = min Fi
i∈Isup
binf = max Fi
i∈Iinf

In practice, when we will iterate the algorithm, we will have bsup ≤ binf as long as we haven’t
reached the optimum, but at the optimum:

binf ≤ bsup

We can reverse this condition, and say that we haven’t reached the optimum as long as we can
find two indices, one in Isup and the other in Iinf , which violate the condition binf ≤ bsup . Such an
indice pair defines a violation of the optimality conditions:

(i, j) such that i ∈ Isup and j ∈ Iinf is a violation if Fi < Fj (11.1)

Equation (11.1) is theoretic, since on a real computer, we will never have, numerically, binf ≤
bsup at the optimum. We will satisfy ourselves with defining this conditions “up to a bit”, say
τ > 0. In other words, the approximative optimality condition is:

binf ≤ bsup + τ (11.2)


11.2. PARAMETER TUNING 123

Equation (11.1), which defines when a pair of indices violates the optimality condition, is then
modified accordingly:

(i, j) such that i ∈ Isup et j ∈ Iinf is a violation if Fi < Fj − τ (11.3)

The criterion (11.3) will be tested to check whether we need to keep running the optimization
algorithm, or if can consider that the optimum has been reached.
Before terminating this paragraph, finally aimed at presenting the stopping criteria of the
algorithm we will define in the following sections, let’s point out that, at the optimum, bsup ≈
binf ≈ β... and that this value is also the b of our separator! In section 9.2.5 page 111, we lamented
that a closed-form solution for b was, until now, not made available by the Lagrangian solution.

11.1.3 Optimisation Algorithm


The principle of the SMO optimization starts by noting that equation L2 (cf. 9.2.5) links the αi and
that their sum, weighted by the yi , is constant. So, if we only allow two of these coefficients αi1 and
αi2 to changes (in [0, C]2 ), keeping the other constant, we have: L2 ⇒ αi1 yi1 +αi2 yi2 +C = 0. With
αi1 = (−C − αi2 yi2 )/yi1 , we can write the objective function of the dual problem (cf. page 110)
as a function of αi1 only. We can then compute the point αi?1 for which this function is maximal3 .
We then update αi1 ← αi?1 and we can obtain αi2 . It is nevertheless necessary to make sure to
control these variations so as to make sure (αi1 , αi2 ) stays in [0, C]2 . Then, we select another pair
of coefficient to update, etc.
So, most of the problem is how to chose the pairs (i1 , i2 ). This is a thorny problem, whose
solution influences how quickly we reach the maximum. The solution proposed in Keerthi et al.
(1999) consist in considering all i1 , to find out whether i1 ∈ Isup or i1 ∈ Iinf . For the selected i1 , i2
will be the index belonging to the other set, for which the bound bsup or binf , according to which
set contains i1 , is reached. By this way, we work on the pair that violates the most the optimality
constraints. More refinements can be found in detail in the pseudo-code presented in Keerthi et al.
(1999).
Numerically, the update of a pair can lead to negligible changes of the pair (αi1 , αi2 ), say smaller
in absolute value than a threshold θ. If all the updates generated by a traversal of the samples are
negligible, then we can stop the algorithm (stopping case #1). Furthermore, when we find out,
via equation (11.2), that the optimality is reached, then we also stop the algorithm (stopping case
#2).

11.1.4 Numerical Problems


Stopping case #1 depends on the value of θ, whereas case #2 depends on τ . But it is completely
possible to stop du to case #1, without having reached optimality... This often occurs, for instance,
when the optimization is playing on the 15th decimal.
To mitigate this, we need to choose the parameters τ and θ. A good thing to check when the
algorithm stops, is the ratio of pairs of coefficients which violate the optimality criterion. When
this ratio is high, we need to reduce θ. When, on the other hand, the solution seems obviously
wrong whereas none of the pair still violates the optimality criterion, then the value of τ is probably
too high.

11.2 Parameter Tuning


The choice of parameters best fitted to a given problem is far from intuitive. In the case of the
classification problem we have seen until now, the parameter C needs to be determined. Similarly,
when we work, for instance, with Gaussian kernels, the parameter σ must be included as well.
To this end, a brute-force approach consist in trying many values of the parameters4 , and
to measure the generalization error of the SVM with cross-validation. Finally, we will select the
parameters for which the generalization error is minimal.
3 It is a quadratic function for which we just need to zero the derivative.
4 Grid search method
124 CHAPTER 11. SOLVING SVMS
Chapter 12

Regression

Until now, we only considered the problem of separating a corpus of samples in two classes,
according to their labels −1 or +1. Regression consist in using labels with values in R and to
search for a function that will map a vector to its label, based on the samples in the corpus.

12.1 Definition of the Optimization Problem


Similarly to previous sections, we will start by considering the case of a linear regression and we
will later generalize to other regression by replacing dot product with kernels.
The regression we are presenting here takes its root in the fact that a linear separator hw,b is
good when it fits well all the samples, up to an error  > 0. In other words:
∀i, |(w.xi + b) − yi | ≤ 
Clearly, this constraint is to strong in general and in practice we will optimize an objective
function that authorize some of the samples to violate the constraint. This is depicted in figure 12.1.

In a process similar to that of a linear separator, we reach the definition of the following
optimization problem for a regression:
N
X
1
find argmin w.w + C (ξi + ξi0 )
w,b,ξ,ξ 0 2
 i
 yi − w.xi − b ≤  + ξi , ∀(xi , yi ) ∈ S
subject to w.xi + b − yi ≤  + ξi0 , ∀(xi , yi ) ∈ S

ξi , ξi0 ≥ 0, ∀i

12.2 Resolution
Solving this optimization problem is once again easier after switching to the dual problem, as was
the case for linear separator in chapter 9. Let αi and αi0 the multipliers for the two first constraints
of our optimization problem. The vector w of the separator is given by:
N
X
wα,α0 = (αi − αi0 )xi
i

with αi and αi0 solution of the following dual problem:


N
X N
X 1
find argmax yi (αi − αi0 ) −  (αi + αi0 ) − wα,α0 .wα,α0
α,α0 2
i i
N
 X

(αi − αi0 ) = 0
subject to

 i
αi , αi0 ∈ [0, C], ∀i

125
126 CHAPTER 12. REGRESSION

Figure 12.1: Linear regression. The white dots have an abscissa of xi , one-dimension vector,
and for ordinate yi . The dashed band represents the set of acceptable distance to the separator,
|w.x + b − y| ≤ , and not many samples are out of this set.
12.3. EXAMPLES 127

Once again, it “just” remains to apply an algorithm that will search for the maximum of this
dual problem. Approaches similar to the SMO algorithm exist but lead to algorithm relatively
hard to implement.

12.3 Examples
Figure 12.2 demonstrates the use of this type of SVM for regression in the case of 1D vectors, for
different kernels. Figure 12.3 gives an example in the case of 2D vectors.

Figure 12.2: Regression on 1D vectors, similar to figure 12.1. Left: using a standard dot product.
Middle: using a 3rd degree polynomial kernel. Right: using a Gaussian kernel. The support vectors
are marked with a cross. They are vectors xi for which a pair of (αi , αi0 ) is non zero. They are
the one constraining the position of the regression curve. The tolerance −, + is represented with
dashed lines.

p
Figure 12.3: Left: the function z = f (x, y) = exp(−2.5(x2 + y 2 ) ∗ cos(8 ∗ x2 + y 2 ) that we used
to generate the samples. Middle: 150 samples, obtained by randomly drawing x and y and defining
vector xi = (x, y) with a label yi = f (x, y) + ν, with ν drawn uniformly from [−0.1, 0.1]. Right:
result of the regression on these samples, with a Gaussian kernel with variance σ = 0.25 and a
tolerance  = 0.05.
128 CHAPTER 12. REGRESSION
Chapter 13

Compedium of SVMs

Ultimately, the principle of the approaches seen so far is always the same: we define a quadratic
optimization problem that can be solved using only dot products of pairs of samples.

13.1 Classification
These approaches are called SVC for Support Vector Classification.

13.1.1 C-SVC
The C-SVC the approach we’ve seen so far. The optimization problem is only given here as a
reminder:
1 X
find argmin w.w + C ξi
w,b,ξ 2
 i
yi (w.xi + b) ≥ 1 − ξi , ∀i
subject to
ξi ≥ 0, ∀i

13.1.2 ν-SVC
The problem of a C − SV M is that C, which define when to use slack variables ξi , does not depend
on the number of samples. In some cases, we might want to define the number of support vectors
based on the number of samples instead of giving an absolute value. The parameter ν ∈]0, 1] is
linked to the ratio of examples that can be used as support vectors1 . In a C-SVM, we always force
the samples to be located outside the band [−1, 1]. Here, we chose a band [−ρ, ρ], and we adjust
ρ to obtain the desired ratio of support vectors. This defines a ν-SVC problem.

1 1 X
find argmin w.w − νρ + ξi
w,b,ξ,ρ 2 N i

 yi (w.xi + b) ≥ ρ − ξi , ∀i
subject to ξi ≥ 0, ∀i

ρ≥0

Expressing the objective function however is not so simple and how will ν define the ratio of
samples used as support vectors is far from obvious. This can be justified by looking at the KKT
conditions of this optimization problem Schölkopf et al. (2000).

13.2 Regression
These approaches are named SVR for Support Vector Regression.
1 This ratio tends towards ν when we have many samples.

129
130 CHAPTER 13. COMPEDIUM OF SVMS

13.2.1 -SVR
The -SVR is the approach we presented earlier. The optimization problem is given here as a
reminder. X
1
find argmin w.w + C (ξi + ξi0 )
w,b,ξ,ξ 0 2
 i
 w.xi + b − yi ≥  − ξi , ∀i
subject to w.xi + b − yi ≤  + ξi0 , ∀i

ξi , ξi0 ≥ 0, ∀i

13.2.2 ν-SVR
Similarly to the ν-SVC, the purpose here is to modulate the width  of the -SVR according to a
parameter ν ∈]0, 1]. The objective is to define the number of samples outside of a tube with radius
 around the regression function as a ratio ν of the total number of samples2 .
1 1 X
find argmin w.w + C(ν + (ξi + ξi0 ))
w,b,ξ,ξ 0 , 2 N i

 w.xi + b − yi ≥  − ξi , ∀i


w.xi + b − yi ≤  + ξi0 , ∀i
subject to
 ξi , ξi0 ≥ 0, ∀i


≥0
As was the case for ν-SVC, the justification of this ν-SVR formulation of the optimization problem
can be found in Schölkopf et al. (2000) and is a consequence of the KKT conditions resulting from
this problem.

13.3 Unsupervized Learning


SVM can also be used to perform unsupervized machine learning (distribution analysis). In this
case, we seeking to build a novelty detector. We ask the SVM to describe the samples as clusters of
points. Once these clusters have been defined, any sample falling outside of any cluster would be
deemed not to have been generated by the same distribution as the others. This different sample
is then the result of a new phenomenon. This particularly practical when we have many examples
of our phenomenon, but not many counter-examples. For instance, in the case of the analysis of an
EEG signal to predict epileptic crisis, the data-set mostly describe states that are not forebearing
a crisis, because crisis are, thankfully, much less frequent than the normal state. One way to detect
a crisis is to observe when a signal falls outside of the perimeter of the normal signals.

13.3.1 Minimal enclosing sphere


The idea behind these techniques is to include the samples x1 , · · · , xi in minimal-radius sphere,
called the minimal enclosing sphere. Obviously, if we’re using a strict constraint, the first noisy
sample far from the other will prevent the sphere from sticking to the bulk of the data-set. In
practice, we might want to search for the smallest sphere containing α% of the data. This problem
is NP-complete...
We can again use the trick of the slack variables that can give an approximate solution to the
full NP-complete problem:
X
find argmin r2 + C ξi
ω,r,ξ
 i
2
|xi − ω| ≤ r2 + ξi , ∀i
subject to
ξi ≥ 0, ∀i
Using the Langrangian solution, we obtain a formula giving r and an indicator function whose
value is 1 outside the sphere, both results fortunately using only dot products (cf. figure 13.1).
2 At least, in practice, the ratio of samples outside a tube with radius  around the function tend towards ν when

the number of samples is high.


13.3. UNSUPERVIZED LEARNING 131

Figure 13.1: Minimal enclosing sphere. Left: using the standard dot product. Right: using a
Gaussian kernel with σ = 3. In both cases, C = 0.1 and the samples are randomly drawn in a
10 × 10 square.

As before, we can find a ν-version of this problem, in order to control the number of support
vectors and thus the ratio of samples lying outside the sphere Shawe-Taylor and Cristanini (2004).
The optimization problem is the same as before, setting C = 1/νN and then multiplying by ν the
objective function3 . Once again, the reason why this value of C leads to ν being effectively linked
with the ratio of samples outside the sphere can be found by analysing the KKT conditons.
1 X
find argmin νr2 + ξi
ω,r,ξ N i
 2
|xi − ω| ≤ r2 + ξi , ∀i
subject to
ξi ≥ 0, ∀i

13.3.2 One-class SVM


This case is a bit peculiar, although very much used. The goal is to find an hyperplane satisfying
two conditions. The first one is that the origin must lie on the negative side of the plane with
the samples all on the positive side, up to the slack variables, as usual. The second condition
is that this hyperplane must be as far as possible from the origin. In general, the advantage of
this setup is not obvious. The samples could set in a spherical pattern around the origin and no
such hyperplane would exist. In practice, the one-class SVM is relevant for radial kernels such as
Gaussian kernels. Remember that these kernels project the vectors onto an hypersphere centered
n the origin (see paragraph 10.2.3). The origin is thus “naturally” far from the samples for this
type of kernels and the samples clustered on a part of the hypersphere. We can isolate this part
by pushing the hyperplane as far as possible from the origin (cf. figure 13.2).
1 1 X
find argmin w.w − ρ + ξi
w,b,ξ,ρ 2 νN i

w.xi ≥ ρ − ξi , ∀i
subject to
ξi ≥ 0, ∀i
Once again, the parameter ν is linked to the ratio of samples on the negative side of the
hyperplane. This is justified in Schölkopf et al. (2001) using arguments on the KKT conditions of
the problem.

3 which does not change anything to the result.


132 CHAPTER 13. COMPEDIUM OF SVMS

Figure 13.2: One class SVM. The samples are the same as in figure 13.1. We use a Gaussian kernel
with σ = 3 and ν = 0.2, which means that around 20% of the samples are outside the selected
region.
Part IV

Vector Quantization

133
Chapter 14

Introduction and notations for


vector quantization

14.1 An unsupervised learning problem


14.1.1 Formalization as a dummy supervised learning problem
The vector quantization techniques belong to the class of unsupervised learning methods. A syn-
thetic overview of these methods can be found in (Fritzke, 1997). Here, let us rather introduce
vector quantization as a degenerated case of supervised learning, where a specific set of hypotheses
H is used.
Let us consider a random variable X whose values are in X , from which the learning dataset
is sampled. There is no labeling here, since only the distribution of inputs is of interest. Let us
consider a dummy labeling process, i.e. an oracle, that is the identity function. This oracle is
the conditional distribution P (y | x) = δx,y , and the random variable providing the samples with
X ×X
a supervised learning flavor is thus (X, X). Let us consider loss function L ∈ (R+ ) . The
quadratic loss is usually considered in vector quantization, but this is not mandatory.
Let us approximate the oracle (the identity...) with a hypothesis h ∈ H ⊂ X X . The set of
hypothesis H is taken as the following set of functions
 

H=
hΩ , Ω ∈ PN? (X ) hΩ (x) = argmin L (ω, x)
ω∈Ω

where PN? (X ) is the set of finite subsets of X . In other words, hΩ is defined according to a set
of values Ω = {ω0 , · · · , ωi , · · · , ωK }. It computes its output as the ωi which is the closest to the
input. The ωi s are called the prototypes. All hypotheses in H are not required to use the same
number of prototypes for their computation.
Let us define the distortion induced by a set of prototypes Ω as R (hΩ ). It is the expectation,
when a sample x is taken according to X, of the error made by assimilating x to its closest
prototype in Ω. It can be approximated by an empirical risk RSemp (hΩ ) measured on a dataset
S = {x1 , · · · , xi , · · · , xN }, actually viewed here as S = {(x1 , x1 ), · · · , (xi , xi ), · · · , (xN , xN )}. This
empirical risk is considered here as the distortion induced by hΩ on the data.
Since we allow for arbitrarily large Ωs in this definition, it is obvious that, when the values of
X are bounded, having a huge set Ω of prototypes uniformly spread over the values taken by X
enables to have R (hΩ ) as small as wanted. Therefore, as opposed to real supervised learning, the
goal here is not only to reduce the distortion, but rather to have a minimal distortion when only
few prototypes (i.e. |Ω| < K) are allowed. In this case, the prototypes for which this minimal
distortion is obtained are “well spread” over the data.
Handling a discrete collection of few prototypes gives the method its name of vector quantiza-
tion.

135
136 CHAPTER 14. INTRODUCTION AND NOTATIONS FOR VECTOR QUANTIZATION

14.1.2 Choosing the suitable loss function

In the previous definition, where a dummy supervised learning is used to describe a unsupervised
learning problem, the loss function L is crucial.

This is the place for adding the semantics of the problem in the algorithms. Even
if the vast majority of references in the literature use X = Rn and L (x, x0 ) =
(x − x0 ).(x − x0 ), the central role of the loss has to be kept in mind when vector
quantization is applied to real problems.

Let us take the example of handwritten digits recognition where inputs are digits, provided as
64
a gray-scaled 8 × 8 images. In this case, X = [0, 1] , where 0 stands for white and 1 for black.
Let us consider the three inputs x1 , x2 and x3 depicted in the figure 14.1. Using the default loss
function mentioned above, the following stands: L (x1 , x2 ) = L (x2 , x3 ) = L (x1 , x3 ) = 20. In other
64
words, x1 x2 x3 forms an equilateral triangle in [0, 1] . An appropriate design should have lead to
L (x1 , x2 ) < L (x1 , x3 ), since samples x1 and x2 look very similar to each other. Figure 14.2 shows
the inadequacy of the `2 norm as well.

Sample x1 Sample x2 Sample x3

Figure 14.1: Three digit inputs. Each digit is made of 11 black pixels. Each pair of digits is such
that the digits have only one black pixel in common.

Figure 14.2: The original image is on the left. The three other images are respectively obtained
from it by a shift, the adding of black rectangles and a darkening. In each of these three cases, the
pixel-wise `2 distance to the original image is the same, while the visual alteration we experiment
as observers is not. The illustration is taken from http://cs231n.github.io/classification/.
14.2. MINIMUM OF DISTORTION 137

14.1.3 Samples
Voronoı̈ subsets
In real cases, the random variable X is unknown. It is supposed to drive the production of the
dataset S = {x1 , · · · , xi , · · · , xN }. Since hΩ in H consists in returning the prototype which is
the closest to the given argument, gathering the samples according to the labels given by hΩ is
meaningful. This leads to the definition1 of Voronoı̈ subsets as follows:
 
def
S
VΩ (ω) = 0
x ∈ S argmin L (ω , x) = ω
ω 0 ∈Ω
0 0
2
∀(ω, ω ) ∈ Ω , ω 6= ω ⇒ VΩS (ω) ∩ VΩS (ω 0 ) =∅
[
VΩS (ω) = S
ω∈Ω

As the Voronoı̈ subsets form a partition of S, the empirical distortion RSemp (hΩ ) can be de-
composed on each of them:
1 X
RSemp (hΩ ) = L (x, hΩ (x))
N
x∈S
1 X X
= L (x, ω)
N S ω∈Ω x∈VΩ (ω)

1 X S
= VΩ (ω)
N
ω∈Ω
def
X
where VΩS (ω) = L (x, ω)
S (ω)
x∈VΩ

Let us call VΩS (ω) the Voronoı̈ distortion caused by ω, since it is the contribution of the samples
“around” ω to the global distortion RSemp (hΩ ). The relevance of Voronoı̈ distortion in the control
of vector quantization algorithms has been introduced in (Frezza-Buet, 2014), it is detailed in
forthcoming paragraphs.

Modeling the sample set as the result of a rejection sampling process


Let us consider that X is bounded, in order to allow for choosing some Rx uniformly from it, i.e.
X
x L99 UX . Let us define some density function p ∈ [0, 1] , such as p (x)/ X p (x) dx is the density
of probability of the random variable X.
Even if, in real situations, S is given, let us model it as the result of the rejection sampling
process described in algorithm 10. The point is that |S| ≤ M . Let us rename S as S M when
we want to stress that the data set is obtained from a rejection sampling process involving M
attempts.
The point to be noticed is that, for a given set of prototypes Ω and a given density function p,
the Voronoı̈ distortions are approximately proportional to the number of attempts in the rejection
sampling process.
k×M M
∀ω ∈ Ω, VΩS (ω) ≈ k × VΩS (ω) (14.8)

14.2 Minimum of distortion


Let us suppose in this section that the number of prototypes is constant, i.e. |Ω| = κ ∈ N. The
goal of vector quantization techniques is to find the prototypes for which the distortion induced
on some data set is minimal. Let us call Ω?κ the result.
Ω?κ ∈ argmin RSemp (hΩ ) (14.9)
Ω∈Pκ (X )

1 The argmin operator is supposed to return a single element in the definition, which may be false theoretically.

This case is not addressed for the sake of clarity, since this is not a big issue in real cases.
138 CHAPTER 14. INTRODUCTION AND NOTATIONS FOR VECTOR QUANTIZATION

Algorithm 10 Computation of S.
1: S ← ∅ // Start with an empty set.
2: for i ← 1 to M do
3: // Let us consider M attempts to add a sample in S.
4: x L99 UX , u L99 U[0,1[ // Choose a random place x in X .
5: if u < p (x) then
6: // The test will pass with a probability p (x).
7: S ← S ∪ {x} // x is kept (i.e. not rejected).
8: end if
9: end for
10: return S

where Pκ (X ) denotes the set of all κ-sized subsets of X .

14.2.1 Non unicity


The fact that Ω?κ is not unique may occur if the dataset configuration is symmetrical, as illustrated
on figure 14.3.

0.5 0.5

−0.5 −0.5

−0.5 0.5 −0.5 0.5

Figure 14.3: Here, X = [−0.5, 0.5]2 . S is represented by smaller dots, while Ω is the larger ones.
Left and right figures show two distinct optimal configurations for κ = 6.

14.2.2 Sensitivity to the density of input samples


Let us experiment some computation of the Voronoı̈ distortion when Ω?κ is reached. In the figures,
Ω?κ is computed by the algorithm 14 presented in section 15.1. In the experiment, X = [−0.5, 0.5]2
and κ = 500 are used. The dataset is built from algorithm 10, using N = 50000. On the figures,
only 5000 inputs are actually displayed, for the sake of clarity. First experiment consists in using
p = 1. On such a uniform distribution, figure 14.4 shows that prototypes are uniformly spread over
50000
the surface of X (left). The values of VΩS?κ (ω) for the κ prototypes ω are also represented (middle)
as well as their histogram (right). It is observed experimentally that the Voronoı̈ distortions is
almost equally shared between the prototypes, when the positions of prototypes minimize the
global distortion. Note that fluctuations in the Voronoı̈ distortion values can be observed, in spite
of the huge number of input samples, suggesting that this heterogeneity is not a discretization
effect but reflects the nature of the Ω?κ for that input configuration2 .
Let us now observe the effect of density variations in the examples, using the density function p
shown in figure 14.5 to set up the sample set. This leads to the figure 14.6. It shows that, in spite
2 As far as the author knows, no mathematical results are available to support or contradict this observation.
14.2. MINIMUM OF DISTORTION 139

0.6 0.6 0.06

0.4 0.4
0.05

0.2 0.2
0.04
0.0 0.0
0.03
0.2 0.2

0.02
0.4 0.4

0.60.6 0.4 0.2 0.0 0.2 0.4 0.6 0.60.6 0.4 0.2 0.0 0.2 0.4 0.6 0.010 20 40 60 80 100 120
Prototypes and at most 5000 samples Voronoi error map Voronoi error histogram

Figure 14.4: Vector quantization of a uniform distribution by κ = 500 prototypes. S M is rep-


resented by smaller dots, while Ω?κ is the larger ones. The darken range in the histogram is the
shortest interval containing 50% of the values.

of some small fluctuations, the Voronoı̈ distortions is almost equally shared between the prototypes
as well when Ω?κ is reached.

density p
1.0

0.8

0.6

0.4

0.2

0.0
0.4
0.2
0.0
−0.2
−0.4 0.2 0.4
−0.2 0.0
−0.4

Figure 14.5: Non uniform density function.

The value of the equal distortion share, i.e the center of the darken range in figures 14.4-right
and 14.6-right, is obtained from a large amount of weak sample contributions around prototypes
where p is high, while it is obtained from a smaller amount of stronger sample contributions around
prototypes where p is low.

In other words, the organization of the prototypes when optimality is achieved is


sensitive to the density of the samples: prototypes are sparser where the density is
lower.

14.2.3 Controlling the quantization accuracy


It has been shown in the previous section that, when the minimum of distortion is reached by a set
of prototypes, the Voronoı̈ distortion is shared, almost equally, between the prototypes. The actual
140 CHAPTER 14. INTRODUCTION AND NOTATIONS FOR VECTOR QUANTIZATION

0.6 0.6 0.035

0.4 0.4 0.030

0.2 0.2 0.025

0.0 0.0 0.020

0.2 0.2 0.015

0.4 0.4 0.010

0.6 0.6 0.005


0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0 20 40 60 80 100 120
Prototypes and at most 5000 samples Voronoi error map Voronoi error histogram

Figure 14.6: Same as figure 14.4, using an non uniform density function p.

value of this share (e.g 0.03 in figure 14.4, 0.013 in figure 14.6) appears to give a hint about the
quantization accuracy: the more numerous the prototypes are, the lower is the share for each one.
Let us use the value of this share to control the quantization accuracy. The samples are supposed
to be tossed according to algorithm 10, and thus S is rather referred to as S M .
M
Equation (14.8) states that VΩS?κ (ω) is proportional to M . Let us denote by T the proportional
coefficient.
M
∀ω ∈ Ω?κ , VΩS?κ (ω) ≈ M T, T ∈ R+ (14.10)

The coefficient T can be viewed as a target, determined in advance in order to set the accuracy
of the quantization. Once T is fixed, a targeted vector quantization process consists of setting the
M
appropriate number of prototypes κ, knowing M , such as each of the measured VΩS?κ (ω), ω ∈ Ω?κ
is close to the value M T . This is what the VQ-T algorithm does (see algorithm 11). The use of
the shortest confidence interval allows to extract the “main” values from a collection (Guenther,
1969). The implementation of algorithm 11 is naive since it can easily be improved by a dichotomic
approach.

Algorithm 11 VQ-T T, S M
1: κ ← 1 // Start with a single prototype, [· · ·] denotes lists.
2: repeat
3: Compute Ω?κ according to eq. (14.9) // Use hyour favorite VQ algorithm
 here
i
SM
4: (a, b) = shortest confidence interval VΩ?κ (ω) , δ // Use δ = .5
ω∈Ω?
κ
5: if T.M < a then
6: κ←κ+1
7: else if T.M > b then
8: κ←κ−1
9: end if
10: until T.M ∈ [a, b]
11: return Ω? κ

The value of T can be chosen by trial and errors, but considering a geometrical interpretation
2 2
is worth it. Let us consider X = [−0.5, 0.5] , the quadratic loss L (x, ω) = (x − ω) and a uniform
distribution (i.e. p = 1). Let us suppose that, in such a situation, the desired quantization
accuracy consists of κ? prototypes. The prototypes are the elements of Ω?κ? obtained from a
vector quantization algorithm. The quantization accuracy in figure 14.4 where p = 1 actually
corresponds to κ? = 500. Let us consider the Voronoı̈ tessellationn induced by the prototypes.
o
? def
The Voronoı̈ tessellation is the partition of X into κ cells Cω = x ∈ X hΩ?κ? (x) = ω . Let
n o
def
us also consider Sω = x ∈ S hΩ?κ? (x) = ω similarly. It can be considered that the Voronoı̈
14.2. MINIMUM OF DISTORTION 141

Algorithm 12 shortest confidence interval(V, δ)


Require: δ ∈ [.5, 1] // V = [vi ]0≤i<N is the list of N values.
1: l ← int (δ.N ) // int (x) stands for the closest integer to x
2: sort V in an increasing order
3: // r is set greater than any possible interval length.
4: r ← vN −1 − v0 + 1
5: for k ← 0 to N − l do
6: // Values from k to k + l are a fraction δ of all the values.
7: // r0 is the width of the range of these values.
8: r0 ← vk+l−1 − vk
9: if r0 < r then
10: r ← r0 , j ← k // The minimum found is saved.
11: end if
12: end for
13: return (vj , vj+l−1 )

tessellation divides the area of X , which is 1 here, into κ? parts with the same surface, i.e 1/κ?
each. Let us approximate the shape of each cell Cω by a circle centered at ω. The radius r is such
that the surface of the disk is the area of the cell, i.e. πr2 = 1/κ? , i.e r2 = 1/πκ? . The quadratic
momentum µ of the disk3 is πr4 /2. The variance µ is the momentum divided by the disk area,
i.e. µ = πr4 /2 /πr2 = r2 /2 = 1/2πκ? . The variance ν of any Sω approximates the variance
M
µ. By definition, ν = VΩS (ω)/ |Sω |. As p is uniform, there are exactly M samples in S M and
they are equally shared in the Sω . Therefore, |Sω | ≈ M/κ? . So the variance can be re-written as
M M M
ν ≈ κ? VΩS (ω)/M . Identifying µ with ν leads to µ ≈ κ? VΩS (ω)/M and thus VΩS (ω) ≈ µM/κ? .
Identifying the latter expression with equation (14.10) leads to T = κµ? . As µ = 2πκ 1
? , we have

1
T =
2πκ? 2

In the case of figure 14.4, since κ? = 500, we have T = 6.37 × 10−7 and thus, from equation (14.10)
M
with M = 50000, VΩS (ω) ≈ 50000 × 6.37 × 10−7 ≈ 0.0318. This value actually lies between the
darken range in figure 14.4.
Let us now apply the algorithm 11 with the density p depicted in figure 14.5, with M = 50000
and T = 0.0318. This leads to κ = 343. The configuration is displayed in figure 14.7. Comparing
the upper-right regions of figures 14.7 and 14.6 shows that in figure 14.7, the accuracy is similar
to the desired one, i.e. figure 14.4.

0.6 0.6 0.06

0.4 0.4
0.05

0.2 0.2
0.04
0.0 0.0
0.03
0.2 0.2

0.02
0.4 0.4

0.60.6 0.4 0.2 0.0 0.2 0.4 0.6 0.60.6 0.4 0.2 0.0 0.2 0.4 0.6 0.010 10 20 30 40 50 60 70
Prototypes and at most 5000 samples Voronoi error map Voronoi error histogram

Figure 14.7: Same as figure 14.6 except that κ = 343.

3 The |x − G|2 dx, with G the center.


R
quadratic momentum is disk
142 CHAPTER 14. INTRODUCTION AND NOTATIONS FOR VECTOR QUANTIZATION

14.3 Preserving topology


The input space X has a topology induced by the loss function L, since it can usually allow for
the definition of neighborhoods in X . Nevertheless, the actual distribution of inputs in X may
not span the entire X , but may rather lie on a manifold. For example, if X = R2 , the inputs may
lie on a circle, that is a 1-D sub-manifold of R2 . Preserving topology consists in building up a
structure around the prototypes in Ω in order to reflect the topology of the manifold where the
samples are extracted from. This structure is a graph, where the prototypes are the vertices, and
an edge between two prototypes mean that they are actually neighbours within the manifold. Lots
of vector quantization algorithm handle the prototypes as a graph.

14.3.1 Notations for graphs


Let us here set up notation for graphs, since all the vector quantization algorithms presented next
consists  in handling a graph. A graph consists of vertices and edges. The vertices are a finite
set V = v1 , · · · , vi , · · · , v|V| ⊂ N containing vertex identifiers. The edges are non oriented and
connect distinct vertices. An edge, here, is thus a 2-sized set of vertices. Let us denote an edge
{v, v 0 } = {v 0 , v} by ev↔v0 or ev0 ↔v . Let E be the finite set of the edges of the graph.
Let us define the set of the neighbors of a vertex v and the corresponding edges as

N (v) = {v 0 ∈ V | ev↔v0 ∈ E}
E (v) = {ea↔b ∈ E | a = v or b = v}

Extending the graph with a new vertex simply consists in adding a new element (that was not
in V) in V.
One notation issue comes from the need to anchor values to both vertices and edges. In other
words, we need these objects to have “attributes”, in the programming sense of the term. This can
be represented by functions. For example, as prototypes are handled by vertices in the algorithms,
a function proto ∈ X V is defined, such as ω = proto (v) is the prototype hosted by v. Edges may
handled an age (integer). In that case, the function age ∈ NE enables to define the age age (ev↔v0 )
of the edge ev↔v0 . Some other attributes/functions may be used further.
Last, attribute affectations means that the function is changed. It is denoted by ←. Changing
the prototype handled by a vertex is thus denoted by proto (v) ← ω. It means that a new proto
is now considered. It is identical to the previous one, except that from now, for the value v,
proto (v) = ω.
The notation for hypothesis space can be set from vertices, rather than from prototypes, for
the sake of clarity in algorithms. Indeed, Let us denote :

hV (x) = argmin L (proto (v 0 ), x) (14.13)


v 0 ∈V

We will allow the following writings, without ambiguity:

ω = hΩ (x)
v = hV (x)

14.3.2 Masked Delaunay triangulation


We have already mentioned the Voronoı̈ tessellation earlier. The tessellation is a partition of the
space into convex regions, delimited by hyperplanes, so that all the points in a cell are closer to
one prototype. Figure 14.8 illustrates this in the R2 case.
The Delaunay triangulation can be obtained as the dual of the Voronoı̈ tessellation as follows:
if two cells share a common boundary hyperplane, the prototypes of that cells are connected in
the Delaunay triangulation. This is what figure 14.9, on top, shows. Let us notice that the input
samples are not uniformly distributed in X = R2 , since they lie within a two-piece distribution. In
Martinez and Schulten (1994), the interest of the masked Delaunay triangulation has been heigh-
lighted, since it reflects the topology of the inner manifold. The masked Delaunay triangulation is
14.3. PRESERVING TOPOLOGY 143

Figure 14.8: Top: vector quantization of the input samples (grey dots) by 50 prototypes (blue
dots). Bottom: Voronoı̈ tessellation (green). Each cell is the region where points are closer to the
central prototypes that to other ones.
144 CHAPTER 14. INTRODUCTION AND NOTATIONS FOR VECTOR QUANTIZATION

Figure 14.9: Top: Delaunay triangulation (Voronoı̈ tessellation dual). Middle: Masked Delau-
nay triangulation. Bottom: Approximated masked Delaunay triangulation obtained by CHL (see
algorithm 13).
14.3. PRESERVING TOPOLOGY 145

a sub graph of the Delaunay triangulation that only has edges covering “well” the manyfold (see.
middle of figure 14.9).
Computing Delaunay triangulations geometrically is feasible, but determining, from a set of
samples, which edges actually belong to the masked Delaunay triangulation is not obvious. More-
over, many vector quantization algorithm build the masked Delaunay triangulation incrementally.
In that context, the very simple competitive Hebbian learning algorithm (Martinez and Schulten,
1994) is the basis of these algorithms (see. algorithm 13). It leads to the graph in the bottom of
figure 14.9. As one can see, some edges are missing, since the graph obtained is not a triangulation.

Figure 14.10: Left: Voronoı̈ tesselation for nearly co-cyclic points. Right: The second order Voronoı̈
tesselation.

Missing edges often correspond to four points that almost lie on the same circle. Such points
are depicted in figure 14.10. On the left, the Voronoı̈ tessellation is showed for prototypes A, B,
C, D. Regions A and B, B and C, C and D, D and A, as well as A and C, share a common
edge. The Delaunay triangulation is made of the 5 segments [AB], [BC], [CD], [DA], [AC]. On
the right plot in figure 14.10, the second order Voronoı̈ tessellation is depicted (in red). Each point
in a cell in that plot have the same two closest prototypes. In other words, in the CHL procedure
(algorithm 13), if a sample tossed belong to one of the second order Voronoı̈ cells, the corresponding
edge of the Delaunay triangulation is created. It can be seen in figure 14.10-right that the creation
of the edge AC is very unlikely with CHL, since the corresponding region, i.e the central cell, is
tiny for almost co-circular points. This explains the missing of some of the edges when comparing
middle and bottom plots in figure 14.9.

14.3.3 Structuring raw data


The main power of vector quantization is its ability to bridge the gap between analogical and
symbolic worlds. In other words, the input data are tossed according to some input distribution,
which is a continuous object, whereas the result of the process is a graph4 , which is a symbolic
object. This enables to talk about the distribution with symbolic term as “the distribution contains
two separated parts”, “the distribution is made of two cycles”, etc. Putting words onto analogical
input received by sensors is what human do when they speak about whet they perceive and vector
quantization can be viewed as a step in that direction for artificial systems.
Let us illustrate this on the example of digit recognition. When someone recognizes a shape
as the digit 9, s/he can explicit that the shape is a nine because it is made of a cycle with a tail
pending on the right. So if we consider the distribution of the pixels belonging to the shape of
4A set of prototypes, without any edges, is a graph as well.
146 CHAPTER 14. INTRODUCTION AND NOTATIONS FOR VECTOR QUANTIZATION

Algorithm 13 Competitive Hebbian Learning, CHL(Ω, n)


Require: V is set of vertices hosting prototypes, |V| ≥ 2.
1: E ← ∅, i ← 0.
2: while i < n do
3: i←i+1
4: Sample some x.
5: find v = hV (x) // See eq. (14.13)
6: find v 0 = hV\{v} (x)
7: if ev↔v0 ∈/ E then
8: i←0
9: E ← E ∪ {ev↔v0 }
10: end if
11: end while
12: return E

a nine in a picture, a topology preserving vector quantization may lead to a graph from which a
cycle and a tail can be extracted (see figure 14.11).

Figure 14.11: The pixels of the digit 9 can be structured as a cycle with pending tail on the right.
Chapter 15

Main algorithms

In this chapter, main vector quantization techniques are presented. Nevertheless, the reader should
keep in mind that lots of variations of those algorithms are available in the literature. An overview
of vector quantization algorithms can also be found in Fritzke (1997).
The reader is invited to refer to paragraph 14.3.1 for notations related to graphs.

15.1 K-means
15.1.1 The Linde-Buzo-Gray algorithm
The k-means algorithms (Lloyd, 1982; Linde et al., 1980; Kanungo et al., 2002) is certainly the
most famous vector quantization algorithm, since it is implemented in any numerical processing
framework. It considers a set of samples a priori and computes the position of k prototypes so
that they minimize the distortion (see algorithm 14). The idea is to update the prototypes so that
each of them is the mean of the samples that lies in its Voronoı̈ region.
It is batch, since it works for a set given as one bulk of data, and k is a parameter that has to
be determined by the user. The line 6 of algorithm 14 consists of cloning some existing prototypes.
Here, cloning a vertex means creating a new vertex hosting a random prototype which is very close
to the prototype of the initial vertex. The new vertex is added into V. The reaching of stopping
condition has be proven, but the result may be a local distortion minimum. Figures 14.4, 14.6
and 14.7 are obtained by the use of this algorithm.

Algorithm 14 k-means
1: Sample S = {x1 , · · · , xi , · · · , xN } according to p.
2: Compute ω1 as the mean of S.
3: V = {v} such as proto (v) = ω1 // Let us start with a single vertex
4: while |V| < k do
5: Select randomly n = min (k − |V| , |V|) vertices in V.
6: Clone these n vertices (the clones are added in V).
7: repeat
8: ∀ x ∈ S, label (x) ← hV (x)
9: ∀ v ∈ V, proto (v) ← mean{x∈S | label(x)=v} x
10: until No label change has occurred.
11: end while

15.1.2 The online k-means


There is an online version of the Linde-Buzo-Gray algorithm (MacQueen, 1967), that is not really
useful for actually processing data. Nevertheless, the structure of the algorithms presented further
reflects the one of this online k-means. Algorithm 15 is very easy to program.

147
148 CHAPTER 15. MAIN ALGORITHMS

Algorithm 15 k-means online


1: Sample Ω = {ω1 , · · · , ωi , · · · , ωk } according to p.
2: Make V = {v1 , · · · , vi , · · · , vk } such as ∀i, proto (vi ) = ωi .
3: while true do
4: Sample x according to p.
5: Determine v ? = hV (x). // Competition.
6: proto (v ? ) ← proto (v ? ) + α(x − proto (v ? )) // learning, α ∈]0, 1].
7: end while

There is no real stopping criterion. After a while, the prototypes hosted by the vertices in V
are placed in order to minimize the distortion1 . Line 5 selects the vertex whose prototype is the
closest to the input sample. This stage is a competition. Line 6 says that the winning vertex v ? is
the only one whose prototype is modified, consecutively to the computation of the current input.
This is called a winner-take-all learning rule.
The update of the winning vertex (line 6) is performed by a low past first order recursive
filter. This computes each proto (v) as the mean of the inputs sample for which v hosted the
closest prototype. The same idea motivates the Linde-Buzo-Gray algorithm, as stated previously.
Increasing α make the prototypes shake, whereas smaller α leads to more stable positions. A good
compromise could be to use a large α = 0.1 for first steps, allowing the prototypes to roughly take
their positions, and then use a much more smaller α = 0.005.

15.2 Incremental neural networks


One issue with vector quantization is to find the appropriate number of vertices. In k-means, this
number is determined a priori. On the contrary, in incremental neural networks algorithms, the
idea is to increase the number of vertices until some stopping criterion is achieved.

15.2.1 Growing Neural Gas


The growing neural gas (GNG) was proposed in Fritzke (1995a). It consists of handling a graph of
prototypes, by successively cloning the current vertices and update their connections thanks to the
competitive Hebbian learning presented in section 14.3.2. Successive stages of the GNG evolution
are shown in figure 15.1, the GNG procedure is algorithm 16.
In order to control the number of prototypes and adjust it when the input distribution changes,
a GNG-T algorithm was proposed (Frezza-Buet, 2014) as an extension of GNG, taking advantages
of the accuracy control introduced in section 14.2.3.

15.2.2 Growing Grid


With GNG-inspired algorithms, the topology of the manifold where the input samples are lying
is retrieved by competitive Hebbian learning, as figure 15.1 shows. Another approach consists of
maintaining the graph as a growing grid (Fritzke, 1995b), thus ignoring the actual topology of the
data. The relevance of forcing the topology is detailed in section 15.3, where self-organizing maps
are introduced. Here, the idea is to start with four prototypes, connected as a square, and add a
full row or a full line when growth of the network is required. The connections are created when
the line (or the row) is added, and it is not updated then, as opposed to GNG. Figure 15.2 shows
the successive stages of the Growing Grid evolution.

1 Algorithm 15 is indeed a stochastic gradient descent.


15.2. INCREMENTAL NEURAL NETWORKS 149

Figure 15.1: Successive evolution steps of GNG from a 3D input sample distribution. Dots are the
input samples. The graph is represented as a red grid, showing edges that intersect at vertices.
Each is placed v at the position of proto (v).

Algorithm 16 Growing Neural Gas


1: Choose randomly (ω, ω 0 ) ∈ X 2 , set up V = {v, v 0 } such as proto (v) = ω and proto (v 0 ) = ω 0 .
2: dist (v) ← 0, dist (v 0 ) ← 0 // New vertices have not accumulated local distortion yet.
3: i ← 0 // this counts the samples.
4: E ← ∅ // Empty edge set at start.
5: while stopping criterion not met do
6: Sample some x according to p, i ← i + 1.
7: find v = hV (x)
8: find v 0 = hV\{v} (x)
9: if ev↔v0 ∈/ E then
10: E ← E ∪ {eω↔ω0 } // An edge is added if it did not exist already
11: end if
12: age (ev↔v0 ) ← 0 // The age of the edge is set (or reset) to 0.
13: dist (v) ← dist (v) + L (proto (v), x) // The local distortion is accumulated for the winner.
14: proto (v) ← proto (v) + α(x − proto (v)) // learning, α ∈]0, 1].
15: ∀v 0 ∈ N (v), proto (v 0 ) ← proto (v 0 ) + ζα(x − proto (v 0 )) // weaker learning, ζ ∈]0, 1]. ζ = 0.1
is ok.
16: ∀ev↔v00 ∈ E (v), age (ev↔v00 ) ← age (ev↔v00 ) + 1 // edges emanating from v get older.
17: Remove edges older than amax . If a vertex has no more edges after these removals, suppress
it.
18: if i is a multiple of λ then
19: find v = argmaxv00 ∈V dist (v 00 ) // v is the vertex with the highest accumulated distortion.
20: find v 0 = argmaxv00 ∈N (v) dist (v 00 ) // v 0 is the neighbor of v with the highest accumulated
distortion.
21: create a new v 00 such as proto (v 0 ) = (proto (v) + proto (v 0 ))/2 and add it in V.
22: E ← E \ {ev↔v0 }.
23: E ← E ∪ {ev↔v00 , ev00 ↔v0 }.
24: dist (v) ← dist (v) − γdist (v) // γ ∈]0, 1].
25: dist (v 0 ) ← dist (v 0 ) − γdist (v 0 )
26: dist (v 00 ) ← (dist (v) + dist (v 0 ))/2.
27: end if
28: ∀v ∈ V, dist (v) ← dist (v) − γdist (v)
29: end while
150 CHAPTER 15. MAIN ALGORITHMS

Figure 15.2: Successive evolution steps of Growing Grid from a 3D input sample distribution.
Drawing convention is the one of figure 15.1.

15.3 Self-Organizing Maps


15.3.1 Principle
A self-organizing map (Kohonen, 1989, 2013) is an vector quantization algorithm where the pro-
totypes are organized as a graph a priori. The graph is kept constant during all the process, as
opposed to incremental networks presented previously. Even if the graph used for self-organizing
maps is a 2D grid, the algorithm can be implemented with any kind of graph.
With self-organizing maps (SOM), two distances are involved. The first is the loss L used to
match prototypes against inputs. The second one is a graph distance, denoted by ν, that is a
distance in the graph separating two vertices. For example, ν (v, v 0 ) can be the number of edges of
the shortest path from v to v 0 .
Given this, the self-organizing maps can be introduced (see. algorithm 17). Note that it is very
close to algorithm 15.

Algorithm 17 Self-Organizing Maps


1: Let (V, E) be a graph, given a priori. // The proto (v), v ∈ V can be initialized randomly.
2: Let ν a graph distance induced by the edge set E.
3: while true do
4: Sample x according to p.
5: Determine v ? = hV (x). // Competition.
6: ∀v ∈ V, proto (v) ← proto (v) + αh (ν (v ? , v)) (x − proto (v)) // learning, α ∈]0, 1].
7: end while

As for algorithm 15, line 5 performs a competition, i.e. the selection of the vertex whose
prototype it the closest to the current input. However, the learning stage, i.e. line 6, slightly
changes in algorithm 17. Indeed, learning is applies to all the prototypes. The strength of learning
is determined by the term αh (.). It corresponds to a modulation of the learning rate α. The
R+
function h ∈ [0, 1] has to be a decreasing function such that h (0) = 1. As h (ν (v ? , v)) is used
at line 6, the modulation is the highest when ν (v ? , v) = 0, i.e. when v = v ? is considered. The
modulation decreases for the neighbors of v ? in the graph, since ν (v ? , v) is still high for them. For
prototypes v that are far, in the graph, from v ? , ν (v ? , v) is weaker and the learning as no significant
effect. To sum up, as opposed to algorithm 15, v ? is not the only prototype that is modified by
the current input sample, since its close neighbors also learn. This is called a winner-take-most
learning scheme.
15.3. SELF-ORGANIZING MAPS 151

Note that a function h such that h (0) = 1 and h (.) = 0 otherwise makes algorithm 17 be
identical to algorithm 15.

15.3.2 Convergence issues


The convergence of Self-organizing maps has not been proved in the general case even if it works
fine in practical cases (Cottrell et al., 1998). Nevertheless, one has to clearly understand what SOM
do actually compute, and the effects of the function h. Let us illustrate this on an example in
X = R2 , where the density p has a coronal shape. The graph is a grid V = {vi,j }1≤i,j≤20 . The edges
are those of a rectangular mesh, sketched on the figures. Nevertheless, the edges q only influence the
2 2
algorithm through ν. Here, we use for the sake of simplicity ν (vi,j , vi0 ,j 0 ) = (i − i0 ) + (j − j 0 ) .
The function h is defined in equation (15.1).

1 − νr if ν ≤ r
h (ν) = (15.1)
0 otherwise

Figure 15.3: SOM applied to a coronal input distribution. Drawing convention is the one of
figure 15.1. See text for details .

Figure 15.3 shows the results. In figure 15.3-left, r = 15 is used (see equation (15.1)). A
wide area of prototypes around the winner actually learn. This has an averaging effect, all the
prototypes tend to be attracted to the mean of the input samples. Nevertheless, it can be observed
that the grid is “unfolded”. In figure 15.3-middle, r = 3 is used. The averaging effect is weaker,
and the prototypes cover the input sample distribution better. The drawback is that the map
is “folded”. In figure 15.3-right, we used first r = 15, and then r = 3. This leads to both an
unfolded and nicely covering prototype distribution. Nevertheless, as figure 15.3-right shows, some
prototypes (the middle ones in the figure) lie outside the distribution, because the map elasticity
pulls them in opposite directions. Such prototypes are sometime called dead units.

The h function in the literature is often presented as a Gaussian with a slowly de-
creasing variance, which complicates the formulation of the SOM algorithm. Indeed,
simpler h can be used, and the progressive decay can be reduce to few values, the first
ones with a wide h expansion and the last ones with a narrower h.

Once the map is correctly unfolded over the input samples, the following stands : two prototypes
that are close according to ν are also close according to L. The reverse is wrong (see figure 15.4).

Be careful with the input sampling. It has to be random. For example, in figure 15.3-
right, submitting examples line bye line, from the top to the bottom and from the left
to the right within each line, would have lead to a bad unfolding.
152 CHAPTER 15. MAIN ALGORITHMS

Figure 15.4: When the input manifold dimension is higher than the one of the map. On the figures,
the blue and green prototypes are close according to L but far according to ν. Left and middle:
The graph is a ring. Right: the graph is a grid. Drawing convention is the one of figure 15.1.

15.3.3 Example
Let us consider the case of written digits. Input samples are 28 × 28 gray-scaled images, where
digits are written. Input samples are thus x ∈ X = [0, 255]784 . As a loss function L, we do not use
directly the Euclidian distance in R784 (see. section 14.1.2). Rather, when we compare proto (v) to
x, we first blur the images and then compute the Euclidian distance between the blurred objects.
In this example, the input sample lie in a manifold in R784 . Visualizing this manifold is not
easy. As we force the topology to be 2D, since we use a grid for connecting the vertices, we are able
to represent the hosted prototypes as a 2D grid on a screen, displaying at each grid position the
prototypical image. Recognition can be performed as follows: Ask an expert to label each vertex
according to its prototype (their number is finite). When a handwritten digit needs to be labeled,
find the vertex hosting the closest prototype in the map and give its label to the input.
Another, and maybe more fundamental, aspect of the map in figure 15.5 is that is represents
the distribution of all the input digits over the surface of the screen, trying to place the prototypes
such as the ones that are close on the screen are actually close digits in R784 . This is an example
of using self-organizing maps as non-linear projections for visualizing high dimensional data.
15.3. SELF-ORGANIZING MAPS 153

Figure 15.5: Handwritten digits mapping with a self-organizing map.


154 CHAPTER 15. MAIN ALGORITHMS
Part V

Neural networks

155
Chapter 16

Introduction

What is a neural network ? A neural network is basically a set of interconnected units (i.e. a
graph of units), having inputs and outputs, which compute by themselves a pretty simple function
of their inputs. The idea of studying a network of interconnected units perfoming a rather basic
computation originates from (McCulloch and Pitts, 1943) which introduces a simple model of a
neuron with several inputs xi feeding the P
neuron through weighted connections of weight wi . The
weighted sum of the input contributions i wi xi provides the pre-activation of the neuron from
which its output is computed with a heaviside transfer function h(x) = 1x≥0 :
X
y = h( w i xi )
i
(
0 if x < 0
h(x) =
1 otherwise

The neuron model of (McCulloch and Pitts, 1943) was not equipped with learning rules allowing to
adapt its weights. As we shall see in the next chapters, various improvements were found ultimately
leading to trainable neural networks.
Even if the first motivation was to model how the brain works, we shall prefer speaking about
units rather than neurons as speaking about neurons tend to insist too much on a relationship with
biological neurons. Definitely, biological neurons inspired (and still inspire) the design of neural
networks but neural networks can be considered as a specific structure of predictors in machine
learning on their own without having to refer to any biological motivations to justify their study.
If we denote x the inputs of a unit and y its output, a prototypical neural network unit links
the inputs to the output through a non-linear function f applied to a linear combination of the
inputs :

a = wT x + b
y = f (a)

where f is a so-called transfer function, w a set of weights, b a bias and a the pre-activation which
is introduced for convenience. The transfer function linking the pre-activation and the output
(or activation) of the unit can take different forms and below is a list of some commonly chosen
transfer functions :

• hyperbolic tangent : f (a) = tanh(a)


1
• sigmoid1 : f (a) = 1+exp(−a)

• rectified linear unit (ReLu) : f (a) = max (a, 0) = [a]+

• softplus : f (a) = log(1 + exp(a))


1 also called logistic or squashing function

157
158 CHAPTER 16. INTRODUCTION

While the hyperbolic tangent and sigmoid were common choices for the transfer function, it turns
out that the softplus and rectified linear units bring up interesting results in terms of performances
of the learned predictor and the speed of learning(Nair and Hinton, 2010; Zeiler et al., 2013).
The ReLu is really quick to evaluate contrary to transfer functions involving exponentials! It also
behaves quite favourably when having to derive it as we shall see latter in the chapter. These
transfer functions are plotted on figure 16.1. There are also population-based transfer functions
where the output of a unit actually depends on the pre-activation of a collection of other units. A
popular example is the softmax function. If we consider a population of units for which we denote
ai the pre-activations and yi the outputs, the softmax computes the outputs as :
exp(ai )
∀i, yi = P
j exp(aj )

The softmax is especially used in the context of learning a classifier as the softmax transfer function
constrains the activations to lie in the range [0, 1] and to sum up to 1. The softmax also induces
a competition among the interconnected units : if one unit raises its activation and due to the
normalization constraint, it necessarily induces a drop of the activation of at least one of the other
units.
1.0 5
1.0

0.8 4
0.5

0.6 3

0.0
4 2 0 2 4
0.4 2

0.5 0.2 1

0.0 0
1.0 4 2 0 2 4 4 2 0 2 4

a) b) c)

Figure 16.1: Classical transfer functions; a) hyperbolic tangent f (a) = tanh(a), b) sigmoid f (a) =
1 +
1+exp(−a) , c) rectified linear unit f (a) = max (a, 0) = [a] , d) softplus : f (a) = log(1 + exp(a)).

Up to now, we especially focused on the type of computation that a single unit is performing.
The topology of the network is also a distinguishable feature of neural networks (fig 16.2). Some
neural networks are acyclic or feedforward; you can, say, identify the leaves with the inputs and
the root with the output if we take the convention that information flows from the leaves up to
the root. In general, we can group the units into layers and therefore speak about the input layer,
the output layer, and the hidden layers in between. The hidden layer are so called because these
contain the units for which you actually do not know the value while the inputs and outputs are
provided by the datasets in a supervised learning problem. Actually, nothing prevents you from
considering an architecture where the units have connections to a layer that is not the next one
with so called skip-layer connections. In particular, if one knows that the output contains some
linear dependencies on the input, it could be beneficial to add such skip layer connections. These
connections do not actually enhance the expressiveness of the architecture but slightly push the
network into the right direction when learning comes into play.
When the data has a hierarchical structures, some neural networks such as recursive and re-
current neural networks are more appropriate. With recursive neural networks, the same network
is evaluated on children to compute a representation of a parent. The children can actually be
inputs from the dataset or could also be some parent representation. Recursive neural networks
are appropriate when dealing with data that have actually a hierarchical structure such as in nat-
ural language processing. In a recurrent neural network, cycles within the network are introduced.
These cycles produce a memory effect in the network as the activations of the units depend not
only on the current input but also on the previous activations within the network. This type of
network is particularly suited for datasets such as time series.
Having briefly sketched what are neural networks, in the next chapters, we go in details through
a variety of neural network architectures especially focusing on how we train them, i.e. how we
find optimal parameters given a regression or classification problem and what they are good for.
Classical books on neural networks include (Bishop, 1995; Hertz et al., 1991). (Schmidhuber, 2015;
Bengio et al., 2013) recently reviewed the history of the ideas in the neural network community
159

Input Hidden Output


Output :
Skip layer connection
y Input Output
y = fθ (x1 , x2 )

x1 x2 y
y = fθ (x1 , x2 )

x1 x2 y
y = fθ (x1 , x2 )

x1 x2

Input : I love neural networks


a) b) c)

Figure 16.2: a) A feedforward neural network is an acyclic graph of units. b) A recursive neural
network is applied recursively to compute the parent representation of two childrens, these two
childrens being possibly parent nodes themselves. c) A recurrent neural network is fed with a
sequence of inputs, the network itself possibly containing cycles.

and pointed out as well recent trends in the field. There is a also the book of (Bengio et al.,
2015) that is, at the time of writting these lines, a work in progress written by researchers from
the university of Montreal (Y. Bengio), one the top leading research group in neural networks
with the university of Toronto (G. Hinton) and the IDSIA research group (J. Schmidhuber). The
online book of Michael Nielsen (http://neuralnetworksanddeeplearning.com/) is also a good
reference.
160 CHAPTER 16. INTRODUCTION
Chapter 17

Feedforward neural networks

17.1 Single Layer perceptron


17.1.1 Perceptron
Architecture
The perceptron was introduced in (Rosenblatt, 1962). The simple perceptron he introduced is
depicted on fig. 17.1. It is made of an input layer, an association layer and an output layer1 . The
association layer A computes predefined functions of the inputs while the connections between the
association layer and output layer are trainable (or plastic). Given an input x, we denote Φj (x)
the activities in the association layer, the basis functions Φj being predefined.

Sensory Associative Result

x0
a0 = φ0 (x)
w00
Σ g r0
x1 w10
a1 = φ1 (x) w01
r1
Σ
x2
a2 = φ2 (x)
w02
Σ r2
w22
x3

Figure 17.1: A perceptron is an acyclic graph with an input or sensory layer x, an association layer
a and an output or result layer r. The association layer activities are computed with predefined
basis functions φi . The weights between the associative and result are trainable (or plastic).

The outputs of the perceptron are computed in two steps : 1) a linear combination of the
activities in the associative layer defining the preactivation, 2) a transfer function g applied on the
pre-activation :
 
X
∀i, ri = g  wj,i ai + bi 
j
X
= g( wj,i φi (x) + bi )
j

1 Actually, F. Rosenblott introduces the layer as a Sensory layer, Association layer and Result layer and build up

architectures from these S, A, R layers.

161
162 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS

Let us denote ns , na , nr the number of units in, respectively, the sensory,


 associative and result

φ0 (x)
 
layers. Introducing the vector of basis functions outputs φ (x) = φ1 (x) ∈ Rna , the weight
..
.
matrix W ∈ Rna ×nr with Wi,j = wi,j where  wi,jis the weight connecting the associative unit j
b0
b1 
to the output unit i, and the bias vector b =   ∈ Rnr , the computation of the perceptron can
..
.
be written compactly in matrix form :

r = g WT φ (x) + b

where g is applied element-wise. You can write the above formula even more compactly if one adds
an extra constant basis function φb (x) = 1 with extra weights to encompass the bias vector b. We
would then consider a weight matrix in Rna +1×nr and the vector of basis functions would contain
na + 1 entries with one entry set to 1.

The perceptron was introduced in the context of binary classification in which case the transfer
function g is taken to be the sign function :
(
−1 if x < 0
g(x)
+1 if x ≥ 0

As a last note to finish the introduction of the perceptron, it is actually a big reduction to
summarize the contribution of (Rosenblatt, 1962) to the study of the S-A-R architecture for binary
classification as he also studied variants of this architecture and the interested reader is referred
to the original book2 . In the next sections, we present how one can learn the weights between the
association and result layers.

Learning : the perceptron algorithm


Given a classification problem with a set of input/output pairs (xi , yi ) ∈ Rd × {−1, 1}, we want to
learn a perceptron defined by :

y = g WT φ (x)
(
−1 if x < 0
g(x)
+1 if x ≥ 0
 
1
φ0 (x)
 
In this setting, φ (x) ∈ Rna +1 , φ (x) = φ1 (x) and, since there is a single output W ∈ Rna +1
 
..
.
as well. The perceptron learning rule operates online and updates, after each sample (xi , yi ), the
weights according to :


W if the input is correctly classified, i.e. g (WT φ (xi ) + b) = yi
W = W + φ (x) if the input is incorrectly classified as -1, i.e. g (WT φ (xi ) + b) = −1 and yi = +1


W − φ (x) if the input is incorrectly classified as +1, i.e. g (WT φ (xi ) + b) = +1 and yi = −1

The perceptron learning rule can be actually compactly written as :


(
W if the input is correctly classified, i.e. g (WT φ (xi ) + b) = yi
W=
W + yi φ (x) otherwise
2 http://catalog.hathitrust.org/Record/000203591
17.1. SINGLE LAYER PERCEPTRON 163

Geometrical interpretation
We can have a geometrical understanding of how perceptrons and its learning rule work. Suppose
we have a set of transformed input vectors φ (xi ) ∈ Rna +1 and associated labels yi ∈ {−1, 1}.
Consider the space Rna +1 in which belong the transformed inputs φ (xi ) as well as the weights of
the perceptron w . We can associate an hyperplane to each of the transformed input vector φ (xi )
defined by the following equations :
vT φ (xi ) = 0
This hyperplane splits the space Rna +1 into two regions : one in which vT φ (xi ) < 0 and one in
which vT φ (xi ) ≥ 0. Consider the case an input is correctly classified (fig. 17.2). If the input vector
is positive (yi = 1), then it means that both the weight vector w and transformed input φ (xi )
belong to the same half space. If the input vector is negative (yi = −1), then it means that the
weight vector w and the transformed input φ (xi ) do not belong to the same half space.
Case yi = +1 Case yi = −1

φ(xi ) φ(xi )

Figure 17.2: Case when a perceptron correctly classify an input xi . If the input is positive (yi = 1),
both the weight and transformed input belong to the same half space. If the input is negative
(yi = −1), the weight vector and transformed input do not belong to the same half space. The
grey region indicates the half space in which a weight vector would misclassify the input.

Now, consider the case the perceptron is misclassifying an input (fig. 17.3). If the input is
positive (yi = 1), the weight vector and transformed input do not belong to the same half space.
In order to correctly classify the input, they should actually belong to the same half-space. In this
case, the perceptron learning rule is updating the weights as w + φ (xi ) which brings, at least in
our example, the weight vector in the correct half space. If the input is negative (yi = −1) both
the weight and transformed input belong to the same half space while they should not in order
to correctly classify the input. In this case, the perceptron learning rule is updating the weights
as w − φ (xi ) which brings, at least in our example, the weight vector in the complementary half
space where it should lie in order to correctly classify the input.
Case yi = +1 Case yi = −1

φ(xi ) w φ(xi )

w + φ(xi )
w − φ(xi )

Figure 17.3: Case when a perceptron misclassify an input xi . If the input is positive (yi = 1), the
weight vector and transformed input do not belong to the same half space. If the input is negative
(yi = −1) both the weight and transformed input belong to the same half space. The grey region
indicates the half space in which a weight vector actually misclassify the input.

Let us now consider the case we have two inputs x1 and x2 , respectively positive and negative
(y1 = 1, y2 = −1). Drawing the hyperplanes defined by vT φ (x1 ) = 0, vT φ (x2 ) = 0 delineate a
164 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS

subspace of Rna +1 in which a weight vector should be in order to correctly classify the two input
vectors, a cone of feasible solutions. It is actually not necessary that such a region exists. Indeed,
if, for example, we add an extra input x3 to the two inputs example we just considered, such that
x3 = x2 − x1 , and setting y3 = +1 vanishes the cone of feasible solutions. We will go back to this
question of feasibility in section 17.1.3.

φ(x1 )
w
φ(x2 )

v T φ(x1 ) = 0

v T φ(x2 ) = 0

Figure 17.4: Considering two inputs x1 and x2 respectively positive and negative (y1 = 1, y2 = −1).
In order to correctly classify the two inputs, the weight vector must to the white region, the cone
of feasible solutions.

Linear separability
As we shall see in the next sections, the perceptron algorithm cannot only solve a particular class
of classification problems which are called linearly separable.

Definition 17.1 (Linear separability). A binary classification problem (xi , yi ) ∈ Rd × {−1, 1}, i ∈
[1..N ] is said to be linearly separable if there exists w ∈ Rd such that :

∀i, sign wT xi = yi

with ∀x < 0, sign (x) = −1, ∀x ≥ 0, sign (x) = +1.

The exact values of output, whether {0, 1} or {−1, 1} does not actually really care in the above
definition. To illustrate the notion of linear separability, consider a binary classification problem
with binary inputs, i.e. binary expressions. Consider two inputs x1 , x2 in {0, 1} and one output
y ∈ {−1, 1}. The boolean expressions and (x1 , x2 ) or or (x1 , x2 ) are both linearly separable as
shown on fig. 17.5.

and(x1 , x2 ) or(x1 , x2 )
x2 x2
1 0 1 1 1 1

0 0 0 0 0 1

0 1 x1 0 1 x1
Figure 17.5: The AND and OR boolean functions are linearly separable as a line in the input
space can be defined in order to place the positive inputs on one side and the negative inputs on
the other side.
17.1. SINGLE LAYER PERCEPTRON 165

Not all the binary classification problems are linearly separable. One famous example is the
XOR function depicted on fig. 17.6 for which there is no way to define a line separating the positive
and negative inputs.

xor(x1 , x2 )
x2 ??
1 1 0 ??

??
0 0 1

0 1 x1
Figure 17.6: The XOR boolean function is not linearly separable as no line can be defined to split
the positive and negative inputs.

One may wonder how many linearly separable functions with discrete inputs and outputs exist
or even generalize and wonder about the probability that a randomly picked classification problem
with real inputs is linearly separable. Actually, it turns out that all depends on the ratio between
the number of data points N and the dimensionality of the input d. If N < d, any labelling of
the inputs can be linearly separated. The probability of getting a linearly separable problem then
quickly drops as the number of samples gets larger than the number of dimensions (Cover, 1965).

Perceptron learning rule convergence theorem

In case a classification problem is linearly separable, the perceptron learning rule can be shown to
converge to a solution in a finite number of steps. Without loss of generality, we will consider a
problem linearly separable in the input space. When introducing the perceptron, we mentioned
using transformed inputs by introducing basis functions φi and we could consider a linearly sepa-
rable classification problem in the transformed input space. However, as the basis functions were
predefined, it is absolutely equivalent to consider that a problem is linearly separable in an input
space whatever is this input space (“raw” or transformed). The perceptron convergence theorem
states :

Theorem 17.1 (Perceptron convergence theorem). A classification problem (xi , yi ) ∈ Rd ×{−1, 1}, i ∈
[1..N ] is linearly separable (def 17.1) if and only if the perceptron learning rule converges to an
optimal solution in a finite number of steps.

Proof. Consider a linearly separable binary classification problem (xi , yi ) ∈ Rd ×{−1, 1}, i ∈ [1..N ].
By definition, there exists ŵ such that :

∀i, ŵT xi = yi ⇒ ∀i, ŵT xi yi > 0

Necessarily, |ŵ|2 > 0. Let us denote wt the weight after having updated t misclassified inputs and
xt , yt the t-th misclassified input/output and we suppose that there exists an infinite sequence of
misclassified input/output pairs3 ; otherwise, the proof ends immediately. For any t > 0, since the
input/output pair xt , yt was misclassified with the weights wt−1 , it means (T xt−1 )xt yt < 0. The

3 at least one of the input/output pairs is considered infinitely many times


166 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS

sequence of weights wt after k updates using the perceptron learning rule will be :
w 1 = w 0 + y 1 x1
w 2 = w 1 + y 2 x2
....
..
wk = wk−1 + yk xk

Pk
Taking k > 0 and summing all the above equations lead to wk − w0 = i=1 yi xi . Let us compute
the scalar product by ŵ (one solution to the linear separation) :
k
X
ŵT (wk − w0 ) = yi ŵT xi
i=1

Since the problem is by hypothesis linearly separable, then ∀i, yi ŵT xi > 0. Let us denote tm =
mini∈[1,N ] yi ŵT xi > 0. Therefore, we end up with :
ŵT (wk − w0 ) ≥ ktm > 0
tm = min yi ŵT xi > 0
i
4
Reminding the Cauchy-Schwartz inequality , we get :

|ŵ|2 wk − w0 2 ≥ ŵT (wk − w0 ) ≥ ktm
ktm
⇒ wk − w0 2 ≥
|ŵ|2
k ktm
⇒ w 2 ≥ − w0 2 +
|ŵ|2
 
tm
Note |ŵ| is a constant dependent only on the dataset and a fixed solution ŵ. Therefore, wk 2
2
is lower bounded by a linear function in the number of misclassified input/output pair k. This is
a first point. Let us now focus on upper bounding the norm of wk :
∀k > 0, wk = wk−1 + yk xk
2 2 2
⇒ wk 2 = wk−1 2 + yk xk 2 + 2(wk−1 )T yk xk

Remind that the input/output pair xk , yk is the k-th misclassified input/output pair, meaning
(wk−1 )T xk yk < 0 and therefore :
2 2 2
∀k > 0, wk 2 < wk−1 2 + yk xk 2
2 2 2
⇒ ∀t, wk − wk−1 < yk xk
2 2 2
k−1
X k−1
X
2 2 2 2
⇒ w k 2 − w 0 2 = ( wi+1 2 − wi 2 ) < yi+1 xi+1 2
2
i=0 i=0
2 2
⇒ w k 2
< w 0 2
+ ktM
2 q
2
with tM = maxi∈[1,N ] yi+1 xi+1 2 . The latter implies wk 2 < |w0 |2 + ktM . That is the second
point. We therefore demonstrated that :
q
ktm 2
∀k, − w0 2 + ≤ wk 2 < |w0 |2 + ktM
|ŵ|2
tm = min yi ŵT xi > 0
i∈[1,N ]
2
tM = max yi+1 xi+1 2
i∈[1,N ]

4 For any vector space E with a scalar product (a pre-Hilbert space), denoted (u.v), then |(u.v)|2 ≤ (u.u)(v.v)
17.1. SINGLE LAYER PERCEPTRON 167

In the lower bound, we have√ a linearly increasing function of k. In the upper bound, we have
an increasing function in k. Necessarily, there is a finite value of the number of misclassified
input/output pairs k for which the two curves cross after what the inequality cannot hold anymore
which raises a contradiction and leads to the conclusion that there cannot be an infinite sequence
of misclassified input/output pairs and therefore the perceptron algorithm is converging. There-
fore, we demonstrated that if the classification problem is linearly separable, then the perceptron
learning rule is converging in a finite number of updates.

The other implication is straightforward. If the perceptron is converging in a finite number of


steps, it means that after say k updates, no mistakes are performed and, given the definition of
the perceptron, this implies that the classification problem is linearly separable.

Given the equivalence, we can then also state that, in case the classification problem is not
linearly separable, the perceptron algorithm will never converge since, otherwise, the classification
problem would have been linearly separable.

More on perceptrons
While we demonstrated the convergence of the perceptron learning rule, we did not speak that
much about the rate of convergence. The learning rule we consider and associated algorithm which
would peak each input/output pairs one after the other is not the algorithm that provides the
fastest rate of convergence. There are variants of the perceptron learning rule with improved rate
of convergence (Gallant, 1990; Muselli, 1997; Soheili and Pena, 2013; Soheili, 2014).

There are also extensions of the perceptron using kernels. As one may note, the weights of the
perceptron are always a weighted (by the labels) sum of the input samples :
X
w= yi xi
i∈I

where I is the set of misclassified inputs that we encounter during learning. At some point in
time, in order to test the prediction of the perceptron, we simply compute the dot product of the
weights by the vector x to test :
X
w.x = yi xi .x
i∈I

and test for the sign of w.x to decide whether x belongs to the positive or negative class. Given
the computation are expressed only from dot products, one can extend the algorithm using kernels
as in (Freund and Schapire, 1999). Given a mapping function ϕ of our input space into a so called
feature space Φ :

ϕ : Rd → Φ
x 7→ X

The weight vector would then be expressed in the feature space :


X
w= yi ϕ (xi )
i∈I

As before, testing an input x (which is also a step during learning) would imply computing the
dot product of the weights by the input, now projected in the feature space ϕ (x) :
X X
w.ϕ (x) = yi ϕ (xi ).ϕ (x) = yi k (xi , x)
i∈I i∈I

where k is a kernel (see chapter10 for more details). For example, we show on fig 17.7 an example
of binary classification, using RBF kernels with a variance σ = 0.3, where the perceptron is
168 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS

trained with the perceptron learning rule. Each class contains 100 samples and convergence was
actually obtained by iterating only two times on the training set. Please note that the point of
this illustration is to actually to illustrate the application of the perceptron and it is clear that
such a classifier does not possess a large margin around the classes which might be revealed by a
bad generalization. However, the interested reader can read (Freund and Schapire, 1999) where
the voted-perceptron algorithm is introduced, a modification of the perceptron algorithm with
guaranteed margins.

3
3 2 1 0 1 2 3

Figure 17.7: Application of the perceptron learning rule with RBF kernels (σ = 0.3) with 100
samples for both the positive and negative classes. Convergence was obtained in two iterations
over the training set.

17.1.2 ADaptive LINear Elements


At the same time (Rosenblatt, 1962) introduced the perceptron, (Widrow and Hoff, 1962) intro-
duced a very similar single layer architecture knonw as the ADaptive LInear Elements (ADA-
LINE). While the architecture is similar, the learning algorithm is different. Rather than using the
perceptron learning rule, the ADALINE network is trained with the Least-Mean-Square (LMS)
algorithm. Suppose we are given a training set for a binary classification problem S = {(xi , yi ) ∈
Rd × {−1, 1}, i ∈ [1..N ]}, the LMS algorithm looks for weights w? ∈ Rd+1 minimizing the empirical
risk :
N N
1 X 1 X 2
RSemp (w) = L(yi , fw (xi )) = |yi − fw (xi )|2
N i=1 N i=1
 
T 1
fw (x) = w
x

There are two possibilities to solve this optimization problem. The first possibility is a batch
method where all the samples are considered and this least mean square problem can actually be
solved analytically by computing its derivative with respect to w and setting it to zero.

dRSemp
(w) = 0
dwj
N   
2 X T 1 1
⇔− (yi − w ) =0
N i=1 xi xi
XN    X N  
T 1 1 1
⇔ (w ) = yi
xi xi xi
i=1 i=1
17.1. SINGLE LAYER PERCEPTRON 169
 
1 1 ··· 1
Let us now introduce the vector y with Yi = yi , X = . We can then rewrite
  x1 x2 ·· · xN
PN 1 1
i=1 yi x = XY. For the left-hand side term, let us write xi =
xi
:
i

N
X    N X
X d+1
1 1
∀j ∈ [1, d + 1], ( (wT ) )j = wk (xi )k (xi )j
xi xi
i=1 i=1 k=1
d+1
X N
X
= wk Xk,i Xj,i
k=1 i=1
d+1
X N
X
= wk Xj,i XT i,k
k=1 i=1
d+1
X d+1
X
T
= wk (XX )j,k = (XXT )j,k wk
k=1 k=1

And therefore :
N
X   
1 1
(wT ) = (XXT )w
xi xi
i=1

Finally, the solution to the least square reads :

(XXT )w = Xy (17.1)

which is known as the normal equations. If the matrix XXT is not singular, the solution to the
least square problem is :
−1
w = (XXT ) Xy
−1
and (XXT ) X ∈ R(d+1)×N is actually the Moore-Penrose pseudo-inverse of X. In case the matrix
XXT is not invertible, there is not unicity of the solution to the least square problem. One can
then find the solution w with the minimal norm. It turns out that this can be computed from
the Singular Value Decomposition (SVD) of X. The SVD of X ∈ R(d+1)×N is X = UΣVT with
U ∈ R(d+1)×(d+1) and V ∈ RN ×N two orthogonal matrices (U−1 = UT , V−1 = VT ) and Σ is a
diagonal matrix with non-negative elements (some can be equal to zeros depending on the rank of
the matrix X). The minimal norm solution to the least square problem is then defined by (17.2)
(Lawson and Hanson, 1974).

w = (VΣ+ UT )y (17.2)

with :
(
1
if Σi,i 6= 0
Σ+
i,i = Σi,i
0 otherwise

It might not be convenient to solve the optimization problem in a single shot as it requires to
compute the pseudo-inverse matrix which grows with the number of samples. Also, the previous
method is batch and requires all the samples to be available to compute the optimal solution for the
weights. An alternative is to update the parameters w online, one sample at a time. One simple
approach is then to compute the gradient of the loss and to perform a so-called steepest descent or
gradient descent. The derivative can be taken considering the whole training set (gradient descent)
or only one sample at a time (Stochastic Gradient Descent - SGD). If we consider one sample at a
time to make the updates online, it reads :
d 2 dfw
∀i, ∇w L (yi , fw (xi )) = |yi − fw (xi )|2 = −2 (xi )(yi − fw (xi )) = −xi (yi − fw (xi ))
dw dw
170 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS

We can then update the weights according to a fraction of the steepest descent. Therefore, at
time t, after observing the input/output pair xi , yi , the weights would be updated according

α
wt+1 = wt − ∇w L (yi , fw (xi )) = wt + αxi (yi − fw (xi ))
2

This is actually the so-called delta rule or Widrow-Hoff rule as considered by (Widrow and
Hoff, 1962). Even if (Widrow and Hoff, 1962) originally considered binary classification with its
architecture using a linear transfer function and a quadratic loss, as we will see in section17.1.4,
different combinations of transfer function and loss are considered depending on the type of problem
to be solved (regression or classification).

17.1.3 Limitations

As explained in the previous section, any neural network in which only the last layer contains train-
able weights with a binary transfer function can only solve linearly separable binary classification
problems. One of the famous example that had a strong negative impact on the research efforts
in neural networks is the XOR binary function. The XOR classification problem with two inputs
x1 , x2 is indeed not linearly separable in the x1 , x2 space. However, if we transform the inputs
and work in the x1 x2 , x1 x2 space, the problem becomes linearly separable (fig. 17.8). However,
the question is to determine how the inputs should be transformed so that the problem becomes
linearly separable. In other words, how one might learn appropriate features computed from the
inputs so that a classification (or regression) problem becomes solvable.

xor(x1 , x2 ) xor(x1 , x2 ) = x1 x2 + x1 x2
x2 ?? x1 x2
1 1 0 ?? 1 1

??
0 0 1 0 0 1

0 1 x1 0 1 x1 x2

Figure 17.8: The XOR boolean function x1 ⊕ x2 is not linearly separable in the x1 , x2 space but
becomes linearly separable when projected into the x1 x2 , x1 x2 space.

In section 17.1.1, we saw an example of perceptron with appropriately chosen basis functions
which performs a non-linear classification. In the section 17.2, we study a particular type of
“single-layer” neural network, the radial basis function networks, in which appropriate choose of
features computed from the inputs allow to solve non-linear regression. Actually, the limitation
of these networks is not that the perceptron can only represent linearly separable problems; the
true question is how to learn the appropriate features and this is what we will in the section on
multilayer perceptrons.

17.1.4 Single layer perceptron

In the previous sections, we introduced both the perceptron and Adaline networks from an historical
perspective in the sense that our presentation sticks to the architecture introduced respectively by
(Rosenblatt, 1962) and (Widrow and Hoff, 1962). We now inspect the question of single layer
neural networks from a different perspective by considering which architecture one might use in
order to solve regression or classification problems.
17.1. SINGLE LAYER PERCEPTRON 171

Regression
Suppose we are given a monodimensional regression problem S = {(xi , yi ) ∈ Rd × R, i ∈ [1..N ]}.
In that case, one would use a linear transfer function g(x) = x and a quadratic loss, i.e.:
2
L (y, hw (x)) = |y − fw (x)|2
 
1
fw (x) = wT
x
The empirical risk to be minimized therefore reads :
N   2
S 1 X T 1

Remp = yi − w
N i=1 x i 2

As detailed in section 17.1.2, the empirical risk can be minimized in batch mode, using all the
training set and the optimal weights w? ∈ Rd+1 are given by solving a linear least square problem
T
and given
 by the equations
 (17.1) or (17.2) depending on whether or not XX is invertible, with
1 1 ··· 1
X= . We remind the previous results for completeness :
x1 x2 · · · xN
( −1
? (XXT ) Xy if XXT is invertible
w =
(VΣ+ UT )y otherwise; the minimal norm solution

The second possibility to optimize for the weights w is to perform learning online with the
stochastic gradient descent. You can perform gradient using one sample at a time (stochastic
gradient), all the samples (batch gradient)5 or mini-batch gradient considering only a part of the
samples at every iteration. For the stochastic gradient descent, given some initial weights w0 , the
update rule is :
α
wt+1 = wt − ∇w L (yi , fw (xi )) = wt + αxi (yi − fw (xi ))
2
where α is learning rate to be defined (pretty small if fixed, i.e. α ≈ 10−2 , 10−3 , or adaptive as we
will see later in this chapter). Mini-batches can be meaningful if you use parallel processors (e.g.
GPUs) as you actually compute the gradient for several samples with the same weights and can
then use more efficiently the parallelism of the hardware.

Binary classification
Let us now consider binary classification problems : we are given S = {(xi , yi ) ∈ Rd × {0, 1}, i ∈
[1..N ]}. For learning a classifier, one can actually devise several architectures and associated
learning algorithms but some are more appropriate than others. The first option we consider is
1
to use the logistic transfer function6 g(x) = 1+exp(−x) which allows to interpret the output as the
conditional probability of belonging to one of the class (as g(x) ∈ [0, 1]) given an input. In this
situation the quadratic loss is not appropriate (see at the end of this paragraph why) and the
cross-entropy loss should be preferred :
L (y, ŷ) = −yln (ŷ) − (1 − y)ln (1 − ŷ)
N
1 X
RSemp (w) = − (yi ln (fw (xi )) + (1 − yi )ln (1 − fw (xi )))
N i=1
 
T 1
fw (x) = g(w )
xi
1
g(x) =
1 + exp(−x)
5 Please note that this is actually meaningless to perform a batch gradient in this situation as the optimal weights

can be analytically solved. For multilayer perceptrons, it makes much more sense.
6 In practice, (LeCun et al., 1998) suggests to use a scaled hyperbolic tangent transfer function g(x) =

1.7159tanh (0.6666x)
172 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS

As y, ŷ ∈ {0, 1}, we can note that L (y, ŷ) ≥ 0. Also, ∀y ∈ {0, 1}, L (y, ŷ) = 0 ⇔ ŷ = y. We
will now compute the gradient of the loss with respect to the weights. A few preliminaries will be
helpful :

exp(−x) 1 1
∀x, g 0 (x) = = (1 − ) = g(x)(1 − g(x))
(1 + exp(−x))2 1 + exp(−x) 1 + exp(−x)
∂L ŷ − y
∀y, ∀ŷ, (y, ŷ) =
dŷ ŷ(1 − ŷ)

 
1
Let us denote xi = . We then get :
xi

∂L   ∂L 
∀i, yi , g wT xi = xi g 0 wT xi yi , g wT xi
dw dŷ

= xi (g wT xi − yi )

So, updating the weights with the stochastic gradients lead to :


wt+1 = wt − α∇w L (yi , fw (xi )) = wt + αxi (yi − g wT xi )

which is actually very similar to the update when considering a linear transfer function with a
quadratic loss for the regression problem we considered previously. The transfer function is taken
1
to be the logistic function σ(x) = 1+exp(−x) . If we were to use, for example, the hyperbolic tangent
tanh x = exp(x)−exp(−x)
exp(x)+exp(−x) one has to adapt the loss accordingly taking into account the fact that the
hyperbolic tangent is linearly linked to the logistic as tanh(x) = 2σ(2x) − 1. The outputs must
also be defined in {−1, 1}.

What is going on if, rather than the cross entropy loss, we take the quadratic loss but still with
the logistic transfer function ? Performing the computation, we get :

d  2  
∀i, yi − g wT xi 2 = −2xi (yi − g wT xi )g 0 wT xi
dw

We see that the gradient of the quadratic cost does keep a g 0 (wT xi ) term which was cancelled out
when using the cross-entropy loss. The issue we then encounter is when, for an input xi , we get
g (wT xi ) ≈ 0 or g (wT xi ) ≈ 1 where the derivative of the logistic function is close to zero. This is
the case for example when an input is misclassified and the initial weights sufficiently strong to
bring the logistic function in its saturated part. In this case, the gradient is really flat and it will
take quite a long time for the parameters to escape from this region.

Multiclass classification

In the case of a classification problem with c classes S = {(xi , yi ) ∈ Rd × {0, · · · c − 1}, i ∈ [1..N ]},
we would encode the output with the 1-of-c or one-hot encoding, i.e. the size of the output y is
number of classes and we set yi = δi,yi . We can then devise two architectures. The first one is
take sigmoidal transfer function for the output layer and use the cross-entropy loss applied
17.1. SINGLE LAYER PERCEPTRON 173

to all the predicted output and output dimensions :

c−1
X
L (y, ŷ) = − (yk ln (ŷk ) + (1 − yk )ln (1 − ŷk ))
k=0
N
1 X
RSemp (W) = − L (yi , fW (xi ))
N i=1
    
T 1
 g w0 x i 
    
   1 
 g w1T 
T 1  x 
fW (x) = g(W )= i 
xi  .. 
 .  
  
 T 1 
g wc−1
xi
1
g(x) =
1 + exp(−x)

where g is applied element-wise and W is now a d + 1 × c matrix with the weights to each of the
c output units in its columns. One can then verify that the derivative of the loss with respect to
any weight wk,j reads :
  
∂L (y, fW (x)) 1
= (−yk + g wkT )xj
∂wk,j xi

In this case, we cannot interpret the outputs as any discrete probability distribution as they
are not normalized. If one wants to interpret the outputs as the conditional probability over the
labels given the inputs, we can guarantee that the outputs are in the range [0, 1] and sums up to 1
by using the soft-max transfer function. Denoting W the weight matrix where the j-th column
W.,j = wj contains the weights from the input to the j-th output, given an input xi it is handful
to introduce the notation :

aj = wjT xi

The predicted outputs then read :

exp(aj )
∀j ∈ [0, c − 1], ŷj = P
k exp(ak )

In this case, the appropriate loss is the negative log-likelihood loss defined as :

L (y, ŷ) = − log(ŷy )

This supposes that y is the class number. In case y is Pencoding the class with the 1-of-c or one
c−1
hot encoding, then you just get the cross-entropy loss − k=0 yk log(ŷk ). If we write the empirical
risk function of the parameter matrix W, denoting wj its j-th column :

N
1 X
RSemp (W) = − L (yi , fW (xi ))
N i=1
N c−1
!
1 XX exp(wkT xi )
=− 1yi =k log Pc−1
N i=1 l=0 exp(wlT xi )
k=0
174 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS

Here, the derivatives are a bit more tedious to compute. Let us compute some intermediate steps :
!
∂ exp(wkT x) exp(wiT x)
∀x, ∀k, j∀i 6= k, log Pc−1 = −x j P c−1 = −xj [fW (x)]i
∂wi,j T
l=0 exp(wl x)
T
l=0 exp(wl x)
!
∂ exp(wkT x) exp(wiT x)
∀x, ∀k, j, log Pc−1 = x j − x j P c−1 = xj − xj [fW (x)]k
∂wk,j T
l=0 exp(wl x)
T
l=0 exp(wl x)
!
∂ exp(wkT x)
⇒ ∀x, ∀i, k, j, log Pc−1 = xj (δi,k − [fW (x)]i )
∂wi,j T
l=0 exp(wl x)

From these, we deduce the derivatives of the loss :


∂L
∀y, ∀x, ∀i, j, (y, fW (x)) = xj ([fW (x)]i − δy,i )
∂wi,j
It can be seen that this is actually a generalization of the binary classification with a sigmoidal
output and a cross-entropy loss. If we apply the above formula with two outputs but just focusing
on the output of class 1 (the positive class), and noting ∀y ∈ {0, 1}, δy,1 = y, we get :
∂L
∀y ∈ {0, 1}, ∀x, ∀j, (y, fW (x)) = xj ([fW (x)]1 − y)
∂wj
exp(wT x) 1
with [fW (x)]1 = exp(wT x)+exp(w
1
T = 1+exp(−(w 1 −w0 ) x)
T . In the case of multiple classes, the soft-
0 1 x)
max output introduces competition between the different classes which is not the case with logistic
outputs. The outputs get normalized and one can interpret the outputs ŷj as the conditional
probability P (y = j | x).

17.2 Radial Basis Function networks (RBF)


17.2.1 Architecture and training
The radial basis function network (Broomhead and Lowe, 1988) is a type of perceptron in which the
(x−c )2
basis functions are radial φj (x) = exp(− σ2j ). In its simplest form, the centers and standard
j
deviations are fixed. To illustrate the RBF networks, consider a single dimensional regression
problem S = {(xi , yi ) ∈ Rd × R, i ∈ [1..N ]}. The regressor is defined as :
K−1
X
∀x ∈ Rd , fw (x) = wk φk (x) = wT φ (x)
k=0
 
φ0 (x)
 φ1 (x) 
 
where φ (x) =  .. , considering K basis functions with one constant function, e.g. φ0 (x) = 1.
 . 
φK−1 (x)
The regression problem can be solved by looking for the weight vector w minimizing the empirical
risk7 with a quadratic loss :

2
L (y, fw (x)) = |y − fw (x)|2
fw (x) = wT φ (x)

The empirical risk to be minimized therefore reads :


N
1 X 2
RSemp = yi − wT φ (x) 2
N i=1
7 we shall latter come back on the question of generalization
17.3. MULTILAYER PERCEPTRON (MLP) 175

As we already saw in the previous section, this least square minimization problem can be solved
analytically or iteratively with a steepest descent. Analytically, the optimal weights read :
(
T −1 T
? (φ (X)φ (X) ) Xy if φ (X)φ (X) is invertible
w =
(VΣ+ UT )y otherwise; the minimal norm solution

How do we define the centers and standard deviations of the basis functions ? There are actually
several possibilities(Schwenker et al., 2001; Peng et al., 2007; Han and Qiao, 2012). The simplest
is to pick randomly K − 1 centers from the inputs and compute a common standard deviations
as the mean of the distances between the selected inputs and their closest selected neighbors. We
then train/compute the optimal weights to minimize the risk. Another possibility is to apply a
clustering algorithm (e.g. k-means) to identify good candidates for the centers and compute the
standard deviation as before. Then, after this unsupervised learning step, one would learn in
a supervised manner the optimal weights. These are two-phase training algorithms for training
RBF (Schwenker et al., 2001). Another possibility is to train the RBF in three phases(Schwenker
et al., 2001). The two first phases consists in initializing the centers and standard deviations
of the kernels with some clustering algorithm and then compute the optimal weights directly or
with a steepest descent. The third phase consist in adapting all the parameters (weights, centers,
standard deviations) using a steepest descent. One can actually compute the gradients of the loss
with respect to the weights, centers and standard deviations(Schwenker et al., 2001; Bishop, 1995) :
2
L (y, fw (x)) = |y − fw (x)|2
fw (x) = wT φ (x)
|x − ck |2
∀k, φ (x)k = exp(− )
2σk 2
φ (x)k x − ck
∀k, j, = δk,j φ (x)k
∂cj σk 2
2
φ (x)k |x − ck |2
∀k, j, = δk,j φ (x)k
∂σj σk 3
∂L (y, fw (x))
= −2φ (x)(y − fw (x))
∂w
∂L (y, fw (x)) ∂fw (x) x − ck
∀k, = −2(y − fw (x)) = −2(y − fw (x))wk φ (x)k
∂ck ∂ck σk 2
2
∂L (y, fw (x)) ∂fw (x) |x − ck |2
∀k, = −2(y − fw (x)) = −2(y − fw (x))wk φ (x)k
∂σk ∂σk σk 3
Some other algorithms for optimizing both the weights and basis function parameters can be
found in(Peng et al., 2007; Han and Qiao, 2012).

17.2.2 Universal approximation


It can be shown that any sufficiently smooth function can be approximated arbitrarily well with
a RBF network with a ”sufficient” number of kernels. The interested reader is referred to (Park
and Sandberg, 1991; Hartman Eric J. et al., 1990).

17.3 Multilayer perceptron (MLP)


17.3.1 Architecture
A multi-layer perceptron is build from one input layer, one output layer and several (≥ 1) hidden
layers as depicted on fig 17.9, with L − 1 hidden layers. The transfer function for the hidden layer
output layers are usually taken to be different and are respectively denoted g and f . For example,
for a classification problem, whatever the transfer function in the hidden layers, we usually take the
(L)
softmax transfer function for the output layer which guarantees the outputs yi define a discrete
176 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS

probability distribution. For a regression problem, the transfer function of the output is taken as
a linear function f (x) = x while non linearities are introduced in the hidden layers.
Let us introduce some notations :
(l)
• wij the weight between the j-th unit of layer l − 1 and the i-th unit of layer l,
(l)
• ai the pre-activation of the unit i of layer l,
(l)
• yi the output of the unit i of layer l

Every unit computes its pre-activation as a linear combination of its inputs. For simplicity
(0)
of the notations, we denote yi = xi and Il the set of indices of the units in layer l ∈ [0, L].
Remember that, in order to take into the bias (offset) in the linear combination of the input, each
layer l ∈ [0, L − 1] has one unit with a constant output equal to 1. This means for example, that
if the inputs are taken from Rd , the input layer contains actually d + 1 units. The computation
within the network read :
(l)
X (l) (l−1)
Preactivations:∀l ∈ [1, L], ∀i ∈ Il , ai = wij yj
j∈Il−1
(l)
∀l ∈ [1, L], a = W(l) y(l−1)

Input Hidden layers Output


Layer 0 Layer 1 ··· Layer L-1 Layer L

x0 (1)
w00
(L−1)
(1) w00
w01 a
(L−1)
0 (L−1)
y
(L−1) 0
x1 w0i
(1)
w02 ···
(L) (L)
a0 y0
···
x2
(L−1)
a
1 (L) (L)
··· y
1
(L−1)
a1 y1

x3

1
1
1
(1) P (1) (L−1) P (L−1) (L−2) (L) P (L) (L−1)
ai = j wij xj ai = j wij
yj ai = j wij yj
(1) (1) (L−1) (L−1) (L) (L)
yi = g(ai ) yi = g(ai ) yi = f (ai )

Figure 17.9: A multilayer perceptron is built from an input and output layer with several hidden
layers in between. Each layer other than the output is extended with a unit of constant output 1
for the bias.

17.3.2 Learning : error backpropagation


Now that we have introduced the architecture, we need to devise an algorithm to optimize the
parameters (the weights) of the neural network and we make a step into the domain of nu-
17.3. MULTILAYER PERCEPTRON (MLP) 177

merical optimization. We can actually resort to any optimization algorithms such as derivative-
free optimization algorithms (e.g. line-search, Brent’s method(Brent, 1973), black-box optimiza-
tion such as CMAE-ES(Hansen, 2006), Particle Swarm Optimization(Engelbrecht, 2007; Eber-
hart and Kennedy, 1995),...), optimization algorithms that make use of the gradient (steepest
descent(Werbos, 1981; Rumelhart et al., 1986), natural gradient (Amari, 1998a), conjugate gradi-
ent (Johansson et al., 1991), ..) or algorithms that use of the second order derivatives (Hessian) but
sometimes only approximating it as in (Martens, 2010). We go back on this topic of optimization
algorithms in the section 17.5. For now, let us consider error backpropagation which is historically
a major breakthrough in the neural network community as it brought the ability to learn multilayer
neural networks(Werbos, 1981; Rumelhart et al., 1986).
We consider architectures for which an appropriate combination of loss function and output
transfer function have been chosen. As we saw in section17.1.4, it means :
• for a regression problem with a vectorial output, a linear transfer function and a quadratic
loss8 :
2
f (a) = a, L (y, ŷ) = 21 |y − ŷ|2

• for a multi-class classification problem, a softmax output transfer function and the negative
log-likelihood loss :

1
 T P
f (a) = P exp(a0 ) exp(a1 ) · · · exp(ac−1 ) , L (y, ŷ) = − k yk log (ŷk )
k exp(ak )

Starting from some initial weight and bias vector w, its update following the steepest descent
reads :

w = w − α∇w L

Let us compute the derivatives of the loss with respect to a weight from the last hidden layer
to the output layer. Denoting a(L) = W(L) y(L−1) the pre-activations of the output layer where
W(L) is the weight matrix from the last hidden layer to the output layer (W(L) i,j is the weight 
from the hidden unit j to the output unit i), the predicted output can be written as ŷ = f a(L) .
In order to compute the gradient to respect to any weight, we shall apply the chain rule; in case of
(L)
a weight wi,j between the j-th hidden unit and i-th output unit, the gradient of the loss reads :

∂L (y, ŷ) X ∂a(L) ∂L (y, ŷ)


k
∀i, j, (L)
= (L) (L)
∂wi,j k ∂wi,j ∂ak
(L) ∂L(y,ŷ)
It is convenient to introduce the notation ∆k = (L) . Therefore, the above formula is written
∂ak
as :

∂L (y, ŷ) X ∂a(L) (L)


k
∀i, j, (L)
= (L)
∆k
∂wi,j k ∂wi,j

Whether in the regression or classification, the preactivations are computed as the product of the
(L) P (L) (L−1)
weight matrix times the output of layer L − 1 : ak = i wki yi . Therefore :
(L)
∂ak (L−1)
(L)
= δi,k yj
∂wi,j

And so, the derivative of the loss reads :

∂L (y, ŷ) X ∂a(L) (L)


X (L−1) (L) (L−1) (L)
k
∀i, j, (L)
= (L)
∆k = δi,k yj ∆k = yj ∆i
∂wi,j k ∂wi,j k

8 the 1 in the quadratic loss is introduced to get similar formula than in the classification case. By the way, this
2
is just a scaling factor
178 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS

(L)
We now need to explicit the term ∆i which is the derivative of the loss with respect to the
pre-activations. The computation are actually similar to the one carried out in section 17.1.4 and
repeated here for completeness :
• in case of a regression :
2
1 ∂ y − y(L) 2 1 X (yi − yi )2 X
2 (L) (L)
(L) ∂L (y, ŷ) 1 ∂|y − ŷ|2 (L) ∂yi
∆k = (L)
= = = = − (y i − y i )
∂ak 2 ∂a(L) 2 ∂a
(L) 2 i ∂a
(L)
i ∂a
(L)
k k k k

∂yi
(L)
∂  
(L)
(L)
= (L)
f ai = δk,i
∂ak ∂ak
(L)
X (L)
(L) ∂yi
X (L)
⇒ ∆k =− (yi − yi ) (L) =− (yi − yi )δk,i
i ∂ak i
(L)
= −(yk − yk )
∂L (y, ŷ) (L−1) (L) (L−1) (L)
⇒ ∀i, j, (L)
= yj ∆i = −yj (yi − yi )
∂wi,j

• in case of a classification, denoting c(x) the class of the input x (i.e. ∀j, yi = δi,c(x) using
1-of-c encoding of the desired output) :

∂L (y, ŷ) X ∂   X yi ∂y(L)


(L) (L) i
∆k = (L)
=− yi(L)
log y i = − (L) (L)
∂ak i ∂a k i yi ∂ak
(L) (L) (L) P (L) (L) (L)
∂yi ∂ exp(ai ) δi,k exp(ai ) l exp(al ) − exp(ai ) exp(ak )
(L)
= (L) P (L)
=  P  2
∂ak ∂ak l exp(al )
(L)
l exp(al )

(L) (L)
exp(ai ) exp(ak )
=P (L)
(δi,k − P (L)
)
l exp(al ) l exp(al )
(L) (L)
= yi (δi,k − yk )
(L)
X yi ∂y(L) X (L) (L)
i
⇒ ∆k =− (L) (L)
=− yi (δi,k − yk ) = −(δc(x),k − yk )
i yi ∂ak i
(L)
= −(yk − yk )
∂L (y, ŷ) (L−1) (L) (L−1) (L)
⇒ ∀i, j, (L)
= yj ∆i = −yj (yi − yi )
∂wi,j

We now turn to the computation of the derivatives with respect to a weight or bias afferent to
a unit in layer L − 1 :

∂L (y, ŷ) X ∂L (y, ŷ) ∂a(L−1) X ∂L (y, ŷ) (L−2) ∂L (y, ŷ) (L−2)
k
∀i, j, (L−1)
= (L−1) (L−1)
= (L−1)
δi,k yj = (L−1)
yj
∂wi,j k ∂ak wi,j k ∂ak ∂ai
(L−2) (L−1)
= yj ∆i

(L−1) ∂L (y, ŷ) X ∂L (y, ŷ) ∂a(L) ∂y(L−1) X (L) (L) (L−1)
∆i = (L−1)
= (L)
k i
(L−1) (L−1)
= ∆k wki g 0 (ai )
∂ai k ∂ak ∂yi ∂ai k
(L−1)
X (L) (L)
= g 0 (ai ) ∆k wki
k
17.3. MULTILAYER PERCEPTRON (MLP) 179

(L−1)
If we look at the structure of ∆i , it is basically the ∆ term of the next layer weighted
by weights of the projection from the unit i to the next layer, everything premultiplied by the
derivative of the hidden layer transfer function. Here comes the name “error backpropagation”
(L)
. The error term ∆i on the output layer is propagated backward in the previous layer. This
process is recursive and backpropagation would go downward down to the input layer. We did
not detail the derivative of the hidden layer transfer function as it is specific to the network you
consider :

• rectified linear units9 : g(x) = max x0, g 0 (x) = 1x>0


1
• logistic units : g(x) = 1+exp(−x) , g 0 (x) = g(x)(1 − g(x))
2
• hyperpoblic tangent units : g(x) = tanh (x), g 0 (x) = 1 − g (x)

17.3.3 Universal approximator


A single hidden layer neural network with a linear output unit can approximate any continuous
function arbitrarily well, given it has a enough hidden units (Hornik et al., 1990; Cybenko, 1989;
Hornik et al., 1989). We propose to illustrate with a single dimensional input and output the ability
of single hidden layer neural networks to approximate a smooth function. Taking for example a
logitstic transfer function :
1
φi (x) =
1 + exp(−α(x − ci ))

where we just introduce the bias. In our notations, the term inside the exponential is really
 T  
−αci 1
. We can then combine two of such sigmoids with different centers and different gain
α x
α. Some examples are drawn on fig 17.10. Combining arbitrarily close sigmoids, one can actually
build up bell shape functions.

1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0

Figure 17.10: By combining two sigmoids with arbitrarily close centers, one can actually build up
arbitrarily local bell shape functions which can then be weighted in order to produce any smooth
function. The full line plots are six sigmoids which are then grouped by pairs and the difference
of these pairs are plotted with dashed lines. For generating the plot, the centers of the pairs are
{0.2, 0.25}, {0.5, 0.52}, {0.8, 0.9} with a gain α = 50.

Intuitively we reach the point that we can define a bunch of pairs of sigmoids (outputs from the
hidden layer) that will create local functions which can then be weighted with the weights from the
hidden layer to the output in order to approximate any continuous function. The formal proofs
are given in the reference(Hornik et al., 1990; Cybenko, 1989; Hornik et al., 1989).
9 in which case, one should speak about subgradient
180 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS

17.3.4 The need for using deep networks


In the previous section, we established that neural networks with a single hidden layer can ap-
proximate arbitrarily well any well behaving functions. This result actually convinced people that
neural networks is a good class of predictors. However, it turns out that the theorem does not state
the size of the hidden layer and it turns out that this number can be actually pretty large(Mhaskar
and Micchelli, 1995). Why should we consider a deep (large number of hidden layers) rather than
a shallow (few number of hidden layers) neural network ? By using many layers, we can compose
many times non-linear functions and can achieve a targeted non-linear mapping without having to
consider very large layers and actually, if we were to use few layers we would need a large number
of units (Håstad and Goldmann, 1991; Håstad, 1986; Delalleau and Bengio, 2011; Pascanu et al.,
2013). These results show that, for some functions, a shallow network may need a number of hidden
units growing exponentially with the input while a deep architecture requires a linearly growing
number of hidden units. There are also, now, several empirical results indicating the superiority of
deep networks (Bengio et al., 2007). These results were permitted after new techniques for training
deep neural networks were developed as we will see in the next chapter.

17.4 Generalization
So far, we only discussed about minimizing the empirical risk given some data. However, if only
minimize the empirical risk we will definitely (hopefully) perform well on the training set but the
generalization performance will be usually bad. This is actually especially true when the dataset
is limited. It turns out that some recent works working on very large datasets (or some works
augmenting the original dataset by transforming the original dataset as (Ciressan et al., 2012)) do
not encounter this issue as working with a very large dataset can be understood as working with an
infinite stream of different inputs. However, if the dataset is not ”sufficiently” large with respect
to the number of degrees of freedom of the network, the network might overfit the training set and
perform very badly on data it has not seen in the training set. In this section, we review some
popular methods allowing to get hopefully good generalization performances by counterbalancing
the minimization of the empirical risk and penalizing models that are too “complex”.
To take an example of overfitting. Imagine that you have some 1D data to regress. Suppose
that the unknown model is actually linear in the input, say y = αx but obviously, you do not
know the model that generated the data as this is what you want to learn from samples, say N
samples xi , yi . If you were to minimize the empirical risk, you can simply consider the Lagrange
polynomials :
N −1
Q
X j∈[0,N −1],j6=i (x − xj )
f (x) = yi Q
i=0 j∈[0,N −1],j6=i (xi − xj )

This regressor gets perfect fit to the samples if we were to estimate the empirical risk as it is
actually null. To avoid too much math, simply suppose that your data are slightly noisy. It is
clear that the lagrange polynomial will not result in a linear function of the input since such a
linear function would not get a null empirical risk and the lagrange polynomial is actually perfectly
fitting the data. You would get higher order monomes which might lead to bad generalization on
unseen input/output since you are actually fitting both the data and the noise that you would like
to filter out.

17.4.1 Regularization
So far, we just speak about minimizing the empirical risk. However, this is not really the quantity
of interest. More relevant is the minimization of the real risk which we usually do not have access
to10 . Usually the issue that we may encounter when only focusing on the empirical risk is overfitting
where we would perfectly perform on the training set but badly performed on data that were not
10 there is less and less true as the datasets we are working with are growing, the abundant amount of data may

actually prevent overfitting and discard the need for regularizing the neural networks.
17.4. GENERALIZATION 181

present in the dataset, i.e. we would have a bad generalization error. For example, on figure17.11,
we generated data from a sine with normally distribution noise :

h(x) = 0.5 + 0.4 sin(2πx) + ,  ∼ N (0, 0.05)

Using a RBF with one kernel per sample it turns out the optimal solution to the least square
regression is clearly overfitting the data as shown by the learned predictor plotted in solid line on
fig. 17.11a.
1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

a) b) c)

Figure 17.11: With 30 samples generated according to eq.17.4.1, and building a RBF with one
kernel per sample with a standard deviation σ = 0.05, fitting the RBF without any penalty leads
to an overfitted regressor (a) while fitting the RBF with a weight decay penalty λ = 2 or L1 norm
penalty (α = 0.005) provides a better generalization (b,c). The dashed line indicates the noise-free
data. Original example from (Bishop, 1995).

Weight decay : L2 norm penalty


The weight decay, or L2 norm penalty consists in adding an extra term to the cost function to be
minimized which depends on the norm of the learned weights, without the biases. You would
then minimize the cost function :
d+1
λX 2
J (w) = L (w) + wk
2
k=1

Performing a gradient descent on this extended cost function is actually simply adding a linear
term in the gradient :

∇w J = ∇w L + λw
w ← w − α(∇w L + λw) = (1 − αλ)w − α∇w L

Note that if the predictor is linear and the cost quadratic, adding a L2 penalty simply adds λI
to the XXT matrix to inverse to compute the optimal solution (actually λI with the first diagonal
element set to zero to avoid regularizing the bias). Note also that the bias is not included in the
regularization. The L2 penalty will actually enforce the weights to keep a low norm, it will bring w
closer to 0. One may see the L2 penalty as a brake P to activate the non-linearities of your network.
d+1
If one has in mind the RBF network with fw (x) = k=0 wk φk (x) we see that if the norm of w is
low, it will tend to prevent activating the non-linearities. An example of RBF with a L2 penalty is
shown on fig 17.11b. With this example in mind, we might understand why it is not a good idea
to penalize the bias term. The bias term is the mean component of your data. To see this, we can
rewrite the cost function to be minimized by expliciting the bias :
2
X N X


argmin yi − w0 − wk φk (xi )

i=1
w
k≥1
2
182 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS

If you now compute the gradient with respect to w0 and set it to zero, you will find :
N N
!
1 X X 1 X
w0 = yi − wk φk (xi )
N i=1 N i=1
k≥1

It is usually a good idea to standardize the inputs11 in which case the second term vanishes
and you recover the mean of the outputs for the optimal bias. This is one reason why you should
not regularize the bias. The penalty should only affect the activation of the non linearities which
are the main causes of overfitting.

There is also another idea which helps intuiting why weight decay helps for generalization.
Weight decay tends to bring the weights closer to zero. It turns out that if we have a logistic
(or a tanh) transfer function, when the weights get small, the preactiviations tend to be where
the logistic is almost linear. Therefore, if the weights are constrained to be small, each layer is
actually a linear layer and the whole layers of the multilayer perceptrons collapse to a single linear
hidden layer. The weights will bring the logistic functions in their saturated part only if it is
actually sufficiently decreasing the loss with respect to the weight decay amplitude. That way, we
understand weight decay as a penalty on the complexity, richness or expressiveness of a multilayer
perceptron.

Sparse weights : L1 norm penalty


The L1 norm penalty consists in adding an extra term to the cost function to be minimized which
depends on the absolute value of weights, without the biases. You would then minimize the cost
function :
d+1
X
J (w) = L (w) + λ |wk |
k=1

Performing a gradient descent on this extended cost function is actually simply adding a term in
the gradient which depends on the sign of the components of w :
∇w J = ∇w L + λsign (w)
w ← w − α(∇w L + λw) = w − αλsign (w) − α∇w L
On figure17.12, we give an illustration(Hastie et al., 2009) that helps understanding the influence
L1 norm penalty. With the hypothesis that our loss is quadratic, the L1 norm penalty tends to
favor solutions that are more aligned with the axis than the L2 penalty which leads to weights
that are sparse (i.e. more components get equal to 0). On figure 17.11c, a RBF is regressed with
30 gaussian kernels and a L1 penalty with α = 0.005. The linear regression with L1 penalty is
solved with the LASSO-Lars algorithm12 . It turns out that of the 30 basis functions, only 10
get activated. The norm of the optimal solution of the predictor plotted on the figure is actually
around |w? |2 ≈ 0.57, actually quite close to the norm of the optimal solution with the L2 penalty
|w? |2 ≈ 0.61 but the solution is much sparser.

17.4.2 Learning procedure


The two methods introduced in the previous section add extra terms to the cost to be minimized.
In this section, we actually consider variations of the learning procedure.

Dropout
Dropout is a regularizing technique introduced in (Srivastava et al., 2014) and illustrated on
fig 17.13. The motivation is actually quite interesting. It is based on the idea of avoiding co-
adaptation. It means that, in order to enforce the hidden units of a MLP to learn sound and
11 a gradient descent has better performance because the cost function is more circular. If you do not standardize

the inputs, the cost function might be elongated and the gradient descent would take longer to converge
12 implemented using the Lasso-Lars algorithm in scikit-learn http://scikit-learn.org.
17.4. GENERALIZATION 183

Figure 17.12: The L1 penalty tends to promote solutions that have few non null components.
Indeed, compared to the L2 penalty, the optimal solution with the L1 penalty will be more axis
aligned. The figure is taken from (Hastie et al., 2009).

robust features, one will actually discard some of its feedforward inputs during training. Discard-
ing is controlled by a binary gate tossed for each input connection following a Bernoulli distribution
of parameter p. By doing so, the units can hardly compensate their failures with the help of the
others (co-adaptation) as they tend to work with a random subset of other units. When testing
the full network, the probability used to drop out units is multiplying the contribution of the unit
therefore averaging the contributions. The authors report in (Srivastava et al., 2014) that using
p = 20% or p = 50% significantly improved the generalization performance of various architectures.

Figure 17.13: Dropout consists in discarding some units during training with a given probability p
taken as 20% or 50% as suggested by (Srivastava et al., 2014).Figure taken from (Srivastava et al.,
2014)

Early stopping
Early stopping consists, in its most naive implementation, in tracking in parallel with the training
error, a validation error computed on a subset of the inputs not in the training set (say, 10% of
the data). In an ideal situation, one would observe error function of training epoch that look like
on figure 17.14a. Initially, both the errors on the training and validation set decrease. At some
point in time, however, while the training error goes on decreasing, the validation starts increasing
(see (Wang et al., 1993) for a theoretical analysis in a simplified case). This point in time should
actually be the point where learning should be stopped in order to avoid overfitting. In practice,
184 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS

the errors do not strictly follow this ideal picture (see fig. 17.14b)(Prechelt, 1996). One can however
still monitor the performances of the neural network on the validation set during training and select
at the end of the training period, the weights that led to the lowest error on the validation set.

Figure 17.14: a) Ideal training and validation curves which clearly indicate when to stop learning.
b) Real cases might actually be much more fluctuating. Images from (Prechelt, 1996)

17.5 Optimization
Several optimization techniques that are not actually specific to neural networks turn out to con-
verge faster than the classical (stochastic) gradient descent. The interested reader is referred to
(Bengio et al., 2015), Chap. 8, for an in depth presentation of the aspects such as momentum, first
and second order methods (conjugate gradients, hessian free optimizers(Martens, 2010), saddle
free optimizers(Dauphin et al., 2014)). There are actually extensive research on understanding the
landscape of the cost function we encounter with neural networks and designing specific optimiza-
tion techniques that take these aspects into account.

17.6 Convolutional neural networks : an early successful


deep neural network
One the oldest deep neural network that worked very well was the Neocognitron (Fukushima, 1980)
and the convolutional neural networks (Lecun et al., 1998) which both rely on similar principles.
These were introduced for classifying images. The basic idea to build up the convolutional neural
networks is to take into account the structure of the images, and especially the fact that filters
extracting sound features from the local patches on the images are translation invariant. To
get this point, imagine you want to detect specific orientations in a gray-scale image using signal
processing techniques. You would certainly end up using gabor filters. Actually, you would perform
the convolution of your input image with the gabor filter just because the way to detect a specific
orientation (computing the gabor on a local patch of the image) is independent of the location
where you want to detect that orientation; that is the meaning of translation invariance in the
extraction of features. Therefore, the idea of convolutional neural networks is, instead of considering
fully connected networks with completely independent weights, to learn filters that are then used
to perform convolution over the input (image or hidden layer). This technique of constraining
weights of several units to be same is known as weight sharing. One example of convolutional
neural network is shown on fig 17.15.
A convolutional neural network is built from a stack of convolution and pooling layers. Then
at the really top of the stack, some fully connected layers are added. The convolution layers are
simply performing a local matrix product between a small patch of the input and a so called filter
which is few units in width and height. This filter is actually applied over the whole image as you
would perform a usual convolution : this is where the invariance in the feature extraction operates.
The feature kernels are actually tensors with the width and height of the filter as well as a depth.
17.7. AUTOENCODERS 185

Figure 17.15: A convolutional neural network as introduced in (Lecun et al., 1998). The first layers
compute convolutions on their input with several trainable filters. This weight sharing dramatically
decrease the number of weights to learn by actually exploiting a fundamental structure of images
: the extraction of features from images is translation invariant.

For example, the first convolution layer applied to a RGB image has filters of depth 3. If k filters
are computed from the input image, the next convolution layer will have a depth of k. After
the convolution layer, one finds a pooling layer. A pooling layer introduces another translation
invariance. In its original work, (Lecun et al., 1998) considers subsampling which reads :
 
(l)
X (l−1)
yi = tanh β yj + b
j∈RFi

Subsampling consists in computing the average of the convolutional layer outputs over a local patch
that we call the receptive field. It turns that another pooling operation known as max-pooling
works significantly better in practice(Scherer et al., 2010). Max-pooling consists in computing the
max rather the average within the receptive field of a unit :
 
(l) (l−1)
yi = max yi
j∈RFi

Such convolutional neural networks with non-overlapping max-pooling layers appear to be very
effective(Ciresan et al., 2011b). In (Simard et al., 2003), the authors present some ”good” choices
for setting up a convolutional neural network which turn out to work well in practice in terms
of initialization of the weights, of the organization of convolutive and fully-connected layers. One
additional point the authors present is data augmentation which consist in applying distorsions on
the training set images in order to feed the network with much more inputs than if we just consid-
ered the original dataset. Hopefully, the data augmentation technique should provide additional
sensible inputs which mimic the availability of an infinite dataset and therefore might discard the
need to regularize the network.

17.7 Autoencoders
Autoencoders Autoencoders (or also known as Diabolo networks) are a specific type of neural
network where the objective is to train a network that is able to reconstruct its input. A simple
single layer autoencoder is represented on fig 17.16. Usually, the hidden layers have the shape of
a bottleneck which smaller and smaller layers with the aim that the input xx will actually get
compressed in a so-called code c being sufficiently informative to allow the reconstruction x0 to be
close to the input x. In the simple autoencoder of fig 17.16, the equations read :

c = σ (Wx + b)
x0 = W0 c + b0

where σ is a transfer function (e.g. logistic). One may constrain the decoding weights W0 to be
equal to the transpose of the coding weights W to decrease the number of parameters to train.
186 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS

If one is using a linear transfer function and a quadratic cost, one then seek to minimize the
reconstruction error from some low dimensional projections which is exactly what PCA is doing.
However, when non linear transfer function are used, the auto-encoders do not at all behave as a
PCA(Japkowicz et al., 2000) and can be used to extract useful non linear features from the inputs
and autoencoders turn out to be effective architectures for performing non linear dimensionality
reduction(Hinton et al., 2006).

Reconstruction x’

W 0 , b0
Code c

W, b

Input x

Figure 17.16: A simple single hidden layer autoencoder. The input x is going through a bottleneck
to create a code c from which we seek to build up a reconstruction x0 as similar as possible to the
original input x.

Actually, one may even use a hidden layer with the same number of even more units than the
input and still get sensible hidden units that are not merely learning the identity function(Bengio
et al., 2007; Ranzato et al., 2006) if the architecture is appropriately regularized (early stopping or
L1 norm penalty). In (Hinton et al., 2006), a deep autoencoder with three hidden layers between
the input and code layer is introduced to perform dimensionality reduction. The authors also
present a way to efficiently train such deep architectures. Variants of the autoencoders where
noise, acting as a regularizer, is introduced in the hidden layers are presented in (Vincent et al.,
2008). Injecting noise enforce the network to learn robust features and prevent the autoencoder
to simply learn the identity function when large hidden layers are considered. These autoencoders
are called denoising autoencoders. In (Vincent et al., 2008), the authors also introduce the stacked
denoising autoencoder which are merely a stack of encoders trained iteratively. A first single layer
denoising autoencoder is trained. Then, the learned code is used as the input for training a second
denoising autoencoder and so on and so forth.

17.8 Where is the problem with a deep neural network and


how to alleviate it ?
We already introduced two deep architectures : the convolutional neural networks and the deep
autoencoders. Actually, the convolutional neural networks were the only successful deep neural net-
works until recently. The reason is that the convolutional neural networks are not fully connected.
Their shared weights are constrained by several patches of the images. When fully connected
feedforward neural networks were considered, they were hard to train. One reason was raised by
(Hochreiter, 1991). He actually identified that training is essentially difficult with gradient descent
because the gradient tends either to vanish or to blow up as it gets backpropagated in the first
layers. Actually, this issue of vanishing gradient is shared with the recurrent neural networks. The
recurrent neural networks will be introduced in the next chapter but let us just mention few words
on them. While feedforward neural networks are acyclic, reccurrent neural networks contain cycle.
One way to train them is to actually see each time step of the activities as a layer of a feedforward
neural network. Using backpropagation on this neural network actually propagates the gradient
through the history of the activities of the network and is called backpropagation through time
(BPTT). If several epochs are considered, we actually build up a feedforward neural network with
several layers. Gradient descent in recurrent neural networks is known to be difficult (Bengio et al.,
1994).
17.9. SUCCESS STORIES OF DEEP NEURAL NETWORKS 187

Around 2006, it was found that the issue of gradient vanishing can be alleviated by an appro-
priate initialization of the network parameters(Hinton et al., 2006; Bengio et al., 2007). A similar
idea was already introduced in (Schmidhuber, 1992). The idea is to train in an unsupervised way
the feedforward (or recurrent) neural networks. One such method relies on autoencoders that we
introduced in a previous section. Remember that autoencoders seek to learn a useful code, useful
in the sense that it can be sparse and allows to reconstruct the original data. Robust and sparse
features are then extracted from the input image. This unsupervised learning seems to bring the
weights of the neural network in a region much more favorable for fine tuning with stochastic
gradient descent. One may find additional elements on why training deep networks is difficult in
(Glorot and Bengio, 2010).

17.9 Success stories of Deep neural networks


Several recent works push forward the use of neural networks for classification and regression
problems, to name a few: image recognition (Ciressan et al., 2012; Krizhevsky et al., 2012), image
to text translation (Karpathy and Li, 2014), speech recognition (Graves et al., 2013; Deng and
Platt, 2014; Deng et al., 2013; Yu and Deng, 2015), speech translation (Sutskever et al., 2014),
drug activity prediction (Dahl et al., 2014). In this section, we review some of these success stories
especially focusing on the architectures and training procedures involved.
One famous example of successful deep neural network that we already presented in section
17.6 are the convolutional neural networks LeNet of (Lecun et al., 1998). Recently, it is still deep
convolutional neural networks that rank first on the MNIST dataset (Ciressan et al., 2012). In
(Ciressan et al., 2012), the authors actually present results on other datasets (e.g. CIFAR, Traffic
signs, NORB, chinese characters) but we concentrate here on the MNIST dataset. As a reminder,
the MNIST dataset has 10 classes of 28 × 28 black and white handwritten digits with a training
set of 60.000 images and a test set of 10.000 images. In (Ciressan et al., 2012), the authors trained
multiple convolutional networks with the same architecture depicted on fig 17.17. The authors
applied small distortions to the original dataset (shrinkage) to build up 35 datasets on which
they trained a convolutional neural network separately (during training, the images are further
distorted before each epoch). They then averaged the response of the 35 neural networks for
the final classification. The final architecture is entitled Multicolumn Convolutional Deep Neural
Network (MCDNN). The final architecture has up to a million parameters.

Figure 17.17: a) Convolutional neural network trained on the MNIST dataset by (Ciressan et al.,
2012). Applying small distorsions on the original data, the authors build up 35 datasets on
which a convolutional neural network is trained separately and then averaged as in (b). Images
from(Ciressan et al., 2012).

In different of their works (Cireşan et al., 2010; Ciresan et al., 2011a; Ciressan et al., 2012), the
authors use, as a non-linear activation function within the hidden layers of the MLPs(Cireşan et al.,
188 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS

2010) or within the fully connected and convolutional layers the convolution networks (Ciressan
et al., 2012), a scaled hyperbolic tangent suggested by (LeCun et al., 1998) :
g(x) = 1.7159tanh (0.6666x)
For the classification output, a softmax is considered. Interestingly, learning is performed using the
“good old on-line back-propagation”(Cireşan et al., 2010) without momentum or any other tricks
except an exponentially decreasing learning rate schedule and an initialization of the weights uni-
formly in [-0.05, 0.05]. Note however that is is allowed because fast GPU implementations permit
to train the networks during several epochs. As stated by the authors, one convolutional neural
network such as on figure17.17a took 14 hours to train on GPUs which would have easily taken
a month on a CPU. As reported by the authors, the final architecture ranks first on the MNIST
classification :

Network architecture Misclassification on the test set Reference


CNN 0.70 % (Lecun et al., 1998)
CNN 0.40 % (Simard et al., 2003)
CNN 0.39 % (Ranzato et al., 2006)
MLP 0.35 % (Cireşan et al., 2010)
CNN committee 0.27 % (Ciresan et al., 2011a)
MCDNN 0.23 % (Ciressan et al., 2012)

Figure 17.18: Examples of classes from the ImageNet dataset.(Russakovsky et al., 2014).

The SuperVision deep convolutional neural network of (Krizhevsky et al., 2012) ranks first
in the ImageNet classification problem(Russakovsky et al., 2014). The 2012 ImageNet challenge
consisted in classifying 1000 different categories of objects. The training set was made of 1.2
million image, the validation set of 50.000 images and the test set contained 100.000 images. Some
examples of the ImageNet dataset are shown on fig 17.18. The supervision network of (Krizhevsky
et al., 2012) is built from 7 hidden layers : 5 convolutional layers and finally 2 fully connected
layers. The output layer uses a soft-max transfer function. The hidden layers have a rectified
linear transfer function. The total number of parameters to learn reach 60 millions. With so many
parameters to learn, the authors proposed to extract random patches of 224 × 224 pixels from the
256 × 256 pixels images in order to augment the dataset. Learning uses stochastic gradient descent
with dropout (probability of 0.5 toset the output of a hidden unit to 0), momentum of 0.9, weight
decay of 0.0005. The weights are initialized according to a specific scheme detailed in (Krizhevsky
17.9. SUCCESS STORIES OF DEEP NEURAL NETWORKS 189

et al., 2012) but basically relying on normally distributed weights and unit or zero biases depending
on the layer. The learning rate is adapted through a heuristic which consists in dividing it by 10
when the validation error stops improving. According to the authors, it took about a week to train
the network on two GPUs involving 90 epochs of the while dataset of 1.2 million images. Recently,
(Krizhevsky, 2014) introduces a new way to make training of convolutional neural networks on
GPUs faster.
190 CHAPTER 17. FEEDFORWARD NEURAL NETWORKS
Chapter 18

Recurrent neural networks

18.1 Dealing with temporal data using feedforward networks

Suppose we want to learn a predictor for which the current decision depends on the current input
and on the previous inputs. One can actually solve such a task with a feedforward neural network by
extending the input layer with, say, n parts each fed with one sample from xt , xt−1 , xt−2 , · · · xt−n+1 .
The input of such a network is called a tapped delay line and the feedforward network built from
such an input, a time-delay neural network (TDNN)(Waibel et al., 1989).

Delay line Hidden layers Output layer


xt−6

xt−5

xt−4

xt−3

xt−2

xt−1

xt

Figure 18.1: A time-delay neural network (TDNN) takes the history into account for its decision
by being fed by a sliding window over the input. The main limitation of such a network is that the
maximal amount of previous inputs the network can integrate is fixed in the length of the delay
line.

The main issue with such a network is that the history that can be used to make the prediction
is dependent on the predefined length of the delay line. In addition, the TDNN is using separate
weight vectors to extract the information from the samples of the different time steps which is not
always optimal, especially if one needs to extract the same piece of information from the input but
at different time steps. This means we introduce several weights that must be trained to perform
the same work and this might impair generalization.
Rather than relying on a predefined time window over the input, a recurrent neural network
can learn to capture and integrate the sufficient amount of information from past inputs in order
to make a correct decision.

191
192 CHAPTER 18. RECURRENT NEURAL NETWORKS

18.2 General recurrent neural network (RNN)


18.2.1 Architecture
The feedforward neural networks introduced in the previous chapter form an acyclic graph. When
ones allow for cycles in the connection graph, one builds up recurrent neural networks such as the
one depicted on figure 18.2.

Input Output

Figure 18.2: A recurrent neural network is a neural network with cycles. The outputs may or may
not be fed back to the network. The output is influenced by the hidden units and may also receive
direct excitation from the input.

Recurrent neural networks are particularly well suited when working with datasets where the
decision requires some form of memory such as for example when working with time series (e.g.
speech signals) and hopefully, the recurrent neural network will learn the dependency in time to
correctly predict the current output. Depending on the application, it might be required that the
output feeds back to the hidden layers as illustrated on fig. 18.2. To describe the computations
within a recurrent neural network, we use the same notations as in (Jaeger, 2002) :
• ui , i ∈ [1, K] denote the input unit activities ,
• xj , j ∈ [1, N ] denotes the hidden or internal state activities,
• yk , k ∈ [1, L] denotes the output unit activities.
The units are interconnected with weighted connected connections denoted :
• Win ∈ RN ×K the weight matrix from the inputs ui to the hidden units xj
• Wback ∈ RN ×L the weight matrix from the outputs yk to the hidden units xj
• W ∈ RealN ×N the weight matrix between the hidden units xj
• Wout ∈ RL×(K+N ) the weight matrix between the input and hidden units to the output units
The biases of the hidden and output units are denoted bx and by . Similarly to the case of
feedforward neural networks, we may use different transfer functions for the hidden units and
the output units. Therefore, we introduce the transfer function f and f out for respectively the
hidden and output units. We can now write down the equations that rule the activities within
the recurrent neural network. The network is supposed to be initialized to some initial state
x (0) = x0 , y (0) = y0 :

∀t > 0, x (t) = f Win u (t) + Wx (t − 1) + Wback y (t − 1) + bx (18.1)
   
u (t)
y (t) = f out Wout + by
x (t)
18.2. GENERAL RECURRENT NEURAL NETWORK (RNN) 193

18.2.2 Real Time Recurrent Learning (RTRL)


In (Williams and Zipser, 1989), the authors introduced the real time recurrent learning algorithm
which is a gradient descent method applied to recurrent neural networks. We will follow in part the
presentation given in (Jaeger, 2002). We consider an epoch of T timesteps. A loss is introduced
between the outputs y (t) , t ∈ [1, T ] and the desired outputs d (t) , t ∈ [1, T ] :
T
X T
X
t
L= L = L (y (t) , d (t))
t=1 t=1

In order to perform a (possibly stochastic) gradient descent of the loss, we need to compute
the derivative of the loss with respect to the weights and biases. All the weights, whether feeding
the hidden or output units, contribute to the loss. Without loss of generality, we formulate the
derivative with respect to a weight we denote wk,l :
T L
X X ∂Lt ∂yi (t)
∂L
∀wk,l , =
∂wk,l t=1 i=1
∂yi ∂wk,l

This requires to compute the derivative of the output activities with respect to the weights
and biases. Let us just explicit the derivatives with respect to the weights. From eq. (18.1), these
derivatives read :
"   " ∂u(t) ##
out
∂y (t) 0 ∂W u (t)
= f out (ai (t)) + Wout ∂w k,l
∂x(t)
∂wk,l ∂wk,l x (t) ∂wk,l
"   " ##
out 0 ∂Wout u (t) out
0
=f (ai (t)) +W ∂x(t)
∂wk,l x (t) ∂wk,l
 
u (t)
ai (t) = Wout + by
x (t)

We cannot much further without specifying with respect to which weight the derivative is
actually computed. Depending on the weight with respect to which the derivative is computed,
one would actually get various formula. For example :
"   " ##
∂y (t) out 0 ∂Wout u (t) out
0
out = f
∂wk,l
(ai (t)) out
∂wk,l x (t)
+W ∂x(t)
∂wout
k,l

∂Wout
The matrix out
∂wk,l
is full of zero with a single 1 line k, column l. If one computes the derivative
out
with respect to a weight feeding the hidden layer, the term ∂W ∂wk,l vanishes but the other remains.
For example :
"   " ##
∂y (t) 0 ∂W out
u (t) 0
in
= f out (ai (t)) in
+ Wout ∂x(t)
∂wk,l ∂wk,l x (t) in
∂wk,l
" #
0 0
= f out (ai (t))Wout ∂x(t)
in
∂wk,l

Whatever the weight we consider, the derivatives of the output activities require to compute
the derivatives of the hidden layer with respect to the weights as well. Similarly to the derivatives
of the output layer activities, one would derive eq (18.1) with respect to the weight and end up
with some formula that we can summarize as :
!
∂x (t) ∂x (t − 1) ∂y (t − 1)
in
=g in
, in
, x (t − 1) , y (t − 1) , u (t) , Win , W, Wback , bx , f
∂wk,l ∂wk,l ∂wk,l
194 CHAPTER 18. RECURRENT NEURAL NETWORKS

Note that the derivatives of the output activities y (t) with respect to some hidden layer weights
were dependent on the derivative of the hidden activities y (t) at time t which themselves are
computed from the derivatives of the hidden and output activities at time t − 1 which overall
implies that all the derivatives at time t can be computed from the derivatives at time t − 1. As
the initial state is independent from the weights, the initial conditions read :
∂x (0)
=0
∂wk,l
∂y (0)
=0
∂wk,l

To compute the gradient of the loss, we need to compute the derivative of all the hidden
and output units (N + L units), for every time step (T steps), with respect to every weight
(N 2 + 2N.L + K.N weights) which actually leads to a very expensive computational cost which
is the main drawback of this method compared to over methods such as the Backpropagation
through time presented in the next section. If the number of hidden units dominates the number
of inputs and outputs, the time complexity of one step is an order of N 4 . However, RTRL is a
forward differentiation method meaning that the derivatives of the loss are computed at each time
step and therefore the parameters can be updated online, i.e. at each time step.

18.2.3 Backpropagation Through Time (BPTT)


Backpropagation through time was introduced in (Werbos, 1988). Contrary to RTRL which is
a forward differentation algorithm, BPTT requires a forward and backward pass similar to the
backpropagation applied in the context of feedforward neural networks. Actually, BPTT is simply
the backpropagation applied to the recurrent neural network unfolded in time, where the network
at one time step is seen as a layer in a feedforward network of depth T , the number of time steps
of an epoch. The unfolding in time of the network is shown on figure 18.3 from (Sutskever, 2013).

Figure 18.3: Backpropagation through time is backpropagation applied to the recurrent neural
network unfolded in time where the network at each time step is considered as one layer of a
feedforward neural network of depth T . The notations of the illustration differ slightly from the
ones used in the previous section. The figure is from (Sutskever, 2013).

Derivation of the BPTT algorithm is an application of backpropagation to the unfolded network


and the reader interested in the equation is referred to (Sutskever, 2013; Jaeger, 2002). Contrary
to the RTRL algorithm, BPTT is a batch algorithm which requires to compute the full forward
pass before backpropagating the gradient of the loss. However, it is computationally less expensive
than RTRL; if the number of hidden units dominate the number of inputs, the time complexity
of one epoch requires an order of N 2 operations for the forward and backward pass and therefore
the time complexity of one epoch is an order of T.N 2 .

18.2.4 What about the initial state ?


Because the recurrent neural networks are recurrent, we need to define an initial state for the
recurrent neurones, i.e. the afferent units to the ones for which computing the activities at time
t requires some activities at time t − 1. This includes the hidden units but also the output units
18.3. ECHO STATE NETWORKS 195

when these are fed back to the hidden units. One could could set the initial activities of these units
arbitrarily to 0. However, this is not guaranteed to be the optimal choice. Another possibility is
to treat the initial state as a variable to be optimized and one can therefore, in a gradient descent
method, computes the derivatives of the loss with respect to the initial state and follow the negative
gradient.

18.3 Echo state networks


Echo state networks (Jaeger, 2004) are a particular type of recurrent neural networks in which
the recurrent weights are predefined and fixed and one only seeks to learn the projection from the
hidden layer to the output. An echo state network is depicted on fig 18.4.

Figure 18.4: In an echo state network (ESN), the recurrent weights are predefined and fixed; only
the weights from the hidden to output layers are trained. The figure is from (Lukosevicius, 2012).

The description of echo state networks follows (Lukosevicius, 2012). The hidden units in the
ESN are leaky integrators and the output unit activities are linear with respect to the hidden (and
possibly input) units. The output units are called readout units. From eq.(18.1), with a linear
output transfer function f out (x) = x, no feedback connections from the output to the hidden layer
Wback = 0, the update equations then read :

x (n) = (1 − α)x (n − 1) + αtanh Win u (n) + Wx (n) + bx
 
out u (n)
y (n) = W + by
x (n)
This is the network one would consider for a regression problem : the fixed predefined recurrent
network extracts features from the input stream and a linear readout learns to map the features
to the output to regress. In the context of a classification problem, one would use a non linear
output transfer function such as the softmax which constrains the output activities to lie in [0,1]
and to sum up to 1. Learning the weights from the recurrent network (or reservoir) to the output
is not the biggest issue with ESN. In the case of a single output, if the sequence is not too long and
the hidden layer not too large, one could compute the optimal weights from the Moore Penrose
−1
pseudo inverse, e.g. wout == XXT Xy where X gathers the hidden states during all time steps
and y the sequence of outputs to predict. One could also apply online learning to the readout
weights. Regularization of the readout weights introduced with feedforward neural networks (e.g.
L2 penalty) apply in this context as well. The interested reader is referred to (Lukosevicius, 2012)
for more information on this.

The size of the reservoir (hidden layer) is usually taken to be as big as possible, hopefully
enriching the hidden representation with which the readout is computed. In (Triefenbach et al.,
2010), the authors make use of reservoir with 20.000 hidden units.
One big issue with ESN is to be able to define the input to hidden weight matrix Win , the
recurrent hidden weight matrix W and the leaking rate α. The hidden recurrent weight matrix
196 CHAPTER 18. RECURRENT NEURAL NETWORKS

is usually generated as a sparse matrix as it turns out that it generates in practice better results
than dense matrix and numerical computation libraries can compute efficiently operations with
sparse matrices which then speed up the evaluation of the network. The input matrix Win is a
dense matrix. Various distributions are used to generate the coefficients of the matrices such as
a uniform or gaussian distribution. It is usually advised to scale the hidden recurrent weights W
so that its spectral radius (largest eigenvalue in magnitude) is strictly smaller than 1. It is not
always the case that a spectral radius of 1 is optimal (Lukosevicius, 2012). The spectral radius has
an influence on the fading of the influence of the inputs on the reservoir activities. If one thinks
of the update of the reservoir as a repeated application of the weight matrix W, using a small
spectral radius tends to vanish more quickly (exponentially) the input that got integrated at some
time step by the reservoir. The leak factor α of the leaky integrator influences how quickly the
dynamic of the reservoir evolves. If the input or output time series evolve quickly in few time steps
and the leak factor is set too close to 1, the dynamic of the reservoir will not be fast enough as it
keeps a strong inertia with its previous state. We will not go further in details on how to setup a
reservoir as various elements can be found in (Lukosevicius, 2012; Jaeger, 2002). Actually, all the
previous details seem to favor a careful design of the input and hidden layers of the ESN and it
seems that much simpler (more constrained) architectures still perform favourably as presented in
(Rodan and Tiño, 2011). To finish this section on ESN, Mantas Lukoševičius is providing source
codes on his website (http://minds.jacobs-university.de/mantasCode) with implementations
of ESN in various programming languages.

18.4 Long Short Term memory (LSTM)


18.4.1 Architecture
Recurrent neural networks share the same vanishing/exploding gradient issue that we encounter
with feedforward neural networks because, by nature, recurrent neural networks are deep networks
and the issue of vanishing/exploding gradient brakes a recurrent neural network in its ability to
capture long term dependencies, i.e. to remember some view of the inputs presented a long time
before and required for the current prediction. The analysis carried out in (Hochreiter, 1991) for
feedforward neural networks led the same author to introduce in (Hochreiter and Schmidhuber,
1997) a specific recurrent neural networks architecture which does not face the vanishing/exploding
gradient issue. The LSTM architecture is built with a specific unit called the memory cell and
depicted on fig 18.5. The idea behind the memory cell is to be able to keep a stored information
unperturbed during an arbitrary long time period. This helps in learning long term dependencies,
i.e. dependency between an input at some time xt−τ and the output yt for “long” time delays τ . To
do so, the memory cell input and output are controlled by gates which are themselves dependent
on the activities of the other units within the network. The original LSTM memory cell introduced
by (Hochreiter and Schmidhuber, 1997) is depicted on fig 18.5).
The input and output gates modulate the entrance and “release”1 of information. This basic
memory unit was further extended in (Gers et al., 2000) which introduces a forget gate. Indeed,
modification of the content memorized in a memory unit in a LSTM unit is done by imposing
a new input and opening the input gate. The motivation of the forget gate introduced in (Gers
et al., 2000) is to modify the recurrent pathway gain from ft = 1.0 in the original unit to a smaller
positive value (hopefully ft = 0.0 to get a full reset) so that the stored information vanishes.
With the input, output and forget gates, the full equations(Graves, 2013) of the LSTM unit
read :

it = σ(Wxi xt + Whi ht−1 + Wci ct−1 + bi )


ft = σ(Wxf xt + Whf ht−1 + Wcf ct−1 + bf )
ct = ft ct−1 + it tanh(Wxc xt + Whc ht−1 + bc )
ot = σ(Wxo xt + Who ht−1 + Wco ct−1 + bo )
ht = ot tanh(ct )
1 this is not strictly a release as the information is not dropped from the memory unit when the output gates

open but rather influences the other units


18.4. LONG SHORT TERM MEMORY (LSTM) 197

Figure 18.5: The original LSTM memory cell introduced in (Hochreiter and Schmidhuber, 1997) is
able to memorize an information. This information is protected from input perturbations provided
by the other units within the network with an input. The other units are protected from the
influence of the memory cell by an output gate. The network has to learn when to memorize and
release a piece of information. LSTM memory units can be arranged to build up memory cells
which share their input and output gates. Figure adapted from (Gers et al., 2000).

Figure 18.6: Modified LSTM unit with a forget gate which modulates the gain of the recurrent
memory feedback pathway which was originally set to 1.0. A full reset would be obtained with a
gain ft = 0. The figure is from (Gers et al., 2000).
198 CHAPTER 18. RECURRENT NEURAL NETWORKS

The above equations simply state that all the hidden and input units contribute to the input,
output and forget gating of a unit as well as to define the potential new input to store in the
cell. In order to memorize an input, the input gate has to be closed it ≈ 0 and the forget gate
open ft ≈ 1. To replace the content of the memory cell, it is sufficient to close the forget gate
ft ≈ 0 and to open the input gate it ≈ 1. In the original LSTM unit, the forget factor was always
set to 1.0 and therefore, replacing the content of the memory cell would require some amount of
time as, when ft = 1, we get ct = ct−1 + c∗ . Training of such a network can be performed by
applying the algorithms such as Real Time Recurrent Learning or Backpropagation Through Time
we introduced in the previous sections.
In some situations as for example in speech processing, it might be helpful to consider both past
and future (to some extent) inputs in order to classify the current input. For example, when one is
speaking, there are co-articulating effects where the next phoneme to be pronounced influences the
end of the previous one. In this context, (Graves and Schmidhuber, 2005) introduced bidirectional
LSTM which consists in two LSTM networks processing the input one in the forward direction
and the other in the backward direction. The classification at a given can then influenced by
both the past and future contexts. There have been also successful works on unsegmented data
(Graves et al., 2006) where the network is directly fed with the continuous signals (e.g. the full
speech signal) rather than with chunks (e.g. segmented phonemes) that have been segmented in a
preprocessing step.

18.4.2 Example of applications


Recurrent neural networks, especially LSTM based neural networks, are one of the most well per-
forming methods on benchmarks involving sequential data(Schmidhuber, 2015). One example of
application domain is in dialog systems with benchmarks on, for example, speech synthesis or
speech recognition which were usually addressed using Hidden Markov Models. There are also
some applications on handwritten texts where the objective is to transcribe a handwritten text
into a computer string. In these contexts, bidirectional LSTM networks were applied successfully
and reached state of the art performances(Graves et al., 2013; Graves and Schmidhuber, 2005;
Graves et al., 2006). Bidirectional LSTM are made of two LSTM networks, one fed with the signal
from its beginning and the other fed with the signal from the end. The point of using bidirectional
LSTM is to improve the predictor by taking into account both the past context and the future
context in order to take into account coarticulation. Coarticulation is the fact that the way a
phoneme is pronounced is dependent on the previous and next phonemes to be pronounced. Actu-
ally, the same phenomena occurs when one is handwritting. LSTM are also extremely performant
to work on unsegmented data. A classic approach to translate a speech signal is to segment it in
phonemes that are each individually classified. That might be feasible but actually turn out to be
unnecessary with LSTM RNN. When one wants to transcribe a handwritten word, the same issue
occurs as the letters are glued together in the word.

Other applications of recurrent neural networks focus on building up generative models(Graves,


2013; Sutskever et al., 2011). In (Sutskever et al., 2011), the authors train a specific RNN architec-
ture that they call the multiplicative RNN where the recurrent connections are gated by learned
functions of the inputs. The author train their network to a character prediction task where the
network learns to predict characters. Then, the authors apply the network for generating sen-
tences ! The idea is feed the network with a seed, i.e. a small piece of text and then take the
most probable output (character) and to feed it back into the network so that it considers it as
the next input and then to iterate. Some samples of generated texts are given in (Sutskever et al.,
2011) where it is interesting to note that the network was able to capture some sense of English
with a pretty good grammar. In (Graves, 2013), the authors train a LSTM network to predict
and enerate texts. The author also train LSTM RNN on a handwritting task where the dataset
consist in x,y coordinates of a pen used to write a sentence as well as its up/down position. The
network is sucessfully trained to predict the next position of the pen. The author goes even fur-
ther by actually synthesizing handwritten sentences from computer strings. A demo is available
on the website of Alex Graves (http://www.cs.toronto.edu/~graves/handwriting.html) and
illustrated on fig.18.7.
18.4. LONG SHORT TERM MEMORY (LSTM) 199

Figure 18.7: A handwritten sentence generated by the recurrent neural network of (Graves, 2013)
and fed with the sentence “A LSTM network generating handwritten sentences”.

Recently, it has been proposed to combine deep feedforward networks (convolutional neural
networks) with recurrent neural networks (bidirectional LSTM) in order to produce captions of
images (Karpathy and Li, 2014).
200 CHAPTER 18. RECURRENT NEURAL NETWORKS
Chapter 19

Energy based models

19.1 Hopfield neural networks


19.1.1 Definition
Hopfield networks are reccurent neural networks with binary threshold units and symmetric con-
nections. If we denote si ∈ {−1, 1} the state of neuron i, bi its biais and wij = wji the weight of
connections between neurons i and j, the update rule for the state si is simply:
 P
 1 if w s (t) + bi > 0
Pj ij j
∀i, si (t + 1) = si (t) if j wij sj (t) + bi = 0 (19.1)

−1 otherwise

We further suppose that there can be self-connections but that the weights of self-connections are
restricted to be positive :
∀i, wii ≥ 0
One can then define the following energy (lyapunov ) function1 :
X 1 X X
E=− si bi − si sj wij − wii si (19.2)
i
2 i
i,j6=i

We can rewrite the energy to isolate the terms in which the state of a specific neuron is involved :
X 1X X X X
∀k, Ek = −sk bk − wkk sk − wkj sk sj − si sj wij − si bi − wii si
2
j6=k i6=k j6=i,j6=k i6=k i6=k

From this expression, we can compute the energy gap, i.e. the difference in energy when the neuron
k is in state 1 and when the neuron k is in state 0 :

∀k, ∆Ek = Ek (sk = 1) − Ek (sk = −1)


X
= 2(−bk − wkk − wkj sj )
j6=k

When updating neuron k, if its state wasPsk (t) = 1, and the update
P makes it turn off sk (t+1) = −1,
according to eq. (19.1) this means that j wkj sj (t) + bk = j6=k wkj sj + wkk + bk < 0. Therefore,
∆Ek > 0. The update produces a modification of the energy of −∆Ek < 0 and therefore the
energy is strictly decreasing. P
If the neuron was inP state sk (t) = −1 and the update makes it turn
on sk (t) = 1, this mean that j wkj sj (t) + bk = j6=k wkj sj (t) + bk > 0. Given that −wkk ≤ 0,
we have ∆Ek < 0. The update leads to an increase of ∆Ek of the energy and therefore the energy
is again strictly decreasing. In case the neuron does not change its state, the energy is constant.
1 It shall be noted that this formulation of the energy function must be modified if we allow the neuron to change

its state when the net input equals 0, in this case, see the work of Floreen et Orponen Complexity issues in Discrete
Hopfield Networks.

201
202 CHAPTER 19. ENERGY BASED MODELS

Therefore, sequential updates make energy function eq.(19.2) decreasing. The energy
function is in general strictly decreasing and is constant if and only if the state of the neuron does
not change. Given that there is a finite number of states, we can conclude that the network will
converge in a finite number of iterations, the fixed point being a local minima of the energy function.

19.1.2 Example
We consider a Hopfield network with 100 binary neurons with states in {−1, 1}. The weights
are symmetrix and randomly generated in [−1, 1]. Self-connections are restricted to be positive
(random in [0, 1]). The biases are randomly generated in [−1, 1]. For each iteration, we randomly
choose one of the productive rules (if any), the updates being stopped whenever there is no more
productive rule to apply. The energy function of the number of productive rules applied is shown
on figure 19.1. On this example, it took 97 iterations before reaching a minimum of the energy
function.

Figure 19.1: Evolution of the energy of a 1D Hopfield network function of the number of productive
rules applied

19.1.3 Training
Hopfield suggested that such a network could be used as a memory where patterns to be memo-
rized would be local minima of the energy function. When the network starts from a state close
to one minimum it will eventually relax to it. This means that, say, a picture might be completely
reconstructed from only a subpart of it.

To store a pattern p in a hopfield network, we need to ensure that this pattern p is a minimum
of the energy function (19.2) which we can do by a gradient descent of the energy function:

∆wii = αpi
∆wij = αpi pj
∆bi = αpi

If all the patterns to be stored are available, we can set the weights and biases to :
1 X
W = pi pTi
N i
1 X
b = pi
N i

where pi is a pattern considered as a column vector.


19.2. RESTRICTED BOLTZMANN MACHINES 203

Example We consider a hopfield neural networks with 100 neurons, their states being either
−1 or 1. We first consider a single pattern to be memorized. This pattern is composed of four
segments alternatively set to −1 and 1. We show on figure 19.2 the evolution of the states once
the weights have been set according to the learning rule (the batch version).

Figure 19.2: Evolution of a Hopfield network trained to store the pattern (−1, −1, ...1, 1, .. −
1, −1, ..., 1, 1)

It is not shown here, but the learning rule above is somehow specific to the states −1, 1. If we
use the same learning rule with states 0, 1, and run the same example, we may keep random values
in the domain where the pattern is 0 as the weights and biaises for these neurons equal zero and
therefore their state does not leave their initial value.

19.2 Restricted Boltzmann Machines


Boltzmann Machines (Hinton and Sejnowski, 1986) and restricted Boltzmann Machines are stochas-
tic generative neural networks. The feedforward and recurrent neural networks presented in the
previous chapters are discriminative neural networks : in a classification context, they learn the
conditional probability of the label given the input. The Boltzmann machine and its restricted
version are generative models that seek to model the joint probability distribution of the labels
and inputs. While discriminiative models can only output the label given an input, the genera-
tive models can output the label as well as some input samples by conditioning the learned joint
probability by some prior over the labels or over the inputs. A Boltzmann machine is a single
layer stochastic neural network. With binary activations, the activation of a unit is sampled from
a probability distribution parametrized by the weighted sum of their inputs (plus the bias). In
a restricted Boltzmann machine2 , the units are divided into two subgroups so-called visible and
hidden units. One motivation in this distinction between hidden and visible units comes when one
wants to model some data generated by some unknown processes (some hidden causes). The only
thing we get is some observation of the system (the visible units) and we would like to infer the
hidden causes of these observations
Denoting vi the activation of the visible units and hi the activation of the hidden units, we
introduce the energy function associated to a state v, h by eq. (19.9).
X X X X X
E(v, h) = vi bvi + hi bhi + hv
vi hj wij + v
vi vj wij + h
hi hj wij (19.9)
i i i<j i<j i<j

The probability of a joint configuration h, v is defined as :

e−E(v,h)
p(v, h) = P −E(u,g)
u,g e

P
The term u,g e−E(u,g) is called the partition function. The probability of a visible state or of a
2 historically introduced in (Smolensky, 1986), it was popularized by G. Hinton and collaborators who devised

efficient learning algorithms such as the contrastive divergence algorithm(Carreira-Perpiñán and Hinton, 2005)
204 CHAPTER 19. ENERGY BASED MODELS

hidden state is given by :


P
g e−E(v,g)
p(v) = P −E(u,g)
u,g e
P −E(u,h)
e
p(h) = P u −E(u,g)
u,g e

In the remaining of this section, we give some elements with respect to restricted boltzmann
machines with binary units and the way they can be trained. We refer the reader to (Hinton,
2012) for more details on training RBM. The RBM are not restricted to deal with binary units
and variants with real valued activations have been proposed. RBM have been applied successfully
to classification where the hidden units can feed a feedforward network to discriminate the class
or the label as well as the input can be used as the visible part of the RBM. Finally, RBM
have been stacked to build up deep belief networks and deep boltzmann machines (Hinton, 2009;
Salakhutdinov and Hinton, 2009).

19.2.1 RBM with binary units


A Restricted Botlzmann Machine is made of two layers called visible and hidden. The visible layer
is what is observed from a phenomena while the hidden layer is the hidden causes that generates
the observations. Each layer is a set of units denoted vi , i ∈ [0..n − 1], hj , j ∈ [0..m − 1] which
have binary values ∀i, vi ∈ {0, 1}, ∀j, hj ∈ {0, 1}. The visible units are connected to all the hidden
units and the hidden units are connected to all the visible units but there is no connection within a
layer which gives the name restricted to this type of Botlzmann machines. The lack of intra-layer
connection will make possible to analytically derive update rules for the network. Such a network
is represented on figure 19.3.

Figure 19.3: Graphical representation of a restricted Boltzmann machine with n = 4 visible units
and m = 3 hidden units.

Let’s denote bv , bh the biases of respectively the visible and hidden units and w the weight
matrix between the visible and hidden units. The weights are symmetric in the sense that if wij
is the weight between the visible unit i and hidden unit j, the hidden unitj is also connected to
the visible unit i with the weight wij . We define an energy function E(v, h) which depends on the
state of the visible and hidden units as :
X X XX
E(v, h) = − vi bvi − hj bhj − vi hj wij (19.12)
i j i j

From this energy function, we can define a probability over the states of the network as :

1 −E(v,h)
p(v = v, h = h) = e
Z
where Z is a normalization factor, called the partition function and defined as :
X
Z= e−E(v,h)
v,h
19.2. RESTRICTED BOLTZMANN MACHINES 205

The random variables v et h are discrete and the marginal of v is defined as :

X
p(v = v) = p(v = v|h = h)p(h = h)
h
X
= p(v = v, h = h)
h
1 X −E(v,h)
= e
Z
h

We now introduce the free energy (which will renders the derivations easier) as :

X
F(v) = − log( exp(−E(v, h)))
h

We can then rewrite the marginal of v and the partition function in terms of the free energy :

X
Z = exp(−F(v))
v
exp(−F(v))
p(v = v) = P 0
v 0 exp(−F(v ))

With the expression of the energy function in equation (19.12) the free energy can be written
as :

X
F(v) = − log( exp(−E(v, h)))
h
X X X XX
= − log( exp( vi bvi ) exp( hj bhj + wij vi hj ))
h i j i j
Y X X XX
= − log( exp(vi bvi ) exp( hj bhj + wij vi hj ))
i h j i j
X X X X
= − vi bvi − log( exp( hj bhj + vi hj wij ))
i h j i,j
X XY X
= − vi bvi − log( exp(hj (bhj + vi wij )))
i h j i
X YX X
= − vi bvi − log( exp(hj (bhj + vi wij )))
i j hj i
X X X X
= − vi bvi − log( exp(hj (bhj + vi wij )))
i j hj i

Since the hidden units have binary states, the expression can be further simplified :

X X X
F(v) = − vi bvi − log(1 + exp((bhj + vi wij )))
i j i

We are now interested in the conditional probabilities p(v|h) and p(h|v) which define the update
206 CHAPTER 19. ENERGY BASED MODELS

rules of the network. By definition of the conditional probabilities and of the free energy3 :

p(h = h, v = v)
p(h = h|v = v) =
p(v = v)
e−E(v,h)
= P −E(v,h0 )
h0 e
P P P
exp( i vi bvi + j hj bhj + i,j vi hj wij )
= P P v
P 0 h P 0
h0 exp( i v i bi + j hj bj + i,j vi hj wij )
P P
exp( j hj bhj + i,j vi hj wij )
= P P 0 h P 0
h0 exp( j hj bj + i,j vi hj wij )
Q h
P
j exp(hj (bj + i vi wij ))
= Q P 0 h
P
j h0j exp(hj (bj + i vi wij ))
P
Y exp(hj (bhj + i vi wij )) Y
= P 0 h
P = p(hj = hj |v = v)
j h0 exp(hj (bj +
j i vi wij )) j

By identifying the two last terms, the conditional probabilities of the hidden components then
read : P
exp(hj (bhj + i vi wij ))
p(hj = hj |v = v) = P 0 h
P
h0 exp(hj (bj +
j i vi wij ))

Since the hidden units are binary, the update rules finally read :
P
exp(bhj + i vi wij )
p(hj = 1|v = v) = P
1+ exp(bhj + i vi wij )
X
= σ(bhj + vi wij )
i

1
with σ the logistic function defined as σ(x) = 1+exp(−x) . By a similar derivation, since the network
is symmetric, we find :
X
p(vi = 1|h = h) = σ(bvi + hj wij )
j

These are classical update rules of neural networks.

19.2.2 Training
We now wish to train the network so that higher probabilities are given to the training samples
clamped on the visible units and a lower probability to all the other samples. This will in effect
shape the energy
P landscape in favor of the training samples. We therefore wish to maximise the
probability d p(v = v̂d ) where v̂d are the training data. Maximizing this sum which is called
the likelihood is equivalent to maximizing the sum of the logarithm of the probabilities. This is a
technical point simplifying the derivations. The sum of the logarithms of the probabilities is called
the log likelihood and we define a cost function as :
X
L(θ) = − log(p(v = v̂d |θ))
d

with θ the parameter vectors that we now explicitly introduce in the notations to highlight the
dependencies of the cost function on the parameters. To compute the gradient of the cost function,
3 for the last equality, we use the fact that the components of h are pairwise independent because there is no

connection within the hidden layer


19.2. RESTRICTED BOLTZMANN MACHINES 207

we use the expression of the free energy :


P 0
∂ log p(v = vd ) ∂F(vd ) ∂ log( v0 e−F (v ) )
− = +
dθ dθ dθ
∂F(vd ) 1 X ∂e−F (v0 )
= + P −F (v0 )
dθ v0 e dθ
v0
∂F(vd ) 1 X 0
∂F(v ) −F (v0 )
= − e
dθ Z 0 dθ
v
∂F(vd ) X ∂F(v 0 )
= − p(v = v 0 )
dθ 0

v
∂F(vd ) ∂F(v)
= − Ep [ ]
dθ dθ
In this expression, we recognize a term depending on the free energy of a training example and
a second term which is a sum over all the possible states and that depends on the model. In the vo-
cabulary of the Boltzmann machines, the first term is called the positive phase while the second
the negative phase. These notations do not merely depend on the sign of the expressions but
more on the influence of these terms. The first term increases the probability of a training sample
(by reducing its free energy), while the second decreases the probablity of the samples produced
by the model. This is the combined effect of both terms that will shape the energy landscape so
that the model generates our training data.

We now have to see how the two terms can be computed to update the parameters. We begin
with the positive phase which can be analytically derived by using the expression of the free
energy :
P
∂F(vd ) X ∂bvi X 1 ∂ exp(bhj + i vd,i wi,j )
= − vd,i − P
dθ i
dθ j
1 + exp(bhj + i vd,i wi,j ) dθ
P !
X ∂bvi X exp(bhj + i vd,i wi,j ) ∂bhj X ∂wij
= − vd,i − P + vd,i
i
dθ j
1 + exp(bhj + i vd,i wi,j ) dθ i

!
X ∂bv X X ∂bhj X ∂wij
= − vd,i i − σ(bhj + vd,i wi,j ) + vd,i
i
dθ j i
dθ i

X ∂bvi X X ∂bhj X X ∂wij
= − vd,i − σ(bhj + vd,i wi,j ) − σ(bhj + vd,i wi,j )vd,i
i
dθ j i
dθ i,j i

We can now derive specific expressions for the biases and weights :
∂F(vd )
∀i, = −vd,i
dbvi
∂F(vd ) X
∀j, = −σ(bhj + vd,i wi,j )
dbhj i
∂F(vd ) X
∀i, j, = −σ(bhj + vd,i wi,j )vd,i
dwij i

We can recognize hebbian like learning rules; the weights are updated by the product of a pre- and
post-synaptic term. For the negative phase, the main difficulty is that we cannot compute the
marginal of the visible units p(v). We can however notice that the mean can be approximated with
Monte-Carlo. Suppose that even if we cannot compute the marginal, we can get samples from the
model. We would then consider a set N of samples drawn from p(v) and approximate the mean
as :
∂F(v) 1 X ∂F(v)
Ep [ ]≈
dθ |N | dθ
v∈N
208 CHAPTER 19. ENERGY BASED MODELS

Since the sum can be computed, we need a method for sampling from p(v). These methods are
called Markov Chain Monte Carlo where we sample alternatively the visible and hidden units
starting from a given sample. We would in principle need several iterations before reaching a
so-called thermal equilibrium. The thermal equilibrium is a kind of steady state where, even if the
states change (because of the stochastic update rules), the probablities from which the states are
samples are fixed. While in principle we would need to reach thermal equilibrium (and this can
actually take a certain unknown amount of iterations), an approximated method called contrastive
divergence brings pretty good results. In the contrastive divergence, we initialize the visible units
to a training sample, then update the hidden layer, the visible layer and the hidden layer again.
This is called CD-1 while in general CD-k considers k such updates. The visible state that we
sampled is called a reconstruction. The negative phase is computed only on the last visible and
hidden states.
Part VI

Ensemble methods

209
Chapter 20

Introduction

In supervised learning, predictions are made based on an estimator built with a given learning
algorithm. Ensemble methods aim at combining the predictions of several base estimators in order
to improve generalization or robustness over a single estimator.
To get an informal idea of ensemble learning1 , consider Fig. 20.1. The top square corresponds
to a classification problem with positive (denoted “+”) and negative (denoted “−”) examples. As
estimators, we consider “decision stumps”, that is linear separators that take decision based on a
single dimension of the input space (here, they must be vertical or horizontal), as illustrated by
the tree squares in the middle. Now, if one takes a weighted combination of such decisions stumps,
the classification problem can be solved, as illustrated on the bottom square.
Strictly speaking, most of the ideas presented in the following chapters applies to arbitrary
base estimators. However, they are often used with decision trees, so we start by presenting
this learning paradigm, before providing an overview of the (non exhaustively, as usual) studied
ensemble methods.

20.1 Decision trees


Next, we briefly present decision trees (Breiman et al., 1984). This section is largely inspired
from2 Hastie et al. (2009, Ch. 9) and focuses on the CART (Classification and Regression Tree)
algorithm, of which C4.5 (Quinlan, 1993) is a classic competitor.

20.1.1 Basic idea


The idea of decision trees is to partition the input space into a set of rectangles and to fit a
simple (constant) model in each partition. As an illustration, consider Fig. 20.2, corresponding to
a regression problem with continuous output y and two-dimensional input x = (x1 , x2 )> ∈ (0, 1)2 .
The top right panel shows a partition of the input space made by a recursive binary tree (the top
left panel illustrates another partition, but that cannot be obtained with a recursive binary tree).
The five regions can be obtained as follows:

• first, construct two regions, depending on x1 ≤ t1 or x1 > t1 ;

• then, if x1 ≤ t1 , construct two regions, depending on x1 ≤ t2 (region R1 ) or x2 > t2 (region


R2 );

• else, if x1 > t1 , construct two regions, depending on x1 ≤ t3 (region R3 ) or x1 > t3 ;

• lastly, if x1 > t3 , construct two regions, depending on x2 ≤ t4 (region R4 ) or x2 > t4 (region


R5 ).
1 Thisis more precisely an illustration of boosting.
2 Thisbook is an excellent general introduction to machine learning and is (legally and freely) available online:
http://statweb.stanford.edu/~tibs/ElemStatLearn/.

211
212 CHAPTER 20. INTRODUCTION

Figure 20.1: Combining weak learners to form a stronger learner. The figure is strongly inspired
from Schapire and Freund (2012).

Figure 20.2: Top left: this partition cannot be obtained with a recursive binary tree. Top right:
a partition corresponding to a binary tree. Bottom left: the corresponding tree. Bottom right: a
function associated to this tree (each leaf—that is, partition of the input space—is associated to a
constant value). The figure is taken from Hastie et al. (2009).
20.1. DECISION TREES 213

This model can be represented by the binary recursive tree shown on the bottom left panel. Inputs
are fed to the root (top) of the tree, and they are assigned on the left or right branch, depending
on the fact that the condition is satisfied or not, until reaching a leaf (terminal node). The
leaves correspond to the regions R1 , . . . , R5 . The corresponding regression model predicts y with
a constant cm in region Rm :
X5
f (x) = cm 1{x∈Rm } .
m=1

The bottom right panel shows such a function, for the preceding tree and some constants cm .
An advantage of trees is their interpretability (one can see easily from the drawn tree the
stratification of data to reach the predicted value). There remains to show how such a tree can be
built, depending on the problem at hand.

20.1.2 Building regression trees


Assume that we want to build a regression tree using the dataset D = {(xi , yi )1≤i≤n } with xi =
(xi,1 , . . . , xi,d )> ∈ Rd . The algorithm should learn the topology (through splitting variables and
split points) of the tree as well as values associated to each leaf, so as to minimize some learning
criterion (such as the empirical risk based on the `2 -loss).
First, assume that a partition in M regions (R1 , . . . , RM ) is fixed, and the regression model
associates a constant cm in each region:
M
X
f (x) = cm 1{x∈Rm } .
m=1

To minimize the risk based on the `2 -loss, one should solve the following optimization problem:
n
1X
min Rn (f ) = min (yi − f (xi ))2 .
c1 ,...,cM c1 ,...,cM n i=1

The solution is easily obtained by setting the gradient (respectively to each cm ) to zero:
Pn
yi 1{xi ∈Rm }
ĉm = ave(yi |xi ∈ Rm ) = Pi=1
n .
i=1 1{xi ∈Rm }

This is simply the empirical expectation of outputs corresponding to inputs belonging to region
Rm .
However, finding the best binary partition in terms of the risk Rn is much more difficult, and
even computationally infeasible in general. Hence, the idea is to proceed with a greedy algorithm.
We start with the whole dataset. Let j be a splitting variable (a component of the input) and s a
split point, and define the pair of half planes

R1 (j, s) = {xi ∈ D : xi,j ≤ s} and R2 (j, s) = {xi ∈ D : xi,j > s}.

Then, we search for the couple (j, s) that solves


 
X X
min min (yi − c1 )2 + min (yi − c2 )2  .
j,s c1 c2
xi ∈R1 (j,s) xi ∈R2 (j,s)

For a given choice of j and s (notice that 1 ≤ j ≤ d, as each input has d components, and that is
is enough to consider n − 1 split points, obtained by ordering the j th components of inputs in the
dataset), the inner minimization problem is solved with

ĉ1 = ave(yi |xi ∈ R1 (j, s)) and ĉ2 = ave(yi |xi ∈ R2 (j, s)).

Therefore, by scanning through each dimension of all the inputs, determination of the best pair
(j, s) is feasible. Then, having found the best split, we partition the data into the two resulting
214 CHAPTER 20. INTRODUCTION

regions and repeat the splitting procedure for each of these regions. Then the process is repeated
again on each resulting region, and so on. This is repeated until a stopping criterion is met, for
example a maximum depth, a maximum number of leaves or a minimum number of samples per
leaf.
The stopping criterion is not anodyne. Clearly, a very large tree will overfit the data (consider
a tree with as many leaves as samples) while a too small tree might not capture the important
structure. For example, a decision stump (mentioned at the beginning of this chapter) is a tree with
two nodes. Consider the exemple of Fig. 20.1, a decision stump cannot capture the structure of the
data, contrary to a slightly larger tree. A solution would be to prune the tree: one construct a big
tree, then prune it by collapsing some of its internal nodes according to some criterion. See Hastie
et al. (2009, Ch. 9) for more details. We do not study this further, as ensemble methods allow
using such trees (big trees for bagging, small trees for boosting, see the next chapters).

20.1.3 Building classification trees


For building a classification tree (y ∈ {1, . . . , K}), the risk based on the `2 -loss is not the best
choice, we should consider another criteria to split the P nodes.
n
For a node m representing a region Rm with nm = i=1 1{xi ∈Rm } data points (writing Dm =
{(xi , yi ) ∈ D : xi ∈ Rm } the associated dataset), write

1 X
p̂m,k = 1{yi =k} (20.1)
nm
xi ∈Rm

the proportion of class k observations in node m. We classify the observations in node m to class
k(m) = argmax1≤k≤K p̂m,k (majority class). Different measures Q(Dm ) of node impurity3 can be
considered:

• misclassification error,

1 X
Q(Dm ) = 1{yi 6=k(m)} = 1 − p̂m,k(m) ; (20.2)
nm
xi ∈Rm

• Gini index,
X K
X
Q(Dm ) = p̂m,k p̂ m,k0 = p̂m,k (1 − p̂m,k );
k6=k0 k=1

• cross-entropy,
K
X
Q(Dm ) = − p̂m,k ln p̂m,k
k=1

These measures are illustrated in Fig. 20.3 in the case of binary classification.
The tree is growth as previously. For a region Rm , consider a couple (j, s) of splitting variable
and split point and write Dm,L (j, s) the resulting dataset of the left node (of size nmL ) and
Dm,R (j, s) the dataset of the right node (of size nmR ). The tree is growth by solving

min (nmL Q(Dm,L (j, s)) + nmR Q(Dm,R (j, s)))


j,s

for one of the preceding measures of impurity. As for the regression case, the tree can be pruned,
but we do not study this aspect.
3 In the regression case, the measure of impurity is
1 X
Q(Dm ) = (yi − ĉm )2 ,
nm xi ∈Rm

with ĉm = ave(yi |xi ∈ Rm ).


20.2. OVERVIEW 215

0.8
misclassification error
0.7
Gini index
0.6 cross-entropy
0.5

0.4

0.3

0.2

0.1

0.0
0.0 0.2 0.4 0.6 0.8 1.0
p

Figure 20.3: Measure of node impurity for binary classification, as a function of the proportion p
in the second class.

20.1.4 More on trees


We have seen that decision trees are interpretable predictors that are quite easy to train. They have
other advantages. For example, they can handle categorical values (when one of the components
of the input takes a finite number of values), for example by binerazing them. They can be
extended to cost-sensitive learning quite easily. They can also handle missing input values (if some
components of some inputs are missing). See Hastie et al. (2009, Ch. 9) for more details on these
subjects.
Unfortunately, trees are unstable, in the sense that they have a high variance. Often, a small
change in the data can result in a very different series of split, thus on a different predictor. This
instability is mainly caused by the hierarchical nature of the process: the effect of an error in the
top split is propagated down to all of the splits below it. Bagging, to be presented in Ch. 21,
is a way to reduce the related variance. The smaller the tree, the lesser this variance effect, but
also the weaker the learner. Boosting, to be presented in Ch. 22, is a way to combine such weak
learners to form a strong learner (without the instability of trees)

20.2 Overview
In this section, we provide a brief overview of the ensemble methods studied next. As explained
before, such methods aim at combining the predictions of several base estimators (for example, the
trees studied above) in order to improve generalization or robustness over a single estimator.
We will distinguish mainly two families of ensemble methods:

• the bagging methods (or more generally averaging methods), to be presented in Ch. 21.
The underlying idea is to build several estimators, (more or less) independently and in a
randomized fashion, and to average their prediction. This leads to a reduction of variance for
the combined estimator, that makes it applicable for example to large and unpruned trees;

• the boosting methods, to be presented in Ch. 22. The idea is to build sequentially (weak)
base estimators in order to reduce the bias of the combined estimator (see for example
Fig. 20.1 for an illustration), the motivation being to combine weak learners to form a strong
learner.

There are other ensemble methods that we will not study here. We mention some of them for
completeness:

• Bayesian model averaging (BMA) combine a set of candidate models (e.g., for classification)
using a Bayesian viewpoint. Informally, let D be the dataset, ξ a quantity of interest (e.g.,
the prediction of the class for a given input) and write Mm each of the M candidate model.
BMA provides the posterior distribution on ξ conditioned on the examples in D by integrating
216 CHAPTER 20. INTRODUCTION

over models:
M
X M
X
P (ξ | D) = P (ξ | Mm , D)P (Mm | D) ∝ P (ξ | Mm , D)P (D | Mn )P (Mn ).
m=1 m=1

See Hoeting et al. (1999) for more on Bayesian model averaging and more generally Part VII
for an introduction to Bayesian Machine Learning (the above equations should become clear
in light of this part of the course material);
• mixture of experts combine local predictors considering possibly heterogeneous sets of fea-
tures. They can also be seen as a variation of decision trees4 , the difference being that the
tree splits are not hard decisions but rather soft (fuzzy or probabilistic) ones, and that the
model in each leaf might be more complex than a simple constant prediction. See Jacobs
et al. (1991); Jordan and Jacobs (1994) or Hastie et al. (2009, Ch. 9) for more on this subject;
• stacking is a way of combining heterogeneous estimators for a given problem. The basic
idea is to combine different estimators in a statistical way, that is by minimizing a given
risk. For example, consider that a classification problem is solved by using many different
classification algorithms, each one producing an estimator. Then, one can construct an
hypothesis space being composed of, for example, all linear combinations of these estimators.
Then, a classification risk can be minimized over this hypothesis space. Generally, it can be
shown that the learned combination is at least as good as the better estimator. For more
on this subject, see Wolpert (1992); Breiman (1996b); Smyth and Wolpert (1999); Ozay and
Vural (2012), for example.

4 As such, they might not be considered as an ensemble method, opinions vary on this subject.
Chapter 21

Bagging

Bagging stands for “bootstrap and aggregating”. The underlying idea is to learn several estimators
(more or less) independently (by introducing some kind of randomization) and to average their
predictions. The averaged estimator is usually better than the single estimators because it reduces
its variance. We motivate this informally. Assume that X1 , . . . , XB are B i.i.d. random variables
of mean µ = E [X1 ] and of variance σ 2 = var(X1 ) = E (X1 − µ)2 . Consider the empirical mean
PB
µB = B1 b=1 Xb . The expectation does not change, E [µB ] = µ, while the variance is reduced
(thanks to decorrelation of the random variables), var(µB ) = B1 σ 2 .
In the supervised learning paradigm, the random quantity is the dataset D = {(xi , yi )1≤i≤n },
where samples are drawn from a fixed but unknown distribution. From this, an estimate fD is
computed by minimizing the empirical risk of interest (see Ch. 5). This is a random quantity
(through the dependency on the dataset) that admits an expectation. This estimator has also a
variance, which somehow tells how different will be the predictions if the dataset is perturbated
(this can also be linked to the Vapnik-Chervonenkis bound studied in Ch. 5).
Assume that we can sample datasets on demand. Then, let D1 , . . . , DB be datasets drawn
independently, and write fb = fDb the associated minimizer of the empirical risk. The averaged
PB
estimator is fave = B1 b=1 fb . This does not change the expectation, E [fave ] = E [f1 ], but it
reduces the variance, var(fave ) = B1 var(f1 ). This is of interest for example for decision trees: we
have seen in Sec. 20 that large and unpruned trees have a small bias but a large variance.
Unfortunately, it is not possible to sample datasets on demand, as the underlying distribution
is unknown. We have to do with the sole dataset we have. That is where bootstrapping is useful.

21.1 Bootstrap aggregating


Generally speaking, in statistics, bootstrapping refers to any method that relies on random sam-
pling with replacement. Here, bootstrapping is applied to the dataset. The basic idea is to
randomly draw datasets with replacement from the training data, each one having the same size
as the original training set. This is done B times, producing B bootstrap datasets, denoted Db .
Let S(D) be any quantity that can be computed from the data set. From the bootstrap sampling
one can compute the quantities S(Db ) and use them to compute any statistic of interest. This is
illustrated in Fig. 21.1.
Bootstrap aggregation, or bagging, consists in bootstrapping the dataset to get B datasets Db ,
1 ≤ b ≤ B, in learning a predictor fDb = fb for each of these datasets and then in averaging the
predictors:
B
1 X
fbag (x) = fb (x). (21.1)
B
b=1

This can be seen as an approximation of the scheme described at the beginning of the chapter, the
approximation coming from the fact that the empirical distribution is used instead of the unknown

217
218 CHAPTER 21. BAGGING

Figure 21.1: Illustration of the bootstrapping principle.

underlying distribution1 . Due to this approximation, independence (or even decorrelation) cannot
be assumed. Yet, this can improve the results empirically.
This idea can typically applied to decision trees. For regression trees, Eq. (21.1) can be applied
directly. For classification trees, an average of predicted classes would not make sense. Recall that
a classification tree will make a majority votes of the exemples belonging to the leaf where the
input of interest ends up (and associates this way a class to each sample). A first possibility is to
do a majority vote over trees:
B
!
1 X
fbag = argmax 1{fb (x)=k} .
1≤i≤K B
b=1

Another bagging strategy consists in considering the class proportion for the leave corresponding
to the input of interest for each tree (see Eq. (20.1)), to average them over all trees and to output
the class that maximizes this averaged class proportion.
There exists variations of the bagging approach, depending on how datasets are sampled from
the original training set:
• samples can be drawn with replacement, which is the principle of the bagging approach,
explained above (Breiman, 1996a);
• alternatively, random subsets of the dataset can be drawn as random subsets of the samples,
which is known as pasting (Breiman, 1999);
• one can also selects randomly a subset of the components of the inputs (which is generally
multi-dimensional) to learn models. When random subsets of the dataset are drawn as
random subsets of the features, the method is known as random subspaces (Ho, 1998);
• is is possible to combine these ideas. When base estimatores are built on subsets of bots
samples and features, it is known as random patches (Louppe and Geurts, 2012).

21.2 Random forests


Bagging average many noisy but approximately unbiased models to reduce the variance. However,
there is necessarily some overlap between bootstrapped datasets, so the models corresponding to
each of these datasets are correlated. We have justified informally the idea of averaging by the
2
fact that if X1 , . . . , XB are i.i.d. random variables, then var(µB ) = σB (using the notations intro-
duced at the beginning of the chapter). Now, assume that these random variables are identically
distributed but not independent, with a positive pairwise correlation ρ:
E [(Xi − µ)(Xj − µ)]
for i 6= j, ρ= p .
var(Xi ) var(Xj )
1 This empirical distribution is a discrete distribution that associates a probability of 1 to each sample (x , y ).
n i i
Sampling from the dataset with replacement is sampling according to this empirical distribution.
21.3. EXTREMELY RANDOMIZED TREES 219

Then, one can easily show2 that


1−ρ 2
var(µB ) = ρσ 2 + σ . (21.2)
B
Therefore, the variance cannot be shrink below ρσ 2 .

Algorithm 18 Random Forest


Require: A dataset D = {(xi , yi )1≤i≤n }, the size B of the ensemble, the number m of candidates
for splitting.
1: for b = 1 to B do
2: Draw a bootstrap dataset Db of size n from the original training set D.
3: Grow a random tree using the bootstrapped dataset:
4: repeat
5: for all terminal node do
6: Select m variables among d, at random.
7: Pick the best variable and split-point couple among the m.
8: Split the node into two daughter nodes.
9: end for
10: until the stopping criterion is met (e.g., minimum number of sample per node reached)
11: end for
12: return the ensemble of B trees.

When averaged models are trees, Breiman (2001) has proposed random forests to further reduce
the variance. The underlying idea is to reduce the correlation between the trees by randomizing
their constructions, without increasing the variance too much. This randomization is achieved in
the tree-growing process thanks to random selection of input variables3 : at each split, m < d of the
input variables are selected at random as candidates for splitting, the choice of the best variable
and split-point among these candidates being as explained in Ch. 20. See also Alg. 18.
Intuitively, reducing m will reduce the correlation between any pair of trees in the ensemble,
and hence by Eq. (21.2) will reduce the variance of the average. However, The corresponding

hypothesis space will be smaller, leading to an increased bias. A heuristic is to choose m = b pc
p
and a minimum node size of 1 for classification, and m = b 3 c and a minimum node size of 5 for
regression. For further information about random forests, the reader can refer to (Hastie et al.,
2009, Ch. 15) (that provides notably a bias-variance analysis).

21.3 Extremely randomized trees


Randomization can be pushed further with extremely randomized forests, introduced by Geurts
et al. (2006). It is basically a random forest, with two important differences:

• split-points are also chosen randomly (in addition to splitting dimensions). More precisely,
m < p of the input variables are chosen at random, and for each of these variables a split-point
is selected at random;

• the full learning dataset is used to growth each tree (instead of a bootstrapped replica).

The rationale behind choosing also the split-point at random is to further reduce the correlation
between trees (so as to reduce the variance of the average of the ensemble more strongly). The
rationale for using the full learning set is to achieve a lower bias (at the price of an increased
variance, that should be compensated by the randomization of split-points). The corresponding
approach is described in Alg. 19.
2 Indeed,
 1P 2
 1 P
we have that var(µB ) = E (n b (Xb − µ)) = n2 b,b0 E [(Xb − µ)(Xb0 − µ)] =
1 2 + 1 2 + nσ 2 ) = ρσ 2 + 1−ρ σ 2 .
P   P
n 2 ( b E (X b − µ) b6 =b 0 E [(Xb − µ)(Xb0 − µ)]) =
n 2 (n(n − 1)ρσ n
3 Notice that this is different from random subspaces: input variables are chosen randomly for each splitting, and

not while bootstrapping the dataset.


220 CHAPTER 21. BAGGING

Algorithm 19 Extremely Randomized Forest


Require: A dataset D = {(xi , yi )1≤i≤n }, the size B of the ensemble, the number m of candidates
for splitting.
1: for b = 1 to B do
2: Grow a random tree using the original dataset:
3: repeat
4: for all terminal node do
5: Select m variables among d, at random.
6: for all sampled variables do
7: Select a split at random
8: end for
9: Pick the best variable and split-point couple among the m candidates.
10: Split the node into two daughter nodes.
11: end for
12: until the stopping criterion is met (e.g., minimum number of sample per node reached)
13: end for
14: return the ensemble of B trees.

Empirically, it often provides better results than random forests. Another advantage of this
approach is its lower computational complexity compared to random forests (instead of searching
the best split-point among the m drawn dimensions, one chooses the split-point among the m
randomly drawn split-points). See the original paper (Geurts et al., 2006) for more details.
Chapter 22

Boosting

We have seen in Ch. 21 that bagging consists in learning in parallel a set of models with low bias
and high variance (learning being randomized, for example through bootstrapping), the prediction
being made by averaging all these models. Boosting takes a different route. The underlying idea
is to add sequentially models with high bias and low variance such as reducing the bias of the
ensemble. This is illustrated on Fig. 20.1 page 212: combining decision stumps (binary trees with
two leaf nodes) allows constructing a complex decision boundary.
The rest of this chapter is organized as follows. In Sec. 22.1, we present AdaBoost, a seminal
and very effective boosting algorithm. In Sec. 22.2, we show how this algorithm can be derived and
demonstrates why the ensemble achieves a lower error than each of its components (called weak
learners) taken separately. In Sec. 22.3, we extend these ideas using an optimization perspective.

22.1 AdaBoost
The seminal AdaBoost algorithm of Freund and Schapire (1997) deals with cost-insensitive binary
classification. The available dataset is of the form D = {(xi , yi )1≤i≤n } with yi ∈ {−1, +1}. Before
presenting AdaBoost, we discuss briefly weighted classification.

22.1.1 Weighted binary classification


So far, our estimates (or decision rules) have been obtained by minimizing an empirical risk or a
convex surrogate (see Ch. 5.2). Typically, for binary classification the risk of interest is
n n
1X X1
Rn (f ) = 1{yi 6=f (xi )} = 1{y 6=f (xi )} .
n i=1 i=1
n i

This way, all samples of the dataset have the same importance (each sample has a weight n1 ). Now,
assume that we want to associate
P a different weight wi to each example (xi , yi ) (this will be useful
for boosting), such that i wi = 1 and wi ≥ 0. The empirical risk to be considered is
n
X
Rn (f ) = wi 1{yi 6=f (xi )} .
i=1

This is a generalization of the approaches considered before, in the sense that they correspond to
the choice wi = n1 , for all 1 ≤ i ≤ n. Weighting the samples allow putting more emphasis on some
of them on less on the others.
Minimizing the weighted empirical risk can be done (approximately) by sampling a bootstrap
replicate according to the discrete distribution (w1 , . . . , wn ) (sampling with replacement according
to this distribution). It can also be done more directly, in a problem-dependent manner. For
example, consider the classification trees presented in Ch. 20. We have seen that a measure of
node impurity is the misclassification error of Eq. (20.2) (see page 214). It can be replaced by a

221
222 CHAPTER 22. BOOSTING

weighted misclassification error:


X
Q(Dm ) = wi 1{yi 6=k(m)} .
xi ∈Rm

The tree is then growth by solving

min (Q(Dm,L (j, s)) + Q(Dm,R (j, s))) .


j,s

Other measures of node impurity can be adapted in a similar way.


In the rest of this section, we assume that a weak learner1 is available (typically, a decision
stump), and that it is able to minimize the weighted risk using the dataset D and weights wi ,
1 ≤ i ≤ n. Notice that we have presented weighted binary classification, but this idea is much
more general (e.g., weighted regression).

22.1.2 The AdaBoost algorithm


AdaBoost is presented in Alg. 20.

Algorithm 20 AdaBoost
Require: A dataset D = {(xi , yi )1≤i≤n }, the size T of the ensemble.
1
1: Initialize the weights wi1 = n ,1≤i≤n
2: for t = 1 to T do
3: Fit a binary classifier ft (x) to the training data using weights wit .
4: Compute the error t made by this classifier:
n
X
t = wit 1{yi 6=ft (xi )} .
i=1

5: Compute the learning rate αt :  


1 1 − t
αt = ln .
2 t

6: Update the weights, for all 1 ≤ i ≤ n:

wt e−αt yi ft (xi )
wit+1 = Pn i t −α y f (x ) .
j=1 wj e
t j t j

7: end for
8: return the decision rule
T
X
HT (x) = sgn (FT (x)) with FT (x) = αt ft (x).
t=1

Let us explain the rational behind this algorithm. At the beginning, all samples are equally
weighted. Then, at each iteration t, one first trains a binary classifier ft (x) with the training
t
Pn D =t {(xi , yi )1≤i≤n } and weights wi (that is, such as minimizing the weighted risk Rn (f ) =
set
i=1 wi 1{yi 6=f (xi )} ). Write t = Rn (ft ) the error made by this classifier (see line 4 in Alg. 20).
Notice that we have necessarily that t < 21 (otherwise, the classifier does worse than random
guessing, and −ft is a better classifier, with an empirical weighted risk below 21 ). The closer to 0 is
t , the better the classifier is. However, with a weak learner such as a decision stump, the error will
be more probably close to 12 . Then, one compute the learning rate αt (see line 5 in Alg. 20). This
rate is a decreasing function of the error t : with t = 21 , αt = 0 (which means that if the classifier
1 The learner is weak in the sense that it has a high bias.
22.2. DERIVATION AND PARTIAL ANALYSIS 223

does not better than random guessing, it is not added to the ensemble), and limt →0 αt = +∞
(which suggests to stop adding models to the ensemble when the empirical risk is null). Then,
the weights are updated (see line 6 in Alg. 20). This can be rewritten as (up to the normalization
factor) (
t+1 wit e−αt if ft (xi ) = yi
wi ∝ .
wit eαt if ft (xi ) 6= yi
This means that if the example (xi , yi ) is correctly classified, its weight is decreased, while if it is
incorrectly classified, its weight is increased. The final decision rule (the strong learner2 ) is the
sign of the weighted combination of the learned classifiers:
T
X
HT (x) = sgn (FT (x)) with FT (x) = αt ft (x).
t=1

To sum up, AdaBoost is a sequential algorithm. At each iteration, samples that where misclassified
by the preceding classifier have their weight increased. Therefore, examples that are difficult to
classify correctly receive ever-increasing influence as iterations proceed.
AdaBoost is a very efficient algorithm. For example, it is behind the face detection algorithm
embedded in recent cameras and smartphones, see Viola and Jones (2001). If this chapter focuses
on AdaBoost, boosting is a much larger field, the interested reader can refer to Schapire and Freund
(2012) for a deeper introduction (see also Hastie et al. (2009, Ch. 10) for a different point of view).

22.2 Derivation and partial analysis


If the algorithm make sense, one can wonder how it is derived (that is, why this choice of reweighting
and not another one), and what type of guarantee it can offers.

22.2.1 Forward stagewise additive modeling


Here, we show that AdaBoost fits an additive model that minimizes (sequentially) the empirical
risk based on the exponential loss (a convex surrogate presented in Ch. 5.2). This way of deriving
AdaBoost has first been proposed by Friedman et al. (2000), it is not how it has been derived
originally (Freund and Schapire, 1997).
We have seen in Ch. 5.2 that a convex surrogate for the binary classification problem is the
risk based on the exponential loss:
n
1 X −yi F (xi )
Rn (F ) = e .
n i=1

The loss is low if the class is correctly predicted, and high otherwise. Moreover, we are looking for
an additive model of the form
X T
FT (x) = αt ft (x),
t=1

with ft being a binary classifier (that is, ft ∈ {−1, +1}X ), called a basis function in the sequel.
The corresponding optimization problem is therefore
n
1 X −yi PTt=1 αt ft (xi )
min e .
(αt ,ft )1≤t≤T n i=1

Yet, this optimization problem is too complicated. A simple alternative is to search for an approx-
imate solution by sequentially adding basis functions and associate weights. Define F0 = 0 and
Ft = Ft−1 + αt ft . This consists in solving sequentially the following subproblems:
n
1 X −yi (Ft−1 (xi )+αf (xi ))
min e .
α,f n i=1
2 The learning is strong because it has a reduced bias, as will be shown later.
224 CHAPTER 22. BOOSTING

This is reminiscent of gradient descent (see also Sec. 22.3). This idea can also be straightforwardly
abstracted to any loss function L (not necessarily corresponding to a binary classification problem):
n
1X
min L(yi , Ft−1 (xi ) + αf (xi )).
α,f n
i=1

Now, we compute the solution of this problem in the case of the exponential loss, with binary
classifiers as basis functions.
At each iteration t ≥ 1, we have to solve
n
1 X −yi (Ft−1 (xi )+αf (xi ))
(αt , ft ) = argmin e .
α,f n i=1

Define the weight wit as


e−yi Ft−1 (xi )
wit = Pn −yj Ft−1 (xj )
.
j=1 e

The optimization problem can thus be rewritten as


Pn −yj Ft−1 (xj ) Xn
j=1 e
(αt , ft ) = argmin wit e−αyi f (xi )
α,f n i=1
n
X
= argmin wit e−αyi f (xi ) , (22.1)
α,f i=1
Pn
the term n1 j=1 e−yj Ft−1 (xj ) being a (positive) constant respectively to the optimization. Notice
also that the terms wit depend neither on α nor f (x), so they can be seen as weights applied to
each example. Notice also that wi1 = n1 (as F0 = 0 by assumption). Write Jt (α, f ) the criterion of
Eq. (22.1). It can be written equivalently as (by rearranging the sums):
n
X
Jt (α, f ) = wit e−αyi f (xi ) (22.2)
i=1
X X
= e−α wit + eα wit
i:yi =f (xi ) i:yi 6=f (xi )
n n
α −α
X −α
X
= e −e wit 1{yi 6=f (xi )} +e wit
i=1 i=1
n
X
= eα − e−α wit 1{yi 6=f (xi )} + e−α , (22.3)
i=1
Pn
using the fact that i=1 wit = 1 (from the definition of the weights).
We can see on Eq. (22.3) that the solution can be obtained in two step (optimization over α
and f having been separated). First, for any α > 0, the solution to Eq. (22.3) for f (x) is
n
X
ft = argmin wit 1{yi 6=f (xi )} . (22.4)
f ∈H i=1

In other words, ft solves the weighted empirical risk corresponding to line 3 of Alg. 20 (we will see
later that the weights are indeed the same). Here, H is the hypothesis space of considered weak
learners (or basis functions). For example, it can be the space of decision stumps. Write t the
corresponding error
n
X
t = wit 1{yi 6=ft (xi )} .
i=1

We therefore have that


Jt (α, ft ) = (eα − e−α )t + e−α . (22.5)
22.2. DERIVATION AND PARTIAL ANALYSIS 225

Solving for α, we get  


1 1 − t
∇α Jt (α, ft ) = 0 ⇔ αt = ln . (22.6)
2 t
We retrieve the learning rate of line 5 in Alg. 20. Thus, we have solved problem (22.1), its solution
being given by Eqs. (22.4) and (22.6). Regarding the weights update, we have that (up to the
normalizing constant):
wt+1 ∝ e−yi Ft (xi )
= e−yi (Ft−1 (xi )+αt ft (xi )) = e−yi Ft−1 (xi ) e−αt yi ft (xi )
∝ wit e−αt yi ft (xi ) ,
which is the update rule in line 6 of Alg. 20.
Therefore, we have shown that AdaBoost can be derived as an additive model minimizing
sequentially the empirical risk based on the empirical loss (this is generally called forward stagewise
additive modeling). Now, one can wonder to what extent this empirical risk is indeed minimized,
notably based on the quality (or bias) of each basis function (each function ft —for example decision
stumps—minimizing intermediate weighted risks).

22.2.2 Bounding the empirical risk


In this section, we provide (and prove) a bound on the empirical risk (based on the binary loss) of
the solution HT computed by AdaBoost. For this, we define the edge γt = 12 − t , which measures
how much better than random guessing (error rate of 21 ) is the error rate of the tth learned classifier
ft . As explained before, this weak learners will typically have low bias but high variance (such as
decision stumps), so the edge γt would be small (the weak learners do only slightly better than
random guessing). The next result shows that the ensemble HT (x) = sgn(FT (x)) does much better
than a single weak learner, given that it is big enough.
Theorem 22.1 (Freund and Schapire 1997). Write γt = 21 − t the edge of the tth classifier. The
empirical risk of the combined classifier HT produced by AdaBoost (Alg. 20) satisfies
n T q
1X Y PT 2
1{yi 6=HT (xi )} ≤ 1 − 4γt2 ≤ e−2 t=1 γt .
n i=1 t=1

In other words, the training error drops exponentially fast as a function of the number of
combined weak learners. For example, if each weakp learner has a 40% misclassification rate, then
γt = 0.1 and the empirical risk is bounded by ( 1 − 4(0.1)2 )T ≤ 0.98T , which can be arbitrarily
close to zero, given a large enough T . Now, we prove this result.
PT
Proof of Th. 22.1. Recall that FT (x) = t=1 αt ft (x). Write Zt the normalizing factor of the
weights at round t:
n
X
Zt = wit e−αt yi ft (xi ) . (22.7)
i=1
Unraveling the recurrence of AdaBoost that defines the weights, we have
e−αT yi fT (xi )
wiT +1 = wiT
ZT
−αT −1 yi fT −1 (xi ) −αT yi fT (xi )
e e
= wiT −1
ZT −1 ZT
e−α1 yi f1 (xi ) e−αT yi fT (xi )
= wi1 ...
Z1 ZT
PT
w1 e−yi t=1 αt ht (xi )
= i QT
t=1 Zt
wi1 e−yi FT (xi )
= QT . (22.8)
t=1 Zt
226 CHAPTER 22. BOOSTING

Recall that HT (x) = sgn(FT (x)). We would like to bound the binary loss by the exponential
one, its convex surrogate. If HT (x) 6= y, then yFT (x) < 0, thus e−yFT (x) ≥ 1. Therefore, we
always have that 1{y6=HT (x)} ≤ e−yF (x) . Therefore, the empirical risk of interest can be bounded
as follows:
n n
1X 1 X −yi FT (xi )
1{yi 6=HT (xi )} ≤ e
n i=1 n i=1
n
X
= wi1 e−yi FT (xi ) (by def. of wi1 )
i=1
X T
Y
= wiT +1 Zt (by Eq. (22.8))
i=1 t=1
T
Y
= Zt (as wiT +1 is a discrete distribution) (22.9)
t=1

Recall the definition of Jt (α, f ) in Eq. (22.2) and the definition of Zt in Eq. (22.7). It is clear that
Zt = Jt (αt , ft ). Therefore, from Eq. (22.5), we have that

Zt = (eαt − e−αt )t + e−αt (22.10)


−αt αt
=e (1 − t ) + e t
   
−αt 1 αt 1 1
=e + γt + e − γt (by def., t = − γt )
2 2 2
s   s  
1 1
2 − γt 1 2 + γt 1
= 1 + γt + 1 − γt (by defs. of αt —line 5 in Alg. 20—and γt )
2 + γt
2 2 − γt
2
q
= 1 − 4γt2 (by direct calculus).

Plugging this result into Eq. (22.9) provides the first bound of the theorem. Using the fact that
for all x ∈ R we have 1 + x ≤ ex provides the second bound and concludes the proof.
QT
Notice that optimizing the bound in Eq. (22.9) (by minimizing t=1 Zt over αt and ft , 1 ≤
t ≤ T , using the relation (22.10)) allows deriving the AdaBoost algorithm. It is how it has been
done originally (Freund and Schapire, 1997).
If this result shows that combining enough weak learners, whatever their quality, allows having
an arbitrarysmall empirical
 risk, it tells noting about the generalization error (that is how the risk,
R(FT ) = E 1{Y 6=FT (X)} , can be controlled). Direct bound on this risk can be obtained by using
rather directly the Vapnik-Chervonenkis theory presented in Ch. 5.1. Yet, this analysis would
show that AdaBoost suffers from overfitting: to obtain a low empirical risk, one has to add many
basis functions (or weak learners), leading to a large Vapnik-Chervonenkis dimension, and thus a
large variance. However, AdaBoost does not suffer from this in general. Another (and better) line
of analysis is based on the concept of margin (which is also central for support vector machines).
Here, for the classifier HT (x) = sgn(FT (x)), the margin of an example (x, y) is the quantity yF (x).
The larger it is (in absolute value), the more confident we are about the prediction (to be correct
or not, depending on the sign). One can provide bounds on the risk based on this notion of margin:
very roughly, the larger the margin, the sharper the bound. On the other hand, it is possible to
show that AdaBoost tends to enlarge the margins of the computed classifier, as the number of
iterations increases. A deeper discussion of this is beyond the scope of this manuscript, but the
interested reader can refer to Schapire and Freund (2012, Ch. 4 and 5) for more details.

22.3 Restricted functional gradient descent


We have seen that AdaBoost can be derived as a stagewise additive modeling approach for min-
imizing the exponential loss. Here, we somehow generalize this idea from a convex optimization
perspective.
22.3. RESTRICTED FUNCTIONAL GRADIENT DESCENT 227

Consider the binary classification problem with a convex surrogate (see Ch. 5.2). We’re looking
for a classifier H(x) = sgn(F (x)) with F ∈ RX . Let L(y, F (x)) be a convex surrogate to the binary
loss (for example, the exponential loss, L(y, F (x)) = e−yF (x) ). We would like to minimize the
empirical risk:
n
1X
min Rn (F ) with Rn (F ) = L(yi , F (xi )).
F ∈RX n i=1

As a sum of convex function, Rn is convex in F . A standard approach for minimizing a convex


function is to perform a gradient descent. This would give an update rule of the kind:

Ft+1 = Ft − αt ∇F Rn (Ft ),

with αt the learning rate. The problem here is that F is not a variable, it is a function, so we need
to introduce the concept of functional gradient. To do so, we need to introduce a relevant Hilbert
space.
Assume that the input space X is measurable and let µ be a probability measure. The
function space L2 (X , R,
R µ) is the set of all equivalence classes of functions F ∈ RX such that
2
the Lebesgue
R integral X F (x) dµ(x) is finite. This Hilbert space has a natural inner product:
hF, Giµ = X F (x)G(x)dµ(x). A functional is an operator that associates a scalar to a function
of this Hilbert space. Let J : L2 (X , R, µ) → R be such a functional, its Fréchet derivative is the
linear operator ∇J(F ) satisfying

J(F + G) − J(F ) − h∇J(F ), Giµ


lim = 0.
G→0 kGkµ

It can also be implicitly defined as J(F + G) = J(F ) + h∇J(F ), Giµ + o(kGkµ ).


The probability measure we’re interested in here is the discret measure that associates a prob-
ability n1 to each input xi of the dataset (and a probability of 0 for any x that does not belong to
the dataset). Write ρ̂ this measure, the associated inner product is
n
1X
hF, Gin = hF, Giρ̂ = F (xi )G(xi ).
n i=1

The functional we are interested in is the empirical risk Rn , and one can compute3 its Fréchet
derivative ∇F Rn (F ). So, we can write a gradient descent:

Ft+1 = Ft − αt ∇F Rn (Ft ).

However, the Fréchet derivative is a function (indeed, a set of functions), known only in the
datapoints xi . It does not allows generalizing and it is not a practical object for computing. The
idea is therefore to “restrict” this gradient to the hypothesis space H of interest. By “restricting”
the gradient, we mean here looking for the function of H being the more collinear to the gradient
(with a comparable norm). This way, we follow approximately the direction of the gradient, so
we reduce the empirical risk. Searching for this collinear function amounts to solve the following
optimization problem:
h∇F Rn (Ft ), f in
ft ∈ argmax .
f ∈H kf kn
Then, we apply the gradient update, but with the functional gradient being replaced by its ap-
proximation ft :
Ft = Ft−1 − αt ft .
Thus, we compute an additive model, as before. Yet, it is here obtained as a restricted functional
gradient descent.
3 Generally speaking, all rules of gradient computation, such as the composition rule, apply. From a practical

point of view, as only the datapoints xi do matter (given the considered discrete measure), one can see the function
F as a vector (F (x1 ), . . . , F (xn ))> and take the derivative of the loss respectively to each component seen as a
variable. We do this later for the exponential loss.
228 CHAPTER 22. BOOSTING

We apply now this idea to the exponential loss. We recall that the associated empirical risk is
n
1 X −yi F (xi )
Rn (F ) = e .
n i=1

For a function F , the functional gradient is the set


n o
∇Rn (F ) = G ∈ RX : G(xi ) = −yi e−yi F (xi ) . (22.11)

To get this result, we can apply the informal method explained in footnote 3 page 227:

∂L(yi , F (xi )) ∂e−yi F (xi )


G(xi ) = = = −yi e−yi F (xi ) .
∂F (xi ) ∂F (xi )

There remains to project this gradient on the hypothesis space. Here we consider H ⊂ {−1, +1}X
(for
Pnexample, H can be the set of decision stumps, as usual), thus for f ∈ H we have kf k2n =
1 2
n i=1 f (xi ) = 1. The update rule is thus

h∇Rn (Ft ), f in
Ft+1 = Ft − αt ft with ft = argmax = argmaxh∇Rn (Ft ), f in .
f ∈H kf kn f ∈H

Using the result of Eq. (22.11), we have


n
1X
h∇Rn (Ft ), f in = G(xi )f (xi )
n i=1
n
1 X −yi Ft (xi )
= e (−yi f (xi )).
n i=1

On the other hand, notice that


(
1 if yi 6= f (xi )
−yi f (xi ) = = 21{yi 6=f (xi )} − 1.
−1 if yi = f (xi )

Therefore, we have (as e−yi Ft (xi ) does not depend on f )


n
1 X −yi Ft (xi )
argmaxh∇Rn (Ft ), f in = argmax e 1{yi 6=f (xi )}
f ∈H f ∈H n i=1

The update rule is therefore


n
1 X −yi Ft (xi )
Ft+1 = Ft − αt ft with ft = argmax e 1{yi 6=f (xi )}
f ∈H n i=1
n
1 X −yi Ft (xi )
= Ft + αt ft with ft = argmin e 1{yi 6=f (xi )} , (22.12)
f ∈H n i=1

the last line being obtained by injecting the negative sign in the optimization problem (recall that
f takes only the values ±1). As expressed in Eq. (22.12), the (negative of the) restricted gradient
is exactly the classifier computed in line 3 of Alg. 20, aiming at minimizing the error of line 4 of
the same algorithm.
There remains to choose the learning rate. The convex optimization theory offers a bunch of
choices
P for this. For example,
P a classic (but not necessarily wise) choice consists in setting αt such
that t≥1 αt = +∞ and t≥1 αt2 < ∞ (typically, αt ∝ 1t ). Here we can perform a line search,
that is we can look for the learning rate that will imply the maximum decrease of the empirical
risk. Formally, this is donne by solving the following optimization problem:

αt = argmin Rn (Ft + αft ).


α>0
22.3. RESTRICTED FUNCTIONAL GRADIENT DESCENT 229

Using the same techniques as before, we have


n
X
nRn (Ft + αft ) = e−yi (Ft (xi )+αft (xi ))
i=1
X X
= e−α e−yi Ft (xi ) + eα e−yi Ft (xi )
i:yi =f (xi ) i:yi 6=f (xi )
n
X n
X
= (eα − e−α ) e−yi Ft (xi ) 1{yi 6=ft (xi )} + eα e−yi Ft (xi ) .
i=1 i=1

Solving for α, we get


  Pn
1 1 − t i=1 e−yi Ft (xi ) 1{y 6=f (x )}
∇α Rn (Ft + αft ) = 0 ⇔ αt = ln with t = Pn −y F (xi ) t i .
2 t i=1 e
i t i

This is exactly the learning rate of AdaBoost (see line 5 of Alg. 20).
Therefore, we have derived AdaBoost in a third way, from an optimization perspective. It is of
high interest, as it allows relying on the whole optimization field. The interested reader can refer
to Mason et al. (1999); Friedman (2001); Grubb and Bagnell (2011); Geist (2015b) for more about
this kind of approach.
230 CHAPTER 22. BOOSTING
Part VII

Bayesian Machine Learning

231
233

Machine Learning is to generalize a finite set of data or samples by a descriptive or predictive


model. Modelling uncertainty is clearly a major issue in Machine Learning, since collected samples
may be noisy or they may partially lack the relevant information required to build a reliable
model. Therefore the probabilities – which are a mathematical formalization of the notion of
uncertainty – appear to be a proper fundamental tool to approach Machine Learning problems in
a sound manner. Bayesian Machine Learning is a branch of Machine Learning that extensively
uses probability distributions in order to represent not only outputs of predictive models, but also
input data and even the models.
This part is an introduction to Bayesian Machine Learning in general and to some prominent
Bayesian ML methods in particular. Its main objective is not to provide a comprehensive list of
these methods, but rather, to emphasize the common theoretical principles on which these methods
are built. Reading this chapter should provide a first level of understanding on the application
scope, the strengths and the limitations of the main Bayesian machine learning methods, along
with some insights about how these methods can be adapted to concrete problems. The first
chapter is a central introduction of the theoretical elements used extensively in the subsequent
chapters. These elements are the heart of Bayesian Machine Learning: mastering them is the key
to properly apply standard methods to problems, or even designing new ones.
234
Chapter 23

Theoretical Foundations

23.1 Preliminary discussions


Frequentist versus Bayesian in a strong sense
Classical methods in Machine Learning consists in selecting in a predefined hypothesis set H the
hypothesis h? that best matches a set of samples S. In practice, these methods consist in solving an
optimization problem, that is, to find the hypothesis h? that minimizes a risk function measuring
the cost of divergence between the samples predicted by the model and the observed samples.
Based on this definition, methods can consider any sort of hypothesis, even families of classifiers
that do not rely on probabilities, like SVM or any other sort of geometric classification model.
Such methods are considered as part of Frequentist Machine Learning as their theoretical study
uses a frequentist interpretation of probabilities. Indeed samples are assumed to be available on
demand from a repeatable sampling process. These samples all follow some unknown but fixed
distribution PX . Samples are assumed i.i.d so that a dataset S = {x1 ,Q . . . , xN } of N samples is
considered as a sample of a random variable whose distribution is PS = i PX (xi ). Since the best
matching hypothesis h? is computed from the sample set S, it can itself be interpreted as a random
variable H ? . The study of the distribution P H ? | S (h? ) and its convergence properties when N
increases legitimate the term of frequentist Machine Learning.
In contrast, Bayesian Machine Learning does not primarily consist in an optimization problem,
and it does not select (at least in theory) a single hypothesis h? . Instead, it keeps updated the
current distribution of hypothesis P H | S (h) (or equivalently the distribution P Θ | S (θ) of model
parameters Θ) conditionally to the set of samples S observed so far. As the samples are received,
some hypothesis get more likely than others and this is reflected in the updates of P H | S (h).
This is why Bayesian Machine Learning methods naturally fit online problems and can process
new samples on the fly: the already seen samples can be safely discarded as their information has
somewhat been saved in the updated distribution P H | S (h). Because the distribution of hypothesis
is updated thanks to Bayes’ rule, these type of methods are called Bayesian.

Frequentist versus Bayesian in a weak sense


That being said, the semantic frontier between frequentist and Bayesian approaches is often very
subtle. It is tempting to think that a method that somehow uses Bayes’s rule is Bayesian. However
this is a necessary but not sufficient condition: in other words, a method using probabilities and
Bayes’rule is not necessarily Bayesian in the strong sense given above. A good example is the
well known Machine Learning method called “Naive Bayes” that will be introduced shortly. In
its simplest and most well known form, the method uses – as stated by its name – Bayes’ rule to
infer the most probable values for the parameters of the classifier. However the learning process
of Naive Bayes does not describe model parameters by some distribution. The basic version of
“Naive Bayes” is thus not Bayesian in a strong sense. In fact, most Machine Learning methods
used in practice, fix some given optimal values for their model parameters that best match observed
samples. Because they do not describe the model parameters by full distributions, they are not
Bayesian in the strict sense.

235
236 CHAPTER 23. THEORETICAL FOUNDATIONS

So you are probably asking yourself why a non Bayesian method such as “Naive Bayes” is intro-
duced in a chapter devoted to Bayesian Machine Learning? And why even talking about Bayesian
Machine Learning if it is rarely used in practice?
Indeed Bayesian Machine Learning provides a general methodology along with theoretical tools
to design Machine Learning methods suiting specific problems. By stating that everything should
be described by distributions – including model parameters – Bayesian Machine Learning encom-
passes a very wide application scope. Many existing ML methods can be reinterpreted in the
Bayesian framework, often leading to interesting generalizations or sound justifications of choices
that would stay arbitrary otherwise. For instance the ordinary least square linear regression (OLS)
is a frequentist regression method but it can be reinterpreted as a specific instance of a more gen-
eral Bayesian linear regression. This latter model provides a probabilistic interpretation of the
least squares principle and it also legitimates with a sound theoretical argument the origin of ridge
regression (also called Tikhonov regularization, that consists in adding a L2 regularization term
in the OLS objective function). Similarly “Naive Bayes” can be interpreted in a more general and
powerful Bayesian context. After a thorough review, it appears that most methods making an
intensive use of probabilities are neither purely frequentist nor purely Bayesian, they are both!
So you might ask internally “why distinguishing two chapters respectively on Frequentist and
Bayesian Machine Learning if there is no fundamental difference between them?”
The key of the answer is in the word “probabilistic”: to be classified as Bayesian in a weak sense,
a method must at least rely on a probabilistic discriminative model, i.e a model that generates an
output Y given an input X and the parameters Θ of the model according to some distribution
P Y | X,Θ . Clearly the basic versions of SVM, neural networks, or decision trees do not output such
a distribution but a single target value1 : they cannot be Bayesian and are consequently considered
as frequentist. Conversely a non Bayesian method (in the strong sense) such as “Naive Bayes”
that relies on a probabilistic generative model should actually be considered as Bayesian (in a weak
sense) as it can always – at least theoretically – be generalized into a full Bayesian method. More
exactly, frequentist probabilistic methods can be derived from Bayesian ones: given some Bayesian
method (let’s say the Bayesian version of “Naive Bayes”) , parameter values θ? learned by some
associated frequentist method (let’s say the basic version of “Naive Bayes”) can be obtained by
first running the Bayesian method to learn some parameter distribution P Θ | S (θ), and then finding
the best parameter value θ? that minimizes the expected value of some cost function according
to P Θ | S (θ). The extraction of this optimal parameter value for some cost function is called a
Bayes estimator. The most well known Bayes estimators are the Maximum A Posteriori estimator
(MAP) and Maximum Likelihood estimator (MLE). For instance the simple frequentist version of
“Naive Bayes” is derived by applying the MLE estimator to the Bayesian version of “Naive Bayes”.
The same holds for linear regression: the Ordinary Least Square method (OLS) is what we get
when one applies the MLE estimator to the Bayesian linear regression.
So finally you might ask why should one use probabilistic frequentist methods if they all accept
full Bayesian counterparts that are more powerful?
There ain’t no such thing as a free lunch. Though a full Bayesian method encompasses the cor-
responding frequentist methods (in the sense frequentist methods are obtained from the Bayesian
ones by applying some Bayes estimator), the process maintaining the distribution of hypothe-
sis/parameters P Θ | S up-to-date when new samples are received can be very demanding in both
computation time and memory footprint. Frequentist versions are more lightweight as they only
compute one specific value θ? for the model parameters.

Overview
In summary, the chapter distinguishes two types of Bayesian methods: in a weak and in a strong
sense. This distinction does not come from the scientific literature but is introduced by the author in
order to understand how many ML methods can simultaneously be interpreted in both frequentist
and Bayesian Machine Learning contexts.
1 Of course a single value can always be interpreted as a degenerate distribution but this is very reductive.
23.2. A SHORT REMINDER OF ELEMENTARY NOTIONS IN PROBABILITY THEORY237

• Bayesian methods in a weak sense gather all probabilistic methods from which some fully
Bayesian methods (in the strong sense) can be straightforwardly derived, which are, all
methods that model the distribution P Y | X .
• Bayesian methods in a strong sense consists in considering a Bayesian method in a weak
sense and representing the model parameters by a full distribution PΘ instead of a single
value θ. These methods are fundamentally online and can integrate some prior knowledge.
The next sections 23.3 and 23.4 further develop the Bayesian Machine Learning in the weak and
strong sense respectively after some basic notions of probability theory are recalled in section 23.2.

23.2 A short reminder of elementary notions in probability


theory
As already stated, Bayesian Machine Learning uses probability distributions to represent every
objects it manipulates: model outputs, but also data, model inputs and most importantly, models
themselves are processed as random variables. This section briefly recalls some basic notions of
probability theory that are essential for understanding the subsequent sections. The confident
reader can skip it and jump directly to section 23.3. On the opposite, the reader that does not feel
comfortable with these notions is invited to read a book on probability theory such as Jaynes and
Bretthorst (2003) for a complete treatment of the subject.

23.2.1 Probability and random variables


Probability space
A probability space is a triple (Ω, E, P ) defined by :
• A set Ω, called universe, of possible outcomes.
• A set E of events. An event is formally defined as the subset E ⊆ Ω of outcomes it covers.
Depending on the type of outcomes, it can be a countable or uncountable set. The set E of
events must be a σ-algebra, i.e. must be closed under countable-fold set operations: union
(or), intersection (and) and complement (not).
Disjunction / union : ∀E1 , E2 ∈ E, E1 ∪ E2 ∈ E
Conjunction / intersection : ∀E1 , E2 ∈ E, E1 ∩ E2 ∈ E
Negation / complément : ∀E ∈ E, E = Ω \ E ∈ E

• A measure P that maps every event of E to its probability and that verifies the following
axioms:
∀E ∈ E, 0 ≤ P (E) ≤ 1
∀E1 , E2 ∈ E, P (E1 ∪ E2 ) + P (E1 ∩ E2 ) = P (E1 ) + P (E2 )
P (Ω) = 1
Events are thus the sets of outcomes that can be measured by some probability.

Example:
Let’s roll two six-sided dices. One possible definition of the associated probability space (Ω, E, P ) is
to define an outcome as a couple (d1 , d2 ) where d1 and d2 respectively represent the obtained value
for the first and second dice. The set Ω of outcomes is thus the Cartesian product {1 . . . 6}×{1 . . . 6}.
The richest possible set E of events is the set P (Ω) of subsets of Ω, which is trivially a σ-algebra.
The event “the result is a duplicate” is then represented by the set Edup = {(1, 1), (2, 2), . . . , (6, 6)}.
The probability function for fair dices is then P : E 7→ |E| |E|
|Ω| = 36 . The probability P (Edup ) is thus 6 .
1
238 CHAPTER 23. THEORETICAL FOUNDATIONS

Random variable
Given a probability space (Ω, E, P ), a random variable X taking its value in a set ΩX is a function
fX : Ω → ΩX that maps any outcome of Ω to a given value of X so that it induces a probability
space (ΩX , EX , PX ) for X. The function PX that maps a set of values of X to a probability is
called the distribution of X. For this distribution to be properly defined, the σ-algebra EX must
be chosen so that the mapping function fX is measurable, in other words, so that every event of
EX corresponds to an event of E:
−1
∀EX ∈ EX , fX (EX ) ∈ E
The probability PX (EX ) for X to take its value in a given subset EX ∈ EX is then:
−1

PX (EX ) = P fX (EX )
The domain DX of a random variable X denotes the smallest event of EX such that P (DX ) = 1.
In practice one often takes ΩX = DX .

Example:
In the previous example, let’s consider the random variable S that is the sum of the faces of the two
dices. The function mapping an outcome to a value of S is fS : (d1 , d2 ) 7→ d1 + d2 . The domain
of S is DS = {2, . . . , 12}. The resulting probability space of S is (DS , ES , PS ) where
ES = P (DS )
min(s − 1, 13 − s)
PS ({s}) = P ({(d1 , d2 ) ∈ Ω | d1 + d2 = s}) =
36
For instance the probability for the sum to be equal to four is PS ({4}) = 3/36 as there are three
matching rolls that correspond to a sum of four: (1, 3), (2, 2) and (3, 1).

Important classes of random variables


Discrete random variable A discrete random variable X has a domain DX = {x1 . . . xk } that
is finite or at least countable.
Real-valued random variable A real-valued random variable has a domain that is a subset of
Rn for some positive integer n. σ-algebra used for real-valued random variable is the Borel
algebra generated by all possible open and closed sets of Rn .
Continuous random variable A particular case of real-valued random variable is continuous
random variable. In that case the full distribution function PX of a continuous random
variable X can be deduced from the cumulative distribution function FX (x) = P (X ≤ x)
with x ∈ Rn , where ≤ is the product order: i.e. (x1 , . . . xn ) ≤ (y1 , . . . yn ) meansn that
x1 ≤ y1 , . . . , xn ≤ yn . Equivalently one can use the probability density function fX = ∂x∂1 ...∂x
FX
n
instead of the cumulative distribution function FX . The distribution of a continuous random
variable is smooth as it is obtained by integration from the density function.
Mix random variable One can mix discrete and continuous random variables (using a mixture
model) to get a mix random variable. The values x ∈ Rn for which P ({x}) > 0 are said
atomic.

Notation convention
1. In the next sections, one will use interchangeably the equivalent notations PX (EX ) and
P (X ∈ EX ) to denote the probability of an event EX defined on a random variable X of
distribution PX . The second notation will be preferred as it is lighter and more readable
when the probability to express is complex. In particular this convention generalizes to joint
and conditional distributions. For instance one will prefer to write P ( (X, Y ) ∈ A | Z ∈ B )
rather than P (X,Y ) | Z∈B (A). Moreover, if EX is a singleton {x}, PX (x) will be a shortcut
for PX ({x}).
23.2. A SHORT REMINDER OF ELEMENTARY NOTIONS IN PROBABILITY THEORY239

2. Second, events EX on random variable X can be specified either extensively, using set no-
tation, or intensively, using logical predicates. For instance PX ({0} ∪ [1, 2[) is equivalent to
P (X = 0 or (X ≥ 1 and X < 2)).

3. Most importantly, because Bayesian methods make some extensive use of sometimes complex
distributions, it is a common usage to simplify expressions with a somewhat abusive but very
handy notation, consisting in denoting the distribution PX as P (X). For instance PX|Y =y
will be replaced by P ( X | Y ). Also the probability density function fX and the distribution
PX of a continuous random variable X will often be interchangeable and referred with the
generic term of distribution as their algebraic properties with respect to conditioning and
marginalization are the same.

23.2.2 Joint distribution and independence


Joint distribution
Given two random variables X and Y defined on a common probability space (Ω, E, P ), the joint
variable Z = (X, Y ) combines all pair of values x of X and y of Y , i.e its domain is the cartesian
product ΩX × ΩY . The joint variable Z is not necessarily a properly defined random variable. To
be so, all pairs (x, y) must be mapped to an event of E. The distribution of Z is then called the
joint distribution PX,Y of X and Y .

Marginalization
Whereas the distributions of two random variables X and Y do not necessarily suffice to define
their joint distribution, the converse is always true: given the joint distribution PX,Y of X and Y ,
it is always possible to derive the distributions of X and Y using the summation property. This
operation is called marginalization and is given by the formula in the discrete case:
X
PX (x) = PX,Y (x, y)
y∈ΩY

or equivalently in the continuous case:


Z
PX (x) = PX,Y (x, y) dy
ΩY

Marginalization can be interpreted as a loss of information since the distribution of Y has been
lost in the result.

Independence
Two events A and B are said independent if and only if

P (A ∩ B) = P (A) × P (B)

Two random variables X and Y are independent if and only if the joint distrubtion of X and Y is
defined and verifies:
PX,Y = PX × PY
Whether one considers two events or two variables, independence will be denoted A ⊥
⊥ B and
X ⊥
⊥ Y.

23.2.3 Conditional distributions and Bayes’ rule


Conditional distributions
Assuming some joint distribution of two random variables (X, Y ), let’s imagine one collect an
outcome (x, y) but only the value y is observable. The fact Y is equal to y is a valuable piece of
knowledge that has (in general) some influence on the possible and most likely values of x. The
240 CHAPTER 23. THEORETICAL FOUNDATIONS

observation of y updates in some way the initial distribution PX of X before one observed the value
of Y . The new distribution of X after the observation Y = y is called conditional distribution of X
given Y is equal to y and is denoted P X | Y =y (x) or simply P ( X = x | Y = y ). When the values x
or y do not carry specific information relatively to the current context, one will use the somewhat
abusive but lighter notation P X | Y (x) or simply P ( X | Y ). Conditional distributions can easily
be deduced from joint distributions using marginalization over X:
P (X = x ∩ Y = y) P (X = x ∩ Y = y)
P (X = x | Y = y ) = =P
P (Y = y) x P (X = x ∩ Y = y)

Of course conditional probability is not specific to random variables and is true for every type of
events. The conditional probability of event A given the fact event B occurred is denoted P ( A | B )
and is equal to:
P (A ∩ B)
P (A | B ) =
P (B)
An immediate property is that A and B are independent if and only if P ( A | B ) = P (A). In other
words, A and B are independent if and only if knowing the value of B has absolutely no influence
on the probability of A.

Bayes’ rule
A straightforward but nonetheless fundamental property of conditional probabilities is Bayes’ rule
also called Bayes’ theorem:
P ( A | B ) × P (B) = P ( B | A ) × P (A)
This symmetric expression is equivalent to the asymmetric form:
1
P (A | B ) = P ( B | A ) × P (A)
P (B)
Or using the distribution notation:
1
P X | Y =y (x) = PY | X=x (y) PX (x)
PY (y)
This latter form is the one at the heart of Bayesian Machine Learning as explained in the next
sections.

23.3 Bayesian Machine Learning in a weak sense


This first chapter is an introduction to the general Bayesian methodology and the associated the-
oretical tools that are required to understand Bayesian methods whereas each subsequent chapter
are devoted to some state-of-the-art probabilistic methods relying on the fundamental notions of
this chapter. Therefore it is essential to properly understand this first chapter.
The first section is dedicated to Bayesian methods in the weak sense, i.e., every probabilistic
methods based on a model characterized by some likelihood P Y,X | Θ , that is to say, methods
that are derived by the general Bayesian Machine Learning methodology, and can be used either
in a full Bayesian context (i.e. model parameters are described by distributions) or simplified
into a frequentist context (i.e. model parameters are set to some optimal values optimizing some
objective functions). This includes many state-of-the-art methods like “Naive Bayes” and Linear
Discriminant Analysis (LDA), linear and logistic regressions, mixture models, Gaussian Processes,
Markov models, and so on.
In order to introduce progressively the main notions used in Bayesian theory, one will reuse
the example of the introductory chapter: the game is to deduce from the physical description
of many students, those that are members of the university’s wrestling club. Please note that
the section introduces the various Bayesian concepts using a narrative logic, so that the building
blocks get assembled progressively in an understandable manner. As a consequence, concepts are
not introduced in a top down order, from the most fundamental to the most specific ones. The
whole Bayesian framework will only get visible in the next section 23.4.
23.3. BAYESIAN MACHINE LEARNING IN A WEAK SENSE 241

23.3.1 Maximal Likelihood Principle


So what is the problem again? Let’s say every university student is described by five variables:
• His/her height H,
• His/her weight W ,
• His/her gender G ∈ {male, f emale},
• His/her dress code says how often the student wears sportswear: D ∈ {never, sometimes, of ten},
and
• His/her wrestling club membership M ∈ {yes, no}.
H and W are continuous variables whereas G, D and M are discrete variables. The considered
problem is a supervised binary classification problem consisting in finding a modeling function f
that maps the physical description (H, W, G, D) of a student to the unknown membership value of
M . M will thus be the output Y , and (H, W, G, D) the input X of this function f : X 7→ Y . To
determine the function f , one also assumes the membership of a subset of students is known and
one has collected all the value pairs (x, y) for these students in a dataset S of samples.
Because our approach is Bayesian, one decides to use distributions. Instead of determining
a single output value y = f (x) from an input, one will determine the whole output distribution
P Y | X=x (y) conditionally to the input, in order to take into account the uncertainty contained
in the link between the input and the output (missing information through other external vari-
ables, intrinsic noise, etc.). In practice, the shape of this distribution is completely determined
by the value θ of a vector of parameters Θ of finite dimension. To emphasize the role played by
these parameters, the distribution should be denoted P Y | X=x,Θ=θ (y). Every probability is thus
determined by the values of three variables: the input X, the output Y , and the parameters Θ.
The partial function θ 7→ P Y | X=x,Θ=θ (y) that maps a parameter value θ for some given sample
z = (x, y) is called the likelihood of Θ and is denoted Lz (θ).
Lz=(x,y) (θ) = P Y | X=x,Θ=θ (y)

As stated by its name, the likelihood estimates how plausible are some parameter values θ given a
sample z = (x, y). The higher the likelihood, the more probable the parameters.
If one assumes our sample set is reduced to one single example z = (x, y), learning the model
consists in finding the best parameter value θ? that maximizes the likelihood Lz (θ). This is what
is called the maximum likelihood estimator (MLE):
θ̂M LE = argmax (Lz (θ))
θ

Of course one single data is not sufficient to learn in a robust manner the values of the many
parameters of a model so that the result will be prone to over-fitting. If one assumes to have
several data Z = {z1 , . . . zn } = {(x1 , y1 ), . . . , (xn , yn )}, the MLE principle can still be applied but
on a likelihood function that is far more complex as it depends on every sample:
LZ (θ) = P ( Y1 = y1 , . . . Yn = yn | X1 = x1 , . . . , Xn = xn , Θ = θ )

To simplify this problem, one usually assumes samples are independent and identically distributed
(i.i.d) which is generally an acceptable hypothesis. In this case the likelihood function can be
simplified thanks to the independence of samples given the model (i.e. given the parameters θ):
P ( Y1 = y1 , . . . Yn = yn , X1 = x1 , . . . , Xn = xn | Θ = θ )
LZ (θ) =
P ( X1 = x1 , . . . , Xn = xn | Θ = θ )
Q
P (
i Q i Y = yi , Xi = xi | Θ = θ )
=
i P ( Xi = xi | Θ = θ )
Y
= P ( Yi = yi | Xi = xi , Θ = θ )
i
Y
= Lzi (θ)
i
242 CHAPTER 23. THEORETICAL FOUNDATIONS

Because the number of data can be large and because probabilities are always less than one (and
often close to zero), the likelihood can reach extremely low values close to zero. As a consequence,
the numerical optimization to determine θ̂M LE can raise precision issues. For this reason a common
def
practice is to optimize the said log likelihood defined as log LZ (θ) = log (LZ (θ)) since the logarithm
is an increasing function with a high sensitivity for the small positive values. In addition it
transforms multiplication into addition so that the MLE is finally defined as:
!
def
X
θ̂M LE = argmax (log LZ (θ)) = argmax log Lzi (θ)
θ θ i

Even if the shape of these functions log Lzi (θ) can be very complex, one can always use some
numerical optimization method (like gradient ascent, etc.), to find a local maxima θ? that hopefully
will be equal or at least a good approximation for θ̂M LE . Now we know how to solve our wrestler
detector problem given some parametrized likelihood function. The next step is thus to define the
shape of this likelihood function, in other words to describe how we parametrize P Y | X=x,Θ=θ (y)
with θ.

23.3.2 A brute-force approach


If every variable was discrete, one way to solve (naively) the problem would be to estimate from
the dataset every single probability of the whole distribution P Y | X=x (y). To do this, let’s first
discretize the continuous variables H and W . For instance, let us split the height in 20 intervals of
length 5 cm ranging from 140 cm and 240 cm. For every student, then replace his/her real height
h by the index of the interval containing h. Do the same for the weight w by dividing the range
from 40 kg to 140 kg into 20 intervals of length 5 kg. The new fully discrete distribution is then
described by multi-indexed coefficients:

c(m,h,w,g,d) = P ( M = m | H = h, W = w, G = g, D = d )

However since P ( M = m | H = h, W = w, G = g, D = d ) is a distribution, some coefficients must


sum up to 1:
∀(h, w, g, d) c(’yes’,h,w,g,d) + c(’no’,h,w,g,d) = 1
Every coefficient c(’no’,h,w,g,d) can thus be deduced from c(’yes’,h,w,g,d) . Let us therefore reduce the
parameters of our model to the following more restricted set of coefficients:

c(h,w,g,d) = P ( M = ’yes’ | H = h, W = w, G = g, D = d )

These many coefficients define the components of the parameter vector Θ. Since our model is now
defined, one must address two questions:

• What are the model parameters θ̂M LE that maximize the likelihood relatively to the sample
set?

• Is this method likely to work in practice?

Maximum Likelihood Estimator for discrete distributions


The most plausible model parameter vector θ̂M LE is the one that maximizes the likelihood as
explained in section 23.3.1:
!
X
θ̂M LE = argmax (LZ (θ)) = argmax log Lzi (θ)
θ θ i

In the general case, an approximated resolution method must be used to find a local optimum
of the rightmost term of the previous equation. However the present shape of the log likelihood
function allows us to find the exact solution analytically. Indeed the global maximum point θ̂M LE
23.3. BAYESIAN MACHINE LEARNING IN A WEAK SENSE 243

is also a local maximum point of LZ (θ). Moreover if one assumes the likelihood function is never
zero, log LZ (θ) is always differentiable and θ̂M LE must verify:

∂log LZ (θ) X ∂log Lzi (θ)


∀(h, w, g, d), = =0
∂c(h,w,g,d) i
∂c(h,w,g,d)

∂L
zi (θ)
What is the value of ∂c(h,w,g,d) for a given coefficient c = c(h,w,g,d) ? This coefficient only appears if
the sample z has the same input features, i.e. if z = (h0 , w0 , g 0 , d0 , m0 ) with h0 = h, w0 = w, g 0 = g
and d0 = d. There are then two different cases:

• If m0 = yes (the student is a wrestler) then

∂Lzi (θ) ∂ log(P ( M = ’yes’ | H = h, W = w, G = g, D = d )) ∂ log(c) 1


= = =
∂c ∂c ∂c c

• If m0 = no (the student is not a wrestler) then

∂Lzi (θ) ∂ log(P ( M = ’no’ | H = h, W = w, G = g, D = d ))) ∂ log(1 − c) 1


= = =−
∂c ∂c ∂c 1−c

LZ (θ)
Now let us add these results to get the whole expression of ∂log∂c . For this purpose, let us
define Ncy (respectively Ncn ) the number of samples z = (h0 , w0 , g 0 , d0 , m0 ) such that (h0 , w0 , g 0 , d0 ) =
(h, w, g, d) and m0 = ’yes’ (resp. m0 = ’no’).

X ∂log Lz (θ) 1 1
i
= Ncy · − Ncn · =0
i
∂c c 1−c

Finally isolating c in the latter expression gives the natural and extremely simple result:

Ncy
c=
Ncy + Ncn

In other words, the maximum likelihood estimator computes a discrete conditional distribution
P A | B=b (a) simply by computing for each value a of A the ratio of the number NA and B of samples
satisfying simultaneously A = a and B = b over the number NB of samples satisfying B = b. This
holds for a general result: computing coefficients of some discrete distribution occurring in some
MLE estimation θ̂M LE can simply be achieved by naive counting in the dataset.

The curse of dimensionality


One then might ask whether this method of estimating the whole distribution by counting events
in the samples really works in practice? The answer is clearly no most of the time, unless one has
an extremely large dataset. To understand why it does not work, let us compute the size of Θ,
that is, the number of coefficients c(h,w,g,d) . Since m and h can take 20 discrete different values,
Θ is a vector of dimension 20 × 20 × 2 × 3 = 2400. If one wants to estimate precisely the value of
every coefficient, one should request that each estimation relies on at least 50 samples. In total,
the dataset should contain more than 100000 samples of students! And this number is even too
optimistic as it is likely to have some unbalanced distribution of samples so that some coefficients
will be estimated using only one or two samples. Now imagine one adds even more variables
(student width, body shape, etc.), the number of model parameters increase with an exponential
rate!
This phenomenon is called the curse of dimensionality in the Machine Learning literature. It
says that a model with too many parameters can always fit the data in many different ways, so
that it will be prone to over-fitting. The next question is how can the number of model parameters
be reduced?
244 CHAPTER 23. THEORETICAL FOUNDATIONS

23.3.3 Bayesian Networks come to the rescue


In order to reduce the computation complexity of a model and to avoid the curse of dimensionality,
some additional hypothesis must be introduced so that these hypothesis simplify the model and
thus the likelihood function Lz=(x,y) (θ) = P Y | X=x,Θ=θ (y) using a reduced number of parameters.
The main method to achieve such simplification consists in factorizing the likelihood function in
independent factors, in other words, to isolate subgroups of feature variables that are independent
of each others. The most common tool used to specify these independence relations between
random variables is the model of Bayesian Networks (also called Belief Network2 ). This section
introduces the minimal notions of Bayesian Networks that are required to understand how they
can help to simplify Bayesian Machine Learning problems.
Let us consider an abstract problem defined on a set (X1 , . . . Xm ) of variables. One assumes
all these variables are observable and that many samples (xi1 , . . . xim )1≤i≤n have been collected in
some dataset Z. In addition one also assumes all variables are discrete and can take two different
values each. These two last assumptions are to make the example more concrete, even if Bayesian
networks also work with continuous variables. The most informative distribution to describe a
set of variables is to estimate the full joint distribution P (X1 ,...Xm ) | Θ (x1 , . . . xm ), where Θ is
the vector of unknown parameters to be computed. Once parameters Θ are computed, one can
solve any supervised classification problem. For instance if one wants to predict Xm from values
(x1 , . . . xm−1 ) of (X1 , . . . Xm−1 ), one can compute the distribution of Xm and extracts the most
likely value x̂m of Xm :

x̂m = argmax P Xm | X1 =x1 ,...Xm−1 =xm−1 ,Θ=θ (xm )
xm
 
P (X1 ,...Xm ) | Θ=θ (x1 , . . . xm )
= argmax
xm P (X1 ,...Xm−1 ) | Θ=θ (x1 , . . . xm−1 )

= argmax P (X1 ,...Xm ) | Θ=θ (x1 , . . . xm )
xm

In order to fully specify the joint distribution P (X1 ,...Xm ) | Θ , one needs to learn the coefficients
θx1 ,...,xm = P(X1 ,...Xm ) (x1 , . . . xm ) from the data. However their huge number 2m − 1 exponentially
grows with m and will shortly lead to a failure, due to overfitting, even if a large number n of
samples are available. The model of Bayesian Networks helps to specify the same joint distribution
with less parameters assuming one knows some relations of independence between variables.
Bayesian Networks have two nested levels of definition: the first level is simply a graph that
states the existence of some factorization of the joint distribution resulting from some independence
relations between variables. The second level is the graph of the first level plus some probability
tables that altogether fully define the probabilities of the joint distribution.

Bayesian Network as a factorization of the joint distribution


On the most elementary level, a Bayesian Network defined on a set of variables X = {X1 , . . . Xm }
is any directed acyclic graph (DAG) whose vertices are the set X of variables and that represents
a specific factorization of the joint distribution PX1 ,...Xm . Let’s consider the directed graphs
displayed on figure 23.1. The graph on the left can represent a Bayesian Network on variables
X = {A, B, C, D, E, F, G} as there is no oriented cycle (cycles BDF and CDF E are not oriented
cycles). On the other hand the graph on the right contains the oriented cycle ABD so that it
cannot represent a Bayesian Network. A variable Y is said parent (resp. a child) of a variable X
if and only if there is an arc from Y to X (resp. from X to Y ). The set of parents of a variable
X is denoted parents(X). Similarly a variable Z is said to be an ancestor (resp. a descendant) of
variable X if there is an oriented path from Z to X (resp. from X to Z).
But the real question is “what does represent the Bayesian Network of figure 23.1a?” This
graph says that the joint distribution PA,B,C,D,E (a, b, c, d, e) can be decomposed into the following
2 Bayesian Networks are one of the Graphical models that can specify in a graphical language (based on graphs)

independence relations between random variables, but it is not the only one. Markov networks is another well-known
graphical model.
23.3. BAYESIAN MACHINE LEARNING IN A WEAK SENSE 245

B C

D E
A

F
B

G C D

(a) A directed acyclic graph (b) A directed cyclic graph

Figure 23.1: Bayesian Network example (a) and counterexample (b)

product of functions:

PA,B,C,D,E,F,G (a, b, c, d, e, f, g) = fA (a) × fB (b, a) × fC (c) × . . .


... fD (d, b, c) × fE (e, c) × fF (f, b, d, e) × fG (g, f )

in a way such that:


X
fA (a) = 1
a
X
∀a, fB (b, a) = 1
b
X
fC (c) = 1
a
X
∀b, ∀c, fD (d, b, c) = 1
d
X
∀c, fE (e, c) = 1
e
X
∀b, ∀d, ∀e, fE (f, b, d, e) = 1
f
X
∀f, fG (g, f ) = 1
g

More generally to be compatible with a given Bayesian network, a decomposition of the joint
distribution must comply with the three following rules:

• There is a one-to-one mapping between variables and factoring functions or factors: e.g. A
is mapped to fA .

• Every factoring function fX mapped to variable X takes as arguments the value x of X along
with the values of the parent variables of X.
246 CHAPTER 23. THEORETICAL FOUNDATIONS

• For every variable X of factor fX (x, y1 , . . . , yk ) and for every value (y1 , . . . , yk ) of the parent
variables Y1 , . . . , Yk of X, one always has:
X
∀y1 , . . . , ∀yk , fX (x, y1 , . . . , yk ) = 1
x

From these abstract axioms, one can indeed derive a very simple interpretation of these factors:

In a Bayesian network, the factor fX of variable X defines the distribution of X


conditionally to its parents. Formally:

∀X, fX (x, y1 , . . . yk ) = P X | Y1 =y1,...,Yk =yk (x)

where Y1 , . . . Yk are the parent variables of X.

Let us prove this result on an example, as a sketch for the general proof: let us show that factor
fD (d, b, c) is nothing else than condition distribution P D | B=b,C=c (d):
X
PB,C,D (b, c, d) = PA,B,C,D,E,F,G (a, b, c, d, ef, g)
a,e,f,g
X
= fA (a) × fB (b, a) × fC (c) × fD (d, b, c) × fE (e, c) × fF (f, b, d, e) × fG (g, f )
a,e,f,g
X
= (fA (a) × fB (b, a) × fC (c) × fD (d, b, c) × . . .
a
 !!
X X X
... fE (e, c) × fF (f, b, d, e) × fG (g, f ) 
e f g

This expression can be simplified by applying in cascade the summation property on factors,
starting by the end:
X
PB,C,D (b, c, d) = fA (a) × fB (b, a) × fC (c) × fD (d, b, c) × . . .
a
X X
... fE (e, c) × fF (f, b, d, e) × 1
e f
X
= fA (a) × fB (b, a) × fC (c) × fD (d, b, c) × . . .
a
X
... fE (e, c) × 1
e
X
= fA (a) × fB (b, a) × fC (c) × fD (d, b, c) × 1
a
X
= fD (d, b, c) × fA (a) × fB (b, a) × fC (c)
a

Then PB,C can be deduced by marginalization of PB,C,D :


X
PB,C (b, c) = PB,C,D (b, c, d)
d
!
X X
= fD (d, b, c) × fA (a) × fB (b, a) × fC (c)
d a
! !
X X
= fD (d, b, c) × fA (a) × fB (b, a) × fC (c)
d a
X
= 1× fA (a) × fB (b, a) × fC (c)
a
23.3. BAYESIAN MACHINE LEARNING IN A WEAK SENSE 247

Finally,

PB,C,D (b, c, d)
P D | B=b,C=c (d) =
PB,C (b, c)
P
fD (d, b, c) × a fA (a) × fB (b, a) × fC (c)
= P
a fA (a) × fB (b, a) × fC (c)
= fD (d, b, c)

In conclusion, a Bayesian Network decomposes a joint distribution into a product of conditional


distributions. The example of figure 23.1a gives:

PA,B,C,D,E,F,G (a, b, c, d, e, f, g) = PA (a) × P B | A=a (b) × PC (c) × P D | B=b,C=c (d) × . . .


... P E | C=c (e) × P F | B=b,D=d,E=e (f ) × P G | F =f (g)

Or equivalently, using a somewhat more ambiguous but shorter notation:

PA,B,C,D,E,F,G = PA × P B | A × PC × P D | B,C × P E | C × P F | B,D,E × PG | F

Bayesian Network as a complete model of joint distribution


The graph of a Bayesian Network specifies a factorization of the joint distribution. However it
does not specify the values of each factor so that the probabilities of the joint distribution cannot
be computed.
To do this, the graph of a Bayesian Network must be completed with what are called Conditional
Probability Tables (CPT). Given a variable X and its parents parents(X) = {Y1 , . . . , Yk }, the CPT
of X is a table specifying all the values of its factor fX , that is, it is a table represented by k + 2
columns X, Y1 . . . Yk , and P X | Y1 ,...Yk . Every row contains one of the possible values for the k + 1-
tuple (x, y1 , . . . , yk ) followed by the probability P X | Y1 =y1 ,...Yk =yk (x) for these values. Considering
the subset of rows matching some given value (y1 , . . . , yk ) for the parent variables, one (and only
one) of them can be discarded as the probabilities of these entries must sum up to one:
X
P X | Y1 =y1 ,...Yk =yk (x) = 1
x

For instance, if one assumes all variables of the example of figure 23.1a take their value in {0, 1},
the CPT of variable D could look like the table 23.1. THe missing probabilities for D = 1 can
be reconstructed using the 1-complement of probabilities of D = 0. Learning the CPTs from

Table 23.1: CPT of variable D.

D B C P D | B,C
0 0 0 0.2
0 0 1 0.4
0 1 0 0.1
0 1 1 0.7

some dataset using the Maximum Likelihood principle is again very easy as it amounts to estimate
probabilities by directly counting occurrences in the dataset.
One advantage of such a decomposition is to reduce drastically the number of parameters. Since
all variables of the example are assumed to be binary, the number of entries of a CPT of some
variable X is equal to 2|parents(X)| . The total number of model parameters is therefore equal to
20 + 21 + 20 + 22 + 21 + 23 + 21 = 20 to be compared with 27 − 1 = 127 parameters for specifying
the same non factorized joint distribution. The advantage of the reduction is two-fold: not only
248 CHAPTER 23. THEORETICAL FOUNDATIONS

the model requires less memory and computation time, but it also reduces the risk of overfitting.
Of course CPTs are only valid with discrete variables. Continuous variables require to parametrize
their distributions but the number of required parameters is usually very low so that the total
parameter number of the whole model remains low. This case will be explained later.

Bayesian network as a dependence modelling tool

The graph of a Bayesian network helps to formalize the hypothesis of ML methods in term of
variable independence. In order to understand this point, one must first study for some Bayesian
network the necessary and sufficient condition for two subsets X and Y of variables to be indepen-
dent.

Let us consider the case where these subsets are reduced to one single variable, respectively
X and Y . Clearly if X is a parent of Y , they are dependent (unless the CPT of Y does not
depend on the value of X, but in this case the connection from X to Y can be dropped). Let us
consider the interesting case where X and Y are not directly connected but they are indirectly
connected through a third variable Z. Table 23.2 lists all of these graph configurations modulo
variable renaming symmetry.

Table 23.2: Independence configuration of X and Y .

X and Y have Z as parent. Learning the value of X


Z brings some new information on the distribution of
Z and then on Y :
(a)
X Y PY |X 6= PY

Therefore X and Y are not independent.

X Z is in between X and Y . Learning the value of X


can have some consequence on the distribution of Z
and then on Y :
(b) Z
PY |X 6= PY
Y Therefore X and Y are not independent.

Z has X and Y as parents. Learning the value of X


can have some consequence on the distribution of Z.
X Y But this has no influence on distribution of Y which
(c) is given by a factor fY (y) independent of X and Z:
Z PY = PY
|X

Therefore X and Y are independent.

Let us prove this last third case (c) (the other proofs are left to the reader as they just consist
23.3. BAYESIAN MACHINE LEARNING IN A WEAK SENSE 249

in exhibiting examples where X and Y are not independent):

PX,Y
PY |X =
P
PX
PX,Y,Z
= PZ
Y,Z PX,Y,Z
P
PX PY P Z | X,Y
= PZ
Y,Z PX PY P Z | X,Y
P
PX PY Z P Z | X,Y
= P P
PX Y PY Z P Z | X,Y
PX PY
=
PX
= PY

What happens now if the value z of Z is known? Does it change the independence relation
between X and Y ? It completely reverses the results as shown on table 23.3. Note the graphical
convention: when the value of a variable is known, the variable is colored in gray. Again, let us

Table 23.3: Independence configuration of X and Y given Z (i.e node of Z is grayed out).

X and Y have Z as parent. Learning the value of


X does not anymore have any consequence on the
Z distribution of Z, since Z is known. So it has no
(a’) influence on Y either:
X Y PY = PY
| X,Z |Z

Therefore X and Y are independent given Z.

X Z is in between X and Y . Learning the value of X


can’t have some consequence on the distribution of Z
since Z is known. So it has no influence on Y either:
(b’) Z
PY | X,Z = PY |Z

Y Therefore X and Y are independent given Z.

Z has X and Y as parents. Learning the value of X


X Y
can have some consequence on the distribution of Y
(c’) given Z:
P Y | X,Z 6= P Y | Z
Z
Therefore X and Y are not independent given Z.
250 CHAPTER 23. THEORETICAL FOUNDATIONS

prove the first case (a’):

PX,Y,Z
PY | X,Z =
PX,Z
PZ P X | Z P Y | Z
= P
Y PZ P X | Z P Y | Z
PZ P X | Z P Y | Z
= P
PZ P X | Z Y PY | Z
PZ P X | Z P Y | Z
=
PZ P X | Z × 1
= PY | Z

One sees on this example that the information contained in observation “variable X is equal to
some value x” spreads over the graph by modifying the distribution of surrounding variables. A
variable Y will be independent of variable X if the distribution is unchanged, that is, for any
undirected path connecting X and Y , some intermediary configuration node Z will block in some
way the information X = x. The question is to identify what are these blocking conditions. The
answer to this question is provided by the following algorithm whose application scope is even
more general since it determines whether two sets of variables are independent conditionnally to a
third set of variables. This theorem introduces the notion of d-separation.

Given a Bayesian network, two subsets X and Y of variables are independent between
each other given the values of a third set Z of variables, if and only if:
1. For every couple (X, Y ) of variables of X × Y, decides whether X and Y are
d-separated given Z (definition to come shortly). If one of them is not d-
separated, conclude X and Y are not independent conditionally to Z. If all of
them are d-separated, then they are independent.
2. Two variables X and Y are d-separated given Z if and only if every undirected
path P going from X to Y is d-separated, that is, if one of the following events
occurs along the path:

• If P contains a pattern X . . . A ← B → C . . . Y such that B ∈ Z (cf


config. a’).
• If P contains a pattern X . . . A → B → C . . . Y or X . . . A ← B ← C . . . Y
such that B ∈ Z (cf config. b’).
• If P contains a pattern X . . . A → B ← C . . . Y such that B and all
descendants of B are not in Z (cf generalization of config. c).

Let us take some examples based on the Bayesian Network of Fig. 23.1a. By introducing the
notation X ⊥⊥ Y | Z to state X and Y are independent given Z, one has:

• B ⊥
⊥ E since every path connecting B and E is a d-separation:

– Path B → D ← C → E is blocked at D (config. c).


– Path B → D ← F → E is blocked at F (config. c).
– Path B → F ← D ← C → E is blocked at F (config. c).
– Path B → F ← E is blocked at F (config. c).

• B 6⊥⊥ E | D since path B → D ← C → E is not blocking anymore, since D is known (config.


c’) and C is unknown (config. a): some information thus propagates between B and E
through D and C.
23.3. BAYESIAN MACHINE LEARNING IN A WEAK SENSE 251

• B ⊥
⊥ E | D, C since now path B → D ← C → E is blocked at C (config. a’).

• B 6⊥
⊥ E | G since path B → F ← E is not blocking as G is a descendant of F and G is
known (config. c).

• C 6⊥
⊥ G | D since path C → E → F → G is not blocked at E and F (config. b in both
cases).

• C ⊥
⊥ G | D, E since every path connecting C and G is a d-separation:

– Path C → D → F → G is blocked at D (config. b’).


– Path C → E → F → G is blocked at E (config. b’).

Independence and causality


The specification of independence relations between variables as explained in the previous subsec-
tion, along with the notion of causality defined shortly make Bayesian Networks a powerful tool
to translate some expert knowledge into probabilistic models. One illustrates this point on the
classification problem of wrestler students.
But first let us explain what causality means and what is the relationship between causality
and independence. Causality can only be explained by introducing the notions of time and of a
physical model able to establish cause to consequence relationship between events. Informally, a
random variable X is causal for variable Y if X and Y are not independent AND if “the definition
of X on the time axis occurs before the definition of Y ”. On our example, one can for instance
says that the gender G of a student is a causal variable of his/her height H as the gender is
determined before the birth of the student, whereas the adult height is determined after many
years. Whereas statistical dependence is a symmetric relation, causality is asymmetric. Causality
gives an orientation to a dependence relation between two variables, oriented by the time direction,
i.e oriented from the cause to the consequence. If X is a causal variable of Y and if one wants to
formalize the stochastic relation linking X and Y , it is much more natural to consider conditional
distribution of the consequence given the cause P Y | X rather than the converse P X | Y . Because
Bayesian Networks are oriented graphical models that represent conditional distributions, they
are the perfect tool to formalize the knowledge of some experts, by simply specifying the causal
relations that the expert has observed between relevant variables.
Let’s make the exercise for the classification problem of the wrestler students. One plays the
role of expert and gives below the reasoning that lead to the network of figure 23.2a. The starting
point is to define the target variable, which is here the boolean variable M that is true if the
student is a member of the wrestling association. Then one also considers the observable variables
that can help to figure out the value of M : they are the gender G, the height H, the weight W and
the dress code D (i.e. whether the student likes to wear sportswear or not). The problem with the
dress code variable is that a student is likely to wear sportswear if he/she likes sport in general, not
only wrestling. However because wrestlers will tend to be tall and heavy, one can identify wrestlers
among sports people using the height/weight variables. This is why one introduces a new variable
S that is true if the student regularly practice some sport. Variable S is different from the other
ones as its value cannot easily be determined just by looking the physical appearance of students.
Such a variable whose values are not observed in samples is said to be a hidden variable or a latent
variable. The next step is to find some causal relations between variables3 :

• Clearly the gender G is the “most causal” variable so that it has no parent in the Bayesian
network.

• The height is not caused by any other variables but the gender. The only parent of H is thus
G.
3 Establishing the right causality relations can be the source of intense arguments: for instance should the

level of sports practice depends on gender. Men and women might disagree on the subject. . . However the practical
importance of these arguments must be mitigated: the orientation of some edges has generally not much consequence
on the results.
252 CHAPTER 23. THEORETICAL FOUNDATIONS

• Deciding on practicing wrestling mainly depends on the gender and the height: M has two
parents G and H.
• Sport practice is a consequence of practicing wrestling (i.e. all wreslters practice sports),
possibly also the gender but not the height, so S has only M and G as parents.
• Wearing sportswear can be explained by sport practice and possibly by the gender: D has
two parents S and G.
• Finally the weight is determined by the gender, the height, sport practice and if the student
is a wrestler: W has four parents G, H, S and M .
All these causal relations define the Bayesian Network of 23.2a.

Gender
Gender

Height
Height

Member
Member

Sport Weight

Dress Weight Dress

(a) With variable S (b) Without variable S

Figure 23.2: The Wresling Club membership example

The next step is to fill the CPTs with probabilities by estimating them from the dataset.
However the variable S is latent: one does not know its values. There exists some general methods
to infer the distribution of hidden variables as it will be explained in chapter25. For now, it is
simpler to delete this variable S from the network along with its incident edges. However the
deletion of a variable might cut some paths of dependence between the remaining variables. These
links must be restored. On the example, they are essentially 6 paths going through S: between G
and D, G and M , G and W , M and D, M and W and D and W . However G is already directly
connected to D so there is no edge to add. The cases G and M , G and W , and M and W are
similar. The path between M and D must be restored: since causality is a transitive relationship,
clearly the new arc must go from M to D. Finally the path between D and W must be restored
but here there is no obvious causal relation between D and W : one arbitrarily choose to add an
arc fom W to D. This gives the Bayesian Network of figure 23.2b.
Since weight and height have been divided in 20 intervals, one can determine that the number
of model parameters (i.e. the number of CPTs entries) of network figure 23.2b is about 1750, as
detailed in table 23.4. Most of the parameters come from the CPT of the weight W as it combines
the 20 possible values of W with the 20 possible values of the height H. In practice however, many
combinations are unlikely or even impossible (like a 50 Kg 2.3 m high person) so that the number
of parameters actually used for classification is much smaller.
23.3. BAYESIAN MACHINE LEARNING IN A WEAK SENSE 253

Table 23.4: Size of CPTs.

Variable Number of Number of Size of


name values parent value CPT
combinations
G 2 1 1×1=1
H 20 2 19 × 2 = 38
M 2 2 × 20 = 40 1 × 40 = 40
W 20 2 × 2 × 20 = 80 19 × 80 = 1520
D 3 2 × 2 × 20 = 80 2 × 80 = 160
Total 1759

23.3.4 Continuous variables


Discretization of continuous variables is not a satisfying approach as some information is lost in
the discretizing process and as one might generate a huge number of parameters. On our example,
one knows that for a large number of students, the height and weight distributions are likely to
look like gamma distributions4 for some parameters (k, θ) which in turn can be approximated as
normal distributions N µ, σ 2 5 .
More generally, let consider some variable X and its parents Y1 , . . . Yk that are assumed in a
first stage to be all discrete. Then for each possible combination (y1 , . . . , yk ) of parents’ values,
one can always try to interpolate distribution P X | Y1 =y1 ...Yk =yk with a family of parametrized
distributions P X | Θ(y ,...y ) . The interpolation consists in finding the best parameters Θ̂(y1 ,...yk )
1 k
that maximizes the likelihood. If xj is the X value of the jth data sample and if set J(y1 , . . . yk )
contains all indexes of data such that Y1 = y1 , . . . Yk = yk , then one has:
 
X 
Θ̂(y1 ,...yk ) = argmax  log P X | Y1 =y1 ,...Yk =yk ,Θ (xj ) 
Θ
j∈J(y1 ,...yk )

The optimal parameter value Θ̂ can be obtained analytically  for most standard distributions. For
instance parameters of a normal distribution N µ, σ 2 can directly be computed by estimating
the mean and variance of samples:
P j
j∈J(y1 ,...yk ) x
µ̂ =
|J(y1 , . . . yk )|
P j
2
ˆ2 j∈J(y1 ,...yk ) x − µ̂
σ =
|J(y1 , . . . yk )| − 1

If no analytical solution is known, a numerical optimization algorithm (gradient ascent) must be


used.

Example:
Let’s take the height variable H as an example. H has only one parent, the gender G. One can
assume the height has a normal distribution N µg , σg2 given the gender G = g. Only four real
parameters µf emale , σf2 emale , µmale , and σmale
2
must be learnt instead of a CPT of 38 entries.

If some parent variables are continuous, the dependence between these variables and X may
also be modelled by some parametrized function.

Example:
To make it clear, let’s take the example of the weight variable W . W has G, M and H as parents.
x
4 The 1
(k, θ)-gamma distribution has a density defined on [0, ∞[ equal to f (x) = Γ(k) θk
xk−1 e− θ .
5 Of course it is only an approximation as a normal distribution could generate theoretically negative weight
254 CHAPTER 23. THEORETICAL FOUNDATIONS

H is therefore a continuous parent variable of W . For students of gender G = g, wrestling club


membership M = m, the weight W is assumed to have a normal distribution N (fg,m (h), gg,m (h))
whose expected value and variance depends on the height H = h through some given parameterized
function fg,m and gg,m . For instance one can choose a linear representation fg,m (h) = µ0g,m +µ1g,m h
0 1
and gg,m (h) = σ 2 g,m +σ 2 g,m h. This new conditional distribution of W then only requires 4×2×2 =
16 parameters instead of 1520!!! The values of these parameters can be determined by linear
regression (OLS) in case or linear representations, or by any regression method in the general case
(SVR, regression trees, etc).
In total the new model takes 99 parameters instead of 1759. Modelling distributions of continuous
variables with parametrized functions is thus less sensitive to overfitting.

23.3.5 Naive Bayes: the standard version


In the previous section, one showed on the wrestler classification problem how Bayesian Networks
can help to reduce the number of model parameters to represent the joint distribution, and thus
to overcome the curse of dimensionality. However such an approach requires an expert with a
sufficient amount of knowledge about the application domain in order to build a Bayesian Network
with a minimal number of connections. In many applications, such knowledge is not available.
In such a case, the classification method called “Naive Bayes” may be applied as it proposes a
very simple generic factorization of the joint distribution. Even if this simplicity may sometimes
produce simplistic and inefficient models for complex problems – explaining by the way the un-
flattering name ‘naive” of the method – it can work well for simple problems and is very robust
against overfitting as it relies on very few parameters. Considering a target class variable Y and
M feature variables {Xi }1≤i≤M , “Naive Bayes” assumes the feature variables are independent of
each other given the target class Y :

∀i, ∀j, i 6= j ⇒ Xi ⊥
⊥ Xj | Y

This amounts to consider the very simple Bayesian Network displayed on figure 23.3a. Figure 23.3b

Y Y

Xi
X1 ... Xi ... XM M

(a) Developped network (b) With plate notation

Figure 23.3: Naive Bayes assumption

gives an equivalent representation using plates. Plates are a graphical convention of Bayesian
Networks to express complex networks in a compact form: given a plate its content must be
duplicated as many times as the specified number (usually given on the bottom right corner of the
plate). Such a Bayesian Network states that the joint distribution can be factorized according to
the equation: Y
PX1 ,...XM ,Y (x1 , . . . , xm , y) = PY (y) × P Xi | Y (xi )
i

Let’s look at the consequences of such hypothesis on both steps of model learning and model
prediction.

Learning step
Learning a Naive Bayes model consists in estimating the discrete distribution PY of the target
variable Y and the distribution P Xi | Y of every feature variable Xi given the class variable Y .
23.3. BAYESIAN MACHINE LEARNING IN A WEAK SENSE 255

• PY (y) is estimated as the number of samples of class Y = y divided by the size N of the
dataset, that is, if one denotes for some event A the number N (A) of data verifying A:

N (Y = y)
P̂Y (y) =
N

• If Xi is discrete, P Xi | Y is also computed by counting:

N (Y = y ∩ Xi = xi )
P̂ Xi | Y =y (xi ) =
N (Y = y)

• If Xi is continuous, P Xi | Y =y must be approximated using a family of parametrized distri-


butions P Xi | Θy . The method then consists in finding the best parameter value Θ̂yi that
i

maximizes the likelihood. If xji denotes the value of feature Xi of the jth data, one has:
 
X  
Θ̂yi = argmax  log P Xi | Y =yj ,Θ (xji ) 
Θ j

Prediction step
Given a new example to classify, predicting the distribution of its class y is straightforward given
its features (x1 , . . . , xm ):

PX1 ,...XM ,Y (x1 , . . . , xm , y)


PY | X1 =x1 ,...XM =xm (y) =
K
1 Y
= · P̂Y (y) × P̂ Xi | Y (xi )
K i

where K is a normalization factor equal to:


X
K= PX1 ,...XM ,Y (x1 , . . . , xm , y)
y

If the risk is defined on the standard 0-1 loss function, the best classifier is the one that chooses
the class of highest probability:

ŷ = argmax P Y | X1 =x1 ,...XM =xm (y)
y
= argmax (PX1 ,...XM ,Y (x1 , . . . , xm , y))
y
!
Y
= argmax P̂Y (y) × P̂ Xi | Y (xi )
y
i

Therefore prediction y just consists in computing a product of probabilities for each y value and
selecting the class with the highest product value. Note that computation of K is not necessary.

Conclusions
Naive Bayes is a must-know Bayesian classification method. Because it is the simplest one. As a
consequence, its advantages and drawbacks are extreme. The main advantages are:
• The risk of overfitting is lower than any other state-of-the-art method.
• The method is extremily fast and scalable for both prediction and learning steps. Model
can be learnt very quickly with only one scan of the dataset and with a time complexity of
Θ(N × M ).
256 CHAPTER 23. THEORETICAL FOUNDATIONS

• The method seamlessly works with discrete and continuous variables.

However there are also some strong disadvantages:


• The independence hypothesis of Naive Bayes is too naive. The classifier is prone to under-
fitting as it won’t capture any complex joint distribution between the target Y and several
features.
• In particular Naive Bayes does not take into account information redundancy between fea-
tures. Naive Bayes tends to overweight the redundant features in its classification decision.
This is why reduction dimension techniques must be used (PCA, etc) as a preprocessing of
Naive Bayes.
23.4. BAYESIAN MACHINE LEARNING IN A STRONG SENSE 257

23.4 Bayesian Machine Learning in a strong sense


So far, the studied method of “Naive Bayes” was only Bayesian in the weak sense, according to
the distinction between weak and strong senses drawn in sections 23.1 and 23.1: as a reminder,
a Bayesian method in the weak sense has models from which it is possible to derive a likelihood
function LZ (θ) relatively to some dataset Z and then to search for the parameter value θ that
maximizes the likelihood. The current section covers Bayesian Machine Learning in the strong
sense, when model parameters Θ (of some Bayesian methods in the weak sense) are described
not by a single value but rather by a full distribution P Θ | Z . The next subsection develops this
approach on an abstract level and on a very simple example before the strong Bayesian version of
Naive Bayes is given as a more comprehensive example.

23.4.1 Principles of Bayesian statistics and inference


Basic notions
Because Bayesian Machine Learning lies at the intersection of Machine Learning and Bayesian
statistics, it is important to bridge notions from both worlds:
In statistics, a specific model (i.e with fixed parameter values) denoted m plays the same
role as a specific hypothesis h in Machine Learning. In most cases, the model is defined by a
vector of predefined parameters denoted θ (usually defined in some Euclidean space Rn ). Statistics
distinguishes two types of models which can be connected somehow to the notion of supervised
and unsupervised learning.
• Generative models attempt to model the joint distribution of all variables (feature vector X
and target Y ) P X,Y | Θ . They are called generative as such models can be used to generate
new samples, once parameters θ have been learnt. These models are suitable for both super-
vised learning problems and unsupervised density estimation problems. Naive Bayes is an
example of method based on a generative model as the joint distribution is fully described
by model parameters according to:
Y
P X,Y | Θ = P Y | Θ × P Xi | Y,Θ
i

• Discriminative models are less powerful than generative models as they only attempt to model
the conditional distribution P Y | X,Θ of some output variable Y given some input variables
X. The model does not take into account the distribution of input features PX . These models
are only suitable for supervised learning problems. Note that given a generative model, it is
always possible to deduce a discriminative model as:
P X,Y | Θ
PY | X,Θ =P
Y P X,Y | Θ

Example:
Let us take an extremely simple example that will be further developed in the next sections: one
wants to estimate the performance of a given network server. The corresponding generative model
defines the distribution PT of the processing time T requiredto answer a query. To keep it simple,
T is assumed to be normally distributed (i.e. T ∼ N µ, σ 2 with θ = [µ, σ 2 ]), even if this is the-
oretically absurd: indeed with such distribution, the probability of having some negative processing
time is non null.

However Bayesian Machine Learning does not work on a single hypothesis or model but on a
whole distribution of models in order to represent the current uncertainty about the knowledge of
the model. In other words, the model should be viewed as a random variable denoted M, which can
be represented equivalently by a random variable Θ of model parameters6 . In order to describe the
6 Depending on the context, one interchangeably uses M or Θ to represent the model as a random variable.
258 CHAPTER 23. THEORETICAL FOUNDATIONS

distribution PΘ of the model parameters, one usually uses some parametric representation P Θ | κ
whose parameters κ are called hyperparameters. Hyperparameters are not random variables but
constants. This fundamental difference is represented schematically by the Bayesian networks of
figure 23.4.
κ

Θ
Θ

X X

(a) Weak sense (b) Strong sense

Figure 23.4: Difference between Bayesian Machine Learning in a weak and strong sense

Example:
The model parameter vector Θ of the network server is [µ, σ]. Let’s assume its technical specification
states the standard deviation σ of the processing time is almost fixed, equal to 10ms. However the
average processing time µ can vary depending on some components of the server. It is equal to
50ms ± 5ms. This initial knowledge can be modelled by defining a distribution on Θ = [µ, σ]:

• σ is a random variable following a dirac distribution centered on σT = 10 ms.



• µ is a random variable following a normal distribution N µ0 , σ0 2 with µ0 = 50ms and
σ0 = 5ms.

• µ and σ are independent.

The hyperparameters are thus κ = [µ0 , σ0 , σT ].

Like any other Machine Learning method, a Bayesian method must provide two algorithms: a
learning algorithm and a predicting algorithm.

• The learning algorithm must infer the distribution P Θ | S,κ (h) of parameters/hypothesis con-
ditioned on the set of observed samples S and values of hyperparameters κ.

• The prediction algorithm must infer an output y from some new input x and the learnt
distribution P Θ | S,κ of parameters/hypothesis.

Of course learning and prediction steps can be interleaved as Bayesian methods are naturally
incremental: the learning algorithm first consists in updating the current model given some new
samples, the second one in predicting some new samples given the current model. These two steps
are further explained in the two next paragraphs.

Bayesian prediction
Given a Bayesian model with parameters Θ and possibly some input x, what output value y should
the model predict? In fact this question is nonsense for two reasons: first the model is stochastic, the
correct answer to give is that the output value will be drawn according to distribution P Y | θ,X=x .
Secondly, even the parameters Θ are not known perfectly and are described by some distribution
P Θ | S,κ . Predicting the distribution of the output must thus be averaged with the distribution of
23.4. BAYESIAN MACHINE LEARNING IN A STRONG SENSE 259

parameters:
 
PY | X=x,S,κ (y) = EΘ P Y | Θ,X=x (y)
Z
= P Y | Θ=θ,X=x (y) · P Θ | S,κ (θ) · dθ

Example:
On the server example, the distribution (in density of probabilities) of the output T is
Z
P T | κ (t) = P T | θ (t) × P θ | κ (θ)dθ
ZZ (µ−µ0 )2
(t−µ)2 −
∝ e− 2σ2 × e 2σ0 2 δ(σ, σT ) dµ dσ
Z (t−µ)2 (µ−µ0 )2
− −
2σ 2 2σ0 2
∝ e T ×e dµ
Z −
(t−µ0 )2
(µ−µ1 )2

∝ e 2 (σ0 2 +σT2 ) × e− 2
2σ1

by introducing the constants σ12 and µ1 :


1 1
1 σ0 2 µ0 + σT2 t
σ12 = 1 1 and µ1 = 1 1 (23.83)
σ0 2 + 2
σT σ0 2 + 2
σT

the previous expression
 can be rewritten as a product of two normal distributions t ∼ N µ0 , σ0 2 + σT2
2
and µ ∼ N µ1 , σ1 so:


(t−µ0 )2 Z (µ−µ0 )2
2 (σ0 2 +σT2 ) × p 1 − 2
PT | κ (t) ∝ e e dµ
2σ1

2 π σ12
(t−µ0 )2

∝ e 2 (σ0 2 +σT2 )

A Bayesian prediction thus outputs the following distribution for t:



t ∼ N µ0 , σ0 2 + σT2

As expected, the most probable value for t is µ0 = 50 ms. However the variance is σ0 2 + σT2 =
125 ms2 . This is because the uncertainty on t comes from two different sources: first the varying
processing time of queries (represented by σT2 ) and second the uncertain knowledge about the aver-
age processing time of the server (represented by σ0 2 ).

Bayesian inference
In Bayesian Machine Learning the distribution P Θ | S,κ of the model is updated every time some
new data is added to the sample set S. These data are also called observations to emphasize the fact
samples can come on-the-fly and that the Machine Learning process never stops. Given a supervised
or unsupervised learning problem with some set O of observations, the general methodology of
Bayesian Machine Learning is the following:
1. First choose a family of generative models as a candidate for producing an observation o. This
family of models is supposedly characterized by a distribution P ( O | θ ) for some unknown
parameter vector θ.
260 CHAPTER 23. THEORETICAL FOUNDATIONS

2. Second consider the model parameters θ as a random variable Θ and choose some distribution
PΘ (θ). This distribution PΘ (θ) called a priori distribution or prior is initialized to represent
the prior knowledge, if any, on the model. In practice the prior is parametrized by some
hyperparameters κ and should be written P Θ | κ (θ).

3. The third step is to condition the distribution PΘ (θ) on the observations O, that is, replacing
P Θ | κ (θ) by P Θ | O,κ (θ). If observations contain sufficient information, the uncertainty (i.e.
entropy) on the model will reduce.

4. The third step is repeated every time some new observation gets available.

The third step is where the Bayes’ rule comes into action. To see this, let us consider an abstract
model defined by distribution PΘ (θ). A first observation o1 (or even possibly a set of samples) gets
available. Thanks to Bayes’ rule, P Θ | O1 =o1 ,κ can be deduced from P Θ | κ :

1
P Θ | O=o1 ,κ (θ) = P O | Θ=θ,κ (o1 ) P Θ | κ (θ)
P O | κ (o1 )
1
= P O | Θ=θ P Θ | κ (θ)
P O | κ (o1 )
Z
with P O | κ (o1 ) = P O | Θ=θ (o1 ) P Θ | κ (θ) dθ

Example:
On the server example, suppose a first query is processed in t1 = 46 ms. How does it update our
knowledge on parameters Θ = [µ, σ 2 ] describing our server? Let’s apply the previous equation in
this context:
1
P Θ | T =t1 ,κ (θ) = PT | Θ=θ (t) × P Θ | κ (θ)
PT | κ (t1 )
(t1 −µ)2 (µ−µ0 )2

∝ e− 2σ 2 ×e 2σ0 2
δ(σ, σT )
(t −µ)2 (µ−µ )2
− 1 2 − 2σ 02
∝ e 2σ
T ×e 0 δ(σ, σT )
(t1 −µ0 )2
− (µ−µ1 )2

∝ e (
2 σ0 2 +σ 2
T ) × e− 2
2σ1
δ(σ, σT )
(µ−µ1 )2
− 2
∝ e 2σ1
δ(σ, σT )

where constants σ1 = 4.47 ms and µ1 = 51.2 ms are the same as previously defined in equa-
tion (23.83):
1 1
2 1 σ0 2 µ0 + σT 2 t1
σ1 = 1 1 and µ1 = 1 1
σ0 2 + σ 2 T
σ0 2 + σ 2 T

The last expression shows that:

• µ ⊥
⊥ σ | T1 = t1 , κ (as µ and σ are separable in the joint posterior distribution).

• σ | T1 = t1 , κ ∼ δ(σ, σT ). Because prior distribution sets σ to value σT with probability 1,


the value of σ will never change.

• µ | T1 = t1 , κ ∼ N µ1 , σ12 . This is the most interesting result:

– σ1 is smaller than σ0 : the uncertainty on µ has decreased thanks to the observation of


t1 . The smaller the uncertainty on the observation (i.e. the smaller σT2 ), the larger the
decrease.
23.4. BAYESIAN MACHINE LEARNING IN A STRONG SENSE 261

– µ1 is a weighted average of µ0 with weight σ12 and t1 with weight σ12 : µ1 is thus a value
0 T
compromise between the prior knowledge of µ, represented by µ0 , and the observation
t1 . The smaller the variance (σ02 for µ0 , σT2 for t1 ), the higher the weight.

These three statements fully define the posterior distribution P Θ | T =t1 ,κ .

Repeating the operation when new observations o2 and o3 get available gives (with some sim-
plified notation):

1
P Θ | o1 ,o2 ,κ = P o2 | Θ P Θ | o1 ,κ
P o2 | o1 ,κ
1
P θ | o1 ,o2 ,o3 ,κ = P o3 | Θ P Θ | o1 ,o2 ,κ
P o3 | o1 ,o2 ,κ

The more observations are collected, the less influence the initial prior P Θ | κ has on the posterior
P Θ | O,κ .

Example:
On the server example, suppose n observations (t1 , . . . tn ) have been made. How does it update our
knowledge on parameters Θ = [µ, σ 2 ] describing our server? By applying n times the Bayesian
inference found previously for each observation ti , it is straightforward to show:


µ | t1 , . . . tn , κ ∼ N µn , σn2
1
with σn2 = n
σ2
+ σ12
T 0
n 1 P
2 ti + σ 2
σT 0 i ti
and µn = n 1 with ti =
σT2 + σ2 n
0

 
1

The variance of σn2 decreases in Θ n so that µ converges relatively slowly in Θ √1 towards
(n)
the average ti of the observations when the number n of observations increases. The expression of
µn also shows that the weight of the prior becomes negligible when n is large.

More generally, Bayesian inference consists in updating the model distribution P ( Θ | O ) es-
tablished so far from the past observations O when some new observation o is received:
262 CHAPTER 23. THEORETICAL FOUNDATIONS

The fundamental equation of Bayesian inference is equation (23.103):


1
P Θ | o,O,κ = P O | Θ (o) P Θ | O,κ (23.103)
P ( o | O, κ )

Every factor of this equation has a special role and name:

• The distribution P O | Θ (o) has already been introduced. It is the likelihood of


the parameters θ relatively to the observation o. In other words, P O | Θ=θ (o)
is a score between 0 and 1 that represents how plausible the generative model
of parameters θ given observation o. Since the observation o is fixed and the
model parameters θ are unknown, the likelihood is considered as a function of θ
and is denoted Lo (θ). As such, this function is not a distribution (i.e. integral
over θ is not necessarily equal to one).
• The factor P Θ | O,κ is the a priori distribution or prior. It is so-called since it
is the current distribution of model parameters before observation o is made.
When o is the first observation ever made (i.e. when O = ∅),the prior should
reflect the a priori knowledge on the parameters. In absence of any prior
knowledge, one usually makes the most neutral choice, like choosing an uniform
distribution if the support of θ is known or choosing a normal distribution with
very large variance if the support is infinite.
• The result P Θ | o,O,κ is the a posteriori distribution or posterior. It is the
resulting model distribution updated by the new observation o. The posterior
becomes the prior of the next observation.
• The factor P ( o | O, κ ) is constant with respect to model parameters Θ. It can
be viewed as a normalizing factor K that ensures the resulting posterior is a
distribution:
Z
K = P ( o | O, κ ) = P O | Θ=θ (o) P Θ | O,κ (θ) dθ

In practice this factor is either ignored when it is useless, or it is computed a


posteriori by normalizing P Θ | o,O,κ (see algorithm 21).

From a practical perspective, it is useless to memorize all successive models and observations.
Algorithms only keep in memory a representation of the current distribution PΘ of parameters,
replacing the prior by the posterior once a new observation is made. Assuming all parameters are
discrete and the number V of their value combinations is not too large, the typical structure of a
Bayesian method looks like algorithm 21. However most of the time, the space of parameters is
rarely countable (e.g. continuous parameters) or very large so that the previous algorithm is not
tractable. Two solutions then exist:
• Either use some standard sampling techniques (Monte Carlo or Markov Chain Monte Carlo
algorithms) to have some approximated representation of posterior PΘ . These techniques
will be introduced later in chapter 28.
• Or choose some families of prior distribution parametrized by some hyperparameters such
that the posterior distribution has the same algebraic form as the prior. Such distributions
are called conjugate priors and are presented in the next subsection.

23.4.2 Conjugate Priors


General definition
The example of the network server was nice as the assumption that the prior follows a normal
distribution implied that the posterior of an observation was also normal. Because the posterior
23.4. BAYESIAN MACHINE LEARNING IN A STRONG SENSE 263

Algorithm 21 Brute-force algorithm for Bayesian inference.


1: Create array PΘ [] of size V and initialize it to reflect prior knowledge on Θ.
2: // Online processing loop
3: loop
4: Wait for some new observation o.
5: K←0
6: // Bayesian inference of posterior
7: for every parameter value θ do
8: PΘ [θ] ← Lo (θ) PΘ [θ]
9: K ← K + PΘ [θ]
10: end for
11: // Normalization of posterior
12: for every θ do
1
13: PΘ [θ] ← K PΘ [θ]
14: end for
15: end loop

becomes the prior of the next observation, one had the guarantee that this property propagates
and that the distribution of parameters will always be normal. In such a case, Bayesian inference
is only about updating the hyperparameters of PΘ and this can be done efficiently with normal
distributions. Such a property is actually not restricted to normal distributions and is intensively
used in Bayesian Machine Learning:

Given a model described by a likelihood LX (θ) of parameters Θ, a conjugate


prior for this likelihood is a parametrization fκ of the prior distribution (i.e.
∃κ, ∀θ, PΘ (θ) = fκ (θ)) by some hyperparameters κ such that the resulting poste-
rior can be parametrized the same way:

∃κ, PΘ = fκ ⇒ ∃κ0 , P Θ | O=o = fκ0

A likelihood can have many different types of conjugate priors.

Example: 
One has already seen that given a normal likelihood X ∼ N µ, σ 2 parametrized by Θ = [µ,σ 2 ],
2
one possible conjugate prior is a constant variance σX and a normal distribution N µ0 , σ02 for
µ parametrized by κ = [µ0, σ0 ]. The resulting posterior after making observations (x1 , . . . xn ) is
µ | x1 , . . . , xn ∼ N µn , σn2 where:

1
σn2 = n 1
2
σX
+ σ02
n 1 P
2 xi + σ 2
σX xi
0 i
and µn = n 1 with xi =
σX2 + σ02 n

This example can be generalized to an observation vector X following a multivariate normal


distribution N (→

µ , Σ) of fixed covariance matrix Σ and unknown expected value → −µ . A possi-

− →

ble conjugate prior is then a multivariate normal distribution µ ∼ N ( µ 0 , Σ0 ). Indeed, after
making observations (→−x 1, . . . →

x n ), the posterior will be another multivariate normal distribution
264 CHAPTER 23. THEORETICAL FOUNDATIONS



µ |→

x 1, . . . →

x n ∼ N (→

µ 1 , Σ1 ) with:
−1
Σ1 = n Σ−1 + Σ0 −1

−  −1 1 X→
µ1 = n Σ−1 x + Σ0 −1 →

µ 0 × n Σ−1 + Σ0 −1 with x = −
xi
n i

This conjugate prior is a generalization of the univariate case and is extensively used in many
applications such as Bayesian linear regression, Gaussian Processes, Kalman filters, etc. Most of the
common discrete and continuous distributions have conjugate priors (see for instance wikipedia).

Example of the Bayesian version of “Naive Bayes”


It is now possible to design a full Bayesian version of “Naive Bayes”. As a simplification, let’s
assume all variables are discrete. Naive Bayes states that the joint distribution of feature variables
(Xi ) and target Y can be factorized as:
Y
P X1 ,...XM ,Y | Θ (x1 , . . . , xm , y) = P Y | Θ (y) × P Xi | Y,Θ (xi )
i

The parameters Θ of the model are the CPTs of the underlying Bayesian network. It has already
been shown that the MLE principle amounts to count occurrences in the dataset:
N (Y = y)
P̂Y (y) =
N
N (Y = y ∩ Xi = xi )
P̂ Xi | Y =y (xi ) =
N (Y = y)
Because every CPT has its own parameters, every CPT can be processed as an independent
likelihood function. Let’s focus for instance on the CPT of Y . If Y can take k values encoded by
numbers from
P 1 to k, the CPT is a categorical distribution described by k probabilities (θ1 , . . . , θk )
such that i θi = 1 (so in practice there are only k − 1 degrees of freedom).
Given such a likelihood function P Y | Θ with Θ = [θ1 , . . . , θk ] is there any simple conjugate
prior? Such a prior must be a distribution of categorical distributions: in other words, a sample of
k−1
this distribution must be a point of the k − 1-simplexP denoted ∆ , that is, the affine subspace
k
of R whose point coordinates (θ1 , . . . , θk ) verify i θi = 1. To answer this question one first
introduces Dirichlet distributions:

A Dirichlet distribution of dimension k is parametrized by k parameters α =


(α1 , . . . , αk ). Its support is the k − 1-simplex, i.e every possible k-categorical distri-
bution. More exactly, the density of probability to draw a point θ = (θ1 , . . . , θk ) ∈ Rk
is equal to:
1 Y
pΘ (θ) = · θαi −1 if θ ∈ ∆k−1
B(α) i i
= / ∆k−1
0 if θ ∈

The normalisation factor B(α) is the beta function of vector α. It is equal to:
Q
Γ (αi )
B(α) = iP
Γ ( i αi )

The gamma function Γ(x) generalizes the factorial function as Γ(x) = (n − 1)! if
x∈N Z ∞
Γ(x) = xt−1 e−x dt
0

The beta function B(α) is thus a generalization of multinomial coefficients.


23.4. BAYESIAN MACHINE LEARNING IN A STRONG SENSE 265

One then claims that Dirichlet distributions are conjugate priors of categorical distributions.
In order to prove it let define Ni the number of observations equal to i given some observations
O = (y1 , . . . yn ). Then considering the prior is a Dirichlet distribution of hyperparameters κ =
(α1 , . . . , αk ), one has:

p Θ | O,κ (θ) ∝ P ( O | Θ ) × p Θ | κ (θ)


Y Y
∝ θyj × θiαi −1
j i
Y Y
∝ θiNi × θiαi −1
i i
Y
∝ θiNi +αi −1
i

The posterior distribution is thus another Dirichlet distribution of parameters α0 = (N1 +α1 , . . . , Nk +
αk ) This also gives some natural interpretation of prior parameters αi . The presence of a non null
parameter αi amounts to observe αi fictitious observations whose value yj is equal to i. This inter-
pretation gives a rule of thumb to determine a prior that represents the right amount of confidence
in the initial knowledge of the problem.
Finally the strong Bayesian version of Naive Bayes only consists in initializing every CPT
entry with some α value. From the point of view of the implementation, the difference between
the standard Naive Bayes and the fully Bayesian Naive Bayes is very weak (as it only initializes
already existing counters to some non null values instead of setting them to zero). This illustrates
the point that Bayesian Machine Learning in the weak and in the strong sense are just two available
options for the same class of methods. The next section on Bayes estimator makes the link even
stronger.

23.4.3 Bayesian estimation


Bayesian Machine Learning is powerful as it enables online updates. However maintaining the
whole distribution PΘ might be costly in both processing time and memory usage. This amounts
to use a sledgehammer to crack a nut when the goal of Machine learning is to identify a unique
optimal set θ? of parameter values. This can happen for instance when designing a real-time pattern
recognition system: Machine Learning then helps to identify a unique set of parameters that will
tune a fast classifier. The next sections study how to extract from the posterior distribution such
optimal parameter values.

Bayes estimator
Given some parameter distribution PΘ , Bayesian estimation consists in finding the optimal pa-
rameter value θ? relatively to some user-defined loss function. This operation is called Bayes
estimator:

Given some data O, some learnt model P Θ | O and some loss function L(θ, θ̂) mea-
suring the cost of choosing parameters θ̂ when real parameters are equal to θ, the
Bayes estimator θ̂ is the parameter value that minimizes the risk relatively to the
posterior distribution P Θ | O :

θ̂ = argmin EP Θ | O [L(Θ, θ0 )]
θ0

Example:
266 CHAPTER 23. THEORETICAL FOUNDATIONS

 2
In the case the loss is quadratic L(θ, θ̂) = θ̂ − θ , the Bayes estimator is the expected value of
Θ | O:
 h i
2
θ̂ = argmin EP Θ | O (θ0 − Θ)
θ0
 
2
= argmin (θ0 − E [Θ])2 + E [Θ − E [Θ]]
θ0
= EP Θ | O [Θ]

Maximum A Posteriori estimator (MAP)


What is the Bayes estimator for problems which are not cost sensitive, that is, when the user
chooses the binary loss:
L(θ, θ̂) = 0 if θ = θ̂ or 0 otherwise
One then has:

θ̂ = argmin EP Θ | O [L(Θ, θ0 )]
θ0

= argmin EP Θ | O [1 − δ(Θ, θ0 )]
θ 0

= argmax EP Θ | O [δ(Θ, θ0 )]
θ0

= argmax P Θ | O (θ0 )
θ 0
 
1 0 0
= argmax P ( O | Θ = θ ) P (Θ = θ )
θ0 P (O)
= argmax (P ( O | Θ = θ0 ) P (Θ = θ0 ))
θ0

This estimator is called the Maximum A Posteriori estimator (MAP):

The Maximum A Posteriori estimator (MAP) selects the parameters that maximize
the posterior:
θ̂M AP = argmax (P ( O | Θ = θ0 ) P (Θ = θ0 ))
θ0

Maximum Likelihood estimator (MLE)


If in addition to a binary loss, one has no initial knowledge about the values of the parameters,
the prior is a uniform distribution equal to a constant K and the MAP estimator is equal to the
classical Maximum Likelihood estimator:
θ̂M AP = argmax (P ( O | Θ = θ0 ) P (Θ = θ0 ))
θ0
= argmax (P ( O | Θ = θ0 ))
θ0

The Maximum Likelihood estimator (MLE) selects the parameters that maximize
the likelihood:
θ̂M LE = argmax (P ( O | Θ = θ0 ))
θ0

As a conclusion, the MLE principle is a specific subcase of a Bayesian estimation when the
problem has no prior knowledge on the model parameters and is not cost sensitive.
Chapter 24

Gaussian and Linear Models for


Supervised Learning

In the previous chapter, a first classification method called “Naive Bayes” was presented as being
probably the most straightforward Bayesian learning method. However the underlying assumption
of independent descriptive features conditionally to the class feature is most of the time too naive.
What are the consequences for the models if one rejects this oversimplifying hypothesis? In the case
of categorical descriptive features, one already knows it amounts to merge the dependent descriptive
features into one joint random variable1 whose distribution is still a categorical distribution. Even
if the introduction of this new categorical distribution requires a larger number of model parameters
and thus increases the risk of overfitting, the form of the resulting model remains unchanged, i.e.
identical to the initial Naive Bayes setting.
However in the case of continuous features, merging them into a single joint random variable
requires a deeper analysis. The distribution of the resulting random variable obviously depends on
the marginal distributions of descriptive fatures. However many joint distributions can share the
same set of marginal distributions. Even in the simplest case where all these features are assumed
to have a univariate Gaussian distribution, the resulting joint distribution might not be Gaussian.
In this chapter one focuses on the specific case where this joint distribution is indeed Gaussian, i.e.
is a multivariate normal distribution (MVN). As it will be seen shortly, the multivariable normal
distribution is the simplest and most natural form of joint distribution for continuous variables
in order to take into account correlation (and thus dependency) between real random variables.
MVN have elegant algebraic properties that underlies many Bayesian methods presented in this
chapter and the subsequent ones. In particular, normal distributions are closely related to linear
models.
The current chapter thus shows how normal distributions occur in Bayesian linear models
or when relaxing the independence assumption in Naive Bayes. To this end, the section 24.1
first investigates some fundamental properties of MVNs. These properties are then applied in
section 24.2 to Bayesian classification without requiring like Naive Bayes any strong hypothesis of
independence. Section 24.3 then considers linear regression problems and shows how the Bayesian
approach generalizes and legitimates the classical Ordinary Least Squares (OLS) method and
regularized versions of it, like Ridge regression. Finally section ?? considers linear classification
problems and again shows how the Bayesian approach generalizes the classical logistic regression.

24.1 Multivariate normal distributions


The multivariate normal distribution (MVN) is the most fundamental distribution for continuous
random variables. This section recalls its fundamental properties.

1 Or several joint random variables if dependent descriptive features can be gathered in such a way that these

groups are independent between each other conditionally to the target feature.

267
268 CHAPTER 24. GAUSSIAN AND LINEAR MODELS FOR SUPERVISED LEARNING

24.1.1 Definition and fundamental properties


A MVN N (µ, Σ) is parameterized by its two first moments: its mean vector µ and its covariance
matrix Σ.

Definition 24.1. The probability density function of a multivariate normal distri-


bution N (µ, Σ) of order d is
 
1 1
pX (x) = q exp − (x − µ)T Σ−1 (x − µ) , x ∈ Rd
k 2
(2 π) det (Σ)

An example of a density of MVN of dimension 2 is shown on Fig. 24.1.

4
0.08
3
0.07
0.06
0.05 2
0.04
0.03 1
y

0.02
0.01 0
0.00
5
4 1
3
4 2
2 1
0 0 y 2
1
x 2 4 2 3 2 1 0 1 2 3 4 5
6 3
x

T
Figure 24.1: Example of a MVN pdf with d = 2, µ = [1, 1] Σ = [[3, 1], [1, 2]]

It is interesting to observe MVN is directly parameterized by moments of the first and second
order (i.e mean and covariance) and that MVN is the most uncertain distribution with given
expected value and covariance:

Property 24.1. Given a vector µ ∈ Rd and a positive semi-definite matrix Σ ∈


Rd×d , the distribution of maximal entropy H (f ) = E [− log(f )] chosen among distri-
butions whose mean vector is equal to µ and covariance matrix is equal to Σ is the
multivariate normal distribution N (µ, Σ).

Since most ML methods only estimate the two first moments, MVN are a natural choice to represent
an unknown distribution constrained to have given mean and covariance.
One fundamental property intensively used by linear models is the fact that a linear function of
a normal random vector is again a normal random vector with known parameters that can easily
be computed using linear algebra:

Property 24.2. Given a normal random vector X ∼ N (µ, Σ) of order n, a vector


b ∈ Rm and a matrix A ∈ Rm×n , the random vector A X + b is again normal, with
parameters: 
A X + b ∼ N A µ + b, A Σ AT

This latter property applied with b = 0m and A = [Im 0m,n−m ] allows to derive marginal distri-
butions of MVN:
24.1. MULTIVARIATE NORMAL DISTRIBUTIONS 269

Corollary 24.1. Every component subset of a normal random vector is a normal


random vector, i.e. if a random vector X = [X1 X2 ] is split in two disjoint random
vectors X1 and X2 so that
     
X1 µ1 Σ11 Σ12
∼N ,
X2 µ2 ΣT12 Σ22

Then
X1 ∼ N (µ1 , Σ11 ) X2 ∼ N (µ2 , Σ22 )

Conditioning distribution of X1 to some value x2 for X2 again gives a MVN according to the
following theorem:

Theorem 24.1. Every component subset of a normal random vector conditionned


to some values for another compoent subset is a normal random vector, i.e. if a
random vector X = [X1 X2 ] is split in two disjoint random vectors X1 and X2 so
that      
X1 µ1 Σ11 Σ12
∼N ,
X2 µ2 ΣT12 Σ22
Then

X1 | X2 = x2 ∼ N µ1 + Σ12 Σ−1 −1 2
22 (x2 − µ2 ), Σ11 − Σ12 Σ22 Σ12 (24.1)

MLE estimator of MVN

Property 24.3. Given n i.i.d samples Xi ∼ N (µ, Σ) the MLE is given by:
1 X
µ̂M LE = xi
n i
1 X
Σ̂M LE = (xi − µ̂M LE ) × (xi − µ̂M LE )T
n i

Details of proof are skipped. It consists in deriving the log-likelihood as function of µ and Σ, then
equating it to 0.

Conjugate Prior of MVN

In a Bayesian context, it might be useful to know a conjugate prior for MVNs in order to make fast
and exact inference. In reality, an MVN likelihood X ∼ N (µ, Σ) accepts several conjugate priors
of various levels of complexity, each one being adapted to specific hypothesis or restrictions. In the
most general case, when mean vector µ and covariance matrix Σ are unknown random variables,
a possible prior is a normal-inverse-Wishart distribution:
270 CHAPTER 24. GAUSSIAN AND LINEAR MODELS FOR SUPERVISED LEARNING

Property 24.4. A normal-inverse-Wishart distribution N IW (µ0 , κ0 , Σ0 , ν0 ) is a


 X ∼ N (µ, Σ), such that the mean vector µ
conjugate prior for a MVN likelihood
given Σ follows an MVN prior N µ0 , κΣ0 and the covariance matrix Σ prior is an
−1
inverse-Wishart distribution W (Σ0 , ν0 ). In short:
 
Σ
(µ, Σ) ∼ N IW (µ0 , κ0 , Σ0 , ν0 ) ⇔ µ | Σ ∼ N µ0 , with Σ ∼ W −1 (Σ0 , ν0 )
κ0

After n observations (Xi )1≤i≤n , the prior’s hyperparameters (µ0 , κ0 , Σ0 , ν0 ) are up-
dated as follows:
n
1X
X̄ ← Xi
n i=1
κ0 n
µ1 ← µ0 + X̄
κ0 + n κ0 + n
κ1 ← κ0 + n
n
1X  T
Σ̄ ← Xi − X̄ × Xi − X̄
n i=1
ν0 n  T
Σ1 ← Σ0 + n Σ̄ + µ0 − X̄ × µ0 − X̄
ν0 + n
ν1 ← ν0 + n

However in many models, one can assume Σ is known, equal to a constant matrix Σc . In this
case, a simpler conjugate prior for µ is a MVN:

Property 24.5. A MVN-Dirac distribution is a conjugate prior for a MVN likeli-


hood X ∼ N (µ, Σ), such that the mean vector µ follows an MVN prior N (µ|µ0 , Σ0 )
and the covariance matrix Σ prior is a Dirac distribution δ (Σc ). In short:

µ ∼ N (µ0 , Σ0 ) and Σ = Σc

After n observations (Xi )1≤i≤n , the prior’s hyperparameters (µ0 , Σ0 ) are updated as
follows:
n
1X
X̄ ← Xi
n i=1
−1
Σ1 ← Σ−1 −1
0 + n Σc

µ1 ← Σ1 × Σ−1 −1
0 µ0 + n Σc X̄

This last result can be extended to the case where X ∼ N (µ, Σ) is not directly observed but
only a linear projection Y = A × X + b of it. Then it is still possible to update parameters µ0
and Σ0 of the normal distribution of µ given observations of Y based on the following property:
24.2. GAUSSIAN DISCRIMINANT ANALYSIS 271

Property 24.6. Given a random vector X ∼ N (µ, Σ) normally distributed, such


that:
• The mean vector µ follows an MVN prior N (µ|µ0 , Σ0 ) and the covariance
matrix Σ prior is a Dirac distribution δ (Σc ).

• One observes a value y of an output variable Y = A × X + b linear in X with


some intrinsic noise/uncertainty of covariance matrix Σy , i.e:

Y ∼ N Aµ + b, A Σ AT + Σy

Then the posterior of µ is still a MVN N (µ1 , Σ1 ) whose hyperparameters are:


−1
Σ1 ← Σ−1 T −1
0 + A Σy A

µ1 ← Σ1 × Σ−1 T −1
0 µ0 + A Σy (y − b)

This last property is central in any Bayesian problem where observations linearly depend on model
parameters, such as in Bayesian Linear Regression (see section 24.3.3) or Kalman filters (see
section 26.4.2).

24.2 Gaussian Discriminant Analysis


A first very simple application of MVN is to consider a supervised classification problem where
descriptive features X = (Xi )1≤i≤m follow a MVN within every class, in other words, a MVN
conditionally to the target feature Y . This assumption defines methods gathered under the general
name of Gaussian Discriminant Analysis (GDA). In some way, GDA generalizes Naive Bayes in
the sense that descriptive features are not anymore assumed to be independent conditionally to
the target. Contrarily to its name, GDA defines truly generative models, not discriminative ones,
since joint distribution of X and Y is:

P X,Y | Θ (x, y) = PY | Θ (y) × P X | Y,Θ (x)

Assuming a problem with k classes numbered from 1 to k, CPT of Y is defined like with Naive
Bayes, by a categorical distribution of k parameters (πy )1≤y≤k = (PY (y))1≤y≤k that can be esti-
mated using the MLE estimator (or a MAP estimator if required):
N (Y = y)
P̂Y (y) =
N
The difference between GDA methods lies in the specification of distribution of X conditionally
to Y . The main variants are presented here from the most expressive method (QDA) to the least
expressive ones (LDA and diagonal LDA):

24.2.1 Quadratic Discriminant Analysis (QDA)


Quadratic Discriminant Analysis (QDA) makes the most general assumption within Gaussian
Discriminative Analysis: every class has its own MVN to describe descriptive variables.

Quadratic Discriminant Analysis assumes a normal distribution of features X con-


ditionnaly to the class Y .

P X | Y =k,Θk (x) ∼ N (µk , Σk )


272 CHAPTER 24. GAUSSIAN AND LINEAR MODELS FOR SUPERVISED LEARNING

Estimating parameters of class k, i.e mean vector µk and covariance matrix Σk is straightfor-
ward using for instance the MLE estimators presented in section 24.1.1.
The method is qualified as quadratic accordingly to the following property.

Property 24.7. Decision boundaries of QDA are quadratic hypersurfaces.

Proof. Points on a decision boundary separating two classes, lets say for Y = 1 and Y = 2, have
equal densities. By replacing the density by its expression, one has:

P Y =1 | X=x,Θ = P Y =2 | X=x,Θ ⇔
P X | Y =1,Θ1 (x) PY (1) = P X | Y =2,Θ2 (x) PY (2) ⇔
   
π1 1 π2 1
p exp − (x − µ1 )T Σ1 −1 (x − µ1 ) = p exp − (x − µ2 )T Σ2 −1 (x − µ2 ) ⇔
det (Σ1 ) 2 det (Σ2 ) 2
cst + (x − µ1 )T Σ1 −1 (x − µ1 ) = cst + (x − µ2 )T Σ2 −1 (x − µ2 ) ⇔
(x − a1 )T B−1 (x − a) + c = 0 for some a ∈ Rd , B ∈ Rd×d , c ∈ R

This latter equation is a level set of a quadratic form and thus it describes a quadratic hypersurface.

Figure 24.2 shows examples of 2D class boundaries using QDA. Observe 1) how the boundaries
are curved quadratic lines 2) how the MVN of every class has its own mean and covariance matrix
(whose eigenvectors are represented by main axes of ellipses). The total number of free parameters
es Samples
QDA QDA

NaiveBayes
LDA
(a) Iris dataset
NaiveBayes
(b) Gene dataset (SRBCT)

Figure 24.2: Class boundaries of QDA applied to data projected onto their two largest principal
components.

  
m (m+1) m+3
to describe distribution of X and Y is k − 1 + k × m + 2 = 2 m + 1 k − 1 to be
compared with the (2 m + 1) k − 1 parameters required by Naive Bayes. QDA’s model complexity
is thus Θ k m2 instead of to be compared with Naive Bayes’s complexity Θ (k m). QDA is thus
prone to overfitting when the number m of dimension gets large compared to the number n of
available samples.

24.2.2 Linear Discriminant Analysis (LDA)


The relatively large number of parameters required by QDA legitimates the introduction of a
slightly simpler and somewhat coarser model called Linear Discriminant Analysis better known
under its acronym LDA. Compared to QDA, LDA assumes that all classes share the same covariance
matrice Σ (whereas QDA considers every class has its own covariance matrix Σk ).
24.2. GAUSSIAN DISCRIMINANT ANALYSIS 273

Linear Discriminant Analysis assumes a normal distribution of features X condition-


naly to the class Y with the same covariance matrix.

P X | Y =k,µk ,Σ (x) ∼ N (µk , Σ)

This reduces the number of free parameters


 to k − 1 + k m + m (m+1)
2 . The model complexity is now
2 2
Θ k m + m instead of Θ k m . A consequence is that the class boundaries are hyperplanes
instead of quadratic surfaces, legitimating the “linear” qualifier of the method.
Property 24.8. Decision boundaries of LDA are hyperplanes.
Proof. Points on a decision boundary separating two classes, lets say for Y = 1 and Y = 2, have
equal densities. By replacing the density by its expression, one has:

Samples P Y =1 | X=x,Θ = Samples


QDA
P Y =2 | X=x,Θ ⇔
P X | Y =1,Θ1 (x) PY (1) = P X | Y =2,Θ2 (x) PY (2) ⇔
   
π1 1 T −1 π2 1 T −1
p exp − (x − µ1 ) Σ (x − µ1 ) = p exp − (x − µ2 ) Σ (x − µ2 ) ⇔
det (Σ) 2 det (Σ) 2
(x − µ1 )T Σ−1 (x − µ1 ) − (x − µ2 )T Σ−1 (x − µ2 ) = cst

This latter expression is the equation of hyperplane orthogonal to the line passing through points
µ1 and µ2 using the Mahanalobis distance relatively to Σ.
Figure 24.3 shows examples of 2D class boundaries using LDA, to be compared with those on
Fig. ?? obtained with QDA. Observe now 1) how the boundaries are straight lines 2) how the
MVN of every class has its own mean but share the same covariance matrix (whose eigenvectors

LDA
are represented by main axes of ellipses).
NaiveBayes
LDA Naiv

(a) Iris dataset (b) Gene dataset (SRBCT)

Figure 24.3: Class boundaries of LDA applied to data projected onto their two largest principal
components.

24.2.3 Diagonal LDA and Naive Bayes


LDA reduces the model complexity by sharing parameters between class. Another strategy to
reduce the model complexity is to assume independence between descriptive variables.
Adding this independence hypothesis to QDA makes diagonal the k covariance matrices Σk .
This “diagonal QDA” turns out to be equivalent to Naive Bayes (in the classical case where real
variables are assumed to be normally distributed), with k − 1 + 2 k m free parameters.
Adding the same independence hypothesis to LDA produces a method called Diagonal LDA,
with a unique diagonal covariance matrix Σ. Model complexity drops to Naive Bayes’s complexity
274 CHAPTER 24. GAUSSIAN AND LINEAR MODELS FOR SUPERVISED LEARNING

Θ (k m). Indeed its even sparser, i.e. less complex than Naive Bayes since the exact number of
A LDA
QDA
NaiveBayes
LDA
parameters is k − 1 + k m + m. LDA
NaiveBayes

ayes Diagonal
NaiveBayes
LDA
(a) NB on Iris dataset Diagonal LDA
(b) NB on Gene dataset (SRBCT)

(c) D-LDA on Iris dataset (d) D-LDA on Gene dataset (SRBCT)

Figure 24.4: Class boundaries of Naive Bayes (NB) and Diagonal LDA (D-LDA) applied to data
projected onto their two largest principal components.

24.2.4 Comparison summary


Chosing between QDA, LDA, Naive Bayes or Diagonal LDA mainly depends on the dimensionality
of the considered problem. QDA has many parameters so that it has a low bias but a larger variance.
At the other end, Diagonal LDA has a high bias but lower variance. Results shown on table 24.1
illustrate the varying level of performanceof the four methods when considering easy or difficult
datasets: the Iris dataset has easily separable classes with relatively many samples compared to
the number (4) of features whereas the SRBCT dataset has only 83 samples for 2308 features.

Table 24.1: Accuracy and logarithmic loss of the four considered methods for an easy dataset (Iris)
and a difficult dataset (SRBCT). Best scores appear written in bold characters.

Iris dataset Gene dataset (SRBCT)


Method Accuracy LogLoss Accuracy LogLoss
QDA 0.97 0.06 0.21 22.8
LDA 0.97 0.26 0.34 6.5
Naive Bayes 0.94 0.15 0.95 1.4
Diagonal LDA 0.93 1.10 0.70 1.3
24.3. LINEAR REGRESSION MODELS 275

24.3 Linear regression models


The most quoted statistical estimation method is probably the Ordinary Least Squares (OLS).
This is understandable as the problem this method addresses is at the same time fundamental,
occurs naturally in many applications and accepts a formal and simple exact solution based only
on elementary linear algebra. The present section reinvestigates this well known problem in order
to interpret it in the Bayesian framework.

24.3.1 Linear regression model


The fundamental assumption is that some target real variable Y follows a normal distribution with
constant variance σ 2 but whose mean µ (X) varies linearly with a set of real features X = (xj ) :

A normal linear regression model assumes a real output Y and m real input features
X = (xj )1≤j≤m such that:

PY | X,σ 2 ,W ∼ N WT X, σ 2

The output variance σ 2 ∈ R+ and the coefficient vector W ∈ Rm are the model
parameters.

Note that this model is only discriminative, not generative: nothing is said about the distribution
of the input features X.

24.3.2 Ordinary Least Squares


Let be a dataset Z = {z1 , . . . zn } = {(x1 , y1 ), . . . , (xn , yn )} of n data each composed of a vector
x ∈ Rm of input features and a scalar output y.

The Ordinary Least Square (OLS) estimator WOLS for W minimizes the empirical
risk for the quadratic loss also called mean square error (MSE):
n
1 X 2
R(W) = M SE(W) = yi − W T x i
n i=1

Its expression is given by:


−1
ŴOLS = XT X XT y

where the rectangular n × m matrix X called design matrix is such that its ith line
−1
contains the ith input sample xi . XT X is called the Moore-Penrose inverse, or
pseudoinverse matrix of X. It is defined as soon as X has a rank equal to m, i.e as
soon as at least m samples xi are linearly independent among the n available (this
is in general the case as soon as n  m).

Proof. The empirical risk is a convex function of W. It has a unique minimum that can be
Pn  2
computed by derivating the empirical risk R(W) = n1 i=1 yi − Ŵ T
x i wrt W. Setting this
gradient to zero and solving the resulting equation leads to expression ŴOLS .

Property 24.9. The MLE estimator of a normal linear regression model is identical to the OLS
estimator:
ŴM LE = ŴOLS
276 CHAPTER 24. GAUSSIAN AND LINEAR MODELS FOR SUPERVISED LEARNING

Proof.
n
Y
 
LZ W, σ 2
= P Yi = yi Xi = xi , W, σ 2
i=1
Yn  
1 1
= √ exp − 2 (yi − WT xi )2
i=1 2πσ 2 2σ
n
!
1 1 X
= √ n exp − 2 (yi − WT xi )2
2 π σ2 2 σ i=1
1  n 
= √ n exp − 2 M SE(W)
2 π σ2 2σ

Since the likelihood is maximized when the mean square error is minimized, ŴM LE = ŴOLS .

24.3.3 Bayesian Linear Regression and Ridge regression


A possible Bayesian approach for learning a linear model with parameters (W, σ 2 ) consists in
setting σ 2 to a constant σy2 and choosing a MVN N (µ0 , Σ0 ) for W in order to apply property 24.6.
To this end, one first observes that the model likelihood after n observations Z = (X, y), can be
written in compact form as a MVN wrt to y:
n
Y
 
LZ W, σ 2 = N yi |xTi W, σ 2
i=1

= N y|X W, σ 2 In

The posterior distribution can then be derived as a MVN wrt to W, simply by appling property 24.6
with A = X and b = 0.
  
P W, σ 2 (X, y) , µ0 , Σ0 , σy2 ∝ LZ W, σ 2 P W, σ 2 µ0 , Σ0 , σy2
 
∝ N y|X W, σ 2 In N (W|µ0 , Σ0 ) δ σ 2 |σy2

∝ N (W|µ1 , Σ1 ) δ σ 2 |σy2
 −1
−1 1 T
with Σ1 = Σ0 + 2 X X
σy
 
−1 1 T
µ1 = Σ1 Σ0 µ0 + 2 X y
σy

The MAP estimator for µ is thus:


−1 
µ̂M AP = σy2 Σ0−1 + XT X σy2 Σ−1 T
0 µ0 + X y

to be compared with the MLE estimator:


−1
µ̂M LE = XT X XT y

The term σy2 Σ−10 µ0 appearing in the MAP estimator is the influence of the prior knowledge µ0
and the confidence in it (represented by Σ−1
0 ). This additional term acts as a regularizer. The
particular case µ0 = 0 and Σ−1 λ
0 = σy2 Im corresponds to the L2-regularized version of OLS, better
known as Ridge regression:
−1 T
µ̂Ridge = XT X + λ Im X y
Chapter 25

Models with Latent Variables

25.1 Latent Variables


So far the studied models were fully observable: samples always had values for all the variables
of the model, except in one case: in the wrestling association problem, one first introduced a
variable S telling whether the student likes sport (see Fig 23.2a). But because this variable was
not observable (i.e. one could not infer the value of S just by looking at the student) one removed
it from the model, to consider the new model of Fig. 23.2b.

A latent variable or alternatively a hidden variable is a random variable whose values


in the available samples are unknown. A model containing at least one latent variable
is said partially observable. If V = (V1 , . . . , Vp ) is the joint variable gathering all
visible variables and if H = (H1 , . . . , Hq ) gathers all hidden variables, a model with
latent variables of parameters Θ can generically be summarized by the graphical model
of Fig. 25.1.

H V
N

Figure 25.1: Model with hidden variables

Given a model of parameters Θ with visible variables V = (V1 , . . . , Vp ) and latent variables
H = (H1 , . . . , Hq ), the first step is to infer the marginal likelihood P V1 ,...,Vp | Θ from observations
(v1 , . . . , vp ), before applying some Bayes’s estimator like for instance MLE (to take the simplest
one):

θ̂M LE = argmax P V1 ,...,Vp | Θ=θ (v1 , . . . , vp ) = argmax P (V1 = v1 , . . . , Vp = vp , Θ = θ)


θ θ

However since the model describes the full joint distribution PV1 ,...,Vp ,H1 ,...,Hq ,Θ , one needs to
compute:
X
P (V1 = v1 , . . . , Vp = vp , Θ = θ) = P (V1 = v1 , . . . , Vp = vp , H1 = h1 , . . . , Hq = hq , Θ = θ)
h1 ,...,hq

277
278 CHAPTER 25. MODELS WITH LATENT VARIABLES

(i) (i)
Given a dataset Z of N samples (v1 , . . . , vp )1≤i≤N , the marginal likelihood is:
 
Y X  
 (i)
LZ (θ) = P V1 = v1 , . . . , Vp = vp(i) , H1 = h1 , . . . , Hq = hq , Θ = θ 
i h1 ,...,hq
 
(i) (i)
where the joint distribution P V1 = v1 , . . . , Vp = vp , H1 = h1 , . . . , Hq = hq , Θ = θ of the model
is decomposed itself in factors according to the underlying graphical model. Maximizing a product
of sums of products, with a high number of terms and factors is just an intractable task even if in
principle, it is always possible to find a local optimum θ? using optimization methods like gradient
ascent.

Example:
In the very simple wrestling example, the joint distribution (omitting parameters for concision) of
Bayesian network of Fig 23.2a is
PG,H,M,S,D,W = PG P H | G P M | G,H P S | G,M P D | G,S P W | G,H,M,S

Given a single sample (g, h, m, d, w), applying the MLE amounts to maximize the marginalized
joint distribution without the latent variable S:
X
PG,H,M,D,W (g, h, m, d, w) = PG (g) × P H | G=g (h) × P M | G=g,H=h (m) ×
s
P S | G=g,M =m (s) × P D | G=g,S=s (d) × P W | G=g,H=h,M =m,S=s (w)
= PG (g) × P H | G=g (h) × P M | G=g,H=h (m) ×
X
P S | G=g,M =m (s) × P D | G=g,S=s (d) × P W | G=g,H=h,M =m,S=s (w)
s

For a dataset Z of N samples, the marginal loglikelihood can be decomposed in a sum of four terms:
N
X N
X N
X
log LZ (θ) = log PG (g (i) ) + log P H | G=g(i) (h(i) ) + log P M | G=g(i) ,H=h(i) (m(i) ) +
i=1 i=1 i=1
N
X X
log P S | G=g(i) ,M =m(i) (s)×
i=1 s

P D | G=g(i) ,S=s (d(i) ) × P W | G=g(i) ,H=h(i) ,M =m(i) ,S=s (w(i) )

Because the parameters of the conditional distribution tables (CDTs) of G, H and M are isolated
in separate terms, the inference of their values is an easy task: it suffices to count occurrences of
relevant events in the data set. However learning the remaining CDTs of S, D and W is much
harder since their parameters are all interdependent within the summation over s.

25.2 EM Algorithm
Given some model of likelihood P V,H | Θ with parameters Θ, visible variables V = (V1 , . . . , Vp )
and latent variables H = (H1 , . . . , Hq ), and given a dataset Z = {v (i) }1≤i≤N of i.i.d. samples, the
problem is to find the global maximum θ? of marginal likelihood P V | Θ where hidden variables H
have been marginalized out:
θ? = argmax log LZ (θ)
θ
Y
= argmax P V | Θ=θ (v (i) )
θ i
!
Y X
= argmax P V,H | Θ=θ (v (i) , h)
θ i h
25.2. EM ALGORITHM 279

This optimization problem is difficult as model parameters θ and latent variable distributions si-
multaneously occur in a product of sums (see section 25.1). Expectation maximization, abbreviated
EM, is a generic heuristic algorithm that solves this problem. Unfortunately, like any standard
numerical optimisation method, EM is heuristic: it only finds a local maximum, i.e. there is no
guarantee that the found local optimum is also a global one. But compared to standard numerical
optimisation methods, EM is simpler and faster.
Even if EM is heuristic, it results from a theoretical construction: indeed EM relies on a lower
bound of the marginal log likelihood log LZ (θ). Let’s assume one knows for each sample V (i) ,
a distribution qi that approximates somehow the true distribution P H (i) | V (i) =v(i) ,θ of hidden
variable H (i) . Using some results from information theory (i.e. using the fact the Kullback-Leibler
divergence between qi and P H (i) | V (i) =v(i) ,θ is always non-negative), every distribution qi verifies:

X XD  E

log LZ (θ) ≥ − hlog qi (H)iH∼qi + log P H, V = v (i) Θ = θ
H∼qi
i i

The right-hand side of the equation can be viewed as a lower bound log L̃Z,q1 ,...,qN (θ) of the
marginal log likelihood log LZ (θ) made of two terms:
P

• The energy E(q1 , . . . qN , θ) = i log P H, V = v (i) Θ = θ H∼qi is the full log likelihood
averaged over distributions of hidden variables H (i) . This term is always non positive.
P
• The entropy H(q1 , . . . qN ) = − i hlog qi (H)iH∼qi quantifies the uncertainty on H (i) . This
corrective term is always non negative and does not depend on parameters θ.

Instead of directly finding a local maximum for the marginal likelihood LZ (θ), EM seeks a local
maximum of the lower bound L̃Z,(qi )i (θ). This choice is justified by two properties: first, as one
will see shortly, finding a local maximum of the lower bound is much easier (at least, not more
difficult than problems based on models without hidden variables). Second a local maximum of
L̃Z,(qi )i (θ) is also a local maximum of the target function LZ (θ).
EM uses an iterative approach to estimate the distribution of both latent variables and the
values of model parameters. Starting from some guess θ? of parameters, it goes alternatively into
an expectation step and a maximization step until parameters converge:

• The Expectation step (E step of EM) keeps θ? constant and optimizes log L̃Z,(qi )i (θ? ) as a
function of distributions (qi )i only. Solving this problem is straightforward: if one sets every
qi to distribution P H (i) | V (i) =v(i) ,θ=θ in equation (25.12), one finds that L̃Z,(qi )i (θ) is equal
to the marginal likelihood LZ (θ), that is also an upper bound according to equation (25.12).
Because LZ (θ) is a constant relatively to (qi )i , it is also the maximum that can be reached
by variations of (q1 , . . . , qN ). Therefore the solution of the E step always consists in setting
the distribution of every hidden variable H (i) to their expected distribution given the current
parameter value θ and the visible values v (i) :

∀i, qi = P H (i) | V (i) =v(i) ,θ?

• The Maximization step (M step of EM) is the dual of the estimation step: it keeps distri-
butions (qi )i constant and optimizes L̃Z,(qi )i (θ) as a function of parameters θ only. Because
the entropy does not depend on parameters θ, it is sufficient to maximize the energy:
X
θ? = argmax E(q1 , . . . qN , θ)
θ i

This problem is very similar to a standard MLE problem for a model without latent variables
and can be solved using standard numerical optimization methods. The only difference is
that the energy function is more complex than a simple likelihood function as the likelihood
is averaged over distributions qi . Some concrete example will be given in the next sections.
280 CHAPTER 25. MODELS WITH LATENT VARIABLES

Algorithm 22 EM algorithm.
1: // Initialization step
2: Set current parameter models θ? to some initial guess θ0
3: repeat
4: // E-step
5: for every sample v (i) do
6: qi ← P H | V =v(i) ,Θ=θ?
7: end for
8: // M-step
9: θ? old ← θ? P

10: θ? ← argmaxθ i log P H, V = v (i) Θ = θ H∼qi
11: until |θ ? − θ ? old | < ε
12: return θ ?

The E and M steps alternate until the parameters converge. EM is summarized by algorithm 22.
The strength of EM might seem weak as it does not maximize the marginal likelihood LZ (θ)
but only a lower bound of it. However it can be shown that iterations of EM never decrease
the marginal likelihood (i.e. LZ (θ? new ) ≥ LZ (θ? old )). In other words EM has the following
fundamental property:

EM always returns a local optimum θ? of the marginal likelihood LZ (θ).

25.3 Bayesian Clustering


So far, the Bayesian approach has only addressed supervised classification problems. However the
notions of latent variable and mixture models allow to address unsupervised classification problems
also called clustering problems.

25.3.1 Mixture models


A mixture model consists in “mixing” C distributions P V | θi defined on the same set V =
(V1 , . . . , Vp ) of visible variables. Every distribution represents a cluster parametrized by θc . The
mixing is done by introducing an additional latent variable H whose value c ranging from 1 to C
can be interpreted as the identifier of a randomly selected cluster:

PV | H=c = PV | θc

H follows a categorical distribution characterized by C probabilities θH = [p1 , . . . , pC ] where pc


denotes P (H = c). The full joint distribution of the model is thus:
C
X
PV,H (v, h) = PV | θi (v) P H | θH (h)
c=1
C
X
= PV | θc (v) pc
c=1

This joint distribution can be viewed as a weighted average of distributions P V | θc . This model is
called a mixture as it consists in mixing samples, first by drawing a value h for H between 1 and
C according to weights of θH , and then by drawing a value for V according to P V | θh . As shown
on the Bayesian network of figure 25.2, the parameters of a mixture model are thus the collection
of vectors θ V | H = (θ1 , . . . , θC ) and the vector θH = [p1 , . . . , pC ].
Such a model can be solved (approximatively) using the generic EM algorithm. Of course the
full solution depends on the exact nature of cluster distributions P V | θc . For instance the next
25.3. BAYESIAN CLUSTERING 281

θH H

θV |H V
N

Figure 25.2: A mixture model

section 25.3.2 will develop an example where cluster distributions are normal. However it is already
possible to infer the mixture coefficients in θH without specifying the cluster distributions. To see
this let’s consider some dataset Z = (v (1) , . . . , v (N ) ).

Estimation step
As seen in the previous section, the estimation step is generic: it consists in updating the currently
estimated distribution qi of H (i) :
∀c ∈ {1, . . . , C}, qi (c) = P H (i) | V (i) =v(i) ,θ (c)
 
∝ P V (i) | H (i) =c,θ v (i) × P H (i) | θ (c)
 
∝ P V (i) | θc v (i) × pc

Maximization step
The M-step maximizes the energy relatively to model parameters ΘH ∪ {Θ1 , . . . , ΘC }. In case of
a mixture model, the energy is:
N D
X  E

E(ΘH , Θ1 , . . . , ΘC ) = log P V = v (i) , H θ
H∼qi
i=1
N D
X  E N
X

= log P V = v (i) H, θ V |H + hlog P ( H | θH )iH∼qi
H∼qi
i=1 i=1
! !
N
X C
X    N
X C
X
(i)
= log P V | θc v qi (c) + log(pc ) qi (c)
i=1 c=1 i=1 c=1

Maximizing the energy relatively to the mixture coefficients (i.e the components of θH ) only requires
to take into account the second term. MoreoverP this is a constrained optimization problem since the
variables p1 to pC must verify the constraint c pc = 1. One thus derives the following Lagrangian:
N C
!
X X X
L = log(pc ) qi (c) − λ( pc − 1)
i=1 c=1 c
C N
!
X X X
= log(pc ) qi (c) − λ( pc − 1)
c=1 i=1 c

So that θH for a local maximum of energy verifies the system:


 PN
 qi (c)  PN

∀c ∈ {1 . . . C}, i=1 −λ= 0  i=1 qi (c)
pc ∀c ∈ {1 . . . C}, pc =
X ⇒ λ

 pc = 1 
 λ= N
c
282 CHAPTER 25. MODELS WITH LATENT VARIABLES

Finally the result is very intuitive since the mixture coefficient of a cluster is just the average
probability for samples to be generated by it:

N
1 X
pc = qi (c)
N i=1

Little can be said about the maximization relatively to cluster parameters θ1 , . . . θC since it depends
on the nature of cluster distributions. However it is interesting to see that only the first term of
the energy depends on θ1 , . . . θC , and that all these parameters can be solved independently of each
other since:
!
∂E ∂ X
N C
X   
(i) 0
=0 ⇔ log P V | θ c0 v qi (c )
∂θc ∂θc i=1
c0 =1

∂ X
N   
⇔ qi (c) log P V | θc v (i) =0
∂θc i=1
N
X ∂   
⇔ qi (c) log P V | θc v (i) =0
i=1
∂θc

All these observations are summarized by algorithm 23.

Algorithm 23 EM algorithm for mixture models.


1: // Initialization step
2: for c = 1 . . . C do
3: pc ← C1
4: Initialize parameters θ? c of cluster c with some guess.
5: end for
6: repeat
7: // E-step
8: for every sample v (i) do
9: S←0
10: for c = 1 . . . C do 
11: qi (c) ← P V (i) | θ? c v (i) × pc
12: S ← S + qi (c)
13: end for
14: // Normalize qi
15: for c = 1 . . . C do
16: qi (c) ← qi (c)/S
17: end for
18: end for
19: // M-step
20: |θ? |1 ← 0
21: for c = 1 . .P. C do
N
22: pc ← N1 i=1 qi (c)
? ?
23: θ old ← θ c PN 

24: θ? c ← solve θc in i=1 qi (c) ∂θc log P V | θc v (i) =0
? ? ? ?
25: |θ |1 ← |θ |1 + |θ c − θ old |1
26: end for
27: until |θ ? |1 < ε
28: return (p1 , θ ? 1 , . . . , pC , θ ? C )
25.3. BAYESIAN CLUSTERING 283

25.3.2 Gaussian Mixture Models


Let’s now develop a complete example of a well-known mixture model called Gaussian Mixture
Model or GMM. GMMs deal with continuous problems where visible variables are all continuous
~ = (V1 , . . . , Vp ) ∈ Rp . The fundamental hypothesis of GMMs is that every cluster follows a
V
multivariate normal distribution:

A Gaussian
 Mixture
 Model or GMM is defined by a mixture of C normal distribu-
~
tions N Vc , Γc . The parameters of a GMM are:

• The mixture coefficients θH = [p1 , . . . , pC ].

• For each cluster c from 1 to C, cluster parameters θc are the expected vector
~c and the covariance matrix Γc of the normal distribution.
V

Pseudocode 23 already describes the general resolution of a mixture model when applying the
EM algorithm. The only remaining task to specify is how to find the best parameters θ? c during
the maximization step. Recalling equation (25.31), one has:

∂E
N
X ∂   
= qi (c) log P V | θc v (i)
~c
∂V ~c
∂V
i=1
 !
N
X ∂ 1 T  
= qi (c) log p exp ~c
v (i) − V Γ−1 v (i) ~c
−V
c
∂ ~c
V 2π det(Γc )
i=1
XN   T  
∂ 1 ~c ~c
= qi (c) − log (2π det(Γc )) + v (i) − V Γ−1
c v (i)
−V
∂ ~c
V 2
i=1
N
X  
= qi (c) (−2) Γ−1 ~c
v (i) − V
c
i=1
!
N
X  
= −2 Γ−1 qi (c) v (i) ~c
−V
c
i=1

~c for cluster c maximizes the energy and can be computed as


Therefore, only one expected value V
(i)
a barycenter of samples v weighter by the mixture coefficients qi (c) of cluster c:
PN (i)
~c = i=1 qi (c) v
V PN
i=1 qi (c)

A similar calculus allows to determine the covariance matrix:


PN    T
q i (c) v (i)
− ~c × v (i) − V
V ~c
i=1
Γc = PN
i=1 qi (c)

One finally gets a complete runnable solution summarized by algorithm 24.


GMM can be interpreted as a generalization of K-means. Indeed if one constraints every
clusters of GMMs to have a constant isotropic covariance matrix Γc = σ 2 IC and that σ tends to
zero, algorithm 24 becomes equivalent to the standard algorithm for K-means: the expectation step
becomes the assignment step of samples to the closest cluster (since σ → 0) and the maximization
step becomes the update step where the update of V ~c corresponds to the update of the centroid of
cluster c.
284 CHAPTER 25. MODELS WITH LATENT VARIABLES

Algorithm 24 EM algorithm for GMM.


1: // Initialization step
2: for c = 1 . . . C do
3: pc ← C1
4: ~c and Γc of cluster c with some guess.
Initialize parameters V
5: end for
6: repeat
7: // E-step
8: for every sample v (i) do
9: S←0
10: for c = 1 . . . C do
11: ∆V~ ← v (i) − V ~c
1
 
12: ~ T Γ−1 ∆V
qi (c) ← det(Γc )− 2 exp ∆V ~ × pc
c
13: S ← S + qi (c)
14: end for
15: // Normalize qi
16: for c = 1 . . . C do
17: qi (c) ← qi (c)/S
18: end for
19: end for
20: // M-step
21: |θ? |1 ← 0
22: for c = 1P. . . C do
N
23: Sq ← i=1 qi (c)
S
24: pc ← Nq
25: ~old ← V
V ~ ; Γold ← Γc
PNc (i)
26: ~c ← i=1 qi (c) v
V Sq
PN
qi (c) (v (i) −V ~c )T
~c )×(v (i) −V
i=1
27: Γc ← S
q
~ ~
28: |θ? |1 ← |θ? |1 + V c − Vold + |Γc − Γold |1
1
29: end for
30: until |θ ? |1 < ε
31: return (p1 , V ~1 , Γ1 , . . . , pC , V
~C , ΓC )
Chapter 26

Markov Models

26.1 Introduction
Dynamic system state and stochastic process
Many practical problems consist in determining the dynamic state of some real system. More
precisely the goal is to track the distribution P Xt | O of the system state Xt at time t given
some observations O carrying some information about the state. Depending on the nature of the
problem, the tracking has to be made in real-time with online observations or it can be processed
offline, in batch mode. Bayesian filtering and Bayesian smoothing respectively address the first
and second class of problems:
• Bayesian filtering estimates the distribution P Xt | O online, i.e. given only past and present
observations O relatively to current time t. Estimating in real time the position and speed
of a vehicle is a Bayesian filtering problem.
• Bayesian smoothing estimates the distribution P Xt | O offline, i.e. given past, present and
posterior observations O relatively to time t. Recognizing words in a recorded sound signal
is a Bayesian smoothing problem. Obviously Bayesian smoothing brings better results than
Bayesian filtering as more observations are taken into account.
In both cases, the state of the system must be modelled by some distribution since the state is
never perfectly known:
• First, the dynamic behaviour of the system is approximatively known and can be influence
by some unobserved forces (latent variables).
• Second, the observations can be noisy and/or partially informative (i.e. the system state
cannot be fully reconstructed from the observations).
Because dynamic systems require to maintain a full distribution of the current state, methods
that deal with dynamic systems are fundamentally Bayesian. Maintaining a distribution on the
system state not only tells us the most probable value for the current state but it also provides the
amount of confidence one should trust our state estimate: a spread distribution (i.e. with a large
entropy) will mean a large uncertainty on the state value, whereas a concentrated distribution
(i.e. with low entropy) will mean a precise knowledge on the current state value. Because the
system state is dynamic and evolves with time, the uncertainty does not necessarily decrease with
time. Indeed when some new observation is received, the uncertainty on the system state usually
decreases, but when no observation has been received for a long time, the uncertainty on the system
state increases. Positioning systems are a good application example: a GPS receiver provides the
current position of a vehicle with a good level of precision. However if the GPS signal drops,
the uncertainty of the position will increase as the vehicle is moving. In some other problems,
the dynamic behaviour of a system or signal does not increase uncertainty, but on the contrary,
helps to remove noise. Indeed the high level of dependency between successive states of a system
brings some additional information that can be exploited to remove noise in order to improve some

285
286 CHAPTER 26. MARKOV MODELS

classification problems. A typical example is speech recognition where speech time slices are highly
correlated.
Dynamic systems take many forms. However all of them have one thing in common: dynamic
systems imply to model a sequence of random state variables (Xt ) indexed by some time variable t
such that a sample builds a trajectory (xt ). This is formalized by the notion of stochastic process.

A stochastic process is a sequence of random variables (Xt )t∈T for some probability
space (Ω, E, P ) where:

• T is the time space and is a subset of R.


• All random variables take their value in one set S called state space.
• For all finite subset T 0 ⊂ T , the joint variable (Xt )t∈T 0 is measurable, of
distribution P(Xt )t∈T 0 .

A stochastic process can have discrete or continuous time space, and discrete, con-
tinuous or hybrid state space.

In the followings, one will focus on discrete-time stochastic processes for two reasons: first their
discrete nature naturally matches computational models. Second a continuous-time process can
generally be approximated by a discrete-time process. In order to simplify the notation, the time
space will be assumed to be the set N as any indexing system can always be remapped to N.

Markov Models
For every stochastic process (Xt )t∈N and for every N ∈ N, the joint distribution of (X0 , . . . , XN )
can always be written as:
PX0 ,...XN = PX0 × P X1 | X0 × · · · × P XN | X0 ,...,XN −1

This is true for any joint distribution and this can be represented by Bayesian Network of Fig. 26.1.

X0 X1 X2 X3 X4 ...

Figure 26.1: Stochastic process

As already stated, modelling the full joint distribution when the number of variables gets large
is not tractable. Fortunately in many problems, if information contained in every variable Xt
is sufficient (i.e. if the state space is rich enough), the prediction of the next state Xt+1 only
depends on the current state Xt , not on the past states Xt0 with t0 < t. Such memoryless property
characterizes Markov models:

A Markov model (of order 1) is a stochastic process (Xt )t∈N that verifies the Markov
property: given any current state Xt at any time t, the knowledge of any past state
Xt0 with t0 < t does not help to better predict any future state Xt0 with t0 > t:

∀t ≥ 0, P Xt | X0 ,X1 ,...,Xt−1 = P Xt | Xt−1

More generally a Markov model of order k verifies:

∀t ≥ 0, P Xt | X0 ,X1 ,...,Xt−1 = P Xt | Xt−k ,...,Xt−1


26.1. INTRODUCTION 287

The joint distribution is then considerably simplified as shown by Bayesian networks of figures 26.2a
and 26.3a.

X0 X1 X2 X3 X4 ...

(a) A Markov model of order 1

Figure 26.2: Markov models

X0 X1 X2 X3 X4 ...

(a) A Markov model of order 2

Stationarity
In addition to satisfying the Markov property, the studied models will usually be assumed to be
stationary, that is, not to evolve with time. Their parameters Θ are thus assumed to be constant.

A stochastic process (Xt )t∈N is stationary if and only if:

∀t, P Xt+1 | Xt = P X1 | X0

A process is stationary of order k if and only if:

∀t, P Xt+k | Xt ,...Xt+k−1 = P Xk | X0 ,...,Xk−1

Given a stationary Markov model, the joint distribution of (Xt )t∈N only depends on the distri-
bution PX0 of the initial state X0 and on the distribution of state transitions P X1 | X0 .
N
Y
PX0 ,...XN (x0 , . . . , xN ) = PX0 (x0 ) × P X1 | X0 =xi−1 (xi )
i=1

Learning a stationary Markov model, consists thus in inferring from observations these two distri-
butions P X1 | X0 ,O and P X0 | O .

Observability
A last important notion is the nature of available observations. If the state of a stochastic process
(Xt )t∈N is (fully) observable the observations directly give the state values. In this case the MLE
principle allows to learn the two distributions P X1 | X0 ,O and P X0 | O by counting the events
in the observation (for discrete variables) or by inferring distribution parameters (for continuous
variables).
If the stochastic process is only partially observable – as it is often the case in practice – the
observation Yt at time t is only weakly connected to the past and current states (Xt0 )0≤t0 ≤t . The
problem is more complex to solve as the state variables Xt are latent variables and their value must
be estimated using approximated algorithms like EM. A Markov model that is partially observable
is called a Hidden Markov Model and can be represented by Bayesian network of Figure 26.4. Since
for some applications, observations Yt can randomly occur in time, they are also called emissions.
So far one assumed the time space is discrete1 but what about the state space? Some systems
have a finite number of possible states (i.e. state of an automaton), some other have a continuous
1 This choice is mostly for simplicity and is not really a restriction. Most discrete-time models that are going to

be presented have some time-continuous counterparts.


288 CHAPTER 26. MARKOV MODELS

X0 X1 X2 X3 X4 ...

Y0 Y1 Y2 Y3 Y4

Figure 26.4: Hidden Markov Model

state (i.e. position and speed of a vehicle). Both cases are studied respectively in this order. Let’s
first consider a dynamic system that can be described by a stationary Markov model whose state
variable Xt can only take a finite number of states numbered from 1 to n. The two next sections
respectively consider the case of fully observable and partially observable models.

26.2 Markov Chains


26.2.1 Definition
A fully observable discrete-state Markov model is called a Markov chain:

A Markov Chain (MC) is a discrete-state discrete-time Markov model. A Markov


Chain is specified by:
• A distribution for the initial state specified by a n-probability vectora PX0
 
P (X0 = 1)
PX0 =  ... 
P (X0 = n)

• n × n transition matrices {Pt |t ∈ N} indexed by time t. Every transition


matrix is stochasticb so that its ith line represents the distribution P Xt+1 | Xt =i
of arrival state at time t + 1 when leaving state i at time t:
 
P ( Xt+1 = 1 | Xt = 1 ) · · · P ( Xt+1 = n | Xt = 1 )
 .. .. .. 
Pt =  . . . 
P ( Xt+1 = 1 | Xt = n ) ··· P ( Xt+1 = n | Xt = n )

A stationary Markov chain is said homogeneous. In that case transition ma-


trices are all equal to a unique matrix P = P0 . Because transition matrices
are often sparse (with many null coefficients), it is possible to represent them
by an automaton graph, where n vertices represent the n states and where an
arc from vertex i to vertex j represents a transition from state i to j such that
P ( Xt+1 = j | Xt = i ) > 0.
a A probability vector represents a discrete distribution. It is a vector of non negative coefficients

that sum up to one.


b A stochastic matrix is a matrix whose line vectors are probability vectors.

An homogeneous Markov chain can be represented by an oriented weighted graph whose vertices
represent the state values and whose arcs x → x0 represents a transition from state x to state x0
with non zero probability. Every arc x → x0 is weighted by the probability P ( Xt+1 = x0 | Xt = x ).
Such a graph is as informative as a transition matrix as illustrated by figure 26.5 so that both
formalisms are equivalent.
26.2. MARKOV CHAINS 289

s3  
0.5 1 0.5 0.5 0 0 0 0
0.2 1 0
 0 0.2 0.8 0 0 
0.5 0.1 0 0 0 0 1 0
s1 s2 1 s5 s6 P=
0

 0 1 0 0 0 
0 0 0 0.9 0 0.1
0.8 0.9
s4 0 0 0 0 0 1

(a) (b)

Figure 26.5: Equivalent representation of a Markov chain: graph (a) and transition matrix (b)

26.2.2 Properties
Given a Markov chain of parameters PX0 and {Pt |0 ≤ t < T }, the distribution of Xt is given by
t−1
!
Y
T
PXt = P t0 × PX0
t0 =0

This recurrence equation is a consequence of PXt+1 = PT t × PXt :

∀j, PXt+1 (j) = P (Xt+1 = j)


X
= P ((Xt+1 = j) ∩ (Xt = i))
i
X
= P ( Xt+1 = j | Xt = i ) × P (Xt = i)
i
X (i,j)
= Pt × PXt (i)
i
(j)
= PT t × PXt

If the chain is homogeneous, the latter equation is a simple matrix exponentiation:


t
PXt = PT × PX0

Sampling a Markov Chain consists in generating a trajectory of state (x0 , x1 , . . . ) such that x0
is drawn from PX0 and every transition xt → xt+1 is drawn according to Pt . Markov Chains can
thus be viewed as a model of random walk on a finite state space. One major question raised by
many applications is to know if the distribution of Xt converges to some limit P∞ when t → +∞. A
second important question is to know if this limit P∞ is unique and does not depend on the initial
state distribution PX0 . In such case P∞ is called the equilibrium distribution. If one considers the
homogeneous case, clearly the probability vector P∞ must be a fixed point of P, i.e. a eigenvector
of P for the eigenvalue 1:
P∞ = P T × P∞
Such a distribution P∞ is said stationary2 . Every Markov chain as at least one stationary distri-
bution as stated by the following theorem:
Theorem 26.1. The largest absolute eigenvalue of a stochastic matrix is 1. The eigenvectors for
the eigenvalue 1 have coefficients of the same sign. As a consequence every Markov chain admits
at least one stationary distribution.
However this last result is not a sufficient condition for a Markov chain to converge towards an
equilibrium distribution. To do so, one needs to introduce the notions of reducibility and periodic-
ity: Given two distinct state values s1 and s2 , s2 is said accessible from s1 if the probability to reach
2 Not to be confounded with stationarity of Markov processes
290 CHAPTER 26. MARKOV MODELS

state s2 in a finite number of steps by starting from state s1 is non zero, or said differently, if there
is at least one (oriented) path connecting s1 to s2 in the Markov chain’s transition graph. s1 and s2
are said communicating if s1 is accessible from s2 and conversely. Communication defines an equiv-
alence relation whose equivalence classes are called communication classes. A Markov chain is said
irreducible if every pair of vertices communicate (i.e there is only one communication class), or said
equivalently, if the transition graph is strongly connected3 . For instance, given the Markov chain
of 26.5, state s6 is accessible from any other state whereas s1 is not accessible from any other state.
States s3 , s4 and s5 are communicating. Communication classes are {{s1 }, {s2 }, {s3 , s4 , s5 }, {s6 }}.
The graph is thus reducible.
Periodicity is another important characteristic of Markov chain trajectories: the period p(s) of
a state s is the greatest common divisor of the possible times to return to state s:

p(s) = gcd {t|P Xt =s | X0 =s > 0}

= gcd {t|Pts,s > 0}

A Markov chain is aperiodic if the period of every state is equal to 1. In the Markov chain of 26.5,
the periodicity of s1 and s6 is 1 whereas the periodicity of s3 , s4 and s5 is 3. s2 is not periodic.
The following theorem concludes on the initial question about Markov chain convergence:

Theorem 26.2. An irreducible and aperiodic Markov chain has a equilibrium distribution: it
converges to a unique stationary distribution, regardless of the initial state distribution PX0 . In
particular this is the case for Markov chains whose coefficients of the transition matrix are all
positive, that is, whose the graph is complete4 .

26.2.3 Learning
Learning a Markov Chain on an interval of time [0, . . . , T − 1] consists in inferring the first T
transition matrices {Pt |0 ≤ t < T } and the initial state distribution PX0 from a dataset Z =
{zk } where every data zk is a sequence of T successive states (xi0 , . . . , xiT −1 ). Because states are
observable, learning a Markov Chain using the MLE estimator is straightforward: Determining
parameter P ( Xt+1 = j | Xt = i ) is done by counting:

N (xt+1 = j and xt = i)
P̂ ( Xt+1 = j | Xt = i ) =
N (xt = i)

Every parameter is learnt from N samples in average, if N is the number of data. However it is
likely that some transitions will occur rarely. In order to have a good confidence in the estimation
of the n2 · T + n model parameters, N must be very large.
Things get nicer if the Markov chain is homogeneous. In this case, the number of parameters to
learn is only n2 + n. Every probability of transition can be learnt from N × T − 1 samples instead
of only N so that the dataset can be much smaller.

26.2.4 Application
The notion of equilibrium distribution of Markov chains has been extensively used in many appli-
cations. It is the theoretical foundation of Markov Chain Monte Carlo algorithms (MCMC) like
the Metropolis-Hastings algorithm introduced in chapter 28. It is also used to compute various
importance index, whose the most popular one is probably PageRank’s index from Google.
PageRank’s index measures the visibility of a website on the Web but similar indexes exist
for measuring the notoriety of people in social networks, the importance of scientific journals
or researchers in their community, etc. The intuition is that the more a website is visible, the
more likely a random websurfer will spend time on this website. This problem can be formalized
as sampling a Markov chain. Let’s index webpages from 1 to the number n of web-pages. The
3 A directed graph is strongly connected if for every couple (s , s ) of vertices, there is at least one (oriented)
1 2
path connecting s1 to s2 .
4 A directed graph is complete if for every couple (s , s ) of vertices, there is at least one arc from s to s .
1 2 1 2
26.3. HIDDEN MARKOV MARKOV MODELS 291

websurfer starts from a random webpage i and then selects randomly an outgoing link on this page,
jumping this way to a new page. He does this a large number of time and records for each page
the number N [i] of times he visited it. When the websurfer reaches a page without outgoing links,
he jumps randomly on a new webpage. The resulting histogram N [i] defines a distribution whose
coefficients are the PageRank indexes. In other words, the PageRank indexes are the equilibrium
distribution of a random walk over a finite state space, i.e. over a homogeneous Markov chain.
This is summarized by algorithm 25. If J(i) denotes the set of webpages accessible from page i,

Algorithm 25 Naive PageRank algorithm


1: Input: number T of iterations
2: for i ← 1 to n do
3: N [i] ← 0
4: end for
5: i ← page drawn from U1,...,n
6: for t ← 1 to T do
7: N [i] ← N [i] + 1
8: Collect output links J pointed from i
9: if J = ∅ then
10: i ← page drawn from U1,...,n
11: else
12: i ← page drawn from UJ
13: end if
14: end for
15: return Probability vector { NT[i] }

the resulting transition matrix of this Markov chain is defined by:


1j∈J(i)
if J(i) 6= ∅, Pi,j =
|J|
1
otherwise Pi,j =
n
However the Markov chain is clearly neither irreducible nor aperiodic so that the Markov chain
does not converge.
One solution is to modify the transition matrix so that every transition between one page to
another is possible with a non-zero probability. To do so, one considers a threshold α typically of
0.1 that is the probability at every step to reset the procedure: in other words, before jumping to
another page, one draws a number u between 0 and 1. If u ≤ α, one prefers to start from a new
page selected randomly in the whole web. Otherwise, one selects an outgoing link as before. This
procedure is formalized on algorithm 26. The Markov chain matrix coefficients are now all clearly
positive

1 1j∈J(i) 1
if J(i) 6= ∅, Pi,j = α+ (1 − α) ≥ α > 0
n |J| n
1
otherwise Pi,j = >0
n
Consequently the algorithm is guaranteed to converge towards a unique set of PageRank indexes
that do not depend on the choice for the initial page.

26.3 Hidden Markov Markov models


26.3.1 Definition
However in most applications like speech or handwriting recognition, the state is not directly
observable and cannot be completely reconstructed from the observations/emissions. One then
292 CHAPTER 26. MARKOV MODELS

Algorithm 26 PageRank algorithm


1: Input: number T of iterations, threshold α
2: for i ← 1 to n do
3: N [i] ← 0
4: end for
5: i ← page drawn from U1,...,n
6: for t ← 1 to T do
7: N [i] ← N [i] + 1
8: Draw u from U[0,1]
9: if u ≤ α then
10: i ← page drawn from U1,...,n
11: else
12: Collect output links J pointed from i
13: if J = ∅ then
14: i ← page drawn from U1,...,n
15: else
16: i ← page drawn from UJ
17: end if
18: end if
19: end for
20: return Probability vector { NN[i] }

needs to consider partially observable Markov Chains, or equivalently discrete Hidden Markov
models. In practice these discrete models are simply referred as Hidden Markov Models.

A discrete Hidden Markov Model (HMM) is a hidden Markov model whose state
variables (Xt )t∈N are discrete. By supposing states range from 1 to n then such a
HMM is specified by:
• A Markov chain specification (i.e. initial state distribution PX0 and transition
distributions/matrices (Pt )t∈N )
• Emission distributions P Yt | Xt ,Θ that depend on parameters Θ.

– If the observations are discrete, and that emission values are supposedly
ranging from 1 to m, the emission distributions can be represented by
emission matrices {Qt }t∈N . Every emission matrix is a n × m stochastic
matrix so that its ith line represents the distribution P Yt | Xt =i :
 
P ( Yt = 1 | Xt = 1 ) ··· P ( Yt = m | Xt = 1 )
 .. .. .. 
Qt =  . . . 
P ( Yt = 1 | Xt = n ) ··· P ( Yt = m | Xt = n )

– If the emissions are continuous, then the emission distributions can be


represented by a list of distribution parameters. For instance if all emis-
sions for all states follow some normal distributions, the parameters  will
1 1 n n i i
be a list (Ŷt , Σt , . . . , Ŷt , Σt ) such that ∀i, Yt | Xt = i ∼ N Ŷt , Σt .

A homogeneous HMM is such that both transition and emission distributions


do not depend on time t.

26.3.2 Bayesian filtering


Bayesian filtering is about evaluating the current state given past and present observations:
26.3. HIDDEN MARKOV MARKOV MODELS 293

Given a HMM of parameters (Pt , Qt )t∈{0,...,T } , some current time t ∈ {0, . . . , T }


and available observations Ot = {y0 , . . . yt }, filtering consists in evaluating the dis-
tribution:
P Xt | Y0 =y0 ,...Yt =yt ,θ

To solve this problem let’s first introduce the so-called alpha coefficients:

αt : x 7→ P (Xt = x, Y0 = y0 , . . . Yt = yt , θ)

Once the values of αt have been computed, it is straightforward to compute the state distribution
by normalization:
αt (x)
P Xt | Y0 =y0 ,...Yt =yt ,θ (x) = Pn
x=1 αt (x)

According to the graphical model of HMMs (see figure 26.4), the alpha coefficients can easily be
computed by forward recursion on t:

αt (x) = P (Xt = x, Y0 = y0 , . . . Yt = yt )
= P ( Yt = yt | Xt = x ) × P (Xt = x, Y0 = y0 , . . . Yt−1 = yt−1 )
X
= Qx,y
t
t
× P (Xt = x, Xt−1 = x0 , Y0 = y0 , . . . Yt−1 = yt−1 )
x0
X
= Qx,y
t
t
× P ( Xt = x | Xt−1 = x0 ) × P (Xt−1 = x0 , Y0 = y0 , . . . Yt−1 = yt−1 )
x0
X 0
= Qx,y
t
t
× Pxt−1
,x
× αt−1 (x0 )
x0

26.3.3 Bayesian smoothing and the forward-backward algorithm


Bayesian smoothing is about evaluating the current state given past, present and future observa-
tions:

Given some current time t and available observations O = {y0 , . . . yt , . . . , yT },


smoothing a HMM of parameters θ consists in evaluating the distribution:

P Xt | Y0 =y0 ,...Yt =yt ,...,YT =yT ,θ

In analogy with alpha coefficients for filtering, one introduces the beta coefficients as the likelihood
for state x at instant t for making the future observations:

βt : x 7→ P ( Yt+1 = yt+1 , . . . , YT = yT | Xt = x, θ )

Then the sought probability can be deduced by normalization on both alpha and beta coefficients:

P Xt | Y0 =y0 ,...Yt =yt ,...,YT =yT (x) ∝ P (Xt = x, Y0 = y0 , . . . Yt = yt , . . . , YT = yT )


∝ P ( Yt+1 = yt+1 , . . . , YT = yT | Xt = x, Y0 = y0 , . . . Yt = yt ) ×
P (Xt = x, Y0 = y0 , . . . Yt = yt )
∝ P ( Yt+1 = yt+1 , . . . , YT = yT | Xt = x ) × αt (x)
∝ βt (x) × αt (x)
α (x) × βt (x)
= Pn t 0 0
x0 =1 αt (x ) × βt (x )
294 CHAPTER 26. MARKOV MODELS

The remaining task is to compute the beta coefficients. Again this can be done by recursion but
in the backward direction (t starts from T and decreases):

βt (x) = P ( Yt+1 = yt+1 , . . . , YT = yT | Xt = x )


X
= P ( Xt+1 = x0 , Yt+1 = yt+1 , . . . , YT = yT | Xt = x )
x0
X
= P ( Yt+1 = yt+1 , . . . , YT = yT | Xt+1 = x0 , Xt = x ) × P ( Xt+1 = x0 | Xt = x )
x0
X
= P ( Yt+1 = yt+1 | Xt+1 = x0 , Yt+2 = yt+2 , . . . , YT = yT ) ×
x0
0
P ( Yt+2 = yt+2 , . . . , YT = yT | Xt+1 = x0 ) × Px,x
t
X 0
= P ( Yt+1 = yt+1 | Xt+1 = x0 ) × βt+1 (x0 ) × Px,x
t
x0
X 0
x0 ,y
= Px,x
t × Qt+1 t+1 × βt+1 (x0 )
x0

Computations of alpha and beta coefficients are independent and can be done in parallel. These
two computation tasks merged together define the forward-backward algorithm.

26.3.4 Most probable trajectory and the Viterbi algorithm


In addition to filtering and smoothing, another problem is to find the most probable state trajectory
of a HMM:

Given a HMM and some observations O = {y0 , . . . , yT }, finding the most probable
state trajectory consists in finding the most probable sequence (x? 0 , . . . , x? T ) of states
that is:

(x? 0 , . . . , x? T ) = argmax P ( X0 = x0 , . . . , XT = xT | Y0 = y0 , . . . , YT = yT )
x0 ,...,xT

This problem can be solved efficiently using dynamic programming. Let’s first define the value
function as:

V : (t, x) 7→ max P (X0 = x0 , . . . , Xt−1 = xt−1 , Xt = x, Y0 = y0 , . . . , Yt = yt )


x0 ,...,xt−1

One then finds a Bellman equation on the values:

V (t + 1, x) = max P (X0 = x0 , . . . , Xt = xt , Xt+1 = x, Y0 = y0 , . . . , Yt+1 = yt+1 )


x0 ,...,xt

= max P Xt+1 | Xt =xt (x) × P Yt+1 | Xt+1 =x (yt+1 )×


x0 ,...,xt
t
! !
Y
P X 0 | X 0 =x 0 (xt0 ) × P Yt0 | Xt0 =xt0 (yt0 ) × PX0 (x0 ) P Y0 | X0 =x0 (y0 )
t t −1 t −1
t0 =1
 
= max P Xt+1 | Xt =xt (x) × P Yt+1 | Xt+1 =x (yt+1 ) × max
xt x0 ,...,xt−1
t
! !
Y
P X 0 | X 0 =x 0 (xt0 ) × P Yt0 | Xt0 =xt0 (yt0 ) × PX0 (x0 ) P Y0 | X0 =x0 (y0 )
t t −1 t −1
t0 =1

= P Yt+1 | Xt+1 =x (yt+1 ) × max P Xt+1 | Xt =xt (x) × V (t, xt )
xt
 0 
x,yt+1 x ,x 0
= Qt+1 × max 0
P t × V (t, x )
x
26.3. HIDDEN MARKOV MARKOV MODELS 295

In practice one computes the log value log V to avoid numerical precision issues and one also
memorizes for all pairs (t, x) the state x? t−1 (x) for which this value V (t, x) is reached, so that the
most probable trajectory can be reconstructed:
 0

x? t−1 (x) = argmax log Ptx ,x + log V (t, x0 )
x0
x,y x? t−1 (x),x
log V (t + 1, x) = log Qt+1t+1 + log Pt + log V (t, x? t−1 (x))

Once the pairs (V (t, x), x? t−1 (x)) have been computed in a forward order for all times t and states
x, the most probable state trajectory (x? 0 , . . . , x? T ) are computed backward:
x? T = argmax V (T, x), x? T −1 = x? T −1 (x? T ) . . . x? t−1 = x? t−1 (x? t ) . . . x? 0 = x? 0 (x? 1 )
x

This is the Viterbi algorithm.

26.3.5 Learning HMM and the Baum-Welch algorithm


So far the parameters of HMMs have been assumed to be known. However in many problems, like
speech recognition, the HMM parameters can not be made explicit and must be learned from data:

Given a sequential problem where the number of hidden states is supposed to be n


and given N trajectory samples Z = {y (i) }1≤i≤N where every trajectory y (i) is made
(i) (i)
of observations (y0 , . . . , yT ) on some time interval [0, . . . , T ], learning the HMM
consists in inferring transition and emission matrices (Pt )0≤t≤T and (Qt )0≤t≤T (in
a strong Bayesian sense or using a Bayes estimator, MLE by default).

Since HMM is a model with latent variables (X0 , . . . , XT ), this ML problem can be approached by
EM (using the MLE estimator). The specific instance of EM is called the Baum-Welch algorithm.

Estimation step
(i) (i) (i) (i)
Let’s define X (i) and Y (i) respectively as the joint variables (X0 , . . . , XT ) and (Y0 , . . . , YT ).
The X (i) variables are latent. In general variables X (i) cannot be further decomposed. However
because of the Markov property, one can decompose the joint distribution of X (i) as a product:
Y
P X (i) | Y (i) = P X (i) Y (i) P X (i) X (i) ,Y (i)
0 t+1 t
t

(i)
Estimating the distribution
 of X can  be done 
by estimating thesefactors, or alternatively, by
estimating distributions P X (i) y(i) and P X (i) ,X (i)  y(i) .
t t t+1
0≤t≤T 0≤t≤T −1
(i) (i)
Let’s thus denote at and Bt the approximated distributions of P X (i) y(i) and P X (i) ,X (i) 
(i)
y
.
t t t+1
(i) (i)
The “E” step updates at and Bt for every sample i and for every time t according to the current
HMM parameters θ:
(i)
at ← P X (i) y(i) ,...,y(i) ,θ
t 0 T

(i)
Bt ← P X (i) ,X (i)
(i) (i)
y0 ,...,yT ,θ
t t+1

(i)
This is a smoothing problem studied in section 26.3.3. Computing at can directly be solved with
(i)
the forward-backward algorithm. Computation of Bt can also be done by reusing alpha and beta
coefficients according to the following equation:
(i)
(i) 0 x0 ,yt+1
Bt (x, x0 ) ∝ αt (x) × Px,x
t × Qt βt+1 (x0 )
296 CHAPTER 26. MARKOV MODELS

Maximization step
−1
 
The “M” step finds the HMM parameters θ = {P0 } ∪ ∪Tt=0 Pt ∪ ∪Tt=0 Qt that maximizes the
(i) (i)
energy, according to distributions at and Bt . Assuming that trajectory samples are i.i.d, the
energy is (omitting parameters):
N D
X  E

E(θ) = log P Y (i) = y (i) , X (i) θ
X (i) ∼a(i) ,B(i)
i=1
N
X D  E T D
X  E
(i) (i) (i)
= log P X0 (i) (i)
+ log P Xt+1 Xt 
(i) (i)

(i)
+
X0 ∼a0 Xt ,Xt+1 ∼Bt
i=1 t=0
!
T D
X  E
(i) (i) (i)
log P Yt = yt Xt (i) (i)
Xt ∼at
t=0

Maximizing this expression using Lagrangian optimization, one gets:


N
1 X (i)
P0 = a
N i=1 0
N
X (i)
∀t, ∀x, Pt (x, ∗) ∝ Bt (x, ∗)
i=1
N
X (i)
∀t, ∀y, Qt (∗, y) ∝ 1y(i) =y at
t
i=1

Lines of Pt and Qt must be normalized so that coefficients sum up to one. In case the HMM is
homogeneous, the previous equations become:
N
1 X (i)
P0 = a
N i=1 0
N T
X X −1
(i)
∀x, P(x, ∗) ∝ Bt (x, ∗)
i=1 t=0
N X
X T
(i)
∀y, Q(∗, y) ∝ 1y(i) =y at
t
i=1 t=0

The Baum-Welch algorithm in the homogeneous case is summarized by pseudocode 27 which is


not optimized (i.e. scalability can be improved greatly by computing during the E step the sums
that are need in the M step).

26.4 Continuous-state Markov models


26.4.1 State-space representation
So far considered Markov models had discrete states although many dynamic systems have con-
tinuous states. A typical example are localization or navigation systems that provide the location
(latitude, longitude, altitude (x, y, z)), heading (yaw, pitch and roll angles(Φ, Θ, Ψ)), and speed v
of a moving vehicle/robot.
Let consider now dynamic systems described by a partially observable Markov model whose
state and time spaces are continuous: state variables Xt are thus vectors of Rn and observations Yt
are vectors of Rq . The model can also be partially controllable thanks to a command represented
by a vector Ut ∈ Rp so that the Bayesian network of our model is given on Figure 26.6. The
26.4. CONTINUOUS-STATE MARKOV MODELS 297

Algorithm 27 Baum-Whelch algorithm.


1: // Initialization step
2: Set current parameter models (P0 , P, Q) to some initial guess
3: repeat
4: // E-step
5: for every trajectory y (i) do
6: Apply forward-backward algorithm to compute αt and βt vectors.
7: for every time t do
8: for every state x do
(i)
9: at (x) ← αt (x) × βt (x)
10: for every state x0 do
(i) 0 0 (i)
11: Bt (x, x0 ) ← αt (x) × Px,x × Qx ,yt+1 × βt+1 (x0 )
12: end for
(i)
13: Normalize at so that coefficients sum up to 1.
14: end for
(i)
15: Normalize Bt so that coefficients sum up to 1.
16: end for
17: end for
18: // M-step
19: θ? old ← (P0 , P, Q)
20: P0 ← 0, P ← 0, Q ← 0
21: for every trajectory y (i) do
22: for every state x do
(i)
23: P0 (x) ← P0 (x) + a0
24: for every time t do
(i) (i) (i)
25: Q(x, yt ) ← Q(x, yt ) + at (x)
0
26: for every state x do
(i)
27: P(x, x0 ) ← Bt (x, x0 )
28: end for
29: end for
30: end for
31: end for
32: Normalize P0 and lines of P and Q
33: until |(P0 , P, Q) − θ? old | < ε
34: return (P0 , P, Q)
298 CHAPTER 26. MARKOV MODELS

U0 U1 U2 U3 U4

X0 X1 X2 X3 X4 ...

Y0 Y1 Y2 Y3 Y4

Figure 26.6: Partially controllable hidden Markov model

command is observable (otherwise it would not be worth integrating it into the model) and helps
to know the current system state (Bayesian filtering).

Example:
Let’s take an example of a car that is localized thanks to a GPS receiver. The state is Xt =
(xt , yt , θt , vt )T , where (x, y) is the couple longitude-latitude, θ is the heading angle (a null angle
means the car is heading east), and v is the velocity. The variations of altitude z are neglected.
The command is Ut = (Ct , αt )T where C is the curvature (i.e. one assumes the wheel can be turned
instantly) and α is the force determined by the combined action of throttle and brakes. Finally the
output/observation is Yt = (xgps gps odo
t , yt , vt ) where (x
gps gps
, y ) are the GPS coordinates and where
odo
v is the speed measured by the car odometer.

Because of the Markov property, the dynamic of the state (i.e the derivative of Xt ) is assumed
to be a function of only the current state and of the current command: in other words Xt+∆t only
depends on Xt and Ut for ∆t > 0; it is independent of the previous states Xt0 and commands Ut0
for t0 < t. The observation Yt generally only depends on the current state Xt even if the further
methods can support a further dependence of Yt on Ut . If the model is deterministic (i.e. there
is no source of uncertainty in the model) one can describe our model by a standard state-space
representation as represented on figure 26.7. Such representation is defined by two families of

Delay τ

Xt ft (Xt , Ut ) Xt+1

Ut gt (Xt , Ut ) Yt

Figure 26.7: State-space representation

functions ft and gt :
dXt
= ft (Xt , Ut )
dt
Yt = gt (Xt , Ut )

The first equation represents the state integration, the second one is the output equation.

Example:
26.4. CONTINUOUS-STATE MARKOV MODELS 299

On our car example, the integration of Xt = (xt , yt , θt , vt )T given Ut = (Ct , αt )T is:

dx
= vt cos(θt )
dt
dy
= vt sin(θt )
dt

= vt Ct
dt
dv 1 f
= αt − vt
dt M M

The output equation of Yt = (xgps gps odo


t , yt , vt ) is given by

xgps
t = xt
ytgps = yt
vtodo = vt

M is the mass of the loaded vehicle and f is a friction coefficient. Because the functions ft and gt
do not depend on time t, the system is homogeneous (unless the car crashes and as a first approx-
imation since the mass can vary over time with the mass of gas, load and passengers).

However, as already stated, the time is supposedly discrete ,first for simplicity reason and
second, because practical implementations assume time is discrete. The notion of derivative is
thus replaced by a finite difference model so that the considered models are:

Xt+1 = ft (Xt , Ut )
Yt = gt (Xt , Ut )

Example:
Assuming the car computer updates the state representation with a time period of τ seconds (typi-
cally 0.1s), the state integration is now:

xt+1 = xt + vt cos(θt ) τ
yt+1 = yt + vt sin(θt ) τ
θt+1 = θ t + vt C t τ
 
fτ τ
vt+1 = 1− vt + αt
M M

The previous car model assumes the reality perfectly matches the model which is of course very
naive, mostly for two distinct reasons:

• First the dynamic is not perfectly known and some external factors can disturb the state
evolution. Some wind or slope can slow down the car or accelerate it in an unpredictable
way (at least for the wind).

• Second the observations can be noisy. The standard deviation of GPS coordinates for a fixed
point is typically of few meters.
300 CHAPTER 26. MARKOV MODELS

Therefore the model must be stochastic and integrate some uncertainty thanks to a Bayesian
approach. Functions ft and gt have to be replaced respectively by distributions P Xt+1 | Xt ,Ut ,Θt
and P Yt | Xt ,Ut ,Θt of parameters (Θt )t∈N .

Xt+1 L99 P Xt+1 | Xt ,Ut ,Θt


Yt L99 P Yt | Xt ,Ut ,Θt

In most cases, the system dynamic is fixed so that the model is homogeneous and the distribution
parameters do not depend on time:

Xt+1 L99 P Xt+1 | Xt ,Ut ,Θ


Yt L99 P Yt | Xt ,Ut ,Θ

The problem of learning such models consist in finding the model parameters Θ that best match
some data (i.e. sequences of command, state and output triplets (ut , xt , yt )t∈N ). This problem
is usually not easy to implement as it requires to represent – at least approximatively – complex
distributions P Xt+1 | Xt ,Ut ,Θ and P Yt | Xt ,Ut ,Θ , and then to use sampling techniques to estimate
them. Further assumptions on the model allow to drastically simplify the problem as shown in the
next section.

26.4.2 Kalman Filter


Assuming models are linear usually provide tractable solutions. So let consider a linear Markov
model with continuous space and observation. The uncertainty on both model dynamic and obser-
vation can be modelled by some additive white noise (since coloured noise can always be obtained
from white noise by adding previous noise samples to the state vector Xt ).

A discrete-time linear Markov model with continuous space and observation is char-
acterized by four matrix time series At ∈ Mn×n (R), Bt ∈ Mn×p (R), Ct ∈ Mq×n (R)
and Dt ∈ Mq×p (R), along with two white zero-centred noises (εX X n
t |εt ∈ R )t∈N and
(εt |εt ∈ R )t∈N and optionally two vector time series Xt ∈ R and Yt ∈ Rq such
Y Y q a 0 n 0

that:

Xt+1 = At Xt + Bt Ut + Xt0 + εX
t
Yt = Ct Xt + Dt Ut + Yt0 + εYt

In the homogeneous case, matrices and vectors are fixed, equal respectively to A, B,
C, D, X 0 and Y 0 .
a These terms X 0 and Y 0 are generally omitted as they can be integrated in the matrices B and
D by extending the command vector with a constant component equal to 1. While elegant, this
choice is misleading and inefficient from an implementation point of view.

Example:
Clearly our car model is not linear because of expressions like vt cos(θt ), vt sin(θt ) or vt Ct . Let’s
modify our problem to make it linear. One will see how the car problem can be solved later. Let’s
consider a logistic elevator that loads and unloads packages from very long shelves in a factory
warehouse: this elevator is a motorized trolley equipped with a lift and mounted on linear rails
that run along the shelves. This robot can thus move in the XZ plane thanks. The state is
X = (x, z, v x , v z ) where (x, z) and (v x , v z ) are the position and speed coordinates in the XZ plane.
The command U = (αx , αz ) is the X and Z forces (αx , αz ) of the robot’s electrical engines. An
26.4. CONTINUOUS-STATE MARKOV MODELS 301

odometer provides the trolley velocity v odo whereas a position encoder gives the elevation z enc of
the lift. The output vector is thus Y = (v odo , z enc ). Assuming the embedded computer updates the
state every τ = 0.1 second, and that the mass of the load can be neglected compared to the mass M
of the elevator, the corresponding discretized model is:

xt+1 = xt + vtx τ + εx
zt+1 = zt + vtz τ + εz
x 1 x x
vt+1 = vtx + (α − fx vtx ) τ + εvt
M t
z 1 z z
vt+1 = vtz + (α − fz vtz − M g) τ + εvt
M t

Coefficients fx and fy represent friction forces along X and Z axis; g is the gravity acceleration
x z
constant, noises εv and εv resp. represent the unknown forces acting on the trolley and lift and
x z
noises ε and ε represent the risk of slipping (that is assumed to be null hereafter). The output
equation of Yt = (v odo , z enc ) is given by

vtodo = vtx + εodo


t
ztenc = zt + εenc
t

Noises εodo and εenc resp. represent the measurement noise of the trolley odometer and the eleva-
tion encoder. From these equations one derive the parameters of our model:
       x 
1 0 τ 0 0 0 0 εt
0 1 0 τ  0 0  0   z 
A=  B =  τ  X0 =   εX =  εvtx 
0 0 1 − fx τ 0   0  0  t  εt 
M M z
fz τ
0 0 0 1− M τ 0 M −g τ εvt

       
0 0 1 0 0 0 0 εodo
C= D= Y0 = εYt = t
0 1 0 0 0 0 0 εenc
t

However the linearity hypothesis is not sufficient to keep a simple and tractable representation
of the state distributions P Xt | Y0 ,...,Yt when t is growing. Further assumptions have to be made:
Kalman filters consider the specific subcase where initial state, state and observation noises are
assumed to be gaussian:

A Kalman filter estimates the state distribution P Xt | Y0 ,...,Yt for a linear discrete-
time continuous-state Markov model where
• The state and observation noises (εX Y
t )t∈N and (εt )t∈N are white and normal
with null expected values and known covariance matrices, respectively denoted
(Qt )t∈N (with Qt ∈ Mn×n (R)) and (Rt )t∈N (with Rt ∈ Mq×q (R)).
 
• The initial state X0 follows a normal distribution X0 ∼ N X̂0 , P0 of known
parameters.

Example:
In the robot example, the noises on x, v x , z and v z are all independent. Similarly the measurement
noise of the odometer and the position encoder are independent. One also assumes all sources of
302 CHAPTER 26. MARKOV MODELS

noises are constant with time so that covariance matrices of noise are constant and diagonal:
 2 
σx 0 0 0  2 
 0 σz2 0 0  σodo 0
Q=  0
 R =
0 σv2x 0  0 2
σenc
2
0 0 0 σ vz

Because multivariate normal distributions are closed under linear combinations, it is obvious
that P Xt | Y0 ,...,Yt will remain normal. The real question is to know how to update the parameters
of this normal distribution during state integration and observation. In this end, one introduces
thethat useful notation:

• X̂t|t−1 and Pt|t−1 are respectively the expected value and the covariance matrix of the current
state Xt | Y0 , . . . , Yt−1 given the past observations, abbreviated as Xt|t−1 .

• X̂t|t and Pt|t are respectively the expected value and the covariance matrix of the current
state Xt | Y0 , . . . , Yt given the past and present observations, abbreviated as Xt|t .
 
Let’s prove by induction that Xt | Y0 , . . . , Yt−1 is normal: Xt|t−1 ∼ N X̂t|t−1 , Pt|t−1 .
 
Proof. The induction is verified at rank 0 since X0 ∼ N X̂0 , P0 by hypothesis. Let’s as-
 
sume Xt|t−1 ∼ N X̂t|t−1 , Pt|t−1 and let’s prove this property at rank t + 1, i.e. Xt+1|t ∼
 
N X̂t+1|t , Pt+1|t .
 
The proof is split in two halves: the first half shows Xt|t ∼ N X̂t|t , Pt|t , the second shows
 
Xt+1|t ∼ N X̂t+1|t , Pt+1|t . Both halves are similar but the first half is more difficult than the
second so let’s assume in a first stage that the first half is already proven and let’s prove first the
second half.
Since Ut is a known constant that can be interpreted as a normal distribution of null covariance
and since εXt is a white noise independent of Xt|t , one then has:
     
Xt|t X̂t|t Pt|t 0n,n 0n,p
 εX
t
 ∼ N 0n,1  , 0n,n Qt 0n,p 
Ut Ut 0p,n 0p,n 0p,p

One also has:

Xt+1|t = At Xt + Bt Ut + εXt
 
Xt|t  
= M ×  εXt
 with M = At In,n Bt
Ut

So since a linear combination of some multivariate normal distribution is still normal:


     
X̂t|t Pt|t 0 0
Xt+1|t ∼ N M ×  0  , M ×  0 Qt 0 × MT 
Ut 0 0 0
 
∼ N At X̂t|t + Bt Ut , At Pt|t ATt + Qt

This also proves that:

X̂t+1|t = At X̂t|t + Bt Ut and Pt+1|t = At Pt|t ATt + Qt


26.4. CONTINUOUS-STATE MARKOV MODELS 303
 
This proves the second half of the proof. Now let’s prove that Xt|t ∼ N X̂t|t , Pt|t . According
def
to the Markov property if yt is the observation made at time t and if Yt+1|t = Yt+1 | Xt denotes
the output prediction given the current state Xt
def
PXt|t = P Xt | Xt−1 ,Yt =yt
P ( Xt , Yt = yt | Xt−1 )
=
P ( Yt = yt | Xt−1 )
= P Xt|t−1 | Yt|t−1 =yt

Because one knows that the joint variable (Xt|t−1 , Yt|t−1 ) is normal (since Y is given by a linear
combination of normal variables) and that conditioning a variable X with a variable Y when joint
variable (X, Y ) is normal gives another normal variable X | Y , one can deduce Xt|t is also normal.
Let’s compute its parameter. First let’s recall the rules for conditioning a multivariate normal
distribution. If:
       
X1 X̂1 Σ11 Σ12 −1 −1 2
∼N , ⇒ X 1 | X 2 = x 2 ∼ N X̂ 1 + Σ Σ
12 22 (x 2 − X̂ 2 ), Σ 11 − Σ Σ
12 22 Σ
X2 X̂2 ΣT12 Σ22 12

(26.107)
But one first needs to determine the joint distribution of Xt|t−1 and Yt|t−1 before applying these
equations:
 
  Xt|t  
Xt|t−1 I 0n,q 0n,p
= M0 ×  εYt  with M0 = n,n
Yt|t−1 Ct Iq,q Dt
Ut

Because      
Xt|t−1 X̂t|t−1 Pt|t−1 0n,n 0n,p
 εYt  ∼ N  0n,1  ,  0n,n Rt 0n,p 
Ut Ut 0p,n 0p,n 0p,p
Consequently:
     
  X̂t|t−1 Pt|t−1 0 0
Xt|t−1 T
∼ N M0 ×  0  , M0 ×  0 Rt 0 × M0 
Yt|t−1
Ut 0 0 0
   T

X̂t|t−1 Pt|t−1 Pt|t−1 Ct
∼ N ,
Ŷt|t−1 C P
t t|t−1 St|t−1

with
Ŷt|t−1 = Ct X̂t|t−1 + Dt Ut
St|t−1 = Ct Pt|t−1 CTt + Rt

By applying equation (26.107) to equation (26.112), one finally gets:


 
Xt|t ∼ N X̂t|t , Pt|t

with
 
X̂t|t = X̂t|t−1 + Pt|t−1 Ct T S−1
t|t−1 yt − Ŷt|t−1

Pt|t = Pt|t−1 − Pt|t−1 CTt S−1


t|t−1 Ct Pt|t−1
 
= In − Pt|t−1 CTt S−1
t|t−1 Ct Pt|t−1
304 CHAPTER 26. MARKOV MODELS

This proof not only demonstrates that the state Xt follows a normal distribution, but it also
gives – and this is essential from an application perspective – the equations to update the parame-
ters of the state distribution 1) when time must be increased, also called the prediction equations,
and 2) when some observations are received, also called the update equations.

The Kalman filter consists in:


• Applying the prediction equations as soon as the state must be integrated from
a time t to a later time t0 > t. The prediction equations compute in order:
1. Predicted state expected value:

X̂t+1|t = At X̂t|t + Bt Ut

2. Predicted state covariance matrix:

Pt+1|t = At Pt|t ATt + Qt

• Applying the update equations as soon as an observation is received. Since


the timestamp t0 of an observation is usually larger than the time t of the
current state X̂, it is required first to apply the prediction equations to update
the current predicted state to time t0 before applying the update equations. The
update equations compute in order:
1. Predicted output expected value:

Ŷt|t−1 = Ct X̂t|t−1 + Dt Ut

2. Predicted output covariance matrix:

St|t−1 = Ct Pt|t−1 CTt + Rt

3. Innovation (that is the signed error between output and expected output):

et = yt − Ŷt|t−1
y

4. Kalman filter gain (that estimates how strongly the innovation should cor-
rect the state):
Kt = Pt|t−1 CTt S−1
t|t−1

5. Updated state expected value:

et
X̂t|t = X̂t|t−1 + Kt y

6. Updated state covariance matrix:

Pt|t = (In − Kt Ct ) Pt|t−1

Implementing a Kalman filter typically looks like implementing the three next functions 28, 29
and 30.
It is interesting to note that the update function can be called with different types of observation
vectors Y. This is of practical interest for systems equipped with different types of sensors providing
measures with different times/rates. This property of extracting the best information from multiple
sensors is called information fusion.

Example:
26.4. CONTINUOUS-STATE MARKOV MODELS 305

 
Algorithm 28 init X̂0 , P0

Require: Initial expected state X̂0 and covariance P0


1: t, X̂ and P are global variables.
2: t ← 0
3: X̂ ← X̂0
4: P ← P0

Algorithm 29 predict(t0 , U)
Require: New time t0 , and command U
1: Compute A, B and Q given current state and time
2: X̂ ← A X̂ + B U
3: P ← A P AT + Q
4: t ← t0

In addition to the existing sensors, the elevator is equipped with an optical sensor mounted on
the lift that triggers an output every time the sensor gets aligned with visual landmarks stuck on
shelves. The position (v opt , z opt ) of the robot can then be inferred by querying a database that maps
every landmark to a rack position. This second output Y opt = (v opt , z opt ) provides a measure of
position much more accurate than the first output but it is only available occasionally at much lower
rate.

26.4.3 Extended Kalman Filter

Kalman filter only works for linear systems but most real systems are not linear. The navigation
systems for wheeled vehicles are an important example. One solution to apply Kalman filter to
a non-linear system is to linearise the system equations in the vicinity of the currently estimated
state X̂. This provides a first order approximation called Extended Kalman Filter.

Algorithm 30 update(t0 , Y, U)
Require: Observation timestamp t0 and value y, command U
1: Call predict(t0 , U)
2: Compute C, D and R given current state and time
3: Ŷ ← C X̂ + D U
4: S ← C P CT + R
5: K ← P CT S−1 
6: X̂ ← X̂ + K y − Ŷ
7: P ← (In − K C) P
306 CHAPTER 26. MARKOV MODELS

The Extended Kalman filter (EKF) consists, given a non-linear state space repre-
sentation

Xt+1 = ft (Xt , Ut )
Yt = gt (Xt , Ut )

in:
• Prediction equations:
1. Predicted state expected value:

X̂t+1|t = ft (X̂t|t , Ut )

2. Jacobian matrix of ft :
∂ft
At = (Xt|t )
∂X
3. Predicted state covariance matrix:

Pt+1|t = At Pt|t ATt + Qt

• Update equations:
1. Predicted output expected value:

Ŷt|t−1 = gt (X̂t|t−1 , Ut )

2. Jacobian matrix of gt :
∂gt
Ct = (Xt|t−1 )
∂X
3. Predicted output covariance matrix:

St|t−1 = Ct Pt|t−1 CTt + Rt

4. The remaining equations are identical to the classical Kalman filter:

et
y = yt − Ŷt|t−1
Kt = Pt|t−1 CTt S−1
t|t−1

X̂t|t = et
X̂t|t−1 + Kt y
Pt|t = (In − Kt Ct ) Pt|t−1

Example:
EKF allows to estimate the position and speed of our car. As a reminder, the state, command and
output of this system are:

 
xt  gps 
" # xt
y  Ct
 t  gps 
X t =   Ut = Yt =  yt 
 θt  αt
vtodo
vt
26.4. CONTINUOUS-STATE MARKOV MODELS 307

The integration equations are non-linear:

Xt+1 = f (Xt , Ut ) + εXt



   fx (X, U) = x + v cos(θ) τ
fx (Xt , Ut ) 



f (X , U ) fy (X, U) = y + v sin(θ) τ
 y t t 
where f (Xt , Ut ) =   with fθ (X, U) = θ+vCτ
 fθ (Xt , Ut ) 
  

 fτ τ
fv (Xt , Ut ) 
fv (X, U) = 1 − v+α
M M

While the observation equations are linear:


 
1 0 0 0
Yt = 0 1 0 0 Xt + εYt
0 0 0 1

A EKF can be implemented by computing the Jacobian matrix of ft :


 ∂fx ∂fx ∂fx ∂fx 
∂x ∂y ∂θ ∂v
∂f  ∂fy ∂fy ∂fy ∂fy 
 ∂x ∂y ∂θ ∂v 
At = =  ∂f ∂fθ ∂fθ ∂fθ 
∂X  ∂xθ ∂y ∂θ ∂v

∂fv ∂fv ∂fv ∂fv
∂x ∂y ∂θ ∂v
 
1 0 −v sin(θ) τ cos(θ) τ
0 1 v cos(θ) τ sin(θ) τ 
= 
0 0

1 Cτ 
0 0 0 1 − fMτ

To define the Q matrix, one needs to determine the main source of uncertainty in the model
dynamic. The acceleration uncertainty standard deviation is estimated roughly to 3 m/s2 , the risk
of slipping is considered to be null in normal conditions and the uncertainty on the rotational speed
is estimated to 10deg/s ≈ 0.2rad/s. For the matrix R, GPS accuracy is about 2m and the odometer
precision is 3km/h ≈ 1m/s, so finally:
 
0 0 0 0  2 
0 0  2 0 0
0 0
Q= 0 0 (0.2 τ )2
 R =  0 2 2 0
0 
0 0 1
0 0 0 (3 τ )2

EKF can track efficiently a system state as long as the uncertainty on the state is kept small.
However if observations are missing for a too long time, the uncertainty (represented by P) increases
and the predicted state will likely diverge from the real state. In such cases, alternative methods
like particle filtering (see section 28.3.4) must be used instead.
308 CHAPTER 26. MARKOV MODELS
Chapter 27

Non-parametric Bayesian methods


and Gaussian Processes

27.1 Introduction
So far all the studied methods rely on the existence of some model parameters θ. Such methods
are said parametric. Given some classification/regression problem predicting an output Y from an
input X and given some i.i.d samples Z = ((x1 , y1 ), . . . , (xn , yn )), a parametric method is divided
in two steps:
• The learning step infers parameters from samples, that is to say, determines the posterior
P Θ | Z as: !
Y
P Θ | Z (θ) ∝ P Yi | Xi =xi ,Θ=θ (yi ) × PΘ (θ)
i

• The prediction step infers the output from the input and the posterior, that is to say, deter-
mines P Y | X=x,Z as:
Z
P Y | X=x,Z (y) = P Y | X=x,Θ=θ (y) × P Θ | Z (θ) dθ
θ

Merging these two steps in one leads to the notion of non-parametric methods:
Z
P Y | X=x,Z (y) = P Y | X=x,Θ=θ (y) × P Θ | Z (θ) dθ
θ
Z !
Y
∝ P Y | X=x,Θ=θ (y) × P Yi | Xi =xi ,θ (yi ) × PΘ (θ) dθ
θ i

In a non-parametric approach, the model does not make parameters explicit. One directly infers
P Y | X=x,Z from the observed samples, which are obviously not i.i.d any more as the parameters
have been marginalized out:
PY | X=x,Z ∝ P Y,Y1 =y1 ,...,Yn =yn | X=x,X1 =x1 ,...,Xn =xn
The K-nearest neighbour classification method is an example of non-parametric method. Let’s
develop the example of Gaussian processes.

27.2 Gaussian Process


27.2.1 Definition
Gaussian Processes are a powerful regression tool also called kriging in some application fields.
They illustrate the notion of non-parametric method in a Bayesian context.

309
310CHAPTER 27. NON-PARAMETRIC BAYESIAN METHODS AND GAUSSIAN PROCESSES

A Gaussian Process (GP) is a stochastic process from Rm to R such that every


finite subset of its variables follows a multivariate normal distribution. A Gaussian
process denoted GP(µ, k) is fully defined by two functions µ and k that respectively
provide the expected value and covariance of any pair of variables. Precisely, process
f ∼ GP(µ, k) is defined by:
n
∀n, ∀(x1 , . . . , xn ) ∈ (Rm ) ,
   k(x , x ) 
µ(x1 ) 1 1 ... k(x1 , xn )
   . .. .. 
(f (x1 ), . . . , f (xn )) ∼ N  ... , .
. . . 
µ(xn ) k(xn , x1 ) ... k(xn , xn )

Where:

∀x ∈ Rm , µ(x) = E(f (x))


 
m 2 T
∀(x1 , x2 ) ∈ (R ) , k(x1 , x2 ) = E (f (x1 ) − µ(x1 )) × (f (x1 ) − µ(x1 ))

Gaussian Process can be viewed as a generalization of multivariate normal distribution to infinite


dimension of Rm .
The functions µ and k can be interpreted as the hyperparameters of the Bayesian prior of
Gaussian processes. Since hyperparameters are set by the user, one might wonder whether any
function can be chosen for µ and k or in other words, are there constraints that must be verified
by µ and k.
• The µ function defines the expected value is a continuous, generally smooth, function. With-
out particular knowledge about the modelled process, a common choice is to choose µ as the
null function.
• The function k must ensure that for any subset of inputs, the covariance matrix is positive
semidefinite. Such a function k is called a covariance function or kernel. In particular, a
kernel function is a continuous, symmetric, non negative function. There are many ways to
construct kernels from other kernels: the sum and product of kernels is another kernel. A
common choice is the γ-exponential kernel of parameters γ, σ and l:
 
x1 − x2 γ
kE (x1 , x2 ) = σ 2 exp −

l
An even more specific choice is to take γ = 2 to get the squared exponential kernel:
!
x1 − x2 2
kSE (x1 , x2 ) = σ exp −
2

l

Kernels can also be built as a scalar product in some space, thanks to a transformation g:

k(x1 , x2 ) = hg(x1 ); g(x2 )i

Since outputs of a Gaussian process are generally spatially correlated, i.e. f (x1 ) and f (x2 ) get
more correlated when x1 and x2 get closer, the kernel functions are chosen so that k(x1 , x2 )
grows when x2 tends to x1 .

27.2.2 Representation and sampling


A first question is how can one represent a Gaussian process. To make it simple, let’s assume
m = 1. In such a case, a Gaussian process can be viewed as a one-dimensional stochastic process
27.2. GAUSSIAN PROCESS 311

f ∼ GP(µ, k) is to represent the


R. A first representation of p
that can be represented on the real line p
three functions x 7→ µ(x), x 7→ µ(x) − 2 k(x, x) and x 7→ µ(x) + 2 k(x, x) for allp x ∈ R. The first
function
p represents the expected value of the outputs and the interval [µ(x) − 2 k(x, x), µ(x) +
2 k(x, x)] represents a 95 % confidence interval for a normal distribution. Such a representation
is given on figure 27.1.

0 2 4 6 8 10

Figure 27.1: Representation of a Gaussian process GP(µ, k) for µ : x 7→ sin(x) and k(x1 , x2 ) =
kSE (x1 , x2 ) with l = 1.

However this first representation considers every input x independently of each other and do
not take into account the correlation k(x1 , x2 ) between f (x1 ) and f (x2 ). This representation
does not emphasize the correlation of outputs of similar inputs (i.e. the smoothness of samples).
A second possible representation is to draw randomly several samples from f and draw them
graphically as curves. Sampling is however not obvious as a sample of a Gaussian process is a
function y : x ∈ R 7→ y(x) ∈ R, that is, an infinite number of points. A possible approximation
is to choose an interval of representation [xmin , xmax ] and to split this interval with n regularly
spaced points:
xmax − xmin
xi = i + xmin
n
Then one approximates sample curve y : x ∈ R 7→ y(x) ∈ R by the finite set of points (x0 , y0 ), . . . (xn , yn )
such as (y0 , . . . , yn ) is drawn from the multivariate normal distribution:
   k(x , x ) . . . k(x , x ) 
µ(x1 ) 1 1 1 n
   .
. . . .
. 
(y1 , . . . , yn ) L99 N  ... , . . . 
µ(xn ) k(xn , x1 ) . . . k(xn , xn )

The sampling algorithm is detailed in pseudocode 31. In particular, it explains how to sample
a multivariate normal distribution thanks to a covariance matrix decomposition and from the
following property of multivariate normal distributions:

X ∼ N (µ, Σ) ⇒ A X + B ∼ N A µ + B, A Σ M atAT

Figure reffig:gp-sampling provides 10 samples from the Gaussian process introduced on Fig. 27.1.
Of course the previous representations can be generalized to an input space of higher dimension
by generating a grid of input points instead of subdividing an interval of the real line.

27.2.3 Influence of kernel


The kernel function k defines the spatial correlation of the process between close input points. The
figure 27.3 illustrates the growing spatial correlation when the distance parameter l of the squared
exponential kernel increases. One can sees that the samples tend to be a white noise when l tends
to 0, and tend to be constant when l increases.
312CHAPTER 27. NON-PARAMETRIC BAYESIAN METHODS AND GAUSSIAN PROCESSES

Algorithm 31 Gaussian Process sampling.


1: Input: functions µ and k, interval [xmin , xmax ], number n of points
2: Create n-vectors X, Y, Ŷ and n × n matrices Σ, A
3: // Generation of regularly spaced input points
4: for i = 0 . . . n − 1 do
5: X[i] ← (xmax − xmin )/n × i + xmin
6: end for
7: // Generation of multivariate normal distribution parameters
8: for i = 0 . . . n − 1 do
9: Ŷ[i] ← µ(X[i])
10: for j = i . . . n − 1 do
11: Σ[i, j] ← Σ[j, i] ← k(X[i], X[j]);
12: end for
13: end for
14: // Sampling of multivariate normal distribution for outputs
15: Compute A such as Σ = A AT
16: // A is computed using an extended Cholesky decomposition or
1
17: // using a spectral decomposition A = M Λ 2 where Σ = M Λ MT
18: for i = 0 . . . n − 1 do
19: Y[i] L99 N (0, 1)
20: end for
21: Y ← Ŷ + A Y
22: // Output graphics
23: for i = 0 . . . n − 1 do
24: Draw point (X[i], Y[i])
25: end for

3
0 2 4 6 8 10
Figure 27.2: 10 samples from GP(µ, k) for µ : x 7→ sin(x) and k(x1 , x2 ) = kSE (x1 , x2 ) with l = 1.
27.2. GAUSSIAN PROCESS 313

2
2

1
0
0

1
2
2

4 3

0 2 4 6 8 10 0 2 4 6 8 10

(a) l = 0.1 (b) l = 0.5

2 3

1 2

1
0

0
1
1
2
2

0 2 4 6 8 10 0 2 4 6 8 10

(c) l = 3 (d) l = 10

Figure 27.3: 10 samples from GP(µ, k) for µ : x 7→ 0 and k(x1 , x2 ) = kSE (x1 , x2 ) with various
values for l.

27.2.4 Prediction

Since Gaussian processes are non-parametric models, there is no learning step: prediction of an
output Y for the input X = x is made directly from the observations O = ((x1 , y1 ), . . . , (xk , yk ))
by conditioning the joint multivariate normal distribution of (f (x), f (x1 ), . . . , f (xk ) to the obser-
vations f (x1 ) = y1 , . . . f (xk ) = yk :
314CHAPTER 27. NON-PARAMETRIC BAYESIAN METHODS AND GAUSSIAN PROCESSES

A Gaussian process f ∼ GP(µ, k) conditioned on observations O =


((x1 , y1 ), . . . , (xk , yk )) is still a Gaussian process. Given n input points (x01 , . . . , x0n ),
the distribution of the resulting process f | O for these inputs is a multivariate nor-
mal distribution on:

(f (x01 ), . . . , f (x0n )) | f (x1 ) = y1 , . . . , f (xk ) = yk ∼



N µp + Σpo Σoo −1 (yo − µo ) , Σpp − Σpo Σoo −1 ΣTpo

Where    
µ(x01 ) k(x01 , x01 ) ... k(x01 , x0n )
 ..   .. .. .. 
µp =  .  Σpp =  . . . 
µ(x0n ) k(x0n , x01 ) ... k(x0n , x0n )
   
µ(x1 ) k(x1 , x1 ) . . . k(x1 , xk )
 ..   .. .. .. 
µo =  .  Σoo =  . . .  (27.7)
µ(xk ) k(xk , x1 ) . . . k(xk , xk )
   
y1 k(x01 , x1 ) . . . k(x01 , xk )
 ..   .. .. .. 
yo =  .  Σpo =  . . . 
yk k(x0n , x1 ) ... k(x0n , xk )

Proof. Given observations O = ((x1 , y1 ), . . . , (xk , yk )), for all finite set (x01 , . . . , x0n ) of input points,
one knows that (f (x01 ), . . . , f (x0n ), f (x1 ), . . . , f (xk )) follows a multivariate normal distribution.
Moreover, given two vectors A and B of random variables, if the joint distribution of A ∪ B follows
a multivariate normal distribution, then it is proven that the distribution of A | B = b of A given
B is equal to value b is still a multivariate normal distribution whose parameters are given by the
following formula, where matrix ΣXY denotes the covariance matrix between X and Y:
     2 
A E [A] σε ΣAA ΣAB
∼N , ⇒
B E [B] ΣTAB ΣBB

A | B = b ∼ N E [A] + ΣAB ΣBB −1 (b − E [B]) , ΣAA − ΣAB ΣBB −1 ΣTAB

Consequently by taking A = (f (x01 ), . . . , f (x0n )), B = (f (x1 ), . . . , f (xk )) and b = (y1 , . . . , yk ),


one knows that (f (x01 ), . . . , f (x0n )) | O follows a multivariate normal distribution so that f | O
is a Gaussian process. The previous identity combined with the definition of functions µ and k
provides the expressions of vectors and matrices in equation (27.7).

From these expressions, one can deduce the distribution of f (x) | O (by taking n = 1 and x1 = x).
This is useful to update the drawing of the average and confidence interval curves of f (x) | O .
Moreover these expressions when combined with the sampling technique presented in algorithm 31
allows to draw graphically samples from conditioned Gaussian process. Figure 27.2 shows how a
one-dimensional Gaussian process is conditioned progressively when new observations get available.
On the last figure, the fourth observation (1.1, −2) contradicts the first one (1, 1). This introduces
a kind of singularity, with abrupt changes, and high expected value of about 15 around input x of
0.5.

27.2.5 Regularization
So far the observations ((x1 , y1 ), . . . , (xk , yk )) were assumed to be perfect, without noise. As a
consequence the distribution of f (xi ) | O is atomic (see how variance of input points is null on
Fig. 27.2). This produces a kind of overfitting, observable on the last figure of Fig. 27.4. Moreover
two contradictory observations (x1 , y1 ) and (x2 , y2 ) (i.e x1 ∼ x2 but y1  y2 ) can introduce
singularities.
27.2. GAUSSIAN PROCESS 315

2
2

1
1
0
0
1

1
2

2
3
0 2 4 6 8 10 0 2 4 6 8 10

(a) Observation (1, 1) (b) Observation (8, −2)

20
3
15
2
10

1 5

0 0

5
1
10
2
15
3
20
0 2 4 6 8 10 0 2 4 6 8 10

(c) Observation (4, 2) (d) Observation (1.1, −2)

Figure 27.4: Conditioning a Gaussian process to an increasing number of observations. The initial
Gaussian process has a null µ function and uses a squared exponential kernel kSE with l = 1.
316CHAPTER 27. NON-PARAMETRIC BAYESIAN METHODS AND GAUSSIAN PROCESSES

To avoid overfitting, one can model the imperfectness of the observation by introducing an
observation noise ε:
∀i, Yi = f (xi ) + εi
Variables εi are assumed to be normal white noises with known variance σε2 :

∀i, εi ∼ N 0, σε2

What are the consequence on equation (27.7):


• The expected value vector µo is unchanged since

E [Yi ] = E [f (xi ) + εi ] = µ(xi )

• Σoo is replaced by Σoo + σε2 Ik since

cov(Yi , Yj ) = cov(f (xi ) + εi , f (xj ) + εj ) = k(xi , xj ) + cov(εi , εj ) = k(xi , xj ) + σε2 δ(i = j)

• Σpo is left unchanged since

cov(f (x0i ), Yj ) = cov(f (x0i ), f (xj ) + εj ) = cov(f (x0i ), f (xj )) + 0 = k(x0i , xj )

Finally

Given a Gaussian process f ∼ GP(µ, k) conditioned on observations O =


((x1 , y1 ), . . . , (xk , yk )) such that the observations contain some additive Gaussian
white noises of variance σε2 , the distribution of the Gaussian process f | O for input
points (x01 , . . . , x0n ) is a multivariate normal distribution:

(f (x01 ), . . . , f (x0n )) | Y1 = y1 , . . . , Yk = yk ∼
 −1 −1 T 
N µp + Σpo Σoo + σε2 Ik (yo − µo ) , Σpp − Σpo Σoo + σε2 Ik Σpo (27.8)

The introduction of noise is a form of regularization to defeat overfitting. Figure 27.5 takes the
same example introduced in figure 27.4 with some additional observation noise. One sees that the
variance at an input point that has been observed is not zero anymore and no singularities appear
on the last figure.

27.3 A word on complexity


The prediction equations (27.7) or (27.8) can be implemented by algorithm of complexity Θ(k 3 +
k gµ (m)k 2 gk (m)) where k is the number of observations, gµ (m) and gk (m) is the complexity of
µ and kernel k for a single input point of Rm . gµ (m) and gk (m) are typically in Θ(m) so that
the whole complexity is Θ(k 3 + k 2 m) Like other non-parametric methods (KNN, K-means, etc),
Gaussian processes nicely scale with the dimension m of the input space but they become quickly
limited when the number k of observations.
27.3. A WORD ON COMPLEXITY 317

3 3

2 2

1 1

0 0

1
1
2
2
3
3
0 2 4 6 8 10 0 2 4 6 8 10

(a) Observation (1, 1) (b) Observation (8, −2)

3
3
2 2

1 1

0
0
1
1
2

2 3

4
3
0 2 4 6 8 10 0 2 4 6 8 10

(c) Observation (4, 2) (d) Observation (1.1, −2)

Figure 27.5: Conditioning a Gaussian process to noisy observation. The initial Gaussian process
is the same as on Fig 27.4. The observation noise variance has been set to σε2 = 0.32 .
318CHAPTER 27. NON-PARAMETRIC BAYESIAN METHODS AND GAUSSIAN PROCESSES
Chapter 28

Approximate Inference

28.1 Interest of sampling techniques


In presence of a complex graphical model describing the joint distribution PX of variables X =
(X1 , . . . , Xm ), deriving a conditional distribution P X0 | X00 where X0 ∪ X00 ⊂ X and X0 ∩ X00 = ∅
or even simply a marginalized distribution PX0 is generally a difficult problem. First distribution
P X0 | X00 has generally no closed form in the continuous case (multivariate normal distribution is an
exception). Second even in the most simple discrete case, the computation is generally untractable
when the number m of variables is large. For this reason, tractable methods based on sampling
techniques are preferred in order to approximate h called such distribution hereafter referred as
the target distribution.
Sampling consists in drawing independent samples {xi }1≤i≤n from the target distribution
P X0 | X00 . In the general case this is not a straight-forward problem and the next sections of
this chapter are devoted to introduce ellaborate sampling algorithms that solve this general prob-
lem. But first let’s briefly explain where and how sampling is useful. Computing a conditional
distribution is useful in Bayesian prediction in general. Bayesian inference requires to compute
the distribution P Θ | O of the parameters given the observations. In models containing hidden
variables, the E step of the EM algorithm requires to compute the distribution of the hidden vari-
ables conditionned on the observations and current model parameters. A target distribution PX0
can be estimated from samples {xi }1≤i≤n by computing an histogram (in the discrete case) or by
using more powerful estimation methods (regression, kernel methods, etc). If the knowledge of
the whole distribution is not required but only some particular value must be computed (like with
MLE and MAP estimators that compute some parameter values θ̂), and that this target value
can be rewritten as the expected value of some function f (X0 ), then according to the law of large
numbers, this expected value can be estimated by averaging the target values of many samples:
Pn
0 f (xi )
f (X ) ≈ i=1
n
This approximation scheme defines Monte Carlo approaches. All these methods rely on some
sampling algorithms whose purpose is to drawn independent samples (xi ) from a random variable
X that can be either univariate or multivariate.

28.2 Univariate sampling


28.2.1 Direct sampling
Let’s consider the simplest sampling problem first: X is univariate and the distribution PX is
known. Pseudorandom generators naturally generate samples from the uniform distribution U[0,1] .
These samples can in turn be used to generate samples from any other univariate distribution. In
the discrete case, X can take m values from v1 to vm . Sampling X consists in drawing a sample
u from the uniform distribution U[0,1] and choosing the sample value as the unique value vi such

319
320 CHAPTER 28. APPROXIMATE INFERENCE

that:
i−1
X i
X
P (X = vj ) ≤ u < P (X = vj )
j=1 j=1

In the continuous case X has a density probability function f . Samping X consists again in drawing
u from U[0,1] and choosing the sample value as the value v satisfying:
Z v
f (x) dx = u
−∞

Of course simpler and faster procedures exist for specific distributions. For instance sampling the
canonical Gaussian distribution N (0, 1) is done by the Box-Muller method:

1. Draw uθ from U]0,1]

2. Draw ur from U]0,1]


p
3. Return sample x = −2 ln (ur ) cos(2π uθ )

Samples from N µ, σ 2 can then be generated as σ x + µ where x is drawn from N (0, 1). More
generally sampling a multivariate normal distribution N (µ, Σ) can be done thanks to the following
procedure:

1. Decompose the m × m covariance matrix Σ = CCT using eigenvalue decomposition or


Cholesky decomposition.

2. Build a vector y of size m, whose components are all drawn from N (0, 1).

3. Return sample x = C y + µ

28.2.2 Rejection sampling


In some case however the univariate distribution PX is not known: only an unormalized version
?
of denoted p? (X) is accessible, such that PX = p Z (X)
where Z is a normalizing factor. Because
normalizing requires integration on a large interval, the accurate computation of Z might be too
hard. The problem is even worse in the multivariate case. This situation naturally arises when
applying Bayesian inference:
P ( O | θ ) P (θ)
P (θ | O) =
P (ObsSet)
Indeed the normalizing factor P (ObsSet) is likely to be intractable as it requires to integrate on
the whole domain of θ a product of complex factors, one for each data. In this case, one call only
compute the product p? (θ) = P ( O | θ ) P (θ) for a finite set of parameter samples θ.
The technique of rejection sampling can be used in such situations. It assumes the existence of
a distribution q(x) and a number M > 0 such that M q is an upper bound of p? :

∃M, ∀x, p? (x) ≤ M q(x)

In this case samples from PX can be drawn thanks to the rejection sampling algorithm 32. This

Algorithm 32 Rejection sampling algorithm


1: Inputs : unnormalized distribution p? (x), distribution q(x), coefficient M
2: repeat
3: Draw x from q(x)
4: Draw u from U[0,1]
p? (x)
5: until u ≤ M q(x)
6: return x

algorithm is correct as the distribution of the output X is PX :


28.3. MULTIVARIATE SAMPLING 321

p? (x)
Proof. Let Y be the boolean random variable true if and only if u ≤ M q(x) . Only because one
p? (x)
assumes M q(x) ≤ 1, one has:
Z p? (x)
 
min 1, M q(x)
P ( Y = true | X = x ) = 1 du
0
 
p? (x)
= min 1,
M q(x)
?
p (x)
=
M q(x)
The distribution of the output X is then

PX (x) = P ( X = x | Y = true )
P ( Y = true | X = x ) P (X = x)
=
P (Y = true)
p? (x)
M q(x) q(x)
= R
P (Y = true, X = x)dx
p? (x)
= R p? (x)
M M q(x) q(x)dx
p? (x)
= R
p? (x) dx

R
The factor p? (x) dx is the normalization factor Z of p? so that the output distribution is the
normalized distribution of p? .
Z
At every iteration, the probability of rejection is 1 − M < 1 so that, even if the algorithm is
not guaranteed to terminate, the probability for an infinite loop is zero. The average number of
iteration is M
Z . However for the loop to exit as quickly as possible, it is important to choose q and
M in a way such that for all value x, M q(x) is as close as possible to p? (x) (ideally M q is equal
to p? ). For a given distribution q(x), M must ideally be set to:
 ? 
p (x)
M = sup
x q(x)
If the shape of q is too different from the shape of p? , M must be chosen large so that the average
number of iterations increase.

28.3 Multivariate sampling


In theory the rejection sampling algorithm also works for multivariate variable X. But in practise
the rejection rate gets too high so that rejection sampling gets very slow. The reason is that
the discrepancy between the shapes of q and p? increase with the number of dimensions. In the
following sections, one assumes to have some Bayesian Network describing some joint distribution
PX and that the goal is to sample from the target distribution P X0 | X00 where X0 ∪ X00 ⊂ X and
X0 ∩ X00 = ∅.

28.3.1 Ancestral Sampling


The previous problem is in general difficult. However there are cases where sampling is straightfor-
ward. Let’s first assume X00 = ∅. In this case, one can start to sample all variables without parents
according to their CPT. Then, one can sample variables whose parents have all been already sam-
pled, thanks to the CPT. This procedure is repeated until all variables have been sampled. Now if
X00 6= ∅ but variables in X00 have no parents, the procedure can be This method is called ancestral
sampling for obvious reasons and is summarized by algorithm 33.
322 CHAPTER 28. APPROXIMATE INFERENCE

Algorithm 33 Ancestral sampling


1: Inputs : Bayesian network for PX , target variables X0 , conditioned variables X00 without
parents
2: Create empty dictionnary V // V maps already sampled variables to their values
3: Create empty list L // L contains sampled variables to process
4: for every variable X in X do
5: if X has no parents then
6: if X ∈ X00 such that X = x then
7: V [X] ← x
8: else
9: V [X] ← v where v is drawn from the CPT of X
10: end if
11: Append X to L
12: end if
13: end for
14: while L is not empty do
15: Remove first element X of L
16: for every child X 0 of X that is not already sampled (i.e not in V ) do
17: if every parent of X 0 is already sampled (i.e. is in V ) then
18: V [X 0 ] ← v where v is drawn from the CPT of X 0
19: Append X 0 to L
20: end if
21: end for
22: end while
23: Create a sample array S
24: for every variable X in V do
25: Append V [X] to S
26: end for
27: return S
28.3. MULTIVARIATE SAMPLING 323

28.3.2 Gibbs Sampling


However ancestral sampling fails as soon some variable of X0 is an ancestor of a variable of X00 . In
this case Gibbs sampling can be used. Gibbs sampling is a particular kind of Markov Chain Monte
Carlo method (MCMC). The general principle of MCMC consists in considering a Markov Chain
whose the equilibrium state (see section 26.2) is the target distribution. MCMC then generates
random walks from this Markov Chain by starting from some arbritary initial state. After a
sufficient number of iterations called the mixing time T , the current state can be considered as
a sample of the equilibrium state, that is, of the target distribution. Because successive states
are highly correlated, only one state per trajectory (or several states but separated by a sufficient
amount of iterations) is used as a sample of the target distribution so that the different samples
can be considered independent. The advantage of MCMC in general compared to Monte Carlo
methods is a very wide application score as it can be applied to very complex target distributions.
The drawback is that it requires a lot of computation to generate one single sample.
The specific Markov chain of Gibbs sampling consists given the current state x = (x1 , . . . , xm ) of
the random walk, to choose randomly a single variable Xi and to sample it from P Xi | {Xj =xj |j6=i} ,
i.e. conditionnaly to the values of all remaining variables that are kept constant. One advantage of
Gibbs sampling is that it naturally works with target distribution that are conditional distributions
P X0 | X00 : values of conditionned variables (in X00 ) are never sampled but kept constant, equal to
their conditioning value. Algorithm 34 summarizes the generic approach of Gibbs sampling.

Algorithm 34 Gibbs sampling algorithm


1: Inputs : Bayesian network for PX where X = (X1 , . . . , Xm ), target variables X0 , conditioned
variables X00 = {Xc1 = vc1 , . . . , Xck = vck }, mixing time T
2: Choose some initial state x0 = (x1 , . . . , xm ) such that for every index cj , xcj = vcj
3: for t = 1 . . . T do
4: Given xt−1 = (x1 , . . . , xm ), choose i between 1 and m such that Xi ∈ / X00
5: Draw x from P ( Xi | X1 = xt1 , . . . Xi−1 = xi−1 , Xi+1 = xi+1 , . . . , Xm = xm )
6: xt ← (x1 , . . . , xi−1 , x, xi+1 , . . . , xm )
7: end for
8: return xT

This sampling of Xi can be performed relatively easily as it is an univariate sampling of a single


variable X that is drawn according to the Markov blanket of X:

Given a Bayesian Network, the Markov blanket blanket(X) of a variable X


is the union parents(X) ∪ children(X) ∪ ∪X 0 ∈children(X) parents(X 0 ) of the set
parents(X) of parents of X, the set children(X) of children of X and the set
∪X 0 ∈children(X) parents(X 0 ) of parents of children of X. Because Markov blanket
blocks every path leaving X, conditioning the distribution of P X | blanket(X) with
X 0 = x0 for some additional variable X 0 ∈ / blanket(X), does not modify it:

P X | X\{X} = P X | blanket(X)
1 Y
= P X | parents(X) P X 0 | parents(X 0 )
Z
X 0 ∈children(X)

where the normalization factor can be computed as:


X Y
Z= P X | parents(X) (x) P X 0 | parents(X 0 )
x X 0 ∈children(X)

Another advantage of Gibbs sampling is its relative simplicity of implementation. However the
jumps of the random walk are very limited as they are necessarily colinear with one main axis
(in case of continuous distributions). In case areas of high probability density are remote and
324 CHAPTER 28. APPROXIMATE INFERENCE

disconnected “islands”, Gibbs sampling is likely to be stuck on one single island, without the
ability to jump remotely to another one. In such cases, more ellaborate sampling techniques are
required.

28.3.3 Markov Chain Monte Carlo and the Metropolis-Hastings algo-


rithm
Given a multivariate variable X of possibly high dimension and a target distribution p(X), how can
one find a general method to construct an homogeneous Markov chain such that its equilibrium
state is p(X)?
Let denote q( x | x0 ) the transition distribution P Xt | Xt−1 =x0 (x) of such an homogeneous
Markov chain. Then p(X) is a stationnary state distribution if and only if:
Z
∀x, p(x) = q( x | x0 ) p(x0 ) dx0

This very general condition is called the global balance equation. This equation can be rewritten
as:
Z
∀x, 1 × p(x) = q( x | x0 ) p(x0 ) dx0 ⇔
Z Z
∀x, q( x0 | x ) dx0 × p(x) = q( x | x0 ) p(x0 ) dx0 ⇔
Z
∀x, (q( x0 | x ) p(x) − q( x | x0 ) p(x0 )) dx0 = 0

On this last expression one sees the global balance equation a specific solution known as the detailed
balance equation:
∀x, ∀x0 , q( x0 | x ) p(x) = q( x | x0 ) p(x0 )
For instance, Gibbs sampling introduced in the previous section is an example of Markov chain
satisfying the detailed balance equation:

Proof. Considering a transition x → x0 such that x = (x1 , . . . , xm ) and x0 = (x01 , . . . , x0m ), either
the transition probabilities q( x0 | x ) and q( x | x0 ) are null or there exists a index i such that for
all j 6= i, x0j = xj . In the first case the detailed balance equation is obviously true. In the second
case one has:

q( x0 | x ) p(x) = P ( Xi = x0i | ∀j 6= i, Xj = xj ) P (∀j, Xj = xj )


= P ( Xi = x0i | ∀j 6= i, Xj = xj ) P ( Xi = xi | ∀j 6= i, Xj = xj ) P (∀j 6= i, Xj = xj )
= P (Xi = x0i , ∀j =6 i, Xj = xj ) P ( Xi = xi | ∀j 6= i, Xj = xj )
 
= P ∀j, Xj = xj P Xi = xi ∀j 6= i, Xj = x0j
0
 
= P Xi = xi ∀j 6= i, Xj = x0j P ∀j, Xj = x0j
= q( x | x0 ) p(x0 )

Gibbs sampling is a very particular case of MCMC. A general question is how can one build
a transition distribution q̃ satisfying the detailed balance equation. One method is to start from
any transition distribution q̃( x | x0 ) called proposal distribution and to use the rejection principle
in order to “transform” it into a valid distribution q( x | x0 ) satisfying the detailed balance. Let’s
denote a( x | x0 ) the acceptance probability (i.e. the complement to 1 of the rejection probability)
of the transition x0 → x. The detailed balance is satisfied if:

∀x, ∀x0 , a( x0 | x ) q̃( x0 | x ) p(x) = a( x | x0 ) q̃( x | x0 ) p(x0 )


28.3. MULTIVARIATE SAMPLING 325

Algorithm 35 Metropolis-Hastings sampling algorithm


1: Inputs : unnormalized target distribution p∗ (x), proposal distribution q̃( x0 | x ), mixing time
T
2: Choose some x0
3: for t = 1 . . . T do
4: Draw x from q̃ ( x | xt−1 )
| x )×p∗ (x)
5: r ← q̃(q̃(x x| t−1
xt−1 )×p∗ (xt−1 )
6: Draw u from U[0,1]
7: if u < r then
8: xt ← x
9: else
10: xt ← xt−1
11: end if
12: end for
13: return xT

This idea is at the origin of the most famous MCMC method called the Metropolis-Hastings
algorithm detailed by pseudocode 35.
The algorithm is characterized by an acceptance probability equal to

 
q̃ ( x0 | x ) × p(x)
a( x | x0 ) = min 1,
q̃ ( x | x0 ) × p(x0 )

Proof.

 
0 0 0 q̃ ( x | x0 ) × p(x0 )
∀x, ∀x , a( x | x ) q̃( x | x ) p(x) = min 1, q̃( x0 | x ) p(x)
q̃ ( x0 | x ) × p(x)
= min (q̃( x0 | x ) p(x), q̃ ( x | x0 ) × p(x0 ))
= min (q̃( x | x0 ) p(x0 ), q̃ ( x0 | x ) × p(x))
 
q̃ ( x0 | x ) × p(x)
= min 1, q̃( x | x0 ) p(x0 )
q̃ ( x | x0 ) × p(x0 )
= a( x | x0 ) q̃( x | x0 ) p(x0 )

One additional advantage of the Metropolis algorithm is that it accepts as an input an unor-
malized target distribution p? (x). The reason is that the acceptance factor depends on a ratio
p(x) p? (x)
p(x0 ) = p? (x0 ) independent of the normalizing factor.

28.3.4 Importance sampling and particle filter

Compared to previous methods, importance sampling is a slightly different technique as it does


not generate samples from the target distribution p(X) but from an importance distribution q(X)
choosed so that sampling from q is much simpler than sampling from p. However importance
sampling makes possible to compute any real expected value E (f (X))p(X) relatively to the target
distribution p(X) or even from a unnormalized version p? (X) of the target distribution.
326 CHAPTER 28. APPROXIMATE INFERENCE

Given an unnormalized target distribution p? (X), a function f : X 7→ Rk , a distri-


bution q(X) that can easily be sampled, importance sampling computes an approxi-
mation Ê (f (X))p(X) of E (f (X))p(X) by:

1. Draw n samples (xi )1≤i≤n from q(X).

2. Compute Ê (f (X)) as a weighted average of (f (xi ), wi )1≤i≤n where weight wi


?
is equal to pq(x
(xi )
i)
:

Pn Pn p? (xi )
i=1 wi f (xi ) i=1 q(xi ) f (xi )
Ê (f (X)) = Pn = Pn p? (xi )
i=1 wi i=1 q(xi )

Proof. The proof uses a trick to redefine the expected value relatively to p as an expected value
relatively to q:
Z Z  
p(x)
E (f (X))p(X) = f (x) p(x) dx = f (x) q(x) dx
q(x)
 
p(x)
= E f (x)
q(x) q(X)
   
? p? (x)
R p (x) E f (x) q(x)
p? (x0 ) dx0 q(X)
= E f (x)  = R p? (x0 )
q(x) 0 0
q(x0 ) q(x ) dx
q(X)
 ?

E f (x) pq(x) (x)

=  ?  q(X)
E pq(x)(x)
q(X)

Applying the law of large numbers to the last expression by replacing expected values with arith-
metic average of n samples of q, one gets the desired result.

Resampling
Importance sampling works in theory for an infinite number of samples. In practice, things do not
work so well when q doesn’t match p. In this case most samples xi will have a very low weight (as
they are very likely to be sampled from q but not from p). Only very few samples will be likely to
be sampled from p and will get relatively high weight. This imbalance in weight distribution will
lead to imprecise results compared to the computation cost (as many samples with very low weight
are useless). The procedure of resampling helps to redistribute homogeneously weights to samples.
The algorithm 36 describes the procedure: Note that one sample can be duplicated several time

Algorithm 36 Resampling
1: Inputs : set of weighted samples (xi , wi )1≤i≤n
2: Let be the categorical distribution p(i) for i ranging from 1 to n with p(i) = Pnwi
i=1 wi
3: Output multiset S ← ∅ of weighted samples
4: for i = 1 . . . n do
5: Draw an index j from p
6: Add sample xj to S with weight w = 1
7: end for
8: return S

in the output multiset1 . Resampling more likely selects samples of heavy weight. However in the
end every “surviving” sample has the same weight.
1A multiset is a set where an element can be duplicated an arbitrary number of times.
28.3. MULTIVARIATE SAMPLING 327

Particle filtering
Particle filtering is an adaptation of importance sampling to deal with numerical trajectories, i.e
where a sample xi is a temporal sequence (x0i , . . . , xti , . . . ) of state points xi ∈ Rk . For this reason
particle filtering is also called Sequential Importance Sampling (SIR). Particle filters are a powerful
tool that is used in place of Kalman filters to solve hard Bayesian filtering problems, where con-
sidered dynamic models are non linear and/or weakly Markovian (i.e. Markov model with a high
order). Particle filtering updates dynamically trajectory samples at each time step, by analogy
with a beam of independent particles drawing a trajectory in a state space. One advantage of
particle filtering is that it can work with an unnormalized dynamic model, i.e. unnormalized tran-
sition distribution p? ( Xt+1 X0 , . . . , Xt ). Particle filtering also takes as input, some importance
distribution defined as a transition distribution q( Xt+1 | X0 , . . . , Xt ) as well. However q is meant
to be much simpler to sample than p? .
At current time t, particle filtering updates every trajectory (also called particle). When pass-
ing from time t to time t + 1, the method
completes every trajectory xi = (x0i , . . . , xti ) with a new
sample xi t+1
drawn from q( Xt+1 X0 = x0i , . . . , Xt = xti ). Trajectories thus follow the q distri-
bution. However the weight wi of every trajectory is updated consequently so that the weighted
average of a function f (x) over all particles xi provides a valid approximation of expected value
E(f (X0 , . . . , Xt+1 )) relativity to distribution p? . If wit is the value of the weight of trajectory xi
at time t, the weight update is simply:

t+1 t t+1 t+1 p? ( Xt+1 X0 , . . . , Xt )
wi = wi × αi with αi =
q( Xt+1 | X0 , . . . , Xt )
since according to importance sampling,one has:
p? (X0 , . . . , Xt+1 )
wit+1 =
q(X0 , . . . , Xt+1 )

p? (X0 , . . . , Xt ) p? ( Xt+1 X0 , . . . , Xt )
= ×
q(X0 , . . . , Xt ) q( X t+1 | X0 , . . . , Xt )
= wit × αit+1

The weights of trajectories are thus updated incrementally. When many trajectories get low
weights, a resampling is done to eliminate trajectories with low weights and to duplicate trajectories
with heavy weight. The randomness of transitions then split these duplicates to explore different
paths. The steps are listed on algorithm 37.
One remaining question is the choice of the importance distribution. One requirement is that
sampling the importance distribution must be an easy task. On the other hand, the importance
distribution must be as close as possible to the target distribution, in order to avoid resampling.
The optimal choice for q depends on the nature of the problem. However in many problems, the
state variable is hidden and partially observable through a visible variable V . Let’s assume the
model is Markovian of order 1 (but this is for simplification and it is not a strong requirement).
In this case the target transition distribution can be decomposed as the product of two factors:

p? ( Xt = xti Xt−1 = xt−1
i ) = p?1 ( Xt = xti Xt−1 = xt−1
i ) × p?2 ( Vt = vt Xt = xti )

In many practical models, the transition distribution p?1 is often a normalized distribution p1 easy to
sample (e.g. a multivariate Gaussian distribution) whereas the emission distribution p?2 is complex
and fundamentally unnormalized (since the variable to sample is not the fixed observation vt but
the state xti . For this reason, one chooses the importance distribution q as p1 . The particle weights
are then equal to the unormalized emission probabilities:
p? ( Xt | Xt−1 )
αit =
q( Xt | Xt−1 )
p?1 ( Xt | Xt−1 ) × p?2 ( Vt = vt | Xt )
=
p?1 ( Xt | Xt−1 )
= p?2 ( Vt = vt | Xt )
328 CHAPTER 28. APPROXIMATE INFERENCE

Algorithm 37 Particle filtering


1: Inputs : target distribution p? , importance distribution q, number n of particles, maximal
tolerated standard deviation σmax of weights
2: for i = 1 . . . n do
3: Draw x0i from q(X0 )
p? (x0 )
4: wi ← q(x0i)
i
5: end for
6: for t = 1 . . . T do
7: for i = 1 . . . n do
8: Draw xti from q( Xt X0 = x0i , . . . , Xt−1 = xt−1 i )
p? ( Xt =xti | X0 =x0i ,...,Xt−1 =xt−1
i )
9: wi ← wi × q( X =xt X =x0 ,...,X =xt−1
t i | 0 i t−1 i )
10: end for P
11: Normalize weights wi so that i wi = 1 √P 2
i wi −1
12: Compute standard deviation of weights σ ← n
13: if σ > σmax then
14: Do resampling
15: end if
16: end for
17: return weighted particles (xti , wi )1≤i≤n,0≤t≤T
Part VIII

Sequential Decision Making

329
Chapter 29

Bandits

A (multi-armed) bandit problem is the most basic example of a sequential decision problem with
a trade-off between exploration and exploitation. A gambler (or player, or forecaster) is facing a
number of options (or actions). At each time step, the player chooses an option and receives a
reward (or a payoff). The goal is to maximize the total sum of rewards obtained in a sequence
of allocations. A tradeoff between exploration and exploitation arises: the player must balance
the exploitation of actions that did well in the past and the exploration of actions that could
give higher reward in the future. The name “bandit” comes from the American slang “one-armed
bandit” that refers to a slot machine: the gambler is facing many slot machines at once in a casino
and must repeatedly choose where to insert the next coin.
Bandits have numerous applications. They have first been introduced by Thompson (1933)
for studying clinical trials (different treatments are available for a given disease, one must choose
which treatment to use on the next patient). Nowadays, they are widely used in online services
(for adapting the service to the user’s individual sequence of requests). For example, they can
be used for ad placement (determining which advertisement to display on a web page, see for
example Chapelle and Li (2011)). They can also be used in cognitive radio for opportunistic
spectrum access (Jouini et al., 2012). They can also be used less directly. For example, they
are at the core of the MoGo program (Gelly et al., 2006) that plays Go at world-class level (see
also Munos (2014)).
The rest of this chapter is organized as follows. In Sec. 29.1, we formalize the stochastic
bandit problem. In Sec. 29.2, we explain the idea of “optimism in the face of uncertainty” and
introduce concentration inequalities. In Sec. 29.3 we present the classical UCB (Upper Confidence
Bound) strategy and prove its effectiveness. In Sec. 29.4, we discuss briefly other kinds of bandits
and problems. The material presented in this chapter is largely inspired from the monograph
of Bubeck and Cesa-Bianchi (2012).

29.1 The stochastic bandit problem


The basic stochastic multi-armed bandit problem is formalized as follows. Each arm (or option,
or action) i ∈ {1, . . . , K} is associated to an unknown probability measure νi . In the sequel, we
assume that the rewards are bounded (in [0, 1], without loss of generality, that is νi ([0, 1]) = 1).
At each time step t = 1, 2, . . . , the player chooses and arm It ∈ {1, . . . , K} (based on past choices
and observations) andR receives a reward XIt ,t drawn from νIt , independently from the past. We
write µi = E [Xi,t ] = xdνi (x) the expectation of the ith arm and define

µ∗ = max µi and i∗ ∈ argmax µi


1≤i≤K 1≤i≤K

the highest expectation and the corresponding arm (which is not necessarily unique).
The ideal (but unreachable) strategy would consist in choosing systematically It = i∗ . There-
fore, the quality of a strategy can be measured with the regret, defined as the cumulative difference

331
332 CHAPTER 29. BANDITS

(in expectation) between the optimal arm and chosen arms1 after n rounds:
" n #
X
Rn = nµ∗ − E µIt . (29.1)
t=1

The better the strategy, the lower the regret. So, we should design the sequential decisions such
as minimizing this quantity.
Next, we formulate this regret differently. Write
s
X
Ti (s) = 1{It =i}
t=1

the number of times the player selected arm i during the first s rounds and

∆i = µ∗ − µi

the suboptimality of arm i. Obviously, we have that


K
X
Ti (n) = n.
i=1

On the other hand, one can easily check that


n
X K
X
µIt = µi Ti (n).
t=1 i=1

Therefore, the regret can be rewritten as follows:


" n #
X
Rn = nµ∗ − E µIt
t=1
" K
#! " K
#
X X
= E Ti (n) µ∗ − E µi Ti (n)
i=1 i=1
K
X
= ∆i E [Ti (n)]. (29.2)
i=1

Therefore, a good strategy should control E [Ti (n)] for i 6= i∗ , the (expected) number of times a
suboptimal arm is played.

29.2 Optimism in the face of uncertainty


At time step t, the player has gathered a number of observations (a reward each time an arm has
been pulled). From this, he can estimate the expectation for each arm (by computing the empirical
mean). A strategy could be to act greedily respectively to these estimated means (select the arm
with the highest empirical mean). However, this would be a bad strategy. Assume for examples
that rewards are drawn according to Bernoulli distributions (that is the reward is either 0 or 1).
Then this (pure exploitation) strategy would lead to always select the first arm that provided a
reward (which is not necessarily the optimal arm, obviously). One should add some exploration
to this strategy. For example, let 0 <  < 1 be a user-defined value. A possible strategy (called
-greedy strategy) is to act greedily respectively to the estimated expectation with probability 1−,
and to choose an arm at random with probability .
Now, assume that we are able to construct a high probability confidence interval for each arm
(with high probability, the true expectation µi of arm i is in a given interval), as illustrated in
Fig. 29.1. In this exemple, we can say (still with high probability) that arm 2 is better than arm
1 Notice that It is a random quantity, as it depends on past observed rewards.
29.2. OPTIMISM IN THE FACE OF UNCERTAINTY 333

Figure 29.1: Optimism in the face of uncertainty.

K (as the lower bound of the interval of arm 2 is higher than the higher bound of arm K), but it
is more difficult to tell which of arms 2 and 3 is the better. Optimism in the face of uncertainty
consists in acting greedily respectively to the most “favorable” case, here to act greedily respectively
to the higher upper bound of the arm’s confidence interval. In Fig. 29.1, optimism in the face of
uncertainty consists in choosing arm 3. Computing these confidence intervals will be done next
through the use of a concentration inequality. The related strategy and the analysis of its regret
are provided in Sec. 29.3.
Let X1 , . . . , Xn be i.i.d. (independent and identically distributed) random variables. Write
µ = E [X1 ] their common expectation and µn the related empirical mean:
n
1X
µn = Xi .
n i=1

Typically, these random variables are the rewards obtained for pulling a given arm n times. The
question we would like to answer is: how close is µn to µ (with some probability)? We first give a
general answer to this question before instantiating it to the case of bounded random variables.

Theorem 29.1 (Hoeffding’s inequality (Hoeffding, 1963; Bubeck and Cesa-Bianchi, 2012)). As-
sume that there exists a convex function ψ : R+ → R+ such that
h i h i
∀λ ≥ 0, ln E eλ(X1 −µ) ≤ ψ(λ) and ln E eλ(µ−X1 ) ≤ ψ(λ).

Define the Legendre-Fenchel transform of ψ as

ψ∗ () = sup (λ − ψ(λ)) .


λ≥0

Then:
P (µn − µ ≥ ) ≤ e−nψ∗ () and P (µ − µn ≥ ) ≤ e−nψ∗ () .

Before proving this result (called a concentration inequality, as it states how the empirical mean
concentrates around the expectation), we give some intuitions about its meaning. The moment
condition (the assumption about the existence of the ψ function) provides information about the
tail of the distribution, notably how it concentrates around its mean. For example, if X ∼ N (µ, σ 2 )
(X is Gaussian of mean µ and variance σ 2 ), it is a standard result of the probability theory that
2
for any λ ∈ R, we have ln E [exp λ(X − µ)] = σ2 . We will see later that we have a similar result
for bounded random variables. The Legendre-Fenchel transform is a standard tool in convex
optimization (how it is introduced will be clear in the proof). For example, for ψ(λ) ∝ λ2 , we have
ψ∗ () ∝ 2 . Next, we explain why this result is indeed a confidence interval. A direct corollary of
334 CHAPTER 29. BANDITS

this theorem is

P (|µn − µ| ≥ ) = P ({µn − µ ≥ } ∪ {µ − µn ≥ })


≤ P (µn − µ ≥ ) + P (µ − µn ≥ ) (union bound)
−nψ∗ ()
≤ 2e (by Th. 29.1). (29.3)

Write δ this upper-bound on the probability


 
−nψ∗ () 1 2
δ = 2e ⇔= ψ∗−1 ln .
n δ

Eq. (29.3) can thus equivalently be rewritten as


     
1 2 1 2
P |µn − µ| ≥ ψ∗−1 ln ≤ δ ⇔ P |µn − µ| ≤ ψ∗−1 ln ≥ 1 − δ.
n δ n δ

Alternatively, we can say that with probability at least 1 − δ, we have


 
1 2
|µn − µ| ≤ ψ∗−1 ln
n δ

This is called a PAC (Probably Approximately Correct) result. This is is indeed a confidence
interval, as it says that with probability at least 1 − δ we have
    
−1 1 2 −1 1 2
µ ∈ µn − ψ∗ ln , µn + ψ∗ ln .
n δ n δ

This is exactly the kind of result we were looking for. Next, we prove the theorem.

Proof of Th. 29.1. The proof is based on what is called a Chernoff argument. Let λ > 0, we have
n
!
X
P (µn − µ ≥ ) = P (Xi − µ) ≥ n
i=1
 
λ n
P
=P e i=1 (Xi −µ) ≥ eλn .

Recall that the Markov’s inequality states2 that if Y is a positive random variable and c a positive
constant we have
E [Y ]
P (Y ≥ c) ≤ .
c
Therefore, we have
 Pn  h Pn i
P eλ i=1 (Xi −µ) ≥ eλn ≤ e−nλ E eλ i=1 (Xi −µ) (by Markov)
n
Y h i
= e−λn E eλ(Xi −µ) (by independence)
i=1
 h in
= e−λn E eλ(X1 −µ) (r.v. are i.d.)
−n(λ−ln E[exp(λ(X1 −µ))])
=e
≤ e−n(λ−ψ(λ)) .
2 Refer to any basic course on probabilities. We give the proof for completeness. We have
Z Z Z Z Z
E [Y ] = Y dP = Y dP + Y dP ≥ Y dP ≥ c dP = cP (Y ≥ c),
Y <c Y ≥c Y ≥c Y ≥c

from which the Markov’s inequality follows.


29.2. OPTIMISM IN THE FACE OF UNCERTAINTY 335

This being true for any λ > 0, it is true for the minimizer, thus

P (µn − µ ≥ ) ≤ inf e−n(λ−ψ(λ))


λ>0

= e−n supλ>0 (λ−ψ(λ))


= e−nψ∗ () .

This shows the first inequality, the proof for the second one being the same.
When introducing the stochastic bandit problem in Sec. 29.1, we have assumed that the rewards
are bounded (this is not mandatory, but usual). In this case, the bound can be instantiated, thanks
to the following Lemma due to Hoeffding (1963), that specify ψ in this case.
Lemma 29.1 (Hoeffding (1963)). Let Y be a random variable such that E [Y ] = 0 and c ≤ Y ≤ d
almost surely3 . Then, for any s ≥ 0, we have
  2 (d−c)
2
E esY ≤ es 8 .

Proof. Let s > 0. The function x → esx is convex, thus es(tx+(1−t)y) ≤ tesx + (1 − t)esy . Notice
also that for any x
d−x x−c
x= c+ d.
d−c d−c
Therefore, we have for the r.v. Y :
d − Y sc Y − c sd
esY ≤ e + e .
d−c d−c
Taking the expectation (recall that E [Y ] = 0):
  d sc −c sd
E esY ≤ e + e
d−c d−c
 
d −c s(d−c)
= esc + e .
d−c d−c
−c d
Define p = d−c > 0 (which implies that d−c = 1 − p) and u = s(d − c). Therefore, sc = −pu and
we can write
 
E esY ≤ e−pu (1 − p + peu ) = eϕ(u)
with ϕ(u) = −pu + ln (1 − p + peu ) .

We will bound ϕ(u). We have that ϕ(0) = 0 and ϕ0 (0) = 0. The second derivative is

peu (1 − p) 1
ϕ00 (u) = u 2
≤ .
(1 − p + pe ) 4
u
pe 00
For this last statement, note that by writing t = 1−p+pe u > 0 we have ϕ (u) = t(1 − t) which is

obviously bounded by 14 . From the Taylor-Lagrange formula, there exists a ξ such that

u2 u2 s2 (d − c)2
ϕ(u) = ϕ(0) + ϕ0 (0)u + ϕ00 (ξ) ≤ = ,
2 8 8
which proves the result.
From this, we have a direct corollary of Th. 29.1.
Corollary 29.1 (Hoeffding (1963)). Assume that 0 ≤ X1 ≤ 1 almost surely. Then we have
2 2
P (µn − µ ≥ ) ≤ e−2n and P (µ − µn ≥ ) ≤ e−2n .
3 Obviously, c ≤ 0 ≤ d.
336 CHAPTER 29. BANDITS

Proof. We have that


0 ≤ X1 ≤ 1 ⇔ −µ ≤ X1 − µ ≤ 1 − µ.
2
Applying Lemma 29.1 with Y = X1 − µ, c = −µ and d = 1 − µ we obtain ψ(λ) = λ8 (the same
2
hold for µ − X1 ). To obtain the Legendre-Fenchel transform, we set the gradient of λ − λ8 to
zero, which gives λ = 4, thus ψ∗ () = 22 . Applying Th. 29.1 allows concluding.

We will next apply these results to derive a strategy for the bandit problem.

29.3 The UCB strategy


Recall that the optimism in the face of uncertainty consists in acting greedily respectively to the
upper bound of a confidence interval. The UCB (Upper Confidence Bound) strategy applies this
principle. Write µi,s the sample mean of rewards obtained by pulling arm i for s times:
s
1X
µi,s = Xi,t .
s t=1

As we are looking for an upper bound on µi , we have from Th. 29.1 that with probability at least
1 − δ,
 
1 1
µi < µi,s + ψ∗−1 ln .
n δ
With the choice of δ = t1α where α > 0 is an input parameter (and with s = Ti (t−1), the number of
times arm i has been played before round t), we obtain the so-called (α, ψ)-UCB strategy of Bubeck
and Cesa-Bianchi (2012):
  
α ln t
It ∈ argmax µi,Ti (t−1) + ψ∗−1 .
1≤i≤K Ti (t − 1)

If the rewards are bounded in [0, 1], we obtain the original UCB (Upper Confidence Bound) strategy
of Auer et al. (2002) (using the results in the proof of Cor. 29.1, that is ψ∗ () = 22 ⇔ ψ∗−1 (u) =
p u
2 ):
s !
α ln t
It ∈ argmax µi,Ti (t−1) + .
1≤i≤K 2Ti (t − 1)

Therefore, we end up with a simple strategy that only requires updating empirical means as
arms as pulled, and to act greedily respectively to the above quantity (which is the empirical mean
plus a kind of bonus). An important question is to know what regret is suffered by these strategies.
The answer is given by the next theorem.

Theorem 29.2 (Auer et al. (2002); Bubeck and Cesa-Bianchi (2012)). Assume that the rewards
distributions satisfy the assumption of Th. 29.1. Then the (α, ψ)-UCB strategy with α > 2 satisfies
!
X α∆i α
Rn ≤ ∆i
 ln n + .
ψ∗ 2
α −2
i:∆i >0

If rewards are bounded in [0, 1], the bound on the regret simplifies as (using the fact that
ψ∗ () = 22 ):
X  2α α

Rn ≤ ln n + .
∆i α−2
i:∆i >0

This tells that each suboptimal arm is chosen no more that a logarithmic number of times (ln n),
and that arms close to the optimal one are chosen more often ( ∆1i ). We prove the result now.
29.3. THE UCB STRATEGY 337

Proof of Th. 29.2. Assume without loss of generality that It = i 6= i∗. Then, at least on of the
tree following equations must be true:
 
−1 α ln t
µi∗ ,Ti∗ (t−1) + ψ∗ ≤ µ∗ (29.4)
Ti∗ (t − 1)
 
α ln t
µi,Ti (t−1) > µi + ψ∗−1 (29.5)
Ti (t − 1)
α ln n
Ti (t − 1) <  (29.6)
ψ∗ ∆2i

Eq. (29.4) states that the upper-bound for the optimal arm is below the associated mean, Eq. (29.5)
states that the lower-bound for the considered arm is above the associated mean and Eq. (29.6)
means that arm i has not been pulled enough. If the three equations were false, we would have
 
−1 α ln t
µi∗ ,Ti∗ (t−1) + ψ∗ > µ∗ by (29.4) false
Ti∗ (t − 1)
= µi + ∆i by def. of ∆i
 
α ln t
≥ µi + 2ψ∗−1 by (29.6) false
Ti (t − 1)
 
−1 α ln t
≥ µi,Ti (t−1) + ψ∗ by (29.5) false,
Ti (t − 1)
which implies that It 6= i, which is a contradiction.
Recall that for controlling the regret it is enough to control the (expected) number of times
each arm has been played (Eq. (29.1) being equivalent to Eq. (29.2)). Define u as
& '
α ln n
u=  .
ψ∗ ∆2i

We have that
" n #
X
E [Ti (n)] = E 1{It =i}
t=1
" n
#
X
≤u+E 1{It =i and (29.6) is false}
t=u+1
" n
#
X
≤u+E 1{(29.4) or (29.5) is true}
t=u+1
n
X
≤u+ (P ((29.4) is true) + P ((29.5) is true)) .
t=u+1

We can bound the probability of event (29.4) as follows:


   
α ln t
P ((29.4) is true) ≤ P ∃s ∈ {1, . . . t} : µi∗ ,s + ψ∗−1 ≤ µ∗
s
Xt    
α ln t
≤ P µi∗ ,s + ψ∗−1 ≤ µ∗
s=1
s
Xt
1 1
≤ α
= α−1 ,
s=1
t t

where we used a union bound and Hoeffding with δ = t−α . The same bound holds for the
probability of event (29.5):
1
P ((29.5) is true) ≤ α−1 .
t
338 CHAPTER 29. BANDITS

Combining these results, we have


n
X 1
E [Ti (n)] ≤ u + 2 α−1
t=u+1
t
Z ∞
α ln n 1
≤  + 1 + 2 dt
ψ∗ ∆2i 1 t α−1

α ln n α
= ∆i
+ .
ψ∗ 2 α −2

Injecting this result in Eq. (29.2) concludes the proof.


Some strategies allow achieving a lower regret (and even optimal ones, in the sense that lower
bounds on the regret—for any possible strategy—can be proven), at the price of more complicated
analysis. Yet, UCB is a popular strategy, simple to apply and quite effective in practice. See Bubeck
and Cesa-Bianchi (2012, Ch. 2) for more details.

29.4 More on bandits


There are other kinds of problems for stochastic bandit. For example, the best arm identification
problem consists in identifying the best arm given a fixed budget or a fixed confidence (e.g.,
see Gabillon et al. (2012)), not paying much attention to the sum of rewards obtained when
pulling arm (for example, a company may want to identify the best product among K variants
before actually placing it on the market). The stochastic bandit can also be addressed from a
Bayesian viewpoint (Thompson, 1933; Korda et al., 2013).
There are also other kinds of bandit. In adverserial bandit, the reward does not follows a
distribution, it is given by an opponent (or adversary). In contextual bandits, a side information
is associated to each arm. In linear bandit, there is some structure in the reward function (that
allows handling a possibly infinite number of arms). In Markovian bandits, each arm is associated
to a Markov process and pulling an arm causes a stochastic transition of the underlying chain.
There are (many) other kinds of bandits. See Bubeck and Cesa-Bianchi (2012) and references
therein for a deeper introduction to the subject.
Chapter 30

Reinforcement learning

Reinforcement learning (RL) can be broadly seen as the machine learning answer to the control
problem. In this paradigm, an agent is interacting with the world by tacking actions and observing
the resulting configuration of the world (in a sequential manner). This agent receives (numerical)
rewards (given by some oracle) that are a local information about the quality of the control. The
aim of this agent is to learn the sequence of actions such as maximizing some notion of cumulative
reward. This chapter provides an introduction to the field of reinforcement learning, more can be
found on reference textbooks (Sutton and Barto, 1998; Bertsekas and Tsitsiklis, 1996; Szepesvári,
2010; Sigaud and Buffet, 2013). This field is inspired by behaviorial psychology (this explains part
of the vocabulary, such as the notion of reward) and has connections to computational neuroscience,
yet this chapter focuses on the mathematical and learning aspects.

30.1 Introduction
In reinforcement learning, an agent is interacting with a system (sometime called the environment,
or the world), as exemplified in Fig. 30.1. At each discrete time step, the system is in a given state
(or configuration) that can be observed by the agent. Based on this state, the agent apply some
action. Following this action, the system transits in a new state and the agent receives a numerical
reward from an oracle, this reward being a local clue of the quality of the control. The goal of
the agent is to take sequential decisions such as maximizing some notion of cumulative reward,
typically the sum of rewards gathered along the followed path. An important thing to understand
is that the rewards quantify the goal of the control, and not how this goal should be reached (this
is what the agent has to learn).
To clarify this paradigm, consider the simple example of a robot in a maze. The goal of the

Figure 30.1: The perception-action cycle in reinforcement learning.

339
340 CHAPTER 30. REINFORCEMENT LEARNING

robot is to find the shortest path to the exit. The state of the system is the current position of
the robot in the maze. Four actions are available, one for each direction. Choosing such an action
amount to moving in the required direction. In this problem, the reward can be −1 for any move,
except for the move that leads to the exit which is rewarded by 0. Notice that the only informative
reward is given for going through the exit. Here, for any path leading to the exit, the sum of
rewards is the negative of the length of the path. Therefore, maximizing the sum of rewards is
equivalent to finding the shortest path to the exit.
A first issue is to formalize mathematically such a control problem. In reinforcement learning,
this is widely done thanks to Markov Decision Processes (MDPs), to be presented in Sec. 30.2. A
second issue is to compute the best possible control when the model (the MDP) is known, which
is addressed by Dynamic Programming (DP), to be presented in Sec. 30.3. Consider again the
maze problem. A smart strategy consists in starting from the exit, and then retro-propagating
the possible paths until reaching the starting point. This is roughly what DP does. Puterman
(1994) and Bertsekas (1995) provide reference textbooks on MDPs and DP. A third problem is
to estimate this optimal strategy from data (interaction data between the agent and the system),
when the model is unknown (this is reinforcement learning). This is addressed in Sec. 30.4-30.6.

30.2 Formalism
In the sequel, we write ∆X the set of probability measures over a discrete set X and Y X the set of
applications from X to Y.

30.2.1 Markov Decision Process


A Markov Decision Process (MDP) is a tuple {S, A, P, r, γ} where:

• S is the (finite) state space1 ;

• A is the (finite) action space2 ;

• P ∈ ∆S×AS is the Markovian transition kernel. The term P (s0 |s, a) denotes the probability
of transiting in state s0 given that action a was chosen in state s. The transition kernel is
Markovian because the probability to go to s0 depends on the fact that action a was chosen
in state s, but it does not depend on the path followed to reach this state s. This assumption
is at the core everything presented here3 ;

• r ∈ RS×A is the reward function4 , it associates the reward r(s, a) for taking action a in state
s. The reward function is assumed to be uniformly bounded;

• γ ∈ (0, 1) is a discount factor that favors shorter term rewards (see the definition of the value
function, later). The closer is γ to 1, the more importance we give to far (in time) rewards.
Usually, this parameter is set to a value close to 1.

So, the system is in state s ∈ S, the agent chooses an action a ∈ A and get the reward r(s, a),
then the system transits stochastically to a new state s0 , this new state being drawn from the
conditional probability P (.|s, a).
1 We will assume larger (countable or even infinite compact) state spaces later.
2 It is quite difficult to consider large action spaces, but see Sec. 30.6
3 Consider the maze exemple of Sec. 30.1. If the robot knows its position, the dynamics are indeed Markovian.

If the agent has only a partial observation (for example, it knows if there are walls among him, but no more), the
dynamics are no longer Markovian. This is known as partially observable MDPs, see for example Kaelbling et al.
(1998). This topic will not be covered in this chapter, but note that the general strategy consists in computing
something which is Markovian.
4 One can define more generally the reward function as r ∈ RS×A×S , that is giving a reward r(s, a, s0 ) for

P transition
each from s to s0 under action a. However, one can define an expected reward function as r̄(s, a) =
P (s 0 |s, a)r(s, a, s0 ), the mean reward for choosing action a in state s. As only this mean reward will be of
s0 ∈S
importance in the following, we keep the notations simple.
30.2. FORMALISM 341

30.2.2 Policy and value function


The strategy of the agent (the way it chooses actions) is called a policy5 and is noted π ∈ AS :
in state s, an agent applying policy π chooses the action π(s). The problem is to find the best
policy, but this requires quantifying the quality of a policy. This is done thanks to the associated
value function vπ ∈ RS , that associates to each state the expected (transition being stochastic)
and discounted (by γ) cumulative reward that is obtained by following policy π from state s:

X∞
vπ (s) = E[ γ t r(St , π(St ))|S0 = s, St+1 ∼ P (.|St , π(St ))]. (30.1)
t=0

In other words, if the agent starts in state s and keeps following the policy π (that is taking the
action given by π whenever a decision is required), it will receive a sequence of rewards. The
discounted cumulative reward of this state is the sum of gathered rewards along the (infinite)
trajectory, the reward being received at the tth time step being discounted by γ t (this favors closer
rewards and allows for the sum to be finite). This quantity being random (due to the transitions
being random), the value of the state is defined as the expectation of the discounted cumulative
reward. There exist other criteria for quantifying the quality of a policy, but they will not be
considered here6 .
A value function allows quantifying the quality of a policy, and it allows comparing policies as
follows:
π1 ≥ π2 ⇔ ∀s ∈ S, vπ1 (s) ≥ vπ2 (s).
Notice that this is a partial ordering, thus two policies might not be comparable. Solving an MDP
means computing the optimal policy π∗ satisfying vπ∗ ≥ vπ , for all π ∈ AS . In other words, the
optimal policy satisfies
π∗ ∈ argmax vπ .
π∈AS

It is possible to show that such a policy exists (the result is admitted). Before showing how such
an optimal policy can be computed (see Sec. 30.3), we develop the notion of value function.

30.2.3 Bellman operators


The value function, as defined in Eq. (30.1), is not practical. However, it can be rewritten in a
recursive way:

X
vπ (s) = E[ γ t r(St , π(St ))|S0 = s, St+1 ∼ P (.|St , π(St ))]
t=0

X
= r(s, π(s)) + E[ γ t r(St , π(St ))|S0 = s, St+1 ∼ P (.|St , π(St ))]
t=1
X∞
= r(s, π(s)) + γE[ γ t r(St+1 , π(St+1 ))|S0 = s, St+1 ∼ P (.|St , π(St ))]
t=0
X
⇔ vπ (s) = r(s, π(s)) + γ P (s0 |s, π(s))vπ (s0 ). (30.2)
s0 ∈S

5 More precisely, it is a deterministic policy. One can consider more generally a stochastic policy, that is π ∈ ∆S :
A
for each state s, π(.|s) is a distribution over actions. This will be useful in Sec. 30.6, but for now deterministic
policies are enough.
6 Still, we can mention the finite horizon criteria, defined for a given horizon H, the associated value function

being
H
X
vπ (s) = E[ r(St , π(St ))|S0 = s, St+1 ∼ P (.|St , π(St ))],
t=0
or the average criteria, the corresponding value being
H
1 X
vπ (s) = lim E[ r(St , π(St ))|S0 = s, St+1 ∼ P (.|St , π(St ))].
H→∞ H t=0
342 CHAPTER 30. REINFORCEMENT LEARNING

In other words, computing the value of state s can be done by knowing the value of possible next
states (and the probability to reach these states). Notice that this is a linear system. To see this
more clearly, first notice that a function from RS can be seen as a vector, and vice-versa (the state
space being finite). We introduce Pπ ∈ RS×S and rπ ∈ RS , defined as

Pπ = (P (s0 |s, π(s)))s,s0 ∈S and rπ = (r(s, π(s)))s∈S .

The term Pπ is a stochastic matrix (each row sums to one) and rπ is a vector. Using this notation,
Eq. (30.2) can be written as

vπ = rπ + γPπ vπ ⇔ vπ = (I − γPπ )−1 rπ , (30.3)

where I is the identity matrix. Notice that Pπ being a stochastic matrix, its spectrum (largest
eigenvalue) is bounded by 1, and as γ < 1, the matrix (I − γPπ ) is indeed invertible.
We can now introduce the Bellman evaluation operator7 Tπ : RS → RS , defined as

∀v ∈ RS , Tπ v = rπ + γPπ v,

which means equivalently that componentwise we have


X
∀s ∈ S, [Tπ v](s) = r(s, π(s)) + γ P (s0 |s, π(s))v(s0 ).
s0 ∈S

This affine operator applies to any function v ∈ RS (not necessarily a value function corresponding
to a policy) and Eq. (30.3) shows that vπ is the unique fixed point of this operator8 :

vπ = Tπ vπ .

Therefore, we have a tool to compute the value function of any policy, which provides useful for
quantifying its quality. However, we also would like to characterize directly the optimal policy π∗
(more precisely, the related value function).
Write v∗ = vπ∗ the value function associated to an optimal policy (called the optimal value
function). Assume that the optimal value function v∗ is known, but not the optimal policy π∗ . For
any state, this policy should take the action that leads to the higher possible value, that is
!
X
0 0
π∗ (s) ∈ argmax r(s, a) + γ P (s |s, a)v∗ (s ) . (30.4)
a∈A
s0 ∈S

We say that π∗ is greedy respectively to v∗ . Therefore, knowing v∗ , one can compute π∗ . The
remaining problem is to characterize v∗ . We have seen in the evaluation problem that knowing
the value in the next state is sufficient to compute the value in the current state (see Eq. (30.2)).
The same principle can be applied to compute the optimal value. Knowing the optimal value of
the next state, the optimal value of the current state is the one that maximizes the sum of the
immediate reward and of the discounted optimal value of the next state:
!
X
0 0
∀s ∈ S, v∗ (s) = max r(s, a) + γ P (s |s, a)v∗ (s ) .
a∈A
s0 ∈S

To see if this problem admits a solution, we introduce the Bellman optimality operator T∗ :
RS → RS , defined as
∀v ∈ RS , T∗ v = max (rπ + γPπ v) , (30.5)
π∈AS

7 Named after Bellman (1957).


8 Indeed, one can easily show that it is a contraction for the supremum norm defined as kvk∞ = maxs∈S |v(s)|:
for any u, v ∈ RS , we have
kTπ v − Tπ uk∞ = kγPπ (v − u)k∞ ≤ γkv − uk∞ .
This also shows unicity of the fixed point and a way to compute it, thanks to the Banach theorem, see later.
30.3. DYNAMIC PROGRAMMING 343

which means equivalently that componentwise we have


!
X
0 0
∀s ∈ S, [T∗ v](s) = max r(s, a) + γ P (s |s, a)v(s ) . (30.6)
a∈A
s0 ∈S

This operator is a contraction in supremum norm9 . Therefore, thanks to the Banach theorem10 ,
T∗ admits v∗ as its unique fixed point:
v∗ = T∗ v∗ .

Next, we show how the optimal policy (or equivalently the optimal value function, as shown in
Eq. (30.4)) can be computed.

30.3 Dynamic Programming


Dynamic Programming (DP) refers to the set of methods that allow solving an MDP when the
model is known (that is, when the transition kernel and the reward function are analytically
known). We present the more classic here, that is linear programming, value iteration and policy
iteration. In practice, the model is often unknown and one has to rely on data (see Sec. 30.4 and
next) and thus on learning, but methods for this case are often based on the classic DP paradigm.

30.3.1 Linear programming


The optimal value function v∗ can be computed by solving the following linear program (writing
1 ∈ RS the vector with all components equal to 1):

min 1> v (30.7)


v∈RS
subject to v ≥ T∗ v.

We start by explaining why v∗ is indeed the solution of this linear program, then we express it in
a less compact form.
Recall that the operator T∗ applies to any v ∈ RS and that it is defined as T∗ v = maxπ∈AS (rπ +
γPπ v) (see Eq. (30.5)). Let π be any policy and v be such v ≥ T∗ v. By the definition of T∗ , we have
that v ≥ rπ + γPπ v (the inequality is true for any policy, in particular for π). We can repeatedly
apply this inequality:

v ≥ rπ + γPπ v (30.8)
≥ rπ + γPπ (rπ + γPπ vπ )
X∞
≥ γ t Pπt rπ (30.9)
t=0
= (I − γPπ )−1 rπ = vπ , (30.10)
9 Let u, v ∈ RS and write s any state. Assume without loss of generality that [T v](s) ≥ [T u](s). Write also
∗ ∗
as∗ ∈ argmaxa∈A (r(s, a) + γ s0 ∈S P (s0 |s, a)v(s0 )). We have
P

0 ≤ |[T∗ v](s) − [T∗ u](s)| = [T∗ v](s) − [T∗ u](s)


   
X X
0 0 0 0
= max r(s, a) + γ P (s |s, a)v(s ) − max r(s, a) + γ P (s |s, a)u(s )
a∈A a∈A
s0 ∈S s0 ∈S
 
X X
0
≤ r(s, as∗ ) +γ P (s |s, as∗ )v(s0 ) − r(s, as∗ ) +γ P (s 0
|s, as∗ )u(s0 ) ≤ γkv − uk∞ .
s0 ∈S s0 ∈S

This being true for any s, we have kT∗ v − T∗ uk∞ ≤ γkv − uk∞ .
10 If an operator T is a contraction, it admits a unique fixed point, which can be constructed as the limit of the

sequence vk+1 = T vk for any initialization v0 .


344 CHAPTER 30. REINFORCEMENT LEARNING

where Eq. (30.9) is obtained P∞by repeatedly applying inequality (30.8) and Eq. (30.10) uses the
fact11 that (I − γPπ )−1 = t=0 γ t Pπt and that vπ = (I − γPπ )−1 rπ (recall Eq. (30.3)). This being
true for any policy, it is true for the optimal one, and we have just shown that

v ≥ T∗ v ⇒ v ≥ v∗ .

Moreover, v ≥ v∗ implies that 1> v ≥ 1> v∗ . Consequently, minimizing 1> v under the constraint
that v ≥ T∗ v provides the optimal value function.

Algorithm 38 Linear programming


1: Solve
X
min v(s)
v∈RS
s∈S
X
subject to v(s) ≥ r(s, a) + γ P (s0 |s, a)v(s0 ), ∀s ∈ S, ∀a ∈ A
s0 ∈S

and get v∗ .
2: return the policy π∗ defined as
!
X
0 0
π∗ (s) ∈ argmax r(s, a) + γ P (s |s, a)v∗ (s ) .
a∈A
s0 ∈S

The optimal policy can be computed from v∗ as the greedy policy (recall Eq. (30.4)). The linear
programming approach to solving MDP is summarized in Alg. 38 (observe that this formulation is
indeed equivalent to Eq. (30.7)). This program has |S| variables and |S × A| constraints.

30.3.2 Value iteration


We have seen in Sec. 30.2 that the operator T∗ is a contraction:

∀u, v ∈ RS , kT∗ u − T∗ vk∞ ≤ ku − vk∞ .

Therefore, the Banach fixed-point theorem states that for any initialization v0 , the sequence defined
as
vk+1 = T∗ vk
will converge to v∗ : limk→∞ vk = v∗ . This provide a simple algorithm for computing v∗ . However,
the convergence is asymptotic, and one should stop the iteration before. A natural stopping
criterion is to check if two subsequent fonctions are closed enough (in supremum norm), that is
kvk+1 − vk k∞ ≤ , for a user-defined value of .
Doing so, we obtain a function vk ∈ RS , which is not necessary a value function (as there is no
reason that it corresponds to a policy). Yet, what we are interested in is finding a control policy.
The notion of greedy policy can be extended to any function v ∈ RS . We define a policy π to be
greedy respectively to a function v, noted π ∈ G(v), as follows:
!
X
π ∈ G(v) ⇔ Tπ v = T∗ v ⇔ ∀s ∈ S, π(s) ∈ argmax r(s, a) + γ P (s0 |s, a)v(s0 ) .
a∈A
s0 ∈S

The output of the algorithm is thus simply πk ∈ G(vk ). This method is called value iteration and
is summarized in Alg. 39.
The stopping criterion makes sense, the iterations are stopped when they do not involve much
change. Yet, we can wonder how close is the computed vk to the optimal value function v∗ . We
11 This is a Taylor expansion, valid due to P being a stochastic matrix and γ being strictly bounded by 1. It can
π
also be checked algebraically.
30.3. DYNAMIC PROGRAMMING 345

Algorithm 39 Value iteration


Require: An initial v0 ∈ RS , a stopping criterion 
1: k = 0
2: repeat
3: for all s ∈ S do
4: !
X
0 0
vk+1 (s) = max r(s, a) + γ P (s |s, a)vk (s )
a∈A
s0 ∈S

5: end for
6: k ←k+1
7: until kvk+1 − vk k∞ ≤ 
8: return a policy πk ∈ G(vk ):
!
X
0 0
πk (s) ∈ argmax r(s, a) + γ P (s |s, a)vk (s ) .
a∈A
s0 ∈S

have the following simple bound:

kv∗ − vk k∞ = kT∗ v∗ − T∗ vk + T∗ vk − vk k∞ (recall that v∗ = T∗ v∗ )


≤ kT∗ v∗ − T∗ vk k∞ + kT∗ vk − vk k∞ (triangle ineq.)
≤ γkv∗ − vk k∞ + kT∗ vk − vk k∞ (T∗ is a γ-contraction)
≤ γkv∗ − vk k∞ +  (T∗ vk = vk+1 and kvk+1 − vk k∞ ≤ )
1
⇔ kv∗ − vk k∞ ≤ .
1−γ
This result tells us how close is vk to v∗ given the considered stopping criterion (there is an
1
expansion of 1−γ of the error ). Yet, it does not tell how good is the policy πk ∈ G(vk ), the
output of the algorithm that will be applied to the system, and this what we are really interested
in. The quality of a policy is quantified by the associated value function, thus the closeness to
optimality of πk can be quantified by kv∗ − vπk k∞ . We have

kv∗ − vπk k∞ = kT∗ v∗ − Tπk vk + Tπk vk − Tπk vπk k∞


≤ kT∗ v∗ − Tπk vk k∞ + kTπk vk − Tπk vπk k∞
≤ kT∗ v∗ − T∗ vk k∞ + γkvk − vπk k∞
≤ γkv∗ − vk k∞ + γkvk − v∗ + v∗ − vπk k∞
≤ 2γkv∗ − vk k∞ + γkv∗ − vπk k∞

⇔ kv∗ − vπk k∞ ≤ kv∗ − vk k∞ ,
1−γ
where we used the fact that v∗ and vπk are respectively fixed points of T∗ and Tπk , that Tπk vk =
T∗ vk (as πk ∈ G(vk )), that T∗ and Tπk are γ-contractions, as well as the triangle inequality.
Combining this result with the preceding one, we have

kv∗ − vπk k∞ ≤ .
(1 − γ)2
So, we can set  such as having the desired control quality (note that the bound might not be
tight).

30.3.3 Policy iteration


Let π be any policy, its value function vπ can be computed as the solution of the linear system in
Eq. (30.3). How could we propose a better policy? A natural idea is to consider a policy being
346 CHAPTER 30. REINFORCEMENT LEARNING

greedy respectively to vπ , π 0 ∈ G(vπ ). Indeed, such a policy satisfies


!
X
∀s ∈ S, π 0 (s) ∈ argmax r(s, a) + γ P (s0 |s, a)vπ (s0 ) ,
a∈A
s0 ∈S

which implies that


X X
∀s ∈ S, r(s, π 0 (s)) + γ P (s0 |s, π 0 (s))vπ (s0 ) ≥ r(s, π(s)) + γ P (s0 |s, π(s))vπ (s0 ) = vπ (s),
s0 ∈S s0 ∈S

or more compactly that Tπ0 vπ = T∗ vπ ≥ Tπ vπ = vπ . Yet, this is not enough to tell that π 0 is better
than π (that is, that vπ0 ≥ vπ ).
This result is indeed true, it can be shown as follows:

Tπ0 vπ = T∗ vπ ≥ Tπ vπ = vπ
⇔rπ0 + γPπ0 vπ ≥ vπ
⇔rπ0 + γPπ0 vπ0 + γPπ0 (vπ − vπ0 ) ≥ vπ
⇔(I − γPπ0 )vπ0 ≥ (I − γPπ0 )vπ
⇒vπ0 ≥ vπ .

We used notably the fact that rπ0 + γPπ0 vπ0 = Tπ0 vπ0 = vπ0 and (for the last line) the fact that
pre-multiplying a componentwise inequality by a positive matrix does not change the inequality12 .
This suggests the following algorithm. Choose an initial policy π0 and then iterates as follows:
1. solve Tπk vπk = vπk (this is called the policy evaluation step);
2. compute πk+1 ∈ G(vπk ) (this is called the policy improvement step).
We have shown that vπk+1 ≥ vπk , so either vπk+1 > vπk or vπk+1 = vπk . In this equality case, we
have
T∗ vπk = Tπk+1 vπk = Tπk+1 vπk+1 = vπk+1 = vπk .
This means that vπk is the fixed point of T∗ , and thus that vπk = v∗ , that is πk+1 = π∗ is the
optimal policy. This suggests to stop iterations when vπk+1 = vπk . Moreover, the number of
policies being finite (it is |A||S| ), the number of iterations is finite (bounded by the number of
policies). This method is called policy iteration and is summarized in Alg. 40.
Each iteration of policy iteration has a higher computational cost than one of value iteration
(as a linear system has to be solved), but it will converges in finite time. Moreover, it converges
empirically often very quickly (only few iterations are required).

30.4 Approximate Dynamic Programming


In Sec. 30.3, we have studied methods for solving MDPs. They can be used if the state and action
spaces are small enough for a value function to be explicitly represented and they require the model
(that is, the transition kernel and the reward function) to be known. Unfortunately, it is far from
being the usual case:
• the state space can be too large13 (even continuous14 ) for the value function to be represented
exactly. In this case, one can adopt for example a linear parametrization for the value function
12 For two vectors u and v satisfying u ≥ v, and for a matrix A having positive elements, we have Au ≥ Av (this

would be false for an arbitrary matrix). Here, the considered matrix is (I − γPπ0 )−1 = t≥0 γ t Pπt 0 , all its elements
P
are obviously positive.
13 Consider the Tetris game, the state being the current board, actions correspond to placing falling tetraminos,

the reward is +1 for removing a line, the size of the state space is 2200 for a 10 × 20 board, which is too huge to be
handled by a machine.
14 We have considered finite state spaces so far. Yet, all involved sums are expectations. For example, the

Bellman evaluation operator is [Tπ v](s) = r(s, π(s)) +R γ s0 ∈S P (s0 |s, π(s))vπ (s0 ). With a continuous state space,
P
it can be simply written as [Tπ v](s) = r(s, π(s)) + γ S vπ (s0 )P (ds0 |s, π(s)). More abstractly, it can be written as
[Tπ v](s) = r(s, π(s)) + γES 0 ∼P (.|s,π(s)) [v(S 0 )], which encompasses both cases. Up to some technicalities, what we
have presented so far remains true for the continuous case.
30.4. APPROXIMATE DYNAMIC PROGRAMMING 347

Algorithm 40 Policy iteration


Require: An initial π0 ∈ AS
1: k = 0
2: repeat
3: solve (policy evaluation)
X
vk (s) = r(s, πk (s)) + γ P (s0 |s, πk (s))vk (s0 ), ∀s ∈ S.
s0 ∈S

4: Compute (policy improvement)


!
X
0 0
πk+1 (s) ∈ argmax r(s, a) + γ P (s |s, a)vk (s ) .
a∈A
s0 ∈S

5: k ←k+1
6: until vk+1 = vk
7: return the policy πk+1 = π∗

(see also the discussion on hypothesis spaces in Ch. 5),


d
X
vθ (s) = θ> φ(s) = θi φi (s),
i=1

where θ are the parameters to be learnt and φ : S → Rd is a predefined feature vector (the
vector which components are the d user-defined basis functions φi (s));
• the model might be unknown and one has to rely on a dataset of the type
D = {(si , ai , ri , s0i )1≤i≤n }, (30.11)
where action ai is taken in state si (according to a given policy, to a random policy, or
something else), the reward satisfies ri = r(si , ai ) and the next state is sampled according
to the dynamics, s0i ∼ P (.|si , ai ). There are multiple ways this dataset can be obtained.
For example, it can be given beforehand (batch learning, the main case we consider in this
section). It can also be gathered in an online fashion by the agent that tries to learn the
optimal control (a case we address in Sec. 30.5). It can also be obtained in a somehow
controlled manner if one has access to a simulator (which does not mean that the model
is known). Here is an example of how this data can be used: first, notice that almost
everything turns around Bellman operators, and that without the model, they cannot be
computed. However, they can still be approximated from data. Assume that ai = π(si ) for
a policy of interest π. One can consider the sampled operator
[T̂π v](si ) = ri + γv(s0i ).
This is an unbiased estimate of the evaluation operator, as E[[T̂π v](si )|si ] = ES 0 ∼P (.|si ,ai ) [ri +
γv(S 0 )] = [Tπ v](si ).
In this section, we will study approximate value and policy iteration algorithms that handle
these problems. There exist also approximate linear programming approaches, extending the one
presented in Sec. 30.3.1, but we will not discuss them here (as they are restrictive, in some sense).
The interested reader can nevertheless refer (notably) to de Farias and Van Roy (2003, 2004) for
more about this.

30.4.1 State-action value function


Before introducing approximate dynamic programming (ADP), we need to extend the notion of
value function. Indeed, we have seen that the notion of greedy policy is central in dynamic
348 CHAPTER 30. REINFORCEMENT LEARNING

programming (to compute the optimal policy from the optimal value function or to improve a
policy in the policy iteration scheme). However, computing a greedy policy requires knowing the
model, as
!
X
0 0
π ∈ G(v) ⇔ ∀s ∈ S, π(s) ∈ argmax r(s, a) + γ P (s |s, a)v(s ) .
a∈A
s0 ∈S

Assume that we can estimate the optimal value function from the data (thus, without knowing
the model). This is a first step, but what we are interested in is a good control policy. Deducing a
greedy policy from this estimated policy would not (or at least hardly) be possible from data only.
Another problem with value functions is that the Bellman optimality operator cannot be sam-
pled as easily as the Bellman evaluation operator. We have seen just before that [T̂π v](si ) =
ri + γv(s0i ) is an unbiased estimate of the evaluation operator. Recall the definition of the opti-
mality operator (see Eq. (30.6)):

[T∗ v](s) = max ES 0 ∼P (.|s,a) [r(s, a) + γv(S 0 )].


a∈A

To define a sampled operator T̂∗ , one would need to sample all actions (and related next states)
for all states si in the dataset15 (in order to compute the max). Write s0i,a a next state sampled
according to P (.|si , a). One could consider the following sampled operator

[T̂∗ v](si ) = max r(si , a) + γv(s0i,a ) .
a∈A

Anyway, this estimator would be biased (as the expectation of a max is not the max of an expec-
tation, E[[T̂∗ v](si )|si ] 6= T∗ (si )).
There is a simple solution to alleviate these problems, namely the state-action value function,
also called Q-function or quality function. For a given policy π, the state-action value function
Qπ (s, a) ∈ RS×A associate to each state-action pair the expected discounted cumulative reward
for starting in this state, taking this action (that might be different from the action advised by the
policy) and following the policy π afterward:

X
∀(s, a) ∈ S ×A, Qπ (s, a) = E[ γ t r(St , At )|S0 = s, A0 = a, St+1 ∼ P (.|St , At ), At+1 = π(St+1 )].
t=0

Roughly speaking, this adds a degree of freedom to the definition of the value function by letting
free the choice of the first action. Notably, it is clear from this definition that value and Q-functions
are related as follows:
vπ (s) = Qπ (s, π(s)).
A Bellman evaluation operator can easily be defined (we used the same notation, which is
slightly abusive as it operates on a different object) as Tπ : RS×A → RS×A such that for Q ∈ RS×A
we have componentwise:
X
∀(s, a) ∈ S × A, [Tπ Q](s, a) = r(s, a) + γ P (s0 |s, a)Q(s0 , π(s0 )).
s0 ∈S

The state-action value function Qπ is the unique fixed point of the operator Tπ (this operator being
a γ-contraction):
Qπ = Tπ Qπ .
Therefore, computing the Q-function of a policy π also amounts to solving a linear system. The
optimal value function Q∗ = Qπ∗ also satisfies a fixed-point equation. Let define the Bellman
optimality operator T∗ : RS×A → RS×A such that for Q ∈ RS×A we have componentwise:
X
∀(s, a) ∈ S × A, [T∗ Q](s, a) = r(s, a) + γ P (s0 |s, a) max
0
Q(s0 , a0 ).
a ∈A
s0 ∈S

15 To do so, we should have access to a simulator.


30.4. APPROXIMATE DYNAMIC PROGRAMMING 349

The optimal state-action value function Q∗ is the unique fixed point of the operator T∗ (this
operator being a γ-contraction):
Q∗ = T∗ Q∗ .
Notice that the optimal value and quality functions are related as follows:

∀s ∈ S, v∗ (s) = max Q∗ (s, a).


a∈A

A first advantage of the quality function is that it allows computing a greedy policy without
knowing the model. Indeed, for a policy improvement step (to compute a greedy policy respectively
to vπ , recalling that it satisfies vπ (s) = Qπ (s, π(s))), we have
!
X
0 0 0
π ∈ G(vπ ) ⇔ ∀s ∈ S, π(s) ∈ argmax r(s, a) + γ P (s |s, a)vπ (s )
a∈A
s0 ∈S
0
⇔ ∀s ∈ S, π (s) ∈ argmax Qπ (s, a).
a∈A

If one is able to compute the optimal quality function , it is possible to compute the optimal policy
as being greedy respectively to it:

π∗ (s) ∈ argmax Q∗ (s, a).


a∈A

In all cases, we define a policy π as being greedy respectively to a function Q ∈ RS×A (which is
not necessarily the state-action value function of a policy) as

∀Q ∈ RS×A , π ∈ G(Q) ⇔ ∀s ∈ S, π(s) ∈ argmax Q(s, a).


a∈A

When working with data (with the dataset given in Eq. (30.11), both operators can be sampled.
The Bellman evaluation operator can be sampled as

[T̂π Q](si , ai ) = ri + γQ(s0i , π(s0i )).

The Bellman optimality operator can also be sampled:

[T̂∗ Q](si , ai ) = ri + γ max


0
Q(s0i , a0 ).
a ∈A

Now, both sampled operators are unbiased (contrary to sampled optimality operator applying on
value functions).
The policy and value iteration algorithms can be easily rewritten with state-action value func-
tions replacing value functions. For all the reasons given so far, it is customary to work with
quality functions when the model is unknown. We have seen that when the state space is too
large, value functions should be searched for in some hypothesis space. For example, with a linear
parameterization, the quality functions would be of the form Qθ (s, a) = θ> φ(s, a). Yet, the states
are usually continuous while the actions are discrete (a less frequent case in supervised learning).
A standard approach consists in defining a feature vector φ(s) for the state space, and to extend
it to the state-action space as follows (δ being the Dirac function):
 >
φ(s, a) = δa=a1 φ(s)> ... δa=a|A| φ(s)> .

Notice that this is reminiscent of the concept of score function for cost-sensitive multiclass clasifi-
cation seen in Sec. 5.2.

30.4.2 Approximate value iteration


In this section, we work with a dataset as in Eq. (30.11), that is

D = {(si , ai , ri , s0i )1≤i≤n },


350 CHAPTER 30. REINFORCEMENT LEARNING

and the aim is to estimate from this set of transitions the optimal Q-function Q∗ (from which we
can estimate an optimal policy by being greedy). The Bellman optimality operator applying on
RS×A is a γ-contraction (the proof is very similar to the case of value functions). Therefore, thanks
to the Banach theorem, the iteration
Qk+1 = T∗ Qk
will converge to Q∗ . Yet, there are two problems:
• the operator T∗ cannot be applied to Qk , the model being unknown;
• as we are working with a too large state space, the Q-functions should belong to some
hypothesis space H, and there is not reason that T∗ Qk ∈ H holds.
The first problem can be avoided by using the sampled operator presented before instead, the
second one indeed corresponds to a regression problem, as shown below.
Now, we construct an approximate variation of value iteration applied to state-action value
functions. Assume that we adopt a linear parametrization for the Q-function, that is we consider
the following hypothesis space

H = {Qθ (s, a) = θ> φ(s, a), θ ∈ Rd },

with φ(s, a) a predefined feature vector. Let θ0 be some initial parameter vector and let Q0 = Qθ0
be the associated quality function. At iteration k we have Qk = Qθk . We can sample the optimality
operator for state-action couples available in the dataset D:

∀1 ≤ i ≤ n, [T̂∗ Qk ](si , ai ) = ri + γ max


0
Qk (s0i , a0 ).
a ∈A

So, we have n target values for the next function Qk+1 (corresponding to inputs (si , ai )), and
this function must belong to H. Finding the function Qk+1 is therefore a regression problem that
can be solved by minimizing the risk based on the `2 -loss, for example. Therefore, Qk+1 can be
computed as follows:

1 X 2
n
Qk+1 ∈ argmin Qθ (si , ai ) − [T̂∗ Qk ](si , ai ) .
Qθ ∈H n i=1

Given the chosen hypothesis space, this is simply a linear least-squares problem with inputs (si , ai )
and outputs ri + γ maxa0 ∈A Qk (s0i , a0 ), and simple calculus16 gives the solution:
n
!−1 n  !
X X
> 0 0
Qk+1 = Qθk+1 with θk+1 = φ(si , ai )φ(si , ai ) φ(si , ai ) ri + γ max
0
Qθk (si , a ) .
a ∈A
i=1 i=1

Alternatively, we can see this as projecting the sampled operator applied to the previous function
onto the hypothesis space, which can written more compactly as Qk+1 = ΠT̂∗ Qk , writing Π the
projection.
For the regression step, we have considered a quadratic risk with a linear parametrization (that
is, a linear least-squares), but other regression schemes can be envisioned. Write abstractly A the
operator that gives a function from observations (such as the result of minimizing a risk, or a
random forest, or something else), approximate value iteration can be written generically as (for
some initialization Q0 )
Qk+1 = AT̂∗ Qk .
Yet, if the Bellman operator T∗ is a contraction, there is not reason for the composed operator AT̂∗
to be a contraction. Indeed, in the example developed before (with the linear least-squares), the
operator ΠT̂∗ is not a contraction, and the iterations may diverges. A sufficient condition for the
operator AT̂∗ to be a contraction is to use averagers as function approximators in the regression
step. We do not explain here what an averager is, but random forests and ensemble of extremely
randomized trees (see Ch. 21) belong to this category. Using this kind of function approximator
in the regression step therefore ensure convergence.
16 Generically, 1 Pn > 2
the problem is to solve minθ n i=1 (yi − θ xi ) . Computing the gradient (resp. to θ) and setting
>
P P
it to zero provides the solution xi xi θ = xi yi .
30.4. APPROXIMATE DYNAMIC PROGRAMMING 351

Algorithm 41 Approximate value iteration


Require: A dataset D = {(si , ai , ri , s0i )1≤i≤n }, the number K of iterations, a function approxi-
mator, an initial state-action value function Q0 .
1: for k = 0 to K do
2: Apply the sampled Bellman operator to function Qk :

[T̂∗ Qk ](si , ai ) = ri + γ max


0
Qk (s0i , a0 ).
a ∈A

3: Solve the regression problem with inputs (si , ai ) and outputs [T̂∗ Qk ](si , ai ) to get the Q-
function Qk+1
4: end for
5: return The greedy policy πK+1 ∈ G(QK+1 ):

∀s ∈ S, πK+1 ∈ argmax QK+1 (s, a).


a∈A

The generic approximate value iteration is provided in Alg. 41. An important thing to notice is
that this algorithm reduces the learning of an optimal control to a sequence of supervised learning
problems. For the definition of an averager and a discussion on necessary conditions for AT̂∗ to be
a contraction, see Gordon (1995). When the function approximator is an ensemble of extremely
randomized trees, the algorithm is known as fitted-Q (Ernst et al., 2005) and is quite efficient em-
pirically. Approximate value iteration has been experimented with other function approximators,
such as neural networks (Riedmiller, 2005) or Nadaraya-Watson estimators (Ormoneit and Sen,
2002).
So far, we have assumed that the dataset is given beforehand. The quality of this dataset is
very important for good empirical results. For example, if states si in D are sampled in a too small
part of the state space, no algorithm will be able to recover a good controller. If one has access to
a simulator (or to the real system), it is possible to choose how data are sampled (how states si are
sampled, according to what policy actions ai are sampled, the next states s0i being imposed by the
dynamics). In this case, one can wonder what is the best way to sample states. A sensible approach
would consist in following the current policy πk+1 ∈ G(Qk+1 ) slightly randomized (this is linked
to what is known as the exploration-exploitation dilemma, to be discussed in Sec. 30.5). Choosing
the right distribution is a difficult problem, and we will not discuss it much further here. However,
there is an important remark: in supervised learning, the distribution is fixed beforehand (given
by the problem at hand), while in approximate dynamic programming (that is, in reinforcement
learning) only the dynamics is fixed, which can involve many different distributions on transitions.
Therefore, things are much more difficult to analyse in this setting.
We can also have a word about model evaluation. An important question is: how good is the
policy πK+1 returned by approximate value iteration? When estimating a function in supervised
learning, its quality can be assessed by using cross-validation, for example. In reinforcement
learning, this is much more difficult, cross-validation cannot be applied. The best way to assess
the quality of the policy πK+1 is to apply it to the control problem (and if it is a real system,
and not a simulated one, it can be dangerous). There are few works on model evaluation for
reinforcement learning (Farahmand and Szepesvári, 2011; Thomas et al., 2015), but notice that
there is no answer as easy as in supervised learning.

30.4.3 Approximate policy iteration


The policy iteration algorithm can also be approximated. Recall the steps involved in an iteration
of policy iteration:

1. policy evaluation: solve the fixed-point equation Qπk = Tπk Qπk ;

2. policy improvement: compute the greedy policy πk+1 = G(Qπk ).


352 CHAPTER 30. REINFORCEMENT LEARNING

Given any function Q ∈ RS×A , computing an associated greedy policy is easy (that is partly why
the state-action value function has been introduced). Therefore, the step to be approximated is
the policy evaluation step. In other words, the problem consists in estimating the quality function
of a given policy, from data. An iteration of approximate policy iteration can be (informally)
summarized as follows:
1. approximate policy evaluation: find a function Qk ∈ H such that Qk ≈ Tπk Qk ;
2. policy improvement: compute the greedy policy πk+1 = G(Qπk ).
The whole procedure is presented in Alg. 42.

Algorithm 42 Approximate policy iteration


Require: An initial π0 ∈ AS (possibly an initial Q0 and π0 ∈ G(Q0 )), number of iterations K
1: for k = 0 to K do
2: approximate policy evaluation:

find Qk ∈ H such that Qk ≈ Tπk Qk .

3: policy improvement:
πk+1 ∈ G(Qk ).
4: end for
5: return the policy πK+1

So, the core question is: how to estimate the quality function of a given policy from a given
dataset? As before, the model is unknown and the Bellman operator can only be sampled. More-
over, the state space being too large, we’re looking for a Q-function belonging to some prede-
fined hypothesis space. Here, we will assume a linear parametrization, that is H = {Qθ (s, a) =
θ> φ(s, a), θ ∈ Rd }. Let π be any policy, we discuss now how to find an approximate fixed point of
Tπ , that is a function Qθ ∈ H such that Qθ ≈ Tπ Qθ .

Monte Carlo Rollouts


We know that Qπ is the fixed point of Tπ . Therefore, solving the approximate fixed point equation
amounts to approximating Qπ . Assume that this function is known for the state-action couples
(si , ai ) in the dataset D. Then, this is simply a regression problem. For example, with an `2 -loss,
the problem to be solved is the following linear least-squares:
n
1X 2
min (Qπ (si , ai ) − Qθ (si , ai )) .
θ∈Rd n i=1

Unfortunately, the Q-function Qπ is obviously unknown (it is what we would like to estimate).
However, if a simulator is available, it can be estimated for any given state-action couple (si , ai ).
To do so, the idea is to sample a full trajectory starting in si where action ai is chosen first, all
subsequent states being sampled according to the system dynamics and all subsequent actions
being chosen according to the policy π. Write qi the associated discounted cumulative reward (the
sum of discounted rewards gathered along the sampled trajectory), it is an unbiased estimate of
the state-action value function: E[qi |si , ai ] = Qπ (si , ai ). This is called a Monte Carlo rollout. This
fits the regression setting and one can solve
n
1X 2
min (qi − Qθ (si , ai )) .
θ∈Rd n i=1

to generalize the simulated state-action values. The solution is here the classical linear-least squares
estimator, the vector parameter minimizing this empirical risk being
Xn n
X
θn = ( φ(si , ai )φ(si , ai )> )−1 φ(si , ai )qi .
i=1 i=1
30.4. APPROXIMATE DYNAMIC PROGRAMMING 353

Notice that other losses and other function approximators could be used (as it is a standard
regression problem). The disadvantage of this approach is that it requires a simulator (which is
not always available, and which can be costly to use) and that the rollouts can be quite noisy (due
to the variance induced by the stochasticity of the system). Also, if it formally requires to sample
infinite trajectories, practically finite trajectories are sampled17 .

Residual approach
As we are looking for an approximate fixe point of Tπ , a natural idea consists in minimizing
kQθ − Tπ Qθ k for some norm. If we can set this quantity to zero, then we have found the fixed
point (and the the state-action value function Qπ ). This is called a residual approach as we try
to minimize the residual between Qθ and its image Tπ Qθ under the Bellman evaluation operator.
We work with data and the operator can only be sampled, so a natural optimization problem to
consider is

1 X 2
n n
1X 2
min [T̂π Qθ ](si , ai ) − Qθ (si , ai ) = min (ri + γQθ (s0i , π(s0i )) − Qθ (si , ai )) .
θ∈Rd n i=1 θ∈Rd n
i=1

With a linear parametrization Qθ (s, a) = θ> φ(s, a), this can be solved analytically by zeroing the
gradient, the minimizer is then given by

n
!−1 n
!
1X 1X
θn = ∆φi ∆φ>
i ∆φi ri with ∆φi = φ(si , ai ) − γφ(s0i , π(s0i )).
n i=1 n i=1

The problem with this approach is that it leads to minimizing a biased surrogates to the residual.
Indeed, if T̂π is an unbiased estimator of the Bellman operator, it is no longer true with its square
(the square of the mean is not the mean of the square):

E[([T̂π Qθ ](si , ai ) − Qθ (si , ai ))2 |si , ai ] = ([Tπ Qθ ](si , ai ) − Qθ (si , ai ))2 + var([T̂π Qθ ](si , ai )|si , ai )
6= ([Tπ Qθ ](si , ai ) − Qθ (si , ai ))2 .

If the dynamics is deterministic, this approach is fine (the variance term is null). However, with
stochastic transitions the estimate will be biased. The variance term acts as a regularizing factor,
which is good in general, but not here, as it cannot be controlled (there is no factor for trading off
the risk and the regularization). For more about this, see Antos et al. (2008).

Least-Squares Temporal Differences


Asymptotically, a linear least-squares projects the function of interest onto the hypothesis space.
Writing Π this projection, the idea here is to find a fixed point of the composed operator ΠTπ , that
is to solve Qθ = ΠTπ Qθ . This approach is known as LSTD (Least-Squares Temporal Differences)
and has been originally proposed from a different perspective (more precisely, an error-in-variable
model) by Bradtke and Barto (1996).
Consider Fig. 30.2. On the left side, we show what the Monte Carlo does asymptotically: it
projects the Q-function of interest onto the hypothesis space H. We have seen that this requires
a simulator. Consider now the figure on the right. Let Qθ be a function of H. One can apply
the Bellman evaluation operator to this function, to get Tπ Qθ . Notice that there is not reason for
Tπ Qθ to belong to the hypothesis space H. The residual approach tries to directly minimize the
distance between these functions (but we have seen that it is biased). Now, take Tπ Qθ and project
it back onto the hypothesis space, which gives ΠTπ Qθ . LSTD search for the parameter vector that
minimizes the distance between Qθ and ΠTπ Qθ , the dashed line in the figure (and one can show
that this distance is zero).
17 A γ H krk∞
first solution is to truncate the trajectories to an horizon H, the error will be bounded by 1−γ
. Another
1
solution is to sample the horizon according to a geometric law of parameter and to take qi as the undiscounted
1−γ
cumulative reward. One can check that it is an unbiased estimate, but the random horizon can be arbitrary long (a
geometric law is unbounded).
354 CHAPTER 30. REINFORCEMENT LEARNING

Figure 30.2: Illustration for the Monte Carlo Rollouts (left) and LSTD (right). See the text for
details.

As usual, we work with data. LSTD can be expressed as solving the following nested optimiza-
tion problem:
( Pn
wθ = argminw∈Rd n1 i=1 (ri + γQθ (s0i , π(s0i )) − Qw (si , ai ))2
Pn .
θn = argminθ∈Rd n1 i=1 (Qθ (si , ai ) − Qwθ (si , ai ))2

The first equation correspond to the projection of T̂π Qθ onto H and the second equation to the
minimization of the distance between Qθ and the projection of T̂π Qθ . This is a nested optimization
problem as both solutions are interleaved (wθ depends en θ and vice-versa). Given the chosen linear
parametrization, this can be solved analytically. The first equation is a simple linear least-squares
problem in w, which solution is given by
n
!−1 n
X X
>
wθ = φ(si , ai )φ(si , ai ) φ(si , ai )(ri + γθ> φ(s0i , π(s0i ))).
i=1 i=1

The second equation is minimized with θ = wθ (and the distance is zero). Therefore, the solution
is
n
!−1 n
X X
>
θn = wθn ⇔ θn = φ(si , ai )φ(si , ai ) φ(si , ai )(ri + γθn> φ(s0i , π(s0i )))
i=1 i=1
n
!−1 n
X >
X
⇔ θn = φ(si , ai ) (φ(si , ai ) − γφ(s0i , π(s0i ))) φ(si , ai )ri .
i=1 i=1

When LSTD is the policy evaluation step of approximate policy iteration, the resulting algorithm
is called LSPI (Lagoudakis and Parr, 2003) for least-squares policy iteration, and is summarized
in Alg. 43.
We have see that a central question here is how to approximate the quality function from data.
We have shown the main methods, but many other exist. For an overview of the subject, the
interested reader can refer to Geist and Pietquin (2013); Geist and Scherrer (2014) (some other
will be briefly discussed in Sec. 30.5). For a discussion about the link between the residual approach
and LSTD, see Scherrer (2010).

Approximating the policy


So far, we have tried to estimate a state-action value function, the policy being deduced from it.
However, in some cases, it might easer to approximate directly the policy (for example, because
it has a simpler form). In an approximate policy iteration scheme, this would mean that we
approximate the policy improvement step, instead of the policy evaluation step.
At iteration k, assume that Qπk (si , a) is known for all 1 ≤ i ≤ n and a ∈ A. Here, let F ⊂ AS
an hypothesis space of policies. Consider the following optimization problem:
n  
1X
πk+1 ∈ argmin max Qπk (si , a) − Qπk (si , π(si )) . (30.12)
π∈F n i=1 a∈A
30.5. ONLINE LEARNING 355

Algorithm 43 Least-squares policy iteration


Require: An initial π0 ∈ AS (possibly an initial Q0 and π0 ∈ G(Q0 )), number of iterations K
1: for k = 0 to K do
2: approximate policy evaluation:
n
!−1 n
X >
X
θk = φ(si , ai ) (φ(si , ai ) − γφ(s0i , πk (s0i ))) φ(si , ai )ri .
i=1 i=1

3: policy improvement:
πk+1 ∈ G(Qθk ).
4: end for
5: return the policy πK+1

This is a cost-sensitive multiclass classification problem (see also Sec. 5.2). Notice that a policy
can be seen as a decision rule (as it associates a label—an action—to an input—a state). Take a
policy π, it will suffer a cost of maxa∈A Qπk (si , a) − Qπk (si , π(si )) for choosing the action π(si )
in state si instead of the greedy action argmaxa∈A Qπk (si , a). So, solving the above optimization
problem, with an infinite amount of data and a rich enough hypothesis space, gives the greedy
policy G(πk+1 ). As we work with a finite amount of data and as the hypothesis space may not
contain the greedy policy, in practice we obtain an approximate greedy policy.
Obviously, the state-action value function is unknown, while it is required to express prob-
lem (30.12). Yet, we only need to know (possibly approximately) the state action value function
for the state-action couples {(si , a)1≤i≤n,a∈A }. We can estimate these values by performing Monte
Carlo rollouts. Notice that we only need to estimate pointwise the function, we do not need to
generalize it to the whole state-action space (generalization is done by the policy). The interesting
aspect here is that we have reduced the optimal control problem to a sequence of classification
problems.
This approach is called DPI (Lazaric et al., 2010) for direct policy iteration. Notice that all
the approximate dynamic algorithms we have presented so far are special cases of the generic
approximate modified policy iteration approach. The interested reader can refer to Scherrer et al.
(2015) for more about this.

30.5 Online learning


In Sec. 30.4, we have considered that a set of data was provided, and we tried to learn a good
controller from it (with possible call to a simulator from time to time). Another setting of interest
is online learning. Here, the agent is interacting with the system, and tries to learn the optimal
control while interacting with the system. This call for learning in an online fashion (while methods
considered so far are batch algorithms). Here, we will focus on learning state-action value functions
(the policy being derived from it). As the agent can choose the action that will be applied to
the system, he can somehow control the kind of data he will observe. At each time step, he
will have to choose between two alternatives: he can either choose an action that he think to
be optimal, according to the currently estimated quality function, or he can choose a supposedly
suboptimal action, but that can improve its knowledge of the world (its estimate of the Q-function)
and therefore lead to highest cumulative rewards in the future. This is called the exploration-
exploitation dilemma.

30.5.1 SARSA and Q-learning


We start by presenting methods for estimating quality functions in an online manner.
356 CHAPTER 30. REINFORCEMENT LEARNING

SARSA
Let π any policy, we present now an algorithm for estimating the Q-function of this policy in
an online fashion. Let H = {Qθ (s, a) = θ> φ(s, a), θ ∈ Rd } be an hypothesis space of linearly
parameterized functions. Assume that Qπ is known, we would like to solve
n
1X 2
min (Qπ (si , ai ) − Qθ (si , ai )) .
θ∈Rd n i=1

If we minimize this risk using a stochastic gradient descent, we get classically the following update
rule,
αi 2
θi+1 = θi − ∇ (Qπ (si , ai ) − Qθ (si , ai ))
2 
= θi + αi φ(si , ai ) Qπ (si , ai ) − θi> φ(si , ai ) ,

where αi is a learning rate. This is a standard Widrow-Hoff update. At time-step i, the state-action
couple (si , ai ) is observed, and the parameter vector is updated according to the gain αi φ(si , ai )
and to the prediction error Qπ (si , ai ) − θi> φ(si , ai ).
Unfortunately, and as usual, the state-action value function is unknown (as it is what we would
like to estimate). The idea is to apply a bootstrapping principle: the unobserved value Qπ (si , ai ) is
replaced by an estimate computed by applying the sampled Bellman evaluation operator to the cur-
rent estimate Qθi (si , ai ), that is [T̂π Qθi ](si , ai ) = ri + γQθi (si+1 , π(si+1 )), where si+1 ∼ P (.|si , ai )
(si+1 is obtained by applying action ai to the system). To sum up, let (si , ai , ri , si+1 , ai+1 ) be the
current transition, with ri = r(si , ai ), si+1 ∼ P (.|si , ai ) and ai+1 = π(si+1 ), the update rule is

θi+1 = θi + αi φ(si , ai ) (ri + γQθi (si+1 , π(si+1 )) − Qθi (si , ai ))



= θi + αi φ(si , ai ) ri + γθi> φ(si+1 , ai+1 ) − θi> φ(si , ai ) .

This is called a temporal difference algorithm, due to the prediction error ri +γQθi (si+1 , π(si+1 ))−
Qθi (si , ai ) being a temporal difference.

Algorithm 44 SARSA
Require: An initial parameter vector θ0 , the initial state s0 , an initial action a0 , the learning
rates (αi )i≥0
1: i = 0
2: while true do
3: Apply action ai in state si
4: Get the reward ri and observe the new state si+1
5: Choose the action ai+1 to be applied in state si+1
6: Update the parameter vector of the Q-function according to the transition
(si , ai , ri , si+1 , ai+1 );

θi+1 = θi + αi φ(si , ai ) ri + γθi> φ(si+1 , ai+1 ) − θi> φ(si , ai )

7: i←i+1
8: end while

The resulting algorithm is called SARSA (due to the transition (si , ai , ri , si+1 , ai+1 )) and is
summarized in Alg. 44. Notice that we remain vague about how action ai+1 is chosen (see line 5
in Alg. 44). We have said just before that action ai+1 is chosen according to π, the policy to
be evaluated. If we consider a policy evaluation problem, that’s fine. However, we would like to
learn the optimal control. A possibility would be to take ai+1 as the greedy action respectively to
the current estimate Qθi . This would correspond to an optimistic18 approximate policy iteration
18 The optimism lies in the fact that the evaluation step occurs for only one transition before taking the greedy

step.
30.5. ONLINE LEARNING 357

scheme. Yet, this would be too conservative (if the Q-function is badly estimated, the agent can
get stuck) and it should be balanced with some exploration. Before expending on this, we present
an alternative algorithm.

Q-learning
We can do the same job as for SARSA with the optimal Q-function Q∗ directly, instead of Qπ .
The update rule (assuming Q∗ known) is:

θi+1 = θi + αi φ(si , ai ) Q∗ (si , ai ) − θi> φ(si , ai ) .

The function Q∗ is unknown, but it can be bootstrapped by replacing it by the estimate obtained
by applying the sampled Bellman optimality operator to the current estimate, [T̂∗ Qθi ](si , ai ) =
ri + γ maxa∈A Q̂θi (s0i , a), giving the following update rule:
 
θi+1 = θi + αi φ(si , ai ) ri + γ max Qθi (si+1 , a) − Qθi (si , ai )
a∈A
 
= θi + αi φ(si , ai ) ri + γ max(θi> φ(si+1 , a)) − θi> φ(si , ai ) .
a∈A

Notice that whatever the behavior policy (the way actions ai are chosen), we will estimate directly
the optimal Q-function (given some assumptions, notably the behavior policy should visit often
enough all state-action pairs). This is called an off-policy algorithm (the estimated policy—here
the optimal one—is different from the behavior policy). This is different from SARSA (used for
policy evaluation), where the followed policy and the evaluated policy are the same. This is called
on-policy learning.

Algorithm 45 Q-learning
Require: An initial parameter vector θ0 , the initial state s0 , the learning rates (αi )i≥0
1: i = 0
2: while true do
3: Choose the action ai to be applied in state si
4: Apply action ai in state si
5: Get the reward ri and observe the new state si+1
6: Update the parameter vector of the Q-function according to the transition (si , ai , ri , si+1 );
 
θi+1 = θi + αi φ(si , ai ) ri + γ max(θi> φ(si+1 , a)) − θi> φ(si , ai )
a∈A

7: i←i+1
8: end while

The resulting algorithm is called Q-learning and is summarized in Alg. 45. Again, we remain
vague about how action ai is chosen (see line 3 in Alg. 45). If the Q-function is perfectly estimated,
the wisest choice would consist in taking the greedy action. However, if some action has never
been tried (or not enough, that is the Q-function estimation has not converged), one cannot know
if it is not indeed a better action than the greedy one. Therefore, exploitation (taking the greedy
action) should be combined with exploration (taking another action).
Notice that there exist many other online learning algorithms. For example, LSTD can be
made online by using Sherman-Morrison (much like how recursive linear least-squares are derived
from linear least-squares). For more about this, see again Geist and Pietquin (2013); Geist and
Scherrer (2014).

30.5.2 The exploration-exploitation dilemma


Therefore, for both SARSA and Q-learning, there is a dilemma between exploitation (taking the
greedy action) and exploration (taking a non-greedy action, leading possibly to higher discounted
358 CHAPTER 30. REINFORCEMENT LEARNING

cumulative rewards). We present two simple solutions, that is policies balancing exploration and
exploitation, that can be used in line 5 of Alg. 44 and line 3 of Alg. 45.
Let  be in (0, 1). An -greedy policy chooses the greedy action with probability 1 − , and a
random action with probability :
(
argmaxa∈A Qθ (s, a) with probability 1 − 
π (s) = .
a random action with probability 

Assume that  is small. Most of time, the agent will act greedily respectively to the currently
estimated quality function. However, from time to time, it will try a different action, to see if it
cannot gather higher values. Practically, it is customary to set a high value of  at the beginning of
learning, such as favoring exploration, and then to decrease  as the estimation of the Q-function
improves, so as to act more and more greedily.
A different kind policy is the softmax policy. Let Qθ be the currently estimated quality function
and let τ > 0 be a temperature parameter, the softmax policy is a stochastic policy defined as
1
e τ Qθ (s,a)
πτ (a|s) = P 1 0
.
a0 ∈A e τ Qθ (s,a )

It is easy to check that this indeed defines a conditional probability. It is clear from this definition
that action with high Q-values will be sampled more often. The parameter τ allows going from a
purely uniform random policy (τ → ∞) to the greedy policy (τ → 0).
These policies make sense for balancing exploration and exploitation. However, one can wonder
how effective they are (empirically, which can be tried, but also theoretically). Moreover, one can
wonder what is the best exploration strategy. This is a very difficult question. Indeed, consider
the bandit problem studied in Ch. 29. A bandit is indeed an MDP with a single state (there is
no transition kernel and γ = 0). We have studied the UCB strategy, that adresse the exploration-
exploitation dilemma. Things are much more complicated in the general MDP setting. For a few
possible strategies, the interested read can refer to Bayesian reinforcement learning (Poupart et al.,
2006; Vlassis et al., 2012) or R-max (Brafman and Tennenholtz, 2003), among many others.

30.6 Policy search and actor-critic methods


So far, we have only considered discrete actions. Indeed, all approaches studied before involve
computing some max or argmax over actions, which becomes a possibly difficult optimization
problem when actions are continuous. In this section, we discuss an alternative way to estimate
an optimal policy, by searching directly in the policy space. The idea is to parameterize the policy
and to look for the set of parameters optimizing some objective function.
To do so, we need to work with stochastic policies, such that no action has a null probability
(for a reason that will be clear later). A stochastic policy π ∈ ∆SA associates to each state s a
conditional probability over actions π(.|s). All the things we have defined for deterministic policies
extend naturally to stochastic policies. For example, the Bellman evaluation equation is
!
X X
0
vπ (s) = π(a|s) r(s, a) + γ P (s |s, a)vπ (s) .
a∈A s0 ∈S

The value and quality function are linked as follows:


X
vπ (s) = π(a|s)Qπ (s, a).
a∈A

Therefore, we can notably write


X
Qπ (s, a) = r(s, a) + γ P (s0 |s, a)vπ (s),
s0 ∈S

as in the deterministic case.


30.6. POLICY SEARCH AND ACTOR-CRITIC METHODS 359

Here, we look for parameterized policies belonging to some hypothesis space F = {πθ , θ ∈ Rd }.
For example, for discrete actions, a common choice is to parameterize the policy with a softmax
distribution:
>
eθ φ(s,a)
πθ (a|s) = P θ > φ(s,a0 )
, (30.13)
a0 ∈A e

with φ(s, a) being a predefined feature vector. For continuous actions, a common approach consists
in embedding a parameterized deterministic policy into a Gaussian distribution. For example, if
A = R, we can consider
 2
a−θ > φ(s)
− 21 σ
πθ (a|s) ∝ e ,
with φ(s) being a predefined feature vector.
Let ν ∈ ∆S be a distribution over states, defined by the user (it weights states where we would
like to have good estimates). The policy search methods aim at solving the following optimization
problem: X
max J(θ) with J(θ) = ν(s)vπθ (s) = ES∼ν [vπθ (S)].
θ∈Rd
s∈S

In dynamic programming, the aim is to find the policy that maximizes the value for every state. In
the current setting, there are too many states, so we instead try to find the policy that maximizes
the associated value function averaged over a predefined distribution over states.

30.6.1 The policy gradient theorem


There remains to know how this optimization problem can be solved. A natural idea is to perform
a gradient ascent. To do so, we need to compute the gradient of Jθ :
X
∇θ J(θ) = ν(s)∇θ vπθ (s).
s∈S

If we look at the gradient of the value function:


X
∇θ vπθ (s) = ∇θ πθ (a|s)Qπθ (s, a)
a∈A
X
= (∇θ (πθ (a|s))Qπθ (s, a) + πθ (a|s)∇θ (Qπθ (s, a)))
a∈A
!
X X
0 0
= ∇θ (πθ (a|s))Qπθ (s, a) + πθ (a|s)∇θ (r(s, a) + γ P (s |s, a)vπθ (s ))
a∈A s0 ∈S
!
X X
0 0
= πθ (a|s) Qπθ (s, a)∇θ ln πθ (a|s) + γ P (s |s, a)∇θ vπθ (s ) ,
a∈A s0 ∈S

where we used a classic log-trick for the last line19 We can see that componentwise, this is a Bellman
evaluation equation for the policy πθ and the reward Qπθ (s, Pa)∇θj ln πθ (a|s). Let 1 ≤ j ≤ d and θj
be the j th component of the vector θ, write also R(s) = a∈A πθ (a|s)Qπθ (s, a)∇θj ln πθ (a|s), we
have equivalently that
∀1 ≤ j ≤ d, ∇θj vπθ = (I − γPπ )−1 R.
Notice that the componentwise gradient of the objective can be written as ∇θj J(θ) = ν > ∇θj vπθ ,
we therefore have
∇θj J(θ) = ν > (I − γPπ )−1 R.
The quantity defined as
dν,π = (1 − γ)ν > (I − γPπ )−1
19 This log-trick is the fact that ∇π(a|s) = π(a|s)∇ ln π(a|s). The log is the reason why we consider stochastic

policies (no action can have probability zero, or the log is ill defined).
360 CHAPTER 30. REINFORCEMENT LEARNING

is a distribution. It correspond to the state occupancy obtained by sampling an initial state


according to ν and then following the optimal policy for a random time drawn from a geometric
1
distribution of parameter 1−γ . From the two previous equations, we can finally write

1 X X
∇θ J(θ) = dν,π(s) πθ (a|s)Qπθ (s, a)∇θ ln πθ (a|s)
1−γ
s∈S a∈A
1
= ES∼dν,π ,A∼πθ (.|S) [Qπθ (S, A)∇θ ln πθ (A|S)]. (30.14)
1−γ

This result is called the policy gradient theorem, see Sutton et al. (1999) for an alternative deriva-
tion (who first provided this result).
A local maximum can thus be searched for by doing a gradient ascent,

θ ← θ + α∇θ J(θ),

with α being a learning rate. The gradient can be estimated using Monte Carlo rollouts, see Baxter
and Bartlett (2001).

30.6.2 Actor-critic methods


Most of the methods we are considered so far are called critic methods (as quality functions are
estimated, not the policies, and the value is a critic of the policy). Policy search (and DPI, the
approximate policy iteration algorithm that reduces to a sequence of classification problems) is
called an actor method, as only the policy is learnt, and it is the object which acts with the
system.
Sometime, it is possible to learn the policy and the quality function. Such methods are called
actor-critic methods. Consider for example the gradient in Eq. (30.14). It involves a Q-function,
which can be estimated pointwise using rollouts. Now, we have studied methods for approximating
such a function. The question we study now is: can we replace Qπ in Eq. 30.14 by an approximation
Qw ∈ H, without changing the gradient?

Policy gradient with a critic


Let Qw ∈ H be a parameterized function of RS×A (the parametrization will be made explicit later).
We would like to replace Qπ by Qw in Eq. 30.14. To do so, we must have (to shorten the notations,
we write “S ∼ dν,π , A ∼ πθ (.|S)” as dν,π ):

Edν,π [Qπθ (S, A)∇θ ln πθ (A|S)] = Edν,π [Qw (S, A)∇θ ln πθ (A|S)]
⇔Edν,π [(Qπθ (S, A) − Qw (S, A))∇θ ln πθ (A|S)] = 0. (30.15)

Now, assume that we have

∇θ ln πθ (a|s) = ∇w Qw (s, a), ∀(s, a) ∈ S × A.

Injecting this into Eq. (30.15), one can recognize a gradient:

Edν,π [(Qπθ (S, A) − Qw (S, A))∇w Qw (S, A)] = 0 ⇔ ∇w Edν,π [(Qπθ (S, A) − Qw (S, A))2 ] = 0.

In other words, Qw must be a local optimum of the risk based on the `2 -loss, with the state-action
distribution given by dν,π , and with the target function being Qπ .
We have just shown that if the parametrization of the state-action value function is compatible,
in the sense that
∀(s, a) ∈ S × A, ∇θ ln πθ (a|s) = ∇w Qw (s, a),
and if Qw is a local optimum of the risk based on the `2 -loss, with state-action distribution given
by dν,π , and with the target function being Qπ , that is

∇w Edν,π [(Qπθ (S, A) − Qw (S, A))2 ] = 0,


30.6. POLICY SEARCH AND ACTOR-CRITIC METHODS 361

then the gradient satisfies

∇θ J(θ) = Edν,π [Qw (S, A)∇θ ln πθ (A|S)].

So, the state-action value function appearing in the gradient can be replaced by its projection onto
the hypothesis space of compatible functions. This result has first been given by Sutton et al.
(1999). Notice that if formally the problem should be resolved using Monte Carlo rollouts, in
practice temporal difference algorithms are often used (and they do not compute the projection,
in general).
Let see what this compatibility condition gives with the softmax parametrization of Eq. (30.13).
We have
>
eθ φ(s,a)
∇θ ln πθ (a|s) = ∇θ ln P
a0 ∈A eθ> φ(s,a0 )
P > 0
a0 ∈A φ(s, a0 )eθ φ(s,a )
= φ(s, a) − P θ > φ(s,a0 )
a0 ∈A e
X
= φ(s, a) − πθ (a0 |s)φ(s, a0 ).
a0 ∈A

So, a compatible function approximation would be


X
Qw (s, a) = w> (φ(s, a) − πθ (a0 |s)φ(s, a0 )).
a0 ∈A
P
Yet, notice that we would have a∈A πθ (a|s)Qw (s, a) = 0, for any s ∈ S. This has no reason to
be true for a quality function. Yet, this is true for the advantage function, defined as

Aπ (s, a) = Qπ (s, a) − vπ (s).

Indeed, this would be true for any compatible approximator, as


X X X
π(a|s)∇θ ln πθ (a|s) = ∇θ πθ (a|s) = ∇θ πθ (a|s) = ∇θ 1 = 0. (30.16)
a∈A a∈A a∈A

However, this is not really a problem. Indeed, let v ∈ RS be any function depending only on
states. One can easily show that Edν,π [v(s)∇θ ln πθ (a|s)] = 0, using the same trick as in Eq. (30.16).
Therefore, we have

∀v ∈ RS , J(θ) = Edν,π [(Qπ (S, A) + v(s))∇θ ln πθ (A|S)].

This means that we can take simply

Qw (s, a) = θ> φ(s, a)

as a compatible function representation for the estimated quality function.

Natural policy gradient


An alternative to gradient ascent is natural gradient ascent. The natural gradient is the gradient
premultiplied by the inverse of the Fisher information matrix (instead of following the steepest
direction in the parameter space, it follows the steepest direction with respect to Fisher metric,
which tends to be much more efficient empirically, see Amari (1998b) for more about natural
gradients).
˜ is defined as
Generally, the natural gradient ∇
˜ θ J(θ) = F (θ)−1 ∇θ J(θ),

with F (θ) the Fisher information matrix. In our case, the matrix is defined as (Peters and Schaal,
2008)
F (θ) = Edν,π [∇θ ln πθ (A|S)(∇θ ln πθ (A|S))> ].
362 CHAPTER 30. REINFORCEMENT LEARNING

Let Qw be a linearly parameterized function satisfying the required conditions, that is Qw (s, a) =
w> ∇θ ln πθ (a|s) and ∇w Edν,π [(Qπθ (S, A) − Qw (S, A))2 ] = 0, the we have

˜ θ J(θ) = F (θ)−1 ∇θ J(θ)



= (Edν,π [∇θ ln πθ (A|S)(∇θ ln πθ (A|S))> ])−1 Edν,π [∇θ ln πθ (A|S)> w∇θ ln πθ (A|S)]
= (Edν,π [∇θ ln πθ (A|S)(∇θ ln πθ (A|S))> ])−1 Edν,π [∇θ ln πθ (A|S)(∇θ ln πθ (A|S))> ]w
=w

Therefore, the policy parameters are simply updated using the parameters computed for the quality
function. The related algorithms are called natural actor-critics and have been introduced by Peters
and Schaal (2008). For more about policy search and actor-critics, the interested reader can refer
to Grondman et al. (2012); Deisenroth et al. (2013)
Bibliography

Amari, S. (1998a). Natural Gradient Works Efficiently in Learning. Neural Computation,


10(2):251–276.
Amari, S.-I. (1998b). Natural gradient works efficiently in learning. Neural Computation, 10(2):251–
276.
Andrieu, C., de Freitas, N., Doucet, A., and Jordan, M. I. (2003). An Introduction to MCMC for
Machine Learning. Machine Learning, 50(1-2):5–43.
Antos, A., Szepesvári, C., and Munos, R. (2008). Learning near-optimal policies with Bellman-
residual minimization based fitted policy iteration and a single sample path. Machine Learning,
71(1):89–129.
Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit
problem. Machine learning, 47(2-3):235–256.
Ávila Pires, B., Szepesvari, C., and Ghavamzadeh, M. (2013). Cost-sensitive multiclass classifica-
tion risk bounds. In International Conference on Machine Learning (ICML), pages 1391–1399.
Barron, A. R. (1988). Complexity Regularization with Application to Artificial Neural Networks.
In Nonparametric Functional Estimation and Related Topics, pages 561–576. Kluwer Academic
Publishers.
Bartlett, P. L., Boucheron, S., and Lugosi, G. (2002). Model Selection and Error Estimation.
Machine Learning, 48(1-3):85–113.
Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. (2006). Convexity, classification, and risk
bounds. Journal of the American Statistical Association, 101(473):138–156.
Baxter, J. and Bartlett, P. L. (2001). Infinite-horizon policy-gradient estimation. Journal of
Artificial Intelligence Research, pages 319–350.
Beijbom, O., Saberian, M., Kriegman, D., and Vasconcelos, N. (2014). Guess-averse loss func-
tions for cost-sensitive multiclass boosting. In International Conference on Machine Learning
(ICML), pages 586–594.
Bellman, R. (1957). A Markovian Decision Process. Journal of Mathematics and Mechanics,
6(5):679–684.
Bengio, Y., Courville, A. C., and Vincent, P. (2013). Representation Learning: A Review and New
Perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828.
Bengio, Y., Goodfellow, I. J., and Courville, A. (2015). Deep Learning. Book in preparation for
MIT Press.
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of
deep networks. In Schölkopf, B., Platt, J., and Hoffman, T., editors, Advances in Neural
Information Processing Systems 19, pages 153–160. MIT Press, Cambridge, MA.
Bengio, Y., Simard, P. Y., and Frasconi, P. (1994). Learning long-term dependencies with gradient
descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166.

363
364 BIBLIOGRAPHY

Bertsekas, D. P. (1995). Dynamic Programming and Optimal Control. Athena Scientific.


Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scientific.
Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press, Oxford,
UK.
Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: a non asymptotic
theory of independence.
Bousquet, O., Boucheron, S., and Lugosi, G. (2004). Introduction to Statistical Learning Theory.
In Advanced Lectures on Machine Learning, pages 169–207.
Bradtke, S. J. and Barto, A. G. (1996). Linear Least-Squares algorithms for temporal difference
learning. Machine Learning, 22(1-3):33–57.
Brafman, R. I. and Tennenholtz, M. (2003). R-max:a general polynomial time algorithm for near-
optimal reinforcement learning. The Journal of Machine Learning Research, 3:213–231.
Breiman, L. (1996a). Bagging predictors. Machine learning, 24(2):123–140.
Breiman, L. (1996b). Stacked regressions. Machine learning, 24(1):49–64.
Breiman, L. (1999). Pasting small votes for classification in large databases and on-line. Machine
Learning, 36(1-2):85–103.
Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1983). CART: Classification and Regression
Trees. Wadsworth: Belmont, CA.
Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. (1984). Classification and regression
trees. CRC press.
Brent, R. (1973). Algorithms for minimization without derivatives, chapter 3. Prentice-Hall.
Broomhead, D. and Lowe, D. (1988). Multivariable Functional Interpolation and Adaptive Net-
works. Complex Systems, 2:321–355.
Bubeck, S. and Cesa-Bianchi, N. (2012). Regret Analysis of Stochastic and Nonstochastic Multi-
armed Bandit Problems. Foundations and Trends in Machine Learning, 5(1):1–122.
Bunea, F., Tsybakov, A., Wegkamp, M., et al. (2007). Sparsity oracle inequalities for the lasso.
Electronic Journal of Statistics, 1:169–194.
Carreira-Perpiñán, M. Á. and Hinton, G. (2005). On Contrastive Divergence Learning. In Cowell,
R. G. and Ghahramani, Z., editors, AISTATS. Society for Artificial Intelligence and Statistics.
Chapelle, O. and Li, L. (2011). An empirical evaluation of thompson sampling. In Advances in
neural information processing systems, pages 2249–2257.
Cireşan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2010). Deep, Big, Simple
Neural Nets for Handwritten Digit Recognition. Neural Computation, 22(12):3207–3220.
Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2011a). Convolutional Neural
Network Committees for Handwritten Character Classification. In ICDAR, pages 1135–1139.
IEEE Computer Society.
Ciresan, D. C., Meier, U., Masci, J., Gambardella, L. M., and Schmidhuber, J. (2011b). Flexible,
High Performance Convolutional Neural Networks for Image Classification. In Walsh, T.,
editor, IJCAI, pages 1237–1242. IJCAI/AAAI.
Ciressan, D., Meier, U., and Schmidhuber, J. (2012). Multi-column Deep Neural Networks for
Image Classification. Technical report, IDSIA.
BIBLIOGRAPHY 365

Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3):273–297.


Cottrell, M., Fort, J., and Pagès, G. (1998). Theoretical aspects of the SOM algorithm. Neuro-
computing, 21(1–3):119–138.
Cover, T. M. (1965). Geometrical and Statistical Properties of Systems of Linear Inequalities
with Applications in Pattern Recognition. Electronic Computers, IEEE Transactions on, EC-
14(3):326–334.
Cox, T. F. and Cox, M. (2000). Multidimensional Scaling, Second Edition. Chapman and
Hall/CRC, 2 edition.
Cristanini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and other
kernel-based learning methods. Cambridge University Press.
Cucker, F. and Smale, S. (2001). On the mathematical foundations of learning. Bulletin of the
american mathematical society, 39(1):1–49.
Cybenko, G. (1989). Approximation by Superpositions of a Sigmoidal Function. Mathematics of
Control, Signals, and Systems, 2(4):303–314.
Dahl, G. E., Jaitly, N., and Salakhutdinov, R. (2014). Multi-task Neural Networks for QSAR
Predictions. CoRR, abs/1406.1231.
Dauphin, Y. N., Pascanu, R., Gülçehre, c., Cho, K., Ganguli, S., and Bengio, Y. (2014). Identifying
and attacking the saddle point problem in high-dimensional non-convex optimization. In
Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q., editors,
NIPS, pages 2933–2941.
de Farias, D. P. and Van Roy, B. (2003). The linear programming approach to approximate
dynamic programming. Operations Research, 51(6):850–865.
de Farias, D. P. and Van Roy, B. (2004). On constraint sampling in the linear programming ap-
proach to approximate dynamic programming. Mathematics of operations research, 29(3):462–
478.
Deisenroth, M. P., Neumann, G., Peters, J., et al. (2013). A survey on policy search for robotics.
Foundations and Trends in Robotics, 2(1-2):1–142.
Delalleau, O. and Bengio, Y. (2011). Shallow vs. Deep Sum-Product Networks. In Shawe-Taylor,
J., Zemel, R. S., Bartlett, P. L., Pereira, F. C. N., and Weinberger, K. Q., editors, NIPS,
pages 666–674.
Deng, L., Hinton, G. E., and Kingsbury, B. (2013). New types of deep neural network learning
for speech recognition and related applications: an overview. In ICASSP, pages 8599–8603.
IEEE.
Deng, L. and Platt, J. C. (2014). Ensemble deep learning for speech recognition. In Li, H., Meng,
H. M., Ma, B., Chng, E., and Xie, L., editors, INTERSPEECH, pages 1915–1919. ISCA.
Eberhart, R. C. and Kennedy, J. (1995). Particle swarm optimization. In Proceedings, volume 4,
pages 1942–1948.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004a). Least Angle Regression. The
Annals of Statistics, 32(2):407–451.
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., et al. (2004b). Least angle regression. The
Annals of statistics, 32(2):407–499.
Engelbrecht, A. P. (2007). Fundamentals of Computational Swarm Intelligence. Wiley.
Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-Based Batch Mode Reinforcement Learning.
Journal of Machine Learning Research, 6:503–556.
366 BIBLIOGRAPHY

Evgeniou, T., Pontil, M., and Poggio, T. (2000). Regularization Networks and Support Vector
Machines. Advances in Computational Mathematics, 13(1):1–50.
Farahmand, A.-m. and Szepesvári, C. (2011). Model selection in reinforcement learning. Machine
learning, 85(3):299–332.
Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and
an application to boosting. Journal of computer and system sciences, 55(1):119–139.
Freund, Y. and Schapire, R. E. (1999). Large Margin Classification Using the Perceptron Algo-
rithm. Machine Learning, 37(3):277–296.
Frezza-Buet, H. (2014). Online Computing of Non-Stationary Distributions Velocity Fields by an
Accuracy Controlled Growing Neural Gas. Neural Networks, 60:203–221.
Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive logistic regression: a statistical view
of boosting. The annals of statistics, 28(2):337–407.
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of
statistics, pages 1189–1232.
Fritzke, B. (1995a). A growing neural gas network learns topologies. In Tesauro, G., Touretzky,
D. S., and Leen, T. K., editors, Advances in Neural Information Processing Systems 7, pages
625–632. MIT Press, Cambridge MA.
Fritzke, B. (1995b). Growing Grid - a self-organizing network with constant neighborhood range
and adaptation strength. Neural Processing Letters, 2:9–13.
Fritzke, B. (1997). Some Competitive Learning Methods. http://www.ki.inf.tu-dresden.de/
~fritzke/JavaPaper/.
Fukushima, K. (1980). Neocognitron: A Self-Organizing Neural Network Model for a Mechanism
of Pattern Recognition Unaffected by Shift in Position. Biological Cybernetics, 36:193–202.
Gabillon, V., Ghavamzadeh, M., and Lazaric, A. (2012). Best arm identification: A unified ap-
proach to fixed budget and fixed confidence. In Advances in Neural Information Processing
Systems, pages 3212–3220.
Gallant, S. I. (1990). Perceptron-based learning algorithms. IEEE Transactions on Neural Net-
works, 1(2):179–191.
Geist, M. (2013-2014). Abrégé non exhaustif sur l’évaluation et la sélection de modèles et la
sélection de variables. Technical report, CentraleSupélec.
Geist, M. (2015a). Précis introductif à l’apprentissage statistique. Support de cours, Cen-
traleSupélec. http://www.metz.supelec.fr/metz/personnel/geist_mat/pdfs/poly_as_
v2.pdf.
Geist, M. (2015b). Soft-max boosting. Machine Learning.
Geist, M. and Pietquin, O. (2013). An Algorithmic Survey of Parametric Value Function Approx-
imation. IEEE Transactions on Neural Networks and Learning Systems, 24(6):845–867.
Geist, M. and Scherrer, B. (2014). Off-policy Learning with Eligibility Traces: A Survey. Journal
of Machine Learning Research (JMLR), 15:289–333.
Gelly, S., Wang, Y., Munos, R., and Teytaud, O. (2006). Modification of UCT with patterns in
Monte-Carlo go. Technical Report RR-6062, 32:30–56.
Gers, F. A., Schmidhuber, J., and Cummins, F. A. (2000). Learning to Forget: Continual Predic-
tion with LSTM. Neural Computation, 12(10):2451–2471.
Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. Machine learning,
63(1):3–42.
BIBLIOGRAPHY 367

Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural
networks. In Teh, Y. W. and Titterington, D. M., editors, AISTATS, volume 9 of JMLR
Proceedings, pages 249–256. JMLR.org.

Gordon, G. (1995). Stable Function Approximation in Dynamic Programming. In International


Conference on Machine Learning (IMCL).

Graves, A. (2013). Generating Sequences With Recurrent Neural Networks. CoRR, abs/1308.0850.

Graves, A., Fernández, S., Gomez, F. J., and Schmidhuber, J. (2006). Connectionist temporal
classification: labelling unsegmented sequence data with recurrent neural networks. In Cohen,
W. W. and Moore, A., editors, ICML, volume 148 of ACM International Conference Proceeding
Series, pages 369–376. ACM.

Graves, A., rahman Mohamed, A., and Hinton, G. E. (2013). Speech recognition with deep
recurrent neural networks. In ICASSP, pages 6645–6649. IEEE.

Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM
and other neural network architectures. Neural Networks, 18(5-6):602–610.

Grondman, I., Buşoniu, L., Lopes, G. A., and Babuška, R. (2012). A survey of actor-critic rein-
forcement learning: Standard and natural policy gradients. Systems, Man, and Cybernetics,
Part C: Applications and Reviews, IEEE Transactions on, 42(6):1291–1307.

Grubb, A. and Bagnell, D. (2011). Generalized boosting algorithms for convex optimization. In
International Conference on Machine Learning, pages 1209–1216.

Guenther, W. C. (1969). Shortest Confidence Intervals. The American Statistician, 23(1):22–25.

Guermeur, Y. (2007). Vc theory of large margin multi-category classifiers. The Journal of Machine
Learning Research, 8:2551–2594.

Guyon, I. and Elisseeff, A. (2003). An Introduction to Variable and Feature Selection. J. Mach.
Learn. Res., 3:1157–1182.

Györfi, L., Kohler, M., Krzyzak, A., and Walk, H. (2006). A distribution-free theory of nonpara-
metric regression. Springer.

Hall, M. A. (1999). Correlation-based Feature Selection for Machine Learning. PhD thesis.

Han, H.-G. and Qiao, J.-F. (2012). Adaptive Computation Algorithm for RBF Neural Network.
IEEE Trans. Neural Netw. Learning Syst., 23(2):342–347.

Hansen, N. (2006). The CMA evolution strategy: a comparing review. In Lozano, J., Larranaga,
P., Inza, I., and Bengoetxea, E., editors, Towards a new evolutionary computation. Advances
on estimation of distribution algorithms, pages 75–102. Springer.

Hartman Eric J., Keeler James D., and Kowalski Jacek M. (1990). Layered Neural Networks with
Gaussian Hidden Units as Universal Approximations. Neural Computation, 2(2):210–215. doi:
10.1162/neco.1990.2.2.210.

Håstad, J. (1986). Almost Optimal Lower Bounds for Small Depth Circuits. In Hartmanis, J.,
editor, STOC, pages 6–20. ACM.

Håstad, J. and Goldmann, M. (1991). On the Power of Small-Depth Threshold Circuits. Compu-
tational Complexity, 1:113–129.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statsitcal Learning. Springer,
2nd edition.

Hertz, J., Krogh, A., and Palmer, R. G. (1991). Introduction to the Theory of Neural Computation.
Addison-Wesley, Redwood City, CA.
368 BIBLIOGRAPHY

Hinton, G. and Roweis, S. (2002). Stochastic Neighbor Embedding. In Advances in Neural Infor-
mation Processing Systems 15, pages 833–840. MIT Press.

Hinton, G. E. (2009). Deep belief networks. Scholarpedia, 4(5):5947.

Hinton, G. E. (2012). A Practical Guide to Training Restricted Boltzmann Machines. In Montavon,


G., Orr, G. B., and Müller, K.-R., editors, Neural Networks: Tricks of the Trade (2nd ed.),
volume 7700 of Lecture Notes in Computer Science, pages 599–619. Springer.

Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A Fast Learning Algorithm for Deep Belief
Nets. Neural Comput., 18(7):1527–1554.

Hinton, G. E. and Sejnowski, T. (1986). Learning and relearning in Boltzmann machines. In Parallel
distributed processing: Explorations in the microstructure of cognition, pages 282–317–. MIT
Press, Cambridge, MA.

Ho, T. K. (1998). The random subspace method for constructing decision forests. Pattern Analysis
and Machine Intelligence, IEEE Transactions on, 20(8):832–844.

Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut


für Informatik, Lehrstuhl Prof. Brauer, Technische Universität München.

Hochreiter, S. and Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation,


9:1735–1780.

Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of
the American statistical association, 58(301):13–30.

Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999). Bayesian model averaging:
a tutorial. Statistical science, pages 382–401.

Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal
approximators. Neural Networks, 2:356–366.

Hornik, K., Stinchcombe, M., and White, H. (1990). Universal Approximation of an Unknown
Mapping and Its Derivatives Using Multilayer Feedforward Networks. Neural Networks, 3:551–
560.

Hsu, D., Kakade, S. M., and Zhang, T. (2014). Random design analysis of ridge regression.
Foundations of Computational Mathematics, 14(3):569–600.

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixtures of local
experts. Neural computation, 3(1):79–87.

Jaeger, H. (2002). A tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF
and the ëcho state networkäpproach . Technical report, Fraunhofer Institute for Autonomous
Intelligent Systems (AIS). http://minds.jacobs-university.de/sites/default/files/
uploads/papers/ESNTutorialRev.pdf.

Jaeger, H. (2004). Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in
Wireless Communication. Science, 304:78–80.

Japkowicz, N., Hanson, S. J., and Gluck, M. A. (2000). Nonlinear Autoassociation Is Not Equiva-
lent to PCA. Neural Computation, 12(3):531–545.

Jaynes, E. T. and Bretthorst, G. L., editors (2003). Probability theory : the logic of science.
Cambridge University Press, Cambridge, UK, New York.

Johansson, E. M., Dowla, F. U., and Goodman, D. M. (1991). Backpropagation Learning for
Multilayer Feed-Forward Neural Networks Using the Conjugate Gradient Method. Int. J.
Neural Syst., 2(4):291–301.
BIBLIOGRAPHY 369

Jordan, M. I. and Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm.
Neural computation, 6(2):181–214.

Jouini, W., Moy, C., and Palicot, J. (2012). Decision making for cognitive radio equipment: analysis
of the first 10 years of exploration. EURASIP J. Wireless Comm. and Networking, 2012:26.

Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. (1998). Planning and acting in partially
observable stochastic domains. Artificial Intelligence, 101(1-2):99–134.

Kanungo, T., Mount, D. M., Netanyahu, N., Piatko, C., Silverman, R., and Wu, A. Y. (2002). An
efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 24:881–892.

Karpathy, A. and Li, F.-F. (2014). Deep Visual-Semantic Alignments for Generating Image De-
scriptions. CoRR, abs/1412.2306.

Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., and Murthy, K. R. K. (1999). Improvements
to Platt’s SMO Algorithm for SVM Classifier Design. Technical Report CD-99-14, National
University of Singapore.

Kohonen, T. (1989). Self-Organization and Associative Memory, volume 8 of Springer Series in


Information Sciences. Springer-Verlag.

Kohonen, T. (2013). Essentials of the self-organizing maps. Neural Networks, (37):52–65.

Korda, N., Kaufmann, E., and Munos, R. (2013). Thompson sampling for 1-dimensional exponen-
tial family bandits. In Advances in Neural Information Processing Systems, pages 1448–1456.

Krizhevsky, A. (2014). One weird trick for parallelizing convolutional neural networks. CoRR,
abs/1404.5997.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet Classification with Deep Con-
volutional Neural Networks. In Bartlett, P. L., Pereira, F. C. N., Burges, C. J. C., Bottou, L.,
and Weinberger, K. Q., editors, NIPS, pages 1106–1114.

Lagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning
Research (JMLR), 4:1107–1149.

Lawson, C. and Hanson, R. (1974). Solving least squares problems. Prentice-Hall series in automatic
computation. Prentice-Hall, Englewood Cliffs, NJ.

Lazaric, A., Ghavamzadeh, M., and Munos, R. (2010). Analysis of a classification-based policy
iteration algorithm. In International Conference on Machine Learning (ICML), pages 607–614.

Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278–2324.

LeCun, Y., Bottou, L., Orr, G., and Müller, K.-R. (1998). Efficient BackProp. In Orr, G. and
Müller, K.-R., editors, Neural Networks: Tricks of the Trade, volume 1524 of Lecture Notes
in Computer Science, pages 9–50. Springer Berlin Heidelberg.

Lee, J. A. and Verleysen, M. (2007). Nonlinear dimensionality reduction. Springer.

Lee, Y., Lin, Y., and Wahba, G. (2004). Multicategory support vector machines: Theory and
application to the classification of microarray data and satellite radiance data. Journal of the
American Statistical Association, 99(465):67–81.

Linde, Y., Buzo, A., and Gray, R. M. (1980). Algorithm for Vector Quantization Design. IEEE
transactions on communications systems, 28(1):84–95.

Lloyd, S. P. (1982). Least Squares Quantization in PCM. IEEE Transactions on Information


Theory, 28(2):129–137.
370 BIBLIOGRAPHY

Louppe, G. and Geurts, P. (2012). Ensembles on random patches. In Machine Learning and
Knowledge Discovery in Databases, pages 346–361. Springer.
Lugosi, G. and Wegkamp, M. (2004). Complexity regularization via localized random penalties.
The Annals of Statistic, 32(4):1679–1697.
Lukosevicius, M. (2012). A Practical Guide to Applying Echo State Networks. In Montavon, G.,
Orr, G. B., and Müller, K.-R., editors, Neural Networks: Tricks of the Trade (2nd ed.), volume
7700 of Lecture Notes in Computer Science, pages 659–686. Springer.
MacQueen, J. B. (1967). Some Methods for Classification and Analysis of MultiVariate Obser-
vations. In Cam, L. M. L. and Neyman, J., editors, Proc. of the fifth Berkeley Symposium
on Mathematical Statistics and Probability, volume 1, pages 281–297. University of California
Press.
Martens, J. (2010). Deep learning via Hessian-free optimization. In Fürnkranz, J. and Joachims,
T., editors, Proc. of the International Conference on Machine Learning (ICML) 2010, pages
735–742. Omnipress.
Martinez, T. M. and Schulten, K. J. (1994). Topology Representing Networks. Neural Networks,
7(3):507–522.
Mason, L., Baxter, J., Bartlett, P., and Frean, M. (1999). Boosting algorithms as gradient descent
in function space. In Neural Information Processing Systems (NIPS).
McCulloch, W. S. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous
activity. Bulletin of Mathematical Biophysics, 5:115–133.
Mhaskar, H. and Micchelli, C. (1995). Degree of Approximation by Neural and Translation Net-
works with a Single Hidden Layer. Advances in Applied Mathematics, 16(2):151–183.
Munos, R. (2014). From bandits to Monte-Carlo Tree Search: The optimistic principle applied to
optimization and planning. Foundations and Trends in Machine Learning.
Muselli, M. (1997). On convergence properties of pocket algorithm. IEEE Transactions on Neural
Networks, 8(3):623–629.
Nair, V. and Hinton, G. E. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines.
In Fürnkranz, J. and Joachims, T., editors, ICML, pages 807–814. Omnipress.
Ormoneit, D. and Sen, S. (2002). Kernel-Based Reinforcement Learning. Machine Learning,
49:161–178.
Ozay, M. and Vural, F. T. Y. (2012). A new fuzzy stacked generalization technique and analysis
of its performance. arXiv preprint arXiv:1204.0171.
Park, J. and Sandberg, I. W. (1991). Universal Approximation Using Radial-Basis-Function Net-
works. Neural Computation, 3:246–257.
Pascanu, R., Montufar, G., and Bengio, Y. (2013). On the number of inference regions of deep
feed forward networks with piece-wise linear activations. CoRR, abs/1312.6098.
Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. Philo-
sophical Magazine Series 6, 2(11):559–572.
Peng, J. X., Li, K., and Irwin, G. W. (2007). A Novel Continuous Forward Algorithm for RBF
Neural Modelling. IEEE Trans. Automat. Contr., 52(1):117–122.
Peters, J. and Schaal, S. (2008). Natural Actor-Critic. Neurocomputing, 71:1180–1190.
Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization.
In schölkopf, B., Burges, C., and Smola, A., editors, Advances in Kernel Methods: Support
Vector Machines. MIT Press, Cambridge, MA.
BIBLIOGRAPHY 371

Poupart, P., Vlassis, N., Hoey, J., and Regan, K. (2006). An analytic solution to discrete bayesian
reinforcement learning. In International Conference on Machine Learning (ICML), pages
697–704.
Prechelt, L. (1996). Early Stopping-But When? In Orr, G. B. and Müller, K.-R., editors, Neural
Networks: Tricks of the Trade, volume 1524 of Lecture Notes in Computer Science, pages
55–69. Springer.
Pudil, P., Novovic̆ová, and Kittler, J. (1993). Floating search methods in feature selection. Pattern
Recognition Letters, 15:1119–1125.
Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming.
Wiley-Interscience.
Quinlan, J. R. (1993). C4. 5: Programs for machine learning.
Ranzato, M. A., Poultney, C. S., Chopra, S., and LeCun, Y. (2006). Efficient Learning of Sparse
Representations with an Energy-Based Model. In Schölkopf, B., Platt, J. C., and Hoffman,
T., editors, NIPS 2006, pages 1137–1144. MIT Press.
Riedmiller, M. (2005). Neural fitted q iteration–first experiences with a data efficient neural
reinforcement learning method. In European Conference on Machine Learning (ECML), pages
317–328. Springer.
Rodan, A. and Tiño, P. (2011). Minimum Complexity Echo State Network. IEEE Transactions
on Neural Networks, 22(1):131–144.
Rosenblatt, F. (1962). Principles of neurodynamics; perceptrons and the theory of brain mecha-
nisms. Washington, Spartan Books.
Roweis, S. T. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embed-
ding. Science, 290:2323–2326.
Rumelhart, D., Hinton, G., and Williams, R. (1986). Learning Internal Representations by Error
Propagation, pages 318–362. MIT Press.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,
Khosla, A., Bernstein, M. S., Berg, A. C., and Li, F.-F. (2014). ImageNet Large Scale Visual
Recognition Challenge. CoRR, abs/1409.0575.
Salakhutdinov, R. and Hinton, G. E. (2009). Deep Boltzmann Machines. Journal of Machine
Learning Research - Proceedings Track, 5:448–455.
Sammon, J. W. (1969). A Nonlinear Mapping for Data Structure Analysis. IEEE Trans. Comput.,
18(5):401–409.
Schapire, R. E. and Freund, Y. (2012). Boosting: Foundations and algorithms. MIT press.
Scherer, D., Müller, A. C., and Behnke, S. (2010). Evaluation of Pooling Operations in Convo-
lutional Architectures for Object Recognition. In Diamantaras, K. I., Duch, W., and Iliadis,
L. S., editors, ICANN (3), volume 6354 of Lecture Notes in Computer Science, pages 92–101.
Springer.
Scherrer, B. (2010). Should one compute the Temporal Difference fix point or minimize the Bellman
Residual? The unified oblique projection view. In International Conference on Machine
Learning (ICML), pages 959–966.
Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B., and Geist, M. (2015). Approximate
Modified Policy Iteration and its Application to the Game of Tetris. Journal of Machine
Learning Research.
Schmidhuber, J. (1992). Learning Complex, Extended Sequences Using the Principle of History
Compression. Neural Computation, 4(2):234–242.
372 BIBLIOGRAPHY

Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61:85–
117.

Schölkopf, B., Platt, J., Shawe-Taylor, J., Smola, A. J., and Williamson, R. C. (2001). Estimating
the support of a high-dimensional distribution. Neural Computation, 13:1443–1471.

Scholkopf, B., Smola, A., and Müller, K.-R. (1999). Kernel principal component analysis. In
Advances in kernel methods - support vector learning, pages 327–352. MIT Press.

Schölkopf, B., Smola, A. J., Williamson, R. C., and Bartlett, P. L. (2000). New support vector
algorithms. Neural Computation, 12:1207–1245.

Schwenker, F., Kestler, H. A., and Palm, G. (2001). Three learning phases for radial-basis-function
networks. Neural Networks, 14(4-5):439–458.

Shawe-Taylor, J. and Cristanini, N. (2004). Kernel Methods for Pattern Analysis. Cambridge
University Press.

Sigaud, O. and Buffet, O. (2013). Markov decision processes in artificial intelligence. John Wiley
& Sons.

Simard, P. Y., Steinkraus, D., and Platt, J. C. (2003). Best Practices for Convolutional Neural
Networks Applied to Visual Document Analysis. In ICDAR, pages 958–962. IEEE Computer
Society.

Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony


theory. In Parallel distributed processing: Explorations in the microstructure of cognition,
pages 194–281–. MIT Press, Cambridge, MA.

Smyth, P. and Wolpert, D. (1999). Linearly combining density estimators via stacking. Machine
Learning, 36(1-2):59–83.

Soheili, N. (2014). Elementary Algorithms for Solving Convex Optimization Problems. PhD thesis,
Carnegie Mellon University.

Soheili, N. and Pena, J. (2013). A primal-dual smooth perceptron-von neumann algorithm. Discrete
Geometry and Optimization, 69:303–320.

Somol, P., Novovicova, J., and Pudil, P. (2010). Pattern recognition recent advances, chapter
Efficient Feature Subset Selection and Subset Size Optimization. InTech.

Somol, P., Pudil, P., Novovicová, J., and Paclı́k, P. (1999). Adaptive floating search methods in
feature selection. Pattern Recognition Letters, 20(11-13):1157–1163.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout:
A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning
Research, 15:1929–1958.

Steinwart, I. (2007). How to compare different loss functions and their risks. Constructive Approx-
imation, 26(2):225–287.

Sutskever, I. (2013). Training recurrent neural networks. PhD thesis, University of Toronto.

Sutskever, I., Martens, J., and Hinton, G. E. (2011). Generating Text with Recurrent Neural
Networks. In Getoor, L. and Scheffer, T., editors, ICML, pages 1017–1024. Omnipress.

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to Sequence Learning with Neural
Networks. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger,
K. Q., editors, NIPS, pages 3104–3112.

Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. The MIT Press.
BIBLIOGRAPHY 373

Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (1999). Policy Gradient Methods
for Reinforcement Learning with Function Approximation. In Neural Information Processing
Systems (NIPS), pages 1057–1063.
Szepesvári, C. (2010). Algorithms for reinforcement learning. Synthesis Lectures on Artificial
Intelligence and Machine Learning, 4(1):1–103.
Tenenbaum, J. B., de Silva, V., and Langford, J. C. (2000). A Global Geometric Framework for
Nonlinear Dimensionality Reduction. Science, 290(5500):2319.
Thomas, P., Theocharous, G., and Ghavamzadeh, M. (2015). High confidence off-policy evaluation.
In Twenty-Ninth AAAI Conference on Artificial Intelligence.
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view
of the evidence of two samples. Biometrika, pages 285–294.
Tibshirani, R. (1996a). Regression shrinkage and selection via the LASSO. Journal of the Royal
Statistical Society. Series B (Methodological), pages 267–288.
Tibshirani, R. (1996b). Regression shrinkage and selection via the Lasso. Journal of the Royal
Statistical Society. Series B (Methodological), pages 267–288.
Tikhonov, A. (1963). Solution of incorrectly formulated problems and the regularization method.
In Soviet Mathematics, volume 5, pages 1035–1038.
Tipping, M. E. (2001). Sparse Kernel Principal Component Analysis. In Leen, T., Dietterich, T.,
and Tresp, V., editors, Advances in Neural Information Processing Systems 13, pages 633–639.
MIT Press.
Triefenbach, F., Jalalvand, A., Schrauwen, B., and Martens, J.-P. (2010). Phoneme Recognition
with Large Hierarchical Reservoirs. In Lafferty, J. D., Williams, C. K. I., Shawe-Taylor, J.,
Zemel, R. S., and Culotta, A., editors, NIPS, pages 2307–2315. Curran Associates, Inc.
Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11):1134–1142.
van der Maaten, L. (2014). Accelerating t-SNE using Tree-Based algorithms. Journal of Machine
Learning Research, 15:3221–3245.
van der Maaten, L. and Hinton, G. (2008). Visualizaing Data using t-SNE. Journal of Machine
Learning Research, 9:2579–2605.
Vapnik, V. N. (1998). Statistical Learning Theory. Wiley-Interscience.
Vapnik, V. N. (1999). An overview of statistical learning theory. Neural Networks, IEEE Trans-
actions on, 10(5):988–999.
Vapnik, V. N. (2000). The Nature of Statistical Learning Theory. Statistics for Engineering and
Information Science. Springer.
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing
robust features with denoising autoencoders. In Cohen, W. W., McCallum, A., and Roweis,
S. T., editors, ICML, volume 307 of ACM International Conference Proceeding Series, pages
1096–1103. ACM.
Viola, P. and Jones, M. (2001). Rapid object detection using a boosted cascade of simple features.
In Computer Vision and Pattern Recognition (CVPR). IEEE.
Vlassis, N., Ghavamzadeh, M., Mannor, S., and Poupart, P. (2012). Bayesian reinforcement learn-
ing. In Reinforcement Learning, pages 359–386. Springer.
Waibel, A. H., Hanazawa, T., Hinton, G. E., Shikano, K., and Lang, K. J. (1989). Phoneme
recognition using time-delay neural networks. IEEE Trans. Acoustics, Speech, and Signal
Processing, 37(3):328–339.
374 BIBLIOGRAPHY

Wang, C., Venkatesh, S. S., and Judd, J. S. (1993). Optimal Stopping and Effective Machine
Complexity in Learning. In Cowan, J. D., Tesauro, G., and Alspector, J., editors, NIPS,
pages 303–310. Morgan Kaufmann.
Wang, J. (2013). Boosting the generalized margin in cost-sensitive multiclass classification. Journal
of Computational and Graphical Statistics, 22(1):178–192.
Wegkamp, M. (2003). Model selection in nonparametric regression. Ann. Statist., 31(1):252–273.

Werbos, P. (1981). Application of advances in nonlinear sensitivity analysis. In Proc. of the 10th
IFIP conference, pages 762–770.
Werbos, P. J. (1988). Generalization of backpropagation with application to a recurrent gas market
model. Neural Networks, 1(4):339–356.

Widrow, B. and Hoff, M. (1962). Associative Storage and Retrieval of Digital Information in
Networks of Adaptive Neurons. Biological Prototypes and Synthetic Systems, 1.
Williams, R. J. and Zipser, D. (1989). A Learning Algorithm for Continually Running Fully
Recurrent Neural Networks. Neural Computation, 1(2):270–280.
Wolpert, D. H. (1992). Stacked generalization. Neural networks, 5(2):241–259.

Yu, D. and Deng, L. (2015). Automatic Speech Recognition A Deep Learning Approach. Springer-
Verlag.
Zeiler, M., Ranzato, M., Monga, R., Mao, M., Yang, K., Le, Q., Nguyen, P., Senior, A., Vanhoucke,
V., Dean, J., and Hinton, G. (2013). On Rectified Linear Units For Speech Processing. In
38th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
3517–3521, Vancouver. IEEE.
Zhao, Z., Morstatter, F., Sharma, S., Alelyani, S., Anand, A., and Liu, H. (2008). Advancing
Feature Selection Research - ASU Feature Selection Repository. Technical report, Arizona
State University.

Zou, H. and Hastie, T. (2003). Regularization and Variable Selection via the Elastic Net. Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320.
Index

-SVR, 130 beta function, 264


γ-exponential kernel, 310 bias-variance decomposition, 55
ν-SVC, 129 bias-variance trade-off, 34
ν-SVR, 130 binary loss, 31
Boosting, 221
a posteriori distribution, 262 boosting, 35
a priori distribution, 260, 262 bootstrapping, 34
accessible, 289
active learning, 23 C-SVC, 129
active sampling, 16 calibration, 65
actor-critic, 360 causality, 251
AdaBoost, 221 chain rule, 177
ADALINE, 168 classification, 21
alpha coefficients, 293 classification tree, 214
ambient space, 30 command, 296
ancestor, 244 communicating, 290
ancestral sampling, 321 communication classes, 290
aperiodic, 290 competition, 148
approximate dynamic programming (ADP), complete, 290
347 concentration inequality, 333
approximate policy iteration, 352 conditional distribution, 240
approximate value iteration, 350 Conditional Probability Tables, 247
atomic, 238 confusion matrix, 46
conjugate prior, 43, 263
bag-of-words, 70 consistency of the ERM principle, 57
Bagging, 217 continuous random variable, 238
bagging, 34 Correlation-based feature selection, 74
bags of words, 119 cost sensitive, 48
bandit, 331 covariance function, 310
batch sampling, 16 cumulative distribution function, 238
Baum-Welch algorithm, 295 curse of dimensionality, 31, 71
Bayes estimator, 236, 265
Bayes’ rule, 38, 240 d-separated, 250
Bayes’ theorem, 240 d-separation, 250
Bayesian filtering, 285 data augmentation, 185
Bayesian inference, 38, 42, 261, 262 dataset, 14
Bayesian learning, 42 dead units, 151
Bayesian Machine Learning, 233, 235 decision stump, 214
Bayesian model averaging, 215 decision trees, 211
Bayesian Networks, 244 Delaunay triangulation, 142
Bayesian smoothing, 285 delta rule, 170
Belief Network, 244 denoising autoencoders, 186
Bellman evaluation operator, 342, 348 density function, 137
Bellman optimality operator, 342, 348 design matrix, 275

375
376 INDEX

detailed balance equation, 324 Gram matrix, 83


directed acyclic graph, 244 Graphical models, 244
Dirichlet distributions, 264 greedy policy, 344, 349
discrete random variable, 238 growing grid, 148
Discriminative models, 257 growing neural gas, 148
distortion, 135
distribution, 238 Hidden Markov Model, 292
domain, 238 hidden variable, 251, 277
double centering, 84 hinge loss, 63, 103
dual problem, 105 Hoeffding’s inequality, 59, 333
Dynamic Programming (DP), 343 homogeneous, 288, 292
hyperparameters, 258
Echo state networks, 195 hypothesis space, 29, 53
EM, 279
embedded, 72 i.i.d, 14
Emission distributions, 292 importance distribution, 325
emission matrices, 292 importance sampling, 325, 326
emissions, 287 independent, 239
empirical risk, 32, 55 independent and identically distributed, 14
empirical risk minimization, 27, 33, 55 induction principle, 27, 33
energy, 279 inductive bias, 34
ensemble learning, 34 inductive learning, 27
entropy, 279 information fusion, 304
equilibrium distribution, 289 Innovation, 304
ERM, 33 irreducible, 290
Euclidean distance matrix, 83
events, 237 joint distribution, 239
Expectation maximization, 279 joint variable, 37
Expectation step, 279
exploration-exploitation dilemma, 355 k-fold cross-validation, 45
Extended Kalman Filter, 305 k-means, 147
Extended Kalman filter, 306 k-nearest neighbours, 27
extremely randomized forests, 219 Kalman filter, 301, 304
Kalman filter gain, 304
f-score, 49 Kalman filters, 301
f1-score, 49 kernel, 87, 113, 310
factors, 245 kernel trick, 87, 114, 117
false negatives, 48 kriging, 309
false positives, 48
feature extraction, 72 Lagrange multipliers, 105
feature selection, 71 Lagrangian, 105
feature space, 30, 86, 167 latent variable, 251, 277
feedforward neural network, 159 learning by heart, 33
filters, 72 Least Absolute Shrinkage and Selection
forward stagewise additive modeling, 225 Operator, 72
forward-backward algorithm, 294 leave-one-out cross-validation, 45
frequentist approach, 29 likelihood, 241, 262
Frequentist Machine Learning, 235 line search, 228
linear functions, 29
gamma function, 264 linear Markov model, 300
Gaussian Mixture Model, 283 linear separator, 97
Gaussian Process, 310 linearly separable, 30, 97, 164
Gaussian Processes, 309 log likelihood, 242
generalization, 27 loss function, 31, 54
Generative models, 257
global balance equation, 324 manifold learning, 89
gradient descent, 169 margin, 98, 226
INDEX 377

marginalization, 37 output equation, 298


Markov blanket, 323 overfitting, 33, 45
Markov Chain, 288
Markov chain, 288 p-spectrum kernel, 119
Markov Chain Monte Carlo method, 323 PAC (Probably Approximately Correct), 59,
Markov Decision Process (MDP), 340 334
Markov model, 286 parametric, 309
Markov model of order k, 286 parametric functions, 29
Markov models, 286 parametric hypothesis space, 29
Markov networks, 244 parent, 244
Markov property, 286 parent variables, 245
Markovian Decision Process, 25 partially observable, 277, 287
masked Delaunay triangulation, 142 Particle filtering, 327
Maximization step, 279 perceptron convergence theorem, 165
Maximum A Posteriori estimator, 236, 266 perceptron learning rule, 162
Maximum Likelihood estimator, 236, 266 period, 290
mean square error, 275 plates, 254
measurable, 238 policy, 25, 341
Mercer’s theorem, 87, 116 policy iteration, 346
mini-batch, 16 policy search, 359
minimal enclosing sphere, 130 posterior, 262
minimal norm solution, 169 precision, 49
mix random variable, 238 prediction equations, 304
mixing time, 323 Principal Component Analysis, 75
mixture model, 280 prior, 39, 260, 262
mixture of experts, 216 probability, 237
Monte Carlo, 319 probability density function, 238
Moore-Penrose inverse, 275 probability space, 237
Moore-Penrose pseudo-inverse, 169 probability vector, 288
multi-class classification, 22 proposal distribution, 324
prototypes, 135
n-grams, 70 pseudoinverse matrix, 275
natural gradient, 361
quadratic loss, 32
neural network, 157
non-parametric approach, 309 random forests, 219
non-parametric method, 309 random variable, 238
nonparametric hypothesis space, 29 real risk, 32
normal equations, 169 real-valued random variable, 238
recall, 49
observable, 287 recurrent neural network, 159
observations, 259 recurrent neural networks, 192
off-policy, 357 recursive neural network, 159
on-policy, 357 regression, 22
one-class SVM, 131 regression tree, 213
one-versus-all, 22 regret, 331
one-versus-one, 22 reinforcement learning, 23, 339
one-versus-rest, 22 rejection sampling, 320
online sampling, 16 resampling, 326
optimal control, 23 restricted functional gradient descent, 227
optimal policy, 25 risk, 31, 55
optimism in the face of uncertainty, 333 ROC space, 49
optimization problem, 104
optimization theory, 104 sample covariance matrix, 77, 83
oracle, 14, 53 scalar linear functions, 29
Ordinary Least Square, 275 score, 22, 65
outcomes, 237 second order Voronoı̈ tessellation, 145
378 INDEX

self-organizing map, 150 target, 140


semi-supervised learning, 23 targeted vector quantization process, 140
sensitivity, 49 test set, 45
Sequential Backward Search, 73 the curse of dimensionality, 243
Sequential Floating Backward Search, 73 time space, 286
Sequential Floating Forward Search, 73 time-delay neural network, 191
Sequential Forward Search, 73 training set, 45
Sequential Importance Sampling, 327 transductive learning, 27
shortest confidence interval, 140 transition matrices, 288
Singular Value Decomposition, 169 true negatives, 48
true positives, 48
slack variables, 103
two-class classification, 21
SMO, 121
specificity, 49 UCB (Upper Confidence Bound) strategy,
squared exponential kernel, 310 336
stacked denoising autoencoder, 186 uncertainty, 42
stacking, 216 universe, 237
standardization, 17 Unsupervised learning, 17
state, 285 update equations, 304
state integration, 298
state space, 286 validation set, 46
state-action value function, 348 value function, 341
stationary, 287, 289 value iteration, 344
stationary of order k, 287 vanishing gradient, 186
statistical learning theory, 53 Vapnik-Chervonenkis dimension, 60
steepest descent, 169 variable selection, 72
vector quantization, 135
Stochastic Gradient Descent, 169
Viterbi algorithm, 295
stochastic matrix, 288
Voronoı̈ distortion, 137
stochastic process, 286
Voronoı̈ subsets, 137
strong learner, 223
Voronoı̈ tessellation, 140, 142
strongly connected, 290
Supervised learning, 21 weak learner, 222
support vectors, 101 weight sharing, 184
surrogate, 62 winner-take-all, 148
SVC, 129 winner-take-most, 150
SVR, 129 wrappers, 72

You might also like